Template-Type: ReDIF-Article 1.0
Author-Name: Wensheng Zhu
Author-X-Name-First: Wensheng
Author-X-Name-Last: Zhu
Author-Name: Yuan Jiang
Author-X-Name-First: Yuan
Author-X-Name-Last: Jiang
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Nonparametric Covariate-Adjusted Association Tests Based on the Generalized Kendall's Tau
Abstract:
Identifying the risk factors for comorbidity is important in psychiatric
research. Empirically, studies have shown that testing multiple correlated
traits simultaneously is more powerful than testing a single trait at a
time in association analysis. Furthermore, for complex diseases,
especially mental illnesses and behavioral disorders, the traits are often
recorded in different scales, such as dichotomous, ordinal, and
quantitative. In the absence of covariates, nonparametric association
tests have been developed for multiple complex traits to study
comorbidity. However, genetic studies generally contain measurements of
some covariates that may affect the relationship between the risk factors
of major interest (such as genes) and the outcomes. While it is relatively
easy to adjust for these covariates in a parametric model for quantitative
traits, it is challenging to adjust for covariates when there are multiple
complex traits with possibly different scales. In this article, we propose
a nonparametric test for multiple complex traits that can adjust for
covariate effects. The test aims to achieve an optimal scheme of
adjustment by using a maximum statistic calculated from multiple adjusted
test statistics. We derive the asymptotic null distribution of the maximum
test statistic and also propose a resampling approach, both of which can
be used to assess the significance of our test. Simulations are conducted
to compare the Type I error and power of the nonparametric adjusted test
to the unadjusted test and other existing adjusted tests. The empirical
results suggest that our proposed test increases the power through
adjustment for covariates when there exist environmental effects and is
more robust to model misspecifications than some existing parametric
adjusted tests. We further demonstrate the advantage of our test by
analyzing a dataset on genetics of alcoholism.
Journal: Journal of the American Statistical Association
Pages: 1-11
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643707
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643707
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:1-11
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaoyan Shi
Author-X-Name-First: Xiaoyan
Author-X-Name-Last: Shi
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Jeffrey Lieberman
Author-X-Name-First: Jeffrey
Author-X-Name-Last: Lieberman
Author-Name: Martin Styner
Author-X-Name-First: Martin
Author-X-Name-Last: Styner
Title: Intrinsic Regression Models for Medial Representation of Subcortical Structures
Abstract:
The aim of this article is to develop a semiparametric model to describe
the variability of the medial representation of subcortical structures,
which belongs to a Riemannian manifold, and establish its association with
covariates of interest, such as diagnostic status, age, and gender. We
develop a two-stage estimation procedure to calculate the parameter
estimates. The first stage is to calculate an intrinsic least squares
estimator of the parameter vector using the annealing evolutionary
stochastic approximation Monte Carlo algorithm, and then the second stage
is to construct a set of estimating equations to obtain a more efficient
estimate with the intrinsic least squares estimate as the starting point.
We use Wald statistics to test linear hypotheses of unknown parameters and
establish their limiting distributions. Simulation studies are used to
evaluate the accuracy of our parameter estimates and the finite sample
performance of the Wald statistics. We apply our methods to the detection
of the difference in the morphological changes of the left and right
hippocampi between schizophrenia patients and healthy controls using a
medial shape description. This article has online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 12-23
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643710
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643710
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:12-23
Template-Type: ReDIF-Article 1.0
Author-Name: Debbie J. Dupuis
Author-X-Name-First: Debbie J.
Author-X-Name-Last: Dupuis
Title: Modeling Waves of Extreme Temperature: The Changing Tails of Four Cities
Abstract:
Heat waves are a serious threat to society, the environment, and the
economy. Estimates of the recurrence probabilities of heat waves may be
obtained following the successful modeling of daily maximum temperature,
but working with the latter is difficult as we have to recognize, and
allow for, not only a time trend but also seasonality in the mean and in
the variability, as well as serial correlation. Furthermore, as the
extreme values of daily maximum temperature have a different form of
nonstationarity from the body, additional modeling is required to
completely capture the realities. We present a time series model for the
daily maximum temperature and use an exceedance over high thresholds
approach to model the upper tail of the distribution of its scaled
residuals. We show how a change-point analysis can be used to
identify seasons of constant crossing rates and how a time-dependent shape
parameter can then be introduced to capture a change in the distribution
of the exceedances. Daily maximum temperature series for Des Moines, New
York, Portland, and Tucson are analyzed. In-sample and out-of-sample
goodness-of-fit measures show that the proposed model is an excellent fit
to the data. The fitted model is then used to estimate the recurrence
probabilities of runs over seasonally high temperatures, and we show that
the probability of long and intense heat waves has increased considerably
over 50 years. We also find that the increases vary by city and by time of
year.
Journal: Journal of the American Statistical Association
Pages: 24-39
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643732
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643732
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:24-39
Template-Type: ReDIF-Article 1.0
Author-Name: Lawrence C. McCandless
Author-X-Name-First: Lawrence C.
Author-X-Name-Last: McCandless
Author-Name: Sylvia Richardson
Author-X-Name-First: Sylvia
Author-X-Name-Last: Richardson
Author-Name: Nicky Best
Author-X-Name-First: Nicky
Author-X-Name-Last: Best
Title: Adjustment for Missing Confounders Using External Validation Data and Propensity Scores
Abstract:
Reducing bias from missing confounders is a challenging problem in the
analysis of observational data. Information about missing variables is
sometimes available from external validation data, such as surveys or
secondary samples drawn from the same source population. In principle, the
validation data permit us to recover information about the missing data,
but the difficulty is in eliciting a valid model for the nuisance
distribution of the missing confounders. Motivated by a British study of
the effects of trihalomethane exposure on risk of full-term low
birthweight, we describe a flexible Bayesian procedure for adjusting for a
vector of missing confounders using external validation data. We summarize
the missing confounders with a scalar summary score using the propensity
score methodology of Rosenbaum and Rubin. The score has the property that
it induces conditional independence between the exposure and the missing
confounders, given the measured confounders. It balances the unmeasured
confounders across exposure groups, within levels of measured covariates.
To adjust for bias, we need only model and adjust for the summary score
during Markov chain Monte Carlo computation. Simulation results illustrate
that the proposed method reduces bias from several missing confounders
over a range of different sample sizes for the validation data. Appendices
A--C are available as online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 40-51
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643739
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643739
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:40-51
Template-Type: ReDIF-Article 1.0
Author-Name: James Y. Dai
Author-X-Name-First: James Y.
Author-X-Name-Last: Dai
Author-Name: Peter B. Gilbert
Author-X-Name-First: Peter B.
Author-X-Name-Last: Gilbert
Author-Name: Benoît R. Mâsse
Author-X-Name-First: Benoît R.
Author-X-Name-Last: Mâsse
Title: Partially Hidden Markov Model for Time-Varying Principal Stratification in HIV Prevention Trials
Abstract:
It is frequently of interest to estimate the intervention effect that
adjusts for post-randomization variables in clinical trials. In the
recently completed HPTN 035 trial, there is differential condom use
between the three microbicide gel arms and the no-gel control arm, so
intention-to-treat (ITT) analyses only assess the net treatment effect
that includes the indirect treatment effect mediated through differential
condom use. Various statistical methods in causal inference have been
developed to adjust for post-randomization variables. We extend the
principal stratification framework to time-varying behavioral variables in
HIV prevention trials with a time-to-event endpoint, using a partially
hidden Markov model (pHMM). We formulate the causal estimand of interest,
establish assumptions that enable identifiability of the causal
parameters, and develop maximum likelihood methods for estimation.
Application of our model on the HPTN 035 trial reveals an interesting
pattern of prevention effectiveness among different condom-use principal
strata.
Journal: Journal of the American Statistical Association
Pages: 52-65
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643743
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643743
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:52-65
Template-Type: ReDIF-Article 1.0
Author-Name: Jooyoung Jeon
Author-X-Name-First: Jooyoung
Author-X-Name-Last: Jeon
Author-Name: James W. Taylor
Author-X-Name-First: James W.
Author-X-Name-Last: Taylor
Title: Using Conditional Kernel Density Estimation for Wind Power Density Forecasting
Abstract:
Of the various renewable energy resources, wind power is widely
recognized as one of the most promising. The management of wind farms and
electricity systems can benefit greatly from the availability of estimates
of the probability distribution of wind power generation. However, most
research has focused on point forecasting of wind power. In this article,
we develop an approach to producing density forecasts for the wind power
generated at individual wind farms. Our interest is in intraday data and
prediction from 1 to 72 hours ahead. We model wind power in terms of wind
speed and wind direction. In this framework, there are two key
uncertainties. First, there is the inherent uncertainty in wind speed and
direction, and we model this using a bivariate vector autoregressive
moving average-generalized autoregressive conditional heteroscedastic
(VARMA-GARCH) model, with a Student t error distribution,
in the Cartesian space of wind speed and direction. Second, there is the
stochastic nature of the relationship of wind power to wind speed
(described by the power curve), and to wind direction. We model this using
conditional kernel density (CKD) estimation, which enables a nonparametric
modeling of the conditional density of wind power. Using Monte Carlo
simulation of the VARMA-GARCH model and CKD estimation, density forecasts
of wind speed and direction are converted to wind power density forecasts.
Our work is novel in several respects: previous wind power studies have
not modeled a stochastic power curve; to accommodate time evolution in the
power curve, we incorporate a time decay factor within the CKD method; and
the CKD method is conditional on a density, rather than a single value.
The new approach is evaluated using datasets from four Greek wind farms.
Journal: Journal of the American Statistical Association
Pages: 66-79
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643745
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643745
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:66-79
Template-Type: ReDIF-Article 1.0
Author-Name: Tristan Zajonc
Author-X-Name-First: Tristan
Author-X-Name-Last: Zajonc
Title: Bayesian Inference for Dynamic Treatment Regimes: Mobility, Equity, and Efficiency in Student Tracking
Abstract:
Policies in health, education, and economics often unfold sequentially
and adapt to changing conditions. Such time-varying treatments pose
problems for standard program evaluation methods because intermediate
outcomes are simultaneously pretreatment confounders and posttreatment
outcomes. This article extends the Bayesian perspective on causal
inference and optimal treatment to these types of dynamic treatment
regimes. A unifying idea remains ignorable treatment assignment, which now
sequentially includes selection on intermediate outcomes. I present
methods to estimate the causal effect of arbitrary regimes, recover the
optimal regime, and characterize the set of feasible outcomes under
different regimes. I demonstrate these methods through an application to
optimal student tracking in ninth and tenth grade mathematics. For the
sample considered, student mobility under the status-quo regime is
significantly below the optimal rate and existing policies reinforce
between-student inequality. An easy to implement optimal dynamic tracking
regime, which promotes more students to honors in tenth grade, increases
average final achievement to 0.07 standard deviations above the status quo
while lowering inequality; there is no binding equity-efficiency tradeoff.
The proposed methods provide a flexible and principled approach to causal
inference for time-varying treatments and optimal treatment choice under
uncertainty. This article has online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 80-92
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643747
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643747
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:80-92
Template-Type: ReDIF-Article 1.0
Author-Name: Alexandre Rodrigues
Author-X-Name-First: Alexandre
Author-X-Name-Last: Rodrigues
Author-Name: Peter J. Diggle
Author-X-Name-First: Peter J.
Author-X-Name-Last: Diggle
Title: Bayesian Estimation and Prediction for Inhomogeneous Spatiotemporal Log-Gaussian Cox Processes Using Low-Rank Models, With Application to Criminal Surveillance
Abstract:
In this article, we propose a method for conducting likelihood-based
inference for a class of nonstationary spatiotemporal log-Gaussian Cox
processes. The method uses convolution-based models to capture
spatiotemporal correlation structure, is computationally feasible even for
large datasets, and does not require knowledge of the underlying spatial
intensity of the process. We describe an application to a surveillance
system for detecting emergent spatiotemporal clusters of homicides in Belo
Horizonte, Brazil, and discuss the advantages and drawbacks of our
model-based approach by comparison with other spatiotemporal surveillance
methods that have been proposed in the literature.
Journal: Journal of the American Statistical Association
Pages: 93-101
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.644496
File-URL: http://hdl.handle.net/10.1080/01621459.2011.644496
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:93-101
Template-Type: ReDIF-Article 1.0
Author-Name: Keyur H. Desai
Author-X-Name-First: Keyur H.
Author-X-Name-Last: Desai
Author-Name: John D. Storey
Author-X-Name-First: John D.
Author-X-Name-Last: Storey
Title: Cross-Dimensional Inference of Dependent High-Dimensional Data
Abstract:
A growing number of modern scientific problems in areas such as genomics,
neurobiology, and spatial epidemiology involve the measurement and
analysis of thousands of related features that may be stochastically
dependent at arbitrarily strong levels. In this work, we consider the
scenario where the features follow a multivariate Normal distribution. We
demonstrate that dependence is manifested as random variation shared among
features, and that standard methods may yield highly unstable inference
due to dependence, even when the dependence is fully parameterized and
utilized in the procedure. We propose a “cross-dimensional
inference” framework that alleviates the problems due to dependence
by modeling and removing the variation shared among features, while also
properly regularizing estimation across features. We demonstrate the
framework on both simultaneous point estimation and multiple hypothesis
testing in scenarios derived from the scientific applications of interest.
Journal: Journal of the American Statistical Association
Pages: 135-151
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.645777
File-URL: http://hdl.handle.net/10.1080/01621459.2011.645777
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:135-151
Template-Type: ReDIF-Article 1.0
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Author-Name: Hyonho Chun
Author-X-Name-First: Hyonho
Author-X-Name-Last: Chun
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: Sparse Estimation of Conditional Graphical Models With Application to Gene Networks
Abstract:
In many applications the graph structure in a network arises from two
sources: intrinsic connections and connections due to external effects. We
introduce a sparse estimation procedure for graphical models that is
capable of isolating the intrinsic connections by removing the external
effects. Technically, this is formulated as a conditional
graphical model, in which the external effects are modeled as predictors,
and the graph is determined by the conditional precision matrix. We
introduce two sparse estimators of this matrix using the reproduced kernel
Hilbert space combined with lasso and adaptive lasso. We establish the
sparsity, variable selection consistency, oracle property, and the
asymptotic distributions of the proposed estimators. We also develop their
convergence rate when the dimension of the conditional precision matrix
goes to infinity. The methods are compared with sparse estimators for
unconditional graphical models, and with the constrained maximum
likelihood estimate that assumes a known graph structure. The methods are
applied to a genetic data set to construct a gene network conditioning on
single-nucleotide polymorphisms.
Journal: Journal of the American Statistical Association
Pages: 152-167
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.644498
File-URL: http://hdl.handle.net/10.1080/01621459.2011.644498
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:152-167
Template-Type: ReDIF-Article 1.0
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Liping Zhu
Author-X-Name-First: Liping
Author-X-Name-Last: Zhu
Title: A Semiparametric Approach to Dimension Reduction
Abstract:
We provide a novel and completely different approach to
dimension-reduction problems from the existing literature. We cast the
dimension-reduction problem in a semiparametric estimation framework and
derive estimating equations. Viewing this problem from the new angle
allows us to derive a rich class of estimators, and obtain the classical
dimension reduction techniques as special cases in this class. The
semiparametric approach also reveals that in the inverse regression
context while keeping the estimation structure intact, the common
assumption of linearity and/or constant variance on the covariates can be
removed at the cost of performing additional nonparametric regression. The
semiparametric estimators without these common assumptions are illustrated
through simulation studies and a real data example. This article has
online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 168-179
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646925
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646925
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:168-179
Template-Type: ReDIF-Article 1.0
Author-Name: Tatiyana V. Apanasovich
Author-X-Name-First: Tatiyana V.
Author-X-Name-Last: Apanasovich
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Author-Name: Ying Sun
Author-X-Name-First: Ying
Author-X-Name-Last: Sun
Title: A Valid Matérn Class of Cross-Covariance Functions for Multivariate Random Fields With Any Number of Components
Abstract:
We introduce a valid parametric family of cross-covariance functions for
multivariate spatial random fields where each component has a covariance
function from a well-celebrated Matérn class. Unlike previous attempts,
our model indeed allows for various smoothnesses and rates of correlation
decay for any number of vector components. We present the conditions on
the parameter space that result in valid models with varying degrees of
complexity. We discuss practical implementations, including
reparameterizations to reflect the conditions on the parameter space and
an iterative algorithm to increase the computational efficiency. We
perform various Monte Carlo simulation experiments to explore the
performances of our approach in terms of estimation and cokriging. The
application of the proposed multivariate Matérn model is illustrated on
two meteorological datasets: temperature/pressure over the Pacific
Northwest (bivariate) and wind/temperature/pressure in Oklahoma
(trivariate). In the latter case, our flexible trivariate Matérn model is
valid and yields better predictive scores compared with a parsimonious
model with common scale parameters.
Journal: Journal of the American Statistical Association
Pages: 180-193
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643197
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643197
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:180-193
Template-Type: ReDIF-Article 1.0
Author-Name: Huixia Judy Wang
Author-X-Name-First: Huixia Judy
Author-X-Name-Last: Wang
Author-Name: Xingdong Feng
Author-X-Name-First: Xingdong
Author-X-Name-Last: Feng
Title: Multiple Imputation for M-Regression With Censored Covariates
Abstract:
We develop a new multiple imputation approach for
M-regression models with censored covariates. Instead of
specifying parametric likelihoods, our method imputes the censored
covariates by their conditional quantiles given the observed data, where
the conditional quantiles are estimated through fitting a censored
quantile regression process. The resulting estimator is shown to be
consistent and asymptotically normal, and it improves the estimation
efficiency by using information from cases with censored covariates.
Compared with existing methods, the proposed method is more flexible as it
does not require stringent parametric assumptions on the distributions of
either the regression errors or the covariates. The finite sample
performance of the proposed method is assessed through a simulation study
and the analysis of a c-reactive protein dataset in the 2007--2008
National Health and Nutrition Examination Survey. This article has
supplementary material online.
Journal: Journal of the American Statistical Association
Pages: 194-204
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.643198
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643198
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:194-204
Template-Type: ReDIF-Article 1.0
Author-Name: Qian M. Zhou
Author-X-Name-First: Qian M.
Author-X-Name-Last: Zhou
Author-Name: Peter X.-K. Song
Author-X-Name-First: Peter X.-K.
Author-X-Name-Last: Song
Author-Name: Mary E. Thompson
Author-X-Name-First: Mary E.
Author-X-Name-Last: Thompson
Title: Information Ratio Test for Model Misspecification in Quasi-Likelihood Inference
Abstract:
In this article, we focus on the circumstances in quasi-likelihood
inference that the estimation accuracy of mean structure parameters is
guaranteed by correct specification of the first moment, but the
estimation efficiency could be diminished due to misspecification of the
second moment. We propose an information ratio (IR) statistic to test for
model misspecification of the variance/covariance structure through a
comparison between two forms of information matrix: the negative
sensitivity matrix and the variability matrix. We establish asymptotic
distributions of the proposed IR test statistics. We also suggest an
approximation to the asymptotic distribution of the IR statistic via a
perturbation resampling method. Moreover, we propose a selection criterion
based on the IR test to select the best fitting variance/covariance
structure from a class of candidates. Through simulation studies, it is
shown that the IR statistic provides a powerful statistical tool to detect
different scenarios of misspecification of the variance/covariance
structures. In addition, the IR test as well as the proposed model
selection procedure shows substantial improvement over some of the
existing statistical methods. The IR-based model selection procedure is
illustrated by analyzing the Madras Longitudinal Schizophrenia data.
Appendices are included in the supplemental materials, which are available
online.
Journal: Journal of the American Statistical Association
Pages: 205-213
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.645785
File-URL: http://hdl.handle.net/10.1080/01621459.2011.645785
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:205-213
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Quantile Regression for Analyzing Heterogeneity in Ultra-High Dimension
Abstract:
Ultra-high dimensional data often display heterogeneity due to either
heteroscedastic variance or other forms of non-location-scale covariate
effects. To accommodate heterogeneity, we advocate a more general
interpretation of sparsity, which assumes that only a small number of
covariates influence the conditional distribution of the response
variable, given all candidate covariates; however, the sets of relevant
covariates may differ when we consider different segments of the
conditional distribution. In this framework, we investigate the
methodology and theory of nonconvex, penalized quantile regression in
ultra-high dimension. The proposed approach has two distinctive features:
(1) It enables us to explore the entire conditional distribution of the
response variable, given the ultra-high-dimensional covariates, and
provides a more realistic picture of the sparsity pattern; (2) it requires
substantially weaker conditions compared with alternative methods in the
literature; thus, it greatly alleviates the difficulty of model checking
in the ultra-high dimension. In theoretic development, it is challenging
to deal with both the nonsmooth loss function and the nonconvex penalty
function in ultra-high-dimensional parameter space. We introduce a novel,
sufficient optimality condition that relies on a convex differencing
representation of the penalized loss function and the subdifferential
calculus. Exploring this optimality condition enables us to establish the
oracle property for sparse quantile regression in the ultra-high dimension
under relaxed conditions. The proposed method greatly enhances existing
tools for ultra-high-dimensional data analysis. Monte Carlo simulations
demonstrate the usefulness of the proposed procedure. The real data
example we analyzed demonstrates that the new approach reveals
substantially more information as compared with alternative methods. This
article has online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 214-222
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2012.656014
File-URL: http://hdl.handle.net/10.1080/01621459.2012.656014
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:214-222
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wei Pan
Author-X-Name-First: Wei
Author-X-Name-Last: Pan
Author-Name: Yunzhang Zhu
Author-X-Name-First: Yunzhang
Author-X-Name-Last: Zhu
Title: Likelihood-Based Selection and Sharp Parameter Estimation
Abstract:
In high-dimensional data analysis, feature selection becomes one
effective means for dimension reduction, which proceeds with parameter
estimation. Concerning accuracy of selection and estimation, we study
nonconvex constrained and regularized likelihoods in the presence of
nuisance parameters. Theoretically, we show that constrained
L 0 likelihood and its computational surrogate
are optimal in that they achieve feature selection consistency and sharp
parameter estimation, under one necessary condition required for any
method to be selection consistent and to achieve sharp parameter
estimation. It permits up to exponentially many candidate features.
Computationally, we develop difference convex methods to implement the
computational surrogate through prime and dual subproblems. These results
establish a central role of L 0 constrained
and regularized likelihoods in feature selection and parameter estimation
involving selection. As applications of the general method and theory, we
perform feature selection in linear regression and logistic regression,
and estimate a precision matrix in Gaussian graphical models. In these
situations, we gain a new theoretical insight and obtain favorable
numerical results. Finally, we discuss an application to predict the
metastasis status of breast cancer patients with their gene expression
profiles. This article has online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 223-232
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.645783
File-URL: http://hdl.handle.net/10.1080/01621459.2011.645783
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:223-232
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel J. Nordman
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Nordman
Author-Name: Soumendra N. Lahiri
Author-X-Name-First: Soumendra N.
Author-X-Name-Last: Lahiri
Title: Block Bootstraps for Time Series With Fixed Regressors
Abstract:
This article examines block bootstrap methods in linear regression models
with weakly dependent error variables and nonstochastic regressors.
Contrary to intuition, the tapered block bootstrap (TBB) with a smooth
taper not only loses its superior bias properties but may also fail to be
consistent in the regression problem. A similar problem, albeit at a
smaller scale, is shown to exist for the moving and the circular block
bootstrap (MBB and CBB, respectively). As a remedy, an additional block
randomization step is introduced that balances out the effects of
nonuniform regression weights, and restores the superiority of the
(modified) TBB. The randomization step also improves the MBB or CBB.
Interestingly, the stationary bootstrap (SB) automatically balances out
regression weights through its probabilistic blocking mechanism, without
requiring any modification, and enjoys a kind of robustness. Optimal block
sizes are explicitly determined for block bootstrap variance estimators
under regression. Finite sample performance and practical uses of the
methods are illustrated through a simulation study and two data examples,
respectively. Supplementary materials are available online.
Journal: Journal of the American Statistical Association
Pages: 233-246
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646929
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646929
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:233-246
Template-Type: ReDIF-Article 1.0
Author-Name: Zonghui Hu
Author-X-Name-First: Zonghui
Author-X-Name-Last: Hu
Author-Name: Dean A. Follmann
Author-X-Name-First: Dean A.
Author-X-Name-Last: Follmann
Author-Name: Jing Qin
Author-X-Name-First: Jing
Author-X-Name-Last: Qin
Title: Semiparametric Double Balancing Score Estimation for Incomplete Data With Ignorable Missingness
Abstract:
When estimating the marginal mean response with missing observations, a
critical issue is robustness to model misspecification. In this article,
we propose a semiparametric estimation method with extended double
robustness that attains the optimal efficiency under less stringent
requirement for model specifications than the doubly robust estimators. In
this semiparametric estimation, covariate information is collapsed into a
two-dimensional score S, with one dimension for (i) the
pattern of missingness and the other for (ii) the pattern of response,
both estimated from some working parametric models. The mean response
E(Y) is then estimated by the sample
mean of E(Y∣S),
which is estimated via nonparametric regression. The semiparametric
estimator is consistent if either the “core” of (i) or the
“core” of (ii) is captured by S, and attains
the optimal efficiency if both are captured by S. As the
“cores” can be obtained without correctly specifying the
full parametric models for (i) or (ii), the proposed estimator can be more
robust than other doubly robust estimators. As S contains the
propensity score as one component, the proposed estimator avoids the use
and the shortcomings of inverse propensity weighting. This semiparametric
estimator is most appealing for high-dimensional covariates, where fully
correct model specification is challenging and nonparametric estimation is
not feasible due to the problem of dimensionality. Numerical performance
is investigated by simulation studies.
Journal: Journal of the American Statistical Association
Pages: 247-257
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2012.656009
File-URL: http://hdl.handle.net/10.1080/01621459.2012.656009
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:247-257
Template-Type: ReDIF-Article 1.0
Author-Name: Gaurav Sharma
Author-X-Name-First: Gaurav
Author-X-Name-Last: Sharma
Author-Name: Thomas Mathew
Author-X-Name-First: Thomas
Author-X-Name-Last: Mathew
Title: One-Sided and Two-Sided Tolerance Intervals in General Mixed and Random Effects Models Using Small-Sample Asymptotics
Abstract:
The computation of tolerance intervals in mixed and random effects models
has not been satisfactorily addressed in a general setting when the data
are unbalanced and/or when covariates are present. This article derives
satisfactory one-sided and two-sided tolerance intervals in such a general
scenario, by applying small-sample asymptotic procedures. In the case of
one-sided tolerance limits, the problem reduces to the interval estimation
of a percentile, and accurate confidence limits are derived using
small-sample asymptotics. In the case of a two-sided tolerance interval,
the problem does not reduce to an interval estimation problem; however, it
is possible to derive an approximate margin of error statistic that is an
upper confidence limit for a linear combination of the variance
components. For the latter problem, small-sample asymptotic procedures can
once again be used to arrive at an accurate upper confidence limit. In the
article, balanced and unbalanced data situations are treated separately,
and computational issues are addressed in detail. Extensive numerical
results show that the tolerance intervals derived based on small-sample
asymptotics exhibit satisfactory performance regardless of the sample
size. The results are illustrated using some examples. Some technical
derivations, additional simulation results, and R codes are available
online as supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 258-267
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.640592
File-URL: http://hdl.handle.net/10.1080/01621459.2011.640592
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:258-267
Template-Type: ReDIF-Article 1.0
Author-Name: Moreno Bevilacqua
Author-X-Name-First: Moreno
Author-X-Name-Last: Bevilacqua
Author-Name: Carlo Gaetan
Author-X-Name-First: Carlo
Author-X-Name-Last: Gaetan
Author-Name: Jorge Mateu
Author-X-Name-First: Jorge
Author-X-Name-Last: Mateu
Author-Name: Emilio Porcu
Author-X-Name-First: Emilio
Author-X-Name-Last: Porcu
Title: Estimating Space and Space-Time Covariance Functions for Large Data Sets: A Weighted Composite Likelihood Approach
Abstract:
In this article, we propose two methods for estimating space and
space-time covariance functions from a Gaussian random field, based on the
composite likelihood idea. The first method relies on the maximization of
a weighted version of the composite likelihood function, while the second
one is based on the solution of a weighted composite score equation. This
last scheme is quite general and could be applied to any kind of composite
likelihood. An information criterion for model selection based on the
first estimation method is also introduced. The methods are useful for
practitioners looking for a good balance between computational complexity
and statistical efficiency. The effectiveness of the methods is
illustrated through examples, simulation experiments, and by analyzing a
dataset on ozone measurements.
Journal: Journal of the American Statistical Association
Pages: 268-280
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646928
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646928
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:268-280
Template-Type: ReDIF-Article 1.0
Author-Name: Luke Bornn
Author-X-Name-First: Luke
Author-X-Name-Last: Bornn
Author-Name: Gavin Shaddick
Author-X-Name-First: Gavin
Author-X-Name-Last: Shaddick
Author-Name: James V. Zidek
Author-X-Name-First: James V.
Author-X-Name-Last: Zidek
Title: Modeling Nonstationary Processes Through Dimension Expansion
Abstract:
In this article, we propose a novel approach to modeling nonstationary
spatial fields. The proposed method works by expanding the geographic
plane over which these processes evolve into higher-dimensional spaces,
transforming and clarifying complex patterns in the physical plane. By
combining aspects of multidimensional scaling, group lasso, and latent
variable models, a dimensionally sparse projection is found in which the
originally nonstationary field exhibits stationarity. Following a
comparison with existing methods in a simulated environment, dimension
expansion is studied on a classic test-bed dataset historically used to
study nonstationary models. Following this, we explore the use of
dimension expansion in modeling air pollution in the United Kingdom, a
process known to be strongly influenced by rural/urban effects, amongst
others, which gives rise to a nonstationary field.
Journal: Journal of the American Statistical Association
Pages: 281-289
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646919
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646919
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:281-289
Template-Type: ReDIF-Article 1.0
Author-Name: Michael S. Smith
Author-X-Name-First: Michael S.
Author-X-Name-Last: Smith
Author-Name: Mohamad A. Khaled
Author-X-Name-First: Mohamad A.
Author-X-Name-Last: Khaled
Title: Estimation of Copula Models With Discrete Margins via Bayesian Data Augmentation
Abstract:
Estimation of copula models with discrete margins can be difficult beyond
the bivariate case. We show how this can be achieved by augmenting the
likelihood with continuous latent variables, and computing inference using
the resulting augmented posterior. To evaluate this, we propose two
efficient Markov chain Monte Carlo sampling schemes. One generates the
latent variables as a block using a Metropolis--Hastings step with a
proposal that is close to its target distribution, the other generates
them one at a time. Our method applies to all parametric copulas where the
conditional copula functions can be evaluated, not just elliptical copulas
as in much previous work. Moreover, the copula parameters can be estimated
joint with any marginal parameters, and Bayesian selection ideas can be
employed. We establish the effectiveness of the estimation method by
modeling consumer behavior in online retail using Archimedean and Gaussian
copulas. The example shows that elliptical copulas can be poor at modeling
dependence in discrete data, just as they can be in the continuous case.
To demonstrate the potential in higher dimensions, we estimate
16-dimensional D-vine copulas for a longitudinal model of usage of a
bicycle path in the city of Melbourne, Australia. The estimates reveal an
interesting serial dependence structure that can be represented in a
parsimonious fashion using Bayesian selection of independence pair-copula
components. Finally, we extend our results and method to the case where
some margins are discrete and others continuous. Supplemental materials
for the article are also available online.
Journal: Journal of the American Statistical Association
Pages: 290-303
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.644501
File-URL: http://hdl.handle.net/10.1080/01621459.2011.644501
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:290-303
Template-Type: ReDIF-Article 1.0
Author-Name: Yulia V. Marchenko
Author-X-Name-First: Yulia V.
Author-X-Name-Last: Marchenko
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: A Heckman Selection-t Model
Abstract:
Sample selection arises often in practice as a result of the partial
observability of the outcome of interest in a study. In the presence of
sample selection, the observed data do not represent a random sample from
the population, even after controlling for explanatory variables. That is,
data are missing not at random. Thus, standard analysis using only
complete cases will lead to biased results. Heckman introduced a sample
selection model to analyze such data and proposed a full maximum
likelihood estimation method under the assumption of normality. The method
was criticized in the literature because of its sensitivity to the
normality assumption. In practice, data, such as income or expenditure
data, often violate the normality assumption because of heavier tails. We
first establish a new link between sample selection models and recently
studied families of extended skew-elliptical distributions. Then, this
allows us to introduce a selection-t (SLt) model, which
models the error distribution using a Student's t
distribution. We study its properties and investigate the finite-sample
performance of the maximum likelihood estimators for this model. We
compare the performance of the SLt model to the conventional Heckman
selection-normal (SLN) model and apply it to analyze ambulatory
expenditures. Unlike the SLN model, our analysis using the SLt model
provides statistical evidence for the existence of sample selection bias
in these data. We also investigate the performance of the test for sample
selection bias based on the SLt model and compare it with the performances
of several tests used with the SLN model. Our findings indicate that the
latter tests can be misleading in the presence of heavy-tailed data.
Journal: Journal of the American Statistical Association
Pages: 304-317
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2012.656011
File-URL: http://hdl.handle.net/10.1080/01621459.2012.656011
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:304-317
Template-Type: ReDIF-Article 1.0
Author-Name: Ying Qing Chen
Author-X-Name-First: Ying Qing
Author-X-Name-Last: Chen
Author-Name: Nan Hu
Author-X-Name-First: Nan
Author-X-Name-Last: Hu
Author-Name: Su-Chun Cheng
Author-X-Name-First: Su-Chun
Author-X-Name-Last: Cheng
Author-Name: Philippa Musoke
Author-X-Name-First: Philippa
Author-X-Name-Last: Musoke
Author-Name: Lue Ping Zhao
Author-X-Name-First: Lue Ping
Author-X-Name-Last: Zhao
Title: Estimating Regression Parameters in an Extended Proportional Odds Model
Abstract:
The proportional odds model may serve as a useful alternative to the Cox
proportional hazards model to study association between covariates and
their survival functions in medical studies. In this article, we study an
extended proportional odds model that incorporates the so-called
“external” time-varying covariates. In the extended model,
regression parameters have a direct interpretation of comparing survival
functions, without specifying the baseline survival odds function.
Semiparametric and maximum likelihood estimation procedures are proposed
to estimate the extended model. Our methods are demonstrated by Monte
Carlo simulations, and applied to a landmark randomized clinical trial of
a short-course nevirapine (NVP) for mother-to-child transmission (MTCT) of
human immunodeficiency virus type-1 (HIV-1). Additional application
includes an analysis of the well-known Veterans Administration (VA) lung
cancer trial.
Journal: Journal of the American Statistical Association
Pages: 318-330
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2012.656021
File-URL: http://hdl.handle.net/10.1080/01621459.2012.656021
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:318-330
Template-Type: ReDIF-Article 1.0
Author-Name: Ruoqing Zhu
Author-X-Name-First: Ruoqing
Author-X-Name-Last: Zhu
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Recursively Imputed Survival Trees
Abstract:
We propose recursively imputed survival tree (RIST) regression for
right-censored data. This new nonparametric regression procedure uses a
novel recursive imputation approach combined with extremely randomized
trees that allows significantly better use of censored data than previous
tree-based methods, yielding improved model fit and reduced prediction
error. The proposed method can also be viewed as a type of Monte Carlo EM
algorithm, which generates extra diversity in the tree-based fitting
process. Simulation studies and data analyses demonstrate the superior
performance of RIST compared with previous methods.
Journal: Journal of the American Statistical Association
Pages: 331-340
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.637468
File-URL: http://hdl.handle.net/10.1080/01621459.2011.637468
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:331-340
Template-Type: ReDIF-Article 1.0
Author-Name: Sebastian Irle
Author-X-Name-First: Sebastian
Author-X-Name-Last: Irle
Author-Name: Helmut Schäfer
Author-X-Name-First: Helmut
Author-X-Name-Last: Schäfer
Title: Interim Design Modifications in Time-to-Event Studies
Abstract:
We propose a flexible method for interim design modifications in
time-to-event studies. With this method, it is possible to inspect the
data at any time during the course of the study, without the need for
prespecification of a learning phase, and to make certain types of design
modifications depending on the interim data without compromising the Type
I error risk. The method can be applied to studies designed with a
conventional statistical test, fixed sample, or group sequential, even
when no adaptive interim analysis and no specific method for design
adaptations (such as combination tests) had been foreseen in the protocol.
Currently, the method supports design changes such as an extension of the
recruitment or follow-up period, as well as certain modifications of the
number and the schedule of interim analyses as well as changes of
inclusion criteria. In contrast to existing methods offering the same
flexibility, our approach allows us to make use of the full interim
information collected until the time of the adaptive data inspection. This
includes time-to-event data from patients who have already experienced an
event at the time of the data inspection, and preliminary information from
patients still alive, even if this information is predictive for survival,
such as early treatment response in a cancer clinical trial. Our method is
an extension of the so-called conditional rejection probability (CRP)
principle. It is based on the conditional distribution of the test
statistic given the final value of the same test statistic from a
subsample, namely the learning sample. It is developed in detail for the
example of the logrank statistic, for which we derive this conditional
distribution using martingale techniques.
Journal: Journal of the American Statistical Association
Pages: 341-348
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.644141
File-URL: http://hdl.handle.net/10.1080/01621459.2011.644141
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:341-348
Template-Type: ReDIF-Article 1.0
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Author-Name: Probal Chaudhuri
Author-X-Name-First: Probal
Author-X-Name-Last: Chaudhuri
Title: On Fractile Transformation of Covariates in Regression
Abstract:
The need for comparing two regression functions arises frequently in
statistical applications. Comparison of the usual regression functions is
not very meaningful in situations where the distributions and the ranges
of the covariates are different for the populations. For instance, in
econometric studies, the prices of commodities and people's incomes
observed at different time points may not be on comparable scales due to
inflation and other economic factors. In this article, we describe a
method of standardizing the covariates and estimating the transformed
regression function, which then become comparable. We develop smooth
estimates of the fractile regression function and study its statistical
properties analytically as well as numerically. We also provide a few real
examples that illustrate the difficulty in comparing the usual regression
functions and motivate the need for the fractile transformation. Our
analysis of the real examples leads to new and useful statistical
conclusions that are missed by comparison of the usual regression
functions.
Journal: Journal of the American Statistical Association
Pages: 349-361
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646916
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646916
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:349-361
Template-Type: ReDIF-Article 1.0
Author-Name: Anirban Bhattacharya
Author-X-Name-First: Anirban
Author-X-Name-Last: Bhattacharya
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Simplex Factor Models for Multivariate Unordered Categorical Data
Abstract:
Gaussian latent factor models are routinely used for modeling of
dependence in continuous, binary, and ordered categorical data. For
unordered categorical variables, Gaussian latent factor models lead to
challenging computation and complex modeling structures. As an
alternative, we propose a novel class of simplex factor models. In the
single-factor case, the model treats the different categorical outcomes as
independent with unknown marginals. The model can characterize flexible
dependence structures parsimoniously with few factors, and as factors are
added, any multivariate categorical data distribution can be accurately
approximated. Using a Bayesian approach for computation and inferences, a
Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well
with increasing dimension, with the number of factors treated as unknown.
We develop an efficient proposal for updating the base probability vector
in hierarchical Dirichlet models. Theoretical properties are described,
and we evaluate the approach through simulation examples. Applications are
described for modeling dependence in nucleotide sequences and prediction
from high-dimensional categorical features.
Journal: Journal of the American Statistical Association
Pages: 362-377
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646934
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646934
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:362-377
Template-Type: ReDIF-Article 1.0
Author-Name: Ranjan Maitra
Author-X-Name-First: Ranjan
Author-X-Name-Last: Maitra
Author-Name: Volodymyr Melnykov
Author-X-Name-First: Volodymyr
Author-X-Name-Last: Melnykov
Author-Name: Soumendra N. Lahiri
Author-X-Name-First: Soumendra N.
Author-X-Name-Last: Lahiri
Title: Bootstrapping for Significance of Compact Clusters in Multidimensional Datasets
Abstract:
This article proposes a bootstrap approach for assessing significance in
the clustering of multidimensional datasets. The procedure compares two
models and declares the more complicated model a better candidate if there
is significant evidence in its favor. The performance of the procedure is
illustrated on two well-known classification datasets and comprehensively
evaluated in terms of its ability to estimate the number of components via
extensive simulation studies, with excellent results. The methodology is
also applied to the problem of k-means color quantization
of several standard images in the literature and is demonstrated to be a
viable approach for determining the minimal and optimal numbers of colors
needed to display an image without significant loss in resolution.
Additional illustrations and performance evaluations are provided in the
online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 378-392
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.646935
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646935
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:378-392
Template-Type: ReDIF-Article 1.0
Author-Name: Peter Z. G. Qian
Author-X-Name-First: Peter Z. G.
Author-X-Name-Last: Qian
Title: Sliced Latin Hypercube Designs
Abstract:
This article proposes a method for constructing a new type of
space-filling design, called a sliced Latin hypercube design, intended for
running computer experiments. Such a design is a special Latin hypercube
design that can be partitioned into slices of smaller Latin hypercube
designs. It is desirable to use the constructed designs for collective
evaluations of computer models and ensembles of multiple computer models.
The proposed construction method is easy to implement, capable of
accommodating any number of factors, and flexible in run size. Examples
are given to illustrate the method. Sampling properties of the constructed
designs are examined. Numerical illustration is provided to corroborate
the derived theoretical results.
Journal: Journal of the American Statistical Association
Pages: 393-399
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2011.644132
File-URL: http://hdl.handle.net/10.1080/01621459.2011.644132
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:393-399
Template-Type: ReDIF-Article 1.0
Author-Name: Dávid Papp
Author-X-Name-First: Dávid
Author-X-Name-Last: Papp
Title: Optimal Designs for Rational Function Regression
Abstract:
We consider the problem of finding optimal nonsequential designs for a
large class of regression models involving polynomials and rational
functions with heteroscedastic noise also given by a polynomial or
rational weight function. Since the design weights can be found easily by
existing methods once the support is known, we concentrate on determining
the support of the optimal design. The proposed method treats D-, E-, A-,
and Φ p -optimal designs in a unified
manner, and generates a polynomial whose zeros are the support points of
the optimal approximate design, generalizing a number of previously known
results of the same flavor. The method is based on a mathematical
optimization model that can incorporate various criteria of optimality and
can be solved efficiently by well-established numerical optimization
methods. In contrast to optimization-based methods previously proposed for
the solution of similar design problems, our method also has theoretical
guarantee of its algorithmic efficiency; in concordance with the theory,
the actual running times of all numerical examples considered in the paper
are negligible. The numerical stability of the method is demonstrated in
an example involving high-degree polynomials. As a corollary, an upper
bound on the size of the support set of the minimally supported optimal
designs is also found.
Journal: Journal of the American Statistical Association
Pages: 400-411
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2012.656035
File-URL: http://hdl.handle.net/10.1080/01621459.2012.656035
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:400-411
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yingying Li
Author-X-Name-First: Yingying
Author-X-Name-Last: Li
Author-Name: Ke Yu
Author-X-Name-First: Ke
Author-X-Name-Last: Yu
Title: Vast Volatility Matrix Estimation Using High-Frequency Data for Portfolio Selection
Abstract:
Portfolio allocation with gross-exposure constraint is an effective
method to increase the efficiency and stability of portfolios selection
among a vast pool of assets, as demonstrated by Fan, Zhang, and Yu. The
required high-dimensional volatility matrix can be estimated by using
high-frequency financial data. This enables us to better adapt to the
local volatilities and local correlations among a vast number of assets
and to increase significantly the sample size for estimating the
volatility matrix. This article studies the volatility matrix estimation
using high-dimensional, high-frequency data from the perspective of
portfolio selection. Specifically, we propose the use of
“pairwise-refresh time” and “all-refresh time”
methods based on the concept of “refresh time” proposed by
Barndorff-Nielsen, Hansen, Lunde, and Shephard for the estimation of vast
covariance matrix and compare their merits in the portfolio selection. We
establish the concentration inequalities of the estimates, which guarantee
desirable properties of the estimated volatility matrix in vast asset
allocation with gross-exposure constraints. Extensive numerical studies
are made via carefully designed simulations. Comparing with the methods
based on low-frequency daily data, our methods can capture the most recent
trend of the time varying volatility and correlation, hence provide more
accurate guidance for the portfolio allocation in the next time period.
The advantage of using high-frequency data is significant in our
simulation and empirical studies, which consist of 50 simulated assets and
30 constituent stocks of Dow Jones Industrial Average index.
Journal: Journal of the American Statistical Association
Pages: 412-428
Issue: 497
Volume: 107
Year: 2012
Month: 3
X-DOI: 10.1080/01621459.2012.656041
File-URL: http://hdl.handle.net/10.1080/01621459.2012.656041
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:497:p:412-428
Template-Type: ReDIF-Article 1.0
Author-Name: Lane F. Burgette
Author-X-Name-First: Lane F.
Author-X-Name-Last: Burgette
Author-Name: Jerome P. Reiter
Author-X-Name-First: Jerome P.
Author-X-Name-Last: Reiter
Title: Nonparametric Bayesian Multiple Imputation for Missing Data Due to Mid-Study Switching of Measurement Methods
Abstract:
Investigators often change how variables are measured during the middle
of data-collection, for example, in hopes of obtaining greater accuracy or
reducing costs. The resulting data comprise sets of observations measured
on two (or more) different scales, which complicates interpretation and
can create bias in analyses that rely directly on the differentially
measured variables. We develop approaches based on multiple imputation for
handling mid-study changes in measurement for settings without calibration
data, that is, no subjects are measured on both (all) scales. This setting
creates a seemingly insurmountable problem for multiple imputation: since
the measurements never appear jointly, there is no information in the data
about their association. We resolve the problem by making an often
scientifically reasonable assumption that each measurement regime
accurately ranks the samples but on differing scales, so that, for
example, an individual at the qth percentile on one scale
should be at about the qth percentile on the other scale.
We use rank-preservation assumptions to develop three imputation
strategies that flexibly transform measurements made in one scale to
measurements made in another: a Markov chain Monte Carlo (MCMC)-free
approach based on permuting ranks of measurements, and two approaches
based on dependent Dirichlet process (DDP) mixture models for imputing
values conditional on covariates. We use simulations to illustrate
conditions under which each strategy performs well, and present guidance
on when to apply each. We apply these methods to a study of birth outcomes
in which investigators collected mothers’ blood samples to measure
levels of environmental contaminants. Midway through data ascertainment,
the study switched from one analytical lab to another. The distributions
of blood lead levels differ greatly across the two labs, suggesting that
the labs report measurements according to different scales. We use
nonparametric Bayesian imputation models to obtain sets of plausible
measurements on a common scale, and estimate quantile regressions of birth
weight on various environmental contaminants.
Journal: Journal of the American Statistical Association
Pages: 439-449
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.643713
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643713
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:439-449
Template-Type: ReDIF-Article 1.0
Author-Name: Paolo Frumento
Author-X-Name-First: Paolo
Author-X-Name-Last: Frumento
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Author-Name: Barbara Pacini
Author-X-Name-First: Barbara
Author-X-Name-Last: Pacini
Author-Name: Donald B. Rubin
Author-X-Name-First: Donald B.
Author-X-Name-Last: Rubin
Title: Evaluating the Effect of Training on Wages in the Presence of Noncompliance, Nonemployment, and Missing Outcome Data
Abstract:
The effects of a job training program, Job Corps, on both employment and
wages are evaluated using data from a randomized study. Principal
stratification is used to address, simultaneously, the complications of
noncompliance, wages that are only partially defined because of
nonemployment, and unintended missing outcomes. The first two
complications are of substantive interest, whereas the third is a
nuisance. The objective is to find a parsimonious model that can be used
to inform public policy. We conduct a likelihood-based analysis using
finite mixture models estimated by the expectation-maximization (EM)
algorithm. We maintain an exclusion restriction assumption for the effect
of assignment on employment and wages for noncompliers, but not on
missingness. We provide estimates under the “missing at
random” assumption, and assess the robustness of our results to
deviations from it. The plausibility of meaningful restrictions is
investigated by means of scaled log-likelihood ratio statistics.
Substantive conclusions include the following. For compliers, the effect
on employment is negative in the short term; it becomes positive in the
long term, but these effects are small at best. For always employed
compliers, that is, compliers who are employed whether trained or not
trained, positive effects on wages are found at all time periods. Our
analysis reveals that background characteristics of individuals differ
markedly across the principal strata. We found evidence that the program
should have been better targeted, in the sense of being designed
differently for different groups of people, and specific suggestions are
offered. Previous analyses of this dataset, which did not address all
complications in a principled manner, led to less nuanced conclusions
about Job Corps.
Journal: Journal of the American Statistical Association
Pages: 450-466
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.643719
File-URL: http://hdl.handle.net/10.1080/01621459.2011.643719
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:450-466
Template-Type: ReDIF-Article 1.0
Author-Name: Earvin Balderama
Author-X-Name-First: Earvin
Author-X-Name-Last: Balderama
Author-Name: Frederic Paik Schoenberg
Author-X-Name-First: Frederic Paik
Author-X-Name-Last: Schoenberg
Author-Name: Erin Murray
Author-X-Name-First: Erin
Author-X-Name-Last: Murray
Author-Name: Philip W. Rundel
Author-X-Name-First: Philip W.
Author-X-Name-Last: Rundel
Title: Application of Branching Models in the Study of Invasive Species
Abstract:
Earthquake occurrences are often described using a class of branching
models called epidemic-type aftershock sequence (ETAS) models. The name
derives from the fact that the model allows earthquakes to cause
aftershocks, and then those aftershocks may induce subsequent aftershocks,
and so on. Despite their value in seismology, such models have not
previously been used in studying the incidence of invasive plant and
animal species. Here, we apply ETAS models to study the spread of an
invasive species in Costa Rica (Musa velutina, or red
banana). One challenge in this ecological application is that fitting the
model requires the originations of the plants, which are not observed but
may be estimated using filed data on the heights of the plants on a given
date and their empirical growth rates. We then characterize the estimated
spatial-temporal rate of spread of red banana plants using a space-time
ETAS model.
Journal: Journal of the American Statistical Association
Pages: 467-476
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.641402
File-URL: http://hdl.handle.net/10.1080/01621459.2011.641402
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:467-476
Template-Type: ReDIF-Article 1.0
Author-Name: Giseon Heo
Author-X-Name-First: Giseon
Author-X-Name-Last: Heo
Author-Name: Jennifer Gamble
Author-X-Name-First: Jennifer
Author-X-Name-Last: Gamble
Author-Name: Peter T. Kim
Author-X-Name-First: Peter T.
Author-X-Name-Last: Kim
Title: Topological Analysis of Variance and the Maxillary Complex
Abstract:
It is common to reduce the dimensionality of data before applying
classical multivariate analysis techniques in statistics. Persistent
homology, a recent development in computational topology, has been shown
to be useful for analyzing high-dimensional (nonlinear) data. In this
article, we connect computational topology with the traditional analysis
of variance and demonstrate the value of combining these approaches on a
three-dimensional orthodontic landmark dataset derived from the maxillary
complex. Indeed, combining appropriate techniques of both persistent
homology and analysis of variance results in a better understanding of the
data’s nonlinear features over and above what could have been
achieved by classical means. Supplementary material for this article is
available online.
Journal: Journal of the American Statistical Association
Pages: 477-492
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.641430
File-URL: http://hdl.handle.net/10.1080/01621459.2011.641430
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:477-492
Template-Type: ReDIF-Article 1.0
Author-Name: Lu Wang
Author-X-Name-First: Lu
Author-X-Name-Last: Wang
Author-Name: Andrea Rotnitzky
Author-X-Name-First: Andrea
Author-X-Name-Last: Rotnitzky
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Author-Name: Randall E. Millikan
Author-X-Name-First: Randall E.
Author-X-Name-Last: Millikan
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Title: Evaluation of Viable Dynamic Treatment Regimes in a Sequentially Randomized Trial of Advanced Prostate Cancer
Abstract:
We present new statistical analyses of data arising from a clinical trial
designed to compare two-stage dynamic treatment regimes (DTRs) for
advanced prostate cancer. The trial protocol mandated that patients be
initially randomized among four chemotherapies, and that those who
responded poorly be re-randomized to one of the remaining candidate
therapies. The primary aim was to compare the DTRs’ overall success
rates, with success defined by the occurrence of successful responses in
each of two consecutive courses of the patient’s therapy. Of the
150 study participants, 47 did not complete their therapy as per the
algorithm. However, 35 of them did so for reasons that precluded further
chemotherapy, that is, toxicity and/or progressive disease. Consequently,
rather than comparing the overall success rates of the DTRs in the
unrealistic event that these patients had remained on their assigned
chemotherapies, we conducted an analysis that compared viable switch rules
defined by the per-protocol rules but with the additional provision that
patients who developed toxicity or progressive disease switch to a
non-prespecified therapeutic or palliative strategy. This modification
involved consideration of bivariate per-course outcomes encoding both
efficacy and toxicity. We used numerical scores elicited from the
trial’s principal investigator to quantify the clinical
desirability of each bivariate per-course outcome, and defined one
endpoint as their average over all courses of treatment. Two other simpler
sets of scores as well as log survival time were also used as endpoints.
Estimation of each DTR-specific mean score was conducted using inverse
probability weighted methods that assumed that missingness in the 12
remaining dropouts was informative but explainable in that it only
depended on past recorded data. We conducted additional worst- and
best-case analyses to evaluate sensitivity of our findings to extreme
departures from the explainable dropout assumption.
Journal: Journal of the American Statistical Association
Pages: 493-508
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.641416
File-URL: http://hdl.handle.net/10.1080/01621459.2011.641416
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:493-508
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Almirall
Author-X-Name-First: Daniel
Author-X-Name-Last: Almirall
Author-Name: Daniel J. Lizotte
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Lizotte
Author-Name: Susan A. Murphy
Author-X-Name-First: Susan A.
Author-X-Name-Last: Murphy
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 509-512
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.665615
File-URL: http://hdl.handle.net/10.1080/01621459.2012.665615
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:509-512
Template-Type: ReDIF-Article 1.0
Author-Name: Paul Chaffee
Author-X-Name-First: Paul
Author-X-Name-Last: Chaffee
Author-Name: Mark van der Laan
Author-X-Name-First: Mark
Author-X-Name-Last: van der Laan
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 513-517
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.665197
File-URL: http://hdl.handle.net/10.1080/01621459.2012.665197
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:513-517
Template-Type: ReDIF-Article 1.0
Author-Name: Lu Wang
Author-X-Name-First: Lu
Author-X-Name-Last: Wang
Author-Name: Andrea Rotnitzky
Author-X-Name-First: Andrea
Author-X-Name-Last: Rotnitzky
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Author-Name: Randall E. Millikan
Author-X-Name-First: Randall E.
Author-X-Name-Last: Millikan
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 518-520
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.665198
File-URL: http://hdl.handle.net/10.1080/01621459.2012.665198
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:518-520
Template-Type: ReDIF-Article 1.0
Author-Name: Michael E. Sobel
Author-X-Name-First: Michael E.
Author-X-Name-Last: Sobel
Title: Does Marriage Boost Men’s Wages?: Identification of Treatment Effects in Fixed Effects Regression Models for Panel Data
Abstract:
Social scientists have generated a large and inconclusive literature on
the effect(s) of marriage on men’s wages. Researchers have
hypothesized that the wage premium enjoyed by married men may reflect both
a tendency for more productive men to marry and an effect of marriage on
productivity. To sort out these explanations, researchers have used fixed
effects regression models for panel data to adjust for selection on
unobserved time-invariant confounders, interpreting coefficients for the
time-varying marriage variables as effects. However, they did not define
these effects or give conditions under which the regression coefficients
would warrant a causal interpretation. Consequently, they failed to
appropriately adjust for important time-varying confounders and
misinterpreted their results. Regression models for panel data with
unobserved time-invariant confounders are also widely used in many other
policy-relevant contexts and the same problems arise there. This article
draws on recent statistical work on causal inference with longitudinal
data to clarify these problems and help researchers use appropriate
methods to model their data. A basic set of treatment effects is defined
and used to define derived effects. Causal models for panel data with
unobserved time-invariant confounders are defined and the treatment
effects are reexpressed in terms of these models. Ignorability conditions
under which the parameters of the causal models are identified from the
regression models are given. Even when these hold, a number of interesting
and important treatment effects are typically not identified.
Journal: Journal of the American Statistical Association
Pages: 521-529
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.646917
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646917
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:521-529
Template-Type: ReDIF-Article 1.0
Author-Name: Xi Luo
Author-X-Name-First: Xi
Author-X-Name-Last: Luo
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Chiang-Shan R. Li
Author-X-Name-First: Chiang-Shan R.
Author-X-Name-Last: Li
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Inference With Interference Between Units in an fMRI Experiment of Motor Inhibition
Abstract:
An experimental unit is an opportunity to randomly apply or withhold a
treatment. There is interference between units if the application of the
treatment to one unit may also affect other units. In cognitive
neuroscience, a common form of experiment presents a sequence of stimuli
or requests for cognitive activity at random to each experimental subject
and measures biological aspects of brain activity that follow these
requests. Each subject is then many experimental units, and interference
between units within an experimental subject is, likely, in part because
the stimuli follow one another quickly and in part because human subjects
learn or become experienced or primed or bored as the experiment proceeds.
We use a recent functional magnetic resonance imaging (fMRI) experiment
concerned with the inhibition of motor activity to illustrate and further
develop recently proposed methodology for inference in the presence of
interference. A simulation evaluates the power of competing procedures.
Journal: Journal of the American Statistical Association
Pages: 530-541
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.655954
File-URL: http://hdl.handle.net/10.1080/01621459.2012.655954
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:530-541
Template-Type: ReDIF-Article 1.0
Author-Name: Li Li
Author-X-Name-First: Li
Author-X-Name-Last: Li
Author-Name: Joseph J. Eron
Author-X-Name-First: Joseph J.
Author-X-Name-Last: Eron
Author-Name: Heather Ribaudo
Author-X-Name-First: Heather
Author-X-Name-Last: Ribaudo
Author-Name: Roy M. Gulick
Author-X-Name-First: Roy M.
Author-X-Name-Last: Gulick
Author-Name: Brent A. Johnson
Author-X-Name-First: Brent A.
Author-X-Name-Last: Johnson
Title: Evaluating the Effect of Early Versus Late ARV Regimen Change if Failure on an Initial Regimen: Results From the AIDS Clinical Trials Group Study A5095
Abstract:
The current goal of initial antiretroviral (ARV) therapy is suppression
of plasma human immunodeficiency virus (HIV)-1 RNA levels to below 200
copies per milliliter. A proportion of HIV-infected patients who initiate
antiretroviral therapy in clinical practice or antiretroviral clinical
trials either fail to suppress HIV-1 RNA or have HIV-1 RNA levels rebound
on therapy. Frequently, these patients have sustained CD4 cell counts
responses and limited or no clinical symptoms and, therefore, have
potentially limited indications for altering therapy which they may be
tolerating well despite increased viral replication. On the other hand,
increased viral replication on therapy leads to selection of resistance
mutations to the antiretroviral agents comprising their therapy and
potentially cross-resistance to other agents in the same class decreasing
the likelihood of response to subsequent antiretroviral therapy. The
optimal time to switch antiretroviral therapy to ensure sustained
virologic suppression and prevent clinical events in patients who have
rebound in their HIV-1 RNA, yet are stable, is not known. Randomized
clinical trials to compare early versus delayed switching have been
difficult to design and more difficult to enroll. In some clinical trials,
such as the AIDS Clinical Trials Group (ACTG) Study A5095, patients
randomized to initial antiretroviral treatment combinations, who fail to
suppress HIV-1 RNA or have a rebound of HIV-1 RNA on therapy are allowed
to switch from the initial ARV regimen to a new regimen, based on
clinician and patient decisions. We delineate a statistical framework to
estimate the effect of early versus late regimen change using data from
ACTG A5095 in the context of two-stage designs. In causal inference, a
large class of doubly robust estimators are derived through semiparametric
theory with applications to missing data problems. This class of
estimators is motivated through geometric arguments and relies on large
samples for good performance. By now, several authors have noted that a
doubly robust estimator may be suboptimal when the outcome model is
misspecified even if it is semiparametric efficient when the outcome
regression model is correctly specified. Through auxiliary variables,
two-stage designs, and within the contextual backdrop of our scientific
problem and clinical study, we propose improved doubly robust, locally
efficient estimators of a population mean and average causal effect for
early versus delayed switching to second-line ARV treatment regimens. Our
analysis of the ACTG A5095 data further demonstrates how methods that use
auxiliary variables can improve over methods that ignore them. Using the
methods developed here, we conclude that patients who switch within 8
weeks of virologic failure have better clinical outcomes, on average, than
patients who delay switching to a new second-line ARV regimen after
failing on the initial regimen. Ordinary statistical methods fail to find
such differences. This article has online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 542-554
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2011.646932
File-URL: http://hdl.handle.net/10.1080/01621459.2011.646932
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:542-554
Template-Type: ReDIF-Article 1.0
Author-Name: Dulal K. Bhaumik
Author-X-Name-First: Dulal K.
Author-X-Name-Last: Bhaumik
Author-Name: Anup Amatya
Author-X-Name-First: Anup
Author-X-Name-Last: Amatya
Author-Name: Sharon-Lise T. Normand
Author-X-Name-First: Sharon-Lise T.
Author-X-Name-Last: Normand
Author-Name: Joel Greenhouse
Author-X-Name-First: Joel
Author-X-Name-Last: Greenhouse
Author-Name: Eloise Kaizar
Author-X-Name-First: Eloise
Author-X-Name-Last: Kaizar
Author-Name: Brian Neelon
Author-X-Name-First: Brian
Author-X-Name-Last: Neelon
Author-Name: Robert D. Gibbons
Author-X-Name-First: Robert D.
Author-X-Name-Last: Gibbons
Title: Meta-Analysis of Rare Binary Adverse Event Data
Abstract:
We examine the use of fixed-effects and random-effects moment-based
meta-analytic methods for analysis of binary adverse-event data. Special
attention is paid to the case of rare adverse events that are commonly
encountered in routine practice. We study estimation of model parameters
and between-study heterogeneity. In addition, we examine traditional
approaches to hypothesis testing of the average treatment effect and
detection of the heterogeneity of treatment effect across studies. We
derive three new methods, a simple (unweighted) average treatment effect
estimator, a new heterogeneity estimator, and a parametric bootstrapping
test for heterogeneity. We then study the statistical properties of both
the traditional and the new methods via simulation. We find that in
general, moment-based estimators of combined treatment effects and
heterogeneity are biased and the degree of bias is proportional to the
rarity of the event under study. The new methods eliminate much, but not
all, of this bias. The various estimators and hypothesis testing methods
are then compared and contrasted using an example dataset on treatment of
stable coronary artery disease.
Journal: Journal of the American Statistical Association
Pages: 555-567
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.664484
File-URL: http://hdl.handle.net/10.1080/01621459.2012.664484
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:555-567
Template-Type: ReDIF-Article 1.0
Author-Name: Hakmook Kang
Author-X-Name-First: Hakmook
Author-X-Name-Last: Kang
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Author-Name: Crystal Linkletter
Author-X-Name-First: Crystal
Author-X-Name-Last: Linkletter
Author-Name: Nicole Long
Author-X-Name-First: Nicole
Author-X-Name-Last: Long
Author-Name: David Badre
Author-X-Name-First: David
Author-X-Name-Last: Badre
Title: Spatio-Spectral Mixed-Effects Model for Functional Magnetic Resonance Imaging Data
Abstract:
The goal of this article is to model cognitive control related activation
among predefined regions of interest (ROIs) of the human brain while
properly adjusting for the underlying spatio-temporal correlations.
Standard approaches to fMRI analysis do not simultaneously take into
account both the spatial and temporal correlations that are prevalent in
fMRI data. This is primarily due to the computational complexity of
estimating the spatio-temporal covariance matrix. More specifically, they
do not take into account multiscale spatial correlation (between-ROIs and
within-ROI). To address these limitations, we propose a spatio-spectral
mixed-effects model. Working in the spectral domain simplifies the
temporal covariance structure because the Fourier coefficients are
approximately uncorrelated across frequencies. Additionally, by
incorporating voxel-specific and ROI-specific random effects, the model is
able to capture the multiscale spatial covariance structure:
distance-dependent local correlation (within an ROI), and
distance-independent global correlation (between-ROIs).
Building on existing theory on linear mixed-effects models to conduct
estimation and inference, we applied our model to fMRI data to study
activation in prespecified ROIs in the prefontal cortex and estimate the
correlation structure in the network. Simulation studies demonstrate that
ignoring the multiscale correlation leads to higher false positive error
rates.
Journal: Journal of the American Statistical Association
Pages: 568-577
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.664503
File-URL: http://hdl.handle.net/10.1080/01621459.2012.664503
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:568-577
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas Barrios
Author-X-Name-First: Thomas
Author-X-Name-Last: Barrios
Author-Name: Rebecca Diamond
Author-X-Name-First: Rebecca
Author-X-Name-Last: Diamond
Author-Name: Guido W. Imbens
Author-X-Name-First: Guido W.
Author-X-Name-Last: Imbens
Author-Name: Michal Kolesár
Author-X-Name-First: Michal
Author-X-Name-Last: Kolesár
Title: Clustering, Spatial Correlations, and Randomization Inference
Abstract:
It is a standard practice in regression analyses to allow for clustering
in the error covariance matrix if the explanatory variable of interest
varies at a more aggregate level (e.g., the state level) than the units of
observation (e.g., individuals). Often, however, the structure of the
error covariance matrix is more complex, with correlations not vanishing
for units in different clusters. Here, we explore the implications of such
correlations for the actual and estimated precision of least squares
estimators. Our main theoretical result is that with equal-sized clusters,
if the covariate of interest is randomly assigned at the cluster level,
only accounting for nonzero covariances at the cluster level, and ignoring
correlations between clusters as well as differences in within-cluster
correlations, leads to valid confidence intervals. However, in the absence
of random assignment of the covariates, ignoring general correlation
structures may lead to biases in standard errors. We illustrate our
findings using the 5% public-use census data. Based on these results, we
recommend that researchers, as a matter of routine, explore the extent of
spatial correlations in explanatory variables beyond state-level
clustering.
Journal: Journal of the American Statistical Association
Pages: 578-591
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682524
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682524
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:578-591
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Jingjin Zhang
Author-X-Name-First: Jingjin
Author-X-Name-Last: Zhang
Author-Name: Ke Yu
Author-X-Name-First: Ke
Author-X-Name-Last: Yu
Title: Vast Portfolio Selection With Gross-Exposure Constraints
Abstract:
This article introduces the large portfolio selection using
gross-exposure constraints. It shows that with gross-exposure constraints,
the empirically selected optimal portfolios based on estimated covariance
matrices have similar performance to the theoretical optimal ones and
there is no error accumulation effect from estimation of vast covariance
matrices. This gives theoretical justification to the empirical results by
Jagannathan and Ma. It also shows that the no-short-sale portfolio can be
improved by allowing some short positions. The applications to portfolio
selection, tracking, and improvements are also addressed. The utility of
our new approach is illustrated by simulation and empirical studies on the
100 Fama--French industrial portfolios and the 600 stocks randomly
selected from Russell 3000.
Journal: Journal of the American Statistical Association
Pages: 592-606
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682825
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682825
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:592-606
Template-Type: ReDIF-Article 1.0
Author-Name: Alessandra Luati
Author-X-Name-First: Alessandra
Author-X-Name-Last: Luati
Author-Name: Tommaso Proietti
Author-X-Name-First: Tommaso
Author-X-Name-Last: Proietti
Author-Name: Marco Reale
Author-X-Name-First: Marco
Author-X-Name-Last: Reale
Title: The Variance Profile
Abstract:
The variance profile is defined as the power mean of the spectral density
function of a stationary stochastic process. It is a continuous and
nondecreasing function of the power parameter, p, which
returns the minimum of the spectrum
(p→−∞), the interpolation error
variance (harmonic mean, p=−1), the prediction
error variance (geometric mean, p=0), the unconditional
variance (arithmetic mean, p=1), and the maximum of the
spectrum (p→∞). The variance profile
provides a useful characterization of a stochastic process; we focus in
particular on the class of fractionally integrated processes. Moreover, it
enables a direct and immediate derivation of the Szegö-Kolmogorov
formula and the interpolation error variance formula. The article proposes
a nonparametric estimator of the variance profile based on the power mean
of the smoothed sample spectrum, and proves its consistency and its
asymptotic normality. From the empirical standpoint, we propose and
illustrate the use of the variance profile for estimating the long memory
parameter in climatological and financial time series and for assessing
structural change.
Journal: Journal of the American Statistical Association
Pages: 607-621
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682832
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682832
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:607-621
Template-Type: ReDIF-Article 1.0
Author-Name: Viktor Todorov
Author-X-Name-First: Viktor
Author-X-Name-Last: Todorov
Author-Name: George Tauchen
Author-X-Name-First: George
Author-X-Name-Last: Tauchen
Title: Inverse Realized Laplace Transforms for Nonparametric Volatility Density Estimation in Jump-Diffusions
Abstract:
This article develops a nonparametric estimator of the stochastic
volatility density of a discretely observed Itô semimartingale in the
setting of an increasing time span and finer mesh of the observation grid.
There are two basic steps involved. The first step is aggregating the
high-frequency increments into the realized Laplace transform, which is a
robust nonparametric estimate of the underlying volatility Laplace
transform. The second step is using a regularized kernel to invert the
realized Laplace transform. These two steps are relatively quick and easy
to compute, so the nonparametric estimator is practicable. The article
also derives bounds for the mean squared error of the estimator. The
regularity conditions are sufficiently general to cover empirically
important cases such as level jumps and possible dependencies between
volatility moves and either diffusive or jump moves in the semimartingale.
The Monte Carlo analysis in this study indicates that the nonparametric
estimator is reliable and reasonably accurate in realistic estimation
contexts. An empirical application to 5-min data for three large-cap
stocks, 1997--2010, reveals the importance of big short-term volatility
spikes in generating high levels of stock price variability over and above
those induced by price jumps. The application also shows how to trace out
the dynamic response of the volatility density to both positive and
negative jumps in the stock price.
Journal: Journal of the American Statistical Association
Pages: 622-635
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682854
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682854
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:622-635
Template-Type: ReDIF-Article 1.0
Author-Name: James O. Berger
Author-X-Name-First: James O.
Author-X-Name-Last: Berger
Author-Name: Jose M. Bernardo
Author-X-Name-First: Jose M.
Author-X-Name-Last: Bernardo
Author-Name: Dongchu Sun
Author-X-Name-First: Dongchu
Author-X-Name-Last: Sun
Title: Objective Priors for Discrete Parameter Spaces
Abstract:
This article considers the development of objective prior distributions
for discrete parameter spaces. Formal approaches to such
development—such as the reference prior
approach—often result in a constant prior for a discrete parameter,
which is questionable for problems that exhibit certain types of
structure. To take advantage of structure, this article proposes embedding
the original problem in a continuous problem that preserves the structure,
and then using standard reference prior theory to determine the
appropriate objective prior. Four different possibilities for this
embedding are explored, and applied to a population-size model, the
hypergeometric distribution, the multivariate hypergeometric distribution,
the binomial-beta distribution, and the binomial distribution. The
recommended objective priors for the first, third, and fourth problems are
new.
Journal: Journal of the American Statistical Association
Pages: 636-648
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682538
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682538
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:636-648
Template-Type: ReDIF-Article 1.0
Author-Name: Valen E. Johnson
Author-X-Name-First: Valen E.
Author-X-Name-Last: Johnson
Author-Name: David Rossell
Author-X-Name-First: David
Author-X-Name-Last: Rossell
Title: Bayesian Model Selection in High-Dimensional Settings
Abstract:
Standard assumptions incorporated into Bayesian model selection
procedures result in procedures that are not competitive with commonly
used penalized likelihood methods. We propose modifications of these
methods by imposing nonlocal prior densities on model parameters. We show
that the resulting model selection procedures are consistent in linear
model settings when the number of possible covariates p
is bounded by the number of observations n, a property
that has not been extended to other model selection procedures. In
addition to consistently identifying the true model, the proposed
procedures provide accurate estimates of the posterior probability that
each identified model is correct. Through simulation studies, we
demonstrate that these model selection procedures perform as well or
better than commonly used penalized likelihood methods in a range of
simulation settings. Proofs of the primary theorems are provided in the
Supplementary Material that is available online.
Journal: Journal of the American Statistical Association
Pages: 649-660
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682536
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682536
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:649-660
Template-Type: ReDIF-Article 1.0
Author-Name: Peter Hall
Author-X-Name-First: Peter
Author-X-Name-Last: Hall
Author-Name: Michael G. Schimek
Author-X-Name-First: Michael G.
Author-X-Name-Last: Schimek
Title: Moderate-Deviation-Based Inference for Random Degeneration in Paired Rank Lists
Abstract:
Consider a problem where N items (objects or
individuals) are judged by assessors using their perceptions of a set of
performance criteria, or alternatively by technical devices. In
particular, two assessors might rank the items between 1 and
N on the basis of relative performance, independently of
each other. We can aggregate the rank lists by assigning
one if the two assessors agree, and zero
otherwise, and we can modify this approach to make it robust against
irregularities. In this article, we consider methods and algorithms that
can be used to address this problem. We study their theoretical properties
in the case of a model based on nonstationary Bernoulli trials, and we
report on their numerical properties for both simulated and real data.
Journal: Journal of the American Statistical Association
Pages: 661-672
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682539
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682539
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:661-672
Template-Type: ReDIF-Article 1.0
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Author-Name: Alexander C. McLain
Author-X-Name-First: Alexander C.
Author-X-Name-Last: McLain
Title: Multiple Testing of Composite Null Hypotheses in Heteroscedastic Models
Abstract:
In large-scale studies, the true effect sizes often range continuously
from zero to small to large, and are observed with heteroscedastic errors.
In practical situations where the failure to reject small deviations from
the null is inconsequential, specifying an indifference region (or forming
composite null hypotheses) can greatly reduce the number of unimportant
discoveries in multiple testing. The heteroscedasticity issue poses new
challenges for multiple testing with composite nulls. In particular, the
conventional framework in multiple testing, which involves rescaling or
standardization, is likely to distort the scientific question. We propose
the concept of a composite null distribution for heteroscedastic models
and develop an optimal testing procedure that minimizes the false
nondiscovery rate, subject to a constraint on the false discovery rate.
The proposed approach is different from conventional methods in that the
effect size, statistical significance, and multiplicity issues are
addressed integrally. The external information of heteroscedastic errors
is incorporated for optimal simultaneous inference. The new features and
advantages of our approach are demonstrated using both simulated and real
data. The numerical studies demonstrate that our new procedure enjoys
superior performance with greater accuracy and better interpretability of
results.
Journal: Journal of the American Statistical Association
Pages: 673-687
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.664505
File-URL: http://hdl.handle.net/10.1080/01621459.2012.664505
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:673-687
Template-Type: ReDIF-Article 1.0
Author-Name: Liuquan Sun
Author-X-Name-First: Liuquan
Author-X-Name-Last: Sun
Author-Name: Xinyuan Song
Author-X-Name-First: Xinyuan
Author-X-Name-Last: Song
Author-Name: Jie Zhou
Author-X-Name-First: Jie
Author-X-Name-Last: Zhou
Author-Name: Lei Liu
Author-X-Name-First: Lei
Author-X-Name-Last: Liu
Title: Joint Analysis of Longitudinal Data With Informative Observation Times and a Dependent Terminal Event
Abstract:
In many longitudinal studies, repeated measures are often correlated with
observation times. Also, there may exist a dependent terminal event such
as death that stops the follow-up. In this article, we propose a new joint
model for the analysis of longitudinal data in the presence of both
informative observation times and a dependent terminal event via latent
variables. Estimating equation approaches are developed for parameter
estimation, and the resulting estimators are shown to be consistent and
asymptotically normal. In addition, some graphical and numerical
procedures are presented for model checking. Simulation studies
demonstrate that the proposed method performs well for practical settings.
An application to a medical cost study of chronic heart failure patients
from the University of Virginia Health System is provided.
Journal: Journal of the American Statistical Association
Pages: 688-700
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682528
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682528
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:688-700
Template-Type: ReDIF-Article 1.0
Author-Name: Jianhui Zhou
Author-X-Name-First: Jianhui
Author-X-Name-Last: Zhou
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Informative Estimation and Selection of Correlation Structure for Longitudinal Data
Abstract:
Identifying an informative correlation structure is important in
improving estimation efficiency for longitudinal data. We approximate the
empirical estimator of the correlation matrix by groups of known basis
matrices that represent different correlation structures, and transform
the correlation structure selection problem to a covariate selection
problem. To address both the complexity and the informativeness of the
correlation matrix, we minimize an objective function that consists of two
parts: the difference between the empirical information and a model
approximation of the correlation matrix, and a penalty that penalizes
models with too many basis matrices. The unique feature of the proposed
estimation and selection of correlation structure is that it does not
require the specification of the likelihood function, and therefore it is
applicable for discrete longitudinal data. We carry out the proposed
method through a groupwise penalty strategy, which is able to identify
more complex structures. The proposed method possesses the oracle property
and selects the true correlation structure consistently. In addition, the
estimator of the correlation parameters follows a normal distribution
asymptotically. Simulation studies and a data example confirm that the
proposed method works effectively in estimating and selecting the true
structure in finite samples, and it enables improvement in estimation
efficiency by selecting the true structures.
Journal: Journal of the American Statistical Association
Pages: 701-710
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682534
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682534
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:701-710
Template-Type: ReDIF-Article 1.0
Author-Name: Mian Huang
Author-X-Name-First: Mian
Author-X-Name-Last: Huang
Author-Name: Weixin Yao
Author-X-Name-First: Weixin
Author-X-Name-Last: Yao
Title: Mixture of Regression Models With Varying Mixing Proportions: A Semiparametric Approach
Abstract:
In this article, we study a class of semiparametric mixtures of
regression models, in which the regression functions are linear functions
of the predictors, but the mixing proportions are smoothing functions of a
covariate. We propose a one-step backfitting estimation procedure to
achieve the optimal convergence rates for both regression parameters and
the nonparametric functions of mixing proportions. We derive the
asymptotic bias and variance of the one-step estimate, and further
establish its asymptotic normality. A modified
expectation-maximization-type (EM-type) estimation procedure is
investigated. We show that the modified EM algorithms preserve the
asymptotic ascent property. Numerical simulations are conducted to examine
the finite sample performance of the estimation procedures. The proposed
methodology is further illustrated via an analysis of a real dataset.
Journal: Journal of the American Statistical Association
Pages: 711-724
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682541
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682541
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:711-724
Template-Type: ReDIF-Article 1.0
Author-Name: Peng Wang
Author-X-Name-First: Peng
Author-X-Name-Last: Wang
Author-Name: Guei-feng Tsai
Author-X-Name-First: Guei-feng
Author-X-Name-Last: Tsai
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Conditional Inference Functions for Mixed-Effects Models With Unspecified Random-Effects Distribution
Abstract:
In longitudinal studies, mixed-effects models are important for
addressing subject-specific effects. However, most existing approaches
assume a normal distribution for the random effects, and this could affect
the bias and efficiency of the fixed-effects estimator. Even in cases
where the estimation of the fixed effects is robust with a misspecified
distribution of the random effects, the estimation of the random effects
could be invalid. We propose a new approach to estimate fixed and random
effects using conditional quadratic inference functions (QIFs). The new
approach does not require the specification of likelihood functions or a
normality assumption for random effects. It can also accommodate serial
correlation between observations within the same cluster, in addition to
mixed-effects modeling. Other advantages include not requiring the
estimation of the unknown variance components associated with the random
effects, or the nuisance parameters associated with the working
correlations. We establish asymptotic results for the fixed-effect
parameter estimators that do not rely on the consistency of the
random-effect estimators. Real data examples and simulations are used to
compare the new approach with the penalized quasi-likelihood (PQL)
approach, and SAS GLIMMIX and nonlinear mixed-effects model (NLMIXED)
procedures. Supplemental materials including technical details are
available online.
Journal: Journal of the American Statistical Association
Pages: 725-736
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.665199
File-URL: http://hdl.handle.net/10.1080/01621459.2012.665199
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:725-736
Template-Type: ReDIF-Article 1.0
Author-Name: Jun Li
Author-X-Name-First: Jun
Author-X-Name-Last: Li
Author-Name: Juan A. Cuesta-Albertos
Author-X-Name-First: Juan A.
Author-X-Name-Last: Cuesta-Albertos
Author-Name: Regina Y. Liu
Author-X-Name-First: Regina Y.
Author-X-Name-Last: Liu
Title: DD-Classifier: Nonparametric Classification Procedure Based on DD-Plot
Abstract:
Using the DD-plot (depth vs. depth plot), we introduce a
new nonparametric classification algorithm and call it
DD-classifier. The algorithm is completely nonparametric,
and it requires no prior knowledge of the underlying distributions or the
form of the separating curve. Thus, it can be applied to a wide range of
classification problems. The algorithm is completely data driven and its
classification outcome can be easily visualized in a two-dimensional plot
regardless of the dimension of the data. Moreover, it has the advantage of
bypassing the estimation of underlying parameters such as means and
scales, which is often required by the existing classification procedures.
We study the asymptotic properties of the DD-classifier
and its misclassification rate. Specifically, we show that
DD-classifier is asymptotically equivalent to the Bayes
rule under suitable conditions, and it can achieve Bayes error for a
family broader than elliptical distributions. The performance of the
classifier is also examined using simulated and real datasets. Overall,
the DD-classifier performs well across a broad range of
settings, and compares favorably with existing classifiers. It can also be
robust against outliers or contamination.
Journal: Journal of the American Statistical Association
Pages: 737-753
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.688462
File-URL: http://hdl.handle.net/10.1080/01621459.2012.688462
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:737-753
Template-Type: ReDIF-Article 1.0
Author-Name: Ute Hahn
Author-X-Name-First: Ute
Author-X-Name-Last: Hahn
Title: A Studentized Permutation Test for the Comparison of Spatial Point Patterns
Abstract:
In this study, a new test is proposed for the hypothesis that two (or
more) observed point patterns are realizations of the same spatial point
process model. To this end, the point patterns are divided into disjoint
quadrats, on each of which an estimate of Ripley’s
K-function is calculated. The two groups of empirical
K-functions are compared by a permutation test using a
Studentized test statistic. The proposed test performs convincingly in
terms of empirical level and power in a simulation study, even for point
patterns where the K-function estimates on neighboring
subsamples are not strictly exchangeable. It also shows improved behavior
compared with a test suggested by Diggle et al. for the comparison of
groups of independently replicated point patterns. In an application to
two point patterns from pathology that represent capillary positions in
sections of healthy and cancerous tissue, our Studentized permutation test
indicates statistical significance, although the patterns cannot be
clearly distinguished by the eye.
Journal: Journal of the American Statistical Association
Pages: 754-764
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.688463
File-URL: http://hdl.handle.net/10.1080/01621459.2012.688463
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:754-764
Template-Type: ReDIF-Article 1.0
Author-Name: Ta-Hsin Li
Author-X-Name-First: Ta-Hsin
Author-X-Name-Last: Li
Title: Quantile Periodograms
Abstract:
Two periodogram-like functions, called quantile periodograms, are
introduced for spectral analysis of time series. The quantile periodograms
are constructed from trigonometric quantile regression and motivated by
different interpretations of the ordinary periodogram. Analytical and
numerical results demonstrate the capability of the quantile periodograms
for detecting hidden periodicity in the quantiles and for providing an
additional view of time-series data. A connection between the quantile
periodograms and the so-called level-crossing spectrum is established
through an asymptotic analysis.
Journal: Journal of the American Statistical Association
Pages: 765-776
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682815
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682815
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:765-776
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas J. Fisher
Author-X-Name-First: Thomas J.
Author-X-Name-Last: Fisher
Author-Name: Colin M. Gallagher
Author-X-Name-First: Colin M.
Author-X-Name-Last: Gallagher
Title: New Weighted Portmanteau Statistics for Time Series Goodness of Fit Testing
Abstract:
We exploit ideas from high-dimensional data analysis to derive new
portmanteau tests that are based on the trace of the square of the
mth order autocorrelation matrix. The resulting
statistics are weighted sums of the squares of the sample autocorrelation
coefficients that, unlike many other tests appearing in the literature,
are numerically stable even when the number of lags considered is
relatively close to the sample size. The statistics behave asymptotically
as a linear combination of chi-squared random variables and their
asymptotic distribution can be approximated by a gamma distribution. The
proposed tests are modified to check for nonlinearity and to check the
adequacy of a fitted nonlinear model. Simulation evidence indicates that
the proposed goodness of fit tests tend to have higher power than other
tests appearing in the literature, particularly in detecting long-memory
nonlinear models. The efficacy of the proposed methods is demonstrated by
investigating nonlinear effects in Apple, Inc., and Nikkei-300 daily
returns during the 2006--2007 calendar years. The supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 777-787
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.688465
File-URL: http://hdl.handle.net/10.1080/01621459.2012.688465
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:777-787
Template-Type: ReDIF-Article 1.0
Author-Name: Christopher R. Genovese
Author-X-Name-First: Christopher R.
Author-X-Name-Last: Genovese
Author-Name: Marco Perone-Pacifico
Author-X-Name-First: Marco
Author-X-Name-Last: Perone-Pacifico
Author-Name: Isabella Verdinelli
Author-X-Name-First: Isabella
Author-X-Name-Last: Verdinelli
Author-Name: Larry Wasserman
Author-X-Name-First: Larry
Author-X-Name-Last: Wasserman
Title: The Geometry of Nonparametric Filament Estimation
Abstract:
We consider the problem of estimating filamentary structure from
d-dimensional point process data. We make some
connections with computational geometry and develop nonparametric methods
for estimating the filaments. We show that, under weak conditions, the
filaments have a simple geometric representation as the medial axis of the
data distribution’s support. Our methods convert an estimator of
the support’s boundary into an estimator of the filaments. We also
find the rates of convergence of our estimators. Proofs of all results are
in the supplementary material available online.
Journal: Journal of the American Statistical Association
Pages: 788-799
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682527
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682527
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:788-799
Template-Type: ReDIF-Article 1.0
Author-Name: Sylvain Sardy
Author-X-Name-First: Sylvain
Author-X-Name-Last: Sardy
Title: Smooth Blockwise Iterative Thresholding: A Smooth Fixed Point Estimator Based on the Likelihood’s Block Gradient
Abstract:
The proposed smooth blockwise iterative thresholding estimator (SBITE) is
a model selection technique defined as a fixed point reached by iterating
a likelihood gradient-based thresholding function. The smooth James--Stein
thresholding function has two regularization parameters λ and
ν, and a smoothness parameter s. It enjoys
smoothness like ridge regression and selects variables like lasso.
Focusing on Gaussian regression, we show that SBITE is uniquely defined,
and that its Stein unbiased risk estimate is a smooth function of λ
and ν, for better selection of the two regularization parameters. We
perform a Monte Carlo simulation to investigate the predictive and oracle
properties of this smooth version of adaptive lasso. The motivation is a
gravitational wave burst detection problem from several concomitant time
series. A nonparametric wavelet-based estimator is developed to combine
information from all captors by block-thresholding multiresolution
coefficients. We study how the smoothness parameter s
tempers the erraticity of the risk estimate, and derives a universal
threshold, an information criterion, and an oracle inequality in this
canonical setting.
Journal: Journal of the American Statistical Association
Pages: 800-813
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.664527
File-URL: http://hdl.handle.net/10.1080/01621459.2012.664527
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:800-813
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Percival
Author-X-Name-First: Daniel
Author-X-Name-Last: Percival
Title: Structured, Sparse Aggregation
Abstract:
This article introduces a method for aggregating many least-squares
estimators so that the resulting estimate has two properties: sparsity and
structure. That is, only a few candidate covariates are used in the
resulting model, and the selected covariates follow some structure over
the candidate covariates that is assumed to be known a priori. Although
sparsity is well studied in many settings, including aggregation,
structured sparse methods are still emerging. We demonstrate a general
framework for structured sparse aggregation that allows for a wide variety
of structures, including overlapping grouped structures and general
structural penalties defined as set functions on the set of covariates. We
show that such estimators satisfy structured sparse oracle
inequalities—their finite sample risk adapts to the structured
sparsity of the target. These inequalities reveal that under suitable
settings, the structured sparse estimator performs at least as well as,
and potentially much better than, a sparse aggregation estimator. We
empirically establish the effectiveness of the method using simulation and
an application to HIV drug resistance.
Journal: Journal of the American Statistical Association
Pages: 814-823
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682542
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682542
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:814-823
Template-Type: ReDIF-Article 1.0
Author-Name: W. Brannath
Author-X-Name-First: W.
Author-X-Name-Last: Brannath
Author-Name: G. Gutjahr
Author-X-Name-First: G.
Author-X-Name-Last: Gutjahr
Author-Name: P. Bauer
Author-X-Name-First: P.
Author-X-Name-Last: Bauer
Title: Probabilistic Foundation of Confirmatory Adaptive Designs
Abstract:
Adaptive designs allow the investigator of a confirmatory trial to react
to unforeseen developments by changing the design. This broad flexibility
comes at the price of a complex statistical model where important
components, such as the adaptation rule, remain unspecified. It has thus
been doubted whether Type I error control can be guaranteed in general
adaptive designs. This criticism is fully justified as long as the
probabilistic framework on which an adaptive design is based remains vague
and implicit. Therefore, an indispensable step lies in the clarification
of the probabilistic fundamentals of adaptive testing. We demonstrate that
the two main principles of adaptive designs, namely the conditional Type I
error rate and the conditional invariance principle, will provide Type I
error rate control, if the conditional distribution of the second-stage
data, given the first-stage data, can be described in terms of a
regression model. A similar assumption is required for regression analysis
where the distribution of the covariates is a nuisance parameter and the
model needs to be identifiable independently from the covariate
distribution. We further show that under the assumption of a regression
model, the events of an arbitrary adaptive design can be embedded into a
formal probability space without the need of posing any restrictions on
the adaptation rule. As a consequence of our results, artificial
constraints that had to be imposed on the investigator only for
mathematical tractability of the model are no longer necessary.
Journal: Journal of the American Statistical Association
Pages: 824-832
Issue: 498
Volume: 107
Year: 2011
Month: 6
X-DOI: 10.1080/01621459.2012.682540
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682540
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2011:i:498:p:824-832
Template-Type: ReDIF-Article 1.0
Author-Name: Alberto Abadie
Author-X-Name-First: Alberto
Author-X-Name-Last: Abadie
Author-Name: Guido W. Imbens
Author-X-Name-First: Guido W.
Author-X-Name-Last: Imbens
Title: A Martingale Representation for Matching Estimators
Abstract:
Matching estimators are widely used in statistical data analysis.
However, the large sample distribution of matching estimators has been
derived only for particular cases. This article establishes a martingale
representation for matching estimators. This representation allows the use
of martingale limit theorems to derive the large sample distribution of
matching estimators. As an illustration of the applicability of the
theory, we derive the asymptotic distribution of a matching estimator when
matching is carried out without replacement, a result previously
unavailable in the literature. In addition, we apply the techniques
proposed in this article to derive a correction to the standard error of a
sample mean when missing data are imputed using the “hot
deck,” a matching imputation method widely used in the Current
Population Survey (CPS) and other large surveys in the social sciences. We
demonstrate the empirical relevance of our methods using two Monte Carlo
designs based on actual datasets. In these Monte Carlo exercises, the
large sample distribution of matching estimators derived in this article
provides an accurate approximation to the small sample behavior of these
estimators. In addition, our simulations show that standard errors that do
not take into account hot-deck imputation of missing data may be severely
downward biased, while standard errors that incorporate the correction for
hot-deck imputation perform extremely well. This article has online
supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 833-843
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.682537
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682537
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:833-843
Template-Type: ReDIF-Article 1.0
Author-Name: Pierre Perron
Author-X-Name-First: Pierre
Author-X-Name-Last: Perron
Author-Name: Tomoyoshi Yabu
Author-X-Name-First: Tomoyoshi
Author-X-Name-Last: Yabu
Title: Testing for Trend in the Presence of Autoregressive Error: A Comment
Journal: Journal of the American Statistical Association
Pages: 844-844
Issue: 498
Volume: 107
Year: 2012
Month: 6
X-DOI: 10.1080/01621459.2012.668638
File-URL: http://hdl.handle.net/10.1080/01621459.2012.668638
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:498:p:844-844
Template-Type: ReDIF-Article 1.0
Author-Name: Ioanna Manolopoulou
Author-X-Name-First: Ioanna
Author-X-Name-Last: Manolopoulou
Author-Name: Melanie P. Matheu
Author-X-Name-First: Melanie P.
Author-X-Name-Last: Matheu
Author-Name: Michael D. Cahalan
Author-X-Name-First: Michael D.
Author-X-Name-Last: Cahalan
Author-Name: Mike West
Author-X-Name-First: Mike
Author-X-Name-Last: West
Author-Name: Thomas B. Kepler
Author-X-Name-First: Thomas B.
Author-X-Name-Last: Kepler
Title: Bayesian Spatio-Dynamic Modeling in Cell Motility Studies: Learning Nonlinear Taxic Fields Guiding the Immune Response
Abstract:
We develop and analyze models of the spatio-temporal organization of
lymphocytes in the lymph nodes and spleen. The spatial dynamics of these
immune system white blood cells are influenced by biochemical fields and
represent key components of the overall immune response to vaccines and
infections. A primary goal is to learn about the structure of these fields
that fundamentally shape the immune response. We define dynamic models of
single-cell motion involving nonparametric representations of scalar
potential fields underlying the directional biochemical fields that guide
cellular motion. Bayesian hierarchical extensions define multicellular
models for aggregating models and data on colonies of cells. Analysis via
customized Markov chain Monte Carlo methods leads to Bayesian inference on
cell-specific and population parameters together with the underlying
spatial fields. Our case study explores data from multiphoton intravital
microscopy in lymph nodes of mice, and we use a number of visualization
tools to summarize and compare posterior inferences on the
three-dimensional taxic fields.
Journal: Journal of the American Statistical Association
Pages: 855-865
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.655995
File-URL: http://hdl.handle.net/10.1080/01621459.2012.655995
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:855-865
Template-Type: ReDIF-Article 1.0
Author-Name: Laura A. Hatfield
Author-X-Name-First: Laura A.
Author-X-Name-Last: Hatfield
Author-Name: Mark E. Boye
Author-X-Name-First: Mark E.
Author-X-Name-Last: Boye
Author-Name: Michelle D. Hackshaw
Author-X-Name-First: Michelle D.
Author-X-Name-Last: Hackshaw
Author-Name: Bradley P. Carlin
Author-X-Name-First: Bradley P.
Author-X-Name-Last: Carlin
Title: Multilevel Bayesian Models for Survival Times and Longitudinal Patient-Reported Outcomes With Many Zeros
Abstract:
Regulatory approval of new therapies often depends on demonstrating
prolonged survival. Particularly when these survival benefits are modest,
consideration of therapeutic benefits to patient-reported outcomes (PROs)
may add value to the traditional biomedical clinical trial endpoints. We
extend a popular class of joint models for longitudinal and survival data
to accommodate the excessive zeros common in PROs, building hierarchical
Bayesian models that combine information from longitudinal PRO
measurements and survival outcomes. The model development is motivated by
a clinical trial for malignant pleural mesothelioma, a rapidly fatal form
of pulmonary cancer usually associated with asbestos exposure. By
separately modeling the presence and severity of PROs, using our
zero-augmented beta (ZAB) likelihood, we are able to model PROs on their
original scale and learn about individual-level parameters from both
presence and severity of symptoms. Correlations among an individual's PROs
and survival are modeled using latent random variables, adjusting the
fitted trajectories to better accommodate the observed data for each
individual. This work contributes to understanding the impact of treatment
on two aspects of mesothelioma: patients’ subjective experience of
the disease process and their progression-free survival times. We uncover
important differences between outcome types that are associated with
therapy (periodic, worse in both treatment groups after therapy
initiation) and those that are responsive to treatment (aperiodic,
gradually widening gap between treatment groups). Finally, our work raises
questions for future investigation into multivariate modeling, choice of
link functions, and the relative contributions of multiple data sources in
joint modeling contexts.
Journal: Journal of the American Statistical Association
Pages: 875-885
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.664517
File-URL: http://hdl.handle.net/10.1080/01621459.2012.664517
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:875-885
Template-Type: ReDIF-Article 1.0
Author-Name: Sally Picciotto
Author-X-Name-First: Sally
Author-X-Name-Last: Picciotto
Author-Name: Miguel A. Hernán
Author-X-Name-First: Miguel A.
Author-X-Name-Last: Hernán
Author-Name: John H. Page
Author-X-Name-First: John H.
Author-X-Name-Last: Page
Author-Name: Jessica G. Young
Author-X-Name-First: Jessica G.
Author-X-Name-Last: Young
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Title: Structural Nested Cumulative Failure Time Models to Estimate the Effects of Interventions
Abstract:
In the presence of time-varying confounders affected by prior treatment,
standard statistical methods for failure time analysis may be biased.
Methods that correctly adjust for this type of covariate include the
parametric g-formula, inverse probability weighted estimation of marginal
structural Cox proportional hazards models, and g-estimation of structural
nested accelerated failure time models. In this article, we propose a
novel method to estimate the causal effect of a time-dependent treatment
on failure in the presence of informative right-censoring and
time-dependent confounders that may be affected by past treatment:
g-estimation of structural nested cumulative failure time models
(SNCFTMs). An SNCFTM considers the conditional effect of a final treatment
at time m on the outcome at each later time
k by modeling the ratio of two counterfactual cumulative
risks at time k under treatment regimes that differ only
at time m. Inverse probability weights are used to adjust
for informative censoring. We also present a procedure that, under certain
“no-interaction” conditions, uses the g-estimates of the
model parameters to calculate unconditional cumulative risks under
nondynamic (static) treatment regimes. The procedure is illustrated with
an example using data from a longitudinal cohort study, in which the
“treatments” are healthy behaviors and the outcome is
coronary heart disease.
Journal: Journal of the American Statistical Association
Pages: 886-900
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682532
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682532
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:886-900
Template-Type: ReDIF-Article 1.0
Author-Name: José R. Zubizarreta
Author-X-Name-First: José R.
Author-X-Name-Last: Zubizarreta
Author-Name: Mark Neuman
Author-X-Name-First: Mark
Author-X-Name-Last: Neuman
Author-Name: Jeffrey H. Silber
Author-X-Name-First: Jeffrey H.
Author-X-Name-Last: Silber
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Contrasting Evidence Within and Between Institutions That Provide Treatment in an Observational Study of Alternate Forms of Anesthesia
Abstract:
In a randomized trial, subjects are assigned to treatment or control by
the flip of a fair coin. In many nonrandomized or observational studies,
subjects find their way to treatment or control in two steps, either or
both of which may lead to biased comparisons. By a vague process, perhaps
affected by proximity or sociodemographic issues, subjects find their way
to institutions that provide treatment. Once at such an institution, a
second process, perhaps thoughtful and deliberate, assigns individuals to
treatment or control. In the current article, the institutions are
hospitals, and the treatment under study is the use of general anesthesia
alone versus some use of regional anesthesia during surgery. For a
specific operation, the use of regional anesthesia may be typical in one
hospital and atypical in another. A new matched design is proposed for
studies of this sort, one that creates two types of nonoverlapping matched
pairs. Using a new extension of optimal matching with fine balance, pairs
of the first type exactly balance treatment assignment across
institutions, so each institution appears in the treated group with the
same frequency that it appears in the control group; hence, differences
between institutions that affect everyone in the same way cannot bias this
comparison. Pairs of the second type compare institutions that assign most
subjects to treatment and other institutions that assign most subjects to
control, so each institution is represented in the treated group if it
typically assigns subjects to treatment or, alternatively, in the control
group if it typically assigns subjects to control, and no institution
appears in both groups. By and large, in the second type of matched pair,
subjects became treated subjects or controls by choosing an institution,
not by a thoughtful and deliberate process of selecting subjects for
treatment within institutions. The design provides two evidence factors,
that is, two tests of the null hypothesis of no treatment effect that are
independent when the null hypothesis is true, where each factor is largely
unaffected by certain unmeasured biases that could readily invalidate the
other factor. The two factors permit separate and combined sensitivity
analyses, where the magnitude of bias affecting the two factors may
differ. The case of knee surgery in the study of regional versus general
anesthesia is considered in detail.
Journal: Journal of the American Statistical Association
Pages: 901-915
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682533
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682533
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:901-915
Template-Type: ReDIF-Article 1.0
Author-Name: Qirong Ho
Author-X-Name-First: Qirong
Author-X-Name-Last: Ho
Author-Name: Ankur P. Parikh
Author-X-Name-First: Ankur P.
Author-X-Name-Last: Parikh
Author-Name: Eric P. Xing
Author-X-Name-First: Eric P.
Author-X-Name-Last: Xing
Title: A Multiscale Community Blockmodel for Network Exploration
Abstract:
Real-world networks exhibit a complex set of phenomena such as underlying
hierarchical organization, multiscale interaction, and varying topologies
of communities. Most existing methods do not adequately capture the
intrinsic interplay among such phenomena. We propose a nonparametric
multiscale community blockmodel (MSCB) to model the generation of
hierarchies in social communities, selective membership of actors to
subsets of these communities, and the resultant networks due to within-
and cross-community interactions. By using the nested Chinese restaurant
process, our model automatically infers the hierarchy structure from the
data. We develop a collapsed Gibbs sampling algorithm for posterior
inference, conduct extensive validation using synthetic networks, and
demonstrate the utility of our model in real-world datasets, such as
predator--prey networks and citation networks.
Journal: Journal of the American Statistical Association
Pages: 916-934
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682530
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682530
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:916-934
Template-Type: ReDIF-Article 1.0
Author-Name: Jianhua Hu
Author-X-Name-First: Jianhua
Author-X-Name-Last: Hu
Author-Name: Xuming He
Author-X-Name-First: Xuming
Author-X-Name-Last: He
Title: Searching for Alternative Splicing With a Joint Model on Probe Measurability and Expression Intensities
Abstract:
The exon tiling array offers a high throughput technology to search for
aberrant splicing in biomedical research, but few methods of analysis for
splicing detection have been tested both statistically and empirically.
Noisy measurements on nonresponsive probe selection regions and outlying
intensities at some of the samples tend to distort model-based
assessments. We propose a robust analysis of variance approach that
incorporates an informative model on probe measurability and uses median
regression rank scores for better reliability in alternative splicing
detection. We study the validity and effectiveness of our proposed
approach in contrast with some of the existing methods through an
empirical investigation of a brain cancer experiment, where a set of
biologically validated genes for splicing and nonsplicing are available.
Our study demonstrates favorable performance of the proposed ranking
method, but shows that analysis of statistical significance cannot be
trusted from any conventional use of p-values. We warn of
any routine attempt to interpret p-values and their
derivatives in model-based detection of alternative splicing.
Journal: Journal of the American Statistical Association
Pages: 935-945
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682801
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682801
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:935-945
Template-Type: ReDIF-Article 1.0
Author-Name: Chiung-yu Huang
Author-X-Name-First: Chiung-yu
Author-X-Name-Last: Huang
Author-Name: Jing Qin
Author-X-Name-First: Jing
Author-X-Name-Last: Qin
Title: Composite Partial Likelihood Estimation Under Length-Biased Sampling, With Application to a Prevalent Cohort Study of Dementia
Abstract:
The Canadian Study of Health and Aging (CSHA) employed a prevalent cohort
design to study survival after onset of dementia, where patients with
dementia were sampled and the onset time of dementia was determined
retrospectively. The prevalent cohort sampling scheme favors individuals
who survive longer. Thus, the observed survival times are subject to
length bias. In recent years, there has been a rising interest in
developing estimation procedures for prevalent cohort survival data that
not only account for length bias but also actually exploit the incidence
distribution of the disease to improve efficiency. This article considers
semiparametric estimation of the Cox model for the time from dementia
onset to death under a stationarity assumption with respect to the disease
incidence. Under the stationarity condition, the semiparametric
maximum likelihood estimation is expected to be fully efficient yet
difficult to perform for statistical practitioners, as the likelihood
depends on the baseline hazard function in a complicated way. Moreover,
the asymptotic properties of the semiparametric maximum likelihood
estimator are not well-studied. Motivated by the composite likelihood
method (Besag 1974), we develop a composite partial likelihood method that
retains the simplicity of the popular partial likelihood estimator and can
be easily performed using standard statistical software. When applied to
the CSHA data, the proposed method estimates a significant difference in
survival between the vascular dementia group and the possible Alzheimer's
disease group, while the partial likelihood method for left-truncated and
right-censored data yields a greater standard error and a 95% confidence
interval covering 0, thus highlighting the practical value of employing a
more efficient methodology. To check the assumption of stable disease for
the CSHA data, we also present new graphical and numerical tests in the
article. The R code used to obtain the maximum composite partial
likelihood estimator for the CSHA data is available in the online
Supplementary Material, posted on the journal web site.
Journal: Journal of the American Statistical Association
Pages: 946-957
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682544
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682544
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:946-957
Template-Type: ReDIF-Article 1.0
Author-Name: Brent Kreider
Author-X-Name-First: Brent
Author-X-Name-Last: Kreider
Author-Name: John V. Pepper
Author-X-Name-First: John V.
Author-X-Name-Last: Pepper
Author-Name: Craig Gundersen
Author-X-Name-First: Craig
Author-X-Name-Last: Gundersen
Author-Name: Dean Jolliffe
Author-X-Name-First: Dean
Author-X-Name-Last: Jolliffe
Title: Identifying the Effects of SNAP (Food Stamps) on Child Health Outcomes When Participation Is Endogenous and Misreported
Abstract:
The literature assessing the efficacy of the Supplemental Nutrition
Assistance Program (SNAP), formerly known as the Food Stamp Program, has
long puzzled over positive associations between SNAP receipt and various
undesirable health outcomes such as food insecurity. Assessing the causal
impacts of SNAP, however, is hampered by two key identification problems:
endogenous selection into participation and extensive systematic
underreporting of participation status. Using data from the National
Health and Nutrition Examination Survey (NHANES), we extend partial
identification bounding methods to account for these two identification
problems in a single unifying framework. Specifically, we derive
informative bounds on the average treatment effect (ATE) of SNAP on child
food insecurity, poor general health, obesity, and anemia across a range
of different assumptions used to address the selection and classification
error problems. In particular, to address the selection problem, we apply
relatively weak nonparametric assumptions on the latent outcomes, selected
treatments, and observed covariates. To address the classification error
problem, we formalize a new approach that uses auxiliary administrative
data on the size of the SNAP caseload to restrict the magnitudes and
patterns of SNAP reporting errors. Layering successively stronger
assumptions, an objective of our analysis is to make transparent how the
strength of the conclusions varies with the strength of the identifying
assumptions. Under the weakest restrictions, there is substantial
ambiguity; we cannot rule out the possibility that SNAP increases or
decreases poor health. Under stronger but plausible assumptions used to
address the selection and classification error problems, we find that
commonly cited relationships between SNAP and poor health outcomes provide
a misleading picture about the true impacts of the program. Our tightest
bounds identify favorable impacts of SNAP on child health.
Journal: Journal of the American Statistical Association
Pages: 958-975
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682828
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682828
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:958-975
Template-Type: ReDIF-Article 1.0
Author-Name: María José García-Zattera
Author-X-Name-First: María José
Author-X-Name-Last: García-Zattera
Author-Name: Alejandro Jara
Author-X-Name-First: Alejandro
Author-X-Name-Last: Jara
Author-Name: Emmanuel Lesaffre
Author-X-Name-First: Emmanuel
Author-X-Name-Last: Lesaffre
Author-Name: Guillermo Marshall
Author-X-Name-First: Guillermo
Author-X-Name-Last: Marshall
Title: Modeling of Multivariate Monotone Disease Processes in the Presence of Misclassification
Abstract:
Motivated by a longitudinal oral health study, the
Signal--Tandmobiel® study, we propose a multivariate binary
inhomogeneous Markov model in which unobserved correlated response
variables are subject to an unconstrained misclassification process and
have a monotone behavior. The multivariate baseline distributions and
Markov transition matrices of the unobserved processes are defined as a
function of covariates through the specification of compatible full
conditional distributions. Distinct misclassification models are
discussed. In all cases, the possibility that different examiners were
involved in the scoring of the responses of a given subject across time is
taken into account. A full Bayesian implementation of the model is
described and its performance is evaluated using simulated data. We
provide theoretical and empirical evidence that the parameters can be
estimated without any external information about the misclassification
parameters. Finally, the analyses of the motivating study are presented.
Appendices 1--7 are available in the online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 976-989
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682804
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682804
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:976-989
Template-Type: ReDIF-Article 1.0
Author-Name: A. Adam Ding
Author-X-Name-First: A. Adam
Author-X-Name-Last: Ding
Author-Name: Shaonan Tian
Author-X-Name-First: Shaonan
Author-X-Name-Last: Tian
Author-Name: Yan Yu
Author-X-Name-First: Yan
Author-X-Name-Last: Yu
Author-Name: Hui Guo
Author-X-Name-First: Hui
Author-X-Name-Last: Guo
Title: A Class of Discrete Transformation Survival Models With Application to Default Probability Prediction
Abstract:
Corporate bankruptcy prediction plays a central role in academic finance
research, business practice, and government regulation. Consequently,
accurate default probability prediction is extremely important. We propose
to apply a discrete transformation family of survival models to corporate
default risk predictions. A class of Box-Cox transformations and
logarithmic transformations is naturally adopted. The proposed
transformation model family is shown to include the popular Shumway model
and the grouped relative risk model. We show that a transformation
parameter different from those two models is needed for default prediction
using a bankruptcy dataset. In addition, we show using out-of-sample
validation statistics that our model improves performance. We use the
estimated default probability to examine a popular asset pricing question
and determine whether default risk has carried a premium. Due to some
distinct features of the bankruptcy application, the proposed class of
discrete transformation survival models with time-varying covariates is
different from the continuous survival models in the survival analysis
literature. Their similarities and differences are discussed.
Journal: Journal of the American Statistical Association
Pages: 990-1003
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682806
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682806
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:990-1003
Template-Type: ReDIF-Article 1.0
Author-Name: Hai Nguyen
Author-X-Name-First: Hai
Author-X-Name-Last: Nguyen
Author-Name: Noel Cressie
Author-X-Name-First: Noel
Author-X-Name-Last: Cressie
Author-Name: Amy Braverman
Author-X-Name-First: Amy
Author-X-Name-Last: Braverman
Title: Spatial Statistical Data Fusion for Remote Sensing Applications
Abstract:
Aerosols are tiny solid or liquid particles suspended in the atmosphere;
examples of aerosols include windblown dust, sea salts, volcanic ash,
smoke from wildfires, and pollution from factories. The global
distribution of aerosols is a topic of great interest in climate studies
since aerosols can either cool or warm the atmosphere depending on their
location, type, and interaction with clouds. Aerosol concentrations are
important input components of global climate models, and it is crucial to
accurately estimate aerosol concentrations from remote sensing instruments
so as to minimize errors “downstream” in climate models.
Currently, space-based observations of aerosols are available from two
remote sensing instruments on board NASA's Terra spacecraft: the
Multiangle Imaging SpectroRadiometer (MISR), and the MODerate-resolution
Imaging Spectrometer (MODIS). These two instruments have complementary
coverage, spatial support, and retrieval characteristics, making it
advantageous to combine information from both sources to make optimal
inferences about global aerosol distributions. In this article, we predict
the true aerosol process from two noisy and possibly biased datasets, and
we also estimate the uncertainties of these estimates. Our data-fusion
methodology scales linearly and bears some resemblance to Fixed Rank
Kriging (FRK), a variant of kriging that is designed for spatial
interpolation of a single, massive dataset. Our spatial statistical
approach does not require assumptions of stationarity or isotropy and,
crucially, allows for change of spatial support. We compare our
methodology to FRK and Bayesian melding, and we show that ours has
superior prediction standard errors compared to FRK and much faster
computational speed compared to Bayesian melding.
Journal: Journal of the American Statistical Association
Pages: 1004-1018
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.694717
File-URL: http://hdl.handle.net/10.1080/01621459.2012.694717
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1004-1018
Template-Type: ReDIF-Article 1.0
Author-Name: Qin Zhou
Author-X-Name-First: Qin
Author-X-Name-Last: Zhou
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Author-Name: Zhaojun Wang
Author-X-Name-First: Zhaojun
Author-X-Name-Last: Wang
Author-Name: Wei Jiang
Author-X-Name-First: Wei
Author-X-Name-Last: Jiang
Title: Likelihood-Based EWMA Charts for Monitoring Poisson Count Data With Time-Varying Sample Sizes
Abstract:
Many applications involve monitoring incidence rates of the Poisson
distribution when the sample size varies over time. Recently, a couple of
cumulative sum and exponentially weighted moving average (EWMA) control
charts have been proposed to tackle this problem by taking the varying
sample size into consideration. However, we argue that some of these
charts, which perform quite well in terms of average run length (ARL), may
not be appealing in practice because they have rather unsatisfactory run
length distributions. With some charts, the specified in-control (IC) ARL
is attained with elevated probabilities of very short and very long runs,
as compared with a geometric distribution. This is reflected in a larger
run length standard deviation than that of a geometric distribution and an
elevated probability of false alarms with short runs, which, in turn, hurt
an operator's confidence in valid alarms. Furthermore, with many charts,
the IC ARL exhibits considerable variations with different patterns of
sample sizes. Under the framework of weighted likelihood ratio test, this
article suggests a new EWMA control chart which automatically integrates
the varying sample sizes with the EWMA scheme. It is fast to compute, easy
to construct, and quite efficient in detecting changes of Poisson rates.
Two important features of the proposed method are that the IC run length
distribution is similar to that of a geometric distribution and the IC ARL
is robust to various patterns of sample size variation. Our simulation
results show that the proposed chart is generally more effective and
robust compared with existing EWMA charts. A health surveillance example
based on mortality data from New Mexico is used to illustrate the
implementation of the proposed method. This article has online
supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 1049-1062
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682811
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682811
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1049-1062
Template-Type: ReDIF-Article 1.0
Author-Name: Anastasios Panagiotelis
Author-X-Name-First: Anastasios
Author-X-Name-Last: Panagiotelis
Author-Name: Claudia Czado
Author-X-Name-First: Claudia
Author-X-Name-Last: Czado
Author-Name: Harry Joe
Author-X-Name-First: Harry
Author-X-Name-Last: Joe
Title: Pair Copula Constructions for Multivariate Discrete Data
Abstract:
Multivariate discrete response data can be found in diverse fields,
including econometrics, finance, biometrics, and psychometrics. Our
contribution, through this study, is to introduce a new class of models
for multivariate discrete data based on pair copula constructions (PCCs)
that has two major advantages. First, by deriving the conditions under
which any multivariate discrete distribution can be decomposed as a PCC,
we show that discrete PCCs attain highly flexible dependence structures.
Second, the computational burden of evaluating the likelihood for an
m-dimensional discrete PCC only grows quadratically with
m. This compares favorably to existing models for which
computing the likelihood either requires the evaluation of 2-super-
m terms or slow numerical integration methods. We
demonstrate the high quality of inference function for margins and maximum
likelihood estimates, both under a simulated setting and for an
application to a longitudinal discrete dataset on headache severity. This
article has online supplementary material.
Journal: Journal of the American Statistical Association
Pages: 1063-1072
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.682850
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682850
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1063-1072
Template-Type: ReDIF-Article 1.0
Author-Name: Jens-peter Kreiss
Author-X-Name-First: Jens-peter
Author-X-Name-Last: Kreiss
Author-Name: Efstathios Paparoditis
Author-X-Name-First: Efstathios
Author-X-Name-Last: Paparoditis
Title: The Hybrid Wild Bootstrap for Time Series
Abstract:
We introduce a new and simple bootstrap procedure for general linear
processes, called the hybrid wild bootstrap. The hybrid wild bootstrap
generates frequency domain replicates of the periodogram that imitate
asymptotically correct the first- and second-order properties of the
ordinary periodogram including its weak dependence structure at different
frequencies. As a consequence, the hybrid wild bootstrapped periodogram
succeeds in approximating consistently the distribution of statistics that
can be expressed as functionals of the periodogram, including the
important class of spectral means for which all so far existing frequency
domain bootstrap methods generally fail. Moreover, by inverting the hybrid
wild bootstrapped discrete Fourier transform, pseudo-observations in the
time domain are obtained. The generated time domain pseudo-observations
can be used to approximate correctly the random behavior of statistics,
the distribution of which depends on the first-, second-, and, to some
extent, on the fourth-order structure of the underlying linear process.
Thus, the proposed hybrid wild bootstrap procedure applied to general time
series overcomes several of the limitations of standard linear time domain
bootstrap methods.
Journal: Journal of the American Statistical Association
Pages: 1073-1084
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.695664
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695664
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1073-1084
Template-Type: ReDIF-Article 1.0
Author-Name: Victor M. Panaretos
Author-X-Name-First: Victor M.
Author-X-Name-Last: Panaretos
Author-Name: Kjell Konis
Author-X-Name-First: Kjell
Author-X-Name-Last: Konis
Title: Nonparametric Construction of Multivariate Kernels
Abstract:
We propose a nonparametric method for constructing multivariate kernels
tuned to the configuration of the sample, for density estimation in
, d moderate.
The motivation behind the approach is to break down the construction of
the kernel into two parts: determining its overall shape and then its
global concentration. We consider a framework that is essentially
nonparametric, as opposed to the usual bandwidth matrix parameterization.
The shape of the kernel to be employed is determined by applying the
backprojection operator, the dual of the Radon transform, to a collection
of one-dimensional kernels, each optimally tuned to the concentration of
the corresponding one-dimensional projections of the data. Once an overall
shape is determined, the global concentration is controlled by a simple
scaling. It is seen that the kernel estimators thus developed are easy and
extremely fast to compute, and perform at least as well in practice as
parametric kernels with cross-validated or otherwise tuned covariance
structure. Connections with integral geometry are discussed, and the
approach is illustrated under a wide range of scenarios in two and three
dimensions, via an R package developed for its implementation.
Journal: Journal of the American Statistical Association
Pages: 1085-1095
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.695657
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695657
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1085-1095
Template-Type: ReDIF-Article 1.0
Author-Name: Jiahua Chen
Author-X-Name-First: Jiahua
Author-X-Name-Last: Chen
Author-Name: Pengfei Li
Author-X-Name-First: Pengfei
Author-X-Name-Last: Li
Author-Name: Yuejiao Fu
Author-X-Name-First: Yuejiao
Author-X-Name-Last: Fu
Title: Inference on the Order of a Normal Mixture
Abstract:
Finite normal mixture models are used in a wide range of applications.
Hypothesis testing on the order of the normal mixture is an important yet
unsolved problem. Existing procedures often lack a rigorous theoretical
foundation. Many are also hard to implement numerically. In this article,
we develop a new method to fill the void in this important area. An
effective expectation-maximization (EM) test is invented for testing the
null hypothesis of arbitrary order m 0 under a
finite normal mixture model. For any positive integer m
0 ⩾ 2, the limiting distribution of the proposed test
statistic is . We also use a novel computer
experiment to provide empirical formulas for the tuning parameter
selection. The finite sample performance of the test is examined through
simulation studies. Real-data examples are provided. The procedure has
been implemented in R code. The p-values for testing the
null order of m 0 = 2 or m
0 = 3 can be calculated with a single command. This article has
supplementary materials available online.
Journal: Journal of the American Statistical Association
Pages: 1096-1105
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.695668
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695668
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1096-1105
Template-Type: ReDIF-Article 1.0
Author-Name: Yingqi Zhao
Author-X-Name-First: Yingqi
Author-X-Name-Last: Zhao
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: A. John Rush
Author-X-Name-First: A. John
Author-X-Name-Last: Rush
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Estimating Individualized Treatment Rules Using Outcome Weighted Learning
Abstract:
There is increasing interest in discovering individualized treatment
rules (ITRs) for patients who have heterogeneous responses to treatment.
In particular, one aims to find an optimal ITR that is a deterministic
function of patient-specific characteristics maximizing expected clinical
outcome. In this article, we first show that estimating such an optimal
treatment rule is equivalent to a classification problem where each
subject is weighted proportional to his or her clinical outcome. We then
propose an outcome weighted learning approach based on the support vector
machine framework. We show that the resulting estimator of the treatment
rule is consistent. We further obtain a finite sample bound for the
difference between the expected outcome using the estimated ITR and that
of the optimal treatment rule. The performance of the proposed approach is
demonstrated via simulation studies and an analysis of chronic depression
data.
Journal: Journal of the American Statistical Association
Pages: 1106-1118
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.695674
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695674
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1106-1118
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel L. Sussman
Author-X-Name-First: Daniel L.
Author-X-Name-Last: Sussman
Author-Name: Minh Tang
Author-X-Name-First: Minh
Author-X-Name-Last: Tang
Author-Name: Donniell E. Fishkind
Author-X-Name-First: Donniell E.
Author-X-Name-Last: Fishkind
Author-Name: Carey E. Priebe
Author-X-Name-First: Carey E.
Author-X-Name-Last: Priebe
Title: A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs
Abstract:
We present a method to estimate block membership of nodes in a random
graph generated by a stochastic blockmodel. We use an embedding procedure
motivated by the random dot product graph model, a particular example of
the latent position model. The embedding associates each node with a
vector; these vectors are clustered via minimization of a square error
criterion. We prove that this method is consistent for assigning nodes to
blocks, as only a negligible number of nodes will be misassigned. We prove
consistency of the method for directed and undirected graphs. The
consistent block assignment makes possible consistent parameter estimation
for a stochastic blockmodel. We extend the result in the setting where the
number of blocks grows slowly with the number of nodes. Our method is also
computationally feasible even for very large graphs. We compare our method
with Laplacian spectral clustering through analysis of simulated data and
a graph derived from Wikipedia documents.
Journal: Journal of the American Statistical Association
Pages: 1119-1128
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.699795
File-URL: http://hdl.handle.net/10.1080/01621459.2012.699795
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1119-1128
Template-Type: ReDIF-Article 1.0
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Wei Zhong
Author-X-Name-First: Wei
Author-X-Name-Last: Zhong
Author-Name: Liping Zhu
Author-X-Name-First: Liping
Author-X-Name-Last: Zhu
Title: Feature Screening via Distance Correlation Learning
Abstract:
This article is concerned with screening features in
ultrahigh-dimensional data analysis, which has become increasingly
important in diverse scientific fields. We develop a sure independence
screening procedure based on the distance correlation (DC-SIS). The DC-SIS
can be implemented as easily as the sure independence screening (SIS)
procedure based on the Pearson correlation proposed by Fan and Lv.
However, the DC-SIS can significantly improve the SIS. Fan and Lv
established the sure screening property for the SIS based on linear
models, but the sure screening property is valid for the DC-SIS under more
general settings, including linear models. Furthermore, the implementation
of the DC-SIS does not require model specification (e.g., linear model or
generalized linear model) for responses or predictors. This is a very
appealing property in ultrahigh-dimensional data analysis. Moreover, the
DC-SIS can be used directly to screen grouped predictor variables and
multivariate response variables. We establish the sure screening property
for the DC-SIS, and conduct simulations to examine its finite sample
performance. A numerical comparison indicates that the DC-SIS performs
much better than the SIS in various models. We also illustrate the DC-SIS
through a real-data example.
Journal: Journal of the American Statistical Association
Pages: 1129-1139
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.695654
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695654
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1129-1139
Template-Type: ReDIF-Article 1.0
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Author-Name: Matthias Trampisch
Author-X-Name-First: Matthias
Author-X-Name-Last: Trampisch
Title: Optimal Designs for Quantile Regression Models
Abstract:
Despite their importance, optimal designs for quantile regression models
have not been developed so far. In this article, we investigate the
D-optimal design problem for nonlinear quantile
regression analysis. We provide a necessary condition to check the
optimality of a given design and use it to determine bounds for the number
of support points of locally D-optimal designs. The
results are illustrated, determining locally, Bayesian and standardized
maximin D-optimal designs for quantile regression
analysis in the Michaelis--Menten and EMAX model, which are widely used in
such important fields as toxicology, pharmacokinetics, and dose--response
modeling.
Journal: Journal of the American Statistical Association
Pages: 1140-1151
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.695665
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695665
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1140-1151
Template-Type: ReDIF-Article 1.0
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Author-Name: Anuj Srivastava
Author-X-Name-First: Anuj
Author-X-Name-Last: Srivastava
Author-Name: Eric Klassen
Author-X-Name-First: Eric
Author-X-Name-Last: Klassen
Author-Name: Zhaohua Ding
Author-X-Name-First: Zhaohua
Author-X-Name-Last: Ding
Title: Statistical Modeling of Curves Using Shapes and Related Features
Abstract:
Motivated by the problems of analyzing protein backbones, diffusion
tensor magnetic resonance imaging (DT-MRI) fiber tracts in the human
brain, and other problems involving curves, in this study we present some
statistical models of parameterized curves, in , in terms of combinations of
features such as shape, location, scale, and orientation. For each
combination of interest, we identify a representation manifold, endow it
with a Riemannian metric, and outline tools for computing sample
statistics on these manifolds. An important characteristic of the chosen
representations is that the ensuing comparison and modeling of curves is
invariant to how the curves are parameterized. The nuisance variables,
including parameterization, are removed by forming quotient spaces under
appropriate group actions. In the case of shape analysis, the resulting
spaces are quotient spaces of Hilbert spheres, and we derive certain
wrapped truncated normal densities for capturing variability in observed
curves. We demonstrate these models using both artificial data and real
data involving DT-MRI fiber tracts from multiple subjects and protein
backbones from the Shape Retrieval Contest of Non-rigid 3D Models (SHREC)
2010 database.
Journal: Journal of the American Statistical Association
Pages: 1152-1165
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.699770
File-URL: http://hdl.handle.net/10.1080/01621459.2012.699770
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1152-1165
Template-Type: ReDIF-Article 1.0
Author-Name: Raymond Carroll
Author-X-Name-First: Raymond
Author-X-Name-Last: Carroll
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Author-Name: Peter Hall
Author-X-Name-First: Peter
Author-X-Name-Last: Hall
Title: Deconvolution When Classifying Noisy Data Involving Transformations
Abstract:
In the present study, we consider the problem of classifying spatial data
distorted by a linear transformation or convolution and contaminated by
additive random noise. In this setting, we show that classifier
performance can be improved if we carefully invert the data before the
classifier is applied. However, the inverse transformation is not
constructed so as to recover the original signal, and in fact, we show
that taking the latter approach is generally inadvisable. We introduce a
fully data-driven procedure based on cross-validation, and use several
classifiers to illustrate numerical properties of our approach.
Theoretical arguments are given in support of our claims. Our procedure is
applied to data generated by light detection and ranging (Lidar)
technology, where we improve on earlier approaches to classifying
aerosols. This article has supplementary materials online.
Journal: Journal of the American Statistical Association
Pages: 1166-1177
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.699793
File-URL: http://hdl.handle.net/10.1080/01621459.2012.699793
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1166-1177
Template-Type: ReDIF-Article 1.0
Author-Name: Mike Danilov
Author-X-Name-First: Mike
Author-X-Name-Last: Danilov
Author-Name: Víctor J. Yohai
Author-X-Name-First: Víctor J.
Author-X-Name-Last: Yohai
Author-Name: Ruben H. Zamar
Author-X-Name-First: Ruben H.
Author-X-Name-Last: Zamar
Title: Robust Estimation of Multivariate Location and Scatter in the Presence of Missing Data
Abstract:
Two main issues regarding data quality are data contamination (outliers)
and data completion (missing data). These two problems have attracted much
attention and research but surprisingly, they are seldom considered
together. Popular robust methods such as S-estimators of
multivariate location and scatter offer protection against outliers but
cannot deal with missing data, except for the obviously inefficient
approach of deleting all incomplete cases. We generalize the definition of
S-estimators of multivariate location and scatter to
simultaneously deal with missing data and outliers. We show that the
proposed estimators are strongly consistent under elliptical models when
data are missing completely at random. We derive an
algorithm similar to the Expectation-Maximization algorithm for computing
the proposed estimators. This algorithm is initialized by an extension for
missing data of the minimum volume ellipsoid. We assess the performance of
our proposal by Monte Carlo simulation and give some real data examples.
This article has supplementary material online.
Journal: Journal of the American Statistical Association
Pages: 1178-1186
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.699792
File-URL: http://hdl.handle.net/10.1080/01621459.2012.699792
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1178-1186
Template-Type: ReDIF-Article 1.0
Author-Name: Chenlei Leng
Author-X-Name-First: Chenlei
Author-X-Name-Last: Leng
Author-Name: Cheng Yong Tang
Author-X-Name-First: Cheng Yong
Author-X-Name-Last: Tang
Title: Sparse Matrix Graphical Models
Abstract:
Matrix-variate observations are frequently encountered in many
contemporary statistical problems due to a rising need to organize and
analyze data with structured information. In this article, we propose a
novel sparse matrix graphical model for these types of statistical
problems. By penalizing, respectively, two precision matrices
corresponding to the rows and columns, our method yields a sparse matrix
graphical model that synthetically characterizes the underlying
conditional independence structure. Our model is more parsimonious and is
practically more interpretable than the conventional sparse vector-variate
graphical models. Asymptotic analysis shows that our penalized likelihood
estimates enjoy better convergent rates than that of the vector-variate
graphical model. The finite sample performance of the proposed method is
illustrated via extensive simulation studies and several real datasets
analysis.
Journal: Journal of the American Statistical Association
Pages: 1187-1200
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.706133
File-URL: http://hdl.handle.net/10.1080/01621459.2012.706133
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1187-1200
Template-Type: ReDIF-Article 1.0
Author-Name: Gabriel Chandler
Author-X-Name-First: Gabriel
Author-X-Name-Last: Chandler
Author-Name: Wolfgang Polonik
Author-X-Name-First: Wolfgang
Author-X-Name-Last: Polonik
Title: Mode Identification of Volatility in Time-Varying Autoregression
Abstract:
In many applications, time series exhibit nonstationary behavior that
might reasonably be modeled as a time-varying autoregressive (AR) process.
In the context of such a model, we discuss the problem of testing for
modality of the variance function. We propose a test of modality that is
local and, when used iteratively, can be used to identify the total number
of modes in a given series. This problem is closely related to peak
detection and identification, which has applications in many fields. We
propose a test that, under appropriate assumptions, is asymptotically
distribution free under the null hypothesis, even though nonparametric
estimation of the AR parameter functions is involved. Simulation studies
and applications to real datasets illustrate the behavior of the test.
Journal: Journal of the American Statistical Association
Pages: 1217-1229
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.703877
File-URL: http://hdl.handle.net/10.1080/01621459.2012.703877
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1217-1229
Template-Type: ReDIF-Article 1.0
Author-Name: Shurong Zheng
Author-X-Name-First: Shurong
Author-X-Name-Last: Zheng
Author-Name: Ning-Zhong Shi
Author-X-Name-First: Ning-Zhong
Author-X-Name-Last: Shi
Author-Name: Zhengjun Zhang
Author-X-Name-First: Zhengjun
Author-X-Name-Last: Zhang
Title: Generalized Measures of Correlation for Asymmetry, Nonlinearity, and Beyond
Abstract:
Applicability of Pearson's correlation as a measure of explained variance
is by now well understood. One of its limitations is that it does not
account for asymmetry in explained variance. Aiming to develop broad
applicable correlation measures, we study a pair of generalized measures
of correlation (GMC) that deals with asymmetries in explained variances,
and linear or nonlinear relations between random variables. We present
examples under which the paired measures are identical, and they become a
symmetric correlation measure that is the same as the squared Pearson's
correlation coefficient. As a result, Pearson's correlation is a special
case of GMC. Theoretical properties of GMC show that GMC can be applicable
in numerous applications and can lead to more meaningful conclusions and
improved decision making. In statistical inference, the joint asymptotics
of the kernel-based estimators for GMC are derived and are used to test
whether or not two random variables are symmetric in explaining variances.
The testing results give important guidance in practical model selection
problems. The efficiency of the test statistics is illustrated in
simulation examples. In real-data analysis, we present an important
application of GMC in explained variances and market movements among three
important economic and financial monetary indicators. This article has
online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 1239-1252
Issue: 499
Volume: 107
Year: 2012
Month: 9
X-DOI: 10.1080/01621459.2012.710509
File-URL: http://hdl.handle.net/10.1080/01621459.2012.710509
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:499:p:1239-1252
Template-Type: ReDIF-Article 1.0
Author-Name: William Astle
Author-X-Name-First: William
Author-X-Name-Last: Astle
Author-Name: Maria De Iorio
Author-X-Name-First: Maria
Author-X-Name-Last: De Iorio
Author-Name: Sylvia Richardson
Author-X-Name-First: Sylvia
Author-X-Name-Last: Richardson
Author-Name: David Stephens
Author-X-Name-First: David
Author-X-Name-Last: Stephens
Author-Name: Timothy Ebbels
Author-X-Name-First: Timothy
Author-X-Name-Last: Ebbels
Title: A Bayesian Model of NMR Spectra for the Deconvolution and Quantification of Metabolites in Complex Biological Mixtures
Abstract:
Nuclear magnetic resonance (NMR) spectra are widely used in metabolomics
to obtain profiles of metabolites dissolved in biofluids such as cell
supernatants. Methods for estimating metabolite concentrations from these
spectra are presently confined to manual peak fitting and to binning
procedures for integrating resonance peaks. Extensive information on the
patterns of spectral resonance generated by human metabolites is now
available in online databases. By incorporating this information into a
Bayesian model, we can deconvolve resonance peaks from a spectrum and
obtain explicit concentration estimates for the corresponding metabolites.
Spectral resonances that cannot be deconvolved in this way may also be of
scientific interest; so, we model them jointly using wavelets. We describe
a Markov chain Monte Carlo algorithm that allows us to sample from the
joint posterior distribution of the model parameters, using specifically
designed block updates to improve mixing. The strong prior on resonance
patterns allows the algorithm to identify peaks corresponding to
particular metabolites automatically, eliminating the need for manual peak
assignment. We assess our method for peak alignment and concentration
estimation. Except in cases when the target resonance signal is very weak,
alignment is unbiased and precise. We compare the Bayesian concentration
estimates with those obtained from a conventional numerical integration
method and find that our point estimates have six-fold lower mean squared
error. Finally, we apply our method to a spectral dataset taken from an
investigation of the metabolic response of yeast to recombinant protein
expression. We estimate the concentrations of 26 metabolites and compare
with manual quantification by five expert spectroscopists. We discuss the
reason for discrepancies and the robustness of our method's concentration
estimates. This article has supplementary materials online.
Journal: Journal of the American Statistical Association
Pages: 1259-1271
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.695661
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695661
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1259-1271
Template-Type: ReDIF-Article 1.0
Author-Name: Yuan Wang
Author-X-Name-First: Yuan
Author-X-Name-Last: Wang
Author-Name: J. S. Marron
Author-X-Name-First: J. S.
Author-X-Name-Last: Marron
Author-Name: Burcu Aydin
Author-X-Name-First: Burcu
Author-X-Name-Last: Aydin
Author-Name: Alim Ladha
Author-X-Name-First: Alim
Author-X-Name-Last: Ladha
Author-Name: Elizabeth Bullitt
Author-X-Name-First: Elizabeth
Author-X-Name-Last: Bullitt
Author-Name: Haonan Wang
Author-X-Name-First: Haonan
Author-X-Name-Last: Wang
Title: A Nonparametric Regression Model With Tree-Structured Response
Abstract:
Developments in science and technology over the last two decades has
motivated the study of complex data objects. In this article, we consider
the topological properties of a population of tree-structured objects. Our
interest centers on modeling the relationship between a tree-structured
response and other covariates. For tree-structured objects, this poses
serious challenges since most regression methods rely on linear operations
in Euclidean space. We generalize the notion of nonparametric regression
to the case of a tree-structured response variable. In addition, we
develop a fast algorithm and give its theoretical justification. We
implement the proposed method to analyze a dataset of human brain artery
trees. An important lesson is that smoothing in the full tree space can
reveal much deeper scientific insights than the simple smoothing of
summary statistics. This article has supplementary materials online.
Journal: Journal of the American Statistical Association
Pages: 1272-1285
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.699348
File-URL: http://hdl.handle.net/10.1080/01621459.2012.699348
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1272-1285
Template-Type: ReDIF-Article 1.0
Author-Name: Manuel Wiesenfarth
Author-X-Name-First: Manuel
Author-X-Name-Last: Wiesenfarth
Author-Name: Tatyana Krivobokova
Author-X-Name-First: Tatyana
Author-X-Name-Last: Krivobokova
Author-Name: Stephan Klasen
Author-X-Name-First: Stephan
Author-X-Name-Last: Klasen
Author-Name: Stefan Sperlich
Author-X-Name-First: Stefan
Author-X-Name-Last: Sperlich
Title: Direct Simultaneous Inference in Additive Models and Its Application to Model Undernutrition
Abstract:
This article proposes a simple and fast approach to build simultaneous
confidence bands and perform specification tests for smooth curves in
additive models. The method allows for handling of spatially heterogeneous
functions and its derivatives as well as heteroscedasticity in the data.
It is applied to study the determinants of chronic undernutrition of
Kenyan children, with a particular focus on the highly nonlinear age
pattern in undernutrition. Model estimation using the mixed model
representation of penalized splines in combination with simultaneous
probability calculations based on the volume-of-tube formula enable the
simultaneous inference directly, that is, without resampling methods.
Finite sample properties of simultaneous confidence bands and
specification tests are investigated in simulations. To facilitate and
enhance its application, the method has been implemented in the R package
AdaptFitOS.
Journal: Journal of the American Statistical Association
Pages: 1286-1296
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.682809
File-URL: http://hdl.handle.net/10.1080/01621459.2012.682809
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1286-1296
Template-Type: ReDIF-Article 1.0
Author-Name: Martin A. Lindquist
Author-X-Name-First: Martin A.
Author-X-Name-Last: Lindquist
Title: Functional Causal Mediation Analysis With an Application to Brain Connectivity
Abstract:
Mediation analysis is often used in the behavioral sciences to
investigate the role of intermediate variables that lie on the causal path
between a randomized treatment and an outcome variable. Typically,
mediation is assessed using structural equation models (SEMs), with model
coefficients interpreted as causal effects. In this article, we present an
extension of SEMs to the functional data analysis (FDA) setting that
allows the mediating variable to be a continuous function rather than a
single scalar measure, thus providing the opportunity to study the
functional effects of the mediator on the outcome. We provide sufficient
conditions for identifying the average causal effects of the functional
mediators using the extended SEM, as well as weaker conditions under which
an instrumental variable estimand may be interpreted as an effect. The
method is applied to data from a functional magnetic resonance imaging
(fMRI) study of thermal pain that sought to determine whether activation
in certain brain regions mediated the effect of applied temperature on
self-reported pain. Our approach provides valuable information about the
timing of the mediating effect that is not readily available when using
the standard nonfunctional approach. To the best of our knowledge, this
work provides the first application of causal inference to the FDA
framework.
Journal: Journal of the American Statistical Association
Pages: 1297-1309
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.695640
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695640
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1297-1309
Template-Type: ReDIF-Article 1.0
Author-Name: Sandra M. Mohammed
Author-X-Name-First: Sandra M.
Author-X-Name-Last: Mohammed
Author-Name: Damla Şentürk
Author-X-Name-First: Damla
Author-X-Name-Last: Şentürk
Author-Name: Lorien S. Dalrymple
Author-X-Name-First: Lorien S.
Author-X-Name-Last: Dalrymple
Author-Name: Danh V. Nguyen
Author-X-Name-First: Danh V.
Author-X-Name-Last: Nguyen
Title: Measurement Error Case Series Models With Application to Infection-Cardiovascular Risk in Older Patients on Dialysis
Abstract:
Infection and cardiovascular disease are leading causes of
hospitalization and death in older patients on dialysis. Our recent work
found an increase in the relative incidence of cardiovascular outcomes
during the ∼ 30 days after infection-related hospitalizations using
the case series model, which adjusts for measured and unmeasured baseline
confounders. However, a major challenge in modeling/assessing the
infection-cardiovascular risk hypothesis is that the exact time of
infection, or more generally “exposure,” onsets cannot be
ascertained based on hospitalization data. Only imprecise markers of the
timing of infection onsets are available. Although there is a large
literature on measurement error in the predictors in regression modeling,
to date, there is no work on measurement error on the timing of a
time-varying exposure to our knowledge. Thus, we propose a new method, the
measurement error case series (MECS) models, to account for measurement
error in time-varying exposure onsets. We characterized the general nature
of bias resulting from estimation that ignores measurement error and
proposed a bias-corrected estimation for the MECS models. We examined in
detail the accuracy of the proposed method to estimate the relative
incidence of cardiovascular events. Hospitalization data from the United
States Renal Data System, which captures nearly all (>99%) patients with
end-stage renal disease in the United States over time, are used to
illustrate the proposed method. The results suggest that the estimate of
the relative incidence of cardiovascular events during the 30 days
after infections, a period where acute effects of infection on vascular
endothelium may be most pronounced, is substantially attenuated in the
presence of infection onset measurement error.
Journal: Journal of the American Statistical Association
Pages: 1310-1323
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.695648
File-URL: http://hdl.handle.net/10.1080/01621459.2012.695648
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1310-1323
Template-Type: ReDIF-Article 1.0
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Author-Name: Tanya P. Garcia
Author-X-Name-First: Tanya P.
Author-X-Name-Last: Garcia
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Title: Nonparametric Estimation for Censored Mixture Data With Application to the Cooperative Huntington’s Observational Research Trial
Abstract:
This work presents methods for estimating genotype-specific outcome
distributions from genetic epidemiology studies where the event times are
subject to right censoring, the genotypes are not directly observed, and
the data arise from a mixture of scientifically meaningful subpopulations.
Examples of such studies include kin-cohort studies and quantitative trait
locus (QTL) studies. Current methods for analyzing censored mixture data
include two types of nonparametric maximum likelihood estimators (NPMLEs;
Type I and Type II) that do not make parametric assumptions on the
genotype-specific density functions. Although both NPMLEs are commonly
used, we show that one is inefficient and the other inconsistent. To
overcome these deficiencies, we propose three classes of consistent
nonparametric estimators that do not assume parametric density models and
are easy to implement. They are based on inverse probability weighting
(IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). AIPW
achieves the efficiency bound without additional modeling assumptions.
Extensive simulation experiments demonstrate satisfactory performance of
these estimators even when the data are heavily censored. We apply these
estimators to the Cooperative Huntington’s Observational Research
Trial (COHORT), and provide age-specific estimates of the effect of
mutation in the Huntington gene on mortality using a sample of family
members. The close approximation of the estimated noncarrier survival
rates to that of the U.S. population indicates small ascertainment bias in
the COHORT family sample. Our analyses underscore an elevated risk of
death in Huntington gene mutation carriers compared with that in
noncarriers for a wide age range, and suggest that the mutation equally
affects survival rates in both genders. The estimated survival rates are
useful in genetic counseling for providing guidelines on interpreting the
risk of death associated with a positive genetic test, and in helping
future subjects at risk to make informed decisions on whether to undergo
genetic mutation testing. Technical details and additional numerical
results are provided in the online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 1324-1338
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.699353
File-URL: http://hdl.handle.net/10.1080/01621459.2012.699353
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1324-1338
Template-Type: ReDIF-Article 1.0
Author-Name: Hsiu-Hsi Chen
Author-X-Name-First: Hsiu-Hsi
Author-X-Name-Last: Chen
Author-Name: Amy Ming-Fang Yen
Author-X-Name-First: Amy Ming-Fang
Author-X-Name-Last: Yen
Author-Name: Laszlo Tabár
Author-X-Name-First: Laszlo
Author-X-Name-Last: Tabár
Title: A Stochastic Model for Calibrating the Survival Benefit of Screen-Detected Cancers
Abstract:
Comparison of the survival of clinically detected and screen-detected
cancer cases from either population-based service screening programs or
opportunistic screening is often distorted by both lead-time and length
biases. Both are correlated with each other and are also affected by
measurement errors and tumor attributes such as regional lymph node
spread. We propose a general stochastic approach to calibrate the survival
benefit of screen-detected cancers related to both biases, measurement
errors, and tumor attributes. We apply our proposed method to breast
cancer screening data from one arm of the Swedish Two-County trial in the
trial period together with the subsequent service screening for the same
cohort. When there is no calibration, the results—assuming a
constant (exponentially distributed) post-lead-time hazard rate (i.e., a
homogeneous stochastic process)—show a 57% reduction in breast
cancer death over 25 years. After correction, the reduction was 30%,
with approximately 12% of the overestimation being due to lead-time bias
and 15% due to length bias. The additional impacts of measurement errors
(sensitivity and specificity) depend on the type of the proposed model and
follow-up time. The corresponding analysis when the Weibull distribution
was applied—relaxing the assumption of a constant hazard
rate—yielded similar findings and lacked statistical significance
compared with the exponential model. The proposed calibration approach
allows the benefit of a service cancer screening program to be fairly
evaluated. This article has supplementary materials online.
Journal: Journal of the American Statistical Association
Pages: 1339-1359
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.716335
File-URL: http://hdl.handle.net/10.1080/01621459.2012.716335
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1339-1359
Template-Type: ReDIF-Article 1.0
Author-Name: José R. Zubizarreta
Author-X-Name-First: José R.
Author-X-Name-Last: Zubizarreta
Title: Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure After Surgery
Abstract:
This article presents a new method for optimal matching in observational
studies based on mixed integer programming. Unlike widely used matching
methods based on network algorithms, which attempt to achieve covariate
balance by minimizing the total sum of distances between treated units and
matched controls, this new method achieves covariate balance directly,
either by minimizing both the total sum of distances and a weighted sum of
specific measures of covariate imbalance, or by minimizing the total sum
of distances while constraining the measures of imbalance to be less than
or equal to certain tolerances. The inclusion of these extra terms in the
objective function or the use of these additional constraints explicitly
optimizes or constrains the criteria that will be used to evaluate the
quality of the match. For example, the method minimizes or constrains
differences in univariate moments, such as means, variances, and skewness;
differences in multivariate moments, such as correlations between
covariates; differences in quantiles; and differences in statistics, such
as the Kolmogorov--Smirnov statistic, to minimize the differences in both
location and shape of the empirical distributions of the treated units and
matched controls. While balancing several of these measures, it is also
possible to impose constraints for exact and near-exact matching, and fine
and near-fine balance for more than one nominal covariate, whereas network
algorithms can finely or near-finely balance only a single nominal
covariate. From a practical standpoint, this method eliminates the
guesswork involved in current optimal matching methods, and offers a
controlled and systematic way of improving covariate balance by focusing
the matching efforts on certain measures of covariate imbalance and their
corresponding weights or tolerances. A matched case--control study of
acute kidney injury after surgery among Medicare patients illustrates
these features in detail. A new R package called
mipmatch implements the method.
Journal: Journal of the American Statistical Association
Pages: 1360-1371
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.703874
File-URL: http://hdl.handle.net/10.1080/01621459.2012.703874
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1360-1371
Template-Type: ReDIF-Article 1.0
Author-Name: Donatello Telesca
Author-X-Name-First: Donatello
Author-X-Name-Last: Telesca
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Steven M. Kornblau
Author-X-Name-First: Steven M.
Author-X-Name-Last: Kornblau
Author-Name: Marc A. Suchard
Author-X-Name-First: Marc A.
Author-X-Name-Last: Suchard
Author-Name: Yuan Ji
Author-X-Name-First: Yuan
Author-X-Name-Last: Ji
Title: Modeling Protein Expression and Protein Signaling Pathways
Abstract:
High-throughput functional proteomic technologies provide a way to
quantify the expression of proteins of interest. Statistical inference
centers on identifying the activation state of proteins and their patterns
of molecular interaction formalized as dependence structure. Inference on
dependence structure is particularly important when proteins are selected
because they are part of a common molecular pathway. In that case,
inference on dependence structure reveals properties of the underlying
pathway. We propose a probability model that represents molecular
interactions at the level of hidden binary latent variables that can be
interpreted as indicators for active versus inactive states of the
proteins. The proposed approach exploits available expert knowledge about
the target pathway to define an informative prior on the hidden
conditional dependence structure. An important feature of this prior is
that it provides an instrument to explicitly anchor the model space to a
set of interactions of interest, favoring a local search approach to model
determination. We apply our model to reverse-phase protein array data from
a study on acute myeloid leukemia. Our inference identifies relevant
subpathways in relation to the unfolding of the biological process under
study.
Journal: Journal of the American Statistical Association
Pages: 1372-1384
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.706121
File-URL: http://hdl.handle.net/10.1080/01621459.2012.706121
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1372-1384
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Manrique-Vallier
Author-X-Name-First: Daniel
Author-X-Name-Last: Manrique-Vallier
Author-Name: Jerome P. Reiter
Author-X-Name-First: Jerome P.
Author-X-Name-Last: Reiter
Title: Estimating Identification Disclosure Risk Using Mixed Membership Models
Abstract:
Statistical agencies and other organizations that disseminate data are
obligated to protect data subjects’ confidentiality. For example,
ill-intentioned individuals might link data subjects to records in other
databases by matching on common characteristics (keys). Successful links
are particularly problematic for data subjects with combinations of keys
that are unique in the population. Hence, as part of their assessments of
disclosure risks, many data stewards estimate the probabilities that
sample uniques on sets of discrete keys are also population uniques on
those keys. This is typically done using log-linear modeling on the keys.
However, log-linear models can yield biased estimates of cell
probabilities for sparse contingency tables with many zero counts, which
often occurs in databases with many keys. This bias can result in
unreliable estimates of probabilities of uniqueness and, hence,
misrepresentations of disclosure risks. We propose an alternative to
log-linear models for datasets with sparse keys based on a Bayesian
version of grade of membership (GoM) models. We present a Bayesian GoM
model for multinomial variables and offer a Markov chain Monte Carlo
algorithm for fitting the model. We evaluate the approach by treating data
from a recent U.S. Census Bureau public use microdata sample as a
population, taking simple random samples from that population, and
benchmarking estimated probabilities of uniqueness against population
values. Compared to log-linear models, GoM models provide more accurate
estimates of the total number of uniques in the samples. Additionally,
they offer record-level predictions of uniqueness that dominate those
based on log-linear models. This article has online supplementary
materials.
Journal: Journal of the American Statistical Association
Pages: 1385-1394
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.710508
File-URL: http://hdl.handle.net/10.1080/01621459.2012.710508
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1385-1394
Template-Type: ReDIF-Article 1.0
Author-Name: Michelle R. Danaher
Author-X-Name-First: Michelle R.
Author-X-Name-Last: Danaher
Author-Name: Anindya Roy
Author-X-Name-First: Anindya
Author-X-Name-Last: Roy
Author-Name: Zhen Chen
Author-X-Name-First: Zhen
Author-X-Name-Last: Chen
Author-Name: Sunni L. Mumford
Author-X-Name-First: Sunni L.
Author-X-Name-Last: Mumford
Author-Name: Enrique F. Schisterman
Author-X-Name-First: Enrique F.
Author-X-Name-Last: Schisterman
Title: Minkowski--Weyl Priors for Models With Parameter Constraints: An Analysis of the BioCycle Study
Abstract:
We propose a general framework for performing full Bayesian analysis
under linear inequality parameter constraints. The proposal is motivated
by the BioCycle Study, a large cohort study of hormone levels of healthy
women where certain well-established linear inequality constraints on the
log-hormone levels should be accounted for in the statistical inferential
procedure. Based on the Minkowski--Weyl decomposition of polyhedral
regions, we propose a class of priors that are fully supported on the
parameter space with linear inequality constraints, and we fit a Bayesian
linear mixed model to the BioCycle data using such a prior. We observe
positive associations between estrogen and progesterone levels and
F2-isoprostanes, a marker for oxidative stress. These findings
are of particular interest to reproductive epidemiologists.
Journal: Journal of the American Statistical Association
Pages: 1395-1409
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.712414
File-URL: http://hdl.handle.net/10.1080/01621459.2012.712414
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1395-1409
Template-Type: ReDIF-Article 1.0
Author-Name: Vanja Dukic
Author-X-Name-First: Vanja
Author-X-Name-Last: Dukic
Author-Name: Hedibert F. Lopes
Author-X-Name-First: Hedibert F.
Author-X-Name-Last: Lopes
Author-Name: Nicholas G. Polson
Author-X-Name-First: Nicholas G.
Author-X-Name-Last: Polson
Title: Tracking Epidemics With Google Flu Trends Data and a State-Space SEIR Model
Abstract:
In this article, we use Google Flu Trends data together with a sequential
surveillance model based on state-space methodology to track the evolution
of an epidemic process over time. We embed a classical mathematical
epidemiology model [a susceptible-exposed-infected-recovered (SEIR) model]
within the state-space framework, thereby extending the SEIR dynamics to
allow changes through time. The implementation of this model is based on a
particle filtering algorithm, which learns about the epidemic process
sequentially through time and provides updated estimated odds of a
pandemic with each new surveillance data point. We show how our approach,
in combination with sequential Bayes factors, can serve as an online
diagnostic tool for influenza pandemic. We take a close look at the Google
Flu Trends data describing the spread of flu in the United States during
2003--2009 and in nine separate U.S. states chosen to represent a wide
range of health care and emergency system strengths and weaknesses. This
article has online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 1410-1426
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.713876
File-URL: http://hdl.handle.net/10.1080/01621459.2012.713876
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1410-1426
Template-Type: ReDIF-Article 1.0
Author-Name: Donatello Telesca
Author-X-Name-First: Donatello
Author-X-Name-Last: Telesca
Author-Name: Elena A. Erosheva
Author-X-Name-First: Elena A.
Author-X-Name-Last: Erosheva
Author-Name: Derek A. Kreager
Author-X-Name-First: Derek A.
Author-X-Name-Last: Kreager
Author-Name: Ross L. Matsueda
Author-X-Name-First: Ross L.
Author-X-Name-Last: Matsueda
Title: Modeling Criminal Careers as Departures From a Unimodal Population Age--Crime Curve: The Case of Marijuana Use
Abstract:
A major aim of longitudinal analyses of life-course data is to describe
the within- and between-individual variability in a behavioral outcome,
such as crime. Statistical analyses of such data typically draw on mixture
and mixed-effects growth models. In this work, we present a functional
analytic point of view and develop an alternative method that models
individual crime trajectories as departures from a population age--crime
curve. Drawing on empirical and theoretical claims in criminology, we
assume a unimodal population age--crime curve and allow individual
expected crime trajectories to differ by their levels of offending and
patterns of temporal misalignment. We extend Bayesian hierarchical curve
registration methods to accommodate count data and to incorporate
influence of baseline covariates on individual behavioral trajectories.
Analyzing self-reported counts of yearly marijuana use from the Denver
Youth Survey, we examine the influence of race and gender categories on
differences in levels and timing of marijuana smoking. We find that our
approach offers a flexible model for longitudinal crime trajectories and
allows for a rich array of inferences of interest to criminologists and
drug abuse researchers. This article has supplementary materials online.
Journal: Journal of the American Statistical Association
Pages: 1427-1440
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.716328
File-URL: http://hdl.handle.net/10.1080/01621459.2012.716328
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1427-1440
Template-Type: ReDIF-Article 1.0
Author-Name: Summer S. Han
Author-X-Name-First: Summer S.
Author-X-Name-Last: Han
Author-Name: Philip S. Rosenberg
Author-X-Name-First: Philip S.
Author-X-Name-Last: Rosenberg
Author-Name: Nilanjan Chatterjee
Author-X-Name-First: Nilanjan
Author-X-Name-Last: Chatterjee
Title: Testing for Gene--Environment and Gene--Gene Interactions Under Monotonicity Constraints
Abstract:
Recent genome-wide association studies (GWASs) designed to detect the
main effects of genetic markers have had considerable success with many
findings validated by replication studies. However, relatively few
findings of gene--gene or gene--environment interactions have been
successfully reproduced. Besides the main issues associated with
insufficient sample size in current studies, a complication is that
interactions that rank high based on p-values often
correspond to extreme forms of joint effects that are biologically less
plausible. To reduce false positives and to increase power, we develop
various gene--environment/gene--gene tests based on biologically more
plausible constraints using bivariate isotonic regressions for
case--control data. We extend our method to exploit gene--environment or
gene--gene independence information, integrating the approach proposed by
Chatterjee and Carroll. We propose appropriate nonparametric and
parametric permutation procedures for evaluating the significance of the
tests. Simulations show that our method gains power over traditional
unconstrained methods by reducing the sizes of alternative parameter
spaces. We apply our method to several real-data examples, including an
analysis of bladder cancer data to detect interactions between the
NAT2 gene and smoking. We also show that the proposed
method is computationally feasible for large-scale problems by applying it
to the National Cancer Institute (NCI) lung cancer GWAS data.
Journal: Journal of the American Statistical Association
Pages: 1441-1452
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.726892
File-URL: http://hdl.handle.net/10.1080/01621459.2012.726892
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1441-1452
Template-Type: ReDIF-Article 1.0
Author-Name: Huixia Judy Wang
Author-X-Name-First: Huixia Judy
Author-X-Name-Last: Wang
Author-Name: Deyuan Li
Author-X-Name-First: Deyuan
Author-X-Name-Last: Li
Author-Name: Xuming He
Author-X-Name-First: Xuming
Author-X-Name-Last: He
Title: Estimation of High Conditional Quantiles for Heavy-Tailed Distributions
Abstract:
Estimation of conditional quantiles at very high or low tails is of
interest in numerous applications. Quantile regression provides a
convenient and natural way of quantifying the impact of covariates at
different quantiles of a response distribution. However, high tails are
often associated with data sparsity, so quantile regression estimation can
suffer from high variability at tails especially for heavy-tailed
distributions. In this article, we develop new estimation methods for high
conditional quantiles by first estimating the intermediate conditional
quantiles in a conventional quantile regression framework and then
extrapolating these estimates to the high tails based on reasonable
assumptions on tail behaviors. We establish the asymptotic properties of
the proposed estimators and demonstrate through simulation studies that
the proposed methods enjoy higher accuracy than the conventional quantile
regression estimates. In a real application involving statistical
downscaling of daily precipitation in the Chicago area, the proposed
methods provide more stable results quantifying the chance of heavy
precipitation in the area. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1453-1464
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.716382
File-URL: http://hdl.handle.net/10.1080/01621459.2012.716382
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1453-1464
Template-Type: ReDIF-Article 1.0
Author-Name: Xianchao Xie
Author-X-Name-First: Xianchao
Author-X-Name-Last: Xie
Author-Name: S. C. Kou
Author-X-Name-First: S. C.
Author-X-Name-Last: Kou
Author-Name: Lawrence D. Brown
Author-X-Name-First: Lawrence D.
Author-X-Name-Last: Brown
Title: SURE Estimates for a Heteroscedastic Hierarchical Model
Abstract:
Hierarchical models are extensively studied and widely used in statistics
and many other scientific areas. They provide an effective tool for
combining information from similar resources and achieving partial pooling
of inference. Since the seminal work by James and Stein (1961) and Stein
(1962), shrinkage estimation has become one major focus for hierarchical
models. For the homoscedastic normal model, it is well known that
shrinkage estimators, especially the James-Stein estimator, have good risk
properties. The heteroscedastic model, though more appropriate for
practical applications, is less well studied, and it is unclear what types
of shrinkage estimators are superior in terms of the risk. We propose in
this article a class of shrinkage estimators based on Stein’s
unbiased estimate of risk (SURE). We study asymptotic properties of
various common estimators as the number of means to be estimated grows
(p → ∞). We establish the asymptotic
optimality property for the SURE estimators. We then extend our
construction to create a class of semiparametric shrinkage estimators and
establish corresponding asymptotic optimality results. We emphasize that
though the form of our SURE estimators is partially obtained through a
normal model at the sampling level, their optimality properties do not
heavily depend on such distributional assumptions. We apply the methods to
two real datasets and obtain encouraging results.
Journal: Journal of the American Statistical Association
Pages: 1465-1479
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.728154
File-URL: http://hdl.handle.net/10.1080/01621459.2012.728154
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1465-1479
Template-Type: ReDIF-Article 1.0
Author-Name: Lingzhou Xue
Author-X-Name-First: Lingzhou
Author-X-Name-Last: Xue
Author-Name: Shiqian Ma
Author-X-Name-First: Shiqian
Author-X-Name-Last: Ma
Author-Name: Hui Zou
Author-X-Name-First: Hui
Author-X-Name-Last: Zou
Title: Positive-Definite ℓ1-Penalized Estimation of Large Covariance Matrices
Abstract:
The thresholding covariance estimator has nice asymptotic properties for
estimating sparse large covariance matrices, but it often has negative
eigenvalues when used in real data analysis. To fix this drawback of
thresholding estimation, we develop a positive-definite
ℓ1-penalized covariance estimator for estimating sparse
large covariance matrices. We derive an efficient alternating direction
method to solve the challenging optimization problem and establish its
convergence properties. Under weak regularity conditions, nonasymptotic
statistical theory is also established for the proposed estimator. The
competitive finite-sample performance of our proposal is demonstrated by
both simulation and real applications.
Journal: Journal of the American Statistical Association
Pages: 1480-1491
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.725386
File-URL: http://hdl.handle.net/10.1080/01621459.2012.725386
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1480-1491
Template-Type: ReDIF-Article 1.0
Author-Name: Layla Parast
Author-X-Name-First: Layla
Author-X-Name-Last: Parast
Author-Name: Su-Chun Cheng
Author-X-Name-First: Su-Chun
Author-X-Name-Last: Cheng
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Title: Landmark Prediction of Long-Term Survival Incorporating Short-Term Event Time Information
Abstract:
In recent years, a wide range of markers have become available as
potential tools to predict risk or progression of disease. In addition to
such biological and genetic markers, short-term outcome information may be
useful in predicting long-term disease outcomes. When such information is
available, it would be desirable to combine this along with predictive
markers to improve the prediction of long-term survival. Most existing
methods for incorporating censored short-term event information in
predicting long-term survival focus on modeling the disease process and
are derived under restrictive parametric models in a multistate survival
setting. When such model assumptions fail to hold, the resulting
prediction of long-term outcomes may be invalid or inaccurate. When there
is only a single discrete baseline covariate, a fully nonparametric
estimation procedure to incorporate short-term event time information has
been previously proposed. However, such an approach is not feasible for
settings with one or more continuous covariates due to the curse of
dimensionality. In this article, we propose to incorporate short-term
event time information along with multiple covariates collected up to a
landmark point via a flexible varying-coefficient model. To evaluate and
compare the prediction performance of the resulting landmark prediction
rule, we use robust nonparametric procedures that do not require the
correct specification of the proposed varying-coefficient model.
Simulation studies suggest that the proposed procedures perform well in
finite samples. We illustrate them here using a dataset of postdialysis
patients with end-stage renal disease.
Journal: Journal of the American Statistical Association
Pages: 1492-1501
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.721281
File-URL: http://hdl.handle.net/10.1080/01621459.2012.721281
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1492-1501
Template-Type: ReDIF-Article 1.0
Author-Name: P. L. Davies
Author-X-Name-First: P. L.
Author-X-Name-Last: Davies
Title: Interactions in the Analysis of Variance
Abstract:
The standard model for the analysis of variance is over-parameterized.
The resulting identifiability problem is typically solved by placing
linear constraints on the parameters. In the case of the interactions,
these require that the marginal sums be zero. Although seemingly neutral,
these conditions have unintended consequences: the interactions are of
necessity connected whether or not this is justified, the minimum number
of nonzero interactions is four, and, in particular, it is not possible to
have a single interaction in one cell. There is no reason why nature
should conform to these constraints. The approach taken in this article is
one of sparsity: the linear factor effects are chosen so as to minimize
the number of nonzero interactions subject to consistency with the data.
The resulting interactions are attached to individual cells making their
interpretation easier irrespective of whether they are isolated or form
clusters. In general, the calculation of a sparse solution is a difficult
combinatorial problem but the special nature of the analysis of variance
simplifies matters considerably. In many cases, the sparse
L 0 solution coincides with the
L 1 solution obtained by minimizing the sum of
the absolute residuals and that can be calculated quickly. The identity of
the two solutions can be checked either algorithmically or by applying
known sufficient conditions for equality.
Journal: Journal of the American Statistical Association
Pages: 1502-1509
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.726895
File-URL: http://hdl.handle.net/10.1080/01621459.2012.726895
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1502-1509
Template-Type: ReDIF-Article 1.0
Author-Name: Justin S. Dyer
Author-X-Name-First: Justin S.
Author-X-Name-Last: Dyer
Author-Name: Art B. Owen
Author-X-Name-First: Art B.
Author-X-Name-Last: Owen
Title: Correct Ordering in the Zipf--Poisson Ensemble
Abstract:
Rankings based on counts are often presented to identify popular items,
such as baby names, English words, or Web sites. This article shows that,
in some examples, the number of correctly identified items can be very
small. We introduce a standard error versus rank plot to diagnose possible
misrankings. Then to explain the slowly growing number of correct ranks,
we model the entire set of count data via a Zipf--Poisson ensemble with
independent Xi ∼
Poi(Ni -super-− α) for α > 1 and
N > 0 and integers i ⩾ 1. We show
that as N → ∞, the first
n′(N) random variables have their
proper order relative to each other, with
probability tending to 1 for n′ up to
(AN/log (N))-super-1/(α + 2)
for A = α-super-2(α + 2)/4. We also show that
the rate N -super-1/(α + 2) cannot be achieved. The
ordering of the first n′(N)
entities does not preclude for some interloping
m > n′. However, we show that the
first n″ random variables are correctly ordered
exclusive of any interlopers, with probability tending to 1 if
n″ ⩽
(BN/log (N))-super-1/(α + 2)
for any B > A. We also show how to
compute the cutoff for alternative models such as a
Zipf--Mandelbrot--Poisson ensemble.
Journal: Journal of the American Statistical Association
Pages: 1510-1517
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.734177
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734177
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1510-1517
Template-Type: ReDIF-Article 1.0
Author-Name: Lisha Chen
Author-X-Name-First: Lisha
Author-X-Name-Last: Chen
Author-Name: Jianhua Z. Huang
Author-X-Name-First: Jianhua Z.
Author-X-Name-Last: Huang
Title: Sparse Reduced-Rank Regression for Simultaneous Dimension Reduction and Variable Selection
Abstract:
The reduced-rank regression is an effective method in predicting multiple
response variables from the same set of predictor variables. It reduces
the number of model parameters and takes advantage of interrelations
between the response variables and hence improves predictive accuracy. We
propose to select relevant variables for reduced-rank regression by using
a sparsity-inducing penalty. We apply a group-lasso type penalty that
treats each row of the matrix of the regression coefficients as a group
and show that this penalty satisfies certain desirable invariance
properties. We develop two numerical algorithms to solve the penalized
regression problem and establish the asymptotic consistency of the
proposed method. In particular, the manifold structure of the reduced-rank
regression coefficient matrix is considered and studied in our theoretical
analysis. In our simulation study and real data analysis, the new method
is compared with several existing variable selection methods for
multivariate regression and exhibits competitive performance in prediction
and variable selection.
Journal: Journal of the American Statistical Association
Pages: 1533-1545
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.734178
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1533-1545
Template-Type: ReDIF-Article 1.0
Author-Name: S. C. Kou
Author-X-Name-First: S. C.
Author-X-Name-Last: Kou
Author-Name: Benjamin P. Olding
Author-X-Name-First: Benjamin P.
Author-X-Name-Last: Olding
Author-Name: Martin Lysy
Author-X-Name-First: Martin
Author-X-Name-Last: Lysy
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: A Multiresolution Method for Parameter Estimation of Diffusion Processes
Abstract:
Diffusion process models are widely used in science, engineering, and
finance. Most diffusion processes are described by stochastic differential
equations in continuous time. In practice, however, data are typically
observed only at discrete time points. Except for a few very special
cases, no analytic form exists for the likelihood of such discretely
observed data. For this reason, parametric inference is often achieved by
using discrete-time approximations, with accuracy controlled through the
introduction of missing data. We present a new multiresolution Bayesian
framework to address the inference difficulty. The methodology relies on
the use of multiple approximations and extrapolation and is significantly
faster and more accurate than known strategies based on Gibbs sampling. We
apply the multiresolution approach to three data-driven inference
problems, one of which features a multivariate diffusion model with an
entirely unobserved component.
Journal: Journal of the American Statistical Association
Pages: 1558-1574
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.720899
File-URL: http://hdl.handle.net/10.1080/01621459.2012.720899
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1558-1574
Template-Type: ReDIF-Article 1.0
Author-Name: Ori Rosen
Author-X-Name-First: Ori
Author-X-Name-Last: Rosen
Author-Name: Sally Wood
Author-X-Name-First: Sally
Author-X-Name-Last: Wood
Author-Name: David S. Stoffer
Author-X-Name-First: David S.
Author-X-Name-Last: Stoffer
Title: AdaptSPEC: Adaptive Spectral Estimation for Nonstationary Time Series
Abstract:
We propose a method for analyzing possibly nonstationary time series by
adaptively dividing the time series into an unknown but finite number of
segments and estimating the corresponding local spectra by smoothing
splines. The model is formulated in a Bayesian framework, and the
estimation relies on reversible jump Markov chain Monte Carlo (RJMCMC)
methods. For a given segmentation of the time series, the likelihood
function is approximated via a product of local Whittle likelihoods. Thus,
no parametric assumption is made about the process underlying the time
series. The number and lengths of the segments are assumed unknown and may
change from one MCMC iteration to another. The frequentist properties of
the method are investigated by simulation, and applications to
electroencephalogram and the El Niño Southern Oscillation phenomenon
are described in detail.
Journal: Journal of the American Statistical Association
Pages: 1575-1589
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.716340
File-URL: http://hdl.handle.net/10.1080/01621459.2012.716340
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1575-1589
Template-Type: ReDIF-Article 1.0
Author-Name: Kehui Chen
Author-X-Name-First: Kehui
Author-X-Name-Last: Chen
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: Modeling Repeated Functional Observations
Abstract:
We introduce a new methodological framework for repeatedly observed and
thus dependent functional data, aiming at situations where curves are
recorded repeatedly for each subject in a sample. Our methodology covers
the case where the recordings of the curves are scheduled on a regular and
dense grid and also situations more typical for longitudinal studies,
where the timing of recordings is often sparse and random. The proposed
models lead to an interpretable and straightforward decomposition of the
inherent variation in repeatedly observed functional data and are
implemented through a straightforward two-step functional principal
component analysis. We provide consistency results and asymptotic
convergence rates for the estimated model components. We compare the
proposed model with an alternative approach via a two-dimensional
Karhunen-Loève expansion and illustrate it through the analysis of
longitudinal mortality data from period lifetables that are repeatedly
observed for a sample of countries over many years, and also through
simulation studies. This article has online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 1599-1609
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.734196
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734196
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1599-1609
Template-Type: ReDIF-Article 1.0
Author-Name: Howard D. Bondell
Author-X-Name-First: Howard D.
Author-X-Name-Last: Bondell
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Title: Consistent High-Dimensional Bayesian Variable Selection via Penalized Credible Regions
Abstract:
For high-dimensional data, particularly when the number of predictors
greatly exceeds the sample size, selection of relevant predictors for
regression is a challenging problem. Methods such as sure screening,
forward selection, or penalized regressions are commonly used. Bayesian
variable selection methods place prior distributions on the parameters
along with a prior over model space, or equivalently, a mixture prior on
the parameters having mass at zero. Since exhaustive enumeration is not
feasible, posterior model probabilities are often obtained via long Markov
chain Monte Carlo (MCMC) runs. The chosen model can depend heavily on
various choices for priors and also posterior thresholds. Alternatively,
we propose a conjugate prior only on the full model parameters and use
sparse solutions within posterior credible regions to perform selection.
These posterior credible regions often have closed-form representations,
and it is shown that these sparse solutions can be computed via existing
algorithms. The approach is shown to outperform common methods in the
high-dimensional setting, particularly under correlation. By searching for
a sparse solution within a joint credible region, consistent model
selection is established. Furthermore, it is shown that, under certain
conditions, the use of marginal credible intervals can give consistent
selection up to the case where the dimension grows exponentially in the
sample size. The proposed approach successfully accomplishes variable
selection in the high-dimensional setting, while avoiding pitfalls that
plague typical Bayesian variable selection methods.
Journal: Journal of the American Statistical Association
Pages: 1610-1624
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.716344
File-URL: http://hdl.handle.net/10.1080/01621459.2012.716344
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1610-1624
Template-Type: ReDIF-Article 1.0
Author-Name: Haipeng Xing
Author-X-Name-First: Haipeng
Author-X-Name-Last: Xing
Author-Name: Zhiliang Ying
Author-X-Name-First: Zhiliang
Author-X-Name-Last: Ying
Title: A Semiparametric Change-Point Regression Model for Longitudinal Observations
Abstract:
Many longitudinal studies involve relating an outcome process to a set of
possibly time-varying covariates, giving rise to the usual regression
models for longitudinal data. When the purpose of the study is to
investigate the covariate effects when experimental environment undergoes
abrupt changes or to locate the periods with different levels of covariate
effects, a simple and easy-to-interpret approach is to introduce
change-points in regression coefficients. In this connection, we propose a
semiparametric change-point regression model, in which the error process
(stochastic component) is nonparametric and the baseline mean function
(functional part) is completely unspecified, the observation times are
allowed to be subject specific, and the number, locations, and magnitudes
of change-points are unknown and need to be estimated. We further develop
an estimation procedure that combines the recent advance in semiparametric
analysis based on counting process argument and multiple change-points
inference and discuss its large sample properties, including consistency
and asymptotic normality, under suitable regularity conditions. Simulation
results show that the proposed methods work well under a variety of
scenarios. An application to a real dataset is also given.
Journal: Journal of the American Statistical Association
Pages: 1625-1637
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.712425
File-URL: http://hdl.handle.net/10.1080/01621459.2012.712425
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1625-1637
Template-Type: ReDIF-Article 1.0
Author-Name: Paul S. Clarke
Author-X-Name-First: Paul S.
Author-X-Name-Last: Clarke
Author-Name: Frank Windmeijer
Author-X-Name-First: Frank
Author-X-Name-Last: Windmeijer
Title: Instrumental Variable Estimators for Binary Outcomes
Abstract:
Instrumental variables (IVs) can be used to construct estimators of
exposure effects on the outcomes of studies affected by nonignorable
selection of the exposure. Estimators that fail to adjust for the effects
of nonignorable selection will be biased and inconsistent. Such situations
commonly arise in observational studies, but are also a problem for
randomized experiments affected by nonignorable noncompliance. In this
article, we review IV estimators for studies in which the outcome is
binary, and consider the links between different approaches developed in
the statistics and econometrics literatures. The implicit assumptions made
by each method are highlighted and compared within our framework. We
illustrate our findings through the reanalysis of a randomized
placebo-controlled trial, and highlight important directions for future
work in this area.
Journal: Journal of the American Statistical Association
Pages: 1638-1652
Issue: 500
Volume: 107
Year: 2012
Month: 12
X-DOI: 10.1080/01621459.2012.734171
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734171
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:107:y:2012:i:500:p:1638-1652
Template-Type: ReDIF-Article 1.0
Author-Name: Robert N. Rodriguez
Author-X-Name-First: Robert N.
Author-X-Name-Last: Rodriguez
Title: Building the Big Tent for Statistics
Journal: Journal of the American Statistical Association
Pages: 1-6
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2013.771010
File-URL: http://hdl.handle.net/10.1080/01621459.2013.771010
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:1-6
Template-Type: ReDIF-Article 1.0
Author-Name: Haeran Cho
Author-X-Name-First: Haeran
Author-X-Name-Last: Cho
Author-Name: Yannig Goude
Author-X-Name-First: Yannig
Author-X-Name-Last: Goude
Author-Name: Xavier Brossat
Author-X-Name-First: Xavier
Author-X-Name-Last: Brossat
Author-Name: Qiwei Yao
Author-X-Name-First: Qiwei
Author-X-Name-Last: Yao
Title: Modeling and Forecasting Daily Electricity Load Curves: A Hybrid Approach
Abstract:
We propose a hybrid approach for the modeling and the
short-term forecasting of electricity loads. Two building blocks of our
approach are (1) modeling the overall trend and seasonality by fitting a
generalized additive model to the weekly averages of the
load and (2) modeling the dependence structure across consecutive
daily loads via curve linear regression. For the latter,
a new methodology is proposed for linear regression with both curve
response and curve regressors. The key idea behind the proposed
methodology is dimension reduction based on a singular value decomposition
in a Hilbert space, which reduces the curve regression problem to several
ordinary (i.e., scalar) linear regression problems. We illustrate the
hybrid method using French electricity loads between 1996 and 2009, on
which we also compare our method with other available models including the
Électricité de France operational model. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 7-21
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.722900
File-URL: http://hdl.handle.net/10.1080/01621459.2012.722900
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:7-21
Template-Type: ReDIF-Article 1.0
Author-Name: Ephraim M. Hanks
Author-X-Name-First: Ephraim M.
Author-X-Name-Last: Hanks
Author-Name: Mevin B. Hooten
Author-X-Name-First: Mevin B.
Author-X-Name-Last: Hooten
Title: Circuit Theory and Model-Based Inference for Landscape Connectivity
Abstract:
Circuit theory has seen extensive recent use in the field of
ecology, where it is often applied to study functional connectivity. The
landscape is typically represented by a network of nodes and resistors,
with the resistance between nodes a function of landscape characteristics.
The effective distance between two locations on a landscape is represented
by the resistance distance between the nodes in the network. Circuit
theory has been applied to many other scientific fields for exploratory
analyses, but parametric models for circuits are not common in the
scientific literature. To model circuits explicitly, we demonstrate a link
between Gaussian Markov random fields and contemporary circuit theory
using a covariance structure that induces the necessary resistance
distance. This provides a parametric model for second-order observations
from such a system. In the landscape ecology setting, the proposed model
provides a simple framework where inference can be obtained for effects
that landscape features have on functional connectivity. We illustrate the
approach through a landscape genetics study linking gene flow in alpine
chamois (Rupicapra rupicapra) to the underlying
landscape.
Journal: Journal of the American Statistical Association
Pages: 22-33
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.724647
File-URL: http://hdl.handle.net/10.1080/01621459.2012.724647
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:22-33
Template-Type: ReDIF-Article 1.0
Author-Name: Roee Gutman
Author-X-Name-First: Roee
Author-X-Name-Last: Gutman
Author-Name: Christopher C. Afendulis
Author-X-Name-First: Christopher C.
Author-X-Name-Last: Afendulis
Author-Name: Alan M. Zaslavsky
Author-X-Name-First: Alan M.
Author-X-Name-Last: Zaslavsky
Title: A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs
Abstract:
End-of-life medical expenses are a significant proportion of
all health care expenditures. These costs were studied using costs of
services from Medicare claims and cause of death (CoD) from death
certificates. In the absence of a unique identifier linking the two
datasets, common variables identified unique matches for only 33% of
deaths. The remaining cases formed cells with multiple cases (32% in cells
with an equal number of cases from each file and 35% in cells with an
unequal number). We sampled from the joint posterior distribution of model
parameters and the permutations that link cases from the two files within
each cell. The linking models included the regression of location of death
on CoD and other parameters, and the regression of cost measures with a
monotone missing data pattern on CoD and other demographic
characteristics. Permutations were sampled by enumerating the exact
distribution for small cells and by the Metropolis algorithm for large
cells. Sparse matrix data structures enabled efficient calculations
despite the large dataset (≈1.7 million cases). The procedure
generates m datasets in which the matches between the two
files are imputed. The m datasets can be analyzed
independently and results can be combined using Rubin's multiple
imputation rules. Our approach can be applied in other file-linking
applications. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 34-47
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.726889
File-URL: http://hdl.handle.net/10.1080/01621459.2012.726889
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:34-47
Template-Type: ReDIF-Article 1.0
Author-Name: Man-Wai Ho
Author-X-Name-First: Man-Wai
Author-X-Name-Last: Ho
Author-Name: Wanzhu Tu
Author-X-Name-First: Wanzhu
Author-X-Name-Last: Tu
Author-Name: Pulak Ghosh
Author-X-Name-First: Pulak
Author-X-Name-Last: Ghosh
Author-Name: Ram C. Tiwari
Author-X-Name-First: Ram C.
Author-X-Name-Last: Tiwari
Title: A Nested Dirichlet Process Analysis of Cluster Randomized Trial Data With Application in Geriatric Care Assessment
Abstract:
In cluster randomized trials, patients seen by the same
physician are randomized to the same treatment arm as a group. Besides the
natural clustering of patients due to cluster/group randomization,
interactions between an individual patient and the attending physician
within the group could just as well influence patient care outcomes.
Despite the intuitive relevance of these interactions to treatment
assessment, few studies have thus far examined their influences. Whether
and to what extent these interactions affect assessment of the treatment
effect remains unexplored. In fact, few statistical models provide ready
accommodation for such interactions. In this research, we propose a
general modeling framework based on the nested Dirichlet process (nDP) for
assessing treatment effect in cluster randomized trials. The proposed
methodology explicitly accounts for physician--patient interactions by
assuming that the interactions follow unspecified group-specific
distributions from an nDP. In addition to accounting for
physician--patient interactions, the model has greatly enhanced the
flexibility of traditional mixed effect models by allowing for nonnormally
distributed random effects, thus, alleviating concerns about mixed effect
misspecification and sidestepping verification of distributional
assumptions on random effects. At the same time, the model retains the
mixed models' ability to make inferences on fixed effects. The proposed
method is easily extendable to more complicated hierarchical clustering
structures. We introduce the method in the context of a real cluster
randomized trial. A comprehensive simulation study was conducted to assess
the operating characteristics of the proposed nDP model.
Journal: Journal of the American Statistical Association
Pages: 48-68
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.734164
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734164
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:48-68
Template-Type: ReDIF-Article 1.0
Author-Name: Riten Mitra
Author-X-Name-First: Riten
Author-X-Name-Last: Mitra
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Shoudan Liang
Author-X-Name-First: Shoudan
Author-X-Name-Last: Liang
Author-Name: Lu Yue
Author-X-Name-First: Lu
Author-X-Name-Last: Yue
Author-Name: Yuan Ji
Author-X-Name-First: Yuan
Author-X-Name-Last: Ji
Title: A Bayesian Graphical Model for ChIP-Seq Data on Histone Modifications
Abstract:
Histone modifications (HMs) are an important
post-translational feature. Different types of HMs are believed to
co-exist and co-regulate biological processes such as gene expression and,
therefore, are intrinsically dependent on each other. We develop inference
for this complex biological network of HMs based on a graphical model
using ChIP-Seq data. A critical computational hurdle in the inference for
the proposed graphical model is the evaluation of a normalization constant
in an autologistic model that builds on the graphical model. We tackle the
problem by Monte Carlo evaluation of ratios of normalization constants. We
carry out a set of simulations to validate the proposed approach and to
compare it with a standard approach using Bayesian networks. We report
inference on HM dependence in a case study with ChIP-Seq data from a next
generation sequencing experiment. An important feature of our approach is
that we can report coherent probabilities and estimates related to any
event or parameter of interest, including honest uncertainties. Posterior
inference is obtained from a joint probability model on latent indicators
for the recorded HMs. We illustrate this in the motivating case study. An
R package including an implementation of posterior simulation in C is
available from Riten Mitra upon request.
Journal: Journal of the American Statistical Association
Pages: 69-80
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.746058
File-URL: http://hdl.handle.net/10.1080/01621459.2012.746058
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:69-80
Template-Type: ReDIF-Article 1.0
Author-Name: Michael W. Robbins
Author-X-Name-First: Michael W.
Author-X-Name-Last: Robbins
Author-Name: Sujit K. Ghosh
Author-X-Name-First: Sujit K.
Author-X-Name-Last: Ghosh
Author-Name: Joshua D. Habiger
Author-X-Name-First: Joshua D.
Author-X-Name-Last: Habiger
Title: Imputation in High-Dimensional Economic Data as Applied to the Agricultural Resource Management Survey
Abstract:
In this article, we consider imputation in the USDA's
Agricultural Resource Management Survey (ARMS) data, which is a complex,
high-dimensional economic dataset. We develop a robust joint model for
ARMS data, which requires that variables are transformed using a suitable
class of marginal densities (e.g., skew normal family). We assume that the
transformed variables may be linked through a Gaussian copula, which
enables construction of the joint model via a sequence of conditional
linear models. We also discuss the criteria used to select the predictors
for each conditional model. For the purpose of developing an imputation
method that is conducive to these model assumptions, we propose a
regression-based technique that allows for flexibility in the selection of
conditional models while providing a valid joint distribution. In this
procedure, labeled as iterative sequential regression (ISR), parameter
estimates and imputations are obtained using a Markov chain Monte Carlo
sampling method. Finally, we apply the proposed method to the full ARMS
data, and we present a thorough data analysis that serves to gauge the
appropriateness of the resulting imputations. Our results demonstrate the
effectiveness of the proposed algorithm and illustrate the specific
deficiencies of existing methods. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 81-95
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.734158
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734158
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:81-95
Template-Type: ReDIF-Article 1.0
Author-Name: Mark C. Wheldon
Author-X-Name-First: Mark C.
Author-X-Name-Last: Wheldon
Author-Name: Adrian E. Raftery
Author-X-Name-First: Adrian E.
Author-X-Name-Last: Raftery
Author-Name: Samuel J. Clark
Author-X-Name-First: Samuel J.
Author-X-Name-Last: Clark
Author-Name: Patrick Gerland
Author-X-Name-First: Patrick
Author-X-Name-Last: Gerland
Title: Reconstructing Past Populations With Uncertainty From Fragmentary Data
Abstract:
Current methods for reconstructing human populations of the
past by age and sex are deterministic or do not formally account for
measurement error. We propose a method for simultaneously estimating
age-specific population counts, fertility rates, mortality rates, and net
international migration flows from fragmentary data that incorporates
measurement error. Inference is based on joint posterior probability
distributions that yield fully probabilistic interval estimates. It is
designed for the kind of data commonly collected in modern demographic
surveys and censuses. Population dynamics over the period of
reconstruction are modeled by embedding formal demographic accounting
relationships in a Bayesian hierarchical model. Informative priors are
specified for vital rates, migration rates, population counts at baseline,
and their respective measurement error variances. We investigate
calibration of central posterior marginal probability intervals by
simulation and demonstrate the method by reconstructing the female
population of Burkina Faso from 1960 to 2005. Supplementary materials for
this article are available online and the method is implemented in the R
package "popReconstruct."
Journal: Journal of the American Statistical Association
Pages: 96-110
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.737729
File-URL: http://hdl.handle.net/10.1080/01621459.2012.737729
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:96-110
Template-Type: ReDIF-Article 1.0
Author-Name: Duchwan Ryu
Author-X-Name-First: Duchwan
Author-X-Name-Last: Ryu
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Bani K. Mallick
Author-X-Name-First: Bani K.
Author-X-Name-Last: Mallick
Title: Sea Surface Temperature Modeling using Radial Basis Function Networks With a Dynamically Weighted Particle Filter
Abstract:
The sea surface temperature (SST) is an important factor of
the earth climate system. A deep understanding of SST is essential for
climate monitoring and prediction. In general, SST follows a nonlinear
pattern in both time and location and can be modeled by a dynamic system
which changes with time and location. In this article, we propose a radial
basis function network-based dynamic model which is able to catch the
nonlinearity of the data and propose to use the dynamically weighted
particle filter to estimate the parameters of the dynamic model. We
analyze the SST observed in the Caribbean Islands area after a hurricane
using the proposed dynamic model. Comparing to the traditional grid-based
approach that requires a supercomputer due to its high computational
demand, our approach requires much less CPU time and makes real-time
forecasting of SST doable on a personal computer. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 111-123
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.734151
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734151
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:111-123
Template-Type: ReDIF-Article 1.0
Author-Name: Drew A. Linzer
Author-X-Name-First: Drew A.
Author-X-Name-Last: Linzer
Title: Dynamic Bayesian Forecasting of Presidential Elections in the States
Abstract:
I present a dynamic Bayesian forecasting model that enables
early and accurate prediction of U.S. presidential election outcomes at
the state level. The method systematically combines information from
historical forecasting models in real time with results from the large
number of state-level opinion surveys that are released publicly during
the campaign. The result is a set of forecasts that are initially as good
as the historical model, and then gradually increase in accuracy as
Election Day nears. I employ a hierarchical specification to overcome the
limitation that not every state is polled on every day, allowing the model
to borrow strength both across states and, through the use of random-walk
priors, across time. The model also filters away day-to-day variation in
the polls due to sampling error and national campaign effects, which
enables daily tracking of voter preferences toward the presidential
candidates at the state and national levels. Simulation techniques are
used to estimate the candidates' probability of winning each state and,
consequently, a majority of votes in the Electoral College. I apply the
model to preelection polls from the 2008 presidential campaign and
demonstrate that the victory of Barack Obama was never realistically in
doubt.
Journal: Journal of the American Statistical Association
Pages: 124-134
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.737735
File-URL: http://hdl.handle.net/10.1080/01621459.2012.737735
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:124-134
Template-Type: ReDIF-Article 1.0
Author-Name: Jesse Y. Hsu
Author-X-Name-First: Jesse Y.
Author-X-Name-Last: Hsu
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Effect Modification and Design Sensitivity in Observational Studies
Abstract:
In an observational study of treatment effects, subjects are
not randomly assigned to treatment or control, so differing outcomes in
treated and control groups may reflect a bias from nonrandom assignment
rather than a treatment effect. After adjusting for measured pretreatment
covariates, perhaps by matching, a sensitivity analysis determines the
magnitude of bias from an unmeasured covariate that would need to be
present to alter the conclusions of the naive analysis that presumes
adjustments eliminated all bias. Other things being equal, larger effects
tend to be less sensitive to bias than smaller effects. Effect
modification is an interaction between a treatment and a pretreatment
covariate controlled by matching, so that the treatment effect is larger
at some values of the covariate than at others. In the presence of effect
modification, it is possible that results are less sensitive to bias in
subgroups experiencing larger effects. Two cases are considered: (i) an a
priori grouping into a few categories based on covariates controlled by
matching and (ii) a grouping discovered empirically in the data at hand.
In case (i), subgroup specific bounds on p-values are
combined using the truncated product of p-values. In case
(ii), information that is fixed under the null hypothesis of no treatment
effect is used to partition matched pairs in the hope of identifying pairs
with larger effects. The methods are evaluated using an asymptotic device,
the design sensitivity, and using simulation. Sensitivity analysis for a
test of the global null hypothesis of no effect is converted to
sensitivity analyses for subgroup analyses using closed testing. A study
of an intervention to control malaria in Africa is used to illustrate.
Journal: Journal of the American Statistical Association
Pages: 135-148
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.742018
File-URL: http://hdl.handle.net/10.1080/01621459.2012.742018
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:135-148
Template-Type: ReDIF-Article 1.0
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Author-Name: Alexander W. Blocker
Author-X-Name-First: Alexander W.
Author-X-Name-Last: Blocker
Title: Estimating Latent Processes on a Network From Indirect Measurements
Abstract:
In a communication network, point-to-point traffic volumes
over time are critical for designing protocols that route information
efficiently and for maintaining security, whether at the scale of an
Internet service provider or within a corporation. While technically
feasible, the direct measurement of point-to-point traffic imposes a heavy
burden on network performance and is typically not implemented. Instead,
indirect aggregate traffic volumes are routinely collected. We consider
the problem of estimating point-to-point traffic volumes, , from aggregate traffic
volumes, , given
information about the network routing protocol encoded in a matrix
A. This estimation task can be reformulated as finding
the solutions to a sequence of ill-posed linear inverse
problems, , since
the number of origin-destination routes of interest is higher than the
number of aggregate measurements available. Here, we introduce
a novel multilevel state-space model (SSM) of aggregate traffic volumes
with realistic features. We implement a naïve strategy for estimating
unobserved point-to-point traffic volumes from indirect measurements of
aggregate traffic, based on particle filtering. We then develop a more
efficient two-stage inference strategy that relies on model-based
regularization: a simple model is used to calibrate regularization
parameters that lead to efficient/scalable inference in the multilevel
SSM. We apply our methods to corporate and academic networks, where we
show that the proposed inference strategy outperforms existing approaches
and scales to larger networks. We also design a simulation study to
explore the factors that influence the performance. Our results suggest
that model-based regularization may be an efficient strategy for inference
in other complex multilevel models. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 149-164
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.756328
File-URL: http://hdl.handle.net/10.1080/01621459.2012.756328
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:149-164
Template-Type: ReDIF-Article 1.0
Author-Name: Asaf Weinstein
Author-X-Name-First: Asaf
Author-X-Name-Last: Weinstein
Author-Name: William Fithian
Author-X-Name-First: William
Author-X-Name-Last: Fithian
Author-Name: Yoav Benjamini
Author-X-Name-First: Yoav
Author-X-Name-Last: Benjamini
Title: Selection Adjusted Confidence Intervals With More Power to Determine the Sign
Abstract:
In many current large-scale problems, confidence intervals
(CIs) are constructed only for the parameters that are large, as indicated
by their estimators, ignoring the smaller parameters. Such selective
inference poses a problem to the usual marginal CIs that no longer offer
the right level of coverage, not even on the average over the selected
parameters. We address this problem by developing three methods to
construct short and valid CIs for the location parameter of a symmetric
unimodal distribution, while conditioning on its estimator being larger
than some constant threshold. In two of these methods, the CI is further
required to offer early sign determination, that is, to avoid including
parameters of both signs for relatively small values of the estimator. One
of the two, the Conditional Quasi-Conventional CI, offers a good balance
between length and sign determination while protecting from the effect of
selection. The CI is not symmetric, extending more toward 0 than away from
it, nor is it of constant shape. However, when the estimator is far away
from the threshold, the proposed CI tends to the usual marginal one. In
spite of its complexity, it is specified by closed form expressions, up to
a small set of constants that are each the solution of a single variable
equation. When multiple testing procedures are used to control
the false discovery rate or other error rates, the resulting threshold for
selecting may be data dependent. We show that conditioning the above CIs
on the data-dependent threshold still offers false coverage-statement rate
(FCR) for many widely used testing procedures. For these reasons, the
conditional CIs for the parameters selected this way are an attractive
alternative to the available general FCR adjusted intervals. We
demonstrate the use of the method in the analysis of some 14,000
correlations between hormone change and brain activity change in response
to the subjects being exposed to stressful movie clips. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 165-176
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.737740
File-URL: http://hdl.handle.net/10.1080/01621459.2012.737740
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:165-176
Template-Type: ReDIF-Article 1.0
Author-Name: S. M. Schennach
Author-X-Name-First: S. M.
Author-X-Name-Last: Schennach
Author-Name: Yingyao Hu
Author-X-Name-First: Yingyao
Author-X-Name-Last: Hu
Title: Nonparametric Identification and Semiparametric Estimation of Classical Measurement Error Models Without Side Information
Abstract:
Virtually all methods aimed at correcting for covariate
measurement error in regressions rely on some form of additional
information (e.g., validation data, known error distributions, repeated
measurements, or instruments). In contrast, we establish that the fully
nonparametric classical errors-in-variables model is identifiable from
data on the regressor and the dependent variable alone, unless the model
takes a very specific parametric form. This parametric family includes
(but is not limited to) the linear specification with normally distributed
variables as a well-known special case. This result relies on standard
primitive regularity conditions taking the form of smoothness constraints
and nonvanishing characteristic functions' assumptions. Our approach can
handle both monotone and nonmonotone specifications, provided the latter
oscillate a finite number of times. Given that the very specific
unidentified parametric functional form is arguably the exception rather
than the rule, this identification result should have a wide
applicability. It leads to a new perspective on handling measurement error
in nonlinear and nonparametric models, opening the way to a novel and
practical approach to correct for measurement error in datasets where it
was previously considered impossible (due to the lack of additional
information regarding the measurement error). We suggest an estimator
based on non/semiparametric maximum likelihood, derive its asymptotic
properties, and illustrate the effectiveness of the method with a
simulation study and an application to the relationship between firm
investment behavior and market value, the latter being notoriously
mismeasured. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 177-186
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.751872
File-URL: http://hdl.handle.net/10.1080/01621459.2012.751872
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:177-186
Template-Type: ReDIF-Article 1.0
Author-Name: Garritt Page
Author-X-Name-First: Garritt
Author-X-Name-Last: Page
Author-Name: Abhishek Bhattacharya
Author-X-Name-First: Abhishek
Author-X-Name-Last: Bhattacharya
Author-Name: David Dunson
Author-X-Name-First: David
Author-X-Name-Last: Dunson
Title: Classification via Bayesian Nonparametric Learning of Affine Subspaces
Abstract:
It has become common for datasets to contain large numbers of
variables in studies conducted in areas such as genetics, machine vision,
image analysis, and many others. When analyzing such data, parametric
models are often too inflexible while nonparametric procedures tend to be
nonrobust because of insufficient data on these high-dimensional spaces.
This is particularly true when interest lies in building efficient
classifiers in the presence of many predictor variables. When dealing with
these types of data, it is often the case that most of the variability
tends to lie along a few directions, or more generally along a much
smaller dimensional submanifold of the data space. In this article, we
propose a class of models that flexibly learn about this submanifold while
simultaneously performing dimension reduction in classification. This
methodology allows the cell probabilities to vary nonparametrically based
on a few coordinates expressed as linear combinations of the predictors.
Also, as opposed to many black-box methods for dimensionality reduction,
the proposed model is appealing in having clearly interpretable and
identifiable parameters that provide insight into which predictors are
important in determining accurate classification boundaries. Gibbs
sampling methods are developed for posterior computation, and the methods
are illustrated using simulated and real data applications.
Journal: Journal of the American Statistical Association
Pages: 187-201
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2013.763566
File-URL: http://hdl.handle.net/10.1080/01621459.2013.763566
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:187-201
Template-Type: ReDIF-Article 1.0
Author-Name: Arlene Naranjo
Author-X-Name-First: Arlene
Author-X-Name-Last: Naranjo
Author-Name: A. Alexandre Trindade
Author-X-Name-First: A. Alexandre
Author-X-Name-Last: Trindade
Author-Name: George Casella
Author-X-Name-First: George
Author-X-Name-Last: Casella
Title: Extending the State-Space Model to Accommodate Missing Values in Responses and Covariates
Abstract:
This article proposes an extended state-space model for
accommodating multivariate panel data. The novel aspect of this
contribution is an adjustment to the classical model for multiple subjects
that allows missingness in the covariates in addition to the responses.
Missing covariate data are handled by a second state-space model nested
inside the first to represent unobserved exogenous information. Relevant
Kalman filter equations are derived, and explicit expressions are provided
for both the E- and M-steps of an expectation-maximization (EM) algorithm,
to obtain maximum (Gaussian) likelihood estimates of all model parameters.
In the presence of missing data, the resulting EM algorithm becomes
computationally intractable, but a simplification of the M-step leads to a
new procedure that is shown to be an expectation/conditional maximization
(ECM) algorithm under exogeneity of the covariates. Simulation studies
reveal that the approach appears to be relatively robust to moderate
percentages of missing data, even with fewer subjects and time points, and
that estimates are generally consistent with the asymptotics. The
methodology is applied to a dataset from a published panel study of
elderly patients with impaired respiratory function. Forecasted values
thus obtained may serve as an "early-warning" mechanism for identifying
patients whose lung function is nearing critical levels. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 202-216
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.746066
File-URL: http://hdl.handle.net/10.1080/01621459.2012.746066
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:202-216
Template-Type: ReDIF-Article 1.0
Author-Name: Jane Paik Kim
Author-X-Name-First: Jane Paik
Author-X-Name-Last: Kim
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Tony Sit
Author-X-Name-First: Tony
Author-X-Name-Last: Sit
Author-Name: Zhiliang Ying
Author-X-Name-First: Zhiliang
Author-X-Name-Last: Ying
Title: A Unified Approach to Semiparametric Transformation Models Under General Biased Sampling Schemes
Abstract:
We propose a unified estimation method for semiparametric
linear transformation models under general biased sampling schemes. The
new estimator is obtained from a set of counting process-based unbiased
estimating equations, developed through introducing a general weighting
scheme that offsets the sampling bias. The usual asymptotic properties,
including consistency and asymptotic normality, are established under
suitable regularity conditions. A closed-form formula is derived for the
limiting variance and the plug-in estimator is shown to be consistent. We
demonstrate the unified approach through the special cases of left
truncation, length bias, the case-cohort design, and variants thereof.
Simulation studies and applications to real datasets are presented.
Journal: Journal of the American Statistical Association
Pages: 217-227
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.746073
File-URL: http://hdl.handle.net/10.1080/01621459.2012.746073
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:217-227
Template-Type: ReDIF-Article 1.0
Author-Name: Qian Jiang
Author-X-Name-First: Qian
Author-X-Name-Last: Jiang
Author-Name: Hansheng Wang
Author-X-Name-First: Hansheng
Author-X-Name-Last: Wang
Author-Name: Yingcun Xia
Author-X-Name-First: Yingcun
Author-X-Name-Last: Xia
Author-Name: Guohua Jiang
Author-X-Name-First: Guohua
Author-X-Name-Last: Jiang
Title: On a Principal Varying Coefficient Model
Abstract:
We propose a novel varying coefficient model (VCM), called
principal varying coefficient model (PVCM), by characterizing the varying
coefficients through linear combinations of a few principal functions.
Compared with the conventional VCM, PVCM reduces the actual number of
nonparametric functions and thus has better estimation efficiency.
Compared with the semivarying coefficient model (SVCM), PVCM is more
flexible but with the same estimation efficiency when the number of
principal functions in PVCM and the number of varying coefficients in SVCM
are the same. Model estimation and identification are investigated, and
the better estimation efficiency is justified theoretically. Incorporating
the estimation with the L 1
penalty, variables in the linear combinations can be selected
automatically, and hence, the estimation efficiency can be further
improved. Numerical experiments suggest that the model together with the
estimation method is useful even when the number of covariates is large.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 228-236
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.736904
File-URL: http://hdl.handle.net/10.1080/01621459.2012.736904
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:228-236
Template-Type: ReDIF-Article 1.0
Author-Name: Zhenghui Feng
Author-X-Name-First: Zhenghui
Author-X-Name-Last: Feng
Author-Name: Xuerong Meggie Wen
Author-X-Name-First: Xuerong Meggie
Author-X-Name-Last: Wen
Author-Name: Zhou Yu
Author-X-Name-First: Zhou
Author-X-Name-Last: Yu
Author-Name: Lixing Zhu
Author-X-Name-First: Lixing
Author-X-Name-Last: Zhu
Title: On Partial Sufficient Dimension Reduction With Applications to Partially Linear Multi-Index Models
Abstract:
Partial dimension reduction is a general method to seek
informative convex combinations of predictors of primary interest, which
includes dimension reduction as its special case when the predictors in
the remaining part are constants. In this article, we propose a novel
method to conduct partial dimension reduction estimation for predictors of
primary interest without assuming that the remaining predictors are
categorical. To this end, we first take the dichotomization step such that
any existing approach for partial dimension reduction estimation can be
employed. Then we take the expectation step to integrate over all the
dichotomic predictors to identify the partial central subspace. As an
example, we use the partially linear multi-index model to illustrate its
applications for semiparametric modeling. Simulations and real data
examples are given to illustrate our methodology.
Journal: Journal of the American Statistical Association
Pages: 237-246
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.746065
File-URL: http://hdl.handle.net/10.1080/01621459.2012.746065
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:237-246
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Lin
Author-X-Name-First: Wei
Author-X-Name-Last: Lin
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Title: High-Dimensional Sparse Additive Hazards Regression
Abstract:
High-dimensional sparse modeling with censored survival data
is of great practical importance, as exemplified by modern applications in
high-throughput genomic data analysis and credit risk analysis. In this
article, we propose a class of regularization methods for simultaneous
variable selection and estimation in the additive hazards model, by
combining the nonconcave penalized likelihood approach and the pseudoscore
method. In a high-dimensional setting where the dimensionality can grow
fast, polynomially or nonpolynomially, with the sample size, we establish
the weak oracle property and oracle property under mild, interpretable
conditions, thus providing strong performance guarantees for the proposed
methodology. Moreover, we show that the regularity conditions required by
the L 1 method are
substantially relaxed by a certain class of sparsity-inducing concave
penalties. As a result, concave penalties such as the smoothly clipped
absolute deviation, minimax concave penalty, and smooth integration of
counting and absolute deviation can significantly improve on the
L 1 method and yield sparser
models with better prediction performance. We present a coordinate descent
algorithm for efficient implementation and rigorously investigate its
convergence properties. The practical use and effectiveness of the
proposed methods are demonstrated by simulation studies and a real data
example.
Journal: Journal of the American Statistical Association
Pages: 247-264
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.746068
File-URL: http://hdl.handle.net/10.1080/01621459.2012.746068
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:247-264
Template-Type: ReDIF-Article 1.0
Author-Name: Tony Cai
Author-X-Name-First: Tony
Author-X-Name-Last: Cai
Author-Name: Weidong Liu
Author-X-Name-First: Weidong
Author-X-Name-Last: Liu
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Title: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings
Abstract:
In the high-dimensional setting, this article considers three
interrelated problems: (a) testing the equality of two covariance matrices
and
; (b) recovering
the support of ;
and (c) testing the equality of and row by row. We propose a new
test for testing the hypothesis H
0:
and investigate its theoretical and numerical properties. The limiting
null distribution of the test statistic is derived and the power of the
test is studied. The test is shown to enjoy certain optimality and to be
especially powerful against sparse alternatives. The simulation results
show that the test significantly outperforms the existing methods both in
terms of size and power. Analysis of a prostate cancer dataset is carried
out to demonstrate the application of the testing procedures. When the
null hypothesis of equal covariance matrices is rejected, it is often of
significant interest to further investigate how they differ from each
other. Motivated by applications in genomics, we also consider recovering
the support of
and testing the equality of the two covariance matrices row by row. New
procedures are introduced and their properties are studied. Applications
to gene selection are also discussed. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 265-277
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.758041
File-URL: http://hdl.handle.net/10.1080/01621459.2012.758041
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:265-277
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Author-Name: James Robins
Author-X-Name-First: James
Author-X-Name-Last: Robins
Author-Name: Larry Wasserman
Author-X-Name-First: Larry
Author-X-Name-Last: Wasserman
Title: Distribution-Free Prediction Sets
Abstract:
This article introduces a new approach to prediction by
bringing together two different nonparametric ideas: distribution-free
inference and nonparametric smoothing. Specifically, we consider the
problem of constructing nonparametric tolerance/prediction sets. We start
from the general conformal prediction approach, and we use a kernel
density estimator as a measure of agreement between a sample point and the
underlying distribution. The resulting prediction set is shown to be
closely related to plug-in density level sets with carefully chosen cutoff
values. Under standard smoothness conditions, we get an asymptotic
efficiency result that is near optimal for a wide range of function
classes. But the coverage is guaranteed whether or not the smoothness
conditions hold and regardless of the sample size. The performance of our
method is investigated through simulation studies and illustrated in a
real data example.
Journal: Journal of the American Statistical Association
Pages: 278-287
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.751873
File-URL: http://hdl.handle.net/10.1080/01621459.2012.751873
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:278-287
Template-Type: ReDIF-Article 1.0
Author-Name: Fei Fu
Author-X-Name-First: Fei
Author-X-Name-Last: Fu
Author-Name: Qing Zhou
Author-X-Name-First: Qing
Author-X-Name-Last: Zhou
Title: Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent
Abstract:
Causal networks are graphically represented by directed
acyclic graphs (DAGs). Learning causal networks from data is a challenging
problem due to the size of the space of DAGs, the acyclicity constraint
placed on the graphical structures, and the presence of equivalence
classes. In this article, we develop an L
1-penalized likelihood approach to estimate the structure of
causal Gaussian networks. A blockwise coordinate descent algorithm, which
takes advantage of the acyclicity constraint, is proposed for seeking a
local maximizer of the penalized likelihood. We establish that model
selection consistency for causal Gaussian networks can be achieved with
the adaptive lasso penalty and sufficient experimental interventions.
Simulation and real data examples are used to demonstrate the
effectiveness of our method. In particular, our method shows satisfactory
performance for DAGs with 200 nodes, which have about 20,000 free
parameters. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 288-300
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.754359
File-URL: http://hdl.handle.net/10.1080/01621459.2012.754359
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:288-300
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan Martin
Author-X-Name-First: Ryan
Author-X-Name-Last: Martin
Author-Name: Chuanhai Liu
Author-X-Name-First: Chuanhai
Author-X-Name-Last: Liu
Title: Inferential Models: A Framework for Prior-Free Posterior Probabilistic Inference
Abstract:
Posterior probabilistic statistical inference without priors
is an important but so far elusive goal. Fisher's fiducial inference,
Dempster--Shafer theory of belief functions, and Bayesian inference with
default priors are attempts to achieve this goal but, to date, none has
given a completely satisfactory picture. This article presents a new
framework for probabilistic inference, based on inferential models (IMs),
which not only provides data-dependent probabilistic measures of
uncertainty about the unknown parameter, but also does so with an
automatic long-run frequency-calibration property. The key to this new
approach is the identification of an unobservable auxiliary variable
associated with observable data and unknown parameter, and the prediction
of this auxiliary variable with a random set before conditioning on data.
Here we present a three-step IM construction, and prove a
frequency-calibration property of the IM's belief function under mild
conditions. A corresponding optimality theory is developed, which helps to
resolve the nonuniqueness issue. Several examples are presented to
illustrate this new approach.
Journal: Journal of the American Statistical Association
Pages: 301-313
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.747960
File-URL: http://hdl.handle.net/10.1080/01621459.2012.747960
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:301-313
Template-Type: ReDIF-Article 1.0
Author-Name: Christoph Rothe
Author-X-Name-First: Christoph
Author-X-Name-Last: Rothe
Author-Name: Dominik Wied
Author-X-Name-First: Dominik
Author-X-Name-Last: Wied
Title: Misspecification Testing in a Class of Conditional Distributional Models
Abstract:
We propose a specification test for a wide range of
parametric models for the conditional distribution function of an outcome
variable given a vector of covariates. The test is based on the
Cramer--von Mises distance between an unrestricted estimate of the joint
distribution function of the data and a restricted estimate that imposes
the structure implied by the model. The procedure is straightforward to
implement, is consistent against fixed alternatives, has nontrivial power
against local deviations of order n
-super- - 1/2 from the null hypothesis, and does not require the choice
of smoothing parameters. In an empirical application, we use our test to
study the validity of various models for the conditional distribution of
wages in the United States.
Journal: Journal of the American Statistical Association
Pages: 314-324
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.736903
File-URL: http://hdl.handle.net/10.1080/01621459.2012.736903
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:314-324
Template-Type: ReDIF-Article 1.0
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Yichen Cheng
Author-X-Name-First: Yichen
Author-X-Name-Last: Cheng
Author-Name: Qifan Song
Author-X-Name-First: Qifan
Author-X-Name-Last: Song
Author-Name: Jincheol Park
Author-X-Name-First: Jincheol
Author-X-Name-Last: Park
Author-Name: Ping Yang
Author-X-Name-First: Ping
Author-X-Name-Last: Yang
Title: A Resampling-Based Stochastic Approximation Method for Analysis of Large Geostatistical Data
Abstract:
The Gaussian geostatistical model has been widely used in
modeling of spatial data. However, it is challenging to computationally
implement this method because it requires the inversion of a large
covariance matrix, particularly when there is a large number of
observations. This article proposes a resampling-based stochastic
approximation method to address this challenge. At each iteration of the
proposed method, a small subsample is drawn from the full dataset, and
then the current estimate of the parameters is updated accordingly under
the framework of stochastic approximation. Since the proposed method makes
use of only a small proportion of the data at each iteration, it avoids
inverting large covariance matrices and thus is scalable to large
datasets. The proposed method also leads to a general parameter estimation
approach, maximum mean log-likelihood estimation, which includes the
popular maximum (log)-likelihood estimation (MLE) approach as a special
case and is expected to play an important role in analyzing large
datasets. Under mild conditions, it is shown that the estimator resulting
from the proposed method converges in probability to a set of parameter
values of equivalent Gaussian probability measures, and that the estimator
is asymptotically normally distributed. To the best of the authors'
knowledge, the present study is the first one on asymptotic normality
under infill asymptotics for general covariance functions. The proposed
method is illustrated with large datasets, both simulated and real.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 325-339
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.746061
File-URL: http://hdl.handle.net/10.1080/01621459.2012.746061
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:325-339
Template-Type: ReDIF-Article 1.0
Author-Name: G. García-donato
Author-X-Name-First: G.
Author-X-Name-Last: García-donato
Author-Name: M. A. Martínez-beneito
Author-X-Name-First: M. A.
Author-X-Name-Last: Martínez-beneito
Title: On Sampling Strategies in Bayesian Variable Selection Problems With Large Model Spaces
Abstract:
One important aspect of Bayesian model selection is how to
deal with huge model spaces, since the exhaustive enumeration of all the
models entertained is not feasible and inferences have to be based on the
very small proportion of models visited. This is the case for the variable
selection problem with a moderately large number of possible explanatory
variables considered in this article. We review some of the strategies
proposed in the literature, from a theoretical point of view using
arguments of sampling theory and in practical terms using several examples
with a known answer. All our results seem to indicate that sampling
methods with frequency-based estimators outperform searching methods with
renormalized estimators. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 340-352
Issue: 501
Volume: 108
Year: 2013
Month: 3
X-DOI: 10.1080/01621459.2012.742443
File-URL: http://hdl.handle.net/10.1080/01621459.2012.742443
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:501:p:340-352
Template-Type: ReDIF-Article 1.0
Author-Name: Roderick J. Little
Author-X-Name-First: Roderick J.
Author-X-Name-Last: Little
Title: In Praise of Simplicity not Mathematistry! Ten Simple Powerful Ideas for the Statistical Scientist
Abstract:
Ronald Fisher was by all accounts a first-rate mathematician,
but he saw himself as a scientist, not a mathematician, and he railed
against what George Box called (in his Fisher lecture) "mathematistry."
Mathematics is the indispensable foundation of statistics, but for me the
real excitement and value of our subject lies in its application to other
disciplines. We should not view statistics as another branch of
mathematics and favor mathematical complexity over clarifying,
formulating, and solving real-world problems. Valuing simplicity, I
describe 10 simple and powerful ideas that have influenced my thinking
about statistics, in my areas of research interest: missing data, causal
inference, survey sampling, and statistical modeling in general. The
overarching theme is that statistics is a missing data problem and the
goal is to predict unknowns with appropriate measures of uncertainty.
Journal: Journal of the American Statistical Association
Pages: 359-369
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.787932
File-URL: http://hdl.handle.net/10.1080/01621459.2013.787932
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:359-369
Template-Type: ReDIF-Article 1.0
Author-Name: Edward Ip
Author-X-Name-First: Edward
Author-X-Name-Last: Ip
Author-Name: Qiang Zhang
Author-X-Name-First: Qiang
Author-X-Name-Last: Zhang
Author-Name: Jack Rejeski
Author-X-Name-First: Jack
Author-X-Name-Last: Rejeski
Author-Name: Tammy Harris
Author-X-Name-First: Tammy
Author-X-Name-Last: Harris
Author-Name: Stephen Kritchevsky
Author-X-Name-First: Stephen
Author-X-Name-Last: Kritchevsky
Title: Partially Ordered Mixed Hidden Markov Model for the Disablement Process of Older Adults
Abstract:
At both the individual and societal levels, the health and
economic burden of disability in older adults is enormous in developed
countries, including the U.S. Recent studies have revealed that the
disablement process in older adults often comprises episodic periods of
impaired functioning and periods that are relatively free of disability,
amid a secular and natural trend of decline in functioning. Rather than an
irreversible, progressive event that is analogous to a chronic disease,
disability is better conceptualized and mathematically modeled as states
that do not necessarily follow a strict linear order of good to bad.
Statistical tools, including Markov models, which allow bidirectional
transition between states, and random effects models, which allow
individual-specific rate of secular decline, are pertinent. In this
article, we propose a mixed effects, multivariate, hidden Markov model to
handle partially ordered disability states. The model generalizes the
continuation ratio model for ordinal data in the generalized linear model
literature and provides a formal framework for testing the effects of risk
factors and/or an intervention on the transitions between different
disability states. Under a generalization of the proportional odds ratio
assumption, the proposed model circumvents the problem of a potentially
large number of parameters when the number of states and the number of
covariates are substantial. We describe a maximum likelihood method for
estimating the partially ordered, mixed effects model and show how the
model can be applied to a longitudinal dataset that consists of
N = 2903 older adults followed for 10 years in the Health
Aging and Body Composition Study. We further statistically test the
effects of various risk factors upon the probabilities of transition into
various severe disability states. The result can be used to inform
geriatric and public health science researchers who study the disablement
process. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 370-384
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770307
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770307
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:370-384
Template-Type: ReDIF-Article 1.0
Author-Name: Mauricio Sadinle
Author-X-Name-First: Mauricio
Author-X-Name-Last: Sadinle
Author-Name: Stephen E. Fienberg
Author-X-Name-First: Stephen E.
Author-X-Name-Last: Fienberg
Title: A Generalized Fellegi--Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems
Abstract:
We present a probabilistic method for linking multiple
datafiles. This task is not trivial in the absence of unique identifiers
for the individuals recorded. This is a common scenario when linking
census data to coverage measurement surveys for census coverage
evaluation, and in general when multiple record systems need to be
integrated for posterior analysis. Our method generalizes the
Fellegi--Sunter theory for linking records from two datafiles and its
modern implementations. The goal of multiple record linkage is to classify
the record K-tuples coming from K
datafiles according to the different matching patterns. Our method
incorporates the transitivity of agreement in the computation of the data
used to model matching probabilities. We use a mixture model to fit
matching probabilities via maximum likelihood using the
Expectation--Maximization algorithm. We present a method to decide the
record K-tuples membership to the subsets of matching
patterns and we prove its optimality. We apply our method to the
integration of the three Colombian homicide record systems and perform a
simulation study to explore the performance of the method under
measurement error and different scenarios. The proposed method works well
and opens new directions for future research.
Journal: Journal of the American Statistical Association
Pages: 385-397
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2012.757231
File-URL: http://hdl.handle.net/10.1080/01621459.2012.757231
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:385-397
Template-Type: ReDIF-Article 1.0
Author-Name: Natallia Katenka
Author-X-Name-First: Natallia
Author-X-Name-Last: Katenka
Author-Name: Elizaveta Levina
Author-X-Name-First: Elizaveta
Author-X-Name-Last: Levina
Author-Name: George Michailidis
Author-X-Name-First: George
Author-X-Name-Last: Michailidis
Title: Tracking Multiple Targets Using Binary Decisions From Wireless Sensor Networks
Abstract:
This article introduces a framework for tracking multiple
targets over time using binary decisions collected by a wireless sensor
network, and applies the methodology to two case studies-an experiment
involving tracking people and a dataset adapted from a project tracking
zebras in Kenya. The tracking approach is based on a penalized maximum
likelihood framework, and allows for sensor failures, targets appearing
and disappearing over time, and complex intersecting target trajectories.
We show that binary decisions about the presence/absence of a target in a
sensor's neighborhood, corrected locally by a method known as local vote
decision fusion, provide the most robust performance in noisy environments
and give good tracking results in applications.
Journal: Journal of the American Statistical Association
Pages: 398-410
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770284
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770284
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:398-410
Template-Type: ReDIF-Article 1.0
Author-Name: Avishek Chakraborty
Author-X-Name-First: Avishek
Author-X-Name-Last: Chakraborty
Author-Name: Bani K. Mallick
Author-X-Name-First: Bani K.
Author-X-Name-Last: Mallick
Author-Name: Ryan G. Mcclarren
Author-X-Name-First: Ryan G.
Author-X-Name-Last: Mcclarren
Author-Name: Carolyn C. Kuranz
Author-X-Name-First: Carolyn C.
Author-X-Name-Last: Kuranz
Author-Name: Derek Bingham
Author-X-Name-First: Derek
Author-X-Name-Last: Bingham
Author-Name: Michael J. Grosskopf
Author-X-Name-First: Michael J.
Author-X-Name-Last: Grosskopf
Author-Name: Erica M. Rutter
Author-X-Name-First: Erica M.
Author-X-Name-Last: Rutter
Author-Name: Hayes F. Stripling
Author-X-Name-First: Hayes F.
Author-X-Name-Last: Stripling
Author-Name: R. Paul Drake
Author-X-Name-First: R. Paul
Author-X-Name-Last: Drake
Title: Spline-Based Emulators for Radiative Shock Experiments With Measurement Error
Abstract:
Radiation hydrodynamics and radiative shocks are of
fundamental interest in the high-energy-density physics research due to
their importance in understanding astrophysical phenomena such as
supernovae. In the laboratory, experiments can produce shocks with
fundamentally similar physics on reduced scales. However, the cost and
time constraints of the experiment necessitate use of a computer algorithm
to generate a reasonable number of outputs for making valid inference. We
focus on modeling emulators that can efficiently assimilate these two
sources of information accounting for their intrinsic differences. The
goal is to learn how to predict the breakout time of the shock given the
information on associated parameters such as pressure and energy. Under
the framework of the Kennedy--O'Hagan model, we introduce an emulator
based on adaptive splines. Depending on the preference of having an
interpolator for the computer code output or a computationally fast model,
a couple of different variants are proposed. Those choices are shown to
perform better than the conventional Gaussian-process-based emulator and a
few other choices of nonstationary models. For the shock experiment
dataset, a number of features related to computer model validation such as
using interpolator, necessity of discrepancy function, or accounting for
experimental heterogeneity are discussed, implemented, and validated for
the current dataset. In addition to the typical Gaussian measurement error
for real data, we consider alternative specifications suitable to
incorporate noninformativeness in error distributions, more in agreement
with the current experiment. Comparative diagnostics, to highlight the
effect of measurement error model on predictive uncertainty, are also
presented. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 411-428
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770688
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770688
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:411-428
Template-Type: ReDIF-Article 1.0
Author-Name: Curtis B. Storlie
Author-X-Name-First: Curtis B.
Author-X-Name-Last: Storlie
Author-Name: Sarah E. Michalak
Author-X-Name-First: Sarah E.
Author-X-Name-Last: Michalak
Author-Name: Heather M. Quinn
Author-X-Name-First: Heather M.
Author-X-Name-Last: Quinn
Author-Name: Andrew J. Dubois
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Dubois
Author-Name: Steven A. Wender
Author-X-Name-First: Steven A.
Author-X-Name-Last: Wender
Author-Name: David H. Dubois
Author-X-Name-First: David H.
Author-X-Name-Last: Dubois
Title: A Bayesian Reliability Analysis of Neutron-Induced Errors in High Performance Computing Hardware
Abstract:
A soft error is an undesired change in an electronic device's
state, for example, a bit flip in computer memory, that does not
permanently affect its functionality. In microprocessor systems,
neutron-induced soft errors can cause crashes and silent data corruption
(SDC). SDC occurs when a soft error produces a computational result that
is incorrect, without the system issuing a warning or error message.
Hence, neutron-induced soft errors are a major concern for high
performance computing platforms that perform scientific computation.
Through accelerated neutron beam testing of hardware in its field
configuration, the frequencies of failures (crashes) and of SDCs in
hardware from the Roadrunner platform, the first Petaflop supercomputer,
are estimated. The impact of key factors on field performance is
investigated and estimates of field reliability are provided. Finally, a
novel statistical approach for the analysis of interval-censored survival
data with mixed effects and uncertainty in the interval endpoints, key
features of the experimental data, is presented. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 429-440
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770694
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770694
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:429-440
Template-Type: ReDIF-Article 1.0
Author-Name: Monica Costa Dias
Author-X-Name-First: Monica Costa
Author-X-Name-Last: Dias
Author-Name: Hidehiko Ichimura
Author-X-Name-First: Hidehiko
Author-X-Name-Last: Ichimura
Author-Name: Gerard J. van den Berg
Author-X-Name-First: Gerard J.
Author-X-Name-Last: van den Berg
Title: Treatment Evaluation With Selective Participation and Ineligibles
Abstract:
Matching methods for treatment evaluation based on a
conditional independence assumption do not balance selective unobserved
differences between treated and nontreated. We derive a simple correction
term if there is an instrument that shifts the treatment probability to
zero in specific cases. Policies with eligibility restrictions, where
treatment is impossible if some variable exceeds a certain value, provide
a natural application. In an empirical analysis, we exploit the age
eligibility restriction in the Swedish Youth Practice subsidized work
program for young unemployed, where compliance is imperfect among the
young. Adjusting the matching estimator for selectivity changes the
results toward making subsidized work detrimental in moving individuals
into employment.
Journal: Journal of the American Statistical Association
Pages: 441-455
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.795447
File-URL: http://hdl.handle.net/10.1080/01621459.2013.795447
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:441-455
Template-Type: ReDIF-Article 1.0
Author-Name: David A. Friedenberg
Author-X-Name-First: David A.
Author-X-Name-Last: Friedenberg
Author-Name: Christopher R. Genovese
Author-X-Name-First: Christopher R.
Author-X-Name-Last: Genovese
Title: Straight to the Source: Detecting Aggregate Objects in Astronomical Images With Proper Error Control
Abstract:
The next generation of telescopes, coming online in the next
decade, will acquire terabytes of image data each night. Collectively,
these large images will contain billions of interesting objects, which
astronomers call sources. One critical task for
astronomers is to construct from the image data a detailed source
catalog that gives the sky coordinates and other properties of
all detected sources. The source catalog is the primary data product
produced by most telescopes and serves as an important input for studies
that build and test new astrophysical theories. To construct an accurate
catalog, the sources must first be detected in the image. A variety of
effective source detection algorithms exist in the astronomical
literature, but few, if any, provide rigorous statistical control of error
rates. A variety of multiple testing procedures exist in the statistical
literature that can provide rigorous error control over pixelwise errors,
but these do not provide control over errors at the level of sources,
which is what astronomers need. In this article, we propose a technique
that is effective at source detection while providing rigorous control on
sourcewise error rates. We demonstrate our approach with data from the
Chandra X-ray Observatory Satellite. Our method is competitive with
existing astronomical methods, even finding two new sources that were
missed by previous studies, while providing stronger performance
guarantees and without requiring costly follow up studies that are
commonly required with current techniques.
Journal: Journal of the American Statistical Association
Pages: 456-468
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.779829
File-URL: http://hdl.handle.net/10.1080/01621459.2013.779829
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:456-468
Template-Type: ReDIF-Article 1.0
Author-Name: Tyler J. Vanderweele
Author-X-Name-First: Tyler J.
Author-X-Name-Last: Vanderweele
Author-Name: Guanglei Hong
Author-X-Name-First: Guanglei
Author-X-Name-Last: Hong
Author-Name: Stephanie M. Jones
Author-X-Name-First: Stephanie M.
Author-X-Name-Last: Jones
Author-Name: Joshua L. Brown
Author-X-Name-First: Joshua L.
Author-X-Name-Last: Brown
Title: Mediation and Spillover Effects in Group-Randomized Trials: A Case Study of the 4Rs Educational Intervention
Abstract:
Peer influence and social interactions can give rise to
spillover effects in which the exposure of one individual may affect
outcomes of other individuals. Even if the intervention under study occurs
at the group or cluster level as in group-randomized trials, spillover
effects can occur when the mediator of interest is measured at a lower
level than the treatment. Evaluators who choose groups rather than
individuals as experimental units in a randomized trial often anticipate
that the desirable changes in targeted social behaviors will be reinforced
through interference among individuals in a group exposed to the same
treatment. In an empirical evaluation of the effect of a school-wide
intervention on reducing individual students' depressive symptoms, schools
in matched pairs were randomly assigned to the 4Rs intervention or the
control condition. Class quality was hypothesized as an important mediator
assessed at the classroom level. We reason that the quality of one
classroom may affect outcomes of children in another classroom because
children interact not simply with their classmates but also with those
from other classes in the hallways or on the playground. In investigating
the role of class quality as a mediator, failure to account for such
spillover effects of one classroom on the outcomes of children in other
classrooms can potentially result in bias and problems with
interpretation. Using a counterfactual conceptualization of direct,
indirect, and spillover effects, we provide a framework that can
accommodate issues of mediation and spillover effects in group randomized
trials. We show that the total effect can be decomposed into a natural
direct effect, a within-classroom mediated effect, and a spillover
mediated effect. We give identification conditions for each of the causal
effects of interest and provide results on the consequences of ignoring
"interference" or "spillover effects" when they are in fact present. Our
modeling approach disentangles these effects. The analysis examines
whether the 4Rs intervention has an effect on childrens' depressive
symptoms through changing the quality of other classes as well as through
changing the quality of a child's own class. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 469-482
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.779832
File-URL: http://hdl.handle.net/10.1080/01621459.2013.779832
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:469-482
Template-Type: ReDIF-Article 1.0
Author-Name: Yueqing Wang
Author-X-Name-First: Yueqing
Author-X-Name-Last: Wang
Author-Name: Xin Jiang
Author-X-Name-First: Xin
Author-X-Name-Last: Jiang
Author-Name: Bin Yu
Author-X-Name-First: Bin
Author-X-Name-Last: Yu
Author-Name: Ming Jiang
Author-X-Name-First: Ming
Author-X-Name-Last: Jiang
Title: A Hierarchical Bayesian Approach for Aerosol Retrieval Using MISR Data
Abstract:
Atmospheric aerosols can cause serious damage to human health
and reduce life expectancy. Using the radiances observed by NASA's
Multi-angle Imaging SpectroRadiometer (MISR), the current MISR operational
algorithm retrieves aerosol optical depth (AOD) at 17.6 km resolution. A
systematic study of aerosols and their impact on public health, especially
in highly populated urban areas, requires finer-resolution estimates of
AOD's spatial distribution. We embed MISR's operational weighted least
squares criterion and its forward calculations for AOD retrievals in a
likelihood framework and further expand into a hierarchical Bayesian model
to adapt to finer spatial resolution of 4.4 km. To take advantage of AOD's
spatial smoothness, our method borrows strength from data at neighboring
areas by postulating a Gaussian Markov random field prior for AOD. Our
model considers AOD and aerosol mixing vectors as continuous variables,
whose inference is carried out using Metropolis-within-Gibbs sampling
methods. Retrieval uncertainties are quantified by posterior
variabilities. We also develop a parallel Markov chain Monte Carlo (MCMC)
algorithm to improve computational efficiency. We assess our retrieval
performance using ground-based measurements from the AErosol RObotic
NETwork (AERONET) and satellite images from Google Earth. Based on case
studies in the greater Beijing area, China, we show that 4.4 km resolution
can improve both the accuracy and coverage of remotely sensed aerosol
retrievals, as well as our understanding of the spatial and seasonal
behaviors of aerosols. This is particularly important during high-AOD
events, which often indicate severe air pollution.
Journal: Journal of the American Statistical Association
Pages: 483-493
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.796834
File-URL: http://hdl.handle.net/10.1080/01621459.2013.796834
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:483-493
Template-Type: ReDIF-Article 1.0
Author-Name: Sungduk Kim
Author-X-Name-First: Sungduk
Author-X-Name-Last: Kim
Author-Name: Zhen Chen
Author-X-Name-First: Zhen
Author-X-Name-Last: Chen
Author-Name: Zhiwei Zhang
Author-X-Name-First: Zhiwei
Author-X-Name-Last: Zhang
Author-Name: Bruce G. Simons-Morton
Author-X-Name-First: Bruce G.
Author-X-Name-Last: Simons-Morton
Author-Name: Paul S. Albert
Author-X-Name-First: Paul S.
Author-X-Name-Last: Albert
Title: Bayesian Hierarchical Poisson Regression Models: An Application to a Driving Study With Kinematic Events
Abstract:
Although there is evidence that teenagers are at a high risk
of crashes in the early months after licensure, the driving behavior of
these teenagers is not well understood. The Naturalistic Teenage Driving
Study (NTDS) is the first U.S. study to document continuous driving
performance of newly licensed teenagers during their first 18 months of
licensure. Counts of kinematic events such as the number of rapid
accelerations are available for each trip, and their incidence rates
represent different aspects of driving behavior. We propose a hierarchical
Poisson regression model incorporating overdispersion, heterogeneity, and
serial correlation as well as a semiparametric mean structure. Analysis of
the NTDS data is carried out with a hierarchical Bayesian framework using
reversible jump Markov chain Monte Carlo algorithms to accommodate the
flexible mean structure. We show that driving with a passenger and night
driving decrease kinematic events, while having risky friends increases
these events. Further the within-subject variation in these events is
comparable to the between-subject variation. This methodology will be
useful for other intensively collected longitudinal count data, where
event rates are low and interest focuses on estimating the mean and
variance structure of the process. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 494-503
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770702
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770702
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:494-503
Template-Type: ReDIF-Article 1.0
Author-Name: Zahra Siddique
Author-X-Name-First: Zahra
Author-X-Name-Last: Siddique
Title: Partially Identified Treatment Effects Under Imperfect Compliance: The Case of Domestic Violence
Abstract:
The Minneapolis Domestic Violence Experiment (MDVE) is a
randomized social experiment with imperfect compliance that has been
extremely influential in how police officers respond to misdemeanor
domestic violence. This article reexamines data from the MDVE, using
recent literature on partial identification to find recidivism associated
with a policy that arrests misdemeanor domestic violence suspects rather
than not arresting them. Using partially identified bounds on the average
treatment effect, I find that arresting rather than not arresting suspects
can potentially reduce recidivism by more than two-and-a-half times the
corresponding intent-to-treat estimate and more than two times the
corresponding local average treatment effect, even when making minimal
assumptions on counterfactuals.
Journal: Journal of the American Statistical Association
Pages: 504-513
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.779836
File-URL: http://hdl.handle.net/10.1080/01621459.2013.779836
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:504-513
Template-Type: ReDIF-Article 1.0
Author-Name: Josue G. Martinez
Author-X-Name-First: Josue G.
Author-X-Name-Last: Martinez
Author-Name: Kirsten M. Bohn
Author-X-Name-First: Kirsten M.
Author-X-Name-Last: Bohn
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Author-Name: Jeffrey S. Morris
Author-X-Name-First: Jeffrey S.
Author-X-Name-Last: Morris
Title: A Study of Mexican Free-Tailed Bat Chirp Syllables: Bayesian Functional Mixed Models for Nonstationary Acoustic Time Series
Abstract:
We describe a new approach to analyze chirp syllables of
free-tailed bats from two regions of Texas in which they are predominant:
Austin and College Station. Our goal is to characterize any systematic
regional differences in the mating chirps and assess whether individual
bats have signature chirps. The data are analyzed by modeling spectrograms
of the chirps as responses in a Bayesian functional mixed model. Given the
variable chirp lengths, we compute the spectrograms on a relative time
scale interpretable as the relative chirp position, using a variable
window overlap based on chirp length. We use two-dimensional wavelet
transforms to capture correlation within the spectrogram in our modeling
and obtain adaptive regularization of the estimates and inference for the
regions-specific spectrograms. Our model includes random effect
spectrograms at the bat level to account for correlation among chirps from
the same bat and to assess relative variability in chirp spectrograms
within and between bats. The modeling of spectrograms using functional
mixed models is a general approach for the analysis of replicated
nonstationary time series, such as our acoustical signals, to relate
aspects of the signals to various predictors, while accounting for
between-signal structure. This can be done on raw spectrograms when all
signals are of the same length and can be done using spectrograms defined
on a relative time scale for signals of variable length in settings where
the idea of defining correspondence across signals based on relative
position is sensible. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 514-526
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.793118
File-URL: http://hdl.handle.net/10.1080/01621459.2013.793118
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:514-526
Template-Type: ReDIF-Article 1.0
Author-Name: Lihui Zhao
Author-X-Name-First: Lihui
Author-X-Name-Last: Zhao
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: Brian Claggett
Author-X-Name-First: Brian
Author-X-Name-Last: Claggett
Author-Name: L. J. Wei
Author-X-Name-First: L. J.
Author-X-Name-Last: Wei
Title: Effectively Selecting a Target Population for a Future Comparative Study
Abstract:
When comparing a new treatment with a control in a randomized
clinical study, the treatment effect is generally assessed by evaluating a
summary measure over a specific study population. The success of the trial
heavily depends on the choice of such a population. In this article, we
show a systematic, effective way to identify a promising population, for
which the new treatment is expected to have a desired benefit, using the
data from a current study involving similar comparator treatments.
Specifically, using the existing data, we first create a parametric
scoring system as a function of multiple baseline covariates to estimate
subject-specific treatment differences. Based on this scoring system, we
specify a desired level of treatment difference and obtain a subgroup of
patients, defined as those whose estimated scores exceed this threshold.
An empirically calibrated threshold-specific treatment difference curve
across a range of score values is constructed. The subpopulation of
patients satisfying any given level of treatment benefit can then be
identified accordingly. To avoid bias due to overoptimism, we use a
cross-training-evaluation method for implementing the above two-step
procedure. We then show how to select the best scoring system among all
competing models. Furthermore, for cases in which only a single
prespecified working model is involved, inference procedures are proposed
for the average treatment difference over a range of score values using
the entire dataset and are justified theoretically and numerically.
Finally, the proposals are illustrated with the data from two clinical
trials in treating HIV and cardiovascular diseases. Note that if we are
not interested in designing a new study for comparing similar treatments,
the new procedure can also be quite useful for the management of future
patients, so that treatment may be targeted toward those who would receive
nontrivial benefits to compensate for the risk or cost of the new
treatment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 527-539
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770705
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770705
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:527-539
Template-Type: ReDIF-Article 1.0
Author-Name: Hua Zhou
Author-X-Name-First: Hua
Author-X-Name-Last: Zhou
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Tensor Regression with Applications in Neuroimaging Data Analysis
Abstract:
Classical regression methods treat covariates as a vector and
estimate a corresponding vector of regression coefficients. Modern
applications in medical imaging generate covariates of more complex form
such as multidimensional arrays (tensors). Traditional statistical and
computational methods are proving insufficient for analysis of these
high-throughput data due to their ultrahigh dimensionality as well as
complex structure. In this article, we propose a new family of tensor
regression models that efficiently exploit the special structure of tensor
covariates. Under this framework, ultrahigh dimensionality is reduced to a
manageable level, resulting in efficient estimation and prediction. A fast
and highly scalable estimation algorithm is proposed for maximum
likelihood estimation and its associated asymptotic properties are
studied. Effectiveness of the new methods is demonstrated on both
synthetic and real MRI imaging data. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 540-552
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.776499
File-URL: http://hdl.handle.net/10.1080/01621459.2013.776499
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:540-552
Template-Type: ReDIF-Article 1.0
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Author-Name: Huaihou Chen
Author-X-Name-First: Huaihou
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Christine Mauro
Author-X-Name-First: Christine
Author-X-Name-Last: Mauro
Author-Name: Naihua Duan
Author-X-Name-First: Naihua
Author-X-Name-Last: Duan
Author-Name: M. Katherine Shear
Author-X-Name-First: M. Katherine
Author-X-Name-Last: Shear
Title: Auxiliary Marker-Assisted Classification in the Absence of Class Identifiers
Abstract:
Constructing classification rules for accurate diagnosis of a
disorder is an important goal in medical practice. In many clinical
applications, there is no clinically significant anatomical or
physiological deviation that exists to identify the gold standard disease
status to inform development of classification algorithms. Despite the
absence of perfect disease class identifiers, there are usually one or
more disease-informative auxiliary markers along with feature variables
that comprise known symptoms. Existing statistical learning approaches do
not effectively draw information from auxiliary prognostic markers. We
propose a large margin classification method, with particular emphasis on
the support vector machine, assisted by available informative markers to
classify disease without knowing a subject's true disease status. We view
this task as statistical learning in the presence of missing data, and
introduce a pseudo-Expectation-Maximization (EM) algorithm to the
classification. A major difference between a regular EM algorithm and the
algorithm proposed here is that we do not model the distribution of
missing data given the observed feature variables either parametrically or
semiparametrically. We also propose a sparse variable selection method
embedded in the pseudo-EM algorithm. Theoretical examination shows that
the proposed classification rule is Fisher consistent, and that under a
linear rule, the proposed selection has an oracle variable selection
property and the estimated coefficients are asymptotically normal. We
apply the methods to build decision rules for including subjects in
clinical trials of a new psychiatric disorder and present four
applications to data available at the University of California, Irvine
Machine Learning Repository.
Journal: Journal of the American Statistical Association
Pages: 553-565
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.775949
File-URL: http://hdl.handle.net/10.1080/01621459.2013.775949
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:553-565
Template-Type: ReDIF-Article 1.0
Author-Name: Arpita Ghosh
Author-X-Name-First: Arpita
Author-X-Name-Last: Ghosh
Author-Name: Fred A. Wright
Author-X-Name-First: Fred A.
Author-X-Name-Last: Wright
Author-Name: Fei Zou
Author-X-Name-First: Fei
Author-X-Name-Last: Zou
Title: Unified Analysis of Secondary Traits in Case--Control Association Studies
Abstract:
It has been repeatedly shown that in case--control
association studies, analysis of a secondary trait that ignores the
original sampling scheme can produce highly biased risk estimates.
Although a number of approaches have been proposed to properly analyze
secondary traits, most approaches fail to reproduce the marginal logistic
model assumed for the original case--control trait and/or do not allow for
interaction between secondary trait and genotype marker on primary disease
risk. In addition, the flexible handling of covariates remains
challenging. We present a general retrospective likelihood framework to
perform association testing for both binary and continuous secondary
traits, which respects marginal models and incorporates the interaction
term. We provide a computational algorithm, based on a reparameterized
approximate profile likelihood, for obtaining the maximum likelihood (ML)
estimate and its standard error for the genetic effect on secondary
traits, in the presence of covariates. For completeness, we also present
an alternative pseudo-likelihood method for handling covariates. We
describe extensive simulations to evaluate the performance of the ML
estimator in comparison with the pseudo-likelihood and other competing
methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 566-576
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.793121
File-URL: http://hdl.handle.net/10.1080/01621459.2013.793121
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:566-576
Template-Type: ReDIF-Article 1.0
Author-Name: Ting Zhang
Author-X-Name-First: Ting
Author-X-Name-Last: Zhang
Title: Clustering High-Dimensional Time Series Based on Parallelism
Abstract:
This article considers the problem of clustering
high-dimensional time series based on trend parallelism. The underlying
process is modeled as a nonparametric trend function contaminated by
locally stationary errors, a special class of nonstationary processes. For
each group where the parallelism holds, I semiparametrically estimate its
representative trend function and vertical shifts of group members, and
establish their central limit theorems. An information criterion,
consisting of in-group similarities and number of groups, is then proposed
for the purpose of clustering. I prove its theoretical consistency and
propose a splitting-coalescence algorithm to reduce the computational
burden in practice. The method is illustrated by both simulation and a
real-data example.
Journal: Journal of the American Statistical Association
Pages: 577-588
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2012.760458
File-URL: http://hdl.handle.net/10.1080/01621459.2012.760458
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:577-588
Template-Type: ReDIF-Article 1.0
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Qifan Song
Author-X-Name-First: Qifan
Author-X-Name-Last: Song
Author-Name: Kai Yu
Author-X-Name-First: Kai
Author-X-Name-Last: Yu
Title: Bayesian Subset Modeling for High-Dimensional Generalized Linear Models
Abstract:
This article presents a new prior setting for
high-dimensional generalized linear models, which leads to a Bayesian
subset regression (BSR) with the maximum a posteriori model approximately
equivalent to the minimum extended Bayesian information criterion model.
The consistency of the resulting posterior is established under mild
conditions. Further, a variable screening procedure is proposed based on
the marginal inclusion probability, which shares the same properties of
sure screening and consistency with the existing sure independence
screening (SIS) and iterative sure independence screening (ISIS)
procedures. However, since the proposed procedure makes use of joint
information from all predictors, it generally outperforms SIS and ISIS in
real applications. This article also makes extensive comparisons of BSR
with the popular penalized likelihood methods, including Lasso, elastic
net, SIS, and ISIS. The numerical results indicate that BSR can generally
outperform the penalized likelihood methods. The models selected by BSR
tend to be sparser and, more importantly, of higher prediction ability. In
addition, the performance of the penalized likelihood methods tends to
deteriorate as the number of predictors increases, while this is not
significant for BSR. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 589-606
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2012.761942
File-URL: http://hdl.handle.net/10.1080/01621459.2012.761942
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:589-606
Template-Type: ReDIF-Article 1.0
Author-Name: J. T. Gene Hwang
Author-X-Name-First: J. T. Gene
Author-X-Name-Last: Hwang
Author-Name: Zhigen Zhao
Author-X-Name-First: Zhigen
Author-X-Name-Last: Zhao
Title: Empirical Bayes Confidence Intervals for Selected Parameters in High-Dimensional Data
Abstract:
Modern statistical problems often involve a large number of
populations and hence a large number of parameters that characterize these
populations. It is common for scientists to use data to select the most
significant populations, such as those with the largest t
statistics. The scientific interest often lies in studying and making
inferences regarding these parameters, called the selected
parameters, corresponding to the selected populations. The
current statistical practices either apply a traditional procedure
assuming there were no selection-a practice that is not valid-or they use
the Bonferroni-type procedure that is valid but very conservative and
often noninformative. In this article, we propose valid and sharp
confidence intervals that allow scientists to select parameters and to
make inferences for the selected parameters based on the same data. This
type of confidence interval allows the users to zero in on the most
interesting selected parameters without collecting more data. The validity
of confidence intervals is defined as the controlling of Bayes coverage
probability so that it is no less than a nominal level uniformly over a
class of prior distributions for the parameter. When a mixed model is
assumed and the random effects are the key parameters, this validity
criterion is exactly the frequentist criterion, since the Bayes coverage
probability is identical to the frequentist coverage probability. Assuming
that the observations are normally distributed with unequal and unknown
variances, we select parameters with the largest t
statistics. We then construct sharp empirical Bayes confidence intervals
for these selected parameters, which have either a large Bayes coverage
probability or a small Bayes false coverage rate uniformly for a class of
priors. Our intervals, applicable to any high-dimensional data, are
applied to microarray data and are shown to be better than all the
alternatives. It is also anticipated that the same intervals would be
valid for any selection rule. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 607-618
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.771102
File-URL: http://hdl.handle.net/10.1080/01621459.2013.771102
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:607-618
Template-Type: ReDIF-Article 1.0
Author-Name: Rong Liu
Author-X-Name-First: Rong
Author-X-Name-Last: Liu
Author-Name: Lijian Yang
Author-X-Name-First: Lijian
Author-X-Name-Last: Yang
Author-Name: Wolfgang K. Härdle
Author-X-Name-First: Wolfgang K.
Author-X-Name-Last: Härdle
Title: Oracally Efficient Two-Step Estimation of Generalized Additive Model
Abstract:
The generalized additive model (GAM) is a multivariate
nonparametric regression tool for non-Gaussian responses including binary
and count data. We propose a spline-backfitted kernel (SBK) estimator for
the component functions and the constant, which are oracally efficient
under weak dependence. The SBK technique is both computationally expedient
and theoretically reliable, thus usable for analyzing high-dimensional
time series. Inference can be made on component functions based on
asymptotic normality. Simulation evidence strongly corroborates the
asymptotic theory. The method is applied to estimate insolvent probability
and to obtain higher accuracy ratio than a previous study. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 619-631
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.763726
File-URL: http://hdl.handle.net/10.1080/01621459.2013.763726
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:619-631
Template-Type: ReDIF-Article 1.0
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Author-Name: Yunlu Jiang
Author-X-Name-First: Yunlu
Author-X-Name-Last: Jiang
Author-Name: Mian Huang
Author-X-Name-First: Mian
Author-X-Name-Last: Huang
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Robust Variable Selection With Exponential Squared Loss
Abstract:
Robust variable selection procedures through penalized
regression have been gaining increased attention in the literature. They
can be used to perform variable selection and are expected to yield robust
estimates. However, to the best of our knowledge, the robustness of those
penalized regression procedures has not been well characterized. In this
article, we propose a class of penalized robust regression estimators
based on exponential squared loss. The motivation for this new procedure
is that it enables us to characterize its robustness in a way that has not
been done for the existing procedures, while its performance is near
optimal and superior to some recently developed methods. Specifically,
under defined regularity conditions, our estimators are -consistent and possess the
oracle property. Importantly, we show that our estimators can achieve the
highest asymptotic breakdown point of 1/2 and that their influence
functions are bounded with respect to the outliers in either the response
or the covariate domain. We performed simulation studies to compare our
proposed method with some recent methods, using the oracle method as the
benchmark. We consider common sources of influential points. Our
simulation studies reveal that our proposed method performs similarly to
the oracle method in terms of the model error and the positive selection
rate even in the presence of influential points. In contrast, other
existing procedures have a much lower noncausal selection rate.
Furthermore, we reanalyze the Boston Housing Price Dataset and the Plasma
Beta-Carotene Level Dataset that are commonly used examples for regression
diagnostics of influential points. Our analysis unravels the discrepancies
of using our robust method versus the other penalized regression method,
underscoring the importance of developing and applying robust penalized
regression methods.
Journal: Journal of the American Statistical Association
Pages: 632-643
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.766613
File-URL: http://hdl.handle.net/10.1080/01621459.2013.766613
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:632-643
Template-Type: ReDIF-Article 1.0
Author-Name: Howard D. Bondell
Author-X-Name-First: Howard D.
Author-X-Name-Last: Bondell
Author-Name: Leonard A. Stefanski
Author-X-Name-First: Leonard A.
Author-X-Name-Last: Stefanski
Title: Efficient Robust Regression via Two-Stage Generalized Empirical Likelihood
Abstract:
Large- and finite-sample efficiency and resistance to
outliers are the key goals of robust statistics. Although often not
simultaneously attainable, we develop and study a linear regression
estimator that comes close. Efficiency is obtained from the estimator's
close connection to generalized empirical likelihood, and its favorable
robustness properties are obtained by constraining the associated sum of
(weighted) squared residuals. We prove maximum attainable finite-sample
replacement breakdown point and full asymptotic efficiency for normal
errors. Simulation evidence shows that compared to existing robust
regression estimators, the new estimator has relatively high efficiency
for small sample sizes and comparable outlier resistance. The estimator is
further illustrated and compared to existing methods via application to a
real dataset with purported outliers.
Journal: Journal of the American Statistical Association
Pages: 644-655
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.779847
File-URL: http://hdl.handle.net/10.1080/01621459.2013.779847
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:644-655
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Li
Author-X-Name-First: Bo
Author-X-Name-Last: Li
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: Nonparametric Identification of Copula Structures
Abstract:
We propose a unified framework for testing a variety of
assumptions commonly made about the structure of copulas, including
symmetry, radial symmetry, joint symmetry, associativity and
Archimedeanity, and max-stability. Our test is nonparametric and based on
the asymptotic distribution of the empirical copula process. We perform
simulation experiments to evaluate our test and conclude that our method
is reliable and powerful for assessing common assumptions on the structure
of copulas, particularly when the sample size is moderately large. We
illustrate our testing approach on two datasets.
Journal: Journal of the American Statistical Association
Pages: 666-675
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.787083
File-URL: http://hdl.handle.net/10.1080/01621459.2013.787083
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:666-675
Template-Type: ReDIF-Article 1.0
Author-Name: Hohsuk Noh
Author-X-Name-First: Hohsuk
Author-X-Name-Last: Noh
Author-Name: Anouar El Ghouch
Author-X-Name-First: Anouar El
Author-X-Name-Last: Ghouch
Author-Name: Taoufik Bouezmarni
Author-X-Name-First: Taoufik
Author-X-Name-Last: Bouezmarni
Title: Copula-Based Regression Estimation and Inference
Abstract:
We investigate a new approach to estimating a regression
function based on copulas. The main idea behind this approach is to write
the regression function in terms of a copula and marginal distributions.
Once the copula and the marginal distributions are estimated, we use the
plug-in method to construct our new estimator. Because various methods are
available in the literature for estimating both a copula and a
distribution, this idea provides a rich and flexible family of regression
estimators. We provide some asymptotic results related to this
copula-based regression modeling when the copula is estimated via profile
likelihood and the marginals are estimated nonparametrically. We also
study the finite sample performance of the estimator and illustrate its
usefulness by analyzing data from air pollution studies.
Journal: Journal of the American Statistical Association
Pages: 676-688
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.783842
File-URL: http://hdl.handle.net/10.1080/01621459.2013.783842
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:676-688
Template-Type: ReDIF-Article 1.0
Author-Name: Dong Hwan Oh
Author-X-Name-First: Dong Hwan
Author-X-Name-Last: Oh
Author-Name: Andrew J. Patton
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Patton
Title: Simulated Method of Moments Estimation for Copula-Based Multivariate Models
Abstract:
This article considers the estimation of the parameters of a
copula via a simulated method of moments (MM) type approach. This approach
is attractive when the likelihood of the copula model is not known in
closed form, or when the researcher has a set of dependence measures or
other functionals of the copula that are of particular interest. The
proposed approach naturally also nests MM and generalized method of
moments estimators. Drawing on results for simulation-based estimation and
on recent work in empirical copula process theory, we show the consistency
and asymptotic normality of the proposed estimator, and obtain a simple
test of overidentifying restrictions as a specification test. The results
apply to both iid and time series data. We analyze the finite-sample
behavior of these estimators in an extensive simulation study. We apply
the model to a group of seven financial stock returns and find evidence of
statistically significant tail dependence, and mild evidence that the
dependence between these assets is stronger in crashes than booms.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 689-700
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.785952
File-URL: http://hdl.handle.net/10.1080/01621459.2013.785952
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:689-700
Template-Type: ReDIF-Article 1.0
Author-Name: Chunpeng Fan
Author-X-Name-First: Chunpeng
Author-X-Name-Last: Fan
Author-Name: Jason P. Fine
Author-X-Name-First: Jason P.
Author-X-Name-Last: Fine
Title: Linear Transformation Model With Parametric Covariate Transformations
Abstract:
The traditional linear transformation model assumes a linear
relationship between the transformed response and the covariates. However,
in real data, this linear relationship may be violated. We propose a
linear transformation model that allows parametric covariate
transformations to recover the linearity. Although the proposed
generalization may seem rather simple, the inferential issues are quite
challenging due to loss of identifiability under the null of no effects of
transformed covariates. This article develops tests for such hypotheses.
We establish rigorous inferences for parameters and the unspecified
transformation function when the transformed covariates have nonzero
effects. The estimates and tests perform well in simulation studies using
a realistic sample size. We also develop goodness-of-fit tests for the
transformation and R -super-2 for model
comparison. GAGurine data are used to illustrate the practical utility of
the proposed methods.
Journal: Journal of the American Statistical Association
Pages: 701-712
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770707
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770707
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:701-712
Template-Type: ReDIF-Article 1.0
Author-Name: Yunzhang Zhu
Author-X-Name-First: Yunzhang
Author-X-Name-Last: Zhu
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wei Pan
Author-X-Name-First: Wei
Author-X-Name-Last: Pan
Title: Simultaneous Grouping Pursuit and Feature Selection Over an Undirected Graph
Abstract:
In high-dimensional regression, grouping pursuit and feature
selection have their own merits while complementing each other in battling
the curse of dimensionality. To seek a parsimonious model, we perform
simultaneous grouping pursuit and feature selection over an arbitrary
undirected graph with each node corresponding to one predictor. When the
corresponding nodes are reachable from each other over the graph,
regression coefficients can be grouped, whose absolute values are the same
or close. This is motivated from gene network analysis, where genes tend
to work in groups according to their biological functionalities. Through a
nonconvex penalty, we develop a computational strategy and analyze the
proposed method. Theoretical analysis indicates that the proposed method
reconstructs the oracle estimator, that is, the unbiased least-square
estimator given the true grouping, leading to consistent reconstruction of
grouping structures and informative features, as well as to optimal
parameter estimation. Simulation studies suggest that the method combines
the benefit of grouping pursuit with that of feature selection, and
compares favorably against its competitors in selection accuracy and
predictive performance. An application to eQTL data is used to illustrate
the methodology, where a network is incorporated into analysis through an
undirected graph.
Journal: Journal of the American Statistical Association
Pages: 713-725
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.770704
File-URL: http://hdl.handle.net/10.1080/01621459.2013.770704
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:713-725
Template-Type: ReDIF-Article 1.0
Author-Name: Zhou Zhou
Author-X-Name-First: Zhou
Author-X-Name-Last: Zhou
Title: Heteroscedasticity and Autocorrelation Robust Structural Change Detection
Abstract:
The assumption of (weak) stationarity is crucial for the
validity of most of the conventional tests of structure change in time
series. Under complicated nonstationary temporal dynamics, we argue that
traditional testing procedures result in mixed structural change signals
of the first and second order and hence could lead to biased testing
results. The article proposes a simple and unified bootstrap testing
procedure that provides consistent testing results under general forms of
smooth and abrupt changes in the temporal dynamics of the time series.
Monte Carlo experiments are performed to compare our testing procedure
with various traditional tests. Our robust bootstrap test is applied to
testing changes in an environmental and a financial time series and our
procedure is shown to provide more reliable results than the conventional
tests.
Journal: Journal of the American Statistical Association
Pages: 726-740
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.787184
File-URL: http://hdl.handle.net/10.1080/01621459.2013.787184
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:726-740
Template-Type: ReDIF-Article 1.0
Author-Name: Lawrence D. Brown
Author-X-Name-First: Lawrence D.
Author-X-Name-Last: Brown
Author-Name: Eitan Greenshtein
Author-X-Name-First: Eitan
Author-X-Name-Last: Greenshtein
Author-Name: Ya'acov Ritov
Author-X-Name-First: Ya'acov
Author-X-Name-Last: Ritov
Title: The Poisson Compound Decision Problem Revisited
Abstract:
The compound decision problem for a vector of independent
Poisson random variables with possibly different means has a
half-century-old solution. However, it appears that the classical solution
needs smoothing adjustment. We discuss three such adjustments. We also
present another approach that first transforms the problem into the normal
compound decision problem. A simulation study shows the effectiveness of
the procedures in improving the performance over that of the classical
procedure. A real data example is also provided. The procedures depend on
a smoothness parameter that can be selected using a nonstandard
cross-validation step, which is of independent interest. Finally, we
mention some asymptotic results.
Journal: Journal of the American Statistical Association
Pages: 741-749
Issue: 502
Volume: 108
Year: 2013
Month: 6
X-DOI: 10.1080/01621459.2013.771582
File-URL: http://hdl.handle.net/10.1080/01621459.2013.771582
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:502:p:741-749
Template-Type: ReDIF-Article 1.0
Author-Name: Matt Taddy
Author-X-Name-First: Matt
Author-X-Name-Last: Taddy
Title: Multinomial Inverse Regression for Text Analysis
Abstract:
Text data, including speeches, stories, and other document
forms, are often connected to sentiment variables that
are of interest for research in marketing, economics, and elsewhere. It is
also very high dimensional and difficult to incorporate into statistical
analyses. This article introduces a straightforward framework of
sentiment-sufficient dimension reduction for text data. Multinomial
inverse regression is introduced as a general tool for simplifying
predictor sets that can be represented as draws from a multinomial
distribution, and we show that logistic regression of phrase counts onto
document annotations can be used to obtain low-dimensional document
representations that are rich in sentiment information. To facilitate this
modeling, a novel estimation technique is developed for multinomial
logistic regression with very high-dimensional response. In particular,
independent Laplace priors with unknown variance are assigned to each
regression coefficient, and we detail an efficient routine for
maximization of the joint posterior over coefficients and their prior
scale. This "gamma-lasso" scheme yields stable and effective estimation
for general high-dimensional logistic regression, and we argue that it
will be superior to current methods in many settings. Guidelines for prior
specification are provided, algorithm convergence is detailed, and
estimator properties are outlined from the perspective of the literature
on nonconcave likelihood penalization. Related work on sentiment analysis
from statistics, econometrics, and machine learning is surveyed and
connected. Finally, the methods are applied in two detailed examples and
we provide out-of-sample prediction studies to illustrate their
effectiveness.
Journal: Journal of the American Statistical Association
Pages: 755-770
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2012.734168
File-URL: http://hdl.handle.net/10.1080/01621459.2012.734168
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:755-770
Template-Type: ReDIF-Article 1.0
Author-Name: Justin Grimmer
Author-X-Name-First: Justin
Author-X-Name-Last: Grimmer
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 770-771
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.822383
File-URL: http://hdl.handle.net/10.1080/01621459.2013.822383
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:770-771
Template-Type: ReDIF-Article 1.0
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 771-772
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.827983
File-URL: http://hdl.handle.net/10.1080/01621459.2013.827983
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:771-772
Template-Type: ReDIF-Article 1.0
Author-Name: Matt Taddy
Author-X-Name-First: Matt
Author-X-Name-Last: Taddy
Title: Rejoinder: Efficiency and Structure in MNIR
Journal: Journal of the American Statistical Association
Pages: 772-774
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.821408
File-URL: http://hdl.handle.net/10.1080/01621459.2013.821408
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:772-774
Template-Type: ReDIF-Article 1.0
Author-Name: Juhee Lee
Author-X-Name-First: Juhee
Author-X-Name-Last: Lee
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Yitan Zhu
Author-X-Name-First: Yitan
Author-X-Name-Last: Zhu
Author-Name: Yuan Ji
Author-X-Name-First: Yuan
Author-X-Name-Last: Ji
Title: A Nonparametric Bayesian Model for Local Clustering With Application to Proteomics
Abstract:
We propose a nonparametric Bayesian local clustering
(NoB-LoC) approach for heterogeneous data. NoB-LoC implements inference
for nested clusters as posterior inference under a Bayesian model. Using
protein expression data as an example, the NoB-LoC model defines a protein
(column) cluster as a set of proteins that give rise to the same partition
of the samples (rows). In other words, the sample partitions are nested
within protein clusters. The common clustering of the samples gives
meaning to the protein clusters. Any pair of samples might belong to the
same cluster for one protein set but to different clusters for another
protein set. These local features are different from features obtained by
global clustering approaches such as hierarchical clustering, which create
only one partition of samples that applies for all the proteins in the
dataset. In addition, the NoB-LoC model is different from most other local
or nested clustering methods, which define clusters based on common
parameters in the sampling model. As an added and important feature, the
NoB-LoC method probabilistically excludes sets of irrelevant proteins and
samples that do not meaningfully cocluster with other proteins and
samples, thus improving the inference on the clustering of the remaining
proteins and samples. Inference is guided by a joint probability model for
all the random elements. We provide a simulation study and a motivating
example to demonstrate the unique features of the NoB-LoC model.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 775-788
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.784705
File-URL: http://hdl.handle.net/10.1080/01621459.2013.784705
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:775-788
Template-Type: ReDIF-Article 1.0
Author-Name: Peter B. Gilbert
Author-X-Name-First: Peter B.
Author-X-Name-Last: Gilbert
Author-Name: Bryan E. Shepherd
Author-X-Name-First: Bryan E.
Author-X-Name-Last: Shepherd
Author-Name: Michael G. Hudgens
Author-X-Name-First: Michael G.
Author-X-Name-Last: Hudgens
Title: Sensitivity Analysis of Per-Protocol Time-to-Event Treatment Efficacy in Randomized Clinical Trials
Abstract:
Assessing per-protocol (PP) treatment efficacy on a
time-to-event endpoint is a common objective of randomized clinical
trials. The typical analysis uses the same method employed for the
intention-to-treat analysis (e.g., standard survival analysis) applied to
the subgroup meeting protocol adherence criteria. However, due to
potential post-randomization selection bias, this analysis may mislead
about treatment efficacy. Moreover, while there is extensive literature on
methods for assessing causal treatment effects in compliers, these methods
do not apply to a common class of trials where (a) the primary objective
compares survival curves, (b) it is inconceivable to assign participants
to be adherent and event free before adherence is measured, and (c) the
exclusion restriction assumption fails to hold. HIV vaccine efficacy
trials including the recent RV144 trial exemplify this class, because many
primary endpoints (e.g., HIV infections) occur before adherence is
measured, and nonadherent subjects who receive some of the planned
immunizations may be partially protected. Therefore, we develop methods
for assessing PP treatment efficacy for this problem class, considering
three causal estimands of interest. Because these estimands are not
identifiable from the observable data, we develop nonparametric bounds and
semiparametric sensitivity analysis methods that yield estimated ignorance
and uncertainty intervals. The methods are applied to RV144. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 789-800
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.786649
File-URL: http://hdl.handle.net/10.1080/01621459.2013.786649
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:789-800
Template-Type: ReDIF-Article 1.0
Author-Name: James Raymer
Author-X-Name-First: James
Author-X-Name-Last: Raymer
Author-Name: Arkadiusz Wiśniowski
Author-X-Name-First: Arkadiusz
Author-X-Name-Last: Wiśniowski
Author-Name: Jonathan J. Forster
Author-X-Name-First: Jonathan J.
Author-X-Name-Last: Forster
Author-Name: Peter W. F. Smith
Author-X-Name-First: Peter W. F.
Author-X-Name-Last: Smith
Author-Name: Jakub Bijak
Author-X-Name-First: Jakub
Author-X-Name-Last: Bijak
Title: Integrated Modeling of European Migration
Abstract:
International migration data in Europe are collected by
individual countries with separate collection systems and designs. As a
result, reported data are inconsistent in availability, definition, and
quality. In this article, we propose a Bayesian model to overcome the
limitations of the various data sources. The focus is on estimating recent
international migration flows among 31 countries in the European Union and
European Free Trade Association from 2002 to 2008, using data collated by
Eurostat. We also incorporate covariate information and information
provided by experts on the effects of undercount, measurement, and
accuracy of data collection systems. The methodology is integrated and
produces a synthetic database with measures of uncertainty for
international migration flows and other model parameters. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 801-819
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.789435
File-URL: http://hdl.handle.net/10.1080/01621459.2013.789435
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:801-819
Template-Type: ReDIF-Article 1.0
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Author-Name: Dipankar Bandyopadhyay
Author-X-Name-First: Dipankar
Author-X-Name-Last: Bandyopadhyay
Author-Name: Howard D. Bondell
Author-X-Name-First: Howard D.
Author-X-Name-Last: Bondell
Title: A Nonparametric Spatial Model for Periodontal Data With Nonrandom Missingness
Abstract:
Periodontal disease (PD) progression is often quantified by
clinical attachment level (CAL) defined as the distance down a tooth's
root that is detached from the surrounding bone. Measured at six locations
per tooth throughout the mouth (excluding the molars), it gives rise to a
dependent data setup. These data are often reduced to a one-number
summary, such as the whole-mouth average or the number of observations
greater than a threshold, to be used as the response in a regression to
identify important covariates related to the current state of a subject's
periodontal health. Rather than a simple one-number summary, we set
forward to analyze all available CAL data for each subject, exploiting the
presence of spatial dependence, nonstationarity, and nonnormality. Also,
many subjects have a considerable proportion of missing teeth, which
cannot be considered missing at random because PD is the leading cause of
adult tooth loss. Under a Bayesian paradigm, we propose a nonparametric
flexible spatial (joint) model of observed CAL and the location of missing
tooth via kernel convolution methods, incorporating the aforementioned
features of CAL data under a unified framework. Application of this
methodology to a dataset recording the periodontal health of an
African-American population, as well as simulation studies reveal the gain
in model fit and inference, and provides a new perspective into unraveling
covariate--response relationships in the presence of complexities posed by
these data.
Journal: Journal of the American Statistical Association
Pages: 820-831
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.795487
File-URL: http://hdl.handle.net/10.1080/01621459.2013.795487
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:820-831
Template-Type: ReDIF-Article 1.0
Author-Name: Jason L. Morrissette
Author-X-Name-First: Jason L.
Author-X-Name-Last: Morrissette
Author-Name: Michael P. Mcdermott
Author-X-Name-First: Michael P.
Author-X-Name-Last: Mcdermott
Title: Estimation and Inference Concerning Ordered Means in Analysis of Covariance Models With Interactions
Abstract:
When interactions are identified in analysis of covariance
models, it becomes important to identify values of the covariates for
which there are significant differences or, more generally, significant
contrasts among the group mean responses. Inferential procedures that
incorporate a priori order restrictions among the group mean responses
would be expected to be superior to those that ignore this information. In
this article, we focus on analysis of covariance models with prespecified
order restrictions on the mean response across the levels of a grouping
variable when the grouping variable may interact with model covariates. In
order for the restrictions to hold in the presence of interactions, it is
necessary to impose the requirement that the restrictions hold over all
levels of interacting categorical covariates and across prespecified
ranges of interacting continuous covariates. The parameter estimation
procedure involves solving a quadratic programming minimization problem
with a carefully specified constraint matrix. Simultaneous confidence
intervals for treatment group contrasts and tests for equality of the
ordered group mean responses are determined by exploiting previously
unconnected literature. The proposed methods are motivated by a clinical
trial of the dopamine agonist pramipexole for the treatment of early-stage
Parkinson's disease.
Journal: Journal of the American Statistical Association
Pages: 832-839
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.797355
File-URL: http://hdl.handle.net/10.1080/01621459.2013.797355
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:832-839
Template-Type: ReDIF-Article 1.0
Author-Name: Roland Langrock
Author-X-Name-First: Roland
Author-X-Name-Last: Langrock
Author-Name: David L. Borchers
Author-X-Name-First: David L.
Author-X-Name-Last: Borchers
Author-Name: Hans J. Skaug
Author-X-Name-First: Hans J.
Author-X-Name-Last: Skaug
Title: Markov-Modulated Nonhomogeneous Poisson Processes for Modeling Detections in Surveys of Marine Mammal Abundance
Abstract:
We consider Markov-modulated nonhomogeneous Poisson processes
for modeling sightings of marine mammals in shipboard or aerial surveys.
In such surveys, detection of an animal is possible only when it surfaces,
and with some species a substantial proportion of animals is missed
because they are diving and thus not available for detection. This needs
to be adequately accounted for to avoid biased abundance estimates. The
tendency of surfacing events of marine mammals to occur in clusters
motivates consideration of the flexible class of Markov-modulated Poisson
processes in this context. We embed these models in distance sampling
models, introducing nonhomogeneity in the process to account for the fact
that the observer's probability of detecting an animal decreases with
increasing distance to the animal. We derive approximate expressions for
the likelihood of Markov-modulated nonhomogeneous Poisson processes that
enable us to estimate the model parameters through numerical maximum
likelihood. The performance of the approach is investigated in an
extensive simulation study, and applications to pilot and beaked whale tag
data as well as to minke whale tag and survey data demonstrate its
relevance in abundance estimation.
Journal: Journal of the American Statistical Association
Pages: 840-851
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.797356
File-URL: http://hdl.handle.net/10.1080/01621459.2013.797356
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:840-851
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan Rougier
Author-X-Name-First: Jonathan
Author-X-Name-Last: Rougier
Author-Name: Michael Goldstein
Author-X-Name-First: Michael
Author-X-Name-Last: Goldstein
Author-Name: Leanna House
Author-X-Name-First: Leanna
Author-X-Name-Last: House
Title: Second-Order Exchangeability Analysis for Multimodel Ensembles
Abstract:
The challenge of understanding complex systems often gives
rise to a multiplicity of models. It is natural to consider whether the
outputs of these models can be combined to produce a system prediction
that is more informative than the output of any one of the models taken in
isolation. And, in particular, to consider the relationship between the
spread of model outputs and system uncertainty. We describe a statistical
framework for such a combination, based on the exchangeability of the
models, and their coexchangeability with the system. We demonstrate the
simplest implementation of our framework in the context of climate
prediction. Throughout we work entirely in means and variances to avoid
the necessity of specifying higher-order quantities for which we often
lack well-founded judgments.
Journal: Journal of the American Statistical Association
Pages: 852-863
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.802963
File-URL: http://hdl.handle.net/10.1080/01621459.2013.802963
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:852-863
Template-Type: ReDIF-Article 1.0
Author-Name: Tao Yu
Author-X-Name-First: Tao
Author-X-Name-Last: Yu
Author-Name: Pengfei Li
Author-X-Name-First: Pengfei
Author-X-Name-Last: Li
Title: Spatial Shrinkage Estimation of Diffusion Tensors on Diffusion-Weighted Imaging Data
Abstract:
Diffusion tensor imaging (DTI), based on the
diffusion-weighted imaging (DWI) data acquired from magnetic resonance
experiments, has been widely used to analyze the physical structure of
white-matter fibers in the human brain in vivo. The raw DWI data, however,
carry noise; this contaminates the diffusion tensor (DT) estimates and
introduces systematic bias into the induced eigenvalues. These bias
components affect the effectiveness of fiber-tracking algorithms. In this
article, we propose a two-stage spatial shrinkage estimation (SpSkE)
procedure to accommodate the spatial information carried in DWI data in DT
estimation and to reduce the bias components in the corresponding derived
eigenvalues. To this end, in the framework of the heteroscedastic linear
model, SpSkE incorporates L
1-type penalization and the locally weighted least-square
function. The theoretical properties of SpSkE are explored. The
effectiveness of SpSkE is further illustrated by simulation and real-data
examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 864-875
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.804408
File-URL: http://hdl.handle.net/10.1080/01621459.2013.804408
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:864-875
Template-Type: ReDIF-Article 1.0
Author-Name: Francesco C. Stingo
Author-X-Name-First: Francesco C.
Author-X-Name-Last: Stingo
Author-Name: Michele Guindani
Author-X-Name-First: Michele
Author-X-Name-Last: Guindani
Author-Name: Marina Vannucci
Author-X-Name-First: Marina
Author-X-Name-Last: Vannucci
Author-Name: Vince D. Calhoun
Author-X-Name-First: Vince D.
Author-X-Name-Last: Calhoun
Title: An Integrative Bayesian Modeling Approach to Imaging Genetics
Abstract:
In this article we present a Bayesian hierarchical modeling
approach for imaging genetics, where the interest lies in linking brain
connectivity across multiple individuals to their genetic information. We
have available data from a functional magnetic resonance imaging (fMRI)
study on schizophrenia. Our goals are to identify brain regions of
interest (ROIs) with discriminating activation patterns between
schizophrenic patients and healthy controls, and to relate the ROIs'
activations with available genetic information from single nucleotide
polymorphisms (SNPs) on the subjects. For this task, we develop a
hierarchical mixture model that includes several innovative
characteristics: it incorporates the selection of ROIs that discriminate
the subjects into separate groups; it allows the mixture components to
depend on selected covariates; it includes prior models that capture
structural dependencies among the ROIs. Applied to the schizophrenia
dataset, the model leads to the simultaneous selection of a set of
discriminatory ROIs and the relevant SNPs, together with the
reconstruction of the correlation structure of the selected regions. To
the best of our knowledge, our work represents the first attempt at a
rigorous modeling strategy for imaging genetics data that incorporates all
such features.
Journal: Journal of the American Statistical Association
Pages: 876-891
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.804409
File-URL: http://hdl.handle.net/10.1080/01621459.2013.804409
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:876-891
Template-Type: ReDIF-Article 1.0
Author-Name: Jin Zhang
Author-X-Name-First: Jin
Author-X-Name-Last: Zhang
Author-Name: Thomas M. Braun
Author-X-Name-First: Thomas M.
Author-X-Name-Last: Braun
Title: A Phase I Bayesian Adaptive Design to Simultaneously Optimize Dose and Schedule Assignments Both Between and Within Patients
Abstract:
In traditional schedule or dose--schedule finding designs,
patients are assumed to receive their assigned dose--schedule combination
throughout the trial even though the combination may be found to have an
undesirable toxicity profile, which contradicts actual clinical practice.
Since no systematic approach exists to optimize intrapatient
dose--schedule assignment, we propose a Phase I clinical trial design that
extends existing approaches to optimize dose and schedule solely between
patients by incorporating adaptive variations to dose--schedule
assignments within patients as the study proceeds. Our design is based on
a Bayesian nonmixture cure rate model that incorporates multiple
administrations each patient receives with the per-administration dose
included as a covariate. Simulations demonstrate that our design
identifies safe dose and schedule combinations as well as the traditional
method that does not allow for intrapatient dose--schedule reassignments,
but with a larger number of patients assigned to safe combinations.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 892-901
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.806927
File-URL: http://hdl.handle.net/10.1080/01621459.2013.806927
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:892-901
Template-Type: ReDIF-Article 1.0
Author-Name: Jerry Q. Cheng
Author-X-Name-First: Jerry Q.
Author-X-Name-Last: Cheng
Author-Name: Minge Xie
Author-X-Name-First: Minge
Author-X-Name-Last: Xie
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Author-Name: Fred Roberts
Author-X-Name-First: Fred
Author-X-Name-Last: Roberts
Title: A Latent Source Model to Detect Multiple Spatial Clusters With Application in a Mobile Sensor Network for Surveillance of Nuclear Materials
Abstract:
Potential nuclear attacks are among the most devastating
terrorist attacks, with severe loss of human lives as well as damage to
infrastructure. To deter such threats, it becomes increasingly vital to
have sophisticated nuclear surveillance and detection systems deployed in
major cities in the United States, such as New York City. In this article,
we design a mobile sensor network and develop statistical algorithms and
models to provide consistent and pervasive surveillance of nuclear
materials in major cities. The network consists of a large number of
vehicles on which nuclear sensors and Global Position System (GPS)
tracking devices are installed. Real time sensor readings and GPS
information are transmitted to and processed at a central surveillance
center. Mathematical and statistical analyses are performed, in which we
mimic a signal-generating process and develop a latent source modeling
framework to detect multiple spatial clusters. A Monte Carlo
expectation-maximization algorithm is developed to estimate model
parameters, detect significant clusters, and identify their locations and
sizes. We also determine the number of clusters using a modified Akaike
Information Criterion/Bayesian Information Criterion. Simulation studies
to evaluate the effectiveness and detection power of such a network are
described.
Journal: Journal of the American Statistical Association
Pages: 902-913
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.808945
File-URL: http://hdl.handle.net/10.1080/01621459.2013.808945
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:902-913
Template-Type: ReDIF-Article 1.0
Author-Name: Yichen Qin
Author-X-Name-First: Yichen
Author-X-Name-Last: Qin
Author-Name: Carey E. Priebe
Author-X-Name-First: Carey E.
Author-X-Name-Last: Priebe
Title: Maximum Lq-Likelihood Estimation via the Expectation-Maximization Algorithm: A Robust Estimation of Mixture Models
Abstract:
We introduce a maximum Lq-likelihood
estimation (MLqE) of mixture models using our proposed
expectation-maximization (EM) algorithm, namely the EM algorithm with
Lq-likelihood (EM-Lq). Properties of the
MLqE obtained from the proposed EM-Lq
are studied through simulated mixture model data. Compared with the
maximum likelihood estimation (MLE), which is obtained from the EM
algorithm, the MLqE provides a more robust estimation
against outliers for small sample sizes. In particular, we study the
performance of the MLqE in the context of the gross error
model, where the true model of interest is a mixture of two normal
distributions, and the contamination component is a third normal
distribution with a large variance. A numerical comparison between the
MLqE and the MLE for this gross error model is presented
in terms of Kullback--Leibler (KL) distance and relative efficiency.
Journal: Journal of the American Statistical Association
Pages: 914-928
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.787933
File-URL: http://hdl.handle.net/10.1080/01621459.2013.787933
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:914-928
Template-Type: ReDIF-Article 1.0
Author-Name: Mian Huang
Author-X-Name-First: Mian
Author-X-Name-Last: Huang
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Shaoli Wang
Author-X-Name-First: Shaoli
Author-X-Name-Last: Wang
Title: Nonparametric Mixture of Regression Models
Abstract:
Motivated by an analysis of U.S. house price index (HPI)
data, we propose nonparametric finite mixture of regression models. We
study the identifiability issue of the proposed models, and develop an
estimation procedure by employing kernel regression. We further
systematically study the sampling properties of the proposed estimators,
and establish their asymptotic normality. A modified EM algorithm is
proposed to carry out the estimation procedure. We show that our algorithm
preserves the ascent property of the EM algorithm in an asymptotic sense.
Monte Carlo simulations are conducted to examine the finite sample
performance of the proposed estimation procedure. An empirical analysis of
the U.S. HPI data is illustrated for the proposed methodology.
Journal: Journal of the American Statistical Association
Pages: 929-941
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.772897
File-URL: http://hdl.handle.net/10.1080/01621459.2013.772897
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:929-941
Template-Type: ReDIF-Article 1.0
Author-Name: Mahbubul Majumder
Author-X-Name-First: Mahbubul
Author-X-Name-Last: Majumder
Author-Name: Heike Hofmann
Author-X-Name-First: Heike
Author-X-Name-Last: Hofmann
Author-Name: Dianne Cook
Author-X-Name-First: Dianne
Author-X-Name-Last: Cook
Title: Validation of Visual Statistical Inference, Applied to Linear Models
Abstract:
Statistical graphics play a crucial role in exploratory data
analysis, model checking, and diagnosis. The lineup protocol enables
statistical significance testing of visual findings, bridging the gulf
between exploratory and inferential statistics. In this article,
inferential methods for statistical graphics are developed further by
refining the terminology of visual inference and framing the lineup
protocol in a context that allows direct comparison with conventional
tests in scenarios when a conventional test exists. This framework is used
to compare the performance of the lineup protocol against conventional
statistical testing in the scenario of fitting linear models. A human
subjects experiment is conducted using simulated data to provide
controlled conditions. Results suggest that the lineup protocol performs
comparably with the conventional tests, and expectedly outperforms them
when data are contaminated, a scenario where assumptions required for
performing a conventional test are violated. Surprisingly, visual tests
have higher power than the conventional tests when the effect size is
large. And, interestingly, there may be some super-visual individuals who
yield better performance and power than the conventional test even in the
most difficult tasks. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 942-956
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.808157
File-URL: http://hdl.handle.net/10.1080/01621459.2013.808157
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:942-956
Template-Type: ReDIF-Article 1.0
Author-Name: Susan Wei
Author-X-Name-First: Susan
Author-X-Name-Last: Wei
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Latent Supervised Learning
Abstract:
This article introduces a new machine learning task, called
latent supervised learning, where the goal is to learn a binary classifier
from continuous training labels that serve as surrogates
for the unobserved class labels. We investigate a specific model where the
surrogate variable arises from a two-component Gaussian mixture with
unknown means and variances, and the component membership is determined by
a hyperplane in the covariate space. The estimation of the separating
hyperplane and the Gaussian mixture parameters forms what shall be
referred to as the change-line classification problem. We propose a
data-driven sieve maximum likelihood estimator for the hyperplane, which
in turn can be used to estimate the parameters of the Gaussian mixture.
The estimator is shown to be consistent. Simulations as well as empirical
data show the estimator has high classification accuracy.
Journal: Journal of the American Statistical Association
Pages: 957-970
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.789695
File-URL: http://hdl.handle.net/10.1080/01621459.2013.789695
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:957-970
Template-Type: ReDIF-Article 1.0
Author-Name: Colin O. Wu
Author-X-Name-First: Colin O.
Author-X-Name-Last: Wu
Author-Name: Xin Tian
Author-X-Name-First: Xin
Author-X-Name-Last: Tian
Title: Nonparametric Estimation of Conditional Distributions and Rank-Tracking Probabilities With Time-Varying Transformation Models in Longitudinal Studies
Abstract:
An objective of longitudinal analysis is to estimate the
conditional distributions of an outcome variable through a regression
model. The approaches based on modeling the conditional means are not
appropriate for this task when the conditional distributions are skewed or
cannot be approximated by a normal distribution through a known
transformation. We study a class of time-varying transformation models and
a two-step smoothing method for the estimation of the conditional
distribution functions. Based on our models, we propose a rank-tracking
probability and a rank-tracking probability ratio to measure the strength
of tracking ability of an outcome variable at two different time points.
Our models and estimation method can be applied to a wide range of
scientific objectives that cannot be evaluated by the conditional
mean-based models. We derive the asymptotic properties for the two-step
local polynomial estimators of the conditional distribution functions.
Finite sample properties of our procedures are investigated through a
simulation study. Application of our models and estimation method is
demonstrated through an epidemiological study of childhood growth and
blood pressure. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 971-982
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.808949
File-URL: http://hdl.handle.net/10.1080/01621459.2013.808949
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:971-982
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaoke Zhang
Author-X-Name-First: Xiaoke
Author-X-Name-Last: Zhang
Author-Name: Byeong U. Park
Author-X-Name-First: Byeong U.
Author-X-Name-Last: Park
Author-Name: Jane-ling Wang
Author-X-Name-First: Jane-ling
Author-X-Name-Last: Wang
Title: Time-Varying Additive Models for Longitudinal Data
Abstract:
The additive model is an effective dimension-reduction
approach that also provides flexibility in modeling the relation between a
response variable and key covariates. The literature is largely developed
to scalar response and vector covariates. In this article, more complex
data are of interest, where both the response and the covariates are
functions. We propose a functional additive model together with a new
backfitting algorithm to estimate the unknown regression functions, whose
components are time-dependent additive functions of the covariates. Such
functional data may not be completely observed since measurements may only
be collected intermittently at discrete time points. We develop a unified
platform and an efficient approach that can cover both dense and sparse
functional data and the needed theory for statistical inference. We also
establish the oracle properties of the proposed estimators of the
component functions.
Journal: Journal of the American Statistical Association
Pages: 983-998
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.778776
File-URL: http://hdl.handle.net/10.1080/01621459.2013.778776
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:983-998
Template-Type: ReDIF-Article 1.0
Author-Name: P. Richard Hahn
Author-X-Name-First: P. Richard
Author-X-Name-Last: Hahn
Author-Name: Carlos M. Carvalho
Author-X-Name-First: Carlos M.
Author-X-Name-Last: Carvalho
Author-Name: Sayan Mukherjee
Author-X-Name-First: Sayan
Author-X-Name-Last: Mukherjee
Title: Partial Factor Modeling: Predictor-Dependent Shrinkage for Linear Regression
Abstract:
We develop a modified Gaussian factor model for the purpose
of inducing predictor-dependent shrinkage for linear regression. The new
model predicts well across a wide range of covariance structures, on real
and simulated data. Furthermore, the new model facilitates variable
selection in the case of correlated predictor variables, which often
stymies other methods.
Journal: Journal of the American Statistical Association
Pages: 999-1008
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.779843
File-URL: http://hdl.handle.net/10.1080/01621459.2013.779843
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:999-1008
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaolei Xun
Author-X-Name-First: Xiaolei
Author-X-Name-Last: Xun
Author-Name: Jiguo Cao
Author-X-Name-First: Jiguo
Author-X-Name-Last: Cao
Author-Name: Bani Mallick
Author-X-Name-First: Bani
Author-X-Name-Last: Mallick
Author-Name: Arnab Maity
Author-X-Name-First: Arnab
Author-X-Name-Last: Maity
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Parameter Estimation of Partial Differential Equation Models
Abstract:
Partial differential equation (PDE) models are commonly used
to model complex dynamic systems in applied sciences such as biology and
finance. The forms of these PDE models are usually proposed by experts
based on their prior knowledge and understanding of the dynamic system.
Parameters in PDE models often have interesting scientific
interpretations, but their values are often unknown and need to be
estimated from the measurements of the dynamic system in the presence of
measurement errors. Most PDEs used in practice have no analytic solutions,
and can only be solved with numerical methods. Currently, methods for
estimating PDE parameters require repeatedly solving PDEs numerically
under thousands of candidate parameter values, and thus the computational
load is high. In this article, we propose two methods to estimate
parameters in PDE models: a parameter cascading method and a Bayesian
approach. In both methods, the underlying dynamic process modeled with the
PDE model is represented via basis function expansion. For the parameter
cascading method, we develop two nested levels of optimization to estimate
the PDE parameters. For the Bayesian method, we develop a joint model for
data and the PDE and develop a novel hierarchical model allowing us to
employ Markov chain Monte Carlo (MCMC) techniques to make posterior
inference. Simulation studies show that the Bayesian method and parameter
cascading method are comparable, and both outperform other available
methods in terms of estimation accuracy. The two methods are demonstrated
by estimating parameters in a PDE model from long-range infrared light
detection and ranging data. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1009-1020
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.794730
File-URL: http://hdl.handle.net/10.1080/01621459.2013.794730
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1009-1020
Template-Type: ReDIF-Article 1.0
Author-Name: Stéphane Guerrier
Author-X-Name-First: Stéphane
Author-X-Name-Last: Guerrier
Author-Name: Jan Skaloud
Author-X-Name-First: Jan
Author-X-Name-Last: Skaloud
Author-Name: Yannick Stebler
Author-X-Name-First: Yannick
Author-X-Name-Last: Stebler
Author-Name: Maria-Pia Victoria-Feser
Author-X-Name-First: Maria-Pia
Author-X-Name-Last: Victoria-Feser
Title: Wavelet-Variance-Based Estimation for Composite Stochastic Processes
Abstract:
This article presents a new estimation method for the
parameters of a time series model. We consider here composite Gaussian
processes that are the sum of independent Gaussian processes which, in
turn, explain an important aspect of the time series, as is the case in
engineering and natural sciences. The proposed estimation method offers an
alternative to classical estimation based on the likelihood, that is
straightforward to implement and often the only feasible estimation method
with complex models. The estimator furnishes results as the optimization
of a criterion based on a standardized distance between the sample wavelet
variances (WV) estimates and the model-based WV. Indeed, the WV provides a
decomposition of the variance process through different scales, so that
they contain the information about different features of the stochastic
model. We derive the asymptotic properties of the proposed estimator for
inference and perform a simulation study to compare our estimator to the
MLE and the LSE with different models. We also set sufficient conditions
on composite models for our estimator to be consistent, that are easy to
verify. We use the new estimator to estimate the stochastic error's
parameters of the sum of three first order Gauss--Markov processes by
means of a sample of over 800, 000 issued from gyroscopes that compose
inertial navigation systems. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1021-1030
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.799920
File-URL: http://hdl.handle.net/10.1080/01621459.2013.799920
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1021-1030
Template-Type: ReDIF-Article 1.0
Author-Name: Cheryl J. Flynn
Author-X-Name-First: Cheryl J.
Author-X-Name-Last: Flynn
Author-Name: Clifford M. Hurvich
Author-X-Name-First: Clifford M.
Author-X-Name-Last: Hurvich
Author-Name: Jeffrey S. Simonoff
Author-X-Name-First: Jeffrey S.
Author-X-Name-Last: Simonoff
Title: Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models
Abstract:
It has been shown that Akaike information criterion
(AIC)-type criteria are asymptotically efficient selectors of the tuning
parameter in nonconcave penalized regression methods under the assumption
that the population variance is known or that a consistent estimator is
available. We relax this assumption to prove that AIC itself is
asymptotically efficient and we study its performance in finite samples.
In classical regression, it is known that AIC tends to select overly
complex models when the dimension of the maximum candidate model is large
relative to the sample size. Simulation studies suggest that AIC suffers
from the same shortcomings when used in penalized regression. We therefore
propose the use of the classical corrected AIC (AIC
c ) as an alternative and prove that
it maintains the desired asymptotic properties. To broaden our results, we
further prove the efficiency of AIC for penalized likelihood methods in
the context of generalized linear models with no dispersion parameter.
Similar results exist in the literature but only for a restricted set of
candidate models. By employing results from the classical literature on
maximum-likelihood estimation in misspecified models, we are able to
establish this result for a general set of candidate models. We use
simulations to assess the performance of AIC and AIC
c , as well as that of other
selectors, in finite samples for both smoothly clipped absolute deviation
(SCAD)-penalized and Lasso regressions and a real data example is
considered. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1031-1043
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.801775
File-URL: http://hdl.handle.net/10.1080/01621459.2013.801775
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1031-1043
Template-Type: ReDIF-Article 1.0
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Title: Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space
Abstract:
High-dimensional data analysis has motivated a spectrum of
regularization methods for variable selection and sparse modeling, with
two popular methods being convex and concave ones. A long debate has taken
place on whether one class dominates the other, an important question both
in theory and to practitioners. In this article, we characterize the
asymptotic equivalence of regularization methods, with general penalty
functions, in a thresholded parameter space under the generalized linear
model setting, where the dimensionality can grow exponentially with the
sample size. To assess their performance, we establish the oracle
inequalities-as in Bickel, Ritov, and Tsybakov (2009)-of the global
minimizer for these methods under various prediction and variable
selection losses. These results reveal an interesting phase transition
phenomenon. For polynomially growing dimensionality, the
L 1-regularization method of
Lasso and concave methods are asymptotically equivalent, having the same
convergence rates in the oracle inequalities. For exponentially growing
dimensionality, concave methods are asymptotically equivalent but have
faster convergence rates than the Lasso. We also establish a stronger
property of the oracle risk inequalities of the regularization methods, as
well as the sampling properties of computable solutions. Our new
theoretical results are illustrated and justified by simulation and real
data examples.
Journal: Journal of the American Statistical Association
Pages: 1044-1061
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.803972
File-URL: http://hdl.handle.net/10.1080/01621459.2013.803972
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1044-1061
Template-Type: ReDIF-Article 1.0
Author-Name: Huixia Judy Wang
Author-X-Name-First: Huixia Judy
Author-X-Name-Last: Wang
Author-Name: Deyuan Li
Author-X-Name-First: Deyuan
Author-X-Name-Last: Li
Title: Estimation of Extreme Conditional Quantiles Through Power Transformation
Abstract:
The estimation of extreme conditional quantiles is an
important issue in numerous disciplines. Quantile regression (QR) provides
a natural way to capture the covariate effects at different tails of the
response distribution. However, without any distributional assumptions,
estimation from conventional QR is often unstable at the tails, especially
for heavy-tailed distributions due to data sparsity. In this article, we
develop a new three-stage estimation procedure that integrates QR and
extreme value theory by estimating intermediate conditional quantiles
using QR and extrapolating these estimates to tails based on extreme value
theory. Using the power-transformed QR, the proposed method allows more
flexibility than existing methods that rely on the linearity of quantiles
on the original scale, while extending the applicability of parametric
models to borrow information across covariates without resorting to
nonparametric smoothing. In addition, we propose a test procedure to
assess the commonality of extreme value index, which could be useful for
obtaining more efficient estimation by sharing information across
covariates. We establish the asymptotic properties of the proposed method
and demonstrate its value through simulation study and the analysis of a
medical cost data. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1062-1074
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.820134
File-URL: http://hdl.handle.net/10.1080/01621459.2013.820134
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1062-1074
Template-Type: ReDIF-Article 1.0
Author-Name: Antonio F. Galvao
Author-X-Name-First: Antonio F.
Author-X-Name-Last: Galvao
Author-Name: Carlos Lamarche
Author-X-Name-First: Carlos
Author-X-Name-Last: Lamarche
Author-Name: Luiz Renato Lima
Author-X-Name-First: Luiz Renato
Author-X-Name-Last: Lima
Title: Estimation of Censored Quantile Regression for Panel Data With Fixed Effects
Abstract:
This article investigates estimation of censored quantile
regression (QR) models with fixed effects. Standard available methods are
not appropriate for estimation of a censored QR model with a large number
of parameters or with covariates correlated with unobserved individual
heterogeneity. Motivated by these limitations, the article proposes
estimators that are obtained by applying fixed effects QR to subsets of
observations selected either parametrically or nonparametrically. We
derive the limiting distribution of the new estimators under joint limits,
and conduct Monte Carlo simulations to assess their small sample
performance. An empirical application of the method to study the impact of
the 1964 Civil Rights Act on the black--white earnings gap is considered.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1075-1089
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.818002
File-URL: http://hdl.handle.net/10.1080/01621459.2013.818002
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1075-1089
Template-Type: ReDIF-Article 1.0
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Mijeong Kim
Author-X-Name-First: Mijeong
Author-X-Name-Last: Kim
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: Semiparametric Efficient and Robust Estimation of an Unknown Symmetric Population Under Arbitrary Sample Selection Bias
Abstract:
We propose semiparametric methods to estimate the center and
shape of a symmetric population when a representative sample of the
population is unavailable due to selection bias. We allow an arbitrary
sample selection mechanism determined by the data collection procedure,
and we do not impose any parametric form on the population distribution.
Under this general framework, we construct a family of consistent
estimators of the center that is robust to population model
misspecification, and we identify the efficient member that reaches the
minimum possible estimation variance. The asymptotic properties and finite
sample performance of the estimation and inference procedures are
illustrated through theoretical analysis and simulations. A data example
is also provided to illustrate the usefulness of the methods in practice.
Journal: Journal of the American Statistical Association
Pages: 1090-1104
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.816184
File-URL: http://hdl.handle.net/10.1080/01621459.2013.816184
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1090-1104
Template-Type: ReDIF-Article 1.0
Author-Name: Davy Paindaveine
Author-X-Name-First: Davy
Author-X-Name-Last: Paindaveine
Author-Name: Germain Van bever
Author-X-Name-First: Germain
Author-X-Name-Last: Van bever
Title: From Depth to Local Depth: A Focus on Centrality
Abstract:
Aiming at analyzing multimodal or nonconvexly supported
distributions through data depth, we introduce a local extension of depth.
Our construction is obtained by conditioning the distribution to
appropriate depth-based neighborhoods and has the advantages, among
others, of maintaining affine-invariance and applying to all depths in a
generic way. Most importantly, unlike their competitors, which (for
extreme localization) rather measure probability mass, the resulting
local depths focus on centrality and remain of a genuine
depth nature at any locality level. We derive their main properties,
establish consistency of their sample versions, and study their behavior
under extreme localization. We present two applications of the proposed
local depth (for classification and for symmetry testing), and we extend
our construction to the regression depth context. Throughout, we
illustrate the results on several datasets, both artificial and real,
univariate and multivariate. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1105-1119
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.813390
File-URL: http://hdl.handle.net/10.1080/01621459.2013.813390
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1105-1119
Template-Type: ReDIF-Article 1.0
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Author-Name: Barbara Pacini
Author-X-Name-First: Barbara
Author-X-Name-Last: Pacini
Title: Using Secondary Outcomes to Sharpen Inference in Randomized Experiments With Noncompliance
Abstract:
We develop new methods for analyzing randomized experiments
with noncompliance and, by extension, instrumental variable settings, when
the often controversial, but key, exclusion restriction assumption is
violated. We show how existing large-sample bounds on intention-to-treat
effects for the subpopulations of compliers, never-takers, and
always-takers can be tightened by exploiting the joint distribution of the
outcome of interest and a secondary outcome, for which the exclusion
restriction is satisfied. The derived bounds can be used to detect
violations of the exclusion restriction and the magnitude of these
violations in instrumental variables settings. It is shown that the
reduced width of the bounds depends on the strength of the association of
the auxiliary variable with the primary outcome and the compliance status.
We also show how the setup we consider offers new identifying assumptions
of intention-to-treat effects. The role of the auxiliary information is
shown in two examples of a real social job training experiment and a
simulated medical randomized encouragement study. We also discuss issues
of inference in finite samples and show how to conduct Bayesian analysis
in our partial and point identified settings. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1120-1131
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.802238
File-URL: http://hdl.handle.net/10.1080/01621459.2013.802238
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1120-1131
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan Martin
Author-X-Name-First: Ryan
Author-X-Name-Last: Martin
Author-Name: Chuanhai Liu
Author-X-Name-First: Chuanhai
Author-X-Name-Last: Liu
Title: Correction
Abstract:
This is to provide corrections to Theorems 1 and 3 in Martin
and Liu (2013). The latter correction also casts further light on the role
of nested predictive random sets.
Journal: Journal of the American Statistical Association
Pages: 1138-1139
Issue: 503
Volume: 108
Year: 2013
Month: 9
X-DOI: 10.1080/01621459.2013.796885
File-URL: http://hdl.handle.net/10.1080/01621459.2013.796885
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:503:p:1138-1139
Template-Type: ReDIF-Article 1.0
Author-Name: Marie Davidian
Author-X-Name-First: Marie
Author-X-Name-Last: Davidian
Title: The International Year of Statistics: A Celebration and A Call to Action
Journal: Journal of the American Statistical Association
Pages: 1141-1146
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.844019
File-URL: http://hdl.handle.net/10.1080/01621459.2013.844019
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1141-1146
Template-Type: ReDIF-Article 1.0
Author-Name: Blakeley B. McShane
Author-X-Name-First: Blakeley B.
Author-X-Name-Last: McShane
Author-Name: Shane T. Jensen
Author-X-Name-First: Shane T.
Author-X-Name-Last: Jensen
Author-Name: Allan I. Pack
Author-X-Name-First: Allan I.
Author-X-Name-Last: Pack
Author-Name: Abraham J. Wyner
Author-X-Name-First: Abraham J.
Author-X-Name-Last: Wyner
Title: Statistical Learning With Time Series Dependence: An Application to Scoring Sleep in Mice
Abstract:
We develop methodology that combines statistical learning
methods with generalized Markov models, thereby enhancing the former to
account for time series dependence. Our methodology can accommodate very
general and very long-term time dependence structures in an easily
estimable and computationally tractable fashion. We apply our methodology
to the scoring of sleep behavior in mice. As methods currently used to
score sleep in mice are expensive, invasive, and labor intensive, there is
considerable interest in developing high-throughput automated systems
which would allow many mice to be scored cheaply and quickly. Previous
efforts at automation have been able to differentiate sleep from
wakefulness, but they are unable to differentiate the rare and important
state of rapid eye movement (REM) sleep from non-REM sleep. Key
difficulties in detecting REM are that (i) REM is much rarer than non-REM
and wakefulness, (ii) REM looks similar to non-REM in terms of the
observed covariates, (iii) the data are noisy, and (iv) the data contain
strong time dependence structures crucial for differentiating REM from
non-REM. Our new approach (i) shows improved differentiation of REM from
non-REM sleep and (ii) accurately estimates aggregate quantities of sleep
in our application to video-based sleep scoring of mice. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1147-1162
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.779838
File-URL: http://hdl.handle.net/10.1080/01621459.2013.779838
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1147-1162
Template-Type: ReDIF-Article 1.0
Author-Name: Kerby Shedden
Author-X-Name-First: Kerby
Author-X-Name-Last: Shedden
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1162-1163
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.836970
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836970
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1162-1163
Template-Type: ReDIF-Article 1.0
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1164-1164
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.836971
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836971
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1164-1164
Template-Type: ReDIF-Article 1.0
Author-Name: Blakeley B. McShane
Author-X-Name-First: Blakeley B.
Author-X-Name-Last: McShane
Author-Name: Shane T. Jensen
Author-X-Name-First: Shane T.
Author-X-Name-Last: Jensen
Author-Name: Allan I. Pack
Author-X-Name-First: Allan I.
Author-X-Name-Last: Pack
Author-Name: Abraham J. Wyner
Author-X-Name-First: Abraham J.
Author-X-Name-Last: Wyner
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1165-1172
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.844021
File-URL: http://hdl.handle.net/10.1080/01621459.2013.844021
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1165-1172
Template-Type: ReDIF-Article 1.0
Author-Name: Tao Liu
Author-X-Name-First: Tao
Author-X-Name-Last: Liu
Author-Name: Joseph W. Hogan
Author-X-Name-First: Joseph W.
Author-X-Name-Last: Hogan
Author-Name: Lisa Wang
Author-X-Name-First: Lisa
Author-X-Name-Last: Wang
Author-Name: Shangxuan Zhang
Author-X-Name-First: Shangxuan
Author-X-Name-Last: Zhang
Author-Name: Rami Kantor
Author-X-Name-First: Rami
Author-X-Name-Last: Kantor
Title: Optimal Allocation of Gold Standard Testing Under Constrained Availability: Application to Assessment of HIV Treatment Failure
Abstract:
The World Health Organization (WHO) guidelines for monitoring
the effectiveness of human immunodeficiency virus (HIV) treatment in
resource-limited settings are mostly based on clinical and immunological
markers (e.g., CD4 cell counts). Recent research indicates that the
guidelines are inadequate and can result in high error rates. Viral load
(VL) is considered the "gold standard," yet its widespread use is limited
by cost and infrastructure. In this article, we propose a diagnostic
algorithm that uses information from routinely collected clinical and
immunological markers to guide a selective use of VL testing for
diagnosing HIV treatment failure, under the assumption that VL testing is
available only at a certain portion of patient visits. Our algorithm
identifies the patient subpopulation, such that the use of limited VL
testing on them minimizes a predefined risk (e.g., misdiagnosis error
rate). Diagnostic properties of our proposed algorithm are assessed by
simulations. For illustration, data from the Miriam Hospital Immunology
Clinic (Providence, RI) are analyzed.
Journal: Journal of the American Statistical Association
Pages: 1173-1188
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.810149
File-URL: http://hdl.handle.net/10.1080/01621459.2013.810149
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1173-1188
Template-Type: ReDIF-Article 1.0
Author-Name: Takahiro Hoshino
Author-X-Name-First: Takahiro
Author-X-Name-Last: Hoshino
Title: Semiparametric Bayesian Estimation for Marginal Parametric Potential Outcome Modeling: Application to Causal Inference
Abstract:
We propose a new semiparametric Bayesian model for causal
inference in which assignment to treatment depends on potential outcomes.
The model uses the probit stick-breaking process mixture proposed by Chung
and Dunson (2009), a variant of the Dirichlet process mixture modeling. In
contrast to previous Bayesian models, the proposed model directly
estimates the parameters of the marginal parametric model of potential
outcomes, while it relaxes the strong ignorability assumption, and
requires no parametric model assumption for the assignment model and
conditional distribution of the covariate vector. The proposed estimation
method is more robust than maximum likelihood estimation, in that it does
not require knowledge of the full joint distribution of potential
outcomes, covariates, and assignments. In addition, the method is more
efficient than fully nonparametric Bayes methods. We apply this model to
infer the differential effects of cognitive and noncognitive skills on the
wages of production and nonproduction workers using panel data from the
National Longitudinal Survey of Youth in 1979. The study also presents the
causal effect of online word-of-mouth on Web site browsing behavior.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1189-1204
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.835656
File-URL: http://hdl.handle.net/10.1080/01621459.2013.835656
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1189-1204
Template-Type: ReDIF-Article 1.0
Author-Name: Malka Gorfine*
Author-X-Name-First: Malka
Author-X-Name-Last: Gorfine*
Author-Name: Li Hsu*
Author-X-Name-First: Li
Author-X-Name-Last: Hsu*
Author-Name: Giovanni Parmigiani
Author-X-Name-First: Giovanni
Author-X-Name-Last: Parmigiani
Title: Frailty Models for Familial Risk With Application to Breast Cancer
Abstract:
In evaluating familial risk for disease we have two main
statistical tasks: assessing the probability of carrying an inherited
genetic mutation conferring higher risk, and predicting the absolute risk
of developing diseases over time for those individuals whose mutation
status is known. Despite substantial progress, much remains unknown about
the role of genetic and environmental risk factors, about the sources of
variation in risk among families that carry high-risk mutations, and about
the sources of familial aggregation beyond major Mendelian effects. These
sources of heterogeneity contribute substantial variation in risk across
families. In this article we present simple and efficient methods for
accounting for this variation in familial risk assessment. Our methods are
based on frailty models. We implemented them in the context of
generalizing Mendelian models of cancer risk, and compared our approaches
to others that do not consider heterogeneity across families. Our
extensive simulation study demonstrates that when predicting the risk of
developing a disease over time conditional on carrier status, accounting
for heterogeneity results in a substantial improvement in the area under
the curve of the receiver operating characteristic. On the other hand, the
improvement for carriership probability estimation is more limited. We
illustrate the utility of the proposed approach through the analysis of
BRCA1 and BRCA2 mutation carriers in the Washington Ashkenazi Kin-Cohort
Study of Breast Cancer. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1205-1215
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.818001
File-URL: http://hdl.handle.net/10.1080/01621459.2013.818001
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1205-1215
Template-Type: ReDIF-Article 1.0
Author-Name: Huaihou Chen
Author-X-Name-First: Huaihou
Author-X-Name-Last: Chen
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Author-Name: Myunghee Cho Paik
Author-X-Name-First: Myunghee Cho
Author-X-Name-Last: Paik
Author-Name: H. Alex Choi
Author-X-Name-First: H. Alex
Author-X-Name-Last: Choi
Title: A Marginal Approach to Reduced-Rank Penalized Spline Smoothing With Application to Multilevel Functional Data
Abstract:
Multilevel functional data are collected in many biomedical
studies. For example, in a study of the effect of Nimodipine on patients
with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hr
treatment cycles. Within each treatment cycle, subjects' vital signs were
reported every 10 min. These data have a natural multilevel structure with
treatment cycles nested within subjects and measurements nested within
cycles. Most literature on nonparametric analysis of such multilevel
functional data focuses on conditional approaches using functional mixed
effects models. However, parameters obtained from the conditional models
do not have direct interpretations as population average effects. When
population effects are of interest, we may employ marginal regression
models. In this work, we propose marginal approaches to fit multilevel
functional data through penalized spline generalized estimating equation
(penalized spline GEE). The procedure is effective for modeling multilevel
correlated generalized outcomes as well as continuous outcomes without
suffering from numerical difficulties. We provide a variance estimator
robust to misspecification of correlation structure. We investigate the
large sample properties of the penalized spline GEE estimator with
multilevel continuous data and show that the asymptotics falls into two
categories. In the small knots scenario, the estimated mean function is
asymptotically efficient when the true correlation function is used and
the asymptotic bias does not depend on the working correlation matrix. In
the large knots scenario, both the asymptotic bias and variance depend on
the working correlation. We propose a new method to select the smoothing
parameter for penalized spline GEE based on an estimate of the asymptotic
mean squared error (MSE). We conduct extensive simulation studies to
examine property of the proposed estimator under different correlation
structures and sensitivity of the variance estimation to the choice of
smoothing parameter. Finally, we apply the methods to the SAH study to
evaluate a recent debate on discontinuing the use of Nimodipine in the
clinical community. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1216-1229
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.826134
File-URL: http://hdl.handle.net/10.1080/01621459.2013.826134
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1216-1229
Template-Type: ReDIF-Article 1.0
Author-Name: Shane T. Jensen
Author-X-Name-First: Shane T.
Author-X-Name-Last: Jensen
Author-Name: Jared Park
Author-X-Name-First: Jared
Author-X-Name-Last: Park
Author-Name: Alexander F. Braunstein
Author-X-Name-First: Alexander F.
Author-X-Name-Last: Braunstein
Author-Name: Jon Mcauliffe
Author-X-Name-First: Jon
Author-X-Name-Last: Mcauliffe
Title: Bayesian Hierarchical Modeling of the HIV Evolutionary Response to Therapy
Abstract:
A major challenge for the treatment of human immunodeficiency
virus (HIV) infection is the development of therapy-resistant strains. We
present a statistical model that quantifies the evolution of HIV
populations when exposed to particular therapies. A hierarchical Bayesian
approach is used to estimate differences in rates of nucleotide changes
between treatment- and control-group sequences. Each group's rates are
allowed to vary spatially along the HIV genome. We employ a coalescent
structure to address the sequence diversity within the treatment and
control HIV populations. We evaluate the model in simulations and estimate
HIV evolution in two different applications: a conventional drug therapy
and an antisense gene therapy. In both studies, we detect evidence of
evolutionary escape response in the HIV population. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1230-1242
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.830449
File-URL: http://hdl.handle.net/10.1080/01621459.2013.830449
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1230-1242
Template-Type: ReDIF-Article 1.0
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Richard K. Crump
Author-X-Name-First: Richard K.
Author-X-Name-Last: Crump
Author-Name: Michael Jansson
Author-X-Name-First: Michael
Author-X-Name-Last: Jansson
Title: Generalized Jackknife Estimators of Weighted Average Derivatives
Abstract:
With the aim of improving the quality of asymptotic
distributional approximations for nonlinear functionals of nonparametric
estimators, this article revisits the large-sample properties of an
important member of that class, namely a kernel-based weighted average
derivative estimator. Asymptotic linearity of the estimator is established
under weak conditions. Indeed, we show that the bandwidth conditions
employed are necessary in some cases. A bias-corrected version of the
estimator is proposed and shown to be asymptotically linear under yet
weaker bandwidth conditions. Implementational details of the estimators
are discussed, including bandwidth selection procedures. Consistency of an
analog estimator of the asymptotic variance is also established. Numerical
results from a simulation study and an empirical illustration are
reported. To establish the results, a novel result on uniform convergence
rates for kernel estimators is obtained. The online supplemental material
to this article includes details on the theoretical proofs and other
analytic derivations, and further results from the simulation study.
Journal: Journal of the American Statistical Association
Pages: 1243-1256
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2012.745810
File-URL: http://hdl.handle.net/10.1080/01621459.2012.745810
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1243-1256
Template-Type: ReDIF-Article 1.0
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1257-1258
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.854172
File-URL: http://hdl.handle.net/10.1080/01621459.2013.854172
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1257-1258
Template-Type: ReDIF-Article 1.0
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1258-1260
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.859516
File-URL: http://hdl.handle.net/10.1080/01621459.2013.859516
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1258-1260
Template-Type: ReDIF-Article 1.0
Author-Name: Enno Mammen
Author-X-Name-First: Enno
Author-X-Name-Last: Mammen
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1260-1262
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.829000
File-URL: http://hdl.handle.net/10.1080/01621459.2013.829000
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1260-1262
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaohong Chen
Author-X-Name-First: Xiaohong
Author-X-Name-Last: Chen
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1262-1264
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.855352
File-URL: http://hdl.handle.net/10.1080/01621459.2013.855352
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1262-1264
Template-Type: ReDIF-Article 1.0
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Richard K. Crump
Author-X-Name-First: Richard K.
Author-X-Name-Last: Crump
Author-Name: Michael Jansson
Author-X-Name-First: Michael
Author-X-Name-Last: Jansson
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1265-1268
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.856717
File-URL: http://hdl.handle.net/10.1080/01621459.2013.856717
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1265-1268
Template-Type: ReDIF-Article 1.0
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Author-Name: Peter Hall
Author-X-Name-First: Peter
Author-X-Name-Last: Hall
Title: Classification Using Censored Functional Data
Abstract:
We consider classification of functional data when the
training curves are not observed on the same interval. Different types of
classifier are suggested, one of which involves a new curve extension
procedure. Our approach enables us to exploit the information contained in
the endpoints of these intervals by incorporating it in an explicit but
flexible way. We study asymptotic properties of our classifiers, and show
that, in a variety of settings, they can even produce asymptotically
perfect classification. The performance of our techniques is illustrated
in applications to real and simulated data. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1269-1283
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.824893
File-URL: http://hdl.handle.net/10.1080/01621459.2013.824893
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1269-1283
Template-Type: ReDIF-Article 1.0
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Naisyin Wang
Author-X-Name-First: Naisyin
Author-X-Name-Last: Wang
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Selecting the Number of Principal Components in Functional Data
Abstract:
Functional principal component analysis (FPCA) has become the
most widely used dimension reduction tool for functional data analysis. We
consider functional data measured at random, subject-specific time points,
contaminated with measurement error, allowing for both sparse and dense
functional data, and propose novel information criteria to select the
number of principal component in such data. We propose a Bayesian
information criterion based on marginal modeling that can consistently
select the number of principal components for both sparse and dense
functional data. For dense functional data, we also develop an Akaike
information criterion based on the expected Kullback--Leibler information
under a Gaussian assumption. In connecting with the time series
literature, we also consider a class of information criteria proposed for
factor analysis of multivariate time series and show that they are still
consistent for dense functional data, if a prescribed undersmoothing
scheme is undertaken in the FPCA algorithm. We perform intensive
simulation studies and show that the proposed information criteria vastly
outperform existing methods for this type of data. Surprisingly, our
empirical evidence shows that our information criteria proposed for dense
functional data also perform well for sparse functional data. An empirical
example using colon carcinogenesis data is also provided to illustrate the
results. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1284-1294
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.788980
File-URL: http://hdl.handle.net/10.1080/01621459.2013.788980
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1284-1294
Template-Type: ReDIF-Article 1.0
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Yuguo Chen
Author-X-Name-First: Yuguo
Author-X-Name-Last: Chen
Title: Sampling for Conditional Inference on Network Data
Abstract:
Random graphs with given vertex degrees have been widely used
as a model for many real-world complex networks. However, both statistical
inference and analytic study of such networks present great challenges. In
this article, we propose a new sequential importance sampling method for
sampling networks with a given degree sequence. These samples can be used
to approximate closely the null distributions of a number of test
statistics involved in such networks and provide an accurate estimate of
the total number of networks with given vertex degrees. We study the
asymptotic behavior of the proposed algorithm and prove that the
importance weight remains bounded as the size of the graph grows. This
property guarantees that the proposed sampling algorithm can still work
efficiently even for large sparse graphs. We apply our method to a range
of examples to demonstrate its efficiency in real problems.
Journal: Journal of the American Statistical Association
Pages: 1295-1307
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2012.758587
File-URL: http://hdl.handle.net/10.1080/01621459.2012.758587
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1295-1307
Template-Type: ReDIF-Article 1.0
Author-Name: Lisha Chen
Author-X-Name-First: Lisha
Author-X-Name-Last: Chen
Author-Name: Winston Wei Dou
Author-X-Name-First: Winston Wei
Author-X-Name-Last: Dou
Author-Name: Zhihua Qiao
Author-X-Name-First: Zhihua
Author-X-Name-Last: Qiao
Title: Ensemble Subsampling for Imbalanced Multivariate Two-Sample Tests
Abstract:
Some existing nonparametric two-sample tests for equality of
multivariate distributions perform unsatisfactorily when the two sample
sizes are unbalanced. In particular, the power of these tests tends to
diminish with increasingly unbalanced sample sizes. In this article, we
propose a new testing procedure to solve this problem. The proposed test,
based on the nearest neighbor method by Schilling, employs a novel
ensemble subsampling scheme to remedy this issue. More specifically, the
test statistic is a weighted average of a collection of statistics, each
associated with a randomly selected subsample of the data. We derive the
asymptotic distribution of the test statistic under the null hypothesis
and show that the new test is consistent against all alternatives when the
ratio of the sample sizes either goes to a finite limit or tends to
infinity. Via simulated data examples we demonstrate that the new test has
increasing power with increasing sample size ratio when the size of the
smaller sample is fixed. The test is applied to a real-data example in the
field of corporate finance. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1308-1323
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.800763
File-URL: http://hdl.handle.net/10.1080/01621459.2013.800763
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1308-1323
Template-Type: ReDIF-Article 1.0
Author-Name: Tsuyoshi Kunihama
Author-X-Name-First: Tsuyoshi
Author-X-Name-Last: Kunihama
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Modeling of Temporal Dependence in Large Sparse Contingency Tables
Abstract:
It is of interest in many applications to study trends over
time in relationships among categorical variables, such as age group,
ethnicity, religious affiliation, political party, and preference for
particular policies. At each time point, a sample of individuals provides
responses to a set of questions, with different individuals sampled at
each time. In such settings, there tend to be an abundance of missing data
and the variables being measured may change over time. At each time point,
we obtained a large sparse contingency table, with the number of cells
often much larger than the number of individuals being surveyed. To borrow
information across time in modeling large sparse contingency tables, we
propose a Bayesian autoregressive tensor factorization approach. The
proposed model relies on a probabilistic Parafac factorization of the
joint pmf characterizing the categorical data distribution at each time
point, with autocorrelation included across times. We develop efficient
computational methods that rely on Markov chain Monte Carlo. The methods
are evaluated through simulation examples and applied to social survey
data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1324-1338
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.823866
File-URL: http://hdl.handle.net/10.1080/01621459.2013.823866
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1324-1338
Template-Type: ReDIF-Article 1.0
Author-Name: Nicholas G. Polson
Author-X-Name-First: Nicholas G.
Author-X-Name-Last: Polson
Author-Name: James G. Scott
Author-X-Name-First: James G.
Author-X-Name-Last: Scott
Author-Name: Jesse Windle
Author-X-Name-First: Jesse
Author-X-Name-Last: Windle
Title: Bayesian Inference for Logistic Models Using Pólya--Gamma Latent Variables
Abstract:
We propose a new data-augmentation strategy for fully
Bayesian inference in models with binomial likelihoods. The approach
appeals to a new class of Pólya--Gamma distributions, which are
constructed in detail. A variety of examples are presented to show the
versatility of the method, including logistic regression, negative
binomial regression, nonlinear mixed-effect models, and spatial models for
count data. In each case, our data-augmentation strategy leads to simple,
effective methods for posterior inference that (1) circumvent the need for
analytic approximations, numerical integration, or Metropolis--Hastings;
and (2) outperform other known data-augmentation strategies, both in ease
of use and in computational efficiency. All methods, including an
efficient sampler for the Pólya--Gamma distribution, are implemented in
the R package BayesLogit. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1339-1349
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.829001
File-URL: http://hdl.handle.net/10.1080/01621459.2013.829001
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1339-1349
Template-Type: ReDIF-Article 1.0
Author-Name: Wentao Li
Author-X-Name-First: Wentao
Author-X-Name-Last: Li
Author-Name: Zhiqiang Tan
Author-X-Name-First: Zhiqiang
Author-X-Name-Last: Tan
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Title: Two-Stage Importance Sampling With Mixture Proposals
Abstract:
For importance sampling (IS), multiple proposals can be
combined to address different aspects of a target distribution. There are
various methods for IS with multiple proposals, including Hesterberg's
stratified IS estimator, Owen and Zhou's regression estimator, and Tan's
maximum likelihood estimator. For the problem of efficiently allocating
samples to different proposals, it is natural to use a pilot sample to
select the mixture proportions before the actual sampling and estimation.
However, most current discussions are in an empirical sense for such a
two-stage procedure. In this article, we establish a theoretical framework
of applying the two-stage procedure for various methods, including the
asymptotic properties and the choice of the pilot sample size. By our
simulation studies, these two-stage estimators can outperform estimators
with naive choices of mixture proportions. Furthermore, while Owen and
Zhou's and Tan's estimators are designed for estimating normalizing
constants, we extend their usage and the two-stage procedure to estimating
expectations and show that the improvement is still preserved in this
extension.
Journal: Journal of the American Statistical Association
Pages: 1350-1365
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.831980
File-URL: http://hdl.handle.net/10.1080/01621459.2013.831980
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1350-1365
Template-Type: ReDIF-Article 1.0
Author-Name: Jian Zhang
Author-X-Name-First: Jian
Author-X-Name-Last: Zhang
Title: Epistatic Clustering: A Model-Based Approach for Identifying Links Between Clusters
Abstract:
Most clustering methods assume that the data can be
represented by mutually exclusive clusters, although this assumption may
not be the case in practice. For example, in gene expression microarray
studies, investigators have often found that a gene can play multiple
functions in a cell and may, therefore, belong to more than one cluster
simultaneously, and that gene clusters can be linked to each other in
certain pathways. This article examines the effect of the above assumption
on the likelihood of finding latent clusters using theoretical
calculations and simulation studies, for which the epistatic structures
were known in advance, and on real data analyses. To explore potential
links between clusters, we introduce an epistatic mixture model which
extends the Gaussian mixture by including epistatic terms. A generalized
expectation-maximization (EM) algorithm is developed to compute the
related maximum likelihood estimators. The Bayesian information criterion
is then used to determine the order of the proposed model. A bootstrap
test is proposed for testing whether the epistatic mixture model is a
significantly better fit to the data than a standard mixture model in
which each data point belongs to one cluster. The asymptotic properties of
the proposed estimators are also investigated when the number of analysis
units is large. The results demonstrate that the epistatic links between
clusters do have a serious effect on the accuracy of clustering and that
our epistatic approach can substantially reduce such an effect and improve
the fit.
Journal: Journal of the American Statistical Association
Pages: 1366-1384
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.835661
File-URL: http://hdl.handle.net/10.1080/01621459.2013.835661
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1366-1384
Template-Type: ReDIF-Article 1.0
Author-Name: Sanat K. Sarkar
Author-X-Name-First: Sanat K.
Author-X-Name-Last: Sarkar
Author-Name: Jingjing Chen
Author-X-Name-First: Jingjing
Author-X-Name-Last: Chen
Author-Name: Wenge Guo
Author-X-Name-First: Wenge
Author-X-Name-Last: Guo
Title: Multiple Testing in a Two-Stage Adaptive Design With Combination Tests Controlling FDR
Abstract:
Testing multiple null hypotheses in two stages to decide
which of these can be rejected or accepted at the first stage and which
should be followed up for further testing having had additional
observations is of importance in many scientific studies. We develop two
procedures, each with two different combination functions, Fisher's and
Simes', to combine p-values from two stages, given
prespecified boundaries on the first-stage p-values in
terms of the false discovery rate (FDR) and controlling the overall FDR at
a desired level. The FDR control is proved when the pairs of first- and
second-stage p-values are independent and those
corresponding to the null hypotheses are identically distributed as a pair
(p 1, p
2) satisfying the p-clud property. We
did simulations to show that (1) our two-stage procedures can have
significant power improvements over the first-stage Benjamini--Hochberg
(BH) procedure compared to the improvement offered by the ideal BH
procedure that one would have used had the second stage data been
available for all the hypotheses, and can continue to control the FDR
under some dependence situations, and (2) can offer considerable cost
savings compared to the ideal BH procedure. The procedures are illustrated
through a real gene expression data. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1385-1401
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.835662
File-URL: http://hdl.handle.net/10.1080/01621459.2013.835662
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1385-1401
Template-Type: ReDIF-Article 1.0
Author-Name: Luo Lu
Author-X-Name-First: Luo
Author-X-Name-Last: Lu
Author-Name: Hui Jiang
Author-X-Name-First: Hui
Author-X-Name-Last: Jiang
Author-Name: Wing H. Wong
Author-X-Name-First: Wing H.
Author-X-Name-Last: Wong
Title: Multivariate Density Estimation by Bayesian Sequential Partitioning
Abstract:
Consider a class of densities that are piecewise constant
functions over partitions of the sample space defined by sequential
coordinate partitioning. We introduce a prior distribution for a density
in this function class and derive in closed form the marginal posterior
distribution of the corresponding partition. A computationally efficient
method, based on sequential importance sampling, is presented for the
inference of the partition from this posterior distribution. Compared to
traditional approaches such as the kernel method or the histogram, the
Bayesian sequential partitioning (BSP) method proposed here is capable of
providing much more accurate estimates when the sample space is of
moderate to high dimension. We illustrate this by simulated as well as
real data examples. The examples also demonstrate how BSP can be used to
design new classification methods competitive with the state of the art.
Journal: Journal of the American Statistical Association
Pages: 1402-1410
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.813389
File-URL: http://hdl.handle.net/10.1080/01621459.2013.813389
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1402-1410
Template-Type: ReDIF-Article 1.0
Author-Name: Min Yang
Author-X-Name-First: Min
Author-X-Name-Last: Yang
Author-Name: Stefanie Biedermann
Author-X-Name-First: Stefanie
Author-X-Name-Last: Biedermann
Author-Name: Elina Tang
Author-X-Name-First: Elina
Author-X-Name-Last: Tang
Title: On Optimal Designs for Nonlinear Models: A General and Efficient Algorithm
Abstract:
Finding optimal designs for nonlinear models is challenging
in general. Although some recent results allow us to focus on a simple
subclass of designs for most problems, deriving a specific optimal design
still mainly depends on numerical approaches. There is need for a general
and efficient algorithm that is more broadly applicable than the current
state-of-the-art methods. We present a new algorithm that can be used to
find optimal designs with respect to a broad class of optimality criteria,
when the model parameters or functions thereof are of interest, and for
both locally optimal and multistage design strategies. We prove
convergence to the optimal design, and show in various examples that the
new algorithm outperforms the current state-of-the-art algorithms.
Journal: Journal of the American Statistical Association
Pages: 1411-1420
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.806268
File-URL: http://hdl.handle.net/10.1080/01621459.2013.806268
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1411-1420
Template-Type: ReDIF-Article 1.0
Author-Name: Ming-yen Cheng
Author-X-Name-First: Ming-yen
Author-X-Name-Last: Cheng
Author-Name: Hau-tieng Wu
Author-X-Name-First: Hau-tieng
Author-X-Name-Last: Wu
Title: Local Linear Regression on Manifolds and Its Geometric Interpretation
Abstract:
High-dimensional data analysis has been an active area, and
the main focus areas have been variable selection and dimension reduction.
In practice, it occurs often that the variables are located on an unknown,
lower-dimensional nonlinear manifold. Under this manifold assumption, one
purpose of this article is regression and gradient estimation on the
manifold, and another is developing a new tool for manifold learning. As
regards the first aim, we suggest directly reducing the dimensionality to
the intrinsic dimension d of the manifold, and performing
the popular local linear regression (LLR) on a tangent plane estimate. An
immediate consequence is a dramatic reduction in the computational time
when the ambient space dimension p >> d.
We provide rigorous theoretical justification of the convergence of the
proposed regression and gradient estimators by carefully analyzing the
curvature, boundary, and nonuniform sampling effects. We propose a
bandwidth selector that can handle heteroscedastic errors. With reference
to the second aim, we analyze carefully the asymptotic behavior of our
regression estimator both in the interior and near the boundary of the
manifold, and make explicit its relationship with manifold learning, in
particular estimating the Laplace--Beltrami operator of the manifold. In
this context, we also make clear that it is important to use a smaller
bandwidth in the tangent plane estimation than in the LLR. A simulation
study and applications to the Isomap face data and a clinically computed
tomography scan dataset are used to illustrate the computational speed and
estimation accuracy of our methods. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1421-1434
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.827984
File-URL: http://hdl.handle.net/10.1080/01621459.2013.827984
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1421-1434
Template-Type: ReDIF-Article 1.0
Author-Name: Shelby J. Haberman
Author-X-Name-First: Shelby J.
Author-X-Name-Last: Haberman
Author-Name: Sandip Sinharay
Author-X-Name-First: Sandip
Author-X-Name-Last: Sinharay
Title: Generalized Residuals for General Models for Contingency Tables With Application to Item Response Theory
Abstract:
Generalized residuals are a tool employed in the analysis of
contingency tables to examine possible sources of model error. They have
typically been applied to log-linear models and to latent-class models. A
general approach to generalized residuals is developed for a very general
class of models for contingency tables. To illustrate their use,
generalized residuals are applied to models based on item response theory
(IRT) models. Such models are commonly applied to analysis of standardized
achievement or aptitude tests. To obtain a realistic perspective on
application of generalized residuals, actual testing data are employed.
Journal: Journal of the American Statistical Association
Pages: 1435-1444
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.835660
File-URL: http://hdl.handle.net/10.1080/01621459.2013.835660
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1435-1444
Template-Type: ReDIF-Article 1.0
Author-Name: Bin Zhu
Author-X-Name-First: Bin
Author-X-Name-Last: Zhu
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Locally Adaptive Bayes Nonparametric Regression via Nested Gaussian Processes
Abstract:
We propose a nested Gaussian process (nGP) as a locally
adaptive prior for Bayesian nonparametric regression. Specified through a
set of stochastic differential equations (SDEs), the nGP imposes a
Gaussian process prior for the function's mth-order
derivative. The nesting comes in through including a local instantaneous
mean function, which is drawn from another Gaussian process inducing
adaptivity to locally varying smoothness. We discuss the support of the
nGP prior in terms of the closure of a reproducing kernel Hilbert space,
and consider theoretical properties of the posterior. The posterior mean
under the nGP prior is shown to be equivalent to the minimizer of a nested
penalized sum-of-squares involving penalties for both the global and local
roughness of the function. Using highly efficient Markov chain Monte Carlo
for posterior inference, the proposed method performs well in simulation
studies compared to several alternatives, and is scalable to massive data,
illustrated through a proteomics application.
Journal: Journal of the American Statistical Association
Pages: 1445-1456
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.838568
File-URL: http://hdl.handle.net/10.1080/01621459.2013.838568
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1445-1456
Template-Type: ReDIF-Article 1.0
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Jing Cheng
Author-X-Name-First: Jing
Author-X-Name-Last: Cheng
Author-Name: M. Elizabeth Halloran
Author-X-Name-First: M. Elizabeth
Author-X-Name-Last: Halloran
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Case Definition and Design Sensitivity
Abstract:
In a case-referent study, cases of disease are compared to
noncases with respect to their antecedent exposure to a treatment in an
effort to determine whether exposure causes some cases of the disease.
Because exposure is not randomly assigned in the population, as it would
be if the population were a vast randomized trial, exposed and unexposed
subjects may differ prior to exposure with respect to covariates that may
or may not have been measured. After controlling for measured preexposure
differences, for instance by matching, a sensitivity analysis asks about
the magnitude of bias from unmeasured covariates that would need to be
present to alter the conclusions of a study that presumed matching for
observed covariates removes all bias. The definition of a case of disease
affects sensitivity to unmeasured bias. We explore this issue using: (i)
an asymptotic tool, the design sensitivity, (ii) a simulation for finite
samples, and (iii) an example. Under favorable circumstances, a narrower
case definition can yield an increase in the design sensitivity, and hence
an increase in the power of a sensitivity analysis. Also, we discuss an
adaptive method that seeks to discover the best case definition from the
data at hand while controlling for multiple testing. An implementation in
R is available as SensitivityCaseControl.
Journal: Journal of the American Statistical Association
Pages: 1457-1468
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.820660
File-URL: http://hdl.handle.net/10.1080/01621459.2013.820660
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1457-1468
Template-Type: ReDIF-Article 1.0
Author-Name: Ying Hung
Author-X-Name-First: Ying
Author-X-Name-Last: Hung
Author-Name: Yijie Wang
Author-X-Name-First: Yijie
Author-X-Name-Last: Wang
Author-Name: Veronika Zarnitsyna
Author-X-Name-First: Veronika
Author-X-Name-Last: Zarnitsyna
Author-Name: Cheng Zhu
Author-X-Name-First: Cheng
Author-X-Name-Last: Zhu
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F. Jeff
Author-X-Name-Last: Wu
Title: Hidden Markov Models With Applications in Cell Adhesion Experiments
Abstract:
Estimation of the number of hidden states is challenging in
hidden Markov models. Motivated by the analysis of a specific type of cell
adhesion experiments, a new framework based on a hidden Markov model and
double penalized order selection is proposed. The order selection
procedure is shown to be consistent in estimating the number of states. A
modified expectation--maximization algorithm is introduced to efficiently
estimate parameters in the model. Simulations show that the proposed
framework outperforms existing methods. Applications of the proposed
methodology to real data demonstrate the accuracy of estimating
receptor--ligand bond lifetimes and waiting times which are essential in
kinetic parameter estimation. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1469-1479
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.836973
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836973
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1469-1479
Template-Type: ReDIF-Article 1.0
Author-Name: Marina Bogomolov
Author-X-Name-First: Marina
Author-X-Name-Last: Bogomolov
Author-Name: Ruth Heller
Author-X-Name-First: Ruth
Author-X-Name-Last: Heller
Title: Discovering Findings That Replicate From a Primary Study of High Dimension to a Follow-Up Study
Abstract:
We consider the problem of identifying whether findings
replicate from one study of high dimension to another, when the primary
study guides the selection of hypotheses to be examined in the follow-up
study as well as when there is no division of roles into the primary and
the follow-up study. We show that existing meta-analysis methods are not
appropriate for this problem, and suggest novel methods instead. We prove
that our multiple testing procedures control for appropriate error rates.
The suggested family-wise error rate controlling procedure is valid for
arbitrary dependence among the test statistics within each study. A more
powerful procedure is suggested for false discovery rate (FDR) control. We
prove that this procedure controls the FDR if the test statistics are
independent within the primary study, and independent or have positive
dependence in the follow-up study. For arbitrary dependence within the
primary study, and either arbitrary dependence or positive dependence in
the follow-up study, simple conservative modifications of the procedure
control the FDR. We demonstrate the usefulness of these procedures via
simulations and real data examples. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1480-1492
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.829002
File-URL: http://hdl.handle.net/10.1080/01621459.2013.829002
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1480-1492
Template-Type: ReDIF-Article 1.0
Author-Name: Li Ma
Author-X-Name-First: Li
Author-X-Name-Last: Ma
Title: Adaptive Testing of Conditional Association Through Recursive Mixture Modeling
Abstract:
In many case-control studies, a central goal is to test for
association or dependence between the predictors and the response.
Relevant covariates must be conditioned on to avoid false positives and
loss in power. Conditioning on covariates is easy in parametric frameworks
such as the logistic regression-by incorporating the covariates into the
model as additional variables. In contrast, nonparametric methods such as
the Cochran-Mantel-Haenszel test accomplish conditioning by dividing the
data into strata, one for each possible covariate value. In modern
applications, this often gives rise to numerous strata, most of which are
sparse due to the multidimensionality of the covariate and/or predictor
space, while in reality, the covariate space often consists of just a
small number of subsets with differential response-predictor dependence.
We introduce a Bayesian approach to inferring from the data such an
effective stratification and testing for association accordingly. The core
of our framework is a recursive mixture model on the retrospective
distribution of the predictors, whose mixing distribution is a prior on
the partitions on the covariate space. Inference under the model can
proceed efficiently in closed form through a sequence of recursions,
striking a balance between model flexibility and computational
tractability. Simulation studies show that our method substantially
outperforms classical tests under various scenarios. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1493-1505
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.838899
File-URL: http://hdl.handle.net/10.1080/01621459.2013.838899
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1493-1505
Template-Type: ReDIF-Article 1.0
Author-Name: Young Min Kim
Author-X-Name-First: Young Min
Author-X-Name-Last: Kim
Author-Name: Soumendra N. Lahiri
Author-X-Name-First: Soumendra N.
Author-X-Name-Last: Lahiri
Author-Name: Daniel J. Nordman
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Nordman
Title: A Progressive Block Empirical Likelihood Method for Time Series
Abstract:
This article develops a new blockwise empirical likelihood
(BEL) method for stationary, weakly dependent time processes, called the
progressive block empirical likelihood (PBEL). In contrast to the standard
version of BEL, which uses data blocks of constant length for a given
sample size and whose performance can depend crucially on the block length
selection, this new approach involves a data-blocking scheme where blocks
increase in length by an arithmetic progression. Consequently, no block
length selections are required for the PBEL method, which implies a
certain type of robustness for this version of BEL. For inference of
smooth functions of the process mean, theoretical results establish the
chi-squared limit of the log-likelihood ratio based on PBEL, which can be
used to calibrate confidence regions. Using the same progressive block
scheme, distributional extensions are also provided for other
nonparametric likelihoods with time series in the family of Cressie--Read
discrepancies. Simulation evidence indicates that the PBEL method can
perform comparably to the standard BEL in coverage accuracy (when the
latter uses a "good" block choice) and can exhibit more stability, without
the need to select a usual block length. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1506-1516
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.847374
File-URL: http://hdl.handle.net/10.1080/01621459.2013.847374
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1506-1516
Template-Type: ReDIF-Article 1.0
Author-Name: Yuanshan Wu
Author-X-Name-First: Yuanshan
Author-X-Name-Last: Wu
Author-Name: Guosheng Yin
Author-X-Name-First: Guosheng
Author-X-Name-Last: Yin
Title: Cure Rate Quantile Regression for Censored Data With a Survival Fraction
Abstract:
Censored quantile regression offers a valuable complement to
the traditional Cox proportional hazards model for survival analysis.
Survival times tend to be right-skewed, particularly when there exists a
substantial fraction of long-term survivors who are either cured or immune
to the event of interest. For survival data with a cure possibility, we
propose cure rate quantile regression under the common censoring scheme
that survival times and censoring times are conditionally independent
given the covariates. In a mixture formulation, we apply censored quantile
regression to model the survival times of susceptible subjects and
logistic regression to model the indicators of whether patients are
susceptible. We develop two estimation methods using martingale-based
equations: One approach fully uses all regression quantiles by iterating
estimation between the cure rate and quantile regression parameters; and
the other separates the two via a nonparametric kernel smoothing
estimator. We establish the uniform consistency and weak convergence
properties for the estimators obtained from both methods. The proposed
model is evaluated through extensive simulation studies and illustrated
with a bone marrow transplantation data example. Technical proofs of key
theorems are given in Appendices A, B, and C, while those of lemmas and
additional simulation studies on model misspecification and comparisons
with other models are provided in the online Supplementary Materials A and
B.
Journal: Journal of the American Statistical Association
Pages: 1517-1531
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.837368
File-URL: http://hdl.handle.net/10.1080/01621459.2013.837368
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1517-1531
Template-Type: ReDIF-Article 1.0
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: Yingye Zheng
Author-X-Name-First: Yingye
Author-X-Name-Last: Zheng
Title: Resampling Procedures for Making Inference Under Nested Case--Control Studies
Abstract:
The nested case--control (NCC) design has been widely adopted
as a cost-effective solution in many large cohort studies for risk
assessment with expensive markers, such as the emerging biologic and
genetic markers. To analyze data from NCC studies, conditional logistic
regression and maximum likelihood-based methods have been proposed.
However, most of these methods either cannot be easily extended beyond the
Cox model or require additional modeling assumptions. More generally
applicable approaches based on inverse probability weighting (IPW) have
been proposed as useful alternatives. However, due to the complex
correlation structure induced by repeated finite risk set sampling,
interval estimation for such IPW estimators remain challenging especially
when the estimation involves nonsmooth objective functions or when making
simultaneous inferences about functions. Standard resampling procedures
such as the bootstrap cannot accommodate the correlation and thus are not
directly applicable. In this article, we propose a resampling procedure
that can provide valid estimates for the distribution of a broad class of
IPW estimators. Simulation results suggest that the proposed procedures
perform well in settings when analytical variance estimator is infeasible
to derive or gives less optimal performance. The new procedures are
illustrated with data from the Framingham Offspring Study to characterize
individual level cardiovascular risks over time based on the Framingham
risk score, C-reactive protein, and a genetic risk score. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1532-1544
Issue: 504
Volume: 108
Year: 2013
Month: 12
X-DOI: 10.1080/01621459.2013.856715
File-URL: http://hdl.handle.net/10.1080/01621459.2013.856715
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:108:y:2013:i:504:p:1532-1544
Template-Type: ReDIF-Article 1.0
Author-Name: Luo Xiao
Author-X-Name-First: Luo
Author-X-Name-Last: Xiao
Author-Name: Sally W. Thurston
Author-X-Name-First: Sally W.
Author-X-Name-Last: Thurston
Author-Name: David Ruppert
Author-X-Name-First: David
Author-X-Name-Last: Ruppert
Author-Name: Tanzy M. T. Love
Author-X-Name-First: Tanzy M. T.
Author-X-Name-Last: Love
Author-Name: Philip W. Davidson
Author-X-Name-First: Philip W.
Author-X-Name-Last: Davidson
Title: Bayesian Models for Multiple Outcomes in Domains With Application to the Seychelles Child Development Study
Abstract:
The Seychelles Child Development Study (SCDS) examines the effects of
prenatal exposure to methylmercury on the functioning of the central
nervous system. The SCDS data include 20 outcomes measured on 9-year-old
children that can be classified broadly in four outcome classes or
"domains": cognition, memory, motor, and social behavior. Previous
analyses and scientific theory suggest that these outcomes may belong to
more than one of these domains, rather than only a single domain as is
frequently assumed for modeling. We present a framework for examining the
effects of exposure and other covariates when the outcomes may each belong
to more than one domain and where we also want to learn about the
assignment of outcomes to domains. Each domain is defined by a sentinel
outcome, which is preassigned to that domain only. All other outcomes can
belong to multiple domains and are not preassigned. Our model allows
exposure and covariate effects to differ across domains and across
outcomes within domains, and includes random subject-specific effects that
model correlations between outcomes within and across domains. We take a
Bayesian MCMC approach. Results from the Seychelles study and from
extensive simulations show that our model can effectively determine sparse
domain assignment, and at the same time give increased power to detect
overall, domain-specific, and outcome-specific exposure and covariate
effects relative to separate models for each endpoint. When fit to the
Seychelles data, several outcomes were classified as partly belonging to
domains other than their originally assigned domains. In retrospect, the
new partial domain assignments are reasonable and, as we discuss, suggest
important scientific insights about the nature of the outcomes. Checks of
model misspecification were improved relative to a model that assumes each
outcome is in a single domain. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 1-10
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.830070
File-URL: http://hdl.handle.net/10.1080/01621459.2013.830070
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:1-10
Template-Type: ReDIF-Article 1.0
Author-Name: Hui Huang
Author-X-Name-First: Hui
Author-X-Name-Last: Huang
Author-Name: Xiaomei Ma
Author-X-Name-First: Xiaomei
Author-X-Name-Last: Ma
Author-Name: Rasmus Waagepetersen
Author-X-Name-First: Rasmus
Author-X-Name-Last: Waagepetersen
Author-Name: Theodore R. Holford
Author-X-Name-First: Theodore R.
Author-X-Name-Last: Holford
Author-Name: Rong Wang
Author-X-Name-First: Rong
Author-X-Name-Last: Wang
Author-Name: Harvey Risch
Author-X-Name-First: Harvey
Author-X-Name-Last: Risch
Author-Name: Lloyd Mueller
Author-X-Name-First: Lloyd
Author-X-Name-Last: Mueller
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: A New Estimation Approach for Combining Epidemiological Data From Multiple Sources
Abstract:
We propose a novel two-step procedure to
combine epidemiological data obtained from diverse sources with the aim to
quantify risk factors affecting the probability that an individual
develops certain disease such as cancer. In the first step, we derive all
possible unbiased estimating functions based on a group of cases and a
group of controls each time. In the second step, we combine these
estimating functions efficiently to make full use of the information
contained in data. Our approach is computationally simple and flexible. We
illustrate its efficacy through simulation and apply it to investigate
pancreatic cancer risks based on data obtained from the Connecticut Tumor
Registry, a population-based case--control study, and the Behavioral Risk
Factor Surveillance System which is a state-based system of health
surveys. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 11-23
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.870904
File-URL: http://hdl.handle.net/10.1080/01621459.2013.870904
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:11-23
Template-Type: ReDIF-Article 1.0
Author-Name: Marco Carone
Author-X-Name-First: Marco
Author-X-Name-Last: Carone
Author-Name: Masoud Asgharian
Author-X-Name-First: Masoud
Author-X-Name-Last: Asgharian
Author-Name: Nicholas P. Jewell
Author-X-Name-First: Nicholas P.
Author-X-Name-Last: Jewell
Title: Estimating the Lifetime Risk of Dementia in the Canadian Elderly Population Using Cross-Sectional Cohort Survival Data
Abstract:
Dementia is one of the world's major
public health challenges. The lifetime risk of dementia is the proportion
of individuals who ever develop dementia during their lifetime. Despite
its importance to epidemiologists and policy-makers, this measure does not
seem to have been estimated in the Canadian population. Data from a birth
cohort study of dementia are not available. Instead, we must rely on data
from the Canadian Study of Health and Aging, a large cross-sectional study
of dementia with follow-up for survival. These data present challenges
because they include substantial loss to follow-up and are not
representatively drawn from the target population because of structural
sampling biases. A first bias is imparted by the cross-sectional sampling
scheme, while a second bias is a result of stratified sampling. Estimation
of the lifetime risk and related quantities in the presence of these
biases has not been previously addressed in the literature. We develop and
study nonparametric estimators of the lifetime risk, the remaining
lifetime risk, and cumulative risk at specific ages, accounting for these
complexities. In particular, we reveal the fact that estimation of the
lifetime risk is invariant to stratification by current age at sampling.
We present simulation results validating our methodology, and provide
novel facts about the epidemiology of dementia in Canada using data from
the Canadian Study of Health and Aging. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 24-35
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.859076
File-URL: http://hdl.handle.net/10.1080/01621459.2013.859076
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:24-35
Template-Type: ReDIF-Article 1.0
Author-Name: Patrick M. Joyce
Author-X-Name-First: Patrick M.
Author-X-Name-Last: Joyce
Author-Name: Donald Malec
Author-X-Name-First: Donald
Author-X-Name-Last: Malec
Author-Name: Roderick J. A. Little
Author-X-Name-First: Roderick J. A.
Author-X-Name-Last: Little
Author-Name: Aaron Gilary
Author-X-Name-First: Aaron
Author-X-Name-Last: Gilary
Author-Name: Alfredo Navarro
Author-X-Name-First: Alfredo
Author-X-Name-Last: Navarro
Author-Name: Mark E. Asiala
Author-X-Name-First: Mark E.
Author-X-Name-Last: Asiala
Title: Statistical Modeling Methodology for the Voting Rights Act Section 203 Language Assistance Determinations
Abstract:
Section 203 of the Voting Rights Act
includes provisions requiring the use of election materials in languages
other than English for states or political subdivisions, specifically,
when a minimum number of voting age U.S. citizens of specified language
minority groups who are unable to speak English very well and have
obtained less than a fifth-grade education is met. Data on these
characteristics are provided by the 2010 Census and the American Community
Survey (ACS), a general purpose sample survey designed to produce a large
volume of estimates across the spectrum of the nation's geographic areas
and subgroups of the population. This article describes the small-area
model and the estimation methods that were developed and applied to create
the list of 2011 political subdivisions that were subject to the
provisions.
Journal: Journal of the American Statistical Association
Pages: 36-47
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.859077
File-URL: http://hdl.handle.net/10.1080/01621459.2013.859077
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:36-47
Template-Type: ReDIF-Article 1.0
Author-Name: Liqun Xi
Author-X-Name-First: Liqun
Author-X-Name-Last: Xi
Author-Name: Kristin Brogaard
Author-X-Name-First: Kristin
Author-X-Name-Last: Brogaard
Author-Name: Qingyang Zhang
Author-X-Name-First: Qingyang
Author-X-Name-Last: Zhang
Author-Name: Bruce Lindsay
Author-X-Name-First: Bruce
Author-X-Name-Last: Lindsay
Author-Name: Jonathan Widom
Author-X-Name-First: Jonathan
Author-X-Name-Last: Widom
Author-Name: Ji-Ping Wang
Author-X-Name-First: Ji-Ping
Author-X-Name-Last: Wang
Title: A Locally Convoluted Cluster Model for Nucleosome Positioning Signals in Chemical Maps
Abstract:
The nucleosome is the fundamental packing
unit of DNA in eukaryotic cells, and its positioning plays a critical role
in regulation of gene expression and chromosome functions. Using a
recently developed chemical mapping method, nucleosomes can be potentially
mapped with an unprecedented single-base-pair resolution. Existence of
overlapping nucleosomes due to cell mixture or cell dynamics, however,
causes convolution of nucleosome positioning signals. In this article, we
introduce a locally convoluted cluster model and a maximum likelihood
deconvolution approach, and illustrate the effectiveness of this approach
in quantification of the nucleosome positional signal in the chemical
mapping data. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 48-62
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.862169
File-URL: http://hdl.handle.net/10.1080/01621459.2013.862169
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:48-62
Template-Type: ReDIF-Article 1.0
Author-Name: Lucas Janson
Author-X-Name-First: Lucas
Author-X-Name-Last: Janson
Author-Name: Bala Rajaratnam
Author-X-Name-First: Bala
Author-X-Name-Last: Rajaratnam
Title: A Methodology for Robust Multiproxy Paleoclimate Reconstructions and Modeling of Temperature Conditional Quantiles
Abstract:
Great strides have been made in the field
of reconstructing past temperatures based on models relating temperature
to temperature-sensitive paleoclimate proxies. One of the goals of such
reconstructions is to assess if current climate is anomalous in a
millennial context. These regression-based approaches model the
conditional mean of the temperature distribution as a function of
paleoclimate proxies (or vice versa). Some of the recent focus in the area
has considered methods that help reduce the uncertainty inherent in such
statistical paleoclimate reconstructions, with the ultimate goal of
improving the confidence that can be attached to such endeavors. A second
important scientific focus in the subject area is the area of forward
models for proxies, the goal of which is to understand the way
paleoclimate proxies are driven by temperature and other environmental
variables. One of the primary contributions of this article is novel
statistical methodology for (i) quantile regression (QR) with
autoregressive residual structure, (ii) estimation of corresponding model
parameters, (iii) development of a rigorous framework for specifying
uncertainty estimates of quantities of interest, yielding (iv) statistical
byproducts that address the two scientific foci discussed above. We show
that by using the above statistical methodology, we can demonstrably
produce a more robust reconstruction than is possible by using
conditional-mean-fitting methods. Our reconstruction shares some of the
common features of past reconstructions, but we also gain useful insights.
More importantly, we are able to demonstrate a significantly smaller
uncertainty than that from previous regression methods. In addition, the
QR component allows us to model, in a more complete and flexible way than
least squares, the conditional distribution of temperature given proxies.
This relationship can be used to inform forward models relating how
proxies are driven by temperature.
Journal: Journal of the American Statistical Association
Pages: 63-77
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.848807
File-URL: http://hdl.handle.net/10.1080/01621459.2013.848807
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:63-77
Template-Type: ReDIF-Article 1.0
Author-Name: Naim Rashid
Author-X-Name-First: Naim
Author-X-Name-Last: Rashid
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Title: Some Statistical Strategies for DAE-seq Data Analysis: Variable Selection and Modeling Dependencies Among Observations
Abstract:
In DAE (DNA after enrichment)-seq
experiments, genomic regions related with certain biological processes are
enriched/isolated by an assay and are then sequenced on a high-throughput
sequencing platform to determine their genomic positions. Statistical
analysis of DAE-seq data aims to detect genomic regions with significant
aggregations of isolated DNA fragments ("enriched regions") versus all the
other regions ("background"). However, many confounding factors may
influence DAE-seq signals. In addition, the signals in adjacent genomic
regions may exhibit strong correlations, which invalidate the independence
assumption employed by many existing methods. To mitigate these issues, we
develop a novel autoregressive Hidden Markov model (AR-HMM) to account for
covariates effects and violations of the independence assumption. We
demonstrate that our AR-HMM leads to improved performance in identifying
enriched regions in both simulated and real datasets, especially in those
in epigenetic datasets with broader regions of DAE-seq signal enrichment.
We also introduce a variable selection procedure in the context of the
HMM/AR-HMM where the observations are not independent and the mean value
of each state-specific emission distribution is modeled by some
covariates. We study the theoretical properties of this variable selection
procedure and demonstrate its efficacy in simulated and real DAE-seq data.
In summary, we develop several practical approaches for DAE-seq data
analysis that are also applicable to more general problems in statistics.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 78-94
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.869222
File-URL: http://hdl.handle.net/10.1080/01621459.2013.869222
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:78-94
Template-Type: ReDIF-Article 1.0
Author-Name: Corwin Matthew Zigler
Author-X-Name-First: Corwin Matthew
Author-X-Name-Last: Zigler
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Title: Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model-Averaged Causal Effects
Abstract:
Causal inference with observational data
frequently relies on the notion of the propensity score (PS) to adjust
treatment comparisons for observed confounding factors. As decisions in
the era of "big data" are increasingly reliant on large and complex
collections of digital data, researchers are frequently confronted with
decisions regarding which of a high-dimensional covariate set to include
in the PS model to satisfy the assumptions necessary for estimating
average causal effects. Typically, simple or ad hoc methods are employed
to arrive at a single PS model, without acknowledging the uncertainty
associated with the model selection. We propose three Bayesian methods for
PS variable selection and model averaging that (a) select relevant
variables from a set of candidate variables to include in the PS model and
(b) estimate causal treatment effects as weighted averages of estimates
under different PS models. The associated weight for each PS model
reflects the data-driven support for that model's ability to adjust for
the necessary variables. We illustrate features of our proposed approaches
with a simulation study, and ultimately use our methods to compare the
effectiveness of surgical versus nonsurgical treatment for brain tumors
among 2606 Medicare beneficiaries. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 95-107
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.869498
File-URL: http://hdl.handle.net/10.1080/01621459.2013.869498
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:95-107
Template-Type: ReDIF-Article 1.0
Author-Name: Ziyue Liu
Author-X-Name-First: Ziyue
Author-X-Name-Last: Liu
Author-Name: Anne R. Cappola
Author-X-Name-First: Anne R.
Author-X-Name-Last: Cappola
Author-Name: Leslie J. Crofford
Author-X-Name-First: Leslie J.
Author-X-Name-Last: Crofford
Author-Name: Wensheng Guo
Author-X-Name-First: Wensheng
Author-X-Name-Last: Guo
Title: Modeling Bivariate Longitudinal Hormone Profiles by Hierarchical State Space Models
Abstract:
The hypothalamic-pituitary-adrenal (HPA) axis is crucial in coping with
stress and maintaining homeostasis. Hormones produced by the HPA axis
exhibit both complex univariate longitudinal profiles and complex
relationships among different hormones. Consequently, modeling these
multivariate longitudinal hormone profiles is a challenging task. In this
article, we propose a bivariate hierarchical state space model, in which
each hormone profile is modeled by a hierarchical state space model, with
both population-average and subject-specific components. The bivariate
model is constructed by concatenating the univariate models based on the
hypothesized relationship. Because of the flexible framework of state
space form, the resultant models not only can handle complex individual
profiles, but also can incorporate complex relationships between two
hormones, including both concurrent and feedback relationship. Estimation
and inference are based on marginal likelihood and posterior means and
variances. Computationally efficient Kalman filtering and smoothing
algorithms are used for implementation. Application of the proposed method
to a study of chronic fatigue syndrome and fibromyalgia reveals that the
relationships between adrenocorticotropic hormone and cortisol in the
patient group are weaker than in healthy controls. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 108-118
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.830071
File-URL: http://hdl.handle.net/10.1080/01621459.2013.830071
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:108-118
Template-Type: ReDIF-Article 1.0
Author-Name: Yinshan Zhao
Author-X-Name-First: Yinshan
Author-X-Name-Last: Zhao
Author-Name: David K. B. Li
Author-X-Name-First: David K. B.
Author-X-Name-Last: Li
Author-Name: A. John Petkau
Author-X-Name-First: A. John
Author-X-Name-Last: Petkau
Author-Name: Andrew Riddehough
Author-X-Name-First: Andrew
Author-X-Name-Last: Riddehough
Author-Name: Anthony Traboulsee
Author-X-Name-First: Anthony
Author-X-Name-Last: Traboulsee
Title: Detection of Unusual Increases in MRI Lesion Counts in Individual Multiple Sclerosis Patients
Abstract:
Data Safety and Monitoring Boards (DSMBs)
for multiple sclerosis clinical trials consider an increase of
contrast-enhancing lesions on repeated magnetic resonance imaging an
indicator for potential adverse events. However, there are no published
studies that clearly identify what should be considered an "unexpected
increase" of lesion activity for a patient. To address this problem, we
consider as an index the likelihood of observing lesion counts as large as
those observed on the recent scans of a patient conditional on the
patient's lesion counts on previous scans. To estimate this index, we rely
on random effects models. Given the patient-specific random effect, we
assume that the repeated lesion counts from the same patient follow a
negative binomial distribution and may be correlated over time. We fit the
model using data collected from the trial under DSMB review and update the
estimation when new data are to be reviewed. We consider two estimation
procedures: maximum likelihood for a fully parameterized model and a
simple semiparametric method for a model with an unspecified distribution
for the random effects. We examine the performance of our methods using
simulations and illustrate the approach using data from a clinical trial.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 119-132
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.847373
File-URL: http://hdl.handle.net/10.1080/01621459.2013.847373
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:119-132
Template-Type: ReDIF-Article 1.0
Author-Name: Ben B. Hansen
Author-X-Name-First: Ben B.
Author-X-Name-Last: Hansen
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Clustered Treatment Assignments and Sensitivity to Unmeasured Biases in Observational Studies
Abstract:
Clustered treatment assignment occurs when
individuals are grouped into clusters prior to treatment and whole
clusters, not individuals, are assigned to treatment or control. In
randomized trials, clustered assignments may be required because the
treatment must be applied to all children in a classroom, or to all
patients at a clinic, or to all radio listeners in the same media market.
The most common cluster randomized design pairs 2S
clusters into S pairs based on similar pretreatment
covariates, then picks one cluster in each pair at random for treatment,
the other cluster being assigned to control. Typically, group
randomization increases sampling variability and so is less efficient,
less powerful, than randomization at the individual level, but it may be
unavoidable when it is impractical to treat just a few people within each
cluster. Related issues arise in nonrandomized, observational studies of
treatment effects, but in this case one must examine the sensitivity of
conclusions to bias from nonrandom selection of clusters for treatment.
Although clustered assignment increases sampling variability in
observational studies, as it does in randomized experiments, it also tends
to decrease sensitivity to unmeasured biases, and as the number of cluster
pairs increases the latter effect overtakes the former, dominating it when
allowance is made for nontrivial biases in treatment assignment.
Intuitively, a given magnitude of departure from random assignment can do
more harm if it acts on individual students than if it is restricted to
act on whole classes, because the bias is unable to pick the strongest
individual students for treatment, and this is especially true if a
serious effort is made to pair clusters that appeared similar prior to
treatment. We examine this issue using an asymptotic measure, the design
sensitivity, some inequalities that exploit convexity, simulation, and an
application concerned with the flooding of villages in Bangladesh.
Journal: Journal of the American Statistical Association
Pages: 133-144
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.863157
File-URL: http://hdl.handle.net/10.1080/01621459.2013.863157
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:133-144
Template-Type: ReDIF-Article 1.0
Author-Name: Genevera I. Allen
Author-X-Name-First: Genevera I.
Author-X-Name-Last: Allen
Author-Name: Logan Grosenick
Author-X-Name-First: Logan
Author-X-Name-Last: Grosenick
Author-Name: Jonathan Taylor
Author-X-Name-First: Jonathan
Author-X-Name-Last: Taylor
Title: A Generalized Least-Square Matrix Decomposition
Abstract:
Variables in many big-data settings are
structured, arising, for example, from measurements on a regular grid as
in imaging and time series or from spatial-temporal measurements as in
climate studies. Classical multivariate techniques ignore these structural
relationships often resulting in poor performance. We propose a
generalization of principal components analysis (PCA) that is appropriate
for massive datasets with structured variables or known two-way
dependencies. By finding the best low-rank approximation of the data with
respect to a transposable quadratic norm, our decomposition, entitled the
generalized least-square matrix decomposition (GMD),
directly accounts for structural relationships. As many variables in
high-dimensional settings are often irrelevant, we also regularize our
matrix decomposition by adding two-way penalties to encourage sparsity or
smoothness. We develop fast computational algorithms using our methods to
perform generalized PCA (GPCA), sparse GPCA, and functional GPCA on
massive datasets. Through simulations and a whole brain functional MRI
example, we demonstrate the utility of our methodology for dimension
reduction, signal recovery, and feature selection with high-dimensional
structured data. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 145-159
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.852978
File-URL: http://hdl.handle.net/10.1080/01621459.2013.852978
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:145-159
Template-Type: ReDIF-Article 1.0
Author-Name: François Portier
Author-X-Name-First: François
Author-X-Name-Last: Portier
Author-Name: Bernard Delyon
Author-X-Name-First: Bernard
Author-X-Name-Last: Delyon
Title: Bootstrap Testing of the Rank of a Matrix via Least-Squared Constrained Estimation
Abstract:
To test if an unknown matrix
M 0
has a given rank (null hypothesis noted H
0), we consider a statistic that is a
squared distance between an estimator
and the submanifold of fixed-rank matrix. Under
H 0, this
statistic converges to a weighted chi-squared distribution. We introduce
the constrained bootstrap (CS bootstrap) to estimate the law of this
statistic under H
0. An important point is that even if H
0 fails, the CS bootstrap
reproduces the behavior of the statistic under H
0. As a consequence, the CS bootstrap
is employed to estimate the nonasymptotic quantile for testing the rank.
We provide the consistency of the procedure and the simulations shed light
on the accuracy of the CS bootstrap with respect to the traditional
asymptotic comparison. More generally, the results are extended to test
whether an unknown parameter belongs to a submanifold of the Euclidean
space. Finally, the CS bootstrap is easy to compute, it handles a large
family of tests and it works under mild assumptions.
Journal: Journal of the American Statistical Association
Pages: 160-172
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.847841
File-URL: http://hdl.handle.net/10.1080/01621459.2013.847841
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:160-172
Template-Type: ReDIF-Article 1.0
Author-Name: Nicolas J-B. Brunel
Author-X-Name-First: Nicolas J-B.
Author-X-Name-Last: Brunel
Author-Name: Quentin Clairon
Author-X-Name-First: Quentin
Author-X-Name-Last: Clairon
Author-Name: Florence d'Alché-Buc
Author-X-Name-First: Florence
Author-X-Name-Last: d'Alché-Buc
Title: Parametric Estimation of Ordinary Differential Equations With Orthogonality Conditions
Abstract:
Differential equations are commonly used to model dynamical deterministic
systems in applications. When statistical parameter estimation is required
to calibrate theoretical models to data, classical statistical estimators
are often confronted to complex and potentially ill-posed optimization
problem. As a consequence, alternative estimators to classical parametric
estimators are needed for obtaining reliable estimates. We propose a
gradient matching approach for the estimation of parametric Ordinary
Differential Equations (ODE) observed with noise. Starting from a
nonparametric proxy of a true solution of the ODE, we build a parametric
estimator based on a variational characterization of the solution. As a
Generalized Moment Estimator, our estimator must satisfy a set of
orthogonal conditions that are solved in the least squares sense. Despite
the use of a nonparametric estimator, we prove the
- consistency and
asymptotic normality of the Orthogonal Conditions estimator. We can derive
confidence sets thanks to a closed-form expression for the asymptotic
variance. Finally, the OC estimator is compared to classical estimators in
several (simulated and real) experiments and ODE models to show its
versatility and relevance with respect to classical Gradient Matching and
Nonlinear Least Squares estimators. In particular, we show on a real
dataset of influenza infection that the approach gives reliable estimates.
Moreover, we show that our approach can deal directly with more elaborated
models such as Delay Differential Equation (DDE). Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 173-185
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.841583
File-URL: http://hdl.handle.net/10.1080/01621459.2013.841583
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:173-185
Template-Type: ReDIF-Article 1.0
Author-Name: Alan Huang
Author-X-Name-First: Alan
Author-X-Name-Last: Huang
Title: Joint Estimation of the Mean and Error Distribution in Generalized Linear Models
Abstract:
This article introduces a semiparametric extension of generalized linear
models that is based on a full probability model, but does not require
specification of an error distribution or variance function for the data.
The approach involves treating the error distribution as an
infinite-dimensional parameter, which is then estimated simultaneously
with the mean-model parameters using a maximum empirical likelihood
approach. The resulting estimators are shown to be consistent and jointly
asymptotically normal in distribution. When interest lies only in
inferences on the mean-model parameters, we show that maximizing out the
error distribution leads to profile empirical log-likelihood ratio
statistics that have asymptotic χ-super-2 distributions under the
null. Simulation studies demonstrate that the proposed method can be more
accurate than existing methods that offer the same level of flexibility
and generality, especially with smaller sample sizes. The theoretical and
numerical results are complemented by a data analysis example.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 186-196
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.824892
File-URL: http://hdl.handle.net/10.1080/01621459.2013.824892
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:186-196
Template-Type: ReDIF-Article 1.0
Author-Name: Christina D. Wang
Author-X-Name-First: Christina D.
Author-X-Name-Last: Wang
Author-Name: Per A. Mykland
Author-X-Name-First: Per A.
Author-X-Name-Last: Mykland
Title: The Estimation of Leverage Effect With High-Frequency Data
Abstract:
The leverage effect has become an
extensively studied phenomenon that describes the (usually) negative
relation between stock returns and their volatility. Although this
characteristic of stock returns is well acknowledged, most studies of the
phenomenon are based on cross-sectional calibration with parametric
models. On the statistical side, most previous works are conducted over
daily or longer return horizons, and few of them have carefully studied
its estimation, especially with high-frequency data. However, estimation
of the leverage effect is important because sensible inference is possible
only when the leverage effect is estimated reliably. In this article, we
provide nonparametric estimation for a class of stochastic measures of
leverage effect. To construct estimators with good statistical properties,
we introduce a new stochastic leverage effect parameter. The estimators
and their statistical properties are provided in cases both with and
without microstructure noise, under the stochastic volatility model. In
asymptotics, the consistency and limiting distribution of the estimators
are derived and corroborated by simulation results. For consistency, a
previously unknown bias correction factor is added to the estimators.
Applications of the estimators are also explored. This estimator provides
the opportunity to study high-frequency regression, which leads to the
prediction of volatility using not only previous volatility but also the
leverage effect. The estimator also reveals a theoretical connection
between skewness and the leverage effect, which further leads to the
prediction of skewness. Furthermore, adopting the ideas similar to the
estimation of the leverage effect, it is easy to extend the methods to
study other important aspects of stock returns, such as volatility of
volatility.
Journal: Journal of the American Statistical Association
Pages: 197-215
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.864189
File-URL: http://hdl.handle.net/10.1080/01621459.2013.864189
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:197-215
Template-Type: ReDIF-Article 1.0
Author-Name: Eun Ryung Lee
Author-X-Name-First: Eun Ryung
Author-X-Name-Last: Lee
Author-Name: Hohsuk Noh
Author-X-Name-First: Hohsuk
Author-X-Name-Last: Noh
Author-Name: Byeong U. Park
Author-X-Name-First: Byeong U.
Author-X-Name-Last: Park
Title: Model Selection via Bayesian Information Criterion for Quantile Regression Models
Abstract:
Bayesian information criterion (BIC) is known to identify the true model
consistently as long as the predictor dimension is finite. Recently, its
moderate modifications have been shown to be consistent in model selection
even when the number of variables diverges. Those works have been done
mostly in mean regression, but rarely in quantile regression. The
best-known results about BIC for quantile regression are for linear models
with a fixed number of variables. In this article, we investigate how BIC
can be adapted to high-dimensional linear quantile regression and show
that a modified BIC is consistent in model selection when the number of
variables diverges as the sample size increases. We also discuss how it
can be used for choosing the regularization parameters of penalized
approaches that are designed to conduct variable selection and shrinkage
estimation simultaneously. Moreover, we extend the results to structured
nonparametric quantile models with a diverging number of covariates. We
illustrate our theoretical results via some simulated examples and a real
data analysis on human eye disease. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 216-229
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.836975
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836975
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:216-229
Template-Type: ReDIF-Article 1.0
Author-Name: Ruosha Li
Author-X-Name-First: Ruosha
Author-X-Name-Last: Li
Author-Name: Yu Cheng
Author-X-Name-First: Yu
Author-X-Name-Last: Cheng
Author-Name: Jason P. Fine
Author-X-Name-First: Jason P.
Author-X-Name-Last: Fine
Title: Quantile Association Regression Models
Abstract:
It is often important to study the
association between two continuous variables. In this work, we propose a
novel regression framework for assessing conditional associations on
quantiles. We develop general methodology which permits covariate effects
on both the marginal quantile models for the two variables and their
quantile associations. The proposed quantile copula models have
straightforward interpretation, facilitating a comprehensive view of
association structure which is much richer than that based on standard
product moment and rank correlations. We show that the resulting
estimators are uniformly consistent and weakly convergent as a process of
the quantile index. Simple variance estimators are presented which perform
well in numerical studies. Extensive simulations and a real data example
demonstrate the practical utility of the methodology. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 230-242
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.847375
File-URL: http://hdl.handle.net/10.1080/01621459.2013.847375
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:230-242
Template-Type: ReDIF-Article 1.0
Author-Name: Ching-Kang Ing
Author-X-Name-First: Ching-Kang
Author-X-Name-Last: Ing
Author-Name: Chiao-Yi Yang
Author-X-Name-First: Chiao-Yi
Author-X-Name-Last: Yang
Title: Predictor Selection for Positive Autoregressive Processes
Abstract:
Let observations y1, …,
yn be generated from a first-order
autoregressive (AR) model with positive errors. In both the stationary and
unit root cases, we derive moment bounds and limiting distributions of an
extreme value estimator, , of the AR
coefficient. These results enable us to provide asymptotic expressions for
the mean squared error (MSE) of and the mean
squared prediction error (MSPE) of the corresponding predictor,
, of
yn + 1. Based on these
expressions, we compare the relative performance of
() and the
least-squares predictor (estimator) from the MSPE (MSE) point of view. Our
comparison reveals that the better predictor (estimator) is determined not
only by whether a unit root exists, but also by the behavior of the
underlying error distribution near the origin, and hence is difficult to
identify in practice. To circumvent this difficulty, we suggest choosing
the predictor (estimator) with the smaller accumulated prediction error
and show that the predictor (estimator) chosen in this way is
asymptotically equivalent to the better one. Both real and simulated
datasets are used to illustrate the proposed method. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 243-253
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.836974
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836974
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:243-253
Template-Type: ReDIF-Article 1.0
Author-Name: Tomohiro Ando
Author-X-Name-First: Tomohiro
Author-X-Name-Last: Ando
Author-Name: Ker-Chau Li
Author-X-Name-First: Ker-Chau
Author-X-Name-Last: Li
Title: A Model-Averaging Approach for High-Dimensional Regression
Abstract:
This article considers high-dimensional regression problems in which the
number of predictors p exceeds the sample size
n. We develop a model-averaging procedure for
high-dimensional regression problems. Unlike most variable selection
studies featuring the identification of true predictors, our focus here is
on the prediction accuracy for the true conditional mean of
y given the p predictors. Our method
consists of two steps. The first step is to construct a class of
regression models, each with a smaller number of regressors, to avoid the
degeneracy of the information matrix. The second step is to find suitable
model weights for averaging. To minimize the prediction error, we estimate
the model weights using a delete-one cross-validation procedure. Departing
from the literature of model averaging that requires the weights always
sum to one, an important improvement we introduce is to remove this
constraint. We derive some theoretical results to justify our procedure. A
theorem is proved, showing that delete-one cross-validation achieves the
lowest possible prediction loss asymptotically. This optimality result
requires a condition that unravels an important feature of
high-dimensional regression. The prediction error of any individual model
in the class for averaging is required to be higher than the classic root
n rate under the traditional parametric regression. This
condition reflects the difficulty of high-dimensional regression and it
depicts a situation especially meaningful for p >
n. We also conduct a simulation study to illustrate the
merits of the proposed approach over several existing methods, including
lasso, group lasso, forward regression, Phase Coupled (PC)-simple
algorithm, Akaike information criterion (AIC) model-averaging, Bayesian
information criterion (BIC) model-averaging methods, and SCAD (smoothly
clipped absolute deviation). This approach uses quadratic programming to
overcome the computing time issue commonly encountered in the
cross-validation literature. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 254-265
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.838168
File-URL: http://hdl.handle.net/10.1080/01621459.2013.838168
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:254-265
Template-Type: ReDIF-Article 1.0
Author-Name: Jingyuan Liu
Author-X-Name-First: Jingyuan
Author-X-Name-Last: Liu
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Rongling Wu
Author-X-Name-First: Rongling
Author-X-Name-Last: Wu
Title: Feature Selection for Varying Coefficient Models With Ultrahigh-Dimensional Covariates
Abstract:
This article is concerned with feature
screening and variable selection for varying coefficient models with
ultrahigh-dimensional covariates. We propose a new feature screening
procedure for these models based on conditional correlation coefficient.
We systematically study the theoretical properties of the proposed
procedure, and establish their sure screening property and the ranking
consistency. To enhance the finite sample performance of the proposed
procedure, we further develop an iterative feature screening procedure.
Monte Carlo simulation studies were conducted to examine the performance
of the proposed procedures. In practice, we advocate a two-stage approach
for varying coefficient models. The two-stage approach consists of (a)
reducing the ultrahigh dimensionality by using the proposed procedure and
(b) applying regularization methods for dimension-reduced varying
coefficient models to make statistical inferences on the coefficient
functions. We illustrate the proposed two-stage approach by a real data
example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 266-274
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.850086
File-URL: http://hdl.handle.net/10.1080/01621459.2013.850086
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:266-274
Template-Type: ReDIF-Article 1.0
Author-Name: Fang Han
Author-X-Name-First: Fang
Author-X-Name-Last: Han
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Title: Scale-Invariant Sparse PCA on High-Dimensional Meta-Elliptical Data
Abstract:
We propose a semiparametric method for
conducting scale-invariant sparse principal component analysis (PCA) on
high-dimensional non-Gaussian data. Compared with sparse PCA, our method
has a weaker modeling assumption and is more robust to possible data
contamination. Theoretically, the proposed method achieves a parametric
rate of convergence in estimating the parameter of interests under a
flexible semiparametric distribution family; computationally, the proposed
method exploits a rank-based procedure and is as efficient as sparse PCA;
empirically, our method outperforms most competing methods on both
synthetic and real-world datasets.
Journal: Journal of the American Statistical Association
Pages: 275-287
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.844699
File-URL: http://hdl.handle.net/10.1080/01621459.2013.844699
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:275-287
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Liu
Author-X-Name-First: Lan
Author-X-Name-Last: Liu
Author-Name: Michael G. Hudgens
Author-X-Name-First: Michael G.
Author-X-Name-Last: Hudgens
Title: Large Sample Randomization Inference of Causal Effects in the Presence of Interference
Abstract:
Recently, there has been increasing
interest in making causal inference when interference is possible. In the
presence of interference, treatment may have several types of effects. In
this article, we consider inference about such effects when the population
consists of groups of individuals where interference is possible within
groups but not between groups. A two-stage randomization design is assumed
where in the first stage groups are randomized to different treatment
allocation strategies and in the second stage individuals are randomized
to treatment or control conditional on the strategy assigned to their
group in the first stage. For this design, the asymptotic distributions of
estimators of the causal effects are derived when either the number of
individuals per group or the number of groups grows large. Under certain
homogeneity assumptions, the asymptotic distributions provide
justification for Wald-type confidence intervals (CIs) and tests.
Empirical results demonstrate that the Wald CIs have good coverage in
finite samples and are narrower than CIs based on either the Chebyshev or
Hoeffding inequalities provided the number of groups is not too small. The
methods are illustrated by two examples which consider the effects of
cholera vaccination and an intervention to encourage voting.
Journal: Journal of the American Statistical Association
Pages: 288-301
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.844698
File-URL: http://hdl.handle.net/10.1080/01621459.2013.844698
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:288-301
Template-Type: ReDIF-Article 1.0
Author-Name: A. C. Davison
Author-X-Name-First: A. C.
Author-X-Name-Last: Davison
Author-Name: D. A. S. Fraser
Author-X-Name-First: D. A. S.
Author-X-Name-Last: Fraser
Author-Name: N. Reid
Author-X-Name-First: N.
Author-X-Name-Last: Reid
Author-Name: N. Sartori
Author-X-Name-First: N.
Author-X-Name-Last: Sartori
Title: Accurate Directional Inference for Vector Parameters in Linear Exponential Families
Abstract:
We consider inference on a vector-valued parameter of interest in a linear
exponential family, in the presence of a finite-dimensional nuisance
parameter. Based on higher-order asymptotic theory for likelihood, we
propose a directional test whose p-value is computed
using one-dimensional integration. The work simplifies and develops
earlier research on directional tests for continuous models and on
higher-order inference for discrete models, and the examples include
contingency tables and logistic regression. Examples and simulations
illustrate the high accuracy of the method, which we compare with the
usual likelihood ratio test and with an adjusted version due to Skovgaard.
In high-dimensional settings, such as covariance selection, the approach
works essentially perfectly, whereas its competitors can fail
catastrophically.
Journal: Journal of the American Statistical Association
Pages: 302-314
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.839451
File-URL: http://hdl.handle.net/10.1080/01621459.2013.839451
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:302-314
Template-Type: ReDIF-Article 1.0
Author-Name: Simon Barthelmé
Author-X-Name-First: Simon
Author-X-Name-Last: Barthelmé
Author-Name: Nicolas Chopin
Author-X-Name-First: Nicolas
Author-X-Name-Last: Chopin
Title: Expectation Propagation for Likelihood-Free Inference
Abstract:
Many models of interest in the natural and
social sciences have no closed-form likelihood function, which means that
they cannot be treated using the usual techniques of statistical
inference. In the case where such models can be efficiently simulated,
Bayesian inference is still possible thanks to the approximate Bayesian
computation (ABC) algorithm. Although many refinements have been
suggested, ABC inference is still far from routine. ABC is often
excruciatingly slow due to very low acceptance rates. In addition, ABC
requires introducing a vector of "summary statistics"
s
(
y ), the
choice of which is relatively arbitrary, and often require some trial and
error, making the whole process laborious for the user. We introduce in
this work the EP-ABC algorithm, which is an adaptation to the
likelihood-free context of the variational approximation algorithm known
as expectation propagation. The main advantage of EP-ABC
is that it is faster by a few orders of magnitude than standard
algorithms, while producing an overall approximation error that is
typically negligible. A second advantage of EP-ABC is that it replaces the
usual global ABC constraint ‖
s
( y
) -
s
(
y
-super-⋆)‖ ⩽ ϵ,
where
s (
y
-super-⋆) is the vector of summary statistics computed on the whole
dataset, by n local constraints of the form
‖si
(yi
) - si
(y
-super-⋆
i
)‖ ⩽ ϵ that apply separately to each
data point. In particular, it is often possible to take
si
(yi
) = yi
, making it possible to do away with summary statistics entirely.
In that case, EP-ABC makes it possible to approximate directly the
evidence (marginal likelihood) of the model. Comparisons are performed in
three real-world applications that are typical of likelihood-free
inference, including one application in neuroscience that is novel, and
possibly too challenging for standard ABC techniques.
Journal: Journal of the American Statistical Association
Pages: 315-333
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.864178
File-URL: http://hdl.handle.net/10.1080/01621459.2013.864178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:315-333
Template-Type: ReDIF-Article 1.0
Author-Name: David S. Matteson
Author-X-Name-First: David S.
Author-X-Name-Last: Matteson
Author-Name: Nicholas A. James
Author-X-Name-First: Nicholas A.
Author-X-Name-Last: James
Title: A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data
Abstract:
Change point analysis has applications in
a wide variety of fields. The general problem concerns the inference of a
change in distribution for a set of time-ordered observations. Sequential
detection is an online version in which new data are continually arriving
and are analyzed adaptively. We are concerned with the related, but
distinct, offline version, in which retrospective analysis of an entire
sequence is performed. For a set of multivariate observations of arbitrary
dimension, we consider nonparametric estimation of both the number of
change points and the positions at which they occur. We do not make any
assumptions regarding the nature of the change in distribution or any
distribution assumptions beyond the existence of the αth absolute
moment, for some α is an element of (0, 2). Estimation is based on
hierarchical clustering and we propose both divisive and agglomerative
algorithms. The divisive method is shown to provide consistent estimates
of both the number and the location of change points under standard
regularity assumptions. We compare the proposed approach with competing
methods in a simulation study. Methods from cluster analysis are applied
to assess performance and to allow simple comparisons of location
estimates, even when the estimated number differs. We conclude with
applications in genetics, finance, and spatio-temporal analysis.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 334-345
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.849605
File-URL: http://hdl.handle.net/10.1080/01621459.2013.849605
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:334-345
Template-Type: ReDIF-Article 1.0
Author-Name: Gery Geenens
Author-X-Name-First: Gery
Author-X-Name-Last: Geenens
Title: Probit Transformation for Kernel Density Estimation on the Unit Interval
Abstract:
Kernel estimation of a probability density
function supported on the unit interval has proved difficult, because of
the well-known boundary bias issues a conventional kernel density
estimator would necessarily face in this situation. Transforming the
variable of interest into a variable whose density has unconstrained
support, estimating that density, and obtaining an estimate of the density
of the original variable through back-transformation, seems a natural idea
to easily get rid of the boundary problems. In practice, however, a simple
and efficient implementation of this methodology is far from immediate,
and the few attempts found in the literature have been reported not to
perform well. In this article, the main reasons for this failure are
identified and an easy way to correct them is suggested. It turns out that
combining the transformation idea with local likelihood density estimation
produces viable density estimators, mostly free from boundary issues.
Their asymptotic properties are derived, and a practical cross-validation
bandwidth selection rule is devised. Extensive simulations demonstrate the
excellent performance of these estimators compared to their main
competitors for a wide range of density shapes. In fact, they turn out to
be the best choice overall. Finally, they are used to successfully
estimate a density of nonstandard shape supported on [0, 1] from a
small-size real data sample.
Journal: Journal of the American Statistical Association
Pages: 346-358
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.842173
File-URL: http://hdl.handle.net/10.1080/01621459.2013.842173
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:346-358
Template-Type: ReDIF-Article 1.0
Author-Name: Kenji Fukumizu
Author-X-Name-First: Kenji
Author-X-Name-Last: Fukumizu
Author-Name: Chenlei Leng
Author-X-Name-First: Chenlei
Author-X-Name-Last: Leng
Title: Gradient-Based Kernel Dimension Reduction for Regression
Abstract:
This article proposes a novel approach to linear dimension reduction for
regression using nonparametric estimation with positive-definite kernels
or reproducing kernel Hilbert spaces (RKHSs). The purpose of the dimension
reduction is to find such directions in the explanatory variables that
explain the response sufficiently: this is called sufficient
dimension reduction. The proposed method is based on an estimator
for the gradient of the regression function considered for the feature
vectors mapped into RKHSs. It is proved that the method is able to
estimate the directions that achieve sufficient dimension reduction. In
comparison with other existing methods, the proposed one has wide
applicability without strong assumptions on the distributions or the type
of variables, and needs only eigendecomposition for estimating the
projection matrix. The theoretical analysis shows that the estimator is
consistent with certain rate under some conditions. The experimental
results demonstrate that the proposed method successfully finds effective
directions with efficient computation even for high-dimensional
explanatory variables.
Journal: Journal of the American Statistical Association
Pages: 359-370
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.838167
File-URL: http://hdl.handle.net/10.1080/01621459.2013.838167
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:359-370
Template-Type: ReDIF-Article 1.0
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: D. Y. Lin
Author-X-Name-First: D. Y.
Author-X-Name-Last: Lin
Title: Efficient Estimation of Semiparametric Transformation Models for Two-Phase Cohort Studies
Abstract:
Under two-phase cohort designs, such as
case--cohort and nested case--control sampling, information on observed
event times, event indicators, and inexpensive covariates is collected in
the first phase, and the first-phase information is used to select
subjects for measurements of expensive covariates in the second phase;
inexpensive covariates are also used in the data analysis to control for
confounding and to evaluate interactions. This article provides efficient
estimation of semiparametric transformation models for such designs,
accommodating both discrete and continuous covariates, and allowing
inexpensive and expensive covariates to be correlated. The estimation is
based on the maximization of a modified nonparametric likelihood function
through a generalization of the expectation--maximization algorithm. The
resulting estimators are shown to be consistent, asymptotically normal and
asymptotically efficient with easily estimated variances. Simulation
studies demonstrate that the asymptotic approximations are accurate in
practical situations. Empirical data from Wilms' tumor studies and the
Atherosclerosis Risk in Communities (ARIC) study are presented.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 371-383
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.842172
File-URL: http://hdl.handle.net/10.1080/01621459.2013.842172
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:371-383
Template-Type: ReDIF-Article 1.0
Author-Name: Layla Parast
Author-X-Name-First: Layla
Author-X-Name-Last: Parast
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Title: Landmark Estimation of Survival and Treatment Effect in a Randomized Clinical Trial
Abstract:
In many studies with a survival outcome,
it is often not feasible to fully observe the primary event of interest.
This often leads to heavy censoring and thus, difficulty in efficiently
estimating survival or comparing survival rates between two groups. In
certain diseases, baseline covariates and the event time of nonfatal
intermediate events may be associated with overall survival. In these
settings, incorporating such additional information may lead to gains in
efficiency in estimation of survival and testing for a difference in
survival between two treatment groups. If gains in efficiency can be
achieved, it may then be possible to decrease the sample size of patients
required for a study to achieve a particular power level or decrease the
duration of the study. Most existing methods for incorporating
intermediate events and covariates to predict survival focus on estimation
of relative risk parameters and/or the joint distribution of events under
semiparametric models. However, in practice, these model assumptions may
not hold and hence may lead to biased estimates of the marginal survival.
In this article, we propose a seminonparametric two-stage procedure to
estimate and compare t-year survival rates by
incorporating intermediate event information observed before some landmark
time, which serves as a useful approach to overcome semicompeting risk
issues. In a randomized clinical trial setting, we further improve
efficiency through an additional calibration step. Simulation studies
demonstrate substantial potential gains in efficiency in terms of
estimation and power. We illustrate our proposed procedures using an AIDS
Clinical Trial Protocol 175 dataset by estimating survival and examining
the difference in survival between two treatment groups: zidovudine and
zidovudine plus zalcitabine. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 384-394
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.842488
File-URL: http://hdl.handle.net/10.1080/01621459.2013.842488
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:384-394
Template-Type: ReDIF-Article 1.0
Author-Name: Bruce G. Lindsay
Author-X-Name-First: Bruce G.
Author-X-Name-Last: Lindsay
Author-Name: Marianthi Markatou
Author-X-Name-First: Marianthi
Author-X-Name-Last: Markatou
Author-Name: Surajit Ray
Author-X-Name-First: Surajit
Author-X-Name-Last: Ray
Title: Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests
Abstract:
In this article, we study the power properties of quadratic-distance-based
goodness-of-fit tests. First, we introduce the concept of a root
kernel and discuss the considerations that enter the selection of
this kernel. We derive an easy to use normal approximation to the power of
quadratic distance goodness-of-fit tests and base the construction of a
noncentrality index, an analogue of the traditional
noncentrality parameter, on it. This leads to a method akin to the
Neyman-Pearson lemma for constructing optimal kernels for specific
alternatives. We then introduce a midpower analysis as a
device for choosing optimal degrees of freedom for a family of
alternatives of interest. Finally, we introduce a new diffusion kernel,
called the Pearson-normal kernel, and study the extent to
which the normal approximation to the power of tests based on this kernel
is valid. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 395-410
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.836972
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836972
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:395-410
Template-Type: ReDIF-Article 1.0
Author-Name: Gerda Claeskens
Author-X-Name-First: Gerda
Author-X-Name-Last: Claeskens
Author-Name: Mia Hubert
Author-X-Name-First: Mia
Author-X-Name-Last: Hubert
Author-Name: Leen Slaets
Author-X-Name-First: Leen
Author-X-Name-Last: Slaets
Author-Name: Kaveh Vakili
Author-X-Name-First: Kaveh
Author-X-Name-Last: Vakili
Title: Multivariate Functional Halfspace Depth
Abstract:
This article defines and studies a depth
for multivariate functional data. By the multivariate nature and by
including a weight function, it acknowledges important characteristics of
functional data, namely differences in the amount of local amplitude,
shape, and phase variation. We study both population and finite sample
versions. The multivariate sample of curves may include warping functions,
derivatives, and integrals of the original curves for a better overall
representation of the functional data via the depth. We present a
simulation study and data examples that confirm the good performance of
this depth function. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 411-423
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.856795
File-URL: http://hdl.handle.net/10.1080/01621459.2013.856795
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:411-423
Template-Type: ReDIF-Article 1.0
Author-Name: Victor M. Panaretos
Author-X-Name-First: Victor M.
Author-X-Name-Last: Panaretos
Author-Name: Tung Pham
Author-X-Name-First: Tung
Author-X-Name-Last: Pham
Author-Name: Zhigang Yao
Author-X-Name-First: Zhigang
Author-X-Name-Last: Yao
Title: Principal Flows
Abstract:
We revisit the problem of extending the
notion of principal component analysis (PCA) to multivariate datasets that
satisfy nonlinear constraints, therefore lying on Riemannian manifolds.
Our aim is to determine curves on the manifold that retain their canonical
interpretability as principal components, while at the same time being
flexible enough to capture nongeodesic forms of variation. We introduce
the concept of a principal flow, a curve on the manifold passing through
the mean of the data, and with the property that, at any point of the
curve, the tangent velocity vector attempts to fit the first eigenvector
of a tangent space PCA locally at that same point, subject to a smoothness
constraint. That is, a particle flowing along the principal flow attempts
to move along a path of maximal variation of the data, up to smoothness
constraints. The rigorous definition of a principal flow is given by means
of a Lagrangian variational problem, and its solution is reduced to an ODE
problem via the Euler--Lagrange method. Conditions for existence and
uniqueness are provided, and an algorithm is outlined for the numerical
solution of the problem. Higher order principal flows are also defined. It
is shown that global principal flows yield the usual principal components
on a Euclidean space. By means of examples, it is illustrated that the
principal flow is able to capture patterns of variation that can escape
other manifold PCA methods.
Journal: Journal of the American Statistical Association
Pages: 424-436
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2013.849199
File-URL: http://hdl.handle.net/10.1080/01621459.2013.849199
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:424-436
Template-Type: ReDIF-Article 1.0
Author-Name: Suprateek Kundu
Author-X-Name-First: Suprateek
Author-X-Name-Last: Kundu
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayes Variable Selection in Semiparametric Linear Models
Abstract:
There is a rich literature on Bayesian
variable selection for parametric models. Our focus is on generalizing
methods and asymptotic theory established for mixtures of
g-priors to semiparametric linear regression models
having unknown residual densities. Using a Dirichlet process location
mixture for the residual density, we propose a semiparametric
g-prior which incorporates an unknown matrix of cluster
allocation indicators. For this class of priors, posterior computation can
proceed via a straightforward stochastic search variable selection
algorithm. In addition, Bayes' factor and variable selection consistency
is shown to result under a class of proper priors on g
even when the number of candidate predictors p is allowed
to increase much faster than sample size n, while making
sparsity assumptions on the true model size.
Journal: Journal of the American Statistical Association
Pages: 437-447
Issue: 505
Volume: 109
Year: 2014
Month: 3
X-DOI: 10.1080/01621459.2014.881153
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881153
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:505:p:437-447
Template-Type: ReDIF-Article 1.0
Author-Name: Yongling Xiao
Author-X-Name-First: Yongling
Author-X-Name-Last: Xiao
Author-Name: Michal Abrahamowicz
Author-X-Name-First: Michal
Author-X-Name-Last: Abrahamowicz
Author-Name: Erica E. M. Moodie
Author-X-Name-First: Erica E. M.
Author-X-Name-Last: Moodie
Author-Name: Rainer Weber
Author-X-Name-First: Rainer
Author-X-Name-Last: Weber
Author-Name: James Young
Author-X-Name-First: James
Author-X-Name-Last: Young
Title: Flexible Marginal Structural Models for Estimating the Cumulative Effect of a Time-Dependent Treatment on the Hazard: Reassessing the Cardiovascular Risks of Didanosine Treatment in the Swiss HIV Cohort Study
Abstract:
The association between antiretroviral
treatment and cardiovascular disease (CVD) risk in HIV-positive persons
has been the subject of much debate since the Data collection on Adverse
events of Anti-HIV Drugs (D:A:D) study reported that recent use of two
antiretroviral drugs, abacavir (ABC) and didanosine (DDI), was associated
with increased risk. We focus on the potential impact of DDI use, as this
drug has not been as studied intensively as ABC. We propose a flexible
marginal structural Cox model with weighted cumulative exposure modeling
(Cox WCE MSM) to address two key challenges encountered when using
observational longitudinal data to assess the adverse effects of
medication: (1) the need to model the cumulative effect of a
time-dependent treatment and (2) the need to control for time-dependent
confounders that also act as mediators of the effect of past treatment.
Simulations confirm that the Cox WCE MSM yields accurate estimates of the
causal treatment effect given complex exposure effects and time-dependent
confounding. We then use the new flexible Cox WCE MSM to assess the
association between DDI use and CVD risk in the Swiss HIV Cohort Study. In
contrast to the nonsignificant results obtained with conventional
parametric Cox MSMs, our new Cox WCE MSM identifies a significant
short-term risk increase due to DDI use in the previous year.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 455-464
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.872650
File-URL: http://hdl.handle.net/10.1080/01621459.2013.872650
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:455-464
Template-Type: ReDIF-Article 1.0
Author-Name: Eduardo S. Ayra
Author-X-Name-First: Eduardo S.
Author-X-Name-Last: Ayra
Author-Name: David Ríos Insua
Author-X-Name-First: David Ríos
Author-X-Name-Last: Insua
Author-Name: Javier Cano
Author-X-Name-First: Javier
Author-X-Name-Last: Cano
Title: To Fuel or Not to Fuel? Is that the Question?
Abstract:
According to the International Air
Transport Association, the industry fuel bill accounts for more than 25%
of the annual airline operating costs. In times of severe economic
constraints and increasing fuel costs, air carriers are looking for ways
to reduce costs and improve fuel efficiency without putting flight safety
into jeopardy. In particular, this is inducing discussions on how much
additional fuel to put in a planned route to avoid diverting to an
alternate airport due to Air Traffic Flow Management delays. We provide
here a general model to support such decisions. We illustrate it with a
case study and provide comparison with the current practice, showing the
relevance of our approach.
Journal: Journal of the American Statistical Association
Pages: 465-476
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.879060
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879060
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:465-476
Template-Type: ReDIF-Article 1.0
Author-Name: Isadora Antoniano-Villalobos
Author-X-Name-First: Isadora
Author-X-Name-Last: Antoniano-Villalobos
Author-Name: Sara Wade
Author-X-Name-First: Sara
Author-X-Name-Last: Wade
Author-Name: Stephen G. Walker
Author-X-Name-First: Stephen G.
Author-X-Name-Last: Walker
Title: A Bayesian Nonparametric Regression Model With Normalized Weights: A Study of Hippocampal Atrophy in Alzheimer's Disease
Abstract:
Hippocampal volume is one of the best
established biomarkers for Alzheimer's disease. However, for appropriate
use in clinical trials research, the evolution of hippocampal volume needs
to be well understood. Recent theoretical models propose a sigmoidal
pattern for its evolution. To support this theory, the use of Bayesian
nonparametric regression mixture models seems particularly suitable due to
the flexibility that models of this type can achieve and the
unsatisfactory predictive properties of semiparametric methods. In this
article, our aim is to develop an interpretable Bayesian nonparametric
regression model which allows inference with combinations of both
continuous and discrete covariates, as required for a full analysis of the
dataset. Simple arguments regarding the interpretation of Bayesian
nonparametric regression mixtures lead naturally to regression weights
based on normalized sums. Difficulty in working with the intractable
normalizing constant is overcome thanks to recent advances in MCMC methods
and the development of a novel auxiliary variable scheme. We apply the new
model and MCMC method to study the dynamics of hippocampal volume, and our
results provide statistical evidence in support of the theoretical
hypothesis.
Journal: Journal of the American Statistical Association
Pages: 477-490
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.879061
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879061
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:477-490
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel M. Percival
Author-X-Name-First: Daniel M.
Author-X-Name-Last: Percival
Author-Name: Donald B. Percival
Author-X-Name-First: Donald B.
Author-X-Name-Last: Percival
Author-Name: Donald W. Denbo
Author-X-Name-First: Donald W.
Author-X-Name-Last: Denbo
Author-Name: Edison Gica
Author-X-Name-First: Edison
Author-X-Name-Last: Gica
Author-Name: Paul Y. Huang
Author-X-Name-First: Paul Y.
Author-X-Name-Last: Huang
Author-Name: Harold O. Mofjeld
Author-X-Name-First: Harold O.
Author-X-Name-Last: Mofjeld
Author-Name: Michael C. Spillane
Author-X-Name-First: Michael C.
Author-X-Name-Last: Spillane
Title: Automated Tsunami Source Modeling Using the Sweeping Window Positive Elastic Net
Abstract:
In response to hazards posed by
earthquake-induced tsunamis, the National Oceanographic and Atmospheric
Administration developed a system for issuing timely warnings to coastal
communities. This system, in part, involves matching data collected in
real time from deep-ocean buoys to a database of precomputed geophysical
models, each associated with a geographical location. Currently, trained
operators must handpick models from the database using the epicenter of
the earthquake as guidance, which can delay issuing of warnings. In this
article, we introduce an automatic procedure to select models to improve
the timing and accuracy of these warnings. This procedure uses an
elastic-net-based penalized and constrained linear least-squares estimator
in conjunction with a sweeping window. This window ensures that selected
models are close spatially, which is desirable from geophysical
considerations. We use the Akaike information criterion to settle on a
particular window and to set the tuning parameters associated with the
elastic net. Test data from the 2006 Kuril Islands and the devastating
2011 Japan tsunamis show that the automatic procedure yields model fits
and verification equal to or better than those from a time-consuming
hand-selected solution.
Journal: Journal of the American Statistical Association
Pages: 491-499
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.879062
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879062
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:491-499
Template-Type: ReDIF-Article 1.0
Author-Name: Carl Schmertmann
Author-X-Name-First: Carl
Author-X-Name-Last: Schmertmann
Author-Name: Emilio Zagheni
Author-X-Name-First: Emilio
Author-X-Name-Last: Zagheni
Author-Name: Joshua R. Goldstein
Author-X-Name-First: Joshua R.
Author-X-Name-Last: Goldstein
Author-Name: Mikko Myrskylä
Author-X-Name-First: Mikko
Author-X-Name-Last: Myrskylä
Title: Bayesian Forecasting of Cohort Fertility
Abstract:
There are signs that fertility in rich
countries may have stopped declining, but this depends critically on
whether women currently in reproductive ages are postponing or reducing
lifetime fertility. Analysis of average completed family sizes requires
forecasts of remaining fertility for women born 1970-1995. We propose a
Bayesian model for fertility that incorporates a priori information about
patterns over age and time. We use a new dataset, the Human Fertility
Database (HFD), to construct improper priors that give high weight to
historically plausible rate surfaces. In the age dimension, cohort
schedules should be well approximated by principal components of HFD
schedules. In the time dimension, series should be smooth and
approximately linear over short spans. We calibrate priors so that
approximation residuals have theoretical distributions similar to
historical HFD data. Our priors use quadratic penalties and imply a
high-dimensional normal posterior distribution for each country's
fertility surface. Forecasts for HFD cohorts currently aged 15-44 show
consistent patterns. In the United States, Northern Europe, and Western
Europe, slight rebounds in completed fertility are likely. In Central and
Southern Europe, East Asia, and Brazil, there is little evidence for a
rebound. Our methods could be applied to other forecasting and
missing-data problems with only minor modifications.
Journal: Journal of the American Statistical Association
Pages: 500-513
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.881738
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881738
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:500-513
Template-Type: ReDIF-Article 1.0
Author-Name: Dandan Liu
Author-X-Name-First: Dandan
Author-X-Name-Last: Liu
Author-Name: Yingye Zheng
Author-X-Name-First: Yingye
Author-X-Name-Last: Zheng
Author-Name: Ross L. Prentice
Author-X-Name-First: Ross L.
Author-X-Name-Last: Prentice
Author-Name: Li Hsu
Author-X-Name-First: Li
Author-X-Name-Last: Hsu
Title: Estimating Risk With Time-to-Event Data: An Application to the Women's Health Initiative
Abstract:
Accurate and individualized risk
prediction is critical for population control of chronic diseases such as
cancer and cardiovascular disease. Large cohort studies provide valuable
resources for building risk prediction models, as the risk factors are
collected at the baseline and subjects are followed over time until
disease occurrence or termination of the study. However, for rare diseases
the baseline risk may not be estimated reliably based on cohort data only,
due to sparse events. In this article, we propose to make use of external
information to improve efficiency for estimating time-dependent absolute
risk. We derive the relationship between external disease incidence rates
and the baseline risk, and incorporate the external disease incidence
information into estimation of absolute risks, while allowing for
potential difference of disease incidence rates between cohort and
external sources. The asymptotic properties, namely, uniform consistency
and weak convergence, of the proposed estimators are established.
Simulation results show that the proposed estimator for absolute risk is
more efficient than that based on the Breslow estimator, which does not
use external disease incidence rates. A large cohort study, the Women's
Health Initiative Observational Study, is used to illustrate the proposed
method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 514-524
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.881739
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881739
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:514-524
Template-Type: ReDIF-Article 1.0
Author-Name: Ick Hoon Jin
Author-X-Name-First: Ick Hoon
Author-X-Name-Last: Jin
Author-Name: Suyu Liu
Author-X-Name-First: Suyu
Author-X-Name-Last: Liu
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Title: Using Data Augmentation to Facilitate Conduct of Phase I-II Clinical Trials With Delayed Outcomes
Abstract:
A practical impediment in adaptive
clinical trials is that outcomes must be observed soon enough to apply
decision rules to choose treatments for new patients. For example, if
outcomes take up to six weeks to evaluate and the accrual rate is one
patient per week, on average three new patients will be accrued while
waiting to evaluate the outcomes of the previous three patients. The
question is how to treat the new patients. This logistical problem
persists throughout the trial. Various ad hoc practical solutions are
used, none entirely satisfactory. We focus on this problem in phase I-II
clinical trials that use binary toxicity and efficacy, defined in terms of
event times, to choose doses adaptively for successive cohorts. We propose
a general approach to this problem that treats late-onset outcomes as
missing data, uses data augmentation to impute missing outcomes from
posterior predictive distributions computed from partial follow-up times
and complete outcome data, and applies the design's decision rules using
the completed data. We illustrate the method with two cancer trials
conducted using a phase I-II design based on efficacy-toxicity trade-offs,
including a computer stimulation study. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 525-536
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.881740
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881740
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:525-536
Template-Type: ReDIF-Article 1.0
Author-Name: Haim Y. Bar
Author-X-Name-First: Haim Y.
Author-X-Name-Last: Bar
Author-Name: James G. Booth
Author-X-Name-First: James G.
Author-X-Name-Last: Booth
Author-Name: Martin T. Wells
Author-X-Name-First: Martin T.
Author-X-Name-Last: Wells
Title: A Bivariate Model for Simultaneous Testing in Bioinformatics Data
Abstract:
We develop a novel approach for testing
treatment effects in high-throughput data. Most previous works on this
topic focused on testing for differences between the means, but recently
it has been recognized that testing for differential variation is probably
as important. We take it a step further, and introduce a bivariate model
modeling strategy which accounts for both differential expression and
differential variation. Our model-based approach, in which the
differential mean and variance are considered random effects, results in
shrinkage estimation and powerful tests as it borrows strength across
levels. We show in simulations that the method yields a substantial gain
in the power to detect differential means when differential variation is
present. Our case studies show that the model is realistic in a wide range
of applications. Furthermore, a hierarchical estimation approach
implemented using the EM algorithm results in a computationally efficient
method which is particularly well-suited for "multiple testing"
situations. Finally, we develop a power and sample size calculation tool
that mirrors the estimation and inference method described in this
article, and can be used to design experiments involving thousands of
simultaneous tests.
Journal: Journal of the American Statistical Association
Pages: 537-547
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.884502
File-URL: http://hdl.handle.net/10.1080/01621459.2014.884502
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:537-547
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaosun Lu
Author-X-Name-First: Xiaosun
Author-X-Name-Last: Lu
Author-Name: J. S. Marron
Author-X-Name-First: J. S.
Author-X-Name-Last: Marron
Author-Name: Perry Haaland
Author-X-Name-First: Perry
Author-X-Name-Last: Haaland
Title: Object-Oriented Data Analysis of Cell Images
Abstract:
This article discusses a study of cell
images in cell culture biology from an object-oriented point of view. The
motivation of this research is to develop a statistical approach to cell
image analysis that better supports the automated development of stem cell
growth media. A major hurdle in this process is the need for human
expertise, based on studying cells under the microscope, to make decisions
about the next step of the cell culture process. We aim to use digital
imaging technology coupled with statistical analysis to tackle this
important problem. The discussion in this article highlights a common
critical issue: choice of data objects. Instead of conventionally treating
either the individual cells or the wells (a container in which the cells
are grown) as data objects, a new type of data object is proposed, that is
the union of a well with its corresponding set of cells. The image data
analysis suggests that the cell-well unions can be a better choice of data
objects than the cells or the wells alone. The data are available in the
online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 548-559
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.884503
File-URL: http://hdl.handle.net/10.1080/01621459.2014.884503
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:548-559
Template-Type: ReDIF-Article 1.0
Author-Name: Kun Chen
Author-X-Name-First: Kun
Author-X-Name-Last: Chen
Author-Name: Kung-Sik Chan
Author-X-Name-First: Kung-Sik
Author-X-Name-Last: Chan
Author-Name: Nils Chr. Stenseth
Author-X-Name-First: Nils Chr.
Author-X-Name-Last: Stenseth
Title: Source-Sink Reconstruction Through Regularized Multicomponent Regression Analysis-With Application to Assessing Whether North Sea Cod Larvae Contributed to Local Fjord Cod in Skagerrak
Abstract:
The problem of reconstructing the
source-sink dynamics arises in many biological systems. Our research is
motivated by marine applications where newborns are passively dispersed by
ocean currents from several potential spawning sources to settle in
various nursery regions that collectively constitute the sink. The
reconstruction of the sparse source-sink linkage pattern, that is, to
identify which sources contribute to which regions in the sink, is a
challenging task in marine ecology. We derive a constrained nonlinear
multicomponent regression model for source-sink reconstruction, which is
capable of simultaneously selecting important linkages from the sources to
the sink regions and making inference about the unobserved spawning
activities at the sources. A sparsity-inducing and
nonnegativity-constrained regularization approach is developed for model
estimation, and theoretically we show that our estimator enjoys the oracle
properties. The empirical performance of the method is investigated via
simulation studies mimicking real ecological applications. We examine the
transport hypothesis that Atlantic cod larvae were transported by sea
currents from the North Sea to a few exposed coastal fjords along the
Norwegian Skagerrak. Our findings of the spawning date distribution is
consistent with results from previous studies, and the proposed approach
for the first time provides valid statistical support for the larval drift
conjecture. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 560-573
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.898583
File-URL: http://hdl.handle.net/10.1080/01621459.2014.898583
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:560-573
Template-Type: ReDIF-Article 1.0
Author-Name: L. A. Stefanski
Author-X-Name-First: L. A.
Author-X-Name-Last: Stefanski
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Author-Name: Kyle White
Author-X-Name-First: Kyle
Author-X-Name-Last: White
Title: Variable Selection in Nonparametric Classification Via Measurement Error Model Selection Likelihoods
Abstract:
Using the relationships among ridge
regression, LASSO estimation, and measurement error attenuation as
motivation, a new measurement-error-model-based approach to variable
selection is developed. After describing the approach in the familiar
context of linear regression, we apply it to the problem of variable
selection in nonparametric classification, resulting in a new kernel-based
classifier with LASSO-like shrinkage and variable-selection properties.
Finite-sample performance of the new classification method is studied via
simulation and real data examples, and consistency of the method is
studied theoretically. Supplementary materials for the article are
available online.
Journal: Journal of the American Statistical Association
Pages: 574-589
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.858630
File-URL: http://hdl.handle.net/10.1080/01621459.2013.858630
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:574-589
Template-Type: ReDIF-Article 1.0
Author-Name: Ngai Hang Chan
Author-X-Name-First: Ngai Hang
Author-X-Name-Last: Chan
Author-Name: Chun Yip Yau
Author-X-Name-First: Chun Yip
Author-X-Name-Last: Yau
Author-Name: Rong-Mao Zhang
Author-X-Name-First: Rong-Mao
Author-X-Name-Last: Zhang
Title: Group LASSO for Structural Break Time Series
Abstract:
Consider a structural break autoregressive
(SBAR) process
where
j = 1, ..., m + 1, {t
1, ...,
tm }
are change-points, 1 = t
0 > t
1 > ⋅⋅⋅ > t
m + 1 = n + 1, σ( · )
is a measurable function on
, and {ϵ
t } are
white noise with unit variance. In practice, the number of change-points
m is usually assumed to be known and small, because a
large m would involve a huge amount of computational
burden for parameters estimation. By reformulating the problem in a
variable selection context, the group least absolute shrinkage and
selection operator (LASSO) is proposed to estimate an SBAR model when
m is unknown. It is shown that both m
and the locations of the change-points {t
1, ..., tm
} can be consistently estimated from
the data, and the computation can be efficiently performed. An improved
practical version that incorporates group LASSO and the stepwise
regression variable selection technique are discussed. Simulation studies
are conducted to assess the finite sample performance. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 590-599
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.866566
File-URL: http://hdl.handle.net/10.1080/01621459.2013.866566
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:590-599
Template-Type: ReDIF-Article 1.0
Author-Name: Guangming Pan
Author-X-Name-First: Guangming
Author-X-Name-Last: Pan
Author-Name: Jiti Gao
Author-X-Name-First: Jiti
Author-X-Name-Last: Gao
Author-Name: Yanrong Yang
Author-X-Name-First: Yanrong
Author-X-Name-Last: Yang
Title: Testing Independence Among a Large Number of High-Dimensional Random Vectors
Abstract:
Capturing dependence among a large number
of high-dimensional random vectors is a very important and challenging
problem. By arranging n random vectors of length
p in the form of a matrix, we develop a linear spectral
statistic of the constructed matrix to test whether the n
random vectors are independent or not. Specifically, the proposed
statistic can also be applied to n random vectors, each
of whose elements can be written as either a linear stationary process or
a linear combination of independent random variables. The asymptotic
distribution of the proposed test statistic is established for the case of
as n → ∞. To avoid
estimating the spectrum of each random vector, a modified test statistic,
which is based on splitting the original n vectors into
two equal parts and eliminating the term that contains the inner structure
of each random vector or time series, is constructed. The facts that the
limiting distribution is normal and there is no need to know the inner
structure of each investigated random vector result in simple
implementation of the constructed test statistic. Simulation results
demonstrate that the proposed test is powerful against several commonly
used dependence structures. An empirical application to detecting
dependence of the closed prices from several stocks in the S&P500 also
illustrates the applicability and effectiveness of our provided test.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 600-612
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.872037
File-URL: http://hdl.handle.net/10.1080/01621459.2013.872037
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:600-612
Template-Type: ReDIF-Article 1.0
Author-Name: Giorgos Minas
Author-X-Name-First: Giorgos
Author-X-Name-Last: Minas
Author-Name: John A.D. Aston
Author-X-Name-First: John A.D.
Author-X-Name-Last: Aston
Author-Name: Nigel Stallard
Author-X-Name-First: Nigel
Author-X-Name-Last: Stallard
Title: Adaptive Multivariate Global Testing
Abstract:
We present a methodology for dealing with
recent challenges in testing global hypotheses using multivariate
observations. The proposed tests target situations, often arising in
emerging applications of neuroimaging, where the sample size
n is relatively small compared with the observations'
dimension K. We employ adaptive designs allowing for
sequential modifications of the test statistics adapting to accumulated
data. The adaptations are optimal in the sense of maximizing the
predictive power of the test at each interim analysis while still
controlling the Type I error. Optimality is obtained by a general result
applicable to typical adaptive design settings. Further, we prove that the
potentially high-dimensional design space of the tests can be reduced to a
low-dimensional projection space enabling us to perform simpler power
analysis studies, including comparisons to alternative tests. We
illustrate the substantial improvement in efficiency that the proposed
tests can make over standard tests, especially in the case of
n smaller or slightly larger than K. The
methods are also studied empirically using both simulated data and data
from an EEG study, where the use of prior knowledge substantially
increases the power of the test. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 613-623
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.870905
File-URL: http://hdl.handle.net/10.1080/01621459.2013.870905
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:613-623
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Title: Adaptive Global Testing for Functional Linear Models
Abstract:
This article studies global testing of the
slope function in functional linear regression models. A major challenge
in functional global testing is to choose the dimension of projection when
approximating the functional regression model by a finite dimensional
multivariate linear regression model. We develop a new method that
simultaneously tests the slope vectors in a sequence of functional
principal components regression models. The sequence of models being
tested is determined by the sample size and is an integral part of the
testing procedure. Our theoretical analysis shows that the proposed method
is uniformly powerful over a class of smooth alternatives when the signal
to noise ratio exceeds the detection boundary. The methods and results
reflect the deep connection between the functional linear regression model
and the Gaussian sequence model. We also present an extensive simulation
study and a real data example to illustrate the finite sample performance
of our method. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 624-634
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.856794
File-URL: http://hdl.handle.net/10.1080/01621459.2013.856794
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:624-634
Template-Type: ReDIF-Article 1.0
Author-Name: Xu Liu
Author-X-Name-First: Xu
Author-X-Name-Last: Liu
Author-Name: Hongmei Jiang
Author-X-Name-First: Hongmei
Author-X-Name-Last: Jiang
Author-Name: Yong Zhou
Author-X-Name-First: Yong
Author-X-Name-Last: Zhou
Title: Local Empirical Likelihood Inference for Varying-Coefficient Density-Ratio Models Based on Case-Control Data
Abstract:
In this article, we develop a
varying-coefficient density-ratio model for case-control studies. The case
and control samples come from two different distributions. Under the model
assumption, the ratio of the two densities is related to the linear
combination of covariates with varying coefficients through a known
function. A special case is the exponential tilt model where the log ratio
of the two densities is a linear function of covariates. We propose a
local empirical likelihood (EL) approach to estimate the nonparametric
coefficient functions. Under some regularity assumptions, the proposed
estimators are shown to be consistent and asymptotically normally
distributed. The sieve empirical likelihood ratio (SELR) test statistic
for detecting whether the varying-coefficients are really constant and
other related hypotheses is constructed and it follows approximately a
chi-squared distribution. We introduce a modified bootstrap procedure to
estimate the null distribution of the SELR when sample size is small. We
also examine the performance of proposed method for finite sample sizes
through simulation studies and illustrate it with a real dataset.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 635-646
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.858629
File-URL: http://hdl.handle.net/10.1080/01621459.2013.858629
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:635-646
Template-Type: ReDIF-Article 1.0
Author-Name: Bruno Scarpa
Author-X-Name-First: Bruno
Author-X-Name-Last: Scarpa
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Enriched Stick-Breaking Processes for Functional Data
Abstract:
In many applications involving functional
data, prior information is available about the proportion of curves having
different attributes. It is not straightforward to include such
information in existing procedures for functional data analysis.
Generalizing the functional Dirichlet process (FDP), we propose a class of
stick-breaking priors for distributions of functions. These priors
incorporate functional atoms drawn from constrained stochastic processes.
The stick-breaking weights are specified to allow user-specified prior
probabilities for curve attributes, with hyperpriors accommodating
uncertainty. Compared with the FDP, the random distribution is enriched
for curves having attributes known to be common. Theoretical properties
are considered, methods are developed for posterior computation, and the
approach is illustrated using data on temperature curves in menstrual
cycles.
Journal: Journal of the American Statistical Association
Pages: 647-660
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.866564
File-URL: http://hdl.handle.net/10.1080/01621459.2013.866564
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:647-660
Template-Type: ReDIF-Article 1.0
Author-Name: Shuzhuan Zheng
Author-X-Name-First: Shuzhuan
Author-X-Name-Last: Zheng
Author-Name: Lijian Yang
Author-X-Name-First: Lijian
Author-X-Name-Last: Yang
Author-Name: Wolfgang K. Härdle
Author-X-Name-First: Wolfgang K.
Author-X-Name-Last: Härdle
Title: A Smooth Simultaneous Confidence Corridor for the Mean of Sparse Functional Data
Abstract:
Functional data analysis (FDA) has become
an important area of statistics research in the recent decade, yet a
smooth simultaneous confidence corridor (SCC) does not exist in the
literature for the mean function of sparse functional data. SCC is a
powerful tool for making statistical inference on an entire unknown
function, nonetheless classic "Hungarian embedding" techniques for
establishing asymptotic correctness of SCC completely fail for sparse
functional data. We propose a local linear SCC and a shoal of confidence
intervals (SCI) for the mean function of sparse functional data, and
establish that it is asymptotically equivalent to the SCC of independent
regression data, using new results from Gaussian process extreme value
theory. The SCC procedure is examined in simulations for its superior
theoretical accuracy and performance, and used to analyze growth curve
data, confirming findings with quantified high significance levels.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 661-673
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.866899
File-URL: http://hdl.handle.net/10.1080/01621459.2013.866899
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:661-673
Template-Type: ReDIF-Article 1.0
Author-Name: Roger Koenker
Author-X-Name-First: Roger
Author-X-Name-Last: Koenker
Author-Name: Ivan Mizera
Author-X-Name-First: Ivan
Author-X-Name-Last: Mizera
Title: Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules
Abstract:
Estimation of mixture densities for the
classical Gaussian compound decision problem and their associated
(empirical) Bayes rules is considered from two new perspectives. The
first, motivated by Brown and Greenshtein, introduces a nonparametric
maximum likelihood estimator of the mixture density subject to a
monotonicity constraint on the resulting Bayes rule. The second, motivated
by Jiang and Zhang, proposes a new approach to computing the
Kiefer-Wolfowitz nonparametric maximum likelihood estimator for mixtures.
In contrast to prior methods for these problems, our new approaches are
cast as convex optimization problems that can be efficiently solved by
modern interior point methods. In particular, we show that the
reformulation of the Kiefer-Wolfowitz estimator as a convex optimization
problem reduces the computational effort by several orders of
magnitude for typical problems, by comparison to prior
EM-algorithm based methods, and thus greatly expands the practical
applicability of the resulting methods. Our new procedures are compared
with several existing empirical Bayes methods in simulations employing the
well-established design of Johnstone and Silverman. Some further
comparisons are made based on prediction of baseball batting averages. A
Bernoulli mixture application is briefly considered in the penultimate
section.
Journal: Journal of the American Statistical Association
Pages: 674-685
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.869224
File-URL: http://hdl.handle.net/10.1080/01621459.2013.869224
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:674-685
Template-Type: ReDIF-Article 1.0
Author-Name: Hua Zhou
Author-X-Name-First: Hua
Author-X-Name-Last: Zhou
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Title: A Generic Path Algorithm for Regularized Statistical Estimation
Abstract:
Regularization is widely used in
statistics and machine learning to prevent overfitting and gear solution
toward prior information. In general, a regularized estimation problem
minimizes the sum of a loss function and a penalty term. The penalty term
is usually weighted by a tuning parameter and encourages certain
constraints on the parameters to be estimated. Particular choices of
constraints lead to the popular lasso, fused-lasso, and other generalized
ℓ1 penalized regression methods. In this article we
follow a recent idea by Wu and propose an exact path solver based on
ordinary differential equations (EPSODE) that works for any convex loss
function and can deal with generalized ℓ1 penalties as
well as more complicated regularization such as inequality constraints
encountered in shape-restricted regressions and nonparametric density
estimation. Nonasymptotic error bounds for the equality regularized
estimates are derived. In practice, the EPSODE can be coupled with AIC,
BIC, Cp
or cross-validation to select an optimal tuning parameter, or provide a
convenient model space for performing model averaging or aggregation. Our
applications to generalized ℓ1 regularized generalized
linear models, shape-restricted regressions, Gaussian graphical models,
and nonparametric density estimation showcase the potential of the EPSODE
algorithm. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 686-699
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.864166
File-URL: http://hdl.handle.net/10.1080/01621459.2013.864166
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:686-699
Template-Type: ReDIF-Article 1.0
Author-Name: Hulin Wu
Author-X-Name-First: Hulin
Author-X-Name-Last: Wu
Author-Name: Tao Lu
Author-X-Name-First: Tao
Author-X-Name-Last: Lu
Author-Name: Hongqi Xue
Author-X-Name-First: Hongqi
Author-X-Name-Last: Xue
Author-Name: Hua Liang
Author-X-Name-First: Hua
Author-X-Name-Last: Liang
Title: Sparse Additive Ordinary Differential Equations for Dynamic Gene Regulatory Network Modeling
Abstract:
The gene regulation network (GRN) is a
high-dimensional complex system, which can be represented by various
mathematical or statistical models. The ordinary differential equation
(ODE) model is one of the popular dynamic GRN models. High-dimensional
linear ODE models have been proposed to identify GRNs, but with a
limitation of the linear regulation effect assumption. In this article, we
propose a sparse additive ODE (SA-ODE) model, coupled with ODE estimation
methods and adaptive group least absolute shrinkage and selection operator
(LASSO) techniques, to model dynamic GRNs that could flexibly deal with
nonlinear regulation effects. The asymptotic properties of the proposed
method are established and simulation studies are performed to validate
the proposed approach. An application example for identifying the
nonlinear dynamic GRN of T-cell activation is used to illustrate the
usefulness of the proposed method.
Journal: Journal of the American Statistical Association
Pages: 700-716
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.859617
File-URL: http://hdl.handle.net/10.1080/01621459.2013.859617
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:700-716
Template-Type: ReDIF-Article 1.0
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Author-Name: Peter Hall
Author-X-Name-First: Peter
Author-X-Name-Last: Hall
Title: Parametrically Assisted Nonparametric Estimation of a Density in the Deconvolution Problem
Abstract:
Nonparametric estimation of a density from
contaminated data is a difficult problem, for which convergence rates are
notoriously slow. We introduce parametrically assisted nonparametric
estimators which can dramatically improve on the performance of standard
nonparametric estimators when the assumed model is close to the true
density, without degrading much the quality of purely nonparametric
estimators in other cases. We establish optimal convergence rates for our
problem and discuss estimators that attain these rates. The very good
numerical properties of the methods are illustrated via a simulation
study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 717-729
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.857611
File-URL: http://hdl.handle.net/10.1080/01621459.2013.857611
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:717-729
Template-Type: ReDIF-Article 1.0
Author-Name: Forrest W. Crawford
Author-X-Name-First: Forrest W.
Author-X-Name-Last: Crawford
Author-Name: Vladimir N. Minin
Author-X-Name-First: Vladimir N.
Author-X-Name-Last: Minin
Author-Name: Marc A. Suchard
Author-X-Name-First: Marc A.
Author-X-Name-Last: Suchard
Title: Estimation for General Birth-Death Processes
Abstract:
Birth-death processes (BDPs) are
continuous-time Markov chains that track the number of "particles" in a
system over time. While widely used in population biology, genetics, and
ecology, statistical inference of the instantaneous particle birth and
death rates remains largely limited to restrictive linear BDPs in which
per-particle birth and death rates are constant. Researchers often observe
the number of particles at discrete times, necessitating data augmentation
procedures such as expectation-maximization (EM) to find maximum
likelihood estimates (MLEs). For BDPs on finite state-spaces, there are
powerful matrix methods for computing the conditional expectations needed
for the E-step of the EM algorithm. For BDPs on infinite state-spaces,
closed-form solutions for the E-step are available for some linear models,
but most previous work has resorted to time-consuming simulation.
Remarkably, we show that the E-step conditional expectations can be
expressed as convolutions of computable transition probabilities for any
general BDP with arbitrary rates. This important observation, along with a
convenient continued fraction representation of the Laplace transforms of
the transition probabilities, allows for novel and efficient computation
of the conditional expectations for all BDPs, eliminating the need for
truncation of the state-space or costly simulation. We use this insight to
derive EM algorithms that yield maximum likelihood estimation for general
BDPs characterized by various rate models, including generalized linear
models (GLM). We show that our Laplace convolution technique outperforms
competing methods when they are available and demonstrate a technique to
accelerate EM algorithm convergence. We validate our approach using
synthetic data and then apply our methods to cancer cell growth and
estimation of mutation parameters in microsatellite evolution.
Journal: Journal of the American Statistical Association
Pages: 730-747
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.866565
File-URL: http://hdl.handle.net/10.1080/01621459.2013.866565
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:730-747
Template-Type: ReDIF-Article 1.0
Author-Name: Marco Di Marzio
Author-X-Name-First: Marco
Author-X-Name-Last: Di Marzio
Author-Name: Agnese Panzera
Author-X-Name-First: Agnese
Author-X-Name-Last: Panzera
Author-Name: Charles C. Taylor
Author-X-Name-First: Charles C.
Author-X-Name-Last: Taylor
Title: Nonparametric Regression for Spherical Data
Abstract:
We develop nonparametric smoothing for
regression when both the predictor and the response variables are defined
on a sphere of whatever dimension. A local polynomial fitting approach is
pursued, which retains all the advantages in terms of rate optimality,
interpretability, and ease of implementation widely observed in the
standard setting. Our estimates have a multi-output nature, meaning that
each coordinate is separately estimated, within a scheme of a regression
with a linear response. The main properties include linearity and
rotational equivariance. This research has been motivated by the fact that
very few models describe this kind of regression. Such current methods are
surely not widely employable since they have a parametric nature, and also
require the same dimensionality for prediction and response spaces, along
with nonrandom design. Our approach does not suffer these limitations.
Real-data case studies and simulation experiments are used to illustrate
the effectiveness of the method.
Journal: Journal of the American Statistical Association
Pages: 748-763
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.866567
File-URL: http://hdl.handle.net/10.1080/01621459.2013.866567
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:748-763
Template-Type: ReDIF-Article 1.0
Author-Name: Miguel de Carvalho
Author-X-Name-First: Miguel
Author-X-Name-Last: de Carvalho
Author-Name: Anthony C. Davison
Author-X-Name-First: Anthony C.
Author-X-Name-Last: Davison
Title: Spectral Density Ratio Models for Multivariate Extremes
Abstract:
The modeling of multivariate extremes has
received increasing recent attention because of its importance in risk
assessment. In classical statistics of extremes, the joint distribution of
two or more extremes has a nonparametric form, subject to moment
constraints. This article develops a semiparametric model for the
situation where several multivariate extremal distributions are linked
through the action of a covariate on an unspecified baseline distribution,
through a so-called density ratio model. Theoretical and numerical aspects
of empirical likelihood inference for this model are discussed, and an
application is given to pairs of extreme forest temperatures.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 764-776
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.872651
File-URL: http://hdl.handle.net/10.1080/01621459.2013.872651
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:764-776
Template-Type: ReDIF-Article 1.0
Author-Name: Chao Wang
Author-X-Name-First: Chao
Author-X-Name-Last: Wang
Author-Name: Heng Liu
Author-X-Name-First: Heng
Author-X-Name-Last: Liu
Author-Name: Jian-Feng Yao
Author-X-Name-First: Jian-Feng
Author-X-Name-Last: Yao
Author-Name: Richard A. Davis
Author-X-Name-First: Richard A.
Author-X-Name-Last: Davis
Author-Name: Wai Keung Li
Author-X-Name-First: Wai Keung
Author-X-Name-Last: Li
Title: Self-Excited Threshold Poisson Autoregression
Abstract:
This article studies theory and inference
of an observation-driven model for time series of counts. It is assumed
that the observations follow a Poisson distribution conditioned on an
accompanying intensity process, which is equipped with a two-regime
structure according to the magnitude of the lagged observations.
Generalized from the Poisson autoregression, it allows more flexible, and
even negative correlation, in the observations, which cannot be produced
by the single-regime model. Classical Markov chain theory and Lyapunov's
method are used to derive the conditions under which the process has a
unique invariant probability measure and to show a strong law of large
numbers of the intensity process. Moreover, the asymptotic theory of the
maximum likelihood estimates of the parameters is established. A
simulation study and a real-data application are considered, where the
model is applied to the number of major earthquakes in the world.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 777-787
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.872994
File-URL: http://hdl.handle.net/10.1080/01621459.2013.872994
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:777-787
Template-Type: ReDIF-Article 1.0
Author-Name: Zongwu Cai
Author-X-Name-First: Zongwu
Author-X-Name-Last: Cai
Author-Name: Xian Wang
Author-X-Name-First: Xian
Author-X-Name-Last: Wang
Title: Selection of Mixed Copula Model via Penalized Likelihood
Abstract:
A fundamental issue of applying a copula
method in applications is how to choose an appropriate copula function for
a given problem. In this article we address this issue by proposing a new
copula selection approach via penalized likelihood plus a shrinkage
operator. The proposed method selects an appropriate copula function and
estimates the related parameters simultaneously. We establish the
asymptotic properties of the proposed penalized likelihood estimator,
including the rate of convergence and asymptotic normality and
abnormality. Particularly, when the true coefficient parameters may be on
the boundary of the parameter space and the dependence parameters are in
an unidentified subset of the parameter space, we show that the limiting
distribution for boundary parameter estimator is half-normal and the
penalized likelihood estimator for unidentified parameter converges to an
arbitrary value. Finally, Monte Carlo simulation studies are carried out
to illustrate the finite sample performance of the proposed approach and
the proposed method is used to investigate the correlation structure and
comovement of financial stock markets.
Journal: Journal of the American Statistical Association
Pages: 788-801
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.873366
File-URL: http://hdl.handle.net/10.1080/01621459.2013.873366
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:788-801
Template-Type: ReDIF-Article 1.0
Author-Name: Antonio Lijoi
Author-X-Name-First: Antonio
Author-X-Name-Last: Lijoi
Author-Name: Bernardo Nipoti
Author-X-Name-First: Bernardo
Author-X-Name-Last: Nipoti
Title: A Class of Hazard Rate Mixtures for Combining Survival Data From Different Experiments
Abstract:
Mixture models for hazard rate functions
are widely used tools for addressing the statistical analysis of survival
data subject to a censoring mechanism. The present article introduced a
new class of vectors of random hazard rate functions that are expressed as
kernel mixtures of dependent completely random measures. This leads to
define dependent nonparametric prior processes that are suitably tailored
to draw inferences in the presence of heterogenous observations. Besides
its flexibility, an important appealing feature of our proposal is
analytical tractability: we are, indeed, able to determine some relevant
distributional properties and a posterior characterization that is also
the key for devising an efficient Markov chain Monte Carlo sampler. For
illustrative purposes, we specialize our general results to a class of
dependent extended gamma processes. We finally display a few numerical
examples, including both simulated and real two-sample datasets: these
allow us to identify the effect of a borrowing strength phenomenon and
provide evidence of the effectiveness of the prior to deal with datasets
for which the proportional hazards assumption does not hold true.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 802-814
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.869499
File-URL: http://hdl.handle.net/10.1080/01621459.2013.869499
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:802-814
Template-Type: ReDIF-Article 1.0
Author-Name: R. Dennis Cook
Author-X-Name-First: R. Dennis
Author-X-Name-Last: Cook
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Title: Fused Estimators of the Central Subspace in Sufficient Dimension Reduction
Abstract:
When studying the regression of a
univariate variable Y on a vector x of
predictors, most existing sufficient dimension-reduction (SDR) methods
require the construction of slices of Y
to estimate moments of the conditional distribution of X
given Y. But there is no widely accepted method for
choosing the number of slices, while a poorly chosen slicing scheme may
produce miserable results. We propose a novel and easily implemented
fusing method that can mitigate the problem of choosing a slicing scheme
and improve estimation efficiency at the same time. We develop two fused
estimators-called FIRE and DIRE-based on an optimal inverse regression
estimator. The asymptotic variance of FIRE is no larger than that of the
original methods regardless of the choice of slicing scheme, while DIRE is
less computational intense and more robust. Simulation studies show that
the fused estimators perform effectively the same as or substantially
better than the parent methods. Fused estimators based on other methods
can be developed in parallel: fused sliced inverse regression (SIR), fused
central solution space (CSS)-SIR, and fused likelihood-based method (LAD)
are introduced briefly. Simulation studies of the fused CSS-SIR and fused
LAD estimators show substantial gain over their parent methods. A real
data example is also presented for illustration and comparison.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 815-827
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.866563
File-URL: http://hdl.handle.net/10.1080/01621459.2013.866563
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:815-827
Template-Type: ReDIF-Article 1.0
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Author-Name: Edward I. George
Author-X-Name-First: Edward I.
Author-X-Name-Last: George
Title: EMVS: The EM Approach to Bayesian Variable Selection
Abstract:
Despite rapid developments in stochastic
search algorithms, the practicality of Bayesian variable selection methods
has continued to pose challenges. High-dimensional data are now routinely
analyzed, typically with many more covariates than observations. To
broaden the applicability of Bayesian variable selection for such
high-dimensional linear regression contexts, we propose EMVS, a
deterministic alternative to stochastic search based on an EM algorithm
which exploits a conjugate mixture prior formulation to quickly find
posterior modes. Combining a spike-and-slab regularization diagram for the
discovery of active predictor sets with subsequent rigorous evaluation of
posterior model probabilities, EMVS rapidly identifies promising sparse
high posterior probability submodels. External structural information such
as likely covariate groupings or network topologies is easily incorporated
into the EMVS framework. Deterministic annealing variants are seen to
improve the effectiveness of our algorithms by mitigating the posterior
multimodality associated with variable selection priors. The usefulness of
the EMVS approach is demonstrated on real high-dimensional data, where
computational complexity renders stochastic search to be less practical.
Journal: Journal of the American Statistical Association
Pages: 828-846
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.869223
File-URL: http://hdl.handle.net/10.1080/01621459.2013.869223
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:828-846
Template-Type: ReDIF-Article 1.0
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Yichen Cheng
Author-X-Name-First: Yichen
Author-X-Name-Last: Cheng
Author-Name: Guang Lin
Author-X-Name-First: Guang
Author-X-Name-Last: Lin
Title: Simulated Stochastic Approximation Annealing for Global Optimization With a Square-Root Cooling Schedule
Abstract:
Simulated annealing has been widely used
in the solution of optimization problems. As known by many researchers,
the global optima cannot be guaranteed to be located by simulated
annealing unless a logarithmic cooling schedule is used. However, the
logarithmic cooling schedule is so slow that no one can afford to use this
much CPU time. This article proposes a new stochastic optimization
algorithm, the so-called simulated stochastic approximation annealing
algorithm, which is a combination of simulated annealing and the
stochastic approximation Monte Carlo algorithm. Under the framework of
stochastic approximation, it is shown that the new algorithm can work with
a cooling schedule in which the temperature can decrease much faster than
in the logarithmic cooling schedule, for example, a square-root cooling
schedule, while guaranteeing the global optima to be reached when the
temperature tends to zero. The new algorithm has been tested on a few
benchmark optimization problems, including feed-forward neural network
training and protein-folding. The numerical results indicate that the new
algorithm can significantly outperform simulated annealing and other
competitors. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 847-863
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2013.872993
File-URL: http://hdl.handle.net/10.1080/01621459.2013.872993
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:847-863
Template-Type: ReDIF-Article 1.0
Author-Name: L. Chen
Author-X-Name-First: L.
Author-X-Name-Last: Chen
Author-Name: W. W. Dou
Author-X-Name-First: W. W.
Author-X-Name-Last: Dou
Author-Name: Z. Qiao
Author-X-Name-First: Z.
Author-X-Name-Last: Qiao
Title: "Ensemble Subsampling for Imbalanced Multivariate Two-Sample Tests,"
Journal: Journal of the American Statistical Association
Pages: 871-871
Issue: 506
Volume: 109
Year: 2014
Month: 6
X-DOI: 10.1080/01621459.2014.899497
File-URL: http://hdl.handle.net/10.1080/01621459.2014.899497
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:506:p:871-871
Template-Type: ReDIF-Article 1.0
Author-Name: Kassandra Fronczyk
Author-X-Name-First: Kassandra
Author-X-Name-Last: Fronczyk
Author-Name: Athanasios Kottas
Author-X-Name-First: Athanasios
Author-X-Name-Last: Kottas
Title: A Bayesian Nonparametric Modeling Framework for Developmental Toxicity Studies
Abstract:
We develop a Bayesian nonparametric mixture modeling framework for
replicated count responses in dose-response settings. We explore this
methodology for modeling and risk assessment in developmental toxicity
studies, where the primary objective is to determine the relationship
between the level of exposure to a toxic chemical and the probability of a
physiological or biochemical response, or death. Data from these
experiments typically involve features that cannot be captured by standard
parametric approaches. To provide flexibility in the functional form of
both the response distribution and the probability of positive response,
the proposed mixture model is built from a dependent Dirichlet process
prior, with the dependence of the mixing distributions governed by the
dose level. The methodology is tested with a simulation study, which
involves also comparison with semiparametric Bayesian approaches to
highlight the practical utility of the dependent Dirichlet process
nonparametric mixture model. Further illustration is provided through the
analysis of data from two developmental toxicity studies.
Journal: Journal of the American Statistical Association
Pages: 873-888
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.830445
File-URL: http://hdl.handle.net/10.1080/01621459.2013.830445
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:873-888
Template-Type: ReDIF-Article 1.0
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Fernando Quintana
Author-X-Name-First: Fernando
Author-X-Name-Last: Quintana
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 889-889
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.955987
File-URL: http://hdl.handle.net/10.1080/01621459.2014.955987
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:889-889
Template-Type: ReDIF-Article 1.0
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 890-891
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.955988
File-URL: http://hdl.handle.net/10.1080/01621459.2014.955988
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:890-891
Template-Type: ReDIF-Article 1.0
Author-Name: Kassandra Fronczyk
Author-X-Name-First: Kassandra
Author-X-Name-Last: Fronczyk
Author-Name: Athanasios Kottas
Author-X-Name-First: Athanasios
Author-X-Name-Last: Kottas
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 891-893
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.932171
File-URL: http://hdl.handle.net/10.1080/01621459.2014.932171
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:891-893
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew W. Wheeler
Author-X-Name-First: Matthew W.
Author-X-Name-Last: Wheeler
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Author-Name: Sudha P. Pandalai
Author-X-Name-First: Sudha P.
Author-X-Name-Last: Pandalai
Author-Name: Brent A. Baker
Author-X-Name-First: Brent A.
Author-X-Name-Last: Baker
Author-Name: Amy H. Herring
Author-X-Name-First: Amy H.
Author-X-Name-Last: Herring
Title: Mechanistic Hierarchical Gaussian Processes
Abstract:
The statistics literature on functional data analysis focuses primarily on
flexible black-box approaches, which are designed to allow individual
curves to have essentially any shape while characterizing variability.
Such methods typically cannot incorporate mechanistic information, which
is commonly expressed in terms of differential equations. Motivated by
studies of muscle activation, we propose a nonparametric Bayesian approach
that takes into account mechanistic understanding of muscle physiology. A
novel class of hierarchical Gaussian processes is defined that favors
curves consistent with differential equations defined on motor, damper,
spring systems. A Gibbs sampler is proposed to sample from the posterior
distribution and applied to a study of rats exposed to noninjurious muscle
activation protocols. Although motivated by muscle force data, a parallel
approach can be used to include mechanistic information in broad
functional data analysis applications.
Journal: Journal of the American Statistical Association
Pages: 894-904
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.899234
File-URL: http://hdl.handle.net/10.1080/01621459.2014.899234
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:894-904
Template-Type: ReDIF-Article 1.0
Author-Name: Yuan Jiang
Author-X-Name-First: Yuan
Author-X-Name-Last: Jiang
Author-Name: Ni Li
Author-X-Name-First: Ni
Author-X-Name-Last: Li
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Identifying Genetic Variants for Addiction via Propensity Score Adjusted Generalized Kendall's Tau
Abstract:
Identifying replicable genetic variants for addiction has been extremely
challenging. Besides the common difficulties with genome-wide association
studies (GWAS), environmental factors are known to be critical to
addiction, and comorbidity is widely observed. Despite the importance of
environmental factors and comorbidity for addiction study, few GWAS
analyses adequately considered them due to the limitations of the existing
statistical methods. Although parametric methods have been developed to
adjust for covariates in association analysis, difficulties arise when the
traits are multivariate because there is no ready-to-use model for them.
Recent nonparametric development includes U-statistics to
measure the phenotype-genotype association weighted by a similarity score
of covariates. However, it is not clear how to optimize the similarity
score. Therefore, we propose a semiparametric method to measure the
association adjusted by covariates. In our approach, the nonparametric
U-statistic is adjusted by parametric estimates of
propensity scores using the idea of inverse probability weighting. The new
measurement is shown to be asymptotically unbiased under our null
hypothesis while the previous nonweighted and weighted ones are not.
Simulation results show that our test improves power as opposed to the
nonweighted and two other weighted U-statistic methods,
and it is particularly powerful for detecting gene-environment
interactions. Finally, we apply our proposed test to the Study of
Addiction: Genetics and Environment (SAGE) to identify genetic variants
for addiction. Novel genetic variants are found from our analysis, which
warrant further investigation in the future.
Journal: Journal of the American Statistical Association
Pages: 905-930
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.901223
File-URL: http://hdl.handle.net/10.1080/01621459.2014.901223
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:905-930
Template-Type: ReDIF-Article 1.0
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Author-Name: Hoang Q. Nguyen
Author-X-Name-First: Hoang Q.
Author-X-Name-Last: Nguyen
Author-Name: Sarah Zohar
Author-X-Name-First: Sarah
Author-X-Name-Last: Zohar
Author-Name: Pierre Maton
Author-X-Name-First: Pierre
Author-X-Name-Last: Maton
Title: Optimizing Sedative Dose in Preterm Infants Undergoing Treatment for Respiratory Distress Syndrome
Abstract:
The intubation-surfactant-extubation (INSURE) procedure is used worldwide
to treat preterm newborn infants suffering from respiratory distress
syndrome, which is caused by an insufficient amount of the chemical
surfactant in the lungs. With INSURE, the infant is intubated, surfactant
is administered via the tube to the trachea, and at completion the infant
is extubated. This improves the infant's ability to breathe and thus
decreases the risk of long-term neurological or motor disabilities. To
perform the intubation safely, the newborn infant first must be sedated.
Despite extensive experience with INSURE, there is no consensus on what
sedative dose is best. This article describes a Bayesian sequentially
adaptive design for a multi-institution clinical trial to optimize the
sedative dose given to preterm infants undergoing the INSURE procedure.
The design is based on three clinical outcomes, two efficacy and one
adverse, using elicited numerical utilities of the eight possible
elementary outcomes. A flexible Bayesian parametric trivariate
dose-outcome model is assumed, with the prior derived from elicited mean
outcome probabilities. Doses are chosen adaptively for successive cohorts
of infants using posterior mean utilities, subject to safety and efficacy
constraints. A computer simulation study of the design is presented.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 931-943
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.904789
File-URL: http://hdl.handle.net/10.1080/01621459.2014.904789
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:931-943
Template-Type: ReDIF-Article 1.0
Author-Name: Radu Herbei
Author-X-Name-First: Radu
Author-X-Name-Last: Herbei
Author-Name: L. Mark Berliner
Author-X-Name-First: L. Mark
Author-X-Name-Last: Berliner
Title: Estimating Ocean Circulation: An MCMC Approach With Approximated Likelihoods via the Bernoulli Factory
Abstract:
We provide a Bayesian analysis of ocean circulation based on data
collected in the South Atlantic Ocean. The analysis incorporates a
reaction-diffusion partial differential equation that is not solvable in
closed form. This leads to an intractable likelihood function. We describe
a novel Markov chain Monte Carlo approach that does not require a
likelihood evaluation. Rather, we use unbiased estimates of the likelihood
and a Bernoulli factory to decide whether or not proposed states are
accepted. The variates required to estimate the likelihood function are
obtained via a Feynman-Kac representation. This lifts the common
restriction of selecting a regular grid for the physical model and
eliminates the need for data preprocessing. We implement our approach
using the parallel graphic processing unit (GPU) computing environment.
Journal: Journal of the American Statistical Association
Pages: 944-954
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.914439
File-URL: http://hdl.handle.net/10.1080/01621459.2014.914439
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:944-954
Template-Type: ReDIF-Article 1.0
Author-Name: Celine Marielle Laffont
Author-X-Name-First: Celine Marielle
Author-X-Name-Last: Laffont
Author-Name: Marc Vandemeulebroecke
Author-X-Name-First: Marc
Author-X-Name-Last: Vandemeulebroecke
Author-Name: Didier Concordet
Author-X-Name-First: Didier
Author-X-Name-Last: Concordet
Title: Multivariate Analysis of Longitudinal Ordinal Data With Mixed Effects Models, With Application to Clinical Outcomes in Osteoarthritis
Abstract:
Our objective was to evaluate the efficacy of robenacoxib in
osteoarthritic dogs using four ordinal responses measured repeatedly over
time. We propose a multivariate probit mixed effects model to describe the
joint evolution of endpoints and to evidence the intrinsic correlations
between responses that are not due to treatment effect. Maximum likelihood
computation is intractable within reasonable time frames. We therefore use
a pairwise modeling approach in combination with a stochastic EM
algorithm. Multidimensional ordinal responses with longitudinal
measurements are a common feature in clinical trials. However, the
standard methods for data analysis use unidimensional models, resulting in
a loss of information. Our methodology provides substantially greater
insight than these methods for the evaluation of treatment effects and
shows a good performance at low computational cost. We thus believe that
it could be used in routine practice to optimize the evaluation of
treatment efficacy.
Journal: Journal of the American Statistical Association
Pages: 955-966
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.917977
File-URL: http://hdl.handle.net/10.1080/01621459.2014.917977
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:955-966
Template-Type: ReDIF-Article 1.0
Author-Name: Michael E. Sobel
Author-X-Name-First: Michael E.
Author-X-Name-Last: Sobel
Author-Name: Martin A. Lindquist
Author-X-Name-First: Martin A.
Author-X-Name-Last: Lindquist
Title: Causal Inference for fMRI Time Series Data With Systematic Errors of Measurement in a Balanced On/Off Study of Social Evaluative Threat
Abstract:
Functional magnetic resonance imaging (fMRI) has facilitated major
advances in understanding human brain function. Neuroscientists are
interested in using fMRI to study the effects of external stimuli on brain
activity and causal relationships among brain regions, but have not stated
what is meant by causation or defined the effects they purport to
estimate. Building on Rubin's causal model, we construct a framework for
causal inference using blood oxygenation level dependent (BOLD) fMRI time
series data. In the usual statistical literature on causal inference,
potential outcomes, assumed to be measured without systematic error, are
used to define unit and average causal effects. However, in general the
potential BOLD responses are measured with stimulus dependent systematic
error. Thus we define unit and average causal effects that are free of
systematic error. In contrast to the usual case of a randomized experiment
where adjustment for intermediate outcomes leads to biased estimates of
treatment effects, here the failure to adjust for task dependent
systematic error leads to biased estimates. We therefore adjust for
systematic error using measured "noise covariates," using a linear mixed
model to estimate the effects and the systematic error. Our results are
important for neuroscientists, who typically do not adjust for systematic
error. They should also prove useful to researchers in other areas where
responses are measured with error and in fields where large amounts of
data are collected on relatively few subjects. To illustrate our approach,
we reanalyze data from a social evaluative threat task, comparing the
findings with results that ignore systematic error.
Journal: Journal of the American Statistical Association
Pages: 967-976
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.922886
File-URL: http://hdl.handle.net/10.1080/01621459.2014.922886
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:967-976
Template-Type: ReDIF-Article 1.0
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Zakaria Khondker
Author-X-Name-First: Zakaria
Author-X-Name-Last: Khondker
Author-Name: Zhaohua Lu
Author-X-Name-First: Zhaohua
Author-X-Name-Last: Lu
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Title: Bayesian Generalized Low Rank Regression Models for Neuroimaging Phenotypes and Genetic Markers
Abstract:
We propose a Bayesian generalized low-rank regression model (GLRR) for the
analysis of both high-dimensional responses and covariates. This
development is motivated by performing searches for associations between
genetic variants and brain imaging phenotypes. GLRR integrates a low rank
matrix to approximate the high-dimensional regression coefficient matrix
of GLRR and a dynamic factor model to model the high-dimensional
covariance matrix of brain imaging phenotypes. Local hypothesis testing is
developed to identify significant covariates on high-dimensional
responses. Posterior computation proceeds via an efficient Markov chain
Monte Carlo algorithm. A simulation study is performed to evaluate the
finite sample performance of GLRR and its comparison with several
competing approaches. We apply GLRR to investigate the impact of 1071 SNPs
on top 40 genes reported by AlzGene database on the volumes of 93 regions
of interest (ROI) obtained from Alzheimer's Disease Neuroimaging
Initiative (ADNI). Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 977-990
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.923775
File-URL: http://hdl.handle.net/10.1080/01621459.2014.923775
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:977-990
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley Efron
Author-X-Name-First: Bradley
Author-X-Name-Last: Efron
Title: Estimation and Accuracy After Model Selection
Abstract:
Classical statistical theory ignores model selection in assessing
estimation accuracy. Here we consider bootstrap methods for computing
standard errors and confidence intervals that take model selection into
account. The methodology involves bagging, also known as bootstrap
smoothing, to tame the erratic discontinuities of selection-based
estimators. A useful new formula for the accuracy of bagging then provides
standard errors for the smoothed estimators. Two examples, nonparametric
and parametric, are carried through in detail: a regression model where
the choice of degree (linear, quadratic, cubic, ...) is determined by the
Cp criterion and a Lasso-based estimation
problem.
Journal: Journal of the American Statistical Association
Pages: 991-1007
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.823775
File-URL: http://hdl.handle.net/10.1080/01621459.2013.823775
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:991-1007
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Ben Sherwood
Author-X-Name-First: Ben
Author-X-Name-Last: Sherwood
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1007-1010
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.905399
File-URL: http://hdl.handle.net/10.1080/01621459.2014.905399
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1007-1010
Template-Type: ReDIF-Article 1.0
Author-Name: Dimitris N. Politis
Author-X-Name-First: Dimitris N.
Author-X-Name-Last: Politis
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1010-1013
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.905788
File-URL: http://hdl.handle.net/10.1080/01621459.2014.905788
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1010-1013
Template-Type: ReDIF-Article 1.0
Author-Name: Shuva Gupta
Author-X-Name-First: Shuva
Author-X-Name-Last: Gupta
Author-Name: S. N. Lahiri
Author-X-Name-First: S. N.
Author-X-Name-Last: Lahiri
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1013-1015
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.905789
File-URL: http://hdl.handle.net/10.1080/01621459.2014.905789
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1013-1015
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew Gelman
Author-X-Name-First: Andrew
Author-X-Name-Last: Gelman
Author-Name: Aki Vehtari
Author-X-Name-First: Aki
Author-X-Name-Last: Vehtari
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1015-1016
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.906153
File-URL: http://hdl.handle.net/10.1080/01621459.2014.906153
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1015-1016
Template-Type: ReDIF-Article 1.0
Author-Name: Nils Lid Hjort
Author-X-Name-First: Nils Lid
Author-X-Name-Last: Hjort
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1017-1020
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.923315
File-URL: http://hdl.handle.net/10.1080/01621459.2014.923315
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1017-1020
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley Efron
Author-X-Name-First: Bradley
Author-X-Name-Last: Efron
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1021-1022
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.932172
File-URL: http://hdl.handle.net/10.1080/01621459.2014.932172
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1021-1022
Template-Type: ReDIF-Article 1.0
Author-Name: Ke Deng
Author-X-Name-First: Ke
Author-X-Name-Last: Deng
Author-Name: Simeng Han
Author-X-Name-First: Simeng
Author-X-Name-Last: Han
Author-Name: Kate J. Li
Author-X-Name-First: Kate J.
Author-X-Name-Last: Li
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Bayesian Aggregation of Order-Based Rank Data
Abstract:
Rank aggregation, that is, combining several ranking functions (called
base rankers) to get aggregated, usually stronger rankings of a given set
of items, is encountered in many disciplines. Most methods in the
literature assume that base rankers of interest are equally reliable. It
is very common in practice, however, that some rankers are more
informative and reliable than others. It is desirable to distinguish high
quality base rankers from low quality ones and treat them differently.
Some methods achieve this by assigning prespecified weights to base
rankers. But there are no systematic and principled strategies for
designing a proper weighting scheme for a practical problem. In this
article, we propose a Bayesian approach, called Bayesian aggregation of
rank data (BARD), to overcome this limitation. By attaching a quality
parameter to each base ranker and estimating these parameters along with
the aggregation process, BARD measures reliabilities of base rankers in a
quantitative way and makes use of this information to improve the
aggregated ranking. In addition, we design a method to detect highly
correlated rankers and to account for their information redundancy
appropriately. Both simulation studies and real data applications show
that BARD significantly outperforms existing methods when equality of base
rankers varies greatly.
Journal: Journal of the American Statistical Association
Pages: 1023-1039
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.878660
File-URL: http://hdl.handle.net/10.1080/01621459.2013.878660
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1023-1039
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew J. Womack
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Womack
Author-Name: Luis León-Novelo
Author-X-Name-First: Luis
Author-X-Name-Last: León-Novelo
Author-Name: George Casella
Author-X-Name-First: George
Author-X-Name-Last: Casella
Title: Inference From Intrinsic Bayes' Procedures Under Model Selection and Uncertainty
Abstract:
In this article, we present a fully coherent and consistent objective
Bayesian analysis of the linear regression model using intrinsic priors.
The intrinsic prior is a scaled mixture of g-priors and
promotes shrinkage toward the subspace defined by a base (or null) model.
While it has been established that the intrinsic prior provides consistent
model selectors across a range of models, the posterior distribution of
the model parameters has not previously been investigated. We prove that
the posterior distribution of the model parameters is consistent under
both model selection and model averaging when the number of regressors is
fixed. Further, we derive tractable expressions for the intrinsic
posterior distribution as well as sampling algorithms for both a selected
model and model averaging. We compare the intrinsic prior to other
mixtures of g-priors and provide details on the
consistency properties of modified versions of the Zellner-Siow prior and
hyper g-priors. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 1040-1053
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.880348
File-URL: http://hdl.handle.net/10.1080/01621459.2014.880348
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1040-1053
Template-Type: ReDIF-Article 1.0
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Mark Low
Author-X-Name-First: Mark
Author-X-Name-Last: Low
Author-Name: Zongming Ma
Author-X-Name-First: Zongming
Author-X-Name-Last: Ma
Title: Adaptive Confidence Bands for Nonparametric Regression Functions
Abstract:
This article proposes a new formulation for the construction of adaptive
confidence bands (CBs) in nonparametric function estimation problems. CBs,
which have size that adapts to the smoothness of the function while
guaranteeing that both the relative excess mass of the function lying
outside the band and the measure of the set of points where the function
lies outside the band are small. It is shown that the bands adapt over a
maximum range of Lipschitz classes. The adaptive CB can be easily
implemented in standard statistical software with wavelet support. We
investigate the numerical performance of the procedure using both
simulated and real datasets. The numerical results agree well with the
theoretical analysis. The procedure can be easily modified and used for
other nonparametric function estimation models. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1054-1070
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.879260
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879260
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1054-1070
Template-Type: ReDIF-Article 1.0
Author-Name: Marc Hallin
Author-X-Name-First: Marc
Author-X-Name-Last: Hallin
Author-Name: Davy Paindaveine
Author-X-Name-First: Davy
Author-X-Name-Last: Paindaveine
Author-Name: Thomas Verdebout
Author-X-Name-First: Thomas
Author-X-Name-Last: Verdebout
Title: Efficient R-Estimation of Principal and Common Principal Components
Abstract:
We propose rank-based estimators of principal components, both in the
one-sample and, under the assumption of common principal
components, in the m-sample cases. Those
estimators are obtained via a rank-based version of Le Cam's one-step
method, combined with an estimation of cross-information
quantities. Under arbitrary elliptical distributions with, in the
m-sample case, possibly heterogeneous radial densities,
those R-estimators remain root-n consistent and
asymptotically normal, while achieving asymptotic efficiency under
correctly specified radial densities. Contrary to their traditional
counterparts computed from empirical covariances, they do not require any
moment conditions. When based on Gaussian score functions, in the
one-sample case, they uniformly dominate their classical competitors in
the Pitman sense. Their AREs with respect to other robust procedures are
quite high-up to 30, in the Gaussian case, with respect to minimum
covariance determinant estimators. Their finite-sample performances are
investigated via a Monte Carlo study.
Journal: Journal of the American Statistical Association
Pages: 1071-1083
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.880057
File-URL: http://hdl.handle.net/10.1080/01621459.2014.880057
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1071-1083
Template-Type: ReDIF-Article 1.0
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Linglong Kong
Author-X-Name-First: Linglong
Author-X-Name-Last: Kong
Title: Spatially Varying Coefficient Model for Neuroimaging Data With Jump Discontinuities
Abstract:
Motivated by recent work on studying massive imaging data in various
neuroimaging studies, we propose a novel spatially varying coefficient
model (SVCM) to capture the varying association between imaging measures
in a three-dimensional volume (or two-dimensional surface) with a set of
covariates. Two stylized features of neuorimaging data are the presence of
multiple piecewise smooth regions with unknown edges and jumps and
substantial spatial correlations. To specifically account for these two
features, SVCM includes a measurement model with multiple varying
coefficient functions, a jumping surface model for each varying
coefficient function, and a functional principal component model. We
develop a three-stage estimation procedure to simultaneously estimate the
varying coefficient functions and the spatial correlations. The estimation
procedure includes a fast multiscale adaptive estimation and testing
procedure to independently estimate each varying coefficient function,
while preserving its edges among different piecewise-smooth regions. We
systematically investigate the asymptotic properties (e.g., consistency
and asymptotic normality) of the multiscale adaptive parameter estimates.
We also establish the uniform convergence rate of the estimated spatial
covariance function and its associated eigenvalues and eigenfunctions. Our
Monte Carlo simulation and real-data analysis have confirmed the excellent
performance of SVCM. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1084-1098
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.881742
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881742
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1084-1098
Template-Type: ReDIF-Article 1.0
Author-Name: Valentin Patilea
Author-X-Name-First: Valentin
Author-X-Name-Last: Patilea
Author-Name: Hamdi Raïssi
Author-X-Name-First: Hamdi
Author-X-Name-Last: Raïssi
Title: Testing Second-Order Dynamics for Autoregressive Processes in Presence of Time-Varying Variance
Abstract:
This article considers the volatility modeling for autoregressive
univariate time series. A benchmark approach is the stationary
autoregressive conditional heteroscedasticity (ARCH) model of Engle.
Motivated by real data evidence, processes with nonconstant unconditional
variance and ARCH effects have been recently introduced. We take into
account this type of nonstationarity in variance and propose simple
testing procedures for ARCH effects. Adaptive McLeod and Li's portmanteau
and ARCH-LM tests for checking the presence of such second-order dynamics
are provided. The standard versions of these tests, commonly used by
practitioners, suppose constant unconditional variance. The failure of
these standard tests with time-varying unconditional variance is
highlighted. The theoretical results are illustrated by means of simulated
and real data. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1099-1111
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.884504
File-URL: http://hdl.handle.net/10.1080/01621459.2014.884504
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1099-1111
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew Harvey
Author-X-Name-First: Andrew
Author-X-Name-Last: Harvey
Author-Name: Alessandra Luati
Author-X-Name-First: Alessandra
Author-X-Name-Last: Luati
Title: Filtering With Heavy Tails
Abstract:
An unobserved components model in which the signal is buried in noise that
is non-Gaussian may throw up observations that, when judged by the
Gaussian yardstick, are outliers. We describe an observation-driven model,
based on a conditional Student's t-distribution, which is
tractable and retains some of the desirable features of the linear
Gaussian model. Letting the dynamics be driven by the score of the
conditional distribution leads to a specification that is not only easy to
implement, but which also facilitates the development of a comprehensive
and relatively straightforward theory for the asymptotic distribution of
the maximum likelihood estimator. The methods are illustrated with an
application to rail travel in the United Kingdom. The final part of the
article shows how the model may be extended to include explanatory
variables.
Journal: Journal of the American Statistical Association
Pages: 1112-1122
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.887011
File-URL: http://hdl.handle.net/10.1080/01621459.2014.887011
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1112-1122
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Wang
Author-X-Name-First: Bo
Author-X-Name-Last: Wang
Author-Name: Jian Qing Shi
Author-X-Name-First: Jian Qing
Author-X-Name-Last: Shi
Title: Generalized Gaussian Process Regression Model for Non-Gaussian Functional Data
Abstract:
In this article, we propose a generalized Gaussian process concurrent
regression model for functional data, where the functional response
variable has a binomial, Poisson, or other non-Gaussian distribution from
an exponential family, while the covariates are mixed functional and
scalar variables. The proposed model offers a nonparametric generalized
concurrent regression method for functional data with multidimensional
covariates, and provides a natural framework on modeling common mean
structure and covariance structure simultaneously for repeatedly observed
functional data. The mean structure provides overall information about the
observations, while the covariance structure can be used to catch up the
characteristic of each individual batch. The prior specification of
covariance kernel enables us to accommodate a wide class of nonlinear
models. The definition of the model, the inference, and the implementation
as well as its asymptotic properties are discussed. Several numerical
examples with different non-Gaussian response variables are presented.
Some technical details and more numerical examples as well as an extension
of the model are provided as supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 1123-1133
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.889021
File-URL: http://hdl.handle.net/10.1080/01621459.2014.889021
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1123-1133
Template-Type: ReDIF-Article 1.0
Author-Name: Yong-Dao Zhou
Author-X-Name-First: Yong-Dao
Author-X-Name-Last: Zhou
Author-Name: Hongquan Xu
Author-X-Name-First: Hongquan
Author-X-Name-Last: Xu
Title: Space-Filling Fractional Factorial Designs
Abstract:
Fractional factorial designs are widely used in various scientific
investigations and industrial applications. Level permutation of factors
could alter their geometrical structures and statistical properties. This
article studies space-filling properties of fractional factorial designs
under two commonly used space-filling measures, discrepancy and maximin
distance. When all possible level permutations are considered, the average
discrepancy is expressed as a linear combination of generalized word
length pattern for fractional factorial designs with any number of levels
and any discrepancy defined by a reproducing kernel. Generalized minimum
aberration designs are shown to have good space-filling properties on
average in terms of both discrepancy and distance. Several novel
relationships between distance distribution and generalized word length
pattern are derived. It is also shown that level permutations can improve
space-filling properties for many existing saturated designs. A two-step
construction procedure is proposed and three-, four-, and five-level
space-filling fractional factorial designs are obtained. These new designs
have better space-filling properties, such as larger distance and lower
discrepancy, than existing ones, and are recommended for use in practice.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1134-1144
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.873367
File-URL: http://hdl.handle.net/10.1080/01621459.2013.873367
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1134-1144
Template-Type: ReDIF-Article 1.0
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Weighted M-statistics With Superior Design Sensitivity in Matched Observational Studies With Multiple Controls
Abstract:
In a nonrandomized or observational study, a weak association between
receipt of the treatment and an outcome may be explained not as effects
caused by the treatment but rather by a small bias in the assignment of
individuals to treatment or control; however, a strong association may be
explained as noncausal only by a large bias. The strength of the
association between treatment and outcome is not uniform across the data
from a study, and this motivates giving greater weight where the
association is stronger. In an observational study with treated-control
matched pairs, it is known that results are less sensitive to unmeasured
biases if pairs with small absolute differences in outcomes are given
little weight in the analysis; more precisely, such a test statistic has
superior design sensitivity. How should outcomes be weighted if an
observational study is matched in sets with one treated subject and
several controls? An M-statistic is the quantity equated
to zero in defining Huber's M-estimates, including the
mean, and it is used in testing hypotheses and setting confidence limits.
In matched sets, a weighted M-statistic increases the
weight of some matched sets and decreases the weight of others. Not unlike
the case of matched pairs, weighted M-statistics with
suitable weights have larger design sensitivities, and hence greater power
in a sensitivity analysis, than unweighted statistics for symmetric
unimodal errors, such as Normal, logistic, or
t-distributed errors. This issue is examined using an
asymptotic measure, the design sensitivity, and using simulation. For one
Normal sampling situation, weighting the matched sets increased the power
of a 0.05 level sensitivity analysis from 0.05 without weights to 0.75
with weights. An example from NHANES 2009-2010 concerning methylmercury in
the blood of people who consume large amounts of fish is used to
illustrate.
Journal: Journal of the American Statistical Association
Pages: 1145-1158
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.879261
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879261
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1145-1158
Template-Type: ReDIF-Article 1.0
Author-Name: Peisong Han
Author-X-Name-First: Peisong
Author-X-Name-Last: Han
Title: Multiply Robust Estimation in Regression Analysis With Missing Data
Abstract:
Doubly robust estimators are widely used in missing-data analysis. They
provide double protection on estimation consistency against model
misspecifications. However, they allow only a single model for the
missingness mechanism and a single model for the data distribution, and
the assumption that one of these two models is correctly specified is
restrictive in practice. For regression analysis with possibly missing
outcome, we propose an estimation method that allows multiple models for
both the missingness mechanism and the data distribution. The resulting
estimator is consistent if any one of those multiple models is correctly
specified, and thus provides multiple protection on consistency. This
estimator is also robust against extreme values of the fitted missingness
probability, which, for most doubly robust estimators, can lead to
erroneously large inverse probability weights that may jeopardize the
numerical performance. The numerical implementation of the proposed method
through a modified Newton-Raphson algorithm is discussed. The asymptotic
distribution of the resulting estimator is derived, based on which we
study the estimation efficiency and provide ways to improve the
efficiency. As an application, we analyze the data collected from the AIDS
Clinical Trials Group Protocol 175.
Journal: Journal of the American Statistical Association
Pages: 1159-1173
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.880058
File-URL: http://hdl.handle.net/10.1080/01621459.2014.880058
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1159-1173
Template-Type: ReDIF-Article 1.0
Author-Name: Tianle Chen
Author-X-Name-First: Tianle
Author-X-Name-Last: Chen
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Author-Name: Huaihou Chen
Author-X-Name-First: Huaihou
Author-X-Name-Last: Chen
Author-Name: Karen Marder
Author-X-Name-First: Karen
Author-X-Name-Last: Marder
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Title: Targeted Local Support Vector Machine for Age-Dependent Classification
Abstract:
We develop methods to accurately predict whether presymptomatic
individuals are at risk of a disease based on their various marker
profiles, which offers an opportunity for early intervention well before
definitive clinical diagnosis. For many diseases, existing clinical
literature may suggest the risk of disease varies with some markers of
biological and etiological importance, for example, age. To identify
effective prediction rules using nonparametric decision functions,
standard statistical learning approaches treat markers with clear
biological importance (e.g., age) and other markers without prior
knowledge on disease etiology interchangeably as input variables.
Therefore, these approaches may be inadequate in singling out and
preserving the effects from the biologically important variables,
especially in the presence of potential noise markers. Using age as an
example of a salient marker to receive special care in the analysis, we
propose a local smoothing large margin classifier implemented with support
vector machine (SVM) to construct effective age-dependent classification
rules. The method adaptively adjusts age effect and separately tunes age
and other markers to achieve optimal performance. We derive the asymptotic
risk bound of the local smoothing SVM and perform extensive simulation
studies to compare with standard approaches. We apply the proposed method
to two studies of premanifest Huntington's disease (HD) subjects and
controls to construct age-sensitive predictive scores for the risk of HD
and risk of receiving HD diagnosis during the study period. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1174-1187
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.881743
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881743
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1174-1187
Template-Type: ReDIF-Article 1.0
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Author-Name: Hyonho Chun
Author-X-Name-First: Hyonho
Author-X-Name-Last: Chun
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: On an Additive Semigraphoid Model for Statistical Networks With Application to Pathway Analysis
Abstract:
We introduce a nonparametric method for estimating non-Gaussian graphical
models based on a new statistical relation called additive conditional
independence, which is a three-way relation among random vectors that
resembles the logical structure of conditional independence. Additive
conditional independence allows us to use one-dimensional kernel
regardless of the dimension of the graph, which not only avoids the curse
of dimensionality but also simplifies computation. It also gives rise to a
parallel structure to the Gaussian graphical model that replaces the
precision matrix by an additive precision operator. The estimators derived
from additive conditional independence cover the recently introduced
nonparanormal graphical model as a special case, but outperform it when
the Gaussian copula assumption is violated. We compare the new method with
existing ones by simulations and in genetic pathway analysis.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1188-1204
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.882842
File-URL: http://hdl.handle.net/10.1080/01621459.2014.882842
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1188-1204
Template-Type: ReDIF-Article 1.0
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Functional Principal Component Analysis of Spatiotemporal Point Processes With Applications in Disease Surveillance
Abstract:
In disease surveillance applications, the disease events are modeled by
spatiotemporal point processes. We propose a new class of semiparametric
generalized linear mixed model for such data, where the event rate is
related to some known risk factors and some unknown latent random effects.
We model the latent spatiotemporal process as spatially correlated
functional data, and propose Poisson maximum likelihood and composite
likelihood methods based on spline approximations to estimate the mean and
covariance functions of the latent process. By performing functional
principal component analysis to the latent process, we can better
understand the correlation structure in the point process. We also propose
an empirical Bayes method to predict the latent spatial random effects,
which can help highlight hot areas with unusually high event rates. Under
an increasing domain and increasing knots asymptotic framework, we
establish the asymptotic distribution for the parametric components in the
model and the asymptotic convergence rates for the functional principal
component estimators. We illustrate the methodology through a simulation
study and an application to the Connecticut Tumor Registry data.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1205-1215
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.885434
File-URL: http://hdl.handle.net/10.1080/01621459.2014.885434
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1205-1215
Template-Type: ReDIF-Article 1.0
Author-Name: Michael Rosenblum
Author-X-Name-First: Michael
Author-X-Name-Last: Rosenblum
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Author-Name: En-Hsu Yen
Author-X-Name-First: En-Hsu
Author-X-Name-Last: Yen
Title: Optimal Tests of Treatment Effects for the Overall Population and Two Subpopulations in Randomized Trials, Using Sparse Linear Programming
Abstract:
We propose new, optimal methods for analyzing randomized trials, when it
is suspected that treatment effects may differ in two predefined
subpopulations. Such subpopulations could be defined by a biomarker or
risk factor measured at baseline. The goal is to simultaneously learn
which subpopulations benefit from an experimental treatment, while
providing strong control of the familywise Type I error rate. We formalize
this as a multiple testing problem and show it is computationally
infeasible to solve using existing techniques. Our solution involves a
novel approach, in which we first transform the original multiple testing
problem into a large, sparse linear program. We then solve this problem
using advanced optimization techniques. This general method can solve a
variety of multiple testing problems and decision theory problems related
to optimal trial design, for which no solution was previously available.
In particular, we construct new multiple testing procedures that satisfy
minimax and Bayes optimality criteria. For a given optimality criterion,
our new approach yields the optimal tradeoff between power to detect an
effect in the overall population versus power to detect effects in
subpopulations. We demonstrate our approach in examples motivated by two
randomized trials of new treatments for HIV. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1216-1228
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.879063
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879063
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1216-1228
Template-Type: ReDIF-Article 1.0
Author-Name: Shan Luo
Author-X-Name-First: Shan
Author-X-Name-Last: Luo
Author-Name: Zehua Chen
Author-X-Name-First: Zehua
Author-X-Name-Last: Chen
Title: Sequential Lasso Cum EBIC for Feature Selection With Ultra-High Dimensional Feature Space
Abstract:
In this article, we propose a method called sequential Lasso (SLasso) for
feature selection in sparse high-dimensional linear models. The SLasso
selects features by sequentially solving partially penalized least squares
problems where the features selected in earlier steps are not penalized.
The SLasso uses extended BIC (EBIC) as the stopping rule. The procedure
stops when EBIC reaches a minimum. The asymptotic properties of SLasso are
considered when the dimension of the feature space is ultra high and the
number of relevant feature diverges. We show that, with probability
converging to 1, the SLasso first selects all the relevant features before
any irrelevant features can be selected, and that the EBIC decreases until
it attains the minimum at the model consisting of exactly all the relevant
features and then begins to increase. These results establish the
selection consistency of SLasso. The SLasso estimators of the final model
are ordinary least squares estimators. The selection consistency implies
the oracle property of SLasso. The asymptotic distribution of the SLasso
estimators with diverging number of relevant features is provided. The
SLasso is compared with other methods by simulation studies, which
demonstrates that SLasso is a desirable approach having an edge over the
other methods. The SLasso together with the other methods are applied to a
microarray data for mapping disease genes. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1229-1240
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.877275
File-URL: http://hdl.handle.net/10.1080/01621459.2013.877275
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1229-1240
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander Aue
Author-X-Name-First: Alexander
Author-X-Name-Last: Aue
Author-Name: Rex C. Y. Cheung
Author-X-Name-First: Rex C. Y.
Author-X-Name-Last: Cheung
Author-Name: Thomas C. M. Lee
Author-X-Name-First: Thomas C. M.
Author-X-Name-Last: Lee
Author-Name: Ming Zhong
Author-X-Name-First: Ming
Author-X-Name-Last: Zhong
Title: Segmented Model Selection in Quantile Regression Using the Minimum Description Length Principle
Abstract:
This article proposes new model-fitting techniques for quantiles of an
observed data sequence, including methods for data segmentation and
variable selection. The main contribution, however, is in providing a
means to perform these two tasks simultaneously. This is achieved by
matching the data with the best-fitting piecewise quantile regression
model, where the fit is determined by a penalization derived from the
minimum description length principle. The resulting optimization problem
is solved with the use of genetic algorithms. The proposed, fully
automatic procedures are, unlike traditional break point procedures, not
based on repeated hypothesis tests, and do not require, unlike most
variable selection procedures, the specification of a tuning parameter.
Theoretical large-sample properties are derived. Empirical comparisons
with existing break point and variable selection methods for quantiles
indicate that the new procedures work well in practice.
Journal: Journal of the American Statistical Association
Pages: 1241-1256
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.889022
File-URL: http://hdl.handle.net/10.1080/01621459.2014.889022
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1241-1256
Template-Type: ReDIF-Article 1.0
Author-Name: Chen Xu
Author-X-Name-First: Chen
Author-X-Name-Last: Xu
Author-Name: Jiahua Chen
Author-X-Name-First: Jiahua
Author-X-Name-Last: Chen
Title: The Sparse MLE for Ultrahigh-Dimensional Feature Screening
Abstract:
Feature selection is fundamental for modeling the high-dimensional data,
where the number of features can be huge and much larger than the sample
size. Since the feature space is so large, many traditional procedures
become numerically infeasible. It is hence essential to first remove most
apparently noninfluential features before any elaborative analysis.
Recently, several procedures have been developed for this purpose, which
include the sure-independent-screening (SIS) as a widely used technique.
To gain computational efficiency, the SIS screens features based on their
individual predicting power. In this article, we propose a new screening
method via the sparsity-restricted maximum likelihood estimator (SMLE).
The new method naturally takes the joint effects of features in the
screening process, which gives itself an edge to potentially outperform
the existing methods. This conjecture is further supported by the
simulation studies under a number of modeling settings. We show that the
proposed method is screening consistent in the context of
ultrahigh-dimensional generalized linear models. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1257-1269
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.879531
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879531
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1257-1269
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yunbei Ma
Author-X-Name-First: Yunbei
Author-X-Name-Last: Ma
Author-Name: Wei Dai
Author-X-Name-First: Wei
Author-X-Name-Last: Dai
Title: Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Varying Coefficient Models
Abstract:
The varying coefficient model is an important class of nonparametric
statistical model, which allows us to examine how the effects of
covariates vary with exposure variables. When the number of covariates is
large, the issue of variable selection arises. In this article, we propose
and investigate marginal nonparametric screening methods to screen
variables in sparse ultra-high-dimensional varying coefficient models. The
proposed nonparametric independence screening (NIS) selects variables by
ranking a measure of the nonparametric marginal contributions of each
covariate given the exposure variable. The sure independent screening
property is established under some mild technical conditions when the
dimensionality is of nonpolynomial order, and the dimensionality reduction
of NIS is quantified. To enhance the practical utility and finite sample
performance, two data-driven iterative NIS (INIS) methods are proposed for
selecting thresholding parameters and variables: conditional permutation
and greedy methods, resulting in conditional-INIS and greedy-INIS. The
effectiveness and flexibility of the proposed methods are further
illustrated by simulation studies and real data applications.
Journal: Journal of the American Statistical Association
Pages: 1270-1284
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2013.879828
File-URL: http://hdl.handle.net/10.1080/01621459.2013.879828
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1270-1284
Template-Type: ReDIF-Article 1.0
Author-Name: Ning Hao
Author-X-Name-First: Ning
Author-X-Name-Last: Hao
Author-Name: Hao Helen Zhang
Author-X-Name-First: Hao Helen
Author-X-Name-Last: Zhang
Title: Interaction Screening for Ultrahigh-Dimensional Data
Abstract:
In ultrahigh-dimensional data analysis, it is extremely challenging to
identify important interaction effects, and a top concern in practice is
computational feasibility. For a dataset with n
observations and p predictors, the augmented design
matrix including all linear and order-2 terms is of size
n × (p-super-2 +
3p)/2. When p is large, say more than
tens of hundreds, the number of interactions is enormous and beyond the
capacity of standard machines and software tools for storage and analysis.
In theory, the interaction-selection consistency is hard to achieve in
high-dimensional settings. Interaction effects have heavier tails and more
complex covariance structures than main effects in a random design, making
theoretical analysis difficult. In this article, we propose to tackle
these issues by forward-selection-based procedures called iFOR, which
identify interaction effects in a greedy forward fashion while maintaining
the natural hierarchical model structure. Two algorithms, iFORT and iFORM,
are studied. Computationally, the iFOR procedures are designed to be
simple and fast to implement. No complex optimization tools are needed,
since only OLS-type calculations are involved; the iFOR algorithms avoid
storing and manipulating the whole augmented matrix, so the memory and CPU
requirement is minimal; the computational complexity is
linear in p for sparse models, hence
feasible for p >> n. Theoretically, we
prove that they possess sure screening property for ultrahigh-dimensional
settings. Numerical examples are used to demonstrate their finite sample
performance. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1285-1301
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.881741
File-URL: http://hdl.handle.net/10.1080/01621459.2014.881741
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1285-1301
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaofeng Shao
Author-X-Name-First: Xiaofeng
Author-X-Name-Last: Shao
Author-Name: Jingsi Zhang
Author-X-Name-First: Jingsi
Author-X-Name-Last: Zhang
Title: Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening
Abstract:
In this article, we propose a new metric, the so-called martingale
difference correlation, to measure the departure of conditional mean
independence between a scalar response variable V and a
vector predictor variable U. Our metric is a natural
extension of distance correlation proposed by Székely, Rizzo, and Bahirov,
which is used to measure the dependence between V and
U. The martingale difference correlation and its
empirical counterpart inherit a number of desirable features of distance
correlation and sample distance correlation, such as algebraic simplicity
and elegant theoretical properties. We further use martingale difference
correlation as a marginal utility to do high-dimensional variable
screening to screen out variables that do not contribute to conditional
mean of the response given the covariates. Further extension to
conditional quantile screening is also described in detail and sure
screening properties are rigorously justified. Both simulation results and
real data illustrations demonstrate the effectiveness of martingale
difference correlation-based screening procedures in comparison with the
existing counterparts. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1302-1318
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.887012
File-URL: http://hdl.handle.net/10.1080/01621459.2014.887012
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1302-1318
Template-Type: ReDIF-Article 1.0
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Author-Name: Ria Van Hecke
Author-X-Name-First: Ria
Author-X-Name-Last: Van Hecke
Author-Name: Stanislav Volgushev
Author-X-Name-First: Stanislav
Author-X-Name-Last: Volgushev
Title: Some Comments on Copula-Based Regression
Abstract:
In a recent article, Noh, El Ghouch, and Bouezmarni proposed a new
semiparametric estimate of a regression function with a multivariate
predictor, which is based on a specification of the dependence structure
between the predictor and the response by means of a parametric copula.
This comment investigates the effect which occurs under misspecification
of the parametric model. We demonstrate by means of several examples that
even for a one or two-dimensional predictor the error caused by a "wrong"
specification of the parametric family is rather severe, if the regression
is not monotone in one of the components of the predictor. Moreover, we
also show that these problems occur for all of the commonly used copula
families and we illustrate in several examples that the copula-based
regression may lead to invalid results even when flexible copula models
such as vine copulas (with the common parametric families) are used in the
estimation procedure.
Journal: Journal of the American Statistical Association
Pages: 1319-1324
Issue: 507
Volume: 109
Year: 2014
Month: 9
X-DOI: 10.1080/01621459.2014.916577
File-URL: http://hdl.handle.net/10.1080/01621459.2014.916577
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:507:p:1319-1324
Template-Type: ReDIF-Article 1.0
Author-Name: Michael R. Wierzbicki
Author-X-Name-First: Michael R.
Author-X-Name-Last: Wierzbicki
Author-Name: Li-Bing Guo
Author-X-Name-First: Li-Bing
Author-X-Name-Last: Guo
Author-Name: Qing-Tao Du
Author-X-Name-First: Qing-Tao
Author-X-Name-Last: Du
Author-Name: Wensheng Guo
Author-X-Name-First: Wensheng
Author-X-Name-Last: Guo
Title: Sparse Semiparametric Nonlinear Model With Application to Chromatographic Fingerprints
Abstract:
Traditional Chinese herbal medications (TCHMs) are composed of a multitude
of compounds and the identification of their active composition is an
important area of research. Chromatography provides a visual
representation of a TCHM sample's composition by outputting a curve
characterized by spikes corresponding to compounds in the sample. Across
different experimental conditions, the location of the spikes can be
shifted, preventing direct comparison of curves and forcing compound
identification to be possible only within each experiment. In this
article, we propose a sparse semiparametric nonlinear modeling framework
for the establishment of a standardized chromatographic fingerprint.
Data-driven basis expansion is used to model the common shape of the
curves, while a parametric time warping function registers across
individual curves. Penalized weighted least-squares with the adaptive
lasso penalty provides a unified criterion for registration, model
selection, and estimation. Furthermore, the adaptive lasso estimators
possess attractive sampling properties. A back-fitting algorithm is
proposed for estimation. Performance is assessed through simulation and we
apply the model to chromatographic data of rhubarb collected from
different experimental conditions and establish a standardized fingerprint
as a first step in TCHM research.
Journal: Journal of the American Statistical Association
Pages: 1339-1349
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2013.836969
File-URL: http://hdl.handle.net/10.1080/01621459.2013.836969
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1339-1349
Template-Type: ReDIF-Article 1.0
Author-Name: Pang Du
Author-X-Name-First: Pang
Author-X-Name-Last: Du
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1349-1350
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.926686
File-URL: http://hdl.handle.net/10.1080/01621459.2014.926686
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1349-1350
Template-Type: ReDIF-Article 1.0
Author-Name: Huaihou Chen
Author-X-Name-First: Huaihou
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1350-1353
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.972158
File-URL: http://hdl.handle.net/10.1080/01621459.2014.972158
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1350-1353
Template-Type: ReDIF-Article 1.0
Author-Name: Michael R. Wierzbicki
Author-X-Name-First: Michael R.
Author-X-Name-Last: Wierzbicki
Author-Name: Li-Bing Guo
Author-X-Name-First: Li-Bing
Author-X-Name-Last: Guo
Author-Name: Qing-Tao Du
Author-X-Name-First: Qing-Tao
Author-X-Name-Last: Du
Author-Name: Wensheng Guo
Author-X-Name-First: Wensheng
Author-X-Name-Last: Guo
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1353-1354
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.972161
File-URL: http://hdl.handle.net/10.1080/01621459.2014.972161
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1353-1354
Template-Type: ReDIF-Article 1.0
Author-Name: Yoonsuh Jung
Author-X-Name-First: Yoonsuh
Author-X-Name-Last: Jung
Author-Name: Jianhua Z. Huang
Author-X-Name-First: Jianhua Z.
Author-X-Name-Last: Huang
Author-Name: Jianhua Hu
Author-X-Name-First: Jianhua
Author-X-Name-Last: Hu
Title: Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA
Abstract:
In genome-wide association studies, the primary task is to detect
biomarkers in the form of single nucleotide polymorphisms (SNPs) that have
nontrivial associations with a disease phenotype and some other important
clinical/environmental factors. However, the extremely large number of
SNPs compared to the sample size inhibits application of classical methods
such as the multiple logistic regression. Currently, the most commonly
used approach is still to analyze one SNP at a time. In this article, we
propose to consider the genotypes of the SNPs simultaneously via a
logistic analysis of variance (ANOVA) model, which expresses the logit
transformed mean of SNP genotypes as the summation of the SNP effects,
effects of the disease phenotype and/or other clinical variables, and the
interaction effects. We use a reduced-rank representation of the
interaction-effect matrix for dimensionality reduction, and employ the
L1-penalty in a penalized likelihood framework
to filter out the SNPs that have no associations. We develop a
majorization-minimization algorithm for computational implementation. In
addition, we propose a modified BIC criterion to select the penalty
parameters and determine the rank number. The proposed method is applied
to a multiple sclerosis dataset and simulated datasets and shows promise
in biomarker detection.
Journal: Journal of the American Statistical Association
Pages: 1355-1367
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.928217
File-URL: http://hdl.handle.net/10.1080/01621459.2014.928217
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1355-1367
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan R. Stroud
Author-X-Name-First: Jonathan R.
Author-X-Name-Last: Stroud
Author-Name: Michael S. Johannes
Author-X-Name-First: Michael S.
Author-X-Name-Last: Johannes
Title: Bayesian Modeling and Forecasting of 24-Hour High-Frequency Volatility
Abstract:
This article estimates models of high-frequency index futures returns
using "around-the-clock" 5-min returns that incorporate the following key
features: multiple persistent stochastic volatility factors, jumps in
prices and volatilities, seasonal components capturing time of the day
patterns, correlations between return and volatility shocks, and
announcement effects. We develop an integrated MCMC approach to estimate
interday and intraday parameters and states using high-frequency data
without resorting to various aggregation measures like realized
volatility. We provide a case study using financial crisis data from 2007
to 2009, and use particle filters to construct likelihood functions for
model comparison and out-of-sample forecasting from 2009 to 2012. We show
that our approach improves realized volatility forecasts by up to 50% over
existing benchmarks and is also useful for risk management and trading
applications. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1368-1384
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.937003
File-URL: http://hdl.handle.net/10.1080/01621459.2014.937003
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1368-1384
Template-Type: ReDIF-Article 1.0
Author-Name: Dimitris Rizopoulos
Author-X-Name-First: Dimitris
Author-X-Name-Last: Rizopoulos
Author-Name: Laura A. Hatfield
Author-X-Name-First: Laura A.
Author-X-Name-Last: Hatfield
Author-Name: Bradley P. Carlin
Author-X-Name-First: Bradley P.
Author-X-Name-Last: Carlin
Author-Name: Johanna J. M. Takkenberg
Author-X-Name-First: Johanna J. M.
Author-X-Name-Last: Takkenberg
Title: Combining Dynamic Predictions From Joint Models for Longitudinal and Time-to-Event Data Using Bayesian Model Averaging
Abstract:
The joint modeling of longitudinal and time-to-event data is an active
area of statistics research that has received a lot of attention in recent
years. More recently, a new and attractive application of this type of
model has been to obtain individualized predictions of survival
probabilities and/or of future longitudinal responses. The advantageous
feature of these predictions is that they are dynamically updated as extra
longitudinal responses are collected for the subjects of interest,
providing real time risk assessment using all recorded information. The
aim of this article is two-fold. First, to highlight the importance of
modeling the association structure between the longitudinal and event time
responses that can greatly influence the derived predictions, and second,
to illustrate how we can improve the accuracy of the derived predictions
by suitably combining joint models with different association structures.
The second goal is achieved using Bayesian model averaging, which, in this
setting, has the very intriguing feature that the model weights are not
fixed but they are rather subject- and time-dependent, implying that at
different follow-up times predictions for the same subject may be based on
different models. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1385-1397
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.931236
File-URL: http://hdl.handle.net/10.1080/01621459.2014.931236
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1385-1397
Template-Type: ReDIF-Article 1.0
Author-Name: Marian Farah
Author-X-Name-First: Marian
Author-X-Name-Last: Farah
Author-Name: Paul Birrell
Author-X-Name-First: Paul
Author-X-Name-Last: Birrell
Author-Name: Stefano Conti
Author-X-Name-First: Stefano
Author-X-Name-Last: Conti
Author-Name: Daniela De Angelis
Author-X-Name-First: Daniela De
Author-X-Name-Last: Angelis
Title: Bayesian Emulation and Calibration of a Dynamic Epidemic Model for A/H1N1 Influenza
Abstract:
In this article, we develop a Bayesian framework for parameter estimation
of a computationally expensive dynamic epidemic model using time series
epidemic data. Specifically, we work with a model for A/H1N1 influenza,
which is implemented as a deterministic computer
simulator, taking as input the underlying epidemic
parameters and calculating the corresponding time series of reported
infections. To obtain Bayesian inference for the epidemic parameters, the
simulator is embedded in the likelihood for the reported epidemic data.
However, the simulator is computationally slow, making it impractical to
use in Bayesian estimation where a large number of simulator runs is
required. We propose an efficient approximation to the simulator using an
emulator, a statistical model that combines a Gaussian
process (GP) prior for the output function of the simulator with a dynamic
linear model (DLM) for its evolution through time. This modeling framework
is both flexible and tractable, resulting in efficient posterior inference
through Markov chain Monte Carlo (MCMC). The proposed dynamic emulator is
then used in a calibration procedure to obtain posterior inference for the
parameters of the influenza epidemic.
Journal: Journal of the American Statistical Association
Pages: 1398-1411
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.934453
File-URL: http://hdl.handle.net/10.1080/01621459.2014.934453
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1398-1411
Template-Type: ReDIF-Article 1.0
Author-Name: Hui Huang
Author-X-Name-First: Hui
Author-X-Name-Last: Huang
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Joint Modeling and Clustering Paired Generalized Longitudinal Trajectories With Application to Cocaine Abuse Treatment Data
Abstract:
In a cocaine dependence treatment study, we have paired binary
longitudinal trajectories that record the cocaine use patterns of each
patient before and after a treatment. To better understand the drug-using
behaviors among the patients, we propose a general framework based on
functional data analysis to jointly model and cluster these paired
non-Gaussian longitudinal trajectories. Our approach assumes that the
response variables follow distributions from the exponential family, with
the canonical parameters determined by some latent Gaussian processes. To
reduce the dimensionality of the latent processes, we express them by a
truncated Karhunen-Lóeve (KL) expansion allowing the mean and covariance
functions to be different across clusters. We further represent the mean
and eigenfunctions functions by flexible spline bases, and determine the
orders of the truncated KL expansions using data-driven methods. By
treating the cluster membership as a missing value, we cluster the cocaine
use trajectories by a likelihood-based approach. The cluster membership
and parameter estimates are jointly estimated by a Monte Carlo EM
algorithm with Gibbs sampling steps. We discover subgroups of patients
with distinct behaviors in terms of overall probability to use, binge
verses periodic use pattern, etc. The joint modeling approach also sheds
new lights on relating relapse behavior to baseline pattern in each
subgroup. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1412-1424
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.957286
File-URL: http://hdl.handle.net/10.1080/01621459.2014.957286
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1412-1424
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan E. Gellar
Author-X-Name-First: Jonathan E.
Author-X-Name-Last: Gellar
Author-Name: Elizabeth Colantuoni
Author-X-Name-First: Elizabeth
Author-X-Name-Last: Colantuoni
Author-Name: Dale M. Needham
Author-X-Name-First: Dale M.
Author-X-Name-Last: Needham
Author-Name: Ciprian M. Crainiceanu
Author-X-Name-First: Ciprian M.
Author-X-Name-Last: Crainiceanu
Title: Variable-Domain Functional Regression for Modeling ICU Data
Abstract:
We introduce a class of scalar-on-function regression models with
subject-specific functional predictor domains. The fundamental idea is to
consider a bivariate functional parameter that depends both on the
functional argument and on the width of the functional predictor domain.
Both parametric and nonparametric models are introduced to fit the
functional coefficient. The nonparametric model is theoretically and
practically invariant to functional support transformation, or support
registration. Methods were motivated by and applied to a study of
association between daily measures of the Intensive Care Unit (ICU)
sequential organ failure assessment (SOFA) score and two outcomes:
in-hospital mortality, and physical impairment at hospital discharge among
survivors. Methods are generally applicable to a large number of new
studies that record a continuous variables over unequal domains.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1425-1439
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.940044
File-URL: http://hdl.handle.net/10.1080/01621459.2014.940044
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1425-1439
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel J. Graham
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Graham
Author-Name: Emma J. McCoy
Author-X-Name-First: Emma J.
Author-X-Name-Last: McCoy
Author-Name: David A. Stephens
Author-X-Name-First: David A.
Author-X-Name-Last: Stephens
Title: Quantifying Causal Effects of Road Network Capacity Expansions on Traffic Volume and Density via a Mixed Model Propensity Score Estimator
Abstract:
Road network capacity expansions are frequently proposed as solutions to
urban traffic congestion but are controversial because it is thought that
they can directly "induce" growth in traffic volumes. This article
quantifies causal effects of road network capacity expansions on aggregate
urban traffic volume and density in U.S. cities using a mixed model
propensity score (PS) estimator. The motivation for this approach is that
we seek to estimate a dose-response relationship between capacity and
volume but suspect confounding from both observed and unobserved
characteristics. Analytical results and simulations show that a
longitudinal mixed model PS approach can be used to adjust effectively for
time-invariant unobserved confounding via random effects (RE). Our
empirical results indicate that network capacity expansions can cause
substantial increases in aggregate urban traffic volumes such that even
major capacity increases can actually lead to little or no reduction in
network traffic densities. This result has important implications for
optimal urban transportation strategies. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1440-1449
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.956871
File-URL: http://hdl.handle.net/10.1080/01621459.2014.956871
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1440-1449
Template-Type: ReDIF-Article 1.0
Author-Name: Dungang Liu
Author-X-Name-First: Dungang
Author-X-Name-Last: Liu
Author-Name: Regina Y. Liu
Author-X-Name-First: Regina Y.
Author-X-Name-Last: Liu
Author-Name: Min-ge Xie
Author-X-Name-First: Min-ge
Author-X-Name-Last: Xie
Title: Exact Meta-Analysis Approach for Discrete Data and its Application to 2 × 2 Tables With Rare Events
Abstract:
This article proposes a general exact meta-analysis approach for
synthesizing inferences from multiple studies of discrete data. The
approach combines the p-value functions (also known as
significance functions) associated with the exact tests
from individual studies. It encompasses a broad class of exact
meta-analysis methods, as it permits broad choices for the combining
elements, such as tests used in individual studies, and any parameter of
interest. The approach yields statements that explicitly account for the
impact of individual studies on the overall inference, in terms of
efficiency/power and the Type I error rate. Those statements also give
rises to empirical methods for further enhancing the combined inference.
Although the proposed approach is for general discrete settings, for
convenience, it is illustrated throughout using the setting of
meta-analysis of multiple 2 × 2 tables. In the context of rare
events data, such as observing few, zero, or zero total (i.e., zero events
in both arms) outcomes in binomial trials or 2 × 2 tables, most
existing meta-analysis methods rely on the large-sample approximations
which may yield invalid inference. The commonly used corrections to zero
outcomes in rare events data, aiming to improve numerical performance can
also incur undesirable consequences. The proposed approach applies readily
to any rare event setting, including even the zero total event studies
without any artificial correction. While debates continue on whether or
how zero total event studies should be incorporated in meta-analysis, the
proposed approach has the advantage of automatically including those
studies and thus making use of all available data. Through numerical
studies in rare events settings, the proposed exact approach is shown to
be efficient and, generally, outperform commonly used meta-analysis
methods, including Mantel-Haenszel and Peto methods.
Journal: Journal of the American Statistical Association
Pages: 1450-1465
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.946318
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946318
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1450-1465
Template-Type: ReDIF-Article 1.0
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Author-Name: Thiago Costa
Author-X-Name-First: Thiago
Author-X-Name-Last: Costa
Author-Name: Federico Bassetti
Author-X-Name-First: Federico
Author-X-Name-Last: Bassetti
Author-Name: Fabrizio Leisen
Author-X-Name-First: Fabrizio
Author-X-Name-Last: Leisen
Author-Name: Michele Guindani
Author-X-Name-First: Michele
Author-X-Name-Last: Guindani
Title: Generalized Species Sampling Priors With Latent Beta Reinforcements
Abstract:
Many popular Bayesian nonparametric priors can be characterized in terms
of exchangeable species sampling sequences. However, in some applications,
exchangeability may not be appropriate. We introduce a novel and
probabilistically coherent family of nonexchangeable species sampling
sequences characterized by a tractable predictive probability function
with weights driven by a sequence of independent Beta random variables. We
compare their theoretical clustering properties with those of the
Dirichlet process and the two parameters Poisson-Dirichlet process. The
proposed construction provides a complete characterization of the joint
process, differently from existing work. We then propose the use of such
process as prior distribution in a hierarchical Bayes' modeling framework,
and we describe a Markov chain Monte Carlo sampler for posterior
inference. We evaluate the performance of the prior and the robustness of
the resulting inference in a simulation study, providing a comparison with
popular Dirichlet process mixtures and hidden Markov models. Finally, we
develop an application to the detection of chromosomal aberrations in
breast cancer by leveraging array comparative genomic hybridization (CGH)
data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1466-1480
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.950735
File-URL: http://hdl.handle.net/10.1080/01621459.2014.950735
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1466-1480
Template-Type: ReDIF-Article 1.0
Author-Name: Kelvin Gu
Author-X-Name-First: Kelvin
Author-X-Name-Last: Gu
Author-Name: Debdeep Pati
Author-X-Name-First: Debdeep
Author-X-Name-Last: Pati
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Multiscale Modeling of Closed Curves in Point Clouds
Abstract:
Modeling object boundaries based on image or point cloud data is
frequently necessary in medical and scientific applications ranging from
detecting tumor contours for targeted radiation therapy, to the
classification of organisms based on their structural information. In
low-contrast images or sparse and noisy point clouds, there is often
insufficient data to recover local segments of the boundary in isolation.
Thus, it becomes critical to model the entire boundary in the form of a
closed curve. To achieve this, we develop a Bayesian hierarchical model
that expresses highly diverse 2D objects in the form of closed curves. The
model is based on a novel multiscale deformation process. By relating
multiple objects through a hierarchical formulation, we can successfully
recover missing boundaries by borrowing structural information from
similar objects at the appropriate scale. Furthermore, the model's latent
parameters help interpret the population, indicating dimensions of
significant structural variability and also specifying a "central curve"
that summarizes the collection. Theoretical properties of our prior are
studied in specific cases and efficient Markov chain Monte Carlo methods
are developed, evaluated through simulation examples and applied to
panorex teeth images for modeling teeth contours and also to a brain tumor
contour detection problem.
Journal: Journal of the American Statistical Association
Pages: 1481-1494
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.934825
File-URL: http://hdl.handle.net/10.1080/01621459.2014.934825
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1481-1494
Template-Type: ReDIF-Article 1.0
Author-Name: Qing Zhou
Author-X-Name-First: Qing
Author-X-Name-Last: Zhou
Title: Monte Carlo Simulation for Lasso-Type Problems by Estimator Augmentation
Abstract:
Regularized linear regression under the ℓ1 penalty, such
as the Lasso, has been shown to be effective in variable selection and
sparse modeling. The sampling distribution of an
ℓ1-penalized estimator is hard to
determine as the estimator is defined by an optimization problem that in
general can only be solved numerically and many of its components may be
exactly zero. Let S be the subgradient of the
ℓ1 norm of the coefficient vector
β evaluated at
. We find that
the joint sampling distribution of and
S, together called an augmented estimator, is much more
tractable and has a closed-form density under a normal error distribution
in both low-dimensional (p ⩽ n)
and high-dimensional (p > n) settings.
Given β and the error variance
σ-super-2, one may employ standard Monte Carlo methods, such as
Markov chain Monte Carlo (MCMC) and importance sampling (IS), to draw
samples from the distribution of the augmented estimator and calculate
expectations with respect to the sampling distribution of
. We develop a
few concrete Monte Carlo algorithms and demonstrate with numerical
examples that our approach may offer huge advantages and great flexibility
in studying sampling distributions in ℓ1-penalized
linear regression. We also establish nonasymptotic bounds on the
difference between the true sampling distribution of
and its
estimator obtained by plugging in estimated parameters, which justifies
the validity of Monte Carlo simulation from an estimated sampling
distribution even when p >> n →
∞.
Journal: Journal of the American Statistical Association
Pages: 1495-1516
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.946035
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946035
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1495-1516
Template-Type: ReDIF-Article 1.0
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Author-Name: Ash A. Alizadeh
Author-X-Name-First: Ash A.
Author-X-Name-Last: Alizadeh
Author-Name: Andrew J. Gentles
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Gentles
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates
Abstract:
We consider a setting in which we have a treatment and a potentially large
number of covariates for a set of observations, and wish to model their
relationship with an outcome of interest. We propose a simple method for
modeling interactions between the treatment and covariates. The idea is to
modify the covariate in a simple way, and then fit a standard model using
the modified covariates and no main effects. We show that coupled with an
efficiency augmentation procedure, this method produces clinically
meaningful estimators in a variety of settings. It can be useful for
practicing personalized medicine: determining from a large set of
biomarkers, the subset of patients that can potentially benefit from a
treatment. We apply the method to both simulated datasets and real trial
data. The modified covariates idea can be used for other purposes, for
example, large scale hypothesis testing for determining which of a set of
covariates interact with a treatment variable. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1517-1532
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.951443
File-URL: http://hdl.handle.net/10.1080/01621459.2014.951443
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1517-1532
Template-Type: ReDIF-Article 1.0
Author-Name: Y. J. Hu
Author-X-Name-First: Y. J.
Author-X-Name-Last: Hu
Author-Name: D. Y. Lin
Author-X-Name-First: D. Y.
Author-X-Name-Last: Lin
Author-Name: W. Sun
Author-X-Name-First: W.
Author-X-Name-Last: Sun
Author-Name: D. Zeng
Author-X-Name-First: D.
Author-X-Name-Last: Zeng
Title: A Likelihood-Based Framework for Association Analysis of Allele-Specific Copy Numbers
Abstract:
Copy number variants (CNVs) and single nucleotide polymorphisms (SNPs)
coexist throughout the human genome and jointly contribute to phenotypic
variations. Thus, it is desirable to consider both types of variants, as
characterized by allele-specific copy numbers (ASCNs), in association
studies of complex human diseases. Current SNP genotyping technologies
capture the CNV and SNP information simultaneously via fluorescent
intensity measurements. The common practice of calling ASCNs from the
intensity measurements and then using the ASCN calls in downstream
association analysis has important limitations. First, the association
tests are prone to false-positive findings when differential measurement
errors between cases and controls arise from differences in DNA quality or
handling. Second, the uncertainties in the ASCN calls are ignored. We
present a general framework for the integrated analysis of CNVs and SNPs,
including the analysis of total copy numbers as a special case. Our
approach combines the ASCN calling and the association analysis into a
single step while allowing for differential measurement errors. We
construct likelihood functions that properly account for case-control
sampling and measurement errors. We establish the asymptotic properties of
the maximum likelihood estimators and develop EM algorithms to implement
the corresponding inference procedures. The advantages of the proposed
methods over the existing ones are demonstrated through realistic
simulation studies and an application to a genome-wide association study
of schizophrenia. Extensions to next-generation sequencing data are
discussed.
Journal: Journal of the American Statistical Association
Pages: 1533-1545
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.908777
File-URL: http://hdl.handle.net/10.1080/01621459.2014.908777
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1533-1545
Template-Type: ReDIF-Article 1.0
Author-Name: Zudi Lu
Author-X-Name-First: Zudi
Author-X-Name-Last: Lu
Author-Name: Dag Tjøstheim
Author-X-Name-First: Dag
Author-X-Name-Last: Tjøstheim
Title: Nonparametric Estimation of Probability Density Functions for Irregularly Observed Spatial Data
Abstract:
Nonparametric estimation of probability density functions, both marginal
and joint densities, is a very useful tool in statistics. The kernel
method is popular and applicable to dependent data, including time series
and spatial data. But at least for the joint density, one has had to
assume that data are observed at regular time intervals or on a regular
grid in space. Though this is not very restrictive in the time series
case, it often is in the spatial case. In fact, to a large degree it has
precluded applications of nonparametric methods to spatial data because
such data often are irregularly positioned over space. In this article, we
propose nonparametric kernel estimators for both the marginal and in
particular the joint probability density functions for nongridded spatial
data. Large sample distributions of the proposed estimators are
established under mild conditions, and a new framework of expanding-domain
infill asymptotics is suggested to overcome the shortcomings of spatial
asymptotics in the existing literature. A practical, reasonable selection
of the bandwidths on the basis of cross-validation is also proposed. We
demonstrate by both simulations and real data examples of moderate sample
size that the proposed methodology is effective and useful in uncovering
nonlinear spatial dependence for general, including non-Gaussian,
distributions. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1546-1564
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.947376
File-URL: http://hdl.handle.net/10.1080/01621459.2014.947376
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1546-1564
Template-Type: ReDIF-Article 1.0
Author-Name: Fangpo Wang
Author-X-Name-First: Fangpo
Author-X-Name-Last: Wang
Author-Name: Alan E. Gelfand
Author-X-Name-First: Alan E.
Author-X-Name-Last: Gelfand
Title: Modeling Space and Space-Time Directional Data Using Projected Gaussian Processes
Abstract:
Directional data naturally arise in many scientific fields, such as
oceanography (wave direction), meteorology (wind direction), and biology
(animal movement direction). Our contribution is to develop a fully
model-based approach to capture structured spatial dependence for modeling
directional data at different spatial locations. We build a projected
Gaussian spatial process, induced from an inline bivariate Gaussian
spatial process. We discuss the properties of the projected Gaussian
process and show how to fit this process as a model for data, using
suitable latent variables, with Markov chain Monte Carlo methods. We also
show how to implement spatial interpolation and conduct model comparison
in this setting. Simulated examples are provided as proof of concept. A
data application arises for modeling wave direction data in the Adriatic
sea, off the coast of Italy. In fact, this directional data is available
across time, requiring a spatio-temporal model for its analysis. We
discuss and illustrate this extension.
Journal: Journal of the American Statistical Association
Pages: 1565-1580
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.934454
File-URL: http://hdl.handle.net/10.1080/01621459.2014.934454
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1565-1580
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew Plumlee
Author-X-Name-First: Matthew
Author-X-Name-Last: Plumlee
Title: Fast Prediction of Deterministic Functions Using Sparse Grid Experimental Designs
Abstract:
Random field models have been widely employed to develop a predictor of an
expensive function based on observations from an experiment. The
traditional framework for developing a predictor with random field models
can fail due to the computational burden it requires. This problem is
often seen in cases where the input of the expensive function is high
dimensional. While many previous works have focused on developing an
approximative predictor to resolve these issues, this article investigates
a different solution mechanism. We demonstrate that when a general set of
designs is employed, the resulting predictor is quick to compute and has
reasonable accuracy. The fast computation of the predictor is made
possible through an algorithm proposed by this work. This article also
demonstrates methods to quickly evaluate the likelihood of the
observations and describes some fast maximum likelihood estimates for
unknown parameters of the random field. The computational savings can be
several orders of magnitude when the input is located in a
high-dimensional space. Beyond the fast computation of the predictor,
existing research has demonstrated that a subset of these designs generate
predictors that are asymptotically efficient. This work details some
empirical comparisons to the more common space-filling designs that verify
the designs are competitive in terms of resulting prediction accuracy.
Journal: Journal of the American Statistical Association
Pages: 1581-1591
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.900250
File-URL: http://hdl.handle.net/10.1080/01621459.2014.900250
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1581-1591
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley Jones
Author-X-Name-First: Bradley
Author-X-Name-Last: Jones
Author-Name: Dibyen Majumdar
Author-X-Name-First: Dibyen
Author-X-Name-Last: Majumdar
Title: Optimal Supersaturated Designs
Abstract:
We consider screening experiments where an investigator wishes to study
many factors using fewer observations. Our focus is on experiments with
two-level factors and a main effects model with intercept. Since the
number of parameters is larger than the number of observations,
traditional methods of inference and design are unavailable. In 1959, Box
suggested the use of supersaturated designs and in 1962,
Booth and Cox introduced measures for efficiency of these designs
including E(s-super-2), which is the
average of squares of the off-diagonal entries of the information matrix,
ignoring the intercept. For a design to be
E(s-super-2)-optimal, the main effect of
every factor must be orthogonal to the intercept (factors are
balanced), and among all designs that satisfy this
condition, it should minimize
E(s-super-2). This is a natural approach
since it identifies the most nearly orthogonal design,
and orthogonal designs enjoy many desirable properties including efficient
parameter estimation. Factor balance in an
E(s-super-2)-optimal design has the
consequence that the intercept is the most precisely estimated parameter.
We introduce and study
UE(s-super-2)-optimality, which is
essentially the same as
E(s-super-2)-optimality, except that we
do not insist on factor balance. We also provide a method of construction.
We introduce a second criterion from a traditional design optimality
theory viewpoint. We use minimization of bias as our estimation criterion,
and minimization of the variance of the minimum
bias estimator as the design optimality criterion. Using
D-optimality as the specific design optimality criterion,
we introduce D-optimal supersaturated designs. We show
that D-optimal supersaturated designs can be constructed
from D-optimal chemical balance weighing designs obtained
by Galil and Kiefer (1980, 1982), Cheng (1980) and other authors. It turns
out that, except when the number of observations and the number of factors
are in a certain range, an
UE(s-super-2)-optimal design is also a
D-optimal supersaturated design. Moreover, these designs
have an interesting connection to Bayes optimal designs. When the prior
variance is large enough, a D-optimal supersaturated
design is Bayes D-optimal and when the prior variance is
small enough, an UE(s-super-2)-optimal
design is Bayes D-optimal. While
E(s-super-2)-optimal designs yield
precise intercept estimates, our study indicates that
UE(s-super-2)-optimal designs generally
produce more efficient estimates for the main effects of the factors.
Based on theoretical properties and the study of examples, we recommend
UE(s-super-2)-optimal designs for
screening experiments.
Journal: Journal of the American Statistical Association
Pages: 1592-1600
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.938810
File-URL: http://hdl.handle.net/10.1080/01621459.2014.938810
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1592-1600
Template-Type: ReDIF-Article 1.0
Author-Name: Alberto Abadie
Author-X-Name-First: Alberto
Author-X-Name-Last: Abadie
Author-Name: Guido W. Imbens
Author-X-Name-First: Guido W.
Author-X-Name-Last: Imbens
Author-Name: Fanyin Zheng
Author-X-Name-First: Fanyin
Author-X-Name-Last: Zheng
Title: Inference for Misspecified Models With Fixed Regressors
Abstract:
Following the work by Eicker, Huber, and White it is common in empirical
work to report standard errors that are robust against general
misspecification. In a regression setting, these standard errors are valid
for the parameter that minimizes the squared difference between the
conditional expectation and a linear approximation, averaged over the
population distribution of the covariates. Here, we discuss an alternative
parameter that corresponds to the approximation to the conditional
expectation based on minimization of the squared difference averaged over
the sample, rather than the population, distribution of the covariates. We
argue that in some cases this may be a more interesting parameter. We
derive the asymptotic variance for this parameter, which is generally
smaller than the Eicker-Huber-White robust variance, and propose a
consistent estimator for this asymptotic variance. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1601-1614
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.928218
File-URL: http://hdl.handle.net/10.1080/01621459.2014.928218
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1601-1614
Template-Type: ReDIF-Article 1.0
Author-Name: Michael Rosenthal
Author-X-Name-First: Michael
Author-X-Name-Last: Rosenthal
Author-Name: Wei Wu
Author-X-Name-First: Wei
Author-X-Name-Last: Wu
Author-Name: Eric Klassen
Author-X-Name-First: Eric
Author-X-Name-Last: Klassen
Author-Name: Anuj Srivastava
Author-X-Name-First: Anuj
Author-X-Name-Last: Srivastava
Title: Spherical Regression Models Using Projective Linear Transformations
Abstract:
This article studies the problem of modeling relationship between two
spherical (or directional) random variables in a regression setup. Here
the predictor and the response variables are constrained to be on a unit
sphere and, due to this nonlinear condition, the standard Euclidean
regression models do not apply. Several past papers have studied this
problem, termed spherical regression, by modeling the response variable
with a von Mises-Fisher (VMF) density with the mean given by a rotation of
the predictor variable. The few papers that go beyond rigid rotations are
limited to one- or two-dimensional spheres. This article extends the mean
transformations to a larger group--the projective linear group of
transformations--on unit spheres of arbitrary dimensions, while keeping
the VMF density to model the noise. It develops a Newton-Raphson algorithm
on the special linear group for estimating the MLE of regression parameter
and establishes its asymptotic properties when the sample-size becomes
large. Through a variety of experiments, using data taken from projective
shape analysis, cloud tracking, etc., and some simulations, this article
demonstrates improvements in the prediction and modeling performance of
the proposed framework over previously used models. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1615-1624
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.892881
File-URL: http://hdl.handle.net/10.1080/01621459.2014.892881
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1615-1624
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Ning
Author-X-Name-First: Jing
Author-X-Name-Last: Ning
Author-Name: Jing Qin
Author-X-Name-First: Jing
Author-X-Name-Last: Qin
Author-Name: Yu Shen
Author-X-Name-First: Yu
Author-X-Name-Last: Shen
Title: Score Estimating Equations from Embedded Likelihood Functions Under Accelerated Failure Time Model
Abstract:
The semiparametric accelerated failure time (AFT) model is one of the most
popular models for analyzing time-to-event outcomes. One appealing feature
of the AFT model is that the observed failure time data can be transformed
to identically independent distributed random variables without covariate
effects. We describe a class of estimating equations based on the score
functions for the transformed data, which are derived from the full
likelihood function under commonly used semiparametric models such as the
proportional hazards or proportional odds model. The methods of estimating
regression parameters under the AFT model can be applied to traditional
right-censored survival data as well as more complex time-to-event data
subject to length-biased sampling. We establish the asymptotic properties
and evaluate the small sample performance of the proposed estimators. We
illustrate the proposed methods through applications in two examples.
Journal: Journal of the American Statistical Association
Pages: 1625-1635
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.946034
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946034
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1625-1635
Template-Type: ReDIF-Article 1.0
Author-Name: Xiao Song
Author-X-Name-First: Xiao
Author-X-Name-Last: Song
Author-Name: Ching-Yun Wang
Author-X-Name-First: Ching-Yun
Author-X-Name-Last: Wang
Title: Proportional Hazards Model With Covariate Measurement Error and Instrumental Variables
Abstract:
In biomedical studies, covariates with measurement error may occur in
survival data. Existing approaches mostly require certain replications on
the error-contaminated covariates, which may not be available in the data.
In this article, we develop a simple nonparametric correction approach for
estimation of the regression parameters in the proportional hazards model
using a subset of the sample where instrumental variables are observed.
The instrumental variables are related to the covariates through a general
nonparametric model, and no distributional assumptions are placed on the
error and the underlying true covariates. We further propose a novel
generalized methods of moments nonparametric correction estimator to
improve the efficiency over the simple correction approach. The efficiency
gain can be substantial when the calibration subsample is small compared
to the whole sample. The estimators are shown to be consistent and
asymptotically normal. Performance of the estimators is evaluated via
simulation studies and by an application to data from an HIV clinical
trial. Estimation of the baseline hazard function is not addressed.
Journal: Journal of the American Statistical Association
Pages: 1636-1646
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.896805
File-URL: http://hdl.handle.net/10.1080/01621459.2014.896805
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1636-1646
Template-Type: ReDIF-Article 1.0
Author-Name: Jenný Brynjarsdóttir
Author-X-Name-First: Jenný
Author-X-Name-Last: Brynjarsdóttir
Author-Name: L. Mark Berliner
Author-X-Name-First: L. Mark
Author-X-Name-Last: Berliner
Title: Dimension-Reduced Modeling of Spatio-Temporal Processes
Abstract:
The field of spatial and spatio-temporal statistics is increasingly faced
with the challenge of very large datasets. The classical approach to
spatial and spatio-temporal modeling is very computationally demanding
when datasets are large, which has led to interest in methods that use
dimension-reduction techniques. In this article, we focus on modeling of
two spatio-temporal processes where the primary goal is to predict one
process from the other and where datasets for both processes are large. We
outline a general dimension-reduced Bayesian hierarchical modeling
approach where spatial structures of both processes are modeled in terms
of a low number of basis vectors, hence reducing the spatial dimension of
the problem. Temporal evolution of the processes and their dependence is
then modeled through the coefficients of the basis vectors. We present a
new method of obtaining data-dependent basis vectors, which is geared
toward the goal of predicting one process from the other. We apply these
methods to a statistical downscaling example, where surface temperatures
on a coarse grid over Antarctica are downscaled onto a finer grid.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1647-1659
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.904232
File-URL: http://hdl.handle.net/10.1080/01621459.2014.904232
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1647-1659
Template-Type: ReDIF-Article 1.0
Author-Name: Brian Claggett
Author-X-Name-First: Brian
Author-X-Name-Last: Claggett
Author-Name: Minge Xie
Author-X-Name-First: Minge
Author-X-Name-Last: Xie
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Title: Meta-Analysis With Fixed, Unknown, Study-Specific Parameters
Abstract:
Meta-analysis is a valuable tool for combining information from
independent studies. However, most common meta-analysis techniques rely on
distributional assumptions that are difficult, if not impossible, to
verify. For instance, in the commonly used fixed-effects and
random-effects models, we take for granted that the underlying study-level
parameters are either exactly the same across individual studies or that
they are realizations of a random sample from a population, often under a
parametric distributional assumption. In this article, we present a new
framework for summarizing information obtained from multiple studies and
make inference that is not dependent on any distributional assumption for
the study-level parameters. Specifically, we assume the study-level
parameters are unknown, fixed parameters and draw inferences about, for
example, the quantiles of this set of parameters using study-specific
summary statistics. This type of problem is known to be quite challenging
(see Hall and Miller). We use a novel resampling method via the confidence
distributions of the study-level parameters to construct confidence
intervals for the above quantiles. We justify the validity of the interval
estimation procedure asymptotically and compare the new procedure with the
standard bootstrapping method. We also illustrate our proposal with the
data from a recent meta-analysis of the treatment effect from an
antioxidant on the prevention of contrast-induced nephropathy.
Journal: Journal of the American Statistical Association
Pages: 1660-1671
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.957288
File-URL: http://hdl.handle.net/10.1080/01621459.2014.957288
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1660-1671
Template-Type: ReDIF-Article 1.0
Author-Name: Hongyu Miao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Miao
Author-Name: Hulin Wu
Author-X-Name-First: Hulin
Author-X-Name-Last: Wu
Author-Name: Hongqi Xue
Author-X-Name-First: Hongqi
Author-X-Name-Last: Xue
Title: Generalized Ordinary Differential Equation Models
Abstract:
Existing estimation methods for ordinary differential equation (ODE)
models are not applicable to discrete data. The generalized ODE (GODE)
model is therefore proposed and investigated for the first time. We
develop the likelihood-based parameter estimation and inference methods
for GODE models. We propose robust computing algorithms and rigorously
investigate the asymptotic properties of the proposed estimator by
considering both measurement errors and numerical errors in solving ODEs.
The simulation study and application of our methods to an influenza viral
dynamics study suggest that the proposed methods have a superior
performance in terms of accuracy over the existing ODE model estimation
approach and the extended smoothing-based (ESB) method. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1672-1682
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.957287
File-URL: http://hdl.handle.net/10.1080/01621459.2014.957287
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1672-1682
Template-Type: ReDIF-Article 1.0
Author-Name: Yunzhang Zhu
Author-X-Name-First: Yunzhang
Author-X-Name-Last: Zhu
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wei Pan
Author-X-Name-First: Wei
Author-X-Name-Last: Pan
Title: Structural Pursuit Over Multiple Undirected Graphs
Abstract:
Gaussian graphical models are useful to analyze and visualize conditional
dependence relationships between interacting units. Motivated from network
analysis under different experimental conditions, such as gene networks
for disparate cancer subtypes, we model structural changes over multiple
networks with possible heterogeneities. In particular, we estimate
multiple precision matrices describing dependencies among interacting
units through maximum penalized likelihood. Of particular interest are
homogeneous groups of similar entries across and zero-entries of these
matrices, referred to as clustering and sparseness structures,
respectively. A nonconvex method is proposed to seek a sparse
representation for each matrix and identify clusters of the entries across
the matrices. Computationally, we develop an efficient method on the basis
of difference convex programming, the augmented Lagrangian method and the
blockwise coordinate descent method, which is scalable to hundreds of
graphs of thousands nodes through a simple necessary and sufficient
partition rule, which divides nodes into smaller disjoint subproblems
excluding zero-coefficients nodes for arbitrary graphs with convex
relaxation. Theoretically, a finite-sample error bound is derived for the
proposed method to reconstruct the clustering and sparseness structures.
This leads to consistent reconstruction of these two structures
simultaneously, permitting the number of unknown parameters to be
exponential in the sample size, and yielding the optimal performance of
the oracle estimator as if the true structures were given a priori.
Simulation studies suggest that the method enjoys the benefit of pursuing
these two disparate kinds of structures, and compares favorably against
its convex counterpart in the accuracy of structure pursuit and parameter
estimation.
Journal: Journal of the American Statistical Association
Pages: 1683-1696
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.921182
File-URL: http://hdl.handle.net/10.1080/01621459.2014.921182
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1683-1696
Template-Type: ReDIF-Article 1.0
Author-Name: Markus Frölich
Author-X-Name-First: Markus
Author-X-Name-Last: Frölich
Author-Name: Martin Huber
Author-X-Name-First: Martin
Author-X-Name-Last: Huber
Title: Treatment Evaluation With Multiple Outcome Periods Under Endogeneity and Attrition
Abstract:
This article develops a nonparametric methodology for treatment evaluation
with multiple outcome periods under treatment endogeneity and missing
outcomes. We use instrumental variables, pretreatment characteristics, and
short-term (or intermediate) outcomes to identify the average treatment
effect on the outcomes of compliers (the subpopulation whose treatment
reacts on the instrument) in multiple periods based on inverse probability
weighting. Treatment selection and attrition may depend on both observed
characteristics and the unobservable compliance type, which is possibly
related to unobserved factors. We also provide a simulation study and
apply our methods to the evaluation of a policy intervention targeting
college achievement, where we find that controlling for attrition
considerably affects the effect estimates. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1697-1711
Issue: 508
Volume: 109
Year: 2014
Month: 12
X-DOI: 10.1080/01621459.2014.896804
File-URL: http://hdl.handle.net/10.1080/01621459.2014.896804
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:109:y:2014:i:508:p:1697-1711
Template-Type: ReDIF-Article 1.0
Author-Name: Nathaniel Schenker
Author-X-Name-First: Nathaniel
Author-X-Name-Last: Schenker
Title: Why Your Involvement Matters
Abstract:
The International Year of Statistics, 2013, focused on outreach in a
wonderful way. As we celebrate the ASA's 175th anniversary in 2014, it is
worthwhile to look inward as well and think about how to keep our
association and profession strong, so that our successors will be able to
celebrate the 275th anniversary. The ASA, with its long history, its fine
staff and organization, and its financial resource base, is well
positioned to serve the profession, and indeed society, and it is very
successful at doing so. But the real measure of the health of our
association is the size and level of engagement of its membership, whose
participation is a major source of the ASA's strength. So, what is it that
compels people to be members? One might argue that it is the tangible
benefits that we receive in exchange for our dues--magazine and journal
subscriptions, discounted meeting registrations, and so on. Although such
benefits are attractive, I believe they are not the primary reasons people
are ASA members. What compels people is the value they find through
involvement in the association. Unlike benefits, which are objective,
value is subjective, varying over time and varying from member to member
or group to group. And unlike benefits, which can be listed as bullet
points, value is best borne out in personal experiences. In this address,
I will use experiences that ASA members have shared with me, along with
experiences of my own, to paint a picture of the deep value that
involvement in the ASA has provided. I also will challenge you to continue
to find the extraordinary value available through involvement in our
association.
Journal: Journal of the American Statistical Association
Pages: 1-5
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2015.1021616
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1021616
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:1-5
Template-Type: ReDIF-Article 1.0
Author-Name: Zhengyi Zhou
Author-X-Name-First: Zhengyi
Author-X-Name-Last: Zhou
Author-Name: David S. Matteson
Author-X-Name-First: David S.
Author-X-Name-Last: Matteson
Author-Name: Dawn B. Woodard
Author-X-Name-First: Dawn B.
Author-X-Name-Last: Woodard
Author-Name: Shane G. Henderson
Author-X-Name-First: Shane G.
Author-X-Name-Last: Henderson
Author-Name: Athanasios C. Micheas
Author-X-Name-First: Athanasios C.
Author-X-Name-Last: Micheas
Title: A Spatio-Temporal Point Process Model for Ambulance Demand
Abstract:
Ambulance demand estimation at fine time and location scales is critical
for fleet management and dynamic deployment. We are motivated by the
problem of estimating the spatial distribution of ambulance demand in
Toronto, Canada, as it changes over discrete 2 hr intervals. This
large-scale dataset is sparse at the desired temporal resolutions and
exhibits location-specific serial dependence, daily, and weekly
seasonality. We address these challenges by introducing a novel
characterization of time-varying Gaussian mixture models. We fix the
mixture component distributions across all time periods to overcome data
sparsity and accurately describe Toronto's spatial structure, while
representing the complex spatio-temporal dynamics through time-varying
mixture weights. We constrain the mixture weights to capture weekly
seasonality, and apply a conditionally autoregressive prior on the mixture
weights of each component to represent location-specific short-term serial
dependence and daily seasonality. While estimation may be performed using
a fixed number of mixture components, we also extend to estimate the
number of components using birth-and-death Markov chain Monte Carlo. The
proposed model is shown to give higher statistical predictive accuracy and
to reduce the error in predicting emergency medical service operational
performance by as much as two-thirds compared to a typical industry
practice.
Journal: Journal of the American Statistical Association
Pages: 6-15
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.941466
File-URL: http://hdl.handle.net/10.1080/01621459.2014.941466
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:6-15
Template-Type: ReDIF-Article 1.0
Author-Name: Xu Tang
Author-X-Name-First: Xu
Author-X-Name-Last: Tang
Author-Name: Fah F. Gan
Author-X-Name-First: Fah F.
Author-X-Name-Last: Gan
Author-Name: Lingyun Zhang
Author-X-Name-First: Lingyun
Author-X-Name-Last: Zhang
Title: Risk-Adjusted Cumulative Sum Charting Procedure Based on Multiresponses
Abstract:
The cumulative sum charting procedure is traditionally used in the
manufacturing industry for monitoring the quality of products. Recently,
it has been extended to monitoring surgical outcomes. Unlike a
manufacturing process where the raw material is usually reasonably
homogeneous, patients' risks of surgical failure are usually different. It
has been proposed in the literature that the binary outcomes from a
surgical procedure be adjusted using the preoperative risk based on a
likelihood-ratio scoring method. Such a crude classification of surgical
outcome is naive. It is unreasonable to regard a patient who has a full
recovery, the same quality outcome as another patient who survived but
remained bed-ridden for life. For a patient who survives an operation,
there can be many different grades of recovery. Thus, it makes sense to
consider a risk-adjusted cumulative sum charting procedure based on more
than two outcomes to better monitor surgical performance. In this article,
we develop such a chart and study its performance.
Journal: Journal of the American Statistical Association
Pages: 16-26
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.960965
File-URL: http://hdl.handle.net/10.1080/01621459.2014.960965
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:16-26
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander M. Franks
Author-X-Name-First: Alexander M.
Author-X-Name-Last: Franks
Author-Name: Gábor Csárdi
Author-X-Name-First: Gábor
Author-X-Name-Last: Csárdi
Author-Name: D. Allan Drummond
Author-X-Name-First: D. Allan
Author-X-Name-Last: Drummond
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Title: Estimating a Structured Covariance Matrix From Multilab Measurements in High-Throughput Biology
Abstract:
We consider the problem of quantifying the degree of coordination between
transcription and translation, in yeast. Several studies have reported a
surprising lack of coordination over the years, in organisms as different
as yeast and humans, using diverse technologies. However, a close look at
this literature suggests that the lack of reported correlation may not
reflect the biology of regulation. These reports do not control for
between-study biases and structure in the measurement errors, ignore key
aspects of how the data connect to the estimand, and systematically
underestimate the correlation as a consequence. Here, we design a careful
meta-analysis of 27 yeast datasets, supported by a multilevel model, full
uncertainty quantification, a suite of sensitivity analyses, and novel
theory, to produce a more accurate estimate of the correlation between
mRNA and protein levels--a proxy for coordination. From a statistical
perspective, this problem motivates new theory on the impact of noise,
model misspecifications, and nonignorable missing data on estimates of the
correlation between high-dimensional responses. We find that the
correlation between mRNA and protein levels is quite high under the
studied conditions, in yeast, suggesting that post-transcriptional
regulation plays a less prominent role than previously thought.
Journal: Journal of the American Statistical Association
Pages: 27-44
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.964404
File-URL: http://hdl.handle.net/10.1080/01621459.2014.964404
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:27-44
Template-Type: ReDIF-Article 1.0
Author-Name: Antonio R. Linero
Author-X-Name-First: Antonio R.
Author-X-Name-Last: Linero
Author-Name: Michael J. Daniels
Author-X-Name-First: Michael J.
Author-X-Name-Last: Daniels
Title: A Flexible Bayesian Approach to Monotone Missing Data in Longitudinal Studies With Nonignorable Missingness With Application to an Acute Schizophrenia Clinical Trial
Abstract:
We develop a Bayesian nonparametric model for a longitudinal response in
the presence of nonignorable missing data. Our general approach is to
first specify a working model that flexibly models the missingness and
full outcome processes jointly. We specify a Dirichlet process mixture of
missing at random (MAR) models as a prior on the joint distribution of the
working model. This aspect of the model governs the fit of the observed
data by modeling the observed data distribution as the marginalization
over the missing data in the working model. We then separately specify the
conditional distribution of the missing data given the observed data and
dropout. This approach allows us to identify the distribution of the
missing data using identifying restrictions as a starting point. We
propose a framework for introducing sensitivity parameters, allowing us to
vary the untestable assumptions about the missing data mechanism smoothly.
Informative priors on the space of missing data assumptions can be
specified to combine inferences under many different assumptions into a
final inference and accurately characterize uncertainty. These methods are
motivated by, and applied to, data from a clinical trial assessing the
efficacy of a new treatment for acute schizophrenia. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 45-55
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.969424
File-URL: http://hdl.handle.net/10.1080/01621459.2014.969424
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:45-55
Template-Type: ReDIF-Article 1.0
Author-Name: Giwhyun Lee
Author-X-Name-First: Giwhyun
Author-X-Name-Last: Lee
Author-Name: Yu Ding
Author-X-Name-First: Yu
Author-X-Name-Last: Ding
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Author-Name: Le Xie
Author-X-Name-First: Le
Author-X-Name-Last: Xie
Title: Power Curve Estimation With Multivariate Environmental Factors for Inland and Offshore Wind Farms
Abstract:
In the wind industry, a power curve refers to the functional relationship
between the power output generated by a wind turbine and the wind speed at
the time of power generation. Power curves are used in practice for a
number of important tasks including predicting wind power production and
assessing a turbine's energy production efficiency. Nevertheless, actual
wind power data indicate that the power output is affected by more than
just wind speed. Several other environmental factors, such as wind
direction, air density, humidity, turbulence intensity, and wind shears,
have potential impact. Yet, in industry practice, as well as in the
literature, current power curve models primarily consider wind speed and,
sometimes, wind speed and direction. We propose an additive multivariate
kernel method that can include the aforementioned environmental factors as
a new power curve model. Our model provides, conditional on a given
environmental condition, both the point estimation and density estimation
of power output. It is able to capture the nonlinear relationships between
environmental factors and the wind power output, as well as the high-order
interaction effects among some of the environmental factors. Using
operational data associated with four turbines in an inland wind farm and
two turbines in an offshore wind farm, we demonstrate the improvement
achieved by our kernel method.
Journal: Journal of the American Statistical Association
Pages: 56-67
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.977385
File-URL: http://hdl.handle.net/10.1080/01621459.2014.977385
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:56-67
Template-Type: ReDIF-Article 1.0
Author-Name: Curtis B. Storlie
Author-X-Name-First: Curtis B.
Author-X-Name-Last: Storlie
Author-Name: William A. Lane
Author-X-Name-First: William A.
Author-X-Name-Last: Lane
Author-Name: Emily M. Ryan
Author-X-Name-First: Emily M.
Author-X-Name-Last: Ryan
Author-Name: James R. Gattiker
Author-X-Name-First: James R.
Author-X-Name-Last: Gattiker
Author-Name: David M. Higdon
Author-X-Name-First: David M.
Author-X-Name-Last: Higdon
Title: Calibration of Computational Models With Categorical Parameters and Correlated Outputs via Bayesian Smoothing Spline ANOVA
Abstract:
It has become commonplace to use complex computer models to predict
outcomes in regions where data do not exist. Typically these models need
to be calibrated and validated using some experimental data, which often
consists of multiple correlated outcomes. In addition, some of the model
parameters may be categorical in nature, such as a pointer variable to
alternate models (or submodels) for some of the physics of the system.
Here, we present a general approach for calibration in such situations
where an emulator of the computationally demanding models and a
discrepancy term from the model to reality are represented within a
Bayesian smoothing spline (BSS) ANOVA framework. The BSS-ANOVA framework
has several advantages over the traditional Gaussian process, including
ease of handling categorical inputs and correlated outputs, and improved
computational efficiency. Finally, this framework is then applied to the
problem that motivated its design; a calibration of a computational fluid
dynamics (CFD) model of a bubbling fluidized which is used as an absorber
in a CO2 capture system. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 68-82
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.979993
File-URL: http://hdl.handle.net/10.1080/01621459.2014.979993
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:68-82
Template-Type: ReDIF-Article 1.0
Author-Name: Kentaro Fukumoto
Author-X-Name-First: Kentaro
Author-X-Name-Last: Fukumoto
Title: What Happens Depends on When It Happens: Copula-Based Ordered Event History Analysis of Civil War Duration and Outcome
Abstract:
Scholars are interested in not just what event happens but also when the
event happens. If there is dependence among events or dependence between
time and events, however, the currently common methods (e.g., competing
risks approaches) produce biased estimates. To deal with these problems,
this article proposes a new method of copula-based ordered event history
analysis (COEHA). A merit of working with copulas is that, whatever
marginal distributions time and event variables follow (including the Cox
model), researchers can derive whatever joint distribution exists between
the two. Application of the COEHA model to a dataset from civil wars
supports two controversial hypotheses. First, as wars become longer, rebel
victory becomes more likely but settlement does not (there is dependence
between time and events at both tails). Second, stronger rebels make wars
shorter but do not necessarily tend to win, as experts predict but fail to
establish (rebels' strength shortens time but has no effect on which
events occur). Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 83-92
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.979994
File-URL: http://hdl.handle.net/10.1080/01621459.2014.979994
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:83-92
Template-Type: ReDIF-Article 1.0
Author-Name: Tingting Zhang
Author-X-Name-First: Tingting
Author-X-Name-Last: Zhang
Author-Name: Jingwei Wu
Author-X-Name-First: Jingwei
Author-X-Name-Last: Wu
Author-Name: Fan Li
Author-X-Name-First: Fan
Author-X-Name-Last: Li
Author-Name: Brian Caffo
Author-X-Name-First: Brian
Author-X-Name-Last: Caffo
Author-Name: Dana Boatman-Reich
Author-X-Name-First: Dana
Author-X-Name-Last: Boatman-Reich
Title: A Dynamic Directional Model for Effective Brain Connectivity Using Electrocorticographic (ECoG) Time Series
Abstract:
We introduce a dynamic directional model (DDM) for studying brain
effective connectivity based on intracranial electrocorticographic (ECoG)
time series. The DDM consists of two parts: a set of differential
equations describing neuronal activity of brain components (state
equations), and observation equations linking the underlying neuronal
states to observed data. When applied to functional MRI or EEG data, DDMs
usually have complex formulations and thus can accommodate only a few
regions, due to limitations in spatial resolution and/or temporal
resolution of these imaging modalities. In contrast, we formulate our
model in the context of ECoG data. The combined high temporal and spatial
resolution of ECoG data result in a much simpler DDM, allowing
investigation of complex connections between many regions. To identify
functionally segregated subnetworks, a form of biologically economical
brain networks, we propose the Potts model for the DDM parameters. The
neuronal states of brain components are represented by cubic spline bases
and the parameters are estimated by minimizing a log-likelihood criterion
that combines the state and observation equations. The Potts model is
converted to the Potts penalty in the penalized regression approach to
achieve sparsity in parameter estimation, for which a fast iterative
algorithm is developed. The methods are applied to an auditory ECoG
dataset.
Journal: Journal of the American Statistical Association
Pages: 93-106
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.988213
File-URL: http://hdl.handle.net/10.1080/01621459.2014.988213
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:93-106
Template-Type: ReDIF-Article 1.0
Author-Name: Weibing Huang
Author-X-Name-First: Weibing
Author-X-Name-Last: Huang
Author-Name: Charles-Albert Lehalle
Author-X-Name-First: Charles-Albert
Author-X-Name-Last: Lehalle
Author-Name: Mathieu Rosenbaum
Author-X-Name-First: Mathieu
Author-X-Name-Last: Rosenbaum
Title: Simulating and Analyzing Order Book Data: The Queue-Reactive Model
Abstract:
Through the analysis of a dataset of ultra high frequency order book
updates, we introduce a model which accommodates the empirical properties
of the full order book together with the stylized facts of lower frequency
financial data. To do so, we split the time interval of interest into
periods in which a well chosen reference price, typically the midprice,
remains constant. Within these periods, we view the limit order book as a
Markov queuing system. Indeed, we assume that the intensities of the order
flows only depend on the current state of the order book. We establish the
limiting behavior of this model and estimate its parameters from market
data. Then, to design a relevant model for the whole period of interest,
we use a stochastic mechanism that allows to switch from one period of
constant reference price to another. Beyond enabling to reproduce
accurately the behavior of market data, we show that our framework can be
very useful for practitioners, notably as a market simulator or as a tool
for the transaction cost analysis of complex trading algorithms.
Journal: Journal of the American Statistical Association
Pages: 107-122
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.982278
File-URL: http://hdl.handle.net/10.1080/01621459.2014.982278
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:107-122
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew J. Heaton
Author-X-Name-First: Matthew J.
Author-X-Name-Last: Heaton
Author-Name: Stephan R. Sain
Author-X-Name-First: Stephan R.
Author-X-Name-Last: Sain
Author-Name: Andrew J. Monaghan
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Monaghan
Author-Name: Olga V. Wilhelmi
Author-X-Name-First: Olga V.
Author-X-Name-Last: Wilhelmi
Author-Name: Mary H. Hayden
Author-X-Name-First: Mary H.
Author-X-Name-Last: Hayden
Title: An Analysis of an Incomplete Marked Point Pattern of Heat-Related 911 Calls
Abstract:
We analyze an incomplete marked point pattern of heat-related 911 calls
between the years 2006-2010 in Houston, TX, to primarily investigate
conditions that are associated with increased vulnerability to
heat-related morbidity and, secondarily, build a statistical model that
can be used as a public health tool to predict the volume of 911 calls
given a time frame and heat exposure. We model the calls as arising from a
nonhomogenous Cox process with unknown intensity measure. By using the
kernel convolution construction of a Gaussian process, the intensity
surface is modeled using a low-dimensional representation and properly
adheres to circular domain constraints. We account for the incomplete
observations by marginalizing the joint intensity measure over the domain
of the missing marks and also demonstrate model based imputation. We find
that spatial regions of high risk for heat-related 911 calls are
temporally dynamic with the highest risk occurring in urban areas during
the day. We also find that elderly populations have an increased
probability of calling 911 with heat-related issues than younger
populations. Finally, the age of individuals and hour of the day with the
highest intensity of heat-related 911 calls varies by race/ethnicity.
Supplementary materials are included with this article.
Journal: Journal of the American Statistical Association
Pages: 123-135
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.983229
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983229
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:123-135
Template-Type: ReDIF-Article 1.0
Author-Name: J. L. Scealy
Author-X-Name-First: J. L.
Author-X-Name-Last: Scealy
Author-Name: Patrice de Caritat
Author-X-Name-First: Patrice
Author-X-Name-Last: de Caritat
Author-Name: Eric C. Grunsky
Author-X-Name-First: Eric C.
Author-X-Name-Last: Grunsky
Author-Name: Michail T. Tsagris
Author-X-Name-First: Michail T.
Author-X-Name-Last: Tsagris
Author-Name: A. H. Welsh
Author-X-Name-First: A. H.
Author-X-Name-Last: Welsh
Title: Robust Principal Component Analysis for Power Transformed Compositional Data
Abstract:
Geochemical surveys collect sediment or rock samples, measure the
concentration of chemical elements, and report these typically either in
weight percent or in parts per million (ppm). There are usually a large
number of elements measured and the distributions are often skewed,
containing many potential outliers. We present a new robust principal
component analysis (PCA) method for geochemical survey data, that involves
first transforming the compositional data onto a manifold using a relative
power transformation. A flexible set of moment assumptions are made which
take the special geometry of the manifold into account. The Kent
distribution moment structure arises as a special case when the chosen
manifold is the hypersphere. We derive simple moment and robust estimators
(RO) of the parameters which are also applicable in high-dimensional
settings. The resulting PCA based on these estimators is done in the
tangent space and is related to the power transformation method used in
correspondence analysis. To illustrate, we analyze major oxide data from
the National Geochemical Survey of Australia. When compared with the
traditional approach in the literature based on the centered log-ratio
transformation, the new PCA method is shown to be more successful at
dimension reduction and gives interpretable results.
Journal: Journal of the American Statistical Association
Pages: 136-148
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.990563
File-URL: http://hdl.handle.net/10.1080/01621459.2014.990563
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:136-148
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Xin Tong
Author-X-Name-First: Xin
Author-X-Name-Last: Tong
Author-Name: Yao Zeng
Author-X-Name-First: Yao
Author-X-Name-Last: Zeng
Title: Multi-Agent Inference in Social Networks: A Finite Population Learning Approach
Abstract:
When people in a society want to make inference about some parameter, each
person may want to use data collected by other people. Information (data)
exchange in social networks is usually costly, so to make reliable
statistical decisions, people need to weigh the benefits and costs of
information acquisition. Conflicts of interests and coordination problems
will arise in the process. Classical statistics does not consider people's
incentives and interactions in the data-collection process. To address
this imperfection, this work explores multi-agent Bayesian inference
problems with a game theoretic social network model. Motivated by our
interest in aggregate inference at the societal level, we propose a new
concept, finite population learning, to address whether
with high probability, a large fraction of people in a given finite
population network can make "good" inference. Serving as a foundation,
this concept enables us to study the long run trend of aggregate inference
quality as population grows. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 149-158
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.893885
File-URL: http://hdl.handle.net/10.1080/01621459.2014.893885
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:149-158
Template-Type: ReDIF-Article 1.0
Author-Name: Christine Peterson
Author-X-Name-First: Christine
Author-X-Name-Last: Peterson
Author-Name: Francesco C. Stingo
Author-X-Name-First: Francesco C.
Author-X-Name-Last: Stingo
Author-Name: Marina Vannucci
Author-X-Name-First: Marina
Author-X-Name-Last: Vannucci
Title: Bayesian Inference of Multiple Gaussian Graphical Models
Abstract:
In this article, we propose a Bayesian approach to inference on multiple
Gaussian graphical models. Specifically, we address the problem of
inferring multiple undirected networks in situations where some of the
networks may be unrelated, while others share common features. We link the
estimation of the graph structures via a Markov random field (MRF) prior,
which encourages common edges. We learn which sample groups have a shared
graph structure by placing a spike-and-slab prior on the parameters that
measure network relatedness. This approach allows us to share information
between sample groups, when appropriate, as well as to obtain a measure of
relative network similarity across groups. Our modeling framework
incorporates relevant prior knowledge through an edge-specific informative
prior and can encourage similarity to an established network. Through
simulations, we demonstrate the utility of our method in summarizing
relative network similarity and compare its performance against related
methods. We find improved accuracy of network estimation, particularly
when the sample sizes within each subgroup are moderate. We also
illustrate the application of our model to infer protein networks for
various cancer subtypes and under different experimental conditions.
Journal: Journal of the American Statistical Association
Pages: 159-174
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.896806
File-URL: http://hdl.handle.net/10.1080/01621459.2014.896806
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:159-174
Template-Type: ReDIF-Article 1.0
Author-Name: Zheng Tracy Ke
Author-X-Name-First: Zheng Tracy
Author-X-Name-Last: Ke
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Title: Homogeneity Pursuit
Abstract:
This article explores the homogeneity of coefficients in high-dimensional
regression, which extends the sparsity concept and is more general and
suitable for many applications. Homogeneity arises when regression
coefficients corresponding to neighboring geographical regions or a
similar cluster of covariates are expected to be approximately the same.
Sparsity corresponds to a special case of homogeneity with a large cluster
of known atom zero. In this article, we propose a new method called
clustering algorithm in regression via data-driven segmentation (CARDS) to
explore homogeneity. New mathematics are provided on the gain that can be
achieved by exploring homogeneity. Statistical properties of two versions
of CARDS are analyzed. In particular, the asymptotic normality of our
proposed CARDS estimator is established, which reveals better estimation
accuracy for homogeneous parameters than that without homogeneity
exploration. When our methods are combined with sparsity exploration,
further efficiency can be achieved beyond the exploration of sparsity
alone. This provides additional insights into the power of exploring
low-dimensional structures in high-dimensional regression: homogeneity and
sparsity. Our results also shed lights on the properties of the fused
Lasso. The newly developed method is further illustrated by simulation
studies and applications to real data. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 175-194
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.892882
File-URL: http://hdl.handle.net/10.1080/01621459.2014.892882
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:175-194
Template-Type: ReDIF-Article 1.0
Author-Name: D. L. Borchers
Author-X-Name-First: D. L.
Author-X-Name-Last: Borchers
Author-Name: B. C. Stevenson
Author-X-Name-First: B. C.
Author-X-Name-Last: Stevenson
Author-Name: D. Kidney
Author-X-Name-First: D.
Author-X-Name-Last: Kidney
Author-Name: L. Thomas
Author-X-Name-First: L.
Author-X-Name-Last: Thomas
Author-Name: T. A. Marques
Author-X-Name-First: T. A.
Author-X-Name-Last: Marques
Title: A Unifying Model for Capture-Recapture and Distance Sampling Surveys of Wildlife Populations
Abstract:
A fundamental problem in wildlife ecology and management is estimation of
population size or density. The two dominant methods in this area are
capture-recapture (CR) and distance sampling (DS), each with its own
largely separate literature. We develop a class of models that synthesizes
them. It accommodates a spectrum of models ranging from nonspatial CR
models (with no information on animal locations) through to DS and
mark-recapture distance sampling (MRDS) models, in which animal locations
are observed without error. Between these lie spatially explicit
capture-recapture (SECR) models that include only capture locations, and a
variety of models with less location data than are typical of DS surveys
but more than are normally used on SECR surveys. In addition to unifying
CR and DS models, the class provides a means of improving inference from
SECR models by adding supplementary location data, and a means of
incorporating measurement error into DS and MRDS models. We illustrate
their utility by comparing inference on acoustic surveys of gibbons and
frogs using only capture locations, using estimated angles (gibbons) and
combinations of received signal strength and time-of-arrival data (frogs),
and on a visual MRDS survey of whales, comparing estimates with exact and
estimated distances. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 195-204
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.893884
File-URL: http://hdl.handle.net/10.1080/01621459.2014.893884
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:195-204
Template-Type: ReDIF-Article 1.0
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Bahadur Efficiency of Sensitivity Analyses in Observational Studies
Abstract:
An observational study draws inferences about treatment effects when
treatments are not randomly assigned, as they would be in a randomized
experiment. The naive analysis of an observational study assumes that
adjustments for measured covariates suffice to remove bias from nonrandom
treatment assignment. A sensitivity analysis in an observational study
determines the magnitude of bias from nonrandom treatment assignment that
would need to be present to alter the qualitative conclusions of the naive
analysis, say leading to the acceptance of a null hypothesis rejected in
the naive analysis. Observational studies vary greatly in their
sensitivity to unmeasured biases, but a poor choice of test statistic can
lead to an exaggerated report of sensitivity to bias. The Bahadur
efficiency of a sensitivity analysis is introduced, calculated, and
connected to established concepts, such as the power of a sensitivity
analysis and the design sensitivity. The Bahadur slope equals zero when
the sensitivity parameter equals the design sensitivity, but the Bahadur
slope permits more refined distinctions. Specifically, the Bahadur
relative efficiency can also compare the relative performance of two test
statistics at a value of the sensitivity parameter below the minimum of
their design sensitivities. Adaptive procedures that combine several tests
can achieve the best design sensitivity and the best Bahadur slope of
their component tests. Ultimately, in sufficiently large sample sizes,
design sensitivity is more important than efficiency for the power of a
sensitivity analysis, and the exponential rate at which rate design
sensitivity overtakes efficiency is characterized.
Journal: Journal of the American Statistical Association
Pages: 205-217
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.960968
File-URL: http://hdl.handle.net/10.1080/01621459.2014.960968
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:205-217
Template-Type: ReDIF-Article 1.0
Author-Name: Marc Hallin
Author-X-Name-First: Marc
Author-X-Name-Last: Hallin
Author-Name: Chintan Mehta
Author-X-Name-First: Chintan
Author-X-Name-Last: Mehta
Title: R-Estimation for Asymmetric Independent Component Analysis
Abstract:
Independent component analysis (ICA) recently has attracted much attention
in the statistical literature as an appealing alternative to elliptical
models. Whereas k-dimensional elliptical densities depend
on one single unspecified radial density, however,
k-dimensional independent component distributions involve
k unspecified component densities. In practice, for given
sample size n and dimension k, this
makes the statistical analysis much harder. We focus here on the
estimation, from an independent sample, of the mixing/demixing matrix of
the model. Traditional methods (FOBI, Kernel-ICA, FastICA) mainly
originate from the engineering literature. Their consistency requires
moment conditions, they are poorly robust, and do not achieve any type of
asymptotic efficiency. When based on robust scatter matrices, the
two-scatter methods developed by Oja, Sirkia, and Eriksson in 2006 and
Nordhausen, Oja, and Ollila in 2008 enjoy better robustness features, but
their optimality properties remain unclear. The "classical semiparametric"
approach by Chen and Bickel in 2006, quite on the contrary, achieves
semiparametric efficiency, but requires the estimation of the densities of
the k unobserved independent components. As a reaction,
an efficient (signed-)rank-based approach was proposed by Ilmonen and
Paindaveine in 2011 for the case of symmetric component densities. The
performance of their estimators is quite good, but they unfortunately fail
to be root-n consistent as soon as one of the component
densities violates the symmetry assumption. In this article, using ranks
rather than signed ranks, we extend their approach to the asymmetric case
and propose a one-step R-estimator for ICA mixing
matrices. The finite-sample performances of those estimators are
investigated and compared to those of existing methods under moderately
large sample sizes. Particularly good performances are obtained from a
version involving data-driven scores taking into account the skewness and
kurtosis of residuals. Finally, we show, by an empirical exercise, that
our methods also may provide excellent results in a context such as image
analysis, where the basic assumptions of ICA are quite unlikely to hold.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 218-232
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.909316
File-URL: http://hdl.handle.net/10.1080/01621459.2014.909316
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:218-232
Template-Type: ReDIF-Article 1.0
Author-Name: Linglong Kong
Author-X-Name-First: Linglong
Author-X-Name-Last: Kong
Author-Name: Douglas P. Wiens
Author-X-Name-First: Douglas P.
Author-X-Name-Last: Wiens
Title: Model-Robust Designs for Quantile Regression
Abstract:
We give methods for the construction of designs for regression models,
when the purpose of the investigation is the estimation of the conditional
quantile function, and the estimation method is quantile regression. The
designs are robust against misspecified response functions, and against
unanticipated heteroscedasticity. The methods are illustrated by example,
and in a case study in which they are applied to growth charts.
Journal: Journal of the American Statistical Association
Pages: 233-245
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.969427
File-URL: http://hdl.handle.net/10.1080/01621459.2014.969427
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:233-245
Template-Type: ReDIF-Article 1.0
Author-Name: Guodong Li
Author-X-Name-First: Guodong
Author-X-Name-Last: Li
Author-Name: Yang Li
Author-X-Name-First: Yang
Author-X-Name-Last: Li
Author-Name: Chih-Ling Tsai
Author-X-Name-First: Chih-Ling
Author-X-Name-Last: Tsai
Title: Quantile Correlations and Quantile Autoregressive Modeling
Abstract:
In this article, we propose two important measures, quantile correlation
(QCOR) and quantile partial correlation (QPCOR). We then apply them to
quantile autoregressive (QAR) models, and introduce two valuable
quantities, the quantile autocorrelation function (QACF) and the quantile
partial autocorrelation function (QPACF). This allows us to extend the
Box-Jenkins three-stage procedure (model identification, model parameter
estimation, and model diagnostic checking) from classical autoregressive
models to quantile autoregressive models. Specifically, the QPACF of an
observed time series can be employed to identify the autoregressive order,
while the QACF of residuals obtained from the fitted model can be used to
assess the model adequacy. We not only demonstrate the asymptotic
properties of QCOR and QPCOR, but also show the large sample results of
QACF, QPACF, and the quantile version of the Box-Pierce test. Moreover, we
obtain the bootstrap approximations to the distributions of parameter
estimators and proposed measures. Simulation studies indicate that the
proposed methods perform well in finite samples, and an empirical example
is presented to illustrate usefulness. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 246-261
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.892007
File-URL: http://hdl.handle.net/10.1080/01621459.2014.892007
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:246-261
Template-Type: ReDIF-Article 1.0
Author-Name: Francis K. C. Hui
Author-X-Name-First: Francis K. C.
Author-X-Name-Last: Hui
Author-Name: David I. Warton
Author-X-Name-First: David I.
Author-X-Name-Last: Warton
Author-Name: Scott D. Foster
Author-X-Name-First: Scott D.
Author-X-Name-Last: Foster
Title: Tuning Parameter Selection for the Adaptive Lasso Using ERIC
Abstract:
The adaptive Lasso is a commonly applied penalty for variable selection in
regression modeling. Like all penalties though, its performance depends
critically on the choice of the tuning parameter. One method for choosing
the tuning parameter is via information criteria, such as those based on
AIC and BIC. However, these criteria were developed for use with
unpenalized maximum likelihood estimators, and it is not clear that they
take into account the effects of penalization. In this article, we propose
the extended regularized information criterion (ERIC) for choosing the
tuning parameter in adaptive Lasso regression. ERIC extends the BIC to
account for the effect of applying the adaptive Lasso on the bias-variance
tradeoff. This leads to a criterion whose penalty for model complexity is
itself a function of the tuning parameter. We show the tuning parameter
chosen by ERIC is selection consistent when the number of variables grows
with sample size, and that this consistency holds in a wider range of
contexts compared to using BIC to choose the tuning parameter. Simulation
show that ERIC can significantly outperform BIC and other information
criteria proposed (for choosing the tuning parameter) in selecting the
true model. For ultra high-dimensional data (p >
n), we consider a two-stage approach combining sure
independence screening with adaptive Lasso regression using ERIC, which is
selection consistent and performs strongly in simulation. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 262-269
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.951444
File-URL: http://hdl.handle.net/10.1080/01621459.2014.951444
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:262-269
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Lin
Author-X-Name-First: Wei
Author-X-Name-Last: Lin
Author-Name: Rui Feng
Author-X-Name-First: Rui
Author-X-Name-Last: Feng
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics
Abstract:
In genetical genomics studies, it is important to jointly analyze gene
expression data and genetic variants in exploring their associations with
complex traits, where the dimensionality of gene expressions and genetic
variants can both be much larger than the sample size. Motivated by such
modern applications, we consider the problem of variable selection and
estimation in high-dimensional sparse instrumental variables models. To
overcome the difficulty of high dimensionality and unknown optimal
instruments, we propose a two-stage regularization framework for
identifying and estimating important covariate effects while selecting and
estimating optimal instruments. The methodology extends the classical
two-stage least squares estimator to high dimensions by exploiting
sparsity using sparsity-inducing penalty functions in both stages. The
resulting procedure is efficiently implemented by coordinate descent
optimization. For the representative L1
regularization and a class of concave regularization methods, we establish
estimation, prediction, and model selection properties of the two-stage
regularized estimators in the high-dimensional setting where the
dimensionality of covariates and instruments are both allowed to grow
exponentially with the sample size. The practical performance of the
proposed method is evaluated by simulation studies and its usefulness is
illustrated by an analysis of mouse obesity data. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 270-288
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.908125
File-URL: http://hdl.handle.net/10.1080/01621459.2014.908125
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:270-288
Template-Type: ReDIF-Article 1.0
Author-Name: Qiang Sun
Author-X-Name-First: Qiang
Author-X-Name-Last: Sun
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Title: SPReM: Sparse Projection Regression Model For High-Dimensional Linear Regression
Abstract:
The aim of this article is to develop a sparse projection regression
modeling (SPReM) framework to perform multivariate regression modeling
with a large number of responses and a multivariate covariate of interest.
We propose two novel heritability ratios to simultaneously perform
dimension reduction, response selection, estimation, and testing, while
explicitly accounting for correlations among multivariate responses. Our
SPReM is devised to specifically address the low statistical power issue
of many standard statistical approaches, such as the Hotelling's
T-super-2 test statistic or a mass univariate analysis,
for high-dimensional data. We formulate the estimation problem of SPReM as
a novel sparse unit rank projection (SURP) problem and propose a fast
optimization algorithm for SURP. Furthermore, we extend SURP to the sparse
multirank projection (SMURP) by adopting a sequential SURP approximation.
Theoretically, we have systematically investigated the convergence
properties of SURP and the convergence rate of SURP estimates. Our
simulation results and real data analysis have shown that SPReM
outperforms other state-of-the-art methods.
Journal: Journal of the American Statistical Association
Pages: 289-302
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.892008
File-URL: http://hdl.handle.net/10.1080/01621459.2014.892008
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:289-302
Template-Type: ReDIF-Article 1.0
Author-Name: Juan Shen
Author-X-Name-First: Juan
Author-X-Name-Last: Shen
Author-Name: Xuming He
Author-X-Name-First: Xuming
Author-X-Name-Last: He
Title: Inference for Subgroup Analysis With a Structured Logistic-Normal Mixture Model
Abstract:
In this article, we propose a statistical model for the purpose of
identifying a subgroup that has an enhanced treatment effect as well as
the variables that are predictive of the subgroup membership. The need for
such subgroup identification arises in clinical trials and in market
segmentation analysis. By using a structured logistic-normal mixture
model, our proposed framework enables us to perform a confirmatory
statistical test for the existence of subgroups, and at the same time, to
construct predictive scores for the subgroup membership. The inferential
procedure proposed in the article is built on the recent literature on
hypothesis testing for Gaussian mixtures, but the structured
logistic-normal mixture model enjoys some distinctive properties that are
unavailable to the simpler Gaussian mixture models. With the bootstrap
approximations, the proposed tests are shown to be powerful and, equally
importantly, insensitive to the choice of tuning parameters. As an
illustration, we analyze a dataset from the AIDS Clinical Trials Group 320
study and show how the proposed methodology can help detect a potential
subgroup of AIDS patients who may react much more favorably to the
addition of a protease inhibitor to a conventional regimen than other
patients.
Journal: Journal of the American Statistical Association
Pages: 303-312
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.894763
File-URL: http://hdl.handle.net/10.1080/01621459.2014.894763
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:303-312
Template-Type: ReDIF-Article 1.0
Author-Name: Eben Kenah
Author-X-Name-First: Eben
Author-X-Name-Last: Kenah
Title: Semiparametric Relative-Risk Regression for Infectious Disease Transmission Data
Abstract:
This article introduces semiparametric relative-risk regression models for
infectious disease data. The units of analysis in these models are pairs
of individuals at risk of transmission. The hazard of infectious contact
from i to j consists of a baseline
hazard multiplied by a relative risk function that can be a function of
infectiousness covariates for i, susceptibliity
covariates for j, and pairwise covariates. When
who-infects-whom is observed, we derive a profile likelihood maximized
over all possible baseline hazard functions that is similar to the Cox
partial likelihood. When who-infects-whom is not observed, we derive an EM
algorithm to maximize the profile likelihood integrated over all possible
combinations of who-infected-whom. This extends the most important class
of regression models in survival analysis to infectious disease
epidemiology. These methods can be implemented in standard statistical
software, and they will be able to address important scientific questions
about emerging infectious diseases with greater clarity, flexibility, and
rigor than current statistical methods allow. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 313-325
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.896807
File-URL: http://hdl.handle.net/10.1080/01621459.2014.896807
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:313-325
Template-Type: ReDIF-Article 1.0
Author-Name: Dungang Liu
Author-X-Name-First: Dungang
Author-X-Name-Last: Liu
Author-Name: Regina Y. Liu
Author-X-Name-First: Regina Y.
Author-X-Name-Last: Liu
Author-Name: Minge Xie
Author-X-Name-First: Minge
Author-X-Name-Last: Xie
Title: Multivariate Meta-Analysis of Heterogeneous Studies Using Only Summary Statistics: Efficiency and Robustness
Abstract:
Meta-analysis has been widely used to synthesize evidence from multiple
studies for common hypotheses or parameters of interest. However, it has
not yet been fully developed for incorporating heterogeneous studies,
which arise often in applications due to different study designs,
populations, or outcomes. For heterogeneous studies, the parameter of
interest may not be estimable for certain studies, and in such a case,
these studies are typically excluded from conventional meta-analysis. The
exclusion of part of the studies can lead to a nonnegligible loss of
information. This article introduces a meta-analysis for heterogeneous
studies by combining the confidence density functions
derived from the summary statistics of individual studies, hence referred
to as the CD approach. It includes all the studies in the analysis and
makes use of all information, direct as well as indirect. Under a general
likelihood inference framework, this new approach is shown to have several
desirable properties, including: (i) it is asymptotically as efficient as
the maximum likelihood approach using individual participant data (IPD)
from all studies; (ii) unlike the IPD analysis, it suffices to use summary
statistics to carry out the CD approach. Individual-level data are not
required; and (iii) it is robust against misspecification of the working
covariance structure of parameter estimates. Besides its own theoretical
significance, the last property also substantially broadens the
applicability of the CD approach. All the properties of the CD approach
are further confirmed by data simulated from a randomized clinical trials
setting as well as by real data on aircraft landing performance. Overall,
one obtains a unifying approach for combining summary statistics,
subsuming many of the existing meta-analysis methods as special cases.
Journal: Journal of the American Statistical Association
Pages: 326-340
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.899235
File-URL: http://hdl.handle.net/10.1080/01621459.2014.899235
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:326-340
Template-Type: ReDIF-Article 1.0
Author-Name: Shujie Ma
Author-X-Name-First: Shujie
Author-X-Name-Last: Ma
Author-Name: Peter X.-K. Song
Author-X-Name-First: Peter X.-K.
Author-X-Name-Last: Song
Title: Varying Index Coefficient Models
Abstract:
It has been a long history of using interactions in regression analysis to
investigate alterations in covariate-effects on response variables. In
this article, we aim to address two kinds of new challenges arising from
the inclusion of such high-order effects in the regression model for
complex data. The first kind concerns a situation where interaction
effects of individual covariates are weak but those of combined covariates
are strong, and the other kind pertains to the presence of nonlinear
interactive effects directed by low-effect covariates. We propose a new
class of semiparametric models with varying index coefficients, which
enables us to model and assess nonlinear interaction effects between
grouped covariates on the response variable. As a result, most of the
existing semiparametric regression models are special cases of our
proposed models. We develop a numerically stable and computationally fast
estimation procedure using both profile least squares method and local
fitting. We establish both estimation consistency and asymptotic normality
for the proposed estimators of index coefficients as well as the oracle
property for the nonparametric function estimator. In addition, a
generalized likelihood ratio test is provided to test for the existence of
interaction effects or the existence of nonlinear interaction effects. Our
models and estimation methods are illustrated by simulation studies, and
by an analysis of child growth data to evaluate alterations in growth
rates incurred by mother's exposures to endocrine disrupting compounds
during pregnancy. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 341-356
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.903185
File-URL: http://hdl.handle.net/10.1080/01621459.2014.903185
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:341-356
Template-Type: ReDIF-Article 1.0
Author-Name: Jianhua Hu
Author-X-Name-First: Jianhua
Author-X-Name-Last: Hu
Author-Name: Hongjian Zhu
Author-X-Name-First: Hongjian
Author-X-Name-Last: Zhu
Author-Name: Feifang Hu
Author-X-Name-First: Feifang
Author-X-Name-Last: Hu
Title: A Unified Family of Covariate-Adjusted Response-Adaptive Designs Based on Efficiency and Ethics
Abstract:
Response-adaptive designs have recently attracted more and more attention
in the literature because of its advantages in efficiency and medical
ethics. To develop personalized medicine, covariate information plays an
important role in both design and analysis of clinical trials. A challenge
is how to incorporate covariate information in response-adaptive designs
while considering issues of both efficiency and medical ethics. To address
this problem, we propose a new and unified family of covariate-adjusted
response-adaptive (CARA) designs based on two general measurements of
efficiency and ethics. Important properties (including asymptotic
properties) of the proposed procedures are studied under categorical
covariates. This new family of designs not only introduces new desirable
CARA designs, but also unifies several important designs in the
literature. We demonstrate the proposed procedures through examples,
simulations, and a discussion of related earlier work.
Journal: Journal of the American Statistical Association
Pages: 357-367
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.903846
File-URL: http://hdl.handle.net/10.1080/01621459.2014.903846
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:357-367
Template-Type: ReDIF-Article 1.0
Author-Name: Jiejun Du
Author-X-Name-First: Jiejun
Author-X-Name-Last: Du
Author-Name: Ian L. Dryden
Author-X-Name-First: Ian L.
Author-X-Name-Last: Dryden
Author-Name: Xianzheng Huang
Author-X-Name-First: Xianzheng
Author-X-Name-Last: Huang
Title: Size and Shape Analysis of Error-Prone Shape Data
Abstract:
We consider the problem of comparing sizes and shapes of objects when
landmark data are prone to measurement error. We show that naive
implementation of ordinary Procrustes analysis that ignores measurement
error can compromise inference. To account for measurement error, we
propose the conditional score method for matching configurations, which
guarantees consistent inference under mild model assumptions. The effects
of measurement error on inference from naive Procrustes analysis and the
performance of the proposed method are illustrated via simulation and
application in three real data examples. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 368-379
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.908779
File-URL: http://hdl.handle.net/10.1080/01621459.2014.908779
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:368-379
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander Aue
Author-X-Name-First: Alexander
Author-X-Name-Last: Aue
Author-Name: Diogo Dubart Norinho
Author-X-Name-First: Diogo Dubart
Author-X-Name-Last: Norinho
Author-Name: Siegfried Hörmann
Author-X-Name-First: Siegfried
Author-X-Name-Last: Hörmann
Title: On the Prediction of Stationary Functional Time Series
Abstract:
This article addresses the prediction of stationary functional time
series. Existing contributions to this problem have largely focused on the
special case of first-order functional autoregressive processes because of
their technical tractability and the current lack of advanced functional
time series methodology. It is shown here how standard multivariate
prediction techniques can be used in this context. The connection between
functional and multivariate predictions is made precise for the important
case of vector and functional autoregressions. The proposed method is easy
to implement, making use of existing statistical software packages, and
may, therefore, be attractive to a broader, possibly nonacademic,
audience. Its practical applicability is enhanced through the introduction
of a novel functional final prediction error model selection criterion
that allows for an automatic determination of the lag structure and the
dimensionality of the model. The usefulness of the proposed methodology is
demonstrated in a simulation study and an application to environmental
data, namely the prediction of daily pollution curves describing the
concentration of particulate matter in ambient air. It is found that the
proposed prediction method often significantly outperforms existing
methods.
Journal: Journal of the American Statistical Association
Pages: 378-392
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.909317
File-URL: http://hdl.handle.net/10.1080/01621459.2014.909317
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:378-392
Template-Type: ReDIF-Article 1.0
Author-Name: Jessica Minnier
Author-X-Name-First: Jessica
Author-X-Name-Last: Minnier
Author-Name: Ming Yuan
Author-X-Name-First: Ming
Author-X-Name-Last: Yuan
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Title: Risk Classification With an Adaptive Naive Bayes Kernel Machine Model
Abstract:
Genetic studies of complex traits have uncovered only a small number of
risk markers explaining a small fraction of heritability and adding little
improvement to disease risk prediction. Standard single marker methods may
lack power in selecting informative markers or estimating effects. Most
existing methods also typically do not account for nonlinearity.
Identifying markers with weak signals and estimating their joint effects
among many noninformative markers remains challenging. One potential
approach is to group markers based on biological knowledge such as gene
structure. If markers in a group tend to have similar effects, proper
usage of the group structure could improve power and efficiency in
estimation. We propose a two-stage method relating markers to disease risk
by taking advantage of known gene-set structures. Imposing a naive Bayes
kernel machine (KM) model, we estimate gene-set specific risk models that
relate each gene-set to the outcome in stage I. The KM framework
efficiently models potentially nonlinear effects of predictors without
requiring explicit specification of functional forms. In stage II, we
aggregate information across gene-sets via a regularization procedure.
Estimation and computational efficiency is further improved with kernel
principal component analysis. Asymptotic results for model estimation and
gene-set selection are derived and numerical studies suggest that the
proposed procedure could outperform existing procedures for constructing
genetic risk models.
Journal: Journal of the American Statistical Association
Pages: 393-404
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.908778
File-URL: http://hdl.handle.net/10.1080/01621459.2014.908778
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:393-404
Template-Type: ReDIF-Article 1.0
Author-Name: Nadja Klein
Author-X-Name-First: Nadja
Author-X-Name-Last: Klein
Author-Name: Thomas Kneib
Author-X-Name-First: Thomas
Author-X-Name-Last: Kneib
Author-Name: Stefan Lang
Author-X-Name-First: Stefan
Author-X-Name-Last: Lang
Title: Bayesian Generalized Additive Models for Location, Scale, and Shape for Zero-Inflated and Overdispersed Count Data
Abstract:
Frequent problems in applied research preventing the application of the
classical Poisson log-linear model for analyzing count data include
overdispersion, an excess of zeros compared to the Poisson distribution,
correlated responses, as well as complex predictor structures comprising
nonlinear effects of continuous covariates, interactions or spatial
effects. We propose a general class of Bayesian generalized additive
models for zero-inflated and overdispersed count data within the framework
of generalized additive models for location, scale, and shape where
semiparametric predictors can be specified for several parameters of a
count data distribution. As standard options for applied work we consider
the zero-inflated Poisson, the negative binomial and the zero-inflated
negative binomial distribution. The additive predictor specifications rely
on basis function approximations for the different types of effects in
combination with Gaussian smoothness priors. We develop Bayesian inference
based on Markov chain Monte Carlo simulation techniques where suitable
proposal densities are constructed based on iteratively weighted least
squares approximations to the full conditionals. To ensure practicability
of the inference, we consider theoretical properties like the involved
question whether the joint posterior is proper. The proposed approach is
evaluated in simulation studies and applied to count data arising from
patent citations and claim frequencies in car insurances. For the
comparison of models with respect to the distribution, we consider
quantile residuals as an effective graphical device and scoring rules that
allow us to quantify the predictive ability of the models. The deviance
information criterion is used to select appropriate predictor
specifications once a response distribution has been chosen. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 405-419
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.912955
File-URL: http://hdl.handle.net/10.1080/01621459.2014.912955
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:405-419
Template-Type: ReDIF-Article 1.0
Author-Name: Efstathia Bura
Author-X-Name-First: Efstathia
Author-X-Name-Last: Bura
Author-Name: Liliana Forzani
Author-X-Name-First: Liliana
Author-X-Name-Last: Forzani
Title: Sufficient Reductions in Regressions With Elliptically Contoured Inverse Predictors
Abstract:
There are two general approaches based on inverse regression for
estimating the linear sufficient reductions for the regression of
Y on X: the moment-based approach such as
SIR, PIR, SAVE, and DR, and the likelihood-based approach such as
principal fitted components (PFC) and likelihood acquired directions (LAD)
when the inverse predictors, X&7CY, are
normal. By construction, these methods extract information from the first
two conditional moments of X&7CY; they can
only estimate linear reductions and thus form the linear
sufficient dimension reduction (SDR) methodology. When
var(X&7CY) is constant, E(X&7CY) contains the
reduction and it can be estimated using PFC. When var(X&7CY)
is nonconstant, PFC misses the information in the variance and second
moment based methods (SAVE, DR, LAD) are used instead, resulting in
efficiency loss in the estimation of the mean-based reduction. In this
article we prove that (a) if X&7CY is
elliptically contoured with parameters and density
gY, there is no linear
nontrivial sufficient reduction except if gY
is the normal density with constant variance; (b) for nonnormal
elliptically contoured data, all existing linear SDR
methods only estimate part of the reduction; (c) a sufficient reduction of
X for the regression of Y on X
comprises of a linear and a nonlinear component.
Journal: Journal of the American Statistical Association
Pages: 420-434
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.914440
File-URL: http://hdl.handle.net/10.1080/01621459.2014.914440
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:420-434
Template-Type: ReDIF-Article 1.0
Author-Name: P. Richard Hahn
Author-X-Name-First: P. Richard
Author-X-Name-Last: Hahn
Author-Name: Carlos M. Carvalho
Author-X-Name-First: Carlos M.
Author-X-Name-Last: Carvalho
Title: Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective
Abstract:
Selecting a subset of variables for linear models remains an active area
of research. This article reviews many of the recent contributions to the
Bayesian model selection and shrinkage prior literature. A posterior
variable selection summary is proposed, which distills a full posterior
distribution over regression coefficients into a sequence of sparse linear
predictors.
Journal: Journal of the American Statistical Association
Pages: 435-448
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2014.993077
File-URL: http://hdl.handle.net/10.1080/01621459.2014.993077
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:435-448
Template-Type: ReDIF-Article 1.0
Author-Name: Stephen E. Fienberg
Author-X-Name-First: Stephen E.
Author-X-Name-Last: Fienberg
Author-Name: James S. Hodges
Author-X-Name-First: James S.
Author-X-Name-Last: Hodges
Author-Name: Liying Luo
Author-X-Name-First: Liying
Author-X-Name-Last: Luo
Title: Letter To the Editor
Journal: Journal of the American Statistical Association
Pages: 457-457
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2015.1008100
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008100
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:457a-457a
Template-Type: ReDIF-Article 1.0
Author-Name: Y. Claire Yang
Author-X-Name-First: Y. Claire
Author-X-Name-Last: Yang
Author-Name: Kenneth C. Land
Author-X-Name-First: Kenneth C.
Author-X-Name-Last: Land
Title: Reply
Journal: Journal of the American Statistical Association
Pages: 457-457
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2015.1008843
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008843
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:457b-457b
Template-Type: ReDIF-Article 1.0
Author-Name: Wenjiang J. Fu
Author-X-Name-First: Wenjiang J.
Author-X-Name-Last: Fu
Title: Reply
Journal: Journal of the American Statistical Association
Pages: 458-458
Issue: 509
Volume: 110
Year: 2015
Month: 3
X-DOI: 10.1080/01621459.2015.1008849
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008849
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:509:p:458-458
Template-Type: ReDIF-Article 1.0
Author-Name: James G. Scott
Author-X-Name-First: James G.
Author-X-Name-Last: Scott
Author-Name: Ryan C. Kelly
Author-X-Name-First: Ryan C.
Author-X-Name-Last: Kelly
Author-Name: Matthew A. Smith
Author-X-Name-First: Matthew A.
Author-X-Name-Last: Smith
Author-Name: Pengcheng Zhou
Author-X-Name-First: Pengcheng
Author-X-Name-Last: Zhou
Author-Name: Robert E. Kass
Author-X-Name-First: Robert E.
Author-X-Name-Last: Kass
Title: False Discovery Rate Regression: An Application to Neural Synchrony Detection in Primary Visual Cortex
Abstract:
This article introduces false discovery rate regression, a method for
incorporating covariate information into large-scale multiple-testing
problems. FDR regression estimates a relationship between test-level
covariates and the prior probability that a given observation is a signal.
It then uses this estimated relationship to inform the outcome of each
test in a way that controls the overall false discovery rate at a
prespecified level. This poses many subtle issues at the interface between
inference and computation, and we investigate several variations of the
overall approach. Simulation evidence suggests that: (1) when covariate
effects are present, FDR regression improves power for a fixed
false-discovery rate; and (2) when covariate effects are absent, the
method is robust, in the sense that it does not lead to inflated error
rates. We apply the method to neural recordings from primary visual
cortex. The goal is to detect pairs of neurons that exhibit
fine-time-scale interactions, in the sense that they fire together more
often than expected due to chance. Our method detects roughly 50% more
synchronous pairs versus a standard FDR-controlling analysis. The
companion R package FDRreg implements all methods described in the
article. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 459-471
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.990973
File-URL: http://hdl.handle.net/10.1080/01621459.2014.990973
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:459-471
Template-Type: ReDIF-Article 1.0
Author-Name: Xiangrong Kong
Author-X-Name-First: Xiangrong
Author-X-Name-Last: Kong
Author-Name: Mei-Cheng Wang
Author-X-Name-First: Mei-Cheng
Author-X-Name-Last: Wang
Author-Name: Ronald Gray
Author-X-Name-First: Ronald
Author-X-Name-Last: Gray
Title: Analysis of Longitudinal Multivariate Outcome Data From Couples Cohort Studies: Application to HPV Transmission Dynamics
Abstract:
We consider a specific situation of correlated data where multiple
outcomes are repeatedly measured on each member of a couple. Such
multivariate longitudinal data from couples may exhibit multi-faceted
correlations that can be further complicated if there are polygamous
partnerships. An example is data from cohort studies on human
papillomavirus (HPV) transmission dynamics in heterosexual couples. HPV is
a common sexually transmitted disease with 14 known oncogenic types
causing anogenital cancers. The binary outcomes on the multiple types
measured in couples over time may introduce inter-type, intra-couple, and
temporal correlations. Simple analysis using generalized estimating
equations or random effects models lacks interpretability and cannot fully
use the available information. We developed a hybrid modeling strategy
using Markov transition models together with pairwise composite likelihood
for analyzing such data. The method can be used to identify risk factors
associated with HPV transmission and persistence, estimate difference in
risks between male-to-female and female-to-male HPV transmission, compare
type-specific transmission risks within couples, and characterize the
inter-type and intra-couple associations. Applying the method to HPV
couple data collected in a Ugandan male circumcision (MC) trial, we
assessed the effect of MC and the role of gender on risks of HPV
transmission and persistence. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 472-485
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.991394
File-URL: http://hdl.handle.net/10.1080/01621459.2014.991394
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:472-485
Template-Type: ReDIF-Article 1.0
Author-Name: Jie Li
Author-X-Name-First: Jie
Author-X-Name-Last: Li
Author-Name: Yili Hong
Author-X-Name-First: Yili
Author-X-Name-Last: Hong
Author-Name: Ram Thapa
Author-X-Name-First: Ram
Author-X-Name-Last: Thapa
Author-Name: Harold E. Burkhart
Author-X-Name-First: Harold E.
Author-X-Name-Last: Burkhart
Title: Survival Analysis of Loblolly Pine Trees With Spatially Correlated Random Effects
Abstract:
Loblolly pine, a native pine species of the southeastern United States, is
the most-planted species for commercial timber. Predicting survival of
loblolly pine following planting is of great interest to researchers in
forestry science as it is closely related to the yield of timber. Data
were collected from a region-wide thinning study, where permanent plots,
located at 182 sites ranging from central Texas east to Florida and north
to Delaware, were established in 1980-1981. One of the main objectives of
this study was to investigate the relationship between the survival of
loblolly pine trees and several important covariates such as age, thinning
types, and physiographic regions, while adjusting for spatial correlation
among different sites. We use a semiparametric proportional hazards model
to describe the effects of covariates on the survival time, and
incorporate the spatial random effects in the model to describe the
spatial correlation among different sites. We apply the
expectation-maximization (EM) algorithm to estimate the parameters in the
model and conduct simulations to validate the estimation procedure. We
also compare the proposed method with existing methods through simulations
and discussions. Then we apply the developed method to the large-scale
loblolly pine tree survival data and interpret the results. We conclude
this article with discussions on the advantages of the proposed method,
major findings of data analysis, and directions for future research.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 486-502
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.995793
File-URL: http://hdl.handle.net/10.1080/01621459.2014.995793
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:486-502
Template-Type: ReDIF-Article 1.0
Author-Name: Yanxun Xu
Author-X-Name-First: Yanxun
Author-X-Name-Last: Xu
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Yuan Yuan
Author-X-Name-First: Yuan
Author-X-Name-Last: Yuan
Author-Name: Kamalakar Gulukota
Author-X-Name-First: Kamalakar
Author-X-Name-Last: Gulukota
Author-Name: Yuan Ji
Author-X-Name-First: Yuan
Author-X-Name-Last: Ji
Title: MAD Bayes for Tumor Heterogeneity--Feature Allocation With Exponential Family Sampling
Abstract:
We propose small-variance asymptotic
approximations for inference on tumor heterogeneity (TH) using
next-generation sequencing data. Understanding TH is an important and open
research problem in biology. The lack of appropriate statistical inference
is a critical gap in existing methods that the proposed approach aims to
fill. We build on a hierarchical model with an exponential family
likelihood and a feature allocation prior. The proposed implementation of
posterior inference generalizes similar small-variance approximations
proposed by Kulis and Jordan and Broderick, Kulis, and Jordan for
inference with Dirichlet process mixture and Indian buffet process prior
models under normal sampling. We show that the new algorithm can
successfully recover latent structures of different haplotypes and
subclones and is magnitudes faster than available Markov chain Monte Carlo
samplers. The latter are practically infeasible for high-dimensional
genomics data. The proposed approach is scalable, easy to implement, and
benefits from the flexibility of Bayesian nonparametric models. More
importantly, it provides a useful tool for applied scientists to estimate
cell subtypes in tumor samples. R code is available on
http://www.ma.utexas.edu/users/yxu/. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 503-514
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.995794
File-URL: http://hdl.handle.net/10.1080/01621459.2014.995794
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:503-514
Template-Type: ReDIF-Article 1.0
Author-Name: Samuel D. Pimentel
Author-X-Name-First: Samuel D.
Author-X-Name-Last: Pimentel
Author-Name: Rachel R. Kelz
Author-X-Name-First: Rachel R.
Author-X-Name-Last: Kelz
Author-Name: Jeffrey H. Silber
Author-X-Name-First: Jeffrey H.
Author-X-Name-Last: Silber
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Large, Sparse Optimal Matching With Refined Covariate Balance in an Observational Study of the Health Outcomes Produced by New Surgeons
Abstract:
Every newly trained surgeon performs her first unsupervised operation. How
do the health outcomes of her patients compare with the patients of
experienced surgeons? Using data from 498 hospitals, we compare 1252 pairs
comprised of a new surgeon and an experienced surgeon working at the same
hospital. We introduce a new form of matching that matches patients of
each new surgeon to patients of an otherwise similar experienced surgeon
at the same hospital, perfectly balancing 176 surgical procedures and
closely balancing a total of 2.9 million categories of patients;
additionally, the individual patient pairs are as close as possible. A new
goal for matching is introduced, called "refined covariate balance," in
which a sequence of nested, ever more refined, nominal covariates is
balanced as closely as possible, emphasizing the first or coarsest
covariate in that sequence. A new algorithm for matching is proposed and
the main new results prove that the algorithm finds the closest match in
terms of the total within-pair covariate distances among all matches that
achieve refined covariate balance. Unlike previous approaches to forcing
balance on covariates, the new algorithm creates multiple paths to a match
in a network, where paths that introduce imbalances are penalized and
hence avoided to the extent possible. The algorithm exploits a sparse
network to quickly optimize a match that is about two orders of magnitude
larger than is typical in statistical matching problems, thereby
permitting much more extensive use of fine and near-fine balance
constraints. The match was constructed in a few minutes using a network
optimization algorithm implemented in R. An R package called rcbalance
implementing the method is available from CRAN.
Journal: Journal of the American Statistical Association
Pages: 515-527
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.997879
File-URL: http://hdl.handle.net/10.1080/01621459.2014.997879
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:515-527
Template-Type: ReDIF-Article 1.0
Author-Name: Hui Yao
Author-X-Name-First: Hui
Author-X-Name-Last: Yao
Author-Name: Sungduk Kim
Author-X-Name-First: Sungduk
Author-X-Name-Last: Kim
Author-Name: Ming-Hui Chen
Author-X-Name-First: Ming-Hui
Author-X-Name-Last: Chen
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Author-Name: Arvind K. Shah
Author-X-Name-First: Arvind K.
Author-X-Name-Last: Shah
Author-Name: Jianxin Lin
Author-X-Name-First: Jianxin
Author-X-Name-Last: Lin
Title: Bayesian Inference for Multivariate Meta-Regression With a Partially Observed Within-Study Sample Covariance Matrix
Abstract:
Multivariate meta-regression models are commonly used in settings where
the response variable is naturally multidimensional. Such settings are
common in cardiovascular and diabetes studies where the goal is to study
cholesterol levels once a certain medication is given. In this setting,
the natural multivariate endpoint is low density lipoprotein cholesterol
(LDL-C), high density lipoprotein cholesterol (HDL-C), and triglycerides
(TG) (LDL-C, HDL-C, TG). In this article, we examine study level
(aggregate) multivariate meta-data from 26 Merck sponsored double-blind,
randomized, active, or placebo-controlled clinical trials on adult
patients with primary hypercholesterolemia. Our goal is to develop a
methodology for carrying out Bayesian inference for multivariate
meta-regression models with study level data when the within-study sample
covariance matrix S for the multivariate response data is
partially observed. Specifically, the proposed methodology is based on
postulating a multivariate random effects regression model with an unknown
within-study covariance matrix Σ in which we treat the within-study
sample correlations as missing data, the standard deviations of the
within-study sample covariance matrix S are assumed
observed, and given Σ, S follows a Wishart
distribution. Thus, we treat the off-diagonal elements of
S as missing data, and these missing elements are sampled
from the appropriate full conditional distribution in a Markov chain Monte
Carlo (MCMC) sampling scheme via a novel transformation based on partial
correlations. We further propose several structures (models) for Σ,
which allow for borrowing strength across different treatment arms and
trials. The proposed methodology is assessed using simulated as well as
real data, and the results are shown to be quite promising. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 528-544
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2015.1006065
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006065
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:528-544
Template-Type: ReDIF-Article 1.0
Author-Name: P. Z. Hadjipantelis
Author-X-Name-First: P. Z.
Author-X-Name-Last: Hadjipantelis
Author-Name: J. A. D. Aston
Author-X-Name-First: J. A. D.
Author-X-Name-Last: Aston
Author-Name: H. G. Müller
Author-X-Name-First: H. G.
Author-X-Name-Last: Müller
Author-Name: J. P. Evans
Author-X-Name-First: J. P.
Author-X-Name-Last: Evans
Title: Unifying Amplitude and Phase Analysis: A Compositional Data Approach to Functional Multivariate Mixed-Effects Modeling of Mandarin Chinese
Abstract:
Mandarin Chinese is characterized by being a tonal language; the pitch (or
F0) of its utterances carries considerable
linguistic information. However, speech samples from different individuals
are subject to changes in amplitude and phase, which must be accounted for
in any analysis that attempts to provide a linguistically meaningful
description of the language. A joint model for amplitude, phase, and
duration is presented, which combines elements from functional data
analysis, compositional data analysis, and linear mixed effects models. By
decomposing functions via a functional principal component analysis, and
connecting registration functions to compositional data analysis, a joint
multivariate mixed effect model can be formulated, which gives insights
into the relationship between the different modes of variation as well as
their dependence on linguistic and nonlinguistic covariates. The model is
applied to the COSPRO-1 dataset, a comprehensive database of spoken
Taiwanese Mandarin, containing approximately 50,000 phonetically diverse
sample F0 contours (syllables), and reveals
that phonetic information is jointly carried by both amplitude and phase
variation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 545-559
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2015.1006729
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006729
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:545-559
Template-Type: ReDIF-Article 1.0
Author-Name: Ran Tao
Author-X-Name-First: Ran
Author-X-Name-Last: Tao
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Nora Franceschini
Author-X-Name-First: Nora
Author-X-Name-Last: Franceschini
Author-Name: Kari E. North
Author-X-Name-First: Kari E.
Author-X-Name-Last: North
Author-Name: Eric Boerwinkle
Author-X-Name-First: Eric
Author-X-Name-Last: Boerwinkle
Author-Name: Dan-Yu Lin
Author-X-Name-First: Dan-Yu
Author-X-Name-Last: Lin
Title: Analysis of Sequence Data Under Multivariate Trait-Dependent Sampling
Abstract:
High-throughput DNA sequencing allows for the genotyping of common and
rare variants for genetic association studies. At the present time and for
the foreseeable future, it is not economically feasible to sequence all
individuals in a large cohort. A cost-effective strategy is to sequence
those individuals with extreme values of a quantitative trait. We consider
the design under which the sampling depends on multiple quantitative
traits. Under such trait-dependent sampling, standard linear regression
analysis can result in bias of parameter estimation, inflation of Type I
error, and loss of power. We construct a likelihood function that properly
reflects the sampling mechanism and uses all available data. We implement
a computationally efficient EM algorithm and establish the theoretical
properties of the resulting maximum likelihood estimators. Our methods can
be used to perform separate inference on each trait or simultaneous
inference on multiple traits. We pay special attention to gene-level
association tests for rare variants. We demonstrate the superiority of the
proposed methods over standard linear regression through extensive
simulation studies. We provide applications to the Cohorts for Heart and
Aging Research in Genomic Epidemiology Targeted Sequencing Study and the
National Heart, Lung, and Blood Institute Exome Sequencing Project.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 560-572
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2015.1008099
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008099
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:560-572
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley J. Barney
Author-X-Name-First: Bradley J.
Author-X-Name-Last: Barney
Author-Name: Federica Amici
Author-X-Name-First: Federica
Author-X-Name-Last: Amici
Author-Name: Filippo Aureli
Author-X-Name-First: Filippo
Author-X-Name-Last: Aureli
Author-Name: Josep Call
Author-X-Name-First: Josep
Author-X-Name-Last: Call
Author-Name: Valen E. Johnson
Author-X-Name-First: Valen E.
Author-X-Name-Last: Johnson
Title: Joint Bayesian Modeling of Binomial and Rank Data for Primate Cognition
Abstract:
In recent years, substantial effort has been devoted to methods for
analyzing data containing mixed response types, but such techniques
typically do not include rank data among the response types. Some unique
challenges exist in analyzing rank data, particularly when ties are
prevalent. We present techniques for jointly modeling binomial and rank
data using Bayesian latent variable models. We apply these techniques to
compare the cognitive abilities of nonhuman primates based on their
performance on 17 cognitive tasks scored on either a rank or binomial
scale. To jointly model the rank and binomial responses, we assume that
responses are implicitly determined by latent cognitive abilities. We then
model the latent variables using random effects models, with identifying
restrictions chosen to promote parsimonious prior specification and model
inferences. Results from the primate cognitive data are presented to
illustrate the methodology. Our results suggest that the ordering of the
cognitive abilities of species varies significantly across tasks,
suggesting a partially independent evolution of cognitive abilities in
primates. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 573-582
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2015.1016223
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016223
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:573-582
Template-Type: ReDIF-Article 1.0
Author-Name: Ying-Qi Zhao
Author-X-Name-First: Ying-Qi
Author-X-Name-Last: Zhao
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes
Abstract:
Dynamic treatment regimes (DTRs) are sequential decision rules for
individual patients that can adapt over time to an evolving illness. The
goal is to accommodate heterogeneity among patients and find the DTR which
will produce the best long-term outcome if implemented. We introduce two
new statistical learning methods for estimating the optimal DTR, termed
backward outcome weighted learning (BOWL), and simultaneous outcome
weighted learning (SOWL). These approaches convert individualized
treatment selection into an either sequential or simultaneous
classification problem, and can thus be applied by modifying existing
machine learning techniques. The proposed methods are based on directly
maximizing over all DTRs a nonparametric estimator of the expected
long-term outcome; this is fundamentally different than regression-based
methods, for example, Q-learning, which indirectly
attempt such maximization and rely heavily on the correctness of
postulated regression models. We prove that the resulting rules are
consistent, and provide finite sample bounds for the errors using the
estimated rules. Simulation results suggest the proposed methods produce
superior DTRs compared with Q-learning especially in
small samples. We illustrate the methods using data from a clinical trial
for smoking cessation. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 583-598
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.937488
File-URL: http://hdl.handle.net/10.1080/01621459.2014.937488
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:583-598
Template-Type: ReDIF-Article 1.0
Author-Name: R. Dennis Cook
Author-X-Name-First: R. Dennis
Author-X-Name-Last: Cook
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Title: Foundations for Envelope Models and Methods
Abstract:
Envelopes were recently proposed by Cook, Li and Chiaromonte as a method
for reducing estimative and predictive variations in multivariate linear
regression. We extend their formulation, proposing a general definition of
an envelope and a general framework for adapting envelope methods to any
estimation procedure. We apply the new envelope methods to weighted least
squares, generalized linear models and Cox regression. Simulations and
illustrative data analysis show the potential for envelope methods to
significantly improve standard methods in linear discriminant analysis,
logistic regression and Poisson regression. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 599-611
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.983235
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983235
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:599-611
Template-Type: ReDIF-Article 1.0
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F. Jeff
Author-X-Name-Last: Wu
Title: Post-Fisherian Experimentation: From Physical to Virtual
Abstract:
Fisher's pioneering work in design of experiments has inspired further
work with broader applications, especially in industrial experimentation.
This article discusses three topics in physical experiments: principles of
effect hierarchy, sparsity, and heredity for factorial designs, a new
method called conditional main effect (CME) for de-aliasing aliased
effects, and robust parameter design. I also review the recent emergence
of virtual experiments on a computer. Some major challenges in computer
experiments, which must go beyond Fisherian principles, are outlined.
Journal: Journal of the American Statistical Association
Pages: 612-620
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.914441
File-URL: http://hdl.handle.net/10.1080/01621459.2014.914441
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:612-620
Template-Type: ReDIF-Article 1.0
Author-Name: Sy Han Chiou
Author-X-Name-First: Sy Han
Author-X-Name-Last: Chiou
Author-Name: Sangwook Kang
Author-X-Name-First: Sangwook
Author-X-Name-Last: Kang
Author-Name: Jun Yan
Author-X-Name-First: Jun
Author-X-Name-Last: Yan
Title: Semiparametric Accelerated Failure Time Modeling for Clustered Failure Times From Stratified Sampling
Abstract:
Clustered failure times often arise from studies with stratified sampling
designs where it is desired to reduce both cost and sampling error.
Semiparametric accelerated failure time (AFT) models have not been used as
frequently as Cox relative risk models in such settings due to lack of
efficient and reliable computing routines for inferences. The challenge
roots in the nonsmoothness of the rank-based estimating functions, and for
clustered data, the asymptotic properties of the estimator from the
weighted version have not been available. The recently proposed induced
smoothing approach, which provides fast and accurate rank-based inferences
for AFT models, is generalized to incorporate weights to accommodate
stratified sampling designs. The estimator from the induced smoothing
weighted estimating equations are shown to be consistent and have the same
asymptotic distribution as that from the nonsmooth version, which has not
been developed before. The variance of the estimator is estimated by
computationally efficient sandwich estimators aided by a multiplier
bootstrap. The proposed method is assessed in extensive simulation studies
where the estimators appear to provide valid and efficient inferences. A
stratified case-cohort design with clustered times to tooth extraction in
a dental study illustrates the usefulness of the method.
Journal: Journal of the American Statistical Association
Pages: 621-629
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.917978
File-URL: http://hdl.handle.net/10.1080/01621459.2014.917978
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:621-629
Template-Type: ReDIF-Article 1.0
Author-Name: Hengjian Cui
Author-X-Name-First: Hengjian
Author-X-Name-Last: Cui
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Wei Zhong
Author-X-Name-First: Wei
Author-X-Name-Last: Zhong
Title: Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis
Abstract:
This work is concerned with marginal sure independence feature screening
for ultrahigh dimensional discriminant analysis. The response variable is
categorical in discriminant analysis. This enables us to use the
conditional distribution function to construct a new index for feature
screening. In this article, we propose a marginal feature screening
procedure based on empirical conditional distribution function. We
establish the sure screening and ranking consistency properties for the
proposed procedure without assuming any moment condition on the
predictors. The proposed procedure enjoys several appealing merits. First,
it is model-free in that its implementation does not require specification
of a regression model. Second, it is robust to heavy-tailed distributions
of predictors and the presence of potential outliers. Third, it allows the
categorical response having a diverging number of classes in the order of
O(n-super-κ) with some κ
⩾ 0. We assess the finite sample property of the proposed procedure
by Monte Carlo simulation studies and numerical comparison. We further
illustrate the proposed methodology by empirical analyses of two real-life
datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 630-641
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.920256
File-URL: http://hdl.handle.net/10.1080/01621459.2014.920256
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:630-641
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Jiang
Author-X-Name-First: Bo
Author-X-Name-Last: Jiang
Author-Name: Chao Ye
Author-X-Name-First: Chao
Author-X-Name-Last: Ye
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Nonparametric K-Sample Tests via Dynamic Slicing
Abstract:
K-sample testing problems arise in many scientific
applications and have attracted statisticians' attention for many years.
We propose an omnibus nonparametric method based on an optimal
discretization (aka "slicing") of continuous random variables in the test.
The novelty of our approach lies in the inclusion of a term penalizing the
number of slices (i.e., the resolution of the discretization) so as to
regularize the corresponding likelihood-ratio test statistic. An efficient
dynamic programming algorithm is developed to determine the optimal
slicing scheme. Asymptotic and finite-sample properties such as power and
null distribution of the resulting test statistic are studied. We compare
the proposed testing method with some existing well-known methods and
demonstrate its statistical power through extensive simulation studies as
well as a real data example. A dynamic slicing method for the one-sample
testing problem is further developed and studied under the same framework.
Supplementary materials including technical derivations and proofs are
available online.
Journal: Journal of the American Statistical Association
Pages: 642-653
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.920257
File-URL: http://hdl.handle.net/10.1080/01621459.2014.920257
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:642-653
Template-Type: ReDIF-Article 1.0
Author-Name: Philip Preuss
Author-X-Name-First: Philip
Author-X-Name-Last: Preuss
Author-Name: Ruprecht Puchstein
Author-X-Name-First: Ruprecht
Author-X-Name-Last: Puchstein
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Title: Detection of Multiple Structural Breaks in Multivariate Time Series
Abstract:
We propose a new nonparametric procedure (referred to as MuBreD) for the
detection and estimation of multiple structural breaks in the
autocovariance function of a multivariate (second-order) piecewise
stationary process, which also identifies the components of the series
where the breaks occur. MuBreD is based on a comparison of the estimated
spectral distribution on different segments of the observed time series
and consists of three steps: it starts with a consistent test, which
allows us to prove the existence of structural breaks at a controlled Type
I error. Second, it estimates sets containing possible break points and
finally these sets are reduced to identify the relevant structural breaks
and corresponding components which are responsible for the changes in the
autocovariance structure. In contrast to all other methods proposed in the
literature, our approach does not make any parametric assumptions, is not
especially designed for detecting one single change point, and addresses
the problem of multiple structural breaks in the autocovariance function
directly with no use of the binary segmentation algorithm. We prove that
the new procedure detects all components and the corresponding locations
where structural breaks occur with probability converging to one as the
sample size increases and provide data-driven rules for the selection of
all regularization parameters. The results are illustrated by analyzing
financial asset returns, and in a simulation study it is demonstrated that
MuBreD outperforms the currently available nonparametric methods for
detecting breaks in the dependency structure of multivariate time series.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 654-668
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.920613
File-URL: http://hdl.handle.net/10.1080/01621459.2014.920613
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:654-668
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Ma
Author-X-Name-First: Wei
Author-X-Name-Last: Ma
Author-Name: Feifang Hu
Author-X-Name-First: Feifang
Author-X-Name-Last: Hu
Author-Name: Lixin Zhang
Author-X-Name-First: Lixin
Author-X-Name-Last: Zhang
Title: Testing Hypotheses of Covariate-Adaptive Randomized Clinical Trials
Abstract:
Covariate-adaptive designs are often implemented to balance important
covariates in clinical trials. However, the theoretical properties of
conventional testing hypotheses are usually unknown under
covariate-adaptive randomized clinical trials. In the literature, most
studies are based on simulations. In this article, we provide theoretical
foundation of hypothesis testing under covariate-adaptive designs based on
linear models. We derive the asymptotic distributions of the test
statistics of testing both treatment effects and the significance of
covariates under null and alternative hypotheses. Under a large class of
covariate-adaptive designs, (i) the hypothesis testing to compare
treatment effects is usually conservative in terms of small Type I error;
(ii) the hypothesis testing to compare treatment effects is usually more
powerful than complete randomization; and (iii) the hypothesis testing for
significance of covariates is still valid. The class includes most of the
covariate-adaptive designs in the literature; for example, Pocock and
Simon's marginal procedure, stratified permuted block design, etc.
Numerical studies are also performed to assess their corresponding finite
sample properties. Supplementary material for this article is available
online.
Journal: Journal of the American Statistical Association
Pages: 669-680
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.922469
File-URL: http://hdl.handle.net/10.1080/01621459.2014.922469
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:669-680
Template-Type: ReDIF-Article 1.0
Author-Name: Grace Y. Yi
Author-X-Name-First: Grace Y.
Author-X-Name-Last: Yi
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Donna Spiegelman
Author-X-Name-First: Donna
Author-X-Name-Last: Spiegelman
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Functional and Structural Methods With Mixed Measurement Error and Misclassification in Covariates
Abstract:
Covariate measurement imprecision or errors arise frequently in many
areas. It is well known that ignoring such errors can substantially
degrade the quality of inference or even yield erroneous results. Although
in practice both covariates subject to measurement error and covariates
subject to misclassification can occur, research attention in the
literature has mainly focused on addressing either one of these problems
separately. To fill this gap, we develop estimation and inference methods
that accommodate both characteristics simultaneously. Specifically, we
consider measurement error and misclassification in generalized linear
models under the scenario that an external validation study is available,
and systematically develop a number of effective functional and structural
methods. Our methods can be applied to different situations to meet
various objectives.
Journal: Journal of the American Statistical Association
Pages: 681-696
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.922777
File-URL: http://hdl.handle.net/10.1080/01621459.2014.922777
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:681-696
Template-Type: ReDIF-Article 1.0
Author-Name: Catalina A. Vallejos
Author-X-Name-First: Catalina A.
Author-X-Name-Last: Vallejos
Author-Name: Mark F. J. Steel
Author-X-Name-First: Mark F. J.
Author-X-Name-Last: Steel
Title: Objective Bayesian Survival Analysis Using Shape Mixtures of Log-Normal Distributions
Abstract:
Survival models such as the Weibull or log-normal lead to inference that
is not robust to the presence of outliers. They also assume that all
heterogeneity between individuals can be modeled through covariates. This
article considers the use of infinite mixtures of lifetime distributions
as a solution for these two issues. This can be interpreted as the
introduction of a random effect in the survival distribution. We introduce
the family of shape mixtures of log-normal distributions, which covers a
wide range of density and hazard functions. Bayesian inference under
nonsubjective priors based on the Jeffreys' rule is examined and
conditions for posterior propriety are established. The existence of the
posterior distribution on the basis of a sample of point observations is
not always guaranteed and a solution through set observations is
implemented. In addition, we propose a method for outlier detection based
on the mixture structure. A simulation study illustrates the performance
of our methods under different scenarios and an application to a real
dataset is provided. Supplementary materials for the article, which
include R code, are available online.
Journal: Journal of the American Statistical Association
Pages: 697-710
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.923316
File-URL: http://hdl.handle.net/10.1080/01621459.2014.923316
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:697-710
Template-Type: ReDIF-Article 1.0
Author-Name: Juhee Lee
Author-X-Name-First: Juhee
Author-X-Name-Last: Lee
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Author-Name: Yuan Ji
Author-X-Name-First: Yuan
Author-X-Name-Last: Ji
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Title: Bayesian Dose-Finding in Two Treatment Cycles Based on the Joint Utility of Efficacy and Toxicity
Abstract:
This article proposes a phase I/II clinical trial design for adaptively
and dynamically optimizing each patient's dose in each of two cycles of
therapy based on the joint binary efficacy and toxicity outcomes in each
cycle. A dose-outcome model is assumed that includes a Bayesian
hierarchical latent variable structure to induce association among the
outcomes and also facilitate posterior computation. Doses are chosen in
each cycle based on posteriors of a model-based objective function,
similar to a reinforcement learning or Q-learning function, defined in
terms of numerical utilities of the joint outcomes in each cycle. For each
patient, the procedure outputs a sequence of two actions, one for each
cycle, with each action being the decision to either treat the patient at
a chosen dose or not to treat. The cycle 2 action depends on the
individual patient's cycle 1 dose and outcomes. In addition, decisions are
based on posterior inference using other patients' data, and therefore,
the proposed method is adaptive both within and between patients. A
simulation study of the method is presented, including comparison to
two-cycle extensions of the conventional 3 + 3 algorithm, continual
reassessment method, and a Bayesian model-based design, and evaluation of
robustness. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 711-722
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.926815
File-URL: http://hdl.handle.net/10.1080/01621459.2014.926815
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:711-722
Template-Type: ReDIF-Article 1.0
Author-Name: Xuerong Chen
Author-X-Name-First: Xuerong
Author-X-Name-Last: Chen
Author-Name: Alan T. K. Wan
Author-X-Name-First: Alan T. K.
Author-X-Name-Last: Wan
Author-Name: Yong Zhou
Author-X-Name-First: Yong
Author-X-Name-Last: Zhou
Title: Efficient Quantile Regression Analysis With Missing Observations
Abstract:
This article examines the problem of estimation in a quantile regression
model when observations are missing at random under independent and
nonidentically distributed errors. We consider three approaches of
handling this problem based on nonparametric inverse probability
weighting, estimating equations projection, and a combination of both. An
important distinguishing feature of our methods is their ability to handle
missing response and/or partially missing covariates, whereas existing
techniques can handle only one or the other, but not both. We prove that
our methods yield asymptotically equivalent estimators that achieve the
desirable asymptotic properties of unbiasedness, normality, and
-consistency.
Because we do not assume that the errors are identically distributed, our
theoretical results are valid under heteroscedasticity, a particularly
strong feature of our methods. Under the special case of identical error
distributions, all of our proposed estimators achieve the semiparametric
efficiency bound. To facilitate the practical implementation of these
methods, we develop an iterative method based on the majorize/minimize
algorithm for computing the quantile regression estimates, and a bootstrap
method for computing their variances. Our simulation findings suggest that
all three methods have good finite sample properties. We further
illustrate these methods by a real data example. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 723-741
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.928219
File-URL: http://hdl.handle.net/10.1080/01621459.2014.928219
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:723-741
Template-Type: ReDIF-Article 1.0
Author-Name: Nikolaos Sgouropoulos
Author-X-Name-First: Nikolaos
Author-X-Name-Last: Sgouropoulos
Author-Name: Qiwei Yao
Author-X-Name-First: Qiwei
Author-X-Name-Last: Yao
Author-Name: Claudia Yastremiz
Author-X-Name-First: Claudia
Author-X-Name-Last: Yastremiz
Title: Matching a Distribution by Matching Quantiles Estimation
Abstract:
Motivated by the problem of selecting representative portfolios for
backtesting counterparty credit risks, we propose a matching quantiles
estimation (MQE) method for matching a target distribution by that of a
linear combination of a set of random variables. An iterative procedure
based on the ordinary least-squares estimation (OLS) is proposed to
compute MQE. MQE can be easily modified by adding a LASSO penalty term if
a sparse representation is desired, or by restricting the matching within
certain range of quantiles to match a part of the target distribution. The
convergence of the algorithm and the asymptotic properties of the
estimation, both with or without LASSO, are established. A measure and an
associated statistical test are proposed to assess the goodness-of-match.
The finite sample properties are illustrated by simulation. An application
in selecting a counterparty representative portfolio with a real dataset
is reported. The proposed MQE also finds applications in portfolio
tracking, which demonstrates the usefulness of combining MQE with LASSO.
Journal: Journal of the American Statistical Association
Pages: 742-759
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.929522
File-URL: http://hdl.handle.net/10.1080/01621459.2014.929522
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:742-759
Template-Type: ReDIF-Article 1.0
Author-Name: Randy C. S. Lai
Author-X-Name-First: Randy C. S.
Author-X-Name-Last: Lai
Author-Name: Jan Hannig
Author-X-Name-First: Jan
Author-X-Name-Last: Hannig
Author-Name: Thomas C. M. Lee
Author-X-Name-First: Thomas C. M.
Author-X-Name-Last: Lee
Title: Generalized Fiducial Inference for Ultrahigh-Dimensional Regression
Abstract:
In recent years, the ultrahigh-dimensional linear regression problem has
attracted enormous attention from the research community. Under the
sparsity assumption, most of the published work is devoted to the
selection and estimation of the predictor variables with nonzero
coefficients. This article studies a different but fundamentally important
aspect of this problem: uncertainty quantification for parameter estimates
and model choices. To be more specific, this article proposes methods for
deriving a probability density function on the set of all possible models,
and also for constructing confidence intervals for the corresponding
parameters. These proposed methods are developed using the generalized
fiducial methodology, which is a variant of Fisher's controversial
fiducial idea. Theoretical properties of the proposed methods are studied,
and in particular it is shown that statistical inference based on the
proposed methods will have correct asymptotic frequentist property. In
terms of empirical performance, the proposed methods are tested by
simulation experiments and an application to a real dataset. Finally, this
work can also be seen as an interesting and successful application of
Fisher's fiducial idea to an important and contemporary problem. To the
best of the authors' knowledge, this is the first time that the fiducial
idea is being applied to a so-called "large p small
n" problem. A connection to objective Bayesian model
selection is also discussed.
Journal: Journal of the American Statistical Association
Pages: 760-772
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.931237
File-URL: http://hdl.handle.net/10.1080/01621459.2014.931237
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:760-772
Template-Type: ReDIF-Article 1.0
Author-Name: Kuangyu Wen
Author-X-Name-First: Kuangyu
Author-X-Name-Last: Wen
Author-Name: Ximing Wu
Author-X-Name-First: Ximing
Author-X-Name-Last: Wu
Title: An Improved Transformation-Based Kernel Estimator of Densities on the Unit Interval
Abstract:
The kernel density estimator (KDE) suffers boundary biases when applied to
densities on bounded supports, which are assumed to be the unit interval.
Transformations mapping the unit interval to the real line can be used to
remove boundary biases. However, this approach may induce erratic tail
behaviors when the estimated density of transformed data is transformed
back to its original scale. We propose a modified, transformation-based
KDE that employs a tapered and tilted back-transformation. We derive the
theoretical properties of the new estimator and show that it
asymptotically dominates the naive transformation based estimator while
maintains its simplicity. We then propose three automatic methods of
smoothing parameter selection. Our Monte Carlo simulations demonstrate the
good finite sample performance of the proposed estimator, especially for
densities with poles near the boundaries. An example with real data is
provided.
Journal: Journal of the American Statistical Association
Pages: 773-783
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.969426
File-URL: http://hdl.handle.net/10.1080/01621459.2014.969426
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:773-783
Template-Type: ReDIF-Article 1.0
Author-Name: Ke Zhu
Author-X-Name-First: Ke
Author-X-Name-Last: Zhu
Author-Name: Shiqing Ling
Author-X-Name-First: Shiqing
Author-X-Name-Last: Ling
Title: LADE-Based Inference for ARMA Models With Unspecified and Heavy-Tailed Heteroscedastic Noises
Abstract:
This article develops a systematic procedure of statistical inference for
the auto-regressive moving average (ARMA) model with unspecified and
heavy-tailed heteroscedastic noises. We first investigate the least
absolute deviation estimator (LADE) and the self-weighted LADE for the
model. Both estimators are shown to be strongly consistent and
asymptotically normal when the noise has a finite variance and infinite
variance, respectively. The rates of convergence of the LADE and the
self-weighted LADE are n-super- - 1/2, which is faster
than those of least-square estimator (LSE) for the ARMA model when the
tail index of generalized auto-regressive conditional heteroskedasticity
(GARCH) noises is in (0, 4], and thus they are more efficient in this
case. Since their asymptotic covariance matrices cannot be estimated
directly from the sample, we develop the random weighting approach for
statistical inference under this nonstandard case. We further propose a
novel sign-based portmanteau test for model adequacy. Simulation study is
carried out to assess the performance of our procedure and one real
illustrating example is given. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 784-794
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.977386
File-URL: http://hdl.handle.net/10.1080/01621459.2014.977386
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:784-794
Template-Type: ReDIF-Article 1.0
Author-Name: Li Ma
Author-X-Name-First: Li
Author-X-Name-Last: Ma
Title: Scalable Bayesian Model Averaging Through Local Information Propagation
Abstract:
This article shows that a probabilistic version of the classical
forward-stepwise variable inclusion procedure can serve as a general
data-augmentation scheme for model space distributions in (generalized)
linear models. This latent variable representation takes the form of a
Markov process, thereby allowing information propagation algorithms to be
applied for sampling from model space posteriors. In particular, We
propose a sequential Monte Carlo method for achieving effective unbiased
Bayesian model averaging in high-dimensional problems, using proposal
distributions constructed using local information propagation. The
method--called LIPS for local information propagation based sampling--is
illustrated using real and simulated examples with dimensionality ranging
from 15 to 1000, and its performance in estimating posterior inclusion
probabilities and in out-of-sample prediction is compared to those of
several other methods--namely, MCMC, BAS, iBMA, and LASSO. In addition, it
is shown that the latent variable representation can also serve as a
modeling tool for specifying model space priors that account for knowledge
regarding model complexity and conditional inclusion relationships.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 795-809
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.980908
File-URL: http://hdl.handle.net/10.1080/01621459.2014.980908
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:795-809
Template-Type: ReDIF-Article 1.0
Author-Name: Harry Crane
Author-X-Name-First: Harry
Author-X-Name-Last: Crane
Title: Clustering from Categorical Data Sequences
Abstract:
The three-parameter cluster model is a combinatorial stochastic process
that generates categorical response sequences by randomly perturbing a
fixed clustering parameter. This clear relationship between the observed
data and the underlying clustering is particularly attractive in cluster
analysis, in which supervised learning is a common goal and missing data
is a familiar issue. The model is well equipped for this task, as it can
handle missing data, perform out-of-sample inference, and accommodate both
independent and dependent data sequences. Moreover, its clustering
parameter lies in the unrestricted space of partitions, so that the number
of clusters need not be specified beforehand. We establish these and other
theoretical properties and also demonstrate the model on datasets from
epidemiology, genetics, political science, and legal studies.
Journal: Journal of the American Statistical Association
Pages: 810-823
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.983521
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983521
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:810-823
Template-Type: ReDIF-Article 1.0
Author-Name: Peter Radchenko
Author-X-Name-First: Peter
Author-X-Name-Last: Radchenko
Author-Name: Xinghao Qiao
Author-X-Name-First: Xinghao
Author-X-Name-Last: Qiao
Author-Name: Gareth M. James
Author-X-Name-First: Gareth M.
Author-X-Name-Last: James
Title: Index Models for Sparsely Sampled Functional Data
Abstract:
The regression problem involving functional predictors has many important
applications and a number of functional regression methods have been
developed. However, a common complication in functional data analysis is
one of sparsely observed curves, that is predictors that are observed,
with error, on a small subset of the possible time points. Such sparsely
observed data induce an errors-in-variables model, where
one must account for measurement error in the functional predictors. Faced
with sparsely observed data, most current functional regression methods
simply estimate the unobserved predictors and treat them as fully
observed; thus failing to account for the extra uncertainty from the
measurement error. We propose a new functional errors-in-variables
approach, sparse index model functional estimation (SIMFE), which uses a
functional index model formulation to deal with sparsely observed
predictors. SIMFE has several advantages over more traditional methods.
First, the index model implements a nonlinear regression and uses an
accurate supervised method to estimate the lower dimensional space into
which the predictors should be projected. Second, SIMFE can be applied to
both scalar and functional responses and multiple predictors. Finally,
SIMFE uses a mixed effects model to effectively deal with very sparsely
observed functional predictors and to correctly model the measurement
error. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 824-836
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.931859
File-URL: http://hdl.handle.net/10.1080/01621459.2014.931859
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:824-836
Template-Type: ReDIF-Article 1.0
Author-Name: Karl Bruce Gregory
Author-X-Name-First: Karl Bruce
Author-X-Name-Last: Gregory
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Author-Name: Soumendra N. Lahiri
Author-X-Name-First: Soumendra N.
Author-X-Name-Last: Lahiri
Title: A Two-Sample Test for Equality of Means in High Dimension
Abstract:
We develop a test statistic for testing the equality of two population
mean vectors in the "large-p-small-n"
setting. Such a test must surmount the rank-deficiency of the sample
covariance matrix, which breaks down the classic Hotelling
T-super-2 test. The proposed procedure, called the
generalized component test, avoids full estimation of the covariance
matrix by assuming that the p components admit a logical
ordering such that the dependence between components is related to their
displacement. The test is shown to be competitive with other recently
developed methods under ARMA and long-range dependence structures and to
achieve superior power for heavy-tailed data. The test does not assume
equality of covariance matrices between the two populations, is robust to
heteroscedasticity in the component variances, and requires very little
computation time, which allows its use in settings with very large
p. An analysis of mitochondrial calcium concentration in
mouse cardiac muscles over time and of copy number variations in a
glioblastoma multiforme dataset from The Cancer Genome Atlas are carried
out to illustrate the test. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 837-849
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.934826
File-URL: http://hdl.handle.net/10.1080/01621459.2014.934826
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:837-849
Template-Type: ReDIF-Article 1.0
Author-Name: Yunxiao Chen
Author-X-Name-First: Yunxiao
Author-X-Name-Last: Chen
Author-Name: Jingchen Liu
Author-X-Name-First: Jingchen
Author-X-Name-Last: Liu
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Author-Name: Zhiliang Ying
Author-X-Name-First: Zhiliang
Author-X-Name-Last: Ying
Title: Statistical Analysis of Q-Matrix Based Diagnostic Classification Models
Abstract:
Diagnostic classification models (DMCs) have recently gained prominence in
educational assessment, psychiatric evaluation, and many other
disciplines. Central to the model specification is the so-called
Q-matrix that provides a qualitative specification of the
item-attribute relationship. In this article, we develop theories on the
identifiability for the Q-matrix under the DINA and the
DINO models. We further propose an estimation procedure for the
Q-matrix through the regularized maximum likelihood. The
applicability of this procedure is not limited to the DINA or the DINO
model and it can be applied to essentially all Q-matrix
based DMCs. Simulation studies show that the proposed method admits high
probability recovering the true Q-matrix. Furthermore,
two case studies are presented. The first case is a dataset on fraction
subtraction (educational application) and the second case is a subsample
of the National Epidemiological Survey on Alcohol and Related Conditions
concerning the social anxiety disorder (psychiatric application).
Journal: Journal of the American Statistical Association
Pages: 850-866
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.934827
File-URL: http://hdl.handle.net/10.1080/01621459.2014.934827
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:850-866
Template-Type: ReDIF-Article 1.0
Author-Name: Shaoting Li
Author-X-Name-First: Shaoting
Author-X-Name-Last: Li
Author-Name: Jiahua Chen
Author-X-Name-First: Jiahua
Author-X-Name-Last: Chen
Author-Name: Jianhua Guo
Author-X-Name-First: Jianhua
Author-X-Name-Last: Guo
Author-Name: Bing-Yi Jing
Author-X-Name-First: Bing-Yi
Author-X-Name-Last: Jing
Author-Name: Shui-Ying Tsang
Author-X-Name-First: Shui-Ying
Author-X-Name-Last: Tsang
Author-Name: Hong Xue
Author-X-Name-First: Hong
Author-X-Name-Last: Xue
Title: Likelihood Ratio Test for Multi-Sample Mixture Model and Its Application to Genetic Imprinting
Abstract:
Genomic imprinting is a known aspect of the etiology of many diseases. The
imprinting phenomenon depicts differential expression levels of the allele
depending on its parental origin. When the parental origin is unknown, the
expression level has a finite normal mixture distribution. In such
applications, a random sample of expression levels consists of three
subsamples according to the number of minor alleles an individual
possesses, of which one is the mixture and the other two are homogeneous.
This understanding leads to a likelihood ratio test (LRT) for the presence
of imprinting. Because of the nonregularity of the finite mixture model,
the classical asymptotic conclusions on likelihood-based inference are not
applicable. We show that the maximum likelihood estimator of the mixing
distribution remains consistent. More interestingly, thanks to the
homogeneous subsamples, the LRT statistic has an elegant and rather
distinct 0.5χ-super-21 + 0.5χ-super-22 null
limiting distribution. Simulation studies confirm that the limiting
distribution provides precise approximations of the finite sample
distributions under various parameter settings. The LRT is applied to
expression data. Our analyses provide evidence for imprinting for a number
of isoform expressions.
Journal: Journal of the American Statistical Association
Pages: 867-877
Issue: 510
Volume: 110
Year: 2015
Month: 6
X-DOI: 10.1080/01621459.2014.939272
File-URL: http://hdl.handle.net/10.1080/01621459.2014.939272
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:510:p:867-877
Template-Type: ReDIF-Article 1.0
Author-Name: Mariel M. Finucane
Author-X-Name-First: Mariel M.
Author-X-Name-Last: Finucane
Author-Name: Christopher J. Paciorek
Author-X-Name-First: Christopher J.
Author-X-Name-Last: Paciorek
Author-Name: Gretchen A. Stevens
Author-X-Name-First: Gretchen A.
Author-X-Name-Last: Stevens
Author-Name: Majid Ezzati
Author-X-Name-First: Majid
Author-X-Name-Last: Ezzati
Title: Semiparametric Bayesian Density Estimation With Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition
Abstract:
Undernutrition, resulting in restricted growth, and quantified here using
height-for-age z-scores, is an important contributor to
childhood morbidity and mortality. Since all levels of mild, moderate, and
severe undernutrition are of clinical and public health importance, it is
of interest to estimate the shape of the z-scores'
distributions. We present a finite normal mixture model that uses data on
4.3 million children to make annual country-specific estimates of these
distributions for under-5-year-old children in the world's 141 low- and
middle-income countries between 1985 and 2011. We incorporate both
individual-level data when available, as well as aggregated summary
statistics from studies whose individual-level data could not be obtained.
We place a hierarchical Bayesian probit stick-breaking model on the
mixture weights. The model allows for nonlinear changes in time, and it
borrows strength in time, in covariates, and within and across regional
country clusters to make estimates where data are uncertain, sparse, or
missing. This work addresses three important problems that often arise in
the fields of public health surveillance and global health monitoring.
First, data are always incomplete. Second, different data sources commonly
use different reporting metrics. Last, distributions, and especially their
tails, are often of substantive interest.
Journal: Journal of the American Statistical Association
Pages: 889-901
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.937487
File-URL: http://hdl.handle.net/10.1080/01621459.2014.937487
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:889-901
Template-Type: ReDIF-Article 1.0
Author-Name: Christopher K. Wikle
Author-X-Name-First: Christopher K.
Author-X-Name-Last: Wikle
Author-Name: Scott H. Holan
Author-X-Name-First: Scott H.
Author-X-Name-Last: Holan
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 901-903
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1073083
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073083
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:901-903
Template-Type: ReDIF-Article 1.0
Author-Name: Jim Hodges
Author-X-Name-First: Jim
Author-X-Name-Last: Hodges
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 903-905
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1073084
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073084
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:903-905
Template-Type: ReDIF-Article 1.0
Author-Name: Mariel M. Finucane
Author-X-Name-First: Mariel M.
Author-X-Name-Last: Finucane
Author-Name: Christopher J. Paciorek
Author-X-Name-First: Christopher J.
Author-X-Name-Last: Paciorek
Author-Name: Gretchen A. Stevens
Author-X-Name-First: Gretchen A.
Author-X-Name-Last: Stevens
Author-Name: Majid Ezzati
Author-X-Name-First: Majid
Author-X-Name-Last: Ezzati
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 906-909
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1073085
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073085
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:906-909
Template-Type: ReDIF-Article 1.0
Author-Name: José R. Zubizarreta
Author-X-Name-First: José R.
Author-X-Name-Last: Zubizarreta
Title: Stable Weights that Balance Covariates for Estimation With Incomplete Outcome Data
Abstract:
Weighting methods that adjust for observed covariates, such as inverse
probability weighting, are widely used for causal inference and estimation
with incomplete outcome data. Part of the appeal of such methods is that
one set of weights can be used to estimate a range of treatment effects
based on different outcomes, or a variety of population means for several
variables. However, this appeal can be diminished in practice by the
instability of the estimated weights and by the difficulty of adequately
adjusting for observed covariates in some settings. To address these
limitations, this article presents a new weighting method that finds the
weights of minimum variance that adjust or balance the empirical
distribution of the observed covariates up to levels prespecified by the
researcher. This method allows the researcher to balance very precisely
the means of the observed covariates and other features of their marginal
and joint distributions, such as variances and correlations and also, for
example, the quantiles of interactions of pairs and triples of observed
covariates, thus, balancing entire two- and three-way marginals. Since the
weighting method is based on a well-defined convex optimization problem,
duality theory provides insight into the behavior of the variance of the
optimal weights in relation to the level of covariate balance adjustment,
answering the question, how much does tightening a balance constraint
increases the variance of the weights? Also, the weighting method runs in
polynomial time so relatively large datasets can be handled quickly. An
implementation of the method is provided in the new package sbw for R.
This article shows some theoretical properties of the resulting weights
and illustrates their use by analyzing both a dataset from the 2010
Chilean earthquake and a simulated example.
Journal: Journal of the American Statistical Association
Pages: 910-922
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1023805
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1023805
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:910-922
Template-Type: ReDIF-Article 1.0
Author-Name: Beom Seuk Hwang
Author-X-Name-First: Beom Seuk
Author-X-Name-Last: Hwang
Author-Name: Zhen Chen
Author-X-Name-First: Zhen
Author-X-Name-Last: Chen
Title: An Integrated Bayesian Nonparametric Approach for Stochastic and Variability Orders in ROC Curve Estimation: An Application to Endometriosis Diagnosis
Abstract:
In estimating ROC curves of multiple tests, some a priori constraints may
exist, either between the healthy and diseased populations within a test
or between tests within a population. In this article, we proposed an
integrated modeling approach for ROC curves that jointly accounts for
stochastic and variability orders. The stochastic order constrains the
distributional centers of the diseased and healthy populations within a
test, while the variability order constrains the distributional spreads of
the tests within each of the populations. Under a Bayesian nonparametric
framework, we used features of the Dirichlet process mixture to
incorporate these order constraints in a natural way. We applied the
proposed approach to data from the Physician Reliability Study that
investigated the accuracy of diagnosing endometriosis using different
clinical information. To address the issue of no gold standard in the real
data, we used a sensitivity analysis approach that exploited diagnosis
from a panel of experts. To demonstrate the performance of the
methodology, we conducted simulation studies with varying sample sizes,
distributional assumptions, and order constraints. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 923-934
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1023806
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1023806
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:923-934
Template-Type: ReDIF-Article 1.0
Author-Name: Charles Hokayem
Author-X-Name-First: Charles
Author-X-Name-Last: Hokayem
Author-Name: Christopher Bollinger
Author-X-Name-First: Christopher
Author-X-Name-Last: Bollinger
Author-Name: James P. Ziliak
Author-X-Name-First: James P.
Author-X-Name-Last: Ziliak
Title: The Role of CPS Nonresponse in the Measurement of Poverty
Abstract:
The Current Population Survey Annual Social and Economic Supplement (CPS
ASEC) serves as the data source for official income, poverty, and
inequality statistics in the United States. There is a concern that the
rise in nonresponse to earnings questions could deteriorate data quality
and distort estimates of these important metrics. We use a dataset of
internal ASEC records matched to Social Security Detailed Earnings Records
(DER) to study the impact of earnings nonresponse on estimates of poverty
from 1997-2008. Our analysis does not treat the administrative data as the
"truth"; instead, we rely on information from both administrative and
survey data. We compare a "full response" poverty rate that assumes all
ASEC respondents provided earnings data to the official poverty rate to
gauge the nonresponse bias. On average, we find the nonresponse bias is
about 1.0 percentage point.
Journal: Journal of the American Statistical Association
Pages: 935-945
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1029576
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1029576
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:935-945
Template-Type: ReDIF-Article 1.0
Author-Name: Chao Huang
Author-X-Name-First: Chao
Author-X-Name-Last: Huang
Author-Name: Martin Styner
Author-X-Name-First: Martin
Author-X-Name-Last: Styner
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Clustering High-Dimensional Landmark-Based Two-Dimensional Shape Data
Abstract:
An important goal in image analysis is to cluster and recognize objects of
interest according to the shapes of their boundaries. Clustering such
objects faces at least four major challenges including a curved shape
space, a high-dimensional feature space, a complex spatial correlation
structure, and shape variation associated with some covariates (e.g., age
or gender). The aim of this article is to develop a penalized model-based
clustering framework to cluster landmark-based planar shape data, while
explicitly addressing these challenges. Specifically, a mixture of
offset-normal shape factor analyzers (MOSFA) is proposed with mixing
proportions defined through a regression model (e.g., logistic) and an
offset-normal shape distribution in each component for data in the curved
shape space. A latent factor analysis model is introduced to explicitly
model the complex spatial correlation. A penalized likelihood approach
with both adaptive pairwise fused Lasso penalty function and
L2 penalty function is used to automatically
realize variable selection via thresholding and deliver a sparse solution.
Our real data analysis has confirmed the excellent finite-sample
performance of MOSFA in revealing meaningful clusters in the corpus
callosum shape data obtained from the Attention Deficit Hyperactivity
Disorder-200 (ADHD-200) study. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 946-961
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1034802
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034802
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:946-961
Template-Type: ReDIF-Article 1.0
Author-Name: Yi-Juan Hu
Author-X-Name-First: Yi-Juan
Author-X-Name-Last: Hu
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Author-Name: Jung-Ying Tzeng
Author-X-Name-First: Jung-Ying
Author-X-Name-Last: Tzeng
Author-Name: Charles M. Perou
Author-X-Name-First: Charles M.
Author-X-Name-Last: Perou
Title: Proper Use of Allele-Specific Expression Improves Statistical Power for cis-eQTL Mapping with RNA-Seq Data
Abstract:
Studies of expression quantitative trait loci (eQTLs) offer insight into
the molecular mechanisms of loci that were found to be associated with
complex diseases and the mechanisms can be classified into
cis- and trans-acting regulation. At
present, high-throughput RNA sequencing (RNA-seq) is rapidly replacing
expression microarrays to assess gene expression abundance. Unlike
microarrays that only measure the total expression of each gene, RNA-seq
also provides information on allele-specific expression (ASE), which can
be used to distinguish cis-eQTLs from
trans-eQTLs and, more importantly, enhance
cis-eQTL mapping. However, assessing the
cis-effect of a candidate eQTL on a gene requires
knowledge of the haplotypes connecting the candidate eQTL and the gene,
which can not be inferred with certainty. The existing two-stage approach
that first phases the candidate eQTL against the gene and then treats the
inferred phase as observed in the association analysis tends to attenuate
the estimated cis-effect and reduce the power for
detecting a cis-eQTL. In this article, we provide a
maximum-likelihood framework for cis-eQTL mapping with
RNA-seq data. Our approach integrates the inference of haplotypes and the
association analysis into a single stage, and is thus unbiased and
statistically powerful. We also develop a pipeline for performing a
comprehensive scan of all local eQTLs for all genes in the genome by
controlling for false discovery rate, and implement the methods in a
computationally efficient software program. The advantages of the proposed
methods over the existing ones are demonstrated through realistic
simulation studies and an application to empirical breast cancer data from
The Cancer Genome Atlas project. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 962-974
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1038449
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1038449
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:962-974
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Author-Name: James J. Crowley
Author-X-Name-First: James J.
Author-X-Name-Last: Crowley
Author-Name: Ting-Huei Chen
Author-X-Name-First: Ting-Huei
Author-X-Name-Last: Chen
Author-Name: Hua Zhou
Author-X-Name-First: Hua
Author-X-Name-Last: Zhou
Author-Name: Haitao Chu
Author-X-Name-First: Haitao
Author-X-Name-Last: Chu
Author-Name: Shunping Huang
Author-X-Name-First: Shunping
Author-X-Name-Last: Huang
Author-Name: Pei-Fen Kuan
Author-X-Name-First: Pei-Fen
Author-X-Name-Last: Kuan
Author-Name: Yuan Li
Author-X-Name-First: Yuan
Author-X-Name-Last: Li
Author-Name: Darla Miller
Author-X-Name-First: Darla
Author-X-Name-Last: Miller
Author-Name: Ginger Shaw
Author-X-Name-First: Ginger
Author-X-Name-Last: Shaw
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Author-Name: Vasyl Zhabotynsky
Author-X-Name-First: Vasyl
Author-X-Name-Last: Zhabotynsky
Author-Name: Leonard McMillan
Author-X-Name-First: Leonard
Author-X-Name-Last: McMillan
Author-Name: Fei Zou
Author-X-Name-First: Fei
Author-X-Name-Last: Zou
Author-Name: Patrick F. Sullivan
Author-X-Name-First: Patrick F.
Author-X-Name-Last: Sullivan
Author-Name: Fernando Pardo-Manuel De Villena
Author-X-Name-First: Fernando Pardo-Manuel
Author-X-Name-Last: De Villena
Title: IsoDOT Detects Differential RNA-Isoform Expression/Usage With Respect to a Categorical or Continuous Covariate With High Sensitivity and Specificity
Abstract:
We have developed a statistical method named IsoDOT to assess differential
isoform expression (DIE) and differential isoform usage (DIU) using
RNA-seq data. Here isoform usage refers to relative isoform expression
given the total expression of the corresponding gene. IsoDOT performs two
tasks that cannot be accomplished by existing methods: to test DIE/DIU
with respect to a continuous covariate, and to test DIE/DIU for one case
versus one control. The latter task is not an uncommon situation in
practice, for example, comparing the paternal and maternal alleles of one
individual or comparing tumor and normal samples of one cancer patient.
Simulation studies demonstrate the high sensitivity and specificity of
IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on
the mouse transcriptome and identify a group of genes whose isoform usages
respond to haloperidol treatment. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 975-986
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1040880
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1040880
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:975-986
Template-Type: ReDIF-Article 1.0
Author-Name: Hang J. Kim
Author-X-Name-First: Hang J.
Author-X-Name-Last: Kim
Author-Name: Lawrence H. Cox
Author-X-Name-First: Lawrence H.
Author-X-Name-Last: Cox
Author-Name: Alan F. Karr
Author-X-Name-First: Alan F.
Author-X-Name-Last: Karr
Author-Name: Jerome P. Reiter
Author-X-Name-First: Jerome P.
Author-X-Name-Last: Reiter
Author-Name: Quanli Wang
Author-X-Name-First: Quanli
Author-X-Name-Last: Wang
Title: Simultaneous Edit-Imputation for Continuous Microdata
Abstract:
Many statistical organizations collect data that are expected to satisfy
linear constraints; as examples, component variables should sum to total
variables, and ratios of pairs of variables should be bounded by
expert-specified constants. When reported data violate constraints,
organizations identify and replace values potentially in error in a
process known as edit-imputation. To date, most approaches separate the
error localization and imputation steps, typically using optimization
methods to identify the variables to change followed by hot deck
imputation. We present an approach that fully integrates editing and
imputation for continuous microdata under linear constraints. Our approach
relies on a Bayesian hierarchical model that includes (i) a flexible joint
probability model for the underlying true values of the data with support
only on the set of values that satisfy all editing constraints, (ii) a
model for latent indicators of the variables that are in error, and (iii)
a model for the reported responses for variables in error. We illustrate
the potential advantages of the Bayesian editing approach over existing
approaches using simulation studies. We apply the model to edit faulty
data from the 2007 U.S. Census of Manufactures. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 987-999
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1040881
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1040881
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:987-999
Template-Type: ReDIF-Article 1.0
Author-Name: Li-Chu Chien
Author-X-Name-First: Li-Chu
Author-X-Name-Last: Chien
Author-Name: Yuh-Jenn Wu
Author-X-Name-First: Yuh-Jenn
Author-X-Name-Last: Wu
Author-Name: Chao A. Hsiung
Author-X-Name-First: Chao A.
Author-X-Name-Last: Hsiung
Author-Name: Lu-Hai Wang
Author-X-Name-First: Lu-Hai
Author-X-Name-Last: Wang
Author-Name: I-Shou Chang
Author-X-Name-First: I-Shou
Author-X-Name-Last: Chang
Title: Smoothed Lexis Diagrams With Applications to Lung and Breast Cancer Trends in Taiwan
Abstract:
Cancer surveillance research often begins with a rate matrix, also called
a Lexis diagram, of cancer incidence derived from cancer registry and
census data. Lexis diagrams with 3- or 5-year intervals for age group and
for calendar year of diagnosis are often considered. This simple smoothing
approach suffers from a significant limitation; important details useful
in studying time trends may be lost in the averaging process involved in
generating a summary rate. This article constructs a smoothed Lexis
diagram and indicates its use in cancer surveillance research.
Specifically, we use a Poisson model to describe the relationship between
the number of new cases, the number of people at risk, and a smoothly
varying incidence rate for the study of the incidence rate function. Based
on the Poisson model, we use the standard Lexis diagram to introduce
priors through the coefficients of Bernstein polynomials and propose a
Bayesian approach to construct a smoothed Lexis diagram for the study of
the effects of age, period, and cohort on incidence rates in terms of
straightforward graphical displays. These include the age-specific rates
by year of birth, age-specific rates by year of diagnosis, year-specific
rates by age of diagnosis, and cohort-specific rates by age of diagnosis.
We illustrate our approach by studying the trends in lung and breast
cancer incidence in Taiwan. We find that for nearly every age group the
incidence rates for lung adenocarcinoma and female invasive breast cancer
increased rapidly in the past two decades and those for male lung squamous
cell carcinoma started to decrease, which is consistent with the decline
in the male smoking rate that began in 1985. Since the analyses indicate
strong age, period, and cohort effects, it seems that both lung cancer and
breast cancer will become more important public health problems in Taiwan.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1000-1012
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1042106
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1042106
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1000-1012
Template-Type: ReDIF-Article 1.0
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Author-Name: Marc Ratkovic
Author-X-Name-First: Marc
Author-X-Name-Last: Ratkovic
Title: Robust Estimation of Inverse Probability Weights for Marginal Structural Models
Abstract:
Marginal structural models (MSMs) are becoming increasingly popular as a
tool for causal inference from longitudinal data. Unlike standard
regression models, MSMs can adjust for time-dependent observed confounders
while avoiding the bias due to the direct adjustment for covariates
affected by the treatment. Despite their theoretical appeal, a main
practical difficulty of MSMs is the required estimation of inverse
probability weights. Previous studies have found that MSMs can be highly
sensitive to misspecification of treatment assignment model even when the
number of time periods is moderate. To address this problem, we generalize
the covariate balancing propensity score (CBPS) methodology of Imai and
Ratkovic to longitudinal analysis settings. The CBPS estimates the inverse
probability weights such that the resulting covariate balance is improved.
Unlike the standard approach, the proposed methodology incorporates all
covariate balancing conditions across multiple time periods. Since the
number of these conditions grows exponentially as the number of time
period increases, we also propose a low-rank approximation to ease the
computational burden. Our simulation and empirical studies suggest that
the CBPS significantly improves the empirical performance of MSMs by
making the treatment assignment model more robust to misspecification.
Open-source software is available for implementing the proposed methods.
Journal: Journal of the American Statistical Association
Pages: 1013-1023
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.956872
File-URL: http://hdl.handle.net/10.1080/01621459.2014.956872
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1013-1023
Template-Type: ReDIF-Article 1.0
Author-Name: Karel Vermeulen
Author-X-Name-First: Karel
Author-X-Name-Last: Vermeulen
Author-Name: Stijn Vansteelandt
Author-X-Name-First: Stijn
Author-X-Name-Last: Vansteelandt
Title: Bias-Reduced Doubly Robust Estimation
Abstract:
Over the past decade, doubly robust estimators have been proposed for a
variety of target parameters in causal inference and missing data models.
These are asymptotically unbiased when at least one of two nuisance
working models is correctly specified, regardless of which. While their
asymptotic distribution is not affected by the choice of
root-n consistent estimators of the nuisance parameters
indexing these working models when all working models are correctly
specified, this choice of estimators can have a dramatic impact under
misspecification of at least one working model. In this article, we will
therefore propose a simple and generic estimation principle for the
nuisance parameters indexing each of the working models, which is designed
to improve the performance of the doubly robust estimator of interest,
relative to the default use of maximum likelihood estimators for the
nuisance parameters. The proposed approach locally minimizes the squared
first-order asymptotic bias of the doubly robust estimator under
misspecification of both working models and results in doubly robust
estimators with easy-to-calculate asymptotic variance. It moreover
improves the stability of the weights in those doubly robust estimators
which invoke inverse probability weighting. Simulation studies confirm the
desirable finite-sample performance of the proposed estimators.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1024-1036
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.958155
File-URL: http://hdl.handle.net/10.1080/01621459.2014.958155
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1024-1036
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander Volfovsky
Author-X-Name-First: Alexander
Author-X-Name-Last: Volfovsky
Author-Name: Peter D. Hoff
Author-X-Name-First: Peter D.
Author-X-Name-Last: Hoff
Title: Testing for Nodal Dependence in Relational Data Matrices
Abstract:
Relational data are often represented as a square matrix, the entries of
which record the relationships between pairs of objects. Many statistical
methods for the analysis of such data assume some degree of similarity or
dependence between objects in terms of the way they relate to each other.
However, formal tests for such dependence have not been developed. We
provide a test for such dependence using the framework of the matrix
normal model, a type of multivariate normal distribution parameterized in
terms of row- and column-specific covariance matrices. We develop a
likelihood ratio test (LRT) for row and column dependence based on the
observation of a single relational data matrix. We obtain a reference
distribution for the LRT statistic, thereby providing an exact test for
the presence of row or column correlations in a square relational data
matrix. Additionally, we provide extensions of the test to accommodate
common features of such data, such as undefined diagonal entries, a
nonzero mean, multiple observations, and deviations from normality.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1037-1046
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.965777
File-URL: http://hdl.handle.net/10.1080/01621459.2014.965777
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1037-1046
Template-Type: ReDIF-Article 1.0
Author-Name: Bailey K. Fosdick
Author-X-Name-First: Bailey K.
Author-X-Name-Last: Fosdick
Author-Name: Peter D. Hoff
Author-X-Name-First: Peter D.
Author-X-Name-Last: Hoff
Title: Testing and Modeling Dependencies Between a Network and Nodal Attributes
Abstract:
Network analysis is often focused on characterizing the dependencies
between network relations and node-level attributes. Potential
relationships are typically explored by modeling the network as a function
of the nodal attributes or by modeling the attributes as a function of the
network. These methods require specification of the exact nature of the
association between the network and attributes, reduce the network data to
a small number of summary statistics, and are unable to provide
predictions simultaneously for missing attribute and network information.
Existing methods that model the attributes and network jointly also assume
the data are fully observed. In this article, we introduce a unified
approach to analysis that addresses these shortcomings. We use a
previously developed latent variable model to obtain a low-dimensional
representation of the network in terms of node-specific network factors.
We introduce a novel testing procedure to determine if dependencies exist
between the network factors and attributes as a surrogate for a test of
dependence between the network and attributes. We also present a joint
model for the network relations and attributes, for use if the hypothesis
of independence is rejected, which can capture a variety of dependence
patterns and be used to make inference and predictions for missing
observations.
Journal: Journal of the American Statistical Association
Pages: 1047-1056
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1008697
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008697
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1047-1056
Template-Type: ReDIF-Article 1.0
Author-Name: Laura Azzimonti
Author-X-Name-First: Laura
Author-X-Name-Last: Azzimonti
Author-Name: Laura M. Sangalli
Author-X-Name-First: Laura M.
Author-X-Name-Last: Sangalli
Author-Name: Piercesare Secchi
Author-X-Name-First: Piercesare
Author-X-Name-Last: Secchi
Author-Name: Maurizio Domanin
Author-X-Name-First: Maurizio
Author-X-Name-Last: Domanin
Author-Name: Fabio Nobile
Author-X-Name-First: Fabio
Author-X-Name-Last: Nobile
Title: Blood Flow Velocity Field Estimation Via Spatial Regression With PDE Penalization
Abstract:
We propose an innovative method for the accurate estimation of surfaces
and spatial fields when prior knowledge of the phenomenon under study is
available. The prior knowledge included in the model derives from physics,
physiology, or mechanics of the problem at hand, and is formalized in
terms of a partial differential equation governing the phenomenon
behavior, as well as conditions that the phenomenon has to satisfy at the
boundary of the problem domain. The proposed models exploit advanced
scientific computing techniques and specifically make use of the finite
element method. The estimators have a penalized regression form and the
usual inferential tools are derived. Both the pointwise and the areal data
frameworks are considered. The driving application concerns the estimation
of the blood flow velocity field in a section of a carotid artery, using
data provided by echo-color Doppler. This applied problem arises within a
research project that aims at studying atherosclerosis pathogenesis.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1057-1071
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.946036
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946036
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1057-1071
Template-Type: ReDIF-Article 1.0
Author-Name: C. Villa
Author-X-Name-First: C.
Author-X-Name-Last: Villa
Author-Name: S. G. Walker
Author-X-Name-First: S. G.
Author-X-Name-Last: Walker
Title: An Objective Approach to Prior Mass Functions for Discrete Parameter Spaces
Abstract:
We present a novel approach to constructing objective prior distributions
for discrete parameter spaces. These types of parameter spaces are
particularly problematic, as it appears that common objective procedures
to design prior distributions are problem specific. We propose an
objective criterion, based on loss functions, instead of trying to define
objective probabilities directly. We systematically apply this criterion
to a series of discrete scenarios, previously considered in the
literature, and compare the priors. The proposed approach applies to any
discrete parameter space, making it appealing as it does not involve
different concepts according to the model. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1072-1082
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.946319
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946319
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1072-1082
Template-Type: ReDIF-Article 1.0
Author-Name: Tianhao Wang
Author-X-Name-First: Tianhao
Author-X-Name-Last: Wang
Author-Name: Yingcun Xia
Author-X-Name-First: Yingcun
Author-X-Name-Last: Xia
Title: Whittle Likelihood Estimation of Nonlinear Autoregressive Models With Moving Average Residuals
Abstract:
The Whittle likelihood estimation (WLE) has played a fundamental role in
the development of both theory and computation of time series analysis.
However, WLE is only applicable to models whose theoretical spectral
density function (SDF) is known up to the parameters in the models. In
this article, we propose a residual-based WLE, called extended WLE (XWLE),
which can estimate models with their SDFs only partially available,
including many popular time series models with correlated residuals.
Asymptotic properties of XWLE are established. In particular, XWLE is
asymptotically equivalent to WLE in estimating linear ARMA models, and is
also capable of estimating nonlinear AR models with MA residuals and even
with exogenous variables. The finite-sample performances of XWLE are
checked by simulated examples and real data analysis.
Journal: Journal of the American Statistical Association
Pages: 1083-1099
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.946513
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946513
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1083-1099
Template-Type: ReDIF-Article 1.0
Author-Name: Graciela Boente
Author-X-Name-First: Graciela
Author-X-Name-Last: Boente
Author-Name: Matías Salibian-Barrera
Author-X-Name-First: Matías
Author-X-Name-Last: Salibian-Barrera
Title: S-Estimators for Functional Principal Component Analysis
Abstract:
Principal component analysis is a widely used technique that provides an
optimal lower-dimensional approximation to multivariate or functional
datasets. These approximations can be very useful in identifying potential
outliers among high-dimensional or functional observations. In this
article, we propose a new class of estimators for principal components
based on robust scale estimators. For a fixed dimension
q, we robustly estimate the
q-dimensional linear space that provides the best
prediction for the data, in the sense of minimizing the sum of robust
scale estimators of the coordinates of the residuals. We also study an
extension to the infinite-dimensional case. Our method is consistent for
elliptical random vectors, and is Fisher consistent for elliptically
distributed random elements on arbitrary Hilbert spaces. Numerical
experiments show that our proposal is highly competitive when compared
with other methods. We illustrate our approach on a real dataset, where
the robust estimator discovers atypical observations that would have been
missed otherwise. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1100-1111
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.946991
File-URL: http://hdl.handle.net/10.1080/01621459.2014.946991
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1100-1111
Template-Type: ReDIF-Article 1.0
Author-Name: Jian Zhu
Author-X-Name-First: Jian
Author-X-Name-Last: Zhu
Author-Name: Trivellore E. Raghunathan
Author-X-Name-First: Trivellore E.
Author-X-Name-Last: Raghunathan
Title: Convergence Properties of a Sequential Regression Multiple Imputation Algorithm
Abstract:
A sequential regression or chained equations imputation approach uses a
Gibbs sampling-type iterative algorithm that imputes the missing values
using a sequence of conditional regression models. It is a flexible
approach for handling different types of variables and complex data
structures. Many simulation studies have shown that the multiple
imputation inferences based on this procedure have desirable repeated
sampling properties. However, a theoretical weakness of this approach is
that the specification of a set of conditional regression models may not
be compatible with a joint distribution of the variables being imputed.
Hence, the convergence properties of the iterative algorithm are not well
understood. This article develops conditions for convergence and assesses
the properties of inferences from both compatible and incompatible
sequence of regression models. The results are established for the missing
data pattern where each subject may be missing a value on at most one
variable. The sequence of regression models are assumed to be empirically
good fit for the data chosen by the imputer based on appropriate model
diagnostics. The results are used to develop criteria for the choice of
regression models. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1112-1124
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.948117
File-URL: http://hdl.handle.net/10.1080/01621459.2014.948117
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1112-1124
Template-Type: ReDIF-Article 1.0
Author-Name: Hua Yun Chen
Author-X-Name-First: Hua Yun
Author-X-Name-Last: Chen
Author-Name: Daniel E. Rader
Author-X-Name-First: Daniel E.
Author-X-Name-Last: Rader
Author-Name: Mingyao Li
Author-X-Name-First: Mingyao
Author-X-Name-Last: Li
Title: Likelihood Inferences on Semiparametric Odds Ratio Model
Abstract:
A flexible semiparametric odds ratio model has been proposed to unify and
to extend both the log-linear model and the joint normal model for data
with a mix of discrete and continuous variables. The semiparametric odds
ratio model is particularly useful for analyzing biased sampling designs.
However, statistical inference of the model has not been systematically
studied when more than one nonparametric component is involved in the
model. In this article, we study the maximum semiparametric likelihood
approach to estimation and inference of the semiparametric odds ratio
model. We show that the maximum semiparametric likelihood estimator of the
odds ratio parameter is consistent and asymptotically normally
distributed. We also establish statistical inference under a misspecified
semiparametric odds ratio model, which is important when handling weak
identifiability in conditionally specified models under biased sampling
designs. We use simulation studies to demonstrate that the proposed
approaches have satisfactory finite sample performance. Finally, we
illustrate the proposed approach by analyzing multiple traits in a
genome-wide association study of high-density lipid protein. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1125-1135
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.948544
File-URL: http://hdl.handle.net/10.1080/01621459.2014.948544
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1125-1135
Template-Type: ReDIF-Article 1.0
Author-Name: Jiming Jiang
Author-X-Name-First: Jiming
Author-X-Name-Last: Jiang
Author-Name: Thuan Nguyen
Author-X-Name-First: Thuan
Author-X-Name-Last: Nguyen
Author-Name: J. Sunil Rao
Author-X-Name-First: J. Sunil
Author-X-Name-Last: Rao
Title: The E-MS Algorithm: Model Selection With Incomplete Data
Abstract:
We propose a procedure associated with the idea of the E-M algorithm for
model selection in the presence of missing data. The idea extends the
concept of parameters to include both the model and the parameters under
the model, and thus allows the model to be part of the E-M iterations. We
develop the procedure, known as the E-MS algorithm, under the assumption
that the class of candidate models is finite. Some special cases of the
procedure are considered, including E-MS with the generalized information
criteria (GIC), and E-MS with the adaptive fence (AF; Jiang et al.). We
prove numerical convergence of the E-MS algorithm as well as consistency
in model selection of the limiting model of the E-MS convergence, for E-MS
with GIC and E-MS with AF. We study the impact on model selection of
different missing data mechanisms. Furthermore, we carry out extensive
simulation studies on the finite-sample performance of the E-MS with
comparisons to other procedures. The methodology is also illustrated on a
real data analysis involving QTL mapping for an agricultural study on
barley grains. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1136-1147
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.948545
File-URL: http://hdl.handle.net/10.1080/01621459.2014.948545
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1136-1147
Template-Type: ReDIF-Article 1.0
Author-Name: Deng Pan
Author-X-Name-First: Deng
Author-X-Name-Last: Pan
Author-Name: Haijin He
Author-X-Name-First: Haijin
Author-X-Name-Last: He
Author-Name: Xinyuan Song
Author-X-Name-First: Xinyuan
Author-X-Name-Last: Song
Author-Name: Liuquan Sun
Author-X-Name-First: Liuquan
Author-X-Name-Last: Sun
Title: Regression Analysis of Additive Hazards Model With Latent Variables
Abstract:
We propose an additive hazards model with latent variables to investigate
the observed and latent risk factors of the failure time of interest. Each
latent risk factor is characterized by correlated observed variables
through a confirmatory factor analysis model. We develop a hybrid
procedure that combines the expectation-maximization (EM) algorithm and
the borrow-strength estimation approach to estimate the model parameters.
We establish the consistency and asymptotic normality of the parameter
estimators. Various nice features, including finite sample performance of
the proposed methodology, are demonstrated by simulation studies. Our
model is applied to a study concerning the risk factors of chronic kidney
disease for Type 2 diabetic patients. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1148-1159
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.950083
File-URL: http://hdl.handle.net/10.1080/01621459.2014.950083
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1148-1159
Template-Type: ReDIF-Article 1.0
Author-Name: Yumou Qiu
Author-X-Name-First: Yumou
Author-X-Name-Last: Qiu
Author-Name: Song Xi Chen
Author-X-Name-First: Song Xi
Author-X-Name-Last: Chen
Title: Bandwidth Selection for High-Dimensional Covariance Matrix Estimation
Abstract:
The banding estimator of Bickel and Levina and its tapering version of
Cai, Zhang, and Zhou are important high-dimensional covariance estimators.
Both estimators require a bandwidth parameter. We propose a bandwidth
selector for the banding estimator by minimizing an empirical estimate of
the expected squared Frobenius norms of the estimation error matrix. The
ratio consistency of the bandwidth selector is established. We provide a
lower bound for the coverage probability of the underlying bandwidth being
contained in an interval around the bandwidth estimate. Extensions to the
bandwidth selection for the tapering estimator and threshold level
selection for the thresholding covariance estimator are made. Numerical
simulations and a case study on sonar spectrum data are conducted to
demonstrate the proposed approaches. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1160-1174
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.950375
File-URL: http://hdl.handle.net/10.1080/01621459.2014.950375
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1160-1174
Template-Type: ReDIF-Article 1.0
Author-Name: Chun Yip Yau
Author-X-Name-First: Chun Yip
Author-X-Name-Last: Yau
Author-Name: Chong Man Tang
Author-X-Name-First: Chong Man
Author-X-Name-Last: Tang
Author-Name: Thomas C. M. Lee
Author-X-Name-First: Thomas C. M.
Author-X-Name-Last: Lee
Title: Estimation of Multiple-Regime Threshold Autoregressive Models With Structural Breaks
Abstract:
The threshold autoregressive (TAR) model is a class of nonlinear time
series models that have been widely used in many areas. Due to its
nonlinear nature, one major difficulty in fitting a TAR model is the
estimation of the thresholds. As a first contribution, this article
develops an automatic procedure to estimate the number and values of the
thresholds, as well as the corresponding AR order and parameter values in
each regime. These parameter estimates are defined as the minimizers of an
objective function derived from the minimum description length (MDL)
principle. A genetic algorithm (GA) is constructed to efficiently solve
the associated minimization problem. The second contribution of this
article is the extension of this framework to piecewise TAR modeling; that
is, the time series is partitioned into different segments for which each
segment can be adequately modeled by a TAR model, while models from
adjacent segments are different. For such piecewise TAR modeling, a
procedure is developed to estimate the number and locations of the
breakpoints, together with all other parameters in each segment. Desirable
theoretical results are derived to lend support to the proposed
methodology. Simulation experiments and an application to an U.S. GNP data
are used to illustrate the empirical performances of the methodology.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1175-1186
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.954706
File-URL: http://hdl.handle.net/10.1080/01621459.2014.954706
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1175-1186
Template-Type: ReDIF-Article 1.0
Author-Name: Hongyuan Cao
Author-X-Name-First: Hongyuan
Author-X-Name-Last: Cao
Author-Name: Mathew M. Churpek
Author-X-Name-First: Mathew M.
Author-X-Name-Last: Churpek
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Jason P. Fine
Author-X-Name-First: Jason P.
Author-X-Name-Last: Fine
Title: Analysis of the Proportional Hazards Model With Sparse Longitudinal Covariates
Abstract:
Regression analysis of censored failure observations via the proportional
hazards model permits time-varying covariates that are observed at death
times. In practice, such longitudinal covariates are typically sparse and
only measured at infrequent and irregularly spaced follow-up times. Full
likelihood analyses of joint models for longitudinal and survival data
impose stringent modeling assumptions that are difficult to verify in
practice and that are complicated both inferentially and computationally.
In this article, a simple kernel weighted score function is proposed with
minimal assumptions. Two scenarios are considered: half kernel estimation
in which observation ceases at the time of the event and full kernel
estimation for data where observation may continue after the event, as
with recurrent events data. It is established that these estimators are
consistent and asymptotically normal. However, they converge at rates that
are slower than the parametric rates that may be achieved with fully
observed covariates, with the full kernel method achieving an optimal
convergence rate that is superior to that of the half kernel method.
Simulation results demonstrate that the large sample approximations are
adequate for practical use and may yield improved performance relative to
last value carried forward approach and joint modeling method. The
analysis of the data from a cardiac arrest study demonstrates the utility
of the proposed methods. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1187-1196
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.957289
File-URL: http://hdl.handle.net/10.1080/01621459.2014.957289
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1187-1196
Template-Type: ReDIF-Article 1.0
Author-Name: Claudia Kirch
Author-X-Name-First: Claudia
Author-X-Name-Last: Kirch
Author-Name: Birte Muhsal
Author-X-Name-First: Birte
Author-X-Name-Last: Muhsal
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Title: Detection of Changes in Multivariate Time Series With Application to EEG Data
Abstract:
The primary contributions of this article are rigorously developed novel
statistical methods for detecting change points in multivariate time
series. We extend the class of score type change point statistics
considered in 2007 by Hušková, Prášková, and Steinebach to
the vector autoregressive (VAR) case and the epidemic change alternative.
Our proposed procedures do not require the observed time series to
actually follow the VAR model. Instead, following the strategy implicitly
employed by practitioners, our approach takes model misspecification into
account so that our detection procedure uses the model background merely
for feature extraction. We derive the asymptotic distributions of our test
statistics and show that our procedure has asymptotic power of 1. The
proposed test statistics require the estimation of the inverse of the
long-run covariance matrix which is particularly difficult in
higher-dimensional settings (i.e., where the dimension of the time series
and the dimension of the parameter vector are both large). Thus we
robustify the proposed test statistics and investigate their finite sample
properties via extensive numerical experiments. Finally, we apply our
procedure to electroencephalograms and demonstrate its potential impact in
identifying change points in complex brain processes during a cognitive
motor task.
Journal: Journal of the American Statistical Association
Pages: 1197-1216
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.957545
File-URL: http://hdl.handle.net/10.1080/01621459.2014.957545
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1197-1216
Template-Type: ReDIF-Article 1.0
Author-Name: David Azriel
Author-X-Name-First: David
Author-X-Name-Last: Azriel
Author-Name: Armin Schwartzman
Author-X-Name-First: Armin
Author-X-Name-Last: Schwartzman
Title: The Empirical Distribution of a Large Number of Correlated Normal Variables
Abstract:
Motivated by the advent of high-dimensional, highly correlated data, this
work studies the limit behavior of the empirical cumulative distribution
function (ecdf) of standard normal random variables under arbitrary
correlation. First, we provide a necessary and sufficient condition for
convergence of the ecdf to the standard normal distribution. Next, under
general correlation, we show that the ecdf limit is a random, possible
infinite, mixture of normal distribution functions that depends on a
number of latent variables and can serve as an asymptotic approximation to
the ecdf in high dimensions. We provide conditions under which the
dimension of the ecdf limit, defined as the smallest number of effective
latent variables, is finite. Estimates of the latent variables are
provided and their consistency proved. We demonstrate these methods in a
real high-dimensional data example from brain imaging where it is shown
that, while the study exhibits apparently strongly significant results,
they can be entirely explained by correlation, as captured by the
asymptotic approximation developed here. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1217-1228
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.958156
File-URL: http://hdl.handle.net/10.1080/01621459.2014.958156
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1217-1228
Template-Type: ReDIF-Article 1.0
Author-Name: Xanthi Pedeli
Author-X-Name-First: Xanthi
Author-X-Name-Last: Pedeli
Author-Name: Anthony C. Davison
Author-X-Name-First: Anthony C.
Author-X-Name-Last: Davison
Author-Name: Konstantinos Fokianos
Author-X-Name-First: Konstantinos
Author-X-Name-Last: Fokianos
Title: Likelihood Estimation for the INAR(p) Model by Saddlepoint Approximation
Abstract:
Saddlepoint techniques have been used successfully in many applications,
owing to the high accuracy with which they can approximate intractable
densities and tail probabilities. This article concerns their use for the
estimation of high-order integer-valued autoregressive,
INAR(p), processes. Conditional least squares estimation
and maximum likelihood estimation have been proposed for
INAR(p) models, but the first is inefficient for
estimating parametric models, and the second becomes difficult to
implement as the order p increases. We propose a simple
saddlepoint approximation to the log-likelihood that performs well even in
the tails of the distribution and with complicated INAR models. We
consider Poisson and negative binomial innovations, and show empirically
that the estimator that maximises the saddlepoint approximation behaves
very similarly to the maximum likelihood estimator in realistic settings.
The approach is applied to data on meningococcal disease counts.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1229-1238
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.983230
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983230
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1229-1238
Template-Type: ReDIF-Article 1.0
Author-Name: Lo-Bin Chang
Author-X-Name-First: Lo-Bin
Author-X-Name-Last: Chang
Author-Name: Donald Geman
Author-X-Name-First: Donald
Author-X-Name-Last: Geman
Title: Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate
Abstract:
In recent years, "reproducibility" has emerged as a key factor in
evaluating x applications of statistics to the biomedical sciences, for
example, learning predictors of disease phenotypes from high-throughput
"omics" data. In particular, "validation" is undermined when error rates
on newly acquired data are sharply higher than those originally reported.
More precisely, when data are collected from m "studies"
representing possibly different subphenotypes, more generally different
mixtures of subphenotypes, the error rates in cross-study validation (CSV)
are observed to be larger than those obtained in ordinary randomized
cross-validation (RCV), although the "gap" seems to close as
m increases. Whereas these findings are hardly surprising
for a heterogenous underlying population, this discrepancy is then seen as
a barrier to translational research. We provide a statistical formulation
in the large-sample limit: studies themselves are modeled as components of
a mixture and all error rates are optimal (Bayes) for a two-class problem.
Our results cohere with the trends observed in practice and suggest what
is likely to be observed with large samples and consistent density
estimators, namely, that the CSV error rate exceeds the RCV error rates
for any m, the latter (appropriately averaged) increases
with m, and both converge to the optimal rate for the
whole population.
Journal: Journal of the American Statistical Association
Pages: 1239-1247
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.1002926
File-URL: http://hdl.handle.net/10.1080/01621459.2014.1002926
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1239-1247
Template-Type: ReDIF-Article 1.0
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Qifan Song
Author-X-Name-First: Qifan
Author-X-Name-Last: Song
Author-Name: Peihua Qiu
Author-X-Name-First: Peihua
Author-X-Name-Last: Qiu
Title: An Equivalent Measure of Partial Correlation Coefficients for High-Dimensional Gaussian Graphical Models
Abstract:
Gaussian graphical models (GGMs) are frequently used to explore networks,
such as gene regulatory networks, among a set of variables. Under the
classical theory of GGMs, the construction of Gaussian graphical networks
amounts to finding the pairs of variables with nonzero partial correlation
coefficients. However, this is infeasible for high-dimensional problems
for which the number of variables is larger than the sample size. In this
article, we propose a new measure of partial correlation coefficient,
which is evaluated with a reduced conditional set and thus feasible for
high-dimensional problems. Under the Markov property and adjacency
faithfulness conditions, the new measure of partial correlation
coefficient is equivalent to the true partial correlation coefficient in
construction of Gaussian graphical networks. Based on the new measure of
partial correlation coefficient, we propose a multiple hypothesis
test-based method for the construction of Gaussian graphical networks.
Furthermore, we establish the consistency of the proposed method under
mild conditions. The proposed method outperforms the existing methods,
such as the PC, graphical Lasso, nodewise regression, and
qp-average methods, especially for the problems for which
a large number of indirect associations are present. The proposed method
has a computational complexity of nearly
O(p-super-2), and is flexible in data
integration, network comparison, and covariate adjustment.
Journal: Journal of the American Statistical Association
Pages: 1248-1265
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1012391
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1012391
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1248-1265
Template-Type: ReDIF-Article 1.0
Author-Name: Kehui Chen
Author-X-Name-First: Kehui
Author-X-Name-Last: Chen
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Title: Localized Functional Principal Component Analysis
Abstract:
We propose localized functional principal component analysis (LFPCA),
looking for orthogonal basis functions with localized support regions that
explain most of the variability of a random process. The LFPCA is
formulated as a convex optimization problem through a novel deflated
Fantope localization method and is implemented through an efficient
algorithm to obtain the global optimum. We prove that the proposed LFPCA
converges to the original functional principal component analysis (FPCA)
when the tuning parameters are chosen appropriately. Simulation shows that
the proposed LFPCA with tuning parameters chosen by cross-validation can
almost perfectly recover the true eigenfunctions and significantly improve
the estimation accuracy when the eigenfunctions are truly supported on
some subdomains. In the scenario that the original eigenfunctions are not
localized, the proposed LFPCA also serves as a nice tool in finding
orthogonal basis functions that balance between interpretability and the
capability of explaining variability of the data. The analyses of a
country mortality data reveal interesting features that cannot be found by
standard FPCA methods. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1266-1275
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1016225
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016225
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1266-1275
Template-Type: ReDIF-Article 1.0
Author-Name: Jan De Neve
Author-X-Name-First: Jan
Author-X-Name-Last: De Neve
Author-Name: Olivier Thas
Author-X-Name-First: Olivier
Author-X-Name-Last: Thas
Title: A Regression Framework for Rank Tests Based on the Probabilistic Index Model
Abstract:
We demonstrate how many classical rank tests, such as the
Wilcoxon-Mann-Whitney, Kruskal-Wallis, and Friedman test, can be embedded
in a statistical modeling framework and how the method can be used to
construct new rank tests. In addition to hypothesis testing, the method
allows for estimating effect sizes with an informative interpretation,
resulting in a better understanding of the data. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1276-1283
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1016226
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016226
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1276-1283
Template-Type: ReDIF-Article 1.0
Author-Name: Tucker McElroy
Author-X-Name-First: Tucker
Author-X-Name-Last: McElroy
Author-Name: Brian Monsell
Author-X-Name-First: Brian
Author-X-Name-Last: Monsell
Title: Model Estimation, Prediction, and Signal Extraction for Nonstationary Stock and Flow Time Series Observed at Mixed Frequencies
Abstract:
An important practical problem for statistical agencies and central banks
that publish economic data is the seasonal adjustment of mixed frequency
stock and flow time series. This may arise in practice due to changes in
funding of a particular survey. Mathematically, the problem can be reduced
to the need to compute imputations, forecasts, and backcasts from a given
model of the highest available frequency data. The nonstationarity of the
economic time series coupled with the alteration of sampling frequency
makes the problem of model estimation and imputation challenging. For flow
data the analysis cannot be recast as a missing value problem, so that
time series imputation methods are ineffective. We provide explicit
formulas and algorithms that allow one to compute the log Gaussian
likelihood of the mixed sample, as well as any imputations and forecasts.
Formulas for the relevant mean squared error are also derived. We evaluate
the methodology through simulations, and illustrate the techniques on some
economic time series.
Journal: Journal of the American Statistical Association
Pages: 1284-1303
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2014.978452
File-URL: http://hdl.handle.net/10.1080/01621459.2014.978452
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1284-1303
Template-Type: ReDIF-Article 1.0
Author-Name: Graeme Blair
Author-X-Name-First: Graeme
Author-X-Name-Last: Blair
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Author-Name: Yang-Yang Zhou
Author-X-Name-First: Yang-Yang
Author-X-Name-Last: Zhou
Title: Design and Analysis of the Randomized Response Technique
Abstract:
About a half century ago, in 1965, Warner proposed the randomized response
method as a survey technique to reduce potential bias due to nonresponse
and social desirability when asking questions about sensitive behaviors
and beliefs. This method asks respondents to use a randomization device,
such as a coin flip, whose outcome is unobserved by the interviewer. By
introducing random noise, the method conceals individual responses and
protects respondent privacy. While numerous methodological advances have
been made, we find surprisingly few applications of this promising survey
technique. In this article, we address this gap by (1) reviewing standard
designs available to applied researchers, (2) developing various
multivariate regression techniques for substantive analyses, (3) proposing
power analyses to help improve research designs, (4) presenting new robust
designs that are based on less stringent assumptions than those of the
standard designs, and (5) making all described methods available through
open-source software. We illustrate some of these methods with an original
survey about militant groups in Nigeria.
Journal: Journal of the American Statistical Association
Pages: 1304-1319
Issue: 511
Volume: 110
Year: 2015
Month: 9
X-DOI: 10.1080/01621459.2015.1050028
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1050028
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:511:p:1304-1319
Template-Type: ReDIF-Article 1.0
Author-Name: David Morganstein
Author-X-Name-First: David
Author-X-Name-Last: Morganstein
Title: Statistics: Making Better Decisions
Journal: Journal of the American Statistical Association
Pages: 1325-1330
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1106790
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106790
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1325-1330
Template-Type: ReDIF-Article 1.0
Author-Name: Joshua D. Angrist
Author-X-Name-First: Joshua D.
Author-X-Name-Last: Angrist
Author-Name: Miikka Rokkanen
Author-X-Name-First: Miikka
Author-X-Name-Last: Rokkanen
Title: Wanna Get Away? Regression Discontinuity Estimation of Exam School Effects Away From the Cutoff
Abstract:
In regression discontinuity (RD) studies exploiting an award or admissions
cutoff, causal effects are nonparametrically identified for those near the
cutoff. The effect of treatment on inframarginal applicants is also of
interest, but identification of such effects requires stronger assumptions
than those required for identification at the cutoff. This article
discusses RD identification and estimation away from the cutoff. Our
identification strategy exploits the availability of dependent variable
predictors other than the running variable. Conditional on these
predictors, the running variable is assumed to be ignorable. This
identification strategy is used to study effects of Boston exam schools
for inframarginal applicants. Identification based on the conditional
independence assumptions imposed in our framework yields reasonably
precise and surprisingly robust estimates of the effects of exam school
attendance on inframarginal applicants. These estimates suggest that the
causal effects of exam school attendance for 9th grade applicants with
running variable values well away from admissions cutoffs differ little
from those for applicants with values that put them on the margin of
acceptance. An extension to fuzzy designs is shown to identify causal
effects for compliers away from the cutoff. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1331-1344
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1012259
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1012259
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1331-1344
Template-Type: ReDIF-Article 1.0
Author-Name: Michael G. Hudgens
Author-X-Name-First: Michael G.
Author-X-Name-Last: Hudgens
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1345-1347
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1033058
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1033058
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1345-1347
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas Lemieux
Author-X-Name-First: Thomas
Author-X-Name-Last: Lemieux
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1347-1348
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1054490
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054490
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1347-1348
Template-Type: ReDIF-Article 1.0
Author-Name: Joshua D. Angrist
Author-X-Name-First: Joshua D.
Author-X-Name-Last: Angrist
Author-Name: Miikka Rokkanen
Author-X-Name-First: Miikka
Author-X-Name-Last: Rokkanen
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1348-1349
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1106189
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106189
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1348-1349
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Jiang
Author-X-Name-First: Bo
Author-X-Name-Last: Jiang
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Bayesian Partition Models for Identifying Expression Quantitative Trait Loci
Abstract:
Expression quantitative trait loci (eQTLs) are genomic locations
associated with changes of expression levels of certain genes. By assaying
gene expressions and genetic variations simultaneously on a genome-wide
scale, scientists wish to discover genomic loci responsible for expression
variations of a set of genes. The task can be viewed as a multivariate
regression problem with variable selection on both responses (gene
expression) and covariates (genetic variations), including also multi-way
interactions among covariates. Instead of learning a predictive model of
quantitative trait given combinations of genetic markers, we adopt an
inverse modeling perspective to model the distribution of genetic markers
conditional on gene expression traits. A particular strength of our method
is its ability to detect interactive effects of genetic variations with
high power even when their marginal effects are weak, addressing a key
weakness of many existing eQTL mapping methods. Furthermore, we introduce
a hierarchical model to capture the dependence structure among correlated
genes. Through simulation studies and a real data example in yeast, we
demonstrate how our Bayesian hierarchical partition model achieves a
significantly improved power in detecting eQTLs compared to existing
methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1350-1361
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1049746
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1049746
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1350-1361
Template-Type: ReDIF-Article 1.0
Author-Name: Liangliang Wang
Author-X-Name-First: Liangliang
Author-X-Name-Last: Wang
Author-Name: Alexandre Bouchard-Côté
Author-X-Name-First: Alexandre
Author-X-Name-Last: Bouchard-Côté
Author-Name: Arnaud Doucet
Author-X-Name-First: Arnaud
Author-X-Name-Last: Doucet
Title: Bayesian Phylogenetic Inference Using a Combinatorial Sequential Monte Carlo Method
Abstract:
The application of Bayesian methods to large-scale phylogenetics problems
is increasingly limited by computational issues, motivating the
development of methods that can complement existing Markov chain Monte
Carlo (MCMC) schemes. Sequential Monte Carlo (SMC) methods are approximate
inference algorithms that have become very popular for time series models.
Such methods have been recently developed to address phylogenetic
inference problems but currently available techniques are only applicable
to a restricted class of phylogenetic tree models compared to MCMC. In
this article, we propose an original combinatorial SMC (CSMC) method to
approximate posterior phylogenetic tree distributions, which is applicable
to a general class of models and can be easily combined with MCMC to infer
evolutionary parameters. Our method only relies on the existence of a
flexible partially ordered set structure and is more generally applicable
to sampling problems on combinatorial spaces. We demonstrate that the
proposed CSMC algorithm provides consistent estimates under weak
assumptions, is computationally fast, and is additionally easily
parallelizable. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1362-1374
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1054487
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054487
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1362-1374
Template-Type: ReDIF-Article 1.0
Author-Name: Jian Zhang
Author-X-Name-First: Jian
Author-X-Name-Last: Zhang
Author-Name: Li Su
Author-X-Name-First: Li
Author-X-Name-Last: Su
Title: Temporal Autocorrelation-Based Beamforming With MEG Neuroimaging Data
Abstract:
Characterizing the brain source activity using magnetoencephalography
(MEG) requires solving an ill-posed inverse problem. Most source
reconstruction procedures are performed in terms of power comparison.
However, in the presence of voxel-specific noises, the direct power
analysis can be misleading due to the power distortion as suggested by our
multiple trial MEG study on a face-perception experiment. To tackle the
issue, we propose a temporal autocorrelation-based method for the above
analysis. The new method improves the face-perception analysis and
identifies several differences between neuronal responses to face and
scrambled-face stimuli. By the simulated and real data analyses, we
demonstrate that compared to the existing methods, the new proposal can be
more robust to voxel-specific noises without compromising on its accuracy
in source localization. We further establish the consistency for
estimating the proposed index when the number of sensors and the number of
time instants are sufficiently large. In particular, we show that the
proposed procedure can make a better focus on true sources than its
precedents in terms of peak segregation coefficient. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1375-1388
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1054488
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054488
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1375-1388
Template-Type: ReDIF-Article 1.0
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Some Counterclaims Undermine Themselves in Observational Studies
Abstract:
Claims based on observational studies that a treatment has certain effects
are often met with counterclaims asserting that the treatment is without
effect, that associations are produced by biased treatment assignment.
Some counterclaims undermine themselves in the following specific sense:
presuming the counterclaim to be true may strengthen the support that the
original data provide for the original claim, so that the counterclaim
fails in its role as a critique of the original claim. In mathematics, a
proof by contradiction supposes a proposition to be true en route to
proving that the proposition is false. Analogously, the supposition that a
particular counterclaim is true may justify an otherwise unjustified
statistical analysis, and this added analysis may interpret the original
data as providing even stronger support for the original claim. More
precisely, the original study is sensitive to unmeasured biases of a
particular magnitude, but an analysis that supposes the counterclaim to be
true may be insensitive to much larger unmeasured biases. The issues are
illustrated using data from the U.S. Fatal Accident Reporting System.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1389-1398
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1054489
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054489
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1389-1398
Template-Type: ReDIF-Article 1.0
Author-Name: G. O. Mohler
Author-X-Name-First: G. O.
Author-X-Name-Last: Mohler
Author-Name: M. B. Short
Author-X-Name-First: M. B.
Author-X-Name-Last: Short
Author-Name: Sean Malinowski
Author-X-Name-First: Sean
Author-X-Name-Last: Malinowski
Author-Name: Mark Johnson
Author-X-Name-First: Mark
Author-X-Name-Last: Johnson
Author-Name: G. E. Tita
Author-X-Name-First: G. E.
Author-X-Name-Last: Tita
Author-Name: Andrea L. Bertozzi
Author-X-Name-First: Andrea L.
Author-X-Name-Last: Bertozzi
Author-Name: P. J. Brantingham
Author-X-Name-First: P. J.
Author-X-Name-Last: Brantingham
Title: Randomized Controlled Field Trials of Predictive Policing
Abstract:
The concentration of police resources in stable crime hotspots has proven
effective in reducing crime, but the extent to which police can disrupt
dynamically changing crime hotspots is unknown. Police must be able to
anticipate the future location of dynamic hotspots to disrupt them. Here
we report results of two randomized controlled trials of near real-time
epidemic-type aftershock sequence (ETAS) crime forecasting, one trial
within three divisions of the Los Angeles Police Department and the other
trial within two divisions of the Kent Police Department (United Kingdom).
We investigate the extent to which (i) ETAS models of short-term crime
risk outperform existing best practice of hotspot maps produced by
dedicated crime analysts, (ii) police officers in the field can
dynamically patrol predicted hotspots given limited resources, and (iii)
crime can be reduced by predictive policing algorithms under realistic law
enforcement resource constraints. While previous hotspot policing
experiments fix treatment and control hotspots throughout the experimental
period, we use a novel experimental design to allow treatment and control
hotspots to change dynamically over the course of the experiment. Our
results show that ETAS models predict 1.4--2.2 times as much crime
compared to a dedicated crime analyst using existing criminal intelligence
and hotspot mapping practice. Police patrols using ETAS forecasts led to
an average 7.4% reduction in crime volume as a function of patrol time,
whereas patrols based upon analyst predictions showed no significant
effect. Dynamic police patrol in response to ETAS crime forecasts can
disrupt opportunities for crime and lead to real crime reductions.
Journal: Journal of the American Statistical Association
Pages: 1399-1411
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1077710
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1077710
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1399-1411
Template-Type: ReDIF-Article 1.0
Author-Name: Kari Lock Morgan
Author-X-Name-First: Kari Lock
Author-X-Name-Last: Morgan
Author-Name: Donald B. Rubin
Author-X-Name-First: Donald B.
Author-X-Name-Last: Rubin
Title: Rerandomization to Balance Tiers of Covariates
Abstract:
When conducting a randomized experiment, if an allocation yields treatment
groups that differ meaningfully with respect to relevant covariates,
groups should be rerandomized. The process involves specifying an explicit
criterion for whether an allocation is acceptable, based on a measure of
covariate balance, and rerandomizing units until an acceptable allocation
is obtained. Here, we illustrate how rerandomization could have improved
the design of an already conducted randomized experiment on vocabulary and
mathematics training programs, then provide a rerandomization procedure
for covariates that vary in importance, and finally offer other extensions
for rerandomization, including methods addressing computational
efficiency. When covariates vary in a priori importance, better balance
should be required for more important covariates. Rerandomization based on
Mahalanobis distance preserves the joint distribution of covariates, but
balances all covariates equally. Here, we propose rerandomizing based on
Mahalanobis distance within tiers of covariate importance. Because
balancing covariates in one tier will in general also partially balance
covariates in other tiers, for each subsequent tier we explicitly balance
only the components orthogonal to covariates in more important tiers.
Journal: Journal of the American Statistical Association
Pages: 1412-1421
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1079528
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1079528
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1412-1421
Template-Type: ReDIF-Article 1.0
Author-Name: Ian W. McKeague
Author-X-Name-First: Ian W.
Author-X-Name-Last: McKeague
Author-Name: Min Qian
Author-X-Name-First: Min
Author-X-Name-Last: Qian
Title: An Adaptive Resampling Test for Detecting the Presence of Significant Predictors
Abstract:
This article investigates marginal screening for detecting the presence of
significant predictors in high-dimensional regression. Screening large
numbers of predictors is a challenging problem due to the nonstandard
limiting behavior of post-model-selected estimators. There is a common
misconception that the oracle property for such estimators is a panacea,
but the oracle property only holds away from the null hypothesis of
interest in marginal screening. To address this difficulty, we propose an
adaptive resampling test (ART). Our approach provides an alternative to
the popular (yet conservative) Bonferroni method of controlling
family-wise error rates. ART is adaptive in the sense that thresholding is
used to decide whether the centered percentile bootstrap applies, and
otherwise adapts to the nonstandard asymptotics in the tightest way
possible. The performance of the approach is evaluated using a simulation
study and applied to gene expression data and HIV drug resistance data.
Journal: Journal of the American Statistical Association
Pages: 1422-1433
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1095099
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1095099
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1422-1433
Template-Type: ReDIF-Article 1.0
Author-Name: A. Chatterjee
Author-X-Name-First: A.
Author-X-Name-Last: Chatterjee
Author-Name: S. N. Lahiri
Author-X-Name-First: S. N.
Author-X-Name-Last: Lahiri
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1434-1438
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1102143
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1102143
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1434-1438
Template-Type: ReDIF-Article 1.0
Author-Name: Rajen D. Shah
Author-X-Name-First: Rajen D.
Author-X-Name-Last: Shah
Author-Name: Richard J. Samworth
Author-X-Name-First: Richard J.
Author-X-Name-Last: Samworth
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1439-1442
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1102142
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1102142
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1439-1442
Template-Type: ReDIF-Article 1.0
Author-Name: Emre Barut
Author-X-Name-First: Emre
Author-X-Name-Last: Barut
Author-Name: Huixia Judy Wang
Author-X-Name-First: Huixia Judy
Author-X-Name-Last: Wang
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1442-1445
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1100619
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100619
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1442-1445
Template-Type: ReDIF-Article 1.0
Author-Name: Lawrence D. Brown
Author-X-Name-First: Lawrence D.
Author-X-Name-Last: Brown
Author-Name: Daniel McCarthy
Author-X-Name-First: Daniel
Author-X-Name-Last: McCarthy
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1446-1449
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1099536
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1099536
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1446-1449
Template-Type: ReDIF-Article 1.0
Author-Name: Alexandre Belloni
Author-X-Name-First: Alexandre
Author-X-Name-Last: Belloni
Author-Name: Victor Chernozhukov
Author-X-Name-First: Victor
Author-X-Name-Last: Chernozhukov
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1449-1451
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1098545
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1098545
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1449-1451
Template-Type: ReDIF-Article 1.0
Author-Name: Yichi Zhang
Author-X-Name-First: Yichi
Author-X-Name-Last: Zhang
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1451-1454
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1106403
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106403
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1451-1454
Template-Type: ReDIF-Article 1.0
Author-Name: Sai Li
Author-X-Name-First: Sai
Author-X-Name-Last: Li
Author-Name: Ritwik Mitra
Author-X-Name-First: Ritwik
Author-X-Name-Last: Mitra
Author-Name: Cun-Hui Zhang
Author-X-Name-First: Cun-Hui
Author-X-Name-Last: Zhang
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1455-1456
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1106404
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1106404
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1455-1456
Template-Type: ReDIF-Article 1.0
Author-Name: Hannes Leeb
Author-X-Name-First: Hannes
Author-X-Name-Last: Leeb
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1457-1459
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1109516
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1109516
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1457-1459
Template-Type: ReDIF-Article 1.0
Author-Name: Ian W. McKeague
Author-X-Name-First: Ian W.
Author-X-Name-Last: McKeague
Author-Name: Min Qian
Author-X-Name-First: Min
Author-X-Name-Last: Qian
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1459-1462
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1107431
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1107431
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1459-1462
Template-Type: ReDIF-Article 1.0
Author-Name: Stephen S. M. Lee
Author-X-Name-First: Stephen S. M.
Author-X-Name-Last: Lee
Author-Name: Mehdi Soleymani
Author-X-Name-First: Mehdi
Author-X-Name-Last: Soleymani
Title: A Simple Formula for Mixing Estimators With Different Convergence Rates
Abstract:
Suppose that two estimators, and
, are available
for estimating an unknown parameter θ, and are known to have
convergence rates n-super-1/2 and
rn =
o(n-super-1/2), respectively, based on a
sample of size n. Typically, the more efficient estimator
is less robust
than , and a
definitive choice cannot be easily made between them under practical
circumstances. We propose a simple mixture estimator, in the form of a
linear combination of and
, which
successfully reaps the benefits of both estimators. We prove that the
mixture estimator possesses a kind of oracle property so that it captures
the fast n-super-1/2 convergence rate of
when conditions
are favorable, and is at least rn-consistent
otherwise. Applications of the mixture estimator are illustrated with
examples drawn from different problem settings including orthogonal
function regression, local polynomial regression, density derivative
estimation, and bootstrap inferences for possibly dependent data.
Journal: Journal of the American Statistical Association
Pages: 1463-1478
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.960966
File-URL: http://hdl.handle.net/10.1080/01621459.2014.960966
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1463-1478
Template-Type: ReDIF-Article 1.0
Author-Name: Anirban Bhattacharya
Author-X-Name-First: Anirban
Author-X-Name-Last: Bhattacharya
Author-Name: Debdeep Pati
Author-X-Name-First: Debdeep
Author-X-Name-Last: Pati
Author-Name: Natesh S. Pillai
Author-X-Name-First: Natesh S.
Author-X-Name-Last: Pillai
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Dirichlet--Laplace Priors for Optimal Shrinkage
Abstract:
Penalized regression methods, such as L1
regularization, are routinely used in high-dimensional applications, and
there is a rich literature on optimality properties under sparsity
assumptions. In the Bayesian paradigm, sparsity is routinely induced
through two-component mixture priors having a probability mass at zero,
but such priors encounter daunting computational problems in high
dimensions. This has motivated continuous shrinkage priors, which can be
expressed as global-local scale mixtures of Gaussians, facilitating
computation. In contrast to the frequentist literature, little is known
about the properties of such priors and the convergence and concentration
of the corresponding posterior distribution. In this article, we propose a
new class of Dirichlet--Laplace priors, which possess optimal posterior
concentration and lead to efficient posterior computation. Finite sample
performance of Dirichlet--Laplace priors relative to alternatives is
assessed in simulated and real data examples.
Journal: Journal of the American Statistical Association
Pages: 1479-1490
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.960967
File-URL: http://hdl.handle.net/10.1080/01621459.2014.960967
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1479-1490
Template-Type: ReDIF-Article 1.0
Author-Name: Weizhen Wang
Author-X-Name-First: Weizhen
Author-X-Name-Last: Wang
Title: Exact Optimal Confidence Intervals for Hypergeometric Parameters
Abstract:
For a hypergeometric distribution, denoted by
, where
N is the population size, M is the
number of population units with some attribute, and n is
the given sample size, there are two parametric cases: (i)
N is unknown and M is given; (ii)
M is unknown and N is given. For each
case, we first show that the minimum coverage probability of commonly used
approximate intervals is much smaller than the nominal level for any
n, then we provide exact smallest lower and upper
one-sided confidence intervals and an exact admissible two-sided
confidence interval, a complete set of solutions, for each parameter.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1491-1499
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.966191
File-URL: http://hdl.handle.net/10.1080/01621459.2014.966191
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1491-1499
Template-Type: ReDIF-Article 1.0
Author-Name: Rajarshi Guhaniyogi
Author-X-Name-First: Rajarshi
Author-X-Name-Last: Guhaniyogi
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Compressed Regression
Abstract:
As an alternative to variable selection or shrinkage in high-dimensional
regression, we propose to randomly compress the predictors prior to
analysis. This dramatically reduces storage and computational bottlenecks,
performing well when the predictors can be projected to a low-dimensional
linear subspace with minimal loss of information about the response. As
opposed to existing Bayesian dimensionality reduction approaches, the
exact posterior distribution conditional on the compressed data is
available analytically, speeding up computation by many orders of
magnitude while also bypassing robustness issues due to convergence and
mixing problems with MCMC. Model averaging is used to reduce sensitivity
to the random projection matrix, while accommodating uncertainty in the
subspace dimension. Strong theoretical support is provided for the
approach by showing near parametric convergence rates for the predictive
density in the large p small n
asymptotic paradigm. Practical performance relative to competitors is
illustrated in simulations and real data applications.
Journal: Journal of the American Statistical Association
Pages: 1500-1514
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.969425
File-URL: http://hdl.handle.net/10.1080/01621459.2014.969425
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1500-1514
Template-Type: ReDIF-Article 1.0
Author-Name: Zifang Guo
Author-X-Name-First: Zifang
Author-X-Name-Last: Guo
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Title: Groupwise Dimension Reduction via Envelope Method
Abstract:
The family of sufficient dimension reduction (SDR) methods that produce
informative combinations of predictors, or indices, are particularly
useful for high-dimensional regression analysis. In many such analyses, it
becomes increasingly common that there is available a priori subject
knowledge of the predictors; for example, they belong to different groups.
While many recent SDR proposals have greatly expanded the scope of the
methods’ applicability, how to effectively incorporate the prior
predictor structure information remains a challenge. In this article, we
aim at dimension reduction that recovers full regression information while
preserving the predictor group structure. Built upon a new concept of the
direct sum envelope, we introduce a systematic way to incorporate the
group information in most existing SDR estimators. As a result, the
reduction outcomes are much easier to interpret. Moreover, the envelope
method provides a principled way to build a variety of prior structures
into dimension reduction analysis. Both simulations and real data analysis
demonstrate the competent numerical performance of the new method.
Journal: Journal of the American Statistical Association
Pages: 1515-1527
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.970687
File-URL: http://hdl.handle.net/10.1080/01621459.2014.970687
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1515-1527
Template-Type: ReDIF-Article 1.0
Author-Name: Antonio F. Galvao
Author-X-Name-First: Antonio F.
Author-X-Name-Last: Galvao
Author-Name: Liang Wang
Author-X-Name-First: Liang
Author-X-Name-Last: Wang
Title: Uniformly Semiparametric Efficient Estimation of Treatment Effects With a Continuous Treatment
Abstract:
This article studies identification, estimation, and inference of general
unconditional treatment effects models with continuous treatment under the
ignorability assumption. We show identification of the parameters of
interest, the dose--response functions, under the assumption that
selection to treatment is based on observables. We propose a
semiparametric two-step estimator, and consider estimation of the
dose--response functions through moment restriction models with
generalized residual functions that are possibly nonsmooth. This general
formulation includes average and quantile treatment effects as special
cases. The asymptotic properties of the estimator are derived, namely,
uniform consistency, weak convergence, and semiparametric efficiency. We
also develop statistical inference procedures and establish the validity
of a bootstrap approach to implement these methods in practice. Monte
Carlo simulations show that the proposed methods have good finite sample
properties. Finally, we apply the proposed methods to estimate the
unconditional average and quantile effects of mothers’ weight gain
and age on birthweight. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1528-1542
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.978005
File-URL: http://hdl.handle.net/10.1080/01621459.2014.978005
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1528-1542
Template-Type: ReDIF-Article 1.0
Author-Name: J. Marcus Jobe
Author-X-Name-First: J. Marcus
Author-X-Name-Last: Jobe
Author-Name: Michael Pokojovy
Author-X-Name-First: Michael
Author-X-Name-Last: Pokojovy
Title: A Cluster-Based Outlier Detection Scheme for Multivariate Data
Abstract:
Detection power of the squared Mahalanobis distance statistic is
significantly reduced when several outliers exist within a multivariate
dataset of interest. To overcome this masking effect, we propose a
computer-intensive cluster-based approach that incorporates a reweighted
version of Rousseeuw’s minimum covariance determinant method with a
multi-step cluster-based algorithm that initially filters out potential
masking points. Compared to the most robust procedures, simulation studies
show that our new method is better for outlier detection. Additional real
data comparisons are given. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1543-1551
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.983231
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983231
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1543-1551
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan Martin
Author-X-Name-First: Ryan
Author-X-Name-Last: Martin
Title: Plausibility Functions and Exact Frequentist Inference
Abstract:
In the frequentist program, inferential methods with exact control on
error rates are a primary focus. The standard approach, however, is to
rely on asymptotic approximations, which may not be suitable. This article
presents a general framework for the construction of exact frequentist
procedures based on plausibility functions. It is shown that the
plausibility function-based tests and confidence regions have the desired
frequentist properties in finite samples—no large-sample
justification needed. An extension of the proposed method is also given
for problems involving nuisance parameters. Examples demonstrate that the
plausibility function-based method is both exact and efficient in a wide
variety of problems.
Journal: Journal of the American Statistical Association
Pages: 1552-1561
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.983232
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983232
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1552-1561
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Zhou
Author-X-Name-First: Jing
Author-X-Name-Last: Zhou
Author-Name: Anirban Bhattacharya
Author-X-Name-First: Anirban
Author-X-Name-Last: Bhattacharya
Author-Name: Amy H. Herring
Author-X-Name-First: Amy H.
Author-X-Name-Last: Herring
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Factorizations of Big Sparse Tensors
Abstract:
It has become routine to collect data that are structured as multiway
arrays (tensors). There is an enormous literature on low rank and sparse
matrix factorizations, but limited consideration of extensions to the
tensor case in statistics. The most common low rank tensor factorization
relies on parallel factor analysis (PARAFAC), which expresses a rank
k tensor as a sum of rank one tensors. In contingency
table applications in which the sample size is massively less than the
number of cells in the table, the low rank assumption is not sufficient
and PARAFAC has poor performance. We induce an additional layer of
dimension reduction by allowing the effective rank to vary across
dimensions of the table. Taking a Bayesian approach, we place priors on
terms in the factorization and develop an efficient Gibbs sampler for
posterior computation. Theory is provided showing posterior concentration
rates in high-dimensional settings, and the methods are shown to have
excellent performance in simulations and several real data applications.
Journal: Journal of the American Statistical Association
Pages: 1562-1576
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.983233
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983233
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1562-1576
Template-Type: ReDIF-Article 1.0
Author-Name: Jiwei Zhao
Author-X-Name-First: Jiwei
Author-X-Name-Last: Zhao
Author-Name: Jun Shao
Author-X-Name-First: Jun
Author-X-Name-Last: Shao
Title: Semiparametric Pseudo-Likelihoods in Generalized Linear Models With Nonignorable Missing Data
Abstract:
We consider identifiability and estimation in a generalized linear model
in which the response variable and some covariates have missing values and
the missing data mechanism is nonignorable and unspecified. We adopt a
pseudo-likelihood approach that makes use of an instrumental variable to
help identifying unknown parameters in the presence of nonignorable
missing data. Explicit conditions for the identifiability of parameters
are given. Some asymptotic properties of the parameter estimators based on
maximizing the pseudo-likelihood are established. Explicit asymptotic
covariance matrix and its estimator are also derived in some cases. For
the numerical maximization of the pseudo-likelihood, we develop a two-step
iteration algorithm that decomposes a nonconcave maximization problem into
two problems of maximizing concave functions. Some simulation results and
an application to a dataset from cotton factory workers are also
presented.
Journal: Journal of the American Statistical Association
Pages: 1577-1590
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.983234
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983234
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1577-1590
Template-Type: ReDIF-Article 1.0
Author-Name: Laurent E. Calvet
Author-X-Name-First: Laurent E.
Author-X-Name-Last: Calvet
Author-Name: Veronika Czellar
Author-X-Name-First: Veronika
Author-X-Name-Last: Czellar
Author-Name: Elvezio Ronchetti
Author-X-Name-First: Elvezio
Author-X-Name-Last: Ronchetti
Title: Robust Filtering
Abstract:
Filtering methods are powerful tools to estimate the hidden state of a
state-space model from observations available in real time. However, they
are known to be highly sensitive to the presence of small
misspecifications of the underlying model and to outliers in the
observation process. In this article, we show that the methodology of
robust statistics can be adapted to sequential filtering. We define a
filter as being robust if the relative error in the state distribution
caused by misspecifications is uniformly bounded by a linear function of
the perturbation size. Since standard filters are nonrobust even in the
simplest cases, we propose robustified filters which provide accurate
state inference in the presence of model misspecifications. The robust
particle filter naturally mitigates the degeneracy problems that plague
the bootstrap particle filler (Gordon, Salmond, and Smith) and its many
extensions. We illustrate the good properties of robust filters in linear
and nonlinear state-space examples. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1591-1606
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.983520
File-URL: http://hdl.handle.net/10.1080/01621459.2014.983520
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1591-1606
Template-Type: ReDIF-Article 1.0
Author-Name: Qifan Song
Author-X-Name-First: Qifan
Author-X-Name-Last: Song
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Title: High-Dimensional Variable Selection With Reciprocal L1-Regularization
Abstract:
During the past decade, penalized likelihood methods have been widely used
in variable selection problems, where the penalty functions are typically
symmetric about 0, continuous and nondecreasing in (0, ∞). We
propose a new penalized likelihood method, reciprocal Lasso (or in short,
rLasso), based on a new class of penalty functions that are decreasing in
(0, ∞), discontinuous at 0, and converge to infinity when the
coefficients approach zero. The new penalty functions give nearly zero
coefficients infinity penalties; in contrast, the conventional penalty
functions give nearly zero coefficients nearly zero penalties (e.g., Lasso
and smoothly clipped absolute deviation [SCAD]) or constant penalties
(e.g., L0 penalty). This distinguishing
feature makes rLasso very attractive for variable selection. It can
effectively avoid to select overly dense models. We establish the
consistency of the rLasso for variable selection and coefficient
estimation under both the low- and high-dimensional settings. Since the
rLasso penalty functions induce an objective function with multiple local
minima, we also propose an efficient Monte Carlo optimization algorithm to
solve the involved minimization problem. Our simulation results show that
the rLasso outperforms other popular penalized likelihood methods, such as
Lasso, SCAD, minimax concave penalty, sure independence screening,
interative sure independence screening, and extended Bayesian information
criterion. It can produce sparser and more accurate coefficient estimates,
and catch the true model with a higher probability. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1607-1620
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.984812
File-URL: http://hdl.handle.net/10.1080/01621459.2014.984812
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1607-1620
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan Martin
Author-X-Name-First: Ryan
Author-X-Name-Last: Martin
Author-Name: Chuanhai Liu
Author-X-Name-First: Chuanhai
Author-X-Name-Last: Liu
Title: Marginal Inferential Models: Prior-Free Probabilistic Inference on Interest Parameters
Abstract:
The inferential models (IM) framework provides prior-free,
frequency-calibrated, and posterior probabilistic inference. The key is
the use of random sets to predict unobservable auxiliary variables
connected to the observable data and unknown parameters. When nuisance
parameters are present, a marginalization step can reduce the dimension of
the auxiliary variable which, in turn, leads to more efficient inference.
For regular problems, exact marginalization can be achieved, and we give
conditions for marginal IM validity. We show that our approach provides
exact and efficient marginal inference in several challenging problems,
including a many-normal-means problem. In nonregular problems, we propose
a generalized marginalization technique and prove its validity. Details
are given for two benchmark examples, namely, the Behrens--Fisher and
gamma mean problems.
Journal: Journal of the American Statistical Association
Pages: 1621-1631
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.985827
File-URL: http://hdl.handle.net/10.1080/01621459.2014.985827
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1621-1631
Template-Type: ReDIF-Article 1.0
Author-Name: Hiroyuki Kasahara
Author-X-Name-First: Hiroyuki
Author-X-Name-Last: Kasahara
Author-Name: Katsumi Shimotsu
Author-X-Name-First: Katsumi
Author-X-Name-Last: Shimotsu
Title: Testing the Number of Components in Normal Mixture Regression Models
Abstract:
Testing the number of components in finite normal mixture models is a
long-standing challenge because of its nonregularity. This article studies
likelihood-based testing of the number of components in normal mixture
regression models with heteroscedastic components. We construct a
likelihood-based test of the null hypothesis of
m0 components against the alternative
hypothesis of m0 + 1 components for any
m0. The null asymptotic distribution of the
proposed modified EM test statistic is the maximum of
m0 random variables that can be easily
simulated. The simulations show that the proposed test has very good
finite sample size and power properties. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1632-1645
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.986272
File-URL: http://hdl.handle.net/10.1080/01621459.2014.986272
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1632-1645
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel K. Sewell
Author-X-Name-First: Daniel K.
Author-X-Name-Last: Sewell
Author-Name: Yuguo Chen
Author-X-Name-First: Yuguo
Author-X-Name-Last: Chen
Title: Latent Space Models for Dynamic Networks
Abstract:
Dynamic networks are used in a variety of fields to represent the
structure and evolution of the relationships between entities. We present
a model which embeds longitudinal network data as trajectories in a latent
Euclidean space. We propose Markov chain Monte Carlo (MCMC) algorithm to
estimate the model parameters and latent positions of the actors in the
network. The model yields meaningful visualization of dynamic networks,
giving the researcher insight into the evolution and the structure, both
local and global, of the network. The model handles directed or undirected
edges, easily handles missing edges, and lends itself well to predicting
future edges. Further, a novel approach is given to detect and visualize
an attracting influence between actors using only the edge information. We
use the case-control likelihood approximation to speed up the estimation
algorithm, modifying it slightly to account for missing data. We apply the
latent space model to data collected from a Dutch classroom, and a
cosponsorship network collected on members of the U.S. House of
Representatives, illustrating the usefulness of the model by making
insights into the networks. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1646-1657
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.988214
File-URL: http://hdl.handle.net/10.1080/01621459.2014.988214
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1646-1657
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Bo Peng
Author-X-Name-First: Bo
Author-X-Name-Last: Peng
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: A High-Dimensional Nonparametric Multivariate Test for Mean Vector
Abstract:
This work is concerned with testing the population mean vector of
nonnormal high-dimensional multivariate data. Several tests for
high-dimensional mean vector, based on modifying the classical Hotelling
T-super-2 test, have been proposed in the literature.
Despite their usefulness, they tend to have unsatisfactory power
performance for heavy-tailed multivariate data, which frequently arise in
genomics and quantitative finance. This article proposes a novel
high-dimensional nonparametric test for the population mean vector for a
general class of multivariate distributions. With the aid of new tools in
modern probability theory, we proved that the limiting null distribution
of the proposed test is normal under mild conditions when
p is substantially larger than n. We
further study the local power of the proposed test and compare its
relative efficiency with a modified Hotelling T-super-2
test for high-dimensional data. An interesting finding is that the newly
proposed test can have even more substantial power gain with large
p than the traditional nonparametric multivariate test
does with finite fixed p. We study the finite sample
performance of the proposed test via Monte Carlo simulations. We further
illustrate its application by an empirical analysis of a genomics dataset.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1658-1669
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.988215
File-URL: http://hdl.handle.net/10.1080/01621459.2014.988215
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1658-1669
Template-Type: ReDIF-Article 1.0
Author-Name: Yuanshan Wu
Author-X-Name-First: Yuanshan
Author-X-Name-Last: Wu
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Guosheng Yin
Author-X-Name-First: Guosheng
Author-X-Name-Last: Yin
Title: Smoothed and Corrected Score Approach to Censored Quantile Regression With Measurement Errors
Abstract:
Censored quantile regression is an important alternative to the Cox
proportional hazards model in survival analysis. In contrast to the usual
central covariate effects, quantile regression can effectively
characterize the covariate effects at different quantiles of the survival
time. When covariates are measured with errors, it is known that naively
treating mismeasured covariates as error-free would result in estimation
bias. Under censored quantile regression, we propose smoothed and
corrected estimating equations to obtain consistent estimators. We
establish consistency and asymptotic normality for the proposed estimators
of quantile regression coefficients. Compared with the naive estimator,
the proposed method can eliminate the estimation bias under various
measurement error distributions and model error distributions. We conduct
simulation studies to examine the finite-sample properties of the new
method and apply it to a lung cancer study. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1670-1683
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.989323
File-URL: http://hdl.handle.net/10.1080/01621459.2014.989323
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1670-1683
Template-Type: ReDIF-Article 1.0
Author-Name: Tyler H. McCormick
Author-X-Name-First: Tyler H.
Author-X-Name-Last: McCormick
Author-Name: Tian Zheng
Author-X-Name-First: Tian
Author-X-Name-Last: Zheng
Title: Latent Surface Models for Networks Using Aggregated Relational Data
Abstract:
Despite increased interest across a range of scientific applications in
modeling and understanding social network structure, collecting complete
network data remains logistically and financially challenging, especially
in the social sciences. This article introduces a latent surface
representation of social network structure for partially observed network
data. We derive a multivariate measure of expected (latent) distance
between an observed actor and unobserved actors with given features. We
also draw novel parallels between our work and dependent data in spatial
and ecological statistics. We demonstrate the contribution of our model
using a random digit-dial telephone survey and a multiyear prospective
study of the relationship between network structure and the spread of
infectious disease. The model proposed here is related to previous network
models which represents high-dimensional structure through a projection to
a low-dimensional latent geometric surface-encoding dependence as distance
in the space. We develop a latent surface model for cases when complete
network data are unavailable. We focus specifically on aggregated
relational data (ARD) which measure network structure indirectly by asking
respondents how many connections they have with members of a certain
subpopulation (e.g., How many individuals do you know who are HIV
positive?) and are easily added to existing surveys. Instead of
conditioning on the (latent) distance between two members of the network,
the latent surface model for ARD conditions on the expected distance
between a survey respondent and the center of a subpopulation on a latent
manifold surface. A spherical latent surface and angular distance across
the sphere’s surface facilitate tractable computation of this
expectation. This model estimates relative homogeneity between groups in
the population and variation in the propensity for interaction between
respondents and group members. The model also estimates features of groups
which are difficult to reach using standard surveys (e.g., the homeless).
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1684-1695
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.991395
File-URL: http://hdl.handle.net/10.1080/01621459.2014.991395
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1684-1695
Template-Type: ReDIF-Article 1.0
Author-Name: Jin Xu
Author-X-Name-First: Jin
Author-X-Name-Last: Xu
Author-Name: Jiajie Chen
Author-X-Name-First: Jiajie
Author-X-Name-Last: Chen
Author-Name: Peter Z. G. Qian
Author-X-Name-First: Peter Z. G.
Author-X-Name-Last: Qian
Title: Sequentially Refined Latin Hypercube Designs: Reusing Every Point
Abstract:
The use of iteratively enlarged Latin hypercube designs for running
computer experiments has recently gained popularity in practice. This
approach conducts an initial experiment with a computer code using a Latin
hypercube design and then runs a follow-up experiment with additional runs
elaborately chosen so that the combined design set for the two experiments
forms a larger Latin hypercube design. This augmenting process can be
repeated multiple stages, where in each stage the augmented design set is
guaranteed to be a Latin hypercube design. We provide a theoretical
framework to put this approach on a firm footing. Numerical examples are
given to corroborate the derived theoretical results. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1696-1706
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.993078
File-URL: http://hdl.handle.net/10.1080/01621459.2014.993078
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1696-1706
Template-Type: ReDIF-Article 1.0
Author-Name: Noah Simon
Author-X-Name-First: Noah
Author-X-Name-Last: Simon
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: A Permutation Approach to Testing Interactions for Binary Response by Comparing Correlations Between Classes
Abstract:
To date testing interactions in high dimensions is a challenging task.
Existing methods often have issues with sensitivity to modeling
assumptions and heavily asymptotic nominal p-values. To
help alleviate these issues, we propose a permutation-based method for
testing marginal interactions with a binary response. Our method searches
for pairwise correlations that differ between classes. In this article, we
compare our method on real and simulated data to the standard approach of
running many pairwise logistic models. On simulated data our method finds
more significant interactions at a lower false discovery rate (especially
in the presence of main effects). On real genomic data, although there is
no gold standard, our method finds apparent signal and tells a believable
story, while logistic regression does not. We also give asymptotic
consistency results under not too restrictive assumptions. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1707-1716
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.993079
File-URL: http://hdl.handle.net/10.1080/01621459.2014.993079
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1707-1716
Template-Type: ReDIF-Article 1.0
Author-Name: Wenxin Jiang
Author-X-Name-First: Wenxin
Author-X-Name-Last: Jiang
Author-Name: Yu Zhao
Author-X-Name-First: Yu
Author-X-Name-Last: Zhao
Title: On Asymptotic Distributions and Confidence Intervals for LIFT Measures in Data Mining
Abstract:
A LIFT measure, such as the response rate, lift, or the percentage of
captured response, is a fundamental measure of effectiveness for a scoring
rule obtained from data mining, which is estimated from a set of
validation data. In this article, we study how to construct confidence
intervals of the LIFT measures. We point out the subtlety of this task and
explain how simple binomial confidence intervals can have incorrect
coverage probabilities, due to omitting variation from the sample
percentile of the scoring rule. We derive the asymptotic distribution
using some advanced empirical process theory and the functional delta
method in the Appendix. The additional variation is shown to be related to
a conditional mean response, which can be estimated by a local averaging
of the responses over the scores from the validation data. Alternatively,
a subsampling method is shown to provide a valid confidence interval,
without needing to estimate the conditional mean response. Numerical
experiments are conducted to compare these different methods regarding the
coverage probabilities and the lengths of the resulting confidence
intervals.
Journal: Journal of the American Statistical Association
Pages: 1717-1725
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.993080
File-URL: http://hdl.handle.net/10.1080/01621459.2014.993080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1717-1725
Template-Type: ReDIF-Article 1.0
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Author-Name: Wenliang Pan
Author-X-Name-First: Wenliang
Author-X-Name-Last: Pan
Author-Name: Wenhao Hu
Author-X-Name-First: Wenhao
Author-X-Name-Last: Hu
Author-Name: Yuan Tian
Author-X-Name-First: Yuan
Author-X-Name-Last: Tian
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Conditional Distance Correlation
Abstract:
Statistical inference on conditional dependence is essential in many
fields including genetic association studies and graphical models. The
classic measures focus on linear conditional correlations and are
incapable of characterizing nonlinear conditional relationship including
nonmonotonic relationship. To overcome this limitation, we introduce a
nonparametric measure of conditional dependence for multivariate random
variables with arbitrary dimensions. Our measure possesses the necessary
and intuitive properties as a correlation index. Briefly, it is zero
almost surely if and only if two multivariate random variables are
conditionally independent given a third random variable. More importantly,
the sample version of this measure can be expressed elegantly as the root
of a V or U-process with random kernels and has desirable theoretical
properties. Based on the sample version, we propose a test for conditional
independence, which is proven to be more powerful than some recently
developed tests through our numerical simulations. The advantage of our
test is even greater when the relationship between the multivariate random
variables given the third random variable cannot be expressed in a linear
or monotonic function of one random variable versus the other. We also
show that the sample measure is consistent and weakly convergent, and the
test statistic is asymptotically normal. By applying our test in a real
data analysis, we are able to identify two conditionally associated gene
expressions, which otherwise cannot be revealed. Thus, our measure of
conditional dependence is not only an ideal concept, but also has
important practical utility. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 1726-1734
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2014.993081
File-URL: http://hdl.handle.net/10.1080/01621459.2014.993081
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1726-1734
Template-Type: ReDIF-Article 1.0
Author-Name: Gauri Sankar Datta
Author-X-Name-First: Gauri Sankar
Author-X-Name-Last: Datta
Author-Name: Abhyuday Mandal
Author-X-Name-First: Abhyuday
Author-X-Name-Last: Mandal
Title: Small Area Estimation With Uncertain Random Effects
Abstract:
Random effects models play an important role in model-based small area
estimation. Random effects account for any lack of fit of a regression
model for the population means of small areas on a set of explanatory
variables. In a recent article, Datta, Hall, and Mandal showed that if the
random effects can be dispensed with via a suitable test, then the model
parameters and the small area means may be estimated with substantially
higher accuracy. The work of Datta, Hall, and Mandal is most useful when
the number of small areas, m, is moderately large. For
large m, the null hypothesis of no random effects will
likely be rejected. Rejection of the null hypothesis is usually caused by
a few large residuals signifying a departure of the direct estimator from
the synthetic regression estimator. As a flexible alternative to the
Fay--Herriot random effects model and the approach in Datta, Hall, and
Mandal, in this article we consider a mixture model for random effects. It
is reasonably expected that small areas with population means explained
adequately by covariates have little model error, and the other areas with
means not adequately explained by covariates will require a random
component added to the regression model. This model is a useful
alternative to the usual random effects model and the data determine the
extent of lack of fit of the regression model for a particular small area,
and include a random effect if needed. Unlike the Datta, Hall, and Mandal
approach which recommends excluding random effects from all small areas if
a test of null hypothesis of no random effects is not rejected, the
present model is more flexible. We used this mixture model to estimate
poverty ratios for 5--17-year-old-related children for the 50 U.S. states
and Washington, DC. This application is motivated by the SAIPE project of
the U.S. Census Bureau. We empirically evaluated the accuracy of the
direct estimates and the estimates obtained from our mixture model and the
Fay--Herriot random effects model. These empirical evaluations and a
simulation study, in conjunction with a lower posterior variance of the
new estimates, show that the new estimates are more accurate than both the
frequentist and the Bayes estimates resulting from the standard
Fay--Herriot model. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 1735-1744
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1016526
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016526
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1735-1744
Template-Type: ReDIF-Article 1.0
Author-Name: Brigham R. Frandsen
Author-X-Name-First: Brigham R.
Author-X-Name-Last: Frandsen
Title: Treatment Effects With Censoring and Endogeneity
Abstract:
This article develops a nonparametric approach to identification and
estimation of treatment effects on censored outcomes when treatment may be
endogenous and have arbitrarily heterogenous effects. Identification is
based on an instrumental variable that satisfies the exclusion and
monotonicity conditions standard in the local average treatment effects
framework. The article proposes a censored quantile treatment effects
estimator, derives its asymptotic distribution, and illustrates its
performance using Monte Carlo simulations. Even in the exogenous case, the
estimator performs better in finite samples than existing censored
quantile regression estimators, and performs nearly as well as maximum
likelihood estimators in cases where their distributional assumptions
hold. An empirical application to a subsidized job training program finds
that participation significantly and dramatically reduced the duration of
jobless spells, especially at the right tail of the distribution.
Journal: Journal of the American Statistical Association
Pages: 1745-1752
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1017577
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1017577
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1745-1752
Template-Type: ReDIF-Article 1.0
Author-Name: Sebastian Calonico
Author-X-Name-First: Sebastian
Author-X-Name-Last: Calonico
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Rocío Titiunik
Author-X-Name-First: Rocío
Author-X-Name-Last: Titiunik
Title: Optimal Data-Driven Regression Discontinuity Plots
Abstract:
Exploratory data analysis plays a central role in applied statistics and
econometrics. In the popular regression-discontinuity (RD) design, the use
of graphical analysis has been strongly advocated because it provides both
easy presentation and transparent validation of the design. RD plots are
nowadays widely used in applications, despite its formal properties being
unknown: these plots are typically presented employing ad hoc choices of
tuning parameters, which makes these procedures less automatic and more
subjective. In this article, we formally study the most common RD plot
based on an evenly spaced binning of the data, and propose several
(optimal) data-driven choices for the number of bins depending on the goal
of the researcher. These RD plots are constructed either to approximate
the underlying unknown regression functions without imposing smoothness in
the estimator, or to approximate the underlying variability of the raw
data while smoothing out the otherwise uninformative scatterplot of the
data. In addition, we introduce an alternative RD plot based on quantile
spaced binning, study its formal properties, and propose similar (optimal)
data-driven choices for the number of bins. The main proposed data-driven
selectors employ spacings estimators, which are simple and easy to
implement in applications because they do not require additional choices
of tuning parameters. Altogether, our results offer an array of
alternative RD plots that are objective and automatic when implemented,
providing a reliable benchmark for graphical analysis in RD designs. We
illustrate the performance of our automatic RD plots using several
empirical examples and a Monte Carlo study. All results are readily
available in R and STATA using the software packages described in
Calonico, Cattaneo, and Titiunik. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 1753-1769
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1017578
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1017578
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1753-1769
Template-Type: ReDIF-Article 1.0
Author-Name: Ruoqing Zhu
Author-X-Name-First: Ruoqing
Author-X-Name-Last: Zhu
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Reinforcement Learning Trees
Abstract:
In this article, we introduce a new type of tree-based method,
reinforcement learning trees (RLT), which exhibits significantly improved
performance over traditional methods such as random forests (Breiman 2001)
under high-dimensional settings. The innovations are three-fold. First,
the new method implements reinforcement learning at each selection of a
splitting variable during the tree construction processes. By splitting on
the variable that brings the greatest future improvement in later splits,
rather than choosing the one with largest marginal effect from the
immediate split, the constructed tree uses the available samples in a more
efficient way. Moreover, such an approach enables linear combination cuts
at little extra computational cost. Second, we propose a variable muting
procedure that progressively eliminates noise variables during the
construction of each individual tree. The muting procedure also takes
advantage of reinforcement learning and prevents noise variables from
being considered in the search for splitting rules, so that toward
terminal nodes, where the sample size is small, the splitting rules are
still constructed from only strong variables. Last, we investigate
asymptotic properties of the proposed method under basic assumptions and
discuss rationale in general settings. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 1770-1784
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1036994
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1036994
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1770-1784
Template-Type: ReDIF-Article 1.0
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Author-Name: Wen-Xin Zhou
Author-X-Name-First: Wen-Xin
Author-X-Name-Last: Zhou
Title: Nonparametric and Parametric Estimators of Prevalence From Group Testing Data With Aggregated Covariates
Abstract:
Group testing is a technique employed in large screening studies involving
infectious disease, where individuals in the study are grouped before
being observed. Parametric and nonparametric estimators of conditional
prevalence have been developed in the group testing literature, in the
case where the binary variable indicating the disease status is available
only for the group, but the explanatory variable is observed for each
individual. However, for reasons such as the high cost of assays, the
confidentiality of the patients, or the impossibility of measuring a
concentration under a detection limit, the explanatory variable is
observable only in an aggregated form and the existing techniques are no
longer valid. We develop consistent parametric and nonparametric
estimators of the conditional prevalence in this complex problem. We
establish theoretical properties of our estimators and illustrate their
practical performance on simulated and real data. We extend our techniques
to the case where the group status is measured imperfectly, and to the
setting where the covariate is aggregated and the individual status is
available. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1785-1796
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1054491
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1054491
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1785-1796
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaofeng Shao
Author-X-Name-First: Xiaofeng
Author-X-Name-Last: Shao
Title: Self-Normalization for Time Series: A Review of Recent Developments
Abstract:
This article reviews some recent developments on the inference of time
series data using the self-normalized approach. We aim to provide a
detailed discussion about the use of self-normalization in different
contexts and highlight distinctive feature associated with each problem
and connections among these recent developments. The topics covered
include: confidence interval construction for a parameter in a weakly
dependent stationary time series setting, change point detection in the
mean, robust inference in regression models with weakly dependent errors,
inference for nonparametric time series regression, inference for long
memory time series, locally stationary time series and near-integrated
time series, change point detection, and two-sample inference for
functional time series, as well as the use of self-normalization for
spatial data and spatial-temporal data. Some new variations of the
self-normalized approach are also introduced with additional simulation
results. We also provide a brief review of related inferential methods,
such as blockwise empirical likelihood and subsampling, which were
recently developed under the fixed-b asymptotic
framework. We conclude the article with a summary of merits and
limitations of self-normalization in the time series context and potential
topics for future investigation.
Journal: Journal of the American Statistical Association
Pages: 1797-1817
Issue: 512
Volume: 110
Year: 2015
Month: 12
X-DOI: 10.1080/01621459.2015.1050493
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1050493
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:110:y:2015:i:512:p:1797-1817
Template-Type: ReDIF-Article 1.0
Author-Name: Ci-Ren Jiang
Author-X-Name-First: Ci-Ren
Author-X-Name-Last: Jiang
Author-Name: John A. D. Aston
Author-X-Name-First: John A. D.
Author-X-Name-Last: Aston
Author-Name: Jane-Ling Wang
Author-X-Name-First: Jane-Ling
Author-X-Name-Last: Wang
Title: A Functional Approach to Deconvolve Dynamic Neuroimaging Data
Abstract:
Positron emission tomography (PET) is an imaging technique which can be
used to investigate chemical changes in human biological processes such as
cancer development or neurochemical reactions. Most dynamic PET scans are
currently analyzed based on the assumption that linear first-order
kinetics can be used to adequately describe the system under observation.
However, there has recently been strong evidence that this is not the
case. To provide an analysis of PET data which is free from this
compartmental assumption, we propose a nonparametric deconvolution and
analysis model for dynamic PET data based on functional principal
component analysis. This yields flexibility in the possible deconvolved
functions while still performing well when a linear compartmental model
setup is the true data generating mechanism. As the deconvolution needs to
be performed on only a relative small number of basis functions rather
than voxel by voxel in the entire three-dimensional volume, the
methodology is both robust to typical brain imaging noise levels while
also being computationally efficient. The new methodology is investigated
through simulations in both one-dimensional functions and 2D images and
also applied to a neuroimaging study whose goal is the quantification of
opioid receptor concentration in the brain.
Journal: Journal of the American Statistical Association
Pages: 1-13
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1060241
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1060241
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:1-13
Template-Type: ReDIF-Article 1.0
Author-Name: P. Richard Hahn
Author-X-Name-First: P. Richard
Author-X-Name-Last: Hahn
Author-Name: Jared S. Murray
Author-X-Name-First: Jared S.
Author-X-Name-Last: Murray
Author-Name: Ioanna Manolopoulou
Author-X-Name-First: Ioanna
Author-X-Name-Last: Manolopoulou
Title: A Bayesian Partial Identification Approach to Inferring the Prevalence of Accounting Misconduct
Abstract:
This article describes the use of flexible Bayesian regression models for
estimating a partially identified probability function. Our approach
permits efficient sensitivity analysis concerning the posterior impact of
priors on the partially identified component of the regression model. The
new methodology is illustrated on an important problem where only
partially observed data are available—inferring the prevalence of
accounting misconduct among publicly traded U.S. businesses. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 14-26
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1084307
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1084307
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:14-26
Template-Type: ReDIF-Article 1.0
Author-Name: Zhiguang Huo
Author-X-Name-First: Zhiguang
Author-X-Name-Last: Huo
Author-Name: Ying Ding
Author-X-Name-First: Ying
Author-X-Name-Last: Ding
Author-Name: Silvia Liu
Author-X-Name-First: Silvia
Author-X-Name-Last: Liu
Author-Name: Steffi Oesterreich
Author-X-Name-First: Steffi
Author-X-Name-Last: Oesterreich
Author-Name: George Tseng
Author-X-Name-First: George
Author-X-Name-Last: Tseng
Title: Meta-Analytic Framework for Sparse K-Means to Identify Disease Subtypes in Multiple Transcriptomic Studies
Abstract:
Disease phenotyping by omics data has become a popular approach that
potentially can lead to better personalized treatment. Identifying disease
subtypes via unsupervised machine learning is the first step toward this
goal. In this article, we extend a sparse K-means method
toward a meta-analytic framework to identify novel disease subtypes when
expression profiles of multiple cohorts are available. The lasso
regularization and meta-analysis identify a unique set of gene features
for subtype characterization. An additional pattern matching reward
function guarantees consistent subtype signatures across studies. The
method was evaluated by simulations and leukemia and breast cancer
datasets. The identified disease subtypes from meta-analysis were
characterized with improved accuracy and stability compared to single
study analysis. The breast cancer model was applied to an independent
METABRIC dataset and generated improved survival difference between
subtypes. These results provide a basis for diagnosis and development of
targeted treatments for disease subgroups. Supplementary materials for
this article are available online.
Journal: Journal of the American Statistical Association
Pages: 27-42
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1086354
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1086354
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:27-42
Template-Type: ReDIF-Article 1.0
Author-Name: Mehdi Maadooliat
Author-X-Name-First: Mehdi
Author-X-Name-Last: Maadooliat
Author-Name: Lan Zhou
Author-X-Name-First: Lan
Author-X-Name-Last: Zhou
Author-Name: Seyed Morteza Najibi
Author-X-Name-First: Seyed Morteza
Author-X-Name-Last: Najibi
Author-Name: Xin Gao
Author-X-Name-First: Xin
Author-X-Name-Last: Gao
Author-Name: Jianhua Z. Huang
Author-X-Name-First: Jianhua Z.
Author-X-Name-Last: Huang
Title: Collective Estimation of Multiple Bivariate Density Functions With Application to Angular-Sampling-Based Protein Loop Modeling
Abstract:
This article develops a method for simultaneous estimation of density
functions for a collection of populations of protein backbone angle pairs
using a data-driven, shared basis that is constructed by bivariate spline
functions defined on a triangulation of the bivariate domain. The circular
nature of angular data is taken into account by imposing appropriate
smoothness constraints across boundaries of the triangles. Maximum
penalized likelihood is used to fit the model and an alternating blockwise
Newton-type algorithm is developed for computation. A simulation study
shows that the collective estimation approach is statistically more
efficient than estimating the densities individually. The proposed method
was used to estimate neighbor-dependent distributions of protein backbone
dihedral angles (i.e., Ramachandran distributions). The estimated
distributions were applied to protein loop modeling, one of the most
challenging open problems in protein structure prediction, by feeding them
into an angular-sampling-based loop structure prediction framework. Our
estimated distributions compared favorably to the Ramachandran
distributions estimated by fitting a hierarchical Dirichlet process model;
and in particular, our distributions showed significant improvements on
the hard cases where existing methods do not work well.
Journal: Journal of the American Statistical Association
Pages: 43-56
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1099535
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1099535
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:43-56
Template-Type: ReDIF-Article 1.0
Author-Name: Won Chang
Author-X-Name-First: Won
Author-X-Name-Last: Chang
Author-Name: Murali Haran
Author-X-Name-First: Murali
Author-X-Name-Last: Haran
Author-Name: Patrick Applegate
Author-X-Name-First: Patrick
Author-X-Name-Last: Applegate
Author-Name: David Pollard
Author-X-Name-First: David
Author-X-Name-Last: Pollard
Title: Calibrating an Ice Sheet Model Using High-Dimensional Binary Spatial Data
Abstract:
Rapid retreat of ice in the Amundsen Sea sector of West Antarctica may
cause drastic sea level rise, posing significant risks to populations in
low-lying coastal regions. Calibration of computer models representing the
behavior of the West Antarctic Ice Sheet is key for informative
projections of future sea level rise. However, both the relevant
observations and the model output are high-dimensional binary spatial
data; existing computer model calibration methods are unable to handle
such data. Here we present a novel calibration method for computer models
whose output is in the form of binary spatial data. To mitigate the
computational and inferential challenges posed by our approach, we apply a
generalized principal component based dimension reduction method. To
demonstrate the utility of our method, we calibrate the PSU3D-ICE model by
comparing the output from a 499-member perturbed-parameter ensemble with
observations from the Amundsen Sea sector of the ice sheet. Our methods
help rigorously characterize the parameter uncertainty even in the
presence of systematic data-model discrepancies and dependence in the
errors. Our method also helps inform environmental risk analyses by
contributing to improved projections of sea level rise from the ice
sheets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 57-72
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1108199
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1108199
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:57-72
Template-Type: ReDIF-Article 1.0
Author-Name: Lisa M. Pham
Author-X-Name-First: Lisa M.
Author-X-Name-Last: Pham
Author-Name: Luis Carvalho
Author-X-Name-First: Luis
Author-X-Name-Last: Carvalho
Author-Name: Scott Schaus
Author-X-Name-First: Scott
Author-X-Name-Last: Schaus
Author-Name: Eric D. Kolaczyk
Author-X-Name-First: Eric D.
Author-X-Name-Last: Kolaczyk
Title: Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian Hierarchical Approach
Abstract:
Cellular response to a perturbation is the result of a dynamic system of
biological variables linked in a complex network. A major challenge in
drug and disease studies is identifying the key factors of a biological
network that are essential in determining the cell’s fate. Here,
our goal is the identification of perturbed pathways from high-throughput
gene expression data. We develop a three-level hierarchical model, where
(i) the first level captures the relationship between gene expression and
biological pathways using confirmatory factor analysis, (ii) the second
level models the behavior within an underlying network of pathways induced
by an unknown perturbation using a conditional autoregressive model, and
(iii) the third level is a spike-and-slab prior on the perturbations. We
then identify perturbations through posterior-based variable selection. We
illustrate our approach using gene transcription drug perturbation
profiles from the DREAM7 drug sensitivity predication challenge dataset.
Our proposed method identified regulatory pathways that are known to play
a causative role and that were not readily resolved using gene set
enrichment analysis or exploratory factor models. Simulation results are
presented assessing the performance of this model relative to a
network-free variant and its robustness to inaccuracies in biological
databases. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 73-92
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1110523
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110523
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:73-92
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew R. Schofield
Author-X-Name-First: Matthew R.
Author-X-Name-Last: Schofield
Author-Name: Richard J. Barker
Author-X-Name-First: Richard J.
Author-X-Name-Last: Barker
Author-Name: Andrew Gelman
Author-X-Name-First: Andrew
Author-X-Name-Last: Gelman
Author-Name: Edward R. Cook
Author-X-Name-First: Edward R.
Author-X-Name-Last: Cook
Author-Name: Keith R. Briffa
Author-X-Name-First: Keith R.
Author-X-Name-Last: Briffa
Title: A Model-Based Approach to Climate Reconstruction Using Tree-Ring Data
Abstract:
Quantifying long-term historical climate is fundamental to understanding
recent climate change. Most instrumentally recorded climate data are only
available for the past 200 years, so proxy observations from natural
archives are often considered. We describe a model-based approach to
reconstructing climate defined in terms of raw tree-ring measurement data
that simultaneously accounts for nonclimatic and climatic variability. In
this approach, we specify a joint model for the tree-ring data and climate
variable that we fit using Bayesian inference. We consider a range of
prior densities and compare the modeling approach to current methodology
using an example case of Scots pine from Torneträsk, Sweden, to
reconstruct growing season temperature. We describe how current approaches
translate into particular model assumptions. We explore how changes to
various components in the model-based approach affect the resulting
reconstruction. We show that minor changes in model specification can have
little effect on model fit but lead to large changes in the predictions.
In particular, the periods of relatively warmer and cooler temperatures
are robust between models, but the magnitude of the resulting temperatures
is highly model dependent. Such sensitivity may not be apparent with
traditional approaches because the underlying statistical model is often
hidden or poorly described. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 93-106
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1110524
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110524
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:93-106
Template-Type: ReDIF-Article 1.0
Author-Name: Nilanjan Chatterjee
Author-X-Name-First: Nilanjan
Author-X-Name-Last: Chatterjee
Author-Name: Yi-Hau Chen
Author-X-Name-First: Yi-Hau
Author-X-Name-Last: Chen
Author-Name: Paige Maas
Author-X-Name-First: Paige
Author-X-Name-Last: Maas
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources
Abstract:
Information from various public and private data sources of extremely
large sample sizes are now increasingly available for research purposes.
Statistical methods are needed for using information from such big data
sources while analyzing data from individual studies that may collect more
detailed information required for addressing specific hypotheses of
interest. In this article, we consider the problem of building regression
models based on individual-level data from an “internal”
study while using summary-level information, such as information on
parameters for reduced models, from an “external” big data
source. We identify a set of very general constraints that link internal
and external models. These constraints are used to develop a framework for
semiparametric maximum likelihood inference that allows the distribution
of covariates to be estimated using either the internal sample or an
external reference sample. We develop extensions for handling complex
stratified sampling designs, such as case-control sampling, for the
internal study. Asymptotic theory and variance estimators are developed
for each case. We use simulation studies and a real data application to
assess the performance of the proposed methods in contrast to the
generalized regression calibration methodology that is popular in the
sample survey literature. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 107-117
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1123157
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1123157
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:107-117
Template-Type: ReDIF-Article 1.0
Author-Name: Peisong Han
Author-X-Name-First: Peisong
Author-X-Name-Last: Han
Author-Name: Jerald F. Lawless
Author-X-Name-First: Jerald F.
Author-X-Name-Last: Lawless
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 118-121
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2016.1149399
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149399
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:118-121
Template-Type: ReDIF-Article 1.0
Author-Name: Sebastien Haneuse
Author-X-Name-First: Sebastien
Author-X-Name-Last: Haneuse
Author-Name: Claudia Rivera
Author-X-Name-First: Claudia
Author-X-Name-Last: Rivera
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 121-122
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2016.1149401
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149401
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:121-122
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas A. Louis
Author-X-Name-First: Thomas A.
Author-X-Name-Last: Louis
Author-Name: Niels Keiding
Author-X-Name-First: Niels
Author-X-Name-Last: Keiding
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 123-124
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2016.1149403
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149403
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:123-124
Template-Type: ReDIF-Article 1.0
Author-Name: Joel A. Mefford
Author-X-Name-First: Joel A.
Author-X-Name-Last: Mefford
Author-Name: Noah A. Zaitlen
Author-X-Name-First: Noah A.
Author-X-Name-Last: Zaitlen
Author-Name: John S. Witte
Author-X-Name-First: John S.
Author-X-Name-Last: Witte
Title: Comment: A Human Genetics Perspective
Journal: Journal of the American Statistical Association
Pages: 124-127
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2016.1149404
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149404
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:124-127
Template-Type: ReDIF-Article 1.0
Author-Name: Chirag J. Patel
Author-X-Name-First: Chirag J.
Author-X-Name-Last: Patel
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Title: Comment: Addressing the Need for Portability in Big Data Model Building and Calibration
Journal: Journal of the American Statistical Association
Pages: 127-129
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2016.1149406
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149406
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:127-129
Template-Type: ReDIF-Article 1.0
Author-Name: Nilanjan Chatterjee
Author-X-Name-First: Nilanjan
Author-X-Name-Last: Chatterjee
Author-Name: Yi-Hau Chen
Author-X-Name-First: Yi-Hau
Author-X-Name-Last: Chen
Author-Name: Paige Maas
Author-X-Name-First: Paige
Author-X-Name-Last: Maas
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 130-131
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2016.1149407
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149407
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:130-131
Template-Type: ReDIF-Article 1.0
Author-Name: Hyunseung Kang
Author-X-Name-First: Hyunseung
Author-X-Name-Last: Kang
Author-Name: Anru Zhang
Author-X-Name-First: Anru
Author-X-Name-Last: Zhang
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization
Abstract:
Instrumental variables have been widely used for estimating the causal
effect between exposure and outcome. Conventional estimation methods
require complete knowledge about all the instruments’ validity; a
valid instrument must not have a direct effect on the outcome and not be
related to unmeasured confounders. Often, this is impractical as
highlighted by Mendelian randomization studies where genetic markers are
used as instruments and complete knowledge about instruments’
validity is equivalent to complete knowledge about the involved
genes’ functions. In this article, we propose a method for
estimation of causal effects when this complete knowledge is absent. It is
shown that causal effects are identified and can be estimated as long as
less than 50% of instruments are invalid, without knowing which of the
instruments are invalid. We also introduce conditions for identification
when the 50% threshold is violated. A fast penalized ℓ1
estimation method, called sisVIVE, is introduced for estimating the causal
effect without knowing which instruments are valid, with theoretical
guarantees on its performance. The proposed method is demonstrated on
simulated data and a real Mendelian randomization study concerning the
effect of body mass index(BMI) on health-related quality of life (HRQL)
index. An R package sisVIVE is available on CRAN.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 132-144
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.994705
File-URL: http://hdl.handle.net/10.1080/01621459.2014.994705
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:132-144
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaoyan Sun
Author-X-Name-First: Xiaoyan
Author-X-Name-Last: Sun
Author-Name: Limin Peng
Author-X-Name-First: Limin
Author-X-Name-Last: Peng
Author-Name: Yijian Huang
Author-X-Name-First: Yijian
Author-X-Name-Last: Huang
Author-Name: HuiChuan J. Lai
Author-X-Name-First: HuiChuan J.
Author-X-Name-Last: Lai
Title: Generalizing Quantile Regression for Counting Processes With Applications to Recurrent Events
Abstract:
In survival analysis, quantile regression has become a useful approach to
account for covariate effects on the distribution of an event time of
interest. In this article, we discuss how quantile regression can be
extended to model counting processes and thus lead to a broader regression
framework for survival data. We specifically investigate the proposed
modeling of counting processes for recurrent events data. We show that the
new recurrent events model retains the desirable features of quantile
regression such as easy interpretation and good model flexibility, while
accommodating various observation schemes encountered in observational
studies. We develop a general theoretical and inferential framework for
the new counting process model, which unifies with an existing method for
censored quantile regression. As another useful contribution of this work,
we propose a sample-based covariance estimation procedure, which provides
a useful complement to the prevailing bootstrapping approach. We
demonstrate the utility of our proposals via simulation studies and an
application to a dataset from the U.S. Cystic Fibrosis Foundation Patient
Registry (CFFPR). Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 145-156
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.995795
File-URL: http://hdl.handle.net/10.1080/01621459.2014.995795
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:145-156
Template-Type: ReDIF-Article 1.0
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Author-Name: Tirthankar Dasgupta
Author-X-Name-First: Tirthankar
Author-X-Name-Last: Dasgupta
Title: A Potential Tale of Two-by-Two Tables From Completely Randomized Experiments
Abstract:
Causal inference in completely randomized treatment-control studies with
binary outcomes is discussed from Fisherian, Neymanian, and Bayesian
perspectives, using the potential outcomes model. A randomization-based
justification of Fisher’s exact test is provided. Arguing that the
crucial assumption of constant causal effect is often unrealistic, and
holds only for extreme cases, some new asymptotic and Bayesian inferential
procedures are proposed. The proposed procedures exploit the intrinsic
nonadditivity of unit-level causal effects, can be applied to linear and
nonlinear estimands, and dominate the existing methods, as verified
theoretically and also through simulation studies. Supplementary materials
for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 157-168
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.995796
File-URL: http://hdl.handle.net/10.1080/01621459.2014.995796
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:157-168
Template-Type: ReDIF-Article 1.0
Author-Name: Rui Pan
Author-X-Name-First: Rui
Author-X-Name-Last: Pan
Author-Name: Hansheng Wang
Author-X-Name-First: Hansheng
Author-X-Name-Last: Wang
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Ultrahigh-Dimensional Multiclass Linear Discriminant Analysis by Pairwise Sure Independence Screening
Abstract:
This article is concerned with the problem of feature screening for
multiclass linear discriminant analysis under ultrahigh-dimensional
setting. We allow the number of classes to be relatively large. As a
result, the total number of relevant features is larger than usual. This
makes the related classification problem much more challenging than the
conventional one, where the number of classes is small (very often two).
To solve the problem, we propose a novel pairwise sure independence
screening method for linear discriminant analysis with an
ultrahigh-dimensional predictor. The proposed procedure is directly
applicable to the situation with many classes. We further prove that the
proposed method is screening consistent. Simulation studies are conducted
to assess the finite sample performance of the new procedure. We also
demonstrate the proposed methodology via an empirical analysis of a real
life example on handwritten Chinese character recognition.
Journal: Journal of the American Statistical Association
Pages: 169-179
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.998760
File-URL: http://hdl.handle.net/10.1080/01621459.2014.998760
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:169-179
Template-Type: ReDIF-Article 1.0
Author-Name: Wenjiang Fu
Author-X-Name-First: Wenjiang
Author-X-Name-Last: Fu
Title: Constrained Estimators and Consistency of a Regression Model on a Lexis Diagram
Abstract:
This article considers a regression model on a Lexis diagram of an
a × p table with a single response
in each cell following a distribution in the exponential family. A
regression model on the fixed effects of a rows,
p columns, and a + p
− 1 diagonals induces a singular design matrix and yields multiple
estimators, leading to parameter identifiability problem in
age--period--cohort analysis in social sciences, demography, and
epidemiology, where assessment of secular trend in age, period, and birth
cohort of social events (e.g., violence) and diseases (e.g., cancer) is of
interest. Similar problems also exist in other settings, such as in
supersaturated designs. In this article, we study the finite sample
properties of the multiple estimators, propose a penalized profile
likelihood method to study the consistency and asymptotic bias, and
demonstrate the results through simulations and data analysis. As a
by-product, the identifiability problem is addressed with consistent
estimation for model parameters and secular trend. We conclude that
consistent estimation can be identified through estimable function and
asymptotics studies in regressions with a singular design. Our method
provides a novel approach to studying asymptotics of multiple estimators
with a diverging number of nuisance parameters.
Journal: Journal of the American Statistical Association
Pages: 180-199
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.998761
File-URL: http://hdl.handle.net/10.1080/01621459.2014.998761
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:180-199
Template-Type: ReDIF-Article 1.0
Author-Name: Michalis K. Titsias
Author-X-Name-First: Michalis K.
Author-X-Name-Last: Titsias
Author-Name: Christopher C. Holmes
Author-X-Name-First: Christopher C.
Author-X-Name-Last: Holmes
Author-Name: Christopher Yau
Author-X-Name-First: Christopher
Author-X-Name-Last: Yau
Title: Statistical Inference in Hidden Markov Models Using k-Segment Constraints
Abstract:
Hidden Markov models (HMMs) are one of the most widely used statistical
methods for analyzing sequence data. However, the reporting of output from
HMMs has largely been restricted to the presentation of the most-probable
(MAP) hidden state sequence, found via the Viterbi algorithm, or the
sequence of most probable marginals using the forward--backward algorithm.
In this article, we expand the amount of information we could obtain from
the posterior distribution of an HMM by introducing linear-time dynamic
programming recursions that, conditional on a user-specified constraint in
the number of segments, allow us to (i) find MAP sequences, (ii) compute
posterior probabilities, and (iii) simulate sample paths. We collectively
call these recursions k-segment algorithms and illustrate
their utility using simulated and real examples. We also highlight the
prospective and retrospective use of k-segment
constraints for fitting HMMs or exploring existing model fits.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 200-215
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.998762
File-URL: http://hdl.handle.net/10.1080/01621459.2014.998762
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:200-215
Template-Type: ReDIF-Article 1.0
Author-Name: Francesco Bartolucci
Author-X-Name-First: Francesco
Author-X-Name-Last: Bartolucci
Author-Name: Monia Lupparelli
Author-X-Name-First: Monia
Author-X-Name-Last: Lupparelli
Title: Pairwise Likelihood Inference for Nested Hidden Markov Chain Models for Multilevel Longitudinal Data
Abstract:
In the context of multilevel longitudinal data, where sample units are
collected in clusters, an important aspect that should be accounted for is
the unobserved heterogeneity between sample units and between clusters.
For this aim, we propose an approach based on nested hidden (latent)
Markov chains, which are associated with every sample unit and with every
cluster. The approach allows us to account for the previously mentioned
forms of unobserved heterogeneity in a dynamic fashion; it also allows us
to account for the correlation that may arise between the responses
provided by the units belonging to the same cluster. Under the assumed
model, computing the manifest distribution of these response variables is
infeasible even with a few units per cluster. Therefore, we make inference
on this model through a composite likelihood function based on all the
possible pairs of subjects within each cluster. Properties of the
composite likelihood estimator are assessed by simulation. The proposed
approach is illustrated through an application to a dataset concerning a
sample of Italian workers in which a binary response variable for the
worker receiving an illness benefit was repeatedly observed. Supplementary
materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 216-228
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.998935
File-URL: http://hdl.handle.net/10.1080/01621459.2014.998935
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:216-228
Template-Type: ReDIF-Article 1.0
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Weidong Liu
Author-X-Name-First: Weidong
Author-X-Name-Last: Liu
Title: Large-Scale Multiple Testing of Correlations
Abstract:
Multiple testing of correlations arises in many applications including
gene coexpression network analysis and brain connectivity analysis. In
this article, we consider large-scale simultaneous testing for
correlations in both the one-sample and two-sample settings. New multiple
testing procedures are proposed and a bootstrap method is introduced for
estimating the proportion of the nulls falsely rejected among all the true
nulls. We investigate the properties of the proposed procedures both
theoretically and numerically. It is shown that the procedures
asymptotically control the overall false discovery rate and false
discovery proportion at the nominal level. Simulation results show that
the methods perform well numerically in terms of both the size and power
of the test and it significantly outperforms two alternative methods. The
two-sample procedure is also illustrated by an analysis of a prostate
cancer dataset for the detection of changes in coexpression patterns
between gene expression levels. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 229-240
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.999157
File-URL: http://hdl.handle.net/10.1080/01621459.2014.999157
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:229-240
Template-Type: ReDIF-Article 1.0
Author-Name: Yunzhang Zhu
Author-X-Name-First: Yunzhang
Author-X-Name-Last: Zhu
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Changqing Ye
Author-X-Name-First: Changqing
Author-X-Name-Last: Ye
Title: Personalized Prediction and Sparsity Pursuit in Latent Factor Models
Abstract:
Personalized information filtering extracts the information specifically
relevant to a user, predicting his/her preference over a large number of
items, based on the opinions of users who think alike or its content. This
problem is cast into the framework of regression and classification, where
we integrate additional user-specific and content-specific predictors in
partial latent models, for higher predictive accuracy. In particular, we
factorize a user-over-item preference matrix into a product of two
matrices, each representing a user’s preference and an item
preference by users. Then we propose a likelihood method to seek a
sparsest latent factorization, from a class of overcomplete
factorizations, possibly with a high percentage of missing values. This
promotes additional sparsity beyond rank reduction. Computationally, we
design methods based on a “decomposition and combination”
strategy, to break large-scale optimization into many small subproblems to
solve in a recursive and parallel manner. On this basis, we implement the
proposed methods through multi-platform shared-memory parallel
programming, and through Mahout, a library for scalable machine learning
and data mining, for mapReduce computation. For example, our methods are
scalable to a dataset consisting of three billions of observations on a
single machine with sufficient memory, having good timings. Both
theoretical and numerical investigations show that the proposed methods
exhibit a significant improvement in accuracy over state-of-the-art
scalable methods. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 241-252
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.999158
File-URL: http://hdl.handle.net/10.1080/01621459.2014.999158
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:241-252
Template-Type: ReDIF-Article 1.0
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Ming Yuan
Author-X-Name-First: Ming
Author-X-Name-Last: Yuan
Title: Minimax and Adaptive Estimation of Covariance Operator for Random Variables Observed on a Lattice Graph
Abstract:
Covariance structure plays an important role in high-dimensional
statistical inference. In a range of applications including imaging
analysis and fMRI studies, random variables are observed on a lattice
graph. In such a setting, it is important to account for the lattice
structure when estimating the covariance operator. In this article, we
consider both minimax and adaptive estimation of the covariance operator
over collections of polynomially decaying and exponentially decaying
parameter spaces. We first establish the minimax rates of convergence for
estimating the covariance operator under the operator norm. The results
show that the dimension of the lattice graph significantly affects the
optimal rates convergence, often much more so than the dimension of the
random variables. We then consider adaptive estimation of the covariance
operator. A fully data-driven block thresholding procedure is proposed and
is shown to be adaptively rate optimal simultaneously over a wide range of
polynomially decaying and exponentially decaying parameter spaces. The
adaptive block thresholding procedure is easy to implement, and numerical
experiments are carried out to illustrate the merit of the procedure.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 253-265
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.1001067
File-URL: http://hdl.handle.net/10.1080/01621459.2014.1001067
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:253-265
Template-Type: ReDIF-Article 1.0
Author-Name: Xingdong Feng
Author-X-Name-First: Xingdong
Author-X-Name-Last: Feng
Author-Name: Liping Zhu
Author-X-Name-First: Liping
Author-X-Name-Last: Zhu
Title: Estimation and Testing of Varying Coefficients in Quantile Regression
Abstract:
In this article, we establish a novel connection between the null
hypothesis H0 on the coefficients and a
rank-reducible form of the varying coefficient model in quantile
regression. We use B-splines to approximate the varying
coefficients in the rank-reducible model, and make use of the fact that
the null hypothesis H0 implies a
unidimensional structure of a transformed coefficient matrix for the
B-spline basis functions. By evaluating the
unidimensional structure, we alleviate the difficulty of testing such
hypotheses commonly considered in varying coefficient quantile models. We
demonstrate through numerical studies that the proposed method can be much
more powerful than the rank score test which is widely used in the
quantile regression literature. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 266-274
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2014.1001068
File-URL: http://hdl.handle.net/10.1080/01621459.2014.1001068
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:266-274
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yang Feng
Author-X-Name-First: Yang
Author-X-Name-Last: Feng
Author-Name: Jiancheng Jiang
Author-X-Name-First: Jiancheng
Author-X-Name-Last: Jiang
Author-Name: Xin Tong
Author-X-Name-First: Xin
Author-X-Name-Last: Tong
Title: Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification
Abstract:
We propose a high-dimensional classification method that involves
nonparametric feature augmentation. Knowing that marginal density ratios
are the most powerful univariate classifiers, we use the ratio estimates
to transform the original feature measurements. Subsequently, penalized
logistic regression is invoked, taking as input the newly transformed or
augmented features. This procedure trains models equipped with local
complexity and global simplicity, thereby avoiding the curse of
dimensionality while creating a flexible nonlinear decision boundary. The
resulting method is called feature augmentation via nonparametrics and
selection (FANS). We motivate FANS by generalizing the naive Bayes model,
writing the log ratio of joint densities as a linear combination of those
of marginal densities. It is related to generalized additive models, but
has better interpretability and computability. Risk bounds are developed
for FANS. In numerical analysis, FANS is compared with competing methods,
so as to provide a guideline on its best application domain. Real data
analysis demonstrates that FANS performs very competitively on benchmark
email spam and gene expression datasets. Moreover, FANS is implemented by
an extremely fast algorithm through parallel computing.
Journal: Journal of the American Statistical Association
Pages: 275-287
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1005212
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1005212
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:275-287
Template-Type: ReDIF-Article 1.0
Author-Name: Jianhua Guo
Author-X-Name-First: Jianhua
Author-X-Name-Last: Guo
Author-Name: Jianchang Hu
Author-X-Name-First: Jianchang
Author-X-Name-Last: Hu
Author-Name: Bing-Yi Jing
Author-X-Name-First: Bing-Yi
Author-X-Name-Last: Jing
Author-Name: Zhen Zhang
Author-X-Name-First: Zhen
Author-X-Name-Last: Zhang
Title: Spline-Lasso in High-Dimensional Linear Regression
Abstract:
We consider a high-dimensional linear regression problem, where the
covariates (features) are ordered in some meaningful way, and the number
of covariates p can be much larger than the sample size
n. The fused lasso of Tibshirani et al. is designed
especially to tackle this type of problems; it yields sparse coefficients
and selects grouped variables, and encourages local constant coefficient
profile within each group. However, in some applications, the effects of
different features within a group might be different and change smoothly.
In this article, we propose a new spline-lasso or more generally,
spline-MCP to better capture the different effects within the group. The
newly proposed method is very easy to implement since it can be easily
turned into a lasso or MCP problem. Simulations show that the method works
very effectively both in feature selection and prediction accuracy. A real
application is also given to illustrate the benefits of the method.
Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 288-297
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1005839
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1005839
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:288-297
Template-Type: ReDIF-Article 1.0
Author-Name: Wentao Li
Author-X-Name-First: Wentao
Author-X-Name-Last: Li
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Author-Name: Zhiqiang Tan
Author-X-Name-First: Zhiqiang
Author-X-Name-Last: Tan
Title: Efficient Sequential Monte Carlo With Multiple Proposals and Control Variates
Abstract:
Sequential Monte Carlo is a useful simulation-based method for online
filtering of state-space models. For certain complex state-space models, a
single proposal distribution is usually not satisfactory and using
multiple proposal distributions is a general approach to address various
aspects of the filtering problem. This article proposes an efficient
method of using multiple proposals in combination with control variates.
The likelihood approach of Tan (2004) is used in both resampling and
estimation. The new algorithm is shown to be asymptotically more efficient
than the direct use of multiple proposals and control variates. The
guidance for selecting multiple proposals and control variates is also
given. Numerical examples are used to demonstrate that the new algorithm
can significantly improve over the bootstrap filter and auxiliary particle
filter.
Journal: Journal of the American Statistical Association
Pages: 298-313
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1006364
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006364
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:298-313
Template-Type: ReDIF-Article 1.0
Author-Name: Chao Du
Author-X-Name-First: Chao
Author-X-Name-Last: Du
Author-Name: Chu-Lan Michael Kao
Author-X-Name-First: Chu-Lan Michael
Author-X-Name-Last: Kao
Author-Name: S. C. Kou
Author-X-Name-First: S. C.
Author-X-Name-Last: Kou
Title: Stepwise Signal Extraction via Marginal Likelihood
Abstract:
This article studies the estimation of a stepwise signal. To determine the
number and locations of change-points of the stepwise signal, we formulate
a maximum marginal likelihood estimator, which can be computed with a
quadratic cost using dynamic programming. We carry out an extensive
investigation on the choice of the prior distribution and study the
asymptotic properties of the maximum marginal likelihood estimator. We
propose to treat each possible set of change-points equally and adopt an
empirical Bayes approach to specify the prior distribution of segment
parameters. A detailed simulation study is performed to compare the
effectiveness of this method with other existing methods. We demonstrate
our method on single-molecule enzyme reaction data and on DNA array
comparative genomic hybridization (CGH) data. Our study shows that this
method is applicable to a wide range of models and offers appealing
results in practice. Supplementary materials for this article are
available online.
Journal: Journal of the American Statistical Association
Pages: 314-330
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1006365
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1006365
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:314-330
Template-Type: ReDIF-Article 1.0
Author-Name: Jacopo Mandozzi
Author-X-Name-First: Jacopo
Author-X-Name-Last: Mandozzi
Author-Name: Peter Bühlmann
Author-X-Name-First: Peter
Author-X-Name-Last: Bühlmann
Title: Hierarchical Testing in the High-Dimensional Setting With Correlated Variables
Abstract:
We propose a method for testing whether hierarchically ordered groups of
potentially correlated variables are significant for explaining a response
in a high-dimensional linear model. In presence of highly correlated
variables, as is very common in high-dimensional data, it seems
indispensable to go beyond an approach of inferring individual regression
coefficients, and we show that detecting smallest groups of variables
(MTDs: minimal true detections) is realistic. Thanks to the hierarchy
among the groups of variables, powerful multiple testing adjustment is
possible which leads to a data-driven choice of the resolution level for
the groups. Our procedure, based on repeated sample splitting, is shown to
asymptotically control the familywise error rate and we provide empirical
results for simulated and real data which complement the theoretical
analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 331-343
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1007209
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1007209
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:331-343
Template-Type: ReDIF-Article 1.0
Author-Name: Ying Wei
Author-X-Name-First: Ying
Author-X-Name-Last: Wei
Author-Name: Xiaoyu Song
Author-X-Name-First: Xiaoyu
Author-X-Name-Last: Song
Author-Name: Mengling Liu
Author-X-Name-First: Mengling
Author-X-Name-Last: Liu
Author-Name: Iuliana Ionita-Laza
Author-X-Name-First: Iuliana
Author-X-Name-Last: Ionita-Laza
Author-Name: Joan Reibman
Author-X-Name-First: Joan
Author-X-Name-Last: Reibman
Title: Quantile Regression in the Secondary Analysis of Case--Control Data
Abstract:
Case--control design is widely used in epidemiology and other fields to
identify factors associated with a disease. Data collected from existing
case--control studies can also provide a cost-effective way to investigate
the association of risk factors with secondary outcomes. When the
secondary outcome is a continuous random variable, most of the existing
methods focus on the statistical inference on the mean of the secondary
outcome. In this article, we propose a quantile-based approach to
facilitating a comprehensive investigation of covariates’ effects
on multiple quantiles of the secondary outcome. We construct a new family
of estimating equations combining observed and pseudo outcomes, which lead
to consistent estimation of conditional quantiles using case--control
data. Simulations are conducted to evaluate the performance of our
proposed approach, and a case--control study on genetic association with
asthma is used to demonstrate the method. Supplementary materials for this
article are available online.
Journal: Journal of the American Statistical Association
Pages: 344-354
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1008101
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008101
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:344-354
Template-Type: ReDIF-Article 1.0
Author-Name: Yuan Jiang
Author-X-Name-First: Yuan
Author-X-Name-Last: Jiang
Author-Name: Yunxiao He
Author-X-Name-First: Yunxiao
Author-X-Name-Last: He
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Variable Selection With Prior Information for Generalized Linear Models via the Prior LASSO Method
Abstract:
LASSO is a popular statistical tool often used in conjunction with
generalized linear models that can simultaneously select variables and
estimate parameters. When there are many variables of interest, as in
current biological and biomedical studies, the power of LASSO can be
limited. Fortunately, so much biological and biomedical data have been
collected and they may contain useful information about the importance of
certain variables. This article proposes an extension of LASSO, namely,
prior LASSO (pLASSO), to incorporate that prior information into penalized
generalized linear models. The goal is achieved by adding in the LASSO
criterion function an additional measure of the discrepancy between the
prior information and the model. For linear regression, the whole solution
path of the pLASSO estimator can be found with a procedure similar to the
least angle regression (LARS). Asymptotic theories and simulation results
show that pLASSO provides significant improvement over LASSO when the
prior information is relatively accurate. When the prior information is
less reliable, pLASSO shows great robustness to the misspecification. We
illustrate the application of pLASSO using a real dataset from a
genome-wide association study. Supplementary materials for this article
are available online.
Journal: Journal of the American Statistical Association
Pages: 355-376
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1008363
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1008363
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:355-376
Template-Type: ReDIF-Article 1.0
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Ick Hoon Jin
Author-X-Name-First: Ick Hoon
Author-X-Name-Last: Jin
Author-Name: Qifan Song
Author-X-Name-First: Qifan
Author-X-Name-Last: Song
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: An Adaptive Exchange Algorithm for Sampling From Distributions With Intractable Normalizing Constants
Abstract:
Sampling from the posterior distribution for a model whose normalizing
constant is intractable is a long-standing problem in statistical
research. We propose a new algorithm, adaptive auxiliary variable exchange
algorithm, or, in short, adaptive exchange (AEX) algorithm, to tackle this
problem. The new algorithm can be viewed as a MCMC extension of the
exchange algorithm, which generates auxiliary variables via an importance
sampling procedure from a Markov chain running in parallel. The
convergence of the algorithm is established under mild conditions.
Compared to the exchange algorithm, the new algorithm removes the
requirement that the auxiliary variables must be drawn using a perfect
sampler, and thus can be applied to many models for which the perfect
sampler is not available or very expensive. Compared to the approximate
exchange algorithms, such as the double Metropolis-Hastings sampler, the
new algorithm overcomes their theoretical difficulty in convergence. The
new algorithm is tested on the spatial autologistic and autonormal models.
The numerical results indicate that the new algorithm is particularly
useful for the problems for which the underlying system is strongly
dependent. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 377-393
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1009072
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1009072
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:377-393
Template-Type: ReDIF-Article 1.0
Author-Name: Mengjie Chen
Author-X-Name-First: Mengjie
Author-X-Name-Last: Chen
Author-Name: Zhao Ren
Author-X-Name-First: Zhao
Author-X-Name-Last: Ren
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Author-Name: Harrison Zhou
Author-X-Name-First: Harrison
Author-X-Name-Last: Zhou
Title: Asymptotically Normal and Efficient Estimation of Covariate-Adjusted Gaussian Graphical Model
Abstract:
We propose an asymptotically normal and efficient procedure to estimate
every finite subgraph for covariate-adjusted Gaussian graphical model. As
a consequence, a confidence interval as well as p-value
can be obtained for each edge. The procedure is tuning-free and enjoys
easy implementation and efficient computation through parallel estimation
on subgraphs or edges. We apply the asymptotic normality result to perform
support recovery through edge-wise adaptive thresholding. This support
recovery procedure is called ANTAC, standing for asymptotically normal
estimation with thresholding after adjusting covariates. ANTAC outperforms
other methodologies in the literature in a range of simulation studies. We
apply ANTAC to identify gene--gene interactions using an eQTL dataset. Our
result achieves better interpretability and accuracy in comparison with a
state-of-the-art method. Supplementary materials for the article are
available online.
Journal: Journal of the American Statistical Association
Pages: 394-406
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1010039
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1010039
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:394-406
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew Reimherr
Author-X-Name-First: Matthew
Author-X-Name-Last: Reimherr
Author-Name: Dan Nicolae
Author-X-Name-First: Dan
Author-X-Name-Last: Nicolae
Title: Estimating Variance Components in Functional Linear Models With Applications to Genetic Heritability
Abstract:
Quantifying heritability is the first step in understanding the
contribution of genetic variation to the risk architecture of complex
human diseases and traits. Heritability can be estimated for univariate
phenotypes from nonfamily data using linear mixed effects models. There
is, however, no fully developed methodology for defining or estimating
heritability from longitudinal studies. By examining longitudinal studies,
researchers have the opportunity to better understand the genetic
influence on the temporal development of diseases, which can be vital for
populations with rapidly changing phenotypes such as children or the
elderly. To define and estimate heritability for longitudinally measured
phenotypes, we present a framework based on functional data analysis, FDA.
While our procedures have important genetic consequences, they also
represent a substantial development for FDA. In particular, we present a
very general methodology for constructing optimal, unbiased estimates of
variance components in functional linear models. Such a problem is
challenging as likelihoods and densities do not readily generalize to
infinite-dimensional settings. Our procedure can be viewed as a functional
generalization of the minimum norm quadratic unbiased estimation
procedure, MINQUE, presented by C. R. Rao, and is equivalent to residual
maximum likelihood, REML, in univariate settings. We apply our methodology
to the Childhood Asthma Management Program, CAMP, a 4-year longitudinal
study examining the long term effects of daily asthma medications on
children.
Journal: Journal of the American Statistical Association
Pages: 407-422
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1016224
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1016224
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:407-422
Template-Type: ReDIF-Article 1.0
Author-Name: Zeng-Hua Lu
Author-X-Name-First: Zeng-Hua
Author-X-Name-Last: Lu
Title: Extended MaxT Tests of One-Sided Hypotheses
Abstract:
In many statistical applications of one-sided tests of multiple hypotheses
researchers are often concerned not only with global tests of the
intersection of individual hypotheses, but also with multiple tests of
individual hypotheses. For example, in clinical trial studies researchers
often need to find out the efficacy of a treatment, as well as the
significance of each outcome measurement (endpoint) of the treatment. This
article proposes MaxT type tests aiming at improving the global power of
existing MaxT tests. Our extended MaxT tests are constructed by adding an
extra component to the maximand set of existing MaxT tests. The added
component is a weighted sum of other components. Some power properties
relating to choices of weight are studied. Our simulation study shows that
the proposed tests can considerably improve the global power of existing
MaxT tests and can also outperform many other global tests under some
alternatives and/or some nonnormal distributions. Furthermore, it is shown
that such global power improvement may involve little loss of power on
multiple testing. Two real data examples on clinical trial studies
reported in the literature are reexamined. The results of our tests
suggest stronger evidence on treatment effects over MaxT tests and
likelihood ratio tests while changing little on the evidence concerning
endpoint testing. Supplementary materials for this article are available
online.
Journal: Journal of the American Statistical Association
Pages: 423-437
Issue: 513
Volume: 111
Year: 2016
Month: 3
X-DOI: 10.1080/01621459.2015.1019509
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1019509
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:513:p:423-437
Template-Type: ReDIF-Article 1.0
Author-Name: Daniele Durante
Author-X-Name-First: Daniele
Author-X-Name-Last: Durante
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Author-Name: Joshua T. Vogelstein
Author-X-Name-First: Joshua T.
Author-X-Name-Last: Vogelstein
Title: Nonparametric Bayes Modeling of Populations of Networks
Abstract:
Replicated network data are increasingly available in many research fields. For example, in connectomic applications, interconnections among brain regions are collected for each patient under study, motivating statistical models which can flexibly characterize the probabilistic generative mechanism underlying these network-valued data. Available models for a single network are not designed specifically for inference on the entire probability mass function of a network-valued random variable and therefore lack flexibility in characterizing the distribution of relevant topological structures. We propose a flexible Bayesian nonparametric approach for modeling the population distribution of network-valued data. The joint distribution of the edges is defined via a mixture model that reduces dimensionality and efficiently incorporates network information within each mixture component by leveraging latent space representations. The formulation leads to an efficient Gibbs sampler and provides simple and coherent strategies for inference and goodness-of-fit assessments. We provide theoretical results on the flexibility of our model and illustrate improved performance—compared to state-of-the-art models—in simulations and application to human brain networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1516-1530
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1219260
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219260
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1516-1530
Template-Type: ReDIF-Article 1.0
Author-Name: Xinyu Zhang
Author-X-Name-First: Xinyu
Author-X-Name-Last: Zhang
Author-Name: Haiying Wang
Author-X-Name-First: Haiying
Author-X-Name-Last: Wang
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Linear Model Selection When Covariates Contain Errors
Abstract:
Prediction precision is arguably the most relevant criterion of a model in practice and is often a sought after property. A common difficulty with covariates measured with errors is the impossibility of performing prediction evaluation on the data even if a model is completely given without any unknown parameters. We bypass this inherent difficulty by using special properties on moment relations in linear regression models with measurement errors. The end product is a model selection procedure that achieves the same optimality properties that are achieved in classical linear regression models without covariate measurement error. Asymptotically, the procedure selects the model with the minimum prediction error in general, and selects the smallest correct model if the regression relation is indeed linear. Our model selection procedure is useful in prediction when future covariates without measurement error become available, for example, due to improved technology or better management and design of data collection procedures. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1553-1561
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1219262
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219262
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1553-1561
Template-Type: ReDIF-Article 1.0
Author-Name: Wayne A. Fuller
Author-X-Name-First: Wayne A.
Author-X-Name-Last: Fuller
Author-Name: Jason C. Legg
Author-X-Name-First: Jason C.
Author-X-Name-Last: Legg
Author-Name: Yang Li
Author-X-Name-First: Yang
Author-X-Name-Last: Li
Title: Bootstrap Variance Estimation for Rejective Sampling
Abstract:
Replication procedures have proven useful for variance estimation for large scale complex surveys. As an extension of bootstrap procedures to rejective samples, we define a bootstrap sample that is a rejective, unequal probability, replacement sample selected from the original sample. A modification of the bootstrap with improved performance is suggested for stratified samples with small stratum sizes. Simulations for Poisson and stratified rejective samples support the use of replicates in estimating the variance of the regression estimator for rejective samples.
Journal: Journal of the American Statistical Association
Pages: 1562-1570
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222285
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222285
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1562-1570
Template-Type: ReDIF-Article 1.0
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Author-Name: Tony Sit
Author-X-Name-First: Tony
Author-X-Name-Last: Sit
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Chiung-Yu Huang
Author-X-Name-First: Chiung-Yu
Author-X-Name-Last: Huang
Title: Estimation and Inference of Quantile Regression for Survival Data Under Biased Sampling
Abstract:
Biased sampling occurs frequently in economics, epidemiology, and medical studies either by design or due to data collecting mechanism. Failing to take into account the sampling bias usually leads to incorrect inference. We propose a unified estimation procedure and a computationally fast resampling method to make statistical inference for quantile regression with survival data under general biased sampling schemes, including but not limited to the length-biased sampling, the case-cohort design, and variants thereof. We establish the uniform consistency and weak convergence of the proposed estimator as a process of the quantile level. We also investigate more efficient estimation using the generalized method of moments and derive the asymptotic normality. We further propose a new resampling method for inference, which differs from alternative procedures in that it does not require to repeatedly solve estimating equations. It is proved that the resampling method consistently estimates the asymptotic covariance matrix. The unified framework proposed in this article provides researchers and practitioners a convenient tool for analyzing data collected from various designs. Simulation studies and applications to real datasets are presented for illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1571-1586
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222286
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222286
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1571-1586
Template-Type: ReDIF-Article 1.0
Author-Name: Kyle R. White
Author-X-Name-First: Kyle R.
Author-X-Name-Last: White
Author-Name: Leonard A. Stefanski
Author-X-Name-First: Leonard A.
Author-X-Name-Last: Stefanski
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Title: Variable Selection in Kernel Regression Using Measurement Error Selection Likelihoods
Abstract:
This article develops a nonparametric shrinkage and selection estimator via the measurement error selection likelihood approach recently proposed by Stefanski, Wu, and White. The measurement error kernel regression operator (MEKRO) has the same form as the Nadaraya–Watson kernel estimator, but optimizes a measurement error model selection likelihood to estimate the kernel bandwidths. Much like LASSO or COSSO solution paths, MEKRO results in solution paths depending on a tuning parameter that controls shrinkage and selection via a bound on the harmonic mean of the pseudo-measurement error standard deviations. We use small-sample-corrected AIC to select the tuning parameter. Large-sample properties of MEKRO are studied and small-sample properties are explored via Monte Carlo experiments and applications to data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1587-1597
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222287
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222287
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1587-1597
Template-Type: ReDIF-Article 1.0
Author-Name: Michalis K. Titsias
Author-X-Name-First: Michalis K.
Author-X-Name-Last: Titsias
Author-Name: Christopher Yau
Author-X-Name-First: Christopher
Author-X-Name-Last: Yau
Title: The Hamming Ball Sampler
Abstract:
We introduce the Hamming ball sampler, a novel Markov chain Monte Carlo algorithm, for efficient inference in statistical models involving high-dimensional discrete state spaces. The sampling scheme uses an auxiliary variable construction that adaptively truncates the model space allowing iterative exploration of the full model space. The approach generalizes conventional Gibbs sampling schemes for discrete spaces and provides an intuitive means for user-controlled balance between statistical efficiency and computational tractability. We illustrate the generic utility of our sampling algorithm through application to a range of statistical models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1598-1611
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222288
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222288
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1598-1611
Template-Type: ReDIF-Article 1.0
Author-Name: Xu He
Author-X-Name-First: Xu
Author-X-Name-Last: He
Title: Rotated Sphere Packing Designs
Abstract:
We propose a new class of space-filling designs called rotated sphere packing designs for computer experiments. The approach starts from the asymptotically optimal positioning of identical balls that covers the unit cube. Properly scaled, rotated, translated, and extracted, such designs are excellent in maximin distance criterion, low in discrepancy, good in projective uniformity and thus useful in both prediction and numerical integration purposes. We provide a fast algorithm to construct such designs for any numbers of dimensions and points with R codes available online. Theoretical and numerical results are also provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1612-1622
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222289
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222289
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1612-1622
Template-Type: ReDIF-Article 1.0
Author-Name: Mingyuan Zhou
Author-X-Name-First: Mingyuan
Author-X-Name-Last: Zhou
Author-Name: Stefano Favaro
Author-X-Name-First: Stefano
Author-X-Name-Last: Favaro
Author-Name: Stephen G Walker
Author-X-Name-First: Stephen G
Author-X-Name-Last: Walker
Title: Frequency of Frequencies Distributions and Size-Dependent Exchangeable Random Partitions
Abstract:
Motivated by the fundamental problem of modeling the frequency of frequencies (FoF) distribution, this article introduces the concept of a cluster structure to define a probability function that governs the joint distribution of a random count and its exchangeable random partitions. A cluster structure, naturally arising from a completely random measure mixed Poisson process, allows the probability distribution of the random partitions of a subset of a population to be dependent on the population size, a distinct and motivated feature that makes it more flexible than a partition structure. This allows it to model an entire FoF distribution whose structural properties change as the population size varies. An FoF vector can be simulated by drawing an infinite number of Poisson random variables, or by a stick-breaking construction with a finite random number of steps. A generalized negative binomial process model is proposed to generate a cluster structure, where in the prior the number of clusters is finite and Poisson distributed, and the cluster sizes follow a truncated negative binomial distribution. We propose a simple Gibbs sampling algorithm to extrapolate the FoF vector of a population given the FoF vector of a sample taken without replacement from the population. We illustrate our results and demonstrate the advantages of the proposed models through the analysis of real text, genomic, and survey data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1623-1635
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222290
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222290
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1623-1635
Template-Type: ReDIF-Article 1.0
Author-Name: Pieralberto Guarniero
Author-X-Name-First: Pieralberto
Author-X-Name-Last: Guarniero
Author-Name: Adam M. Johansen
Author-X-Name-First: Adam M.
Author-X-Name-Last: Johansen
Author-Name: Anthony Lee
Author-X-Name-First: Anthony
Author-X-Name-Last: Lee
Title: The Iterated Auxiliary Particle Filter
Abstract:
We present an offline, iterated particle filter to facilitate statistical inference in general state space hidden Markov models. Given a model and a sequence of observations, the associated marginal likelihood L is central to likelihood-based inference for unknown statistical parameters. We define a class of “twisted” models: each member is specified by a sequence of positive functions ψ${\bm \psi }$ and has an associated ψ${\bm \psi }$-auxiliary particle filter that provides unbiased estimates of L. We identify a sequence ψ*${\bm \psi }^{*}$ that is optimal in the sense that the ψ*${\bm \psi }^{*}$-auxiliary particle filter’s estimate of L has zero variance. In practical applications, ψ*${\bm \psi }^{*}$ is unknown so the ψ*${\bm \psi }^{*}$-auxiliary particle filter cannot straightforwardly be implemented. We use an iterative scheme to approximate ψ*${\bm \psi }^{*}$ and demonstrate empirically that the resulting iterated auxiliary particle filter significantly outperforms the bootstrap particle filter in challenging settings. Applications include parameter estimation using a particle Markov chain Monte Carlo algorithm.
Journal: Journal of the American Statistical Association
Pages: 1636-1647
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222291
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222291
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1636-1647
Template-Type: ReDIF-Article 1.0
Author-Name: Shujie Ma
Author-X-Name-First: Shujie
Author-X-Name-Last: Ma
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Yanqing Wang
Author-X-Name-First: Yanqing
Author-X-Name-Last: Wang
Author-Name: Eli S. Kravitz
Author-X-Name-First: Eli S.
Author-X-Name-Last: Kravitz
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: A Semiparametric Single-Index Risk Score Across Populations
Abstract:
We consider a problem motivated by issues in nutritional epidemiology, across diseases and populations. In this area, it is becoming increasingly common for diseases to be modeled by a single diet score, such as the Healthy Eating Index, the Mediterranean Diet Score, etc. For each disease and for each population, a partially linear single-index model is fit. The partially linear aspect of the problem is allowed to differ in each population and disease. However, and crucially, the single-index itself, having to do with the diet score, is common to all diseases and populations, and the nonparametrically estimated functions of the single-index are the same up to a scale parameter. Using B-splines with an increasing number of knots, we develop a method to solve the problem, and display its asymptotic theory. An application to the NIH-AARP Study of Diet and Health is described, where we show the advantages of using multiple diseases and populations simultaneously rather than one at a time in understanding the effect of increased Milk consumption. Simulations illustrate the properties of the methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1648-1662
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1222944
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222944
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1648-1662
Template-Type: ReDIF-Article 1.0
Author-Name: Susanne M. Schennach
Author-X-Name-First: Susanne M.
Author-X-Name-Last: Schennach
Author-Name: Daniel Wilhelm
Author-X-Name-First: Daniel
Author-X-Name-Last: Wilhelm
Title: A Simple Parametric Model Selection Test
Abstract:
We propose a simple model selection test for choosing among two parametric likelihoods, which can be applied in the most general setting without any assumptions on the relation between the candidate models and the true distribution. That is, both, one or neither is allowed to be correctly specified or misspecified, they may be nested, nonnested, strictly nonnested, or overlapping. Unlike in previous testing approaches, no pretesting is needed, since in each case, the same test statistic together with a standard normal critical value can be used. The new procedure controls asymptotic size uniformly over a large class of data-generating processes. We demonstrate its finite sample properties in a Monte Carlo experiment and its practical relevance in an empirical application comparing Keynesian versus new classical macroeconomic models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1663-1674
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1224716
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1224716
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1663-1674
Template-Type: ReDIF-Article 1.0
Author-Name: Yong-Dao Zhou
Author-X-Name-First: Yong-Dao
Author-X-Name-Last: Zhou
Author-Name: Hongquan Xu
Author-X-Name-First: Hongquan
Author-X-Name-Last: Xu
Title: Composite Designs Based on Orthogonal Arrays and Definitive Screening Designs
Abstract:
Central composite designs are widely used in practice for factor screening and building response surface models. We study two classes of new composite designs. The first class consists of a two-level factorial design and a three-level orthogonal array; the second consists of a two-level factorial and a three-level definitive screening design. We derive bounds of their efficiencies for estimating all and part of the parameters in a second-order model and obtain some general theoretical results. New composite designs are constructed. They are more efficient than central composite designs and other existing designs. Supplementary materials are available online.
Journal: Journal of the American Statistical Association
Pages: 1675-1683
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1228535
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1228535
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1675-1683
Template-Type: ReDIF-Article 1.0
Author-Name: Yen-Chi Chen
Author-X-Name-First: Yen-Chi
Author-X-Name-Last: Chen
Author-Name: Christopher R. Genovese
Author-X-Name-First: Christopher R.
Author-X-Name-Last: Genovese
Author-Name: Larry Wasserman
Author-X-Name-First: Larry
Author-X-Name-Last: Wasserman
Title: Density Level Sets: Asymptotics, Inference, and Visualization
Abstract:
We study the plug-in estimator for density level sets under Hausdorff loss. We derive asymptotic theory for this estimator, and based on this theory, we develop two bootstrap confidence regions for level sets. We introduce a new technique for visualizing density level sets, even in multidimensions, which is easy to interpret and efficient to compute. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1684-1696
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1228536
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1228536
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1684-1696
Template-Type: ReDIF-Article 1.0
Author-Name: Shizhe Chen
Author-X-Name-First: Shizhe
Author-X-Name-Last: Chen
Author-Name: Ali Shojaie
Author-X-Name-First: Ali
Author-X-Name-Last: Shojaie
Author-Name: Daniela M. Witten
Author-X-Name-First: Daniela M.
Author-X-Name-Last: Witten
Title: Network Reconstruction From High-Dimensional Ordinary Differential Equations
Abstract:
We consider the task of learning a dynamical system from high-dimensional time-course data. For instance, we might wish to estimate a gene regulatory network from gene expression data measured at discrete time points. We model the dynamical system nonparametrically as a system of additive ordinary differential equations. Most existing methods for parameter estimation in ordinary differential equations estimate the derivatives from noisy observations. This is known to be challenging and inefficient. We propose a novel approach that does not involve derivative estimation. We show that the proposed method can consistently recover the true network structure even in high dimensions, and we demonstrate empirical improvement over competing approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1697-1707
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1229197
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1229197
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1697-1707
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Manrique-Vallier
Author-X-Name-First: Daniel
Author-X-Name-Last: Manrique-Vallier
Author-Name: Jerome P. Reiter
Author-X-Name-First: Jerome P.
Author-X-Name-Last: Reiter
Title: Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data
Abstract:
In categorical data, it is typically the case that some combinations of variables are theoretically impossible, such as a 3-year-old child who is married or a man who is pregnant. In practice, however, reported values often include such structural zeros due to, for example, respondent mistakes or data processing errors. To purge data of such errors, many statistical organizations use a process known as edit-imputation. The basic idea is first to select reported values to change according to some heuristic or loss function, and second to replace those values with plausible imputations. This two-stage process typically does not fully use information in the data when determining locations of errors, nor does it appropriately reflect uncertainty resulting from the edits and imputations. We present an alternative approach to editing and imputation for categorical microdata with structural zeros that addresses these shortcomings. Specifically, we use a Bayesian hierarchical model that couples a stochastic model for the measurement error process with a Dirichlet process mixture of multinomial distributions for the underlying, error-free values. The latter model is restricted to have support only on the set of theoretically possible combinations. We illustrate this integrated approach to editing and imputation using simulation studies with data from the 2000 U. S. census, and compare it to a two-stage edit-imputation routine. Supplementary material is available online.
Journal: Journal of the American Statistical Association
Pages: 1708-1719
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1231612
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1231612
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1708-1719
Template-Type: ReDIF-Article 1.0
Author-Name: Taisuke Otsu
Author-X-Name-First: Taisuke
Author-X-Name-Last: Otsu
Author-Name: Yoshiyasu Rai
Author-X-Name-First: Yoshiyasu
Author-X-Name-Last: Rai
Title: Bootstrap Inference of Matching Estimators for Average Treatment Effects
Abstract:
It is known that the naive bootstrap is not asymptotically valid for a matching estimator of the average treatment effect with a fixed number of matches. In this article, we propose asymptotically valid inference methods for matching estimators based on the weighted bootstrap. The key is to construct bootstrap counterparts by resampling based on certain linear forms of the estimators. Our weighted bootstrap is applicable for the matching estimators of both the average treatment effect and its counterpart for the treated population. Also, by incorporating a bias correction method in Abadie and Imbens (2011), our method can be asymptotically valid even for matching based on a vector of covariates. A simulation study indicates that the weighted bootstrap method is favorably comparable with the asymptotic normal approximation. As an empirical illustration, we apply the proposed method to the National Supported Work data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1720-1732
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1231613
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1231613
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1720-1732
Template-Type: ReDIF-Article 1.0
Author-Name: Karthik Bharath
Author-X-Name-First: Karthik
Author-X-Name-Last: Bharath
Author-Name: Prabhanjan Kambadur
Author-X-Name-First: Prabhanjan
Author-X-Name-Last: Kambadur
Author-Name: Dipak. K. Dey
Author-X-Name-First: Dipak. K.
Author-X-Name-Last: Dey
Author-Name: Arvind Rao
Author-X-Name-First: Arvind
Author-X-Name-Last: Rao
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Title: Statistical Tests for Large Tree-Structured Data
Abstract:
We develop a general statistical framework for the analysis and inference of large tree-structured data, with a focus on developing asymptotic goodness-of-fit tests. We first propose a consistent statistical model for binary trees, from which we develop a class of invariant tests. Using the model for binary trees, we then construct tests for general trees by using the distributional properties of the continuum random tree, which arises as the invariant limit for a broad class of models for tree-structured data based on conditioned Galton–Watson processes. The test statistics for the goodness-of-fit tests are simple to compute and are asymptotically distributed as χ2 and F random variables. We illustrate our methods on an important application of detecting tumor heterogeneity in brain cancer. We use a novel approach with tree-based representations of magnetic resonance images and employ the developed tests to ascertain tumor heterogeneity between two groups of patients. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1733-1743
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1240081
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240081
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1733-1743
Template-Type: ReDIF-Article 1.0
Author-Name: Yacine Aït-Sahalia
Author-X-Name-First: Yacine
Author-X-Name-Last: Aït-Sahalia
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Roger J. A. Laeven
Author-X-Name-First: Roger J. A.
Author-X-Name-Last: Laeven
Author-Name: Christina Dan Wang
Author-X-Name-First: Christina Dan
Author-X-Name-Last: Wang
Author-Name: Xiye Yang
Author-X-Name-First: Xiye
Author-X-Name-Last: Yang
Title: Estimation of the Continuous and Discontinuous Leverage Effects
Abstract:
This article examines the leverage effect, or the generally negative covariation between asset returns and their changes in volatility, under a general setup that allows the log-price and volatility processes to be Itô semimartingales. We decompose the leverage effect into continuous and discontinuous parts and develop statistical methods to estimate them. We establish the asymptotic properties of these estimators. We also extend our methods and results (for the continuous leverage) to the situation where there is market microstructure noise in the observed returns. We show in Monte Carlo simulations that our estimators have good finite sample performance. When applying our methods to real data, our empirical results provide convincing evidence of the presence of the two leverage effects, especially the discontinuous one. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1744-1758
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2016.1240082
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240082
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1744-1758
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Liu
Author-X-Name-First: Wei
Author-X-Name-Last: Liu
Author-Name: Zhiwei Zhang
Author-X-Name-First: Zhiwei
Author-X-Name-Last: Zhang
Author-Name: Lei Nie
Author-X-Name-First: Lei
Author-X-Name-Last: Nie
Author-Name: Guoxing Soon
Author-X-Name-First: Guoxing
Author-X-Name-Last: Soon
Title: A Case Study in Personalized Medicine: Rilpivirine Versus Efavirenz for Treatment-Naive HIV Patients
Abstract:
Rilpivirine and efavirenz are two major nonnucleoside reverse transcriptase inhibitors currently available in the U.S. for treatment-naive adult patients infected with human immunodeficiency virus (HIV). Two randomized clinical trials comparing the two drugs suggested that their relative efficacy may depend on baseline viral load and CD4 cell count. This article is concerned with the potential utilities of these biomarkers in developing individualized treatment regimes that attempt to maximize the virologic response rate or the median of a composite outcome that combines virologic response with change in CD4 cell count (dCD4). Working with the median composite outcome removes the need to assign numerical values to the composite outcome, as would be necessary if we were to maximize its mean, and reduces the influence of extreme dCD4 values. To estimate the target quantities for a given treatment regime, we use G-computation, inverse probability weighting (IPW), and augmented IPW methods to deal with censoring and missing data under a monotone coarsening framework. The resulting estimates form the basis for optimization in a class of candidate regimes indexed by a small number of parameters. A cross-validation procedure is used to remove the resubstitution bias in evaluating an optimized treatment regime. Application of these methods to the HIV trial data yields candidate regimes of different forms together with cross-validated performance measure estimates, which suggest that optimized treatment regimes may be able to improve virologic response (but not the composite outcome) over uniform regimes that prescribe one drug for all patients. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1381-1392
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1280404
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1280404
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1381-1392
Template-Type: ReDIF-Article 1.0
Author-Name: Chuan Hong
Author-X-Name-First: Chuan
Author-X-Name-Last: Hong
Author-Name: Yang Ning
Author-X-Name-First: Yang
Author-X-Name-Last: Ning
Author-Name: Shuang Wang
Author-X-Name-First: Shuang
Author-X-Name-Last: Wang
Author-Name: Hao Wu
Author-X-Name-First: Hao
Author-X-Name-Last: Wu
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Author-Name: Yong Chen
Author-X-Name-First: Yong
Author-X-Name-Last: Chen
Title: PLEMT: A Novel Pseudolikelihood-Based EM Test for Homogeneity in Generalized Exponential Tilt Mixture Models
Abstract:
Motivated by analyses of DNA methylation data, we propose a semiparametric mixture model, namely, the generalized exponential tilt mixture model, to account for heterogeneity between differentially methylated and nondifferentially methylated subjects in the cancer group, and capture the differences in higher order moments (e.g., mean and variance) between subjects in cancer and normal groups. A pairwise pseudolikelihood is constructed to eliminate the unknown nuisance function. To circumvent boundary and nonidentifiability problems as in parametric mixture models, we modify the pseudolikelihood by adding a penalty function. In addition, the test with simple asymptotic distribution has computational advantages compared with permutation-based test for high-dimensional genetic or epigenetic data. We propose a pseudolikelihood-based expectation–maximization test, and show the proposed test follows a simple chi-squared limiting distribution. Simulation studies show that the proposed test controls Type I errors well and has better power compared to several current tests. In particular, the proposed test outperforms the commonly used tests under all simulation settings considered, especially when there are variance differences between two groups. The proposed test is applied to a real dataset to identify differentially methylated sites between ovarian cancer subjects and normal subjects. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1393-1404
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1280405
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1280405
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1393-1404
Template-Type: ReDIF-Article 1.0
Author-Name: Robert T. Krafty
Author-X-Name-First: Robert T.
Author-X-Name-Last: Krafty
Author-Name: Ori Rosen
Author-X-Name-First: Ori
Author-X-Name-Last: Rosen
Author-Name: David S. Stoffer
Author-X-Name-First: David S.
Author-X-Name-Last: Stoffer
Author-Name: Daniel J. Buysse
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Buysse
Author-Name: Martica H. Hall
Author-X-Name-First: Martica H.
Author-X-Name-Last: Hall
Title: Conditional Spectral Analysis of Replicated Multiple Time Series With Application to Nocturnal Physiology
Abstract:
This article considers the problem of analyzing associations between power spectra of multiple time series and cross-sectional outcomes when data are observed from multiple subjects. The motivating application comes from sleep medicine, where researchers are able to noninvasively record physiological time series signals during sleep. The frequency patterns of these signals, which can be quantified through the power spectrum, contain interpretable information about biological processes. An important problem in sleep research is drawing connections between power spectra of time series signals and clinical characteristics; these connections are key to understanding biological pathways through which sleep affects, and can be treated to improve, health. Such analyses are challenging as they must overcome the complicated structure of a power spectrum from multiple time series as a complex positive-definite matrix-valued function. This article proposes a new approach to such analyses based on a tensor-product spline model of Cholesky components of outcome-dependent power spectra. The approach flexibly models power spectra as nonparametric functions of frequency and outcome while preserving geometric constraints. Formulated in a fully Bayesian framework, a Whittle likelihood-based Markov chain Monte Carlo (MCMC) algorithm is developed for automated model fitting and for conducting inference on associations between outcomes and spectral measures. The method is used to analyze data from a study of sleep in older adults and uncovers new insights into how stress and arousal are connected to the amount of time one spends in bed. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1405-1416
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1281811
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1281811
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1405-1416
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaodong Li
Author-X-Name-First: Xiaodong
Author-X-Name-Last: Li
Author-Name: Xu He
Author-X-Name-First: Xu
Author-X-Name-Last: He
Author-Name: Yuanzhen He
Author-X-Name-First: Yuanzhen
Author-X-Name-Last: He
Author-Name: Hui Zhang
Author-X-Name-First: Hui
Author-X-Name-Last: Zhang
Author-Name: Zhong Zhang
Author-X-Name-First: Zhong
Author-X-Name-Last: Zhang
Author-Name: Dennis K. J. Lin
Author-X-Name-First: Dennis K. J.
Author-X-Name-Last: Lin
Title: The Design and Analysis for the Icing Wind Tunnel Experiment of a New Deicing Coating
Abstract:
A new kind of deicing coating is developed to provide aircraft with efficient and durable protection from icing-induced dangers. The icing wind tunnel experiment is indispensable in confirming the usefulness of a deicing coating. Due to the high cost of each batch relative to the available budget, an efficient design of the icing wind tunnel experiment is crucial. The challenges in designing this experiment are multi-fold. It involves between-block factors and within-block factors, incomplete blocking with random effects, related factors, hard-to-change factors, and nuisance factors. Traditional designs and theories cannot be directly applied. To overcome these challenges, we propose using a step-by-step design strategy that includes applying a cross array structure for between-block factors and within-block factors, a group of balanced conditions for optimizing incomplete blocking, a run order method to achieve the minimum number of level changes for hard-to-change factors, and a zero aliased matrix for the nuisance factors. New (theoretical) results for D-optimal design of incomplete blocking experiments with random block effects and minimum number of level changes are obtained. Results of the experiments show that this novel deicing coating is promising in offering both high efficiency of ice reduction and a long service lifetime. The methodology proposed here is generalizable to other applications that involve nonstandard design problems. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1417-1429
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1281812
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1281812
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1417-1429
Template-Type: ReDIF-Article 1.0
Author-Name: Boyu Ren
Author-X-Name-First: Boyu
Author-X-Name-Last: Ren
Author-Name: Sergio Bacallado
Author-X-Name-First: Sergio
Author-X-Name-Last: Bacallado
Author-Name: Stefano Favaro
Author-X-Name-First: Stefano
Author-X-Name-Last: Favaro
Author-Name: Susan Holmes
Author-X-Name-First: Susan
Author-X-Name-Last: Holmes
Author-Name: Lorenzo Trippa
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Trippa
Title: Bayesian Nonparametric Ordination for the Analysis of Microbial Communities
Abstract:
Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters that capture and describe variations of OTU counts across biological samples. It remains important to evaluate how uncertainty in estimates of each biological sample’s microbial distribution propagates to ordination analyses, including visualization of clusters and projections of biological samples on low-dimensional spaces. We propose a Bayesian analysis for dependent distributions to endow frequently used ordinations with estimates of uncertainty. A Bayesian nonparametric prior for dependent normalized random measures is constructed, which is marginally equivalent to the normalized generalized Gamma process, a well-known prior for nonparametric analyses. In our prior, the dependence and similarity between microbial distributions is represented by latent factors that concentrate in a low-dimensional space. We use a shrinkage prior to tune the dimensionality of the latent factors. The resulting posterior samples of model parameters can be used to evaluate uncertainty in analyses routinely applied in microbiome studies. Specifically, by combining them with multivariate data analysis techniques we can visualize credible regions in ecological ordination plots. The characteristics of the proposed model are illustrated through a simulation study and applications in two microbiome datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1430-1442
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1288631
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1288631
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1430-1442
Template-Type: ReDIF-Article 1.0
Author-Name: Caleb H. Miles
Author-X-Name-First: Caleb H.
Author-X-Name-Last: Miles
Author-Name: Ilya Shpitser
Author-X-Name-First: Ilya
Author-X-Name-Last: Shpitser
Author-Name: Phyllis Kanki
Author-X-Name-First: Phyllis
Author-X-Name-Last: Kanki
Author-Name: Seema Meloni
Author-X-Name-First: Seema
Author-X-Name-Last: Meloni
Author-Name: Eric J. Tchetgen Tchetgen
Author-X-Name-First: Eric J.
Author-X-Name-Last: Tchetgen Tchetgen
Title: Quantifying an Adherence Path-Specific Effect of Antiretroviral Therapy in the Nigeria PEPFAR Program
Abstract:
Since the early 2000s, evidence has accumulated for a significant differential effect of first-line antiretroviral therapy (ART) regimens on human immunodeficiency virus (HIV) viral load suppression. This finding was replicated in our data from the Harvard President’s Emergency Plan for AIDS Relief (PEPFAR) program in Nigeria. Investigators were interested in finding the source of these differences, that is, understanding the mechanisms through which one regimen outperforms another, particularly via adherence. This question can be naturally formulated via mediation analysis with adherence playing the role of a mediator. Existing mediation analysis results, however, have relied on an assumption of no exposure-induced confounding of the intermediate variable, and generally require an assumption of no unmeasured confounding for nonparametric identification. Both assumptions are violated by the presence of drug toxicity. In this article, we relax these assumptions and show that certain path-specific effects remain identified under weaker conditions. We focus on the path-specific effect solely mediated by adherence and not by toxicity and propose an estimator for this effect. We illustrate with simulations and present results from a study applying the methodology to the Harvard PEPFAR data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1443-1452
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1295862
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295862
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1443-1452
Template-Type: ReDIF-Article 1.0
Author-Name: K. Sham Bhat
Author-X-Name-First: K. Sham
Author-X-Name-Last: Bhat
Author-Name: David S. Mebane
Author-X-Name-First: David S.
Author-X-Name-Last: Mebane
Author-Name: Priyadarshi Mahapatra
Author-X-Name-First: Priyadarshi
Author-X-Name-Last: Mahapatra
Author-Name: Curtis B. Storlie
Author-X-Name-First: Curtis B.
Author-X-Name-Last: Storlie
Title: Upscaling Uncertainty with Dynamic Discrepancy for a Multi-Scale Carbon Capture System
Abstract:
Uncertainties from model parameters and model discrepancy from small-scale models impact the accuracy and reliability of predictions of large-scale systems. Inadequate representation of these uncertainties may result in inaccurate and overconfident predictions during scale-up to larger systems. Hence, multiscale modeling efforts must accurately quantify the effect of the propagation of uncertainties during upscaling. Using a Bayesian approach, we calibrate a small-scale solid sorbent model to thermogravimetric (TGA) data on a functional profile using chemistry-based priors. Crucial to this effort is the representation of model discrepancy, which uses a Bayesian smoothing splines (BSS-ANOVA) framework. Our uncertainty quantification (UQ) approach could be considered intrusive as it includes the discrepancy function within the chemical rate expressions; resulting in a set of stochastic differential equations. Such an approach allows for easily propagating uncertainty by propagating the joint model parameter and discrepancy posterior into the larger-scale system of rate expressions. The broad UQ framework presented here could be applicable to virtually all areas of science where multiscale modeling is used. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1453-1467
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1295863
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295863
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1453-1467
Template-Type: ReDIF-Article 1.0
Author-Name: Ran Tao
Author-X-Name-First: Ran
Author-X-Name-Last: Tao
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Dan-Yu Lin
Author-X-Name-First: Dan-Yu
Author-X-Name-Last: Lin
Title: Efficient Semiparametric Inference Under Two-Phase Sampling, With Applications to Genetic Association Studies
Abstract:
In modern epidemiological and clinical studies, the covariates of interest may involve genome sequencing, biomarker assay, or medical imaging and thus are prohibitively expensive to measure on a large number of subjects. A cost-effective solution is the two-phase design, under which the outcome and inexpensive covariates are observed for all subjects during the first phase and that information is used to select subjects for measurements of expensive covariates during the second phase. For example, subjects with extreme values of quantitative traits were selected for whole-exome sequencing in the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP). Herein, we consider general two-phase designs, where the outcome can be continuous or discrete, and inexpensive covariates can be continuous and correlated with expensive covariates. We propose a semiparametric approach to regression analysis by approximating the conditional density functions of expensive covariates given inexpensive covariates with B-spline sieves. We devise a computationally efficient and numerically stable EM-algorithm to maximize the sieve likelihood. In addition, we establish the consistency, asymptotic normality, and asymptotic efficiency of the estimators. Furthermore, we demonstrate the superiority of the proposed methods over existing ones through extensive simulation studies. Finally, we present applications to the aforementioned NHLBI ESP. Supplementary materials for this article are available online
Journal: Journal of the American Statistical Association
Pages: 1468-1476
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1295864
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295864
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1468-1476
Template-Type: ReDIF-Article 1.0
Author-Name: Xinran Li
Author-X-Name-First: Xinran
Author-X-Name-Last: Li
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Title: General Forms of Finite Population Central Limit Theorems with Applications to Causal Inference
Abstract:
Frequentists’ inference often delivers point estimators associated with confidence intervals or sets for parameters of interest. Constructing the confidence intervals or sets requires understanding the sampling distributions of the point estimators, which, in many but not all cases, are related to asymptotic Normal distributions ensured by central limit theorems. Although previous literature has established various forms of central limit theorems for statistical inference in super population models, we still need general and convenient forms of central limit theorems for some randomization-based causal analyses of experimental data, where the parameters of interests are functions of a finite population and randomness comes solely from the treatment assignment. We use central limit theorems for sample surveys and rank statistics to establish general forms of the finite population central limit theorems that are particularly useful for proving asymptotic distributions of randomization tests under the sharp null hypothesis of zero individual causal effects, and for obtaining the asymptotic repeated sampling distributions of the causal effect estimators. The new central limit theorems hold for general experimental designs with multiple treatment levels, multiple treatment factors and vector outcomes, and are immediately applicable for studying the asymptotic properties of many methods in causal inference, including instrumental variable, regression adjustment, rerandomization, cluster-randomized experiments, and so on. Previously, the asymptotic properties of these problems are often based on heuristic arguments, which in fact rely on general forms of finite population central limit theorems that have not been established before. Our new theorems fill this gap by providing more solid theoretical foundation for asymptotic randomization-based causal inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1759-1769
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1295865
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295865
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1759-1769
Template-Type: ReDIF-Article 1.0
Author-Name: D. L. Oberski
Author-X-Name-First: D. L.
Author-X-Name-Last: Oberski
Author-Name: A. Kirchner
Author-X-Name-First: A.
Author-X-Name-Last: Kirchner
Author-Name: S. Eckman
Author-X-Name-First: S.
Author-X-Name-Last: Eckman
Author-Name: F. Kreuter
Author-X-Name-First: F.
Author-X-Name-Last: Kreuter
Title: Evaluating the Quality of Survey and Administrative Data with Generalized Multitrait-Multimethod Models
Abstract:
Administrative data are increasingly important in statistics, but, like other types of data, may contain measurement errors. To prevent such errors from invalidating analyses of scientific interest, it is therefore essential to estimate the extent of measurement errors in administrative data. Currently, however, most approaches to evaluate such errors involve either prohibitively expensive audits or comparison with a survey that is assumed perfect. We introduce the “generalized multitrait-multimethod” (GMTMM) model, which can be seen as a general framework for evaluating the quality of administrative and survey data simultaneously. This framework allows both survey and administrative data to contain random and systematic measurement errors. Moreover, it accommodates common features of administrative data such as discreteness, nonlinearity, and nonnormality, improving similar existing models. The use of the GMTMM model is demonstrated by application to linked survey-administrative data from the German Federal Employment Agency on income from of employment, and a simulation study evaluates the estimates obtained and their robustness to model misspecification. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1477-1489
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1302338
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1302338
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1477-1489
Template-Type: ReDIF-Article 1.0
Author-Name: Siem Jan Koopman
Author-X-Name-First: Siem Jan
Author-X-Name-Last: Koopman
Author-Name: Rutger Lit
Author-X-Name-First: Rutger
Author-X-Name-Last: Lit
Author-Name: André Lucas
Author-X-Name-First: André
Author-X-Name-Last: Lucas
Title: Intraday Stochastic Volatility in Discrete Price Changes: The Dynamic Skellam Model
Abstract:
We study intraday stochastic volatility for four liquid stocks traded on the New York Stock Exchange using a new dynamic Skellam model for high-frequency tick-by-tick discrete price changes. Since the likelihood function is analytically intractable, we rely on numerical methods for its evaluation. Given the high number of observations per series per day (1000 to 10,000), we adopt computationally efficient methods including Monte Carlo integration. The intraday dynamics of volatility and the high number of trades without price impact require nontrivial adjustments to the basic dynamic Skellam model. In-sample residual diagnostics and goodness-of-fit statistics show that the final model provides a good fit to the data. An extensive day-to-day forecasting study of intraday volatility shows that the dynamic modified Skellam model provides accurate forecasts compared to alternative modeling approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1490-1503
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1302878
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1302878
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1490-1503
Template-Type: ReDIF-Article 1.0
Author-Name: Michel H. Hof
Author-X-Name-First: Michel H.
Author-X-Name-Last: Hof
Author-Name: Anita C. Ravelli
Author-X-Name-First: Anita C.
Author-X-Name-Last: Ravelli
Author-Name: Aeilko H. Zwinderman
Author-X-Name-First: Aeilko H.
Author-X-Name-Last: Zwinderman
Title: A Probabilistic Record Linkage Model for Survival Data
Abstract:
In the absence of a unique identifier, combining information from multiple files relies on partially identifying variables (e.g., gender, initials). With a record linkage procedure, these variables are used to distinguish record pairs that belong together (matches) from record pairs that do not belong together (nonmatches). Generally, the combined strength of the partially identifying variables is too low causing imperfect linkage; some true nonmatches are identified as match and, on the other hand, some true matches as nonmatch. To avoid bias in further analyses, it is necessary to correct for imperfect linkage. In this article, pregnancy data from the Perinatal Registry of the Netherlands were used to estimate the associations between the (baseline) characteristics from the first delivery and the time to a second delivery. Because of privacy regulations, no unique identifier was available to determine which pregnancies belonged to the same woman. To deal with imperfect linkage in a time-to-event setting, where we have a file with baseline characteristics and a file with event times, we developed a joint model in which the record linkage procedure and the time-to-event analysis are performed simultaneously. R code and example data are available as online supplemental material.
Journal: Journal of the American Statistical Association
Pages: 1504-1515
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1311262
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311262
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1504-1515
Template-Type: ReDIF-Article 1.0
Author-Name: Scott W. Linderman
Author-X-Name-First: Scott W.
Author-X-Name-Last: Linderman
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Comment: A Discussion of “Nonparametric Bayes Modeling of Populations of Networks”
Journal: Journal of the American Statistical Association
Pages: 1543-1547
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1388244
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1388244
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1543-1547
Template-Type: ReDIF-Article 1.0
Author-Name: Nicholas J. Foti
Author-X-Name-First: Nicholas J.
Author-X-Name-Last: Foti
Author-Name: Emily B. Fox
Author-X-Name-First: Emily B.
Author-X-Name-Last: Fox
Title: Comment: Nonparametric Bayes Modeling of Populations of Networks
Journal: Journal of the American Statistical Association
Pages: 1539-1543
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1388245
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1388245
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1539-1543
Template-Type: ReDIF-Article 1.0
Author-Name: Adrian E. Raftery
Author-X-Name-First: Adrian E.
Author-X-Name-Last: Raftery
Title: Comment: Extending the Latent Position Model for Networks
Journal: Journal of the American Statistical Association
Pages: 1531-1534
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1389736
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389736
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1531-1534
Template-Type: ReDIF-Article 1.0
Author-Name: Mark S. Handcock
Author-X-Name-First: Mark S.
Author-X-Name-Last: Handcock
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1537-1539
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1389737
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389737
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1537-1539
Template-Type: ReDIF-Article 1.0
Author-Name: Tamara Broderick
Author-X-Name-First: Tamara
Author-X-Name-Last: Broderick
Title: Comment: Nonparametric Bayes Modeling of Populations of Networks
Journal: Journal of the American Statistical Association
Pages: 1534-1537
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1389738
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389738
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1534-1537
Template-Type: ReDIF-Article 1.0
Author-Name: Samuel D. Pimentel
Author-X-Name-First: Samuel D.
Author-X-Name-Last: Pimentel
Author-Name: Rachel R. Kelz
Author-X-Name-First: Rachel R.
Author-X-Name-Last: Kelz
Author-Name: Jeffrey H. Silber
Author-X-Name-First: Jeffrey H.
Author-X-Name-Last: Silber
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 1770-1770
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1395640
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395640
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1770-1770
Template-Type: ReDIF-Article 1.0
Author-Name: Daniele Durante
Author-X-Name-First: Daniele
Author-X-Name-Last: Durante
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Author-Name: Joshua T. Vogelstein
Author-X-Name-First: Joshua T.
Author-X-Name-Last: Vogelstein
Title: Rejoinder: Nonparametric Bayes Modeling of Populations of Networks
Journal: Journal of the American Statistical Association
Pages: 1547-1552
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1395643
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395643
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1547-1552
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Collaborators
Journal: Journal of the American Statistical Association
Pages: 1784-1791
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1395645
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395645
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1784-1791
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Board EOV
Journal: Journal of the American Statistical Association
Pages: ebi-ebi
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1400347
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1400347
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:ebi-ebi
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 1771-1783
Issue: 520
Volume: 112
Year: 2017
Month: 10
X-DOI: 10.1080/01621459.2017.1411709
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411709
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:520:p:1771-1783
Template-Type: ReDIF-Article 1.0
Author-Name: Xin Zhou
Author-X-Name-First: Xin
Author-X-Name-Last: Zhou
Author-Name: Nicole Mayer-Hamblett
Author-X-Name-First: Nicole
Author-X-Name-Last: Mayer-Hamblett
Author-Name: Umer Khan
Author-X-Name-First: Umer
Author-X-Name-Last: Khan
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Residual Weighted Learning for Estimating Individualized Treatment Rules
Abstract:
Personalized medicine has received increasing attention among statisticians, computer scientists, and clinical practitioners. A major component of personalized medicine is the estimation of individualized treatment rules (ITRs). Recently, Zhao et al. proposed outcome weighted learning (OWL) to construct ITRs that directly optimize the clinical outcome. Although OWL opens the door to introducing machine learning techniques to optimal treatment regimes, it still has some problems in performance. (1) The estimated ITR of OWL is affected by a simple shift of the outcome. (2) The rule from OWL tries to keep treatment assignments that subjects actually received. (3) There is no variable selection mechanism with OWL. All of them weaken the finite sample performance of OWL. In this article, we propose a general framework, called residual weighted learning (RWL), to alleviate these problems, and hence to improve finite sample performance. Unlike OWL which weights misclassification errors by clinical outcomes, RWL weights these errors by residuals of the outcome from a regression fit on clinical covariates excluding treatment assignment. We use the smoothed ramp loss function in RWL and provide a difference of convex (d.c.) algorithm to solve the corresponding nonconvex optimization problem. By estimating residuals with linear models or generalized linear models, RWL can effectively deal with different types of outcomes, such as continuous, binary, and count outcomes. We also propose variable selection methods for linear and nonlinear rules, respectively, to further improve the performance. We show that the resulting estimator of the treatment rule is consistent. We further obtain a rate of convergence for the difference between the expected outcome using the estimated ITR and that of the optimal treatment rule. The performance of the proposed RWL methods is illustrated in simulation studies and in an analysis of cystic fibrosis clinical trial data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 169-187
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1093947
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093947
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:169-187
Template-Type: ReDIF-Article 1.0
Author-Name: Qing Yang
Author-X-Name-First: Qing
Author-X-Name-Last: Yang
Author-Name: Guangming Pan
Author-X-Name-First: Guangming
Author-X-Name-Last: Pan
Title: Weighted Statistic in Detecting Faint and Sparse Alternatives for High-Dimensional Covariance Matrices
Abstract:
This article considers testing equality of two population covariance matrices when the data dimension p diverges with the sample size n (p/n → c > 0). We propose a weighted test statistic that is data-driven and powerful in both faint alternatives (many small disturbances) and sparse alternatives (several large disturbances). Its asymptotic null distribution is derived by large random matrix theory without assuming the existence of a limiting cumulative distribution function of the population covariance matrix. The simulation results confirm that our statistic is powerful against all alternatives, while other tests given in the literature fail in at least one situation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 188-200
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1122602
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1122602
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:188-200
Template-Type: ReDIF-Article 1.0
Author-Name: Matthias Katzfuss
Author-X-Name-First: Matthias
Author-X-Name-Last: Katzfuss
Title: A Multi-Resolution Approximation for Massive Spatial Datasets
Abstract:
Automated sensing instruments on satellites and aircraft have enabled the collection of massive amounts of high-resolution observations of spatial fields over large spatial regions. If these datasets can be efficiently exploited, they can provide new insights on a wide variety of issues. However, traditional spatial-statistical techniques such as kriging are not computationally feasible for big datasets. We propose a multi-resolution approximation (M-RA) of Gaussian processes observed at irregular locations in space. The M-RA process is specified as a linear combination of basis functions at multiple levels of spatial resolution, which can capture spatial structure from very fine to very large scales. The basis functions are automatically chosen to approximate a given covariance function, which can be nonstationary. All computations involving the M-RA, including parameter inference and prediction, are highly scalable for massive datasets. Crucially, the inference algorithms can also be parallelized to take full advantage of large distributed-memory computing environments. In comparisons using simulated data and a large satellite dataset, the M-RA outperforms a related state-of-the-art method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 201-214
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1123632
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1123632
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:201-214
Template-Type: ReDIF-Article 1.0
Author-Name: Thaís C. O. Fonseca
Author-X-Name-First: Thaís C. O.
Author-X-Name-Last: Fonseca
Author-Name: Marco A. R. Ferreira
Author-X-Name-First: Marco A. R.
Author-X-Name-Last: Ferreira
Title: Dynamic Multiscale Spatiotemporal Models for Poisson Data
Abstract:
We propose a new class of dynamic multiscale models for Poisson spatiotemporal processes. Specifically, we use a multiscale spatial Poisson factorization to decompose the Poisson process at each time point into spatiotemporal multiscale coefficients. We then connect these spatiotemporal multiscale coefficients through time with a novel Dirichlet evolution. Further, we propose a simulation-based full Bayesian posterior analysis. In particular, we develop filtering equations for updating of information forward in time and smoothing equations for integration of information backward in time, and use these equations to develop a forward filter backward sampler for the spatiotemporal multiscale coefficients. Because the multiscale coefficients are conditionally independent a posteriori, our full Bayesian posterior analysis is scalable, computationally efficient, and highly parallelizable. Moreover, the Dirichlet evolution of each spatiotemporal multiscale coefficient is parametrized by a discount factor that encodes the relevance of the temporal evolution of the spatiotemporal multiscale coefficient. Therefore, the analysis of discount factors provides a powerful way to identify regions with distinctive spatiotemporal dynamics. Finally, we illustrate the usefulness of our multiscale spatiotemporal Poisson methodology with two applications. The first application examines mortality ratios in the state of Missouri, and the second application considers tornado reports in the American Midwest.
Journal: Journal of the American Statistical Association
Pages: 215-234
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1129968
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1129968
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:215-234
Template-Type: ReDIF-Article 1.0
Author-Name: Shaojun Guo
Author-X-Name-First: Shaojun
Author-X-Name-Last: Guo
Author-Name: John Leigh Box
Author-X-Name-First: John Leigh
Author-X-Name-Last: Box
Author-Name: Wenyang Zhang
Author-X-Name-First: Wenyang
Author-X-Name-Last: Zhang
Title: A Dynamic Structure for High-Dimensional Covariance Matrices and Its Application in Portfolio Allocation
Abstract:
Estimation of high-dimensional covariance matrices is an interesting and important research topic. In this article, we propose a dynamic structure and develop an estimation procedure for high-dimensional covariance matrices. Asymptotic properties are derived to justify the estimation procedure and simulation studies are conducted to demonstrate its performance when the sample size is finite. By exploring a financial application, an empirical study shows that portfolio allocation based on dynamic high-dimensional covariance matrices can significantly outperform the market from 1995 to 2014. Our proposed method also outperforms portfolio allocation based on the sample covariance matrix, the covariance matrix based on factor models, and the shrinkage estimator of covariance matrix. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 235-253
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1129969
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1129969
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:235-253
Template-Type: ReDIF-Article 1.0
Author-Name: David Rossell
Author-X-Name-First: David
Author-X-Name-Last: Rossell
Author-Name: Donatello Telesca
Author-X-Name-First: Donatello
Author-X-Name-Last: Telesca
Title: Nonlocal Priors for High-Dimensional Estimation
Abstract:
Jointly achieving parsimony and good predictive power in high dimensions is a main challenge in statistics. Nonlocal priors (NLPs) possess appealing properties for model choice, but their use for estimation has not been studied in detail. We show that for regular models NLP-based Bayesian model averaging (BMA) shrink spurious parameters either at fast polynomial or quasi-exponential rates as the sample size n increases, while nonspurious parameter estimates are not shrunk. We extend some results to linear models with dimension p growing with n. Coupled with our theoretical investigations, we outline the constructive representation of NLPs as mixtures of truncated distributions that enables simple posterior sampling and extending NLPs beyond previous proposals. Our results show notable high-dimensional estimation for linear models with p > >n at low computational cost. NLPs provided lower estimation error than benchmark and hyper-g priors, SCAD and LASSO in simulations, and in gene expression data achieved higher cross-validated R2 with less predictors. Remarkably, these results were obtained without prescreening variables. Our findings contribute to the debate of whether different priors should be used for estimation and model selection, showing that selection priors may actually be desirable for high-dimensional estimation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 254-265
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1130634
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1130634
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:254-265
Template-Type: ReDIF-Article 1.0
Author-Name: Tao Zou
Author-X-Name-First: Tao
Author-X-Name-Last: Zou
Author-Name: Wei Lan
Author-X-Name-First: Wei
Author-X-Name-Last: Lan
Author-Name: Hansheng Wang
Author-X-Name-First: Hansheng
Author-X-Name-Last: Wang
Author-Name: Chih-Ling Tsai
Author-X-Name-First: Chih-Ling
Author-X-Name-Last: Tsai
Title: Covariance Regression Analysis
Abstract:
This article introduces covariance regression analysis for a p-dimensional response vector. The proposed method explores the regression relationship between the p-dimensional covariance matrix and auxiliary information. We study three types of estimators: maximum likelihood, ordinary least squares, and feasible generalized least squares estimators. Then, we demonstrate that these regression estimators are consistent and asymptotically normal. Furthermore, we obtain the high dimensional and large sample properties of the corresponding covariance matrix estimators. Simulation experiments are presented to demonstrate the performance of both regression and covariance matrix estimates. An example is analyzed from the Chinese stock market to illustrate the usefulness of the proposed covariance regression model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 266-281
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1131699
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1131699
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:266-281
Template-Type: ReDIF-Article 1.0
Author-Name: Jack Kuipers
Author-X-Name-First: Jack
Author-X-Name-Last: Kuipers
Author-Name: Giusi Moffa
Author-X-Name-First: Giusi
Author-X-Name-Last: Moffa
Title: Partition MCMC for Inference on Acyclic Digraphs
Abstract:
Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. Markov chain Monte Carlo (MCMC) methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain’s convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and nonergodic edge reversal moves. Here, we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally, the method can be combined with edge reversal moves to improve the sampler further. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 282-299
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1133426
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1133426
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:282-299
Template-Type: ReDIF-Article 1.0
Author-Name: Jonghyun Yun
Author-X-Name-First: Jonghyun
Author-X-Name-Last: Yun
Author-Name: Fan Yang
Author-X-Name-First: Fan
Author-X-Name-Last: Yang
Author-Name: Yuguo Chen
Author-X-Name-First: Yuguo
Author-X-Name-Last: Chen
Title: Augmented Particle Filters
Abstract:
Particle filters have been widely used for online filtering problems in state–space models (SSMs). The current available proposal distributions depend either only on the state dynamics, or only on the observation, or on both sources of information but are not available for general SSMs. In this article, we develop a new particle filtering algorithm, called the augmented particle filter (APF), for online filtering problems in SSMs. The APF combines two sets of particles from the observation equation and the state equation, and the state space is augmented to facilitate the weight computation. Theoretical justification of the APF is provided, and the connection between the APF and the optimal particle filter (OPF) in some special SSMs is investigated. The APF shares similar properties as the OPF, but the APF can be applied to a much wider range of models than the OPF. Simulation studies show that the APF performs similarly to or better than the OPF when the OPF is available, and the APF can perform better than other filtering algorithms in the literature when the OPF is not available.
Journal: Journal of the American Statistical Association
Pages: 300-313
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1135803
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1135803
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:300-313
Template-Type: ReDIF-Article 1.0
Author-Name: Roderick J. Little
Author-X-Name-First: Roderick J.
Author-X-Name-Last: Little
Author-Name: Donald B. Rubin
Author-X-Name-First: Donald B.
Author-X-Name-Last: Rubin
Author-Name: Sahar Z. Zangeneh
Author-X-Name-First: Sahar Z.
Author-X-Name-Last: Zangeneh
Title: Conditions for Ignoring the Missing-Data Mechanism in Likelihood Inferences for Parameter Subsets
Abstract:
For likelihood-based inferences from data with missing values, models are generally needed for both the data and the missing-data mechanism. However, modeling the mechanism can be challenging, and parameters are often poorly identified. Rubin in 1976 showed that for likelihood and Bayesian inference, sufficient conditions for ignoring the missing data mechanism are (a) the missing data are missing at random (MAR), in the sense that missingness does not depend on the missing values after conditioning on the observed data and (b) the parameters of the data model and the missingness mechanism are distinct, that is, there are no a priori ties, via parameter space restrictions or prior distributions, between these two sets of parameters. These conditions are sufficient but not always necessary, and they relate to the full vector of parameters of the data model. We propose definitions of partially MAR and ignorability for a subvector of the parameters of particular substantive interest, for direct likelihood/Bayesian and frequentist likelihood-based inference. We apply these definitions to a variety of examples. We also discuss conditioning on the pattern of missingness, as an alternative strategy for avoiding the need to model the missingness mechanism.
Journal: Journal of the American Statistical Association
Pages: 314-320
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2015.1136826
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1136826
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:314-320
Template-Type: ReDIF-Article 1.0
Author-Name: Colin B. Fogarty
Author-X-Name-First: Colin B.
Author-X-Name-Last: Fogarty
Author-Name: Pixu Shi
Author-X-Name-First: Pixu
Author-X-Name-Last: Shi
Author-Name: Mark E. Mikkelsen
Author-X-Name-First: Mark E.
Author-X-Name-Last: Mikkelsen
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Randomization Inference and Sensitivity Analysis for Composite Null Hypotheses With Binary Outcomes in Matched Observational Studies
Abstract:
We present methods for conducting hypothesis testing and sensitivity analyses for composite null hypotheses in matched observational studies when outcomes are binary. Causal estimands discussed include the causal risk difference, causal risk ratio, and the effect ratio. We show that inference under the assumption of no unmeasured confounding can be performed by solving an integer linear program, while inference allowing for unmeasured confounding of a given strength requires solving an integer quadratic program. Through simulation studies and data examples, we demonstrate that our formulation allows these problems to be solved in an expedient manner even for large datasets and for large strata. We further exhibit that through our formulation, one can assess the impact of various assumptions about the potential outcomes on the performed inference. R scripts are provided that implement our methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 321-331
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1138865
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1138865
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:321-331
Template-Type: ReDIF-Article 1.0
Author-Name: Jia Li
Author-X-Name-First: Jia
Author-X-Name-Last: Li
Author-Name: Viktor Todorov
Author-X-Name-First: Viktor
Author-X-Name-Last: Todorov
Author-Name: George Tauchen
Author-X-Name-First: George
Author-X-Name-Last: Tauchen
Title: Robust Jump Regressions
Abstract:
We develop robust inference methods for studying linear dependence between the jumps of discretely observed processes at high frequency. Unlike classical linear regressions, jump regressions are determined by a small number of jumps occurring over a fixed time interval and the rest of the components of the processes around the jump times. The latter are the continuous martingale parts of the processes as well as observation noise. By sampling more frequently the role of these components, which are hidden in the observed price, shrinks asymptotically. The robustness of our inference procedure is with respect to outliers, which are of particular importance in the current setting of relatively small number of jump observations. This is achieved by using nonsmooth loss functions (like L1) in the estimation. Unlike classical robust methods, the limit of the objective function here remains nonsmooth. The proposed method is also robust to measurement error in the observed processes, which is achieved by locally smoothing the high-frequency increments. In an empirical application to financial data, we illustrate the usefulness of the robust techniques by contrasting the behavior of robust and ordinary least regression (OLS)-type jump regressions in periods including disruptions of the financial markets such as so-called “flash crashes.” Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 332-341
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1138866
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1138866
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:332-341
Template-Type: ReDIF-Article 1.0
Author-Name: Yuan Huang
Author-X-Name-First: Yuan
Author-X-Name-Last: Huang
Author-Name: Qingzhao Zhang
Author-X-Name-First: Qingzhao
Author-X-Name-Last: Zhang
Author-Name: Sanguo Zhang
Author-X-Name-First: Sanguo
Author-X-Name-Last: Zhang
Author-Name: Jian Huang
Author-X-Name-First: Jian
Author-X-Name-Last: Huang
Author-Name: Shuangge Ma
Author-X-Name-First: Shuangge
Author-X-Name-Last: Ma
Title: Promoting Similarity of Sparsity Structures in Integrative Analysis With Penalization
Abstract:
For data with high-dimensional covariates but small sample sizes, the analysis of single datasets often generates unsatisfactory results. The integrative analysis of multiple independent datasets provides an effective way of pooling information and outperforms single-dataset and several alternative multi-datasets methods. Under many scenarios, multiple datasets are expected to share common important covariates, that is, the corresponding models have similarity in their sparsity structures. However, the existing methods do not have a mechanism to promote the similarity in sparsity structures in integrative analysis. In this study, we consider penalized variable selection and estimation in integrative analysis. We develop an L0-penalty-based method, which explicitly promotes the similarity in sparsity structures. Computationally it is realized using a coordinate descent algorithm. Theoretically it has the selection and estimation consistency properties. Under a wide spectrum of simulation scenarios, it has identification and estimation performance comparable to or better than the alternatives. In the analysis of three lung cancer datasets with gene expression measurements, it identifies genes with sound biological implications and satisfactory prediction performance. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 342-350
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1139497
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1139497
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:342-350
Template-Type: ReDIF-Article 1.0
Author-Name: Kwun Chuen Gary Chan
Author-X-Name-First: Kwun Chuen Gary
Author-X-Name-Last: Chan
Author-Name: Mei-Cheng Wang
Author-X-Name-First: Mei-Cheng
Author-X-Name-Last: Wang
Title: Semiparametric Modeling and Estimation of the Terminal Behavior of Recurrent Marker Processes Before Failure Events
Abstract:
Recurrent event processes with marker measurements are mostly and largely studied with forward time models starting from an initial event. Interestingly, the processes could exhibit important terminal behavior during a time period before occurrence of the failure event. A natural and direct way to study recurrent events prior to a failure event is to align the processes using the failure event as the time origin and to examine the terminal behavior by a backward time model. This article studies regression models for backward recurrent marker processes by counting time backward from the failure event. A three-level semiparametric regression model is proposed for jointly modeling the time to a failure event, the backward recurrent event process, and the marker observed at the time of each backward recurrent event. The first level is a proportional hazards model for the failure time, the second level is a proportional rate model for the recurrent events occurring before the failure event, and the third level is a proportional mean model for the marker given the occurrence of a recurrent event backward in time. By jointly modeling the three components, estimating equations can be constructed for marked counting processes to estimate the target parameters in the three-level regression models. Large sample properties of the proposed estimators are studied and established. The proposed models and methods are illustrated by a community-based AIDS clinical trial to examine the terminal behavior of frequencies and severities of opportunistic infections among HIV-infected individuals in the last 6 months of life.
Journal: Journal of the American Statistical Association
Pages: 351-362
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1140051
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1140051
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:351-362
Template-Type: ReDIF-Article 1.0
Author-Name: Simón Lunagómez
Author-X-Name-First: Simón
Author-X-Name-Last: Lunagómez
Author-Name: Sayan Mukherjee
Author-X-Name-First: Sayan
Author-X-Name-Last: Mukherjee
Author-Name: Robert L. Wolpert
Author-X-Name-First: Robert L.
Author-X-Name-Last: Wolpert
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Title: Geometric Representations of Random Hypergraphs
Abstract:
We introduce a novel parameterization of distributions on hypergraphs based on the geometry of points in Rd${\mathbb {R}}^d$. The idea is to induce distributions on hypergraphs by placing priors on point configurations via spatial processes. This specification is then used to infer conditional independence models, or Markov structure, for multivariate distributions. This approach results in a broader class of conditional independence models beyond standard graphical models. Factorizations that cannot be retrieved via a graph are possible. Inference of nondecomposable graphical models is possible without requiring decomposability, or the need of Gaussian assumptions. This approach leads to new Metropolis-Hastings Markov chain Monte Carlo algorithms with both local and global moves in graph space, generally offers greater control on the distribution of graph features than currently possible, and naturally extends to hypergraphs. We provide a comparative performance evaluation against state-of-the-art approaches, and illustrate the utility of this approach on simulated and real data.
Journal: Journal of the American Statistical Association
Pages: 363-383
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1141686
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141686
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:363-383
Template-Type: ReDIF-Article 1.0
Author-Name: Ilze Kalnina
Author-X-Name-First: Ilze
Author-X-Name-Last: Kalnina
Author-Name: Dacheng Xiu
Author-X-Name-First: Dacheng
Author-X-Name-Last: Xiu
Title: Nonparametric Estimation of the Leverage Effect: A Trade-Off Between Robustness and Efficiency
Abstract:
We consider two new approaches to nonparametric estimation of the leverage effect. The first approach uses stock prices alone. The second approach uses the data on stock prices as well as a certain volatility instrument, such as the Chicago Board Options Exchange (CBOE) volatility index (VIX) or the Black–Scholes implied volatility. The theoretical justification for the instrument-based estimator relies on a certain invariance property, which can be exploited when high-frequency data are available. The price-only estimator is more robust since it is valid under weaker assumptions. However, in the presence of a valid volatility instrument, the price-only estimator is inefficient as the instrument-based estimator has a faster rate of convergence.We consider an empirical application, in which we study the relationship between the leverage effect and the debt-to-equity ratio, credit risk, and illiquidity. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 384-396
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1141687
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141687
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:384-396
Template-Type: ReDIF-Article 1.0
Author-Name: Hao Chen
Author-X-Name-First: Hao
Author-X-Name-Last: Chen
Author-Name: Jerome H. Friedman
Author-X-Name-First: Jerome H.
Author-X-Name-Last: Friedman
Title: A New Graph-Based Two-Sample Test for Multivariate and Object Data
Abstract:
Two-sample tests for multivariate data and especially for non-Euclidean data are not well explored. This article presents a novel test statistic based on a similarity graph constructed on the pooled observations from the two samples. It can be applied to multivariate data and non-Euclidean data as long as a dissimilarity measure on the sample space can be defined, which can usually be provided by domain experts. Existing tests based on a similarity graph lack power either for location or for scale alternatives. The new test uses a common pattern that was overlooked previously, and works for both types of alternatives. The test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large datasets. The new test is illustrated on two applications: The assessment of covariate balance in a matched observational study, and the comparison of network data under different conditions.
Journal: Journal of the American Statistical Association
Pages: 397-409
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1147356
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1147356
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:397-409
Template-Type: ReDIF-Article 1.0
Author-Name: Shujie Ma
Author-X-Name-First: Shujie
Author-X-Name-Last: Ma
Author-Name: Jian Huang
Author-X-Name-First: Jian
Author-X-Name-Last: Huang
Title: A Concave Pairwise Fusion Approach to Subgroup Analysis
Abstract:
An important step in developing individualized treatment strategies is correct identification of subgroups of a heterogeneous population to allow specific treatment for each subgroup. This article considers the problem using samples drawn from a population consisting of subgroups with different mean values, along with certain covariates. We propose a penalized approach for subgroup analysis based on a regression model, in which heterogeneity is driven by unobserved latent factors and thus can be represented by using subject-specific intercepts. We apply concave penalty functions to pairwise differences of the intercepts. This procedure automatically divides the observations into subgroups. To implement the proposed approach, we develop an alternating direction method of multipliers algorithm with concave penalties and demonstrate its convergence. We also establish the theoretical properties of our proposed estimator and determine the order requirement of the minimal difference of signals between groups to recover them. These results provide a sound basis for making statistical inference in subgroup analysis. Our proposed method is further illustrated by simulation studies and analysis of a Cleveland heart disease dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 410-423
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1148039
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148039
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:410-423
Template-Type: ReDIF-Article 1.0
Author-Name: Mark Fiecas
Author-X-Name-First: Mark
Author-X-Name-Last: Fiecas
Author-Name: Jürgen Franke
Author-X-Name-First: Jürgen
Author-X-Name-Last: Franke
Author-Name: Rainer von Sachs
Author-X-Name-First: Rainer
Author-X-Name-Last: von Sachs
Author-Name: Joseph Tadjuidje Kamgaing
Author-X-Name-First: Joseph
Author-X-Name-Last: Tadjuidje Kamgaing
Title: Shrinkage Estimation for Multivariate Hidden Markov Models
Abstract:
Motivated from a changing market environment over time, we consider high-dimensional data such as financial returns, generated by a hidden Markov model that allows for switching between different regimes or states. To get more stable estimates of the covariance matrices of the different states, potentially driven by a number of observations that are small compared to the dimension, we modify the expectation–maximization (EM) algorithm so that it yields the shrinkage estimators for the covariance matrices. The final algorithm turns out to reproduce better estimates not only for the covariance matrices but also for the transition matrix. It results into a more stable and reliable filter that allows for reconstructing the values of the hidden Markov chain. In addition to a simulation study performed in this article, we also present a series of theoretical results that include dimensionality asymptotics and provide the motivation for certain techniques used in the algorithm. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 424-435
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1148608
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148608
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:424-435
Template-Type: ReDIF-Article 1.0
Author-Name: Andreas Alfons
Author-X-Name-First: Andreas
Author-X-Name-Last: Alfons
Author-Name: Christophe Croux
Author-X-Name-First: Christophe
Author-X-Name-Last: Croux
Author-Name: Peter Filzmoser
Author-X-Name-First: Peter
Author-X-Name-Last: Filzmoser
Title: Robust Maximum Association Estimators
Abstract:
The maximum association between two multivariate variables X$\boldsymbol{X}$ and Y$\boldsymbol{Y}$ is defined as the maximal value that a bivariate association measure between one-dimensional projections αtX${\boldsymbol{\alpha }}^{t} \boldsymbol{X}$ and βtY${\boldsymbol{\beta }}^{t} \boldsymbol{Y}$ can attain. Taking the Pearson correlation as projection index results in the first canonical correlation coefficient. We propose to use more robust association measures, such as Spearman’s or Kendall’s rank correlation, or association measures derived from bivariate scatter matrices. We study the robustness of the proposed maximum association measures and the corresponding estimators of the coefficients yielding the maximum association. In the important special case of Y$\boldsymbol{Y}$ being univariate, maximum rank correlation estimators yield regression estimators that are invariant against monotonic transformations of the response. We obtain asymptotic variances for this special case. It turns out that maximum rank correlation estimators combine good efficiency and robustness properties. Simulations and a real data example illustrate the robustness and the power for handling nonlinear relationships of these estimators. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 436-445
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1148609
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148609
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:436-445
Template-Type: ReDIF-Article 1.0
Author-Name: Andreas Hagemann
Author-X-Name-First: Andreas
Author-X-Name-Last: Hagemann
Title: Cluster-Robust Bootstrap Inference in Quantile Regression Models
Abstract:
In this article I develop a wild bootstrap procedure for cluster-robust inference in linear quantile regression models. I show that the bootstrap leads to asymptotically valid inference on the entire quantile regression process in a setting with a large number of small, heterogeneous clusters and provides consistent estimates of the asymptotic covariance function of that process. The proposed bootstrap procedure is easy to implement and performs well even when the number of clusters is much smaller than the sample size. An application to Project STAR data is provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 446-456
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1148610
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148610
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:446-456
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas A. Murray
Author-X-Name-First: Thomas A.
Author-X-Name-Last: Murray
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Author-Name: Sarah McAvoy
Author-X-Name-First: Sarah
Author-X-Name-Last: McAvoy
Author-Name: Daniel R. Gomez
Author-X-Name-First: Daniel R.
Author-X-Name-Last: Gomez
Title: Robust Treatment Comparison Based on Utilities of Semi-Competing Risks in Non-Small-Cell Lung Cancer
Abstract:
A design is presented for a randomized clinical trial comparing two second-line treatments, chemotherapy versus chemotherapy plus reirradiation, for treatment of recurrent non-small-cell lung cancer. The central research question is whether the potential efficacy benefit that adding reirradiation to chemotherapy may provide justifies its potential for increasing the risk of toxicity. The design uses two co-primary outcomes: time to disease progression or death, and time to severe toxicity. Because patients may be given an active third-line treatment at disease progression that confounds second-line treatment effects on toxicity and survival following disease progression, for the purpose of this comparative study follow-up ends at disease progression or death. In contrast, follow-up for disease progression or death continues after severe toxicity, so these are semi-competing risks. A conditionally conjugate Bayesian model that is robust to misspecification is formulated using piecewise exponential distributions. A numerical utility function is elicited from the physicians that characterizes desirabilities of the possible co-primary outcome realizations. A comparative test based on posterior mean utilities is proposed. A simulation study is presented to evaluate test performance for a variety of treatment differences, and a sensitivity assessment to the elicited utility function is performed. General guidelines are given for constructing a design in similar settings, and a computer program for simulation and trial conduct is provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 11-23
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1176926
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1176926
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:11-23
Template-Type: ReDIF-Article 1.0
Author-Name: J. L. Scealy
Author-X-Name-First: J. L.
Author-X-Name-Last: Scealy
Author-Name: A. H. Welsh
Author-X-Name-First: A. H.
Author-X-Name-Last: Welsh
Title: A Directional Mixed Effects Model for Compositional Expenditure Data
Abstract:
Compositional data are vectors of proportions defined on the unit simplex and this type of constrained data occur frequently in Government surveys. It is also possible for the compositional data to be correlated due to the clustering or grouping of the observations within small domains or areas. We propose a new class of the mixed model for compositional data based on the Kent distribution for directional data, where the random effects also have Kent distributions. One useful property of the new directional mixed model is that the marginal mean direction has a closed form and is interpretable. The random effects enter the model in a multiplicative way via the product of a set of rotation matrices and the conditional mean direction is a random rotation of the marginal mean direction. In small area estimation settings, the mean proportions are usually of primary interest and these are shown to be simple functions of the marginal mean direction. For estimation, we apply a quasi-likelihood method which results in solving a new set of generalized estimating equations and these are shown to have low bias in typical situations. For inference, we use a nonparametric bootstrap method for clustered data which does not rely on estimates of the shape parameters (shape parameters are difficult to estimate in Kent models). We analyze data from the 2009–2010 Australian Household Expenditure Survey CURF (confidentialized unit record file). We predict the proportions of total weekly expenditure on food and housing costs for households in a chosen set of domains. The new approach is shown to be more tractable than the traditional approach based on the logratio transformation.
Journal: Journal of the American Statistical Association
Pages: 24-36
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1189336
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1189336
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:24-36
Template-Type: ReDIF-Article 1.0
Author-Name: Bledar A. Konomi
Author-X-Name-First: Bledar A.
Author-X-Name-Last: Konomi
Author-Name: Georgios Karagiannis
Author-X-Name-First: Georgios
Author-X-Name-Last: Karagiannis
Author-Name: Kevin Lai
Author-X-Name-First: Kevin
Author-X-Name-Last: Lai
Author-Name: Guang Lin
Author-X-Name-First: Guang
Author-X-Name-Last: Lin
Title: Bayesian Treed Calibration: An Application to Carbon Capture With AX Sorbent
Abstract:
In cases where field (or experimental) measurements are not available, computer models can model real physical or engineering systems to reproduce their outcomes. They are usually calibrated in light of experimental data to create a better representation of the real system. Statistical methods, based on Gaussian processes, for calibration and prediction have been especially important when the computer models are expensive and experimental data limited. In this article, we develop the Bayesian treed calibration (BTC) as an extension of standard Gaussian process calibration methods to deal with nonstationarity computer models and/or their discrepancy from the field (or experimental) data. Our proposed method partitions both the calibration and observable input space, based on a binary tree partitioning, into subregions where existing model calibration methods can be applied to connect a computer model with the real system. The estimation of the parameters in the proposed model is carried out using Markov chain Monte Carlo (MCMC) computational techniques. Different strategies have been applied to improve mixing. We illustrate our method in two artificial examples and a real application that concerns the capture of carbon dioxide with AX amine based sorbents. The source code and the examples analyzed in this article are available as part of the supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 37-53
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1190279
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1190279
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:37-53
Template-Type: ReDIF-Article 1.0
Author-Name: Audrey Mauguen
Author-X-Name-First: Audrey
Author-X-Name-Last: Mauguen
Author-Name: Emily C. Zabor
Author-X-Name-First: Emily C.
Author-X-Name-Last: Zabor
Author-Name: Nancy E. Thomas
Author-X-Name-First: Nancy E.
Author-X-Name-Last: Thomas
Author-Name: Marianne Berwick
Author-X-Name-First: Marianne
Author-X-Name-Last: Berwick
Author-Name: Venkatraman E. Seshan
Author-X-Name-First: Venkatraman E.
Author-X-Name-Last: Seshan
Author-Name: Colin B. Begg
Author-X-Name-First: Colin B.
Author-X-Name-Last: Begg
Title: Defining Cancer Subtypes With Distinctive Etiologic Profiles: An Application to the Epidemiology of Melanoma
Abstract:
We showcase a novel analytic strategy to identify subtypes of cancer that possess distinctive causal factors, that is, subtypes that are “etiologically” distinct. The method involves the integrated analysis of two types of study design: an incident series of cases with double primary cancers with detailed information on tumor characteristics that can be used to define the subtypes; a case-series of incident cases with information on known risk factors that can be used to investigate the specific risk factors that distinguish the subtypes. The methods are applied to a rich melanoma dataset with detailed information on pathologic tumor factors, and comprehensive information on known genetic and environmental risk factors for melanoma. Identification of the optimal subtyping solution is accomplished using a novel clustering analysis that seeks to maximize a measure that characterizes the distinctiveness of the distributions of risk factors across the subtypes and that is a function of the correlations of tumor factors in the case-specific tumor pairs. This analysis is challenged by the presence of extensive missing data. If successful, studies of this nature offer the opportunity for efficient study design to identify unknown risk factors whose effects are concentrated in defined subtypes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 54-63
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1191499
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1191499
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:54-63
Template-Type: ReDIF-Article 1.0
Author-Name: Ian Barnett
Author-X-Name-First: Ian
Author-X-Name-Last: Barnett
Author-Name: Rajarshi Mukherjee
Author-X-Name-First: Rajarshi
Author-X-Name-Last: Mukherjee
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies
Abstract:
It is of substantial interest to study the effects of genes, genetic pathways, and networks on the risk of complex diseases. These genetic constructs each contain multiple SNPs, which are often correlated and function jointly, and might be large in number. However, only a sparse subset of SNPs in a genetic construct is generally associated with the disease of interest. In this article, we propose the generalized higher criticism (GHC) to test for the association between an SNP set and a disease outcome. The higher criticism is a test traditionally used in high-dimensional signal detection settings when marginal test statistics are independent and the number of parameters is very large. However, these assumptions do not always hold in genetic association studies, due to linkage disequilibrium among SNPs and the finite number of SNPs in an SNP set in each genetic construct. The proposed GHC overcomes the limitations of the higher criticism by allowing for arbitrary correlation structures among the SNPs in an SNP-set, while performing accurate analytic p-value calculations for any finite number of SNPs in the SNP-set. We obtain the detection boundary of the GHC test. We compared empirically using simulations the power of the GHC method with existing SNP-set tests over a range of genetic regions with varied correlation structures and signal sparsity. We apply the proposed methods to analyze the CGEM breast cancer genome-wide association study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 64-76
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1192039
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192039
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:64-76
Template-Type: ReDIF-Article 1.0
Author-Name: Sungmin Kim
Author-X-Name-First: Sungmin
Author-X-Name-Last: Kim
Author-Name: Kevin Potter
Author-X-Name-First: Kevin
Author-X-Name-Last: Potter
Author-Name: Peter F. Craigmile
Author-X-Name-First: Peter F.
Author-X-Name-Last: Craigmile
Author-Name: Mario Peruggia
Author-X-Name-First: Mario
Author-X-Name-Last: Peruggia
Author-Name: Trisha Van Zandt
Author-X-Name-First: Trisha
Author-X-Name-Last: Van Zandt
Title: A Bayesian Race Model for Recognition Memory
Abstract:
Many psychological models use the idea of a trace, which represents a change in a person’s cognitive state that arises as a result of processing a given stimulus. These models assume that a trace is always laid down when a stimulus is processed. In addition, some of these models explain how response times (RTs) and response accuracies arise from a process in which the different traces race against each other. In this article, we present a Bayesian hierarchical model of RT and accuracy in a difficult recognition memory experiment. The model includes a stochastic component that probabilistically determines whether a trace is laid down. The RTs and accuracies are modeled using a minimum gamma race model, with extra model components that allow for the effects of stimulus, sequential dependencies, and trend. Subject-specific effects, as well as ancillary effects due to processes such as perceptual encoding and guessing, are also captured in the hierarchy. Predictive checks show that our model fits the data well. Marginal likelihood evaluations show better predictive performance of our model compared to an approximate Weibull model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 77-91
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1194844
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1194844
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:77-91
Template-Type: ReDIF-Article 1.0
Author-Name: Curtis B. Storlie
Author-X-Name-First: Curtis B.
Author-X-Name-Last: Storlie
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Author-Name: William N. Rust
Author-X-Name-First: William N.
Author-X-Name-Last: Rust
Author-Name: Lawrence O. Ticknor
Author-X-Name-First: Lawrence O.
Author-X-Name-Last: Ticknor
Author-Name: Amanda M. Bonnie
Author-X-Name-First: Amanda M.
Author-X-Name-Last: Bonnie
Author-Name: Andrew J. Montoya
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Montoya
Author-Name: Sarah E. Michalak
Author-X-Name-First: Sarah E.
Author-X-Name-Last: Michalak
Title: Spatiotemporal Modeling of Node Temperatures in Supercomputers
Abstract:
Los Alamos National Laboratory is home to many large supercomputing clusters. These clusters require an enormous amount of power (∼500–2000 kW each), and most of this energy is converted into heat. Thus, cooling the components of the supercomputer becomes a critical and expensive endeavor. Recently, a project was initiated to investigate the effect that changes to the cooling system in a machine room had on three large machines that were housed there. Coupled with this goal was the aim to develop a general good-practice for characterizing the effect of cooling changes and monitoring machine node temperatures in this and other machine rooms. This article focuses on the statistical approach used to quantify the effect that several cooling changes to the room had on the temperatures of the individual nodes of the computers. The largest cluster in the room has 1600 nodes that run a variety of jobs during general use. Since extremes temperatures are important, a Normal distribution plus generalized Pareto distribution for the upper tail is used to model the marginal distribution, along with a Gaussian process copula to account for spatio-temporal dependence. A Gaussian Markov random field (GMRF) model is used to model the spatial effects on the node temperatures as the cooling changes take place. This model is then used to assess the condition of the node temperatures after each change to the room. The analysis approach was used to uncover the cause of a problematic episode of overheating nodes on one of the supercomputing clusters. This same approach can easily be applied to monitor and investigate cooling systems at other data centers, as well. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 92-108
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1195271
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195271
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:92-108
Template-Type: ReDIF-Article 1.0
Author-Name: M. P. Wand
Author-X-Name-First: M. P.
Author-X-Name-Last: Wand
Title: Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing
Abstract:
We show how the notion of message passing can be used to streamline the algebra and computer coding for fast approximate inference in large Bayesian semiparametric regression models. In particular, this approach is amenable to handling arbitrarily large models of particular types once a set of primitive operations is established. The approach is founded upon a message passing formulation of mean field variational Bayes that utilizes factor graph representations of statistical models. The underlying principles apply to general Bayesian hierarchical models although we focus on semiparametric regression. The notion of factor graph fragments is introduced and is shown to facilitate compartmentalization of the required algebra and coding. The resultant algorithms have ready-to-implement closed form expressions and allow a broad class of arbitrarily large semiparametric regression models to be handled. Ongoing software projects such as Infer.NET and Stan support variational-type inference for particular model classes. This article is not concerned with software packages per se and focuses on the underlying tenets of scalable variational inference algorithms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 137-168
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1197833
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1197833
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:137-168
Template-Type: ReDIF-Article 1.0
Author-Name: Michael W. Robbins
Author-X-Name-First: Michael W.
Author-X-Name-Last: Robbins
Author-Name: Jessica Saunders
Author-X-Name-First: Jessica
Author-X-Name-Last: Saunders
Author-Name: Beau Kilmer
Author-X-Name-First: Beau
Author-X-Name-Last: Kilmer
Title: A Framework for Synthetic Control Methods With High-Dimensional, Micro-Level Data: Evaluating a Neighborhood-Specific Crime Intervention
Abstract:
The synthetic control method is an increasingly popular tool for analysis of program efficacy. Here, it is applied to a neighborhood-specific crime intervention in Roanoke, VA, and several novel contributions are made to the synthetic control toolkit. We examine high-dimensional data at a granular level (the treated area has several cases, a large number of untreated comparison cases, and multiple outcome measures). Calibration is used to develop weights that exactly match the synthetic control to the treated region across several outcomes and time periods. Further, we illustrate the importance of adjusting the estimated effect of treatment for the design effect implicit within the weights. A permutation procedure is proposed wherein countless placebo areas can be constructed, enabling estimation of p-values under a robust set of assumptions. An omnibus statistic is introduced that is used to jointly test for the presence of an intervention effect across multiple outcomes and post-intervention time periods. Analyses indicate that the Roanoke crime intervention did decrease crime levels, but the estimated effect of the intervention is not as statistically significant as it would have been had less rigorous approaches been used. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 109-126
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1213634
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1213634
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:109-126
Template-Type: ReDIF-Article 1.0
Author-Name: Brenda López Cabrera
Author-X-Name-First: Brenda López
Author-X-Name-Last: Cabrera
Author-Name: Franziska Schulz
Author-X-Name-First: Franziska
Author-X-Name-Last: Schulz
Title: Forecasting Generalized Quantiles of Electricity Demand: A Functional Data Approach
Abstract:
Electricity load forecasts are an integral part of many decision-making processes in the electricity market. However, most literature on electricity load forecasting concentrates on deterministic forecasts, neglecting possibly important information about uncertainty. A more complete picture of future demand can be obtained by using distributional forecasts, allowing for more efficient decision-making. A predictive density can be fully characterized by tail measures such as quantiles and expectiles. Furthermore, interest often lies in the accurate estimation of tail events rather than in the mean or median. We propose a new methodology to obtain probabilistic forecasts of electricity load that is based on functional data analysis of generalized quantile curves. The core of the methodology is dimension reduction based on functional principal components of tail curves with dependence structure. The approach has several advantages, such as flexible inclusion of explanatory variables like meteorological forecasts and no distributional assumptions. The methodology is applied to load data from a transmission system operator (TSO) and a balancing unit in Germany. Our forecast method is evaluated against other models including the TSO forecast model. It outperforms them in terms of mean absolute percentage error and mean squared error. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 127-136
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1219259
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219259
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:127-136
Template-Type: ReDIF-Article 1.0
Author-Name: Valen E. Johnson
Author-X-Name-First: Valen E.
Author-X-Name-Last: Johnson
Author-Name: Richard D. Payne
Author-X-Name-First: Richard D.
Author-X-Name-Last: Payne
Author-Name: Tianying Wang
Author-X-Name-First: Tianying
Author-X-Name-Last: Wang
Author-Name: Alex Asher
Author-X-Name-First: Alex
Author-X-Name-Last: Asher
Author-Name: Soutrik Mandal
Author-X-Name-First: Soutrik
Author-X-Name-Last: Mandal
Title: On the Reproducibility of Psychological Science
Abstract:
Investigators from a large consortium of scientists recently performed a multi-year study in which they replicated 100 psychology experiments. Although statistically significant results were reported in 97% of the original studies, statistical significance was achieved in only 36% of the replicated studies. This article presents a reanalysis of these data based on a formal statistical model that accounts for publication bias by treating outcomes from unpublished studies as missing data, while simultaneously estimating the distribution of effect sizes for those studies that tested nonnull effects. The resulting model suggests that more than 90% of tests performed in eligible psychology experiments tested negligible effects, and that publication biases based on p-values caused the observed rates of nonreproducibility. The results of this reanalysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1-10
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1240079
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240079
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:1-10
Template-Type: ReDIF-Article 1.0
Author-Name: Dustin Tran
Author-X-Name-First: Dustin
Author-X-Name-Last: Tran
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 156-158
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270044
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270044
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:156-158
Template-Type: ReDIF-Article 1.0
Author-Name: Wanzhu Tu
Author-X-Name-First: Wanzhu
Author-X-Name-Last: Tu
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 158-161
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270045
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270045
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:158-161
Template-Type: ReDIF-Article 1.0
Author-Name: Philip T. Reiss
Author-X-Name-First: Philip T.
Author-X-Name-Last: Reiss
Author-Name: Jeff Goldsmith
Author-X-Name-First: Jeff
Author-X-Name-Last: Goldsmith
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 161-164
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270049
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270049
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:161-164
Template-Type: ReDIF-Article 1.0
Author-Name: Simon N. Wood
Author-X-Name-First: Simon N.
Author-X-Name-Last: Wood
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 164-166
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270050
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270050
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:164-166
Template-Type: ReDIF-Article 1.0
Author-Name: M. P. Wand
Author-X-Name-First: M. P.
Author-X-Name-Last: Wand
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 166-168
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270051
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270051
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:166-168
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 465-465
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270057
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270057
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:465-465
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Abstract:
Articles in the June 2016 issue of the Journal of the American Statistical Association unintentionally omitted some author affiliations. Following is a complete list of authors and their affiliations
Journal: Journal of the American Statistical Association
Pages: 466-469
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2016.1270064
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270064
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:466-469
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 457-464
Issue: 517
Volume: 112
Year: 2017
Month: 1
X-DOI: 10.1080/01621459.2017.1286186
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1286186
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:517:p:457-464
Template-Type: ReDIF-Article 1.0
Author-Name: Chung Eun Lee
Author-X-Name-First: Chung Eun
Author-X-Name-Last: Lee
Author-Name: Xiaofeng Shao
Author-X-Name-First: Xiaofeng
Author-X-Name-Last: Shao
Title: Martingale Difference Divergence Matrix and Its Application to Dimension Reduction for Stationary Multivariate Time Series
Abstract:
In this article, we introduce a new methodology to perform dimension reduction for a stationary multivariate time series. Our method is motivated by the consideration of optimal prediction and focuses on the reduction of the effective dimension in conditional mean of time series given the past information. In particular, we seek a contemporaneous linear transformation such that the transformed time series has two parts with one part being conditionally mean independent of the past. To achieve this goal, we first propose the so-called martingale difference divergence matrix (MDDM), which can quantify the conditional mean independence of V ∈ Rp given U ∈ Rq and also encodes the number and form of linear combinations of V that are conditional mean independent of U. Our dimension reduction procedure is based on eigen-decomposition of the cumulative martingale difference divergence matrix, which is an extension of MDDM to the time series context. Interestingly, there is a static factor model representation for our dimension reduction framework and it has subtle difference from the existing static factor model used in the time series literature. Some theory is also provided about the rate of convergence of eigenvalue and eigenvector of the sample cumulative MDDM in the fixed-dimensional setting. Favorable finite sample performance is demonstrated via simulations and real data illustrations in comparison with some existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 216-229
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1240083
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240083
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:216-229
Template-Type: ReDIF-Article 1.0
Author-Name: Susan Athey
Author-X-Name-First: Susan
Author-X-Name-Last: Athey
Author-Name: Dean Eckles
Author-X-Name-First: Dean
Author-X-Name-Last: Eckles
Author-Name: Guido W. Imbens
Author-X-Name-First: Guido W.
Author-X-Name-Last: Imbens
Title: Exact p-Values for Network Interference
Abstract:
We study the calculation of exact p-values for a large class of nonsharp null hypotheses about treatment effects in a setting with data from experiments involving members of a single connected network. The class includes null hypotheses that limit the effect of one unit’s treatment status on another according to the distance between units, for example, the hypothesis might specify that the treatment status of immediate neighbors has no effect, or that units more than two edges away have no effect. We also consider hypotheses concerning the validity of sparsification of a network (e.g., based on the strength of ties) and hypotheses restricting heterogeneity in peer effects (so that, e.g., only the number or fraction treated among neighboring units matters). Our general approach is to define an artificial experiment, such that the null hypothesis that was not sharp for the original experiment is sharp for the artificial experiment, and such that the randomization analysis for the artificial experiment is validated by the design of the original experiment.
Journal: Journal of the American Statistical Association
Pages: 230-240
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1241178
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1241178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:230-240
Template-Type: ReDIF-Article 1.0
Author-Name: Kehui Chen
Author-X-Name-First: Kehui
Author-X-Name-Last: Chen
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Title: Network Cross-Validation for Determining the Number of Communities in Network Data
Abstract:
The stochastic block model (SBM) and its variants have been a popular tool for analyzing large network data with community structures. In this article, we develop an efficient network cross-validation (NCV) approach to determine the number of communities, as well as to choose between the regular stochastic block model and the degree corrected block model (DCBM). The proposed NCV method is based on a block-wise node-pair splitting technique, combined with an integrated step of community recovery using sub-blocks of the adjacency matrix. We prove that the probability of under-selection vanishes as the number of nodes increases, under mild conditions satisfied by a wide range of popular community recovery algorithms. The solid performance of our method is also demonstrated in extensive simulations and two data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 241-251
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1246365
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246365
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:241-251
Template-Type: ReDIF-Article 1.0
Author-Name: Fang Han
Author-X-Name-First: Fang
Author-X-Name-Last: Han
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Title: ECA: High-Dimensional Elliptical Component Analysis in Non-Gaussian Distributions
Abstract:
We present a robust alternative to principal component analysis (PCA)—called elliptical component analysis (ECA)—for analyzing high-dimensional, elliptically distributed data. ECA estimates the eigenspace of the covariance matrix of the elliptical data. To cope with heavy-tailed elliptical distributions, a multivariate rank statistic is exploited. At the model-level, we consider two settings: either that the leading eigenvectors of the covariance matrix are nonsparse or that they are sparse. Methodologically, we propose ECA procedures for both nonsparse and sparse settings. Theoretically, we provide both nonasymptotic and asymptotic analyses quantifying the theoretical performances of ECA. In the nonsparse setting, we show that ECA’s performance is highly related to the effective rank of the covariance matrix. In the sparse setting, the results are twofold: (i) we show that the sparse ECA estimator based on a combinatoric program attains the optimal rate of convergence; (ii) based on some recent developments in estimating sparse leading eigenvectors, we show that a computationally efficient sparse ECA estimator attains the optimal rate of convergence under a suboptimal scaling. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 252-268
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1246366
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246366
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:252-268
Template-Type: ReDIF-Article 1.0
Author-Name: Jiming Jiang
Author-X-Name-First: Jiming
Author-X-Name-Last: Jiang
Author-Name: J. Sunil Rao
Author-X-Name-First: J. Sunil
Author-X-Name-Last: Rao
Author-Name: Jie Fan
Author-X-Name-First: Jie
Author-X-Name-Last: Fan
Author-Name: Thuan Nguyen
Author-X-Name-First: Thuan
Author-X-Name-Last: Nguyen
Title: Classified Mixed Model Prediction
Abstract:
Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially associated with a random effect corresponding to the same class in the training data, so that method of mixed model prediction can be used to make the best prediction. We propose a new method, called classified mixed model prediction (CMMP), to achieve this goal. We develop CMMP for both prediction of mixed effects and prediction of future observations, and consider different scenarios where there may or may not be a “match” of the new subject among the training-data subjects. Theoretical and empirical studies are carried out to study the properties of CMMP, including prediction intervals based on CMMP, and its comparison with existing methods. In particular, we show that, even if the actual match does not exist between the class of the new observations and those of the training data, CMMP still helps in improving prediction accuracy. Two real-data examples are considered. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 269-279
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1246367
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246367
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:269-279
Template-Type: ReDIF-Article 1.0
Author-Name: Stephen Reid
Author-X-Name-First: Stephen
Author-X-Name-Last: Reid
Author-Name: Jonathan Taylor
Author-X-Name-First: Jonathan
Author-X-Name-Last: Taylor
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: A General Framework for Estimation and Inference From Clusters of Features
Abstract:
Applied statistical problems often come with prespecified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Current tests for the presence of such signals include the classical F-test or a t-test on unsupervised group prototypes (either group centroids or first principal components). In this article, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, and then test with likelihood ratio statistics incorporating only these prototypes. We propose a model, called the “prototype model,” which naturally models this two-step procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and prediction. Prototype formation often relies on variable selection, which invalidates classical Gaussian test theory. We use recent advances in selective inference to account for selection in the prototyping step and retain test validity. Simulation experiments suggest that our testing procedure enjoys more power than do classical approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 280-293
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1246368
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246368
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:280-293
Template-Type: ReDIF-Article 1.0
Author-Name: Ling Ma
Author-X-Name-First: Ling
Author-X-Name-Last: Ma
Author-Name: Rajeshwari Sundaram
Author-X-Name-First: Rajeshwari
Author-X-Name-Last: Sundaram
Title: Analysis of Gap Times Based on Panel Count Data With Informative Observation Times and Unknown Start Time
Abstract:
In biomedical studies, one is often interested in repeat events with longitudinal observations occurring only intermittently, resulting in panel count data. The first stage of labor, measured through unit-increments of cervical dilation in pregnant women, provides such an example. Obstetricians are interested in assessing the gap time distribution of per-unit increments of cervical dilation for better management of labor process. Typically, only intermittent medical examinations for cervical dilation occur after (already dilated) women get admitted to hospital. The observation frequency is very likely correlated to how fast/slow she dilates. Thus, one could view such data as panel count data with informative observation times and unknown start time. Here, we propose semiparametric proportional rate models for the event process and the observation process, with a multiplicative subject-specific frailty variable capturing the correlation between the two processes. Inference procedures for the gap times between consecutive events are proposed when the start times are known as well when unknown, using likelihood-based approach and estimating equations. The methodology is assessed through simulation study and through large sample property. A detailed analysis using the proposed methods is applied to data from two studies: the Collaborative Perinatal Project and the Consortium on Safe Labor. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 294-305
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1246369
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246369
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:294-305
Template-Type: ReDIF-Article 1.0
Author-Name: Emilie Devijver
Author-X-Name-First: Emilie
Author-X-Name-Last: Devijver
Author-Name: Mélina Gallopin
Author-X-Name-First: Mélina
Author-X-Name-Last: Gallopin
Title: Block-Diagonal Covariance Selection for High-Dimensional Gaussian Graphical Models
Abstract:
Gaussian graphical models are widely used to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, we propose a nonasymptotic model selection procedure supported by strong theoretical guarantees based on an oracle type inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure is illustrated on simulated data. An application to a real gene expression dataset with a limited sample size is also presented: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 306-314
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1247002
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1247002
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:306-314
Template-Type: ReDIF-Article 1.0
Author-Name: Zhao Chen
Author-X-Name-First: Zhao
Author-X-Name-Last: Chen
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Error Variance Estimation in Ultrahigh-Dimensional Additive Models
Abstract:
Error variance estimation plays an important role in statistical inference for high-dimensional regression models. This article concerns with error variance estimation in high-dimensional sparse additive model. We study the asymptotic behavior of the traditional mean squared errors, the naive estimate of error variance, and show that it may significantly underestimate the error variance due to spurious correlations that are even higher in nonparametric models than linear models. We further propose an accurate estimate for error variance in ultrahigh-dimensional sparse additive model by effectively integrating sure independence screening and refitted cross-validation techniques. The root n consistency and the asymptotic normality of the resulting estimate are established. We conduct Monte Carlo simulation study to examine the finite sample performance of the newly proposed estimate. A real data example is used to illustrate the proposed methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 315-327
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1251440
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1251440
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:315-327
Template-Type: ReDIF-Article 1.0
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Title: Multiple Testing of Submatrices of a Precision Matrix With Applications to Identification of Between Pathway Interactions
Abstract:
Making accurate inference for gene regulatory networks, including inferring about pathway-by-pathway interactions, is an important and difficult task. Motivated by such genomic applications, we consider multiple testing for conditional dependence between subgroups of variables. Under a Gaussian graphical model framework, the problem is translated into simultaneous testing for a collection of submatrices of a high-dimensional precision matrix with each submatrix summarizing the dependence structure between two subgroups of variables.A novel multiple testing procedure is proposed and both theoretical and numerical properties of the procedure are investigated. Asymptotic null distribution of the test statistic for an individual hypothesis is established and the proposed multiple testing procedure is shown to asymptotically control the false discovery rate (FDR) and false discovery proportion (FDP) at the prespecified level under regularity conditions. Simulations show that the procedure works well in controlling the FDR and has good power in detecting the true interactions. The procedure is applied to a breast cancer gene expression study to identify between pathway interactions. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 328-339
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1251930
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1251930
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:328-339
Template-Type: ReDIF-Article 1.0
Author-Name: Jeffrey W. Miller
Author-X-Name-First: Jeffrey W.
Author-X-Name-Last: Miller
Author-Name: Matthew T. Harrison
Author-X-Name-First: Matthew T.
Author-X-Name-Last: Harrison
Title: Mixture Models With a Prior on the Number of Components
Abstract:
A natural Bayesian approach for mixture models with an unknown number of components is to take the usual finite mixture model with symmetric Dirichlet weights, and put a prior on the number of components—that is, to use a mixture of finite mixtures (MFM). The most commonly used method of inference for MFMs is reversible jump Markov chain Monte Carlo, but it can be nontrivial to design good reversible jump moves, especially in high-dimensional spaces. Meanwhile, there are samplers for Dirichlet process mixture (DPM) models that are relatively simple and are easily adapted to new applications. It turns out that, in fact, many of the essential properties of DPMs are also exhibited by MFMs—an exchangeable partition distribution, restaurant process, random measure representation, and stick-breaking representation—and crucially, the MFM analogues are simple enough that they can be used much like the corresponding DPM properties. Consequently, many of the powerful methods developed for inference in DPMs can be directly applied to MFMs as well; this simplifies the implementation of MFMs and can substantially improve mixing. We illustrate with real and simulated data, including high-dimensional gene expression data used to discriminate cancer subtypes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 340-356
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1255636
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1255636
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:340-356
Template-Type: ReDIF-Article 1.0
Author-Name: Shengchun Kong
Author-X-Name-First: Shengchun
Author-X-Name-Last: Kong
Author-Name: Bin Nan
Author-X-Name-First: Bin
Author-X-Name-Last: Nan
Author-Name: John D. Kalbfleisch
Author-X-Name-First: John D.
Author-X-Name-Last: Kalbfleisch
Author-Name: Rajiv Saran
Author-X-Name-First: Rajiv
Author-X-Name-Last: Saran
Author-Name: Richard Hirth
Author-X-Name-First: Richard
Author-X-Name-Last: Hirth
Title: Conditional Modeling of Longitudinal Data With Terminal Event
Abstract:
We consider a random effects model for longitudinal data with the occurrence of an informative terminal event that is subject to right censoring. Existing methods for analyzing such data include the joint modeling approach using latent frailty and the marginal estimating equation approach using inverse probability weighting; in both cases the effect of the terminal event on the response variable is not explicit and thus not easily interpreted. In contrast, we treat the terminal event time as a covariate in a conditional model for the longitudinal data, which provides a straightforward interpretation while keeping the usual relationship of interest between the longitudinally measured response variable and covariates for times that are far from the terminal event. A two-stage semiparametric likelihood-based approach is proposed for estimating the regression parameters; first, the conditional distribution of the right-censored terminal event time given other covariates is estimated and then the likelihood function for the longitudinal event given the terminal event and other regression parameters is maximized. The method is illustrated by numerical simulations and by analyzing medical cost data for patients with end-stage renal disease. Desirable asymptotic properties are provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 357-368
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1255637
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1255637
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:357-368
Template-Type: ReDIF-Article 1.0
Author-Name: BaoLuo Sun
Author-X-Name-First: BaoLuo
Author-X-Name-Last: Sun
Author-Name: Eric J. Tchetgen Tchetgen
Author-X-Name-First: Eric J.
Author-X-Name-Last: Tchetgen Tchetgen
Title: On Inverse Probability Weighting for Nonmonotone Missing at Random Data
Abstract:
The development of coherent missing data models to account for nonmonotone missing at random (MAR) data by inverse probability weighting (IPW) remains to date largely unresolved. As a consequence, IPW has essentially been restricted for use only in monotone MAR settings. We propose a class of models for nonmonotone missing data mechanisms that spans the MAR model, while allowing the underlying full data law to remain unrestricted. For parametric specifications within the proposed class, we introduce an unconstrained maximum likelihood estimator for estimating the missing data probabilities which is easily implemented using existing software. To circumvent potential convergence issues with this procedure, we also introduce a constrained Bayesian approach to estimate the missing data process which is guaranteed to yield inferences that respect all model restrictions. The efficiency of standard IPW estimation is improved by incorporating information from incomplete cases through an augmented estimating equation which is optimal within a large class of estimating equations. We investigate the finite-sample properties of the proposed estimators in extensive simulations and illustrate the new methodology in an application evaluating key correlates of preterm delivery for infants born to HIV-infected mothers in Botswana, Africa. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 369-379
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1256814
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256814
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:369-379
Template-Type: ReDIF-Article 1.0
Author-Name: Quefeng Li
Author-X-Name-First: Quefeng
Author-X-Name-Last: Li
Author-Name: Guang Cheng
Author-X-Name-First: Guang
Author-X-Name-Last: Cheng
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yuyan Wang
Author-X-Name-First: Yuyan
Author-X-Name-Last: Wang
Title: Embracing the Blessing of Dimensionality in Factor Models
Abstract:
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data are often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this article, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of using data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 380-389
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1256815
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256815
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:380-389
Template-Type: ReDIF-Article 1.0
Author-Name: Fan Li
Author-X-Name-First: Fan
Author-X-Name-Last: Li
Author-Name: Kari Lock Morgan
Author-X-Name-First: Kari Lock
Author-X-Name-Last: Morgan
Author-Name: Alan M. Zaslavsky
Author-X-Name-First: Alan M.
Author-X-Name-Last: Zaslavsky
Title: Balancing Covariates via Propensity Score Weighting
Abstract:
Covariate balance is crucial for unconfounded descriptive or causal comparisons. However, lack of balance is common in observational studies. This article considers weighting strategies for balancing covariates. We define a general class of weights—the balancing weights—that balance the weighted distributions of the covariates between treatment groups. These weights incorporate the propensity score to weight each group to an analyst-selected target population. This class unifies existing weighting methods, including commonly used weights such as inverse-probability weights as special cases. General large-sample results on nonparametric estimation based on these weights are derived. We further propose a new weighting scheme, the overlap weights, in which each unit’s weight is proportional to the probability of that unit being assigned to the opposite group. The overlap weights are bounded, and minimize the asymptotic variance of the weighted average treatment effect among the class of balancing weights. The overlap weights also possess a desirable small-sample exact balance property, based on which we propose a new method that achieves exact balance for means of any selected set of covariates. Two applications illustrate these methods and compare them with other approaches.
Journal: Journal of the American Statistical Association
Pages: 390-400
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1260466
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260466
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:390-400
Template-Type: ReDIF-Article 1.0
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Author-Name: Debdeep Pati
Author-X-Name-First: Debdeep
Author-X-Name-Last: Pati
Author-Name: Antik Chakraborty
Author-X-Name-First: Antik
Author-X-Name-Last: Chakraborty
Author-Name: Bani K. Mallick
Author-X-Name-First: Bani K.
Author-X-Name-Last: Mallick
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Bayesian Semiparametric Multivariate Density Deconvolution
Abstract:
We consider the problem of multivariate density deconvolution when interest lies in estimating the distribution of a vector valued random variable X but precise measurements on X are not available, observations being contaminated by measurement errors U. The existing sparse literature on the problem assumes the density of the measurement errors to be completely known. We propose robust Bayesian semiparametric multivariate deconvolution approaches when the measurement error density of U is not known but replicated proxies are available for at least some individuals. Additionally, we allow the variability of U to depend on the associated unobserved values of X through unknown relationships, which also automatically includes the case of multivariate multiplicative measurement errors. Basic properties of finite mixture models, multivariate normal kernels, and exchangeable priors are exploited in novel ways to meet modeling and computational challenges. Theoretical results showing the flexibility of the proposed methods in capturing a wide variety of data-generating processes are provided. We illustrate the efficiency of the proposed methods in recovering the density of X through simulation experiments. The methodology is applied to estimate the joint consumption pattern of different dietary components from contaminated 24 h recalls. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 401-416
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1260467
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260467
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:401-416
Template-Type: ReDIF-Article 1.0
Author-Name: Rajesh Ranganath
Author-X-Name-First: Rajesh
Author-X-Name-Last: Ranganath
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Correlated Random Measures
Abstract:
We develop correlated random measures, random measures where the atom weights can exhibit a flexible pattern of dependence, and use them to develop powerful hierarchical Bayesian nonparametric models. Hierarchical Bayesian nonparametric models are usually built from completely random measures, a Poisson-process-based construction in which the atom weights are independent. Completely random measures imply strong independence assumptions in the corresponding hierarchical model, and these assumptions are often misplaced in real-world settings. Correlated random measures address this limitation. They model correlation within the measure by using a Gaussian process in concert with the Poisson process. With correlated random measures, for example, we can develop a latent feature model for which we can infer both the properties of the latent features and their dependency pattern. We develop several other examples as well. We study a correlated random measure model of pairwise count data. We derive an efficient variational inference algorithm and show improved predictive performance on large datasets of documents, web clicks, and electronic health records. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 417-430
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1260468
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260468
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:417-430
Template-Type: ReDIF-Article 1.0
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Author-Name: Edward I. George
Author-X-Name-First: Edward I.
Author-X-Name-Last: George
Title: The Spike-and-Slab LASSO
Abstract:
Despite the wide adoption of spike-and-slab methodology for Bayesian variable selection, its potential for penalized likelihood estimation has largely been overlooked. In this article, we bridge this gap by cross-fertilizing these two paradigms with the Spike-and-Slab LASSO procedure for variable selection and parameter estimation in linear regression. We introduce a new class of self-adaptive penalty functions that arise from a fully Bayes spike-and-slab formulation, ultimately moving beyond the separable penalty framework. A virtue of these nonseparable penalties is their ability to borrow strength across coordinates, adapt to ensemble sparsity information and exert multiplicity adjustment. The Spike-and-Slab LASSO procedure harvests efficient coordinate-wise implementations with a path-following scheme for dynamic posterior exploration. We show on simulated data that the fully Bayes penalty mimics oracle performance, providing a viable alternative to cross-validation. We develop theory for the separable and nonseparable variants of the penalty, showing rate-optimality of the global mode as well as optimal posterior concentration when p > n. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 431-444
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1260469
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260469
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:431-444
Template-Type: ReDIF-Article 1.0
Author-Name: Yiyuan She
Author-X-Name-First: Yiyuan
Author-X-Name-Last: She
Author-Name: Zhifeng Wang
Author-X-Name-First: Zhifeng
Author-X-Name-Last: Wang
Author-Name: He Jiang
Author-X-Name-First: He
Author-X-Name-Last: Jiang
Title: Group Regularized Estimation Under Structural Hierarchy
Abstract:
Variable selection for models including interactions between explanatory variables often needs to obey certain hierarchical constraints. Weak or strong structural hierarchy requires that the existence of an interaction term implies at least one or both associated main effects to be present in the model. Lately, this problem has attracted a lot of attention, but existing computational algorithms converge slow even with a moderate number of predictors. Moreover, in contrast to the rich literature on ordinary variable selection, there is a lack of statistical theory to show reasonably low error rates of hierarchical variable selection. This work investigates a new class of estimators that make use of multiple group penalties to capture structural parsimony. We show that the proposed estimators enjoy sharp rate oracle inequalities, and give the minimax lower bounds in strong and weak hierarchical variable selection. A general-purpose algorithm is developed with guaranteed convergence and global optimality. Simulations and real data experiments demonstrate the efficiency and efficacy of the proposed approach. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 445-454
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1260470
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260470
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:445-454
Template-Type: ReDIF-Article 1.0
Author-Name: Marco Battiston
Author-X-Name-First: Marco
Author-X-Name-Last: Battiston
Author-Name: Stefano Favaro
Author-X-Name-First: Stefano
Author-X-Name-Last: Favaro
Author-Name: Yee Whye Teh
Author-X-Name-First: Yee Whye
Author-X-Name-Last: Teh
Title: Multi-Armed Bandit for Species Discovery: A Bayesian Nonparametric Approach
Abstract:
Let (P1, …, PJ) denote J populations of animals from distinct regions. A priori, it is unknown which species are present in each region and what are their corresponding frequencies. Species are shared among populations and each species can be present in more than one region with its frequency varying across populations. In this article, we consider the problem of sequentially sampling these populations to observe the greatest number of different species. We adopt a Bayesian nonparametric approach and endow (P1, …, PJ) with a hierarchical Pitman–Yor process prior. As a consequence of the hierarchical structure, the J unknown discrete probability measures share the same support, that of their common random base measure. Given this prior choice, we propose a sequential rule that, at every time step, given the information available up to that point, selects the population from which to collect the next observation. Rather than picking the population with the highest posterior estimate of producing a new value, the proposed rule includes a Thompson sampling step to better balance the exploration–exploitation trade-off. We also propose an extension of the algorithm to deal with incidence data, where multiple observations are collected in a time period. The performance of the proposed algorithms is assessed through a simulation study and compared to three other strategies. Finally, we compare these algorithms using a dataset of species of trees, collected from different plots in South America. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 455-466
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1261711
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1261711
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:455-466
Template-Type: ReDIF-Article 1.0
Author-Name: Pavel Krupskii
Author-X-Name-First: Pavel
Author-X-Name-Last: Krupskii
Author-Name: Raphaël Huser
Author-X-Name-First: Raphaël
Author-X-Name-Last: Huser
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: Factor Copula Models for Replicated Spatial Data
Abstract:
We propose a new copula model that can be used with replicated spatial data. Unlike the multivariate normal copula, the proposed copula is based on the assumption that a common factor exists and affects the joint dependence of all measurements of the process. Moreover, the proposed copula can model tail dependence and tail asymmetry. The model is parameterized in terms of a covariance function that may be chosen from the many models proposed in the literature, such as the Matérn model. For some choice of common factors, the joint copula density is given in closed form and therefore likelihood estimation is very fast. In the general case, one-dimensional numerical integration is needed to calculate the likelihood, but estimation is still reasonably fast even with large datasets. We use simulation studies to show the wide range of dependence structures that can be generated by the proposed model with different choices of common factors. We apply the proposed model to spatial temperature data and compare its performance with some popular geostatistics models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 467-479
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2016.1261712
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1261712
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:467-479
Template-Type: ReDIF-Article 1.0
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Author-Name: Haoda Fu
Author-X-Name-First: Haoda
Author-X-Name-Last: Fu
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Title: Learning Optimal Personalized Treatment Rules in Consideration of Benefit and Risk: With an Application to Treating Type 2 Diabetes Patients With Insulin Therapies
Abstract:
Individualized medical decision making is often complex due to patient treatment response heterogeneity. Pharmacotherapy may exhibit distinct efficacy and safety profiles for different patient populations. An “optimal” treatment that maximizes clinical benefit for a patient may also lead to concern of safety due to a high risk of adverse events. Thus, to guide individualized clinical decision making and deliver optimal tailored treatments, maximizing clinical benefit should be considered in the context of controlling for potential risk. In this work, we propose two approaches to identify personalized optimal treatment strategy that maximizes clinical benefit under a constraint on the average risk. We derive the theoretical optimal treatment rule under the risk constraint and draw an analogy to the Neyman–Pearson lemma to prove the theorem. We present algorithms that can be easily implemented by any off-the-shelf quadratic programming package. We conduct extensive simulation studies to show satisfactory risk control when maximizing the clinical benefit. Finally, we apply our method to a randomized trial of type 2 diabetes patients to guide optimal utilization of the first line insulin treatments based on individual patient characteristics while controlling for the rate of hypoglycemia events. We identify baseline glycated hemoglobin level, body mass index, and fasting blood glucose as three key factors among 18 biomarkers to differentiate treatment assignments, and demonstrate a successful control of the risk of hypoglycemia in both the training and testing dataset.
Journal: Journal of the American Statistical Association
Pages: 1-13
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1303386
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1303386
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:1-13
Template-Type: ReDIF-Article 1.0
Author-Name: Danielle Braun
Author-X-Name-First: Danielle
Author-X-Name-Last: Braun
Author-Name: Malka Gorfine
Author-X-Name-First: Malka
Author-X-Name-Last: Gorfine
Author-Name: Hormuzd A. Katki
Author-X-Name-First: Hormuzd A.
Author-X-Name-Last: Katki
Author-Name: Argyrios Ziogas
Author-X-Name-First: Argyrios
Author-X-Name-Last: Ziogas
Author-Name: Giovanni Parmigiani
Author-X-Name-First: Giovanni
Author-X-Name-Last: Parmigiani
Title: Nonparametric Adjustment for Measurement Error in Time-to-Event Data: Application to Risk Prediction Models
Abstract:
Mismeasured time-to-event data used as a predictor in risk prediction models will lead to inaccurate predictions. This arises in the context of self-reported family history, a time-to-event predictor often measured with error, used in Mendelian risk prediction models. Using validation data, we propose a method to adjust for this type of error. We estimate the measurement error process using a nonparametric smoothed Kaplan–Meier estimator, and use Monte Carlo integration to implement the adjustment. We apply our method to simulated data in the context of both Mendelian and multivariate survival prediction models. Simulations are evaluated using measures of mean squared error of prediction (MSEP), area under the response operating characteristics curve (ROC-AUC), and the ratio of observed to expected number of events. These results show that our method mitigates the effects of measurement error mainly by improving calibration and total accuracy. We illustrate our method in the context of Mendelian risk prediction models focusing on misreporting of breast cancer, fitting the measurement error model on data from the University of California at Irvine, and applying our method to counselees from the Cancer Genetics Network. We show that our method improves overall calibration, especially in low risk deciles. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 14-25
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1311261
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311261
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:14-25
Template-Type: ReDIF-Article 1.0
Author-Name: Jouni Kuha
Author-X-Name-First: Jouni
Author-X-Name-Last: Kuha
Author-Name: Sarah Butt
Author-X-Name-First: Sarah
Author-X-Name-Last: Butt
Author-Name: Myrsini Katsikatsou
Author-X-Name-First: Myrsini
Author-X-Name-Last: Katsikatsou
Author-Name: Chris J. Skinner
Author-X-Name-First: Chris J.
Author-X-Name-Last: Skinner
Title: The Effect of Probing “Don’t Know” Responses on Measurement Quality and Nonresponse in Surveys
Abstract:
In survey interviews, “Don’t know” (DK) responses are commonly treated as missing data. One way to reduce the rate of such responses is to probe initial DK answers with a follow-up question designed to encourage respondents to give substantive, non-DK responses. However, such probing can also reduce data quality by introducing additional or differential measurement error. We propose a latent variable model for analyzing the effects of probing on responses to survey questions. The model makes it possible to separate measurement effects of probing from true differences between respondents who do and do not require probing. We analyze new data from an experiment, which compared responses to two multi-item batteries of questions with and without probing. In this study, probing reduced the rate of DK responses by around a half. However, it also had substantial measurement effects, in that probed answers were often weaker measures of constructs of interest than were unprobed answers. These effects were larger for questions on attitudes than for pseudo-knowledge questions on perceptions of external facts. The results provide evidence against the use of probing of “Don’t know” responses, at least for the kinds of items and respondents considered in this study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 26-40
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1323640
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1323640
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:26-40
Template-Type: ReDIF-Article 1.0
Author-Name: Guillaume Basse
Author-X-Name-First: Guillaume
Author-X-Name-Last: Basse
Author-Name: Avi Feller
Author-X-Name-First: Avi
Author-X-Name-Last: Feller
Title: Analyzing Two-Stage Experiments in the Presence of Interference
Abstract:
Two-stage randomization is a powerful design for estimating treatment effects in the presence of interference; that is, when one individual’s treatment assignment affects another individual’s outcomes. Our motivating example is a two-stage randomized trial evaluating an intervention to reduce student absenteeism in the School District of Philadelphia. In that experiment, households with multiple students were first assigned to treatment or control; then, in treated households, one student was randomly assigned to treatment. Using this example, we highlight key considerations for analyzing two-stage experiments in practice. Our first contribution is to address additional complexities that arise when household sizes vary; in this case, researchers must decide between assigning equal weight to households or equal weight to individuals. We propose unbiased estimators for a broad class of individual- and household-weighted estimands, with corresponding theoretical and estimated variances. Our second contribution is to connect two common approaches for analyzing two-stage designs: linear regression and randomization inference. We show that, with suitably chosen standard errors, these two approaches yield identical point and variance estimates, which is somewhat surprising given the complex randomization scheme. Finally, we explore options for incorporating covariates to improve precision. We confirm our analytic results via simulation studies and apply these methods to the attendance study, finding substantively meaningful spillover effects.
Journal: Journal of the American Statistical Association
Pages: 41-55
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1323641
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1323641
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:41-55
Template-Type: ReDIF-Article 1.0
Author-Name: Maria DeYoreo
Author-X-Name-First: Maria
Author-X-Name-Last: DeYoreo
Author-Name: Athanasios Kottas
Author-X-Name-First: Athanasios
Author-X-Name-Last: Kottas
Title: Modeling for Dynamic Ordinal Regression Relationships: An Application to Estimating Maturity of Rockfish in California
Abstract:
We develop a Bayesian nonparametric framework for modeling ordinal regression relationships, which evolve in discrete time. The motivating application involves a key problem in fisheries research on estimating dynamically evolving relationships between age, length, and maturity, the latter recorded on an ordinal scale. The methodology builds from nonparametric mixture modeling for the joint stochastic mechanism of covariates and latent continuous responses. This approach yields highly flexible inference for ordinal regression functions while at the same time avoiding the computational challenges of parametric models that arise from estimation of cut-off points relating the latent continuous and ordinal responses. A novel-dependent Dirichlet process prior for time-dependent mixing distributions extends the model to the dynamic setting. The methodology is used for a detailed study of relationships between maturity, age, and length for Chilipepper rockfish, using data collected over 15 years along the coast of California. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 68-80
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1328357
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328357
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:68-80
Template-Type: ReDIF-Article 1.0
Author-Name: Siamak Zamani Dadaneh
Author-X-Name-First: Siamak Zamani
Author-X-Name-Last: Dadaneh
Author-Name: Xiaoning Qian
Author-X-Name-First: Xiaoning
Author-X-Name-Last: Qian
Author-Name: Mingyuan Zhou
Author-X-Name-First: Mingyuan
Author-X-Name-Last: Zhou
Title: BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data
Abstract:
We perform differential expression analysis of high-throughput sequencing count data under a Bayesian nonparametric framework, removing sophisticated ad hoc pre-processing steps commonly required in existing algorithms. We propose to use the gamma (beta) negative binomial process, which takes into account different sequencing depths using sample-specific negative binomial probability (dispersion) parameters, to detect differentially expressed genes by comparing the posterior distributions of gene-specific negative binomial dispersion (probability) parameters. These model parameters are inferred by borrowing statistical strength across both the genes and samples. Extensive experiments on both simulated and real-world RNA sequencing count data show that the proposed differential expression analysis algorithms clearly outperform previously proposed ones in terms of the areas under both the receiver operating characteristic and precision-recall curves. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 81-94
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1328358
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328358
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:81-94
Template-Type: ReDIF-Article 1.0
Author-Name: Meredith L. Wallace
Author-X-Name-First: Meredith L.
Author-X-Name-Last: Wallace
Author-Name: Daniel J. Buysse
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Buysse
Author-Name: Anne Germain
Author-X-Name-First: Anne
Author-X-Name-Last: Germain
Author-Name: Martica H. Hall
Author-X-Name-First: Martica H.
Author-X-Name-Last: Hall
Author-Name: Satish Iyengar
Author-X-Name-First: Satish
Author-X-Name-Last: Iyengar
Title: Variable Selection for Skewed Model-Based Clustering: Application to the Identification of Novel Sleep Phenotypes
Abstract:
In sleep research, applying finite mixture models to sleep characteristics captured through multiple data types, including self-reported sleep diary, a wrist monitor capturing movement (actigraphy), and brain waves (polysomnography), may suggest new phenotypes that reflect underlying disease mechanisms. However, a direct mixture model application is challenging because there are many sleep variables from which to choose, and sleep variables are often highly skewed even in homogenous samples. Moreover, previous sleep research findings indicate that some of the most clinically interesting solutions will be those that incorporate all three data types. Thus, we present two novel skewed variable selection algorithms based on the multivariate skew normal (MSN) distribution: one that selects the best set of variables ignoring data type and another that embraces the exploratory nature of clustering and suggests multiple statistically plausible sets of variables that each incorporate all data types. Through a simulation study, we empirically compare our approach with other asymmetric and normal dimension reduction strategies for clustering. Finally, we demonstrate our methods using a sample of older adults with and without insomnia. The proposed MSN-based variable selection algorithm appears to be suitable for both MSN and multivariate normal cluster distributions, especially with moderate to large-sample sizes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 95-110
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1330202
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330202
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:95-110
Template-Type: ReDIF-Article 1.0
Author-Name: Ross P. Hilton
Author-X-Name-First: Ross P.
Author-X-Name-Last: Hilton
Author-Name: Yuchen Zheng
Author-X-Name-First: Yuchen
Author-X-Name-Last: Zheng
Author-Name: Nicoleta Serban
Author-X-Name-First: Nicoleta
Author-X-Name-Last: Serban
Title: Modeling Heterogeneity in Healthcare Utilization Using Massive Medical Claims Data
Abstract:
We introduce a modeling approach for characterizing heterogeneity in healthcare utilization using massive medical claims data. We first translate the medical claims observed for a large study population and across five years into individual-level discrete events of care called utilization sequences. We model the utilization sequences using an exponential proportional hazards mixture model to capture heterogeneous behaviors in patients’ healthcare utilization. The objective is to cluster patients according to their longitudinal utilization behaviors and to determine the main drivers of variation in healthcare utilization while controlling for the demographic, geographic, and health characteristics of the patients. Due to the computational infeasibility of fitting a parametric proportional hazards model for high-dimensional, large-sample size data we use an iterative one-step procedure to estimate the model parameters and impute the cluster membership. The approach is used to draw inferences on utilization behaviors of children in the Medicaid system with persistent asthma across six states. We conclude with policy implications for targeted interventions to improve adherence to recommended care practices for pediatric asthma. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 111-121
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1330203
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330203
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:111-121
Template-Type: ReDIF-Article 1.0
Author-Name: Peng Shi
Author-X-Name-First: Peng
Author-X-Name-Last: Shi
Author-Name: Lu Yang
Author-X-Name-First: Lu
Author-X-Name-Last: Yang
Title: Pair Copula Constructions for Insurance Experience Rating
Abstract:
In nonlife insurance, insurers use experience rating to adjust premiums to reflect policyholders’ previous claim experience. Performing prospective experience rating can be challenging when the claim distribution is complex. For instance, insurance claims are semicontinuous in that a fraction of zeros is often associated with an otherwise positive continuous outcome from a right-skewed and long-tailed distribution. Practitioners use credibility premium that is a special form of the shrinkage estimator in the longitudinal data framework. However, the linear predictor is not informative especially when the outcome follows a mixed distribution. In this article, we introduce a mixed vine pair copula construction framework for modeling semicontinuous longitudinal claims. In the proposed framework, a two-component mixture regression is employed to accommodate the zero inflation and thick tails in the claim distribution. The temporal dependence among repeated observations is modeled using a sequence of bivariate conditional copulas based on a mixed D-vine. We emphasize that the resulting predictive distribution allows insurers to incorporate past experience into future premiums in a nonlinear fashion and the classic linear predictor can be viewed as a nested case. In the application, we examine a unique claims dataset of government property insurance from the state of Wisconsin. Due to the discrepancies between the claim and premium distributions, we employ an ordered Lorenz curve to evaluate the predictive performance. We show that the proposed approach offers substantial opportunities for separating risks and identifying profitable business when compared with alternative experience rating methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 122-133
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1330692
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330692
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:122-133
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan Warnick
Author-X-Name-First: Ryan
Author-X-Name-Last: Warnick
Author-Name: Michele Guindani
Author-X-Name-First: Michele
Author-X-Name-Last: Guindani
Author-Name: Erik Erhardt
Author-X-Name-First: Erik
Author-X-Name-Last: Erhardt
Author-Name: Elena Allen
Author-X-Name-First: Elena
Author-X-Name-Last: Allen
Author-Name: Vince Calhoun
Author-X-Name-First: Vince
Author-X-Name-Last: Calhoun
Author-Name: Marina Vannucci
Author-X-Name-First: Marina
Author-X-Name-Last: Vannucci
Title: A Bayesian Approach for Estimating Dynamic Functional Network Connectivity in fMRI Data
Abstract:
Dynamic functional connectivity, that is, the study of how interactions among brain regions change dynamically over the course of an fMRI experiment, has recently received wide interest in the neuroimaging literature. Current approaches for studying dynamic connectivity often rely on ad hoc approaches for inference, with the fMRI time courses segmented by a sequence of sliding windows. We propose a principled Bayesian approach to dynamic functional connectivity, which is based on the estimation of time varying networks. Our method utilizes a hidden Markov model for classification of latent cognitive states, achieving estimation of the networks in an integrated framework that borrows strength over the entire time course of the experiment. Furthermore, we assume that the graph structures, which define the connectivity states at each time point, are related within a super-graph, to encourage the selection of the same edges among related graphs. We apply our method to simulated task -based fMRI data, where we show how our approach allows the decoupling of the task-related activations and the functional connectivity states. We also analyze data from an fMRI sensorimotor task experiment on an individual healthy subject and obtain results that support the role of particular anatomical regions in modulating interaction between executive control and attention networks.
Journal: Journal of the American Statistical Association
Pages: 134-151
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1379404
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1379404
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:134-151
Template-Type: ReDIF-Article 1.0
Author-Name: John C. Duchi
Author-X-Name-First: John C.
Author-X-Name-Last: Duchi
Author-Name: Michael I. Jordan
Author-X-Name-First: Michael I.
Author-X-Name-Last: Jordan
Author-Name: Martin J. Wainwright
Author-X-Name-First: Martin J.
Author-X-Name-Last: Wainwright
Title: Minimax Optimal Procedures for Locally Private Estimation
Abstract:
Working under a model of privacy in which data remain private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretical bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures. We provide a treatment of several canonical families of problems: mean estimation and median estimation, generalized linear models, and nonparametric density estimation. For all of these families, we provide lower and upper bounds that match up to constant factors, and exhibit new (optimal) privacy-preserving mechanisms and computationally efficient estimators that achieve the bounds. Additionally, we present a variety of experimental results for estimation problems involving sensitive data, including salaries, censored blog posts and articles, and drug abuse; these experiments demonstrate the importance of deriving optimal procedures. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 182-201
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1389735
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389735
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:182-201
Template-Type: ReDIF-Article 1.0
Author-Name: Joseph Guinness
Author-X-Name-First: Joseph
Author-X-Name-Last: Guinness
Author-Name: Dorit Hammerling
Author-X-Name-First: Dorit
Author-X-Name-Last: Hammerling
Title: Compression and Conditional Emulation of Climate Model Output
Abstract:
Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full datasets by smaller compressed versions. We propose a statistical compression and decompression algorithm based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full dataset given the summary statistics. We decompress the data by computing conditional expectations and conditional simulations from the model given the summary statistics. Conditional expectations represent our best estimate of the original data but are subject to oversmoothing in space and time. Conditional simulations introduce realistic small-scale noise so that the decompressed fields are neither too smooth nor too rough compared with the original data. Considerable attention is paid to accurately modeling the original dataset—1 year of daily mean temperature data—particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured, while allowing for fast decompression and conditional emulation on modest computers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 56-67
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1395339
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395339
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:56-67
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Corrigendum
Journal: Journal of the American Statistical Association
Pages: 486-486
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1395340
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395340
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:486-486
Template-Type: ReDIF-Article 1.0
Author-Name: Petra M. Kuhnert
Author-X-Name-First: Petra M.
Author-X-Name-Last: Kuhnert
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 168-170
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1415904
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415904
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:168-170
Template-Type: ReDIF-Article 1.0
Author-Name: William F. Christensen
Author-X-Name-First: William F.
Author-X-Name-Last: Christensen
Author-Name: C. Shane Reese
Author-X-Name-First: C. Shane
Author-X-Name-Last: Reese
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 171-173
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1415905
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415905
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:171-173
Template-Type: ReDIF-Article 1.0
Author-Name: Noel Cressie
Author-X-Name-First: Noel
Author-X-Name-Last: Cressie
Title: Mission CO2ntrol: A Statistical Scientist's Role in Remote Sensing of Atmospheric Carbon Dioxide
Abstract:
Too much carbon dioxide (CO2) in the atmosphere is a threat to long-term sustainability of Earth's ecosystem. Atmospheric CO2 is a leading greenhouse gas that has increased to levels not seen since the middle Pliocene (approximately 3.6 million years ago). One of the US National Aeronautics Space Administration's (NASA) remote sensing missions is the Orbiting Carbon Observatory-2, whose principal science objective is to estimate the global geographic distribution of CO2 sources and sinks at Earth's surface, through time. This starts with raw radiances (Level 1), moves on to retrievals of the atmospheric state (Level 2), from which maps of gap-filled and de-noised geophysical variables and their uncertainties are made (Level 3). With the aid of a model of transport in the atmosphere, CO2 fluxes (Level 4) can be obtained from Level 2 data directly or possibly through Level 3. Decisions about how to mitigate or manage CO2 could be thought of as Level 5. Hierarchical statistical modeling is used to qualify and quantify the uncertainties at each level. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 152-168
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1419136
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419136
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:152-168
Template-Type: ReDIF-Article 1.0
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Author-Name: Jaehong Jeong
Author-X-Name-First: Jaehong
Author-X-Name-Last: Jeong
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 176-178
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1419137
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419137
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:176-178
Template-Type: ReDIF-Article 1.0
Author-Name: Frédéric Chevallier
Author-X-Name-First: Frédéric
Author-X-Name-Last: Chevallier
Author-Name: François-Marie Bréon
Author-X-Name-First: François-Marie
Author-X-Name-Last: Bréon
Title: Comment
Abstract:
Based on the measurements of the OCO-2 satellite, Noel Cressie addresses a particularly hard challenge for Earth observation, arguably an extreme case in remote sensing. He is one of the very few who has expertise in most of the processing chain and his article brilliantly discusses the diverse underlying statistical challenges. In this comment, we provide a complementary view of the topic to qualify its prospects as drawn by N. Cressie at the end of his article. We first summarize the motivation of OCO-2-type programs; we then expose the corresponding challenges before discussing the prospects.
Journal: Journal of the American Statistical Association
Pages: 173-175
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1419138
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419138
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:173-175
Template-Type: ReDIF-Article 1.0
Author-Name: Noel Cressie
Author-X-Name-First: Noel
Author-X-Name-Last: Cressie
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 178-181
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2017.1421541
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1421541
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:178-181
Template-Type: ReDIF-Article 1.0
Author-Name: Anderson Y. Zhang
Author-X-Name-First: Anderson Y.
Author-X-Name-Last: Zhang
Author-Name: Harrison H. Zhou
Author-X-Name-First: Harrison H.
Author-X-Name-Last: Zhou
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 201-203
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1442605
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442605
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:201-203
Template-Type: ReDIF-Article 1.0
Author-Name: Alfred Hero
Author-X-Name-First: Alfred
Author-X-Name-Last: Hero
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 203-204
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1442606
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442606
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:203-204
Template-Type: ReDIF-Article 1.0
Author-Name: Vishesh Karwa
Author-X-Name-First: Vishesh
Author-X-Name-Last: Karwa
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 204-207
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1442607
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442607
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:204-207
Template-Type: ReDIF-Article 1.0
Author-Name: Moritz Hardt
Author-X-Name-First: Moritz
Author-X-Name-Last: Hardt
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 207-208
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1442608
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442608
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:207-208
Template-Type: ReDIF-Article 1.0
Author-Name: Aaron Roth
Author-X-Name-First: Aaron
Author-X-Name-Last: Roth
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 208-211
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1442610
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442610
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:208-211
Template-Type: ReDIF-Article 1.0
Author-Name: John C. Duchi
Author-X-Name-First: John C.
Author-X-Name-Last: Duchi
Author-Name: Michael I. Jordan
Author-X-Name-First: Michael I.
Author-X-Name-Last: Jordan
Author-Name: Martin J. Wainwright
Author-X-Name-First: Martin J.
Author-X-Name-Last: Wainwright
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 212-215
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1442611
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442611
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:212-215
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Erratum
Journal: Journal of the American Statistical Association
Pages: 487-487
Issue: 521
Volume: 113
Year: 2018
Month: 1
X-DOI: 10.1080/01621459.2018.1460558
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1460558
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:521:p:487-487
Template-Type: ReDIF-Article 1.0
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Author-Name: Jonathan M. Bischof
Author-X-Name-First: Jonathan M.
Author-X-Name-Last: Bischof
Title: Improving and Evaluating Topic Models and Other Models of Text
Abstract:
An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. Here, we show that words that are both frequent and exclusive to a theme are more effective at characterizing topical content, and we propose a regularization scheme that leads to better estimates of these quantities. We consider a supervised setting where professional editors have annotated documents to topic categories, organized into a tree, in which leaf-nodes correspond to more specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce a hierarchical Poisson convolution model to analyze these annotated documents. A parallelized Hamiltonian Monte Carlo sampler allows the inference to scale to millions of documents. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. In this supervised setting, we validate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. In an unsupervised setting, we then consider a simplified version of the model that shares the same regularization scheme with the previous model. We carry out a large randomized experiment on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency-based summaries, and that the proposed model produces more efficient estimates of exclusivity than the currently established models.
Journal: Journal of the American Statistical Association
Pages: 1381-1403
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1051182
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1051182
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1381-1403
Template-Type: ReDIF-Article 1.0
Author-Name: Xinlei Wang
Author-X-Name-First: Xinlei
Author-X-Name-Last: Wang
Author-Name: Johan Lim
Author-X-Name-First: Johan
Author-X-Name-Last: Lim
Author-Name: Lynne Stokes
Author-X-Name-First: Lynne
Author-X-Name-Last: Stokes
Title: Using Ranked Set Sampling With Cluster Randomized Designs for Improved Inference on Treatment Effects
Abstract:
This article examines the use of ranked set sampling (RSS) with cluster randomized designs (CRDs), for potential improvement in estimation and detection of treatment or intervention effects. Outcome data in cluster randomized studies typically have nested structures, where hierarchical linear models (HLMs) become a natural choice for data analysis. However, nearly all theoretical developments in RSS to date are within the structure of one-level models. Thus, implementation of RSS at one or more levels of an HLM will require development of new theory and methods. Under RSS-structured CRDs developed to incorporate RSS at different levels, a nonparametric estimator of the treatment effect is proposed; and its theoretical properties are studied under a general HLM that has almost no distributional assumptions. We formally quantify the magnitude of the improvement from using RSS over SRS (simple random sampling), investigate the relationship between design parameters and relative efficiency, and establish connections with one-level RSS under completely balanced CRDs, as well as studying the impact of clustering and imperfect ranking. Further, based on the proposed RSS estimator, a new test is constructed to detect treatment effects, which is distribution-free and easy to use. Simulation studies confirm that in general, the proposed test is more powerful than the conventional F-test for the original CRDs, especially for small or medium effect sizes. Two empirical studies, one using data from educational research (i.e., the motivating application) and the other using human dental data, show that our methods work well in real-world settings and our theory provides useful predictions at the stage of experimental design, and that substantial gains may be obtained from using RSS at either level. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1576-1590
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1093946
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093946
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1576-1590
Template-Type: ReDIF-Article 1.0
Author-Name: Patrick R. Conrad
Author-X-Name-First: Patrick R.
Author-X-Name-Last: Conrad
Author-Name: Youssef M. Marzouk
Author-X-Name-First: Youssef M.
Author-X-Name-Last: Marzouk
Author-Name: Natesh S. Pillai
Author-X-Name-First: Natesh S.
Author-X-Name-Last: Pillai
Author-Name: Aaron Smith
Author-X-Name-First: Aaron
Author-X-Name-Last: Smith
Title: Accelerating Asymptotically Exact MCMC for Computationally Intensive Models via Local Approximations
Abstract:
We construct a new framework for accelerating Markov chain Monte Carlo in posterior sampling problems where standard methods are limited by the computational cost of the likelihood, or of numerical models embedded therein. Our approach introduces local approximations of these models into the Metropolis–Hastings kernel, borrowing ideas from deterministic approximation theory, optimization, and experimental design. Previous efforts at integrating approximate models into inference typically sacrifice either the sampler’s exactness or efficiency; our work seeks to address these limitations by exploiting useful convergence characteristics of local approximations. We prove the ergodicity of our approximate Markov chain, showing that it samples asymptotically from the exact posterior distribution of interest. We describe variations of the algorithm that employ either local polynomial approximations or local Gaussian process regressors. Our theoretical results reinforce the key observation underlying this article: when the likelihood has some local regularity, the number of model evaluations per Markov chain Monte Carlo (MCMC) step can be greatly reduced without biasing the Monte Carlo average. Numerical experiments demonstrate multiple order-of-magnitude reductions in the number of forward model evaluations used in representative ordinary differential equation (ODE) and partial differential equation (PDE) inference problems, with both synthetic and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1591-1607
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1096787
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1096787
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1591-1607
Template-Type: ReDIF-Article 1.0
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Author-Name: Edward I. George
Author-X-Name-First: Edward I.
Author-X-Name-Last: George
Title: Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity
Abstract:
Rotational post hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations, and (c) better oriented sparse solutions. To avoid the prespecification of the factor cardinality, we extend the loading matrix to have infinitely many columns with the Indian buffet process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the spike-and-slab LASSO prior, a two-component refinement of the Laplace prior. A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional data, which would render posterior simulation impractical. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1608-1622
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1100620
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100620
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1608-1622
Template-Type: ReDIF-Article 1.0
Author-Name: Ville A. Satopää
Author-X-Name-First: Ville A.
Author-X-Name-Last: Satopää
Author-Name: Robin Pemantle
Author-X-Name-First: Robin
Author-X-Name-Last: Pemantle
Author-Name: Lyle H. Ungar
Author-X-Name-First: Lyle H.
Author-X-Name-Last: Ungar
Title: Modeling Probability Forecasts via Information Diversity
Abstract:
Randomness in scientific estimation is generally assumed to arise from unmeasured or uncontrolled factors. However, when combining subjective probability estimates, heterogeneity stemming from people’s cognitive or information diversity is often more important than measurement noise. This article presents a novel framework that uses partially overlapping information sources. A specific model is proposed within that framework and applied to the task of aggregating the probabilities given by a group of forecasters who predict whether an event will occur or not. Our model describes the distribution of information across forecasters in terms of easily interpretable parameters and shows how the optimal amount of extremizing of the average probability forecast (shifting it closer to its nearest extreme) varies as a function of the forecasters’ information overlap. Our model thus gives a more principled understanding of the historically ad hoc practice of extremizing average forecasts. Supplementary material for this article is available online.
Journal: Journal of the American Statistical Association
Pages: 1623-1633
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1100621
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100621
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1623-1633
Template-Type: ReDIF-Article 1.0
Author-Name: Pramita Bagchi
Author-X-Name-First: Pramita
Author-X-Name-Last: Bagchi
Author-Name: Moulinath Banerjee
Author-X-Name-First: Moulinath
Author-X-Name-Last: Banerjee
Author-Name: Stilian A. Stoev
Author-X-Name-First: Stilian A.
Author-X-Name-Last: Stoev
Title: Inference for Monotone Functions Under Short- and Long-Range Dependence: Confidence Intervals and New Universal Limits
Abstract:
We introduce new point-wise confidence interval estimates for monotone functions observed with additive, dependent noise. Our methodology applies to both short- and long-range dependence regimes for the errors. The interval estimates are obtained via the method of inversion of certain discrepancy statistics. This approach avoids the estimation of nuisance parameters such as the derivative of the unknown function, which previous methods are forced to deal with. The resulting estimates are therefore more accurate, stable, and widely applicable in practice under minimal assumptions on the trend and error structure. The dependence of the errors especially long-range dependence leads to new phenomena, where new universal limits based on convex minorant functionals of drifted fractional Brownian motion emerge. Some extensions to uniform confidence bands are also developed. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1634-1647
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1100622
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100622
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1634-1647
Template-Type: ReDIF-Article 1.0
Author-Name: Pietro Coretto
Author-X-Name-First: Pietro
Author-X-Name-Last: Coretto
Author-Name: Christian Hennig
Author-X-Name-First: Christian
Author-X-Name-Last: Hennig
Title: Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering
Abstract:
The two main topics of this article are the introduction of the “optimally tuned robust improper maximum likelihood estimator” (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to maximum likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an improper constant density for modeling outliers and noise. This can be chosen optimally so that the nonnoise part of the data looks as close to a Gaussian mixture as possible. Some deviation from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance matrix constraints and computation of the OTRIMLE are also treated. In the simulation study, all methods are confronted with setups in which their model assumptions are not exactly fulfilled, and to evaluate the experiments in a standardized way by misclassification rates, a new model-based definition of “true clusters” is introduced that deviates from the usual identification of mixture components with clusters. In the study, every method turns out to be superior for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance. The methods are also applied to two real datasets, one without and one with known “true” clusters. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1648-1659
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1100996
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1100996
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1648-1659
Template-Type: ReDIF-Article 1.0
Author-Name: Rebecca C. Steorts
Author-X-Name-First: Rebecca C.
Author-X-Name-Last: Steorts
Author-Name: Rob Hall
Author-X-Name-First: Rob
Author-X-Name-Last: Hall
Author-Name: Stephen E. Fienberg
Author-X-Name-First: Stephen E.
Author-X-Name-Last: Fienberg
Title: A Bayesian Approach to Graphical Record Linkage and Deduplication
Abstract:
We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1660-1672
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1105807
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1105807
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1660-1672
Template-Type: ReDIF-Article 1.0
Author-Name: Wang Miao
Author-X-Name-First: Wang
Author-X-Name-Last: Miao
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Author-Name: Zhi Geng
Author-X-Name-First: Zhi
Author-X-Name-Last: Geng
Title: Identifiability of Normal and Normal Mixture Models with Nonignorable Missing Data
Abstract:
Missing data problems arise in many applied research studies. They may jeopardize statistical inference of the model of interest, if the missing mechanism is nonignorable, that is, the missing mechanism depends on the missing values themselves even conditional on the observed data. With a nonignorable missing mechanism, the model of interest is often not identifiable without imposing further assumptions. We find that even if the missing mechanism has a known parametric form, the model is not identifiable without specifying a parametric outcome distribution. Although it is fundamental for valid statistical inference, identifiability under nonignorable missing mechanisms is not established for many commonly used models. In this article, we first demonstrate identifiability of the normal distribution under monotone missing mechanisms. We then extend it to the normal mixture and t mixture models with nonmonotone missing mechanisms. We discover that models under the Logistic missing mechanism are less identifiable than those under the Probit missing mechanism. We give necessary and sufficient conditions for identifiability of models under the Logistic missing mechanism, which sometimes can be checked in real data analysis. We illustrate our methods using a series of simulations, and apply them to a real-life dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1673-1683
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1105808
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1105808
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1673-1683
Template-Type: ReDIF-Article 1.0
Author-Name: Valentin Patilea
Author-X-Name-First: Valentin
Author-X-Name-Last: Patilea
Author-Name: César Sánchez-Sellero
Author-X-Name-First: César
Author-X-Name-Last: Sánchez-Sellero
Author-Name: Matthieu Saumard
Author-X-Name-First: Matthieu
Author-X-Name-Last: Saumard
Title: Testing the Predictor Effect on a Functional Response
Abstract:
This article examines the problem of nonparametric testing for the no-effect of a random covariate (or predictor) on a functional response. This means testing whether the conditional expectation of the response given the covariate is almost surely zero or not, without imposing any model relating response and covariate. The covariate could be univariate, multivariate, or functional. Our test statistic is a quadratic form involving univariate nearest neighbor smoothing and the asymptotic critical values are given by the standard normal law. When the covariate is multidimensional or functional, a preliminary dimension reduction device is used, which allows the effect of the covariate to be summarized into a univariate random quantity. The test is able to detect not only linear but nonparametric alternatives. The responses could have conditional variance of unknown form and the law of the covariate does not need to be known. An empirical study with simulated and real data shows that the test performs well in applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1684-1695
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1110031
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110031
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1684-1695
Template-Type: ReDIF-Article 1.0
Author-Name: Bing-Yi Jing
Author-X-Name-First: Bing-Yi
Author-X-Name-Last: Jing
Author-Name: Zhouping Li
Author-X-Name-First: Zhouping
Author-X-Name-Last: Li
Author-Name: Guangming Pan
Author-X-Name-First: Guangming
Author-X-Name-Last: Pan
Author-Name: Wang Zhou
Author-X-Name-First: Wang
Author-X-Name-Last: Zhou
Title: On SURE-Type Double Shrinkage Estimation
Abstract:
The article is concerned with empirical Bayes shrinkage estimators for the heteroscedastic hierarchical normal model using Stein's unbiased estimate of risk (SURE). Recently, Xie, Kou, and Brown proposed a class of estimators for this type of problems and established their asymptotic optimality properties under the assumption of known but unequal variances. In this article, we consider this problem with unequal and unknown variances, which may be more appropriate in real situations. By placing priors for both means and variances, we propose novel SURE-type double shrinkage estimators that shrink both means and variances. Optimal properties for these estimators are derived under certain regularity conditions. Extensive simulation studies are conducted to compare the newly developed methods with other shrinkage techniques. Finally, the methods are applied to the well-known baseball dataset and a gene expression dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1696-1704
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1110032
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110032
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1696-1704
Template-Type: ReDIF-Article 1.0
Author-Name: Naveen N. Narisetty
Author-X-Name-First: Naveen N.
Author-X-Name-Last: Narisetty
Author-Name: Vijayan N. Nair
Author-X-Name-First: Vijayan N.
Author-X-Name-Last: Nair
Title: Extremal Depth for Functional Data and Applications
Abstract:
We propose a new notion called “extremal depth” (ED) for functional data, discuss its properties, and compare its performance with existing concepts. The proposed notion is based on a measure of extreme “outlyingness.” ED has several desirable properties that are not shared by other notions and is especially well suited for obtaining central regions of functional data and function spaces. In particular: (a) the central region achieves the nominal (desired) simultaneous coverage probability; (b) there is a correspondence between ED-based (simultaneous) central regions and appropriate pointwise central regions; and (c) the method is resistant to certain classes of functional outliers. The article examines the performance of ED and compares it with other depth notions. Its usefulness is demonstrated through applications to constructing central regions, functional boxplots, outlier detection, and simultaneous confidence bands in regression problems. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1705-1714
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1110033
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1110033
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1705-1714
Template-Type: ReDIF-Article 1.0
Author-Name: Pier Luigi Conti
Author-X-Name-First: Pier Luigi
Author-X-Name-Last: Conti
Author-Name: Daniela Marella
Author-X-Name-First: Daniela
Author-X-Name-Last: Marella
Author-Name: Mauro Scanu
Author-X-Name-First: Mauro
Author-X-Name-Last: Scanu
Title: Statistical Matching Analysis for Complex Survey Data With Applications
Abstract:
The goal of statistical matching is the estimation of a joint distribution having observed only samples from its marginals. The lack of joint observations on the variables of interest is the reason of uncertainty about the joint population distribution function. In the present article, the notion of matching error is introduced, and upper-bounded via an appropriate measure of uncertainty. Then, an estimate of the distribution function for the variables not jointly observed is constructed on the basis of a modification of the conditional independence assumption in the presence of logical constraints. The corresponding measure of uncertainty is estimated via sample data. Finally, a simulation study is performed, and an application to a real case is provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1715-1725
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1112803
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1112803
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1715-1725
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Lingzhou Xue
Author-X-Name-First: Lingzhou
Author-X-Name-Last: Xue
Author-Name: Hui Zou
Author-X-Name-First: Hui
Author-X-Name-Last: Zou
Title: Multitask Quantile Regression Under the Transnormal Model
Abstract:
We consider estimating multitask quantile regression under the transnormal model, with focus on high-dimensional setting. We derive a surprisingly simple closed-form solution through rank-based covariance regularization. In particular, we propose the rank-based ℓ1 penalization with positive-definite constraints for estimating sparse covariance matrices, and the rank-based banded Cholesky decomposition regularization for estimating banded precision matrices. By taking advantage of the alternating direction method of multipliers, nearest correlation matrix projection is introduced that inherits sampling properties of the unprojected one. Our work combines strengths of quantile regression and rank-based covariance regularization to simultaneously deal with nonlinearity and nonnormality for high-dimensional regression. Furthermore, the proposed method strikes a good balance between robustness and efficiency, achieves the “oracle”-like convergence rate, and provides the provable prediction interval under the high-dimensional setting. The finite-sample performance of the proposed method is also examined. The performance of our proposed rank-based method is demonstrated in a real application to analyze the protein mass spectroscopy data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1726-1735
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1113973
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1113973
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1726-1735
Template-Type: ReDIF-Article 1.0
Author-Name: J. D. Godolphin
Author-X-Name-First: J. D.
Author-X-Name-Last: Godolphin
Title: A Link Between the -Value and the Robustness of Block Designs
Abstract:
This article investigates the robustness of binary incomplete block designs against giving rise to a disconnected design in the event of observation loss. A link is established between the E-value of a planned design and the extent of observation loss that can be experienced while still guaranteeing an eventual design from which all treatment contrasts can be estimated. Patterns of missing observations covered include loss of entire blocks and loss of individual observations. Simple bounds are provided enabling practitioners to easily assess the robustness of a planned design.
Journal: Journal of the American Statistical Association
Pages: 1736-1745
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1114949
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1114949
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1736-1745
Template-Type: ReDIF-Article 1.0
Author-Name: Botond Cseke
Author-X-Name-First: Botond
Author-X-Name-Last: Cseke
Author-Name: Andrew Zammit-Mangion
Author-X-Name-First: Andrew
Author-X-Name-Last: Zammit-Mangion
Author-Name: Tom Heskes
Author-X-Name-First: Tom
Author-X-Name-Last: Heskes
Author-Name: Guido Sanguinetti
Author-X-Name-First: Guido
Author-X-Name-Last: Sanguinetti
Title: Sparse Approximate Inference for Spatio-Temporal Point Process Models
Abstract:
Spatio-temporal log-Gaussian Cox process models play a central role in the analysis of spatially distributed systems in several disciplines. Yet, scalable inference remains computationally challenging both due to the high-resolution modeling generally required and the analytically intractable likelihood function. Here, we exploit the sparsity structure typical of (spatially) discretized log-Gaussian Cox process models by using approximate message-passing algorithms. The proposed algorithms scale well with the state dimension and the length of the temporal horizon with moderate loss in distributional accuracy. They hence provide a flexible and faster alternative to both nonlinear filtering-smoothing type algorithms and to approaches that implement the Laplace method or expectation propagation on (block) sparse latent Gaussian models. We infer the parameters of the latent Gaussian model using a structured variational Bayes approach. We demonstrate the proposed framework on simulation studies with both Gaussian and point-process observations and use it to reconstruct the conflict intensity and dynamics in Afghanistan from the WikiLeaks Afghan War Diary. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1746-1763
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1115357
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115357
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1746-1763
Template-Type: ReDIF-Article 1.0
Author-Name: Claudio Agostinelli
Author-X-Name-First: Claudio
Author-X-Name-Last: Agostinelli
Author-Name: Víctor J. Yohai
Author-X-Name-First: Víctor J.
Author-X-Name-Last: Yohai
Title: Composite Robust Estimators for Linear Mixed Models
Abstract:
The classical Tukey–Huber contamination model (CCM) is a commonly adopted framework to describe the mechanism of outliers generation in robust statistics. Given a dataset with n observations and p variables, under the CCM, an outlier is a unit, even if only one or a few values are corrupted. Classical robust procedures were designed to cope with this type of outliers. Recently, a new mechanism of outlier generation was introduced, namely, the independent contamination model (ICM), where the occurrences that each cell of the data matrix is an outlier are independent events and have the same probability. ICM poses new challenges to robust statistics since the percentage of contaminated rows dramatically increase with p, often reaching more than 50% whereas classical affine equivariant robust procedures have a breakdown point of 50% at most. For ICM, we propose a new type of robust methods, namely, composite robust procedures that are inspired by the idea of composite likelihood, where low-dimension likelihood, very often the likelihood of pairs, are aggregated to obtain a tractable approximation of the full likelihood. Our composite robust procedures are built on pairs of observations to gain robustness in the ICM. We propose composite τ-estimators for linear mixed models. Composite τ-estimators are proved to have a high breakdown point both in the CCM and ICM. A Monte Carlo study shows that while classical S-estimators can only cope with outliers generated by the CCM, the estimators proposed here are resistant to both CCM and ICM outliers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1764-1774
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1115358
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115358
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1764-1774
Template-Type: ReDIF-Article 1.0
Author-Name: Xinyu Zhang
Author-X-Name-First: Xinyu
Author-X-Name-Last: Zhang
Author-Name: Dalei Yu
Author-X-Name-First: Dalei
Author-X-Name-Last: Yu
Author-Name: Guohua Zou
Author-X-Name-First: Guohua
Author-X-Name-Last: Zou
Author-Name: Hua Liang
Author-X-Name-First: Hua
Author-X-Name-Last: Liang
Title: Optimal Model Averaging Estimation for Generalized Linear Models and Generalized Linear Mixed-Effects Models
Abstract:
Considering model averaging estimation in generalized linear models, we propose a weight choice criterion based on the Kullback–Leibler (KL) loss with a penalty term. This criterion is different from that for continuous observations in principle, but reduces to the Mallows criterion in the situation. We prove that the corresponding model averaging estimator is asymptotically optimal under certain assumptions. We further extend our concern to the generalized linear mixed-effects model framework and establish associated theory. Numerical experiments illustrate that the proposed method is promising.
Journal: Journal of the American Statistical Association
Pages: 1775-1790
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1115762
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115762
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1775-1790
Template-Type: ReDIF-Article 1.0
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Nonparametric Modeling of Higher Order Markov Chains
Abstract:
We consider the problem of flexible modeling of higher order Markov chains when an upper bound on the order of the chain is known but the true order and nature of the serial dependence are unknown. We propose Bayesian nonparametric methodology based on conditional tensor factorizations, which can characterize any transition probability with a specified maximal order. The methodology selects the important lags and captures higher order interactions among the lags, while also facilitating calculation of Bayes factors for a variety of hypotheses of interest. We design efficient Markov chain Monte Carlo algorithms for posterior computation, allowing for uncertainty in the set of important lags to be included and in the nature and order of the serial dependence. The methods are illustrated using simulation experiments and real world applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1791-1803
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1115763
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1115763
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1791-1803
Template-Type: ReDIF-Article 1.0
Author-Name: Degui Li
Author-X-Name-First: Degui
Author-X-Name-Last: Li
Author-Name: Junhui Qian
Author-X-Name-First: Junhui
Author-X-Name-Last: Qian
Author-Name: Liangjun Su
Author-X-Name-First: Liangjun
Author-X-Name-Last: Su
Title: Panel Data Models With Interactive Fixed Effects and Multiple Structural Breaks
Abstract:
In this article, we consider estimation of common structural breaks in panel data models with unobservable interactive fixed effects. We introduce a penalized principal component (PPC) estimation procedure with an adaptive group fused LASSO to detect the multiple structural breaks in the models. Under some mild conditions, we show that with probability approaching one the proposed method can correctly determine the unknown number of breaks and consistently estimate the common break dates. Furthermore, we estimate the regression coefficients through the post-LASSO method and establish the asymptotic distribution theory for the resulting estimators. The developed methodology and theory are applicable to the case of dynamic panel data models. Simulation results demonstrate that the proposed method works well in finite samples with low false detection probability when there is no structural break and high probability of correctly estimating the break numbers when the structural breaks exist. We finally apply our method to study the environmental Kuznets curve for 74 countries over 40 years and detect two breaks in the data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1804-1819
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1119696
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1119696
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1804-1819
Template-Type: ReDIF-Article 1.0
Author-Name: Colin B. Fogarty
Author-X-Name-First: Colin B.
Author-X-Name-Last: Fogarty
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Sensitivity Analysis for Multiple Comparisons in Matched Observational Studies Through Quadratically Constrained Linear Programming
Abstract:
A sensitivity analysis in an observational study assesses the robustness of significant findings to unmeasured confounding. While sensitivity analyses in matched observational studies have been well addressed when there is a single outcome variable, accounting for multiple comparisons through the existing methods yields overly conservative results when there are multiple outcome variables of interest. This stems from the fact that unmeasured confounding cannot affect the probability of assignment to treatment differently depending on the outcome being analyzed. Existing methods allow this to occur by combining the results of individual sensitivity analyses to assess whether at least one hypothesis is significant, which in turn results in an overly pessimistic assessment of a study's sensitivity to unobserved biases. By solving a quadratically constrained linear program, we are able to perform a sensitivity analysis while enforcing that unmeasured confounding must have the same impact on the treatment assignment probabilities across outcomes for each individual in the study. We show that this allows for uniform improvements in the power of a sensitivity analysis not only for testing the overall null of no effect, but also for null hypotheses on specific outcome variables while strongly controlling the familywise error rate. We illustrate our method through an observational study on the effect of smoking on naphthalene exposure. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1820-1830
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1120675
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1120675
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1820-1830
Template-Type: ReDIF-Article 1.0
Author-Name: J. R. Lockwood
Author-X-Name-First: J. R.
Author-X-Name-Last: Lockwood
Author-Name: Daniel F. McCaffrey
Author-X-Name-First: Daniel F.
Author-X-Name-Last: McCaffrey
Title: Matching and Weighting With Functions of Error-Prone Covariates for Causal Inference
Abstract:
Matching estimators are commonly used to estimate causal effects in nonexperimental settings. Covariate measurement error can be problematic for matching estimators when observational treatment groups differ on latent quantities observed only through error-prone surrogates. We establish necessary and sufficient conditions for matching and weighting with functions of observed covariates to yield unconfounded causal effect estimators, generalizing results from the standard (i.e., no measurement error) case. We establish that in common covariate measurement error settings, including continuous variables with continuous measurement error, discrete variables with misclassification, and factor and item response theory models, no single function of the observed covariates computed for all units in a study is appropriate for matching. However, we demonstrate that in some circumstances, it is possible to create different functions of the observed covariates for treatment and control units to construct a variable appropriate for matching. We also demonstrate the counterintuitive result that in some settings, it is possible to selectively contaminate the covariates with additional measurement error to construct a variable appropriate for matching. We discuss the implications of our results for the choice between matching and weighting estimators with error-prone covariates. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1831-1839
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2015.1122601
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1122601
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1831-1839
Template-Type: ReDIF-Article 1.0
Author-Name: Guanhua Chen
Author-X-Name-First: Guanhua
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Personalized Dose Finding Using Outcome Weighted Learning
Abstract:
In dose-finding clinical trials, it is becoming increasingly important to account for individual-level heterogeneity while searching for optimal doses to ensure an optimal individualized dose rule (IDR) maximizes the expected beneficial clinical outcome for each individual. In this article, we advocate a randomized trial design where candidate dose levels assigned to study subjects are randomly chosen from a continuous distribution within a safe range. To estimate the optimal IDR using such data, we propose an outcome weighted learning method based on a nonconvex loss function, which can be solved efficiently using a difference of convex functions algorithm. The consistency and convergence rate for the estimated IDR are derived, and its small-sample performance is evaluated via simulation studies. We demonstrate that the proposed method outperforms competing approaches. Finally, we illustrate this method using data from a cohort study for warfarin (an anti-thrombotic drug) dosing. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1509-1521
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1148611
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148611
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1509-1521
Template-Type: ReDIF-Article 1.0
Author-Name: Martin Lysy
Author-X-Name-First: Martin
Author-X-Name-Last: Lysy
Author-Name: Natesh S. Pillai
Author-X-Name-First: Natesh S.
Author-X-Name-Last: Pillai
Author-Name: David B. Hill
Author-X-Name-First: David B.
Author-X-Name-Last: Hill
Author-Name: M. Gregory Forest
Author-X-Name-First: M. Gregory
Author-X-Name-Last: Forest
Author-Name: John W. R. Mellnik
Author-X-Name-First: John W. R.
Author-X-Name-Last: Mellnik
Author-Name: Paula A. Vasquez
Author-X-Name-First: Paula A.
Author-X-Name-Last: Vasquez
Author-Name: Scott A. McKinley
Author-X-Name-First: Scott A.
Author-X-Name-Last: McKinley
Title: Model Comparison and Assessment for Single Particle Tracking in Biological Fluids
Abstract:
State-of-the-art techniques in passive particle-tracking microscopy provide high-resolution path trajectories of diverse foreign particles in biological fluids. For particles on the order of 1 μm diameter, these paths are generally inconsistent with simple Brownian motion. Yet, despite an abundance of data confirming these findings and their wide-ranging scientific implications, stochastic modeling of the complex particle motion has received comparatively little attention. Even among posited models, there is virtually no literature on likelihood-based inference, model comparisons, and other quantitative assessments. In this article, we develop a rigorous and computationally efficient Bayesian methodology to address this gap. We analyze two of the most prevalent candidate models for 30-sec paths of 1 μm diameter tracer particles in human lung mucus: fractional Brownian motion (fBM) and a Generalized Langevin Equation (GLE) consistent with viscoelastic theory. Our model comparisons distinctly favor GLE over fBM, with the former describing the data remarkably well up to the timescales for which we have reliable information. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1413-1426
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1158716
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1158716
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1413-1426
Template-Type: ReDIF-Article 1.0
Author-Name: Yize Zhao
Author-X-Name-First: Yize
Author-X-Name-Last: Zhao
Author-Name: Matthias Chung
Author-X-Name-First: Matthias
Author-X-Name-Last: Chung
Author-Name: Brent A. Johnson
Author-X-Name-First: Brent A.
Author-X-Name-Last: Johnson
Author-Name: Carlos S. Moreno
Author-X-Name-First: Carlos S.
Author-X-Name-Last: Moreno
Author-Name: Qi Long
Author-X-Name-First: Qi
Author-X-Name-Last: Long
Title: Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence
Abstract:
Our work is motivated by a prostate cancer study aimed at identifying mRNA and miRNA biomarkers that are predictive of cancer recurrence after prostatectomy. It has been shown in the literature that incorporating known biological information on pathway memberships and interactions among biomarkers improves feature selection of high-dimensional biomarkers in relation to disease risk. Biological information is often represented by graphs or networks, in which biomarkers are represented by nodes and interactions among them are represented by edges; however, biological information is often not fully known. For example, the role of microRNAs (miRNAs) in regulating gene expression is not fully understood and the miRNA regulatory network is not fully established, in which case new strategies are needed for feature selection. To this end, we treat unknown biological information as missing data (i.e., missing edges in graphs), different from commonly encountered missing data problems where variable values are missing. We propose a new concept of imputing unknown biological information based on observed data and define the imputed information as the novel biological information. In addition, we propose a hierarchical group penalty to encourage sparsity and feature selection at both the pathway level and the within-pathway level, which, combined with the imputation step, allows for incorporation of known and novel biological information. While it is applicable to general regression settings, we develop and investigate the proposed approach in the context of semiparametric accelerated failure time models motivated by our data example. Data application and simulation studies show that incorporation of novel biological information improves performance in risk prediction and feature selection and the proposed penalty outperforms the extensions of several existing penalties. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1427-1439
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1164051
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164051
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1427-1439
Template-Type: ReDIF-Article 1.0
Author-Name: Mark Fiecas
Author-X-Name-First: Mark
Author-X-Name-Last: Fiecas
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Title: Modeling the Evolution of Dynamic Brain Processes During an Associative Learning Experiment
Abstract:
We develop a new time series model to investigate the dynamic interactions between the nucleus accumbens and the hippocampus during an associative learning experiment. Preliminary analyses indicated that the spectral properties of the local field potentials at these two regions changed over the trials of the experiment. While many models already take into account nonstationarity within a single trial, the evolution of the dynamics across trials is often ignored. Our proposed model, the slowly evolving locally stationary process (SEv-LSP), is designed to capture nonstationarity both within a trial and across trials. We rigorously define the evolving evolutionary spectral density matrix, which we estimate using a two-stage procedure. In the first stage, we compute the within-trial time-localized periodogram matrix. In the second stage, we develop a data-driven approach that combines information from trial-specific local periodogram matrices. Through simulation studies, we show the utility of our proposed method for analyzing time series data with different evolutionary structures. Finally, we use the SEv-LSP model to demonstrate the evolving dynamics between the hippocampus and the nucleus accumbens during an associative learning experiment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1440-1453
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1165683
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165683
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1440-1453
Template-Type: ReDIF-Article 1.0
Author-Name: J. T. Gaskins
Author-X-Name-First: J. T.
Author-X-Name-Last: Gaskins
Author-Name: M. J. Daniels
Author-X-Name-First: M. J.
Author-X-Name-Last: Daniels
Author-Name: B. H. Marcus
Author-X-Name-First: B. H.
Author-X-Name-Last: Marcus
Title: Bayesian Methods for Nonignorable Dropout in Joint Models in Smoking Cessation Studies
Abstract:
Inference on data with missingness can be challenging, particularly if the knowledge that a measurement was unobserved provides information about its distribution. Our work is motivated by the Commit to Quit II study, a smoking cessation trial that measured smoking status and weight change as weekly outcomes. It is expected that dropout in this study was informative and that patients with missed measurements are more likely to be smoking, even after conditioning on their observed smoking and weight history. We jointly model the categorical smoking status and continuous weight change outcomes by assuming normal latent variables for cessation and by extending the usual pattern mixture model to the bivariate case. The model includes a novel approach to sharing information across patterns through a Bayesian shrinkage framework to improve estimation stability for sparsely observed patterns. To accommodate the presumed informativeness of the missing data in a parsimonious manner, we model the unidentified components of the model under a nonfuture dependence assumption and specify departures from missing at random through sensitivity parameters, whose distributions are elicited from a subject-matter expert. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1454-1465
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1167693
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1167693
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1454-1465
Template-Type: ReDIF-Article 1.0
Author-Name: Jared S. Murray
Author-X-Name-First: Jared S.
Author-X-Name-Last: Murray
Author-Name: Jerome P. Reiter
Author-X-Name-First: Jerome P.
Author-X-Name-Last: Reiter
Title: Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models With Local Dependence
Abstract:
We present a nonparametric Bayesian joint model for multivariate continuous and categorical variables, with the intention of developing a flexible engine for multiple imputation of missing values. The model fuses Dirichlet process mixtures of multinomial distributions for categorical variables with Dirichlet process mixtures of multivariate normal distributions for continuous variables. We incorporate dependence between the continuous and categorical variables by (1) modeling the means of the normal distributions as component-specific functions of the categorical variables and (2) forming distinct mixture components for the categorical and continuous data with probabilities that are linked via a hierarchical model. This structure allows the model to capture complex dependencies between the categorical and continuous data with minimal tuning by the analyst. We apply the model to impute missing values due to item nonresponse in an evaluation of the redesign of the Survey of Income and Program Participation (SIPP). The goal is to compare estimates from a field test with the new design to estimates from selected individuals from a panel collected under the old design. We show that accounting for the missing data changes some conclusions about the comparability of the distributions in the two datasets. We also perform an extensive repeated sampling simulation using similar data from complete cases in an existing SIPP panel, comparing our proposed model to a default application of multiple imputation by chained equations. Imputations based on the proposed model tend to have better repeated sampling properties than the default application of chained equations in this realistic setting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1466-1479
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1174132
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1174132
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1466-1479
Template-Type: ReDIF-Article 1.0
Author-Name: Wolfgang Karl Härdle
Author-X-Name-First: Wolfgang Karl
Author-X-Name-Last: Härdle
Author-Name: Brenda López Cabrera
Author-X-Name-First: Brenda
Author-X-Name-Last: López Cabrera
Author-Name: Ostap Okhrin
Author-X-Name-First: Ostap
Author-X-Name-Last: Okhrin
Author-Name: Weining Wang
Author-X-Name-First: Weining
Author-X-Name-Last: Wang
Title: Localizing Temperature Risk
Abstract:
On the temperature derivative market, modeling temperature volatility is an important issue for pricing and hedging. To apply the pricing tools of financial mathematics, one needs to isolate a Gaussian risk factor. A conventional model for temperature dynamics is a stochastic model with seasonality and intertemporal autocorrelation. Empirical work based on seasonality and autocorrelation correction reveals that the obtained residuals are heteroscedastic with a periodic pattern. The object of this research is to estimate this heteroscedastic function so that, after scale normalization, a pure standardized Gaussian variable appears. Earlier works investigated temperature risk in different locations and showed that neither parametric component functions nor a local linear smoother with constant smoothing parameter are flexible enough to generally describe the variance process well. Therefore, we consider a local adaptive modeling approach to find, at each time point, an optimal smoothing parameter to locally estimate the seasonality and volatility. Our approach provides a more flexible and accurate fitting procedure for localized temperature risk by achieving nearly normal risk factors. We also employ our model to forecast the temperaturein different cities and compare it to a model developed in 2005 by Campbell and Diebold. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1491-1508
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1180985
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180985
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1491-1508
Template-Type: ReDIF-Article 1.0
Author-Name: Simon N. Wood
Author-X-Name-First: Simon N.
Author-X-Name-Last: Wood
Author-Name: Natalya Pya
Author-X-Name-First: Natalya
Author-X-Name-Last: Pya
Author-Name: Benjamin Säfken
Author-X-Name-First: Benjamin
Author-X-Name-Last: Säfken
Title: Smoothing Parameter and Model Selection for General Smooth Models
Abstract:
This article discusses a general framework for smoothing parameter estimation for models with regular likelihoods constructed in terms of unknown smooth functions of covariates. Gaussian random effects and parametric terms may also be present. By construction the method is numerically stable and convergent, and enables smoothing parameter uncertainty to be quantified. The latter enables us to fix a well known problem with AIC for such models, thereby improving the range of model selection tools available. The smooth functions are represented by reduced rank spline like smoothers, with associated quadratic penalties measuring function smoothness. Model estimation is by penalized likelihood maximization, where the smoothing parameters controlling the extent of penalization are estimated by Laplace approximate marginal likelihood. The methods cover, for example, generalized additive models for nonexponential family responses (e.g., beta, ordered categorical, scaled t distribution, negative binomial and Tweedie distributions), generalized additive models for location scale and shape (e.g., two stage zero inflation models, and Gaussian location-scale models), Cox proportional hazards models and multivariate additive models. The framework reduces the implementation of new model classes to the coding of some standard derivatives of the log-likelihood. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1548-1563
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1180986
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180986
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1548-1563
Template-Type: ReDIF-Article 1.0
Author-Name: Jörg Polzehl
Author-X-Name-First: Jörg
Author-X-Name-Last: Polzehl
Author-Name: Karsten Tabelow
Author-X-Name-First: Karsten
Author-X-Name-Last: Tabelow
Title: Low SNR in Diffusion MRI Models
Abstract:
Noise is a common issue for all magnetic resonance imaging (MRI) techniques such as diffusion MRI and obviously leads to variability of the estimates in any model describing the data. Increasing spatial resolution in MR experiments further diminishes the signal-to-noise ratio (SNR). However, with low SNR the expected signal deviates from the true value. Common modeling approaches therefore lead to a bias in estimated model parameters. Adjustments require an analysis of the data generating process and a characterization of the resulting distribution of the imaging data. We provide an adequate quasi-likelihood approach that employs these characteristics. We elaborate on the effects of typical data preprocessing and analyze the bias effects related to low SNR for the example of the diffusion tensor model in diffusion MRI. We then demonstrate the relevance of the problem using data from the Human Connectome Project. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1480-1490
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1222284
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1222284
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1480-1490
Template-Type: ReDIF-Article 1.0
Author-Name: Michael P. Wallace
Author-X-Name-First: Michael P.
Author-X-Name-Last: Wallace
Author-Name: Erica E. M. Moodie
Author-X-Name-First: Erica E. M.
Author-X-Name-Last: Moodie
Author-Name: David A. Stephens
Author-X-Name-First: David A.
Author-X-Name-Last: Stephens
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1530-1534
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1240080
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1530-1534
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 1852-1852
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1240685
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240685
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1852-1852
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander R. Luedtke
Author-X-Name-First: Alexander R.
Author-X-Name-Last: Luedtke
Author-Name: Mark J. van der Laan
Author-X-Name-First: Mark J. van der
Author-X-Name-Last: Laan
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1526-1530
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1242427
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1242427
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1526-1530
Template-Type: ReDIF-Article 1.0
Author-Name: Min Qian
Author-X-Name-First: Min
Author-X-Name-Last: Qian
Title: Comment
Abstract:
This commentary deals with issues related to the article by Chen, Zeng, and Kosorok. We present several potential modifications of the outcome weighted learning approach. Those modifications are based on truncated l2 loss. One advantage of l2 loss is that it is differentiable everywhere, which makes it more stable and computationally more tractable.
Journal: Journal of the American Statistical Association
Pages: 1538-1541
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1243479
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1243479
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1538-1541
Template-Type: ReDIF-Article 1.0
Author-Name: Elizabeth L. Ogburn
Author-X-Name-First: Elizabeth L.
Author-X-Name-Last: Ogburn
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1534-1537
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1243480
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1243480
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1534-1537
Template-Type: ReDIF-Article 1.0
Author-Name: Michael Rosenblum
Author-X-Name-First: Michael
Author-X-Name-Last: Rosenblum
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1541-1542
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1243481
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1243481
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1541-1542
Template-Type: ReDIF-Article 1.0
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1521-1524
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1244064
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1244064
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1521-1524
Template-Type: ReDIF-Article 1.0
Author-Name: Jun Fan
Author-X-Name-First: Jun
Author-X-Name-Last: Fan
Author-Name: Ming Yuan
Author-X-Name-First: Ming
Author-X-Name-Last: Yuan
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1524-1525
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1244065
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1244065
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1524-1525
Template-Type: ReDIF-Article 1.0
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1410-1412
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1245070
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245070
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1410-1412
Template-Type: ReDIF-Article 1.0
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1408-1410
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1245071
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245071
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1408-1410
Template-Type: ReDIF-Article 1.0
Author-Name: Aleksandrina Goeva
Author-X-Name-First: Aleksandrina
Author-X-Name-Last: Goeva
Author-Name: Eric D. Kolaczyk
Author-X-Name-First: Eric D.
Author-X-Name-Last: Kolaczyk
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1405-1408
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1245072
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245072
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1405-1408
Template-Type: ReDIF-Article 1.0
Author-Name: Matt Taddy
Author-X-Name-First: Matt
Author-X-Name-Last: Taddy
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1403-1405
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1245073
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1245073
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1403-1405
Template-Type: ReDIF-Article 1.0
Author-Name: Guanhua Chen
Author-X-Name-First: Guanhua
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1543-1547
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1250573
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250573
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1543-1547
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas Kneib
Author-X-Name-First: Thomas
Author-X-Name-Last: Kneib
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1563-1565
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1250576
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250576
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1563-1565
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas W. Yee
Author-X-Name-First: Thomas W.
Author-X-Name-Last: Yee
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1565-1568
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1250579
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250579
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1565-1568
Template-Type: ReDIF-Article 1.0
Author-Name: Sonja Greven
Author-X-Name-First: Sonja
Author-X-Name-Last: Greven
Author-Name: Fabian Scheipl
Author-X-Name-First: Fabian
Author-X-Name-Last: Scheipl
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1568-1573
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1250580
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250580
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1568-1573
Template-Type: ReDIF-Article 1.0
Author-Name: Simon N. Wood
Author-X-Name-First: Simon N.
Author-X-Name-Last: Wood
Author-Name: Natalya Pya
Author-X-Name-First: Natalya
Author-X-Name-Last: Pya
Author-Name: Benjamin Säfken
Author-X-Name-First: Benjamin
Author-X-Name-Last: Säfken
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1573-1575
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1250583
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250583
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1573-1575
Template-Type: ReDIF-Article 1.0
Author-Name: Jessica Utts
Author-X-Name-First: Jessica
Author-X-Name-Last: Utts
Title: Appreciating Statistics
Journal: Journal of the American Statistical Association
Pages: 1373-1380
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1250592
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1250592
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1373-1380
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Collaborators
Journal: Journal of the American Statistical Association
Pages: 1853-1861
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1255066
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1255066
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1853-1861
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 1840-1851
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1257826
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1257826
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:1840-1851
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Board EOV
Journal: Journal of the American Statistical Association
Pages: ebi-ebi
Issue: 516
Volume: 111
Year: 2016
Month: 10
X-DOI: 10.1080/01621459.2016.1267991
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1267991
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:516:p:ebi-ebi
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Peña
Author-X-Name-First: Daniel
Author-X-Name-Last: Peña
Author-Name: Victor J. Yohai
Author-X-Name-First: Victor J.
Author-X-Name-Last: Yohai
Title: Generalized Dynamic Principal Components
Abstract:
Brillinger defined dynamic principal components (DPC) for time series based on a reconstruction criterion. He gave a very elegant theoretical solution and proposed an estimator which is consistent under stationarity. Here, we propose a new enterally empirical approach to DPC. The main differences with the existing methods—mainly Brillinger procedure—are (1) the DPC we propose need not be a linear combination of the observations and (2) it can be based on a variety of loss functions including robust ones. Unlike Brillinger, we do not establish any consistency results; however, contrary to Brillinger’s, which has a very strong stationarity flavor, our concept aims at a better adaptation to possible nonstationary features of the series. We also present a robust version of our procedure that allows to estimate the DPC when the series have outlier contamination. We give iterative algorithms to compute the proposed procedures that can be used with a large number of variables. Our nonrobust and robust procedures are illustrated with real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1121-1131
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1072542
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1072542
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1121-1131
Template-Type: ReDIF-Article 1.0
Author-Name: Ritabrata Das
Author-X-Name-First: Ritabrata
Author-X-Name-Last: Das
Author-Name: Moulinath Banerjee
Author-X-Name-First: Moulinath
Author-X-Name-Last: Banerjee
Author-Name: Bin Nan
Author-X-Name-First: Bin
Author-X-Name-Last: Nan
Author-Name: Huiyong Zheng
Author-X-Name-First: Huiyong
Author-X-Name-Last: Zheng
Title: Fast Estimation of Regression Parameters in a Broken-Stick Model for Longitudinal Data
Abstract:
Estimation of change-point locations in the broken-stick model has significant applications in modeling important biological phenomena. In this article, we present a computationally economical likelihood-based approach for estimating change-point(s) efficiently in both cross-sectional and longitudinal settings. Our method, based on local smoothing in a shrinking neighborhood of each change-point, is shown via simulations to be computationally more viable than existing methods that rely on search procedures, with dramatic gains in the multiple change-point case. The proposed estimates are shown to have n$\sqrt{n}$-consistency and asymptotic normality—in particular, they are asymptotically efficient in the cross-sectional setting—allowing us to provide meaningful statistical inference. As our primary and motivating (longitudinal) application, we study the Michigan Bone Health and Metabolism Study cohort data to describe patterns of change in log estradiol levels, before and after the final menstrual period, for which a two change-point broken-stick model appears to be a good fit. We also illustrate our method on a plant growth dataset in the cross-sectional setting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1132-1143
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1073154
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1073154
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1132-1143
Template-Type: ReDIF-Article 1.0
Author-Name: Mingyuan Zhou
Author-X-Name-First: Mingyuan
Author-X-Name-Last: Zhou
Author-Name: Oscar Hernan Madrid Padilla
Author-X-Name-First: Oscar Hernan Madrid
Author-X-Name-Last: Padilla
Author-Name: James G. Scott
Author-X-Name-First: James G.
Author-X-Name-Last: Scott
Title: Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
Abstract:
We define a family of probability distributions for random count matrices with a potentially unbounded number of rows and columns. The three distributions we consider are derived from the gamma-Poisson, gamma-negative binomial, and beta-negative binomial processes, which we refer to generically as a family of negative-binomial processes. Because the models lead to closed-form update equations within the context of a Gibbs sampler, they are natural candidates for nonparametric Bayesian priors over count matrices. A key aspect of our analysis is the recognition that although the random count matrices within the family are defined by a row-wise construction, their columns can be shown to be independent and identically distributed (iid). This fact is used to derive explicit formulas for drawing all the columns at once. Moreover, by analyzing these matrices’ combinatorial structure, we describe how to sequentially construct a column-iid random count matrix one row at a time, and derive the predictive distribution of a new row count vector with previously unseen features. We describe the similarities and differences between the three priors, and argue that the greater flexibility of the gamma- and beta-negative binomial processes—especially their ability to model over-dispersed, heavy-tailed count data—makes these well suited to a wide variety of real-world applications. As an example of our framework, we construct a naive-Bayes text classifier to categorize a count vector to one of several existing random count matrices of different categories. The classifier supports an unbounded number of features and, unlike most existing methods, it does not require a predefined finite vocabulary to be shared by all the categories, and needs neither feature selection nor parameter tuning. Both the gamma- and beta-negative binomial processes are shown to significantly outperform the gamma-Poisson process when applied to document categorization, with comparable performance to other state-of-the-art supervised text classification algorithms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1144-1156
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1075407
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1075407
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1144-1156
Template-Type: ReDIF-Article 1.0
Author-Name: Samuel D. Pimentel
Author-X-Name-First: Samuel D.
Author-X-Name-Last: Pimentel
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Constructed Second Control Groups and Attenuation of Unmeasured Biases
Abstract:
The informal folklore of observational studies claims that if an irrelevant observed covariate is left uncontrolled, say unmatched, then it will influence treatment assignment in haphazard ways, thereby diminishing the biases from unmeasured covariates. We prove a result along these lines: it is true, in a certain sense, to a limited degree, under certain conditions. Alas, the conditions are neither inconsequential nor easy to check in empirical work; indeed, they are often dubious, more often implausible. We suggest the result is most useful in the computerized construction of a second control group, where the investigator can see more in available data without necessarily believing the required conditions. One of the two control groups controls for the possibly irrelevant observed covariate, the other control group either leaves it uncontrolled or forces separation; therefore, the investigator views one situation from two angles under different assumptions. A pair of sensitivity analyses for the two control groups is coordinated by a weighted Holm or recycling procedure built around the possibility of slight attenuation of bias in one control group. Issues are illustrated using an observational study of the possible effects of cigarette smoking as a cause of increased homocysteine levels, a risk factor for cardiovascular disease. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1157-1167
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1076342
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1076342
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1157-1167
Template-Type: ReDIF-Article 1.0
Author-Name: Fernando A. Quintana
Author-X-Name-First: Fernando A.
Author-X-Name-Last: Quintana
Author-Name: Wesley O. Johnson
Author-X-Name-First: Wesley O.
Author-X-Name-Last: Johnson
Author-Name: L. Elaine Waetjen
Author-X-Name-First: L. Elaine
Author-X-Name-Last: Waetjen
Author-Name: Ellen B. Gold
Author-X-Name-First: Ellen
Author-X-Name-Last: B. Gold
Title: Bayesian Nonparametric Longitudinal Data Analysis
Abstract:
Practical Bayesian nonparametric methods have been developed across a wide variety of contexts. Here, we develop a novel statistical model that generalizes standard mixed models for longitudinal data that include flexible mean functions as well as combined compound symmetry (CS) and autoregressive (AR) covariance structures. AR structure is often specified through the use of a Gaussian process (GP) with covariance functions that allow longitudinal data to be more correlated if they are observed closer in time than if they are observed farther apart. We allow for AR structure by considering a broader class of models that incorporates a Dirichlet process mixture (DPM) over the covariance parameters of the GP. We are able to take advantage of modern Bayesian statistical methods in making full predictive inferences and about characteristics of longitudinal profiles and their differences across covariate combinations. We also take advantage of the generality of our model, which provides for estimation of a variety of covariance structures. We observe that models that fail to incorporate CS or AR structure can result in very poor estimation of a covariance or correlation matrix. In our illustration using hormone data observed on women through the menopausal transition, biology dictates the use of a generalized family of sigmoid functions as a model for time trends across subpopulation categories.
Journal: Journal of the American Statistical Association
Pages: 1168-1181
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1076725
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1076725
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1168-1181
Template-Type: ReDIF-Article 1.0
Author-Name: Ning Zhang
Author-X-Name-First: Ning
Author-X-Name-Last: Zhang
Author-Name: Daniel W. Apley
Author-X-Name-First: Daniel W.
Author-X-Name-Last: Apley
Title: Brownian Integrated Covariance Functions for Gaussian Process Modeling: Sigmoidal Versus Localized Basis Functions
Abstract:
Gaussian process modeling, or kriging, is a popular method for modeling data from deterministic computer simulations, and the most common choices of covariance function are Gaussian, power exponential, and Matérn. A characteristic of these covariance functions is that the basis functions associated with their corresponding response predictors are localized, in the sense that they decay to zero as the input location moves away from the simulated input sites. As a result, the predictors tend to revert to the prior mean, which can result in a bumpy fitted response surface. In contrast, a fractional Brownian field model results in a predictor with basis functions that are nonlocalized and more sigmoidal in shape, although it suffers from drawbacks such as inability to represent smooth response surfaces. We propose a class of Brownian integrated covariance functions obtained by incorporating an integrator (as in the white noise integral representation of a fractional Brownian field) into any stationary covariance function. Brownian integrated covariance models result in predictor basis functions that are nonlocalized and sigmoidal, but they are capable of modeling smooth response surfaces. We discuss fundamental differences between Brownian integrated and other covariance functions, and we illustrate by comparing Brownian integrated power exponential with regular power exponential kriging models in a number of examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1182-1195
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1077711
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1077711
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1182-1195
Template-Type: ReDIF-Article 1.0
Author-Name: Ziqi Chen
Author-X-Name-First: Ziqi
Author-X-Name-Last: Chen
Author-Name: Chenlei Leng
Author-X-Name-First: Chenlei
Author-X-Name-Last: Leng
Title: Dynamic Covariance Models
Abstract:
An important problem in contemporary statistics is to understand the relationship among a large number of variables based on a dataset, usually with p, the number of the variables, much larger than n, the sample size. Recent efforts have focused on modeling static covariance matrices where pairwise covariances are considered invariant. In many real systems, however, these pairwise relations often change. To characterize the changing correlations in a high-dimensional system, we study a class of dynamic covariance models (DCMs) assumed to be sparse, and investigate for the first time a unified theory for understanding their nonasymptotic error rates and model selection properties. In particular, in the challenging high-dimensional regime, we highlight a new uniform consistency theory in which the sample size can be seen as n4/5 when the bandwidth parameter is chosen as h∝n− 1/5 for accounting for the dynamics. We show that this result holds uniformly over a range of the variable used for modeling the dynamics. The convergence rate bears the mark of the familiar bias-variance trade-off in the kernel smoothing literature. We illustrate the results with simulations and the analysis of a neuroimaging dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1196-1207
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1077712
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1077712
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1196-1207
Template-Type: ReDIF-Article 1.0
Author-Name: Ming-Yen Cheng
Author-X-Name-First: Ming-Yen
Author-X-Name-Last: Cheng
Author-Name: Toshio Honda
Author-X-Name-First: Toshio
Author-X-Name-Last: Honda
Author-Name: Jin-Ting Zhang
Author-X-Name-First: Jin-Ting
Author-X-Name-Last: Zhang
Title: Forward Variable Selection for Sparse Ultra-High Dimensional Varying Coefficient Models
Abstract:
Varying coefficient models have numerous applications in a wide scope of scientific areas. While enjoying nice interpretability, they also allow for flexibility in modeling dynamic impacts of the covariates. But, in the new era of big data, it is challenging to select the relevant variables when the dimensionality is very large. Recently, several works are focused on this important problem based on sparsity assumptions; they are subject to some limitations, however. We introduce an appealing forward selection procedure. It selects important variables sequentially according to a reduction in sum of squares criterion and it employs a Bayesian information criterion (BIC)-based stopping rule. Clearly, it is simple to implement and fast to compute, and possesses many other desirable properties from theoretical and numerical viewpoints. The BIC is a special case of the extended BIC (EBIC) when an extra tuning parameter in the latter vanishes. We establish rigorous screening consistency results when either BIC or EBIC is used as the stopping criterion. The theoretical results depend on some conditions on the eigenvalues related to the design matrices, which can be relaxed in some situations. Results of an extensive simulation study and a real data example are also presented to show the efficacy and usefulness of our procedure. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1209-1221
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1080708
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1080708
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1209-1221
Template-Type: ReDIF-Article 1.0
Author-Name: Srijan Sengupta
Author-X-Name-First: Srijan
Author-X-Name-Last: Sengupta
Author-Name: Stanislav Volgushev
Author-X-Name-First: Stanislav
Author-X-Name-Last: Volgushev
Author-Name: Xiaofeng Shao
Author-X-Name-First: Xiaofeng
Author-X-Name-Last: Shao
Title: A Subsampled Double Bootstrap for Massive Data
Abstract:
The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets that are increasingly prevalent, the bootstrap becomes prohibitively costly in computation and its feasibility is questionable even with modern parallel computing platforms. Recently, Kleiner and co-authors proposed a method called BLB (bag of little bootstraps) for massive data, which is more computationally scalable with little sacrifice of statistical accuracy. Building on BLB and the idea of fast double bootstrap, we propose a new resampling method, the subsampled double bootstrap, for both independent data and time series data. We establish consistency of the subsampled double bootstrap under mild conditions for both independent and dependent cases. Methodologically, the subsampled double bootstrap is superior to BLB in terms of running time, more sample coverage, and automatic implementation with less tuning parameters for a given time budget. Its advantage relative to BLB and bootstrap is also demonstrated in numerical simulations and a data illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1222-1232
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1080709
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1080709
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1222-1232
Template-Type: ReDIF-Article 1.0
Author-Name: Yanxun Xu
Author-X-Name-First: Yanxun
Author-X-Name-Last: Xu
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Abdus S. Wahed
Author-X-Name-First: Abdus S.
Author-X-Name-Last: Wahed
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Title: Bayesian Nonparametric Estimation for Dynamic Treatment Regimes With Sequential Transition Times
Abstract:
We analyze a dataset arising from a clinical trial involving multi-stage chemotherapy regimes for acute leukemia. The trial design was a 2 × 2 factorial for frontline therapies only. Motivated by the idea that subsequent salvage treatments affect survival time, we model therapy as a dynamic treatment regime (DTR), that is, an alternating sequence of adaptive treatments or other actions and transition times between disease states. These sequences may vary substantially between patients, depending on how the regime plays out. To evaluate the regimes, mean overall survival time is expressed as a weighted average of the means of all possible sums of successive transitions times. We assume a Bayesian nonparametric survival regression model for each transition time, with a dependent Dirichlet process prior and Gaussian process base measure (DDP-GP). Posterior simulation is implemented by Markov chain Monte Carlo (MCMC) sampling. We provide general guidelines for constructing a prior using empirical Bayes methods. The proposed approach is compared with inverse probability of treatment weighting, including a doubly robust augmented version of this approach, for both single-stage and multi-stage regimes with treatment assignment depending on baseline covariates. The simulations show that the proposed nonparametric Bayesian approach can substantially improve inference compared to existing methods. An R program for implementing the DDP-GP-based Bayesian nonparametric analysis is freely available at www.ams.jhu.edu/yxu70. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 921-950
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1086353
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1086353
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:921-950
Template-Type: ReDIF-Article 1.0
Author-Name: Ulrich K. Müller
Author-X-Name-First: Ulrich K.
Author-X-Name-Last: Müller
Author-Name: Andriy Norets
Author-X-Name-First: Andriy
Author-X-Name-Last: Norets
Title: Coverage Inducing Priors in Nonstandard Inference Problems
Abstract:
We consider the construction of set estimators that possess both Bayesian credibility and frequentist coverage properties. We show that under mild regularity conditions there exists a prior distribution that induces (1 − α) frequentist coverage of a (1 − α) credible set. In contrast to the previous literature, this result does not rely on asymptotic normality or invariance, so it can be applied in nonstandard inference problems. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1233-1241
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1086654
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1086654
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1233-1241
Template-Type: ReDIF-Article 1.0
Author-Name: Junhui Wang
Author-X-Name-First: Junhui
Author-X-Name-Last: Wang
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Yiwen Sun
Author-X-Name-First: Yiwen
Author-X-Name-Last: Sun
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Classification With Unstructured Predictors and an Application to Sentiment Analysis
Abstract:
Unstructured data refer to information that lacks certain structures and cannot be organized in a predefined fashion. Unstructured data often involve words, texts, graphs, objects, or multimedia types of files that are difficult to process and analyze with traditional computational tools and statistical methods. This work explores ordinal classification for unstructured predictors with ordered class categories, where imprecise information concerning strengths of association between predictors is available for predicting class labels. However, imprecise information here is expressed in terms of a directed graph, with each node representing a predictor and a directed edge containing pairwise strengths of association between two nodes. One of the targeted applications for unstructured data arises from sentiment analysis, which identifies and extracts the relevant content or opinion of a document concerning a specific event of interest. We integrate the imprecise predictor relations into linear relational constraints over classification function coefficients, where large margin ordinal classifiers are introduced, subject to many quadratically linear constraints. The proposed classifiers are then applied in sentiment analysis using binary word predictors. Computationally, we implement ordinal support vector machines and ψ-learning through a scalable quadratic programming package based on sparse word representations. Theoretically, we show that using relationships among unstructured predictors improves prediction accuracy of classification significantly. We illustrate an application for sentiment analysis using consumer text reviews and movie review data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1242-1253
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1089771
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1089771
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1242-1253
Template-Type: ReDIF-Article 1.0
Author-Name: Will Wei Sun
Author-X-Name-First: Will Wei
Author-X-Name-Last: Sun
Author-Name: Xingye Qiao
Author-X-Name-First: Xingye
Author-X-Name-Last: Qiao
Author-Name: Guang Cheng
Author-X-Name-First: Guang
Author-X-Name-Last: Cheng
Title: Stabilized Nearest Neighbor Classifier and its Statistical Properties
Abstract:
The stability of statistical analysis is an important indicator for reproducibility, which is one main principle of the scientific method. It entails that similar statistical conclusions can be reached based on independent samples from the same underlying population. In this article, we introduce a general measure of classification instability (CIS) to quantify the sampling variability of the prediction made by a classification method. Interestingly, the asymptotic CIS of any weighted nearest neighbor classifier turns out to be proportional to the Euclidean norm of its weight vector. Based on this concise form, we propose a stabilized nearest neighbor (SNN) classifier, which distinguishes itself from other nearest neighbor classifiers, by taking the stability into consideration. In theory, we prove that SNN attains the minimax optimal convergence rate in risk, and a sharp convergence rate in CIS. The latter rate result is established for general plug-in classifiers under a low-noise condition. Extensive simulated and real examples demonstrate that SNN achieves a considerable improvement in CIS over existing nearest neighbor classifiers, with comparable classification accuracy. We implement the algorithm in a publicly available R package snn. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1254-1265
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1089772
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1089772
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1254-1265
Template-Type: ReDIF-Article 1.0
Author-Name: Emre Barut
Author-X-Name-First: Emre
Author-X-Name-Last: Barut
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Anneleen Verhasselt
Author-X-Name-First: Anneleen
Author-X-Name-Last: Verhasselt
Title: Conditional Sure Independence Screening
Abstract:
Independence screening is powerful for variable selection when the number of variables is massive. Commonly used independence screening methods are based on marginal correlations or its variants. When some prior knowledge on a certain important set of variables is available, a natural assessment on the relative importance of the other predictors is their conditional contributions to the response given the known set of variables. This results in conditional sure independence screening (CSIS). CSIS produces a rich family of alternative screening methods by different choices of the conditioning set and can help reduce the number of false positive and false negative selections when covariates are highly correlated. This article proposes and studies CSIS in generalized linear models. We give conditions under which sure screening is possible and derive an upper bound on the number of selected variables. We also spell out the situation under which CSIS yields model selection consistency and the properties of CSIS when a data-driven conditioning set is used. Moreover, we provide two data-driven methods to select the thresholding parameter of conditional screening. The utility of the procedure is illustrated by simulation studies and analysis of two real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1266-1277
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1092974
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1092974
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1266-1277
Template-Type: ReDIF-Article 1.0
Author-Name: Christopher J. Bennett
Author-X-Name-First: Christopher J.
Author-X-Name-Last: Bennett
Author-Name: Brennan S. Thompson
Author-X-Name-First: Brennan S.
Author-X-Name-Last: Thompson
Title: Graphical Procedures for Multiple Comparisons Under General Dependence
Abstract:
It has been more than half a century since Tukey first introduced graphical displays that relate nonoverlap of confidence intervals to statistically significant differences between parameter estimates. In this article, we show how Tukey’s graphical overlap procedure can be modified to accommodate general forms of dependence within and across samples. We also develop a procedure that can be used to more effectively resolve rankings within the tails of the distributions of parameter values, thereby generalizing existing methods for “multiple comparisons with the best.” We show that these new procedures retain the simplicity of Tukey’s original procedure, while maintaining asymptotic control of the familywise error rate under very general conditions. Simple examples are used throughout to illustrate the procedures. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1278-1288
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1093941
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093941
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1278-1288
Template-Type: ReDIF-Article 1.0
Author-Name: Gang Li
Author-X-Name-First: Gang
Author-X-Name-Last: Li
Author-Name: Qing Yang
Author-X-Name-First: Qing
Author-X-Name-Last: Yang
Title: Joint Inference for Competing Risks Survival Data
Abstract:
This article develops joint inferential methods for the cause-specific hazard function and the cumulative incidence function of a specific type of failure to assess the effects of a variable on the time to the type of failure of interest in the presence of competing risks. Joint inference for the two functions are needed in practice because (i) they describe different characteristics of a given type of failure, (ii) they do not uniquely determine each other, and (iii) the effects of a variable on the two functions can be different and one often does not know which effects are to be expected. We study both the group comparison problem and the regression problem. We also discuss joint inference for other related functions. Our simulation shows that our joint tests can be considerably more powerful than the Bonferroni method, which has important practical implications to the analysis and design of clinical studies with competing risks data. We illustrate our method using a Hodgkin disease data and a lymphoma data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1289-1300
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1093942
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093942
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1289-1300
Template-Type: ReDIF-Article 1.0
Author-Name: Samiran Sinha
Author-X-Name-First: Samiran
Author-X-Name-Last: Sinha
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Title: Analysis of Proportional Odds Models With Censoring and Errors-in-Covariates
Abstract:
We propose a consistent method for estimating both the finite- and infinite-dimensional parameters of the proportional odds model when a covariate is subject to measurement error and time-to-events are subject to right censoring. The proposed method does not rely on the distributional assumption of the true covariate, which is not observed in the data. In addition, the proposed estimator does not require the measurement error to be normally distributed or to have any other specific distribution, and we do not attempt to assess the error distribution. Instead, we construct martingale-based estimators through inversion, using only the moment properties of the error distribution, estimable from multiple erroneous measurements of the true covariate. The theoretical properties of the estimators are established and the finite sample performance is demonstrated via simulations. We illustrate the usefulness of the method by analyzing a dataset from a clinical study on AIDS. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1301-1312
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1093943
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093943
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1301-1312
Template-Type: ReDIF-Article 1.0
Author-Name: Efstathia Bura
Author-X-Name-First: Efstathia
Author-X-Name-Last: Bura
Author-Name: Sabrina Duarte
Author-X-Name-First: Sabrina
Author-X-Name-Last: Duarte
Author-Name: Liliana Forzani
Author-X-Name-First: Liliana
Author-X-Name-Last: Forzani
Title: Sufficient Reductions in Regressions With Exponential Family Inverse Predictors
Abstract:
We develop methodology for identifying and estimating sufficient reductions in regressions with predictors that, given the response, follow a multivariate exponential family distribution. This setup includes regressions where predictors are all continuous, all categorical, or mixtures of categorical and continuous. We derive the minimal sufficient reduction of the predictors and its maximum likelihood estimator by modeling the conditional distribution of the predictors given the response. Whereas nearly all extant estimators of sufficient reductions are linear and only partly capture the sufficient reduction, our method is not limited to linear reductions. It also provides the exact form of the sufficient reduction, which is exhaustive, its maximum likelihood (ML) estimates via an iterated reweighted least-square (IRLS) estimation algorithm, and asymptotic tests for the dimension of the regression. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1313-1329
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1093944
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093944
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1313-1329
Template-Type: ReDIF-Article 1.0
Author-Name: Carsten Jentsch
Author-X-Name-First: Carsten
Author-X-Name-Last: Jentsch
Author-Name: Claudia Kirch
Author-X-Name-First: Claudia
Author-X-Name-Last: Kirch
Title: How Much Information Does Dependence Between Wavelet Coefficients Contain?
Abstract:
This article is motivated by several articles that propose statistical inference where the independence of wavelet coefficients for both short- as well as long-range dependent time series is assumed. We focus on the sample variance and investigate the influence of the dependence between wavelet coefficients and this statistic. To this end, we derive asymptotic distributional properties of the sample variance for a time series that is synthesized, ignoring some or all dependence between wavelet coefficients. We show that the second-order properties differ from the those of the true time series whose wavelet coefficients have the same marginal distribution except in the independent Gaussian case. This holds true even if the dependency is correct within each level and only the dependence between levels is ignored. In the case of sample autocovariances and sample autocorrelations at lag one, we indicate that first-order properties are erroneous. In a second step, we investigate several nonparametric bootstrap schemes in the wavelet domain, which take more and more dependence into account until finally the full dependency is mimicked. We obtain very similar results, where only a bootstrap mimicking the full covariance structure correctly can be valid asymptotically. A simulation study supports our theoretical findings for the wavelet domain bootstraps. For long-range-dependent time series with long-memory parameter d > 1/4, we show that some additional problems occur, which cannot be solved easily without using additional information for the bootstrap. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1330-1345
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1093945
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1093945
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1330-1345
Template-Type: ReDIF-Article 1.0
Author-Name: Paula Moraga
Author-X-Name-First: Paula
Author-X-Name-Last: Moraga
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1110-1111
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1116989
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1116989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1110-1111
Template-Type: ReDIF-Article 1.0
Author-Name: Peter J. Diggle
Author-X-Name-First: Peter J.
Author-X-Name-Last: Diggle
Author-Name: Emanuele Giorgi
Author-X-Name-First: Emanuele
Author-X-Name-Last: Giorgi
Title: Model-Based Geostatistics for Prevalence Mapping in Low-Resource Settings
Abstract:
In low-resource settings, prevalence mapping relies on empirical prevalence data from a finite, often spatially sparse, set of surveys of communities within the region of interest, possibly supplemented by remotely sensed images that can act as proxies for environmental risk factors. A standard geostatistical model for data of this kind is a generalized linear mixed model with binomial error distribution, logistic link, and a combination of explanatory variables and a Gaussian spatial stochastic process in the linear predictor. In this article, we first review statistical methods and software associated with this standard model, then consider several methodological extensions whose development has been motivated by the requirements of specific applications. These include: methods for combining randomized survey data with data from nonrandomized, and therefore potentially biased, surveys; spatio-temporal extensions; and spatially structured zero-inflation. Throughout, we illustrate the methods with disease mapping applications that have arisen through our involvement with a range of African public health programs.
Journal: Journal of the American Statistical Association
Pages: 1096-1120
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2015.1123158
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1123158
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1096-1120
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Chen
Author-X-Name-First: Yang
Author-X-Name-Last: Chen
Author-Name: Kuang Shen
Author-X-Name-First: Kuang
Author-X-Name-Last: Shen
Author-Name: Shu-Ou Shan
Author-X-Name-First: Shu-Ou
Author-X-Name-Last: Shan
Author-Name: S. C. Kou
Author-X-Name-First: S. C.
Author-X-Name-Last: Kou
Title: Analyzing Single-Molecule Protein Transportation Experiments via Hierarchical Hidden Markov Models
Abstract:
To maintain proper cellular functions, over 50% of proteins encoded in the genome need to be transported to cellular membranes. The molecular mechanism behind such a process, often referred to as protein targeting, is not well understood. Single-molecule experiments are designed to unveil the detailed mechanisms and reveal the functions of different molecular machineries involved in the process. The experimental data consist of hundreds of stochastic time traces from the fluorescence recordings of the experimental system. We introduce a Bayesian hierarchical model on top of hidden Markov models (HMMs) to analyze these data and use the statistical results to answer the biological questions. In addition to resolving the biological puzzles and delineating the regulating roles of different molecular complexes, our statistical results enable us to propose a more detailed mechanism for the late stages of the protein targeting process.
Journal: Journal of the American Statistical Association
Pages: 951-966
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1140050
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1140050
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:951-966
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander W. Blocker
Author-X-Name-First: Alexander W.
Author-X-Name-Last: Blocker
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Title: Template-Based Models for Genome-Wide Analysis of Next-Generation Sequencing Data at Base-Pair Resolution
Abstract:
We consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates to control for the variability along the sequence of read counts associated with nucleosomal DNA due to enzymatic digestion and other sample preparation steps, and we develop a calibrated Bayesian method to detect local concentrations of nucleosome positions. We also introduce a set of estimands that provides rich, interpretable summaries of nucleosome positioning. Inference is carried out via a distributed Hamiltonian Monte Carlo algorithm that can scale linearly with the length of the genome being analyzed. We provide MPI-based Python implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire Saccharomyces cerevisiae genome in less than 1 hr on EC2. We evaluate the accuracy and reproducibility of the inferences leveraging a factorially designed simulation study and experimental replicates. The template-based approach we develop here is also applicable to single-end sequencing data by using alternative sources of fragment length information, and to ordered and sequential data more generally. It provides a flexible and scalable alternative to mixture models, hidden Markov models, and Parzen-window methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 967-987
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1141095
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141095
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:967-987
Template-Type: ReDIF-Article 1.0
Author-Name: Margaret E. Roberts
Author-X-Name-First: Margaret E.
Author-X-Name-Last: Roberts
Author-Name: Brandon M. Stewart
Author-X-Name-First: Brandon M.
Author-X-Name-Last: Stewart
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Title: A Model of Text for Experimentation in the Social Sciences
Abstract:
Statistical models of text have become increasingly popular in statistics and computer science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this article, we develop a model of text data that supports this type of substantive research. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are specified as a simple generalized linear model on an arbitrary number of document-level covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods quantify the effect of news wire source on both the frequency and nature of topic coverage. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 988-1003
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1141684
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141684
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:988-1003
Template-Type: ReDIF-Article 1.0
Author-Name: Sung Won Han
Author-X-Name-First: Sung Won
Author-X-Name-Last: Han
Author-Name: Gong Chen
Author-X-Name-First: Gong
Author-X-Name-Last: Chen
Author-Name: Myun-Seok Cheon
Author-X-Name-First: Myun-Seok
Author-X-Name-Last: Cheon
Author-Name: Hua Zhong
Author-X-Name-First: Hua
Author-X-Name-Last: Zhong
Title: Estimation of Directed Acyclic Graphs Through Two-Stage Adaptive Lasso for Gene Network Inference
Abstract:
Graphical models are a popular approach to find dependence and conditional independence relationships between gene expressions. Directed acyclic graphs (DAGs) are a special class of directed graphical models, where all the edges are directed edges and contain no directed cycles. The DAGs are well known models for discovering causal relationships between genes in gene regulatory networks. However, estimating DAGs without assuming known ordering is challenging due to high dimensionality, the acyclic constraints, and the presence of equivalence class from observational data. To overcome these challenges, we propose a two-stage adaptive Lasso approach, called NS-DIST, which performs neighborhood selection (NS) in stage 1, and then estimates DAGs by the discrete improving search with Tabu (DIST) algorithm within the selected neighborhood. Simulation studies are presented to demonstrate the effectiveness of the method and its computational efficiency. Two real data examples are used to demonstrate the practical usage of our method for gene regulatory network inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1004-1019
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1142880
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1142880
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1004-1019
Template-Type: ReDIF-Article 1.0
Author-Name: Shahin Tavakoli
Author-X-Name-First: Shahin
Author-X-Name-Last: Tavakoli
Author-Name: Victor M. Panaretos
Author-X-Name-First: Victor M.
Author-X-Name-Last: Panaretos
Title: Detecting and Localizing Differences in Functional Time Series Dynamics: A Case Study in Molecular Biophysics
Abstract:
Motivated by the problem of inferring the molecular dynamics of DNA in solution, and linking them with its base-pair composition, we consider the problem of comparing the dynamics of functional time series (FTS), and of localizing any inferred differences in frequency and along curvelength. The approach we take is one of Fourier analysis, where the complete second-order structure of the FTS is encoded by its spectral density operator, indexed by frequency and curvelength. The comparison is broken down to a hierarchy of stages: at a global level, we compare the spectral density operators of the two FTS, across frequencies and curvelength, based on a Hilbert–Schmidt criterion; then, we localize any differences to specific frequencies; and, finally, we further localize any differences along the length of the random curves, that is, in physical space. A hierarchical multiple testing approach guarantees control of the averaged false discovery rate over the selected frequencies. In this sense, we are able to attribute any differences to distinct dynamic (frequency) and spatial (curvelength) contributions. Our approach is presented and illustrated by means of a case study in molecular biophysics: how can one use molecular dynamics simulations of short strands of DNA to infer their temporal dynamics at the scaling limit, and probe whether these depend on the sequence encoded in these strands? Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1020-1035
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1147355
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1147355
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1020-1035
Template-Type: ReDIF-Article 1.0
Author-Name: Tyler H. McCormick
Author-X-Name-First: Tyler H.
Author-X-Name-Last: McCormick
Author-Name: Zehang Richard Li
Author-X-Name-First: Zehang Richard
Author-X-Name-Last: Li
Author-Name: Clara Calvert
Author-X-Name-First: Clara
Author-X-Name-Last: Calvert
Author-Name: Amelia C. Crampin
Author-X-Name-First: Amelia C.
Author-X-Name-Last: Crampin
Author-Name: Kathleen Kahn
Author-X-Name-First: Kathleen
Author-X-Name-Last: Kahn
Author-Name: Samuel J. Clark
Author-X-Name-First: Samuel J.
Author-X-Name-Last: Clark
Title: Probabilistic Cause-of-Death Assignment Using Verbal Autopsies
Abstract:
In regions without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such regions, the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This article develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1036-1049
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1152191
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1152191
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1036-1049
Template-Type: ReDIF-Article 1.0
Author-Name: Chen Yue
Author-X-Name-First: Chen
Author-X-Name-Last: Yue
Author-Name: Vadim Zipunnikov
Author-X-Name-First: Vadim
Author-X-Name-Last: Zipunnikov
Author-Name: Pierre-Louis Bazin
Author-X-Name-First: Pierre-Louis
Author-X-Name-Last: Bazin
Author-Name: Dzung Pham
Author-X-Name-First: Dzung
Author-X-Name-Last: Pham
Author-Name: Daniel Reich
Author-X-Name-First: Daniel
Author-X-Name-Last: Reich
Author-Name: Ciprian Crainiceanu
Author-X-Name-First: Ciprian
Author-X-Name-Last: Crainiceanu
Author-Name: Brian Caffo
Author-X-Name-First: Brian
Author-X-Name-Last: Caffo
Title: Parameterization of White Matter Manifold-Like Structures Using Principal Surfaces
Abstract:
In this article, we are concerned with data generated from a diffusion tensor imaging (DTI) experiment. The goal is to parameterize manifold-like white matter tracts, such as the corpus callosum, using principal surfaces. The problem is approached by finding a geometrically motivated surface-based representation of the corpus callosum and visualized fractional anisotropy (FA) values projected onto the surface. The method also applies to any other diffusion summary. An algorithm is proposed that (a) constructs the principal surface of a corpus callosum; (b) flattens the surface into a parametric two-dimensional (2D) map; and (c) projects associated FA values on the map. The algorithm is applied to a longitudinal study containing 466 diffusion tensor images of 176 multiple sclerosis (MS) patients observed at multiple visits. For each subject and visit, the study contains a registered DTI scan of the corpus callosum at roughly 20,000 voxels. Extensive simulation studies demonstrate fast convergence and robust performance of the algorithm under a variety of challenging scenarios.
Journal: Journal of the American Statistical Association
Pages: 1050-1060
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1164050
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164050
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1050-1060
Template-Type: ReDIF-Article 1.0
Author-Name: Kyu Ha Lee
Author-X-Name-First: Kyu Ha
Author-X-Name-Last: Lee
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Author-Name: Deborah Schrag
Author-X-Name-First: Deborah
Author-X-Name-Last: Schrag
Author-Name: Sebastien Haneuse
Author-X-Name-First: Sebastien
Author-X-Name-Last: Haneuse
Title: Hierarchical Models for Semicompeting Risks Data With Application to Quality of End-of-Life Care for Pancreatic Cancer
Abstract:
Readmission following discharge from an initial hospitalization is a key marker of quality of healthcare in the United States. For the most part, readmission has been studied among patients with “acute” health conditions, such as pneumonia and heart failure, with analyses based on a logistic-Normal generalized linear mixed model. Naïve application of this model to the study of readmission among patients with “advanced” health conditions such as pancreatic cancer, however, is problematic because it ignores death as a competing risk. A more appropriate analysis is to imbed such a study within the semicompeting risks framework. To our knowledge, however, no comprehensive statistical methods have been developed for cluster-correlated semicompeting risks data. To resolve this gap in the literature we propose a novel hierarchical modeling framework for the analysis of cluster-correlated semicompeting risks data that permits parametric or nonparametric specifications for a range of components giving analysts substantial flexibility as they consider their own analyses. Estimation and inference is performed within the Bayesian paradigm since it facilitates the straightforward characterization of (posterior) uncertainty for all model parameters, including hospital-specific random effects. Model comparison and choice is performed via the deviance information criterion and the log-pseudo marginal likelihood statistic, both of which are based on a partially marginalized likelihood. An efficient computational scheme, based on the Metropolis-Hastings-Green algorithm, is developed and had been implemented in the R package SemiCompRisks. A comprehensive simulation study shows that the proposed framework performs very well in a range of data scenarios, and outperforms competitor analysis strategies. The proposed framework is motivated by and illustrated with an ongoing study of the risk of readmission among Medicare beneficiaries diagnosed with pancreatic cancer. Using data on n = 5298 patients at J=112 hospitals in the six New England states between 2000–2009, key scientific questions we consider include the role of patient-level risk factors on the risk of readmission and the extent of variation in risk across hospitals not explained by differences in patient case-mix. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1075-1095
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1164052
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164052
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1075-1095
Template-Type: ReDIF-Article 1.0
Author-Name: Leonhard Held
Author-X-Name-First: Leonhard
Author-X-Name-Last: Held
Author-Name: Stefanie Muff
Author-X-Name-First: Stefanie
Author-X-Name-Last: Muff
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 1108-1110
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1164705
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164705
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1108-1110
Template-Type: ReDIF-Article 1.0
Author-Name: Jan Hannig
Author-X-Name-First: Jan
Author-X-Name-Last: Hannig
Author-Name: Hari Iyer
Author-X-Name-First: Hari
Author-X-Name-Last: Iyer
Author-Name: Randy C. S. Lai
Author-X-Name-First: Randy C. S.
Author-X-Name-Last: Lai
Author-Name: Thomas C. M. Lee
Author-X-Name-First: Thomas C. M.
Author-X-Name-Last: Lee
Title: Generalized Fiducial Inference: A Review and New Results
Abstract:
R. A. Fisher, the father of modern statistics, proposed the idea of fiducial inference during the first half of the 20th century. While his proposal led to interesting methods for quantifying uncertainty, other prominent statisticians of the time did not accept Fisher’s approach as it became apparent that some of Fisher’s bold claims about the properties of fiducial distribution did not hold up for multi-parameter problems. Beginning around the year 2000, the authors and collaborators started to reinvestigate the idea of fiducial inference and discovered that Fisher’s approach, when properly generalized, would open doors to solve many important and difficult inference problems. They termed their generalization of Fisher’s idea as generalized fiducial inference (GFI). The main idea of GFI is to carefully transfer randomness from the data to the parameter space using an inverse of a data-generating equation without the use of Bayes’ theorem. The resulting generalized fiducial distribution (GFD) can then be used for inference. After more than a decade of investigations, the authors and collaborators have developed a unifying theory for GFI, and provided GFI solutions to many challenging practical problems in different fields of science and industry. Overall, they have demonstrated that GFI is a valid, useful, and promising approach for conducting statistical inference. The goal of this article is to deliver a timely and concise introduction to GFI, to present some of the latest results, as well as to list some related open research problems. It is authors’ hope that their contributions to GFI will stimulate the growth and usage of this exciting approach for statistical inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1346-1361
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1165102
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165102
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1346-1361
Template-Type: ReDIF-Article 1.0
Author-Name: Fiona Steele
Author-X-Name-First: Fiona
Author-X-Name-Last: Steele
Author-Name: Elizabeth Washbrook
Author-X-Name-First: Elizabeth
Author-X-Name-Last: Washbrook
Author-Name: Christopher Charlton
Author-X-Name-First: Christopher
Author-X-Name-Last: Charlton
Author-Name: William J. Browne
Author-X-Name-First: William J.
Author-X-Name-Last: Browne
Title: A Longitudinal Mixed Logit Model for Estimation of Push and Pull Effects in Residential Location Choice
Abstract:
We develop a random effects discrete choice model for the analysis of households’ choice of neighborhood over time. The model is parameterized in a way that exploits longitudinal data to separate the influence of neighborhood characteristics on the decision to move out of the current area (“push” effects) and on the choice of one destination over another (“pull” effects). Random effects are included to allow for unobserved heterogeneity between households in their propensity to move, and in the importance placed on area characteristics. The model also includes area-level random effects. The combination of a large choice set, large sample size, and repeated observations mean that existing estimation approaches are often infeasible. We, therefore, propose an efficient MCMC algorithm for the analysis of large-scale datasets. The model is applied in an analysis of residential choice in England using data from the British Household Panel Survey linked to neighborhood-level census data. We consider how effects of area deprivation and distance from the current area depend on household characteristics and life course transitions in the previous year. We find substantial differences between households in the effects of deprivation on out-mobility and selection of destination, with evidence of severely constrained choices among less-advantaged households. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1061-1074
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1180984
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180984
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1061-1074
Template-Type: ReDIF-Article 1.0
Author-Name: Qian Guan
Author-X-Name-First: Qian
Author-X-Name-Last: Guan
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 936-942
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200911
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200911
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:936-942
Template-Type: ReDIF-Article 1.0
Author-Name: Jingxiang Chen
Author-X-Name-First: Jingxiang
Author-X-Name-Last: Chen
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Author-Name: Yingqi Zhao
Author-X-Name-First: Yingqi
Author-X-Name-Last: Zhao
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Comment
Abstract:
Xu, Müller, Wahed, and Thall proposed a Bayesian model to analyze an acute leukemia study involving multi-stage chemotherapy regimes. We discuss two alternative methods, Q-learning and O-learning, to solve the same problem from the machine learning point of view. The numerical studies show that these methods can be flexible and have advantages in some situations to handle treatment heterogeneity while being robust to model misspecification.
Journal: Journal of the American Statistical Association
Pages: 942-947
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200914
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200914
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:942-947
Template-Type: ReDIF-Article 1.0
Author-Name: Lorenzo Trippa
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Trippa
Author-Name: Giovanni Parmigiani
Author-X-Name-First: Giovanni
Author-X-Name-Last: Parmigiani
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 947-948
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200915
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200915
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:947-948
Template-Type: ReDIF-Article 1.0
Author-Name: Yanxun Xu
Author-X-Name-First: Yanxun
Author-X-Name-Last: Xu
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Abdus S. Wahed
Author-X-Name-First: Abdus S.
Author-X-Name-Last: Wahed
Author-Name: Peter Thall
Author-X-Name-First: Peter
Author-X-Name-Last: Thall
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 948-950
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200917
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200917
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:948-950
Template-Type: ReDIF-Article 1.0
Author-Name: Jon Wakefield
Author-X-Name-First: Jon
Author-X-Name-Last: Wakefield
Author-Name: Daniel Simpson
Author-X-Name-First: Daniel
Author-X-Name-Last: Simpson
Author-Name: Jessica Godwin
Author-X-Name-First: Jessica
Author-X-Name-Last: Godwin
Title: Comment: Getting into Space with a Weight Problem
Journal: Journal of the American Statistical Association
Pages: 1111-1118
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200918
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200918
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1111-1118
Template-Type: ReDIF-Article 1.0
Author-Name: Peter J. Diggle
Author-X-Name-First: Peter J.
Author-X-Name-Last: Diggle
Author-Name: Emanuele Giorgi
Author-X-Name-First: Emanuele
Author-X-Name-Last: Giorgi
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1119-1120
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200919
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200919
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1119-1120
Template-Type: ReDIF-Article 1.0
Author-Name: Wouter Duivesteijn
Author-X-Name-First: Wouter
Author-X-Name-Last: Duivesteijn
Title: Correction to Jin-Ting Zhang’s “Approximate and Asymptotic Distributions of Chi-Squared-Type Mixtures With Applications’’
Abstract:
Zhang derives approximations for the distribution of a mixture of chi-squared distributions. The two derived approximation bounds in Theorem 1.1 both contain an arithmetic error. These are corrected here.
Journal: Journal of the American Statistical Association
Pages: 1370-1371
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1200980
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200980
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1370-1371
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 1362-1369
Issue: 515
Volume: 111
Year: 2016
Month: 7
X-DOI: 10.1080/01621459.2016.1235436
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1235436
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:515:p:1362-1369
Template-Type: ReDIF-Article 1.0
Author-Name: Ning Hao
Author-X-Name-First: Ning
Author-X-Name-Last: Hao
Author-Name: Yang Feng
Author-X-Name-First: Yang
Author-X-Name-Last: Feng
Author-Name: Hao Helen Zhang
Author-X-Name-First: Hao Helen
Author-X-Name-Last: Zhang
Title: Model Selection for High-Dimensional Quadratic Regression via Regularization
Abstract:
Quadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high-dimensional data. This article focuses on scalable regularization methods for model selection in high-dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called regularization algorithm under marginality principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 615-625
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1264956
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1264956
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:615-625
Template-Type: ReDIF-Article 1.0
Author-Name: Antonio R. Linero
Author-X-Name-First: Antonio R.
Author-X-Name-Last: Linero
Title: Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection
Abstract:
Decision tree ensembles are an extremely popular tool for obtaining high-quality predictions in nonparametric regression problems. Unmodified, however, many commonly used decision tree ensemble methods do not adapt to sparsity in the regime in which the number of predictors is larger than the number of observations. A recent stream of research concerns the construction of decision tree ensembles that are motivated by a generative probabilistic model, the most influential method being the Bayesian additive regression trees (BART) framework. In this article, we take a Bayesian point of view on this problem and show how to construct priors on decision tree ensembles that are capable of adapting to sparsity in the predictors by placing a sparsity-inducing Dirichlet hyperprior on the splitting proportions of the regression tree prior. We characterize the asymptotic distribution of the number of predictors included in the model and show how this prior can be easily incorporated into existing Markov chain Monte Carlo schemes. We demonstrate that our approach yields useful posterior inclusion probabilities for each predictor and illustrate the usefulness of our approach relative to other decision tree ensemble approaches on both simulated and real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 626-636
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1264957
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1264957
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:626-636
Template-Type: ReDIF-Article 1.0
Author-Name: Ting Zhang
Author-X-Name-First: Ting
Author-X-Name-Last: Zhang
Author-Name: Liliya Lavitas
Author-X-Name-First: Liliya
Author-X-Name-Last: Lavitas
Title: Unsupervised Self-Normalized Change-Point Testing for Time Series
Abstract:
We propose a new self-normalized method for testing change points in the time series setting. Self-normalization has been celebrated for its ability to avoid direct estimation of the nuisance asymptotic variance and its flexibility of being generalized to handle quantities other than the mean. However, it was developed and mainly studied for constructing confidence intervals for quantities associated with a stationary time series, and its adaptation to change-point testing can be nontrivial as direct implementation can lead to tests with nonmonotonic power. Compared with existing results on using self-normalization in this direction, the current article proposes a new self-normalized change-point test that does not require prespecifying the number of total change points and is thus unsupervised. In addition, we propose a new contrast-based approach in generalizing self-normalized statistics to handle quantities other than the mean, which is specifically tailored for change-point testing. Monte Carlo simulations are presented to illustrate the finite-sample performance of the proposed method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 637-648
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1270214
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270214
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:637-648
Template-Type: ReDIF-Article 1.0
Author-Name: Clara Happ
Author-X-Name-First: Clara
Author-X-Name-Last: Happ
Author-Name: Sonja Greven
Author-X-Name-First: Sonja
Author-X-Name-Last: Greven
Title: Multivariate Functional Principal Component Analysis for Data Observed on Different (Dimensional) Domains
Abstract:
Existing approaches for multivariate functional principal component analysis are restricted to data on the same one-dimensional interval. The presented approach focuses on multivariate functional data on different domains that may differ in dimension, such as functions and images. The theoretical basis for multivariate functional principal component analysis is given in terms of a Karhunen–Loève Theorem. For the practically relevant case of a finite Karhunen–Loève representation, a relationship between univariate and multivariate functional principal component analysis is established. This offers an estimation strategy to calculate multivariate functional principal components and scores based on their univariate counterparts. For the resulting estimators, asymptotic results are derived. The approach can be extended to finite univariate expansions in general, not necessarily orthonormal bases. It is also applicable for sparse functional data or data with measurement error. A flexible R implementation is available on CRAN. The new method is shown to be competitive to existing approaches for data observed on a common one-dimensional domain. The motivating application is a neuroimaging study, where the goal is to explore how longitudinal trajectories of a neuropsychological test score covary with FDG-PET brain scans at baseline. Supplementary material, including detailed proofs, additional simulation results, and software is available online.
Journal: Journal of the American Statistical Association
Pages: 649-659
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1273115
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273115
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:649-659
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander Hanbo Li
Author-X-Name-First: Alexander Hanbo
Author-X-Name-Last: Li
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Title: Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions
Abstract:
This article examines the role and the efficiency of nonconvex loss functions for binary classification problems. In particular, we investigate how to design adaptive and effective boosting algorithms that are robust to the presence of outliers in the data or to the presence of errors in the observed data labels. We demonstrate that nonconvex losses play an important role for prediction accuracy because of the diminishing gradient properties—the ability of the losses to efficiently adapt to the outlying data. We propose a new boosting framework called ArchBoost that uses diminishing gradient property directly and leads to boosting algorithms that are provably robust. Along with the ArchBoost framework, a family of nonconvex losses is proposed, which leads to the new robust boosting algorithms, named adaptive robust boosting (ARB). Furthermore, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, based only on local curvatures, we establish statistical and optimization properties of the proposed ArchBoost algorithms with highly nonconvex losses. Extensive numerical and real data examples illustrate theoretical properties and reveal advantages over the existing boosting methods when data are perturbed by an adversary or otherwise. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 660-674
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1273116
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273116
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:660-674
Template-Type: ReDIF-Article 1.0
Author-Name: Federico Bassetti
Author-X-Name-First: Federico
Author-X-Name-Last: Bassetti
Author-Name: Roberto Casarin
Author-X-Name-First: Roberto
Author-X-Name-Last: Casarin
Author-Name: Francesco Ravazzolo
Author-X-Name-First: Francesco
Author-X-Name-Last: Ravazzolo
Title: Bayesian Nonparametric Calibration and Combination of Predictive Distributions
Abstract:
We introduce a Bayesian approach to predictive density calibration and combination that accounts for parameter uncertainty and model set incompleteness through the use of random calibration functionals and random combination weights. Building on the work of Ranjan and Gneiting, we use infinite beta mixtures for the calibration. The proposed Bayesian nonparametric approach takes advantage of the flexibility of Dirichlet process mixtures to achieve any continuous deformation of linearly combined predictive distributions. The inference procedure is based on combination Gibbs and slice sampling. We provide some conditions under which the proposed probabilistic calibration converges in terms of weak posterior consistency to the true underlying density for both cases of iid and Markovian observations. This calibration property improves upon the earlier calibration approaches. We study the methodology in simulation examples with fat tails and multimodal densities and apply it to density forecasts of daily S&P returns and daily maximum wind speed at the Frankfurt airport. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 675-685
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1273117
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273117
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:675-685
Template-Type: ReDIF-Article 1.0
Author-Name: L. Fattorini
Author-X-Name-First: L.
Author-X-Name-Last: Fattorini
Author-Name: M. Marcheselli
Author-X-Name-First: M.
Author-X-Name-Last: Marcheselli
Author-Name: L. Pratelli
Author-X-Name-First: L.
Author-X-Name-Last: Pratelli
Title: Design-Based Maps for Finite Populations of Spatial Units
Abstract:
The estimation of the values of a survey variable in finite populations of spatial units is considered for making maps when samples of spatial units are selected by probabilistic sampling schemes. The single values are estimated by means of an inverse distance weighting predictor. The design-based asymptotic properties of the resulting maps, referred to as the design-based maps, are considered when the study area remains fixed and the sizes of the spatial units tend to zero. Conditions ensuring design-based asymptotic unbiasedness and consistency are derived. They essentially require the existence of a pointwise or uniformly continuous density function of the survey variable onto the study area, some regularities in the size and shape of the units, and the use of spatially balanced designs to select units. The continuity assumption can be relaxed into a Riemann integrability assumption when estimation is performed at a sufficiently small spatial grain and the estimates are subsequently aggregated at a greater grain. A computationally simple mean squared error estimator is proposed. A simulation study is performed to assess the theoretical results. An application to estimate the map of wine cultivations in Tuscany (Central Italy) is considered. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 686-697
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2016.1278174
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1278174
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:686-697
Template-Type: ReDIF-Article 1.0
Author-Name: Asaf Weinstein
Author-X-Name-First: Asaf
Author-X-Name-Last: Weinstein
Author-Name: Zhuang Ma
Author-X-Name-First: Zhuang
Author-X-Name-Last: Ma
Author-Name: Lawrence D. Brown
Author-X-Name-First: Lawrence D.
Author-X-Name-Last: Brown
Author-Name: Cun-Hui Zhang
Author-X-Name-First: Cun-Hui
Author-X-Name-Last: Zhang
Title: Group-Linear Empirical Bayes Estimates for a Heteroscedastic Normal Mean
Abstract:
The problem of estimating the mean of a normal vector with known but unequal variances introduces substantial difficulties that impair the adequacy of traditional empirical Bayes estimators. By taking a different approach that treats the known variances as part of the random observations, we restore symmetry and thus the effectiveness of such methods. We suggest a group-linear empirical Bayes estimator, which collects observations with similar variances and applies a spherically symmetric estimator to each group separately. The proposed estimator is motivated by a new oracle rule which is stronger than the best linear rule, and thus provides a more ambitious benchmark than that considered in the previous literature. Our estimator asymptotically achieves the new oracle risk (under appropriate conditions) and at the same time is minimax. The group-linear estimator is particularly advantageous in situations where the true means and observed variances are empirically dependent. To demonstrate the merits of the proposed methods in real applications, we analyze the baseball data used by Brown (2008), where the group-linear methods achieved the prediction error of the best nonparametric estimates that have been applied to the dataset, and significantly lower error than other parametric and semiparametric empirical Bayes estimators.
Journal: Journal of the American Statistical Association
Pages: 698-710
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1280406
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1280406
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:698-710
Template-Type: ReDIF-Article 1.0
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Author-Name: Kathrin Möllenhoff
Author-X-Name-First: Kathrin
Author-X-Name-Last: Möllenhoff
Author-Name: Stanislav Volgushev
Author-X-Name-First: Stanislav
Author-X-Name-Last: Volgushev
Author-Name: Frank Bretz
Author-X-Name-First: Frank
Author-X-Name-Last: Bretz
Title: Equivalence of Regression Curves
Abstract:
This article investigates the problem whether the difference between two parametric models m1, m2 describing the relation between a response variable and several covariates in two different groups is practically irrelevant, such that inference can be performed on the basis of the pooled sample. Statistical methodology is developed to test the hypotheses H0: d(m1, m2) ⩾ ϵ versus H1: d(m1, m2) < ϵ to demonstrate equivalence between the two regression curves m1, m2 for a prespecified threshold ϵ, where d denotes a distance measuring the distance between m1 and m2. Our approach is based on the asymptotic properties of a suitable estimator d(m^1,m^2)$d(\hat{m}_1, \hat{m}_2)$ of this distance. To improve the approximation of the nominal level for small sample sizes, a bootstrap test is developed, which addresses the specific form of the interval hypotheses. In particular, data have to be generated under the null hypothesis, which implicitly defines a manifold for the parameter vector. The results are illustrated by means of a simulation study and a data example. It is demonstrated that the new methods substantially improve currently available approaches with respect to power and approximation of the nominal level.
Journal: Journal of the American Statistical Association
Pages: 711-729
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1281813
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1281813
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:711-729
Template-Type: ReDIF-Article 1.0
Author-Name: Chong Zhang
Author-X-Name-First: Chong
Author-X-Name-Last: Zhang
Author-Name: Wenbo Wang
Author-X-Name-First: Wenbo
Author-X-Name-Last: Wang
Author-Name: Xingye Qiao
Author-X-Name-First: Xingye
Author-X-Name-Last: Qiao
Title: On Reject and Refine Options in Multicategory Classification
Abstract:
In many real applications of statistical learning, a decision made from misclassification can be too costly to afford; in this case, a reject option, which defers the decision until further investigation is conducted, is often preferred. In recent years, there has been much development for binary classification with a reject option. Yet, little progress has been made for the multicategory case. In this article, we propose margin-based multicategory classification methods with a reject option. In addition, and more importantly, we introduce a new and unique refine option for the multicategory problem, where the class of an observation is predicted to be from a set of class labels, whose cardinality is not necessarily one. The main advantage of both options lies in their capacity of identifying error-prone observations. Moreover, the refine option can provide more constructive information for classification by effectively ruling out implausible classes. Efficient implementations have been developed for the proposed methods. On the theoretical side, we offer a novel statistical learning theory and show a fast convergence rate of the excess ℓ-risk of our methods with emphasis on diverging dimensionality and number of classes. The results can be further improved under a low noise assumption and be generalized to the excess 0-d-1 risk. Finite-sample upper bounds for the reject and reject/refine rates are also provided. A set of comprehensive simulation and real data studies has shown the usefulness of the new learning tools compared to regular multicategory classifiers. Detailed proofs of theorems and extended numerical results are included in the supplemental materials available online.
Journal: Journal of the American Statistical Association
Pages: 730-745
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1282372
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1282372
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:730-745
Template-Type: ReDIF-Article 1.0
Author-Name: Kejun He
Author-X-Name-First: Kejun
Author-X-Name-Last: He
Author-Name: Heng Lian
Author-X-Name-First: Heng
Author-X-Name-Last: Lian
Author-Name: Shujie Ma
Author-X-Name-First: Shujie
Author-X-Name-Last: Ma
Author-Name: Jianhua Z. Huang
Author-X-Name-First: Jianhua Z.
Author-X-Name-Last: Huang
Title: Dimensionality Reduction and Variable Selection in Multivariate Varying-Coefficient Models With a Large Number of Covariates
Abstract:
Motivated by the study of gene and environment interactions, we consider a multivariate response varying-coefficient model with a large number of covariates. The need of nonparametrically estimating a large number of coefficient functions given relatively limited data poses a big challenge for fitting such a model. To overcome the challenge, we develop a method that incorporates three ideas: (i) reduce the number of unknown functions to be estimated by using (noncentered) principal components; (ii) approximate the unknown functions by polynomial splines; (iii) apply sparsity-inducing penalization to select relevant covariates. The three ideas are integrated into a penalized least-square framework. Our asymptotic theory shows that the proposed method can consistently identify relevant covariates and can estimate the corresponding coefficient functions with the same convergence rate as when only the relevant variables are included in the model. We also develop a novel computational algorithm to solve the penalized least-square problem by combining proximal algorithms and optimization over Stiefel manifolds. Our method is illustrated using data from Framingham Heart Study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 746-754
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1285774
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285774
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:746-754
Template-Type: ReDIF-Article 1.0
Author-Name: Forrest W. Crawford
Author-X-Name-First: Forrest W.
Author-X-Name-Last: Crawford
Author-Name: Jiacheng Wu
Author-X-Name-First: Jiacheng
Author-X-Name-Last: Wu
Author-Name: Robert Heimer
Author-X-Name-First: Robert
Author-X-Name-Last: Heimer
Title: Hidden Population Size Estimation From Respondent-Driven Sampling: A Network Approach
Abstract:
Estimating the size of stigmatized, hidden, or hard-to-reach populations is a major problem in epidemiology, demography, and public health research. Capture–recapture and multiplier methods are standard tools for inference of hidden population sizes, but they require random sampling of target population members, which is rarely possible. Respondent-driven sampling (RDS) is a survey method for hidden populations that relies on social link tracing. The RDS recruitment process is designed to spread through the social network connecting members of the target population. In this article, we show how to use network data revealed by RDS to estimate hidden population size. The key insight is that the recruitment chain, timing of recruitments, and network degrees of recruited subjects provide information about the number of individuals belonging to the target population who are not yet in the sample. We use a computationally efficient Bayesian method to integrate over the missing edges in the subgraph of recruited individuals. We validate the method using simulated data and apply the technique to estimate the number of people who inject drugs in St. Petersburg, Russia. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 755-766
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1285775
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285775
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:755-766
Template-Type: ReDIF-Article 1.0
Author-Name: Sebastian Calonico
Author-X-Name-First: Sebastian
Author-X-Name-Last: Calonico
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Max H. Farrell
Author-X-Name-First: Max H.
Author-X-Name-Last: Farrell
Title: On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference
Abstract:
Nonparametric methods play a central role in modern empirical work. While they provide inference procedures that are more robust to parametric misspecification bias, they may be quite sensitive to tuning parameter choices. We study the effects of bias correction on confidence interval coverage in the context of kernel density and local polynomial regression estimation, and prove that bias correction can be preferred to undersmoothing for minimizing coverage error and increasing robustness to tuning parameter choice. This is achieved using a novel, yet simple, Studentization, which leads to a new way of constructing kernel-based bias-corrected confidence intervals. In addition, for practical cases, we derive coverage error optimal bandwidths and discuss easy-to-implement bandwidth selectors. For interior points, we show that the mean-squared error (MSE)-optimal bandwidth for the original point estimator (before bias correction) delivers the fastest coverage error decay rate after bias correction when second-order (equivalent) kernels are employed, but is otherwise suboptimal because it is too “large.” Finally, for odd-degree local polynomial regression, we show that, as with point estimation, coverage error adapts to boundary points automatically when appropriate Studentization is used; however, the MSE-optimal bandwidth for the original point estimator is suboptimal. All the results are established using valid Edgeworth expansions and illustrated with simulated data. Our findings have important consequences for empirical work as they indicate that bias-corrected confidence intervals, coupled with appropriate standard errors, have smaller coverage error and are less sensitive to tuning parameter choices in practically relevant cases where additional smoothness is available. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 767-779
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1285776
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285776
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:767-779
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander R. Luedtke
Author-X-Name-First: Alexander R.
Author-X-Name-Last: Luedtke
Author-Name: Mark J. van der Laan
Author-X-Name-First: Mark J. van der
Author-X-Name-Last: Laan
Title: Parametric-Rate Inference for One-Sided Differentiable Parameters
Abstract:
Suppose one has a collection of parameters indexed by a (possibly infinite dimensional) set. Given data generated from some distribution, the objective is to estimate the maximal parameter in this collection evaluated at the distribution that generated the data. This estimation problem is typically nonregular when the maximizing parameter is nonunique, and as a result standard asymptotic techniques generally fail in this case. We present a technique for developing parametric-rate confidence intervals for the quantity of interest in these nonregular settings. We show that our estimator is asymptotically efficient when the maximizing parameter is unique so that regular estimation is possible. We apply our technique to a recent example from the literature in which one wishes to report the maximal absolute correlation between a prespecified outcome and one of p predictors. The simplicity of our technique enables an analysis of the previously open case where p grows with sample size. Specifically, we only require that log p grows slower than n$\sqrt{n}$, where n is the sample size. We show that, unlike earlier approaches, our method scales to massive datasets: the point estimate and confidence intervals can be constructed in O(np) time. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 780-788
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1285777
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285777
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:780-788
Template-Type: ReDIF-Article 1.0
Author-Name: Ery Arias-Castro
Author-X-Name-First: Ery
Author-X-Name-Last: Arias-Castro
Author-Name: Rui M. Castro
Author-X-Name-First: Rui M.
Author-X-Name-Last: Castro
Author-Name: Ervin Tánczos
Author-X-Name-First: Ervin
Author-X-Name-Last: Tánczos
Author-Name: Meng Wang
Author-X-Name-First: Meng
Author-X-Name-Last: Wang
Title: Distribution-Free Detection of Structured Anomalies: Permutation and Rank-Based Scans
Abstract:
The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields a hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distribution is known, then calibration of a scan-based test is relatively easy, as it can be done by Monte Carlo simulation. When the null distribution is unknown, it is less straightforward. We investigate two procedures. The first one is a calibration by permutation and the other is a rank-based scan test, which is distribution-free and less sensitive to outliers. Furthermore, the rank scan test requires only a one-time calibration for a given data size making it computationally much more appealing. In both cases, we quantify the performance loss with respect to an oracle scan test that knows the null distribution. We show that using one of these calibration procedures results in only a very small loss of power in the context of a natural exponential family. This includes the classical normal location model, popular in signal processing, and the Poisson model, popular in syndromic surveillance. We perform numerical experiments on simulated data further supporting our theory and also on a real dataset from genomics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 789-801
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1286240
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1286240
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:789-801
Template-Type: ReDIF-Article 1.0
Author-Name: Li Ma
Author-X-Name-First: Li
Author-X-Name-Last: Ma
Author-Name: Jacopo Soriano
Author-X-Name-First: Jacopo
Author-X-Name-Last: Soriano
Title: Efficient Functional ANOVA Through Wavelet-Domain Markov Groves
Abstract:
We introduce a wavelet-domain method for functional analysis of variance (fANOVA). It is based on a Bayesian hierarchical model that employs a graphical hyperprior in the form of a Markov grove (MG)—that is, a collection of Markov trees—for linking the presence/absence of factor effects at all location-scale combinations, thereby incorporating the natural clustering of factor effects in the wavelet-domain across locations and scales. Inference under the model enjoys both analytical simplicity and computational efficiency. Specifically, the posterior of the full hierarchical model is available in closed form through a pyramid algorithm operationally similar to Mallat’s pyramid algorithm for discrete wavelet transform (DWT), achieving for exact Bayesian inference the same computational efficiency—linear in both the number of observations and the number of locations—as for carrying out the DWT. In particular, posterior probabilities of the presence of factor contributions to functional variation are directly available from the pyramid algorithm, while posterior samples for the factor effects can be drawn directly from the exact posterior through standard (not Markov chain) Monte Carlo. We investigate the performance of our method through extensive simulation and show that it substantially outperforms existing wavelet-domain fANOVA methods in a variety of common settings. We illustrate the method through analyzing the orthosis data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 802-818
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1286241
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1286241
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:802-818
Template-Type: ReDIF-Article 1.0
Author-Name: Joshua Chan
Author-X-Name-First: Joshua
Author-X-Name-Last: Chan
Author-Name: Roberto Leon-Gonzalez
Author-X-Name-First: Roberto
Author-X-Name-Last: Leon-Gonzalez
Author-Name: Rodney W. Strachan
Author-X-Name-First: Rodney W.
Author-X-Name-Last: Strachan
Title: Invariant Inference and Efficient Computation in the Static Factor Model
Abstract:
Factor models are used in a wide range of areas. Two issues with Bayesian versions of these models are a lack of invariance to ordering of and scaling of the variables and computational inefficiency. This article develops invariant and efficient Bayesian methods for estimating static factor models. This approach leads to inference that does not depend upon the ordering or scaling of the variables, and we provide arguments to explain this invariance. Beginning from identified parameters which are subject to orthogonality restrictions, we use parameter expansions to obtain a specification with computationally convenient conditional posteriors. We show significant gains in computational efficiency. Identifying restrictions that are commonly employed result in interpretable factors or loadings and, using our approach, these can be imposed ex-post. This allows us to investigate several alternative identifying (noninvariant) schemes without the need to respecify and resample the model. We illustrate the methods with two macroeconomic datasets.
Journal: Journal of the American Statistical Association
Pages: 819-828
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1287080
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1287080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:819-828
Template-Type: ReDIF-Article 1.0
Author-Name: HaiYing Wang
Author-X-Name-First: HaiYing
Author-X-Name-Last: Wang
Author-Name: Rong Zhu
Author-X-Name-First: Rong
Author-X-Name-Last: Zhu
Author-Name: Ping Ma
Author-X-Name-First: Ping
Author-X-Name-Last: Ma
Title: Optimal Subsampling for Large Sample Logistic Regression
Abstract:
For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least-square estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this article, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real datasets are used to evaluate the practical performance of the proposed method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 829-844
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1292914
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1292914
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:829-844
Template-Type: ReDIF-Article 1.0
Author-Name: Dungang Liu
Author-X-Name-First: Dungang
Author-X-Name-Last: Liu
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach
Abstract:
Ordinal outcomes are common in scientific research and everyday practice, and we often rely on regression models to make inference. A long-standing problem with such regression analyses is the lack of effective diagnostic tools for validating model assumptions. The difficulty arises from the fact that an ordinal variable has discrete values that are labeled with, but not, numerical values. The values merely represent ordered categories. In this article, we propose a surrogate approach to defining residuals for an ordinal outcome Y. The idea is to define a continuous variable S as a “surrogate” of Y and then obtain residuals based on S. For the general class of cumulative link regression models, we study the residual’s theoretical and graphical properties. We show that the residual has null properties similar to those of the common residuals for continuous outcomes. Our numerical studies demonstrate that the residual has power to detect misspecification with respect to (1) mean structures; (2) link functions; (3) heteroscedasticity; (4) proportionality; and (5) mixed populations. The proposed residual also enables us to develop numeric measures for goodness of fit using classical distance notions. Our results suggest that compared to a previously defined residual, our residual can reveal deeper insights into model diagnostics. We stress that this work focuses on residual analysis, rather than hypothesis testing. The latter has limited utility as it only provides a single p-value, whereas our residual can reveal what components of the model are misspecified and advise how to make improvements. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 845-854
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1292915
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1292915
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:845-854
Template-Type: ReDIF-Article 1.0
Author-Name: Alexandre Bouchard-Côté
Author-X-Name-First: Alexandre
Author-X-Name-Last: Bouchard-Côté
Author-Name: Sebastian J. Vollmer
Author-X-Name-First: Sebastian J.
Author-X-Name-Last: Vollmer
Author-Name: Arnaud Doucet
Author-X-Name-First: Arnaud
Author-X-Name-Last: Doucet
Title: The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method
Abstract:
Many Markov chain Monte Carlo techniques currently available rely on discrete-time reversible Markov processes whose transition kernels are variations of the Metropolis–Hastings algorithm. We explore and generalize an alternative scheme recently introduced in the physics literature (Peters and de With 2012) where the target distribution is explored using a continuous-time nonreversible piecewise-deterministic Markov process. In the Metropolis–Hastings algorithm, a trial move to a region of lower target density, equivalently of higher “energy,” than the current state can be rejected with positive probability. In this alternative approach, a particle moves along straight lines around the space and, when facing a high energy barrier, it is not rejected but its path is modified by bouncing against this barrier. By reformulating this algorithm using inhomogeneous Poisson processes, we exploit standard sampling techniques to simulate exactly this Markov process in a wide range of scenarios of interest. Additionally, when the target distribution is given by a product of factors dependent only on subsets of the state variables, such as the posterior distribution associated with a probabilistic graphical model, this method can be modified to take advantage of this structure by allowing computationally cheaper “local” bounces, which only involve the state variables associated with a factor, while the other state variables keep on evolving. In this context, by leveraging techniques from chemical kinetics, we propose several computationally efficient implementations. Experimentally, this new class of Markov chain Monte Carlo schemes compares favorably to state-of-the-art methods on various Bayesian inference tasks, including for high-dimensional models and large datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 855-867
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1294075
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1294075
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:855-867
Template-Type: ReDIF-Article 1.0
Author-Name: Rahul Mukerjee
Author-X-Name-First: Rahul
Author-X-Name-Last: Mukerjee
Author-Name: Tirthankar Dasgupta
Author-X-Name-First: Tirthankar
Author-X-Name-Last: Dasgupta
Author-Name: Donald B. Rubin
Author-X-Name-First: Donald B.
Author-X-Name-Last: Rubin
Title: Using Standard Tools From Finite Population Sampling to Improve Causal Inference for Complex Experiments
Abstract:
This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning experimental units to multiple treatments. This framework extends classical methods by allowing the possibility of randomization restrictions and unequal replications. Novel conditions that are “milder” than strict additivity of treatment effects, yet permit unbiased estimation of the finite population sampling variance of any treatment contrast estimator, are derived. The consequences of departures from such conditions are also studied under the criterion of minimax bias, and a new justification for using the Neymanian conservative sampling variance estimator in experiments is provided. The proposed approach can readily be extended to the case of treatments with a general factorial structure.
Journal: Journal of the American Statistical Association
Pages: 868-881
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1294076
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1294076
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:868-881
Template-Type: ReDIF-Article 1.0
Author-Name: Dandan Liu
Author-X-Name-First: Dandan
Author-X-Name-Last: Liu
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: Anna Lok
Author-X-Name-First: Anna
Author-X-Name-Last: Lok
Author-Name: Yingye Zheng
Author-X-Name-First: Yingye
Author-X-Name-Last: Zheng
Title: Nonparametric Maximum Likelihood Estimators of Time-Dependent Accuracy Measures for Survival Outcome Under Two-Stage Sampling Designs
Abstract:
Large prospective cohort studies of rare chronic diseases require thoughtful planning of study designs, especially for biomarker studies when measurements are based on stored tissue or blood specimens. Two-phase designs, including nested case–control and case-cohort sampling designs, provide cost-effective strategies for conducting biomarker evaluation studies.Existing literature for biomarker assessment under two-phase designs largely focuses on simple inverse probability weighting (IPW) estimators. Drawing on recent theoretical development on the maximum likelihood estimators for relative risk parameters in two-phase studies, we propose nonparametric maximum likelihood-based estimators to evaluate the accuracy and predictiveness of a risk prediction biomarker under both types of two-phase designs. In addition, hybrid estimators that combine IPW estimators and maximum likelihood estimation procedure are proposed to improve efficiency and alleviate computational burden. We derive large sample properties of proposed estimators and evaluate their finite sample performance using numerical studies. We illustrate new procedures using a two-phase biomarker study aiming to evaluate the accuracy of a novel biomarker, des-γ-carboxy prothrombin, for early detection of hepatocellular carcinoma. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 882-892
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1295866
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1295866
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:882-892
Template-Type: ReDIF-Article 1.0
Author-Name: Kin Yau Wong
Author-X-Name-First: Kin Yau
Author-X-Name-Last: Wong
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: D. Y. Lin
Author-X-Name-First: D. Y.
Author-X-Name-Last: Lin
Title: Efficient Estimation for Semiparametric Structural Equation Models With Censored Data
Abstract:
Structural equation modeling is commonly used to capture complex structures of relationships among multiple variables, both latent and observed. We propose a general class of structural equation models with a semiparametric component for potentially censored survival times. We consider nonparametric maximum likelihood estimation and devise a combined expectation-maximization and Newton-Raphson algorithm for its implementation. We establish conditions for model identifiability and prove the consistency, asymptotic normality, and semiparametric efficiency of the estimators. Finally, we demonstrate the satisfactory performance of the proposed methods through simulation studies and provide an application to a motivating cancer study that contains a variety of genomic variables. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 893-905
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1299626
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1299626
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:893-905
Template-Type: ReDIF-Article 1.0
Author-Name: Ori Davidov
Author-X-Name-First: Ori
Author-X-Name-Last: Davidov
Author-Name: Casey M. Jelsema
Author-X-Name-First: Casey M.
Author-X-Name-Last: Jelsema
Author-Name: Shyamal Peddada
Author-X-Name-First: Shyamal
Author-X-Name-Last: Peddada
Title: Testing for Inequality Constraints in Singular Models by Trimming or Winsorizing the Variance Matrix
Abstract:
There are many applications in which a statistic follows, at least asymptotically, a normal distribution with a singular or nearly singular variance matrix. A classic example occurs in linear regression models under multicollinearity but there are many more such examples. There is well-developed theory for testing linear equality constraints when the alternative is two-sided and the variance matrix is either singular or nonsingular. In recent years, there is considerable, and growing, interest in developing methods for situations in which the estimated variance matrix is nearly singular. However, there is no corresponding methodology for addressing one-sided, that is, constrained or ordered alternatives. In this article, we develop a unified framework for analyzing such problems. Our approach may be viewed as the trimming or winsorizing of the eigenvalues of the corresponding variance matrix. The proposed methodology is applicable to a wide range of scientific problems and to a variety of statistical models in which inequality constraints arise. We illustrate the methodology using data from a gene expression microarray experiment obtained from the NIEHS’ Fibroid Growth Study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 906-918
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1301258
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1301258
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:906-918
Template-Type: ReDIF-Article 1.0
Author-Name: Jia Chen
Author-X-Name-First: Jia
Author-X-Name-Last: Chen
Author-Name: Degui Li
Author-X-Name-First: Degui
Author-X-Name-Last: Li
Author-Name: Oliver Linton
Author-X-Name-First: Oliver
Author-X-Name-Last: Linton
Author-Name: Zudi Lu
Author-X-Name-First: Zudi
Author-X-Name-Last: Lu
Title: Semiparametric Ultra-High Dimensional Model Averaging of Nonlinear Dynamic Time Series
Abstract:
We propose two semiparametric model averaging schemes for nonlinear dynamic time series regression models with a very large number of covariates including exogenous regressors and auto-regressive lags. Our objective is to obtain more accurate estimates and forecasts of time series by using a large number of conditioning variables in a nonparametric way. In the first scheme, we introduce a kernel sure independence screening (KSIS) technique to screen out the regressors whose marginal regression (or autoregression) functions do not make a significant contribution to estimating the joint multivariate regression function; we then propose a semiparametric penalized method of model averaging marginal regression (MAMAR) for the regressors and auto-regressors that survive the screening procedure, to further select the regressors that have significant effects on estimating the multivariate regression function and predicting the future values of the response variable. In the second scheme, we impose an approximate factor modeling structure on the ultra-high dimensional exogenous regressors and use the principal component analysis to estimate the latent common factors; we then apply the penalized MAMAR method to select the estimated common factors and the lags of the response variable that are significant. In each of the two schemes, we construct the optimal combination of the significant marginal regression and autoregression functions. Asymptotic properties for these two schemes are derived under some regularity conditions. Numerical studies including both simulation and an empirical application to forecasting inflation are given to illustrate the proposed methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 919-932
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1302339
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1302339
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:919-932
Template-Type: ReDIF-Article 1.0
Author-Name: Peter Ganong
Author-X-Name-First: Peter
Author-X-Name-Last: Ganong
Author-Name: Simon Jäger
Author-X-Name-First: Simon
Author-X-Name-Last: Jäger
Title: A Permutation Test for the Regression Kink Design
Abstract:
The regression kink (RK) design is an increasingly popular empirical method for estimating causal effects of policies, such as the effect of unemployment benefits on unemployment duration. Using simulation studies based on data from existing RK designs, we empirically document that the statistical significance of RK estimators based on conventional standard errors can be spurious. In the simulations, false positives arise as a consequence of nonlinearities in the underlying relationship between the outcome and the assignment variable, confirming concerns about the misspecification bias of discontinuity estimators pointed out by Calonico, Cattaneo, and Titiunik. As a complement to standard RK inference, we propose that researchers construct a distribution of placebo estimates in regions with and without a policy kink and use this distribution to gauge statistical significance. Under the assumption that the location of the kink point is random, this permutation test has exact size in finite samples for testing a sharp null hypothesis of no effect of the policy on the outcome. We implement simulation studies based on existing RK applications that estimate the effect of unemployment benefits on unemployment duration and show that our permutation test as well as inference procedures proposed by Calonico, Cattaneo, and Titiunik improve upon the size of standard approaches, while having sufficient power to detect an effect of unemployment benefits on unemployment duration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 494-504
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1328356
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328356
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:494-504
Template-Type: ReDIF-Article 1.0
Author-Name: Zhigang Yao
Author-X-Name-First: Zhigang
Author-X-Name-Last: Yao
Author-Name: Ye Zhang
Author-X-Name-First: Ye
Author-X-Name-Last: Zhang
Author-Name: Zhidong Bai
Author-X-Name-First: Zhidong
Author-X-Name-Last: Bai
Author-Name: William F. Eddy
Author-X-Name-First: William F.
Author-X-Name-Last: Eddy
Title: Estimating the Number of Sources in Magnetoencephalography Using Spiked Population Eigenvalues
Abstract:
Magnetoencephalography (MEG) is an advanced imaging technique used to measure the magnetic fields outside the human head produced by the electrical activity inside the brain. Various source localization methods in MEG require the knowledge of the underlying active sources, which are identified by a priori. Common methods used to estimate the number of sources include principal component analysis or information criterion methods, both of which make use of the eigenvalue distribution of the data, thus avoiding solving the time-consuming inverse problem. Unfortunately, all these methods are very sensitive to the signal-to-noise ratio (SNR), as examining the sample extreme eigenvalues does not necessarily reflect the perturbation of the population ones. To uncover the unknown sources from the very noisy MEG data, we introduce a framework, referred to as the intrinsic dimensionality (ID) of the optimal transformation for the SNR rescaling functional. It is defined as the number of the spiked population eigenvalues of the associated transformed data matrix. It is shown that the ID yields a more reasonable estimate for the number of sources than its sample counterparts, especially when the SNR is small. By means of examples, we illustrate that the new method is able to capture the number of signal sources in MEG that can escape PCA or other information criterion-based methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 505-518
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1341411
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341411
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:505-518
Template-Type: ReDIF-Article 1.0
Author-Name: Xi Chen
Author-X-Name-First: Xi
Author-X-Name-Last: Chen
Author-Name: Kaoru Irie
Author-X-Name-First: Kaoru
Author-X-Name-Last: Irie
Author-Name: David Banks
Author-X-Name-First: David
Author-X-Name-Last: Banks
Author-Name: Robert Haslinger
Author-X-Name-First: Robert
Author-X-Name-Last: Haslinger
Author-Name: Jewell Thomas
Author-X-Name-First: Jewell
Author-X-Name-Last: Thomas
Author-Name: Mike West
Author-X-Name-First: Mike
Author-X-Name-Last: West
Title: Scalable Bayesian Modeling, Monitoring, and Analysis of Dynamic Network Flow Data
Abstract:
Traffic flow count data in networks arise in many applications, such as automobile or aviation transportation, certain directed social network contexts, and Internet studies. Using an example of Internet browser traffic flow through site-segments of an international news website, we present Bayesian analyses of two linked classes of models which, in tandem, allow fast, scalable, and interpretable Bayesian inference. We first develop flexible state-space models for streaming count data, able to adaptively characterize and quantify network dynamics efficiently in real-time. We then use these models as emulators of more structured, time-varying gravity models that allow formal dissection of network dynamics. This yields interpretable inferences on traffic flow characteristics, and on dynamics in interactions among network nodes. Bayesian monitoring theory defines a strategy for sequential model assessment and adaptation in cases when network flow data deviate from model-based predictions. Exploratory and sequential monitoring analyses of evolving traffic on a network of web site-segments in e-commerce demonstrate the utility of this coupled Bayesian emulation approach to analysis of streaming network count data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 519-533
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1345742
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1345742
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:519-533
Template-Type: ReDIF-Article 1.0
Author-Name: Bruce J. Swihart
Author-X-Name-First: Bruce J.
Author-X-Name-Last: Swihart
Author-Name: Michael P. Fay
Author-X-Name-First: Michael P.
Author-X-Name-Last: Fay
Author-Name: Kazutoyo Miura
Author-X-Name-First: Kazutoyo
Author-X-Name-Last: Miura
Title: Statistical Methods for Standard Membrane-Feeding Assays to Measure Transmission Blocking or Reducing Activity in Malaria
Abstract:
Transmission blocking vaccines for malaria are not designed to directly protect vaccinated people from malaria disease, but to reduce the probability of infecting other people by interfering with the growth of the malaria parasite in mosquitoes. Standard membrane-feeding assays compare the growth of parasites in mosquitoes from a test sample (using antibodies from a vaccinated person) compared to a control sample. There is debate about whether to estimate the transmission reducing activity (TRA) which compares the mean number of parasites between test and control samples, or transmission blocking activity (TBA) which compares the proportion of infected mosquitoes. TBA appears biologically more important since each mosquito with any parasites is potentially infective; however, TBA is less reproducible and may be an overly strict criterion for screening vaccine candidates. Through a statistical model, we show that the TBA estimand depends on μc, the mean number of parasites in the control mosquitoes, a parameter not easily experimentally controlled. We develop a standardized TBA estimator based on the model and a given target value for μc which has better mean squared error than alternative methods. We discuss types of statistical inference needed for using these assays for vaccine development. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 534-545
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1356313
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356313
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:534-545
Template-Type: ReDIF-Article 1.0
Author-Name: Kun Chen
Author-X-Name-First: Kun
Author-X-Name-Last: Chen
Author-Name: Neha Mishra
Author-X-Name-First: Neha
Author-X-Name-Last: Mishra
Author-Name: Joan Smyth
Author-X-Name-First: Joan
Author-X-Name-Last: Smyth
Author-Name: Haim Bar
Author-X-Name-First: Haim
Author-X-Name-Last: Bar
Author-Name: Elizabeth Schifano
Author-X-Name-First: Elizabeth
Author-X-Name-Last: Schifano
Author-Name: Lynn Kuo
Author-X-Name-First: Lynn
Author-X-Name-Last: Kuo
Author-Name: Ming-Hui Chen
Author-X-Name-First: Ming-Hui
Author-X-Name-Last: Chen
Title: A Tailored Multivariate Mixture Model for Detecting Proteins of Concordant Change Among Virulent Strains of Clostridium Perfringens
Abstract:
Necrotic enteritis (NE) is a serious disease of poultry caused by the bacterium C. perfringens. To identify proteins of C. perfringens that confer virulence with respect to NE, the protein secretions of four NE disease-producing strains and one baseline nondisease-producing strain of C. perfringens were examined. The problem then becomes a clustering task, for the identification of two extreme groups of proteins that were produced at either concordantly higher or concordantly lower levels across all four disease-producing strains compared to the baseline, when most of the proteins do not exhibit significant change across all strains. However, the existence of some nuisance proteins of discordant change may severely distort any biologically meaningful cluster pattern. We develop a tailored multivariate clustering approach to robustly identify the proteins of concordant change. Using a three-component normal mixture model as the skeleton, our approach incorporates several constraints to account for biological expectations and data characteristics. More importantly, we adopt a sparse mean-shift parameterization in the reference distribution, coupled with a regularized estimation approach, to flexibly accommodate proteins of discordant change. We explore the connections and differences between our approach and other robust clustering methods, and resolve the issue of unbounded likelihood under an eigenvalue-ratio condition. Simulation studies demonstrate the superior performance of our method compared with a number of alternative approaches. Our protein analysis along with further biological investigations may shed light on the discovery of the complete set of virulence factors in NE. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 546-559
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1356314
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356314
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:546-559
Template-Type: ReDIF-Article 1.0
Author-Name: Li Hsu
Author-X-Name-First: Li
Author-X-Name-Last: Hsu
Author-Name: Malka Gorfine
Author-X-Name-First: Malka
Author-X-Name-Last: Gorfine
Author-Name: David Zucker
Author-X-Name-First: David
Author-X-Name-Last: Zucker
Title: On Estimation of the Hazard Function From Population-Based Case–Control Studies
Abstract:
The population-based case–control study design has been widely used for studying the etiology of chronic diseases. It is well established that the Cox proportional hazards model can be adapted to the case–control study and hazard ratios can be estimated by (conditional) logistic regression model with time as either a matched set or a covariate. However, the baseline hazard function, a critical component in absolute risk assessment, is unidentifiable, because the ratio of cases and controls is controlled by the investigators and does not reflect the true disease incidence rate in the population. In this article, we propose a simple and innovative approach, which makes use of routinely collected family history information, to estimate the baseline hazard function for any logistic regression model that is fit to the risk factor data collected on cases and controls. We establish that the proposed baseline hazard function estimator is consistent and asymptotically normal and show via simulation that it performs well in finite samples. We illustrate the proposed method by a population-based case–control study of prostate cancer where the association of various risk factors is assessed and the family history information is used to estimate the baseline hazard function. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 560-570
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1356315
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356315
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:560-570
Template-Type: ReDIF-Article 1.0
Author-Name: Haiming Zhou
Author-X-Name-First: Haiming
Author-X-Name-Last: Zhou
Author-Name: Timothy Hanson
Author-X-Name-First: Timothy
Author-X-Name-Last: Hanson
Title: A Unified Framework for Fitting Bayesian Semiparametric Models to Arbitrarily Censored Survival Data, Including Spatially Referenced Data
Abstract:
A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonly used semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, current-status, left and right censored, and mixtures of these. Left-truncated data are also accommodated leading to models for time-dependent covariates. Both georeferenced (location exactly observed) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled; formal variable selection makes model selection especially easy. Model fit is assessed with conditional Cox–Snell residual plots, and model choice is carried out via log pseudo marginal likelihood (LPML) and deviance information criterion (DIC). Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via a new function which calls efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications. An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonly used proportional hazards model. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 571-581
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1356316
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356316
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:571-581
Template-Type: ReDIF-Article 1.0
Author-Name: Liang Li
Author-X-Name-First: Liang
Author-X-Name-Last: Li
Author-Name: Chih-Hsien Wu
Author-X-Name-First: Chih-Hsien
Author-X-Name-Last: Wu
Author-Name: Jing Ning
Author-X-Name-First: Jing
Author-X-Name-Last: Ning
Author-Name: Xuelin Huang
Author-X-Name-First: Xuelin
Author-X-Name-Last: Huang
Author-Name: Ya-Chen Tina Shih
Author-X-Name-First: Ya-Chen Tina
Author-X-Name-Last: Shih
Author-Name: Yu Shen
Author-X-Name-First: Yu
Author-X-Name-Last: Shen
Title: Semiparametric Estimation of Longitudinal Medical Cost Trajectory
Abstract:
Estimating the average monthly medical costs from disease diagnosis to a terminal event such as death for an incident cohort of patients is a topic of immense interest to researchers in health policy and health economics because patterns of average monthly costs over time reveal how medical costs vary across phases of care. The statistical challenges to estimating monthly medical costs longitudinally are multifold; the longitudinal cost trajectory (formed by plotting the average monthly costs from diagnosis to the terminal event) is likely to be nonlinear, with its shape depending on the time of the terminal event, which can be subject to right censoring. The goal of this article is to tackle this statistically challenging topic by estimating the conditional mean cost at any month t given the time of the terminal event s. The longitudinal cost trajectories with different terminal event times form a bivariate surface of t and s, under the constraint t ⩽ s. We propose to estimate this surface using bivariate penalized splines in an expectation-maximization algorithm that treats the censored terminal event times as missing data. We evaluate the proposed model and estimation method in simulations and apply the method to the medical cost data of an incident cohort of stage IV breast cancer patients from the Surveillance, Epidemiology, and End Results–Medicare Linked Database. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 582-592
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1361329
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1361329
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:582-592
Template-Type: ReDIF-Article 1.0
Author-Name: Yuhang Xu
Author-X-Name-First: Yuhang
Author-X-Name-Last: Xu
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Dan Nettleton
Author-X-Name-First: Dan
Author-X-Name-Last: Nettleton
Title: Nested Hierarchical Functional Data Modeling and Inference for the Analysis of Functional Plant Phenotypes
Abstract:
In a plant science Root Image Study, the process of seedling roots bending in response to gravity is recorded using digital cameras, and the bending rates are modeled as functional plant phenotype data. The functional phenotypes are collected from seeds representing a large variety of genotypes and have a three-level nested hierarchical structure, with seeds nested in groups nested in genotypes. The seeds are imaged on different days of the lunar cycle, and an important scientific question is whether there are lunar effects on root bending. We allow the mean function of the bending rate to depend on the lunar day and model the phenotypic variation between genotypes, groups of seeds imaged together, and individual seeds by hierarchical functional random effects. We estimate the covariance functions of the functional random effects by a fast penalized tensor product spline approach, perform multi-level functional principal component analysis (FPCA) using the best linear unbiased predictor of the principal component scores, and improve the efficiency of mean estimation by iterative decorrelation. We choose the number of principal components using a conditional Akaike information criterion and test the lunar day effect using generalized likelihood ratio test statistics based on the marginal and conditional likelihoods. We also propose a permutation procedure to evaluate the null distribution of the test statistics. Our simulation studies show that our model selection criterion selects the correct number of principal components with remarkably high frequency, and the likelihood-based tests based on FPCA have higher power than a test based on working independence. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 593-606
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2017.1366907
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1366907
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:593-606
Template-Type: ReDIF-Article 1.0
Author-Name: Sonja A. Swanson
Author-X-Name-First: Sonja A.
Author-X-Name-Last: Swanson
Author-Name: Miguel A. Hernán
Author-X-Name-First: Miguel A.
Author-X-Name-Last: Hernán
Author-Name: Matthew Miller
Author-X-Name-First: Matthew
Author-X-Name-Last: Miller
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Author-Name: Thomas S. Richardson
Author-X-Name-First: Thomas S.
Author-X-Name-Last: Richardson
Title: Partial Identification of the Average Treatment Effect Using Instrumental Variables: Review of Methods for Binary Instruments, Treatments, and Outcomes
Abstract:
Several methods have been proposed for partially or point identifying the average treatment effect (ATE) using instrumental variable (IV) type assumptions. The descriptions of these methods are widespread across the statistical, economic, epidemiologic, and computer science literature, and the connections between the methods have not been readily apparent. In the setting of a binary instrument, treatment, and outcome, we review proposed methods for partial and point identification of the ATE under IV assumptions, express the identification results in a common notation and terminology, and propose a taxonomy that is based on sets of identifying assumptions. We further demonstrate and provide software for the application of these methods to estimate bounds. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 933-947
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2018.1434530
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1434530
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:933-947
Template-Type: ReDIF-Article 1.0
Author-Name: Houshmand Shirani-Mehr
Author-X-Name-First: Houshmand
Author-X-Name-Last: Shirani-Mehr
Author-Name: David Rothschild
Author-X-Name-First: David
Author-X-Name-Last: Rothschild
Author-Name: Sharad Goel
Author-X-Name-First: Sharad
Author-X-Name-Last: Goel
Author-Name: Andrew Gelman
Author-X-Name-First: Andrew
Author-X-Name-Last: Gelman
Title: Disentangling Bias and Variance in Election Polls
Abstract:
It is well known among researchers and practitioners that election polls suffer from a variety of sampling and nonsampling errors, often collectively referred to as total survey error. Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (e.g., errors due to uncertainty in who will vote). Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is approximately 3.5 percentage points, about twice as large as that implied by most reported margins of error. We decompose survey error into election-level bias and variance terms. We find that average absolute election-level bias is about 2 percentage points, indicating that polls for a given election often share a common component of error. This shared error may stem from the fact that polling organizations often face similar difficulties in reaching various subgroups of the population, and that they rely on similar screening rules when estimating who will vote. We also find that average election-level variance is higher than implied by simple random sampling, in part because polling organizations often use complex sampling designs and adjustment procedures. We conclude by discussing how these results help explain polling failures in the 2016 U.S. presidential election, and offer recommendations to improve polling practice.
Journal: Journal of the American Statistical Association
Pages: 607-614
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2018.1448823
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448823
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:607-614
Template-Type: ReDIF-Article 1.0
Author-Name: Barry D. Nussbaum
Author-X-Name-First: Barry D.
Author-X-Name-Last: Nussbaum
Title: Statistics: Essential Now More Than Ever
Abstract:
Each year, the Journal of the American Statistical Association publishes the presidential address from the Joint Statistical Meetings. Here, we present the 2017 address verbatim save for the addition of references and a few minor editorial corrections.
Journal: Journal of the American Statistical Association
Pages: 489-493
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2018.1463486
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1463486
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:489-493
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 948-953
Issue: 522
Volume: 113
Year: 2018
Month: 4
X-DOI: 10.1080/01621459.2018.1486071
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1486071
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:522:p:948-953
Template-Type: ReDIF-Article 1.0
Author-Name: Karun Adusumilli
Author-X-Name-First: Karun
Author-X-Name-Last: Adusumilli
Author-Name: Taisuke Otsu
Author-X-Name-First: Taisuke
Author-X-Name-Last: Otsu
Title: Empirical Likelihood for Random Sets
Abstract:
In many statistical applications, the observed data take the form of sets rather than points. Examples include bracket data in survey analysis, tumor growth and rock grain images in morphology analysis, and noisy measurements on the support function of a convex set in medical imaging and robotic vision. Additionally, in studies of treatment effects, researchers often wish to conduct inference on nonparametric bounds for the effects which can be expressed by means of random sets. This article develops the concept of nonparametric likelihood for random sets and its mean, known as the Aumann expectation, and proposes general inference methods by adapting the theory of empirical likelihood. Several examples, such as regression with bracket income data, Boolean models for tumor growth, bound analysis on treatment effects, and image analysis via support functions, illustrate the usefulness of the proposed methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1064-1075
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1188107
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1188107
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1064-1075
Template-Type: ReDIF-Article 1.0
Author-Name: Kin Wai Chan
Author-X-Name-First: Kin Wai
Author-X-Name-Last: Chan
Author-Name: Chun Yip Yau
Author-X-Name-First: Chun Yip
Author-X-Name-Last: Yau
Title: Automatic Optimal Batch Size Selection for Recursive Estimators of Time-Average Covariance Matrix
Abstract:
The time-average covariance matrix (TACM) Σ:=∑k∈ZΓk$\bm {\Sigma }:=\sum _{k\in \mathbb {Z}}\bm {\Gamma }_k$, where Γk is the auto-covariance function, is an important quantity for the inference of the mean of an Rd$\mathbb {R}^d$-valued stationary process (d ⩾ 1). This article proposes two recursive estimators for Σ with optimal asymptotic mean square error (AMSE) under different strengths of serial dependence. The optimal estimator involves a batch size selection, which requires knowledge of a smoothness parameter ϒβ:=∑k∈Z|k|βΓk$\bm {\Upsilon }_{\beta }:=\sum _{k\in \mathbb {Z}} |k|^{\beta } \bm {\Gamma }_k$, for some β. This article also develops recursive estimators for ϒβ. Combining these two estimators, we obtain a fully automatic procedure for optimal online estimation for Σ. Consistency and convergence rates of the proposed estimators are derived. Applications to confidence region construction and Markov chain Monte Carlo convergence diagnosis are discussed. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1076-1089
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1189337
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1189337
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1076-1089
Template-Type: ReDIF-Article 1.0
Author-Name: Neil Shephard
Author-X-Name-First: Neil
Author-X-Name-Last: Shephard
Author-Name: Justin J. Yang
Author-X-Name-First: Justin J.
Author-X-Name-Last: Yang
Title: Continuous Time Analysis of Fleeting Discrete Price Moves
Abstract:
This article proposes a novel model of financial prices where (i) prices are discrete; (ii) prices change in continuous time; (iii) a high proportion of price changes are reversed in a fraction of a second. Our model is analytically tractable and directly formulated in terms of the calendar time and price impact curve. The resulting càdlàg price process is a piecewise constant semimartingale with finite activity, finite variation, and no Brownian motion component. We use moment-based estimations to fit four high-frequency futures datasets and demonstrate the descriptive power of our proposed model. This model is able to describe the observed dynamics of price changes over three different orders of magnitude of time intervals. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1090-1106
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1192544
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192544
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1090-1106
Template-Type: ReDIF-Article 1.0
Author-Name: Yun Yang
Author-X-Name-First: Yun
Author-X-Name-Last: Yang
Author-Name: Surya T. Tokdar
Author-X-Name-First: Surya T.
Author-X-Name-Last: Tokdar
Title: Joint Estimation of Quantile Planes Over Arbitrary Predictor Spaces
Abstract:
In spite of the recent surge of interest in quantile regression, joint estimation of linear quantile planes remains a great challenge in statistics and econometrics. We propose a novel parameterization that characterizes any collection of noncrossing quantile planes over arbitrarily shaped convex predictor domains in any dimension by means of unconstrained scalar, vector and function valued parameters. Statistical models based on this parameterization inherit a fast computation of the likelihood function, enabling penalized likelihood or Bayesian approaches to model fitting. We introduce a complete Bayesian methodology by using Gaussian process prior distributions on the function valued parameters and develop a robust and efficient Markov chain Monte Carlo parameter estimation. The resulting method is shown to offer posterior consistency under mild tail and regularity conditions. We present several illustrative examples where the new method is compared against existing approaches and is found to offer better accuracy, coverage and model fit. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1107-1120
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1192545
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192545
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1107-1120
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas S. Richardson
Author-X-Name-First: Thomas S.
Author-X-Name-Last: Richardson
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Author-Name: Linbo Wang
Author-X-Name-First: Linbo
Author-X-Name-Last: Wang
Title: On Modeling and Estimation for the Relative Risk and Risk Difference
Abstract:
A common problem in formulating models for the relative risk and risk difference is the variation dependence between these parameters and the baseline risk, which is a nuisance model. We address this problem by proposing the conditional log odds-product as a preferred nuisance model. This novel nuisance model facilitates maximum-likelihood estimation, but also permits doubly-robust estimation for the parameters of interest. Our approach is illustrated via simulations and a data analysis. An R package {\tt brm} implementing the proposed methods is available on CRAN. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1121-1130
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1192546
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1192546
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1121-1130
Template-Type: ReDIF-Article 1.0
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Title: Parsimonious Tensor Response Regression
Abstract:
Aiming at abundant scientific and engineering data with not only high dimensionality but also complex structure, we study the regression problem with a multidimensional array (tensor) response and a vector predictor. Applications include, among others, comparing tensor images across groups after adjusting for additional covariates, which is of central interest in neuroimaging analysis. We propose parsimonious tensor response regression adopting a generalized sparsity principle. It models all voxels of the tensor response jointly, while accounting for the inherent structural information among the voxels. It effectively reduces the number of free parameters, leading to feasible computation and improved interpretation. We achieve model estimation through a nascent technique called the envelope method, which identifies the immaterial information and focuses the estimation based upon the material information in the tensor response. We demonstrate that the resulting estimator is asymptotically efficient, and it enjoys a competitive finite sample performance. We also illustrate the new method on two real neuroimaging studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1131-1146
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1193022
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1193022
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1131-1146
Template-Type: ReDIF-Article 1.0
Author-Name: David Choi
Author-X-Name-First: David
Author-X-Name-Last: Choi
Title: Estimation of Monotone Treatment Effects in Network Experiments
Abstract:
Randomized experiments on social networks pose statistical challenges, due to the possibility of interference between units. We propose new methods for finding confidence intervals on the attributable treatment effect in such settings. The methods do not require partial interference, but instead require an identifying assumption that is similar to requiring nonnegative treatment effects. Network or spatial information can be used to customize the test statistic; in principle, this can increase power without making assumptions on the data-generating process. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1147-1155
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1194845
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1194845
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1147-1155
Template-Type: ReDIF-Article 1.0
Author-Name: Xiao Wang
Author-X-Name-First: Xiao
Author-X-Name-Last: Wang
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: Generalized Scalar-on-Image Regression Models via Total Variation
Abstract:
The use of imaging markers to predict clinical outcomes can have a great impact in public health. The aim of this article is to develop a class of generalized scalar-on-image regression models via total variation (GSIRM-TV), in the sense of generalized linear models, for scalar response and imaging predictor with the presence of scalar covariates. A key novelty of GSIRM-TV is that it is assumed that the slope function (or image) of GSIRM-TV belongs to the space of bounded total variation to explicitly account for the piecewise smooth nature of most imaging data. We develop an efficient penalized total variation optimization to estimate the unknown slope function and other parameters. We also establish nonasymptotic error bounds on the excess risk. These bounds are explicitly specified in terms of sample size, image size, and image smoothness. Our simulations demonstrate a superior performance of GSIRM-TV against many existing approaches. We apply GSIRM-TV to the analysis of hippocampus data obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1156-1168
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1194846
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1194846
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1156-1168
Template-Type: ReDIF-Article 1.0
Author-Name: Jialiang Li
Author-X-Name-First: Jialiang
Author-X-Name-Last: Li
Author-Name: Chao Huang
Author-X-Name-First: Chao
Author-X-Name-Last: Huang
Author-Name: Zhub Hongtu
Author-X-Name-First: Zhub
Author-X-Name-Last: Hongtu
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: A Functional Varying-Coefficient Single-Index Model for Functional Response Data
Abstract:
Motivated by the analysis of imaging data, we propose a novel functional varying-coefficient single-index model (FVCSIM) to carry out the regression analysis of functional response data on a set of covariates of interest. FVCSIM represents a new extension of varying-coefficient single-index models for scalar responses collected from cross-sectional and longitudinal studies. An efficient estimation procedure is developed to iteratively estimate varying coefficient functions, link functions, index parameter vectors, and the covariance function of individual functions. We systematically examine the asymptotic properties of all estimators including the weak convergence of the estimated varying coefficient functions, the asymptotic distribution of the estimated index parameter vectors, and the uniform convergence rate of the estimated covariance function and their spectrum. Simulation studies are carried out to assess the finite-sample performance of the proposed procedure. We apply FVCSIM to investigate the development of white matter diffusivities along the corpus callosum skeleton obtained from Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. Supplementary material for this article is available online.
Journal: Journal of the American Statistical Association
Pages: 1169-1181
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1195742
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195742
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1169-1181
Template-Type: ReDIF-Article 1.0
Author-Name: Tomohiro Ando
Author-X-Name-First: Tomohiro
Author-X-Name-Last: Ando
Author-Name: Jushan Bai
Author-X-Name-First: Jushan
Author-X-Name-Last: Bai
Title: Clustering Huge Number of Financial Time Series: A Panel Data Approach With High-Dimensional Predictors and Factor Structures
Abstract:
This article introduces a new procedure for clustering a large number of financial time series based on high-dimensional panel data with grouped factor structures. The proposed method attempts to capture the level of similarity of each of the time series based on sensitivity to observable factors as well as to the unobservable factor structure. The proposed method allows for correlations between observable and unobservable factors and also allows for cross-sectional and serial dependence and heteroscedasticities in the error structure, which are common in financial markets. In addition, theoretical properties are established for the procedure. We apply the method to analyze the returns for over 6000 international stocks from over 100 financial markets. The empirical analysis quantifies the extent to which the U.S. subprime crisis spilled over to the global financial markets. Furthermore, we find that nominal classifications based on either listed market, industry, country or region are insufficient to characterize the heterogeneity of the global financial markets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1182-1198
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1195743
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195743
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1182-1198
Template-Type: ReDIF-Article 1.0
Author-Name: Simon N. Wood
Author-X-Name-First: Simon N.
Author-X-Name-Last: Wood
Author-Name: Zheyuan Li
Author-X-Name-First: Zheyuan
Author-X-Name-Last: Li
Author-Name: Gavin Shaddick
Author-X-Name-First: Gavin
Author-X-Name-Last: Shaddick
Author-Name: Nicole H. Augustin
Author-X-Name-First: Nicole H.
Author-X-Name-Last: Augustin
Title: Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data
Abstract:
We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 104 coefficients to up to 108 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration’s pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1199-1210
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1195744
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1195744
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1199-1210
Template-Type: ReDIF-Article 1.0
Author-Name: Cyrus J. DiCiccio
Author-X-Name-First: Cyrus J.
Author-X-Name-Last: DiCiccio
Author-Name: Joseph P. Romano
Author-X-Name-First: Joseph P.
Author-X-Name-Last: Romano
Title: Robust Permutation Tests For Correlation And Regression Coefficients
Abstract:
Given a sample from a bivariate distribution, consider the problem of testing independence. A permutation test based on the sample correlation is known to be an exact level α test. However, when used to test the null hypothesis that the samples are uncorrelated, the permutation test can have rejection probability that is far from the nominal level. Further, the permutation test can have a large Type 3 (directional) error rate, whereby there can be a large probability that the permutation test rejects because the sample correlation is a large positive value, when in fact the true correlation is negative. It will be shown that studentizing the sample correlation leads to a permutation test which is exact under independence and asymptotically controls the probability of Type 1 (or Type 3) errors. These conclusions are based on our results describing the almost sure limiting behavior of the randomization distribution. We will also present asymptotically robust randomization tests for regression coefficients, including a result based on a modified procedure of Freedman and Lane. Simulations and empirical applications are included. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1211-1220
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1202117
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1202117
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1211-1220
Template-Type: ReDIF-Article 1.0
Author-Name: Jon Arni Steingrimsson
Author-X-Name-First: Jon Arni
Author-X-Name-Last: Steingrimsson
Author-Name: Robert L. Strawderman
Author-X-Name-First: Robert L.
Author-X-Name-Last: Strawderman
Title: Estimation in the Semiparametric Accelerated Failure Time Model With Missing Covariates: Improving Efficiency Through Augmentation
Abstract:
This article considers linear regression with missing covariates and a right censored outcome. We first consider a general two-phase outcome sampling design, where full covariate information is only ascertained for subjects in phase two and sampling occurs under an independent Bernoulli sampling scheme with known subject-specific sampling probabilities that depend on phase one information (e.g., survival time, failure status and covariates). The semiparametric information bound is derived for estimating the regression parameter in this setting. We also introduce a more practical class of augmented estimators that is shown to improve asymptotic efficiency over simple but inefficient inverse probability of sampling weighted estimators. Estimation for known sampling weights and extensions to the case of estimated sampling weights are both considered. The allowance for estimated sampling weights permits covariates to be missing at random according to a monotone but unknown mechanism. The asymptotic properties of the augmented estimators are derived and simulation results demonstrate substantial efficiency improvements over simpler inverse probability of sampling weighted estimators in the indicated settings. With suitable modification, the proposed methodology can also be used to improve augmented estimators previously used for missing covariates in a Cox regression model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1221-1235
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1205500
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1205500
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1221-1235
Template-Type: ReDIF-Article 1.0
Author-Name: Ganggang Xu
Author-X-Name-First: Ganggang
Author-X-Name-Last: Xu
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: Tukey -and- Random Fields
Abstract:
We propose a new class of transGaussian random fields named Tukey g-and-h (TGH) random fields to model non-Gaussian spatial data. The proposed TGH random fields have extremely flexible marginal distributions, possibly skewed and/or heavy-tailed, and, therefore, have a wide range of applications. The special formulation of the TGH random field enables an automatic search for the most suitable transformation for the dataset of interest while estimating model parameters. Asymptotic properties of the maximum likelihood estimator and the probabilistic properties of the TGH random fields are investigated. An efficient estimation procedure, based on maximum approximated likelihood, is proposed and an extreme spatial outlier detection algorithm is formulated. Kriging and probabilistic prediction with TGH random fields are developed along with prediction confidence intervals. The predictive performance of TGH random fields is demonstrated through extensive simulation studies and an application to a dataset of total precipitation in the south east of the United States. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1236-1249
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1205501
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1205501
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1236-1249
Template-Type: ReDIF-Article 1.0
Author-Name: Pengfei Li
Author-X-Name-First: Pengfei
Author-X-Name-Last: Li
Author-Name: Yukun Liu
Author-X-Name-First: Yukun
Author-X-Name-Last: Liu
Author-Name: Jing Qin
Author-X-Name-First: Jing
Author-X-Name-Last: Qin
Title: Semiparametric Inference in a Genetic Mixture Model
Abstract:
In genetic backcross studies, data are often collected from complex mixtures of distributions with known mixing proportions. Previous approaches to the inference of these genetic mixture models involve parameterizing the component distributions. However, model misspecification of any form is expected to have detrimental effects. We propose a semiparametric likelihood method for genetic mixture models: the empirical likelihood under the exponential tilting model assumption, in which the log ratio of the probability (density) functions from the components is linear in the observations. An application to mice cancer genetics involves random numbers of offspring within a litter. In other words, the cluster size is a random variable. We wish to test the null hypothesis that there is no difference between the two components in the mixture model, but unfortunately we find that the Fisher information is degenerate. As a consequence, the conventional two-term expansion in the likelihood ratio statistic does not work. By using a higher-order expansion, we are able to establish a nonstandard convergence rate N− 1/4 for the odds ratio parameter estimator β^$\hat{\beta }$. Moreover, the limiting distribution of the empirical likelihood ratio statistic is derived. The underlying distribution function of each component can also be estimated semiparametrically. Analogously to the full parametric approach, we develop an expectation and maximization algorithm for finding the semiparametric maximum likelihood estimator. Simulation results and a real cancer application indicate that the proposed semiparametric method works much better than parametric methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1250-1260
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1208614
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1208614
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1250-1260
Template-Type: ReDIF-Article 1.0
Author-Name: Lizhen Lin
Author-X-Name-First: Lizhen
Author-X-Name-Last: Lin
Author-Name: Brian St. Thomas
Author-X-Name-First: Brian
Author-X-Name-Last: St. Thomas
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Extrinsic Local Regression on Manifold-Valued Data
Abstract:
We propose an extrinsic regression framework for modeling data with manifold valued responses and Euclidean predictors. Regression with manifold responses has wide applications in shape analysis, neuroscience, medical imaging, and many other areas. Our approach embeds the manifold where the responses lie onto a higher dimensional Euclidean space, obtains a local regression estimate in that space, and then projects this estimate back onto the image of the manifold. Outside the regression setting both intrinsic and extrinsic approaches have been proposed for modeling iid manifold-valued data. However, to our knowledge our work is the first to take an extrinsic approach to the regression problem. The proposed extrinsic regression framework is general, computationally efficient, and theoretically appealing. Asymptotic distributions and convergence rates of the extrinsic regression estimates are derived and a large class of examples is considered indicating the wide applicability of our approach. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1261-1273
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1208615
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1208615
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1261-1273
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew Plumlee
Author-X-Name-First: Matthew
Author-X-Name-Last: Plumlee
Title: Bayesian Calibration of Inexact Computer Models
Abstract:
Bayesian calibration is used to study computer models in the presence of both a calibration parameter and model bias. The parameter in the predominant methodology is left undefined. This results in an issue, where the posterior of the parameter is suboptimally broad. There has been no generally accepted alternatives to date. This article proposes using Bayesian calibration, where the prior distribution on the bias is orthogonal to the gradient of the computer model. Problems associated with Bayesian calibration are shown to be mitigated through analytic results in addition to examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1274-1285
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1211016
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1211016
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1274-1285
Template-Type: ReDIF-Article 1.0
Author-Name: Kyle Vincent
Author-X-Name-First: Kyle
Author-X-Name-Last: Vincent
Author-Name: Steve Thompson
Author-X-Name-First: Steve
Author-X-Name-Last: Thompson
Title: Estimating Population Size With Link-Tracing Sampling
Abstract:
We present a new design and method for estimating the size of a hidden population best reached through a link-tracing design. The design is based on selecting initial samples at random and then adaptively tracing links to add new members. The inferential procedure involves the Rao–Blackwell theorem applied to a sufficient statistic markedly different from the usual one that arises in sampling from a finite population. The strategy involves a combination of link-tracing and mark-recapture estimation methods. An empirical application is described. The result demonstrates that the strategy can efficiently incorporate adaptively selected members of the sample into the inferential procedure. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1286-1295
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1212712
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1212712
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1286-1295
Template-Type: ReDIF-Article 1.0
Author-Name: Ming-Yueh Huang
Author-X-Name-First: Ming-Yueh
Author-X-Name-Last: Huang
Author-Name: Chin-Tsang Chiang
Author-X-Name-First: Chin-Tsang
Author-X-Name-Last: Chiang
Title: An Effective Semiparametric Estimation Approach for the Sufficient Dimension Reduction Model
Abstract:
In the exploratory data analysis, the sufficient dimension reduction model has been widely used to characterize the conditional distribution of interest. Different from the existing approaches, our main achievement is to simultaneously estimate two essential elements, basis and structural dimension, of the central subspace and the bandwidth of a kernel distribution estimator through a single estimation criterion. With an appropriate order of kernel function, the proposed estimation procedure can be effectively carried out by starting with a dimension of zero until the first local minimum is reached. Meanwhile, the optimal bandwidth selector is ensured to be a valid tuning parameter for the central subspace estimator. An important advantage of this estimation technique is its flexibility to allow a response to be discrete and some of covariates to be discrete or categorical providing that a certain continuity condition holds. Under very mild assumptions, we further derive the uniform consistency of the introduced optimization function and the consistency of the resulting estimators. Moreover, the asymptotic normality of the central subspace estimator is established with an estimated rather than exact structural dimension. In extensive simulations, the developed approach generally outperforms the competitors. Data from previous studies are also used to illustrate the proposal. On the whole, our methodology is very effective in estimating the central subspace and conditional distribution, highly flexible in adapting diverse types of a response and covariates, and practically feasible in obtaining an asymptotically optimal and valid bandwidth estimator. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1296-1310
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1215987
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215987
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1296-1310
Template-Type: ReDIF-Article 1.0
Author-Name: Kazuki Uematsu
Author-X-Name-First: Kazuki
Author-X-Name-Last: Uematsu
Author-Name: Yoonkyung Lee
Author-X-Name-First: Yoonkyung
Author-X-Name-Last: Lee
Title: On Theoretically Optimal Ranking Functions in Bipartite Ranking
Abstract:
This article investigates the theoretical relation between loss criteria and the optimal ranking functions driven by the criteria in bipartite ranking. In particular, the relation between area under the ROC curve (AUC) maximization and minimization of ranking risk under a convex loss is examined. We characterize general conditions for ranking-calibrated loss functions in a pairwise approach, and show that the best ranking functions under convex ranking-calibrated loss criteria produce the same ordering as the likelihood ratio of the positive category to the negative category over the instance space. The result illuminates the parallel between ranking and classification in general, and suggests the notion of consistency in ranking when convex ranking risk is minimized as in the RankBoost algorithm for instance. For a certain class of loss functions including the exponential loss and the binomial deviance, we specify the optimal ranking function explicitly in relation to the underlying probability distribution. In addition, we present an in-depth analysis of hinge loss optimization for ranking and point out that the RankSVM may produce potentially many ties or granularity in ranking scores due to the singularity of the hinge loss, which could result in ranking inconsistency. The theoretical findings are illustrated with numerical examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1311-1322
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1215988
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215988
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1311-1322
Template-Type: ReDIF-Article 1.0
Author-Name: Francis K. C. Hui
Author-X-Name-First: Francis K. C.
Author-X-Name-Last: Hui
Author-Name: Samuel Müller
Author-X-Name-First: Samuel
Author-X-Name-Last: Müller
Author-Name: A. H. Welsh
Author-X-Name-First: A. H.
Author-X-Name-Last: Welsh
Title: Joint Selection in Mixed Models using Regularized PQL
Abstract:
The application of generalized linear mixed models presents some major challenges for both estimation, due to the intractable marginal likelihood, and model selection, as we usually want to jointly select over both fixed and random effects. We propose to overcome these challenges by combining penalized quasi-likelihood (PQL) estimation with sparsity inducing penalties on the fixed and random coefficients. The resulting approach, referred to as regularized PQL, is a computationally efficient method for performing joint selection in mixed models. A key aspect of regularized PQL involves the use of a group based penalty for the random effects: sparsity is induced such that all the coefficients for a random effect are shrunk to zero simultaneously, which in turn leads to the random effect being removed from the model. Despite being a quasi-likelihood approach, we show that regularized PQL is selection consistent, that is, it asymptotically selects the true set of fixed and random effects, in the setting where the cluster size grows with the number of clusters. Furthermore, we propose an information criterion for choosing the single tuning parameter and show that it facilitates selection consistency. Simulations demonstrate regularized PQL outperforms several currently employed methods for joint selection even if the cluster size is small compared to the number of clusters, while also offering dramatic reductions in computation time. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1323-1333
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1215989
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1323-1333
Template-Type: ReDIF-Article 1.0
Author-Name: Ulrich K. Müller
Author-X-Name-First: Ulrich K.
Author-X-Name-Last: Müller
Author-Name: Yulong Wang
Author-X-Name-First: Yulong
Author-X-Name-Last: Wang
Title: Fixed- Asymptotic Inference About Tail Properties
Abstract:
We consider inference about tail properties of a distribution from an iid sample, based on extreme value theory. All of the numerous previous suggestions rely on asymptotics where eventually, an infinite number of observations from the tail behave as predicted by extreme value theory, enabling the consistent estimation of the key tail index, and the construction of confidence intervals using the delta method or other classic approaches. In small samples, however, extreme value theory might well provide good approximations for only a relatively small number of tail observations. To accommodate this concern, we develop asymptotically valid confidence intervals for high quantile and tail conditional expectations that only require extreme value theory to hold for the largest k observations, for a given and fixed k. Small-sample simulations show that these “fixed-k” intervals have excellent small-sample coverage properties, and we illustrate their use with mainland U.S. hurricane data. In addition, we provide an analytical result about the additional asymptotic robustness of the fixed-k approach compared to kn → ∞ inference.
Journal: Journal of the American Statistical Association
Pages: 1334-1343
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1215990
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1215990
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1334-1343
Template-Type: ReDIF-Article 1.0
Author-Name: Xuan Bi
Author-X-Name-First: Xuan
Author-X-Name-Last: Bi
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Author-Name: Junhui Wang
Author-X-Name-First: Junhui
Author-X-Name-Last: Wang
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Title: A Group-Specific Recommender System
Abstract:
In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this article, we propose a group-specific method to use dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the “cold-start” problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on the numbers of ratings from each user and other variables associated with missing patterns. In addition, since this type of data involves large-scale customer records, traditional algorithms are not computationally scalable. To implement the proposed method, we propose a new algorithm that embeds a back-fitting algorithm into alternating least squares, which avoids large matrices operation and big memory storage, and therefore makes it feasible to achieve scalable computing. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1344-1353
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1219261
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1219261
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1344-1353
Template-Type: ReDIF-Article 1.0
Author-Name: Mike G. Tsionas
Author-X-Name-First: Mike G.
Author-X-Name-Last: Tsionas
Title: “When, Where, and How” of Efficiency Estimation: Improved Procedures for Stochastic Frontier Modeling
Abstract:
The issues of functional form, distributions of the error components, and endogeneity are for the most part still open in stochastic frontier models. The same is true when it comes to imposition of restrictions of monotonicity and curvature, making efficiency estimation an elusive goal. In this article, we attempt to consider these problems simultaneously and offer practical solutions to the problems raised by Stone and addressed by Badunenko, Henderson and Kumbhakar. We provide major extensions to smoothly mixing regressions and fractional polynomial approximations for both the functional form of the frontier and the structure of inefficiency. Endogeneity is handled, simultaneously, using copulas. We provide detailed computational experiments and an application to U.S. banks. To explore the posteriors of the new models we rely heavily on sequential Monte Carlo techniques.
Journal: Journal of the American Statistical Association
Pages: 948-965
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1246364
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246364
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:948-965
Template-Type: ReDIF-Article 1.0
Author-Name: Zihuai He
Author-X-Name-First: Zihuai
Author-X-Name-Last: He
Author-Name: Min Zhang
Author-X-Name-First: Min
Author-X-Name-Last: Zhang
Author-Name: Seunggeun Lee
Author-X-Name-First: Seunggeun
Author-X-Name-Last: Lee
Author-Name: Jennifer A. Smith
Author-X-Name-First: Jennifer A.
Author-X-Name-Last: Smith
Author-Name: Sharon L. R. Kardia
Author-X-Name-First: Sharon L. R.
Author-X-Name-Last: Kardia
Author-Name: V. Diez Roux
Author-X-Name-First: V. Diez
Author-X-Name-Last: Roux
Author-Name: Bhramar Mukherjee
Author-X-Name-First: Bhramar
Author-X-Name-Last: Mukherjee
Title: Set-Based Tests for the Gene–Environment Interaction in Longitudinal Studies
Abstract:
We propose a generalized score type test for set-based inference for the gene–environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for the gene–environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene–environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with four exams. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 966-978
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1252266
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1252266
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:966-978
Template-Type: ReDIF-Article 1.0
Author-Name: Ethan X. Fang
Author-X-Name-First: Ethan X.
Author-X-Name-Last: Fang
Author-Name: Min-Dian Li
Author-X-Name-First: Min-Dian
Author-X-Name-Last: Li
Author-Name: Michael I. Jordan
Author-X-Name-First: Michael I.
Author-X-Name-Last: Jordan
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Title: Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach
Abstract:
Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 921-932
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1256812
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256812
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:921-932
Template-Type: ReDIF-Article 1.0
Author-Name: Weiyi Xie
Author-X-Name-First: Weiyi
Author-X-Name-Last: Xie
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Author-Name: Karthik Bharath
Author-X-Name-First: Karthik
Author-X-Name-Last: Bharath
Author-Name: Ying Sun
Author-X-Name-First: Ying
Author-X-Name-Last: Sun
Title: A Geometric Approach to Visualization of Variability in Functional Data
Abstract:
We propose a new method for the construction and visualization of boxplot-type displays for functional data. We use a recent functional data analysis framework, based on a representation of functions called square-root slope functions, to decompose observed variation in functional data into three main components: amplitude, phase, and vertical translation. We then construct separate displays for each component, using the geometry and metric of each representation space, based on a novel definition of the median, the two quartiles, and extreme observations. The outlyingness of functional data is a very complex concept. Thus, we propose to identify outliers based on any of the three main components after decomposition. We provide a variety of visualization tools for the proposed boxplot-type displays including surface plots. We evaluate the proposed method using extensive simulations and then focus our attention on three real data applications including exploratory data analysis of sea surface temperature functions, electrocardiogram functions, and growth curves. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 979-993
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1256813
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1256813
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:979-993
Template-Type: ReDIF-Article 1.0
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Jiguo Cao
Author-X-Name-First: Jiguo
Author-X-Name-Last: Cao
Title: Finding Common Modules in a Time-Varying Network with Application to the Gene Regulation Network
Abstract:
Finding functional modules in gene regulation networks is an important task in systems biology. Many methods have been proposed for finding communities in static networks; however, the application of such methods is limited due to the dynamic nature of gene regulation networks. In this article, we first propose a statistical framework for detecting common modules in the Drosophila melanogaster time-varying gene regulation network. We then develop both a significance test and a robustness test for the identified modular structure. We apply an enrichment analysis to our community findings, which reveals interesting results. Moreover, we investigate the consistency property of our proposed method under a time-varying stochastic block model framework with a temporal correlation structure. Although we focus on gene regulation networks in our work, our method is general and can be applied to other time-varying networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 994-1008
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1260465
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1260465
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:994-1008
Template-Type: ReDIF-Article 1.0
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Dan Shen
Author-X-Name-First: Dan
Author-X-Name-Last: Shen
Author-Name: Xuewei Peng
Author-X-Name-First: Xuewei
Author-X-Name-Last: Peng
Author-Name: Leo Yufeng Liu
Author-X-Name-First: Leo Yufeng
Author-X-Name-Last: Liu
Title: MWPCR: Multiscale Weighted Principal Component Regression for High-Dimensional Prediction
Abstract:
We propose a multiscale weighted principal component regression (MWPCR) framework for the use of high-dimensional features with strong spatial features (e.g., smoothness and correlation) to predict an outcome variable, such as disease status. This development is motivated by identifying imaging biomarkers that could potentially aid detection, diagnosis, assessment of prognosis, prediction of response to treatment, and monitoring of disease status, among many others. The MWPCR can be regarded as a novel integration of principal components analysis (PCA), kernel methods, and regression models. In MWPCR, we introduce various weight matrices to prewhitten high-dimensional feature vectors, perform matrix decomposition for both dimension reduction and feature extraction, and build a prediction model by using the extracted features. Examples of such weight matrices include an importance score weight matrix for the selection of individual features at each location and a spatial weight matrix for the incorporation of the spatial pattern of feature vectors. We integrate the importance of score weights with the spatial weights to recover the low-dimensional structure of high-dimensional features. We demonstrate the utility of our methods through extensive simulations and real data analyses of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1009-1021
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1261710
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1261710
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1009-1021
Template-Type: ReDIF-Article 1.0
Author-Name: Tao Wang
Author-X-Name-First: Tao
Author-X-Name-Last: Wang
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: Constructing Predictive Microbial Signatures at Multiple Taxonomic Levels
Abstract:
Recent advances in DNA sequencing technology have enabled rapid advances in our understanding of the contribution of the human microbiome to many aspects of normal human physiology and disease. A major goal of human microbiome studies is the identification of important groups of microbes that are predictive of host phenotypes. However, the large number of bacterial taxa and the compositional nature of the data make this goal difficult to achieve using traditional approaches. Furthermore, the microbiome data are structured in the sense that bacterial taxa are not independent of one another and are related evolutionarily by a phylogenetic tree. To deal with these challenges, we introduce the concept of variable fusion for high-dimensional compositional data and propose a novel tree-guided variable fusion method. Our method is based on the linear regression model with tree-guided penalty functions. It incorporates the tree information node-by-node and is capable of building predictive models comprised of bacterial taxa at different taxonomic levels. A gut microbiome data analysis and simulations are presented to illustrate the good performance of the proposed method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1022-1031
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1270213
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270213
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1022-1031
Template-Type: ReDIF-Article 1.0
Author-Name: Sihai Dave Zhao
Author-X-Name-First: Sihai Dave
Author-X-Name-Last: Zhao
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Thomas P. Cappola
Author-X-Name-First: Thomas P.
Author-X-Name-Last: Cappola
Author-Name: Kenneth B. Margulies
Author-X-Name-First: Kenneth B.
Author-X-Name-Last: Margulies
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Sparse Simultaneous Signal Detection for Identifying Genetically Controlled Disease Genes
Abstract:
Genome-wide association studies (GWAS) and differential expression analyses have had limited success in finding genes that cause complex diseases such as heart failure (HF), a leading cause of death in the United States. This article proposes a new statistical approach that integrates GWAS and expression quantitative trait loci (eQTL) data to identify important HF genes. For such genes, genetic variations that perturb its expression are also likely to influence disease risk. The proposed method thus tests for the presence of simultaneous signals: SNPs that are associated with the gene’s expression as well as with disease. An analytic expression for the p-value is obtained, and the method is shown to be asymptotically adaptively optimal under certain conditions. It also allows the GWAS and eQTL data to be collected from different groups of subjects, enabling investigators to integrate public resources with their own data. Simulation experiments show that it can be more powerful than standard approaches and also robust to linkage disequilibrium between variants. The method is applied to an extensive analysis of HF genomics and identifies several genes with biological evidence for being functionally relevant in the etiology of HF. It is implemented in the R package ssa. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1032-1046
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1270825
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1270825
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1032-1046
Template-Type: ReDIF-Article 1.0
Author-Name: E. I. George
Author-X-Name-First: E. I.
Author-X-Name-Last: George
Author-Name: V. Ročková
Author-X-Name-First: V.
Author-X-Name-Last: Ročková
Author-Name: P. R. Rosenbaum
Author-X-Name-First: P. R.
Author-X-Name-Last: Rosenbaum
Author-Name: V. A. Satopää
Author-X-Name-First: V. A.
Author-X-Name-Last: Satopää
Author-Name: J. H. Silber
Author-X-Name-First: J. H.
Author-X-Name-Last: Silber
Title: Mortality Rate Estimation and Standardization for Public Reporting: Medicare’s Hospital Compare
Abstract:
Bayesian models are increasingly fit to large administrative datasets and then used to make individualized recommendations. In particular, Medicare’s Hospital Compare webpage provides information to patients about specific hospital mortality rates for a heart attack or acute myocardial infarction (AMI). Hospital Compare’s current recommendations are based on a random-effects logit model with a random hospital indicator and patient risk factors. Except for the largest hospitals, these individual recommendations or predictions are not checkable against data, because data from smaller hospitals are too limited to provide a meaningful check. Before individualized Bayesian recommendations, people derived general advice from empirical studies of many hospitals, for example, prefer hospitals of Type 1 to Type 2 because the risk is lower at Type 1 hospitals. Here, we calibrate these Bayesian recommendation systems by checking, out of sample, whether their predictions aggregate to give correct general advice derived from another sample. This process of calibrating individualized predictions against general empirical advice leads to substantial revisions in the Hospital Compare model for AMI mortality. To make appropriately calibrated predictions, our revised models incorporate information about hospital volume, nursing staff, medical residents, and the hospital’s ability to perform cardiovascular procedures. For the ultimate purpose of comparisons, hospital mortality rates must be standardized to adjust for patient mix variation across hospitals. We find that indirect standardization, as currently used by Hospital Compare, fails to adequately control for differences in patient risk factors and systematically underestimates mortality rates at the low volume hospitals. To provide good control and correctly calibrated rates, we propose direct standardization instead. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 933-947
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1276021
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1276021
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:933-947
Template-Type: ReDIF-Article 1.0
Author-Name: Wesley Tansey
Author-X-Name-First: Wesley
Author-X-Name-Last: Tansey
Author-Name: Alex Athey
Author-X-Name-First: Alex
Author-X-Name-Last: Athey
Author-Name: Alex Reinhart
Author-X-Name-First: Alex
Author-X-Name-Last: Reinhart
Author-Name: James G. Scott
Author-X-Name-First: James G.
Author-X-Name-Last: Scott
Title: Multiscale Spatial Density Smoothing: An Application to Large-Scale Radiological Survey and Anomaly Detection
Abstract:
We consider the problem of estimating a spatially varying density function, motivated by problems that arise in large-scale radiological survey and anomaly detection. In this context, the density functions to be estimated are the background gamma-ray energy spectra at sites spread across a large geographical area, such as nuclear production and waste-storage sites, military bases, medical facilities, university campuses, or the downtown of a city. Several challenges combine to make this a difficult problem. First, the spectral density at any given spatial location may have both smooth and nonsmooth features. Second, the spatial correlation in these density functions is neither stationary nor locally isotropic. Finally, at some spatial locations, there are very little data. We present a method called multiscale spatial density smoothing that successfully addresses these challenges. The method is based on recursive dyadic partition of the sample space, and therefore shares much in common with other multiscale methods, such as wavelets and Pólya-tree priors. We describe an efficient algorithm for finding a maximum a posteriori (MAP) estimate that leverages recent advances in convex optimization for nonsmooth functions.We apply multiscale spatial density smoothing to real data collected on the background gamma-ray spectra at locations across a large university campus. The method exhibits state-of-the-art performance for spatial smoothing in density estimation, and it leads to substantial improvements in power when used in conjunction with existing methods for detecting the kinds of radiological anomalies that may have important consequences for public health and safety.
Journal: Journal of the American Statistical Association
Pages: 1047-1063
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2016.1276461
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1276461
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1047-1063
Template-Type: ReDIF-Article 1.0
Author-Name: Eric D. Schoen
Author-X-Name-First: Eric D.
Author-X-Name-Last: Schoen
Author-Name: Nha Vo-Thanh
Author-X-Name-First: Nha
Author-X-Name-Last: Vo-Thanh
Author-Name: Peter Goos
Author-X-Name-First: Peter
Author-X-Name-Last: Goos
Title: Two-Level Orthogonal Screening Designs With 24, 28, 32, and 36 Runs
Abstract:
The potential of two-level orthogonal designs to fit models with main effects and two-factor interaction effects is commonly assessed through the correlation between contrast vectors involving these effects. We study the complete catalog of nonisomorphic orthogonal two-level 24-run designs involving 3–23 factors and we identify the best few designs in terms of these correlations. By modifying an existing enumeration algorithm, we identify the best few 28-run designs involving 3–14 factors and the best few 36-run designs in 3–18 factors as well. Based on a complete catalog of 7570 designs with 28 runs and 27 factors, we also seek good 28-run designs with more than 14 factors. Finally, starting from a unique 31-factor design in 32 runs that minimizes the maximum correlation among the contrast vectors for main effects and two-factor interactions, we obtain 32-run designs that have low values for this correlation. To demonstrate the added value of our work, we provide a detailed comparison of our designs to the alternatives available in the literature. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1354-1369
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1279547
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1279547
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1354-1369
Template-Type: ReDIF-Article 1.0
Author-Name: Blakeley B. McShane
Author-X-Name-First: Blakeley B.
Author-X-Name-Last: McShane
Author-Name: David Gal
Author-X-Name-First: David
Author-X-Name-Last: Gal
Title: Statistical Significance and the Dichotomization of Evidence
Abstract:
In light of recent concerns about reproducibility and replicability, the ASA issued a Statement on Statistical Significance and p-values aimed at those who are not primarily statisticians. While the ASA Statement notes that statistical significance and p-values are “commonly misused and misinterpreted,” it does not discuss and document broader implications of these errors for the interpretation of evidence. In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p-values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p-value crosses the conventional 0.05 threshold for statistical significance. We discuss implications and offer recommendations.
Journal: Journal of the American Statistical Association
Pages: 885-895
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1289846
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1289846
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:885-895
Template-Type: ReDIF-Article 1.0
Author-Name: Alfredo Farjat
Author-X-Name-First: Alfredo
Author-X-Name-Last: Farjat
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Author-Name: Joseph Guinness
Author-X-Name-First: Joseph
Author-X-Name-Last: Guinness
Author-Name: Ross Whetten
Author-X-Name-First: Ross
Author-X-Name-Last: Whetten
Author-Name: Steven McKeand
Author-X-Name-First: Steven
Author-X-Name-Last: McKeand
Author-Name: Fikret Isik
Author-X-Name-First: Fikret
Author-X-Name-Last: Isik
Title: Optimal Seed Deployment Under Climate Change Using Spatial Models: Application to Loblolly Pine in the Southeastern US
Abstract:
Provenance tests are a common tool in forestry designed to identify superior genotypes for planting at specific locations. The trials are replicated experiments established with seed from parent trees collected from different regions and grown at several locations. In this work, a Bayesian spatial approach is developed for modeling the expected relative performance of seed sources using climate variables as predictors associated with the origin of seed source and the planting site. The proposed modeling technique accounts for the spatial dependence in the data and introduces a separable Matérn covariance structure that provides a flexible means to estimate effects associated with the origin and planting site locations. The statistical model was used to develop a quantitative tool for seed deployment aimed to identify the location of superior performing seed sources that could be suitable for a specific planting site under a given climate scenario. Cross-validation results indicate that the proposed spatial models provide superior predictive ability compared to multiple linear regression methods in unobserved locations. The general trend of performance predictions based on future climate scenarios suggests an optimal assisted migration of loblolly pine seed sources from southern and warmer regions to northern and colder areas in the southern USA. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 909-920
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1292179
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1292179
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:909-920
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew Gelman
Author-X-Name-First: Andrew
Author-X-Name-Last: Gelman
Author-Name: John Carlin
Author-X-Name-First: John
Author-X-Name-Last: Carlin
Title: Some Natural Solutions to the -Value Communication Problem—and Why They Won’t Work
Journal: Journal of the American Statistical Association
Pages: 899-901
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1311263
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311263
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:899-901
Template-Type: ReDIF-Article 1.0
Author-Name: William M. Briggs
Author-X-Name-First: William M.
Author-X-Name-Last: Briggs
Title: The Substitute for -Values
Abstract:
If it was not obvious before, after reading McShane and Gal, the conclusion is that p-values should be proscribed. There are no good uses for them; indeed, every use either violates frequentist theory, is fallacious, or is based on a misunderstanding. A replacement for p-values is suggested, based on predictive models.
Journal: Journal of the American Statistical Association
Pages: 897-898
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1311264
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311264
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:897-898
Template-Type: ReDIF-Article 1.0
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Kerby Shedden
Author-X-Name-First: Kerby
Author-X-Name-Last: Shedden
Title: Statistical Significance and the Dichotomization of Evidence: The Relevance of the ASA Statement on Statistical Significance and p-Values for Statisticians
Journal: Journal of the American Statistical Association
Pages: 902-904
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1311265
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1311265
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:902-904
Template-Type: ReDIF-Article 1.0
Author-Name: Donald Berry
Author-X-Name-First: Donald
Author-X-Name-Last: Berry
Title: A -Value to Die For
Journal: Journal of the American Statistical Association
Pages: 895-897
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1316279
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1316279
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:895-897
Template-Type: ReDIF-Article 1.0
Author-Name: Blakeley B. McShane
Author-X-Name-First: Blakeley B.
Author-X-Name-Last: McShane
Author-Name: David Gal
Author-X-Name-First: David
Author-X-Name-Last: Gal
Title: Rejoinder: Statistical Significance and the Dichotomization of Evidence
Journal: Journal of the American Statistical Association
Pages: 904-908
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1323642
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1323642
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:904-908
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 1370-1379
Issue: 519
Volume: 112
Year: 2017
Month: 7
X-DOI: 10.1080/01621459.2017.1367179
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1367179
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:519:p:1370-1379
Template-Type: ReDIF-Article 1.0
Author-Name: P. Richard Hahn
Author-X-Name-First: P. Richard
Author-X-Name-Last: Hahn
Author-Name: Ryan Martin
Author-X-Name-First: Ryan
Author-X-Name-Last: Martin
Author-Name: Stephen G. Walker
Author-X-Name-First: Stephen G.
Author-X-Name-Last: Walker
Title: On Recursive Bayesian Predictive Distributions
Abstract:
A Bayesian framework is attractive in the context of prediction, but a fast recursive update of the predictive distribution has apparently been out of reach, in part because Monte Carlo methods are generally used to compute the predictive. This article shows that online Bayesian prediction is possible by characterizing the Bayesian predictive update in terms of a bivariate copula, making it unnecessary to pass through the posterior to update the predictive. In standard models, the Bayesian predictive update corresponds to familiar choices of copula but, in nonparametric problems, the appropriate copula may not have a closed-form expression. In such cases, our new perspective suggests a fast recursive approximation to the predictive density, in the spirit of Newton’s predictive recursion algorithm, but without requiring evaluation of normalizing constants. Consistency of the new algorithm is shown, and numerical examples demonstrate its quality performance in finite-samples compared to fully Bayesian and kernel methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1085-1093
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1304219
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1304219
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1085-1093
Template-Type: ReDIF-Article 1.0
Author-Name: Audrey Boruvka
Author-X-Name-First: Audrey
Author-X-Name-Last: Boruvka
Author-Name: Daniel Almirall
Author-X-Name-First: Daniel
Author-X-Name-Last: Almirall
Author-Name: Katie Witkiewitz
Author-X-Name-First: Katie
Author-X-Name-Last: Witkiewitz
Author-Name: Susan A. Murphy
Author-X-Name-First: Susan A.
Author-X-Name-Last: Murphy
Title: Assessing Time-Varying Causal Effect Moderation in Mobile Health
Abstract:
In mobile health interventions aimed at behavior change and maintenance, treatments are provided in real time to manage current or impending high-risk situations or promote healthy behaviors in near real time. Currently there is great scientific interest in developing data analysis approaches to guide the development of mobile interventions. In particular data from mobile health studies might be used to examine effect moderators—individual characteristics, time-varying context, or past treatment response that moderate the effect of current treatment on a subsequent response. This article introduces a formal definition for moderated effects in terms of potential outcomes, a definition that is particularly suited to mobile interventions, where treatment occasions are numerous, individuals are not always available for treatment, and potential moderators might be influenced by past treatment. Methods for estimating moderated effects are developed and compared. The proposed approach is illustrated using BASICS-Mobile, a smartphone-based intervention designed to curb heavy drinking and smoking among college students. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1112-1121
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1305274
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1305274
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1112-1121
Template-Type: ReDIF-Article 1.0
Author-Name: Ashkan Ertefaie
Author-X-Name-First: Ashkan
Author-X-Name-Last: Ertefaie
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Quantitative Evaluation of the Trade-Off of Strengthened Instruments and Sample Size in Observational Studies
Abstract:
Weak instruments produce causal inferences that are sensitive to small failures of the assumptions underlying an instrumental variable, so strong instruments are preferred. The possibility of strengthening an instrument at the price of a reduced sample size has been proposed in the statistical literature and used in the medical literature, but there has not been a theoretical study of the trade-off of instrument strength and sample size. This trade-off and related questions are examined using the Bahadur efficiency of a test or a sensitivity analysis. A moderate increase in instrument strength is worth more than an enormous increase in sample size. This is true with a flawless instrument, and the difference is more pronounced when allowance is made for small unmeasured biases in the instrument. A new method of strengthening an instrument is proposed: it discards half the sample to learn empirically where the instrument is strong, then discards part of the remaining half to avoid areas where the instrument is weak; however, the gains in instrument strength can more than compensate for the loss of sample size. The example is drawn from a study of the effectiveness of high-level neonatal intensive care units in saving the lives of premature infants.
Journal: Journal of the American Statistical Association
Pages: 1122-1134
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1305275
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1305275
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1122-1134
Template-Type: ReDIF-Article 1.0
Author-Name: Chao Wang
Author-X-Name-First: Chao
Author-X-Name-Last: Wang
Author-Name: Kung-Sik Chan
Author-X-Name-First: Kung-Sik
Author-X-Name-Last: Chan
Title: Quasi-Likelihood Estimation of a Censored Autoregressive Model With Exogenous Variables
Abstract:
Maximum likelihood estimation of a censored autoregressive model with exogenous variables (CARX) requires computing the conditional likelihood of blocks of data of variable dimensions. As the random block dimension generally increases with the censoring rate, maximum likelihood estimation becomes quickly numerically intractable with increasing censoring. We introduce a new estimation approach using the complete-incomplete data framework with the complete data comprising the observations were there no censoring. We introduce a system of unbiased estimating equations motivated by the complete-data score vector, for estimating a CARX model. The proposed quasi-likelihood method reduces to maximum likelihood estimation when there is no censoring, and it is computationally efficient. We derive the consistency and asymptotic normality of the quasi-likelihood estimator, under mild regularity conditions. We illustrate the efficacy of the proposed method by simulations and a real application on phosphorus concentration in river water.
Journal: Journal of the American Statistical Association
Pages: 1135-1145
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1307115
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1307115
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1135-1145
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Author-Name: Max G’Sell
Author-X-Name-First: Max
Author-X-Name-Last: G’Sell
Author-Name: Alessandro Rinaldo
Author-X-Name-First: Alessandro
Author-X-Name-Last: Rinaldo
Author-Name: Ryan J. Tibshirani
Author-X-Name-First: Ryan J.
Author-X-Name-Last: Tibshirani
Author-Name: Larry Wasserman
Author-X-Name-First: Larry
Author-X-Name-Last: Wasserman
Title: Distribution-Free Predictive Inference for Regression
Abstract:
We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called rank-one-out conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, to adapt to heteroscedasticity in the data. Finally, we propose a model-free notion of variable importance, called leave-one-covariate-out or LOCO inference. Accompanying this article is an R package conformalInference that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.
Journal: Journal of the American Statistical Association
Pages: 1094-1111
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1307116
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1307116
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1094-1111
Template-Type: ReDIF-Article 1.0
Author-Name: Hao Chen
Author-X-Name-First: Hao
Author-X-Name-Last: Chen
Author-Name: Xu Chen
Author-X-Name-First: Xu
Author-X-Name-Last: Chen
Author-Name: Yi Su
Author-X-Name-First: Yi
Author-X-Name-Last: Su
Title: A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data
Abstract:
Two-sample tests for multivariate data and non-Euclidean data are widely used in many fields. Parametric tests are mostly restrained to certain types of data that meets the assumptions of the parametric models. In this article, we study a nonparametric testing procedure that uses graphs representing the similarity among observations. It can be applied to any data types as long as an informative similarity measure on the sample space can be defined. The classic test based on a similarity graph has a problem when the two sample sizes are different. We solve the problem by applying appropriate weights to different components of the classic test statistic. The new test exhibits substantial power gains in simulation studies. Its asymptotic permutation null distribution is derived and shown to work well under finite samples, facilitating its application to large datasets. The new test is illustrated through an analysis on a real dataset of network data.
Journal: Journal of the American Statistical Association
Pages: 1146-1155
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1307757
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1307757
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1146-1155
Template-Type: ReDIF-Article 1.0
Author-Name: Wesley Tansey
Author-X-Name-First: Wesley
Author-X-Name-Last: Tansey
Author-Name: Oluwasanmi Koyejo
Author-X-Name-First: Oluwasanmi
Author-X-Name-Last: Koyejo
Author-Name: Russell A. Poldrack
Author-X-Name-First: Russell A.
Author-X-Name-Last: Poldrack
Author-Name: James G. Scott
Author-X-Name-First: James G.
Author-X-Name-Last: Scott
Title: False Discovery Rate Smoothing
Abstract:
We present false discovery rate (FDR) smoothing, an empirical-Bayes method for exploiting spatial structure in large multiple-testing problems. FDR smoothing automatically finds spatially localized regions of significant test statistics. It then relaxes the threshold of statistical significance within these regions, and tightens it elsewhere, in a manner that controls the overall false discovery rate at a given level. This results in increased power and cleaner spatial separation of signals from noise. The approach requires solving a nonstandard high-dimensional optimization problem, for which an efficient augmented-Lagrangian algorithm is presented. In simulation studies, FDR smoothing exhibits state-of-the-art performance at modest computational cost. In particular, it is shown to be far more robust than existing methods for spatially dependent multiple testing. We also apply the method to a dataset from an fMRI experiment on spatial working memory, where it detects patterns that are much more biologically plausible than those detected by standard FDR-controlling methods. All code for FDR smoothing is publicly available in Python and R (https://github.com/tansey/smoothfdr). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1156-1171
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1319838
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319838
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1156-1171
Template-Type: ReDIF-Article 1.0
Author-Name: Stefan Wager
Author-X-Name-First: Stefan
Author-X-Name-Last: Wager
Author-Name: Susan Athey
Author-X-Name-First: Susan
Author-X-Name-Last: Athey
Title: Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
Abstract:
Many scientific and engineering challenges—ranging from personalized medicine to customized marketing recommendations—require an understanding of treatment effect heterogeneity. In this article, we develop a nonparametric causal forest for estimating heterogeneous treatment effects that extends Breiman’s widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.
Journal: Journal of the American Statistical Association
Pages: 1228-1242
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1319839
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319839
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1228-1242
Template-Type: ReDIF-Article 1.0
Author-Name: Sokbae Lee
Author-X-Name-First: Sokbae
Author-X-Name-Last: Lee
Author-Name: Yuan Liao
Author-X-Name-First: Yuan
Author-X-Name-Last: Liao
Author-Name: Myung Hwan Seo
Author-X-Name-First: Myung Hwan
Author-X-Name-Last: Seo
Author-Name: Youngki Shin
Author-X-Name-First: Youngki
Author-X-Name-Last: Shin
Title: Oracle Estimation of a Change Point in High-Dimensional Quantile Regression
Abstract:
In this article, we consider a high-dimensional quantile regression model where the sparsity structure may differ between two sub-populations. We develop ℓ1-penalized estimators of both regression coefficients and the threshold parameter. Our penalized estimators not only select covariates but also discriminate between a model with homogenous sparsity and a model with a change point. As a result, it is not necessary to know or pretest whether the change point is present, or where it occurs. Our estimator of the change point achieves an oracle property in the sense that its asymptotic distribution is the same as if the unknown active sets of regression coefficients were known. Importantly, we establish this oracle property without a perfect covariate selection, thereby avoiding the need for the minimum level condition on the signals of active covariates. Dealing with high-dimensional quantile regression with an unknown change point calls for a new proof technique since the quantile loss function is nonsmooth and furthermore the corresponding objective function is nonconvex with respect to the change point. The technique developed in this article is applicable to a general M-estimation framework with a change point, which may be of independent interest. The proposed methods are then illustrated via Monte Carlo experiments and an application to tipping in the dynamics of racial segregation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1184-1194
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1319840
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319840
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1184-1194
Template-Type: ReDIF-Article 1.0
Author-Name: Quentin Clairon
Author-X-Name-First: Quentin
Author-X-Name-Last: Clairon
Author-Name: Nicolas J.-B. Brunel
Author-X-Name-First: Nicolas J.-B.
Author-X-Name-Last: Brunel
Title: Optimal Control and Additive Perturbations Help in Estimating Ill-Posed and Uncertain Dynamical Systems
Abstract:
Ordinary differential equations (ODE) are routinely calibrated on real data for estimating unknown parameters or for reverse-engineering. Nevertheless, standard statistical techniques can give disappointing results because of the complex relationship between parameters and states, which makes the corresponding estimation problem ill-posed. Moreover, ODE are mechanistic models that are prone to modeling errors, whose influences on inference are often neglected during statistical analysis. We propose a regularized estimation framework, called Tracking, which consists in adding a perturbation (L2 function) to the original ODE. This perturbation facilitates data fitting and represents also possible model misspecifications, so that parameter estimation is done by solving a trade-off between data fidelity and model fidelity. We show that the underlying optimization problem is an optimal control problem that can be solved by the Pontryagin maximum principle for general nonlinear and partially observed ODE. The same methodology can be used for the joint estimation of finite and time-varying parameters. We show, in the case of a well-specified parametric model that our estimator is consistent and reaches the root-n rate. In addition, numerical experiments considering various sources of model misspecifications shows that Tracking still furnishes accurate estimates. Finally, we consider semiparametric estimation on both simulated data and on a real data example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1195-1209
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1319841
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1319841
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1195-1209
Template-Type: ReDIF-Article 1.0
Author-Name: José R. Berrendero
Author-X-Name-First: José R.
Author-X-Name-Last: Berrendero
Author-Name: Antonio Cuevas
Author-X-Name-First: Antonio
Author-X-Name-Last: Cuevas
Author-Name: José L. Torrecilla
Author-X-Name-First: José L.
Author-X-Name-Last: Torrecilla
Title: On the Use of Reproducing Kernel Hilbert Spaces in Functional Classification
Abstract:
The Hájek–Feldman dichotomy establishes that two Gaussian measures are either mutually absolutely continuous with respect to each other (and hence there is a Radon–Nikodym density for each measure with respect to the other one) or mutually singular. Unlike the case of finite-dimensional Gaussian measures, there are nontrivial examples of both situations when dealing with Gaussian stochastic processes. This article provides: (a) Explicit expressions for the optimal (Bayes) rule and the minimal classification error probability in several relevant problems of supervised binary classification of mutually absolutely continuous Gaussian processes. The approach relies on some classical results in the theory of reproducing kernel Hilbert spaces (RKHS). (b) An interpretation, in terms of mutual singularity, for the so-called “near perfect classification” phenomenon. We show that the asymptotically optimal rule proposed by these authors can be identified with the sequence of optimal rules for an approximating sequence of classification problems in the absolutely continuous case. (c) As an application, we discuss a natural variable selection method, which essentially consists of taking the original functional data X(t), t ∈ [0, 1] to a d-dimensional marginal (X(t1), …, X(td)), which is chosen to minimize the classification error of the corresponding Fisher’s linear rule. We give precise conditions under which this discrimination method achieves the minimal classification error of the original functional problem. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1210-1218
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1320287
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1320287
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1210-1218
Template-Type: ReDIF-Article 1.0
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Ana-Maria Staicu
Author-X-Name-First: Ana-Maria
Author-X-Name-Last: Staicu
Title: Functional Feature Construction for Individualized Treatment Regimes
Abstract:
Evidence-based personalized medicine formalizes treatment selection as an individualized treatment regime that maps up-to-date patient information into the space of possible treatments. Available patient information may include static features such race, gender, family history, genetic and genomic information, as well as longitudinal information including the emergence of comorbidities, waxing and waning of symptoms, side-effect burden, and adherence. Dynamic information measured at multiple time points before treatment assignment should be included as input to the treatment regime. However, subject longitudinal measurements are typically sparse, irregularly spaced, noisy, and vary in number across subjects. Existing estimators for treatment regimes require equal information be measured on each subject and thus standard practice is to summarize longitudinal subject information into a scalar, ad hoc summary during data preprocessing. This reduction of the longitudinal information to a scalar feature precedes estimation of a treatment regime and is therefore not informed by subject outcomes, treatments, or covariates. Furthermore, we show that this reduction requires more stringent causal assumptions for consistent estimation than are necessary. We propose a data-driven method for constructing maximally prescriptive yet interpretable features that can be used with standard methods for estimating optimal treatment regimes. In our proposed framework, we treat the subject longitudinal information as a realization of a stochastic process observed with error at discrete time points. Functionals of this latent process are then combined with outcome models to estimate an optimal treatment regime. The proposed methodology requires weaker causal assumptions than Q-learning with an ad hoc scalar summary and is consistent for the optimal treatment regime. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1219-1227
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1321545
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1321545
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1219-1227
Template-Type: ReDIF-Article 1.0
Author-Name: Lorenzo Trapani
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Trapani
Title: A Randomized Sequential Procedure to Determine the Number of Factors
Abstract:
This article proposes a procedure to estimate the number of common factors k in a static approximate factor model. The building block of the analysis is the fact that the first k eigenvalues of the covariance matrix of the data diverge, while the others stay bounded. On the grounds of this, we propose a test for the null that the ith eigenvalue diverges, using a randomized test statistic based directly on the estimated eigenvalue. The test only requires minimal assumptions on the data, and no assumptions are required on factors, loadings or idiosyncratic errors. The randomized tests are then employed in a sequential procedure to determine k. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1341-1349
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1328359
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328359
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1341-1349
Template-Type: ReDIF-Article 1.0
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Michael Jansson
Author-X-Name-First: Michael
Author-X-Name-Last: Jansson
Author-Name: Whitney K. Newey
Author-X-Name-First: Whitney K.
Author-X-Name-Last: Newey
Title: Inference in Linear Regression Models with Many Covariates and Heteroscedasticity
Abstract:
The linear regression model is widely used in empirical work in economics, statistics, and many other disciplines. Researchers often include many covariates in their linear model specification in an attempt to control for confounders. We give inference methods that allow for many covariates and heteroscedasticity. Our results are obtained using high-dimensional approximations, where the number of included covariates is allowed to grow as fast as the sample size. We find that all of the usual versions of Eicker–White heteroscedasticity consistent standard error estimators for linear models are inconsistent under this asymptotics. We then propose a new heteroscedasticity consistent standard error formula that is fully automatic and robust to both (conditional) heteroscedasticity of unknown form and the inclusion of possibly many covariates. We apply our findings to three settings: parametric linear models with many covariates, linear panel models with many fixed effects, and semiparametric semi-linear models with many technical regressors. Simulation evidence consistent with our theoretical results is provided, and the proposed methods are also illustrated with an empirical application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1350-1361
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1328360
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328360
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1350-1361
Template-Type: ReDIF-Article 1.0
Author-Name: Quan Zhou
Author-X-Name-First: Quan
Author-X-Name-Last: Zhou
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: On the Null Distribution of Bayes Factors in Linear Regression
Abstract:
We show that under the null, the 2log(Bayesfactor)$2 \log (\text{Bayes factor})$ is asymptotically distributed as a weighted sum of chi-squared random variables with a shifted mean. This claim holds for Bayesian multi-linear regression with a family of conjugate priors, namely, the normal-inverse-gamma prior, the g-prior, and the normal prior. Our results have three immediate impacts. First, we can compute analytically a p-value associated with a Bayes factor without the need of permutation. We provide a software package that can evaluate the p-value associated with Bayes factor efficiently and accurately. Second, the null distribution is illuminating to some intrinsic properties of Bayes factor, namely, how Bayes factor quantitatively depends on prior and the genesis of Bartlett’s paradox. Third, enlightened by the null distribution of Bayes factor, we formulate a novel scaled Bayes factor that depends less on the prior and is immune to Bartlett’s paradox. When two tests have an identical p-value, the test with a larger power tends to have a larger scaled Bayes factor, a desirable property that is missing for the (unscaled) Bayes factor. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1362-1371
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1328361
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1328361
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1362-1371
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Yu Zhou
Author-X-Name-First: Yu
Author-X-Name-Last: Zhou
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Author-Name: Ben Sherwood
Author-X-Name-First: Ben
Author-X-Name-Last: Sherwood
Title: Quantile-Optimal Treatment Regimes
Abstract:
Finding the optimal treatment regime (or a series of sequential treatment regimes) based on individual characteristics has important applications in areas such as precision medicine, government policies, and active labor market interventions. In the current literature, the optimal treatment regime is usually defined as the one that maximizes the average benefit in the potential population. This article studies a general framework for estimating the quantile-optimal treatment regime, which is of importance in many real-world applications. Given a collection of treatment regimes, we consider robust estimation of the quantile-optimal treatment regime, which does not require the analyst to specify an outcome regression model. We propose an alternative formulation of the estimator as a solution of an optimization problem with an estimated nuisance parameter. This novel representation allows us to investigate the asymptotic theory of the estimated optimal treatment regime using empirical process techniques. We derive theory involving a nonstandard convergence rate and a nonnormal limiting distribution. The same nonstandard convergence rate would also occur if the mean optimality criterion is applied, but this has not been studied. Thus, our results fill an important theoretical gap for a general class of policy search methods in the literature. The article investigates both static and dynamic treatment regimes. In addition, doubly robust estimation and alternative optimality criterion such as that based on Gini’s mean difference or weighted quantiles are investigated. Numerical simulations demonstrate the performance of the proposed estimator. A data example from a trial in HIV+ patients is used to illustrate the application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1243-1254
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1330204
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1330204
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1243-1254
Template-Type: ReDIF-Article 1.0
Author-Name: Pallavi Basu
Author-X-Name-First: Pallavi
Author-X-Name-Last: Basu
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Kiranmoy Das
Author-X-Name-First: Kiranmoy
Author-X-Name-Last: Das
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Title: Weighted False Discovery Rate Control in Large-Scale Multiple Testing
Abstract:
The use of weights provides an effective strategy to incorporate prior domain knowledge in large-scale inference. This article studies weighted multiple testing in a decision-theoretical framework. We develop oracle and data-driven procedures that aim to maximize the expected number of true positives subject to a constraint on the weighted false discovery rate. The asymptotic validity and optimality of the proposed methods are established. The results demonstrate that incorporating informative domain knowledge enhances the interpretability of results and precision of inference. Simulation studies show that the proposed method controls the error rate at the nominal level, and the gain in power over existing methods is substantial in many settings. An application to a genome-wide association study is discussed. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1172-1183
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1336443
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1336443
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1172-1183
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas A. Murray
Author-X-Name-First: Thomas A.
Author-X-Name-Last: Murray
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Title: A Bayesian Machine Learning Approach for Optimizing Dynamic Treatment Regimes
Abstract:
Medical therapy often consists of multiple stages, with a treatment chosen by the physician at each stage based on the patient’s history of treatments and clinical outcomes. These decisions can be formalized as a dynamic treatment regime. This article describes a new approach for optimizing dynamic treatment regimes, which bridges the gap between Bayesian inference and existing approaches, like Q-learning. The proposed approach fits a series of Bayesian regression models, one for each stage, in reverse sequential order. Each model uses as a response variable the remaining payoff assuming optimal actions are taken at subsequent stages, and as covariates the current history and relevant actions at that stage. The key difficulty is that the optimal decision rules at subsequent stages are unknown, and even if these decision rules were known the relevant response variables may be counterfactual. However, posterior distributions can be derived from the previously fitted regression models for the optimal decision rules and the counterfactual response variables under a particular set of rules. The proposed approach averages over these posterior distributions when fitting each regression model. An efficient sampling algorithm for estimation is presented, along with simulation studies that compare the proposed approach with Q-learning. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1255-1267
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1340887
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1340887
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1255-1267
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Donggyu Kim
Author-X-Name-First: Donggyu
Author-X-Name-Last: Kim
Title: Robust High-Dimensional Volatility Matrix Estimation for High-Frequency Factor Model
Abstract:
High-frequency financial data allow us to estimate large volatility matrices with relatively short time horizon. Many novel statistical methods have been introduced to address large volatility matrix estimation problems from a high-dimensional Itô process with microstructural noise contamination. Their asymptotic theories require sub-Gaussian or some finite high-order moments assumptions for observed log-returns. These assumptions are at odd with the heavy tail phenomenon that is pandemic in financial stock returns and new procedures are needed to mitigate the influence of heavy tails. In this article, we introduce the Huber loss function with a diverging threshold to develop a robust realized volatility estimation. We show that it has the sub-Gaussian concentration around the volatility with only finite fourth moments of observed log-returns. With the proposed robust estimator as input, we further regularize it by using the principal orthogonal component thresholding (POET) procedure to estimate the large volatility matrix that admits an approximate factor structure. We establish the asymptotic theories for such low-rank plus sparse matrices. The simulation study is conducted to check the finite sample performance of the proposed estimation methods.
Journal: Journal of the American Statistical Association
Pages: 1268-1283
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1340888
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1340888
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1268-1283
Template-Type: ReDIF-Article 1.0
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Author-Name: Zhuoran Shang
Author-X-Name-First: Zhuoran
Author-X-Name-Last: Shang
Title: Identifying Latent Structures in Restricted Latent Class Models
Abstract:
This article focuses on a family of restricted latent structure models with wide applications in psychological and educational assessment, where the model parameters are restricted via a latent structure matrix to reflect prespecified assumptions on the latent attributes. Such a latent matrix is often provided by experts and assumed to be correct upon construction, yet it may be subjective and misspecified. Recognizing this problem, researchers have been developing methods to estimate the matrix from data. However, the fundamental issue of the identifiability of the latent structure matrix has not been addressed until now. The first goal of this article is to establish identifiability conditions that ensure the estimability of the structure matrix. With the theoretical development, the second part of the article proposes a likelihood-based method to estimate the latent structure from the data. Simulation studies show that the proposed method outperforms the existing approaches. We further illustrate the method through a dataset in educational assessment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1284-1295
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1340889
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1340889
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1284-1295
Template-Type: ReDIF-Article 1.0
Author-Name: Jonas Mueller
Author-X-Name-First: Jonas
Author-X-Name-Last: Mueller
Author-Name: Tommi Jaakkola
Author-X-Name-First: Tommi
Author-X-Name-Last: Jaakkola
Author-Name: David Gifford
Author-X-Name-First: David
Author-X-Name-Last: Gifford
Title: Modeling Persistent Trends in Distributions
Abstract:
We present a nonparametric framework to model a short sequence of probability distributions that vary both due to underlying effects of sequential progression and confounding noise. To distinguish between these two types of variation and estimate the sequential-progression effects, our approach leverages an assumption that these effects follow a persistent trend. This work is motivated by the recent rise of single-cell RNA-sequencing experiments over a brief time course, which aim to identify genes relevant to the progression of a particular biological process across diverse cell populations. While classical statistical tools focus on scalar-response regression or order-agnostic differences between distributions, it is desirable in this setting to consider both the full distributions as well as the structure imposed by their ordering. We introduce a new regression model for ordinal covariates where responses are univariate distributions and the underlying relationship reflects consistent changes in the distributions over increasing levels of the covariate. This concept is formalized as a trend in distributions, which we define as an evolution that is linear under the Wasserstein metric. Implemented via a fast alternating projections algorithm, our method exhibits numerous strengths in simulations and analyses of single-cell gene expression data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1296-1310
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1341412
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341412
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1296-1310
Template-Type: ReDIF-Article 1.0
Author-Name: Harry Crane
Author-X-Name-First: Harry
Author-X-Name-Last: Crane
Author-Name: Walter Dempsey
Author-X-Name-First: Walter
Author-X-Name-Last: Dempsey
Title: Edge Exchangeable Models for Interaction Networks
Abstract:
Many modern network datasets arise from processes of interactions in a population, such as phone calls, email exchanges, co-authorships, and professional collaborations. In such interaction networks, the edges comprise the fundamental statistical units, making a framework for edge-labeled networks more appropriate for statistical analysis. In this context, we initiate the study of edge exchangeable network models and explore its basic statistical properties. Several theoretical and practical features make edge exchangeable models better suited to many applications in network analysis than more common vertex-centric approaches. In particular, edge exchangeable models allow for sparse structure and power law degree distributions, both of which are widely observed empirical properties that cannot be handled naturally by more conventional approaches. Our discussion culminates in the Hollywood model, which we identify here as the canonical family of edge exchangeable distributions. The Hollywood model is computationally tractable, admits a clear interpretation, exhibits good theoretical properties, and performs reasonably well in estimation and prediction as we demonstrate on real network datasets. As a generalization of the Hollywood model, we further identify the vertex components model as a nonparametric subclass of models with a convenient stick breaking construction.
Journal: Journal of the American Statistical Association
Pages: 1311-1326
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1341413
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341413
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1311-1326
Template-Type: ReDIF-Article 1.0
Author-Name: Max Sommerfeld
Author-X-Name-First: Max
Author-X-Name-Last: Sommerfeld
Author-Name: Stephan Sain
Author-X-Name-First: Stephan
Author-X-Name-Last: Sain
Author-Name: Armin Schwartzman
Author-X-Name-First: Armin
Author-X-Name-Last: Schwartzman
Title: Confidence Regions for Spatial Excursion Sets From Repeated Random Field Observations, With an Application to Climate
Abstract:
The goal of this article is to give confidence regions for the excursion set of a spatial function above a given threshold from repeated noisy observations on a fine grid of fixed locations. Given an asymptotically Gaussian estimator of the target function, a pair of data-dependent nested excursion sets are constructed that are sub- and super-sets of the true excursion set, respectively, with a desired confidence. Asymptotic coverage probabilities are determined via a multiplier bootstrap method, not requiring Gaussianity of the original data nor stationarity or smoothness of the limiting Gaussian field. The method is used to determine regions in North America where the mean summer and winter temperatures are expected to increase by mid-21st century by more than 2 degrees Celsius.
Journal: Journal of the American Statistical Association
Pages: 1327-1340
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1341838
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341838
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1327-1340
Template-Type: ReDIF-Article 1.0
Author-Name: Uri Keich
Author-X-Name-First: Uri
Author-X-Name-Last: Keich
Author-Name: William Stafford Noble
Author-X-Name-First: William Stafford
Author-X-Name-Last: Noble
Title: Controlling the FDR in Imperfect Matches to an Incomplete Database
Abstract:
We consider the problem of controlling the false discovery rate (FDR) among discoveries from searching an incomplete database. This problem differs from the classical multiple testing setting because there are two different types of false discoveries: those arising from objects that have no match in the database and those that are incorrectly matched. We show that commonly used FDR controlling procedures are inadequate for this setup, a special case of which is tandem mass spectrum identification. We then derive a novel FDR controlling approach which extensive simulations suggest is unbiased. We also compare its performance with problem-specific as well as general FDR controlling procedures using both simulated and real mass spectrometry data.
Journal: Journal of the American Statistical Association
Pages: 973-982
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1375931
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375931
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:973-982
Template-Type: ReDIF-Article 1.0
Author-Name: Michele Santacatterina
Author-X-Name-First: Michele
Author-X-Name-Last: Santacatterina
Author-Name: Matteo Bottai
Author-X-Name-First: Matteo
Author-X-Name-Last: Bottai
Title: Optimal Probability Weights for Inference With Constrained Precision
Abstract:
Probability weights are used in many areas of research including complex survey designs, missing data analysis, and adjustment for confounding factors. They are useful analytic tools but can lead to statistical inefficiencies when they contain outlying values. This issue is frequently tackled by replacing large weights with smaller ones or by normalizing them through smoothing functions. While these approaches are practical, they are also prone to yield biased inferences. This article introduces a method for obtaining optimal weights, defined as those with smallest Euclidean distance from target weights among all sets of weights that satisfy a constraint on the variance of the resulting weighted estimator. The optimal weights yield minimum-bias estimators among all estimators with specified precision. The method is based on solving a constrained nonlinear optimization problem whose Lagrange multipliers and objective function can help assess the trade-off between bias and precision of the resulting weighted estimator. The finite-sample performance of the optimally weighted estimator is assessed in a simulation study, and its applicability is illustrated through an analysis of heterogeneity over age of the effect of the timing of treatment-initiation on long-term treatment efficacy in patient infected by human immunodeficiency virus in Sweden.
Journal: Journal of the American Statistical Association
Pages: 983-991
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1375932
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375932
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:983-991
Template-Type: ReDIF-Article 1.0
Author-Name: Sungduk Kim
Author-X-Name-First: Sungduk
Author-X-Name-Last: Kim
Author-Name: Paul S. Albert
Author-X-Name-First: Paul S.
Author-X-Name-Last: Albert
Title: Latent Variable Poisson Models for Assessing the Regularity of Circadian Patterns over Time
Abstract:
Many researchers in biology and medicine have focused on trying to understand biological rhythms and their potential impact on disease. A common biological rhythm is circadian, where the cycle repeats itself every 24 hours. However, a disturbance of the circadian pattern may be indicative of future disease. In this article, we develop new statistical methodology for assessing the degree of disturbance or irregularity in a circadian pattern for count sequences that are observed over time in a population of individuals. We develop a latent variable Poisson modeling approach with both circadian and stochastic short-term trend (autoregressive latent process) components that allow for individual variation in the degree of each component. A parameterization is proposed for modeling covariate dependence on the proportion of these two model components across individuals. In addition, we incorporate covariate dependence in the overall mean, the magnitude of the trend, and the phase-shift of the circadian pattern. Innovative Markov chain Monte Carlo sampling is used to carry out Bayesian posterior computation. Several variations of the proposed models are considered and compared using the deviance information criterion. We illustrate this methodology with longitudinal physical activity count data measured in a longitudinal cohort of adolescents.
Journal: Journal of the American Statistical Association
Pages: 992-1002
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1379402
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1379402
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:992-1002
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Backenroth
Author-X-Name-First: Daniel
Author-X-Name-Last: Backenroth
Author-Name: Jeff Goldsmith
Author-X-Name-First: Jeff
Author-X-Name-Last: Goldsmith
Author-Name: Michelle D. Harran
Author-X-Name-First: Michelle D.
Author-X-Name-Last: Harran
Author-Name: Juan C. Cortes
Author-X-Name-First: Juan C.
Author-X-Name-Last: Cortes
Author-Name: John W. Krakauer
Author-X-Name-First: John W.
Author-X-Name-Last: Krakauer
Author-Name: Tomoko Kitago
Author-X-Name-First: Tomoko
Author-X-Name-Last: Kitago
Title: Modeling Motor Learning Using Heteroscedastic Functional Principal Components Analysis
Abstract:
We propose a novel method for estimating population-level and subject-specific effects of covariates on the variability of functional data. We extend the functional principal components analysis framework by modeling the variance of principal component scores as a function of covariates and subject-specific random effects. In a setting where principal components are largely invariant across subjects and covariate values, modeling the variance of these scores provides a flexible and interpretable way to explore factors that affect the variability of functional data. Our work is motivated by a novel dataset from an experiment assessing upper extremity motor control, and quantifies the reduction in movement variability associated with skill learning. The proposed methods can be applied broadly to understand movement variability, in settings that include motor learning, impairment due to injury or disease, and recovery. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1003-1015
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1379403
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1379403
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1003-1015
Template-Type: ReDIF-Article 1.0
Author-Name: Suyu Liu
Author-X-Name-First: Suyu
Author-X-Name-Last: Liu
Author-Name: Beibei Guo
Author-X-Name-First: Beibei
Author-X-Name-Last: Guo
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Title: A Bayesian Phase I/II Trial Design for Immunotherapy
Abstract:
Immunotherapy is an innovative treatment approach that stimulates a patient’s immune system to fight cancer. It demonstrates characteristics distinct from conventional chemotherapy and stands to revolutionize cancer treatment. We propose a Bayesian phase I/II dose-finding design that incorporates the unique features of immunotherapy by simultaneously considering three outcomes: immune response, toxicity, and efficacy. The objective is to identify the biologically optimal dose, defined as the dose with the highest desirability in the risk–benefit tradeoff. An Emax model is utilized to describe the marginal distribution of the immune response. Conditional on the immune response, we jointly model toxicity and efficacy using a latent variable approach. Using the accumulating data, we adaptively randomize patients to experimental doses based on the continuously updated model estimates. A simulation study shows that our proposed design has good operating characteristics in terms of selecting the target dose and allocating patients to the target dose. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1016-1027
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1383260
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1383260
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1016-1027
Template-Type: ReDIF-Article 1.0
Author-Name: Daisy Philtron
Author-X-Name-First: Daisy
Author-X-Name-Last: Philtron
Author-Name: Yafei Lyu
Author-X-Name-First: Yafei
Author-X-Name-Last: Lyu
Author-Name: Qunhua Li
Author-X-Name-First: Qunhua
Author-X-Name-Last: Li
Author-Name: Debashis Ghosh
Author-X-Name-First: Debashis
Author-X-Name-Last: Ghosh
Title: Maximum Rank Reproducibility: A Nonparametric Approach to Assessing Reproducibility in Replicate Experiments
Abstract:
The identification of reproducible signals from the results of replicate high-throughput experiments is an important part of modern biological research. Often little is known about the dependence structure and the marginal distribution of the data, motivating the development of a nonparametric approach to assess reproducibility. The procedure, which we call the maximum rank reproducibility (MaRR) procedure, uses a maximum rank statistic to parse reproducible signals from noise without making assumptions about the distribution of reproducible signals. Because it uses the rank scale this procedure can be easily applied to a variety of data types. One application is to assess the reproducibility of RNA-seq technology using data produced by the sequencing quality control (SEQC) consortium, which coordinated a multi-laboratory effort to assess reproducibility across three RNA-seq platforms. Our results on simulations and SEQC data show that the MaRR procedure effectively controls false discovery rates, has desirable power properties, and compares well to existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1028-1039
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1397521
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1397521
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1028-1039
Template-Type: ReDIF-Article 1.0
Author-Name: Adam N. Glynn
Author-X-Name-First: Adam N.
Author-X-Name-Last: Glynn
Author-Name: Konstantin Kashin
Author-X-Name-First: Konstantin
Author-X-Name-Last: Kashin
Title: Front-Door Versus Back-Door Adjustment With Unmeasured Confounding: Bias Formulas for Front-Door and Hybrid Adjustments With Application to a Job Training Program
Abstract:
We demonstrate that the front-door adjustment can be a useful alternative to standard covariate adjustments (i.e., back-door adjustments), even when the assumptions required for the front-door approach do not hold. We do this by providing asymptotic bias formulas for the front-door approach that can be compared directly to bias formulas for the back-door approach. In some cases, this allows the tightening of bounds on treatment effects. We also show that under one-sided noncompliance, the front-door approach does not rely on the use of control units. This finding has implications for the design of studies when treatment cannot be withheld from individuals (perhaps for ethical reasons). We illustrate these points with an application to the National Job Training Partnership Act Study.
Journal: Journal of the American Statistical Association
Pages: 1040-1049
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1398657
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1398657
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1040-1049
Template-Type: ReDIF-Article 1.0
Author-Name: Carlos M. Carvalho
Author-X-Name-First: Carlos M.
Author-X-Name-Last: Carvalho
Author-Name: Hedibert F. Lopes
Author-X-Name-First: Hedibert F.
Author-X-Name-Last: Lopes
Author-Name: Robert E. McCulloch
Author-X-Name-First: Robert E.
Author-X-Name-Last: McCulloch
Title: On the Long-Run Volatility of Stocks
Abstract:
In this article, we investigate whether or not the volatility per period of stocks is lower over longer horizons. Taking the perspective of an investor, we evaluate the predictive variance of k-period returns under different model and prior specifications. We adopt the state-space framework of Pástor and Stambaugh to model the dynamics of expected returns and evaluate the effects of prior elicitation in the resulting volatility estimates. Part of the developments includes an extension that incorporates time-varying volatilities and covariances in a constrained prior information set-up. Our conclusion for the U.S. market, under plausible prior specifications, is that stocks are less volatile in the long run. Model assessment exercises demonstrate the models and priors supporting our main conclusions are in accordance with the data. To assess the generality of the results, we extend our analysis to a number of international equity indices. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1050-1069
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1407769
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407769
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1050-1069
Template-Type: ReDIF-Article 1.0
Author-Name: Qingyuan Zhao
Author-X-Name-First: Qingyuan
Author-X-Name-Last: Zhao
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Cross-Screening in Observational Studies That Test Many Hypotheses
Abstract:
We discuss observational studies that test many causal hypotheses, either hypotheses about many outcomes or many treatments. To be credible an observational study that tests many causal hypotheses must demonstrate that its conclusions are neither artifacts of multiple testing nor of small biases from nonrandom treatment assignment. In a sense that needs to be defined carefully, hidden within a sensitivity analysis for nonrandom assignment is an enormous correction for multiple testing: In the absence of bias, it is extremely improbable that multiple testing alone would create an association insensitive to moderate biases. We propose a new strategy called “cross-screening,” different from but motivated by recent work of Bogomolov and Heller on replicability. Cross-screening splits the data in half at random, uses the first half to plan a study carried out on the second half, then uses the second half to plan a study carried out on the first half, and reports the more favorable conclusions of the two studies correcting using the Bonferroni inequality for having done two studies. If the two studies happen to concur, then they achieve Bogomolov–Heller replicability; however, importantly, replicability is not required for strong control of the family-wise error rate, and either study alone suffices for firm conclusions. In randomized studies with just a few null hypotheses, cross-screening is not an attractive method when compared with conventional methods of multiplicity control. However, cross-screening has substantially higher power when hundreds or thousands of hypotheses are subjected to sensitivity analyses in an observational study of moderate size. We illustrate the technique by comparing 46 biomarkers in individuals who consume large quantities of fish versus little or no fish. The R package CrossScreening on CRAN implements the cross-screening method. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1070-1084
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1407770
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407770
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1070-1084
Template-Type: ReDIF-Article 1.0
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Qizhai Li
Author-X-Name-First: Qizhai
Author-X-Name-Last: Li
Author-Name: Lei Zhou
Author-X-Name-First: Lei
Author-X-Name-Last: Zhou
Title: Bayesian Neural Networks for Selection of Drug Sensitive Genes
Abstract:
Recent advances in high-throughput biotechnologies have provided an unprecedented opportunity for biomarker discovery, which, from a statistical point of view, can be cast as a variable selection problem. This problem is challenging due to the high-dimensional and nonlinear nature of omics data and, in general, it suffers three difficulties: (i) an unknown functional form of the nonlinear system, (ii) variable selection consistency, and (iii) high-demanding computation. To circumvent the first difficulty, we employ a feed-forward neural network to approximate the unknown nonlinear function motivated by its universal approximation ability. To circumvent the second difficulty, we conduct structure selection for the neural network, which induces variable selection, by choosing appropriate prior distributions that lead to the consistency of variable selection. To circumvent the third difficulty, we implement the population stochastic approximation Monte Carlo algorithm, a parallel adaptive Markov Chain Monte Carlo algorithm, on the OpenMP platform that provides a linear speedup for the simulation with the number of cores of the computer. The numerical results indicate that the proposed method can work very well for identification of relevant variables for high-dimensional nonlinear systems. The proposed method is successfully applied to identification of the genes that are associated with anticancer drug sensitivities based on the data collected in the cancer cell line encyclopedia study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 955-972
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2017.1409122
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1409122
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:955-972
Template-Type: ReDIF-Article 1.0
Author-Name: Jaewoo Park
Author-X-Name-First: Jaewoo
Author-X-Name-Last: Park
Author-Name: Murali Haran
Author-X-Name-First: Murali
Author-X-Name-Last: Haran
Title: Bayesian Inference in the Presence of Intractable Normalizing Functions
Abstract:
Models with intractable normalizing functions arise frequently in statistics. Common examples of such models include exponential random graph models for social networks and Markov point processes for ecology and disease modeling. Inference for these models is complicated because the normalizing functions of their probability distributions include the parameters of interest. In Bayesian analysis, they result in so-called doubly intractable posterior distributions which pose significant computational challenges. Several Monte Carlo methods have emerged in recent years to address Bayesian inference for such models. We provide a framework for understanding the algorithms, and elucidate connections among them. Through multiple simulated and real data examples, we compare and contrast the computational and statistical efficiency of these algorithms and discuss their theoretical bases. Our study provides practical recommendations for practitioners along with directions for future research for Markov chain Monte Carlo (MCMC) methodologists. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1372-1390
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2018.1448824
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448824
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1372-1390
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 1391-1394
Issue: 523
Volume: 113
Year: 2018
Month: 7
X-DOI: 10.1080/01621459.2018.1513232
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1513232
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:523:p:1391-1394
Template-Type: ReDIF-Article 1.0
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Anru Zhang
Author-X-Name-First: Anru
Author-X-Name-Last: Zhang
Title: Structured Matrix Completion with Applications to Genomic Data Integration
Abstract:
Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics, and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 621-633
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1021005
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1021005
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:621-633
Template-Type: ReDIF-Article 1.0
Author-Name: Chris J. Oates
Author-X-Name-First: Chris J.
Author-X-Name-Last: Oates
Author-Name: Theodore Papamarkou
Author-X-Name-First: Theodore
Author-X-Name-Last: Papamarkou
Author-Name: Mark Girolami
Author-X-Name-First: Mark
Author-X-Name-Last: Girolami
Title: The Controlled Thermodynamic Integral for Bayesian Model Evidence Evaluation
Abstract:
Approximation of the model evidence is well known to be challenging. One promising approach is based on thermodynamic integration, but a key concern is that the thermodynamic integral can suffer from high variability in many applications. This article considers the reduction of variance that can be achieved by exploiting control variates in this setting. Our methodology applies whenever the gradient of both the log-likelihood and the log-prior with respect to the parameters can be efficiently evaluated. Results obtained on regression models and popular benchmark datasets demonstrate a significant and sometimes dramatic reduction in estimator variance and provide insight into the wider applicability of control variates to evidence estimation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 634-645
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1021006
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1021006
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:634-645
Template-Type: ReDIF-Article 1.0
Author-Name: Shuyuan He
Author-X-Name-First: Shuyuan
Author-X-Name-Last: He
Author-Name: Wei Liang
Author-X-Name-First: Wei
Author-X-Name-Last: Liang
Author-Name: Junshan Shen
Author-X-Name-First: Junshan
Author-X-Name-Last: Shen
Author-Name: Grace Yang
Author-X-Name-First: Grace
Author-X-Name-Last: Yang
Title: Empirical Likelihood for Right Censored Lifetime Data
Abstract:
When the empirical likelihood (EL) of a parameter θ is constructed with right censored data, literature shows that − 2log (empirical likelihood ratio) typically has an asymptotic scaled chi-squared distribution, where the scale parameter is a function of some unknown asymptotic variances. Therefore, the EL construction of confidence intervals for θ requires an additional estimation of the scale parameter. Additional estimation would reduce the coverage accuracy for θ. By using a special influence function as an estimating function, we prove that under very general conditions, − 2log (empirical likelihood ratio) has an asymptotic standard chi-squared distribution with one degree of freedom. This eliminates the need for estimating the scale parameter as well as eases some of the often demanding computations of the EL method. Our estimating function yields a smaller asymptotic variance than those of Wang and Jing (2001) and Qin and Zhao (2007). Thus, it is not surprising that confidence intervals using the special influence functions give a better coverage accuracy as demonstrated by simulations.
Journal: Journal of the American Statistical Association
Pages: 646-655
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1024058
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1024058
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:646-655
Template-Type: ReDIF-Article 1.0
Author-Name: Yun Yang
Author-X-Name-First: Yun
Author-X-Name-Last: Yang
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Conditional Tensor Factorizations for High-Dimensional Classification
Abstract:
In many application areas, data are collected on a categorical response and high-dimensional categorical predictors, with the goals being to build a parsimonious model for classification while doing inferences on the important predictors. In settings such as genomics, there can be complex interactions among the predictors. By using a carefully structured Tucker factorization, we define a model that can characterize any conditional probability, while facilitating variable selection and modeling of higher-order interactions. Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm for posterior computation accommodating uncertainty in the predictors to be included. Under near low-rank assumptions, the posterior distribution for the conditional probability is shown to achieve close to the parametric rate of contraction even in ultra high-dimensional settings. The methods are illustrated using simulation examples and biomedical applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 656-669
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1029129
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1029129
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:656-669
Template-Type: ReDIF-Article 1.0
Author-Name: Michael W. Robbins
Author-X-Name-First: Michael W.
Author-X-Name-Last: Robbins
Author-Name: Colin M. Gallagher
Author-X-Name-First: Colin M.
Author-X-Name-Last: Gallagher
Author-Name: Robert B. Lund
Author-X-Name-First: Robert B.
Author-X-Name-Last: Lund
Title: A General Regression Changepoint Test for Time Series Data
Abstract:
This article develops a test for a single changepoint in a general setting that allows for correlated time series regression errors, a seasonal cycle, time-varying regression factors, and covariate information. Within, a changepoint statistic is constructed from likelihood ratio principles and its asymptotic distribution is derived. The asymptotic distribution of the changepoint statistic is shown to be invariant of the seasonal cycle and the covariates should the latter obey some simple limit laws; however, the limit distribution depends on any time-varying factors. A new test based on ARMA residuals is developed and is shown to have favorable properties with finite samples. Driving our work is a changepoint analysis of the Mauna Loa record of monthly carbon dioxide concentrations. This series has a pronounced seasonal cycle, a nonlinear trend, heavily correlated regression errors, and covariate information in the form of climate oscillations. In the end, we find a prominent changepoint in the early 1990s, often attributed to the eruption of Mount Pinatubo, which cannot be explained by covariates. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 670-683
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1029130
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1029130
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:670-683
Template-Type: ReDIF-Article 1.0
Author-Name: Ying Yan
Author-X-Name-First: Ying
Author-X-Name-Last: Yan
Author-Name: Grace Y. Yi
Author-X-Name-First: Grace Y.
Author-X-Name-Last: Yi
Title: A Class of Functional Methods for Error-Contaminated Survival Data Under Additive Hazards Models with Replicate Measurements
Abstract:
Covariate measurement error has attracted extensive interest in survival analysis. Since Prentice, a large number of inference methods have been developed to handle error-prone data that are modulated with proportional hazards models. In contrast to proportional hazards models, additive hazards models offer a flexible tool to delineate survival processes. However, there is little research on measurement error effects under additive hazards models. In this article, we systematically investigate this important problem. New insights into measurement error effects are revealed, as opposed to well-documented results for proportional hazards models. In particular, we explore asymptotic bias of ignoring measurement error in the analysis. To correct for the induced bias, we develop a class of functional correction methods for measurement error effects. The validity of the proposed methods is carefully examined, and we investigate issues of model checking and model misspecification. Theoretical results are established, and are complemented with numerical assessments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 684-695
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1034317
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034317
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:684-695
Template-Type: ReDIF-Article 1.0
Author-Name: Irina Gaynanova
Author-X-Name-First: Irina
Author-X-Name-Last: Gaynanova
Author-Name: James G. Booth
Author-X-Name-First: James G.
Author-X-Name-Last: Booth
Author-Name: Martin T. Wells
Author-X-Name-First: Martin T.
Author-X-Name-Last: Wells
Title: Simultaneous Sparse Estimation of Canonical Vectors in the ≫ Setting
Abstract:
This article considers the problem of sparse estimation of canonical vectors in linear discriminant analysis when p ≫ N. Several methods have been proposed in the literature that estimate one canonical vector in the two-group case. However, G − 1 canonical vectors can be considered if the number of groups is G. In the multi-group context, it is common to estimate canonical vectors in a sequential fashion. Moreover, separate prior estimation of the covariance structure is often required. We propose a novel methodology for direct estimation of canonical vectors. In contrast to existing techniques, the proposed method estimates all canonical vectors at once, performs variable selection across all the vectors and comes with theoretical guarantees on the variable selection and classification consistency. First, we highlight the fact that in the N > p setting the canonical vectors can be expressed in a closed form up to an orthogonal transformation. Secondly, we propose an extension of this form to the p ≫ N setting and achieve feature selection by using a group penalty. The resulting optimization problem is convex and can be solved using a block-coordinate descent algorithm. The practical performance of the method is evaluated through simulation studies as well as real data applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 696-706
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1034318
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034318
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:696-706
Template-Type: ReDIF-Article 1.0
Author-Name: Guan Yu
Author-X-Name-First: Guan
Author-X-Name-Last: Yu
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Sparse Regression Incorporating Graphical Structure Among Predictors
Abstract:
With the abundance of high-dimensional data in various disciplines, sparse regularized techniques are very popular these days. In this article, we make use of the structure information among predictors to improve sparse regression models. Typically, such structure information can be modeled by the connectivity of an undirected graph using all predictors as nodes of the graph. Most existing methods use this undirected graph edge-by-edge to encourage the regression coefficients of corresponding connected predictors to be similar. However, such methods do not directly use the neighborhood information of the graph. Furthermore, if there are more edges in the predictor graph, the corresponding regularization term will be more complicated. In this article, we incorporate the graph information node-by-node, instead of edge-by-edge as used in most existing methods. Our proposed method is very general and it includes adaptive Lasso, group Lasso, and ridge regression as special cases. Both theoretical and numerical studies demonstrate the effectiveness of the proposed method for simultaneous estimation, prediction, and model selection. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 707-720
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1034319
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1034319
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:707-720
Template-Type: ReDIF-Article 1.0
Author-Name: Long Feng
Author-X-Name-First: Long
Author-X-Name-Last: Feng
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Author-Name: Zhaojun Wang
Author-X-Name-First: Zhaojun
Author-X-Name-Last: Wang
Title: Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem
Abstract:
This article concerns tests for the two-sample location problem when data dimension is larger than the sample size. Existing multivariate-sign-based procedures are not robust against high dimensionality, producing tests with Type I error rates far away from nominal levels. This is mainly due to the biases from estimating location parameters. We propose a novel test to overcome this issue by using the “leave-one-out” idea. The proposed test statistic is scalar-invariant and thus is particularly useful when different components have different scales in high-dimensional data. Asymptotic properties of the test statistic are studied. Compared with other existing approaches, simulation studies show that the proposed method behaves well in terms of sizes and power. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 721-735
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1035380
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1035380
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:721-735
Template-Type: ReDIF-Article 1.0
Author-Name: Jin Tang
Author-X-Name-First: Jin
Author-X-Name-Last: Tang
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Generalized Quasi-Likelihood Ratio Tests for Semiparametric Analysis of Covariance Models in Longitudinal Data
Abstract:
We model generalized longitudinal data from multiple treatment groups by a class of semiparametric analysis of covariance models, which take into account the parametric effects of time dependent covariates and the nonparametric time effects. In these models, the treatment effects are represented by nonparametric functions of time and we propose a generalized quasi-likelihood ratio test procedure to test if these functions are identical. Our estimation procedure is based on profile estimating equations combined with local linear smoothers. We find that the much celebrated Wilks phenomenon which is well established for independent data still holds for longitudinal data if a working independence correlation structure is assumed in the test statistic. However, this property does not hold in general, especially when the working variance function is misspecified. Our empirical study also shows that incorporating correlation into the test statistic does not necessarily improve the power of the test. The proposed methods are illustrated with simulation studies and a real application from opioid dependence treatments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 736-747
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1036995
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1036995
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:736-747
Template-Type: ReDIF-Article 1.0
Author-Name: Pete Bunch
Author-X-Name-First: Pete
Author-X-Name-Last: Bunch
Author-Name: Simon Godsill
Author-X-Name-First: Simon
Author-X-Name-Last: Godsill
Title: Approximations of the Optimal Importance Density Using Gaussian Particle Flow Importance Sampling
Abstract:
Recently developed particle flow algorithms provide an alternative to importance sampling for drawing particles from a posterior distribution, and a number of particle filters based on this principle have been proposed. Samples are drawn from the prior and then moved according to some dynamics over an interval of pseudo-time such that their final values are distributed according to the desired posterior. In practice, implementing a particle flow sampler requires multiple layers of approximation, with the result that the final samples do not in general have the correct posterior distribution. In this article we consider using an approximate Gaussian flow for sampling with a class of nonlinear Gaussian models. We use the particle flow within an importance sampler, correcting for the discrepancy between the target and actual densities with importance weights. We present a suitable numerical integration procedure for use with this flow and an accompanying step-size control algorithm. In a filtering context, we use the particle flow to sample from the optimal importance density, rather than the filtering density itself, avoiding the need to make analytical or numerical approximations of the predictive density. Simulations using particle flow importance sampling within a particle filter demonstrate significant improvement over standard approximations of the optimal importance density, and the algorithm falls within the standard sequential Monte Carlo framework.
Journal: Journal of the American Statistical Association
Pages: 748-762
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1038387
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1038387
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:748-762
Template-Type: ReDIF-Article 1.0
Author-Name: Yiyuan She
Author-X-Name-First: Yiyuan
Author-X-Name-Last: She
Author-Name: Shijie Li
Author-X-Name-First: Shijie
Author-X-Name-Last: Li
Author-Name: Dapeng Wu
Author-X-Name-First: Dapeng
Author-X-Name-Last: Wu
Title: Robust Orthogonal Complement Principal Component Analysis
Abstract:
Recently, the robustification of principal component analysis (PCA) has attracted lots of attention from statisticians, engineers, and computer scientists. In this work, we study the type of outliers that are not necessarily apparent in the original observation space but can seriously affect the principal subspace estimation. Based on a mathematical formulation of such transformed outliers, a novel robust orthogonal complement principal component analysis (ROC-PCA) is proposed. The framework combines the popular sparsity-enforcing and low-rank regularization techniques to deal with row-wise outliers as well as element-wise outliers. A nonasymptotic oracle inequality guarantees the accuracy and high breakdown performance of ROC-PCA in finite samples. To tackle the computational challenges, an efficient algorithm is developed on the basis of Stiefel manifold optimization and iterative thresholding. Furthermore, a batch variant is proposed to significantly reduce the cost in ultra high dimensions. The article also points out a pitfall of a common practice of singular value decomposition (SVD) reduction in robust PCA. Experiments show the effectiveness and efficiency of ROC-PCA in both synthetic and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 763-771
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1042107
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1042107
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:763-771
Template-Type: ReDIF-Article 1.0
Author-Name: Lin Zhang
Author-X-Name-First: Lin
Author-X-Name-Last: Zhang
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Author-Name: Hongxiao Zhu
Author-X-Name-First: Hongxiao
Author-X-Name-Last: Zhu
Author-Name: Keith A. Baggerly
Author-X-Name-First: Keith A.
Author-X-Name-Last: Baggerly
Author-Name: Tadeusz Majewski
Author-X-Name-First: Tadeusz
Author-X-Name-Last: Majewski
Author-Name: Bogdan A. Czerniak
Author-X-Name-First: Bogdan A.
Author-X-Name-Last: Czerniak
Author-Name: Jeffrey S. Morris
Author-X-Name-First: Jeffrey S.
Author-X-Name-Last: Morris
Title: Functional CAR Models for Large Spatially Correlated Functional Datasets
Abstract:
We develop a functional conditional autoregressive (CAR) model for spatially correlated data for which functions are collected on areal units of a lattice. Our model performs functional response regression while accounting for spatial correlations with potentially nonseparable and nonstationary covariance structure, in both the space and functional domains. We show theoretically that our construction leads to a CAR model at each functional location, with spatial covariance parameters varying and borrowing strength across the functional domain. Using basis transformation strategies, the nonseparable spatial-functional model is computationally scalable to enormous functional datasets, generalizable to different basis functions, and can be used on functions defined on higher dimensional domains such as images. Through simulation studies, we demonstrate that accounting for the spatial correlation in our modeling leads to improved functional regression performance. Applied to a high-throughput spatially correlated copy number dataset, the model identifies genetic markers not identified by comparable methods that ignore spatial correlations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 772-786
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1042581
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1042581
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:772-786
Template-Type: ReDIF-Article 1.0
Author-Name: Chiung-Yu Huang
Author-X-Name-First: Chiung-Yu
Author-X-Name-Last: Huang
Author-Name: Jing Qin
Author-X-Name-First: Jing
Author-X-Name-Last: Qin
Author-Name: Huei-Ting Tsai
Author-X-Name-First: Huei-Ting
Author-X-Name-Last: Tsai
Title: Efficient Estimation of the Cox Model with Auxiliary Subgroup Survival Information
Abstract:
With the rapidly increasing availability of data in the public domain, combining information from different sources to infer about associations or differences of interest has become an emerging challenge to researchers. This article presents a novel approach to improve efficiency in estimating the survival time distribution by synthesizing information from the individual-level data with t-year survival probabilities from external sources such as disease registries. While disease registries provide accurate and reliable overall survival statistics for the disease population, critical pieces of information that influence both choice of treatment and clinical outcomes usually are not available in the registry database. To combine with the published information, we propose to summarize the external survival information via a system of nonlinear population moments and estimate the survival time model using empirical likelihood methods. The proposed approach is more flexible than the conventional meta-analysis in the sense that it can automatically combine survival information for different subgroups and the information may be derived from different studies. Moreover, an extended estimator that allows for a different baseline risk in the aggregate data is also studied. Empirical likelihood ratio tests are proposed to examine whether the auxiliary survival information is consistent with the individual-level data. Simulation studies show that the proposed estimators yield a substantial gain in efficiency over the conventional partial likelihood approach. Two sets of data analysis are conducted to illustrate the methods and theory.
Journal: Journal of the American Statistical Association
Pages: 787-799
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1044090
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1044090
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:787-799
Template-Type: ReDIF-Article 1.0
Author-Name: Abhirup Datta
Author-X-Name-First: Abhirup
Author-X-Name-Last: Datta
Author-Name: Sudipto Banerjee
Author-X-Name-First: Sudipto
Author-X-Name-Last: Banerjee
Author-Name: Andrew O. Finley
Author-X-Name-First: Andrew O.
Author-X-Name-Last: Finley
Author-Name: Alan E. Gelfand
Author-X-Name-First: Alan E.
Author-X-Name-Last: Gelfand
Title: Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets
Abstract:
Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 800-812
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1044091
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1044091
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:800-812
Template-Type: ReDIF-Article 1.0
Author-Name: Zhou Yu
Author-X-Name-First: Zhou
Author-X-Name-Last: Yu
Author-Name: Yuexiao Dong
Author-X-Name-First: Yuexiao
Author-X-Name-Last: Dong
Author-Name: Li-Xing Zhu
Author-X-Name-First: Li-Xing
Author-X-Name-Last: Zhu
Title: Trace Pursuit: A General Framework for Model-Free Variable Selection
Abstract:
We propose trace pursuit for model-free variable selection under the sufficient dimension-reduction paradigm. Two distinct algorithms are proposed: stepwise trace pursuit and forward trace pursuit. Stepwise trace pursuit achieves selection consistency with fixed p. Forward trace pursuit can serve as an initial screening step to speed up the computation in the case of ultrahigh dimensionality. The screening consistency property of forward trace pursuit based on sliced inverse regression is established. Finite sample performances of trace pursuit and other model-free variable selection methods are compared through numerical studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 813-821
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1050494
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1050494
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:813-821
Template-Type: ReDIF-Article 1.0
Author-Name: F. Jay Breidt
Author-X-Name-First: F. Jay
Author-X-Name-Last: Breidt
Author-Name: Jean D. Opsomer
Author-X-Name-First: Jean D.
Author-X-Name-Last: Opsomer
Author-Name: Ismael Sanchez-Borrego
Author-X-Name-First: Ismael
Author-X-Name-Last: Sanchez-Borrego
Title: Nonparametric Variance Estimation Under Fine Stratification: An Alternative to Collapsed Strata
Abstract:
Fine stratification is commonly used to control the distribution of a sample from a finite population and to improve the precision of resulting estimators. One-per-stratum designs represent the finest possible stratification and occur in practice, but designs with very low numbers of elements per stratum (say, two or three) are also common. The classical variance estimator in this context is the collapsed stratum estimator, which relies on creating larger “pseudo-strata” and computing the sum of the squared differences between estimated stratum totals across the pseudo-strata. We propose here a nonparametric alternative that replaces the pseudo-strata by kernel-weighted stratum neighborhoods and uses deviations from a fitted mean function to estimate the variance. We establish the asymptotic behavior of the kernel-based estimator and show its superior practical performance relative to the collapsed stratum variance estimator in a simulation study. An application to data from the U.S. Consumer Expenditure Survey illustrates the potential of the method in practice.
Journal: Journal of the American Statistical Association
Pages: 822-833
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1058264
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1058264
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:822-833
Template-Type: ReDIF-Article 1.0
Author-Name: Jacob Bien
Author-X-Name-First: Jacob
Author-X-Name-Last: Bien
Author-Name: Florentina Bunea
Author-X-Name-First: Florentina
Author-X-Name-Last: Bunea
Author-Name: Luo Xiao
Author-X-Name-First: Luo
Author-X-Name-Last: Xiao
Title: Convex Banding of the Covariance Matrix
Abstract:
We introduce a new sparse estimator of the covariance matrix for high-dimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator that tapers the sample covariance matrix by a Toeplitz, sparsely banded, data-adaptive matrix. As a result of this adaptivity, the convex banding estimator enjoys theoretical optimality properties not attained by previous banding or tapered estimators. In particular, our convex banding estimator is minimax rate adaptive in Frobenius and operator norms, up to log factors, over commonly studied classes of covariance matrices, and over more general classes. Furthermore, it correctly recovers the bandwidth when the true covariance is exactly banded. Our convex formulation admits a simple and efficient algorithm. Empirical studies demonstrate its practical effectiveness and illustrate that our exactly banded estimator works well even when the true covariance matrix is only close to a banded matrix, confirming our theoretical results. Our method compares favorably with all existing methods, in terms of accuracy and speed. We illustrate the practical merits of the convex banding estimator by showing that it can be used to improve the performance of discriminant analysis for classifying sound recordings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 834-845
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1058265
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1058265
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:834-845
Template-Type: ReDIF-Article 1.0
Author-Name: Aaron Fisher
Author-X-Name-First: Aaron
Author-X-Name-Last: Fisher
Author-Name: Brian Caffo
Author-X-Name-First: Brian
Author-X-Name-Last: Caffo
Author-Name: Brian Schwartz
Author-X-Name-First: Brian
Author-X-Name-Last: Schwartz
Author-Name: Vadim Zipunnikov
Author-X-Name-First: Vadim
Author-X-Name-Last: Zipunnikov
Title: Fast, Exact Bootstrap Principal Component Analysis for > 1 Million
Abstract:
Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 846-860
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1062383
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1062383
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:846-860
Template-Type: ReDIF-Article 1.0
Author-Name: Baojiang Chen
Author-X-Name-First: Baojiang
Author-X-Name-Last: Chen
Author-Name: Pengfei Li
Author-X-Name-First: Pengfei
Author-X-Name-Last: Li
Author-Name: Jing Qin
Author-X-Name-First: Jing
Author-X-Name-Last: Qin
Author-Name: Tao Yu
Author-X-Name-First: Tao
Author-X-Name-Last: Yu
Title: Using a Monotonic Density Ratio Model to Find the Asymptotically Optimal Combination of Multiple Diagnostic Tests
Abstract:
With the advent of new technology, new biomarker studies have become essential in cancer research. To achieve optimal sensitivity and specificity, one needs to combine different diagnostic tests. The celebrated Neyman–Pearson lemma enables us to use the density ratio to optimally combine different diagnostic tests. In this article, we propose a semiparametric model by directly modeling the density ratio between the diseased and nondiseased population as an unspecified monotonic nondecreasing function of a linear or nonlinear combination of multiple diagnostic tests. This method is appealing in that it is not necessary to assume separate models for the diseased and nondiseased populations. Further, the proposed method provides an asymptotically optimal way to combine multiple test results. We use a pool-adjacent-violation-algorithm to find the semiparametric maximum likelihood estimate of the receiver operating characteristic (ROC) curve. Using modern empirical process theory we show cubic root n consistency for the ROC curve and the underlying Euclidean parameter estimation. Extensive simulations show that the proposed method outperforms its competitors. We apply the method to two real-data applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 861-874
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1066681
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1066681
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:861-874
Template-Type: ReDIF-Article 1.0
Author-Name: Stanislav Minsker
Author-X-Name-First: Stanislav
Author-X-Name-Last: Minsker
Author-Name: Ying-Qi Zhao
Author-X-Name-First: Ying-Qi
Author-X-Name-Last: Zhao
Author-Name: Guang Cheng
Author-X-Name-First: Guang
Author-X-Name-Last: Cheng
Title: Active Clinical Trials for Personalized Medicine
Abstract:
Individualized treatment rules (ITRs) tailor treatments according to individual patient characteristics. They can significantly improve patient care and are thus becoming increasingly popular. The data collected during randomized clinical trials are often used to estimate the optimal ITRs. However, these trials are generally expensive to run, and, moreover, they are not designed to efficiently estimate ITRs. In this article, we propose a cost-effective estimation method from an active learning perspective. In particular, our method recruits only the “most informative” patients (in terms of learning the optimal ITRs) from an ongoing clinical trial. Simulation studies and real-data examples show that our active clinical trial method significantly improves on competing methods. We derive risk bounds and show that they support these observed empirical advantages. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 875-887
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1066682
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1066682
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:875-887
Template-Type: ReDIF-Article 1.0
Author-Name: Emilio Porcu
Author-X-Name-First: Emilio
Author-X-Name-Last: Porcu
Author-Name: Moreno Bevilacqua
Author-X-Name-First: Moreno
Author-X-Name-Last: Bevilacqua
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: Spatio-Temporal Covariance and Cross-Covariance Functions of the Great Circle Distance on a Sphere
Abstract:
In this article, we propose stationary covariance functions for processes that evolve temporally over a sphere, as well as cross-covariance functions for multivariate random fields defined over a sphere. For such processes, the great circle distance is the natural metric that should be used to describe spatial dependence. Given the mathematical difficulties for the construction of covariance functions for processes defined over spheres cross time, approximations of the state of nature have been proposed in the literature by using the Euclidean (based on map projections) and the chordal distances. We present several methods of construction based on the great circle distance and provide closed-form expressions for both spatio-temporal and multivariate cases. A simulation study assesses the discrepancy between the great circle distance, chordal distance, and Euclidean distance based on a map projection both in terms of estimation and prediction in a space-time and a bivariate spatial setting, where the space is in this case the Earth. We revisit the analysis of Total Ozone Mapping Spectrometer (TOMS) data and investigate differences in terms of estimation and prediction between the aforementioned distance-based approaches. Both simulation and real data highlight sensible differences in terms of estimation of the spatial scale parameter. As far as prediction is concerned, the differences can be appreciated only when the interpoint distances are large, as demonstrated by an illustrative example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 888-898
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1072541
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1072541
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:888-898
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan J. Tibshirani
Author-X-Name-First: Ryan J.
Author-X-Name-Last: Tibshirani
Author-Name: Jonathan Taylor
Author-X-Name-First: Jonathan
Author-X-Name-Last: Taylor
Author-Name: Richard Lockhart
Author-X-Name-First: Richard
Author-X-Name-Last: Lockhart
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: Exact Post-Selection Inference for Sequential Regression Procedures
Abstract:
We propose new inference tools for forward stepwise regression, least angle regression, and the lasso. Assuming a Gaussian model for the observation vector y, we first describe a general scheme to perform valid inference after any selection event that can be characterized as y falling into a polyhedral set. This framework allows us to derive conditional (post-selection) hypothesis tests at any step of forward stepwise or least angle regression, or any step along the lasso regularization path, because, as it turns out, selection events for these procedures can be expressed as polyhedral constraints on y. The p-values associated with these tests are exactly uniform under the null distribution, in finite samples, yielding exact Type I error control. The tests can also be inverted to produce confidence intervals for appropriate underlying regression parameters. The R package selectiveInference, freely available on the CRAN repository, implements the new inference tools described in this article. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 600-620
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1108848
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1108848
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:600-620
Template-Type: ReDIF-Article 1.0
Author-Name: Colin B. Fogarty
Author-X-Name-First: Colin B.
Author-X-Name-Last: Fogarty
Author-Name: Mark E. Mikkelsen
Author-X-Name-First: Mark E.
Author-X-Name-Last: Mikkelsen
Author-Name: David F. Gaieski
Author-X-Name-First: David F.
Author-X-Name-Last: Gaieski
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Discrete Optimization for Interpretable Study Populations and Randomization Inference in an Observational Study of Severe Sepsis Mortality
Abstract:
Motivated by an observational study of the effect of hospital ward versus intensive care unit admission on severe sepsis mortality, we develop methods to address two common problems in observational studies: (1) when there is a lack of covariate overlap between the treated and control groups, how to define an interpretable study population wherein inference can be conducted without extrapolating with respect to important variables; and (2) how to use randomization inference to form confidence intervals for the average treatment effect with binary outcomes. Our solution to problem (1) incorporates existing suggestions in the literature while yielding a study population that is easily understood in terms of the covariates themselves, and can be solved using an efficient branch-and-bound algorithm. We address problem (2) by solving a linear integer program to use the worst-case variance of the average treatment effect among values for unobserved potential outcomes that are compatible with the null hypothesis. Our analysis finds no evidence for a difference between the 60-day mortality rates if all individuals were admitted to the ICU and if all patients were admitted to the hospital ward among less severely ill patients and among patients with cryptic septic shock. We implement our methodology in R, providing scripts in the supplementary material.
Journal: Journal of the American Statistical Association
Pages: 447-458
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1112802
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1112802
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:447-458
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Zhou
Author-X-Name-First: Bo
Author-X-Name-Last: Zhou
Author-Name: David E. Moorman
Author-X-Name-First: David E.
Author-X-Name-Last: Moorman
Author-Name: Sam Behseta
Author-X-Name-First: Sam
Author-X-Name-Last: Behseta
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Author-Name: Babak Shahbaba
Author-X-Name-First: Babak
Author-X-Name-Last: Shahbaba
Title: A Dynamic Bayesian Model for Characterizing Cross-Neuronal Interactions During Decision-Making
Abstract:
The goal of this article is to develop a novel statistical model for studying cross-neuronal spike train interactions during decision-making. For an individual to successfully complete the task of decision-making, a number of temporally organized events must occur: stimuli must be detected, potential outcomes must be evaluated, behaviors must be executed or inhibited, and outcomes (such as reward or no-reward) must be experienced. Due to the complexity of this process, it is likely the case that decision-making is encoded by the temporally precise interactions between large populations of neurons. Most existing statistical models, however, are inadequate for analyzing such a phenomenon because they provide only an aggregated measure of interactions over time. To address this considerable limitation, we propose a dynamic Bayesian model that captures the time-varying nature of neuronal activity (such as the time-varying strength of the interactions between neurons). The proposed method yielded results that reveal new insight into the dynamic nature of population coding in the prefrontal cortex during decision-making. In our analysis, we note that while some neurons in the prefrontal cortex do not synchronize their firing activity until the presence of a reward, a different set of neurons synchronizes their activity shortly after stimulus onset. These differentially synchronizing subpopulations of neurons suggest a continuum of population representation of the reward-seeking task. Second, our analyses also suggest that the degree of synchronization differs between the rewarded and nonrewarded conditions. Moreover, the proposed model is scalable to handle data on many simultaneously recorded neurons and is applicable to analyzing other types of multivariate time series data with latent structure. Supplementary materials (including computer codes) for our article are available online.
Journal: Journal of the American Statistical Association
Pages: 459-471
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1116988
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1116988
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:459-471
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan R. Bradley
Author-X-Name-First: Jonathan R.
Author-X-Name-Last: Bradley
Author-Name: Christopher K. Wikle
Author-X-Name-First: Christopher K.
Author-X-Name-Last: Wikle
Author-Name: Scott H. Holan
Author-X-Name-First: Scott H.
Author-X-Name-Last: Holan
Title: Bayesian Spatial Change of Support for Count-Valued Survey Data With Application to the American Community Survey
Abstract:
We introduce Bayesian spatial change of support (COS) methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year “period-estimates,” and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies, it is often of interest to data-users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on “new” spatial supports in “real-time.” This problem is known as spatial COS, which is typically performed under the assumption that the data follow a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in “real-time.” We show the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data.
Journal: Journal of the American Statistical Association
Pages: 472-487
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1117471
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1117471
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:472-487
Template-Type: ReDIF-Article 1.0
Author-Name: Angela Noufaily
Author-X-Name-First: Angela
Author-X-Name-Last: Noufaily
Author-Name: Paddy Farrington
Author-X-Name-First: Paddy
Author-X-Name-Last: Farrington
Author-Name: Paul Garthwaite
Author-X-Name-First: Paul
Author-X-Name-Last: Garthwaite
Author-Name: Doyo Gragn Enki
Author-X-Name-First: Doyo Gragn
Author-X-Name-Last: Enki
Author-Name: Nick Andrews
Author-X-Name-First: Nick
Author-X-Name-Last: Andrews
Author-Name: Andre Charlett
Author-X-Name-First: Andre
Author-X-Name-Last: Charlett
Title: Detection of Infectious Disease Outbreaks From Laboratory Data With Reporting Delays
Abstract:
Many statistical surveillance systems for the timely detection of outbreaks of infectious disease operate on laboratory data. Such data typically incur reporting delays between the time at which a specimen is collected for diagnostic purposes, and the time at which the results of the laboratory analysis become available. Statistical surveillance systems currently in use usually make some ad hoc adjustment for such delays, or use counts by time of report. We propose a new statistical approach that takes account of the delays explicitly, by monitoring the number of specimens identified in the current and past m time units, where m is a tuning parameter. Values expected in the absence of an outbreak are estimated from counts observed in recent years (typically 5 years). We study the method in the context of an outbreak detection system used in the United Kingdom and several other European countries. We propose a suitable test statistic for the null hypothesis that no outbreak is currently occurring. We derive its null variance, incorporating uncertainty about the estimated delay distribution. Simulations and applications to some test datasets suggest the method works well, and can improve performance over ad hoc methods in current use. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 488-499
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1119047
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1119047
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:488-499
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew Plumlee
Author-X-Name-First: Matthew
Author-X-Name-Last: Plumlee
Author-Name: V. Roshan Joseph
Author-X-Name-First: V. Roshan
Author-X-Name-Last: Joseph
Author-Name: Hui Yang
Author-X-Name-First: Hui
Author-X-Name-Last: Yang
Title: Calibrating Functional Parameters in the Ion Channel Models of Cardiac Cells
Abstract:
Computational modeling is a popular tool to understand a diverse set of complex systems. The output from a computational model depends on a set of parameters that are unknown to the designer, but a modeler can estimate them by collecting physical data. In the described study of the ion channels of ventricular myocytes, the parameter of interest is a function as opposed to a scalar or a set of scalars. This article develops a new modeling strategy to nonparametrically study the functional parameter using Bayesian inference with Gaussian process prior distributions. A new sampling scheme is devised to address this unique problem.
Journal: Journal of the American Statistical Association
Pages: 500-509
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1119695
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1119695
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:500-509
Template-Type: ReDIF-Article 1.0
Author-Name: Laura Forastiere
Author-X-Name-First: Laura
Author-X-Name-Last: Forastiere
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Author-Name: Tyler J. VanderWeele
Author-X-Name-First: Tyler J.
Author-X-Name-Last: VanderWeele
Title: Identification and Estimation of Causal Mechanisms in Clustered Encouragement Designs: Disentangling Bed Nets Using Bayesian Principal Stratification
Abstract:
Exploration of causal mechanisms is often important for researchers and policymakers to understand how an intervention works and how it can be improved. This task can be crucial in clustered encouragement designs (CEDs). Encouragement design studies arise frequently when the treatment cannot be enforced because of ethical or practical constraints and an encouragement intervention (information campaigns, incentives, etc.) is conceived with the purpose of increasing the uptake of the treatment of interest. By design, encouragements always entail the complication of noncompliance. Encouragements can also give rise to a variety of mechanisms, particularly when encouragement is assigned at the cluster level. Social interactions among units within the same cluster can result in spillover effects. Disentangling the effect of encouragement through spillover effects from that through the enhancement of the treatment would give better insight into the intervention and it could be compelling for planning the scaling-up phase of the program. Building on previous works on CEDs and noncompliance, we use the principal stratification framework to define stratum-specific causal effects, that is, effects for specific latent subpopulations, defined by the joint potential compliance statuses under both encouragement conditions. We show how the latter stratum-specific causal effects are related to the decomposition commonly used in the literature and provide flexible homogeneity assumptions under which an extrapolation across principal strata allows one to disentangle the effects. Estimation of causal estimands can be performed with Bayesian inferential methods using hierarchical models to account for clustering. We illustrate the proposed methodology by analyzing a cluster randomized experiment implemented in Zambia and designed to evaluate the impact on malaria prevalence of an agricultural loan program intended to increase the bed net coverage. Farmer households assigned to the program could take advantage of a deferred payment and a discount in the purchase of new bed nets. Our analysis shows a lack of evidence of an effect of the offering of the program to a cluster of households through spillover effects, that is, through a greater bed net coverage in the neighborhood. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 510-525
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1125788
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1125788
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:510-525
Template-Type: ReDIF-Article 1.0
Author-Name: Damião Nóbrega Da Silva
Author-X-Name-First: Damião Nóbrega
Author-X-Name-Last: Da Silva
Author-Name: Chris Skinner
Author-X-Name-First: Chris
Author-X-Name-Last: Skinner
Author-Name: Jae Kwang Kim
Author-X-Name-First: Jae Kwang
Author-X-Name-Last: Kim
Title: Using Binary Paradata to Correct for Measurement Error in Survey Data Analysis
Abstract:
Paradata refers here to data at unit level on an observed auxiliary variable, not usually of direct scientific interest, which may be informative about the quality of the survey data for the unit. There is increasing interest among survey researchers in how to use such data. Its use to reduce bias from nonresponse has received more attention so far than its use to correct for measurement error. This article considers the latter with a focus on binary paradata indicating the presence of measurement error. A motivating application concerns inference about a regression model, where earnings is a covariate measured with error and whether a respondent refers to pay records is the paradata variable. We specify a parametric model allowing for either normally or t-distributed measurement errors and discuss the assumptions required to identify the regression coefficients. We propose two estimation approaches that take account of complex survey designs: pseudo-maximum likelihood estimation and parametric fractional imputation. These approaches are assessed in a simulation study and are applied to a regression of a measure of deprivation given earnings and other covariates using British Household Panel Survey data. It is found that the proposed approach to correcting for measurement error reduces bias and improves on the precision of a simple approach based on accurate observations. We outline briefly possible extensions to uses of this approach at earlier stages in the survey process. Supplemental materials are available online.
Journal: Journal of the American Statistical Association
Pages: 526-537
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1130632
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1130632
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:526-537
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Liu
Author-X-Name-First: Wei
Author-X-Name-Last: Liu
Author-Name: Zhiwei Zhang
Author-X-Name-First: Zhiwei
Author-X-Name-Last: Zhang
Author-Name: R. Jason Schroeder
Author-X-Name-First: R. Jason
Author-X-Name-Last: Schroeder
Author-Name: Martin Ho
Author-X-Name-First: Martin
Author-X-Name-Last: Ho
Author-Name: Bo Zhang
Author-X-Name-First: Bo
Author-X-Name-Last: Zhang
Author-Name: Cynthia Long
Author-X-Name-First: Cynthia
Author-X-Name-Last: Long
Author-Name: Hui Zhang
Author-X-Name-First: Hui
Author-X-Name-Last: Zhang
Author-Name: Telba Z. Irony
Author-X-Name-First: Telba Z.
Author-X-Name-Last: Irony
Title: Joint Estimation of Treatment and Placebo Effects in Clinical Trials With Longitudinal Blinding Assessments
Abstract:
In some therapeutic areas, treatment evaluation is frequently complicated by a possible placebo effect (i.e., the psychobiological effect of a patient's knowledge or belief of being treated). When a substantial placebo effect is likely to exist, it is important to distinguish the treatment and placebo effects in quantifying the clinical benefit of a new treatment. These causal effects can be formally defined in a joint causal model that includes treatment (e.g., new vs. placebo) and treatmentality (i.e., a patient's belief or mentality about which treatment she or he has received) as separate exposures. Information about the treatmentality exposure can be obtained from blinding assessments, which are increasingly common in clinical trials where blinding success is in question. Assuming that treatmentality has a lagged effect and is measured at multiple time points, this article is concerned with joint evaluation of treatment and placebo effects in clinical trials with longitudinal follow-up, possibly with monotone missing data. We describe and discuss several methods adapted from the longitudinal causal inference literature, apply them to a weight loss study, and compare them in simulation experiments that mimic the weight loss study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 538-548
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1130633
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1130633
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:538-548
Template-Type: ReDIF-Article 1.0
Author-Name: Zhe Yu
Author-X-Name-First: Zhe
Author-X-Name-Last: Yu
Author-Name: Raquel Prado
Author-X-Name-First: Raquel
Author-X-Name-Last: Prado
Author-Name: Erin Burke Quinlan
Author-X-Name-First: Erin Burke
Author-X-Name-Last: Quinlan
Author-Name: Steven C. Cramer
Author-X-Name-First: Steven C.
Author-X-Name-Last: Cramer
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Title: Understanding the Impact of Stroke on Brain Motor Function: A Hierarchical Bayesian Approach
Abstract:
Stroke is a disturbance in blood supply to the brain resulting in the loss of brain functions, particularly motor function. A study was conducted by the UCI Neurorehabilitation Lab to investigate the impact of stroke on motor-related brain regions. Functional MRI (fMRI) data were collected from stroke patients and healthy controls while the subjects performed a simple motor task. In addition to affecting local neuronal activation strength, stroke might also alter communications (i.e., connectivity) between brain regions. We develop a hierarchical Bayesian modeling approach for the analysis of multi-subject fMRI data that allows us to explore brain changes due to stroke. Our approach simultaneously estimates activation and condition-specific connectivity at the group level, and provides estimates for region/subject-specific hemodynamic response functions. Moreover, our model uses spike-and-slab priors to allow for direct posterior inference on the connectivity network. Our results indicate that motor-control regions show greater activation in the unaffected hemisphere and the midline surface in stroke patients than those same regions in healthy controls during the simple motor task. We also note increased connectivity within secondary motor regions in stroke subjects. These findings provide insight into altered neural correlates of movement in subjects who suffered a stroke. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 549-563
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1133425
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1133425
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:549-563
Template-Type: ReDIF-Article 1.0
Author-Name: Eric W. Fox
Author-X-Name-First: Eric W.
Author-X-Name-Last: Fox
Author-Name: Martin B. Short
Author-X-Name-First: Martin B.
Author-X-Name-Last: Short
Author-Name: Frederic P. Schoenberg
Author-X-Name-First: Frederic P.
Author-X-Name-Last: Schoenberg
Author-Name: Kathryn D. Coronges
Author-X-Name-First: Kathryn D.
Author-X-Name-Last: Coronges
Author-Name: Andrea L. Bertozzi
Author-X-Name-First: Andrea L.
Author-X-Name-Last: Bertozzi
Title: Modeling E-mail Networks and Inferring Leadership Using Self-Exciting Point Processes
Abstract:
We propose various self-exciting point process models for the times when e-mails are sent between individuals in a social network. Using an expectation–maximization (EM)-type approach, we fit these models to an e-mail network dataset from West Point Military Academy and the Enron e-mail dataset. We argue that the self-exciting models adequately capture major temporal clustering features in the data and perform better than traditional stationary Poisson models. We also investigate how accounting for diurnal and weekly trends in e-mail activity improves the overall fit to the observed network data. A motivation and application for fitting these self-exciting models is to use parameter estimates to characterize important e-mail communication behaviors such as the baseline sending rates, average reply rates, and average response times. A primary goal is to use these features, estimated from the self-exciting models, to infer the underlying leadership status of users in the West Point and Enron networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 564-584
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1135802
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1135802
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:564-584
Template-Type: ReDIF-Article 1.0
Author-Name: Peter Goos
Author-X-Name-First: Peter
Author-X-Name-Last: Goos
Author-Name: Bradley Jones
Author-X-Name-First: Bradley
Author-X-Name-Last: Jones
Author-Name: Utami Syafitri
Author-X-Name-First: Utami
Author-X-Name-Last: Syafitri
Title: I-Optimal Design of Mixture Experiments
Abstract:
In mixture experiments, the factors under study are proportions of the ingredients of a mixture. The special nature of the factors necessitates specific types of regression models, and specific types of experimental designs. Although mixture experiments usually are intended to predict the response(s) for all possible formulations of the mixture and to identify optimal proportions for each of the ingredients, little research has been done concerning their I-optimal design. This is surprising given that I-optimal designs minimize the average variance of prediction and, therefore, seem more appropriate for mixture experiments than the commonly used D-optimal designs, which focus on a precise model estimation rather than precise predictions. In this article, we provide the first detailed overview of the literature on the I-optimal design of mixture experiments and identify several contradictions. For the second-order and the special cubic model, we present continuous I-optimal designs and contrast them with the published results. We also study exact I-optimal designs, and compare them in detail to continuous I-optimal designs and to D-optimal designs. One striking result of our work is that the performance of D-optimal designs in terms of the I-optimality criterion very strongly depends on which of the D-optimal designs is considered. Supplemental materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 899-911
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2015.1136632
File-URL: http://hdl.handle.net/10.1080/01621459.2015.1136632
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:899-911
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Cervone
Author-X-Name-First: Daniel
Author-X-Name-Last: Cervone
Author-Name: Alex D’Amour
Author-X-Name-First: Alex
Author-X-Name-Last: D’Amour
Author-Name: Luke Bornn
Author-X-Name-First: Luke
Author-X-Name-Last: Bornn
Author-Name: Kirk Goldsberry
Author-X-Name-First: Kirk
Author-X-Name-Last: Goldsberry
Title: A Multiresolution Stochastic Process Model for Predicting Basketball Possession Outcomes
Abstract:
Basketball games evolve continuously in space and time as players constantly interact with their teammates, the opposing team, and the ball. However, current analyses of basketball outcomes rely on discretized summaries of the game that reduce such interactions to tallies of points, assists, and similar events. In this article, we propose a framework for using optical player tracking data to estimate, in real time, the expected number of points obtained by the end of a possession. This quantity, called expected possession value (EPV), derives from a stochastic process model for the evolution of a basketball possession. We model this process at multiple levels of resolution, differentiating between continuous, infinitesimal movements of players, and discrete events such as shot attempts and turnovers. Transition kernels are estimated using hierarchical spatiotemporal models that share information across players while remaining computationally tractable on very large data sets. In addition to estimating EPV, these models reveal novel insights on players’ decision-making tendencies as a function of their spatial strategy. In the supplementary material, we provide a data sample and R code for further exploration of our model and its results.
Journal: Journal of the American Statistical Association
Pages: 585-599
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2016.1141685
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1141685
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:585-599
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan J. Tibshirani
Author-X-Name-First: Ryan J.
Author-X-Name-Last: Tibshirani
Author-Name: Jonathan Taylor
Author-X-Name-First: Jonathan
Author-X-Name-Last: Taylor
Author-Name: Richard Lockhart
Author-X-Name-First: Richard
Author-X-Name-Last: Lockhart
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 618-620
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2016.1182787
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1182787
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:618-620
Template-Type: ReDIF-Article 1.0
Author-Name: Lawrence D. Brown
Author-X-Name-First: Lawrence D.
Author-X-Name-Last: Brown
Author-Name: Kory D. Johnson
Author-X-Name-First: Kory D.
Author-X-Name-Last: Johnson
Title: Comment
Journal: Journal of the American Statistical Association
Pages: 614-617
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2016.1182788
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1182788
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:614-617
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 912-919
Issue: 514
Volume: 111
Year: 2016
Month: 4
X-DOI: 10.1080/01621459.2016.1200851
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1200851
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:111:y:2016:i:514:p:912-919
Template-Type: ReDIF-Article 1.0
Author-Name: Mauricio Sadinle
Author-X-Name-First: Mauricio
Author-X-Name-Last: Sadinle
Title: Bayesian Estimation of Bipartite Matchings for Record Linkage
Abstract:
The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is nontrivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal article by Fellegi and Sunter in 1969. These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 600-612
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1148612
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1148612
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:600-612
Template-Type: ReDIF-Article 1.0
Author-Name: Yijian Huang
Author-X-Name-First: Yijian
Author-X-Name-Last: Huang
Title: Restoration of Monotonicity Respecting in Dynamic Regression
Abstract:
Dynamic regression models, including the quantile regression model and Aalen’s additive hazards model, are widely adopted to investigate evolving covariate effects. Yet lack of monotonicity respecting with standard estimation procedures remains an outstanding issue. Advances have recently been made, but none provides a complete resolution. In this article, we propose a novel adaptive interpolation method to restore monotonicity respecting, by successively identifying and then interpolating nearest monotonicity-respecting points of an original estimator. Under mild regularity conditions, the resulting regression coefficient estimator is shown to be asymptotically equivalent to the original. Our numerical studies have demonstrated that the proposed estimator is much more smooth and may have better finite-sample efficiency than the original as well as, when available as only in special cases, other competing monotonicity-respecting estimators. Illustration with a clinical study is provided.
Journal: Journal of the American Statistical Association
Pages: 613-622
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1149070
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1149070
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:613-622
Template-Type: ReDIF-Article 1.0
Author-Name: David S. Matteson
Author-X-Name-First: David S.
Author-X-Name-Last: Matteson
Author-Name: Ruey S. Tsay
Author-X-Name-First: Ruey S.
Author-X-Name-Last: Tsay
Title: Independent Component Analysis via Distance Covariance
Abstract:
This article introduces a novel statistical framework for independent component analysis (ICA) of multivariate data. We propose methodology for estimating mutually independent components, and a versatile resampling-based procedure for inference, including misspecification testing. Independent components are estimated by combining a nonparametric probability integral transformation with a generalized nonparametric whitening method based on distance covariance that simultaneously minimizes all forms of dependence among the components. We prove the consistency of our estimator under minimal regularity conditions and detail conditions for consistency under model misspecification, all while placing assumptions on the observations directly, not on the latent components. U statistics of certain Euclidean distances between sample elements are combined to construct a test statistic for mutually independent components. The proposed measures and tests are based on both necessary and sufficient conditions for mutual independence. We demonstrate the improvements of the proposed method over several competing methods in simulation studies, and we apply the proposed ICA approach to two real examples and contrast it with principal component analysis.
Journal: Journal of the American Statistical Association
Pages: 623-637
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1150851
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1150851
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:623-637
Template-Type: ReDIF-Article 1.0
Author-Name: Kristin A. Linn
Author-X-Name-First: Kristin A.
Author-X-Name-Last: Linn
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Leonard A. Stefanski
Author-X-Name-First: Leonard A.
Author-X-Name-Last: Stefanski
Title: Interactive -Learning for Quantiles
Abstract:
A dynamic treatment regime is a sequence of decision rules, each of which recommends treatment based on features of patient medical history such as past treatments and outcomes. Existing methods for estimating optimal dynamic treatment regimes from data optimize the mean of a response variable. However, the mean may not always be the most appropriate summary of performance. We derive estimators of decision rules for optimizing probabilities and quantiles computed with respect to the response distribution for two-stage, binary treatment settings. This enables estimation of dynamic treatment regimes that optimize the cumulative distribution function of the response at a prespecified point or a prespecified quantile of the response distribution such as the median. The proposed methods perform favorably in simulation experiments. We illustrate our approach with data from a sequentially randomized trial where the primary outcome is remission of depression symptoms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 638-649
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1155993
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1155993
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:638-649
Template-Type: ReDIF-Article 1.0
Author-Name: Shujie Ma
Author-X-Name-First: Shujie
Author-X-Name-Last: Ma
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Chih-Ling Tsai
Author-X-Name-First: Chih-Ling
Author-X-Name-Last: Tsai
Title: Variable Screening via Quantile Partial Correlation
Abstract:
In quantile linear regression with ultrahigh-dimensional data, we propose an algorithm for screening all candidate variables and subsequently selecting relevant predictors. Specifically, we first employ quantile partial correlation for screening, and then we apply the extended Bayesian information criterion (EBIC) for best subset selection. Our proposed method can successfully select predictors when the variables are highly correlated, and it can also identify variables that make a contribution to the conditional quantiles but are marginally uncorrelated or weakly correlated with the response. Theoretical results show that the proposed algorithm can yield the sure screening set. By controlling the false selection rate, model selection consistency can be achieved theoretically. In practice, we proposed using EBIC for best subset selection so that the resulting model is screening consistent. Simulation studies demonstrate that the proposed algorithm performs well, and an empirical example is presented. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 650-663
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1156545
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1156545
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:650-663
Template-Type: ReDIF-Article 1.0
Author-Name: Qingning Zhou
Author-X-Name-First: Qingning
Author-X-Name-Last: Zhou
Author-Name: Tao Hu
Author-X-Name-First: Tao
Author-X-Name-Last: Hu
Author-Name: Jianguo Sun
Author-X-Name-First: Jianguo
Author-X-Name-Last: Sun
Title: A Sieve Semiparametric Maximum Likelihood Approach for Regression Analysis of Bivariate Interval-Censored Failure Time Data
Abstract:
Interval-censored failure time data arise in a number of fields and many authors have discussed various issues related to their analysis. However, most of the existing methods are for univariate data and there exists only limited research on bivariate data, especially on regression analysis of bivariate interval-censored data. We present a class of semiparametric transformation models for the problem and for inference, a sieve maximum likelihood approach is developed. The model provides a great flexibility, in particular including the commonly used proportional hazards model as a special case, and in the approach, Bernstein polynomials are employed. The strong consistency and asymptotic normality of the resulting estimators of regression parameters are established and furthermore, the estimators are shown to be asymptotically efficient. Extensive simulation studies are conducted and indicate that the proposed method works well for practical situations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 664-672
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1158113
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1158113
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:664-672
Template-Type: ReDIF-Article 1.0
Author-Name: Vikram V. Garg
Author-X-Name-First: Vikram V.
Author-X-Name-Last: Garg
Author-Name: Roy H. Stogner
Author-X-Name-First: Roy H.
Author-X-Name-Last: Stogner
Title: Hierarchical Latin Hypercube Sampling
Abstract:
Latin hypercube sampling (LHS) is a robust, scalable Monte Carlo method that is used in many areas of science and engineering. We present a new algorithm for generating hierarchic Latin hypercube sets (HLHS) that are recursively divisible into LHS subsets. Based on this new construction, we introduce a hierarchical incremental LHS (HILHS) method that allows the user to employ LHS in a flexibly incremental setting. This overcomes a drawback of many LHS schemes that require the entire sample set to be selected a priori, or only allow very large increments. We derive the sampling properties for HLHS designs and HILHS estimators. We also present numerical studies that showcase the flexible incrementation offered by HILHS.
Journal: Journal of the American Statistical Association
Pages: 673-682
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1158717
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1158717
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:673-682
Template-Type: ReDIF-Article 1.0
Author-Name: Fasheng Sun
Author-X-Name-First: Fasheng
Author-X-Name-Last: Sun
Author-Name: Boxin Tang
Author-X-Name-First: Boxin
Author-X-Name-Last: Tang
Title: A Method of Constructing Space-Filling Orthogonal Designs
Abstract:
This article presents a method of constructing a rich class of orthogonal designs that include orthogonal Latin hypercubes as special cases. Two prominent features of the method are its simplicity and generality. In addition to orthogonality, the resulting designs enjoy some attractive space-filling properties, making them very suitable for computer experiments.
Journal: Journal of the American Statistical Association
Pages: 683-689
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1159211
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1159211
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:683-689
Template-Type: ReDIF-Article 1.0
Author-Name: Ruiyan Luo
Author-X-Name-First: Ruiyan
Author-X-Name-Last: Luo
Author-Name: Xin Qi
Author-X-Name-First: Xin
Author-X-Name-Last: Qi
Title: Function-on-Function Linear Regression by Signal Compression
Abstract:
We consider functional linear regression models with a functional response and multiple functional predictors, with the goal of finding the best finite-dimensional approximation to the signal part of the response function. Defining the integrated squared correlation coefficient between a random variable and a random function, we propose to solve a penalized generalized functional eigenvalue problem, whose solutions satisfy that projections on the original predictors generate new scalar uncorrelated variables and these variables have the largest integrated squared correlation coefficient with the signal function. With these new variables, we transform the original function-on-function regression model to a function-on-scalar regression model whose predictors are uncorrelated, and estimate the model by penalized least-square method. This method is also extended to models with both multiple functional and scalar predictors. We provide the asymptotic consistency and the corresponding convergence rates for our estimates. Simulation studies in various settings and for both one and multiple functional predictors demonstrate that our approach has good predictive performance and is very computational efficient. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 690-705
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1164053
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164053
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:690-705
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander Schnurr
Author-X-Name-First: Alexander
Author-X-Name-Last: Schnurr
Author-Name: Herold Dehling
Author-X-Name-First: Herold
Author-X-Name-Last: Dehling
Title: Testing for Structural Breaks via Ordinal Pattern Dependence
Abstract:
We propose new concepts to analyze and model the dependence structure between two time series. Our methods rely exclusively on the order structure of the data points. Hence, the methods are stable under monotone transformations of the time series and robust against small perturbations or measurement errors. Ordinal pattern dependence can be characterized by four parameters. We propose estimators for these parameters, and we calculate their asymptotic distributions. Furthermore, we derive a test for structural breaks within the dependence structure. All results are supplemented by simulation studies and empirical examples. For three consecutive data points attaining different values, there are six possibilities how their values can be ordered. These possibilities are called ordinal patterns. Our first idea is simply to count the number of coincidences of patterns in both time series and to compare this with the expected number in the case of independence. If we detect a lot of coincident patterns, it would indicate that the up-and-down behavior is similar. Hence, our concept can be seen as a way to measure nonlinear “correlation.” We show in the last section how to generalize the concept to capture various other kinds of dependence.
Journal: Journal of the American Statistical Association
Pages: 706-720
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1164706
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1164706
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:706-720
Template-Type: ReDIF-Article 1.0
Author-Name: David B. Dahl
Author-X-Name-First: David B.
Author-X-Name-Last: Dahl
Author-Name: Ryan Day
Author-X-Name-First: Ryan
Author-X-Name-Last: Day
Author-Name: Jerry W. Tsai
Author-X-Name-First: Jerry W.
Author-X-Name-Last: Tsai
Title: Random Partition Distribution Indexed by Pairwise Information
Abstract:
We propose a random partition distribution indexed by pairwise similarity information such that partitions compatible with the similarities are given more probability. The use of pairwise similarities, in the form of distances, is common in some clustering algorithms (e.g., hierarchical clustering), but we show how to use this type of information to define a prior partition distribution for flexible Bayesian modeling. A defining feature of the distribution is that it allocates probability among partitions within a given number of subsets, but it does not shift probability among sets of partitions with different numbers of subsets. Our distribution places more probability on partitions that group similar items yet keeps the total probability of partitions with a given number of subsets constant. The distribution of the number of subsets (and its moments) is available in closed-form and is not a function of the similarities. Our formulation has an explicit probability mass function (with a tractable normalizing constant) so the full suite of MCMC methods may be used for posterior inference. We compare our distribution with several existing partition distributions, showing that our formulation has attractive properties. We provide three demonstrations to highlight the features and relative performance of our distribution.
Journal: Journal of the American Statistical Association
Pages: 721-732
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1165103
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165103
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:721-732
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel R. Kowal
Author-X-Name-First: Daniel R.
Author-X-Name-Last: Kowal
Author-Name: David S. Matteson
Author-X-Name-First: David S.
Author-X-Name-Last: Matteson
Author-Name: David Ruppert
Author-X-Name-First: David
Author-X-Name-Last: Ruppert
Title: A Bayesian Multivariate Functional Dynamic Linear Model
Abstract:
We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data—functional, time dependent, and multivariate components—we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization framework. The proposed methods identify a time-invariant functional basis for the functional observations, which is smooth and interpretable, and can be made common across multivariate observations for additional information sharing. The Bayesian framework permits joint estimation of the model parameters, provides exact inference (up to MCMC error) on specific parameters, and allows generalized dependence structures. Sampling from the posterior distribution is accomplished with an efficient Gibbs sampling algorithm. We illustrate the proposed framework with two applications: (1) multi-economy yield curve data from the recent global recession, and (2) local field potential brain signals in rats, for which we develop a multivariate functional time series approach for multivariate time–frequency analysis. Supplementary materials, including R code and the multi-economy yield curve data, are available online.
Journal: Journal of the American Statistical Association
Pages: 733-744
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1165104
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1165104
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:733-744
Template-Type: ReDIF-Article 1.0
Author-Name: Yifei Sun
Author-X-Name-First: Yifei
Author-X-Name-Last: Sun
Author-Name: Mei-Cheng Wang
Author-X-Name-First: Mei-Cheng
Author-X-Name-Last: Wang
Title: Evaluating Utility Measurement From Recurrent Marker Processes in the Presence of Competing Terminal Events
Abstract:
In follow-up studies, utility marker measurements are usually collected upon the occurrence of recurrent events until a terminal event such as death takes place. In this article, we define the recurrent marker process to characterize utility accumulation over time. For example, with medical cost and repeated hospitalizations being treated as marker and recurrent events, respectively, the recurrent marker process is the trajectory of cumulative cost, which stops to increase after death. In many applications, competing risks arise as subjects are at risk of more than one mutually exclusive terminal event, such as death from different causes, and modeling the recurrent marker process for each failure type is often of interest. However, censoring creates challenges in the methodological development, because for censored subjects, both failure type and recurrent marker process after censoring are unobserved. To circumvent this problem, we propose a nonparametric framework for the recurrent marker process with competing terminal events. In the presence of competing risks, we start with an estimator by using marker information from uncensored subjects. As a result, the estimator can be inefficient under heavy censoring. To improve efficiency, we propose a second estimator by combining the first estimator with auxiliary information from the estimate under noncompeting risks model. The large sample properties and optimality of the second estimator are established. Simulation studies and an application to the SEER-Medicare linked data are presented to illustrate the proposed methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 745-756
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1166113
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1166113
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:745-756
Template-Type: ReDIF-Article 1.0
Author-Name: Xianyang Zhang
Author-X-Name-First: Xianyang
Author-X-Name-Last: Zhang
Author-Name: Guang Cheng
Author-X-Name-First: Guang
Author-X-Name-Last: Cheng
Title: Simultaneous Inference for High-Dimensional Linear Models
Abstract:
This article proposes a bootstrap-assisted procedure to conduct simultaneous inference for high-dimensional sparse linear models based on the recent desparsifying Lasso estimator. Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the desparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening to enhance its power in sparse testing with a reduced computational cost, or with the step-down method to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the prespecified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 757-768
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1166114
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1166114
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:757-768
Template-Type: ReDIF-Article 1.0
Author-Name: Ailin Fan
Author-X-Name-First: Ailin
Author-X-Name-Last: Fan
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Title: Change-Plane Analysis for Subgroup Detection and Sample Size Calculation
Abstract:
We propose a systematic method for testing and identifying a subgroup with an enhanced treatment effect. We adopts a change-plane technique to first test the existence of a subgroup, and then identify the subgroup if the null hypothesis on nonexistence of such a subgroup is rejected. A semiparametric model is considered for the response with an unspecified baseline function and an interaction between a subgroup indicator and treatment. A doubly robust test statistic is constructed based on this model, and asymptotic distributions of the test statistic under both null and local alternative hypotheses are derived. Moreover, a sample size calculation method for subgroup detection is developed based on the proposed statistic. The finite sample performance of the proposed test is evaluated via simulations. Finally, the proposed methods for subgroup identification and sample size calculation are applied to a data from an AIDS study.
Journal: Journal of the American Statistical Association
Pages: 769-778
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1166115
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1166115
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:769-778
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Ni
Author-X-Name-First: Yang
Author-X-Name-Last: Ni
Author-Name: Francesco C. Stingo
Author-X-Name-First: Francesco C.
Author-X-Name-Last: Stingo
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Title: Sparse Multi-Dimensional Graphical Models: A Unified Bayesian Framework
Abstract:
Multi-dimensional data constituted by measurements along multiple axes have emerged across many scientific areas such as genomics and cancer surveillance. A common objective is to investigate the conditional dependencies among the variables along each axes taking into account multi-dimensional structure of the data. Traditional multivariate approaches are unsuitable for such highly structured data due to inefficiency, loss of power, and lack of interpretability. In this article, we propose a novel class of multi-dimensional graphical models based on matrix decompositions of the precision matrices along each dimension. Our approach is a unified framework applicable to both directed and undirected decomposable graphs as well as arbitrary combinations of these. Exploiting the marginalization of the likelihood, we develop efficient posterior sampling schemes based on partially collapsed Gibbs samplers. Empirically, through simulation studies, we show the superior performance of our approach in comparison with those of benchmark and state-of-the-art methods. We illustrate our approaches using two datasets: ovarian cancer proteomics and U.S. cancer mortality. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 779-793
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1167694
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1167694
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:779-793
Template-Type: ReDIF-Article 1.0
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Author-Name: Sy Han Chiou
Author-X-Name-First: Sy Han
Author-X-Name-Last: Chiou
Author-Name: Chiung-Yu Huang
Author-X-Name-First: Chiung-Yu
Author-X-Name-Last: Huang
Author-Name: Mei-Cheng Wang
Author-X-Name-First: Mei-Cheng
Author-X-Name-Last: Wang
Author-Name: Jun Yan
Author-X-Name-First: Jun
Author-X-Name-Last: Yan
Title: Joint Scale-Change Models for Recurrent Events and Failure Time
Abstract:
Recurrent event data arise frequently in various fields such as biomedical sciences, public health, engineering, and social sciences. In many instances, the observation of the recurrent event process can be stopped by the occurrence of a correlated failure event, such as treatment failure and death. In this article, we propose a joint scale-change model for the recurrent event process and the failure time, where a shared frailty variable is used to model the association between the two types of outcomes. In contrast to the popular Cox-type joint modeling approaches, the regression parameters in the proposed joint scale-change model have marginal interpretations. The proposed approach is robust in the sense that no parametric assumption is imposed on the distribution of the unobserved frailty and that we do not need the strong Poisson-type assumption for the recurrent event process. We establish consistency and asymptotic normality of the proposed semiparametric estimators under suitable regularity conditions. To estimate the corresponding variances of the estimators, we develop a computationally efficient resampling-based procedure. Simulation studies and an analysis of hospitalization data from the Danish Psychiatric Central Register illustrate the performance of the proposed method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 794-805
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1173557
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1173557
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:794-805
Template-Type: ReDIF-Article 1.0
Author-Name: Andrés F. Barrientos
Author-X-Name-First: Andrés F.
Author-X-Name-Last: Barrientos
Author-Name: Alejandro Jara
Author-X-Name-First: Alejandro
Author-X-Name-Last: Jara
Author-Name: Fernando A. Quintana
Author-X-Name-First: Fernando A.
Author-X-Name-Last: Quintana
Title: Fully Nonparametric Regression for Bounded Data Using Dependent Bernstein Polynomials
Abstract:
We propose a novel class of probability models for sets of predictor-dependent probability distributions with bounded domain. The proposal extends the Dirichlet–Bernstein prior for single density estimation, by using dependent stick-breaking processes. A general model class and two simplified versions are discussed in detail. Appealing theoretical properties such as continuity, association structure, marginal distribution, large support, and consistency of the posterior distribution are established for all models. The behavior of the models is illustrated using simulated and real-life data. The simulated data are also used to compare the proposed methodology to existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 806-825
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1180987
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180987
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:806-825
Template-Type: ReDIF-Article 1.0
Author-Name: Yifei Sun
Author-X-Name-First: Yifei
Author-X-Name-Last: Sun
Author-Name: Chiung-Yu Huang
Author-X-Name-First: Chiung-Yu
Author-X-Name-Last: Huang
Author-Name: Mei-Cheng Wang
Author-X-Name-First: Mei-Cheng
Author-X-Name-Last: Wang
Title: Nonparametric Benefit–Risk Assessment Using Marker Process in the Presence of a Terminal Event
Abstract:
Benefit–risk assessment is a crucial step in medical decision process. In many biomedical studies, both longitudinal marker measurements and time to a terminal event serve as important endpoints for benefit–risk assessment. The effect of an intervention or a treatment on the longitudinal marker process, however, can be in conflict with its effect on the time to the terminal event. Thus, questions arise on how to evaluate treatment effects based on the two endpoints, for the purpose of deciding on which treatment is most likely to benefit the patients. In this article, we present a unified framework for benefit–risk assessment using the observed longitudinal markers and time to event data. We propose a cumulative weighted marker process to synthesize information from the two endpoints, and use its mean function at a prespecified time point as a benefit–risk summary measure. We consider nonparametric estimation of the summary measure under two scenarios: (i) the longitudinal marker is measured intermittently during the study period, and (ii) the value of the longitudinal marker is observed throughout the entire follow-up period. The large-sample properties of the estimators are derived and compared. Simulation studies and data examples exhibit that the proposed methods are easy to implement and reliable for practical use. Supplemental materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 826-836
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1180988
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180988
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:826-836
Template-Type: ReDIF-Article 1.0
Author-Name: Ang Li
Author-X-Name-First: Ang
Author-X-Name-Last: Li
Author-Name: Rina Foygel Barber
Author-X-Name-First: Rina Foygel
Author-X-Name-Last: Barber
Title: Accumulation Tests for FDR Control in Ordered Hypothesis Testing
Abstract:
Multiple testing problems arising in modern scientific applications can involve simultaneously testing thousands or even millions of hypotheses, with relatively few true signals. In this article, we consider the multiple testing problem where prior information is available (for instance, from an earlier study under different experimental conditions), that can allow us to test the hypotheses as a ranked list to increase the number of discoveries. Given an ordered list of n hypotheses, the aim is to select a data-dependent cutoff k and declare the first k hypotheses to be statistically significant while bounding the false discovery rate (FDR). Generalizing several existing methods, we develop a family of “accumulation tests” to choose a cutoff k that adapts to the amount of signal at the top of the ranked list. We introduce a new method in this family, the HingeExp method, which offers higher power to detect true signals compared to existing techniques. Our theoretical results prove that these methods control a modified FDR on finite samples, and characterize the power of the methods in the family. We apply the tests to simulated data, including a high-dimensional model selection problem for linear regression. We also compare accumulation tests to existing methods for multiple testing on a real data problem of identifying differential gene expression over a dosage gradient. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 837-849
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1180989
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1180989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:837-849
Template-Type: ReDIF-Article 1.0
Author-Name: Hélène Juillard
Author-X-Name-First: Hélène
Author-X-Name-Last: Juillard
Author-Name: Guillaume Chauvet
Author-X-Name-First: Guillaume
Author-X-Name-Last: Chauvet
Author-Name: Anne Ruiz-Gazen
Author-X-Name-First: Anne
Author-X-Name-Last: Ruiz-Gazen
Title: Estimation Under Cross-Classified Sampling With Application to a Childhood Survey
Abstract:
The cross-classified sampling design consists in drawing samples from a two-dimensional population, independently in each dimension. Such design is commonly used in consumer price index surveys and has been recently applied to draw a sample of babies in the French Longitudinal Survey on Childhood, by crossing a sample of maternity units and a sample of days. We propose to derive a general theory of estimation for this sampling design. We consider the Horvitz–Thompson estimator for a total, and show that the cross-classified design will usually result in a loss of efficiency as compared to the widespread two-stage design. We obtain the asymptotic distribution of the Horvitz–Thompson estimator and several unbiased variance estimators. Facing the problem of possibly negative values, we propose simplified nonnegative variance estimators and study their bias under a super-population model. The proposed estimators are compared for totals and ratios on simulated data. An application on real data from the French Longitudinal Survey on Childhood is also presented, and we make some recommendations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 850-858
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1186028
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1186028
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:850-858
Template-Type: ReDIF-Article 1.0
Author-Name: Guohui Wu
Author-X-Name-First: Guohui
Author-X-Name-Last: Wu
Author-Name: Scott H. Holan
Author-X-Name-First: Scott H.
Author-X-Name-Last: Holan
Title: Bayesian Hierarchical Multi-Population Multistate Jolly–Seber Models With Covariates: Application to the Pallid Sturgeon Population Assessment Program
Abstract:
Estimating abundance for multiple populations is of fundamental importance to many ecological monitoring programs. Equally important is quantifying the spatial distribution and characterizing the migratory behavior of target populations within the study domain. To achieve these goals, we propose a Bayesian hierarchical multi-population multistate Jolly–Seber model that incorporates covariates. The model is proposed using a state-space framework and has several distinct advantages. First, multiple populations within the same study area can be modeled simultaneously. As a consequence, it is possible to achieve improved parameter estimation by “borrowing strength” across different populations. In many cases, such as our motivating example involving endangered species, this borrowing of strength is crucial, as there is relatively less information for one of the populations under consideration. Second, in addition to accommodating covariate information, we develop a computationally efficient Markov chain Monte Carlo algorithm that requires no tuning. Importantly, the model we propose allows us to draw inference on each population as well as on multiple populations simultaneously. Finally, we demonstrate the effectiveness of our method through a motivating example of estimating the spatial distribution and migration of hatchery and wild populations of the endangered pallid sturgeon (Scaphirhynchus albus), using data from the Pallid Sturgeon Population Assessment Program on the Lower Missouri River. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 471-483
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1211531
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1211531
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:471-483
Template-Type: ReDIF-Article 1.0
Author-Name: Giampiero Marra
Author-X-Name-First: Giampiero
Author-X-Name-Last: Marra
Author-Name: Rosalba Radice
Author-X-Name-First: Rosalba
Author-X-Name-Last: Radice
Author-Name: Till Bärnighausen
Author-X-Name-First: Till
Author-X-Name-Last: Bärnighausen
Author-Name: Simon N. Wood
Author-X-Name-First: Simon N.
Author-X-Name-Last: Wood
Author-Name: Mark E. McGovern
Author-X-Name-First: Mark E.
Author-X-Name-Last: McGovern
Title: A Simultaneous Equation Approach to Estimating HIV Prevalence With Nonignorable Missing Responses
Abstract:
Estimates of HIV prevalence are important for policy to establish the health status of a country’s population and to evaluate the effectiveness of population-based interventions and campaigns. However, participation rates in testing for surveillance conducted as part of household surveys, on which many of these estimates are based, can be low. HIV positive individuals may be less likely to participate because they fear disclosure, in which case estimates obtained using conventional approaches to deal with missing data, such as imputation-based methods, will be biased. We develop a Heckman-type simultaneous equation approach that accounts for nonignorable selection, but unlike previous implementations, allows for spatial dependence and does not impose a homogenous selection process on all respondents. In addition, our framework addresses the issue of separation, where for instance some factors are severely unbalanced and highly predictive of the response, which would ordinarily prevent model convergence. Estimation is carried out within a penalized likelihood framework where smoothing is achieved using a parameterization of the smoothing criterion, which makes estimation more stable and efficient. We provide the software for straightforward implementation of the proposed approach, and apply our methodology to estimating national and sub-national HIV prevalence in Swaziland, Zimbabwe, and Zambia. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 484-496
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1224713
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1224713
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:484-496
Template-Type: ReDIF-Article 1.0
Author-Name: Ephraim M. Hanks
Author-X-Name-First: Ephraim M.
Author-X-Name-Last: Hanks
Title: Modeling Spatial Covariance Using the Limiting Distribution of Spatio-Temporal Random Walks
Abstract:
We present an approach for modeling areal spatial covariance in observed genetic allele data by considering the stationary (limiting) distribution of a spatio-temporal Markov random walk model for gene flow. This stationary distribution corresponds to an intrinsic simultaneous autoregressive (SAR) model for spatial correlation, and provides a principled approach to specifying areal spatial models when a spatio-temporal generating process can be assumed. We apply the approach to a study of spatial genetic variation of trout in a stream network in Connecticut, USA.
Journal: Journal of the American Statistical Association
Pages: 497-507
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1224714
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1224714
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:497-507
Template-Type: ReDIF-Article 1.0
Author-Name: Beibei Guo
Author-X-Name-First: Beibei
Author-X-Name-Last: Guo
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Title: Bayesian Phase I/II Biomarker-Based Dose Finding for Precision Medicine With Molecularly Targeted Agents
Abstract:
The optimal dose for treating patients with a molecularly targeted agent may differ according to the patient's individual characteristics, such as biomarker status. In this article, we propose a Bayesian phase I/II dose-finding design to find the optimal dose that is personalized for each patient according to his/her biomarker status. To overcome the curse of dimensionality caused by the relatively large number of biomarkers and their interactions with the dose, we employ canonical partial least squares (CPLS) to extract a small number of components from the covariate matrix containing the dose, biomarkers, and dose-by-biomarker interactions. Using these components as the covariates, we model the ordinal toxicity and efficacy using the latent-variable approach. Our model accounts for important features of molecularly targeted agents. We quantify the desirability of the dose using a utility function and propose a two-stage dose-finding algorithm to find the personalized optimal dose according to each patient's individual biomarker profile. Simulation studies show that our proposed design has good operating characteristics, with a high probability of identifying the personalized optimal dose. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 508-520
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1228534
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1228534
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:508-520
Template-Type: ReDIF-Article 1.0
Author-Name: Justin Strait
Author-X-Name-First: Justin
Author-X-Name-Last: Strait
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Author-Name: Emily Bartha
Author-X-Name-First: Emily
Author-X-Name-Last: Bartha
Author-Name: Steven N. MacEachern
Author-X-Name-First: Steven N.
Author-X-Name-Last: MacEachern
Title: Landmark-Constrained Elastic Shape Analysis of Planar Curves
Abstract:
Various approaches to statistical shape analysis exist in current literature. They mainly differ in the representations, metrics, and/or methods for alignment of shapes. One such approach is based on landmarks, that is, mathematically or structurally meaningful points, which ignores the remaining outline information. Elastic shape analysis, a more recent approach, attempts to fix this by using a special functional representation of the parametrically defined outline to perform shape registration, and subsequent statistical analyses. However, the lack of landmark identification can lead to unnatural alignment, particularly in biological and medical applications, where certain features are crucial to shape structure, comparison, and modeling. The main contribution of this work is the definition of a joint landmark-constrained elastic statistical shape analysis framework. We treat landmark points as constraints in the full shape analysis process. Thus, we inherit benefits of both methods: the landmarks help disambiguate shape alignment when the fully automatic elastic shape analysis framework produces unsatisfactory solutions. We provide standard statistical tools on the landmark-constrained shape space including mean and covariance calculation, classification, clustering, and tangent principal component analysis (PCA). We demonstrate the benefits of the proposed framework on complex shapes from the MPEG-7 dataset and two real data examples: mice T2 vertebrae and Hawaiian Drosophila fly wings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 521-533
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1236726
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1236726
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:521-533
Template-Type: ReDIF-Article 1.0
Author-Name: Brian L. Egleston
Author-X-Name-First: Brian L.
Author-X-Name-Last: Egleston
Author-Name: Robert G. Uzzo
Author-X-Name-First: Robert G.
Author-X-Name-Last: Uzzo
Author-Name: Yu-Ning Wong
Author-X-Name-First: Yu-Ning
Author-X-Name-Last: Wong
Title: Latent Class Survival Models Linked by Principal Stratification to Investigate Heterogenous Survival Subgroups Among Individuals With Early-Stage Kidney Cancer
Abstract:
Rates of kidney cancer have been increasing, with small incidental tumors experiencing the fastest growth rates. Much of the increase could be due to increased use of CT scans, MRIs, and ultrasounds for unrelated conditions. Many tumors might never have been detected or become symptomatic in the past. This suggests that many patients might benefit from less aggressive therapy, such as active surveillance by which tumors are surgically removed only if they become sufficiently large. However, it has been difficult for clinicians to identify subgroups of patients for whom treatment might be especially beneficial or harmful. In this work, we use a principal stratification framework to estimate the proportion and characteristics of individuals who have large or small hazard rates of death in two treatment arms. This allows us to assess who might be helped or harmed by aggressive treatment. We also use Weibull mixture models. This work differs from much previous work in that the survival classes upon which principal stratification is based are latent variables. That is, survival class is not an observed variable. We apply this work using Surveillance Epidemiology and End Results-Medicare claims data. Clinicians can use our methods for investigating treatments with heterogenous effects.
Journal: Journal of the American Statistical Association
Pages: 534-546
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1240078
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240078
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:534-546
Template-Type: ReDIF-Article 1.0
Author-Name: José R. Zubizarreta
Author-X-Name-First: José R.
Author-X-Name-Last: Zubizarreta
Author-Name: Luke Keele
Author-X-Name-First: Luke
Author-X-Name-Last: Keele
Title: Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the Effectiveness of Private Schools Under a Large-Scale Voucher System
Abstract:
A distinctive feature of a clustered observational study is its multilevel or nested data structure arising from the assignment of treatment, in a nonrandom manner, to groups or clusters of units or individuals. Examples are ubiquitous in the health and social sciences including patients in hospitals, employees in firms, and students in schools. What is the optimal matching strategy in a clustered observational study? At first thought, one might start by matching clusters of individuals and then, within matched clusters, continue by matching individuals. But as we discuss in this article, the optimal strategy is the opposite: in typical applications, where the intracluster correlation is not one, it is best to first match individuals and, once all possible combinations of matched individuals are known, then match clusters. In this article, we use dynamic and integer programming to implement this strategy and extend optimal matching methods to hierarchical and multilevel settings. Among other matched designs, our strategy can approximate a paired clustered randomized study by finding the largest sample of matched pairs of treated and control individuals within matched pairs of treated and control clusters that is balanced according to specifications given by the investigator. This strategy directly balances covariates both at the cluster and individual levels and does not require estimating the propensity score, although the propensity score can be balanced as an additional covariate. We illustrate our results with a case study of the comparative effectiveness of public versus private voucher schools in Chile, a question of intense policy debate in the country at the present.
Journal: Journal of the American Statistical Association
Pages: 547-560
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1240683
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240683
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:547-560
Template-Type: ReDIF-Article 1.0
Author-Name: Dalia Chakrabarty
Author-X-Name-First: Dalia
Author-X-Name-Last: Chakrabarty
Title: A New Bayesian Test to Test for the Intractability-Countering Hypothesis
Abstract:
We present a new test of hypothesis in which we seek the probability of the null conditioned on the data, where the null is a simplification undertaken to counter the intractability of the more complex model that the simpler null model is nested within. With the more complex model rendered intractable, the null model uses a simplifying assumption that capacitates the learning of an unknown parameter vector given the data. Bayes factors are shown to be known only up to a ratio of unknown data-dependent constants—a problem that cannot be cured using prescriptions similar to those suggested to solve the problem caused to Bayes factor computation, by noninformative priors. Thus, a new test is needed in which we can circumvent Bayes factor computation. In this test, we undertake generation of data from the model in which the null hypothesis is true and can achieve support in the measured data for the null by comparing the marginalized posterior of the model parameter given the measured data, to that given such generated data. However, such a ratio of marginalized posteriors can confound interpretation of comparison of support in one measured data for a null, with that in another dataset for a different null. Given an application in which such comparison is undertaken, we alternatively define support in a measured dataset for a null by identifying the model parameters that are less consistent with the measured data than is minimally possible given the generated data, and realizing that the higher the number of such parameter values, less is the support in the measured data for the null. Then, the probability of the null conditional on the data is given within a Markov chain Monte Carlo (MCMC)-based scheme, by marginalizing the posterior given the measured data, over parameter values that are as, or more consistent with the measured data, than with the generated data. In the aforementioned application, we test the hypothesis that a galactic state-space bears an isotropic geometry, where the (missing) data comprising measurements of some components of the state-space vector of a sample of observed galactic particles are implemented to Bayesianly learn the gravitational mass density of all matter in the galaxy. In lieu of an assumption about the state-space being isotropic, the likelihood of the sought gravitational mass density given the data is intractable. For a real example galaxy, we find unequal values of the probability of the null—that the host state-space is isotropic—given two different datasets, implying that in this galaxy, the system state-space constitutes at least two disjoint sub-volumes that the two datasets, respectively, live in. Implementation on simulated galactic data is also undertaken, as is an empirical illustration on the well-known O-ring data, to test for the form of the thermal variation of the failure probability of the O-rings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 561-577
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1240684
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1240684
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:561-577
Template-Type: ReDIF-Article 1.0
Author-Name: Mevin B. Hooten
Author-X-Name-First: Mevin B.
Author-X-Name-Last: Hooten
Author-Name: Devin S. Johnson
Author-X-Name-First: Devin S.
Author-X-Name-Last: Johnson
Title: Basis Function Models for Animal Movement
Abstract:
Advances in satellite-based data collection techniques have served as a catalyst for new statistical methodology to analyze these data. In wildlife ecological studies, satellite-based data and methodology have provided a wealth of information about animal space use and the investigation of individual-based animal–environment relationships. With the technology for data collection improving dramatically over time, we are left with massive archives of historical animal telemetry data of varying quality. While many contemporary statistical approaches for inferring movement behavior are specified in discrete time, we develop a flexible continuous-time stochastic integral equation framework that is amenable to reduced-rank second-order covariance parameterizations. We demonstrate how the associated first-order basis functions can be constructed to mimic behavioral characteristics in realistic trajectory processes using telemetry data from mule deer and mountain lion individuals in western North America. Our approach is parallelizable and provides inference for heterogenous trajectories using nonstationary spatial modeling techniques that are feasible for large telemetry datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 578-589
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1246250
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246250
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:578-589
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew Blackwell
Author-X-Name-First: Matthew
Author-X-Name-Last: Blackwell
Title: Instrumental Variable Methods for Conditional Effects and Causal Interaction in Voter Mobilization Experiments
Abstract:
In democratic countries, voting is one of the most important ways for citizens to influence policy and hold their representatives accountable. And yet, in the United States and many other countries, rates of voter turnout are alarmingly low. Every election cycle, mobilization efforts encourage citizens to vote and ensure that elections reflect the true will of the people. To establish the most effective way of encouraging voter turnout, this article seeks to differentiate between (1) the synergy hypothesis that multiple instances of voter contact increase the effectiveness of a single form of contact, and (2) the diminishing returns hypothesis that multiple instances of contact are less effective or even counterproductive. Remarkably, previous studies have been unable to compare these hypotheses because extant approaches to analyzing experiments with noncompliance cannot speak to questions of causal interaction. I resolve this impasse by extending the traditional instrumental variables framework to accommodate multiple treatment–instrument pairs, which allows for the estimation of conditional and interaction effects to adjudicate between synergy and diminishing returns. The analysis of two voter mobilization field experiments provides the first evidence of diminishing returns to follow-up contact and a cautionary tale about experimental design for these quantities. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 590-599
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1246363
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1246363
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:590-599
Template-Type: ReDIF-Article 1.0
Author-Name: Sokbae Lee
Author-X-Name-First: Sokbae
Author-X-Name-Last: Lee
Author-Name: Myung Hwan Seo
Author-X-Name-First: Myung Hwan
Author-X-Name-Last: Seo
Author-Name: Youngki Shin
Author-X-Name-First: Youngki
Author-X-Name-Last: Shin
Title: Correction
Abstract:
This note provides correction to Lee, S., Seo, M. H., and Shin, Y. (2011), “Testing for Threshold Effects in Regression Models,” Journal of the American Statistical Association, 106, 220–231.1
Journal: Journal of the American Statistical Association
Pages: 883-883
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2016.1273114
File-URL: http://hdl.handle.net/10.1080/01621459.2016.1273114
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:883-883
Template-Type: ReDIF-Article 1.0
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Author-Name: Alp Kucukelbir
Author-X-Name-First: Alp
Author-X-Name-Last: Kucukelbir
Author-Name: Jon D. McAuliffe
Author-X-Name-First: Jon D.
Author-X-Name-Last: McAuliffe
Title: Variational Inference: A Review for Statisticians
Abstract:
One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find a member of that family which is close to the target density. Closeness is measured by Kullback–Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this article is to catalyze statistical research on this class of algorithms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 859-877
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2017.1285773
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1285773
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:859-877
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Book Reviews
Journal: Journal of the American Statistical Association
Pages: 878-882
Issue: 518
Volume: 112
Year: 2017
Month: 4
X-DOI: 10.1080/01621459.2017.1325629
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1325629
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:112:y:2017:i:518:p:878-882
Template-Type: ReDIF-Article 1.0
Author-Name: Stéphane Guerrier
Author-X-Name-First: Stéphane
Author-X-Name-Last: Guerrier
Author-Name: Elise Dupuis-Lozeron
Author-X-Name-First: Elise
Author-X-Name-Last: Dupuis-Lozeron
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Author-Name: Maria-Pia Victoria-Feser
Author-X-Name-First: Maria-Pia
Author-X-Name-Last: Victoria-Feser
Title: Simulation-Based Bias Correction Methods for Complex Models
Abstract:
Along with the ever increasing data size and model complexity, an important challenge frequently encountered in constructing new estimators or in implementing a classical one such as the maximum likelihood estimator, is the computational aspect of the estimation procedure. To carry out estimation, approximate methods such as pseudo-likelihood functions or approximated estimating equations are increasingly used in practice as these methods are typically easier to implement numerically although they can lead to inconsistent and/or biased estimators. In this context, we extend and provide refinements on the known bias correction properties of two simulation-based methods, respectively, indirect inference and bootstrap, each with two alternatives. These results allow one to build a framework defining simulation-based estimators that can be implemented for complex models. Indeed, based on a biased or even inconsistent estimator, several simulation-based methods can be used to define new estimators that are both consistent and with reduced finite sample bias. This framework includes the classical method of the indirect inference for bias correction without requiring specification of an auxiliary model. We demonstrate the equivalence between one version of the indirect inference and the iterative bootstrap, both correct sample biases up to the order n− 3. The iterative method can be thought of as a computationally efficient algorithm to solve the optimization problem of the indirect inference. Our results provide different tools to correct the asymptotic as well as finite sample biases of estimators and give insight on which method should be applied for the problem at hand. The usefulness of the proposed approach is illustrated with the estimation of robust income distributions and generalized linear latent variable models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 146-157
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1380031
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1380031
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:146-157
Template-Type: ReDIF-Article 1.0
Author-Name: Jingshu Wang
Author-X-Name-First: Jingshu
Author-X-Name-Last: Wang
Author-Name: Art B. Owen
Author-X-Name-First: Art B.
Author-X-Name-Last: Owen
Title: Admissibility in Partial Conjunction Testing
Abstract:
Meta-analysis combines results from multiple studies aiming to increase power in finding their common effect. It would typically reject the null hypothesis of no effect if any one of the studies shows strong significance. The partial conjunction null hypothesis is rejected only when at least r of n component hypotheses are nonnull with r = 1 corresponding to a usual meta-analysis. Compared with meta-analysis, it can encourage replicable findings across studies. A by-product of it when applied to different r values is a confidence interval of r quantifying the proportion of nonnull studies. Benjamini and Heller (2008) provided a valid test for the partial conjunction null by ignoring the r − 1 smallest p-values and applying a valid meta-analysis p-value to the remaining n − r + 1 p-values. We provide sufficient and necessary conditions of admissible combined p-value for the partial conjunction hypothesis among monotone tests. Non-monotone tests always dominate monotone tests but are usually too unreasonable to be used in practice. Based on these findings, we propose a generalized form of Benjamini and Heller’s test which allows usage of various types of meta-analysis p-values, and apply our method to an example in assessing replicable benefit of new anticoagulants across subgroups of patients for stroke prevention.
Journal: Journal of the American Statistical Association
Pages: 158-168
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1385465
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1385465
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:158-168
Template-Type: ReDIF-Article 1.0
Author-Name: Paul Fearnhead
Author-X-Name-First: Paul
Author-X-Name-Last: Fearnhead
Author-Name: Guillem Rigaill
Author-X-Name-First: Guillem
Author-X-Name-Last: Rigaill
Title: Changepoint Detection in the Presence of Outliers
Abstract:
Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints to fit the outliers. To overcome this problem, data often needs to be preprocessed to remove outliers, though this is difficult for applications where the data needs to be analyzed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalized cost approaches for detecting changes so that they use loss functions that are less sensitive to outliers. We argue that loss functions that are bounded, such as the classical biweight loss, are particularly suitable—as we show that only bounded loss functions are robust to arbitrarily extreme outliers. We present an efficient dynamic programming algorithm that can find the optimal segmentation under our penalized cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analyzed online. We show that we can consistently estimate the number of changepoints, and accurately estimate their locations, using the biweight loss function. We demonstrate the usefulness of our approach for applications such as analyzing well-log data, detecting copy number variation, and detecting tampering of wireless devices. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 169-183
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1385466
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1385466
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:169-183
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Ni
Author-X-Name-First: Yang
Author-X-Name-Last: Ni
Author-Name: Francesco C. Stingo
Author-X-Name-First: Francesco C.
Author-X-Name-Last: Stingo
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Title: Bayesian Graphical Regression
Abstract:
We consider the problem of modeling conditional independence structures in heterogenous data in the presence of additional subject-level covariates—termed graphical regression. We propose a novel specification of a conditional (in)dependence function of covariates—which allows the structure of a directed graph to vary flexibly with the covariates; imposes sparsity in both edge and covariate selection; produces both subject-specific and predictive graphs; and is computationally tractable. We provide theoretical justifications of our modeling endeavor, in terms of graphical model selection consistency. We demonstrate the performance of our method through rigorous simulation studies. We illustrate our approach in a cancer genomics-based precision medicine paradigm, where-in we explore gene regulatory networks in multiple myeloma taking prognostic clinical factors into account to obtain both population-level and subject-level gene regulatory networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 184-197
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1389739
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389739
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:184-197
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaojun Mao
Author-X-Name-First: Xiaojun
Author-X-Name-Last: Mao
Author-Name: Song Xi Chen
Author-X-Name-First: Song Xi
Author-X-Name-Last: Chen
Author-Name: Raymond K. W. Wong
Author-X-Name-First: Raymond K. W.
Author-X-Name-Last: Wong
Title: Matrix Completion With Covariate Information
Abstract:
This article investigates the problem of matrix completion from the corrupted data, when the additional covariates are available. Despite being seldomly considered in the matrix completion literature, these covariates often provide valuable information for completing the unobserved entries of the high-dimensional target matrix A0. Given a covariate matrix X with its rows representing the row covariates of A0, we consider a column-space-decomposition model A0 = Xβ0 + B0, where β0 is a coefficient matrix and B0 is a low-rank matrix orthogonal to X in terms of column space. This model facilitates a clear separation between the interpretable covariate effects (Xβ0) and the flexible hidden factor effects (B0). Besides, our work allows the probabilities of observation to depend on the covariate matrix, and hence a missing-at-random mechanism is permitted. We propose a novel penalized estimator for A0 by utilizing both Frobenius-norm and nuclear-norm regularizations with an efficient and scalable algorithm. Asymptotic convergence rates of the proposed estimators are studied. The empirical performance of the proposed methodology is illustrated via both numerical experiments and a real data application.
Journal: Journal of the American Statistical Association
Pages: 198-210
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1389740
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1389740
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:198-210
Template-Type: ReDIF-Article 1.0
Author-Name: Xinghao Qiao
Author-X-Name-First: Xinghao
Author-X-Name-Last: Qiao
Author-Name: Shaojun Guo
Author-X-Name-First: Shaojun
Author-X-Name-Last: Guo
Author-Name: Gareth M. James
Author-X-Name-First: Gareth M.
Author-X-Name-Last: James
Title: Functional Graphical Models
Abstract:
Graphical models have attracted increasing attention in recent years, especially in settings involving high-dimensional data. In particular, Gaussian graphical models are used to model the conditional dependence structure among multiple Gaussian random variables. As a result of its computational efficiency, the graphical lasso (glasso) has become one of the most popular approaches for fitting high-dimensional graphical models. In this paper, we extend the graphical models concept to model the conditional dependence structure among p random functions. In this setting, not only is p large, but each function is itself a high-dimensional object, posing an additional level of statistical and computational complexity. We develop an extension of the glasso criterion (fglasso), which estimates the functional graphical model by imposing a block sparsity constraint on the precision matrix, via a group lasso penalty. The fglasso criterion can be optimized using an efficient block coordinate descent algorithm. We establish the concentration inequalities of the estimates, which guarantee the desirable graph support recovery property, that is, with probability tending to one, the fglasso will correctly identify the true conditional dependence structure. Finally, we show that the fglasso significantly outperforms possible competing methods through both simulations and an analysis of a real-world electroencephalography dataset comparing alcoholic and nonalcoholic patients.
Journal: Journal of the American Statistical Association
Pages: 211-222
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1390466
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1390466
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:211-222
Template-Type: ReDIF-Article 1.0
Author-Name: Mauricio Sadinle
Author-X-Name-First: Mauricio
Author-X-Name-Last: Sadinle
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Author-Name: Larry Wasserman
Author-X-Name-First: Larry
Author-X-Name-Last: Wasserman
Title: Least Ambiguous Set-Valued Classifiers With Bounded Error Levels
Abstract:
In most classification tasks, there are observations that are ambiguous and therefore difficult to correctly label. Set-valued classifiers output sets of plausible labels rather than a single label, thereby giving a more appropriate and informative treatment to the labeling of ambiguous instances. We introduce a framework for multiclass set-valued classification, where the classifiers guarantee user-defined levels of coverage or confidence (the probability that the true label is contained in the set) while minimizing the ambiguity (the expected size of the output). We first derive oracle classifiers assuming the true distribution to be known. We show that the oracle classifiers are obtained from level sets of the functions that define the conditional probability of each class. Then we develop estimators with good asymptotic and finite sample properties. The proposed estimators build on existing single-label classifiers. The optimal classifier can sometimes output the empty set, but we provide two solutions to fix this issue that are suitable for various practical needs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 223-234
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1395341
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395341
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:223-234
Template-Type: ReDIF-Article 1.0
Author-Name: Amy Willis
Author-X-Name-First: Amy
Author-X-Name-Last: Willis
Title: Confidence Sets for Phylogenetic Trees
Abstract:
Inferring evolutionary histories (phylogenetic trees) has important applications in biology, criminology, and public health. However, phylogenetic trees are complex mathematical objects that reside in a non-Euclidean space, which complicates their analysis. While our mathematical, algorithmic, and probabilistic understanding of phylogenies in their metric space is mature, rigorous inferential infrastructure is as yet undeveloped. In this manuscript, we unify recent computational and probabilistic advances to construct tree–valued confidence sets. The procedure accounts for both center and multiple directions of tree–valued variability. We draw on block replicates to improve testing, identifying the best supported most recent ancestor of the Zika virus, and formally testing the hypothesis that a Floridian dentist with AIDS infected two of his patients with HIV. The method illustrates connections between variability in Euclidean and tree space, opening phylogenetic tree analysis to techniques available in the multivariate Euclidean setting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 235-244
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1395342
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1395342
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:235-244
Template-Type: ReDIF-Article 1.0
Author-Name: Li Ma
Author-X-Name-First: Li
Author-X-Name-Last: Ma
Author-Name: Jialiang Mao
Author-X-Name-First: Jialiang
Author-X-Name-Last: Mao
Title: Fisher Exact Scanning for Dependency
Abstract:
We introduce a method—called Fisher exact scanning (FES)—for testing and identifying variable dependency that generalizes Fisher’s exact test on 2 × 2 contingency tables to R × C contingency tables and continuous sample spaces. FES proceeds through scanning over the sample space using windows in the form of 2 × 2 tables of various sizes, and on each window completing a Fisher’s exact test. Based on a factorization of Fisher’s multivariate hypergeometric (MHG) likelihood into the product of the univariate hypergeometric likelihoods, we show that there exists a coarse-to-fine, sequential generative representation for the MHG model in the form of a Bayesian network, which in turn implies the mutual independence (up to deviation due to discreteness) among the Fisher’s exact tests completed under FES. This allows an exact characterization of the joint null distribution of the p-values and gives rise to an effective inference recipe through simple multiple testing procedures such as Šidák and Bonferroni corrections, eliminating the need for resampling. In addition, FES can characterize dependency through reporting significant windows after multiple testing control. The computational complexity of FES is approximately linear in the sample size, which along with the avoidance of resampling makes it ideal for analyzing massive datasets. We use extensive numerical studies to illustrate the work of FES and compare it to several state-of-the-art methods for testing dependency in both statistical and computational performance. Finally, we apply FES to analyzing a microbiome dataset and further investigate its relationship with other popular dependency metrics in that context. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 245-258
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1397522
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1397522
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:245-258
Template-Type: ReDIF-Article 1.0
Author-Name: Anna Bellach
Author-X-Name-First: Anna
Author-X-Name-Last: Bellach
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Author-Name: Ludger Rüschendorf
Author-X-Name-First: Ludger
Author-X-Name-Last: Rüschendorf
Author-Name: Jason P. Fine
Author-X-Name-First: Jason P.
Author-X-Name-Last: Fine
Title: Weighted NPMLE for the Subdistribution of a Competing Risk
Abstract:
Direct regression modeling of the subdistribution has become popular for analyzing data with multiple, competing event types. All general approaches so far are based on nonlikelihood-based procedures and target covariate effects on the subdistribution. We introduce a novel weighted likelihood function that allows for a direct extension of the Fine–Gray model to a broad class of semiparametric regression models. The model accommodates time-dependent covariate effects on the subdistribution hazard. To motivate the proposed likelihood method, we derive standard nonparametric estimators and discuss a new interpretation based on pseudo risk sets. We establish consistency and asymptotic normality of the estimators and propose a sandwich estimator of the variance. In comprehensive simulation studies, we demonstrate the solid performance of the weighted nonparametric maximum likelihood estimation in the presence of independent right censoring. We provide an application to a very large bone marrow transplant dataset, thereby illustrating its practical utility. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 259-270
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1401540
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1401540
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:259-270
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Li
Author-X-Name-First: Yang
Author-X-Name-Last: Li
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Robust Variable and Interaction Selection for Logistic Regression and General Index Models
Abstract:
Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms. In the forward stage, SODA adds in predictors that have significant overall effects, whereas in the backward stage SODA removes unimportant terms to optimize the extended Bayesian information criterion (EBIC). Compared with existing methods for variable selection in quadratic discriminant analysis, SODA can deal with high-dimensional data in which the number of predictors is much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for general index models. Compared with existing variable selection methods based on the sliced inverse regression (SIR), SODA requires neither linearity nor constant variance condition and is thus more robust. Our theoretical analysis establishes the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both logistic and general index models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 271-286
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1401541
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1401541
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:271-286
Template-Type: ReDIF-Article 1.0
Author-Name: Yacine Aït-Sahalia
Author-X-Name-First: Yacine
Author-X-Name-Last: Aït-Sahalia
Author-Name: Dacheng Xiu
Author-X-Name-First: Dacheng
Author-X-Name-Last: Xiu
Title: Principal Component Analysis of High-Frequency Data
Abstract:
We develop the necessary methodology to conduct principal component analysis at high frequency. We construct estimators of realized eigenvalues, eigenvectors, and principal components, and provide the asymptotic distribution of these estimators. Empirically, we study the high-frequency covariance structure of the constituents of the S&P 100 Index using as little as one week of high-frequency data at a time, and examines whether it is compatible with the evidence accumulated over decades of lower frequency returns. We find a surprising consistency between the low- and high-frequency structures. During the recent financial crisis, the first principal component becomes increasingly dominant, explaining up to 60% of the variation on its own, while the second principal component drives the common variation of financial sector stocks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 287-303
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1401542
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1401542
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:287-303
Template-Type: ReDIF-Article 1.0
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Author-Name: Avi Feller
Author-X-Name-First: Avi
Author-X-Name-Last: Feller
Author-Name: Luke Miratrix
Author-X-Name-First: Luke
Author-X-Name-Last: Miratrix
Title: Decomposing Treatment Effect Variation
Abstract:
Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the “black box” of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of randomized experiments, this article proposes a framework for decomposing overall treatment effect variation into a systematic component explained by observed covariates and a remaining idiosyncratic component. Our framework is fully randomization-based, with estimates of treatment effect variation that are entirely justified by the randomization itself. Our framework can also account for noncompliance, which is an important practical complication. We make several contributions. First, we show that randomization-based estimates of systematic variation are very similar in form to estimates from fully interacted linear regression and two-stage least squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect variation, both with and without noncompliance. Third, we propose an R2-like measure of treatment effect variation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 304-317
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407322
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407322
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:304-317
Template-Type: ReDIF-Article 1.0
Author-Name: Rahul Mazumder
Author-X-Name-First: Rahul
Author-X-Name-Last: Mazumder
Author-Name: Arkopal Choudhury
Author-X-Name-First: Arkopal
Author-X-Name-Last: Choudhury
Author-Name: Garud Iyengar
Author-X-Name-First: Garud
Author-X-Name-Last: Iyengar
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Title: A Computational Framework for Multivariate Convex Regression and Its Variants
Abstract:
We study the nonparametric least squares estimator (LSE) of a multivariate convex regression function. The LSE, given as the solution to a quadratic program with O(n2) linear constraints (n being the sample size), is difficult to compute for large problems. Exploiting problem specific structure, we propose a scalable algorithmic framework based on the augmented Lagrangian method to compute the LSE. We develop a novel approach to obtain smooth convex approximations to the fitted (piecewise affine) convex LSE and provide formal bounds on the quality of approximation. When the number of samples is not too large compared to the dimension of the predictor, we propose a regularization scheme—Lipschitz convex regression—where we constrain the norm of the subgradients, and study the rates of convergence of the obtained LSE. Our algorithmic framework is simple and flexible and can be easily adapted to handle variants: estimation of a nondecreasing/nonincreasing convex/concave (with or without a Lipschitz bound) function. We perform numerical studies illustrating the scalability of the proposed algorithm—on some instances our proposal leads to more than a 10,000-fold improvement in runtime when compared to off-the-shelf interior point solvers for problems with n = 500.
Journal: Journal of the American Statistical Association
Pages: 318-331
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407771
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407771
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:318-331
Template-Type: ReDIF-Article 1.0
Author-Name: Benjamin B. Risk
Author-X-Name-First: Benjamin B.
Author-X-Name-Last: Risk
Author-Name: David S. Matteson
Author-X-Name-First: David S.
Author-X-Name-Last: Matteson
Author-Name: David Ruppert
Author-X-Name-First: David
Author-X-Name-Last: Ruppert
Title: Linear Non-Gaussian Component Analysis Via Maximum Likelihood
Abstract:
Independent component analysis (ICA) is popular in many applications, including cognitive neuroscience and signal processing. Due to computational constraints, principal component analysis (PCA) is used for dimension reduction prior to ICA (PCA+ICA), which could remove important information. The problem is that interesting independent components (ICs) could be mixed in several principal components that are discarded and then these ICs cannot be recovered. We formulate a linear non-Gaussian component model with Gaussian noise components. To estimate the model parameters, we propose likelihood component analysis (LCA), in which dimension reduction and latent variable estimation are achieved simultaneously. Our method orders components by their marginal likelihood rather than ordering components by variance as in PCA. We present a parametric LCA using the logistic density and a semiparametric LCA using tilted Gaussians with cubic B-splines. Our algorithm is scalable to datasets common in applications (e.g., hundreds of thousands of observations across hundreds of variables with dozens of latent components). In simulations, latent components are recovered that are discarded by PCA+ICA methods. We apply our method to multivariate data and demonstrate that LCA is a useful data visualization and dimension reduction tool that reveals features not apparent from PCA or PCA+ICA. We also apply our method to a functional magnetic resonance imaging experiment from the Human Connectome Project and identify artifacts missed by PCA+ICA. We present theoretical results on identifiability of the linear non-Gaussian component model and consistency of LCA. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 332-343
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407772
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407772
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:332-343
Template-Type: ReDIF-Article 1.0
Author-Name: S. Luo
Author-X-Name-First: S.
Author-X-Name-Last: Luo
Author-Name: R. Song
Author-X-Name-First: R.
Author-X-Name-Last: Song
Author-Name: M. Styner
Author-X-Name-First: M.
Author-X-Name-Last: Styner
Author-Name: J. H. Gilmore
Author-X-Name-First: J. H.
Author-X-Name-Last: Gilmore
Author-Name: H. Zhu
Author-X-Name-First: H.
Author-X-Name-Last: Zhu
Title: FSEM: Functional Structural Equation Models for Twin Functional Data
Abstract:
The aim of this article is to develop a novel class of functional structural equation models (FSEMs) for dissecting functional genetic and environmental effects on twin functional data, while characterizing the varying association between functional data and covariates of interest. We propose a three-stage estimation procedure to estimate varying coefficient functions for various covariates (e.g., gender) as well as three covariance operators for the genetic and environmental effects. We develop an inference procedure based on weighted likelihood ratio statistics to test the genetic/environmental effect at either a fixed location or a compact region. We also systematically carry out the theoretical analysis of the estimated varying functions, the weighted likelihood ratio statistics, and the estimated covariance operators. We conduct extensive Monte Carlo simulations to examine the finite-sample performance of the estimation and inference procedures. We apply the proposed FSEM to quantify the degree of genetic and environmental effects on twin white matter tracts obtained from the UNC early brain development study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 344-357
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407773
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407773
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:344-357
Template-Type: ReDIF-Article 1.0
Author-Name: Zijian Guo
Author-X-Name-First: Zijian
Author-X-Name-Last: Guo
Author-Name: Wanjie Wang
Author-X-Name-First: Wanjie
Author-X-Name-Last: Wang
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Optimal Estimation of Genetic Relatedness in High-Dimensional Linear Models
Abstract:
Estimating the genetic relatedness between two traits based on the genome-wide association data is an important problem in genetics research. In the framework of high-dimensional linear models, we introduce two measures of genetic relatedness and develop optimal estimators for them. One is genetic covariance, which is defined to be the inner product of the two regression vectors, and another is genetic correlation, which is a normalized inner product by their lengths. We propose functional de-biased estimators (FDEs), which consist of an initial estimation step with the plug-in scaled Lasso estimator, and a further bias correction step. We also develop estimators of the quadratic functionals of the regression vectors, which can be used to estimate the heritability of each trait. The estimators are shown to be minimax rate-optimal and can be efficiently implemented. Simulation results show that FDEs provide better estimates of the genetic relatedness than simple plug-in estimates. FDE is also applied to an analysis of a yeast segregant dataset with multiple traits to estimate the genetic relatedness among these traits. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 358-369
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407774
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407774
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:358-369
Template-Type: ReDIF-Article 1.0
Author-Name: Jon Arni Steingrimsson
Author-X-Name-First: Jon Arni
Author-X-Name-Last: Steingrimsson
Author-Name: Liqun Diao
Author-X-Name-First: Liqun
Author-X-Name-Last: Diao
Author-Name: Robert L. Strawderman
Author-X-Name-First: Robert L.
Author-X-Name-Last: Strawderman
Title: Censoring Unbiased Regression Trees and Ensembles
Abstract:
This article proposes a novel paradigm for building regression trees and ensemble learning in survival analysis. Generalizations of the classification and regression trees (CART) and random forests (RF) algorithms for general loss functions, and in the latter case more general bootstrap procedures, are both introduced. These results, in combination with an extension of the theory of censoring unbiased transformations (CUTs) applicable to loss functions, underpin the development of two new classes of algorithms for constructing survival trees and survival forests: censoring unbiased regression trees and censoring unbiased regression ensembles. For a certain “doubly robust” CUT of squared error loss, we further show how these new algorithms can be implemented using existing software (e.g., CART, RF). Comparisons of these methods to existing ensemble procedures for predicting survival probabilities are provided in both simulated settings and through applications to four datasets. It is shown that these new methods either improve upon, or remain competitive with, existing implementations of random survival forests, conditional inference forests, and recursively imputed survival trees.
Journal: Journal of the American Statistical Association
Pages: 370-383
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407775
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407775
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:370-383
Template-Type: ReDIF-Article 1.0
Author-Name: Yaowu Liu
Author-X-Name-First: Yaowu
Author-X-Name-Last: Liu
Author-Name: Jun Xie
Author-X-Name-First: Jun
Author-X-Name-Last: Xie
Title: Accurate and Efficient P-value Calculation Via Gaussian Approximation: A Novel Monte-Carlo Method
Abstract:
It is of fundamental interest in statistics to test the significance of a set of covariates. For example, in genome-wide association studies, a joint null hypothesis of no genetic effect is tested for a set of multiple genetic variants. The minimum p-value method, higher criticism, and Berk–Jones tests are particularly effective when the covariates with nonzero effects are sparse. However, the correlations among covariates and the nonGaussian distribution of the response pose a great challenge toward the p-value calculation of the three tests. In practice, permutation is commonly used to obtain accurate p-values, but it is computationally very intensive, especially when we need to conduct a large amount of hypothesis testing. In this paper, we propose a Gaussian approximation method based on a Monte Carlo scheme, which is computationally more efficient than permutation while still achieving similar accuracy. We derive nonasymptotic approximation error bounds that could vanish in the limit even if the number of covariates is much larger than the sample size. Through real-genotype-based simulations and data analysis of a genome-wide association study of Crohn’s disease, we compare the accuracy and computation cost of our proposed method, of permutation, and of the method based on asymptotic distribution. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 384-392
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1407776
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1407776
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:384-392
Template-Type: ReDIF-Article 1.0
Author-Name: Cody Alsaker
Author-X-Name-First: Cody
Author-X-Name-Last: Alsaker
Author-Name: F. Jay Breidt
Author-X-Name-First: F. Jay
Author-X-Name-Last: Breidt
Author-Name: Mark J. van der Woerd
Author-X-Name-First: Mark J.
Author-X-Name-Last: van der Woerd
Title: Minimum Mean Squared Error Estimation of the Radius of Gyration in Small-Angle X-Ray Scattering Experiments
Abstract:
Small-angle X-ray scattering (SAXS) is a technique that yields low-resolution structural information of biological macromolecules by exposing a large ensemble of molecules in solution to a powerful X-ray beam. The beam interacts with the molecules and the intensity of the scattered beam is recorded on a detector plate. The radius of gyration for a molecule, which is a measure of the spread of its mass, can be estimated from the lowest scattering angles of SAXS data. This estimation method requires specification of a window of scattering angles. Under a local polynomial model with autoregressive errors, we develop methodology and supporting asymptotic theory for selection of an optimal window, minimum mean square error estimation of the radius of gyration, and estimation of its variance. Simulation studies confirm the quality of our asymptotic approximations and the superior performance of the proposed methodology relative to the accepted standard. Our semi-automated methodology makes it feasible to estimate the radius of gyration many times, from replicated SAXS data under various experimental conditions, in an objective and reproducible manner. This in turn allows for secondary analyses of the dataset of estimates, as we demonstrate with a split–split plot analysis for 357 SAXS intensity curves. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 39-47
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1408467
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1408467
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:39-47
Template-Type: ReDIF-Article 1.0
Author-Name: HaiYing Wang
Author-X-Name-First: HaiYing
Author-X-Name-Last: Wang
Author-Name: Min Yang
Author-X-Name-First: Min
Author-X-Name-Last: Yang
Author-Name: John Stufken
Author-X-Name-First: John
Author-X-Name-Last: Stufken
Title: Information-Based Optimal Subdata Selection for Big Data Linear Regression
Abstract:
Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large datasets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, that is, the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 393-405
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1408468
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1408468
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:393-405
Template-Type: ReDIF-Article 1.0
Author-Name: Raymond K. W. Wong
Author-X-Name-First: Raymond K. W.
Author-X-Name-Last: Wong
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Zhengyuan Zhu
Author-X-Name-First: Zhengyuan
Author-X-Name-Last: Zhu
Title: Partially Linear Functional Additive Models for Multivariate Functional Data
Abstract:
We investigate a class of partially linear functional additive models (PLFAM) that predicts a scalar response by both parametric effects of a multivariate predictor and nonparametric effects of a multivariate functional predictor. We jointly model multiple functional predictors that are cross-correlated using multivariate functional principal component analysis (mFPCA), and model the nonparametric effects of the principal component scores as additive components in the PLFAM. To address the high-dimensional nature of functional data, we let the number of mFPCA components diverge to infinity with the sample size, and adopt the component selection and smoothing operator (COSSO) penalty to select relevant components and regularize the fitting. A fundamental difference between our framework and the existing high-dimensional additive models is that the mFPCA scores are estimated with error, and the magnitude of measurement error increases with the order of mFPCA. We establish the asymptotic convergence rate for our estimator, while allowing the number of components diverge. When the number of additive components is fixed, we also establish the asymptotic distribution for the partially linear coefficients. The practical performance of the proposed methods is illustrated via simulation studies and a crop yield prediction application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 406-418
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1411268
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411268
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:406-418
Template-Type: ReDIF-Article 1.0
Author-Name: Damian Brzyski
Author-X-Name-First: Damian
Author-X-Name-Last: Brzyski
Author-Name: Alexej Gossmann
Author-X-Name-First: Alexej
Author-X-Name-Last: Gossmann
Author-Name: Weijie Su
Author-X-Name-First: Weijie
Author-X-Name-Last: Su
Author-Name: Małgorzata Bogdan
Author-X-Name-First: Małgorzata
Author-X-Name-Last: Bogdan
Title: Group SLOPE – Adaptive Selection of Groups of Predictors
Abstract:
Sorted L-One Penalized Estimation (SLOPE; Bogdan et al. 2013, 2015) is a relatively new convex optimization procedure, which allows for adaptive selection of regressors under sparse high-dimensional designs. Here, we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, group SLOPE (gSLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with an implementation of our method is available on The Comprehensive R Archive Network.
Journal: Journal of the American Statistical Association
Pages: 419-433
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1411269
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411269
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:419-433
Template-Type: ReDIF-Article 1.0
Author-Name: Raphaël Huser
Author-X-Name-First: Raphaël
Author-X-Name-Last: Huser
Author-Name: Jennifer L. Wadsworth
Author-X-Name-First: Jennifer L.
Author-X-Name-Last: Wadsworth
Title: Modeling Spatial Processes with Unknown Extremal Dependence Class
Abstract:
Many environmental processes exhibit weakening spatial dependence as events become more extreme. Well-known limiting models, such as max-stable or generalized Pareto processes, cannot capture this, which can lead to a preference for models that exhibit a property known as asymptotic independence. However, weakening dependence does not automatically imply asymptotic independence, and whether the process is truly asymptotically (in)dependent is usually far from clear. The distinction is key as it can have a large impact upon extrapolation, that is, the estimated probabilities of events more extreme than those observed. In this work, we present a single spatial model that is able to capture both dependence classes in a parsimonious manner, and with a smooth transition between the two cases. The model covers a wide range of possibilities from asymptotic independence through to complete dependence, and permits weakening dependence of extremes even under asymptotic dependence. Censored likelihood-based inference for the implied copula is feasible in moderate dimensions due to closed-form margins. The model is applied to oceanographic datasets with ambiguous true limiting dependence structure. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 434-444
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1411813
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1411813
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:434-444
Template-Type: ReDIF-Article 1.0
Author-Name: Geir-Arne Fuglstad
Author-X-Name-First: Geir-Arne
Author-X-Name-Last: Fuglstad
Author-Name: Daniel Simpson
Author-X-Name-First: Daniel
Author-X-Name-Last: Simpson
Author-Name: Finn Lindgren
Author-X-Name-First: Finn
Author-X-Name-Last: Lindgren
Author-Name: Håvard Rue
Author-X-Name-First: Håvard
Author-X-Name-Last: Rue
Title: Constructing Priors that Penalize the Complexity of Gaussian Random Fields
Abstract:
Priors are important for achieving proper posteriors with physically meaningful covariance structures for Gaussian random fields (GRFs) since the likelihood typically only provides limited information about the covariance structure under in-fill asymptotics. We extend the recent penalized complexity prior framework and develop a principled joint prior for the range and the marginal variance of one-dimensional, two-dimensional, and three-dimensional Matérn GRFs with fixed smoothness. The prior is weakly informative and penalizes complexity by shrinking the range toward infinity and the marginal variance toward zero. We propose guidelines for selecting the hyperparameters, and a simulation study shows that the new prior provides a principled alternative to reference priors that can leverage prior knowledge to achieve shorter credible intervals while maintaining good coverage.We extend the prior to a nonstationary GRF parameterized through local ranges and marginal standard deviations, and introduce a scheme for selecting the hyperparameters based on the coverage of the parameters when fitting simulated stationary data. The approach is applied to a dataset of annual precipitation in southern Norway and the scheme for selecting the hyperparameters leads to conservative estimates of nonstationarity and improved predictive performance over the stationary model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 445-452
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1415907
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415907
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:445-452
Template-Type: ReDIF-Article 1.0
Author-Name: Zeda Li
Author-X-Name-First: Zeda
Author-X-Name-Last: Li
Author-Name: Robert T. Krafty
Author-X-Name-First: Robert T.
Author-X-Name-Last: Krafty
Title: Adaptive Bayesian Time–Frequency Analysis of Multivariate Time Series
Abstract:
This article introduces a nonparametric approach to multivariate time-varying power spectrum analysis. The procedure adaptively partitions a time series into an unknown number of approximately stationary segments, where some spectral components may remain unchanged across segments, allowing components to evolve differently over time. Local spectra within segments are fit through Whittle likelihood-based penalized spline models of modified Cholesky components, which provide flexible nonparametric estimates that preserve positive definite structures of spectral matrices. The approach is formulated in a Bayesian framework, in which the number and location of partitions are random, and relies on reversible jump Markov chain and Hamiltonian Monte Carlo methods that can adapt to the unknown number of segments and parameters. By averaging over the distribution of partitions, the approach can approximate both abrupt and slowly varying changes in spectral matrices. Empirical performance is evaluated in simulation studies and illustrated through analyses of electroencephalography during sleep and of the El Niño-Southern Oscillation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 453-465
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1415908
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1415908
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:453-465
Template-Type: ReDIF-Article 1.0
Author-Name: Marco Di Marzio
Author-X-Name-First: Marco
Author-X-Name-Last: Di Marzio
Author-Name: Agnese Panzera
Author-X-Name-First: Agnese
Author-X-Name-Last: Panzera
Author-Name: Charles C. Taylor
Author-X-Name-First: Charles C.
Author-X-Name-Last: Taylor
Title: Nonparametric Rotations for Sphere-Sphere Regression
Abstract:
Regression of data represented as points on a hypersphere has traditionally been treated using parametric families of transformations that include the simple rigid rotation as an important, special case. On the other hand, nonparametric methods have generally focused on modeling a scalar response through a spherical predictor by representing the regression function as a polynomial, leading to component-wise estimation of a spherical response. We propose a very flexible, simple regression model where for each location of the manifold a specific rotation matrix is to be estimated. To make this approach tractable, we assume continuity of the regression function that, in turn, allows for approximations of rotation matrices based on a series expansion. It is seen that the nonrigidity of our technique motivates an iterative estimation within a Newton–Raphson learning scheme, which exhibits bias reduction properties. Extensions to general shape matching are also outlined. Both simulations and real data are used to illustrate the results. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 466-476
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2017.1421542
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1421542
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:466-476
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Ni
Author-X-Name-First: Yang
Author-X-Name-Last: Ni
Author-Name: Francesco C. Stingo
Author-X-Name-First: Francesco C.
Author-X-Name-Last: Stingo
Author-Name: Min Jin Ha
Author-X-Name-First: Min Jin
Author-X-Name-Last: Ha
Author-Name: Rehan Akbani
Author-X-Name-First: Rehan
Author-X-Name-Last: Akbani
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Title: Bayesian Hierarchical Varying-Sparsity Regression Models with Application to Cancer Proteogenomics
Abstract:
Identifying patient-specific prognostic biomarkers is of critical importance in developing personalized treatment for clinically and molecularly heterogeneous diseases such as cancer. In this article, we propose a novel regression framework, Bayesian hierarchical varying-sparsity regression (BEHAVIOR) models to select clinically relevant disease markers by integrating proteogenomic (proteomic+genomic) and clinical data. Our methods allow flexible modeling of protein–gene relationships as well as induces sparsity in both protein–gene and protein–survival relationships, to select genomically driven prognostic protein markers at the patient-level. Simulation studies demonstrate the superior performance of BEHAVIOR against competing method in terms of both protein marker selection and survival prediction. We apply BEHAVIOR to The Cancer Genome Atlas (TCGA) proteogenomic pan-cancer data and find several interesting prognostic proteins and pathways that are shared across multiple cancers and some that exclusively pertain to specific cancers. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available online.
Journal: Journal of the American Statistical Association
Pages: 48-60
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1434529
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1434529
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:48-60
Template-Type: ReDIF-Article 1.0
Author-Name: Mark D. Risser
Author-X-Name-First: Mark D.
Author-X-Name-Last: Risser
Author-Name: Christopher J. Paciorek
Author-X-Name-First: Christopher J.
Author-X-Name-Last: Paciorek
Author-Name: Dáithí A. Stone
Author-X-Name-First: Dáithí A.
Author-X-Name-Last: Stone
Title: Spatially Dependent Multiple Testing Under Model Misspecification, With Application to Detection of Anthropogenic Influence on Extreme Climate Events
Abstract:
The Weather Risk Attribution Forecast (WRAF) is a forecasting tool that uses output from global climate models to make simultaneous attribution statements about whether and how greenhouse gas emissions have contributed to extreme weather across the globe. However, in conducting a large number of simultaneous hypothesis tests, the WRAF is prone to identifying false “discoveries.” A common technique for addressing this multiple testing problem is to adjust the procedure in a way that controls the proportion of true null hypotheses that are incorrectly rejected, or the false discovery rate (FDR). Unfortunately, generic FDR procedures suffer from low power when the hypotheses are dependent, and techniques designed to account for dependence are sensitive to misspecification of the underlying statistical model. In this article, we develop a Bayesian decision-theoretical approach for dependent multiple testing and a nonparametric hierarchical statistical model that flexibly controls false discovery and is robust to model misspecification. We illustrate the robustness of our procedure to model error with a simulation study, using a framework that accounts for generic spatial dependence and allows the practitioner to flexibly specify the decision criteria. Finally, we apply our procedure to several seasonal forecasts and discuss implementation for the WRAF workflow. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 61-78
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1451335
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1451335
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:61-78
Template-Type: ReDIF-Article 1.0
Author-Name: Kwonsang Lee
Author-X-Name-First: Kwonsang
Author-X-Name-Last: Lee
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Estimating the Malaria Attributable Fever Fraction Accounting for Parasites Being Killed by Fever and Measurement Error
Abstract:
Malaria is a major health problem in many tropical regions. Fever is a characteristic symptom of malaria. The fraction of fevers that are attributable to malaria, the malaria attributable fever fraction (MAFF), is an important public health measure in that the MAFF can be used to calculate the number of fevers that would be avoided if malaria was eliminated. Despite such causal interpretation, the MAFF has not been considered in the framework of causal inference. We define the MAFF using the potential outcome framework, and define causal assumptions that current estimation methods rely on. Furthermore, we demonstrate that one of the assumptions—that the parasite density is correctly measured—generally does not hold because (i) fever kills some parasites and (ii) parasite density is measured with error. In the presence of these problems, we reveal that current MAFF estimators can be significantly biased. To develop a consistent estimator, we propose a novel maximum likelihood estimation method based on exponential family g-modeling. Under the assumption that the measurement error mechanism and the magnitude of the fever killing effect are known, we show that our proposed method provides approximately unbiased estimates of the MAFF in simulation studies. A sensitivity analysis is developed to assess the impact of different magnitudes of fever killing and different measurement error mechanisms. Finally, we apply our proposed method to estimate the MAFF in Kilombero, Tanzania. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 79-92
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1469989
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:79-92
Template-Type: ReDIF-Article 1.0
Author-Name: Edward H. Kennedy
Author-X-Name-First: Edward H.
Author-X-Name-Last: Kennedy
Author-Name: Steve Harris
Author-X-Name-First: Steve
Author-X-Name-Last: Harris
Author-Name: Luke J. Keele
Author-X-Name-First: Luke J.
Author-X-Name-Last: Keele
Title: Survivor-Complier Effects in the Presence of Selection on Treatment, With Application to a Study of Prompt ICU Admission
Abstract:
Pretreatment selection or censoring (“selection on treatment”) can occur when two treatment levels are compared ignoring the third option of neither treatment, in “censoring by death” settings where treatment is only defined for those who survive long enough to receive it, or in general in studies where the treatment is only defined for a subset of the population. Unfortunately, the standard instrumental variable (IV) estimand is not defined in the presence of such selection, so we consider estimating a new survivor-complier causal effect. Although this effect is generally not identified under standard IV assumptions, it is possible to construct sharp bounds. We derive these bounds and give a corresponding data-driven sensitivity analysis, along with nonparametric yet efficient estimation methods. Importantly, our approach allows for high-dimensional confounding adjustment, and valid inference even after employing machine learning. Incorporating covariates can tighten bounds dramatically, especially when they are strong predictors of the selection process. We apply the methods in a UK cohort study of critical care patients to examine the mortality effects of prompt admission to the intensive care unit, using ICU bed availability as an instrument. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 93-104
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1469990
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469990
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:93-104
Template-Type: ReDIF-Article 1.0
Author-Name: Mamadou Yauck
Author-X-Name-First: Mamadou
Author-X-Name-Last: Yauck
Author-Name: Louis-Paul Rivest
Author-X-Name-First: Louis-Paul
Author-X-Name-Last: Rivest
Author-Name: Greg Rothman
Author-X-Name-First: Greg
Author-X-Name-Last: Rothman
Title: Capture-Recapture Methods for Data on the Activation of Applications on Mobile Phones
Abstract:
This work is concerned with the analysis of marketing data on the activation of applications (apps) on mobile devices. Each application has a hashed identification number that is specific to the device on which it has been installed. This number can be registered by a platform at each activation of the application. Activations on the same device are linked together using the identification number. By focusing on activations that took place at a business location, one can create a capture-recapture dataset about devices, that is, users, that “visited” the business: the units are owners of mobile devices and the capture occasions are time intervals such as days. A unit is captured when she activates an application, provided that this activation is recorded by the platform providing the data. Statistical capture-recapture techniques can be applied to the app data to estimate the total number of users that visited the business over a time period, thereby providing an indirect estimate of foot traffic. This article argues that the robust design, a method for dealing with a nested mark-recapture experiment, can be used in this context. A new algorithm for estimating the parameters of a robust design with a fairly large number of capture occasions and a simple parametric bootstrap variance estimator are proposed. Moreover, new estimation methods and new theoretical results are introduced for a wider application of the robust design. This is used to analyze a dataset about the mobile devices that visited the auto-dealerships of a major auto brand in a U.S. metropolitan area over a period of 1 year and a half. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 105-114
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1469991
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469991
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:105-114
Template-Type: ReDIF-Article 1.0
Author-Name: Anna Louise Schröder
Author-X-Name-First: Anna Louise
Author-X-Name-Last: Schröder
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Title: FreSpeD: Frequency-Specific Change-Point Detection in Epileptic Seizure Multi-Channel EEG Data
Abstract:
The goal in this article is to develop a practical tool that identifies changes in the brain activity as recorded in electroencephalograms (EEG). Our method is devised to detect possibly subtle disruptions in normal brain functioning that precede the onset of an epileptic seizure. Moreover, it is able to capture the evolution of seizure spread from one region (or channel) to another. The proposed frequency-specific change-point detection method (FreSpeD) deploys a cumulative sum-type test statistic within a binary segmentation algorithm. We demonstrate the theoretical properties of FreSpeD and show its robustness to parameter choice and advantages against two competing methods. Furthermore, the FreSpeD method produces directly interpretable output. When applied to epileptic seizure EEG data, FreSpeD identifies the correct brain region as the focal point of seizure and the timing of the seizure onset. Moreover, FreSpeD detects changes in cross-coherence immediately before seizure onset which indicate an evolution leading up to the seizure. These changes are subtle and were not captured by the methods that previously analyzed the same EEG data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 115-128
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1476238
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476238
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:115-128
Template-Type: ReDIF-Article 1.0
Author-Name: Li Li
Author-X-Name-First: Li
Author-X-Name-Last: Li
Author-Name: Alejandro Jara
Author-X-Name-First: Alejandro
Author-X-Name-Last: Jara
Author-Name: María José García-Zattera
Author-X-Name-First: María José
Author-X-Name-Last: García-Zattera
Author-Name: Timothy E. Hanson
Author-X-Name-First: Timothy E.
Author-X-Name-Last: Hanson
Title: Marginal Bayesian Semiparametric Modeling of Mismeasured Multivariate Interval-Censored Data
Abstract:
Motivated by data gathered in an oral health study, we propose a Bayesian nonparametric approach for population-averaged modeling of correlated time-to-event data, when the responses can only be determined to lie in an interval obtained from a sequence of examination times and the determination of the occurrence of the event is subject to misclassification. The joint model for the true, unobserved time-to-event data is defined semiparametrically; proportional hazards, proportional odds, and accelerated failure time (proportional quantiles) are all fit and compared. The baseline distribution is modeled as a flexible tailfree prior. The joint model is completed by considering a parametric copula function. A general misclassification model is discussed in detail, considering the possibility that different examiners were involved in the assessment of the occurrence of the events for a given subject across time. We provide empirical evidence that the model can be used to estimate the underlying time-to-event distribution and the misclassification parameters without any external information about the latter parameters. We also illustrate the effect on the statistical inferences of neglecting the presence of misclassification. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 129-145
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1476240
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476240
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:129-145
Template-Type: ReDIF-Article 1.0
Author-Name: Tingting Zhou
Author-X-Name-First: Tingting
Author-X-Name-Last: Zhou
Author-Name: Michael R. Elliott
Author-X-Name-First: Michael R.
Author-X-Name-Last: Elliott
Author-Name: Roderick J. A. Little
Author-X-Name-First: Roderick J. A.
Author-X-Name-Last: Little
Title: Penalized Spline of Propensity Methods for Treatment Comparison
Abstract:
Valid causal inference from observational studies requires controlling for confounders. When time-dependent confounders are present that serve as mediators of treatment effects and affect future treatment assignment, standard regression methods for controlling for confounders fail. Similar issues also arise in trials with sequential randomization, when randomization at later time points is based on intermediate outcomes from earlier randomized assignments. We propose a robust multiple imputation-based approach to causal inference in this setting called penalized spline of propensity methods for treatment comparison (PENCOMP), which builds on the penalized spline of propensity prediction method for missing data problems. PENCOMP estimates causal effects by imputing missing potential outcomes with flexible spline models and draws inference based on imputed and observed outcomes. Under the SUTVA, positivity, and ignorability assumptions, PENCOMP has a double robustness property for causal effects. Simulations suggest that it tends to outperform doubly robust marginal structural modeling when the weights are variable. We apply our method to the multicenter AIDS cohort study to estimate the effect of antiretroviral treatment on CD4 counts in HIV-infected patients. Supplementary materials for this article are available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1-19
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1518234
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518234
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:1-19
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew J. Spieker
Author-X-Name-First: Andrew J.
Author-X-Name-Last: Spieker
Title: Comment on Penalized Spline of Propensity Methods for Treatment Comparison by Zhou, Elliott, and Little
Journal: Journal of the American Statistical Association
Pages: 20-23
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1537913
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537913
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:20-23
Template-Type: ReDIF-Article 1.0
Author-Name: Joseph Antonelli
Author-X-Name-First: Joseph
Author-X-Name-Last: Antonelli
Author-Name: Michael J. Daniels
Author-X-Name-First: Michael J.
Author-X-Name-Last: Daniels
Title: Discussion of PENCOMP
Journal: Journal of the American Statistical Association
Pages: 24-27
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1537914
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537914
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:24-27
Template-Type: ReDIF-Article 1.0
Author-Name: Qingxia Chen
Author-X-Name-First: Qingxia
Author-X-Name-Last: Chen
Author-Name: Frank E. Harrell
Author-X-Name-First: Frank E.
Author-X-Name-Last: Harrell
Title: Comment: Penalized Spline of Propensity Methods for Treatment Comparison
Journal: Journal of the American Statistical Association
Pages: 28-30
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1537915
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537915
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:28-30
Template-Type: ReDIF-Article 1.0
Author-Name: Shu Yang
Author-X-Name-First: Shu
Author-X-Name-Last: Yang
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Title: Discussion of “Penalized Spline of Propensity Methods for Treatment Comparison” by Zhou, Elliott, and Little
Journal: Journal of the American Statistical Association
Pages: 30-32
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1537916
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537916
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:30-32
Template-Type: ReDIF-Article 1.0
Author-Name: Georgia Papadogeorgou
Author-X-Name-First: Georgia
Author-X-Name-Last: Papadogeorgou
Author-Name: Fan Li
Author-X-Name-First: Fan
Author-X-Name-Last: Li
Title: Discussion of “Penalized Spline of Propensity Methods for Treatment Comparison”
Journal: Journal of the American Statistical Association
Pages: 32-35
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1543120
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543120
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:32-35
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Corrigendum
Journal: Journal of the American Statistical Association
Pages: 484-484
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1548858
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548858
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:484-484
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 485-485
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1548859
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548859
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:485-485
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 486-486
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2018.1548861
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548861
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:486-486
Template-Type: ReDIF-Article 1.0
Author-Name: Tingting Zhou
Author-X-Name-First: Tingting
Author-X-Name-Last: Zhou
Author-Name: Michael R. Elliott
Author-X-Name-First: Michael R.
Author-X-Name-Last: Elliott
Author-Name: Roderick J. A. Little
Author-X-Name-First: Roderick J. A.
Author-X-Name-Last: Little
Title: Penalized Spline of Propensity Methods for Treatment Comparison: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 35-38
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2019.1576439
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1576439
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:35-38
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Collaborators
Journal: Journal of the American Statistical Association
Pages: 487-494
Issue: 525
Volume: 114
Year: 2019
Month: 1
X-DOI: 10.1080/01621459.2019.1583915
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1583915
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:525:p:487-494
Template-Type: ReDIF-Article 1.0
Author-Name: Lisa Morrissey LaVange
Author-X-Name-First: Lisa Morrissey
Author-X-Name-Last: LaVange
Title: Choose to Lead
Abstract:
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA.Abstract–Each year, the Journal of the American Statistical Association publishes the presidential address from the Joint Statistical Meetings. Here we present the 2018 address verbatim save for the addition of references and a few minor editorial corrections.
Journal: Journal of the American Statistical Association
Pages: 1427-1435
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1661183
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1661183
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1427-1435
Template-Type: ReDIF-Article 1.0
Author-Name: Christopher Jackson
Author-X-Name-First: Christopher
Author-X-Name-Last: Jackson
Author-Name: Anne Presanis
Author-X-Name-First: Anne
Author-X-Name-Last: Presanis
Author-Name: Stefano Conti
Author-X-Name-First: Stefano
Author-X-Name-Last: Conti
Author-Name: Daniela De Angelis
Author-X-Name-First: Daniela
Author-X-Name-Last: De Angelis
Title: Value of Information: Sensitivity Analysis and Research Design in Bayesian Evidence Synthesis
Abstract:
Suppose we have a Bayesian model that combines evidence from several different sources. We want to know which model parameters most affect the estimate or decision from the model, or which of the parameter uncertainties drive the decision uncertainty. Furthermore, we want to prioritize what further data should be collected. These questions can be addressed by Value of Information (VoI) analysis, in which we estimate expected reductions in loss from learning specific parameters or collecting data of a given design. We describe the theory and practice of VoI for Bayesian evidence synthesis, using and extending ideas from health economics, computer modeling and Bayesian design. The methods are general to a range of decision problems including point estimation and choices between discrete actions. We apply them to a model for estimating prevalence of HIV infection, combining indirect information from surveys, registers, and expert beliefs. This analysis shows which parameters contribute most of the uncertainty about each prevalence estimate, and the expected improvements in precision from specific amounts of additional data. These benefits can be traded with the costs of sampling to determine an optimal sample size. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1436-1449
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1562932
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1562932
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1436-1449
Template-Type: ReDIF-Article 1.0
Author-Name: Devin Francom
Author-X-Name-First: Devin
Author-X-Name-Last: Francom
Author-Name: Bruno Sansó
Author-X-Name-First: Bruno
Author-X-Name-Last: Sansó
Author-Name: Vera Bulaevskaya
Author-X-Name-First: Vera
Author-X-Name-Last: Bulaevskaya
Author-Name: Donald Lucas
Author-X-Name-First: Donald
Author-X-Name-Last: Lucas
Author-Name: Matthew Simpson
Author-X-Name-First: Matthew
Author-X-Name-Last: Simpson
Title: Inferring Atmospheric Release Characteristics in a Large Computer Experiment Using Bayesian Adaptive Splines
Abstract:
An atmospheric release of hazardous material, whether accidental or intentional, can be catastrophic for those in the path of the plume. Predicting the path of a plume based on characteristics of the release (location, amount, and duration) and meteorological conditions is an active research area highly relevant for emergency and long-term response to these releases. As a result, researchers have developed particle dispersion simulators to provide plume path predictions that incorporate release characteristics and meteorological conditions. However, since release characteristics and meteorological conditions are often unknown, the inverse problem is of great interest, that is, based on all the observations of the plume so far, what can be inferred about the release characteristics? This is the question we seek to answer using plume observations from a controlled release at the Diablo Canyon Nuclear Power Plant in Central California. With access to a large number of evaluations of a computationally expensive particle dispersion simulator that includes continuous and categorical inputs and spatio-temporal output, building a fast statistical surrogate model (or emulator) presents many statistical challenges, but is an essential tool for inverse modeling and sensitivity analysis. We achieve accurate emulation using Bayesian adaptive splines to model weights on empirical orthogonal functions. We use this emulator as well as appropriately identifiable simulator discrepancy and observational error models to calibrate the simulator, thus finding a posterior distribution of characteristics of the release. Since the release was controlled, these characteristics are known, making it possible to compare our findings to the truth. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1450-1465
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1562933
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1562933
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1450-1465
Template-Type: ReDIF-Article 1.0
Author-Name: Yimeng Xie
Author-X-Name-First: Yimeng
Author-X-Name-Last: Xie
Author-Name: Li Xu
Author-X-Name-First: Li
Author-X-Name-Last: Xu
Author-Name: Jie Li
Author-X-Name-First: Jie
Author-X-Name-Last: Li
Author-Name: Xinwei Deng
Author-X-Name-First: Xinwei
Author-X-Name-Last: Deng
Author-Name: Yili Hong
Author-X-Name-First: Yili
Author-X-Name-Last: Hong
Author-Name: Korine Kolivras
Author-X-Name-First: Korine
Author-X-Name-Last: Kolivras
Author-Name: David N. Gaines
Author-X-Name-First: David N.
Author-X-Name-Last: Gaines
Title: Spatial Variable Selection and An Application to Virginia Lyme Disease Emergence
Abstract:
Lyme disease is an infectious disease, that is, caused by a bacterium called Borrelia burgdorferi sensu stricto. In the United States, Lyme disease is one of the most common infectious diseases. The major endemic areas of the disease are New England, Mid-Atlantic, East-North Central, South Atlantic, and West North-Central. Virginia is on the front-line of the disease’s diffusion from the northeast to the south. One of the research objectives for the infectious disease community is to identify environmental and economic variables that are associated with the emergence of Lyme disease. In this article, we use a spatial Poisson regression model to link the spatial disease counts and environmental and economic variables, and develop a spatial variable selection procedure to effectively identify important factors by using an adaptive elastic net penalty. The proposed methods can automatically select important covariates, while adjusting for possible spatial correlations of disease counts. The performance of the proposed method is studied and compared with existing methods via a comprehensive simulation study. We apply the developed variable selection methods to the Virginia Lyme disease data and identify important variables that are new to the literature. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1466-1480
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1564670
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1564670
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1466-1480
Template-Type: ReDIF-Article 1.0
Author-Name: Oliver Stoner
Author-X-Name-First: Oliver
Author-X-Name-Last: Stoner
Author-Name: Theo Economou
Author-X-Name-First: Theo
Author-X-Name-Last: Economou
Author-Name: Gabriela Drummond Marques da Silva
Author-X-Name-First: Gabriela
Author-X-Name-Last: Drummond Marques da Silva
Title: A Hierarchical Framework for Correcting Under-Reporting in Count Data
Abstract:
Tuberculosis poses a global health risk and Brazil is among the top 20 countries by absolute mortality. However, this epidemiological burden is masked by under-reporting, which impairs planning for effective intervention. We present a comprehensive investigation and application of a Bayesian hierarchical approach to modeling and correcting under-reporting in tuberculosis counts, a general problem arising in observational count data. The framework is applicable to fully under-reported data, relying only on an informative prior distribution for the mean reporting rate to supplement the partial information in the data. Covariates are used to inform both the true count-generating process and the under-reporting mechanism, while also allowing for complex spatio-temporal structures. We present several sensitivity analyses based on simulation experiments to aid the elicitation of the prior distribution for the mean reporting rate and decisions relating to the inclusion of covariates. Both prior and posterior predictive model checking are presented, as well as a critical evaluation of the approach. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1481-1492
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1573732
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1573732
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1481-1492
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley C. Saul
Author-X-Name-First: Bradley C.
Author-X-Name-Last: Saul
Author-Name: Michael G. Hudgens
Author-X-Name-First: Michael G.
Author-X-Name-Last: Hudgens
Author-Name: Michael A. Mallin
Author-X-Name-First: Michael A.
Author-X-Name-Last: Mallin
Title: Downstream Effects of Upstream Causes
Abstract:
The United States Environmental Protection Agency considers nutrient pollution in stream ecosystems one of the United States’ most pressing environmental challenges. But limited independent replicates, lack of experimental randomization, and space- and time-varying confounding handicap causal inference on effects of nutrient pollution. In this article, the causal g-methods are extended to allow for exposures to vary in time and space in order to assess the effects of nutrient pollution on chlorophyll a—a proxy for algal production. Publicly available data from North Carolina’s Cape Fear River and a simulation study are used to show how causal effects of upstream nutrient concentrations on downstream chlorophyll a levels may be estimated from typical water quality monitoring data. Estimates obtained from the parametric g-formula, a marginal structural model, and a structural nested model indicate that chlorophyll a concentrations at Lock and Dam 1 were influenced by nitrate concentrations measured 86 to 109 km upstream, an area where four major industrial and municipal point sources discharge wastewater. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1493-1504
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1574226
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574226
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1493-1504
Template-Type: ReDIF-Article 1.0
Author-Name: Zhengwu Zhang
Author-X-Name-First: Zhengwu
Author-X-Name-Last: Zhang
Author-Name: Maxime Descoteaux
Author-X-Name-First: Maxime
Author-X-Name-Last: Descoteaux
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Nonparametric Bayes Models of Fiber Curves Connecting Brain Regions
Abstract:
In studying structural inter-connections in the human brain, it is common to first estimate fiber bundles connecting different regions relying on diffusion MRI. These fiber bundles act as highways for neural activity. Current statistical methods reduce the rich information into an adjacency matrix, with the elements containing a count of fibers or a mean diffusion feature along the fibers. The goal of this article is to avoid discarding the rich geometric information of fibers, developing flexible models for characterizing the population distribution of fibers between brain regions of interest within and across different individuals. We start by decomposing each fiber into a rotation matrix, shape and translation from a global reference curve. These components are viewed as data lying on a product space composed of different Euclidean spaces and manifolds. To nonparametrically model the distribution within and across individuals, we rely on a hierarchical mixture of product kernels specific to the component spaces. Taking a Bayesian approach to inference, we develop efficient methods for posterior sampling. The approach automatically produces clusters of fibers within and across individuals. Applying the method to Human Connectome Project data, we find interesting relationships between brain fiber geometry and reading ability. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1505-1517
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1574582
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574582
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1505-1517
Template-Type: ReDIF-Article 1.0
Author-Name: Chris J. Oates
Author-X-Name-First: Chris J.
Author-X-Name-Last: Oates
Author-Name: Jon Cockayne
Author-X-Name-First: Jon
Author-X-Name-Last: Cockayne
Author-Name: Robert G. Aykroyd
Author-X-Name-First: Robert G.
Author-X-Name-Last: Aykroyd
Author-Name: Mark Girolami
Author-X-Name-First: Mark
Author-X-Name-Last: Girolami
Title: Bayesian Probabilistic Numerical Methods in Time-Dependent State Estimation for Industrial Hydrocyclone Equipment
Abstract:
The use of high-power industrial equipment, such as large-scale mixing equipment or a hydrocyclone for separation of particles in liquid suspension, demands careful monitoring to ensure correct operation. The fundamental task of state-estimation for the liquid suspension can be posed as a time-evolving inverse problem and solved with Bayesian statistical methods. In this article, we extend Bayesian methods to incorporate statistical models for the error that is incurred in the numerical solution of the physical governing equations. This enables full uncertainty quantification within a principled computation-precision trade-off, in contrast to the over-confident inferences that are obtained when all sources of numerical error are ignored. The method is cast within a sequential Monte Carlo framework and an optimized implementation is provided in Python.
Journal: Journal of the American Statistical Association
Pages: 1518-1531
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1574583
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574583
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1518-1531
Template-Type: ReDIF-Article 1.0
Author-Name: Dean Knox
Author-X-Name-First: Dean
Author-X-Name-Last: Knox
Author-Name: Teppei Yamamoto
Author-X-Name-First: Teppei
Author-X-Name-Last: Yamamoto
Author-Name: Matthew A. Baum
Author-X-Name-First: Matthew A.
Author-X-Name-Last: Baum
Author-Name: Adam J. Berinsky
Author-X-Name-First: Adam J.
Author-X-Name-Last: Berinsky
Title: Design, Identification, and Sensitivity Analysis for Patient Preference Trials
Abstract:
Social and medical scientists are often concerned that the external validity of experimental results may be compromised because of heterogeneous treatment effects. If a treatment has different effects on those who would choose to take it and those who would not, the average treatment effect estimated in a standard randomized controlled trial (RCT) may give a misleading picture of its impact outside of the study sample. Patient preference trials (PPTs), where participants’ preferences over treatment options are incorporated in the study design, provide a possible solution. In this paper, we provide a systematic analysis of PPTs based on the potential outcomes framework of causal inference. We propose a general design for PPTs with multi-valued treatments, where participants state their preferred treatments and are then randomized into either a standard RCT or a self-selection condition. We derive nonparametric sharp bounds on the average causal effects among each choice-based subpopulation of participants under the proposed design. We also propose a sensitivity analysis for the violation of the key ignorability assumption sufficient for identifying the target causal quantity. The proposed design and methodology are illustrated with an original study of partisan news media and its behavioral impact. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1532-1546
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1585248
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585248
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1532-1546
Template-Type: ReDIF-Article 1.0
Author-Name: J. L. Scealy
Author-X-Name-First: J. L.
Author-X-Name-Last: Scealy
Author-Name: Andrew T. A. Wood
Author-X-Name-First: Andrew T. A.
Author-X-Name-Last: Wood
Title: Scaled von Mises–Fisher Distributions and Regression Models for Paleomagnetic Directional Data
Abstract:
We propose a new distribution for analyzing paleomagnetic directional data, that is, a novel transformation of the von Mises–Fisher distribution. The new distribution has ellipse-like symmetry, as does the Kent distribution; however, unlike the Kent distribution the normalizing constant in the new density is easy to compute and estimation of the shape parameters is straightforward. To accommodate outliers, the model also incorporates an additional shape parameter, which controls the tail-weight of the distribution. We also develop a general regression model framework that allows both the mean direction and the shape parameters of the error distribution to depend on covariates. The proposed regression procedure is shown to be equivariant with respect to the choice of coordinate system for the directional response. To illustrate, we analyses paleomagnetic directional data from the GEOMAGIA50.v3 database. We predict the mean direction at various geological time points and show that there is significant heteroscedasticity present. It is envisaged that the regression structures and error distribution proposed here will also prove useful when covariate information is available with (i) other types of directional response data; and (ii) square-root transformed compositional data of general dimension. Supplementary materials for this article are available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1547-1560
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1585249
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585249
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1547-1560
Template-Type: ReDIF-Article 1.0
Author-Name: Xueying Tang
Author-X-Name-First: Xueying
Author-X-Name-Last: Tang
Author-Name: Yang Yang
Author-X-Name-First: Yang
Author-X-Name-Last: Yang
Author-Name: Hong-Jie Yu
Author-X-Name-First: Hong-Jie
Author-X-Name-Last: Yu
Author-Name: Qiao-Hong Liao
Author-X-Name-First: Qiao-Hong
Author-X-Name-Last: Liao
Author-Name: Nikolay Bliznyuk
Author-X-Name-First: Nikolay
Author-X-Name-Last: Bliznyuk
Title: A Spatio-Temporal Modeling Framework for Surveillance Data of Multiple Infectious Pathogens With Small Laboratory Validation Sets
Abstract:
Many surveillance systems of infectious diseases are syndrome-based, capturing patients by clinical manifestation. Only a fraction of patients, mostly severe cases, undergo laboratory validation to identify the underlying pathogen. Motivated by the need to understand transmission dynamics and associate risk factors of enteroviruses causing the hand, foot, and mouth disease (HFMD) in China, we developed a Bayesian spatio-temporal modeling framework for surveillance data of infectious diseases with small validation sets. A novel approach was proposed to sample unobserved pathogen-specific patient counts over space and time and was compared to an existing sampling approach. The practical utility of this framework in identifying key parameters was assessed in simulations for a range of realistic sizes of the validation set. Several designs of sampling patients for laboratory validation were compared with and without aggregation of sparse validation data. The methodology was applied to the 2009 HFMD epidemic in southern China to evaluate transmissibility and the effects of climatic conditions for the leading pathogens of the disease, enterovirus 71, and Coxsackie A16. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1561-1573
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1585250
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585250
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1561-1573
Template-Type: ReDIF-Article 1.0
Author-Name: Yixin Wang
Author-X-Name-First: Yixin
Author-X-Name-Last: Wang
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: The Blessings of Multiple Causes
Abstract:
Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods assume that we observe all confounders, variables that affect both the causal variables and the outcome variables. This assumption is standard but it is also untestable. In this article, we develop the deconfounder, a way to do causal inference with weaker assumptions than the traditional methods require. The deconfounder is designed for problems of multiple causal inference: scientific studies that involve multiple causes whose effects are simultaneously of interest. Specifically, the deconfounder combines unsupervised machine learning and predictive model checking to use the dependencies among multiple causes as indirect evidence for some of the unobserved confounders. We develop the deconfounder algorithm, prove that it is unbiased, and show that it requires weaker assumptions than traditional causal inference. We analyze its performance in three types of studies: semi-simulated data around smoking and lung cancer, semi-simulated data around genome-wide association studies, and a real dataset about actors and movie revenue. The deconfounder is an effective approach to estimating causal effects in problems of multiple causal inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1574-1596
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1686987
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686987
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1574-1596
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander D’Amour
Author-X-Name-First: Alexander
Author-X-Name-Last: D’Amour
Title: Comment: Reflections on the Deconfounder
Journal: Journal of the American Statistical Association
Pages: 1597-1601
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1689138
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689138
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1597-1601
Template-Type: ReDIF-Article 1.0
Author-Name: Susan Athey
Author-X-Name-First: Susan
Author-X-Name-Last: Athey
Author-Name: Guido W. Imbens
Author-X-Name-First: Guido W.
Author-X-Name-Last: Imbens
Author-Name: Michael Pollmann
Author-X-Name-First: Michael
Author-X-Name-Last: Pollmann
Title: Comment on: “The Blessings of Multiple Causes” by Yixin Wang and David M. Blei
Journal: Journal of the American Statistical Association
Pages: 1602-1604
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1691008
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691008
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1602-1604
Template-Type: ReDIF-Article 1.0
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Author-Name: Zhichao Jiang
Author-X-Name-First: Zhichao
Author-X-Name-Last: Jiang
Title: Comment: The Challenges of Multiple Causes
Abstract:
We begin by congratulating Yixin Wang and David Blei for their thought-provoking article that opens up a new research frontier in the field of causal inference. The authors directly tackle the challenging question of how to infer causal effects of many treatments in the presence of unmeasured confounding. We expect their article to have a major impact by further advancing our understanding of this important methodological problem. This commentary has two goals. We first critically review the deconfounder method and point out its advantages and limitations. We then briefly consider three possible ways to address some of the limitations of the deconfounder method.
Journal: Journal of the American Statistical Association
Pages: 1605-1610
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1689137
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689137
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1605-1610
Template-Type: ReDIF-Article 1.0
Author-Name: Elizabeth L. Ogburn
Author-X-Name-First: Elizabeth L.
Author-X-Name-Last: Ogburn
Author-Name: Ilya Shpitser
Author-X-Name-First: Ilya
Author-X-Name-Last: Shpitser
Author-Name: Eric J. Tchetgen Tchetgen
Author-X-Name-First: Eric J. Tchetgen
Author-X-Name-Last: Tchetgen
Title: Comment on “Blessings of Multiple Causes”
Journal: Journal of the American Statistical Association
Pages: 1611-1615
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1689139
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689139
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1611-1615
Template-Type: ReDIF-Article 1.0
Author-Name: Yixin Wang
Author-X-Name-First: Yixin
Author-X-Name-Last: Wang
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: The Blessings of Multiple Causes: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1616-1619
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1690841
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1690841
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1616-1619
Template-Type: ReDIF-Article 1.0
Author-Name: Kai Zhang
Author-X-Name-First: Kai
Author-X-Name-Last: Zhang
Title: BET on Independence
Abstract:
We study the problem of nonparametric dependence detection. Many existing methods may suffer severe power loss due to nonuniform consistency, which we illustrate with a paradox. To avoid such power loss, we approach the nonparametric test of independence through the new framework of binary expansion statistics (BEStat) and binary expansion testing (BET), which examine dependence through a novel binary expansion filtration approximation of the copula. Through a Hadamard transform, we find that the symmetry statistics in the filtration are complete sufficient statistics for dependence. These statistics are also uncorrelated under the null. By using symmetry statistics, the BET avoids the problem of nonuniform consistency and improves upon a wide class of commonly used methods (a) by achieving the minimax rate in sample size requirement for reliable power and (b) by providing clear interpretations of global relationships upon rejection of independence. The binary expansion approach also connects the symmetry statistics with the current computing system to facilitate efficient bitwise implementation. We illustrate the BET with a study of the distribution of stars in the night sky and with an exploratory data analysis of the TCGA breast cancer data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1620-1637
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1537921
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537921
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1620-1637
Template-Type: ReDIF-Article 1.0
Author-Name: Shubhadeep Chakraborty
Author-X-Name-First: Shubhadeep
Author-X-Name-Last: Chakraborty
Author-Name: Xianyang Zhang
Author-X-Name-First: Xianyang
Author-X-Name-Last: Zhang
Title: Distance Metrics for Measuring Joint Dependence with Application to Causal Inference
Abstract:
Many statistical applications require the quantification of joint dependence among more than two random vectors. In this work, we generalize the notion of distance covariance to quantify joint dependence among d≥2
random vectors. We introduce the high-order distance covariance to measure the so-called Lancaster interaction dependence. The joint distance covariance is then defined as a linear combination of pairwise distance covariances and their higher-order counterparts which together completely characterize mutual independence. We further introduce some related concepts including the distance cumulant, distance characteristic function, and rank-based distance covariance. Empirical estimators are constructed based on certain Euclidean distances between sample elements. We study the large-sample properties of the estimators and propose a bootstrap procedure to approximate their sampling distributions. The asymptotic validity of the bootstrap procedure is justified under both the null and alternative hypotheses. The new metrics are employed to perform model selection in causal inference, which is based on the joint independence testing of the residuals from the fitted structural equation models. The effectiveness of the method is illustrated via both simulated and real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1638-1650
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1513364
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1513364
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1638-1650
Template-Type: ReDIF-Article 1.0
Author-Name: Xinran Li
Author-X-Name-First: Xinran
Author-X-Name-Last: Li
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Author-Name: Qian Lin
Author-X-Name-First: Qian
Author-X-Name-Last: Lin
Author-Name: Dawei Yang
Author-X-Name-First: Dawei
Author-X-Name-Last: Yang
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Randomization Inference for Peer Effects
Abstract:
Many previous causal inference studies require no interference, that is, the potential outcomes of a unit do not depend on the treatments of other units. However, this no-interference assumption becomes unreasonable when a unit interacts with other units in the same group or cluster. In a motivating application, a top Chinese university admits students through two channels: the college entrance exam (also known as Gaokao) and recommendation (often based on Olympiads in various subjects). The university randomly assigns students to dorms, each of which hosts four students. Students within the same dorm live together and have extensive interactions. Therefore, it is likely that peer effects exist and the no-interference assumption does not hold. It is important to understand peer effects, because they give useful guidance for future roommate assignment to improve the performance of students. We define peer effects using potential outcomes. We then propose a randomization-based inference framework to study peer effects with arbitrary numbers of peers and peer types. Our inferential procedure does not assume any parametric model on the outcome distribution. Our analysis gives useful practical guidance for policy makers of the university. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1651-1664
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1512863
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1512863
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1651-1664
Template-Type: ReDIF-Article 1.0
Author-Name: Iavor Bojinov
Author-X-Name-First: Iavor
Author-X-Name-Last: Bojinov
Author-Name: Neil Shephard
Author-X-Name-First: Neil
Author-X-Name-Last: Shephard
Title: Time Series Experiments and Causal Estimands: Exact Randomization Tests and Trading
Abstract:
We define causal estimands for experiments on single time series, extending the potential outcome framework to dealing with temporal data. Our approach allows the estimation of a broad class of these estimands and exact randomization-based p-values for testing causal effects, without imposing stringent assumptions. We further derive a general central limit theorem that can be used to conduct conservative tests and build confidence intervals for causal effects. Finally, we provide three methods for generalizing our approach to multiple units that are receiving the same class of treatment, over time. We test our methodology on simulated “potential autoregressions,” which have a causal interpretation. Our methodology is partially inspired by data from a large number of experiments carried out by a financial company who compared the impact of two different ways of trading equity futures contracts. We use our methodology to make causal statements about their trading methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1665-1682
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1527225
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527225
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1665-1682
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Peña
Author-X-Name-First: Daniel
Author-X-Name-Last: Peña
Author-Name: Ezequiel Smucler
Author-X-Name-First: Ezequiel
Author-X-Name-Last: Smucler
Author-Name: Victor J. Yohai
Author-X-Name-First: Victor J.
Author-X-Name-Last: Yohai
Title: Forecasting Multiple Time Series With One-Sided Dynamic Principal Components
Abstract:
We define one-sided dynamic principal components (ODPC) for time series as linear combinations of the present and past values of the series that minimize the reconstruction mean squared error. Usually dynamic principal components have been defined as functions of past and future values of the series and therefore they are not appropriate for forecasting purposes. On the contrary, it is shown that the ODPC introduced in this article can be successfully used for forecasting high-dimensional multiple time series. An alternating least-squares algorithm to compute the proposed ODPC is presented. We prove that for stationary and ergodic time series the estimated values converge to their population analogs. We also prove that asymptotically, when both the number of series and the sample size go to infinity, if the data follow a dynamic factor model, the reconstruction obtained with ODPC converges in mean square to the common part of the factor model. The results of a simulation study show that the forecasts obtained with ODPC compare favorably with those obtained using other forecasting methods based on dynamic factor models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1683-1694
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1520117
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1520117
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1683-1694
Template-Type: ReDIF-Article 1.0
Author-Name: Torben G. Andersen
Author-X-Name-First: Torben G.
Author-X-Name-Last: Andersen
Author-Name: Martin Thyrsgaard
Author-X-Name-First: Martin
Author-X-Name-Last: Thyrsgaard
Author-Name: Viktor Todorov
Author-X-Name-First: Viktor
Author-X-Name-Last: Todorov
Title: Time-Varying Periodicity in Intraday Volatility
Abstract:
We develop a nonparametric test for whether return volatility exhibits time-varying intraday periodicity using a long time series of high-frequency data. Our null hypothesis, commonly adopted in work on volatility modeling, is that volatility follows a stationary process combined with a constant time-of-day periodic component. We construct time-of-day volatility estimates and studentize the high-frequency returns with these periodic components. If the intraday periodicity is invariant, then the distribution of the studentized returns should be identical across the trading day. Consequently, the test compares the empirical characteristic function of the studentized returns across the trading day. The limit distribution of the test depends on the error in recovering volatility from discrete return data and the empirical process error associated with estimating volatility moments through their sample counterparts. Critical values are computed via easy-to-implement simulation. In an empirical application to S&P 500 index returns, we find strong evidence for variation in the intraday volatility pattern driven in part by the current level of volatility. When volatility is elevated, the period preceding the market close constitutes a significantly higher fraction of the total daily integrated volatility than during low volatility regimes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1695-1707
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1512864
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1512864
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1695-1707
Template-Type: ReDIF-Article 1.0
Author-Name: Anru Zhang
Author-X-Name-First: Anru
Author-X-Name-Last: Zhang
Author-Name: Rungang Han
Author-X-Name-First: Rungang
Author-X-Name-Last: Han
Title: Optimal Sparse Singular Value Decomposition for High-Dimensional High-Order Data
Abstract:
In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named sparse tensor alternating thresholding for singular value decomposition (STAT-SVD) is proposed. The proposed procedure features a novel double projection & thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaker assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1708-1725
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1527227
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527227
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1708-1725
Template-Type: ReDIF-Article 1.0
Author-Name: Qian Lin
Author-X-Name-First: Qian
Author-X-Name-Last: Lin
Author-Name: Zhigen Zhao
Author-X-Name-First: Zhigen
Author-X-Name-Last: Zhao
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Sparse Sliced Inverse Regression via Lasso
Abstract:
For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if ρ=limpn=0
, where p is the dimension and n is the sample size. Thus, when p is of the same or a higher order of n, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieves the optimal convergence rate under certain sparsity conditions when p is of order o(n2λ2)
, where λ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1726-1739
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1520115
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1520115
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1726-1739
Template-Type: ReDIF-Article 1.0
Author-Name: Efang Kong
Author-X-Name-First: Efang
Author-X-Name-Last: Kong
Author-Name: Yingcun Xia
Author-X-Name-First: Yingcun
Author-X-Name-Last: Xia
Author-Name: Wei Zhong
Author-X-Name-First: Wei
Author-X-Name-Last: Zhong
Title: Composite Coefficient of Determination and Its Application in Ultrahigh Dimensional Variable Screening
Abstract:
In this article, we propose to measure the dependence between two random variables through a composite coefficient of determination (CCD) of a set of nonparametric regressions. These regressions take consecutive binarizations of one variable as the response and the other variable as the predictor. The resulting measure is invariant to monotonic marginal variable transformation, rendering it robust against heavy-tailed distributions and outliers, and convenient for independent testing. Estimation of CCD could be done through kernel smoothing, with a consistency rate of root-n. CCD is a natural measure of the importance of variables in regression and its sure screening property, when used for variable screening, is also established. Comprehensive simulation studies and real data analysis show that the newly proposed measure quite often turns out to be the most preferred compared to other existing methods both in independence testing and in variable screening. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1740-1751
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1514305
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1514305
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1740-1751
Template-Type: ReDIF-Article 1.0
Author-Name: Weixin Yao
Author-X-Name-First: Weixin
Author-X-Name-Last: Yao
Author-Name: Debmalya Nandy
Author-X-Name-First: Debmalya
Author-X-Name-Last: Nandy
Author-Name: Bruce G. Lindsay
Author-X-Name-First: Bruce G.
Author-X-Name-Last: Lindsay
Author-Name: Francesca Chiaromonte
Author-X-Name-First: Francesca
Author-X-Name-Last: Chiaromonte
Title: Covariate Information Matrix for Sufficient Dimension Reduction
Abstract:
Building upon recent research on the applications of the density information matrix, we develop a tool for sufficient dimension reduction (SDR) in regression problems called covariate information matrix (CIM). CIM exhaustively identifies the central subspace (CS) and provides a rank ordering of the reduced covariates in terms of their regression information. Compared to other popular SDR methods, CIM does not require distributional assumptions on the covariates, or estimation of the mean regression function. CIM is implemented via eigen-decomposition of a matrix estimated with a previously developed efficient nonparametric density estimation technique. We also propose a bootstrap-based diagnostic plot for estimating the dimension of the CS. Results of simulations and real data applications demonstrate superior or competitive performance of CIM compared to that of some other SDR methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1752-1764
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1515080
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1515080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1752-1764
Template-Type: ReDIF-Article 1.0
Author-Name: Francis K. C. Hui
Author-X-Name-First: Francis K. C.
Author-X-Name-Last: Hui
Author-Name: C. You
Author-X-Name-First: C.
Author-X-Name-Last: You
Author-Name: H. L. Shang
Author-X-Name-First: H. L.
Author-X-Name-Last: Shang
Author-Name: Samuel Müller
Author-X-Name-First: Samuel
Author-X-Name-Last: Müller
Title: Semiparametric Regression Using Variational Approximations
Abstract:
Semiparametric regression offers a flexible framework for modeling nonlinear relationships between a response and covariates. A prime example are generalized additive models (GAMs) where splines (say) are used to approximate nonlinear functional components in conjunction with a quadratic penalty to control for overfitting. Estimation and inference are then generally performed based on the penalized likelihood, or under a mixed model framework. The penalized likelihood framework is fast but potentially unstable, and choosing the smoothing parameters needs to be done externally using cross-validation, for instance. The mixed model framework tends to be more stable and offers a natural way for choosing the smoothing parameters, but for nonnormal responses involves an intractable integral. In this article, we introduce a new framework for semiparametric regression based on variational approximations (VA). The approach possesses the stability and natural inference tools of the mixed model framework, while achieving computation times comparable to using penalized likelihood. Focusing on GAMs, we derive fully tractable variational likelihoods for some common response types. We present several features of the VA framework for inference, including a variational information matrix for inference on parametric components, and a closed-form update for estimating the smoothing parameter. We demonstrate the consistency of the VA estimates, and an asymptotic normality result for the parametric component of the model. Simulation studies show the VA framework performs similarly to and sometimes better than currently available software for fitting GAMs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1765-1777
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1518235
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518235
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1765-1777
Template-Type: ReDIF-Article 1.0
Author-Name: Kin Yau Wong
Author-X-Name-First: Kin Yau
Author-X-Name-Last: Wong
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: D. Y. Lin
Author-X-Name-First: D. Y.
Author-X-Name-Last: Lin
Title: Robust Score Tests With Missing Data in Genomics Studies
Abstract:
Analysis of genomic data is often complicated by the presence of missing values, which may arise due to cost or other reasons. The prevailing approach of single imputation is generally invalid if the imputation model is misspecified. In this article, we propose a robust score statistic based on imputed data for testing the association between a phenotype and a genomic variable with (partially) missing values. We fit a semiparametric regression model for the genomic variable against an arbitrary function of the linear predictor in the phenotype model and impute each missing value by its estimated posterior expectation. We show that the score statistic with such imputed values is asymptotically unbiased under general missing-data mechanisms, even when the imputation model is misspecified. We develop a spline-based method to estimate the semiparametric imputation model and derive the asymptotic distribution of the corresponding score statistic with a consistent variance estimator using sieve approximation theory and empirical process theory. The proposed test is computationally feasible regardless of the number of independent variables in the imputation model. We demonstrate the advantages of the proposed method over existing methods through extensive simulation studies and provide an application to a major cancer genomics study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1778-1786
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1514304
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1514304
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1778-1786
Template-Type: ReDIF-Article 1.0
Author-Name: X. Jessie Jeng
Author-X-Name-First: X. Jessie
Author-X-Name-Last: Jeng
Author-Name: Teng Zhang
Author-X-Name-First: Teng
Author-X-Name-Last: Zhang
Author-Name: Jung-Ying Tzeng
Author-X-Name-First: Jung-Ying
Author-X-Name-Last: Tzeng
Title: Efficient Signal Inclusion With Genomic Applications
Abstract:
This article addresses the challenge of efficiently capturing a high proportion of true signals for subsequent data analyses when sample sizes are relatively limited with respect to data dimension. We propose the signal missing rate (SMR) as a new measure for false-negative control to account for the variability of false-negative proportion. Novel data-adaptive procedures are developed to control SMR without incurring many unnecessary false positives under dependence. We justify the efficiency and adaptivity of the proposed methods via theory and simulation. The proposed methods are applied to GWAS on human height to effectively remove irrelevant single nucleotide polymorphisms (SNPs) while retaining a high proportion of relevant SNPs for subsequent polygenic analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1787-1799
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1518236
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518236
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1787-1799
Template-Type: ReDIF-Article 1.0
Author-Name: James M. Salter
Author-X-Name-First: James M.
Author-X-Name-Last: Salter
Author-Name: Daniel B. Williamson
Author-X-Name-First: Daniel B.
Author-X-Name-Last: Williamson
Author-Name: John Scinocca
Author-X-Name-First: John
Author-X-Name-Last: Scinocca
Author-Name: Viatcheslav Kharin
Author-X-Name-First: Viatcheslav
Author-X-Name-Last: Kharin
Title: Uncertainty Quantification for Computer Models With Spatial Output Using Calibration-Optimal Bases
Abstract:
The calibration of complex computer codes using uncertainty quantification (UQ) methods is a rich area of statistical methodological development. When applying these techniques to simulators with spatial output, it is now standard to use principal component decomposition to reduce the dimensions of the outputs in order to allow Gaussian process emulators to predict the output for calibration. We introduce the “terminal case,” in which the model cannot reproduce observations to within model discrepancy, and for which standard calibration methods in UQ fail to give sensible results. We show that even when there is no such issue with the model, the standard decomposition on the outputs can and usually does lead to a terminal case analysis. We present a simple test to allow a practitioner to establish whether their experiment will result in a terminal case analysis, and a methodology for defining calibration-optimal bases that avoid this whenever it is not inevitable. We present the optimal rotation algorithm for doing this, and demonstrate its efficacy for an idealized example for which the usual principal component methods fail. We apply these ideas to the CanAM4 model to demonstrate the terminal case issue arising for climate models. We discuss climate model tuning and the estimation of model discrepancy within this context, and show how the optimal rotation algorithm can be used in developing practical climate model tuning tools. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1800-1814
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1514306
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1514306
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1800-1814
Template-Type: ReDIF-Article 1.0
Author-Name: Gang Li
Author-X-Name-First: Gang
Author-X-Name-Last: Li
Author-Name: Xiaoyan Wang
Author-X-Name-First: Xiaoyan
Author-X-Name-Last: Wang
Title: Prediction Accuracy Measures for a Nonlinear Model and for Right-Censored Time-to-Event Data
Abstract:
This article develops a pair of new prediction summary measures for a nonlinear prediction function with right-censored time-to-event data. The first measure, defined as the proportion of explained variance by a linearly corrected prediction function, quantifies the potential predictive power of the nonlinear prediction function. The second measure, defined as the proportion of explained prediction error by its corrected prediction function, gauges the closeness of the prediction function to its corrected version and serves as a supplementary measure to indicate (by a value less than 1) whether the correction is needed to fulfill its potential predictive power and quantify how much prediction error reduction can be realized with the correction. The two measures together provide a complete summary of the predictive accuracy of the nonlinear prediction function. We motivate these measures by first establishing a variance decomposition and a prediction error decomposition at the population level and then deriving uncensored and censored sample versions of these decompositions. We note that for the least square prediction function under the linear model with no censoring, the first measure reduces to the classical coefficient of determination and the second measure degenerates to 1. We show that the sample measures are consistent estimators of their population counterparts and conduct extensive simulations to investigate their finite sample properties. A real data illustration is provided using the PBC data. Supplementary materials for this article are available online. An R package PAmeasures has been developed and made available via the CRAN R library. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1815-1825
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1515079
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1515079
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1815-1825
Template-Type: ReDIF-Article 1.0
Author-Name: Stephane Shao
Author-X-Name-First: Stephane
Author-X-Name-Last: Shao
Author-Name: Pierre E. Jacob
Author-X-Name-First: Pierre E.
Author-X-Name-Last: Jacob
Author-Name: Jie Ding
Author-X-Name-First: Jie
Author-X-Name-Last: Ding
Author-Name: Vahid Tarokh
Author-X-Name-First: Vahid
Author-X-Name-Last: Tarokh
Title: Bayesian Model Comparison with the Hyvärinen Score: Computation and Consistency
Abstract:
The Bayes factor is a widely used criterion in model comparison and its logarithm is a difference of out-of-sample predictive scores under the logarithmic scoring rule. However, when some of the candidate models involve vague priors on their parameters, the log-Bayes factor features an arbitrary additive constant that hinders its interpretation. As an alternative, we consider model comparison using the Hyvärinen score. We propose a method to consistently estimate this score for parametric models, using sequential Monte Carlo methods. We show that this score can be estimated for models with tractable likelihoods as well as nonlinear non-Gaussian state-space models with intractable likelihoods. We prove the asymptotic consistency of this new model selection criterion under strong regularity assumptions in the case of nonnested models, and we provide qualitative insights for the nested case. We also use existing characterizations of proper scoring rules on discrete spaces to extend the Hyvärinen score to discrete observations. Our numerical illustrations include Lévy-driven stochastic volatility models and diffusion models for population dynamics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1826-1837
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1518237
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518237
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1826-1837
Template-Type: ReDIF-Article 1.0
Author-Name: Annalisa Cadonna
Author-X-Name-First: Annalisa
Author-X-Name-Last: Cadonna
Author-Name: Athanasios Kottas
Author-X-Name-First: Athanasios
Author-X-Name-Last: Kottas
Author-Name: Raquel Prado
Author-X-Name-First: Raquel
Author-X-Name-Last: Prado
Title: Bayesian Spectral Modeling for Multiple Time Series
Abstract:
We develop a novel Bayesian modeling approach to spectral density estimation for multiple time series. The log-periodogram distribution for each series is modeled as a mixture of Gaussian distributions with frequency-dependent weights and mean functions. The implied model for the log-spectral density is a mixture of linear mean functions with frequency-dependent weights. The mixture weights are built through successive differences of a logit-normal distribution function with frequency-dependent parameters. Building from the construction for a single spectral density, we develop a hierarchical extension for multiple time series. Specifically, we set the mean functions to be common to all spectral densities and make the weights specific to the time series through the parameters of the logit-normal distribution. In addition to accommodating flexible spectral density shapes, a practically important feature of the proposed formulation is that it allows for ready posterior simulation through a Gibbs sampler with closed form full conditional distributions for all model parameters. The modeling approach is illustrated with simulated datasets and used for spectral analysis of multichannel electroencephalographic recordings, which provides a key motivating application for the proposed methodology.
Journal: Journal of the American Statistical Association
Pages: 1838-1853
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1520114
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1520114
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1838-1853
Template-Type: ReDIF-Article 1.0
Author-Name: Fei Jiang
Author-X-Name-First: Fei
Author-X-Name-Last: Jiang
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Author-Name: Haoda Fu
Author-X-Name-First: Haoda
Author-X-Name-Last: Fu
Author-Name: Takahiro Hasegawa
Author-X-Name-First: Takahiro
Author-X-Name-Last: Hasegawa
Author-Name: L. J. Wei
Author-X-Name-First: L. J.
Author-X-Name-Last: Wei
Title: Robust Alternatives to ANCOVA for Estimating the Treatment Effect via a Randomized Comparative Study
Abstract:
In comparing two treatments via a randomized clinical trial, the analysis of covariance (ANCOVA) technique is often utilized to estimate an overall treatment effect. The ANCOVA is generally perceived as a more efficient procedure than its simple two sample estimation counterpart. Unfortunately, when the ANCOVA model is nonlinear, the resulting estimator is generally not consistent. Recently, various nonparametric alternatives to the ANCOVA, such as the augmentation methods, have been proposed to estimate the treatment effect by adjusting the covariates. However, the properties of these alternatives have not been studied in the presence of treatment allocation imbalance. In this article, we take a different approach to explore how to improve the precision of the naive two-sample estimate even when the observed distributions of baseline covariates between two groups are dissimilar. Specifically, we derive a bias-adjusted estimation procedure constructed from a conditional inference principle via relevant ancillary statistics from the observed covariates. This estimator is shown to be asymptotically equivalent to an augmentation estimator under the unconditional setting. We utilize the data from a clinical trial for evaluating a combination treatment of cardiovascular diseases to illustrate our findings.
Journal: Journal of the American Statistical Association
Pages: 1854-1864
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1527226
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527226
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1854-1864
Template-Type: ReDIF-Article 1.0
Author-Name: Benjamin D. Youngman
Author-X-Name-First: Benjamin D.
Author-X-Name-Last: Youngman
Title: Generalized Additive Models for Exceedances of High Thresholds With an Application to Return Level Estimation for U.S. Wind Gusts
Abstract:
Generalized additive model (GAM) forms offer a flexible approach to capturing marginal variation. Such forms are used here to represent distributional variation in extreme values and presented in terms of spatio-temporal variation, which is often evident in environmental processes. A two-stage procedure is proposed that identifies extreme values as exceedances of a high threshold, which is defined as a fixed quantile and estimated by quantile regression. Excesses of the threshold are modelled with the generalized Pareto distribution (GPD). GAM forms are adopted for the threshold and GPD parameters, and directly estimated—in particular smoothing parameters—by restricted maximum likelihood, which provides an objective and relatively fast method of inference. The GAM models are used to produce return level maps for extreme wind gust speeds over the United States, which show extreme quantiles of the distribution of annual maximum gust speeds. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1865-1879
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1529596
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529596
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1865-1879
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yuan Ke
Author-X-Name-First: Yuan
Author-X-Name-Last: Ke
Author-Name: Qiang Sun
Author-X-Name-First: Qiang
Author-X-Name-Last: Sun
Author-Name: Wen-Xin Zhou
Author-X-Name-First: Wen-Xin
Author-X-Name-Last: Zhou
Title: FarmTest: Factor-Adjusted Robust Multiple Testing With Approximate False Discovery Control
Abstract:
Large-scale multiple testing with correlated and heavy-tailed data arises in a wide range of research areas from genomics, medical imaging to finance. Conventional methods for estimating the false discovery proportion (FDP) often ignore the effect of heavy-tailedness and the dependence structure among test statistics, and thus may lead to inefficient or even inconsistent estimation. Also, the commonly imposed joint normality assumption is arguably too stringent for many applications. To address these challenges, in this article we propose a factor-adjusted robust multiple testing (FarmTest) procedure for large-scale simultaneous inference with control of the FDP. We demonstrate that robust factor adjustments are extremely important in both controlling the FDP and improving the power. We identify general conditions under which the proposed method produces consistent estimate of the FDP. As a byproduct that is of independent interest, we establish an exponential-type deviation inequality for a robust U-type covariance estimator under the spectral norm. Extensive numerical experiments demonstrate the advantage of the proposed method over several state-of-the-art methods especially when the data are generated from heavy-tailed distributions. The proposed procedures are implemented in the R-package FarmTest. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1880-1893
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1527700
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527700
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1880-1893
Template-Type: ReDIF-Article 1.0
Author-Name: Will Wei Sun
Author-X-Name-First: Will Wei
Author-X-Name-Last: Sun
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Dynamic Tensor Clustering
Abstract:
Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a general-order tensor. There is also a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we propose a new dynamic tensor clustering method that works for a general-order dynamic tensor, and enjoys both strong statistical guarantee and high computational efficiency. Our proposal is based on a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. Theoretically, we first establish a nonasymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with high probability. Moreover, our proposed method can be naturally extended to co-clustering of multiple modes of the tensor data. The efficacy of our method is illustrated through simulations and a brain dynamic functional connectivity analysis from an autism spectrum disorder study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1894-1907
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1527701
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527701
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1894-1907
Template-Type: ReDIF-Article 1.0
Author-Name: Zhao Ren
Author-X-Name-First: Zhao
Author-X-Name-Last: Ren
Author-Name: Yongjian Kang
Author-X-Name-First: Yongjian
Author-X-Name-Last: Kang
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Title: Tuning-Free Heterogeneous Inference in Massive Networks
Abstract:
Heterogeneity is often natural in many contemporary applications involving massive data. While posing new challenges to effective learning, it can play a crucial role in powering meaningful scientific discoveries through the integration of information among subpopulations of interest. In this article, we exploit multiple networks with Gaussian graphs to encode the connectivity patterns of a large number of features on the subpopulations. To uncover the underlying sparsity structures across subpopulations, we suggest a framework of large-scale tuning-free heterogeneous inference, where the number of networks is allowed to diverge. In particular, two new tests, the chi-based and the linear functional-based tests, are introduced and their asymptotic null distributions are established. Under mild regularity conditions, we establish that both tests are optimal in achieving the testable region boundary and the sample size requirement for the latter test is minimal. Both theoretical guarantees and the tuning-free property stem from efficient multiple-network estimation by our newly suggested heterogeneous group square-root Lasso for high-dimensional multi-response regression with heterogeneous noises. To solve this convex program, we further introduce a scalable algorithm that enjoys provable convergence to the global optimum. Both computational and theoretical advantages are elucidated through simulation and real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1908-1925
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2018.1537920
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537920
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1908-1925
Template-Type: ReDIF-Article 1.0
Author-Name: Neil Pearce
Author-X-Name-First: Neil
Author-X-Name-Last: Pearce
Title: Handbook of Statistical Methods for Case-Control Studies.
Journal: Journal of the American Statistical Association
Pages: 1926-1928
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1691865
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691865
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1926-1928
Template-Type: ReDIF-Article 1.0
Author-Name: A. Alexandre Trindade
Author-X-Name-First: A.
Author-X-Name-Last: Alexandre Trindade
Title: Linear Models and the Relevant Distributions and Matrix Algebra.
Journal: Journal of the American Statistical Association
Pages: 1928-1929
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1691864
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691864
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:1928-1929
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Collaborators
Journal: Journal of the American Statistical Association
Pages: W1930-W1938
Issue: 528
Volume: 114
Year: 2019
Month: 10
X-DOI: 10.1080/01621459.2019.1690842
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1690842
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:528:p:W1930-W1938
Template-Type: ReDIF-Article 1.0
Author-Name: Qinshu Lian
Author-X-Name-First: Qinshu
Author-X-Name-Last: Lian
Author-Name: James S. Hodges
Author-X-Name-First: James S.
Author-X-Name-Last: Hodges
Author-Name: Haitao Chu
Author-X-Name-First: Haitao
Author-X-Name-Last: Chu
Title: A Bayesian Hierarchical Summary Receiver Operating Characteristic Model for Network Meta-Analysis of Diagnostic Tests
Abstract:
In studies evaluating the accuracy of diagnostic tests, three designs are commonly used, crossover, randomized, and noncomparative. Existing methods for meta-analysis of diagnostic tests mainly consider the simple cases in which the reference test in all or none of the studies can be considered a gold standard test, and in which all studies use either a randomized or noncomparative design. The proliferation of diagnostic instruments and the diversity of study designs create a need for more general methods to combine studies that include or do not include a gold standard test and that use various designs. This article extends the Bayesian hierarchical summary receiver operating characteristic model to network meta-analysis of diagnostic tests to simultaneously compare multiple tests within a missing data framework. The method accounts for correlations between multiple tests and for heterogeneity between studies. It also allows different studies to include different subsets of diagnostic tests and provides flexibility in the choice of summary statistics. The model is evaluated using simulations and illustrated using real data on tests for deep vein thrombosis, with sensitivity analyses. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 949-961
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1476239
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476239
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:949-961
Template-Type: ReDIF-Article 1.0
Author-Name: Steffen Ventz
Author-X-Name-First: Steffen
Author-X-Name-Last: Ventz
Author-Name: Matteo Cellamare
Author-X-Name-First: Matteo
Author-X-Name-Last: Cellamare
Author-Name: Sergio Bacallado
Author-X-Name-First: Sergio
Author-X-Name-Last: Bacallado
Author-Name: Lorenzo Trippa
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Trippa
Title: Bayesian Uncertainty Directed Trial Designs
Abstract:
Most Bayesian response-adaptive designs unbalance randomization rates toward the most promising arms with the goal of increasing the number of positive treatment outcomes during the study, even though the primary aim of the trial is different. We discuss Bayesian uncertainty directed designs (BUD), a class of Bayesian designs in which the investigator specifies an information measure tailored to the experiment. All decisions during the trial are selected to optimize the available information at the end of the study. The approach can be applied to several designs, ranging from early stage multi-arm trials to biomarker-driven and multi-endpoint studies. We discuss the asymptotic limit of the patient allocation proportion to treatments, and illustrate the finite-sample operating characteristics of BUD designs through examples, including multi-arm trials, biomarker-stratified trials, and trials with multiple co-primary endpoints. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 962-974
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1497497
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497497
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:962-974
Template-Type: ReDIF-Article 1.0
Author-Name: Zhonghua Liu
Author-X-Name-First: Zhonghua
Author-X-Name-Last: Liu
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: A Geometric Perspective on the Power of Principal Component Association Tests in Multiple Phenotype Studies
Abstract:
Joint analysis of multiple phenotypes can increase statistical power in genetic association studies. Principal component analysis, as a popular dimension reduction method, especially when the number of phenotypes is high dimensional, has been proposed to analyze multiple correlated phenotypes. It has been empirically observed that the first PC, which summarizes the largest amount of variance, can be less powerful than higher-order PCs and other commonly used methods in detecting genetic association signals. In this article, we investigate the properties of PCA-based multiple phenotype analysis from a geometric perspective by introducing a novel concept called principal angle. A particular PC is powerful if its principal angle is 0°
and is powerless if its principal angle is 90°
. Without prior knowledge about the true principal angle, each PC can be powerless. We propose linear, nonlinear, and data-adaptive omnibus tests by combining PCs. We demonstrate that the Wald test is a special quadratic PC-based test. We show that the omnibus PC test is robust and powerful in a wide range of scenarios. We study the properties of the proposed methods using power analysis and eigen-analysis. The subtle differences and close connections between these combined PC methods are illustrated graphically in terms of their rejection boundaries. Our proposed tests have convex acceptance regions and hence are admissible. The p-values for the proposed tests can be efficiently calculated analytically and the proposed tests have been implemented in a publicly available R package MPAT. We conduct simulation studies in both low- and high-dimensional settings with various signal vectors and correlation structures. We apply the proposed tests to the joint analysis of metabolic syndrome-related phenotypes with datasets collected from four international consortia to demonstrate the effectiveness of the proposed combined PC testing procedures. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 975-990
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1513363
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1513363
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:975-990
Template-Type: ReDIF-Article 1.0
Author-Name: Qian Li
Author-X-Name-First: Qian
Author-X-Name-Last: Li
Author-Name: Damla Şentürk
Author-X-Name-First: Damla
Author-X-Name-Last: Şentürk
Author-Name: Catherine A. Sugar
Author-X-Name-First: Catherine A.
Author-X-Name-Last: Sugar
Author-Name: Shafali Jeste
Author-X-Name-First: Shafali
Author-X-Name-Last: Jeste
Author-Name: Charlotte DiStefano
Author-X-Name-First: Charlotte
Author-X-Name-Last: DiStefano
Author-Name: Joel Frohlich
Author-X-Name-First: Joel
Author-X-Name-Last: Frohlich
Author-Name: Donatello Telesca
Author-X-Name-First: Donatello
Author-X-Name-Last: Telesca
Title: Inferring Brain Signals Synchronicity From a Sample of EEG Readings
Abstract:
Inferring patterns of synchronous brain activity from a heterogeneous sample of electroencephalograms is scientifically and methodologically challenging. While it is intuitively and statistically appealing to rely on readings from more than one individual in order to highlight recurrent patterns of brain activation, pooling information across subjects presents nontrivial methodological problems. We discuss some of the scientific issues associated with the understanding of synchronized neuronal activity and propose a methodological framework for statistical inference from a sample of EEG readings. Our work builds on classical contributions in time-series, clustering, and functional data analysis, in an effort to reframe a challenging inferential problem in the context of familiar analytical techniques. Some attention is paid to computational issues, with a proposal based on the combination of machine learning and Bayesian techniques. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 991-1001
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1518233
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1518233
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:991-1001
Template-Type: ReDIF-Article 1.0
Author-Name: Justin Strait
Author-X-Name-First: Justin
Author-X-Name-Last: Strait
Author-Name: Oksana Chkrebtii
Author-X-Name-First: Oksana
Author-X-Name-Last: Chkrebtii
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Title: Automatic Detection and Uncertainty Quantification of Landmarks on Elastic Curves
Abstract:
A population quantity of interest in statistical shape analysis is the location of landmarks, which are points that aid in reconstructing and representing shapes of objects. We provide an automated, model-based approach to inferring landmarks given a sample of shape data. The model is formulated based on a linear reconstruction of the shape, passing through the specified points, and a Bayesian inferential approach is described for estimating unknown landmark locations. The question of how many landmarks to select is addressed in two different ways: (1) by defining a criterion-based approach and (2) joint estimation of the number of landmarks along with their locations. Efficient methods for posterior sampling are also discussed. We motivate our approach using several simulated examples, as well as data obtained from applications in computer vision, biology, and medical imaging. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1002-1017
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1527224
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1527224
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1002-1017
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Chen
Author-X-Name-First: Yang
Author-X-Name-Last: Chen
Author-Name: Xiao-Li Meng
Author-X-Name-First: Xiao-Li
Author-X-Name-Last: Meng
Author-Name: Xufei Wang
Author-X-Name-First: Xufei
Author-X-Name-Last: Wang
Author-Name: David A. van Dyk
Author-X-Name-First: David A.
Author-X-Name-Last: van Dyk
Author-Name: Herman L. Marshall
Author-X-Name-First: Herman L.
Author-X-Name-Last: Marshall
Author-Name: Vinay L. Kashyap
Author-X-Name-First: Vinay L.
Author-X-Name-Last: Kashyap
Title: Calibration Concordance for Astronomical Instruments via Multiplicative Shrinkage
Abstract:
Calibration data are often obtained by observing several well-understood objects simultaneously with multiple instruments, such as satellites for measuring astronomical sources. Analyzing such data and obtaining proper concordance among the instruments is challenging when the physical source models are not well understood, when there are uncertainties in “known” physical quantities, or when data quality varies in ways that cannot be fully quantified. Furthermore, the number of model parameters increases with both the number of instruments and the number of sources. Thus, concordance of the instruments requires careful modeling of the mean signals, the intrinsic source differences, and measurement errors. In this article, we propose a log-Normal model and a more general log-t model that respect the multiplicative nature of the mean signals via a half-variance adjustment, yet permit imperfections in the mean modeling to be absorbed by residual variances. We present analytical solutions in the form of power shrinkage in special cases and develop reliable Markov chain Monte Carlo algorithms for general cases, both of which are available in the Python module CalConcordance. We apply our method to several datasets including a combination of observations of active galactic nuclei (AGN) and spectral line emission from the supernova remnant E0102, obtained with a variety of X-ray telescopes such as Chandra, XMM- Newton, Suzaku, and Swift. The data are compiled by the International Astronomical Consortium for High Energy Calibration. We demonstrate that our method provides helpful and practical guidance for astrophysicists when adjusting for disagreements among instruments. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1018-1037
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1528978
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1528978
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1018-1037
Template-Type: ReDIF-Article 1.0
Author-Name: David Benkeser
Author-X-Name-First: David
Author-X-Name-Last: Benkeser
Author-Name: Peter B. Gilbert
Author-X-Name-First: Peter B.
Author-X-Name-Last: Gilbert
Author-Name: Marco Carone
Author-X-Name-First: Marco
Author-X-Name-Last: Carone
Title: Estimating and Testing Vaccine Sieve Effects Using Machine Learning
Abstract:
When available, vaccines are an effective means of disease prevention. Unfortunately, efficacious vaccines have not yet been developed for several major infectious diseases, including HIV and malaria. Vaccine sieve analysis studies whether and how the efficacy of a vaccine varies with the genetics of the pathogen of interest, which can guide subsequent vaccine development and deployment. In sieve analyses, the effect of the vaccine on the cumulative incidence corresponding to each of several possible genotypes is often assessed within a competing risks framework. In the context of clinical trials, the estimators employed in these analyses generally do not account for covariates, even though the latter may be predictive of the study endpoint or censoring. Motivated by two recent preventive vaccine efficacy trials for HIV and malaria, we develop new methodology for vaccine sieve analysis. Our approach offers improved validity and efficiency relative to existing approaches by allowing covariate adjustment through ensemble machine learning. We derive results that indicate how to perform statistical inference using our estimators. Our analysis of the HIV and malaria trials shows markedly increased precision—up to doubled efficiency in both trials—under more plausible assumptions compared with standard methodology. Our findings provide greater evidence for vaccine sieve effects in both trials. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1038-1049
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1529594
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529594
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1038-1049
Template-Type: ReDIF-Article 1.0
Author-Name: Furong Li
Author-X-Name-First: Furong
Author-X-Name-Last: Li
Author-Name: Huiyan Sang
Author-X-Name-First: Huiyan
Author-X-Name-Last: Sang
Title: Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets
Abstract:
Spatial regression models have been widely used to describe the relationship between a response variable and some explanatory variables over a region of interest, taking into account the spatial dependence of the observations. In many applications, relationships between response variables and covariates are expected to exhibit complex spatial patterns. We propose a new approach, referred to as spatially clustered coefficient (SCC) regression, to detect spatially clustered patterns in the regression coefficients. It incorporates spatial neighborhood information through a carefully constructed regularization to automatically detect change points in space and to achieve computational scalability. Our numerical studies suggest that SCC works very effectively, capturing not only clustered coefficients, but also smoothly varying coefficients because of its strong local adaptivity. This flexibility allows researchers to explore various spatial structures in regression coefficients. We also establish theoretical properties of SCC. We use SCC to explore the relationship between the temperature and salinity of sea water in the Atlantic basin; this can provide important insights about the evolution of individual water masses and the pathway and strength of meridional overturning circulation in oceanography. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1050-1062
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1529595
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529595
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1050-1062
Template-Type: ReDIF-Article 1.0
Author-Name: Samuel I. Berchuck
Author-X-Name-First: Samuel I.
Author-X-Name-Last: Berchuck
Author-Name: Jean-Claude Mwanza
Author-X-Name-First: Jean-Claude
Author-X-Name-Last: Mwanza
Author-Name: Joshua L. Warren
Author-X-Name-First: Joshua L.
Author-X-Name-Last: Warren
Title: Diagnosing Glaucoma Progression With Visual Field Data Using a Spatiotemporal Boundary Detection Method
Abstract:
Diagnosing glaucoma progression is critical for limiting irreversible vision loss. A common method for assessing glaucoma progression uses a longitudinal series of visual fields (VFs) acquired at regular intervals. VF data are characterized by a complex spatiotemporal structure due to the data generating process and ocular anatomy. Thus, advanced statistical methods are needed to make clinical determinations regarding progression status. We introduce a spatiotemporal boundary detection model that allows the underlying anatomy of the optic disc to dictate the spatial structure of the VF data across time. We show that our new method provides novel insight into vision loss that improves diagnosis of glaucoma progression using data from the Vein Pulsation Study Trial in Glaucoma and the Lions Eye Institute trial registry. Simulations are presented, showing the proposed methodology is preferred over existing spatial methods for VF data. Supplementary materials for this article are available online and the method is implemented in the R package womblR.
Journal: Journal of the American Statistical Association
Pages: 1063-1074
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1537911
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537911
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1063-1074
Template-Type: ReDIF-Article 1.0
Author-Name: Jesson J. Einmahl
Author-X-Name-First: Jesson J.
Author-X-Name-Last: Einmahl
Author-Name: John H. J. Einmahl
Author-X-Name-First: John H. J.
Author-X-Name-Last: Einmahl
Author-Name: Laurens de Haan
Author-X-Name-First: Laurens
Author-X-Name-Last: de Haan
Title: Limits to Human Life Span Through Extreme Value Theory
Abstract:
There is no scientific consensus on the fundamental question whether the probability distribution of the human life span has a finite endpoint or not and, if so, whether this upper limit changes over time. Our study uses a unique dataset of the ages at death—in days—of all (about 285,000) Dutch residents, born in the Netherlands, who died in the years 1986–2015 at a minimum age of 92 years and is based on extreme value theory, the coherent approach to research problems of this type. Unlike some other studies, we base our analysis on the configuration of thousands of mortality data of old people, not just the few oldest old. We find compelling statistical evidence that there is indeed an upper limit to the life span of men and to that of women for all the 30 years we consider and, moreover, that there are no indications of trends in these upper limits over the last 30 years, despite the fact that the number of people reaching high age (say 95 years) was almost tripling. We also present estimates for the endpoints, for the force of mortality at very high age, and for the so-called perseverance parameter. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1075-1080
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1537912
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537912
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1075-1080
Template-Type: ReDIF-Article 1.0
Author-Name: Shahin Tavakoli
Author-X-Name-First: Shahin
Author-X-Name-Last: Tavakoli
Author-Name: Davide Pigoli
Author-X-Name-First: Davide
Author-X-Name-Last: Pigoli
Author-Name: John A. D. Aston
Author-X-Name-First: John A. D.
Author-X-Name-Last: Aston
Author-Name: John S. Coleman
Author-X-Name-First: John S.
Author-X-Name-Last: Coleman
Title: A Spatial Modeling Approach for Linguistic Object Data: Analyzing Dialect Sound Variations Across Great Britain
Abstract:
Dialect variation is of considerable interest in linguistics and other social sciences. However, traditionally it has been studied using proxies (transcriptions) rather than acoustic recordings directly. We introduce novel statistical techniques to analyze geolocalized speech recordings and to explore the spatial variation of pronunciations continuously over the region of interest, as opposed to traditional isoglosses, which provide a discrete partition of the region. Data of this type require an explicit modeling of the variation in the mean and the covariance. Usual Euclidean metrics are not appropriate, and we therefore introduce the concept of d-covariance, which allows consistent estimation both in space and at individual locations. We then propose spatial smoothing for these objects which accounts for the possibly nonconvex geometry of the domain of interest. We apply the proposed method to data from the spoken part of the British National Corpus, deposited at the British Library, London, and we produce maps of the dialect variation over Great Britain. In addition, the methods allow for acoustic reconstruction across the domain of interest, allowing researchers to listen to the statistical analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1081-1096
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1607357
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1607357
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1081-1096
Template-Type: ReDIF-Article 1.0
Author-Name: Ian L. Dryden
Author-X-Name-First: Ian L.
Author-X-Name-Last: Dryden
Author-Name: Simon P. Preston
Author-X-Name-First: Simon P.
Author-X-Name-Last: Preston
Author-Name: Katie E. Severn
Author-X-Name-First: Katie E.
Author-X-Name-Last: Severn
Title: Discussion: Object-Oriented Data Analysis, Power Metrics, and Graph Laplacians
Journal: Journal of the American Statistical Association
Pages: 1097-1098
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1635477
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635477
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1097-1098
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander Petersen
Author-X-Name-First: Alexander
Author-X-Name-Last: Petersen
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: Discussion: A Spatial Modeling Approach for Linguistic Object Data: Analyzing Dialect Sound Variations Across Great Britain, by Shahin Tavakoli et al.
Journal: Journal of the American Statistical Association
Pages: 1099-1101
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1635478
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635478
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1099-1101
Template-Type: ReDIF-Article 1.0
Author-Name: J. S. Marron
Author-X-Name-First: J. S.
Author-X-Name-Last: Marron
Title: Discussion: A Spatial Modeling Approach for Linguistic Object Data: Analysing Dialect Sound Variations Across Great Britain, by Shahin Tavakoli et al.
Journal: Journal of the American Statistical Association
Pages: 1102-1102
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1639513
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1639513
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1102-1102
Template-Type: ReDIF-Article 1.0
Author-Name: Shahin Tavakoli
Author-X-Name-First: Shahin
Author-X-Name-Last: Tavakoli
Author-Name: Davide Pigoli
Author-X-Name-First: Davide
Author-X-Name-Last: Pigoli
Author-Name: John A. D. Aston
Author-X-Name-First: John A. D.
Author-X-Name-Last: Aston
Author-Name: John S. Coleman
Author-X-Name-First: John S.
Author-X-Name-Last: Coleman
Title: Rejoinder for “A Spatial Modeling Approach for Linguistic Object Data: Analyzing Dialect Sound Variations Across Great Britain”
Journal: Journal of the American Statistical Association
Pages: 1103-1104
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1655931
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1655931
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1103-1104
Template-Type: ReDIF-Article 1.0
Author-Name: Patrick Rubin-Delanchy
Author-X-Name-First: Patrick
Author-X-Name-Last: Rubin-Delanchy
Author-Name: Nicholas A. Heard
Author-X-Name-First: Nicholas A.
Author-X-Name-Last: Heard
Author-Name: Daniel J. Lawson
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Lawson
Title: Meta-Analysis of Mid-p-Values: Some New Results based on the Convex Order
Abstract:
The mid-p-value is a proposed improvement on the ordinary p-value for the case where the test statistic is partially or completely discrete. In this case, the ordinary p-value is conservative, meaning that its null distribution is larger than a uniform distribution on the unit interval, in the usual stochastic order. The mid-p-value is not conservative. However, its null distribution is dominated by the uniform distribution in a different stochastic order, called the convex order. The property leads us to discover some new finite-sample and asymptotic bounds on functions of mid-p-values, which can be used to combine results from different hypothesis tests conservatively, yet more powerfully, using mid-p-values rather than p-values. Our methodology is demonstrated on real data from a cyber-security application.
Journal: Journal of the American Statistical Association
Pages: 1105-1112
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1469994
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469994
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1105-1112
Template-Type: ReDIF-Article 1.0
Author-Name: Jeffrey W. Miller
Author-X-Name-First: Jeffrey W.
Author-X-Name-Last: Miller
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Robust Bayesian Inference via Coarsening
Abstract:
The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure. We introduce a novel approach to Bayesian inference that improves robustness to small departures from the model: rather than conditioning on the event that the observed data are generated by the model, one conditions on the event that the model generates data close to the observed data, in a distributional sense. When closeness is defined in terms of relative entropy, the resulting “coarsened” posterior can be approximated by simply tempering the likelihood—that is, by raising the likelihood to a fractional power—thus, inference can usually be implemented via standard algorithms, and one can even obtain analytical solutions when using conjugate priors. Some theoretical properties are derived, and we illustrate the approach with real and simulated data using mixture models and autoregressive models of unknown order. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1113-1125
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1469995
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469995
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1113-1125
Template-Type: ReDIF-Article 1.0
Author-Name: Mickaël De Backer
Author-X-Name-First: Mickaël
Author-X-Name-Last: De Backer
Author-Name: Anouar El Ghouch
Author-X-Name-First: Anouar El
Author-X-Name-Last: Ghouch
Author-Name: Ingrid Van Keilegom
Author-X-Name-First: Ingrid
Author-X-Name-Last: Van Keilegom
Title: An Adapted Loss Function for Censored Quantile Regression
Abstract:
In this article, we study a novel approach for the estimation of quantiles when facing potential right censoring of the responses. Contrary to the existing literature on the subject, the adopted strategy of this article is to tackle censoring at the very level of the loss function usually employed for the computation of quantiles, the so-called “check” function. For interpretation purposes, a simple comparison with the latter reveals how censoring is accounted for in the newly proposed loss function. Subsequently, when considering the inclusion of covariates for conditional quantile estimation, by defining a new general loss function the proposed methodology opens the gate to numerous parametric, semiparametric, and nonparametric modeling techniques. To illustrate this statement, we consider the well-studied linear regression under the usual assumption of conditional independence between the true response and the censoring variable. For practical minimization of the studied loss function, we also provide a simple algorithmic procedure shown to yield satisfactory results for the proposed estimator with respect to the existing literature in an extensive simulation study. From a more theoretical prospect, consistency and asymptotic normality of the estimator for linear regression are obtained using several recent results on nonsmooth semiparametric estimation equations with an infinite-dimensional nuisance parameter, while numerical examples illustrate the adequateness of a simple bootstrap procedure for inferential purposes. Lastly, an application to a real dataset is used to further illustrate the validity and finite sample performance of the proposed estimator. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1126-1137
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1469996
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469996
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1126-1137
Template-Type: ReDIF-Article 1.0
Author-Name: Xinbing Kong
Author-X-Name-First: Xinbing
Author-X-Name-Last: Kong
Author-Name: Jiangyan Wang
Author-X-Name-First: Jiangyan
Author-X-Name-Last: Wang
Author-Name: Jinbao Xing
Author-X-Name-First: Jinbao
Author-X-Name-Last: Xing
Author-Name: Chao Xu
Author-X-Name-First: Chao
Author-X-Name-Last: Xu
Author-Name: Chao Ying
Author-X-Name-First: Chao
Author-X-Name-Last: Ying
Title: Factor and Idiosyncratic Empirical Processes
Abstract:
The distributions of the common and idiosyncratic components for an individual variable are important in forecasting and applications. However, they are not identified with low-dimensional observations. Using the recently developed theory for large dimensional approximate factor model for large panel data, the common and idiosyncratic components can be estimated consistently. Based on the estimated common and idiosyncratic components, we construct the empirical processes for estimation of the distribution functions of the common and idiosyncratic components. We prove that the two empirical processes are oracle efficient when T = o(p) where p and T are the dimension and sample size, respectively. This demonstrates that the factor and idiosyncratic empirical processes behave as well as the empirical processes pretending that the common and idiosyncratic components for an individual variable are directly observable. Based on this oracle property, we construct simultaneous confidence bands (SCBs) for the distributions of the common and idiosyncratic components. For the first-order consistency of the estimated distribution functions, T=o(p)$\sqrt{T} =o(p)$ suffices. Extensive simulation studies check that the estimated bands have good coverage frequencies. Our real data analysis shows that the common-component distribution has a structural change during the crisis in 2008, while the idiosyncratic-component distribution does not change much. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1138-1146
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1469997
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469997
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1138-1146
Template-Type: ReDIF-Article 1.0
Author-Name: Yixin Wang
Author-X-Name-First: Yixin
Author-X-Name-Last: Wang
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Frequentist Consistency of Variational Bayes
Abstract:
A key challenge for modern Bayesian statistics is how to perform scalable inference of posterior distributions. To address this challenge, variational Bayes (VB) methods have emerged as a popular alternative to the classical Markov chain Monte Carlo (MCMC) methods. VB methods tend to be faster while achieving comparable predictive performance. However, there are few theoretical results around VB. In this article, we establish frequentist consistency and asymptotic normality of VB methods. Specifically, we connect VB methods to point estimates based on variational approximations, called frequentist variational approximations, and we use the connection to prove a variational Bernstein–von Mises theorem. The theorem leverages the theoretical characterizations of frequentist variational approximations to understand asymptotic properties of VB. In summary, we prove that (1) the VB posterior converges to the Kullback–Leibler (KL) minimizer of a normal distribution, centered at the truth and (2) the corresponding variational expectation of the parameter is consistent and asymptotically normal. As applications of the theorem, we derive asymptotic properties of VB posteriors in Bayesian mixture models, Bayesian generalized linear mixed models, and Bayesian stochastic block models. We conduct a simulation study to illustrate these theoretical results. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1147-1161
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1473776
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1473776
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1147-1161
Template-Type: ReDIF-Article 1.0
Author-Name: Ery Arias-Castro
Author-X-Name-First: Ery
Author-X-Name-Last: Arias-Castro
Author-Name: Beatriz Pateiro-López
Author-X-Name-First: Beatriz
Author-X-Name-Last: Pateiro-López
Author-Name: Alberto Rodríguez-Casal
Author-X-Name-First: Alberto
Author-X-Name-Last: Rodríguez-Casal
Title: Minimax Estimation of the Volume of a Set Under the Rolling Ball Condition
Abstract:
We consider the problem of estimating the volume of a compact domain in a Euclidean space based on a uniform sample from the domain. We assume that the domain has a boundary with positive reach. We propose a data-splitting approach to correct the bias of the plug-in estimator based on the sample α-convex hull. We show that this simple estimator achieves a minimax lower bound that we derive. Some numerical experiments corroborate our theoretical findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1162-1173
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482751
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482751
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1162-1173
Template-Type: ReDIF-Article 1.0
Author-Name: Marco Carone
Author-X-Name-First: Marco
Author-X-Name-Last: Carone
Author-Name: Alexander R. Luedtke
Author-X-Name-First: Alexander R.
Author-X-Name-Last: Luedtke
Author-Name: Mark J. van der Laan
Author-X-Name-First: Mark J.
Author-X-Name-Last: van der Laan
Title: Toward Computerized Efficient Estimation in Infinite-Dimensional Models
Abstract:
Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are simple and convenient to use. In particular, efficient estimation procedures in parametric models are easy to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scientific knowledge more appropriately, performing efficient inference in these models is generally challenging. The efficient influence function is a key analytic object from which the construction of asymptotically efficient estimators can potentially be streamlined. However, the theoretical derivation of the efficient influence function requires specialized knowledge and is often a difficult task, even for experts. In this article, we present a novel representation of the efficient influence function and describe a numerical procedure for approximating its evaluation. The approach generalizes the nonparametric procedures of Frangakis et al. and Luedtke, Carone, and van der Laan to arbitrary models. We present theoretical results to support our proposal and illustrate the method in the context of several semiparametric problems. The proposed approach is an important step toward automating efficient estimation in general statistical models, thereby rendering more accessible the use of realistic models in statistical analyses. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1174-1190
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482752
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482752
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1174-1190
Template-Type: ReDIF-Article 1.0
Author-Name: Lixia Hu
Author-X-Name-First: Lixia
Author-X-Name-Last: Hu
Author-Name: Tao Huang
Author-X-Name-First: Tao
Author-X-Name-Last: Huang
Author-Name: Jinhong You
Author-X-Name-First: Jinhong
Author-X-Name-Last: You
Title: Estimation and Identification of a Varying-Coefficient Additive Model for Locally Stationary Processes
Abstract:
The additive model and the varying-coefficient model are both powerful regression tools, with wide practical applications. However, our empirical study on a financial data has shown that both of these models have drawbacks when applied to locally stationary time series. For the analysis of functional data, Zhang and Wang have proposed a flexible regression method, called the varying-coefficient additive model (VCAM), and presented a two-step spline estimation method. Motivated by their approach, we adopt the VCAM to characterize the time-varying regression function in a locally stationary context. We propose a three-step spline estimation method and show its consistency and asymptotic normality. For the purpose of model diagnosis, we suggest an L2-distance test statistic to check multiplicative assumption, and raise a two-stage penalty procedure to identify the additive terms and the varying-coefficient terms provided that the VCAM is applicable. We also present the asymptotic distribution of the proposed test statistics and demonstrate the consistency of the two-stage model identification procedure. Simulation studies investigating the finite-sample performance of the estimation and model diagnosis methods confirm the validity of our asymptotic theory. The financial data are also considered. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1191-1204
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482753
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482753
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1191-1204
Template-Type: ReDIF-Article 1.0
Author-Name: Naveen N. Narisetty
Author-X-Name-First: Naveen N.
Author-X-Name-Last: Narisetty
Author-Name: Juan Shen
Author-X-Name-First: Juan
Author-X-Name-Last: Shen
Author-Name: Xuming He
Author-X-Name-First: Xuming
Author-X-Name-Last: He
Title: Skinny Gibbs: A Consistent and Scalable Gibbs Sampler for Model Selection
Abstract:
We consider the computational and statistical issues for high-dimensional Bayesian model selection under the Gaussian spike and slab priors. To avoid large matrix computations needed in a standard Gibbs sampler, we propose a novel Gibbs sampler called “Skinny Gibbs” which is much more scalable to high-dimensional problems, both in memory and in computational efficiency. In particular, its computational complexity grows only linearly in p, the number of predictors, while retaining the property of strong model selection consistency even when p is much greater than the sample size n. The present article focuses on logistic regression due to its broad applicability as a representative member of the generalized linear models. We compare our proposed method with several leading variable selection methods through a simulation study to show that Skinny Gibbs has a strong performance as indicated by our theoretical work. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1205-1217
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482754
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482754
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1205-1217
Template-Type: ReDIF-Article 1.0
Author-Name: Lingrui Gan
Author-X-Name-First: Lingrui
Author-X-Name-Last: Gan
Author-Name: Naveen N. Narisetty
Author-X-Name-First: Naveen N.
Author-X-Name-Last: Narisetty
Author-Name: Feng Liang
Author-X-Name-First: Feng
Author-X-Name-Last: Liang
Title: Bayesian Regularization for Graphical Models With Unequal Shrinkage
Abstract:
We consider a Bayesian framework for estimating a high-dimensional sparse precision matrix, in which adaptive shrinkage and sparsity are induced by a mixture of Laplace priors. Besides discussing our formulation from the Bayesian standpoint, we investigate the MAP (maximum a posteriori) estimator from a penalized likelihood perspective that gives rise to a new nonconvex penalty approximating the ℓ0 penalty. Optimal error rates for estimation consistency in terms of various matrix norms along with selection consistency for sparse structure recovery are shown for the unique MAP estimator under mild conditions. For fast and efficient computation, an EM algorithm is proposed to compute the MAP estimator of the precision matrix and (approximate) posterior probabilities on the edges of the underlying sparse structure. Through extensive simulation studies and a real application to a call center data, we have demonstrated the fine performance of our method compared with existing alternatives. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1218-1231
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482755
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482755
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1218-1231
Template-Type: ReDIF-Article 1.0
Author-Name: Fei Gao
Author-X-Name-First: Fei
Author-X-Name-Last: Gao
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: David Couper
Author-X-Name-First: David
Author-X-Name-Last: Couper
Author-Name: D. Y. Lin
Author-X-Name-First: D. Y.
Author-X-Name-Last: Lin
Title: Semiparametric Regression Analysis of Multiple Right- and Interval-Censored Events
Abstract:
Health sciences research often involves both right- and interval-censored events because the occurrence of a symptomatic disease can only be observed up to the end of follow-up, while the occurrence of an asymptomatic disease can only be detected through periodic examinations. We formulate the effects of potentially time-dependent covariates on the joint distribution of multiple right- and interval-censored events through semiparametric proportional hazards models with random effects that capture the dependence both within and between the two types of events. We consider nonparametric maximum likelihood estimation and develop a simple and stable EM algorithm for computation. We show that the resulting estimators are consistent and the parametric components are asymptotically normal and efficient with a covariance matrix that can be consistently estimated by profile likelihood or nonparametric bootstrap. In addition, we leverage the joint modelling to provide dynamic prediction of disease incidence based on the evolving event history. Furthermore, we assess the performance of the proposed methods through extensive simulation studies. Finally, we provide an application to a major epidemiological cohort study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1232-1240
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482756
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482756
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1232-1240
Template-Type: ReDIF-Article 1.0
Author-Name: Shirong Deng
Author-X-Name-First: Shirong
Author-X-Name-Last: Deng
Author-Name: Xingqiu Zhao
Author-X-Name-First: Xingqiu
Author-X-Name-Last: Zhao
Title: Covariate-Adjusted Regression for Distorted Longitudinal Data With Informative Observation Times
Abstract:
In many longitudinal studies, repeated response and predictors are not directly observed, but can be treated as distorted by unknown functions of a common confounding covariate. Moreover, longitudinal data involve an observation process which may be informative with a longitudinal response process in practice. To deal with such complex data, we propose a class of flexible semiparametric covariate-adjusted joint models. The new models not only allow for the longitudinal response to be correlated with observation times through latent variables and completely unspecified link functions, but they also characterize distorted longitudinal response and predictors by unknown multiplicative factors depending on time and a confounding covariate. For estimation of regression parameters in the proposed models, we develop a novel covariate-adjusted estimating equation approach which does not rely on forms of link functions and distributions of frailties. The asymptotic properties of resulting parameter estimators are established and examined by simulation studies. A longitudinal data example containing calcium absorption and intake measurements is provided for illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1241-1250
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1482757
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482757
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1241-1250
Template-Type: ReDIF-Article 1.0
Author-Name: Jia Guo
Author-X-Name-First: Jia
Author-X-Name-Last: Guo
Author-Name: Bu Zhou
Author-X-Name-First: Bu
Author-X-Name-Last: Zhou
Author-Name: Jin-Ting Zhang
Author-X-Name-First: Jin-Ting
Author-X-Name-Last: Zhang
Title: New Tests for Equality of Several Covariance Functions for Functional Data
Abstract:
In this article, we propose two new tests for the equality of the covariance functions of several functional populations, namely, a quasi-GPF test and a quasi-Fmax test whose test statistics are obtained via globalizing a pointwise quasi-F-test statistic with integration and taking its supremum over some time interval of interest, respectively. Unlike several existing tests, they are scale-invariant in the sense that their test statistics will not change if we multiply each of the observed functions by any nonzero function of time. We derive the asymptotic random expressions of the two tests under the null hypothesis and show that under some mild conditions, the asymptotic null distribution of the quasi-GPF test is a chi-squared-type mixture whose distribution can be well approximated by a simple-scaled chi-squared distribution. We also propose a random permutation method for approximating the null distributions of the quasi-GPF and Fmax tests. The asymptotic distributions of the two tests under a local alternative are also investigated and the two tests are shown to be root-n consistent. A theoretical power comparison between the quasi-GPF test and the L2-norm-based test proposed in the literature is also given. Simulation studies are presented to demonstrate the finite-sample performance of the new tests against five existing tests. An illustrative example is also presented. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1251-1263
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1483827
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1483827
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1251-1263
Template-Type: ReDIF-Article 1.0
Author-Name: Niklas Pfister
Author-X-Name-First: Niklas
Author-X-Name-Last: Pfister
Author-Name: Peter Bühlmann
Author-X-Name-First: Peter
Author-X-Name-Last: Bühlmann
Author-Name: Jonas Peters
Author-X-Name-First: Jonas
Author-X-Name-Last: Peters
Title: Invariant Causal Prediction for Sequential Data
Abstract:
We investigate the problem of inferring the causal predictors of a response Y from a set of d explanatory variables (X1, …, Xd). Classical ordinary least-square regression includes all predictors that reduce the variance of Y. Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions; loosely speaking they lead to invariance across different “environments” or “heterogeneity patterns.” More precisely, the conditional distribution of Y given its causal predictors is the same for all observations, provided that there are no interventions on Y. Recent work exploits such a stability to infer causal relations from data with different but known environments. We show that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In particular, this allows detecting instantaneous causal relations in multivariate linear time series, which is usually not the case for Granger causality. Besides novel methodology, we provide statistical confidence bounds and asymptotic detection results for inferring causal predictors, and present an application to monetary policy in macroeconomics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1264-1276
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1491403
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1491403
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1264-1276
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Qian
Author-X-Name-First: Wei
Author-X-Name-Last: Qian
Author-Name: Shanshan Ding
Author-X-Name-First: Shanshan
Author-X-Name-Last: Ding
Author-Name: R. Dennis Cook
Author-X-Name-First: R. Dennis
Author-X-Name-Last: Cook
Title: Sparse Minimum Discrepancy Approach to Sufficient Dimension Reduction with Simultaneous Variable Selection in Ultrahigh Dimension
Abstract:
Sufficient dimension reduction (SDR) is known to be a powerful tool for achieving data reduction and data visualization in regression and classification problems. In this work, we study ultrahigh-dimensional SDR problems and propose solutions under a unified minimum discrepancy approach with regularization. When p grows exponentially with n, consistency results in both central subspace estimation and variable selection are established simultaneously for important SDR methods, including sliced inverse regression (SIR), principal fitted component (PFC), and sliced average variance estimation (SAVE). Special sparse structures of large predictor or error covariance are also considered for potentially better performance. In addition, the proposed approach is equipped with a new algorithm to efficiently solve the regularized objective functions and a new data-driven procedure to determine structural dimension and tuning parameters, without the need to invert a large covariance matrix. Simulations and a real data analysis are offered to demonstrate the promise of our proposal in ultrahigh-dimensional settings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1277-1290
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1497498
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497498
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1277-1290
Template-Type: ReDIF-Article 1.0
Author-Name: Qingyuan Zhao
Author-X-Name-First: Qingyuan
Author-X-Name-Last: Zhao
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Weijie Su
Author-X-Name-First: Weijie
Author-X-Name-Last: Su
Title: Multiple Testing When Many p-Values are Uniformly Conservative, with Application to Testing Qualitative Interaction in Educational Interventions
Abstract:
In the evaluation of treatment effects, it is of major policy interest to know if the treatment is beneficial for some and harmful for others, a phenomenon known as qualitative interaction. We formulate this question as a multiple testing problem with many conservative null p-values, in which the classical multiple testing methods may lose power substantially. We propose a simple technique—conditioning—to improve the power. A crucial assumption we need is uniform conservativeness, meaning for any conservative p-value p, the conditional distribution (p/τ) | p ⩽ τ is stochastically larger than the uniform distribution on (0, 1) for any τ. We show this property holds for one-sided tests in a one-dimensional exponential family (e.g., testing for qualitative interaction) as well as testing |μ| ⩽ η using a statistic Y ∼ N(μ, 1) (e.g., testing for practical importance with threshold η). We propose an adaptive method to select the threshold τ. Our theoretical and simulation results suggest that the proposed tests gain significant power when many p-values are uniformly conservative and lose little power when no p-value is uniformly conservative. We apply our method to two educational intervention datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1291-1304
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1497499
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497499
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1291-1304
Template-Type: ReDIF-Article 1.0
Author-Name: Yuqing Pan
Author-X-Name-First: Yuqing
Author-X-Name-Last: Pan
Author-Name: Qing Mai
Author-X-Name-First: Qing
Author-X-Name-Last: Mai
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Title: Covariate-Adjusted Tensor Classification in High Dimensions
Abstract:
In contemporary scientific research, it is often of great interest to predict a categorical response based on a high-dimensional tensor (i.e., multi-dimensional array) and additional covariates. Motivated by applications in science and engineering, we propose a comprehensive and interpretable discriminant analysis model, called the CATCH model (short for covariate-adjusted tensor classification in high-dimensions). The CATCH model efficiently integrates the covariates and the tensor to predict the categorical outcome. It also jointly explains the complicated relationships among the covariates, the tensor predictor, and the categorical response. The tensor structure is used to achieve easy interpretation and accurate prediction. To tackle the new computational and statistical challenges arising from the intimidating tensor dimensions, we propose a penalized approach to select a subset of the tensor predictor entries that affect classification after adjustment for the covariates. An efficient algorithm is developed to take advantage of the tensor structure in the penalized estimation. Theoretical results confirm that the proposed method achieves variable selection and prediction consistency, even when the tensor dimension is much larger than the sample size. The superior performance of our method over existing methods is demonstrated in extensive simulated and real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1305-1319
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1497500
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497500
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1305-1319
Template-Type: ReDIF-Article 1.0
Author-Name: Shulei Wang
Author-X-Name-First: Shulei
Author-X-Name-Last: Wang
Author-Name: Ming Yuan
Author-X-Name-First: Ming
Author-X-Name-Last: Yuan
Title: Combined Hypothesis Testing on Graphs With Applications to Gene Set Enrichment Analysis
Abstract:
Motivated by gene set enrichment analysis, we investigate the problem of combined hypothesis testing on a graph. A general framework is introduced to make effective use of the structural information of the underlying graph when testing multivariate means. A new testing procedure is proposed within this framework, and shown to be optimal in that it can consistently detect departures from the collective null at a rate that no other test could improve, for almost all graphs. We also provide general performance bounds for the proposed test under any specific graph, and illustrate their utility through several common types of graphs. Numerical experiments are presented to further demonstrate the merits of our approach.
Journal: Journal of the American Statistical Association
Pages: 1320-1338
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1497501
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497501
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1320-1338
Template-Type: ReDIF-Article 1.0
Author-Name: Frank Windmeijer
Author-X-Name-First: Frank
Author-X-Name-Last: Windmeijer
Author-Name: Helmut Farbmacher
Author-X-Name-First: Helmut
Author-X-Name-Last: Farbmacher
Author-Name: Neil Davies
Author-X-Name-First: Neil
Author-X-Name-Last: Davies
Author-Name: George Davey Smith
Author-X-Name-First: George
Author-X-Name-Last: Davey Smith
Title: On the Use of the Lasso for Instrumental Variables Estimation with Some Invalid Instruments
Abstract:
We investigate the behavior of the Lasso for selecting invalid instruments in linear instrumental variables models for estimating causal effects of exposures on outcomes, as proposed recently by Kang et al. Invalid instruments are such that they fail the exclusion restriction and enter the model as explanatory variables. We show that for this setup, the Lasso may not consistently select the invalid instruments if these are relatively strong. We propose a median estimator that is consistent when less than 50% of the instruments are invalid, and its consistency does not depend on the relative strength of the instruments, or their correlation structure. We show that this estimator can be used for adaptive Lasso estimation, with the resulting estimator having oracle properties. The methods are applied to a Mendelian randomization study to estimate the causal effect of body mass index (BMI) on diastolic blood pressure, using data on individuals from the UK Biobank, with 96 single nucleotide polymorphisms as potential instruments for BMI. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1339-1350
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1498346
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1498346
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1339-1350
Template-Type: ReDIF-Article 1.0
Author-Name: Yuval Benjamini
Author-X-Name-First: Yuval
Author-X-Name-Last: Benjamini
Author-Name: Jonathan Taylor
Author-X-Name-First: Jonathan
Author-X-Name-Last: Taylor
Author-Name: Rafael A. Irizarry
Author-X-Name-First: Rafael A.
Author-X-Name-Last: Irizarry
Title: Selection-Corrected Statistical Inference for Region Detection With High-Throughput Assays
Abstract:
Scientists use high-dimensional measurement assays to detect and prioritize regions of strong signal in spatially organized domain. Examples include finding methylation-enriched genomic regions using microarrays, and active cortical areas using brain-imaging. The most common procedure for detecting potential regions is to group neighboring sites where the signal passed a threshold. However, one needs to account for the selection bias induced by this procedure to avoid diminishing effects when generalizing to a population. This article introduces pin-down inference, a model and an inference framework that permit population inference for these detected regions. Pin-down inference provides nonasymptotic point and confidence interval estimators for the mean effect in the region that account for local selection bias. Our estimators accommodate nonstationary covariances that are typical of these data, allowing researchers to better compare regions of different sizes and correlation structures. Inference is provided within a conditional one-parameter exponential family per region, with truncations that match the selection constraints. A secondary screening-and-adjustment step allows pruning the set of detected regions, while controlling the false-coverage rate over the reported regions. We apply the method to genomic regions with differing DNA-methylation rates across tissue. Our method provides superior power compared to other conditional and nonparametric approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1351-1365
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1498347
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1498347
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1351-1365
Template-Type: ReDIF-Article 1.0
Author-Name: Abdelaati Daouia
Author-X-Name-First: Abdelaati
Author-X-Name-Last: Daouia
Author-Name: Irène Gijbels
Author-X-Name-First: Irène
Author-X-Name-Last: Gijbels
Author-Name: Gilles Stupfler
Author-X-Name-First: Gilles
Author-X-Name-Last: Stupfler
Title: Extremiles: A New Perspective on Asymmetric Least Squares
Abstract:
Quantiles and expectiles of a distribution are found to be useful descriptors of its tail in the same way as the median and mean are related to its central behavior. This article considers a valuable alternative class to expectiles, called extremiles, which parallels the class of quantiles and includes the family of expected minima and expected maxima. The new class is motivated via several angles, which reveals its specific merits and strengths. Extremiles suggest better capability of fitting both location and spread in data points and provide an appropriate theory that better displays the interesting features of long-tailed distributions. We discuss their estimation in the range of the data and beyond the sample maximum. A number of motivating examples are given to illustrate the utility of estimated extremiles in modeling noncentral behavior. There is in particular an interesting connection with coherent measures of risk protection. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1366-1381
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1498348
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1498348
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1366-1381
Template-Type: ReDIF-Article 1.0
Author-Name: J. C. Escanciano
Author-X-Name-First: J. C.
Author-X-Name-Last: Escanciano
Author-Name: S. C. Goh
Author-X-Name-First: S. C.
Author-X-Name-Last: Goh
Title: Quantile-Regression Inference With Adaptive Control of Size
Abstract:
Regression quantiles have asymptotic variances that depend on the conditional densities of the response variable given regressors. This article develops a new estimate of the asymptotic variance of regression quantiles that leads any resulting Wald-type test or confidence region to behave as well in large samples as its infeasible counterpart in which the true conditional response densities are embedded. We give explicit guidance on implementing the new variance estimator to control adaptively the size of any resulting Wald-type test. Monte Carlo evidence indicates the potential of our approach to deliver powerful tests of heterogeneity of quantile treatment effects in covariates with good size performance over different quantile levels, data-generating processes, and sample sizes. We also include an empirical example. Supplementary material is available online.
Journal: Journal of the American Statistical Association
Pages: 1382-1393
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1505624
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1505624
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1382-1393
Template-Type: ReDIF-Article 1.0
Author-Name: James E. Johndrow
Author-X-Name-First: James E.
Author-X-Name-Last: Johndrow
Author-Name: Aaron Smith
Author-X-Name-First: Aaron
Author-X-Name-Last: Smith
Author-Name: Natesh Pillai
Author-X-Name-First: Natesh
Author-X-Name-Last: Pillai
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: MCMC for Imbalanced Categorical Data
Abstract:
Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and practitioners waste time trying different types of Markov chain Monte Carlo (MCMC) approaches. This article was motivated by an application to quantitative advertising in which we encountered extremely poor computational performance for data augmentation MCMC algorithms but obtained excellent performance for adaptive Metropolis. To obtain a deeper understanding of this behavior, we derive theoretical results on the computational complexity of commonly used data augmentation algorithms and the Random Walk Metropolis algorithm for highly imbalanced binary data. In this regime, our results show computational complexity of Metropolis is logarithmic in sample size, while data augmentation is polynomial in sample size. The root cause of this poor performance of data augmentation is a discrepancy between the rates at which the target density and MCMC step sizes concentrate. Our methods also show that MCMC algorithms that exhibit a similar discrepancy will fail in large samples—a result with substantial practical impact. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1394-1403
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1505626
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1505626
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1394-1403
Template-Type: ReDIF-Article 1.0
Author-Name: Wensheng Zhu
Author-X-Name-First: Wensheng
Author-X-Name-Last: Zhu
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Proper Inference for Value Function in High-Dimensional Q-Learning for Dynamic Treatment Regimes
Abstract:
Dynamic treatment regimes are a set of decision rules and each treatment decision is tailored over time according to patients’ responses to previous treatments as well as covariate history. There is a growing interest in development of correct statistical inference for optimal dynamic treatment regimes to handle the challenges of nonregularity problems in the presence of nonrespondents who have zero-treatment effects, especially when the dimension of the tailoring variables is high. In this article, we propose a high-dimensional Q-learning (HQ-learning) to facilitate the inference of optimal values and parameters. The proposed method allows us to simultaneously estimate the optimal dynamic treatment regimes and select the important variables that truly contribute to the individual reward. At the same time, hard thresholding is introduced in the method to eliminate the effects of the nonrespondents. The asymptotic properties for the parameter estimators as well as the estimated optimal value function are then established by adjusting the bias due to thresholding. Both simulation studies and real data analysis demonstrate satisfactory performance for obtaining the proper inference for the value function for the optimal dynamic treatment regimes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1404-1417
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2018.1506341
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1506341
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1404-1417
Template-Type: ReDIF-Article 1.0
Author-Name: Hongying Dai
Author-X-Name-First: Hongying
Author-X-Name-Last: Dai
Title: Asymptotic Analysis of Mixed Effects Models: Theory, Applications, and Open Problems
Journal: Journal of the American Statistical Association
Pages: 1418-1420
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662242
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662242
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1418-1420
Template-Type: ReDIF-Article 1.0
Author-Name: Kaixian Yu
Author-X-Name-First: Kaixian
Author-X-Name-Last: Yu
Title: Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics.
Journal: Journal of the American Statistical Association
Pages: 1420-1421
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662241
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662241
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1420-1421
Template-Type: ReDIF-Article 1.0
Author-Name: Shu Yang
Author-X-Name-First: Shu
Author-X-Name-Last: Yang
Title: Flexible Imputation of Missing Data, 2nd ed.
Journal: Journal of the American Statistical Association
Pages: 1421-1421
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662249
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662249
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1421-1421
Template-Type: ReDIF-Article 1.0
Author-Name: Ofer Harel
Author-X-Name-First: Ofer
Author-X-Name-Last: Harel
Title: Missing and Modified Data in Nonparametric Estimation: With R Examples.
Journal: Journal of the American Statistical Association
Pages: 1421-1423
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662248
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662248
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1421-1423
Template-Type: ReDIF-Article 1.0
Author-Name: Anna Snavely
Author-X-Name-First: Anna
Author-X-Name-Last: Snavely
Title: Randomization, Masking, and Allocation Concealment.
Journal: Journal of the American Statistical Association
Pages: 1423-1424
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662247
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662247
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1423-1424
Template-Type: ReDIF-Article 1.0
Author-Name: Chen Zhou
Author-X-Name-First: Chen
Author-X-Name-Last: Zhou
Title: Risk Theory: A Heavy Tail Approach.
Journal: Journal of the American Statistical Association
Pages: 1424-1425
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662244
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662244
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1424-1425
Template-Type: ReDIF-Article 1.0
Author-Name: Dootika Vats
Author-X-Name-First: Dootika
Author-X-Name-Last: Vats
Title: Simulation and the Monte Carlo Method, 3rd ed.
Journal: Journal of the American Statistical Association
Pages: 1425-1425
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662243
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662243
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1425-1425
Template-Type: ReDIF-Article 1.0
Author-Name: Jae-Kwang Kim
Author-X-Name-First: Jae-Kwang
Author-X-Name-Last: Kim
Title: Statistical Data Fusion
Journal: Journal of the American Statistical Association
Pages: 1425-1426
Issue: 527
Volume: 114
Year: 2019
Month: 7
X-DOI: 10.1080/01621459.2019.1662245
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1662245
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:527:p:1425-1426
Template-Type: ReDIF-Article 1.0
Author-Name: Ganggang Xu
Author-X-Name-First: Ganggang
Author-X-Name-Last: Xu
Author-Name: Rasmus Waagepetersen
Author-X-Name-First: Rasmus
Author-X-Name-Last: Waagepetersen
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Stochastic Quasi-Likelihood for Case-Control Point Pattern Data
Abstract:
We propose a novel stochastic quasi-likelihood estimation procedure for case-control point processes. Quasi-likelihood for point processes depends on a certain optimal weight function and for the new method the weight function is stochastic since it depends on the control point pattern. The new procedure also provides a computationally efficient implementation of quasi-likelihood for univariate point processes in which case a synthetic control point process is simulated by the user. Under mild conditions, the proposed approach yields consistent and asymptotically normal parameter estimators. We further show that the estimators are optimal in the sense that the associated Godambe information is maximal within a wide class of estimating functions for case-control point processes. The effectiveness of the proposed method is further illustrated using extensive simulation studies and two data examples.
Journal: Journal of the American Statistical Association
Pages: 631-644
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2017.1421543
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1421543
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:631-644
Template-Type: ReDIF-Article 1.0
Author-Name: Edward H. Kennedy
Author-X-Name-First: Edward H.
Author-X-Name-Last: Kennedy
Title: Nonparametric Causal Effects Based on Incremental Propensity Score Interventions
Abstract:
Most work in causal inference considers deterministic interventions that set each unit’s treatment to some fixed value. However, under positivity violations these interventions can lead to nonidentification, inefficiency, and effects with little practical relevance. Further, corresponding effects in longitudinal studies are highly sensitive to the curse of dimensionality, resulting in widespread use of unrealistic parametric models. We propose a novel solution to these problems: incremental interventions that shift propensity score values rather than set treatments to fixed values. Incremental interventions have several crucial advantages. First, they avoid positivity assumptions entirely. Second, they require no parametric assumptions and yet still admit a simple characterization of longitudinal effects, independent of the number of timepoints. For example, they allow longitudinal effects to be visualized with a single curve instead of lists of coefficients. After characterizing incremental interventions and giving identifying conditions for corresponding effects, we also develop general efficiency theory, propose efficient nonparametric estimators that can attain fast convergence rates even when incorporating flexible machine learning, and propose a bootstrap-based confidence band and simultaneous test of no treatment effect. Finally, we explore finite-sample performance via simulation, and apply the methods to study time-varying sociological effects of incarceration on entry into marriage. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 645-656
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2017.1422737
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1422737
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:645-656
Template-Type: ReDIF-Article 1.0
Author-Name: Leqin Wu
Author-X-Name-First: Leqin
Author-X-Name-Last: Wu
Author-Name: Xing Qiu
Author-X-Name-First: Xing
Author-X-Name-Last: Qiu
Author-Name: Ya-xiang Yuan
Author-X-Name-First: Ya-xiang
Author-X-Name-Last: Yuan
Author-Name: Hulin Wu
Author-X-Name-First: Hulin
Author-X-Name-Last: Wu
Title: Parameter Estimation and Variable Selection for Big Systems of Linear Ordinary Differential Equations: A Matrix-Based Approach
Abstract:
Ordinary differential equations (ODEs) are widely used to model the dynamic behavior of a complex system. Parameter estimation and variable selection for a “Big System” with linear ODEs are very challenging due to the need of nonlinear optimization in an ultra-high dimensional parameter space. In this article, we develop a parameter estimation and variable selection method based on the ideas of similarity transformation and separable least squares (SLS). Simulation studies demonstrate that the proposed matrix-based SLS method could be used to estimate the coefficient matrix more accurately and perform variable selection for a linear ODE system with thousands of dimensions and millions of parameters much better than the direct least squares method and the vector-based two-stage method that are currently available. We applied this new method to two real datasets—a yeast cell cycle gene expression dataset with 30 dimensions and 930 unknown parameters and the Standard & Poor 1500 index stock price data with 1250 dimensions and 1,563,750 unknown parameters—to illustrate the utility and numerical performance of the proposed parameter estimation and variable selection method for big systems in practice. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 657-667
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2017.1423074
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1423074
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:657-667
Template-Type: ReDIF-Article 1.0
Author-Name: Michael I. Jordan
Author-X-Name-First: Michael I.
Author-X-Name-Last: Jordan
Author-Name: Jason D. Lee
Author-X-Name-First: Jason D.
Author-X-Name-Last: Lee
Author-Name: Yun Yang
Author-X-Name-First: Yun
Author-X-Name-Last: Yang
Title: Communication-Efficient Distributed Statistical Inference
Abstract:
We present a communication-efficient surrogate likelihood (CSL) framework for solving distributed statistical inference problems. CSL provides a communication-efficient surrogate to the global likelihood that can be used for low-dimensional estimation, high-dimensional regularized estimation, and Bayesian inference. For low-dimensional estimation, CSL provably improves upon naive averaging schemes and facilitates the construction of confidence intervals. For high-dimensional regularized estimation, CSL leads to a minimax-optimal estimator with controlled communication cost. For Bayesian inference, CSL can be used to form a communication-efficient quasi-posterior distribution that converges to the true posterior. This quasi-posterior procedure significantly improves the computational efficiency of Markov chain Monte Carlo (MCMC) algorithms even in a nondistributed setting. We present both theoretical analysis and experiments to explore the properties of the CSL approximation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 668-681
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1429274
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429274
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:668-681
Template-Type: ReDIF-Article 1.0
Author-Name: Michael Hornstein
Author-X-Name-First: Michael
Author-X-Name-Last: Hornstein
Author-Name: Roger Fan
Author-X-Name-First: Roger
Author-X-Name-Last: Fan
Author-Name: Kerby Shedden
Author-X-Name-First: Kerby
Author-X-Name-Last: Shedden
Author-Name: Shuheng Zhou
Author-X-Name-First: Shuheng
Author-X-Name-Last: Zhou
Title: Joint Mean and Covariance Estimation with Unreplicated Matrix-Variate Data
Abstract:
It has been proposed that complex populations, such as those that arise in genomics studies, may exhibit dependencies among observations as well as among variables. This gives rise to the challenging problem of analyzing unreplicated high-dimensional data with unknown mean and dependence structures. Matrix-variate approaches that impose various forms of (inverse) covariance sparsity allow flexible dependence structures to be estimated, but cannot directly be applied when the mean and covariance matrices are estimated jointly. We present a practical method utilizing generalized least squares and penalized (inverse) covariance estimation to address this challenge. We establish consistency and obtain rates of convergence for estimating the mean parameters and covariance matrices. The advantages of our approaches are: (i) dependence graphs and covariance structures can be estimated in the presence of unknown mean structure, (ii) the mean structure becomes more efficiently estimated when accounting for the dependence structure among observations; and (iii) inferences about the mean parameters become correctly calibrated. We use simulation studies and analysis of genomic data from a twin study of ulcerative colitis to illustrate the statistical convergence and the performance of our methods in practical settings. Several lines of evidence show that the test statistics for differential gene expression produced by our methods are correctly calibrated and improve power over conventional methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 682-696
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1429275
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429275
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:682-696
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan J. Tibshirani
Author-X-Name-First: Ryan J.
Author-X-Name-Last: Tibshirani
Author-Name: Saharon Rosset
Author-X-Name-First: Saharon
Author-X-Name-Last: Rosset
Title: Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?
Abstract:
Nearly all estimators in statistical prediction come with an associated tuning parameter, in one way or another. Common practice, given data, is to choose the tuning parameter value that minimizes a constructed estimate of the prediction error of the estimator; we focus on Stein’s unbiased risk estimator, or SURE, which forms an unbiased estimate of the prediction error by augmenting the observed training error with an estimate of the degrees of freedom of the estimator. Parameter tuning via SURE minimization has been advocated by many authors, in a wide variety of problem settings, and in general, it is natural to ask: what is the prediction error of the SURE-tuned estimator? An obvious strategy would be simply use the apparent error estimate as reported by SURE, that is, the value of the SURE criterion at its minimum, to estimate the prediction error of the SURE-tuned estimator. But this is no longer unbiased; in fact, we would expect that the minimum of the SURE criterion is systematically biased downwards for the true prediction error. In this work, we define the excess optimism of the SURE-tuned estimator to be the amount of this downward bias in the SURE minimum. We argue that the following two properties motivate the study of excess optimism: (i) an unbiased estimate of excess optimism, added to the SURE criterion at its minimum, gives an unbiased estimate of the prediction error of the SURE-tuned estimator; (ii) excess optimism serves as an upper bound on the excess risk, that is, the difference between the risk of the SURE-tuned estimator and the oracle risk (where the oracle uses the best fixed tuning parameter choice). We study excess optimism in two common settings: shrinkage estimators and subset regression estimators. Our main results include a James–Stein-like property of the SURE-tuned shrinkage estimator, which is shown to dominate the MLE; and both upper and lower bounds on excess optimism for SURE-tuned subset regression. In the latter setting, when the collection of subsets is nested, our bounds are particularly tight, and reveal that in the case of no signal, the excess optimism is always in between 0 and 10 degrees of freedom, regardless of how many models are being selected from. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 697-712
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1429276
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429276
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:697-712
Template-Type: ReDIF-Article 1.0
Author-Name: Qingyuan Zhao
Author-X-Name-First: Qingyuan
Author-X-Name-Last: Zhao
Title: On Sensitivity Value of Pair-Matched Observational Studies
Abstract:
This article proposes a new quantity called the “sensitivity value,” which is defined as the minimum strength of unmeasured confounders needed to change the qualitative conclusions of a naive analysis assuming no unmeasured confounder. We establish the asymptotic normality of the sensitivity value in pair-matched observational studies. The theoretical results are then used to approximate the power of a sensitivity analysis and select the design of a study. We explore the potential to use sensitivity values to screen multiple hypotheses in the presence of unmeasured confounding using a microarray dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 713-722
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1429277
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1429277
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:713-722
Template-Type: ReDIF-Article 1.0
Author-Name: Benjamin Frot
Author-X-Name-First: Benjamin
Author-X-Name-Last: Frot
Author-Name: Luke Jostins
Author-X-Name-First: Luke
Author-X-Name-Last: Jostins
Author-Name: Gilean McVean
Author-X-Name-First: Gilean
Author-X-Name-Last: McVean
Title: Graphical Model Selection for Gaussian Conditional Random Fields in the Presence of Latent Variables
Abstract:
We consider the problem of learning a conditional Gaussian graphical model in the presence of latent variables. Building on recent advances in this field, we suggest a method that decomposes the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this estimator and show that it is well-behaved in the high-dimensional regime as well as “sparsistent” (i.e., capable of recovering the graph structure). We then show how proximal gradient algorithms and semi-definite programming techniques can be employed to fit the model to thousands of variables. Through extensive simulations, we illustrate the conditions required for identifiability and show that there is a wide range of situations in which this model performs significantly better than its counterparts, for example, by accommodating more latent variables. Finally, the suggested method is applied to two datasets comprising individual level data on genetic variants and metabolites levels. We show our results replicate better than alternative approaches and show enriched biological signal. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 723-734
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1434531
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1434531
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:723-734
Template-Type: ReDIF-Article 1.0
Author-Name: Satyajit Ghosh
Author-X-Name-First: Satyajit
Author-X-Name-Last: Ghosh
Author-Name: Kshitij Khare
Author-X-Name-First: Kshitij
Author-X-Name-Last: Khare
Author-Name: George Michailidis
Author-X-Name-First: George
Author-X-Name-Last: Michailidis
Title: High-Dimensional Posterior Consistency in Bayesian Vector Autoregressive Models
Abstract:
Vector autoregressive (VAR) models aim to capture linear temporal interdependencies among multiple time series. They have been widely used in macroeconomics and financial econometrics and more recently have found novel applications in functional genomics and neuroscience. These applications have also accentuated the need to investigate the behavior of the VAR model in a high-dimensional regime, which provides novel insights into the role of temporal dependence for regularized estimates of the model’s parameters. However, hardly anything is known regarding properties of the posterior distribution for Bayesian VAR models in such regimes. In this work, we consider a VAR model with two prior choices for the autoregressive coefficient matrix: a nonhierarchical matrix-normal prior and a hierarchical prior, which corresponds to an arbitrary scale mixture of normals. We establish posterior consistency for both these priors under standard regularity assumptions, when the dimension p of the VAR model grows with the sample size n (but still remains smaller than n). A special case corresponds to a shrinkage prior that introduces (group) sparsity in the columns of the model coefficient matrices. The performance of the model estimates are illustrated on synthetic and real macroeconomic datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 735-748
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1437043
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1437043
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:735-748
Template-Type: ReDIF-Article 1.0
Author-Name: Alexandre Belloni
Author-X-Name-First: Alexandre
Author-X-Name-Last: Belloni
Author-Name: Victor Chernozhukov
Author-X-Name-First: Victor
Author-X-Name-Last: Chernozhukov
Author-Name: Kengo Kato
Author-X-Name-First: Kengo
Author-X-Name-Last: Kato
Title: Valid Post-Selection Inference in High-Dimensional Approximately Sparse Quantile Regression Models
Abstract:
This work proposes new inference methods for a regression coefficient of interest in a (heterogenous) quantile regression model. We consider a high-dimensional model where the number of regressors potentially exceeds the sample size but a subset of them suffices to construct a reasonable approximation to the conditional quantile function. The proposed methods are (explicitly or implicitly) based on orthogonal score functions that protect against moderate model selection mistakes, which are often inevitable in the approximately sparse model considered in the present article. We establish the uniform validity of the proposed confidence regions for the quantile regression coefficient. Importantly, these methods directly apply to more than one variable and a continuum of quantile indices. In addition, the performance of the proposed methods is illustrated through Monte Carlo experiments and an empirical example, dealing with risk factors in childhood malnutrition. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 749-758
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1442339
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442339
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:749-758
Template-Type: ReDIF-Article 1.0
Author-Name: Yuanpei Cao
Author-X-Name-First: Yuanpei
Author-X-Name-Last: Cao
Author-Name: Wei Lin
Author-X-Name-First: Wei
Author-X-Name-Last: Lin
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding
Abstract:
High-dimensional compositional data arise naturally in many applications such as metagenomic data analysis. The observed data lie in a high-dimensional simplex, and conventional statistical methods often fail to produce sensible results due to the unit-sum constraint. In this article, we address the problem of covariance estimation for high-dimensional compositional data and introduce a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. We rigorously characterize the identifiability of the covariance parameters, derive rates of convergence under the spectral norm, and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some existing optimization-based estimators. We apply the proposed method to the analysis of a microbiome dataset to understand the dependence structure among bacterial taxa in the human gut.
Journal: Journal of the American Statistical Association
Pages: 759-772
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1442340
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442340
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:759-772
Template-Type: ReDIF-Article 1.0
Author-Name: Zhenguo Gao
Author-X-Name-First: Zhenguo
Author-X-Name-Last: Gao
Author-Name: Zuofeng Shang
Author-X-Name-First: Zuofeng
Author-X-Name-Last: Shang
Author-Name: Pang Du
Author-X-Name-First: Pang
Author-X-Name-Last: Du
Author-Name: John L. Robertson
Author-X-Name-First: John L.
Author-X-Name-Last: Robertson
Title: Variance Change Point Detection Under a Smoothly-Changing Mean Trend with Application to Liver Procurement
Abstract:
Literature on change point analysis mostly requires a sudden change in the data distribution, either in a few parameters or the distribution as a whole. We are interested in the scenario, where the variance of data may make a significant jump while the mean changes in a smooth fashion. The motivation is a liver procurement experiment monitoring organ surface temperature. Blindly applying the existing methods to the example can yield erroneous change point estimates since the smoothly changing mean violates the sudden-change assumption. We propose a penalized weighted least-squares approach with an iterative estimation procedure that integrates variance change point detection and smooth mean function estimation. The procedure starts with a consistent initial mean estimate ignoring the variance heterogeneity. Given the variance components the mean function is estimated by smoothing splines as the minimizer of the penalized weighted least squares. Given the mean function, we propose a likelihood ratio test statistic for identifying the variance change point. The null distribution of the test statistic is derived together with the rates of convergence of all the parameter estimates. Simulations show excellent performance of the proposed method. Application analysis offers numerical support to non invasive organ viability assessment by surface temperature monitoring. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 773-781
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1442341
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442341
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:773-781
Template-Type: ReDIF-Article 1.0
Author-Name: Jacob Bien
Author-X-Name-First: Jacob
Author-X-Name-Last: Bien
Title: Graph-Guided Banding of the Covariance Matrix
Abstract:
Regularization has become a primary tool for developing reliable estimators of the covariance matrix in high-dimensional settings. To curb the curse of dimensionality, numerous methods assume that the population covariance (or inverse covariance) matrix is sparse, while making no particular structural assumptions on the desired pattern of sparsity. A highly-related, yet complementary, literature studies the specific setting in which the measured variables have a known ordering, in which case a banded population matrix is often assumed. While the banded approach is conceptually and computationally easier than asking for “patternless sparsity,” it is only applicable in very specific situations (such as when data are measured over time or one-dimensional space). This work proposes a generalization of the notion of bandedness that greatly expands the range of problems in which banded estimators apply. We develop convex regularizers occupying the broad middle ground between the former approach of “patternless sparsity” and the latter reliance on having a known ordering. Our framework defines bandedness with respect to a known graph on the measured variables. Such a graph is available in diverse situations, and we provide a theoretical, computational, and applied treatment of two new estimators. An R package, called ggb, implements these new methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 782-792
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1442720
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1442720
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:782-792
Template-Type: ReDIF-Article 1.0
Author-Name: Prosper Dovonon
Author-X-Name-First: Prosper
Author-X-Name-Last: Dovonon
Author-Name: Sílvia Gonçalves
Author-X-Name-First: Sílvia
Author-X-Name-Last: Gonçalves
Author-Name: Ulrich Hounyo
Author-X-Name-First: Ulrich
Author-X-Name-Last: Hounyo
Author-Name: Nour Meddahi
Author-X-Name-First: Nour
Author-X-Name-Last: Meddahi
Title: Bootstrapping High-Frequency Jump Tests
Abstract:
The main contribution of this article is to propose a bootstrap test for jumps based on functions of realized volatility and bipower variation. Bootstrap intraday returns are randomly generated from a mean zero Gaussian distribution with a variance given by a local measure of integrated volatility (which we denote by {v^in}$\lbrace \hat{v}_{i}^{n}\rbrace $). We first discuss a set of high-level conditions on {v^in}$\lbrace \hat{v}_{i}^{n}\rbrace $ such that any bootstrap test of this form has the correct asymptotic size and is alternative-consistent. We then provide a set of primitive conditions that justify the choice of a thresholding-based estimator for {v^in}$\lbrace \hat{v}_{i}^{n}\rbrace $. Our cumulant expansions show that the bootstrap is unable to mimic the higher-order bias of the test statistic. We propose a modification of the original bootstrap test which contains an appropriate bias correction term and for which second-order asymptotic refinements are obtained.
Journal: Journal of the American Statistical Association
Pages: 793-803
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1447485
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1447485
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:793-803
Template-Type: ReDIF-Article 1.0
Author-Name: Shanika L. Wickramasuriya
Author-X-Name-First: Shanika L.
Author-X-Name-Last: Wickramasuriya
Author-Name: George Athanasopoulos
Author-X-Name-First: George
Author-X-Name-Last: Athanasopoulos
Author-Name: Rob J. Hyndman
Author-X-Name-First: Rob J.
Author-X-Name-Last: Hyndman
Title: Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series Through Trace Minimization
Abstract:
Large collections of time series often have aggregation constraints due to product or geographical groupings. The forecasts for the most disaggregated series are usually required to add-up exactly to the forecasts of the aggregated series, a constraint we refer to as “coherence.” Forecast reconciliation is the process of adjusting forecasts to make them coherent.The reconciliation algorithm proposed by Hyndman et al. (2011) is based on a generalized least squares estimator that requires an estimate of the covariance matrix of the coherency errors (i.e., the errors that arise due to incoherence). We show that this matrix is impossible to estimate in practice due to identifiability conditions.We propose a new forecast reconciliation approach that incorporates the information from a full covariance matrix of forecast errors in obtaining a set of coherent forecasts. Our approach minimizes the mean squared error of the coherent forecasts across the entire collection of time series under the assumption of unbiasedness. The minimization problem has a closed-form solution. We make this solution scalable by providing a computationally efficient representation.We evaluate the performance of the proposed method compared to alternative methods using a series of simulation designs which take into account various features of the collected time series. This is followed by an empirical application using Australian domestic tourism data. The results indicate that the proposed method works well with artificial and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 804-819
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1448825
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448825
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:804-819
Template-Type: ReDIF-Article 1.0
Author-Name: Simon N. Vandekar
Author-X-Name-First: Simon N.
Author-X-Name-Last: Vandekar
Author-Name: Philip T. Reiss
Author-X-Name-First: Philip T.
Author-X-Name-Last: Reiss
Author-Name: Russell T. Shinohara
Author-X-Name-First: Russell T.
Author-X-Name-Last: Shinohara
Title: Interpretable High-Dimensional Inference Via Score Projection With an Application in Neuroimaging
Abstract:
In the fields of neuroimaging and genetics, a key goal is testing the association of a single outcome with a very high-dimensional imaging or genetic variable. Often, summary measures of the high-dimensional variable are created to sequentially test and localize the association with the outcome. In some cases, the associations between the outcome and summary measures are significant, but subsequent tests used to localize differences are underpowered and do not identify regions associated with the outcome. Here, we propose a generalization of Rao’s score test based on projecting the score statistic onto a linear subspace of a high-dimensional parameter space. The approach provides a way to localize signal in the high-dimensional space by projecting the scores to the subspace where the score test was performed. This allows for inference in the high-dimensional space to be performed on the same degrees of freedom as the score test, effectively reducing the number of comparisons. Simulation results demonstrate the test has competitive power relative to others commonly used. We illustrate the method by analyzing a subset of the Alzheimer’s Disease Neuroimaging Initiative dataset. Results suggest cortical thinning of the frontal and temporal lobes may be a useful biological marker of Alzheimer’s disease risk. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 820-830
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1448826
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448826
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:820-830
Template-Type: ReDIF-Article 1.0
Author-Name: Matias Quiroz
Author-X-Name-First: Matias
Author-X-Name-Last: Quiroz
Author-Name: Robert Kohn
Author-X-Name-First: Robert
Author-X-Name-Last: Kohn
Author-Name: Mattias Villani
Author-X-Name-First: Mattias
Author-X-Name-Last: Villani
Author-Name: Minh-Ngoc Tran
Author-X-Name-First: Minh-Ngoc
Author-X-Name-Last: Tran
Title: Speeding Up MCMC by Efficient Data Subsampling
Abstract:
We propose subsampling Markov chain Monte Carlo (MCMC), an MCMC framework where the likelihood function for n observations is estimated from a random subset of m observations. We introduce a highly efficient unbiased estimator of the log-likelihood based on control variates, such that the computing cost is much smaller than that of the full log-likelihood in standard MCMC. The likelihood estimate is bias-corrected and used in two dependent pseudo-marginal algorithms to sample from a perturbed posterior, for which we derive the asymptotic error with respect to n and m, respectively. We propose a practical estimator of the error and show that the error is negligible even for a very small m in our applications. We demonstrate that subsampling MCMC is substantially more efficient than standard MCMC in terms of sampling efficiency for a given computational budget, and that it outperforms other subsampling methods for MCMC proposed in the literature. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 831-843
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1448827
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448827
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:831-843
Template-Type: ReDIF-Article 1.0
Author-Name: Simon Mak
Author-X-Name-First: Simon
Author-X-Name-Last: Mak
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F. Jeff
Author-X-Name-Last: Wu
Title: cmenet: A New Method for Bi-Level Variable Selection of Conditional Main Effects
Abstract:
This article introduces a novel method for selecting main effects and a set of reparameterized effects called conditional main effects (CMEs), which capture the conditional effect of a factor at a fixed level of another factor. CMEs represent interpretable, domain-specific phenomena for a wide range of applications in engineering, social sciences, and genomics. The key challenge is in incorporating the implicit grouped structure of CMEs within the variable selection procedure itself. We propose a new method, cmenet, which employs two principles called CME coupling and CME reduction to effectively navigate the selection algorithm. Simulation studies demonstrate the improved CME selection performance of cmenet over more generic selection methods. Applied to a gene association study on fly wing shape, cmenet not only yields more parsimonious models and improved predictive performance over standard two-factor interaction analysis methods, but also reveals important insights on gene activation behavior, which can be used to guide further experiments. Efficient implementations of our algorithms are available in the R package cmenet in CRAN. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 844-856
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1448828
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448828
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:844-856
Template-Type: ReDIF-Article 1.0
Author-Name: Ting Yan
Author-X-Name-First: Ting
Author-X-Name-Last: Yan
Author-Name: Binyan Jiang
Author-X-Name-First: Binyan
Author-X-Name-Last: Jiang
Author-Name: Stephen E. Fienberg
Author-X-Name-First: Stephen E.
Author-X-Name-Last: Fienberg
Author-Name: Chenlei Leng
Author-X-Name-First: Chenlei
Author-X-Name-Last: Leng
Title: Statistical Inference in a Directed Network Model With Covariates
Abstract:
Networks are often characterized by node heterogeneity for which nodes exhibit different degrees of interaction and link homophily for which nodes sharing common features tend to associate with each other. In this article, we rigorously study a directed network model that captures the former via node-specific parameterization and the latter by incorporating covariates. In particular, this model quantifies the extent of heterogeneity in terms of outgoingness and incomingness of each node by different parameters, thus allowing the number of heterogeneity parameters to be twice the number of nodes. We study the maximum likelihood estimation of the model and establish the uniform consistency and asymptotic normality of the resulting estimators. Numerical studies demonstrate our theoretical findings and two data analyses confirm the usefulness of our model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 857-868
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1448829
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1448829
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:857-868
Template-Type: ReDIF-Article 1.0
Author-Name: Likai Chen
Author-X-Name-First: Likai
Author-X-Name-Last: Chen
Author-Name: Wei Biao Wu
Author-X-Name-First: Wei Biao
Author-X-Name-Last: Wu
Title: Testing for Trends in High-Dimensional Time Series
Abstract:
The article considers statistical inference for trends of high-dimensional time series. Based on a modified L2$\mathcal {L}^2$ distance between parametric and nonparametric trend estimators, we propose a de-diagonalized quadratic form test statistic for testing patterns on trends, such as linear, quadratic, or parallel forms. We develop an asymptotic theory for the test statistic. A Gaussian multiplier testing procedure is proposed and it has an improved finite sample performance. Our testing procedure is applied to a spatial temporal temperature data gathered from various locations across America. A simulation study is also presented to illustrate the performance of our testing method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 869-881
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1456935
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1456935
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:869-881
Template-Type: ReDIF-Article 1.0
Author-Name: Rong Zhu
Author-X-Name-First: Rong
Author-X-Name-Last: Zhu
Author-Name: Alan T. K. Wan
Author-X-Name-First: Alan T. K.
Author-X-Name-Last: Wan
Author-Name: Xinyu Zhang
Author-X-Name-First: Xinyu
Author-X-Name-Last: Zhang
Author-Name: Guohua Zou
Author-X-Name-First: Guohua
Author-X-Name-Last: Zou
Title: A Mallows-Type Model Averaging Estimator for the Varying-Coefficient Partially Linear Model
Abstract:
In the last decade, significant theoretical advances have been made in the area of frequentist model averaging (FMA); however, the majority of this work has emphasized parametric model setups. This article considers FMA for the semiparametric varying-coefficient partially linear model (VCPLM), which has gained prominence to become an extensively used modeling tool in recent years. Within this context, we develop a Mallows-type criterion for assigning model weights and prove its asymptotic optimality. A simulation study and a real data analysis demonstrate that the FMA estimator that arises from this criterion is vastly preferred to information criterion score-based model selection and averaging estimators. Our analysis is complicated by the fact that the VCPLM is subject to uncertainty arising not only from the choice of covariates, but also whether the covariate should enter the parametric or nonparametric parts of the model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 882-892
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1456936
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1456936
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:882-892
Template-Type: ReDIF-Article 1.0
Author-Name: Junxian Geng
Author-X-Name-First: Junxian
Author-X-Name-Last: Geng
Author-Name: Anirban Bhattacharya
Author-X-Name-First: Anirban
Author-X-Name-Last: Bhattacharya
Author-Name: Debdeep Pati
Author-X-Name-First: Debdeep
Author-X-Name-Last: Pati
Title: Probabilistic Community Detection With Unknown Number of Communities
Abstract:
A fundamental problem in network analysis is clustering the nodes into groups which share a similar connectivity pattern. Existing algorithms for community detection assume the knowledge of the number of clusters or estimate it a priori using various selection criteria and subsequently estimate the community structure. Ignoring the uncertainty in the first stage may lead to erroneous clustering, particularly when the community structure is vague. We instead propose a coherent probabilistic framework for simultaneous estimation of the number of communities and the community structure, adapting recently developed Bayesian nonparametric techniques to network models. An efficient Markov chain Monte Carlo (MCMC) algorithm is proposed which obviates the need to perform reversible jump MCMC on the number of clusters. The methodology is shown to outperform recently developed community detection algorithms in a variety of synthetic data examples and in benchmark real-datasets. Using an appropriate metric on the space of all configurations, we develop nonasymptotic Bayes risk bounds even when the number of clusters is unknown. Enroute, we develop concentration properties of nonlinear functions of Bernoulli random variables, which may be of independent interest in analysis of related models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 893-905
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1458618
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1458618
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:893-905
Template-Type: ReDIF-Article 1.0
Author-Name: Éric Lesage
Author-X-Name-First: Éric
Author-X-Name-Last: Lesage
Author-Name: David Haziza
Author-X-Name-First: David
Author-X-Name-Last: Haziza
Author-Name: Xavier D’Haultfœuille
Author-X-Name-First: Xavier
Author-X-Name-Last: D’Haultfœuille
Title: A Cautionary Tale on Instrumental Calibration for the Treatment of Nonignorable Unit Nonresponse in Surveys
Abstract:
Response rates have been steadily declining over the last decades, making survey estimates vulnerable to nonresponse bias. To reduce the potential bias, two weighting approaches are commonly used in National Statistical Offices: the one-step and the two-step approaches. In this article, we focus on the one-step approach, whereby the design weights are modified in a single step with two simultaneous goals in mind: reduce the nonresponse bias and ensure the consistency between survey estimates and known population totals. In particular, we examine the properties of instrumental calibration, a special case of the one-step approach that has received a lot of attention in the literature in recent years. Despite the rich literature on the topic, there remain some important gaps that this article aims to fill. First, we give a set of sufficient conditions required for establishing the consistency of instrumental calibration estimators. Also, we show that the latter may suffer from a large bias when some of these conditions are violated. Results from a simulation study support our findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 906-915
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1458619
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1458619
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:906-915
Template-Type: ReDIF-Article 1.0
Author-Name: Rongmao Zhang
Author-X-Name-First: Rongmao
Author-X-Name-Last: Zhang
Author-Name: Peter Robinson
Author-X-Name-First: Peter
Author-X-Name-Last: Robinson
Author-Name: Qiwei Yao
Author-X-Name-First: Qiwei
Author-X-Name-Last: Yao
Title: Identifying Cointegration by Eigenanalysis
Abstract:
We propose a new and easy-to-use method for identifying cointegrated components of nonstationary time series, consisting of an eigenanalysis for a certain nonnegative definite matrix. Our setting is model-free, and we allow the integer-valued integration orders of the observable series to be unknown, and to possibly differ. Consistency of estimates of the cointegration space and cointegration rank is established both when the dimension of the observable time series is fixed as sample size increases, and when it diverges slowly. The proposed methodology is also extended and justified in a fractional setting. A Monte Carlo study of finite-sample performance, and a small empirical illustration, are reported. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 916-927
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1458620
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1458620
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:916-927
Template-Type: ReDIF-Article 1.0
Author-Name: Wenliang Pan
Author-X-Name-First: Wenliang
Author-X-Name-Last: Pan
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Author-Name: Weinan Xiao
Author-X-Name-First: Weinan
Author-X-Name-Last: Xiao
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: A Generic Sure Independence Screening Procedure
Abstract:
Extracting important features from ultra-high dimensional data is one of the primary tasks in statistical learning, information theory, precision medicine, and biological discovery. Many of the sure independent screening methods developed to meet these needs are suitable for special models under some assumptions. With the availability of more data types and possible models, a model-free generic screening procedure with fewer and less restrictive assumptions is desirable. In this article, we propose a generic nonparametric sure independence screening procedure, called BCor-SIS, on the basis of a recently developed universal dependence measure: Ball correlation. We show that the proposed procedure has strong screening consistency even when the dimensionality is an exponential order of the sample size without imposing sub-exponential moment assumptions on the data. We investigate the flexibility of this procedure by considering three commonly encountered challenging settings in biological discovery or precision medicine: iterative BCor-SIS, interaction pursuit, and survival outcomes. We use simulation studies and real data analyses to illustrate the versatility and practicability of our BCor-SIS method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 928-937
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1462709
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1462709
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:928-937
Template-Type: ReDIF-Article 1.0
Author-Name: Jessica G. Young
Author-X-Name-First: Jessica G.
Author-X-Name-Last: Young
Author-Name: Roger W. Logan
Author-X-Name-First: Roger W.
Author-X-Name-Last: Logan
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Author-Name: Miguel A. Hernán
Author-X-Name-First: Miguel A.
Author-X-Name-Last: Hernán
Title: Inverse Probability Weighted Estimation of Risk Under Representative Interventions in Observational Studies
Abstract:
Researchers are often interested in using observational data to estimate the effect on a health outcome of maintaining a continuous treatment within a prespecified range over time, for example, “always exercise at least 30 minutes per day.” There may be many precise interventions that could achieve this range. In this article, we consider representative interventions. These are special cases of random dynamic interventions: interventions under which treatment at each time is assigned according to a random draw from a distribution that may depend on a subject’s measured past. Estimators of risk under representative interventions on a time-varying treatment have previously been described based on g-estimation of structural nested cumulative failure time models. In this article, we consider an alternative approach based on inverse probability weighting (IPW) of marginal structural models. In particular, we show that the risk under a representative intervention on a time-varying continuous treatment can be consistently estimated via computationally simple IPW methods traditionally used for deterministic static (i.e., “nonrandom” and “nondynamic”) interventions for binary treatments. We present an application of IPW in this setting to estimate the 28-year risk of coronary heart disease under various representative interventions on lifestyle behaviors in the Nurses' Health Study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 938-947
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1469993
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469993
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:938-947
Template-Type: ReDIF-Article 1.0
Author-Name: Wonyul Lee
Author-X-Name-First: Wonyul
Author-X-Name-Last: Lee
Author-Name: Michelle F. Miranda
Author-X-Name-First: Michelle F.
Author-X-Name-Last: Miranda
Author-Name: Philip Rausch
Author-X-Name-First: Philip
Author-X-Name-Last: Rausch
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Author-Name: Massimo Fazio
Author-X-Name-First: Massimo
Author-X-Name-Last: Fazio
Author-Name: J. Crawford Downs
Author-X-Name-First: J. Crawford
Author-X-Name-Last: Downs
Author-Name: Jeffrey S. Morris
Author-X-Name-First: Jeffrey S.
Author-X-Name-Last: Morris
Title: Bayesian Semiparametric Functional Mixed Models for Serially Correlated Functional Data, With Application to Glaucoma Data
Abstract:
Glaucoma, a leading cause of blindness, is characterized by optic nerve damage related to intraocular pressure (IOP), but its full etiology is unknown. Researchers at UAB have devised a custom device to measure scleral strain continuously around the eye under fixed levels of IOP, which here is used to assess how strain varies around the posterior pole, with IOP, and across glaucoma risk factors such as age. The hypothesis is that scleral strain decreases with age, which could alter biomechanics of the optic nerve head and cause damage that could eventually lead to glaucoma. To evaluate this hypothesis, we adapted Bayesian Functional Mixed Models to model these complex data consisting of correlated functions on spherical scleral surface, with nonparametric age effects allowed to vary in magnitude and smoothness across the scleral surface, multi-level random effect functions to capture within-subject correlation, and functional growth curve terms to capture serial correlation across IOPs that can vary around the scleral surface. Our method yields fully Bayesian inference on the scleral surface or any aggregation or transformation thereof, and reveals interesting insights into the biomechanical etiology of glaucoma. The general modeling framework described is very flexible and applicable to many complex, high-dimensional functional data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 495-513
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1476242
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476242
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:495-513
Template-Type: ReDIF-Article 1.0
Author-Name: Oscar Hernan Madrid Padilla
Author-X-Name-First: Oscar Hernan
Author-X-Name-Last: Madrid Padilla
Author-Name: Alex Athey
Author-X-Name-First: Alex
Author-X-Name-Last: Athey
Author-Name: Alex Reinhart
Author-X-Name-First: Alex
Author-X-Name-Last: Reinhart
Author-Name: James G. Scott
Author-X-Name-First: James G.
Author-X-Name-Last: Scott
Title: Sequential Nonparametric Tests for a Change in Distribution: An Application to Detecting Radiological Anomalies
Abstract:
We propose a sequential nonparametric test for detecting a change in distribution, based on windowed Kolmogorov–Smirnov statistics. The approach is simple, robust, highly computationally efficient, easy to calibrate, and requires no parametric assumptions about the underlying null and alternative distributions. We show that both the false-alarm rate and the power of our procedure are amenable to rigorous analysis, and that the method outperforms existing sequential testing procedures in practice. We then apply the method to the problem of detecting radiological anomalies, using data collected from measurements of the background gamma-radiation spectrum on a large university campus. In this context, the proposed method leads to substantial improvements in time-to-detection for the kind of radiological anomalies of interest in law-enforcement and border-security applications.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 514-528
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1476245
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476245
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:514-528
Template-Type: ReDIF-Article 1.0
Author-Name: Naoki Egami
Author-X-Name-First: Naoki
Author-X-Name-Last: Egami
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Title: Causal Interaction in Factorial Experiments: Application to Conjoint Analysis
Abstract:
We study causal interaction in factorial experiments, in which several factors, each with multiple levels, are randomized to form a large number of possible treatment combinations. Examples of such experiments include conjoint analysis, which is often used by social scientists to analyze multidimensional preferences in a population. To characterize the structure of causal interaction in factorial experiments, we propose a new causal interaction effect, called the average marginal interaction effect (AMIE). Unlike the conventional interaction effect, the relative magnitude of the AMIE does not depend on the choice of baseline conditions, making its interpretation intuitive even for higher-order interactions. We show that the AMIE can be nonparametrically estimated using ANOVA regression with weighted zero-sum constraints. Because the AMIEs are invariant to the choice of baseline conditions, we directly regularize them by collapsing levels and selecting factors within a penalized ANOVA framework. This regularized estimation procedure reduces false discovery rate and further facilitates interpretation. Finally, we apply the proposed methodology to the conjoint analysis of ethnic voting behavior in Africa and find clear patterns of causal interaction between politicians’ ethnicity and their prior records. The proposed methodology is implemented in an open source software package. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 529-540
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1476246
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476246
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:529-540
Template-Type: ReDIF-Article 1.0
Author-Name: Seung Jun Shin
Author-X-Name-First: Seung Jun
Author-X-Name-Last: Shin
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Author-Name: Louise C. Strong
Author-X-Name-First: Louise C.
Author-X-Name-Last: Strong
Author-Name: Jasmina Bojadzieva
Author-X-Name-First: Jasmina
Author-X-Name-Last: Bojadzieva
Author-Name: Wenyi Wang
Author-X-Name-First: Wenyi
Author-X-Name-Last: Wang
Title: Bayesian Semiparametric Estimation of Cancer-Specific Age-at-Onset Penetrance With Application to Li-Fraumeni Syndrome
Abstract:
Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., genotype) that cause a particular trait and who have clinical symptoms of the trait (i.e., phenotype). We propose a Bayesian semiparametric approach to estimate the cancer-specific age-at-onset penetrance in the presence of the competing risk of multiple cancers. We employ a Bayesian semiparametric competing risk model to model the duration until individuals in a high-risk group develop different cancers, and accommodate family data using family-wise likelihoods. We tackle the ascertainment bias arising when family data are collected through probands in a high-risk population in which disease cases are more likely to be observed. We apply the proposed method to a cohort of 186 families with Li-Fraumeni syndrome identified through probands with sarcoma treated at MD Anderson Cancer Center from 1944 to 1982. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 541-552
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1482749
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482749
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:541-552
Template-Type: ReDIF-Article 1.0
Author-Name: Lei Huang
Author-X-Name-First: Lei
Author-X-Name-Last: Huang
Author-Name: Jiawei Bai
Author-X-Name-First: Jiawei
Author-X-Name-Last: Bai
Author-Name: Andrada Ivanescu
Author-X-Name-First: Andrada
Author-X-Name-Last: Ivanescu
Author-Name: Tamara Harris
Author-X-Name-First: Tamara
Author-X-Name-Last: Harris
Author-Name: Mathew Maurer
Author-X-Name-First: Mathew
Author-X-Name-Last: Maurer
Author-Name: Philip Green
Author-X-Name-First: Philip
Author-X-Name-Last: Green
Author-Name: Vadim Zipunnikov
Author-X-Name-First: Vadim
Author-X-Name-Last: Zipunnikov
Title: Multilevel Matrix-Variate Analysis and its Application to Accelerometry-Measured Physical Activity in Clinical Populations
Abstract:
The number of studies where the primary measurement is a matrix is exploding. In response to this, we propose a statistical framework for modeling populations of repeatedly observed matrix-variate measurements. The 2D structure is handled via a matrix-variate distribution with decomposable row/column-specific covariance matrices and a linear mixed effect framework is used to model the multilevel design. The proposed framework flexibly expands to accommodate many common crossed and nested designs and introduces two important concepts: the between-subject distance and intraclass correlation coefficient, both defined for matrix-variate data. The computational feasibility and performance of the approach is shown in extensive simulation studies. The method is motivated by and applied to a study that monitored physical activity of individuals diagnosed with congestive heart failure (CHF) over a 4- to 9-month period. The long-term patterns of physical activity are studied and compared in two CHF subgroups: with and without adverse clinical events. Supplementary materials for this article, that include de-identified accelerometry and clinical data, are available online.
Journal: Journal of the American Statistical Association
Pages: 553-564
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1482750
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1482750
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:553-564
Template-Type: ReDIF-Article 1.0
Author-Name: Domenico Giannone
Author-X-Name-First: Domenico
Author-X-Name-Last: Giannone
Author-Name: Michele Lenza
Author-X-Name-First: Michele
Author-X-Name-Last: Lenza
Author-Name: Giorgio E. Primiceri
Author-X-Name-First: Giorgio E.
Author-X-Name-Last: Primiceri
Title: Priors for the Long Run
Abstract:
We propose a class of prior distributions that discipline the long-run behavior of vector autoregressions (VARs). These priors can be naturally elicited using economic theory, which provides guidance on the joint dynamics of macroeconomic time series in the long run. Our priors for the long run are conjugate, and can thus be easily implemented using dummy observations and combined with other popular priors. In VARs with standard macroeconomic variables, a prior based on the long-run predictions of a wide class of theoretical models yields substantial improvements in the forecasting performance. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 565-580
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1483826
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1483826
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:565-580
Template-Type: ReDIF-Article 1.0
Author-Name: Xiangyu Luo
Author-X-Name-First: Xiangyu
Author-X-Name-Last: Luo
Author-Name: Yingying Wei
Author-X-Name-First: Yingying
Author-X-Name-Last: Wei
Title: Batch Effects Correction with Unknown Subtypes
Abstract:
High-throughput experimental data are accumulating exponentially in public databases. Unfortunately, however, mining valid scientific discoveries from these abundant resources is hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. Existing methods either tackle batch effects provided that subtypes are known or cluster subtypes assuming that batch effects are absent. Consequently, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we combine a location-and-scale adjustment model and model-based clustering into a novel hybrid one, the batch-effects-correction-with-unknown-subtypes model (BUS). BUS is capable of (a) correcting batch effects explicitly, (b) grouping samples that share similar characteristics into subtypes, (c) identifying features that distinguish subtypes, (d) allowing the number of subtypes to vary from batch to batch, (e) integrating batches from different platforms, and (f) enjoying a linear-order computation complexity. We prove the identifiability of BUS and provide conditions for study designs under which batch effects can be corrected. BUS is evaluated by simulation studies and a real breast cancer dataset combined from three batches measured on two platforms. Results from the breast cancer dataset offer much better biological insights than existing methods. We implement BUS as a free Bioconductor package BUScorrect. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 581-594
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1497494
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497494
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:581-594
Template-Type: ReDIF-Article 1.0
Author-Name: Yakuan Chen
Author-X-Name-First: Yakuan
Author-X-Name-Last: Chen
Author-Name: Jeff Goldsmith
Author-X-Name-First: Jeff
Author-X-Name-Last: Goldsmith
Author-Name: R. Todd Ogden
Author-X-Name-First: R. Todd
Author-X-Name-Last: Ogden
Title: Functional Data Analysis of Dynamic PET Data
Abstract:
One application of positron emission tomography (PET), a nuclear imaging technique, in neuroscience involves in vivo estimation of the density of various proteins (often, neuroreceptors) in the brain. PET scanning begins with the injection of a radiolabeled tracer that binds preferentially to the target protein; tracer molecules are then continuously delivered to the brain via the bloodstream. By detecting the radioactive decay of the tracer over time, dynamic PET data are constructed to reflect the concentration of the target protein in the brain at each time. The fundamental problem in the analysis of dynamic PET data involves estimating the impulse response function (IRF), which is necessary for describing the binding behavior of the injected radiotracer. Virtually all existing methods have three common aspects: summarizing the entire IRF with a single scalar measure; modeling each subject separately; and the imposition of parametric restrictions on the IRF. In contrast, we propose a functional data analytic approach that regards each subject’s IRF as the basic analysis unit, models multiple subjects simultaneously, and estimates the IRF nonparametrically. We pose our model as a linear mixed effect model in which population level fixed effects and subject-specific random effects are expanded using a B-spline basis. Shrinkage and roughness penalties are incorporated in the model to enforce identifiability and smoothness of the estimated curves, respectively, while monotonicity and nonnegativity constraints impose biological information on estimates. We illustrate this approach by applying it to clinical PET data with subjects belonging to three diagnosic groups. We explore differences among groups by means of pointwise confidence intervals of the estimated mean curves based on bootstrap samples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 595-609
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1497495
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497495
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:595-609
Template-Type: ReDIF-Article 1.0
Author-Name: Will Landau
Author-X-Name-First: Will
Author-X-Name-Last: Landau
Author-Name: Jarad Niemi
Author-X-Name-First: Jarad
Author-X-Name-Last: Niemi
Author-Name: Dan Nettleton
Author-X-Name-First: Dan
Author-X-Name-Last: Nettleton
Title: Fully Bayesian Analysis of RNA-seq Counts for the Detection of Gene Expression Heterosis
Abstract:
Heterosis, or hybrid vigor, is the enhancement of the phenotype of hybrid progeny relative to their inbred parents. Heterosis is extensively used in agriculture, and the underlying mechanisms are unclear. To investigate the molecular basis of phenotypic heterosis, researchers search tens of thousands of genes for heterosis with respect to expression in the transcriptome. Difficulty arises in the assessment of heterosis due to composite null hypotheses and nonuniform distributions for p-values under these null hypotheses. Thus, we develop a general hierarchical model for count data and a fully Bayesian analysis in which an efficient parallelized Markov chain Monte Carlo algorithm ameliorates the computational burden. We use our method to detect gene expression heterosis in a two-hybrid plant-breeding scenario, both in a real RNA-seq maize dataset and in simulation studies. In the simulation studies, we show our method has well-calibrated posterior probabilities and credible intervals when the model assumed in analysis matches the model used to simulate the data. Although model misspecification can adversely affect calibration, the methodology is still able to accurately rank genes. Finally, we show that hyperparameter posteriors are extremely narrow and an empirical Bayes (eBayes) approach based on posterior means from the fully Bayesian analysis provides virtually equivalent posterior probabilities, credible intervals, and gene rankings relative to the fully Bayesian solution. This evidence of equivalence provides support for the use of eBayes procedures in RNA-seq data analysis if accurate hyperparameter estimates can be obtained. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 610-621
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1497496
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1497496
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:610-621
Template-Type: ReDIF-Article 1.0
Author-Name: Yifei Wang
Author-X-Name-First: Yifei
Author-X-Name-Last: Wang
Author-Name: Daniel J. Tancredi
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Tancredi
Author-Name: Diana L. Miglioretti
Author-X-Name-First: Diana L.
Author-X-Name-Last: Miglioretti
Title: Joint Indirect Standardization When Only Marginal Distributions are Observed in the Index Population
Abstract:
It is a common interest in medicine to determine whether a hospital meets a benchmark created from an aggregate reference population, after accounting for differences in distributions of multiple covariates. Due to the difficulties of collecting individual-level data, however, it is often the case that only marginal distributions of the covariates are available, making covariate-adjusted comparison challenging. We propose and evaluate a novel approach for conducting indirect standardization when only marginal covariate distributions of the studied hospital are known, but complete information is available for the reference hospitals. We do this with the aid of two existing methods: iterative proportional fit, which estimates the cells of a contingency table when only marginal sums are known, and synthetic control methods, which create a counterfactual control group using a weighted combination of potential control groups. The proper application of these existing methods for indirect standardization would require accounting for the statistical uncertainties induced by a situation where no individual-level data are collected from the studied population. We address this need with a novel method which uses a random Dirichlet parameterization of the synthetic control weights to estimate uncertainty intervals for the standard incidence ratio. We demonstrate our novel methods by estimating hospital-level standardized incidence ratios for comparing the adjusted probability of computed tomography examinations with high radiations doses, relative to a reference standard and we evaluate out methods in a simulation study. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 622-630
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2018.1506340
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1506340
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:622-630
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Su
Author-X-Name-First: Jing
Author-X-Name-Last: Su
Title: Book Review
Journal: Journal of the American Statistical Association
Pages: 948-948
Issue: 526
Volume: 114
Year: 2019
Month: 4
X-DOI: 10.1080/01621459.2019.1614762
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1614762
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:114:y:2019:i:526:p:948-948
Template-Type: ReDIF-Article 1.0
Author-Name: Clement Lee
Author-X-Name-First: Clement
Author-X-Name-Last: Lee
Author-Name: Darren J. Wilkinson
Author-X-Name-First: Darren J.
Author-X-Name-Last: Wilkinson
Title: A Hierarchical Model of Nonhomogeneous Poisson Processes for Twitter Retweets
Abstract:
We present a hierarchical model of nonhomogeneous Poisson processes (NHPP) for information diffusion on online social media, in particular Twitter retweets. The retweets of each original tweet are modelled by a NHPP, for which the intensity function is a product of time-decaying components and another component that depends on the follower count of the original tweet author. The latter allows us to explain or predict the ultimate retweet count by a network centrality-related covariate. The inference algorithm enables the Bayes factor to be computed, to facilitate model selection. Finally, the model is applied to the retweet datasets of two hashtags. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement
Journal: Journal of the American Statistical Association
Pages: 1-15
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1585358
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585358
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:1-15
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan P. Williams
Author-X-Name-First: Jonathan P.
Author-X-Name-Last: Williams
Author-Name: Curtis B. Storlie
Author-X-Name-First: Curtis B.
Author-X-Name-Last: Storlie
Author-Name: Terry M. Therneau
Author-X-Name-First: Terry M.
Author-X-Name-Last: Therneau
Author-Name: Clifford R. Jack Jr
Author-X-Name-First: Clifford R. Jack
Author-X-Name-Last: Jr
Author-Name: Jan Hannig
Author-X-Name-First: Jan
Author-X-Name-Last: Hannig
Title: A Bayesian Approach to Multistate Hidden Markov Models: Application to Dementia Progression
Abstract:
People are living longer than ever before, and with this arises new complications and challenges for humanity. Among the most pressing of these challenges is of understanding the role of aging in the development of dementia. This article is motivated by the Mayo Clinic Study of Aging data for 4742 subjects since 2004, and how it can be used to draw inference on the role of aging in the development of dementia. We construct a hidden Markov model (HMM) to represent progression of dementia from states associated with the buildup of amyloid plaque in the brain, and the loss of cortical thickness. A hierarchical Bayesian approach is taken to estimate the parameters of the HMM with a truly time-inhomogeneous infinitesimal generator matrix, and response functions of the continuous-valued biomarker measurements are cut-point agnostic. A Bayesian approach with these features could be useful in many disease progression models. Additionally, an approach is illustrated for correcting a common bias in delayed enrollment studies, in which some or all subjects are not observed at baseline. Standard software is incapable of accounting for this critical feature, so code to perform the estimation of the model described below is made available online. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 16-31
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1594831
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1594831
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:16-31
Template-Type: ReDIF-Article 1.0
Author-Name: Curtis B. Storlie
Author-X-Name-First: Curtis B.
Author-X-Name-Last: Storlie
Author-Name: Terry M. Therneau
Author-X-Name-First: Terry M.
Author-X-Name-Last: Therneau
Author-Name: Rickey E. Carter
Author-X-Name-First: Rickey E.
Author-X-Name-Last: Carter
Author-Name: Nicholas Chia
Author-X-Name-First: Nicholas
Author-X-Name-Last: Chia
Author-Name: John R. Bergquist
Author-X-Name-First: John R.
Author-X-Name-Last: Bergquist
Author-Name: Jeanne M. Huddleston
Author-X-Name-First: Jeanne M.
Author-X-Name-Last: Huddleston
Author-Name: Santiago Romero-Brufau
Author-X-Name-First: Santiago
Author-X-Name-Last: Romero-Brufau
Title: Prediction and Inference With Missing Data in Patient Alert Systems
Abstract:
We describe the Bedside Patient Rescue (BPR) project, the goal of which is risk prediction of adverse events for non-intensive care unit patients using ∼100 variables (vitals, lab results, assessments, etc.). There are several missing predictor values for most patients, which in the health sciences is the norm, rather than the exception. A Bayesian approach is presented that addresses many of the shortcomings to standard approaches to missing predictors: (i) treatment of the uncertainty due to imputation is straight-forward in the Bayesian paradigm, (ii) the predictor distribution is flexibly modeled as an infinite normal mixture with latent variables to explicitly account for discrete predictors (i.e., as in multivariate probit regression models), and (iii) certain missing not at random situations can be handled effectively by allowing the indicator of missingness into the predictor distribution only to inform the distribution of the missing variables. The proposed approach also has the benefit of providing a distribution for the prediction, including the uncertainty inherent in the imputation. Therefore, we can ask questions such as: is it possible this individual is at high risk but we are missing too much information to know for sure? How much would we reduce the uncertainty in our risk prediction by obtaining a particular missing value? This approach is applied to the BPR problem resulting in excellent predictive capability to identify deteriorating patients. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 32-46
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1604359
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604359
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:32-46
Template-Type: ReDIF-Article 1.0
Author-Name: Adam N. Smith
Author-X-Name-First: Adam N.
Author-X-Name-Last: Smith
Author-Name: Greg M. Allenby
Author-X-Name-First: Greg M.
Author-X-Name-Last: Allenby
Title: Demand Models With Random Partitions
Abstract:
Many economic models of consumer demand require researchers to partition sets of products or attributes prior to the analysis. These models are common in applied problems when the product space is large or spans multiple categories. While the partition is traditionally fixed a priori, we let the partition be a model parameter and propose a Bayesian method for inference. The challenge is that demand systems are commonly multivariate models that are not conditionally conjugate with respect to partition indices, precluding the use of Gibbs sampling. We solve this problem by constructing a new location-scale partition distribution that can generate random-walk Metropolis–Hastings proposals and also serve as a prior. Our method is illustrated in the context of a store-level category demand model, where we find that allowing for partition uncertainty is important for preserving model flexibility, improving demand forecasts, and learning about the structure of demand. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 47-65
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1604360
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604360
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:47-65
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew J. Heaton
Author-X-Name-First: Matthew J.
Author-X-Name-Last: Heaton
Author-Name: Candace Berrett
Author-X-Name-First: Candace
Author-X-Name-Last: Berrett
Author-Name: Sierra Pugh
Author-X-Name-First: Sierra
Author-X-Name-Last: Pugh
Author-Name: Amber Evans
Author-X-Name-First: Amber
Author-X-Name-Last: Evans
Author-Name: Chantel Sloan
Author-X-Name-First: Chantel
Author-X-Name-Last: Sloan
Title: Modeling Bronchiolitis Incidence Proportions in the Presence of Spatio-Temporal Uncertainty
Abstract:
Bronchiolitis (inflammation of the lower respiratory tract) in infants is primarily due to viral infection and is the single most common cause of infant hospitalization in the United States. To increase epidemiological understanding of bronchiolitis (and, subsequently, develop better prevention strategies), this research analyzes data on infant bronchiolitis cases from the U.S. Military Health System between the years 2003–2013 in Norfolk, Virginia, USA. For privacy reasons, child home addresses, birth dates, and diagnosis dates were randomized (jittered) creating spatio-temporal uncertainty in the geographic location and timing of bronchiolitis incidents. Using spatio-temporal point patterns, we created a modeling strategy that accounts for the jittering to estimate and quantify the uncertainty for the incidence proportion (IP) of bronchiolitis. Additionally, we regress the IP onto key covariates including pollution where we adequately account for uncertainty in the pollution levels (i.e., covariate uncertainty) using a land use regression model. Our analysis results indicate that the IP is positively associated with sulfur dioxide and population density. Further, we demonstrate how scientific conclusions may change if various sources of uncertainty (either spatio-temporal or covariate uncertainty) are not accounted for. Code submitted with this article was checked by an Associate Editor for Reproducibility and is available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 66-78
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1609480
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609480
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:66-78
Template-Type: ReDIF-Article 1.0
Author-Name: DouglasR. Wilson
Author-X-Name-First: DouglasR.
Author-X-Name-Last: Wilson
Author-Name: JosephG. Ibrahim
Author-X-Name-First: JosephG.
Author-X-Name-Last: Ibrahim
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Title: Mapping Tumor-Specific Expression QTLs in Impure Tumor Samples
Abstract:
The study of gene expression quantitative trait loci (eQTL) is an effective approach to illuminate the functional roles of genetic variants. Computational methods have been developed for eQTL mapping using gene expression data from microarray or RNA-seq technology. Application of these methods for eQTL mapping in tumor tissues is problematic because tumor tissues are composed of both tumor and infiltrating normal cells (e.g., immune cells) and eQTL effects may vary between tumor and infiltrating normal cells. To address this challenge, we have developed a new method for eQTL mapping using RNA-seq data from tumor samples. Our method separately estimates the eQTL effects in tumor and infiltrating normal cells using both total expression and allele-specific expression (ASE). We demonstrate that our method controls Type I error rate and has higher power than some alternative approaches. We applied our method to study RNA-seq data from The Cancer Genome Atlas and illustrated the similarities and differences of eQTL effects in tumor and normal cells. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 79-89
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1609968
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609968
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:79-89
Template-Type: ReDIF-Article 1.0
Author-Name: Hojin Yang
Author-X-Name-First: Hojin
Author-X-Name-Last: Yang
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Author-Name: Arvind U.K. Rao
Author-X-Name-First: Arvind U.K.
Author-X-Name-Last: Rao
Author-Name: Jeffrey S. Morris
Author-X-Name-First: Jeffrey S.
Author-X-Name-Last: Morris
Title: Quantile Function on Scalar Regression Analysis for Distributional Data
Abstract:
Radiomics involves the study of tumor images to identify quantitative markers explaining cancer heterogeneity. The predominant approach is to extract hundreds to thousands of image features, including histogram features comprised of summaries of the marginal distribution of pixel intensities, which leads to multiple testing problems and can miss out on insights not contained in the selected features. In this paper, we present methods to model the entire marginal distribution of pixel intensities via the quantile function as functional data, regressed on a set of demographic, clinical, and genetic predictors to investigate their effects of imaging-based cancer heterogeneity. We call this approach quantile functional regression, regressing subject-specific marginal distributions across repeated measurements on a set of covariates, allowing us to assess which covariates are associated with the distribution in a global sense, as well as to identify distributional features characterizing these differences, including mean, variance, skewness, heavy-tailedness, and various upper and lower quantiles. To account for smoothness in the quantile functions, account for intrafunctional correlation, and gain statistical power, we introduce custom basis functions we call quantlets that are sparse, regularized, near-lossless, and empirically defined, adapting to the features of a given dataset and containing a Gaussian subspace so non-Gaussianness can be assessed. We fit this model using a Bayesian framework that uses nonlinear shrinkage of quantlet coefficients to regularize the functional regression coefficients and provides fully Bayesian inference after fitting a Markov chain Monte Carlo. We demonstrate the benefit of the basis space modeling through simulation studies, and apply the method to Magnetic resonance imaging (MRI)-based radiomic dataset from Glioblastoma Multiforme to relate imaging-based quantile functions to various demographic, clinical, and genetic predictors, finding specific differences in tumor pixel intensity distribution between males and females and between tumors with and without DDIT3 mutations. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 90-106
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1609969
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609969
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:90-106
Template-Type: ReDIF-Article 1.0
Author-Name: Gareth M. James
Author-X-Name-First: Gareth M.
Author-X-Name-Last: James
Author-Name: Courtney Paulson
Author-X-Name-First: Courtney
Author-X-Name-Last: Paulson
Author-Name: Paat Rusmevichientong
Author-X-Name-First: Paat
Author-X-Name-Last: Rusmevichientong
Title: Penalized and Constrained Optimization: An Application to High-Dimensional Website Advertising
Abstract:
Firms are increasingly transitioning advertising budgets to Internet display campaigns, but this transition poses new challenges. These campaigns use numerous potential metrics for success (e.g., reach or click rate), and because each website represents a separate advertising opportunity, this is also an inherently high-dimensional problem. Further, advertisers often have constraints they wish to place on their campaign, such as targeting specific sub-populations or websites. These challenges require a method flexible enough to accommodate thousands of websites, as well as numerous metrics and campaign constraints. Motivated by this application, we consider the general constrained high-dimensional problem, where the parameters satisfy linear constraints. We develop the Penalized and Constrained optimization method (PaC) to compute the solution path for high-dimensional, linearly constrained criteria. PaC is extremely general; in addition to internet advertising, we show it encompasses many other potential applications, such as portfolio estimation, monotone curve estimation, and the generalized lasso. Computing the PaC coefficient path poses technical challenges, but we develop an efficient algorithm over a grid of tuning parameters. Through extensive simulations, we show PaC performs well. Finally, we apply PaC to a proprietary dataset in an exemplar Internet advertising case study and demonstrate its superiority over existing methods in this practical setting. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 107-122
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1609970
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609970
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:107-122
Template-Type: ReDIF-Article 1.0
Author-Name: Victor Chernozhukov
Author-X-Name-First: Victor
Author-X-Name-Last: Chernozhukov
Author-Name: Iván Fernández-Val
Author-X-Name-First: Iván
Author-X-Name-Last: Fernández-Val
Author-Name: Blaise Melly
Author-X-Name-First: Blaise
Author-X-Name-Last: Melly
Author-Name: Kaspar Wüthrich
Author-X-Name-First: Kaspar
Author-X-Name-Last: Wüthrich
Title: Generic Inference on Quantile and Quantile Effect Functions for Discrete Outcomes
Abstract:
Quantile and quantile effect (QE) functions are important tools for descriptive and causal analysis due to their natural and intuitive interpretation. Existing inference methods for these functions do not apply to discrete random variables. This article offers a simple, practical construction of simultaneous confidence bands for quantile and QE functions of possibly discrete random variables. It is based on a natural transformation of simultaneous confidence bands for distribution functions, which are readily available for many problems. The construction is generic and does not depend on the nature of the underlying problem. It works in conjunction with parametric, semiparametric, and nonparametric modeling methods for observed and counterfactual distributions, and does not depend on the sampling scheme. We apply our method to characterize the distributional impact of insurance coverage on health care utilization and obtain the distributional decomposition of the racial test score gap. We find that universal insurance coverage increases the number of doctor visits across the entire distribution, and that the racial test score gap is small at early ages but grows with age due to socio-economic factors especially at the top of the distribution. Supplementary materials (additional results, R package, replication files) for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 123-137
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1611581
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611581
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:123-137
Template-Type: ReDIF-Article 1.0
Author-Name: Saharon Rosset
Author-X-Name-First: Saharon
Author-X-Name-Last: Rosset
Author-Name: Ryan J. Tibshirani
Author-X-Name-First: Ryan J.
Author-X-Name-Last: Tibshirani
Title: From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation
Abstract:
In statistical prediction, classical approaches for model selection and model evaluation based on covariance penalties are still widely used. Most of the literature on this topic is based on what we call the “Fixed-X” assumption, where covariate values are assumed to be nonrandom. By contrast, it is often more reasonable to take a “Random-X” view, where the covariate values are independently drawn for both training and prediction. To study the applicability of covariance penalties in this setting, we propose a decomposition of Random-X prediction error in which the randomness in the covariates contributes to both the bias and variance components. This decomposition is general, but we concentrate on the fundamental case of ordinary least-squares (OLS) regression. We prove that in this setting the move from Fixed-X to Random-X prediction results in an increase in both bias and variance. When the covariates are normally distributed and the linear model is unbiased, all terms in this decomposition are explicitly computable, which yields an extension of Mallows’ Cp that we call RCp. RCp also holds asymptotically for certain classes of nonnormal covariates. When the noise variance is unknown, plugging in the usual unbiased estimate leads to an approach that we call RCp ^$\widehat{{\rm RCp}}$, which is closely related to Sp, and generalized cross-validation (GCV). For excess bias, we propose an estimate based on the “shortcut-formula” for ordinary cross-validation (OCV), resulting in an approach we call RCp+. Theoretical arguments and numerical simulations suggest that RCp+ is typically superior to OCV, though the difference is small. We further examine the Random-X error of other popular estimators. The surprising result we get for ridge regression is that, in the heavily regularized regime, Random-X variance is smaller than Fixed-X variance, which can lead to smaller overall Random-X error. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 138-151
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1424632
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1424632
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:138-151
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Hsin-Cheng Huang
Author-X-Name-First: Hsin-Cheng
Author-X-Name-Last: Huang
Title: Discussion of “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation”
Journal: Journal of the American Statistical Association
Pages: 152-156
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543597
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543597
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:152-156
Template-Type: ReDIF-Article 1.0
Author-Name: Stefan Wager
Author-X-Name-First: Stefan
Author-X-Name-Last: Wager
Title: Cross-Validation, Risk Estimation, and Model Selection: Comment on a Paper by Rosset and Tibshirani
Journal: Journal of the American Statistical Association
Pages: 157-160
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1727235
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1727235
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:157-160
Template-Type: ReDIF-Article 1.0
Author-Name: Saharon Rosset
Author-X-Name-First: Saharon
Author-X-Name-Last: Rosset
Author-Name: Ryan J. Tibshirani
Author-X-Name-First: Ryan J.
Author-X-Name-Last: Tibshirani
Title: From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 161-162
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1727236
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1727236
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:161-162
Template-Type: ReDIF-Article 1.0
Author-Name: Maya B. Mathur
Author-X-Name-First: Maya B.
Author-X-Name-Last: Mathur
Author-Name: Tyler J. VanderWeele
Author-X-Name-First: Tyler J.
Author-X-Name-Last: VanderWeele
Title: Sensitivity Analysis for Unmeasured Confounding in Meta-Analyses
Abstract:
Random-effects meta-analyses of observational studies can produce biased estimates if the synthesized studies are subject to unmeasured confounding. We propose sensitivity analyses quantifying the extent to which unmeasured confounding of specified magnitude could reduce to below a certain threshold the proportion of true effect sizes that are scientifically meaningful. We also develop converse methods to estimate the strength of confounding capable of reducing the proportion of scientifically meaningful true effects to below a chosen threshold. These methods apply when a “bias factor” is assumed to be normally distributed across studies or is assessed across a range of fixed values. Our estimators are derived using recently proposed sharp bounds on confounding bias within a single study that do not make assumptions regarding the unmeasured confounders themselves or the functional form of their relationships with the exposure and outcome of interest. We provide an R package, EValue, and a free website that compute point estimates and inference and produce plots for conducting such sensitivity analyses. These methods facilitate principled use of random-effects meta-analyses of observational studies to assess the strength of causal evidence for a hypothesis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 163-172
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1529598
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529598
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:163-172
Template-Type: ReDIF-Article 1.0
Author-Name: Xi Chen
Author-X-Name-First: Xi
Author-X-Name-Last: Chen
Author-Name: Qihang Lin
Author-X-Name-First: Qihang
Author-X-Name-Last: Lin
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Title: On Degrees of Freedom of Projection Estimators With Applications to Multivariate Nonparametric Regression
Abstract:
Abstract–In this article, we consider the nonparametric regression problem with multivariate predictors. We provide a characterization of the degrees of freedom and divergence for estimators of the unknown regression function, which are obtained as outputs of linearly constrained quadratic optimization procedures; namely, minimizers of the least-squares criterion with linear constraints and/or quadratic penalties. As special cases of our results, we derive explicit expressions for the degrees of freedom in many nonparametric regression problems, for example, bounded isotonic regression, multivariate (penalized) convex regression, and additive total variation regularization. Our theory also yields, as special cases, known results on the degrees of freedom of many well-studied estimators in the statistics literature, such as ridge regression, Lasso and generalized Lasso. Our results can be readily used to choose the tuning parameter(s) involved in the estimation procedure by minimizing the Stein’s unbiased risk estimate. As a by-product of our analysis we derive an interesting connection between bounded isotonic regression and isotonic regression on a general partially ordered set, which is of independent interest. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 173-186
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1537917
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537917
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:173-186
Template-Type: ReDIF-Article 1.0
Author-Name: Fangzheng Xie
Author-X-Name-First: Fangzheng
Author-X-Name-Last: Xie
Author-Name: Yanxun Xu
Author-X-Name-First: Yanxun
Author-X-Name-Last: Xu
Title: Bayesian Repulsive Gaussian Mixture Model
Abstract:
We develop a general class of Bayesian repulsive Gaussian mixture models that encourage well-separated clusters, aiming at reducing potentially redundant components produced by independent priors for locations (such as the Dirichlet process). The asymptotic results for the posterior distribution of the proposed models are derived, including posterior consistency and posterior contraction rate in the context of nonparametric density estimation. More importantly, we show that compared to the independent prior on the component centers, the repulsive prior introduces additional shrinkage effect on the tail probability of the posterior number of components, which serves as a measurement of the model complexity. In addition, a generalized urn model that allows a random number of components and correlated component centers is developed based on the exchangeable partition distribution, which gives rise to the corresponding blocked-collapsed Gibbs sampler for posterior inference. We evaluate the performance and demonstrate the advantages of the proposed methodology through extensive simulation studies and real data analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 187-203
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1537918
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537918
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:187-203
Template-Type: ReDIF-Article 1.0
Author-Name: Hui Zhao
Author-X-Name-First: Hui
Author-X-Name-Last: Zhao
Author-Name: Qiwei Wu
Author-X-Name-First: Qiwei
Author-X-Name-Last: Wu
Author-Name: Gang Li
Author-X-Name-First: Gang
Author-X-Name-Last: Li
Author-Name: Jianguo Sun
Author-X-Name-First: Jianguo
Author-X-Name-Last: Sun
Title: Simultaneous Estimation and Variable Selection for Interval-Censored Data With Broken Adaptive Ridge Regression
Abstract:
The simultaneous estimation and variable selection for Cox model has been discussed by several authors when one observes right-censored failure time data. However, there does not seem to exist an established procedure for interval-censored data, a more general and complex type of failure time data, except two parametric procedures. To address this, we propose a broken adaptive ridge (BAR) regression procedure that combines the strengths of the quadratic regularization and the adaptive weighted bridge shrinkage. In particular, the method allows for the number of covariates to be diverging with the sample size. Under some weak regularity conditions, unlike most of the existing variable selection methods, we establish both the oracle property and the grouping effect of the proposed BAR procedure. An extensive simulation study is conducted and indicates that the proposed approach works well in practical situations and deals with the collinearity problem better than the other oracle-like methods. An application is also provided.
Journal: Journal of the American Statistical Association
Pages: 204-216
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1537922
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537922
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:204-216
Template-Type: ReDIF-Article 1.0
Author-Name: Yunzhang Zhu
Author-X-Name-First: Yunzhang
Author-X-Name-Last: Zhu
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wei Pan
Author-X-Name-First: Wei
Author-X-Name-Last: Pan
Title: On High-Dimensional Constrained Maximum Likelihood Inference
Abstract:
Inference in a high-dimensional situation may involve regularization of a certain form to treat overparameterization, imposing challenges to inference. The common practice of inference uses either a regularized model, as in inference after model selection, or bias-reduction known as “debias.” While the first ignores statistical uncertainty inherent in regularization, the second reduces the bias inbred in regularization at the expense of increased variance. In this article, we propose a constrained maximum likelihood method for hypothesis testing involving unspecific nuisance parameters, with a focus of alleviating the impact of regularization on inference. Particularly, for general composite hypotheses, we unregularize hypothesized parameters whereas regularizing nuisance parameters through a L0-constraint controlling the degree of sparseness. This approach is analogous to semiparametric likelihood inference in a high-dimensional situation. On this ground, for the Gaussian graphical model and linear regression, we derive conditions under which the asymptotic distribution of the constrained likelihood ratio is established, permitting parameter dimension increasing with the sample size. Interestingly, the corresponding limiting distribution is the chi-square or normal, depending on if the co-dimension of a test is finite or increases with the sample size, leading to asymptotic similar tests. This goes beyond the classical Wilks phenomenon. Numerically, we demonstrate that the proposed method performs well against it competitors in various scenarios. Finally, we apply the proposed method to infer linkages in brain network analysis based on MRI data, to contrast Alzheimer’s disease patients against healthy subjects. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 217-230
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1540986
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1540986
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:217-230
Template-Type: ReDIF-Article 1.0
Author-Name: Qian Feng
Author-X-Name-First: Qian
Author-X-Name-Last: Feng
Author-Name: Quang Vuong
Author-X-Name-First: Quang
Author-X-Name-Last: Vuong
Author-Name: Haiqing Xu
Author-X-Name-First: Haiqing
Author-X-Name-Last: Xu
Title: Estimation of Heterogeneous Individual Treatment Effects With Endogenous Treatments
Abstract:
This article estimates individual treatment effects (ITE) and its probability distribution in a triangular model with binary-valued endogenous treatments. Our estimation procedure takes two steps. First, we estimate the counterfactual outcome and hence, the ITE for every observational unit in the sample. Second, we estimate the ITE density function of the whole population. Our estimation method does not suffer from the ill-posed inverse problem associated with inverting a nonlinear functional. Asymptotic properties of the proposed method are established. We study its finite sample properties in Monte Carlo experiments. We also illustrate our approach with an empirical application assessing the effects of 401(k) retirement programs on personal savings. Our results show that there exists a small but statistically significant proportion of individuals who experience negative effects, although the majority of ITEs is positive. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 231-240
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543121
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543121
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:231-240
Template-Type: ReDIF-Article 1.0
Author-Name: Pavlo Mozharovskyi
Author-X-Name-First: Pavlo
Author-X-Name-Last: Mozharovskyi
Author-Name: Julie Josse
Author-X-Name-First: Julie
Author-X-Name-Last: Josse
Author-Name: François Husson
Author-X-Name-First: François
Author-X-Name-Last: Husson
Title: Nonparametric Imputation by Data Depth
Abstract:
We present single imputation method for missing values which borrows the idea of data depth—a measure of centrality defined for an arbitrary point of a space with respect to a probability distribution or data cloud. This consists in iterative maximization of the depth of each observation with missing values, and can be employed with any properly defined statistical depth function. For each single iteration, imputation reverts to optimization of quadratic, linear, or quasiconcave functions that are solved analytically by linear programming or the Nelder–Mead method. As it accounts for the underlying data topology, the procedure is distribution free, allows imputation close to the data geometry, can make prediction in situations where local imputation (k-nearest neighbors, random forest) cannot, and has attractive robustness and asymptotic properties under elliptical symmetry. It is shown that a special case—when using the Mahalanobis depth—has direct connection to well-known methods for the multivariate normal model, such as iterated regression and regularized PCA. The methodology is extended to multiple imputation for data stemming from an elliptically symmetric distribution. Simulation and real data studies show good results compared with existing popular alternatives. The method has been implemented as an R-package. Supplementary materials for the article are available online.
Journal: Journal of the American Statistical Association
Pages: 241-253
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543123
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543123
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:241-253
Template-Type: ReDIF-Article 1.0
Author-Name: Qiang Sun
Author-X-Name-First: Qiang
Author-X-Name-Last: Sun
Author-Name: Wen-Xin Zhou
Author-X-Name-First: Wen-Xin
Author-X-Name-Last: Zhou
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Title: Adaptive Huber Regression
Abstract:
Big data can easily be contaminated by outliers or contain variables with heavy-tailed distributions, which makes many conventional methods inadequate. To address this challenge, we propose the adaptive Huber regression for robust estimation and inference. The key observation is that the robustification parameter should adapt to the sample size, dimension and moments for optimal tradeoff between bias and robustness. Our theoretical framework deals with heavy-tailed distributions with bounded (1+δ)
th moment for any δ>0
. We establish a sharp phase transition for robust estimation of regression parameters in both low and high dimensions: when δ≥1
, the estimator admits a sub-Gaussian-type deviation bound without sub-Gaussian assumptions on the data, while only a slower rate is available in the regime 0<δ<1
and the transition is smooth and optimal. In addition, we extend the methodology to allow both heavy-tailed predictors and observation noise. Simulation studies lend further support to the theory. In a genetic study of cancer cell lines that exhibit heavy-tailedness, the proposed methods are shown to be more robust and predictive. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 254-265
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543124
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543124
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:254-265
Template-Type: ReDIF-Article 1.0
Author-Name: Tomohiro Ando
Author-X-Name-First: Tomohiro
Author-X-Name-Last: Ando
Author-Name: Jushan Bai
Author-X-Name-First: Jushan
Author-X-Name-Last: Bai
Title: Quantile Co-Movement in Financial Markets: A Panel Quantile Model With Unobserved Heterogeneity
Abstract:
This article introduces a new procedure for analyzing the quantile co-movement of a large number of financial time series based on a large-scale panel data model with factor structures. The proposed method attempts to capture the unobservable heterogeneity of each of the financial time series based on sensitivity to explanatory variables and to the unobservable factor structure. In our model, the dimension of the common factor structure varies across quantiles, and the explanatory variables is allowed to depend on the factor structure. The proposed method allows for both cross-sectional and serial dependence, and heteroscedasticity, which are common in financial markets.We propose new estimation procedures for both frequentist and Bayesian frameworks. Consistency and asymptotic normality of the proposed estimator are established. We also propose a new model selection criterion for determining the number of common factors together with theoretical support.We apply the method to analyze the returns for over 6000 international stocks from over 60 countries during the subprime crisis, European sovereign debt crisis, and subsequent period. The empirical analysis indicates that the common factor structure varies across quantiles. We find that the common factors for the quantiles and the common factors for the mean are different. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 266-279
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543598
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543598
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:266-279
Template-Type: ReDIF-Article 1.0
Author-Name: Cencheng Shen
Author-X-Name-First: Cencheng
Author-X-Name-Last: Shen
Author-Name: Carey E. Priebe
Author-X-Name-First: Carey E.
Author-X-Name-Last: Priebe
Author-Name: Joshua T. Vogelstein
Author-X-Name-First: Joshua T.
Author-X-Name-Last: Vogelstein
Title: From Distance Correlation to Multiscale Graph Correlation
Abstract:
Understanding and developing a correlation measure that can detect general dependencies is not only imperative to statistics and machine learning, but also crucial to general scientific discovery in the big data age. In this paper, we establish a new framework that generalizes distance correlation (Dcorr)—a correlation measure that was recently proposed and shown to be universally consistent for dependence testing against all joint distributions of finite moments—to the multiscale graph correlation (MGC). By using the characteristic functions and incorporating the nearest neighbor machinery, we formalize the population version of local distance correlations, define the optimal scale in a given dependency, and name the optimal local correlation as MGC. The new theoretical framework motivates a theoretically sound sample MGC and allows a number of desirable properties to be proved, including the universal consistency, convergence, and almost unbiasedness of the sample version. The advantages of MGC are illustrated via a comprehensive set of simulations with linear, nonlinear, univariate, multivariate, and noisy dependencies, where it loses almost no power in monotone dependencies while achieving better performance in general dependencies, compared to Dcorr and other popular methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 280-291
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543125
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543125
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:280-291
Template-Type: ReDIF-Article 1.0
Author-Name: Hai Shu
Author-X-Name-First: Hai
Author-X-Name-Last: Shu
Author-Name: Xiao Wang
Author-X-Name-First: Xiao
Author-X-Name-Last: Wang
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: D-CCA: A Decomposition-Based Canonical Correlation Analysis for High-Dimensional Datasets
Abstract:
A typical approach to the joint analysis of two high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within a single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider the more necessary orthogonal relationship between the two distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the ℓ2
space of random variables rather than the conventionally used Euclidean space, with a careful construction of the orthogonal relationship between distinctive matrices. D-CCA represents a natural generalization of the traditional canonical correlation analysis. The proposed estimators of common and distinctive matrices are shown to be consistent and have reasonably better performance than some state-of-the-art methods in both simulated data and the real data analysis of breast cancer data obtained from The Cancer Genome Atlas. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 292-306
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543599
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543599
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:292-306
Template-Type: ReDIF-Article 1.0
Author-Name: Wenliang Pan
Author-X-Name-First: Wenliang
Author-X-Name-Last: Pan
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Jin Zhu
Author-X-Name-First: Jin
Author-X-Name-Last: Zhu
Title: Ball Covariance: A Generic Measure of Dependence in Banach Space
Abstract:
Technological advances in science and engineering have led to the routine collection of large and complex data objects, where the dependence structure among those objects is often of great interest. Those complex objects (e.g., different brain subcortical structures) often reside in some Banach spaces, and hence their relationship cannot be well characterized by most of the existing measures of dependence such as correlation coefficients developed in Hilbert spaces. To overcome the limitations of the existing measures, we propose Ball Covariance as a generic measure of dependence between two random objects in two possibly different Banach spaces. Our Ball Covariance possesses the following attractive properties: (i) It is nonparametric and model-free, which make the proposed measure robust to model mis-specification; (ii) It is nonnegative and equal to zero if and only if two random objects in two separable Banach spaces are independent; (iii) Empirical Ball Covariance is easy to compute and can be used as a test statistic of independence. We present both theoretical and numerical results to reveal the potential power of the Ball Covariance in detecting dependence. Also importantly, we analyze two real datasets to demonstrate the usefulness of Ball Covariance in the complex dependence detection. Supplementary materials for this article are avaiable online.
Journal: Journal of the American Statistical Association
Pages: 307-317
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1543600
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1543600
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:307-317
Template-Type: ReDIF-Article 1.0
Author-Name: Raffaele Argiento
Author-X-Name-First: Raffaele
Author-X-Name-Last: Argiento
Author-Name: Andrea Cremaschi
Author-X-Name-First: Andrea
Author-X-Name-Last: Cremaschi
Author-Name: Marina Vannucci
Author-X-Name-First: Marina
Author-X-Name-Last: Vannucci
Title: Hierarchical Normalized Completely Random Measures to Cluster Grouped Data
Abstract:
In this article, we propose a Bayesian nonparametric model for clustering grouped data. We adopt a hierarchical approach: at the highest level, each group of data is modeled according to a mixture, where the mixing distributions are conditionally independent normalized completely random measures (NormCRMs) centered on the same base measure, which is itself a NormCRM. The discreteness of the shared base measure implies that the processes at the data level share the same atoms. This desired feature allows to cluster together observations of different groups. We obtain a representation of the hierarchical clustering model by marginalizing with respect to the infinite dimensional NormCRMs. We investigate the properties of the clustering structure induced by the proposed model and provide theoretical results concerning the distribution of the number of clusters, within and between groups. Furthermore, we offer an interpretation in terms of generalized Chinese restaurant franchise process, which allows for posterior inference under both conjugate and nonconjugate models. We develop algorithms for fully Bayesian inference and assess performances by means of a simulation study and a real-data illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 318-333
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1594833
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1594833
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:318-333
Template-Type: ReDIF-Article 1.0
Author-Name: Hyebin Song
Author-X-Name-First: Hyebin
Author-X-Name-Last: Song
Author-Name: Garvesh Raskutti
Author-X-Name-First: Garvesh
Author-X-Name-Last: Raskutti
Title: PUlasso: High-Dimensional Variable Selection With Presence-Only Data
Abstract:
In various real-world problems, we are presented with classification problems with positive and unlabeled data, referred to as presence-only responses. In this article we study variable selection in the context of presence only responses where the number of features or covariates p is large. The combination of presence-only responses and high dimensionality presents both statistical and computational challenges. In this article, we develop the PUlasso algorithm for variable selection and classification with positive and unlabeled responses. Our algorithm involves using the majorization-minimization framework which is a generalization of the well-known expectation-maximization (EM) algorithm. In particular to make our algorithm scalable, we provide two computational speed-ups to the standard EM algorithm. We provide a theoretical guarantee where we first show that our algorithm converges to a stationary point, and then prove that any stationary point within a local neighborhood of the true parameter achieves the minimax optimal mean-squared error under both strict sparsity and group sparsity assumptions. We also demonstrate through simulations that our algorithm outperforms state-of-the-art algorithms in the moderate p settings in terms of classification performance. Finally, we demonstrate that our PUlasso algorithm performs well on a biochemistry example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 334-347
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1546587
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546587
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:334-347
Template-Type: ReDIF-Article 1.0
Author-Name: Radoslav Harman
Author-X-Name-First: Radoslav
Author-X-Name-Last: Harman
Author-Name: Lenka Filová
Author-X-Name-First: Lenka
Author-X-Name-Last: Filová
Author-Name: Peter Richtárik
Author-X-Name-First: Peter
Author-X-Name-Last: Richtárik
Title: A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments
Abstract:
We propose a class of subspace ascent methods for computing optimal approximate designs that covers existing algorithms as well as new and more efficient ones. Within this class of methods, we construct a simple, randomized exchange algorithm (REX). Numerical comparisons suggest that the performance of REX is comparable or superior to that of state-of-the-art methods across a broad range of problem structures and sizes. We focus on the most commonly used criterion of D-optimality, which also has applications beyond experimental design, such as the construction of the minimum-volume ellipsoid containing a given set of data points. For D-optimality, we prove that the proposed algorithm converges to the optimum. We also provide formulas for the optimal exchange of weights in the case of the criterion of A-optimality, which enable one to use REX and some other algorithms for computing A-optimal and I-optimal designs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 348-361
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1546588
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546588
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:348-361
Template-Type: ReDIF-Article 1.0
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Author-Name: Emre Demirkaya
Author-X-Name-First: Emre
Author-X-Name-Last: Demirkaya
Author-Name: Gaorong Li
Author-X-Name-First: Gaorong
Author-X-Name-Last: Li
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Title: RANK: Large-Scale Inference With Graphical Nonlinear Knockoffs
Abstract:
Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this article, we provide theoretical foundations on the power and robustness for the model-X knockoffs procedure introduced recently in Candès, Fan, Janson and Lv in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-X knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptotically controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real dataset is analyzed to further assess the performance of the suggested knockoffs procedure. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 362-379
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1546589
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546589
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:362-379
Template-Type: ReDIF-Article 1.0
Author-Name: Peng Wu
Author-X-Name-First: Peng
Author-X-Name-Last: Wu
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Title: Matched Learning for Optimizing Individualized Treatment Strategies Using Electronic Health Records
Abstract:
Current guidelines for treatment decision making largely rely on data from randomized controlled trials (RCTs) studying average treatment effects. They may be inadequate to make individualized treatment decisions in real-world settings. Large-scale electronic health records (EHR) provide opportunities to fulfill the goals of personalized medicine and learn individualized treatment rules (ITRs) depending on patient-specific characteristics from real-world patient data. In this work, we tackle challenges with EHRs and propose a machine learning approach based on matching (M-learning) to estimate optimal ITRs from EHRs. This new learning method performs matching instead of inverse probability weighting as commonly used in many existing methods for estimating ITRs to more accurately assess individuals’ treatment responses to alternative treatments and alleviate confounding. Matching-based value functions are proposed to compare matched pairs under a unified framework, where various types of outcomes for measuring treatment response (including continuous, ordinal, and discrete outcomes) can easily be accommodated. We establish the Fisher consistency and convergence rate of M-learning. Through extensive simulation studies, we show that M-learning outperforms existing methods when propensity scores are misspecified or when unmeasured confounders are present in certain scenarios. Lastly, we apply M-learning to estimate optimal personalized second-line treatments for type 2 diabetes patients to achieve better glycemic control or reduce major complications using EHRs from New York Presbyterian Hospital. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 380-392
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1549050
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1549050
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:380-392
Template-Type: ReDIF-Article 1.0
Author-Name: Yaowu Liu
Author-X-Name-First: Yaowu
Author-X-Name-Last: Liu
Author-Name: Jun Xie
Author-X-Name-First: Jun
Author-X-Name-Last: Xie
Title: Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures
Abstract:
Abstract–Combining individual p-values to aggregate multiple small effects has a long-standing interest in statistics, dating back to the classic Fisher’s combination test. In modern large-scale data analysis, correlation and sparsity are common features and efficient computation is a necessary requirement for dealing with massive data. To overcome these challenges, we propose a new test that takes advantage of the Cauchy distribution. Our test statistic has a simple form and is defined as a weighted sum of Cauchy transformation of individual p-values. We prove a nonasymptotic result that the tail of the null distribution of our proposed test statistic can be well approximated by a Cauchy distribution under arbitrary dependency structures. Based on this theoretical result, the p-value calculation of our proposed test is not only accurate, but also as simple as the classic z-test or t-test, making our test well suited for analyzing massive data. We further show that the power of the proposed test is asymptotically optimal in a strong sparsity setting. Extensive simulations demonstrate that the proposed test has both strong power against sparse alternatives and a good accuracy with respect to p-value calculations, especially for very small p-values. The proposed test has also been applied to a genome-wide association study of Crohn’s disease and compared with several existing tests. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 393-402
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1554485
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1554485
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:393-402
Template-Type: ReDIF-Article 1.0
Author-Name: Dehan Kong
Author-X-Name-First: Dehan
Author-X-Name-Last: Kong
Author-Name: Baiguo An
Author-X-Name-First: Baiguo
Author-X-Name-Last: An
Author-Name: Jingwen Zhang
Author-X-Name-First: Jingwen
Author-X-Name-Last: Zhang
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: L2RM: Low-Rank Linear Regression Models for High-Dimensional Matrix Responses
Abstract:
The aim of this article is to develop a low-rank linear regression model to correlate a high-dimensional response matrix with a high-dimensional vector of covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm of each coefficient matrix to deal with the case when the number of covariates is extremely large. We develop an efficient estimation procedure based on the trace norm regularization, which explicitly imposes the low rank structure of coefficient matrices. When both the dimension of response matrix and that of covariate vector diverge at the exponential order of the sample size, we investigate the sure independence screening property under some mild conditions. We also systematically investigate some theoretical properties of our estimation procedure including estimation consistency, rank consistency, and nonasymptotic error bound under some mild conditions. We further establish a theoretical guarantee for the overall solution of our two-step screening and estimation procedure. We examine the finite-sample performance of our screening and estimation methods using simulations and a large-scale imaging genetic dataset collected by the Philadelphia Neurodevelopmental Cohort study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 403-424
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1555092
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1555092
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:403-424
Template-Type: ReDIF-Article 1.0
Author-Name: Jean-Pierre Florens
Author-X-Name-First: Jean-Pierre
Author-X-Name-Last: Florens
Author-Name: Léopold Simar
Author-X-Name-First: Léopold
Author-X-Name-Last: Simar
Author-Name: Ingrid Van Keilegom
Author-X-Name-First: Ingrid
Author-X-Name-Last: Van Keilegom
Title: Estimation of the Boundary of a Variable Observed With Symmetric Error
Abstract:
Consider the model Y=X+ε
with X=τ+Z
, where τ is an unknown constant (the boundary of X), Z is a random variable defined on R+
, ε is a symmetric error, and ε and Z are independent. Based on an iid sample of Y, we aim at identifying and estimating the boundary τ when the law of ε is unknown (apart from symmetry) and in particular its variance is unknown. We propose an estimation procedure based on a minimal distance approach and by making use of Laguerre polynomials. Asymptotic results as well as finite sample simulations are shown. The paper also proposes an extension to stochastic frontier analysis, where the model is conditional to observed variables. The model becomes Y=τ(w1,w2)+Z+ε
, where Y is a cost, w1 are the observed outputs and w2 represents the observed values of other conditioning variables, so Z is the cost inefficiency. Some simulations illustrate again how the approach works in finite samples, and the proposed procedure is illustrated with data coming from post offices in France.
Journal: Journal of the American Statistical Association
Pages: 425-441
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1555093
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1555093
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:425-441
Template-Type: ReDIF-Article 1.0
Author-Name: Jingshen Wang
Author-X-Name-First: Jingshen
Author-X-Name-Last: Wang
Author-Name: Xuming He
Author-X-Name-First: Xuming
Author-X-Name-Last: He
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Title: Debiased Inference on Treatment Effect in a High-Dimensional Model
Abstract:
This article concerns the potential bias in statistical inference on treatment effects when a large number of covariates are present in a linear or partially linear model. While the estimation bias in an under-fitted model is well understood, we address a lesser-known bias that arises from an over-fitted model. The over-fitting bias can be eliminated through data splitting at the cost of statistical efficiency, and we show that smoothing over random data splits can be pursued to mitigate the efficiency loss. We also discuss some of the existing methods for debiased inference and provide insights into their intrinsic bias-variance trade-off, which leads to an improvement in bias controls. Under appropriate conditions, we show that the proposed estimators for the treatment effects are asymptotically normal and their variances can be well estimated. We discuss the pros and cons of various methods both theoretically and empirically, and show that the proposed methods are valuable options in post-selection inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 442-454
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1558062
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1558062
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:442-454
Template-Type: ReDIF-Article 1.0
Author-Name: Timothée Tabouy
Author-X-Name-First: Timothée
Author-X-Name-Last: Tabouy
Author-Name: Pierre Barbillon
Author-X-Name-First: Pierre
Author-X-Name-Last: Barbillon
Author-Name: Julien Chiquet
Author-X-Name-First: Julien
Author-X-Name-Last: Chiquet
Title: Variational Inference for Stochastic Block Models From Sampled Data
Abstract:
This article deals with nonobserved dyads during the sampling of a network and consecutive issues in the inference of the stochastic block model (SBM). We review sampling designs and recover missing at random (MAR) and not missing at random (NMAR) conditions for the SBM. We introduce variants of the variational EM algorithm for inferring the SBM under various sampling designs (MAR and NMAR) all available as an R package. Model selection criteria based on integrated classification likelihood are derived for selecting both the number of blocks and the sampling design. We investigate the accuracy and the range of applicability of these algorithms with simulations. We explore two real-world networks from ethnology (seed circulation network) and biology (protein–protein interaction network), where the interpretations considerably depend on the sampling designs considered. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 455-466
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1562934
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1562934
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:455-466
Template-Type: ReDIF-Article 1.0
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Author-Name: Wei Huang
Author-X-Name-First: Wei
Author-X-Name-Last: Huang
Author-Name: Shaoke Lei
Author-X-Name-First: Shaoke
Author-X-Name-Last: Lei
Title: Estimation of Conditional Prevalence From Group Testing Data With Missing Covariates
Abstract:
We consider estimating the conditional prevalence of a disease from data pooled according to the group testing mechanism. Consistent estimators have been proposed in the literature, but they rely on the data being available for all individuals. In infectious disease studies where group testing is frequently applied, the covariate is often missing for some individuals. There, unless the missing mechanism occurs completely at random, applying the existing techniques to the complete cases without adjusting for missingness does not generally provide consistent estimators, and finding appropriate modifications is challenging. We develop a consistent spline estimator, derive its theoretical properties, and show how to adapt local polynomial and likelihood estimators to the missing data problem. We illustrate the numerical performance of our methods on simulated and real examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 467-480
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2019.1566071
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1566071
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:467-480
Template-Type: ReDIF-Article 1.0
Author-Name: Thibault Vatter
Author-X-Name-First: Thibault
Author-X-Name-Last: Vatter
Title: Simulating Copulas: Stochastic Models, Sampling Algorithms, and Applications
Journal: Journal of the American Statistical Association
Pages: 481-482
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1721244
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721244
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:481-482
Template-Type: ReDIF-Article 1.0
Author-Name: Peter M. Aronow
Author-X-Name-First: Peter M.
Author-X-Name-Last: Aronow
Author-Name: Fredrik Sävje
Author-X-Name-First: Fredrik
Author-X-Name-Last: Sävje
Title: The Book of Why: The New Science of Cause and Effect
Journal: Journal of the American Statistical Association
Pages: 482-485
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1721245
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721245
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:482-485
Template-Type: ReDIF-Article 1.0
Author-Name: Noor Azina Ismail
Author-X-Name-First: Noor Azina
Author-X-Name-Last: Ismail
Title: Measuring Agreement: Models, Methods, and Applications.
Journal: Journal of the American Statistical Association
Pages: 485-486
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1721246
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721246
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:485-486
Template-Type: ReDIF-Article 1.0
Author-Name: Qing Wang
Author-X-Name-First: Qing
Author-X-Name-Last: Wang
Title: Multivariate Kernel Smoothing and Its Applications
Journal: Journal of the American Statistical Association
Pages: 486-486
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1721247
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721247
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:486-486
Template-Type: ReDIF-Article 1.0
Author-Name: Anita D. Behme
Author-X-Name-First: Anita D.
Author-X-Name-Last: Behme
Title: Theory of Stochastic Objects: Probability, Stochastic Processes and Inference.
Journal: Journal of the American Statistical Association
Pages: 486-487
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1721248
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721248
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:486-487
Template-Type: ReDIF-Article 1.0
Author-Name: Oliver Y. Chén
Author-X-Name-First: Oliver Y.
Author-X-Name-Last: Chén
Title: Big Data in Omics and Imaging: Integrated Analysis and Causal Inference.
Journal: Journal of the American Statistical Association
Pages: 487-488
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2020.1721249
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1721249
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:487-488
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: RETRACTED ARTICLE: Smoothing with Couplings of Conditional Particle Filters
Journal: Journal of the American Statistical Association
Pages: 489-489
Issue: 529
Volume: 115
Year: 2020
Month: 1
X-DOI: 10.1080/01621459.2018.1505625
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1505625
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:529:p:489-489
Template-Type: ReDIF-Article 1.0
Author-Name: Shiwen Zhao
Author-X-Name-First: Shiwen
Author-X-Name-Last: Zhao
Author-Name: Barbara E. Engelhardt
Author-X-Name-First: Barbara E.
Author-X-Name-Last: Engelhardt
Author-Name: Sayan Mukherjee
Author-X-Name-First: Sayan
Author-X-Name-Last: Mukherjee
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Fast Moment Estimation for Generalized Latent Dirichlet Models
Abstract:
We develop a generalized method of moments (GMM) approach for fast parameter estimation in a new class of Dirichlet latent variable models with mixed data types. Parameter estimation via GMM has computational and statistical advantages over alternative methods, such as expectation maximization, variational inference, and Markov chain Monte Carlo. A key computational advantage of our method, Moment Estimation for latent Dirichlet models (MELD), is that parameter estimation does not require instantiation of the latent variables. Moreover, performance is agnostic to distributional assumptions of the observations. We derive population moment conditions after marginalizing out the sample-specific Dirichlet latent variables. The moment conditions only depend on component mean parameters. We illustrate the utility of our approach on simulated data, comparing results from MELD to alternative methods, and we show the promise of our approach through the application to several datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1528-1540
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1341839
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1341839
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1528-1540
Template-Type: ReDIF-Article 1.0
Author-Name: Yichi Zhang
Author-X-Name-First: Yichi
Author-X-Name-Last: Zhang
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Marie Davidian
Author-X-Name-First: Marie
Author-X-Name-Last: Davidian
Author-Name: Anastasios A. Tsiatis
Author-X-Name-First: Anastasios A.
Author-X-Name-Last: Tsiatis
Title: Interpretable Dynamic Treatment Regimes
Abstract:
Precision medicine is currently a topic of great interest in clinical and intervention science. A key component of precision medicine is that it is evidence-based, that is, data-driven, and consequently there has been tremendous interest in estimation of precision medicine strategies using observational or randomized study data. One way to formalize precision medicine is through a treatment regime, which is a sequence of decision rules, one per stage of clinical intervention, that map up-to-date patient information to a recommended treatment. An optimal treatment regime is defined as maximizing the mean of some cumulative clinical outcome if applied to a population of interest. It is well-known that even under simple generative models an optimal treatment regime can be a highly nonlinear function of patient information. Consequently, a focal point of recent methodological research has been the development of flexible models for estimating optimal treatment regimes. However, in many settings, estimation of an optimal treatment regime is an exploratory analysis intended to generate new hypotheses for subsequent research and not to directly dictate treatment to new patients. In such settings, an estimated treatment regime that is interpretable in a domain context may be of greater value than an unintelligible treatment regime built using “black-box” estimation methods. We propose an estimator of an optimal treatment regime composed of a sequence of decision rules, each expressible as a list of “if-then” statements that can be presented as either a paragraph or as a simple flowchart that is immediately interpretable to domain experts. The discreteness of these lists precludes smooth, that is, gradient-based, methods of estimation and leads to nonstandard asymptotics. Nevertheless, we provide a computationally efficient estimation algorithm, prove consistency of the proposed estimator, and derive rates of convergence. We illustrate the proposed methods using a series of simulation examples and application to data from a sequential clinical trial on bipolar disorder. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1541-1549
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1345743
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1345743
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1541-1549
Template-Type: ReDIF-Article 1.0
Author-Name: Ling Zhou
Author-X-Name-First: Ling
Author-X-Name-Last: Zhou
Author-Name: Huazhen Lin
Author-X-Name-First: Huazhen
Author-X-Name-Last: Lin
Author-Name: Hua Liang
Author-X-Name-First: Hua
Author-X-Name-Last: Liang
Title: Efficient Estimation of the Nonparametric Mean and Covariance Functions for Longitudinal and Sparse Functional Data
Abstract:
We consider the estimation of mean and covariance functions for longitudinal and sparse functional data by using the full quasi-likelihood coupling a modification of the local kernel smoothing method. The proposed estimators are shown to be consistent, asymptotically normal, and semiparametrically efficient in terms of their linear functionals. Their superiority to the competitors is further illustrated numerically through simulation studies. The method is applied to analyze AIDS study and atmospheric study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1550-1564
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356317
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356317
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1550-1564
Template-Type: ReDIF-Article 1.0
Author-Name: Clément Dombry
Author-X-Name-First: Clément
Author-X-Name-Last: Dombry
Author-Name: Mathieu Ribatet
Author-X-Name-First: Mathieu
Author-X-Name-Last: Ribatet
Author-Name: Stilian Stoev
Author-X-Name-First: Stilian
Author-X-Name-Last: Stoev
Title: Probabilities of Concurrent Extremes
Abstract:
The statistical modeling of spatial extremes has been an active area of recent research with a growing domain of applications. Much of the existing methodology, however, focuses on the magnitudes of extreme events rather than on their timing. To address this gap, this article investigates the notion of extremal concurrence. Suppose that daily temperatures are measured at several synoptic stations. We say that extremes are concurrent if record maximum temperatures occur simultaneously, that is, on the same day for all stations. It is important to be able to understand, quantify, and model extremal concurrence. Under general conditions, we show that the finite sample concurrence probability converges to an asymptotic quantity, deemed extremal concurrence probability. Using Palm calculus, we establish general expressions for the extremal concurrence probability through the max-stable process emerging in the limit of the component-wise maxima of the sample. Explicit forms of the extremal concurrence probabilities are obtained for various max-stable models and several estimators are introduced. In particular, we prove that the pairwise extremal concurrence probability for max-stable vectors is precisely equal to the Kendall’s τ. The estimators are evaluated from simulations and applied to study temperature extremes in the United States. Results demonstrate that concurrence probability can be used to study, for example, the effect of global climate phenomena such as the El Niño Southern Oscillation (ENSO) or global warming on the spatial structure and areal impact of extremes.
Journal: Journal of the American Statistical Association
Pages: 1565-1582
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356318
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356318
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1565-1582
Template-Type: ReDIF-Article 1.0
Author-Name: Yinchu Zhu
Author-X-Name-First: Yinchu
Author-X-Name-Last: Zhu
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Title: Linear Hypothesis Testing in Dense High-Dimensional Linear Models
Abstract:
We propose a methodology for testing linear hypothesis in high-dimensional linear models. The proposed test does not impose any restriction on the size of the model, that is, model sparsity or the loading vector representing the hypothesis. Providing asymptotically valid methods for testing general linear functions of the regression parameters in high-dimensions is extremely challenging—especially without making restrictive or unverifiable assumptions on the number of nonzero elements. We propose to test the moment conditions related to the newly designed restructured regression, where the inputs are transformed and augmented features. These new features incorporate the structure of the null hypothesis directly. The test statistics are constructed in such a way that lack of sparsity in the original model parameter does not present a problem for the theoretical justification of our procedures. We establish asymptotically exact control on Type I error without imposing any sparsity assumptions on model parameter or the vector representing the linear hypothesis. Our method is also shown to achieve certain optimality in detecting deviations from the null hypothesis. We demonstrate the favorable finite-sample performance of the proposed methods, via a number of numerical and a real data example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1583-1600
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356319
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356319
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1583-1600
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaoxiao Sun
Author-X-Name-First: Xiaoxiao
Author-X-Name-Last: Sun
Author-Name: Pang Du
Author-X-Name-First: Pang
Author-X-Name-Last: Du
Author-Name: Xiao Wang
Author-X-Name-First: Xiao
Author-X-Name-Last: Wang
Author-Name: Ping Ma
Author-X-Name-First: Ping
Author-X-Name-Last: Ma
Title: Optimal Penalized Function-on-Function Regression Under a Reproducing Kernel Hilbert Space Framework
Abstract:
Many scientific studies collect data where the response and predictor variables are both functions of time, location, or some other covariate. Understanding the relationship between these functional variables is a common goal in these studies. Motivated from two real-life examples, we present in this article a function-on-function regression model that can be used to analyze such kind of functional data. Our estimator of the 2D coefficient function is the optimizer of a form of penalized least squares where the penalty enforces a certain level of smoothness on the estimator. Our first result is the representer theorem which states that the exact optimizer of the penalized least squares actually resides in a data-adaptive finite-dimensional subspace although the optimization problem is defined on a function space of infinite dimensions. This theorem then allows us an easy incorporation of the Gaussian quadrature into the optimization of the penalized least squares, which can be carried out through standard numerical procedures. We also show that our estimator achieves the minimax convergence rate in mean prediction under the framework of function-on-function regression. Extensive simulation studies demonstrate the numerical advantages of our method over the existing ones, where a sparse functional data extension is also introduced. The proposed method is then applied to our motivating examples of the benchmark Canadian weather data and a histone regulation study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1601-1611
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356320
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356320
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1601-1611
Template-Type: ReDIF-Article 1.0
Author-Name: Matthew Dawson
Author-X-Name-First: Matthew
Author-X-Name-Last: Dawson
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: Dynamic Modeling of Conditional Quantile Trajectories, With Application to Longitudinal Snippet Data
Abstract:
Longitudinal data are often plagued with sparsity of time points where measurements are available. The functional data analysis perspective has been shown to provide an effective and flexible approach to address this problem for the case where measurements are sparse but their times are randomly distributed over an interval. Here, we focus on a different scenario where available data can be characterized as snippets, which are very short stretches of longitudinal measurements. For each subject, the stretch of available data is much shorter than the time frame of interest, a common occurrence in accelerated longitudinal studies. An added challenge is introduced if a time proxy that is basic for usual longitudinal modeling is not available. This situation arises in the case of Alzheimer’s disease and comparable scenarios, where one is interested in time dynamics of declining performance, but the time of disease onset is unknown and chronological age does not provide a meaningful time reference for longitudinal modeling. Our main methodological contribution to address these challenges is to introduce conditional quantile trajectories for monotonic processes that emerge as solutions of a dynamic system. Our proposed estimates for these trajectories are shown to be uniformly consistent. Conditional quantile trajectories are useful descriptors of processes that quantify deterioration over time, such as hippocampal volumes in Alzheimer’s patients. We demonstrate how the proposed approach can be applied to longitudinal snippets data sampled from such processes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1612-1624
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356321
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356321
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1612-1624
Template-Type: ReDIF-Article 1.0
Author-Name: Minjie Fan
Author-X-Name-First: Minjie
Author-X-Name-Last: Fan
Author-Name: Debashis Paul
Author-X-Name-First: Debashis
Author-X-Name-Last: Paul
Author-Name: Thomas C. M. Lee
Author-X-Name-First: Thomas C. M.
Author-X-Name-Last: Lee
Author-Name: Tomoko Matsuo
Author-X-Name-First: Tomoko
Author-X-Name-Last: Matsuo
Title: Modeling Tangential Vector Fields on a Sphere
Abstract:
Physical processes that manifest as tangential vector fields on a sphere are common in geophysical and environmental sciences. These naturally occurring vector fields are often subject to physical constraints, such as being curl-free or divergence-free. We start with constructing parametric models for curl-free and divergence-free vector fields that are tangential to the unit sphere through applying the surface gradient or the surface curl operator to a scalar random potential field on the unit sphere. Using the Helmholtz–Hodge decomposition, we then construct a class of simple but flexible parametric models for general tangential vector fields, which are represented as a sum of a curl-free and a divergence-free components. We propose a likelihood-based parameter estimation procedure, and show that fast computation is possible even for large datasets when the observations are on a regular latitude–longitude grid. Characteristics and practical utility of the proposed methodology are illustrated through extensive simulation studies and an application to a dataset of ocean surface wind velocities collected by satellite-based scatterometers. We also compare our model with a bivariate Matérn model and a non-stationary bivariate global model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1625-1636
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356322
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356322
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1625-1636
Template-Type: ReDIF-Article 1.0
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Author-Name: Eftychia Solea
Author-X-Name-First: Eftychia
Author-X-Name-Last: Solea
Title: A Nonparametric Graphical Model for Functional Data With Application to Brain Networks Based on fMRI
Abstract:
We introduce a nonparametric graphical model whose observations on vertices are functions. Many modern applications, such as electroencephalogram and functional magnetic resonance imaging (fMRI), produce data are of this type. The model is based on additive conditional independence (ACI), a statistical relation that captures the spirit of conditional independence without resorting to multi-dimensional kernels. The random functions are assumed to reside in a Hilbert space. No distributional assumption is imposed on the random functions: instead, their statistical relations are characterized nonparametrically by a second Hilbert space, which is a reproducing kernel Hilbert space whose kernel is determined by the inner product of the first Hilbert space. A precision operator is then constructed based on the second space, which characterizes ACI, and hence also the graph. The resulting estimator is relatively easy to compute, requiring no iterative optimization or inversion of large matrices. We establish the consistency and the convergence rate of the estimator. Through simulation studies we demonstrate that the estimator performs better than the functional Gaussian graphical model when the relations among vertices are nonlinear or heteroscedastic. The method is applied to an fMRI dataset to construct brain networks for patients with attention-deficit/hyperactivity disorder. Supplementary materials for this article are available online
Journal: Journal of the American Statistical Association
Pages: 1637-1655
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1356726
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1356726
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1637-1655
Template-Type: ReDIF-Article 1.0
Author-Name: Siddhartha Chib
Author-X-Name-First: Siddhartha
Author-X-Name-Last: Chib
Author-Name: Minchul Shin
Author-X-Name-First: Minchul
Author-X-Name-Last: Shin
Author-Name: Anna Simoni
Author-X-Name-First: Anna
Author-X-Name-Last: Simoni
Title: Bayesian Estimation and Comparison of Moment Condition Models
Abstract:
In this article, we develop a Bayesian semiparametric analysis of moment condition models by casting the problem within the exponentially tilted empirical likelihood (ETEL) framework. We use this framework to develop a fully Bayesian analysis of correctly and misspecified moment condition models. We show that even under misspecification, the Bayesian ETEL posterior distribution satisfies the Bernstein–von Mises (BvM) theorem. We also develop a unified approach based on marginal likelihoods and Bayes factors for comparing different moment-restricted models and for discarding any misspecified moment restrictions. Computation of the marginal likelihoods is by the method of Chib (1995) as extended to Metropolis–Hastings samplers in Chib and Jeliazkov in 2001. We establish the model selection consistency of the marginal likelihood and show that the marginal likelihood favors the model with the minimum number of parameters and the maximum number of valid moment restrictions. When the models are misspecified, the marginal likelihood model selection procedure selects the model that is closer to the (unknown) true data-generating process in terms of the Kullback–Leibler divergence. The ideas and results in this article broaden the theoretical underpinning and value of the Bayesian ETEL framework with many practical applications. The discussion is illuminated through several examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1656-1668
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1358172
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1358172
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1656-1668
Template-Type: ReDIF-Article 1.0
Author-Name: Zachary M. Thomas
Author-X-Name-First: Zachary M.
Author-X-Name-Last: Thomas
Author-Name: Steven N. MacEachern
Author-X-Name-First: Steven N.
Author-X-Name-Last: MacEachern
Author-Name: Mario Peruggia
Author-X-Name-First: Mario
Author-X-Name-Last: Peruggia
Title: Reconciling Curvature and Importance Sampling Based Procedures for Summarizing Case Influence in Bayesian Models
Abstract:
Methods for summarizing case influence in Bayesian models take essentially two forms: (1) use common divergence measures for calculating distances between the full-data posterior and the case-deleted posterior, and (2) measure the impact of infinitesimal perturbations to the likelihood to study local case influence. Methods based on approach (1) lead naturally to considering the behavior of case-deletion importance sampling weights (the weights used to approximate samples from the case-deleted posterior using samples from the full posterior). Methods based on approach (2) lead naturally to considering the local curvature of the Kullback–Leibler divergence of the full posterior from a geometrically perturbed quasi-posterior. By examining the connections between the two approaches, we establish a rationale for employing low-dimensional summaries of case influence obtained entirely via the variance–covariance matrix of the log importance sampling weights. We illustrate the use of the proposed diagnostics using real and simulated data. Supplementary materials are available online.
Journal: Journal of the American Statistical Association
Pages: 1669-1683
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1360777
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1360777
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1669-1683
Template-Type: ReDIF-Article 1.0
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Title: Particle EM for Variable Selection
Abstract:
Despite its long history of success, the EM algorithm has been vulnerable to local entrapment when the posterior/likelihood is multi-modal. This is particularly pronounced in spike-and-slab posterior distributions for Bayesian variable selection. The main thrust of this article is to introduce the particle EM algorithm, a new population-based optimization strategy that harvests multiple modes in search spaces that present many local maxima. Motivated by nonparametric variational Bayes strategies, particle EM achieves this goal by deploying an ensemble of interactive repulsive particles. These particles are geared toward uncharted areas of the posterior, providing a more comprehensive summary of its topography than simple parallel EM deployments. A sequential Monte Carlo variant of particle EM is also proposed that explores a sequence of annealed posteriors by sampling from a set of mutually avoiding particles. Particle EM outputs a deterministic reconstruction of the posterior distribution for approximate fully Bayes inference by capturing its essential modes and mode weights. This reconstruction reflects model selection uncertainty and is supported by asymptotic considerations, which indicate that the requisite number of particles need not be large in the presence of sparsity (when p > n). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1684-1697
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1360778
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1360778
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1684-1697
Template-Type: ReDIF-Article 1.0
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: A Massive Data Framework for M-Estimators with Cubic-Rate
Abstract:
The divide and conquer method is a common strategy for handling massive data. In this article, we study the divide and conquer method for cubic-rate estimators under the massive data framework. We develop a general theory for establishing the asymptotic distribution of the aggregated M-estimators using a weighted average with weights depending on the subgroup sample sizes. Under certain condition on the growing rate of the number of subgroups, the resulting aggregated estimators are shown to have faster convergence rate and asymptotic normal distribution, which are more tractable in both computation and inference than the original M-estimators based on pooled data. Our theory applies to a wide class of M-estimators with cube root convergence rate, including the location estimator, maximum score estimator, and value search estimator. Empirical performance via simulations and a real data application also validate our theoretical findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1698-1709
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1360779
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1360779
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1698-1709
Template-Type: ReDIF-Article 1.0
Author-Name: Lorin Crawford
Author-X-Name-First: Lorin
Author-X-Name-Last: Crawford
Author-Name: Kris C. Wood
Author-X-Name-First: Kris C.
Author-X-Name-Last: Wood
Author-Name: Xiang Zhou
Author-X-Name-First: Xiang
Author-X-Name-Last: Zhou
Author-Name: Sayan Mukherjee
Author-X-Name-First: Sayan
Author-X-Name-Last: Mukherjee
Title: Bayesian Approximate Kernel Regression With Variable Selection
Abstract:
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this article, we propose a novel framework that provides an effect size analog for each explanatory variable in Bayesian kernel regression models when the kernel is shift-invariant—for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. This projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion, we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e., phenotypic prediction) and association mapping (i.e., inference of significant variants or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1710-1721
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1361830
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1361830
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1710-1721
Template-Type: ReDIF-Article 1.0
Author-Name: Jonas Harnau
Author-X-Name-First: Jonas
Author-X-Name-Last: Harnau
Author-Name: Bent Nielsen
Author-X-Name-First: Bent
Author-X-Name-Last: Nielsen
Title: Over-Dispersed Age-Period-Cohort Models
Abstract:
We consider inference and forecasting for aggregate data organized in a two-way table with age and cohort as indices, but without measures of exposure. This is modeled using a Poisson likelihood with an age-period-cohort structure for the mean while allowing for over-dispersion. We propose a repetitive structure that keeps the dimension of the table fixed while increasing the latent exposure. For this, we use a class of infinitely divisible distributions which include a variety of compound Poisson models and Poisson mixture models. This results in asymptotic F inference and t forecast distributions.
Journal: Journal of the American Statistical Association
Pages: 1722-1732
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1366908
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1366908
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1722-1732
Template-Type: ReDIF-Article 1.0
Author-Name: Roger S. Zoh
Author-X-Name-First: Roger S.
Author-X-Name-Last: Zoh
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Author-Name: Bani K. Mallick
Author-X-Name-First: Bani K.
Author-X-Name-Last: Mallick
Title: A Powerful Bayesian Test for Equality of Means in High Dimensions
Abstract:
We develop a Bayes factor-based testing procedure for comparing two population means in high-dimensional settings. In ‘large-p-small-n” settings, Bayes factors based on proper priors require eliciting a large and complex p × p covariance matrix, whereas Bayes factors based on Jeffrey’s prior suffer the same impediment as the classical Hotelling T2 test statistic as they involve inversion of ill-formed sample covariance matrices. To circumvent this limitation, we propose that the Bayes factor be based on lower dimensional random projections of the high-dimensional data vectors. We choose the prior under the alternative to maximize the power of the test for a fixed threshold level, yielding a restricted most powerful Bayesian test (RMPBT). The final test statistic is based on the ensemble of Bayes factors corresponding to multiple replications of randomly projected data. We show that the test is unbiased and, under mild conditions, is also locally consistent. We demonstrate the efficacy of the approach through simulated and real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1733-1741
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1371024
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1371024
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1733-1741
Template-Type: ReDIF-Article 1.0
Author-Name: David Rossell
Author-X-Name-First: David
Author-X-Name-Last: Rossell
Author-Name: Francisco J. Rubio
Author-X-Name-First: Francisco J.
Author-X-Name-Last: Rubio
Title: Tractable Bayesian Variable Selection: Beyond Normality
Abstract:
Bayesian variable selection often assumes normality, but the effects of model misspecification are not sufficiently understood. There are sound reasons behind this assumption, particularly for large p: ease of interpretation, analytical, and computational convenience. More flexible frameworks exist, including semi- or nonparametric models, often at the cost of some tractability. We propose a simple extension that allows for skewness and thicker-than-normal tails but preserves tractability. It leads to easy interpretation and a log-concave likelihood that facilitates optimization and integration. We characterize asymptotically parameter estimation and Bayes factor rates, under certain model misspecification. Under suitable conditions, misspecified Bayes factors induce sparsity at the same rates than under the correct model. However, the rates to detect signal change by an exponential factor, often reducing sensitivity. These deficiencies can be ameliorated by inferring the error distribution, a simple strategy that can improve inference substantially. Our work focuses on the likelihood and can be combined with any likelihood penalty or prior, but here we focus on nonlocal priors to induce extra sparsity and ameliorate finite-sample effects caused by misspecification. We show the importance of considering the likelihood rather than solely the prior, for Bayesian variable selection. The methodology is in R package ‘mombf.’ Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1742-1758
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1371025
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1371025
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1742-1758
Template-Type: ReDIF-Article 1.0
Author-Name: Francis K. C. Hui
Author-X-Name-First: Francis K. C.
Author-X-Name-Last: Hui
Author-Name: Samuel Müller
Author-X-Name-First: Samuel
Author-X-Name-Last: Müller
Author-Name: A. H. Welsh
Author-X-Name-First: A. H.
Author-X-Name-Last: Welsh
Title: Sparse Pairwise Likelihood Estimation for Multivariate Longitudinal Mixed Models
Abstract:
It is becoming increasingly common in longitudinal studies to collect and analyze data on multiple responses. For example, in the social sciences we may be interested in uncovering the factors driving mental health of individuals over time, where mental health is measured using a set of questionnaire items. One approach to analyzing such multi-dimensional data is multivariate mixed models, an extension of the standard univariate mixed model to handle multiple responses. Estimating multivariate mixed models presents a considerable challenge however, let alone performing variable selection to uncover which covariates are important in driving each response. Motivated by composite likelihood ideas, we propose a new approach for estimation and fixed effects selection in multivariate mixed models, called approximate pairwise likelihood estimation and shrinkage (APLES). The method works by constructing a quadratic approximation to each term in the pairwise likelihood function, and then augmenting this approximate pairwise likelihood with a penalty that encourages both individual and group coefficient sparsity. This leads to a relatively fast method of selection, as we can use coordinate ascent type methods to then construct the full regularization path for the model. Our method is the first to extend penalized likelihood estimation to multivariate generalized linear mixed models. We show that the APLES estimator attains a composite likelihood version of the oracle property. We propose a new information criterion for selecting the tuning parameter, which employs a dynamic model complexity penalty to facilitate aggressive shrinkage, and demonstrate that it asymptotically leads to selection consistency, that is, leads to the true model being selected. A simulation study demonstrates that the APLES estimator outperforms several univariate selection methods based on analyzing each outcome separately. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1759-1769
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1371026
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1371026
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1759-1769
Template-Type: ReDIF-Article 1.0
Author-Name: Denis Rybin
Author-X-Name-First: Denis
Author-X-Name-Last: Rybin
Author-Name: Robert Lew
Author-X-Name-First: Robert
Author-X-Name-Last: Lew
Author-Name: Michael J. Pencina
Author-X-Name-First: Michael J.
Author-X-Name-Last: Pencina
Author-Name: Maurizio Fava
Author-X-Name-First: Maurizio
Author-X-Name-Last: Fava
Author-Name: Gheorghe Doros
Author-X-Name-First: Gheorghe
Author-X-Name-Last: Doros
Title: Placebo Response as a Latent Characteristic: Application to Analysis of Sequential Parallel Comparison Design Studies
Abstract:
In clinical trials, placebo response can affect the inference about efficacy of the studied treatment. It is important to have a robust way to classify trial subjects with respect to their response to placebo. Simple, criterion-based classification may lead to classification error and bias the inference. The uncertainty about placebo response characteristic has to be factored into the treatment effect estimation. We propose a novel approach that views the placebo response as a latent characteristic and the study sample as an unlabeled mixture of “placebo responders” and “placebo nonresponders.” The likelihood-based methodology is used to estimate the treatment effect corrected for placebo response as defined within sequential parallel comparison design.
Journal: Journal of the American Statistical Association
Pages: 1411-1430
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1375930
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375930
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1411-1430
Template-Type: ReDIF-Article 1.0
Author-Name: Ruth Heller
Author-X-Name-First: Ruth
Author-X-Name-Last: Heller
Author-Name: Nilanjan Chatterjee
Author-X-Name-First: Nilanjan
Author-X-Name-Last: Chatterjee
Author-Name: Abba Krieger
Author-X-Name-First: Abba
Author-X-Name-Last: Krieger
Author-Name: Jianxin Shi
Author-X-Name-First: Jianxin
Author-X-Name-Last: Shi
Title: Post-Selection Inference Following Aggregate Level Hypothesis Testing in Large-Scale Genomic Data
Abstract:
In many genomic applications, hypotheses tests are performed for powerful identification of signals by aggregating test-statistics across units within naturally defined classes. Following class-level testing, it is naturally of interest to identify the lower level units which contain true signals. Testing the individual units within a class without taking into account the fact that the class was selected using an aggregate-level test-statistic, will produce biased inference. We develop a hypothesis testing framework that guarantees control for false positive rates conditional on the fact that the class was selected. Specifically, we develop procedures for calculating unit level p-values that allows rejection of null hypotheses controlling for two types of conditional error rates, one relating to family-wise rate and the other relating to false discovery rate. We use simulation studies to illustrate validity and power of the proposed procedure in comparison to several possible alternatives. We illustrate the power of the method in a natural application involving whole-genome expression quantitative trait loci (eQTL) analysis across 17 tissue types using data from The Cancer Genome Atlas (TCGA) Project. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1770-1783
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1375933
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375933
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1770-1783
Template-Type: ReDIF-Article 1.0
Author-Name: Federico A. Bugni
Author-X-Name-First: Federico A.
Author-X-Name-Last: Bugni
Author-Name: Ivan A. Canay
Author-X-Name-First: Ivan A.
Author-X-Name-Last: Canay
Author-Name: Azeem M. Shaikh
Author-X-Name-First: Azeem M.
Author-X-Name-Last: Shaikh
Title: Inference Under Covariate-Adaptive Randomization
Abstract:
This article studies inference for the average treatment effect in randomized controlled trials with covariate-adaptive randomization. Here, by covariate-adaptive randomization, we mean randomization schemes that first stratify according to baseline covariates and then assign treatment status so as to achieve “balance” within each stratum. Our main requirement is that the randomization scheme assigns treatment status within each stratum so that the fraction of units being assigned to treatment within each stratum has a well behaved distribution centered around a proportion π as the sample size tends to infinity. Such schemes include, for example, Efron’s biased-coin design and stratified block randomization. When testing the null hypothesis that the average treatment effect equals a prespecified value in such settings, we first show the usual two-sample t-test is conservative in the sense that it has limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level. We show, however, that a simple adjustment to the usual standard error of the two-sample t-test leads to a test that is exact in the sense that its limiting rejection probability under the null hypothesis equals the nominal level. Next, we consider the usual t-test (on the coefficient on treatment assignment) in a linear regression of outcomes on treatment assignment and indicators for each of the strata. We show that this test is exact for the important special case of randomization schemes with π=12$\pi = \frac{1}{2}$, but is otherwise conservative. We again provide a simple adjustment to the standard errors that yields an exact test more generally. Finally, we study the behavior of a modified version of a permutation test, which we refer to as the covariate-adaptive permutation test, that only permutes treatment status for units within the same stratum. When applied to the usual two-sample t-statistic, we show that this test is exact for randomization schemes with π=12$\pi = \frac{1}{2}$ and that additionally achieve what we refer to as “strong balance.” For randomization schemes with π≠12$\pi \not= \frac{1}{2}$, this test may have limiting rejection probability under the null hypothesis strictly greater than the nominal level. When applied to a suitably adjusted version of the two-sample t-statistic, however, we show that this test is exact for all randomization schemes that achieve “strong balance,” including those with π≠12$\pi \not= \frac{1}{2}$. A simulation study confirms the practical relevance of our theoretical results. We conclude with recommendations for empirical practice and an empirical illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1784-1796
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1375934
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1375934
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1784-1796
Template-Type: ReDIF-Article 1.0
Author-Name: Chenglong Ye
Author-X-Name-First: Chenglong
Author-X-Name-Last: Ye
Author-Name: Yi Yang
Author-X-Name-First: Yi
Author-X-Name-Last: Yang
Author-Name: Yuhong Yang
Author-X-Name-First: Yuhong
Author-X-Name-Last: Yang
Title: Sparsity Oriented Importance Learning for High-Dimensional Linear Regression
Abstract:
With now well-recognized nonnegligible model selection uncertainty, data analysts should no longer be satisfied with the output of a single final model from a model selection process, regardless of its sophistication. To improve reliability and reproducibility in model choice, one constructive approach is to make good use of a sound variable importance measure. Although interesting importance measures are available and increasingly used in data analysis, little theoretical justification has been done. In this article, we propose a new variable importance measure, sparsity oriented importance learning (SOIL), for high-dimensional regression from a sparse linear modeling perspective by taking into account the variable selection uncertainty via the use of a sensible model weighting. The SOIL method is theoretically shown to have the inclusion/exclusion property: When the model weights are properly around the true model, the SOIL importance can well separate the variables in the true model from the rest. In particular, even if the signal is weak, SOIL rarely gives variables not in the true model significantly higher important values than those in the true model. Extensive simulations in several illustrative settings and real-data examples with guided simulations show desirable properties of the SOIL importance in contrast to other importance measures. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1797-1812
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1377080
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1377080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1797-1812
Template-Type: ReDIF-Article 1.0
Author-Name: Yacouba Boubacar Maïnassara
Author-X-Name-First: Yacouba
Author-X-Name-Last: Boubacar Maïnassara
Author-Name: Bruno Saussereau
Author-X-Name-First: Bruno
Author-X-Name-Last: Saussereau
Title: Diagnostic Checking in Multivariate ARMA Models With Dependent Errors Using Normalized Residual Autocorrelations
Abstract:
In this paper, we derive the asymptotic distribution of normalized residual empirical autocovariances and autocorrelations under weak assumptions on the noise. We propose new portmanteau statistics for vector autoregressive moving average models with uncorrelated but nonindependent innovations by using a self-normalization approach. We establish the asymptotic distribution of the proposed statistics. This asymptotic distribution is quite different from the usual chi-squared approximation used under the independent and identically distributed assumption on the noise, or the weighted sum of independent chi-squared random variables obtained under nonindependent innovations. A set of Monte Carlo experiments and an application to the daily returns of the CAC40 is presented. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1813-1827
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1380030
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1380030
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1813-1827
Template-Type: ReDIF-Article 1.0
Author-Name: Albert E. Parker
Author-X-Name-First: Albert E.
Author-X-Name-Last: Parker
Author-Name: Betsey Pitts
Author-X-Name-First: Betsey
Author-X-Name-Last: Pitts
Author-Name: Lindsey Lorenz
Author-X-Name-First: Lindsey
Author-X-Name-Last: Lorenz
Author-Name: Philip S. Stewart
Author-X-Name-First: Philip S.
Author-X-Name-Last: Stewart
Title: Polynomial Accelerated Solutions to a Large Gaussian Model for Imaging Biofilms: In Theory and Finite Precision
Abstract:
Three-dimensional confocal scanning laser microscope images offer dramatic visualizations of living biofilms before and after interventions. Here, we use confocal microscopy to study the effect of a treatment over time that causes a biofilm to swell and contract due to osmotic pressure changes. From these data (the video is provided in the supplementary materials), our goal is to reconstruct biofilm surfaces, to estimate the effect of the treatment on the biofilm’s volume, and to quantify the related uncertainties. We formulate the associated massive linear Bayesian inverse problem and then solve it using iterative samplers from large multivariate Gaussians that exploit well-established polynomial acceleration techniques from numerical linear algebra. Because of a general equivalence with linear solvers, these polynomial accelerated iterative samplers have known convergence rates, stopping criteria, and perform well in finite precision. An explicit algorithm is provided, for the first time, for an iterative sampler that is accelerated by the synergistic implementation of preconditioned conjugate gradient and Chebyshev polynomials. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1431-1442
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1409121
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1409121
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1431-1442
Template-Type: ReDIF-Article 1.0
Author-Name: Simon Mak
Author-X-Name-First: Simon
Author-X-Name-Last: Mak
Author-Name: Chih-Li Sung
Author-X-Name-First: Chih-Li
Author-X-Name-Last: Sung
Author-Name: Xingjian Wang
Author-X-Name-First: Xingjian
Author-X-Name-Last: Wang
Author-Name: Shiang-Ting Yeh
Author-X-Name-First: Shiang-Ting
Author-X-Name-Last: Yeh
Author-Name: Yu-Hung Chang
Author-X-Name-First: Yu-Hung
Author-X-Name-Last: Chang
Author-Name: V. Roshan Joseph
Author-X-Name-First: V. Roshan
Author-X-Name-Last: Joseph
Author-Name: Vigor Yang
Author-X-Name-First: Vigor
Author-X-Name-Last: Yang
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F. Jeff
Author-X-Name-Last: Wu
Title: An Efficient Surrogate Model for Emulation and Physics Extraction of Large Eddy Simulations
Abstract:
In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations, and statistical modeling. In this article, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of turbulent flows in swirl injectors with varying geometries, devices commonly used in many engineering applications. The novelty of the proposed method lies in the incorporation of known physical properties of the fluid flow as simplifying assumptions for the statistical model. In view of the massive simulation data at hand, which is on the order of hundreds of gigabytes, these assumptions allow for accurate flow predictions in around an hour of computation time. To contrast, existing flow emulators which forgo such simplifications may require more computation time for training and prediction than is needed for conducting the simulation itself. Moreover, by accounting for coupling mechanisms between flow variables, the proposed model can jointly reduce prediction uncertainty and extract useful flow physics, which can then be used to guide further investigations. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1443-1456
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1409123
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1409123
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1443-1456
Template-Type: ReDIF-Article 1.0
Author-Name: Bhuvanesh Pareek
Author-X-Name-First: Bhuvanesh
Author-X-Name-Last: Pareek
Author-Name: Pulak Ghosh
Author-X-Name-First: Pulak
Author-X-Name-Last: Ghosh
Author-Name: Hugh N. Wilson
Author-X-Name-First: Hugh N.
Author-X-Name-Last: Wilson
Author-Name: Emma K. Macdonald
Author-X-Name-First: Emma K.
Author-X-Name-Last: Macdonald
Author-Name: Paul Baines
Author-X-Name-First: Paul
Author-X-Name-Last: Baines
Title: Tracking the Impact of Media on Voter Choice in Real Time: A Bayesian Dynamic Joint Model
Abstract:
Commonly used methods of evaluating the impact of marketing communications during political elections struggle to account for respondents’ exposures to these communications due to the problems associated with recall bias. In addition, they completely fail to account for the impact of mediated or earned communications, such as newspaper articles or television news, that are typically not within the control of the advertising party, nor are they effectively able to monitor consumers’ perceptual responses over time. This study based on a new data collection technique using cell-phone text messaging (called real-time experience tracking or RET) offers the potential to address these weaknesses. We propose an RET-based model of the impact of communications and apply it to a unique choice situation: voting behavior during the 2010 UK general election, which was dominated by three political parties. We develop a Bayesian zero-inflated dynamic multinomial choice model that enables the joint modeling of: the interplay and dynamics associated with the individual voter's choice intentions over time, actual vote, and the heterogeneity in the exposure to marketing communications over time. Results reveal the differential impact over time of paid and earned media, demonstrate a synergy between the two, and show the particular importance of exposure valence and not just frequency, contrary to the predominant practitioner emphasis on share-of-voice metrics. Results also suggest that while earned media have a reducing impact on voting intentions as the final choice approaches, their valence continues to influence the final vote: a difference between drivers of intentions and behavior that implies that exposure valence remains critically important close to the final brand choice. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1457-1475
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1419134
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419134
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1457-1475
Template-Type: ReDIF-Article 1.0
Author-Name: Xueying Tang
Author-X-Name-First: Xueying
Author-X-Name-Last: Tang
Author-Name: Malay Ghosh
Author-X-Name-First: Malay
Author-X-Name-Last: Ghosh
Author-Name: Neung Soo Ha
Author-X-Name-First: Neung Soo
Author-X-Name-Last: Ha
Author-Name: Joseph Sedransk
Author-X-Name-First: Joseph
Author-X-Name-Last: Sedransk
Title: Modeling Random Effects Using Global–Local Shrinkage Priors in Small Area Estimation
Abstract:
Small area estimation is becoming increasingly popular for survey statisticians. One very important program is Small Area Income and Poverty Estimation undertaken by the United States Bureau of the Census, which aims at providing estimates related to income and poverty based on American Community Survey data at the state level and even at lower levels of geography. This article introduces global–local (GL) shrinkage priors for random effects in small area estimation to capture wide area level variation when the number of small areas is very large. These priors employ two levels of parameters, global and local parameters, to express variances of area-specific random effects so that both small and large random effects can be captured properly. We show via simulations and data analysis that use of the GL priors can improve estimation results in most cases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1476-1489
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2017.1419135
File-URL: http://hdl.handle.net/10.1080/01621459.2017.1419135
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1476-1489
Template-Type: ReDIF-Article 1.0
Author-Name: Alexander D. Bolton
Author-X-Name-First: Alexander D.
Author-X-Name-Last: Bolton
Author-Name: Nicholas A. Heard
Author-X-Name-First: Nicholas A.
Author-X-Name-Last: Heard
Title: Malware Family Discovery Using Reversible Jump MCMC Sampling of Regimes
Abstract:
Malware is computer software that has either been designed or modified with malicious intent. Hundreds of thousands of new malware threats appear on the internet each day. This is made possible through reuse of known exploits in computer systems that have not been fully eradicated; existing pieces of malware can be trivially modified and combined to create new malware, which is unknown to anti-virus programs. Finding new software with similarities to known malware is therefore an important goal in cyber-security. A dynamic instruction trace of a piece of software is the sequence of machine language instructions it generates when executed. Statistical analysis of a dynamic instruction trace can help reverse engineers infer the purpose and origin of the software that generated it. Instruction traces have been successfully modeled as simple Markov chains, but empirically there are change points in the structure of the traces, with recurring regimes of transition patterns. Here, reversible jump Markov chain Monte Carlo for change point detection is extended to incorporate regime-switching, allowing regimes to be inferred from malware instruction traces. A similarity measure for malware programs based on regime matching is then used to infer the originating families, leading to compelling performance results.
Journal: Journal of the American Statistical Association
Pages: 1490-1502
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2018.1423984
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1423984
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1490-1502
Template-Type: ReDIF-Article 1.0
Author-Name: Gen Li
Author-X-Name-First: Gen
Author-X-Name-Last: Li
Author-Name: Jianhua Z. Huang
Author-X-Name-First: Jianhua Z.
Author-X-Name-Last: Huang
Author-Name: Haipeng Shen
Author-X-Name-First: Haipeng
Author-X-Name-Last: Shen
Title: To Wait or Not to Wait: Two-Way Functional Hazards Model for Understanding Waiting in Call Centers
Abstract:
Telephone call centers offer a convenient communication channel between businesses and their customers. Efficient management of call centers needs accurate modeling of customer waiting behavior, which contains important information about customer patience (how long a customer is willing to wait) and service quality (how long a customer needs to wait to get served). Hazard functions offer dynamic characterization of customer waiting behavior, and provide critical inputs for agent scheduling. Motivated by this application, we develop a two-way functional hazards (tF-Hazards) model to study customer waiting behavior as a function of two timescales, waiting duration and the time of day that a customer calls in. The model stems from a two-way piecewise constant hazard function, and imposes low-rank structure and smoothness on the hazard rates to enhance interpretability. We exploit an alternating direction method of multipliers algorithm to optimize a penalized likelihood function of the model. We carefully analyze the data from a U.S. Bank call center, and provide informative insights about customer patience and service quality patterns along waiting time and across different times of a day. The findings provide primitive inputs for call center agent staffing and scheduling, as well as for call center practitioners to understand the effect of system protocols on customer waiting behavior. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1503-1514
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2018.1423985
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1423985
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1503-1514
Template-Type: ReDIF-Article 1.0
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Author-Name: Jonathan Chabout
Author-X-Name-First: Jonathan
Author-X-Name-Last: Chabout
Author-Name: Joshua Jones Macopson
Author-X-Name-First: Joshua Jones
Author-X-Name-Last: Macopson
Author-Name: Erich D. Jarvis
Author-X-Name-First: Erich D.
Author-X-Name-Last: Jarvis
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Semiparametric Mixed Effects Markov Models With Application to Vocalization Syntax
Abstract:
Studying the neurological, genetic, and evolutionary basis of human vocal communication mechanisms using animal vocalization models is an important field of neuroscience. The datasets typically comprise structured sequences of syllables or “songs” produced by animals from different genotypes under different social contexts. It has been difficult to come up with sophisticated statistical methods that appropriately model animal vocal communication syntax. We address this need by developing a novel Bayesian semiparametric framework for inference in such datasets. Our approach is built on a novel class of mixed effects Markov transition models for the songs that accommodate exogenous influences of genotype and context as well as animal-specific heterogeneity. Crucial advantages of the proposed approach include its ability to provide insights into key scientific queries related to global and local influences of the exogenous predictors on the transition dynamics via automated tests of hypotheses. The methodology is illustrated using simulation experiments and the aforementioned motivating application in neuroscience. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1515-1527
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2018.1423986
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1423986
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1515-1527
Template-Type: ReDIF-Article 1.0
Author-Name: Yingbo Li
Author-X-Name-First: Yingbo
Author-X-Name-Last: Li
Author-Name: Merlise A. Clyde
Author-X-Name-First: Merlise A.
Author-X-Name-Last: Clyde
Title: Mixtures of g-Priors in Generalized Linear Models
Abstract:
Mixtures of Zellner’s g-priors have been studied extensively in linear models and have been shown to have numerous desirable properties for Bayesian variable selection and model averaging. Several extensions of g-priors to generalized linear models (GLMs) have been proposed in the literature; however, the choice of prior distribution of g and resulting properties for inference have received considerably less attention. In this article, we unify mixtures of g-priors in GLMs by assigning the truncated Compound Confluent Hypergeometric (tCCH) distribution to 1/(1 + g), which encompasses as special cases several mixtures of g-priors in the literature, such as the hyper-g, Beta-prime, truncated Gamma, incomplete inverse-Gamma, benchmark, robust, hyper-g/n, and intrinsic priors. Through an integrated Laplace approximation, the posterior distribution of 1/(1 + g) is in turn a tCCH distribution, and approximate marginal likelihoods are thus available analytically, leading to “Compound Hypergeometric Information Criteria” for model selection. We discuss the local geometric properties of the g-prior in GLMs and show how the desiderata for model selection proposed by Bayarri et al., such as asymptotic model selection consistency, intrinsic consistency, and measurement invariance may be used to justify the prior and specific choices of the hyper parameters. We illustrate inference using these priors and contrast them to other approaches via simulation and real data examples. The methodology is implemented in the R package BAS and freely available on CRAN. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1828-1845
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2018.1469992
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1469992
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1828-1845
Template-Type: ReDIF-Article 1.0
Author-Name: Cheng-Han Yu
Author-X-Name-First: Cheng-Han
Author-X-Name-Last: Yu
Author-Name: Raquel Prado
Author-X-Name-First: Raquel
Author-X-Name-Last: Prado
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Author-Name: Daniel Rowe
Author-X-Name-First: Daniel
Author-X-Name-Last: Rowe
Title: A Bayesian Variable Selection Approach Yields Improved Detection of Brain Activation From Complex-Valued fMRI
Abstract:
Voxel functional magnetic resonance imaging (fMRI) time courses are complex-valued signals giving rise to magnitude and phase data. Nevertheless, most studies use only the magnitude signals and thus discard half of the data that could potentially contain important information. Methods that make use of complex-valued fMRI (CV-fMRI) data have been shown to lead to superior power in detecting active voxels when compared to magnitude-only methods, particularly for small signal-to-noise ratios (SNRs). We present a new Bayesian variable selection approach for detecting brain activation at the voxel level from CV-fMRI data. We develop models with complex-valued spike-and-slab priors on the activation parameters that are able to combine the magnitude and phase information. We present a complex-valued EM variable selection algorithm that leads to fast detection at the voxel level in CV-fMRI slices and also consider full posterior inference via Markov chain Monte Carlo (MCMC). Model performance is illustrated through extensive simulation studies, including the analysis of physically based simulated CV-fMRI slices. Finally, we use the complex-valued Bayesian approach to detect active voxels in human CV-fMRI from a healthy individual who performed unilateral finger tapping in a designed experiment. The proposed approach leads to improved detection of activation in the expected motor-related brain regions and produces fewer false positive results than other methods for CV-fMRI. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1395-1410
Issue: 524
Volume: 113
Year: 2018
Month: 10
X-DOI: 10.1080/01621459.2018.1476244
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1476244
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:113:y:2018:i:524:p:1395-1410
Template-Type: ReDIF-Article 1.0
Author-Name: Daniela Castro-Camilo
Author-X-Name-First: Daniela
Author-X-Name-Last: Castro-Camilo
Author-Name: Raphaël Huser
Author-X-Name-First: Raphaël
Author-X-Name-Last: Huser
Title: Local Likelihood Estimation of Complex Tail Dependence Structures, Applied to U.S. Precipitation Extremes
Abstract:
To disentangle the complex nonstationary dependence structure of precipitation extremes over the entire contiguous United States (U.S.), we propose a flexible local approach based on factor copula models. Our subasymptotic spatial modeling framework yields nontrivial tail dependence structures, with a weakening dependence strength as events become more extreme; a feature commonly observed with precipitation data but not accounted for in classical asymptotic extreme-value models. To estimate the local extremal behavior, we fit the proposed model in small regional neighborhoods to high threshold exceedances, under the assumption of local stationarity, which allows us to gain in flexibility. By adopting a local censored likelihood approach, we make inference on a fine spatial grid, and we perform local estimation by taking advantage of distributed computing resources and the embarrassingly parallel nature of this estimation procedure. The local model is efficiently fitted at all grid points, and uncertainty is measured using a block bootstrap procedure. We carry out an extensive simulation study to show that our approach can adequately capture complex, nonstationary dependencies, in addition, our study of U.S. winter precipitation data reveals interesting differences in local tail structures over space, which has important implications on regional risk assessment of extreme precipitation events. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1037-1054
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1647842
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1647842
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1037-1054
Template-Type: ReDIF-Article 1.0
Author-Name: Douglas R. Wilson
Author-X-Name-First: Douglas R.
Author-X-Name-Last: Wilson
Author-Name: Chong Jin
Author-X-Name-First: Chong
Author-X-Name-Last: Jin
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Title: ICeD-T Provides Accurate Estimates of Immune Cell Abundance in Tumor Samples by Allowing for Aberrant Gene Expression Patterns
Abstract:
Immunotherapies have attracted lots of research interests recently. The need to understand the underlying mechanisms of immunotherapies and to develop precision immunotherapy regimens has spurred great interest in characterizing immune cell composition within the tumor microenvironment. Several methods have been developed to estimate immune cell composition using gene expression data from bulk tumor samples. However, these methods are not flexible enough to handle aberrant patterns of gene expression data, for example, inconsistent cell type-specific gene expression between purified reference samples and tumor samples. We propose a novel statistical method for expression deconvolution called immune cell deconvolution in tumor tissues (ICeD-T). ICeD-T automatically identifies aberrant genes whose expression are inconsistent with the deconvolution model and down-weights their contributions to cell type abundance estimates. We evaluated the performance of ICeD-T versus existing methods in simulation studies and several real data analyses. ICeD-T displayed comparable or superior performance to these competing methods. Applying these methods to assess the relationship between immunotherapy response and immune cell composition, ICeD-T is able to identify significant associations that are missed by its competitors. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1055-1065
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1654874
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654874
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1055-1065
Template-Type: ReDIF-Article 1.0
Author-Name: Qian Guan
Author-X-Name-First: Qian
Author-X-Name-Last: Guan
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Dipankar Bandyopadhyay
Author-X-Name-First: Dipankar
Author-X-Name-Last: Bandyopadhyay
Title: Bayesian Nonparametric Policy Search With Application to Periodontal Recall Intervals
Abstract:
Tooth loss from periodontal disease is a major public health burden in the United States. Standard clinical practice is to recommend a dental visit every six months; however, this practice is not evidence-based, and poor dental outcomes and increasing dental insurance premiums indicate room for improvement. We consider a tailored approach that recommends recall time based on patient characteristics and medical history to minimize disease progression without increasing resource expenditures. We formalize this method as a dynamic treatment regime which comprises a sequence of decisions, one per stage of intervention, that follow a decision rule which maps current patient information to a recommendation for their next visit time. The dynamics of periodontal health, visit frequency, and patient compliance are complex, yet the estimated optimal regime must be interpretable to domain experts if it is to be integrated into clinical practice. We combine nonparametric Bayesian dynamics modeling with policy-search algorithms to estimate the optimal dynamic treatment regime within an interpretable class of regimes. Both simulation experiments and application to a rich database of electronic dental records from the HealthPartners HMO shows that our proposed method leads to better dental health without increasing the average recommended recall time relative to competing methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1066-1078
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1660169
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660169
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1066-1078
Template-Type: ReDIF-Article 1.0
Author-Name: Ryan Sun
Author-X-Name-First: Ryan
Author-X-Name-Last: Sun
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: Genetic Variant Set-Based Tests Using the Generalized Berk–Jones Statistic With Application to a Genome-Wide Association Study of Breast Cancer
Abstract:
Studying the effects of groups of single nucleotide polymorphisms (SNPs), as in a gene, genetic pathway, or network, can provide novel insight into complex diseases such as breast cancer, uncovering new genetic associations and augmenting the information that can be gleaned from studying SNPs individually. Common challenges in set-based genetic association testing include weak effect sizes, correlation between SNPs in a SNP-set, and scarcity of signals, with individual SNP effects often ranging from extremely sparse to moderately sparse in number. Motivated by these challenges, we propose the Generalized Berk–Jones (GBJ) test for the association between a SNP-set and outcome. The GBJ extends the Berk–Jones statistic by accounting for correlation among SNPs, and it provides advantages over the Generalized Higher Criticism test when signals in a SNP-set are moderately sparse. We also provide an analytic p-value calculation for SNP-sets of any finite size, and we develop an omnibus statistic that is robust to the degree of signal sparsity. An additional advantage of our work is the ability to conduct inference using individual SNP summary statistics from a genome-wide association study (GWAS). We evaluate the finite sample performance of the GBJ through simulation and apply the method to identify breast cancer risk genes in a GWAS conducted by the Cancer Genetic Markers of Susceptibility Consortium. Our results suggest evidence of association between FGFR2 and breast cancer and also identify other potential susceptibility genes, complementing conventional SNP-level analysis. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1079-1091
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1660170
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660170
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1079-1091
Template-Type: ReDIF-Article 1.0
Author-Name: Kenichiro McAlinn
Author-X-Name-First: Kenichiro
Author-X-Name-Last: McAlinn
Author-Name: Knut Are Aastveit
Author-X-Name-First: Knut Are
Author-X-Name-Last: Aastveit
Author-Name: Jouchi Nakajima
Author-X-Name-First: Jouchi
Author-X-Name-Last: Nakajima
Author-Name: Mike West
Author-X-Name-First: Mike
Author-X-Name-Last: West
Title: Multivariate Bayesian Predictive Synthesis in Macroeconomic Forecasting
Abstract:
We present new methodology and a case study in use of a class of Bayesian predictive synthesis (BPS) models for multivariate time series forecasting. This extends the foundational BPS framework to the multivariate setting, with detailed application in the topical and challenging context of multistep macroeconomic forecasting in a monetary policy setting. BPS evaluates—sequentially and adaptively over time—varying forecast biases and facets of miscalibration of individual forecast densities for multiple time series, and—critically—their time-varying inter-dependencies. We define BPS methodology for a new class of dynamic multivariate latent factor models implied by BPS theory. Structured dynamic latent factor BPS is here motivated by the application context—sequential forecasting of multiple U.S. macroeconomic time series with forecasts generated from several traditional econometric time series models. The case study highlights the potential of BPS to improve of forecasts of multiple series at multiple forecast horizons, and its use in learning dynamic relationships among forecasting models or agents.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1092-1110
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1660171
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660171
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1092-1110
Template-Type: ReDIF-Article 1.0
Author-Name: Yawen Guan
Author-X-Name-First: Yawen
Author-X-Name-Last: Guan
Author-Name: Margaret C. Johnson
Author-X-Name-First: Margaret C.
Author-X-Name-Last: Johnson
Author-Name: Matthias Katzfuss
Author-X-Name-First: Matthias
Author-X-Name-Last: Katzfuss
Author-Name: Elizabeth Mannshardt
Author-X-Name-First: Elizabeth
Author-X-Name-Last: Mannshardt
Author-Name: Kyle P. Messier
Author-X-Name-First: Kyle P.
Author-X-Name-Last: Messier
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Author-Name: Joon J. Song
Author-X-Name-First: Joon J.
Author-X-Name-Last: Song
Title: Fine-Scale Spatiotemporal Air Pollution Analysis Using Mobile Monitors on Google Street View Vehicles
Abstract:
People are increasingly concerned with understanding their personal environment, including possible exposure to harmful air pollutants. To make informed decisions on their day-to-day activities, they are interested in real-time information on a localized scale. Publicly available, fine-scale, high-quality air pollution measurements acquired using mobile monitors represent a paradigm shift in measurement technologies. A methodological framework utilizing these increasingly fine-scale measurements to provide real-time air pollution maps and short-term air quality forecasts on a fine-resolution spatial scale could prove to be instrumental in increasing public awareness and understanding. The Google Street View study provides a unique source of data with spatial and temporal complexities, with the potential to provide information about commuter exposure and hot spots within city streets with high traffic. We develop a computationally efficient spatiotemporal model for these data and use the model to make short-term forecasts and high-resolution maps of current air pollution levels. We also show via an experiment that mobile networks can provide more nuanced information than an equally sized fixed-location network. This modeling framework has important real-world implications in understanding citizens’ personal environments, as data production and real-time availability continue to be driven by the ongoing development and improvement of mobile measurement technologies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1111-1124
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1665526
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665526
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1111-1124
Template-Type: ReDIF-Article 1.0
Author-Name: Naim U. Rashid
Author-X-Name-First: Naim U.
Author-X-Name-Last: Rashid
Author-Name: Quefeng Li
Author-X-Name-First: Quefeng
Author-X-Name-Last: Li
Author-Name: Jen Jen Yeh
Author-X-Name-First: Jen Jen
Author-X-Name-Last: Yeh
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Title: Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction
Abstract:
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently nonzero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high-dimensional penalized generalized linear mixed model is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1125-1138
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1671197
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671197
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1125-1138
Template-Type: ReDIF-Article 1.0
Author-Name: Lorin Crawford
Author-X-Name-First: Lorin
Author-X-Name-Last: Crawford
Author-Name: Anthea Monod
Author-X-Name-First: Anthea
Author-X-Name-Last: Monod
Author-Name: Andrew X. Chen
Author-X-Name-First: Andrew X.
Author-X-Name-Last: Chen
Author-Name: Sayan Mukherjee
Author-X-Name-First: Sayan
Author-X-Name-Last: Mukherjee
Author-Name: Raúl Rabadán
Author-X-Name-First: Raúl
Author-X-Name-Last: Rabadán
Title: Predicting Clinical Outcomes in Glioblastoma: An Application of Topological and Functional Data Analysis
Abstract:
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer that is under active study in the field of cancer biology. Its rapid progression and the relative time cost of obtaining molecular data make other readily available forms of data, such as images, an important resource for actionable measures in patients. Our goal is to use information given by medical images taken from GBM patients in statistical settings. To do this, we design a novel statistic—the smooth Euler characteristic transform (SECT)—that quantifies magnetic resonance images of tumors. Due to its well-defined inner product structure, the SECT can be used in a wider range of functional and nonparametric modeling approaches than other previously proposed topological summary statistics. When applied to a cohort of GBM patients, we find that the SECT is a better predictor of clinical outcomes than both existing tumor shape quantifications and common molecular assays. Specifically, we demonstrate that SECT features alone explain more of the variance in GBM patient survival than gene expression, volumetric features, and morphometric features. The main takeaways from our findings are thus 2-fold. First, they suggest that images contain valuable information that can play an important role in clinical prognosis and other medical decisions. Second, they show that the SECT is a viable tool for the broader study of medical imaging informatics. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1139-1150
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1671198
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671198
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1139-1150
Template-Type: ReDIF-Article 1.0
Author-Name: Amanda F. Mejia
Author-X-Name-First: Amanda F.
Author-X-Name-Last: Mejia
Author-Name: Mary Beth Nebel
Author-X-Name-First: Mary Beth
Author-X-Name-Last: Nebel
Author-Name: Yikai Wang
Author-X-Name-First: Yikai
Author-X-Name-Last: Wang
Author-Name: Brian S. Caffo
Author-X-Name-First: Brian S.
Author-X-Name-Last: Caffo
Author-Name: Ying Guo
Author-X-Name-First: Ying
Author-X-Name-Last: Guo
Title: Template Independent Component Analysis: Targeted and Reliable Estimation of Subject-level Brain Networks Using Big Data Population Priors
Abstract:
Large brain imaging databases contain a wealth of information on brain organization in the populations they target, and on individual variability. While such databases have been used to study group-level features of populations directly, they are currently underutilized as a resource to inform single-subject analysis. Here, we propose leveraging the information contained in large functional magnetic resonance imaging (fMRI) databases by establishing population priors to employ in an empirical Bayesian framework. We focus on estimation of brain networks as source signals in independent component analysis (ICA). We formulate a hierarchical “template” ICA model where source signals—including known population brain networks and subject-specific signals—are represented as latent variables. For estimation, we derive an expectation–maximization (EM) algorithm having an explicit solution. However, as this solution is computationally intractable, we also consider an approximate subspace algorithm and a faster two-stage approach. Through extensive simulation studies, we assess performance of both methods and compare with dual regression, a popular but ad-hoc method. The two proposed algorithms have similar performance, and both dramatically outperform dual regression. We also conduct a reliability study utilizing the Human Connectome Project and find that template ICA achieves substantially better performance than dual regression, achieving 75–250% higher intra-subject reliability. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1151-1177
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1679638
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1679638
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1151-1177
Template-Type: ReDIF-Article 1.0
Author-Name: Carles Bretó
Author-X-Name-First: Carles
Author-X-Name-Last: Bretó
Author-Name: Edward L. Ionides
Author-X-Name-First: Edward L.
Author-X-Name-Last: Ionides
Author-Name: Aaron A. King
Author-X-Name-First: Aaron A.
Author-X-Name-Last: King
Title: Panel Data Analysis via Mechanistic Models
Abstract:
Panel data, also known as longitudinal data, consist of a collection of time series. Each time series, which could itself be multivariate, comprises a sequence of measurements taken on a distinct unit. Mechanistic modeling involves writing down scientifically motivated equations describing the collection of dynamic systems giving rise to the observations on each unit. A defining characteristic of panel systems is that the dynamic interaction between units should be negligible. Panel models therefore consist of a collection of independent stochastic processes, generally linked through shared parameters while also having unit-specific parameters. To give the scientist flexibility in model specification, we are motivated to develop a framework for inference on panel data permitting the consideration of arbitrary nonlinear, partially observed panel models. We build on iterated filtering techniques that provide likelihood-based inference on nonlinear partially observed Markov process models for time series data. Our methodology depends on the latent Markov process only through simulation; this plug-and-play property ensures applicability to a large class of models. We demonstrate our methodology on a toy example and two epidemiological case studies. We address inferential and computational issues arising due to the combination of model complexity and dataset size. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1178-1188
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1604367
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604367
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1178-1188
Template-Type: ReDIF-Article 1.0
Author-Name: Christian H. Weiß
Author-X-Name-First: Christian H.
Author-X-Name-Last: Weiß
Title: Distance-Based Analysis of Ordinal Data and Ordinal Time Series
Abstract:
The dissimilarity of ordinal categories can be expressed with a distance measure. A unified approach relying on expected distances is proposed to obtain well-interpretable measures of location, dispersion, or symmetry of random variables, as well as measures of serial dependence within a given process. For special types of distance, these analytic tools lead to known approaches for ordinal or real-valued random variables. We also analyze the sample counterparts of the proposed measures and derive asymptotic results for practically important cases in ordinal data and time series analysis. Two real applications about the economic situation in Germany and the credit rating of European countries are presented. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1189-1200
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1604370
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604370
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1189-1200
Template-Type: ReDIF-Article 1.0
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: A Sparse Random Projection-Based Test for Overall Qualitative Treatment Effects
Abstract:
In contrast to the classical “one-size-fits-all” approach, precision medicine proposes the customization of individualized treatment regimes to account for patients’ heterogeneity in response to treatments. Most of existing works in the literature focused on estimating optimal individualized treatment regimes. However, there has been less attention devoted to hypothesis testing regarding the existence of overall qualitative treatment effects, especially when there are a large number of prognostic covariates. When covariates do not have qualitative treatment effects, the optimal treatment regime will assign the same treatment to all patients regardless of their covariate values. In this article, we consider testing the overall qualitative treatment effects of patients’ prognostic covariates in a high-dimensional setting. We propose a sample splitting method to construct the test statistic, based on a nonparametric estimator of the contrast function. When the dimension of covariates is large, we construct the test based on sparse random projections of covariates into a low-dimensional space. We prove the consistency of our test statistic. In the regular cases, we show the asymptotic power function of our test statistic is asymptotically the same as the “oracle” test statistic which is constructed based on the “optimal” projection matrix. Simulation studies and real data applications validate our theoretical findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1201-1213
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1604368
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604368
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1201-1213
Template-Type: ReDIF-Article 1.0
Author-Name: Valentina Corradi
Author-X-Name-First: Valentina
Author-X-Name-Last: Corradi
Author-Name: Walter Distaso
Author-X-Name-First: Walter
Author-X-Name-Last: Distaso
Author-Name: Marcelo Fernandes
Author-X-Name-First: Marcelo
Author-X-Name-Last: Fernandes
Title: Testing for Jump Spillovers Without Testing for Jumps
Abstract:
This article develops statistical tools for testing conditional independence among the jump components of the daily quadratic variation, which we estimate using intraday data. To avoid sequential bias distortion, we do not pretest for the presence of jumps. If the null is true, our test statistic based on daily integrated jumps weakly converges to a Gaussian random variable if both assets have jumps. If instead at least one asset has no jumps, then the statistic approaches zero in probability. We show how to compute asymptotically valid bootstrap-based critical values that result in a consistent test with asymptotic size equal to or smaller than the nominal size. Empirically, we study jump linkages between US futures and equity index markets. We find not only strong evidence of jump cross-excitation between the SPDR exchange-traded fund and E-mini futures on the S&P 500 index, but also that integrated jumps in the E-mini futures during the overnight period carry relevant information. Supplementary materials for this article are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1214-1226
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1609971
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609971
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1214-1226
Template-Type: ReDIF-Article 1.0
Author-Name: Shan Luo
Author-X-Name-First: Shan
Author-X-Name-Last: Luo
Author-Name: Zehua Chen
Author-X-Name-First: Zehua
Author-X-Name-Last: Chen
Title: Feature Selection by Canonical Correlation Search in High-Dimensional Multiresponse Models With Complex Group Structures
Abstract:
High-dimensional multiresponse models with complex group structures in both the response variables and the covariates arise from current researches in important fields such as genetics and medicine. However, no enough research has been done on such models. One of a few researches, if not the only one, is the article by Li, Nan, and Zhu where the sparse group Lasso approach is extended to such models. In this article, we propose a novel approach named the sequential canonical correlation search (SCCS) procedure. In the SCCS procedure, the nonzero group by group blocks of regression coefficients are searched stepwise using a canonical correlation measure. Each step of the procedure consists of a block selection and a sparsity identification. The model selection criterion, EBIC, is used as the stopping rule of the procedure. We establish the selection consistency of the SCCS procedure and conduct simulation studies for the comparison of existing methods. The SCCS procedure has two advantages over the sparse grouped Lasso method: (i) it is more accurate in the identification of nonzero coefficient blocks and their nonzero entries, and (ii) its implementation is not limited by the dimensionality of the models and requires much less computation. A real example in genetic studies is also considered. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1227-1235
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1609972
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609972
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1227-1235
Template-Type: ReDIF-Article 1.0
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Title: GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference
Abstract:
This article develops a general framework for exploiting the sparsity information in two-sample multiple testing problems. We propose to first construct a covariate sequence, in addition to the usual primary test statistics, to capture the sparsity structure, and then incorporate the auxiliary covariates in inference via a three-step algorithm consisting of grouping, adjusting and pooling (GAP). The GAP procedure provides a simple and effective framework for information pooling. An important advantage of GAP is its capability of handling various dependence structures such as those arise from high-dimensional linear regression, differential correlation analysis, and differential network analysis. We establish general conditions under which GAP is asymptotically valid for false discovery rate control, and show that these conditions are fulfilled in a range of settings, including testing multivariate normal means, high-dimensional linear regression, differential covariance or correlation matrices, and Gaussian graphical models. Numerical results demonstrate that existing methods can be significantly improved by the proposed framework. The GAP procedure is illustrated using a breast cancer study for identifying gene–gene interactions.
Journal: Journal of the American Statistical Association
Pages: 1236-1250
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1611585
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611585
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1236-1250
Template-Type: ReDIF-Article 1.0
Author-Name: Jieli Shen
Author-X-Name-First: Jieli
Author-X-Name-Last: Shen
Author-Name: Regina Y. Liu
Author-X-Name-First: Regina Y.
Author-X-Name-Last: Liu
Author-Name: Min-ge Xie
Author-X-Name-First: Min-ge
Author-X-Name-Last: Xie
Title: iFusion: Individualized Fusion Learning
Abstract:
Inferences from different data sources can often be fused together, a process referred to as “fusion learning,” to yield more powerful findings than those from individual data sources alone. Effective fusion learning approaches are in growing demand as increasing number of data sources have become easily available in this big data era. This article proposes a new fusion learning approach, called “iFusion,” for drawing efficient individualized inference by fusing learnings from relevant data sources. Specifically, iFusion (i) summarizes inferences from individual data sources as individual confidence distributions (CDs); (ii) forms a clique of individuals that bear relevance to the target individual and then combines the CDs from those relevant individuals; and, finally, (iii) draws inference for the target individual from the combined CD. In essence, iFusion strategically “borrows strength” from relevant individuals to enhance the efficiency of the target individual inference while preserving its validity. This article focuses on the setting where each individual study has a number of observations but its inference can be further improved by incorporating additional information from similar studies that is referred to as its clique. Under the setting, iFusion is shown to achieve oracle property under suitable conditions. It is also shown to be flexible and robust in handling heterogeneity arising from diverse data sources. The development is ideally suited for goal-directed applications. Computationally, iFusion is parallel in nature and scales up easily for big data. An efficient scalable algorithm is provided for implementation. Simulation studies and a real application in financial forecasting are presented. In effect, this article covers methodology, theory, computation, and application for individualized inference by iFusion.
Journal: Journal of the American Statistical Association
Pages: 1251-1267
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1672557
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1672557
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1251-1267
Template-Type: ReDIF-Article 1.0
Author-Name: Yassir Rabhi
Author-X-Name-First: Yassir
Author-X-Name-Last: Rabhi
Author-Name: Taoufik Bouezmarni
Author-X-Name-First: Taoufik
Author-X-Name-Last: Bouezmarni
Title: Nonparametric Inference for Copulas and Measures of Dependence Under Length-Biased Sampling and Informative Censoring
Abstract:
Length-biased data are often encountered in cross-sectional surveys and prevalent-cohort studies on disease durations. Under length-biased sampling subjects with longer disease durations have greater chance to be observed. As a result, covariate values linked to the longer survivors are favored by the sampling mechanism. When the sampled durations are also subject to right censoring, the censoring is informative. Modeling dependence structure without adjusting for these issues leads to biased results. In this article, we consider copulas for modeling dependence when the collected data are length-biased and account for both informative censoring and covariate bias that are naturally linked to length-biased sampling. We address nonparametric estimation of the bivariate distribution, copula function and its density, and Kendall and Spearman measures for right-censored length-biased data. The proposed estimator for the bivariate cdf is a Hadamard-differentiable functional of two MLEs (Kaplan–Meier and empirical cdf) and inherits their efficiency. Based on this estimator, we devise two estimators for copula function and a local-polynomial estimator for copula density that accounts for boundary bias. The limiting processes of the estimators are established by deriving their iid representations. As a by-product, we establish the oscillation behavior of the bivariate cdf estimator. In addition, we introduce estimators for Kendall and Spearman measures and study their weak convergence. The proposed method is applied to analyze a set of right-censored length-biased data on survival with dementia, collected as part of a nationwide study in Canada.
Journal: Journal of the American Statistical Association
Pages: 1268-1278
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1611586
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611586
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1268-1278
Template-Type: ReDIF-Article 1.0
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Samuel N. Lockhart
Author-X-Name-First: Samuel N.
Author-X-Name-Last: Lockhart
Author-Name: William J. Jagust
Author-X-Name-First: William J.
Author-X-Name-Last: Jagust
Title: Simultaneous Covariance Inference for Multimodal Integrative Analysis
Abstract:
Multimodal integrative analysis fuses different types of data collected on the same set of experimental subjects. It is becoming a norm in many branches of scientific research, such as multi-omics and multimodal neuroimaging analysis. In this article, we address the problem of simultaneous covariance inference of associations between multiple modalities, which is of a vital interest in multimodal integrative analysis. Recognizing that there are few readily available solutions in the literature for this type of problem, we develop a new simultaneous testing procedure. It provides an explicit quantification of statistical significance, a much improved detection power, as well as a rigid false discovery control. Our proposal makes novel and useful contributions from both the scientific perspective and the statistical methodological perspective. We demonstrate the efficacy of the new method through both simulations and a multimodal positron emission tomography study of associations between two hallmark pathological proteins of Alzheimer’s disease.
Journal: Journal of the American Statistical Association
Pages: 1279-1291
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1623040
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623040
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1279-1291
Template-Type: ReDIF-Article 1.0
Author-Name: Geneviève Robin
Author-X-Name-First: Geneviève
Author-X-Name-Last: Robin
Author-Name: Olga Klopp
Author-X-Name-First: Olga
Author-X-Name-Last: Klopp
Author-Name: Julie Josse
Author-X-Name-First: Julie
Author-X-Name-Last: Josse
Author-Name: Éric Moulines
Author-X-Name-First: Éric
Author-X-Name-Last: Moulines
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: Main Effects and Interactions in Mixed and Incomplete Data Frames
Abstract:
A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column, or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1292-1303
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1623041
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623041
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1292-1303
Template-Type: ReDIF-Article 1.0
Author-Name: Chunlin Li
Author-X-Name-First: Chunlin
Author-X-Name-Last: Li
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wei Pan
Author-X-Name-First: Wei
Author-X-Name-Last: Pan
Title: Likelihood Ratio Tests for a Large Directed Acyclic Graph
Abstract:
Inference of directional pairwise relations between interacting units in a directed acyclic graph (DAG), such as a regulatory gene network, is common in practice, imposing challenges because of lack of inferential tools. For example, inferring a specific gene pathway of a regulatory gene network is biologically important. Yet, frequentist inference of directionality of connections remains largely unexplored for regulatory models. In this article, we propose constrained likelihood ratio tests for inference of the connectivity as well as directionality subject to nonconvex acyclicity constraints in a Gaussian directed graphical model. Particularly, we derive the asymptotic distributions of the constrained likelihood ratios in a high-dimensional situation. For testing of connectivity, the asymptotic distribution is either chi-squared or normal depending on if the number of testable links in a DAG model is small. For testing of directionality, the asymptotic distribution is the minimum of d independent chi-squared variables with one-degree of freedom or a generalized Gamma distribution depending on if d is small, where d is number of breakpoints in a hypothesized pathway. Moreover, we develop a computational method to perform the proposed tests, which integrates an alternating direction method of multipliers and difference convex programming. Finally, the power analysis and simulations suggest that the tests achieve the desired objectives of inference. An analysis of an Alzheimer’s disease gene expression dataset illustrates the utility of the proposed method to infer a directed pathway in a gene network.
Journal: Journal of the American Statistical Association
Pages: 1304-1319
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1623042
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623042
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1304-1319
Template-Type: ReDIF-Article 1.0
Author-Name: Beniamino Hadj-Amar
Author-X-Name-First: Beniamino
Author-X-Name-Last: Hadj-Amar
Author-Name: Bärbel Finkenstädt Rand
Author-X-Name-First: Bärbel Finkenstädt
Author-X-Name-Last: Rand
Author-Name: Mark Fiecas
Author-X-Name-First: Mark
Author-X-Name-Last: Fiecas
Author-Name: Francis Lévi
Author-X-Name-First: Francis
Author-X-Name-Last: Lévi
Author-Name: Robert Huckstepp
Author-X-Name-First: Robert
Author-X-Name-Last: Huckstepp
Title: Bayesian Model Search for Nonstationary Periodic Time Series
Abstract:
We propose a novel Bayesian methodology for analyzing nonstationary time series that exhibit oscillatory behavior. We approximate the time series using a piecewise oscillatory model with unknown periodicities, where our goal is to estimate the change-points while simultaneously identifying the potentially changing periodicities in the data. Our proposed methodology is based on a trans-dimensional Markov chain Monte Carlo algorithm that simultaneously updates the change-points and the periodicities relevant to any segment between them. We show that the proposed methodology successfully identifies time changing oscillatory behavior in two applications which are relevant to e-Health and sleep research, namely the occurrence of ultradian oscillations in human skin temperature during the time of night rest, and the detection of instances of sleep apnea in plethysmographic respiratory traces. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1320-1335
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1623043
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623043
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1320-1335
Template-Type: ReDIF-Article 1.0
Author-Name: Carina Gerstenberger
Author-X-Name-First: Carina
Author-X-Name-Last: Gerstenberger
Author-Name: Daniel Vogel
Author-X-Name-First: Daniel
Author-X-Name-Last: Vogel
Author-Name: Martin Wendler
Author-X-Name-First: Martin
Author-X-Name-Last: Wendler
Title: Tests for Scale Changes Based on Pairwise Differences
Abstract:
In many applications it is important to know whether the amount of fluctuation in a series of observations changes over time. In this article, we investigate different tests for detecting changes in the scale of mean-stationary time series. The classical approach, based on the CUSUM test applied to the squared centered observations, is very vulnerable to outliers and impractical for heavy-tailed data, which leads us to contemplate test statistics based on alternative, less outlier-sensitive scale estimators. It turns out that the tests based on Gini’s mean difference (the average of all pairwise distances) and generalized Qn
estimators (sample quantiles of all pairwise distances) are very suitable candidates. They improve upon the classical test not only under heavy tails or in the presence of outliers, but also under normality. We use recent results on the process convergence of U-statistics and U-quantiles for dependent sequences to derive the limiting distribution of the test statistics and propose estimators for the long-run variance. We show the consistency of the tests and demonstrate the applicability of the new change-point detection methods at two real-life data examples from hydrology and finance. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1336-1348
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1629938
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1629938
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1336-1348
Template-Type: ReDIF-Article 1.0
Author-Name: Meng Li
Author-X-Name-First: Meng
Author-X-Name-Last: Li
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Comparing and Weighting Imperfect Models Using D-Probabilities
Abstract:
We propose a new approach for assigning weights to models using a divergence-based method (D-probabilities), relying on evaluating parametric models relative to a nonparametric Bayesian reference using Kullback–Leibler divergence. D-probabilities are useful in goodness-of-fit assessments, in comparing imperfect models, and in providing model weights to be used in model aggregation. D-probabilities avoid some of the disadvantages of Bayesian model probabilities, such as large sensitivity to prior choice, and tend to place higher weight on a greater diversity of models. In an application to linear model selection against a Gaussian process reference, we provide simple analytic forms for routine implementation and show that D-probabilities automatically penalize model complexity. Some asymptotic properties are described, and we provide interesting probabilistic interpretations of the proposed model weights. The framework is illustrated through simulation examples and an ozone data application. Supplementary materials for this aricle are available online.
Journal: Journal of the American Statistical Association
Pages: 1349-1360
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1611140
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611140
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1349-1360
Template-Type: ReDIF-Article 1.0
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Author-Name: Josua Gösmann
Author-X-Name-First: Josua
Author-X-Name-Last: Gösmann
Title: A Likelihood Ratio Approach to Sequential Change Point Detection for a General Class of Parameters
Abstract:
In this article, we propose a new approach for sequential monitoring of a general class of parameters of a d-dimensional time series, which can be estimated by approximately linear functionals of the empirical distribution function. We consider a closed-end method, which is motivated by the likelihood ratio test principle and compare the new method with two alternative procedures. We also incorporate self-normalization such that estimation of the long-run variance is not necessary. We prove that for a large class of testing problems the new detection scheme has asymptotic level α and is consistent. The asymptotic theory is illustrated for the important cases of monitoring a change in the mean, variance, and correlation. By means of a simulation study it is demonstrated that the new test performs better than the currently available procedures for these problems. Finally, the methodology is illustrated by a small data example investigating index prices from the dot-com bubble. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1361-1377
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1630562
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1630562
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1361-1377
Template-Type: ReDIF-Article 1.0
Author-Name: Karthik Bharath
Author-X-Name-First: Karthik
Author-X-Name-Last: Bharath
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Title: Distribution on Warp Maps for Alignment of Open and Closed Curves
Abstract:
Alignment of curve data is an integral part of their statistical analysis, and can be achieved using model- or optimization-based approaches. The parameter space is usually the set of monotone, continuous warp maps of a domain. Infinite-dimensional nature of the parameter space encourages sampling based approaches, which require a distribution on the set of warp maps. Moreover, the distribution should also enable sampling in the presence of important landmark information on the curves which constrain the warp maps. For alignment of closed and open curves in Rd,d=1,2,3
, possibly with landmark information, we provide a constructive, point-process based definition of a distribution on the set of warp maps of [0, 1] and the unit circle S
, that is, (1) simple to sample from, and (2) possesses the desiderata for decomposition of the alignment problem with landmark constraints into multiple unconstrained ones. For warp maps on [0, 1], the distribution is related to the Dirichlet process. We demonstrate its utility by using it as a prior distribution on warp maps in a Bayesian model for alignment of two univariate curves, and as a proposal distribution in a stochastic algorithm that optimizes a suitable alignment functional for higher-dimensional curves. Several examples from simulated and real datasets are provided.
Journal: Journal of the American Statistical Association
Pages: 1378-1392
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1632066
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632066
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1378-1392
Template-Type: ReDIF-Article 1.0
Author-Name: Tingyou Zhou
Author-X-Name-First: Tingyou
Author-X-Name-Last: Zhou
Author-Name: Liping Zhu
Author-X-Name-First: Liping
Author-X-Name-Last: Zhu
Author-Name: Chen Xu
Author-X-Name-First: Chen
Author-X-Name-Last: Xu
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Model-Free Forward Screening Via Cumulative Divergence
Abstract:
Feature screening plays an important role in the analysis of ultrahigh dimensional data. Due to complicated model structure and high noise level, existing screening methods often suffer from model misspecification and the presence of outliers. To address these issues, we introduce a new metric named cumulative divergence (CD), and develop a CD-based forward screening procedure. This forward screening method is model-free and resistant to the presence of outliers in the response. It also incorporates the joint effects among covariates into the screening process. With a data-driven threshold, the new method can automatically determine the number of features that should be retained after screening. These merits make the CD-based screening very appealing in practice. Under certain regularity conditions, we show that the proposed method possesses sure screening property. The performance of our proposal is illustrated through simulations and a real data example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1393-1405
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1632078
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632078
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1393-1405
Template-Type: ReDIF-Article 1.0
Author-Name: Guan Yu
Author-X-Name-First: Guan
Author-X-Name-Last: Yu
Author-Name: Quefeng Li
Author-X-Name-First: Quefeng
Author-X-Name-Last: Li
Author-Name: Dinggang Shen
Author-X-Name-First: Dinggang
Author-X-Name-Last: Shen
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Optimal Sparse Linear Prediction for Block-missing Multi-modality Data Without Imputation
Abstract:
In modern scientific research, data are often collected from multiple modalities. Since different modalities could provide complementary information, statistical prediction methods using multimodality data could deliver better prediction performance than using single modality data. However, one special challenge for using multimodality data is related to block-missing data. In practice, due to dropouts or the high cost of measures, the observations of a certain modality can be missing completely for some subjects. In this paper, we propose a new direct sparse regression procedure using covariance from multimodality data (DISCOM). Our proposed DISCOM method includes two steps to find the optimal linear prediction of a continuous response variable using block-missing multimodality predictors. In the first step, rather than deleting or imputing missing data, we make use of all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable. The proposed new estimate of the covariance matrix is a linear combination of the identity matrix, the estimates of the intra-modality covariance matrix and the cross-modality covariance matrix. Flexible estimates for both the sub-Gaussian and heavy-tailed cases are considered. In the second step, based on the estimated covariance matrix and the estimated cross-covariance vector, an extended Lasso-type estimator is used to deliver a sparse estimate of the coefficients in the optimal linear prediction. The number of samples that are effectively used by DISCOM is the minimum number of samples with available observations from two modalities, which can be much larger than the number of samples with complete observations from all modalities. The effectiveness of the proposed method is demonstrated by theoretical studies, simulated examples, and a real application from the Alzheimer’s Disease Neuroimaging Initiative. The comparison between DISCOM and some existing methods also indicates the advantages of our proposed method.
Journal: Journal of the American Statistical Association
Pages: 1406-1419
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1632079
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632079
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1406-1419
Template-Type: ReDIF-Article 1.0
Author-Name: Eardi Lila
Author-X-Name-First: Eardi
Author-X-Name-Last: Lila
Author-Name: John A. D. Aston
Author-X-Name-First: John A. D.
Author-X-Name-Last: Aston
Title: Statistical Analysis of Functions on Surfaces, With an Application to Medical Imaging
Abstract:
Abstract–In functional data analysis, data are commonly assumed to be smooth functions on a fixed interval of the real line. In this work, we introduce a comprehensive framework for the analysis of functional data, whose domain is a two-dimensional manifold and the domain itself is subject to variability from sample to sample. We formulate a statistical model for such data, here called functions on surfaces, which enables a joint representation of the geometric and functional aspects, and propose an associated estimation framework. We assess the validity of the framework by performing a simulation study and we finally apply it to the analysis of neuroimaging data of cortical thickness, acquired from the brains of different subjects, and thus lying on domains with different geometries. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1420-1434
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1635479
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635479
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1420-1434
Template-Type: ReDIF-Article 1.0
Author-Name: Zhigang Yao
Author-X-Name-First: Zhigang
Author-X-Name-Last: Yao
Author-Name: Zhenyue Zhang
Author-X-Name-First: Zhenyue
Author-X-Name-Last: Zhang
Title: Principal Boundary on Riemannian Manifolds
Abstract:
We consider the classification problem and focus on nonlinear methods for classification on manifolds. For multivariate datasets lying on an embedded nonlinear Riemannian manifold within the higher-dimensional ambient space, we aim to acquire a classification boundary for the classes with labels, using the intrinsic metric on the manifolds. Motivated by finding an optimal boundary between the two classes, we invent a novel approach—the principal boundary. From the perspective of classification, the principal boundary is defined as an optimal curve that moves in between the principal flows traced out from two classes of data, and at any point on the boundary, it maximizes the margin between the two classes. We estimate the boundary in quality with its direction, supervised by the two principal flows. We show that the principal boundary yields the usual decision boundary found by the support vector machine in the sense that locally, the two boundaries coincide. Some optimality and convergence properties of the random principal boundary and its population counterpart are also shown. We illustrate how to find, use, and interpret the principal boundary with an application in real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1435-1448
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1610660
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1610660
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1435-1448
Template-Type: ReDIF-Article 1.0
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Michael Jansson
Author-X-Name-First: Michael
Author-X-Name-Last: Jansson
Author-Name: Xinwei Ma
Author-X-Name-First: Xinwei
Author-X-Name-Last: Ma
Title: Simple Local Polynomial Density Estimators
Abstract:
This article introduces an intuitive and easy-to-implement nonparametric density estimator based on local polynomial techniques. The estimator is fully boundary adaptive and automatic, but does not require prebinning or any other transformation of the data. We study the main asymptotic properties of the estimator, and use these results to provide principled estimation, inference, and bandwidth selection methods. As a substantive application of our results, we develop a novel discontinuity in density testing procedure, an important problem in regression discontinuity designs and other program evaluation settings. An illustrative empirical application is given. Two companion Stata and R software packages are provided.
Journal: Journal of the American Statistical Association
Pages: 1449-1455
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1635480
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635480
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1449-1455
Template-Type: ReDIF-Article 1.0
Author-Name: Chaowen Zheng
Author-X-Name-First: Chaowen
Author-X-Name-Last: Zheng
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Title: Nonparametric Estimation of Multivariate Mixtures
Abstract:
A multivariate mixture model is determined by three elements: the number of components, the mixing proportions, and the component distributions. Assuming that the number of components is given and that each mixture component has independent marginal distributions, we propose a nonparametric method to estimate the component distributions. The basic idea is to convert the estimation of component density functions to a problem of estimating the coordinates of the component density functions with respect to a good set of basis functions. Specifically, we construct a set of basis functions by using conditional density functions and try to recover the coordinates of component density functions with respect to this set of basis functions. Furthermore, we show that our estimator for the component density functions is consistent. Numerical studies are used to compare our algorithm with other existing nonparametric methods of estimating component distributions under the assumption of conditionally independent marginals.
Journal: Journal of the American Statistical Association
Pages: 1456-1471
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1635481
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635481
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1456-1471
Template-Type: ReDIF-Article 1.0
Author-Name: Jack Kamm
Author-X-Name-First: Jack
Author-X-Name-Last: Kamm
Author-Name: Jonathan Terhorst
Author-X-Name-First: Jonathan
Author-X-Name-Last: Terhorst
Author-Name: Richard Durbin
Author-X-Name-First: Richard
Author-X-Name-Last: Durbin
Author-Name: Yun S. Song
Author-X-Name-First: Yun S.
Author-X-Name-Last: Song
Title: Efficiently Inferring the Demographic History of Many Populations With Allele Count Data
Abstract:
The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than previously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed “basal Eurasian” admixture event in human history. We implement and release our method in a new open-source software package momi2.
Journal: Journal of the American Statistical Association
Pages: 1472-1487
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1635482
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635482
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1472-1487
Template-Type: ReDIF-Article 1.0
Author-Name: Wei Ma
Author-X-Name-First: Wei
Author-X-Name-Last: Ma
Author-Name: Yichen Qin
Author-X-Name-First: Yichen
Author-X-Name-Last: Qin
Author-Name: Yang Li
Author-X-Name-First: Yang
Author-X-Name-Last: Li
Author-Name: Feifang Hu
Author-X-Name-First: Feifang
Author-X-Name-Last: Hu
Title: Statistical Inference for Covariate-Adaptive Randomization Procedures
Abstract:
Covariate-adaptive randomization (CAR) procedures are frequently used in comparative studies to increase the covariate balance across treatment groups. However, because randomization inevitably uses the covariate information when forming balanced treatment groups, the validity of classical statistical methods after such randomization is often unclear. In this article, we derive the theoretical properties of statistical methods based on general CAR under the linear model framework. More importantly, we explicitly unveil the relationship between covariate-adaptive and inference properties by deriving the asymptotic representations of the corresponding estimators. We apply the proposed general theory to various randomization procedures such as complete randomization, rerandomization, pairwise sequential randomization, and Atkinson’s DA-biased coin design and compare their performance analytically. Based on the theoretical results, we then propose a new approach to obtain valid and more powerful tests. These results open a door to understand and analyze experiments based on CAR. Simulation studies provide further evidence of the advantages of the proposed framework and the theoretical results. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1488-1497
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1635483
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635483
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1488-1497
Template-Type: ReDIF-Article 1.0
Author-Name: Federico Ricciardi
Author-X-Name-First: Federico
Author-X-Name-Last: Ricciardi
Author-Name: Alessandra Mattei
Author-X-Name-First: Alessandra
Author-X-Name-Last: Mattei
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Title: Bayesian Inference for Sequential Treatments Under Latent Sequential Ignorability
Abstract:
We focus on causal inference for longitudinal treatments, where units are assigned to treatments at multiple time points, aiming to assess the effect of different treatment sequences on an outcome observed at a final point. A common assumption in similar studies is sequential ignorability (SI): treatment assignment at each time point is assumed independent of future potential outcomes given past observed outcomes and covariates. SI is questionable when treatment participation depends on individual choices, and treatment assignment may depend on unobservable quantities associated with future outcomes. We rely on principal stratification to formulate a relaxed version of SI: latent sequential ignorability (LSI) assumes that treatment assignment is conditionally independent on future potential outcomes given past treatments, covariates, and principal stratum membership, a latent variable defined by the joint value of observed and missing intermediate outcomes. We evaluate SI and LSI, using theoretical arguments and simulation studies to investigate the performance of the two assumptions when one holds and inference is conducted under both. Simulations show that when SI does not hold, inference performed under SI leads to misleading conclusions. Conversely, LSI generally leads to correct posterior distributions, irrespective of which assumption holds.
Journal: Journal of the American Statistical Association
Pages: 1498-1517
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1623039
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1623039
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1498-1517
Template-Type: ReDIF-Article 1.0
Author-Name: Colin B. Fogarty
Author-X-Name-First: Colin B.
Author-X-Name-Last: Fogarty
Title: Studentized Sensitivity Analysis for the Sample Average Treatment Effect in Paired Observational Studies
Abstract:
A fundamental limitation of causal inference in observational studies is that perceived evidence for an effect might instead be explained by factors not accounted for in the primary analysis. Methods for assessing the sensitivity of a study’s conclusions to unmeasured confounding have been established under the assumption that the treatment effect is constant across all individuals. In the potential presence of unmeasured confounding, it has been argued that certain patterns of effect heterogeneity may conspire with unobserved covariates to render the performed sensitivity analysis inadequate. We present a new method for conducting a sensitivity analysis for the sample average treatment effect in the presence of effect heterogeneity in paired observational studies. Our recommended procedure, called the studentized sensitivity analysis, represents an extension of recent work on studentized permutation tests to the case of observational studies, where randomizations are no longer drawn uniformly. The method naturally extends conventional tests for the sample average treatment effect in paired experiments to the case of unknown, but bounded, probabilities of assignment to treatment. In so doing, we illustrate that concerns about certain sensitivity analyses operating under the presumption of constant effects are largely unwarranted.
Journal: Journal of the American Statistical Association
Pages: 1518-1530
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1632072
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1632072
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1518-1530
Template-Type: ReDIF-Article 1.0
Author-Name: Gabrielle Simoneau
Author-X-Name-First: Gabrielle
Author-X-Name-Last: Simoneau
Author-Name: Erica E. M. Moodie
Author-X-Name-First: Erica E. M.
Author-X-Name-Last: Moodie
Author-Name: Jagtar S. Nijjar
Author-X-Name-First: Jagtar S.
Author-X-Name-Last: Nijjar
Author-Name: Robert W. Platt
Author-X-Name-First: Robert W.
Author-X-Name-Last: Platt
Author-Name: the Scottish Early Rheumatoid Arthritis Inception Cohort Investigators
Author-X-Name-First:
Author-X-Name-Last: the Scottish Early Rheumatoid Arthritis Inception Cohort Investigators
Title: Estimating Optimal Dynamic Treatment Regimes With Survival Outcomes
Abstract:
The statistical study of precision medicine is concerned with dynamic treatment regimes (DTRs) in which treatment decisions are tailored to patient-level information. Individuals are followed through multiple stages of clinical intervention, and the goal is to perform inferences on the sequence of personalized treatment decision rules to be applied in practice. Of interest is the identification of an optimal DTR, that is, the sequence of treatment decisions that yields the best expected outcome. Statistical methods for identifying optimal DTRs from observational data are theoretically complex and not easily implementable by researchers, especially when the outcome of interest is survival time. We propose a doubly robust, easy to implement method for estimating optimal DTRs with survival endpoints subject to right-censoring which requires solving a series of weighted generalized estimating equations. We provide a proof of consistency that relies on the balancing property of the weights and derive a formula for the asymptotic variance of the resulting estimators. We illustrate our novel approach with an application to the treatment of rheumatoid arthritis using observational data from the Scottish Early Rheumatoid Arthritis Inception Cohort. Our method, called dynamic weighted survival modeling, has been implemented in the DTRreg R package. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1531-1539
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1629939
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1629939
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1531-1539
Template-Type: ReDIF-Article 1.0
Author-Name: Shu Yang
Author-X-Name-First: Shu
Author-X-Name-Last: Yang
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Title: Combining Multiple Observational Data Sources to Estimate Causal Effects
Abstract:
The era of big data has witnessed an increasing availability of multiple data sources for statistical analyses. We consider estimation of causal effects combining big main data with unmeasured confounders and smaller validation data with supplementary information on these confounders. Under the unconfoundedness assumption with completely observed confounders, the smaller validation data allow for constructing consistent estimators for causal effects, but the big main data can only give error-prone estimators in general. However, by leveraging the information in the big main data in a principled way, we can improve the estimation efficiencies yet preserve the consistencies of the initial estimators based solely on the validation data. Our framework applies to asymptotically normal estimators, including the commonly used regression imputation, weighting, and matching estimators, and does not require a correct specification of the model relating the unmeasured confounders to the observed variables. We also propose appropriate bootstrap procedures, which makes our method straightforward to implement using software routines for existing estimators. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1540-1554
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2019.1609973
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1609973
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1540-1554
Template-Type: ReDIF-Article 1.0
Author-Name: Genevera I. Allen
Author-X-Name-First: Genevera I.
Author-X-Name-Last: Allen
Title: Handbook of Graphical Models
Journal: Journal of the American Statistical Association
Pages: 1555-1557
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2020.1801279
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801279
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1555-1557
Template-Type: ReDIF-Article 1.0
Author-Name: Ling Leng
Author-X-Name-First: Ling
Author-X-Name-Last: Leng
Title: Statistical Computing With R
Journal: Journal of the American Statistical Association
Pages: 1557-1558
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2020.1801280
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801280
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1557-1558
Template-Type: ReDIF-Article 1.0
Author-Name: Ming Chen
Author-X-Name-First: Ming
Author-X-Name-Last: Chen
Title: Time Series Clustering and Classification
Journal: Journal of the American Statistical Association
Pages: 1558-1558
Issue: 531
Volume: 115
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2020.1801281
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801281
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:531:p:1558-1558
Template-Type: ReDIF-Article 1.0
Author-Name: Kaushik Jana
Author-X-Name-First: Kaushik
Author-X-Name-Last: Jana
Author-Name: Debasis Sengupta
Author-X-Name-First: Debasis
Author-X-Name-Last: Sengupta
Author-Name: Subrata Kundu
Author-X-Name-First: Subrata
Author-X-Name-Last: Kundu
Author-Name: Arindam Chakraborty
Author-X-Name-First: Arindam
Author-X-Name-Last: Chakraborty
Author-Name: Purnima Shaw
Author-X-Name-First: Purnima
Author-X-Name-Last: Shaw
Title: The Statistical Face of a Region Under Monsoon Rainfall in Eastern India
Abstract:
A region under rainfall is a contiguous spatial area receiving positive precipitation at a particular time. The probabilistic behavior of such a region is an issue of interest in meteorological studies. A region under rainfall can be viewed as a shape object of a special kind, where scale and rotational invariance are not necessarily desirable attributes of a mathematical representation. For modeling variation in objects of this type, we propose an approximation of the boundary that can be represented as a real valued function, and arrive at further approximation through functional principal component analysis, after suitable adjustment for asymmetry and incompleteness in the data. The analysis of an open access satellite dataset on monsoon precipitation over Eastern Indian subcontinent leads to explanation of most of the variation in shapes of the regions under rainfall through a handful of interpretable functions that can be further approximated parametrically. The most important aspect of shape is found to be the size followed by contraction/elongation, mostly along two pairs of orthogonal axes. The different modes of variation are remarkably stable across calendar years and across different thresholds for minimum size of the region. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1559-1573
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1681275
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1681275
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1559-1573
Template-Type: ReDIF-Article 1.0
Author-Name: Xiangnan Feng
Author-X-Name-First: Xiangnan
Author-X-Name-Last: Feng
Author-Name: Tengfei Li
Author-X-Name-First: Tengfei
Author-X-Name-Last: Li
Author-Name: Xinyuan Song
Author-X-Name-First: Xinyuan
Author-X-Name-Last: Song
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Bayesian Scalar on Image Regression With Nonignorable Nonresponse
Abstract:
Medical imaging has become an increasingly important tool in screening, diagnosis, prognosis, and treatment of various diseases given its information visualization and quantitative assessment. The aim of this article is to develop a Bayesian scalar-on-image regression model to integrate high-dimensional imaging data and clinical data to predict cognitive, behavioral, or emotional outcomes, while allowing for nonignorable missing outcomes. Such a nonignorable nonresponse consideration is motivated by examining the association between baseline characteristics and cognitive abilities for 802 Alzheimer patients enrolled in the Alzheimer’s Disease Neuroimaging Initiative 1 (ADNI1), for which data are partially missing. Ignoring such missing data may distort the accuracy of statistical inference and provoke misleading results. To address this issue, we propose an imaging exponential tilting model to delineate the data missing mechanism and incorporate an instrumental variable to facilitate model identifiability followed by a Bayesian framework with Markov chain Monte Carlo algorithms to conduct statistical inference. This approach is validated in simulation studies where both the finite sample performance and asymptotic properties are evaluated and compared with the model with fully observed data and that with a misspecified ignorable missing mechanism. Our proposed methods are finally carried out on the ADNI1 dataset, which turns out to capture both of those clinical risk factors and imaging regions consistent with the existing literature that exhibits clinical significance.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1574-1597
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1686391
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686391
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1574-1597
Template-Type: ReDIF-Article 1.0
Author-Name: Bingduo Yang
Author-X-Name-First: Bingduo
Author-X-Name-Last: Yang
Author-Name: Wei Long
Author-X-Name-First: Wei
Author-X-Name-Last: Long
Author-Name: Liang Peng
Author-X-Name-First: Liang
Author-X-Name-Last: Peng
Author-Name: Zongwu Cai
Author-X-Name-First: Zongwu
Author-X-Name-Last: Cai
Title: Testing the Predictability of U.S. Housing Price Index Returns Based on an IVX-AR Model
Abstract:
We use ten common macroeconomic variables to test for the predictability of the quarterly growth rate of house price index (HPI) in the United States during 1975:Q1–2018:Q2. We extend the instrumental variable based Wald statistic (IVX-KMS) proposed by Kostakis, Magdalinos, and Stamatogiannis to a new instrumental variable based Wald statistic (IVX-AR) which accounts for serial correlation and heteroscedasticity in the error terms of the linear predictive regression model. Simulation results show that the proposed IVX-AR exhibits excellent size control regardless of the degree of serial correlation in the error terms and the persistency in the predictive variables, while IVX-KMS displays severe size distortions. The empirical results indicate that the percentage of residential fixed investment in GDP is fairly a robust predictor of the growth rate of HPI. However, other macroeconomic variables’ strong predictive ability detected by IVX-KMS is likely to be driven by the highly correlated error terms in the predictive regressions and thus becomes insignificant when the proposed IVX-AR method is implemented. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1598-1619
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1686392
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686392
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1598-1619
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Ni
Author-X-Name-First: Yang
Author-X-Name-Last: Ni
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Author-Name: Yuan Ji
Author-X-Name-First: Yuan
Author-X-Name-Last: Ji
Title: Bayesian Double Feature Allocation for Phenotyping With Electronic Health Records
Abstract:
Electronic health records (EHR) provide opportunities for deeper understanding of human phenotypes—in our case, latent disease—based on statistical modeling. We propose a categorical matrix factorization method to infer latent diseases from EHR data. A latent disease is defined as an unknown biological aberration that causes a set of common symptoms for a group of patients. The proposed approach is based on a novel double feature allocation model which simultaneously allocates features to the rows and the columns of a categorical matrix. Using a Bayesian approach, available prior information on known diseases (e.g., hypertension and diabetes) greatly improves identifiability and interpretability of the latent diseases. We assess the proposed approach by simulation studies including mis-specified models and comparison with sparse latent factor models. In the application to a Chinese EHR dataset, we identify 10 latent diseases, each of which is shared by groups of subjects with specific health traits related to lipid disorder, thrombocytopenia, polycythemia, anemia, bacterial and viral infections, allergy, and malnutrition. The identification of the latent diseases can help healthcare officials better monitor the subjects’ ongoing health conditions and look into potential risk factors and approaches for disease prevention. We cross-check the reported latent diseases with medical literature and find agreement between our discovery and reported findings elsewhere. We provide an R package “dfa” implementing our method and an R shiny web application reporting the findings.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1620-1634
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1686985
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686985
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1620-1634
Template-Type: ReDIF-Article 1.0
Author-Name: Samrachana Adhikari
Author-X-Name-First: Samrachana
Author-X-Name-Last: Adhikari
Author-Name: Sherri Rose
Author-X-Name-First: Sherri
Author-X-Name-Last: Rose
Author-Name: Sharon-Lise Normand
Author-X-Name-First: Sharon-Lise
Author-X-Name-Last: Normand
Title: Nonparametric Bayesian Instrumental Variable Analysis: Evaluating Heterogeneous Effects of Coronary Arterial Access Site Strategies
Abstract:
Percutaneous coronary interventions (PCIs) are nonsurgical procedures to open blocked blood vessels to the heart, frequently using a catheter to place a stent. The catheter can be inserted into the blood vessels using an artery in the groin or an artery in the wrist. Because clinical trials have indicated that access via the wrist may result in fewer post procedure complications, shortening the length of stay, and ultimately cost less than groin access, adoption of access via the wrist has been encouraged. However, patients treated in usual care are likely to differ from those participating in clinical trials, and there is reason to believe that the effectiveness of wrist access may differ between males and females. Moreover, the choice of artery access strategy is likely to be influenced by patient or physician unmeasured factors. To study the effectiveness of the two artery access site strategies on hospitalization charges, we use data from a state-mandated clinical registry including 7963 patients undergoing PCI. A hierarchical Bayesian likelihood-based instrumental variable analysis under a latent index modeling framework is introduced to jointly model outcomes and treatment status. Our approach accounts for unobserved heterogeneity via a latent factor structure, and permits nonparametric error distributions with Dirichlet process mixture models. Our results demonstrate that artery access in the wrist reduces hospitalization charges compared to access in the groin, with a higher mean reduction for male patients.Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1635-1644
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1688663
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1688663
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1635-1644
Template-Type: ReDIF-Article 1.0
Author-Name: Changgee Chang
Author-X-Name-First: Changgee
Author-X-Name-Last: Chang
Author-Name: Jeong Hoon Jang
Author-X-Name-First: Jeong Hoon
Author-X-Name-Last: Jang
Author-Name: Amita Manatunga
Author-X-Name-First: Amita
Author-X-Name-Last: Manatunga
Author-Name: Andrew T. Taylor
Author-X-Name-First: Andrew T.
Author-X-Name-Last: Taylor
Author-Name: Qi Long
Author-X-Name-First: Qi
Author-X-Name-Last: Long
Title: A Bayesian Latent Class Model to Predict Kidney Obstruction in the Absence of Gold Standard
Abstract:
Kidney obstruction, if untreated in a timely manner, can lead to irreversible loss of renal function. A widely used technology for evaluations of kidneys with suspected obstruction is diuresis renography. However, it is generally very challenging for radiologists who typically interpret renography data in practice to build high level of competency due to the low volume of renography studies and insufficient training. Another challenge is that there is currently no gold standard for detection of kidney obstruction. Seeking to develop a computer-aided diagnostic (CAD) tool that can assist practicing radiologists to reduce errors in the interpretation of kidney obstruction, a recent study collected data from diuresis renography, interpretations on the renography data from highly experienced nuclear medicine experts as well as clinical data. To achieve the objective, we develop a statistical model that can be used as a CAD tool for assisting radiologists in kidney interpretation. We use a Bayesian latent class modeling approach for predicting kidney obstruction through the integrative analysis of time-series renogram data, expert ratings, and clinical variables. A nonparametric Bayesian latent factor regression approach is adopted for modeling renogram curves in which the coefficients of the basis functions are parameterized via the factor loadings dependent on the latent disease status and the extended latent factors that can also adjust for clinical variables. A hierarchical probit model is used for expert ratings, allowing for training with rating data from multiple experts while predicting with at most one expert, which makes the proposed model operable in practice. An efficient MCMC algorithm is developed to train the model and predict kidney obstruction with associated uncertainty. We demonstrate the superiority of the proposed method over several existing methods through extensive simulations. Analysis of the renal study also lends support to the usefulness of our model as a CAD tool to assist less experienced radiologists in the field. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1645-1663
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1689983
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689983
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1645-1663
Template-Type: ReDIF-Article 1.0
Author-Name: Chih-Li Sung
Author-X-Name-First: Chih-Li
Author-X-Name-Last: Sung
Author-Name: Ying Hung
Author-X-Name-First: Ying
Author-X-Name-Last: Hung
Author-Name: William Rittase
Author-X-Name-First: William
Author-X-Name-Last: Rittase
Author-Name: Cheng Zhu
Author-X-Name-First: Cheng
Author-X-Name-Last: Zhu
Author-Name: C. F. J. Wu
Author-X-Name-First: C. F. J.
Author-X-Name-Last: Wu
Title: Calibration for Computer Experiments With Binary Responses and Application to Cell Adhesion Study
Abstract:
Calibration refers to the estimation of unknown parameters which are present in computer experiments but not available in physical experiments. An accurate estimation of these parameters is important because it provides a scientific understanding of the underlying system which is not available in physical experiments. Most of the work in the literature is limited to the analysis of continuous responses. Motivated by a study of cell adhesion experiments, we propose a new calibration framework for binary responses. Its application to the T cell adhesion data provides insight into the unknown values of the kinetic parameters which are difficult to determine by physical experiments due to the limitation of the existing experimental techniques. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1664-1674
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1699419
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1699419
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1664-1674
Template-Type: ReDIF-Article 1.0
Author-Name: Samuel D. Pimentel
Author-X-Name-First: Samuel D.
Author-X-Name-Last: Pimentel
Author-Name: Rachel R. Kelz
Author-X-Name-First: Rachel R.
Author-X-Name-Last: Kelz
Title: Optimal Tradeoffs in Matched Designs Comparing US-Trained and Internationally Trained Surgeons
Abstract:
Does receiving a medical education outside the United States impact a surgeon’s performance? We study this question by matching operations performed by internationally trained surgeons to those performed by US-trained surgeons in reanalysis of a large health outcomes study. An effective matched design must achieve several goals, including balancing covariate distributions marginally, ensuring units within individual pairs have similar values on key covariates, and using a sufficiently large sample from the raw data. Yet in our study, optimizing some of these goals forces less desirable results on others. We address such tradeoffs from a multi-objective optimization perspective by creating matched designs that are Pareto optimal with respect to two goals. We provide general tools for generating representative subsets of Pareto optimal solution sets and articulate how they can be used to improve decision-making in observational study design. In the motivating surgical outcomes study, formulating a multi-objective version of the problem helps us balance an important variable without sacrificing two other design goals, average closeness of matched pairs on a multivariate distance and size of the final matched sample. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1675-1688
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1720693
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1720693
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1675-1688
Template-Type: ReDIF-Article 1.0
Author-Name: Emma G. Thomas
Author-X-Name-First: Emma G.
Author-X-Name-Last: Thomas
Author-Name: Lorenzo Trippa
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Trippa
Author-Name: Giovanni Parmigiani
Author-X-Name-First: Giovanni
Author-X-Name-Last: Parmigiani
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Title: Estimating the Effects of Fine Particulate Matter on 432 Cardiovascular Diseases Using Multi-Outcome Regression With Tree-Structured Shrinkage
Abstract:
The positive relationship between airborne fine particulate matter (PM2.5) and cardiovascular disease (CVD) is established. Little is known about effect size heterogeneity across distinct CVD outcomes. We conducted a multi-outcome case-crossover study of Medicare beneficiaries aged >65 years residing in the mainland USA from 2000 through 2012. The exposure was two-day average PM2.5 in each individual’s residential zipcode. The outcomes were hospitalization for 432 distinct CVDs defined by the International Classification of Diseases, Revision 9. Our dataset included almost 24 million CVD hospitalizations. We analyzed the data using multi-outcome regression with tree-structured shrinkage (MOReTreeS), a novel method that enables: (1) borrowing of strength across outcomes; (2) data-driven discovery of outcome groups that are similarly affected by the exposure; (3) estimation of a single effect for each group. MOReTreeS grouped 420 outcomes together; for this group, the odds ratio [OR] for hospitalization associated with a 10 μg m− 3 increase in PM2.5 was 1.011 (95% credible interval [CI] = 1.011–1.012). The model identified congestive heart failure as having the strongest positive association with PM2.5 (OR = 1.019; 95%CI = 1.017–1.022). Some outcomes exhibited negative associations with PM2.5, including aortic dissection, subarachnoid and intracerebral hemorrhage, abdominal aneurysm, and essential hypertension; further research is needed to understand these counterintuitive findings. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1689-1699
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1722134
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1722134
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1689-1699
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Bo Peng
Author-X-Name-First: Bo
Author-X-Name-Last: Peng
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Yunan Wu
Author-X-Name-First: Yunan
Author-X-Name-Last: Wu
Title: A Tuning-free Robust and Efficient Approach to High-dimensional Regression
Abstract:
We introduce a novel approach for high-dimensional regression with theoretical guarantees. The new procedure overcomes the challenge of tuning parameter selection of Lasso and possesses several appealing properties. It uses an easily simulated tuning parameter that automatically adapts to both the unknown random error distribution and the correlation structure of the design matrix. It is robust with substantial efficiency gain for heavy-tailed random errors while maintaining high efficiency for normal random errors. Comparing with other alternative robust regression procedures, it also enjoys the property of being equivariant when the response variable undergoes a scale transformation. Computationally, it can be efficiently solved via linear programming. Theoretically, under weak conditions on the random error distribution, we establish a finite-sample error bound with a near-oracle rate for the new estimator with the simulated tuning parameter. Our results make useful contributions to mending the gap between the practice and theory of Lasso and its variants. We also prove that further improvement in efficiency can be achieved by a second-stage enhancement with some light tuning. Our simulation results demonstrate that the proposed methods often outperform cross-validated Lasso in various settings.
Journal: Journal of the American Statistical Association
Pages: 1700-1714
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1840989
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1700-1714
Template-Type: ReDIF-Article 1.0
Author-Name: Po-Ling Loh
Author-X-Name-First: Po-Ling
Author-X-Name-Last: Loh
Title: Comment on “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”
Journal: Journal of the American Statistical Association
Pages: 1715-1716
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1837141
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837141
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1715-1716
Template-Type: ReDIF-Article 1.0
Author-Name: Xiudi Li
Author-X-Name-First: Xiudi
Author-X-Name-Last: Li
Author-Name: Ali Shojaie
Author-X-Name-First: Ali
Author-X-Name-Last: Shojaie
Title: Discussion of “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”
Journal: Journal of the American Statistical Association
Pages: 1717-1719
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1837139
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837139
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1717-1719
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Cong Ma
Author-X-Name-First: Cong
Author-X-Name-Last: Ma
Author-Name: Kaizheng Wang
Author-X-Name-First: Kaizheng
Author-X-Name-Last: Wang
Title: Comment on “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”
Journal: Journal of the American Statistical Association
Pages: 1720-1725
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1837138
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837138
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1720-1725
Template-Type: ReDIF-Article 1.0
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Bo Peng
Author-X-Name-First: Bo
Author-X-Name-Last: Peng
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Yunan Wu
Author-X-Name-First: Yunan
Author-X-Name-Last: Wu
Title: Rejoinder to “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”
Journal: Journal of the American Statistical Association
Pages: 1726-1729
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1843865
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1843865
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1726-1729
Template-Type: ReDIF-Article 1.0
Author-Name: AlexanderM. Franks
Author-X-Name-First: AlexanderM.
Author-X-Name-Last: Franks
Author-Name: Alexander D’Amour
Author-X-Name-First: Alexander
Author-X-Name-Last: D’Amour
Author-Name: Avi Feller
Author-X-Name-First: Avi
Author-X-Name-Last: Feller
Title: Flexible Sensitivity Analysis for Observational Studies Without Observable Implications
Abstract:
A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observed data. To address this issue, we propose a framework that allows (1) flexible models for the observed data and (2) clean separation of the identified and unidentified parts of the sensitivity model. Our framework extends an approach from the missing data literature, known as Tukey’s factorization, to the causal inference setting. Under this factorization, we can represent the distributions of unobserved potential outcomes in terms of unidentified selection functions that posit a relationship between treatment assignment and unobserved potential outcomes. The sensitivity parameters in this framework are easily interpreted, and we provide heuristics for calibrating these parameters against observable quantities. We demonstrate the flexibility of this approach in two examples, where we estimate both average treatment effects and quantile treatment effects using Bayesian nonparametric models for the observed data.
Journal: Journal of the American Statistical Association
Pages: 1730-1746
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1604369
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604369
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1730-1746
Template-Type: ReDIF-Article 1.0
Author-Name: Leying Guan
Author-X-Name-First: Leying
Author-X-Name-Last: Guan
Author-Name: Xi Chen
Author-X-Name-First: Xi
Author-X-Name-Last: Chen
Author-Name: Wing Hung Wong
Author-X-Name-First: Wing
Author-X-Name-Last: Hung Wong
Title: Detecting Strong Signals in Gene Perturbation Experiments: An Adaptive Approach With Power Guarantee and FDR Control
Abstract:
The perturbation of a transcription factor should affect the expression levels of its direct targets. However, not all genes showing changes in expression are direct targets. To increase the chance of detecting direct targets, we propose a modified two-group model where the null group corresponds to genes which are not direct targets, but can have small nonzero effects. We model the behavior of genes from the null set by a Gaussian distribution with unknown variance τ2
. To estimate τ2
, we focus on a simple estimation approach, the iterated empirical Bayes estimation. We conduct a detailed analysis of the properties of the iterated EB estimate and provide theoretical guarantee of its good performance under mild conditions. We provide simulations comparing the new modeling approach with existing methods, and the new approach shows more stable and better performance under different situations. We also apply it to a real dataset from gene knock-down experiments and obtained better results compared with the original two-group model testing for nonzero effects.
Journal: Journal of the American Statistical Association
Pages: 1747-1755
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1635484
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635484
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1747-1755
Template-Type: ReDIF-Article 1.0
Author-Name: Yunxiao Chen
Author-X-Name-First: Yunxiao
Author-X-Name-Last: Chen
Author-Name: Xiaoou Li
Author-X-Name-First: Xiaoou
Author-X-Name-Last: Li
Author-Name: Siliang Zhang
Author-X-Name-First: Siliang
Author-X-Name-Last: Zhang
Title: Structured Latent Factor Analysis for Large-scale Data: Identifiability, Estimability, and Their Implications
Abstract:
Abstract–Latent factor models are widely used to measure unobserved latent traits in social and behavioral sciences, including psychology, education, and marketing. When used in a confirmatory manner, design information is incorporated as zero constraints on corresponding parameters, yielding structured (confirmatory) latent factor models. In this article, we study how such design information affects the identifiability and the estimation of a structured latent factor model. Insights are gained through both asymptotic and nonasymptotic analyses. Our asymptotic results are established under a regime where both the number of manifest variables and the sample size diverge, motivated by applications to large-scale data. Under this regime, we define the structural identifiability of the latent factors and establish necessary and sufficient conditions that ensure structural identifiability. In addition, we propose an estimator which is shown to be consistent and rate optimal when structural identifiability holds. Finally, a nonasymptotic error bound is derived for this estimator, through which the effect of design information is further quantified. Our results shed lights on the design of large-scale measurement in education and psychology and have important implications on measurement validity and reliability.
Journal: Journal of the American Statistical Association
Pages: 1756-1770
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1635485
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1635485
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1756-1770
Template-Type: ReDIF-Article 1.0
Author-Name: Jianwei Hu
Author-X-Name-First: Jianwei
Author-X-Name-Last: Hu
Author-Name: Hong Qin
Author-X-Name-First: Hong
Author-X-Name-Last: Qin
Author-Name: Ting Yan
Author-X-Name-First: Ting
Author-X-Name-Last: Yan
Author-Name: Yunpeng Zhao
Author-X-Name-First: Yunpeng
Author-X-Name-Last: Zhao
Title: Corrected Bayesian Information Criterion for Stochastic Block Models
Abstract:
Estimating the number of communities is one of the fundamental problems in community detection. We re-examine the Bayesian paradigm for stochastic block models (SBMs) and propose a “corrected Bayesian information criterion” (CBIC), to determine the number of communities and show that the proposed criterion is consistent under mild conditions as the size of the network and the number of communities go to infinity. The CBIC outperforms those used in Wang and Bickel and Saldana, Yu, and Feng which tend to underestimate and overestimate the number of communities, respectively. The results are further extended to degree corrected SBMs. Numerical studies demonstrate our theoretical results.
Journal: Journal of the American Statistical Association
Pages: 1771-1783
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1637744
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1637744
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1771-1783
Template-Type: ReDIF-Article 1.0
Author-Name: Minsuk Shin
Author-X-Name-First: Minsuk
Author-X-Name-Last: Shin
Author-Name: Anirban Bhattacharya
Author-X-Name-First: Anirban
Author-X-Name-Last: Bhattacharya
Author-Name: Valen E. Johnson
Author-X-Name-First: Valen E.
Author-X-Name-Last: Johnson
Title: Functional Horseshoe Priors for Subspace Shrinkage
Abstract:
We introduce a new shrinkage prior on function spaces, called the functional horseshoe (fHS) prior, that encourages shrinkage toward parametric classes of functions. Unlike other shrinkage priors for parametric models, the fHS shrinkage acts on the shape of the function rather than inducing sparsity on model parameters. We study the efficacy of the proposed approach by showing an adaptive posterior concentration property on the function. We also demonstrate consistency of the model selection procedure that thresholds the shrinkage parameter of the fHS prior. We apply the fHS prior to nonparametric additive models and compare its performance with procedures based on the standard horseshoe prior and several penalized likelihood approaches. We find that the new procedure achieves smaller estimation error and more accurate model selection than other procedures in several simulated and real examples. Supplementary materials for this article, which contain additional simulated and real data examples, MCMC diagnostics, and proofs of the theoretical results, are available online.
Journal: Journal of the American Statistical Association
Pages: 1784-1797
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1654875
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654875
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1784-1797
Template-Type: ReDIF-Article 1.0
Author-Name: Xiao Nie
Author-X-Name-First: Xiao
Author-X-Name-Last: Nie
Author-Name: Peter Chien
Author-X-Name-First: Peter
Author-X-Name-Last: Chien
Author-Name: Dane Morgan
Author-X-Name-First: Dane
Author-X-Name-Last: Morgan
Author-Name: Amy Kaczmarowski
Author-X-Name-First: Amy
Author-X-Name-Last: Kaczmarowski
Title: A Statistical Method for Emulation of Computer Models With Invariance-Preserving Properties, With Application to Structural Energy Prediction
Abstract:
Statistical design and analysis of computer experiments is a growing area in statistics. Computer models with structural invariance properties now appear frequently in materials science, physics, biology, and other fields. These properties are consequences of dependency on structural geometry, and cannot be accommodated by standard statistical emulation methods. In this article, we propose a statistical framework for building emulators to preserve invariance. The framework uses a weighted complete graph to represent the geometry and introduces a new class of function, called the relabeling symmetric functions, associated with the graph. We establish a characterization theorem of the relabeling symmetric functions and propose a nonparametric kernel method for estimating such functions. The effectiveness of the proposed method is illustrated by examples from materials science. Supplemental material for this article can be found online.
Journal: Journal of the American Statistical Association
Pages: 1798-1811
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1654876
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654876
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1798-1811
Template-Type: ReDIF-Article 1.0
Author-Name: A. S. Hedayat
Author-X-Name-First: A. S.
Author-X-Name-Last: Hedayat
Author-Name: Heng Xu
Author-X-Name-First: Heng
Author-X-Name-Last: Xu
Author-Name: Wei Zheng
Author-X-Name-First: Wei
Author-X-Name-Last: Zheng
Title: Optimal Designs for the Two-Dimensional Interference Model
Abstract:
Recently, there have been some major advances in the theory of optimal designs for interference models when the block is arranged in one-dimensional layout. Relatively speaking, the study for two-dimensional interference model is quite limited partly due to technical difficulties. This article tries to fill this gap. Specifically, we set the tone by characterizing all possible universally optimal designs simultaneously through one linear equations system (LES) with respect to the proportions of block arrays. However, such a LES is not readily solvable due to the extremely large number of block arrays. This computational issue could be resolved by identifying a small subset of block arrays with the theoretical guarantee that any optimal design is supported by this subset. The nature of two-dimensional layout of the block has made this task very technically challenging, and we have theoretically derived such subset for any size of the treatment array and any number of treatments under comparison. This facilitates the development of the algorithm for deriving either approximate or exact designs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1812-1821
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1654877
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654877
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1812-1821
Template-Type: ReDIF-Article 1.0
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Author-Name: Mahrad Sharifvaghefi
Author-X-Name-First: Mahrad
Author-X-Name-Last: Sharifvaghefi
Author-Name: Yoshimasa Uematsu
Author-X-Name-First: Yoshimasa
Author-X-Name-Last: Uematsu
Title: IPAD: Stable Interpretable Forecasting with Knockoffs Inference
Abstract:
Interpretability and stability are two important features that are desired in many contemporary big data applications arising in statistics, economics, and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance greatly the interpretability is still largely underdeveloped. To this end, in this article, we exploit the general framework of model-X knockoffs introduced recently in Candès, Fan, Janson and Lv [(2018), “Panning for Gold: ‘model X’ Knockoffs for High Dimensional Controlled Variable Selection,” Journal of the Royal Statistical Society, Series B, 80, 551–577], which is nonconventional for reproducible large-scale inference in that the framework is completely free of the use of p-values for significance testing, and suggest a new method of intertwined probabilistic factors decoupling (IPAD) for stable interpretable forecasting with knockoffs inference in high-dimensional models. The recipe of the method is constructing the knockoff variables by assuming a latent factor model that is exploited widely in economics and finance for the association structure of covariates. Our method and work are distinct from the existing literature in which we estimate the covariate distribution from data instead of assuming that it is known when constructing the knockoff variables, our procedure does not require any sample splitting, we provide theoretical justifications on the asymptotic false discovery rate control, and the theory for the power analysis is also established. Several simulation examples and the real data analysis further demonstrate that the newly suggested method has appealing finite-sample performance with desired interpretability and stability compared to some popularly used forecasting methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1822-1834
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1654878
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1654878
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1822-1834
Template-Type: ReDIF-Article 1.0
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Author-Name: Gerda Claeskens
Author-X-Name-First: Gerda
Author-X-Name-Last: Claeskens
Author-Name: Thomas Gueuning
Author-X-Name-First: Thomas
Author-X-Name-Last: Gueuning
Title: Fixed Effects Testing in High-Dimensional Linear Mixed Models
Abstract:
Many scientific and engineering challenges—ranging from pharmacokinetic drug dosage allocation and personalized medicine to marketing mix (4Ps) recommendations—require an understanding of the unobserved heterogeneity to develop the best decision making-processes. In this article, we develop a hypothesis test and the corresponding p-value for testing for the significance of the homogeneous structure in linear mixed models. A robust matching moment construction is used for creating a test that adapts to the size of the model sparsity. When unobserved heterogeneity at a cluster level is constant, we show that our test is both consistent and unbiased even when the dimension of the model is extremely high. Our theoretical results rely on a new family of adaptive sparse estimators of the fixed effects that do not require consistent estimation of the random effects. Moreover, our inference results do not require consistent model selection. We showcase that moment matching can be extended to nonlinear mixed effects models and to generalized linear mixed effects models. In numerical and real data experiments, we find that the developed method is extremely accurate, that it adapts to the size of the underlying model and is decidedly powerful in the presence of irrelevant covariates.Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1835-1850
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1660172
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660172
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1835-1850
Template-Type: ReDIF-Article 1.0
Author-Name: Xinwei Ma
Author-X-Name-First: Xinwei
Author-X-Name-Last: Ma
Author-Name: Jingshen Wang
Author-X-Name-First: Jingshen
Author-X-Name-Last: Wang
Title: Robust Inference Using Inverse Probability Weighting
Abstract:
Inverse probability weighting (IPW) is widely used in empirical work in economics and other disciplines. As Gaussian approximations perform poorly in the presence of “small denominators,” trimming is routinely employed as a regularization strategy. However, ad hoc trimming of the observations renders usual inference procedures invalid for the target estimand, even in large samples. In this article, we first show that the IPW estimator can have different (Gaussian or non-Gaussian) asymptotic distributions, depending on how “close to zero” the probability weights are and on how large the trimming threshold is. As a remedy, we propose an inference procedure that is robust not only to small probability weights entering the IPW estimator but also to a wide range of trimming threshold choices, by adapting to these different asymptotic distributions. This robustness is achieved by employing resampling techniques and by correcting a non-negligible trimming bias. We also propose an easy-to-implement method for choosing the trimming threshold by minimizing an empirical analogue of the asymptotic mean squared error. In addition, we show that our inference procedure remains valid with the use of a data-driven trimming threshold. We illustrate our method by revisiting a dataset from the National Supported Work program. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1851-1860
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1660173
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660173
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1851-1860
Template-Type: ReDIF-Article 1.0
Author-Name: Yaniv Romano
Author-X-Name-First: Yaniv
Author-X-Name-Last: Romano
Author-Name: Matteo Sesia
Author-X-Name-First: Matteo
Author-X-Name-Last: Sesia
Author-Name: Emmanuel Candès
Author-X-Name-First: Emmanuel
Author-X-Name-Last: Candès
Title: Deep Knockoffs
Abstract:
This article introduces a machine for sampling approximate model-X knockoffs for arbitrary and unspecified data distributions using deep generative models. The main idea is to iteratively refine a knockoff sampling mechanism until a criterion measuring the validity of the produced knockoffs is optimized; this criterion is inspired by the popular maximum mean discrepancy in machine learning and can be thought of as measuring the distance to pairwise exchangeability between original and knockoff features. By building upon the existing model-X framework, we thus obtain a flexible and model-free statistical tool to perform controlled variable selection. Extensive numerical experiments and quantitative tests confirm the generality, effectiveness, and power of our deep knockoff machines. Finally, we apply this new method to a real study of mutations linked to changes in drug resistance in the human immunodeficiency virus. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1861-1872
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1660174
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1660174
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1861-1872
Template-Type: ReDIF-Article 1.0
Author-Name: Eduardo García-Portugués
Author-X-Name-First: Eduardo
Author-X-Name-Last: García-Portugués
Author-Name: Davy Paindaveine
Author-X-Name-First: Davy
Author-X-Name-Last: Paindaveine
Author-Name: Thomas Verdebout
Author-X-Name-First: Thomas
Author-X-Name-Last: Verdebout
Title: On Optimal Tests for Rotational Symmetry Against New Classes of Hyperspherical Distributions
Abstract:
Motivated by the central role played by rotationally symmetric distributions in directional statistics, we consider the problem of testing rotational symmetry on the hypersphere. We adopt a semiparametric approach and tackle problems where the location of the symmetry axis is either specified or unspecified. For each problem, we define two tests and study their asymptotic properties under very mild conditions. We introduce two new classes of directional distributions that extend the rotationally symmetric class and are of independent interest. We prove that each test is locally asymptotically maximin, in the Le Cam sense, for one kind of the alternatives given by the new classes of distributions, for both specified and unspecified symmetry axis. The tests, aimed to detect location- and scatter-like alternatives, are combined into convenient hybrid tests that are consistent against both alternatives. We perform Monte Carlo experiments that illustrate the finite-sample performances of the proposed tests and their agreement with the asymptotic results. Finally, the practical relevance of our tests is illustrated on a real data application from astronomy. The R package rotasym implements the proposed tests and allows practitioners to reproduce the data application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1873-1887
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1665527
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665527
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1873-1887
Template-Type: ReDIF-Article 1.0
Author-Name: S. R. Johnson
Author-X-Name-First: S. R.
Author-X-Name-Last: Johnson
Author-Name: D. A. Henderson
Author-X-Name-First: D. A.
Author-X-Name-Last: Henderson
Author-Name: R. J. Boys
Author-X-Name-First: R. J.
Author-X-Name-Last: Boys
Title: Revealing Subgroup Structure in Ranked Data Using a Bayesian WAND
Abstract:
Ranked data arise in many areas of application ranging from the ranking of up-regulated genes for cancer to the ranking of academic statistics journals. Complications can arise when rankers do not report a full ranking of all entities; for example, they might only report their top-M ranked entities after seeing some or all entities. It can also be useful to know whether rankers are equally informative, and whether some entities are effectively judged to be exchangeable. Revealing subgroup structure in the data may also be helpful in understanding the distribution of ranker views. In this paper, we propose a flexible Bayesian nonparametric model for identifying heterogeneous structure and ranker reliability in ranked data. The model is a weighted adapted nested Dirichlet (WAND) process mixture of Plackett–Luce models and inference proceeds through a simple and efficient Gibbs sampling scheme for posterior sampling. The richness of information in the posterior distribution allows us to infer many details of the structure both between ranker groups and between entity groups (within-ranker groups). Our modeling framework also facilitates a flexible representation of the posterior predictive distribution. This flexibility is important as we propose to use the posterior predictive distribution as the basis for addressing the rank aggregation problem, and also for identifying lack of model fit. The methodology is illustrated using several simulation studies and real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1888-1901
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1665528
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665528
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1888-1901
Template-Type: ReDIF-Article 1.0
Author-Name: P. Hall
Author-X-Name-First: P.
Author-X-Name-Last: Hall
Author-Name: I.M. Johnstone
Author-X-Name-First: I.M.
Author-X-Name-Last: Johnstone
Author-Name: J.T. Ormerod
Author-X-Name-First: J.T.
Author-X-Name-Last: Ormerod
Author-Name: M.P. Wand
Author-X-Name-First: M.P.
Author-X-Name-Last: Wand
Author-Name: J.C.F. Yu
Author-X-Name-First: J.C.F.
Author-X-Name-Last: Yu
Title: Fast and Accurate Binary Response Mixed Model Analysis via Expectation Propagation
Abstract:
Expectation propagation is a general prescription for approximation of integrals in statistical inference problems. Its literature is mainly concerned with Bayesian inference scenarios. However, expectation propagation can also be used to approximate integrals arising in frequentist statistical inference. We focus on likelihood-based inference for binary response mixed models and show that fast and accurate quadrature-free inference can be realized for the probit link case with multivariate random effects and higher levels of nesting. The approach is supported by asymptotic calculations in which expectation propagation is seen to provide consistent estimation of the exact likelihood surface. Numerical studies reveal the availability of fast, highly accurate and scalable methodology for binary mixed model analysis.
Journal: Journal of the American Statistical Association
Pages: 1902-1916
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1665529
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1665529
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1902-1916
Template-Type: ReDIF-Article 1.0
Author-Name: David Benkeser
Author-X-Name-First: David
Author-X-Name-Last: Benkeser
Author-Name: Maya Petersen
Author-X-Name-First: Maya
Author-X-Name-Last: Petersen
Author-Name: Mark J. van der Laan
Author-X-Name-First: Mark J.
Author-X-Name-Last: van der Laan
Title: Improved Small-Sample Estimation of Nonlinear Cross-Validated Prediction Metrics
Abstract:
When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. We consider the problem of measuring the performance of a prediction algorithm with the same data that were used to train the algorithm. Typical approaches involve bootstrapping or cross-validation. However, we demonstrate that bootstrap-based approaches often fail and standard cross-validation estimators may perform poorly. We provide a general study of cross-validation-based estimators that highlights the source of this poor performance, and propose an alternative framework for estimation using techniques from the efficiency theory literature. We provide a theorem establishing the weak convergence of our estimators. The general theorem is applied in detail to two specific examples and we discuss possible extensions to other parameters of interest. For the two explicit examples that we consider, our estimators demonstrate remarkable finite-sample improvements over standard approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1917-1932
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1668794
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1668794
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1917-1932
Template-Type: ReDIF-Article 1.0
Author-Name: Scott A. Bruce
Author-X-Name-First: Scott A.
Author-X-Name-Last: Bruce
Author-Name: Cheng Yong Tang
Author-X-Name-First: Cheng Yong
Author-X-Name-Last: Tang
Author-Name: Martica H. Hall
Author-X-Name-First: Martica H.
Author-X-Name-Last: Hall
Author-Name: Robert T. Krafty
Author-X-Name-First: Robert T.
Author-X-Name-Last: Krafty
Title: Empirical Frequency Band Analysis of Nonstationary Time Series
Abstract:
The time-varying power spectrum of a time series process is a bivariate function that quantifies the magnitude of oscillations at different frequencies and times. To obtain low-dimensional, parsimonious measures from this functional parameter, applied researchers consider collapsed measures of power within local bands that partition the frequency space. Frequency bands commonly used in the scientific literature were historically derived, but they are not guaranteed to be optimal or justified for adequately summarizing information from a given time series process under current study. There is a dearth of methods for empirically constructing statistically optimal bands for a given signal. The goal of this article is to provide a standardized, unifying approach for deriving and analyzing customized frequency bands. A consistent, frequency-domain, iterative cumulative sum based scanning procedure is formulated to identify frequency bands that best preserve nonstationary information. A formal hypothesis testing procedure is also developed to test which, if any, frequency bands remain stationary. The proposed method is used to analyze heart rate variability of a patient during sleep and uncovers a refined partition of frequency bands that best summarize the time-varying power spectrum. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1933-1945
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1671199
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671199
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1933-1945
Template-Type: ReDIF-Article 1.0
Author-Name: Ran Tao
Author-X-Name-First: Ran
Author-X-Name-Last: Tao
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Dan-Yu Lin
Author-X-Name-First: Dan-Yu
Author-X-Name-Last: Lin
Title: Optimal Designs of Two-Phase Studies
Abstract:
The two-phase design is a cost-effective sampling strategy to evaluate the effects of covariates on an outcome when certain covariates are too expensive to be measured on all study subjects. Under such a design, the outcome and inexpensive covariates are measured on all subjects in the first phase and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase. Previous research on two-phase studies has focused largely on the inference procedures rather than the design aspects. We investigate the design efficiency of the two-phase study, as measured by the semiparametric efficiency bound for estimating the regression coefficients of expensive covariates. We consider general two-phase studies, where the outcome variable can be continuous, discrete, or censored, and the second-phase sampling can depend on the first-phase data in any manner. We develop optimal or approximately optimal two-phase designs, which can be substantially more efficient than the existing designs. We demonstrate the improvements of the new designs over the existing ones through extensive simulation studies and two large medical studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1946-1959
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1671200
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1671200
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1946-1959
Template-Type: ReDIF-Article 1.0
Author-Name: Dachuan Chen
Author-X-Name-First: Dachuan
Author-X-Name-Last: Chen
Author-Name: Per A. Mykland
Author-X-Name-First: Per A.
Author-X-Name-Last: Mykland
Author-Name: Lan Zhang
Author-X-Name-First: Lan
Author-X-Name-Last: Zhang
Title: The Five Trolls Under the Bridge: Principal Component Analysis With Asynchronous and Noisy High Frequency Data
Abstract:
We develop a principal component analysis (PCA) for high frequency data. As in Northern fairy tales, there are trolls waiting for the explorer. The first three trolls are market microstructure noise, asynchronous sampling times, and edge effects in estimators. To get around these, a robust estimator of the spot covariance matrix is developed based on the smoothed two-scale realized variance (S-TSRV). The fourth troll is how to pass from estimated time-varying covariance matrix to PCA. Under finite dimensionality, we develop this methodology through the estimation of realized spectral functions. Rates of convergence and central limit theory, as well as an estimator of standard error, are established. The fifth troll is high dimension on top of high frequency, where we also develop PCA. With the help of a new identity concerning the spot principal orthogonal complement, the high-dimensional rates of convergence have been studied after eliminating several strong assumptions in classical PCA. As an application, we show that our first principal component (PC) closely matches but potentially outperforms the S&P 100 market index. From a statistical standpoint, the close match between the first PC and the market index also corroborates this PCA procedure and the underlying S-TSRV matrix, in the sense of Karl Popper.Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1960-1977
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1672555
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1672555
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1960-1977
Template-Type: ReDIF-Article 1.0
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Title: Cross-Validation With Confidence
Abstract:
Cross-validation is one of the most popular model and tuning parameter selection methods in statistics and machine learning. Despite its wide applicability, traditional cross-validation methods tend to overfit, due to the ignorance of the uncertainty in the testing sample. We develop a novel statistically principled inference tool based on cross-validation that takes into account the uncertainty in the testing sample. This method outputs a set of highly competitive candidate models containing the optimal one with guaranteed probability. As a consequence, our method can achieve consistent variable selection in a classical linear regression setting, for which existing cross-validation methods require unconventional split ratios. When used for tuning parameter selection, the method can provide an alternative trade-off between prediction accuracy and model interpretability than existing variants of cross-validation. We demonstrate the performance of the proposed method in several simulated and real data examples. Supplemental materials for this article can be found online.
Journal: Journal of the American Statistical Association
Pages: 1978-1997
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1672556
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1672556
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1978-1997
Template-Type: ReDIF-Article 1.0
Author-Name: Minerva Mukhopadhyay
Author-X-Name-First: Minerva
Author-X-Name-Last: Mukhopadhyay
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Targeted Random Projection for Prediction From High-Dimensional Features
Abstract:
We consider the problem of computationally efficient prediction with high dimensional and highly correlated predictors when accurate variable selection is effectively impossible. Direct application of penalization or Bayesian methods implemented with Markov chain Monte Carlo can be computationally daunting and unstable. A common solution is first stage dimension reduction through screening or projecting the design matrix to a lower dimensional hyper-plane. Screening is highly sensitive to threshold choice, while projections often have poor performance in very high-dimensions. We propose targeted random projection (TARP) to combine positive aspects of both strategies. TARP uses screening to order the inclusion probabilities of the features in the projection matrix used for dimension reduction, leading to data-informed sparsity. We provide theoretical support for a Bayesian predictive algorithm based on TARP, including statistical and computational complexity guarantees. Examples for simulated and real data applications illustrate gains relative to a variety of competitors. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1998-2010
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1677240
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677240
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:1998-2010
Template-Type: ReDIF-Article 1.0
Author-Name: Yilin Chen
Author-X-Name-First: Yilin
Author-X-Name-Last: Chen
Author-Name: Pengfei Li
Author-X-Name-First: Pengfei
Author-X-Name-Last: Li
Author-Name: Changbao Wu
Author-X-Name-First: Changbao
Author-X-Name-Last: Wu
Title: Doubly Robust Inference With Nonprobability Survey Samples
Abstract:
We establish a general framework for statistical inferences with nonprobability survey samples when relevant auxiliary information is available from a probability survey sample. We develop a rigorous procedure for estimating the propensity scores for units in the nonprobability sample, and construct doubly robust estimators for the finite population mean. Variance estimation is discussed under the proposed framework. Results from simulation studies show the robustness and the efficiency of our proposed estimators as compared to existing methods. The proposed method is used to analyze a nonprobability survey sample collected by the Pew Research Center with auxiliary information from the Behavioral Risk Factor Surveillance System and the Current Population Survey. Our results illustrate a general approach to inference with nonprobability samples and highlight the importance and usefulness of auxiliary information from probability survey samples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2011-2021
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1677241
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677241
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2011-2021
Template-Type: ReDIF-Article 1.0
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Will Wei Sun
Author-X-Name-First: Will Wei
Author-X-Name-Last: Sun
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Mixed-Effect Time-Varying Network Model and Application in Brain Connectivity Analysis
Abstract:
Time-varying networks are fast emerging in a wide range of scientific and business applications. Most existing dynamic network models are limited to a single-subject and discrete-time setting. In this article, we propose a mixed-effect network model that characterizes the continuous time-varying behavior of the network at the population level, meanwhile taking into account both the individual subject variability as well as the prior module information. We develop a multistep optimization procedure for a constrained likelihood estimation and derive the associated asymptotic properties. We demonstrate the effectiveness of our method through both simulations and an application to a study of brain development in youth. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2022-2036
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1677242
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677242
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2022-2036
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan R. Bradley
Author-X-Name-First: Jonathan R.
Author-X-Name-Last: Bradley
Author-Name: Scott H. Holan
Author-X-Name-First: Scott H.
Author-X-Name-Last: Holan
Author-Name: Christopher K. Wikle
Author-X-Name-First: Christopher K.
Author-X-Name-Last: Wikle
Title: Bayesian Hierarchical Models With Conjugate Full-Conditional Distributions for Dependent Data From the Natural Exponential Family
Abstract:
We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called “big n problem.” The computational complexity of the “big n problem” is further exacerbated when allowing for non-Gaussian data models, as is the case here. Thus, we develop new computationally efficient distribution theory for this setting. In particular, we introduce the “conjugate multivariate distribution,” which is motivated by the Diaconis and Ylvisaker distribution. Furthermore, we provide substantial theoretical and methodological development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, conjugate prior distributions, and full-conditional distributions for a Gibbs sampler. To demonstrate the wide-applicability of the proposed methodology, we provide two simulation studies and three applications based on an epidemiology dataset, a federal statistics dataset, and an environmental dataset, respectively. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2037-2052
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1677471
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1677471
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2037-2052
Template-Type: ReDIF-Article 1.0
Author-Name: Trambak Banerjee
Author-X-Name-First: Trambak
Author-X-Name-Last: Banerjee
Author-Name: Gourab Mukherjee
Author-X-Name-First: Gourab
Author-X-Name-Last: Mukherjee
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Title: Adaptive Sparse Estimation With Side Information
Abstract:
The article considers the problem of estimating a high-dimensional sparse parameter in the presence of side information that encodes the sparsity structure. We develop a general framework that involves first using an auxiliary sequence to capture the side information, and then incorporating the auxiliary sequence in inference to reduce the estimation risk. The proposed method, which carries out adaptive Stein’s unbiased risk estimate-thresholding using side information (ASUS), is shown to have robust performance and enjoy optimality properties. We develop new theories to characterize regimes in which ASUS far outperforms competitive shrinkage estimators, and establish precise conditions under which ASUS is asymptotically optimal. Simulation studies are conducted to show that ASUS substantially improves the performance of existing methods in many settings. The methodology is applied for analysis of data from single cell virology studies and microarray time course experiments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2053-2067
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1679639
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1679639
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2053-2067
Template-Type: ReDIF-Article 1.0
Author-Name: Kathleen T. Li
Author-X-Name-First: Kathleen T.
Author-X-Name-Last: Li
Title: Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods
Abstract:
The synthetic control (SC) method, a powerful tool for estimating average treatment effects (ATE), is increasingly popular in fields such as statistics, economics, political science, and marketing. The SC is particularly suitable for estimating ATE with a single (or a few) treated unit(s), a fixed number of control units, and large pre and post-treatment periods (which we refer as “long panels”). To date, there has been no formal inference theory for SC ATE estimator with long panels under general conditions. Existing work mostly use placebo tests for inference or some permutation methods when the post-treatment period is small. In this article, we derive the asymptotic distribution of the SC and modified synthetic control (MSC) ATE estimators using projection theory. We show that a properly designed subsampling method can be used to obtain confidence intervals and conduct inference whereas the standard bootstrap cannot. Simulations and an empirical application that examines the effect of opening a physical showroom by an e-tailer demonstrate the usefulness of the MSC method in applications.Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2068-2083
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1686986
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1686986
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2068-2083
Template-Type: ReDIF-Article 1.0
Author-Name: Junwei Lu
Author-X-Name-First: Junwei
Author-X-Name-Last: Lu
Author-Name: Mladen Kolar
Author-X-Name-First: Mladen
Author-X-Name-Last: Kolar
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Title: Kernel Meets Sieve: Post-Regularization Confidence Bands for Sparse Additive Model
Abstract:
We develop a novel procedure for constructing confidence bands for components of a sparse additive model. Our procedure is based on a new kernel-sieve hybrid estimator that combines two most popular nonparametric estimation methods in the literature, the kernel regression and the spline method, and is of interest in its own right. Existing methods for fitting sparse additive model are primarily based on sieve estimators, while the literature on confidence bands for nonparametric models are primarily based upon kernel or local polynomial estimators. Our kernel-sieve hybrid estimator combines the best of both worlds and allows us to provide a simple procedure for constructing confidence bands in high-dimensional sparse additive models. We prove that the confidence bands are asymptotically honest by studying approximation with a Gaussian process. Thorough numerical results on both synthetic data and real-world neuroscience data are provided to demonstrate the efficacy of the theory. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2084-2099
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2019.1689984
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1689984
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2084-2099
Template-Type: ReDIF-Article 1.0
Author-Name: Jordan J. Franks
Author-X-Name-First: Jordan J.
Author-X-Name-Last: Franks
Title: Handbook of Approximate Bayesian Computation.
Journal: Journal of the American Statistical Association
Pages: 2100-2101
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1846973
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846973
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2100-2101
Template-Type: ReDIF-Article 1.0
Author-Name: Yen-Chi Chen
Author-X-Name-First: Yen-Chi
Author-X-Name-Last: Chen
Title: Handbook of Mixture Analysis.
Journal: Journal of the American Statistical Association
Pages: 2101-2102
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1846974
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846974
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2101-2102
Template-Type: ReDIF-Article 1.0
Author-Name: Richard J. Cook
Author-X-Name-First: Richard J.
Author-X-Name-Last: Cook
Title: The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach.
Journal: Journal of the American Statistical Association
Pages: 2102-2104
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1846975
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846975
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2102-2104
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Editorial Collaborators
Journal: Journal of the American Statistical Association
Pages: 2105-2113
Issue: 532
Volume: 115
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1846977
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1846977
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:532:p:2105-2113
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew Gelman
Author-X-Name-First: Andrew
Author-X-Name-Last: Gelman
Author-Name: Zaiying Huang
Author-X-Name-First: Zaiying
Author-X-Name-Last: Huang
Title: Estimating Incumbency Advantage and Its Variation, as an Example of a Before–After Study
Abstract:
Incumbency advantage is one of the most widely studied features in American legislative elections. In this article we construct and implement an estimate that allows incumbency advantage to vary between individual incumbents. This model predicts that open-seat elections will be less variable than those with incumbents running, an observed empirical pattern that is not explained by previous models. We apply our method to the U.S. House of Representatives in the twentieth century. Our estimate of the overall pattern of incumbency advantage over time is similar to previous estimates (although slightly lower), and we also find a pattern of increasing variation. More generally, our multilevel model represents a new method for estimating effects in before–after studies.
Journal: Journal of the American Statistical Association
Issue: 482
Volume: 103
Year: 2008
Month: 9
X-DOI: 10.1198/016214507000000626
File-URL: http://hdl.handle.net/
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:103:y:2008:i:482:p:437-446
Template-Type: ReDIF-Article 1.0
Author-Name: Zhuo Wang
Author-X-Name-First: Zhuo
Author-X-Name-Last: Wang
Author-Name: Yujing Jiang
Author-X-Name-First: Yujing
Author-X-Name-Last: Jiang
Author-Name: Hui Wan
Author-X-Name-First: Hui
Author-X-Name-Last: Wan
Author-Name: Jun Yan
Author-X-Name-First: Jun
Author-X-Name-Last: Yan
Author-Name: Xuebin Zhang
Author-X-Name-First: Xuebin
Author-X-Name-Last: Zhang
Title: Toward Optimal Fingerprinting in Detection and Attribution of Changes in Climate Extremes
Abstract:
Abstract–Detection and attribution of climate change plays a central role in establishing the causal relationship between the observed changes in the climate and their possible causes. Optimal fingerprinting has been widely used as a standard method for detection and attribution analysis for mean climate conditions, but there has been no satisfactory analog for climate extremes. Here, we turn an intuitive concept, which incorporates the expected climate responses to external forcings into the location parameters of the marginal generalized extreme value (GEV) distributions of the observed extremes, to a practical and better-understood method. Marginal approaches based on a weighted sum of marginal GEV score equations are promising for no need to specify the dependence structure. The computational efficiency makes them feasible in handling multiple forcings simultaneously. The method under working independence is recommended because it produces robust results where there are errors-in-variables. Our analyses show human influences on temperature extremes at the subcontinental scale. Compared with previous studies, we detected human influences in a slightly smaller number of regions. This is possibly due to the under-coverage of the confidence intervals in existing works, suggesting the need for careful examinations of the properties of the statistical methods in practice. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1-13
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1730852
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730852
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:1-13
Template-Type: ReDIF-Article 1.0
Author-Name: Seyoung Park
Author-X-Name-First: Seyoung
Author-X-Name-Last: Park
Author-Name: Hao Xu
Author-X-Name-First: Hao
Author-X-Name-Last: Xu
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: Integrating Multidimensional Data for Clustering Analysis With Applications to Cancer Patient Data
Abstract:
Advances in high-throughput genomic technologies coupled with large-scale studies including The Cancer Genome Atlas (TCGA) project have generated rich resources of diverse types of omics data to better understand cancer etiology and treatment responses. Clustering patients into subtypes with similar disease etiologies and/or treatment responses using multiple omics data types has the potential to improve the precision of clustering than using a single data type. However, in practice, patient clustering is still mostly based on a single type of omics data or ad hoc integration of clustering results from individual data types, leading to potential loss of information. By treating each omics data type as a different informative representation from patients, we propose a novel multi-view spectral clustering framework to integrate different omics data types measured from the same subject. We learn the weight of each data type as well as a similarity measure between patients via a nonconvex optimization framework. We solve the proposed nonconvex problem iteratively using the ADMM algorithm and show the convergence of the algorithm. The accuracy and robustness of the proposed clustering method is studied both in theory and through various synthetic data. When our method is applied to the TCGA data, the patient clusters inferred by our method show more significant differences in survival times between clusters than those inferred from existing clustering methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 14-26
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1730853
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730853
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:14-26
Template-Type: ReDIF-Article 1.0
Author-Name: Souhaib Ben Taieb
Author-X-Name-First: Souhaib Ben
Author-X-Name-Last: Taieb
Author-Name: James W. Taylor
Author-X-Name-First: James W.
Author-X-Name-Last: Taylor
Author-Name: Rob J. Hyndman
Author-X-Name-First: Rob J.
Author-X-Name-Last: Hyndman
Title: Hierarchical Probabilistic Forecasting of Electricity Demand With Smart Meter Data
Abstract:
Decisions regarding the supply of electricity across a power grid must take into consideration the inherent uncertainty in demand. Optimal decision-making requires probabilistic forecasts for demand in a hierarchy with various levels of aggregation, such as substations, cities, and regions. The forecasts should be coherent in the sense that the forecast of the aggregated series should equal the sum of the forecasts of the corresponding disaggregated series. Coherency is essential, since the allocation of electricity at one level of the hierarchy relies on the appropriate amount being provided from the previous level. We introduce a new probabilistic forecasting method for a large hierarchy based on UK residential smart meter data. We find our method provides coherent and accurate probabilistic forecasts, as a result of an effective forecast combination. Furthermore, by avoiding distributional assumptions, we find that our method captures the variety of distributions in the smart meter hierarchy. Finally, the results confirm that, to ensure coherency in our large-scale hierarchy, it is sufficient to model a set of lower-dimension dependencies, rather than modeling the entire joint distribution of all series in the hierarchy. In achieving coherent and accurate hierarchical probabilistic forecasts, this work contributes to improved decision-making for smart grids. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 27-43
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1736081
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1736081
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:27-43
Template-Type: ReDIF-Article 1.0
Author-Name: Giovanni Nattino
Author-X-Name-First: Giovanni
Author-X-Name-Last: Nattino
Author-Name: Bo Lu
Author-X-Name-First: Bo
Author-X-Name-Last: Lu
Author-Name: Junxin Shi
Author-X-Name-First: Junxin
Author-X-Name-Last: Shi
Author-Name: Stanley Lemeshow
Author-X-Name-First: Stanley
Author-X-Name-Last: Lemeshow
Author-Name: Henry Xiang
Author-X-Name-First: Henry
Author-X-Name-Last: Xiang
Title: Triplet Matching for Estimating Causal Effects With Three Treatment Arms: A Comparative Study of Mortality by Trauma Center Level
Abstract:
Comparing outcomes across different levels of trauma centers is vital in evaluating regionalized trauma care. With observational data, it is critical to adjust for patient characteristics to render valid causal comparisons. Propensity score matching is a popular method to infer causal relationships in observational studies with two treatment arms. Few studies, however, have used matching designs with more than two groups, due to the complexity of matching algorithms. We fill the gap by developing an iterative matching algorithm for the three-group setting. Our algorithm outperforms the nearest neighbor algorithm and is shown to produce matched samples with total distance no larger than twice the optimal distance. We implement the evidence factors method for binary outcomes, which includes a randomization-based testing strategy and a sensitivity analysis for hidden bias in three-group matched designs. We apply our method to the Nationwide Emergency Department Sample data to compare emergency department mortality among non-trauma, level I, and level II trauma centers. Our tests suggest that the admission to a trauma center has a beneficial effect on mortality, assuming no unmeasured confounding. A sensitivity analysis for hidden bias shows that unmeasured confounders, moderately associated with the type of care received, may change the result qualitatively. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 44-53
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1737078
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1737078
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:44-53
Template-Type: ReDIF-Article 1.0
Author-Name: Kevin Z. Lin
Author-X-Name-First: Kevin Z.
Author-X-Name-Last: Lin
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Author-Name: Kathryn Roeder
Author-X-Name-First: Kathryn
Author-X-Name-Last: Roeder
Title: Covariance-Based Sample Selection for Heterogeneous Data: Applications to Gene Expression and Autism Risk Gene Detection
Abstract:
Risk for autism can be influenced by genetic mutations in hundreds of genes. Based on findings showing that genes with highly correlated gene expressions are functionally interrelated, “guilt by association” methods such as DAWN have been developed to identify these autism risk genes. Previous research analyzes the BrainSpan dataset, which contains gene expression of brain tissues from varying regions and developmental periods. Since the spatiotemporal properties of brain tissue are known to affect the gene expression’s covariance, previous research has focused only on a specific subset of samples to avoid the issue of heterogeneity. This analysis leads to a potential loss of power when detecting risk genes. In this article, we develop a new method called covariance-based sample selection (COBS) to find a larger and more homogeneous subset of samples that share the same population covariance matrix for the downstream DAWN analysis. To demonstrate COBSs effectiveness, we use genetic risk scores from two sequential data freezes obtained in 2014 and 2020. We show COBS improves DAWNs ability to predict risk genes detected in the newer data freeze when using the risk scores of the older data freeze as input. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 54-67
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1738234
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1738234
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:54-67
Template-Type: ReDIF-Article 1.0
Author-Name: Lucy Xia
Author-X-Name-First: Lucy
Author-X-Name-Last: Xia
Author-Name: Richard Zhao
Author-X-Name-First: Richard
Author-X-Name-Last: Zhao
Author-Name: Yanhui Wu
Author-X-Name-First: Yanhui
Author-X-Name-Last: Wu
Author-Name: Xin Tong
Author-X-Name-First: Xin
Author-X-Name-Last: Tong
Title: Intentional Control of Type I Error Over Unconscious Data Distortion: A Neyman–Pearson Approach to Text Classification
Abstract:
This article addresses the challenges in classifying textual data obtained from open online platforms, which are vulnerable to distortion. Most existing classification methods minimize the overall classification error and may yield an undesirably large Type I error (relevant textual messages are classified as irrelevant), particularly when available data exhibit an asymmetry between relevant and irrelevant information. Data distortion exacerbates this situation and often leads to fallacious prediction. To deal with inestimable data distortion, we propose the use of the Neyman–Pearson (NP) classification paradigm, which minimizes Type II error under a user-specified Type I error constraint. Theoretically, we show that the NP oracle is unaffected by data distortion when the class conditional distributions remain the same. Empirically, we study a case of classifying posts about worker strikes obtained from a leading Chinese microblogging platform, which are frequently prone to extensive, unpredictable and inestimable censorship. We demonstrate that, even though the training and test data are susceptible to different distortion and therefore potentially follow different distributions, our proposed NP methods control the Type I error on test data at the targeted level. The methods and implementation pipeline proposed in our case study are applicable to many other problems involving data distortion. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 68-81
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1740711
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1740711
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:68-81
Template-Type: ReDIF-Article 1.0
Author-Name: Bikram Karmakar
Author-X-Name-First: Bikram
Author-X-Name-Last: Karmakar
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Paul R. Rosenbaum
Author-X-Name-First: Paul R.
Author-X-Name-Last: Rosenbaum
Title: Reinforced Designs: Multiple Instruments Plus Control Groups as Evidence Factors in an Observational Study of the Effectiveness of Catholic Schools
Abstract:
Absent randomization, causal conclusions gain strength if several independent evidence factors concur. We develop a method for constructing evidence factors from several instruments plus a direct comparison of treated and control groups, and we evaluate the methods performance in terms of design sensitivity and simulation. In the application, we consider the effectiveness of Catholic versus public high schools, constructing three evidence factors from three past strategies for studying this question, namely: (i) having nearby access to a Catholic school as an instrument, (ii) being Catholic as an instrument for attending Catholic school, and (iii) a direct comparison of students in Catholic and public high schools. Although these three analyses use the same data, we: (i) construct three essentially independent statistical tests of no effect that require very different assumptions, (ii) study the sensitivity of each test to the assumptions underlying that test, (iii) examine the degree to which independent tests dependent upon different assumptions concur, (iv) pool evidence across independent factors. In the application, we conclude that the ostensible benefit of Catholic education depends critically on the validity of one instrument, and is therefore quite fragile. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 82-92
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1745811
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745811
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:82-92
Template-Type: ReDIF-Article 1.0
Author-Name: Gregory P. Bopp
Author-X-Name-First: Gregory P.
Author-X-Name-Last: Bopp
Author-Name: Benjamin A. Shaby
Author-X-Name-First: Benjamin A.
Author-X-Name-Last: Shaby
Author-Name: Raphaël Huser
Author-X-Name-First: Raphaël
Author-X-Name-Last: Huser
Title: A Hierarchical Max-Infinitely Divisible Spatial Model for Extreme Precipitation
Abstract:
Understanding the spatial extent of extreme precipitation is necessary for determining flood risk and adequately designing infrastructure (e.g., stormwater pipes) to withstand such hazards. While environmental phenomena typically exhibit weakening spatial dependence at increasingly extreme levels, limiting max-stable process models for block maxima have a rigid dependence structure that does not capture this type of behavior. We propose a flexible Bayesian model from a broader family of (conditionally) max-infinitely divisible processes that allows for weakening spatial dependence at increasingly extreme levels, and due to a hierarchical representation of the likelihood in terms of random effects, our inference approach scales to large datasets. Therefore, our model not only has a flexible dependence structure, but it also allows for fast, fully Bayesian inference, prediction and conditional simulation in high dimensions. The proposed model is constructed using flexible random basis functions that are estimated from the data, allowing for straightforward inspection of the predominant spatial patterns of extremes. In addition, the described process possesses (conditional) max-stability as a special case, making inference on the tail dependence class possible. We apply our model to extreme precipitation in North-Eastern America, and show that the proposed model adequately captures the extremal behavior of the data. Interestingly, we find that the principal modes of spatial variation estimated from our model resemble observed patterns in extreme precipitation events occurring along the coast (e.g., with localized tropical cyclones and convective storms) and mountain range borders. Our model, which can easily be adapted to other types of environmental datasets, is therefore useful to identify extreme weather patterns and regions at risk. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 93-106
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1750414
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1750414
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:93-106
Template-Type: ReDIF-Article 1.0
Author-Name: R. Glennie
Author-X-Name-First: R.
Author-X-Name-Last: Glennie
Author-Name: S. T. Buckland
Author-X-Name-First: S. T.
Author-X-Name-Last: Buckland
Author-Name: R. Langrock
Author-X-Name-First: R.
Author-X-Name-Last: Langrock
Author-Name: T. Gerrodette
Author-X-Name-First: T.
Author-X-Name-Last: Gerrodette
Author-Name: L. T. Ballance
Author-X-Name-First: L. T.
Author-X-Name-Last: Ballance
Author-Name: S. J. Chivers
Author-X-Name-First: S. J.
Author-X-Name-Last: Chivers
Author-Name: M. D. Scott
Author-X-Name-First: M. D.
Author-X-Name-Last: Scott
Title: Incorporating Animal Movement Into Distance Sampling
Abstract:
Distance sampling is a popular statistical method to estimate the density of wild animal populations. Conventional distance sampling represents animals as fixed points in space that are detected with an unknown probability that depends on the distance between the observer and the animal. Animal movement can cause substantial bias in density estimation. Methods to correct for responsive animal movement exist, but none account for nonresponsive movement independent of the observer. Here, an explicit animal movement model is incorporated into distance sampling, combining distance sampling survey data with animal telemetry data. Detection probability depends on the entire unobserved path the animal travels. The intractable integration over all possible animal paths is approximated by a hidden Markov model. A simulation study shows the method to be negligibly biased (<5%) in scenarios where conventional distance sampling overestimates abundance by up to 100%. The method is applied to line transect surveys (1999–2006) of spotted dolphins (Stenella attenuata) in the eastern tropical Pacific where abundance is shown to be positively biased by 21% on average, which can have substantial impact on the population dynamics estimated from these abundance estimates and on the choice of statistical methodology applied to future surveys. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 107-115
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1764362
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764362
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:107-115
Template-Type: ReDIF-Article 1.0
Author-Name: Decai Liang
Author-X-Name-First: Decai
Author-X-Name-Last: Liang
Author-Name: Haozhe Zhang
Author-X-Name-First: Haozhe
Author-X-Name-Last: Zhang
Author-Name: Xiaohui Chang
Author-X-Name-First: Xiaohui
Author-X-Name-Last: Chang
Author-Name: Hui Huang
Author-X-Name-First: Hui
Author-X-Name-Last: Huang
Title: Modeling and Regionalization of China’s PM2.5 Using Spatial-Functional Mixture Models
Abstract:
Abstract–Severe air pollution affects billions of people around the world, particularly in developing countries such as China. Effective emission control policies rely primarily on a proper assessment of air pollutants and accurate spatial clustering outcomes. Unfortunately, emission patterns are difficult to observe as they are highly confounded by many meteorological and geographical factors. In this study, we propose a novel approach for modeling and clustering PM
2.5 concentrations across China. We model observed concentrations from monitoring stations as spatially dependent functional data and assume latent emission processes originate from a functional mixture model with each component as a spatio-temporal process. Cluster memberships of monitoring stations are modeled as a Markov random field, in which confounding effects are controlled through energy functions. The superior performance of our approach is demonstrated using extensive simulation studies. Our method is effective in dividing China and the Beijing-Tianjin-Hebei region into several regions based on PM
2.5 concentrations, suggesting that separate local emission control policies are needed. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 116-132
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1764363
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764363
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:116-132
Template-Type: ReDIF-Article 1.0
Author-Name: Ting-Huei Chen
Author-X-Name-First: Ting-Huei
Author-X-Name-Last: Chen
Author-Name: Nilanjan Chatterjee
Author-X-Name-First: Nilanjan
Author-X-Name-Last: Chatterjee
Author-Name: Maria Teresa Landi
Author-X-Name-First: Maria Teresa
Author-X-Name-Last: Landi
Author-Name: Jianxin Shi
Author-X-Name-First: Jianxin
Author-X-Name-Last: Shi
Title: A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information
Abstract:
Large-scale genome-wide association studies (GWAS) provide opportunities for developing genetic risk prediction models that have the potential to improve disease prevention, intervention or treatment. The key step is to develop polygenic risk score (PRS) models with high predictive performance for a given disease, which typically requires a large training dataset for selecting truly associated single nucleotide polymorphisms (SNPs) and estimating effect sizes accurately. Here, we develop a comprehensive penalized regression for fitting l
1 regularized regression models to GWAS summary statistics. We propose incorporating pleiotropy and annotation information into PRS (PANPRS) development through suitable formulation of penalty functions and associated tuning parameters. Extensive simulations show that PANPRS performs equally well or better than existing PRS methods when no functional annotation or pleiotropy is incorporated. When functional annotation data and pleiotropy are informative, PANPRS substantially outperforms existing PRS methods in simulations. Finally, we applied our methods to build PRS for type 2 diabetes and melanoma and found that incorporating relevant functional annotations and GWAS of genetically related traits improved prediction of these two complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 133-143
Issue: 533
Volume: 116
Year: 2020
Month: 10
X-DOI: 10.1080/01621459.2020.1764849
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764849
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:133-143
Template-Type: ReDIF-Article 1.0
Author-Name: Long Feng
Author-X-Name-First: Long
Author-X-Name-Last: Feng
Author-Name: Xuan Bi
Author-X-Name-First: Xuan
Author-X-Name-Last: Bi
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Brain Regions Identified as Being Associated With Verbal Reasoning Through the Use of Imaging Regression via Internal Variation
Abstract:
Abstract–Brain-imaging data have been increasingly used to understand intellectual disabilities. Despite significant progress in biomedical research, the mechanisms for most of the intellectual disabilities remain unknown. Finding the underlying neurological mechanisms has proved difficult, especially in children due to the rapid development of their brains. We investigate verbal reasoning, which is a reliable measure of an individual’s general intellectual abilities, and develop a class of high-order imaging regression models to identify brain subregions which might be associated with this specific intellectual ability. A key novelty of our method is to take advantage of spatial brain structures, and specifically the piecewise smooth nature of most imaging coefficients in the form of high-order tensors. Our approach provides an effective and urgently needed method for identifying brain subregions potentially underlying certain intellectual disabilities. The idea behind our approach is a carefully constructed concept called internal variation (IV). The IV employs tensor decomposition and provides a computationally feasible substitution for total variation, which has been considered suitable to deal with similar problems but may not be scalable to high-order tensor regression. Before applying our method to analyze the real data, we conduct comprehensive simulation studies to demonstrate the validity of our method in imaging signal identification. Next, we present our results from the analysis of a dataset based on the Philadelphia Neurodevelopmental Cohort for which we preprocessed the data including reorienting, bias-field correcting, extracting, normalizing, and registering the magnetic resonance images from 978 individuals. Our analysis identified a subregion across the cingulate cortex and the corpus callosum as being associated with individuals’ verbal reasoning ability, which, to the best of our knowledge, is a novel region that has not been reported in the literature. This finding is useful in further investigation of functional mechanisms for verbal reasoning. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 144-158
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1766468
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1766468
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:144-158
Template-Type: ReDIF-Article 1.0
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Title: Introduction to the Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery
Abstract:
We introduce the Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery. The issue consists of four discussion papers, grouped into two pairs, and sixteen regular research papers that cover many important lines of research on data-driven decision making. We hope that the many provocative and original ideas presented herein will inspire further work and development in precision medicine and personalization.
Journal: Journal of the American Statistical Association
Pages: 159-161
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1863224
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863224
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:159-161
Template-Type: ReDIF-Article 1.0
Author-Name: Yifan Cui
Author-X-Name-First: Yifan
Author-X-Name-Last: Cui
Author-Name: Eric Tchetgen Tchetgen
Author-X-Name-First: Eric
Author-X-Name-Last: Tchetgen Tchetgen
Title: A Semiparametric Instrumental Variable Approach to Optimal Treatment Regimes Under Endogeneity
Abstract:
There is a fast-growing literature on estimating optimal treatment regimes based on randomized trials or observational studies under a key identifying condition of no unmeasured confounding. Because confounding by unmeasured factors cannot generally be ruled out with certainty in observational studies or randomized trials subject to noncompliance, we propose a general instrumental variable (IV) approach to learning optimal treatment regimes under endogeneity. Specifically, we establish identification of both value function E[YD(L)] for a given regime D and optimal regimes argmaxDE[YD(L)] with the aid of a binary IV, when no unmeasured confounding fails to hold. We also construct novel multiply robust classification-based estimators. Furthermore, we propose to identify and estimate optimal treatment regimes among those who would comply to the assigned treatment under a monotonicity assumption. In this latter case, we establish the somewhat surprising result that complier optimal regimes can be consistently estimated without directly collecting compliance information and therefore without the complier average treatment effect itself being identified. Our approach is illustrated via extensive simulation studies and a data application on the effect of child rearing on labor participation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 162-173
Issue: 533
Volume: 116
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1783272
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783272
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:162-173
Template-Type: ReDIF-Article 1.0
Author-Name: Hongxiang Qiu
Author-X-Name-First: Hongxiang
Author-X-Name-Last: Qiu
Author-Name: Marco Carone
Author-X-Name-First: Marco
Author-X-Name-Last: Carone
Author-Name: Ekaterina Sadikova
Author-X-Name-First: Ekaterina
Author-X-Name-Last: Sadikova
Author-Name: Maria Petukhova
Author-X-Name-First: Maria
Author-X-Name-Last: Petukhova
Author-Name: Ronald C. Kessler
Author-X-Name-First: Ronald C.
Author-X-Name-Last: Kessler
Author-Name: Alex Luedtke
Author-X-Name-First: Alex
Author-X-Name-Last: Luedtke
Title: Optimal Individualized Decision Rules Using Instrumental Variable Methods
Abstract:
There is an extensive literature on the estimation and evaluation of optimal individualized treatment rules in settings where all confounders of the effect of treatment on outcome are observed. We study the development of individualized decision rules in settings where some of these confounders may not have been measured but a valid binary instrument is available for a binary treatment. We first consider individualized treatment rules, which will naturally be most interesting in settings where it is feasible to intervene directly on treatment. We then consider a setting where intervening on treatment is infeasible, but intervening to encourage treatment is feasible. In both of these settings, we also handle the case that the treatment is a limited resource so that optimal interventions focus the available resources on those individuals who will benefit most from treatment. Given a reference rule, we evaluate an optimal individualized rule by its average causal effect relative to a prespecified reference rule. We develop methods to estimate optimal individualized rules and construct asymptotically efficient plug-in estimators of the corresponding average causal effect relative to a prespecified reference rule. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 174-191
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1745814
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745814
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:174-191
Template-Type: ReDIF-Article 1.0
Author-Name: Sukjin Han
Author-X-Name-First: Sukjin
Author-X-Name-Last: Han
Title: Comment: Individualized Treatment Rules Under Endogeneity
Journal: Journal of the American Statistical Association
Pages: 192-195
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1831923
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831923
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:192-195
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Zhang
Author-X-Name-First: Bo
Author-X-Name-Last: Zhang
Author-Name: Hongming Pu
Author-X-Name-First: Hongming
Author-X-Name-Last: Pu
Title: Discussion of Cui and Tchetgen Tchetgen (2020) and Qiu et al. (2020)
Journal: Journal of the American Statistical Association
Pages: 196-199
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1832500
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1832500
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:196-199
Template-Type: ReDIF-Article 1.0
Author-Name: Yifan Cui
Author-X-Name-First: Yifan
Author-X-Name-Last: Cui
Author-Name: Eric Tchetgen Tchetgen
Author-X-Name-First: Eric Tchetgen
Author-X-Name-Last: Tchetgen
Title: Machine Intelligence for Individualized Decision Making Under a Counterfactual World: A Rejoinder
Journal: Journal of the American Statistical Association
Pages: 200-206
Issue: 533
Volume: 116
Year: 2021
Month: 2
X-DOI: 10.1080/01621459.2021.1872580
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1872580
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:200-206
Template-Type: ReDIF-Article 1.0
Author-Name: Hongxiang Qiu
Author-X-Name-First: Hongxiang
Author-X-Name-Last: Qiu
Author-Name: Marco Carone
Author-X-Name-First: Marco
Author-X-Name-Last: Carone
Author-Name: Ekaterina Sadikova
Author-X-Name-First: Ekaterina
Author-X-Name-Last: Sadikova
Author-Name: Maria Petukhova
Author-X-Name-First: Maria
Author-X-Name-Last: Petukhova
Author-Name: Ronald C. Kessler
Author-X-Name-First: Ronald C.
Author-X-Name-Last: Kessler
Author-Name: Alex Luedtke
Author-X-Name-First: Alex
Author-X-Name-Last: Luedtke
Title: Rejoinder: Optimal Individualized Decision Rules Using Instrumental Variable Methods
Journal: Journal of the American Statistical Association
Pages: 207-209
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1865166
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865166
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:207-209
Template-Type: ReDIF-Article 1.0
Author-Name: Jared D. Huling
Author-X-Name-First: Jared D.
Author-X-Name-Last: Huling
Author-Name: Maureen A. Smith
Author-X-Name-First: Maureen A.
Author-X-Name-Last: Smith
Author-Name: Guanhua Chen
Author-X-Name-First: Guanhua
Author-X-Name-Last: Chen
Title: A Two-Part Framework for Estimating Individualized Treatment Rules From Semicontinuous Outcomes
Abstract:
Health care payments are an important component of health care utilization and are thus a major focus in health services and health policy applications. However, payment outcomes are semicontinuous in that over a given period of time some patients incur no payments and some patients incur large costs. Individualized treatment rules (ITRs) are a major part of the push for tailoring treatments and interventions to patients, yet there is a little work focused on estimating ITRs from semicontinuous outcomes. In this article, we develop a framework for estimation of ITRs based on two-part modeling, wherein the ITR is estimated by separately targeting the zero part of the outcome and the strictly positive part. To improve performance when high-dimensional covariates are available, we leverage a scientifically plausible penalty that simultaneously selects variables and encourages the signs of coefficients for each variable to agree between the two components of the ITR. We develop an efficient algorithm for computation and prove oracle inequalities for the resulting estimation and prediction errors. We demonstrate the effectiveness of our approach in simulated examples and in a study of a health system intervention. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 210-223
Issue: 533
Volume: 116
Year: 2020
Month: 10
X-DOI: 10.1080/01621459.2020.1801449
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801449
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:210-223
Template-Type: ReDIF-Article 1.0
Author-Name: Lin Liu
Author-X-Name-First: Lin
Author-X-Name-Last: Liu
Author-Name: Zach Shahn
Author-X-Name-First: Zach
Author-X-Name-Last: Shahn
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Author-Name: Andrea Rotnitzky
Author-X-Name-First: Andrea
Author-X-Name-Last: Rotnitzky
Title: Efficient Estimation of Optimal Regimes Under a No Direct Effect Assumption
Abstract:
We derive new estimators of an optimal joint testing and treatment regime under the no direct effect (NDE) assumption that a given laboratory, diagnostic, or screening test has no effect on a patient’s clinical outcomes except through the effect of the test results on the choice of treatment. We model the optimal joint strategy with an optimal structural nested mean model (opt-SNMM). The proposed estimators are more efficient than previous estimators of the parameters of an opt-SNMM because they efficiently leverage the “NDE of testing” assumption. Our methods will be of importance to decision scientists who either perform cost-benefit analyses or are tasked with the estimation of the “value of information” supplied by an expensive diagnostic test (such as an MRI to screen for lung cancer). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 224-239
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1856117
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1856117
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:224-239
Template-Type: ReDIF-Article 1.0
Author-Name: Haoyu Chen
Author-X-Name-First: Haoyu
Author-X-Name-Last: Chen
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Statistical Inference for Online Decision Making: In a Contextual Bandit Setting
Abstract:
Online decision making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The ε-greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 240-255
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1770098
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1770098
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:240-255
Template-Type: ReDIF-Article 1.0
Author-Name: Juliana Schulz
Author-X-Name-First: Juliana
Author-X-Name-Last: Schulz
Author-Name: Erica E. M. Moodie
Author-X-Name-First: Erica E. M.
Author-X-Name-Last: Moodie
Title: Doubly Robust Estimation of Optimal Dosing Strategies
Abstract:
The goal of precision medicine is to tailor treatment strategies on an individual patient level. Although several estimation techniques have been developed for determining optimal treatment rules, the majority of methods focus on the case of a dichotomous treatment, an example being the dynamic weighted ordinary least squares regression approach of Wallace and Moodie. We propose an extension to the aforementioned framework to allow for a continuous treatment with the ultimate goal of estimating optimal dosing strategies. The proposed method is shown to be doubly robust against model misspecification whenever the implemented weights satisfy a particular balancing condition. A broad class of weight functions can be derived from the balancing condition, providing a flexible regression based estimation method in the context of adaptive treatment strategies for continuous valued treatments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 256-268
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1753521
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753521
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:256-268
Template-Type: ReDIF-Article 1.0
Author-Name: Yuan Chen
Author-X-Name-First: Yuan
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Title: Learning Individualized Treatment Rules for Multiple-Domain Latent Outcomes
Abstract:
For many mental disorders, latent mental status from multiple-domain psychological or clinical symptoms may perform as a better characterization of the underlying disorder status than a simple summary score of the symptoms, and they may also serve as more reliable and representative features to differentiate treatment responses. Therefore, to address the complexity and heterogeneity of treatment responses for mental disorders, we provide a new paradigm for learning optimal individualized treatment rules (ITRs) by modeling patients’ latent mental status. We first learn the multi-domain latent states at baseline from the observed symptoms under a restricted Boltzmann machine (RBM) model, which encodes patients’ heterogeneous symptoms using an economical number of latent variables and yet remains flexible. We then optimize a value function defined by the latent states after treatment by exploiting a transformation of the observed symptoms based on the RBM without modeling the relationship between the latent mental states before and after treatment. The optimal treatment rules are derived using a weighted large margin classifier. We derive the convergence rate of the proposed estimator under the latent models. Simulation studies are conducted to test the performance of the proposed method. Finally, we apply the developed method to real world studies and we demonstrate the utility and advantage of our method in tailoring treatments for patients with major depression, and identify patient subgroups informative for treatment recommendations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 269-282
Issue: 533
Volume: 116
Year: 2020
Month: 10
X-DOI: 10.1080/01621459.2020.1817751
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817751
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:269-282
Template-Type: ReDIF-Article 1.0
Author-Name: Yinghao Pan
Author-X-Name-First: Yinghao
Author-X-Name-Last: Pan
Author-Name: Ying-Qi Zhao
Author-X-Name-First: Ying-Qi
Author-X-Name-Last: Zhao
Title: Improved Doubly Robust Estimation in Learning Optimal Individualized Treatment Rules
Abstract:
Individualized treatment rules (ITRs) recommend treatment according to patient characteristics. There is a growing interest in developing novel and efficient statistical methods in constructing ITRs. We propose an improved doubly robust estimator of the optimal ITRs. The proposed estimator is based on a direct optimization of an augmented inverse-probability weighted estimator of the expected clinical outcome over a class of ITRs. The method enjoys two key properties. First, it is doubly robust, meaning that the proposed estimator is consistent when either the propensity score or the outcome model is correct. Second, it achieves the smallest variance among the class of doubly robust estimators when the propensity score model is correctly specified, regardless of the specification of the outcome model. Simulation studies show that the estimated ITRs obtained from our method yield better results than those obtained from current popular methods. Data from the Sequenced Treatment Alternatives to Relieve Depression study is analyzed as an illustrative example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 283-294
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1725522
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1725522
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:283-294
Template-Type: ReDIF-Article 1.0
Author-Name: Bo Zhang
Author-X-Name-First: Bo
Author-X-Name-Last: Zhang
Author-Name: Jordan Weiss
Author-X-Name-First: Jordan
Author-X-Name-Last: Weiss
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Qingyuan Zhao
Author-X-Name-First: Qingyuan
Author-X-Name-Last: Zhao
Title: Selecting and Ranking Individualized Treatment Rules With Unmeasured Confounding
Abstract:
It is common to compare individualized treatment rules based on the value function, which is the expected potential outcome under the treatment rule. Although the value function is not point-identified when there is unmeasured confounding, it still defines a partial order among the treatment rules under Rosenbaum’s sensitivity analysis model. We first consider how to compare two treatment rules with unmeasured confounding in the single-decision setting and then use this pairwise test to rank multiple treatment rules. We consider how to, among many treatment rules, select the best rules and select the rules that are better than a control rule. The proposed methods are illustrated using two real examples, one about the benefit of malaria prevention programs to different age groups and another about the effect of late retirement on senior health in different gender and occupation groups. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 295-308
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1736083
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1736083
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:295-308
Template-Type: ReDIF-Article 1.0
Author-Name: Wenchuan Guo
Author-X-Name-First: Wenchuan
Author-X-Name-Last: Guo
Author-Name: Xiao-Hua Zhou
Author-X-Name-First: Xiao-Hua
Author-X-Name-Last: Zhou
Author-Name: Shujie Ma
Author-X-Name-First: Shujie
Author-X-Name-Last: Ma
Title: Estimation of Optimal Individualized Treatment Rules Using a Covariate-Specific Treatment Effect Curve With High-Dimensional Covariates
Abstract:
With a large number of baseline covariates, we propose a new semiparametric modeling strategy for heterogeneous treatment effect estimation and individualized treatment selection, which are two major goals in personalized medicine. We achieve the first goal through estimating a covariate-specific treatment effect (CSTE) curve modeled as an unknown function of a weighted linear combination of all baseline covariates. The weight or the coefficient for each covariate is estimated by fitting a sparse semiparametric logistic single-index coefficient model. The CSTE curve is estimated by a spline-backfitted kernel procedure, which enables us to further construct a simultaneous confidence band (SCB) for the CSTE curve under a desired confidence level. Based on the SCB, we find the subgroups of patients that benefit from each treatment, so that we can make individualized treatment selection. The innovations of the proposed method are 3-fold. First, the proposed method can quantify variability associated with the estimated optimal individualized treatment rule with high-dimensional covariates. Second, the proposed method is very flexible to depict both local and global associations between the treatment and baseline covariates in the presence of high-dimensional covariates, and thus it enjoys flexibility while achieving dimensionality reduction. Third, the SCB achieves the nominal confidence level asymptotically, and it provides a uniform inferential tool in making individualized treatment decisions. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 309-321
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1865167
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865167
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:309-321
Template-Type: ReDIF-Article 1.0
Author-Name: Ruitao Lin
Author-X-Name-First: Ruitao
Author-X-Name-Last: Lin
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Title: BAGS: A Bayesian Adaptive Group Sequential Trial Design With Subgroup-Specific Survival Comparisons
Abstract:
A Bayesian group sequential design is proposed that performs survival comparisons within patient subgroups in randomized trials where treatment–subgroup interactions may be present. A latent subgroup membership variable is assumed to allow the design to adaptively combine homogeneous subgroups, or split heterogeneous subgroups, to improve the procedure’s within-subgroup power. If a baseline covariate related to survival is available, the design may incorporate this information to improve subgroup identification while basing the comparative test on the average hazard ratio. General guidelines are provided for calibrating prior hyperparameters and design parameters to control the overall Type I error rate and optimize performance. Simulations show that the design is robust under a wide variety of different scenarios. When two or more subgroups are truly homogeneous but differ from the other subgroups, the proposed method is substantially more powerful than tests that either ignore subgroups or conduct a separate test within each subgroup. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 322-334
Issue: 533
Volume: 116
Year: 2020
Month: 11
X-DOI: 10.1080/01621459.2020.1837142
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837142
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:322-334
Template-Type: ReDIF-Article 1.0
Author-Name: Steve Yadlowsky
Author-X-Name-First: Steve
Author-X-Name-Last: Yadlowsky
Author-Name: Fabio Pellegrini
Author-X-Name-First: Fabio
Author-X-Name-Last: Pellegrini
Author-Name: Federica Lionetto
Author-X-Name-First: Federica
Author-X-Name-Last: Lionetto
Author-Name: Stefan Braune
Author-X-Name-First: Stefan
Author-X-Name-Last: Braune
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Title: Estimation and Validation of Ratio-based Conditional Average Treatment Effects Using Observational Data
Abstract:
While sample sizes in randomized clinical trials are large enough to estimate the average treatment effect well, they are often insufficient for estimation of treatment-covariate interactions critical to studying data-driven precision medicine. Observational data from real world practice may play an important role in alleviating this problem. One common approach in trials is to predict the outcome of interest with separate regression models in each treatment arm, and estimate the treatment effect based on the contrast of the predictions. Unfortunately, this simple approach may induce spurious treatment-covariate interaction in observational studies when the regression model is misspecified. Motivated by the need of modeling the number of relapses in multiple sclerosis (MS) patients, where the ratio of relapse rates is a natural choice of the treatment effect, we propose to estimate the conditional average treatment effect (CATE) as the ratio of expected potential outcomes, and derive a doubly robust estimator of this CATE in a semiparametric model of treatment-covariate interactions. We also provide a validation procedure to check the quality of the estimator on an independent sample. We conduct simulations to demonstrate the finite sample performance of the proposed methods, and illustrate their advantages on real data by examining the treatment effect of dimethyl fumarate compared to teriflunomide in MS patients. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 335-352
Issue: 533
Volume: 116
Year: 2020
Month: 7
X-DOI: 10.1080/01621459.2020.1772080
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1772080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:335-352
Template-Type: ReDIF-Article 1.0
Author-Name: Xinran Li
Author-X-Name-First: Xinran
Author-X-Name-Last: Li
Author-Name: Xiao-Li Meng
Author-X-Name-First: Xiao-Li
Author-X-Name-Last: Meng
Title: A Multi-resolution Theory for Approximating Infinite-p-Zero-n: Transitional Inference, Individualized Predictions, and a World Without Bias-Variance Tradeoff
Abstract:
Transitional inference is an empiricism concept, rooted and practiced in clinical medicine since ancient Greece. Knowledge and experiences gained from treating one entity (e.g., a disease or a group of patients) are applied to treat a related but distinctively different one (e.g., a similar disease or a new patient). This notion of “transition to the similar” renders individualized treatments an operational meaning, yet its theoretical foundation defies the familiar inductive inference framework. The uniqueness of entities is the result of potentially an infinite number of attributes (hence p=∞), which entails zero direct training sample size (i.e., n = 0) because genuine guinea pigs do not exist. However, the literature on wavelets and on sieve methods for nonparametric estimation suggests a principled approximation theory for transitional inference via a multi-resolution (MR) perspective, where we use the resolution level to index the degree of approximation to ultimate individuality. MR inference seeks a primary resolution indexing an indirect training sample, which provides enough matched attributes to increase the relevance of the results to the target individuals and yet still accumulate sufficient indirect sample sizes for robust estimation. Theoretically, MR inference relies on an infinite-term ANOVA-type decomposition, providing an alternative way to model sparsity via the decay rate of the resolution bias as a function of the primary resolution level. Unexpectedly, this decomposition reveals a world without variance when the outcome is a deterministic function of potentially infinitely many predictors. In this deterministic world, the optimal resolution prefers over-fitting in the traditional sense when the resolution bias decays sufficiently rapidly. Furthermore, there can be many “descents” in the prediction error curve, when the contributions of predictors are inhomogeneous and the ordering of their importance does not align with the order of their inclusion in prediction. These findings may hint at a deterministic approximation theory for understanding the apparently over-fitting resistant phenomenon of some over-saturated models in machine learning.
Journal: Journal of the American Statistical Association
Pages: 353-367
Issue: 533
Volume: 116
Year: 2020
Month: 12
X-DOI: 10.1080/01621459.2020.1844210
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844210
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:353-367
Template-Type: ReDIF-Article 1.0
Author-Name: Ashkan Ertefaie
Author-X-Name-First: Ashkan
Author-X-Name-Last: Ertefaie
Author-Name: James R. McKay
Author-X-Name-First: James R.
Author-X-Name-Last: McKay
Author-Name: David Oslin
Author-X-Name-First: David
Author-X-Name-Last: Oslin
Author-Name: Robert L. Strawderman
Author-X-Name-First: Robert L.
Author-X-Name-Last: Strawderman
Title: Robust Q-Learning
Abstract:
Abstract–Q-learning is a regression-based approach that is widely used to formalize the development of an optimal dynamic treatment strategy. Finite dimensional working models are typically used to estimate certain nuisance parameters, and misspecification of these working models can result in residual confounding and/or efficiency loss. We propose a robust Q-learning approach which allows estimating such nuisance parameters using data-adaptive techniques. We study the asymptotic behavior of our estimators and provide simulation studies that highlight the need for and usefulness of the proposed method in practice. We use the data from the “Extending Treatment Effectiveness of Naltrexone” multistage randomized trial to illustrate our proposed methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 368-381
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1753522
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753522
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:368-381
Template-Type: ReDIF-Article 1.0
Author-Name: Peng Liao
Author-X-Name-First: Peng
Author-X-Name-Last: Liao
Author-Name: Predrag Klasnja
Author-X-Name-First: Predrag
Author-X-Name-Last: Klasnja
Author-Name: Susan Murphy
Author-X-Name-First: Susan
Author-X-Name-Last: Murphy
Title: Off-Policy Estimation of Long-Term Average Outcomes With Applications to Mobile Health
Abstract:
Due to the recent advancements in wearables and sensing technology, health scientists are increasingly developing mobile health (mHealth) interventions. In mHealth interventions, mobile devices are used to deliver treatment to individuals as they go about their daily lives. These treatments are generally designed to impact a near time, proximal outcome such as stress or physical activity. The mHealth intervention policies, often called just-in-time adaptive interventions, are decision rules that map an individual’s current state (e.g., individual’s past behaviors as well as current observations of time, location, social activity, stress, and urges to smoke) to a particular treatment at each of many time points. The vast majority of current mHealth interventions deploy expert-derived policies. In this article, we provide an approach for conducting inference about the performance of one or more such policies using historical data collected under a possibly different policy. Our measure of performance is the average of proximal outcomes over a long time period should the particular mHealth policy be followed. We provide an estimator as well as confidence intervals. This work is motivated by HeartSteps, an mHealth physical activity intervention. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 382-391
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1807993
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1807993
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:382-391
Template-Type: ReDIF-Article 1.0
Author-Name: Xinkun Nie
Author-X-Name-First: Xinkun
Author-X-Name-Last: Nie
Author-Name: Emma Brunskill
Author-X-Name-First: Emma
Author-X-Name-Last: Brunskill
Author-Name: Stefan Wager
Author-X-Name-First: Stefan
Author-X-Name-Last: Wager
Title: Learning When-to-Treat Policies
Abstract:
Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an “advantage doubly robust” estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 392-409
Issue: 533
Volume: 116
Year: 2020
Month: 11
X-DOI: 10.1080/01621459.2020.1831925
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831925
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:392-409
Template-Type: ReDIF-Article 1.0
Author-Name: Xinyu Hu
Author-X-Name-First: Xinyu
Author-X-Name-Last: Hu
Author-Name: Min Qian
Author-X-Name-First: Min
Author-X-Name-Last: Qian
Author-Name: Bin Cheng
Author-X-Name-First: Bin
Author-X-Name-Last: Cheng
Author-Name: Ying Kuen Cheung
Author-X-Name-First: Ying Kuen
Author-X-Name-Last: Cheung
Title: Personalized Policy Learning Using Longitudinal Mobile Health Data
Abstract:
Personalized policy represents a paradigm shift one decision rule for all users to an individualized decision rule for each user. Developing personalized policy in mobile health applications imposes challenges. First, for lack of adherence, data from each user are limited. Second, unmeasured contextual factors can potentially impact on decision making. Aiming to optimize immediate rewards, we propose using a generalized linear mixed modeling framework where population features and individual features are modeled as fixed and random effects, respectively, and synthesized to form the personalized policy. The group lasso type penalty is imposed to avoid overfitting of individual deviations from the population model. We examine the conditions under which the proposed method work in the presence of time-varying endogenous covariates, and provide conditional optimality and marginal consistency results of the expected immediate outcome under the estimated policies. We apply our method to develop personalized push (“prompt”) schedules in 294 app users, with the goal to maximize the prompt response rate given past app usage and other contextual factors. The proposed method compares favorably to existing estimation methods including using the R function “glmer” in a simulation study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 410-420
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2020.1785476
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1785476
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:410-420
Template-Type: ReDIF-Article 1.0
Author-Name: Yilun Sun
Author-X-Name-First: Yilun
Author-X-Name-Last: Sun
Author-Name: Lu Wang
Author-X-Name-First: Lu
Author-X-Name-Last: Wang
Title: Stochastic Tree Search for Estimating Optimal Dynamic Treatment Regimes
Abstract:
A dynamic treatment regime (DTR) is a sequence of decision rules that adapt to the time-varying states of an individual. Black-box learning methods have shown great potential in predicting the optimal treatments; however, the resulting DTRs lack interpretability, which is of paramount importance for medical experts to understand and implement. We present a stochastic tree-based reinforcement learning (ST-RL) method for estimating optimal DTRs in a multistage multitreatment setting with data from either randomized trials or observational studies. At each stage, ST-RL constructs a decision tree by first modeling the mean of counterfactual outcomes via nonparametric regression models, and then stochastically searching for the optimal tree-structured decision rule using a Markov chain Monte Carlo algorithm. We implement the proposed method in a backward inductive fashion through multiple decision stages. The proposed ST-RL delivers optimal DTRs with better interpretability and contributes to the existing literature in its non-greedy policy search. Additionally, ST-RL demonstrates stable and outstanding performances even with a large number of covariates, which is especially appealing when data are from large observational studies. We illustrate the performance of ST-RL through simulation studies, and also a real data application using esophageal cancer data collected from 1170 patients at MD Anderson Cancer Center from 1998 to 2012. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 421-432
Issue: 533
Volume: 116
Year: 2020
Month: 10
X-DOI: 10.1080/01621459.2020.1819294
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1819294
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:533:p:421-432
Template-Type: ReDIF-Article 1.0
Author-Name: Christopher Nemeth
Author-X-Name-First: Christopher
Author-X-Name-Last: Nemeth
Author-Name: Paul Fearnhead
Author-X-Name-First: Paul
Author-X-Name-Last: Fearnhead
Title: Stochastic Gradient Markov Chain Monte Carlo
Abstract:
Markov chain Monte Carlo (MCMC) algorithms are generally regarded as the gold standard technique for Bayesian inference. They are theoretically well-understood and conceptually simple to apply in practice. The drawback of MCMC is that performing exact inference generally requires all of the data to be processed at each iteration of the algorithm. For large datasets, the computational cost of MCMC can be prohibitive, which has led to recent developments in scalable Monte Carlo algorithms that have a significantly lower computational cost than standard MCMC. In this article, we focus on a particular class of scalable Monte Carlo algorithms, stochastic gradient Markov chain Monte Carlo (SGMCMC) which utilizes data subsampling techniques to reduce the per-iteration cost of MCMC. We provide an introduction to some popular SGMCMC algorithms and review the supporting theoretical results, as well as comparing the efficiency of SGMCMC algorithms against MCMC on benchmark examples. The supporting R code is available online at https://github.com/chris-nemeth/sgmcmc-review-paper.
Journal: Journal of the American Statistical Association
Pages: 433-450
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1847120
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1847120
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:433-450
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Handbook of Spatial Epidemiology
Journal: Journal of the American Statistical Association
Pages: 451-453
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2021.1880230
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880230
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:451-453
Template-Type: ReDIF-Article 1.0
Author-Name: Grace S. Chiu
Author-X-Name-First: Grace S.
Author-X-Name-Last: Chiu
Title: Handbook of Environmental and Ecological Statistics.
Journal: Journal of the American Statistical Association
Pages: 453-455
Issue: 533
Volume: 116
Year: 2021
Month: 3
X-DOI: 10.1080/01621459.2021.1880232
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880232
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:453-455
Template-Type: ReDIF-Article 1.0
Author-Name: Xinkun Nie
Author-X-Name-First: Xinkun
Author-X-Name-Last: Nie
Author-Name: Emma Brunskill
Author-X-Name-First: Emma
Author-X-Name-Last: Brunskill
Author-Name: Stefan Wager
Author-X-Name-First: Stefan
Author-X-Name-Last: Wager
Title: Learning When-to-Treat Policies
Abstract:
Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an “advantage doubly robust” estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 392-409
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1831925
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831925
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:392-409
Template-Type: ReDIF-Article 1.0
Author-Name: Yilun Sun
Author-X-Name-First: Yilun
Author-X-Name-Last: Sun
Author-Name: Lu Wang
Author-X-Name-First: Lu
Author-X-Name-Last: Wang
Title: Stochastic Tree Search for Estimating Optimal Dynamic Treatment Regimes
Abstract:
A dynamic treatment regime (DTR) is a sequence of decision rules that adapt to the time-varying states of an individual. Black-box learning methods have shown great potential in predicting the optimal treatments; however, the resulting DTRs lack interpretability, which is of paramount importance for medical experts to understand and implement. We present a stochastic tree-based reinforcement learning (ST-RL) method for estimating optimal DTRs in a multistage multitreatment setting with data from either randomized trials or observational studies. At each stage, ST-RL constructs a decision tree by first modeling the mean of counterfactual outcomes via nonparametric regression models, and then stochastically searching for the optimal tree-structured decision rule using a Markov chain Monte Carlo algorithm. We implement the proposed method in a backward inductive fashion through multiple decision stages. The proposed ST-RL delivers optimal DTRs with better interpretability and contributes to the existing literature in its non-greedy policy search. Additionally, ST-RL demonstrates stable and outstanding performances even with a large number of covariates, which is especially appealing when data are from large observational studies. We illustrate the performance of ST-RL through simulation studies, and also a real data application using esophageal cancer data collected from 1170 patients at MD Anderson Cancer Center from 1998 to 2012. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 421-432
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1819294
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1819294
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:421-432
Template-Type: ReDIF-Article 1.0
Author-Name: Yuan Chen
Author-X-Name-First: Yuan
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Title: Learning Individualized Treatment Rules for Multiple-Domain Latent Outcomes
Abstract:
For many mental disorders, latent mental status from multiple-domain psychological or clinical symptoms may perform as a better characterization of the underlying disorder status than a simple summary score of the symptoms, and they may also serve as more reliable and representative features to differentiate treatment responses. Therefore, to address the complexity and heterogeneity of treatment responses for mental disorders, we provide a new paradigm for learning optimal individualized treatment rules (ITRs) by modeling patients’ latent mental status. We first learn the multi-domain latent states at baseline from the observed symptoms under a restricted Boltzmann machine (RBM) model, which encodes patients’ heterogeneous symptoms using an economical number of latent variables and yet remains flexible. We then optimize a value function defined by the latent states after treatment by exploiting a transformation of the observed symptoms based on the RBM without modeling the relationship between the latent mental states before and after treatment. The optimal treatment rules are derived using a weighted large margin classifier. We derive the convergence rate of the proposed estimator under the latent models. Simulation studies are conducted to test the performance of the proposed method. Finally, we apply the developed method to real world studies and we demonstrate the utility and advantage of our method in tailoring treatments for patients with major depression, and identify patient subgroups informative for treatment recommendations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 269-282
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1817751
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817751
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:269-282
Template-Type: ReDIF-Article 1.0
Author-Name: Ruitao Lin
Author-X-Name-First: Ruitao
Author-X-Name-Last: Lin
Author-Name: Peter F. Thall
Author-X-Name-First: Peter F.
Author-X-Name-Last: Thall
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Title: BAGS: A Bayesian Adaptive Group Sequential Trial Design With Subgroup-Specific Survival Comparisons
Abstract:
A Bayesian group sequential design is proposed that performs survival comparisons within patient subgroups in randomized trials where treatment–subgroup interactions may be present. A latent subgroup membership variable is assumed to allow the design to adaptively combine homogeneous subgroups, or split heterogeneous subgroups, to improve the procedure’s within-subgroup power. If a baseline covariate related to survival is available, the design may incorporate this information to improve subgroup identification while basing the comparative test on the average hazard ratio. General guidelines are provided for calibrating prior hyperparameters and design parameters to control the overall Type I error rate and optimize performance. Simulations show that the design is robust under a wide variety of different scenarios. When two or more subgroups are truly homogeneous but differ from the other subgroups, the proposed method is substantially more powerful than tests that either ignore subgroups or conduct a separate test within each subgroup. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 322-334
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1837142
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837142
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:322-334
Template-Type: ReDIF-Article 1.0
Author-Name: Jared D. Huling
Author-X-Name-First: Jared D.
Author-X-Name-Last: Huling
Author-Name: Maureen A. Smith
Author-X-Name-First: Maureen A.
Author-X-Name-Last: Smith
Author-Name: Guanhua Chen
Author-X-Name-First: Guanhua
Author-X-Name-Last: Chen
Title: A Two-Part Framework for Estimating Individualized Treatment Rules From Semicontinuous Outcomes
Abstract:
Health care payments are an important component of health care utilization and are thus a major focus in health services and health policy applications. However, payment outcomes are semicontinuous in that over a given period of time some patients incur no payments and some patients incur large costs. Individualized treatment rules (ITRs) are a major part of the push for tailoring treatments and interventions to patients, yet there is a little work focused on estimating ITRs from semicontinuous outcomes. In this article, we develop a framework for estimation of ITRs based on two-part modeling, wherein the ITR is estimated by separately targeting the zero part of the outcome and the strictly positive part. To improve performance when high-dimensional covariates are available, we leverage a scientifically plausible penalty that simultaneously selects variables and encourages the signs of coefficients for each variable to agree between the two components of the ITR. We develop an efficient algorithm for computation and prove oracle inequalities for the resulting estimation and prediction errors. We demonstrate the effectiveness of our approach in simulated examples and in a study of a health system intervention. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 210-223
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1801449
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801449
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:210-223
Template-Type: ReDIF-Article 1.0
Author-Name: Ting-Huei Chen
Author-X-Name-First: Ting-Huei
Author-X-Name-Last: Chen
Author-Name: Nilanjan Chatterjee
Author-X-Name-First: Nilanjan
Author-X-Name-Last: Chatterjee
Author-Name: Maria Teresa Landi
Author-X-Name-First: Maria Teresa
Author-X-Name-Last: Landi
Author-Name: Jianxin Shi
Author-X-Name-First: Jianxin
Author-X-Name-Last: Shi
Title: A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information
Abstract:
Large-scale genome-wide association studies (GWAS) provide opportunities for developing genetic risk prediction models that have the potential to improve disease prevention, intervention or treatment. The key step is to develop polygenic risk score (PRS) models with high predictive performance for a given disease, which typically requires a large training dataset for selecting truly associated single nucleotide polymorphisms (SNPs) and estimating effect sizes accurately. Here, we develop a comprehensive penalized regression for fitting l
1 regularized regression models to GWAS summary statistics. We propose incorporating pleiotropy and annotation information into PRS (PANPRS) development through suitable formulation of penalty functions and associated tuning parameters. Extensive simulations show that PANPRS performs equally well or better than existing PRS methods when no functional annotation or pleiotropy is incorporated. When functional annotation data and pleiotropy are informative, PANPRS substantially outperforms existing PRS methods in simulations. Finally, we applied our methods to build PRS for type 2 diabetes and melanoma and found that incorporating relevant functional annotations and GWAS of genetically related traits improved prediction of these two complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 133-143
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1764849
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764849
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:133-143
Template-Type: ReDIF-Article 1.0
Author-Name: Yifan Cui
Author-X-Name-First: Yifan
Author-X-Name-Last: Cui
Author-Name: Eric Tchetgen Tchetgen
Author-X-Name-First: Eric
Author-X-Name-Last: Tchetgen Tchetgen
Title: A Semiparametric Instrumental Variable Approach to Optimal Treatment Regimes Under Endogeneity
Abstract:
There is a fast-growing literature on estimating optimal treatment regimes based on randomized trials or observational studies under a key identifying condition of no unmeasured confounding. Because confounding by unmeasured factors cannot generally be ruled out with certainty in observational studies or randomized trials subject to noncompliance, we propose a general instrumental variable (IV) approach to learning optimal treatment regimes under endogeneity. Specifically, we establish identification of both value function E[YD(L)] for a given regime D and optimal regimes argmaxDE[YD(L)] with the aid of a binary IV, when no unmeasured confounding fails to hold. We also construct novel multiply robust classification-based estimators. Furthermore, we propose to identify and estimate optimal treatment regimes among those who would comply to the assigned treatment under a monotonicity assumption. In this latter case, we establish the somewhat surprising result that complier optimal regimes can be consistently estimated without directly collecting compliance information and therefore without the complier average treatment effect itself being identified. Our approach is illustrated via extensive simulation studies and a data application on the effect of child rearing on labor participation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 162-173
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1783272
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783272
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:162-173
Template-Type: ReDIF-Article 1.0
Author-Name: Steve Yadlowsky
Author-X-Name-First: Steve
Author-X-Name-Last: Yadlowsky
Author-Name: Fabio Pellegrini
Author-X-Name-First: Fabio
Author-X-Name-Last: Pellegrini
Author-Name: Federica Lionetto
Author-X-Name-First: Federica
Author-X-Name-Last: Lionetto
Author-Name: Stefan Braune
Author-X-Name-First: Stefan
Author-X-Name-Last: Braune
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Title: Estimation and Validation of Ratio-based Conditional Average Treatment Effects Using Observational Data
Abstract:
While sample sizes in randomized clinical trials are large enough to estimate the average treatment effect well, they are often insufficient for estimation of treatment-covariate interactions critical to studying data-driven precision medicine. Observational data from real world practice may play an important role in alleviating this problem. One common approach in trials is to predict the outcome of interest with separate regression models in each treatment arm, and estimate the treatment effect based on the contrast of the predictions. Unfortunately, this simple approach may induce spurious treatment-covariate interaction in observational studies when the regression model is misspecified. Motivated by the need of modeling the number of relapses in multiple sclerosis (MS) patients, where the ratio of relapse rates is a natural choice of the treatment effect, we propose to estimate the conditional average treatment effect (CATE) as the ratio of expected potential outcomes, and derive a doubly robust estimator of this CATE in a semiparametric model of treatment-covariate interactions. We also provide a validation procedure to check the quality of the estimator on an independent sample. We conduct simulations to demonstrate the finite sample performance of the proposed methods, and illustrate their advantages on real data by examining the treatment effect of dimethyl fumarate compared to teriflunomide in MS patients. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 335-352
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1772080
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1772080
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:335-352
Template-Type: ReDIF-Article 1.0
Author-Name: Xinran Li
Author-X-Name-First: Xinran
Author-X-Name-Last: Li
Author-Name: Xiao-Li Meng
Author-X-Name-First: Xiao-Li
Author-X-Name-Last: Meng
Title: A Multi-resolution Theory for Approximating Infinite-p-Zero-n: Transitional Inference, Individualized Predictions, and a World Without Bias-Variance Tradeoff
Abstract:
Transitional inference is an empiricism concept, rooted and practiced in clinical medicine since ancient Greece. Knowledge and experiences gained from treating one entity (e.g., a disease or a group of patients) are applied to treat a related but distinctively different one (e.g., a similar disease or a new patient). This notion of “transition to the similar” renders individualized treatments an operational meaning, yet its theoretical foundation defies the familiar inductive inference framework. The uniqueness of entities is the result of potentially an infinite number of attributes (hence p=∞), which entails zero direct training sample size (i.e., n = 0) because genuine guinea pigs do not exist. However, the literature on wavelets and on sieve methods for nonparametric estimation suggests a principled approximation theory for transitional inference via a multi-resolution (MR) perspective, where we use the resolution level to index the degree of approximation to ultimate individuality. MR inference seeks a primary resolution indexing an indirect training sample, which provides enough matched attributes to increase the relevance of the results to the target individuals and yet still accumulate sufficient indirect sample sizes for robust estimation. Theoretically, MR inference relies on an infinite-term ANOVA-type decomposition, providing an alternative way to model sparsity via the decay rate of the resolution bias as a function of the primary resolution level. Unexpectedly, this decomposition reveals a world without variance when the outcome is a deterministic function of potentially infinitely many predictors. In this deterministic world, the optimal resolution prefers over-fitting in the traditional sense when the resolution bias decays sufficiently rapidly. Furthermore, there can be many “descents” in the prediction error curve, when the contributions of predictors are inhomogeneous and the ordering of their importance does not align with the order of their inclusion in prediction. These findings may hint at a deterministic approximation theory for understanding the apparently over-fitting resistant phenomenon of some over-saturated models in machine learning.
Journal: Journal of the American Statistical Association
Pages: 353-367
Issue: 533
Volume: 116
Year: 2021
Month: 1
X-DOI: 10.1080/01621459.2020.1844210
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844210
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:533:p:353-367
Template-Type: ReDIF-Article 1.0
Author-Name: Shonosuke Sugasawa
Author-X-Name-First: Shonosuke
Author-X-Name-Last: Sugasawa
Title: Grouped Heterogeneous Mixture Modeling for Clustered Data
Abstract:
Clustered data are ubiquitous in a variety of scientific fields. In this article, we propose a flexible and interpretable modeling approach, called grouped heterogeneous mixture modeling, for clustered data, which models cluster-wise conditional distributions by mixtures of latent conditional distributions common to all the clusters. In the model, we assume that clusters are divided into a finite number of groups and mixing proportions are the same within the same group. We provide a simple generalized EM algorithm for computing the maximum likelihood estimator, and an information criterion to select the numbers of groups and latent distributions. We also propose structured grouping strategies by introducing penalties on grouping parameters in the likelihood function. Under the settings where both the number of clusters and cluster sizes tend to infinity, we present asymptotic properties of the maximum likelihood estimator and the information criterion. We demonstrate the proposed method through simulation studies and an application to crime risk modeling in Tokyo.
Journal: Journal of the American Statistical Association
Pages: 999-1010
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1777136
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1777136
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:999-1010
Template-Type: ReDIF-Article 1.0
Author-Name: Nathan Kallus
Author-X-Name-First: Nathan
Author-X-Name-Last: Kallus
Title: Rejoinder: New Objectives for Policy Learning
Journal: Journal of the American Statistical Association
Pages: 694-698
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1866580
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1866580
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:694-698
Template-Type: ReDIF-Article 1.0
Author-Name: Jared S. Murray
Author-X-Name-First: Jared S.
Author-X-Name-Last: Murray
Title: Log-Linear Bayesian Additive Regression Trees for Multinomial Logistic and Count Regression Models
Abstract:
We introduce Bayesian additive regression trees (BART) for log-linear models including multinomial logistic regression and count regression with zero-inflation and overdispersion. BART has been applied to nonparametric mean regression and binary classification problems in a range of settings. However, existing applications of BART have been mostly limited to models for Gaussian “data,” either observed or latent. This is primarily because efficient MCMC algorithms are available for Gaussian likelihoods. But while many useful models are naturally cast in terms of latent Gaussian variables, many others are not—including models considered in this article. We develop new data augmentation strategies and carefully specified prior distributions for these new models. Like the original BART prior, the new prior distributions are carefully constructed and calibrated to be flexible while guarding against overfitting. Together the new priors and data augmentation schemes allow us to implement an efficient MCMC sampler outside the context of Gaussian models. The utility of these new methods is illustrated with examples and an application to a previously published dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 756-769
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1813587
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1813587
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:756-769
Template-Type: ReDIF-Article 1.0
Author-Name: Srinjoy Das
Author-X-Name-First: Srinjoy
Author-X-Name-Last: Das
Author-Name: Dimitris N. Politis
Author-X-Name-First: Dimitris N.
Author-X-Name-Last: Politis
Title: Predictive Inference for Locally Stationary Time Series With an Application to Climate Data
Abstract:
The model-free prediction principle of Politis has been successfully applied to general regression problems, as well as problems involving stationary time series. However, with long time series, for example, annual temperature measurements spanning over 100 years or daily financial returns spanning several years, it may be unrealistic to assume stationarity throughout the span of the dataset. In this article, we show how model-free prediction can be applied to handle time series that are only locally stationary, that is, they can be assumed to be stationary only over short time-windows. Surprisingly, there is little literature on point prediction for general locally stationary time series even in model-based setups, and there is no literature whatsoever on the construction of prediction intervals of locally stationary time series. We attempt to fill this gap here as well. Both one-step-ahead point predictors and prediction intervals are constructed, and the performance of model-free is compared to model-based prediction using models that incorporate a trend and/or heteroscedasticity. Both aspects of the article, model-free and model-based, are novel in the context of time-series that are locally (but not globally) stationary. We also demonstrate the application of our model-based and model-free prediction methods to speleothem climate data which exhibits local stationarity and show that our best model-free point prediction results outperform that obtained with the RAMPFIT algorithm previously used for analysis of this type of data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 919-934
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1708368
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1708368
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:919-934
Template-Type: ReDIF-Article 1.0
Author-Name: Francesca Tang
Author-X-Name-First: Francesca
Author-X-Name-Last: Tang
Author-Name: Yang Feng
Author-X-Name-First: Yang
Author-X-Name-Last: Feng
Author-Name: Hamza Chiheb
Author-X-Name-First: Hamza
Author-X-Name-Last: Chiheb
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Title: The Interplay of Demographic Variables and Social Distancing Scores in Deep Prediction of U.S. COVID-19 Cases
Abstract:
With the severity of the COVID-19 outbreak, we characterize the nature of the growth trajectories of counties in the United States using a novel combination of spectral clustering and the correlation matrix. As the United States and the rest of the world are still suffering from the effects of the virus, the importance of assigning growth membership to counties and understanding the determinants of the growth is increasingly evident. For the two communities (faster versus slower growth trajectories) we cluster the counties into, the average between-group correlation is 88.4% whereas the average within-group correlations are 95.0% and 93.8%. The average growth rate for one group is 0.1589 and 0.1704 for the other, further suggesting that our methodology captures meaningful differences between the nature of the growth across various counties. Subsequently, we select the demographic features that are most statistically significant in distinguishing the communities: number of grocery stores, number of bars, Asian population, White population, median household income, number of people with the bachelor’s degrees, and population density. Lastly, we effectively predict the future growth of a given county with a long short-term memory (LSTM) recurrent neural network using three social distancing scores. The best-performing model achieves a median out-of-sample R2 of 0.6251 for a four-day ahead prediction and we find that the number of communities and social distancing features play an important role in producing a more accurate forecasting. This comprehensive study captures the nature of the counties’ growth in cases at a very micro-level using growth communities, demographic factors, and social distancing performance to help government agencies utilize known information to make appropriate decisions regarding which potential counties to target resources and funding to. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 492-506
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1901717
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1901717
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:492-506
Template-Type: ReDIF-Article 1.0
Author-Name: Sijia Li
Author-X-Name-First: Sijia
Author-X-Name-Last: Li
Author-Name: Xiudi Li
Author-X-Name-First: Xiudi
Author-X-Name-Last: Li
Author-Name: Alex Luedtke
Author-X-Name-First: Alex
Author-X-Name-Last: Luedtke
Title: Discussion of Kallus (2020) and Mo, Qi, and Liu (2020): New Objectives for Policy Learning
Journal: Journal of the American Statistical Association
Pages: 680-689
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1837140
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1837140
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:680-689
Template-Type: ReDIF-Article 1.0
Author-Name: Yingda Jiang
Author-X-Name-First: Yingda
Author-X-Name-Last: Jiang
Author-Name: Chi-Yang Chiu
Author-X-Name-First: Chi-Yang
Author-X-Name-Last: Chiu
Author-Name: Qi Yan
Author-X-Name-First: Qi
Author-X-Name-Last: Yan
Author-Name: Wei Chen
Author-X-Name-First: Wei
Author-X-Name-Last: Chen
Author-Name: Michael B. Gorin
Author-X-Name-First: Michael B.
Author-X-Name-Last: Gorin
Author-Name: Yvette P. Conley
Author-X-Name-First: Yvette P.
Author-X-Name-Last: Conley
Author-Name: M’Hamed Lajmi Lakhal-Chaieb
Author-X-Name-First: M’Hamed Lajmi
Author-X-Name-Last: Lakhal-Chaieb
Author-Name: Richard J. Cook
Author-X-Name-First: Richard J.
Author-X-Name-Last: Cook
Author-Name: Christopher I. Amos
Author-X-Name-First: Christopher I.
Author-X-Name-Last: Amos
Author-Name: Alexander F. Wilson
Author-X-Name-First: Alexander F.
Author-X-Name-Last: Wilson
Author-Name: Joan E. Bailey-Wilson
Author-X-Name-First: Joan E.
Author-X-Name-Last: Bailey-Wilson
Author-Name: Francis J. McMahon
Author-X-Name-First: Francis J.
Author-X-Name-Last: McMahon
Author-Name: Ana I. Vazquez
Author-X-Name-First: Ana I.
Author-X-Name-Last: Vazquez
Author-Name: Ao Yuan
Author-X-Name-First: Ao
Author-X-Name-Last: Yuan
Author-Name: Xiaogang Zhong
Author-X-Name-First: Xiaogang
Author-X-Name-Last: Zhong
Author-Name: Momiao Xiong
Author-X-Name-First: Momiao
Author-X-Name-Last: Xiong
Author-Name: Daniel E. Weeks
Author-X-Name-First: Daniel E.
Author-X-Name-Last: Weeks
Author-Name: Ruzong Fan
Author-X-Name-First: Ruzong
Author-X-Name-Last: Fan
Title: Gene-Based Association Testing of Dichotomous Traits With Generalized Functional Linear Mixed Models Using Extended Pedigrees: Applications to Age-Related Macular Degeneration
Abstract:
Genetics plays a role in age-related macular degeneration (AMD), a common cause of blindness in the elderly. There is a need for powerful methods for carrying out region-based association tests between a dichotomous trait like AMD and genetic variants on family data. Here, we apply our new generalized functional linear mixed models (GFLMM) developed to test for gene-based association in a set of AMD families. Using common and rare variants, we observe significant association with two known AMD genes: CFH and ARMS2. Using rare variants, we find suggestive signals in four genes: ASAH1, CLEC6A, TMEM63C, and SGSM1. Intriguingly, ASAH1 is down-regulated in AMD aqueous humor, and ASAH1 deficiency leads to retinal inflammation and increased vulnerability to oxidative stress. These findings were made possible by our GFLMM which model the effect of a major gene as a fixed mean, the polygenic contributions as a random variation, and the correlation of pedigree members by kinship coefficients. Simulations indicate that the GFLMM likelihood ratio tests (LRTs) accurately control the Type I error rates. The LRTs have similar or higher power than existing retrospective kernel and burden statistics. Our GFLMM-based statistics provide a new tool for conducting family-based genetic studies of complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 531-545
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1799809
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799809
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:531-545
Template-Type: ReDIF-Article 1.0
Author-Name: Stijn Vansteelandt
Author-X-Name-First: Stijn
Author-X-Name-Last: Vansteelandt
Author-Name: Oliver Dukes
Author-X-Name-First: Oliver
Author-X-Name-Last: Dukes
Title: Discussion of Kallus and Mo, Qi, and Liu: New Objectives for Policy Learning
Journal: Journal of the American Statistical Association
Pages: 675-679
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1844718
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844718
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:675-679
Template-Type: ReDIF-Article 1.0
Author-Name: Ian L. Dryden
Author-X-Name-First: Ian L.
Author-X-Name-Last: Dryden
Author-Name: Alfred Kume
Author-X-Name-First: Alfred
Author-X-Name-Last: Kume
Author-Name: Phillip J. Paine
Author-X-Name-First: Phillip J.
Author-X-Name-Last: Paine
Author-Name: Andrew T. A. Wood
Author-X-Name-First: Andrew T. A.
Author-X-Name-Last: Wood
Title: Regression Modeling for Size-and-Shape Data Based on a Gaussian Model for Landmarks
Abstract:
In this article, we propose a regression model for size-and-shape response data. So far as we are aware, few such models have been explored in the literature to date. We assume a Gaussian model for labeled landmarks; these landmarks are used to represent the random objects under study. The regression structure, assumed in this article to be linear in the ambient space, enters through the landmark means. Two approaches to parameter estimation are considered. The first approach is based directly on the marginal likelihood for the landmark-based shapes. In the second approach, we treat the orientations of the landmarks as missing data, and we set up a model-consistent estimation procedure for the parameters using the EM algorithm. Both approaches raise challenging computational issues which we explain how to deal with. The usefulness of this regression modeling framework is demonstrated through real-data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1011-1022
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1724115
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1724115
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1011-1022
Template-Type: ReDIF-Article 1.0
Author-Name: Rong Ma
Author-X-Name-First: Rong
Author-X-Name-Last: Ma
Author-Name: T. Tony Cai
Author-X-Name-First: T.
Author-X-Name-Last: Tony Cai
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models
Abstract:
High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this article, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A lower bound for the global testing is established, which shows that the proposed test is asymptotically minimax optimal over some sparsity range. For testing the individual coefficients simultaneously, multiple testing procedures are proposed and shown to control the false discovery rate and falsely discovered variables asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed tests and their superiority over existing methods. The testing procedures are also illustrated by analyzing a dataset of a metabolomics study that investigates the association between fecal metabolites and pediatric Crohn’s disease and the effects of treatment on such associations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 984-998
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1699421
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1699421
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:984-998
Template-Type: ReDIF-Article 1.0
Author-Name: Yifei Sun
Author-X-Name-First: Yifei
Author-X-Name-Last: Sun
Author-Name: Charles E. McCulloch
Author-X-Name-First: Charles E.
Author-X-Name-Last: McCulloch
Author-Name: Kieren A. Marr
Author-X-Name-First: Kieren A.
Author-X-Name-Last: Marr
Author-Name: Chiung-Yu Huang
Author-X-Name-First: Chiung-Yu
Author-X-Name-Last: Huang
Title: Recurrent Events Analysis With Data Collected at Informative Clinical Visits in Electronic Health Records
Abstract:
Although increasingly used as a data resource for assembling cohorts, electronic health records (EHRs) pose many analytic challenges. In particular, a patient’s health status influences when and what data are recorded, generating sampling bias in the collected data. In this article, we consider recurrent event analysis using EHR data. Conventional regression methods for event risk analysis usually require the values of covariates to be observed throughout the follow-up period. In EHR databases, time-dependent covariates are intermittently measured during clinical visits, and the timing of these visits is informative in the sense that it depends on the disease course. Simple methods, such as the last-observation-carried-forward approach, can lead to biased estimation. On the other hand, complex joint models require additional assumptions on the covariate process and cannot be easily extended to handle multiple longitudinal predictors. By incorporating sampling weights derived from estimating the observation time process, we develop a novel estimation procedure based on inverse-rate-weighting and kernel-smoothing for the semiparametric proportional rate model of recurrent events. The proposed methods do not require model specifications for the covariate processes and can easily handle multiple time-dependent covariates. Our methods are applied to a kidney transplant study for illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 594-604
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1801447
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801447
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:594-604
Template-Type: ReDIF-Article 1.0
Author-Name: Min Jin Ha
Author-X-Name-First: Min Jin
Author-X-Name-Last: Ha
Author-Name: Francesco Claudio Stingo
Author-X-Name-First: Francesco Claudio
Author-X-Name-Last: Stingo
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Title: Bayesian Structure Learning in Multilayered Genomic Networks
Abstract:
Integrative network modeling of data arising from multiple genomic platforms provides insight into the holistic picture of the interactive system, as well as the flow of information across many disease domains including cancer. The basic data structure consists of a sequence of hierarchically ordered datasets for each individual subject, which facilitates integration of diverse inputs, such as genomic, transcriptomic, and proteomic data. A primary analytical task in such contexts is to model the layered architecture of networks where the vertices can be naturally partitioned into ordered layers, dictated by multiple platforms, and exhibit both undirected and directed relationships. We propose a multilayered Gaussian graphical model (mlGGM) to investigate conditional independence structures in such multilevel genomic networks in human cancers. We implement a Bayesian node-wise selection (BANS) approach based on variable selection techniques that coherently accounts for the multiple types of dependencies in mlGGM; this flexible strategy exploits edge-specific prior knowledge and selects sparse and interpretable models. Through simulated data generated under various scenarios, we demonstrate that BANS outperforms other existing multivariate regression-based methodologies. Our integrative genomic network analysis for key signaling pathways across multiple cancer types highlights commonalities and differences of p53 integrative networks and epigenetic effects of BRCA2 on p53 and its interaction with T68 phosphorylated CHK2, that may have translational utilities of finding biomarkers and therapeutic targets. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 605-618
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1775611
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775611
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:605-618
Template-Type: ReDIF-Article 1.0
Author-Name: Xiaohan Yan
Author-X-Name-First: Xiaohan
Author-X-Name-Last: Yan
Author-Name: Jacob Bien
Author-X-Name-First: Jacob
Author-X-Name-Last: Bien
Title: Rare Feature Selection in High Dimensions
Abstract:
It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 887-900
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1796677
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796677
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:887-900
Template-Type: ReDIF-Article 1.0
Author-Name: Eric J. Tchetgen Tchetgen
Author-X-Name-First: Eric J.
Author-X-Name-Last: Tchetgen Tchetgen
Author-Name: Isabel R. Fulcher
Author-X-Name-First: Isabel R.
Author-X-Name-Last: Fulcher
Author-Name: Ilya Shpitser
Author-X-Name-First: Ilya
Author-X-Name-Last: Shpitser
Title: Auto-G-Computation of Causal Effects on a Network
Abstract:
Methods for inferring average causal effects have traditionally relied on two key assumptions: (i) the intervention received by one unit cannot causally influence the outcome of another; and (ii) units can be organized into nonoverlapping groups such that outcomes of units in separate groups are independent. In this article, we develop new statistical methods for causal inference based on a single realization of a network of connected units for which neither assumption (i) nor (ii) holds. The proposed approach allows both for arbitrary forms of interference, whereby the outcome of a unit may depend on interventions received by other units with whom a network path through connected units exists; and long range dependence, whereby outcomes for any two units likewise connected by a path in the network may be dependent. Under network versions of consistency and no unobserved confounding, inference is made tractable by an assumption that the networks outcome, treatment and covariate vectors are a single realization of a certain chain graph model. This assumption allows inferences about various network causal effects via the auto-g-computation algorithm, a network generalization of Robins’ well-known g-computation algorithm previously described for causal inference under assumptions (i) and (ii). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 833-844
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1811098
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1811098
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:833-844
Template-Type: ReDIF-Article 1.0
Author-Name: Cheng Zhang
Author-X-Name-First: Cheng
Author-X-Name-Last: Zhang
Author-Name: Vu Dinh
Author-X-Name-First: Vu
Author-X-Name-Last: Dinh
Author-Name: Frederick A. Matsen
Author-X-Name-First: Frederick A.
Author-X-Name-Last: Matsen
Title: Nonbifurcating Phylogenetic Tree Inference via the Adaptive LASSO
Abstract:
Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including sampled ancestors in which we sequence a genotype along with its direct descendants, and polytomies in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this article, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators for the branch lengths of phylogenetic trees, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 858-873
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1778481
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1778481
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:858-873
Template-Type: ReDIF-Article 1.0
Author-Name: Kwonsang Lee
Author-X-Name-First: Kwonsang
Author-X-Name-Last: Lee
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Title: Discovering Heterogeneous Exposure Effects Using Randomization Inference in Air Pollution Studies
Abstract:
Several studies have provided strong evidence that long-term exposure to air pollution, even at low levels, increases risk of mortality. As regulatory actions are becoming prohibitively expensive, robust evidence to guide the development of targeted interventions to protect the most vulnerable is needed. In this article, we introduce a novel statistical method that (i) discovers subgroups whose effects substantially differ from the population mean, and (ii) uses randomization-based tests to assess discovered heterogeneous effects. Also, we develop a sensitivity analysis method to assess the robustness of the conclusions to unmeasured confounding bias. Via simulation studies and theoretical arguments, we demonstrate that hypothesis testing focusing on the discovered subgroups can substantially increase statistical power to detect heterogeneity of the exposure effects. We apply the proposed de novo method to the data of 1,612,414 Medicare beneficiaries in the New England region in the United States for the period 2000–2006. We find that seniors aged between 81 and 85 with low income and seniors aged 85 and above have statistically significant greater causal effects of long-term exposure to PM2.5 on 5-year mortality rate compared to the population mean.
Journal: Journal of the American Statistical Association
Pages: 569-580
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1870476
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1870476
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:569-580
Template-Type: ReDIF-Article 1.0
Author-Name: Laura Forastiere
Author-X-Name-First: Laura
Author-X-Name-Last: Forastiere
Author-Name: Edoardo M. Airoldi
Author-X-Name-First: Edoardo M.
Author-X-Name-Last: Airoldi
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Title: Identification and Estimation of Treatment and Interference Effects in Observational Studies on Networks
Abstract:
Abstract–Causal inference on a population of units connected through a network often presents technical challenges, including how to account for interference. In the presence of interference, for instance, potential outcomes of a unit depend on their treatment as well as on the treatments of other units, such as their neighbors in the network. In observational studies, a further complication is that the typical unconfoundedness assumption must be extended—say, to include the treatment of neighbors, and individual and neighborhood covariates—to guarantee identification and valid inference. Here, we propose new estimands that define treatment and interference effects. We then derive analytical expressions for the bias of a naive estimator that wrongly assumes away interference. The bias depends on the level of interference but also on the degree of association between individual and neighborhood treatments. We propose an extended unconfoundedness assumption that accounts for interference, and we develop new covariate-adjustment methods that lead to valid estimates of treatment and interference effects in observational studies on networks. Estimation is based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors’ treatment. We carry out simulations, calibrated using friendship networks and covariates in a nationally representative longitudinal study of adolescents in grades 7–12 in the United States, to explore finite-sample performance in different realistic settings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 901-918
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1768100
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1768100
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:901-918
Template-Type: ReDIF-Article 1.0
Author-Name: Zhicheng Ji
Author-X-Name-First: Zhicheng
Author-X-Name-Last: Ji
Author-Name: Hongkai Ji
Author-X-Name-First: Hongkai
Author-X-Name-Last: Ji
Title: Discussion of “Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-seq Data”
Journal: Journal of the American Statistical Association
Pages: 471-474
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1880920
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880920
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:471-474
Template-Type: ReDIF-Article 1.0
Author-Name: Danijel Kivaranovic
Author-X-Name-First: Danijel
Author-X-Name-Last: Kivaranovic
Author-Name: Hannes Leeb
Author-X-Name-First: Hannes
Author-X-Name-Last: Leeb
Title: On the Length of Post-Model-Selection Confidence Intervals Conditional on Polyhedral Constraints
Abstract:
Valid inference after model selection is currently a very active area of research. The polyhedral method, introduced in an article by Lee et al., allows for valid inference after model selection if the model selection event can be described by polyhedral constraints. In that reference, the method is exemplified by constructing two valid confidence intervals when the Lasso estimator is used to select a model. We here study the length of these intervals. For one of these confidence intervals, which is easier to compute, we find that its expected length is always infinite. For the other of these confidence intervals, whose computation is more demanding, we give a necessary and sufficient condition for its expected length to be infinite. In simulations, we find that this sufficient condition is typically satisfied, unless the selected model includes almost all or almost none of the available regressors. For the distribution of confidence interval length, we find that the κ-quantiles behave like 1/(1−κ) for κ close to 1. Our results can also be used to analyze other confidence intervals that are based on the polyhedral method.
Journal: Journal of the American Statistical Association
Pages: 845-857
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1732989
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1732989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:845-857
Template-Type: ReDIF-Article 1.0
Author-Name: Muxuan Liang
Author-X-Name-First: Muxuan
Author-X-Name-Last: Liang
Author-Name: Ying-Qi Zhao
Author-X-Name-First: Ying-Qi
Author-X-Name-Last: Zhao
Title: Discussion of Kallus (2020) and Mo et al. (2020)
Abstract:
We discuss the results on improving the generalizability of individualized treatment rule following the work by Kallus and Mo et al. We note that the advocated weights in the work of Kallus are connected to the efficient score of the contrast function. We further propose a likelihood-ratio-based method (LR-ITR) to accommodate covariate shifts, and compare it to the CTE-DR-ITR method proposed by Mo et al. We provide the upper-bound on the risk function of the target population when both the covariate shift and the contrast function shift are present. Numerical studies show that LR-ITR can outperform CTE-DR-ITR when there is only covariate shift. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 690-693
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1833887
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1833887
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:690-693
Template-Type: ReDIF-Article 1.0
Author-Name: Ting Tian
Author-X-Name-First: Ting
Author-X-Name-Last: Tian
Author-Name: Jianbin Tan
Author-X-Name-First: Jianbin
Author-X-Name-Last: Tan
Author-Name: Wenxiang Luo
Author-X-Name-First: Wenxiang
Author-X-Name-Last: Luo
Author-Name: Yukang Jiang
Author-X-Name-First: Yukang
Author-X-Name-Last: Jiang
Author-Name: Minqiong Chen
Author-X-Name-First: Minqiong
Author-X-Name-Last: Chen
Author-Name: Songpan Yang
Author-X-Name-First: Songpan
Author-X-Name-Last: Yang
Author-Name: Canhong Wen
Author-X-Name-First: Canhong
Author-X-Name-Last: Wen
Author-Name: Wenliang Pan
Author-X-Name-First: Wenliang
Author-X-Name-Last: Pan
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Title: The Effects of Stringent and Mild Interventions for Coronavirus Pandemic
Abstract:
The pandemic of COVID-19 has caused severe public health consequences around the world. Many interventions of COVID-19 have been implemented. It is of great public health and social importance to evaluate the effects of interventions in the pandemic of COVID-19. With the help of a synthetic control method, the regression discontinuity, and a state-space compartmental model, we evaluated the treatment and stagewise effects of the intervention policies. We found statistically significant treatment effects of broad stringent interventions in Wenzhou and mild interventions in Shanghai to subdue the epidemic’s spread. If those reduction effects were not activated, the expected number of positive individuals would increase by 2.18 times on February 5, 2020, for Wenzhou and 7.69 times on February 4, 2020, for Shanghai, respectively. Alternatively, regression discontinuity elegantly identified the stringent (p-value: <0.001) and mild interventions (p-value: 0.024) lowered the severity of the epidemic. Under the compartmental modeling for different interventions, we understood the importance of implementing the interventions. The highest level alert to COVID-19 was practical and crucial at the early stage of the epidemic. Furthermore, the physical/social distancing policy was necessary once the spread of COVID-19 continued. If appropriate control measures were implemented, then epidemic would be under control effectively and early. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 481-491
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1897015
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1897015
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:481-491
Template-Type: ReDIF-Article 1.0
Author-Name: Dungang Liu
Author-X-Name-First: Dungang
Author-X-Name-Last: Liu
Author-Name: Shaobo Li
Author-X-Name-First: Shaobo
Author-X-Name-Last: Li
Author-Name: Yan Yu
Author-X-Name-First: Yan
Author-X-Name-Last: Yu
Author-Name: Irini Moustaki
Author-X-Name-First: Irini
Author-X-Name-Last: Moustaki
Title: Assessing Partial Association Between Ordinal Variables: Quantification, Visualization, and Hypothesis Testing
Abstract:
Partial association refers to the relationship between variables Y1,Y2,…,YK while adjusting for a set of covariates X={X1,…,Xp}. To assess such an association when Yk’s are recorded on ordinal scales, a classical approach is to use partial correlation between the latent continuous variables. This so-called polychoric correlation is inadequate, as it requires multivariate normality and it only reflects a linear association. We propose a new framework for studying ordinal-ordinal partial association by using Liu-Zhang’s surrogate residuals. We justify that conditional on X, Yk, and Yl are independent if and only if their corresponding surrogate residual variables are independent. Based on this result, we develop a general measure ϕ to quantify association strength. As opposed to polychoric correlation, ϕ does not rely on normality or models with the probit link, but instead it broadly applies to models with any link functions. It can capture a nonlinear or even nonmonotonic association. Moreover, the measure ϕ gives rise to a general procedure for testing the hypothesis of partial independence. Our framework also permits visualization tools, such as partial regression plots and three-dimensional P-P plots, to examine the association structure, which is otherwise unfeasible for ordinal data. We stress that the whole set of tools (measures, p-values, and graphics) is developed within a single unified framework, which allows a coherent inference. The analyses of the National Election Study (K = 5) and Big Five Personality Traits (K = 50) demonstrate that our framework leads to a much fuller assessment of partial association and yields deeper insights for domain researchers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 955-968
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1796394
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796394
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:955-968
Template-Type: ReDIF-Article 1.0
Author-Name: Karthika Mohan
Author-X-Name-First: Karthika
Author-X-Name-Last: Mohan
Author-Name: Judea Pearl
Author-X-Name-First: Judea
Author-X-Name-Last: Pearl
Title: Graphical Models for Processing Missing Data
Abstract:
This article reviews recent advances in missing data research using graphical models to represent multivariate dependencies. We first examine the limitations of traditional frameworks from three different perspectives: transparency, estimability, and testability. We then show how procedures based on graphical models can overcome these limitations and provide meaningful performance guarantees even when data are missing not at random (MNAR). In particular, we identify conditions that guarantee consistent estimation in broad categories of missing data problems, and derive procedures for implementing this estimation. Finally, we derive testable implications for missing data models in both missing at random and MNAR categories.
Journal: Journal of the American Statistical Association
Pages: 1023-1037
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1874961
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1874961
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1023-1037
Template-Type: ReDIF-Article 1.0
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Author-Name: Zhichao Jiang
Author-X-Name-First: Zhichao
Author-X-Name-Last: Jiang
Author-Name: Anup Malani
Author-X-Name-First: Anup
Author-X-Name-Last: Malani
Title: Causal Inference With Interference and Noncompliance in Two-Stage Randomized Experiments
Abstract:
In many social science experiments, subjects often interact with each other and as a result one unit’s treatment influences the outcome of another unit. Over the last decade, a significant progress has been made toward causal inference in the presence of such interference between units. Researchers have shown that the two-stage randomization of treatment assignment enables the identification of average direct and spillover effects. However, much of the literature has assumed perfect compliance with treatment assignment. In this article, we establish the nonparametric identification of the complier average direct and spillover effects in two-stage randomized experiments with interference and noncompliance. In particular, we consider the spillover effect of the treatment assignment on the treatment receipt as well as the spillover effect of the treatment receipt on the outcome. We propose consistent estimators and derive their randomization-based variances under the stratified interference assumption. We also prove the exact relationships between the proposed randomization-based estimators and the popular two-stage least squares estimators. The proposed methodology is motivated by and applied to our own randomized evaluation of India’s National Health Insurance Program (RSBY), where we find some evidence of spillover effects. The proposed methods are implemented via an open-source software package. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 632-644
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1775612
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775612
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:632-644
Template-Type: ReDIF-Article 1.0
Author-Name: Haoyu Chen
Author-X-Name-First: Haoyu
Author-X-Name-Last: Chen
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Statistical Inference for Online Decision Making via Stochastic Gradient Descent
Abstract:
Online decision making aims to learn the optimal decision rule by making personalized decisions and updating the decision rule recursively. It has become easier than before with the help of big data, but new challenges also come along. Since the decision rule should be updated once per step, an offline update which uses all the historical data is inefficient in computation and storage. To this end, we propose a completely online algorithm that can make decisions and update the decision rule online via stochastic gradient descent. It is not only efficient but also supports all kinds of parametric reward models. Focusing on the statistical inference of online decision making, we establish the asymptotic normality of the parameter estimator produced by our algorithm and the online inverse probability weighted value estimator we used to estimate the optimal value. Online plugin estimators for the variance of the parameter and value estimators are also provided and shown to be consistent, so that interval estimation and hypothesis test are possible using our method. The proposed algorithm and theoretical results are tested by simulations and a real data application to news article recommendation.
Journal: Journal of the American Statistical Association
Pages: 708-719
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1826325
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1826325
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:708-719
Template-Type: ReDIF-Article 1.0
Author-Name: Ricardo Moura
Author-X-Name-First: Ricardo
Author-X-Name-Last: Moura
Author-Name: Martin Klein
Author-X-Name-First: Martin
Author-X-Name-Last: Klein
Author-Name: John Zylstra
Author-X-Name-First: John
Author-X-Name-Last: Zylstra
Author-Name: Carlos A. Coelho
Author-X-Name-First: Carlos A.
Author-X-Name-Last: Coelho
Author-Name: Bimal Sinha
Author-X-Name-First: Bimal
Author-X-Name-Last: Sinha
Title: Inference for Multivariate Regression Model Based on Synthetic Data Generated Using Plug-in Sampling
Abstract:
In this article, the authors derive the likelihood-based exact inference for singly and multiply imputed synthetic data in the context of a multivariate regression model. The synthetic data are generated via the Plug-in Sampling method, where the unknown parameters in the model are set equal to the observed values of their point estimators based on the original data, and synthetic data are drawn from this estimated version of the model. Simulation studies are carried out in order to confirm the theoretical results. The authors provide exact test procedures, which in case multiple synthetic datasets are permissible, are compared with the asymptotic results of Reiter. An application using 2000 U.S. Current Population Survey public use data is discussed. Furthermore, properties of the proposed methodology are evaluated in scenarios where some of the conditions that were used to derive the methodology do not hold, namely for nonnormal and discrete distributed random variables, cases in which the inferential procedures developed still show very good performances.
Journal: Journal of the American Statistical Association
Pages: 720-733
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1900860
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1900860
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:720-733
Template-Type: ReDIF-Article 1.0
Author-Name: Joshua Lukemire
Author-X-Name-First: Joshua
Author-X-Name-Last: Lukemire
Author-Name: Suprateek Kundu
Author-X-Name-First: Suprateek
Author-X-Name-Last: Kundu
Author-Name: Giuseppe Pagnoni
Author-X-Name-First: Giuseppe
Author-X-Name-Last: Pagnoni
Author-Name: Ying Guo
Author-X-Name-First: Ying
Author-X-Name-Last: Guo
Title: Bayesian Joint Modeling of Multiple Brain Functional Networks
Abstract:
Investigating the similarity and changes in brain networks under different mental conditions has become increasingly important in neuroscience research. A standard separate estimation strategy fails to pool information across networks and hence has reduced estimation accuracy and power to detect between-network differences. Motivated by an fMRI Stroop task experiment that involves multiple related tasks, we develop an integrative Bayesian approach for jointly modeling multiple brain networks that provides a systematic inferential framework for network comparisons. The proposed approach explicitly models shared and differential patterns via flexible Dirichlet process-based priors on edge probabilities. Conditional on edges, the connection strengths are modeled via Bayesian spike-and-slab prior on the precision matrix off-diagonals. Numerical simulations illustrate that the proposed approach has increased power to detect true differential edges while providing adequate control on false positives and achieves greater network estimation accuracy compared to existing methods. The Stroop task data analysis reveals greater connectivity differences between task and fixation that are concentrated in brain regions previously identified as differentially activated in Stroop task, and more nuanced connectivity differences between exertion and relaxed task. In contrast, penalized modeling approaches involving computationally burdensome permutation tests reveal negligible network differences between conditions that seem biologically implausible. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 518-530
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1796357
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796357
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:518-530
Template-Type: ReDIF-Article 1.0
Author-Name: Jordan Awan
Author-X-Name-First: Jordan
Author-X-Name-Last: Awan
Author-Name: Aleksandra Slavković
Author-X-Name-First: Aleksandra
Author-X-Name-Last: Slavković
Title: Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms
Abstract:
Differential privacy (DP) provides a framework for provable privacy protection against arbitrary adversaries, while allowing the release of summary statistics and synthetic data. We address the problem of releasing a noisy real-valued statistic vector T, a function of sensitive data under DP, via the class of K-norm mechanisms with the goal of minimizing the noise added to achieve privacy. First, we introduce the sensitivity space of T, which extends the concepts of sensitivity polytope and sensitivity hull to the setting of arbitrary statistics T. We then propose a framework consisting of three methods for comparing the K-norm mechanisms: (1) a multivariate extension of stochastic dominance, (2) the entropy of the mechanism, and (3) the conditional variance given a direction, to identify the optimal K-norm mechanism. In all of these criteria, the optimal K-norm mechanism is generated by the convex hull of the sensitivity space. Using our methodology, we extend the objective perturbation and functional mechanisms and apply these tools to logistic and linear regression, allowing for private releases of statistical results. Via simulations and an application to a housing price dataset, we demonstrate that our proposed methodology offers a substantial improvement in utility for the same level of risk.
Journal: Journal of the American Statistical Association
Pages: 935-954
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1773831
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1773831
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:935-954
Template-Type: ReDIF-Article 1.0
Author-Name: Chi Wing Chu
Author-X-Name-First: Chi Wing
Author-X-Name-Last: Chu
Author-Name: Tony Sit
Author-X-Name-First: Tony
Author-X-Name-Last: Sit
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Title: Transformed Dynamic Quantile Regression on Censored Data
Abstract:
We propose a class of power-transformed linear quantile regression models for time-to-event observations subject to censoring. By introducing a process of power transformation with different transformation parameters at individual quantile levels, our framework relaxes the assumption of logarithmic transformation on survival times and provides dynamic estimation of various quantile levels. With such formulation, our proposal no longer requires the potentially restrictive global linearity assumption imposed on a class of existing inference procedures for censored quantile regression. Uniform consistency and weak convergence of the proposed estimator as a process of quantile levels are established via the martingale-based argument. Numerical studies are presented to illustrate the outperformance of the proposed estimator over existing contenders under various settings.
Journal: Journal of the American Statistical Association
Pages: 874-886
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1695623
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1695623
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:874-886
Template-Type: ReDIF-Article 1.0
Author-Name: Kiranmoy Das
Author-X-Name-First: Kiranmoy
Author-X-Name-Last: Das
Author-Name: Pulak Ghosh
Author-X-Name-First: Pulak
Author-X-Name-Last: Ghosh
Author-Name: Michael J. Daniels
Author-X-Name-First: Michael J.
Author-X-Name-Last: Daniels
Title: Modeling Multiple Time-Varying Related Groups: A Dynamic Hierarchical Bayesian Approach With an Application to the Health and Retirement Study
Abstract:
As the population of the older individuals continues to grow, it is important to study the relationship among the variables measuring financial health and physical health of the older individuals to better understand the demand for healthcare, and health insurance. We propose a semiparametric approach to jointly model these variables. We use data from the Health and Retirement Study which includes a set of correlated longitudinal variables measuring financial and physical health. In particular, we propose a dynamic hierarchical matrix stick-breaking process prior for some of the model parameters to account for the time dependent aspects of our data. This prior introduces dependence among the parameters across different groups which varies over time. A Lasso type shrinkage prior is specified for the covariates with time-invariant effects for selecting the set of covariates with significant effects on the outcomes. Through joint modeling, we are able to study the physical health of the older individuals conditional on their financial health, and vice-versa. Based on our analysis, we find that the health insurance (medicare) provided by the government (of the United States) to the older individuals is very effective, and it covers most of the medical expenditures. However, none of the health insurances conveniently cover the additional medical expenses due to chronic diseases like cancer and heart problem. Simulation studies are performed to assess the operating characteristics of our proposed modeling approach. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 558-568
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1886105
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886105
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:558-568
Template-Type: ReDIF-Article 1.0
Author-Name: Maxime Rischard
Author-X-Name-First: Maxime
Author-X-Name-Last: Rischard
Author-Name: Zach Branson
Author-X-Name-First: Zach
Author-X-Name-Last: Branson
Author-Name: Luke Miratrix
Author-X-Name-First: Luke
Author-X-Name-Last: Miratrix
Author-Name: Luke Bornn
Author-X-Name-First: Luke
Author-X-Name-Last: Bornn
Title: Do School Districts Affect NYC House Prices? Identifying Border Differences Using a Bayesian Nonparametric Approach to Geographic Regression Discontinuity Designs
Abstract:
What is the premium on house price for a particular school district? To estimate this in New York City we use a novel implementation of a geographic regression discontinuity design (GeoRDD) built from Gaussian processes regression (kriging) to model spatial structure. With a GeoRDD, we specifically examine price differences along borders between “treatment” and “control” school districts. GeoRDDs extend RDDs to multivariate settings; location is the forcing variable and the border between school districts constitutes the discontinuity threshold. We first obtain a Bayesian posterior distribution of the price difference function, our nominal treatment effect, along the border. We then address nuances of having a functional estimand defined on a border with potentially intricate topology, particularly when defining and estimating causal estimands of the local average treatment effect (LATE). We test for nonzero LATE with a calibrated hypothesis test with good frequentist properties, which we further validate using a placebo test. Using our methodology, we identify substantial differences in price across several borders. In one case, a border separating Brooklyn and Queens, we estimate a statistically significant 20% higher price for a house on the more desirable side. We also find that geographic features can undermine some of these comparisons. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 619-631
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1817749
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817749
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:619-631
Template-Type: ReDIF-Article 1.0
Author-Name: Sharmistha Guha
Author-X-Name-First: Sharmistha
Author-X-Name-Last: Guha
Author-Name: Abel Rodriguez
Author-X-Name-First: Abel
Author-X-Name-Last: Rodriguez
Title: Bayesian Regression With Undirected Network Predictors With an Application to Brain Connectome Data
Abstract:
This article focuses on the relationship between a measure of creativity and the human brain network for subjects in a brain connectome dataset obtained using a diffusion weighted magnetic resonance imaging procedure. We identify brain regions and interconnections that have a significant effect on creativity. Brain networks are often expressed in terms of symmetric adjacency matrices, with row and column indices of the matrix representing the regions of interest (ROI), and a cell entry signifying the estimated number of fiber bundles connecting the corresponding row and column ROIs. Current statistical practices for regression analysis with the brain network as the predictor and the measure of creativity as the response typically vectorize the network predictor matrices prior to any analysis, thus failing to account for the important structural information in the network. This results in poor inferential and predictive performance in presence of small sample sizes. To answer the scientific questions discussed above, we develop a flexible Bayesian framework that avoids reshaping the network predictor matrix, draws inference on brain ROIs and interconnections significantly related to creativity, and enables accurate prediction of creativity from a brain network. A novel class of network shrinkage priors for the coefficient corresponding to the network predictor is proposed to achieve these goals simultaneously. The Bayesian framework allows characterization of uncertainty in the findings. Empirical results in simulation studies illustrate substantial inferential and predictive gains of the proposed framework in comparison with the ordinary high-dimensional Bayesian shrinkage priors and penalized optimization schemes. Our framework yields new insights into the relationship of brain regions with creativity, also providing the uncertainty associated with the scientific findings. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 581-593
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1772079
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1772079
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:581-593
Template-Type: ReDIF-Article 1.0
Author-Name: Weibin Mo
Author-X-Name-First: Weibin
Author-X-Name-Last: Mo
Author-Name: Zhengling Qi
Author-X-Name-First: Zhengling
Author-X-Name-Last: Qi
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Rejoinder: Learning Optimal Distributionally Robust Individualized Treatment Rules
Journal: Journal of the American Statistical Association
Pages: 699-707
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1866581
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1866581
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:699-707
Template-Type: ReDIF-Article 1.0
Author-Name: Kevin Z. Lin
Author-X-Name-First: Kevin Z.
Author-X-Name-Last: Lin
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Author-Name: Kathryn Roeder
Author-X-Name-First: Kathryn
Author-X-Name-Last: Roeder
Title: Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data
Abstract:
Scientists often embed cells into a lower-dimensional space when studying single-cell RNA-seq data for improved downstream analyses such as developmental trajectory analyses, but the statistical properties of such nonlinear embedding methods are often not well understood. In this article, we develop the exponential-family SVD (eSVD), a nonlinear embedding method for both cells and genes jointly with respect to a random dot product model using exponential-family distributions. Our estimator uses alternating minimization, which enables us to have a computationally efficient method, prove the identifiability conditions and consistency of our method, and provide statistically principled procedures to tune our method. All these qualities help advance the single-cell embedding literature, and we provide extensive simulations to demonstrate that the eSVD is competitive compared to other embedding methods. We apply the eSVD via Gaussian distributions where the standard deviations are proportional to the means to analyze a single-cell dataset of oligodendrocytes in mouse brains. Using the eSVD estimated embedding, we then investigate the cell developmental trajectories of the oligodendrocytes. While previous results are not able to distinguish the trajectories among the mature oligodendrocyte cell types, our diagnostics and results demonstrate there are two major developmental trajectories that diverge at mature oligodendrocytes. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 457-470
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1886106
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886106
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:457-470
Template-Type: ReDIF-Article 1.0
Author-Name: Philip G. Sansom
Author-X-Name-First: Philip G.
Author-X-Name-Last: Sansom
Author-Name: David B. Stephenson
Author-X-Name-First: David B.
Author-X-Name-Last: Stephenson
Author-Name: Thomas J. Bracegirdle
Author-X-Name-First: Thomas J.
Author-X-Name-Last: Bracegirdle
Title: On Constraining Projections of Future Climate Using Observations and Simulations From Multiple Climate Models
Abstract:
Numerical climate models are used to project future climate change due to both anthropogenic and natural causes. Differences between projections from different climate models are a major source of uncertainty about future climate. Emergent relationships shared by multiple climate models have the potential to constrain our uncertainty when combined with historical observations. We combine projections from 13 climate models with observational data to quantify the impact of emergent relationships on projections of future warming in the Arctic at the end of the 21st century. We propose a hierarchical Bayesian framework based on a coexchangeable representation of the relationship between climate models and the Earth system. We show how emergent constraints fit into the coexchangeable representation, and extend it to account for internal variability simulated by the models and natural variability in the Earth system. Our analysis shows that projected warming in some regions of the Arctic may be more than 2 °C lower and our uncertainty reduced by up to 30% when constrained by historical observations. A detailed theoretical comparison with existing multi-model projection frameworks is also provided. In particular, we show that projections may be biased if we do not account for internal variability in climate model predictions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 546-557
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1851696
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1851696
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:546-557
Template-Type: ReDIF-Article 1.0
Author-Name: Justin Khim
Author-X-Name-First: Justin
Author-X-Name-Last: Khim
Author-Name: Po-Ling Loh
Author-X-Name-First: Po-Ling
Author-X-Name-Last: Loh
Title: Permutation Tests for Infection Graphs
Abstract:
We formulate and analyze a novel hypothesis testing problem for inferring the edge structure of an infection graph. In our model, a disease spreads over a network via contagion or random infection, where the times between successive contagion events are independent exponential random variables with unknown rate parameters. A subset of nodes is also censored uniformly at random. Given the observed infection statuses of nodes in the network, the goal is to determine the underlying graph. We present a procedure based on permutation testing, and we derive sufficient conditions for the validity of our test in terms of automorphism groups of the graphs corresponding to the null and alternative hypotheses. Our test is easy to compute and does not involve estimating unknown parameters governing the process. We also derive risk bounds for our permutation test in a variety of settings, and relate our test statistic to approximate likelihood ratio testing and maximin tests. For graphs not satisfying the necessary symmetries, we provide an additional method for testing the significance of the graph structure, albeit at a higher computational cost. We conclude with an application to real data from an HIV infection network. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 770-782
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1700128
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1700128
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:770-782
Template-Type: ReDIF-Article 1.0
Author-Name: Joris Chau
Author-X-Name-First: Joris
Author-X-Name-Last: Chau
Author-Name: Rainer von Sachs
Author-X-Name-First: Rainer
Author-X-Name-Last: von Sachs
Title: Intrinsic Wavelet Regression for Curves of Hermitian Positive Definite Matrices
Abstract:
Intrinsic wavelet transforms and wavelet estimation methods are introduced for curves in the non-Euclidean space of Hermitian positive definite matrices, with in mind the application to Fourier spectral estimation of multivariate stationary time series. The main focus is on intrinsic average-interpolation wavelet transforms in the space of positive definite matrices equipped with an affine-invariant Riemannian metric, and convergence rates of linear wavelet thresholding are derived for intrinsically smooth curves of Hermitian positive definite matrices. In the context of multivariate Fourier spectral estimation, intrinsic wavelet thresholding is equivariant under a change of basis of the time series, and nonlinear wavelet thresholding is able to capture localized features in the spectral density matrix across frequency, always guaranteeing positive definite estimates. The finite-sample performance of intrinsic wavelet thresholding is assessed by means of simulated data and compared to several benchmark estimators in the Riemannian manifold. Further illustrations are provided by examining the multivariate spectra of trial-replicated brain signal time series recorded during a learning experiment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 819-832
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1700129
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1700129
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:819-832
Template-Type: ReDIF-Article 1.0
Author-Name: Jasjeet S. Sekhon
Author-X-Name-First: Jasjeet S.
Author-X-Name-Last: Sekhon
Author-Name: Yotam Shem-Tov
Author-X-Name-First: Yotam
Author-X-Name-Last: Shem-Tov
Title: Inference on a New Class of Sample Average Treatment Effects
Abstract:
We derive new variance formulas for inference on a general class of estimands of causal average treatment effects in a randomized control trial. We generalize the seminal work of Robins and show that when the researcher’s objective is inference on sample average treatment effect of the treated (SATT), a consistent variance estimator exists. Although this estimand is equal to the sample average treatment effect (SATE) in expectation, potentially large differences in both accuracy and coverage can occur by the change of estimand, even asymptotically. Inference on SATE, even using a conservative confidence interval, provides incorrect coverage of SATT. We demonstrate the applicability of the new theoretical results using an empirical application with hundreds of online experiments with an average sample size of approximately 100 million observations per experiment. An R package, estCI, that implements all the proposed estimation procedures is available. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 798-804
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1730854
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730854
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:798-804
Template-Type: ReDIF-Article 1.0
Author-Name: Jian Hu
Author-X-Name-First: Jian
Author-X-Name-Last: Hu
Author-Name: Mingyao Li
Author-X-Name-First: Mingyao
Author-X-Name-Last: Li
Title: Discussion of “Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data”
Journal: Journal of the American Statistical Association
Pages: 475-477
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1880919
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1880919
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:475-477
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 1039-1039
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1915023
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1915023
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1039-1039
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Introduction to Theory and Methods Special Issue on Precision Medicine and Individualized Policy Discovery Part II
Journal: Journal of the American Statistical Association
Pages: 645-645
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1916266
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1916266
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:645-645
Template-Type: ReDIF-Article 1.0
Author-Name: Marco Avella-Medina
Author-X-Name-First: Marco
Author-X-Name-Last: Avella-Medina
Title: Privacy-Preserving Parametric Inference: A Case for Robust Statistics
Abstract:
Differential privacy is a cryptographically motivated approach to privacy that has become a very active field of research over the last decade in theoretical computer science and machine learning. In this paradigm, one assumes there is a trusted curator who holds the data of individuals in a database and the goal of privacy is to simultaneously protect individual data while allowing the release of global characteristics of the database. In this setting, we introduce a general framework for parametric inference with differential privacy guarantees. We first obtain differentially private estimators based on bounded influence M-estimators by leveraging their gross-error sensitivity in the calibration of a noise term added to them to ensure privacy. We then show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score, and likelihood ratio tests. We provide statistical guarantees for all our proposals via an asymptotic analysis. An interesting consequence of our results is to further clarify the connection between differential privacy and robust statistics. In particular, we demonstrate that differential privacy is a weaker stability requirement than infinitesimal robustness, and show that robust M-estimators can be easily randomized to guarantee both differential privacy and robustness toward the presence of contaminated data. We illustrate our results both on simulated and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 969-983
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1700130
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1700130
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:969-983
Template-Type: ReDIF-Article 1.0
Author-Name: Emily C. Hector
Author-X-Name-First: Emily C.
Author-X-Name-Last: Hector
Author-Name: Peter X.-K. Song
Author-X-Name-First: Peter X.-K.
Author-X-Name-Last: Song
Title: A Distributed and Integrated Method of Moments for High-Dimensional Correlated Data Analysis
Abstract:
This article is motivated by a regression analysis of electroencephalography (EEG) neuroimaging data with high-dimensional correlated responses with multilevel nested correlations. We develop a divide-and-conquer procedure implemented in a fully distributed and parallelized computational scheme for statistical estimation and inference of regression parameters. Despite significant efforts in the literature, the computational bottleneck associated with high-dimensional likelihoods prevents the scalability of existing methods. The proposed method addresses this challenge by dividing responses into subvectors to be analyzed separately and in parallel on a distributed platform using pairwise composite likelihood. Theoretical challenges related to combining results from dependent data are overcome in a statistically efficient way using a meta-estimator derived from Hansen’s generalized method of moments. We provide a rigorous theoretical framework for efficient estimation, inference, and goodness-of-fit tests. We develop an R package for ease of implementation. We illustrate our method’s performance with simulations and the analysis of the EEG data, and find that iron deficiency is significantly associated with two auditory recognition memory related potentials in the left parietal-occipital region of the brain. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 805-818
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1736082
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1736082
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:805-818
Template-Type: ReDIF-Article 1.0
Author-Name: Paolo Frumento
Author-X-Name-First: Paolo
Author-X-Name-Last: Frumento
Author-Name: Matteo Bottai
Author-X-Name-First: Matteo
Author-X-Name-Last: Bottai
Author-Name: Iván Fernández-Val
Author-X-Name-First: Iván
Author-X-Name-Last: Fernández-Val
Title: Parametric Modeling of Quantile Regression Coefficient Functions With Longitudinal Data
Abstract:
In ordinary quantile regression, quantiles of different order are estimated one at a time. An alternative approach, which is referred to as quantile regression coefficients modeling (qrcm), is to model quantile regression coefficients as parametric functions of the order of the quantile. In this article, we describe how the qrcm
paradigm can be applied to longitudinal data. We introduce a two-level quantile function, in which two different quantile regression models are used to describe the (conditional) distribution of the within-subject response and that of the individual effects. We propose a novel type of penalized fixed-effects estimator, and discuss its advantages over standard methods based on l1
and l2
penalization. We provide model identifiability conditions, derive asymptotic properties, describe goodness-of-fit measures and model selection criteria, present simulation results, and discuss an application. The proposed method has been implemented in the R package qrcm.
Journal: Journal of the American Statistical Association
Pages: 783-797
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1892702
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1892702
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:783-797
Template-Type: ReDIF-Article 1.0
Author-Name: Dean Eckles
Author-X-Name-First: Dean
Author-X-Name-Last: Eckles
Author-Name: Eytan Bakshy
Author-X-Name-First: Eytan
Author-X-Name-Last: Bakshy
Title: Bias and High-Dimensional Adjustment in Observational Studies of Peer Effects
Abstract:
Peer effects, in which an individual’s behavior is affected by peers’ behavior, are posited by multiple theories in the social sciences. Randomized field experiments that identify peer effects, however, are often expensive or infeasible, so many studies of peer effects use observational data, which is expected to suffer from confounding. Here we show, in the context of information and media diffusion, that high-dimensional adjustment of a nonexperimental control group (660 million observations) using propensity score models produces estimates of peer effects statistically indistinguishable from those using a large randomized experiment (215 million observations). Compared with the experiment, naive observational estimators overstate peer effects by over 300% and commonly available variables (e.g., demographics) offer little bias reduction. Adjusting for a measure of prior behaviors closely related to the focal behavior reduces this bias by 91%, while models adjusting for over 3700 past behaviors provide additional bias reduction, reducing bias by over 97%, which is statistically indistinguishable from unbiasedness. This demonstrates how detailed records of behavior can improve studies of social influence, information diffusion, and imitation; these results are encouraging for the credibility of some studies but also cautionary for studies of peer effects in rare or new behaviors. More generally, these results show how large, high-dimensional datasets and statistical learning can be used to improve causal inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 507-517
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1796393
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796393
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:507-517
Template-Type: ReDIF-Article 1.0
Author-Name: Youngjun Choe
Author-X-Name-First: Youngjun
Author-X-Name-Last: Choe
Title: Design of experiments for generalized linear models
Journal: Journal of the American Statistical Association
Pages: 1038-1038
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1921472
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1921472
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:1038-1038
Template-Type: ReDIF-Article 1.0
Author-Name: Nathan Kallus
Author-X-Name-First: Nathan
Author-X-Name-Last: Kallus
Title: More Efficient Policy Learning via Optimal Retargeting
Abstract:
Policy learning can be used to extract individualized treatment regimes from observational data in healthcare, civics, e-commerce, and beyond. One big hurdle to policy learning is a commonplace lack of overlap in the data for different actions, which can lead to unwieldy policy evaluation and poorly performing learned policies. We study a solution to this problem based on retargeting, that is, changing the population on which policies are optimized. We first argue that at the population level, retargeting may induce little to no bias. We then characterize the optimal reference policy and retargeting weights in both binary-action and multi-action settings. We do this in terms of the asymptotic efficient estimation variance of the new learning objective. We further consider weights that additionally control for potential bias due to retargeting. Extensive empirical results in a simulation study and a case study of personalized job counseling demonstrate that retargeting is a fairly easy way to significantly improve any policy learning procedure applied to observational data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 646-658
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1788948
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1788948
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:646-658
Template-Type: ReDIF-Article 1.0
Author-Name: Bowei Yan
Author-X-Name-First: Bowei
Author-X-Name-Last: Yan
Author-Name: Purnamrita Sarkar
Author-X-Name-First: Purnamrita
Author-X-Name-Last: Sarkar
Title: Covariate Regularized Community Detection in Sparse Graphs
Abstract:
In this article, we investigate community detection in networks in the presence of node covariates. In many instances, covariates and networks individually only give a partial view of the cluster structure. One needs to jointly infer the full cluster structure by considering both. In statistics, an emerging body of work has been focused on combining information from both the edges in the network and the node covariates to infer community memberships. However, so far the theoretical guarantees have been established in the dense regime, where the network can lead to perfect clustering under a broad parameter regime, and hence the role of covariates is often not clear. In this article, we examine sparse networks in conjunction with finite dimensional sub-Gaussian mixtures as covariates under moderate separation conditions. In this setting each individual source can only cluster a nonvanishing fraction of nodes correctly. We propose a simple optimization framework which improves clustering accuracy when the two sources carry partial information about the cluster memberships, and hence perform poorly on their own. Our optimization problem can be solved by scalable convex optimization algorithms. With a variety of simulated and real data examples, we show that the proposed method outperforms other existing methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 734-745
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2019.1706541
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1706541
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:734-745
Template-Type: ReDIF-Article 1.0
Author-Name: Cong Ma
Author-X-Name-First: Cong
Author-X-Name-Last: Ma
Author-Name: Junwei Lu
Author-X-Name-First: Junwei
Author-X-Name-Last: Lu
Author-Name: Han Liu
Author-X-Name-First: Han
Author-X-Name-Last: Liu
Title: Inter-Subject Analysis: A Partial Gaussian Graphical Model Approach
Abstract:
Different from traditional intra-subject analysis, the goal of inter-subject analysis (ISA) is to explore the dependency structure between different subjects with the intra-subject dependency as nuisance. ISA has important applications in neuroscience to study the functional connectivity between brain regions under natural stimuli. We propose a modeling framework for ISA that is based on Gaussian graphical models, under which ISA can be converted to the problem of estimation and inference of a partial Gaussian graphical model. The main statistical challenge is that we do not impose sparsity constraints on the whole precision matrix and we only assume the inter-subject part is sparse. For estimation, we propose to estimate an alternative parameter to get around the nonsparse issue and it can achieve asymptotic consistency even if the intra-subject dependency is dense. For inference, we propose an “untangle and chord” procedure to de-bias our estimator. It is valid without the sparsity assumption on the inverse Hessian of the log-likelihood function. This inferential method is general and can be applied to many other statistical problems, thus it is of independent theoretical interest. Numerical experiments on both simulated and brain imaging data validate our methods and theory. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 746-755
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1841645
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1841645
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:746-755
Template-Type: ReDIF-Article 1.0
Author-Name: Kevin Z. Lin
Author-X-Name-First: Kevin Z.
Author-X-Name-Last: Lin
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Author-Name: Kathryn Roeder
Author-X-Name-First: Kathryn
Author-X-Name-Last: Roeder
Title: Rejoinder for “Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data”
Journal: Journal of the American Statistical Association
Pages: 478-480
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2021.1892701
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1892701
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:478-480
Template-Type: ReDIF-Article 1.0
Author-Name: Weibin Mo
Author-X-Name-First: Weibin
Author-X-Name-Last: Mo
Author-Name: Zhengling Qi
Author-X-Name-First: Zhengling
Author-X-Name-Last: Qi
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Learning Optimal Distributionally Robust Individualized Treatment Rules
Abstract:
Recent development in the data-driven decision science has seen great advances in individualized decision making. Given data with individual covariates, treatment assignments and outcomes, policy makers best individualized treatment rule (ITR) that maximizes the expected outcome, known as the value function. Many existing methods assume that the training and testing distributions are the same. However, the estimated optimal ITR may have poor generalizability when the training and testing distributions are not identical. In this article, we consider the problem of finding an optimal ITR from a restricted ITR class where there are some unknown covariate changes between the training and testing distributions. We propose a novel distributionally robust ITR (DR-ITR) framework that maximizes the worst-case value function across the values under a set of underlying distributions that are “close” to the training distribution. The resulting DR-ITR can guarantee the performance among all such distributions reasonably well. We further propose a calibrating procedure that tunes the DR-ITR adaptively to a small amount of calibration data from a target population. In this way, the calibrated DR-ITR can be shown to enjoy better generalizability than the standard ITR based on our numerical studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 659-674
Issue: 534
Volume: 116
Year: 2021
Month: 4
X-DOI: 10.1080/01621459.2020.1796359
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796359
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:659-674
Template-Type: ReDIF-Article 1.0
Author-Name: Min-ge Xie
Author-X-Name-First: Min-ge
Author-X-Name-Last: Xie
Author-Name: Zheshi Zheng
Author-X-Name-First: Zheshi
Author-X-Name-Last: Zheng
Title: Discussion of Professor Bradley Efron’s Article on “Prediction, Estimation, and Attribution”
Journal: Journal of the American Statistical Association
Pages: 667-671
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762614
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762614
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:667-671
Template-Type: ReDIF-Article 1.0
Author-Name: Antony M. Overstall
Author-X-Name-First: Antony M.
Author-X-Name-Last: Overstall
Author-Name: David C. Woods
Author-X-Name-First: David C.
Author-X-Name-Last: Woods
Author-Name: Ben M. Parker
Author-X-Name-First: Ben M.
Author-X-Name-Last: Parker
Title: Bayesian Optimal Design for Ordinary Differential Equation Models With Application in Biological Science
Abstract:
Bayesian optimal design is considered for experiments where the response distribution depends on the solution to a system of nonlinear ordinary differential equations. The motivation is an experiment to estimate parameters in the equations governing the transport of amino acids through cell membranes in human placentas. Decision-theoretic Bayesian design of experiments for such nonlinear models is conceptually very attractive, allowing the formal incorporation of prior knowledge to overcome the parameter dependence of frequentist design and being less reliant on asymptotic approximations. However, the necessary approximation and maximization of the, typically analytically intractable, expected utility results in a computationally challenging problem. These issues are further exacerbated if the solution to the differential equations is not available in closed-form. This article proposes a new combination of a probabilistic solution to the equations embedded within a Monte Carlo approximation to the expected utility with cyclic descent of a smooth approximation to find the optimal design. A novel precomputation algorithm reduces the computational burden, making the search for an optimal design feasible for bigger problems. The methods are demonstrated by finding new designs for a number of common models derived from differential equations, and by providing optimal designs for the placenta experiment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 583-598
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1617154
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1617154
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:583-598
Template-Type: ReDIF-Article 1.0
Author-Name: Pierre E. Jacob
Author-X-Name-First: Pierre E.
Author-X-Name-Last: Jacob
Author-Name: Fredrik Lindsten
Author-X-Name-First: Fredrik
Author-X-Name-Last: Lindsten
Author-Name: Thomas B. Schön
Author-X-Name-First: Thomas B.
Author-X-Name-Last: Schön
Title: Smoothing With Couplings of Conditional Particle Filters
Abstract:
In state–space models, smoothing refers to the task of estimating a latent stochastic process given noisy measurements related to the process. We propose an unbiased estimator of smoothing expectations. The lack-of-bias property has methodological benefits: independent estimators can be generated in parallel, and CI can be constructed from the central limit theorem to quantify the approximation error. To design unbiased estimators, we combine a generic debiasing technique for Markov chains, with a Markov chain Monte Carlo algorithm for smoothing. The resulting procedure is widely applicable and we show in numerical experiments that the removal of the bias comes at a manageable increase in variance. We establish the validity of the proposed estimators under mild assumptions. Numerical experiments are provided on toy models, including a setting of highly informative observations, and for a realistic Lotka–Volterra model with an intractable transition density. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 721-729
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2018.1548856
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1548856
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:721-729
Template-Type: ReDIF-Article 1.0
Author-Name: Xinyu Zhang
Author-X-Name-First: Xinyu
Author-X-Name-Last: Zhang
Author-Name: Guohua Zou
Author-X-Name-First: Guohua
Author-X-Name-Last: Zou
Author-Name: Hua Liang
Author-X-Name-First: Hua
Author-X-Name-Last: Liang
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Parsimonious Model Averaging With a Diverging Number of Parameters
Abstract:
Model averaging generally provides better predictions than model selection, but the existing model averaging methods cannot lead to parsimonious models. Parsimony is an especially important property when the number of parameters is large. To achieve a parsimonious model averaging coefficient estimator, we suggest a novel criterion for choosing weights. Asymptotic properties are derived in two practical scenarios: (i) one or more correct models exist in the candidate model set and (ii) all candidate models are misspecified. Under the former scenario, it is proved that our method can put the weight one to the smallest correct model and the resulting model averaging estimators of coefficients have many zeros and thus lead to a parsimonious model. The asymptotic distribution of the estimators is also provided. Under the latter scenario, prediction is mainly focused on and we prove that the proposed procedure is asymptotically optimal in the sense that its squared prediction loss and risk are asymptotically identical to those of the best—but infeasible—model averaging estimator. Numerical analysis shows the promise of the proposed procedure over existing model averaging and selection methods.
Journal: Journal of the American Statistical Association
Pages: 972-984
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1604363
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604363
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:972-984
Template-Type: ReDIF-Article 1.0
Author-Name: Yichuan Zhao
Author-X-Name-First: Yichuan
Author-X-Name-Last: Zhao
Title: Empirical Likelihood Methods in Biomedicine and Health
Journal: Journal of the American Statistical Association
Pages: 1028-1029
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1759986
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759986
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1028-1029
Template-Type: ReDIF-Article 1.0
Author-Name: Frederic P. Schoenberg
Author-X-Name-First: Frederic P.
Author-X-Name-Last: Schoenberg
Title: Theory of Spatial Statistics: A Concise Introduction
Journal: Journal of the American Statistical Association
Pages: 1033-1034
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1759991
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759991
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1033-1034
Template-Type: ReDIF-Article 1.0
Author-Name: Jinhan Xie
Author-X-Name-First: Jinhan
Author-X-Name-Last: Xie
Author-Name: Yuanyuan Lin
Author-X-Name-First: Yuanyuan
Author-X-Name-Last: Lin
Author-Name: Xiaodong Yan
Author-X-Name-First: Xiaodong
Author-X-Name-Last: Yan
Author-Name: Niansheng Tang
Author-X-Name-First: Niansheng
Author-X-Name-Last: Tang
Title: Category-Adaptive Variable Screening for Ultra-High Dimensional Heterogeneous Categorical Data
Abstract:
The populations of interest in modern studies are very often heterogeneous. The population heterogeneity, the qualitative nature of the outcome variable and the high dimensionality of the predictors pose significant challenge in statistical analysis. In this article, we introduce a category-adaptive screening procedure with high-dimensional heterogeneous data, which is to detect category-specific important covariates. The proposal is a model-free approach without any specification of a regression model and an adaptive procedure in the sense that the set of active variables is allowed to vary across different categories, thus making it more flexible to accommodate heterogeneity. For response-selective sampling data, another main discovery of this article is that the proposed method works directly without any modification. Under mild regularity conditions, the newly procedure is shown to possess the sure screening and ranking consistency properties. Simulation studies contain supportive evidence that the proposed method performs well under various settings and it is effective to extract category-specific information. Applications are illustrated with two real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 747-760
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1573734
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1573734
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:747-760
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley Efron
Author-X-Name-First: Bradley
Author-X-Name-Last: Efron
Title: Prediction, Estimation, and Attribution
Abstract:
The scientific needs and computational limitations of the twentieth century fashioned classical statistical methodology. Both the needs and limitations have changed in the twenty-first, and so has the methodology. Large-scale prediction algorithms—neural nets, deep learning, boosting, support vector machines, random forests—have achieved star status in the popular press. They are recognizable as heirs to the regression tradition, but ones carried out at enormous scale and on titanic datasets. How do these algorithms compare with standard regression techniques such as ordinary least squares or logistic regression? Several key discrepancies will be examined, centering on the differences between prediction and estimation or prediction and attribution (significance testing). Most of the discussion is carried out through small numerical examples.
Journal: Journal of the American Statistical Association
Pages: 636-655
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762613
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762613
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:636-655
Template-Type: ReDIF-Article 1.0
Author-Name: Robin Henderson
Author-X-Name-First: Robin
Author-X-Name-Last: Henderson
Author-Name: Irina Makarenko
Author-X-Name-First: Irina
Author-X-Name-Last: Makarenko
Author-Name: Paul Bushby
Author-X-Name-First: Paul
Author-X-Name-Last: Bushby
Author-Name: Andrew Fletcher
Author-X-Name-First: Andrew
Author-X-Name-Last: Fletcher
Author-Name: Anvar Shukurov
Author-X-Name-First: Anvar
Author-X-Name-Last: Shukurov
Title: Statistical Topology and the Random Interstellar Medium
Abstract:
We use topological methods to investigate the small-scale variation and local spatial characteristics of the interstellar medium (ISM) in three regions of the southern sky. We demonstrate that there are circumstances where topological methods can identify differences in distributions when conventional marginal or correlation analyses may not. We propose a nonparametric method for comparing two fields based on the counts of topological features and the geometry of the associated persistence diagrams. We investigate the expected distribution of topological structures quantified through Betti numbers under Gaussian random field (GRF) assumptions, which underlie many astrophysical models of the ISM. When we apply the methods to the astrophysical data, we find strong evidence that one of the three regions is both topologically dissimilar to the other two and not consistent with an underlying GRF model. This region is proximal to a region of recent star formation whereas the others are more distant. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 625-635
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1647841
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1647841
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:625-635
Template-Type: ReDIF-Article 1.0
Author-Name: Jerome Friedman
Author-X-Name-First: Jerome
Author-X-Name-Last: Friedman
Author-Name: Trevor Hastie
Author-X-Name-First: Trevor
Author-X-Name-Last: Hastie
Author-Name: Robert Tibshirani
Author-X-Name-First: Robert
Author-X-Name-Last: Tibshirani
Title: Discussion of “Prediction, Estimation, and Attribution” by Bradley Efron
Abstract:
Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.
Journal: Journal of the American Statistical Association
Pages: 665-666
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762617
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762617
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:665-666
Template-Type: ReDIF-Article 1.0
Author-Name: Briana J. K. Stephenson
Author-X-Name-First: Briana J. K.
Author-X-Name-Last: Stephenson
Author-Name: Amy H. Herring
Author-X-Name-First: Amy H.
Author-X-Name-Last: Herring
Author-Name: Andrew Olshan
Author-X-Name-First: Andrew
Author-X-Name-Last: Olshan
Title: Robust Clustering With Subpopulation-Specific Deviations
Abstract:
The National Birth Defects Prevention Study (NBDPS) is a case-control study of birth defects conducted across 10 U.S. states. Researchers are interested in characterizing the etiologic role of maternal diet, collected using a food frequency questionnaire. Because diet is multidimensional, dimension reduction methods such as cluster analysis are often used to summarize dietary patterns. In a large, heterogeneous population, traditional clustering methods, such as latent class analysis, used to estimate dietary patterns can produce a large number of clusters due to a variety of factors, including study size and regional diversity. These factors result in a loss of interpretability of patterns that may differ due to minor consumption changes. Based on adaptation of the local partition process, we propose a new method, robust profile clustering, to handle these data complexities. Here, participants may be clustered at two levels: (1) globally, where women are assigned to an overall population-level cluster via an overfitted finite mixture model, and (2) locally, where regional variations in diet are accommodated via a beta-Bernoulli process dependent on subpopulation differences. We use our method to analyze the NBDPS data, deriving prepregnancy dietary patterns for women in the NBDPS while accounting for regional variability. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 521-537
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1611583
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611583
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:521-537
Template-Type: ReDIF-Article 1.0
Author-Name: Giacomo Zanella
Author-X-Name-First: Giacomo
Author-X-Name-Last: Zanella
Title: Informed Proposals for Local MCMC in Discrete Spaces
Abstract:
There is a lack of methodological results to design efficient Markov chain Monte Carlo (
MCMC) algorithms for statistical models with discrete-valued high-dimensional parameters. Motivated by this consideration, we propose a simple framework for the design of informed MCMC proposals (i.e., Metropolis–Hastings proposal distributions that appropriately incorporate local information about the target) which is naturally applicable to discrete spaces. Using Peskun-type comparisons of Markov kernels, we explicitly characterize the class of asymptotically optimal proposal distributions under this framework, which we refer to as locally balanced proposals. The resulting algorithms are straightforward to implement in discrete spaces and provide orders of magnitude improvements in efficiency compared to alternative MCMC schemes, including discrete versions of Hamiltonian Monte Carlo. Simulations are performed with both simulated and real datasets, including a detailed application to Bayesian record linkage. A direct connection with gradient-based MCMC suggests that locally balanced proposals can be seen as a natural way to extend the latter to discrete spaces. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 852-865
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1585255
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585255
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:852-865
Template-Type: ReDIF-Article 1.0
Author-Name: Fei Jiang
Author-X-Name-First: Fei
Author-X-Name-Last: Jiang
Author-Name: Qing Cheng
Author-X-Name-First: Qing
Author-X-Name-Last: Cheng
Author-Name: Guosheng Yin
Author-X-Name-First: Guosheng
Author-X-Name-Last: Yin
Author-Name: Haipeng Shen
Author-X-Name-First: Haipeng
Author-X-Name-Last: Shen
Title: Functional Censored Quantile Regression
Abstract:
We propose a functional censored quantile regression model to describe the time-varying relationship between time-to-event outcomes and corresponding functional covariates. The time-varying effect is modeled as an unspecified function that is approximated via B-splines. A generalized approximate cross-validation method is developed to select the number of knots by minimizing the expected loss. We establish asymptotic properties of the method and the knot selection procedure. Furthermore, we conduct extensive simulation studies to evaluate the finite sample performance of our method. Finally, we analyze the functional relationship between ambulatory blood pressure trajectories and clinical outcome in stroke patients. The results reinforce the importance of the morning blood pressure surge phenomenon, whose effect has caught attention but remains controversial in the medical literature. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 931-944
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1602047
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1602047
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:931-944
Template-Type: ReDIF-Article 1.0
Author-Name: Jin-Ting Zhang
Author-X-Name-First: Jin-Ting
Author-X-Name-Last: Zhang
Author-Name: Jia Guo
Author-X-Name-First: Jia
Author-X-Name-Last: Guo
Author-Name: Bu Zhou
Author-X-Name-First: Bu
Author-X-Name-Last: Zhou
Author-Name: Ming-Yen Cheng
Author-X-Name-First: Ming-Yen
Author-X-Name-Last: Cheng
Title: A Simple Two-Sample Test in High Dimensions Based on L2-Norm
Abstract:
Testing the equality of two means is a fundamental inference problem. For high-dimensional data, the Hotelling’s T2-test either performs poorly or becomes inapplicable. Several modifications have been proposed to address this issue. However, most of them are based on asymptotic normality of the null distributions of their test statistics which inevitably requires strong assumptions on the covariance. We study this problem thoroughly and propose an L2-norm based test that works under mild conditions and even when there are fewer observations than the dimension. Specially, to cope with general nonnormality of the null distribution we employ the Welch–Satterthwaite χ2-approximation. We derive a sharp upper bound on the approximation error and use it to justify that χ2-approximation is preferred to normal approximation. Simple ratio-consistent estimators for the parameters in the χ2-approximation are given. Importantly, our test can cope with singularity or near singularity of the covariance which is commonly seen in high dimensions and is the main cause of nonnormality. The power of the proposed test is also investigated. Extensive simulation studies and an application show that our test is at least comparable to and often outperforms several competitors in terms of size control, and the powers are comparable when their sizes are. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1011-1027
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1604366
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604366
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1011-1027
Template-Type: ReDIF-Article 1.0
Author-Name: Chih-Li Sung
Author-X-Name-First: Chih-Li
Author-X-Name-Last: Sung
Author-Name: Ying Hung
Author-X-Name-First: Ying
Author-X-Name-Last: Hung
Author-Name: William Rittase
Author-X-Name-First: William
Author-X-Name-Last: Rittase
Author-Name: Cheng Zhu
Author-X-Name-First: Cheng
Author-X-Name-Last: Zhu
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F.
Author-X-Name-Last: Jeff Wu
Title: A Generalized Gaussian Process Model for Computer Experiments With Binary Time Series
Abstract:
Non-Gaussian observations such as binary responses are common in some computer experiments. Motivated by the analysis of a class of cell adhesion experiments, we introduce a generalized Gaussian process model for binary responses, which shares some common features with standard GP models. In addition, the proposed model incorporates a flexible mean function that can capture different types of time series structures. Asymptotic properties of the estimators are derived, and an optimal predictor as well as its predictive distribution are constructed. Their performance is examined via two simulation studies. The methodology is applied to study computer simulations for cell adhesion experiments. The fitted model reveals important biological information in repeated cell bindings, which is not directly observable in lab experiments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 945-956
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1604361
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604361
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:945-956
Template-Type: ReDIF-Article 1.0
Author-Name: Zhiliang Ying
Author-X-Name-First: Zhiliang
Author-X-Name-Last: Ying
Author-Name: Wen Yu
Author-X-Name-First: Wen
Author-X-Name-Last: Yu
Author-Name: Ziqiang Zhao
Author-X-Name-First: Ziqiang
Author-X-Name-Last: Zhao
Author-Name: Ming Zheng
Author-X-Name-First: Ming
Author-X-Name-Last: Zheng
Title: Regression Analysis of Doubly Truncated Data
Abstract:
Doubly truncated data are found in astronomy, econometrics, and survival analysis literature. They arise when each observation is confined to an interval, that is, only those which fall within their respective intervals are observed along with the intervals. Unlike the one-sided truncation that can be handled by counting process-based approach, doubly truncated data are much more difficult to handle. In their analysis of an astronomical dataset, Efron and Petrosian proposed some nonparametric methods for doubly truncated data. Motivated by their approach, as well as by the work of Bhattacharya et al. for right truncated data, we propose a general method for estimating the regression parameter when the dependent variable is subject to the double truncation. It extends the Mann–Whitney-type rank estimator and can be computed easily by existing software packages. Weighted rank estimation is also considered for improving estimation efficiency. We show that the resulting estimators are consistent and asymptotically normal. Resampling schemes are proposed with large sample justification for approximating the limiting distributions. The quasar data in Efron and Petrosian and an AIDS incubation data are analyzed by the new method. Simulation results show that the proposed method works well.
Journal: Journal of the American Statistical Association
Pages: 810-821
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1585252
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585252
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:810-821
Template-Type: ReDIF-Article 1.0
Author-Name: A. C. Davison
Author-X-Name-First: A. C.
Author-X-Name-Last: Davison
Title: Discussion
Journal: Journal of the American Statistical Association
Pages: 663-664
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762616
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762616
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:663-664
Template-Type: ReDIF-Article 1.0
Author-Name: Yen-Chi Chen
Author-X-Name-First: Yen-Chi
Author-X-Name-Last: Chen
Title: Statistical Modelling by Exponential Families
Journal: Journal of the American Statistical Association
Pages: 1032-1032
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1759989
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759989
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1032-1032
Template-Type: ReDIF-Article 1.0
Author-Name: Bin Yu
Author-X-Name-First: Bin
Author-X-Name-Last: Yu
Author-Name: Rebecca Barter
Author-X-Name-First: Rebecca
Author-X-Name-Last: Barter
Title: The Data Science Process: One Culture
Journal: Journal of the American Statistical Association
Pages: 672-674
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762615
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762615
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:672-674
Template-Type: ReDIF-Article 1.0
Author-Name: Jialiang Mao
Author-X-Name-First: Jialiang
Author-X-Name-Last: Mao
Author-Name: Yuhan Chen
Author-X-Name-First: Yuhan
Author-X-Name-Last: Chen
Author-Name: Li Ma
Author-X-Name-First: Li
Author-X-Name-Last: Ma
Title: Bayesian Graphical Compositional Regression for Microbiome Data
Abstract:
An important task in microbiome studies is to test the existence of and give characterization to differences in the microbiome composition across groups of samples. Important challenges of this problem include the large within-group heterogeneities among samples and the existence of potential confounding variables that, when ignored, increase the chance of false discoveries and reduce the power for identifying true differences. We propose a probabilistic framework to overcome these issues by combining three ideas: (i) a phylogenetic tree-based decomposition of the cross-group comparison problem into a series of local tests, (ii) a graphical model that links the local tests to allow information sharing across taxa, and (iii) a Bayesian testing strategy that incorporates covariates and integrates out the within-group variation, avoiding potentially unstable point estimates. With the proposed method, we analyze the American Gut data to compare the gut microbiome composition of groups of participants with different dietary habits. Our analysis shows that (i) the frequency of consuming fruit, seafood, vegetable, and whole grain are closely related to the gut microbiome composition and (ii) the conclusion of the analysis can change drastically when different sets of relevant covariates are adjusted, indicating the necessity of carefully selecting and including possible confounders in the analysis when comparing microbiome compositions with data from observational studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 610-624
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1647212
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1647212
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:610-624
Template-Type: ReDIF-Article 1.0
Author-Name: Kyunghee Han
Author-X-Name-First: Kyunghee
Author-X-Name-Last: Han
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Author-Name: Byeong U. Park
Author-X-Name-First: Byeong U.
Author-X-Name-Last: Park
Title: Additive Functional Regression for Densities as Responses
Abstract:
We propose and investigate additive density regression, a novel additive functional regression model for situations where the responses are random distributions that can be viewed as random densities and the predictors are vectors. Data in the form of samples of densities or distributions are increasingly encountered in statistical analysis and there is a need for flexible regression models that accommodate random densities as responses. Such models are of special interest for multivariate continuous predictors, where unrestricted nonparametric regression approaches are subject to the curse of dimensionality. Additive models can be expected to maintain one-dimensional rates of convergence while permitting a substantial degree of flexibility. This motivates the development of additive regression models for situations where multivariate continuous predictors are coupled with densities as responses. To overcome the problem that distributions do not form a vector space, we utilize a class of transformations that map densities to unrestricted square integrable functions and then deploy an additive functional regression model to fit the responses in the unrestricted space, finally transforming back to density space. We implement the proposed additive model with an extended version of smooth backfitting and establish the consistency of this approach, including rates of convergence. The proposed method is illustrated with an application to the distributions of baby names in the United States.
Journal: Journal of the American Statistical Association
Pages: 997-1010
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1604365
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604365
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:997-1010
Template-Type: ReDIF-Article 1.0
Author-Name: Amanda F. Mejia
Author-X-Name-First: Amanda F.
Author-X-Name-Last: Mejia
Author-Name: Yu (Ryan) Yue
Author-X-Name-First: Yu (Ryan)
Author-X-Name-Last: Yue
Author-Name: David Bolin
Author-X-Name-First: David
Author-X-Name-Last: Bolin
Author-Name: Finn Lindgren
Author-X-Name-First: Finn
Author-X-Name-Last: Lindgren
Author-Name: Martin A. Lindquist
Author-X-Name-First: Martin A.
Author-X-Name-Last: Lindquist
Title: A Bayesian General Linear Modeling Approach to Cortical Surface fMRI Data Analysis
Abstract:
Cortical surface functional magnetic resonance imaging (cs-fMRI) has recently grown in popularity versus traditional volumetric fMRI. In addition to offering better whole-brain visualization, dimension reduction, removal of extraneous tissue types, and improved alignment of cortical areas across subjects, it is also more compatible with common assumptions of Bayesian spatial models. However, as no spatial Bayesian model has been proposed for cs-fMRI data, most analyses continue to employ the classical general linear model (GLM), a “massive univariate” approach. Here, we propose a spatial Bayesian GLM for cs-fMRI, which employs a class of sophisticated spatial processes to model latent activation fields. We make several advances compared with existing spatial Bayesian models for volumetric fMRI. First, we use integrated nested Laplacian approximations, a highly accurate and efficient Bayesian computation technique, rather than variational Bayes. To identify regions of activation, we utilize an excursions set method based on the joint posterior distribution of the latent fields, rather than the marginal distribution at each location. Finally, we propose the first multi-subject spatial Bayesian modeling approach, which addresses a major gap in the existing literature. The methods are very computationally advantageous and are validated through simulation studies and two task fMRI studies from the Human Connectome Project.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 501-520
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1611582
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611582
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:501-520
Template-Type: ReDIF-Article 1.0
Author-Name: Shan Yu
Author-X-Name-First: Shan
Author-X-Name-Last: Yu
Author-Name: Guannan Wang
Author-X-Name-First: Guannan
Author-X-Name-Last: Wang
Author-Name: Li Wang
Author-X-Name-First: Li
Author-X-Name-Last: Wang
Author-Name: Chenhui Liu
Author-X-Name-First: Chenhui
Author-X-Name-Last: Liu
Author-Name: Lijian Yang
Author-X-Name-First: Lijian
Author-X-Name-Last: Yang
Title: Estimation and Inference for Generalized Geoadditive Models
Abstract:
In many application areas, data are collected on a count or binary response with spatial covariate information. In this article, we introduce a new class of generalized geoadditive models (GGAMs) for spatial data distributed over complex domains. Through a link function, the proposed GGAM assumes that the mean of the discrete response variable depends on additive univariate functions of explanatory variables and a bivariate function to adjust for the spatial effect. We propose a two-stage approach for estimating and making inferences of the components in the GGAM. In the first stage, the univariate components and the geographical component in the model are approximated via univariate polynomial splines and bivariate penalized splines over triangulation, respectively. In the second stage, local polynomial smoothing is applied to the cleaned univariate data to average out the variation of the first-stage estimators. We investigate the consistency of the proposed estimators and the asymptotic normality of the univariate components. We also establish the simultaneous confidence band for each of the univariate components. The performance of the proposed method is evaluated by two simulation studies. We apply the proposed method to analyze the crash counts data in the Tampa-St. Petersburg urbanized area in Florida. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 761-774
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1574584
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1574584
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:761-774
Template-Type: ReDIF-Article 1.0
Author-Name: Zhengling Qi
Author-X-Name-First: Zhengling
Author-X-Name-Last: Qi
Author-Name: Dacheng Liu
Author-X-Name-First: Dacheng
Author-X-Name-Last: Liu
Author-Name: Haoda Fu
Author-X-Name-First: Haoda
Author-X-Name-Last: Fu
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Multi-Armed Angle-Based Direct Learning for Estimating Optimal Individualized Treatment Rules With Various Outcomes
Abstract:
Estimating an optimal individualized treatment rule (ITR) based on patients’ information is an important problem in precision medicine. An optimal ITR is a decision function that optimizes patients’ expected clinical outcomes. Many existing methods in the literature are designed for binary treatment settings with the interest of a continuous outcome. Much less work has been done on estimating optimal ITRs in multiple treatment settings with good interpretations. In this article, we propose angle-based direct learning (AD-learning) to efficiently estimate optimal ITRs with multiple treatments. Our proposed method can be applied to various types of outcomes, such as continuous, survival, or binary outcomes. Moreover, it has an interesting geometric interpretation on the effect of different treatments for each individual patient, which can help doctors and patients make better decisions. Finite sample error bounds have been established to provide a theoretical guarantee for AD-learning. Finally, we demonstrate the superior performance of our method via an extensive simulation study and real data applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 678-691
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2018.1529597
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1529597
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:678-691
Template-Type: ReDIF-Article 1.0
Author-Name: Bradley Efron
Author-X-Name-First: Bradley
Author-X-Name-Last: Efron
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 675-677
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762453
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762453
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:675-677
Template-Type: ReDIF-Article 1.0
Author-Name: Wenjia Wang
Author-X-Name-First: Wenjia
Author-X-Name-Last: Wang
Author-Name: Rui Tuo
Author-X-Name-First: Rui
Author-X-Name-Last: Tuo
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F.
Author-X-Name-Last: Jeff Wu
Title: On Prediction Properties of Kriging: Uniform Error Bounds and Robustness
Abstract:
Kriging based on Gaussian random fields is widely used in reconstructing unknown functions. The kriging method has pointwise predictive distributions which are computationally simple. However, in many applications one would like to predict for a range of untried points simultaneously. In this work, we obtain some error bounds for the simple and universal kriging predictor under the uniform metric. It works for a scattered set of input points in an arbitrary dimension, and also covers the case where the covariance function of the Gaussian process is misspecified. These results lead to a better understanding of the rate of convergence of kriging under the Gaussian or the Matérn correlation functions, the relationship between space-filling designs and kriging models, and the robustness of the Matérn correlation functions. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 920-930
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1598868
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1598868
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:920-930
Template-Type: ReDIF-Article 1.0
Author-Name: Lu Yang
Author-X-Name-First: Lu
Author-X-Name-Last: Yang
Author-Name: Edward W. Frees
Author-X-Name-First: Edward W.
Author-X-Name-Last: Frees
Author-Name: Zhengjun Zhang
Author-X-Name-First: Zhengjun
Author-X-Name-Last: Zhang
Title: Nonparametric Estimation of Copula Regression Models With Discrete Outcomes
Abstract:
Multivariate discrete outcomes are common in a wide range of areas including insurance, finance, and biology. When the interplay between outcomes is significant, quantifying dependencies among interrelated variables is of great importance. Due to their ability to accommodate dependence flexibly, copulas are being applied increasingly. Yet, the application of copulas on discrete data is still in its infancy; one of the biggest barriers is the nonuniqueness of copulas, calling into question model interpretations and predictions. In this article, we study copula estimation with discrete outcomes in a regression context. As the marginal distributions vary with covariates, inclusion of continuous regressors expands the region of support for consistent estimation of copulas. Because some properties of continuous outcomes do not carry over to discrete outcomes, specification of a copula model has been a problem. We propose a nonparametric estimator of copulas to identify the “hidden” dependence structure for discrete outcomes and develop its asymptotic properties. The proposed nonparametric estimator can also serve as a diagnostic tool for selecting a parametric form for copulas. In the simulation study, we explore the performance of the proposed estimator under different scenarios and provide guidance on when the choice of copulas is important. The performance of the estimator improves as discreteness diminishes. A practical bandwidth selector is also proposed. An empirical analysis examines a dataset from the Local Government Property Insurance Fund (LGPIF) in the state of Wisconsin. We apply the nonparametric estimator to model the dependence among claim frequencies from different types of insurance coverage. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 707-720
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2018.1546586
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1546586
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:707-720
Template-Type: ReDIF-Article 1.0
Author-Name: Han Li
Author-X-Name-First: Han
Author-X-Name-Last: Li
Author-Name: Minxuan Xu
Author-X-Name-First: Minxuan
Author-X-Name-Last: Xu
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Author-Name: Xiaodan Fan
Author-X-Name-First: Xiaodan
Author-X-Name-Last: Fan
Title: An Extended Mallows Model for Ranked Data Aggregation
Abstract:
In this article, we study the rank aggregation problem, which aims to find a consensus ranking by aggregating multiple ranking lists. To address the problem probabilistically, we formulate an elaborate ranking model for full and partial rankings by generalizing the Mallows model. Our model assumes that the ranked data are generated through a multistage ranking process that is explicitly governed by parameters that measure the overall quality and stability of the process. The new model is quite flexible and has a closed form expression. Under mild conditions, we can derive a few useful theoretical properties of the model. Furthermore, we propose an efficient statistic called rank coefficient to detect over-correlated rankings and a hierarchical ranking model to fit the data. Through extensive simulation studies and real applications, we evaluate the merits of our models and demonstrate that they outperform the state-of-the-art methods in diverse scenarios. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 730-746
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1573733
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1573733
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:730-746
Template-Type: ReDIF-Article 1.0
Author-Name: Xiwei Tang
Author-X-Name-First: Xiwei
Author-X-Name-Last: Tang
Author-Name: Xuan Bi
Author-X-Name-First: Xuan
Author-X-Name-Last: Bi
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Individualized Multilayer Tensor Learning With an Application in Imaging Analysis
Abstract:
This work is motivated by multimodality breast cancer imaging data, which is quite challenging in that the signals of discrete tumor-associated microvesicles are randomly distributed with heterogeneous patterns. This imposes a significant challenge for conventional imaging regression and dimension reduction models assuming a homogeneous feature structure. We develop an innovative multilayer tensor learning method to incorporate heterogeneity to a higher-order tensor decomposition and predict disease status effectively through utilizing subject-wise imaging features and multimodality information. Specifically, we construct a multilayer decomposition which leverages an individualized imaging layer in addition to a modality-specific tensor structure. One major advantage of our approach is that we are able to efficiently capture the heterogeneous spatial features of signals that are not characterized by a population structure as well as integrating multimodality information simultaneously. To achieve scalable computing, we develop a new bi-level block improvement algorithm. In theory, we investigate both the algorithm convergence property, tensor signal recovery error bound and asymptotic consistency for prediction model estimation. We also apply the proposed method for simulated and human breast cancer imaging data. Numerical results demonstrate that the proposed method outperforms other existing competing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 836-851
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1585254
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585254
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:836-851
Template-Type: ReDIF-Article 1.0
Author-Name: Degui Li
Author-X-Name-First: Degui
Author-X-Name-Last: Li
Author-Name: Peter M. Robinson
Author-X-Name-First: Peter M.
Author-X-Name-Last: Robinson
Author-Name: Han Lin Shang
Author-X-Name-First: Han Lin
Author-X-Name-Last: Shang
Title: Long-Range Dependent Curve Time Series
Abstract:
We introduce methods and theory for functional or curve time series with long-range dependence. The temporal sum of the curve process is shown to be asymptotically normally distributed, the conditions for this covering a functional version of fractionally integrated autoregressive moving averages. We also construct an estimate of the long-run covariance function, which we use, via functional principal component analysis, in estimating the orthonormal functions spanning the dominant subspace of the curves. In a semiparametric context, we propose an estimate of the memory parameter and establish its consistency. A Monte Carlo study of finite-sample performance is included, along with two empirical applications. The first of these finds a degree of stability and persistence in intraday stock returns. The second finds similarity in the extent of long memory in incremental age-specific fertility rates across some developed nations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 957-971
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1604362
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604362
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:957-971
Template-Type: ReDIF-Article 1.0
Author-Name: Trambak Banerjee
Author-X-Name-First: Trambak
Author-X-Name-Last: Banerjee
Author-Name: Gourab Mukherjee
Author-X-Name-First: Gourab
Author-X-Name-Last: Mukherjee
Author-Name: Shantanu Dutta
Author-X-Name-First: Shantanu
Author-X-Name-Last: Dutta
Author-Name: Pulak Ghosh
Author-X-Name-First: Pulak
Author-X-Name-Last: Ghosh
Title: A Large-Scale Constrained Joint Modeling Approach for Predicting User Activity, Engagement, and Churn With Application to Freemium Mobile Games
Abstract:
We develop a constrained extremely zero inflated joint (CEZIJ) modeling framework for simultaneously analyzing player activity, engagement, and dropouts (churns) in app-based mobile freemium games. Our proposed framework addresses the complex interdependencies between a player’s decision to use a freemium product, the extent of her direct and indirect engagement with the product and her decision to permanently drop its usage. CEZIJ extends the existing class of joint models for longitudinal and survival data in several ways. It not only accommodates extremely zero-inflated responses in a joint model setting but also incorporates domain-specific, convex structural constraints on the model parameters. Longitudinal data from app-based mobile games usually exhibit a large set of potential predictors and choosing the relevant set of predictors is highly desirable for various purposes including improved predictability. To achieve this goal, CEZIJ conducts simultaneous, coordinated selection of fixed and random effects in high-dimensional penalized generalized linear mixed models. For analyzing such large-scale datasets, variable selection and estimation are conducted via a distributed computing based split-and-conquer approach that massively increases scalability and provides better predictive performance over competing predictive methods. Our results reveal codependencies between varied player characteristics that promote player activity and engagement. Furthermore, the predicted churn probabilities exhibit idiosyncratic clusters of player profiles over time based on which marketers and game managers can segment the playing population for improved monetization of app-based freemium games. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 538-554
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1611584
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1611584
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:538-554
Template-Type: ReDIF-Article 1.0
Author-Name: Ian Laga
Author-X-Name-First: Ian
Author-X-Name-Last: Laga
Author-Name: Xiaoyue Niu
Author-X-Name-First: Xiaoyue
Author-X-Name-Last: Niu
Title: Model-Based Geostatistics for Global Public Health: Methods and Applications
Journal: Journal of the American Statistical Association
Pages: 1030-1032
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1759988
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759988
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1030-1032
Template-Type: ReDIF-Article 1.0
Author-Name: Giampiero Marra
Author-X-Name-First: Giampiero
Author-X-Name-Last: Marra
Author-Name: Rosalba Radice
Author-X-Name-First: Rosalba
Author-X-Name-Last: Radice
Title: Copula Link-Based Additive Models for Right-Censored Event Time Data
Abstract:
This article proposes an approach to estimate and make inference on the parameters of copula link-based survival models. The methodology allows for the margins to be specified using flexible parametric formulations for time-to-event data, the baseline survival functions to be modeled using monotonic splines, and each parameter of the assumed joint survival distribution to depend on an additive predictor incorporating several types of covariate effects. All the model’s coefficients as well as the smoothing parameters associated with the relevant components in the additive predictors are estimated using a carefully structured efficient and stable penalized likelihood algorithm. Some theoretical properties are also discussed. The proposed modeling framework is evaluated in a simulation study and illustrated using a real dataset. The relevant numerical computations can be easily carried out using the freely available GJRM R package. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 886-895
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1593178
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1593178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:886-895
Template-Type: ReDIF-Article 1.0
Author-Name: Matthias Katzfuss
Author-X-Name-First: Matthias
Author-X-Name-Last: Katzfuss
Author-Name: Jonathan R. Stroud
Author-X-Name-First: Jonathan R.
Author-X-Name-Last: Stroud
Author-Name: Christopher K. Wikle
Author-X-Name-First: Christopher K.
Author-X-Name-Last: Wikle
Title: Ensemble Kalman Methods for High-Dimensional Hierarchical Dynamic Space-Time Models
Abstract:
We propose a new class of filtering and smoothing methods for inference in high-dimensional, nonlinear, non-Gaussian, spatio-temporal state-space models. The main idea is to combine the ensemble Kalman filter and smoother, developed in the geophysics literature, with state-space algorithms from the statistics literature. Our algorithms address a variety of estimation scenarios, including online and off-line state and parameter estimation. We take a Bayesian perspective, for which the goal is to generate samples from the joint posterior distribution of states and parameters. The key benefit of our approach is the use of ensemble Kalman methods for dimension reduction, which allows inference for high-dimensional state vectors. We compare our methods to existing ones, including ensemble Kalman filters, particle filters, and particle MCMC. Using a real data example of cloud motion and data simulated under a number of nonlinear and non-Gaussian scenarios, we show that our approaches outperform these existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 866-885
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1592753
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1592753
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:866-885
Template-Type: ReDIF-Article 1.0
Author-Name: Lin Su
Author-X-Name-First: Lin
Author-X-Name-Last: Su
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Author-Name: Danyang Huang
Author-X-Name-First: Danyang
Author-X-Name-Last: Huang
Title: Testing and Estimation of Social Network Dependence With Time to Event Data
Abstract:
Nowadays, events are spread rapidly along social networks. We are interested in whether people’s responses to an event are affected by their friends’ characteristics. For example, how soon will a person start playing a game given that his/her friends like it? Studying social network dependence is an emerging research area. In this work, we propose a novel latent spatial autocorrelation Cox model to study social network dependence with time-to-event data. The proposed model introduces a latent indicator to characterize whether a person’s survival time might be affected by his or her friends’ features. We first propose a score-type test for detecting the existence of social network dependence. If it exists, we further develop an EM-type algorithm to estimate the model parameters. The performance of the proposed test and estimators are illustrated by simulation studies and an application to a time-to-event dataset about playing a popular mobile game from one of the largest online social network platforms. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 570-582
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1617153
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1617153
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:570-582
Template-Type: ReDIF-Article 1.0
Author-Name: Neal S. Grantham
Author-X-Name-First: Neal S.
Author-X-Name-Last: Grantham
Author-Name: Yawen Guan
Author-X-Name-First: Yawen
Author-X-Name-Last: Guan
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Author-Name: Elizabeth T. Borer
Author-X-Name-First: Elizabeth T.
Author-X-Name-Last: Borer
Author-Name: Kevin Gross
Author-X-Name-First: Kevin
Author-X-Name-Last: Gross
Title: MIMIX: A Bayesian Mixed-Effects Model for Microbiome Data From Designed Experiments
Abstract:
Recent advances in bioinformatics have made high-throughput microbiome data widely available, and new statistical tools are required to maximize the information gained from these data. For example, analysis of high-dimensional microbiome data from designed experiments remains an open area in microbiome research. Contemporary analyses work on metrics that summarize collective properties of the microbiome, but such reductions preclude inference on the fine-scale effects of environmental stimuli on individual microbial taxa. Other approaches model the proportions or counts of individual taxa as response variables in mixed models, but these methods fail to account for complex correlation patterns among microbial communities. In this article, we propose a novel Bayesian mixed-effects model that exploits cross-taxa correlations within the microbiome, a model we call microbiome mixed model (MIMIX). MIMIX offers global tests for treatment effects, local tests and estimation of treatment effects on individual taxa, quantification of the relative contribution from heterogeneous sources to microbiome variability, and identification of latent ecological subcommunities in the microbiome. MIMIX is tailored to large microbiome experiments using a combination of Bayesian factor analysis to efficiently represent dependence between taxa and Bayesian variable selection methods to achieve sparsity. We demonstrate the model using a simulation experiment and on a 2 × 2 factorial experiment of the effects of nutrient supplement and herbivore exclusion on the foliar fungal microbiome of Andropogon gerardii, a perennial bunchgrass, as part of the global Nutrient Network research initiative. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 599-609
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1626242
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1626242
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:599-609
Template-Type: ReDIF-Article 1.0
Author-Name: Chenlu Ke
Author-X-Name-First: Chenlu
Author-X-Name-Last: Ke
Author-Name: Xiangrong Yin
Author-X-Name-First: Xiangrong
Author-X-Name-Last: Yin
Title: Expected Conditional Characteristic Function-based Measures for Testing Independence
Abstract:
We propose a novel class of independence measures for testing independence between two random vectors based on the discrepancy between the conditional and the marginal characteristic functions. The relation between our index and other similar measures is studied, which indicates that they all belong to a large framework of reproducing kernel Hilbert space. If one of the variables is categorical, our asymmetric index extends the typical ANOVA to a kernel ANOVA that can test a more general hypothesis of equal distributions among groups. In addition, our index is also applicable when both variables are continuous. We develop two empirical estimates and obtain their respective asymptotic distributions. We illustrate the advantages of our approach by numerical studies across a variety of settings including a real data example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 985-996
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1604364
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1604364
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:985-996
Template-Type: ReDIF-Article 1.0
Author-Name: Ionut Bebu
Author-X-Name-First: Ionut
Author-X-Name-Last: Bebu
Title: Innovative Strategies, Statistical Solutions and Simulations for Modern Clinical Trials
Journal: Journal of the American Statistical Association
Pages: 1029-1030
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1759987
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759987
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1029-1030
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 1035-1036
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1724472
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1724472
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1035-1036
Template-Type: ReDIF-Article 1.0
Author-Name: Emmanuel Candès
Author-X-Name-First: Emmanuel
Author-X-Name-Last: Candès
Author-Name: Chiara Sabatti
Author-X-Name-First: Chiara
Author-X-Name-Last: Sabatti
Title: Discussion of the Paper “Prediction, Estimation, and Attribution” by B. Efron
Journal: Journal of the American Statistical Association
Pages: 656-658
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762618
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762618
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:656-658
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel J. McDonald
Author-X-Name-First: Daniel J.
Author-X-Name-Last: McDonald
Title: Sufficient Dimension Reduction: Methods and Applications With R
Journal: Journal of the American Statistical Association
Pages: 1032-1033
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1759990
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1759990
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:1032-1033
Template-Type: ReDIF-Article 1.0
Author-Name: Jean-Noël Bacro
Author-X-Name-First: Jean-Noël
Author-X-Name-Last: Bacro
Author-Name: Carlo Gaetan
Author-X-Name-First: Carlo
Author-X-Name-Last: Gaetan
Author-Name: Thomas Opitz
Author-X-Name-First: Thomas
Author-X-Name-Last: Opitz
Author-Name: Gwladys Toulemonde
Author-X-Name-First: Gwladys
Author-X-Name-Last: Toulemonde
Title: Hierarchical Space-Time Modeling of Asymptotically Independent Exceedances With an Application to Precipitation Data
Abstract:
The statistical modeling of space-time extremes in environmental applications is key to understanding complex dependence structures in original event data and to generating realistic scenarios for impact models. In this context of high-dimensional data, we propose a novel hierarchical model for high threshold exceedances defined over continuous space and time by embedding a space-time Gamma process convolution for the rate of an exponential variable, leading to asymptotic independence in space and time. Its physically motivated anisotropic dependence structure is based on geometric objects moving through space-time according to a velocity vector. We demonstrate that inference based on weighted pairwise likelihood is fast and accurate. The usefulness of our model is illustrated by an application to hourly precipitation data from a study region in Southern France, where it clearly improves on an alternative censored Gaussian space-time random field model. While classical limit models based on threshold-stability fail to appropriately capture relatively fast joint tail decay rates between asymptotic dependence and classical independence, strong empirical evidence from our application and other recent case studies motivates the use of more realistic asymptotic independence models such as ours. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 555-569
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1617152
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1617152
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:555-569
Template-Type: ReDIF-Article 1.0
Author-Name: Elynn Y. Chen
Author-X-Name-First: Elynn Y.
Author-X-Name-Last: Chen
Author-Name: Ruey S. Tsay
Author-X-Name-First: Ruey S.
Author-X-Name-Last: Tsay
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Title: Constrained Factor Models for High-Dimensional Matrix-Variate Time Series
Abstract:
High-dimensional matrix-variate time series data are becoming widely available in many scientific fields, such as economics, biology, and meteorology. To achieve significant dimension reduction while preserving the intrinsic matrix structure and temporal dynamics in such data, Wang, Liu, and Chen proposed a matrix factor model, that is, shown to be able to provide effective analysis. In this article, we establish a general framework for incorporating domain and prior knowledge in the matrix factor model through linear constraints. The proposed framework is shown to be useful in achieving parsimonious parameterization, facilitating interpretation of the latent matrix factor, and identifying specific factors of interest. Fully utilizing the prior-knowledge-induced constraints results in more efficient and accurate modeling, inference, dimension reduction as well as a clear and better interpretation of the results. Constrained, multi-term, and partially constrained factor models for matrix-variate time series are developed, with efficient estimation procedures and their asymptotic properties. We show that the convergence rates of the constrained factor loading matrices are much faster than those of the conventional matrix factor analysis under many situations. Simulation studies are carried out to demonstrate finite-sample performance of the proposed method and its associated asymptotic properties. We illustrate the proposed model with three applications, where the constrained matrix-factor models outperform their unconstrained counterparts in the power of variance explanation under the out-of-sample 10-fold cross-validation setting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 775-793
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1584899
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1584899
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:775-793
Template-Type: ReDIF-Article 1.0
Author-Name: Abhijoy Saha
Author-X-Name-First: Abhijoy
Author-X-Name-Last: Saha
Author-Name: Karthik Bharath
Author-X-Name-First: Karthik
Author-X-Name-Last: Bharath
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Title: A Geometric Variational Approach to Bayesian Inference
Abstract:
We propose a novel Riemannian geometric framework for variational inference in Bayesian models based on the nonparametric Fisher–Rao metric on the manifold of probability density functions. Under the square-root density representation, the manifold can be identified with the positive orthant of the unit hypersphere S∞
in L2
, and the Fisher–Rao metric reduces to the standard L2
metric. Exploiting such a Riemannian structure, we formulate the task of approximating the posterior distribution as a variational problem on the hypersphere based on the α-divergence. This provides a tighter lower bound on the marginal distribution when compared to, and a corresponding upper bound unavailable with, approaches based on the Kullback–Leibler divergence. We propose a novel gradient-based algorithm for the variational problem based on Fréchet derivative operators motivated by the geometry of S∞
, and examine its properties. Through simulations and real data applications, we demonstrate the utility of the proposed geometric framework and algorithm on several Bayesian models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 822-835
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1585253
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585253
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:822-835
Template-Type: ReDIF-Article 1.0
Author-Name: D. R. Cox
Author-X-Name-First: D. R.
Author-X-Name-Last: Cox
Title: Discussion of Paper by Brad Efron
Journal: Journal of the American Statistical Association
Pages: 659-659
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762451
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762451
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:659-659
Template-Type: ReDIF-Article 1.0
Author-Name: Karen Kafadar
Author-X-Name-First: Karen
Author-X-Name-Last: Kafadar
Title: Reinforcing the Impact of Statistics on Society
Abstract:
What does statistics have to offer science and society, in this age of massive data, machine learning algorithms, and multiple online sources of tools for data analysis? I recall a few situations where statistics made a real difference and reinforced the impact of our discipline on society. Sometimes the difference lay in the insightful analysis and inference enabled by ground-breaking methods in our field like hypothesis testing, likelihood ratios, Bayesian models, jackknife, and bootstrap. But perhaps more often, the impacts came from thoughtful analyses before data were collected, and the questions that arose after the statistical analysis. The impact of understanding the problem, designing the experiment and data collections, conducting the pilot surveys, and raising important questions, is substantial. Through sensible explorations following formal statistical procedures, statisticians have made contributions in many domains. In this presentation, I recall some examples which made a long-lasting impact. Some of them, like randomization in clinical trials, known and familiar to all, are so ingrained in our practice that the role of statistics has been forgotten. Others may be less familiar but nonetheless benefited greatly from the critical input of statisticians. All remind us that our field remains today not only relevant but critical to science and society.
Journal: Journal of the American Statistical Association
Pages: 491-500
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1761217
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1761217
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:491-500
Template-Type: ReDIF-Article 1.0
Author-Name: Noel Cressie
Author-X-Name-First: Noel
Author-X-Name-Last: Cressie
Title: Comment: When Is It Data Science and When Is It Data Engineering?
Journal: Journal of the American Statistical Association
Pages: 660-662
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2020.1762619
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1762619
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:660-662
Template-Type: ReDIF-Article 1.0
Author-Name: Chih-Li Sung
Author-X-Name-First: Chih-Li
Author-X-Name-Last: Sung
Author-Name: Wenjia Wang
Author-X-Name-First: Wenjia
Author-X-Name-Last: Wang
Author-Name: Matthew Plumlee
Author-X-Name-First: Matthew
Author-X-Name-Last: Plumlee
Author-Name: Benjamin Haaland
Author-X-Name-First: Benjamin
Author-X-Name-Last: Haaland
Title: Multiresolution Functional ANOVA for Large-Scale, Many-Input Computer Experiments
Abstract:
The Gaussian process is a standard tool for building emulators for both deterministic and stochastic computer experiments. However, application of Gaussian process models is greatly limited in practice, particularly for large-scale and many-input computer experiments that have become typical. We propose a multiresolution functional ANOVA (MRFA) model as a computationally feasible emulation alternative. More generally, this model can be used for large-scale and many-input nonlinear regression problems. An overlapping group lasso approach is used for estimation, ensuring computational feasibility in a large-scale and many-input setting. New results on consistency and inference for the (potentially overlapping) group lasso in a high-dimensional setting are developed and applied to the proposed MRFA model. Importantly, these results allow us to quantify the uncertainty in our predictions. Numerical examples demonstrate that the proposed model enjoys marked computational advantages. Data capabilities, in terms of both sample size and dimension, meet or exceed best available emulation tools while meeting or exceeding emulation accuracy. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 908-919
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1595630
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1595630
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:908-919
Template-Type: ReDIF-Article 1.0
Author-Name: Guan Yu
Author-X-Name-First: Guan
Author-X-Name-Last: Yu
Author-Name: Liang Yin
Author-X-Name-First: Liang
Author-X-Name-Last: Yin
Author-Name: Shu Lu
Author-X-Name-First: Shu
Author-X-Name-Last: Lu
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Confidence Intervals for Sparse Penalized Regression With Random Designs
Abstract:
With the abundance of large data, sparse penalized regression techniques are commonly used in data analysis due to the advantage of simultaneous variable selection and estimation. A number of convex as well as nonconvex penalties have been proposed in the literature to achieve sparse estimates. Despite intense work in this area, how to perform valid inference for sparse penalized regression with a general penalty remains to be an active research problem. In this article, by making use of state-of-the-art optimization tools in stochastic variational inequality theory, we propose a unified framework to construct confidence intervals for sparse penalized regression with a wide range of penalties, including convex and nonconvex penalties. We study the inference for parameters under the population version of the penalized regression as well as parameters of the underlying linear model. Theoretical convergence properties of the proposed method are obtained. Several simulated and real data examples are presented to demonstrate the validity and effectiveness of the proposed inference procedure. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 794-809
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1585251
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1585251
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:794-809
Template-Type: ReDIF-Article 1.0
Author-Name: Hamid Javadi
Author-X-Name-First: Hamid
Author-X-Name-Last: Javadi
Author-Name: Andrea Montanari
Author-X-Name-First: Andrea
Author-X-Name-Last: Montanari
Title: Nonnegative Matrix Factorization Via Archetypal Analysis
Abstract:
Given a collection of data points, nonnegative matrix factorization (NMF) suggests expressing them as convex combinations of a small set of “archetypes” with nonnegative entries. This decomposition is unique only if the true archetypes are nonnegative and sufficiently sparse (or the weights are sufficiently sparse), a regime that is captured by the separability condition and its generalizations. In this article, we study an approach to NMF that can be traced back to the work of Cutler and Breiman [(1994), “Archetypal Analysis,” Technometrics, 36, 338–347] and does not require the data to be separable, while providing a generally unique decomposition. We optimize a trade-off between two objectives: we minimize the distance of the data points from the convex envelope of the archetypes (which can be interpreted as an empirical risk), while also minimizing the distance of the archetypes from the convex envelope of the data (which can be interpreted as a data-dependent regularization). The archetypal analysis method of Cutler and Breiman is recovered as the limiting case in which the last term is given infinite weight. We introduce a “uniqueness condition” on the data which is necessary for identifiability. We prove that, under uniqueness (plus additional regularity conditions on the geometry of the archetypes), our estimator is robust. While our approach requires solving a nonconvex optimization problem, we find that standard optimization methods succeed in finding good solutions for both real and synthetic data. Supplementary materials for this article are available online
Journal: Journal of the American Statistical Association
Pages: 896-907
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2019.1594832
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1594832
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:896-907
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel J. Luckett
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Luckett
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Anna R. Kahkoska
Author-X-Name-First: Anna R.
Author-X-Name-Last: Kahkoska
Author-Name: David M. Maahs
Author-X-Name-First: David M.
Author-X-Name-Last: Maahs
Author-Name: Elizabeth Mayer-Davis
Author-X-Name-First: Elizabeth
Author-X-Name-Last: Mayer-Davis
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning
Abstract:
The vision for precision medicine is to use individual patient characteristics to inform a personalized treatment plan that leads to the best possible healthcare for each patient. Mobile technologies have an important role to play in this vision as they offer a means to monitor a patient’s health status in real-time and subsequently to deliver interventions if, when, and in the dose that they are needed. Dynamic treatment regimes formalize individualized treatment plans as sequences of decision rules, one per stage of clinical intervention, that map current patient information to a recommended treatment. However, most existing methods for estimating optimal dynamic treatment regimes are designed for a small number of fixed decision points occurring on a coarse time-scale. We propose a new reinforcement learning method for estimating an optimal treatment regime that is applicable to data collected using mobile technologies in an outpatient setting. The proposed method accommodates an indefinite time horizon and minute-by-minute decision making that are common in mobile health applications. We show that the proposed estimators are consistent and asymptotically normal under mild conditions. The proposed methods are applied to estimate an optimal dynamic treatment regime for controlling blood glucose levels in patients with type 1 diabetes.
Journal: Journal of the American Statistical Association
Pages: 692-706
Issue: 530
Volume: 115
Year: 2020
Month: 4
X-DOI: 10.1080/01621459.2018.1537919
File-URL: http://hdl.handle.net/10.1080/01621459.2018.1537919
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:115:y:2020:i:530:p:692-706
Template-Type: ReDIF-Article 1.0
Author-Name: Yixuan Qiu
Author-X-Name-First: Yixuan
Author-X-Name-Last: Qiu
Author-Name: Xiao Wang
Author-X-Name-First: Xiao
Author-X-Name-Last: Wang
Title: ALMOND: Adaptive Latent Modeling and Optimization via Neural Networks and Langevin Diffusion
Abstract:
Latent variable models cover a broad range of statistical and machine learning models, such as Bayesian models, linear mixed models, and Gaussian mixture models. Existing methods often suffer from two major challenges in practice: (a) a proper latent variable distribution is difficult to be specified; (b) making an exact likelihood inference is formidable due to the intractable computation. We propose a novel framework for the inference of latent variable models that overcomes these two limitations. This new framework allows for a fully data-driven latent variable distribution via deep neural networks, and the proposed stochastic gradient method, combined with the Langevin algorithm, is efficient and suitable for complex models and big data. We provide theoretical results for the Langevin algorithm, and establish the convergence analysis of the optimization method. This framework has demonstrated superior practical performance through simulation studies and a real data analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1224-1236
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1691563
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691563
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1224-1236
Template-Type: ReDIF-Article 1.0
Author-Name: Qiang Sun
Author-X-Name-First: Qiang
Author-X-Name-Last: Sun
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression
Abstract:
Analysis of high-dimensional data has received considerable and increasing attention in statistics. In practice, we may not be interested in every variable that is observed. Instead, often some of the variables are of particular interest, and the remaining variables are nuisance. To this end, we propose the nuisance penalized regression which does not penalize the parameters of interest. When the coherence between interest parameters and nuisance parameters is negligible, we show that resulting estimator can be directly used for inference without any correction. When the coherence is not negligible, we propose an iterative procedure to further refine the estimate of interest parameters, based on which we propose a modified profile likelihood based statistic for hypothesis testing. The utilities of our general results are demonstrated in three specific examples. Numerical studies lend further support to our method.
Journal: Journal of the American Statistical Association
Pages: 1472-1486
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1737079
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1737079
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1472-1486
Template-Type: ReDIF-Article 1.0
Author-Name: Shulei Wang
Author-X-Name-First: Shulei
Author-X-Name-Last: Wang
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Optimal Estimation of Wasserstein Distance on a Tree With an Application to Microbiome Studies
Abstract:
The weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to measure the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this finding, we study the problem of optimal estimation of the Wasserstein distance between two distributions on a tree from the sampled data in the high-dimensional setting. The minimax rate of convergence is established. To overcome the bias problem, we introduce a new estimator, referred to as the moment-screening estimator on a tree (MET), by using implicit best polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Numerical studies using both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between the inactive Crohn’s disease patients and the normal controls. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1237-1253
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1699422
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1699422
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1237-1253
Template-Type: ReDIF-Article 1.0
Author-Name: Laurens de Haan
Author-X-Name-First: Laurens
Author-X-Name-Last: de Haan
Author-Name: Chen Zhou
Author-X-Name-First: Chen
Author-X-Name-Last: Zhou
Title: Trends in Extreme Value Indices
Abstract:
We consider extreme value analysis for independent but nonidentically distributed observations. In particular, the observations do not share the same extreme value index. Assuming continuously changing extreme value indices, we provide a nonparametric estimate for the functional extreme value index. Besides estimating the extreme value index locally, we also provide a global estimator for the trend and its joint asymptotic theory. The asymptotic theory for the global estimator can be used for testing a prespecified parametric trend in the extreme value indices. In particular, it can be applied to test whether the extreme value index remains at a constant level across all observations.
Journal: Journal of the American Statistical Association
Pages: 1265-1279
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1705307
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1705307
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1265-1279
Template-Type: ReDIF-Article 1.0
Author-Name: Stephen Bates
Author-X-Name-First: Stephen
Author-X-Name-Last: Bates
Author-Name: Emmanuel Candès
Author-X-Name-First: Emmanuel
Author-X-Name-Last: Candès
Author-Name: Lucas Janson
Author-X-Name-First: Lucas
Author-X-Name-Last: Janson
Author-Name: Wenshuo Wang
Author-X-Name-First: Wenshuo
Author-X-Name-Last: Wang
Title: Metropolized Knockoff Sampling
Abstract:
Model-X knockoffs is a wrapper that transforms essentially any feature importance measure into a variable selection algorithm, which discovers true effects while rigorously controlling the expected fraction of false positives. A frequently discussed challenge to apply this method is to construct knockoff variables, which are synthetic variables obeying a crucial exchangeability property with the explanatory variables under study. This article introduces techniques for knockoff generation in great generality: we provide a sequential characterization of all possible knockoff distributions, which leads to a Metropolis–Hastings formulation of an exact knockoff sampler. We further show how to use conditional independence structure to speed up computations. Combining these two threads, we introduce an explicit set of sequential algorithms and empirically demonstrate their effectiveness. Our theoretical analysis proves that our algorithms achieve near-optimal computational complexity in certain cases. The techniques we develop are sufficiently rich to enable knockoff sampling in challenging models including cases where the covariates are continuous and heavy-tailed, and follow a graphical model such as the Ising model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1413-1427
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1729163
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1729163
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1413-1427
Template-Type: ReDIF-Article 1.0
Author-Name: Pierre E. Jacob
Author-X-Name-First: Pierre E.
Author-X-Name-Last: Jacob
Author-Name: Ruobin Gong
Author-X-Name-First: Ruobin
Author-X-Name-Last: Gong
Author-Name: Paul T. Edlefsen
Author-X-Name-First: Paul T.
Author-X-Name-Last: Edlefsen
Author-Name: Arthur P. Dempster
Author-X-Name-First: Arthur P.
Author-X-Name-Last: Dempster
Title: A Gibbs Sampler for a Class of Random Convex Polytopes
Abstract:
We present a Gibbs sampler for the Dempster–Shafer (DS) approach to statistical inference for categorical distributions. The DS framework extends the Bayesian approach, allows in particular the use of partial prior information, and yields three-valued uncertainty assessments representing probabilities “for,” “against,” and “don’t know” about formal assertions of interest. The proposed algorithm targets the distribution of a class of random convex polytopes which encapsulate the DS inference. The sampler relies on an equivalence between the iterative constraints of the vertex configuration and the nonnegativity of cycles in a fully connected directed graph. Illustrations include the testing of independence in 2 × 2 contingency tables and parameter estimation of the linkage model.
Journal: Journal of the American Statistical Association
Pages: 1181-1192
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1881523
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1881523
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1181-1192
Template-Type: ReDIF-Article 1.0
Author-Name: Trevor Harris
Author-X-Name-First: Trevor
Author-X-Name-Last: Harris
Author-Name: Bo Li
Author-X-Name-First: Bo
Author-X-Name-Last: Li
Author-Name: Nathan J. Steiger
Author-X-Name-First: Nathan J.
Author-X-Name-Last: Steiger
Author-Name: Jason E. Smerdon
Author-X-Name-First: Jason E.
Author-X-Name-Last: Smerdon
Author-Name: Naveen Narisetty
Author-X-Name-First: Naveen
Author-X-Name-Last: Narisetty
Author-Name: J. Derek Tucker
Author-X-Name-First: J. Derek
Author-X-Name-Last: Tucker
Title: Evaluating Proxy Influence in Assimilated Paleoclimate Reconstructions—Testing the Exchangeability of Two Ensembles of Spatial Processes
Abstract:
Abstract–Climate field reconstructions (CFRs) attempt to estimate spatiotemporal fields of climate variables in the past using climate proxies such as tree rings, ice cores, and corals. Data assimilation (DA) methods are a recent and promising new means of deriving CFRs that optimally fuse climate proxies with climate model output. Despite the growing application of DA-based CFRs, little is understood about how much the assimilated proxies change the statistical properties of the climate model data. To address this question, we propose a robust and computationally efficient method, based on functional data depth, to evaluate differences in the distributions of two spatiotemporal processes. We apply our test to study global and regional proxy influence in DA-based CFRs by comparing the background and analysis states, which are treated as two samples of spatiotemporal fields. We find that the analysis states are significantly altered from the climate-model-based background states due to the assimilation of proxies. Moreover, the difference between the analysis and background states increases with the number of proxies, even in regions far beyond proxy collection sites. Our approach allows us to characterize the added value of proxies, indicating where and when the analysis states are distinct from the background states. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1100-1113
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1799810
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799810
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1100-1113
Template-Type: ReDIF-Article 1.0
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Author-Name: Debdeep Pati
Author-X-Name-First: Debdeep
Author-X-Name-Last: Pati
Author-Name: Bani K. Mallick
Author-X-Name-First: Bani K.
Author-X-Name-Last: Mallick
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Title: Bayesian Copula Density Deconvolution for Zero-Inflated Data in Nutritional Epidemiology
Abstract:
Estimating the marginal and joint densities of the long-term average intakes of different dietary components is an important problem in nutritional epidemiology. Since these variables cannot be directly measured, data are usually collected in the form of 24-hr recalls of the intakes, which show marked patterns of conditional heteroscedasticity. Significantly compounding the challenges, the recalls for episodically consumed dietary components also include exact zeros. The problem of estimating the density of the latent long-time intakes from their observed measurement error contaminated proxies is then a problem of deconvolution of densities with zero-inflated data. We propose a Bayesian semiparametric solution to the problem, building on a novel hierarchical latent variable framework that translates the problem to one involving continuous surrogates only. Crucial to accommodating important aspects of the problem, we then design a copula based approach to model the involved joint distributions, adopting different modeling strategies for the marginals of the different dietary components. We design efficient Markov chain Monte Carlo algorithms for posterior inference and illustrate the efficacy of the proposed method through simulation experiments. Applied to our motivating nutritional epidemiology problems, compared to other approaches, our method provides more realistic estimates of the consumption patterns of episodically consumed dietary components. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1075-1087
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1782220
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782220
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1075-1087
Template-Type: ReDIF-Article 1.0
Author-Name: Richard A. Davis
Author-X-Name-First: Richard A.
Author-X-Name-Last: Davis
Author-Name: Konstantinos Fokianos
Author-X-Name-First: Konstantinos
Author-X-Name-Last: Fokianos
Author-Name: Scott H. Holan
Author-X-Name-First: Scott H.
Author-X-Name-Last: Holan
Author-Name: Harry Joe
Author-X-Name-First: Harry
Author-X-Name-Last: Joe
Author-Name: James Livsey
Author-X-Name-First: James
Author-X-Name-Last: Livsey
Author-Name: Robert Lund
Author-X-Name-First: Robert
Author-X-Name-Last: Lund
Author-Name: Vladas Pipiras
Author-X-Name-First: Vladas
Author-X-Name-Last: Pipiras
Author-Name: Nalini Ravishanker
Author-X-Name-First: Nalini
Author-X-Name-Last: Ravishanker
Title: Count Time Series: A Methodological Review
Abstract:
A growing interest in non-Gaussian time series, particularly in series comprised of nonnegative integers (counts), is taking place in today’s statistics literature. Count series naturally arise in fields, such as agriculture, economics, epidemiology, finance, geology, meteorology, and sports. Unlike stationary Gaussian series where autoregressive moving-averages are the primary modeling vehicle, no single class of models dominates the count landscape. As such, the literature has evolved somewhat ad-hocly, with different model classes being developed to tackle specific situations. This article is an attempt to summarize the current state of count time series modeling. The article first reviews models having prescribed marginal distributions, including some recent developments. This is followed by a discussion of state-space approaches. Multivariate extensions of the methods are then studied and Bayesian approaches to the problem are considered. The intent is to inform researchers and practitioners about the various types of count time series models arising in the modern literature. While estimation issues are not pursued in detail, reference to this literature is made.
Journal: Journal of the American Statistical Association
Pages: 1533-1547
Issue: 535
Volume: 116
Year: 2021
Month: 5
X-DOI: 10.1080/01621459.2021.1904957
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1904957
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1533-1547
Template-Type: ReDIF-Article 1.0
Author-Name: Thomas Kuenzer
Author-X-Name-First: Thomas
Author-X-Name-Last: Kuenzer
Author-Name: Siegfried Hörmann
Author-X-Name-First: Siegfried
Author-X-Name-Last: Hörmann
Author-Name: Piotr Kokoszka
Author-X-Name-First: Piotr
Author-X-Name-Last: Kokoszka
Title: Principal Component Analysis of Spatially Indexed Functions
Abstract:
We develop an expansion, similar in some respects to the Karhunen–Loève expansion, but which is more suitable for functional data indexed by spatial locations on a grid. Unlike the traditional Karhunen–Loève expansion, it takes into account the spatial dependence between the functions. By doing so, it provides a more efficient dimension reduction tool, both theoretically and in finite samples, for functional data with moderate spatial dependence. For such data, it also possesses other theoretical and practical advantages over the currently used approach. The article develops complete asymptotic theory and estimation methodology. The performance of the method is examined by a simulation study and data analysis. The new tools are implemented in an R package. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1444-1456
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1732395
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1732395
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1444-1456
Template-Type: ReDIF-Article 1.0
Author-Name: Matteo Fasiolo
Author-X-Name-First: Matteo
Author-X-Name-Last: Fasiolo
Author-Name: Simon N. Wood
Author-X-Name-First: Simon N.
Author-X-Name-Last: Wood
Author-Name: Margaux Zaffran
Author-X-Name-First: Margaux
Author-X-Name-Last: Zaffran
Author-Name: Raphaël Nedellec
Author-X-Name-First: Raphaël
Author-X-Name-Last: Nedellec
Author-Name: Yannig Goude
Author-X-Name-First: Yannig
Author-X-Name-Last: Goude
Title: Fast Calibrated Additive Quantile Regression
Abstract:
We propose a novel framework for fitting additive quantile regression models, which provides well-calibrated inference about the conditional quantiles and fast automatic estimation of the smoothing parameters, for model structures as diverse as those usable with distributional generalized additive models, while maintaining equivalent numerical efficiency and stability. The proposed methods are at once statistically rigorous and computationally efficient, because they are based on the general belief updating framework of Bissiri, Holmes, and Walker to loss based inference, but compute by adapting the stable fitting methods of Wood, Pya, and Säfken. We show how the pinball loss is statistically suboptimal relative to a novel smooth generalization, which also gives access to fast estimation methods. Further, we provide a novel calibration method for efficiently selecting the “learning rate” balancing the loss with the smoothing priors during inference, thereby obtaining reliable quantile uncertainty estimates. Our work was motivated by a probabilistic electricity load forecasting application, used here to demonstrate the proposed approach. The methods described here are implemented by the qgam R package, available on the Comprehensive R Archive Network (CRAN). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1402-1412
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1725521
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1725521
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1402-1412
Template-Type: ReDIF-Article 1.0
Author-Name: Jonathan P Williams
Author-X-Name-First: Jonathan P
Author-X-Name-Last: Williams
Title: Discussion of “A Gibbs Sampler for a Class of Random Convex Polytopes”
Journal: Journal of the American Statistical Association
Pages: 1198-1200
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1946405
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1946405
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1198-1200
Template-Type: ReDIF-Article 1.0
Author-Name: Jianwei Hu
Author-X-Name-First: Jianwei
Author-X-Name-Last: Hu
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Hong Qin
Author-X-Name-First: Hong
Author-X-Name-Last: Qin
Author-Name: Ting Yan
Author-X-Name-First: Ting
Author-X-Name-Last: Yan
Author-Name: Ji Zhu
Author-X-Name-First: Ji
Author-X-Name-Last: Zhu
Title: Using Maximum Entry-Wise Deviation to Test the Goodness of Fit for Stochastic Block Models
Abstract:
Abstract–The stochastic block model is widely used for detecting community structures in network data. How to test the goodness of fit of the model is one of the fundamental problems and has gained growing interests in recent years. In this article, we propose a novel goodness-of-fit test based on the maximum entry of the centered and rescaled adjacency matrix for the stochastic block model. One noticeable advantage of the proposed test is that the number of communities can be allowed to grow linearly with the number of nodes ignoring a logarithmic factor. We prove that the null distribution of the test statistic converges in distribution to a Gumbel distribution, and we show that both the number of communities and the membership vector can be tested via the proposed method. Furthermore, we show that the proposed test has asymptotic power guarantee against a class of alternatives. We also demonstrate that the proposed method can be extended to the degree-corrected stochastic block model. Both simulation studies and real-world data examples indicate that the proposed method works well. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1373-1382
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1722676
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1722676
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1373-1382
Template-Type: ReDIF-Article 1.0
Author-Name: Bingying Xie
Author-X-Name-First: Bingying
Author-X-Name-Last: Xie
Author-Name: Jun Shao
Author-X-Name-First: Jun
Author-X-Name-Last: Shao
Title: Nonparametric Estimation of Conditional Expectation with Auxiliary Information and Dimension Reduction
Abstract:
Nonparametric estimation of the conditional expectation
E(Y|U)
of an outcome Y given a covariate vector U is of primary importance in many statistical applications such as prediction and personalized medicine. In some problems, there is an additional auxiliary variable Z in the training dataset used to construct estimators, but Z is not available for future prediction or selecting patient treatment in personalized medicine. For example, in the training dataset longitudinal outcomes are observed, but only the last outcome Y is concerned in the future prediction or analysis. The longitudinal outcomes other than the last point is then the variable Z that is observed and related with both Y and U. Previous work on how to make use of Z in the estimation of
E(Y|U)
mainly focused on using Z in the construction of a linear function of U to reduce covariate dimension for better estimation. Using
E(Y|U)=E{E(Y|U,Z)|U}
, we propose a two-step estimation of inner and outer expectations, respectively, with sufficient dimension reduction for kernel estimation in both steps. The information from Z is utilized not only in dimension reduction, but also directly in the estimation. Because of the existence of different ways for dimension reduction, we construct two estimators that may improve the estimator without using Z. The improvements are shown in the convergence rate of estimators as the sample size increases to infinity as well as in the finite sample simulation performance. A real data analysis about the selection of mammography intervention is presented for illustration.
Journal: Journal of the American Statistical Association
Pages: 1346-1357
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1713793
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1713793
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1346-1357
Template-Type: ReDIF-Article 1.0
Author-Name: Earl Lawrence
Author-X-Name-First: Earl
Author-X-Name-Last: Lawrence
Author-Name: Scott Vander Wiel
Author-X-Name-First: Scott Vander
Author-X-Name-Last: Wiel
Title: Comment on “A Gibbs Sampler for a Class of Random Convex Polytopes”
Journal: Journal of the American Statistical Association
Pages: 1201-1203
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1947305
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1947305
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1201-1203
Template-Type: ReDIF-Article 1.0
Author-Name: Ian Laga
Author-X-Name-First: Ian
Author-X-Name-Last: Laga
Author-Name: Le Bao
Author-X-Name-First: Le
Author-X-Name-Last: Bao
Author-Name: Xiaoyue Niu
Author-X-Name-First: Xiaoyue
Author-X-Name-Last: Niu
Title: Thirty Years of The Network Scale-up Method
Abstract:
Estimating the size of hard-to-reach populations is an important problem for many fields. The network scale-up method (NSUM) is a relatively new approach to estimate the size of these hard-to-reach populations by asking respondents the question, “How many X’s do you know,” where X is the population of interest (e.g., “How many female sex workers do you know?”). The answers to these questions form aggregated relational data (ARD). The NSUM has been used to estimate the size of a variety of subpopulations, including female sex workers, drug users, and even children who have been hospitalized for choking. Within the network scale-up methodology, there are a multitude of estimators for the size of the hidden population, including direct estimators, maximum likelihood estimators, and Bayesian estimators. In this article, we first provide an in-depth analysis of ARD properties and the techniques to collect the data. Then, we comprehensively review different estimation methods in terms of the assumptions behind each model, the relationships between the estimators, and the practical considerations of implementing the methods. We apply many of the models discussed in the review to one canonical dataset and compare their performance and unique features, presented in the supplementary materials. Finally, we provide a summary of the dominant methods and an extensive list of the applications, and discuss the open problems and potential research directions in this area.
Journal: Journal of the American Statistical Association
Pages: 1548-1559
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1935267
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1935267
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1548-1559
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 1560-1560
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1957322
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1957322
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1560-1560
Template-Type: ReDIF-Article 1.0
Author-Name: Giorgio Paulon
Author-X-Name-First: Giorgio
Author-X-Name-Last: Paulon
Author-Name: Fernando Llanos
Author-X-Name-First: Fernando
Author-X-Name-Last: Llanos
Author-Name: Bharath Chandrasekaran
Author-X-Name-First: Bharath
Author-X-Name-Last: Chandrasekaran
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Title: Bayesian Semiparametric Longitudinal Drift-Diffusion Mixed Models for Tone Learning in Adults
Abstract:
Abstract–Understanding how adult humans learn nonnative speech categories such as tone information has shed novel insights into the mechanisms underlying experience-dependent brain plasticity. Scientists have traditionally examined these questions using longitudinal learning experiments under a multi-category decision making paradigm. Drift-diffusion processes are popular in such contexts for their ability to mimic underlying neural mechanisms. Motivated by these problems, we develop a novel Bayesian semiparametric inverse Gaussian drift-diffusion mixed model for multi-alternative decision making in longitudinal settings. We design a Markov chain Monte Carlo algorithm for posterior computation. We evaluate the method’s empirical performances through synthetic experiments. Applied to our motivating longitudinal tone learning study, the method provides novel insights into how the biologically interpretable model parameters evolve with learning, differ between input-response tone combinations, and differ between well and poorly performing adults. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1114-1127
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1801448
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801448
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1114-1127
Template-Type: ReDIF-Article 1.0
Author-Name: Persi Diaconis
Author-X-Name-First: Persi
Author-X-Name-Last: Diaconis
Author-Name: Guanyang Wang
Author-X-Name-First: Guanyang
Author-X-Name-Last: Wang
Title: Discussion of “A Gibbs Sampler for a Class of Random Convex Polytopes”
Journal: Journal of the American Statistical Association
Pages: 1193-1195
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1950000
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950000
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1193-1195
Template-Type: ReDIF-Article 1.0
Author-Name: Naim U. Rashid
Author-X-Name-First: Naim U.
Author-X-Name-Last: Rashid
Author-Name: Daniel J. Luckett
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Luckett
Author-Name: Jingxiang Chen
Author-X-Name-First: Jingxiang
Author-X-Name-Last: Chen
Author-Name: Michael T. Lawson
Author-X-Name-First: Michael T.
Author-X-Name-Last: Lawson
Author-Name: Longshaokan Wang
Author-X-Name-First: Longshaokan
Author-X-Name-Last: Wang
Author-Name: Yunshu Zhang
Author-X-Name-First: Yunshu
Author-X-Name-Last: Zhang
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Author-Name: Jen Jen Yeh
Author-X-Name-First: Jen Jen
Author-X-Name-Last: Yeh
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Michael R. Kosorok
Author-X-Name-First: Michael R.
Author-X-Name-Last: Kosorok
Title: High-Dimensional Precision Medicine From Patient-Derived Xenografts
Abstract:
The complexity of human cancer often results in significant heterogeneity in response to treatment. Precision medicine offers the potential to improve patient outcomes by leveraging this heterogeneity. Individualized treatment rules (ITRs) formalize precision medicine as maps from the patient covariate space into the space of allowable treatments. The optimal ITR is that which maximizes the mean of a clinical outcome in a population of interest. Patient-derived xenograft (PDX) studies permit the evaluation of multiple treatments within a single tumor, and thus are ideally suited for estimating optimal ITRs. PDX data are characterized by correlated outcomes, a high-dimensional feature space, and a large number of treatments. Here we explore machine learning methods for estimating optimal ITRs from PDX data. We analyze data from a large PDX study to identify biomarkers that are informative for developing personalized treatment recommendations in multiple cancers. We estimate optimal ITRs using regression-based (Q-learning) and direct-search methods (outcome weighted learning). Finally, we implement a superlearner approach to combine multiple estimated ITRs and show that the resulting ITR performs better than any of the input ITRs, mitigating uncertainty regarding user choice. Our results indicate that PDX data are a valuable resource for developing individualized treatment strategies in oncology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1140-1154
Issue: 535
Volume: 116
Year: 2020
Month: 11
X-DOI: 10.1080/01621459.2020.1828091
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1828091
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2020:i:535:p:1140-1154
Template-Type: ReDIF-Article 1.0
Author-Name: Xinzhou Guo
Author-X-Name-First: Xinzhou
Author-X-Name-Last: Guo
Author-Name: Xuming He
Author-X-Name-First: Xuming
Author-X-Name-Last: He
Title: Inference on Selected Subgroups in Clinical Trials
Abstract:
When existing clinical trial data suggest a promising subgroup, we must address the question of how good the selected subgroup really is. The usual statistical inference applied to the selected subgroup, assuming that the subgroup is chosen independent of the data, may lead to an overly optimistic evaluation of the selected subgroup. In this article, we address the issue of selection bias and develop a de-biasing bootstrap inference procedure for the best selected subgroup effect. The proposed inference procedure is model-free, easy to compute, and asymptotically sharp. We demonstrate the merit of our proposed method by reanalyzing the MONET1 trial and show that how the subgroup is selected post hoc should play an important role in any statistical analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1498-1506
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1740096
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1740096
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1498-1506
Template-Type: ReDIF-Article 1.0
Author-Name: Glenn Shafer
Author-X-Name-First: Glenn
Author-X-Name-Last: Shafer
Title: Comment on “A Gibbs Sampler for a Class of Random Convex Polytopes,” by Pierre E. Jacob, Ruobin Gong, Paul T. Edlefsen, and Arthur P. Dempster
Journal: Journal of the American Statistical Association
Pages: 1196-1197
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1950001
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950001
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1196-1197
Template-Type: ReDIF-Article 1.0
Author-Name: Claudio Heinrich
Author-X-Name-First: Claudio
Author-X-Name-Last: Heinrich
Author-Name: Kristoffer H. Hellton
Author-X-Name-First: Kristoffer H.
Author-X-Name-Last: Hellton
Author-Name: Alex Lenkoski
Author-X-Name-First: Alex
Author-X-Name-Last: Lenkoski
Author-Name: Thordis L. Thorarinsdottir
Author-X-Name-First: Thordis L.
Author-X-Name-Last: Thorarinsdottir
Title: Multivariate Postprocessing Methods for High-Dimensional Seasonal Weather Forecasts
Abstract:
Abstract–Seasonal weather forecasts are crucial for long-term planning in many practical situations and skillful forecasts may have substantial economic and humanitarian implications. Current seasonal forecasting models require statistical postprocessing of the output to correct systematic biases and unrealistic uncertainty assessments. We propose a multivariate postprocessing approach using covariance tapering, combined with a dimension reduction step based on principal component analysis for efficient computation. Our proposed technique can correctly and efficiently handle nonstationary, non-isotropic and negatively correlated spatial error patterns, and is applicable on a global scale. Further, a moving average approach to marginal postprocessing is shown to flexibly handle trends in biases caused by global warming, and short training periods. In an application to global sea surface temperature forecasts issued by the Norwegian climate prediction model, our proposed methodology is shown to outperform known reference methods. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1048-1059
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1769634
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1769634
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1048-1059
Template-Type: ReDIF-Article 1.0
Author-Name: Ben Dai
Author-X-Name-First: Ben
Author-X-Name-Last: Dai
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Junhui Wang
Author-X-Name-First: Junhui
Author-X-Name-Last: Wang
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Scalable Collaborative Ranking for Personalized Prediction
Abstract:
Personalized prediction presents an important yet challenging task, which predicts user-specific preferences on a large number of items given limited information. It is often modeled as certain recommender systems focusing on ordinal or continuous ratings, as in collaborative filtering and content-based filtering. In this article, we propose a new collaborative ranking system to predict most-preferred items for each user given search queries. Particularly, we propose a ψ-ranker based on ranking functions incorporating information on users, items, and search queries through latent factor models. Moreover, we show that the proposed nonconvex surrogate pairwise ψ-loss performs well under four popular bipartite ranking losses, such as the sum loss, pairwise zero-one loss, discounted cumulative gain, and mean average precision. We develop a parallel computing strategy to optimize the intractable loss of two levels of nonconvex components through difference of convex programming and block successive upper-bound minimization. Theoretically, we establish a probabilistic error bound for the ψ-ranker and show that its ranking error has a sharp rate of convergence in the general framework of bipartite ranking, even when the dimension of the model parameters diverges with the sample size. Consequently, this result also indicates that the ψ-ranker performs better than two major approaches in bipartite ranking: pairwise ranking and scoring. Finally, we demonstrate the utility of the ψ-ranker by comparing it with some strong competitors in the literature through simulated examples as well as Expedia booking data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1215-1223
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1691562
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1691562
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1215-1223
Template-Type: ReDIF-Article 1.0
Author-Name: Lax Chan
Author-X-Name-First: Lax
Author-X-Name-Last: Chan
Author-Name: Bernard W. Silverman
Author-X-Name-First: Bernard W.
Author-X-Name-Last: Silverman
Author-Name: Kyle Vincent
Author-X-Name-First: Kyle
Author-X-Name-Last: Vincent
Title: Multiple Systems Estimation for Sparse Capture Data: Inferential Challenges When There Are Nonoverlapping Lists
Abstract:
Multiple systems estimation strategies have recently been applied to quantify hard-to-reach populations, particularly when estimating the number of victims of human trafficking and modern slavery. In such contexts, it is not uncommon to see sparse or even no overlap between some of the lists on which the estimates are based. These create difficulties in model fitting and selection, and we develop inference procedures to address these challenges. The approach is based on Poisson log-linear regression modeling. Issues investigated in detail include taking proper account of data sparsity in the estimation procedure, as well as the existence and identifiability of maximum likelihood estimates. A stepwise method for choosing the most suitable parameters is developed, together with a bootstrap approach to finding confidence intervals for the total population size. We apply the strategy to two empirical datasets of trafficking in US regions, and find that the approach results in stable, reasonable estimates. An accompanying R software implementation has been made publicly available. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1297-1306
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1708748
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1708748
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1297-1306
Template-Type: ReDIF-Article 1.0
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Author-Name: Wenbin Lu
Author-X-Name-First: Wenbin
Author-X-Name-Last: Lu
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Statistical Inference for High-Dimensional Models via Recursive Online-Score Estimation
Abstract:
In this article, we develop a new estimation and valid inference method for single or low-dimensional regression coefficients in high-dimensional generalized linear models. The number of the predictors is allowed to grow exponentially fast with respect to the sample size. The proposed estimator is computed by solving a score function. We recursively conduct model selection to reduce the dimensionality from high to a moderate scale and construct the score equation based on the selected variables. The proposed confidence interval (CI) achieves valid coverage without assuming consistency of the model selection procedure. When the selection consistency is achieved, we show the length of the proposed CI is asymptotically the same as the CI of the “oracle” method which works as well as if the support of the control variables were known. In addition, we prove the proposed CI is asymptotically narrower than the CIs constructed based on the desparsified Lasso estimator and the decorrelated score statistic. Simulation studies and real data applications are presented to back up our theoretical findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1307-1318
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1710154
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1710154
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1307-1318
Template-Type: ReDIF-Article 1.0
Author-Name: Rong Ma
Author-X-Name-First: Rong
Author-X-Name-Last: Ma
Author-Name: T. Tony Cai
Author-X-Name-First: T.
Author-X-Name-Last: Tony Cai
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Optimal Permutation Recovery in Permuted Monotone Matrix Model
Abstract:
Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model
Y=ΘΠ+Z
, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, Θ is an unknown mean matrix with monotone entries for each row, Π is a permutation matrix that permutes the columns of Θ, and Z is a noise matrix. This article studies the problem of estimation/recovery of Π given the observed noisy matrix Y. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall’s tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the nonresponders of the IBD patients after 8 weeks of treatment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1358-1372
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1713794
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1713794
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1358-1372
Template-Type: ReDIF-Article 1.0
Author-Name: Jairo Diaz-Rodriguez
Author-X-Name-First: Jairo
Author-X-Name-Last: Diaz-Rodriguez
Author-Name: Dominique Eckert
Author-X-Name-First: Dominique
Author-X-Name-Last: Eckert
Author-Name: Hatef Monajemi
Author-X-Name-First: Hatef
Author-X-Name-Last: Monajemi
Author-Name: Stéphane Paltani
Author-X-Name-First: Stéphane
Author-X-Name-Last: Paltani
Author-Name: Sylvain Sardy
Author-X-Name-First: Sylvain
Author-X-Name-Last: Sardy
Title: Nonparametric Estimation of Galaxy Cluster Emissivity and Detection of Point Sources in Astrophysics With Two Lasso Penalties
Abstract:
Astrophysicists are interested in recovering the three-dimensional gas emissivity of a galaxy cluster from a two-dimensional telescope image. Blurring and point sources make this inverse problem harder to solve. The conventional approach requires in a first step to identify and mask the point sources. Instead we model all astrophysical components in a single Poisson generalized linear model. To enforce sparsity on the parameters, maximum likelihood estimation is regularized with two
l1
penalties with weights λ
1 for the radial emissivity and λ
2 for the point sources. The method has the advantage of not employing cross-validation to select λ
1 and λ
2. To judge the significance of interesting features, we quantify uncertainty with the bootstrap. We apply our method to two X-ray telescopes (XMM-Newton and Chandra) data to estimate gas emissivity. The results are more stable and seems less biased than the conventional method, in particular in the outskirt of galaxy clusters. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1088-1099
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1796676
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796676
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1088-1099
Template-Type: ReDIF-Article 1.0
Author-Name: Wendy L. Martinez
Author-X-Name-First: Wendy L.
Author-X-Name-Last: Martinez
Title: Back to Our Future: Text Analytics Insights
Abstract:
Abstract–Each year, the Journal of the American Statistical Association (ASA) publishes the presidential address from the Joint Statistical Meetings (JSM). Here, we present the 2020 address verbatim save for the addition of references and a few minor editorial corrections.
Journal: Journal of the American Statistical Association
Pages: 1039-1047
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1960760
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1960760
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1039-1047
Template-Type: ReDIF-Article 1.0
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Author-Name: Peter Hall
Author-X-Name-First: Peter
Author-X-Name-Last: Hall
Author-Name: Wei Huang
Author-X-Name-First: Wei
Author-X-Name-Last: Huang
Author-Name: Alois Kneip
Author-X-Name-First: Alois
Author-X-Name-Last: Kneip
Title: Estimating the Covariance of Fragmented and Other Related Types of Functional Data
Abstract:
We consider the problem of estimating the covariance function of functional data which are only observed on a subset of their domain, such as fragments observed on small intervals or related types of functional data. We focus on situations where the data enable to compute the empirical covariance function or smooth versions of it only on a subset of its domain which contains a diagonal band. We show that estimating the covariance function consistently outside that subset is possible as long as the curves are sufficiently smooth. We establish conditions under which the covariance function is identifiable on its entire domain and propose a tensor product series approach for estimating it consistently. We derive asymptotic properties of our estimator and illustrate its finite sample properties on simulated and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1383-1401
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1723597
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1723597
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1383-1401
Template-Type: ReDIF-Article 1.0
Author-Name: Meiling Hao
Author-X-Name-First: Meiling
Author-X-Name-Last: Hao
Author-Name: Kin-yat Liu
Author-X-Name-First: Kin-yat
Author-X-Name-Last: Liu
Author-Name: Wei Xu
Author-X-Name-First: Wei
Author-X-Name-Last: Xu
Author-Name: Xingqiu Zhao
Author-X-Name-First: Xingqiu
Author-X-Name-Last: Zhao
Title: Semiparametric Inference for the Functional Cox Model
Abstract:
This article studies penalized semiparametric maximum partial likelihood estimation and hypothesis testing for the functional Cox model in analyzing right-censored data with both functional and scalar predictors. Deriving the asymptotic joint distribution of finite-dimensional and infinite-dimensional estimators is a very challenging theoretical problem due to the complexity of semiparametric models. For the problem, we construct the Sobolev space equipped with a special inner product and discover a new joint Bahadur representation of estimators of the unknown slope function and coefficients. Using this key tool, we establish the asymptotic joint normality of the proposed estimators and the weak convergence of the estimated slope function, and then construct local and global confidence intervals for an unknown slope function. Furthermore, we study a penalized partial likelihood ratio test, show that the test statistic enjoys the Wilks phenomenon, and also verify the optimality of the test. The theoretical results are examined through simulation studies, and a right-censored data example from the Improving Care of Acute Lung Injury Patients study is provided for illustration. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1319-1329
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1710155
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1710155
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1319-1329
Template-Type: ReDIF-Article 1.0
Author-Name: Malka Gorfine
Author-X-Name-First: Malka
Author-X-Name-Last: Gorfine
Author-Name: Nir Keret
Author-X-Name-First: Nir
Author-X-Name-Last: Keret
Author-Name: Asaf Ben Arie
Author-X-Name-First: Asaf
Author-X-Name-Last: Ben Arie
Author-Name: David Zucker
Author-X-Name-First: David
Author-X-Name-Last: Zucker
Author-Name: Li Hsu
Author-X-Name-First: Li
Author-X-Name-Last: Hsu
Title: Marginalized Frailty-Based Illness-Death Model: Application to the UK-Biobank Survival Data
Abstract:
The UK Biobank is a large-scale health resource comprising genetic, environmental, and medical information on approximately 500,000 volunteer participants in the United Kingdom, recruited at ages 40–69 during the years 2006–2010. The project monitors the health and well-being of its participants. This work demonstrates how these data can be used to yield the building blocks for an interpretable risk-prediction model, in a semiparametric fashion, based on known genetic and environmental risk factors of various chronic diseases, such as colorectal cancer. An illness-death model is adopted, which inherently is a semi-competing risks model, since death can censor the disease, but not vice versa. Using a shared-frailty approach to account for the dependence between time to disease diagnosis and time to death, we provide a new illness-death model that assumes Cox models for the marginal hazard functions. The recruitment procedure used in this study introduces delayed entry to the data. An additional challenge arising from the recruitment procedure is that information coming from both prevalent and incident cases must be aggregated. Lastly, we do not observe any deaths prior to the minimal recruitment age, 40. In this work, we provide an estimation procedure for our new illness-death model that overcomes all the above challenges. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1155-1167
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1831922
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831922
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1155-1167
Template-Type: ReDIF-Article 1.0
Author-Name: Rachel C. Nethery
Author-X-Name-First: Rachel C.
Author-X-Name-Last: Nethery
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Author-Name: Jason D. Sacks
Author-X-Name-First: Jason D.
Author-X-Name-Last: Sacks
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Title: Evaluation of the health impacts of the 1990 Clean Air Act Amendments using causal inference and machine learning
Abstract:
We develop a causal inference approach to estimate the number of adverse health events that were prevented due to changes in exposure to multiple pollutants attributable to a large-scale air quality intervention/regulation, with a focus on the 1990 Clean Air Act Amendments (CAAA). We introduce a causal estimand called the Total Events Avoided (TEA) by the regulation, defined as the difference in the number of health events expected under the no-regulation pollution exposures and the number observed with-regulation. We propose matching and machine learning methods that leverage population-level pollution and health data to estimate the TEA. Our approach improves upon traditional methods for regulation health impact analyses by formalizing causal identifying assumptions, utilizing population-level data, minimizing parametric assumptions, and collectively analyzing multiple pollutants. To reduce model-dependence, our approach estimates cumulative health impacts in the subset of regions with projected no-regulation features lying within the support of the observed with-regulation data, thereby providing a conservative but data-driven assessment to complement traditional parametric approaches. We analyze the health impacts of the CAAA in the US Medicare population in the year 2000, and our estimates suggest that large numbers of cardiovascular and dementia-related hospitalizations were avoided due to CAAA-attributable changes in pollution exposure.
Journal: Journal of the American Statistical Association
Pages: 1128-1139
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1803883
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1803883
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1128-1139
Template-Type: ReDIF-Article 1.0
Author-Name: Youjin Lee
Author-X-Name-First: Youjin
Author-X-Name-Last: Lee
Author-Name: Elizabeth L. Ogburn
Author-X-Name-First: Elizabeth L.
Author-X-Name-Last: Ogburn
Title: Network Dependence Can Lead to Spurious Associations and Invalid Inference
Abstract:
Researchers across the health and social sciences generally assume that observations are independent, even while relying on convenience samples that draw subjects from one or a small number of communities, schools, hospitals, etc. A paradigmatic example of this is the Framingham Heart Study (FHS). Many of the limitations of such samples are well-known, but the issue of statistical dependence due to social network ties has not previously been addressed. We show that, along with anticonservative variance estimation, this can result in spurious associations due to network dependence. Using a statistical test that we adapted from one developed for spatial autocorrelation, we test for network dependence in several of the thousands of influential papers that have been published using FHS data. Results suggest that some of the many decades of research on coronary heart disease, other health outcomes, and peer influence using FHS data may suffer from spurious associations, error-prone point estimates, and anticonservative inference due to unacknowledged network dependence. These issues are not unique to the FHS; as researchers in psychology, medicine, and beyond grapple with replication failures, this unacknowledged source of invalid statistical inference should be part of the conversation. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1060-1074
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1782219
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782219
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1060-1074
Template-Type: ReDIF-Article 1.0
Author-Name: Ross L. Prentice
Author-X-Name-First: Ross L.
Author-X-Name-Last: Prentice
Author-Name: Shanshan Zhao
Author-X-Name-First: Shanshan
Author-X-Name-Last: Zhao
Title: Regression Models and Multivariate Life Tables
Abstract:
Semiparametric, multiplicative-form regression models are specified for marginal single and double failure hazard rates for the regression analysis of multivariate failure time data. Cox-type estimating functions are specified for single and double failure hazard ratio parameter estimation, and corresponding Aalen–Breslow estimators are specified for baseline hazard rates. Generalization to allow classification of failure times into a smaller set of failure types, with failures of the same type having common baseline hazard functions, is also included. Asymptotic distribution theory arises by generalization of the marginal single failure hazard rate estimation results of Lin et al. The Péano series representation for the bivariate survival function in terms of corresponding marginal single and double failure hazard rates leads to novel estimators for pairwise bivariate survival functions and pairwise dependency functions, at specified covariate history. Related asymptotic distribution theory follows from that for the marginal single and double failure hazard rates and the continuity, compact differentiability of the Péano series transformation and bootstrap applicability. Simulation evaluation of the proposed estimation procedures is presented, and an application to multiple clinical outcomes in the Women’s Health Initiative Dietary Modification Trial is provided. Higher dimensional marginal hazard rate regression modeling is briefly mentioned. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1330-1345
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1713792
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1713792
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1330-1345
Template-Type: ReDIF-Article 1.0
Author-Name: Janice L. Scealy
Author-X-Name-First: Janice L.
Author-X-Name-Last: Scealy
Author-Name: Andrew T. A. Wood
Author-X-Name-First: Andrew T. A.
Author-X-Name-Last: Wood
Title: Analogues on the Sphere of the Affine-Equivariant Spatial Median
Abstract:
Robust estimation of location for data on the unit sphere
Sp−1
is an important problem in directional statistics even though the influence functions of the sample mean direction and other location estimators are bounded. A significant limitation of previous literature on this topic is that robust estimators and procedures have been developed under the assumption that the underlying population is rotationally symmetric. This assumption often does not hold with real data and in these cases there is a needless loss of efficiency in the estimator. In this article, we propose two estimators for spherical data, both of which are analogous to the affine-equivariant spatial median in Euclidean space. The influence functions of the new location estimators are obtained under a new semiparametric elliptical symmetry model on the sphere and are shown to be standardized bias robust in the highly concentrated case; the influence function of the companion scatter matrix is also obtained. An iterative algorithm that computes both estimators is described. Asymptotic results, including consistency and asymptotic normality, are also derived for the location estimators that result from applying a fixed number of steps in this algorithm. Numerical studies demonstrate that both location estimators may be expected to perform well in practice in terms of efficiency and robustness. A brief example application from the geophysics literature is also provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1457-1471
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1733582
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1733582
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1457-1471
Template-Type: ReDIF-Article 1.0
Author-Name: Federico Ferrari
Author-X-Name-First: Federico
Author-X-Name-Last: Ferrari
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Factor Analysis for Inference on Interactions
Abstract:
Abstract–This article is motivated by the problem of inference on interactions among chemical exposures impacting human health outcomes. Chemicals often co-occur in the environment or in synthetic mixtures and as a result exposure levels can be highly correlated. We propose a latent factor joint model, which includes shared factors in both the predictor and response components while assuming conditional independence. By including a quadratic regression in the latent variables in the response component, we induce flexible dimension reduction in characterizing main effects and interactions. We propose a Bayesian approach to inference under this factor analysis for interactions (FIN) framework. Through appropriate modifications of the factor modeling structure, FIN can accommodate higher order interactions. We evaluate the performance using a simulation study and data from the National Health and Nutrition Examination Survey. Code is available on GitHub. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1521-1532
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1745813
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745813
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1521-1532
Template-Type: ReDIF-Article 1.0
Author-Name: Md Kamrul Hasan Khan
Author-X-Name-First: Md Kamrul Hasan
Author-X-Name-Last: Khan
Author-Name: Avishek Chakraborty
Author-X-Name-First: Avishek
Author-X-Name-Last: Chakraborty
Author-Name: Giovanni Petris
Author-X-Name-First: Giovanni
Author-X-Name-Last: Petris
Author-Name: Barry T. Wilson
Author-X-Name-First: Barry T.
Author-X-Name-Last: Wilson
Title: Constrained Functional Regression of National Forest Inventory Data Over Time Using Remote Sensing Observations
Abstract:
The USDA Forest Service uses satellite imagery, along with a sample of national forest inventory field plots, to monitor and predict changes in forest conditions over time throughout the United States. We specifically focus on a 230,400 ha region in north-central Wisconsin between 2003 and 2012. The auxiliary data from the satellite imagery of this region are relatively dense in space and time, and can be used to learn how forest conditions changed over that decade. However, these records have a significant proportion of missing values due to weather conditions and system failures that we fill in first using a spatiotemporal model. Subsequently, we use the complete imagery as functional predictors in a two-component mixture model to capture the spatial variation in yearly average live tree basal area, an attribute of interest measured on field plots. We further modify the regression equation to accommodate a biophysical constraint on how plot-level live tree basal area can change from one year to the next. Findings from our analysis, represented with a series of maps, match known spatial patterns across the landscape. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1168-1180
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1860769
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1860769
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1168-1180
Template-Type: ReDIF-Article 1.0
Author-Name: Masayo Y. Hirose
Author-X-Name-First: Masayo Y.
Author-X-Name-Last: Hirose
Author-Name: Partha Lahiri
Author-X-Name-First: Partha
Author-X-Name-Last: Lahiri
Title: Multi-Goal Prior Selection: A Way to Reconcile Bayesian and Classical Approaches for Random Effects Models
Abstract:
Abstract–The two-level normal hierarchical model has played an important role in statistical theory and applications. In this article, we first introduce a general adjusted maximum likelihood method for estimating the unknown variance component of the model and the associated empirical best linear unbiased predictor of the random effects. We then discuss a new idea for selecting prior for the hyperparameters. The prior, called a multi-goal prior, produces Bayesian solutions for hyperparmeters and random effects that match (in the higher order asymptotic sense) the corresponding classical solution in linear mixed model with respect to several properties. Moreover, we establish for the first time an analytical equivalence of the posterior variances under the proposed multi-goal prior and the corresponding parametric bootstrap second-order mean squared error estimates in the context of a random effects model.
Journal: Journal of the American Statistical Association
Pages: 1487-1497
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1737532
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1737532
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1487-1497
Template-Type: ReDIF-Article 1.0
Author-Name: Pierre E. Jacob
Author-X-Name-First: Pierre E.
Author-X-Name-Last: Jacob
Author-Name: Ruobin Gong
Author-X-Name-First: Ruobin
Author-X-Name-Last: Gong
Author-Name: Paul T. Edlefsen
Author-X-Name-First: Paul T.
Author-X-Name-Last: Edlefsen
Author-Name: Arthur P. Dempster
Author-X-Name-First: Arthur P.
Author-X-Name-Last: Dempster
Title: Rejoinder—A Gibbs Sampler for a Class of Random Convex Polytopes
Abstract:
We are very grateful to all commenters for their stimulating remarks, questions, as well as useful pointers to the literature which span a wide range of statistical methods over decades of research. We have neither the space nor the knowledge to answer many of the questions raised, and we only aim to offer some clarifications. We hope that readers will be as enthusiastic as ourselves about research on the topics discussed by the commenters. In the following, we refer to Diaconis and Wang as DW, Hoffman, Hannig and Zhang as HHZ, Lawrence and Vander Wiel as LV, Ruggeri as R, Shafer as S, and Williams as W.
Journal: Journal of the American Statistical Association
Pages: 1211-1214
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1945458
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1945458
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1211-1214
Template-Type: ReDIF-Article 1.0
Author-Name: Yanxi Hou
Author-X-Name-First: Yanxi
Author-X-Name-Last: Hou
Author-Name: Xing Wang
Author-X-Name-First: Xing
Author-X-Name-Last: Wang
Title: Extreme and Inference for Tail Gini Functionals With Applications in Tail Risk Measurement
Abstract:
Abstract–Tail risk analysis focuses on the problem of risk measurement on the tail regions of financial variables. As one crucial task in tail risk analysis for risk management, the measurement of tail risk variability is less addressed in the literature. Neither the theoretical results nor inference methods are fully developed, which results in the difficulty of modeling implementation. Practitioners are then short of measurement methods to understand and evaluate tail risks, even when they have large amounts of valuable data in hand. In this article, we consider the measurement of tail variability under the tail scenarios of a systemic variable by extending the Gini’s methodology. As we are very interested in the limit of the proposed measures as the risk level approaches to the extreme status, we showed, by using extreme value techniques, how the tail dependence structure and marginal risk severity have influences on the limit of the proposed tail variability measures. We construct a nonparametric estimator, and its asymptotic behavior is explored. Furthermore, to provide practitioners with more measures for tail risk, we construct three coefficients/measures for tail risks from different views toward tail risks and illustrate them in a real data analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1428-1443
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1730855
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1730855
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1428-1443
Template-Type: ReDIF-Article 1.0
Author-Name: Kara E. Rudolph
Author-X-Name-First: Kara E.
Author-X-Name-Last: Rudolph
Author-Name: Oleg Sofrygin
Author-X-Name-First: Oleg
Author-X-Name-Last: Sofrygin
Author-Name: Mark J. van der Laan
Author-X-Name-First: Mark J.
Author-X-Name-Last: van der Laan
Title: Complier Stochastic Direct Effects: Identification and Robust Estimation
Abstract:
Mediation analysis is critical to understanding the mechanisms underlying exposure-outcome relationships. In this article, we identify the instrumental variable-direct effect of the exposure on the outcome not through the mediator, using randomization of the instrument. We call this estimand the complier stochastic direct effect (CSDE). To our knowledge, such an estimand has not previously been considered or estimated. We propose and evaluate several estimators for the CSDE: a ratio of inverse-probability of treatment-weighted estimators (IPTW), a ratio of estimating equation estimators (EE), a ratio of targeted minimum loss-based estimators (TMLE), and a TMLE that targets the CSDE directly. These estimators are applicable for a variety of study designs, including randomized encouragement trials, like the Moving to Opportunity housing voucher experiment we consider as an illustrative example, treatment discontinuities, and Mendelian randomization. We found the IPTW estimator to be the most sensitive to finite sample bias, resulting in bias of over 40% even when all models were correctly specified in a sample size of N = 100. In contrast, the EE estimator and TMLE that targets the CSDE directly were far less sensitive. The EE and TML estimators also have advantages in terms of efficiency and reduced reliance on correct parametric model specification. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1254-1264
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1704292
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1704292
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1254-1264
Template-Type: ReDIF-Article 1.0
Author-Name: Fabrizio Ruggeri
Author-X-Name-First: Fabrizio
Author-X-Name-Last: Ruggeri
Title: Comment on “A Gibbs Sampler for a Class of Random Convex Polytopes” by P.E. Jacob, R. Gong, P.T. Edlefsen and A.P. Dempster
Journal: Journal of the American Statistical Association
Pages: 1204-1205
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1946404
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1946404
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1204-1205
Template-Type: ReDIF-Article 1.0
Author-Name: Kentaro Hoffman
Author-X-Name-First: Kentaro
Author-X-Name-Last: Hoffman
Author-Name: Jan Hannig
Author-X-Name-First: Jan
Author-X-Name-Last: Hannig
Author-Name: Kai Zhang
Author-X-Name-First: Kai
Author-X-Name-Last: Zhang
Title: Comments on “A Gibbs Sampler for a Class of Random Convex Polytopes”
Journal: Journal of the American Statistical Association
Pages: 1206-1210
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2021.1950002
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950002
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1206-1210
Template-Type: ReDIF-Article 1.0
Author-Name: Xialiang Dou
Author-X-Name-First: Xialiang
Author-X-Name-Last: Dou
Author-Name: Tengyuan Liang
Author-X-Name-First: Tengyuan
Author-X-Name-Last: Liang
Title: Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits
Abstract:
Consider the problem: given the data pair
(x,y)
drawn from a population with
f*(x)=E[y|x=x]
, specify a neural network model and run gradient flow on the weights over time until reaching any stationarity. How does ft
, the function computed by the neural network at time t, relate to
f*
, in terms of approximation and representation? What are the provable benefits of the adaptive representation by neural networks compared to the pre-specified fixed basis representation in the classical nonparametric literature? We answer the above questions via a dynamic reproducing kernel Hilbert space (RKHS) approach indexed by the training process of neural networks. Firstly, we show that when reaching any local stationarity, gradient flow learns an adaptive RKHS representation and performs the global least-squares projection onto the adaptive RKHS, simultaneously. Secondly, we prove that as the RKHS is data-adaptive and task-specific, the residual for
f*
lies in a subspace that is potentially much smaller than the orthogonal complement of the RKHS. The result formalizes the representation and approximation benefits of neural networks. Lastly, we show that the neural network function computed by gradient flow converges to the kernel ridgeless regression with an adaptive kernel, in the limit of vanishing regularization. The adaptive kernel viewpoint provides new angles of studying the approximation, representation, generalization, and optimization advantages of neural networks.
Journal: Journal of the American Statistical Association
Pages: 1507-1520
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2020.1745812
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745812
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1507-1520
Template-Type: ReDIF-Article 1.0
Author-Name: Xiwei Tang
Author-X-Name-First: Xiwei
Author-X-Name-Last: Tang
Author-Name: Fei Xue
Author-X-Name-First: Fei
Author-X-Name-Last: Xue
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Individualized Multidirectional Variable Selection
Abstract:
In this article, we propose a heterogeneous modeling framework which achieves individual-wise feature selection and heterogeneous covariates’ effects subgrouping simultaneously. In contrast to conventional model selection approaches, the new approach constructs a separation penalty with multidirectional shrinkages, which facilitates individualized modeling to distinguish strong signals from noisy ones and selects different relevant variables for different individuals. Meanwhile, the proposed model identifies subgroups among which individuals share similar covariates’ effects, and thus improves individualized estimation efficiency and feature selection accuracy. Moreover, the proposed model also incorporates within-individual correlation for longitudinal data to gain extra efficiency. We provide a general theoretical foundation under a double-divergence modeling framework where the number of individuals and the number of individual-wise measurements can both diverge, which enables inference on both an individual level and a population level. In particular, we establish a strong oracle property for the individualized estimator to ensure its optimal large sample property under various conditions. An efficient ADMM algorithm is developed for computational scalability. Simulation studies and applications to post-trauma mental disorder analysis with genetic variation and an HIV longitudinal treatment study are illustrated to compare the new approach to existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1280-1296
Issue: 535
Volume: 116
Year: 2021
Month: 7
X-DOI: 10.1080/01621459.2019.1705308
File-URL: http://hdl.handle.net/10.1080/01621459.2019.1705308
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:535:p:1280-1296
Template-Type: ReDIF-Article 1.0
Author-Name: Fangzheng Xie
Author-X-Name-First: Fangzheng
Author-X-Name-Last: Xie
Author-Name: Yanxun Xu
Author-X-Name-First: Yanxun
Author-X-Name-Last: Xu
Title: Bayesian Projected Calibration of Computer Models
Abstract:
We develop a Bayesian approach called the Bayesian projected calibration to address the problem of calibrating an imperfect computer model using observational data from an unknown complex physical system. The calibration parameter and the physical system are parameterized in an identifiable fashion via the L2-projection. The physical system is imposed a Gaussian process prior distribution, which naturally induces a prior distribution on the calibration parameter through the L2-projection constraint. The calibration parameter is estimated through its posterior distribution, serving as a natural and nonasymptotic approach for the uncertainty quantification. We provide rigorous large sample justifications of the proposed approach by establishing the asymptotic normality of the posterior of the calibration parameter with the efficient covariance matrix. In addition to the theoretical analysis, two convenient computational algorithms based on stochastic approximation are designed with strong theoretical support. Through extensive simulation studies and the analyses of two real-world datasets, we show that the proposed Bayesian projected calibration can accurately estimate the calibration parameters, calibrate the computer models well, and compare favorably to alternative approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1965-1982
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1753519
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753519
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1965-1982
Template-Type: ReDIF-Article 1.0
Author-Name: DongHyuk Lee
Author-X-Name-First: DongHyuk
Author-X-Name-Last: Lee
Author-Name: Bin Zhu
Author-X-Name-First: Bin
Author-X-Name-Last: Zhu
Title: A Semiparametric Kernel Independence Test With Application to Mutational Signatures
Abstract:
Cancers arise owing to somatic mutations, and the characteristic combinations of somatic mutations form mutational signatures. Despite many mutational signatures being identified, mutational processes underlying a number of mutational signatures remain unknown, which hinders the identification of interventions that may reduce somatic mutation burdens and prevent the development of cancer. We demonstrate that the unknown cause of a mutational signature can be inferred by the associated signatures with known etiology. However, existing association tests are not statistically powerful due to excess zeros in mutational signatures data. To address this limitation, we propose a semiparametric kernel independence test (SKIT). The SKIT statistic is defined as the integrated squared distance between mixed probability distributions and is decomposed into four disjoint components to pinpoint the source of dependency. We derive the asymptotic null distribution and prove the asymptotic convergence of power. Due to slow convergence to the asymptotic null distribution, a bootstrap method is employed to compute p-values. Simulation studies demonstrate that when zeros are prevalent, SKIT is more resilient to power loss than existing tests and robust to random errors. We applied SKIT to The Cancer Genome Atlas mutational signatures data for over 9000 tumors across 32 cancer types, and identified a novel association between signature 17 curated in the Catalogue of Somatic Mutations in Cancer and apolipoprotein B mRNA editing enzyme (APOBEC) signatures in gastrointestinal cancers. It indicates that APOBEC activity is likely associated with the unknown cause of signature 17. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1648-1661
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1871357
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1871357
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1648-1661
Template-Type: ReDIF-Article 1.0
Author-Name: Qihui Su
Author-X-Name-First: Qihui
Author-X-Name-Last: Su
Author-Name: Zhongling Qin
Author-X-Name-First: Zhongling
Author-X-Name-Last: Qin
Author-Name: Liang Peng
Author-X-Name-First: Liang
Author-X-Name-Last: Peng
Author-Name: Gengsheng Qin
Author-X-Name-First: Gengsheng
Author-X-Name-Last: Qin
Title: Efficiently Backtesting Conditional Value-at-Risk and Conditional Expected Shortfall
Abstract:
Abstract–Given the importance of backtesting risk models and forecasts for financial institutions and regulators, we develop an efficient empirical likelihood backtest for either conditional value-at-risk or conditional expected shortfall when the given risk variable is modeled by an ARMA-GARCH process. Using a two-step procedure, the proposed backtests require less finite moments than existing backtests, allowing for robustness to heavier tails. Furthermore, we add a constraint on the goodness of fit of the error distribution to provide more accurate risk forecasts and improved test power. A simulation study confirms the good finite sample performance of the new backtests, and empirical analyses demonstrate the usefulness of these efficient backtests for monitoring financial crises.
Journal: Journal of the American Statistical Association
Pages: 2041-2052
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1763804
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1763804
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2041-2052
Template-Type: ReDIF-Article 1.0
Author-Name: Xueyu Mao
Author-X-Name-First: Xueyu
Author-X-Name-Last: Mao
Author-Name: Purnamrita Sarkar
Author-X-Name-First: Purnamrita
Author-X-Name-Last: Sarkar
Author-Name: Deepayan Chakrabarti
Author-X-Name-First: Deepayan
Author-X-Name-Last: Chakrabarti
Title: Estimating Mixed Memberships With Sharp Eigenvector Deviations
Abstract:
We consider the problem of estimating community memberships of nodes in a network, where every node is associated with a vector determining its degree of membership in each community. Existing provably consistent algorithms often require strong assumptions about the population, are computationally expensive, and only provide an overall error bound for the whole community membership matrix. This article provides uniform rates of convergence for the inferred community membership vector of each node in a network generated from the mixed membership stochastic blockmodel (MMSB); to our knowledge, this is the first work to establish per-node rates for overlapping community detection in networks. We achieve this by establishing sharp row-wise eigenvector deviation bounds for MMSB. Based on the simplex structure inherent in the eigen-decomposition of the population matrix, we build on established corner-finding algorithms from the optimization community to infer the community membership vectors. Our results hold over a broad parameter regime where the average degree only grows poly-logarithmically with the number of nodes. Using experiments with simulated and real datasets, we show that our method achieves better error with lower variability over competing methods, and processes real world networks of up to 100,000 nodes within tens of seconds. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1928-1940
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1751645
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1751645
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1928-1940
Template-Type: ReDIF-Article 1.0
Author-Name: Fan Zhou
Author-X-Name-First: Fan
Author-X-Name-Last: Zhou
Author-Name: Shikai Luo
Author-X-Name-First: Shikai
Author-X-Name-Last: Luo
Author-Name: Xiaohu Qie
Author-X-Name-First: Xiaohu
Author-X-Name-Last: Qie
Author-Name: Jieping Ye
Author-X-Name-First: Jieping
Author-X-Name-Last: Ye
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Graph-Based Equilibrium Metrics for Dynamic Supply–Demand Systems With Applications to Ride-sourcing Platforms
Abstract:
How to dynamically measure the local-to-global spatio-temporal coherence between demand and supply networks is a fundamental task for ride-sourcing platforms, such as DiDi. Such coherence measurement is critically important for the quantification of the market efficiency and the comparison of different platform policies, such as dispatching. The aim of this paper is to introduce a graph-based equilibrium metric (GEM) to quantify the distance between demand and supply networks based on a weighted graph structure. We formulate GEM as the optimal objective value of an unbalanced optimal transport problem, which can be formulated as an equivalent linear programming and efficiently solved. We examine how the GEM can help solve three operational tasks of ride-sourcing platforms. The first one is that GEM achieves up to 70.6% reduction in root-mean-square error over the second-best distance measurement for the prediction accuracy of order answer rate. The second one is that the use of GEM for designing order dispatching policy increases drivers’ revenue for more than 1%, representing a huge improvement in number. The third one is that GEM can serve as an endpoint for comparing different platform policies in AB test. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1688-1699
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1898409
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1898409
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1688-1699
Template-Type: ReDIF-Article 1.0
Author-Name: Natalie Dean
Author-X-Name-First: Natalie
Author-X-Name-Last: Dean
Author-Name: Yang Yang
Author-X-Name-First: Yang
Author-X-Name-Last: Yang
Title: Discussion of “Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data”
Journal: Journal of the American Statistical Association
Pages: 1587-1590
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1982722
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1982722
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1587-1590
Template-Type: ReDIF-Article 1.0
Author-Name: Sangwook Kang
Author-X-Name-First: Sangwook
Author-X-Name-Last: Kang
Title: Advanced Survival Models
Journal: Journal of the American Statistical Association
Pages: 2098-2099
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1997014
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1997014
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2098-2099
Template-Type: ReDIF-Article 1.0
Author-Name: Yutong Li
Author-X-Name-First: Yutong
Author-X-Name-Last: Li
Author-Name: Ruoqing Zhu
Author-X-Name-First: Ruoqing
Author-X-Name-Last: Zhu
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Author-Name: Han Ye
Author-X-Name-First: Han
Author-X-Name-Last: Ye
Author-Name: Zhankun Sun
Author-X-Name-First: Zhankun
Author-X-Name-Last: Sun
Title: Topic Modeling on Triage Notes With Semiorthogonal Nonnegative Matrix Factorization
Abstract:
Emergency department (ED) crowding is a universal health issue that affects the efficiency of hospital management and patient care quality. ED crowding frequently occurs when a request for a ward-bed for a patient is delayed until a doctor makes an admission decision. In this case study, we build a classifier to predict the disposition of patients using manually typed nurse notes collected during triage as provided by the Alberta Medical Center. These predictions can potentially be incorporated to early bed coordination and fast track streaming strategies to alleviate overcrowding and waiting times in the ED. However, these triage notes involve high dimensional, noisy, and sparse text data, which make model-fitting and interpretation difficult. To address this issue, we propose a novel semiorthogonal nonnegative matrix factorization for both continuous and binary predictors to reduce the dimensionality and derive word topics. The triage notes can then be interpreted as a non-subtractive linear combination of orthogonal basis topic vectors. Our real data analysis shows that the triage notes contain strong predictive information toward classifying the disposition of patients for certain medical complaints, such as altered consciousness or stroke. Additionally, we show that the document-topic vectors generated by our method can be used as features to further improve classification accuracy by up to 1% across different medical complaints, for example, 74.3%–75.3% accuracy for patients with stroke symptoms. This improvement could be clinically impactful for certain patients, especially when the scale of hospital patients is large. Furthermore, the generated word-topic vectors provide a bi-clustering interpretation under each topic due to the orthogonal formulation, which can be beneficial for hospitals in better understanding the symptoms and reasons behind patients’ visits. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1609-1624
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1862667
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862667
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1609-1624
Template-Type: ReDIF-Article 1.0
Author-Name: Maxwell Kellogg
Author-X-Name-First: Maxwell
Author-X-Name-Last: Kellogg
Author-Name: Magne Mogstad
Author-X-Name-First: Magne
Author-X-Name-Last: Mogstad
Author-Name: Guillaume A. Pouliot
Author-X-Name-First: Guillaume A.
Author-X-Name-Last: Pouliot
Author-Name: Alexander Torgovitsky
Author-X-Name-First: Alexander
Author-X-Name-Last: Torgovitsky
Title: Combining Matching and Synthetic Control to Tradeoff Biases From Extrapolation and Interpolation
Abstract:
The synthetic control (SC) method is widely used in comparative case studies to adjust for differences in pretreatment characteristics. SC limits extrapolation bias at the potential expense of interpolation bias, whereas traditional matching estimators have the opposite properties. This complementarity motives us to propose a matching and synthetic control (or MASC) estimator as a model averaging estimator that combines the standard SC and matching estimators. We show how to use a rolling-origin cross-validation procedure to train the MASC to resolve tradeoffs between interpolation and extrapolation bias. We use a series of empirically based placebo and Monte Carlo simulations to shed light on when the SC, matching, MASC and penalized SC estimators do (and do not) perform well. Then, we apply these estimators to examine the economic costs of conflicts in the context of Spain.
Journal: Journal of the American Statistical Association
Pages: 1804-1816
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1979562
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979562
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1804-1816
Template-Type: ReDIF-Article 1.0
Author-Name: Junhyung Park
Author-X-Name-First: Junhyung
Author-X-Name-Last: Park
Author-Name: Frederic Paik Schoenberg
Author-X-Name-First: Frederic Paik
Author-X-Name-Last: Schoenberg
Author-Name: Andrea L. Bertozzi
Author-X-Name-First: Andrea L.
Author-X-Name-Last: Bertozzi
Author-Name: P. Jeffrey Brantingham
Author-X-Name-First: P. Jeffrey
Author-X-Name-Last: Brantingham
Title: Investigating Clustering and Violence Interruption in Gang-Related Violent Crime Data Using Spatial–Temporal Point Processes With Covariates
Abstract:
Reported gang-related violent crimes in Los Angeles, California, from 1/1/14 to 12/31/17 are modeled using spatial–temporal marked Hawkes point processes with covariates. We propose an algorithm to estimate the spatial-temporally varying background rate nonparametrically as a function of demographic covariates. Kernel smoothing and generalized additive models are used in an attempt to model the background rate as closely as possible in an effort to differentiate inhomogeneity in the background rate from causal clustering or triggering of events. The models are fit to data from 2014 to 2016 and evaluated using data from 2017, based on log-likelihood and superthinned residuals. The impact of nonrandomized violence interruption performed by The City of Los Angeles Mayor’s Office of Gang Reduction Youth Development (GRYD) Incident Response (IR) Program is assessed by comparing the triggering associated with GRYD IR Program events to the triggering associated with sub-sampled non-GRYD events selected to have a similar spatial–temporal distribution. The results suggest that GRYD IR Program violence interruption yields a reduction of approximately 18.3% in the retaliation rate in locations more than 130 m from the original reported crimes, and a reduction of 14.2% in retaliations within 130 m.
Journal: Journal of the American Statistical Association
Pages: 1674-1687
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1898408
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1898408
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1674-1687
Template-Type: ReDIF-Article 1.0
Author-Name: Jushan Bai
Author-X-Name-First: Jushan
Author-X-Name-Last: Bai
Author-Name: Serena Ng
Author-X-Name-First: Serena
Author-X-Name-Last: Ng
Title: Matrix Completion, Counterfactuals, and Factor Analysis of Missing Data
Abstract:
This article proposes an imputation procedure that uses the factors estimated from a tall block along with the re-rotated loadings estimated from a wide block to impute missing values in a panel of data. Assuming that a strong factor structure holds for the full panel of data and its sub-blocks, it is shown that the common component can be consistently estimated at four different rates of convergence without requiring regularization or iteration. An asymptotic analysis of the estimation error is obtained. An application of our analysis is estimation of counterfactuals when potential outcomes have a factor structure. We study the estimation of average and individual treatment effects on the treated and establish a normal distribution theory that can be useful for hypothesis testing.
Journal: Journal of the American Statistical Association
Pages: 1746-1763
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1967163
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1967163
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1746-1763
Template-Type: ReDIF-Article 1.0
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Luke Keele
Author-X-Name-First: Luke
Author-X-Name-Last: Keele
Author-Name: Rocío Titiunik
Author-X-Name-First: Rocío
Author-X-Name-Last: Titiunik
Author-Name: Gonzalo Vazquez-Bare
Author-X-Name-First: Gonzalo
Author-X-Name-Last: Vazquez-Bare
Title: Extrapolating Treatment Effects in Multi-Cutoff Regression Discontinuity Designs
Abstract:
Abstract–In nonexperimental settings, the regression discontinuity (RD) design is one of the most credible identification strategies for program evaluation and causal inference. However, RD treatment effect estimands are necessarily local, making statistical methods for the extrapolation of these effects a key area for development. We introduce a new method for extrapolation of RD effects that relies on the presence of multiple cutoffs, and is therefore design-based. Our approach employs an easy-to-interpret identifying assumption that mimics the idea of “common trends” in difference-in-differences designs. We illustrate our methods with data on a subsidized loan program on post-education attendance in Colombia, and offer new evidence on program effects for students with test scores away from the cutoff that determined program eligibility. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1941-1952
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1751646
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1751646
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1941-1952
Template-Type: ReDIF-Article 1.0
Author-Name: Azeem M. Shaikh
Author-X-Name-First: Azeem M.
Author-X-Name-Last: Shaikh
Author-Name: Panos Toulis
Author-X-Name-First: Panos
Author-X-Name-Last: Toulis
Title: Randomization Tests in Observational Studies With Staggered Adoption of Treatment
Abstract:
This article considers the problem of inference in observational studies with time-varying adoption of treatment. In addition to an unconfoundedness assumption that the potential outcomes are independent of the times at which units adopt treatment conditional on the units’ observed characteristics, our analysis assumes that the time at which each unit adopts treatment follows a Cox proportional hazards model. This assumption permits the time at which each unit adopts treatment to depend on the observed characteristics of the unit, but imposes the restriction that the probability of multiple units adopting treatment at the same time is zero. In this context, we study randomization tests of a null hypothesis that specifies that there is no treatment effect for all units and all time periods in a distributional sense. We first show that an infeasible test that treats the parameters of the Cox model as known has rejection probability under the null hypothesis no greater than the nominal level in finite samples. Since these parameters are unknown in practice, this result motivates a feasible test that replaces these parameters with consistent estimators. While the resulting test does not need to have the same finite-sample validity as the infeasible test, we show that it has limiting rejection probability under the null hypothesis no greater than the nominal level. In a simulation study, we examine the practical relevance of our theoretical results, including robustness to misspecification of the model for the time at which each unit adopts treatment. Finally, we provide an empirical application of our methodology using the synthetic control-based test statistic and tobacco legislation data found in Abadie, Diamond and Hainmueller. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1835-1848
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1974458
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1974458
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1835-1848
Template-Type: ReDIF-Article 1.0
Author-Name: Susan Athey
Author-X-Name-First: Susan
Author-X-Name-Last: Athey
Author-Name: Mohsen Bayati
Author-X-Name-First: Mohsen
Author-X-Name-Last: Bayati
Author-Name: Nikolay Doudchenko
Author-X-Name-First: Nikolay
Author-X-Name-Last: Doudchenko
Author-Name: Guido Imbens
Author-X-Name-First: Guido
Author-X-Name-Last: Imbens
Author-Name: Khashayar Khosravi
Author-X-Name-First: Khashayar
Author-X-Name-Last: Khosravi
Title: Matrix Completion Methods for Causal Panel Data Models
Abstract:
In this article, we study methods for estimating causal effects in settings with panel data, where some units are exposed to a treatment during some periods and the goal is estimating counterfactual (untreated) outcomes for the treated unit/period combinations. We propose a class of matrix completion estimators that uses the observed elements of the matrix of control outcomes corresponding to untreated unit/periods to impute the “missing” elements of the control outcome matrix, corresponding to treated units/periods. This leads to a matrix that well-approximates the original (incomplete) matrix, but has lower complexity according to the nuclear norm for matrices. We generalize results from the matrix completion literature by allowing the patterns of missing data to have a time series dependency structure that is common in social science applications. We present novel insights concerning the connections between the matrix completion literature, the literature on interactive fixed effects models and the literatures on program evaluation under unconfoundedness and synthetic control methods. We show that all these estimators can be viewed as focusing on the same objective function. They differ solely in the way they deal with identification, in some cases solely through regularization (our proposed nuclear norm matrix completion estimator) and in other cases primarily through imposing hard restrictions (the unconfoundedness and synthetic control approaches). The proposed method outperforms unconfoundedness-based or synthetic control estimators in simulations based on real data.
Journal: Journal of the American Statistical Association
Pages: 1716-1730
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1891924
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891924
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1716-1730
Template-Type: ReDIF-Article 1.0
Author-Name: Xu Shi
Author-X-Name-First: Xu
Author-X-Name-Last: Shi
Author-Name: Xiaoou Li
Author-X-Name-First: Xiaoou
Author-X-Name-Last: Li
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Title: Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation
Abstract:
Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix W ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for W by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of W in both fixed and high-dimensional setting. We demonstrate that the refined estimate of W achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1953-1964
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1752219
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1752219
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1953-1964
Template-Type: ReDIF-Article 1.0
Author-Name: Corbin Quick
Author-X-Name-First: Corbin
Author-X-Name-Last: Quick
Author-Name: Rounak Dey
Author-X-Name-First: Rounak
Author-X-Name-Last: Dey
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data
Abstract:
Modeling infectious disease dynamics has been critical throughout the COVID-19 pandemic. Of particular interest are the incidence, prevalence, and effective reproductive number (Rt
). Estimating these quantities is challenging due to under-ascertainment, unreliable reporting, and time lags between infection, onset, and testing. We propose a Multilevel Epidemic Regression Model to Account for Incomplete Data (MERMAID) to jointly estimate Rt
, ascertainment rates, incidence, and prevalence over time in one or multiple regions. Specifically, MERMAID allows for a flexible regression model of Rt
that can incorporate geographic and time-varying covariates. To account for under-ascertainment, we (a) model the ascertainment probability over time as a function of testing metrics and (b) jointly model data on confirmed infections and population-based serological surveys. To account for delays between infection, onset, and reporting, we model stochastic lag times as missing data, and develop an EM algorithm to estimate the model parameters. We evaluate the performance of MERMAID in simulation studies, and assess its robustness by conducting sensitivity analyses in a range of scenarios of model misspecifications. We apply the proposed method to analyze COVID-19 daily confirmed infection counts, PCR testing data, and serological survey data across the United States. Based on our model, we estimate an overall COVID-19 prevalence of 12.5% (ranging from 2.4% in Maine to 20.2% in New York) and an overall ascertainment rate of 45.5% (ranging from 22.5% in New York to 81.3% in Rhode Island) in the United States from March to December 2020. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1561-1577
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.2001339
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2001339
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1561-1577
Template-Type: ReDIF-Article 1.0
Author-Name: Alan Julian Izenman
Author-X-Name-First: Alan Julian
Author-X-Name-Last: Izenman
Title: Sampling Algorithms for Discrete Markov Random Fields and Related Graphical Models
Abstract:
Discrete Markov random fields are undirected graphical models in which the nodes of a graph are discrete random variables with values usually represented by colors. Typically, graphs are taken to be square lattices, although more general graphs are also of interest. Such discrete MRFs have been studied in many disciplines. We describe the two most popular types of discrete MRFs, namely the two-state Ising model and the q-state Potts model, and variations such as the cellular automaton, the cellular Potts model, and the random cluster model, the latter of which is a continuous generalization of both the Ising and Potts models. Research interest is usually focused on providing algorithms for simulating from these models because the partition function is so computationally intractable that statistical inference for the parameters of the appropriate probability distribution becomes very complicated. Substantial improvements to the Metropolis algorithm have appeared in the form of cluster algorithms, such as the Swendsen–Wang and Wolff algorithms. We study the simulation processes of these algorithms, which update the color of a cluster of nodes at each iteration.
Journal: Journal of the American Statistical Association
Pages: 2065-2086
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1898410
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1898410
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2065-2086
Template-Type: ReDIF-Article 1.0
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Author-Name: Yingjie Feng
Author-X-Name-First: Yingjie
Author-X-Name-Last: Feng
Author-Name: Rocio Titiunik
Author-X-Name-First: Rocio
Author-X-Name-Last: Titiunik
Title: Prediction Intervals for Synthetic Control Methods
Abstract:
Uncertainty quantification is a fundamental problem in the analysis and interpretation of synthetic control (SC) methods. We develop conditional prediction intervals in the SC framework, and provide conditions under which these intervals offer finite-sample probability guarantees. Our method allows for covariate adjustment and nonstationary data. The construction begins by noting that the statistical uncertainty of the SC prediction is governed by two distinct sources of randomness: one coming from the construction of the (likely misspecified) SC weights in the pretreatment period, and the other coming from the unobservable stochastic error in the post-treatment period when the treatment effect is analyzed. Accordingly, our proposed prediction intervals are constructed taking into account both sources of randomness. For implementation, we propose a simulation-based approach along with finite-sample-based probability bound arguments, naturally leading to principled sensitivity analysis methods. We illustrate the numerical performance of our methods using empirical applications and a small simulation study. Python, R and Stata software packages implementing our methodology are available. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1865-1880
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1979561
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979561
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1865-1880
Template-Type: ReDIF-Article 1.0
Author-Name: Andrew Gelman
Author-X-Name-First: Andrew
Author-X-Name-Last: Gelman
Author-Name: Aki Vehtari
Author-X-Name-First: Aki
Author-X-Name-Last: Vehtari
Title: What are the Most Important Statistical Ideas of the Past 50 Years?
Abstract:
We review the most important statistical ideas of the past half century, which we categorize as: counterfactual causal inference, bootstrapping and simulation-based inference, overparameterized models and regularization, Bayesian multilevel models, generic computation algorithms, adaptive decision analysis, robust inference, and exploratory data analysis. We discuss key contributions in these subfields, how they relate to modern computing and big data, and how they might be developed and extended in future decades. The goal of this article is to provoke thought and discussion regarding the larger themes of research in statistics and data science.
Journal: Journal of the American Statistical Association
Pages: 2087-2097
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1938081
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938081
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2087-2097
Template-Type: ReDIF-Article 1.0
Author-Name: Munir Hiabu
Author-X-Name-First: Munir
Author-X-Name-Last: Hiabu
Author-Name: Enno Mammen
Author-X-Name-First: Enno
Author-X-Name-Last: Mammen
Author-Name: M. Dolores Martínez-Miranda
Author-X-Name-First: M. Dolores
Author-X-Name-Last: Martínez-Miranda
Author-Name: Jens P. Nielsen
Author-X-Name-First: Jens P.
Author-X-Name-Last: Nielsen
Title: Smooth Backfitting of Proportional Hazards With Multiplicative Components
Abstract:
Smooth backfitting has proven to have a number of theoretical and practical advantages in structured regression. By projecting the data down onto the structured space of interest smooth backfitting provides a direct link between data and estimator. This article introduces the ideas of smooth backfitting to survival analysis in a proportional hazard model, where we assume an underlying conditional hazard with multiplicative components. We develop asymptotic theory for the estimator. In a comprehensive simulation study, we show that our smooth backfitting estimator successfully circumvents the curse of dimensionality and outperforms existing estimators. This is especially the case in difficult situations like high number of covariates and/or high correlation between the covariates, where other estimators tend to break down. We use the smooth backfitter in a practical application where we extend recent advances of in-sample forecasting methodology by allowing more information to be incorporated, while still obeying the structured requirements of in-sample forecasting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1983-1993
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1753520
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753520
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1983-1993
Template-Type: ReDIF-Article 1.0
Author-Name: Guillaume Gerber
Author-X-Name-First: Guillaume
Author-X-Name-Last: Gerber
Author-Name: Yohann Le Faou
Author-X-Name-First: Yohann Le
Author-X-Name-Last: Faou
Author-Name: Olivier Lopez
Author-X-Name-First: Olivier
Author-X-Name-Last: Lopez
Author-Name: Michael Trupin
Author-X-Name-First: Michael
Author-X-Name-Last: Trupin
Title: The Impact of Churn on Client Value in Health Insurance, Evaluation Using a Random Forest Under Various Censoring Mechanisms
Abstract:
Abstract–In the insurance broker market, commissions received by brokers are closely related to so-called “customer value”: the longer a policyholder keeps their contract, the more profit there is for the company and therefore the broker. Hence, predicting the time at which a potential policyholder will surrender their contract is essential to optimize a commercial process and define a prospect scoring. In this article, we propose a weighted random forest model to address this problem. Our model is designed to compensate for the impact of random censoring. We investigate different types of assumptions on the censoring, studying both the cases where it is independent or not from the covariates. We compare our approach with other standard methods which apply in our setting, using simulated and real data analysis. We show that our approach is very competitive in terms of quadratic error in addressing the given problem. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2053-2064
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1764364
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764364
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2053-2064
Template-Type: ReDIF-Article 1.0
Author-Name: Jyotishka Datta
Author-X-Name-First: Jyotishka
Author-X-Name-Last: Datta
Author-Name: Bhramar Mukherjee
Author-X-Name-First: Bhramar
Author-X-Name-Last: Mukherjee
Title: Discussion on “Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data”
Journal: Journal of the American Statistical Association
Pages: 1583-1586
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1982721
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1982721
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1583-1586
Template-Type: ReDIF-Article 1.0
Author-Name: Huazhang Li
Author-X-Name-First: Huazhang
Author-X-Name-Last: Li
Author-Name: Yaotian Wang
Author-X-Name-First: Yaotian
Author-X-Name-Last: Wang
Author-Name: Guofen Yan
Author-X-Name-First: Guofen
Author-X-Name-Last: Yan
Author-Name: Yinge Sun
Author-X-Name-First: Yinge
Author-X-Name-Last: Sun
Author-Name: Seiji Tanabe
Author-X-Name-First: Seiji
Author-X-Name-Last: Tanabe
Author-Name: Chang-Chia Liu
Author-X-Name-First: Chang-Chia
Author-X-Name-Last: Liu
Author-Name: Mark S. Quigg
Author-X-Name-First: Mark S.
Author-X-Name-Last: Quigg
Author-Name: Tingting Zhang
Author-X-Name-First: Tingting
Author-X-Name-Last: Zhang
Title: A Bayesian State-Space Approach to Mapping Directional Brain Networks
Abstract:
The human brain is a directional network system of brain regions involving directional connectivity. Seizures are a directional network phenomenon as abnormal neuronal activities start from a seizure onset zone (SOZ) and propagate to otherwise healthy regions. To localize the SOZ of an epileptic patient, clinicians use intracranial electroencephalography (iEEG) to record the patient’s intracranial brain activity in many small regions. iEEG data are high-dimensional multivariate time series. We build a state-space multivariate autoregression (SSMAR) for iEEG data to model the underlying directional brain network. To produce scientifically interpretable network results, we incorporate into the SSMAR the scientific knowledge that the underlying brain network tends to have a cluster structure. Specifically, we assign to the SSMAR parameters a stochastic-blockmodel-motivated prior, which reflects the cluster structure. We develop a Bayesian framework to estimate the SSMAR, infer directional connections, and identify clusters for the unobserved network edges. The new method is robust to violations of model assumptions and outperforms existing network methods. By applying the new method to an epileptic patient’s iEEG data, we reveal seizure initiation and propagation in the patient’s directional brain network and discover a unique directional connectivity property of the SOZ. Overall, the network results obtained in this study bring new insights into epileptic patients’ normal and abnormal epileptic brain mechanisms and have the potential to assist neurologists and clinicians in localizing the SOZ—a long-standing research focus in epilepsy diagnosis and treatment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1637-1647
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1865985
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865985
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1637-1647
Template-Type: ReDIF-Article 1.0
Author-Name: Pierre Lafaye de Micheaux
Author-X-Name-First: Pierre Lafaye
Author-X-Name-Last: de Micheaux
Author-Name: Pavlo Mozharovskyi
Author-X-Name-First: Pavlo
Author-X-Name-Last: Mozharovskyi
Author-Name: Myriam Vimond
Author-X-Name-First: Myriam
Author-X-Name-Last: Vimond
Title: Depth for Curve Data and Applications
Abstract:
In 1975, John W. Tukey defined statistical data depth as a function that determines the centrality of an arbitrary point with respect to a data cloud or to a probability measure. During the last decades, this seminal idea of data depth evolved into a powerful tool proving to be useful in various fields of science. Recently, extending the notion of data depth to the functional setting attracted a lot of attention among theoretical and applied statisticians. We go further and suggest a notion of data depth suitable for data represented as curves, or trajectories, which is independent of the parameterization. We show that our curve depth satisfies theoretical requirements of general depth functions that are meaningful for trajectories. We apply our methodology to diffusion tensor brain images and also to pattern recognition of handwritten digits and letters. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1881-1897
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1745815
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1745815
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1881-1897
Template-Type: ReDIF-Article 1.0
Author-Name: Bruno Ferman
Author-X-Name-First: Bruno
Author-X-Name-Last: Ferman
Title: On the Properties of the Synthetic Control Estimator with Many Periods and Many Controls
Abstract:
We consider the asymptotic properties of the synthetic control (SC) estimator when both the number of pretreatment periods and control units are large. If potential outcomes follow a linear factor model, we provide conditions under which the SC unit asymptotically recovers the factor structure of the treated unit, even when the pretreatment fit is imperfect. This happens when there are weights diluted among an increasing number of control units such that a weighted average of the factor structure of the control units asymptotically reconstructs the factor structure of the treated unit. In this case, the SC estimator is asymptotically unbiased even when treatment assignment is correlated with time-varying unobservables. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1764-1772
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1965613
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1965613
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1764-1772
Template-Type: ReDIF-Article 1.0
Author-Name: Anish Agarwal
Author-X-Name-First: Anish
Author-X-Name-Last: Agarwal
Author-Name: Devavrat Shah
Author-X-Name-First: Devavrat
Author-X-Name-Last: Shah
Author-Name: Dennis Shen
Author-X-Name-First: Dennis
Author-X-Name-Last: Shen
Author-Name: Dogyoon Song
Author-X-Name-First: Dogyoon
Author-X-Name-Last: Song
Title: On Robustness of Principal Component Regression
Abstract:
Principal component regression (PCR) is a simple, but powerful and ubiquitously utilized method. Its effectiveness is well established when the covariates exhibit low-rank structure. However, its ability to handle settings with noisy, missing, and mixed-valued, that is, discrete and continuous, covariates is not understood and remains an important open challenge. As the main contribution of this work, we establish the robustness of PCR, without any change, in this respect and provide meaningful finite-sample analysis. To do so, we establish that PCR is equivalent to performing linear regression after preprocessing the covariate matrix via hard singular value thresholding (HSVT). As a result, in the context of counterfactual analysis using observational data, we show PCR is equivalent to the recently proposed robust variant of the synthetic control method, known as robust synthetic control (RSC). As an immediate consequence, we obtain finite-sample analysis of the RSC estimator that was previously absent. As an important contribution to the synthetic controls literature, we establish that an (approximate) linear synthetic control exists in the setting of a generalized factor model, or latent variable model; traditionally in the literature, the existence of a synthetic control needs to be assumed to exist as an axiom. We further discuss a surprising implication of the robustness property of PCR with respect to noise, that is, PCR can learn a good predictive model even if the covariates are tactfully transformed to preserve differential privacy. Finally, this work advances the state-of-the-art analysis for HSVT by establishing stronger guarantees with respect to the l2,∞
-norm rather than the Frobenius norm as is commonly done in the matrix estimation literature, which may be of interest in its own right.
Journal: Journal of the American Statistical Association
Pages: 1731-1745
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1928513
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1928513
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1731-1745
Template-Type: ReDIF-Article 1.0
Author-Name: Jincheng Zhou
Author-X-Name-First: Jincheng
Author-X-Name-Last: Zhou
Author-Name: James S. Hodges
Author-X-Name-First: James S.
Author-X-Name-Last: Hodges
Author-Name: Haitao Chu
Author-X-Name-First: Haitao
Author-X-Name-Last: Chu
Title: A Bayesian Hierarchical CACE Model Accounting for Incomplete Noncompliance With Application to a Meta-analysis of Epidural Analgesia on Cesarean Section
Abstract:
Noncompliance with assigned treatments is a common challenge in analyzing and interpreting randomized clinical trials (RCTs). One way to handle noncompliance is to estimate the complier-average causal effect (CACE), the intervention’s efficacy in the subpopulation that complies with assigned treatment. In a two-step meta-analysis, one could first estimate CACE for each study, then combine them to estimate the population-averaged CACE. However, when some trials do not report noncompliance data, the two-step meta-analysis can be less efficient and potentially biased by excluding these trials. This article proposes a flexible Bayesian hierarchical CACE framework to simultaneously account for heterogeneous and incomplete noncompliance data in a meta-analysis of RCTs. The models are motivated by and used for a meta-analysis estimating the CACE of epidural analgesia on cesarean section, in which only 10 of 27 trials reported complete noncompliance data. The new analysis includes all 27 studies and the results present new insights on the causal effect after accounting for noncompliance. Compared to the estimated risk difference of 0.8% (95% CI: –0.3%, 1.9%) given by the two-step intention-to-treat meta-analysis, the estimated CACE is 4.1% (95% CrI: –0.3%, 10.5%). We also report simulation studies to evaluate the performance of the proposed method. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1700-1712
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1900859
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1900859
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1700-1712
Template-Type: ReDIF-Article 1.0
Author-Name: Xinyi Li
Author-X-Name-First: Xinyi
Author-X-Name-Last: Li
Author-Name: Li Wang
Author-X-Name-First: Li
Author-X-Name-Last: Wang
Author-Name: Huixia Judy Wang
Author-X-Name-First: Huixia Judy
Author-X-Name-Last: Wang
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: Sparse Learning and Structure Identification for Ultrahigh-Dimensional Image-on-Scalar Regression
Abstract:
This article considers high-dimensional image-on-scalar regression, where the spatial heterogeneity of covariate effects on imaging responses is investigated via a flexible partially linear spatially varying coefficient model. To tackle the challenges of spatial smoothing over the imaging response’s complex domain consisting of regions of interest, we approximate the spatially varying coefficient functions via bivariate spline functions over triangulation. We first study estimation when the active constant coefficients and varying coefficient functions are known in advance. We then further develop a unified approach for simultaneous sparse learning and model structure identification in the presence of ultrahigh-dimensional covariates. Our method can identify zero, nonzero constant, and spatially varying components correctly and efficiently. The estimators of constant coefficients and varying coefficient functions are consistent and asymptotically normal for constant coefficient estimators. The method is evaluated by Monte Carlo simulation studies and applied to a dataset provided by the Alzheimer’s Disease Neuroimaging Initiative. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1994-2008
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1753523
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1753523
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1994-2008
Template-Type: ReDIF-Article 1.0
Author-Name: Eli Ben-Michael
Author-X-Name-First: Eli
Author-X-Name-Last: Ben-Michael
Author-Name: Avi Feller
Author-X-Name-First: Avi
Author-X-Name-Last: Feller
Author-Name: Jesse Rothstein
Author-X-Name-First: Jesse
Author-X-Name-Last: Rothstein
Title: The Augmented Synthetic Control Method
Abstract:
The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The “synthetic control” is a weighted average of control units that balances the treated unit’s pretreatment outcomes and other covariates as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pretreatment outcomes is excellent. We propose Augmented SCM as an extension of SCM to settings where such pretreatment fit is infeasible. Analogous to bias correction for inexact matching, augmented SCM uses an outcome model to estimate the bias due to imperfect pretreatment fit and then de-biases the original SCM estimate. Our main proposal, which uses ridge regression as the outcome model, directly controls pretreatment fit while minimizing extrapolation from the convex hull. This estimator can also be expressed as a solution to a modified synthetic controls problem that allows negative weights on some donor units. We bound the estimation error of this approach under different data-generating processes, including a linear factor model, and show how regularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCM with extensive simulation studies and apply this framework to estimate the impact of the 2012 Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth R package.
Journal: Journal of the American Statistical Association
Pages: 1789-1803
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1929245
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1929245
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1789-1803
Template-Type: ReDIF-Article 1.0
Author-Name: Fei Xue
Author-X-Name-First: Fei
Author-X-Name-Last: Xue
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Integrating Multisource Block-Wise Missing Data in Model Selection
Abstract:
For multisource data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this article, we propose a multiple block-wise imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1914-1927
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1751176
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1751176
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1914-1927
Template-Type: ReDIF-Article 1.0
Author-Name: Nicholas P. Jewell
Author-X-Name-First: Nicholas P.
Author-X-Name-Last: Jewell
Title: Statistical Models for COVID-19 Incidence, Cumulative Prevalence, and R t
Journal: Journal of the American Statistical Association
Pages: 1578-1582
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1983436
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1983436
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1578-1582
Template-Type: ReDIF-Article 1.0
Author-Name: Jason Wu
Author-X-Name-First: Jason
Author-X-Name-Last: Wu
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Title: Randomization Tests for Weak Null Hypotheses in Randomized Experiments
Abstract:
The Fisher randomization test (FRT) is appropriate for any test statistic, under a sharp null hypothesis that can recover all missing potential outcomes. However, it is often sought after to test a weak null hypothesis that the treatment does not affect the units on average. To use the FRT for a weak null hypothesis, we must address two issues. First, we need to impute the missing potential outcomes although the weak null hypothesis cannot determine all of them. Second, we need to choose a proper test statistic. For a general weak null hypothesis, we propose an approach to imputing missing potential outcomes under a compatible sharp null hypothesis. Building on this imputation scheme, we advocate a studentized statistic. The resulting FRT has multiple desirable features. First, it is model-free. Second, it is finite-sample exact under the sharp null hypothesis that we use to impute the potential outcomes. Third, it conservatively controls large-sample Type I error under the weak null hypothesis of interest. Therefore, our FRT is agnostic to the treatment effect heterogeneity. We establish a unified theory for general factorial experiments and extend it to stratified and clustered experiments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1898-1913
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1750415
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1750415
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1898-1913
Template-Type: ReDIF-Article 1.0
Author-Name: Alberto Abadie
Author-X-Name-First: Alberto
Author-X-Name-Last: Abadie
Author-Name: Matias D. Cattaneo
Author-X-Name-First: Matias D.
Author-X-Name-Last: Cattaneo
Title: Introduction to the Special Section on Synthetic Control Methods
Journal: Journal of the American Statistical Association
Pages: 1713-1715
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.2002600
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002600
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1713-1715
Template-Type: ReDIF-Article 1.0
Author-Name: Sourav Chatterjee
Author-X-Name-First: Sourav
Author-X-Name-Last: Chatterjee
Title: A New Coefficient of Correlation
Abstract:
Abstract–Is it possible to define a coefficient of correlation which is (a) as simple as the classical coefficients like Pearson’s correlation or Spearman’s correlation, and yet (b) consistently estimates some simple and interpretable measure of the degree of dependence between the variables, which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other, and (c) has a simple asymptotic theory under the hypothesis of independence, like the classical coefficients? This article answers this question in the affirmative, by producing such a coefficient. No assumptions are needed on the distributions of the variables. There are several coefficients in the literature that converge to 0 if and only if the variables are independent, but none that satisfy any of the other properties mentioned above. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2009-2022
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1758115
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1758115
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2009-2022
Template-Type: ReDIF-Article 1.0
Author-Name: Simón Lunagómez
Author-X-Name-First: Simón
Author-X-Name-Last: Lunagómez
Author-Name: Sofia C. Olhede
Author-X-Name-First: Sofia C.
Author-X-Name-Last: Olhede
Author-Name: Patrick J. Wolfe
Author-X-Name-First: Patrick J.
Author-X-Name-Last: Wolfe
Title: Modeling Network Populations via Graph Distances
Abstract:
This article introduces a new class of models for multiple networks. The core idea is to parameterize a distribution on labeled graphs in terms of a Fréchet mean graph (which depends on a user-specified choice of metric or graph distance) and a parameter that controls the concentration of this distribution about its mean. Entropy is the natural parameter for such control, varying from a point mass concentrated on the Fréchet mean itself to a uniform distribution over all graphs on a given vertex set. We provide a hierarchical Bayesian approach for exploiting this construction, along with straightforward strategies for sampling from the resultant posterior distribution. We conclude by demonstrating the efficacy of our approach via simulation studies and two multiple-network data analysis examples: one drawn from systems biology and the other from neuroscience. This article has online supplementary materials.
Journal: Journal of the American Statistical Association
Pages: 2023-2040
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1763803
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1763803
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2023-2040
Template-Type: ReDIF-Article 1.0
Author-Name: Colin B. Fogarty
Author-X-Name-First: Colin B.
Author-X-Name-Last: Fogarty
Author-Name: Kwonsang Lee
Author-X-Name-First: Kwonsang
Author-X-Name-Last: Lee
Author-Name: Rachel R. Kelz
Author-X-Name-First: Rachel R.
Author-X-Name-Last: Kelz
Author-Name: Luke J. Keele
Author-X-Name-First: Luke J.
Author-X-Name-Last: Keele
Title: Biased Encouragements and Heterogeneous Effects in an Instrumental Variable Study of Emergency General Surgical Outcomes
Abstract:
We investigate the efficacy of surgical versus nonsurgical management for two gastrointestinal conditions, colitis and diverticulitis, using observational data. We deploy an instrumental variable design with surgeons’ tendencies to operate as an instrument. Assuming instrument validity, we find that nonsurgical alternatives can reduce both hospital length of stay and the risk of complications, with estimated effects larger for septic patients than for nonseptic patients. The validity of our instrument is plausible but not ironclad, necessitating a sensitivity analysis. Existing sensitivity analyses for IV designs assume effect homogeneity, unlikely to hold here because of patient-specific physiology. We develop a new sensitivity analysis that accommodates arbitrary effect heterogeneity and exploits components explainable by observed features. We find that the results for nonseptic patients prove more robust to hidden bias despite having smaller estimated effects. For nonseptic patients, two individuals with identical observed characteristics would have to differ in their odds of assignment to a high tendency to operate surgeon by a factor of 2.34 to overturn our finding of a benefit for nonsurgical management in reducing length of stay. For septic patients, this value is only 1.64. Simulations illustrate that this phenomenon may be explained by differences in within-group heterogeneity. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1625-1636
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1863220
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863220
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1625-1636
Template-Type: ReDIF-Article 1.0
Author-Name: Victor Chernozhukov
Author-X-Name-First: Victor
Author-X-Name-Last: Chernozhukov
Author-Name: Kaspar Wüthrich
Author-X-Name-First: Kaspar
Author-X-Name-Last: Wüthrich
Author-Name: Yinchu Zhu
Author-X-Name-First: Yinchu
Author-X-Name-Last: Zhu
Title: An Exact and Robust Conformal Inference Method for Counterfactual and Synthetic Controls
Abstract:
We introduce new inference procedures for counterfactual and synthetic control methods for policy evaluation. We recast the causal inference problem as a counterfactual prediction and a structural breaks testing problem. This allows us to exploit insights from conformal prediction and structural breaks testing to develop permutation inference procedures that accommodate modern high-dimensional estimators, are valid under weak and easy-to-verify conditions, and are provably robust against misspecification. Our methods work in conjunction with many different approaches for predicting counterfactual mean outcomes in the absence of the policy intervention. Examples include synthetic controls, difference-in-differences, factor and matrix completion models, and (fused) time series panel data models. Our approach demonstrates an excellent small-sample performance in simulations and is taken to a data application where we re-evaluate the consequences of decriminalizing indoor prostitution. Open-source software for implementing our conformal inference methods is available.
Journal: Journal of the American Statistical Association
Pages: 1849-1864
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1920957
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1920957
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1849-1864
Template-Type: ReDIF-Article 1.0
Author-Name: Zhigang Li
Author-X-Name-First: Zhigang
Author-X-Name-Last: Li
Author-Name: Lu Tian
Author-X-Name-First: Lu
Author-X-Name-Last: Tian
Author-Name: A. James O’Malley
Author-X-Name-First: A. James
Author-X-Name-Last: O’Malley
Author-Name: Margaret R. Karagas
Author-X-Name-First: Margaret R.
Author-X-Name-Last: Karagas
Author-Name: Anne G. Hoen
Author-X-Name-First: Anne G.
Author-X-Name-Last: Hoen
Author-Name: Brock C. Christensen
Author-X-Name-First: Brock C.
Author-X-Name-Last: Christensen
Author-Name: Juliette C. Madan
Author-X-Name-First: Juliette C.
Author-X-Name-Last: Madan
Author-Name: Quran Wu
Author-X-Name-First: Quran
Author-X-Name-Last: Wu
Author-Name: Raad Z. Gharaibeh
Author-X-Name-First: Raad Z.
Author-X-Name-Last: Gharaibeh
Author-Name: Christian Jobin
Author-X-Name-First: Christian
Author-X-Name-Last: Jobin
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: IFAA: Robust Association Identification and Inference for Absolute Abundance in Microbiome Analyses
Abstract:
The target of inference in microbiome analyses is usually relative abundance (RA) because RA in a sample (e.g., stool) can be considered as an approximation of RA in an entire ecosystem (e.g., gut). However, inference on RA suffers from the fact that RA are calculated by dividing absolute abundances (AAs) over the common denominator (CD), the summation of all AA (i.e., library size). Because of that, perturbation in one taxon will result in a change in the CD and thus cause false changes in RA of all other taxa, and those false changes could lead to false positive/negative findings. We propose a novel analysis approach (IFAA) to make robust inference on AA of an ecosystem that can circumvent the issues induced by the CD problem and compositional structure of RA. IFAA can also address the issues of overdispersion and handle zero-inflated data structures. IFAA identifies microbial taxa associated with the covariates in Phase 1 and estimates the association parameters by employing an independent reference taxon in Phase 2. Two real data applications are presented and extensive simulations show that IFAA outperforms other established existing approaches by a big margin in the presence of unbalanced library size. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1595-1608
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2020.1860770
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1860770
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1595-1608
Template-Type: ReDIF-Article 1.0
Author-Name: Alberto Abadie
Author-X-Name-First: Alberto
Author-X-Name-Last: Abadie
Author-Name: Jérémy L’Hour
Author-X-Name-First: Jérémy
Author-X-Name-Last: L’Hour
Title: A Penalized Synthetic Control Estimator for Disaggregated Data
Abstract:
Synthetic control methods are commonly applied in empirical research to estimate the effects of treatments or interventions on aggregate outcomes. A synthetic control estimator compares the outcome of a treated unit to the outcome of a weighted average of untreated units that best resembles the characteristics of the treated unit before the intervention. When disaggregated data are available, constructing separate synthetic controls for each treated unit may help avoid interpolation biases. However, the problem of finding a synthetic control that best reproduces the characteristics of a treated unit may not have a unique solution. Multiplicity of solutions is a particularly daunting challenge when the data include many treated and untreated units. To address this challenge, we propose a synthetic control estimator that penalizes the pairwise discrepancies between the characteristics of the treated units and the characteristics of the units that contribute to their synthetic controls. The penalization parameter trades off pairwise matching discrepancies with respect to the characteristics of each unit in the synthetic control against matching discrepancies with respect to the characteristics of the synthetic control unit as a whole. We study the properties of this estimator and propose data-driven choices of the penalization parameter.
Journal: Journal of the American Statistical Association
Pages: 1817-1834
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1971535
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1971535
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1817-1834
Template-Type: ReDIF-Article 1.0
Author-Name: Corbin Quick
Author-X-Name-First: Corbin
Author-X-Name-Last: Quick
Author-Name: Rounak Dey
Author-X-Name-First: Rounak
Author-X-Name-Last: Dey
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: Rejoinder: Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data
Journal: Journal of the American Statistical Association
Pages: 1591-1594
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.2001340
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2001340
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1591-1594
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 2100-2100
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1969237
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969237
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:2100-2100
Template-Type: ReDIF-Article 1.0
Author-Name: Ricardo Masini
Author-X-Name-First: Ricardo
Author-X-Name-Last: Masini
Author-Name: Marcelo C. Medeiros
Author-X-Name-First: Marcelo C.
Author-X-Name-Last: Medeiros
Title: Counterfactual Analysis With Artificial Controls: Inference, High Dimensions, and Nonstationarity
Abstract:
Recently, there has been growing interest in developing statistical tools to conduct counterfactual analysis with aggregate data when a single “treated” unit suffers an intervention, such as a policy change, and there is no obvious control group. Usually, the proposed methods are based on the construction of an artificial counterfactual from a pool of “untre ated” peers, organized in a panel data structure. In this article, we consider a general framework for counterfactual analysis for high-dimensional, nonstationary data with either deterministic and/or stochastic trends, which nests well-established methods, such as the synthetic control. We propose a resampling procedure to test intervention effects that does not rely on postintervention asymptotics and that can be used even if there is only a single observation after the intervention. A simulation study is provided as well as an empirical application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1773-1788
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1964978
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1964978
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1773-1788
Template-Type: ReDIF-Article 1.0
Author-Name: Dingdong Yi
Author-X-Name-First: Dingdong
Author-X-Name-Last: Yi
Author-Name: Shaoyang Ning
Author-X-Name-First: Shaoyang
Author-X-Name-Last: Ning
Author-Name: Chia-Jung Chang
Author-X-Name-First: Chia-Jung
Author-X-Name-Last: Chang
Author-Name: S. C. Kou
Author-X-Name-First: S. C.
Author-X-Name-Last: Kou
Title: Forecasting Unemployment Using Internet Search Data via PRISM
Abstract:
Big data generated from the Internet offer great potential for predictive analysis. Here we focus on using online users’ Internet search data to forecast unemployment initial claims weeks into the future, which provides timely insights into the direction of the economy. To this end, we present a novel method Penalized Regression with Inferred Seasonality Module (PRISM), which uses publicly available online search data from Google. PRISM is a semiparametric method, motivated by a general state-space formulation, and employs nonparametric seasonal decomposition and penalized regression. For forecasting unemployment initial claims, PRISM outperforms all previously available methods, including forecasting during the 2008–2009 financial crisis period and near-future forecasting during the COVID-19 pandemic period, when unemployment initial claims both rose rapidly. The timely and accurate unemployment forecasts by PRISM could aid government agencies and financial institutions to assess the economic trend and make well-informed decisions, especially in the face of economic turbulence.
Journal: Journal of the American Statistical Association
Pages: 1662-1673
Issue: 536
Volume: 116
Year: 2021
Month: 10
X-DOI: 10.1080/01621459.2021.1883436
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1883436
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:116:y:2021:i:536:p:1662-1673
Template-Type: ReDIF-Article 1.0
Author-Name: Xianyang Zhang
Author-X-Name-First: Xianyang
Author-X-Name-Last: Zhang
Author-Name: Jun Chen
Author-X-Name-First: Jun
Author-X-Name-Last: Chen
Title: Covariate Adaptive False Discovery Rate Control With Applications to Omics-Wide Multiple Testing
Abstract:
Conventional multiple testing procedures often assume hypotheses for different features are exchangeable. However, in many scientific applications, additional covariate information regarding the patterns of signals and nulls are available. In this article, we introduce an FDR control procedure in large-scale inference problem that can incorporate covariate information. We develop a fast algorithm to implement the proposed procedure and prove its asymptotic validity even when the underlying likelihood ratio model is misspecified and the p-values are weakly dependent (e.g., strong mixing). Extensive simulations are conducted to study the finite sample performance of the proposed method and we demonstrate that the new approach improves over the state-of-the-art approaches by being flexible, robust, powerful, and computationally efficient. We finally apply the method to several omics datasets arising from genomics studies with the aim to identify omics features associated with some clinical and biological phenotypes. We show that the method is overall the most powerful among competing methods, especially when the signal is sparse. The proposed covariate adaptive multiple testing procedure is implemented in the R package CAMT. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 411-427
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1783273
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783273
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:411-427
Template-Type: ReDIF-Article 1.0
Author-Name: Holger Dette
Author-X-Name-First: Holger
Author-X-Name-Last: Dette
Author-Name: Guangming Pan
Author-X-Name-First: Guangming
Author-X-Name-Last: Pan
Author-Name: Qing Yang
Author-X-Name-First: Qing
Author-X-Name-Last: Yang
Title: Estimating a Change Point in a Sequence of Very High-Dimensional Covariance Matrices
Abstract:
This article considers the problem of estimating a change point in the covariance matrix in a sequence of high-dimensional vectors, where the dimension is substantially larger than the sample size. A two-stage approach is proposed to efficiently estimate the location of the change point. The first step consists of a reduction of the dimension to identify elements of the covariance matrices corresponding to significant changes. In a second step, we use the components after dimension reduction to determine the position of the change point. Theoretical properties are developed for both steps, and numerical studies are conducted to support the new methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 444-454
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1785477
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1785477
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:444-454
Template-Type: ReDIF-Article 1.0
Author-Name: Bei Jiang
Author-X-Name-First: Bei
Author-X-Name-Last: Jiang
Author-Name: Adrian E. Raftery
Author-X-Name-First: Adrian E.
Author-X-Name-Last: Raftery
Author-Name: Russell J. Steele
Author-X-Name-First: Russell J.
Author-X-Name-Last: Steele
Author-Name: Naisyin Wang
Author-X-Name-First: Naisyin
Author-X-Name-Last: Wang
Title: Balancing Inferential Integrity and Disclosure Risk Via Model Targeted Masking and Multiple Imputation
Abstract:
There is a growing expectation that data collected by government-funded studies should be openly available to ensure research reproducibility, which also increases concerns about data privacy. A strategy to protect individuals’ identity is to release multiply imputed (MI) synthetic datasets with masked sensitivity values. However, information loss or incorrectly specified imputation models can weaken or invalidate the inferences obtained from the MI-datasets. We propose a new masking framework with a data-augmentation (DA) component and a tuning mechanism that balances protecting identity disclosure against preserving data utility. Applying it to a restricted-use Canadian Scleroderma Research Group (CSRG) dataset, we found that this DA-MI strategy achieved a 0% identity disclosure risk and preserved all inferential conclusions. It yielded 95% confidence intervals (CIs) that had overlaps of 98.5% (95.5%) on average with the CIs constructed using the full, unmasked CSRG dataset in a work-disability (interstitial lung disease) study. The CI-overlaps were lower for several other methods considered, ranging from 73.9% to 91.9% on average with the lowest value being 28.1%; such low CI-overlaps further led to some incorrect inferential conclusions. These findings indicate that the DA-MI masking framework facilitates sharing of useful research data while protecting participants’ identities. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 52-66
Issue: 537
Volume: 117
Year: 2021
Month: 5
X-DOI: 10.1080/01621459.2021.1909597
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909597
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2021:i:537:p:52-66
Template-Type: ReDIF-Article 1.0
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Author-Name: Dan Yang
Author-X-Name-First: Dan
Author-X-Name-Last: Yang
Author-Name: Cun-Hui Zhang
Author-X-Name-First: Cun-Hui
Author-X-Name-Last: Zhang
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 128-132
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2022.2035099
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035099
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:128-132
Template-Type: ReDIF-Article 1.0
Author-Name: Adam Ciarleglio
Author-X-Name-First: Adam
Author-X-Name-Last: Ciarleglio
Author-Name: Eva Petkova
Author-X-Name-First: Eva
Author-X-Name-Last: Petkova
Author-Name: Ofer Harel
Author-X-Name-First: Ofer
Author-X-Name-Last: Harel
Title: Elucidating Age and Sex-Dependent Association Between Frontal EEG Asymmetry and Depression: An Application of Multiple Imputation in Functional Regression
Abstract:
Frontal power asymmetry (FA), a measure of brain function derived from electroencephalography, is a potential biomarker for major depressive disorder (MDD). Though FA is functional in nature, it is typically reduced to a scalar value prior to analysis, possibly obscuring its relationship with MDD and leading to a number of studies that have provided contradictory results. To overcome this issue, we sought to fit a functional regression model to characterize the association between FA and MDD status, adjusting for age, sex, cognitive ability, and handedness using data from a large clinical study that included both MDD and healthy control (HC) subjects. Since nearly 40% of the observations are missing data on either FA or cognitive ability, we propose an extension of multiple imputation (MI) by chained equations that allows for the imputation of both scalar and functional data. We also propose an extension of Rubin’s Rules for conducting valid inference in this setting. The proposed methods are evaluated in a simulation and applied to our FA data. For our FA data, a pooled analysis from the imputed datasets yielded similar results to those of the complete case analysis. We found that, among young females, HCs tended to have higher FA over the θ, α, and β frequency bands, but that the difference between HC and MDD subjects diminishes and ultimately reverses with age. For males, HCs tended to have higher FA in the β frequency band, regardless of age. Young male HCs had higher FA in the θ and α bands, but this difference diminishes with increasing age in the α band and ultimately reverses with increasing age in the θ band. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 12-26
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1942011
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942011
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:12-26
Template-Type: ReDIF-Article 1.0
Author-Name: Kori Khan
Author-X-Name-First: Kori
Author-X-Name-Last: Khan
Author-Name: Catherine A. Calder
Author-X-Name-First: Catherine A.
Author-X-Name-Last: Calder
Title: Restricted Spatial Regression Methods: Implications for Inference
Abstract:
The issue of spatial confounding between the spatial random effect and the fixed effects in regression analyses has been identified as a concern in the statistical literature. Multiple authors have offered perspectives and potential solutions. In this article, for the areal spatial data setting, we show that many of the methods designed to alleviate spatial confounding can be viewed as special cases of a general class of models. We refer to this class as restricted spatial regression (RSR) models, extending terminology currently in use. We offer a mathematically based exploration of the impact that RSR methods have on inference for regression coefficients for the linear model. We then explore whether these results hold in the generalized linear model setting for count data using simulations. We show that the use of these methods have counterintuitive consequences which defy the general expectations in the literature. In particular, our results and the accompanying simulations suggest that RSR methods will typically perform worse than nonspatial methods. These results have important implications for dimension reduction strategies in spatial regression modeling. Specifically, we demonstrate that the problems with RSR models cannot be fixed with a selection of “better” spatial basis vectors or dimension reduction techniques. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 482-494
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1788949
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1788949
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:482-494
Template-Type: ReDIF-Article 1.0
Author-Name: Tung-Yu Wu
Author-X-Name-First: Tung-Yu
Author-X-Name-Last: Wu
Author-Name: Y. X. Rachel Wang
Author-X-Name-First: Y. X.
Author-X-Name-Last: Rachel Wang
Author-Name: Wing H. Wong
Author-X-Name-First: Wing H.
Author-X-Name-Last: Wong
Title: Mini-Batch Metropolis–Hastings With Reversible SGLD Proposal
Abstract:
Traditional Markov chain Monte Carlo (MCMC) algorithms are computationally intensive and do not scale well to large data. In particular, the Metropolis–Hastings (MH) algorithm requires passing over the entire dataset to evaluate the likelihood ratio in each iteration. We propose a general framework for performing MH-MCMC using mini-batches of the whole dataset and show that this gives rise to approximately a tempered stationary distribution. We prove that the algorithm preserves the modes of the original target distribution and derive an error bound on the approximation with mild assumptions on the likelihood. To further extend the utility of the algorithm to high-dimensional settings, we construct a proposal with forward and reverse moves using stochastic gradient and show that the construction leads to reasonable acceptance probabilities. We demonstrate the performance of our algorithm in both low dimensional models and high dimensional neural network applications. Particularly in the latter case, compared to popular optimization methods, our method is more robust to the choice of learning rate and improves testing accuracy. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 386-394
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1782222
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782222
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:386-394
Template-Type: ReDIF-Article 1.0
Author-Name: Gareth M. James
Author-X-Name-First: Gareth M.
Author-X-Name-Last: James
Author-Name: Peter Radchenko
Author-X-Name-First: Peter
Author-X-Name-Last: Radchenko
Author-Name: Bradley Rava
Author-X-Name-First: Bradley
Author-X-Name-Last: Rava
Title: Irrational Exuberance: Correcting Bias in Probability Estimates
Abstract:
We consider the common setting where one observes probability estimates for a large number of events, such as default risks for numerous bonds. Unfortunately, even with unbiased estimates, selecting events corresponding to the most extreme probabilities can result in systematically underestimating the true level of uncertainty. We develop an empirical Bayes approach “excess certainty adjusted probabilities” (ECAP), using a variant of Tweedie’s formula, which updates probability estimates to correct for selection bias. ECAP is a flexible nonparametric method, which directly estimates the score function associated with the probability estimates, so it does not need to make any restrictive assumptions about the prior on the true probabilities. ECAP also works well in settings where the probability estimates are biased. We demonstrate through theoretical results, simulations, and an analysis of two real world datasets, that ECAP can provide significant improvements over the original probability estimates. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 455-468
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1787175
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1787175
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:455-468
Template-Type: ReDIF-Article 1.0
Author-Name: Daniel Peña
Author-X-Name-First: Daniel
Author-X-Name-Last: Peña
Title: Comment on “Factor Models for High-Dimensional Tensor Time Series”
Journal: Journal of the American Statistical Association
Pages: 118-123
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.2024214
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024214
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:118-123
Template-Type: ReDIF-Article 1.0
Author-Name: Oliver B. Linton
Author-X-Name-First: Oliver B.
Author-X-Name-Last: Linton
Author-Name: Haihan Tang
Author-X-Name-First: Haihan
Author-X-Name-Last: Tang
Title: Comment on “Factor Models for High-Dimensional Tensor Time Series” by Rong Chen, Dan Yang, and Cun-Hui Zhang
Journal: Journal of the American Statistical Association
Pages: 117-117
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.2018328
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2018328
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:117-117
Template-Type: ReDIF-Article 1.0
Author-Name: Jianyu Liu
Author-X-Name-First: Jianyu
Author-X-Name-Last: Liu
Author-Name: Haodong Wang
Author-X-Name-First: Haodong
Author-X-Name-Last: Wang
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: Prioritizing Autism Risk Genes Using Personalized Graphical Models Estimated From Single-Cell RNA-seq Data
Abstract:
Hundreds of autism risk genes have been reported recently, mainly based on genetic studies where these risk genes have more de novo mutations in autism subjects than healthy controls. However, as a complex disease, autism is likely associated with more risk genes and many of them may not be identifiable through de novo mutations. We hypothesize that more autism risk genes can be identified through their connections with known autism risk genes in personalized gene–gene interaction graphs. We estimate such personalized graphs using single-cell RNA sequencing (scRNA-seq) while appropriately modeling the cell dependence and possible zero-inflation in the scRNA-seq data. The sample size, which is the number of cells per individual, ranges from 891 to 1241 in our case study using scRNA-seq data in autism subjects and controls. We consider 1500 genes in our analysis. Since the number of genes is larger or comparable to the sample size, we perform penalized estimation. We score each gene’s relevance by applying a simple graph kernel smoothing method to each personalized graph. The molecular functions of the top-scored genes are related to autism diseases. For example, a candidate gene RYR2 that encodes protein ryanodine receptor 2 is involved in neurotransmission, a process that is impaired in ASD patients. While our method provides a systemic and unbiased approach to prioritize autism risk genes, the relevance of these genes needs to be further validated in functional studies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 38-51
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1933495
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933495
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:38-51
Template-Type: ReDIF-Article 1.0
Author-Name: Fan Bu
Author-X-Name-First: Fan
Author-X-Name-Last: Bu
Author-Name: Allison E. Aiello
Author-X-Name-First: Allison E.
Author-X-Name-Last: Aiello
Author-Name: Jason Xu
Author-X-Name-First: Jason
Author-X-Name-Last: Xu
Author-Name: Alexander Volfovsky
Author-X-Name-First: Alexander
Author-X-Name-Last: Volfovsky
Title: Likelihood-Based Inference for Partially Observed Epidemics on Dynamic Networks
Abstract:
We propose a generative model and an inference scheme for epidemic processes on dynamic, adaptive contact networks. Network evolution is formulated as a link-Markovian process, which is then coupled to an individual-level stochastic susceptible-infectious-recovered model, to describe the interplay between the dynamics of the disease spread and the contact network underlying the epidemic. A Markov chain Monte Carlo framework is developed for likelihood-based inference from partial epidemic observations, with a novel data augmentation algorithm specifically designed to deal with missing individual recovery times under the dynamic network setting. Through a series of simulation experiments, we demonstrate the validity and flexibility of the model as well as the efficacy and efficiency of the data augmentation inference scheme. The model is also applied to a recent real-world dataset on influenza-like-illness transmission with high-resolution social contact tracking records. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 510-526
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1790376
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1790376
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:510-526
Template-Type: ReDIF-Article 1.0
Author-Name: Ben Dai
Author-X-Name-First: Ben
Author-X-Name-Last: Dai
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Junhui Wang
Author-X-Name-First: Junhui
Author-X-Name-Last: Wang
Title: Embedding Learning
Abstract:
Numerical embedding has become one standard technique for processing and analyzing unstructured data that cannot be expressed in a predefined fashion. It stores the main characteristics of data by mapping it onto a numerical vector. An embedding is often unsupervised and constructed by transfer learning from large-scale unannotated data. Given an embedding, a downstream learning method, referred to as a two-stage method, is applicable to unstructured data. In this article, we introduce a novel framework of embedding learning to deliver a higher learning accuracy than the two-stage method while identifying an optimal learning-adaptive embedding. In particular, we propose a concept of U-minimal sufficient learning-adaptive embeddings, based on which we seek an optimal one to maximize the learning accuracy subject to an embedding constraint. Moreover, when specializing the general framework to classification, we derive a graph embedding classifier based on a hyperlink tensor representing multiple hypergraphs, directed or undirected, characterizing multi-way relations of unstructured data. Numerically, we design algorithms based on blockwise coordinate descent and projected gradient descent to implement linear and feed-forward neural network classifiers, respectively. Theoretically, we establish a learning theory to quantify the generalization error of the proposed method. Moreover, we show, in linear regression, that the one-hot encoder is more preferable among two-stage methods, yet its dimension restriction hinders its predictive performance. For a graph embedding classifier, the generalization error matches up to the standard fast rate or the parametric rate for linear or nonlinear classification. Finally, we demonstrate the utility of the classifiers on two benchmarks in grammatical classification and sentiment analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 307-319
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1775614
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775614
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:307-319
Template-Type: ReDIF-Article 1.0
Author-Name: Jialiang Li
Author-X-Name-First: Jialiang
Author-X-Name-Last: Li
Author-Name: Jing Lv
Author-X-Name-First: Jing
Author-X-Name-Last: Lv
Author-Name: Alan T. K. Wan
Author-X-Name-First: Alan T. K.
Author-X-Name-Last: Wan
Author-Name: Jun Liao
Author-X-Name-First: Jun
Author-X-Name-Last: Liao
Title: AdaBoost Semiparametric Model Averaging Prediction for Multiple Categories
Abstract:
Model average techniques are very useful for model-based prediction. However, most earlier works in this field focused on parametric models and continuous responses. In this article, we study varying coefficient multinomial logistic models and propose a semiparametric model averaging prediction (SMAP) approach for multi-category outcomes. The proposed procedure does not need any artificial specification of the index variable in the adopted varying coefficient sub-model structure to forecast the response. In particular, this new SMAP method is more flexible and robust against model misspecification. To improve the practical predictive performance, we combine SMAP with the AdaBoost algorithm to obtain more accurate estimations of class probabilities and model averaging weights. We compare our proposed methods with all existing model averaging approaches and a wide range of popular classification methods via extensive simulations. An automobile classification study is included to illustrate the merits of our methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 495-509
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1790375
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1790375
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:495-509
Template-Type: ReDIF-Article 1.0
Author-Name: Ian Laga
Author-X-Name-First: Ian
Author-X-Name-Last: Laga
Author-Name: Xiaoyue Niu
Author-X-Name-First: Xiaoyue
Author-X-Name-Last: Niu
Author-Name: Le Bao
Author-X-Name-First: Le
Author-X-Name-Last: Bao
Title: Modeling the Marked Presence-Only Data: A Case Study of Estimating the Female Sex Worker Size in Malawi
Abstract:
Certain subpopulations like female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID) often have higher prevalence of HIV/AIDS and are difficult to map directly due to stigma, discrimination, and criminalization. Fine-scale mapping of those populations contributes to the progress toward reducing the inequalities and ending the AIDS epidemic. In 2016 and 2017, the PLACE surveys were conducted at 3290 venues in 20 out of the total 28 districts in Malawi to estimate the FSW sizes. These venues represent a presence-only dataset where, instead of knowing both where people live and do not live (presence–absence data), only information about visited locations is available. In this study, we develop a Bayesian model for presence-only data and utilize the PLACE data to estimate the FSW size and uncertainty interval at a 1.5×1.5-km resolution for all of Malawi. The estimates can also be aggregated to any desirable level (city/district/region) for implementing targeted HIV prevention and treatment programs in FSW communities, which have been successful in lowering the incidence of HIV and other sexually transmitted infections. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 27-37
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1944873
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1944873
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:27-37
Template-Type: ReDIF-Article 1.0
Author-Name: Hongjian Shi
Author-X-Name-First: Hongjian
Author-X-Name-Last: Shi
Author-Name: Mathias Drton
Author-X-Name-First: Mathias
Author-X-Name-Last: Drton
Author-Name: Fang Han
Author-X-Name-First: Fang
Author-X-Name-Last: Han
Title: Distribution-Free Consistent Independence Tests via Center-Outward Ranks and Signs
Abstract:
This article investigates the problem of testing independence of two random vectors of general dimensions. For this, we give for the first time a distribution-free consistent test. Our approach combines distance covariance with the center-outward ranks and signs developed by Marc Hallin and collaborators. In technical terms, the proposed test is consistent and distribution-free in the family of multivariate distributions with nonvanishing (Lebesgue) probability densities. Exploiting the (degenerate) U-statistic structure of the distance covariance and the combinatorial nature of Hallin’s center-outward ranks and signs, we are able to derive the limiting null distribution of our test statistic. The resulting asymptotic approximation is accurate already for moderate sample sizes and makes the test implementable without requiring permutation. The limiting distribution is derived via a more general result that gives a new type of combinatorial noncentral limit theorem for double- and multiple-indexed permutation statistics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 395-410
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1782223
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782223
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:395-410
Template-Type: ReDIF-Article 1.0
Author-Name: Jun Yu
Author-X-Name-First: Jun
Author-X-Name-Last: Yu
Author-Name: HaiYing Wang
Author-X-Name-First: HaiYing
Author-X-Name-Last: Wang
Author-Name: Mingyao Ai
Author-X-Name-First: Mingyao
Author-X-Name-Last: Ai
Author-Name: Huiming Zhang
Author-X-Name-First: Huiming
Author-X-Name-Last: Zhang
Title: Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data
Abstract:
Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This article solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 265-276
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1773832
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1773832
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:265-276
Template-Type: ReDIF-Article 1.0
Author-Name: Youngjun Choe
Author-X-Name-First: Youngjun
Author-X-Name-Last: Choe
Title: An Introduction to Acceptance Sampling and SPC with R
Journal: Journal of the American Statistical Association
Pages: 528-528
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2022.2035160
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035160
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:528-528
Template-Type: ReDIF-Article 1.0
Author-Name: James Y. Dai
Author-X-Name-First: James Y.
Author-X-Name-Last: Dai
Author-Name: Janet L. Stanford
Author-X-Name-First: Janet L.
Author-X-Name-Last: Stanford
Author-Name: Michael LeBlanc
Author-X-Name-First: Michael
Author-X-Name-Last: LeBlanc
Title: A Multiple-Testing Procedure for High-Dimensional Mediation Hypotheses
Abstract:
Mediation analysis is of rising interest in epidemiologic studies and clinical trials. Among existing methods, the joint significance test yields an overly conservative Type I error rate and low power, particularly for high-dimensional mediation hypotheses. In this article, we develop a multiple-testing procedure that accurately controls the family-wise error rate (FWER) and the false discovery rate (FDR) when testing high-dimensional mediation hypotheses. The core of our procedure is based on estimating the proportions of component null hypotheses and the underlying mixture null distribution of p-values. Theoretical developments and simulation experiments prove that the proposed procedure effectively controls FWER and FDR. Two mediation analyses on DNA methylation and cancer research are presented: assessing the mediation role of DNA methylation in genetic regulation of gene expression in primary prostate cancer samples; exploring the possibility of DNA methylation mediating the effect of exercise on prostate cancer progression. Results of data examples include well-behaved quantile-quantile plots and improved power to detect novel mediation relationships. An R package HDMT implementing the proposed procedure is freely accessible in CRAN. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 198-213
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1765785
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1765785
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:198-213
Template-Type: ReDIF-Article 1.0
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Author-Name: Dan Yang
Author-X-Name-First: Dan
Author-X-Name-Last: Yang
Author-Name: Cun-Hui Zhang
Author-X-Name-First: Cun-Hui
Author-X-Name-Last: Zhang
Title: Factor Models for High-Dimensional Tensor Time Series
Abstract:
Large tensor (multi-dimensional array) data routinely appear nowadays in a wide range of applications, due to modern data collection capabilities. Often such observations are taken over time, forming tensor time series. In this article we present a factor model approach to the analysis of high-dimensional dynamic tensor time series and multi-category dynamic transport networks. This article presents two estimation procedures along with their theoretical properties and simulation results. We present two applications to illustrate the model and its interpretations.
Journal: Journal of the American Statistical Association
Pages: 94-116
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1912757
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1912757
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:94-116
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation: Correction
Journal: Journal of the American Statistical Association
Pages: 529-529
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.2016420
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016420
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:529-529
Template-Type: ReDIF-Article 1.0
Author-Name: Jun Yang
Author-X-Name-First: Jun
Author-X-Name-Last: Yang
Author-Name: Zhou Zhou
Author-X-Name-First: Zhou
Author-X-Name-Last: Zhou
Title: Spectral Inference under Complex Temporal Dynamics
Abstract:
We develop a unified theory and methodology for the inference of evolutionary Fourier power spectra for a general class of locally stationary and possibly nonlinear processes. In particular, simultaneous confidence regions (SCR) with asymptotically correct coverage rates are constructed for the evolutionary spectral densities on a nearly optimally dense grid of the joint time-frequency domain. A simulation based bootstrap method is proposed to implement the SCR. The SCR enables researchers and practitioners to visually evaluate the magnitude and pattern of the evolutionary power spectra with asymptotically accurate statistical guarantee. The SCR also serves as a unified tool for a wide range of statistical inference problems in time-frequency analysis ranging from tests for white noise, stationarity and time-frequency separability to the validation for non-stationary linear models.
Journal: Journal of the American Statistical Association
Pages: 133-155
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1764365
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764365
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:133-155
Template-Type: ReDIF-Article 1.0
Author-Name: Abolfazl Safikhani
Author-X-Name-First: Abolfazl
Author-X-Name-Last: Safikhani
Author-Name: Ali Shojaie
Author-X-Name-First: Ali
Author-X-Name-Last: Shojaie
Title: Joint Structural Break Detection and Parameter Estimation in High-Dimensional Nonstationary VAR Models
Abstract:
Assuming stationarity is unrealistic in many time series applications. A more realistic alternative is to assume piecewise stationarity, where the model can change at potentially many change points. We propose a three-stage procedure for simultaneous estimation of change points and parameters of high-dimensional piecewise vector autoregressive (VAR) models. In the first step, we reformulate the change point detection problem as a high-dimensional variable selection one, and solve it using a penalized least square estimator with a total variation penalty. We show that the penalized estimation method over-estimates the number of change points, and propose a selection criterion to identify the change points. In the last step of our procedure, we estimate the VAR parameters in each of the segments. We prove that the proposed procedure consistently detects the number and location of change points, and provides consistent estimates of VAR parameters. The performance of the method is illustrated through several simulated and real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 251-264
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1770097
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1770097
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:251-264
Template-Type: ReDIF-Article 1.0
Author-Name: Yaping Wang
Author-X-Name-First: Yaping
Author-X-Name-Last: Wang
Author-Name: Fasheng Sun
Author-X-Name-First: Fasheng
Author-X-Name-Last: Sun
Author-Name: Hongquan Xu
Author-X-Name-First: Hongquan
Author-X-Name-Last: Xu
Title: On Design Orthogonality, Maximin Distance, and Projection Uniformity for Computer Experiments
Abstract:
Space-filling designs are widely used in both computer and physical experiments. Column-orthogonality, maximin distance, and projection uniformity are three basic and popular space-filling criteria proposed from different perspectives, but their relationships have been rarely investigated. We show that the average squared correlation metric is a function of the pairwise L2-distances between the rows only. We further explore the connection between uniform projection designs and maximin L1-distance designs. Based on these connections, we develop new lower and upper bounds for column-orthogonality and projection uniformity from the perspective of distance between design points. These results not only provide new theoretical justifications for each criterion but also help in finding better space-filling designs under multiple criteria. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 375-385
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1782221
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1782221
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:375-385
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction to: Semiparametric Inference for Non-monotone Missing-Not-at-Random Data: the No Self-Censoring Model
Journal: Journal of the American Statistical Association
Pages: 530-530
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.2016421
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016421
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:530-530
Template-Type: ReDIF-Article 1.0
Author-Name: Ray Bai
Author-X-Name-First: Ray
Author-X-Name-Last: Bai
Author-Name: Gemma E. Moran
Author-X-Name-First: Gemma E.
Author-X-Name-Last: Moran
Author-Name: Joseph L. Antonelli
Author-X-Name-First: Joseph L.
Author-X-Name-Last: Antonelli
Author-Name: Yong Chen
Author-X-Name-First: Yong
Author-X-Name-Last: Chen
Author-Name: Mary R. Boland
Author-X-Name-First: Mary R.
Author-X-Name-Last: Boland
Title: Spike-and-Slab Group Lassos for Grouped Regression and Sparse Generalized Additive Models
Abstract:
Abstract–We introduce the spike-and-slab group lasso (SSGL) for Bayesian estimation and variable selection in linear regression with grouped variables. We further extend the SSGL to sparse generalized additive models (GAMs), thereby introducing the first nonparametric variant of the spike-and-slab lasso methodology. Our model simultaneously performs group selection and estimation, while our fully Bayes treatment of the mixture proportion allows for model complexity control and automatic self-adaptivity to different levels of sparsity. We develop theory to uniquely characterize the global posterior mode under the SSGL and introduce a highly efficient block coordinate ascent algorithm for maximum a posteriori estimation. We further employ de-biasing methods to provide uncertainty quantification of our estimates. Thus, implementation of our model avoids the computational intensiveness of Markov chain Monte Carlo in high dimensions. We derive posterior concentration rates for both grouped linear regression and sparse GAMs when the number of covariates grows at nearly exponential rate with sample size. Finally, we illustrate our methodology through extensive simulations and data analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 184-197
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1765784
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1765784
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:184-197
Template-Type: ReDIF-Article 1.0
Author-Name: Derek Feng
Author-X-Name-First: Derek
Author-X-Name-Last: Feng
Author-Name: Randolf Altmeyer
Author-X-Name-First: Randolf
Author-X-Name-Last: Altmeyer
Author-Name: Derek Stafford
Author-X-Name-First: Derek
Author-X-Name-Last: Stafford
Author-Name: Nicholas A. Christakis
Author-X-Name-First: Nicholas A.
Author-X-Name-Last: Christakis
Author-Name: Harrison H. Zhou
Author-X-Name-First: Harrison H.
Author-X-Name-Last: Zhou
Title: Testing for Balance in Social Networks
Abstract:
Friendship and antipathy exist in concert with one another in real social networks. Despite the role they play in social interactions, antagonistic ties are poorly understood and infrequently measured. One important theory of negative ties that has received relatively little empirical evaluation is balance theory, the codification of the adage “the enemy of my enemy is my friend” and similar sayings. Unbalanced triangles are those with an odd number of negative ties, and the theory posits that such triangles are rare. To test for balance, previous works have used a permutation test on the edge signs. The flaw in this method, however, is that it assumes that negative and positive edges are interchangeable. In reality, they could not be more different. Here, we propose a novel test of balance that accounts for this discrepancy and show that our test is more accurate at detecting balance. Along the way, we prove asymptotic normality of the test statistic under our null model, which is of independent interest. Our case study is a novel dataset of signed networks we collected from 32 isolated, rural villages in Honduras. Contrary to previous results, we find that there is only marginal evidence for balance in social tie formation in this setting.
Journal: Journal of the American Statistical Association
Pages: 156-174
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1764850
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1764850
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:156-174
Template-Type: ReDIF-Article 1.0
Author-Name: Jialin Ouyang
Author-X-Name-First: Jialin
Author-X-Name-Last: Ouyang
Author-Name: Ming Yuan
Author-X-Name-First: Ming
Author-X-Name-Last: Yuan
Title: Comments on “Factor Models for High-Dimensional Tensor Time Series”
Journal: Journal of the American Statistical Association
Pages: 124-127
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2022.2028630
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2028630
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:124-127
Template-Type: ReDIF-Article 1.0
Author-Name: Valérie Garès
Author-X-Name-First: Valérie
Author-X-Name-Last: Garès
Author-Name: Jérémy Omer
Author-X-Name-First: Jérémy
Author-X-Name-Last: Omer
Title: Regularized Optimal Transport of Covariates and Outcomes in Data Recoding
Abstract:
When databases are constructed from heterogeneous sources, it is not unusual that different encodings are used for the same outcome. In such case, it is necessary to recode the outcome variable before merging two databases. The method proposed for the recoding is an application of optimal transportation where we search for a bijective mapping between the distributions of such variable in two databases. In this article, we build upon the work by Garés et al., where they transport the distributions of categorical outcomes assuming that they are distributed equally in the two databases. Here, we extend the scope of the model to treat all the situations where the covariates explain the outcomes similarly in the two databases. In particular, we do not require that the outcomes be distributed equally. For this, we propose a model where joint distributions of outcomes and covariates are transported. We also propose to enrich the model by relaxing the constraints on marginal distributions and adding an L1 regularization term. The performances of the models are evaluated in a simulation study, and they are applied to a real dataset. The code used in the computational assessment and in the simulation of test cases is publicly available on Github repository: https://github.com/otrecoding/OTRecod.jl.
Journal: Journal of the American Statistical Association
Pages: 320-333
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1775615
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775615
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:320-333
Template-Type: ReDIF-Article 1.0
Author-Name: Zhenhua Lin
Author-X-Name-First: Zhenhua
Author-X-Name-Last: Lin
Author-Name: Jane-Ling Wang
Author-X-Name-First: Jane-Ling
Author-X-Name-Last: Wang
Title: Mean and Covariance Estimation for Functional Snippets
Abstract:
We consider estimation of mean and covariance functions of functional snippets, which are short segments of functions possibly observed irregularly on an individual specific subinterval that is much shorter than the entire study interval. Estimation of the covariance function for functional snippets is challenging since information for the far off-diagonal regions of the covariance structure is completely missing. We address this difficulty by decomposing the covariance function into a variance function component and a correlation function component. The variance function can be effectively estimated nonparametrically, while the correlation part is modeled parametrically, possibly with an increasing number of parameters, to handle the missing information in the far off-diagonal regions. Both theoretical analysis and numerical simulations suggest that this hybrid strategy is effective. In addition, we propose a new estimator for the variance of measurement errors and analyze its asymptotic properties. This estimator is required for the estimation of the variance function from noisy measurements. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 348-360
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1777138
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1777138
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:348-360
Template-Type: ReDIF-Article 1.0
Author-Name: Jinyuan Chang
Author-X-Name-First: Jinyuan
Author-X-Name-Last: Chang
Author-Name: Eric D. Kolaczyk
Author-X-Name-First: Eric D.
Author-X-Name-Last: Kolaczyk
Author-Name: Qiwei Yao
Author-X-Name-First: Qiwei
Author-X-Name-Last: Yao
Title: Estimation of Subgraph Densities in Noisy Networks
Abstract:
While it is common practice in applied network analysis to report various standard network summary statistics, these numbers are rarely accompanied by uncertainty quantification. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Here we study the problem of estimating the density of an arbitrary subgraph, given a noisy version of some underlying network as data. Under a simple model of network error, we show that consistent estimation of such densities is impossible when the rates of error are unknown and only a single network is observed. Accordingly, we develop method-of-moment estimators of network subgraph densities and error rates for the case where a minimal number of network replicates are available. These estimators are shown to be asymptotically normal as the number of vertices increases to infinity. We also provide confidence intervals for quantifying the uncertainty in these estimates based on the asymptotic normality. To construct the confidence intervals, a new and nonstandard bootstrap method is proposed to compute asymptotic variances, which is infeasible otherwise. We illustrate the proposed methods in the context of gene coexpression networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 361-374
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1778482
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1778482
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:361-374
Template-Type: ReDIF-Article 1.0
Author-Name: Chris McKennan
Author-X-Name-First: Chris
Author-X-Name-Last: McKennan
Author-Name: Dan Nicolae
Author-X-Name-First: Dan
Author-X-Name-Last: Nicolae
Title: Estimating and Accounting for Unobserved Covariates in High-Dimensional Correlated Data
Abstract:
Many high-dimensional and high-throughput biological datasets have complex sample correlation structures, which include longitudinal and multiple tissue data, as well as data with multiple treatment conditions or related individuals. These data, as well as nearly all high-throughput “omic” data, are influenced by technical and biological factors unknown to the researcher, which, if unaccounted for, can severely obfuscate estimation of and inference on the effects of interest. We therefore developed CBCV and CorrConf: provably accurate and computationally efficient methods to choose the number of and estimate latent confounding factors present in high-dimensional data with correlated or nonexchangeable residuals. We demonstrate each method’s superior performance compared to other state of the art methods by analyzing simulated multi-tissue gene expression data and identifying sex-associated DNA methylation sites in a real, longitudinal twin study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 225-236
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1769635
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1769635
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:225-236
Template-Type: ReDIF-Article 1.0
Author-Name: Mats J. Stensrud
Author-X-Name-First: Mats J.
Author-X-Name-Last: Stensrud
Author-Name: Jessica G. Young
Author-X-Name-First: Jessica G.
Author-X-Name-Last: Young
Author-Name: Vanessa Didelez
Author-X-Name-First: Vanessa
Author-X-Name-Last: Didelez
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Author-Name: Miguel A. Hernán
Author-X-Name-First: Miguel A.
Author-X-Name-Last: Hernán
Title: Separable Effects for Causal Inference in the Presence of Competing Events
Abstract:
In time-to-event settings, the presence of competing events complicates the definition of causal effects. Here we propose the new separable effects to study the causal effect of a treatment on an event of interest. The separable direct effect is the treatment effect on the event of interest not mediated by its effect on the competing event. The separable indirect effect is the treatment effect on the event of interest only through its effect on the competing event. Similar to Robins and Richardson’s extended graphical approach for mediation analysis, the separable effects can only be identified under the assumption that the treatment can be decomposed into two distinct components that exert their effects through distinct causal pathways. Unlike existing definitions of causal effects in the presence of competing events, our estimands do not require cross-world contrasts or hypothetical interventions to prevent death. As an illustration, we apply our approach to a randomized clinical trial on estrogen therapy in individuals with prostate cancer. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 175-183
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1765783
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1765783
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:175-183
Template-Type: ReDIF-Article 1.0
Author-Name: Bingxin Zhao
Author-X-Name-First: Bingxin
Author-X-Name-Last: Zhao
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: On Genetic Correlation Estimation With Summary Statistics From Genome-Wide Association Studies
Abstract:
Cross-trait polygenic risk score (PRS) method has gained popularity for assessing genetic correlation of complex traits using summary statistics from biobank-scale genome-wide association studies (GWAS). However, empirical evidence has shown a common bias phenomenon that highly significant cross-trait PRS can only account for a very small amount of genetic variance (R2 can be <1%
) in independent testing GWAS. The aim of this paper is to investigate and address the bias phenomenon of cross-trait PRS in numerous GWAS applications. We show that the estimated genetic correlation can be asymptotically biased toward zero. A consistent cross-trait PRS estimator is then proposed to correct such asymptotic bias. In addition, we investigate whether or not SNP screening by GWAS p-values can lead to improved estimation and show the effect of overlapping samples among GWAS. We analyze GWAS summary statistics of reaction time and brain structural magnetic resonance imaging-based features measured in the Pediatric Imaging, Neurocognition, and Genetics study. We find that the raw cross-trait PRS estimators heavily underestimate the genetic similarity between cognitive function and human brain structures (mean R2=1.32%
), whereas the bias-corrected estimators uncover the moderate degree of genetic overlap between these closely related heritable traits (mean R2=22.42%
). Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1-11
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1906684
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1906684
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:1-11
Template-Type: ReDIF-Article 1.0
Author-Name: Albert Xingyi Man
Author-X-Name-First: Albert Xingyi
Author-X-Name-Last: Man
Author-Name: Steven Andrew Culpepper
Author-X-Name-First: Steven Andrew
Author-X-Name-Last: Culpepper
Title: A Mode-Jumping Algorithm for Bayesian Factor Analysis
Abstract:
Exploratory factor analysis is a dimension-reduction technique commonly used in psychology, finance, genomics, neuroscience, and economics. Advances in computational power have opened the door for fully Bayesian treatments of factor analysis. One open problem is enforcing rotational identifability of the latent factor loadings, as the loadings are not identified from the likelihood without further restrictions. Nonidentifability of the loadings can cause posterior multimodality, which can produce misleading posterior summaries. The positive-diagonal, lower-triangular (PLT) constraint is the most commonly used restriction to guarantee identifiability, in which the upper m × m submatrix of the loadings is constrained to be a lower-triangular matrix with positive-diagonal elements. The PLT constraint can fail to guarantee identifiability if the constrained submatrix is singular. Furthermore, though the PLT constraint addresses identifiability-related multimodality, it introduces additional mixing issues. We introduce a new Bayesian sampling algorithm that efficiently explores the multimodal posterior surface and addresses issues with PLT-constrained approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 277-290
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1773833
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1773833
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:277-290
Template-Type: ReDIF-Article 1.0
Author-Name: Ying-Qi Zhao
Author-X-Name-First: Ying-Qi
Author-X-Name-Last: Zhao
Title: Dynamic Treatment Regimes: Statistical Methods for Precision Medicine
Journal: Journal of the American Statistical Association
Pages: 527-527
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2022.2035159
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035159
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:527-527
Template-Type: ReDIF-Article 1.0
Author-Name: Zhonghua Liu
Author-X-Name-First: Zhonghua
Author-X-Name-Last: Liu
Author-Name: Jincheng Shen
Author-X-Name-First: Jincheng
Author-X-Name-Last: Shen
Author-Name: Richard Barfield
Author-X-Name-First: Richard
Author-X-Name-Last: Barfield
Author-Name: Joel Schwartz
Author-X-Name-First: Joel
Author-X-Name-Last: Schwartz
Author-Name: Andrea A. Baccarelli
Author-X-Name-First: Andrea A.
Author-X-Name-Last: Baccarelli
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: Large-Scale Hypothesis Testing for Causal Mediation Effects with Applications in Genome-wide Epigenetic Studies
Abstract:
In genome-wide epigenetic studies, it is of great scientific interest to assess whether the effect of an exposure on a clinical outcome is mediated through DNA methylations. However, statistical inference for causal mediation effects is challenged by the fact that one needs to test a large number of composite null hypotheses across the whole epigenome. Two popular tests, the Wald-type Sobel’s test and the joint significant test using the traditional null distribution are underpowered and thus can miss important scientific discoveries. In this article, we show that the null distribution of Sobel’s test is not the standard normal distribution and the null distribution of the joint significant test is not uniform under the composite null of no mediation effect, especially in finite samples and under the singular point null case that the exposure has no effect on the mediator and the mediator has no effect on the outcome. Our results explain why these two tests are underpowered, and more importantly motivate us to develop a more powerful divide-aggregate composite-null test (DACT) for the composite null hypothesis of no mediation effect by leveraging epigenome-wide data. We adopted Efron’s empirical null framework for assessing statistical significance of the DACT test. We showed analytically that the proposed DACT method had improved power, and could well control Type I error rate. Our extensive simulation studies showed that, in finite samples, the DACT method properly controlled the Type I error rate and outperformed Sobel’s test and the joint significance test for detecting mediation effects. We applied the DACT method to the U.S. Department of Veterans Affairs Normative Aging Study, an ongoing prospective cohort study which included men who were aged 21 to 80 years at entry. We identified multiple DNA methylation CpG sites that might mediate the effect of smoking on lung function with effect sizes ranging from –0.18 to –0.79 and false discovery rate controlled at the level 0.05, including the CpG sites in the genes AHRR and F2RL3. Our sensitivity analysis found small residual correlations (less than 0.01) of the error terms between the outcome and mediator regressions, suggesting that our results are robust to unmeasured confounding factors. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 67-81
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1914634
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1914634
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:67-81
Template-Type: ReDIF-Article 1.0
Author-Name: Félix Camirand Lemyre
Author-X-Name-First: Félix
Author-X-Name-Last: Camirand Lemyre
Author-Name: Raymond J. Carroll
Author-X-Name-First: Raymond J.
Author-X-Name-Last: Carroll
Author-Name: Aurore Delaigle
Author-X-Name-First: Aurore
Author-X-Name-Last: Delaigle
Title: Semiparametric Estimation of the Distribution of Episodically Consumed Foods Measured With Error
Abstract:
Dietary data collected from 24-hour dietary recalls are observed with significant measurement errors. In the nonparametric curve estimation literature, much of the effort has been devoted to designing methods that are consistent under contamination by noise, and which have been traditionally applied for analyzing those data. However, some foods such as alcohol or fruits are consumed only episodically, and may not be consumed during the day when the 24-hour recall is administered. These so-called excess zeros make existing nonparametric estimators break down, and new techniques need to be developed for such data. We develop two new consistent semiparametric estimators of the distribution of such episodically consumed food data, making parametric assumptions only on some less important parts of the model. We establish its theoretical properties and illustrate the good performance of our fully data-driven method in simulated and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 469-481
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1787840
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1787840
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:469-481
Template-Type: ReDIF-Article 1.0
Author-Name: Masako Ikefuji
Author-X-Name-First: Masako
Author-X-Name-Last: Ikefuji
Author-Name: Roger J. A. Laeven
Author-X-Name-First: Roger J. A.
Author-X-Name-Last: Laeven
Author-Name: Jan R. Magnus
Author-X-Name-First: Jan R.
Author-X-Name-Last: Magnus
Author-Name: Yuan Yue
Author-X-Name-First: Yuan
Author-X-Name-Last: Yue
Title: Earthquake Risk Embedded in Property Prices: Evidence From Five Japanese Cities
Abstract:
We analyze the impact of short-run (90 days) and long-run (30 years) earthquake risk on real estate transaction prices in five Japanese cities (Tokyo, Osaka, Nagoya, Fukuoka, and Sapporo), using quarterly data over the period 2006–2015. We exploit a rich panel dataset (331,343 observations) with property characteristics, ward attractiveness information, macroeconomic variables, and long-run seismic hazard data, supplemented with short-run earthquake probabilities generated from a seismic excitation model using historical earthquake occurrences. We design a hedonic property price model that allows for subjective probability weighting, employ a multivariate error components structure, and develop associated maximum likelihood estimation and variance computation procedures. Our approach enables us to identify the total compensation for earthquake risk embedded in property prices, to decompose this into pieces stemming from short-run and long-run risk, and to distinguish between objective and subjectively weighted (“distorted”) earthquake probabilities. We find that objective long-run earthquake probabilities have a statistically significant negative impact on property prices, whereas short-run earthquake probabilities become statistically significant only when we allow them to be distorted. The total compensation for earthquake risk amounts to an average –2.0% of log property prices, slightly more than the annual income of a middle-income Japanese household. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 82-93
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2021.1928512
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1928512
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:82-93
Template-Type: ReDIF-Article 1.0
Author-Name: Lilun Du
Author-X-Name-First: Lilun
Author-X-Name-Last: Du
Author-Name: Inchi Hu
Author-X-Name-First: Inchi
Author-X-Name-Last: Hu
Title: An Empirical Bayes Method for Chi-Squared Data
Abstract:
In a thought-provoking paper, Efron investigated the merit and limitation of an empirical Bayes method to correct selection bias based on Tweedie’s formula first reported in the study by Robbins. The exceptional virtue of Tweedie’s formula for the normal distribution lies in its representation of selection bias as a simple function of the derivative of log marginal likelihood. Since the marginal likelihood and its derivative can be estimated from the data directly without invoking prior information, bias correction can be carried out conveniently. We propose a Bayesian hierarchical model for chi-squared data such that the resulting Tweedie’s formula has the same virtue as that of the normal distribution. Because the family of noncentral chi-squared distributions, the common alternative distributions for chi-squared tests, does not constitute an exponential family, our results cannot be obtained by extending existing results. Furthermore, the corresponding Tweedie’s formula manifests new phenomena quite different from those of the normal distribution and suggests new ways of analyzing chi-squared data.
Journal: Journal of the American Statistical Association
Pages: 334-347
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1777137
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1777137
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:334-347
Template-Type: ReDIF-Article 1.0
Author-Name: Wanjun Liu
Author-X-Name-First: Wanjun
Author-X-Name-Last: Liu
Author-Name: Yuan Ke
Author-X-Name-First: Yuan
Author-X-Name-Last: Ke
Author-Name: Jingyuan Liu
Author-X-Name-First: Jingyuan
Author-X-Name-Last: Liu
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Model-Free Feature Screening and FDR Control With Knockoff Features
Abstract:
This article proposes a model-free and data-adaptive feature screening method for ultrahigh-dimensional data. The proposed method is based on the projection correlation which measures the dependence between two random vectors. This projection correlation based method does not require specifying a regression model, and applies to data in the presence of heavy tails and multivariate responses. It enjoys both sure screening and rank consistency properties under weak assumptions. A two-step approach, with the help of knockoff features, is advocated to specify the threshold for feature screening such that the false discovery rate (FDR) is controlled under a prespecified level. The proposed two-step approach enjoys both sure screening and FDR control simultaneously if the prespecified FDR level is greater or equal to 1/s, where s is the number of active features. The superior empirical performance of the proposed method is illustrated by simulation examples and real data applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 428-443
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1783274
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1783274
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:428-443
Template-Type: ReDIF-Article 1.0
Author-Name: Sooin Yun
Author-X-Name-First: Sooin
Author-X-Name-Last: Yun
Author-Name: Xianyang Zhang
Author-X-Name-First: Xianyang
Author-X-Name-Last: Zhang
Author-Name: Bo Li
Author-X-Name-First: Bo
Author-X-Name-Last: Li
Title: Detection of Local Differences in Spatial Characteristics Between Two Spatiotemporal Random Fields
Abstract:
Comparing the spatial characteristics of spatiotemporal random fields is often at demand. However, the comparison can be challenging due to the high-dimensional feature and dependency in the data. We develop a new multiple testing approach to detect local differences in the spatial characteristics of two spatiotemporal random fields by taking the spatial information into account. Our method adopts a two-component mixture model for location wise p-values and then derives a new false discovery rate (FDR) control, called mirror procedure, to determine the optimal rejection region. This procedure is robust to model misspecification and allows for weak dependency among hypotheses. To integrate the spatial heterogeneity, we model the mixture probability as well as study the benefit if any of allowing the alternative distribution to be spatially varying. An EM-algorithm is developed to estimate the mixture model and implement the FDR procedure. We study the FDR control and the power of our new approach both theoretically and numerically, and apply the approach to compare the mean and teleconnection pattern between two synthetic climate fields. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 291-306
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1775613
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1775613
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:291-306
Template-Type: ReDIF-Article 1.0
Author-Name: Jiahui Yu
Author-X-Name-First: Jiahui
Author-X-Name-Last: Yu
Author-Name: Jian Shi
Author-X-Name-First: Jian
Author-X-Name-Last: Shi
Author-Name: Anna Liu
Author-X-Name-First: Anna
Author-X-Name-Last: Liu
Author-Name: Yuedong Wang
Author-X-Name-First: Yuedong
Author-X-Name-Last: Wang
Title: Smoothing Spline Semiparametric Density Models
Abstract:
Density estimation plays a fundamental role in many areas of statistics and machine learning. Parametric, nonparametric, and semiparametric density estimation methods have been proposed in the literature. Semiparametric density models are flexible in incorporating domain knowledge and uncertainty regarding the shape of the density function. Existing literature on semiparametric density models is scattered and lacks a systematic framework. In this article, we consider a unified framework based on reproducing kernel Hilbert space for modeling, estimation, computation, and theory. We propose general semiparametric density models for both a single sample and multiple samples which include many existing semiparametric density models as special cases. We develop penalized likelihood based estimation methods and computational methods under different situations. We establish joint consistency and derive convergence rates of the proposed estimators for both finite dimensional Euclidean parameters and an infinite-dimensional functional parameter. We validate our estimation methods empirically through simulations and an application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 237-250
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1769636
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1769636
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:237-250
Template-Type: ReDIF-Article 1.0
Author-Name: David E. Allen
Author-X-Name-First: David E.
Author-X-Name-Last: Allen
Author-Name: Michael McAleer
Author-X-Name-First: Michael
Author-X-Name-Last: McAleer
Title: “Generalized Measures of Correlation for Asymmetry, Nonlinearity, and Beyond”: Some Antecedents on Causality
Abstract:
This note comments on the generalized measure of correlation (GMC) that was suggested by Zheng, Shi, and Zhang. The GMC concept was partly anticipated in some publications over 100 years earlier by Yule in the Proceedings of the Royal Society, and by Kendall. Other antecedents discussed include work on dependency by Renyi and Doksum and Samarov, together with the Yule–Simpson paradox. The GMC metric partly extends the concept of Granger causality, so that we consider causality, graphical analysis and alternative measures of dependency provided by copulas.
Journal: Journal of the American Statistical Association
Pages: 214-224
Issue: 537
Volume: 117
Year: 2022
Month: 1
X-DOI: 10.1080/01621459.2020.1768101
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1768101
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:537:p:214-224
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Ricardo Masini
Author-X-Name-First: Ricardo
Author-X-Name-Last: Masini
Author-Name: Marcelo C. Medeiros
Author-X-Name-First: Marcelo C.
Author-X-Name-Last: Medeiros
Title: Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction
Abstract:
Optimal pricing, that is determining the price level that maximizes profit or revenue of a given product, is a vital task for the retail industry. To select such a quantity, one needs first to estimate the price elasticity from the product demand. Regression methods usually fail to recover such elasticities due to confounding effects and price endogeneity. Therefore, randomized experiments are typically required. However, elasticities can be highly heterogeneous depending on the location of stores, for example. As the randomization frequently occurs at the municipal level, standard difference-in-differences methods may also fail. Possible solutions are based on methodologies to measure the effects of treatments on a single (or just a few) treated unit(s) based on counterfactuals constructed from artificial controls. For example, for each city in the treatment group, a counterfactual may be constructed from the untreated locations. In this article, we apply a novel high-dimensional statistical method to measure the effects of price changes on daily sales from a major retailer in Brazil. The proposed methodology combines principal components (factors) and sparse regressions, resulting in a method called Factor-Adjusted Regularized Method for Treatment evaluation (FarmTreat). The data consist of daily sales and prices of five different products over more than 400 municipalities. The products considered belong to the sweet and candies category and experiments have been conducted over the years of 2016 and 2017. Our results confirm the hypothesis of a high degree of heterogeneity yielding very different pricing strategies over distinct municipalities. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 574-590
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.2004895
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2004895
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:574-590
Template-Type: ReDIF-Article 1.0
Author-Name: Haojie Ren
Author-X-Name-First: Haojie
Author-X-Name-Last: Ren
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Author-Name: Nan Chen
Author-X-Name-First: Nan
Author-X-Name-Last: Chen
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Large-Scale Datastreams Surveillance via Pattern-Oriented-Sampling
Abstract:
Monitoring large-scale datastreams with limited resources has become increasingly important for real-time detection of abnormal activities in many applications. Despite the availability of large datasets, the challenges associated with designing an efficient change-detection when clustering or spatial pattern exists are not yet well addressed. In this article, a design-adaptive testing procedure is developed when only a limited number of streaming observations can be accessed at each time. We derive an optimal sampling strategy, the pattern-oriented-sampling, with which the proposed test possesses asymptotically and locally best power under alternatives. Then, a sequential change-detection procedure is proposed by integrating this test with generalized likelihood ratio approach. Benefiting from dynamically estimating the optimal sampling design, the proposed procedure is able to improve the sensitivity in detecting clustered changes compared with existing procedures. Its advantages are demonstrated in numerical simulations and a real data example. Ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of traditional detection procedures. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 794-808
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1819295
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1819295
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:794-808
Template-Type: ReDIF-Article 1.0
Author-Name: Jiaying Gu
Author-X-Name-First: Jiaying
Author-X-Name-Last: Gu
Author-Name: Roger Koenker
Author-X-Name-First: Roger
Author-X-Name-Last: Koenker
Title: Nonparametric Maximum Likelihood Methods for Binary Response Models With Random Coefficients
Abstract:
The venerable method of maximum likelihood has found numerous recent applications in nonparametric estimation of regression and shape constrained densities. For mixture models the nonparametric maximum likelihood estimator (NPMLE) of Kiefer and Wolfowitz plays a central role in recent developments of empirical Bayes methods. The NPMLE has also been proposed by Cosslett as an estimation method for single index linear models for binary response with random coefficients. However, computational difficulties have hindered its application. Combining recent developments in computational geometry and convex optimization, we develop a new approach to computation for such models that dramatically increases their computational tractability. Consistency of the method is established for an expanded profile likelihood formulation. The methods are evaluated in simulation experiments, compared to the deconvolution methods of Gautier and Kitamura and illustrated in an application to modal choice for journey-to-work data in the Washington DC area. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 732-751
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1802284
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1802284
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:732-751
Template-Type: ReDIF-Article 1.0
Author-Name: Yiwei Fan
Author-X-Name-First: Yiwei
Author-X-Name-Last: Fan
Author-Name: Xiaoling Lu
Author-X-Name-First: Xiaoling
Author-X-Name-Last: Lu
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Author-Name: Junlong Zhao
Author-X-Name-First: Junlong
Author-X-Name-Last: Zhao
Title: Angle-Based Hierarchical Classification Using Exact Label Embedding
Abstract:
Hierarchical classification problems are commonly seen in practice. However, most existing methods do not fully use the hierarchical information among class labels. In this article, a novel label embedding approach is proposed, which keeps the hierarchy of labels exactly, and reduces the complexity of the hypothesis space significantly. Based on the newly proposed label embedding approach, a new angle-based classifier is developed for hierarchical classification. Moreover, to handle massive data, a new (weighted) linear loss is designed, which has a closed form solution and is computationally efficient. Theoretical properties of the new method are established and intensive numerical comparisons with other methods are conducted. Both simulations and applications in document categorization demonstrate the advantages of the proposed method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 704-717
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1801450
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801450
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:704-717
Template-Type: ReDIF-Article 1.0
Author-Name: Rune Christiansen
Author-X-Name-First: Rune
Author-X-Name-Last: Christiansen
Author-Name: Matthias Baumann
Author-X-Name-First: Matthias
Author-X-Name-Last: Baumann
Author-Name: Tobias Kuemmerle
Author-X-Name-First: Tobias
Author-X-Name-Last: Kuemmerle
Author-Name: Miguel D. Mahecha
Author-X-Name-First: Miguel D.
Author-X-Name-Last: Mahecha
Author-Name: Jonas Peters
Author-X-Name-First: Jonas
Author-X-Name-Last: Peters
Title: Toward Causal Inference for Spatio-Temporal Data: Conflict and Forest Loss in Colombia
Abstract:
How does armed conflict influence tropical forest loss? For Colombia, both enhancing and reducing effect estimates have been reported. However, a lack of causal methodology has prevented establishing clear causal links between these two variables. In this work, we propose a class of causal models for spatio-temporal stochastic processes which allows us to formally define and quantify the causal effect of a vector of covariates X on a real-valued response Y. We introduce a procedure for estimating causal effects and a nonparametric hypothesis test for these effects being zero. Our application is based on geospatial information on conflict events and remote-sensing-based data on forest loss between 2000 and 2018 in Colombia. Across the entire country, we estimate the effect to be slightly negative (conflict reduces forest loss) but insignificant (P = 0.578), while at the provincial level, we find both positive effects (e.g., La Guajira, P = 0.047) and negative effects (e.g., Magdalena, P = 0.004). The proposed methods do not make strong distributional assumptions, and allow for arbitrarily many latent confounders, given that these confounders do not vary across time. Our theoretical findings are supported by simulations, and code is available online.
Journal: Journal of the American Statistical Association
Pages: 591-601
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.2013241
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2013241
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:591-601
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yuan Liao
Author-X-Name-First: Yuan
Author-X-Name-Last: Liao
Title: Learning Latent Factors From Diversified Projections and Its Applications to Over-Estimated and Weak Factors
Abstract:
Estimations and applications of factor models often rely on the crucial condition that the number of latent factors is consistently estimated, which in turn also requires that factors be relatively strong, data are stationary and weakly serially dependent, and the sample size be fairly large, although in practical applications, one or several of these conditions may fail. In these cases, it is difficult to analyze the eigenvectors of the data matrix. To address this issue, we propose simple estimators of the latent factors using cross-sectional projections of the panel data, by weighted averages with predetermined weights. These weights are chosen to diversify away the idiosyncratic components, resulting in “diversified factors.” Because the projections are conducted cross-sectionally, they are robust to serial conditions, easy to analyze and work even for finite length of time series. We formally prove that this procedure is robust to over-estimating the number of factors, and illustrate it in several applications, including post-selection inference, big data forecasts, large covariance estimation, and factor specification tests. We also recommend several choices for the diversified weights. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 909-924
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1831927
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831927
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:909-924
Template-Type: ReDIF-Article 1.0
Author-Name: Yan Dora Zhang
Author-X-Name-First: Yan Dora
Author-X-Name-Last: Zhang
Author-Name: Brian P. Naughton
Author-X-Name-First: Brian P.
Author-X-Name-Last: Naughton
Author-Name: Howard D. Bondell
Author-X-Name-First: Howard D.
Author-X-Name-Last: Bondell
Author-Name: Brian J. Reich
Author-X-Name-First: Brian J.
Author-X-Name-Last: Reich
Title: Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior
Abstract:
Prior distributions for high-dimensional linear regression require specifying a joint distribution for the unobserved regression coefficients, which is inherently difficult. We instead propose a new class of shrinkage priors for linear regression via specifying a prior first on the model fit, in particular, the coefficient of determination, and then distributing through to the coefficients in a novel way. The proposed method compares favorably to previous approaches in terms of both concentration around the origin and tail behavior, which leads to improved performance both in posterior contraction and in empirical performance. The limiting behavior of the proposed prior is
1/x
, both around the origin and in the tails. This behavior is optimal in the sense that it simultaneously lies on the boundary of being an improper prior both in the tails and around the origin. None of the existing shrinkage priors obtain this behavior in both regions simultaneously. We also demonstrate that our proposed prior leads to the same near-minimax posterior contraction rate as the spike-and-slab prior. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 862-874
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1825449
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825449
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:862-874
Template-Type: ReDIF-Article 1.0
Author-Name: Gabriel Hassler
Author-X-Name-First: Gabriel
Author-X-Name-Last: Hassler
Author-Name: Max R. Tolkoff
Author-X-Name-First: Max R.
Author-X-Name-Last: Tolkoff
Author-Name: William L. Allen
Author-X-Name-First: William L.
Author-X-Name-Last: Allen
Author-Name: Lam Si Tung Ho
Author-X-Name-First: Lam Si Tung
Author-X-Name-Last: Ho
Author-Name: Philippe Lemey
Author-X-Name-First: Philippe
Author-X-Name-Last: Lemey
Author-Name: Marc A. Suchard
Author-X-Name-First: Marc A.
Author-X-Name-Last: Suchard
Title: Inferring Phenotypic Trait Evolution on Large Trees With Many Incomplete Measurements
Abstract:
Comparative biologists are often interested in inferring covariation between multiple biological traits sampled across numerous related taxa. To properly study these relationships, we must control for the shared evolutionary history of the taxa to avoid spurious inference. An additional challenge arises as obtaining a full suite of measurements becomes increasingly difficult with increasing taxa. This generally necessitates data imputation or integration, and existing control techniques typically scale poorly as the number of taxa increases. We propose an inference technique that integrates out missing measurements analytically and scales linearly with the number of taxa by using a post-order traversal algorithm under a multivariate Brownian diffusion (MBD) model to characterize trait evolution. We further exploit this technique to extend the MBD model to account for sampling error or nonheritable residual variance. We test these methods to examine mammalian life history traits, prokaryotic genomic and phenotypic traits, and HIV infection traits. We find computational efficiency increases that top two orders-of-magnitude over current best practices. While we focus on the utility of this algorithm in phylogenetic comparative methods, our approach generalizes to solve long-standing challenges in computing the likelihood for matrix-normal and multivariate normal distributions with missing data at scale. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 678-692
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1799812
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799812
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:678-692
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Author-Name: Xiao Han
Author-X-Name-First: Xiao
Author-X-Name-Last: Han
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Title: Asymptotic Theory of Eigenvectors for Random Matrices With Diverging Spikes
Abstract:
Characterizing the asymptotic distributions of eigenvectors for large random matrices poses important challenges yet can provide useful insights into a range of statistical applications. To this end, in this article we introduce a general framework of asymptotic theory of eigenvectors for large spiked random matrices with diverging spikes and heterogeneous variances, and establish the asymptotic properties of the spiked eigenvectors and eigenvalues for the scenario of the generalized Wigner matrix noise. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we establish asymptotic expansions for the general linear combination and further show that it is asymptotically normal after some normalization, where the weight vector can be arbitrary. We also provide a more general asymptotic theory for the spiked eigenvectors using the bilinear form. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 996-1009
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1840990
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840990
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:996-1009
Template-Type: ReDIF-Article 1.0
Author-Name: Assaf Rabinowicz
Author-X-Name-First: Assaf
Author-X-Name-Last: Rabinowicz
Author-Name: Saharon Rosset
Author-X-Name-First: Saharon
Author-X-Name-Last: Rosset
Title: Cross-Validation for Correlated Data
Abstract:
Abstract–K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional assumptions cannot be taken. However, CV with squared error loss is not free from distributional assumptions, in particular in cases involving non-iid data. This article analyzes CV for correlated data. We present a criterion for suitability of standard CV in presence of correlations. When this criterion does not hold, we introduce a bias corrected CV estimator which we term
CVc,
that yields an unbiased estimate of prediction error in many settings where standard CV is invalid. We also demonstrate our results numerically, and find that introducing our correction substantially improves both, model evaluation and model selection in simulations and real data studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 718-731
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1801451
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1801451
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:718-731
Template-Type: ReDIF-Article 1.0
Author-Name: D. Andrew Brown
Author-X-Name-First: D. Andrew
Author-X-Name-Last: Brown
Author-Name: Christopher S. McMahan
Author-X-Name-First: Christopher S.
Author-X-Name-Last: McMahan
Author-Name: Russell T. Shinohara
Author-X-Name-First: Russell T.
Author-X-Name-Last: Shinohara
Author-Name: Kristin A. Linn
Author-X-Name-First: Kristin A.
Author-X-Name-Last: Linn
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: Bayesian Spatial Binary Regression for Label Fusion in Structural Neuroimaging
Abstract:
Alzheimer’s disease is a neurodegenerative condition that accelerates cognitive decline relative to normal aging. It is of critical scientific importance to gain a better understanding of early disease mechanisms in the brain to facilitate effective, targeted therapies. The volume of the hippocampus is often used in diagnosis and monitoring of the disease. Measuring this volume via neuroimaging is difficult since each hippocampus must either be manually identified or automatically delineated, a task referred to as segmentation. Automatic hippocampal segmentation often involves mapping a previously manually segmented image to a new brain image and propagating the labels to obtain an estimate of where each hippocampus is located in the new image. A more recent approach to this problem is to propagate labels from multiple manually segmented atlases and combine the results using a process known as label fusion. To date, most label fusion algorithms employ voting procedures with voting weights assigned directly or estimated via optimization. We propose using a fully Bayesian spatial regression model for label fusion that facilitates direct incorporation of covariate information while making accessible the entire posterior distribution. Our results suggest that incorporating tissue classification (e.g., gray matter) into the label fusion procedure can greatly improve segmentation when relatively homogeneous, healthy brains are used as atlases for diseased brains. The fully Bayesian approach also produces meaningful uncertainty measures about hippocampal volumes, information which can be leveraged to detect significant, scientifically meaningful differences between healthy and diseased populations, improving the potential for early detection and tracking of the disease. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 547-560
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.2014854
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2014854
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:547-560
Template-Type: ReDIF-Article 1.0
Author-Name: Yaowu Liu
Author-X-Name-First: Yaowu
Author-X-Name-Last: Liu
Author-Name: Zilin Li
Author-X-Name-First: Zilin
Author-X-Name-Last: Li
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: A Minimax Optimal Ridge-Type Set Test for Global Hypothesis With Applications in Whole Genome Sequencing Association Studies
Abstract:
Testing a global hypothesis for a set of variables is a fundamental problem in statistics with a wide range of applications. A few well-known classical tests include the Hotelling’s T
2 test, the F-test, and the empirical Bayes based score test. These classical tests, however, are not robust to the signal strength and could have a substantial loss of power when signals are weak or moderate, a situation we commonly encounter in contemporary applications. In this article, we propose a minimax optimal ridge-type set test (MORST), a simple and generic method for testing a global hypothesis. The power of MORST is robust and considerably higher than that of the classical tests when the strength of signals is weak or moderate. In the meantime, MORST only requires a slight increase in computation compared to these existing tests, making it applicable to the analysis of massive genome-wide data. We also provide the generalizations of MORST that are parallel to the traditional Wald test and Rao’s score test in asymptotic settings. Extensive simulations demonstrated the robust power of MORST and that the Type I error of MORST was well controlled. We applied MORST to the analysis of the whole-genome sequencing data from the Atherosclerosis Risk in Communities study, where MORST detected 20%–250% more signal regions than the classical tests. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 897-908
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1831926
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831926
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:897-908
Template-Type: ReDIF-Article 1.0
Author-Name: Muxuan Liang
Author-X-Name-First: Muxuan
Author-X-Name-Last: Liang
Author-Name: Menggang Yu
Author-X-Name-First: Menggang
Author-X-Name-Last: Yu
Title: A Semiparametric Approach to Model Effect Modification
Abstract:
One fundamental statistical question for research areas such as precision medicine and health disparity is about discovering effect modification of treatment or exposure by observed covariates. We propose a semiparametric framework for identifying such effect modification. Instead of using the traditional outcome models, we directly posit semiparametric models on contrasts, or expected differences of the outcome under different treatment choices or exposures. Through semiparametric estimation theory, all valid estimating equations, including the efficient scores, are derived. Besides doubly robust loss functions, our approach also enables dimension reduction in presence of many covariates. The asymptotic and non-asymptotic properties of the proposed methods are explored via a unified statistical and algorithmic analysis. Comparison with existing methods in both simulation and real data analysis demonstrates the superiority of our estimators especially for an efficiency improved version. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 752-764
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1811099
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1811099
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:752-764
Template-Type: ReDIF-Article 1.0
Author-Name: Jiayi Wang
Author-X-Name-First: Jiayi
Author-X-Name-Last: Wang
Author-Name: Raymond K. W. Wong
Author-X-Name-First: Raymond K. W.
Author-X-Name-Last: Wong
Author-Name: Xiaoke Zhang
Author-X-Name-First: Xiaoke
Author-X-Name-Last: Zhang
Title: Low-Rank Covariance Function Estimation for Multidimensional Functional Data
Abstract:
Multidimensional function data arise from many fields nowadays. The covariance function plays an important role in the analysis of such increasingly common data. In this article, we propose a novel nonparametric covariance function estimation approach under the framework of reproducing kernel Hilbert spaces (RKHS) that can handle both sparse and dense functional data. We extend multilinear rank structures for (finite-dimensional) tensors to functions, which allow for flexible modeling of both covariance operators and marginal structures. The proposed framework can guarantee that the resulting estimator is automatically semipositive definite, and can incorporate various spectral regularizations. The trace-norm regularization in particular can promote low ranks for both covariance operator and marginal structures. Despite the lack of a closed form, under mild assumptions, the proposed estimator can achieve unified theoretical results that hold for any relative magnitudes between the sample size and the number of observations per sample field, and the rate of convergence reveals the phase-transition phenomenon from sparse to dense functional data. Based on a new representer theorem, an ADMM algorithm is developed for the trace-norm regularization. The appealing numerical performance of the proposed estimator is demonstrated by a simulation study and the analysis of a dataset from the Argo project. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 809-822
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1820344
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1820344
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:809-822
Template-Type: ReDIF-Article 1.0
Author-Name: Alberto Abadie
Author-X-Name-First: Alberto
Author-X-Name-Last: Abadie
Author-Name: Jann Spiess
Author-X-Name-First: Jann
Author-X-Name-Last: Spiess
Title: Robust Post-Matching Inference
Abstract:
Nearest-neighbor matching is a popular nonparametric tool to create balance between treatment and control groups in observational studies. As a preprocessing step before regression, matching reduces the dependence on parametric modeling assumptions. In current empirical practice, however, the matching step is often ignored in the calculation of standard errors and confidence intervals. In this article, we show that ignoring the matching step results in asymptotically valid standard errors if matching is done without replacement and the regression model is correctly specified relative to the population regression function of the outcome variable on the treatment variable and all the covariates used for matching. However, standard errors that ignore the matching step are not valid if matching is conducted with replacement or, more crucially, if the second step regression model is misspecified in the sense indicated above. Moreover, correct specification of the regression model is not required for consistent estimation of treatment effects with matched data. We show that two easily implementable alternatives produce approximations to the distribution of the post-matching estimator that are robust to misspecification. A simulation study and an empirical example demonstrate the empirical relevance of our results. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 983-995
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1840383
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840383
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:983-995
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Liu
Author-X-Name-First: Yang
Author-X-Name-Last: Liu
Author-Name: Feifang Hu
Author-X-Name-First: Feifang
Author-X-Name-Last: Hu
Title: Balancing Unobserved Covariates With Covariate-Adaptive Randomized Experiments
Abstract:
Balancing important covariates is often critical in clinical trials and causal inference. Stratified permuted block (STR-PB) and covariate-adaptive randomization (CAR) procedures are widely used to balance observed covariates in practice. The balance properties of these procedures with respect to the observed covariates have been well studied. However, it has been questioned whether these methods will also yield a good balance for the unobserved covariates. In this article, we develop a general framework for the analysis of the unobserved covariates imbalance. These results are applicable to develop and compare the balance properties of complete randomization (CR), STR-PB, and CAR procedures with respect to the unobserved covariates. To quantify the improvement obtained by using STR-PB and CAR procedures rather than CR, we introduce the percentage reduction in variance of the unobserved covariates imbalance and compare these quantities. Our results demonstrate the benefits of using CAR or STR-PB (when the number of strata is small relative to the sample size) in terms of balancing unobserved covariates. These results also pave the way for future research into the effect of unobserved covariates in covariate-adaptive randomized experiments in clinical trials, as well as many other applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 875-886
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1825450
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825450
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:875-886
Template-Type: ReDIF-Article 1.0
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Author-Name: Yuan Ke
Author-X-Name-First: Yuan
Author-X-Name-Last: Ke
Author-Name: Wenyang Zhang
Author-X-Name-First: Wenyang
Author-X-Name-Last: Zhang
Title: Estimation of Low Rank High-Dimensional Multivariate Linear Models for Multi-Response Data
Abstract:
In this article, we study low rank high-dimensional multivariate linear models (LRMLM) for high-dimensional multi-response data. We propose an intuitively appealing estimation approach and develop an algorithm for implementation purposes. Asymptotic properties are established to justify the estimation procedure theoretically. Intensive simulation studies are also conducted to demonstrate performance when the sample size is finite, and a comparison is made with some popular methods from the literature. The results show the proposed estimator outperforms all of the alternative methods under various circumstances. Finally, using our suggested estimation procedure we apply the LRMLM to analyze an environmental dataset and predict concentrations of PM2.5 at the locations concerned. The results illustrate how the proposed method provides more accurate predictions than the alternative approaches.
Journal: Journal of the American Statistical Association
Pages: 693-703
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1799813
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799813
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:693-703
Template-Type: ReDIF-Article 1.0
Author-Name: Yunxiao Li
Author-X-Name-First: Yunxiao
Author-X-Name-Last: Li
Author-Name: Yi-Juan Hu
Author-X-Name-First: Yi-Juan
Author-X-Name-Last: Hu
Author-Name: Glen A. Satten
Author-X-Name-First: Glen A.
Author-X-Name-Last: Satten
Title: A Bottom-Up Approach to Testing Hypotheses That Have a Branching Tree Dependence Structure, With Error Rate Control
Abstract:
Modern statistical analyses often involve testing large numbers of hypotheses. In many situations, these hypotheses may have an underlying tree structure that both helps determine the order that tests should be conducted but also imposes a dependency between tests that must be accounted for. Our motivating example comes from testing the association between a trait of interest and groups of microbes that have been organized into operational taxonomic units (OTUs) or amplicon sequence variants (ASVs). Given p-values from association tests for each individual OTU or ASV, we would like to know if we can declare a certain species, genus, or higher taxonomic group to be associated with the trait. For this problem, a bottom-up testing algorithm that starts at the lowest level of the tree (OTUs or ASVs) and proceeds upward through successively higher taxonomic groupings (species, genus, family, etc.) is required. We develop such a bottom-up testing algorithm that controls a novel error rate that we call the false selection rate. By simulation, we also show that our approach is better at finding driver taxa, the highest level taxa below which there are dense association signals. We illustrate our approach using data from a study of the microbiome among patients with ulcerative colitis and healthy controls. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 664-677
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1799811
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1799811
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:664-677
Template-Type: ReDIF-Article 1.0
Author-Name: Chenguang Dai
Author-X-Name-First: Chenguang
Author-X-Name-Last: Dai
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Monte Carlo Approximation of Bayes Factors via Mixing With Surrogate Distributions
Abstract:
By mixing the target posterior distribution with a surrogate distribution, of which the normalizing constant is tractable, we propose a method for estimating the marginal likelihood using the Wang–Landau algorithm. We show that a faster convergence of the proposed method can be achieved via the momentum acceleration. Two implementation strategies are detailed: (i) facilitating global jumps between the posterior and surrogate distributions via the multiple-try Metropolis (MTM); (ii) constructing the surrogate via the variational approximation. When a surrogate is difficult to come by, we describe a new jumping mechanism for general reversible jump Markov chain Monte Carlo algorithms, which combines the MTM and a directional sampling algorithm. We illustrate the proposed methods on several statistical models, including the log-Gaussian Cox process, the Bayesian Lasso, the logistic regression, and the g-prior Bayesian variable selection. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 765-780
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1811100
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1811100
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:765-780
Template-Type: ReDIF-Article 1.0
Author-Name: Zeya Wang
Author-X-Name-First: Zeya
Author-X-Name-Last: Wang
Author-Name: Veerabhadran Baladandayuthapani
Author-X-Name-First: Veerabhadran
Author-X-Name-Last: Baladandayuthapani
Author-Name: Ahmed O. Kaseb
Author-X-Name-First: Ahmed O.
Author-X-Name-Last: Kaseb
Author-Name: Hesham M. Amin
Author-X-Name-First: Hesham M.
Author-X-Name-Last: Amin
Author-Name: Manal M. Hassan
Author-X-Name-First: Manal M.
Author-X-Name-Last: Hassan
Author-Name: Wenyi Wang
Author-X-Name-First: Wenyi
Author-X-Name-Last: Wang
Author-Name: Jeffrey S. Morris
Author-X-Name-First: Jeffrey S.
Author-X-Name-Last: Morris
Title: Bayesian Edge Regression in Undirected Graphical Models to Characterize Interpatient Heterogeneity in Cancer
Abstract:
It is well established that interpatient heterogeneity in cancer may significantly affect genomic data analyses and in particular, network topologies. Most existing graphical model methods estimate a single population-level graph for genomic or proteomic network. In many investigations, these networks depend on patient-specific indicators that characterize the heterogeneity of individual networks across subjects with respect to subject-level covariates. Examples include assessments of how the network varies with patient-specific prognostic scores or comparisons of tumor and normal graphs while accounting for tumor purity as a continuous predictor. In this article, we propose a novel edge regression model for undirected graphs, which estimates conditional dependencies as a function of subject-level covariates. We evaluate our model performance through simulation studies focused on comparing tumor and normal graphs while adjusting for tumor purity. In application to a dataset of proteomic measurements on plasma samples from patients with hepatocellular carcinoma (HCC), we ascertain how blood protein networks vary with disease severity, as measured by HepatoScore, a novel biomarker signature measuring disease severity. Our case study shows that the network connectivity increases with HepatoScore and a set of hub proteins as well as important protein connections are identified under different HepatoScore, which may provide important biological insights to the development of precision therapies for HCC.
Journal: Journal of the American Statistical Association
Pages: 533-546
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.2000866
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2000866
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:533-546
Template-Type: ReDIF-Article 1.0
Author-Name: Tianxi Li
Author-X-Name-First: Tianxi
Author-X-Name-Last: Li
Author-Name: Lihua Lei
Author-X-Name-First: Lihua
Author-X-Name-Last: Lei
Author-Name: Sharmodeep Bhattacharyya
Author-X-Name-First: Sharmodeep
Author-X-Name-Last: Bhattacharyya
Author-Name: Koen Van den Berge
Author-X-Name-First: Koen
Author-X-Name-Last: Van den Berge
Author-Name: Purnamrita Sarkar
Author-X-Name-First: Purnamrita
Author-X-Name-Last: Sarkar
Author-Name: Peter J. Bickel
Author-X-Name-First: Peter J.
Author-X-Name-Last: Bickel
Author-Name: Elizaveta Levina
Author-X-Name-First: Elizaveta
Author-X-Name-Last: Levina
Title: Hierarchical Community Detection by Recursive Partitioning
Abstract:
The problem of community detection in networks is usually formulated as finding a single partition of the network into some “correct” number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm’s theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We apply the algorithm to a gene network based on gene co-occurrence in 1580 research papers on anemia, and identify six clusters of genes in a meaningful hierarchy. We also illustrate the algorithm on a dataset of statistics papers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 951-968
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1833888
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1833888
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:951-968
Template-Type: ReDIF-Article 1.0
Author-Name: Lazhi Wang
Author-X-Name-First: Lazhi
Author-X-Name-Last: Wang
Author-Name: David E. Jones
Author-X-Name-First: David E.
Author-X-Name-Last: Jones
Author-Name: Xiao-Li Meng
Author-X-Name-First: Xiao-Li
Author-X-Name-Last: Meng
Title: Warp Bridge Sampling: The Next Generation
Abstract:
Bridge sampling is an effective Monte Carlo (MC) method for estimating the ratio of normalizing constants of two probability densities, a routine computational problem in statistics, physics, chemistry, and other fields. The MC error of the bridge sampling estimator is determined by the amount of overlap between the two densities. In the case of unimodal densities, Warp-I, II, and III transformations are effective for increasing the initial overlap, but they are less so for multimodal densities. This article introduces Warp-U transformations that aim to transform multimodal densities into unimodal ones (hence “U”) without altering their normalizing constants. The construction of a Warp-U transformation starts with a normal (or other convenient) mixture distribution
ϕmix
that has reasonable overlap with the target density p, whose normalizing constant is unknown. The stochastic transformation that maps
ϕmix
back to its generating distribution
N(0,1)
is then applied to p yielding its Warp-U version, which we denote
p˜
. Typically,
p˜
is unimodal and has substantially increased overlap with
ϕ
. Furthermore, we prove that the overlap between
p˜
and
N(0,1)
is guaranteed to be no less than the overlap between p and
ϕmix
, in terms of any f-divergence. We propose a computationally efficient method to find an appropriate
ϕmix
, and a simple but effective approach to remove the bias which results from estimating the normalizing constant and fitting
ϕmix
with the same data. We illustrate our findings using 10 and 50 dimensional highly irregular multimodal densities, and demonstrate how Warp-U sampling can be used to improve the final estimation step of the Generalized Wang–Landau algorithm, a powerful sampling and estimation approach. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 835-851
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1825447
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825447
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:835-851
Template-Type: ReDIF-Article 1.0
Author-Name: Lars Arne Jordanger
Author-X-Name-First: Lars Arne
Author-X-Name-Last: Jordanger
Author-Name: Dag Tjøstheim
Author-X-Name-First: Dag
Author-X-Name-Last: Tjøstheim
Title: Nonlinear Spectral Analysis: A Local Gaussian Approach
Abstract:
The spectral distribution
f(ω)
of a stationary time series
{Yt}t∈Z
can be used to investigate whether or not periodic structures are present in
{Yt}t∈Z
, but
f(ω)
has some limitations due to its dependence on the autocovariances
γ(h)
. For example,
f(ω)
can not distinguish white iid noise from GARCH-type models (whose terms are dependent, but uncorrelated), which implies that
f(ω)
can be an inadequate tool when
{Yt}t∈Z
contains asymmetries and nonlinear dependencies. Asymmetries between the upper and lower tails of a time series can be investigated by means of the local Gaussian autocorrelations, and these local measures of dependence can be used to construct the local Gaussian spectral density presented in this paper. A key feature of the new local spectral density is that it coincides with
f(ω)
for Gaussian time series, which implies that it can be used to detect non-Gaussian traits in the time series under investigation. In particular, if
f(ω)
is flat, then peaks and troughs of the new local spectral density can indicate nonlinear traits, which potentially might discover local periodic phenomena that remain undetected in an ordinary spectral analysis.
Journal: Journal of the American Statistical Association
Pages: 1010-1027
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1840991
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840991
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1010-1027
Template-Type: ReDIF-Article 1.0
Author-Name: Jeremiah Zhe Liu
Author-X-Name-First: Jeremiah Zhe
Author-X-Name-Last: Liu
Author-Name: Wenying Deng
Author-X-Name-First: Wenying
Author-X-Name-Last: Deng
Author-Name: Jane Lee
Author-X-Name-First: Jane
Author-X-Name-Last: Lee
Author-Name: Pi-i Debby Lin
Author-X-Name-First: Pi-i Debby
Author-X-Name-Last: Lin
Author-Name: Linda Valeri
Author-X-Name-First: Linda
Author-X-Name-Last: Valeri
Author-Name: David C. Christiani
Author-X-Name-First: David C.
Author-X-Name-Last: Christiani
Author-Name: David C. Bellinger
Author-X-Name-First: David C.
Author-X-Name-Last: Bellinger
Author-Name: Robert O. Wright
Author-X-Name-First: Robert O.
Author-X-Name-Last: Wright
Author-Name: Maitreyi M. Mazumdar
Author-X-Name-First: Maitreyi M.
Author-X-Name-Last: Mazumdar
Author-Name: Brent A. Coull
Author-X-Name-First: Brent A.
Author-X-Name-Last: Coull
Title: A Cross-Validated Ensemble Approach to Robust Hypothesis Testing of Continuous Nonlinear Interactions: Application to Nutrition-Environment Studies
Abstract:
Gene-environment and nutrition-environment studies often involve testing of high-dimensional interactions between two sets of variables, each having potentially complex nonlinear main effects on an outcome. Construction of a valid and powerful hypothesis test for such an interaction is challenging, due to the difficulty in constructing an efficient and unbiased estimator for the complex, nonlinear main effects. In this work, we address this problem by proposing a cross-validated ensemble of kernels (CVEK) that learns the space of appropriate functions for the main effects using a cross-validated ensemble approach. With a carefully chosen library of base kernels, CVEK flexibly estimates the form of the main-effect functions from the data, and encourages test power by guarding against over-fitting under the alternative. The method is motivated by a study on the interaction between metal exposures in utero and maternal nutrition on children’s neurodevelopment in rural Bangladesh. The proposed tests identified evidence of an interaction between minerals and vitamins intake and arsenic and manganese exposures. Results suggest that the detrimental effects of these metals are most pronounced at low intake levels of the nutrients, suggesting nutritional interventions in pregnant women could mitigate the adverse impacts of in utero metal exposures on the children’s neurodevelopment. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 561-573
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.1962889
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1962889
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:561-573
Template-Type: ReDIF-Article 1.0
Author-Name: Hejian Sang
Author-X-Name-First: Hejian
Author-X-Name-Last: Sang
Author-Name: Jae Kwang Kim
Author-X-Name-First: Jae Kwang
Author-X-Name-Last: Kim
Author-Name: Danhyang Lee
Author-X-Name-First: Danhyang
Author-X-Name-Last: Lee
Title: Semiparametric Fractional Imputation Using Gaussian Mixture Models for Handling Multivariate Missing Data
Abstract:
Item nonresponse is frequently encountered in practice. Ignoring missing data can lose efficiency and lead to misleading inference. Fractional imputation is a frequentist approach of imputation for handling missing data. However, the parametric fractional imputation may be subject to bias under model misspecification. In this article, we propose a novel semiparametric fractional imputation (SFI) method using Gaussian mixture models. The proposed method is computationally efficient and leads to robust estimation. The proposed method is further extended to incorporate the categorical auxiliary information. The asymptotic model consistency and
n
-consistency of the SFI estimator are also established. Some simulation studies are presented to check the finite sample performance of the proposed method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 654-663
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1796358
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1796358
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:654-663
Template-Type: ReDIF-Article 1.0
Author-Name: Erin E. Gabriel
Author-X-Name-First: Erin E.
Author-X-Name-Last: Gabriel
Author-Name: Michael C. Sachs
Author-X-Name-First: Michael C.
Author-X-Name-Last: Sachs
Author-Name: Arvid Sjölander
Author-X-Name-First: Arvid
Author-X-Name-Last: Sjölander
Title: Causal Bounds for Outcome-Dependent Sampling in Observational Studies
Abstract:
Outcome-dependent sampling designs are common in many different scientific fields including epidemiology, ecology, and economics. As with all observational studies, such designs often suffer from unmeasured confounding, which generally precludes the nonparametric identification of causal effects. Nonparametric bounds can provide a way to narrow the range of possible values for a nonidentifiable causal effect without making additional untestable assumptions. The nonparametric bounds literature has almost exclusively focused on settings with random sampling, and the bounds have often been derived with a particular linear programming method. We derive novel bounds for the causal risk difference, often referred to as the average treatment effect, in six settings with outcome-dependent sampling and unmeasured confounding for a binary outcome and exposure. Our derivations of the bounds illustrate two approaches that may be applicable in other settings where the bounding problem cannot be directly stated as a system of linear constraints. We illustrate our derived bounds in a real data example involving the effect of vitamin D concentration on mortality. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 939-950
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1832502
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1832502
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:939-950
Template-Type: ReDIF-Article 1.0
Author-Name: Michele Peruzzi
Author-X-Name-First: Michele
Author-X-Name-Last: Peruzzi
Author-Name: Sudipto Banerjee
Author-X-Name-First: Sudipto
Author-X-Name-Last: Banerjee
Author-Name: Andrew O. Finley
Author-X-Name-First: Andrew O.
Author-X-Name-Last: Finley
Title: Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains
Abstract:
We introduce a class of scalable Bayesian hierarchical models for the analysis of massive geostatistical datasets. The underlying idea combines ideas on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial process, which we call the meshed Gaussian process (MGP). A major contribution is the development of an MGPs on tessellated domains, accompanied by a Gibbs sampler for the efficient recovery of spatial random effects. In particular, the cubic MGP (Q-MGP) can harness high-performance computing resources by executing all large-scale operations in parallel within the Gibbs sampler, improving mixing and computing time compared to sequential updating schemes. Unlike some existing models for large spatial data, a Q-MGP facilitates massive caching of expensive matrix operations, making it particularly apt in dealing with spatiotemporal remote-sensing data. We compare Q-MGPs with large synthetic and real world data against state-of-the-art methods. We also illustrate using Normalized Difference Vegetation Index data from the Serengeti park region to recover latent multivariate spatiotemporal random effects at millions of locations. The source code is available at github.com/mkln/meshgp. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 969-982
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1833889
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1833889
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:969-982
Template-Type: ReDIF-Article 1.0
Author-Name: The Editors
Title: Correction
Journal: Journal of the American Statistical Association
Pages: 1043-1043
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2022.2060607
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060607
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1043-1043
Template-Type: ReDIF-Article 1.0
Author-Name: Somabha Mukherjee
Author-X-Name-First: Somabha
Author-X-Name-Last: Mukherjee
Author-Name: Divyansh Agarwal
Author-X-Name-First: Divyansh
Author-X-Name-Last: Agarwal
Author-Name: Nancy R. Zhang
Author-X-Name-First: Nancy R.
Author-X-Name-Last: Zhang
Author-Name: Bhaswar B. Bhattacharya
Author-X-Name-First: Bhaswar B.
Author-X-Name-Last: Bhattacharya
Title: Distribution-Free Multisample Tests Based on Optimal Matchings With Applications to Single Cell Genomics
Abstract:
In this article, we propose a nonparametric graphical test based on optimal matching, for assessing the equality of multiple unknown multivariate probability distributions. Our procedure pools the data from the different classes to create a graph based on the minimum non-bipartite matching, and then utilizes the number of edges connecting data points from different classes to examine the closeness between the distributions. The proposed test is exactly distribution-free (the null distribution does not depend on the distribution of the data) and can be efficiently applied to multivariate as well as non-Euclidean data, whenever the inter-point distances are well-defined. We show that the test is universally consistent, and prove a distributional limit theorem for the test statistic under general alternatives. Through simulation studies, we demonstrate its superior performance against other common and well-known multisample tests. The method is applied to single cell transcriptomics data obtained from the peripheral blood, cancer tissue, and tumor-adjacent normal tissue of human subjects with hepatocellular carcinoma and non-small-cell lung cancer. Our method unveils patterns in how biochemical metabolic pathways are altered across immune cells in a cancer setting, depending on the tissue location. All of the methods described herein are implemented in the R package multicross. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 627-638
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1791131
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1791131
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:627-638
Template-Type: ReDIF-Article 1.0
Author-Name: Koen Jochmans
Author-X-Name-First: Koen
Author-X-Name-Last: Jochmans
Title: Heteroscedasticity-Robust Inference in Linear Regression Models With Many Covariates
Abstract:
We consider inference in linear regression models that is robust to heteroscedasticity and the presence of many control variables. When the number of control variables increases at the same rate as the sample size the usual heteroscedasticity-robust estimators of the covariance matrix are inconsistent. Hence, tests based on these estimators are size distorted even in large samples. An alternative covariance-matrix estimator for such a setting is presented that complements recent work by Cattaneo, Jansson, and Newey. We provide high-level conditions for our approach to deliver (asymptotically) size-correct inference as well as more primitive conditions for three special cases. Simulation results and an empirical illustration to inference on the union premium are also provided. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 887-896
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1831924
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1831924
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:887-896
Template-Type: ReDIF-Article 1.0
Author-Name: Yang Ni
Author-X-Name-First: Yang
Author-X-Name-Last: Ni
Title: Bayesian Thinking in Biostatistics.
Journal: Journal of the American Statistical Association
Pages: 1041-1042
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2022.2069442
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2069442
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1041-1042
Template-Type: ReDIF-Article 1.0
Author-Name: Gery Geenens
Author-X-Name-First: Gery
Author-X-Name-Last: Geenens
Author-Name: Pierre Lafaye de Micheaux
Author-X-Name-First: Pierre
Author-X-Name-Last: Lafaye de Micheaux
Title: The Hellinger Correlation
Abstract:
In this article, the defining properties of any valid measure of the dependence between two continuous random variables are revisited and complemented with two original ones, shown to imply other usual postulates. While other popular choices are proved to violate some of these requirements, a class of dependence measures satisfying all of them is identified. One particular measure, that we call the Hellinger correlation, appears as a natural choice within that class due to both its theoretical and intuitive appeal. A simple and efficient nonparametric estimator for that quantity is proposed, with its implementation publicly available in the R package HellCor. Synthetic and real-data examples illustrate the descriptive ability of the measure, which can also be used as test statistic for exact independence testing. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 639-653
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1791132
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1791132
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:639-653
Template-Type: ReDIF-Article 1.0
Author-Name: Xuan Bi
Author-X-Name-First: Xuan
Author-X-Name-Last: Bi
Author-Name: Long Feng
Author-X-Name-First: Long
Author-X-Name-Last: Feng
Author-Name: Cai Li
Author-X-Name-First: Cai
Author-X-Name-Last: Li
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Modeling Pregnancy Outcomes Through Sequentially Nested Regression Models
Abstract:
The polycystic ovary syndrome (PCOS) is a most common cause of infertility among women of reproductive age. Unfortunately, the etiology of PCOS is poorly understood. Large-scale clinical trials for pregnancy in polycystic ovary syndrome (PPCOS) were conducted to evaluate the effectiveness of treatments. Ovulation, pregnancy, and live birth are three sequentially nested binary outcomes, typically analyzed separately. However, the separate models may lose power in detecting the treatment effects and influential variables for live birth, due to decreased sample sizes and unbalanced event counts. It has been a long-held hypothesis among the clinicians that some of the important variables for early pregnancy outcomes may continue their influence on live birth. To consider this possibility, we develop an
l0
-norm based regularization method in favor of variables that have been identified from an earlier stage. Our approach explicitly bridges the connections across nested outcomes through computationally easy algorithms and enjoys theoretical guarantee of estimation and variable selection. By analyzing the PPCOS data, we successfully uncover the hidden influence of risk factors on live birth, which confirm clinical experience. Moreover, we provide novel infertility treatment recommendations (e.g., letrozole vs. clomiphene citrate) for women with PCOS to improve their chances of live birth. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 602-616
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.2006666
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2006666
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:602-616
Template-Type: ReDIF-Article 1.0
Author-Name: M. Hallin
Author-X-Name-First: M.
Author-X-Name-Last: Hallin
Author-Name: D. La Vecchia
Author-X-Name-First: D.
Author-X-Name-Last: La Vecchia
Author-Name: H. Liu
Author-X-Name-First: H.
Author-X-Name-Last: Liu
Title: Center-Outward R-Estimation for Semiparametric VARMA Models
Abstract:
We propose a new class of R-estimators for semiparametric VARMA models in which the innovation density plays the role of the nuisance parameter. Our estimators are based on the novel concepts of multivariate center-outward ranks and signs. We show that these concepts, combined with Le Cam’s asymptotic theory of statistical experiments, yield a class of semiparametric estimation procedures, which are efficient (at a given reference density), root-n consistent, and asymptotically normal under a broad class of (possibly non-elliptical) actual innovation densities. No kernel density estimation is required to implement our procedures. A Monte Carlo comparative study of our R-estimators and other routinely applied competitors demonstrates the benefits of the novel methodology, in large and small sample. Proofs, computational aspects, and further numerical results are available in the supplementary materials. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 925-938
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1832501
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1832501
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:925-938
Template-Type: ReDIF-Article 1.0
Author-Name: Luella Fu
Author-X-Name-First: Luella
Author-X-Name-Last: Fu
Author-Name: Bowen Gang
Author-X-Name-First: Bowen
Author-X-Name-Last: Gang
Author-Name: Gareth M. James
Author-X-Name-First: Gareth M.
Author-X-Name-Last: James
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Title: Heteroscedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing
Abstract:
Standardization has been a widely adopted practice in multiple testing, for it takes into account the variability in sampling and makes the test statistics comparable across different study units. However, despite conventional wisdom to the contrary, we show that there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity-adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The main idea of HART is to bypass standardization by directly incorporating both the summary statistic and its variance into the testing procedure. A key message is that the variance structure of the alternative distribution, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power. The proposed HART procedure is shown to be asymptotically valid and optimal for false discovery rate (FDR) control. Our simulation results demonstrate that HART achieves substantial power gain over existing methods at the same FDR level. We illustrate the implementation through a microarray analysis of myeloma.
Journal: Journal of the American Statistical Association
Pages: 1028-1040
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1840992
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1840992
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:1028-1040
Template-Type: ReDIF-Article 1.0
Author-Name: Zilin Li
Author-X-Name-First: Zilin
Author-X-Name-Last: Li
Author-Name: Yaowu Liu
Author-X-Name-First: Yaowu
Author-X-Name-Last: Liu
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: Simultaneous Detection of Signal Regions Using Quadratic Scan Statistics With Applications to Whole Genome Association Studies
Abstract:
We consider in this article detection of signal regions associated with disease outcomes in whole genome association studies. Gene- or region-based methods have become increasingly popular in whole genome association analysis as a complementary approach to traditional individual variant analysis. However, these methods test for the association between an outcome and the genetic variants in a prespecified region, for example, a gene. In view of massive intergenic regions in whole genome sequencing (WGS) studies, we propose a computationally efficient quadratic scan (Q-SCAN) statistic based method to detect the existence and the locations of signal regions by scanning the genome continuously. The proposed method accounts for the correlation (linkage disequilibrium) among genetic variants, and allows for signal regions to have both causal and neutral variants, and the effects of signal variants to be in different directions. We study the asymptotic properties of the proposed Q-SCAN statistics. We derive an empirical threshold that controls for the family-wise error rate, and show that under regularity conditions the proposed method consistently selects the true signal regions. We perform simulation studies to evaluate the finite sample performance of the proposed method. Our simulation results show that the proposed procedure outperforms the existing methods, especially when signal regions have causal variants whose effects are in different directions, or are contaminated with neutral variants. We illustrate Q-SCAN by analyzing the WGS data from the Atherosclerosis Risk in Communities study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 823-834
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1822849
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1822849
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:823-834
Template-Type: ReDIF-Article 1.0
Author-Name: Eftychia Solea
Author-X-Name-First: Eftychia
Author-X-Name-Last: Solea
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Title: Copula Gaussian Graphical Models for Functional Data
Abstract:
We introduce a statistical graphical model for multivariate functional data, which are common in medical applications such as EEG and fMRI. Recently published functional graphical models rely on the multivariate Gaussian process assumption, but we relax it by introducing the functional copula Gaussian graphical model (FCGGM). This model removes the marginal Gaussian assumption but retains the simplicity of the Gaussian dependence structure, which is particularly attractive for large data. We develop four estimators for the FCGGM and establish the consistency and the convergence rates of one of them. We compare our FCGGM with the existing functional Gaussian graphical model by simulations, and apply our method to an EEG dataset to construct brain networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 781-793
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1817750
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1817750
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:781-793
Template-Type: ReDIF-Article 1.0
Author-Name: Luca Frigau
Author-X-Name-First: Luca
Author-X-Name-Last: Frigau
Author-Name: Qiuyi Wu
Author-X-Name-First: Qiuyi
Author-X-Name-Last: Wu
Author-Name: David Banks
Author-X-Name-First: David
Author-X-Name-Last: Banks
Title: Optimizing the JSM Program
Abstract:
Sometimes the Joint Statistical Meetings (JSM) is frustrating to attend, because multiple sessions on the same topic are scheduled at the same time. This article uses seeded latent Dirichlet allocation and a scheduling optimization algorithm to very significantly reduce overlapping content in the original schedule for the 2020 JSM program. Specifically, a measure based on total variation distance that ranges from 0 (random scheduling) to 1 (no overlapping content) finds that the original schedule had a score of 0.058, whereas our proposed schedule achieved a score of 0.371. This is a huge improvement that would (i) increase participant satisfaction as measured by the post-JSM satisfaction survey, and (ii) save the American Statistical Association significant money by obviating the need for the traditional in-person meeting of the 47 program chairs and other organizers. The methodology developed in this work immediately applies to future JSMs and is easily modified to improve scheduling for any other scientific conference that has parallel sessions.
Journal: Journal of the American Statistical Association
Pages: 617-626
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2021.1978466
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1978466
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:617-626
Template-Type: ReDIF-Article 1.0
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Jianhua Guo
Author-X-Name-First: Jianhua
Author-X-Name-Last: Guo
Author-Name: Shurong Zheng
Author-X-Name-First: Shurong
Author-X-Name-Last: Zheng
Title: Estimating Number of Factors by Adjusted Eigenvalues Thresholding
Abstract:
Determining the number of common factors is an important and practical topic in high-dimensional factor models. The existing literature is mainly based on the eigenvalues of the covariance matrix. Owing to the incomparability of the eigenvalues of the covariance matrix caused by the heterogeneous scales of the observed variables, it is not easy to find an accurate relationship between these eigenvalues and the number of common factors. To overcome this limitation, we appeal to the correlation matrix and demonstrate, surprisingly, that the number of eigenvalues greater than 1 of the population correlation matrix is the same as the number of common factors under certain mild conditions. To use such a relationship, we study random matrix theory based on the sample correlation matrix to correct biases in estimating the top eigenvalues and to take into account of estimation errors in eigenvalue estimation. Thus, we propose a tuning-free scale-invariant adjusted correlation thresholding (ACT) method for determining the number of common factors in high-dimensional factor models, taking into account the sampling variabilities and biases of top sample eigenvalues. We also establish the optimality of the proposed ACT method in terms of minimal signal strength and the optimal threshold. Simulation studies lend further support to our proposed method and show that our estimator outperforms competing methods in most test cases. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 852-861
Issue: 538
Volume: 117
Year: 2022
Month: 4
X-DOI: 10.1080/01621459.2020.1825448
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1825448
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:538:p:852-861
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2096039_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Marianna Pensky
Author-X-Name-First: Marianna
Author-X-Name-Last: Pensky
Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis”
Journal: Journal of the American Statistical Association
Pages: 1183-1185
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2096039
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096039
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1183-1185
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2087659_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Chenguang Dai
Author-X-Name-First: Chenguang
Author-X-Name-Last: Dai
Author-Name: Jeremy Heng
Author-X-Name-First: Jeremy
Author-X-Name-Last: Heng
Author-Name: Pierre E. Jacob
Author-X-Name-First: Pierre E.
Author-X-Name-Last: Jacob
Author-Name: Nick Whiteley
Author-X-Name-First: Nick
Author-X-Name-Last: Whiteley
Title: An Invitation to Sequential Monte Carlo Samplers
Abstract:
Statisticians often use Monte Carlo methods to approximate probability distributions, primarily with Markov chain Monte Carlo and importance sampling. Sequential Monte Carlo samplers are a class of algorithms that combine both techniques to approximate distributions of interest and their normalizing constants. These samplers originate from particle filtering for state space models and have become general and scalable sampling techniques. This article describes sequential Monte Carlo samplers and their possible implementations, arguing that they remain under-used in statistics, despite their ability to perform sequential inference and to leverage parallel processing resources among other potential benefits. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1587-1600
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2087659
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2087659
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1587-1600
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1862670_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Belmiro P. M. Duarte
Author-X-Name-First: Belmiro P. M.
Author-X-Name-Last: Duarte
Author-Name: Anthony C. Atkinson
Author-X-Name-First: Anthony C.
Author-X-Name-Last: Atkinson
Author-Name: José F. O. Granjo
Author-X-Name-First: José F. O.
Author-X-Name-Last: Granjo
Author-Name: Nuno M. C. Oliveira
Author-X-Name-First: Nuno M. C.
Author-X-Name-Last: Oliveira
Title: Optimal Design of Experiments for Implicit Models
Abstract:
Explicit models representing the response variables as functions of the control variables are standard in virtually all scientific fields. For these models, there is a vast literature on the optimal design of experiments (ODoE) to provide good estimates of the parameters with the use of minimal resources. Contrarily, the ODoE for implicit models is more complex and has not been systematically addressed. Nevertheless, there are practical examples where the models relating the response variables, the parameters and the factors are implicit or hardly convertible into an explicit form. We propose a general formulation for developing the theory of the ODoE for implicit algebraic models to specifically find continuous local designs. The treatment relies on converting the ODoE problem into an optimization problem of the nonlinear programming (NLP) class which includes the construction of the parameter sensitivities and the Cholesky decomposition of the Fisher information matrix. The NLP problem generated has multiple local optima, and we use global solvers, combined with an equivalence theorem from the theory of ODoE, to ensure the global optimality of our continuous optimal designs. We consider D- and A-optimality criteria and apply the approach to five examples of practical interest in chemistry and thermodynamics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1424-1437
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1862670
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862670
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1424-1437
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2040519_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Peng Shi
Author-X-Name-First: Peng
Author-X-Name-Last: Shi
Author-Name: Gee Y. Lee
Author-X-Name-First: Gee Y.
Author-X-Name-Last: Lee
Title: Copula Regression for Compound Distributions with Endogenous Covariates with Applications in Insurance Deductible Pricing
Abstract:
This article concerns deductible pricing in nonlife insurance contracts. The primary interest of insurers is the effect of the contract deductible on a policyholder’s aggregate loss that is determined by a compound distribution where the sum of individual claim amount is stopped by the number of claims. Policyholders choose the deductible level based on their hidden risks, which makes deductible endogenous in the regressions for both claim frequency and claim severity. To address the endogeneity in the regression for the compound aggregate loss, we introduce a novel approach using pair copula constructions to jointly model the policyholder’s deductible, number of claims, and individual claim amounts, in the context of compound distributions. The proposed method provides insurers an empirical tool to uncover the underlying risk distribution of the potential customers.In the application we consider an insurance portfolio from the property insurance program that provides property coverage for building and contents of local government entities of the Wisconsin. Using the historical data on policyholder and insurance claims, we first provide empirical evidence of the endogeneity of the deductible. Interestingly, we find that the policyholder’s deductible is negatively associated with the claim frequency but positively associated with the claim severity. For the portfolio of policyholders, the endogenous deductible model provides superior prediction for 65% and 71% of policyholders for claim frequency and severity, respectively. The endogeneity of deductible shows significant managerial implications on insurance operations. In particular, the risk score suggested by the proposed method allows the insurer to identify additional profitable underwriting strategies which are quantified by the Gini indices of 0.22 and 0.13 when switching from the exogenous deductible premium and the insurer’s contract premium, respectively. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1094-1109
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2040519
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2040519
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1094-1109
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2093726_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Subhashis Ghosal
Author-X-Name-First: Subhashis
Author-X-Name-Last: Ghosal
Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Ignatiadis and Wager
Journal: Journal of the American Statistical Association
Pages: 1171-1174
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2093726
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093726
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1171-1174
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2008403_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Nikolaos Ignatiadis
Author-X-Name-First: Nikolaos
Author-X-Name-Last: Ignatiadis
Author-Name: Stefan Wager
Author-X-Name-First: Stefan
Author-X-Name-Last: Wager
Title: Confidence Intervals for Nonparametric Empirical Bayes Analysis
Abstract:
In an empirical Bayes analysis, we use data from repeated sampling to imitate inferences made by an oracle Bayesian with extensive knowledge of the data-generating distribution. Existing results provide a comprehensive characterization of when and why empirical Bayes point estimates accurately recover oracle Bayes behavior. In this paper, we develop flexible and practical confidence intervals that provide asymptotic frequentist coverage of empirical Bayes estimands, such as the posterior mean or the local false sign rate. The coverage statements hold even when the estimands are only partially identified or when empirical Bayes point estimates converge very slowly. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1149-1166
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2021.2008403
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2008403
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1149-1166
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1858838_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Likun Zhang
Author-X-Name-First: Likun
Author-X-Name-Last: Zhang
Author-Name: Benjamin A. Shaby
Author-X-Name-First: Benjamin A.
Author-X-Name-Last: Shaby
Author-Name: Jennifer L. Wadsworth
Author-X-Name-First: Jennifer L.
Author-X-Name-Last: Wadsworth
Title: Hierarchical Transformed Scale Mixtures for Flexible Modeling of Spatial Extremes on Datasets With Many Locations
Abstract:
Abstract–Flexible spatial models that allow transitions between tail dependence classes have recently appeared in the literature. However, inference for these models is computationally prohibitive, even in moderate dimensions, due to the necessity of repeatedly evaluating the multivariate Gaussian distribution function. In this work, we attempt to achieve truly high-dimensional inference for extremes of spatial processes, while retaining the desirable flexibility in the tail dependence structure, by modifying an established class of models based on scale mixtures Gaussian processes. We show that the desired extremal dependence properties from the original models are preserved under the modification, and demonstrate that the corresponding Bayesian hierarchical model does not involve the expensive computation of the multivariate Gaussian distribution function. We fit our model to exceedances of a high threshold, and perform coverage analyses and cross-model checks to validate its ability to capture different types of tail characteristics. We use a standard adaptive Metropolis algorithm for model fitting, and further accelerate the computation via parallelization and Rcpp. Lastly, we apply the model to a dataset of a fire threat index on the Great Plains region of the United States, which is vulnerable to massively destructive wildfires. We find that the joint tail of the fire threat index exhibits a decaying dependence structure that cannot be captured by limiting extreme value models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1357-1369
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1858838
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1858838
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1357-1369
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2101797_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Yang Zhou
Author-X-Name-First: Yang
Author-X-Name-Last: Zhou
Author-Name: Lirong Xue
Author-X-Name-First: Lirong
Author-X-Name-Last: Xue
Author-Name: Zhengyu Shi
Author-X-Name-First: Zhengyu
Author-X-Name-Last: Shi
Author-Name: Libo Wu
Author-X-Name-First: Libo
Author-X-Name-Last: Wu
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Title: Rejoinder
Journal: Journal of the American Statistical Association
Pages: 1066-1067
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2101797
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2101797
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1066-1067
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1859379_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Title: LAWS: A Locally Adaptive Weighting and Screening Approach to Spatial Multiple Testing
Abstract:
Exploiting spatial patterns in large-scale multiple testing promises to improve both power and interpretability of false discovery rate (FDR) analyses. This article develops a new class of locally adaptive weighting and screening (LAWS) rules that directly incorporates useful local patterns into inference. The idea involves constructing robust and structure-adaptive weights according to the estimated local sparsity levels. LAWS provides a unified framework for a broad range of spatial problems and is fully data-driven. It is shown that LAWS controls the FDR asymptotically under mild conditions on dependence. The finite sample performance is investigated using simulated data, which demonstrates that LAWS controls the FDR and outperforms existing methods in power. The efficiency gain is substantial in many settings. We further illustrate the merits of LAWS through applications to the analysis of two-dimensional and three-dimensional images. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1370-1383
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1859379
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1859379
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1370-1383
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1865168_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Ted Westling
Author-X-Name-First: Ted
Author-X-Name-Last: Westling
Title: Nonparametric Tests of the Causal Null With Nondiscrete Exposures
Abstract:
In many scientific studies, it is of interest to determine whether an exposure has a causal effect on an outcome. In observational studies, this is a challenging task due to the presence of confounding variables that affect both the exposure and the outcome. Many methods have been developed to test for the presence of a causal effect when all such confounding variables are observed and when the exposure of interest is discrete. In this article, we propose a class of nonparametric tests of the null hypothesis that there is no average causal effect of an arbitrary univariate exposure on an outcome in the presence of observed confounding. Our tests apply to discrete, continuous, and mixed discrete-continuous exposures. We demonstrate that our proposed tests are doubly robust consistent, that they have correct asymptotic Type I error if both nuisance parameters involved in the problem are estimated at fast enough rates, and that they have power to detect local alternatives approaching the null at the rate n−1/2. We study the performance of our tests in numerical studies, and use them to test for the presence of a causal effect of BMI on immune response in early phase vaccine trials. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1551-1562
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1865168
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1865168
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1551-1562
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1855183_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Di Wang
Author-X-Name-First: Di
Author-X-Name-Last: Wang
Author-Name: Yao Zheng
Author-X-Name-First: Yao
Author-X-Name-Last: Zheng
Author-Name: Heng Lian
Author-X-Name-First: Heng
Author-X-Name-Last: Lian
Author-Name: Guodong Li
Author-X-Name-First: Guodong
Author-X-Name-Last: Li
Title: High-Dimensional Vector Autoregressive Time Series Modeling via Tensor Decomposition
Abstract:
The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters when the number of time series and lag order are even moderately large. This article proposes to rearrange the transition matrices of the model into a tensor form such that the parameter space can be restricted along three directions simultaneously via tensor decomposition. In contrast, the reduced-rank regression method can restrict the parameter space in only one direction. Besides achieving substantial dimension reduction, the proposed model is interpretable from the factor modeling perspective. Moreover, to handle high-dimensional time series, this article considers imposing sparsity on factor matrices to improve the model interpretability and estimation efficiency, which leads to a sparsity-inducing estimator. For the low-dimensional case, we derive asymptotic properties of the proposed least squares estimator and introduce an alternating least squares algorithm. For the high-dimensional case, we establish nonasymptotic properties of the sparsity-inducing estimator and propose an ADMM algorithm for regularized estimation. Simulation experiments and a real data example demonstrate the advantages of the proposed approach over various existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1338-1356
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1855183
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1855183
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1338-1356
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2053136_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Xu Guo
Author-X-Name-First: Xu
Author-X-Name-Last: Guo
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Jingyuan Liu
Author-X-Name-First: Jingyuan
Author-X-Name-Last: Liu
Author-Name: Mudong Zeng
Author-X-Name-First: Mudong
Author-X-Name-Last: Zeng
Title: High-Dimensional Mediation Analysis for Selecting DNA Methylation Loci Mediating Childhood Trauma and Cortisol Stress Reactivity
Abstract:
Childhood trauma tends to influence cortisol stress reactivity through the mediating effects of DNA methylation. Houtepen et al. conducted a study to investigate the role of DNA methylation in cortisol stress reactivity and its relationship with childhood trauma. The study collected a dataset consisting of 385,882 DNA methylation loci, cortisol stress reactivity, one-dimensional score on a childhood trauma questionnaire and several covariates for 85 healthy individuals. Of great scientific interest is to identify the active mediating loci out of the 385,882 ones. Houtepen et al. conducted 385,882 linear mediation analyses, in each of which one locus was considered, and identified three active mediating loci. More recently, van Kesteren and Oberski proposed a coordinate-wise mediation filter (CMF) and applied it to the same dataset. They identified five active mediating loci. Unfortunately, the three loci identified by Houtepen et al. are completely different from the five loci identified by van Kesteren and Oberski, probably because both Houtepen et al. and van Kesteren and Oberski did not consider all loci jointly in their analyses. The high dimensional DNA methylation loci indeed necessitate new techniques for identifying active mediating loci and testing the direct and indirect effects of the early life traumatic stress on later cortisol alteration. Motivated by the contradictory results in the aforementioned two scientific works, we develop a new estimating and testing procedure, and apply it to the same dataset as that analyzed by the two works. We identify three new loci: cg19230917, cg06422529 and cg03199124, and their effect sizes and p-values are 321.196 (p-value = 0.035965), 418.173 (p-value = 0.000234) and 471.865 (p-value = 0.001691), respectively. These three loci possess both reasonably neurobiological interpretations and statistically significant effects via our proposed tests. Based on our new procedure, we further confirm that the childhood trauma does not have significant direct effects on cortisol change—it only indirectly affects cortisol through DNA methylation, and the indirect effect is negative. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1110-1121
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2053136
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2053136
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1110-1121
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2093727_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Dongyue Xie
Author-X-Name-First: Dongyue
Author-X-Name-Last: Xie
Author-Name: Matthew Stephens
Author-X-Name-First: Matthew
Author-X-Name-Last: Stephens
Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis”
Journal: Journal of the American Statistical Association
Pages: 1186-1191
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2093727
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093727
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1186-1191
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1851236_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Anders Bredahl Kock
Author-X-Name-First: Anders Bredahl
Author-X-Name-Last: Kock
Author-Name: David Preinerstorfer
Author-X-Name-First: David
Author-X-Name-Last: Preinerstorfer
Author-Name: Bezirgen Veliyev
Author-X-Name-First: Bezirgen
Author-X-Name-Last: Veliyev
Title: Functional Sequential Treatment Allocation
Abstract:
Consider a setting in which a policy maker assigns subjects to treatments, observing each outcome before the next subject arrives. Initially, it is unknown which treatment is best, but the sequential nature of the problem permits learning about the effectiveness of the treatments. While the multi-armed-bandit literature has shed much light on the situation when the policy maker compares the effectiveness of the treatments through their mean, much less is known about other targets. This is restrictive, because a cautious decision maker may prefer to target a robust location measure such as a quantile or a trimmed mean. Furthermore, socio-economic decision making often requires targeting purpose specific characteristics of the outcome distribution, such as its inherent degree of inequality, welfare or poverty. In the present article, we introduce and study sequential learning algorithms when the distributional characteristic of interest is a general functional of the outcome distribution. Minimax expected regret optimality results are obtained within the subclass of explore-then-commit policies, and for the unrestricted class of all policies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1311-1323
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1851236
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1851236
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1311-1323
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1844719_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Ben Dai
Author-X-Name-First: Ben
Author-X-Name-Last: Dai
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wing Wong
Author-X-Name-First: Wing
Author-X-Name-Last: Wong
Title: Coupled Generation
Abstract:
Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1243-1253
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1844719
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844719
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1243-1253
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2098134_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Hyunwoo Park
Author-X-Name-First: Hyunwoo
Author-X-Name-Last: Park
Title: A History of Data Visualization and Graphic Communication,
Journal: Journal of the American Statistical Association
Pages: 1601-1603
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2098134
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2098134
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1601-1603
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1862668_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Zhaoxing Gao
Author-X-Name-First: Zhaoxing
Author-X-Name-Last: Gao
Author-Name: Ruey S. Tsay
Author-X-Name-First: Ruey S.
Author-X-Name-Last: Tsay
Title: Modeling High-Dimensional Time Series: A Factor Model With Dynamically Dependent Factors and Diverging Eigenvalues
Abstract:
This article proposes a new approach to modeling high-dimensional time series by treating a p-dimensional time series as a nonsingular linear transformation of certain common factors and idiosyncratic components. Unlike the approximate factor models, we assume that the factors capture all the nontrivial dynamics of the data, but the cross-sectional dependence may be explained by both the factors and the idiosyncratic components. Under the proposed model, (a) the factor process is dynamically dependent and the idiosyncratic component is a white noise process, and (b) the largest eigenvalues of the covariance matrix of the idiosyncratic components may diverge to infinity as the dimension p increases. We propose a white noise testing procedure for high-dimensional time series to determine the number of white noise components and, hence, the number of common factors, and introduce a projected principal component analysis (PCA) to eliminate the diverging effect of the idiosyncratic noises. Asymptotic properties of the proposed method are established for both fixed p and diverging p as the sample size n increases to infinity. We use both simulated data and real examples to assess the performance of the proposed method. We also compare our method with two commonly used methods in the literature concerning the forecastability of the extracted factors and find that the proposed approach not only provides interpretable results, but also performs well in out-of-sample forecasting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1398-1414
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1862668
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862668
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1398-1414
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1863222_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Raiden B. Hasegawa
Author-X-Name-First: Raiden B.
Author-X-Name-Last: Hasegawa
Author-Name: Dylan S. Small
Author-X-Name-First: Dylan S.
Author-X-Name-Last: Small
Title: Estimating Malaria Vaccine Efficacy in the Absence of a Gold Standard Case Definition: Mendelian Factorial Design
Abstract:
Accurate estimates of malaria vaccine efficacy require a reliable definition of a malaria case. However, the symptoms of clinical malaria are unspecific, overlapping with other childhood illnesses. Additionally, children in endemic areas tolerate varying levels of parasitemia without symptoms. Together, this makes finding a gold-standard case definition challenging. We present a method to identify and estimate malaria vaccine efficacy that does not require an observable gold-standard case definition. Instead, we leverage genetic traits that are protective against malaria but not against other illnesses, for example, the sickle cell trait, to identify vaccine efficacy in a randomized trial. Inspired by Mendelian randomization, we introduce Mendelian factorial design, a method that augments a randomized trial with genetic variation to produce a natural factorial experiment, which identifies vaccine efficacy under realistic assumptions. A robust, covariance adjusted estimation procedure is developed for estimating vaccine efficacy on the risk ratio and incidence rate ratio scales. Simulations suggest that our estimator has good performance whereas standard methods are systematically biased. We demonstrate that a combined estimator using both our proposed estimator and the standard approach yields significant improvements when the Mendelian factor is only weakly protective. Our method can be applied in vaccine and prevention trials of other childhood diseases that have no gold-standard case definition and known genetic risk factors. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1466-1481
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1863222
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863222
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1466-1481
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102501_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Guido Imbens
Author-X-Name-First: Guido
Author-X-Name-Last: Imbens
Title: Comment on: “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Ignatiadis and Wager
Journal: Journal of the American Statistical Association
Pages: 1181-1182
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2102501
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102501
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1181-1182
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1864382_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Matteo Bonvini
Author-X-Name-First: Matteo
Author-X-Name-Last: Bonvini
Author-Name: Edward H. Kennedy
Author-X-Name-First: Edward H.
Author-X-Name-Last: Kennedy
Title: Sensitivity Analysis via the Proportion of Unmeasured Confounding
Abstract:
In observational studies, identification of ATEs is generally achieved by assuming that the correct set of confounders has been measured and properly included in the relevant models. Because this assumption is both strong and untestable, a sensitivity analysis should be performed. Common approaches include modeling the bias directly or varying the propensity scores to probe the effects of a potential unmeasured confounder. In this article, we take a novel approach whereby the sensitivity parameter is the “proportion of unmeasured confounding”: the proportion of units for whom the treatment is not as good as randomized even after conditioning on the observed covariates. We consider different assumptions on the probability of a unit being unconfounded. In each case, we derive sharp bounds on the average treatment effect as a function of the sensitivity parameter and propose nonparametric estimators that allow flexible covariate adjustment. We also introduce a one-number summary of a study’s robustness to the number of confounded units. Finally, we explore finite-sample properties via simulation, and apply the methods to an observational database used to assess the effects of right heart catheterization. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1540-1550
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1864382
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1864382
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1540-1550
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2055559_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Yingtian Hu
Author-X-Name-First: Yingtian
Author-X-Name-Last: Hu
Author-Name: Mahmoud Zeydabadinezhad
Author-X-Name-First: Mahmoud
Author-X-Name-Last: Zeydabadinezhad
Author-Name: Longchuan Li
Author-X-Name-First: Longchuan
Author-X-Name-Last: Li
Author-Name: Ying Guo
Author-X-Name-First: Ying
Author-X-Name-Last: Guo
Title: A Multimodal Multilevel Neuroimaging Model for Investigating Brain Connectome Development
Abstract:
Recent advancements of multimodal neuroimaging such as functional MRI (fMRI) and diffusion MRI (dMRI) offers unprecedented opportunities to understand brain development. Most existing neurodevelopmental studies focus on using a single imaging modality to study microstructure or neural activations in localized brain regions. The developmental changes of brain network architecture in childhood and adolescence are not well understood. Our study made use of dMRI and resting-state fMRI imaging data sets from Philadelphia Neurodevelopmental Cohort (PNC) study to characterize developmental changes in both structural as well as functional brain connectomes. A multimodal multilevel model (MMM) is developed and implemented in PNC study to investigate brain maturation in both white matter structural connection and intrinsic functional connection. MMM addresses several major challenges in multimodal connectivity analysis. First, by using a first-level data generative model for observed measures and a second-level latent network modeling, MMM effectively infers underlying connection states from noisy imaging-based connectivity measurements. Second, MMM models the interplay between the structural and functional connections to capture the relationship between different brain connectomes. Third, MMM incorporates covariate effects in the network modeling to investigate network heterogeneity across subpopoulations. Finally, by using a module-wise parameterization based on brain network topology, MMM is scalable to whole-brain connectomics. MMM analysis of the PNC study generates new insights in neurodevelopment during adolescence including revealing the majority of the white fiber connectivity growth are related to the cognitive networks where the most significant increase is found between the default mode and the executive control network with a 15% increase in the probability of structural connections. We also uncover functional connectome development mainly derived from global functional integration rather than direct anatomical connections. To the best of our knowledge, these findings have not been reported in the literature using multimodal connectomics. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1134-1148
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2055559
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2055559
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1134-1148
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1859380_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Laura Jula Vanegas
Author-X-Name-First: Laura
Author-X-Name-Last: Jula Vanegas
Author-Name: Merle Behr
Author-X-Name-First: Merle
Author-X-Name-Last: Behr
Author-Name: Axel Munk
Author-X-Name-First: Axel
Author-X-Name-Last: Munk
Title: Multiscale Quantile Segmentation
Abstract:
We introduce a new methodology for analyzing serial data by quantile regression assuming that the underlying quantile function consists of constant segments. The procedure does not rely on any distributional assumption besides serial independence. It is based on a multiscale statistic, which allows to control the (finite sample) probability for selecting the correct number of segments S at a given error level, which serves as a tuning parameter. For a proper choice of this parameter, this probability tends exponentially fast to one, as sample size increases. We further show that the location and size of segments are estimated at minimax optimal rate (compared to a Gaussian setting) up to a log-factor. Thereby, our approach leads to (asymptotically) uniform confidence bands for the entire quantile regression function in a fully nonparametric setup. The procedure is efficiently implemented using dynamic programming techniques with double heap structures, and software is provided. Simulations and data examples from genetic sequencing and ion channel recordings confirm the robustness of the proposed procedure, which at the same time reliably detects changes in quantiles from arbitrary distributions with precise statistical guarantees. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1384-1397
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1859380
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1859380
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1384-1397
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2093728_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Peter Hoff
Author-X-Name-First: Peter
Author-X-Name-Last: Hoff
Title: Coverage Properties of Empirical Bayes Intervals: A Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Ignatiadis and Wager
Journal: Journal of the American Statistical Association
Pages: 1175-1178
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2093728
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093728
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1175-1178
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1863223_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Xiao Liu
Author-X-Name-First: Xiao
Author-X-Name-Last: Liu
Author-Name: Kyongmin Yeo
Author-X-Name-First: Kyongmin
Author-X-Name-Last: Yeo
Author-Name: Siyuan Lu
Author-X-Name-First: Siyuan
Author-X-Name-Last: Lu
Title: Statistical Modeling for Spatio-Temporal Data From Stochastic Convection-Diffusion Processes
Abstract:
This article proposes a physical-statistical modeling approach for spatio-temporal data arising from a class of stochastic convection-diffusion processes. Such processes are widely found in scientific and engineering applications where fundamental physics imposes critical constraints on how data can be modeled and how models should be interpreted. The idea of spectrum decomposition is employed to approximate a physical spatio-temporal process by the linear combination of spatial basis functions and a multivariate random process of spectral coefficients. Unlike existing approaches assuming spatially and temporally invariant convection-diffusion, this article considers a more general scenario with spatially varying convection-diffusion and nonzero-mean source-sink. As a result, the temporal dynamics of spectral coefficients is coupled with each other, which can be interpreted as the nonlinear energy redistribution across multiple scales from the perspective of physics. Because of the spatially varying convection-diffusion, the space-time covariance is nonstationary in space. The theoretical results are integrated into a hierarchical dynamical spatio-temporal model. The connection is established between the proposed model and the existing models based on integro-difference equations. Computational efficiency and scalability are also investigated to make the proposed approach practical. The advantages of the proposed methodology are demonstrated by numerical examples, a case study, and comprehensive comparison studies. Computer code is available on GitHub. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1482-1499
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1863223
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863223
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1482-1499
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1870984_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Zhengwu Zhang
Author-X-Name-First: Zhengwu
Author-X-Name-Last: Zhang
Author-Name: Xiao Wang
Author-X-Name-First: Xiao
Author-X-Name-Last: Wang
Author-Name: Linglong Kong
Author-X-Name-First: Linglong
Author-X-Name-Last: Kong
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: High-Dimensional Spatial Quantile Function-on-Scalar Regression
Abstract:
This article develops a novel spatial quantile function-on-scalar regression model, which studies the conditional spatial distribution of a high-dimensional functional response given scalar predictors. With the strength of both quantile regression and copula modeling, we are able to explicitly characterize the conditional distribution of the functional or image response on the whole spatial domain. Our method provides a comprehensive understanding of the effect of scalar covariates on functional responses across different quantile levels and also gives a practical way to generate new images for given covariate values. Theoretically, we establish the minimax rates of convergence for estimating coefficient functions under both fixed and random designs. We further develop an efficient primal-dual algorithm to handle high-dimensional image data. Simulations and real data analysis are conducted to examine the finite-sample performance.
Journal: Journal of the American Statistical Association
Pages: 1563-1578
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1870984
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1870984
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1563-1578
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1863812_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Kristian Bjørn Hessellund
Author-X-Name-First: Kristian Bjørn
Author-X-Name-Last: Hessellund
Author-Name: Ganggang Xu
Author-X-Name-First: Ganggang
Author-X-Name-Last: Xu
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Author-Name: Rasmus Waagepetersen
Author-X-Name-First: Rasmus
Author-X-Name-Last: Waagepetersen
Title: Semiparametric Multinomial Logistic Regression for Multivariate Point Pattern Data
Abstract:
We propose a new method for analysis of multivariate point pattern data observed in a heterogeneous environment and with complex intensity functions. We suggest semiparametric models for the intensity functions that depend on an unspecified factor common to all types of points. This is for example well suited for analyzing spatial covariate effects on events such as street crime activities that occur in a complex urban environment. A multinomial conditional composite likelihood function is introduced for estimation of intensity function regression parameters and the asymptotic joint distribution of the resulting estimators is derived under mild conditions. Crucially, the asymptotic covariance matrix depends on ratios of cross pair correlation functions of the multivariate point process. To make valid statistical inference without restrictive assumptions, we construct consistent nonparametric estimators for these ratios. Finally, we construct standardized residual plots, predictive probability plots, and semiparametric intensity plots to validate and to visualize the findings of the model. The effectiveness of the proposed methodology is demonstrated through extensive simulation studies and an application to analyzing the effects of socio-economic and demographical variables on occurrences of street crimes in Washington DC. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1500-1515
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1863812
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863812
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1500-1515
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1853547_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Tengyuan Liang
Author-X-Name-First: Tengyuan
Author-X-Name-Last: Liang
Author-Name: Hai Tran-Bach
Author-X-Name-First: Hai
Author-X-Name-Last: Tran-Bach
Title: Mehler’s Formula, Branching Process, and Compositional Kernels of Deep Neural Networks
Abstract:
Abstract–We use a connection between compositional kernels and branching processes via Mehler’s formula to study deep neural networks. This new probabilistic insight provides us a novel perspective on the mathematical role of activation functions in compositional neural networks. We study the unscaled and rescaled limits of the compositional kernels and explore the different phases of the limiting behavior, as the compositional depth increases. We investigate the memorization capacity of the compositional kernels and neural networks by characterizing the interplay among compositional depth, sample size, dimensionality, and nonlinearity of the activation. Explicit formulas on the eigenvalues of the compositional kernel are provided, which quantify the complexity of the corresponding reproducing kernel Hilbert space. On the methodological front, we propose a new random features algorithm, which compresses the compositional layers by devising a new activation function. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1324-1337
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1853547
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1853547
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1324-1337
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2093725_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Bradley Efron
Author-X-Name-First: Bradley
Author-X-Name-Last: Efron
Title: Discussion of “Confidence Intervals for Nonparametric Empirical Bayes Analysis” by Nikolaos Ignatiadis and Stefan Wager
Journal: Journal of the American Statistical Association
Pages: 1179-1180
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2093725
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093725
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1179-1180
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2096040_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Noel Cressie
Author-X-Name-First: Noel
Author-X-Name-Last: Cressie
Title: Nonparametric Empirical Bayes Prediction
Journal: Journal of the American Statistical Association
Pages: 1167-1170
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2096040
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096040
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1167-1170
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2098135_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Sudipto Banerjee
Author-X-Name-First: Sudipto
Author-X-Name-Last: Banerjee
Title: Discussion of “Measuring Housing Vitality from Multi-Source Big Data and Machine Learning”
Journal: Journal of the American Statistical Association
Pages: 1063-1065
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2098135
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2098135
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1063-1065
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1847121_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Kolyan Ray
Author-X-Name-First: Kolyan
Author-X-Name-Last: Ray
Author-Name: Botond Szabó
Author-X-Name-First: Botond
Author-X-Name-Last: Szabó
Title: Variational Bayes for High-Dimensional Linear Regression With Sparse Priors
Abstract:
We study a mean-field spike and slab variational Bayes (VB) approximation to Bayesian model selection priors in sparse high-dimensional linear regression. Under compatibility conditions on the design matrix, oracle inequalities are derived for the mean-field VB approximation, implying that it converges to the sparse truth at the optimal rate and gives optimal prediction of the response vector. The empirical performance of our algorithm is studied, showing that it works comparably well as other state-of-the-art Bayesian variable selection methods. We also numerically demonstrate that the widely used coordinate-ascent variational inference algorithm can be highly sensitive to the parameter updating order, leading to potentially poor performance. To mitigate this, we propose a novel prioritized updating scheme that uses a data-driven updating order and performs better in simulations. The variational algorithm is implemented in the R package sparsevb. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1270-1281
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1847121
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1847121
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1270-1281
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1862671_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Fei Xue
Author-X-Name-First: Fei
Author-X-Name-Last: Xue
Author-Name: Yanqing Zhang
Author-X-Name-First: Yanqing
Author-X-Name-Last: Zhang
Author-Name: Wenzhuo Zhou
Author-X-Name-First: Wenzhuo
Author-X-Name-Last: Zhou
Author-Name: Haoda Fu
Author-X-Name-First: Haoda
Author-X-Name-Last: Fu
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Multicategory Angle-Based Learning for Estimating Optimal Dynamic Treatment Regimes With Censored Data
Abstract:
An optimal dynamic treatment regime (DTR) consists of a sequence of decision rules in maximizing long-term benefits, which is applicable for chronic diseases such as HIV infection or cancer. In this article, we develop a novel angle-based approach to search the optimal DTR under a multicategory treatment framework for survival data. The proposed method targets to maximize the conditional survival function of patients following a DTR. In contrast to most existing approaches which are designed to maximize the expected survival time under a binary treatment framework, the proposed method solves the multicategory treatment problem given multiple stages for censored data. Specifically, the proposed method obtains the optimal DTR via integrating estimations of decision rules at multiple stages into a single multicategory classification algorithm without imposing additional constraints, which is also more computationally efficient and robust. In theory, we establish Fisher consistency and provide the risk bound for the proposed estimator under regularity conditions. Our numerical studies show that the proposed method outperforms competing methods in terms of maximizing the conditional survival probability. We apply the proposed method to two real datasets: Framingham heart study data and acquired immunodeficiency syndrome clinical data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1438-1451
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1862671
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862671
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1438-1451
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1841646_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Author-Name: Jingnan Xue
Author-X-Name-First: Jingnan
Author-X-Name-Last: Xue
Author-Name: Bochao Jia
Author-X-Name-First: Bochao
Author-X-Name-Last: Jia
Title: Markov Neighborhood Regression for High-Dimensional Inference
Abstract:
This article proposes an innovative method for constructing confidence intervals and assessing p-values in statistical inference for high-dimensional linear models. The proposed method has successfully broken the high-dimensional inference problem into a series of low-dimensional inference problems: For each regression coefficient βi, the confidence interval and p-value are computed by regressing on a subset of variables selected according to the conditional independence relations between the corresponding variable Xi and other variables. Since the subset of variables forms a Markov neighborhood of Xi in the Markov network formed by all the variables X1,X2,…,Xp, the proposed method is coined as Markov neighborhood regression (MNR). The proposed method is tested on high-dimensional linear, logistic, and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones. Based on the MNR, a method of learning causal structures for high-dimensional linear models is proposed and applied to identification of drug sensitive genes and cancer driver genes. The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other high-dimensional or big data problems as well. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1200-1214
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1841646
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1841646
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1200-1214
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1875837_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Abdelaati Daouia
Author-X-Name-First: Abdelaati
Author-X-Name-Last: Daouia
Author-Name: Irène Gijbels
Author-X-Name-First: Irène
Author-X-Name-Last: Gijbels
Author-Name: Gilles Stupfler
Author-X-Name-First: Gilles
Author-X-Name-Last: Stupfler
Title: Extremile Regression
Abstract:
Regression extremiles define a least squares analogue of regression quantiles. They are determined by weighted expectations rather than tail probabilities. Of special interest is their intuitive meaning in terms of expected minima and maxima. Their use appears naturally in risk management where, in contrast to quantiles, they fulfill the coherency axiom and take the severity of tail losses into account. In addition, they are comonotonically additive and belong to both the families of spectral risk measures and concave distortion risk measures. This article provides the first detailed study exploring implications of the extremile terminology in a general setting of presence of covariates. We rely on local linear (least squares) check function minimization for estimating conditional extremiles and deriving the asymptotic normality of their estimators. We also extend extremile regression far into the tails of heavy-tailed distributions. Extrapolated estimators are constructed and their asymptotic theory is developed. Some applications to real data are provided.
Journal: Journal of the American Statistical Association
Pages: 1579-1586
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2021.1875837
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1875837
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1579-1586
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1864380_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Debmalya Nandy
Author-X-Name-First: Debmalya
Author-X-Name-Last: Nandy
Author-Name: Francesca Chiaromonte
Author-X-Name-First: Francesca
Author-X-Name-Last: Chiaromonte
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems
Abstract:
Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates (p≫n)
, only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Pearson’s correlation coefficient-based sure independence screening and other model- and correlation-based feature screening methods, we propose a model-free procedure called covariate information number-sure independence screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1516-1529
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1864380
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1864380
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1516-1529
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2096038_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Yang Zhou
Author-X-Name-First: Yang
Author-X-Name-Last: Zhou
Author-Name: Lirong Xue
Author-X-Name-First: Lirong
Author-X-Name-Last: Xue
Author-Name: Zhengyu Shi
Author-X-Name-First: Zhengyu
Author-X-Name-Last: Shi
Author-Name: Libo Wu
Author-X-Name-First: Libo
Author-X-Name-Last: Wu
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Title: Measuring Housing Vitality from Multi-Source Big Data and Machine Learning
Abstract:
Measuring timely high-resolution socioeconomic outcomes is critical for policymaking and evaluation, but hard to reliably obtain. With the help of machine learning and cheaply available data such as social media and nightlight, it is now possible to predict such indices in fine granularity. This article demonstrates an adaptive way to measure the time trend and spatial distribution of housing vitality (number of occupied houses) with the help of multiple easily accessible datasets: energy, nightlight, and land-use data. We first identified the high-frequency housing occupancy status from energy consumption data and then matched it with the monthly nightlight data. We then introduced the Factor-Augmented Regularized Model for prediction (FarmPredict) to deal with the dependence and collinearity issue among predictors by effectively lifting the prediction space, which is suitable to most machine learning algorithms. The heterogeneity issue in big data analysis is mitigated through the land-use data. FarmPredict allows us to extend the regional results to the city level, with a 76% out-of-sample explanation of the spatial and timeliness variation in the house usage. Since energy is indispensable for life, our method is highly transferable with the only requirement of publicly accessible data. Our article provides an alternative approach with statistical machine learning to predict socioeconomic outcomes without the reliance on existing census and survey data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1045-1059
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2096038
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096038
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1045-1059
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2097086_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Wei Tu
Author-X-Name-First: Wei
Author-X-Name-Last: Tu
Author-Name: Bei Jiang
Author-X-Name-First: Bei
Author-X-Name-Last: Jiang
Author-Name: Linglong Kong
Author-X-Name-First: Linglong
Author-X-Name-Last: Kong
Title: Comments on “Measuring Housing Vitality from Multi-Source Big Data and Machine Learning”
Journal: Journal of the American Statistical Association
Pages: 1060-1062
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2097086
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2097086
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1060-1062
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2104726_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Nicholas J. Horton
Author-X-Name-First: Nicholas J.
Author-X-Name-Last: Horton
Title: Foundations of Statistics for Data Scientists: With R and Python
Journal: Journal of the American Statistical Association
Pages: 1603-1604
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2104726
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2104726
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1603-1604
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2041422_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Tianwen Ma
Author-X-Name-First: Tianwen
Author-X-Name-Last: Ma
Author-Name: Yang Li
Author-X-Name-First: Yang
Author-X-Name-Last: Li
Author-Name: Jane E. Huggins
Author-X-Name-First: Jane E.
Author-X-Name-Last: Huggins
Author-Name: Ji Zhu
Author-X-Name-First: Ji
Author-X-Name-Last: Zhu
Author-Name: Jian Kang
Author-X-Name-First: Jian
Author-X-Name-Last: Kang
Title: Bayesian Inferences on Neural Activity in EEG-Based Brain-Computer Interface
Abstract:
A brain-computer interface (BCI) is a system that translates brain activity into commands to operate technology. A common design for an electroencephalogram (EEG) BCI relies on the classification of the P300 event-related potential (ERP), which is a response elicited by the rare occurrence of target stimuli among common nontarget stimuli. Few existing ERP classifiers directly explore the underlying mechanism of the neural activity. To this end, we perform a novel Bayesian analysis of the probability distribution of multi-channel real EEG signals under the P300 ERP-BCI design. We aim to identify relevant spatial temporal differences of the neural activity, which provides statistical evidence of P300 ERP responses and helps design individually efficient and accurate BCIs. As one key finding of our single participant analysis, there is a 90% posterior probability that the target ERPs of the channels around visual cortex reach their negative peaks around 200 milliseconds poststimulus. Our analysis identifies five important channels (PO7, PO8, Oz, P4, Cz) for the BCI speller leading to a 100% prediction accuracy. From the analyses of nine other participants, we consistently select the identified five channels, and the selection frequencies are robust to small variations of bandpass filters and kernel hyper parameters. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1122-1133
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2041422
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2041422
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1122-1133
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1844211_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Ting Li
Author-X-Name-First: Ting
Author-X-Name-Last: Li
Author-Name: Tengfei Li
Author-X-Name-First: Tengfei
Author-X-Name-Last: Li
Author-Name: Zhongyi Zhu
Author-X-Name-First: Zhongyi
Author-X-Name-Last: Zhu
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Regression Analysis of Asynchronous Longitudinal Functional and Scalar Data
Abstract:
Many modern large-scale longitudinal neuroimaging studies, such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, have collected/are collecting asynchronous scalar and functional variables that are measured at distinct time points. The analyses of temporally asynchronous functional and scalar variables pose major technical challenges to many existing statistical approaches. We propose a class of generalized functional partial-linear varying-coefficient models to appropriately deal with these challenges through introducing both scalar and functional coefficients of interest and using kernel weighting methods. We design penalized kernel-weighted estimating equations to estimate scalar and functional coefficients, in which we represent functional coefficients by using a rich truncated tensor product penalized B-spline basis. We establish the theoretical properties of scalar and functional coefficient estimators including consistency, convergence rate, prediction accuracy, and limiting distributions. We also propose a bootstrap method to test the nullity of both parametric and functional coefficients, while establishing the bootstrap consistency. Simulation studies and the analysis of the ADNI study are used to assess the finite sample performance of our proposed approach. Our real data analysis reveals significant relationship between fractional anisotropy density curves and cognitive function with education, baseline disease status and APOE4 gene as major contributing factors. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1228-1242
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1844211
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844211
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1228-1242
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1850461_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Qinglong Tian
Author-X-Name-First: Qinglong
Author-X-Name-Last: Tian
Author-Name: Fanqi Meng
Author-X-Name-First: Fanqi
Author-X-Name-Last: Meng
Author-Name: Daniel J. Nordman
Author-X-Name-First: Daniel J.
Author-X-Name-Last: Nordman
Author-Name: William Q. Meeker
Author-X-Name-First: William Q.
Author-X-Name-Last: Meeker
Title: Predicting the Number of Future Events
Abstract:
This article describes prediction methods for the number of future events from a population of units associated with an on-going time-to-event process. Examples include the prediction of warranty returns and the prediction of the number of future product failures that could cause serious threats to property or life. Important decisions such as whether a product recall should be mandated are often based on such predictions. Data, generally right-censored (and sometimes left truncated and right-censored), are used to estimate the parameters of a time-to-event distribution. This distribution can then be used to predict the number of events over future periods of time. Such predictions are sometimes called within-sample predictions and differ from other prediction problems considered in most of the prediction literature. This article shows that the plug-in (also known as estimative or naive) prediction method is not asymptotically correct (i.e., for large amounts of data, the coverage probability always fails to converge to the nominal confidence level). However, a commonly used prediction calibration method is shown to be asymptotically correct for within-sample predictions, and two alternative predictive-distribution-based methods that perform better than the calibration method are presented and justified. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1296-1310
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1850461
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1850461
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1296-1310
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1863221_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Timothy W. Waite
Author-X-Name-First: Timothy W.
Author-X-Name-Last: Waite
Author-Name: David C. Woods
Author-X-Name-First: David C.
Author-X-Name-Last: Woods
Title: Minimax Efficient Random Experimental Design Strategies With Application to Model-Robust Design for Prediction
Abstract:
In game theory and statistical decision theory, a random (i.e., mixed) decision strategy often outperforms a deterministic strategy in minimax expected loss. As experimental design can be viewed as a game pitting the Statistician against Nature, the use of a random strategy to choose a design will often be beneficial. However, the topic of minimax-efficient random strategies for design selection is mostly unexplored, with consideration limited to Fisherian randomization of the allocation of a predetermined set of treatments to experimental units. Here, for the first time, novel and more flexible random design strategies are shown to have better properties than their deterministic counterparts in linear model estimation and prediction, including stronger bounds on both the expectation and survivor function of the loss distribution. Design strategies are considered for three important statistical problems: (i) parameter estimation in linear potential outcomes models, (ii) point prediction from a correct linear model, and (iii) global prediction from a linear model taking into account an L2-class of possible model discrepancy functions. The new random design strategies proposed for (iii) give a finite bound on the expected loss, a dramatic improvement compared to existing deterministic exact designs for which the expected loss is unbounded. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1452-1465
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1863221
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1863221
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1452-1465
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2027774_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Nathan B. Wikle
Author-X-Name-First: Nathan B.
Author-X-Name-Last: Wikle
Author-Name: Ephraim M. Hanks
Author-X-Name-First: Ephraim M.
Author-X-Name-Last: Hanks
Author-Name: Lucas R. F. Henneman
Author-X-Name-First: Lucas R. F.
Author-X-Name-Last: Henneman
Author-Name: Corwin M. Zigler
Author-X-Name-First: Corwin M.
Author-X-Name-Last: Zigler
Title: A Mechanistic Model of Annual Sulfate Concentrations in the United States
Abstract:
Understanding how individual pollution sources contribute to ambient sulfate pollution is critical for assessing past and future air quality regulations. Since attribution to specific sources is typically not encoded in spatial air pollution data, we develop a mechanistic model which we use to estimate, with uncertainty, the contribution of ambient sulfate concentrations attributable specifically to sulfur dioxide (SO2) emissions from individual coal-fired power plants in the central United States. We propose a multivariate Ornstein–Uhlenbeck (OU) process approximation to the dynamics of the underlying space-time chemical transport process, and its distributional properties are leveraged to specify novel probability models for spatial data that are viewed as either a snapshot or time-averaged observation of the OU process. Using US EPA SO2 emissions data from 193 power plants and state-of-the-art estimates of ground-level annual mean sulfate concentrations, we estimate that in 2011—a time of active power plant regulatory action—existing flue-gas desulfurization (FGD) technologies at 66 power plants reduced population-weighted exposure to ambient sulfate by 1.97 μg/m3 (95% CI: 1.80–2.15). Furthermore, we anticipate future regulatory benefits by estimating that installing FGD technologies at the five largest SO2-emitting facilities would reduce human exposure to ambient sulfate by an additional 0.45 μg/m3 (95% CI: 0.33–0.54). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1082-1093
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2027774
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2027774
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1082-1093
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1864381_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Alan Riva-Palacio
Author-X-Name-First: Alan
Author-X-Name-Last: Riva-Palacio
Author-Name: Fabrizio Leisen
Author-X-Name-First: Fabrizio
Author-X-Name-Last: Leisen
Author-Name: Jim Griffin
Author-X-Name-First: Jim
Author-X-Name-Last: Griffin
Title: Survival Regression Models With Dependent Bayesian Nonparametric Priors
Abstract:
We present a novel Bayesian nonparametric model for regression in survival analysis. Our model builds on the classical neutral to the right model of Doksum and on the Cox proportional hazards model of Kim and Lee. The use of a vector of dependent Bayesian nonparametric priors allows us to efficiently model the hazard as a function of covariates while allowing nonproportionality. The model can be seen as having competing latent risks. We characterize the posterior of the underlying dependent vector of completely random measures and study the asymptotic behavior of the model. We show how an MCMC scheme can provide Bayesian inference for posterior means and credible intervals. The method is illustrated using simulated and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1530-1539
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1864381
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1864381
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1530-1539
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2024436_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Santiago Olivella
Author-X-Name-First: Santiago
Author-X-Name-Last: Olivella
Author-Name: Tyler Pratt
Author-X-Name-First: Tyler
Author-X-Name-Last: Pratt
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Title: Dynamic Stochastic Blockmodel Regression for Network Data: Application to International Militarized Conflicts
Abstract:
The decision to engage in military conflict is shaped by many factors, including state- and dyad-level characteristics as well as the state’s membership in geopolitical coalitions. Supporters of the democratic peace theory, for example, hypothesize that the community of democratic states is less likely to wage war with each other. Such theories explain the ways in which nodal and dyadic characteristics affect the evolution of conflict patterns over time via their effects on group memberships. To test these arguments, we develop a dynamic model of network data by combining a hidden Markov model with a mixed-membership stochastic blockmodel that identifies latent groups underlying the network structure. Unlike existing models, we incorporate covariates that predict dynamic node memberships in latent groups as well as the direct formation of edges between dyads. While prior substantive research often assumes the decision to engage in international militarized conflict is independent across states and static over time, we demonstrate that conflict is driven by states’ evolving membership in geopolitical blocs. Our analysis of militarized disputes from 1816 to 2010 identifies two distinct blocs of democratic states, only one of which exhibits unusually low rates of conflict. Changes in monadic covariates like democracy shift states between coalitions, making some states more pacific but others more belligerent. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1068-1081
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2021.2024436
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024436
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1068-1081
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1862669_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Daniel Malinsky
Author-X-Name-First: Daniel
Author-X-Name-Last: Malinsky
Author-Name: Ilya Shpitser
Author-X-Name-First: Ilya
Author-X-Name-Last: Shpitser
Author-Name: Eric J. Tchetgen Tchetgen
Author-X-Name-First: Eric J.
Author-X-Name-Last: Tchetgen Tchetgen
Title: Semiparametric Inference for Nonmonotone Missing-Not-at-Random Data: The No Self-Censoring Model
Abstract:
We study the identification and estimation of statistical functionals of multivariate data missing nonmonotonically and not-at-random, taking a semiparametric approach. Specifically, we assume that the missingness mechanism satisfies what has been previously called “no self-censoring” or “itemwise conditionally independent nonresponse,” which roughly corresponds to the assumption that no partially observed variable directly determines its own missingness status. We show that this assumption, combined with an odds ratio parameterization of the joint density, enables identification of functionals of interest, and we establish the semiparametric efficiency bound for the nonparametric model satisfying this assumption. We propose a practical augmented inverse probability weighted estimator, and in the setting with a (possibly high-dimensional) always-observed subset of covariates, our proposed estimator enjoys a certain double-robustness property. We explore the performance of our estimator with simulation experiments and on a previously studied dataset of HIV-positive mothers in Botswana. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1415-1423
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1862669
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1862669
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1415-1423
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1850460_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Yiyuan She
Author-X-Name-First: Yiyuan
Author-X-Name-Last: She
Author-Name: Zhifeng Wang
Author-X-Name-First: Zhifeng
Author-X-Name-Last: Wang
Author-Name: Jiahui Shen
Author-X-Name-First: Jiahui
Author-X-Name-Last: Shen
Title: Gaining Outlier Resistance With Progressive Quantiles: Fast Algorithms and Theoretical Studies
Abstract:
Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this article, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1282-1295
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1850460
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1850460
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1282-1295
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1841647_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Brenda Betancourt
Author-X-Name-First: Brenda
Author-X-Name-Last: Betancourt
Author-Name: Giacomo Zanella
Author-X-Name-First: Giacomo
Author-X-Name-Last: Zanella
Author-Name: Rebecca C. Steorts
Author-X-Name-First: Rebecca C.
Author-X-Name-Last: Steorts
Title: Random Partition Models for Microclustering Tasks
Abstract:
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution (ER), modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points—the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of ER, where we provide a simulation study and real experiments on survey panel data.
Journal: Journal of the American Statistical Association
Pages: 1215-1227
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1841647
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1841647
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1215-1227
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2093729_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Nikolaos Ignatiadis
Author-X-Name-First: Nikolaos
Author-X-Name-Last: Ignatiadis
Author-Name: Stefan Wager
Author-X-Name-First: Stefan
Author-X-Name-Last: Wager
Title: Rejoinder: Confidence Intervals for Nonparametric Empirical Bayes Analysis
Journal: Journal of the American Statistical Association
Pages: 1192-1199
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2022.2093729
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093729
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1192-1199
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1844720_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220823T191300 git hash: 39867e6e2f
Author-Name: Peter Hoff
Author-X-Name-First: Peter
Author-X-Name-Last: Hoff
Title: Smaller p-Values via Indirect Information
Abstract:
This article develops p-values for evaluating means of normal populations that make use of indirect or prior information. A p-value of this type is based on a biased frequentist hypothesis test that has optimal average power with respect to a probability distribution that encodes indirect information about the mean parameter, resulting in a smaller p-value if the indirect information is accurate. In a variety of multiparameter settings, we show how to adaptively estimate the indirect information for each mean parameter while still maintaining uniformity of the p-values under their null hypotheses. This is done using a linking model through which indirect information about the mean of one population may be obtained from the data of other populations. Importantly, the linking model does not need to be correct to maintain the uniformity of the p-values under their null hypotheses. This methodology is illustrated in several data analysis scenarios, including small area inference, spatially arranged populations, interactions in linear regression, and generalized linear models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1254-1269
Issue: 539
Volume: 117
Year: 2022
Month: 9
X-DOI: 10.1080/01621459.2020.1844720
File-URL: http://hdl.handle.net/10.1080/01621459.2020.1844720
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:539:p:1254-1269
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1904959_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Qing Mai
Author-X-Name-First: Qing
Author-X-Name-Last: Mai
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Author-Name: Yuqing Pan
Author-X-Name-First: Yuqing
Author-X-Name-Last: Pan
Author-Name: Kai Deng
Author-X-Name-First: Kai
Author-X-Name-Last: Deng
Title: A Doubly Enhanced EM Algorithm for Model-Based Tensor Clustering
Abstract:
Modern scientific studies often collect datasets in the form of tensors. These datasets call for innovative statistical analysis methods. In particular, there is a pressing need for tensor clustering methods to understand the heterogeneity in the data. We propose a tensor normal mixture model approach to enable probabilistic interpretation and computational tractability. Our statistical model leverages the tensor covariance structure to reduce the number of parameters for parsimonious modeling, and at the same time explicitly exploits the correlations for better variable selection and clustering. We propose a doubly enhanced expectation–maximization (DEEM) algorithm to perform clustering under this model. Both the expectation-step and the maximization-step are carefully tailored for tensor data in order to maximize statistical accuracy and minimize computational costs in high dimensions. Theoretical studies confirm that DEEM achieves consistent clustering even when the dimension of each mode of the tensors grows at an exponential rate of the sample size. Numerical studies demonstrate favorable performance of DEEM in comparison to existing methods.
Journal: Journal of the American Statistical Association
Pages: 2120-2134
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1904959
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1904959
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2120-2134
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1876710_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Minsuk Shin
Author-X-Name-First: Minsuk
Author-X-Name-Last: Shin
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Neuronized Priors for Bayesian Sparse Linear Regression
Abstract:
Although Bayesian variable selection methods have been intensively studied, their routine use in practice has not caught up with their non-Bayesian counterparts such as Lasso, likely due to difficulties in both computations and flexibilities of prior choices. To ease these challenges, we propose the neuronized priors to unify and extend some popular shrinkage priors, such as Laplace, Cauchy, horseshoe, and spike-and-slab priors. A neuronized prior can be written as the product of a Gaussian weight variable and a scale variable transformed from Gaussian via an activation function. Compared with classic spike-and-slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variables, which results in both more efficient and flexible posterior sampling and more effective posterior modal estimation. Theoretically, we provide specific conditions on the neuronized formulation to achieve the optimal posterior contraction rate, and show that a broadly applicable MCMC algorithm achieves an exponentially fast convergence rate under the neuronized formulation. We also examine various simulated and real data examples and demonstrate that using the neuronization representation is computationally more or comparably efficient than its standard counterpart in all well-known cases. An R package NPrior is provided for using neuronized priors in Bayesian linear regression.
Journal: Journal of the American Statistical Association
Pages: 1695-1710
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1876710
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1876710
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1695-1710
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1891927_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Lucio Barabesi
Author-X-Name-First: Lucio
Author-X-Name-Last: Barabesi
Author-Name: Andrea Cerasa
Author-X-Name-First: Andrea
Author-X-Name-Last: Cerasa
Author-Name: Andrea Cerioli
Author-X-Name-First: Andrea
Author-X-Name-Last: Cerioli
Author-Name: Domenico Perrotta
Author-X-Name-First: Domenico
Author-X-Name-Last: Perrotta
Title: On Characterizations and Tests of Benford’s Law
Abstract:
Benford’s law defines a probability distribution for patterns of significant digits in real numbers. When the law is expected to hold for genuine observations, deviation from it can be taken as evidence of possible data manipulation. We derive results on a transform of the significand function that provide motivation for new tests of conformance to Benford’s law exploiting its sum-invariance characterization. We also study the connection between sum invariance of the first digit and the corresponding marginal probability distribution. We approximate the exact distribution of the new test statistics through a computationally efficient Monte Carlo algorithm. We investigate the power of our tests under different alternatives and we point out relevant situations in which they are clearly preferable to the available procedures. Finally, we show the application potential of our approach in the context of fraud detection in international trade.
Journal: Journal of the American Statistical Association
Pages: 1887-1903
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1891927
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891927
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1887-1903
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2077209_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Le Bao
Author-X-Name-First: Le
Author-X-Name-Last: Bao
Author-Name: Changcheng Li
Author-X-Name-First: Changcheng
Author-X-Name-Last: Li
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Songshan Yang
Author-X-Name-First: Songshan
Author-X-Name-Last: Yang
Title: Causal Structural Learning on MPHIA Individual Dataset
Abstract:
The Population-based HIV Impact Assessment (PHIA) is an ongoing project that conducts nationally representative HIV-focused surveys for measuring national and regional progress toward UNAIDS’ 90-90-90 targets, the primary strategy to end the HIV epidemic. We believe the PHIA survey offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for 90-90-90 targets. Existing constraint-based causal structural learning algorithms are quite aggressive in edge removal. The proposed algorithm preserves more information about important features and potential causal pathways. It is applied to the Malawi PHIA (MPHIA) dataset and leads to interesting results. For example, it discovers age and condom usage to be important for female HIV awareness; the number of sexual partners to be important for male HIV awareness; and knowing the travel time to HIV care facilities leads to a higher chance of being treated for both females and males. We further compare and validate the proposed algorithm using BIC and using Monte Carlo simulations, and show that the proposed algorithm achieves improvement in true positive rates in important feature discovery over existing algorithms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1642-1655
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2077209
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2077209
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1642-1655
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1902817_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Dungang Liu
Author-X-Name-First: Dungang
Author-X-Name-Last: Liu
Author-Name: Regina Y. Liu
Author-X-Name-First: Regina Y.
Author-X-Name-Last: Liu
Author-Name: Min-ge Xie
Author-X-Name-First: Min-ge
Author-X-Name-Last: Xie
Title: Nonparametric Fusion Learning for Multiparameters: Synthesize Inferences From Diverse Sources Using Data Depth and Confidence Distribution
Abstract:
Fusion learning refers to synthesizing inferences from multiple sources or studies to make a more effective inference and prediction than from any individual source or study alone. Most existing methods for synthesizing inferences rely on parametric model assumptions, such as normality, which often do not hold in practice. We propose a general nonparametric fusion learning framework for synthesizing inferences for multiparameters from different studies. The main tool underlying the proposed framework is the new notion of depth confidence distribution (depth-CD), which is developed by combining data depth and confidence distribution. Broadly speaking, a depth-CD is a data-driven nonparametric summary distribution of the available inferential information for a target parameter. We show that a depth-CD is a powerful inferential tool and, moreover, is an omnibus form of confidence regions, whose contours of level sets shrink toward the true parameter value. The proposed fusion learning approach combines depth-CDs from the individual studies, with each depth-CD constructed by nonparametric bootstrap and data depth. The approach is shown to be efficient, general and robust. Specifically, it achieves high-order accuracy and Bahadur efficiency under suitably chosen combining elements. It allows the model or inference structure to be different among individual studies. And, it readily adapts to heterogeneous studies with a broad range of complex and irregular settings. This last property enables the approach to use indirect evidence from incomplete studies to gain efficiency for the overall inference. We develop the theoretical support for the proposed approach, and we also illustrate the approach in making combined inference for the common mean vector and correlation coefficient from several studies. The numerical results from simulated studies show the approach to be less biased and more efficient than the traditional approaches in nonnormal settings. The advantages of the approach are also demonstrated in a Federal Aviation Administration study of aircraft landing performance. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2086-2104
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1902817
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1902817
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2086-2104
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1909598_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Gaetano Romano
Author-X-Name-First: Gaetano
Author-X-Name-Last: Romano
Author-Name: Guillem Rigaill
Author-X-Name-First: Guillem
Author-X-Name-Last: Rigaill
Author-Name: Vincent Runge
Author-X-Name-First: Vincent
Author-X-Name-Last: Runge
Author-Name: Paul Fearnhead
Author-X-Name-First: Paul
Author-X-Name-Last: Fearnhead
Title: Detecting Abrupt Changes in the Presence of Local Fluctuations and Autocorrelated Noise
Abstract:
While there are a plethora of algorithms for detecting changes in mean in univariate time-series, almost all struggle in real applications where there is autocorrelated noise or where the mean fluctuates locally between the abrupt changes that one wishes to detect. In these cases, default implementations, which are often based on assumptions of a constant mean between changes and independent noise, can lead to substantial over-estimation of the number of changes. We propose a principled approach to detect such abrupt changes that models local fluctuations as a random walk process and autocorrelated noise via an AR(1) process. We then estimate the number and location of changepoints by minimizing a penalized cost based on this model. We develop a novel and efficient dynamic programming algorithm, DeCAFS, that can solve this minimization problem; despite the additional challenge of dependence across segments, due to the autocorrelated noise, which makes existing algorithms inapplicable. Theory and empirical results show that our approach has greater power at detecting abrupt changes than existing approaches. We apply our method to measuring gene expression levels in bacteria. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2147-2162
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1909598
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909598
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2147-2162
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1895810_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Jiayin Zheng
Author-X-Name-First: Jiayin
Author-X-Name-Last: Zheng
Author-Name: Yingye Zheng
Author-X-Name-First: Yingye
Author-X-Name-Last: Zheng
Author-Name: Li Hsu
Author-X-Name-First: Li
Author-X-Name-Last: Hsu
Title: Risk Projection for Time-to-Event Outcome Leveraging Summary Statistics With Source Individual-Level Data
Abstract:
Predicting risks of chronic diseases has become increasingly important in clinical practice. When a prediction model is developed in a cohort, there is a great interest to apply the model to other cohorts. Due to potential discrepancy in baseline disease incidences between different cohorts and shifts in patient composition, the risk predicted by the model built in the source cohort often under- or over-estimates the risk in a new cohort. In this article, we assume the relative risks of predictors are the same between the two cohorts, and propose a novel weighted estimating equation approach to recalibrating the projected risk for the targeted population through updating the baseline risk. The recalibration leverages the knowledge about survival probabilities for the disease of interest and competing events, and summary information of risk factors from the target population. We establish the consistency and asymptotic normality of the proposed estimators. Extensive simulation demonstrate that the proposed estimators are robust, even if the risk factor distributions differ between the source and target populations, and gain efficiency if they are the same, as long as the information from the target is precise. The method is illustrated with a recalibration of colorectal cancer prediction model.
Journal: Journal of the American Statistical Association
Pages: 2043-2055
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1895810
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895810
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2043-2055
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1906685_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Peter Z. Schochet
Author-X-Name-First: Peter Z.
Author-X-Name-Last: Schochet
Author-Name: Nicole E. Pashley
Author-X-Name-First: Nicole E.
Author-X-Name-Last: Pashley
Author-Name: Luke W. Miratrix
Author-X-Name-First: Luke W.
Author-X-Name-Last: Miratrix
Author-Name: Tim Kautz
Author-X-Name-First: Tim
Author-X-Name-Last: Kautz
Title: Design-Based Ratio Estimators and Central Limit Theorems for Clustered, Blocked RCTs
Abstract:
This article develops design-based ratio estimators for clustered, blocked randomized controlled trials (RCTs), with an application to a federally funded, school-based RCT testing the effects of behavioral health interventions. We consider finite population weighted least-square estimators for average treatment effects (ATEs), allowing for general weighting schemes and covariates. We consider models with block-by-treatment status interactions as well as restricted models with block indicators only. We prove new finite population central limit theorems for each block specification. We also discuss simple variance estimators that share features with commonly used cluster-robust standard error estimators. Simulations show that the design-based ATE estimator yields nominal rejection rates with standard errors near true ones, even with few clusters.
Journal: Journal of the American Statistical Association
Pages: 2135-2146
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1906685
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1906685
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2135-2146
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1882466_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Xiaowu Dai
Author-X-Name-First: Xiaowu
Author-X-Name-Last: Dai
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Kernel Ordinary Differential Equations
Abstract:
Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We perform sparse estimation to select individual functionals, and construct confidence intervals for the estimated signal trajectories. We establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA as well. We demonstrate the efficacy of our method through numerous ODE examples.
Journal: Journal of the American Statistical Association
Pages: 1711-1725
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1882466
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1882466
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1711-1725
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1915319_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Yingying Zhang
Author-X-Name-First: Yingying
Author-X-Name-Last: Zhang
Author-Name: Huixia Judy Wang
Author-X-Name-First: Huixia Judy
Author-X-Name-Last: Wang
Author-Name: Zhongyi Zhu
Author-X-Name-First: Zhongyi
Author-X-Name-Last: Zhu
Title: Single-index Thresholding in Quantile Regression
Abstract:
Threshold regression models are useful for identifying subgroups with heterogeneous parameters. The conventional threshold regression models split the sample based on a single and observed threshold variable, which enforces the threshold point to be equal for all subgroups of the population. In this article, we consider a more flexible single-index threshold model in the quantile regression setup, in which the sample is split based on a linear combination of predictors. We propose a new estimator by smoothing the indicator function in thresholding, which enables Gaussian approximation for statistical inference and allows characterizing the limiting distribution when the quantile process is interested. We further construct a mixed-bootstrap inference method with faster computation and a procedure for testing the constancy of the threshold parameters across quantiles. Finally, we demonstrate the value of the proposed methods via simulation studies, as well as through the application to an executive compensation data.
Journal: Journal of the American Statistical Association
Pages: 2222-2237
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1915319
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1915319
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2222-2237
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1891926_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Daniel R. Kowal
Author-X-Name-First: Daniel R.
Author-X-Name-Last: Kowal
Title: Fast, Optimal, and Targeted Predictions Using Parameterized Decision Analysis
Abstract:
Prediction is critical for decision-making under uncertainty and lends validity to statistical inference. With targeted prediction, the goal is to optimize predictions for specific decision tasks of interest, which we represent via functionals. Although classical decision analysis extracts predictions from a Bayesian model, these predictions are often difficult to interpret and slow to compute. Instead, we design a class of parameterized actions for Bayesian decision analysis that produce optimal, scalable, and simple targeted predictions. For a wide variety of action parameterizations and loss functions—including linear actions with sparsity constraints for targeted variable selection—we derive a convenient representation of the optimal targeted prediction that yields efficient and interpretable solutions. Customized out-of-sample predictive metrics are developed to evaluate and compare among targeted predictors. Through careful use of the posterior predictive distribution, we introduce a procedure that identifies a set of near-optimal, or acceptable targeted predictors, which provide unique insights into the features and level of complexity needed for accurate targeted prediction. Simulations demonstrate excellent prediction, estimation, and variable selection capabilities. Targeted predictions are constructed for physical activity (PA) data from the National Health and Nutrition Examination Survey to better predict and understand the characteristics of intraday PA. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1875-1886
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1891926
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891926
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1875-1886
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1896526_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Walter Dempsey
Author-X-Name-First: Walter
Author-X-Name-Last: Dempsey
Author-Name: Brandon Oselio
Author-X-Name-First: Brandon
Author-X-Name-Last: Oselio
Author-Name: Alfred Hero
Author-X-Name-First: Alfred
Author-X-Name-Last: Hero
Title: Hierarchical Network Models for Exchangeable Structured Interaction Processes
Abstract:
Network data often arises via a series of structured interactions among a population of constituent elements. E-mail exchanges, for example, have a single sender followed by potentially multiple receivers. Scientific articles, on the other hand, may have multiple subject areas and multiple authors. We introduce a statistical model, termed the Pitman-Yor hierarchical vertex components model (PY-HVCM), that is well suited for structured interaction data. The proposed PY-HVCM effectively models complex relational data by partial pooling of local information via a latent, shared population-level distribution. The PY-HCVM is a canonical example of hierarchical vertex components models—a subfamily of models for exchangeable structured interaction-labeled networks, that is, networks invariant to interaction relabeling. Theoretical analysis and supporting simulations provide clear model interpretation, and establish global sparsity and power law degree distribution. A computationally tractable Gibbs sampling algorithm is derived for inferring sparsity and power law properties of complex networks. We demonstrate the model on both the Enron e-mail dataset and an ArXiv dataset, showing goodness of fit of the model via posterior predictive validation.
Journal: Journal of the American Statistical Association
Pages: 2056-2073
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1896526
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1896526
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2056-2073
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1901718_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Xuening Zhu
Author-X-Name-First: Xuening
Author-X-Name-Last: Zhu
Author-Name: Zhanrui Cai
Author-X-Name-First: Zhanrui
Author-X-Name-Last: Cai
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Title: Network Functional Varying Coefficient Model
Abstract:
We consider functional responses with network dependence observed for each individual at irregular time points. To model both the interindividual dependence and within-individual dynamic correlation, we propose a network functional varying coefficient (NFVC) model. The response of each individual is characterized by a linear combination of responses from its connected nodes and its exogenous covariates. All the model coefficients are allowed to be time dependent. The NFVC model adds to the richness of both the classical network autoregression model and the functional regression models. To overcome the complexity caused by the network interdependence, we devise a special nonparametric least-squares-type estimator, which is feasible when the responses are observed at irregular time points for different individuals. The estimator takes advantage of the sparsity of the network structure to reduce the computational burden. To further conduct the functional principal component analysis, a novel within-individual covariance function estimation method is proposed and studied. Theoretical properties of our estimators, which involve techniques related to empirical processes, nonparametrics, functional data analysis and various concentration inequalities, are analyzed. We analyze a social network dataset to illustrate the powerfulness of the proposed procedure.
Journal: Journal of the American Statistical Association
Pages: 2074-2085
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1901718
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1901718
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2074-2085
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1909599_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Jacob Fiksel
Author-X-Name-First: Jacob
Author-X-Name-Last: Fiksel
Author-Name: Abhirup Datta
Author-X-Name-First: Abhirup
Author-X-Name-Last: Datta
Author-Name: Agbessi Amouzou
Author-X-Name-First: Agbessi
Author-X-Name-Last: Amouzou
Author-Name: Scott Zeger
Author-X-Name-First: Scott
Author-X-Name-Last: Zeger
Title: Generalized Bayes Quantification Learning under Dataset Shift
Abstract:
Quantification learning is the task of prevalence estimation for a test population using predictions from a classifier trained on a different population. Quantification methods assume that the sensitivities and specificities of the classifier are either perfect or transportable from the training to the test population. These assumptions are inappropriate in the presence of dataset shift, when the misclassification rates in the training population are not representative of those for the test population. Quantification under dataset shift has been addressed only for single-class (categorical) predictions and assuming perfect knowledge of the true labels on a small subset of the test population. We propose generalized Bayes quantification learning (GBQL) that uses the entire compositional predictions from probabilistic classifiers and allows for uncertainty in true class labels for the limited labeled test data. Instead of positing a full model, we use a model-free Bayesian estimating equation approach to compositional data using Kullback–Leibler loss-functions based only on a first-moment assumption. The idea will be useful in Bayesian compositional data analysis in general as it is robust to different generating mechanisms for compositional data and allows 0’s and 1’s in the compositional outputs thereby including categorical outputs as a special case. We show how our method yields existing quantification approaches as special cases. Extension to an ensemble GBQL that uses predictions from multiple classifiers yielding inference robust to inclusion of a poor classifier is discussed. We outline a fast and efficient Gibbs sampler using a rounding and coarsening approximation to the loss functions. We establish posterior consistency, asymptotic normality and valid coverage of interval estimates from GBQL, which to our knowledge are the first theoretical results for a quantification approach in the presence of local labeled data. We also establish finite sample posterior concentration rate. Empirical performance of GBQL is demonstrated through simulations and analysis of real data with evident dataset shift. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2163-2181
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1909599
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909599
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2163-2181
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2054816_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Blakeley B. McShane
Author-X-Name-First: Blakeley B.
Author-X-Name-Last: McShane
Author-Name: Ulf Böckenholt
Author-X-Name-First: Ulf
Author-X-Name-Last: Böckenholt
Author-Name: Karsten T. Hansen
Author-X-Name-First: Karsten T.
Author-X-Name-Last: Hansen
Title: Variation and Covariation in Large-Scale Replication Projects: An Evaluation of Replicability
Abstract:
Over the last decade, large-scale replication projects across the biomedical and social sciences have reported relatively low replication rates. In these large-scale replication projects, replication has typically been evaluated based on a single replication study of some original study and dichotomously as successful or failed. However, evaluations of replicability that are based on a single study and are dichotomous are inadequate, and evaluations of replicability should instead be based on multiple studies, be continuous, and be multi-faceted. Further, such evaluations are in fact possible due to two characteristics shared by many large-scale replication projects. In this article, we provide such an evaluation for two prominent large-scale replication projects, one which replicated a phenomenon from cognitive psychology and another which replicated 13 phenomena from social psychology and behavioral economics. Our results indicate a very high degree of replicability in the former and a medium to low degree of replicability in the latter. They also suggest an unidentified covariate in each, namely ocular dominance in the former and political ideology in the latter, that is theoretically pertinent. We conclude by discussing evaluations of replicability at large, recommendations for future large-scale replication projects, and design-based model generalization. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1605-1621
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2054816
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2054816
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1605-1621
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1895175_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Yan Sun
Author-X-Name-First: Yan
Author-X-Name-Last: Sun
Author-Name: Qifan Song
Author-X-Name-First: Qifan
Author-X-Name-Last: Song
Author-Name: Faming Liang
Author-X-Name-First: Faming
Author-X-Name-Last: Liang
Title: Consistent Sparse Deep Learning: Theory and Computation
Abstract:
Deep learning has been the engine powering many successes of data science. However, the deep neural network (DNN), as the basic model of deep learning, is often excessively over-parameterized, causing many difficulties in training, prediction and interpretation. We propose a frequentist-like method for learning sparse DNNs and justify its consistency under the Bayesian framework: the proposed method could learn a sparse DNN with at most O(n/ log (n))
connections and nice theoretical guarantees such as posterior consistency, variable selection consistency and asymptotically optimal generalization bounds. In particular, we establish posterior consistency for the sparse DNN with a mixture Gaussian prior, show that the structure of the sparse DNN can be consistently determined using a Laplace approximation-based marginal posterior inclusion probability approach, and use Bayesian evidence to elicit sparse DNNs learned by an optimization method such as stochastic gradient descent in multiple runs with different initializations. The proposed method is computationally more efficient than standard Bayesian methods for large-scale sparse DNNs. The numerical results indicate that the proposed method can perform very well for large-scale network compression and high-dimensional nonlinear variable selection, both advancing interpretable machine learning.
Journal: Journal of the American Statistical Association
Pages: 1981-1995
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1895175
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895175
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1981-1995
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1888740_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Sai Li
Author-X-Name-First: Sai
Author-X-Name-Last: Li
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Inference for High-Dimensional Linear Mixed-Effects Models: A Quasi-Likelihood Approach
Abstract:
Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that do not rely on the structural information of the variance components. We also study the estimation of variance components with high-dimensional fixed effects in general settings. The algorithms are easy to implement and computationally fast. The proposed methods are assessed in various simulation settings and are applied to a real study regarding the associations between body mass index and genetic polymorphic markers in a heterogeneous stock mice population.
Journal: Journal of the American Statistical Association
Pages: 1835-1846
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1888740
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1888740
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1835-1846
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1889565_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Gonzalo García-Donato
Author-X-Name-First: Gonzalo
Author-X-Name-Last: García-Donato
Author-Name: Rui Paulo
Author-X-Name-First: Rui
Author-X-Name-Last: Paulo
Title: Variable Selection in the Presence of Factors: A Model Selection Perspective
Abstract:
In the context of a Gaussian multiple regression model, we address the problem of variable selection when in the list of potential predictors there are factors, that is, categorical variables. We adopt a model selection perspective, that is, we approach the problem by constructing a class of models, each corresponding to a particular selection of active variables. The methodology is Bayesian and proceeds by computing the posterior probability of each of these models. We highlight the fact that the set of competing models depends on the dummy variable representation of the factors, an issue already documented by Fernández et al. in a particular example but that has not received any attention since then. We construct methodology that circumvents this problem and that presents very competitive frequentist behavior when compared with recently proposed techniques. Additionally, it is fully automatic, in that it does not require the specification of any tuning parameters.
Journal: Journal of the American Statistical Association
Pages: 1847-1857
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1889565
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1889565
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1847-1857
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1912758_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Yaniv Tenzer
Author-X-Name-First: Yaniv
Author-X-Name-Last: Tenzer
Author-Name: Micha Mandel
Author-X-Name-First: Micha
Author-X-Name-Last: Mandel
Author-Name: Or Zuk
Author-X-Name-First: Or
Author-X-Name-Last: Zuk
Title: Testing Independence Under Biased Sampling
Abstract:
Testing for dependence between pairs of random variables is a fundamental problem in statistics. In some applications, data are subject to selection bias that can create spurious dependence. An important example is truncation models, in which observed pairs are restricted to a specific subset of the X-Y plane. Standard tests for independence are not suitable in such cases, and alternative tests that take the selection bias into account are required. Here, we generalize the notion of quasi-independence with respect to the sampling mechanism, and study the problem of detecting any deviations from it. We develop two tests statistics motivated by the classic Hoeffding’s statistic, and use two approaches to compute their distribution under the null: (i) a bootstrap-based approach, and (ii) a permutation-test with nonuniform probability of permutations. We also handle an important application to the case of censoring with truncation, by estimating the biased sampling mechanism from the data. We prove the validity of the tests, and show, using simulations, that they improve power compared to competing methods for important special cases. The tests are applied to four datasets, two that are subject to truncation, with and without censoring, and two to bias mechanisms related to length bias.
Journal: Journal of the American Statistical Association
Pages: 2194-2206
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1912758
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1912758
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2194-2206
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1883437_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Yuehao Bai
Author-X-Name-First: Yuehao
Author-X-Name-Last: Bai
Author-Name: Joseph P. Romano
Author-X-Name-First: Joseph P.
Author-X-Name-Last: Romano
Author-Name: Azeem M. Shaikh
Author-X-Name-First: Azeem M.
Author-X-Name-Last: Shaikh
Title: Inference in Experiments With Matched Pairs
Abstract:
This article studies inference for the average treatment effect in randomized controlled trials where treatment status is determined according to a “matched pairs” design. By a “matched pairs” design, we mean that units are sampled iid from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. This type of design is used routinely throughout the sciences, but fundamental questions about its implications for inference about the average treatment effect remain. The main requirement underlying our analysis is that pairs are formed so that units within pairs are suitably “close” in terms of the baseline covariates, and we develop novel results to ensure that pairs are formed in a way that satisfies this condition. Under this assumption, we show that, for the problem of testing the null hypothesis that the average treatment effect equals a prespecified value in such settings, the commonly used two-sample t-test and “matched pairs” t-test are conservative in the sense that these tests have limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level. We show, however, that a simple adjustment to the standard errors of these tests leads to a test that is asymptotically exact in the sense that its limiting rejection probability under the null hypothesis equals the nominal level. We also study the behavior of randomization tests that arise naturally in these types of settings. When implemented appropriately, we show that this approach also leads to a test that is asymptotically exact in the sense described previously, but additionally has finite-sample rejection probability no greater than the nominal level for certain distributions satisfying the null hypothesis. A simulation study and empirical application confirm the practical relevance of our theoretical results.
Journal: Journal of the American Statistical Association
Pages: 1726-1737
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1883437
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1883437
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1726-1737
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2117703_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Blakeley B. McShane
Author-X-Name-First: Blakeley B.
Author-X-Name-Last: McShane
Author-Name: Ulf Böckenholt
Author-X-Name-First: Ulf
Author-X-Name-Last: Böckenholt
Author-Name: Karsten T. Hansen
Author-X-Name-First: Karsten T.
Author-X-Name-Last: Hansen
Title: Modeling and Learning From Variation and Covariation
Journal: Journal of the American Statistical Association
Pages: 1627-1630
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2117703
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2117703
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1627-1630
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2066536_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Wensheng Guo
Author-X-Name-First: Wensheng
Author-X-Name-Last: Guo
Author-Name: Mengying You
Author-X-Name-First: Mengying
Author-X-Name-Last: You
Author-Name: Jialin Yi
Author-X-Name-First: Jialin
Author-X-Name-Last: Yi
Author-Name: Michel A. Pontari
Author-X-Name-First: Michel A.
Author-X-Name-Last: Pontari
Author-Name: J. Richard Landis
Author-X-Name-First: J. Richard
Author-X-Name-Last: Landis
Title: Functional Mixed Effects Clustering with Application to Longitudinal Urologic Chronic Pelvic Pain Syndrome Symptom Data
Abstract:
By clustering patients with the urologic chronic pelvic pain syndromes (UCPPS) into homogeneous subgroups and associating these subgroups with baseline covariates and other clinical outcomes, we provide opportunities to investigate different potential elements of pathogenesis, which may also guide us in selection of appropriate therapeutic targets. Motivated by the longitudinal urologic symptom data with extensive subject heterogeneity and differential variability of trajectories, we propose a functional clustering procedure where each subgroup is modeled by a functional mixed effects model, and the posterior probability is used to iteratively classify each subject into different subgroups. The classification takes into account both group-average trajectories and between-subject variabilities. We develop an equivalent state-space model for efficient computation. We also propose a cross-validation based Kullback–Leibler information criterion to choose the optimal number of subgroups. The performance of the proposed method is assessed through a simulation study. We apply our methods to longitudinal bi-weekly measures of a primary urological urinary symptoms score from a UCPPS longitudinal cohort study, and identify four subgroups ranging from moderate decline, mild decline, stable and mild increasing. The resulting clusters are also associated with the one-year changes in several clinically important outcomes, and are also related to several clinically relevant baseline predictors, such as sleep disturbance score, physical quality of life and painful urgency. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1631-1641
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2066536
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2066536
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1631-1641
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1909600_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Jacob Vorstrup Goldman
Author-X-Name-First: Jacob Vorstrup
Author-X-Name-Last: Goldman
Author-Name: Torben Sell
Author-X-Name-First: Torben
Author-X-Name-Last: Sell
Author-Name: Sumeetpal Sidhu Singh
Author-X-Name-First: Sumeetpal Sidhu
Author-X-Name-Last: Singh
Title: Gradient-Based Markov Chain Monte Carlo for Bayesian Inference With Non-differentiable Priors
Abstract:
The use of nondifferentiable priors in Bayesian statistics has become increasingly popular, in particular in Bayesian imaging analysis. Current state-of-the-art methods are approximate in the sense that they replace the posterior with a smooth approximation via Moreau-Yosida envelopes, and apply gradient-based discretized diffusions to sample from the resulting distribution. We characterize the error of the Moreau-Yosida approximation and propose a novel implementation using underdamped Langevin dynamics. In misson-critical cases, however, replacing the posterior with an approximation may not be a viable option. Instead, we show that piecewise-deterministic Markov processes (PDMP) can be used for exact posterior inference from distributions satisfying almost everywhere differentiability. Furthermore, in contrast with diffusion-based methods, the suggested PDMP-based samplers place no assumptions on the prior shape, nor require access to a computationally cheap proximal operator, and consequently have a much broader scope of application. Through detailed numerical examples, including a nondifferentiable circular distribution and a nonconvex genomics model, we elucidate the relative strengths of these sampling methods on problems of moderate to high dimensions, underlining the benefits of PDMP-based methods when accurate sampling is decisive. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2182-2193
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1909600
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1909600
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2182-2193
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1895178_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Ziyang Lyu
Author-X-Name-First: Ziyang
Author-X-Name-Last: Lyu
Author-Name: A.H. Welsh
Author-X-Name-First: A.H.
Author-X-Name-Last: Welsh
Title: Asymptotics for EBLUPs: Nested Error Regression Models
Abstract:
In this article we derive the asymptotic distribution of estimated best linear unbiased predictors (EBLUPs) of the random effects in a nested error regression model. Under very mild conditions which do not require the assumption of normality, we show that asymptotically the distribution of the EBLUPs as both the number of clusters and the cluster sizes diverge to infinity is the convolution of the true distribution of the random effects and a normal distribution. This result yields very simple asymptotic approximations to and estimators of the prediction mean squared error of EBLUPs, and then asymptotic prediction intervals for the unobserved random effects. We also derive a higher order approximation to the asymptotic mean squared error and provide a detailed theoretical and empirical comparison with the well-known analytical prediction mean squared error approximations and estimators proposed by Kackar and Harville and Prasad and Rao. We show that our simple estimator of the predictor mean squared errors of EBLUPs works very well in practice when both the number of clusters and the cluster sizes are sufficiently large. Finally, we illustrate the use of the asymptotic prediction intervals with data on radon measurements of houses in Massachusetts and Arizona.
Journal: Journal of the American Statistical Association
Pages: 2028-2042
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1895178
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2028-2042
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1904958_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: Molei Liu
Author-X-Name-First: Molei
Author-X-Name-Last: Liu
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Title: Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data
Abstract:
Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.
Journal: Journal of the American Statistical Association
Pages: 2105-2119
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1904958
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1904958
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2105-2119
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1886937_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Xi Chen
Author-X-Name-First: Xi
Author-X-Name-Last: Chen
Author-Name: Jason D. Lee
Author-X-Name-First: Jason D.
Author-X-Name-Last: Lee
Author-Name: He Li
Author-X-Name-First: He
Author-X-Name-Last: Li
Author-Name: Yun Yang
Author-X-Name-First: Yun
Author-X-Name-Last: Yang
Title: Distributed Estimation for Principal Component Analysis: An Enlarged Eigenspace Analysis
Abstract:
The growing size of modern datasets brings many challenges to the existing statistical estimation approaches, which calls for new distributed methodologies. This article studies distributed estimation for a fundamental statistical machine learning problem, principal component analysis (PCA). Despite the massive literature on top eigenvector estimation, much less is presented for the top-L-dim (L > 1) eigenspace estimation, especially in a distributed manner. We propose a novel multi-round algorithm for constructing top-L-dim eigenspace for distributed data. Our algorithm takes advantage of shift-and-invert preconditioning and convex optimization. Our estimator is communication-efficient and achieves a fast convergence rate. In contrast to the existing divide-and-conquer algorithm, our approach has no restriction on the number of machines. Theoretically, the traditional Davis–Kahan theorem requires the explicit eigengap assumption to estimate the top-L-dim eigenspace. To abandon this eigengap assumption, we consider a new route in our analysis: instead of exactly identifying the top-L-dim eigenspace, we show that our estimator is able to cover the targeted top-L-dim population eigenspace. Our distributed algorithm can be applied to a wide range of statistical problems based on PCA, such as principal component regression and single index model. Finally, we provide simulation studies to demonstrate the performance of the proposed distributed estimator.
Journal: Journal of the American Statistical Association
Pages: 1775-1786
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1886937
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886937
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1775-1786
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2139708_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: S. Lynne Stokes
Author-X-Name-First: S. Lynne
Author-X-Name-Last: Stokes
Title: Sampling: Design and Analysis, 3rd ed.
Journal: Journal of the American Statistical Association
Pages: 2287-2288
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2139708
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139708
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2287-2288
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1888739_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Nabarun Deb
Author-X-Name-First: Nabarun
Author-X-Name-Last: Deb
Author-Name: Sujayam Saha
Author-X-Name-First: Sujayam
Author-X-Name-Last: Saha
Author-Name: Adityanand Guntuboyina
Author-X-Name-First: Adityanand
Author-X-Name-Last: Guntuboyina
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Title: Two-Component Mixture Model in the Presence of Covariates
Abstract:
In this article, we study a generalization of the two-groups model in the presence of covariates—a problem that has recently received much attention in the statistical literature due to its applicability in multiple hypotheses testing problems. The model we consider allows for infinite dimensional parameters and offers flexibility in modeling the dependence of the response on the covariates. We discuss the identifiability issues arising in this model and systematically study several estimation strategies. We propose a tuning parameter-free nonparametric maximum likelihood method, implementable via the expectation-maximization algorithm, to estimate the unknown parameters. Further, we derive the rate of convergence of the proposed estimators—in particular we show that the finite sample Hellinger risk for every ‘approximate’ nonparametric maximum likelihood estimator achieves a near-parametric rate (up to logarithmic multiplicative factors). In addition, we propose and theoretically study two ‘marginal’ methods that are more scalable and easily implementable. We demonstrate the efficacy of our procedures through extensive simulation studies and relevant data analyses—one arising from neuroscience and the other from astronomy. We also outline the application of our methods to multiple testing. The companion R package NPMLEmix implements all the procedures proposed in this article.
Journal: Journal of the American Statistical Association
Pages: 1820-1834
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1888739
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1888739
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1820-1834
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1895177_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Testing Mediation Effects Using Logic of Boolean Matrices
Abstract:
A central question in high-dimensional mediation analysis is to infer the significance of individual mediators. The main challenge is that the total number of potential paths that go through any mediator is super-exponential in the number of mediators. Most existing mediation inference solutions either explicitly impose that the mediators are conditionally independent given the exposure, or ignore any potential directed paths among the mediators. In this article, we propose a novel hypothesis testing procedure to evaluate individual mediation effects, while taking into account potential interactions among the mediators. Our proposal thus fills a crucial gap, and greatly extends the scope of existing mediation tests. Our key idea is to construct the test statistic using the logic of Boolean matrices, which enables us to establish the proper limiting distribution under the null hypothesis. We further employ screening, data splitting, and decorrelated estimation to reduce the bias and increase the power of the test. We show that our test can control both the size and false discovery rate asymptotically, and the power of the test approaches one, while allowing the number of mediators to diverge to infinity with the sample size. We demonstrate the efficacy of the method through simulations and a neuroimaging study of Alzheimer’s disease. A Python implementation of the proposed procedure is available at https://github.com/callmespring/LOGAN.
Journal: Journal of the American Statistical Association
Pages: 2014-2027
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1895177
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895177
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2014-2027
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1915320_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: David Azriel
Author-X-Name-First: David
Author-X-Name-Last: Azriel
Author-Name: Lawrence D. Brown
Author-X-Name-First: Lawrence D.
Author-X-Name-Last: Brown
Author-Name: Michael Sklar
Author-X-Name-First: Michael
Author-X-Name-Last: Sklar
Author-Name: Richard Berk
Author-X-Name-First: Richard
Author-X-Name-Last: Berk
Author-Name: Andreas Buja
Author-X-Name-First: Andreas
Author-X-Name-Last: Buja
Author-Name: Linda Zhao
Author-X-Name-First: Linda
Author-X-Name-Last: Zhao
Title: Semi-Supervised Linear Regression
Abstract:
We study a regression problem where for some part of the data we observe both the label variable (Y) and the predictors (X
), while for other part of the data only the predictors are given. Such a problem arises, for example, when observations of the label variable are costly and may require a skilled human agent. When the conditional expectation E[Y|X]
is not exactly linear, one can consider the best linear approximation to the conditional expectation, which can be estimated consistently by the least-square estimates (LSE). The latter depends only on the labeled data. We suggest improved alternative estimates to the LSE that use also the unlabeled data. Our estimation method can be easily implemented and has simply described asymptotic properties. The new estimates asymptotically dominate the usual standard procedures under certain non-linearity condition of E[Y|X]
; otherwise, they are asymptotically equivalent. The performance of the new estimator for small sample size is investigated in an extensive simulation study. A real data example of inferring homeless population is used to illustrate the new methodology.
Journal: Journal of the American Statistical Association
Pages: 2238-2251
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1915320
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1915320
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2238-2251
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1893179_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: James Matuk
Author-X-Name-First: James
Author-X-Name-Last: Matuk
Author-Name: Karthik Bharath
Author-X-Name-First: Karthik
Author-X-Name-Last: Bharath
Author-Name: Oksana Chkrebtii
Author-X-Name-First: Oksana
Author-X-Name-Last: Chkrebtii
Author-Name: Sebastian Kurtek
Author-X-Name-First: Sebastian
Author-X-Name-Last: Kurtek
Title: Bayesian Framework for Simultaneous Registration and Estimation of Noisy, Sparse, and Fragmented Functional Data
Abstract:
In many applications, smooth processes generate data that are recorded under a variety of observational regimes, including dense sampling and sparse or fragmented observations that are often contaminated with error. The statistical goal of registering and estimating the individual underlying functions from discrete observations has thus far been mainly approached sequentially without formal uncertainty propagation, or in an application-specific manner by pooling information across subjects. We propose a unified Bayesian framework for simultaneous registration and estimation, which is flexible enough to accommodate inference on individual functions under general observational regimes. Our ability to do this relies on the specification of strongly informative prior models over the amplitude component of function variability using two strategies: a data-driven approach that defines an empirical basis for the amplitude subspace based on training data, and a shape-restricted approach when the relative location and number of extrema is well-understood. The proposed methods build on the elastic functional data analysis framework to separately model amplitude and phase variability inherent in functional data. We emphasize the importance of uncertainty quantification and visualization of these two components as they provide complementary information about the estimated functions. We validate the proposed framework using multiple simulation studies and real applications.
Journal: Journal of the American Statistical Association
Pages: 1964-1980
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1893179
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893179
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1964-1980
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1886936_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Thomas W. Yee
Author-X-Name-First: Thomas W.
Author-X-Name-Last: Yee
Title: On the Hauck–Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization
Abstract:
The Wald test remains ubiquitous in statistical practice despite shortcomings such as its inaccuracy in small samples and lack of invariance under reparameterization. This article develops on another but lesser-known shortcoming called the Hauck–Donner effect (HDE) whereby a Wald test statistic is no longer monotone increasing as a function of increasing distance between the parameter estimate and the null value. Resulting in an upward biased p-value and loss of power, the aberration can lead to very damaging consequences such as in variable selection. The HDE afflicts many types of regression models and corresponds to estimates near the boundary of the parameter space. This article presents several new results, and its main contributions are to (i) propose a very general test for detecting the HDE in the class of vector generalized linear models (VGLMs), regardless of the underlying cause; (ii) fundamentally characterize the HDE by pairwise ratios of Wald and Rao score and likelihood ratio test statistics for 1-parameter distributions with large samples; (iii) show that the parameter space may be partitioned into an interior encased by at least 5 HDE severity measures (faint, weak, moderate, strong, extreme); (iv) prove that a necessary condition for the HDE in a 2 by 2 table is a log odds ratio of at least 2; (v) give some practical guidelines about HDE-free hypothesis testing. Overall, practical post-fit tests can now be conducted potentially to any model estimated by iteratively reweighted least squares, especially the GLM and VGLM classes, the latter which encompasses many popular regression models.
Journal: Journal of the American Statistical Association
Pages: 1763-1774
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1886936
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1886936
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1763-1774
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1893177_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Xiao Guo
Author-X-Name-First: Xiao
Author-X-Name-Last: Guo
Author-Name: Guang Cheng
Author-X-Name-First: Guang
Author-X-Name-Last: Cheng
Title: Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares
Abstract:
Statistical inferences for quadratic functionals of linear regression parameter have found wide applications including signal detection, global testing, inferences of error variance and fraction of variance explained. Classical theory based on ordinary least squares estimator works perfectly in the low-dimensional regime, but fails when the parameter dimension pn grows proportionally to the sample size n. In some cases, its performance is not satisfactory even when n≥5pn
. The main contribution of this article is to develop dimension-adaptive inferences for quadratic functionals when limn→∞pn/n=τ∈[0,1)
. We propose a bias-and-variance-corrected test statistic and demonstrate that its theoretical validity (such as consistency and asymptotic normality) is adaptive to both low dimension with τ = 0 and moderate dimension with τ∈(0,1)
. Our general theory holds, in particular, without Gaussian design/error or structural parameter assumption, and applies to a broad class of quadratic functionals covering all aforementioned applications. As a by-product, we find that the classical fixed-dimensional results continue to hold if and only if the signal-to-noise ratio is large enough, say when pn diverges but slower than n. Extensive numerical results demonstrate the satisfactory performance of the proposed methodology even when pn≥0.9n
in some extreme cases. The mathematical arguments are based on the random matrix theory and leave-one-observation-out method.
Journal: Journal of the American Statistical Association
Pages: 1931-1950
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1893177
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893177
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1931-1950
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2089572_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Fei Xue
Author-X-Name-First: Fei
Author-X-Name-Last: Xue
Author-Name: Xiwei Tang
Author-X-Name-First: Xiwei
Author-X-Name-Last: Tang
Author-Name: Grace Kim
Author-X-Name-First: Grace
Author-X-Name-Last: Kim
Author-Name: Karestan C. Koenen
Author-X-Name-First: Karestan C.
Author-X-Name-Last: Koenen
Author-Name: Chantel L. Martin
Author-X-Name-First: Chantel L.
Author-X-Name-Last: Martin
Author-Name: Sandro Galea
Author-X-Name-First: Sandro
Author-X-Name-Last: Galea
Author-Name: Derek Wildman
Author-X-Name-First: Derek
Author-X-Name-Last: Wildman
Author-Name: Monica Uddin
Author-X-Name-First: Monica
Author-X-Name-Last: Uddin
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Heterogeneous Mediation Analysis on Epigenomic PTSD and Traumatic Stress in a Predominantly African American Cohort
Abstract:
DNA methylation (DNAm) has been suggested to play a critical role in post-traumatic stress disorder (PTSD), through mediating the relationship between trauma and PTSD. However, this underlying mechanism of PTSD for African Americans still remains unknown. To fill this gap, in this article, we investigate how DNAm mediates the effects of traumatic experiences on PTSD symptoms in the Detroit Neighborhood Health Study (DNHS) (2008–2013) which involves primarily African Americans adults. To achieve this, we develop a new mediation analysis approach for high-dimensional potential DNAm mediators. A key novelty of our method is that we consider heterogeneity in mediation effects across subpopulations. Specifically, mediators in different subpopulations could have opposite effects on the outcome, and thus could be difficult to identify under a traditional homogeneous model framework. In contrast, the proposed method can estimate heterogeneous mediation effects and identifies subpopulations in which individuals share similar effects. Simulation studies demonstrate that the proposed method outperforms existing methods for both homogeneous and heterogeneous data. We also present our mediation analysis results of a dataset with 125 participants and more than 450,000 CpG sites from the DNHS study. The proposed method finds three subgroups of subjects and identifies DNAm mediators corresponding to genes such as HSP90AA1 and NFATC1 which have been linked to PTSD symptoms in literature. Our finding could be useful in future finer-grained investigation of PTSD mechanism and in the development of new treatments for PTSD. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1669-1683
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2089572
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2089572
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1669-1683
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1917417_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Yumou Qiu
Author-X-Name-First: Yumou
Author-X-Name-Last: Qiu
Author-Name: Xiao-Hua Zhou
Author-X-Name-First: Xiao-Hua
Author-X-Name-Last: Zhou
Title: Inference on Multi-level Partial Correlations Based on Multi-subject Time Series Data
Abstract:
Partial correlations are commonly used to analyze the conditional dependence among variables. In this work, we propose a hierarchical model to study both the subject- and population-level partial correlations based on multi-subject time-series data. Multiple testing procedures adaptive to temporally dependent data with false discovery proportion control are proposed to identify the nonzero partial correlations in both the subject and population levels. A computationally feasible algorithm is developed. Theoretical results and simulation studies demonstrate the good properties of the proposed procedures. We illustrate the application of the proposed methods in a real example of brain connectivity on fMRI data from normal healthy persons and patients with Parkinson’s disease. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2268-2282
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1917417
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1917417
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2268-2282
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2096618_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Paul De Boeck
Author-X-Name-First: Paul
Author-X-Name-Last: De Boeck
Author-Name: Michael L. DeKay
Author-X-Name-First: Michael L.
Author-X-Name-Last: DeKay
Author-Name: Menglin Xu
Author-X-Name-First: Menglin
Author-X-Name-Last: Xu
Title: The Potential of Factor Analysis for Replication, Generalization, and Integration
Journal: Journal of the American Statistical Association
Pages: 1622-1626
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2096618
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096618
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1622-1626
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2087658_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Dengdeng Yu
Author-X-Name-First: Dengdeng
Author-X-Name-Last: Yu
Author-Name: Linbo Wang
Author-X-Name-First: Linbo
Author-X-Name-Last: Wang
Author-Name: Dehan Kong
Author-X-Name-First: Dehan
Author-X-Name-Last: Kong
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer’s Disease
Abstract:
Alzheimer’s disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of β amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer’s disease (AD). The aim of this article is to map the genetic-imaging-clinical pathway for AD in order to delineate the genetically-regulated brain changes that drive disease progression based on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. We develop a novel two-step approach to delineate the association between high-dimensional 2D hippocampal surface exposures and the Alzheimer’s Disease Assessment Scale (ADAS) cognitive score, while taking into account the ultra-high dimensional clinical and genetic covariates at baseline. Analysis results suggest that the radial distance of each pixel of both hippocampi is negatively associated with the severity of behavioral deficits conditional on observed clinical and genetic covariates. These associations are stronger in Cornu Ammonis region 1 (CA1) and subiculum subregions compared to Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) subregions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1656-1668
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2087658
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2087658
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1656-1668
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1887741_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Andrew Zammit-Mangion
Author-X-Name-First: Andrew
Author-X-Name-Last: Zammit-Mangion
Author-Name: Tin Lok James Ng
Author-X-Name-First: Tin Lok James
Author-X-Name-Last: Ng
Author-Name: Quan Vu
Author-X-Name-First: Quan
Author-X-Name-Last: Vu
Author-Name: Maurizio Filippone
Author-X-Name-First: Maurizio
Author-X-Name-Last: Filippone
Title: Deep Compositional Spatial Models
Abstract:
Spatial processes with nonstationary and anisotropic covariance structure are often used when modeling, analyzing, and predicting complex environmental phenomena. Such processes may often be expressed as ones that have stationary and isotropic covariance structure on a warped spatial domain. However, the warping function is generally difficult to fit and not constrained to be injective, often resulting in “space-folding.” Here, we propose modeling an injective warping function through a composition of multiple elemental injective functions in a deep-learning framework. We consider two cases; first, when these functions are known up to some weights that need to be estimated, and, second, when the weights in each layer are random. Inspired by recent methodological and technological advances in deep learning and deep Gaussian processes, we employ approximate Bayesian methods to make inference with these models using graphics processing units. Through simulation studies in one and two dimensions we show that the deep compositional spatial models are quick to fit, and are able to provide better predictions and uncertainty quantification than other deep stochastic models of similar complexity. We also show their remarkable capacity to model nonstationary, anisotropic spatial data using radiances from the MODIS instrument aboard the Aqua satellite.
Journal: Journal of the American Statistical Association
Pages: 1787-1808
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1887741
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1887741
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1787-1808
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1887742_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Jingnan Zhang
Author-X-Name-First: Jingnan
Author-X-Name-Last: Zhang
Author-Name: Xin He
Author-X-Name-First: Xin
Author-X-Name-Last: He
Author-Name: Junhui Wang
Author-X-Name-First: Junhui
Author-X-Name-Last: Wang
Title: Directed Community Detection With Network Embedding
Abstract:
Community detection in network data aims at grouping similar nodes sharing certain characteristics together. Most existing methods focus on detecting communities in undirected networks, where similarity between nodes is measured by their node features and whether they are connected. In this article, we propose a novel method to conduct network embedding and community detection simultaneously in a directed network. The network embedding model introduces two sets of vectors to represent the out- and in-nodes separately, and thus allows the same nodes belong to different out- and in-communities. The community detection formulation equips the negative log-likelihood with a novel regularization term to encourage community structure among the nodes representations, and thus achieves better performance by jointly estimating the nodes embeddings and their community structures. To tackle the resultant optimization task, an efficient alternative updating scheme is developed. More importantly, the asymptotic properties of the proposed method are established in terms of both network embedding and community detection, which are also supported by numerical experiments on some simulated and real examples.
Journal: Journal of the American Statistical Association
Pages: 1809-1819
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1887742
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1887742
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1809-1819
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1914635_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Quefeng Li
Author-X-Name-First: Quefeng
Author-X-Name-Last: Li
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Integrative Factor Regression and Its Inference for Multimodal Data Analysis
Abstract:
Multimodal data, where different types of data are collected from the same subjects, are fast emerging in a large variety of scientific applications. Factor analysis is commonly used in integrative analysis of multimodal data, and is particularly useful to overcome the curse of high dimensionality and high correlations. However, there is little work on statistical inference for factor analysis-based supervised modeling of multimodal data. In this article, we consider an integrative linear regression model that is built upon the latent factors extracted from multimodal data. We address three important questions: how to infer the significance of one data modality given the other modalities in the model; how to infer the significance of a combination of variables from one modality or across different modalities; and how to quantify the contribution, measured by the goodness of fit, of one data modality given the others. When answering each question, we explicitly characterize both the benefit and the extra cost of factor analysis. Those questions, to our knowledge, have not yet been addressed despite wide use of factor analysis in integrative multimodal analysis, and our proposal bridges an important gap. We study the empirical performance of our methods through simulations, and further illustrate with a multimodal neuroimaging analysis.
Journal: Journal of the American Statistical Association
Pages: 2207-2221
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1914635
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1914635
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2207-2221
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1917416_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Paromita Dubey
Author-X-Name-First: Paromita
Author-X-Name-Last: Dubey
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: Modeling Time-Varying Random Objects and Dynamic Networks
Abstract:
Samples of dynamic or time-varying networks and other random object data such as time-varying probability distributions are increasingly encountered in modern data analysis. Common methods for time-varying data such as functional data analysis are infeasible when observations are time courses of networks or other complex non-Euclidean random objects that are elements of general metric spaces. In such spaces, only pairwise distances between the data objects are available and a strong limitation is that one cannot carry out arithmetic operations due to the lack of an algebraic structure. We combat this complexity by a generalized notion of mean trajectory taking values in the object space. For this, we adopt pointwise Fréchet means and then construct pointwise distance trajectories between the individual time courses and the estimated Fréchet mean trajectory, thus representing the time-varying objects and networks by functional data. Functional principal component analysis of these distance trajectories can reveal interesting features of dynamic networks and object time courses and is useful for downstream analysis. Our approach also makes it possible to study the empirical dynamics of time-varying objects, including dynamic regression to the mean or explosive behavior over time. We demonstrate desirable asymptotic properties of sample based estimators for suitable population targets under mild assumptions. The utility of the proposed methodology is illustrated with dynamic networks, time-varying distribution data and longitudinal growth data.
Journal: Journal of the American Statistical Association
Pages: 2252-2267
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1917416
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1917416
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2252-2267
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1884561_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Changcheng Li Runze Li
Author-X-Name-First: Changcheng Li
Author-X-Name-Last: Runze Li
Title: Linear Hypothesis Testing in Linear Models With High-Dimensional Responses
Abstract:
In this article, we propose a new projection test for linear hypotheses on regression coefficient matrices in linear models with high-dimensional responses. We systematically study the theoretical properties of the proposed test. We first derive the optimal projection matrix for any given projection dimension to achieve the best power and provide an upper bound for the optimal dimension of projection matrix. We further provide insights into how to construct the optimal projection matrix. One- and two-sample mean problems can be formulated as special cases of linear hypotheses studied in this article. We both theoretically and empirically demonstrate that the proposed test can outperform the existing ones for one- and two-sample mean problems. We conduct Monte Carlo simulation to examine the finite sample performance and illustrate the proposed test by a real data example.
Journal: Journal of the American Statistical Association
Pages: 1738-1750
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1884561
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1884561
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1738-1750
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1893176_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Jiwei Zhao
Author-X-Name-First: Jiwei
Author-X-Name-Last: Zhao
Author-Name: Yanyuan Ma
Author-X-Name-First: Yanyuan
Author-X-Name-Last: Ma
Title: A Versatile Estimation Procedure Without Estimating the Nonignorable Missingness Mechanism
Abstract:
We consider the estimation problem in a regression setting where the outcome variable is subject to nonignorable missingness and identifiability is ensured by the shadow variable approach. We propose a versatile estimation procedure where modeling of missingness mechanism is completely bypassed. We show that our estimator is easy to implement and we derive the asymptotic theory of the proposed estimator. We also investigate some alternative estimators under different scenarios. Comprehensive simulation studies are conducted to demonstrate the finite sample performance of the method. We apply the estimator to a children’s mental health study to illustrate its usefulness.
Journal: Journal of the American Statistical Association
Pages: 1916-1930
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1893176
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893176
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1916-1930
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1875838_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Sihai Dave Zhao
Author-X-Name-First: Sihai Dave
Author-X-Name-Last: Zhao
Author-Name: William Biscarri
Author-X-Name-First: William
Author-X-Name-Last: Biscarri
Title: A Regression Modeling Approach to Structured Shrinkage Estimation
Abstract:
Problems involving the simultaneous estimation of multiple parameters arise in many areas of theoretical and applied statistics. A canonical example is the estimation of a vector of normal means. Frequently, structural information about relationships between the parameters of interest is available. For example, in a gene expression denoising problem, genes with similar functions may have similar expression levels. Despite its importance, structural information has not been well-studied in the simultaneous estimation literature, perhaps in part because it poses challenges to the usual geometric or empirical Bayes shrinkage estimation paradigms. This article proposes that some of these challenges can be resolved by adopting an alternate paradigm, based on regression modeling. This approach can naturally incorporate structural information and also motivates new shrinkage estimation and inference procedures. As an illustration, this regression paradigm is used to develop a class of estimators with asymptotic risk optimality properties that perform well in simulations and in denoising gene expression data from a single cell RNA-sequencing experiment.
Journal: Journal of the American Statistical Association
Pages: 1684-1694
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1875838
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1875838
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1684-1694
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1892703_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Davy Paindaveine
Author-X-Name-First: Davy
Author-X-Name-Last: Paindaveine
Author-Name: Joséa Rasoafaraniaina
Author-X-Name-First: Joséa
Author-X-Name-Last: Rasoafaraniaina
Author-Name: Thomas Verdebout
Author-X-Name-First: Thomas
Author-X-Name-Last: Verdebout
Title: Preliminary Multiple-Test Estimation, With Applications to k-Sample Covariance Estimation
Abstract:
Multisample covariance estimation—that is, estimation of the covariance matrices associated with k distinct populations—is a classical problem in multivariate statistics. A common solution is to base estimation on the outcome of a test that these covariance matrices show some given pattern. Such a preliminary test may, for example, investigate whether or not the various covariance matrices are equal to each other (test of homogeneity), or whether or not they have common eigenvectors (test of common principal components), etc. Since it is usually unclear what the possible pattern might be, it is natural to consider a collection of such patterns, leading to a collection of preliminary tests, and to base estimation on the outcome of such a multiple testing rule. In the present work, we therefore study preliminary test estimation based on multiple tests. Since this is of interest also outside k-sample covariance estimation, we do so in a very general framework where it is only assumed that the sequence of models at hand is locally asymptotically normal. In this general setup, we define the proposed estimators and derive their asymptotic properties. We come back to k-sample covariance estimation to illustrate the asymptotic and finite-sample behaviors of our estimators. Finally, we treat a real data example that allows us to show their practical relevance in a supervised classification framework.
Journal: Journal of the American Statistical Association
Pages: 1904-1915
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1892703
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1892703
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1904-1915
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1884562_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Yangfan Zhang
Author-X-Name-First: Yangfan
Author-X-Name-Last: Zhang
Author-Name: Runmin Wang
Author-X-Name-First: Runmin
Author-X-Name-Last: Wang
Author-Name: Xiaofeng Shao
Author-X-Name-First: Xiaofeng
Author-X-Name-Last: Shao
Title: Adaptive Inference for Change Points in High-Dimensional Data
Abstract:
In this article, we propose a class of test statistics for a change point in the mean of high-dimensional independent data. Our test integrates the U-statistic based approach in a recent work by Wang et al. and the Lq-norm based high-dimensional test in a recent work by He et al., and inherits several appealing features such as being tuning parameter free and asymptotic independence for test statistics corresponding to even q’s. A simple combination of test statistics corresponding to several different q’s leads to a test with adaptive power property, that is, it can be powerful against both sparse and dense alternatives. On the estimation front, we obtain the convergence rate of the maximizer of our test statistic standardized by sample size when there is one change-point in mean and q = 2, and propose to combine our tests with a wild binary segmentation algorithm to estimate the change-point number and locations when there are multiple change-points. Numerical comparisons using both simulated and real data demonstrate the advantage of our adaptive test and its corresponding estimation method.
Journal: Journal of the American Statistical Association
Pages: 1751-1762
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1884562
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1884562
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1751-1762
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1891925_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Xi Chen
Author-X-Name-First: Xi
Author-X-Name-Last: Chen
Author-Name: Weidong Liu
Author-X-Name-First: Weidong
Author-X-Name-Last: Liu
Author-Name: Yichen Zhang
Author-X-Name-First: Yichen
Author-X-Name-Last: Zhang
Title: First-Order Newton-Type Estimator for Distributed Estimation and Inference
Abstract:
This article studies distributed estimation and inference for a general statistical problem with a convex loss that could be nondifferentiable. For the purpose of efficient computation, we restrict ourselves to stochastic first-order optimization, which enjoys low per-iteration complexity. To motivate the proposed method, we first investigate the theoretical properties of a straightforward divide-and-conquer stochastic gradient descent approach. Our theory shows that there is a restriction on the number of machines and this restriction becomes more stringent when the dimension p is large. To overcome this limitation, this article proposes a new multi-round distributed estimation procedure that approximates the Newton step only using stochastic subgradient. The key component in our method is the proposal of a computationally efficient estimator of Σ−1w
, where Σ
is the population Hessian matrix and w is any given vector. Instead of estimating Σ
(or Σ−1
) that usually requires the second-order differentiability of the loss, the proposed first-order Newton-type estimator (FONE) directly estimates the vector of interest Σ−1w
as a whole and is applicable to nondifferentiable losses. Our estimator also facilitates the inference for the empirical risk minimizer. It turns out that the key term in the limiting covariance has the form of Σ−1w
, which can be estimated by FONE.
Journal: Journal of the American Statistical Association
Pages: 1858-1874
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1891925
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1891925
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1858-1874
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1895176_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Stéphane Guerrier
Author-X-Name-First: Stéphane
Author-X-Name-Last: Guerrier
Author-Name: Roberto Molinari
Author-X-Name-First: Roberto
Author-X-Name-Last: Molinari
Author-Name: Maria-Pia Victoria-Feser
Author-X-Name-First: Maria-Pia
Author-X-Name-Last: Victoria-Feser
Author-Name: Haotian Xu
Author-X-Name-First: Haotian
Author-X-Name-Last: Xu
Title: Robust Two-Step Wavelet-Based Inference for Time Series Models
Abstract:
Latent time series models such as (the independent sum of) ARMA(p, q) models with additional stochastic processes are increasingly used for data analysis in biology, ecology, engineering, and economics. Inference on and/or prediction from these models can be highly challenging: (i) the data may contain outliers that can adversely affect the estimation procedure; (ii) the computational complexity can become prohibitive when the time series are extremely large; (iii) model selection adds another layer of (computational) complexity; and (iv) solutions that address (i), (ii), and (iii) simultaneously do not exist in practice. This paper aims at jointly addressing these challenges by proposing a general framework for robust two-step estimation based on a bounded influence M-estimator of the wavelet variance. We first develop the conditions for the joint asymptotic normality of the latter estimator thereby providing the necessary tools to perform (direct) inference for scale-based analysis of signals. Taking advantage of the model-independent weights of this first-step estimator, we then develop the asymptotic properties of two-step robust estimators using the framework of the generalized method of wavelet moments (GMWM). Simulation studies illustrate the good finite sample performance of the robust GMWM estimator and applied examples highlight the practical relevance of the proposed approach.
Journal: Journal of the American Statistical Association
Pages: 1996-2013
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1895176
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1895176
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1996-2013
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2139707_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Zixiao Wang
Author-X-Name-First: Zixiao
Author-X-Name-Last: Wang
Author-Name: Yi Feng
Author-X-Name-First: Yi
Author-X-Name-Last: Feng
Author-Name: Lin Liu
Author-X-Name-First: Lin
Author-X-Name-Last: Liu
Title: Semiparametric Regression with R
Journal: Journal of the American Statistical Association
Pages: 2283-2287
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2022.2139707
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139707
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:2283-2287
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1893178_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20220907T060133 git hash: 85d61bd949
Author-Name: Likai Chen
Author-X-Name-First: Likai
Author-X-Name-Last: Chen
Author-Name: Weining Wang
Author-X-Name-First: Weining
Author-X-Name-Last: Wang
Author-Name: Wei Biao Wu
Author-X-Name-First: Wei Biao
Author-X-Name-Last: Wu
Title: Inference of Breakpoints in High-dimensional Time Series
Abstract:
For multiple change-points detection of high-dimensional time series, we provide asymptotic theory concerning the consistency and the asymptotic distribution of the breakpoint statistics and estimated break sizes. The theory backs up a simple two-step procedure for detecting and estimating multiple change-points. The proposed two-step procedure involves the maximum of a MOSUM (moving sum) type statistics in the first step and a CUSUM (cumulative sum) refinement step on an aggregated time series in the second step. Thus, for a fixed time-point, we can capture both the biggest break across different coordinates and aggregating simultaneous breaks over multiple coordinates. Extending the existing high-dimensional Gaussian approximation theorem to dependent data with jumps, the theory allows us to characterize the size and power of our multiple change-point test asymptotically. Moreover, we can make inferences on the breakpoints estimates when the break sizes are small. Our theoretical setup incorporates both weak temporal and strong or weak cross-sectional dependence and is suitable for heavy-tailed innovations. A robust long-run covariance matrix estimation is proposed, which can be of independent interest. An application on detecting structural changes of the U.S. unemployment rate is considered to illustrate the usefulness of our method.
Journal: Journal of the American Statistical Association
Pages: 1951-1963
Issue: 540
Volume: 117
Year: 2022
Month: 10
X-DOI: 10.1080/01621459.2021.1893178
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1893178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:117:y:2022:i:540:p:1951-1963
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1917418_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xiao Han
Author-X-Name-First: Xiao
Author-X-Name-Last: Han
Author-Name: Xin Tong
Author-X-Name-First: Xin
Author-X-Name-Last: Tong
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Title: Eigen Selection in Spectral Clustering: A Theory-Guided Practice
Abstract:
Based on a Gaussian mixture type model of K components, we derive eigen selection procedures that improve the usual spectral clustering algorithms in high-dimensional settings, which typically act on the top few eigenvectors of an affinity matrix (e.g., X⊤X
) derived from the data matrix X
. Our selection principle formalizes two intuitions: (i) eigenvectors should be dropped when they have no clustering power; (ii) some eigenvectors corresponding to smaller spiked eigenvalues should be dropped due to estimation inaccuracy. Our selection procedures lead to new spectral clustering algorithms: ESSC for K = 2 and GESSC for K > 2. The newly proposed algorithms enjoy better stability and compare favorably against canonical alternatives, as demonstrated in extensive simulation and multiple real data studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 109-121
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1917418
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1917418
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:109-121
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2115916_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Moo K. Chung
Author-X-Name-First: Moo K.
Author-X-Name-Last: Chung
Author-Name: Jamie L. Hanson
Author-X-Name-First: Jamie L.
Author-X-Name-Last: Hanson
Author-Name: Richard J. Davidson
Author-X-Name-First: Richard J.
Author-X-Name-Last: Davidson
Author-Name: Seth D. Pollak
Author-X-Name-First: Seth D.
Author-X-Name-Last: Pollak
Title: Discussion of “LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures”
Journal: Journal of the American Statistical Association
Pages: 20-21
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2115916
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115916
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:20-21
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1953507_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Hao Chen
Author-X-Name-First: Hao
Author-X-Name-Last: Chen
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Title: A Normality Test for High-dimensional Data Based on the Nearest Neighbor Approach
Abstract:
Many statistical methodologies for high-dimensional data assume the population is normal. Although a few multivariate normality tests have been proposed, to the best of our knowledge, none of them can properly control the Type I error when the dimension is larger than the number of observations. In this work, we propose a novel nonparametric test that uses the nearest neighbor information. The proposed method guarantees the asymptotic Type I error control under the high-dimensional setting. Simulation studies verify the empirical size performance of the proposed test when the dimension grows with the sample size and at the same time exhibit a superior power performance of the new test compared with alternative methods. We also illustrate our approach through two popularly used datasets in high-dimensional classification and clustering literatures where deviation from the normality assumption may lead to invalid conclusions.
Journal: Journal of the American Statistical Association
Pages: 719-731
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1953507
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1953507
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:719-731
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1942012_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jingyu He
Author-X-Name-First: Jingyu
Author-X-Name-Last: He
Author-Name: P. Richard Hahn
Author-X-Name-First: P. Richard
Author-X-Name-Last: Hahn
Title: Stochastic Tree Ensembles for Regularized Nonlinear Regression
Abstract:
This article develops a novel stochastic tree ensemble method for nonlinear regression, referred to as accelerated Bayesian additive regression trees, or XBART. By combining regularization and stochastic search strategies from Bayesian modeling with computationally efficient techniques from recursive partitioning algorithms, XBART attains state-of-the-art performance at prediction and function estimation. Simulation studies demonstrate that XBART provides accurate point-wise estimates of the mean function and does so faster than popular alternatives, such as BART, XGBoost, and neural networks (using Keras) on a variety of test functions. Additionally, it is demonstrated that using XBART to initialize the standard BART MCMC algorithm considerably improves credible interval coverage and reduces total run-time. Finally, two basic theoretical results are established: the single tree version of the model is asymptotically consistent and the Markov chain produced by the ensemble version of the algorithm has a unique stationary distribution.
Journal: Journal of the American Statistical Association
Pages: 551-570
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1942012
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942012
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:551-570
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2173603_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: The Editors
Title: Correction to “Modeling Time-Varying Random Objects and Dynamic Networks”
Journal: Journal of the American Statistical Association
Pages: 778-778
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2023.2173603
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2173603
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:778-778
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1930547_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Wanchuang Zhu
Author-X-Name-First: Wanchuang
Author-X-Name-Last: Zhu
Author-Name: Yingkai Jiang
Author-X-Name-First: Yingkai
Author-X-Name-Last: Jiang
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Author-Name: Ke Deng
Author-X-Name-First: Ke
Author-X-Name-Last: Deng
Title: Partition–Mallows Model and Its Inference for Rank Aggregation
Abstract:
Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into two disjoint groups, that is, relevant and irrelevant/background ones, and incorporating the Mallows model for the relative ranking of relevant entities, we propose a framework for rank aggregation that can not only distinguish quality differences among the rankers but also provide the detailed ranking information for relevant entities. Theoretical properties of the proposed approach are established, and its advantages over existing approaches are demonstrated via simulation studies and real-data applications. Extensions of the proposed method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed.
Journal: Journal of the American Statistical Association
Pages: 343-359
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1930547
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1930547
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:343-359
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1920958_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Eugene Katsevich
Author-X-Name-First: Eugene
Author-X-Name-Last: Katsevich
Author-Name: Chiara Sabatti
Author-X-Name-First: Chiara
Author-X-Name-Last: Sabatti
Author-Name: Marina Bogomolov
Author-X-Name-First: Marina
Author-X-Name-Last: Bogomolov
Title: Filtering the Rejection Set While Preserving False Discovery Rate Control
Abstract:
Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the international classification of diseases (ICD), the directed acyclic graph structure of the gene ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any prespecified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method’s practical utility via analyses of real datasets based on ICD and GO.
Journal: Journal of the American Statistical Association
Pages: 165-176
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1920958
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1920958
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:165-176
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2173458_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Michael L. Stein
Author-X-Name-First: Michael L.
Author-X-Name-Last: Stein
Title: Editorial: What Makes for a Great Applications and Case Studies Paper?
Journal: Journal of the American Statistical Association
Pages: 1-2
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2023.2173458
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2173458
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:1-2
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1927741_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Arun K. Kuchibhotla
Author-X-Name-First: Arun K.
Author-X-Name-Last: Kuchibhotla
Author-Name: Rohit K. Patra
Author-X-Name-First: Rohit K.
Author-X-Name-Last: Patra
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Title: Semiparametric Efficiency in Convexity Constrained Single-Index Model
Abstract:
We consider estimation and inference in a single-index regression model with an unknown convex link function. We introduce a convex and Lipschitz constrained least-square estimator (CLSE) for both the parametric and the nonparametric components given independent and identically distributed observations. We prove the consistency and find the rates of convergence of the CLSE when the errors are assumed to have only q≥2
moments and are allowed to depend on the covariates. When q≥5
, we establish n−1/2
-rate of convergence and asymptotic normality of the estimator of the parametric component. Moreover, the CLSE is proved to be semiparametrically efficient if the errors happen to be homoscedastic. We develop and implement a numerically stable and computationally fast algorithm to compute our proposed estimator in the R package simest. We illustrate our methodology through extensive simulations and data analysis. Finally, our proof of efficiency is geometric and provides a general framework that can be used to prove efficiency of estimators in a wide variety of semiparametric models even when they do not satisfy the efficient score equation directly. Supplementary files for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 272-286
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1927741
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1927741
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:272-286
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1955688_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Bowen Gang
Author-X-Name-First: Bowen
Author-X-Name-Last: Gang
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Author-Name: Weinan Wang
Author-X-Name-First: Weinan
Author-X-Name-Last: Wang
Title: Structure–Adaptive Sequential Testing for Online False Discovery Rate Control
Abstract:
Consider the online testing of a stream of hypotheses where a real-time decision must be made before the next data point arrives. The error rate is required to be controlled at all decision points. Conventional simultaneous testing rules are no longer applicable due to the more stringent error constraints and absence of future data. Moreover, the online decision-making process may come to a halt when the total error budget, or alpha-wealth, is exhausted. This work develops a new class of structure-adaptive sequential testing (SAST) rules for online false discovery rate (FDR) control. A key element in our proposal is a new alpha-investing algorithm that precisely characterizes the gains and losses in sequential decision making. SAST captures time varying structures of the data stream, learns the optimal threshold adaptively in an ongoing manner and optimizes the alpha-wealth allocation across different time periods. We present theory and numerical results to show that SAST is asymptotically valid for online FDR control and achieves substantial power gain over existing online testing rules.
Journal: Journal of the American Statistical Association
Pages: 732-745
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1955688
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955688
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:732-745
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1938083_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yujia Deng
Author-X-Name-First: Yujia
Author-X-Name-Last: Deng
Author-Name: Xiwei Tang
Author-X-Name-First: Xiwei
Author-X-Name-Last: Tang
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Correlation Tensor Decomposition and Its Application in Spatial Imaging Data
Abstract:
Multi-dimensional tensor data have gained increasing attention in the recent years, especially in biomedical imaging analyses. However, the most existing tensor models are only based on the mean information of imaging pixels. Motivated by multimodal optical imaging data in a breast cancer study, we develop a new tensor learning approach to use pixel-wise correlation information, which is represented through the higher order correlation tensor. We proposed a novel semi-symmetric correlation tensor decomposition method which effectively captures the informative spatial patterns of pixel-wise correlations to facilitate cancer diagnosis. We establish the theoretical properties for recovering structure and for classification consistency. In addition, we develop an efficient algorithm to achieve computational scalability. Our simulation studies and an application on breast cancer imaging data all indicate that the proposed method outperforms other competing methods in terms of pattern recognition and prediction accuracy.
Journal: Journal of the American Statistical Association
Pages: 440-456
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1938083
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938083
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:440-456
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1938582_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Alexander Henzi
Author-X-Name-First: Alexander
Author-X-Name-Last: Henzi
Author-Name: Gian-Reto Kleger
Author-X-Name-First: Gian-Reto
Author-X-Name-Last: Kleger
Author-Name: Johanna F. Ziegel
Author-X-Name-First: Johanna F.
Author-X-Name-Last: Ziegel
Title: Distributional (Single) Index Models
Abstract:
A Distributional (Single) Index Model (DIM) is a semiparametric model for distributional regression, that is, estimation of conditional distributions given covariates. The method is a combination of classical single-index models for the estimation of the conditional mean of a response given covariates, and isotonic distributional regression. The model for the index is parametric, whereas the conditional distributions are estimated nonparametrically under a stochastic ordering constraint. We show consistency of our estimators and apply them to a highly challenging dataset on the length of stay (LoS) of patients in intensive care units. We use the model to provide skillful and calibrated probabilistic predictions for the LoS of individual patients, which outperform the available methods in the literature.
Journal: Journal of the American Statistical Association
Pages: 489-503
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1938582
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938582
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:489-503
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2120399_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: John A. D. Aston
Author-X-Name-First: John A. D.
Author-X-Name-Last: Aston
Author-Name: Eardi Lila
Author-X-Name-First: Eardi
Author-X-Name-Last: Lila
Title: Discussion of LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures
Journal: Journal of the American Statistical Association
Pages: 18-19
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2120399
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2120399
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:18-19
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102984_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhengwu Zhang
Author-X-Name-First: Zhengwu
Author-X-Name-Last: Zhang
Author-Name: Yuexuan Wu
Author-X-Name-First: Yuexuan
Author-X-Name-Last: Wu
Author-Name: Di Xiong
Author-X-Name-First: Di
Author-X-Name-Last: Xiong
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Author-Name: Anuj Srivastava
Author-X-Name-First: Anuj
Author-X-Name-Last: Srivastava
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures
Abstract:
Over the past 30 years, magnetic resonance imaging has become a ubiquitous tool for accurately visualizing the change and development of the brain’s subcortical structures (e.g., hippocampus). Although subcortical structures act as information hubs of the nervous system, their quantification is still in its infancy due to many challenges in shape extraction, representation, and modeling. Here, we develop a simple and efficient framework of longitudinal elastic shape analysis (LESA) for subcortical structures. Integrating ideas from elastic shape analysis of static surfaces and statistical modeling of sparse longitudinal data, LESA provides a set of tools for systematically quantifying changes of longitudinal subcortical surface shapes from raw structure MRI data. The key novelties of LESA include: (i) it can efficiently represent complex subcortical structures using a small number of basis functions and (ii) it can accurately delineate the spatiotemporal shape changes of the human subcortical structures. We applied LESA to analyze three longitudinal neuroimaging datasets and showcase its wide applications in estimating continuous shape trajectories, building life-span growth patterns, and comparing shape differences among different groups. In particular, with the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data, we found that Alzheimer’s Disease (AD) can significantly speed the shape change of the lateral ventricle and the hippocampus from 60 to 75 years olds compared with normal aging. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 3-17
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2102984
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102984
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:3-17
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1930546_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jue Hou
Author-X-Name-First: Jue
Author-X-Name-Last: Hou
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Author-Name: Ronghui Xu
Author-X-Name-First: Ronghui
Author-X-Name-Last: Xu
Title: Treatment Effect Estimation Under Additive Hazards Models With High-Dimensional Confounding
Abstract:
Estimating treatment effects for survival outcomes in the high-dimensional setting is critical for many biomedical applications and any application with censored observations. This article establishes an “orthogonal” score for learning treatment effects, using observational data with a potentially large number of confounders. The estimator allows for root-n, asymptotically valid confidence intervals, despite the bias induced by the regularization. Moreover, we develop a novel hazard difference (HDi), estimator. We establish rate double robustness through the cross-fitting formulation. Numerical experiments illustrate the finite sample performance, where we observe that the cross-fitted HDi estimator has the best performance. We study the radical prostatectomy’s effect on conservative prostate cancer management through the SEER-Medicare linked data. Last, we provide an extension to machine learning both approaches and heterogeneous treatment effects. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 327-342
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1930546
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1930546
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:327-342
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1941053_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Kevin Guo
Author-X-Name-First: Kevin
Author-X-Name-Last: Guo
Author-Name: Guillaume Basse
Author-X-Name-First: Guillaume
Author-X-Name-Last: Basse
Title: The Generalized Oaxaca-Blinder Estimator
Abstract:
After performing a randomized experiment, researchers often use ordinary least-square (OLS) regression to adjust for baseline covariates when estimating the average treatment effect. It is widely known that the resulting confidence interval is valid even if the linear model is misspecified. In this article, we generalize that conclusion to covariate adjustment with nonlinear models. We introduce an intuitive way to use any “simple” nonlinear model to construct a covariate-adjusted confidence interval for the average treatment effect. The confidence interval derives its validity from randomization alone, and when nonlinear models fit the data better than linear models, it is narrower than the usual interval from OLS adjustment.
Journal: Journal of the American Statistical Association
Pages: 524-536
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1941053
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1941053
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:524-536
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1919122_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jie Chen
Author-X-Name-First: Jie
Author-X-Name-Last: Chen
Author-Name: Michael L. Stein
Author-X-Name-First: Michael L.
Author-X-Name-Last: Stein
Title: Linear-Cost Covariance Functions for Gaussian Random Fields
Abstract:
Gaussian random fields (GRF) are a fundamental stochastic model for spatiotemporal data analysis. An essential ingredient of GRF is the covariance function that characterizes the joint Gaussian distribution of the field. Commonly used covariance functions give rise to fully dense and unstructured covariance matrices, for which required calculations are notoriously expensive to carry out for large data. In this work, we propose a construction of covariance functions that result in matrices with a hierarchical structure. Empowered by matrix algorithms that scale linearly with the matrix dimension, the hierarchical structure is proved to be efficient for a variety of random field computations, including sampling, kriging, and likelihood evaluation. Specifically, with n scattered sites, sampling and likelihood evaluation has an O(n) cost and kriging has an O( log n)
cost after preprocessing, particularly favorable for the kriging of an extremely large number of sites (e.g., predicting on more sites than observed). We demonstrate comprehensive numerical experiments to show the use of the constructed covariance functions and their appealing computation time. Numerical examples on a laptop include simulated data of size up to one million, as well as a climate data product with over two million observations.
Journal: Journal of the American Statistical Association
Pages: 147-164
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1919122
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1919122
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:147-164
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1935268_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Cheng-Yu Sun
Author-X-Name-First: Cheng-Yu
Author-X-Name-Last: Sun
Author-Name: Boxin Tang
Author-X-Name-First: Boxin
Author-X-Name-Last: Tang
Title: Uniform Projection Designs and Strong Orthogonal Arrays
Abstract:
We explore the connections between uniform projection designs and strong orthogonal arrays of strength 2+ in this article. Both of these classes of designs are suitable designs for computer experiments and space-filling in two-dimensional margins, but they are motivated by different considerations. Uniform projection designs are introduced by Sun, Wang, and Xu to capture two-dimensional uniformity using the centered L2-discrepancy whereas strong orthogonal arrays of strength 2+ are brought forth by He, Cheng, and Tang as they achieve stratifications in two-dimensions on finer grids than ordinary orthogonal arrays. We first derive a new expression for the centered L2-discrepancy, which gives a decomposition of the criterion into a sum of squares where each square measures one aspect of design uniformity. This result is not only insightful in itself but also allows us to study strong orthogonal arrays in terms of the discrepancy criterion. More specifically, we show that strong orthogonal arrays of strength 2+ are optimal or nearly optimal under the uniform projection criterion.
Journal: Journal of the American Statistical Association
Pages: 417-423
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1935268
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1935268
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:417-423
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2163898_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Sabrina Giordano
Author-X-Name-First: Sabrina
Author-X-Name-Last: Giordano
Title: Data Science Ethics: Concepts, Techniques and Cautionary Tales
Journal: Journal of the American Statistical Association
Pages: 774-776
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2163898
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2163898
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:774-776
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1955689_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yuqi Gu
Author-X-Name-First: Yuqi
Author-X-Name-Last: Gu
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Title: A Joint MLE Approach to Large-Scale Structured Latent Attribute Analysis
Abstract:
Structured latent attribute models (SLAMs) are a family of discrete latent variable models widely used in education, psychology, and epidemiology to model multivariate categorical data. A SLAM assumes that multiple discrete latent attributes explain the dependence of observed variables in a highly structured fashion. Usually, the maximum marginal likelihood estimation approach is adopted for SLAMs, treating the latent attributes as random effects. The increasing scope of modern assessment data involves large numbers of observed variables and high-dimensional latent attributes. This poses challenges to classical estimation methods and requires new methodology and understanding of latent variable modeling. Motivated by this, we consider the joint maximum likelihood estimation (MLE) approach to SLAMs, treating latent attributes as fixed unknown parameters. We investigate estimability, consistency, and computation in the regime where sample size, number of variables, and number of latent attributes all can diverge. We establish the statistical consistency of the joint MLE and propose efficient algorithms that scale well to large-scale data for several popular SLAMs. Simulation studies demonstrate the superior empirical performance of the proposed methods. An application to real data from an international educational assessment gives interpretable findings of cognitive diagnosis.
Journal: Journal of the American Statistical Association
Pages: 746-760
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1955689
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955689
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:746-760
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1923508_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Nabarun Deb
Author-X-Name-First: Nabarun
Author-X-Name-Last: Deb
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Title: Multivariate Rank-Based Distribution-Free Nonparametric Testing Using Measure Transportation
Abstract:
In this article, we propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of measure transportation. Unlike other existing proposals in the literature, these multivariate ranks share a number of useful properties with the usual one-dimensional ranks; most importantly, these ranks are distribution-free. This crucial observation allows us to design nonparametric tests that are exactly distribution-free under the null hypothesis. We demonstrate the applicability of this approach by constructing exact distribution-free tests for two classical nonparametric problems: (I) testing for mutual independence between random vectors, and (II) testing for the equality of multivariate distributions. In particular, we propose (multivariate) rank versions of distance covariance and energy statistic for testing scenarios (I) and (II), respectively. In both these problems, we derive the asymptotic null distribution of the proposed test statistics. We further show that our tests are consistent against all fixed alternatives. Moreover, the proposed tests are computationally feasible and are well-defined under minimal assumptions on the underlying distributions (e.g., they do not need any moment assumptions). We also demonstrate the efficacy of these procedures via extensive simulations. In the process of analyzing the theoretical properties of our procedures, we end up proving some new results in the theory of measure transportation and in the limit theory of permutation statistics using Stein’s method for exchangeable pairs, which may be of independent interest.
Journal: Journal of the American Statistical Association
Pages: 192-207
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1923508
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923508
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:192-207
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1920959_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhenhua Lin
Author-X-Name-First: Zhenhua
Author-X-Name-Last: Lin
Author-Name: Miles E. Lopes
Author-X-Name-First: Miles E.
Author-X-Name-Last: Lopes
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: High-Dimensional MANOVA Via Bootstrapping and Its Application to Functional and Sparse Count Data
Abstract:
We propose a new approach to the problem of high-dimensional multivariate ANOVA via bootstrapping max statistics that involve the differences of sample mean vectors. The proposed method proceeds via the construction of simultaneous confidence regions for the differences of population mean vectors. It is suited to simultaneously test the equality of several pairs of mean vectors of potentially more than two populations. By exploiting the variance decay property that is a natural feature in relevant applications, we are able to provide dimension-free and nearly parametric convergence rates for Gaussian approximation, bootstrap approximation, and the size of the test. We demonstrate the proposed approach with ANOVA problems for functional data and sparse count data. The proposed methodology is shown to work well in simulations and several real data applications.
Journal: Journal of the American Statistical Association
Pages: 177-191
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1920959
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1920959
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:177-191
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1947306_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chencheng Cai
Author-X-Name-First: Chencheng
Author-X-Name-Last: Cai
Author-Name: Rong Chen
Author-X-Name-First: Rong
Author-X-Name-Last: Chen
Author-Name: Min-ge Xie
Author-X-Name-First: Min-ge
Author-X-Name-Last: Xie
Title: Individualized Group Learning
Abstract:
Many massive data sets are assembled through collections of information of a large number of individuals in a population. The analysis of such data, especially in the aspect of individualized inferences and solutions, has the potential to create significant value for practical applications. Traditionally, inference for an individual in the dataset is either solely relying on the information of the individual or from summarizing the information about the whole population. However, with the availability of big data, we have the opportunity, as well as a unique challenge, to make a more effective individualized inference that takes into consideration of both the population information and the individual discrepancy. To deal with the possible heterogeneity within the population while providing effective and credible inferences for individuals in a dataset, this article develops a new approach called the individualized group learning (iGroup). The iGroup approach uses local nonparametric techniques to generate an individualized group by pooling other entities in the population which share similar characteristics with the target individual, even when individual estimates are biased due to limited number of observations. Three general cases of iGroup are discussed, and their asymptotic performances are investigated. Both theoretical results and empirical simulations reveal that, by applying iGroup, the performance of statistical inference on the individual level are ensured and can be substantially improved from inference based on either solely individual information or entire population information. The method has a broad range of applications. An example in financial statistics is presented.
Journal: Journal of the American Statistical Association
Pages: 622-638
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1947306
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1947306
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:622-638
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1918130_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Tao Zhang
Author-X-Name-First: Tao
Author-X-Name-Last: Zhang
Author-Name: Kengo Kato
Author-X-Name-First: Kengo
Author-X-Name-Last: Kato
Author-Name: David Ruppert
Author-X-Name-First: David
Author-X-Name-Last: Ruppert
Title: Bootstrap Inference for Quantile-based Modal Regression
Abstract:
In this article, we develop uniform inference methods for the conditional mode based on quantile regression. Specifically, we propose to estimate the conditional mode by minimizing the derivative of the estimated conditional quantile function defined by smoothing the linear quantile regression estimator, and develop two bootstrap methods, a novel pivotal bootstrap and the nonparametric bootstrap, for our conditional mode estimator. Building on high-dimensional Gaussian approximation techniques, we establish the validity of simultaneous confidence rectangles constructed from the two bootstrap methods for the conditional mode. We also extend the preceding analysis to the case where the dimension of the covariate vector is increasing with the sample size. Finally, we conduct simulation experiments and a real data analysis using the U.S. wage data to demonstrate the finite sample performance of our inference method. The supplemental materials include the wage dataset, R codes and an appendix containing proofs of the main results, additional simulation results, discussion of model misspecification and quantile crossing, and additional details of the numerical implementation.
Journal: Journal of the American Statistical Association
Pages: 122-134
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1918130
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1918130
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:122-134
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1945459_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Lilun Du
Author-X-Name-First: Lilun
Author-X-Name-Last: Du
Author-Name: Xu Guo
Author-X-Name-First: Xu
Author-X-Name-Last: Guo
Author-Name: Wenguang Sun
Author-X-Name-First: Wenguang
Author-X-Name-Last: Sun
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Title: False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation
Abstract:
We develop a new class of distribution-free multiple testing rules for false discovery rate (FDR) control under general dependence. A key element in our proposal is a symmetrized data aggregation (SDA) approach to incorporating the dependence structure via sample splitting, data screening, and information pooling. The proposed SDA filter first constructs a sequence of ranking statistics that fulfill global symmetry properties, and then chooses a data-driven threshold along the ranking to control the FDR. The SDA filter substantially outperforms the Knockoff method in power under moderate to strong dependence, and is more robust than existing methods based on asymptotic p-values. We first develop finite-sample theories to provide an upper bound for the actual FDR under general dependence, and then establish the asymptotic validity of SDA for both the FDR and false discovery proportion control under mild regularity conditions. The procedure is implemented in the R package sdafilter. Numerical results confirm the effectiveness and robustness of SDA in FDR control and show that it achieves substantial power gain over existing methods in many settings.
Journal: Journal of the American Statistical Association
Pages: 607-621
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1945459
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1945459
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:607-621
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1929246_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yunan Wu
Author-X-Name-First: Yunan
Author-X-Name-Last: Wu
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Haoda Fu
Author-X-Name-First: Haoda
Author-X-Name-Last: Fu
Title: Model-Assisted Uniformly Honest Inference for Optimal Treatment Regimes in High Dimension
Abstract:
This article develops new tools to quantify uncertainty in optimal decision making and to gain insight into which variables one should collect information about given the potential cost of measuring a large number of variables. We investigate simultaneous inference to determine if a group of variables is relevant for estimating an optimal decision rule in a high-dimensional semiparametric framework. The unknown link function permits flexible modeling of the interactions between the treatment and the covariates, but leads to nonconvex estimation in high dimension and imposes significant challenges for inference. We first establish that a local restricted strong convexity condition holds with high probability and that any feasible local sparse solution of the estimation problem can achieve the near-oracle estimation error bound. We further rigorously verify that a wild bootstrap procedure based on a debiased version of the local solution can provide asymptotically honest uniform inference for the effect of a group of variables on optimal decision making. The advantage of honest inference is that it does not require the initial estimator to achieve perfect model selection and does not require the zero and nonzero effects to be well-separated. We also propose an efficient algorithm for estimation. Our simulations suggest satisfactory performance. An example from a diabetes study illustrates the real application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 305-314
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1929246
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1929246
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:305-314
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1950734_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Erin E. Gabriel
Author-X-Name-First: Erin E.
Author-X-Name-Last: Gabriel
Author-Name: Arvid Sjölander
Author-X-Name-First: Arvid
Author-X-Name-Last: Sjölander
Author-Name: Michael C. Sachs
Author-X-Name-First: Michael C.
Author-X-Name-Last: Sachs
Title: Nonparametric Bounds for Causal Effects in Imperfect Randomized Experiments
Abstract:
Nonignorable missingness and noncompliance can occur even in well-designed randomized experiments, making the intervention effect that the experiment was designed to estimate nonidentifiable. Nonparametric causal bounds provide a way to narrow the range of possible values for a nonidentifiable causal effect with minimal assumptions. We derive novel bounds for the causal risk difference for a binary outcome and intervention in randomized experiments with nonignorable missingness that is caused by a variety of mechanisms, with both perfect and imperfect compliance. We show that the so-called worst-case imputation, whereby all missing subjects on the intervention arm are assumed to have events and all missing subjects on the control or placebo arm are assumed to be event-free, can be too pessimistic in blinded studies with perfect compliance, and is not bounding the correct estimand with imperfect compliance. We illustrate the use of the proposed bounds in our motivating data example of peanut consumption on the development of peanut allergies in infants. We find that, even accounting for potentially nonignorable missingness and noncompliance, our derived bounds confirm that regular exposure to peanuts reduces the risk of development of peanut allergies, making the results of this study much more compelling.
Journal: Journal of the American Statistical Association
Pages: 684-692
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1950734
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950734
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:684-692
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1933498_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Wanrong Zhu
Author-X-Name-First: Wanrong
Author-X-Name-Last: Zhu
Author-Name: Xi Chen
Author-X-Name-First: Xi
Author-X-Name-Last: Chen
Author-Name: Wei Biao Wu
Author-X-Name-First: Wei Biao
Author-X-Name-Last: Wu
Title: Online Covariance Matrix Estimation in Stochastic Gradient Descent
Abstract:
The stochastic gradient descent (SGD) algorithm is widely used for parameter estimation, especially for huge datasets and online learning. While this recursive algorithm is popular for computation and memory efficiency, quantifying variability and randomness of the solutions has been rarely studied. This article aims at conducting statistical inference of SGD-based estimates in an online setting. In particular, we propose a fully online estimator for the covariance matrix of averaged SGD (ASGD) iterates only using the iterates from SGD. We formally establish our online estimator’s consistency and show that the convergence rate is comparable to offline counterparts. Based on the classic asymptotic normality results of ASGD, we construct asymptotically valid confidence intervals for model parameters. Upon receiving new observations, we can quickly update the covariance matrix estimate and the confidence intervals. This approach fits in an online setting and takes full advantage of SGD: efficiency in computation and memory.
Journal: Journal of the American Statistical Association
Pages: 393-404
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1933498
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933498
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:393-404
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123332_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Eleni Matechou
Author-X-Name-First: Eleni
Author-X-Name-Last: Matechou
Author-Name: Raffaele Argiento
Author-X-Name-First: Raffaele
Author-X-Name-Last: Argiento
Title: Capture-Recapture Models with Heterogeneous Temporary Emigration
Abstract:
We propose a novel approach for modeling capture-recapture (CR) data on open populations that exhibit temporary emigration, while also accounting for individual heterogeneity to allow for differences in visit patterns and capture probabilities between individuals. Our modeling approach combines changepoint processes—fitted using an adaptive approach—for inferring individual visits, with Bayesian mixture modeling—fitted using a nonparametric approach—for identifying clusters of individuals with similar visit patterns or capture probabilities. The proposed method is extremely flexible as it can be applied to any CR dataset and is not reliant upon specialized sampling schemes, such as Pollock’s robust design. We fit the new model to motivating data on salmon anglers collected annually at the Gaula river in Norway. Our results when analyzing data from the 2017, 2018, and 2019 seasons reveal two clusters of anglers—consistent across years—with substantially different visit patterns. Most anglers are allocated to the “occasional visitors” cluster, making infrequent and shorter visits with mean total length of stay at the river of around seven days, whereas there also exists a small cluster of “super visitors,” with regular and longer visits, with mean total length of stay of around 30 days in a season. Our estimate of the probability of catching salmon whilst at the river is more than three times higher than that obtained when using a model that does not account for temporary emigration, giving us a better understanding of the impact of fishing at the river. Finally, we discuss the effect of the COVID-19 pandemic on the angling population by modeling data from the 2020 season. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 56-69
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2123332
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123332
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:56-69
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1933499_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Francesco Denti
Author-X-Name-First: Francesco
Author-X-Name-Last: Denti
Author-Name: Federico Camerlenghi
Author-X-Name-First: Federico
Author-X-Name-Last: Camerlenghi
Author-Name: Michele Guindani
Author-X-Name-First: Michele
Author-X-Name-Last: Guindani
Author-Name: Antonietta Mira
Author-X-Name-First: Antonietta
Author-X-Name-Last: Mira
Title: A Common Atoms Model for the Bayesian Nonparametric Analysis of Nested Data
Abstract:
The use of large datasets for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn distinctive features of the units. In this manuscript, we propose a nested common atoms model (CAM) that is particularly suited for the analysis of nested datasets where the distributions of the units are expected to differ only over a small fraction of the observations sampled from each unit. The proposed CAM allows a two-layered clustering at the distributional and observational level and is amenable to scalable posterior inference through the use of a computationally efficient nested slice sampler algorithm. We further discuss how to extend the proposed modeling framework to handle discrete measurements, and we conduct posterior inference on a real microbiome dataset from a diet swap study to investigate how the alterations in intestinal microbiota composition are associated with different eating habits. We further investigate the performance of our model in capturing true distributional structures in the population by means of a simulation study.
Journal: Journal of the American Statistical Association
Pages: 405-416
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1933499
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933499
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:405-416
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1938082_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jie Zhou
Author-X-Name-First: Jie
Author-X-Name-Last: Zhou
Author-Name: Will Wei Sun
Author-X-Name-First: Will Wei
Author-X-Name-Last: Sun
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Partially Observed Dynamic Tensor Response Regression
Abstract:
In modern data science, dynamic tensor data prevail in numerous applications. An important task is to characterize the relationship between dynamic tensor datasets and external covariates. However, the tensor data are often only partially observed, rendering many existing methods inapplicable. In this article, we develop a regression model with a partially observed dynamic tensor as the response and external covariates as the predictor. We introduce the low-rankness, sparsity, and fusion structures on the regression coefficient tensor, and consider a loss function projected over the observed entries. We develop an efficient nonconvex alternating updating algorithm, and derive the finite-sample error bound of the actual estimator from each step of our optimization algorithm. Unobserved entries in the tensor response have imposed serious challenges. As a result, our proposal differs considerably in terms of estimation algorithm, regularity conditions, as well as theoretical properties, compared to the existing tensor completion or tensor response regression solutions. We illustrate the efficacy of our proposed method using simulations and two real applications, including a neuroimaging dementia study and a digital advertising study.
Journal: Journal of the American Statistical Association
Pages: 424-439
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1938082
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938082
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:424-439
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1929248_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Shuhao Jiao
Author-X-Name-First: Shuhao
Author-X-Name-Last: Jiao
Author-Name: Alexander Aue
Author-X-Name-First: Alexander
Author-X-Name-Last: Aue
Author-Name: Hernando Ombao
Author-X-Name-First: Hernando
Author-X-Name-Last: Ombao
Title: Functional Time Series Prediction Under Partial Observation of the Future Curve
Abstract:
Abstract–This article tackles one of the most fundamental goals in functional time series analysis which is to provide reliable predictions for future functions. Existing methods for predicting a complete future functional observation use only completely observed trajectories. We develop a new method, called partial functional prediction (PFP), which uses both completely observed trajectories and partial information (available partial data) on the trajectory to be predicted. The PFP method includes an automatic selection criterion for tuning parameters based on minimizing the prediction error, and the convergence rate of the PFP prediction is established. Simulation studies demonstrate that incorporating partially observed trajectory in the prediction outperforms existing methods with respect to mean squared prediction error. The PFP method is illustrated to be superior in the analysis of environmental data and traffic flow data.
Journal: Journal of the American Statistical Association
Pages: 315-326
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1929248
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1929248
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:315-326
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1933497_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zheng Tracy Ke
Author-X-Name-First: Zheng Tracy
Author-X-Name-Last: Ke
Author-Name: Yucong Ma
Author-X-Name-First: Yucong
Author-X-Name-Last: Ma
Author-Name: Xihong Lin
Author-X-Name-First: Xihong
Author-X-Name-Last: Lin
Title: Estimation of the Number of Spiked Eigenvalues in a Covariance Matrix by Bulk Eigenvalue Matching Analysis
Abstract:
The spiked covariance model has gained increasing popularity in high-dimensional data analysis. A fundamental problem is determination of the number of spiked eigenvalues, K. For estimation of K, most attention has focused on the use of top eigenvalues of sample covariance matrix, and there is little investigation into proper ways of using bulk eigenvalues to estimate K. We propose a principled approach to incorporating bulk eigenvalues in the estimation of K. Our method imposes a working model on the residual covariance matrix, which is assumed to be a diagonal matrix whose entries are drawn from a gamma distribution. Under this model, the bulk eigenvalues are asymptotically close to the quantiles of a fixed parametric distribution. This motivates us to propose a two-step method: the first step uses bulk eigenvalues to estimate parameters of this distribution, and the second step leverages these parameters to assist the estimation of K. The resulting estimator K̂ aggregates information in a large number of bulk eigenvalues. We show the consistency of K̂ under a standard spiked covariance model. We also propose a confidence interval estimate for K. Our extensive simulation studies show that the proposed method is robust and outperforms the existing methods in a range of scenarios. We apply the proposed method to analysis of a lung cancer microarray dataset and the 1000 Genomes dataset.
Journal: Journal of the American Statistical Association
Pages: 374-392
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1933497
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933497
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:374-392
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1918554_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Wenxuan Zhong
Author-X-Name-First: Wenxuan
Author-X-Name-Last: Zhong
Author-Name: Yiwen Liu
Author-X-Name-First: Yiwen
Author-X-Name-Last: Liu
Author-Name: Peng Zeng
Author-X-Name-First: Peng
Author-X-Name-Last: Zeng
Title: A Model-free Variable Screening Method Based on Leverage Score
Abstract:
With rapid advances in information technology, massive datasets are collected in all fields of science, such as biology, chemistry, and social science. Useful or meaningful information is extracted from these data often through statistical learning or model fitting. In massive datasets, both sample size and number of predictors can be large, in which case conventional methods face computational challenges. Recently, an innovative and effective sampling scheme based on leverage scores via singular value decompositions has been proposed to select rows of a design matrix as a surrogate of the full data in linear regression. Analogously, variable screening can be viewed as selecting rows of the design matrix. However, effective variable selection along this line of thinking remains elusive. In this article, we bridge this gap to propose a weighted leverage variable screening method by using both the left and right singular vectors of the design matrix. We show theoretically and empirically that the predictors selected using our method can consistently include true predictors not only for linear models but also for complicated general index models. Extensive simulation studies show that the weighted leverage screening method is highly computationally efficient and effective. We also demonstrate its success in identifying carcinoma related genes using spatial transcriptome data.
Journal: Journal of the American Statistical Association
Pages: 135-146
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1918554
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1918554
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:135-146
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1933496_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Hua Liu
Author-X-Name-First: Hua
Author-X-Name-Last: Liu
Author-Name: Jinhong You
Author-X-Name-First: Jinhong
Author-X-Name-Last: You
Author-Name: Jiguo Cao
Author-X-Name-First: Jiguo
Author-X-Name-Last: Cao
Title: A Dynamic Interaction Semiparametric Function-on-Scalar Model
Abstract:
Motivated by recent work studying massive functional data, such as the COVID-19 data, we propose a new dynamic interaction semiparametric function-on-scalar (DISeF) model. The proposed model is useful to explore the dynamic interaction among a set of covariates and their effects on the functional response. The proposed model includes many important models investigated recently as special cases. By tensor product B-spline approximating the unknown bivariate coefficient functions, a three-step efficient estimation procedure is developed to iteratively estimate bivariate varying-coefficient functions, the vector of index parameters, and the covariance functions of random effects. We also establish the asymptotic properties of the estimators including the convergence rate and their asymptotic distributions. In addition, we develop a test statistic to check whether the dynamic interaction varies with time/spatial locations, and we prove the asymptotic normality of the test statistic. The finite sample performance of our proposed method and of the test statistic are investigated with several simulation studies. Our proposed DISeF model is also used to analyze the COVID-19 data and the ADNI data. In both applications, hypothesis testing shows that the bivariate varying-coefficient functions significantly vary with the index and the time/spatial locations. For instance, we find that the interaction effect of the population aging and the socio-economic covariates, such as the number of hospital beds, physicians, nurses per 1000 people and GDP per capita, on the COVID-19 mortality rate varies in different periods of the COVID-19 pandemic. The healthcare infrastructure index related to the COVID-19 mortality rate is also obtained for 141 countries estimated based on the proposed DISeF model.
Journal: Journal of the American Statistical Association
Pages: 360-373
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1933496
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1933496
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:360-373
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2161385_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jörg Drechsler
Author-X-Name-First: Jörg
Author-X-Name-Last: Drechsler
Title: Differential Privacy for Government Agencies—Are We There Yet?
Abstract:
Government agencies typically need to take potential risks of disclosure into account whenever they publish statistics based on their data or give external researchers access to collected data. In this context, the promise of formal privacy guarantees offered by concepts such as differential privacy seems to be the panacea enabling the agencies to quantify and control the privacy loss incurred by any data release exactly. Nevertheless, despite the excitement in academia and industry, most agencies—with the prominent exception of the U.S. Census Bureau—have been reluctant to even consider the concept for their data release strategy. This article discusses potential reasons for this. We argue that the requirements for implementing differential privacy approaches at government agencies are often fundamentally different from the requirements in industry. This raises many challenges and questions that still need to be addressed before the concept can be used as an overarching principle when sharing data with the public. The article does not offer any solutions to these challenges. Instead, we hope to stimulate some collaborative research efforts, as we believe that many of the problems can only be addressed by interdisciplinary collaborations.
Journal: Journal of the American Statistical Association
Pages: 761-773
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2161385
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2161385
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:761-773
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1944874_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yisu Jia
Author-X-Name-First: Yisu
Author-X-Name-Last: Jia
Author-Name: Stefanos Kechagias
Author-X-Name-First: Stefanos
Author-X-Name-Last: Kechagias
Author-Name: James Livsey
Author-X-Name-First: James
Author-X-Name-Last: Livsey
Author-Name: Robert Lund
Author-X-Name-First: Robert
Author-X-Name-Last: Lund
Author-Name: Vladas Pipiras
Author-X-Name-First: Vladas
Author-X-Name-Last: Pipiras
Title: Latent Gaussian Count Time Series
Abstract:
This article develops the theory and methods for modeling a stationary count time series via Gaussian transformations. The techniques use a latent Gaussian process and a distributional transformation to construct stationary series with very flexible correlation features that can have any prespecified marginal distribution, including the classical Poisson, generalized Poisson, negative binomial, and binomial structures. Gaussian pseudo-likelihood and implied Yule–Walker estimation paradigms, based on the autocovariance function of the count series, are developed via a new Hermite expansion. Particle filtering and sequential Monte Carlo methods are used to conduct likelihood estimation. Connections to state space models are made. Our estimation approaches are evaluated in a simulation study and the methods are used to analyze a count series of weekly retail sales. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 596-606
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1944874
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1944874
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:596-606
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1950003_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Arkajyoti Saha
Author-X-Name-First: Arkajyoti
Author-X-Name-Last: Saha
Author-Name: Sumanta Basu
Author-X-Name-First: Sumanta
Author-X-Name-Last: Basu
Author-Name: Abhirup Datta
Author-X-Name-First: Abhirup
Author-X-Name-Last: Datta
Title: Random Forests for Spatially Dependent Data
Abstract:
Spatial linear mixed-models, consisting of a linear covariate effect and a Gaussian process (GP) distributed spatial random effect, are widely used for analyses of geospatial data. We consider the setting where the covariate effect is nonlinear. Random forests (RF) are popular for estimating nonlinear functions but applications of RF for spatial data have often ignored the spatial correlation. We show that this impacts the performance of RF adversely. We propose RF-GLS, a novel and well-principled extension of RF, for estimating nonlinear covariate effects in spatial mixed models where the spatial correlation is modeled using GP. RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models. RF becomes a special case of RF-GLS, and is substantially outperformed by RF-GLS for both estimation and prediction across extensive numerical experiments with spatially correlated data. RF-GLS can be used for functional estimation in other types of dependent data like time series. We prove consistency of RF-GLS for β-mixing dependent error processes that include the popular spatial Matérn GP. As a byproduct, we also establish, to our knowledge, the first consistency result for RF under dependence. We establish results of independent importance, including a general consistency result of GLS optimizers of data-driven function classes, and a uniform law of large number under β-mixing dependence with weaker assumptions. These new tools can be potentially useful for asymptotic analysis of other GLS-style estimators in nonparametric regression with dependent data.
Journal: Journal of the American Statistical Association
Pages: 665-683
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1950003
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1950003
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:665-683
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1928514_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yi Liu
Author-X-Name-First: Yi
Author-X-Name-Last: Liu
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Title: Variable Selection Via Thompson Sampling
Abstract:
Abstract–Thompson sampling is a heuristic algorithm for the multi-armed bandit problem which has a long tradition in machine learning. The algorithm has a Bayesian spirit in the sense that it selects arms based on posterior samples of reward probabilities of each arm. By forging a connection between combinatorial binary bandits and spike-and-slab variable selection, we propose a stochastic optimization approach to subset selection called Thompson variable selection (TVS). TVS is a framework for interpretable machine learning which does not rely on the underlying model to be linear. TVS brings together Bayesian reinforcement and machine learning in order to extend the reach of Bayesian subset selection to nonparametric models and large datasets with very many predictors and/or very many observations. Depending on the choice of a reward, TVS can be deployed in offline as well as online setups with streaming data batches. Tailoring multiplay bandits to variable selection, we provide regret bounds without necessarily assuming that the arm mean rewards be unrelated. We show a very strong empirical performance on both simulated and real data. Unlike deterministic optimization methods for spike-and-slab variable selection, the stochastic nature makes TVS less prone to local convergence and thereby more robust.
Journal: Journal of the American Statistical Association
Pages: 287-304
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1928514
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1928514
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:287-304
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123333_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yeonjoo Park
Author-X-Name-First: Yeonjoo
Author-X-Name-Last: Park
Author-Name: Bo Li
Author-X-Name-First: Bo
Author-X-Name-Last: Li
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Title: Crop Yield Prediction Using Bayesian Spatially Varying Coefficient Models with Functional Predictors
Abstract:
Reliable prediction for crop yield is crucial for economic planning, food security monitoring, and agricultural risk management. This study aims to develop a crop yield forecasting model at large spatial scales using meteorological variables closely related to crop growth. The influence of climate patterns on agricultural productivity can be spatially inhomogeneous due to local soil and environmental conditions. We propose a Bayesian spatially varying functional model (BSVFM) to predict county-level corn yield for five Midwestern states, based on annual precipitation and daily maximum and minimum temperature trajectories modeled as multivariate functional predictors. The proposed model accommodates spatial correlation and measurement errors of functional predictors, and respects the spatially heterogeneous relationship between the response and associated predictors by allowing the functional coefficients to vary over space. The model also incorporates a Bayesian variable selection device to further expand its capacity to accommodate spatial heterogeneity. The proposed method is demonstrated to outperform other highly competitive methods in corn yield prediction, owing to the flexibility of allowing spatial heterogeneity with spatially varying coefficients in our model. Our study provides further insights into understanding the impact of climate change on crop yield. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 70-83
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2123333
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123333
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:70-83
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1942013_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ines Wilms
Author-X-Name-First: Ines
Author-X-Name-Last: Wilms
Author-Name: Sumanta Basu
Author-X-Name-First: Sumanta
Author-X-Name-Last: Basu
Author-Name: Jacob Bien
Author-X-Name-First: Jacob
Author-X-Name-Last: Bien
Author-Name: David S. Matteson
Author-X-Name-First: David S.
Author-X-Name-Last: Matteson
Title: Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages
Abstract:
The vector autoregressive moving average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive vector autoregressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equivalent data-generating models, we use convex optimization to seek the parameterization that is simplest in a certain sense. A user-specified strongly convex penalty is used to measure model simplicity, and that same penalty is then used to define an estimator that can be efficiently computed. We establish consistency of our estimators in a double-asymptotic regime. Our nonasymptotic error bound analysis accommodates both model specification and parameter estimation steps, a feature that is crucial for studying large-scale VARMA algorithms. Our analysis also provides new results on penalized estimation of infinite-order VAR, and elastic net regression under a singular covariance structure of regressors, which may be of independent interest. We illustrate the advantage of our method over VAR alternatives on three real data examples.
Journal: Journal of the American Statistical Association
Pages: 571-582
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1942013
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942013
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:571-582
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1938583_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Charles E. McCulloch
Author-X-Name-First: Charles E.
Author-X-Name-Last: McCulloch
Author-Name: John M. Neuhaus
Author-X-Name-First: John M.
Author-X-Name-Last: Neuhaus
Title: Improving Predictions When Interest Focuses on Extreme Random Effects
Abstract:
Statistical models that generate predicted random effects are widely used to evaluate the performance of and rank patients, physicians, hospitals and health plans from longitudinal and clustered data. Predicted random effects have been proven to outperform treating clusters as fixed effects (essentially a categorical predictor variable) and using standard regression models, on average. These predicted random effects are often used to identify extreme or outlying values, such as poorly performing hospitals or patients with rapid declines in their health. When interest focuses on the extremes rather than performance on average, there has been no systematic investigation of best approaches. We develop novel methods for prediction of extreme values, evaluate their performance, and illustrate their application using data from the Osteoarthritis Initiative to predict walking speed in older adults. The new methods substantially outperform standard random and fixed-effects approaches for extreme values.
Journal: Journal of the American Statistical Association
Pages: 504-513
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1938583
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938583
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:504-513
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1952877_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Manuel Arellano
Author-X-Name-First: Manuel
Author-X-Name-Last: Arellano
Author-Name: Stéphane Bonhomme
Author-X-Name-First: Stéphane
Author-X-Name-Last: Bonhomme
Title: Recovering Latent Variables by Matching
Abstract:
We propose an optimal-transport-based matching method to nonparametrically estimate linear models with independent latent variables. The method consists in generating pseudo-observations from the latent variables, so that the Euclidean distance between the model’s predictions and their matched counterparts in the data is minimized. We show that our nonparametric estimator is consistent, and we document that it performs well in simulated data. We apply this method to study the cyclicality of permanent and transitory income shocks in the Panel Study of Income Dynamics. We find that the dispersion of income shocks is approximately acyclical, whereas the skewness of permanent shocks is procyclical. By comparison, we find that the dispersion and skewness of shocks to hourly wages vary little with the business cycle. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 693-706
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1952877
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1952877
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:693-706
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126779_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Matthew Simpson
Author-X-Name-First: Matthew
Author-X-Name-Last: Simpson
Author-Name: Scott H. Holan
Author-X-Name-First: Scott H.
Author-X-Name-Last: Holan
Author-Name: Christopher K. Wikle
Author-X-Name-First: Christopher K.
Author-X-Name-Last: Wikle
Author-Name: Jonathan R. Bradley
Author-X-Name-First: Jonathan R.
Author-X-Name-Last: Bradley
Title: Interpolating Population Distributions using Public-Use Data: An Application to Income Segregation using American Community Survey Data
Abstract:
The presence of income inequality is an important problem to demographers, policy makers, economists, and social scientists. A causal link has been hypothesized between income inequality and income segregation, which measures how much households with similar incomes cluster. The information theory index is used to measure income segregation, however, critics have suggested the divergence index instead. Motivated by this, we construct both indices using American Community Survey (ACS) estimates of features of the income distribution. Since the elimination of the decennial census long form, methods of computing these indices must be updated to interpolate ACS estimates and account for survey error. We propose a novel model-based method to do this which improves on previous approaches by using more types of estimates, and by providing uncertainty quantification. We apply this method to estimate U.S. census tract-level income distributions, and in turn use these to construct both income segregation indices. We find major differences between the two indices and find evidence that the information index underestimates the relationship between income inequality and income segregation. The literature suggests interventions designed to reduce income inequality by reducing income segregation, or vice versa, so using the information index implicitly understates the value of these interventions. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 84-96
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2126779
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126779
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:84-96
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2139264_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhengwu Zhang
Author-X-Name-First: Zhengwu
Author-X-Name-Last: Zhang
Author-Name: Yuexuan Wu
Author-X-Name-First: Yuexuan
Author-X-Name-Last: Wu
Author-Name: Di Xiong
Author-X-Name-First: Di
Author-X-Name-Last: Xiong
Author-Name: Joseph G. Ibrahim
Author-X-Name-First: Joseph G.
Author-X-Name-Last: Ibrahim
Author-Name: Anuj Srivastava
Author-X-Name-First: Anuj
Author-X-Name-Last: Srivastava
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Title: Rejoinder: LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures
Journal: Journal of the American Statistical Association
Pages: 25-28
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2139264
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139264
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:25-28
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2128806_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Tianhai Zu
Author-X-Name-First: Tianhai
Author-X-Name-Last: Zu
Author-Name: Heng Lian
Author-X-Name-First: Heng
Author-X-Name-Last: Lian
Author-Name: Brittany Green
Author-X-Name-First: Brittany
Author-X-Name-Last: Green
Author-Name: Yan Yu
Author-X-Name-First: Yan
Author-X-Name-Last: Yu
Title: Ultra-High Dimensional Quantile Regression for Longitudinal Data: An Application to Blood Pressure Analysis
Abstract:
Despite major advances in research and treatment, identifying important genotype risk factors for high blood pressure remains challenging. Traditional genome-wide association studies (GWAS) focus on one single nucleotide polymorphism (SNP) at a time. We aim to select among over half a million SNPs along with time-varying phenotype variables via simultaneous modeling and variable selection, focusing on the most dangerous blood pressure levels at high quantiles. Taking advantage of rich data from a large-scale public health study, we develop and apply a novel quantile penalized generalized estimating equations (GEE) approach, incorporating several key aspects including ultra-high dimensional genetic SNPs, the longitudinal nature of blood pressure measurements, time-varying covariates, and conditional high quantiles of blood pressure. Importantly, we identify interesting new SNPs for high blood pressure. Besides, we find blood pressure levels are likely heterogeneous, where the important risk factors identified differ among quantiles. This comprehensive picture of conditional quantiles of blood pressure can allow more insights and targeted treatments. We provide an efficient computational algorithm and prove consistency, asymptotic normality, and the oracle property for the quantile penalized GEE estimators with ultra-high dimensional predictors. Moreover, we establish model-selection consistency for high-dimensional BIC. Simulation studies show the promise of the proposed approach. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 97-108
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2128806
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2128806
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:97-108
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1948419_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Fangzheng Xie
Author-X-Name-First: Fangzheng
Author-X-Name-Last: Xie
Author-Name: Yanxun Xu
Author-X-Name-First: Yanxun
Author-X-Name-Last: Xu
Title: Efficient Estimation for Random Dot Product Graphs via a One-Step Procedure
Abstract:
We propose a one-step procedure to estimate the latent positions in random dot product graphs efficiently. Unlike the classical spectral-based methods, the proposed one-step procedure takes advantage of both the low-rank structure of the expected adjacency matrix and the Bernoulli likelihood information of the sampling model simultaneously. We show that for each vertex, the corresponding row of the one-step estimator (OSE) converges to a multivariate normal distribution after proper scaling and centering up to an orthogonal transformation, with an efficient covariance matrix. The initial estimator for the one-step procedure needs to satisfy the so-called approximate linearization property. The OSE improves the commonly adopted spectral embedding methods in the following sense: Globally for all vertices, it yields an asymptotic sum of squares error no greater than those of the spectral methods, and locally for each vertex, the asymptotic covariance matrix of the corresponding row of the OSE dominates those of the spectral embeddings in spectra. The usefulness of the proposed one-step procedure is demonstrated via numerical examples and the analysis of a real-world Wikipedia graph dataset.
Journal: Journal of the American Statistical Association
Pages: 651-664
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1948419
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1948419
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:651-664
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2105703_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jiayin Zheng
Author-X-Name-First: Jiayin
Author-X-Name-Last: Zheng
Author-Name: Xinyuan Dong
Author-X-Name-First: Xinyuan
Author-X-Name-Last: Dong
Author-Name: Christina C. Newton
Author-X-Name-First: Christina C.
Author-X-Name-Last: Newton
Author-Name: Li Hsu
Author-X-Name-First: Li
Author-X-Name-Last: Hsu
Title: A Generalized Integration Approach to Association Analysis with Multi-category Outcome: An Application to a Tumor Sequencing Study of Colorectal Cancer and Smoking
Abstract:
Cancer is a heterogeneous disease, and rapid progress in sequencing and -omics technologies has enabled researchers to characterize tumors comprehensively. This has stimulated an intensive interest in studying how risk factors are associated with various tumor heterogeneous features. The Cancer Prevention Study-II (CPS-II) cohort is one of the largest prospective studies, particularly valuable for elucidating associations between cancer and risk factors. In this article, we investigate the association of smoking with novel colorectal tumor markers obtained from targeted sequencing. However, due to cost and logistic difficulties, only a limited number of tumors can be assayed, which limits our capability for studying these associations. Meanwhile, there are extensive studies for assessing the association of smoking with overall cancer risk and established colorectal tumor markers. Importantly, such summary information is readily available from the literature. By linking this summary information to parameters of interest with proper constraints, we develop a generalized integration approach for polytomous logistic regression model with outcome characterized by tumor features. The proposed approach gains the efficiency through maximizing the joint likelihood of individual-level tumor data and external summary information under the constraints that narrow the parameter searching space. We apply the proposed method to the CPS-II data and identify the association of smoking with colorectal cancer risk differing by the mutational status of APC and RNF43 genes, neither of which is identified by the conventional analysis of CPS-II individual data only. These results help better understand the role of smoking in the etiology of colorectal cancer. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 29-42
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2105703
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2105703
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:29-42
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2174869_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Kwun Chuen Gary Chan
Author-X-Name-First: Kwun Chuen Gary
Author-X-Name-Last: Chan
Title: Handbook of Measurement Error Models
Journal: Journal of the American Statistical Association
Pages: 776-777
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2023.2174869
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2174869
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:776-777
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1942014_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Katarzyna Reluga
Author-X-Name-First: Katarzyna
Author-X-Name-Last: Reluga
Author-Name: María-José Lombardía
Author-X-Name-First: María-José
Author-X-Name-Last: Lombardía
Author-Name: Stefan Sperlich
Author-X-Name-First: Stefan
Author-X-Name-Last: Sperlich
Title: Simultaneous Inference for Empirical Best Predictors With a Poverty Study in Small Areas
Abstract:
Today, generalized linear mixed models (GLMM) are broadly used in many fields. However, the development of tools for performing simultaneous inference has been largely neglected in this domain. A framework for joint inference is indispensable to carry out statistically valid multiple comparisons of parameters of interest between all or several clusters. We therefore develop simultaneous confidence intervals and multiple testing procedures for empirical best predictors under GLMM. In addition, we implement our methodology to study widely employed examples of mixed models, that is, the unit-level binomial, the area-level Poisson-gamma and the area-level Poisson-lognormal mixed models. The asymptotic results are accompanied by extensive simulations. A case study on predicting poverty rates illustrates applicability and advantages of our simultaneous inference tools.
Journal: Journal of the American Statistical Association
Pages: 583-595
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1942014
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1942014
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:583-595
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123334_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Daiwei Zhang
Author-X-Name-First: Daiwei
Author-X-Name-Last: Zhang
Author-Name: Jian Kang
Author-X-Name-First: Jian
Author-X-Name-Last: Kang
Title: Discussion of “LESA: Longitudinal Elastic Shape Analysis of Brain Subcortical Structures”
Journal: Journal of the American Statistical Association
Pages: 22-24
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2123334
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123334
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:22-24
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1938084_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ye Tian
Author-X-Name-First: Ye
Author-X-Name-Last: Tian
Author-Name: Yang Feng
Author-X-Name-First: Yang
Author-X-Name-Last: Feng
Title: RaSE: A Variable Screening Framework via Random Subspace Ensembles
Abstract:
Variable screening methods have been shown to be effective in dimension reduction under the ultra-high dimensional setting. Most existing screening methods are designed to rank the predictors according to their individual contributions to the response. As a result, variables that are marginally independent but jointly dependent with the response could be missed. In this work, we propose a new framework for variable screening, random subspace ensemble (RaSE), which works by evaluating the quality of random subspaces that may cover multiple predictors. This new screening framework can be naturally combined with any subspace evaluation criterion, which leads to an array of screening methods. The framework is capable to identify signals with no marginal effect or with high-order interaction effects. It is shown to enjoy the sure screening property and rank consistency. We also develop an iterative version of RaSE screening with theoretical support. Extensive simulation studies and real-data analysis show the effectiveness of the new screening framework.
Journal: Journal of the American Statistical Association
Pages: 457-468
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1938084
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938084
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:457-468
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1947307_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Guangyu Yang
Author-X-Name-First: Guangyu
Author-X-Name-Last: Yang
Author-Name: Baqun Zhang
Author-X-Name-First: Baqun
Author-X-Name-Last: Zhang
Author-Name: Min Zhang
Author-X-Name-First: Min
Author-X-Name-Last: Zhang
Title: Estimation of Knots in Linear Spline Models
Abstract:
The linear spline model is able to accommodate nonlinear effects while allowing for an easy interpretation. It has significant applications in studying threshold effects and change-points. However, its application in practice has been limited by the lack of both rigorously studied and computationally convenient method for estimating knots. A key difficulty in estimating knots lies in the nondifferentiability. In this article, we study influence functions of regular and asymptotically linear estimators for linear spline models using the semiparametric theory. Based on the theoretical development, we propose a simple semismooth estimating equation approach to circumvent the nondifferentiability issue using modified derivatives, in contrast to the previous smoothing-based methods. Without relying on any smoothing parameters, the proposed method is computationally convenient. To further improve numerical stability, a two-step algorithm taking advantage of the analytic solution available when knots are known is developed to solve the proposed estimating equation. Consistency and asymptotic normality are rigorously derived using the empirical process theory. Simulation studies have shown that the two-step algorithm performs well in terms of both statistical and computational properties and improves over existing methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 639-650
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1947307
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1947307
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:639-650
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1941054_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: María F. Gil–Leyva
Author-X-Name-First: María F.
Author-X-Name-Last: Gil–Leyva
Author-Name: Ramsés H. Mena
Author-X-Name-First: Ramsés H.
Author-X-Name-Last: Mena
Title: Stick-Breaking Processes With Exchangeable Length Variables
Abstract:
Our object of study is the general class of stick-breaking processes with exchangeable length variables. These generalize well-known Bayesian nonparametric priors in an unexplored direction. We give conditions to assure the respective species sampling process is proper and the corresponding prior has full support. For a rich subclass we explain how, by tuning a single [0,1]-valued parameter, the stochastic ordering of the weights can be modulated, and Dirichlet and Geometric priors can be recovered. A general formula for the distribution of the latent allocation variables is derived and an MCMC algorithm is proposed for density estimation purposes.
Journal: Journal of the American Statistical Association
Pages: 537-550
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1941054
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1941054
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:537-550
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1924178_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Kuang-Yao Lee
Author-X-Name-First: Kuang-Yao
Author-X-Name-Last: Lee
Author-Name: Dingjue Ji
Author-X-Name-First: Dingjue
Author-X-Name-Last: Ji
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Todd Constable
Author-X-Name-First: Todd
Author-X-Name-Last: Constable
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: Conditional Functional Graphical Models
Abstract:
Graphical modeling of multivariate functional data is becoming increasingly important in a wide variety of applications. The changes of graph structure can often be attributed to external variables, such as the diagnosis status or time, the latter of which gives rise to the problem of dynamic graphical modeling. Most existing methods focus on estimating the graph by aggregating samples, but largely ignore the subject-level heterogeneity due to the external variables. In this article, we introduce a conditional graphical model for multivariate random functions, where we treat the external variables as conditioning set, and allow the graph structure to vary with the external variables. Our method is built on two new linear operators, the conditional precision operator and the conditional partial correlation operator, which extend the precision matrix and the partial correlation matrix to both the conditional and functional settings. We show that their nonzero elements can be used to characterize the conditional graphs, and develop the corresponding estimators. We establish the uniform convergence of the proposed estimators and the consistency of the estimated graph, while allowing the graph size to grow with the sample size, and accommodating both completely and partially observed data. We demonstrate the efficacy of the method through both simulations and a study of brain functional connectivity network.
Journal: Journal of the American Statistical Association
Pages: 257-271
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1924178
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1924178
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:257-271
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1953506_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Edward McFowland
Author-X-Name-First: Edward
Author-X-Name-Last: McFowland
Author-Name: Cosma Rohilla Shalizi
Author-X-Name-First: Cosma Rohilla
Author-X-Name-Last: Shalizi
Title: Estimating Causal Peer Influence in Homophilous Social Networks by Inferring Latent Locations
Abstract:
Social influence cannot be identified from purely observational data on social networks, because such influence is generically confounded with latent homophily, that is, with a node’s network partners being informative about the node’s attributes and therefore its behavior. If the network grows according to either a latent community (stochastic block) model, or a continuous latent space model, then latent homophilous attributes can be consistently estimated from the global pattern of social ties. We show that, for common versions of those two network models, these estimates are so informative that controlling for estimated attributes allows for asymptotically unbiased and consistent estimation of social-influence effects in linear models. In particular, the bias shrinks at a rate that directly reflects how much information the network provides about the latent attributes. These are the first results on the consistent nonexperimental estimation of social-influence effects in the presence of latent homophily, and we discuss the prospects for generalizing them.
Journal: Journal of the American Statistical Association
Pages: 707-718
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1953506
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1953506
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:707-718
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1923509_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yingying Dong
Author-X-Name-First: Yingying
Author-X-Name-Last: Dong
Author-Name: Ying-Ying Lee
Author-X-Name-First: Ying-Ying
Author-X-Name-Last: Lee
Author-Name: Michael Gou
Author-X-Name-First: Michael
Author-X-Name-Last: Gou
Title: Regression Discontinuity Designs With a Continuous Treatment
Abstract:
The standard regression discontinuity (RD) design deals with a binary treatment. Many empirical applications of RD designs involve continuous treatments. This article establishes identification and robust bias-corrected inference for such RD designs. Causal identification is achieved by using any changes in the distribution of the continuous treatment at the RD threshold (including the usual mean change as a special case). We discuss a double-robust identification approach and propose an estimand that incorporates the standard fuzzy RD estimand as a special case. Applying the proposed approach, we estimate the impacts of bank capital on bank failure in the pre-Great Depression era in the United States. Our RD design takes advantage of the minimum capital requirements, which change discontinuously with town size.
Journal: Journal of the American Statistical Association
Pages: 208-221
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1923509
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923509
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:208-221
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2110876_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xiaoyu Song
Author-X-Name-First: Xiaoyu
Author-X-Name-Last: Song
Author-Name: Jiayi Ji
Author-X-Name-First: Jiayi
Author-X-Name-Last: Ji
Author-Name: Pei Wang
Author-X-Name-First: Pei
Author-X-Name-Last: Wang
Title: iProMix: A Mixture Model for Studying the Function of ACE2 based on Bulk Proteogenomic Data
Abstract:
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused over six million deaths in the ongoing COVID-19 pandemic. SARS-CoV-2 uses ACE2 protein to enter human cells, raising a pressing need to characterize proteins/pathways interacted with ACE2. Large-scale proteomic profiling technology is not mature at single-cell resolution to examine the protein activities in disease-relevant cell types. We propose iProMix, a novel statistical framework to identify epithelial-cell specific associations between ACE2 and other proteins/pathways with bulk proteomic data. iProMix decomposes the data and models cell type-specific conditional joint distribution of proteins through a mixture model. It improves cell-type composition estimation from prior input, and uses a nonparametric inference framework to account for uncertainty of cell-type proportion estimates in hypothesis test. Simulations demonstrate iProMix has well-controlled false discovery rates and favorable powers in nonasymptotic settings. We apply iProMix to the proteomic data of 110 (tumor-adjacent) normal lung tissue samples from the Clinical Proteomic Tumor Analysis Consortium lung adenocarcinoma study, and identify interferon α/γ response pathways as the most significant pathways associated with ACE2 protein abundances in epithelial cells. Strikingly, the association direction is sex-specific. This result casts light on the sex difference of COVID-19 incidences and outcomes, and motivates sex-specific evaluation for interferon therapies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 43-55
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2022.2110876
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2110876
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:43-55
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1941052_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Lingzhu Li
Author-X-Name-First: Lingzhu
Author-X-Name-Last: Li
Author-Name: Xuehu Zhu
Author-X-Name-First: Xuehu
Author-X-Name-Last: Zhu
Author-Name: Lixing Zhu
Author-X-Name-First: Lixing
Author-X-Name-Last: Zhu
Title: Adaptive-to-Model Hybrid of Tests for Regressions
Abstract:
In model checking for regressions, nonparametric estimation-based tests usually have tractable limiting null distributions and are sensitive to oscillating alternative models, but suffer from the curse of dimensionality. In contrast, empirical process-based tests can, at the fastest possible rate, detect local alternatives distinct from the null model, yet are less sensitive to oscillating alternatives and rely on Monte Carlo approximation for critical value determination, which is costly in computation. We propose an adaptive-to-model hybrid of moment and conditional moment-based tests to fully inherit the merits of these two types of tests and avoid the shortcomings. Further, such a hybrid makes nonparametric estimation-based tests, under the alternatives, also share the merits of existing empirical process-based tests. The methodology can be readily applied to other kinds of data and construction of other hybrids. As a by-product in sufficient dimension reduction field, a study on residual-related central mean subspace and central subspace for model adaptation is devoted to showing when alternative models can be indicated and when cannot. Numerical studies are conducted to verify the powerfulness of the proposed test.
Journal: Journal of the American Statistical Association
Pages: 514-523
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1941052
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1941052
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:514-523
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1923510_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xin Xing
Author-X-Name-First: Xin
Author-X-Name-Last: Xing
Author-Name: Zhigen Zhao
Author-X-Name-First: Zhigen
Author-X-Name-Last: Zhao
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Controlling False Discovery Rate Using Gaussian Mirrors
Abstract:
Simultaneously, finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso (the mirror variables can also be created after selection). The mirror variables naturally lead to test statistics effective for controlling the FDR. Under a mild assumption on the dependence among the covariates, we show that the FDR can be controlled at any designated level asymptotically. We also demonstrate through extensive numerical studies that the GM method is more powerful than many existing methods for selecting relevant variables subject to FDR control, especially for cases when the covariates are highly correlated and the influential variables are not overly sparse.
Journal: Journal of the American Statistical Association
Pages: 222-241
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1923510
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923510
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:222-241
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1923511_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Kosuke Imai
Author-X-Name-First: Kosuke
Author-X-Name-Last: Imai
Author-Name: Michael Lingzhi Li
Author-X-Name-First: Michael Lingzhi
Author-X-Name-Last: Li
Title: Experimental Evaluation of Individualized Treatment Rules
Abstract:
The increasing availability of individual-level data has led to numerous applications of individualized (or personalized) treatment rules (ITRs). Policy makers often wish to empirically evaluate ITRs and compare their relative performance before implementing them in a target population. We propose a new evaluation metric, the population average prescriptive effect (PAPE). The PAPE compares the performance of ITR with that of non-individualized treatment rule, which randomly treats the same proportion of units. Averaging the PAPE over a range of budget constraints yields our second evaluation metric, the area under the prescriptive effect curve (AUPEC). The AUPEC represents an overall performance measure for evaluation, like the area under the receiver and operating characteristic curve (AUROC) does for classification, and is a generalization of the QINI coefficient used in uplift modeling. We use Neyman’s repeated sampling framework to estimate the PAPE and AUPEC and derive their exact finite-sample variances based on random sampling of units and random assignment of treatment. We extend our methodology to a common setting, in which the same experimental data are used to both estimate and evaluate ITRs. In this case, our variance calculation incorporates the additional uncertainty due to random splits of data used for cross-validation. The proposed evaluation metrics can be estimated without requiring modeling assumptions, asymptotic approximation, or resampling methods. As a result, it is applicable to any ITR including those based on complex machine learning algorithms. The open-source software package is available for implementing the proposed methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 242-256
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1923511
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1923511
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:242-256
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1938581_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Kun Zhou
Author-X-Name-First: Kun
Author-X-Name-Last: Zhou
Author-Name: Ker-Chau Li
Author-X-Name-First: Ker-Chau
Author-X-Name-Last: Li
Author-Name: Qing Zhou
Author-X-Name-First: Qing
Author-X-Name-Last: Zhou
Title: Honest Confidence Sets for High-Dimensional Regression by Projection and Shrinkage
Abstract:
The issue of honesty in constructing confidence sets arises in nonparametric regression. While optimal rate in nonparametric estimation can be achieved and utilized to construct sharp confidence sets, severe degradation of confidence level often happens after estimating the degree of smoothness. Similarly, for high-dimensional regression, oracle inequalities for sparse estimators could be utilized to construct sharp confidence sets. Yet, the degree of sparsity itself is unknown and needs to be estimated, which causes the honesty problem. To resolve this issue, we develop a novel method to construct honest confidence sets for sparse high-dimensional linear regression. The key idea in our construction is to separate signals into a strong and a weak group, and then construct confidence sets for each group separately. This is achieved by a projection and shrinkage approach, the latter implemented via Stein estimation and the associated Stein unbiased risk estimate. Our confidence set is honest over the full parameter space without any sparsity constraints, while its size adapts to the optimal rate of n−1/4 when the true parameter is indeed sparse. Moreover, under some form of a separation assumption between the strong and weak signals, the diameter of our confidence set can achieve a faster rate than existing methods. Through extensive numerical comparisons on both simulated and real data, we demonstrate that our method outperforms other competitors with big margins for finite samples, including oracle methods built upon the true sparsity of the underlying model.
Journal: Journal of the American Statistical Association
Pages: 469-488
Issue: 541
Volume: 118
Year: 2023
Month: 1
X-DOI: 10.1080/01621459.2021.1938581
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1938581
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:541:p:469-488
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1962328_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Francesca R. Crucinio
Author-X-Name-First: Francesca R.
Author-X-Name-Last: Crucinio
Author-Name: Arnaud Doucet
Author-X-Name-First: Arnaud
Author-X-Name-Last: Doucet
Author-Name: Adam M. Johansen
Author-X-Name-First: Adam M.
Author-X-Name-Last: Johansen
Title: A Particle Method for Solving Fredholm Equations of the First Kind
Abstract:
Fredholm integral equations of the first kind are the prototypical example of ill-posed linear inverse problems. They model, among other things, reconstruction of distorted noisy observations and indirect density estimation and also appear in instrumental variable regression. However, their numerical solution remains a challenging problem. Many techniques currently available require a preliminary discretization of the domain of the solution and make strong assumptions about its regularity. For example, the popular expectation maximization smoothing (EMS) scheme requires the assumption of piecewise constant solutions which is inappropriate for most applications. We propose here a novel particle method that circumvents these two issues. This algorithm can be thought of as a Monte Carlo approximation of the EMS scheme which not only performs an adaptive stochastic discretization of the domain but also results in smooth approximate solutions. We analyze the theoretical properties of the EMS iteration and of the corresponding particle algorithm. Compared to standard EMS, we show experimentally that our novel particle method provides state-of-the-art performance for realistic systems, including motion deblurring and reconstruction of cross-section images of the brain from positron emission tomography.
Journal: Journal of the American Statistical Association
Pages: 937-947
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1962328
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1962328
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:937-947
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2152342_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Wei Zhong
Author-X-Name-First: Wei
Author-X-Name-Last: Zhong
Author-Name: Chen Qian
Author-X-Name-First: Chen
Author-X-Name-Last: Qian
Author-Name: Wanjun Liu
Author-X-Name-First: Wanjun
Author-X-Name-Last: Liu
Author-Name: Liping Zhu
Author-X-Name-First: Liping
Author-X-Name-Last: Zhu
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills
Abstract:
It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this article, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k–10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China’s online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 805-817
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2022.2152342
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2152342
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:805-817
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1996376_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xin-Bing Kong
Author-X-Name-First: Xin-Bing
Author-X-Name-Last: Kong
Author-Name: Jin-Guan Lin
Author-X-Name-First: Jin-Guan
Author-X-Name-Last: Lin
Author-Name: Cheng Liu
Author-X-Name-First: Cheng
Author-X-Name-Last: Liu
Author-Name: Guang-Ying Liu
Author-X-Name-First: Guang-Ying
Author-X-Name-Last: Liu
Title: Discrepancy Between Global and Local Principal Component Analysis on Large-Panel High-Frequency Data
Abstract:
In this article, we study the discrepancy between the global principal component analysis (GPCA) and local principal component analysis (LPCA) in recovering the common components of a large-panel high-frequency data. We measure the discrepancy by the total sum of squared differences between common components reconstructed from GPCA and LPCA. The asymptotic distribution of the discrepancy measure is provided when the factor space is time invariant as the dimension p and sample size n tend to infinity simultaneously. Alternatively when the factor space changes, the discrepancy measure explodes under some mild signal condition on the magnitude of time-variation of the factor space. We apply the theory to test the invariance in time of the factor space. The test performs well in controlling the Type I error and detecting time-varying factor spaces. This is checked by extensive simulation studies. A real data analysis provides strong evidences that the factor space is always time-varying within a time span longer than one week.
Journal: Journal of the American Statistical Association
Pages: 1333-1344
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1996376
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996376
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1333-1344
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1970570_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Le Zhou
Author-X-Name-First: Le
Author-X-Name-Last: Zhou
Author-Name: Hui Zou
Author-X-Name-First: Hui
Author-X-Name-Last: Zou
Title: Cross-Fitted Residual Regression for High-Dimensional Heteroscedasticity Pursuit
Abstract:
There is a vast amount of work on high-dimensional regression. The common starting point for the existing theoretical work is to assume the data generating model is a homoscedastic linear regression model with some sparsity structure. In reality the homoscedasticity assumption is often violated, and hence understanding the heteroscedasticity of the data is of critical importance. In this article we systematically study the estimation of a high-dimensional heteroscedastic regression model. In particular, the emphasis is on how to detect and estimate the heteroscedasticity effects reliably and efficiently. To this end, we propose a cross-fitted residual regression approach and prove the resulting estimator is selection consistent for heteroscedasticity effects and establish its rates of convergence. Our estimator has tuning parameters to be determined by the data in practice. We propose a novel high-dimensional BIC for tuning parameter selection and establish its consistency. This is the first high-dimensional BIC result under heteroscedasticity. The theoretical analysis is more involved in order to handle heteroscedasticity, and we develop a couple of interesting new concentration inequalities that are of independent interests.
Journal: Journal of the American Statistical Association
Pages: 1056-1065
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1970570
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1970570
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1056-1065
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2133718_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jin-Hong Du
Author-X-Name-First: Jin-Hong
Author-X-Name-Last: Du
Author-Name: Yifeng Guo
Author-X-Name-First: Yifeng
Author-X-Name-Last: Guo
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Title: High-Dimensional Portfolio Selection with Cardinality Constraints
Abstract:
The expanding number of assets offers more opportunities for investors but poses new challenges for modern portfolio management (PM). As a central plank of PM, portfolio selection by expected utility maximization (EUM) faces uncontrollable estimation and optimization errors in ultrahigh-dimensional scenarios. Past strategies for high-dimensional PM mainly concern only large-cap companies and select many stocks, making PM impractical. We propose a sample-average-approximation-based portfolio strategy to tackle the difficulties above with cardinality constraints. Our strategy bypasses the estimation of mean and covariance, the Chinese walls in high-dimensional scenarios. Empirical results on S&P 500 and Russell 2000 show that an appropriate number of carefully chosen assets leads to better out-of-sample mean-variance efficiency. On Russell 2000, our best portfolio profits as much as the equally weighted portfolio but reduces the maximum drawdown and the average number of assets by 10% and 90%, respectively. The flexibility and the stability of incorporating factor signals for augmenting out-of-sample performances are also demonstrated. Our strategy balances the tradeoff among the return, the risk, and the number of assets with cardinality constraints. Therefore, we provide a theoretically sound and computationally efficient strategy to make PM practical in the growing global financial market. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 779-791
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2022.2133718
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2133718
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:779-791
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1987920_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ying Hung
Author-X-Name-First: Ying
Author-X-Name-Last: Hung
Author-Name: Li-Hsiang Lin
Author-X-Name-First: Li-Hsiang
Author-X-Name-Last: Lin
Author-Name: C. F. Jeff Wu
Author-X-Name-First: C. F. Jeff
Author-X-Name-Last: Wu
Title: Optimal Simulator Selection
Abstract:
Computer simulators are widely used for the study of complex systems. In many applications, there are multiple simulators available with different scientific interpretations of the underlying mechanism, and the goal is to identify an optimal simulator based on the observed physical experiments. To achieve the goal, we propose a selection criterion based on leave-one-out cross-validation. This criterion consists of a goodness-of-fit measure and a generalized degrees of freedom penalizing the simulator sensitivity to perturbations in the physical observations. Asymptotic properties of the selected optimal simulator are discussed. It is shown that the proposed procedure includes a conventional calibration method as a special case. The finite sample performance of the proposed procedure is demonstrated through numerical examples. In the application of cell biology, an optimal simulator is selected, which can shed light on the T cell recognition mechanism in the human immune system. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1264-1271
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1987920
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1987920
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1264-1271
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1996378_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jiangzhou Wang
Author-X-Name-First: Jiangzhou
Author-X-Name-Last: Wang
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Binghui Liu
Author-X-Name-First: Binghui
Author-X-Name-Last: Liu
Author-Name: Ji Zhu
Author-X-Name-First: Ji
Author-X-Name-Last: Zhu
Author-Name: Jianhua Guo
Author-X-Name-First: Jianhua
Author-X-Name-Last: Guo
Title: Fast Network Community Detection With Profile-Pseudo Likelihood Methods
Abstract:
The stochastic block model is one of the most studied network models for community detection, and fitting its likelihood function on large-scale networks is known to be challenging. One prominent work that overcomes this computational challenge is the fast pseudo-likelihood approach proposed by Amini et al. for fitting stochastic block models to large sparse networks. However, this approach does not have convergence guarantee, and may not be well suited for small and medium scale networks. In this article, we propose a novel likelihood based approach that decouples row and column labels in the likelihood function, enabling a fast alternating maximization. This new method is computationally efficient, performs well for both small- and large-scale networks, and has provable convergence guarantee. We show that our method provides strongly consistent estimates of communities in a stochastic block model. We further consider extensions of our proposed method to handle networks with degree heterogeneity and bipartite properties. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1359-1372
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1996378
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996378
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1359-1372
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1990766_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Dongdong Li
Author-X-Name-First: Dongdong
Author-X-Name-Last: Li
Author-Name: X. Joan Hu
Author-X-Name-First: X. Joan
Author-X-Name-Last: Hu
Author-Name: Rui Wang
Author-X-Name-First: Rui
Author-X-Name-Last: Wang
Title: Evaluating Association Between Two Event Times with Observations Subject to Informative Censoring
Abstract:
This article is concerned with evaluating the association between two event times without specifying the joint distribution parametrically. This is particularly challenging when the observations on the event times are subject to informative censoring due to a terminating event such as death. There are few methods suitable for assessing covariate effects on association in this context. We link the joint distribution of the two event times and the informative censoring time using a nested copula function. We use flexible functional forms to specify the covariate effects on both the marginal and joint distributions. In a semiparametric model for the bivariate event time, we estimate simultaneously the association parameters, the marginal survival functions, and the covariate effects. A byproduct of the approach is a consistent estimator for the induced marginal survival function of each event time conditional on the covariates. We develop an easy-to-implement pseudolikelihood-based inference procedure, derive the asymptotic properties of the estimators, and conduct simulation studies to examine the finite-sample performance of the proposed approach. For illustration, we apply our method to analyze data from the breast cancer survivorship study that motivated this research. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1282-1294
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1990766
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990766
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1282-1294
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2151447_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Roulan Jiang
Author-X-Name-First: Roulan
Author-X-Name-Last: Jiang
Author-Name: Xiang Zhan
Author-X-Name-First: Xiang
Author-X-Name-Last: Zhan
Author-Name: Tianying Wang
Author-X-Name-First: Tianying
Author-X-Name-Last: Wang
Title: A Flexible Zero-Inflated Poisson-Gamma Model with Application to Microbiome Sequence Count Data
Abstract:
In microbiome studies, it is of interest to use a sample from a population of microbes, such as the gut microbiota community, to estimate the population proportion of these taxa. However, due to biases introduced in sampling and preprocessing steps, these observed taxa abundances may not reflect true taxa abundance patterns in the ecosystem. Repeated measures, including longitudinal study designs, may be potential solutions to mitigate the discrepancy between observed abundances and true underlying abundances. Yet, widely observed zero-inflation and over-dispersion issues can distort downstream statistical analyses aiming to associate taxa abundances with covariates of interest. To this end, we propose a Zero-Inflated Poisson Gamma (ZIPG) model framework to address these aforementioned challenges. From a perspective of measurement errors, we accommodate the discrepancy between observations and truths by decomposing the mean parameter in Poisson regression into a true abundance level and a multiplicative measurement of sampling variability from the microbial ecosystem. Then, we provide a flexible ZIPG model framework by connecting both the mean abundance and the variability of abundances to different covariates, and build valid statistical inference procedures for both parameter estimation and hypothesis testing. Through comprehensive simulation studies and real data applications, the proposed ZIPG method provides significant insights into distinguished differential variability and mean abundance. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 792-804
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2022.2151447
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2151447
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:792-804
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1996379_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Stan Tendijck
Author-X-Name-First: Stan
Author-X-Name-Last: Tendijck
Author-Name: Emma Eastoe
Author-X-Name-First: Emma
Author-X-Name-Last: Eastoe
Author-Name: Jonathan Tawn
Author-X-Name-First: Jonathan
Author-X-Name-Last: Tawn
Author-Name: David Randell
Author-X-Name-First: David
Author-X-Name-Last: Randell
Author-Name: Philip Jonathan
Author-X-Name-First: Philip
Author-X-Name-Last: Jonathan
Title: Modeling the Extremes of Bivariate Mixture Distributions With Application to Oceanographic Data
Abstract:
There currently exist a variety of statistical methods for modeling bivariate extremes. However, when the dependence between variables is driven by more than one latent process, these methods are likely to fail to give reliable inferences. We consider situations in which the observed dependence at extreme levels is a mixture of a possibly unknown number of much simpler bivariate distributions. For such structures, we demonstrate the limitations of existing methods and propose two new methods: an extension of the Heffernan–Tawn conditional extreme value model to allow for mixtures and an extremal quantile-regression approach. The two methods are examined in a simulation study and then applied to oceanographic data. Finally, we discuss extensions including a subasymptotic version of the proposed model, which has the potential to give more efficient results by incorporating data that are less extreme. Both new methods outperform existing approaches when mixtures are present.
Journal: Journal of the American Statistical Association
Pages: 1373-1384
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1996379
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996379
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1373-1384
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2000867_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xinran Li
Author-X-Name-First: Xinran
Author-X-Name-Last: Li
Author-Name: Bo Jiang
Author-X-Name-First: Bo
Author-X-Name-Last: Jiang
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Kernel-Based Partial Permutation Test for Detecting Heterogeneous Functional Relationship
Abstract:
We propose a kernel-based partial permutation test for checking the equality of functional relationship between response and covariates among different groups. The main idea, which is intuitive and easy to implement, is to keep the projections of the response vector Y on leading principle components of a kernel matrix fixed and permute Y’s projections on the remaining principle components. The proposed test allows for different choices of kernels, corresponding to different classes of functions under the null hypothesis. First, using linear or polynomial kernels, our partial permutation tests are exactly valid in finite samples for linear or polynomial regression models with Gaussian noise; similar results straightforwardly extend to kernels with finite feature spaces. Second, by allowing the kernel feature space to diverge with the sample size, the test can be large-sample valid for a wider class of functions. Third, for general kernels with possibly infinite-dimensional feature space, the partial permutation test is exactly valid when the covariates are exactly balanced across all groups, or asymptotically valid when the underlying function follows certain regularized Gaussian processes. We further suggest test statistics using likelihood ratio between two (nested) Gaussian process regression models, and propose computationally efficient algorithms utilizing the EM algorithm and Newton’s method, where the latter also involves Fisher scoring and quadratic programming and is particularly useful when EM suffers from slow convergence. Extensions to correlated and non-Gaussian noises have also been investigated theoretically or numerically. Furthermore, the test can be extended to use multiple kernels together and can thus enjoy properties from each kernel. Both simulation study and application illustrate the properties of the proposed test.
Journal: Journal of the American Statistical Association
Pages: 1429-1447
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.2000867
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2000867
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1429-1447
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1969238_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yongyi Guo
Author-X-Name-First: Yongyi
Author-X-Name-Last: Guo
Author-Name: Kaizheng Wang
Author-X-Name-First: Kaizheng
Author-X-Name-Last: Wang
Title: Communication-Efficient Accurate Statistical Estimation
Abstract:
When the data are stored in a distributed manner, direct applications of traditional statistical inference procedures are often prohibitive due to communication costs and privacy concerns. This article develops and investigates two communication-efficient accurate statistical estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicate with the central processor, which then broadcasts aggregated information to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is presented explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived. By regarding the proposed method as a multistep statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications. In addition, we give the conditions under which the one-step CEASE estimator is statistically efficient. Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms.
Journal: Journal of the American Statistical Association
Pages: 1000-1010
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1969238
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969238
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1000-1010
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1996377_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Reza Mohammadi
Author-X-Name-First: Reza
Author-X-Name-Last: Mohammadi
Author-Name: Hélène Massam
Author-X-Name-First: Hélène
Author-X-Name-Last: Massam
Author-Name: Gérard Letac
Author-X-Name-First: Gérard
Author-X-Name-Last: Letac
Title: Accelerating Bayesian Structure Learning in Sparse Gaussian Graphical Models
Abstract:
Bayesian structure learning in Gaussian graphical models is often done by search algorithms over the graph space.The conjugate prior for the precision matrix satisfying graphical constraints is the well-known G-Wishart.With this prior, the transition probabilities in the search algorithms necessitate evaluating the ratios of the prior normalizing constants of G-Wishart.In moderate to high-dimensions, this ratio is often approximated by using sampling-based methods as computationally expensive updates in the search algorithm.Calculating this ratio so far has been a major computational bottleneck.We overcome this issue by representing a search algorithm in which the ratio of normalizing constants is carried out by an explicit closed-form approximation.Using this approximation within our search algorithm yields significant improvement in the scalability of structure learning without sacrificing structure learning accuracy.We study the conditions under which the approximation is valid.We also evaluate the efficacy of our method with simulation studies.We show that the new search algorithm with our approximation outperforms state-of-the-art methods in both computational efficiency and accuracy.The implementation of our work is available in the R package BDgraph.
Journal: Journal of the American Statistical Association
Pages: 1345-1358
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1996377
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1996377
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1345-1358
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1963262_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Efstathios Paparoditis
Author-X-Name-First: Efstathios
Author-X-Name-Last: Paparoditis
Author-Name: Han Lin Shang
Author-X-Name-First: Han Lin
Author-X-Name-Last: Shang
Title: Bootstrap Prediction Bands for Functional Time Series
Abstract:
A bootstrap procedure for constructing prediction bands for a stationary functional time series is proposed. The procedure exploits a general vector autoregressive representation of the time-reversed series of Fourier coefficients appearing in the Karhunen–Loève representation of the functional process. It generates backward-in-time functional replicates that adequately mimic the dependence structure of the underlying process in a model-free way and have the same conditionally fixed curves at the end of each functional pseudo-time series. The bootstrap prediction error distribution is then calculated as the difference between the model-free, bootstrap-generated future functional observations and the functional forecasts obtained from the model used for prediction. This allows the estimated prediction error distribution to account for the innovation and estimation errors associated with prediction and the possible errors due to model misspecification. We establish the asymptotic validity of the bootstrap procedure in estimating the conditional prediction error distribution of interest, and we also show that the procedure enables the construction of prediction bands that achieve (asymptotically) the desired coverage. Prediction bands based on a consistent estimation of the conditional distribution of the studentized prediction error process also are introduced. Such bands allow for taking more appropriately into account the local uncertainty of the prediction. Through a simulation study and the analysis of two datasets, we demonstrate the capabilities and the good finite-sample performance of the proposed method.
Journal: Journal of the American Statistical Association
Pages: 972-986
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1963262
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1963262
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:972-986
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1987251_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Francis K. C. Hui
Author-X-Name-First: Francis K. C.
Author-X-Name-Last: Hui
Author-Name: Samuel Müller
Author-X-Name-First: Samuel
Author-X-Name-Last: Müller
Author-Name: A. H. Welsh
Author-X-Name-First: A. H.
Author-X-Name-Last: Welsh
Title: GEE-Assisted Variable Selection for Latent Variable Models with Multivariate Binary Data
Abstract:
Multivariate data are commonly analyzed using one of two approaches: a conditional approach based on generalized linear latent variable models (GLLVMs) or some variation thereof, and a marginal approach based on generalized estimating equations (GEEs). With research on mixed models and GEEs having gone down separate paths, there is a common mindset to treat the two approaches as mutually exclusive, with which to use driven by the question of interest. In this article, focusing on multivariate binary responses, we study the connections between the parameters from conditional and marginal models, with the aim of using GEEs for fast variable selection in GLLVMs. This is accomplished through two main contributions. First, we show that GEEs are zero consistent for GLLVMs fitted to multivariate binary data. That is, if the true model is a GLLVM but we misspecify and fit GEEs, then the latter is able to asymptotically differentiate between truly zero versus nonzero coefficients in the former. Building on this result, we propose GEE-assisted variable selection for GLLVMs using score- and Wald-based information criteria to construct a fast forward selection path followed by pruning. We demonstrate GEE-assisted variable selection is selection consistent for the underlying GLLVM, with simulation studies demonstrating its strong finite sample performance and computational efficiency.
Journal: Journal of the American Statistical Association
Pages: 1252-1263
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1987251
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1987251
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1252-1263
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1956501_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yuxin Chen
Author-X-Name-First: Yuxin
Author-X-Name-Last: Chen
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Bingyan Wang
Author-X-Name-First: Bingyan
Author-X-Name-Last: Wang
Author-Name: Yuling Yan
Author-X-Name-First: Yuling
Author-X-Name-Last: Yan
Title: Convex and Nonconvex Optimization Are Both Minimax-Optimal for Noisy Blind Deconvolution Under Random Designs
Abstract:
We investigate the effectiveness of convex relaxation and nonconvex optimization in solving bilinear systems of equations under two different designs (i.e., a sort of random Fourier design and Gaussian design). Despite the wide applicability, the theoretical understanding about these two paradigms remains largely inadequate in the presence of random noise. The current article makes two contributions by demonstrating that (i) a two-stage nonconvex algorithm attains minimax-optimal accuracy within a logarithmic number of iterations, and (ii) convex relaxation also achieves minimax-optimal statistical accuracy vis-à-vis random noise. Both results significantly improve upon the state-of-the-art theoretical guarantees. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 858-868
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1956501
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1956501
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:858-868
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1981338_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Bingkai Wang
Author-X-Name-First: Bingkai
Author-X-Name-Last: Wang
Author-Name: Ryoko Susukida
Author-X-Name-First: Ryoko
Author-X-Name-Last: Susukida
Author-Name: Ramin Mojtabai
Author-X-Name-First: Ramin
Author-X-Name-Last: Mojtabai
Author-Name: Masoumeh Amin-Esmaeili
Author-X-Name-First: Masoumeh
Author-X-Name-Last: Amin-Esmaeili
Author-Name: Michael Rosenblum
Author-X-Name-First: Michael
Author-X-Name-Last: Rosenblum
Title: Model-Robust Inference for Clinical Trials that Improve Precision by Stratified Randomization and Covariate Adjustment
Abstract:
Two commonly used methods for improving precision and power in clinical trials are stratified randomization and covariate adjustment. However, many trials do not fully capitalize on the combined precision gains from these two methods, which can lead to wasted resources in terms of sample size and trial duration. We derive consistency and asymptotic normality of model-robust estimators that combine these two methods, and show that these estimators can lead to substantial gains in precision and power. Our theorems cover a class of estimators that handle continuous, binary, and time-to-event outcomes; missing outcomes under the missing at random assumption are handled as well. For each estimator, we give a formula for a consistent variance estimator that is model-robust and that fully captures variance reductions from stratified randomization and covariate adjustment. Also, we give the first proof (to the best of our knowledge) of consistency and asymptotic normality of the Kaplan–Meier estimator under stratified randomization, and we derive its asymptotic variance. The above results also hold for the biased-coin covariate-adaptive design. We demonstrate our results using data from three trials of substance use disorder treatments, where the variance reduction due to stratified randomization and covariate adjustment ranges from 1% to 36%. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1152-1163
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1981338
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1981338
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1152-1163
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1999820_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Hui Chen
Author-X-Name-First: Hui
Author-X-Name-Last: Chen
Author-Name: Haojie Ren
Author-X-Name-First: Haojie
Author-X-Name-Last: Ren
Author-Name: Fang Yao
Author-X-Name-First: Fang
Author-X-Name-Last: Yao
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Title: Data-driven selection of the number of change-points via error rate control
Abstract:
In multiple change-point analysis, one of the main difficulties is to determine the number of change-points. Various consistent selection methods, including the use of Schwarz information criterion and cross-validation, have been proposed to balance the model fitting and complexity. However, there is lack of systematic approaches to provide theoretical guarantee of significance in determining the number of changes. In this paper, we introduce a data-adaptive selection procedure via error rate control based on order-preserving sample-splitting, which is applicable to most existing change-point methods. The key idea is to construct a series of statistics with global symmetry property and then utilize the symmetry to derive a data-driven threshold. Under this general framework, we are able to rigorously investigate the false discovery proportion control, and show that the proposed method controls the false discovery rate (FDR) asymptotically under mild conditions while retaining the true change-points. Numerical experiments indicate that our selection procedure works well for many change-detection methods and is able to yield accurate FDR control in finite samples. Keywords: Empirical distribution; False discovery rate; Multiple change-point model; Sample-splitting; Symmetry; Uniform convergence.
Journal: Journal of the American Statistical Association
Pages: 1415-1428
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1999820
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1999820
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1415-1428
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1982723_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zexi Song
Author-X-Name-First: Zexi
Author-X-Name-Last: Song
Author-Name: Zhiqiang Tan
Author-X-Name-First: Zhiqiang
Author-X-Name-Last: Tan
Title: Hamiltonian-Assisted Metropolis Sampling
Abstract:
Various Markov chain Monte Carlo (MCMC) methods are studied to improve upon random walk Metropolis sampling, for simulation from complex distributions. Examples include Metropolis-adjusted Langevin algorithms, Hamiltonian Monte Carlo, and other algorithms related to underdamped Langevin dynamics. We propose a broad class of irreversible sampling algorithms, called Hamiltonian-assisted Metropolis sampling (HAMS), and develop two specific algorithms with appropriate tuning and preconditioning strategies. Our HAMS algorithms are designed to simultaneously achieve two distinctive properties, while using an augmented target density with a momentum as an auxiliary variable. One is generalized detailed balance, which induces an irreversible exploration of the target. The other is a rejection-free property for a Gaussian target with a prespecified variance matrix. This property allows our preconditioned algorithms to perform satisfactorily with relatively large step sizes. Furthermore, we formulate a framework of generalized Metropolis–Hastings sampling, which not only highlights our construction of HAMS at a more abstract level, but also facilitates possible further development of irreversible MCMC algorithms. We present several numerical experiments, where the proposed algorithms consistently yield superior results among existing algorithms using the same preconditioning schemes.
Journal: Journal of the American Statistical Association
Pages: 1176-1194
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1982723
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1982723
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1176-1194
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1961784_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Tamara Fernández
Author-X-Name-First: Tamara
Author-X-Name-Last: Fernández
Author-Name: Arthur Gretton
Author-X-Name-First: Arthur
Author-X-Name-Last: Gretton
Author-Name: David Rindt
Author-X-Name-First: David
Author-X-Name-Last: Rindt
Author-Name: Dino Sejdinovic
Author-X-Name-First: Dino
Author-X-Name-Last: Sejdinovic
Title: A Kernel Log-Rank Test of Independence for Right-Censored Data
Abstract:
We introduce a general nonparametric independence test between right-censored survival times and covariates, which may be multivariate. Our test statistic has a dual interpretation, first in terms of the supremum of a potentially infinite collection of weight-indexed log-rank tests, with weight functions belonging to a reproducing kernel Hilbert space (RKHS) of functions; and second, as the norm of the difference of embeddings of certain finite measures into the RKHS, similar to the Hilbert–Schmidt Independence Criterion (HSIC) test-statistic. We study the asymptotic properties of the test, finding sufficient conditions to ensure our test correctly rejects the null hypothesis under any alternative. The test statistic can be computed straightforwardly, and the rejection threshold is obtained via an asymptotically consistent Wild Bootstrap procedure. Extensive investigations on both simulated and real data suggest that our testing procedure generally performs better than competing approaches in detecting complex nonlinear dependence.
Journal: Journal of the American Statistical Association
Pages: 925-936
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1961784
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1961784
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:925-936
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2183001_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Susan S. Ellenberg
Author-X-Name-First: Susan S.
Author-X-Name-Last: Ellenberg
Title: Statistical Thinking in Clinical Trials
Journal: Journal of the American Statistical Association
Pages: 1448-1449
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2023.2183001
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183001
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1448-1449
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1987250_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Mehdi Dagdoug
Author-X-Name-First: Mehdi
Author-X-Name-Last: Dagdoug
Author-Name: Camelia Goga
Author-X-Name-First: Camelia
Author-X-Name-Last: Goga
Author-Name: David Haziza
Author-X-Name-First: David
Author-X-Name-Last: Haziza
Title: Model-Assisted Estimation Through Random Forests in Finite Population Sampling
Abstract:
In surveys, the interest lies in estimating finite population parameters such as population totals and means. In most surveys, some auxiliary information is available at the estimation stage. This information may be incorporated in the estimation procedures to increase their precision. In this article, we use random forests (RFs) to estimate the functional relationship between the survey variable and the auxiliary variables. In recent years, RFs have become attractive as National Statistical Offices have now access to a variety of data sources, potentially exhibiting a large number of observations on a large number of variables. We establish the theoretical properties of model-assisted procedures based on RFs and derive corresponding variance estimators. A model-calibration procedure for handling multiple survey variables is also discussed. The results of a simulation study suggest that the proposed point and estimation procedures perform well in terms of bias, efficiency and coverage of normal-based confidence intervals, in a wide variety of settings. Finally, we apply the proposed methods using data on radio audiences collected by Médiamétrie, a French audience company. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1234-1251
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1987250
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1987250
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1234-1251
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1990768_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Florian Gunsilius
Author-X-Name-First: Florian
Author-X-Name-Last: Gunsilius
Author-Name: Susanne Schennach
Author-X-Name-First: Susanne
Author-X-Name-Last: Schennach
Title: Independent Nonlinear Component Analysis
Abstract:
The idea of summarizing the information contained in a large number of variables by a small number of “factors” or “principal components” has been broadly adopted in statistics. This article introduces a generalization of the widely used principal component analysis (PCA) to nonlinear settings, thus providing a new tool for dimension reduction and exploratory data analysis or representation. The distinguishing features of the method include (i) the ability to always deliver truly independent (instead of merely uncorrelated) factors; (ii) the use of optimal transport theory and Brenier maps to obtain a robust and efficient computational algorithm; (iii) the use of a new multivariate additive entropy decomposition to determine the most informative principal nonlinear components, and (iv) formally nesting PCA as a special case for linear Gaussian factor models. We illustrate the method’s effectiveness in an application to excess bond returns prediction from a large number of macro factors. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1305-1318
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1990768
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990768
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1305-1318
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1969239_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Fabian Mies
Author-X-Name-First: Fabian
Author-X-Name-Last: Mies
Title: Functional Estimation and Change Detection for Nonstationary Time Series
Abstract:
Tests for structural breaks in time series should ideally be sensitive to breaks in the parameter of interest, while being robust to nuisance changes. Statistical analysis thus needs to allow for some form of nonstationarity under the null hypothesis of no change. In this article, estimators for integrated parameters of locally stationary time series are constructed and a corresponding functional central limit theorem is established, enabling change-point inference for a broad class of parameters under mild assumptions. The proposed framework covers all parameters which may be expressed as nonlinear functions of moments, for example kurtosis, autocorrelation, and coefficients in a linear regression model. To perform feasible inference based on the derived limit distribution, a bootstrap variant is proposed and its consistency is established. The methodology is illustrated by means of a simulation study and by an application to high-frequency asset prices.
Journal: Journal of the American Statistical Association
Pages: 1011-1022
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1969239
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969239
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1011-1022
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1978467_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yinghao Pan
Author-X-Name-First: Yinghao
Author-X-Name-Last: Pan
Author-Name: Eric B. Laber
Author-X-Name-First: Eric B.
Author-X-Name-Last: Laber
Author-Name: Maureen A. Smith
Author-X-Name-First: Maureen A.
Author-X-Name-Last: Smith
Author-Name: Ying-Qi Zhao
Author-X-Name-First: Ying-Qi
Author-X-Name-Last: Zhao
Title: Reinforced Risk Prediction With Budget Constraint Using Irregularly Measured Data From Electronic Health Records
Abstract:
Uncontrolled glycated hemoglobin (HbA1c) levels are associated with adverse events among complex diabetic patients. These adverse events present serious health risks to affected patients and are associated with significant financial costs. Thus, a high-quality predictive model that could identify high-risk patients so as to inform preventative treatment has the potential to improve patient outcomes while reducing healthcare costs. Because the biomarker information needed to predict risk is costly and burdensome, it is desirable that such a model collect only as much information as is needed on each patient so as to render an accurate prediction. We propose a sequential predictive model that uses accumulating patient longitudinal data to classify patients as: high-risk, low-risk, or uncertain. Patients classified as high-risk are then recommended to receive preventative treatment and those classified as low-risk are recommended to standard care. Patients classified as uncertain are monitored until a high-risk or low-risk determination is made. We construct the model using claims and enrollment files from Medicare, linked with patient electronic health records (EHR) data. The proposed model uses functional principal components to accommodate noisy longitudinal data and weighting to deal with missingness and sampling bias. The proposed method demonstrates higher predictive accuracy and lower cost than competing methods in a series of simulation experiments and application to data on complex patients with diabetes. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1090-1101
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1978467
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1978467
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1090-1101
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1962720_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhimei Ren
Author-X-Name-First: Zhimei
Author-X-Name-Last: Ren
Author-Name: Yuting Wei
Author-X-Name-First: Yuting
Author-X-Name-Last: Wei
Author-Name: Emmanuel Candès
Author-X-Name-First: Emmanuel
Author-X-Name-Last: Candès
Title: Derandomizing Knockoffs
Abstract:
Model-X knockoffs is a general procedure that can leverage any feature importance measure to produce a variable selection algorithm, which discovers true effects while rigorously controlling the number or fraction of false positives. Model-X knockoffs is a randomized procedure which relies on the one-time construction of synthetic (random) variables. This article introduces a derandomization method by aggregating the selection results across multiple runs of the knockoffs algorithm. The derandomization step is designed to be flexible and can be adapted to any variable selection base procedure to yield stable decisions without compromising statistical power. When applied to the base procedure of Janson and Su, we prove that derandomized knockoffs controls both the per family error rate (PFER) and the k family-wise error rate (k-FWER). Furthermore, we carry out extensive numerical studies demonstrating tight Type I error control and markedly enhanced power when compared with alternative variable selection algorithms. Finally, we apply our approach to multistage genome-wide association studies of prostate cancer and report locations on the genome that are significantly associated with the disease. When cross-referenced with other studies, we find that the reported associations have been replicated.Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 948-958
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1962720
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1962720
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:948-958
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2169150_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Niccolò Anceschi
Author-X-Name-First: Niccolò
Author-X-Name-Last: Anceschi
Author-Name: Augusto Fasano
Author-X-Name-First: Augusto
Author-X-Name-Last: Fasano
Author-Name: Daniele Durante
Author-X-Name-First: Daniele
Author-X-Name-Last: Durante
Author-Name: Giacomo Zanella
Author-X-Name-First: Giacomo
Author-X-Name-Last: Zanella
Title: Bayesian Conjugacy in Probit, Tobit, Multinomial Probit and Extensions: A Review and New Results
Abstract:
A broad class of models that routinely appear in several fields can be expressed as partially or fully discretized Gaussian linear regressions. Besides including classical Gaussian response settings, this class also encompasses probit, multinomial probit and tobit regression, among others, thereby yielding one of the most widely-implemented families of models in routine applications. The relevance of such representations has stimulated decades of research in the Bayesian field, mostly motivated by the fact that, unlike for Gaussian linear regression, the posterior distribution induced by such models does not seem to belong to a known class, under the commonly assumed Gaussian priors for the coefficients. This has motivated several solutions for posterior inference relying either on sampling-based strategies or on deterministic approximations that, however, still experience computational and accuracy issues, especially in high dimensions. The scope of this article is to review, unify and extend recent advances in Bayesian inference and computation for this core class of models. To address such a goal, we prove that the likelihoods induced by these formulations share a common analytical structure implying conjugacy with a broad class of distributions, namely the unified skew-normal (SUN), that generalize Gaussians to include skewness. This result unifies and extends recent conjugacy properties for specific models within the class analyzed, and opens new avenues for improved posterior inference, under a broader class of formulations and priors, via novel closed-form expressions, iid samplers from the exact SUN posteriors, and more accurate and scalable approximations from variational Bayes and expectation-propagation. Such advantages are illustrated in simulations and are expected to facilitate the routine-use of these core Bayesian models, while providing novel frameworks for studying theoretical properties and developing future extensions. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1451-1469
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2023.2169150
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2169150
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1451-1469
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2156348_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Cecilia Balocchi
Author-X-Name-First: Cecilia
Author-X-Name-Last: Balocchi
Author-Name: Sameer K. Deshpande
Author-X-Name-First: Sameer K.
Author-X-Name-Last: Deshpande
Author-Name: Edward I. George
Author-X-Name-First: Edward I.
Author-X-Name-Last: George
Author-Name: Shane T. Jensen
Author-X-Name-First: Shane T.
Author-X-Name-Last: Jensen
Title: Crime in Philadelphia: Bayesian Clustering with Particle Optimization
Abstract:
Accurate estimation of the change in crime over time is a critical first step toward better understanding of public safety in large urban environments. Bayesian hierarchical modeling is a natural way to study spatial variation in urban crime dynamics at the neighborhood level, since it facilitates principled “sharing of information” between spatially adjacent neighborhoods. Typically, however, cities contain many physical and social boundaries that may manifest as spatial discontinuities in crime patterns. In this situation, standard prior choices often yield overly smooth parameter estimates, which can ultimately produce mis-calibrated forecasts. To prevent potential over-smoothing, we introduce a prior that partitions the set of neighborhoods into several clusters and encourages spatial smoothness within each cluster. In terms of model implementation, conventional stochastic search techniques are computationally prohibitive, as they must traverse a combinatorially vast space of partitions. We introduce an ensemble optimization procedure that simultaneously identifies several high probability partitions by solving one optimization problem using a new local search strategy. We then use the identified partitions to estimate crime trends in Philadelphia between 2006 and 2017. On simulated and real data, our proposed method demonstrates good estimation and partition selection performance. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 818-829
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2022.2156348
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2156348
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:818-829
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1984927_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: David J. Edwards
Author-X-Name-First: David J.
Author-X-Name-Last: Edwards
Author-Name: Robert W. Mee
Author-X-Name-First: Robert W.
Author-X-Name-Last: Mee
Title: Structure of Nonregular Two-Level Designs
Abstract:
Two-level fractional factorial designs are often used in screening scenarios to identify active factors. This article investigates the block diagonal structure of the information matrix of nonregular two-level designs. This structure is appealing since estimates of parameters belonging to different diagonal submatrices are uncorrelated. As such, the covariance matrix of the least squares estimates is simplified and the number of linear dependencies is reduced. We connect the block diagonal information matrix to the parallel flats design (PFD) literature and gain insights into the structure of what is estimable and/or aliased using the concept of minimal dependent sets. We show how to determine the number of parallel flats for any given design, and how to construct a design with a specified number of parallel flats. The usefulness of our construction method is illustrated by producing designs for estimation of the two-factor interaction model with three or more parallel flats. We also provide a fuller understanding of recently proposed group orthogonal supersaturated designs. Benefits of PFDs for analysis, including bias containment, are also discussed.
Journal: Journal of the American Statistical Association
Pages: 1222-1233
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1984927
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1984927
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1222-1233
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1957900_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhe Fei
Author-X-Name-First: Zhe
Author-X-Name-Last: Fei
Author-Name: Qi Zheng
Author-X-Name-First: Qi
Author-X-Name-Last: Zheng
Author-Name: Hyokyoung G. Hong
Author-X-Name-First: Hyokyoung G.
Author-X-Name-Last: Hong
Author-Name: Yi Li
Author-X-Name-First: Yi
Author-X-Name-Last: Li
Title: Inference for High-Dimensional Censored Quantile Regression
Abstract:
With the availability of high-dimensional genetic biomarkers, it is of interest to identify heterogeneous effects of these predictors on patients’ survival, along with proper statistical inference. Censored quantile regression has emerged as a powerful tool for detecting heterogeneous effects of covariates on survival outcomes. To our knowledge, there is little work available to draw inferences on the effects of high-dimensional predictors for censored quantile regression (CQR). This article proposes a novel procedure to draw inference on all predictors within the framework of global CQR, which investigates covariate-response associations over an interval of quantile levels, instead of a few discrete values. The proposed estimator combines a sequence of low-dimensional model estimates that are based on multi-sample splittings and variable selection. We show that, under some regularity conditions, the estimator is consistent and asymptotically follows a Gaussian process indexed by the quantile level. Simulation studies indicate that our procedure can properly quantify the uncertainty of the estimates in high-dimensional settings. We apply our method to analyze the heterogeneous effects of SNPs residing in lung cancer pathways on patients’ survival, using the Boston Lung Cancer Survival Cohort, a cancer epidemiology study on the molecular mechanism of lung cancer.
Journal: Journal of the American Statistical Association
Pages: 898-912
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1957900
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1957900
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:898-912
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1956937_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yaqing Chen
Author-X-Name-First: Yaqing
Author-X-Name-Last: Chen
Author-Name: Zhenhua Lin
Author-X-Name-First: Zhenhua
Author-X-Name-Last: Lin
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: Wasserstein Regression
Abstract:
The analysis of samples of random objects that do not lie in a vector space is gaining increasing attention in statistics. An important class of such object data is univariate probability measures defined on the real line. Adopting the Wasserstein metric, we develop a class of regression models for such data, where random distributions serve as predictors and the responses are either also distributions or scalars. To define this regression model, we use the geometry of tangent bundles of the space of random measures endowed with the Wasserstein metric for mapping distributions to tangent spaces. The proposed distribution-to-distribution regression model provides an extension of multivariate linear regression for Euclidean data and function-to-function regression for Hilbert space-valued data in functional data analysis. In simulations, it performs better than an alternative transformation approach where one maps distributions to a Hilbert space through the log quantile density transformation and then applies traditional functional regression. We derive asymptotic rates of convergence for the estimator of the regression operator and for predicted distributions and also study an extension to autoregressive models for distribution-valued time series. The proposed methods are illustrated with data on human mortality and distributional time series of house prices.
Journal: Journal of the American Statistical Association
Pages: 869-882
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1956937
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1956937
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:869-882
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1961783_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yunzhang Zhu
Author-X-Name-First: Yunzhang
Author-X-Name-Last: Zhu
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Hui Jiang
Author-X-Name-First: Hui
Author-X-Name-Last: Jiang
Author-Name: Wing Hung Wong
Author-X-Name-First: Wing Hung
Author-X-Name-Last: Wong
Title: Collaborative Multilabel Classification
Abstract:
In multilabel classification, strong label dependence is present for exploiting, particularly for word-to-word dependence defined by semantic labels. In such a situation, we develop a collaborative-learning framework to predict class labels based on label-predictor pairs and label-only data. For example, in image categorization and recognition, language expressions describe the content of an image together with a large number of words and phrases without associated images. This article proposes a new loss quantifying partial correctness for false positive and negative misclassifications due to label similarities. Given this loss, we develop the Bayes rule to capture label dependence by nonlinear classification. On this ground, we introduce a weighted random forest classifier for complete data and a stacking scheme for leveraging additional labels to enhance the performance of supervised learning based on label-predictor pairs. Importantly, we decompose multilabel classification into a sequence of independent learning tasks, based on which the computational complexity of our classifier becomes linear in the size of labels. Compared to existing classifiers without label-only data, the proposed classifier enjoys the computational benefit while enabling the detection of novel labels absent from training by exploring label dependence and leveraging label-only data for higher accuracy. Theoretically, we show that the proposed method reconstructs the Bayes performance consistently, achieving the desired learning accuracy. Numerically, we demonstrate that the proposed method compares favorably in terms of the proposed and Hamming losses against binary relevance and a regularized Ising classifier modeling conditional label dependence. Indeed, leveraging additional labels tends to improve the supervised performance, especially when the training sample is not very large, as in semisupervised learning. Finally, we demonstrate the utility of the proposed approach on the Microsoft COCO object detection challenge, PASCAL visual object classes challenge 2007, and Mediamill benchmark.
Journal: Journal of the American Statistical Association
Pages: 913-924
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1961783
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1961783
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:913-924
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1970569_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Elynn Y. Chen
Author-X-Name-First: Elynn Y.
Author-X-Name-Last: Chen
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Title: Statistical Inference for High-Dimensional Matrix-Variate Factor Models
Abstract:
This article considers the estimation and inference of the low-rank components in high-dimensional matrix-variate factor models, where each dimension of the matrix-variates (p × q) is comparable to or greater than the number of observations (T). We propose an estimation method called α-PCA that preserves the matrix structure and aggregates mean and contemporary covariance through a hyper-parameter α. We develop an inferential theory, establishing consistency, the rate of convergence, and the limiting distributions, under general conditions that allow for correlations across time, rows, or columns of the noise. We show both theoretical and empirical methods of choosing the best α, depending on the use-case criteria. Simulation results demonstrate the adequacy of the asymptotic results in approximating the finite sample properties. The α-PCA compares favorably with the existing ones. Finally, we illustrate its applications with a real numeric dataset and two real image datasets. In all applications, the proposed estimation procedure outperforms previous methods in the power of variance explanation using out-of-sample 10-fold cross-validation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1038-1055
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1970569
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1970569
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1038-1055
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1990769_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Zijian Guo
Author-X-Name-First: Zijian
Author-X-Name-Last: Guo
Author-Name: Rong Ma
Author-X-Name-First: Rong
Author-X-Name-Last: Ma
Title: Statistical Inference for High-Dimensional Generalized Linear Models With Binary Outcomes
Abstract:
This article develops a unified statistical inference framework for high-dimensional binary generalized linear models (GLMs) with general link functions. Both unknown and known design distribution settings are considered. A two-step weighted bias-correction method is proposed for constructing confidence intervals (CIs) and simultaneous hypothesis tests for individual components of the regression vector. Minimax lower bound for the expected length is established and the proposed CIs are shown to be rate-optimal up to a logarithmic factor. The numerical performance of the proposed procedure is demonstrated through simulation studies and an analysis of a single cell RNA-seq dataset, which yields interesting biological insights that integrate well into the current literature on the cellular immune response mechanisms as characterized by single-cell transcriptomics. The theoretical analysis provides important insights on the adaptivity of optimal CIs with respect to the sparsity of the regression vector. New lower bound techniques are introduced and they can be of independent interest to solve other inference problems in high-dimensional binary GLMs.
Journal: Journal of the American Statistical Association
Pages: 1319-1332
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1990769
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990769
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1319-1332
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1963261_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yu-Ting Chen
Author-X-Name-First: Yu-Ting
Author-X-Name-Last: Chen
Author-Name: Jeng-Min Chiou
Author-X-Name-First: Jeng-Min
Author-X-Name-Last: Chiou
Author-Name: Tzee-Ming Huang
Author-X-Name-First: Tzee-Ming
Author-X-Name-Last: Huang
Title: Greedy Segmentation for a Functional Data Sequence
Abstract:
We present a new approach known as greedy segmentation (GS) to identify multiple changepoints for a functional data sequence. The proposed multiple changepoint detection criterion links detectability with the projection onto a suitably chosen subspace and the changepoint locations. The changepoint estimator identifies the true changepoints for any predetermined number of changepoint candidates, either over-reporting or under-reporting. This theoretical finding supports the proposed GS estimator, which can be efficiently obtained in a greedy manner. The GS estimator’s consistency holds without being restricted to the conventional at most one changepoint condition, and it is robust to the relative positions of the changepoints. Based on the GS estimator, the test statistic’s asymptotic distribution leads to the novel GS algorithm, which identifies the number and locations of changepoints. Using intensive simulation studies, we compare the finite sample performance of the GS approach with other competing methods. We also apply our method to temporal changepoint detection in weather datasets.
Journal: Journal of the American Statistical Association
Pages: 959-971
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1963261
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1963261
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:959-971
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1983437_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chan Park
Author-X-Name-First: Chan
Author-X-Name-Last: Park
Author-Name: Hyunseung Kang
Author-X-Name-First: Hyunseung
Author-X-Name-Last: Kang
Title: Assumption-Lean Analysis of Cluster Randomized Trials in Infectious Diseases for Intent-to-Treat Effects and Network Effects
Abstract:
Cluster randomized trials (CRTs) are a popular design to study the effect of interventions in infectious disease settings. However, standard analysis of CRTs primarily relies on strong parametric methods, usually mixed-effect models to account for the clustering structure, and focuses on the overall intent-to-treat (ITT) effect to evaluate effectiveness. The article presents two assumption-lean methods to analyze two types of effects in CRTs, ITT effects and network effects among well-known compliance groups. For the ITT effects, we study the overall and the heterogeneous ITT effects among the observed covariates where we do not impose parametric models or asymptotic restrictions on cluster size. For the network effects among compliance groups, we propose a new bound-based method that uses pretreatment covariates, classification algorithms, and a linear program to obtain sharp bounds. A key feature of our method is that the bounds can become narrower as the classification algorithm improves and the method may also be useful for studies of partial identification with instrumental variables. We conclude by reanalyzing a CRT studying the effect of face masks and hand sanitizers on transmission of 2008 interpandemic influenza in Hong Kong.
Journal: Journal of the American Statistical Association
Pages: 1195-1206
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1983437
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1983437
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1195-1206
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1979011_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yash Deshpande
Author-X-Name-First: Yash
Author-X-Name-Last: Deshpande
Author-Name: Adel Javanmard
Author-X-Name-First: Adel
Author-X-Name-Last: Javanmard
Author-Name: Mohammad Mehrabi
Author-X-Name-First: Mohammad
Author-X-Name-Last: Mehrabi
Title: Online Debiasing for Adaptively Collected High-Dimensional Data With Applications to Time Series Analysis
Abstract:
Adaptive collection of data is commonplace in applications throughout science and engineering. From the point of view of statistical inference, however, adaptive data collection induces memory and correlation in the samples, and poses significant challenge. We consider the high-dimensional linear regression, where the samples are collected adaptively, and the sample size n can be smaller than p, the number of covariates. In this setting, there are two distinct sources of bias: the first due to regularization imposed for consistent estimation, for example, using the LASSO, and the second due to adaptivity in collecting the samples. We propose “online debiasing,” a general procedure for estimators such as the LASSO, which addresses both sources of bias. In two concrete contexts (i) time series analysis and (ii) batched data collection, we demonstrate that online debiasing optimally debiases the LASSO estimate when the underlying parameter θ0 has sparsity of order o(n/ log p)
. In this regime, the debiased estimator can be used to compute p-values and confidence intervals of optimal size.
Journal: Journal of the American Statistical Association
Pages: 1126-1139
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1979011
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979011
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1126-1139
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1999819_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Aaron J. Molstad
Author-X-Name-First: Aaron J.
Author-X-Name-Last: Molstad
Author-Name: Adam J. Rothman
Author-X-Name-First: Adam J.
Author-X-Name-Last: Rothman
Title: A Likelihood-Based Approach for Multivariate Categorical Response Regression in High Dimensions
Abstract:
We propose a penalized likelihood method to fit the bivariate categorical response regression model. Our method allows practitioners to estimate which predictors are irrelevant, which predictors only affect the marginal distributions of the bivariate response, and which predictors affect both the marginal distributions and log odds ratios. To compute our estimator, we propose an efficient algorithm which we extend to settings where some subjects have only one response variable measured, that is, a semi-supervised setting. We derive an asymptotic error bound which illustrates the performance of our estimator in high-dimensional settings. Generalizations to the multivariate categorical response regression model are proposed. Finally, simulation studies and an application in pan-cancer risk prediction demonstrate the usefulness of our method in terms of interpretability and prediction accuracy. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1402-1414
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1999819
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1999819
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1402-1414
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1974867_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Edward L. Ionides
Author-X-Name-First: Edward L.
Author-X-Name-Last: Ionides
Author-Name: Kidus Asfaw
Author-X-Name-First: Kidus
Author-X-Name-Last: Asfaw
Author-Name: Joonha Park
Author-X-Name-First: Joonha
Author-X-Name-Last: Park
Author-Name: Aaron A. King
Author-X-Name-First: Aaron A.
Author-X-Name-Last: King
Title: Bagged Filters for Partially Observed Interacting Systems
Abstract:
Bagging (i.e., bootstrap aggregating) involves combining an ensemble of bootstrap estimators. We consider bagging for inference from noisy or incomplete measurements on a collection of interacting stochastic dynamic systems. Each system is called a unit, and each unit is associated with a spatial location. A motivating example arises in epidemiology, where each unit is a city: the majority of transmission occurs within a city, with smaller yet epidemiologically important interactions arising from disease transmission between cities. Monte Carlo filtering methods used for inference on nonlinear non-Gaussian systems can suffer from a curse of dimensionality (COD) as the number of units increases. We introduce bagged filter (BF) methodology which combines an ensemble of Monte Carlo filters, using spatiotemporally localized weights to select successful filters at each unit and time. We obtain conditions under which likelihood evaluation using a BF algorithm can beat a COD, and we demonstrate applicability even when these conditions do not hold. BF can out-perform an ensemble Kalman filter on a coupled population dynamics model describing infectious disease transmission. A block particle filter (BPF) also performs well on this task, though the bagged filter respects smoothness and conservation laws that a BPF can violate. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1078-1089
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1974867
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1974867
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1078-1089
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1970571_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Francesca Gasperoni
Author-X-Name-First: Francesca
Author-X-Name-Last: Gasperoni
Author-Name: Alessandra Luati
Author-X-Name-First: Alessandra
Author-X-Name-Last: Luati
Author-Name: Lucia Paci
Author-X-Name-First: Lucia
Author-X-Name-Last: Paci
Author-Name: Enzo D’Innocenzo
Author-X-Name-First: Enzo
Author-X-Name-Last: D’Innocenzo
Title: Score-Driven Modeling of Spatio-Temporal Data
Abstract:
A simultaneous autoregressive score-driven model with autoregressive disturbances is developed for spatio-temporal data that may exhibit heavy tails. The model specification rests on a signal plus noise decomposition of a spatially filtered process, where the signal can be approximated by a nonlinear function of the past variables and a set of explanatory variables, while the noise follows a multivariate Student-t distribution. The key feature of the model is that the dynamics of the space-time varying signal are driven by the score of the conditional likelihood function. When the distribution is heavy-tailed, the score provides a robust update of the space-time varying location. Consistency and asymptotic normality of maximum likelihood estimators are derived along with the stochastic properties of the model. The motivating application of the proposed model comes from brain scans recorded through functional magnetic resonance imaging when subjects are at rest and not expected to react to any controlled stimulus. We identify spontaneous activations in brain regions as extreme values of a possibly heavy-tailed distribution, by accounting for spatial and temporal dependence.
Journal: Journal of the American Statistical Association
Pages: 1066-1077
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1970571
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1970571
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1066-1077
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1969240_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Danielle C. Tucker
Author-X-Name-First: Danielle C.
Author-X-Name-Last: Tucker
Author-Name: Yichao Wu
Author-X-Name-First: Yichao
Author-X-Name-Last: Wu
Author-Name: Hans-Georg Müller
Author-X-Name-First: Hans-Georg
Author-X-Name-Last: Müller
Title: Variable Selection for Global Fréchet Regression
Abstract:
Global Fréchet regression is an extension of linear regression to cover more general types of responses, such as distributions, networks, and manifolds, which are becoming more prevalent. In such models, predictors are Euclidean while responses are metric space valued. Predictor selection is of major relevance for regression modeling in the presence of multiple predictors but has not yet been addressed for Fréchet regression. Due to the metric space-valued nature of the responses, Fréchet regression models do not feature model parameters, and this lack of parameters makes it a major challenge to extend existing variable selection methods for linear regression to global Fréchet regression. In this work, we address this challenge and propose a novel variable selection method that overcomes it and has good practical performance. We provide theoretical support and demonstrate that the proposed variable selection method achieves selection consistency. We also explore the finite sample performance of the proposed method with numerical examples and data illustrations.
Journal: Journal of the American Statistical Association
Pages: 1023-1037
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1969240
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1969240
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1023-1037
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1990765_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Fan Xia
Author-X-Name-First: Fan
Author-X-Name-Last: Xia
Author-Name: Kwun Chuen Gary Chan
Author-X-Name-First: Kwun Chuen Gary
Author-X-Name-Last: Chan
Title: Identification, Semiparametric Efficiency, and Quadruply Robust Estimation in Mediation Analysis with Treatment-Induced Confounding
Abstract:
Natural mediation effects are often of interest when the goal is to understand a causal mechanism. However, most existing methods and their identification assumptions preclude treatment-induced confounders often present in practice. To address this fundamental limitation, we provide a set of assumptions that identify the natural direct effect in the presence of treatment-induced confounders. Even when some of those assumptions are violated, the estimand still has an interventional direct effect interpretation. We derive the semiparametric efficiency bound for the estimand, which unlike usual expressions, contains conditional densities that are variational dependent. We consider a reparameterization and propose a quadruply robust estimator that remains consistent under four types of possible misspecification and is also locally semiparametric efficient. We use simulation studies to demonstrate the proposed method and study an application to the 2017 Natality data to investigate the effect of prenatal care on preterm birth mediated by preeclampsia with smoking status during pregnancy being a potential treatment-induced confounder.
Journal: Journal of the American Statistical Association
Pages: 1272-1281
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1990765
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990765
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1272-1281
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1956938_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Rui Tuo
Author-X-Name-First: Rui
Author-X-Name-Last: Tuo
Author-Name: Shiyuan He
Author-X-Name-First: Shiyuan
Author-X-Name-Last: He
Author-Name: Arash Pourhabib
Author-X-Name-First: Arash
Author-X-Name-Last: Pourhabib
Author-Name: Yu Ding
Author-X-Name-First: Yu
Author-X-Name-Last: Ding
Author-Name: Jianhua Z. Huang
Author-X-Name-First: Jianhua Z.
Author-X-Name-Last: Huang
Title: A Reproducing Kernel Hilbert Space Approach to Functional Calibration of Computer Models
Abstract:
This article develops a frequentist solution to the functional calibration problem, where the value of a calibration parameter in a computer model is allowed to vary with the value of control variables in the physical system. The need of functional calibration is motivated by engineering applications where using a constant calibration parameter results in a significant mismatch between outputs from the computer model and the physical experiment. Reproducing kernel Hilbert spaces (RKHS) are used to model the optimal calibration function, defined as the functional relationship between the calibration parameter and control variables that gives the best prediction. This optimal calibration function is estimated through penalized least squares with an RKHS-norm penalty and using physical data. An uncertainty quantification procedure is also developed for such estimates. Theoretical guarantees of the proposed method are provided in terms of prediction consistency and consitency of estimating the optimal calibration function. The proposed method is tested using both real and synthetic data and exhibits more robust performance in prediction and uncertainty quantification than the existing parametric functional calibration method and a state-of-art Bayesian method.
Journal: Journal of the American Statistical Association
Pages: 883-897
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1956938
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1956938
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:883-897
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1967164_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Nikolaos Ignatiadis
Author-X-Name-First: Nikolaos
Author-X-Name-Last: Ignatiadis
Author-Name: Sujayam Saha
Author-X-Name-First: Sujayam
Author-X-Name-Last: Saha
Author-Name: Dennis L. Sun
Author-X-Name-First: Dennis L.
Author-X-Name-Last: Sun
Author-Name: Omkar Muralidharan
Author-X-Name-First: Omkar
Author-X-Name-Last: Muralidharan
Title: Empirical Bayes Mean Estimation With Nonparametric Errors Via Order Statistic Regression on Replicated Data
Abstract:
We study empirical Bayes estimation of the effect sizes of N units from K noisy observations on each unit. We show that it is possible to achieve near-Bayes optimal mean squared error, without any assumptions or knowledge about the effect size distribution or the noise. The noise distribution can be heteroscedastic and vary arbitrarily from unit to unit. Our proposal, which we call Aurora, leverages the replication inherent in the K observations per unit and recasts the effect size estimation problem as a general regression problem. Aurora with linear regression provably matches the performance of a wide array of estimators including the sample mean, the trimmed mean, the sample median, as well as James-Stein shrunk versions thereof. Aurora automates effect size estimation for Internet-scale datasets, as we demonstrate on data from a large technology firm.
Journal: Journal of the American Statistical Association
Pages: 987-999
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1967164
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1967164
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:987-999
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1981913_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chaonan Jiang
Author-X-Name-First: Chaonan
Author-X-Name-Last: Jiang
Author-Name: Davide La Vecchia
Author-X-Name-First: Davide La
Author-X-Name-Last: Vecchia
Author-Name: Elvezio Ronchetti
Author-X-Name-First: Elvezio
Author-X-Name-Last: Ronchetti
Author-Name: Olivier Scaillet
Author-X-Name-First: Olivier
Author-X-Name-Last: Scaillet
Title: Saddlepoint Approximations for Spatial Panel Data Models
Abstract:
We develop new higher-order asymptotic techniques for the Gaussian maximum likelihood estimator in a spatial panel data model, with fixed effects, time-varying covariates, and spatially correlated errors. Our saddlepoint density and tail area approximation feature relative error of order O(1/(n(T−1)))
with n being the cross-sectional dimension and T the time-series dimension. The main theoretical tool is the tilted-Edgeworth technique in a nonidentically distributed setting. The density approximation is always nonnegative, does not need resampling, and is accurate in the tails. Monte Carlo experiments on density approximation and testing in the presence of nuisance parameters illustrate the good performance of our approximation over first-order asymptotics and Edgeworth expansion. An empirical application to the investment–saving relationship in OECD (Organisation for Economic Co-operation and Development) countries shows disagreement between testing results based on the first-order asymptotics and saddlepoint techniques. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 1164-1175
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1981913
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1981913
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1164-1175
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1990767_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xinhe Wang
Author-X-Name-First: Xinhe
Author-X-Name-Last: Wang
Author-Name: Tingyu Wang
Author-X-Name-First: Tingyu
Author-X-Name-Last: Wang
Author-Name: Hanzhong Liu
Author-X-Name-First: Hanzhong
Author-X-Name-Last: Liu
Title: Rerandomization in Stratified Randomized Experiments
Abstract:
Stratification and rerandomization are two well-known methods used in randomized experiments for balancing the baseline covariates. Renowned scholars in experimental design have recommended combining these two methods; however, limited studies have addressed the statistical properties of this combination. This article proposes two rerandomization methods to be used in stratified randomized experiments, based on the overall and stratum-specific Mahalanobis distances. The first method is applicable for nearly arbitrary numbers of strata, strata sizes, and stratum-specific proportions of the treated units. The second method, which is generally more efficient than the first method, is suitable for situations in which the number of strata is fixed with their sizes tending to infinity. Under the randomization inference framework, we obtain the asymptotic distributions of estimators used in these methods and the formulas of variance reduction when compared to stratified randomization. Our analysis does not require any modeling assumption regarding the potential outcomes. Moreover, we provide asymptotically conservative variance estimators and confidence intervals for the average treatment effect. The advantages of the proposed methods are exhibited through an extensive simulation study and a real-data example.
Journal: Journal of the American Statistical Association
Pages: 1295-1304
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1990767
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1990767
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1295-1304
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1955691_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Iván Díaz
Author-X-Name-First: Iván
Author-X-Name-Last: Díaz
Author-Name: Nicholas Williams
Author-X-Name-First: Nicholas
Author-X-Name-Last: Williams
Author-Name: Katherine L. Hoffman
Author-X-Name-First: Katherine L.
Author-X-Name-Last: Hoffman
Author-Name: Edward J. Schenck
Author-X-Name-First: Edward J.
Author-X-Name-Last: Schenck
Title: Nonparametric Causal Effects Based on Longitudinal Modified Treatment Policies
Abstract:
Most causal inference methods consider counterfactual variables under interventions that set the exposure to a fixed value. With continuous or multi-valued treatments or exposures, such counterfactuals may be of little practical interest because no feasible intervention can be implemented that would bring them about. Longitudinal modified treatment policies (LMTPs) are a recently developed nonparametric alternative that yield effects of immediate practical relevance with an interpretation in terms of meaningful interventions such as reducing or increasing the exposure by a given amount. LMTPs also have the advantage that they can be designed to satisfy the positivity assumption required for causal inference. We present a novel sequential regression formula that identifies the LMTP causal effect, study properties of the LMTP statistical estimand such as the efficient influence function and the efficiency bound, and propose four different estimators. Two of our estimators are efficient, and one is sequentially doubly robust in the sense that it is consistent if, for each time point, either an outcome regression or a treatment mechanism is consistently estimated. We perform numerical studies of the estimators, and present the results of our motivating study on hypoxemia and mortality in intubated Intensive Care Unit (ICU) patients. Software implementing our methods is provided in the form of the open source R package lmtp freely available on GitHub (https://github.com/nt-williams/lmtp) and CRAN.
Journal: Journal of the American Statistical Association
Pages: 846-857
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1955691
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955691
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:846-857
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1981337_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: B. Zhang
Author-X-Name-First: B.
Author-X-Name-Last: Zhang
Author-Name: D. S. Small
Author-X-Name-First: D. S.
Author-X-Name-Last: Small
Author-Name: K. B. Lasater
Author-X-Name-First: K. B.
Author-X-Name-Last: Lasater
Author-Name: M. McHugh
Author-X-Name-First: M.
Author-X-Name-Last: McHugh
Author-Name: J. H. Silber
Author-X-Name-First: J. H.
Author-X-Name-Last: Silber
Author-Name: P. R. Rosenbaum
Author-X-Name-First: P. R.
Author-X-Name-Last: Rosenbaum
Title: Matching One Sample According to Two Criteria in Observational Studies
Abstract:
Multivariate matching has two goals (i) to construct treated and control groups that have similar distributions of observed covariates, and (ii) to produce matched pairs or sets that are homogeneous in a few key covariates. When there are only a few binary covariates, both goals may be achieved by matching exactly for these few covariates. Commonly, however, there are many covariates, so goals (i) and (ii) come apart, and must be achieved by different means. As is also true in a randomized experiment, similar distributions can be achieved for a high-dimensional covariate, but close pairs can be achieved for only a few covariates. We introduce a new polynomial-time method for achieving both goals that substantially generalizes several existing methods; in particular, it can minimize the earthmover distance between two marginal distributions. The method involves minimum cost flow optimization in a network built around a tripartite graph, unlike the usual network built around a bipartite graph. In the tripartite graph, treated subjects appear twice, on the far left and the far right, with controls sandwiched between them, and efforts to balance covariates are represented on the right, while efforts to find close individual pairs are represented on the left. In this way, the two efforts may be pursued simultaneously without conflict. The method is applied to our on-going study in the Medicare population of the relationship between superior nursing and sepsis mortality. The match2C package in R implements the method. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1140-1151
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1981337
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1981337
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1140-1151
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1979010_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jiawei Zhang
Author-X-Name-First: Jiawei
Author-X-Name-Last: Zhang
Author-Name: Jie Ding
Author-X-Name-First: Jie
Author-X-Name-Last: Ding
Author-Name: Yuhong Yang
Author-X-Name-First: Yuhong
Author-X-Name-Last: Yang
Title: Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning
Abstract:
In recent years, many nontraditional classification methods, such as random forest, boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness of fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1115-1125
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1979010
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1979010
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1115-1125
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1978468_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Matthew Blackwell
Author-X-Name-First: Matthew
Author-X-Name-Last: Blackwell
Author-Name: Nicole E. Pashley
Author-X-Name-First: Nicole E.
Author-X-Name-Last: Pashley
Title: Noncompliance and Instrumental Variables for 2K Factorial Experiments
Abstract:
Factorial experiments are widely used to assess the marginal, joint, and interactive effects of multiple concurrent factors. While a robust literature covers the design and analysis of these experiments, there is less work on how to handle treatment noncompliance in this setting. To fill this gap, we introduce a new methodology that uses the potential outcomes framework for analyzing 2K
factorial experiments with noncompliance on any number of factors. This framework builds on and extends the literature on both instrumental variables and factorial experiments in several ways. First, we define novel, complier-specific quantities of interest for this setting and show how to generalize key instrumental variables assumptions. Second, we show how partial compliance across factors gives researchers a choice over different types of compliers to target in estimation. Third, we show how to conduct inference for these new estimands from both the finite-population and superpopulation asymptotic perspectives. Finally, we illustrate these techniques by applying them to a field experiment on the effectiveness of different forms of get-out-the-vote canvassing. New easy-to-use, open-source software implements the methodology. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1102-1114
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1978468
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1978468
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1102-1114
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2183128_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yang Ni
Author-X-Name-First: Yang
Author-X-Name-Last: Ni
Title: Handbook of Bayesian Variable Selection
Journal: Journal of the American Statistical Association
Pages: 1449-1450
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2023.2183128
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183128
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1449-1450
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1984926_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zihao Yang
Author-X-Name-First: Zihao
Author-X-Name-Last: Yang
Author-Name: Tianyi Qu
Author-X-Name-First: Tianyi
Author-X-Name-Last: Qu
Author-Name: Xinran Li
Author-X-Name-First: Xinran
Author-X-Name-Last: Li
Title: Rejective Sampling, Rerandomization, and Regression Adjustment in Survey Experiments
Abstract:
Classical randomized experiments, equipped with randomization-based inference, provide assumption-free inference for treatment effects. They have been the gold standard for drawing causal inference and provide excellent internal validity. However, they have also been criticized for questionable external validity, in the sense that the conclusion may not generalize well to a larger population. The randomized survey experiment is a design tool that can help mitigate this concern, by randomly selecting the experimental units from the target population of interest. However, as pointed out by Morgan and Rubin, chance imbalances often exist in covariate distributions between different treatment groups even under completely randomized experiments. Not surprisingly, such covariate imbalances also occur in randomized survey experiments. Furthermore, the covariate imbalances happen not only between different treatment groups, but also between the sampled experimental units and the overall population of interest. In this article, we propose a two-stage rerandomization design that can actively avoid undesirable covariate imbalances at both the sampling and treatment assignment stages. We further develop asymptotic theory for rerandomized survey experiments, demonstrating that rerandomization provides better covariate balance, more precise treatment effect estimators, and shorter large-sample confidence intervals. We also propose covariate adjustment to deal with remaining covariate imbalances after rerandomization, showing that it can further improve both the sampling and estimated precision. Our work allows general relationship among covariates at the sampling, treatment assignment and analysis stages, and generalizes both rerandomization in classical randomized experiments and rejective sampling in survey sampling.
Journal: Journal of the American Statistical Association
Pages: 1207-1221
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1984926
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1984926
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1207-1221
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1999818_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Wei Liu
Author-X-Name-First: Wei
Author-X-Name-Last: Liu
Author-Name: Huazhen Lin
Author-X-Name-First: Huazhen
Author-X-Name-Last: Lin
Author-Name: Shurong Zheng
Author-X-Name-First: Shurong
Author-X-Name-Last: Zheng
Author-Name: Jin Liu
Author-X-Name-First: Jin
Author-X-Name-Last: Liu
Title: Generalized Factor Model for Ultra-High Dimensional Correlated Variables with Mixed Types
Abstract:
As high-dimensional data measured with mixed-type variables gradually become prevalent, it is particularly appealing to represent those mixed-type high-dimensional data using a much smaller set of so-called factors. Due to the limitation of the existing methods for factor analysis that deal with only continuous variables, in this article, we develop a generalized factor model, a corresponding algorithm and theory for ultra-high dimensional mixed types of variables where both the sample size n and variable dimension p could diverge to infinity. Specifically, to solve the computational problem arising from the non-linearity and mixed types, we develop a two-step algorithm so that each update can be carried out in parallel across variables and samples by using an existing package. Theoretically, we establish the rate of convergence for the estimators of factors and loadings in the presence of nonlinear structure accompanied with mixed-type variables when both n and p diverge to infinity. Moreover, since the correct specification of the number of factors is crucial to both the theoretical and the empirical validity of factor models, we also develop a criterion based on a penalized loss to consistently estimate the number of factors under the framework of a generalized factor model. To demonstrate the advantages of the proposed method over the existing ones, we conducted extensive simulation studies and also applied it to the analysis of the NFBC1966 dataset and a cardiac arrhythmia dataset, resulting in more predictive and interpretable estimators for loadings and factors than the existing factor model.
Journal: Journal of the American Statistical Association
Pages: 1385-1401
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1999818
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1999818
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:1385-1401
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_1955690_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xiwei Tang
Author-X-Name-First: Xiwei
Author-X-Name-Last: Tang
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Multivariate Temporal Point Process Regression
Abstract:
Point process modeling is gaining increasing attention, as point process type data are emerging in a large variety of scientific applications. In this article, motivated by a neuronal spike trains study, we propose a novel point process regression model, where both the response and the predictor can be a high-dimensional point process. We model the predictor effects through the conditional intensities using a set of basis transferring functions in a convolutional fashion. We organize the corresponding transferring coefficients in the form of a three-way tensor, then impose the low-rank, sparsity, and subgroup structures on this coefficient tensor. These structures help reduce the dimensionality, integrate information across different individual processes, and facilitate the interpretation. We develop a highly scalable optimization algorithm for parameter estimation. We derive the large sample error bound for the recovered coefficient tensor, and establish the subgroup identification consistency, while allowing the dimension of the multivariate point process to diverge. We demonstrate the efficacy of our method through both simulations and a cross-area neuronal spike trains analysis in a sensory cortex study.
Journal: Journal of the American Statistical Association
Pages: 830-845
Issue: 542
Volume: 118
Year: 2023
Month: 4
X-DOI: 10.1080/01621459.2021.1955690
File-URL: http://hdl.handle.net/10.1080/01621459.2021.1955690
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:542:p:830-845
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2224409_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Mark N. Harris
Author-X-Name-First: Mark N.
Author-X-Name-Last: Harris
Title: Modern Applied Regressions: Bayesian and Frequentist Analysis of Categorical and Limited Response variables with R and Stan
Journal: Journal of the American Statistical Association
Pages: 2209-2211
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2224409
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2224409
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2209-2211
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2003201_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Leo L. Duan
Author-X-Name-First: Leo L.
Author-X-Name-Last: Duan
Title: Transport Monte Carlo: High-Accuracy Posterior Approximation via Random Transport
Abstract:
In Bayesian applications, there is a huge interest in rapid and accurate estimation of the posterior distribution, particularly for high dimensional or hierarchical models. In this article, we propose to use optimization to solve for a joint distribution (random transport plan) between two random variables, θ from the posterior distribution and β from the simple multivariate uniform. Specifically, we obtain an approximate estimate of the conditional distribution Π(β|θ) as an infinite mixture of simple location-scale changes; applying the Bayes’ theorem, Π(θ|β) can be sampled as one of the reversed transforms from the uniform, with the weight proportional to the posterior density/mass function. This produces independent random samples with high approximation accuracy, as well as nice theoretical guarantees. Our method shows compelling advantages in performance and accuracy, compared to the state-of-the-art Markov chain Monte Carlo and approximations such as variational Bayes and normalizing flow. We illustrate this approach via several challenging applications, such as sampling from multi-modal distribution, estimating sparse signals in high dimension, and soft-thresholding of a graph with a prior on the degrees. Supplementary materials for this article, including the source code and additional comparison with popular alternative algorithms are available on the journal website.
Journal: Journal of the American Statistical Association
Pages: 1659-1670
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2003201
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2003201
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1659-1670
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2013851_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xiaowu Dai
Author-X-Name-First: Xiaowu
Author-X-Name-Last: Dai
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Orthogonalized Kernel Debiased Machine Learning for Multimodal Data Analysis
Abstract:
Multimodal imaging has transformed neuroscience research. While it presents unprecedented opportunities, it also imposes serious challenges. Particularly, it is difficult to combine the merits of the interpretability attributed to a simple association model with the flexibility achieved by a highly adaptive nonlinear model. In this article, we propose an orthogonalized kernel debiased machine learning approach, which is built upon the Neyman orthogonality and a form of decomposition orthogonality, for multimodal data analysis. We target the setting that naturally arises in almost all multimodal studies, where there is a primary modality of interest, plus additional auxiliary modalities. We establish the root-N-consistency and asymptotic normality of the estimated primary parameter, the semi-parametric estimation efficiency, and the asymptotic validity of the confidence band of the predicted primary modality effect. Our proposal enjoys, to a good extent, both model interpretability and model flexibility. It is also considerably different from the existing statistical methods for multimodal data integration, as well as the orthogonality-based methods for high-dimensional inferences. We demonstrate the efficacy of our method through both simulations and an application to a multimodal neuroimaging study of Alzheimer’s disease. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1796-1810
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2013851
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2013851
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1796-1810
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2044333_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Sai Li
Author-X-Name-First: Sai
Author-X-Name-Last: Li
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Author-Name: Hongzhe Li
Author-X-Name-First: Hongzhe
Author-X-Name-Last: Li
Title: Transfer Learning in Large-Scale Gaussian Graphical Models with False Discovery Rate Control
Abstract:
Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied. The target GGM is estimated by incorporating the data from similar and related auxiliary studies, where the similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single-task setting. Furthermore, we introduce a universal debiasing method that can be coupled with a range of initial graph estimators and can be analytically computed in one step. A debiased Trans-CLIME estimator is then constructed and is shown to be element-wise asymptotically normal. This fact is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2171-2183
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2044333
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044333
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2171-2183
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2023550_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yen-Chi Chen
Author-X-Name-First: Yen-Chi
Author-X-Name-Last: Chen
Title: Statistical Inference with Local Optima
Abstract:
We study the statistical properties of an estimator derived by applying a gradient ascent method with multiple initializations to a multi-modal likelihood function. We derive the population quantity that is the target of this estimator and study the properties of confidence intervals (CIs) constructed from asymptotic normality and the bootstrap approach. In particular, we analyze the coverage deficiency due to finite number of random initializations. We also investigate the CIs by inverting the likelihood ratio test, the score test, and the Wald test, and we show that the resulting CIs may be very different. We propose a two-sample test procedure even when the maximum likelihood estimator is intractable. In addition, we analyze the performance of the EM algorithm under random initializations and derive the coverage of a CI with a finite number of initializations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1940-1952
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2023550
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2023550
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1940-1952
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2025815_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Lizhen Nie
Author-X-Name-First: Lizhen
Author-X-Name-Last: Nie
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Title: Bayesian Bootstrap Spike-and-Slab LASSO
Abstract:
The impracticality of posterior sampling has prevented the widespread adoption of spike-and-slab priors in high-dimensional applications. To alleviate the computational burden, optimization strategies have been proposed that quickly find local posterior modes. Trading off uncertainty quantification for computational speed, these strategies have enabled spike-and-slab deployments at scales that would be previously unfeasible. We build on one recent development in this strand of work: the Spike-and-Slab LASSO procedure. Instead of optimization, however, we explore multiple avenues for posterior sampling, some traditional and some new. Intrigued by the speed of Spike-and-Slab LASSO mode detection, we explore the possibility of sampling from an approximate posterior by performing MAP optimization on many independently perturbed datasets. To this end, we explore Bayesian bootstrap ideas and introduce a new class of jittered Spike-and-Slab LASSO priors with random shrinkage targets. These priors are a key constituent of the Bayesian Bootstrap Spike-and-Slab LASSO (BB-SSL) method proposed here. BB-SSL turns fast optimization into approximate posterior sampling. Beyond its scalability, we show that BB-SSL has a strong theoretical support. Indeed, we find that the induced pseudo-posteriors contract around the truth at a near-optimal rate in sparse normal-means and in high-dimensional regression. We compare our algorithm to the traditional Stochastic Search Variable Selection (under Laplace priors) as well as many state-of-the-art methods for shrinkage priors. We show, both in simulations and on real data, that our method fares very well in these comparisons, often providing substantial computational gains. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2013-2028
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2025815
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2025815
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2013-2028
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2013242_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Serge Aleshin-Guendel
Author-X-Name-First: Serge
Author-X-Name-Last: Aleshin-Guendel
Author-Name: Mauricio Sadinle
Author-X-Name-First: Mauricio
Author-X-Name-Last: Sadinle
Title: Multifile Partitioning for Record Linkage and Duplicate Detection
Abstract:
Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1786-1795
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2013242
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2013242
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1786-1795
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2164287_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chuan Tian
Author-X-Name-First: Chuan
Author-X-Name-Last: Tian
Author-Name: Duo Jiang
Author-X-Name-First: Duo
Author-X-Name-Last: Jiang
Author-Name: Austin Hammer
Author-X-Name-First: Austin
Author-X-Name-Last: Hammer
Author-Name: Thomas Sharpton
Author-X-Name-First: Thomas
Author-X-Name-Last: Sharpton
Author-Name: Yuan Jiang
Author-X-Name-First: Yuan
Author-X-Name-Last: Jiang
Title: Compositional Graphical Lasso Resolves the Impact of Parasitic Infection on Gut Microbial Interaction Networks in a Zebrafish Model
Abstract:
Understanding how microbes interact with each other is key to revealing the underlying role that microorganisms play in the host or environment and to identifying microorganisms as an agent that can potentially alter the host or environment. For example, understanding how the microbial interactions associate with parasitic infection can help resolve potential drug or diagnostic test for parasitic infection. To unravel the microbial interactions, existing tools often rely on graphical models to infer the conditional dependence of microbial abundances to represent their interactions. However, current methods do not simultaneously account for the discreteness, compositionality, and heterogeneity inherent to microbiome data. Thus, we build a new approach called “compositional graphical lasso” upon existing tools by incorporating the above characteristics into the graphical model explicitly. We illustrate the advantage of compositional graphical lasso over current methods under a variety of simulation scenarios and on a benchmark study, the Tara Oceans Project. Moreover, we present our results from the analysis of a dataset from the Zebrafish Parasite Infection Study, which aims to gain insight into how the gut microbiome and parasite burden covary during infection, thus, uncovering novel putative methods of disrupting parasite success. Our approach identifies changes in interaction degree between infected and uninfected individuals for three taxa, Photobacterium, Gemmobacter, and Paucibacter, which are inversely predicted by other methods. Further investigation of these method-specific taxa interaction changes reveals their biological plausibility. In particular, we speculate on the potential pathobiotic roles of Photobacterium and Gemmobacter in the zebrafish gut, and the potential probiotic role of Paucibacter. Collectively, our analyses demonstrate that compositional graphical lasso provides a powerful means of accurately resolving interactions between microbiota and can thus drive novel biological discovery. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1500-1514
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2164287
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2164287
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1500-1514
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2165929_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ian Laga
Author-X-Name-First: Ian
Author-X-Name-Last: Laga
Author-Name: Le Bao
Author-X-Name-First: Le
Author-X-Name-Last: Bao
Author-Name: Xiaoyue Niu
Author-X-Name-First: Xiaoyue
Author-X-Name-Last: Niu
Title: A Correlated Network Scale-Up Model: Finding the Connection Between Subpopulations
Abstract:
Aggregated Relational Data (ARD), formed from “How many X’s do you know?” questions, is a powerful tool for learning important network characteristics with incomplete network data. Compared to traditional survey methods, ARD is attractive as it does not require a sample from the target population and does not ask respondents to self-reveal their own status. This is helpful for studying hard-to-reach populations like female sex workers who may be hesitant to reveal their status. From December 2008 to February 2009, the Kiev International Institute of Sociology (KIIS) collected ARD from 10,866 respondents to estimate the size of HIV-related groups in Ukraine. To analyze this data, we propose a new ARD model which incorporates respondent and group covariates in a regression framework and includes a bias term that is correlated between groups. We also introduce a new scaling procedure using the correlation structure to further reduce biases. The resulting size estimates of those most-at-risk of HIV infection can improve the HIV response efficiency in Ukraine. Additionally, the proposed model allows us to better understand two network features without the full network data: (a) What characteristics affect who respondents know, and (b) How is knowing someone from one group related to knowing people from other groups. These features can allow researchers to better recruit marginalized individuals into the prevention and treatment programs. Our proposed model and several existing NSUM models are implemented in the networkscaleup R package. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1515-1524
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2165929
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2165929
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1515-1524
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2039671_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xiaowu Dai
Author-X-Name-First: Xiaowu
Author-X-Name-Last: Dai
Author-Name: Xiang Lyu
Author-X-Name-First: Xiang
Author-X-Name-Last: Lyu
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Title: Kernel Knockoffs Selection for Nonparametric Additive Models
Abstract:
Thanks to its fine balance between model flexibility and interpretability, the nonparametric additive model has been widely used, and variable selection for this type of model has been frequently studied. However, none of the existing solutions can control the false discovery rate (FDR) unless the sample size tends to infinity. The knockoff framework is a recent proposal that can address this issue, but few knockoff solutions are directly applicable to nonparametric models. In this article, we propose a novel kernel knockoffs selection procedure for the nonparametric additive model. We integrate three key components: the knockoffs, the subsampling for stability, and the random feature mapping for nonparametric function approximation. We show that the proposed method is guaranteed to control the FDR for any sample size, and achieves a power that approaches one as the sample size tends to infinity. We demonstrate the efficacy of our method through intensive simulations and comparisons with the alternative solutions. Our proposal thus, makes useful contributions to the methodology of nonparametric variable selection, FDR-based inference, as well as knockoffs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2158-2170
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2039671
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2039671
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2158-2170
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2223689_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Qingzhao Zhang
Author-X-Name-First: Qingzhao
Author-X-Name-Last: Zhang
Author-Name: Shuangge Ma
Author-X-Name-First: Shuangge
Author-X-Name-Last: Ma
Title: Comment on “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Chengguang Dai, Buyu Lin, Xin Xing, and Jun S. Liu
Journal: Journal of the American Statistical Association
Pages: 1566-1568
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2223689
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223689
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1566-1568
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2011298_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xiongtao Dai
Author-X-Name-First: Xiongtao
Author-X-Name-Last: Dai
Author-Name: Sara Lopez-Pintado
Author-X-Name-First: Sara
Author-X-Name-Last: Lopez-Pintado
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: Tukey’s Depth for Object Data
Abstract:
We develop a novel exploratory tool for non-Euclidean object data based on data depth, extending celebrated Tukey’s depth for Euclidean data. The proposed metric halfspace depth, applicable to data objects in a general metric space, assigns to data points depth values that characterize the centrality of these points with respect to the distribution and provides an interpretable center-outward ranking. Desirable theoretical properties that generalize standard depth properties postulated for Euclidean data are established for the metric halfspace depth. The depth median, defined as the deepest point, is shown to have high robustness as a location descriptor both in theory and in simulation. We propose an efficient algorithm to approximate the metric halfspace depth and illustrate its ability to adapt to the intrinsic data geometry. The metric halfspace depth was applied to an Alzheimer’s disease study, revealing group differences in the brain connectivity, modeled as covariance matrices, for subjects in different stages of dementia. Based on phylogenetic trees of seven pathogenic parasites, our proposed metric halfspace depth was also used to construct a meaningful consensus estimate of the evolutionary history and to identify potential outlier trees. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1760-1772
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2011298
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2011298
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1760-1772
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2005608_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yubai Yuan
Author-X-Name-First: Yubai
Author-X-Name-Last: Yuan
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: High-Order Joint Embedding for Multi-Level Link Prediction
Abstract:
Link prediction infers potential links from observed networks, and is one of the essential problems in network analyses. In contrast to traditional graph representation modeling which only predicts two-way pairwise relations, we propose a novel tensor-based joint network embedding approach on simultaneously encoding pairwise links and hyperlinks onto a latent space, which captures the dependency between pairwise and multi-way links in inferring potential unobserved hyperlinks. The major advantage of the proposed embedding procedure is that it incorporates both the pairwise relationships and subgroup-wise structure among nodes to capture richer network information. In addition, the proposed method introduces a hierarchical dependency among links to infer potential hyperlinks, and leads to better link prediction. In theory we establish the estimation consistency for the proposed embedding approach, and provide a faster convergence rate compared to link prediction using pairwise links or hyperlinks only. Numerical studies on both simulation settings and Facebook ego-networks indicate that the proposed method improves both hyperlink and pairwise link prediction accuracy compared to existing link prediction algorithms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1692-1706
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2005608
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2005608
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1692-1706
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2231056_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: The Editors
Title: Bootstrap Prediction Bands for Functional Time Series
Journal: Journal of the American Statistical Association
Pages: 2211-2211
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2231056
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231056
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2211-2211
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2019045_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yujia Deng
Author-X-Name-First: Yujia
Author-X-Name-Last: Deng
Author-Name: Yubai Yuan
Author-X-Name-First: Yubai
Author-X-Name-Last: Yuan
Author-Name: Haoda Fu
Author-X-Name-First: Haoda
Author-X-Name-Last: Fu
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Query-Augmented Active Metric Learning
Abstract:
In this article, we propose an active metric learning method for clustering with pairwise constraints. The proposed method actively queries the label of informative instance pairs, while estimating underlying metrics by incorporating unlabeled instance pairs, which leads to a more accurate and efficient clustering process. In particular, we augment the queried constraints by generating more pairwise labels to provide additional information in learning a metric to enhance clustering performance. Furthermore, we increase the robustness of metric learning by updating the learned metric sequentially and penalizing the irrelevant features adaptively. In addition, we propose a novel active query strategy that evaluates the information gain of instance pairs more accurately by incorporating the neighborhood structure, which improves clustering efficiency without extra labeling cost. In theory, we provide a tighter error bound of the proposed metric learning method using augmented queries compared with methods using existing constraints only. Furthermore, we also investigate the improvement using the active query strategy instead of random selection. Numerical studies on simulation settings and real datasets indicate that the proposed method is especially advantageous when the signal-to-noise ratio between significant features and irrelevant features is low.
Journal: Journal of the American Statistical Association
Pages: 1862-1875
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2019045
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2019045
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1862-1875
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2026778_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Lan Luo
Author-X-Name-First: Lan
Author-X-Name-Last: Luo
Author-Name: Ling Zhou
Author-X-Name-First: Ling
Author-X-Name-Last: Zhou
Author-Name: Peter X.-K. Song
Author-X-Name-First: Peter X.-K.
Author-X-Name-Last: Song
Title: Real-Time Regression Analysis of Streaming Clustered Data With Possible Abnormal Data Batches
Abstract:
This article develops an incremental learning algorithm based on quadratic inference function (QIF) to analyze streaming datasets with correlated outcomes such as longitudinal data and clustered data. We propose a renewable QIF (RenewQIF) method within a paradigm of renewable estimation and incremental inference, in which parameter estimates are recursively renewed with current data and summary statistics of historical data, but with no use of any historical subject-level raw data. We compare our renewable estimation method with both offline QIF and offline generalized estimating equations (GEE) approach that process the entire cumulative subject-level data all together, and show theoretically and numerically that our renewable procedure enjoys statistical and computational efficiency. We also propose an approach to diagnose the homogeneity assumption of regression coefficients via a sequential goodness-of-fit test as a screening procedure on occurrences of abnormal data batches. We implement the proposed methodology by expanding existing Spark’s Lambda architecture for the operation of statistical inference and data quality diagnosis. We illustrate the proposed methodology by extensive simulation studies and an analysis of streaming car crash datasets from the National Automotive Sampling System-Crashworthiness Data System (NASS CDS). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2029-2044
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2026778
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2026778
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2029-2044
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2020126_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Rui Miao
Author-X-Name-First: Rui
Author-X-Name-Last: Miao
Author-Name: Xiaoke Zhang
Author-X-Name-First: Xiaoke
Author-X-Name-Last: Zhang
Author-Name: Raymond K. W. Wong
Author-X-Name-First: Raymond K. W.
Author-X-Name-Last: Wong
Title: A Wavelet-Based Independence Test for Functional Data With an Application to MEG Functional Connectivity
Abstract:
Measuring and testing the dependency between multiple random functions is often an important task in functional data analysis. In the literature, a model-based method relies on a model which is subject to the risk of model misspecification, while a model-free method only provides a correlation measure which is inadequate to test independence. In this paper, we adopt the Hilbert–Schmidt Independence Criterion (HSIC) to measure the dependency between two random functions. We develop a two-step procedure by first pre-smoothing each function based on its discrete and noisy measurements and then applying the HSIC to recovered functions. To ensure the compatibility between the two steps such that the effect of the pre-smoothing error on the subsequent HSIC is asymptotically negligible when the data are densely measured, we propose a new wavelet thresholding method for pre-smoothing and to use Besov-norm-induced kernels for HSIC. We also provide the corresponding asymptotic analysis. The superior numerical performance of the proposed method over existing ones is demonstrated in a simulation study. Moreover, in a magnetoencephalography (MEG) data application, the functional connectivity patterns identified by the proposed method are more anatomically interpretable than those by existing methods.
Journal: Journal of the American Statistical Association
Pages: 1876-1889
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2020126
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2020126
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1876-1889
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2183127_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Haoran Xue
Author-X-Name-First: Haoran
Author-X-Name-Last: Xue
Author-Name: Xiaotong Shen
Author-X-Name-First: Xiaotong
Author-X-Name-Last: Shen
Author-Name: Wei Pan
Author-X-Name-First: Wei
Author-X-Name-Last: Pan
Title: Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data
Abstract:
Transcriptome-Wide Association Studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we’d like to identify causal genes for Low-Density Lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, for example, due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1525-1537
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2183127
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183127
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1525-1537
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2165930_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chenguang Dai
Author-X-Name-First: Chenguang
Author-X-Name-Last: Dai
Author-Name: Buyu Lin
Author-X-Name-First: Buyu
Author-X-Name-Last: Lin
Author-Name: Xin Xing
Author-X-Name-First: Xin
Author-X-Name-Last: Xing
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models
Abstract:
The Generalized Linear Model (GLM) has been widely used in practice to model counts or other types of non-Gaussian data. This article introduces a framework for feature selection in the GLM that can achieve robust False Discovery Rate (FDR) control. The main idea is to construct a mirror statistic based on data perturbation to measure the importance of each feature. FDR control is achieved by taking advantage of the mirror statistic’s property that its sampling distribution is (asymptotically) symmetric about zero for any null feature. In the moderate-dimensional setting, that is, p/n→κ∈(0,1), we construct the mirror statistic based on the maximum likelihood estimation. In the high-dimensional setting, that is, p≫n, we use the debiased Lasso to build the mirror statistic. The proposed methodology is scale-free as it only hinges on the symmetry of the mirror statistic, thus, can be more robust in finite-sample cases compared to existing methods. Both simulation results and a real data application show that the proposed methods are capable of controlling the FDR and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1551-1565
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2165930
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2165930
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1551-1565
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2003202_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhibo Cai
Author-X-Name-First: Zhibo
Author-X-Name-Last: Cai
Author-Name: Yingcun Xia
Author-X-Name-First: Yingcun
Author-X-Name-Last: Xia
Author-Name: Weiqiang Hang
Author-X-Name-First: Weiqiang
Author-X-Name-Last: Hang
Title: An Outer-Product-of-Gradient Approach to Dimension Reduction and its Application to Classification in High Dimensional Space
Abstract:
Sufficient dimension reduction (SDR) has progressed steadily. However, its ability to improve general function estimation or classification has not been well received, especially for high-dimensional data. In this article, we first devise a local linear smoother for high dimensional nonparametric regression and then utilise it in the outer-product-of-gradient (OPG) approach of SDR. We call the method high-dimensional OPG (HOPG). To apply SDR to classification in high-dimensional data, we propose an ensemble classifier by aggregating results of classifiers that are built on subspaces reduced by the random projection and HOPG consecutively from the data. Asymptotic results for both HOPG and the classifier are established. Superior performance over the existing methods is demonstrated in simulations and real data analyses. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1671-1681
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2003202
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2003202
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1671-1681
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2005609_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yichen Zhu
Author-X-Name-First: Yichen
Author-X-Name-Last: Zhu
Author-Name: Cheng Li
Author-X-Name-First: Cheng
Author-X-Name-Last: Li
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Classification Trees for Imbalanced Data: Surface-to-Volume Regularization
Abstract:
Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1707-1717
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2005609
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2005609
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1707-1717
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2021919_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Anna E. Dudek
Author-X-Name-First: Anna E.
Author-X-Name-Last: Dudek
Author-Name: Łukasz Lenart
Author-X-Name-First: Łukasz
Author-X-Name-Last: Lenart
Title: Spectral Density Estimation for Nonstationary Data With Nonzero Mean Function
Abstract:
We introduce a new approach for nonparametric spectral density estimation based on the subsampling technique, which we apply to the important class of nonstationary time series. These are almost periodically correlated sequences. In contrary to existing methods, our technique does not require demeaning of the data. On the simulated data examples, we compare our estimator of spectral density function with the classical one. Additionally, we propose a modified estimator, which allows to reduce the leakage effect. Moreover, in the supplementary materials, we provide a simulation study and two real data economic applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1900-1910
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2021919
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2021919
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1900-1910
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2223578_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yin Xia
Author-X-Name-First: Yin
Author-X-Name-Last: Xia
Author-Name: T. Tony Cai
Author-X-Name-First: T. Tony
Author-X-Name-Last: Cai
Title: Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Dai, Lin, Xing, and Liu
Journal: Journal of the American Statistical Association
Pages: 1569-1572
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2223578
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223578
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1569-1572
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2011735_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xu Guo
Author-X-Name-First: Xu
Author-X-Name-Last: Guo
Author-Name: Haojie Ren
Author-X-Name-First: Haojie
Author-X-Name-Last: Ren
Author-Name: Changliang Zou
Author-X-Name-First: Changliang
Author-X-Name-Last: Zou
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Threshold Selection in Feature Screening for Error Rate Control
Abstract:
Hard thresholding rule is commonly adopted in feature screening procedures to screen out unimportant predictors for ultrahigh-dimensional data. However, different thresholds are required to adapt to different contexts of screening problems and an appropriate thresholding magnitude usually varies from the model and error distribution. With an ad-hoc choice, it is unclear whether all of the important predictors are selected or not, and it is very likely that the procedures would include many unimportant features. We introduce a data-adaptive threshold selection procedure with error rate control, which is applicable to most kinds of popular screening methods. The key idea is to apply the sample-splitting strategy to construct a series of statistics with marginal symmetry property and then to utilize the symmetry for obtaining an approximation to the number of false discoveries. We show that the proposed method is able to asymptotically control the false discovery rate and per family error rate under certain conditions and still retains all of the important predictors. Three important examples are presented to illustrate the merits of the new proposed procedures. Numerical experiments indicate that the proposed methodology works well for many existing screening methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1773-1785
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2011735
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2011735
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1773-1785
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2016424_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xingyu Zhou
Author-X-Name-First: Xingyu
Author-X-Name-Last: Zhou
Author-Name: Yuling Jiao
Author-X-Name-First: Yuling
Author-X-Name-Last: Jiao
Author-Name: Jin Liu
Author-X-Name-First: Jin
Author-X-Name-Last: Liu
Author-Name: Jian Huang
Author-X-Name-First: Jian
Author-X-Name-Last: Huang
Title: A Deep Generative Approach to Conditional Sampling
Abstract:
We propose a deep generative approach to sampling from a conditional distribution based on a unified formulation of conditional distribution and generalized nonparametric regression function using the noise-outsourcing lemma. The proposed approach aims at learning a conditional generator, so that a random sample from the target conditional distribution can be obtained by transforming a sample drawn from a reference distribution. The conditional generator is estimated nonparametrically with neural networks by matching appropriate joint distributions using the Kullback-Liebler divergence. An appealing aspect of our method is that it allows either of or both the predictor and the response to be high-dimensional and can handle both continuous and discrete type predictors and responses. We show that the proposed method is consistent in the sense that the conditional generator converges in distribution to the underlying conditional distribution under mild conditions. Our numerical experiments with simulated and benchmark image data validate the proposed method and demonstrate that it outperforms several existing conditional density estimation methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1837-1848
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2016424
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016424
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1837-1848
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2004896_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Michael Law
Author-X-Name-First: Michael
Author-X-Name-Last: Law
Author-Name: Ya’acov Ritov
Author-X-Name-First: Ya’acov
Author-X-Name-Last: Ritov
Title: Inference and Estimation for Random Effects in High-Dimensional Linear Mixed Models
Abstract:
We consider three problems in high-dimensional linear mixed models. Without any assumptions on the design for the fixed effects, we construct asymptotic statistics for testing whether a collection of random effects is zero, derive an asymptotic confidence interval for a single random effect at the parametric rate n
, and propose an empirical Bayes estimator for a part of the mean vector in ANOVA type models that performs asymptotically as well as the oracle Bayes estimator. We support our theoretical results with numerical simulations and provide comparisons with oracle estimators. The procedures developed are applied to the Trends in International Mathematics and Sciences Study (TIMSS) data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1682-1691
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2004896
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2004896
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1682-1691
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2044334_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Author-Name: Kai Xu
Author-X-Name-First: Kai
Author-X-Name-Last: Xu
Author-Name: Yeqing Zhou
Author-X-Name-First: Yeqing
Author-X-Name-Last: Zhou
Author-Name: Liping Zhu
Author-X-Name-First: Liping
Author-X-Name-Last: Zhu
Title: Testing the Effects of High-Dimensional Covariates via Aggregating Cumulative Covariances
Abstract:
In this article, we test for the effects of high-dimensional covariates on the response. In many applications, different components of covariates usually exhibit various levels of variation, which is ubiquitous in high-dimensional data. To simultaneously accommodate such heteroscedasticity and high dimensionality, we propose a novel test based on an aggregation of the marginal cumulative covariances, requiring no prior information on the specific form of regression models. Our proposed test statistic is scale-invariance, tuning-free and convenient to implement. The asymptotic normality of the proposed statistic is established under the null hypothesis. We further study the asymptotic relative efficiency of our proposed test with respect to the state-of-art universal tests in two different settings: one is designed for high-dimensional linear model and the other is introduced in a completely model-free setting. A remarkable finding reveals that, thanks to the scale-invariance property, even under the high-dimensional linear models, our proposed test is asymptotically much more powerful than existing competitors for the covariates with heterogeneous variances while maintaining high efficiency for the homoscedastic ones. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2184-2194
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2044334
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044334
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2184-2194
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2195546_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yan Liu
Author-X-Name-First: Yan
Author-X-Name-Last: Liu
Author-Name: Dewei Wang
Author-X-Name-First: Dewei
Author-X-Name-Last: Wang
Author-Name: Li Li
Author-X-Name-First: Li
Author-X-Name-Last: Li
Author-Name: Dingsheng Li
Author-X-Name-First: Dingsheng
Author-X-Name-Last: Li
Title: Assessing Disparities in Americans’ Exposure to PCBs and PBDEs based on NHANES Pooled Biomonitoring Data
Abstract:
The National Health and Nutrition Examination Survey (NHANES) has been continuously biomonitoring Americans’ exposure to two families of harmful environmental chemicals: polychlorinated biphenyls (PCBs) and polybrominated diphenyl ethers (PBDEs). However, biomonitoring these chemicals is expensive. To save cost, in 2005, NHANES resorted to pooled biomonitoring; that is, amalgamating individual specimens to form a pool and measuring chemical levels from pools. Despite being publicly available, these pooled data gain limited applications in health studies. Among the few studies using these data, racial/age disparities were detected, but there is no control for confounding effects. These disadvantages are due to the complexity of pooled measurements and a dearth of statistical tools. Herein, we developed a regression-based method to unzip pooled measurements, which facilitated a comprehensive assessment of disparities in exposure to these chemicals. We found increasing dependence of PCBs on age and income, whereas PBDEs were the highest among adolescents and seniors and were elevated among the low-income population. In addition, Hispanics had the lowest PCBs and PBDEs among all demographic groups after controlling for potential confounders. These findings can guide the development of population-specific interventions to promote environmental justice. Moreover, both chemical levels declined throughout the period, indicating the effectiveness of existing regulatory policies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1538-1550
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2195546
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2195546
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1538-1550
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2003200_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Brian D. Williamson
Author-X-Name-First: Brian D.
Author-X-Name-Last: Williamson
Author-Name: Peter B. Gilbert
Author-X-Name-First: Peter B.
Author-X-Name-Last: Gilbert
Author-Name: Noah R. Simon
Author-X-Name-First: Noah R.
Author-X-Name-Last: Simon
Author-Name: Marco Carone
Author-X-Name-First: Marco
Author-X-Name-Last: Carone
Title: A General Framework for Inference on Algorithm-Agnostic Variable Importance
Abstract:
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response—in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1645-1658
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2003200
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2003200
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1645-1658
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2020658_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Amichai Painsky
Author-X-Name-First: Amichai
Author-X-Name-Last: Painsky
Title: Generalized Good-Turing Improves Missing Mass Estimation
Abstract:
Consider a finite sample from an unknown distribution over a countable alphabet. The missing mass refers to the probability of symbols that do not appear in the sample. Estimating the missing mass is a basic problem in statistics and related fields, which dates back to the early work of Laplace, and the more recent seminal contribution of Good and Turing. In this article, we introduce a generalized Good-Turing (GT) framework for missing mass estimation. We derive an upper-bound for the risk (in terms of mean squared error) and minimize it over the parameters of our framework. Our analysis distinguishes between two setups, depending on the (unknown) alphabet size. When the alphabet size is bounded from above, our risk-bound demonstrates a significant improvement compared to currently known results (which are typically oblivious to the alphabet size). Based on this bound, we introduce a numerically obtained estimator that improves upon GT. When the alphabet size holds no restrictions, we apply our suggested risk-bound and introduce a closed-form estimator that again improves GT performance guarantees. Our suggested framework is easy to apply and does not require additional modeling assumptions. This makes it a favorable choice for practical applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1890-1899
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2020658
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2020658
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1890-1899
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2023551_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Wang Miao
Author-X-Name-First: Wang
Author-X-Name-Last: Miao
Author-Name: Wenjie Hu
Author-X-Name-First: Wenjie
Author-X-Name-Last: Hu
Author-Name: Elizabeth L. Ogburn
Author-X-Name-First: Elizabeth L.
Author-X-Name-Last: Ogburn
Author-Name: Xiao-Hua Zhou
Author-X-Name-First: Xiao-Hua
Author-X-Name-Last: Zhou
Title: Identifying Effects of Multiple Treatments in the Presence of Unmeasured Confounding
Abstract:
Identification of treatment effects in the presence of unmeasured confounding is a persistent problem in the social, biological, and medical sciences. The problem of unmeasured confounding in settings with multiple treatments is most common in statistical genetics and bioinformatics settings, where researchers have developed many successful statistical strategies without engaging deeply with the causal aspects of the problem. Recently there have been a number of attempts to bridge the gap between these statistical approaches and causal inference, but these attempts have either been shown to be flawed or have relied on fully parametric assumptions. In this article, we propose two strategies for identifying and estimating causal effects of multiple treatments in the presence of unmeasured confounding. The auxiliary variables approach leverages variables that are not causally associated with the outcome; in the case of a univariate confounder, our method only requires one auxiliary variable, unlike existing instrumental variable methods that would require as many instruments as there are treatments. An alternative null treatments approach relies on the assumption that at least half of the confounded treatments have no causal effect on the outcome, but does not require a priori knowledge of which treatments are null. Our identification strategies do not impose parametric assumptions on the outcome model and do not rest on estimation of the confounder. This article extends and generalizes existing work on unmeasured confounding with a single treatment and models commonly used in bioinformatics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1953-1967
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2023551
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2023551
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1953-1967
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2018329_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ruijia Wu
Author-X-Name-First: Ruijia
Author-X-Name-Last: Wu
Author-Name: Linjun Zhang
Author-X-Name-First: Linjun
Author-X-Name-Last: Zhang
Author-Name: T. Tony Cai
Author-X-Name-First: T.
Author-X-Name-Last: Tony Cai
Title: Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference
Abstract:
Sparse topic modeling under the probabilistic latent semantic indexing (pLSI) model is studied. Novel and computationally fast algorithms for estimation and inference of both the word-topic matrix and the topic-document matrix are proposed and their theoretical properties are investigated. Both minimax upper and lower bounds are established and the results show that the proposed algorithms are rate-optimal, up to a logarithmic factor. Moreover, a refitting algorithm is proposed to establish asymptotic normality and construct valid confidence intervals for the individual entries of the word-topic and topic-document matrices. Simulation studies are carried out to investigate the numerical performance of the proposed algorithms. The results show that the proposed algorithms perform well numerically and are more accurate in a range of simulation settings comparing to the existing literature. In addition, the methods are illustrated through an analysis of the COVID-19 Open Research Dataset (CORD-19).
Journal: Journal of the American Statistical Association
Pages: 1849-1861
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2018329
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2018329
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1849-1861
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2006667_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Kuang-Yao Lee
Author-X-Name-First: Kuang-Yao
Author-X-Name-Last: Lee
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: Nonparametric Functional Graphical Modeling Through Functional Additive Regression Operator
Abstract:
In this article, we develop a nonparametric graphical model for multivariate random functions. Most existing graphical models are restricted by the assumptions of multivariate Gaussian or copula Gaussian distributions, which also imply linear relations among the random variables or functions on different nodes. We relax those assumptions by building our graphical model based on a new statistical object—the functional additive regression operator. By carrying out regression and neighborhood selection at the operator level, our method can capture nonlinear relations without requiring any distributional assumptions. Moreover, the method is built up using only one-dimensional kernel, thus, avoids the curse of dimensionality from which a fully nonparametric approach often suffers, and enables us to work with large-scale networks. We derive error bounds for the estimated regression operator and establish graph estimation consistency, while allowing the number of functions to diverge at the exponential rate of the sample size. We demonstrate the efficacy of our method by both simulations and analysis of an electroencephalography dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1718-1732
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2006667
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2006667
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1718-1732
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2231063_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Michael Law
Author-X-Name-First: Michael
Author-X-Name-Last: Law
Author-Name: Peter Bühlmann
Author-X-Name-First: Peter
Author-X-Name-Last: Bühlmann
Title: Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models”
Journal: Journal of the American Statistical Association
Pages: 1578-1583
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2231063
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231063
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1578-1583
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2029456_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ganggang Xu
Author-X-Name-First: Ganggang
Author-X-Name-Last: Xu
Author-Name: Chen Liang
Author-X-Name-First: Chen
Author-X-Name-Last: Liang
Author-Name: Rasmus Waagepetersen
Author-X-Name-First: Rasmus
Author-X-Name-Last: Waagepetersen
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Semiparametric Goodness-of-Fit Test for Clustered Point Processes with a Shape-Constrained Pair Correlation Function
Abstract:
Specification of a parametric model for the intensity function is a fundamental task in statistics for spatial point processes. It is, therefore, crucial to be able to assess the appropriateness of a suggested model for a given point pattern dataset. For this purpose, we develop a new class of semiparametric goodness-of-fit tests for the specified parametric first-order intensity, without assuming a full data generating mechanism that is needed for the existing popular Monte Carlo tests. The proposed tests crucially rely on accurate nonparametric estimation of the second-order properties of a point process. To address this we propose a new nonparametric pair correlation function (PCF) estimator for clustered spatial point processes under some mild shape constraints, which is shown to achieve uniform consistency. The proposed test statistics are computationally efficient owing to closed-form asymptotic distributions and achieve the nominal size even for testing composite hypotheses. In practice, the proposed estimation and testing procedures provide effective tools to improve parametric intensity function modeling, which is demonstrated through extensive simulation studies as well as a real data analysis of street crime activity in Washington DC. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2072-2087
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2029456
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2029456
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2072-2087
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2035736_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jiashun Jin
Author-X-Name-First: Jiashun
Author-X-Name-Last: Jin
Author-Name: Zheng Tracy Ke
Author-X-Name-First: Zheng Tracy
Author-X-Name-Last: Ke
Author-Name: Shengming Luo
Author-X-Name-First: Shengming
Author-X-Name-Last: Luo
Author-Name: Minzhe Wang
Author-X-Name-First: Minzhe
Author-X-Name-Last: Wang
Title: Optimal Estimation of the Number of Network Communities
Abstract:
In network analysis, how to estimate the number of communities K is a fundamental problem. We consider a broad setting where we allow severe degree heterogeneity and a wide range of sparsity levels, and propose Stepwise Goodness of Fit (StGoF) as a new approach. This is a stepwise algorithm, where for m=1,2,…
, we alternately use a community detection step and a goodness of fit (GoF) step. We adapt SCORE Jin for community detection, and propose a new GoF metric. We show that at step m, the GoF metric diverges to ∞
in probability for all m < K and converges to N(0, 1) if m = K. This gives rise to a consistent estimate for K. Also, we discover the right way to define the signal-to-noise ratio (SNR) for our problem and show that consistent estimates for K do not exist if SNR→0
, and StGoF is uniformly consistent for K if SNR→∞
. Therefore, StGoF achieves the optimal phase transition.Similar stepwise methods are known to face analytical challenges. We overcome the challenges by using a different stepwise scheme in StGoF and by deriving sharp results that are not available before. The key to our analysis is to show that SCORE has the Nonsplitting Property (NSP). Primarily due to a nontractable rotation of eigenvectors dictated by the Davis–Kahan sin (θ)
theorem, the NSP is nontrivial to prove and requires new techniques we develop. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2101-2116
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2035736
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2035736
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2101-2116
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2224412_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Sai Li
Author-X-Name-First: Sai
Author-X-Name-Last: Li
Author-Name: Yisha Yao
Author-X-Name-First: Yisha
Author-X-Name-Last: Yao
Author-Name: Cun-Hui Zhang
Author-X-Name-First: Cun-Hui
Author-X-Name-Last: Zhang
Title: Comments on “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models”
Journal: Journal of the American Statistical Association
Pages: 1586-1589
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2224412
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2224412
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1586-1589
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2024437_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Lexin Li
Author-X-Name-First: Lexin
Author-X-Name-Last: Li
Author-Name: Jing Zeng
Author-X-Name-First: Jing
Author-X-Name-Last: Zeng
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Title: Generalized Liquid Association Analysis for Multimodal Data Integration
Abstract:
Multimodal data are now prevailing in scientific research. One of the central questions in multimodal integrative analysis is to understand how two data modalities associate and interact with each other given another modality or demographic variables. The problem can be formulated as studying the associations among three sets of random variables, a question that has received relatively less attention in the literature. In this article, we propose a novel generalized liquid association analysis method, which offers a new and unique angle to this important class of problems of studying three-way associations. We extend the notion of liquid association from the univariate setting to the sparse, multivariate, and high-dimensional setting. We establish a population dimension reduction model, transform the problem to sparse Tucker decomposition of a three-way tensor, and develop a higher-order orthogonal iteration algorithm for parameter estimation. We derive the nonasymptotic error bound and asymptotic consistency of the proposed estimator, while allowing the variable dimensions to be larger than and diverge with the sample size. We demonstrate the efficacy of the method through both simulations and a multimodal neuroimaging application for Alzheimer’s disease research. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1984-1996
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2024437
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024437
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1984-1996
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2183129_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: David Rios Insua
Author-X-Name-First: David
Author-X-Name-Last: Rios Insua
Author-Name: Roi Naveiro
Author-X-Name-First: Roi
Author-X-Name-Last: Naveiro
Author-Name: Víctor Gallego
Author-X-Name-First: Víctor
Author-X-Name-Last: Gallego
Author-Name: Jason Poulos
Author-X-Name-First: Jason
Author-X-Name-Last: Poulos
Title: Adversarial Machine Learning: Bayesian Perspectives
Abstract:
Adversarial Machine Learning (AML) is emerging as a major field aimed at protecting Machine Learning (ML) systems against security threats: in certain scenarios there may be adversaries that actively manipulate input data to fool learning systems. This creates a new class of security vulnerabilities that ML systems may face, and a new desirable property called adversarial robustness essential to trust operations based on ML outputs. Most work in AML is built upon a game-theoretic modeling of the conflict between a learning system and an adversary, ready to manipulate input data. This assumes that each agent knows their opponent’s interests and uncertainty judgments, facilitating inferences based on Nash equilibria. However, such common knowledge assumption is not realistic in the security scenarios typical of AML. After reviewing such game-theoretic approaches, we discuss the benefits that Bayesian perspectives provide when defending ML-based systems. We demonstrate how the Bayesian approach allows us to explicitly model our uncertainty about the opponent’s beliefs and interests, relaxing unrealistic assumptions, and providing more robust inferences. We illustrate this approach in supervised learning settings, and identify relevant future research problems. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2195-2206
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2183129
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2183129
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2195-2206
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2002157_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yaoming Zhen
Author-X-Name-First: Yaoming
Author-X-Name-Last: Zhen
Author-Name: Junhui Wang
Author-X-Name-First: Junhui
Author-X-Name-Last: Wang
Title: Community Detection in General Hypergraph Via Graph Embedding
Abstract:
Conventional network data have largely focused on pairwise interactions between two entities, yet multi-way interactions among multiple entities have been frequently observed in real-life hypergraph networks. In this article, we propose a novel method for detecting community structure in general hypergraph networks, uniform or non-uniform. The proposed method introduces a null vertex to augment a nonuniform hypergraph into a uniform multi-hypergraph, and then embeds the multi-hypergraph in a low-dimensional vector space such that vertices within the same community are close to each other. The resultant optimization task can be efficiently tackled by an alternative updating scheme. The asymptotic consistencies of the proposed method are established in terms of both community detection and hypergraph estimation, which are also supported by numerical experiments on some synthetic and real-life hypergraph networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1620-1629
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2002157
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002157
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1620-1629
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2008402_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ethan X. Fang
Author-X-Name-First: Ethan X.
Author-X-Name-Last: Fang
Author-Name: Zhaoran Wang
Author-X-Name-First: Zhaoran
Author-X-Name-Last: Wang
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Title: Fairness-Oriented Learning for Optimal Individualized Treatment Rules
Abstract:
There has recently been a surge on the methodological development for optimal individualized treatment rule (ITR) estimation. The standard methods in the literature are designed to maximize the potential average performance (assuming larger outcomes are desirable). A notable drawback of the standard approach, due to heterogeneity in treatment response, is that the estimated optimal ITR may be suboptimal or even detrimental to certain disadvantaged subpopulations. Motivated by the importance of incorporating an appropriate fairness constraint in optimal decision making (e.g., assign treatment with protection to those with shorter survival time, or assign a job training program with protection to those with lower wages), we propose a new framework that aims to estimate an optimal ITR to maximize the average value with the guarantee that its tail performance exceeds a prespecified threshold. The optimal fairness-oriented ITR corresponds to a solution of a nonconvex optimization problem. To handle the computational challenge, we develop a new efficient first-order algorithm. We establish theoretical guarantees for the proposed estimator. Furthermore, we extend the proposed method to dynamic optimal ITRs. The advantages of the proposed approach over existing methods are demonstrated via extensive numerical studies and real data analysis.
Journal: Journal of the American Statistical Association
Pages: 1733-1746
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2008402
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2008402
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1733-1746
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2231224_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Martin Holub
Author-X-Name-First: Martin
Author-X-Name-Last: Holub
Author-Name: Patrícia Martinková
Author-X-Name-First: Patrícia
Author-X-Name-Last: Martinková
Title: Supervised Machine Learning for Text Analysis in R
Journal: Journal of the American Statistical Association
Pages: 2207-2209
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2231224
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231224
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2207-2209
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2016423_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Shunan Yao
Author-X-Name-First: Shunan
Author-X-Name-Last: Yao
Author-Name: Bradley Rava
Author-X-Name-First: Bradley
Author-X-Name-Last: Rava
Author-Name: Xin Tong
Author-X-Name-First: Xin
Author-X-Name-Last: Tong
Author-Name: Gareth James
Author-X-Name-First: Gareth
Author-X-Name-Last: James
Title: Asymmetric Error Control Under Imperfect Supervision: A Label-Noise-Adjusted Neyman–Pearson Umbrella Algorithm
Abstract:
Label noise in data has long been an important problem in supervised learning applications as it affects the effectiveness of many widely used classification methods. Recently, important real-world applications, such as medical diagnosis and cybersecurity, have generated renewed interest in the Neyman–Pearson (NP) classification paradigm, which constrains the more severe type of error (e.g., the Type I error) under a preferred level while minimizing the other (e.g., the Type II error). However, there has been little research on the NP paradigm under label noise. It is somewhat surprising that even when common NP classifiers ignore the label noise in the training stage, they are still able to control the Type I error with high probability. However, the price they pay is excessive conservativeness of the Type I error and hence a significant drop in power (i.e., 1 - Type II error). Assuming that domain experts provide lower bounds on the corruption severity, we propose the first theory-backed algorithm that adapts most state-of-the-art classification methods to the training label noise under the NP paradigm. The resulting classifiers not only control the Type I error with high probability under the desired level but also improve power.
Journal: Journal of the American Statistical Association
Pages: 1824-1836
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2016423
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016423
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1824-1836
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2034632_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Yi Li
Author-X-Name-First: Yi
Author-X-Name-Last: Li
Title: High-Dimensional Gaussian Graphical Regression Models with Covariates
Abstract:
Though Gaussian graphical models have been widely used in many scientific fields, relatively limited progress has been made to link graph structures to external covariates. We propose a Gaussian graphical regression model, which regresses both the mean and the precision matrix of a Gaussian graphical model on covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can determine how genetic variants and clinical conditions modulate the subject-level network structures, and recover both the population-level and subject-level gene networks. Our framework encourages sparsity of covariate effects on both the mean and the precision matrix. In particular for the precision matrix, we stipulate simultaneous sparsity, that is, group sparsity and element-wise sparsity, on effective covariates and their effects on network edges, respectively. We establish variable selection consistency first under the case with known mean parameters and then a more challenging case with unknown means depending on external covariates, and establish in both cases the l2
convergence rates and the selection consistency of the estimated precision parameters. The utility and efficacy of our proposed method is demonstrated through simulation studies and an application to a co-expression QTL study with brain cancer patients. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2088-2100
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2034632
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2034632
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2088-2100
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2016422_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Janice L. Scealy
Author-X-Name-First: Janice L.
Author-X-Name-Last: Scealy
Author-Name: Andrew T. A. Wood
Author-X-Name-First: Andrew T. A.
Author-X-Name-Last: Wood
Title: Score Matching for Compositional Distributions
Abstract:
Compositional data are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. With real data, it is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. Major limitations of currently available models for compositional data include one or more of the following: insufficient flexibility in terms of distributional shape; difficulty in accommodating zeros in the data in estimation; and lack of computational viability in moderate to high dimensions. In this article, we propose a new model, the polynomially tilted pairwise interaction (PPI) model, for analysing compositional data. Maximum likelihood estimation is difficult for the PPI model. Instead, we propose novel score matching estimators, which entails extending the score matching approach to Riemannian manifolds with boundary. These new estimators are available in closed form and simulation studies show that they perform well in practice. As our main application, we analyse real microbiome count data with fixed totals using a multinomial latent variable model with a PPI model for the latent variable distribution. We prove that, under certain conditions, the new score matching estimators are consistent for the parameters in the new multinomial latent variable model.
Journal: Journal of the American Statistical Association
Pages: 1811-1823
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2016422
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2016422
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1811-1823
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2002156_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Decai Liang
Author-X-Name-First: Decai
Author-X-Name-Last: Liang
Author-Name: Hui Huang
Author-X-Name-First: Hui
Author-X-Name-Last: Huang
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Author-Name: Fang Yao
Author-X-Name-First: Fang
Author-X-Name-Last: Yao
Title: Test of Weak Separability for Spatially Stationary Functional Field
Abstract:
For spatially dependent functional data, a generalized Karhunen-Loève expansion is commonly used to decompose data into an additive form of temporal components and spatially correlated coefficients. This structure provides a convenient model to investigate the space-time interactions, but may not hold for complex spatio-temporal processes. In this work, we introduce the concept of weak separability, and propose a formal test to examine its validity for non-replicated spatially stationary functional field. The asymptotic distribution of the test statistic that adapts to potentially diverging ranks is derived by constructing lag covariance estimation, which is easy to compute for practical implementation. We demonstrate the efficacy of the proposed test via simulations and illustrate its usefulness in two real examples: China PM 2.5 data and Harvard Forest data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1606-1619
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2002156
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002156
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1606-1619
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2038180_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Zhengling Qi
Author-X-Name-First: Zhengling
Author-X-Name-Last: Qi
Author-Name: Jong-Shi Pang
Author-X-Name-First: Jong-Shi
Author-X-Name-Last: Pang
Author-Name: Yufeng Liu
Author-X-Name-First: Yufeng
Author-X-Name-Last: Liu
Title: On Robustness of Individualized Decision Rules
Abstract:
With the emergence of precision medicine, estimating optimal individualized decision rules (IDRs) has attracted tremendous attention in many scientific areas. Most existing literature has focused on finding optimal IDRs that can maximize the expected outcome for each individual. Motivated by complex individualized decision making procedures and the popular conditional value at risk (CVaR) measure, we propose a new robust criterion to estimate optimal IDRs in order to control the average lower tail of the individuals’ outcomes. In addition to improving the individualized expected outcome, our proposed criterion takes risks into consideration, and thus the resulting IDRs can prevent adverse events. The optimal IDR under our criterion can be interpreted as the decision rule that maximizes the “worst-case” scenario of the individualized outcome when the underlying distribution is perturbed within a constrained set. An efficient non-convex optimization algorithm is proposed with convergence guarantees. We investigate theoretical properties for our estimated optimal IDRs under the proposed criterion such as consistency and finite sample error bounds. Simulation studies and a real data application are used to further demonstrate the robust performance of our methods. Several extensions of the proposed method are also discussed. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2143-2157
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2038180
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2038180
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2143-2157
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2037431_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yinpu Li
Author-X-Name-First: Yinpu
Author-X-Name-Last: Li
Author-Name: Antonio R. Linero
Author-X-Name-First: Antonio R.
Author-X-Name-Last: Linero
Author-Name: Jared Murray
Author-X-Name-First: Jared
Author-X-Name-Last: Murray
Title: Adaptive Conditional Distribution Estimation with Bayesian Decision Tree Ensembles
Abstract:
We present a Bayesian nonparametric model for conditional distribution estimation using Bayesian additive regression trees (BART). The generative model we use is based on rejection sampling from a base model. Like other BART models, our model is flexible, has a default prior specification, and is computationally convenient. To address the distinguished role of the response in our BART model, we introduce an approach to targeted smoothing of BART models which is of independent interest. We study the proposed model theoretically and provide sufficient conditions for the posterior distribution to concentrate at close to the minimax optimal rate adaptively over smoothness classes in the high-dimensional regime in which many predictors are irrelevant. To fit our model, we propose a data augmentation algorithm which allows for existing BART samplers to be extended with minimal effort. We illustrate the performance of our methodology on simulated data and use it to study the relationship between education and body mass index using data from the medical expenditure panel survey (MEPS). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2129-2142
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2037431
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2037431
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2129-2142
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2037430_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Federico Castelletti
Author-X-Name-First: Federico
Author-X-Name-Last: Castelletti
Author-Name: Stefano Peluso
Author-X-Name-First: Stefano
Author-X-Name-Last: Peluso
Title: Network Structure Learning Under Uncertain Interventions
Abstract:
Gaussian Directed Acyclic Graphs (DAGs) represent a powerful tool for learning the network of dependencies among variables, a task which is of primary interest in many fields and specifically in biology. Different DAGs may encode equivalent conditional independence structures, implying limited ability, with observational data, to identify causal relations. In many contexts however, measurements are collected under heterogeneous settings where variables are subject to exogenous interventions. Interventional data can improve the structure learning process whenever the targets of an intervention are known. However, these are often uncertain or completely unknown, as in the context of drug target discovery. We propose a Bayesian method for learning dependence structures and intervention targets from data subject to interventions on unknown variables of the system. Selected features of our approach include a DAG-Wishart prior on the DAG parameters, and the use of variable selection priors to express uncertainty on the targets. We provide theoretical results on the correct asymptotic identification of intervention targets and derive sufficient conditions for Bayes factor and posterior ratio consistency of the graph structure. Our method is applied in simulations and real-data world settings, to analyze perturbed protein data and assess antiepileptic drug therapies. Details of the MCMC algorithm and proofs of propositions are provided in the supplementary materials, together with more extensive results on simulations and applied studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2117-2128
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2037430
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2037430
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2117-2128
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2023552_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Tianhao Wang
Author-X-Name-First: Tianhao
Author-X-Name-Last: Wang
Author-Name: Sarah J. Ratcliffe
Author-X-Name-First: Sarah J.
Author-X-Name-Last: Ratcliffe
Author-Name: Wensheng Guo
Author-X-Name-First: Wensheng
Author-X-Name-Last: Guo
Title: Time-to-Event Analysis with Unknown Time Origins via Longitudinal Biomarker Registration
Abstract:
In observational studies, the time origin of interest for time-to-event analysis is often unknown, such as the time of disease onset. Existing approaches to estimating the time origins are commonly built on extrapolating a parametric longitudinal model, which rely on rigid assumptions that can lead to biased inferences. In this paper, we introduce a flexible semiparametric curve registration model. It assumes the longitudinal trajectories follow a flexible common shape function with person-specific disease progression pattern characterized by a random curve registration function, which is further used to model the unknown time origin as a random start time. This random time is used as a link to jointly model the longitudinal and survival data where the unknown time origins are integrated out in the joint likelihood function, which facilitates unbiased and consistent estimation. Since the disease progression pattern naturally predicts time-to-event, we further propose a new functional survival model using the registration function as a predictor of the time-to-event. The asymptotic consistency and semiparametric efficiency of the proposed models are proved. Simulation studies and two real data applications demonstrate the effectiveness of this new approach. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1968-1983
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2023552
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2023552
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1968-1983
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2002158_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Ying Yang Fang Yao
Author-X-Name-First: Ying Yang
Author-X-Name-Last: Fang Yao
Title: Online Estimation for Functional Data
Abstract:
Functional data analysis has attracted considerable interest and is facing new challenges, one of which is the increasingly available data in a streaming manner. In this article we develop an online nonparametric method to dynamically update the estimates of mean and covariance functions for functional data. The kernel-type estimates can be decomposed into two sufficient statistics depending on the data-driven bandwidths. We propose to approximate the future optimal bandwidths by a sequence of dynamically changing candidates and combine the corresponding statistics across blocks to form the updated estimation. The proposed online method is easy to compute based on the stored sufficient statistics and the current data block. We derive the asymptotic normality and, more importantly, the relative efficiency lower bounds of the online estimates of mean and covariance functions. This provides insight into the relationship between estimation accuracy and computational cost driven by the length of candidate bandwidth sequence. Simulations and real data examples are provided to support such findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1630-1644
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2002158
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2002158
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1630-1644
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2027775_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Pulong Ma
Author-X-Name-First: Pulong
Author-X-Name-Last: Ma
Author-Name: Anindya Bhadra
Author-X-Name-First: Anindya
Author-X-Name-Last: Bhadra
Title: Beyond Matérn: On A Class of Interpretable Confluent Hypergeometric Covariance Functions
Abstract:
The Matérn covariance function is a popular choice for prediction in spatial statistics and uncertainty quantification literature. A key benefit of the Matérn class is that it is possible to get precise control over the degree of mean-square differentiability of the random process. However, the Matérn class possesses exponentially decaying tails, and thus, may not be suitable for modeling polynomially decaying dependence. This problem can be remedied using polynomial covariances; however, one loses control over the degree of mean-square differentiability of corresponding processes, in that random processes with existing polynomial covariances are either infinitely mean-square differentiable or nowhere mean-square differentiable at all. We construct a new family of covariance functions called the Confluent Hypergeometric (CH) class using a scale mixture representation of the Matérn class where one obtains the benefits of both Matérn and polynomial covariances. The resultant covariance contains two parameters: one controls the degree of mean-square differentiability near the origin and the other controls the tail heaviness, independently of each other. Using a spectral representation, we derive theoretical properties of this new covariance including equivalent measures and asymptotic behavior of the maximum likelihood estimators under infill asymptotics. The improved theoretical properties of the CH class are verified via extensive simulations. Application using NASA’s Orbiting Carbon Observatory-2 satellite data confirms the advantage of the CH class over the Matérn class, especially in extrapolative settings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2045-2058
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2027775
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2027775
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2045-2058
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2021920_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Gonzalo Vazquez-Bare
Author-X-Name-First: Gonzalo
Author-X-Name-Last: Vazquez-Bare
Title: Causal Spillover Effects Using Instrumental Variables
Abstract:
I set up a potential outcomes framework to analyze spillover effects using instrumental variables. I characterize the population compliance types in a setting in which spillovers can occur on both treatment take-up and outcomes, and provide conditions for identification of the marginal distribution of compliance types. I show that intention-to-treat (ITT) parameters aggregate multiple direct and spillover effects for different compliance types, and hence do not have a clear link to causally interpretable parameters. Moreover, rescaling ITT parameters by first-stage estimands generally recovers a weighted combination of average effects where the sum of weights is larger than one. I then analyze identification of causal direct and spillover effects under one-sided noncompliance, and show that causal effects can be estimated by 2SLS in this case. I illustrate the proposed methods using data from an experiment on social interactions and voting behavior. I also introduce an alternative assumption, independence of the peers’ types, that identifies parameters of interest under two-sided noncompliance by restricting the amount of heterogeneity in average potential outcomes. Supplementary material of this article will be available in online.
Journal: Journal of the American Statistical Association
Pages: 1911-1922
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2021920
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2021920
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1911-1922
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2245686_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chenguang Dai
Author-X-Name-First: Chenguang
Author-X-Name-Last: Dai
Author-Name: Buyu Lin
Author-X-Name-First: Buyu
Author-X-Name-Last: Lin
Author-Name: Xin Xing
Author-X-Name-First: Xin
Author-X-Name-Last: Xing
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: Rejoinder: A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models
Journal: Journal of the American Statistical Association
Pages: 1590-1594
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2245686
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2245686
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1590-1594
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2157727_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Xinzhou Guo
Author-X-Name-First: Xinzhou
Author-X-Name-Last: Guo
Author-Name: Waverly Wei
Author-X-Name-First: Waverly
Author-X-Name-Last: Wei
Author-Name: Molei Liu
Author-X-Name-First: Molei
Author-X-Name-Last: Liu
Author-Name: Tianxi Cai
Author-X-Name-First: Tianxi
Author-X-Name-Last: Cai
Author-Name: Chong Wu
Author-X-Name-First: Chong
Author-X-Name-Last: Wu
Author-Name: Jingshen Wang
Author-X-Name-First: Jingshen
Author-X-Name-Last: Wang
Title: Assessing the Most Vulnerable Subgroup to Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data
Abstract:
There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset Type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In this case study, leveraging the biobank and electronic health record data in the Partner Health System, we introduce a new data analysis pipeline and a novel statistical methodology that address existing limitations by (i) designing a rigorous causal framework that systematically examines the causal effects of statin usage on T2D risk in observational data, (ii) uncovering which patient subgroup is most vulnerable for developing T2D after taking statins, and (iii) assessing the replicability and statistical significance of the most vulnerable subgroup via a bootstrap calibration procedure. Our proposed approach delivers asymptotically sharp confidence intervals and debiased estimate for the treatment effect of the most vulnerable subgroup in the presence of high-dimensional covariates. With our proposed approach, we find that females with high T2D genetic risk are at the highest risk of developing T2D due to statin usage. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1488-1499
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2157727
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2157727
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1488-1499
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2223656_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Gerda Claeskens
Author-X-Name-First: Gerda
Author-X-Name-Last: Claeskens
Author-Name: Maarten Jansen
Author-X-Name-First: Maarten
Author-X-Name-Last: Jansen
Author-Name: Jing Zhou
Author-X-Name-First: Jing
Author-X-Name-Last: Zhou
Title: Discussion on: “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Dai, Lin, Zing, Liu
Journal: Journal of the American Statistical Association
Pages: 1573-1577
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2223656
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223656
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1573-1577
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2000868_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Harold D. Chiang
Author-X-Name-First: Harold D.
Author-X-Name-Last: Chiang
Author-Name: Kengo Kato
Author-X-Name-First: Kengo
Author-X-Name-Last: Kato
Author-Name: Yuya Sasaki
Author-X-Name-First: Yuya
Author-X-Name-Last: Sasaki
Title: Inference for High-Dimensional Exchangeable Arrays
Abstract:
We consider inference for high-dimensional separately and jointly exchangeable arrays where the dimensions may be much larger than the sample sizes. For both exchangeable arrays, we first derive high-dimensional central limit theorems over the rectangles and subsequently develop novel multiplier bootstraps with theoretical guarantees. These theoretical results rely on new technical tools such as Hoeffding-type decomposition and maximal inequalities for the degenerate components in the Hoeffiding-type decomposition for the exchangeable arrays. We exhibit applications of our methods to uniform confidence bands for density estimation under joint exchangeability and penalty choice for l1-penalized regression under separate exchangeability. Extensive simulations demonstrate precise uniform coverage rates. We illustrate by constructing uniform confidence bands for international trade network densities.
Journal: Journal of the American Statistical Association
Pages: 1595-1605
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2000868
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2000868
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1595-1605
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2232834_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Lucas Janson
Author-X-Name-First: Lucas
Author-X-Name-Last: Janson
Title: Discussion of “A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models” by Chenguang Dai, Buyu Lin, Xin Xing, and Jun S. Liu
Journal: Journal of the American Statistical Association
Pages: 1584-1585
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2023.2232834
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2232834
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1584-1585
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2156349_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Yize Zhao
Author-X-Name-First: Yize
Author-X-Name-Last: Zhao
Author-Name: Changgee Chang
Author-X-Name-First: Changgee
Author-X-Name-Last: Chang
Author-Name: Jingwen Zhang
Author-X-Name-First: Jingwen
Author-X-Name-Last: Zhang
Author-Name: Zhengwu Zhang
Author-X-Name-First: Zhengwu
Author-X-Name-Last: Zhang
Title: Genetic Underpinnings of Brain Structural Connectome for Young Adults
Abstract:
With distinct advantages in power over behavioral phenotypes, brain imaging traits have become emerging endophenotypes to dissect molecular contributions to behaviors and neuropsychiatric illnesses. Among different imaging features, brain structural connectivity (i.e., structural connectome) which summarizes the anatomical connections between different brain regions is one of the most cutting edge while under-investigated traits; and the genetic influence on the structural connectome variation remains highly elusive. Relying on a landmark imaging genetics study for young adults, we develop a biologically plausible brain network response shrinkage model to comprehensively characterize the relationship between high dimensional genetic variants and the structural connectome phenotype. Under a unified Bayesian framework, we accommodate the topology of brain network and biological architecture within the genome; and eventually establish a mechanistic mapping between genetic biomarkers and the associated brain sub-network units. An efficient expectation-maximization algorithm is developed to estimate the model and ensure computing feasibility. In the application to the Human Connectome Project Young Adult (HCP-YA) data, we establish the genetic underpinnings which are highly interpretable under functional annotation and brain tissue eQTL analysis, for the brain white matter tracts connecting the hippocampus and two cerebral hemispheres. We also show the superiority of our method in extensive simulations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1473-1487
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2156349
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2156349
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1473-1487
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2021921_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Marc Hallin
Author-X-Name-First: Marc
Author-X-Name-Last: Hallin
Author-Name: Daniel Hlubinka
Author-X-Name-First: Daniel
Author-X-Name-Last: Hlubinka
Author-Name: Šárka Hudecová
Author-X-Name-First: Šárka
Author-X-Name-Last: Hudecová
Title: Efficient Fully Distribution-Free Center-Outward Rank Tests for Multiple-Output Regression and MANOVA
Abstract:
Extending rank-based inference to a multivariate setting such as multiple-output regression or MANOVA with unspecified d-dimensional error density has remained an open problem for more than half a century. None of the many solutions proposed so far is enjoying the combination of distribution-freeness and efficiency that makes rank-based inference a successful tool in the univariate setting. A concept of center-outward multivariate ranks and signs based on measure transportation ideas has been introduced recently. Center-outward ranks and signs are not only distribution-free but achieve in dimension d > 1 the (essential) maximal ancillarity property of traditional univariate ranks. In the present case, we show that fully distribution-free testing procedures based on center-outward ranks can achieve parametric efficiency. We establish the Hájek representation and asymptotic normality results required in the construction of such tests in multiple-output regression and MANOVA models. Simulations and an empirical study demonstrate the excellent performance of the proposed procedures.
Journal: Journal of the American Statistical Association
Pages: 1923-1939
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2021921
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2021921
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1923-1939
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2027776_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Xiaoyu Wang
Author-X-Name-First: Xiaoyu
Author-X-Name-Last: Wang
Author-Name: Shikai Luo
Author-X-Name-First: Shikai
Author-X-Name-Last: Luo
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Jieping Ye
Author-X-Name-First: Jieping
Author-X-Name-Last: Ye
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework
Abstract:
A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this article is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2059-2071
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2022.2027776
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2027776
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:2059-2071
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2024836_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Pierre Alquier
Author-X-Name-First: Pierre
Author-X-Name-Last: Alquier
Author-Name: Badr-Eddine Chérief-Abdellatif
Author-X-Name-First: Badr-Eddine
Author-X-Name-Last: Chérief-Abdellatif
Author-Name: Alexis Derumigny
Author-X-Name-First: Alexis
Author-X-Name-Last: Derumigny
Author-Name: Jean-David Fermanian
Author-X-Name-First: Jean-David
Author-X-Name-Last: Fermanian
Title: Estimation of Copulas via Maximum Mean Discrepancy
Abstract:
This article deals with robust inference for parametric copula models. Estimation using canonical maximum likelihood might be unstable, especially in the presence of outliers. We propose to use a procedure based on the maximum mean discrepancy (MMD) principle. We derive nonasymptotic oracle inequalities, consistency and asymptotic normality of this new estimator. In particular, the oracle inequality holds without any assumption on the copula family, and can be applied in the presence of outliers or under misspecification. Moreover, in our MMD framework, the statistical inference of copula models for which there exists no density with respect to the Lebesgue measure on [0,1]d, as the Marshall-Olkin copula, becomes feasible. A simulation study shows the robustness of our new procedures, especially compared to pseudo-maximum likelihood estimation. An R package implementing the MMD estimator for copula models is available. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1997-2012
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2024836
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2024836
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1997-2012
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2008944_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20230119T200553 git hash: 724830af20
Author-Name: Haiqiang Ma
Author-X-Name-First: Haiqiang
Author-X-Name-Last: Ma
Author-Name: Jiming Jiang
Author-X-Name-First: Jiming
Author-X-Name-Last: Jiang
Title: Pseudo-Bayesian Classified Mixed Model Prediction
Abstract:
We propose a new classified mixed model prediction (CMMP) procedure, called pseudo-Bayesian CMMP, that uses network information in matching the group index between the training data and new data, whose characteristics of interest one wishes to predict. The current CMMP procedures do not incorporate such information; as a result, the methods are not consistent in terms of matching the group index. Although, as the number of training data groups increases, the current CMMP method can predict the mixed effects of interest consistently, its accuracy is not guaranteed when the number of groups is moderate, as is the case in many potential applications. The proposed pseudo-Bayesian CMMP procedure assumes a flexible working probability model for the group index of the new observation to match the index of a training data group, which may be viewed as a pseudo prior. We show that, given any working model satisfying mild conditions, the pseudo-Bayesian CMMP procedure is consistent and asymptotically optimal both in terms of matching the group index and in terms of predicting the mixed effect of interest associated with the new observations. The theoretical results are fully supported by results of empirical studies, including Monte-Carlo simulations and real-data validation.
Journal: Journal of the American Statistical Association
Pages: 1747-1759
Issue: 543
Volume: 118
Year: 2023
Month: 7
X-DOI: 10.1080/01621459.2021.2008944
File-URL: http://hdl.handle.net/10.1080/01621459.2021.2008944
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:543:p:1747-1759
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2044824_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Zhuoran Yang
Author-X-Name-First: Zhuoran
Author-X-Name-Last: Yang
Author-Name: Mengxin Yu
Author-X-Name-First: Mengxin
Author-X-Name-Last: Yu
Title: Understanding Implicit Regularization in Over-Parameterized Single Index Model
Abstract:
In this article, we leverage over-parameterization to design regularization-free algorithms for the high-dimensional single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding of the role played by implicit regularization without excess technicality, we assume that the distribution of the covariates is known a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In addition, our experimental results support our theoretical findings and also demonstrate that our methods empirically outperform classical methods with explicit regularization in terms of both l2-statistical rate and variable selection consistency. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2315-2328
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2044824
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044824
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2315-2328
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2061354_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Xiufan Yu
Author-X-Name-First: Xiufan
Author-X-Name-Last: Yu
Author-Name: Danning Li
Author-X-Name-First: Danning
Author-X-Name-Last: Li
Author-Name: Lingzhou Xue
Author-X-Name-First: Lingzhou
Author-X-Name-Last: Xue
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Power-Enhanced Simultaneous Test of High-Dimensional Mean Vectors and Covariance Matrices with Application to Gene-Set Testing
Abstract:
Power-enhanced tests with high-dimensional data have received growing attention in theoretical and applied statistics in recent years. Existing tests possess their respective high-power regions, and we may lack prior knowledge about the alternatives when testing for a problem of interest in practice. There is a critical need of developing powerful testing procedures against more general alternatives. This article studies the joint test of two-sample mean vectors and covariance matrices for high-dimensional data. We first expand the high-power regions of high-dimensional mean tests or covariance tests to a wider alternative space and then combine their strengths together in the simultaneous test. We develop a new power-enhanced simultaneous test that is powerful to detect differences in either mean vectors or covariance matrices under either sparse or dense alternatives. We prove that the proposed testing procedures align with the power enhancement principles introduced by Fan, Liao, and Yao and achieve the accurate asymptotic size and consistent asymptotic power. We demonstrate the finite-sample performance using simulation studies and a real application to find differentially expressed gene-sets in cancer studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2548-2561
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2061354
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2061354
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2548-2561
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2070070_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Baihua He
Author-X-Name-First: Baihua
Author-X-Name-Last: He
Author-Name: Shuangge Ma
Author-X-Name-First: Shuangge
Author-X-Name-Last: Ma
Author-Name: Xinyu Zhang
Author-X-Name-First: Xinyu
Author-X-Name-Last: Zhang
Author-Name: Li-Xing Zhu
Author-X-Name-First: Li-Xing
Author-X-Name-Last: Zhu
Title: Rank-Based Greedy Model Averaging for High-Dimensional Survival Data
Abstract:
Model averaging is an effective way to enhance prediction accuracy. However, most previous works focus on low-dimensional settings with completely observed responses. To attain an accurate prediction for the risk effect of survival data with high-dimensional predictors, we propose a novel method: rank-based greedy (RG) model averaging. Specifically, adopting the transformation model with splitting predictors as working models, we doubly use the smooth concordance index function to derive the candidate predictions and optimal model weights. The final prediction is achieved by weighted averaging all the candidates. Our approach is flexible, computationally efficient, and robust against model misspecification, as it neither requires the correctness of a joint model nor involves the estimation of the transformation function. We further adopt the greedy algorithm for high dimensions. Theoretically, we derive an asymptotic error bound for the optimal weights under some mild conditions. In addition, the summation of weights assigned to the correct candidate submodels is proven to approach one in probability when there are correct models included among the candidate submodels. Extensive numerical studies are carried out using both simulated and real datasets to show the proposed approach’s robust performance compared to the existing regularization approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2658-2670
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2070070
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2070070
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2658-2670
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2044827_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Yanyan Zeng
Author-X-Name-First: Yanyan
Author-X-Name-Last: Zeng
Author-Name: Daolin Pang
Author-X-Name-First: Daolin
Author-X-Name-Last: Pang
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Author-Name: Tao Wang
Author-X-Name-First: Tao
Author-X-Name-Last: Wang
Title: A Zero-Inflated Logistic Normal Multinomial Model for Extracting Microbial Compositions
Abstract:
High throughput sequencing data collected to study the microbiome provide information in the form of relative abundances and should be treated as compositions. Although many approaches including scaling and rarefaction have been proposed for converting raw count data into microbial compositions, most of these methods simply return zero values for zero counts. However, zeros can distort downstream analyses, and they can also pose problems for composition-aware methods. This problem is exacerbated with microbiome abundance data because they are sparse with excessive zeros. In addition to data sparsity, microbial composition estimation depends on other data characteristics such as high dimensionality, over-dispersion, and complex co-occurrence relationships. To address these challenges, we introduce a zero-inflated probabilistic PCA (ZIPPCA) model that accounts for the compositional nature of microbiome data, and propose an empirical Bayes approach to estimate microbial compositions. An efficient iterative algorithm, called classification variational approximation, is developed for carrying out maximum likelihood estimation. Moreover, we study the consistency and asymptotic normality of variational approximation estimator from the perspective of profile M-estimation. Extensive simulations and an application to a dataset from the Human Microbiome Project are presented to compare the performance of the proposed method with that of the existing methods. The method is implemented in R and available at https://github.com/YanyZeng/ZIPPCAlnm. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2356-2369
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2044827
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044827
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2356-2369
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2089574_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Yinyin Chen
Author-X-Name-First: Yinyin
Author-X-Name-Last: Chen
Author-Name: Shishuang He
Author-X-Name-First: Shishuang
Author-X-Name-Last: He
Author-Name: Yun Yang
Author-X-Name-First: Yun
Author-X-Name-Last: Yang
Author-Name: Feng Liang
Author-X-Name-First: Feng
Author-X-Name-Last: Liang
Title: Learning Topic Models: Identifiability and Finite-Sample Analysis
Abstract:
Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this article, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood that is naturally connected to the concept, in computational geometry, of volume minimization. Our theory introduces a new set of geometric conditions for topic model identifiability, conditions that are weaker than conventional separability conditions, which typically rely on the existence of pure topic documents or of anchor words. Weaker conditions allow a wider and thus potentially more fruitful investigation. We conduct finite-sample error analysis for the proposed estimator and discuss connections between our results and those of previous investigations. We conclude with empirical studies employing both simulated and real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2860-2875
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2089574
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2089574
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2860-2875
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2208388_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Moritz Berger
Author-X-Name-First: Moritz
Author-X-Name-Last: Berger
Author-Name: Ana Kowark
Author-X-Name-First: Ana
Author-X-Name-Last: Kowark
Author-Name: Rolf Rossaint
Author-X-Name-First: Rolf
Author-X-Name-Last: Rossaint
Author-Name: Mark Coburn
Author-X-Name-First: Mark
Author-X-Name-Last: Coburn
Author-Name: Matthias Schmid
Author-X-Name-First: Matthias
Author-X-Name-Last: Schmid
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: Modeling Postoperative Mortality in Older Patients by Boosting Discrete-Time Competing Risks Models
Abstract:
Elderly patients are at a high risk of suffering from postoperative death. Personalized strategies to improve their recovery after intervention are therefore urgently needed. A popular way to analyze postoperative mortality is to develop a prognostic model that incorporates risk factors measured at hospital admission, for example, comorbidities. When building such models, numerous issues must be addressed, including censoring and the presence of competing events (such as discharge from hospital alive). Here we present a novel survival modeling approach to investigate 30-day inpatient mortality following intervention. The proposed method accounts for both grouped event times, for example, measured in 24-hour intervals, and competing events. Conceptually, the method is embedded in the framework of generalized additive models for location, scale, and shape (GAMLSS). Model fitting is performed using a component-wise gradient boosting algorithm, which allows for additional regularization steps via stability selection. We used this new modeling approach to analyze data from the Peri-interventional Outcome Study in the Elderly (POSE), which is a recent cohort study that enrolled 9862 elderly inpatients undergoing intervention under anesthesia. Application of the proposed boosting algorithm yielded six important risk factors (including both clinical variables and interventional characteristics) that either contributed to the hazard of death or to discharge from hospital alive. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2239-2249
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2208388
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2208388
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2239-2249
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2096619_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Samuel Perreault
Author-X-Name-First: Samuel
Author-X-Name-Last: Perreault
Author-Name: Johanna G. Nešlehová
Author-X-Name-First: Johanna G.
Author-X-Name-Last: Nešlehová
Author-Name: Thierry Duchesne
Author-X-Name-First: Thierry
Author-X-Name-Last: Duchesne
Title: Hypothesis Tests for Structured Rank Correlation Matrices
Abstract:
Joint modeling of a large number of variables often requires dimension reduction strategies that lead to structural assumptions of the underlying correlation matrix, such as equal pair-wise correlations within subsets of variables. The underlying correlation matrix is thus of interest for both model specification and model validation. In this article, we develop tests of the hypothesis that the entries of the Kendall rank correlation matrix are linear combinations of a smaller number of parameters. The asymptotic behavior of the proposed test statistics is investigated both when the dimension is fixed and when it grows with the sample size. We pay special attention to the restricted hypothesis of partial exchangeability, which contains full exchangeability as a special case. We show that under partial exchangeability, the test statistics and their large-sample distributions simplify, which leads to computational advantages and better performance of the tests. We propose various scalable numerical strategies for implementation of the proposed procedures, investigate their behavior through simulations and power calculations under local alternatives, and demonstrate their use on a real dataset of mean sea levels at various geographical locations. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2889-2900
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2096619
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096619
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2889-2900
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2060112_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Robin Dunn
Author-X-Name-First: Robin
Author-X-Name-Last: Dunn
Author-Name: Larry Wasserman
Author-X-Name-First: Larry
Author-X-Name-Last: Wasserman
Author-Name: Aaditya Ramdas
Author-X-Name-First: Aaditya
Author-X-Name-Last: Ramdas
Title: Distribution-Free Prediction Sets for Two-Layer Hierarchical Models
Abstract:
We consider the problem of constructing distribution-free prediction sets for data from two-layer hierarchical distributions. For iid data, prediction sets can be constructed using the method of conformal prediction. The validity of conformal prediction hinges on the exchangeability of the data, which does not hold when groups of observations come from distinct distributions, such as multiple observations on each patient in a medical database. We extend conformal methods to a hierarchical setting. We develop CDF pooling, single subsampling, and repeated subsampling approaches to construct prediction sets in unsupervised and supervised settings. We compare these approaches in terms of coverage and average set size. If asymptotic coverage is acceptable, we recommend CDF pooling for its balance between empirical coverage and average set size. If we desire coverage guarantees, then we recommend the repeated subsampling approach. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2491-2502
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2060112
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060112
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2491-2502
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2080682_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Nicholas C. Henderson
Author-X-Name-First: Nicholas C.
Author-X-Name-Last: Henderson
Author-Name: Ravi Varadhan
Author-X-Name-First: Ravi
Author-X-Name-Last: Varadhan
Author-Name: Thomas A. Louis
Author-X-Name-First: Thomas A.
Author-X-Name-Last: Louis
Title: Improved Small Domain Estimation via Compromise Regression Weights
Abstract:
Shrinkage estimates of small domain parameters typically use a combination of a noisy “direct” estimate that only uses data from a specific small domain and a more stable regression estimate. When the regression model is misspecified, estimation performance for the noisier domains can suffer due to substantial shrinkage toward a poorly estimated regression surface. In this article, we introduce a new class of robust, empirically-driven regression weights that target estimation of the small domain means under potential misspecification of the global regression model. Our regression weights are a convex combination of the model-based weights associated with the best linear unbiased predictor (BLUP) and those associated with the observed best predictor (OBP). The mixing parameter in this convex combination is found by minimizing a novel, unbiased estimate of the mean-squared prediction error for the small domain means, and we label the associated small domain estimates the “compromise best predictor,” or CBP. Using a data-adaptive mixture for the regression weights enables the CBP to preserve the robustness of the OBP while retaining the main advantages of the EBLUP whenever the regression model is correct. We demonstrate the use of the CBP in an application estimating gait speed in older adults. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2793-2809
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2080682
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2080682
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2793-2809
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2057316_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Erin E. Gabriel
Author-X-Name-First: Erin E.
Author-X-Name-Last: Gabriel
Author-Name: Michael C. Sachs
Author-X-Name-First: Michael C.
Author-X-Name-Last: Sachs
Author-Name: Arvid Sjölander
Author-X-Name-First: Arvid
Author-X-Name-Last: Sjölander
Title: Sharp Nonparametric Bounds for Decomposition Effects with Two Binary Mediators
Abstract:
In randomized trials, once the total effect of the intervention has been estimated, it is often of interest to explore mechanistic effects through mediators along the causal pathway between the randomized treatment and the outcome. In the setting with two sequential mediators, there are a variety of decompositions of the total risk difference into mediation effects. We derive sharp and valid bounds for a number of mediation effects in the setting of two sequential mediators both with unmeasured confounding with the outcome. We provide five such bounds in the main text corresponding to two different decompositions of the total effect, as well as the controlled direct effect, with an additional 30 novel bounds provided in the supplementary materials corresponding to the terms of 24 four-way decompositions. We also show that, although it may seem that one can produce sharp bounds by adding or subtracting the limits of the sharp bounds for terms in a decomposition, this almost always produces valid, but not sharp bounds that can even be completely noninformative. We investigate the properties of the bounds by simulating random probability distributions under our causal model and illustrate how they are interpreted in a real data example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2446-2453
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2057316
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057316
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2446-2453
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2075369_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Michael Guggisberg
Author-X-Name-First: Michael
Author-X-Name-Last: Guggisberg
Title: A Bayesian Approach to Multiple-Output Quantile Regression
Abstract:
This article presents a Bayesian approach to multiple-output quantile regression. The prior can be elicited as ex-ante knowledge of the distance of the τ-Tukey depth contour to the Tukey median, the first prior of its kind. The parametric model is proven to be consistent and a procedure to obtain confidence intervals is proposed. A proposal for nonparametric multiple-output regression is also presented. These results add to the literature of misspecified Bayesian modeling, consistency, and prior elicitation for nonparametric multivariate modeling. The model is applied to the Tennessee Project Steps to Achieving Resilience (STAR) experiment and finds a joint increase in τ-quantile subpopulations for mathematics and reading scores given a decrease in the number of students per teacher. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2736-2745
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2075369
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2075369
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2736-2745
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2060835_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Alessandro Zito
Author-X-Name-First: Alessandro
Author-X-Name-Last: Zito
Author-Name: Tommaso Rigon
Author-X-Name-First: Tommaso
Author-X-Name-Last: Rigon
Author-Name: Otso Ovaskainen
Author-X-Name-First: Otso
Author-X-Name-Last: Ovaskainen
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Bayesian Modeling of Sequential Discoveries
Abstract:
We aim at modeling the appearance of distinct tags in a sequence of labeled objects. Common examples of this type of data include words in a corpus or distinct species in a sample. These sequential discoveries are often summarized via accumulation curves, which count the number of distinct entities observed in an increasingly large set of objects. We propose a novel Bayesian method for species sampling modeling by directly specifying the probability of a new discovery, therefore, allowing for flexible specifications. The asymptotic behavior and finite sample properties of such an approach are extensively studied. Interestingly, our enlarged class of sequential processes includes highly tractable special cases. We present a subclass of models characterized by appealing theoretical and computational properties, including one that shares the same discovery probability with the Dirichlet process. Moreover, due to strong connections with logistic regression models, the latter subclass can naturally account for covariates. We finally test our proposal on both synthetic and real data, with special emphasis on a large fungal biodiversity study in Finland. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2521-2532
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2060835
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060835
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2521-2532
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2057860_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Thomas H. Scheike
Author-X-Name-First: Thomas H.
Author-X-Name-Last: Scheike
Author-Name: Torben Martinussen
Author-X-Name-First: Torben
Author-X-Name-Last: Martinussen
Author-Name: Brice Ozenne
Author-X-Name-First: Brice
Author-X-Name-Last: Ozenne
Title: Efficient Estimation in the Fine and Gray Model
Abstract:
Direct regression for the cumulative incidence function (CIF) has become increasingly popular since the Fine and Gray model was suggested (Fine and Gray) due to its more direct interpretation on the probability risk scale. We here consider estimation within the Fine and Gray model using the theory of semiparametric efficient estimation. We show that the Fine and Gray estimator is semiparametrically efficient in the case without censoring. In the case of right-censored data, however, we show that the Fine and Gray estimator is no longer semiparametrically efficient and derive the semiparametrically efficient estimator. This estimation approach involves complicated integral equations, and we therefore also derive a simpler estimator as an augmented version of the Fine and Gray estimator with respect to the censoring nuisance space. While the augmentation term involves the CIF of the competing risk, it also leads to a robustness property: the proposed estimators remain consistent even if one of the models for the censoring mechanism or the CIF of the competing risk are misspecified. We illustrate this robustness property using simulation studies, comparing the Fine–Gray estimator and its augmented version. When the competing cause has a high cumulative incidence we see a substantial gain in efficiency from adding the augmentation term with a very reasonable computation time. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2482-2490
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2057860
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057860
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2482-2490
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2079514_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Peiliang Bai
Author-X-Name-First: Peiliang
Author-X-Name-Last: Bai
Author-Name: Abolfazl Safikhani
Author-X-Name-First: Abolfazl
Author-X-Name-Last: Safikhani
Author-Name: George Michailidis
Author-X-Name-First: George
Author-X-Name-Last: Michailidis
Title: Multiple Change Point Detection in Reduced Rank High Dimensional Vector Autoregressive Models
Abstract:
We study the problem of detecting and locating change points in high-dimensional Vector Autoregressive (VAR) models, whose transition matrices exhibit low rank plus sparse structure. We first address the problem of detecting a single change point using an exhaustive search algorithm and establish a finite sample error bound for its accuracy. Next, we extend the results to the case of multiple change points that can grow as a function of the sample size. Their detection is based on a two-step algorithm, wherein the first step, an exhaustive search for a candidate change point is employed for overlapping windows, and subsequently a backward elimination procedure is used to screen out redundant candidates. The two-step strategy yields consistent estimates of the number and the locations of the change points. To reduce computation cost, we also investigate conditions under which a surrogate VAR model with a weakly sparse transition matrix can accurately estimate the change points and their locations for data generated by the original model. This work also addresses and resolves a number of novel technical challenges posed by the nature of the VAR models under consideration. The effectiveness of the proposed algorithms and methodology is illustrated on both synthetic and two real datasets. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2776-2792
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2079514
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2079514
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2776-2792
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2262009_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Alan Agresti
Author-X-Name-First: Alan
Author-X-Name-Last: Agresti
Title: Confidence Intervals for Discrete Data in Clinical Research
Journal: Journal of the American Statistical Association
Pages: 2945-2945
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2262009
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2262009
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2945-2945
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2078331_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Sze Ming Lee
Author-X-Name-First: Sze Ming
Author-X-Name-Last: Lee
Author-Name: Tony Sit
Author-X-Name-First: Tony
Author-X-Name-Last: Sit
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Title: Efficient Estimation for Censored Quantile Regression
Abstract:
Censored quantile regression (CQR) has received growing attention in survival analysis because of its flexibility in modeling heterogeneous effect of covariates. Advances have been made in developing various inferential procedures under different assumptions and settings. Under the conditional independence assumption, many existing CQR methods can be characterized either by stochastic integral-based estimating equations (see, e.g., Peng and Huang) or by locally weighted approaches to adjust for the censored observations (see, for instance, Wang and Wang). While there have been proposals of different apparently dissimilar strategies in terms of formulations and the techniques applied for CQR, the inter-relationships amongst these methods are rarely discussed in the literature. In addition, given the complicated structure of the asymptotic variance, there has been limited investigation on improving the estimation efficiency for censored quantile regression models. This article addresses these open questions by proposing a unified framework under which many conventional approaches for CQR are covered as special cases. The new formulation also facilitates the construction of the most efficient estimator for the parameters of interest amongst a general class of estimating functions. Asymptotic properties including consistency and weak convergence of the proposed estimator are established via the martingale-based argument. Numerical studies are presented to illustrate the promising performance of the proposed estimator as compared to existing contenders under various settings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2762-2775
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2078331
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2078331
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2762-2775
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2069572_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Jacob Dorn
Author-X-Name-First: Jacob
Author-X-Name-Last: Dorn
Author-Name: Kevin Guo
Author-X-Name-First: Kevin
Author-X-Name-Last: Guo
Title: Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing
Abstract:
Inverse propensity weighting (IPW) is a popular method for estimating treatment effects from observational data. However, its correctness relies on the untestable (and frequently implausible) assumption that all confounders have been measured. This article introduces a robust sensitivity analysis for IPW that estimates the range of treatment effects compatible with a given amount of unobserved confounding. The estimated range converges to the narrowest possible interval (under the given assumptions) that must contain the true treatment effect. Our proposal is a refinement of the influential sensitivity analysis by Zhao, Small, and Bhattacharya, which we show gives bounds that are too wide even asymptotically. This analysis is based on new partial identification results for Tan’s marginal sensitivity model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2645-2657
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2069572
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2069572
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2645-2657
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2071720_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Clément Cerovecki
Author-X-Name-First: Clément
Author-X-Name-Last: Cerovecki
Author-Name: Vaidotas Characiejus
Author-X-Name-First: Vaidotas
Author-X-Name-Last: Characiejus
Author-Name: Siegfried Hörmann
Author-X-Name-First: Siegfried
Author-X-Name-Last: Hörmann
Title: The Maximum of the Periodogram of a Sequence of Functional Data
Abstract:
We study the periodogram operator of a sequence of functional data. Using recent advances in Gaussian approximation theory, we derive the asymptotic distribution of the maximum norm over all fundamental frequencies. We consider the case where the noise variables are independent and then generalize our results to functional linear processes. Our theory can be used for detecting periodic signals in functional time series when the length of the period is unknown. We demonstrate the proposed methodology in a simulation study as well as on real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2712-2720
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2071720
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071720
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2712-2720
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2209349_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Nir Keret
Author-X-Name-First: Nir
Author-X-Name-Last: Keret
Author-Name: Malka Gorfine
Author-X-Name-First: Malka
Author-X-Name-Last: Gorfine
Title: Analyzing Big EHR Data—Optimal Cox Regression Subsampling Procedure with Rare Events
Abstract:
Massive sized survival datasets become increasingly prevalent with the development of the healthcare industry, and pose computational challenges unprecedented in traditional survival analysis use cases. In this work we analyze the UK-biobank colorectal cancer data with genetic and environmental risk factors, including a time-dependent coefficient, which transforms the dataset into “pseudo-observation” form, thus, critically inflating its size. A popular way for coping with massive datasets is downsampling them, such that the computational resources can be afforded by the researcher. Cox regression has remained one of the most popular statistical models for the analysis of survival data to-date. This work addresses the settings of right censored and possibly left-truncated data with rare events, such that the observed failure times constitute only a small portion of the overall sample. We propose Cox regression subsampling-based estimators that approximate their full-data partial-likelihood-based counterparts, by assigning optimal sampling probabilities to censored observations, and including all observed failures in the analysis. The suggested methodology is applied on the UK-biobank for building a colorectal cancer risk-prediction model, while reducing the computation time and memory requirements. Asymptotic properties of the proposed estimators are established under suitable regularity conditions, and simulation studies are carried out to evaluate their finite sample performance. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2262-2275
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2209349
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2209349
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2262-2275
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2252041_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: George G. Vega Yon
Author-X-Name-First: George G.
Author-X-Name-Last: Vega Yon
Title: Power and Multicollinearity in Small Networks: A Discussion of “Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks” by Krivitsky, Coletti, and Hens
Journal: Journal of the American Statistical Association
Pages: 2228-2231
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2252041
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2252041
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2228-2231
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2054817_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Jing Lei
Author-X-Name-First: Jing
Author-X-Name-Last: Lei
Author-Name: Kevin Z. Lin
Author-X-Name-First: Kevin Z.
Author-X-Name-Last: Lin
Title: Bias-Adjusted Spectral Clustering in Multi-Layer Stochastic Block Models
Abstract:
We consider the problem of estimating common community structures in multi-layer stochastic block models, where each single layer may not have sufficient signal strength to recover the full community structure. In order to efficiently aggregate signal across different layers, we argue that the sum-of-squared adjacency matrices contain sufficient signal even when individual layers are very sparse. Our method uses a bias-removal step that is necessary when the squared noise matrices may overwhelm the signal in the very sparse regime. The analysis of our method relies on several novel tail probability bounds for matrix linear combinations with matrix-valued coefficients and matrix-valued quadratic forms, which may be of independent interest. The performance of our method and the necessity of bias removal is demonstrated in synthetic data and in microarray analysis about gene co-expression networks. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2433-2445
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2054817
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2054817
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2433-2445
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2257267_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Nynke M. D. Niezink
Author-X-Name-First: Nynke M. D.
Author-X-Name-Last: Niezink
Title: Discussion of “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks”
Journal: Journal of the American Statistical Association
Pages: 2232-2234
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2257267
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2257267
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2232-2234
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2086132_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: David T. Frazier
Author-X-Name-First: David T.
Author-X-Name-Last: Frazier
Author-Name: David J. Nott
Author-X-Name-First: David J.
Author-X-Name-Last: Nott
Author-Name: Christopher Drovandi
Author-X-Name-First: Christopher
Author-X-Name-Last: Drovandi
Author-Name: Robert Kohn
Author-X-Name-First: Robert
Author-X-Name-Last: Kohn
Title: Bayesian Inference Using Synthetic Likelihood: Asymptotics and Adjustments
Abstract:
Implementing Bayesian inference is often computationally challenging in complex models, especially when calculating the likelihood is difficult. Synthetic likelihood is one approach for carrying out inference when the likelihood is intractable, but it is straightforward to simulate from the model. The method constructs an approximate likelihood by taking a vector summary statistic as being multivariate normal, with the unknown mean and covariance estimated by simulation. Previous research demonstrates that the Bayesian implementation of synthetic likelihood can be more computationally efficient than approximate Bayesian computation, a popular likelihood-free method, in the presence of a high-dimensional summary statistic. Our article makes three contributions. The first shows that if the summary statistics are well-behaved, then the synthetic likelihood posterior is asymptotically normal and yields credible sets with the correct level of coverage. The second contribution compares the computational efficiency of Bayesian synthetic likelihood and approximate Bayesian computation. We show that Bayesian synthetic likelihood is computationally more efficient than approximate Bayesian computation. Based on the asymptotic results, the third contribution proposes using adjusted inference methods when a possibly misspecified form is assumed for the covariance matrix of the synthetic likelihood, such as diagonal or a factor model, to speed up computation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2821-2832
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2086132
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2086132
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2821-2832
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2060113_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Chenguang Dai
Author-X-Name-First: Chenguang
Author-X-Name-Last: Dai
Author-Name: Buyu Lin
Author-X-Name-First: Buyu
Author-X-Name-Last: Lin
Author-Name: Xin Xing
Author-X-Name-First: Xin
Author-X-Name-Last: Xing
Author-Name: Jun S. Liu
Author-X-Name-First: Jun S.
Author-X-Name-Last: Liu
Title: False Discovery Rate Control via Data Splitting
Abstract:
Selecting relevant features associated with a given response variable is an important problem in many scientific fields. Quantifying quality and uncertainty of a selection result via false discovery rate (FDR) control has been of recent interest. This article introduces a data-splitting method (referred to as “DS”) to asymptotically control the FDR while maintaining a high power. For each feature, DS constructs a test statistic by estimating two independent regression coefficients via data splitting. FDR control is achieved by taking advantage of the statistic’s property that, for any null feature, its sampling distribution is symmetric about zero; whereas for a relevant feature, its sampling distribution has a positive mean. Furthermore, a Multiple Data Splitting (MDS) method is proposed to stabilize the selection result and boost the power. Surprisingly, with the FDR under control, MDS not only helps overcome the power loss caused by data splitting, but also results in a lower variance of the false discovery proportion (FDP) compared with all other methods in consideration. Extensive simulation studies and a real-data application show that the proposed methods are robust to the unknown distribution of features, easy to implement and computationally efficient, and are often the most powerful ones among competitors especially when the signals are weak and correlations or partial correlations among features are high. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2503-2520
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2060113
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060113
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2503-2520
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2050244_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Marco Avarucci
Author-X-Name-First: Marco
Author-X-Name-Last: Avarucci
Author-Name: Paolo Zaffaroni
Author-X-Name-First: Paolo
Author-X-Name-Last: Zaffaroni
Title: Robust Estimation of Large Panels with Factor Structures
Abstract:
This article studies estimation of linear panel regression models with heterogeneous coefficients using a class of weighted least squares estimators, when both the regressors and the error possibly contain a common latent factor structure. Our theory is robust to the specification of such a factor structure because it does not require any information on the number of factors or estimation of the factor structure itself. Moreover, our theory is efficient, in certain circumstances, because it nests the GLS principle. We first show how our unfeasible weighted-estimator provides a bias-adjusted estimator with the conventional limiting distribution, for situations in which the OLS is affected by a first-order bias. The technical challenge resolved in the article consists of showing how these properties are preserved for the feasible weighted estimator in a double-asymptotics setting. Our theory is illustrated by extensive Monte Carlo experiments and an empirical application that investigates the link between capital accumulation and economic growth in an international setting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2394-2405
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2050244
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2050244
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2394-2405
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2199814_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Yao Zhang
Author-X-Name-First: Yao
Author-X-Name-Last: Zhang
Author-Name: Qingyuan Zhao
Author-X-Name-First: Qingyuan
Author-X-Name-Last: Zhao
Title: What is a Randomization Test?
Abstract:
The meaning of randomization tests has become obscure in statistics education and practice over the last century. This article makes a fresh attempt at rectifying this core concept of statistics. A new term—“quasi-randomization test”—is introduced to define significance tests based on theoretical models and distinguish these tests from the “randomization tests” based on the physical act of randomization. The practical importance of this distinction is illustrated through a real stepped-wedge cluster-randomized trial. Building on the recent literature on randomization inference, a general framework of conditional randomization tests is developed and some practical methods to construct conditioning events are given. The proposed terminology and framework are then applied to understand several widely used (quasi-)randomization tests, including Fisher’s exact test, permutation tests for treatment effect, quasi-randomization tests for independence and conditional independence, adaptive randomization, and conformal prediction. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2928-2942
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2199814
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2199814
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2928-2942
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2044826_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Peter Kramlinger
Author-X-Name-First: Peter
Author-X-Name-Last: Kramlinger
Author-Name: Tatyana Krivobokova
Author-X-Name-First: Tatyana
Author-X-Name-Last: Krivobokova
Author-Name: Stefan Sperlich
Author-X-Name-First: Stefan
Author-X-Name-Last: Sperlich
Title: Marginal and Conditional Multiple Inference for Linear Mixed Model Predictors
Abstract:
In spite of its high practical relevance, cluster specific multiple inference for linear mixed model predictors has hardly been addressed so far. While marginal inference for population parameters is well understood, conditional inference for the cluster specific predictors is more intricate. This work introduces a general framework for multiple inference in linear mixed models for cluster specific predictors. Consistent confidence sets for multiple inference are constructed under both, the marginal and the conditional law. Furthermore, it is shown that, remarkably, corresponding multiple marginal confidence sets are also asymptotically valid for conditional inference. Those lend themselves for testing linear hypotheses using standard quantiles without the need of resampling techniques. All findings are validated in simulations and illustrated along a study on Covid-19 mortality in the U.S. state prisons. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2344-2355
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2044826
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044826
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2344-2355
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2049278_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Ting Ye
Author-X-Name-First: Ting
Author-X-Name-Last: Ye
Author-Name: Jun Shao
Author-X-Name-First: Jun
Author-X-Name-Last: Shao
Author-Name: Yanyao Yi
Author-X-Name-First: Yanyao
Author-X-Name-Last: Yi
Author-Name: Qingyuan Zhao
Author-X-Name-First: Qingyuan
Author-X-Name-Last: Zhao
Title: Toward Better Practice of Covariate Adjustment in Analyzing Randomized Clinical Trials
Abstract:
In randomized clinical trials, adjustments for baseline covariates at both design and analysis stages are highly encouraged by regulatory agencies. A recent trend is to use a model-assisted approach for covariate adjustment to gain credibility and efficiency while producing asymptotically valid inference even when the model is incorrect. In this article we present three considerations for better practice when model-assisted inference is applied to adjust for covariates under simple or covariate-adaptive randomized trials: (a) guaranteed efficiency gain: a model-assisted method should often gain but never hurt efficiency; (b) wide applicability: a valid procedure should be applicable, and preferably universally applicable, to all commonly used randomization schemes; (c) robust standard error: variance estimation should be robust to model misspecification and heteroscedasticity. To achieve these, we recommend a model-assisted estimator under an analysis of heterogeneous covariance working model that includes all covariates used in randomization. Our conclusions are based on an asymptotic theory that provides a clear picture of how covariate-adaptive randomization and regression adjustment alter statistical efficiency. Our theory is more general than the existing ones in terms of studying arbitrary functions of response means (including linear contrasts, ratios, and odds ratios), multiple arms, guaranteed efficiency gain, optimality, and universal applicability. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2370-2382
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2049278
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2049278
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2370-2382
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2068419_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Yichi Zhang
Author-X-Name-First: Yichi
Author-X-Name-Last: Zhang
Author-Name: Weining Shen
Author-X-Name-First: Weining
Author-X-Name-Last: Shen
Author-Name: Dehan Kong
Author-X-Name-First: Dehan
Author-X-Name-Last: Kong
Title: Covariance Estimation for Matrix-valued Data
Abstract:
Covariance estimation for matrix-valued data has received an increasing interest in applications. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, we propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Under these conditions, the original covariance matrix is decomposed into a Kronecker product of two bandable small covariance matrices representing the variability over row and column directions. We formulate a unified framework for estimating bandable covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation. The convergence rates of the proposed estimators are established, and the derived minimax lower bound shows our proposed estimator is rate-optimal under certain divergence regimes of matrix size. We further introduce a class of robust covariance estimators and provide theoretical guarantees to deal with heavy-tailed data. We demonstrate the superior finite-sample performance of our methods using simulations and real applications from a gridded temperature anomalies dataset and an S&P 500 stock data analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2620-2631
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2068419
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2068419
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2620-2631
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2210336_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Ziyi Li
Author-X-Name-First: Ziyi
Author-X-Name-Last: Li
Author-Name: Yu Shen
Author-X-Name-First: Yu
Author-X-Name-Last: Shen
Author-Name: Jing Ning
Author-X-Name-First: Jing
Author-X-Name-Last: Ning
Title: Accommodating Time-Varying Heterogeneity in Risk Estimation under the Cox Model: A Transfer Learning Approach
Abstract:
Transfer learning has attracted increasing attention in recent years for adaptively borrowing information across different data cohorts in various settings. Cancer registries have been widely used in clinical research because of their easy accessibility and large sample size. Our method is motivated by the question of how to use cancer registry data as a complement to improve the estimation precision of individual risks of death for inflammatory breast cancer (IBC) patients at The University of Texas MD Anderson Cancer Center. When transferring information for risk estimation based on the cancer registries (i.e., source cohort) to a single cancer center (i.e., target cohort), time-varying population heterogeneity needs to be appropriately acknowledged. However, there is no literature on how to adaptively transfer knowledge on risk estimation with time-to-event data from the source cohort to the target cohort while adjusting for time-varying differences in event risks between the two sources. Our goal is to address this statistical challenge by developing a transfer learning approach under the Cox proportional hazards model. To allow data-adaptive levels of information borrowing, we impose Lasso penalties on the discrepancies in regression coefficients and baseline hazard functions between the two cohorts, which are jointly solved in the proposed transfer learning algorithm. As shown in the extensive simulation studies, the proposed method yields more precise individualized risk estimation than using the target cohort alone. Meanwhile, our method demonstrates satisfactory robustness against cohort differences compared with the method that directly combines the target and source data in the Cox model. We develop a more accurate risk estimation model for the MD Anderson IBC cohort given various treatment and baseline covariates, while adaptively borrowing information from the National Cancer Database to improve risk assessment. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2276-2287
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2210336
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2210336
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2276-2287
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2063131_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Jian-Feng Cai
Author-X-Name-First: Jian-Feng
Author-X-Name-Last: Cai
Author-Name: Jingyang Li
Author-X-Name-First: Jingyang
Author-X-Name-Last: Li
Author-Name: Dong Xia
Author-X-Name-First: Dong
Author-X-Name-Last: Xia
Title: Generalized Low-Rank Plus Sparse Tensor Estimation by Fast Riemannian Optimization
Abstract:
We investigate a generalized framework to estimate a latent low-rank plus sparse tensor, where the low-rank tensor often captures the multi-way principal components and the sparse tensor accounts for potential model mis-specifications or heterogeneous signals that are unexplainable by the low-rank part. The framework flexibly covers both linear and generalized linear models, and can easily handle continuous or categorical variables. We propose a fast algorithm by integrating the Riemannian gradient descent and a novel gradient pruning procedure. Under suitable conditions, the algorithm converges linearly and can simultaneously estimate both the low-rank and sparse tensors. The statistical error bounds of final estimates are established in terms of the gradient of loss function. The error bounds are generally sharp under specific statistical models, for example, the sub-Gaussian robust PCA and Bernoulli tensor model. Moreover, our method achieves nontrivial error bounds for heavy-tailed tensor PCA whenever the noise has a finite 2+ε moment. We apply our method to analyze the international trade flow dataset and the statistician hypergraph coauthorship network, both yielding new and interesting findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2588-2604
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2063131
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2063131
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2588-2604
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2280383_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Pavel N. Krivitsky
Author-X-Name-First: Pavel N.
Author-X-Name-Last: Krivitsky
Author-Name: Pietro Coletti
Author-X-Name-First: Pietro
Author-X-Name-Last: Coletti
Author-Name: Niel Hens
Author-X-Name-First: Niel
Author-X-Name-Last: Hens
Title: Rejoinder to Discussion of “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks”
Journal: Journal of the American Statistical Association
Pages: 2235-2238
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2280383
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2280383
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2235-2238
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2231581_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Noirrit Kiran Chandra
Author-X-Name-First: Noirrit Kiran
Author-X-Name-Last: Chandra
Author-Name: Abhra Sarkar
Author-X-Name-First: Abhra
Author-X-Name-Last: Sarkar
Author-Name: John F. de Groot
Author-X-Name-First: John F.
Author-X-Name-Last: de Groot
Author-Name: Ying Yuan
Author-X-Name-First: Ying
Author-X-Name-Last: Yuan
Author-Name: Peter Müller
Author-X-Name-First: Peter
Author-X-Name-Last: Müller
Title: Bayesian Nonparametric Common Atoms Regression for Generating Synthetic Controls in Clinical Trials
Abstract:
The availability of electronic health records (EHR) has opened opportunities to supplement increasingly expensive and difficult to carry out randomized controlled trials (RCT) with evidence from readily available real world data. In this article, we use EHR data to construct synthetic control arms for treatment-only single arm trials. We propose a novel nonparametric Bayesian common atoms mixture model that allows us to find equivalent population strata in the EHR and the treatment arm and then resample the EHR data to create equivalent patient populations under both the single arm trial and the resampled EHR. Resampling is implemented via a density-free importance sampling scheme. Using the synthetic control arm, inference for the treatment effect can then be carried out using any method available for RCTs. Alternatively the proposed nonparametric Bayesian model allows straightforward model-based inference. In simulation experiments, the proposed method exhibits higher power than alternative methods in detecting treatment effects, specifically for nonlinear response functions. We apply the method to supplement single arm treatment-only glioblastoma studies with a synthetic control arm based on historical trials. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2301-2314
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2231581
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2231581
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2301-2314
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2068420_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Yu Zhou
Author-X-Name-First: Yu
Author-X-Name-Last: Zhou
Author-Name: Lan Wang
Author-X-Name-First: Lan
Author-X-Name-Last: Wang
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Author-Name: Tuoyi Zhao
Author-X-Name-First: Tuoyi
Author-X-Name-Last: Zhao
Title: Transformation-Invariant Learning of Optimal Individualized Decision Rules with Time-to-Event Outcomes
Abstract:
In many important applications of precision medicine, the outcome of interest is time to an event (e.g., death, relapse of disease) and the primary goal is to identify the optimal individualized decision rule (IDR) to prolong survival time. Existing work in this area have been mostly focused on estimating the optimal IDR to maximize the restricted mean survival time in the population. We propose a new robust framework for estimating an optimal static or dynamic IDR with time-to-event outcomes based on an easy-to-interpret quantile criterion. The new method does not need to specify an outcome regression model and is robust for heavy-tailed distribution. The estimation problem corresponds to a nonregular M-estimation problem with both finite and infinite-dimensional nuisance parameters. Employing advanced empirical process techniques, we establish the statistical theory of the estimated parameter indexing the optimal IDR. Furthermore, we prove a novel result that the proposed approach can consistently estimate the optimal value function under mild conditions even when the optimal IDR is nonunique, which happens in the challenging setting of exceptional laws. We also propose a smoothed resampling procedure for inference. The proposed methods are implemented in the R-package QTOCen. We demonstrate the performance of the proposed new methods via extensive Monte Carlo studies and a real data application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2632-2644
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2068420
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2068420
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2632-2644
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2071276_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Mats J. Stensrud
Author-X-Name-First: Mats J.
Author-X-Name-Last: Stensrud
Author-Name: James M. Robins
Author-X-Name-First: James M.
Author-X-Name-Last: Robins
Author-Name: Aaron Sarvet
Author-X-Name-First: Aaron
Author-X-Name-Last: Sarvet
Author-Name: Eric J. Tchetgen Tchetgen
Author-X-Name-First: Eric J.
Author-X-Name-Last: Tchetgen Tchetgen
Author-Name: Jessica G. Young
Author-X-Name-First: Jessica G.
Author-X-Name-Last: Young
Title: Conditional Separable Effects
Abstract:
Researchers are often interested in treatment effects on outcomes that are only defined conditional on posttreatment events. For example, in a study of the effect of different cancer treatments on quality of life at end of follow-up, the quality of life of individuals who die during the study is undefined. In these settings, naive contrasts of outcomes conditional on posttreatment events are not average causal effects, even in randomized experiments. Therefore, the effect in the principal stratum of those who would have the same value of the posttreatment variable regardless of treatment (such as the survivor average causal effect) is often advocated for causal inference. While principal stratum effects are average causal effects, they refer to a subset of the population that cannot be observed and may not exist. Therefore, it is not clear how these effects inform decisions or policies. Here we propose the conditional separable effects, quantifying causal effects of modified versions of the study treatment in an observable subset of the population. These effects, which may quantify direct effects of the study treatment, require transparent reasoning about candidate modified treatments and their mechanisms. We provide identifying conditions and various estimators of these effects along with an applied example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2671-2683
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2071276
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071276
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2671-2683
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2071721_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Hang Deng
Author-X-Name-First: Hang
Author-X-Name-Last: Deng
Author-Name: Qiyang Han
Author-X-Name-First: Qiyang
Author-X-Name-Last: Han
Author-Name: Bodhisattva Sen
Author-X-Name-First: Bodhisattva
Author-X-Name-Last: Sen
Title: Inference for Local Parameters in Convexity Constrained Models
Abstract:
In this article, we develop automated inference methods for “local” parameters in a collection of convexity constrained models based on the natural constrained tuning-free estimators. A canonical example is given by the univariate convex regression model, in which automated inference is drawn for the function value, the function derivative at a fixed interior point, and the anti-mode of the convex regression function, based on the widely used tuning-free, piecewise linear convex least squares estimator (LSE). The key to our inference proposal in this model is a pivotal joint limit distribution theory for the LS estimates of the local parameters, normalized appropriately by the length of certain data-driven linear piece of the convex LSE. Such a pivotal limiting distribution instantly gives rise to confidence intervals for these local parameters, whose construction requires almost no more effort than computing the convex LSE itself. This inference method in the convex regression model is a special case of a general inference machinery that covers a number of convexity constrained models in which a limit distribution theory is available for model-specific estimators. Concrete models include: (i) log-concave density estimation, (ii) s-concave density estimation, (iii) convex nonincreasing density estimation, (iv) concave bathtub-shaped hazard function estimation, and (v) concave distribution function estimation from corrupted data. The proposed confidence intervals for all these models are proved to have asymptotically exact coverage and oracle length, and require no further information than the estimator itself. We provide extensive simulation evidence that validates our theoretical results. Real data applications and comparisons with competing methods are given to illustrate the usefulness of our inference proposals. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2721-2735
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2071721
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071721
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2721-2735
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2089573_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Wenbo Wang
Author-X-Name-First: Wenbo
Author-X-Name-Last: Wang
Author-Name: Xingye Qiao
Author-X-Name-First: Xingye
Author-X-Name-Last: Qiao
Title: Set-Valued Support Vector Machine with Bounded Error Rates
Abstract:
This article concerns cautious classification models that are allowed to predict a set of class labels or reject to make a prediction when the uncertainty in the prediction is high. This set-valued classification approach is equivalent to the task of acceptance region learning, which aims to identify subsets of the input space, each of which guarantees to cover observations in a class with at least a predetermined probability. We propose to directly learn the acceptance regions through risk minimization, by making use of a truncated hinge loss and a constrained optimization framework. Collectively our theoretical analyses show that these acceptance regions, with high probability, satisfy simultaneously two properties: (a) they guarantee to cover each class with a noncoverage rate bounded from above; (b) they give the least ambiguous predictions among all the acceptance regions satisfying (a). An efficient algorithm is developed and numerical studies are conducted using both simulated and real data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2847-2859
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2089573
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2089573
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2847-2859
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2057859_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Akihiko Nishimura
Author-X-Name-First: Akihiko
Author-X-Name-Last: Nishimura
Author-Name: Marc A. Suchard
Author-X-Name-First: Marc A.
Author-X-Name-Last: Suchard
Title: Prior-Preconditioned Conjugate Gradient Method for Accelerated Gibbs Sampling in “Large n, Large p” Bayesian Sparse Regression
Abstract:
In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of 105–106 and of 104–105. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian method based on shrinkage priors. In the “large n and large p” setting, however, the required posterior computation encounters a bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix Φ is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: We can cheaply generate a random vector b such that the solution to the linear system Φβ=b has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by Φ; this involves no explicit factorization or calculation of Φ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n=72,489 patients and p=22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in posterior inference, in our case cutting the computation time from two weeks to less than a day. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2468-2481
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2057859
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057859
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2468-2481
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2096620_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Pratik Ramprasad
Author-X-Name-First: Pratik
Author-X-Name-Last: Ramprasad
Author-Name: Yuantong Li
Author-X-Name-First: Yuantong
Author-X-Name-Last: Li
Author-Name: Zhuoran Yang
Author-X-Name-First: Zhuoran
Author-X-Name-Last: Yang
Author-Name: Zhaoran Wang
Author-X-Name-First: Zhaoran
Author-X-Name-Last: Wang
Author-Name: Will Wei Sun
Author-X-Name-First: Will Wei
Author-X-Name-Last: Sun
Author-Name: Guang Cheng
Author-X-Name-First: Guang
Author-X-Name-Last: Cheng
Title: Online Bootstrap Inference For Policy Evaluation In Reinforcement Learning
Abstract:
The recent emergence of reinforcement learning (RL) has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for inference in online learning are restricted to settings involving independently sampled observations, while inference methods in RL have so far been limited to the batch setting. The bootstrap is a flexible and efficient approach for statistical inference in online learning algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this article, we study the use of the online bootstrap method for inference in RL policy evaluation. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm across a range of real RL environments. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2901-2914
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2096620
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2096620
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2901-2914
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2066537_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Hilda S. Ibriga
Author-X-Name-First: Hilda S.
Author-X-Name-Last: Ibriga
Author-Name: Will Wei Sun
Author-X-Name-First: Will Wei
Author-X-Name-Last: Sun
Title: Covariate-Assisted Sparse Tensor Completion
Abstract:
We aim to provably complete a sparse and highly missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users’ click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on nonmissing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this article, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2605-2619
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2066537
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2066537
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2605-2619
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2242627_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Pavel N. Krivitsky
Author-X-Name-First: Pavel N.
Author-X-Name-Last: Krivitsky
Author-Name: Pietro Coletti
Author-X-Name-First: Pietro
Author-X-Name-Last: Coletti
Author-Name: Niel Hens
Author-X-Name-First: Niel
Author-X-Name-Last: Hens
Title: A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks
Abstract:
The last two decades have seen considerable progress in foundational aspects of statistical network analysis, but the path from theory to application is not straightforward. Two large, heterogeneous samples of small networks of within-household contacts in Belgium were collected using two different but complementary sampling designs: one smaller but with all contacts in each household observed, the other larger and more representative but recording contacts of only one person per household. We wish to combine their strengths to learn the social forces that shape household contact formation and facilitate simulation for prediction of disease spread, while generalising to the population of households in the region. To accomplish this, we describe a flexible framework for specifying multi-network models in the exponential family class and identify the requirements for inference and prediction under this framework to be consistent, identifiable, and generalisable, even when data are incomplete; explore how these requirements may be violated in practice; and develop a suite of quantitative and graphical diagnostics for detecting violations and suggesting improvements to candidate models. We report on the effects of network size, geography, and household roles on household contact patterns (activity, heterogeneity in activity, and triadic closure). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2213-2224
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2242627
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2242627
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2213-2224
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2078330_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Mary Lai O. Salvaña
Author-X-Name-First: Mary Lai O.
Author-X-Name-Last: Salvaña
Author-Name: Amanda Lenzi
Author-X-Name-First: Amanda
Author-X-Name-Last: Lenzi
Author-Name: Marc G. Genton
Author-X-Name-First: Marc G.
Author-X-Name-Last: Genton
Title: Spatio-Temporal Cross-Covariance Functions under the Lagrangian Framework with Multiple Advections
Abstract:
When analyzing the spatio-temporal dependence in most environmental and earth sciences variables such as pollutant concentrations at different levels of the atmosphere, a special property is observed: the covariances and cross-covariances are stronger in certain directions. This property is attributed to the presence of natural forces, such as wind, which cause the transport and dispersion of these variables. This spatio-temporal dynamics prompted the use of the Lagrangian reference frame alongside any Gaussian spatio-temporal geostatistical model. Under this modeling framework, a whole new class was birthed and was known as the class of spatio-temporal covariance functions under the Lagrangian framework, with several developments already established in the univariate setting, in both stationary and nonstationary formulations, but less so in the multivariate case. Despite the many advances in this modeling approach, efforts have yet to be directed to probing the case for the use of multiple advections, especially when several variables are involved. Accounting for multiple advections would make the Lagrangian framework a more viable approach in modeling realistic multivariate transport scenarios. In this work, we establish a class of Lagrangian spatio-temporal cross-covariance functions with multiple advections, study its properties, and demonstrate its use on a bivariate pollutant dataset of particulate matter in Saudi Arabia. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2746-2761
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2078330
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2078330
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2746-2761
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2063130_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Zhaoxue Tong
Author-X-Name-First: Zhaoxue
Author-X-Name-Last: Tong
Author-Name: Zhanrui Cai
Author-X-Name-First: Zhanrui
Author-X-Name-Last: Cai
Author-Name: Songshan Yang
Author-X-Name-First: Songshan
Author-X-Name-Last: Yang
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Model-Free Conditional Feature Screening with FDR Control
Abstract:
In this article, we propose a model-free conditional feature screening method with false discovery rate (FDR) control for ultra-high dimensional data. The proposed method is built upon a new measure of conditional independence. Thus, the new method does not require a specific functional form of the regression function and is robust to heavy-tailed responses and predictors. The variables to be conditional on are allowed to be multivariate. The proposed method enjoys sure screening and ranking consistency properties under mild regularity conditions. To control the FDR, we apply the Reflection via Data Splitting method and prove its theoretical guarantee using martingale theory and empirical process techniques. Simulated examples and real data analysis show that the proposed method performs very well compared with existing works. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2575-2587
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2063130
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2063130
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2575-2587
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2099402_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Augustin Chevallier
Author-X-Name-First: Augustin
Author-X-Name-Last: Chevallier
Author-Name: Paul Fearnhead
Author-X-Name-First: Paul
Author-X-Name-Last: Fearnhead
Author-Name: Matthew Sutton
Author-X-Name-First: Matthew
Author-X-Name-Last: Sutton
Title: Reversible Jump PDMP Samplers for Variable Selection
Abstract:
A new class of Markov chain Monte Carlo (MCMC) algorithms, based on simulating piecewise deterministic Markov processes (PDMPs), has recently shown great promise: they are nonreversible, can mix better than standard MCMC algorithms, and can use subsampling ideas to speed up computation in big data scenarios. However, current PDMP samplers can only sample from posterior densities that are differentiable almost everywhere, which precludes their use for model choice. Motivated by variable selection problems, we show how to develop reversible jump PDMP samplers that can jointly explore the discrete space of models and the continuous space of parameters. Our framework is general: it takes any existing PDMP sampler, and adds two types of trans-dimensional moves that allow for the addition or removal of a variable from the model. We show how the rates of these trans-dimensional moves can be calculated so that the sampler has the correct invariant distribution. We remove a variable from a model when the associated parameter is zero, and this means that the rates of the trans-dimensional moves do not depend on the likelihood. It is, thus, easy to implement a reversible jump version of any PDMP sampler that can explore a fixed model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2915-2927
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2099402
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2099402
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2915-2927
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2081575_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Snigdha Panigrahi
Author-X-Name-First: Snigdha
Author-X-Name-Last: Panigrahi
Author-Name: Jonathan Taylor
Author-X-Name-First: Jonathan
Author-X-Name-Last: Taylor
Title: Approximate Selective Inference via Maximum Likelihood
Abstract:
Several strategies have been developed recently to ensure valid inference after model selection; some of these are easy to compute, while others fare better in terms of inferential power. In this article, we consider a selective inference framework for Gaussian data. We propose a new method for inference through approximate maximum likelihood estimation. Our goal is to: (a) achieve better inferential power with the aid of randomization, (b) bypass expensive MCMC sampling from exact conditional distributions that are hard to evaluate in closed forms. We construct approximate inference, for example, p-values, confidence intervals etc., by solving a fairly simple, convex optimization problem. We illustrate the potential of our method across wide-ranging values of signal-to-noise ratio in simulations. On a cancer gene expression dataset we find that our method improves upon the inferential power of some commonly used strategies for selective inference. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2810-2820
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2081575
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2081575
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2810-2820
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2225742_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Tianchen Xu
Author-X-Name-First: Tianchen
Author-X-Name-Last: Xu
Author-Name: Yuan Chen
Author-X-Name-First: Yuan
Author-X-Name-Last: Chen
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Title: Mixed-Response State-Space Model for Analyzing Multi-Dimensional Digital Phenotypes
Abstract:
Digital technologies (e.g., mobile phones) can be used to obtain objective, frequent, and real-world digital phenotypes from individuals. However, modeling these data poses substantial challenges since observational data are subject to confounding and various sources of variabilities. For example, signals on patients’ underlying health status and treatment effects are mixed with variation due to the living environment and measurement noises. The digital phenotype data thus shows extensive variabilities between- and within-patient as well as across different health domains (e.g., motor, cognitive, and speaking). Motivated by a mobile health study of Parkinson’s disease (PD), we develop a mixed-response state-space (MRSS) model to jointly capture multi-dimensional, multi-modal digital phenotypes and their measurement processes by a finite number of latent state time series. These latent states reflect the dynamic health status and personalized time-varying treatment effects and can be used to adjust for informative measurements. For computation, we use the Kalman filter for Gaussian phenotypes and importance sampling with Laplace approximation for non-Gaussian phenotypes. We conduct comprehensive simulation studies and demonstrate the advantage of MRSS in modeling a mobile health study that remotely collects real-time digital phenotypes from PD patients. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2288-2300
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2225742
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2225742
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2288-2300
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2071278_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Ye Tian
Author-X-Name-First: Ye
Author-X-Name-Last: Tian
Author-Name: Yang Feng
Author-X-Name-First: Yang
Author-X-Name-Last: Feng
Title: Transfer Learning Under High-Dimensional Generalized Linear Models
Abstract:
In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its l1/l2-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and sources are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don’t know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2684-2697
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2071278
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071278
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2684-2697
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2053137_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Ruijian Han
Author-X-Name-First: Ruijian
Author-X-Name-Last: Han
Author-Name: Yiming Xu
Author-X-Name-First: Yiming
Author-X-Name-Last: Xu
Author-Name: Kani Chen
Author-X-Name-First: Kani
Author-X-Name-Last: Chen
Title: A General Pairwise Comparison Model for Extremely Sparse Networks
Abstract:
Statistical estimation using pairwise comparison data is an effective approach to analyzing large-scale sparse networks. In this article, we propose a general framework to model the mutual interactions in a network, which enjoys ample flexibility in terms of model parameterization. Under this setup, we show that the maximum likelihood estimator for the latent score vector of the subjects is uniformly consistent under a near-minimal condition on network sparsity. This condition is sharp in terms of the leading order asymptotics describing the sparsity. Our analysis uses a novel chaining technique and illustrates an important connection between graph topology and model consistency. Our results guarantee that the maximum likelihood estimator is justified for estimation in large-scale pairwise comparison networks where data are asymptotically deficient. Simulation studies are provided in support of our theoretical findings. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2422-2432
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2053137
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2053137
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2422-2432
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2050243_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Kean Ming Tan
Author-X-Name-First: Kean Ming
Author-X-Name-Last: Tan
Author-Name: Qiang Sun
Author-X-Name-First: Qiang
Author-X-Name-Last: Sun
Author-Name: Daniela Witten
Author-X-Name-First: Daniela
Author-X-Name-Last: Witten
Title: Sparse Reduced Rank Huber Regression in High Dimensions
Abstract:
We propose a sparse reduced rank Huber regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained nonconvex optimization problem, which is then solved using a block coordinate descent and an alternating direction method of multipliers algorithm. We establish nonasymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded (1+δ)th moment with δ∈(0,1), the rate of convergence is a function of δ, and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. We illustrate the performance of the proposed method via extensive numerical studies and a data application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2383-2393
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2050243
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2050243
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2383-2393
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2093206_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Zhu Li
Author-X-Name-First: Zhu
Author-X-Name-Last: Li
Author-Name: Weijie J. Su
Author-X-Name-First: Weijie J.
Author-X-Name-Last: Su
Author-Name: Dino Sejdinovic
Author-X-Name-First: Dino
Author-X-Name-Last: Sejdinovic
Title: Benign Overfitting and Noisy Features
Abstract:
Modern machine learning models often exhibit the benign overfitting phenomenon, which has recently been characterized using the double descent curves. In addition to the classical U-shaped learning curve, the learning risk undergoes another descent as we increase the number of parameters beyond a certain threshold. In this article, we examine the conditions under which benign overfitting occurs in the random feature (RF) models, that is, in a two-layer neural network with fixed first layer weights. Adopting a novel view of random features, we show that benign overfitting emerges because of the noise residing in such features. The noise may already exist in the data and propagates to the features, or it may be added by the user to the features directly. Such noise plays an implicit yet crucial regularization role in the phenomenon. In addition, we derive the explicit tradeoff between the number of parameters and the prediction accuracy, and for the first time demonstrate that overparameterized model can achieve the optimal learning rate in the minimax sense. Finally, our results indicate that the learning risk for overparameterized models has multiple, instead of double descent behavior, which is empirically verified in recent works. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2876-2888
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2093206
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2093206
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2876-2888
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2257260_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Jaewoo Park
Author-X-Name-First: Jaewoo
Author-X-Name-Last: Park
Title: Bayesian Filtering and Smoothing, 2nd ed.
Journal: Journal of the American Statistical Association
Pages: 2943-2945
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2257260
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2257260
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2943-2945
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2087660_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Marinho Bertanha
Author-X-Name-First: Marinho
Author-X-Name-Last: Bertanha
Author-Name: Eunyi Chung
Author-X-Name-First: Eunyi
Author-X-Name-Last: Chung
Title: Permutation Tests at Nonparametric Rates
Abstract:
Classical two-sample permutation tests for equality of distributions have exact size in finite samples, but they fail to control size for testing equality of parameters that summarize each distribution. This article proposes permutation tests for equality of parameters that are estimated at root-n or slower rates. Our general framework applies to both parametric and nonparametric models, with two samples or one sample split into two subsamples. Our tests have correct size asymptotically while preserving exact size in finite samples when distributions are equal. They have no loss in local asymptotic power compared to tests that use asymptotic critical values. We propose confidence sets with correct coverage in large samples that also have exact coverage in finite samples if distributions are equal up to a transformation. We apply our theory to four commonly-used hypothesis tests of nonparametric functions evaluated at a point. Lastly, simulations show good finite sample properties, and two empirical examples illustrate our tests in practice. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2833-2846
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2087660
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2087660
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2833-2846
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2061982_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Tomas Masak
Author-X-Name-First: Tomas
Author-X-Name-Last: Masak
Author-Name: Victor M. Panaretos
Author-X-Name-First: Victor M.
Author-X-Name-Last: Panaretos
Title: Random Surface Covariance Estimation by Shifted Partial Tracing
Abstract:
The problem of covariance estimation for replicated surface-valued processes is examined from the functional data analysis perspective. Considerations of statistical and computational efficiency often compel the use of separability of the covariance, even though the assumption may fail in practice. We consider a setting where the covariance structure may fail to be separable locally—either due to noise contamination or due to the presence of a nonseparable short-range dependent signal component. That is, the covariance is an additive perturbation of a separable component by a nonseparable but banded component. We introduce nonparametric estimators hinging on the novel concept of shifted partial tracing, enabling computationally efficient estimation of the model under dense observation. Due to the denoising properties of shifted partial tracing, our methods are shown to yield consistent estimators even under noisy discrete observation, without the need for smoothing. Further to deriving the convergence rates and limit theorems, we also show that the implementation of our estimators, including prediction, comes at no computational overhead relative to a separable model. Finally, we demonstrate empirical performance and computational feasibility of our methods in an extensive simulation study and on a real dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2562-2574
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2061982
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2061982
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2562-2574
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2044825_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Qing Mai
Author-X-Name-First: Qing
Author-X-Name-Last: Mai
Author-Name: Di He
Author-X-Name-First: Di
Author-X-Name-Last: He
Author-Name: Hui Zou
Author-X-Name-First: Hui
Author-X-Name-Last: Zou
Title: Coordinatewise Gaussianization: Theories and Applications
Abstract:
In statistical analysis, researchers often perform coordinatewise Gaussianization such that each variable is marginally normal. The normal score transformation is a method for coordinatewise Gaussianization and is widely used in statistics, econometrics, genetics and other areas. However, few studies exist on the theoretical properties of the normal score transformation, especially in high-dimensional problems where the dimension p diverges with the sample size n. In this article, we show that the normal score transformation uniformly converges to its population counterpart even when log p=o(n/ log n). Our result can justify the normal score transformation prior to any downstream statistical method to which the theoretical normal transformation is beneficial. The same results are established for the Winsorized normal transformation, another popular choice for coordinatewise Gaussianization. We demonstrate the benefits of coordinatewise Gaussianization by studying its applications to the Gaussian copula model, the nearest shrunken centroids classifier and distance correlation. The benefits are clearly shown in theory and supported by numerical studies. Moreover, we also point out scenarios where coordinatewise Gaussinization does not help and even causes damages. We offer a general recommendation on how to use coordinatewise Gaussianization in applications. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2329-2343
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2044825
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2044825
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2329-2343
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2208390_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Georgia Papadogeorgou
Author-X-Name-First: Georgia
Author-X-Name-Last: Papadogeorgou
Author-Name: Carolina Bello
Author-X-Name-First: Carolina
Author-X-Name-Last: Bello
Author-Name: Otso Ovaskainen
Author-X-Name-First: Otso
Author-X-Name-Last: Ovaskainen
Author-Name: David B. Dunson
Author-X-Name-First: David B.
Author-X-Name-Last: Dunson
Title: Covariate-Informed Latent Interaction Models: Addressing Geographic & Taxonomic Bias in Predicting Bird–Plant Interactions
Abstract:
Reductions in natural habitats urge that we better understand species’ interconnection and how biological communities respond to environmental changes. However, ecological studies of species’ interactions are limited by their geographic and taxonomic focus which can distort our understanding of interaction dynamics. We focus on bird–plant interactions that refer to situations of potential fruit consumption and seed dispersal. We develop an approach for predicting species’ interactions that accounts for errors in the recorded interaction networks, addresses the geographic and taxonomic biases of existing studies, is based on latent factors to increase flexibility and borrow information across species, incorporates covariates in a flexible manner to inform the latent factors, and uses a meta-analysis dataset from 85 individual studies. We focus on interactions among 232 birds and 511 plants in the Atlantic Forest, and identify 5% of pairs of species with an unrecorded interaction, but posterior probability that the interaction is possible over 80%. Finally, we develop a permutation-based variable importance procedure for latent factor network models and identify that a bird’s body mass and a plant’s fruit diameter are important in driving the presence of species interactions, with a multiplicative relationship that exhibits both a thresholding and a matching behavior. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2250-2261
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2208390
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2208390
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2250-2261
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2071279_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Zhaoxing Gao
Author-X-Name-First: Zhaoxing
Author-X-Name-Last: Gao
Author-Name: Ruey S. Tsay
Author-X-Name-First: Ruey S.
Author-X-Name-Last: Tsay
Title: Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data
Abstract:
This article proposes a hierarchical approximate-factor approach to analyzing high-dimensional, large-scale heterogeneous time series data using distributed computing. The new method employs a multiple-fold dimension reduction procedure using Principal Component Analysis (PCA) and shows great promises for modeling large-scale data that cannot be stored nor analyzed by a single machine. Each computer at the basic level performs a PCA to extract common factors among the time series assigned to it and transfers those factors to one and only one node of the second level. Each second-level computer collects the common factors from its subordinates and performs another PCA to select the second-level common factors. This process is repeated until the central server is reached, which collects factors from its direct subordinates and performs a final PCA to select the global common factors. The noise terms of the second-level approximate factor model are the unique common factors of the first-level clusters. We focus on the case of two levels in our theoretical derivations, but the idea can easily be generalized to any finite number of hierarchies, and the proposed method is also applicable to data with heterogeneous and multilevel subcluster structures that are stored and analyzed by a single machine. We introduce a new diffusion index approach to forecasting based on the global and group-specific factors. Some clustering methods are discussed in the supplement when the group memberships are unknown. We further extend the analysis to unit-root nonstationary time series. Asymptotic properties of the proposed method are derived for the diverging dimension of the data in each computing unit and the sample size T. We use both simulated and real examples to assess the performance of the proposed method in finite samples, and compare our method with the commonly used ones in the literature concerning the forecasting ability of extracted factors. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2698-2711
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2071279
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2071279
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2698-2711
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2051519_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Weijing Tang
Author-X-Name-First: Weijing
Author-X-Name-Last: Tang
Author-Name: Kevin He
Author-X-Name-First: Kevin
Author-X-Name-Last: He
Author-Name: Gongjun Xu
Author-X-Name-First: Gongjun
Author-X-Name-Last: Xu
Author-Name: Ji Zhu
Author-X-Name-First: Ji
Author-X-Name-Last: Zhu
Title: Survival Analysis via Ordinary Differential Equations
Abstract:
This article introduces an Ordinary Differential Equation (ODE) notion for survival analysis. The ODE notion not only provides a unified modeling framework, but more importantly, also enables the development of a widely applicable, scalable, and easy-to-implement procedure for estimation and inference. Specifically, the ODE modeling framework unifies many existing survival models, such as the proportional hazards model, the linear transformation model, the accelerated failure time model, and the time-varying coefficient model as special cases. The generality of the proposed framework serves as the foundation of a widely applicable estimation procedure. As an illustrative example, we develop a sieve maximum likelihood estimator for a general semiparametric class of ODE models. In comparison to existing estimation methods, the proposed procedure has advantages in terms of computational scalability and numerical stability. Moreover, to address unique theoretical challenges induced by the ODE notion, we establish a new general sieve M-theorem for bundled parameters and show that the proposed sieve estimator is consistent and asymptotically normal, and achieves the semiparametric efficiency bound. The finite sample performance of the proposed estimator is examined in simulation studies and a real-world data example. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2406-2421
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2051519
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2051519
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2406-2421
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2060836_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Tetsuya Kaji
Author-X-Name-First: Tetsuya
Author-X-Name-Last: Kaji
Author-Name: Veronika Ročková
Author-X-Name-First: Veronika
Author-X-Name-Last: Ročková
Title: Metropolis–Hastings via Classification
Abstract:
This article develops a Bayesian computational platform at the interface between posterior sampling and optimization in models whose marginal likelihoods are difficult to evaluate. Inspired by contrastive learning and Generative Adversarial Networks (GAN), we reframe the likelihood function estimation problem as a classification problem. Pitting a Generator, who simulates fake data, against a Classifier, who tries to distinguish them from the real data, one obtains likelihood (ratio) estimators which can be plugged into the Metropolis–Hastings algorithm. The resulting Markov chains generate, at a steady state, samples from an approximate posterior whose asymptotic properties we characterize. Drawing upon connections with empirical Bayes and Bayesian misspecification, we quantify the convergence rate in terms of the contraction speed of the actual posterior and the convergence rate of the Classifier. Asymptotic normality results are also provided which justify the inferential potential of our approach. We illustrate the usefulness of our approach on examples which have challenged for existing Bayesian likelihood-free approaches. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2533-2547
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2060836
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2060836
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2533-2547
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2223680_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Michael Schweinberger
Author-X-Name-First: Michael
Author-X-Name-Last: Schweinberger
Author-Name: Cornelius Fritz
Author-X-Name-First: Cornelius
Author-X-Name-Last: Fritz
Title: Discussion of “A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks” by Pavel N. Krivitsky, Pietro Coletti, and Niel Hens
Journal: Journal of the American Statistical Association
Pages: 2225-2227
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2023.2223680
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2223680
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2225-2227
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2057317_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20231214T103247 git hash: d7a2cb0857
Author-Name: Shulei Wang
Author-X-Name-First: Shulei
Author-X-Name-Last: Wang
Title: Self-supervised Metric Learning in Multi-View Data: A Downstream Task Perspective
Abstract:
Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is used in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction’s weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, k-means clustering, and k-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation’s accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the article. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 2454-2467
Issue: 544
Volume: 118
Year: 2023
Month: 10
X-DOI: 10.1080/01621459.2022.2057317
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2057317
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:118:y:2023:i:544:p:2454-2467
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2120400_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Laurens de Haan
Author-X-Name-First: Laurens
Author-X-Name-Last: de Haan
Author-Name: Chen Zhou
Author-X-Name-First: Chen
Author-X-Name-Last: Zhou
Title: Bootstrapping Extreme Value Estimators
Abstract:
This article develops a bootstrap analogue of the well-known asymptotic expansion of the tail quantile process in extreme value theory. One application of this result is to construct confidence intervals for estimators of the extreme value index such as the Probability Weighted Moment (PWM) estimator. For the peaks-over-threshold method, we show the bootstrap consistency of the confidence intervals. By contrast, the asymptotic expansion of the quantile process of the bootstrapped block maxima does not lead to a similar consistency result for the PWM estimator using the block maxima method. For both methods, We show by simulations that the sample variance of bootstrapped estimates can be a good approximation for the asymptotic variance of the original estimator. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 382-393
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2120400
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2120400
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:382-393
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2127360_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Christoph Alexander Weitkamp
Author-X-Name-First: Christoph Alexander
Author-X-Name-Last: Weitkamp
Author-Name: Katharina Proksch
Author-X-Name-First: Katharina
Author-X-Name-Last: Proksch
Author-Name: Carla Tameling
Author-X-Name-First: Carla
Author-X-Name-Last: Tameling
Author-Name: Axel Munk
Author-X-Name-First: Axel
Author-X-Name-Last: Munk
Title: Distribution of Distances based Object Matching: Asymptotic Inference
Abstract:
In this article, we aim to provide a statistical theory for object matching based on a lower bound of the Gromov-Wasserstein distance related to the distribution of (pairwise) distances of the considered objects. To this end, we model general objects as metric measure spaces. Based on this, we propose a simple and efficiently computable asymptotic statistical test for pose invariant object discrimination. This is based on a (β-trimmed) empirical version of the afore-mentioned lower bound. We derive the distributional limits of this test statistic for the trimmed and untrimmed case. For this purpose, we introduce a novel U-type process indexed in β and show its weak convergence. The theory developed is investigated in Monte Carlo simulations and applied to structural protein comparisons. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 538-551
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2127360
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2127360
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:538-551
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102019_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Biao Cai
Author-X-Name-First: Biao
Author-X-Name-Last: Cai
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Latent Network Structure Learning From High-Dimensional Multivariate Point Processes
Abstract:
Learning the latent network structure from large scale multivariate point process data is an important task in a wide range of scientific and business applications. For instance, we might wish to estimate the neuronal functional connectivity network based on spiking times recorded from a collection of neurons. To characterize the complex processes underlying the observed data, we propose a new and flexible class of nonstationary Hawkes processes that allow both excitatory and inhibitory effects. We estimate the latent network structure using an efficient sparse least squares estimation approach. Using a thinning representation, we establish concentration inequalities for the first and second order statistics of the proposed Hawkes process. Such theoretical results enable us to establish the non-asymptotic error bound and the selection consistency of the estimated parameters. Furthermore, we describe a least squares loss based statistic for testing if the background intensity is constant in time. We demonstrate the efficacy of our proposed method through simulation studies and an application to a neuron spike train dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 95-108
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2102019
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102019
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:95-108
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2279695_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Patrick M. LeBlanc
Author-X-Name-First: Patrick M.
Author-X-Name-Last: LeBlanc
Author-Name: David Banks
Author-X-Name-First: David
Author-X-Name-Last: Banks
Author-Name: Linhui Fu
Author-X-Name-First: Linhui
Author-X-Name-Last: Fu
Author-Name: Mingyan Li
Author-X-Name-First: Mingyan
Author-X-Name-Last: Li
Author-Name: Zhengyu Tang
Author-X-Name-First: Zhengyu
Author-X-Name-Last: Tang
Author-Name: Qiuyi Wu
Author-X-Name-First: Qiuyi
Author-X-Name-Last: Wu
Title: Recommender Systems: A Review
Abstract:
Recommender systems are the engine of online advertising. Not only do they suggest movies, music, or romantic partners, but they also are used to select which advertisements to show to users. This paper reviews the basics of recommender system methodology and then looks at the emerging arena of active recommender systems.
Journal: Journal of the American Statistical Association
Pages: 773-785
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2279695
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2279695
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:773-785
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2116331_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Lucy L. Gao
Author-X-Name-First: Lucy L.
Author-X-Name-Last: Gao
Author-Name: Jacob Bien
Author-X-Name-First: Jacob
Author-X-Name-Last: Bien
Author-Name: Daniela Witten
Author-X-Name-First: Daniela
Author-X-Name-Last: Witten
Title: Selective Inference for Hierarchical Clustering
Abstract:
Classical tests for a difference in means control the Type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated Type I error rate. Notably, this problem persists even if two separate and independent datasets are used to define the groups and to test for a difference in their means. To address this problem, in this article, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective Type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 332-342
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2116331
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2116331
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:332-342
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2140052_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Xin Ma
Author-X-Name-First: Xin
Author-X-Name-Last: Ma
Author-Name: Suprateek Kundu
Author-X-Name-First: Suprateek
Author-X-Name-Last: Kundu
Author-Name:
Author-X-Name-First:
Author-X-Name-Last:
Title: Multi-Task Learning with High-Dimensional Noisy Images
Abstract:
Recent medical imaging studies have given rise to distinct but inter-related datasets corresponding to multiple experimental tasks or longitudinal visits. Standard scalar-on-image regression models that fit each dataset separately are not equipped to leverage information across inter-related images, and existing multi-task learning approaches are compromised by the inability to account for the noise that is often observed in images. We propose a novel joint scalar-on-image regression framework involving wavelet-based image representations with grouped penalties that are designed to pool information across inter-related images for joint learning, and which explicitly accounts for noise in high-dimensional images via a projection-based approach. In the presence of nonconvexity arising due to noisy images, we derive nonasymptotic error bounds under nonconvex as well as convex grouped penalties, even when the number of voxels increases exponentially with sample size. A projected gradient descent algorithm is used for computation, which is shown to approximate the optimal solution via well-defined nonasymptotic optimization error bounds under noisy images. Extensive simulations and application to a motivating longitudinal Alzheimer’s disease study illustrate significantly improved predictive ability and greater power to detect true signals, that are simply missed by existing methods without noise correction due to the attenuation to null phenomenon. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 650-663
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2140052
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2140052
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:650-663
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2115374_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Mingxue Quan
Author-X-Name-First: Mingxue
Author-X-Name-Last: Quan
Author-Name: Zhenhua Lin
Author-X-Name-First: Zhenhua
Author-X-Name-Last: Lin
Title: Optimal One-Pass Nonparametric Estimation Under Memory Constraint
Abstract:
For nonparametric regression in the streaming setting, where data constantly flow in and require real-time analysis, a main challenge is that data are cleared from the computer system once processed due to limited computer memory and storage. We tackle the challenge by proposing a novel one-pass estimator based on penalized orthogonal basis expansions and developing a general framework to study the interplay between statistical efficiency and memory consumption of estimators. We show that, the proposed estimator is statistically optimal under memory constraint, and has asymptotically minimal memory footprints among all one-pass estimators of the same estimation quality. Numerical studies demonstrate that the proposed one-pass estimator is nearly as efficient as its nonstreaming counterpart that has access to all historical data.
Journal: Journal of the American Statistical Association
Pages: 285-296
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2115374
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115374
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:285-296
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2115918_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Federico Camerlenghi
Author-X-Name-First: Federico
Author-X-Name-Last: Camerlenghi
Author-Name: Stefano Favaro
Author-X-Name-First: Stefano
Author-X-Name-Last: Favaro
Author-Name: Lorenzo Masoero
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Masoero
Author-Name: Tamara Broderick
Author-X-Name-First: Tamara
Author-X-Name-Last: Broderick
Title: Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation
Abstract:
There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior’s parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 320-331
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2115918
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115918
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:320-331
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123814_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Anqi Zhao
Author-X-Name-First: Anqi
Author-X-Name-Last: Zhao
Author-Name: Peng Ding
Author-X-Name-First: Peng
Author-X-Name-Last: Ding
Title: To Adjust or not to Adjust? Estimating the Average Treatment Effect in Randomized Experiments with Missing Covariates
Abstract:
Randomized experiments allow for consistent estimation of the average treatment effect based on the difference in mean outcomes without strong modeling assumptions. Appropriate use of pretreatment covariates can further improve the estimation efficiency. Missingness in covariates is nevertheless common in practice, and raises an important question: should we adjust for covariates subject to missingness, and if so, how? The unadjusted difference in means is always unbiased. The complete-covariate analysis adjusts for all completely observed covariates, and is asymptotically more efficient than the difference in means if at least one completely observed covariate is predictive of the outcome. Then what is the additional gain of adjusting for covariates subject to missingness? To reconcile the conflicting recommendations in the literature, we analyze and compare five strategies for handling missing covariates in randomized experiments under the design-based framework, and recommend the missingness-indicator method, as a known but not so popular strategy in the literature, due to its multiple advantages. First, it removes the dependence of the regression-adjusted estimators on the imputed values for the missing covariates. Second, it does not require modeling the missingness mechanism, and yields consistent estimators even when the missingness mechanism is related to the missing covariates and unobservable potential outcomes. Third, it ensures large-sample efficiency over the complete-covariate analysis and the analysis based on only the imputed covariates. Lastly, it is easy to implement via least squares. We also propose modifications to it based on asymptotic and finite sample considerations. Importantly, our theory views randomization as the basis for inference, and does not impose any modeling assumptions on the data-generating process or missingness mechanism. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 450-460
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2123814
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123814
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:450-460
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2105223_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Naoki Awaya
Author-X-Name-First: Naoki
Author-X-Name-Last: Awaya
Author-Name: Li Ma
Author-X-Name-First: Li
Author-X-Name-Last: Ma
Title: Hidden Markov Pólya Trees for High-Dimensional Distributions
Abstract:
The Pólya tree (PT) process is a general-purpose Bayesian nonparametric model that has found wide application in a range of inference problems. It has a simple analytic form and the posterior computation boils down to beta-binomial conjugate updates along a partition tree over the sample space. Recent development in PT models shows that performance of these models can be substantially improved by (i) allowing the partition tree to adapt to the structure of the underlying distributions and (ii) incorporating latent state variables that characterize local features of the underlying distributions. However, important limitations of the PT remain, including (i) the sensitivity in the posterior inference with respect to the choice of the partition tree, and (ii) the lack of scalability with respect to dimensionality of the sample space. We consider a modeling strategy for PT models that incorporates a flexible prior on the partition tree along with latent states with Markov dependency. We introduce a hybrid algorithm combining sequential Monte Carlo (SMC) and recursive message passing for posterior sampling that can scale up to 100 dimensions. While our description of the algorithm assumes a single computer environment, it has the potential to be implemented on distributed systems to further enhance the scalability. Moreover, we investigate the large sample properties of the tree structures and latent states under the posterior model. We carry out extensive numerical experiments in density estimation and two-group comparison, which show that flexible partitioning can substantially improve the performance of PT models in both inference tasks. We demonstrate an application to a mass cytometry dataset with 19 dimensions and over 200,000 observations. Supplementary Materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 189-201
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2105223
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2105223
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:189-201
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126363_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Author-Name: Jia Liu
Author-X-Name-First: Jia
Author-X-Name-Last: Liu
Author-Name: Zhengyuan Zhu
Author-X-Name-First: Zhengyuan
Author-X-Name-Last: Zhu
Title: Learning Coefficient Heterogeneity over Networks: A Distributed Spanning-Tree-Based Fused-Lasso Regression
Abstract:
Identifying the latent cluster structure based on model heterogeneity is a fundamental but challenging task arises in many machine learning applications. In this article, we study the clustered coefficient regression problem in the distributed network systems, where the data are locally collected and held by nodes. Our work aims to improve the regression estimation efficiency by aggregating the neighbors’ information while also identifying the cluster membership for nodes. To achieve efficient estimation and clustering, we develop a distributed spanning-tree-based fused-lasso regression (DTFLR) approach. In particular, we propose an adaptive spanning-tree-based fusion penalty for the low-complexity clustered coefficient regression. We show that our proposed estimator satisfies statistical oracle properties. Additionally, to solve the problem parallelly, we design a distributed generalized alternating direction method of multiplier algorithm, which has a simple node-based implementation scheme and enjoys a linear convergence rate. Collectively, our results in this article contribute to the theories of low-complexity clustered coefficient regression and distributed optimization over networks. Thorough numerical experiments and real-world data analysis are conducted to verify our theoretical results, which show that our approach outperforms existing works in terms of estimation accuracy, computation speed, and communication costs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 485-497
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2126363
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126363
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:485-497
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2294527_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Shengbin Ye
Author-X-Name-First: Shengbin
Author-X-Name-Last: Ye
Author-Name: Thomas P. Senftle
Author-X-Name-First: Thomas P.
Author-X-Name-Last: Senftle
Author-Name: Meng Li
Author-X-Name-First: Meng
Author-X-Name-Last: Li
Title: Operator-Induced Structural Variable Selection for Identifying Materials Genes
Abstract:
In the emerging field of materials informatics, a fundamental task is to identify physicochemically meaningful descriptors, or materials genes, which are engineered from primary features and a set of elementary algebraic operators through compositions. Standard practice directly analyzes the high-dimensional candidate predictor space in a linear model; statistical analyses are then substantially hampered by the daunting challenge posed by the astronomically large number of correlated predictors with limited sample size. We formulate this problem as variable selection with operator-induced structure (OIS) and propose a new method to achieve unconventional dimension reduction by using the geometry embedded in OIS. Although the model remains linear, we iterate nonparametric variable selection for effective dimension reduction. This enables variable selection based on ab initio primary features, leading to a method that is orders of magnitude faster than existing methods, with improved accuracy. To select the nonparametric module, we discuss a desired performance criterion that is uniquely induced by variable selection with OIS; in particular, we propose to employ a Bayesian Additive Regression Trees (BART)-based variable selection method. Numerical studies show superiority of the proposed method, which continues to exhibit robust performance when the input dimension is out of reach of existing methods. Our analysis of single-atom catalysis identifies physical descriptors that explain the binding energy of metal-support pairs with high explanatory power, leading to interpretable insights to guide the prevention of a notorious problem called sintering and aid catalysis design. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 81-94
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2294527
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2294527
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:81-94
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2142590_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Bingyuan Liu
Author-X-Name-First: Bingyuan
Author-X-Name-Last: Liu
Author-Name: Qi Zhang
Author-X-Name-First: Qi
Author-X-Name-Last: Zhang
Author-Name: Lingzhou Xue
Author-X-Name-First: Lingzhou
Author-X-Name-Last: Xue
Author-Name: Peter X.-K. Song
Author-X-Name-First: Peter X.-K.
Author-X-Name-Last: Song
Author-Name: Jian Kang
Author-X-Name-First: Jian
Author-X-Name-Last: Kang
Title: Robust High-Dimensional Regression with Coefficient Thresholding and Its Application to Imaging Data Analysis
Abstract:
It is important to develop statistical techniques to analyze high-dimensional data in the presence of both complex dependence and possible heavy tails and outliers in real-world applications such as imaging data analyses. We propose a new robust high-dimensional regression with coefficient thresholding, in which an efficient nonconvex estimation procedure is proposed through a thresholding function and the robust Huber loss. The proposed regularization method accounts for complex dependence structures in predictors and is robust against heavy tails and outliers in outcomes. Theoretically, we rigorously analyze the landscape of the population and empirical risk functions for the proposed method. The fine landscape enables us to establish both statistical consistency and computational convergence under the high-dimensional setting. We also present an extension to incorporate spatial information into the proposed method. Finite-sample properties of the proposed methods are examined by extensive simulation studies. An application concerns a scalar-on-image regression analysis for an association of psychiatric disorder measured by the general factor of psychopathology with features extracted from the task functional MRI data in the Adolescent Brain Cognitive Development (ABCD) study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 715-729
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2142590
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2142590
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:715-729
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2270795_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Andrés F. Barrientos
Author-X-Name-First: Andrés F.
Author-X-Name-Last: Barrientos
Author-Name: Aaron R. Williams
Author-X-Name-First: Aaron R.
Author-X-Name-Last: Williams
Author-Name: Joshua Snoke
Author-X-Name-First: Joshua
Author-X-Name-Last: Snoke
Author-Name: Claire McKay Bowen
Author-X-Name-First: Claire McKay
Author-X-Name-Last: Bowen
Title: A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data
Abstract:
Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This article studies the feasibility of using differentially private (DP) methods to make certain queries while preserving privacy. We also include new methodological adaptations to existing DP regression methods for using new data types and returning standard error estimates. We define feasibility as the impact of DP methods on analyses for making public policy decisions and the queries accuracy according to several utility metrics. We evaluate the methods using Internal Revenue Service data and public-use Current Population Survey data and identify how specific data features might challenge some of these methods. Our findings show that DP methods are feasible for simple, univariate statistics but struggle to produce accurate regression estimates and confidence intervals. To the best of our knowledge, this is the first comprehensive statistical study of DP regression methodology on real, complex datasets, and the findings have significant implications for the direction of a growing research field and public policy. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 52-65
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2270795
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2270795
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:52-65
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2138760_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Wenzhuo Zhou
Author-X-Name-First: Wenzhuo
Author-X-Name-Last: Zhou
Author-Name: Ruoqing Zhu
Author-X-Name-First: Ruoqing
Author-X-Name-Last: Zhu
Author-Name: Annie Qu
Author-X-Name-First: Annie
Author-X-Name-Last: Qu
Title: Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning
Abstract:
Recent advances in mobile health (mHealth) technology provide an effective way to monitor individuals’ health statuses and deliver just-in-time personalized interventions. However, the practical use of mHealth technology raises unique challenges to existing methodologies on learning an optimal dynamic treatment regime. Many mHealth applications involve decision-making with large numbers of intervention options and under an infinite time horizon setting where the number of decision stages diverges to infinity. In addition, temporary medication shortages may cause optimal treatments to be unavailable, while it is unclear what alternatives can be used. To address these challenges, we propose a Proximal Temporal consistency Learning (pT-Learning) framework to estimate an optimal regime that is adaptively adjusted between deterministic and stochastic sparse policy models. The resulting minimax estimator avoids the double sampling issue in the existing algorithms. It can be further simplified and can easily incorporate off-policy data without mismatched distribution corrections. We study theoretical properties of the sparse policy and establish finite-sample bounds on the excess risk and performance error. The proposed method is provided in our proximalDTR package and is evaluated through extensive simulation studies and the OhioT1DM mHealth dataset. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 625-638
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2138760
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2138760
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:625-638
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2104728_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Jiaming Qiu
Author-X-Name-First: Jiaming
Author-X-Name-Last: Qiu
Author-Name: Xiongtao Dai
Author-X-Name-First: Xiongtao
Author-X-Name-Last: Dai
Author-Name: Zhengyuan Zhu
Author-X-Name-First: Zhengyuan
Author-X-Name-Last: Zhu
Title: Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes
Abstract:
We consider the estimation of densities in multiple subpopulations, where the available sample size in each subpopulation greatly varies. This problem occurs in epidemiology, for example, where different diseases may share similar pathogenic mechanism but differ in their prevalence. Without specifying a parametric form, our proposed method pools information from the population and estimate the density in each subpopulation in a data-driven fashion. Drawing from functional data analysis, low-dimensional approximating density families in the form of exponential families are constructed from the principal modes of variation in the log-densities. Subpopulation densities are subsequently fitted in the approximating families based on likelihood principles and shrinkage. The approximating families increase in their flexibility as the number of components increases and can approximate arbitrary infinite-dimensional densities. We also derive convergence results of the density estimates formed with discrete observations. The proposed methods are shown to be interpretable and efficient in simulation studies as well as applications to electronic medical record and rainfall data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 176-188
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2104728
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2104728
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:176-188
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2128359_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Jianqing Fan
Author-X-Name-First: Jianqing
Author-X-Name-Last: Fan
Author-Name: Yongyi Guo
Author-X-Name-First: Yongyi
Author-X-Name-Last: Guo
Author-Name: Mengxin Yu
Author-X-Name-First: Mengxin
Author-X-Name-Last: Yu
Title: Policy Optimization Using Semiparametric Models for Dynamic Pricing
Abstract:
In this article, we study the contextual dynamic pricing problem where the market value of a product is linear in its observed features plus some market noise. Products are sold one at a time, and only a binary response indicating success or failure of a sale is observed. Our model setting is similar to the work by? except that we expand the demand curve to a semiparametric model and learn dynamically both parametric and nonparametric components. We propose a dynamic statistical learning and decision making policy that minimizes regret (maximizes revenue) by combining semiparametric estimation for a generalized linear model with unknown link and online decision making. Under mild conditions, for a market noise cdf F(·) with mth order derivative ( m≥2), our policy achieves a regret upper bound of O˜d(T2m+14m−1), where T is the time horizon and O˜d is the order hiding logarithmic terms and the feature dimension d. The upper bound is further reduced to O˜d(T) if F is super smooth. These upper bounds are close to Ω(T), the lower bound where F belongs to a parametric class. We further generalize these results to the case with dynamic dependent product features under the strong mixing condition. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 552-564
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2128359
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2128359
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:552-564
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126780_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Matteo Barigozzi
Author-X-Name-First: Matteo
Author-X-Name-Last: Barigozzi
Author-Name: Matteo Farnè
Author-X-Name-First: Matteo
Author-X-Name-Last: Farnè
Title: An Algebraic Estimator for Large Spectral Density Matrices
Abstract:
We propose a new estimator of high-dimensional spectral density matrices, called ALgebraic Spectral Estimator (ALSE), under the assumption of an underlying low rank plus sparse structure, as typically assumed in dynamic factor models. The ALSE is computed by minimizing a quadratic loss under a nuclear norm plus l1 norm constraint to control the latent rank and the residual sparsity pattern. The loss function requires as input the classical smoothed periodogram estimator and two threshold parameters, the choice of which is thoroughly discussed. We prove consistency of ALSE as both the dimension p and the sample size T diverge to infinity, as well as the recovery of latent rank and residual sparsity pattern with probability one. We then propose the UNshrunk ALgebraic Spectral Estimator (UNALSE), which is designed to minimize the Frobenius loss with respect to the pre-estimator while retaining the optimality of the ALSE. When applying UNALSE to a standard U.S. quarterly macroeconomic dataset, we find evidence of two main sources of comovements: a real factor driving the economy at business cycle frequencies, and a nominal factor driving the higher frequency dynamics. The article is also complemented by an extensive simulation exercise. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 498-510
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2126780
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126780
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:498-510
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2118602_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Alessandro Mastrototaro
Author-X-Name-First: Alessandro
Author-X-Name-Last: Mastrototaro
Author-Name: Jimmy Olsson
Author-X-Name-First: Jimmy
Author-X-Name-Last: Olsson
Author-Name: Johan Alenlöv
Author-X-Name-First: Johan
Author-X-Name-Last: Alenlöv
Title: Fast and Numerically Stable Particle-Based Online Additive Smoothing: The AdaSmooth Algorithm
Abstract:
We present a novel sequential Monte Carlo approach to online smoothing of additive functionals in a very general class of path-space models. Hitherto, the solutions proposed in the literature suffer from either long-term numerical instability due to particle-path degeneracy or, in the case that degeneracy is remedied by particle approximation of the so-called backward kernel, high computational demands. In order to balance optimally computational speed against numerical stability, we propose to furnish a (fast) naive particle smoother, propagating recursively a sample of particles and associated smoothing statistics, with an adaptive backward-sampling-based updating rule which allows the number of (costly) backward samples to be kept at a minimum. This yields a new, function-specific additive smoothing algorithm, AdaSmooth, which is computationally fast, numerically stable and easy to implement. The algorithm is provided with rigorous theoretical results guaranteeing its consistency, asymptotic normality and long-term stability as well as numerical results demonstrating empirically the clear superiority of AdaSmooth to existing algorithms. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 356-367
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2118602
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2118602
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:356-367
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2141636_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Yunlu Jiang
Author-X-Name-First: Yunlu
Author-X-Name-Last: Jiang
Author-Name: Xueqin Wang
Author-X-Name-First: Xueqin
Author-X-Name-Last: Wang
Author-Name: Canhong Wen
Author-X-Name-First: Canhong
Author-X-Name-Last: Wen
Author-Name: Yukang Jiang
Author-X-Name-First: Yukang
Author-X-Name-Last: Jiang
Author-Name: Heping Zhang
Author-X-Name-First: Heping
Author-X-Name-Last: Zhang
Title: Nonparametric Two-Sample Tests of High Dimensional Mean Vectors via Random Integration
Abstract:
Testing the equality of the means in two samples is a fundamental statistical inferential problem. Most of the existing methods are based on the sum-of-squares or supremum statistics. They are possibly powerful in some situations, but not in others, and they do not work in a unified way. Using random integration of the difference, we develop a framework that includes and extends many existing methods, especially in high-dimensional settings, without restricting the same covariance matrices or sparsity. Under a general multivariate model, we can derive the asymptotic properties of the proposed test statistic without specifying a relationship between the data dimension and sample size explicitly. Specifically, the new framework allows us to better understand the test’s properties and select a powerful procedure accordingly. For example, we prove that our proposed test can achieve the power of 1 when nonzero signals in the true mean differences are weakly dense with nearly the same sign. In addition, we delineate the conditions under which the asymptotic relative Pitman efficiency of our proposed test to its competitor is greater than or equal to 1. Extensive numerical studies and a real data example demonstrate the potential of our proposed test. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 701-714
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2141636
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2141636
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:701-714
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2303300_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: The Editors
Title: The Journal of the American Statistical Association 2023 Associate Editors
Journal: Journal of the American Statistical Association
Pages: 792-793
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2024.2303300
File-URL: http://hdl.handle.net/10.1080/01621459.2024.2303300
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:792-793
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126781_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Xiufan Yu
Author-X-Name-First: Xiufan
Author-X-Name-Last: Yu
Author-Name: Danning Li
Author-X-Name-First: Danning
Author-X-Name-Last: Li
Author-Name: Lingzhou Xue
Author-X-Name-First: Lingzhou
Author-X-Name-Last: Xue
Title: Fisher’s Combined Probability Test for High-Dimensional Covariance Matrices
Abstract:
Testing large covariance matrices is of fundamental importance in statistical analysis with high-dimensional data. In the past decade, three types of test statistics have been studied in the literature: quadratic form statistics, maximum form statistics, and their weighted combination. It is known that quadratic form statistics would suffer from low power against sparse alternatives and maximum form statistics would suffer from low power against dense alternatives. The weighted combination methods were introduced to enhance the power of quadratic form statistics or maximum form statistics when the weights are appropriately chosen. In this article, we provide a new perspective to exploit the full potential of quadratic form statistics and maximum form statistics for testing high-dimensional covariance matrices. We propose a scale-invariant power-enhanced test based on Fisher’s method to combine the p-values of quadratic form statistics and maximum form statistics. After carefully studying the asymptotic joint distribution of quadratic form statistics and maximum form statistics, we first prove that the proposed combination method retains the correct asymptotic size under the Gaussian assumption, and we also derive a new Lyapunov-type bound for the joint distribution and prove the correct asymptotic size of the proposed method without requiring the Gaussian assumption. Moreover, we show that the proposed method boosts the asymptotic power against more general alternatives. Finally, we demonstrate the finite-sample performance in simulation studies and a real application. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 511-524
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2126781
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126781
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:511-524
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2141635_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Blair Bilodeau
Author-X-Name-First: Blair
Author-X-Name-Last: Bilodeau
Author-Name: Alex Stringer
Author-X-Name-First: Alex
Author-X-Name-Last: Stringer
Author-Name: Yanbo Tang
Author-X-Name-First: Yanbo
Author-X-Name-Last: Tang
Title: Stochastic Convergence Rates and Applications of Adaptive Quadrature in Bayesian Inference
Abstract:
We provide the first stochastic convergence rates for a family of adaptive quadrature rules used to normalize the posterior distribution in Bayesian models. Our results apply to the uniform relative error in the approximate posterior density, the coverage probabilities of approximate credible sets, and approximate moments and quantiles, therefore, guaranteeing fast asymptotic convergence of approximate summary statistics used in practice. The family of quadrature rules includes adaptive Gauss-Hermite quadrature, and we apply this rule in two challenging low-dimensional examples. Further, we demonstrate how adaptive quadrature can be used as a crucial component of a modern approximate Bayesian inference procedure for high-dimensional additive models. The method is implemented and made publicly available in the aghq package for the R language, available on CRAN. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 690-700
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2141635
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2141635
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:690-700
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102503_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Mingzhang Yin
Author-X-Name-First: Mingzhang
Author-X-Name-Last: Yin
Author-Name: Claudia Shi
Author-X-Name-First: Claudia
Author-X-Name-Last: Shi
Author-Name: Yixin Wang
Author-X-Name-First: Yixin
Author-X-Name-Last: Wang
Author-Name: David M. Blei
Author-X-Name-First: David M.
Author-X-Name-Last: Blei
Title: Conformal Sensitivity Analysis for Individual Treatment Effects
Abstract:
Estimating an individual treatment effect (ITE) is essential to personalized decision making. However, existing methods for estimating the ITE often rely on unconfoundedness, an assumption that is fundamentally untestable with observed data. To assess the robustness of individual-level causal conclusion with unconfoundedness, this article proposes a method for sensitivity analysis of the ITE, a way to estimate a range of the ITE under unobserved confounding. The method we develop quantifies unmeasured confounding through a marginal sensitivity model, and adapts the framework of conformal inference to estimate an ITE interval at a given confounding strength. In particular, we formulate this sensitivity analysis as a conformal inference problem under distribution shift, and we extend existing methods of covariate-shifted conformal inference to this more general setting. The resulting predictive interval has guaranteed nominal coverage of the ITE and provides this coverage with distribution-free and nonasymptotic guarantees. We evaluate the method on synthetic data and illustrate its application in an observational study. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 122-135
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2102503
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102503
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:122-135
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2139265_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Yi Ding
Author-X-Name-First: Yi
Author-X-Name-Last: Ding
Author-Name: Yingying Li
Author-X-Name-First: Yingying
Author-X-Name-Last: Li
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Statistical Learning for Individualized Asset Allocation
Abstract:
We establish a high-dimensional statistical learning framework for individualized asset allocation. Our proposed methodology addresses continuous-action decision-making with a large number of characteristics. We develop a discretization approach to model the effect of continuous actions and allow the discretization frequency to be large and diverge with the number of observations. We estimate the value function of continuous-action using penalized regression with our proposed generalized penalties that are imposed on linear transformations of the model coefficients. We show that our proposed Discretization and Regression with generalized fOlded concaVe penalty on Effect discontinuity (DROVE) approach enjoys desirable theoretical properties and allows for statistical inference of the optimal value associated with optimal decision-making. Empirically, the proposed framework is exercised with the Health and Retirement Study data in finding individualized optimal asset allocation. The results show that our individualized optimal strategy improves the financial well-being of the population. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 639-649
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2139265
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2139265
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:639-649
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2140053_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Diego Morales-Navarrete
Author-X-Name-First: Diego
Author-X-Name-Last: Morales-Navarrete
Author-Name: Moreno Bevilacqua
Author-X-Name-First: Moreno
Author-X-Name-Last: Bevilacqua
Author-Name: Christian Caamaño-Carrillo
Author-X-Name-First: Christian
Author-X-Name-Last: Caamaño-Carrillo
Author-Name: Luis M. Castro
Author-X-Name-First: Luis M.
Author-X-Name-Last: Castro
Title: Modeling Point Referenced Spatial Count Data: A Poisson Process Approach
Abstract:
Random fields are useful mathematical tools for representing natural phenomena with complex dependence structures in space and/or time. In particular, the Gaussian random field is commonly used due to its attractive properties and mathematical tractability. However, this assumption seems to be restrictive when dealing with counting data. To deal with this situation, we propose a random field with a Poisson marginal distribution considering a sequence of independent copies of a random field with an exponential marginal distribution as “inter-arrival times” in the counting renewal processes framework. Our proposal can be viewed as a spatial generalization of the Poisson counting process. Unlike the classical hierarchical Poisson Log-Gaussian model, our proposal generates a (non)-stationary random field that is mean square continuous and with Poisson marginal distributions. For the proposed Poisson spatial random field, analytic expressions for the covariance function and the bivariate distribution are provided. In an extensive simulation study, we investigate the weighted pairwise likelihood as a method for estimating the Poisson random field parameters. Finally, the effectiveness of our methodology is illustrated by an analysis of reindeer pellet-group survey data, where a zero-inflated version of the proposed model is compared with zero-inflated Poisson Log-Gaussian and Poisson Gaussian copula models. Supplementary materials for this article, including technical proofs and R code for reproducing the work, are available as an online supplement.
Journal: Journal of the American Statistical Association
Pages: 664-677
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2140053
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2140053
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:664-677
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123336_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Ben Wu
Author-X-Name-First: Ben
Author-X-Name-Last: Wu
Author-Name: Ying Guo
Author-X-Name-First: Ying
Author-X-Name-Last: Guo
Author-Name: Jian Kang
Author-X-Name-First: Jian
Author-X-Name-Last: Kang
Title: Bayesian Spatial Blind Source Separation via the Thresholded Gaussian Process
Abstract:
Blind source separation (BSS) aims to separate latent source signals from their mixtures. For spatially dependent signals in high-dimensional and large-scale data, such as neuroimaging, most existing BSS methods do not take into account the spatial dependence and the sparsity of the latent source signals. To address these major limitations, we propose a Bayesian spatial blind source separation (BSP-BSS) approach for neuroimaging data analysis. We assume the expectation of the observed images as a linear mixture of multiple sparse and piece-wise smooth latent source signals, for which we construct a new class of Bayesian nonparametric prior models by thresholding Gaussian processes. We assign the vMF priors to mixing coefficients in the model. Under some regularity conditions, we show that the proposed method has several desirable theoretical properties including the large support for the priors, the consistency of joint posterior distribution of the latent source intensity functions and the mixing coefficients, and the selection consistency on the number of latent sources. We use extensive simulation studies and an analysis of the resting-state fMRI data in the Autism Brain Imaging Data Exchange (ABIDE) study to demonstrate that BSP-BSS outperforms the existing method for separating latent brain networks and detecting activated brain activation in the latent sources. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 422-433
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2123336
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123336
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:422-433
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2120401_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Zhen Miao
Author-X-Name-First: Zhen
Author-X-Name-Last: Miao
Author-Name: Weihao Kong
Author-X-Name-First: Weihao
Author-X-Name-Last: Kong
Author-Name: Ramya Korlakai Vinayak
Author-X-Name-First: Ramya Korlakai
Author-X-Name-Last: Vinayak
Author-Name: Wei Sun
Author-X-Name-First: Wei
Author-X-Name-Last: Sun
Author-Name: Fang Han
Author-X-Name-First: Fang
Author-X-Name-Last: Han
Title: Fisher-Pitman Permutation Tests Based on Nonparametric Poisson Mixtures with Application to Single Cell Genomics
Abstract:
This article investigates the theoretical and empirical performance of Fisher-Pitman-type permutation tests for assessing the equality of unknown Poisson mixture distributions. Building on nonparametric maximum likelihood estimators (NPMLEs) of the mixing distribution, these tests are theoretically shown to be able to adapt to complicated unspecified structures of count data and also consistent against their corresponding ANOVA-type alternatives; the latter is a result in parallel to classic claims made by Robinson. The studied methods are then applied to a single-cell RNA-seq data obtained from different cell types from brain samples of autism subjects and healthy controls; empirically, they unveil genes that are differentially expressed between autism and control subjects yet are missed using common tests. For justifying their use, rate optimality of NPMLEs is also established in settings similar to nonparametric Gaussian (Wu and Yang) and binomial mixtures (Tian, Kong, and Valiant; Vinayak et al.). Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 394-406
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2120401
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2120401
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:394-406
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2119983_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Jun Tao
Author-X-Name-First: Jun
Author-X-Name-Last: Tao
Author-Name: Bing Li
Author-X-Name-First: Bing
Author-X-Name-Last: Li
Author-Name: Lingzhou Xue
Author-X-Name-First: Lingzhou
Author-X-Name-Last: Xue
Title: An Additive Graphical Model for Discrete Data
Abstract:
We introduce a nonparametric graphical model for discrete node variables based on additive conditional independence. Additive conditional independence is a three-way statistical relation that shares similar properties with conditional independence by satisfying the semi-graphoid axioms. Based on this relation we build an additive graphical model for discrete variables that does not suffer from the restriction of a parametric model such as the Ising model. We develop an estimator of the new graphical model via the penalized estimation of the discrete version of the additive precision operator and establish the consistency of the estimator under the ultrahigh-dimensional setting. Along with these methodological developments, we also exploit the properties of discrete random variables to uncover a deeper relation between additive conditional independence and conditional independence than previously known. The new graphical model reduces to a conditional independence graphical model under certain sparsity conditions. We conduct simulation experiments and analysis of an HIV antiretroviral therapy dataset to compare the new method with existing ones. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 368-381
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2119983
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2119983
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:368-381
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2142591_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Davide Viviano
Author-X-Name-First: Davide
Author-X-Name-Last: Viviano
Author-Name: Jelena Bradic
Author-X-Name-First: Jelena
Author-X-Name-Last: Bradic
Title: Fair Policy Targeting
Abstract:
One of the major concerns of targeting interventions on individuals in social welfare programs is discrimination: individualized treatments may induce disparities across sensitive attributes such as age, gender, or race. This article addresses the question of the design of fair and efficient treatment allocation rules. We adopt the nonmaleficence perspective of “first do no harm”: we select the fairest allocation within the Pareto frontier. We cast the optimization into a mixed-integer linear program formulation, which can be solved using off-the-shelf algorithms. We derive regret bounds on the unfairness of the estimated policy function and small sample guarantees on the Pareto frontier under general notions of fairness. Finally, we illustrate our method using an application from education economics. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 730-743
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2142591
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2142591
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:730-743
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123813_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Zheng Tracy Ke
Author-X-Name-First: Zheng Tracy
Author-X-Name-Last: Ke
Author-Name: Minzhe Wang
Author-X-Name-First: Minzhe
Author-X-Name-Last: Wang
Title: Using SVD for Topic Modeling
Abstract:
The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the data matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 434-449
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2123813
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123813
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:434-449
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2110877_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Chun-Hao Yang
Author-X-Name-First: Chun-Hao
Author-X-Name-Last: Yang
Author-Name: Hani Doss
Author-X-Name-First: Hani
Author-X-Name-Last: Doss
Author-Name: Baba C. Vemuri
Author-X-Name-First: Baba C.
Author-X-Name-Last: Vemuri
Title: An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Definite Matrices
Abstract:
The James–Stein estimator is an estimator of the multivariate normal mean and dominates the maximum likelihood estimator (MLE) under squared error loss. The original work inspired great interest in developing shrinkage estimators for a variety of problems. Nonetheless, research on shrinkage estimation for manifold-valued data is scarce. In this article, we propose shrinkage estimators for the parameters of the Log-Normal distribution defined on the manifold of N × N symmetric positive-definite matrices. For this manifold, we choose the Log-Euclidean metric as its Riemannian metric since it is easy to compute and has been widely used in a variety of applications. By using the Log-Euclidean distance in the loss function, we derive a shrinkage estimator in an analytic form and show that it is asymptotically optimal within a large class of estimators that includes the MLE, which is the sample Fréchet mean of the data. We demonstrate the performance of the proposed shrinkage estimator via several simulated data experiments. Additionally, we apply the shrinkage estimator to perform statistical inference in both diffusion and functional magnetic resonance imaging problems. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 259-272
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2110877
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2110877
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:259-272
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2287599_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Ting Ye
Author-X-Name-First: Ting
Author-X-Name-Last: Ye
Title: Fundamentals of Causal Inference: With R
Journal: Journal of the American Statistical Association
Pages: 790-791
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2287599
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2287599
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:790-791
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2104727_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Surya T. Tokdar
Author-X-Name-First: Surya T.
Author-X-Name-Last: Tokdar
Author-Name: Sheng Jiang
Author-X-Name-First: Sheng
Author-X-Name-Last: Jiang
Author-Name: Erika L. Cunningham
Author-X-Name-First: Erika L.
Author-X-Name-Last: Cunningham
Title: Heavy-Tailed Density Estimation
Abstract:
A novel statistical method is proposed and investigated for estimating a heavy tailed density under mild smoothness assumptions. Statistical analyses of heavy-tailed distributions are susceptible to the problem of sparse information in the tail of the distribution getting washed away by unrelated features of a hefty bulk. The proposed Bayesian method avoids this problem by incorporating smoothness and tail regularization through a carefully specified semiparametric prior distribution, and is able to consistently estimate both the density function and its tail index at near minimax optimal rates of contraction. A joint, likelihood driven estimation of the bulk and the tail is shown to help improve uncertainty assessment in estimating the tail index parameter and offer more accurate and reliable estimates of the high tail quantiles compared to thresholding methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 163-175
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2104727
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2104727
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:163-175
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2115375_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Emre Demirkaya
Author-X-Name-First: Emre
Author-X-Name-Last: Demirkaya
Author-Name: Yingying Fan
Author-X-Name-First: Yingying
Author-X-Name-Last: Fan
Author-Name: Lan Gao
Author-X-Name-First: Lan
Author-X-Name-Last: Gao
Author-Name: Jinchi Lv
Author-X-Name-First: Jinchi
Author-X-Name-Last: Lv
Author-Name: Patrick Vossler
Author-X-Name-First: Patrick
Author-X-Name-Last: Vossler
Author-Name: Jingbo Wang
Author-X-Name-First: Jingbo
Author-X-Name-Last: Wang
Title: Optimal Nonparametric Inference with Two-Scale Distributional Nearest Neighbors
Abstract:
The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele 2009; Biau, Cérou, and Guyader 2010); we name the resulting estimator as the distributional nearest neighbors (DNN) for easy reference. Yet, there is a lack of distributional results for such estimator, limiting its application to statistical inference. Moreover, when the mean regression function has higher-order smoothness, DNN does not achieve the optimal nonparametric convergence rate, mainly because of the bias issue. In this work, we provide an in-depth technical analysis of the DNN, based on which we suggest a bias reduction approach for the DNN estimator by linearly combining two DNN estimators with different subsampling scales, resulting in the novel two-scale DNN (TDNN) estimator. The two-scale DNN estimator has an equivalent representation of WNN with weights admitting explicit forms and some being negative. We prove that, thanks to the use of negative weights, the two-scale DNN estimator enjoys the optimal nonparametric rate of convergence in estimating the regression function under the fourth-order smoothness condition. We further go beyond estimation and establish that the DNN and two-scale DNN are both asymptotically normal as the subsampling scales and sample size diverge to infinity. For the practical implementation, we also provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for the two-scale DNN. These estimators can be exploited for constructing valid confidence intervals for nonparametric inference of the regression function. The theoretical results and appealing finite-sample performance of the suggested two-scale DNN method are illustrated with several simulation examples and a real data application.
Journal: Journal of the American Statistical Association
Pages: 297-307
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2115375
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115375
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:297-307
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2258595_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Daniel Mork
Author-X-Name-First: Daniel
Author-X-Name-Last: Mork
Author-Name: Marianthi-Anna Kioumourtzoglou
Author-X-Name-First: Marianthi-Anna
Author-X-Name-Last: Kioumourtzoglou
Author-Name: Marc Weisskopf
Author-X-Name-First: Marc
Author-X-Name-Last: Weisskopf
Author-Name: Brent A. Coull
Author-X-Name-First: Brent A.
Author-X-Name-Last: Coull
Author-Name: Ander Wilson
Author-X-Name-First: Ander
Author-X-Name-Last: Wilson
Title: Heterogeneous Distributed Lag Models to Estimate Personalized Effects of Maternal Exposures to Air Pollution
Abstract:
Children’s health studies support an association between maternal environmental exposures and children’s birth outcomes. A common goal is to identify critical windows of susceptibility—periods during gestation with increased association between maternal exposures and a future outcome. The timing of the critical windows and magnitude of the associations are likely heterogeneous across different levels of individual, family, and neighborhood characteristics. Using an administrative Colorado birth cohort we estimate the individualized relationship between weekly exposures to fine particulate matter (PM2.5) during gestation and birth weight. To achieve this goal, we propose a statistical learning method combining distributed lag models and Bayesian additive regression trees to estimate critical windows at the individual level and identify characteristics that induce heterogeneity from a high-dimensional set of potential modifying factors. We find evidence of heterogeneity in the PM2.5—birth weight relationship, with some mother—child dyads showing a three times larger decrease in birth weight for an IQR increase in exposure (5.9–8.5 μg/m3 PM2.5) compared to the population average. Specifically, we find increased vulnerability for non-Hispanic mothers who are either younger, have higher body mass index or lower educational attainment. Our case study is the first precision health study of critical windows. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 14-26
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2258595
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2258595
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:14-26
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2133719_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Tucker McElroy
Author-X-Name-First: Tucker
Author-X-Name-Last: McElroy
Author-Name: Dimitris N. Politis
Author-X-Name-First: Dimitris N.
Author-X-Name-Last: Politis
Title: Estimating the Spectral Density at Frequencies Near Zero
Abstract:
Estimating the spectral density function f(w) for some w∈[−π,π] has been traditionally performed by kernel smoothing the periodogram and related techniques. Kernel smoothing is tantamount to local averaging, that is, approximating f(w) by a constant over a window of small width. Although f(w) is uniformly continuous and periodic with period 2π, in this article we recognize the fact that w = 0 effectively acts as a boundary point in the underlying kernel smoothing problem, and the same is true for w=±π. It is well-known that local averaging may be suboptimal in kernel regression at (or near) a boundary point. As an alternative, we propose a local polynomial regression of the periodogram or log-periodogram when w is at (or near) the points 0 or ±π. The case w = 0 is of particular importance since f(0) is the large-sample variance of the sample mean; hence, estimating f(0) is crucial in order to conduct any sort of inference on the mean. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 612-624
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2133719
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2133719
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:612-624
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2106234_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Ganggang Xu
Author-X-Name-First: Ganggang
Author-X-Name-Last: Xu
Author-Name: Jingfei Zhang
Author-X-Name-First: Jingfei
Author-X-Name-Last: Zhang
Author-Name: Yehua Li
Author-X-Name-First: Yehua
Author-X-Name-Last: Li
Author-Name: Yongtao Guan
Author-X-Name-First: Yongtao
Author-X-Name-Last: Guan
Title: Bias-Correction and Test for Mark-Point Dependence with Replicated Marked Point Processes
Abstract:
Mark-point dependence plays a critical role in research problems that can be fitted into the general framework of marked point processes. In this work, we focus on adjusting for mark-point dependence when estimating the mean and covariance functions of the mark process, given independent replicates of the marked point process. We assume that the mark process is a Gaussian process and the point process is a log-Gaussian Cox process, where the mark-point dependence is generated through the dependence between two latent Gaussian processes. Under this framework, naive local linear estimators ignoring the mark-point dependence can be severely biased. We show that this bias can be corrected using a local linear estimator of the cross-covariance function and establish uniform convergence rates of the bias-corrected estimators. Furthermore, we propose a test statistic based on local linear estimators for mark-point independence, which is shown to converge to an asymptotic normal distribution in a parametric n-convergence rate. Model diagnostics tools are developed for key model assumptions and a robust functional permutation test is proposed for a more general class of mark-point processes. The effectiveness of the proposed methods is demonstrated using extensive simulations and applications to two real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 217-231
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2106234
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2106234
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:217-231
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126362_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Stijn Vansteelandt
Author-X-Name-First: Stijn
Author-X-Name-Last: Vansteelandt
Author-Name: Oliver Dukes
Author-X-Name-First: Oliver
Author-X-Name-Last: Dukes
Author-Name: Kelly Van Lancker
Author-X-Name-First: Kelly
Author-X-Name-Last: Van Lancker
Author-Name: Torben Martinussen
Author-X-Name-First: Torben
Author-X-Name-Last: Martinussen
Title: Assumption-Lean Cox Regression
Abstract:
Inference for the conditional association between an exposure and a time-to-event endpoint, given covariates, is routinely based on partial likelihood estimators for hazard ratios indexing Cox proportional hazards models. This approach is flexible and makes testing straightforward, but is nonetheless not entirely satisfactory. First, there is no good understanding of what it infers when the model is misspecified. Second, it is common to employ variable selection procedures when deciding which model to use. However, the bias and uncertainty that imperfect variable selection adds to the analysis is rarely acknowledged, rendering standard inferences biased and overly optimistic. To remedy this, we propose a nonparametric estimand which reduces to the main exposure effect parameter in a (partially linear) Cox model when that model is correct, but continues to capture the (conditional) association of interest in a well understood way, even when this model is misspecified in an arbitrary manner. We achieve an assumption-lean inference for this estimand based on its influence function under the nonparametric model. This has the further advantage that it makes the proposed approach amenable to the use of data-adaptive procedures (e.g., variable selection, machine learning), which we find to work well in simulation studies and a data analysis. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 475-484
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2126362
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126362
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:475-484
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2118601_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Jing Zeng
Author-X-Name-First: Jing
Author-X-Name-Last: Zeng
Author-Name: Qing Mai
Author-X-Name-First: Qing
Author-X-Name-Last: Mai
Author-Name: Xin Zhang
Author-X-Name-First: Xin
Author-X-Name-Last: Zhang
Title: Subspace Estimation with Automatic Dimension and Variable Selection in Sufficient Dimension Reduction
Abstract:
Sufficient dimension reduction (SDR) methods target finding lower-dimensional representations of a multivariate predictor to preserve all the information about the conditional distribution of the response given the predictor. The reduction is commonly achieved by projecting the predictor onto a low-dimensional subspace. The smallest such subspace is known as the Central Subspace (CS) and is the key parameter of interest for most SDR methods. In this article, we propose a unified and flexible framework for estimating the CS in high dimensions. Our approach generalizes a wide range of model-based and model-free SDR methods to high-dimensional settings, where the CS is assumed to involve only a subset of the predictors. We formulate the problem as a quadratic convex optimization so that the global solution is feasible. The proposed estimation procedure simultaneously achieves the structural dimension selection and coordinate-independent variable selection of the CS. Theoretically, our method achieves dimension selection, variable selection, and subspace estimation consistency at a high convergence rate under mild conditions. We demonstrate the effectiveness and efficiency of our method with extensive simulation studies and real data examples. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 343-355
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2118601
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2118601
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:343-355
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2293811_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Raymond K. W. Wong
Author-X-Name-First: Raymond K. W.
Author-X-Name-Last: Wong
Title: Handbook of Matching and Weighting Adjustments for Causal Inference
Journal: Journal of the American Statistical Association
Pages: 791-791
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2293811
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2293811
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:791-791
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102502_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Monica Billio
Author-X-Name-First: Monica
Author-X-Name-Last: Billio
Author-Name: Roberto Casarin
Author-X-Name-First: Roberto
Author-X-Name-Last: Casarin
Author-Name: Matteo Iacopini
Author-X-Name-First: Matteo
Author-X-Name-Last: Iacopini
Title: Bayesian Markov-Switching Tensor Regression for Time-Varying Networks
Abstract:
Modeling time series of multilayer network data is challenging due to the peculiar characteristics of real-world networks, such as sparsity and abrupt structural changes. Moreover, the impact of external factors on the network edges is highly heterogeneous due to edge- and time-specific effects. Capturing all these features results in a very high-dimensional inference problem. A novel tensor-on-tensor regression model is proposed, which integrates zero-inflated logistic regression to deal with the sparsity, and Markov-switching coefficients to account for structural changes. A tensor representation and decomposition of the regression coefficients are used to tackle the high-dimensionality and account for the heterogeneous impact of the covariate tensor across the response variables. The inference is performed following a Bayesian approach, and an efficient Gibbs sampler is developed for posterior approximation. Our methodology applied to financial and email networks detects different connectivity regimes and uncovers the role of covariates in the edge-formation process, which are relevant in risk and resource management. Code is available on GitHub. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 109-121
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2102502
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102502
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:109-121
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2110878_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Jin Zhu
Author-X-Name-First: Jin
Author-X-Name-Last: Zhu
Author-Name: Shen Ye
Author-X-Name-First: Shen
Author-X-Name-Last: Ye
Author-Name: Shikai Luo
Author-X-Name-First: Shikai
Author-X-Name-Last: Luo
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process
Abstract:
This article is concerned with constructing a confidence interval for a target policy’s value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this article, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy’s value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
Journal: Journal of the American Statistical Association
Pages: 273-284
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2110878
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2110878
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:273-284
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2261184_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Xingche Guo
Author-X-Name-First: Xingche
Author-X-Name-Last: Guo
Author-Name: Donglin Zeng
Author-X-Name-First: Donglin
Author-X-Name-Last: Zeng
Author-Name: Yuanjia Wang
Author-X-Name-First: Yuanjia
Author-X-Name-Last: Wang
Title: A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders
Abstract:
Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject’s decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a nondecreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 27-38
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2261184
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2261184
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:27-38
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2286293_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Giacomo Bormetti
Author-X-Name-First: Giacomo
Author-X-Name-Last: Bormetti
Title: Stable Lévy Processes via Lamperti-Type Representations
Journal: Journal of the American Statistical Association
Pages: 789-790
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2286293
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2286293
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:789-790
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126782_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Jason M. Klusowski
Author-X-Name-First: Jason M.
Author-X-Name-Last: Klusowski
Author-Name: Peter M. Tian
Author-X-Name-First: Peter M.
Author-X-Name-Last: Tian
Title: Large Scale Prediction with Decision Trees
Abstract:
This article shows that decision trees constructed with Classification and Regression Trees (CART) and C4.5 methodology are consistent for regression and classification tasks, even when the number of predictor variables grows sub-exponentially with the sample size, under natural 0-norm and 1-norm sparsity constraints. The theory applies to a wide range of models, including (ordinary or logistic) additive regression models with component functions that are continuous, of bounded variation, or, more generally, Borel measurable. Consistency holds for arbitrary joint distributions of the predictor variables, thereby accommodating continuous, discrete, and/or dependent data. Finally, we show that these qualitative properties of individual trees are inherited by Breiman’s random forests. A key step in the analysis is the establishment of an oracle inequality, which allows for a precise characterization of the goodness of fit and complexity tradeoff for a mis-specified model. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 525-537
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2126782
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126782
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:525-537
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102985_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Hanzhong Liu
Author-X-Name-First: Hanzhong
Author-X-Name-Last: Liu
Author-Name: Jiyang Ren
Author-X-Name-First: Jiyang
Author-X-Name-Last: Ren
Author-Name: Yuehan Yang
Author-X-Name-First: Yuehan
Author-X-Name-Last: Yang
Title: Randomization-based Joint Central Limit Theorem and Efficient Covariate Adjustment in Randomized Block 2K Factorial Experiments
Abstract:
Randomized block factorial experiments are widely used in industrial engineering, clinical trials, and social science. Researchers often use a linear model and analysis of covariance to analyze experimental results; however, limited studies have addressed the validity and robustness of the resulting inferences because assumptions for a linear model might not be justified by randomization in randomized block factorial experiments. In this article, we establish a new finite population joint central limit theorem for usual (unadjusted) factorial effect estimators in randomized block 2K factorial experiments. Our theorem is obtained under a randomization-based inference framework, making use of an extension of the vector form of the Wald–Wolfowitz–Hoeffding theorem for a linear rank statistic. It is robust to model misspecification, numbers of blocks, block sizes, and propensity scores across blocks. To improve the estimation and inference efficiency, we propose four covariate adjustment methods. We show that under mild conditions, the resulting covariate-adjusted factorial effect estimators are consistent, jointly asymptotically normal, and generally more efficient than the unadjusted estimator. In addition, we propose Neyman-type conservative estimators for the asymptotic covariances to facilitate valid inferences. Simulation studies and a clinical trial data analysis demonstrate the benefits of the covariate adjustment methods. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 136-150
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2102985
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102985
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:136-150
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2128807_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Matteo Barigozzi
Author-X-Name-First: Matteo
Author-X-Name-Last: Barigozzi
Author-Name: Giuseppe Cavaliere
Author-X-Name-First: Giuseppe
Author-X-Name-Last: Cavaliere
Author-Name: Lorenzo Trapani
Author-X-Name-First: Lorenzo
Author-X-Name-Last: Trapani
Title: Inference in Heavy-Tailed Nonstationary Multivariate Time Series
Abstract:
We study inference on the common stochastic trends in a nonstationary, N-variate time series yt, in the possible presence of heavy tails. We propose a novel methodology which does not require any knowledge or estimation of the tail index, or even knowledge as to whether certain moments (such as the variance) exist or not, and develop an estimator of the number of stochastic trends m based on the eigenvalues of the sample second moment matrix of yt. We study the rates of such eigenvalues, showing that the first m ones diverge, as the sample size T passes to infinity, at a rate faster by O(T) than the remaining N – m ones, irrespective of the tail index. We thus exploit this eigen-gap by constructing, for each eigenvalue, a test statistic which diverges to positive infinity or drifts to zero according to whether the relevant eigenvalue belongs to the set of the first m eigenvalues or not. We then construct a randomized statistic based on this, using it as part of a sequential testing procedure, ensuring consistency of the resulting estimator of m. We also discuss an estimator of the common trends based on principal components and show that, up to a an invertible linear transformation, such estimator is consistent in the sense that the estimation error is of smaller order than the trend itself. Importantly, we present the case in which we relax the standard assumption of iid innovations, by allowing for heterogeneity of a very general form in the scale of the innovations. Finally, we develop an extension to the large dimensional case. A Monte Carlo study shows that the proposed estimator for m performs particularly well, even in samples of small size. We complete the article by presenting two illustrative applications covering commodity prices and interest rates data. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 565-581
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2128807
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2128807
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:565-581
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2276742_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Insuk Seo
Author-X-Name-First: Insuk
Author-X-Name-Last: Seo
Title: Martingale Methods in Statistics
Journal: Journal of the American Statistical Association
Pages: 787-789
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2276742
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2276742
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:787-789
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2102986_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Wei Ma
Author-X-Name-First: Wei
Author-X-Name-Last: Ma
Author-Name: Ping Li
Author-X-Name-First: Ping
Author-X-Name-Last: Li
Author-Name: Li-Xin Zhang
Author-X-Name-First: Li-Xin
Author-X-Name-Last: Zhang
Author-Name: Feifang Hu
Author-X-Name-First: Feifang
Author-X-Name-Last: Hu
Title: A New and Unified Family of Covariate Adaptive Randomization Procedures and Their Properties
Abstract:
In clinical trials and other comparative studies, covariate balance is crucial for credible and efficient assessment of treatment effects. Covariate adaptive randomization (CAR) procedures are extensively used to reduce the likelihood of covariate imbalances occurring. In the literature, most studies have focused on balancing of discrete covariates. Applications of CAR with continuous covariates remain rare, especially when the interest goes beyond balancing only the first moment. In this article, we propose a family of CAR procedures that can balance general covariate features, such as quadratic and interaction terms. Our framework not only unifies many existing methods, but also introduces a much broader class of new and useful CAR procedures. We show that the proposed procedures have superior balancing properties; in particular, the convergence rate of imbalance vectors is OP(nϵ) for any ϵ>0 if all of the moments are finite for the covariate features, relative to OP(n) under complete randomization, where n is the sample size. Both the resulting convergence rate and its proof are novel. These favorable balancing properties lead to increased precision of treatment effect estimation in the presence of nonlinear covariate effects. The framework is applied to balance covariate means and covariance matrices simultaneously. Simulation and empirical studies demonstrate the excellent and robust performance of the proposed procedures. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 151-162
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2102986
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2102986
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:151-162
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2140054_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Jolien Ponnet
Author-X-Name-First: Jolien
Author-X-Name-Last: Ponnet
Author-Name: Pieter Segaert
Author-X-Name-First: Pieter
Author-X-Name-Last: Segaert
Author-Name: Stefan Van Aelst
Author-X-Name-First: Stefan
Author-X-Name-Last: Van Aelst
Author-Name: Tim Verdonck
Author-X-Name-First: Tim
Author-X-Name-Last: Verdonck
Title: Robust Inference and Modeling of Mean and Dispersion for Generalized Linear Models
Abstract:
Generalized Linear Models (GLMs) are a popular class of regression models when the responses follow a distribution in the exponential family. In real data the variability often deviates from the relation imposed by the exponential family distribution, which results in over- or underdispersion. Dispersion effects may even vary in the data. Such datasets do not follow the traditional GLM distributional assumptions, leading to unreliable inference. Therefore, the family of double exponential distributions has been proposed, which models both the mean and the dispersion as a function of covariates in the GLM framework. Since standard maximum likelihood inference is highly susceptible to the possible presence of outliers, we propose the robust double exponential (RDE) estimator. Asymptotic properties and robustness of the RDE estimator are discussed. A generalized robust quasi-deviance measure is introduced which constitutes the basis for a stable robust test. Simulations for binomial and Poisson models show the excellent performance of the RDE estimator and corresponding robust tests. Penalized versions of the RDE estimator are developed for sparse estimation with high-dimensional data and for flexible estimation via generalized additive models (GAMs). Real data applications illustrate the relevance of robust inference for dispersion effects in GLMs and GAMs. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 678-689
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2140054
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2140054
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:678-689
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2144737_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Xiao Wu
Author-X-Name-First: Xiao
Author-X-Name-Last: Wu
Author-Name: Fabrizia Mealli
Author-X-Name-First: Fabrizia
Author-X-Name-Last: Mealli
Author-Name: Marianthi-Anna Kioumourtzoglou
Author-X-Name-First: Marianthi-Anna
Author-X-Name-Last: Kioumourtzoglou
Author-Name: Francesca Dominici
Author-X-Name-First: Francesca
Author-X-Name-Last: Dominici
Author-Name: Danielle Braun
Author-X-Name-First: Danielle
Author-X-Name-Last: Braun
Title: Matching on Generalized Propensity Scores with Continuous Exposures
Abstract:
In the context of a binary treatment, matching is a well-established approach in causal inference. However, in the context of a continuous treatment or exposure, matching is still underdeveloped. We propose an innovative matching approach to estimate an average causal exposure-response function under the setting of continuous exposures that relies on the generalized propensity score (GPS). Our approach maintains the following attractive features of matching: (a) clear separation between the design and the analysis; (b) robustness to model misspecification or to the presence of extreme values of the estimated GPS; (c) straightforward assessments of covariate balance. We first introduce an assumption of identifiability, called local weak unconfoundedness. Under this assumption and mild smoothness conditions, we provide theoretical guarantees that our proposed matching estimator attains point-wise consistency and asymptotic normality. In simulations, our proposed matching approach outperforms existing methods under settings with model misspecification or in the presence of extreme values of the estimated GPS. We apply our proposed method to estimate the average causal exposure-response function between long-term PM
2.5 exposure and all-cause mortality among 68.5 million Medicare enrollees, 2000–2016. We found strong evidence of a harmful effect of long-term PM
2.5 exposure on mortality. Code for the proposed matching approach is provided in the CausalGPS R package, which is available on CRAN and provides a computationally efficient implementation. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 757-772
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2144737
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2144737
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:757-772
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2142592_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Wanjun Liu
Author-X-Name-First: Wanjun
Author-X-Name-Last: Liu
Author-Name: Xiufan Yu
Author-X-Name-First: Xiufan
Author-X-Name-Last: Yu
Author-Name: Wei Zhong
Author-X-Name-First: Wei
Author-X-Name-Last: Zhong
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Projection Test for Mean Vector in High Dimensions
Abstract:
This article studies the projection test for high-dimensional mean vectors via optimal projection. The idea of projection test is to project high-dimensional data onto a space of low dimension such that traditional methods can be applied. We first propose a new estimation for the optimal projection direction by solving a constrained and regularized quadratic programming. Then two tests are constructed using the estimated optimal projection direction. The first one is based on a data-splitting procedure, which achieves an exact t-test under normality assumption. To mitigate the power loss due to data-splitting, we further propose an online framework, which iteratively updates the estimation of projection direction when new observations arrive. We show that this online-style projection test asymptotically converges to the standard normal distribution. Various simulation studies as well as a real data example show that the proposed online-style projection test retains the Type I error rate well and is more powerful than other existing tests. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 744-756
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2142592
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2142592
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:744-756
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2131557_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Elizabeth L. Ogburn
Author-X-Name-First: Elizabeth L.
Author-X-Name-Last: Ogburn
Author-Name: Oleg Sofrygin
Author-X-Name-First: Oleg
Author-X-Name-Last: Sofrygin
Author-Name: Iván Díaz
Author-X-Name-First: Iván
Author-X-Name-Last: Díaz
Author-Name: Mark J. van der Laan
Author-X-Name-First: Mark J.
Author-X-Name-Last: van der Laan
Title: Causal Inference for Social Network Data
Abstract:
We describe semiparametric estimation and inference for causal effects using observational data from a single social network. Our asymptotic results are the first to allow for dependence of each observation on a growing number of other units as sample size increases. In addition, while previous methods have implicitly permitted only one of two possible sources of dependence among social network observations, we allow for both dependence due to transmission of information across network ties and for dependence due to latent similarities among nodes sharing ties. We propose new causal effects that are specifically of interest in social network settings, such as interventions on network ties and network structure. We use our methods to reanalyze an influential and controversial study that estimated causal peer effects of obesity using social network data from the Framingham Heart Study; after accounting for network structure we find no evidence for causal peer effects. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 597-611
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2131557
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2131557
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:597-611
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2105704_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Seyoung Park
Author-X-Name-First: Seyoung
Author-X-Name-Last: Park
Author-Name: Eun Ryung Lee
Author-X-Name-First: Eun Ryung
Author-X-Name-Last: Lee
Author-Name: Hongyu Zhao
Author-X-Name-First: Hongyu
Author-X-Name-Last: Zhao
Title: Low-Rank Regression Models for Multiple Binary Responses and their Applications to Cancer Cell-Line Encyclopedia Data
Abstract:
In this article, we study high-dimensional multivariate logistic regression models in which a common set of covariates is used to predict multiple binary outcomes simultaneously. Our work is primarily motivated from many biomedical studies with correlated multiple responses such as the cancer cell-line encyclopedia project. We assume that the underlying regression coefficient matrix is simultaneously low-rank and row-wise sparse. We propose an intuitively appealing selection and estimation framework based on marginal model likelihood, and we develop an efficient computational algorithm for inference. We establish a novel high-dimensional theory for this nonlinear multivariate regression. Our theory is general, allowing for potential correlations between the binary responses. We propose a new type of nuclear norm penalty using the smooth clipped absolute deviation, filling the gap in the related non-convex penalization literature. We theoretically demonstrate that the proposed approach improves estimation accuracy by considering multiple responses jointly through the proposed estimator when the underlying coefficient matrix is low-rank and row-wise sparse. In particular, we establish the non-asymptotic error bounds, and both rank and row support consistency of the proposed method. Moreover, we develop a consistent rule to simultaneously select the rank and row dimension of the coefficient matrix. Furthermore, we extend the proposed methods and theory to a joint Ising model, which accounts for the dependence relationships. In our analysis of both simulated data and the cancer cell line encyclopedia data, the proposed methods outperform the existing methods in better predicting responses. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 202-216
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2105704
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2105704
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:202-216
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2126361_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Ivo V. Stoepker
Author-X-Name-First: Ivo V.
Author-X-Name-Last: Stoepker
Author-Name: Rui M. Castro
Author-X-Name-First: Rui M.
Author-X-Name-Last: Castro
Author-Name: Ery Arias-Castro
Author-X-Name-First: Ery
Author-X-Name-Last: Arias-Castro
Author-Name: Edwin van den Heuvel
Author-X-Name-First: Edwin
Author-X-Name-Last: van den Heuvel
Title: Anomaly Detection for a Large Number of Streams: A Permutation-Based Higher Criticism Approach
Abstract:
Anomaly detection when observing a large number of data streams is essential in a variety of applications, ranging from epidemiological studies to monitoring of complex systems. High-dimensional scenarios are usually tackled with scan-statistics and related methods, requiring stringent modeling assumptions for proper calibration. In this work we take a nonparametric stance, and propose a permutation-based variant of the higher criticism statistic not requiring knowledge of the null distribution. This results in an exact test in finite samples which is asymptotically optimal in the wide class of exponential models. We demonstrate the power loss in finite samples is minimal with respect to the oracle test. Furthermore, since the proposed statistic does not rely on asymptotic approximations it typically performs better than popular variants of higher criticism that rely on such approximations. We include recommendations such that the test can be readily applied in practice, and demonstrate its applicability in monitoring the content uniformity of an active ingredient for a batch-produced drug product. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 461-474
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2126361
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2126361
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:461-474
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2129059_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Shuang Zhou
Author-X-Name-First: Shuang
Author-X-Name-Last: Zhou
Author-Name: Pallavi Ray
Author-X-Name-First: Pallavi
Author-X-Name-Last: Ray
Author-Name: Debdeep Pati
Author-X-Name-First: Debdeep
Author-X-Name-Last: Pati
Author-Name: Anirban Bhattacharya
Author-X-Name-First: Anirban
Author-X-Name-Last: Bhattacharya
Title: A Mass-Shifting Phenomenon of Truncated Multivariate Normal Priors
Abstract:
We show that lower-dimensional marginal densities of dependent zero-mean normal distributions truncated to the positive orthant exhibit a mass-shifting phenomenon. Despite the truncated multivariate normal density having a mode at the origin, the marginal density assigns increasingly small mass near the origin as the dimension increases. The phenomenon accentuates with stronger correlation between the random variables. This surprising behavior has serious implications toward Bayesian constrained estimation and inference, where the prior, in addition to having a full support, is required to assign a substantial probability near the origin to capture flat parts of the true function of interest. A precise quantification of the mass-shifting phenomenon for both the prior and the posterior, characterizing the role of the dimension as well as the dependence, is provided under a variety of correlation structures. Without further modification, we show that truncated normal priors are not suitable for modeling flat regions and propose a novel alternative strategy based on shrinking the coordinates using a multiplicative scale parameter. The proposed shrinkage prior is shown to achieve optimal posterior contraction around true functions with potentially flat regions. Synthetic and real data studies demonstrate how the modification guards against the mass shifting phenomenon while retaining computational efficiency. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 582-596
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2129059
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2129059
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:582-596
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2238943_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Patrick M. Schnell
Author-X-Name-First: Patrick M.
Author-X-Name-Last: Schnell
Author-Name: Matthew Wascher
Author-X-Name-First: Matthew
Author-X-Name-Last: Wascher
Author-Name: Grzegorz A. Rempala
Author-X-Name-First: Grzegorz A.
Author-X-Name-Last: Rempala
Title: Overcoming Repeated Testing Schedule Bias in Estimates of Disease Prevalence
Abstract:
During the COVID-19 pandemic, many institutions such as universities and workplaces implemented testing regimens with every member of some population tested longitudinally, and those testing positive isolated for some time. Although the primary purpose of such regimens was to suppress disease spread by identifying and isolating infectious individuals, testing results were often also used to obtain prevalence and incidence estimates. Such estimates are helpful in risk assessment and institutional planning and various estimation procedures have been implemented, ranging from simple test-positive rates to complex dynamical modeling. Unfortunately, the popular test-positive rate is a biased estimator of prevalence under many seemingly innocuous longitudinal testing regimens with isolation. We illustrate how such bias arises and identify conditions under which the test-positive rate is unbiased. Further, we identify weaker conditions under which prevalence is identifiable and propose a new estimator of prevalence under longitudinal testing. We evaluate the proposed estimation procedure via simulation study and illustrate its use on a dataset derived by anonymizing testing data from The Ohio State University. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 1-13
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2238943
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2238943
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:1-13
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2108816_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Yi Chen
Author-X-Name-First: Yi
Author-X-Name-Last: Chen
Author-Name: Yining Wang
Author-X-Name-First: Yining
Author-X-Name-Last: Wang
Author-Name: Ethan X. Fang
Author-X-Name-First: Ethan X.
Author-X-Name-Last: Fang
Author-Name: Zhaoran Wang
Author-X-Name-First: Zhaoran
Author-X-Name-Last: Wang
Author-Name: Runze Li
Author-X-Name-First: Runze
Author-X-Name-Last: Li
Title: Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection
Abstract:
We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendations, online advertisements, and personalized medicine. However, it is very challenging to balance the exploration and exploitation tradeoff. We modify the LinUCB algorithm in doubly growing epochs and estimate the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves O(sT) regret with high probability, which is nearly independent of the “ambient” regression model dimension d. We further attain a sharper O(sT) regret by using the SupLinUCB framework and match the minimax lower bound of the low-dimensional linear stochastic bandit problem. Finally, we conduct extensive numerical experiments to empirically demonstrate our algorithms’ applicability and robustness. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 246-258
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2108816
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2108816
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:246-258
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2106868_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Chengchun Shi
Author-X-Name-First: Chengchun
Author-X-Name-Last: Shi
Author-Name: Shikai Luo
Author-X-Name-First: Shikai
Author-X-Name-Last: Luo
Author-Name: Yuan Le
Author-X-Name-First: Yuan
Author-X-Name-Last: Le
Author-Name: Hongtu Zhu
Author-X-Name-First: Hongtu
Author-X-Name-Last: Zhu
Author-Name: Rui Song
Author-X-Name-First: Rui
Author-X-Name-Last: Song
Title: Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons
Abstract:
We consider reinforcement learning (RL) methods in offline domains without additional online data collection, such as mobile health applications. Most of existing policy optimization algorithms in the computer science literature are developed in online settings where data are easy to collect or simulate. Their generalizations to mobile health applications with a pre-collected offline dataset remain are less explored. The aim of this article is to develop a novel advantage learning framework in order to efficiently use pre-collected data for policy optimization. The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator. Extensive numerical experiments are conducted to back up our theoretical findings. A Python implementation of our proposed method is available at https://github.com/leyuanheart/SEAL. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 232-245
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2106868
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2106868
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:232-245
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2123335_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Qian Xiao
Author-X-Name-First: Qian
Author-X-Name-Last: Xiao
Author-Name: Yaping Wang
Author-X-Name-First: Yaping
Author-X-Name-Last: Wang
Author-Name: Abhyuday Mandal
Author-X-Name-First: Abhyuday
Author-X-Name-Last: Mandal
Author-Name: Xinwei Deng
Author-X-Name-First: Xinwei
Author-X-Name-Last: Deng
Title: Modeling and Active Learning for Experiments with Quantitative-Sequence Factors
Abstract:
A new type of experiment that aims to determine the optimal quantities of a sequence of factors is eliciting considerable attention in medical science, bioengineering, and many other disciplines. Such studies require the simultaneous optimization of both quantities and sequence orders of several components which are called quantitative-sequence (QS) factors. Given the large and semi-discrete solution spaces in such experiments, efficiently identifying optimal or near-optimal solutions by using a small number of experimental trials is a nontrivial task. To address this challenge, we propose a novel active learning approach, called QS-learning, to enable effective modeling and efficient optimization for experiments with QS factors. QS-learning consists of three parts: a novel mapping-based additive Gaussian process (MaGP) model, an efficient global optimization scheme (QS-EGO), and a new class of optimal designs (QS-design). The theoretical properties of the proposed method are investigated, and optimization techniques using analytical gradients are developed. The performance of the proposed method is demonstrated via a real drug experiment on lymphoma treatment and several simulation studies. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 407-421
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2123335
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2123335
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:407-421
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2278201_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Anna Menacher
Author-X-Name-First: Anna
Author-X-Name-Last: Menacher
Author-Name: Thomas E. Nichols
Author-X-Name-First: Thomas E.
Author-X-Name-Last: Nichols
Author-Name: Chris Holmes
Author-X-Name-First: Chris
Author-X-Name-Last: Holmes
Author-Name: Habib Ganjgahi
Author-X-Name-First: Habib
Author-X-Name-Last: Ganjgahi
Title: Bayesian Lesion Estimation with a Structured Spike-and-Slab Prior
Abstract:
Neural demyelination and brain damage accumulated in white matter appear as hyperintense areas on T2-weighted MRI scans in the form of lesions. Modeling binary images at the population level, where each voxel represents the existence of a lesion, plays an important role in understanding aging and inflammatory diseases. We propose a scalable hierarchical Bayesian spatial model, called BLESS, capable of handling binary responses by placing continuous spike-and-slab mixture priors on spatially varying parameters and enforcing spatial dependency on the parameter dictating the amount of sparsity within the probability of inclusion. The use of mean-field variational inference with dynamic posterior exploration, which is an annealing-like strategy that improves optimization, allows our method to scale to large sample sizes. Our method also accounts for underestimation of posterior variance due to variational inference by providing an approximate posterior sampling approach based on Bayesian bootstrap ideas and spike-and-slab priors with random shrinkage targets. Besides accurate uncertainty quantification, this approach is capable of producing novel cluster size based imaging statistics, such as credible intervals of cluster size, and measures of reliability of cluster occurrence. Lastly, we validate our results via simulation studies and an application to the UK Biobank, a large-scale lesion mapping study with a sample size of 40,000 subjects. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 66-80
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2278201
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2278201
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:66-80
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2270657_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Lijia Wang
Author-X-Name-First: Lijia
Author-X-Name-Last: Wang
Author-Name: Y. X. Rachel Wang
Author-X-Name-First: Y. X. Rachel
Author-X-Name-Last: Wang
Author-Name: Jingyi Jessica Li
Author-X-Name-First: Jingyi Jessica
Author-X-Name-Last: Li
Author-Name: Xin Tong
Author-X-Name-First: Xin
Author-X-Name-Last: Tong
Title: Hierarchical Neyman-Pearson Classification for Prioritizing Severe Disease Categories in COVID-19 Patient Data
Abstract:
COVID-19 has a spectrum of disease severity, ranging from asymptomatic to requiring hospitalization. Understanding the mechanisms driving disease severity is crucial for developing effective treatments and reducing mortality rates. One way to gain such understanding is using a multi-class classification framework, in which patients’ biological features are used to predict patients’ severity classes. In this severity classification problem, it is beneficial to prioritize the identification of more severe classes and control the “under-classification” errors, in which patients are misclassified into less severe categories. The Neyman-Pearson (NP) classification paradigm has been developed to prioritize the designated type of error. However, current NP procedures are either for binary classification or do not provide high probability controls on the prioritized errors in multi-class classification. Here, we propose a hierarchical NP (H-NP) framework and an umbrella algorithm that generally adapts to popular classification methods and controls the under-classification errors with high probability. On an integrated collection of single-cell RNA-seq (scRNA-seq) datasets for 864 patients, we explore ways of featurization and demonstrate the efficacy of the H-NP algorithm in controlling the under-classification errors regardless of featurization. Beyond COVID-19 severity classification, the H-NP algorithm generally applies to multi-class classification problems, where classes have a priority order. Supplementary materials for this article are available online.
Journal: Journal of the American Statistical Association
Pages: 39-51
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2270657
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2270657
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:39-51
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2115917_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Xiaoyu Hu
Author-X-Name-First: Xiaoyu
Author-X-Name-Last: Hu
Author-Name: Fang Yao
Author-X-Name-First: Fang
Author-X-Name-Last: Yao
Title: Dynamic Principal Component Analysis in High Dimensions
Abstract:
Principal component analysis is a versatile tool to reduce dimensionality which has wide applications in statistics and machine learning. It is particularly useful for modeling data in high-dimensional scenarios where the number of variables p is comparable to, or much larger than the sample size n. Despite an extensive literature on this topic, researchers have focused on modeling static principal eigenvectors, which are not suitable for stochastic processes that are dynamic in nature. To characterize the change in the entire course of high-dimensional data collection, we propose a unified framework to directly estimate dynamic eigenvectors of covariance matrices. Specifically, we formulate an optimization problem by combining the local linear smoothing and regularization penalty together with the orthogonality constraint, which can be effectively solved by manifold optimization algorithms. We show that our method is suitable for high-dimensional data observed under both common and irregular designs, and theoretical properties of the estimators are investigated under lq(0≤q≤1) sparsity. Extensive experiments demonstrate the effectiveness of the proposed method in both simulated and real data examples.
Journal: Journal of the American Statistical Association
Pages: 308-319
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2022.2115917
File-URL: http://hdl.handle.net/10.1080/01621459.2022.2115917
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:308-319
Template-Type: ReDIF-Article 1.0
# input file: UASA_A_2273403_J.xml processed with: repec_from_jats12.xsl darts-xml-transformations-20240209T083504 git hash: db97ba8e3a
Author-Name: Ali Rahnavard
Author-X-Name-First: Ali
Author-X-Name-Last: Rahnavard
Title: Statistical Analytics for Health Data Science with SAS and R
Journal: Journal of the American Statistical Association
Pages: 786-787
Issue: 545
Volume: 119
Year: 2024
Month: 1
X-DOI: 10.1080/01621459.2023.2273403
File-URL: http://hdl.handle.net/10.1080/01621459.2023.2273403
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:taf:jnlasa:v:119:y:2024:i:545:p:786-787