Skip to main content

Statistics Seminar

Repro Sampling Method for Statistical Inference of High Dimensional Linear Models

Abstract:  This paper proposes a new and effective simulation-based approach, called  Repro Sampling method, to conduct statistical inference in high dimensional  linear models. The Repro method creates and studies the performance of artificial samples (referred to as  Repro samples) that are generated by mimicking the sampling mechanism that generated the true observed sample.  By doing so, this method provides a new way to quantify model and parameter uncertainty and provide confidence sets with guaranteed coverage rates on a wide range of problems.  A general theoretical framework and an effective Monte-Carlo algorithm, with supporting theories, are developed for high dimensional linear models.  This method is used to joint create confidence sets of selected models and model coefficients, with both exact and asymptotic inferences are included. It also provides theoretical development to support the computational efficiency. Furthermore, this development allows us to handle inference problems involving covariates that are perfectly correlated.  A new and intuitive graphical tool to present uncertainties in model selection and regression parameter estimation is also developed.  We provide numerical studies to demonstrate the utility of the proposed method in a range of problems. Numerical comparisons suggest that the method is far better (in terms of improved coverage rates and significantly reduced  sizes of  confidence sets) than the approaches that are currently used in the literature. The development provides a simple and effective solution for the difficult post-selection inference problems. 

 

A link to the signup sheet for meals and meetings can be found here.

 

Date:
Location:
TBD
Event Series:

The Casual Effect of a Timeout at Stopping an Opposing Run in the NBA

Abstract: In the summer of 2017, the NBA reduced the number of total timeouts, along with other rule changes, to regulate the flow of the game. With these rule changes, it becomes increasingly important for coaches to effectively manage their timeouts. Understanding the utility of a timeout under various game scenarios, e.g. during an opposing team's run, is of the utmost importance. There are two schools of thought when the opposition is on a run: (1) call a timeout and allow your team to rest and regroup, or (2) save a timeout and hope your team can make corrections on the fly. This talk investigates the credence of these tenets using the Rubin causal model framework to quantify the causal effect of a timeout in the presence of an opposing team's run.

 

Sign up sheets for meals and meeting are available here

 

Date:
Location:
MDS 220
Event Series:

Converting High-dimensional Objects to Functional Data: The Stringing Approach

Abstract: There is a close relation between high-dimensional data and functional data. For instance, densely observed functional data can be viewed as high-dimensional data endowed with a natural ordering. In this talk, we explore the opposite question whether one can find a proper ordering of high-dimensional data so they can be reordered and viewed as functional data. 

 
Stringing is such a method that takes advantage of the high dimensionality by representing such data as discretized and noisy observations that originate from a hidden smooth stochastic process.  It transforms high-dimensional data to functional data so that established techniques from functional data analysis can be applied for further statistical analysis.  We illustrate the advantage of the stringing methodology through several data sets.  In one of the applications, stringing leads to the development of a new Cox model that accommodates functional covariates.
 
In the second part of the talk, we extend the stringing approach to align high-dimensional object data. Take the fMRI data as an example, the object is the BOLD time-series and the goal is to align those spatially indexed object data by mapping their spatial locations to a targeted one-dimensional interval so objects that are similar are placed near each other on the new target space. The proposed alignment provides a visualization tool to view these complex object data. Moreover, the aligned data often exhibit certain level of smoothness and can be handled by approaches designed for functional data. We demonstrate how to implement such an alignment for fMRI time series and propose a new concept of path length to study functional connectivity, in addition to a new community detection method. The proposed methods are illustrated by simulations and on a study of the Alzheimer disease.
 
*Based on joint work with Kun Chen (GE Healthcare), Kehui Chen (U. Pittsburgh), Hans-Georg Mueller (UC Davis), Simeng Qu and Xiao Wang (both from Purdue U.), and Chun-Jui Chen (Lyft)
 
Sign-ups for meals and meetings can be done here.
 
 
 
Date:
Location:
MDS 220
Event Series:

Generalized Fiducial Inference: A Review

Abstract: R. A. Fisher, the father of modern statistics, developed the idea of fiducial inference during the first half of the 20th century.  While his proposal led to interesting methods for quantifying uncertainty, other prominent statisticians of the time did not accept Fisher's approach as it became apparent that some of Fisher's bold claims about the properties of fiducial distribution did not hold up for multi-parameter problems.  Beginning around the year 2000, the authors and collaborators started to re-investigate the idea of fiducial inference and discovered that Fisher's approach, when properly generalized, would open doors to solve many important and difficult inference problems.  They termed their generalization of Fisher's idea as generalized fiducial inference (GFI). The main idea of GFI is to carefully transfer randomness from the data to the parameter space using an inverse of a data generating equation without the use of Bayes theorem. The resulting generalized fiducial distribution (GFD) can then be used for inference. After more than a decade of investigations, the authors and collaborators have developed a unifying theory for GFI and provided GFI solutions to many challenging practical problems in different fields of science and industry.  Overall, they have demonstrated that GFI is a valid, useful, and promising approach for conducting statistical inference.  In this talk we latest developments and some successful applications of generalized fiducial inference.

 

Parts of this talk are joint work with T. C.M Lee and Randy Lai (UC Davis), H. Iyer (NIST), J. Williams (NCSU), Y. Cui (U Pennsylvania)

 

Signup sheets for meeting and meals can be found here

 

Date:
Location:
MDS 220
Event Series:

It’s Not What We Said, It’s Not What They Heard, It’s What They Say They Heard

Abstract:  Statisticians have long known that success in our profession frequently depends on our ability to succinctly explain our results so decision makers may correctly integrate our efforts into their actions.  However, this is no longer enough.  While we still must make sure that we carefully present results and conclusions, the real difficulty is what the recipient thinks we just said.   The situation becomes more challenging in the age of “big data”.  This presentation will discuss what to do, and what not to do.  Examples, including those used in court cases, executive documents, and material presented for the President of the United States, will illustrate the principles.

 

More on the Speaker: Barry D. Nussbaum was the Chief Statistician for the U.S. Environmental Protection Agency from 2007 until his retirement in March 2016.   He started his EPA career in 1975 in mobile sources and was the branch chief for the team that phased lead out of gasoline.  Dr. Nussbaum is the founder of the EPA Statistics Users Group.  In recognition of his notable accomplishments he was awarded the Environmental Protection Agency’s Distinguished Career Service Award.  Dr. Nussbaum has a bachelor’s degree from Rensselaer Polytechnic Institute, and both a master’s and a doctorate from the George Washington University.     In May 2015, he was elected the 112th president of the American Statistical Association.   He has been a fellow of the ASA since 2007 and is an elected member of the International Statistical Institute. He has taught graduate statistics courses for George Washington University and Virginia Tech. In 2019, he was appointed Adjunct Professor of Mathematics and Statistics at the University of Maryland, Baltimore County.  In addition, he has even survived two terms as the treasurer of the Ravensworth Elementary School Parent Teacher Association. 

 

The link for meal and meeting signups can be found here.

 

Date:
Location:
MDS 220
Event Series:

New Approaches to Analyzing Modern Time Series

Abstract: In the BIGDATA era many new forms of data have become available and useful in various important applications. When these data are observed over time, they form new types of time series that require new statistical models and analytical tools in order to extract useful information. In this talk we present new developments in analyzing matrix time series, dynamic networks, functional time series and compositional time series, with applications ranging from economics, finance, international trade, electricity loading and others. We will also briefly discuss approaches on modeling other forms of time series, including text time series, dynamic social network and tensor time series.

 

Meeting and Meal Signups can be done here

 

Date:
Location:
MDS 220
Event Series:

Models for Space-Time Data Inspired from Statistical Physics

Abstract: This presentation will focus on statistical models for space-time data which are motivated by ideas from statistical physics. The latter provides a general framework for developing space-time models based on Boltzmann-Gibbs probability density functions and stochastic partial differential equations (SPDEs). In geostatistics, on the other hand, spatial models are typically defined in terms of an explicit covariance function (or a family of covariance functions). In contrast, in the Boltzmann-Gibbs approach the covariance function is intrinsically generated from the underlying joint probability density model. The latter is determined from the respective energy function model which incorporates interactions between different sites. In the SPDE formulation, the covariance function is determined from the “driving equation” of the random field, which leads to a respective partial differential equation for the covariance function.

 

I will briefly discuss the connection between the Boltzmann-Gibbs and SPDE formulations for Gaussian random fields. I will then review some results which are based on Boltzmann-Gibbs densities equipped with an energy function comprising short-range interactions. These results include: (1) A class of flexible spatial covariance functions; (2) a non-separable covariance function with a composite spacetime metric; (3) a family of non-separable covariance functions that are based on linear response theory combined with the space transform; and (4) ongoing efforts to generalize Boltzmann-Gibbs models from continuum and lattice spaces to irregular sampling geometries. The space-time models generated by means of the Boltzmann-Gibbs formulation with short-range energy functions involve sparse precision matrices by construction. This is a significant asset for the processing of big spatial or space-time datasets, since the computationally demanding inversion of large covariance matrices (common in geostatistics and Gaussian process regression) is avoided. I will illustrate these concepts with applications to environmental and energy resources datasets.

 

Meeting and meal signups can be done here


Date:
Location:
MDS 220
Event Series:

Integrative Data Analytics via Distributed Inference Functions

Abstract: This talk concerns integrative data analytics and distributed inference in data integration. As data sharing from related studies become of interest, statistical methods for a joint analysis of all available datasets are needed in practice to achieve better statistical power and detect signals that are otherwise impossible to be captured from a single dataset alone.  A major challenge arising from integrative data analytics pertains to principles of information aggregation, learning data heterogeneity, inference and algorithms for model fusion. Generalizing the classical theoretical foundation of information aggregation, we propose a new framework of distributed inference functions and divide-and-conquer algorithms to handle massive large-scale correlated data.  I will focus on two new approaches, renewable estimation and incremental inference (REII), and distributed and integrated method of moments (DIMM).  I discuss both conceptual formulations and theoretical guarantees of these methods, and illustrate their performances via numerical examples.  This is joint work with Emily Hector, Lan Luo and Ling Zhou.

 

A link to sign up for meetings and meals with the speaker can be found here.

 

Date:
Location:
MDS 220
Event Series:

Covariate Information Matrix for Dimension Reduction

Abstract: Building upon recent research on the applications of the Density Information Matrix (DIM), we developed a tool for Sufficient Dimension Reduction (SDR) in regression problems called a Covariate Information Matrix (CIM). CIM exhaustively identifies the Central Subspace (CS) and provides a rank ordering of the reduced covariates in terms of their regression information. Compared to other popular SDR methods, CIM does not require distributional assumptions on the covariates, or estimation of the mean regression function. CIM is implemented via eigen-decomposition of a matrix estimated with a previously developed efficient nonparametric density estimation technique. We also propose a bootstrap-based diagnostic plot for estimating the dimension of the CS. Results of simulations and real data applications demonstrate superior or competitive performance of CIM compared to that of some other SDR methods.
 

Meeting and meal signups can be done here.

 

Date:
Location:
MDS 220
Event Series:

Approximate Kernal PCA: Computation vs. Statistical Trade-Off

Abstract: Kernel principal component analysis (KPCA) is a popular non-linear dimensionality reduction technique, which generalizes classical linear PCA by finding functions in a reproducing kernel Hilbert space (RKHS) such that the function evaluation at a random variable X has maximum variance. Despite its popularity, kernel PCA suffers from poor scalability in big data scenarios as it involves solving a n x n eigensystem leading to a computational complexity of O(n^3) with n being the number of samples. To address this issue, in this work, we consider a random feature approximation to kernel PCA which requires solving an m x m eigenvalue problem and therefore has a computational complexity of O(m^3+nm^2), implying that the approximate method is computationally efficient if m<n with m being the number of random features. The goal of this work is to investigate the trade-off between computational and statistical behaviors of approximate KPCA, i.e., whether the computational gain is achieved at the cost of statistical efficiency. We show that the approximate KPCA is both computationally and statistically efficient compared to KPCA in terms of the error associated with reconstructing a kernel function based on its projection onto the corresponding eigenspaces. Depending on the eigenvalue decay behavior of the covariance operator, we show that only n^{2/3} features (polynomial decay) or \sqrt{n} features (exponential decay) are needed to match the statistical performance of KPCA, which means without losing statistically, approximate KPCA has a computational complexity of O(n^2) or O(n^{3/2}) depending on the eigenvalue decay behavior.

Date:
Location:
MDS 220
Event Series:
Subscribe to Statistics Seminar