Skip to main content

Statistics Seminar

Seminar

Title: Robust Unsupervised Multi-task Learning on Mixture Models

Abstract: Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this talk, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that not only can effectively utilize unknown similarity between related tasks but is also robust against a fraction of outlier tasks from arbitrary sources. The proposed procedure is shown to achieve minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. We then generalize our approach to tackle the problem of multi-task learning for general mixture models, where a general and informative error bound is derived. The effectiveness of our methods will be demonstrated through simulations and real data examples. 

Date:
-
Location:
220 MDS
Event Series:

Seminar

Title:  A Roadmap to Transcriptomic Deconvolution in Cancer

Abstract:  Cancer is characterized by vast transcriptional variations in genes and pathways. Cancer tissues are complex heterogeneous mixtures of epithelial, stromal and immune cells, with each group comprising multiple distinct cell types and states. This heterogeneity has likely led to numerous contradictory findings in the literature over the past 30 years of high-throughput transcriptomic profiling of tumor tissues, thereby impeding a clearer understanding of cancer biology. Two approaches to address this issue are single-cell RNA-seq profiling and bulk RNA-seq deconvolution. Due to the higher cost and sample quality requirements of single-cell profiling, bulk RNA sequencing remains widely used for a vast amount of patient tissues. An important analytical challenge is how to integrate information from the two technologies to fully uncover the tumor-microenvironment (TME) landscape of cancer. In this talk, I will present our recent development of DeMixSC for single-cell based bulk deconvolution, and a pan-cancer biomarker tumor-cell specific total mRNA expression score or TmS, which is calculated through an integrative deconvolution model. Both DeMixSC and TmS have opened up untapped opportunities in understanding the dynamics of TME in relation to metastasis and resistance to treatment. We envisage that transcriptomic deconvolution will continue to empower cancer researchers, deepening our understanding of tumor heterogeneity and informing clinical decision-making.

Date:
-
Location:
220 MDS
Event Series:

Seminar

Title: Generalized Heterogeneous Functional Model with Applications to Mobile Health Data

Abstract: Physical activity plays a pivotal role in human health, and it has been suggested that there is a strong relationship between physical activity and various diseases such as mental disorder and Parkinson's disease (PD). However, the underlying mechanism of this relationship is still unclear. One of the primary challenges of depicting this relationship is the inherent heterogeneity among people. To fill this gap,  we propose a  generalized heterogeneous functional method (GHFM) within the generalized functional data analysis framework, which can estimate coefficient functions and subgroup information simultaneously and accommodates generalized outcomes. Unlike traditional homogeneous methods, proposed approach distinguishes the relationship between physical activity and diseases within different subgroups, providing a more comprehensive and systematic depiction. Additionally, we propose a pre-clustering method to improve computational efficiency for large samples.  Simulation studies demonstrate the superior performance of our method in various settings compared to existing approaches. In applications, we investigate the influence of physical activity on the risk of mental disorder measured by neuroticism scores and risk of PD in UK-Biobank dataset.  In a dataset of 79,246 subjects for neuroticism scores, our method identifies four distinct subgroups and estimates their respective coefficient functions. Similarly, in a dataset of 80,692 subjects for PD, our method identifies three distinct subgroups and estimates their coefficient functions. We present scientific interpretation for each subgroup, and these findings  could contribute to identifying disease risks in mobile health applications in the future.

Date:
-
Location:
MDS 220
Event Series:

Seminar

Title: Statistical Understanding of Deep Learning with Big Data

 

Speaker: Taps Maiti, Michigan State University & National Science Foundation

 

Abstract: Deep learning has profoundly impacted science and society as it has successfully applied data-driven artificial intelligence. One of the key features of deep learning is that its accuracy improves as the size of the model and the amount of training data increases. This property has significantly improved state-of-the-art learning architectures across various fields in the past decade. However, the lack of a mathematical/statistical foundation has limited the development of deep learning to specific applications and has prevented it from being more broadly applied with high confidence. This foundational gap becomes even more apparent when applied to statistical estimation and inference under limited training sample regimes. To address this issue, we aim to develop statistically principled reasoning and theory that can validate the application of deep learning and pave the way for interpretable deep learning. Our approach is based on Bayesian statistical theory and methodology and scalable computation. We demonstrate the methods with a wide range of applications.

Date:
-
Location:
MDS 220
Event Series:

Statistical Practice and Research at NASA

Statistical Practice and Research at NASA

Peter A. Parker, Ph.D., P.E.

National Aeronautics and Space Administration

Langley Research Center

Hampton, Virginia, USA

 

The discipline of statistics has gained recognition within NASA by spurring innovation and efficiency, and it has demonstrated significant impact and value. In aerospace research and development, it accelerates learning, maximizes knowledge, ensures strategic resource investment, and informs rigorous data-driven decisions. In practice, it requires immersive multidisciplinary collaboration to develop solution strategies that integrate statistical methods with subject-matter expertise to address challenging research objectives. This presentation provides an overview of statistical case studies in aeronautics, space exploration, and atmospheric science, and it highlights statistical research motivated by NASA’s challenging applications.

Date:
-
Location:
MDS 220
Event Series:

R.L. Anderson Lecture

Statistical Thinking About Home Run Hitting


 

Jim Albert

Emeritus Distinguished University Professor

Department of Mathematics and Statistics

Bowling Green State University


 

Abstract


 

Baseball is remarkable with respect to the amount of data collected over the seasons of Major League Baseball (MLB) beginning in 1871.  These data have provided an opportunity to address many questions of interest among baseball fans and researchers.  This talk will review several statistical studies on baseball home run hitting by the speaker over the last 30 years.  By modeling career trajectories, one learns about the greatest peak abilities of home run hitters.  We know that players exhibit streaky home run performances, but is there evidence that hitters exhibit streaky ability? MLB has been concerned about the abrupt rise in home run hitting in recent seasons.  What are the possible causes of the home run explosion, and in particular, is the explosion due to the composition of the baseball?

Date:
-
Location:
The90 rm. 203 (Teal Classroom)
Tags/Keywords:
Event Series:

Seminar

Title: On multivariate and infinite-dimensional quantiles and statistical depth functions

 

 Abstract:

 For absolutely continuous real random variables, the cumulative distribution function is known to be a strictly increasing function, and the quantile function is defined as its inverse. Minor adjustment to the definition allows us to define quantile functions for other real random variables that may not have strictly increasing cumulative distributions, while retaining all desirable properties. How does one define quantiles in dimensions greater than one? In this overview talk, we will discuss an alternative and equivalent definition of a quantile, and how that definition can generalize to higher dimensions, including many cases where the dimension may be infinite. We will look at some interesting probabilistic and geometric properties of such multivariate quantiles. In one dimension, sample quantiles also allow us to rank and order the observations. A partial equivalent in higher dimensions is the notion of a statistical depth function (or data-depth, as is often commonly called), and our overview will also include discussions of properties and uses of the depth function.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Seminar

Title: Tolerance bands for exponential family functional data

Abstract: A tolerance band for a functional response provides a region that is expected to contain a given fraction of observations from the sampled population at each point in the domain. This band is a functional analogue of the tolerance interval for a univariate response. Although the problem of constructing functional tolerance bands has been considered for a Gaussian response, it has not been investigated for non-Gaussian responses, which are common in biomedical applications. We describe a methodology for constructing tolerance bands for two non-Gaussian members of the exponential family: binomial and Poisson. The approach is to first model the data using the framework of generalized functional principal components analysis. Then, a parameter is identified in which the marginal distribution of the response is stochastically monotone. We show that the tolerance limits can be readily obtained from confidence limits for this parameter, which in turn can be computed using large-sample theory and bootstrapping. The proposed methodology works for both dense and sparse functional data. We report the results of simulation studies designed to evaluate its performance and get recommendations for practical applications. The methodology is illustrated using two actual biomedical studies.

Brief Bio: Dr. Pankaj Choudhary is a professor of statistics in the Department of Mathematical Sciences at the University of Texas (UT) at Dallas. He received his bachelor’s and master’s degrees in statistics in India and his PhD in statistics from the Ohio State University in 2002. He has been at UT Dallas since then. His current research interests include development of risk prediction models for contralateral breast cancer and substance use disorders, modeling and analysis of method comparison studies, and construction of tolerance regions. In his free time, he likes to watch TV with his family.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Seminar

Title:

CESME: High-Dimensional Clustering via Latent Semiparametric Mixture Models

 

Abstract:

Cluster analysis is a fundamental task in machine learning. Several clustering algorithms have been extended to handle high-dimensional data by incorporating a sparsity constraint in the estimation of a mixture of Gaussian models. Though it makes some neat theoretical analysis possible, this type of approach is arguably restrictive for many applications. In this talk, I will introduce a novel latent variable transformation mixture model for clustering in which a mixture of Gaussians is assumed after some unknown monotone data transformation. A new clustering algorithm named CESME is developed for high-dimensional clustering under the assumption that optimal clustering admits a sparsity structure. The use of unspecified transformation makes the model far more flexible than the classical mixture of Gaussians. On the other hand, the transformation also brings quite a few technical challenges to the model estimation as well as the theoretical analysis of CESME. I will present a comprehensive analysis of CESME including identifiability, initialization, algorithmic convergence, and statistical guarantees on clustering. In addition, the convergence analysis has revealed an interesting algorithmic phase transition for CESME, which has also been noted for the EM algorithm in the literature. Leveraging such a transition, a data-adaptive procedure is developed and substantially improves the computational efficiency of CESME. Extensive numerical study and real data analysis show that CESME outperforms the existing high-dimensional clustering algorithms including CHIME, sparse spectral clustering, sparse K-means, sparse convex clustering, and IF-PCA.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:
Subscribe to Statistics Seminar