Skip to main content

statistics

R.L. Anderson Lecture

Statistical Thinking About Home Run Hitting

 

Jim Albert

Emeritus Distinguished University Professor

Department of Mathematics and Statistics

Bowling Green State University

 

Abstract

 

Baseball is remarkable with respect to the amount of data collected over the seasons of Major League Baseball (MLB) beginning in 1871.  These data have provided an opportunity to address many questions of interest among baseball fans and researchers.  This talk will review several statistical studies on baseball home run hitting by the speaker over the last 30 years.  By modeling career trajectories, one learns about the greatest peak abilities of home run hitters.  We know that players exhibit streaky home run performances, but is there evidence that hitters exhibit streaky ability? MLB has been concerned about the abrupt rise in home run hitting in recent seasons.  What are the possible causes of the home run explosion, and in particular, is the explosion due to the composition of the baseball?

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Seminar

Title: Tolerance bands for exponential family functional data

Abstract: A tolerance band for a functional response provides a region that is expected to contain a given fraction of observations from the sampled population at each point in the domain. This band is a functional analogue of the tolerance interval for a univariate response. Although the problem of constructing functional tolerance bands has been considered for a Gaussian response, it has not been investigated for non-Gaussian responses, which are common in biomedical applications. We describe a methodology for constructing tolerance bands for two non-Gaussian members of the exponential family: binomial and Poisson. The approach is to first model the data using the framework of generalized functional principal components analysis. Then, a parameter is identified in which the marginal distribution of the response is stochastically monotone. We show that the tolerance limits can be readily obtained from confidence limits for this parameter, which in turn can be computed using large-sample theory and bootstrapping. The proposed methodology works for both dense and sparse functional data. We report the results of simulation studies designed to evaluate its performance and get recommendations for practical applications. The methodology is illustrated using two actual biomedical studies.

Brief Bio: Dr. Pankaj Choudhary is a professor of statistics in the Department of Mathematical Sciences at the University of Texas (UT) at Dallas. He received his bachelor’s and master’s degrees in statistics in India and his PhD in statistics from the Ohio State University in 2002. He has been at UT Dallas since then. His current research interests include development of risk prediction models for contralateral breast cancer and substance use disorders, modeling and analysis of method comparison studies, and construction of tolerance regions. In his free time, he likes to watch TV with his family.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Seminar

Title:

CESME: High-Dimensional Clustering via Latent Semiparametric Mixture Models

 

Abstract:

Cluster analysis is a fundamental task in machine learning. Several clustering algorithms have been extended to handle high-dimensional data by incorporating a sparsity constraint in the estimation of a mixture of Gaussian models. Though it makes some neat theoretical analysis possible, this type of approach is arguably restrictive for many applications. In this talk, I will introduce a novel latent variable transformation mixture model for clustering in which a mixture of Gaussians is assumed after some unknown monotone data transformation. A new clustering algorithm named CESME is developed for high-dimensional clustering under the assumption that optimal clustering admits a sparsity structure. The use of unspecified transformation makes the model far more flexible than the classical mixture of Gaussians. On the other hand, the transformation also brings quite a few technical challenges to the model estimation as well as the theoretical analysis of CESME. I will present a comprehensive analysis of CESME including identifiability, initialization, algorithmic convergence, and statistical guarantees on clustering. In addition, the convergence analysis has revealed an interesting algorithmic phase transition for CESME, which has also been noted for the EM algorithm in the literature. Leveraging such a transition, a data-adaptive procedure is developed and substantially improves the computational efficiency of CESME. Extensive numerical study and real data analysis show that CESME outperforms the existing high-dimensional clustering algorithms including CHIME, sparse spectral clustering, sparse K-means, sparse convex clustering, and IF-PCA.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Seminar

Jonas Beck1 1 Paris Lodron University of Salzburg, Department of Artificial Intelligence and Human Interfaces, Austria, jonas.beck@plus.ac.at

 

Abstract: A fundamental functional in nonparametric statistics is the Mann-Whitney functional θ = P(X < Y ) , which constitutes the basis for the most popular nonparametric procedures. The functional θ measures a location or stochastic tendency effect between two distributions. A limitation of θ is its inability to capture scale differences. If differences of this nature are to be detected, specific tests for scale or omnibus tests need to be employed. However, the latter often suffer from low power, and they do not yield interpretable effect measures. In this manuscript, we extend θ by additionally incorporating the recently introduced distribution overlap index (nonparametric dispersion measure) I2 that can be expressed in terms of the quantile process. We derive the joint asymptotic distribution of the respective estimators of θ and I2 and construct confidence regions. Extending the Wilcoxon-Mann-Whitney test, we introduce a new test based on the joint use of these functionals. It results in much larger consistency regions while maintaining competitive power to the rank sum test for situations in which θ alone would suffice. Compared with classical omnibus tests, the simulated power is much improved. Additionally, the newly proposed inference method yields effect measures whose interpretation is surprisingly straightforward.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Gaussian Process Modeling for Dissolution Curve Comparison

Title: Gaussian Process Modeling for Dissolution Curve Comparison

Abstract: Dissolution studies are an integral part of pharmaceutical drug development, yet standard methods for analyzing dissolution data are inadequate for capturing the true underlying shapes of the dissolution curves. Methods based on similarity factors, such as the f2 statistic, have been developed to demonstrate comparability of dissolution curves, however this inability to capture the shapes of the dissolution curves can lead to substantial bias in comparability estimators. In this talk, we propose two novel semi-parametric dissolution curve modeling strategies for establishing the comparability of dissolution curves. The first method relies upon hierarchical Gaussian process regression models to construct an f2 statistic based on continuous time modeling that results in significant bias reduction. The second method uses a Bayesian model selection approach for creating a framework that does not suffer from the limitations of the f2 statistic. Overall, these two methods are shown to be superior to their comparator methods and provide feasible alternatives for similarity assessment under practical limitations. Illustrations highlighting the success of our methods are provided for two motivating real dissolution data sets from the literature, as well as extensive simulation studies.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Linear Models for matrix-variate data

Abstract: Observations that are made on p response variables and each response variable is measured over n sites or time points, construct matrix-variate response variable, and arise across a wide range of disciplines, including medical, environmental and agricultural studies. The observations in an (n x p)-dimensional matrix-variate sample are not independent, but are doubly correlated. The popularity of the classical general linear model (CGLM) is mostly due to the ease of modeling and authentication of the appropriateness of the model. However, CGLM is not appropriate for doubly correlated matrix-variate data. We propose an extension of CGLM for matrix-variate data with exchangeably distributed errors for multiple observations. Maximum likelihood estimates of the matrix parameters of the intercept, slope and the eigenblocks of the exchangeable error matrix are derived. The distributions of these estimators are also derived. The practical implications of the methodological aspects of the proposed extended model for matrix-variate data are demonstrated using two medical datasets.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Surrogate method for partial association between mixed data with application to well-being survey analysis

Abstract: 

This paper is motivated by the analysis of a survey study focusing on college student well-being before and after the COVID-19 pandemic outbreak. A statistical challenge in well-being studies lies in the multidimensionality of outcome variables, recorded in various scales such as continuous, binary, or ordinal. The presence of mixed data complicates the examination of their relationships when adjusting for important covariates. To address this challenge, we propose a unifying framework for studying partial association between mixed data. We achieve this by defining a unified residual using the surrogate method. The idea is to map the residual randomness to a consistent continuous scale, regardless of the original scales of outcome variables. This framework applies to parametric or semiparametric models for covariate adjustments. We validate the use of such residuals for assessing partial association, introducing a measure that generalizes classical Kendall’s tau to capture both partial and marginal associations. Moreover, our development advances the theory of the surrogate method by demonstrating its applicability without requiring outcome variables to have a latent variable structure. In the analysis of the college student well-being survey, our proposed method unveils the contingency of relationships between multidimensional well-being measures and micro personal risk factors (e.g., physical health, loneliness, and accommodation), as well as the macro disruption caused by COVID-19. 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:
Subscribe to statistics