Skip to main content

Statistics Seminar

Generalized Matrix Decomposition: Exploratory Analysis, Prediction, and Inference

Abstract:  Analysis of two-way structured data, i.e., data with structures among both variables and samples, is becoming increasingly common in ecology, biology and neuroscience. For example, a sample-by-taxon abundance data matrix in a microbiome study may have columns structured by the phylogeny of taxa and rows structured by an ecologically defined distance between samples. Classical dimension-reduction tools, such as the singular value decomposition (SVD), may perform poorly for two-way structured data. The generalized matrix decomposition (GMD, Allen et al., 2014) extends the SVD to two-way structured data and thus constructs singular vectors that account for both structures. In the first part of the talk, I will present a graphical visualization tool for two-way structured data, called the GMD-biplot, that can simultaneously display sample clustering and important variables that contribute to the observed sample clustering. In the second part of the talk, I propose the GMD regression (GMDR) as an estimation/prediction tool that seamlessly incorporates two-way structures into high-dimensional linear models. The proposed GMDR directly regresses the outcome on a set of GMD components, selected by a novel procedure that guarantees the best prediction performance. We then propose the GMD inference (GMDI) framework to identify variables that are associated with the outcome for any model in a large family of regression models that includes GMDR. As opposed to most existing tools for high-dimensional inference, GMDI efficiently accounts for pre-specified two-way structures and can provide asymptotically valid inference even for non-sparse coefficient vectors. 
 
Dr. Wang is an assistant professor of Statistics in the School of Mathematical and Natural Sciences at Arizona State University. Prior to joining ASU, he worked as a senior fellow in the Department of Biostatistics, University of Washington. Dr. Wang obtained his PhD degree in Biostatistics at UNC-Chapel Hill in 2018. 
 
 
Date:
Location:
https://uky.zoom.us/j/81804516643
Event Series:

High-dimensional Change-point Detection Using Generalized Homogeneity Metrics

Abstract: Change-point detection has been a classical problem in statistics, finding applications in a wide variety of fields. A nonparametric change-point detection procedure is concerned with detecting abrupt distributional changes in the data generating distribution, rather than only mean changes. We consider the problem of detecting an unknown number of change-points in an independent sequence of high-dimensional observations and testing for the significance of the estimated change-point locations. Our approach essentially rests upon nonparametric tests for the homogeneity of two high-dimensional distributions. We construct a single change-point location estimator via defining a cumulative sum process in an embedded Hilbert space. As the main theoretical innovation, we rigorously derive its limiting distribution under the high-dimension medium sample size framework. Subsequently, we combine our statistic with the idea of wild binary segmentation to recursively estimate and test for multiple change-point locations. The superior performance of our methodology compared to several other existing procedures is illustrated via both simulated and real datasets.

Date:
Location:
https://uky.zoom.us/j/86710704783
Event Series:

Joint Bayesian analysis of multiple response-types using the hierarchical generalized transformation model

Abstract: Consider the situation where an analyst has a Bayesian statistical model that performs well for continuous data. However, suppose the observed dataset consists of multiple response-types (e.g., continuous, count-valued, Bernoulli trials, etc.), which are distributed from more than one class of distributions. We refer to these types of data as "multiple response-type" datasets. The goal of this talk is to introduce a reasonable easy-to-implement all-purpose method that "converts" a Bayesian statistical model for continuous responses (call this the preferred model) into a Bayesian model for multiple response-type datasets. To do this, we consider a transformation of the multiple response-type data, such that the transformed data can be be reasonably modeled using the preferred model. What is unique with our strategy is that we treat the transformations as unknown and use a Bayesian approach to model this uncertainty. The implementation of our Bayesian approach to unknown transformations is straightforward, and involves two steps. The first step produces posterior replicates of the transformed multiple response-type data from a latent conjugate multivariate (LCM) model. The second step involves generating values from the posterior distribution implied by the preferred model. We demonstrate the flexibility of our model through an application to Bayesian additive regression trees (BART) and a spatio-temporal mixed effects (SME) model. We provide a thorough joint multiple response-type spatio-temporal analysis of coronavirus disease 2019 (COVID-19) cases, the adjust closing price of the Dow Jones Industrial, and Google Trends data.

Date:
Location:
https://uky.zoom.us/j/98209994722
Event Series:

Data Augmentation Algorithms for Bayesian Analysis of Directional Data

Abstract: Novel data augmentation algorithms are proposed for Bayesian analysis of the directional data in arbitrary dimensions. The approach leads to new classes of distributions which are constructed in detail. The proposed data augmentation strategies circumvent the need for analytic approximations to integration, numerical integration, or Metropolis-Hastings for the corresponding posterior inference. Simulations and real data examples are presented to demonstrate the applicability and to apprise the performance of the procedure.
 
Having trouble logging in to the session?  Please email cgu254@uky.edu for assistance!
 
Date:
Location:
https://uky.zoom.us/j/95415114536
Event Series:

Penalized likelihood estimation for Pearson's family of distributions, with an application to financial market risk

Abstract: Pearson’s family of distributions consists of all continuous densities f which are solutions to the differential equation:

f' = −g_βf, where g_β(x) = (x − β1) / (β2 + β3x + β4x^2) for all x in a connected subset of the real line and β = (β1, β2, β3, β4) is a given vector.

It is a rich class of models which includes many classical distributions and which can accommodate both skewness and flexible tail behavior. However, estimation of a Pearson density is challenging because a small variation in β can induce a wild change in the shape of the solution fβ. In this talk, I will show how β and fβ can be estimated effectively through a penalized likelihood procedure incorporating Pearson’s differential equation. The approach relies on a parameter cascading method from the functional data analysis literature. Simulations and an illustration involving the S&P 500 index will show that it leads to estimates of Value-at-Risk and Expected Shortfall that can substantially improve market risk assessment by outperforming the estimates currently used by financial institutions and regulators. This talk is based on joint work with M. Carey (Dublin) and my colleague J.O. Ramsay.

Date:
Location:
https://uky.zoom.us/j/97992418835
Event Series:

A Multivariate Spatio-temporal Change Point Model of Opioid Overdose Deaths in Ohio

Abstract: Ohio is one of the states most impacted by the opioid epidemic and experienced the second highest age-adjusted fatal drug overdose rate in 2017. Initially it was believed prescription opioids were driving the opioid crisis in Ohio. However as the epidemic evolved, opioid overdose deaths due to fentanyl have drastically increased. In this work, we develop a Bayesian multivariate spatio-temporal model for Ohio county overdose death rates from 2007 to 2018 due to different types of opioids. The log-odds are assumed to follow a spatially varying change point regression model. By assuming the regression coefficients are a multivariate conditional autoregressive process, we capture spatial dependence within each drug type and also dependence across drug types. The proposed model allows us to not only study spatio-temporal trends in overdose death rates, but also to detect county-level shifts in these trends over time for various types of opioids.

Dr. Staci Hepler is originally from southern Ohio and earned a Bachelor's degree in Mathematics Education from Shawnee State University in 2010 before going on to earn a PhD in Statistics from The Ohio State University. In 2015 Staci joined the faculty in the Department of Mathematics and Statistics at Wake Forest University in Winston-Salem, NC. Her primary research interests are in applied spatio-temporal statistics and Bayesian modeling, and she focuses on problems in public health, ecology, and environmental science. 

 

 

Date:
Location:
https://uky.zoom.us/j/93806079284
Event Series:

Conjugate Bayesian Modeling of High-Dimensional Count Valued Survey Data Under Informative Sampling Designs

We introduce a computationally efficient Bayesian model for predicting high-dimensional dependent count-valued data. In this setting, the Poisson data model with a latent Gaussian process model has become the de facto model. However, this model can be difficult to use in high dimensional settings, where the data may be tabulated over different variables, geographic regions, and times. These computational difficulties are further exacerbated by acknowledging that count-valued data are naturally non-Gaussian. Thus, many of the current approaches, in Bayesian inference, require one to carefully calibrate a Markov chain Monte Carlo (MCMC) technique. We avoid MCMC methods that require tuning by developing a new conjugate multivariate distribution. To incorporate dependence between variables, regions, and time points, a multivariate spatio-temporal mixed effects model (MSTM) is used, resulting in an area-level model. In contrast, unit-level models for survey data offer many advantages over their area-level counterparts, such as potential for more precise estimates and a natural benchmarking property. However, two main challenges occur in this context: accounting for an informative survey design and handling non-Gaussian data types. The pseudo-likelihood approach is one solution to the former, and conjugate multivariate distribution theory offers a solution to the latter. By combining these approaches, we attain a unit-level model for count data that accounts for informative sampling designs and includes fully Bayesian model uncertainty propagation. Importantly, conjugate full conditional distributions hold under the pseudo-likelihood, yielding an extremely computationally efficient approach. Our methods are illustrated using data obtained from the US Census Bureau’s American Community Survey (ACS) and Longitudinal Employer-Household Dynamics (LEHD) program.

A link to the signup sheet for meals and meetings can be found here.

 

Date:
Location:
MDS 220
Event Series:
Subscribe to Statistics Seminar