Skip to main content

Statistics Seminar

Bayesian Hierarchical Modeling for Inferring the Causal Relationship Between Human Activities and Climate Change Impacts

Abstract: While the impacts of heat waves, droughts, and floods have been increasing along with rising greenhouse gas concentrations, the complex structure of natural variability in the climate system makes it challenging to precisely quantify the extent to which human activities are responsible for observed changes. The statistical methods used by high-profile scientific bodies to address this connection have been observed in recent findings to underestimate the magnitude of variability, resulting in potentially misleading over-confidence. To address this issue, I propose a physically-informed basis function parameterization of the covariance structure within a regularized Bayesian selection method to avoid over-fitting the limited amount of data and to propagate the estimation uncertainty to the final inference. When evaluated on statistically and dynamically simulated data, this method achieves lower RMSE scores and better-calibrated posterior coverage rates than methods that rely on potentially uncertain principal components. Incorporating the physically-informed basis representation into a mixture model allows for the error in the dynamical climate simulations informing the natural variability component to be assessed and accounted for in the inference procedure. Motivated by the need for policymakers and the public at large to understand the extent of human responsibility for climate impacts at specific locations, ongoing work funded aims to leverage the global covariance structure to provide robust quantification of causal connections at fine spatial scales. Longer-term extensions include the use of deep learning techniques to understand more complex distributions and non-linear causal relationships within a Bayesian framework.

Date:
-
Location:
MDS 220
Event Series:

Objective Bayesian Model Selection for Spatial Hierarchical Models with Intrinsic Conditional Autoregressive Priors

 

Abstract: In this talk, I present Bayesian model selection via fractional Bayes factors to simultaneously assess spatial dependence and select regressors in Gaussian hierarchical models with intrinsic conditional autoregressive (ICAR) spatial random effects. Selection of covariates and spatial model structure is difficult, as spatial confounding creates a tension between fixed and spatial random effects. Researchers have commonly performed selection separately for fixed and random effects in spatial hierarchical models. Simultaneous selection methods relieve the researcher from arbitrarily fixing one of these types of effects while selecting the other. Notably, Bayesian approaches to simultaneously select covariates and spatial effects are limited. My use of fractional Bayes factors allows for selection of fixed effects and spatial model structure under automatic reference priors for model parameters, which obviates the need to specify hyperparameters for priors. I also derive the minimal training size for the fractional Bayes factor applied to the ICAR model under the reference prior. I present a simulation study to assess the performance of my approach and compare results to the Deviance Information Criterion and Widely Applicable Information Criterion. I show that my fractional Bayes factor approach assigns low posterior model probability to spatial models when data is truly independent and reliably selects the correct covariate structure. An imminent software update to my existing R package, ref.ICAR, will include this selection method, making automatic Bayesian model selection and analysis readily available for spatial data.  This is important, as a subject matter expert is not required to subjectively specify prior distributions to obtain selection or inference results. Finally, I demonstrate my Bayesian model selection approach with applications to county-level median household income in the contiguous United States and residential crime rates in the neighborhoods of Columbus, Ohio.

Date:
-
Location:
MDS 220
Event Series:

Model-Free Conditional Feature Screening with FDR Control

Abstract: In this talk, I will present a new model-free conditional feature screening method with false discovery rate (FDR) control for ultra-high dimensional data. The proposed method is built upon a new measure of conditional independence. Thus, the new method does not require a specific functional form of the regression function and is robust to heavy-tailed responses and predictors. The variables to be conditional on are allowed to be multivariate. The proposed method enjoys sure screening and ranking consistency properties under mild regularity conditions. To control the FDR, we apply the Reflection via Data Splitting method and prove its theoretical guarantee using martingale theory and empirical process techniques. Simulated examples and real data analysis show that the proposed method performs well compared with existing works.

Date:
-
Location:
MDS 220
Event Series:

The Promises of Parallel Outcomes

Abstract:

A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this talk, I will introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. The key assumption in this approach is conditional independence among multiple outcomes. In contrast to existing proposals in the literature, the roles of multiple outcomes in the key identification assumption are symmetric, hence the name parallel outcomes. I will show nonparametric identifiability with at least three parallel outcomes and provide parametric estimation tools under a set of linear structural equation models. The method is applied to a data set from Alzheimer's Disease Neuroimaging Initiative to study the causal effects of tau protein level on regional brain atrophies.

Date:
-
Location:
MDS 220
Event Series:

High-Dimensional Directed Network Analysis of Human Brains

Abstract:  

The human brain is a high-dimensional directed network system consisting of many regions as network nodes that exert influence on each other. The directed influence from one region to another is called directed connectivity and corresponds to one directed edge in the directed brain network. To understand how brain regions interact with each other and form different brain network patterns when performing different functions, we develop statistical modeling approaches to reveal high-dimensional directed brain networks using brain data. In this talk, I will present two models. (1) The first model is for studying normal and abnormal directed brain networks of patients with epilepsy using their intracranial electroencephalography (EEG) data. Epilepsy is a directed network disorder, as epileptic activity spreads from a seizure onset zone (SOZ) to many other regions after seizure onset. Intracranial EEG data are multivariate time series recordings of many brain regions. With our proposed model, we revealed the evolution of brain networks from normal to abnormal states and uncovered unique directed connectivity properties of the SOZ during seizure development. (2) The second model characterizes whole-brain directed networks of the population of healthy subjects based on functional magnetic resonance imaging (fMRI) data. We also propose a computationally efficient algorithm to address the challenge of analyzing thousands of subjects’ fMRI data. Using our new model and algorithm, we analyzed the resting-state fMRI data of around one thousand subjects from the Human Connectome Project (HCP). We revealed both population-mean and subject-specific whole-brain directed networks. Finally, I will introduce my future research. 

Date:
-
Location:
MDS 220
Event Series:

New Biomarker Evaluation Metrics and their Inferences

The development of biomarkers into diagnostic and prognostic tests can be categorized into three broad phases: discovery, performance evaluation, and impact determination when added to existing clinical measures. This talk covers some key concepts including classification types and classification metrics in performance evaluation, from a statistician’s perspective. The limitations of existing classification metrics and the importance of using appropriate classification metrics are highlighted. Specifically, this talk presents some new efficient biomarker evaluation metrics to address the common pitfall caused by “naïve pooling” in biomarker evaluation and the inefficiency of existing metrics in multiple ordered classification. Related statistical inference methods are also presented. An ovarian cancer data set from PLCO cancer study is analyzed.

Date:
-
Location:
Zoom
Event Series:

Epsilon-greedy strategy for nonparametric bandits

Title:  Epsilon-greedy strategy for nonparametric bandits 

Abstract: Contextual bandit algorithms are popular for sequential decision-making in several practical applications, ranging from online advertisement recommendations to mobile health. The goal of such problems is to maximize cumulative reward over time for a set of choices/arms while considering covariate (or contextual) information. Epsilon-Greedy is a popular heuristic for the Multi-Armed Bandits problem, however, it is not one of the most studied algorithms theoretically in the presence of contextual information. We study the Epsilon-Greedy strategy in nonparametric bandits, i.e., when no parametric form is assumed for the reward functions.  In this work, we assume that the similarities between the covariates and expected rewards can be modeled as arbitrary linear functions of the contexts' images in a specific reproducing kernel Hilbert space (RKHS). We propose a kernelized epsilon-greedy algorithm and establish its convergence rates for estimation and cumulative regret, which are closely tied to the intrinsic dimensionality of the RKHS. We show that the rates closely match the optimal rates for linear contextual bandits when restricted to a finite-dimensional RKHS. Lastly, we illustrate our results through simulation studies and real-data analysis.

Date:
-
Location:
MDS 220
Event Series:

Individual- and Community-Level Disease Risk Prediction through the Integration of Information across Disparate Data Sources

Abstract: Large-scale epidemiologic studies are rapidly leading to novel findings of risk factors associated with various human diseases. The increasing availability of multi-modal health data provides us with major opportunities to develop data integration methods for developing advanced risk prediction tools incorporating a rich set of risk factors, which could generate more effective strategies for disease prevention on healthy individuals and treatment strategies for patients. Such data fusion has been an understudied area with many open questions. In this talk, I will present some of my recent work on data integration methods for risk model development with two specific examples. The first example focuses on integrating individual- and summary-level information from studies on different types of risk factors and community-level pandemic dynamics to develop individualized prediction models for COVID-19 mortality risk. Such a methodological framework can be applied to predict and validate the risk of other diseases on both individual- and community-level and be continuously updated once new datasets or information are available. In the second application, we develop enhanced genome-wide polygenic risk prediction models for the underrepresented non-European populations by appropriately borrowing information across ancestries through the integration of ancestry-specific genetic datasets. Both applications demonstrate future promise of data integration methods for developing comprehensive risk models and informing targeted disease prevention strategies.

Date:
-
Location:
MDS 220
Event Series:

A joint directed acyclic graph estimation model to detect aberrant brain connectivity in schizophrenia

Title: A joint directed acyclic graph estimation model to detect aberrant brain connectivity in schizophrenia

 

Abstract: Functional connectivity (FC) between brain region has been widely studied and linked with cognition and behavior of an individual. FC is usually defined as the correlation or partial correlation of fMRI signals between two brain regions. Although FC has been effective to understand brain organization, it cannot reveal the direction of interactions. Many directed acyclic graph (DAG) based methods have been applied to study the directed interactions but their performance was limited by the small sample size while high dimensionality of the available data. By enforcing group regularization and utilizing samples from both case and control groups, we propose a joint DAG model to estimate the directed FC. We first demonstrate that the proposed model is efficient and accurate through a series of simulation studies. We then apply it to the case-control study of schizophrenia (SZ) with data collected from the MIND Clinical Imaging Consortium (MCIC). We have successfully identified decreased functional integration, disrupted hub structures and characteristic edges in SZ patients. Those findings have been confirmed by previous studies with some identified to be potential markers for SZ patients. A comparison of the results between the directed FC and undirected FC showed substantial differences in the selected features. In addition, we used the identified features based on directed FC for the classification of SZ patients and achieved better accuracy than using undirected FC or raw features, demonstrating the advantage of using directed FC for brain network analysis.

 
Date:
-
Location:
MDS 223
Event Series:

Novel Methods for Multi-ancestry Polygenic Prediction and their Evaluations in 3.7 Million Individuals of Diverse Ancestry

Abtract: Polygenic risk scores are becoming increasingly predictive of complex traits, but subpar performance in non-European populations raises concerns about their potential clinical applications. We develop a powerful and scalable method to calculate PRS using GWAS summary-statistics from multi-ancestry training samples by integrating multiple techniques, including clumping and thresholding, empirical Bayes and super learning. We evaluate the performance of the proposed method and a variety of alternatives using large-scale simulated GWAS on ~19 million common variants and large 23andMe Inc. datasets, including up to 800K individuals from four non-European populations, across seven complex traits. Results show that the proposed method can substantially improve the performance of PRS in non-European populations relative to simple alternatives and has comparable or superior performance relative to more advanced and less flexible methods that require substantially more computational time. Further, our simulation studies provide novel insights to sample size requirements and the effect of SNP density on multi-ancestry risk prediction.

Date:
Location:
MDS 220
Event Series:
Subscribe to Statistics Seminar