Skip to main content

Statistics Seminar

Nonparametric Finite Mixtures for Overcoming Biomarker-Error Bias

Nonparametric Finite Mixtures for Overcoming Biomarker-Error Bias

Solomon W. Harrar

 

Personalized medicine research involves investigating the differential effect of treatments in patient groups defined by specific characteristics. In enrichment trials, participants are stratified based on biomarkers to assess the effectiveness of treatments on these groups. However, biomarkers are susceptible to misclassification errors, leading to bias. We propose nonparametric methods to estimate treatment effects and quantify the bias due to biomarker misclassification errors. Our methods are applicable to outcomes measured on ordinal, discrete, or continuous scales, without requiring assumptions such as the existence of moments. Simulation results show significant improvements in bias reduction, coverage probability, and power compared to existing methods. We illustrate the application of our methods using gene expression profiling of bronchial airway brushing in asthmatic and healthy control subjects.

 

I will use the first 10-15 minutes to share a brief account of my fall 2022 sabbatical experience.

Date:
-
Location:
MDS 220
Event Series:

Online Estimation and Network Point Processes

Title: Online Estimation and Network Point Processes

Abstract: A common goal in network modeling is to uncover the latent structure present among nodes. For many real-world networks the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of the temporal dynamics of interactions is to use point processes as the foundation of network models. Computational complexity can hamper the scalability of such methods to large sparse networks which occur in modern settings. We consider the use of online variational inference as a way of scaling such methods when the goal is community detection.

Date:
-
Location:
MDS 220
Event Series:

Enhancing the Study of Microbiome-Metabolome Interactions: A Transfer-Learning Approach for Precise Identification of Essential Microbes

Abstract: Recent research has revealed the essential role that microbial metabolites play in host-microbiome interactions. Although statistical and machine-learning methods have been employed to explore microbiome-metabolome interactions in multiview microbiome studies, most of these approaches focus solely on the prediction of microbial metabolites, which lacks biological interpretation. Additionally, existing methods face limitations in either prediction or inference due to small sample sizes and highly correlated microbes and metabolites. To overcome these limitations, we present a transfer-learning method that evaluates microbiome-metabolome interactions. Our approach efficiently utilizes information from comparable metabolites obtained through external databases or data-driven methods, resulting in more precise predictions of microbial metabolites and identification of essential microbes involved in each microbial metabolite. Our numerical studies demonstrate that our method enables a deeper understanding of the mechanism of host-microbiome interactions and establishes a statistical basis for potential microbiome-based therapies for various human diseases.

 

Date:
-
Location:
MDS 220
Event Series:

Musings on Subdata Selection

Abstract: Data reduction or summarization methods for large datasets (full data) aim at making inferences by replacing the full data by the reduced or summarized data. Data storage and computational costs are among the primary motivations for this. In this presentation, data reduction will mean the selection of a subset (subdata) of the observations in the full data. While data reduction has been around for decades, its impact continues to grow with approximately 2.5 exabytes (2.5 x 10^18 bytes) of data collected per day. We will begin by discussing an information-based method for subdata selection under the assumption that a linear regression model is adequate. A strength of this method, which is inspired by ideas from optimal design of experiments, is that it is superior to competing methods in terms of statistical performance and computational cost when the model is correct. A weakness of the method, shared with other model-based methods, is that it can give poor results if the model is incorrect. We will therefore conclude with discussions of a method based on a more flexible model and a method based on a model-free method.

 

The work discussed has benefited from various collaborators, including Rakhi Singh (Binghamton U), HaiYing Wang (U of Connecticut), and Min Yang (U of Illinois at Chicago).

Date:
-
Location:
MDS 220
Event Series:

Bayesian Edge Regression: Characterizing Observation-Specific Heterogeneity in Estimating Undirected Graphical Models

Abstract: In this talk, I will introduce Bayesian Edge Regression, a novel edge regression model for undirected graphs, which estimates conditional dependencies as a function of subject-level covariates. By doing so, this model allows accounting for observation-specific heterogeneity in estimating networks. I will present two cases studies using the proposed model: one is a set of simulation studies focused on comparing tumor and normal networks while adjusting for tumor purity; the other is an application to a dataset of proteomic measurements on plasma samples from patients with hepatocellular carcinoma (HCC), in which we ascertained how blood protein networks vary with disease severity. I will also give a brief introduction to my other research work.
 

Date:
-
Location:
MDS 220
Event Series:

Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning

Abstract: In recent years, many nontraditional classification methods, such as random forest, boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness of fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models.

Date:
-
Location:
MDS 220
Event Series:

Bayesian Hierarchical Modeling for Inferring the Causal Relationship Between Human Activities and Climate Change Impacts

Abstract: While the impacts of heat waves, droughts, and floods have been increasing along with rising greenhouse gas concentrations, the complex structure of natural variability in the climate system makes it challenging to precisely quantify the extent to which human activities are responsible for observed changes. The statistical methods used by high-profile scientific bodies to address this connection have been observed in recent findings to underestimate the magnitude of variability, resulting in potentially misleading over-confidence. To address this issue, I propose a physically-informed basis function parameterization of the covariance structure within a regularized Bayesian selection method to avoid over-fitting the limited amount of data and to propagate the estimation uncertainty to the final inference. When evaluated on statistically and dynamically simulated data, this method achieves lower RMSE scores and better-calibrated posterior coverage rates than methods that rely on potentially uncertain principal components. Incorporating the physically-informed basis representation into a mixture model allows for the error in the dynamical climate simulations informing the natural variability component to be assessed and accounted for in the inference procedure. Motivated by the need for policymakers and the public at large to understand the extent of human responsibility for climate impacts at specific locations, ongoing work funded aims to leverage the global covariance structure to provide robust quantification of causal connections at fine spatial scales. Longer-term extensions include the use of deep learning techniques to understand more complex distributions and non-linear causal relationships within a Bayesian framework.

Date:
-
Location:
MDS 220
Event Series:

Objective Bayesian Model Selection for Spatial Hierarchical Models with Intrinsic Conditional Autoregressive Priors

 

Abstract: In this talk, I present Bayesian model selection via fractional Bayes factors to simultaneously assess spatial dependence and select regressors in Gaussian hierarchical models with intrinsic conditional autoregressive (ICAR) spatial random effects. Selection of covariates and spatial model structure is difficult, as spatial confounding creates a tension between fixed and spatial random effects. Researchers have commonly performed selection separately for fixed and random effects in spatial hierarchical models. Simultaneous selection methods relieve the researcher from arbitrarily fixing one of these types of effects while selecting the other. Notably, Bayesian approaches to simultaneously select covariates and spatial effects are limited. My use of fractional Bayes factors allows for selection of fixed effects and spatial model structure under automatic reference priors for model parameters, which obviates the need to specify hyperparameters for priors. I also derive the minimal training size for the fractional Bayes factor applied to the ICAR model under the reference prior. I present a simulation study to assess the performance of my approach and compare results to the Deviance Information Criterion and Widely Applicable Information Criterion. I show that my fractional Bayes factor approach assigns low posterior model probability to spatial models when data is truly independent and reliably selects the correct covariate structure. An imminent software update to my existing R package, ref.ICAR, will include this selection method, making automatic Bayesian model selection and analysis readily available for spatial data.  This is important, as a subject matter expert is not required to subjectively specify prior distributions to obtain selection or inference results. Finally, I demonstrate my Bayesian model selection approach with applications to county-level median household income in the contiguous United States and residential crime rates in the neighborhoods of Columbus, Ohio.

Date:
-
Location:
MDS 220
Event Series:

Model-Free Conditional Feature Screening with FDR Control

Abstract: In this talk, I will present a new model-free conditional feature screening method with false discovery rate (FDR) control for ultra-high dimensional data. The proposed method is built upon a new measure of conditional independence. Thus, the new method does not require a specific functional form of the regression function and is robust to heavy-tailed responses and predictors. The variables to be conditional on are allowed to be multivariate. The proposed method enjoys sure screening and ranking consistency properties under mild regularity conditions. To control the FDR, we apply the Reflection via Data Splitting method and prove its theoretical guarantee using martingale theory and empirical process techniques. Simulated examples and real data analysis show that the proposed method performs well compared with existing works.

Date:
-
Location:
MDS 220
Event Series:

The Promises of Parallel Outcomes

Abstract:

A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this talk, I will introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. The key assumption in this approach is conditional independence among multiple outcomes. In contrast to existing proposals in the literature, the roles of multiple outcomes in the key identification assumption are symmetric, hence the name parallel outcomes. I will show nonparametric identifiability with at least three parallel outcomes and provide parametric estimation tools under a set of linear structural equation models. The method is applied to a data set from Alzheimer's Disease Neuroimaging Initiative to study the causal effects of tau protein level on regional brain atrophies.

Date:
-
Location:
MDS 220
Event Series:
Subscribe to Statistics Seminar