Skip to main content

Statistics Seminar

Ghost Data

As natural as real data, ghost data is everywhere—it is just data that you cannot see.  We need to learn how to handle it, how to model with it, and how to put it to work.  Some examples of ghost data are (see, Sall, 2017):  

  1. Virtual data—it isn’t there until you look at it; 

  1. Missing data—there is a slot to hold a value, but the slot is empty; 

  1. Pretend data—data that is made up;  

  1. Highly Sparse Data—whose absence implies a near zero, and 

  1. Simulation data—data to answer “what if.” 

For example, absence of evidence/data is not evidence of absence.  In fact, it can be evidence of something.  More Ghost Data can be extended to other existing areas: Hidden Markov Chain, Two-stage Least Square Estimate, Optimization via Simulation, Partition Model, Topological Data, just to name a few.  

Three movies will be used for illustration in this talk: (1) “The Sixth Sense” (Bruce Wallis)—I can see things that you cannot see; (2) “Sherlock Holmes” (Robert Downey)—absence of expected facts; and (3) “Edge of Tomorrow” (Tom Cruise)—how to speed up your learning.  It will be helpful, if you watch these movies before coming to my talk.   This is an early stage of my research in this area--any feedback from you is deeply appreciated.  Much of the basic idea is highly influenced via John Sall (JMP-SAS).   

 

Dr. Dennis K. J. Lin is a Distinguished Professor in the Department of Statistics at Purdue University.   Prior to his current job, he was a University Distinguished Professor of Supply Chain Management and Statistics at Penn State.  His research interests are quality assurance, industrial statistics, data mining, and response surface. He has published nearly 300 SCI/SSCI papers in a wide variety of journals.  He currently serves or has served as an associate editor for more than 10 professional journals and was a co-editor for Applied Stochastic Models for Business and Industry.  Dr. Lin is an elected fellow of ASA, IMS, ASQ, and RSS, an elected member of ISI, and a lifetime member of ICSA. He is an honorary chair professor for various universities, including a Chang-Jiang Scholar at Renmin University of China, Fudan University, and National Taiwan Normal University.   His recent awards include, the Youden Address (ASQ, 2010), the Shewell Award (ASQ, 2010), the Don Owen Award (ASA, 2011), the Loutit Address (SSC, 2011), the Hunter Award (ASQ, 2014), the Shewhart Medal (ASQ, 2015), and the SPES Award (ASA, 2016), the Chow Yuan-Shin Award (2019), and the Deming Lecturer Award (JSM, 2020).  His most recent honor is the Outstanding Alumni Award from National Tsing Hua University (Taiwan, 2022). 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Bayesian Regression for Group Testing Data

Abstract: Group testing involves pooling individual specimens (e.g., blood, urine, swabs, etc.) and testing the pools for the presence of a disease. When individual covariate information is available (e.g., age, gender, number of sexual partners, etc.), a common goal is to relate an individual's true disease status to the covariates in a regression model. Estimating this relationship is a nonstandard problem in group testing because true individual statuses are not observed and all testing responses (on pools and on individuals) are subject to misclassification arising from assay error. Previous regression methods for group testing data can be inefficient because they are restricted to using only initial pool responses and/or they make potentially unrealistic assumptions regarding the assay accuracy probabilities. To overcome these limitations, we propose a general Bayesian regression framework for modeling group testing data. The novelty of our approach is that it can be easily implemented with data from any group testing protocol. Furthermore, our approach will simultaneously estimate assay accuracy probabilities (along with the covariate effects) and can even be applied in screening situations where multiple assays are used. We apply our methods to group testing data collected in Iowa as part of statewide screening efforts for chlamydia.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Inference for Longitudinal Data After Adaptive Sampling

Adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, are increasingly used for the real-time personalization of interventions in digital applications like mobile health and education. As a result, there is a need to be able to use the resulting adaptively collected user data to address a variety of inferential questions, including questions about time-varying causal effects. However, current methods for statistical inference on such data (a) make strong assumptions regarding the environment dynamics, e.g., assume the longitudinal data follows a Markovian process, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes algorithms that learn to select actions using data collected from multiple users. These are major obstacles preventing the use of adaptive sampling algorithms more widely in practice. In this work, we proved statistical inference for the common Z-estimator based on adaptively sampled data. The inference is valid even when observations are non-stationary and highly dependent over time, and (b) allow the online adaptive sampling algorithm to learn using the data of all users. Furthermore, our inference method is robust to miss-specification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work in designing the Oralytics oral health clinical trial in which an RL adaptive sampling algorithm will be used to select treatments, yet valid statistical inference is essential for conducting primary data analyses after the trial is over.

Date:
-
Location:
Zoom
Event Series:

Sufficient dimension reduction on manifolds

The high dimensional data from modern scientific discoveries introduces unique challenges to statistical modeling. Sufficient dimension reduction is a useful tool to bridge the gap through projection subspace recovery. In this talk, we present a semiparametric framework formulated on Grassmann manifolds for dimension reduction. A gradient descent estimation on Grassmann manifolds will be discussed. The proposed approach can preserve the orthogonality of the estimators, and improve the estimation efficiency over existing approaches when the features are highly correlated. Simulation studies and a real data application will be presented to demonstrate the efficacy of the proposed approach. 

Date:
-
Location:
MDS 220
Event Series:

Nonparametric Finite Mixtures for Overcoming Biomarker-Error Bias

Nonparametric Finite Mixtures for Overcoming Biomarker-Error Bias

Solomon W. Harrar

 

Personalized medicine research involves investigating the differential effect of treatments in patient groups defined by specific characteristics. In enrichment trials, participants are stratified based on biomarkers to assess the effectiveness of treatments on these groups. However, biomarkers are susceptible to misclassification errors, leading to bias. We propose nonparametric methods to estimate treatment effects and quantify the bias due to biomarker misclassification errors. Our methods are applicable to outcomes measured on ordinal, discrete, or continuous scales, without requiring assumptions such as the existence of moments. Simulation results show significant improvements in bias reduction, coverage probability, and power compared to existing methods. We illustrate the application of our methods using gene expression profiling of bronchial airway brushing in asthmatic and healthy control subjects.

 

I will use the first 10-15 minutes to share a brief account of my fall 2022 sabbatical experience.

Date:
-
Location:
MDS 220
Event Series:

Online Estimation and Network Point Processes

Title: Online Estimation and Network Point Processes

Abstract: A common goal in network modeling is to uncover the latent structure present among nodes. For many real-world networks the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of the temporal dynamics of interactions is to use point processes as the foundation of network models. Computational complexity can hamper the scalability of such methods to large sparse networks which occur in modern settings. We consider the use of online variational inference as a way of scaling such methods when the goal is community detection.

Date:
-
Location:
MDS 220
Event Series:

Enhancing the Study of Microbiome-Metabolome Interactions: A Transfer-Learning Approach for Precise Identification of Essential Microbes

Abstract: Recent research has revealed the essential role that microbial metabolites play in host-microbiome interactions. Although statistical and machine-learning methods have been employed to explore microbiome-metabolome interactions in multiview microbiome studies, most of these approaches focus solely on the prediction of microbial metabolites, which lacks biological interpretation. Additionally, existing methods face limitations in either prediction or inference due to small sample sizes and highly correlated microbes and metabolites. To overcome these limitations, we present a transfer-learning method that evaluates microbiome-metabolome interactions. Our approach efficiently utilizes information from comparable metabolites obtained through external databases or data-driven methods, resulting in more precise predictions of microbial metabolites and identification of essential microbes involved in each microbial metabolite. Our numerical studies demonstrate that our method enables a deeper understanding of the mechanism of host-microbiome interactions and establishes a statistical basis for potential microbiome-based therapies for various human diseases.

 

Date:
-
Location:
MDS 220
Event Series:

Musings on Subdata Selection

Abstract: Data reduction or summarization methods for large datasets (full data) aim at making inferences by replacing the full data by the reduced or summarized data. Data storage and computational costs are among the primary motivations for this. In this presentation, data reduction will mean the selection of a subset (subdata) of the observations in the full data. While data reduction has been around for decades, its impact continues to grow with approximately 2.5 exabytes (2.5 x 10^18 bytes) of data collected per day. We will begin by discussing an information-based method for subdata selection under the assumption that a linear regression model is adequate. A strength of this method, which is inspired by ideas from optimal design of experiments, is that it is superior to competing methods in terms of statistical performance and computational cost when the model is correct. A weakness of the method, shared with other model-based methods, is that it can give poor results if the model is incorrect. We will therefore conclude with discussions of a method based on a more flexible model and a method based on a model-free method.

 

The work discussed has benefited from various collaborators, including Rakhi Singh (Binghamton U), HaiYing Wang (U of Connecticut), and Min Yang (U of Illinois at Chicago).

Date:
-
Location:
MDS 220
Event Series:

Bayesian Edge Regression: Characterizing Observation-Specific Heterogeneity in Estimating Undirected Graphical Models

Abstract: In this talk, I will introduce Bayesian Edge Regression, a novel edge regression model for undirected graphs, which estimates conditional dependencies as a function of subject-level covariates. By doing so, this model allows accounting for observation-specific heterogeneity in estimating networks. I will present two cases studies using the proposed model: one is a set of simulation studies focused on comparing tumor and normal networks while adjusting for tumor purity; the other is an application to a dataset of proteomic measurements on plasma samples from patients with hepatocellular carcinoma (HCC), in which we ascertained how blood protein networks vary with disease severity. I will also give a brief introduction to my other research work.

 

Date:
-
Location:
MDS 220
Event Series:

Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning

Abstract: In recent years, many nontraditional classification methods, such as random forest, boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness of fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models.

Date:
-
Location:
MDS 220
Event Series:
Subscribe to Statistics Seminar