Skip to main content

Statistics Seminar

Surrogate method for partial association between mixed data with application to well-being survey analysis

Abstract: 

This paper is motivated by the analysis of a survey study focusing on college student well-being before and after the COVID-19 pandemic outbreak. A statistical challenge in well-being studies lies in the multidimensionality of outcome variables, recorded in various scales such as continuous, binary, or ordinal. The presence of mixed data complicates the examination of their relationships when adjusting for important covariates. To address this challenge, we propose a unifying framework for studying partial association between mixed data. We achieve this by defining a unified residual using the surrogate method. The idea is to map the residual randomness to a consistent continuous scale, regardless of the original scales of outcome variables. This framework applies to parametric or semiparametric models for covariate adjustments. We validate the use of such residuals for assessing partial association, introducing a measure that generalizes classical Kendall’s tau to capture both partial and marginal associations. Moreover, our development advances the theory of the surrogate method by demonstrating its applicability without requiring outcome variables to have a latent variable structure. In the analysis of the college student well-being survey, our proposed method unveils the contingency of relationships between multidimensional well-being measures and micro personal risk factors (e.g., physical health, loneliness, and accommodation), as well as the macro disruption caused by COVID-19. 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Importance tempering of Markov chain Monte Carlo schemes

Abstract: Informed importance tempering (IIT) is an easy-to-implement MCMC algorithm that can be seen as an extension of the familiar Metropolis-Hastings algorithm with the special feature that informed proposals are always accepted, and which was shown in Zhou and Smith (2022) to converge much more quickly in some common circumstances. This work develops a new, comprehensive guide to the use of IIT in many situations. First, we propose two IIT schemes that run faster than existing informed MCMC methods on discrete spaces by not requiring the posterior evaluation of all neighboring states. Second, we integrate IIT with other MCMC techniques, including simulated tempering, pseudo-marginal and multiple-try methods (on general state spaces), which have been conventionally implemented as Metropolis-Hastings schemes and can suffer from low acceptance rates. The use of IIT allows us to always accept proposals and brings about new opportunities for optimizing the sampler which are not possible under the Metropolis-Hastings framework. Numerical examples illustrating our findings are provided for each proposed algorithm, and a general theory on the complexity of IIT methods is developed. Joint work with G. Li and A. Smith. 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Balancing Inferential Integrity and Disclosure Risk via a Multiple Imputation Synthesis Strategy

Abstract: Responsible data sharing anchors research reproducibility and promotes the integrity of scientific research. The possibility of identification creates tension between data sharing to facilitate medical treatment or collaborative research and patient privacy protection. Information loss due to incorrect specification of imputation models can weaken or even invalidate the inference obtained from the synthetic datasets. In this talk, we focus on privacy protection in the direction of statistical disclosure control. We introduce a synthetic component into the synthesis strategy behind the traditional multiple imputation framework to ease the task of conducting inferences for researchers with limited statistical backgrounds. The tuning of the injected synthetic components enables balancing inferential quality and disclosure risk. Its addition also has the advantage of protecting against model misspecification. This framework can be combined with existing missing data methods to produce complete synthetic data sets for public release. We show, using the Canadian Scleroderma Research Group data set, that the new synthesis strategy achieves better data utility than the direct use of the classical multiple imputation approach while providing similar or better protection against identity disclosure. This is joint work with Bei Jiang, Adrian Raftery, and Russell Steele.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Causal Discovery from Multivariate Functional Data

Abstract: Discovering causal relationship using multivariate functional data has received a significant amount of attention very recently. We introduce a functional linear structural equation model for causal structure learning. To enhance interpretability, our model involves a low-dimensional causal embedded space such that all the relevant causal information in the multivariate functional data is preserved in this lower-dimensional subspace. We prove that the proposed model is causally identifiable under standard assumptions that are often made in the causal discovery literature. To carry out inference of our model, we develop a fully Bayesian framework with suitable prior specifications and uncertainty quantification through posterior summaries. We illustrate the superior performance of our method over existing methods in terms of causal graph estimation through extensive simulation studies. We also demonstrate the proposed method using a brain EEG dataset.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Recent Advances in Independent Testing of Stochastic Processes

Abstract:

This talk focuses on independent testing of two stochastic processes. In contrast to i.i.d. random variable data, data originating from a stochastic process typically exhibit strong correlations. This inherent feature of stochastic processes poses significant challenges when conducting statistical inference. In this talk, we will commence by reviewing the historical context of Yule's nonsense correlation. This correlation is defined as the correlation of two independent random walks whose distribution is known to be heavily dispersed and frequently large in absolute value. This phenomenon demonstrates the difficulty inherent in conducting statistical inference for stochastic processes. The second part of the talk is devoted to AR(1) processes. We investigated the convergence rate of the distribution of the correlation between two independent AR(1) processes to the normal distribution. Our analysis demonstrates that this convergence rate follows the order of the square root of the fraction logarithm of n over n, with n representing the length of the processes. Finally, we will discuss the potential for a new methodology to test the independence of both random walks and AR(1) processes.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Maximum Wilcoxon-Mann-Whitney Test in High Dimensional Applications

Abstract: 

The statistical comparison of two multivariate samples is a frequent task, e.g. in biomarker analysis. Parametric and nonparametric multivariate analysis of variance (MANOVA) procedures are well established procedures for the analysis of such data. Which method to use depends on the scales of the endpoints and whether the assumption of a parametric multivariate distribution is meaningful. However, in case of a significant outcome, MANOVA methods can only provide the information that the treatments (conditions) differ in any of endpoints; they cannot locate the guilty endpoint(s). Multiple contrast tests in terms of maximum tests on the contrary provide local test results and thus the information of interest.

The maximum test method controls the error rate by comparing the value of the largest contrast in magnitude to the (1-α)-equicoordinate quantile of the joint distribution of all considered contrasts. The advantage of this approach over existing and commonly used methods that control the multiple type-I error rate, such as Bonferroni, Holm, or Hochberg, is that it is appealingly simple, yet has sufficient power to detect a significant difference in high-dimensional designs, and does not make strong assumptions (such as MTP2) about the joint distribution of test statistics. Furthermore, the computation of simultaneous confidence intervals is possible. The challenge, however, is that the joint distribution of the test statistics used must be known in order to implement the method.

In this talk, we develop a simultaneous maximum Wilcoxon-Mann-Whitney test for the analysis of multivariate data in two independent samples. We hereby consider both the cases of low-and high-dimensional designs. We derive the (asymptotic) joint distribution of the test statistic and propose different bootstrap approximations for small sample sizes. We investigate their quality within extensive simulation studies. It turns out that the methods control the multiple type-I error rate well, even in high-dimensional designs with small sample sizes. A real data set illustrates the application.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Ghost Data

As natural as real data, ghost data is everywhere—it is just data that you cannot see.  We need to learn how to handle it, how to model with it, and how to put it to work.  Some examples of ghost data are (see, Sall, 2017):  

  1. Virtual data—it isn’t there until you look at it; 

  1. Missing data—there is a slot to hold a value, but the slot is empty; 

  1. Pretend data—data that is made up;  

  1. Highly Sparse Data—whose absence implies a near zero, and 

  1. Simulation data—data to answer “what if.” 

For example, absence of evidence/data is not evidence of absence.  In fact, it can be evidence of something.  More Ghost Data can be extended to other existing areas: Hidden Markov Chain, Two-stage Least Square Estimate, Optimization via Simulation, Partition Model, Topological Data, just to name a few.  

Three movies will be used for illustration in this talk: (1) “The Sixth Sense” (Bruce Wallis)—I can see things that you cannot see; (2) “Sherlock Holmes” (Robert Downey)—absence of expected facts; and (3) “Edge of Tomorrow” (Tom Cruise)—how to speed up your learning.  It will be helpful, if you watch these movies before coming to my talk.   This is an early stage of my research in this area--any feedback from you is deeply appreciated.  Much of the basic idea is highly influenced via John Sall (JMP-SAS).   

 

Dr. Dennis K. J. Lin is a Distinguished Professor in the Department of Statistics at Purdue University.   Prior to his current job, he was a University Distinguished Professor of Supply Chain Management and Statistics at Penn State.  His research interests are quality assurance, industrial statistics, data mining, and response surface. He has published nearly 300 SCI/SSCI papers in a wide variety of journals.  He currently serves or has served as an associate editor for more than 10 professional journals and was a co-editor for Applied Stochastic Models for Business and Industry.  Dr. Lin is an elected fellow of ASA, IMS, ASQ, and RSS, an elected member of ISI, and a lifetime member of ICSA. He is an honorary chair professor for various universities, including a Chang-Jiang Scholar at Renmin University of China, Fudan University, and National Taiwan Normal University.   His recent awards include, the Youden Address (ASQ, 2010), the Shewell Award (ASQ, 2010), the Don Owen Award (ASA, 2011), the Loutit Address (SSC, 2011), the Hunter Award (ASQ, 2014), the Shewhart Medal (ASQ, 2015), and the SPES Award (ASA, 2016), the Chow Yuan-Shin Award (2019), and the Deming Lecturer Award (JSM, 2020).  His most recent honor is the Outstanding Alumni Award from National Tsing Hua University (Taiwan, 2022). 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Bayesian Regression for Group Testing Data

Abstract: Group testing involves pooling individual specimens (e.g., blood, urine, swabs, etc.) and testing the pools for the presence of a disease. When individual covariate information is available (e.g., age, gender, number of sexual partners, etc.), a common goal is to relate an individual's true disease status to the covariates in a regression model. Estimating this relationship is a nonstandard problem in group testing because true individual statuses are not observed and all testing responses (on pools and on individuals) are subject to misclassification arising from assay error. Previous regression methods for group testing data can be inefficient because they are restricted to using only initial pool responses and/or they make potentially unrealistic assumptions regarding the assay accuracy probabilities. To overcome these limitations, we propose a general Bayesian regression framework for modeling group testing data. The novelty of our approach is that it can be easily implemented with data from any group testing protocol. Furthermore, our approach will simultaneously estimate assay accuracy probabilities (along with the covariate effects) and can even be applied in screening situations where multiple assays are used. We apply our methods to group testing data collected in Iowa as part of statewide screening efforts for chlamydia.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Inference for Longitudinal Data After Adaptive Sampling

Adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, are increasingly used for the real-time personalization of interventions in digital applications like mobile health and education. As a result, there is a need to be able to use the resulting adaptively collected user data to address a variety of inferential questions, including questions about time-varying causal effects. However, current methods for statistical inference on such data (a) make strong assumptions regarding the environment dynamics, e.g., assume the longitudinal data follows a Markovian process, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes algorithms that learn to select actions using data collected from multiple users. These are major obstacles preventing the use of adaptive sampling algorithms more widely in practice. In this work, we proved statistical inference for the common Z-estimator based on adaptively sampled data. The inference is valid even when observations are non-stationary and highly dependent over time, and (b) allow the online adaptive sampling algorithm to learn using the data of all users. Furthermore, our inference method is robust to miss-specification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work in designing the Oralytics oral health clinical trial in which an RL adaptive sampling algorithm will be used to select treatments, yet valid statistical inference is essential for conducting primary data analyses after the trial is over.

Date:
-
Location:
Zoom
Event Series:

Sufficient dimension reduction on manifolds

The high dimensional data from modern scientific discoveries introduces unique challenges to statistical modeling. Sufficient dimension reduction is a useful tool to bridge the gap through projection subspace recovery. In this talk, we present a semiparametric framework formulated on Grassmann manifolds for dimension reduction. A gradient descent estimation on Grassmann manifolds will be discussed. The proposed approach can preserve the orthogonality of the estimators, and improve the estimation efficiency over existing approaches when the features are highly correlated. Simulation studies and a real data application will be presented to demonstrate the efficacy of the proposed approach. 

Date:
-
Location:
MDS 220
Event Series:
Subscribe to Statistics Seminar