Skip to main content

statistics

Importance tempering of Markov chain Monte Carlo schemes

Abstract: Informed importance tempering (IIT) is an easy-to-implement MCMC algorithm that can be seen as an extension of the familiar Metropolis-Hastings algorithm with the special feature that informed proposals are always accepted, and which was shown in Zhou and Smith (2022) to converge much more quickly in some common circumstances. This work develops a new, comprehensive guide to the use of IIT in many situations. First, we propose two IIT schemes that run faster than existing informed MCMC methods on discrete spaces by not requiring the posterior evaluation of all neighboring states. Second, we integrate IIT with other MCMC techniques, including simulated tempering, pseudo-marginal and multiple-try methods (on general state spaces), which have been conventionally implemented as Metropolis-Hastings schemes and can suffer from low acceptance rates. The use of IIT allows us to always accept proposals and brings about new opportunities for optimizing the sampler which are not possible under the Metropolis-Hastings framework. Numerical examples illustrating our findings are provided for each proposed algorithm, and a general theory on the complexity of IIT methods is developed. Joint work with G. Li and A. Smith. 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Balancing Inferential Integrity and Disclosure Risk via a Multiple Imputation Synthesis Strategy

Abstract: Responsible data sharing anchors research reproducibility and promotes the integrity of scientific research. The possibility of identification creates tension between data sharing to facilitate medical treatment or collaborative research and patient privacy protection. Information loss due to incorrect specification of imputation models can weaken or even invalidate the inference obtained from the synthetic datasets. In this talk, we focus on privacy protection in the direction of statistical disclosure control. We introduce a synthetic component into the synthesis strategy behind the traditional multiple imputation framework to ease the task of conducting inferences for researchers with limited statistical backgrounds. The tuning of the injected synthetic components enables balancing inferential quality and disclosure risk. Its addition also has the advantage of protecting against model misspecification. This framework can be combined with existing missing data methods to produce complete synthetic data sets for public release. We show, using the Canadian Scleroderma Research Group data set, that the new synthesis strategy achieves better data utility than the direct use of the classical multiple imputation approach while providing similar or better protection against identity disclosure. This is joint work with Bei Jiang, Adrian Raftery, and Russell Steele.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Causal Discovery from Multivariate Functional Data

Abstract: Discovering causal relationship using multivariate functional data has received a significant amount of attention very recently. We introduce a functional linear structural equation model for causal structure learning. To enhance interpretability, our model involves a low-dimensional causal embedded space such that all the relevant causal information in the multivariate functional data is preserved in this lower-dimensional subspace. We prove that the proposed model is causally identifiable under standard assumptions that are often made in the causal discovery literature. To carry out inference of our model, we develop a fully Bayesian framework with suitable prior specifications and uncertainty quantification through posterior summaries. We illustrate the superior performance of our method over existing methods in terms of causal graph estimation through extensive simulation studies. We also demonstrate the proposed method using a brain EEG dataset.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Recent Advances in Independent Testing of Stochastic Processes

Abstract:

This talk focuses on independent testing of two stochastic processes. In contrast to i.i.d. random variable data, data originating from a stochastic process typically exhibit strong correlations. This inherent feature of stochastic processes poses significant challenges when conducting statistical inference. In this talk, we will commence by reviewing the historical context of Yule's nonsense correlation. This correlation is defined as the correlation of two independent random walks whose distribution is known to be heavily dispersed and frequently large in absolute value. This phenomenon demonstrates the difficulty inherent in conducting statistical inference for stochastic processes. The second part of the talk is devoted to AR(1) processes. We investigated the convergence rate of the distribution of the correlation between two independent AR(1) processes to the normal distribution. Our analysis demonstrates that this convergence rate follows the order of the square root of the fraction logarithm of n over n, with n representing the length of the processes. Finally, we will discuss the potential for a new methodology to test the independence of both random walks and AR(1) processes.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Maximum Wilcoxon-Mann-Whitney Test in High Dimensional Applications

Abstract: 

The statistical comparison of two multivariate samples is a frequent task, e.g. in biomarker analysis. Parametric and nonparametric multivariate analysis of variance (MANOVA) procedures are well established procedures for the analysis of such data. Which method to use depends on the scales of the endpoints and whether the assumption of a parametric multivariate distribution is meaningful. However, in case of a significant outcome, MANOVA methods can only provide the information that the treatments (conditions) differ in any of endpoints; they cannot locate the guilty endpoint(s). Multiple contrast tests in terms of maximum tests on the contrary provide local test results and thus the information of interest.

The maximum test method controls the error rate by comparing the value of the largest contrast in magnitude to the (1-α)-equicoordinate quantile of the joint distribution of all considered contrasts. The advantage of this approach over existing and commonly used methods that control the multiple type-I error rate, such as Bonferroni, Holm, or Hochberg, is that it is appealingly simple, yet has sufficient power to detect a significant difference in high-dimensional designs, and does not make strong assumptions (such as MTP2) about the joint distribution of test statistics. Furthermore, the computation of simultaneous confidence intervals is possible. The challenge, however, is that the joint distribution of the test statistics used must be known in order to implement the method.

In this talk, we develop a simultaneous maximum Wilcoxon-Mann-Whitney test for the analysis of multivariate data in two independent samples. We hereby consider both the cases of low-and high-dimensional designs. We derive the (asymptotic) joint distribution of the test statistic and propose different bootstrap approximations for small sample sizes. We investigate their quality within extensive simulation studies. It turns out that the methods control the multiple type-I error rate well, even in high-dimensional designs with small sample sizes. A real data set illustrates the application.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Ghost Data

As natural as real data, ghost data is everywhere—it is just data that you cannot see.  We need to learn how to handle it, how to model with it, and how to put it to work.  Some examples of ghost data are (see, Sall, 2017):  

  1. Virtual data—it isn’t there until you look at it; 

  1. Missing data—there is a slot to hold a value, but the slot is empty; 

  1. Pretend data—data that is made up;  

  1. Highly Sparse Data—whose absence implies a near zero, and 

  1. Simulation data—data to answer “what if.” 

For example, absence of evidence/data is not evidence of absence.  In fact, it can be evidence of something.  More Ghost Data can be extended to other existing areas: Hidden Markov Chain, Two-stage Least Square Estimate, Optimization via Simulation, Partition Model, Topological Data, just to name a few.  

Three movies will be used for illustration in this talk: (1) “The Sixth Sense” (Bruce Wallis)—I can see things that you cannot see; (2) “Sherlock Holmes” (Robert Downey)—absence of expected facts; and (3) “Edge of Tomorrow” (Tom Cruise)—how to speed up your learning.  It will be helpful, if you watch these movies before coming to my talk.   This is an early stage of my research in this area--any feedback from you is deeply appreciated.  Much of the basic idea is highly influenced via John Sall (JMP-SAS).   

 

Dr. Dennis K. J. Lin is a Distinguished Professor in the Department of Statistics at Purdue University.   Prior to his current job, he was a University Distinguished Professor of Supply Chain Management and Statistics at Penn State.  His research interests are quality assurance, industrial statistics, data mining, and response surface. He has published nearly 300 SCI/SSCI papers in a wide variety of journals.  He currently serves or has served as an associate editor for more than 10 professional journals and was a co-editor for Applied Stochastic Models for Business and Industry.  Dr. Lin is an elected fellow of ASA, IMS, ASQ, and RSS, an elected member of ISI, and a lifetime member of ICSA. He is an honorary chair professor for various universities, including a Chang-Jiang Scholar at Renmin University of China, Fudan University, and National Taiwan Normal University.   His recent awards include, the Youden Address (ASQ, 2010), the Shewell Award (ASQ, 2010), the Don Owen Award (ASA, 2011), the Loutit Address (SSC, 2011), the Hunter Award (ASQ, 2014), the Shewhart Medal (ASQ, 2015), and the SPES Award (ASA, 2016), the Chow Yuan-Shin Award (2019), and the Deming Lecturer Award (JSM, 2020).  His most recent honor is the Outstanding Alumni Award from National Tsing Hua University (Taiwan, 2022). 

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Bayesian Regression for Group Testing Data

Abstract: Group testing involves pooling individual specimens (e.g., blood, urine, swabs, etc.) and testing the pools for the presence of a disease. When individual covariate information is available (e.g., age, gender, number of sexual partners, etc.), a common goal is to relate an individual's true disease status to the covariates in a regression model. Estimating this relationship is a nonstandard problem in group testing because true individual statuses are not observed and all testing responses (on pools and on individuals) are subject to misclassification arising from assay error. Previous regression methods for group testing data can be inefficient because they are restricted to using only initial pool responses and/or they make potentially unrealistic assumptions regarding the assay accuracy probabilities. To overcome these limitations, we propose a general Bayesian regression framework for modeling group testing data. The novelty of our approach is that it can be easily implemented with data from any group testing protocol. Furthermore, our approach will simultaneously estimate assay accuracy probabilities (along with the covariate effects) and can even be applied in screening situations where multiple assays are used. We apply our methods to group testing data collected in Iowa as part of statewide screening efforts for chlamydia.

Date:
-
Location:
MDS 220
Tags/Keywords:
Event Series:

Statistics Tutoring Center

 

The Statistics Tutoring Center (STC) provides free tutoring for students enrolled in STA 210 and STA 296. The tutors are graduate students in statistics who are currently teaching or assisting in these classes. 

The STC offers both online and in-person hours.  The in-person hours are in the Multidisciplinary Science Center (MDS) 327R. The online hours are in a Canvas conference (Big Blue Button) in a dedicated Canvas shell.

Subscribe to statistics