Statistical Inference for Change Points in High-Dimensional

STATISTICAL INFERENCE FOR CHANGE POINTS IN HIGH-DIMENSIONAL OFFLINE AND ONLINE DATA A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Lingjun Li May 2020 ⃝c Copyright All rights reserved Except for previously published materials Dissertation written by Lingjun Li B.S., Chongqing Technology and Business University, 2010 M.S., Kent State University, 2012 M.A., Kent State University, 2014 Ph.D., Kent State University, 2020 Approved by Jun Li , Chair, Doctoral Dissertation Committee Mohammad Kazim Khan , Members, Doctoral Dissertation Committee Jing Li Cheng-Chang Lu Ruoming Jin Accepted by Andrew M. Tonge , Chair, Department of Mathematical Sciences James L. Blank , Dean, College of Arts and Sciences TABLE OF CONTENTS TABLE OF CONTENTS . iii LIST OF FIGURES . v LIST OF TABLES . vii ACKNOWLEDGMENTS . viii 1 INTRODUCTION . 1 1.1 Offline Change Point Detection . 1 1.2 Online Change Point Detection . 3 2 OFFLINE CHANGE-POINT DETECTION IN HIGH-DIMENSIONAL MEAN 5 2.1 Test Statistic . 6 2.2 Hypothesis Testing of Change Point . 10 2.3 Estimating One and Multiple Change Points . 14 2.4 Power Enhancement Test for Change Point . 17 2.5 Elbow Method for Dependence . 19 2.6 Numerical Studies . 20 2.6.1 Empirical performance of the proposed testing procedure . 21 2.6.2 Empirical performance of change point estimates . 24 2.6.3 Empirical performance of the power enhancement testing procedure . 28 2.7 Application . 31 2.7.1 Application to fMRI data . 32 2.7.2 Application to environmental data . 34 2.8 Technical Details . 36 2.8.1 Proofs of main theorems . 36 2.8.2 Lemmas and their proofs . 50 iii 3 ONLINE CHANGE-POINT DETECTION IN HIGH-DIMENSIONAL COVARI- ANCE STRUCTURE . 59 3.1 Modeling Spatial and Temporal Dependence . 60 3.2 Test Statistic . 62 3.3 Stopping Rule . 63 3.4 Asymptotic Results . 65 3.4.1 Average run length . 65 3.4.2 Expected detection delay . 65 3.4.3 Change-point testing in the training sample . 67 3.4.4 Stopping rule with estimated M .......................... 68 3.5 Simulation Studies . 69 3.5.1 Accuracy of the theoretical ARL . 69 3.5.2 Accuracy of upper bound for EDD . 70 3.5.3 Accuracy of the data-driven procedure for M ................... 75 3.6 Case Study . 75 3.7 Technical Details . 78 3.7.1 Proofs of main theorems . 78 3.7.2 Lemmas and their proofs . 89 4 CONCLUSION . 122 BIBLIOGRAPHY . 123 iv LIST OF FIGURES 1 Histogram of L^t=σ^nt;0 versus N(0; 1)-curve. The upper row chooses t = 1, at different n and p; The lower row chooses t = n=2, and different n and p............. 12 2 The elbow method for choosing M for dependence under both null and alternative hypotheses. The results were obtained based on 50 replications. 20 3 Empirical powers of the proposed testing procedure with δ = 1 based on 1000 repli- cates under different n, p, M and locations of the change point. 23 4 The probability of detecting a change point at 40 when n = 100 and p = 100; 300 and 600, respectively. The probabilities are obtained based on 1000 iterations for each p. Upper panel: data are dependent with M = 2. Lower panel: data are independent with M =0. ........................................ 25 5 The probability of detecting a change point at 2 when n = 100 and p = 100; 300 and 600, respectively. The probabilities are obtained based on 1000 iterations for each p. Upper panel: data are dependent with M = 2. Lower panel: data are independent with M =0. ........................................ 26 6 Activation map for ROIs. In panel (a): left insula (yellow); right EBA (cyan); left ACC (darkblue); right ACC (blue); right MPFC (darkred); left IPL (orange); right IPL (red); right DLPFC (deepskyblue). In panel (b): right amygdala (darkblue); left MPFC (lightgreen); right MPFC (darkred). In panel (c): right FBA (darkblue); left insula (lightgreen); right insula (darkred). In panel (d): right insula (lightgreen); right MPFC (darkred); left DLPFC (darkblue); right DLPFC (deepskyblue); right IPL (orange). 34 7 Heatmap of PM2.5 at 36 stations measured in Beijing, China from Jan. 1st, 2014 to Dec. 31, 2015. 35 v 8 Boxplots of EDDs for \Max-type" and \General" stopping rules in Chen (2019) and Chu and Chen (2018), and the proposed stopping rule. The results are based on 1000 simulations under model (a). 72 9 Boxplots of EDDs for \Max-type" and \General" stopping rules in Chen (2019) and Chu and Chen (2018), and the proposed stopping rule. The results are based on 1000 simulations under model (b). 73 10 Boxplots of EDDs for \Max-type" and \General" stopping rules in Chen (2019) and Chu and Chen (2018), and the proposed stopping rule. The results are based on 1000 simulations under model (c). 74 11 Histograms of selected M by the proposed data-driven procedure when the actual M = 0 and 1. The results are based on 100 simulations. 75 12 Online change-point detection in the covariance structure of subject 103010 (upper panel) and subject 130417 (lower panel). Each panel illustrates the estimated correlation matrices before and after the estimated change point. 76 vi LIST OF TABLES 1 Empirical sizes and powers of the CQ, the E-div and the proposed tests based on 1000 replications with Gaussian ϵi in (2.19). 22 2 The performance of the proposed binary segmentation and E-div method for estimating multiple change points. The average FP, FN, TP and corresponding standard deviations (subscript numbers) are obtained based on 1000 replications. 27 3 The performance of the proposed binary segmentation (BS) and wild binary segmentation (WBS) for estimating multiple change points. The average FP, FN, TP and corresponding standard deviations (subscript numbers) are obtained based on 1000 replications. The number of randomly selected intervals for the wild binary segmentation is H = 1000. 28 4 Empirical sizes of L1 and L based on 1000 replications with Gaussian ϵi........ 29 5 Empirical powers of L1 and L with τ = 0:1n and β = 0:8, based on 1000 replications under different M and magnitude δ............................ 30 6 Empirical powers of L1 and L with τ = 0:02n and β = 0:3, based on 1000 replications under different M and magnitude δ............................ 31 7 ROIs activated by the thin-body images and fat-body images for the normal weight subject and the overweight subject, respectively. 32 8 The comparison between theoretical ARLs and Monte Carlo ARLs based on 1000 simulations. For each ARL and window-size H, the threshold a is obtained by solving the equation in Theorem 3.1. 70 9 The comparison between theoretical upper bounds for EDDs and Monte Carlo EDDs based on 1000 simulations with the ARL controlled around 5000. 71 vii ACKNOWLEDGMENTS First and with the deepest appreciation, I would like to thank my advisor Dr. Jun Li for his guidance, patience, and support throughout my years of study at Kent State University. He has not only spent tremendous hours proofreading my research papers but also inspired me to conduct research on a broad range of statistical topics. The many skills I have learned from him and a lot of invaluable research experience he shared with me will constantly guide me for my new career path. I would also express my appreciation to my committee members: Dr. Mohammad Kazim Khan, Dr. Jing Li, Dr. Cheng-Chang Lu, and Dr. Ruoming Jin. I am tremendously indebted to them for their collective time, effort, and direction. I would like to thank all my friends accompanying me during my studies at KSU. The years that I spent with them in Kent is so wonderful and will be such a precious memory in my life. Finally, I express my profound gratitude to my parents for providing me with unfaltering support and encouragement throughout my educational journey as well as throughout the process of researching and writing this thesis. This achievement would not have been possible without them. viii CHAPTER 1 INTRODUCTION High-dimensional offline and online time series data, characterized by a large number of measure- ments and complex dependence, are commonly observed in medical, environmental, financial and social networks studies. Here offline data refer to the data which have been collected at the time when data analysis is conducted, and online data are real-time observations which continually ar- rive. Change point detection is the problem of finding unknown locations where abrupt changes have occurred (Aminikhanghahi and Cook 2017). Despite its importance, methods available for detecting change points in high-dimensional offline and online time series data are scarce. This is mainly due to two reasons. First, a large number of parameters for the underlying distribution cannot be estimated accurately with a limited number of observations, and thus effectively incor- porated to develop statistical methods. Second, high-dimensional time series data often involve both spatial and temporal dependence, and multifarious aspects of such complex dependence may not be fully captured by imposed independence or parametric models. In this thesis, we develop some new change point detection methods for high-dimension time series data which may involve complex spatial and temporal dependence. In this chapter, we first briefly review some methods of offline change point detection. We then describe the methods of online change point detection. 1.1 Offline Change Point Detection Offline change point detection is to test and estimate the unknown locations at which abrupt change in the given time series data has occurred. There exists abundant research on change point detection for univariate time series.

Statistical Inference for Change Points in High-Dimensional

Quickest Change Detection

Unsupervised Network Anomaly Detection Johan Mazel

Statistical Methods for Change Detection - Michèle Basseville

A Survey of Change Detection Methods Based on Remote Sensing Images for Multi-Source and Multi-Objective Scenarios

Change Detection Algorithms

(Quickest) Change Detection: Classical Results and New Directions Liyan Xie, Student Member, IEEE, Shaofeng Zou, Member, IEEE, Yao Xie, Member, IEEE, and Venugopal V

Robust Change Detection and Change Point Estimation for Poisson Count Processes Marcus B

A Markovian Approach to Unsupervised Change Detection with Multiresolution and Multimodality SAR Data

Change Detection Under Autocorrelation ∑

An Unsupervised Change Detection Method Using Time-Series of Polsar Images from Radarsat-2 and Gaofen-3

Event Detection in Artigo Data

Sequential Change-Point Detection Whenunknown Parameters Are