STATISTICAL INFERENCE FOR CHANGE POINTS IN HIGH-DIMENSIONAL

OFFLINE AND ONLINE DATA

A dissertation submitted

to Kent State University

in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

by

Lingjun Li

May 2020

⃝c Copyright

All rights reserved

Except for previously published materials Dissertation written by

Lingjun Li

B.S., Chongqing Technology and Business University, 2010

M.S., Kent State University, 2012

M.A., Kent State University, 2014

Ph.D., Kent State University, 2020

Approved by

Jun Li , Chair, Doctoral Dissertation Committee

Mohammad Kazim Khan , Members, Doctoral Dissertation Committee

Jing Li

Cheng-Chang Lu

Ruoming Jin

Accepted by

Andrew M. Tonge , Chair, Department of Mathematical Sciences

James L. Blank , Dean, College of Arts and Sciences TABLE OF CONTENTS

TABLE OF CONTENTS ...... iii

LIST OF FIGURES ...... v

LIST OF TABLES ...... vii

ACKNOWLEDGMENTS ...... viii

1 INTRODUCTION ...... 1

1.1 Offline Change Point Detection ...... 1

1.2 Online Change Point Detection ...... 3

2 OFFLINE CHANGE-POINT DETECTION IN HIGH-DIMENSIONAL 5

2.1 Test Statistic ...... 6

2.2 Hypothesis Testing of Change Point ...... 10

2.3 Estimating One and Multiple Change Points ...... 14

2.4 Power Enhancement Test for Change Point ...... 17

2.5 Elbow Method for Dependence ...... 19

2.6 Numerical Studies ...... 20

2.6.1 Empirical performance of the proposed testing procedure ...... 21

2.6.2 Empirical performance of change point estimates ...... 24

2.6.3 Empirical performance of the power enhancement testing procedure ...... 28

2.7 Application ...... 31

2.7.1 Application to fMRI data ...... 32

2.7.2 Application to environmental data ...... 34

2.8 Technical Details ...... 36

2.8.1 Proofs of main theorems ...... 36

2.8.2 Lemmas and their proofs ...... 50

iii 3 ONLINE CHANGE-POINT DETECTION IN HIGH-DIMENSIONAL COVARI-

ANCE STRUCTURE ...... 59

3.1 Modeling Spatial and Temporal Dependence ...... 60

3.2 Test Statistic ...... 62

3.3 Stopping Rule ...... 63

3.4 Asymptotic Results ...... 65

3.4.1 Average run length ...... 65

3.4.2 Expected detection delay ...... 65

3.4.3 Change-point testing in the training sample ...... 67

3.4.4 Stopping rule with estimated M ...... 68

3.5 Simulation Studies ...... 69

3.5.1 Accuracy of the theoretical ARL ...... 69

3.5.2 Accuracy of upper bound for EDD ...... 70

3.5.3 Accuracy of the data-driven procedure for M ...... 75

3.6 Case Study ...... 75

3.7 Technical Details ...... 78

3.7.1 Proofs of main theorems ...... 78

3.7.2 Lemmas and their proofs ...... 89

4 CONCLUSION ...... 122

BIBLIOGRAPHY ...... 123

iv LIST OF FIGURES

1 Histogram of Lˆt/σˆnt,0 versus N(0, 1)-curve. The upper row chooses t = 1, at different n and p; The lower row chooses t = n/2, and different n and p...... 12

2 The elbow method for choosing M for dependence under both null and alternative

hypotheses. The results were obtained based on 50 replications...... 20

3 Empirical powers of the proposed testing procedure with δ = 1 based on 1000 repli-

cates under different n, p, M and locations of the change point...... 23

4 The probability of detecting a change point at 40 when n = 100 and p = 100, 300 and

600, respectively. The probabilities are obtained based on 1000 iterations for each p.

Upper panel: data are dependent with M = 2. Lower panel: data are independent

with M =0...... 25

5 The probability of detecting a change point at 2 when n = 100 and p = 100, 300 and

600, respectively. The probabilities are obtained based on 1000 iterations for each p.

Upper panel: data are dependent with M = 2. Lower panel: data are independent

with M =0...... 26

6 Activation map for ROIs. In panel (a): left insula (yellow); right EBA (cyan); left

ACC (darkblue); right ACC (blue); right MPFC (darkred); left IPL (orange); right

IPL (red); right DLPFC (deepskyblue). In panel (b): right amygdala (darkblue);

left MPFC (lightgreen); right MPFC (darkred). In panel (c): right FBA (darkblue);

left insula (lightgreen); right insula (darkred). In panel (d): right insula (lightgreen);

right MPFC (darkred); left DLPFC (darkblue); right DLPFC (deepskyblue); right

IPL (orange)...... 34

7 Heatmap of PM2.5 at 36 stations measured in Beijing, China from Jan. 1st, 2014 to

Dec. 31, 2015...... 35

v 8 Boxplots of EDDs for “Max-type” and “General” stopping rules in Chen (2019) and

Chu and Chen (2018), and the proposed stopping rule. The results are based on

1000 simulations under model (a)...... 72

9 Boxplots of EDDs for “Max-type” and “General” stopping rules in Chen (2019) and

Chu and Chen (2018), and the proposed stopping rule. The results are based on

1000 simulations under model (b)...... 73

10 Boxplots of EDDs for “Max-type” and “General” stopping rules in Chen (2019) and

Chu and Chen (2018), and the proposed stopping rule. The results are based on

1000 simulations under model (c)...... 74

11 Histograms of selected M by the proposed data-driven procedure when the actual

M = 0 and 1. The results are based on 100 simulations...... 75

12 Online change-point detection in the covariance structure of subject 103010 (up-

per panel) and subject 130417 (lower panel). Each panel illustrates the estimated

correlation matrices before and after the estimated change point...... 76

vi LIST OF TABLES

1 Empirical sizes and powers of the CQ, the E-div and the proposed tests based on

1000 replications with Gaussian ϵi in (2.19)...... 22 2 The performance of the proposed binary segmentation and E-div method for estimat-

ing multiple change points. The average FP, FN, TP and corresponding standard

deviations (subscript numbers) are obtained based on 1000 replications...... 27

3 The performance of the proposed binary segmentation (BS) and wild binary seg-

mentation (WBS) for estimating multiple change points. The average FP, FN, TP

and corresponding standard deviations (subscript numbers) are obtained based on

1000 replications. The number of randomly selected intervals for the wild binary

segmentation is H = 1000...... 28

4 Empirical sizes of L1 and L based on 1000 replications with Gaussian ϵi...... 29

5 Empirical powers of L1 and L with τ = 0.1n and β = 0.8, based on 1000 replications under different M and magnitude δ...... 30

6 Empirical powers of L1 and L with τ = 0.02n and β = 0.3, based on 1000 replications under different M and magnitude δ...... 31

7 ROIs activated by the thin-body images and fat-body images for the normal weight

subject and the overweight subject, respectively...... 32

8 The comparison between theoretical ARLs and Monte Carlo ARLs based on 1000

simulations. For each ARL and window-size H, the threshold a is obtained by solving

the equation in Theorem 3.1...... 70

9 The comparison between theoretical upper bounds for EDDs and Monte Carlo EDDs

based on 1000 simulations with the ARL controlled around 5000...... 71

vii ACKNOWLEDGMENTS

First and with the deepest appreciation, I would like to thank my advisor Dr. Jun Li for his guidance, patience, and support throughout my years of study at Kent State University. He has not only spent tremendous hours proofreading my research papers but also inspired me to conduct research on a broad range of statistical topics. The many skills I have learned from him and a lot of invaluable research experience he shared with me will constantly guide me for my new career path.

I would also express my appreciation to my committee members: Dr. Mohammad Kazim Khan,

Dr. Jing Li, Dr. Cheng-Chang Lu, and Dr. Ruoming Jin. I am tremendously indebted to them for their collective time, effort, and direction.

I would like to thank all my friends accompanying me during my studies at KSU. The years that I spent with them in Kent is so wonderful and will be such a precious memory in my life.

Finally, I express my profound gratitude to my parents for providing me with unfaltering sup- port and encouragement throughout my educational journey as well as throughout the process of researching and writing this thesis. This achievement would not have been possible without them.

viii CHAPTER 1

INTRODUCTION

High-dimensional offline and online data, characterized by a large number of measure- ments and complex dependence, are commonly observed in medical, environmental, financial and social networks studies. Here offline data refer to the data which have been collected at the time when data analysis is conducted, and online data are real-time observations which continually ar- rive. Change point detection is the problem of finding unknown locations where abrupt changes have occurred (Aminikhanghahi and Cook 2017). Despite its importance, methods available for detecting change points in high-dimensional offline and online time series data are scarce. This is mainly due to two reasons. First, a large number of parameters for the underlying distribution cannot be estimated accurately with a limited number of observations, and thus effectively incor- porated to develop statistical methods. Second, high-dimensional time series data often involve both spatial and temporal dependence, and multifarious aspects of such complex dependence may not be fully captured by imposed independence or parametric models. In this thesis, we develop some new change point detection methods for high-dimension time series data which may involve complex spatial and temporal dependence. In this chapter, we first briefly review some methods of offline change point detection. We then describe the methods of online change point detection.

1.1 Offline Change Point Detection

Offline change point detection is to test and estimate the unknown locations at which abrupt change in the given time series data has occurred. There exists abundant research on change point detection for univariate time series. For example, Sen and Srivastava (1975) studied testing and locating a single change point in the mean of a sequence of i.i.d. Gaussian random variable. Incl´an and Tiao (1994) proposed a procedure to detect changes in a sequence of independent random variables based on an iterated cumulative sums of squares algorithm. Lavielle and Moulines

(2000) considered least-squares estimation of changes in the mean of a random process and applied

1 the method to a class of dependent processes. Davis, Lee, and Rodriguez-Yam(2006) studied the method to detect the number and locations of piecewise autoregressive segments in nonstationary time series sequences. Shao and Zhang (2010) introduced self-normalization based Kolmogorov-

Smirnov test for a change point in a time series.

The problem of detecting change points in the classical multivariate time series with the fixed dimension has also been extensively studied. For example, based on the maximum Hotelling T 2,

Srivastava and Worsley (1986) proposed likelihood ratio tests for the change in the mean of a se- quence of independent multivariate normal vectors. James, James and Siegmund (1992) introduced a tail approximation for the significance level of the likelihood ratio test of one single change point in the mean of a sequence of independent p-dimensional normally distributed observations. Har- chaoui, Moulines and Bach (2009) considered a kernel-based method for testing whether a change in the distribution occurs within a sequence of temporal observations. Motivated by the scientific problem of detecting inherited copy number variants in aligned DNA samples, Siegmund, Yakir, and Zhang (2011) studied the problem to detect intervals where the mean values of the observations change simultaneously in a subset of aligned sequences of independent noisy observations.

Change point detection in high-dimensional time series data has been receiving more attention in recent years. For testing the existence of a change point in high-dimensional panel data, Horv´ath and Hˇuskov´a(2012) and Chan, Horv´athand Hˇuskov´a(2013) proposed the sum of squared CUSUM statistics. In the same vein, Bai (2010) and Horv´athet al. (2017) considered estimating the location of the change point assuming a change has occurred a priori. Aston and Kirch (2014) introduced the concept of high-dimensional efficiency that can be used to evaluate and compare the power of different change-point tests. When change points are sparse across the panel, Cho and Fryzlewicz

(2015) proposed the Sparsified Binary Segmentation approach where the thresholding is applied to remove those dimensions without any change point. To avoid difficulty in selecting the level of threshold in Cho and Fryzlewicz (2015), Cho (2016) considered the double CUSUM statistic which partitions the panel data at each time to subgroups with and without any change. Allowing each component to have its own change point, Jirak (2015) proposed a test statistic which maximizes the largest marginal CUSUM statistic over time. Some other recent development can be seen in

Chen and Zhang (2015), Wang, Zou, and Yin (2017), and Wang and Samworth (2018). Along with methodology development, change point analysis has been recognized as a useful tool in fMRI

2 (Lindquist, Waugh and Wager 2007; Robinson, Wager and Lindquist 2010; Aston and Kirch 2012).

Despite the usefulness of those methods, they usually suffer from the following shortcomings.

First, they need stringent assumptions on the growth rate of p. For example, Jirak (2012, 2015) assumes logarithmic and polynomial growth rate of p. Horv´athand Hˇuskov´a(2012) requires ph2 = o(n2) where h is the bandwidth of kernel estimator for the long run variance. Cho and

Fryzlewicz (2015) assumes pn− log n → 0. Second, those methods usually require structural assump- tions on both spatial and temporal dependence. For instance, Chen and Zhang (2015) and Wang,

Zou, and Yin (2017) assume that the sequence of observations is independent. Jirak (2015) requires polynomial decay and logarithmic decay structures on the temporal and spatial dependence, re- spectively. Finally, those methods may not be able to consistently estimate the change point near the boundary, specially in the high dimensional setting.

In chapter 2, we propose a new procedure to detect the change points in the mean of high- dimensional time series data. We establish an asymptotic testing procedure for the existence of any change point. When the null hypothesis is rejected, a regular binary segmentation or a wild binary segmentation method is conducted to estimate multiple change points. We derive the convergence rate of the change point estimator, which demonstrates the impact of sample size, dimensionality and the location of the change point. The estimator is shown to be consistent with sufficient signal- to-noise ratio even when the change point is near the boundary of data. Compared with other methods, the proposed procedure allows both sample size and dimensionality to diverge without constraint on the growth rate of dimensionality. It does not assume Gaussianity. Moreover, it incor- porates both spatial and temporal dependence without imposing restrictive structural assumptions.

Simulation and case studies are provided to demonstrate the performance of the proposed methods.

1.2 Online Change Point Detection

Online change-point detection or sequential change-point detection is designed for a real-time man- ner. With observations continually arriving, a stopping rule is chosen to terminate and reset the process as early as possible when an anomaly occurs. In modern applications, there has been a resurgence of interest in detecting abrupt changes from streaming data with a large number of measurements. Examples include real-time monitoring for sensor networks and threat detection from surveillance videos. More can be found in studying the dynamic connectivity of resting state

3 functional magnetic resonance imaging, and in detecting threat of fake news from the group of fake accounts in social networks (Bara, Fung and Dinh 2015).

Extensive research has been done for online change-point detection of univariate data; see, for example, Page (1954), Shiryayev (1963), Lorden (1971), Wald (1973), Siegmund (1985) and

Siegmund and Venkatraman (1995). The proposed stopping rules are based on the CUSUM test or the quasi-Bayesian test which assumes the distributions of data before and after the change point to be known, or its variants proposed to relax the restrictive assumption of known distributions. There also exist many developments in online change-point detection of multivariate data. For example,

Tartakovsky and Veeravalli (2008) and Mei (2010) propose the stopping rule for the common change point detection from all dimensions based on the assumption that the distributions of data before and after the change point are known. By relaxing the common change to the change of the only subset of data, Xie and Siegmund (2013), and Chan and Walther (2015) study the stopping rule for the multivariate normally distributed data with the identity covariance matrix. By extending and modifying the approach in Xie and Siegmund (2013), Chan (2017) investigates the optimal detection of multiple data streams in detecting mean shift of independent multivariate normally distributed data with the identity covariance matrix. Despite the preceding developments, very little work has been done for online change-point detection of high-dimensional data. A recent development can be seen in Chen (2019) where the proposed stopping rule utilizes nearest neighbor information to detect the change point in the distribution of independent data. However, the condition of temporal independence is too restrictive in a real application.

In chapter 3, we develop an online change-point detection procedure in the covariance structure of high-dimensional data. A new stopping rule is proposed to terminate the process as early as possible when a network change occurs. The stopping rule incorporates spatial and temporal dependence and can be applied to non-Gaussian data. An explicit expression for the average run length (ARL) is derived so that the level of threshold in the stopping rule can be easily obtained with no need to run time-consuming Monte Carlo simulations. We also establish an upper bound for the expected detection delay (EDD), the expression of which demonstrates the impact of data dependence and magnitude of change in the covariance structure. Simulation studies are provided to confirm the accuracy of the theoretical results. The practical usefulness of the proposed procedure is illustrated by detecting the brain’s network change in a resting-state fMRI dataset.

4 CHAPTER 2

OFFLINE CHANGE-POINT DETECTION IN HIGH-DIMENSIONAL MEAN

With the explosive development of high-throughput technologies, high-dimensional time series data are commonly observed in many fields including medical, environmental, financial, engineering and geographical studies. For example, functional magnetic resonance imaging (fMRI) in neuroscience is a modern noninvasive technique, which produces massive amounts of complex data by measuring brain activity repeatedly over many times (Ashby 2011). These high-dimensional time series data push toward the investigation of new statistical tools aimed at understanding and characterizing the underlying mechanism. In many cases, the dynamic processes involve abrupt changes occurred at unknown locations, and detecting such unknown change points in these high-dimensional time series data rekindles researchers’ interest nowadays. For example, fMRI data can be used to study the structure and function of the brain by detecting changes associated with blood flow (Logothetis et al. 2001).

In this chapter, we propose a new nonparametric method to detect the change points in the

p mean of high-dimensional time series data. More precisely, letting {Xi ∈ R , 1 ≤ i ≤ n} be a sequence of observations and µi be the mean of Xi for i = 1, ··· , n, we first establish an asymptotic testing procedure for the existence of any change point:

H0 : µ1 = ··· = µn, against ··· ̸ ··· ̸ ··· H1 : µ1 = = µτ1 = µτ1+1 = = µτq = µτq+1 = = µn, (2.1) where 1 ≤ τ1 < ··· < τq < n are some unknown change points. When the null hypothesis is rejected, we further adopt a binary segmentation or wild binary segmentation searching scheme to identify all the change points.

The advantages of the proposed change point detection procedure are multifold. First, it allows the dimension p to be much larger than the number of observations n without constraint on the growth rate of p, whereas previous relevant work needs more stringent assumptions. For example,

5 Jirak (2012, 2015) assume the logarithmic and polynomial growth rate of p, Horv´athand Hˇuskov´a

(2012) requires ph2 = o(n2) where h is the bandwidth of kernel estimator for the long run variance, and Cho and Fryzlewicz (2015) assumes pn− log n → 0. Second, the proposed method incorporates both spatial and temporal dependence in the sequence of high-dimensional observations, namely spatial dependence among the p-components of Xi at each i and temporal dependence between any

Xi and Xj for i ≠ j. Moreover, unlike other existing methods, we do not impose any structural assumptions on both spatial and temporal dependence. For instance, Chen and Zhang (2015) and Wang, Zou, and Yin (2017) assume that the sequence of observations is independent. And

Jirak (2015) requires polynomial decay and logarithmic decay structures on the temporal and spatial dependence, respectively. Third, the proposed method is nonparametric without making distributional assumptions on data beyond the existence of the fourth moment. As a result, it can be applied to a wide range of applications. Finally, we derive the convergence rate of the change point estimator which demonstrates the impact of n, p and location of the change point. The change point estimator is shown to be consistent with a sufficient signal-to-noise ratio even when the change point locates on the boundary of a sequence. To the best of our knowledge, an investigation of consistently estimating the change point near the boundary, specially in high dimensional setting, is lacking in the literature. Our results thus address and relax the usual concern of unsatisfactory performance of change point detection near the boundaries of data.

This chapter is organized as follows. In Section 2.1, 2.2 and 2.3, we introduce our methods for testing the existence of change point and estimating one or multiple change points, we also introduce a power enhancement statistic for better testing performance. The theoretical properties of the methods are also discussed in this section. In Section 2.4, an elbow method is discussed for estimating dependence. In Sections 2.5 and 2.6, we demonstrate the empirical performance of our methods through simulated and real data respectively. The proofs of the main results are relegated to Section 2.8.

2.1 Test Statistic

We model the sequence of p-dimensional random vectors {Xi, 1 ≤ i ≤ n} by a linear high- dimensional time series

Xi = µi + ΓiZ for i = 1, ··· , n, (2.2)

6 where µi is the p-dimensional population mean, Γi is a p × q matrix with q ≥ n · p, and Z = ··· T { }q (z1, , zq) such that zi i=1 are mutually independent and satisfy E(zi) = 0, Var(zi) = 1 and 4 E(zi ) = 3 + β for some finite constant β. T By allowing Γi to depend on i, each Xi has its own covariance described by ΓiΓi , and each T ̸ pair of Xi and Xj has its own temporal dependence described by ΓiΓj for i = j. We require ≥ T q np to guarantee the positive definite of ΓiΓi . It also provides flexibility in generating different dependence structures. For example, if all the Xi’s are temporally m-dependent, the condition ≥ T | − | q np guarantees the existence of Γi’s such that ΓiΓj = 0 if i j > m. Another advantage of (2.2) is that it does not assume the Gaussian distribution of Z beyond the the existence of the fourth moment. ∑ − T − ≡ T n−h − { −1 ≤ Let C(j i) = C (i j) ΓiΓj , and define a weight function wt(h) = i=1 n(n t) t I(i t) − (n − t)−1I(i > t)}{t−1I(i + h ≤ t) − (n − t)−1I(i + h > t)}. Moreover, for any matrix A, we let

A⊗2 = AAT . We consider the following condition for dependence.

(C1). The sequence of {X1, ··· ,Xn} is stationary satisfying C(i − j) = C(h) with h = i − j.

1/2 Moreover, as n → ∞, there exists Mn = o(n ) such that

n∑−1 n∑−1 ∑Mn ⊗2 ⊗2 |tr{C(h)}| = o(n), and tr[{ wt(h)C(h)} ] = o(tr[{ wt(h)C(h)} ]).

h=Mn+1 h=Mn+1 h=1

Some discussions about (C1) are as follows. First, the stationary assumption has been commonly used in the change-point analysis (Aue et al. 2009; Bai 2010; Horv´athand Hˇuskov´a2012; Wang and

Samworth 2018). We impose the stationary of temporal correlation for simplification of notation and technical proofs of asymptotic results. It can be relaxed to the locally stationary. Second, (C1) is trivially true for temporally independent or the m-dependent sequence. However, (C1) is very general as the sequence needs not to be m-dependent. Third, (C1) does not impose any structural

1/2 assumption on dependence within a critical value Mn = o(n ), but only requires that the spatial dependence beyond the critical value Mn is not too strong, so that the two equations in (C1) are satisfied. At last, compared to the usually assumed mixing condition, (C1) is advantageous as mixing condition is hard to verify for the real data and usually requires additional smoothness or restrictive moment assumptions (Carrasco and Chen 2002; Aue et al. 2009). For notation simplicity, we suppress the index n from Mn in the rest part of the dissertation.

7 For any t ∈ {1, ··· , n − 1}, we measure homogeneity in {µi, 1 ≤ i ≤ n} by − t(n t) T L = (¯µ≤ − µ¯ ) (¯µ≤ − µ¯ ), (2.3) t n2 t >t t >t ∑ ∑ −1 t − −1 n ̸ whereµ ¯≤t = t i=1 µi andµ ¯>t = (n t) i=t+1 µi. Clearly, Lt = 0 under H0 and Lt = 0 −2 for some t under H1 of (2.1). To estimate the unknown Lt, a naive choice is t(n − t)n (X¯≤t − ∑ ∑ ¯ T ¯ − ¯ ¯ −1 t ¯ − −1 n X>t) (X≤t X>t), where X≤t = t i=1 Xi, X>t = (n t) i=t+1 Xi. However, as shown in Proposition 2.1 in this section, the estimator is biased due to data dependence. We therefore consider a bias-corrected estimator − f T F −1 Vˆ t(n t) T t n,M Lˆ = (X¯≤ − X¯ ) (X¯≤ − X¯ ) − , (2.4) t n2 t >t t >t n where, like Ayyala, Park, and Roy (2017), ft is an (M + 1)-dimensional vector with ft(1) = 1 and for i ∈ {2, ··· ,M + 1}, { (n − t)(t − i + 1) t(n − t − i + 1) f (i) = 2 I(t + 1 > i) + I(n − t + 1 > i) t nt n(n − t) − } 1 ∑i 1 − I(t ≥ l)I(n − t ≥ i − l) . (2.5) n l=1 The (M + 1) × (M + 1) matrix

i − 1 i − 1 j − 1 2 − I(j, 1) F (i, j) = (1 − )I(i, j) + (1 − )(1 − ) n,M n n n n − { } 1 n∑i+1 ∑n − I(|a − b| + 1, j) + I(|a + i − 1 − b| + 1, j) , (2.6) n2 a=1 b=1 where the indicator I(i, j) equals 1 if i = j and 0 otherwise. And for h = 0, ··· ,M,

− 1 n∑h Vˆ = (tr{C[(0)}, ··· , tr{C\(M)})T with tr{C[(h)} = (X − X¯)T (X − X¯). (2.7) n i i+h i=1 −1 T −1 ˆ As shown in Proposition 2.1, n ft Fn,M V is defined as the bias-correction term to eliminates the leading order bias from the first term of Lˆt in (2.4). It requires the (M + 1) × (M + 1) matrix

Fn,M to be invertible. From Ayyala, Park and Roy (2017), Fn,M a = 0 if and only if the M + 1 column vector a = 0, and Fn,M is thus of full rank and invertible. In analogy with the kernel −1 T −1 ˆ estimator in Horv´athand Huˇskov´a(2012) and Horv´ath,Rice and Whipple (2016), n ft Fn,M V can be written as − − { − } n∑1 n∑1 1 n∑h K(h)tr{C[(h)} = K(h) (X − X¯)T (X − X¯) , n i i+h h=0 h=0 i=1

8 −1 T −1 ≤ where K(h) is the hth element of n ft Fn,M if h M, and zero otherwise.

With a known change point τ, the statistic Lˆτ can be used for the two-sample testing problem by requiring the two sample sizes τ and n−τ to diverge at the same order. See Bai and Saranadasa

(1996) for independent data. See also Ayyala, Park, and Roy (2017) for m-dependent Gaussian data. As the number and locations of change points are unknown a priori, the change point detection problem is more challenging. Specially, if the change point τ is near the boundary, using

Lˆτ for the two-sample problem is infeasible as the assumption τ/(n − τ) → k ∈ (0, 1) is violated.

The proposed Lˆt is closely related with the test statistics in Horv´athand Hˇuskov´a(2012) and Chan, Horv´athand Hˇuskov´a(2013). More precisely, they considered the sum of squared CUSUM statistics, each of which is scaled by its long run variance. While scaling can convert dimensions to homogeneous scales and thus avoid the power loss when the change occurs at the components with small scales, all p long run need to be estimated consistently to establish the asymptotic distribution. It thus requires restrictive conditions specially on the growth rate of dimension p with respect to sample size n. Different from Horv´athand Hˇuskov´a(2012) and Chan, Horv´ath and Hˇuskov´a(2013), we only scale Lˆt by the consistent estimator of its standard deviation and consequently, its asymptotic normality can be established without imposing any explicit restriction on p as long as the condition (C2) is satisfied (see Theorem 2.1 in Section 2.2 for details).

The mean and variance of Lˆt are given by the following proposition.

Proposition 2.1. Under the model (2.2) and (C1), and for t ∈ {1, ··· , n − 1}, T −1 ft F VB E(Lˆ ) = L − n,M + o(1), (2.8) t t n ∑ ∑ −1 n − T − ··· −1 n−M − T − where Lt is defined by (2.3) and VB = (n i=1(µi µ¯) (µi µ¯), , n i=1 (µi µ¯) (µi+M ∑ T −1 n µ¯)) withµ ¯ = n i=1 µi. [ 1 ∑n ∑n ∑ Var(Lˆ ) ≡ σ2 = {B (i, j)B (i + h , j − h ) t nt n4 t t 2 1 i=1 j=1 h1,h2∈A ∑n ∑n ∑n ∑ + Bt(i, j)Bt(j − h1, i + h2)}tr{C(h1)C(h2)} + {Bt(i, j) i=1] j=1 k=1 h∈A∪Ac }{ } T { } + Bt(j, i) Bt(k, i + h) + Bt(i + h, k) µj C(h)µk 1 + o(1) , (2.9) where the set A = {0, ±1, ··· , ±M} and the n × n matrix Bt satisfies n − t t B (i, j) = I(i ≤ t)I(j ≤ t) − 2I(i ≤ t)I(j > t) + I(i > t)I(j > t) t t n − t

9 [ ] ∑M { I(j ≥ h + 1) I(j ≤ n − h) n − h} − (f T F −1 ) I(i − j, h) − − + . t n,M h+1 n n n2 h=0

Remark 2.1 Under the null hypothesis, E(Lˆt) = o(1) as VB = 0 in (2.8) and thus, Lˆt is asymptotically unbiased to Lt. Under the alternative, though Lˆt is still biased to Lt, the bias effect −1 T −1 ∈ { ··· − } is negligible as the bias n ft Fn,M VB is of smaller order than Lt for any t 1, , n 1 (see a proof for Lemma 2.1 in Section 2.8).

Remark 2.2 The proposed bias-corrected estimator Lˆt depends on the M, which separates the dominant temporal dependence from the remainder. How to choose a proper M is very important in practice and will be addressed in Section 2.5. From here to the end of Section 2.4, we simply assume M to be known in order to present the theoretical results of our method.

2.2 Hypothesis Testing of Change Point

To test (2.1) for the statistical significance of any change point, we first establish the asymptotic normality of Lˆt under the following condition.

(C2). For any h1, h2, h3, h4 ∈ A with A = {0, ±1, ··· , ±M}, as p → ∞, [ ] { } { ′ ′ } { ′ ′ } tr C(h1)C(h2)C(h3)C(h4) = o tr C(h1)C(h2) tr C(h3)C(h4) ,

{ ′ ′ ′ ′ } { } where h1, h2, h3, h4 is a permutation of h1, h2, h3, h4 .

The condition (C2) is imposed on the covariance matrix of the entire sequence of X1, ··· ,Xn. T T ··· T T T T ··· T T To see this, let X = (X1 ,X2 , ,Xn ) and Γ = (Γ1 , Γ2 , , Γn ) from (2.2). As a result, the np×np covariance matrix of X is Σ = ΓΓT , where each p×p block diagonal matrix of Σ describes the spatial dependence among p components of each Xi, and each block off-diagonal matrix measures the temporal dependence of each pair of Xi and Xj for i ≠ j. To impose a condition on Σ, we may consider tr(Σ4) = o{tr2(Σ2)}, which is satisfied if all eigenvalues of Σ are bounded. However, it is more desirable to impose the condition on the spatial and temporal dependence through Γi. By T T T ··· T T T T ··· T the relationship that Σ = ΓΓ = (Γ1 , Γ2 , , Γn ) (Γ1 , Γ2 , , Γn ), it can be shown that (C2) is a sufficient condition for tr(Σ4) = o{tr2(Σ2)}.

The advantage of (C2) is that we do not require any explicit relationship between dimension p and the number of observations n as long as (C2) is satisfied. Under temporal independence,

(C2) is reduced to tr{C4(0)} = o[tr2{C2(0)}] in Chen and Qin (2010), where C(0) describes the

10 spatial dependence of the p-components of Xi at time i. However, (C2) is more general as our setup incorporates both spatial and temporal dependence. Specially, (C2) involves C(h ≠ 0) whose off-diagonal elements describe the temporal dependence at different time points. It is also worth noting that under strong spatial dependence such as equal correlation structure of Σ, the condition tr(Σ4) = o{tr2(Σ2)} does not hold and (C2) is thus violated.

Theorem 2.1. Assume (2.2) and (C1)-(C2). As n → ∞, for any t ∈ {1, ··· , n − 1},

Lˆ − L t t −→d N(0, 1), σnt where σnt is defined by (2.9) in Proposition 2.1.

In order to implement a testing procedure, we need to estimate the unknown variance of Lˆt under the null hypothesis, { } ∑n ∑ B (i, j) σ2 = t B (i + h , j − h ) + B (j − h , i + h ) tr{C(h )C(h )}, nt,0 n4 t 2 1 t 1 2 1 2 i,j=1 h1,h2∈A which only requires us to estimate the unknown tr{C(h1)C(h2)}. For any h1 and h2 from the set A = {0, ±1, ··· , ±M}, the estimator is ∗ ∗ 1 ∑ 1 ∑ tr{C(\h )C(h )} = XT X XT X − XT X XT X 1 2 n∗ t+h2 s s+h1 t n∗ r s s+h1 t 1 s,t 2 r,s,t ∗ ∗ 1 ∑ 1 ∑ − XT X XT X + XT X XT X , (2.10) n∗ r s s+h2 t n∗ q r s t 3 r,s,t 4 q,r,s,t ∑ ∗ ∗ where represents the sum of indices that are at least M apart, and ni with i = 1, 2, 3, 4 are the corresponding number of indices. As a result, the estimator of the variance under the null hypothesis is { } ∑n ∑ B (i, j) σˆ2 = t B (i + h , j − h ) + B (j − h , i + h ) tr{C(\h )C(h )}. nt,0 n4 t 2 1 t 1 2 1 2 i,j=1 h1,h2∈A (2.11)

Theorem 2.2. Assume the conditions in Theorem 2.1 and H0 of (2.1). As n → ∞,

σˆ p nt,0 −→ 1, for t ∈ {1, ··· , n − 1}. σnt,0

Combining Theorems 2.1 and 2.2, we immediately see that under the null hypothesis,

Lˆ t −→d N(0, 1), for t ∈ {1, ··· , n − 1}. σˆnt,0

11 n = 100 p = 200 t = 1 n = 200 p = 600 t = 1 n = 300 p = 1000 t = 1 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 ^ ^ ^ ^ ^ ^ Lt σnt,0 Lt σnt,0 Lt σnt,0

n = 100 p = 200 t = 50 n = 200 p = 600 t = 100 n = 300 p = 1000 t = 150 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 ^ ^ ^ ^ ^ ^ Lt σnt,0 Lt σnt,0 Lt σnt,0

Figure 1: Histogram of Lˆt/σˆnt,0 versus N(0, 1)-curve. The upper row chooses t = 1, at different n and p; The lower row chooses t = n/2, and different n and p.

We also conduct some simulations to put the above result into a visual inspection. Figure 1 shows the histograms of Lˆt/σˆnt,0 based on 1000 iterations for t = 1 and t = n/2, respectively. The data were generated based on the setups in Section 2.6. Clearly, as n and p increase, the empirical histograms get closer to the standard normal curve even when the time point t is taken to be 1 on the boundary.

From Theorems 2.1 and 2.2, we reject the null hypothesis if Lˆt/σˆnt,0 > zα with a nominal significance level α , where zα is the upper-α quantile of N(0, 1). However, there is a drawback to implement the testing procedure via Lˆt/σˆnt,0 as it depends on the choice of t. Although the testing procedure can be applied for any t ∈ {1, ··· , n − 1}, the power of the test can greatly vary with t.

From Theorem 2.1, the power of the test at a given t is ( ) σnt,0 Lt Φ −zα + , σnt σnt

12 where Φ(·) is the cumulative distribution function of standard normal distribution and Lt is given by (2.3). Since σnt,0/σnt ≤ 1, the power is determined by the signal-to-noise ratio Lt/σnt which can be affected by the choice of t and may lead to be conservative. For example, consider a simple case that n = 150 where µt = 0 for 1 ≤ t ≤ 50, µt = 1 for 51 ≤ t ≤ 100 and µt = −1 for 101 ≤ t ≤ 150.

If we choose t = 50 for Lˆt, the signal-to-noise ratio Lt/σnt = 0 leading to the low power of the test. In order to circumvent the difficulty of choosing t and most importantly retain the power of the test, we consider the following test statistic,

n∑−1 Lˆ = Lˆt, (2.12) t=1 which is the L2-norm statistic by summing over all the marginal Lˆt. Instead of the L2-norm statistic, one may consider the max-norm statistic max1≤t≤n−1 Lˆt/σˆnt,0. The advantage and disadvantage of choosing each of the two norms have been discussed extensively in the literature. Generally speaking, if there are only few time points t where the difference ofµ ¯≤t andµ ¯>t is large, the max- norm based test is expected to be more powerful than the L2-norm based test. If, however, the small differences occur in many time points, the L2-norm based test will dominate the max-norm based test in power since it can aggregate all the strength of small differences. Furthermore, to establish a testing procedure, considering the test statistic (2.12) requires less stringent conditions comparing to the max-norm based test. ∑ B n−1 2 Let (i, j) = t=1 Bt(i, j), where Bt(i, j) is specified in Proposition 2.1, and σn be the variance of Lˆ obtained by replacing Bt(·, ·) with the corresponding B(·, ·). Theorem 2.3. Assume the same conditions in Theorem 2.1. As n → ∞,

∑ − Lˆ − n 1 L t=1 t −→d N(0, 1). σn

Specially under H0 of (2.1), Lˆ −→d N(0, 1), σˆn,0 whereσ ˆn,0 is obtained by replacing Bt(·, ·) with the corresponding B(·, ·) in (2.9).

Based on Theorem 2.3, we reject H0 of (2.1) with a nominal significance level α if Lˆ/σˆn,0 > zα. Free of the tuning parameter t and retaining the reasonable power, the testing procedure based on

Lˆ is thus chosen for the existence of any change point.

13 Remark 2.3 Following Aston and Kirch (2014), we derive the high-dimensional efficiency of the ∑ ∈ { ··· − } p 2 1/2 { 2 } −1/4 proposed test with one change point τ 1, , n 1 to be ( l=1 δl ) [tr C (0) ] , where T δ = (µ1 − µn) is the population mean difference. Note that the oracle projection-based change-

−1/2 point test in Aston and Kirch (2014) achieves the high-dimensional efficiency ||C (0)δ||2, where

|| · ||2 refers to the L2-norm of a vector. With the diagonal C(0) (cross-sectional independence), our high-dimensional efficiency is less than Aston and Kirch (2014) by p1/4.

2.3 Estimating One and Multiple Change Points

If the null hypothesis is rejected by the proposed testing procedure, the second interest is to further allocate the change points. We first consider the case of one change point. Estimating multiple change points will be covered at the end of this section. By using Lˆt given by (2.4), the location of a change point τ ∈ {1, ··· , n − 1} is estimated by

τˆ = arg max Lˆt. (2.13) 0

Without performing any bias-correction, Bai (2010) proposed a similar estimator

− t(n t) T τˆB = arg max (X¯≤t − X¯>t) (X¯≤t − X¯>t), 0

2 − 2 t (n t) T ¯≤ − ¯ ¯≤ − ¯ τˆH = arg max 2 (X t X>t) (X t X>t). 0

As shown in Proposition 2.1, the bias in these two estimators can lead to an adverse effect on change- point estimate especially when the actual change point is near the boundary of the sequence. Unlike

Bai (2010) and Horv´athet al. (2017), we propose the bias-correctedτ ˆ which retains a satisfactory performance as shown in the following lemma.

1/2 − −1 T −1 Lemma 2.1. Under the alternative of (2.1) and M = o(n ), Lt n ft Fn,M VB always attains its maximum at one of the change points 1 ≤ τ1 < ··· < τq < n. √ 2 − T − 2 2 2 Let δ = (µ1 µn) (µ1 µn) and vmax = max0

14 Theorem 2.4. Assume that the change-point τ ∈ {1, ··· , n − 1} satisfies τ = O(nγ) or n − τ = O(nγ) with γ ∈ [0, 1]. Under the same conditions in Theorem 2.1, as n → ∞, ( √ ) n1−γ logn v τˆ − τ = O max . p δ2

Remark 2.4 To fully understand the rate of convergence, we need to specify the order of vmax. Similar to Aston and Kirch (2014), we consider cross-sectional dependence but temporal independence to provide quick insight. By using (2.9) in Proposition 2.1, √ 2 −1 T vmax = 2tr{C (0)} + 4n max t(n − t)(¯µ≤t − µ¯>t) C(0)(¯µ≤t − µ¯>t). 0

Under the local alternative that the change in µ tends to zero, the leading order √ 2 1/2 vmax = 2tr{C (0)} = p , if all the eigenvalues of C(0) are bounded, and thus ( √ ) n1−γ log np1/2 τˆ − τ = O , p δ2 which allows to compare with the rates of other change-point estimators. For example, when the change point is away from the boundary (γ = 1), the rate of convergence established in Bai (2010) √ is p1/2/δ2 which is faster than ours by the factor log n. Similarly, the rate of convergence in

Horv´athet al. (2017) can also be faster than ours. Nevertheless, the result established in Theorem

2.4 is more general as it includes the case that the change point is near the boundary (γ < 1) (see

Remark 2.5 below).

Remark 2.5 In the change point literature, it commonly assumes that the location of the change point τ is of the form κn with κ ∈ (0, 1), which is τ = O(n) with γ = 1 in terms of our √ −2 notation. The corresponding convergence rate is logn vmaxδ . However, Theorem 2.4 considers a more general case as γ can vary within [0, 1]. For example, when the actual change point τ = O(1) √ −2 (near the boundary), we can establish the convergence rate, that is n logn vmaxδ . From this, we know that in order to attain the same convergence rate as τ = O(n), a stronger signal is needed, namely n times the signal δ2 used for τ = O(n). √ Remark 2.6 With fixed dimension p and τ = O(n), our result showsτ ˆ − τ = Op( logn). A similar result was also derived in Aue et al. (2009) for change point detection in the covariance

15 structure of fixed dimensional time series (See their (2.13) for details, where θˆn − θ is equivalent

2 to our (ˆτ − τ)/n). However, the convergence rate can be faster if the signal-to-noise ratio δ /vmax diverges as p increases. Our result clearly demonstrates the beneficial effect of p on the change point identification.

The developed single change-point testing and estimation can be readily combined with other algorithms to estimate multiple change points 1 ≤ τ1 < ··· < τq < n in (2.1). The classical algorithm is the binary segmentation which was proposed in Vostrikova (1981) and Venkatraman

(1992) in univariate settings. We implement the algorithm in high-dimensional settings as follows.

• For s, t ∈ {1, ··· , n} satisfying s < t, the test statistic Lˆ[s, t] and the estimator of its standard

deviationσ ˆn,0[s, t] are computed from the data between s and t. If Lˆ [s, t] ≤ zαn , σˆn,0[s, t]

where αn is a chosen nominal significance level, the algorithm stops. Otherwise, a change point is estimated as

τˆ = arg max Lˆl[s, t], l∈[s,t] and [s, t] is partitioned into [s, τˆ] and [ˆτ + 1, t].

• The algorithm starts with s = 1 and t = n, and repeats the above steps iteratively until no

more change point can be estimated from any segment.

The binary segmentation becomes inefficient under the configuration that a short segment with piecewise constant signal is buried in the middle of the sequence. The obstacle can be overcome by combining our single change-point testing and estimation with the wild binary segmentation proposed by Fryzlewicz (2014). The algorithm is implemented as follows.

• For s, t ∈ {1, ··· , n} satisfying s < t, the intervals [sm, tm] for m = 1, ··· ,H are randomly generated with replacement from {s, . . . , t}, where the choice of H is given in Fryzlewicz

(2014).

• Among all [sm, tm] for m = 1, ··· ,H, let

∗ (m , τˆ) = arg max Lˆl[sm, tm], m∈[1,··· ,H], l∈[sm,··· ,tm−1]

16 and [sm∗ , tm∗ ] be the corresponding interval. If

Lˆ[sm∗ , tm∗ ] > ζn, σˆn,0[sm∗ , tm∗ ] √ where similar to Fryzlewicz (2014), the threshold ζn = 2 log n, thenτ ˆ is an identified change point and [s, t] is partitioned into [s, τˆ] and [ˆτ + 1, t]. Otherwise, the algorithm stops.

• The algorithm starts with s = 1 and t = n, and repeats the above steps recursively until no

more change point can be identified.

2.4 Power Enhancement Test for Change Point

When the change occurs in a small number of components in the population mean, the L2-norm statistic Lˆt given by (2.4) includes too many components that contribute no signals but noise. Moreover, when there are only a few change points in the sequence of data, Lˆ give by (2.12) sums over too many time points at which the corresponding Lˆts are not significantly different from zero. As a result, the test based on Lˆ may not be powerful for testing the existence of change points.

To tackle such a problem, we consider adding a power enhancement statistic L0 to the original

L1 ≡ Lˆ/σˆn,0. The new test statistic L ≡ L1 + L0 should be more powerful than the original

L1 under the alternative hypothesis, but still possesses the correct empirical size under null the hypothesis.

The power enhancement statistic L0 is defined as ( ) √ ∑ ∑p |ˆ | √ L ˆ2 √δt,j ≥ 0 = p δt,jI log(log n) log p , where σˆ2 t∈Sˆα j=1 t,j { √ } ˆ Lt 1/π Sˆα = t : √ ≥ 2 log n − log(log n) + log log(n ) . (2.14) 2 σˆnt,0

In (2.14),

1 ∑t 1 ∑n δˆ = X¯≤ − X¯ = X − X , (2.15) t,j t,j >t,j t i1,j n − t i2,j i1=1 i2=t+1 which is an unbiased estimator of

1 ∑t 1 ∑n δ =µ ¯≤ − µ¯ = µ − µ , (2.16) t,j t,j >t,j t i1,j n − t i2,j i1=1 i2=t+1

17 where t ∈ {1, . . . , n − 1}, and j ∈ {1, . . . p}. Also in (2.14),

n σˆ2 = f T F −1 Vˆ , (2.17) t,j t(n − t) t n,M j where ft and Fn,M are given in (2.5) and (2.6), respectively. And Vˆj is the jth margin of Vˆ in (2.7), ∑ ˆ p ˆ such that V = j=1 Vj, and for h = 0,...,M,

− 1 n∑h Vˆ = {C\(0) ,..., C\(M) }T with C\(h) = (X − X¯ )(X − X¯ ). j j j j n i,j j i+h,j j i=1

2 As showed in Proposition 2.1 of Section 2.1, under null hypothesis,σ ˆt,j is an unbiased estimator of

n σ2 ≡ Var(δˆ ) = f T {C(0) ,...,C(M) }T , (2.18) t,j t,j t(n − t) t j j where C(h)j is the jth diagonal element of C(h), with h = 0,...,M. ˆ ˆ 2 In Sα, Lt andσ ˆnt,0 are given by (2.4) and (2.11), respectively. Through the threshold level √ 1/π 2 log n − log(log n) + log log(n ), the set Sˆα is expected to contain the true change points and the time points where the signal-to-noise ratio Lt/σnt is significantly different from zero. The power √ can be further improved by imposing another threshold level log (log n) log p in L0, which excludes those non-signal bearing dimensions. A similar screen statistic was also considered in Fan et al.

(2015) and Yang and Pan (2017).

The power enhancement properties of L0 are showed under the following conditions. (C3). As n → ∞, and logp = o(n1/3).

(C4). For Xi = µi + Wi, i = 1, . . . , n, there exits a positive constants H such that for h0 ∈

2 hT [(W (k))2,(W (l))2] [−H,H] ,E{e 0 i i } < ∞, for k ≠ l.

The condition (C3) specifies the growth rate of dimension p relative to n under which the large deviation results can be applied. The condition (C4) assumes a bivariate sub-Gaussian distribution of (Xi,k,Xi,l), which is more general than the Gaussian distribution. Such conditions are also considered in Chen et al. (2019) to build a thresholding statistic. √ √ 1/π Let λ1 ≡ 2 log n − log(log n) + log log(n ), λ2 ≡ log(log n) log p and Sα ≡ {t : |Lt|/σnt ≥

2λ1}. Similar to Fan et al. (2015), the asymptotic properties of L = L1 + L0 are showed in the following theorem.

Theorem 2.5. Assume (C1)-(C4),

18 (i)Under H0, as n → ∞,

P (L0 = 0|H0) → 1, and d L = L1 + L0 −→ N(0, 1).

∗2 ≡ 2 | | ≥ | | ∗ ≥ (ii)Under Hα, let σ t,j E(ˆσt,j), if maxt Lt /σnt 2λ1, and maxt∈Sα maxj δt,j /σt,j 2λ2 , as n → ∞,

P (L > zα|Hα) → 1, where zα is the upper-α quantile of N(0, 1). Based on Theorem 2.5, we reject the null hypothesis of (2.1) at the nominal significance level

α if L > zα, where zα is the upper α-quantile of N(0, 1). The empirical performance of the power enhancement test procedure and its comparison with L1 are in Section 2.6.3.

2.5 Elbow Method for Dependence

In previous sections, we have introduced our procedure for testing and estimating one or multiple change points. The procedure relies on the choice of M, which is unknown in practice. According to (C1), M separates dominant temporal dependence from the remainder. As demonstrated in the simulation studies of Section 2.6, if data are dependent (M ≠ 0), wrongly applying the procedure based on the assumption that M = 0 can cause severe type I error and thus produce a lot of false positives when estimating locations of change points. On the other hand, choosing a value that is larger than the actual M will reduce the power of the test and thus generate many false negatives.

In this section, we propose a quite simple way to determine M.

Suppose that Xi, 1 ≤ i ≤ n satisfies the condition (C1) for dependence. Then Cov(Xi,Xj) = C(i − j) is relatively small if |i − j| > M, or equivalently, tr{C(h)CT (h)} is small if |h| > M.

The unknown tr{C(h)CT (h)} can be consistently estimated by (2.10) under the null hypothesis according to the proof of Theorem 2.2. Even under the alternative hypothesis, the effect of het- erogeneity of means µi on the estimation is of small order as long as the heterogeneity is not too strong. Inspired by the automatic bandwidth selection procedure in Horv´ath,Rice and Whipple

(2016), we determine M by calculating (2.10) for each integer starting from 0, and terminate the \ process once a small value appears. Visually, we can plot tr{C(h)CT (h)} versus h, and choose the elbow in the plot for M.

19 nti eto,w rsn iuainrslsaotteeprclpromneo h proposed the of performance empirical the about sample random results The simulation methods. present we section, this In Studies Numerical 2.6 when Similarly, at happened elbow iue2illustrates 2 Figure with 2.6 Section in (2.19) replications. 50 on based obtained choosing were results for The method hypotheses. elbow The 2: Figure odmntaeteie bv,w eeae h admsample random the generated we above, idea the demonstrate To

^ T ^ T tr{C(h)C (h)} tr{C(h)C (h)}

0 5000 10000 15000 20000 25000 0 2000 4000 6000 8000 10000 12000 14000 0 0 M 1 1 ,teebwhpee at happened elbow the 2, = h tr 2 ne ohnl n lentv yohss etu estimated thus We hypotheses. alternative and null both under 1 = { H H C 0 0 and and ( \ h h 2 h 3 ) M M C n = = T 0 2 5 and 150 = { ( 4 X h ) i 3 } } for versus 5 i 1 = 4 6 p h , 0.W considered We 600. = ae n5 trtos hnteactual the When iterations. 50 on based · · · M h 20 n , o eedneudrbt uladalternative and null both under dependence for hc ugse st estimate to us suggested which 3 =

eegnrtdfo h olwn multivariate following the from generated were , ^ T ^ T tr{C(h)C (h)} tr{C(h)C (h)}

0 5000 10000 15000 20000 25000 0 2000 4000 6000 8000 10000 12000 14000 0 0 1 1 2 H H 1 1 and and M { h 3 h 2 X M M n ,respectively. 2, and 0 = i = = } 2 0 4 for 3 i 5 1 = , M 6 4 · · · M y2. by n , ,the 0, = M using y0. by linear process M∑+2 Xi = µi + Ql ϵi−l, (2.19) l=0 where µi is the p-dimensional population mean vector at point i, Ql is a p × p matrix for l =

0, ··· ,M + 2, and ϵi is a p-variate random vector with mean 0 and identity covariance Ip. In the

|i−j| −1 simulation, we set Ql = {0.6 (M −l+1) } for i, j = 1, ··· , p, and l = 0, ··· ,M. For QM+1 and

QM+2, we considered two different scenarios. If M = 0, we simply chose QM+1 = QM+2 = 0 such { }n ̸ that Xi i=1 became an independent sequence. If M = 0, we chose QM+1 = QM+2 and each row of them had only 0.05p non-zero elements that were randomly chosen from {1, ··· , p} with magnitude generated by Unif (0, 0.05). By doing so, we modeled the temporal dependence dominated by Ql for l = 0, ··· ,M plus perturbations contributed by QM+1 and QM+2.

2.6.1 Empirical performance of the proposed testing procedure

It is crucial to correctly test the existence of any change point as it is the first step of the proposed change-point detection and identification procedure. The first part of the simulation studies is to investigate the empirical performance of the test statistic Lˆ whose asymptotic normality was established in Theorem 2.3. Without loss of generality, we chose µi = 0 for i = 1, ··· , n under H0 of (2.1). Under the alternative hypothesis, we considered one change-point τ ∈ {1, ··· , n − 1} such

0.7 as µi = 0 for i ≤ τ and µi = µ for τ + 1 ≤ i ≤ n. The non-zero mean vector µ had [p ] non-zero components that were uniformly and randomly drawn from p coordinates {1, ··· , p}. Here, [a] denotes the integer part of a. The magnitude of non-zero entry of µ was controlled by a constant

δ multiplied by a random sign. The nominal significance level was chosen to be 0.05. All the simulation results were obtained based on 1000 replications.

To show the performance of the proposed testing procedure, we also considered two competi- tors. One is the E-div test proposed by Matteson and James (2014). The E-div test obtains an approximate p-value to determine the statistical significance of a change point by performing ran- { }n dom permutations. The testing procedure assumes independence of Xi i=1. The other one is the CQ test by Chen and Qin (2010). The CQ test statistic is a linear combination of U-statistics and { }n follows the asymptotic normality under the independence of Xi i=1. Note that the CQ test was originally designed for the two-sample problem, requiring the change point to be known and based on which, the whole sequence was divided into two samples. To implement the CQ test, we used

21 Table 1: Empirical sizes and powers of the CQ, the E-div and the proposed tests based on 1000 replications with Gaussian ϵi in (2.19).

Size

n = 100 150 200

M method p = 200 600 1000 200 600 1000 200 600 1000

CQ 0.066 0.059 0.066 0.055 0.071 0.054 0.051 0.062 0.040

0 E-div 0.069 0.050 0.055 0.042 0.047 0.045 0.038 0.071 0.041

New 0.056 0.055 0.051 0.052 0.067 0.060 0.052 0.062 0.040

CQ 0.859 0.999 1.000 0.875 0.999 1.000 0.881 0.999 1.000

1 E-div 0.989 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

New 0.039 0.044 0.053 0.059 0.046 0.050 0.050 0.044 0.054

CQ 0.990 1.000 1.000 0.991 1.000 1.000 0.995 1.000 1.000

2 E-div 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

New 0.054 0.034 0.043 0.048 0.054 0.039 0.047 0.054 0.043

Power

CQ 0.274 0.318 0.401 0.407 0.524 0.601 0.558 0.764 0.826

0 E-div 0.107 0.147 0.164 0.142 0.195 0.224 0.146 0.253 0.326

New 0.190 0.193 0.234 0.273 0.304 0.366 0.327 0.508 0.555

the true change point to divide the sequence into two samples under the alternative hypothesis.

Under the null case, in order to obtain its size, we obtained the two samples by adopting the same time point used for the alternative.

Table 1 demonstrates the empirical sizes and powers of three tests with Gaussian ϵi in (2.19). Under the alternative hypothesis, we chose the location of the change point τ = 0.4n and magnitude

δ = 0.3. Under independent data (i.e., M = 0), the sizes of all three tests were well controlled around the nominal significance level 0.05. However, for both the dependent cases with M = 1 and 2, the CQ and E-div tests suffered severe size distortion. Different from those two tests, the proposed test still had well-controlled sizes around the nominal significance level 0.05. Due to

22 n = 100 M = 1 n = 100 M = 2

● τ = 0.5 n τ = 0.2 n τ = 0.1 n

● ● 1.0 ● 1.0 ●

● 0.8 0.8

Power Power ● 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0

200 600 1000 200 600 1000

p p

n = 200 M = 1 n = 200 M = 2

● ● ● ● ● ● 1.0 1.0 0.8 0.8 Power Power 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0

200 600 1000 200 600 1000

p p

Figure 3: Empirical powers of the proposed testing procedure with δ = 1 based on 1000 replicates under different n, p, M and locations of the change point.

severe size distortion of the CQ and E-div tests under data dependence, we only showed power comparison for the case M = 0. Empirical powers of all three tests increased as n and p increased.

The reason the CQ test had the best power among the three is that it utilized the information of the known location of the change point. In a real application, such information is unavailable. The proposed test always enjoyed greater powers than the E-div test with respect to different n and p.

Under dependence, we studied the effect of n, p and especially location of the change point on

23 the power of the proposed test. Figure 3 shows that with δ = 1 and for different locations of the change point (τ = 0.1n, 0.2n, 0.5n), the empirical powers increased as n and p increased. However, the closer was the change point to the center, the greater was the power of the test. Moreover, our results also showed that the powers of the test under weaker dependence (e.g., M = 1) were greater than those under stronger dependence (e.g.,M = 2), indicating the adverse effect of data dependence on the change point detection.

2.6.2 Empirical performance of change point estimates

The second part of the simulation studies aims to investigate the empirical performance of the change point estimatorτ ˆ in (2.13). We first considered the situation with one change-point τ ∈

{1, ··· , n − 1} such as µi = 0 for i ≤ τ and µi = µ for τ + 1 ≤ i ≤ n − 1. The non-zero mean vector µ had [p0.7] non-zero components, which were uniformly and randomly drawn from p coordinates

{1, ··· , p}. The magnitude of non-zero entry of µ was controlled by a constant δ multiplied by a random sign. Figures 4 and 5 demonstrate the proportion of the 1000 iterations detecting the change point that was located at the time point 40 and 2, respectively. In both figures, the probability of detecting the change point increased as dimension p or δ increased. Comparing the upper panel

(M = 2) with the lower panel (M = 0 that is independence) in each figure, the probability of detecting the change point became lower as dependence became stronger. Also, the stronger signal strength was needed in Figure 5 to retain the similar detection probability as in Figure 4 where the change point was located at the center. The results in Figures 4 and 5 are in excellent agreement with the theoretical analysis of the change point estimatorτ ˆ in Theorem 2.4.

To demonstrate the performance of the proposed binary segmentation method for multiple change-point detection, we chose n = 150 and considered three change points at 15, 75 and 105, respectively. In particular, for 1 ≤ i ≤ 15, µi = 0. For 16 ≤ i ≤ 75, the non-zero entry of µi was controlled by a constant δ1. For 76 ≤ i ≤ 105, µi = 0 and for 106 ≤ i ≤ 150, the non-zero entry of

µi was controlled by another constant δ2. We compared our method with E-div method in terms of false positives (FP), false negatives (FN), and true positives (TP). The FP is the number of time points that are wrongly estimated as change points. The FN is the number of change points that are wrongly treated as time points without change. And TP is the total number of identified change points. A procedure is better if it has smaller FP and FN, but TP is close to 3 which is the

24 -i rcdr uee eeeF o l ae lhuhi a agrT.Dffrn rmteE-div the from Different TP. larger had it although cases all for FP severe suffered procedure E-div independence as Under increased points. change two three of the performance estimating the (i.e., when demonstrates iterations 2 Table 1000 on design. based our methods on based points change of number total each for with iterations dependent 1000 are on when data based 40 obtained at are point probabilities change The a respectively. detecting 600, of probability The 4: Figure

M Probability to detect change point Probability to detect change point

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ) h w ehd a iia efrac ihbt PadF erae u TP but decreased FN and FP both with performance similar had methods two the 0), = 37 37 ● ● p n/r( and/or 38 38 ● ● δ δ 39 39 ● ● = = δ 1 1

1

M δ ,

n n = = 40 40 2 n n 100 100 ● ● nrae.O h te ad ne eedne(e.g., dependence under hand, other the On increased. ) .Lwrpnl aaaeidpnetwith independent are data panel: Lower 2. =

M M = = 41 41 ● ● 2 0 ● 42 42 ● ● p p p = = = 600 300 100 43 43 ● ● 25

Probability to detect change point Probability to detect change point

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 37 37 ● ● 38 38 ● ● δ δ 39 39 = = ● ● n 1.5 1.5

0 and 100 =

n n = = 40 40 n n ● ● 100 100 M

M M 0. = 41 41 = = ● ● 0 2 p p 100 = 42 42 pe panel: Upper . ● ● M ) the 2), = 43 43 , ● ● 0 and 300 epciey npriua,fr1 for particular, In respectively. choose we mentation, rcdr,tepooe ehdawy a PadF ne oto.Ms motnl,similar importantly, Most control. under of FN case and the FP to had always method proposed the procedure, each for with iterations dependent 1000 when are on 2 data based at obtained are point probabilities change The a respectively. detecting 600, of probability The 5: Figure ocmaetepromneo h rpsdwl iaysgetto ihtebnr seg- binary the with segmentation binary wild proposed the of performance the compare To

Probability to detect change point Probability to detect change point

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 1 ● ● M 2 2 ● ● ,i noe mle PadF u agrT as TP larger but FN and FP smaller enjoyed it 0, = δ δ 3 3 ● ● = = n 5 5

150, =

M

n n = = n 4 n 4 100 100 ● ● .Lwrpnl aaaeidpnetwith independent are data panel: Lower 2. =

M M = = p 5 5 ● ● 2 0 ≤ 0 n 0,adcnie w hnepit t7 n 80, and 70 at points change two consider and 600, and 200 = i ● ≤ 6 6 ● ● p p p 0ad81 and 70 = = = 600 300 100 7 7 ● ● 26

Probability to detect change point Probability to detect change point ≤ 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 i 1 1 ● ● ≤ 150, 2 2 ● ● µ i δ δ .Fr71 For 0. = 3 3 = = ● ● n 6.5 6.5 0 and 100 =

p

n n = = n/r( and/or n 4 n 4 ● ● 100 100 M

M M 0. = = = 5 5 ● ● 0 2 ≤ δ p 1 p i δ , 100 = pe panel: Upper . 6 6 ● ● ≤ 2 increased. ) 80, , 7 7 ● ● 0 and 300 µ i 0, ̸= Table 2: The performance of the proposed binary segmentation and E-div method for estimating multiple change points. The average FP, FN, TP and corresponding standard deviations (subscript numbers) are obtained based on 1000 replications.

N = 200 600 1000

M New E-div New E-div New E-div

(δ1, δ2) = (1.5, 1.5)

FP 0.2100.454 0.1490.396 0.0780.279 0.0830.300 0.0330.190 0.0570.240

0 FN 0.1530.390 0.0940.318 0.0470.212 0.0220.147 0.0220.147 0.0060.077

TP 2.8470.390 2.9060.318 2.9530.212 2.9780.147 2.9770.146 2.9940.077

FP 0.5020.612 18.6192.864 0.2740.495 20.3361.434 0.2060.436 20.1871.360

2 FN 2.1620.648 1.0940.877 1.8890.641 0.5990.735 1.7060.751 0.4210.647

TP 0.8380.648 1.9060.877 1.1110.641 2.4010.735 1.2940.751 2.5790.647

(δ1, δ2) = (2, 2)

FP 0.0270.168 0.0580.258 0.0040.063 0.0510.254 0.0010.032 0.0520.244

0 FN 0.0080.089 0.0030.055 0.0010.032 0.0010.032 00 00

TP 2.9920.089 2.9970.055 2.9990.032 2.9990.032 30 30

FP 0.1550.373 17.6442.805 0.0430.203 19.7851.215 0.0280.165 19.8241.179

2 FN 1.4010.846 0.2340.471 1.1110.895 0.0770.285 0.9170.888 0.0320.176

TP 1.5990.846 2.7660.471 1.8890.895 2.9230.285 2.0830.888 2.9680.176

the non-zero entry of which is controlled by a constant δ. From the design, we obtain a piecewise short segment with length only 10 buried in the long segment with length 150. As demonstrated in

Table 3, the wild binary segmentation performs much better than the binary segmentation when identifying the two change points.

27 Table 3: The performance of the proposed binary segmentation (BS) and wild binary segmentation

(WBS) for estimating multiple change points. The average FP, FN, TP and corresponding standard deviations (subscript numbers) are obtained based on 1000 replications. The number of randomly selected intervals for the wild binary segmentation is H = 1000.

δ = 2 3

M BS WBS BS WBS

p = 200

FP 0.1750.512 0.2570.528 0.2800.562 0.1210.372

0 FN 1.8380.527 0.2950.569 1.5260.827 00

TP 0.1620.527 1.7050.569 0.4740.827 20

FP 0.0590.271 0.2130.520 0.0330.200 0.1720.430

2 FN 1.9980.063 1.8450.419 1.9870.144 0.5870.799

TP 0.0020.063 0.1550.419 0.0130.144 1.4130.799 p = 600

FP 0.1540.448 0.1390.374 0.3400.550 0.0470.212

0 FN 1.8250.548 0.0610.282 1.2710.949 00

TP 0.1750.548 1.9390.282 0.7290.949 20

FP 0.0400.233 0.1320.362 0.0290.195 0.0370.189

2 FN 1.9960.077 1.6450.616 1.9760.208 0.1610.468

TP 0.0040.077 0.3550.616 0.0240.208 1.8390.468

2.6.3 Empirical performance of the power enhancement testing procedure

This part of the simulation studies is to investigate the empirical performance of the power en- hancement test statistic L = L1 + L0. The random sample {Xi} for i = 1, . . . , n, were generated from the following multivariate linear process

∑M Xi = µi + Qlϵil, l=0 where µi is the p-dimensional population mean vector at point i, Ql is a p × p matrix for l =

0,...,M, and ϵi is a p-variate random vector with mean 0 and identity covariance matrix Ip. The

28 Table 4: Empirical sizes of L1 and L based on 1000 replications with Gaussian ϵi.

M = 0

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.056 0.059 0.050 0.060 0.050 0.071 0.048 0.066 0.046 L 0.087 0.081 0.065 0.091 0.063 0.076 0.070 0.071 0.047

M = 1

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.054 0.046 0.059 0.043 0.055 0.048 0.047 0.061 0.048 L 0.078 0.055 0.065 0.050 0.061 0.053 0.053 0.068 0.051

M = 2

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.051 0.036 0.035 0.051 0.038 0.047 0.050 0.056 0.041 L 0.066 0.036 0.036 0.059 0.042 0.047 0.054 0.057 0.042

model was considered to account for the M-dependent stationary process such that Cov(Xi,Xj) = ∑ M T | − | ≤ l=|i−j| QlQl−|i−j|, if i j M and Cov(Xi,Xj) = 0 otherwise. In the simulation, we set |i−j| −1 Ql = {0.5 I(|i − j| < p/2)/(M − l + 1) }, for i, j = 1, . . . , p and l = 0, 1,...,M.

Without loss of generality, we chose µi = 0 for i = 1, . . . , n under H0. Under the alternative hypothesis, we considered one change-point τ ∈ {1, . . . , n − 1} such as µi = 0 for i ≤ τ and µi = µ for τ + 1 ≤ i ≤ n. The non-zero mean vector µ had [p1−β] non-zero entries which were uniformly and randomly drawn from p coordinates {1, . . . , p}. Here, β ∈ (0, 1), with β closer to 1, there are fewer non-zero entries on µi, and [a] denotes the integer part of a. The magnitude of non-zero entries of µi was controlled by a constant δ multiplied by a random sign. The nominal significance level was chosen to be 0.05. All the simulation results were obtained based on 1000 replications.

Table 4 demonstrates the empirical sizes of L1 and L with Gaussian ϵi. When the dimension

29 Table 5: Empirical powers of L1 and L with τ = 0.1n and β = 0.8, based on 1000 replications under different M and magnitude δ.

M = 0, δ = 1.5

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.098 0.078 0.079 0.143 0.108 0.103 0.163 0.135 0.103 L 0.189 0.149 0.112 0.247 0.171 0.175 0.306 0.256 0.180

M = 1, δ = 2.5

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.087 0.073 0.062 0.128 0.125 0.099 0.182 0.156 0.128 L 0.155 0.132 0.092 0.249 0.211 0.150 0.369 0.325 0.220

M = 2, δ = 3.5

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.101 0.094 0.064 0.162 0.131 0.119 0.221 0.202 0.161 L 0.198 0.153 0.099 0.345 0.293 0.196 0.554 0.473 0.300

is low, the power enhancement test has a little size distortion. However, the size distortion is alleviated as p, n increase. This is reasonable since the power enhancement statistic converges to zero in probability under the null hypothesis as p, n tend to infinity. Under the alternative hypothesis, we considered two different scenarios. In the first scenario, we chose β = 0.8, which is a very sparse situation, and τ = 0.1n. When p = 300, there are only [3000.2] = 3 non-zero elements on the mean vector µi. The magnitude δ = 1.5, 2.5 and 3.5 respectively, for M = 0, 1, and 2. Table

5 shows that, L is always more powerful than L1 for different p, n and M. In the second scenario, we considered that change point τ = 0.02n, which means it is close to the boundary. We chose

β = 0.3, and the magnitude δ = 1.5, 2, and 2.5 respectively for M = 0, 1, and 2. Table 6 shows that the empirical power is enhanced significantly by the power enhancement statistic L0 compared

30 Table 6: Empirical powers of L1 and L with τ = 0.02n and β = 0.3, based on 1000 replications under different M and magnitude δ.

M = 0, δ = 1.5

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.114 0.105 0.132 0.147 0.133 0.167 0.194 0.216 0.242 L 0.566 0.527 0.579 0.818 0.817 0.789 0.948 0.942 0.932

M = 1, δ = 2

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.080 0.084 0.093 0.092 0.127 0.103 0.135 0.152 0.169 L 0.477 0.484 0.482 0.652 0.638 0.555 0.765 0.718 0.690

M = 2, δ = 2.5

n = 100 n = 150 n = 200

p = 300 600 1000 300 600 1000 300 600 1000

L1 0.065 0.056 0.074 0.103 0.108 0.111 0.118 0.131 0.146 L 0.810 0.755 0.780 0.818 0.766 0.720 0.868 0.816 0.775

with L1.

2.7 Application

We consider two datasets for applications: one is fMRI data and another is environmental data.

Before implementing the proposed methods, we need to validate the model (2.2) and associated assumptions. For the fMRI data analysis, the commonly used model is the general linear model

(GLM). Despite its usefulness, the GLM suffers some shortcomings (Robinson, Wager and Lindquist

2010): the is sensitive to the choice of the hemodynamic response function

(hrf) in modeling the BOLD signal; it requires the timing of neural activity onset and duration to be known, but in many psychological processes such information cannot be specified a priori; it can only be applied to a single voxel at a time and thus does not incorporate the dependence

31 Table 7: ROIs activated by the thin-body images and fat-body images for the normal weight subject and the overweight subject, respectively.

Normal weight subject 7 Overweight subject 1

left ACC, right ACC,

right DLPFC, right EBA, right FBA, Thin-body images left insula, left IPL, left insula, right insula.

right IPL, right MPFC.

left DLPFC, right DLPFC, right amygdala, Fat-body images right insula, right IPL, left MPFC, right MPFC. right MPFC.

among all voxels in the brain. The model (2.2) shares a similar structure but overcomes the above shortcomings of the GLM. Moreover, compared with other change-point analysis for fMRI data

(Lindquist, Waugh and Wager 2007; Lindquist 2008; Robinson, Wager and Lindquist 2010; Aston and Kirch 2012), the assumptions in the model (2.2) are very mild because we allow general spatial and temporal dependence and only require the fourth moments of Z are finite. Similar arguments can be applied to validate the implementation of the proposed methods for the second data analysis.

2.7.1 Application to fMRI data

Modern culture emphasizing thinness plays a vigorous role in shaping body image and social com- parison (Schwartz and Brownell 2004). Obesity is becoming a worldwide problem and high body image distress is one of the most common consequences, which could, in turn, lead to additional physical and psychological diseases (Gavin, Simon, and Ludman 2010). However, little is known about its brain mechanisms (Gao et al. 2016).

Southwest University, China conducted an fMRI experiment to exam the differences in brain activation between overweight and normal weight subjects when performing a body image self- reflection task. In the task, participants were instructed to view several fat and thin body images closely, and vividly imagine that someone was comparing her body to the body in the picture. The experiment comprised six blocks of the fat body condition and six blocks of the thin body condition,

32 and each block consisted of seven images. During the experiment, the brain of each participant was scanned every 2 seconds and a total of 280 images were taken, and each of image consisted of 131,072 voxels. Hence, for each subject, the high-dimensional time course data have p = 131, 072 and n =

280. The recorded fMRI data are publicly available at https://openfmri.org/dataset/ds000213/.

Our change point method can be applied to each individual to see how they react. We randomly picked normal weight subject 7 and overweight subject 1. Based on Gao et al.(2016), we used the MNI coordinates to partition all voxels into 16 distinct regions of interest (ROIs). Different

ROIs have different functions. For example, previous studies found that inferior parietal lobule

(IPL), extrastriate body area (EBA, lateral occipitotemporal cortex) and fusiformbody area (FBA) were related to perceptive processing of body image; Dorsolateral prefrontal cortex (DLPFC) and amygdala were related to affective processing of body image and can be activated when viewing body pictures with negative emotional valence; Medial prefrontal cortex (MPFC) was related with self-reflection; and ACC and insula were related to body dissatisfaction (Wagner et al. 2003; Uher et al. 2005; Kurosaki et al. 2006; Friederich et al. 2007; Miyake et al. 2010; Friederich et al. 2010;

Yang et al. 2014).

We applied the proposed method of change point detection to 16 ROIs with the nominal sig- nificance level α = 0.05 and the M dependence estimated by the elbow method. If there exists a change, indicating an ROI was activated. Table 7 lists the names and Figure 6 illustrates the physical locations of the ROIs activated (changes detected) by the thin-body images and fat-body images for the normal weight subject and the overweight subject. More specifically, 5 ROIs were activated for the overweight subjects when viewing fat-body images, while only 3 ROIs were acti- vated when viewing thin-body images. On the other hand, for the normal weight subjects, 8 ROIs were activated when viewing thin-body images, while 3 for fat-body images.

Our results indicate that the overweight subject showed stronger visual processing of fat body images than thin body images, whereas the normal weight subject showed stronger visual processing of thin body images than fat body images. Interestingly, we found that ACC was only activated for the normal weight women when viewing thin body images. Such a result was consistent with the finding of Friederich et al. (2007), that healthy women body dissatisfaction and self-ideal discrepancies can be greatly induced by exposure to attractive slim bodies of other women. This may be one reason that normal weight women are more motivated to watch their weight and keep

33 (a) Normal weight, thin-body images (c) Overweight, thin-body images

(b) Normal weight, fat-body images (d) Overweight, fat-body images

Figure 6: Activation map for ROIs. In panel (a): left insula (yellow); right EBA (cyan); left

ACC (darkblue); right ACC (blue); right MPFC (darkred); left IPL (orange); right IPL (red); right DLPFC (deepskyblue). In panel (b): right amygdala (darkblue); left MPFC (lightgreen); right MPFC (darkred). In panel (c): right FBA (darkblue); left insula (lightgreen); right insula

(darkred). In panel (d): right insula (lightgreen); right MPFC (darkred); left DLPFC (darkblue); right DLPFC (deepskyblue); right IPL (orange). in shape than overweight women.

2.7.2 Application to environmental data

Air pollution poses a significant risk to human health, and we are all strongest ally in the battle against air pollution. In this section, we applied the proposed change point detection method to two-year measurement of PM2.5 at 36 monitoring stations in Beijing, China from Jan. 1st, 2014 to

Dec. 31, 2015. To analyze the data, we used the nominal significance level α = 0.05. As PM2.5 may possess time persistency, we implemented the elbow method and found that M = 1 for dependence.

Under such choice, the proposed change point detection method identified the four change points at Feb. 12, 2014; Feb. 26, 2014; Apr. 11, 2015 and Nov. 26, 2015.

We found that these four changes were corresponding to four big changes in PM2.5 level. More precisely, we looked into these four dates to find the reason why the changes occur. First, to

34 Color Key PM2.5

100 500 Value PM2.5 at 36 Stations

− 2014.02.12− 2014.02.26 − 2015.04.11 − 2015.11.26

Time

Figure 7: Heatmap of PM2.5 at 36 stations measured in Beijing, China from Jan. 1st, 2014 to Dec.

31, 2015.

celebrate the Lantern Festival on Feb. 14, 2014, China suffered heavy air pollution (PM2.5 > 250

µg/m3) because of fireworks since Feb. 13, 2014. Second, thanks to the heavy rain on the morning of Feb. 27, 2014, the heavy haze lasted for more than a week was finally dissipated. Third, the very heavy air pollution was substantially alleviated on Apr. 12, 2015 because of the windy weather.

Forth, due to the extremely adverse weather conditions (high humidity, low wind speed and strong temperature inversion), Beijing encountered “the most serious haze (PM2.5 > 400 µg/m3) in the year 2015” since Nov. 27, 2017. Figure 7 demonstrates the heatmap of PM2.5 with darker color representing a higher value of the measurement. Note that the change detection method is not to

find the locations with a higher value of measurement, but to identify the time points involving abrupt changes. The identified four times points well match the big changes of PM 2.5 level.

35 2.8 Technical Details

2.8.1 Proofs of main theorems

Proof of Proposition 2.1.

We first show that the mean of Lˆt is given by (2.8) in Section 2.1. Note that

¯ − ¯ T ¯ − ¯ ¯ T ¯ ¯ T ¯ − ¯ T ¯ (X≤t X>t) (X≤t X>t) = X≤tX≤t + X>tX>t 2X≤tX>t.

First, ∑t T 1 T T T T E(X¯ X¯≤ ) = (µ µ + µ ϵ + ϵ µ + ϵ ϵ ) ≤t t t2 i j i j i j i j i,j=1 ∑ T 1 2 =µ ¯ µ¯≤ + tr(Σ) + tr{Cov(ϵ , ϵ )} ≤t t t t2 j i i M)tr{C(h)}. (2.20) t2 h=M+1 Second, based on similar derivation, we have

tr(Σ) 2 ∑M E(X¯ T X¯ ) =µ ¯T µ¯ + + (n − t − h)I(h < n − t)tr{C(h)} >t >t >t >t n − t (n − t)2 h=1 − − 2 n∑t 1 + (n − t − h)I(n − t > M)tr{C(h)}. (2.21) (n − t)2 h=M+1 At last,

1 ∑t ∑n E(X¯ T X¯ ) =µ ¯T µ¯ + tr{Cov(ϵ , ϵ )} ≤t >t ≤t >t t(n − t) j i i=1 j=t+1 − 1 n∑1 =µ ¯T µ¯ + min(h, n − h, t, n − t)tr{C(h)} ≤t >t t(n − t) h=1 1 ∑M =µ ¯T µ¯ + min(h, n − h, t, n − t)tr{C(h)} ≤t >t t(n − t) h=1 − 1 n∑1 + min(h, n − h, t, n − t)tr{C(h)} (2.22) t(n − t) h=M+1 Combining (2.20), (2.21) and (2.22), we can derive { } T T E (X¯≤t − X¯>t) (X¯≤t − X¯>t) = (¯µ≤t − µ¯>t) (¯µ≤t − µ¯>t) + {∆b + Rb},

36 (2.23) where, the bias term is ( ) { 1 1 ∑M t − h n − t − h ∆ = + tr(Σ) + 2 I(t > h) + I(n − t > h) b t n − t t2 (n − t)2 }h=1 min(h, n − h, t, n − t) − tr{C(h)}. t(n − t) and the remainder term is

{ − − − 1 ∑t 1 1 n∑t 1 R = 2 (t − h)I(t > M) + (n − t − h)I(n − t > M) b t2 (n − t)2 h=M+1 h=M+1 − } − 1 n∑1 n∑1 − min(h, n − h, t, n − t) tr{C(h)} ≤ c |tr{C(h)}|/n t(n − t) 0 h=M+1 h=M+1 for some constant c0, which does not depend on p and n. T −1 ˆ ˆ Next, we evaluate E(ft Fn,M V ), where ft, Fn,M , and V are specified in (2.4), (2.5) and (2.6). To find E(Vˆ ), we need to evaluate E[tr{C[(h)}], where

− 1 n∑h tr{C[(h)} = (X − X¯)T (X − X¯) for h = 0, ··· ,M. n i i+h i=1

Let Xi = Yi + µi where Yi has mean 0. Then,

− − 1 n∑h 1 n∑h tr{C[(h)} = (Y − Y¯ )T (Y − Y¯ ) + (Y − Y¯ )T (µ − µ¯) n i i+h n i i+h i=1 i=1 − − 1 n∑h 1 n∑h + (µ − µ¯)T (Y − Y¯ ) + (µ − µ¯)T (µ − µ¯), n i i+h n i i+h i=1 i=1 where only the first term has non-zero expectation under the null hypothesis, which is

{ − } 1 n∑h E (Y − Y¯ )T (Y − Y¯ ) n i i+h i=1 [ { − − − }] 1 n∑h n − h ∑n ∑n 1 n∑h ∑n 1 ∑n n∑h = tr E Y Y T + Y Y T − Y Y T − Y Y T n i i+h n3 i j n2 i j n2 i j+h i=1 i=1 j=1 i=1 j=1 i=1 j=1 h h ∑M l 2 − I(l, 0) = (1 − )tr{C(h)} + (1 − ) (1 − ) tr{C(l)} n n n n l=0 − { } 1 ∑M ∑n n∑h − I(|i − j|, l) + I(|i − j + h|, l) tr{C(l)} + R (h) n2 v l=0 i=1 j=1

37 ∑ ∑ ∑ ∑ { n−1 − − { } 3 − n−1 n n−h | − | | − where Rv(h) = 2 l=M+1(n h)(n l)tr C(l) /n l=M+1 i=1 j=1 I( i j , l) + I( i j + } ∑ | { } 2 ≤ n−1 | { }| h , l) tr C(l) /n c0 h=M+1 tr C(h) /n for some constant c0, which does not depend on p and n.

Therefore, for h = 0, ··· ,M,

h h ∑M l 2 − I(l, 0) E[tr{C[(h)}] = (1 − )tr{C(h)} + (1 − ) (1 − ) tr{C(l)} n n n n l=0 − { } 1 ∑M n∑h ∑n − I(|i − j|, l) + I(|i − j + h|, l) tr{C(l)} n2 l=0 i=1 j=1 − 1 n∑h + (µ − µ¯)T (µ − µ¯) + R (h). n i i+h v i=1 As a result, we have

E(Vˆ ) = E(tr{C[(0)}, ··· , tr{C\(M)})T

T = Fn,M (tr{C(0)}, ··· , tr{C(M)}) + Rv + VB, where Fn,M is a (M + 1) × (M + 1) matrix defined in (2.5), Rv = {Rv(0), ··· ,Rv(M)} and VB is defined in Proposition 2.1. Furthermore,

T −1 ˆ T { } ··· { } T T −1 T −1 E(ft Fn,M V ) = ft (tr C(0) , , tr C(M) ) + ft Fn,M Rv + ft Fn,M VB. (2.24)

Combing (2.23) and (2.24), we get

T −1 T −1 ft F VB ft F Rv t(n − t)R E(Lˆ ) = L − n,M − n,M + b . t t n n n2

− T −1 − 2 Therefore, the remainder term is ft Fn,M Rv/n+t(n t)Rb/n , which is negligible under condition (C1).

In the following, we derive the variance of Lˆt. From the definition of Lˆt, we can write Lˆt as

1 ∑n ∑n Lˆ = B (i, j)XT X , (2.25) t n2 t i j i=1 j=1 where the n × n matrix Bt is defined in Proposition 2.1. Based on the definition of Bt(i, j), the leading order term of Bt(i, j) can be written as Bt(i, j) = Bt1(i)Bt1(j){1 + o(1)}, where Bt1(i) = √ n(n − t)[I(i ≤ t)/t − I(i > t)/(n − t)]. Using Xi = Yi + µi, we have ∑ ∑ ∑ 2 ˆ T T T n Lt = Bt1(i)Bt1(j)Yi Yj + 2 Bt1(i)Bt1(j)Yi µj + Bt1(i)Bt1(j)µi µj ij ij ij

38 ≡ Lˆ1t + Lˆ2t + Lˆ3t.

4 Therefore, n Var(Lˆt) = Var(Lˆ1t) + Var(Lˆ2t) + 2Cov(Lˆ1t, Lˆ2t). ∑ ∑ ˆ ˆ T T First, we can write L1t as L1t = i Bt1(i)Bt1(j)Yi Yi + i≠ j Bt1(i)Bt1(j)Yi Yj. The expecta- tion of Lˆ1t is ∑n ∑ E(Lˆ1t) = Bt1(i)Bt1(j)tr{C(0)} + Bt1(i)Bt1(j)tr{C(j − i)}. i=1 i≠ j

The square of Lˆ1t is ∑ ∑ ∑ ˆ2 2 2 T T 2 T T L1t = Bt1(i)Bt1(k)Yi YiYk Yk + 2 Bt1(i)Yi YiBt1(k)Bt1(j)Yk Yj i,k i k≠ j ∑ ∑ T T + Bt1(i)Bt1(j)Yi YjBt1(k)Bt1(l)Yk Yl. i≠ j k≠ l

Then, the variance of Lˆ1t is ∑ ˆ 2 2 { − T − } Var(L1t) = 2 Bt1(i)Bt1(k)tr C(k i)C (k i) i,k ∑ ∑ 2 ∗ { − − } + 4 Bt1(i)Bt (i, i)Bt1(k)Bt1(j)tr C(k i)C(i j) i k≠ j ∑ ∑ + 2 Bt1(i)Bt1(j)Bt1(k)Bt1(l)tr{C(k − j)C(i − l)} i≠ j k≠ l ∑ ∑ = 2 Bt1(i)Bt1(j)Bt1(k)Bt1(l)tr{C(k − j)C(i − l)} i,l k,j ∑ ∑ = 2 Bt1(i)Bt1(j)Bt1(k)Bt1(l)tr{C(k − j)C(i − l)} |i−l|M ∑ ∑ + 2 Bt1(i)Bt1(j)Bt1(k)Bt1(l)tr{C(k − j)C(i − l)} := I1 + I2 + I3. |i−l|>M |k−j|>M ∑ n−h ⊗2 T Let Bh = i=1 Bt1(i)Bt1(i + h) and A = AA . Then, the first term in the above expression could be represented as

∑M n∑−h ∑M n∑−h1 T I1 =4 Bt1(i)Bt1(j)Bt1(i + h)Bt1(j + h1)tr{C(h)C (h1)}

h=1 i=1 h1=1 j=1 ∑M n∑−h ∑M n∑−h1 + 4 Bt1(i)Bt1(k + h1)Bt1(i + h)Bt1(k)tr{C(h)C(h1)}

h=1 i=1 h1=1 k=1

39 ∑M ∑M ⊗2 2 =4tr[{ BhC(h)} ] + 4tr[{ BhC(h)} ]. h=1 h=1 The second term is

∑M n∑−h n∑−1 n∑−h1 T I2 =8 Bt1(i)Bt1(j)Bt1(i + h)Bt1(j + h1)tr{C(h)C (h1)}

h=1 i=1 h1=M+1 j=1 ∑M n∑−h n∑−1 n∑−h1 +8 Bt1(i)Bt1(k + h1)Bt1(i + h)Bt1(k)tr{C(h)C(h1)}

h=1 i=1 h1=M+1 k=1 [ ∑M n∑−1 ] [ ∑M n∑−1 ] T =8tr { BhC(h)}{ BhC (h)} + 8tr { BhC(h)}{ BhC(h)} h=1 h=M+1 h=1 h=M+1 ∑M n∑−1 1 ⊗2 1 ⊗2 ≤16tr 2 [{ BhC(h)} ]tr 2 [{ BhC(h)} ]. h=1 h=M+1 The third term is

n∑−1 n∑−h1 n∑−1 n∑−h1 T I3 =4 Bt1(i)Bt1(j)Bt1(i + h)Bt1(j + h1)tr{C(h)C (h1)}

h=M+1 i=1 h1=M+1 j=1 n∑−1 n∑−h1 n∑−1 n∑−h1 +4 Bt1(i)Bt1(k + h1)Bt1(i + h)Bt1(k)tr{C(h)C(h1)}

h=M+1 i=1 h1=M+1 k=1 n∑−1 n∑−1 ⊗2 2 =4tr[{ BhC(h)} ] + 4tr[{ BhC(h)} ]. h=M+1 h=M+1

Under condition (C1), the I2 and I3 are smaller order terms of I1. T T T ··· T { } { ··· } Let µBt = (Bt1(1)µ1 ,Bt1(2)µ2 , ,Bt1(n)µ1 ) and Diag C(h) = Diag C(1), ,C(n) . Then the variance of Lˆ2t is ∑ ∑ ˆ T − Var(L2t) = 4 Bt1(i)Bt1(j)Bt1(k)Bt1(l)µj C(k i)µl i,j k,l ∑n ∑M n∑−h T = 8 Bt1(i)Bt1(j)Bt1(i + h)Bt1(l)µj C(h)µl j,l h=1 i=1 ∑n n∑−1 n∑−h T + 8 Bt1(i)Bt1(j)Bt1(i + h)Bt1(l)µj C(h)µl j,l h=M+1 i=1 ∑M n∑−1 T { } T { } = 8 BhµBtDiag C(h) µBt + 8 BhµBtDiag C(h) µBt h=1 h=M+1 and Cov(Lˆ1t, Lˆ2t) = 0. Combining all, we can derive the variance of Lˆt, which is given by (2.9). This completes the proof of Proposition 2.1.

40 Proof of Theorem 2.1

We first prove the asymptotic normality of Lˆt under the null hypothesis. From (2.4), we can write

ˆ ˆ(1) ˆ(2) Lt = Lt + Lt ,

ˆ(1) − 2 ¯ − ¯ T ¯ − ¯ ˆ(2) − T −1 ˆ where Lt = t(n t)/n (X≤t X>t) (X≤t X>t) and Lt = ft Fn,M V /n. If we write

n − t t B (i, j) = I(i ≤ t)I(j ≤ t) − 2I(i ≤ t)I(j > t) + I(i > t)I(j > t) t t n − t [ ( )] ∑M I(j ≥ h + 1) I(j ≤ n − h) n − h − (f T F −1 ) I(i − j, h) − − + n,M h+1 n n n2 h=0 ≡ (1) (2) Bt (i, j) + Bt (i, j),

(1) (2) where Bt (i, j) is the sum of the first three terms, and Bt (i, j) is the remainder of Bt(i, j), respectively. Then,

1 ∑n 1 ∑n Lˆ(1) = B(1)(i, j)XT X and Lˆ(2) = B(2)(i, j)XT X . t n2 t i j t n2 t i j i,j i,j

ˆ(2) From the derivations for Proposition 2.1, we have seen that imposing Lt is to eliminate the bias caused by data dependence. Moreover, from the expression of variance (2.9), the variance ˆ(2) ˆ(1) contributed by Lt is always smaller order of the variance contributed by Lt . Therefore, using the Slutsky theorem, we only need to show that

Lˆ(1) − E(Lˆ(1)) t t −→d N(0, 1). σnt

Note that under the null hypothesis and from (2.2), Xi = ΓiZ. Based on that,

q q 1 ∑n ∑ ∑ Lˆ(1) = B(1)(i, j)ZT ΓT Γ Z = G z z , t n2 t i j kl k l i,j k=1 l=1 ∑ −2 n (1) T { }q where Gkl = n ij Bt (i, j)(Γi Γj)kl. Most importantly, from (2.2), all zk k=1 are mutually independent, which allows us to use the Martingale central limit theorem to establish the asymptotic ˆ(1) normality of Lt .

First, we define G˜kl that is equal to Gkl if k ≠ l and Gkk/2 if k = l. Then,

∑q ∑q ∑l ˆ(1) − ˆ(1) ˜ − ˜ Lt E(Lt ) = 2 Vl = 2 ( Gklzkzl Gll). l=2 l=2 k=1

41 Let Fh = {∅, Ω}, if h = 0, and Fh = σ{z1, z2, . . . , zh}, if h ≥ 1. Hence, F1 ⊆ F2 ⊆ ... is an ∑ h F increasing sequence of σ algebras, and Sh = l=2 Vl is h measurable. Then, it is readily to see that Sh is a martingale by showing that for any q > h, E(Sq|Fh) = Sh. Toward this end, as zi are mutually independent and satisfy E(zi) = 0, Var(zi) = 1,

∑h ∑q E(Sq|Fh) = E( Vl|Fh) + E( Vl|Fh) l=2 l=h+1 ∑q ∑l = Sh + E( G˜klzkzl − G˜ll|Fh), l=h+1 k=1 where ∑q ∑l E( G˜klzkzl − G˜ll|Fh) l=h+1 k=1 ∑h = E( G˜k,h+1zkzh+1|Fh) + E(G˜h+1,h+1zh+1zh+1|Fh) k=1 ∑h + E( G˜k,h+2zkzh+2|Fh) + E(G˜h+1,h+2zh+1zh+2|Fh) + E(G˜h+2,h+2zh+2zh+2|Fh) k=1 ∑h ∑q + ··· + E( G˜kqzkzq|F) + E(G˜h+1,qzh+1zq|Fh) + ··· + E(G˜qqzqzq|Fh) − G˜ll k=1 l=h+1 ∑q ∑q = G˜ll − G˜ll = 0. l=h+1 l=h+1 Next, we need to prove the following two results in order to apply the Martingale central limit theorem: ∑ 2 E(V |F − ) l l l 1 −→p 1 2 , and (2.26) σnt,0 4 ∑ 2 E{V I(|V | > ϵσ )|F − } l l l nt,0 l 1 −→p 2 0, (2.27) σnt,0

F { ··· } 2 ˆ(1) where l−1 is the σ algebra generated by z1, , zl−1 , σnt,0 is the variance of Lt under the null hypothesis and ϵ is any small positive number. ∑ ∑ 2 2 → { 2|F } 4 For (2.26), we need to show that l E(Vl )/σnt,0 1/4 and Var l E(Vl l−1) /σnt,0 → 0, respectively. First,

∑l ∑l 2 ˜ ˜ 2 − ˜ ˜ ˜2 Vl = Gk1lGk2lzk1 zk2 zl 2Gll Gklzkzl + Gll. k1k2 k=1

42 Then, it is easy to see that ∑ ∑ ∑l−1 ≡ 2|F { ˜ ˜ ˜2 } φ E(Vl l−1) = Gk1lGk2lzk1 zk2 + (2 + β)Gll . l l k1k2 ∑ ∑ { l−1 ˜2 ˜2 } Using the above expression, we have E(φ) = l k Gkl + (2 + β)Gll . From the definition that ∑ −2 n (1) T ˜ ̸ Gkl = n ij Bt (i, j)(Γi Γj)kl, and Gkl that is equal to Gkl if k = l and Gkk/2 if k = l, we have q 1 ∑ ∑n E(φ) = B (i , j )B (i j )(ΓT Γ ) (ΓT Γ ) {1 + o(1)}. 2n4 t 1 1 t 2 2 i1 j1 kl j2 i2 lk kl i1j1i2j2 ∑ 2 2 2 → By comparing the above expression with that of σnt,0, we have l E(Vl )/σnt,0 1/4. ∑ { 2|F } 4 → 2 − 2 → To show that Var l E(Vl l−1) /σnt,0 0, we only need to show that E(φ ) E (φ) 0. From the expression of φ, we can derive the following result: ∑q ∑q ∑l ∑l 2 − 2 ˜ ˜ ˜ ˜ { } E(φ ) E (φ) = 2 Gk1l1 Gk1l2 Gk2l1 Gk2l2 1 + o(1) l1=2 l2=2 k1=1 k2=1 ∑ ∑ ∑ ∑ 1 { − = 8 Bt(i1, j1)Bt(i1 + h2, j1 h1) 2n c c i1j1 i2j2 h1,h2∈A∪A h3,h4∈A∪A + Bt(i1, j1)Bt(j1 − h1, i1 + h2)}{Bt(i2, j2)Bt(i2 + h2, j2 − h1)

+ Bt(i2, j2)Bt(j2 − h1, i2 + h2)}tr{C(h1)C(h2)C(h3)C(h4},

4 which is the smaller order of σnt,0 by using both (C1) and (C2). To prove (2.27), we only need to show that by Chebyshev Inequality, ∑ q E(V 4) l=2 l → 4 0. σnt,0

From the definition of Vl, we can derive that for some constant C, q ∑ C ∑ ∑ ∑ ∑ E(V 4) = B (i , j )B (i , j )B (i , j )B (i , j ) l n8 t 1 1 t 2 2 t 3 3 t 4 4 l=2 ∑i∑1j1 i2j2 i3j3 i4j4 × T T T T (Γi1 Γj1 )k1l(Γi2 Γj2 )k1l(Γi3 Γj3 )k2l(Γi4 Γj4 )k2l. l k1k2 In the above expression, ∑ ∑ { } T T T T T T ◦ T T (Γi1 Γj1 )k1l(Γi2 Γj2 )k1l(Γi3 Γj3 )k2l(Γi4 Γj4 )k2l = tr (Γj1 Γi1 Γi2 Γj2 ) (Γj3 Γi3 Γi4 Γj4 ) , l k1k2 where the symbol “◦” represents element product of two matrices. Using the inequality that tr(A ◦ A) ≤ tr(AAT ), we have ∑q ∑ ∑ ∑ ∑ 4 C { − E(Vl ) = 8 Bt(i1, j1)Bt(i1 + h2, j1 h1) n c c l=2 i1j1 i2j2 h1,h2∈A∪A h3,h4∈A∪A

43 + Bt(i1, j1)Bt(j1 − h1, i1 + h2)}{Bt(i2, j2)Bt(i2 + h2, j2 − h1)

+ Bt(i2, j2)Bt(j2 − h1, i2 + h2)}tr{C(h1)C(h2)C(h3)C(h4},

4 which again is the smaller order of σnt,0 by (C1) and (C2). Based on (2.26) and (2.27), we can apply the Martingale central limit theorem to establish the asymptotic normality of Lˆt under the null hypothesis.

To prove asymptotic normality of Lˆt under the alternative, we use (2.2) to write 1 ∑n 1 ∑n 1 ∑n Lˆ = B (i, j)µT µ + B (i, j)µT Γ Z + B (i, j)µT Γ Z t n2 t i j n2 t i j n2 t j i ij ij ij 1 ∑n + B (i, j)ZT ΓT Γ Z, n2 t i j ij where only the last term above remains under the null hypothesis and we have already established its asymptotic normality in the first part of the proof. From the derivations of Proposition 2.1, it can be seen that in the variance of Lt given by (2.9), [ ] 1 ∑n ∑n ∑n ∑ {B (i, j) + B (j, i)}{B (k, i + h) + B (i + h, k)}µT C(h)µ (2.28) n4 t t t t j k i=1 j=1 k=1 h∈A∪Ac ∑ ∑ −2 n T −2 n T is contributed by n ij Bt(i, j)µi ΓjZ + n ij Bt(i, j)µj ΓiZ. If (2.28) is the smaller order of ∑ −2 n T T the variance of n ij Bt(i, j)Z Γi ΓjZ, the asymptotic normality can be established by applying Slutsky theorem. On the other hand, we only need to establish the asymptotic normality of ∑ ∑ −2 n T −2 n T ˆ n ij Bt(i, j)µi ΓjZ + n ij Bt(i, j)µj ΓiZ in order to prove the asymptotic normality of Lt ··· T { }q under the alternative hypothesis. By observing that Z = (z1, , zq) and zk k=1 are mutually ∑ ∑ −2 n T −2 n T independent, the asymptotic normality of n ij Bt(i, j)µi ΓjZ + n ij Bt(i, j)µj ΓiZ can be established from the Lyapunov’s condition. We thus establish the asymptotic normality of Lˆt under the alternative hypothesis and complete the proof of Theorem 2.1.

Proof of Theorem 2.2

2 2 First, we show that E(ˆσnt,0) = σnt,0. It is equivalent to showing \ E[tr{C(h1)C(h2)}] = tr{C(h1)C(h2)}, which can be proved using (2.10). For the first term on the right side of equal sign in (2.10), using

(2.2), we have ( ∗ ) 1 ∑ E XT X XT X = µT µµT µ + µT tr{C(h )}µ + µT tr{C(h )}µ + tr{C(h )C(h )}. n∗ t+h2 s s+h1 t 1 2 1 2 1 s,t

44 Similarly, we can show that the expectation of the other three terms on the right side of equal sign

T T T T in (2.10) is −µ µµ µ − µ tr{C(h1)}µ − µ tr{C(h2)}. Combining everything, we have shown that { \ } { } 2 2 E[tr C(h1)C(h2) ] = tr C(h1)C(h2) or E(ˆσnt,0) = σnt,0. In order to prove Theorem 2.2, we need to show that as n → ∞ and for any ϵ > 0, ( ) σˆ2 | nt,0 − | → P 2 1 > ϵ 0. σnt,0 Using Chebyshev inequality, we only need to show that { } σˆ2 nt,0 − 2 → E ( 2 1) 0. σnt,0 2 2 − 2 4 → → ∞ Since E(ˆσnt,0/σnt,0 1) = 0, we only need to show that Var(ˆσnt,0)/σnt,0 0 as n . Further- 4 4 → more, we only need to show that E(ˆσnt,0)/σnt,0 1 or equivalently, [ ] \ \ E tr{C(h1)C(h2)}tr{C(h3)C(h4)} → 1, tr{C(h1)C(h2)}tr{C(h3)C(h4)} which, under (C2), can be shown by following similar derivations for Theorem 2 of Li and Chen

(2012). Theorem 2.2 is then proved by the continuous mapping theorem.

Proof of Theorem 2.3 ∑ ∑ Lˆ n−1 ˆ B n−1 Recall that the L2-norm statistic = t=1 Lt. Using the definition of (i, j) = t=1 Bt(i, j), we see that 1 ∑n ∑n Lˆ = B(i, j)XT X . n2 i j i=1 j=1 The advantage of using the above expression for Lˆ is that all the established results for each marginal Lˆt can be applied to Lˆ by simply replacing Bt(·, ·) by the corresponding B(·, ·). Specially, we can prove Theorem 2.3 by doing so.

Proof of Theorem 2.4

Given a constant C > 0, we define a set

1−γ 1/2 2 K(C) = {t : |t − τ| > Cn log nσmax/δ , 1 ≤ t ≤ n − 1}.

To show Theorem 2.4, we first show that for ϵ > 0, there exists a constant C such that { } 1−γ 1/2 2 P |τˆ − τ| > Cn log nσmax/δ < ϵ. (2.29)

45 { ∈ } { ˆ ˆ } Since( the event τˆ K)(C) implies the event maxt∈K(C) Lt > Lτ , then it is enough to show that ˆ ˆ P maxt∈K(C) Lt > Lτ < ϵ. Toward this end, we first derive the result based on the definition of

Lt: { } n − τ τ L − L = δ2 I(1 ≤ t ≤ τ) + I(τ < t ≤ n − 1) |τ − t|, (2.30) τ t n(n − t) nt 2 which implies that Lt attains its maximum δ at t = τ since 1/(n − t) is an increasing function and 1/t is a decreasing function. As a result, by union sum inequality and letting A(t, τ|1, n − 1) =

(n − τ)/{n(n − t)}I(1 ≤ t ≤ τ) + τ/(nt)I(τ < t ≤ n − 1), ( ) ∑ ( ) P max Lˆt > Lˆτ ≤ P Lˆt − E(Lˆt) + E(Lˆt) − E(Lˆτ ) > Lˆτ − E(Lˆτ ) t∈K(C) t∈K(C) ∑ { } = P (Lˆt − E(Lˆt)) − (Lˆτ − E(Lˆτ )) > E(Lˆτ ) − E(Lˆt) . t∈K(C)

Since the bias of E(Lˆt) to Lt is the smaller order such that E(Lˆτ ) − E(Lˆt) = (Lτ − Lt){1 + o(1)}, then using the result given in (2.30), we have ( ) { } ∑ 2 Lˆt − E(Lˆt) A(t, τ|1, n − 1) δ P max Lˆt > Lˆτ ≤ P | | > |τ − t|{1 + o(1)} t∈K(C) σnt 2 σmax t∈K(C) { } ∑ Lˆ − E(Lˆ ) A(t, τ|1, n − 1) δ2 + P | τ τ | > |τ − t|{1 + o(1)} σnτ 2 σmax t∈K(C) { } ∑ Lˆ − E(L ) √ ≤ P | t t | > Clogn σnt t∈K(C) { } ∑ Lˆ − E(L ) √ + P | τ τ | > Clogn , σnτ t∈K(C) where the result of A(t, τ|1,T ) = O(nγ−2) has been used.

Since {Lˆt − E(Lt)}/σnt ∼ N(0, 1), for a large C, { } ∑ Lˆ − E(L ) √ ∑ P | t t | > Clogn = C(logn)−1/2n−C ≤ C(logn)−1/2n1−C ≤ ϵ. σnt t∈K(C) t∈K(C) Similarly, we can show that { } ∑ Lˆ − E(L ) √ P | τ τ | > Clogn ≤ ϵ. σnτ t∈K(C) { } 1−γ 1/2 2 −1 Hence, (2.29) is true, or equivalently,τ ˆ − τ = Op n log nσmax/δ . Since σmax = vmaxn , it leads to the convergent rate in Theorem 2.4.

46 Proof of Theorem 2.5

Recall that ( ) √ ∑ ∑p |ˆ | √ L ˆ2 √δt,j ≥ 0 = p δt,jI log(log n) log p , with σˆ2 t∈Sˆα j=1 t,j { √ } ˆ Lt 1/π Sˆα = t : √ ≥ 2 log n − log(log n) + log{log(n )} . 2 σˆnt,0

Under H0, we have

P (L = 0) 0 [ ] { ∑ ∑p ( |ˆ | √ ) } ˆ ∅ { ˆ ̸ ∅} ∩ ˆ2 √δt,j ≥ = P (Sα = ) + P Sα = δt,jI log(log n) log p = 0 σˆ2 t∈Sˆα j=1 t,j

= P (Sˆα = ∅) + P (Sˆα ≠ ∅) { ∑ ∑p ( |ˆ | √ ) } × ˆ2 √δt,j ≥ | ˆ ̸ ∅ P δt,jI log(log n) log p = 0 Sα = . σˆ2 t∈Sˆα j=1 t,j By Lemma 2.2, as n → ∞,

P (Sˆα = ∅) { √ } |Lˆ | = P max √ t ≤ 2 log n − log(log n) + log log(n1/π) 1≤t≤n−1 2 σˆnt,0 [ 1 1 ] → exp − √ exp{− log log(n1/π)} π 2 ( 1 ) = exp − √ → 1. log n

Recall

δˆt,j = X¯≤t,j − X¯>t,j 1 ∑t 1 ∑n = X − X . t i1,j n − t i2,j i1=1 i2=t+1

Under H0, follows the results of Lemma 2.3, for any t ∈ {1, . . . , n − 1}, as n → ∞, { } |δˆ | √ P max √t,j < log(log n) log p → 1. 1≤j≤p 2 σˆt,j

Consequently, under H0, as n → ∞, { ∑ ∑p ( |ˆ | √ ) } ˆ2 √δt,j ≥ | ˆ ̸ ∅ P δt,jI log(log n) log p = 0 Sα = σˆ2 t∈Sˆα j=1 t,j

47 ( √ ) |δˆt,j| = P max max √ < log(log n) log p|Sˆα ≠ ∅ → 1. t∈Sˆ 1≤j≤p 2 α σˆt,j

In sum, under H0, as n → ∞,

P (L0 = 0) → 1 + 0 · 1 = 1.

d As Theorem 2.3 already showed that L1 = Lˆ/σˆn,0 −→ N(0, 1), by Slutsky’s Theorem, as n → ∞,

d L1 + L0 −→ N(0, 1).

This proves part (i).

As n → ∞, for any t ∈ {1, . . . , n − 1}, we have

Lˆ − L t t −→d N(0, 1). σnt

Under Hα, for any t ∈ {1, . . . , n − 1}, as n → ∞, { √ } |Lˆ − L | P t t ≥ 2 log n − log(log n) + log{log(n1/π)} √ σnt √ 2 [ 1 ] → exp − log n + log(π 2 ) / 2 log n − log π π √ −1 − 1 = 2n (2 log n − log π) 2 → 0.

√ { } 1/π Let λ1 ≡ 2 log n − log log n + log log(n ), and Sα ≡ t : |Lt|/σnt ≥ 2λ1 . Under Hα, if Sα ≠ ∅, then for any t ∈ Sα,

( |ˆ | ) ( |ˆ | ) √Lt ≥ → Lt ≥ 2 2 −→p → ∞ P λ1 P λ1 , asσ ˆnt,0/σnt,0 1, when n . 2 σnt,0 σˆnt,0 ( ) |Lˆt| ≥ P ≥ λ1 , as σnt > σnt,0. σnt ( ) ( ) |Lt| |Lˆt − Lt| ≥ P − ≥ λ1 ≥ P 2λ1 − λ1 ≥ λ1 = 1. σnt σnt { √ } ˆ |ˆ | 2 ≥ → ∞ → ∞ ̸ ∅ Recall Sα = t : Lt / σˆnt,0 λ1 . Therefore, under Hα, as p , and n , if Sα = , for any t ∈ Sα,

P (t ∈ Sˆα|Sα ≠ ∅) → 1, which is equivalent to

P (Sα ⊂ Sˆα|Sα ≠ ∅) → 1.

48 √ Let λ2 = log(log n) log p. Under Hα, if Sα ≠ ∅, for any t ∈ Sα, by the results of Lemma 2.3, as n → ∞, we have ( ) |δˆt,j − δt,j| P max ≤ λ2 → 1. 1≤j≤p σt,j { } ≡ | | ∗ ≥ ∗2 ≡ 2 ∈ ̸ ∅ Let St,d j : δt,j /σt,j 2λ2 , where σ t,j E(ˆσt,j) under Hα. If there is a t0 Sα = , such ̸ ∅ ∈ that St0,d = , then for any j St0,d, ( ) ( ) |δˆ | |δˆ | √t0,j ≥ → t0,j ≥ 2 ∗2 −→p → ∞ P λ2 P ∗ λ2 , asσ ˆt ,j/σ t ,j 1, when n . 2 σ 0 0 σˆ t0,j t0,j ( ) |δ | |δˆ − δ | ≥ P t0,j − t0,j t0,j ≥ λ , σ∗ ≥ σ . σ∗ σ 2 t0,j t0,j t0,j t0,j ≥ P (2λ2 − λ2 ≥ λ2) = 1. √ Consequently, for t ∈ S ≠ ∅, such that S ≠ ∅, let Sˆ ≡ {j : |δˆ |/ σˆ2 ≥ λ }, under H , 0 α t0,d t0,d t0,j t0,j 2 α → ∞ ∈ as n , for any j St0,d, ∈ ˆ | ̸ ∅ → P (j St0,d St0,d = ) 1, which is equivalent to ⊂ ˆ | ̸ ∅ → P (St0,d St0,d St0,d = ) 1.

̸ ∅ ∈ ̸ ∅ Therefore, under Hα, if Sα = , and if there is a t0 Sα, such that St0,d = , then for a positive constant c, as n → ∞, √ (√ ∑ ∑ √ ) L ˆ2 P ( 0 > c p) = P p δt,j > c p ˆ ˆ t∈Sα j∈St,d (√ ∑ ∑ √ ) ≥ ˆ2 P p δt,j > c p t∈Sα j∈St,d [√ ∑ { √ } √ ] ≥ 2 2 → P p log(log n) log p σt0,j > c p 1. ∈ j St0,d Toward this end, as { } { } |Lt| |δt,j| max ≥ 2λ1 ∩ max max ∗ ≥ λ2 ⇔ {Sα ≠ ∅} ∩ {St ,d ≠ ∅, t0 ∈ Sα}, t ∈ 0 σnt t Sα j σt,j { | | ≥ } ∩ { | | ∗ ≥ } → ∞ hence, under Hα, if maxt Lt /σnt 2λ1 maxt∈Sα maxj δt,j /σt,j 2λ2 , as n √ P (L1 + L0 > zα|Hα) > P (L1 + c p > zα|Hα) → 1.

This completes the proof of Theorem 2.5.

49 2.8.2 Lemmas and their proofs

1/2 − −1 T −1 Lemma 2.1. Under the alternative of (2.1) and M = o(n ), Lt n ft Fn,M VB always attains its maximum at one of the change points 1 ≤ τ1 < ··· < τq < n.

Proof. We first show that Lt always attains its maximum at one of the change points 1 ≤ τ1 < ··· −1 T −1 < τq < n. Then we show that the bias term n ft Fn,M VB is the small order of Lt at those change points.

Toward this end, we first consider the case of one change-point τ1. If t ≤ τ1, we can derive that − 2 − −1 −2 − T − 2 −1 −2 − Lt = (n τ1) t(n t) n (µ1 µ2) (µ1 µ2). On the other hand, if t > τ1, Lt = τ1 t n (n T −1 −1 t)(µ1 − µ2) (µ1 − µ2). Since t(n − t) = −1 + n(n − t) is an increasing function of t and

−1 −1 (n − t)t = nt − 1 is a decreasing function of t, the maximum of Lt is attained at τ1.

Next, we consider more than one change-point case. Suppose that t ≤ τ1, it can be derived that

{ q } { q } t ∑ T ∑ L = (τ − τ )(µ − µ ) (τ − τ )(µ − µ ) , t n2(n − t) i+1 i 1 τi+1 i+1 i 1 τi+1 i=1 i=1 where τq+1 = n and µτq+1 = µn. Clearly, it can be seen that the maximum is attained at τ1. If t > τq, it can be derived that

{ q } { q } n − t ∑ T ∑ L = (τ − τ − )(µ − µ ) (τ − τ − )(µ − µ ) , t n2t i i 1 τi n i i 1 τi n i=1 i=1 where τ0 = 0. Therefore, Lt attains its maximum at t = τ1. At last, if τk < t ≤ τk+1, we have { } { } n − t ∑k T ∑k L = (τ − τ − )(µ − µ ) (τ − τ − )(µ − µ ) t n2t i i 1 τi τk+1 i i 1 τi τk+1 i=1 i=1 { q } { q } t ∑ T ∑ + (τ − τ )(µ − µ ) (τ − τ )(µ − µ ) n2(n − t) i+1 i τk+1 τi+1 i+1 i τk+1 τi+1 i=k+1 i=k+1 { } { q } 2 ∑k T ∑ + (τ − τ − )(µ − µ ) (τ − τ )(µ − µ ) . n2 i i 1 τi τk+1 i+1 i τk+1 τi+1 i=1 i=k+1

Based on the above expression, the first derivative of Lt with respect to t is

BT B AT A L′ = − , t n(n − t)2 nt2 ∑ ∑ k q T − − − − − ≥ where A = i=1(τi τi 1)(µτi µτk+1 ) and B = i=k+1(τi+1 τi)(µτk+1 µτi+1 ). Since B B 0 and AT A ≥ 0 (but they cannot be zero simultaneously under the alternative), and (n − t)−2 is an

−2 increasing function of t, and t is a decreasing function of t, Lt is either monotonically increasing,

50 or monotonically decreasing, or “U”-shaped function of t. For all cases, we have Lt attains its maximum at either τk or τk+1.

−1 T −1 Next, we show that the bias term n ft Fn,M VB is the small order of Lt at those change points.

Again, we start with the case of one change point τ1. From the definitions of ft and Fn,M given by (2.4) and (2.5) in Section 2.1, it can be seen that all the components ft and diagonal elements

−1 of Fn,M have the order of O(1). The off-diagonals of Fn,M have the order of n . Moreover, ∑ ∑ −1 n − T − ··· −1 n−M − T − T VB = (n i=1(µi µ¯) (µi µ¯), , n i=1 (µi µ¯) (µi+M µ¯)) , where we can show that for h = 0, ··· ,M,

− { 1 n∑h 1 (µ − µ¯)T (µ − µ¯) = (τ − h)(µ − µ¯)T (µ − µ¯) + h(µ − µ¯)T (µ − µ¯) n i i+h n 1 1 1 1 2 i=1 } T + (n − τ1 − h)(µ1 − µ¯) (µ1 − µ¯)

n(n − τ )(τ − h) − τ 2h = 1 1 1 (µ − µ )T (µ − µ ). n3 1 2 1 2

−1 T −1 2 −1 1/2 Combining the above results, we have n ft Fn,M VB = O(M n Lt), which is o(Lt) if M = o(n ). Similarly, we can show that the bias term is the smaller order for more than one change points.

This completes the proof of Lemma 2.1.

′ Lemma 2.2. Let (Z1,...,Zn−1) be a zero-mean multivariate normal random vector with covari- ance ΣZ whose diagonal components are 1. If the maximum eigenvalue of ΣZ is bounded, then assume (C1) and (C2), under H0, as n → ∞,

Lˆt d max √ −→ max Zt, 1≤t≤n−1 2 1≤t≤n−1 σˆnt,0 and ( ) |Lˆ | √ P max √ t ≤ 2 log n − log log n + x → exp{−(π)−1exp(−x/2)}. 1≤t≤n−1 2 σˆnt,0

√ ˆ 2 → ∞ Proof. To establish the asymptotic distribution of max1≤t≤n−1 Lt/ σˆnt,0, when n , similar to −1 ˆ the idea of Theorem 3 in Zhong and Li (2016), we need to show that under H0, max1≤t≤n−1 σnt,0Lt converges to max1≤t≤n−1 Zt, where Zt is a Gaussian process, with mean 0 and covariance ΣZ . Toward this end, we need to first show the joint asymptotic normality of (σ−1 Lˆ , . . . , σ−1 Lˆ )′ nt1,0 t1 ntl,0 tl −1 ˆ for t1 < t2, . . . , < tl, and second show the tightness of the random variable max1≤t≤n−1 σnt,0Lt.

51 According to the Cramer-Wold device, we only need to show that for any non-zero constant ∑ ′ tl vector c = (ct , . . . , ct ) , ctLˆt is asymptotically normal under H0, or to say, we need to show 1 √ l t=t1 ∑ ∑ d that tl c Lˆ / Var( tl c Lˆ ) −→ N(0, 1), which can be proved by the martingale central limit t=t1 t t t=t1 t t theorem. Since the proof is very similar to that of Theorem 2.1, we omit it. −1 ˆ To prove the tightness of max1≤t≤n−1 σnt,0Lt, without loss of generality, we assume that µ1 =

µ2 = ··· = µn = 0. Recall that − f T F −1 Vˆ t(n t) T t n,M Lˆ = (X¯≤ − X¯ ) (X¯≤ − X¯ ) − t n2 t >t t >t n ˆ(1) ˆ(2) = Lt + Lt and we can write ˆ(1) ˆ(11) ˆ(12) Lt = Lt + Lt , where n − t ∑t t ∑n Lˆ(11) = X′X + X′ X , t n2t i i n2(n − t) j j i=1 j=t+1 and n − t ∑t t ∑n 2 ∑t ∑n Lˆ(12) = X′ X + X′ X − X′X . t n2t i1 i2 n2(n − t) j1 j2 n2 i j i1≠ i21 j1≠ j2=t+1 i=1 j=t+1 Then we have [ ] (n − t)2 ∑t ∑t Var(Lˆ(11)) ≍ (5 + β) tr{C2(0)} + tr{C(i − i )C(i − i )} t n4t2 2 1 1 2 i=1 i1≠ i2=1 [ ] t2 ∑n ∑t + (5 + β) tr{C2(0)} + tr{C(j − j )C(j − j )} n4(n − t)2 2 1 1 2 j=t+1 j1≠ j2=1 1 ∑t ∑n + (5 + β) tr{C(j − i)C(i − j)} n4 i=1 j=t+1 t2 ∑ (n − t)2 ∑ ≍ · (n − t) · tr{C(h )C(h )} + · t · tr{C(h )C(h )} n4(n − t)2 1 2 n4t2 1 2 h1,h2 h1,h2 1 ∑ ∑ + tr{C(h )C(h )} ≍ n−3 tr{C(h )C(h )}. n4 1 2 1 2 h1,h2 h1,h2 ˆ(12) ˆ(12) ˆ(12) ˆ(12) Let Lt = Lt,1 + Lt,2 + Lt,3 , then

ˆ(12) Var(Lt,1 ) (n − t)2 ∑t [ ] ≍ 2 (4 + β)tr{C2(i − i )} + tr{C2(0)} n4t2 1 2 i1≠ i2

52 (n − t)2 ∑t [ ] + (4 + β)tr{C(i − i )C(i − i )} + tr{C(i − i )C(i − i )} n4t2 1 4 3 2 1 3 4 2 i1≠ i2≠ i3≠ i4 (n − t)2 ∑ (n − t)2 ∑ ≍ · t2 · tr{C(h )C(h )} + · t · tr{C(h )C(h ) n4t2 1 2 n4t2 1 2 h1,h2 h1,h2 (n − t)2 ∑ ≍ tr{C(h )C(h )}. n4 1 2 h1,h2 ˆ(12) ≍ 2 4 { 2 } ˆ(12) For the same calculation, we can have Var(Lt,2 ) (t /n )tr C (0) . Next, we consider Var(Lt,3 ).

ˆ(12) Var(Lt,3 ) 4 ∑t ∑n [ ] = (4 + β)tr{C2(i − j)} + tr{C2(0)} n4 i=1 j=t+1 4 ∑t ∑n [ ] + (4 + β)tr{C(i − j)C(i − l)} + tr{C(l − j)C(0)} n4 i=1 j≠ l=t+1 4 ∑t ∑n [ ] + (4 + β)tr{C(k − j)C(i − j)} + tr{C(0)C(i − k)} n4 i≠ k=1 j=t+1 4 ∑t ∑n [ ] + (4 + β)tr{C(k − j)C(i − l)} + tr{C(l − j)C(i − k)} n4 i≠ k=1 j≠ l=t+1 4 ∑ 4 ∑ ≍ · t(n − t) · tr{C(h )C(h )} + · t · tr{C(h )C(h )} n4 1 2 n4 1 2 h1,h2 h1,h2 4 ∑ 4 ∑ + · (n − t) · tr{C(h )C(h )} + · t(n − t) · tr{C(h )C(h )} n4 1 2 n4 1 2 h1,h2 h1,h2 t(n − t) ∑ ≍ tr{C(h )C(h )}. n4 1 2 h1,h2 ˆ(12) ˆ(12) ˆ(12) It can be shown that the covariance among Lt,1 , Lt,2 and Lt,3 , are all smaller order of ∑ ∑ t(n − t)/n4 tr{C(h )C(h )}. In sum, Var(Lˆ(12)) ≍ n−2 tr{C(h )C(h )}. Therefore, h1,h2 1 2 t h1,h2 1 2 ∑ as Var(Lˆ(11)) = o{Var(Lˆ(12))}, we have Var(Lˆ(1)) ≍ n−2 tr{C(h )C(h )}. As showed in t t t h1,h2 1 2 ˆ(2) ˆ(1) Proposition 2.1, the variance contributed by Lt is the smaller order of Var(Lt ), then we have ∑ σ2 = Var(Lˆ(1)){1 + o(1)} ≍ n−2 tr{C(h )C(h )}. nt,0 t h1,h2 1 2 Consider t = [nν] for ν = j/n ∈ (0, 1) with j = 1, . . . , n − 1. Based on the above results, to −1 ˆ show the tightness of σnt,0Lt is equivalent to show the tightness of G(ν) where [ ] ∑ (−1/2) ˆ G(ν) = n tr{C(h1)C(h2)} L[nν] h ,h [ 1 2 ] ∑ (−1/2) { } ˆ(1) ˆ(2) (1) (2) = n tr C(h1)C(h2) (L[nν] + L[nν]) = G (ν) + G (ν). h1,h2

53 We first show the tightness of G(1)(ν). Note that [ ] ∑ (−1/2) (1) { } ˆ(11) ˆ(12) (11) (12) G (ν) = n tr C(h1)C(h2) (Lt + Lt ) = G (ν) + G (ν). h1,h2

For 0 < ν < η < 1, we have the following for G(11)(ν)

E{|G(11)(ν) − G(11)(η)|2} [nν] n2 n − [nν] ∑ [nν] ∑n ∑ ′ ′ = E 2 XiXi + 2 XjXj tr{C(h1)C(h2)} n [nν] n (n − [nν]) h1,h2 i=1 j=[nν]+1 [nη] n n − [nη] ∑ [nη] ∑ 2 − X′X − X′ X n2[nη] i i n2(n − [nη]) j j i=1 j=[nη]+1 { (n − [nν])2 ([nν])2 (n − [nη])2 ([nη])2 ≤ Cn2 + + + n4[nν] n4(n − [nν]) n4[nη] n4(n − [nη]) } (n − [nν])(n − [nη]) [nν](n − [nη])([nη] − [nν]) [nν][nη] − 2 − 2 − 2 n4[nη] n4[nη](n − [nν]) n4(n − [nν]) ≤ C(η − ν)/n.

Applying the above inequality with ν = k/n and η = m/n for 0 ≤ k ≤ m < n for integers k, m and n and using Chebyshev’s inequality, we have, for any ϵ > 0,

P (|G(11)(k/n) − G(11)(m/n)| ≥ ϵ) ≤ E|G(11)(k/n) − G(11)(m/n)|2/ϵ2

≤ C(m − k)/(ϵn)2 ≤ (C/ϵ2)(m − k)1+α/n2−α,

(11) (11) where 0 < α < 1/2. Now if we define ξi = G (i/n) − G {(i − 1)/n} for i = 1, . . . , n − 1. Then

(11) (11) G (i/n) is equal to the partial sum of ξi, namely Si = ξ1 + ··· + ξi = G (i/n). Here S0 = 0. Then we have

2 1/(1+α) (2−α)/(1+α) 1+α P (|Sm − Sk| ≥ ϵ) ≤ (1/ϵ ){C (m − k)/n } .

Then using Theorem 10.2 in Billingsley (1999), we conclude the following

2 (2−α)/(1+α) 1+α 2 2α−1 P ( max |Si| ≥ ϵ) ≤ (KC/ϵ ){n/n } ≤ (KC/ϵ )n . 1≤i≤n

The right hand side of the above inequality goes to 0 as n → ∞ because α < 1/2. Based on the

(11) (11) relationship between Si and G (i/n), we have shown the tightness of G (ν). Next, we consider the tightness of G(12)(ν). Recall that

(12) −1/2{ 2 } ˆ(12) ˆ(12) ˆ(12) G (ν) = ntr C (0) (Lt,1 + Lt,2 + Lt,3 )

54 (12) (12) (12) = G1 (ν) + G2 (ν) + G3 (ν)

(12) (12) (12) It is enough to show the tightness of G1 (ν), since the tightness of G2 (ν) and G3 (ν) are ∑ similar. Let h (i , i ) = [nν] X′ X . Then we have the following ν 1 2 i1≠ i2=1 i1 i2

(12) − (12) G1 (η) G1 (ν) [ ] { } ∑ (−1/2) n − [nν] n − [nη] = n tr{C(h )C(h )} h (i , i ) − h (j , j ) 1 2 n2[nν] ν 1 2 n2[nη] η 1 2 h1,h2 and

{ (12) − (12) }4 E[ G1 (η) G1 (ν) ] [ ] [ ∑ −1 (n − [nν])4 (n − [nη])4 = n4 tr2{C(h )C(h )} E{h4(i , i )} + E{h4(j , j )} 1 2 n8[nν]4 ν 1 2 n8[nη]4 η 1 2 h1,h2 (n − [nν])2(n − [nη])2 + 6 E{h2(i , i )h2(j , j )} n8[nν]2[nη]2 ν 1 2 η 1 2 (n − [nν])3(n − [nη]) − 4 E{h3(i , i )h (j , j )} n8[nν]3[nη] ν 1 2 η 1 2 ] (n − [nν])(n − [nη])3 − 4 E{h (i , i )h3(j , j )} n8[nν][nη]3 ν 1 2 η 1 2

= I1 + I2 + I3 + I4 + I5.

First note, [ ] ∑ −1 (n − [nν])4 I ≍ n4 tr2{C(h )C(h )} 1 1 2 n8[nν]4 h1,h2 ∑[nν] [ ] × tr{C(i4 − i1)C(i3 − i2)tr{C(i8 − i5)C(i7 − i6)}

i1≠ i2,i3≠ i4,i5≠ i6,i7≠ i8 [ ] ∑ −1 (n − [nν])4 ∑ ≍ Cn4 tr2{C(h )C(h )} [nν]2([nν] − 1)2 tr2{C(h )C(h )} 1 2 n8[nν]4 1 2 h1,h2 h1,h2 ≤ C(η − ν)4 ≤ C(η − ν)2.

2 For the same idea, we can have I2 ≤ C(η − ν) either. Now we check I3, We have the following [ ] ∑ −1 (n − [nν])2(n − [nη])2 I ≍ n4 tr2{C(h )C(h )} 3 1 2 n8[nν]2[nη]2 h1,h2 ∑[nν] ∑[nη] [ ] × tr{C(i4 − i1)C(i3 − i2)tr{C(j4 − j1)C(j3 − j2)} i ≠ i ,i ≠ i j ≠ j ,j ≠ j 1 2[ 3 4 1 2 3 4 ] ∑ −1 4 2 ≍ Cn tr {C(h1)C(h2)}

h1,h2

55 (n − [nν])2(n − [nη])2 ∑ × [nν]([nν] − 1)[nη]([nη] − 1) tr2{C(h )C(h )} n8[nν]2[nη]2 1 2 h1,h2 ≤ C(1 − ν)2(1 − η)2 ≤ C(η − ν)2.

At last, we consider I4. [ ] ∑ −1 (n − [nν])(n − [nη])3 I ≍ n4 tr2{C(h )C(h )} 4 1 2 n8[nν][nη]3 h1,h2 ∑[nν] ∑[nη] [ ] × tr{C(i2 − j5)C(i1 − j6)tr{C(j4 − j1)C(j3 − j2)}

i1≠ i2 j1≠ j2,j3≠ j4,j5≠ j6 [ ] ∑ −1 (n − [nν])(n − [nη])3 ≍ Cn4 tr2{C(h )C(h )} [nη]([nη] − 1) 1 2 n8[nν][nη]3 h1,h2 ∑ 2 × {[nν]([nν] − 1) + [nν]} tr {C(h1)C(h2)}

h1,h2 ≤ C(1 − η)3(1 − ν) ≤ C(η − ν)2.

2 For the same idea, we can also have I5 ≤ C(η − ν) . Let ν = k/n and η = m/n for 0 ≤ k ≤ m < n for integers k, m and n and using above bounds | 12 − 12 | for the fourth moment of G1 (η) G1 (ν) , we have, for any L > 0,

| (12) − (12) | ≥ ≤ | (12) − (12) |4 4 P ( G1 (k/n) G1 (m/n) L) E G1 (k/n) G1 (m/n) /L

≤ (C/L)4{(m − k)/n}2.

Applying Theorem 10.2 in Billingsley (1999) again, we have

| (12) | ≥ ≤ 4 P ( max G1 (i/n) L) (KC/L ). 1≤i≤n

12 If L is large enough, the above probability could be smaller than any ϵ > 0. Therefore, G1 (ν) is (12) (12) tight. Similarly, we can show the tightness of G2 (ν) and G3 (ν). In summary, we have shown the tightness of G(1)(ν). (2) ˆ(2) ˆ(2) 2 For the tightness of G (ν), note that, under H0, as E(Lt ) = 0, and Var(Lt ) = o(σnt,0), E{|G(2)(ν) − G(2)(η)|2} = o(1), which can also be bounded by C(η − ν)/n, for a positive constant

C. Therefore, following the similar procedure to show the tightness of G(11)(ν), we can also have the tightness of G(2)(ν). −1 ˆ Hence, G(ν) is tight. Toward this end, we have shown that σnt,0Lt converges to a Gaussian process with mean 0 and covariance ΣZ . Finally, applying Lemma 4 in Zhong and Li (2016), we

56 −1 ˆ can obtain the Gumbel limiting distribution as the asymptotic distribution of max1≤t≤n−1 σnt,0Lt. This completes the proof of Lemma 2.2.

ˆ 2 1 Lemma 2.3. Let δt,j, δt,j and σt,j be defined in (2.16), (2.15), and (2.18) respectively. If x = o(n 3 ), and assume (C3)-(C4), then for any t ∈ {1, . . . , n − 1} and any j ∈ {1, . . . , p}, as n → ∞,

|δˆ | √ |δ | √ P ( t,j > x) = {1 + o(1)}I( t,j > x) σt,j σt,j √ δ √ δ + [{Φ(¯ x − t,j ) + Φ(¯ x + t,j )} σt,j σt,j |δ | √ × I( t,j < x)]{1 + o(1)}. σt,j

Proof. By condition (C4), for any i ∈ {1, . . . , n − 1} and j ∈ {1, . . . , p}, Xi,j is a sub-Gaussian random variable. By the property of sub-Gaussian random variable, there exist positive constants g and T such that

2 E{exp(τXi,j)} ≤ exp(gτ ), for |τ| ≤ T.

∈ { − } ∈ { 1 − 1 } | | ≤ Therefore, for any t 1, . . . , n 1 , let ht t , n−t , and ht 1, then we have

{ } ≤ 2 2 | | ≤ E exp(τhtXi,j) exp(ght τ ), for htτ T.

Therefore, htXi,j is still a sub-Gaussian random variable. ∈ { 1 − 1 } 1 1 For ht1 , ht2 t , n−t , and for any a, b > 1 such that a + b = 1, using H¨older’sinequality,

{ } ≤ { } 1/a { } 1/b E[exp τ(ht1 X1j + ht2 X2j) ] [E exp(τht1 a) ] [E exp(τht2 b) ] a ≤ exp{gτ 2(ah2 + h2 )}. t1 a − 1 t2

Minimizing over a > 1, we have

{ } ≤ { 2 2} | | ≤ E[exp τ(ht1 X1j + ht2 X2j) ] exp gτ (ht1 + ht2 ) , for τ(ht1 + ht2 ) 2T.

ˆ Therefore, ht1 X1j + ht2 X2j is also a sub-Gaussian random variable. As δt,j is a linear combination of htXi,j, then for any t ∈ {1, . . . , n − 1} and j ∈ {1, . . . , p}, δˆt,j is a sub-Gaussian random variable.

Toward this end, similar to Lemma 1 in Chen et al. (2019), let Yi,j = Xi,j/σt,j, for i ∈ {1, . . . , n − 1} and j ∈ {1, . . . , p}, and follows Theorem 5.23 of Petrov (1995).

57 √ If |δt,j|/σt,j < x, √ √ |δˆt,j| P ( > x) = P (|Y¯≤t,j − Y¯>t,j| > x) σt,j √ δt,j |δt,j| = P {(Y¯≤t,j − Y¯>t,j) − > x − } σt,j σt,j √ δt,j |δt,j| + P {(Y¯≤t,j − Y¯>t,j) − < − x − } σt,j σt,j √ δ √ δ x3/2 = {Φ(¯ x − t,j ) + Φ(¯ x + t,j )}{1 + O( )}. 1/2 σt,j σt,j n √ On the other hand, if |δt,j|/σt,j > x, as n → ∞, √ √ |δˆt,j| δt,j |δt,j| P ( > x) = 1 − P {(Y¯≤t,j − Y¯>t,j) − < x − } σt,j σt,j σt,j √ δt,j |δt,j| + P {(Y¯≤t,j − Y¯>t,j) − < − x − } σt,j σt,j = 1 + o(1).

In summary,

|δˆ | √ |δ | √ P ( t,j > x) = {1 + o(1)}I( t,j > x) σt,j σt,j √ δ √ δ + [{Φ(¯ x − t,j ) + Φ(¯ x + t,j )} σt,j σt,j |δ | √ × I( t,j < x)]{1 + o(1)}. σt,j

This completes the proof of Lemma 2.3.

58 CHAPTER 3

ONLINE CHANGE-POINT DETECTION IN HIGH-DIMENSIONAL

COVARIANCE STRUCTURE

Online change-point detection or sequential change-point detection, originally arises from the prob- lem of . The product quality is monitored based on the observations continually arriving during an industrial process. A stopping rule is chosen to terminate and reset the process as early as possible when an anomaly occurs. In modern applications, there has been a resurgence of interest in detecting abrupt changes from streaming data with a large number of measurements.

Examples include real-time monitoring for sensor networks and threat detection from surveillance videos. More can be found in studying dynamic connectivity of resting state functional magnetic resonance imaging, and in detecting threat of fake news from the group of fake accounts in social networks (Bara, Fung and Dinh 2015).

In this chapter, we consider online change-point detection in the covariance structure of high- dimensional data. More precisely, letting {X1,X2, ···} be a sequence of continually arriving p- dimensional random vectors, each of which has its own covariance matrix Σi, we consider the hypotheses

H0 :Σ1 = Σ2 = ··· , against

H1 :Σ1 = ··· = Στ ≠ Στ+1 = ··· , (3.1) where τ is some unknown change point. We propose a stopping rule for (3.1), which terminates the process as early as possible after Στ changes to Στ+1. Under the null hypothesis, we derive an explicit expression for the average run length (ARL), so that the level of threshold in the stopping rule can be easily obtained with no need to run time-consuming Monte Carlo simulations. Under the alternative hypothesis, we establish an upper bound for the expected detection delay (EDD), which demonstrates the impact of data dependence and magnitude of change in the covariance structure.

59 The proposed stopping rule is readily applied to detecting network change in high-dimensional online data as the network can be modeled by the covariance matrix. In addition to its practical usefulness, the developed method has several theoretical contributions. First, the stopping rule in- corporates spatial and temporal dependence of data. Rather than assume temporal independence, we estimate the temporal dependence consistently through a data-driven procedure and establish the distribution of the stopping time with the correctly specified dependence. Consequently, the

ARL of the proposed stopping rule can be well controlled even in the presence of temporal depen- dence. Second, the stopping rule can be applied to a wide range of data in that it does not assume

Gaussian distribution, but only requires the existence of the fourth moment of data. Third, the stopping rule is implementable when the dimension p diverges and thus suitable for monitoring modern networks whose size varies enormously from thousands to millions. Finally, we identify the key factors and establish their impact on the EDD through an explicitly derived upper bound.

In particular, we reveal that the EDD based on the L2-norm statistic increases as the strength of temporal dependence increases, but decreases as the magnitude of change ||Στ+1 − Στ ||F /||Στ ||F increases. Here || · ||F represents the matrix Frobenius norm. The rest of this chapter is organized as follows. In Section 3.1, 3.2 and 3.3, we introduce the proposed stopping rule. Section 3.4 presents its asymptotic properties. Simulation studies and real data analysis are given in Sections 3.5 and 3.6, respectively. Technical proofs of main theorems are relegated to Section 3.7.

3.1 Modeling Spatial and Temporal Dependence

Let {Xi, 1 ≤ i ≤ n} be a sequence of p-dimensional random vectors with E(Xi) = µ. We model the sequence by

Xi = µ + ΓiZ for i = 1, ··· , n, (3.2)

× ≥ · ··· T { }m where Γi is a p m matrix with m n p, and Z = (z1, , zm) such that zi i=1 are mutually 4 independent and satisfy E(zi) = 0, Var(zi) = 1 and E(zi ) = 3 + β for some finite constant β. There are two advantages to impose the above model. First, it incorporates both spatial { ≤ ≤ } T ··· T T and temporal dependence of the sequence Xi, 1 i n . Let X = (X1 , ,Xn ) and T ··· T T T × Γ = (Γ1 , , Γn ) . From (3.2), the variance-covariance matrix of X is ΓΓ , in which each p p T ≡ × block diagonal sub-matrix ΓiΓi Σi represents the spatial dependence of each Xi and each p p

60 T ≡ − block off-diagonal sub-matrix ΓiΓj C(j i) describes the spatial and temporal dependence be- T tween Xi and Xj at i ≠ j. Here we require m ≥ n × p to ensure the positive definite of ΓΓ and thus the existence of C(j − i). Second, the model does not assume any distribution of data, but only requires the existence of the fourth moment. In particular, Xi is normally distributed if β = 0. Based on (3.2), we accommodate the spatial and temporal dependence by the following two conditions.

(C1). The sequence is M-dependent, such that for some integer M ≥ 0, C(j − i) ≠ 0 if and only if |j − i| ≤ M. Moreover, under H0 of (3.1), C(j − i) = C(h) for all i and j satisfying j − i = h with h ∈ {0, ±1, ··· , ±M}.

Under the null hypothesis, we assume that the sequence is M-dependent, and the spatial and temporal dependence is stationary. Under the alternative hypothesis, the covariance structure changes and consequently, the stationary of the spatial and temporal dependence cannot hold. We thus only assume the M-dependence. We introduce the M-dependence to relax the commonly as- sumed temporal independence in the literature. As shown in Section 3.7.1, the assumption enables us to establish the asymptotic normality of the test statistic (3.3) through the martingale central limit theorem. Moreover, the M-dependence combined with the stationary in the spatial and tem- poral dependence yields that the stopping time (3.6) converges to the Gumbel limiting distribution of a stationary Gaussian process under the null hypothesis. Under the alternative hypothesis, we impose the M-dependence to generalize the Wald’s lemma from a sum of a random number of in- dependent random variables to that of M-dependent random variables. The generalization enables us to study the EDD of the stopping time even in the presence of temporal dependence (see the proof of Theorem 3.2 in Section 3.7).

(C2). For any h1, h2, h3, h4 ∈ {0, ±1, ··· , ±M}, as p → ∞, [ ] { } { ′ ′ } { ′ ′ } tr C(h1)C(h2)C(h3)C(h4) = o tr C(h1)C(h2) tr C(h3)C(h4) ,

{ ′ ′ ′ ′ } { } where h1, h2, h3, h4 is a permutation of h1, h2, h3, h4 .

If there is no temporal dependence, (C2) becomes tr{C4(0)} = o[tr2{C2(0)}]. It holds if all the eigenvalues of C(0) are bounded but violates under strong dependence such as the compound

61 symmetry covariance structure. If the temporal dependence is present (h ≠ 0), (C2) takes into account for both spatial and temporal dependence. It can be shown that (C2) holds if the re- quirement of bounded eigenvalues is extended to the np × np covariance matrix of entire sequence T T ··· T T × X = (X1 ,X2 , ,Xn ) , each p p block diagonal matrix of which measures the spatial depen- dence of each p-dimensional random vector in the sequence, and each p×p block off-diagonal matrix of which describes the spatio-temporal dependence of two random vectors collected at different time points. The condition cannot hold if the spatial and temporal dependence is too strong so that the covariance matrix of X has unbounded eigenvalues. The advantage of (C2) is that it does not impose any decay structures on C(h) as long as the trace condition is satisfied. Moreover, it allows the dimension p to diverge without imposing its growth rate.

3.2 Test Statistic

Suppose that n observations have been collected. We need a test statistic, the expectation of which can measure the heterogeneity of covariance structure from the collected observations. Assuming for the moment that µ = 0 in (3.2), we propose the following test statistic

1 ∑n Jˆ ≡ W (i, j)(XT X )2, (3.3) n,M n2 M i j i,j=1 ∑ ≡ n−M−2 | − | ≥ where the weight function WM (i, j) t=M+2 At,M (i, j)I( i j M + 1) and

n − t − M t − M A (i, j) = I(i ≤ t)I(j ≤ t) + I(t + 1 ≤ i)I(t + 1 ≤ j) t,M t − M − 1 n − t − M − 1 (t − M)(n − t − M) − {I(i ≤ t)I(t + 1 ≤ j) + I(t + 1 ≤ i)I(j ≤ t)}. − − 1 t(n t) 2 M(M + 1) If µ ≠ 0, a centralized version of (3.3) is

1 ∑n Jˆ∗ = W (i, j){(X − µˆ)T (X − µˆ)}2, (3.4) n,M n2 M i j i,j=1 whereµ ˆ is a consistent estimator of µ. As introduced in Section 3.2.3, the proposed stopping rule needs a training sample andµ ˆ thus can be chosen as the sample mean of the training sample.

Remark 3.1 We first assume a known M to present the main results of the proposed methods.

We then provide a data-driven procedure for estimating M and establish the theoretical results based on the estimated M in Section 3.3.4.

62 Remark 3.2 The test statistics are constructed in several steps. At each t from {M +2, ··· , n−

M − 2}, we first partition the entire sequence {Xi, 1 ≤ i ≤ n} into two segments {Xi, 1 ≤ i ≤ t} and {Xi, t + 1 ≤ i ≤ n}. After utilizing the indicator function I(|i − j| ≥ M + 1) to exclude the interference of C(j − i) with 0 < |i − j| ≤ M, we estimate the two covariance structures separately from the two segments. We then compare the two covariance structures through At,M (i, j), so that the expectation of Jˆn,M is zero under the null hypothesis, but it is non-zero with the maximum attained at the change point under the alternative hypothesis. Finally, we choose WM (i, j) to accumulate all the structural comparisons, each of which is obtained through At,M (i, j). Since the main task is to detect the change in the covariance structure, we assume without further notice that µ = 0 in (3.2), and focus on Jˆn,M to facilitate the theoretical investigation. All Jˆ∗ ̸ the established results can be readily extended to n,M with µ = 0.

Proposition 3.1. Assume (3.2) and (C1). Under the null hypothesis, E(Jˆn,M ) = 0. Under the alternative hypothesis, ∑n ˆ 1 µJˆ ≡ E(Jn,M ) = WM (i, j)tr(ΣiΣj). n,M n2 i,j=1

Since the expectation of Jˆn,M under the alternative hypothesis differs from its expectation under the null hypothesis, it can be used to test heterogeneity of the covariance structure after we standardize it. This requires us to further derive the variance of the test statistic.

Proposition 3.2. Under (3.2) and (C1)-(C2), ∑n ∑n 2 ≡ Jˆ 4 2{ − − }{ } σJˆ Var( n,M ) = WM (i, j)WM (k, l)tr C(i k)C(l j) 1 + o(1) . n,M n4 i,j=1 k,l=1 Under the null hypothesis, (C1) assumes that the spatial and temporal dependence is stationary.

The variance can thus be simplified as ∑n ∑ 2 4 − 2{ } σJˆ = WM (i, j)WM (i h1, j + h2)tr C(h1)C(h2) , (3.5) n,M ,0 n4 i,j=1 h1,h2 where h1, h2 ∈ {0, ±1, ··· , ±M}.

3.3 Stopping Rule

∗ ∗ − F ∗ { ∗ } ≥ F ∗ {∅ } Let n = n n0. Let n = σ X1,...,Xn0 ,Xn0+1,...,Xn0+n if n 1 and n = , Ω if ∗ ∞ {F ∗ } F n = 0. Then n 0 is an increasing sequence of σ-fields on a probability space (Ω, ∞,P ).

63 ∞ { ∗ } ∗ Moreover, Xn0+n n =1 is a stationary and M-dependent sequence of random vectors adapted to ∞ {Fn∗ } . Define the proposed stopping rule 1 { } Jˆ − n,M,H TH (a, M) = inf n n0 : > a, n > n0 , (3.6) σˆn0,M,H then TH (a, M) is a stopping time relative to {Fn∗ }. The proposed stopping rule terminates the detection process in a minimal number of new observations, when the absolute value of the stan- dardized test statistic is above a threshold. Some key observations about the stopping rule are as follows.

First, 1 ∑n Jˆ = W (i, j)(XT X )2, n,M,H H2 M i j i,j=n−H+1 which is the test statistic Jˆn,M (3.3) based on past H observations from the current time n. Here H is the window-size and chosen to reduce the computational time. It is not rare to utilize a moving window H for the online change point detection; see, for example, Lai (1995), Cao et al. (2019) and Chen (2019). We display its effect on our stopping rule explicitly through asymptotic results in next section and simulation studies in Section 3.5. Some other guidelines in selecting the window size can be seen in Lai (1995).

Second, similar to Pollak and Siegmund (1991), we consider a training sample of size n0 to provide an estimation of M for dependence and the standard deviation of Jˆn,M,H under the null hypothesis. Estimating M will be covered in Section 3.3.4. To estimate the standard deviation of

Jˆn,M,H under the null hypothesis, we only need to estimate tr{C(h1)C(h2)} because it is the only unknown indicated in (3.5). Based on the training sample, it is estimated by ∗ 1 ∑ tr{C(\h )C(h )} = XT X XT X , (3.7) 1 2 n∗ t+h2 s s+h1 t s,t ∑∗ where represents the sum of indices that are at least M apart in the training sample, and n∗ be the corresponding number of indices. As a result, the estimator of the variance of Jˆn,M,H is ∑H ∑ 4 \ σˆ2 = W (i, j)W (i − h , j + h )tr2{C(h )C(h )}. (3.8) n0,M,H H4 M M 1 2 1 2 i,j=1 h1,h2 The consistency of the estimator will be established in Theorem 3.3 in Section 3.4.3.

At last, a is the threshold and should be chosen to control the ARL at any pre-specified value.

Theorem 3.1 in next the section will provide a result for selecting a for this purpose.

64 3.4 Asymptotic Results

3.4.1 Average run length

Let E∞ and P∞ denote the expectation and probability, respectively, under the null hypothesis.

Let √ √ g(t/H, a) = 2 log(t/H) + 1/2 log log(t/H) + log(4/ π) − a 2 log(t/H).

The ARL is defined to be the expected value of the stopping time under the null hypothesis.

The following theorem establishes the ARL or E∞{TH (a, M)} for the proposed stopping rule (3.6).

Theorem 3.1. Assume (3.2) and (C1)-(C2). As p → ∞, and both H and a → ∞ satisfying

H = o{exp(a2/2)},

( ∫ ∞ [ { }] ) E∞{TH (a, M)} = H + exp −2exp g(t/H, a) dt {1 + o(1)}. H

As shown in the proof of Theorem 3.1, the ARL is readily obtained by establishing the cumula- tive distribution function of TH (a, M) as a → ∞. Since the randomness of TH (a, M) is determined Jˆ by n,M,H /σˆn0,M,H , the cumulative distribution of the former can be derived by establishing the asymptotic distribution of the latter when p and H → ∞. Here the condition H = o{exp(a2/2)} specifies the growth rate of H with respect to a. It is imposed to ensure that the probability the procedure stops within the window H goes to zero exponentially fast.

Theorem 3.1 states that the ARL depends on the threshold a and the window-size H. In particular, it increases as a increases when H is fixed. This can also be seen from the proposed stopping rule (3.6), where raising a makes the standardized test statistic less likely to go beyond the a when there is no change point. The practical usefulness of Theorem 3.1 is that with any pre-specified ARL and H, we can quickly determine the value of a by solving the equation rather than running time-consuming Monte Carlo simulations.

3.4.2 Expected detection delay

When there is a change point τ, the proposed stopping rule is conventionally examined by the expected detection delay (EDD), Eτ {TH (a, M) − (τ − n0)|TH (a, M) > τ − n0} with τ ≥ n0. In the literature, it is customary to consider the EDD for the so-called immediate change point; see, for example, Siegmund and Venkatraman (1995) and Xie and Siegmund (2013). In terms

65 of our configuration, it refers to the change occurs immediately after the training sample n0 and the corresponding EDD is E0{TH (a, M)}. The main reason to consider the EDD of the immediate change point is that for many stopping rules, the supremum of all the EDDs attains at the immediate change point. It is therefore important to see if such property is still held by our proposed stopping rule. We establish the following theorem which confirms this conclusion. More importantly, the theorem provides an upper bound for the EDDs.

Theorem 3.2. Assume (3.2) and (C1)-(C2). Consider the change point τ ≥ n0. As p → ∞, and both H and a → ∞ satisfying a = O(Hr) with 1/2 ≤ r < 1,

sup Eτ {TH (a, M) − (τ − n0)|TH (a, M) > τ − n0} = E0{TH (a, M)}, and √ a · H · σH,M,0 E0{TH (a, M)} ≤ (M + 2) + {1 + o(1)}, ||Στ+1 − Στ ||F where σH,M,0 is obtained by replacing n with H in (3.5), and ||·||F represents the matrix Frobenius Norm.

Theorem 3.2 demonstrates the impact of some key factors on the EDD. First, a larger M could lead to a greater EDD, showing the adverse effect of the dependence on change-point detection.

Second, the impact of the threshold a on the EDD essentially depends on the choice of ARL, because a is obtained by solving the equation in Theorem 3.1 in which the window size H and ARL are pre-specified by the user. Generally speaking, a larger user-chosen ARL leads to a higher value 1/2 || − ||−1 of a and thus a greater EDD. Finally, the impact of σH,M,0 Στ+1 Στ F can be demonstrated 1/2 || || by applying the result σH,M,0 = O( Στ F ) from the proof of Theorem 3.1 to obtain √ ( ) σH,M,0 ||Σ || = O τ F . ||Στ+1 − Στ ||F ||Στ+1 − Στ ||F

The result shows that the EDD can be significantly reduced by increasing the ratio of the change in covariance structure to the original covariance.

Remark 3.2 It requires a minimum change in the covariance structure, for the proposed stop- ping rule to detect the change point. To understand this, we consider the configuration with the immediate change after the training sample. As the window continuously moves to the right, the number of observations with Στ decreases but the number of observations with Στ+1 increases. If the detection procedure has not yet stopped when the last observation with Στ begins to leave the

66 window, it probably won’t be able to stop because the process ends up with all the H observations having the same Στ+1. Theorem 3.2 actually provides a minimum change the proposed stopping rule requires. By noticing that the right-hand side of the inequality in Theorem 3.2 must be no more than H, the change in covariance structure √ ||Στ+1 − Στ ||F ≥ a/H||Στ ||F , √ where a/H||Στ ||F is, therefore, the minimum change in the covariance structure the proposed stopping rule is able to detect. To provide an insight of the result, we consider Στ = Ip where

|i−j| p = 1000, and Στ+1 = (ρ ) where 0 < ρ < 1. Further, we choose H = 100 and obtain a = 3.58 by solving the equation in Theorem 1 so that the ARL is controlled around 5000. Solving √ ||Στ+1 −Στ ||F = a/H||Στ ||F , we obtain the minimum ρ for the stopping rule to detect the change is 0.133.

Remark 3.3 The proposed stopping rule is based on the L2-norm statistic. When Στ+1 differs from Στ in a large number of components, the stopping rule is advantageous as the detection delay can be significantly reduced by accumulating all the differences through ||Στ+1 − Στ ||F . When

Στ+1 differs from Στ only in a sparse number of components, the components without the change do not contribute to ||Στ+1 −Στ ||F but to ||Στ ||F , which may lead to a large ||Στ ||F /||Στ+1 −Στ ||F and thus a long detection delay. To reduce the detection delay under the sparse situation, we can rewrite the test statistic in the stopping rule into

2 ∑ ∑n Jˆ = W (i, j)Y Y , n,M,H H2 M i,kl j,kl 1≤k≤l≤p i,j=n−H+1 where Yi,kl = Xi,kXi,l and Xi,k is the kth component of the p-dimensional random vector Xi, and remove the elements Yi,klYj,kl with no change. Since the screening must be conducted through a data-driven approach, its effect on the ARL and EDD deserves some future research efforts.

3.4.3 Change-point testing in the training sample

To implement the stopping rule, we need a training sample which has not any change points in covariance structure. A training sample can be historical observations from previous experimental runs subject to similar experimental conditions after their stationarity of the covariance structure has been confirmed. To know whether a sample {Xi, 1 ≤ i ≤ n0} is qualified as a training sample,

67 we need to consider the hypotheses

∗ ··· H0 :Σ1 = = Σn0 , against ∗ ··· ̸ ··· ̸ ··· H1 :Σ1 = = Στ1 = Στ1+1 = = Στq = Στq+1 = = Σn0 , (3.9) where 1 ≤ τ1 < ··· < τq < n0 are unknown change points. This is an offline testing problem as the Jˆ sample has been collected. We consider the test statistic n0,M obtained by replacing n with n0 in (3.3) in that its expectation can distinguish the alternative from the null hypothesis. The following Jˆ theorem establishes the asymptotic normality of n0,M .

Theorem 3.3. Assume (3.2) and (C1)-(C2). As n0 → ∞,

Jˆ − n0,M µJˆ n0,M −→d N(0, 1), σJˆ n0,M where µJˆ and σJˆ are given by Propositions 3.1 and 3.2, respectively, with n replaced by n0,M n0,M ∗ n0. In particular, under H0 of (3.9), Jˆ n0,M −→d N(0, 1), σˆJˆ n0,M ,0 whereσ ˆJˆ is defined in (3.8) with n replaced by n0. n0,M ,0 ∗ Jˆ From Theorem 3.3, we reject H0 of (3.9) with a nominal significance level α if n0,M /σˆJˆ > n0,M ,0 zα, where zα is the upper α-quantile of the standard normal distribution. Otherwise, we fail to ∗ reject H0 and hereby obtain a training sample for the proposed stopping rule.

3.4.4 Stopping rule with estimated M ··· The unknown M in stopping rule (3.6) can be estimated through the training sample X1, ,Xn0 .

From (C1), we know that Cov(Xi,Xj) = C(i − j) is zero if and only if |i − j| > M, or equivalently, tr{C(h)CT (h)} is zero if and only if |h| > M. We thus estimate M through the following steps.

• Using (3.7), we compute tr{C(\h)C(−h)}/tr{C\(0)C(0)} with h starting from 0.

• We terminate the process when the first non-negative integer h∗ satisfies

tr{C(h\∗)C(−h∗)} ≤ ϵ, tr{C\(0)C(0)}

where ϵ is a small constant and can be chosen to be 0.05 in practice.

68 • We then estimate M by Mˆ = h∗ − 1.

Let TH (a, Mˆ ) be the stopping rule obtained by replacing M with Mˆ in (3.6). The follow- ing theorem shows that TH (a, Mˆ ) performs as well as TH (a, M) under both null and alternative hypotheses.

Theorem 3.4. Assume the same conditions in Theorems 3.1 and 3.2. As the training sample size n0 → ∞,

E∞{TH (a, Mˆ )} = E∞{TH (a, M)}, E0{TH (a, Mˆ )} = E0{TH (a, M)}.

3.5 Simulation Studies

3.5.1 Accuracy of the theoretical ARL

We first evaluate the performance of the stopping rule under the null hypothesis. The random vectors Xi for i = 1, 2, ··· are generated from

∑M Xi = Γl ϵi−l, (3.10) l=0

|i−j| −1 where the p × p matrix Γl = {0.6 (M − l + 1) } for i, j = 1, ··· , p, and l = 0, ··· ,M. Each

ϵi is a p-variate random vector with mean 0 and identity covariance Ip, and all ϵis are mutually independent. If M = 0, all Xis are mutually independent from (3.10) and each individual Xi has ∑ T ̸ M−(j−i) T − ··· the covariance matrix Γ0Γ0 . If M = 0, Cov(Xi,Xj) = l=0 Γj−iΓl for j i = 0, ,M. We choose the dimension p = 200, 400 and 1000, the size of historical data n0 = 200, the window-size H = 100 and 150, and dependence M = 0, 1, 2, respectively.

To examine the accuracy of the theoretical ARL, we first specify its value and obtain the corresponding a by solving the equation in Theorem 3.1. Based on the a, we obtain the Monte

Carlo ARL by taking the average of the stopping times from 1000 simulations. Table 8 compares the theoretical ARLs with the corresponding Monte Carlo ARLs under different combinations of

H, p and M. All the Monte Carlo ARLs are reasonably close to the theoretical ARLs, subject to some random variations from simulations under different M and p.

69 Table 8: The comparison between theoretical ARLs and Monte Carlo ARLs based on 1000 simu- lations. For each ARL and window-size H, the threshold a is obtained by solving the equation in

Theorem 3.1.

H = 100

Theoretical p = 200 p = 400 p = 1000

(a, ARL) M = 0 1 2 M = 0 1 2 M = 0 1 2

(3.04, 1002) 1178 1151 1194 1245 1284 1317 1302 1295 1335

(3.42, 3008) 3067 3148 2986 3690 3614 3529 3850 3954 3617

(3.58, 5038) 5118 4527 4253 5799 5923 5212 6570 6102 5878

H = 150

p = 200 p = 400 p = 1000

M = 0 1 2 M = 0 1 2 M = 0 1 2

(2.88, 1005) 1044 1127 1149 1069 1173 1308 1145 1198 1270

(3.29, 3033) 3240 3156 3202 3505 3795 3628 3652 3759 3931

(3.46, 5118) 5120 5097 5280 6083 5820 6156 6162 6586 6794

3.5.2 Accuracy of upper bound for EDD

We next evaluate the performance of the stopping rule under the alternative hypothesis. In partic- ular, we examine the accuracy of the upper bound for the EDD in Theorem 3.2. In the simulation studies, we consider an immediate change, namely the change at τ = 201 immediately after the historical data of size n0 = 200. Before the change point τ, the observations Xi for i = 1, ··· , 200

−1 are generated from (3.10) where Γl = I(M − l + 1) with I being the identity matrix. After the

−1 change, Γl = Q(M − l + 1) in (3.10) where the p × p matrix Q is modeled by one of the following patterns.

T |i−j| (a). Q satisfies QQ = Σ, where Σij = ρ for 1 ≤ i, j ≤ p.

(b). Each row of Q has only three non-zero elements that are randomly chosen from {1, ··· , p}

with magnitude ρ multiplied by a random sign.

70 Table 9: The comparison between theoretical upper bounds for EDDs and Monte Carlo EDDs based on 1000 simulations with the ARL controlled around 5000.

ρ = 0.6 0.7 0.8

M = 0 1 2 M = 0 1 2 M = 0 1 2

Model (a)

H = 100 Monte Carlo 16.18 20.14 24.04 11.31 14.34 16.98 8.11 10.31 12.44

Theoretical 20.59 23.63 25.99 16.23 18.79 20.83 12.46 14.61 16.38

H = 150 Monte Carlo 17.49 21.62 25.45 12.34 15.38 18.56 8.90 11.32 13.37

Theoretical 24.36 28.10 31.04 19.11 22.21 24.70 14.59 17.13 19.22

Model (b)

H = 100 Monte Carlo 4.36 5.84 7.16 3.58 4.71 5.87 3.10 4.13 5.06

Theoretical 7.42 9.09 10.58 6.07 7.40 8.76 5.11 6.42 7.60

H = 150 Monte Carlo 4.79 6.38 7.68 3.85 5.13 6.58 3.27 4.50 5.32

Theoretical 8.53 10.38 11.92 6.80 8.45 9.88 5.74 7.19 8.45

Model (c)

H = 100 Monte Carlo 2.84 3.90 4.94 2.68 3.68 4.78 2.63 3.69 4.72

Theoretical 3.04 4.15 6.23 2.89 3.99 5.05 2.78 3.87 4.92

H = 150 Monte Carlo 2.96 3.94 5.09 2.89 3.93 4.91 2.72 3.76 4.84

Theoretical 3.25 4.40 5.51 3.07 4.20 5.30 2.94 4.05 5.13

T (c). Q satisfies QQ = Σ, where Σii = 1 for i = 1, ··· , p, and Σij = ρ for i ≠ j.

Models (a)–(c) specify the bandable, sparse and strong covariance matrices, respectively. We choose ρ = 0.6, 0.7, 0.8 to obtain different magnitudes in the covariance change, and choose the dimension p = 1000, the window-size H = 100 and 150, and dependence M = 0, 1, 2, respectively.

Moreover, the threshold a = 3.58 when H = 100 and a = 3.46 when H = 150 so that the theoretical

ARL is controlled around 5000. Table 9 compares the theoretical bound for the EDD in Theorem

3.2 with the corresponding Monte Carlo EDD based on 1000 simulations. As we can see, each

Monte Carlo EDD is no more than its theoretical upper bound. Furthermore, both Monte Carlo

EDDs and theoretical bounds decrease as ρ increases with the same M and H, but increase as

71 Model (a) H = 100 ρ = 0.6 Model (a) H = 150 ρ = 0.6 100 100 80 80 60 60 40 40 Detection Delay Detection Delay 20 20 0 0

Max−Type General Our Method Max−Type General Our Method Methods Methods

Model (a) H = 100 ρ = 0.8 Model (a) H = 150 ρ = 0.8 50 40 40 30 30 20 20 Detection Delay Detection Delay 10 10 0 0

Max−Type General Our Method Max−Type General Our Method Methods Methods

Figure 8: Boxplots of EDDs for “Max-type” and “General” stopping rules in Chen (2019) and Chu and Chen (2018), and the proposed stopping rule. The results are based on 1000 simulations under model (a).

M increases with the same ρ and H. The simulation results are consistent with the theoretical

findings in Theorem 3.2.

We also compare the proposed stopping rule with some other stopping rules in the literature.

Based on different edge-count statistics, Chen (2019) and Chu and Chen (2018) propose a series of stopping rules, among which the ones based on the generalized edge-count statistic and based

72 Model (b) H = 100 ρ = 0.6 Model (b) H = 150 ρ = 0.6 50 80 40 60 30 40 20 Detection Delay Detection Delay 20 10 0 0

Max−Type General Our Method Max−Type General Our Method Methods Methods

Model (b) H = 100 ρ = 0.8 Model (b) H = 150 ρ = 0.8 50 80 40 60 30 40 20 Detection Delay Detection Delay 20 10 0 0

Max−Type General Our Method Max−Type General Our Method Methods Methods

Figure 9: Boxplots of EDDs for “Max-type” and “General” stopping rules in Chen (2019) and Chu and Chen (2018), and the proposed stopping rule. The results are based on 1000 simulations under model (b).

on the max-type edge-count statistic are more effective in detecting changes. The generalized and max-type stopping rules are based on a non-parameter framework that utilizes nearest neighbor information among observations. The implementation of these two stopping rules is available in the R package gStream. Similar to the authors, we choose a relatively larger nearest neighbors 5 to gain more information. The ARL is specified at 5000. Since they assume the observations are

73 Model (c) H = 100 ρ = 0.6 Model (c) H = 150 ρ = 0.6 20 25 20 15 15 10 10 Detection Delay Detection Delay 5 5 0 0

Max−Type General Our Method Max−Type General Our Method Methods Methods

Model (c) H = 100 ρ = 0.8 Model (c) H = 150 ρ = 0.8 20 25 20 15 15 10 10 Detection Delay Detection Delay 5 5 0 0

Max−Type General Our Method Max−Type General Our Method Methods Methods

Figure 10: Boxplots of EDDs for “Max-type” and “General” stopping rules in Chen (2019) and

Chu and Chen (2018), and the proposed stopping rule. The results are based on 1000 simulations under model (c).

temporally independent, we consider M = 0. Other setups are specified at the beginning of this section. Note that the stopping rules in Chen (2019) and Chu and Chen (2018) are proposed to detect the change point in the distribution. When the change in distribution is indeed caused by the covariance structure, Figures 8-10 show that the proposed stopping rule performs better with much smaller EDDs than the two competitors.

74 Histogram of selected M when M=0 Histogram of selected M when M=1

99 96 100 100 80 80 60 60 Frequency Frequency 40 40 20 20

4 1 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 0 1 2 3 4 5

Figure 11: Histograms of selected M by the proposed data-driven procedure when the actual M = 0 and 1. The results are based on 100 simulations.

3.5.3 Accuracy of the data-driven procedure for M

In the last part of simulation studies, we examine the data-driven procedure proposed in Section

3.4.4 for estimating M. For each simulation, a training sample of 200 observations is generated from

(3.10) with p = 1000. Figure 11 illustrates the histograms of selected M based on 100 simulations when the actual M = 0 and 1. With 99 and 96 successes respectively, the proposed data-driven procedure demonstrates its satisfactory performance for estimating the M.

3.6 Case Study

Resting-state fMRI is a method to explore the brain’s internal dynamic networks. We apply the pro- posed method to a resting-state fMRI dataset obtained from the 2017 Human Connectome Project

(HCP) data release. The data consist of 300 independent component analysis (ICA) component nodes (p = 300) repeatedly observed over 1200 time points, collected for each of 1003 subject- s. The publicly accessible dataset together with details about data acquisition and preprocessing procedures can be found in HCP website (http://www.humanconnectome.org).

We detect the change in a real-time manner, in the sense that we pretend the observations in the dataset continually arrive in time. At each time, we determine whether the process should be terminated through the proposed stopping rule. Note that the proposed stopping rule is designed only for detecting the covariance change. When a detection process involves a change in the mean,

75 Figure 12: Online change-point detection in the covariance structure of subject 103010 (upper panel) and subject 130417 (lower panel). Each panel illustrates the estimated correlation matrices before and after the estimated change point.

it cannot be detected by the proposed stopping rule. Despite such a limitation, we still apply the stopping rule to the dataset for the covariance change as the main interest of using the resting-state fMRI is to study the dynamic nature of brain connectivity (Cribben et al., 2013; Jeong et al., 2016).

While there are 1003 subjects in the dataset, we randomly choose two subjects 103010 and

130417 to demonstrate the practical usefulness of the proposed method. The proposed stopping rule needs a training sample. We pretend that the first 200 observations of each time series are historical, and further justify their stationarity in the covariance structure through the testing procedure in Section 3.3. Here we use a relatively large training sample size 200 to attain precise

76 estimation of nuisance parameters. Based on the training sample, we estimate M by 5 for the subject 103010 and 6 for the subject 130417 using the method in Section 3.4 and obtain the sample meanµ ˆ and the sample standard deviation of the test statistic using (3.8) in Section 3.3. Choosing the threshold a = 3.58 so that the ARL is controlled around 5000, we apply the proposed stopping rule with the window size H = 100 and terminate the process at the time 287 for the subject 103010 and the time 245 for the subject 130417.

With each of the stopping times 287 and 245, we pull out the observations from time 1 to the stopping time and conduct some post analyses. The first analysis is the change-point estimation.

Similar to Bai (2010), the change point is estimated by

Jˆ τˆ = arg max t,M,Hˆ , 1

Jˆ Jˆ where t,M,Hˆ is obtained by replacing WM (i, j) in n,M,H with At,Mˆ (i, j) defined in (3.3). The ˆ rationale of using the above estimator is that the expectation of Jt,M,Hˆ always attains its maximum at the true change point, as mentioned in Remark 3.2 of Section 3.2. The estimated change points are 264 for the subject 103010 and 228 for the subject 130417. With the two stopping times 287 and 245, the corresponding detection delays are 23 for the subject 103010 and 17 for the subject

130417, showing that the proposed stopping rule can quickly terminate the process when the brain’s network change occurs.

The second analysis is illustrating the actual change in the brain’s network. For each subject, we estimate the correlation matrices before and after the estimated change point using the glasso.

The obtained results for the two subjects are summarized in Figure 12, which clearly illustrates the brain’s internal networks become stronger after the estimated change points. The results are consistent with recent studies that during the resting state, the brain’s networks activate when a subject focuses on internal tasks, and exhibit dynamic changes within time scales of seconds to minutes (Allen et al. 2014; Calhoun et al. 2014; Chang and Glover 2010; Cribben et al. 2012;

Handwerker et al. 2012; Hutchison et al. 2013b; Jeong et al. 2016; Monti et al. 2014).

77 3.7 Technical Details

3.7.1 Proofs of main theorems

Proof of Proposition 3.1.

From (3.3), Xi and Xj in Jˆn,M are M apart, because of the indicator function in WM (i, j). Using

(C1), we see that Xi and Xj are independent. As a result, ∑ ∑ Jˆ −2 T T −2 E( n,M ) = n WM (i, j)E(Xi XjXj Xi) = n WM (i, j)tr(ΣiΣj) i,j i,j by the model (3.2), which gives the expectation under the alternative hypothesis. Specially, under the null hypothesis, tr(Σ2) ∑ E(Jˆ ) = W (i, j) = 0, n,M n2 M ∑ i,j because i,j WM (i, j) = 0. This completes the proof of Proposition 3.1.

Proof of Proposition 3.2. ∑ ∑ Jˆ Jˆ2 − 2 Jˆ 2 Jˆ −4 Note that Var( n,M ) = E( n,M ) E ( n,M ), where E ( n,M ) = n i,j k,l WM (i, j) Jˆ2 WM (k, l)tr(ΣiΣj)tr(ΣkΣl) from Proposition 3.1. We thus only need to derive E( n,M ), which, from (3.2) and (C1), is 1 ∑ ∑ E(Jˆ2 ) = W (i, j)W (k, l)E(XT X XT X XT X XT X ) n,M n4 M M i j j i k l l k i,j k,l 1 ∑ ∑ 4 ∑ ∑ = W (i, j)W (k, l)tr(Σ Σ )tr(Σ Σ ) + W (i, j)W (k, l) n4 M M i j k l n4 M M i,j k,l i,j k,l 1 ∑ ∑ × tr2{C(i − k)C(l − j)} + W (i, j)W (k, l) n4 M M [ i,j k,l

16tr{ΣlC(i − k)ΣjC(k − i)} + 4tr{C(k − i)C(j − l)C(k − i)C(j − l)} T T ◦ T T T T ◦ T T + 8βtr(Γi ΓjΓj Γi Γk ΓlΓl Γk) + 8βtr(Γi ΓjΓk Γl Γi ΓjΓk Γl) ∑ ] 2 T T T ◦ T T T + 2β tr(Γj ΓiememΓk Γl Γj ΓiememΓk Γl) , m where for any square matrices A and B, the symbol A ◦ B = (aijbij), and em is the unit vector with

2 the only non-zero element at the mth component. By applying (C2) and subtracting E (Jˆn,M ) in Proposition 3.1, we have 4 ∑ ∑ E(Jˆ2 ) = W (i, j)W (k, l)tr2{C(i − k)C(l − j)}{1 + o(1)}. n,M n4 M M i,j k,l

78 This completes the proof of Proposition 3.2.

Proof of Theorem 3.1

We need to derive the cumulative distribution function of TH (a, M). From (3.6), ( ) Jˆ n0+i,M,H P∞{T (a, M) ≤ t} = P∞ max > a . H ≤ ≤ 0 i t σˆn0,M,H

Jˆ The cumulative distribution function of TH (a, M) thus depends on n0+i,M,H /σˆn0,M,H , which will be shown to converge to a stationary Gaussian process. Jˆ ≡ Jˆ ≡ To simplify notation, we let n0+i,M,H i,M , andσ ˆn0,M,H σˆ0. The Gaussian process can −1Jˆ −1Jˆ ′ be established by showing (i) the joint asymptotic normality of (ˆσ0 i1,M ,..., σˆ0 id,M ) for any ··· −1Jˆ i1 < i2 < < id. (ii) the tightness ofσ ˆ0 i,M . To prove (i), we apply the Cram´er-Wold device ∑ ··· d −1 Jˆ to show that for any non-zero a1, , ad, l=1 σˆ0 al il,M is asymptotic normal. Since the proof is similar to that of Theorem 3.3, we omit it. We thus only need to prove (ii).

Toward this end, we first obtain the leading order of Var(Jˆi,M ), which is

4 ∑H ∑ Var(Jˆ ) = W (i, j)W (i − h , j + h )tr2{C(h )C(h )} i,M H4 M M 1 2 1 2 i,j=1 h1,h2 ( ) 4 ∑ 6π2 − 51 = tr2{C(h )C(h )} H4 . H4 1 2 18 h1,h2 ∈ { } Jˆ Jˆ Let i1, i2 1, . . . , t . Next, we derive the leading order of Cov( i1,M , i2,M ), which equals Jˆ Jˆ ≡ − E( i1,M i2,M ) when there is no any change point. Let id i2 i1. ∈ { − } − Jˆ Jˆ For id 1,...,H 1 and H id = O(H), under (C1), the leading order of Cov( i1,M , i2,M ) depends on id can be derived to be { } 4 ∑ 6π2 − 51 Cov(Jˆ , Jˆ ) = tr2{C(h )C(h )} (H − i )4 . i1,M i2,M H4 1 2 18 d h1,h2 ∈ { − } − ≥ Jˆ Jˆ For id 1,...,H 1 and H id = o(H), or for id H, Cov( i1,M , i2,M ) can be shown is the smaller order of Var(Jˆi,M ), i.e. [ ( )] 4 ∑ 6π2 − 51 Cov(Jˆ , Jˆ ) = o tr2{C(h )C(h )} H4 . i1,M i2,M H4 1 2 18 h1,h2 −1Jˆ −1Jˆ We want to show the tightness of σ0 i,M . Then the tightness ofσ ˆ0 i,M can be established by the Slutsky theorem becauseσ ˆ0 is ratio-consistent to σ0 according to Theorem 3.3. Consider

79 i = q∗ · t, for q∗ = i/t ∈ (0, 1), with i = 1, . . . , t. It is equivalent to show the tightness of G(i/t), ∗ −1Jˆ ∗ ∗ where G(i/t) = G(q ) = σ0 i,M . For 0 < q < r < 1,

| ∗ − ∗ |2 −1 |Jˆ − Jˆ |2 E G(r ) G(q ) = σ0 E i1,M i1,M −1{ Jˆ2 Jˆ2 − Jˆ Jˆ } = σ0 E( i1,M ) + E( i2,M ) 2E( i1,M i2,M ) .

When there is no any change point,

E(Jˆ2 ) = E(Jˆ2 ) = Var(Jˆ ) i1,M i2,M ( i,M ) 4 ∑ 6π2 − 51 = tr2{C(h )C(h )} H4 {1 + o(1)}. H4 1 2 18 h1,h2

For any i1, i2 ∈ {1, . . . , t}, and i2 − i1 = id ∈ {1,...,H − 1}, as H → ∞, ∑ 4 2 4 4 (4/H ) tr {C(h1)C(h2)}2{H − (H − id) } | ∗ − ∗ |2 ≤ h1,h2 ∑ E G(r ) G(q ) C 4 2 4 (4/H ) tr {C(h1)C(h2)}H ( ) h1,h2 i ≤ C d . H

For id ≥ H, ∑ 4 2 4 4 C(4/H ) tr {C(h1)C(h2)}2{H + o(H )} E|G(r∗) − G(q∗)|2 ≤ h1,h∑2 ≤ C. (4/H4) tr2{C(h )C(h )}H4 h1,h2 1 2

Therefore, by Chebyshev’s inequality, if 1 ≤ id ≤ H − 1,

E|G(r∗) − G(q∗)|2 P(|G(r∗) − G(q∗)| ≥ λ) ≤ ≤ (C/λ2)(i /H). λ2 d

Let H/t = d, then ∗ ∗ ∗ ∗ id/H = (i2 − i1)/H = (r − q )t/H = (r − q )/d,

∗ ∗ ∗ ∗ and {(r − q )/d} ∈ (0, 1). If id ≥ H, or equivalently {(r − q )/d} ≥ 1,

E|G(r∗) − G(q∗)|2 P(|G(r∗) − G(q∗)| ≥ λ) ≤ ≤ C/λ2. λ2

Let   (r∗ − q∗)/d, if r∗ − q∗ < d ∗ ∗ fd{(q , r ]} =  1, if r∗ − q∗ ≥ d, then

∗ ∗ 2 ∗ ∗ P(|G(r ) − G(q )| ≥ λ) ≤ (C/λ )fd{(q , r ]}.

80 Let ξi = G(i/t) − G{(i − 1)/m}, for i = 1, . . . , t. Then Si = ξ1 + ··· + ξi = G(i/t) with S0 = 0. Therefore, | − | ≥ ≤ 2 { ∗ ∗ } P( Si2 Si1 λ) (C/λ )fd (q , r ] .

∗ ∗ ∗ ∗ ∗ ∗ For any 0 < p < q < r < 1, G(p ) = Si0 , G(q ) = Si1 and G(r ) = Si2 , respectively. Let m∗ = |G(q∗) − G(p∗)| ∧ |G(r∗) − G(q∗)|. Then [ ] P(m∗ ≥ λ) = P {|G(q∗) − G(p∗)| ≥ λ} ∩ {|G(r∗) − G(q∗)| ≥ λ}

≤ 1/2 | − | ≥ · 1/2 | − | ≥ P ( Si1 Si0 λ) P ( Si2 Si1 λ) ≤ 1/2{ ∗ ∗ } 1/2{ ∗ ∗ } (C/λ)fd (p , q ] (C/λ)fd (q , r ]

2 ∗ ∗ ∗ ∗ ≤ (C/λ )[fd{(p , q ]} + fd{(q , r ]}].

If q∗ − p∗ < d and r∗ − q∗ < d, or equivalently r∗ − p∗ < 2d, { } ( ) q∗ − p∗ r∗ − q∗ r∗ − p∗ P(m∗ ≥ λ) ≤ (C/λ2) + ≤ (C/λ2) . d d d

If q∗ − p∗ < d and r∗ − q∗ ≥ d, or q∗ − p∗ ≥ d and r∗ − q∗ < d, but r∗ − p∗ < 2d, { } q∗ − p∗ P(m∗ ≥ λ) ≤ (C/λ2) + 1 ( d ) ( ) q∗ − p∗ r∗ − q∗ r∗ − p∗ ≤ (C/λ2) + ≤ (C/λ2) . d d d

If q∗ − p∗ < d and r∗ − q∗ ≥ d, or q∗ − p∗ ≥ d and r∗ − q∗ < d, but r∗ − p∗ ≥ 2d, { } q∗ − p∗ P(m∗ ≥ λ) ≤ (C/λ2) + 1 ≤ 2C/λ2. d

If q∗ − p∗ ≥ d and r∗ − q∗ ≥ d, and r∗ − p∗ ≥ 2d,

P(m∗ ≥ λ) ≤ 2C/λ2.

Let    r∗−p∗ 1 ∗ ∗ ( ) 2α , if r − p < 2d { ∗ ∗ } d µα,d (p , r ] =   1 ∗ ∗ 2 2α , if r − p ≥ 2d,

1 { ∗ ∗ } ∗ ∗ ∗ ∈ where α > 2 . Then µα,d (p , r ] is a finite measure on T = (0, 1]. For any ϵ > 0 and p , q , r T = (0, 1], ∗ ≥ ≤ 2 2α { ∗ ∗ } P(m λ) (C/λ )µα,d (p , r ] .

81 Let

∗ | − | ∧ | − | L(G) = sup m = max Si1 Si0 Si2 Si1 . p∗≤q∗≤r∗ i0≤i1≤i2

Using Theorem 10.3 in Billingsley (1999), we conclude

KC P {L(G) ≥ λ} ≤ µ2α {(0, 1]}, λ2 α,d

≫ − 2α { } where K is a constant. As t H, d = H/t is close to zero, and 2d < (1 0). Hence, µα,d (0, 1] = 2, and 2KC P {L(G) ≥ λ} ≤ . λ2 From (10.4) in Billingsley (1999), we obtain

max |Si| ≤ L(G) + |St|. 1≤i≤t

| |2 −2 Jˆ2 Since E St = σ0 E( t,M ) = 1, we have { } ( ) 1 1 P( max |Si| ≥ λ) ≤ P L(G) ≥ λ + P |St| ≥ λ 1≤i≤t 2 2 2KC E|S |2 KC ≤ + t ≤ . 1 2 1 2 2 ( 2 λ) ( 2 λ) λ

If λ goes to infinity, the above probability converges to zero. Therefore, Si is tight or equivalently

Jˆi,M /σ0 is tight.

Let q = i/H and let Y (q) = Y (i/H) ≡ Jˆi,M /σ0. For 0 ≤ p ≤ q, consider |p − q| → 0, then we have, as H → ∞,

|p − q| → 0 ⇒ |i1 − i2|/H → 0 ⇒ id/H → 0 ⇒ id = o(H).

If id = o(H),

Cov{Y (p),Y (q)} ∑ 4 2 4 (4/H ) tr {C(h1)C(h2)}{(H − id) } = σ−2E(Jˆ Jˆ ) = h1,h∑2 {1 + o(1)} 0 i1,M i2,M (4/H4) tr2{C(h )C(h )}H4 h1,h2 1 2 4 4 = {(H − id) /H }{1 + o(1)} = 1 − 4(id/H) + o{(id/H)}

= 1 − 4|p − q| + o{|p − q|}.

On the other hand, if |p − q| → ∞ or id/H → ∞, Cov{Y (p),Y (q)} = 0.

82 As a result, {Y (q), q ≥ 0} converges to {Z(q), q ≥ 0}, which is a stationary Gaussian process with zero mean, unit variance and covariance function of the form

r(|p − q|) = Cov{Z(p),Z(q)} = 1 − 4|p − q| + o(|p − q|), as |p − q| → 0. On the other hand, as |p − q| → ∞, r(|p − q|) log(|p − q|) → 0.

Let Q = t/H. From Finch (2003), as Q → ∞, max0≤q≤Q |Z(q)| has the Gumbel distribution so that { } [ { }]

P∞ max |Z(q)| ≤ a = exp −2exp g(t/H, a) , 0≤q≤Q where √ √ g(t/H, a) = 2 log(t/H) + 1/2 log log(t/H) + log(4/ π) − a 2 log(t/H).

As a result, when t > H, [ { }]

P∞{TH (a, M) ≤ t} = 1 − exp −2exp g(t/H, a) .

When t = H and as H → ∞, { } { } √ −1 2 P∞ max |Z(q)| ≤ a = exp −(2 π) Hexp(−a /2) , 0≤q≤1 √ which has the order of 1 − 1/(2 π)Hexp(−a2/2) because H = o{exp(a2/2)}. As a result, √ 2 P∞{TH (a, M) ≤ H} = 1/(2 π)Hexp(−a /2), which decays to zero as H = o{exp(a2/2)}.

We next derive the expectation of TH (a, M). Since the support of TH (a, M) is non-negative, we have ∫ ∞ ∞{ } { − } E TH (a, M) = 1 FTH (a,M)(t) dt, 0 where FTH (a,M)(t) is the cumulative distribution function of TH (a, M) evaluated at t. From the above results, we have ∫ ∫ H ∞ ∞{ } { − } { − } E TH (a, M) = 1 FTH (a,M)(t) dt + 1 FTH (a,M)(t) dt 0 H ( ∫ ∞ [ { }] ) = H + exp −2exp g(t/H, a) dt {1 + o(1)}. H

This completes the proof of Theorem 3.1.

83 Proof of Theorem 3.2

We first prove that the supremum of the EDDs attains at the immediate change point τ = n0.

Equivalently, we need to show that for any τ > n0,

Eτ {TH (a, M) − (τ − n0)|TH (a, M) > τ − n0} ≤ E0{TH (a, M)}.

∗ ∗ ∗ To simplify notation, we let τ = τ − n0 and T = TH (a, M) − τ . Then to show the above inequality, we only need to show that

∗ ∗ ∗ Eτ {T |T > 0} ≤ E0{T }.

Since ∫ ∞ ∗ ∗ ∗ ∗ Eτ {T |T > 0} = {1 − Pτ (T < t|T > 0)}dt, and 0 ∫ ∞ ∗ ∗ E0{T } = {1 − P0(T < t)}dt, 0 we only need to show that

∗ ∗ ∗ Pτ (T < t|T > 0) ≥ P0(T < t), (3.11)

First, the probability on the left hand side of (3.11) is

∗ − ∗ ∗ | ∗ Pτ (T < t) Pτ (T < 0) Pτ (T < t T > 0) = ∗ , (3.12) 1 − Pτ (T < 0) where ( ) Jˆ P {T ∗ < t} = P max n0+i,M,H > a , and τ τ ≤ ≤ ∗ 0 i t+τ σˆn0,M,H

( ) Jˆ P {T ∗ < 0} = P max n0+i,M,H > a . τ τ ≤ ≤ ∗ 0 i τ σˆn0,M,H From the above two probabilities, we can define two events

Jˆ Jˆ A = { max n0+i,M,H > a}, and B = { max n0+i,M,H > a}. ≤ ≤ ∗ ≤ ≤ ∗ 0 i t+τ σˆn0,M,H 0 i τ σˆn0,M,H

Second, the probability on the right hand side of (3.11) is ( ) Jˆ ∗ n0+i,M,H P0{T < t} = P0 max > a 0≤i≤t σˆ n 0,M,H ( ) Jˆ = P max n0+i,M,H > a . (3.13) τ ∗≤ ≤ ∗ τ i t+τ σˆn0,M,H

84 The last equation holds because both probabilities are based on the observations after the change points 0 and τ − n0, respectively, and the observations have the same distribution. From (3.13), we define the event

Jˆ C = { max n0+i,M,H > a}. ∗≤ ≤ ∗ τ i t+τ σˆn0,M,H From the above defined events A, B and C, we see that A = B ∪ C. Therefore, P(A) =

P(B) + P(C) − P(B ∩ C). From the definition of the events B and C, we see that if B occurs, then ∗ ∗ T < 0 or the stopping time TH (a, M) < τ . Then C cannot occur. Therefore P(A) = P(B)+P(C). Moreover, from the definitions of A, B, C, (3.12) becomes

∗ ∗ ∗ Pτ (A) − Pτ (B) Pτ (C) P0{T < t} Pτ (T < t|T > 0) = = = , 1 − Pτ (B) 1 − Pτ (B) 1 − Pτ (B) where the last equation holds by using (3.13). Then (3.11) can be proved accordingly. This completes the proof that the supremum of the EDDs attains at the immediate change point τ = n0.

We next establish the upper bound for the EDDs. To simplify notation, we let JˆT ≡ Jˆn,M,H , which is the test statistic evaluated at the stopping time TH (a, M). Let yi,r1r2 = xi,r1 xi,r2 , where xi,r1 and xi,r2 are the r1th and r2th component of Xi, respectively. Hence, E(yi,r1r2 ) =

Cov(xi,r1 , xi,r2 ) = σi,r1r2 , which is the r1th row and r2th column of Σi. Using (3.3), we see that

{ p } 1 ∑ ∑H E(Jˆ ) = E W (i, j)y y T H2 M T +i,r1r2 T +j,r1r2 r1,r2=1 i,j=1 { p } 1 ∑ ∑H = E W (i, j)σ σ H2 M T +i,r1r2 T +j,r1r2 r1,r2=1 i,j=1 { p } 1 ∑ ∑H + E W (i, j)(y − σ )(y − σ ) H2 M T +i,r1r2 T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i,j=1 { p } 2 ∑ ∑H + E W (i, j)σ (y − σ ) H2 M T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i,j=1 = E(I) + E(II) + E(III).

By Lemma 3.3, 3.4, and 3.5, we obtain

1 E(I) ≥ E{(T − M)(T − M − 1)}tr{(Σ − Σ )2}{1 + o(1)}, H τ τ+1 [ ] [√ ] 2 2 E(II) = O log(H)tr{(Στ − Στ+1) } = o Htr{(Στ − Στ+1) } ,

85 and [√ ] 2 E(III) = o Htr{(Στ − Στ+1) } .

Hence,

1 E(Jˆ ) ≥ E{(T − M)(T − M − 1)}tr{(Σ − Σ )2}{1 + o(1)} T H τ τ+1 [√ ] 2 + o Htr{(Στ − Στ+1) } . (3.14)

Using (3.14), we have

1 a · σ ≥ E{(T − M)(T − M − 1)}tr{(Σ − Σ )2}{1 + o(1)} H,M,0 H τ τ+1 [√ ] 2 − (|E(JˆT )| − a · σH,M,0) + o Htr{(Στ − Στ+1) } . (3.15)

Let JˆT −1 denote the test statistic evaluated at T − 1. From the stopping rule (3.6), we have

E|JˆT −1| ≤ a · σH,M,0.

By Jensen’s inequality and triangle inequality, we also have

E|JˆT −1| ≥ |E(JˆT )| − |E(JˆT − JˆT −1)|.

Combining the above two inequality, we obtain

|E(JˆT )| − a · σH,M,0 ≤ |E(JˆT − JˆT −1)|. (3.16)

Based on similar derivations,

2 2 |E(Jˆ − Jˆ − )| = E(T − M − 1)tr{(Σ − Σ ) }{1 + o(1)} T T 1 H τ τ+1 [√ ] 2 + o Htr{(Στ − Στ+1) } . (3.17)

Combining (3.15), (3.16) and (3.17), we obtain [ ] 1 √ E{(T − M − 2)2}tr{(Σ − Σ )2}{1 + o(1)} ≤ a · σ + o Htr{(Σ − Σ )2} . H τ τ+1 H,M,0 τ τ+1

r 2 Using a · σH,M,0 = O{H · tr(Στ − Στ+1) } with 1/2 ≤ r < 1 and the Jensen’s inequality, we have [ ] √ a · σ · H 1/2 − − ≤ { − − 2} ≤ H,M,0 { } E(T M 2) E (T M 2) 2 1 + o(1) . tr{(Στ − Στ+1) } This completes the proof of Theorem 3.2.

86 Proof of Theorem 3.3 Jˆ The asymptotic normality of n0,M can be established by the martingale central limit theorem.

Toward this end, we let F0 = {∅, Ω}, Fk = σ{X1, ..., Xk} with k = 1, 2, ..., n0, and Ek(·) denote F − Jˆ the conditional expectation given k. Define Dn0,k = (Ek Ek−1) n0,M and it is easy to see that ∑ Jˆ − n0 n0,M µJˆ = k=1 Dn0,k. n0,M ∑ m Jˆ − ≥ We further define Sn0,m = k=1 Dn0,k = Em n0,M µJˆ . We can show that for q m, n0,M |F Jˆ − Jˆ − E(Sn0,q m) = Sn0,m. To this end, we note that Sn0,q = Eq n0,M µJˆ = Em n0,M µJˆ + n0,M n0,M Jˆ − Jˆ Jˆ − Jˆ Eq n0,M Em n0,M = Sn0,m + (Eq n0,M Em n0,M ). Then

|F { Jˆ |F } − { Jˆ |F } E(Sn0,q m) = Sn0,m + E Eq( n0,M ) m E Em( n0,M ) m { Jˆ } − { Jˆ } = Sn0,m + E Em( n0,M ) E Em( n0,M )

= Sn0,m.

{ F } { ≤ ≤ } As a result, we see that Sn0,k, k is a martingale and accordingly, Dn0,k, 1 k n0 is a martingale difference sequence with respect to the σ-fields {Fk, 1 ≤ k ≤ n0} Based on similar derivations for Lemmas 2 and 3 in Li and Chen (2012), we can show that under (3.2) and (C1)-(C2), as n0 → ∞, ∑ n0 2 Ek−1(D ) k=1 n0,k −→p 2 1. σJˆ n0,M And, ∑ n0 E(D4 ) k=1 n0,k → 4 0. σJˆ n0,M The above two results are sufficient conditions for the martingale central limit theorem. This thus completes the first part of Theorem 3.3.

To show the second part of Theorem 3.3, we only need to show the ratio consistency ofσ ˆJˆ n0,M ,0 defined in (3.8) to σJˆ under the null hypothesis. From the expression (3.7), we apply (3.2) n0,M ,0 such that under the null hypothesis,

( ∗ ) ∗ 1 ∑ 1 ∑ E XT X XT X = E(ZT ΓT Γ ZZT ΓT Γ Z) n∗ t+h2 s s+h1 t n∗ t+h2 s s+h1 t s,t s,t

= tr{C(h1)C(h2)}.

87 \ This shows that E[tr{C(h1)C(h2)}] = tr{C(h1)C(h2)}. Similarly, we can show that under the

\ 2 conditions (C1)-(C2), Var[tr{C(h1)C(h2)}] = o[tr {C(h1)C(h2)}]. This implies that under the null hypothesis, \ p tr{C(h1)C(h2)}/tr{C(h1)C(h2)} −→ 1.

The second part of Theorem 3.3 is then proved by applying the continuous mapping theorem.

Proof of Theorem 3.4

We first show that P(Mˆ = M) = 1 as n0 → ∞. Note that the event that Mˆ > M is equivalent to the event that

tr{C(M +\ 1)C(−M − 1)}/tr{C\(0)C(0)} > ϵ.

Therefore, P(Mˆ > M) is equivalent to [ ] P tr{C(M +\ 1)C(−M − 1)}/tr{C\(0)C(0)} > ϵ . [ ] It is also equivalent to P tr{C(M +\ 1)C(−M − 1)}/tr{C(0)C(0)} > ϵ as

p tr{C\(0)C(0)}/tr{C(0)C(0)} −→ 1 from the proof of Theorem 3.3.

From (3.7), we can show that E[tr{C(M +\ 1)C(−M − 1)}] = 0 and [ ] Var tr{C(M +\ 1)C(−M − 1)}/tr{C(0)C(0)} = O(n−2).

Using Chebyshev’s inequality, we can show that as n0 → ∞, [ ] P tr{C(M +\ 1)C(−M − 1)}/tr{C(0)C(0)} > ϵ = 0, or equivalently, P(Mˆ > M) = 0. Similarly, we can show that P(Mˆ < M) = 0. We then establish the consistency of Mˆ to M.

To prove E∞{TH (a, Mˆ )} = E∞{TH (a, M)}, we only need to show that as n0 → ∞,

P∞{TH (a, Mˆ ) ≤ t} = P∞{TH (a, M) ≤ t}.

Toward this end, we notice that

P∞{TH (a, Mˆ ) ≤ t} = P∞{TH (a, Mˆ ) ≤ t, Mˆ = M} + P∞{TH (a, Mˆ ) ≤ t, Mˆ ≠ M},

88 where the second term converges to zero because P(Mˆ = M) = 1 as n0 → ∞.

To prove E0{TH (a, Mˆ )} = E0{TH (a, M)}, we notice that as n0 → ∞,

E0{TH (a, Mˆ )} = E(E0{TH (a, Mˆ )|Mˆ }) ∑ ∗ ∗ = E0{TH (a, M)}P(Mˆ = M) + E0{TH (a, M )}P(Mˆ = M ) M ∗≠ M

= E0{TH (a, M)}.

This completes the proof of Theorem 3.4.

3.7.2 Lemmas and their proofs

Lemma 3.1 Assume the model (3.1). For any j ≥ 1, we assume that E(y2 ) < ∞ where j,r1r2 yj,r1r2 = xj,r1 xj,r2 , and xj,r1 and xj,r2 are the r1th and r2th element of Xj, respectively. Let τ be the change point, such that Σ ≠ Σ , and Σ = {σ }p . Let n be the size of historical τ τ+1 τ+1 τ+1,r1r2 r1,r2=1 0 data. For the stopping time T , we have

(i) E(|y |2) = O{E(T )σ2 }, for 1 ≤ i ≤ T + M. n0+i,r1r2 τ+1,r1r2 | | { 1/2 } ≤ ≤ (ii) E( yn0+i,r1r2 ) = O E(T )στ+1,r1r2 , for 1 i T + M. ∑ (iii) E(| M y |2) = O{E(T )σ2 }. i=1 n0+T +i,r1r2 τ+1,r1r2 ∑ | M | { }1/2 (iv) E( i=1 yn0+T +i,r1r2 ) = O[ E(T ) στ+1,r1r2 ]. ∑ T { }1/2 (v) E( i=1 yn0+i,r1r2 ) = E(T )στ+1,r1r2 + O[ E(T ) στ+1,r1r2 ]. ∑ { T − 2} 1/2 { }1/2 { }1/2 (vi) [E ( i=1 yn0+i,r1r2 T στ+1,r1r2 ) ] = E(T)γτ+1,r1r2 + O[ E(T) στ+1,r1r2 ]. ∑ (vii) E{( T y − T σ )2} = E(T )γ + O{E(T)σ2 }, where γ = i=1 n0+i,r1r2 τ+1,r1r2 τ+1,r1r2 τ+1,r1r2 τ+1,r1r2 ∑ M ≥ Var(yi,r1r2 ) + 2 q=1 Cov(yi,r1r2 , yi+q,r1r2 ), for i n0 + 1. ∗ ∗ ∗ − F ∗ { ∗ } ≥ F ∗ {∅ } Proof. Let n = n n0. Let n = σ X1,...,Xn0+n if n 1 and n = , Ω if n = 0. ∞ {F ∗ } F Then n 0 is an increasing sequence of σ-fields on a probability space (Ω, ∞,P ). Moreover, ∞ ∞ { ∗ } ∗ {F ∗ } Xn0+n n =1 is a stationary and M-dependent sequence of random vectors adapted to n 1 , ∞ ∗ { ∗ } {F ∗ } {F ∗ } and Xn0+n +i i=M+1 is independent of n for every n . Hence, T is relative to n .

Note that for i ≥ n0 + 1,

E(yi,r1r2 ) = Cov(xi,r1 , xi,r2 ) = στ+1,r1r2 .

∞ { ∗ } ∗ Therefore, yn0+n ,r1r2 n =1 is a sequence of stationary and M-dependent random variables adapt- ∞ ∞ ∗ {F ∗ } ∗ { ∗ } {F ∗ } ∗ ed to n n =1, and yn0+n +i,r1r2 i=M+1 is independent of n for every n . Let Sn =

89 ∑ n∗ ∗ ∗ { ∗ − |F ∗ } i=1 yn0+i,r1r2 , and let Un = E Sn +M (n + M)στ+1,r1r2 n , then we have

E(Un∗ |Fn∗−1)

∗ { ∗ − |F ∗ } = E Sn +M (n + M)στ+1,r1r2 n −1

∗ ∗ − |F ∗ ∗ = Un −1 + E(yn0+n +M,r1r2 στ+1,r1r2 n −1) = Un −1.

∞ { ∗ } Therefore, Un n∗=0 is a martingale. Applying Theorem 1.1 and Theorem 1.2 (iii) in Janson (1983), we obtain

E(ST +M ) = E(T + M)στ+1,r1r2 , and { − }2 − 2 E[ ST +2M (T + 2M)στ+1,r1r2 ] = E(T )γτ+1,r1r2 + E(S2M 2Mστ+1,r1r2 ) .

Let ϵ > 0 and for i ≥ 1, let y′ = y ·I(|y | > A) and y′′ = y − n0+i,r1r2 n0+i,r1r2 n0+i,r1r2 n0+i,r1r2 n0+i,r1r2 y′ where A is so large such that E|y′ |2 < ϵσ2 , where σ2 < ∞, because of n0+i,r1r2 n0+i,r1r2 τ+1,r1r2 τ+1,r1r2 E(y2 ) < ∞, for any j ≥ 1. Similar to the proof of Corollary 1.1 (i) in Janson (1983), we have j,r1r2

| |2 | ′ |2 | ′′ |2 E( yn0+i,r1r2 ) = E( yn0+i,r1r2 ) + E( yn0+i,r1r2 ) ( ) T∑+M ≤ | ′ |2 2 E yn0+i,r1r2 + A i=1 ≤ | ′ |2 2 E(T + M)E yn0+i,r1r2 + A

2 < ϵE(T )στ+1,r1r2 + Cϵ, for 1 ≤ i ≤ T +M. This proves (i). By the similar procedure for (ii)-(vii) of Corollary 1.1 in Janson

(1983), the results (ii)-(vii) can be obtained. This completes the proof of Lemma 3.1.

Lemma 3.2 Under the same conditions in Lemma 3.1, { } ∑T ∑M − − E (yn0+i,r1r2 στ+1,r1r2 )(yn0+i+q,r1r2 στ+1,r1r2 ) i=1 q=1 ∑M { 2 } = E(T ) Cov(yn0+1,r1r2 , yn0+1+q,r1r2 ) + O E(T )στ+1,r1r2 . q=1

∞ { ∗ } ∗ Proof. As shown in Lemma 3.1, yn0+n ,r1r2 n =1 is a sequence of stationary and M-dependent ran- ∑ ∞ M {F ∗ } ∗ { ∗ − ∗ − dom variables adapted to σ-fields n n =1. Thus q=1(yn0+n ,r1r2 στ+1,r1r2 )(yn0+n +q,r1r2

90 ∞ } ∗ στ+1,r1r2 ) n =1 is a sequence of stationary and 2M-dependent random variables and it is adapted ∗ ∑ ∞ (n ) M {F ∗ } ∗ ∗ − ∗ − to n +M n =1. If we let YM = q=1(yn0+n ,r1r2 στ+1,r1r2 )(yn0+n +q,r1r2 στ+1,r1r2 ) and (n∗) ∞ H ∗ F ∗ { } n = n +M , then YM n∗=1 is a sequence of stationary and 2M-dependent random variables ∗ ∑ ∞ ∗ (n ) M ∗ {H ∗ } ∗ { } ∗ ∗ ≥ adapted to n n =1. Let µ = E YM = q=1 Cov(yn0+n , yn0+n +q) for any n 1. As ∗ Fn∗ ⊆ Hn∗ for every n , T is also a stopping time relative to {Hn∗ }. Therefore, we can write

∑T ∑T ∑M S (i) − − T = YM = (yn0+i,r1r2 στ+1,r1r2 )(yn0+i+q,r1r2 στ+1,r1r2 ). i=1 i=1 q=1 ∑ n∗ (i) ∗ ∗ H {∅ } S ∗ U ∗ {S ∗ − |H ∗ } Let 0 = , Ω and define n = i=1 YM and n = E n +2M (n + 2M)µ n , then

E(Un∗ |Hn∗−1)

∗ ∗ = E{Sn∗+2M − (n + 2M)µ |Hn∗−1}

(n∗+2M) ∗ U ∗ − |H ∗ U ∗ = n −1 + E(YM µ n −1) = n −1.

∞ {U ∗ } Therefore, n n∗=0 is a martingale. Applying Theorem 1.1 in Janson (1983), we obtain

∗ E(ST +2M ) = E(T + 2M)µ .

As E(y2 ) < ∞, for any j ≥ 1, by H¨older’sinequality we have j,r1r2

∑M | (j)| − − E YM = E (yn0+j,r1r2 στ+1,r1r2 )(yn0+j+q,r1r2 στ+1,r1r2 ) q=1 ( ) ∑M ≤ 1/2 | − |2 1/2 | − |2 E ( yn0+j,r1r2 στ+1,r1r2 )E M yn0+j+q,r1r2 στ+1,r1r2 q=1 · | − |2 ∞ = M E yn0+j,r1r2 στ+1,r1r2 < , for any j ≥ 1. Therefore, by the similar idea in Lemma 3.1 (ii), we can show

{| (T )|} { }1/2 2 E YM = O[ E(T ) στ+1,r1r2 ].

Towards this end, applying Lemma 3.1 (v), we have

S ∗ { }1/2 2 E( T ) = E(T )µ + O[ E(T ) στ+1,r1r2 ].

This completes the proof of Lemma 3.2.

91 Lemma 3.3 Under the same conditions in Theorem 3.2,

{ p } 1 ∑ ∑H E W (i, j)σ σ H2 M T +i,r1r2 T +j,r1r2 r1,r2=1 i,j=1 1 { } ≥ E (T − M)(T − M − 1) tr{(Σ − Σ )2}{1 + o(1)}. H τ τ+1

Proof. We first write p 1 ∑ ∑H I ≡ W (i, j)σ σ H2 M T +i,r1r2 T +j,r1r2 r1,r2=1 i,j=1 − 1 H∑T 1 ∑H = W (i, j)tr(Σ2) + W (i, j)tr(Σ2 ) H2 M τ H2 M τ+1 i,j=1 i,j=H−T +1 − 2 H∑T ∑H + W (i, j)tr(Σ Σ ) H2 M τ τ+1 i=1 i=H−T +1 = IA + IB + IC .

We start from part IA, − − − { 1 H∑T H ∑M 2 H − t − M I = I(i ≤ t)I(j ≤ t)I(|i − j| ≥ M + 1) A H2 t − M − 1 i,j=1 t=M+2 t − M + I(i ≥ t + 1)I(j ≥ t + 1)I(|i − j| ≥ M + 1) H − t − M − 1 (t − M)(H − t − M) − I(i ≤ t)I(j ≥ t + 1)I(|i − j| ≥ M + 1) t(H − t) − 1 M(M + 1) 2 } (t − M)(H − t − M) − I(j ≤ t)I(i ≥ t + 1)I(|i − j| ≥ M + 1) tr(Σ2) − − 1 τ t(H t) 2 M(M + 1) − − − [ 1 H T∑M 2 H − t − M = (t − M)(t − M − 1) H2 t − M − 1 t=M+2 t − M + (H − T − t − M)(H − T − t − M − 1) H − t − M − 1 { }{ }] M(H − 3 M − 1 ) 1 + 2 − 1 + 2 2 t(H − T − t) − M(M + 1) tr(Σ2) − − 1 τ t(H t) 2 M(M + 1) 2 − ( 1 H∑T H − t − M + (t − M)(t − M − 1) H2 t − M − 1 t=H−T −M−1 { } M(H − 3 M − 1 ) + 2 − 1 + 2 2 t(H − t) − 1 M(M + 1) [ 2 ]) 1 × t(H − T − t) − (H − T − t){2M − (H − T − t) + 1} tr(Σ2) 2 τ − − 1 H ∑M 2 H − t − M + (H − T − M)(H − T − M − 1)tr(Σ2) H2 t − M − 1 τ t=H−T +1

92 (1) (2) (3) = IA + IA + IA , where

1 I(1) = (T − M)(T − M − 1) A H2 { − − − } H T∑M 2 H − 2M − 1 × − (H − T − 2M − 3) tr(Σ2) H − t − M − 1 τ t=M+2 1 + M(M + 1)(H − T − 2M − 3)tr(Σ2) H2 τ { ( ) − − − 1 1 1 H T∑M 2 H − 2M − 1 + 2M T − M − H2 2 2 H − t − M − 1 t=M+2 ( ) − − − } 3 1 H T∑M 2 T · t − 2M H − M − tr(Σ2) 2 2 t(H − t) − 1 M(M + 1) τ t=M+2 2 (11) (12) (13) = IA + IA + IA .

As H → ∞,

− − − ∫ ( ) H T∑M 2 1 H−T −M−2 1 H − 2M − 3 = dt = log . H − t − M − 1 H − t − M − 1 T + 1 t=M+2 M+2 Hence,

1 I(11) = (T − M)(T − M − 1) A H2 { ( ) } H − 2M − 3 × (H − 2M − 1) log − (H − T − 2M − 3) tr(Σ2), T + 1 τ and ( ) − − − { } 1 1 H T∑M 2 1 1 I(13) ≤ 2M(H − 2M − 1) T − M − − tr(Σ2) A 2 2 H − t − M − 1 H − t τ t=M+2 ≤ − − 2 2M(H 2M 1)(T + M + 2)tr(Στ ).

Therefore, we have [ 1 E(I(1)) = E (T − M)(T − M − 1) A H2 { ( ) }] H − 2M − 3 × (H − 2M − 1) log − (H − T − 2M − 3) tr(Σ2) T + 1 τ 1 + M(M + 1)(H − 2M − 1)tr(Σ2) H2 τ 1 } − M(M + 1)E(T + 2) tr(Σ2) H2 τ

93 1 { } + O (H − 2M − 1)E(T + M + 2) tr(Σ2). H2 τ

(2) For IA , − 1 H∑T I(2) = (H − t − M)(t − M)tr(Σ2) A H2 τ t=H−T −M−1 { } H−∑T −1 − 2 − − − 1 − − − 2 2 t(H T t) (H T t)(2M H + T + t + 1) tr(Στ ) H − − − 2 t=H( T M 1 ) 2 3 1 + M H − M − H2 2 2 − − H∑T 1 t(H − T − t) − 1 (H − T − t)(2M − H + T + t + 1) × 2 tr(Σ2) t(H − t) − 1 M(M + 1) τ t=H−T −M−1 2 (21) (22) (23) = IA + IA + IA , where 1 I(21) ≤ M(T + 1)(H − T − M)tr(Σ2), A H2 τ and 2 I(22) ≤ − (M − 1)(H − T + M 2 − 1)tr(Σ2). A H2 τ As H → ∞ and T ≥ 1, and for some constant c, ( ) 2 3 1 I(23) ≤ M(M − 1) H − M − A H2 2 2 (M + 1)(H − T − 1) × tr(Σ2) (H − T − M − 1)(T + 1) − 1 M(M + 1) τ ( 2 ) c 3 1 ≤ 2M(M − 1)(M + 1) H − M − tr(Σ2). H2 2 2 τ

Therefore, [ (2) 1 − − − − { − − − } E(IA ) = O M(H 2M 1)E(T M + 2) ME (T M 1)(T M) H2 ] − − − 2 + 2M(M 1)(M + 1)(H 2M 1) tr(Στ ).

(3) For IA , − − 1 H ∑M 2 1 I(3) = (H − T − M)(H − T − M − 1)(H − 2M − 1) tr(Σ2) A H2 t − M − 1 τ t=H−T +1 1 − (H − T − M)(H − T − M − 1)(T − M − 2)tr(Σ2) H2 τ 1 = (H − T − M)(H − T − M − 1)(H − 2M − 1) H2

94 − − ( ) H ∑M 2 1 1 × − tr(Σ2) t − M − 1 H − T − M − 1 τ t=H−T +1 1 + (T − M)(T − M − 2)(H − T − M)tr(Σ2) H2 τ (31) (32) = IA + IA .

We have

1 |I(31)| ≤ (H − T − M)(H − T − M − 1)(H − 2M − 1) A H2 (H − 2M − 3) − (H − T − M − 1) × (T − M − 2) tr(Σ2) (H − T + 1)(H − T − M − 1) τ 1 ≤ (T − M − 2)2(H − 2M − 1)tr(Σ2). H2 τ

Therefore,

1 { } E(I(3)) = E (T − M)(T − M − 2)(H − T − M) tr(Σ2) A H2 τ 1 { } + O E(T − M − 2)2(H − 2M − 1) tr(Σ2). H2 τ

Combining all the results and using Tmax = o(H) as H → ∞, we have

1 { } E(I ) = E (T − M)(T − M − 1) tr(Σ2){1 + o(1)}. A H τ

Next, we consider IB, − − { 1 ∑H H ∑M 2 H − t − M I = I(i ≤ t)I(j ≤ t)I(|i − j| ≥ M + 1) B H2 t − M − 1 i,j=H−T +1 t=M+2 t − M + I(i ≥ t + 1)I(j ≥ t + 1)I(|i − j| ≥ M + 1) H − t − M − 1 (t − M)(H − t − M) − I(i ≤ t)I(j ≥ t + 1)I(|i − j| ≥ M + 1) t(H − t) − 1 M(M + 1) 2 } (t − M)(H − t − M) − I(j ≤ t)I(i ≥ t + 1)I(|i − j| ≥ M + 1) tr(Σ2 ) − − 1 τ+1 t(H t) 2 M(M + 1) − 1 H∑T t − M = (T − M)(T − M − 1)tr(Σ2 ) H2 H − t − M − 1 τ+1 t=M+2 − [ 1 H T∑+M+1 t − M + (H − t − M)(H − t − M − 1) H2 H − t − M − 1 t=H−T +1 { } M(H − 3 M − 1 ) + 2 − 1 + 2 2 t(H − t) − 1 M(M + 1) { 2 }] 1 × (t − H + T )(H − t) − (t − H + T )(2M − t + H − T + 1) tr(Σ2 ) 2 τ+1

95 − − [ 1 H ∑M 2 H − M − t + (t − H + T − M)(t − H + T − M − 1) H2 t − M − 1 t=H−T +M+2 t − M + (H − t − M)(H − t − M − 1) H − t − M − 1 { } M(H − 3 M − 1 ) + 2 − 1 + 2 2 t(H − t) − 1 M(M + 1) { 2 }] 1 × (t − H + T )(H − t) − M(M + 1) tr(Σ2 ) 2 τ+1 (1) (2) (3) = IB + IB + IB .

(11) By the same idea as IA , { ( )} 1 H − 2M − 3 E(I(1)) = (H − 2M − 1)E (T − M)(T − M − 1) log tr(Σ2 ) B H2 T − M − 1 τ+1 { } 1 − E (T − M)(T − M − 1)(H − T − M − 1) tr(Σ2 ). H2 τ+1

(2) For IB , we have − { 1 H T∑+M+1 I(2) = (t − M)(H − t − M) B H2 t=H−T +1 } − − − − − − 2 2(t H + T )(H t) + (t H + T )(2M t + H T + 1) tr(Στ+1) − 1 H T∑+M+1 M(H − 3 M − 1 ) + 2 2 H2 t(H − t) − 1 M(M + 1) { t=H−T +1 2 } × − − − − − − 2 2(t H + T )(H t) (t H + T )(2M t + H T + 1) tr(Στ+1)

(21) (22) = IB + IB .

First,

(21) 1 − − − − 2 E(IB ) = 2 (M + 1)(H 2M 1)E(T M 1)tr(Στ+1) H ( ) 1 1 − M (M + 1)(H − 2M − 1)tr(Σ2 ) H2 2 τ+1 1 { } − (M + 1)E (T − M − 1)(T − M) tr(Σ2 ). H2 τ+1

Next,

1 3 1 |I(22)| ≤ M(M + 1)2(H − M − )(H − 2M − 1) B H2 2 2 1 × tr(Σ2 ) − − − 1 τ+1 (H M 2) 2 M(M + 1) 1 { } = O M(M + 1)2(H − 2M − 1) tr(Σ2 ). H2 τ+1

96 Hence,

(2) 1 − − − − 2 E(IB ) = (M + 1)(H 2M 1)E(T M 1)tr(Στ+1) H2 ( ) 1 1 − M (M + 1)(H − 2M − 1)tr(Σ2 ) H2 2 τ+1 1 { } − (M + 1)E (T − M − 1)(T − M) tr(Σ2 ) H2 τ+1 1 { } + O M(M + 1)2(H − 2M − 1) tr(Σ2 ). H2 τ+1

Finally,

− − [ 1 H ∑M 2 H − M − t I(3) = (t − H + T − M)(t − H + T − M − 1) B H2 t − M − 1 t=H−T +M+2 { }] 1 + (t − M)(H − t − M) − 2 (t − H + T )(H − t) − M(M + 1) tr(Σ2 ) 2 τ+1 − − 2 H ∑M 2 M(H − 3 M − 1 ) + 2 2 H2 t(H − t) − 1 M(M + 1) t=H−T +M+2 2 1 × {(t − H + T )(H − t) − M(M + 1)}tr(Σ2 ) 2 τ+1 (31) (32) = IB + IB , where

− − 1 H ∑M 2 H − M − t I(31) = (H − T − M)(H − T − M − 1)tr(Σ2 ) B H2 t − M − 1 τ+1 t=H−T +M+2 − − { 1 H ∑M 2 H − M − t + (t − 2M − 1)(t − 2H + 2T ) H2 t − M − 1 t=H−T +M+2 } − − − − − − 2 + (t M)(H t M) 2(t H + T )(H t) + M(M + 1) tr(Στ+1)

(311) (312) = IB + IB .

(3) Similar to E(IA ), { } (311) 1 − − − − − 2 E(IB ) = E (T M)(T 2M 3)(H T M) tr(Στ+1) H2 [ ] 1 + O (H − 2M − 1)E{(T − 2M − 3)(T − M − 2)} tr(Σ2 ). H2 τ+1

And, { } 1 I(312) ≤ 2M(H − 2M − 1)(T − 2M − 3) tr(Σ2 ). B H2 τ+1

97 Therefore, { } (31) 1 − − − − − 2 E(IB ) = E (T M)(T 2M 3)(H T M) tr(Στ+1) H2 [ ] 1 − − { − − − − } 2 + O (H 2M 1)E (T 2M 3)(T M 2) tr(Στ+1) H2 { } 1 + O (H − 2M − 1)E(T − 2M − 3) tr(Σ2 ). H2 τ+1

Moreover, { } 1 3 1 E(I(32)) = O (H − M − )E(T − 2M − 4) tr(Σ2 ). B H2 2 2 τ+1 Combining all the results, we have

E(IB) { ( )} 1 H − 2M − 3 = (H − 2M − 1)E (T − M)(T − M − 1) log tr(Σ2 ) H2 T − M − 1 τ+1 1 [ ] + O (H − 2M − 1)E{(T − 2M − 3)(T − M − 2)} tr(Σ2 ) H2 τ+1 1 { } + O (H − 2M − 1)E(T − 2M − 1) tr(Σ2 ) H2 τ+1 1 − − 2 + O(H 2M 1)tr(Στ+1) H2 [ ] 1 { } + O E (T − M)(T − M − 1) tr(Σ2 ) H2 τ+1 1 { } = E (T − M)(T − M − 1) tr(Σ2 ){1 + o(1)}. H τ+1

At last, we consider IC ,

− − − { 2 H∑T ∑H H ∑M 2 H − t − M I = I(i ≤ t)I(j ≤ t)I(|i − j| ≥ M + 1) C H2 t − M − 1 i=1 j=H−T +1 t=M+2 t − M + I(i ≥ t + 1)I(j ≥ t + 1)I(|i − j| ≥ M + 1) H − t − M − 1 } (t − M)(H − t − M) − I(i ≤ t)I(j ≥ t + 1)I(|i − j| ≥ M + 1) tr(Σ Σ ) − − 1 τ τ+1 t(H t) 2 M(M + 1) − − [ { } 2 H ∑T M t − M 1 = (H − T − t)T − M(M + 1) H2 H − t − M − 1 2 t=M+2 { } ] M(H − 3 M − 1 ) + − 1 + 2 2 T t tr(Σ Σ ) − − 1 τ τ+1 t(H t) 2 M(M + 1) − [ { 2 H∑T t − M + (H − T − t)T H2 H − t − M − 1 t=H−T −M+1 } { } 1 M(H − 3 M − 1 ) − (H − T − t)(2M − H + T + t + 1) + − 1 + 2 2 − − 1 2 t(H t) 2 M(M + 1)

98 { }] 1 × T t − (t − H + T + M + 1)(t − H + T + M) tr(Σ Σ ) 2 τ τ+1 − − [ { 2 H T∑+M 1 H − M − t + (H − T )(t − H + T ) 2 − − H − t M 1 t=H T +1 } 1 − (t − H + T )(2M − t + H − T + 1) 2 { }{ M(H − 3 M − 1 ) + − 1 + 2 2 (H − T )(H − t) t(H − t) − 1 M(M + 1) 2 }] 1 − (H − T + M + 1 − t)(H − T + M + 1 − t − 1) tr(Σ Σ ) 2 τ τ+1 − − [ { } 2 H ∑M 2 H − M − t 1 + (H − T )(t − H + T ) − M(M + 1) H2 t − M − 1 2 t=H−T +M { }{ }] M(H − 3 M − 1 ) + − 1 + 2 2 (H − T )(H − t) tr(Σ Σ ). − − 1 τ τ+1 t(H t) 2 M(M + 1)

Based on similar derivations for E(IA) and E(IB), we can obtain

E(IC ) { } 2 (H − 2M − 3) = − (H − 2M − 1)E (T − M)(T − M − 1) log tr(Σ Σ ) H2 T − 1 τ τ+1 { } 1 3 1 (H − 2M − 3) + M(M + 1)(H − M − )E log tr(Σ Σ ){1 + o(1)} 2 − τ τ+1 H [ 2 2 T 1] 1 { } + O (H − 2M − 1)E (T − M)(T − M − 1) tr(Σ Σ ) H2 τ τ+1 1 { } + O (H − 2M − 1)E(T − M) tr(Σ Σ ) H2 τ τ+1 1 + O(H − 2M − 1)tr(Σ Σ ) H2 τ τ+1 1 [ ] + O E{(T − M)(T − M − 1)} tr(Σ Σ ) H2 τ τ+1 1 { } + O E(T − M − 1) tr(Σ Σ ) H2 τ τ+1 − 2 − − = 2 (H 2M 1) [H ] { 1 } (H − 2M − 3) × E (T − M)(T − M − 1) − M(M + 1) log tr(Σ Σ ){1 + o(1)} 2 T − 1 τ τ+1 { } 2 (H − 2M − 3) ≥ − (H − 2M − 1)E (T − M)(T − M − 1) log tr(Σ Σ ){1 + o(1)} H2 T − 1 τ τ+1 2 { } = − E (T − M)(T − M − 1) tr(Σ Σ ){1 + o(1)}. H τ τ+1

As a result,

1 { } E(I) ≥ E (T − M)(T − M − 1) tr{(Σ − Σ )2}{1 + o(1)}. H τ τ+1

This completes the proof of Lemma 3.3.

99 Lemma 3.4 Under the same conditions in Theorem 3.2,

{ p } 1 ∑ ∑H E W (i, j)(y − σ )(y − σ ) H2 M T +i,r1r2 T +i,r1r2 T +j,r1r2 T +j,r1r2 [ r1,r2=1 i,j=1 ] 2 = O log(H)tr{(Στ+1 − Στ ) } .

Proof. We first write

p 1 ∑ ∑H II ≡ W (i, j)(y − σ )(y − σ ) H2 M T +i,r1r2 T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i,j=1 p − − 1 ∑ ∑H H ∑M 2 = H2 { r1,r2=1 i,j=1 t=M+2 H − t − M × I(i ≤ t)I(j ≤ t)I(|i − j| ≥ M + 1) t − M − 1 t − M + I(i ≥ t + 1)I(j ≥ t + 1)I(|i − j| ≥ M + 1) H − t − M − 1 } (t − M)(H − t − M) − 2 I(i ≤ t)I(j ≥ t + 1)I(|i − j| ≥ M + 1) y∗ y∗ − − 1 i,r1r2 j,r1r2 t(H t) 2 M(M + 1) = I + II + III.

As H → ∞, − − ∫ ( ) H ∑M 2 1 H−M−2 1 H − 2M − 3 = dt = log . t − M − 1 t − M − 1 j − M − 1 t=j j Hence,

p − − − − − − 2 ∑ H ∑2M 3 H ∑M 2 H ∑M 2 H − t − M I = y∗ y∗ H2 t − M − 1 i,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 t=j p − − − − 2 ∑ H ∑2M 3 H ∑M 2 = H2 r1,r2=1 i=1 j=i+M+1 { ( ) } H − 2M − 3 × (H − 2M − 1) log − (H − j − M − 1) y∗ y∗ . j − M − 1 i,r1r2 j,r1r2

Similarly,

p − − − 2 ∑ H ∑M 1 ∑H ∑i 1 t − M II = y∗ y∗ H2 H − t − M − 1 i,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 t=M+2 p − − − − 2 ∑ H ∑2M 3 H ∑M 2 = H2 r1,r2=1 i=1 j=i+M+1 { ( ) } H − 2M − 3 × (H − 2M − 1) log − (i − M − 2) y∗ y∗ . H − i − M i,r1r2 j,r1r2

100 For III, we have

III p − − − − { } 2 ∑ H ∑M 1 ∑H H ∑M 2 M(H − 3 M − 1 ) = − 1 + 2 2 H2 t(H − t) − 1 M(M + 1) r1,r2=1 i=1 j=i+M+1 t=M+2 2 × ≤ ≤ − ∗ ∗ I(i t)I(t j 1)yi,r1r2 yj,r1r2 p − − { j−1 } 2 ∑ H ∑M 1 ∑ M(H − 3 M − 1 ) = − (j − M − 2) + 2 2 y∗ y∗ H2 t(H − t) − 1 M(M + 1) 1,r1r2 j,r1r2 r1,r2=1 j=M+3 t=M+2 2 p − − { j−1 } 2 ∑ M∑+2 H ∑M 1 ∑ M(H − 3 M − 1 ) − (j − M − 2) + 2 2 y∗ y∗ H2 t(H − t) − 1 M(M + 1) i,r1r2 j,r1r2 r1,r2=1 i=2 j=i+M+1 t=M+2 2 p { − − } 2 ∑ M∑+2 ∑H H ∑M 2 M(H − 3 M − 1 ) − (H − 2M − 3) + 2 2 y∗ y∗ H2 t(H − t) − 1 M(M + 1) i,r1r2 j,r1r2 r1,r2=1 i=1 j=H−M t=M+2 2 p − − − − { j−1 } 2 ∑ H ∑2M 2 H ∑M 1 ∑ M(H − 3 M − 1 ) − (j − i) + 2 2 y∗ y∗ H2 t(H − t) − 1 M(M + 1) i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 t=i 2 ∑p H−∑2M−2 ∑H − 2 H2 r1,r2=1 i=M+3 j=H−M { − − } H ∑M 2 M(H − 3 M − 1 ) × (H − M − 1 − i) + 2 2 y∗ y∗ t(H − t) − 1 M(M + 1) i,r1r2 j,r1r2 t=i 2 ∑p H−∑M−2 ∑H − 2 H2 r1,r2=1 i=H−2M−1 j=i+M+1 { − − } H ∑M 2 M(H − 3 M − 1 ) × (H − M − 1 − i) + 2 2 y∗ y∗ . t(H − t) − 1 M(M + 1) i,r1r2 j,r1r2 t=i 2 Note that, as t ≥ M + 2 and H → ∞, ∑j−1 − 3 − 1 M(H 2 M 2 ) t(H − t) − 1 M(M + 1) t=M+2 2 − − ∑j 1 M(H − 3 M − 1 ) ∑j 1 M(H − 3 M − 1 ) ≤ 2 2 ≤ 2 2 − − 1 − − t(H t) 2 Mt t(H t M) t=M+2 { t=M+2 } M(H − 3 M − 1 ) (j − 1)(H − 2M − 2) = 2 2 log . H − M (M + 2)(H − M − j + 1) Hereby, for j = M + 3,...,H − M − 1, − ∑j 1 M(H − 3 M − 1 ) 2 2 = o(H − 2M − 3), t(H − t) − 1 M(M + 1) t=M+2 2 which is the same for the other terms. Combining I, II, and III together, we have

I + II + III

101 p − − 2 ∑ M∑+2 H ∑M 2 = H2 r1,r2=1 i=1 j=i+M+1 [ ( ) ] H − 2M − 3 × (H − 2M − 1) log − (H − 2M − 3){1 + o(1)} y∗ y∗ j − M − 1 i,r1r2 j,r1r2 p [ ] 2 ∑ M∑+2 ∑H + − (H − 2M − 3){1 + o(1)} y∗ y∗ H2 i,r1r2 j,r1r2 r1,r2=1 i=1 j=H−M−1 p − − − − [ { } 2 ∑ H ∑2M 3 H ∑M 2 (H − 2M − 3)2 + (H − 2M − 1) log H2 (j − M − 1)(H − i − M) r1,r2=1 i=M+3 j=i+M]+1 − − − { } ∗ ∗ (H 2M 3) 1 + o(1) yi,r1r2 yj,r1r2

p − − 2 ∑ H ∑2M 3 ∑H + H2 r1,r2=1 i=M+3 j=H−M−1 [ ( ) ] H − 2M − 3 × (H − 2M − 1) log − (H − 2M − 3){1 + o(1)} y∗ y∗ H − i − M i,r1r2 j,r1r2 p − − 2 ∑ H ∑M 1 ∑H + H2 r1,r2=1 i=H−2M−2 j=i+M+1 [ ( ) ] H − 2M − 3 × (H − 2M − 1) log − (H − 2M − 3){1 + o(1)} y∗ y∗ H − i − M i,r1r2 j,r1r2 p − − { ( ) } 2 ∑ M∑+2 H ∑M 2 H − 2M − 3 H − 2M − 3 = log − y∗ y∗ H j − M − 1 H − 2M − 1 i,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 p ( ) 2 ∑ M∑+2 ∑H H − 2M − 3 + − y∗ y∗ H H − 2M − 1 i,r1r2 j,r1r2 r1,r2=1 i=1 j=H−M−1 p − − − − 2 ∑ H ∑2M 3 H ∑M 2 + H r1,r2=1 i=M+3 j=i+M+1 [ { } ] (H − 2M − 3)2 H − 2M − 3 × log − y∗ y∗ (j − M − 1)(H − i − M) H − 2M − 1 i,r1r2 j,r1r2 p − − { ( ) } 2 ∑ H ∑2M 3 ∑H H − 2M − 3 H − 2M − 3 + log − y∗ y∗ H H − i − M H − 2M − 1 i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=H−M−1 p − − { ( ) } 2 ∑ H ∑M 1 ∑H H − 2M − 3 H − 2M − 3 + log − y∗ y∗ . H H − i − M H − 2M − 1 i,r1r2 j,r1r2 r1,r2=1 i=H−2M−2 j=i+M+1 We can write I + II + III as,

p − − ( ) 2 ∑ H ∑M 1 ∑H H − 2M − 3 I + II + III = log y∗ y∗ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p − − − − ( ) 2 ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 + log y∗ y∗ H j − M − 1 i,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1

102 p − − 2 ∑ H ∑M 1 ∑H H − 2M − 3 − y∗ y∗ H H − 2M − 1 i,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 = IIA + IIB + IIC .

We first consider IIA, whose expectation is

E(IIA) p { − − ( ) } 2 ∑ H+T∑M 1 H∑+T H − 2M − 3 = E log (y − σ )(y − σ ) H H − i + T − M i,r1r2 i,r1r2 j,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=i+M+1 p { − − ( ) } 2 ∑ H ∑M 1 ∑H H − 2M − 3 = E log y˜ y˜ H H − i + T − M i,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=i+M+1 p { − − ( ) } 2 ∑ H+T∑M 1 H∑+T H − 2M − 3 + E log y˜ y˜ H H − i + T − M i,r1r2 j,r1r2 r1,r2=1 i=H+1 j=i+M+1 p { ( ) 2 ∑ ∑H H∑+T H − 2M − 3 + E log y˜ y˜ H H − i + T − M i,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=H+1 ( ) } ∑M ∑M H − 2M − 3 − log y˜ − y˜ − T − l H M+l H M+l+q l=1 q=M−l+1 (1) (2) (3) = E(IIA ) + E(IIA ) + E(IIA ).

For some constant c, p { − − ( ) } 2c ∑ H ∑M 1 ∑H H − 2M − 3 E(II(1)) ≤ E log y˜ y˜ A H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=i+M+1 p { − − ( ) } 2c ∑ H ∑M 1 ∑H H − 2M − 3 = E log y˜ y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p { ( ) } 2c ∑ T +∑M+2 ∑H H − 2M − 3 − E log y˜ y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 (11) − (12) = E(IIA ) E(IIA ).

Note that, p − − ( ) 2c ∑ H ∑M 1 ∑H H − 2M − 3 E(II(11)) = log A H H − i − M r1,r2=1 i=M+3 j=i+M+1 × − − E(yi,r1r2 στ,r1r2 )E(yj,r1r2 στ,r1r2 ) = 0.

Meanwhile, we can write p { ( ) } 2c ∑ T +∑M+2 ∑H H − 2M − 3 E(II(12)) = E log y˜ y˜ A H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1

103 p { } 2c ∑ T +∑M+2 ∑H = E log(H − 2M − 3)˜y y˜ H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p { } 2c ∑ T +∑M+2 ∑H − E log(H − i − M)˜y y˜ H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 (121) (122) = E(IIA ) + E(IIA ).

Consequently,

p ( ) 2c ∑ T∑+1 T +∑M+2 E(II(121)) = log(H − 2M − 3) E y˜ y˜ A H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p ( 2c ∑ T +∑M+2 ∑H + log(H − 2M − 3) E y˜ y˜ H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=T +M+3 ) ∑M ∑M − y˜T +2+ly˜T +2+l+q l=1 q=M−l+1 (1211) (1212) = E(IIA ) + E(IIA ), where

p ( ) c ∑ T +∑M+2 E(II(1211)) = log(H − 2M − 3) E y˜ y˜ A H i,r1r2 j,r1r2 r1,r2=1 i,j=M+3 p ( ) c ∑ T +∑M+2 − log(H − 2M − 3) E y˜2 H i,r1r2 r1,r2=1 i=M+3 p ( 2c ∑ T +∑M+2 ∑M − log(H − 2M − 3) E y˜ y˜ H i,r1r2 i+q,r1r2 r1,r2=1 i=M+3 q=1 ∑M ∑M ) − y˜T +2+ly˜T +2+l+q l=1 q=M−l+1 (12111) − (12112) − (12113) = E(IIA ) E(IIA ) E(IIA ), and by Theorem 10 of Merlev`ede,et al. (2006), for a positive constant D,

∑p ( ∑t ) (12111) c 2 E(II ) ≤ log(H − 2M − 3) E max y˜i,r r A H M+3≤t≤H 1 2 r1,r2=1 i=M+3 p c ∑ ≤ log(H − 2M − 3) 2D(H − M − 2)E(˜y2 ) H 1,r1r2 { } r1,r2=1 2 = O log(H)tr(Στ ) ,

104 where E(˜y2 ) < ∞, as E(y2 ) < ∞, for any i ≥ 1. For the others, 1,r1r2 i,r1r2

p ( ) c ∑ ∑H E(II(12112)) ≤ log(H − 2M − 3) E y˜2 A H i,r1r2 r1,r2=1 i=M+3 p c ∑ = log(H − 2M − 3) (H − M − 2)E(˜y2 ) H 1,r1r2 { } r1,r2=1 2 = O log(H)tr(Στ ) .

Based on the similar idea, it can be shown that { } (12113) 2 E(IIA ) = O log(H)tr(Στ ) . { } (1211) 2 (1212) Hence, E(IIA ) = O log(H)tr(Στ ) . Next, we study E(IIA ). Note that

p ( ) 2c ∑ T +∑M+2 ∑H E(II(1212)) = log(H − 2M − 3) E y˜ y˜ A H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=T +M+3 p ( ) 2c ∑ ∑M ∑M − log(H − 2M − 3) E y˜ y˜ H T +2+l T +2+l+q r1,r2=1 l=1 q=M−l+1 (12121) − (12122) = E(IIA ) E(IIA ), where

p 2c ∑ |E(II(12121))| ≤ log(H − 2M − 3) A H r1,r2=1 ( ) T +∑M+2 ∑H × · E y˜i,r1r2 y˜j,r1r2 i=M+3 v j=T +M+3 v u u ∑p u T +∑M+2 u ∑H 2c t 2 2 ≤ log(H) E y˜ · tE y˜ , H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=T +M+3 where

T +M+2 ( t ) ∑ 2 ∑ 2 E y˜i,r r ≤ E max y˜i,r r 1 2 M+3≤t≤H 1 2 i=M+3 i=M+3 ≤ − − 2 2D(H M 2)E(˜y1,r1r2 ), and similarly,

H ( H ) ∑ 2 ∑ 2 E y˜j,r r ≤ E max y˜i,r r 1 2 M+3≤t≤H 1 2 j=T +M+3 i=t

105 ≤ − − 2 2D(H M 2)E(˜y1,r1r2 ), and

p ( ) 2c ∑ ∑M ∑M |E(II(12122))| ≤ log(H − 2M − 3)E |y˜ y˜ | A H T +2+l,r1r2 T +2+l+q,r1r2 r1,r2=1 l=1 q=M−l+1 p ( − ) 2c ∑ H∑M ∑M ≤ log(H − 2M − 3)E |y˜ y˜ | H i,r1r2 i+q,r1r2 {r1,r2=1 i=1 q=1 2 } = O log(H)tr(Στ ) . { } { } (1212) 2 (1) 2 Hence, E(IIA ) = O log(H)tr(Στ ) . As a result, we have shown E(A12 ) = O log(H)tr(Στ ) .

As H −i−M = H −2M −3,...,H −T −2M −2, for i = M +3,...,T +M +2, and Tmax = o(H), we have, for some constant c1, { } 2c T +∑M+2 ∑H E(II(122)) ≤ 1 E log(H − 2M − 3)˜y y˜ . A H i,r1r2 j,r1r2 i=M+3 j=i+M+1 { } Hence, by the same idea for E(II(121)), we have E(II(122)) = O log(H)tr(Σ2) . Combining the A { }A τ (1) 2 above results, we obtain E(IIA ) = O log(H)tr(Στ ) .

As H − i + T − M = T − M − 1,..., 1, for i = H + 1,...,H + T − M − 1, and Tmax = o(H), we have ( ) H − 2M − 3 log = log(H − 2M − 3){1 + o(1)}, H − i + T − M as H → ∞. Hence, for some constant c2

p { − − } 2c ∑ H+T∑M 1 H∑+T E(II(2)) ≤ 2 E log(H − 2M − 3)˜y y˜ A H i,r1r2 j,r1r2 r1,r2=1 i=H+1 j=i+M+1 p ( ) c ∑ H∑+T = 2 log(H − 2M − 3) E y˜ y˜ H i,r1r2 j,r1r2 r1,r2=1 i,j=H+1 p ( ) c ∑ H∑+T − 2 log(H − 2M − 3) E y˜2 H i,r1r2 r1,r2=1 i=H+1 p ( 2c ∑ H∑+T ∑M − 2 log(H − 2M − 3) E y˜ y˜ H i,r1r2 i+q,r1r2 r1,r2=1 i=H+1 q=1 ∑M ∑M ) − y˜H+T −M+ly˜H+T −M+l+q l=1 q=M−l+1 (21) (22) (23) = E(IIA ) + E(IIA ) + E(IIA ).

106 By Lemma 3.1 (vii),

p c ∑ [ ] E(II(21)) = 2 log(H − 2M − 3) E(T )γ + O{E(T )σ2 } , A H τ+1,r1r2 τ+1,r1r2 r1,r2=1 ∑ 2 M 2 ∞ where γτ+1,r1r2 = Var(yH+1,r r ) + 2 q=1 Cov(yH+1,r1r2 , yH+q,r1r2 ). As E(yi,r r ) < , for any 1 2 { } 1 2 ≥ (21) 2 i 1, and E(T ) = o(H), we have E(IIA ) = o log(H)tr(Στ+1) . By Lemma 3.1 (v) and similar (21) idea for E(IIA ), we obtain p c ∑ [ ] E(II(22)) = 2 log(H − 2M − 3) E(T )E(˜y2 ) + O{E(T )σ2 } A H H+1,r1r2 τ+1,r1r2 { } r1,r2=1 2 = o log(H)tr(Στ+1) . { } (23) 2 By Lemma 3.2, we have E(IIA ) = o log(H)tr(Στ+1) . (3) For E(IIA ),

p { ( )} 2 ∑ ∑H H∑+T H − 2M − 3 E(II(3)) = E log y˜ y˜ A H H − i + T − M i,r1r2 j,r1r2 r1,r2=1 T +M+3 j=H+1 p { ( ) } 2 ∑ ∑M ∑M H − 2M − 3 − E log y˜ − y˜ − , H T − l H M+l,r1r2 H M+l+q,r1r2 r1,r2=1 l=1 q=M−l+1 where { ( ) } ∑p ∑M ∑M 2 H − 2M − 3 E log y˜H−M+l,r r y˜H−M+l+q,r r H T − l 1 2 1 2 r1,r2=1 l=1 q=M−l+1 ∑p ∑M ∑M ≤ 2c | | log(H) E y˜H−M+l,r1r2 y˜H−M+l+q,r1r2 H − { r1,r2=1 l=1}q=M l+1 1 = o √ tr(Στ+1Στ ) . H

Hence, let

p { ( )} 2 ∑ ∑H H∑+T H − 2M − 3 E(II(3)) = E log y˜ y˜ , A H H − i + T − M i,r1r2 j,r1r2 r1,r2=1 T +M+3 j=H+1

(1) and Similar to E(IIA ), we have,

p { ( ) } 2c ∑ ∑H H∑+T H − 2M − 3 E(II(3)) ≤ E log y˜ y˜ A H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=H+1 p { − − ( ) } 2c ∑ H ∑M 1 H∑+T H − 2M − 3 = E log y˜ y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=H+1

107 p { ( ) } 2c ∑ ∑H H∑+T H − 2M − 3 + E log y˜ y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=H−M j=H+1 (31) (32) = E(IIA ) + E(IIA ), where

p { − − ( ) } 2c ∑ H ∑M 1 H∑+T H − 2M − 3 E(II(31)) = E log y˜ y˜ A H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=H+1 p { ( ) } 2c ∑ T +∑M+2 H∑+T H − 2M − 3 − E log y˜ y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=H+1 (311) − (312) = E(IIA ) E(IIA ).

Consequently,

| (311) | E(IIA ) p { − − ( ) } 2c ∑ H ∑M 1 H − 2M − 3 H∑+T ≤ E log y˜ · y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 v i=M+3 jv=H+1 u u ∑p u H−∑M−1 ( ) u H∑+T 2c t H − 2M − 3 2 2 ≤ E log y˜ · tE y˜ H H − i − M i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=H+1 p √ [√ ] 2c ∑ √ ≤ log2(H)(H − 2M − 3)|γ | · E(T )|γ | + O{ E(T )σ } , H τ,r1r2 τ+1,r1r2 τ+1,r1r2 r1,r2=1 ∑ where γ = Var(y2 ) + 2 M Cov(y , y ), and γ , γ < ∞. As E(T ) = τ,r1r2 1,r1r2 q=1 1,r1r2 1+q,r1r2 τ,r1r2 τ+1,r1r2 o(H), we have { } (311) E(IIA ) = o log(H)tr(Στ Στ+1) .

(312) And we can write E(IIA ) as

p { } 2c ∑ T +∑M+2 H∑+T E(II(312)) = E log(H − 2M − 3)˜y y˜ A H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=H+1 p { } 2c ∑ T +∑M+2 H∑+T − E log(H − i − M)˜y y˜ H i,r1r2 j,r1r2 r1,r2=1 i=M+3 j=H+1 (3121) − (3122) = E(IIA ) E(IIA ).

Consequently,

| (3121) | E(IIA )

108 p { } 2c ∑ T +∑M+2 H∑+T ≤ log(H − 2M − 3) E y˜ · y˜ H i,r1r2 j,r1r2 r1,r2=1 v i=M+3 j=H+1 v u u ∑p u ( ∑t ) u H∑+T 2c t 2 t 2 ≤ log(H − 2M − 3) E max y˜i,r r · E y˜j,r r H M+3≤t≤H 1 2 1 2 r1,r2=1 i=M+3 j=H+1 p √ 2c ∑ ≤ log(H − 2M − 3) 2D(H − M − 2)E(˜y2 ) H 1,r1r2 r1,r2=1 [√ √ ] × | | { } E(T ) γτ+1,r1r2 + O E(T )στ+1,r1r2 { } = o log(H)tr(Στ Στ+1) .

As H − i − M = H − 2M − 3,...,H − T − 2M − 2, for i = M + 3,...,T + M + 2, and Tmax = o(H), for some constant c2, we have p { } 2c ∑ T +∑M+2 H∑+T E(II(3122)) ≤ 1 E log(H − 2M − 3)˜y y˜ A H i,r1r2 j,r1r2 { r1,r2=1 i=M+3 }j=H+1 = o log(H)tr(Στ Στ+1) . { } (31) Hence, E(IIA ) = o log(H)tr(Στ Στ+1) .

As H − i + T − M = T,...,T − M, for i = H − M,...,H, and Tmax = o(H), p { } 2c ∑ ∑H H∑+T E(II(32)) ≤ 2 log(H − 2M − 3) E y˜ y˜ . A H i,r1r2 j,r1r2 r1,r2=1 i=H−M j=H+1 { } (311) (32) Using a similar idea for E(IIA ), we have E(IIA ) = o log(H)tr(Στ Στ+1) . As a result, we { } (3) have shown that E(IIA ) = o log(H)tr(Στ Στ+1) . Combining all the results, we have [ ] 2 E(IIA) = O log(H)tr{(Στ − Στ+1) } . [ ] 2 By the similar derivations, we can show E(IIB) = O log(H)tr{(Στ − Στ+1) } and E(IIC ) = [ ] [ ] 2 2 O tr{(Στ −Στ+1) } . In summary, E(II) = O log(H)tr{(Στ −Στ+1) } . This completes the proof of Lemma 3.4.

Lemma 3.5 Under the same conditions in Theorem 3.2, { p } [ ] 2 ∑ ∑H √ E W (i, j)σ (y − σ ) = o Htr{(Σ − Σ )2} . H2 M T +i,r1r2 T +j,r1r2 T +j,r1r2 τ τ+1 r1,r2=1 i,j=1 Proof. We first write p 2 ∑ ∑H III ≡ W (i, j)σ (y − σ ) H2 M T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i,j=1

109 { p − − ( ) 2 ∑ H ∑M 1 ∑H H − 2M − 3 = log σ (y − σ ) H H − i − M T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p − − − − ( ) 2 ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 + log σ (y − σ ) H j − M − 1 T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i=1 j=i+M+1 p − − ( ) } 2 ∑ H ∑M 1 ∑H H − 2M − 3 − σ (y − σ ) H H − 2M − 1 T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i=1 j=i+M+1 { p − − ( ) 2 ∑ H ∑M 1 ∑H H − 2M − 3 + log (y − σ )σ H H − i − M T +i,r1r2 T +i,r1r2 T +j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p − − − − ( ) 2 ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 + log (y − σ )σ H j − M − 1 T +i,r1r2 T +i,r1r2 T +j,r1r2 r1,r2=1 i=1 j=i+M+1 p − − ( ) } 2 ∑ H ∑M 1 ∑H H − 2M − 3 − (y − σ )σ H H − 2M − 1 T +i,r1r2 T +i,r1r2 T +j,r1r2 r1,r2=1 i=1 j=i+M+1 = III1 + III2.

We study III1 first, and let

III1 { p − − ( ) 2 ∑ H ∑M 1 ∑H H − 2M − 3 = log σ (y − σ ) H H − i − M T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p − − − − ( ) 2 ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 + log σ (y − σ ) H j − M − 1 T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i=1 j=i+M+1 p − − ( ) } 2 ∑ H ∑M 1 ∑H H − 2M − 3 − σ (y − σ ) H H − 2M − 1 T +i,r1r2 T +j,r1r2 T +j,r1r2 r1,r2=1 i=1 j=i+M+1 = III1,A + III1,B − III1,C .

To simplify the notation, let IIIA ≡ III1,A, IIIB ≡ III1,B, and IIIC ≡ III1,C .

p − − ( ) 2 ∑ H ∑M 1 ∑H H − 2M − 3 III = log σ y˜ A H H − i + T − M τ,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=i+M+1 p − − ( ) 2 ∑ H+T∑M 1 H∑+T H − 2M − 3 + log σ y˜ H H − i + T − M τ+1,r1r2 j,r1r2 r1,r2=1 i=H+1 j=i+M+1 p { ( ) 2 ∑ ∑H H∑+T H − 2M − 3 + log σ y˜ H H − i + T − M τ,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=H+1 ( ) } ∑M ∑M H − 2M − 3 − log σ y˜ − T − l τ,r1r2 H M+l+q,r1r2 l=1 q=M−l+1

110 (1) (2) (3) = IIIA + IIIA + IIIA .

Note that for some constant c, we have

p { − − ( ) } 2c ∑ H ∑M 1 ∑H H − 2M − 3 E(III(1)) ≤ E log σ y˜ A H H − i − M τ,r1r2 j,r1r2 r1,r2=1 i=T +M+3 j=i+M+1 p { − − ( ) } 2c ∑ H ∑M 1 ∑H H − 2M − 3 = E log σ y˜ H H − i − M τ,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p { ( ) } 2c ∑ T +∑M+2 ∑H H − 2M − 3 − E log σ y˜ H H − i − M τ,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 (11) − (12) = E(IIIA ) E(IIIA ),

(11) (12) where E(IIIA ) = 0. We study E(IIIA ), p { ( ) } 2c ∑ T +∑M+2 T +2∑M+3 H − 2M − 3 E(III(12)) = E log σ y˜ A H H − i − M τ,r1r2 j,r1r2 r1,r2=1 i=M+3 j=i+M+1 p { ( ) } 2c ∑ T +∑M+2 ∑H H − 2M − 3 + E log σ y˜ H H − i − M τ,r1r2 j,r1r2 r1,r2=1 i=M+3 j=T +2M+4 (121) (122) = E(IIIA ) + E(IIIA ), where p { ( ) } 2C ∑ T +∑M+2 H − 2M − 3 T +2∑M+3 |E(III(121))| ≤ |σ |E log |y˜ | . A H τ,r1r2 H − i − M j,r1r2 r1,r2=1 i=M+3 j=i+M+1

As H → ∞, and Tmax = o(H), ( ) ∫ ( ) T +∑M+2 H − 2M − 3 H−2M−3 H − 2M − 3 log = log dx = O(1). H − i − M − − − x i=M+3 H T 2M 2 Hence, as E(y2 ) < ∞, for any i ≥ 1, and for some constant c , i,r1r2 1 p ( ) c ∑ T +2∑M+3 |E(III(121))| ≤ 1 |σ |E |y˜ | A H τ,r1r2 j,r1r2 r1,r2=1 j=2M+4 p c ∑ ∑H ≤ 1 |σ | E|y˜ | H τ,r1r2 j,r1r2 r1,r2=1 j=2M+4 { 2 } = O tr(Στ ) .

Similarly,

p { } c ∑ ∑H |E(III(122))| ≤ 1 |σ |E |y˜ | A H τ,r1r2 j,r1r2 r1,r2=1 j=T +2M+4

111 p c ∑ ∑H ≤ 1 |σ | E|y˜ | H τ,r1r2 j,r1r2 r1,r2=1 j=2M+4 { 2 } = O tr(Στ ) .

| (12) | { 2 } Hence, E(IIIA ) = o tr(Στ ) . As a result, we have shown

(1) { 2 } E(IIIA ) = O tr(Στ ) .

For some constant c and c1, we have,

(2) E(IIIA ) p { − − ( ) } 2c ∑ H+T∑M 1 H∑+T H − 2M − 3 ≤ E log σ y˜ H H − i + T + 1 τ+1,r1r2 j,r1r2 r1,r2=1 i=H+1 j=i+M+1 p { ( ) } c ∑ H∑+T H − 2M − 3 ≤ 1 E log σ y˜ H H − i + T + 1 τ+1,r1r2 j,r1r2 r1,r2=1 i,j=H+1 p { ( ) } c ∑ H∑+T H − 2M − 3 − 1 E log σ y˜ H H − i + T + 1 τ+1,r1r2 i,r1r2 r1,r2=1 i=H+1 p { ( ) 2c ∑ H∑+T ∑M H − 2M − 3 − 1 E log σ y˜ H H − i + T + 1 τ+1,r1r2 i+q,r1r2 r1,r2=1 i=H+1 q=1 ( ) } ∑M ∑M H − 2M − 3 − log σ y˜ − M − l + 1 τ+1,r1r2 H+T M+l+q,r1r2 l=1 q=M−l+1 (21) − (22) − (23) = E(IIIA ) E(IIIA ) E(IIIA ).

For i = H + 1,...,H + T , H − i + T + 1 = T,..., 1, they are all smaller order of H. By Lemma

3.1 (v), we have,

p ( ) c ∑ H∑+T E(III(22)) ≤ 2 σ log(H)E y˜ A H τ+1,r1r2 i,r1r2 r1,r2=1 i=H+1 p c ∑ √ = 2 σ log(H)O{ E(T )σ } H τ+1,r1r2 τ+1,r1r2 r1√,r2=1 { 2 } = o 1/ H log(H)tr(Στ+1) , √ (23) { 2 } (21) and by the similar idea and Lemma 3.2, E(IIIA ) = o 1/ H log(H)tr(Στ+1) . For E(IIIA ), note that

p { ( ) } c ∑ H∑+T H∑+T H − 2M − 3 E(III(21)) = E log σ y˜ A H H − i + T + 1 τ+1,r1r2 j,r1r2 r1,r2=1 i=T +2M+4 j=H+1

112 p { ( ) } c ∑ ∑H H∑+T H − 2M − 3 − E log σ y˜ H H − i + T + 1 τ+1,r1r2 j,r1r2 r1,r2=1 i=T +2M+4 j=H+1 (211) − (212) = E(IIIA ) E(IIIA ), and [{ ( )} ] ∑p H∑+T H∑+T (211) c H − 2M − 3 |E(III )| ≤ |στ+1,r r | E log · y˜j,r r . A H 1 2 H − i + T + 1 1 2 r1,r2=1 i=T +2M+4 j=H+1

As H → ∞, ( ) ∫ ( ) H∑+T H − 2M − 3 H−2M−3 H − 2M − 3 log = log dx H − i + T + 1 x i=T +2M+4 1 = (H − 2M − 4) − log(H − 2M − 3) = O(H).

Hence, by Lemma 3.1 (vi), v u ∑p u H∑+T 2 | (211) | ≤ | | t E(IIIA ) c στ+1,r1r2 E y˜j,r1r2 r1,r2=1 j=H+1 ∑p [√ √ ] | | | | { } = c στ+1,r1r2 E(T ) γτ+1,r1r2 + O E(T )στ+1,r1r2 {√ r1,r2=1} 2 = o Htr(Στ+1) , ∑ M ∞ where γτ+1,r1r2 = Var(yH+1,r1r2 ) + 2 q=1 Cov(yH+1,r1r2 , yH+1+q,r1r2 ), and γτ+1,r1r2 < , as E(y2 ) < ∞, for i ≥ 1. Similar to E(III(211)), i,r1r2 A

| (212) | E(IIIA ) [{ ( )} ] ∑p ∑H H∑+T c H − 2M − 3 ≤ |στ+1,r r | E log · y˜j,r r H 1 2 H − i + T + 1 1 2 r1,r2=1 i=T +2M+4 j=H+1 [{ ( )} ] ∑p H∑+T H∑+T c H − 2M − 3 ≤ |στ+1,r r | E log · y˜j,r r H 1 2 H − i + T + 1 1 2 r ,r =1 {√ 1 }2 i=T +2M+4 j=H+1 2 = o Htr(Στ+1) .

As a result, we have shown {√ } (2) 2 E(IIIA ) = o Htr(Στ+1) . {√ } (212) (3) Similar to E(IIIA ), we can also show that E(IIIA ) = o Htr(Στ Στ+1) . Combining the [√ ] 2 above results, we have E(IIIA) = o Htr{(Στ − Στ+1) } .

113 Next, we consider

p − − ( ) 2 ∑ H ∑M 1 ∑H H − 2M − 3 III = log σ y˜ B H j − T − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=i+M+1 p − − − − ( ) 2 ∑ H+T∑2M 3 H+T∑M 2 H − 2M − 3 + log σ y˜ H j − T − M − 1 τ+1,r1r2 j,r1r2 r1,r2=1 i=H+1 j=i+M+1 p { − − ( ) 2 ∑ ∑H H+T∑M 2 H − 2M − 3 + log σ y˜ H j − T − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=H+1 ( ) } ∑M ∑M H − 2M − 3 − log σ y˜ − H − T − 2M − 1 + l + q τ,r1r2 H M+l+q,r1r2 l=1 q=M−l+1 (1) (2) (3) = IIIB + IIIB + IIIB .

First,

p { − − − − ( ) 2 ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 E(III(1)) = E log σ y˜ B H j − T − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=i+M+1 − − ( ) H ∑2M 3 ∑H H − 2M − 3 + log σ y˜ j − T − M − 1 τ,r1r2 j,r1r2 i=T +1 j=H−M−1 − − ( ) } H ∑M 1 ∑H H − 2M − 3 + log σ y˜ j − T − M − 1 τ,r1r2 j,r1r2 i=H−2M−2 j=i+M+1 III(1) III(2) III(3) = E( B ) + E( B ) + E( B ).

For some constant c0, { ( ) } ∑p H−∑2M−3 ∑H (2) 2c0 H − 2M − 3 |E(III )| = E log στ,r r y˜j,r r B H j − 2M − 3 1 2 1 2 r1,r2=1 i=T +1 j=H−M−1 p ( ) 2c ∑ ∑H H − 2M − 3 ≤ 0 |σ |(H − 2M − 3) log E|y˜ | H τ,r1r2 j − 2M − 3 j,r1r2 r1,r2=1 j=H−M−1 p 2c ∑ M + 1 ≤ 0 |σ |(H − 2M − 3)E|y˜ |(M + 2) H τ,r1r2 1,r1r2 H − 3M − 4 { r1,r2=1 } 1 = O tr(Σ2) . H τ { } | III(3) | 2 Similarly, we can also have E( B ) = O (1/H)tr(Στ ) . And

p { − − − − ( ) } 2 ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 E(III(1)) = E log σ y˜ B H j − T − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=i+M+1

114 p { − − − − ( ) } 2c ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 ≤ E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=i+M+1 p { − − − − ( ) } 2c ∑ H ∑2M 3 H ∑M 2 H − 2M − 3 = E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=i+M+1 p { − − ( ) } 2c ∑ ∑T H ∑M 2 H − 2M − 3 − E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=i+M+1 (11) − (12) = E(IIIB ) E(IIIB ),

(11) where E(IIIB ) = 0.

p { − − ( ) } 2c ∑ T ∑M 1 ∑T H − 2M − 3 E(III(12)) = E log σ y˜ B H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=i+M+1 p { − − ( ) 2c ∑ ∑T H ∑M 2 H − 2M − 3 + E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=T +1 ( ) } ∑M ∑M H − 2M − 3 − log σ y˜ − T − 2M − 1 + l + q τ,r1r2 T M+l+q,r1r2 l=1 q=M−l+1 (121) (122) = E(IIIB ) + E(IIIB ), where

(121) E(IIIB ) p { ( ) } c ∑ ∑T H − 2M − 3 ≤ 1 E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i,j=M+2 p { ( ) } c ∑ ∑T H − 2M − 3 − 1 E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 j=M+2 p { ( ) 2c ∑ ∑T ∑M H − 2M − 3 − 1 E log σ y˜ H i + q − M − 1 τ,r1r2 i+q,r1r2 r1,r2=1 i=M+2 q=1 ( ) } ∑M ∑M H − 2M − 3 − log σ y˜ − T − 2M − 1 + l + q τ,r1r2 T M+l+q,r1r2 l=1 q=M−l+1 (1211) − (1212) − (1213) = E(IIIB ) E(IIIB ) E(IIIB ).

We study E(III(1212)) first. As E(y2 ) < ∞, we have B 1,r1r2

p { ( ) } c ∑ ∑T H − 2M − 3 |E(III(1212))| ≤ 1 |σ |E log |y˜ | B H τ,r1r2 j − M − 1 j,r1r2 r1,r2=1 j=M+2

115 p { − − ( ) } c ∑ H ∑M 2 H − 2M − 3 ≤ 1 |σ |E log |y˜ | H τ,r1r2 j − M − 1 j,r1r2 r1,r2=1 j=M+2 p c ∑ { } = 1 |σ | (H − 2M − 4) − log(H − 2M − 3) E|y˜ | H τ,r1r2 1,r1r2 r1,r2=1 { 2 } = O tr(Στ ) .

| (1213) | { 2 } (1211) By similar idea, we can also have E(IIIB ) = O tr(Στ ) . We now turn to E(IIIB ),

| (1211) | E(IIIB ) ( ) ∑p ∑T ∑T c1 H − 2M − 3 ≤ |στ,r r |E log y˜j,r r H 1 2 i − M − 1 1 2 r1,r2=1 v i=M+2 j=M+2v u ( ) u ∑p u ∑T 2 u ∑T 2 c1 t H − 2M − 3 t ≤ |στ,r r | E log · E y˜j,r r , H 1 2 i − M − 1 1 2 r1,r2=1 i=M+2 j=M+2 where ( ) { ( )} ∑T 2 H−∑M−2 2 H − 2M − 3 H − 2M − 3 E log ≤ log i − M − 1 i − M − 1 i=M+2 i=M+2 { ∫ ( ) } H−2M−3 H − 2M − 3 2 = log dx 1 x = {(H − 2M − 4) − log(H − 2M − 3)}2.

As H → ∞, Tmax = o(H). Next, by Theorem 10 of Merlev`edeet al. (2006), we have ( ) ∑T 2 ∑t 2 ≤ E y˜j,r1r2 E max y˜j,r1r2 M+2≤t≤Tmax j=M+2 j=M+2 ≤ − − 2 2D(Tmax M 1)E(˜y1,r1r2 ), where D is a positive constant, and E(˜y2 ) < ∞. Hence, 1,r1r2 p c ∑ √ {√ } |E(III(1211))| = 1 |σ | · O(H) · o( H) = o Htr(Σ2) . B H τ,r1r2 τ r1,r2=1

(122) For E(IIIB ), we have

(122) E(IIIB ) p { − − ( ) } 2c ∑ ∑T H ∑M 2 H − 2M − 3 = E log σ y˜ H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=T +1 p { ( ) } 2c ∑ ∑M ∑M H − 2M − 3 − E log σ y˜ − H T − 2M − 1 + l + q τ,r1r2 T M+l+q,r1r2 r1,r2=1 l=1 q=M−l+1

116 (1221) (1222) = E(IIIB ) + E(IIIB ).

(1222) We study E(IIIB ) first. Note that p 2c ∑ ∑H |E(III(1222))| ≤ |σ | log(H)M E|y˜ | = O{log(H)tr(Σ2)}, B H τ,r1r2 i,r1r2 τ r1,r2=1 i=1 as E(y2 ) < ∞, for i ≥ 1. i,r1r2 p { − − ( ) } 2c ∑ ∑T H ∑M 2 H − 2M − 3 E(III(1221)) = E log σ y˜ {1 + o(1)} B H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=T +1 p { − − ( ) } 2c ∑ ∑T H ∑M 2 H − 2M − 3 = E log σ y˜ {1 + o(1)} H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=M+2 j=M+2 p { ( ) } 2c ∑ ∑T H − 2M − 3 − E log σ y˜ {1 + o(1)} H j − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i,j=M+2 (12211) − (12212) = E(IIIB ) E(IIIB ). {√ } (1211) (12212) 2 By the same idea for E(IIIB ), we have E(IIIB ) = o Htr(Στ ) . As a result,

| (12211) | E(IIIB ) { ( ) } ∑p H−∑M−2 c1 H − 2M − 3 ≤ |στ,r r |E (T − M − 1) log y˜j,r r H 1 2 j − M − 1 1 2 r1,r2=1 j=vM+2 u ( ) ∑p √ u H−∑M−2 − − 2 c1 2 t H 2M 3 ≤ |στ,r r | E(T − M − 1) · E log y˜j,r r . H 1 2 j − M − 1 1 2 r1,r2=1 j=M+2 Note that ( ) H−∑M−2 2 H − 2M − 3 E log y˜j,r r j − M − 1 1 2 j=M+2 − − ( ) H ∑M 2 H − 2M − 3 ≤ log2 (2M + 1)|γ | j − M − 1 τ,r1r2 j=M+2 ∫ ( ) H−2M−3 − − | | 2 H 2M 3 = (2M + 1) γτ,r1r2 log dx [ 1( ) x ] 7 = (2M + 1)|γ | 2 H − 2M − − {log(H − 2M − 3) + 1}2 τ,r1r2 2 = O(H), ∑ M ∞ − − 2 ≤ where γτ,r1r2 = Var(y1,r1 ) + 2 q=1 Cov(y1,r1r2 , y1+q,r1r2 ), and γτ,r1r2 < . As E(T M 1) 2 2 Tmax = o(H ), therefore, {√ } | (1221) | 2 E(IIIB ) = o Htr(Στ ) .

117 {√ } (1) 2 As a result, we have shown E(IIIB ) = o Htr(Στ ) . (2) (3) Next, we study E(IIIB ) and E(IIIB ). For some constant c, we have ( ) ∑p H+T∑−2M−3 H+T∑−M−2 (2) 2c H − 2M − 3 |E(III )| ≤ |στ+1,r r | · E log y˜j,r r B H 1 2 i − T − M − 1 1 2 r1,r2=1 i=H+1 j=i+M+1 p { − − ( ) } 2c ∑ H+T∑2M 3 H − 2M − 3 H∑+T ≤ |σ | · E log |y˜ | . H τ+1,r1r2 i − T − M − 1 j,r1r2 r1,r2=1 i=H+1 j=H+1

As H → ∞, and Tmax = o(H),

− − ( ) ∫ ( ) H+T∑2M 3 H − 2M − 3 H−3M−4 H − 2M − 3 log = log dx = O(1). i − T − M − 1 − − − x i=H+1 H T M 2

Hence, for some constant c1, and by Lemma 3.1 (v),

p { } 2c ∑ H∑+T |E(III(2))| ≤ 1 |σ | · E |y˜ | B H τ+1,r1r2 j,r1r2 r1,r2=1 j=H+1 p 2c ∑ [ √ ] = 1 |σ | E(T )E|y˜ | + O{ E(T )σ } H τ+1,r1r2 H+1,r1r2 τ+1,r1r2 r1,r2=1 { 2 } = O tr(Στ+1) .

(3) Next, we study E(IIIB ),

(3) E(IIIB ) p { − − ( ) } 2 ∑ ∑H H+T∑M 2 H − 2M − 3 = E log σ y˜ H j − T − M − 1 τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=H+1 p { ( ) } 2 ∑ ∑M ∑M H − 2M − 3 − E log σ y˜ − H H − T − 2M − 1 + l + q τ,r1r2 H M+l+q,r1r2 r1,r2=1 l=1 q=M−l+1 (31) − (32) = E(IIIB ) E(IIIB ).

As H → ∞, for j = H + 1,...,H + T − M − 2, ( ) H − 2M − 3 log = O(1). j − T − M − 1

Hence, for some constant c, and by Lemma 3.1, { } ∑p H+T∑−M−2 (31) 2c |E(III )| ≤ |στ,r r |E (H − T ) y˜j,r r B H 1 2 1 2 r1,r2=1 v j=H+1 u ∑p u H+T∑−M−2 2 2c t ≤ H|στ,r r | E y˜j,r r H 1 2 1 2 r1,r2=1 j=H+1

118 p [√ ] 2 ∑ √ = Hσ E(T )|γ | + O{ E(T )} H τ,r1r2 τ+1,r1r2 {√r1,r2=1 } = o Htr(Στ Στ+1) . {√ } (32) By similar idea, it can be shown that E(IIIB ) = o Htr(Στ Στ+1) . Combining all the results, {√ } 2 we have E(IIIB) = o Htr(Στ − Στ+1) ) .

Finally, we study part IIIC . p − − 2 ∑ H+T∑M 1 H∑+T III = σ y˜ C H i,r1r2 j,r1r2 r1,r2=1 i=T +1 j=i+M+1 p − − 2 ∑ H ∑M 1 ∑H = σ y˜ H τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=i+M+1 p − − 2 ∑ H+T∑M 1 H∑+T + σ y˜ H τ+1,r1r2 j,r1r2 r1,r2=1 i=H+1 j=i+M+1 p 2 ∑ ∑H H∑+T + σ y˜ H τ,r1r2 j,r1r2 r1,r2=1 i=T +1 j=H+1 (1) (2) (3) = IIIC + IIIC + IIIC .

First,

p { − − } 2 ∑ H ∑M 1 ∑H III(1) = E σ y˜ C H τ,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 p { } 2 ∑ ∑T ∑H − E σ y˜ H τ,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 (11) − (12) = E(IIIC ) E(IIIC ),

(11) (12) where IIIC = 0. We study IIIC , p { − − } 2 ∑ T ∑M 1 ∑T E(III(12)) = E σ y˜ C H τ,r1r2 j,r1r2 r1,r2=1 i=1 j=i+M+1 p { } 2 ∑ ∑T ∑H ∑M ∑M + E σ y˜ − σ y˜ − H τ,r1r2 j,r1r2 τ,r1r2 T M+l+q r1,r2=1 i=1 j=T +1 l=1 q=M−l+1 (121) (122) = E(IIIC ) + E(IIIC ).

(121) We can write IIIC as p { } p { } 1 ∑ ∑T 1 ∑ ∑T E(III(121)) = E σ y˜ − E σ y˜ C H τ,r1r2 j,r1r2 H τ,r1r2 i,r1r2 r1,r2=1 i,j=1 r1,r2=1 i=1

119 p { } 2 ∑ ∑T ∑M ∑M ∑M − E σ y˜ − σ y˜ − H τ,r1r2 i+q,r1r2 τ,r1r2 T M+l+q,r1r2 r1,r2=1 i=1 q=1 l=1 q=M−l+1 (1211) − (1212) − (1213) = E(IIIC ) E(IIIC ) E(IIIC ), where v u ∑p √ u ∑T 2 (1211) 1 2 t |E(III )| ≤ |στ,r r | E(T ) · E y˜j,r r . C H 1 2 1 2 r1,r2=1 j=1 √ 2 ≤ 2 2 2 As E(T ) Tmax = o(H ), E(T ) = o(H). By Theorem 10 of Merlev`edeet al. (2006), ( ) ∑T 2 ∑t 2 ≤ | ≤ 2 E y˜j,r1r2 E max y˜j,r1r2 2DTmaxE(˜y1,r1r2 ) = o(H). 1≤t≤Tmax j=1 j=1 Hence, {√ } {√ } (1211) 2 (1212) 2 E(IIIC ) = o Htr(Στ,r1r2 ) . Similarly, we can show E(IIIC ) = o Htr(Στ,r1r2 ) and {√ } (1213) 2 E(IIIC ) = o Htr(Στ,r1r2 ) .

p { } 2 ∑ ∑T ∑H E(III(122)) = E σ y˜ C H τ,r1r2 j,r1r2 r1,r2=1 i=1 j=T +1 p { } 2 ∑ ∑M ∑M − E σ y˜ − H τ,r1r2 T M+l+q r1,r2=1 l=1 q=M−l+1 (1221) − (1222) = E(IIIC ) E(IIIC ).

(1222) { 2 } (1221) It can be shown that E(IIIC ) = o tr(Στ ) . And for E(IIIC ),

p { } p { } 2 ∑ ∑T ∑H 2 ∑ ∑T E(III(1221)) = E σ y˜ − E σ y˜ C H τ,r1r2 j,r1r2 H τ,r1r2 j,r1r2 r1,r2=1 i=1 j=1 r1,r2=1 i,j=1 (12211) − (12212) = E(IIIC ) E(IIIC ). {√ } (1211) (12212) 2 Similar to E(IIIC ), E(IIIC ) = o Htr(Στ,r1r2 ) . v u ∑p √ u ∑H 2 (12211) 2 2 t |E(III )| ≤ |στ,r r | E(T ) · E y˜j,r r C H 1 2 1 2 r1,r2=1 j=1 p √ 2 ∑ {√ } ≤ |σ |T · H · |γ | = o Htr(Σ2 ) , H τ,r1r2 max τ,r1r2 τ,r1r2 r1,r2=1 ∑ M ∞ (2) where γτ,r1r2 = Var(y1,r1r2 )+2 q=1 Cov(y1,r1r2 , y1+q,r1r2 ), and γτ,r1r2 < . Next, we study IIIC ,

(2) E(IIIC )

120 p { } p { } 1 ∑ H∑+T 1 ∑ H∑+T = E σ y˜ − E σ y˜ H τ+1,r1r2 j,r1r2 H τ+1,r1r2 i,r1r2 r1,r2=1 i,j=H+1 r1,r2=1 i=H+1 p { } 2 ∑ H∑+T ∑M ∑M ∑M − E σ y˜ − σ y˜ − H τ+1,r1r2 i+q,r1r2 τ+1,r1r2 H+T M+l+q,r1r2 r1,r2=1 i=H+1 q=1 l=1 q=M−l+1 (21) − (22) − (23) = E(IIIC ) E(IIIC ) E(IIIC ), where v u ∑p √ u H∑+T 2 (21) 1 2 t |E(III )| ≤ |στ+1,r r | E(T ) · E y˜j,r r C H 1 2 1 2 r1,r2=1 j=H+1 p 1 ∑ √ {√ } = |σ | · o(H) · O{ E(T )σ } = o Htr(Σ2 ) . H τ+1,r1r2 τ+1,r1r2 τ+1 r1,r2=1 {√ } {√ } (22) 2 (23) 2 By the similar idea, we can show E(IIIC ) = o Htr(Στ+1) and E(IIIC ) = o Htr(Στ+1) . Finally, v u ∑p u H∑+T 2 (3) 2 t |E(III )| ≤ O(H)|στ,r r | E y˜j,r r C H 1 2 1 2 r1,r2=1 j=H+1 p 2 ∑ √ = O(H)|σ | · O{ E(T )|γ |} H τ,r1r2 τ+1 {√r1,r2=1 } = o Htr(Στ Στ+1) . {√ } [√ ] 2 Hence, E(IIIC ) = o Htr(Στ Στ+1) . In summary, E(III1) = o Htr{(Στ − Στ+1) } . By the [√ ] 2 similar derivations, we can also show E(III2) = o Htr{(Στ − Στ+1) } . This completes the proof of Lemma 3.5.

121 CHAPTER 4

CONCLUSION

In chapter 2, we propose a new approach for testing and estimating one or multiple change points for a sequence of dependent high-dimensional offline data. The method has several advantages which enable it to have a wide range of applications. First, by using a factor model, the proposed method incorporates both spatial and temporal dependence of data without imposing structural assumptions. Second, the method is nonparametric without assuming distribution form of data.

Last but not least, our method can be applied to data with large dimensionality of p, and it does not require certain growth rate of dimension p with respect to n.

We also explicitly derive and discuss the convergence rate of the change point estimator with respect to n, p, as well as various locations of the change point including the boundary. For esti- mating multiple change points, we consider the binary segmentation and wild binary segmentation.

The only tuning parameter is the M for dependence. Based on a consistent estimator, we suggest the easily used elbow method to search for it. The proposed test statistic Lˆt in (2.4) may encounter power loss if the change occurs at a small number of components in the population parameter or there are only a few change points within the sequence. To overcome the power loss, we also propose a power enhancement statistic Lˆ/σˆn,0 + L0, and study its asymptotic performance. In chapter 3, we propose a new procedure to detect anomaly in covariance structure of high- dimensional online data. The procedure is implementable when data are non-Gaussian and involve both spatial and temporal dependence. We investigate its theoretical properties by deriving an explicit expression for the average run length (ARL) and an upper bound for the expected detection delay (EDD). The established ARL can be employed to obtain the level of threshold in stopping rule without running time-consuming Monte Carlo simulations. The derived upper bound demonstrates the impact of data dependence and magnitude of change in covariance structure on EDD. The theoretical properties are examined and justified by both simulation studies and a real application.

122 BIBLIOGRAPHY

[1] Allen, E. A., Damaraju, E., Plis, S. M., Erhardt, E. B., Eichele, T. and Calhoun, V. D. (2014), “Tracking whole-brain connectivity dynamics in the resting state,” Cerebral cortex, 24(3), 663- 676. [2] Aminikhanghahi, S., and Cook, D. J. (2017), “A survey of methods for time series change point detection,” Knowledge and information systems, 51(2), 339-367. [3] Ashby, F. G.(2011), “Statistical Analysis of fMRI Data,” MIT Press. [4] Aston, J. and Kirch, C. (2012), “Evaluating stationarity via change-point alternatives with applications to fMRI data,” The Annals of Applied Statistics 6, 1906-1948. [5] Aston, J. and Kirch, C. (2014), “Efficiency of change point tests in high dimensional setting,” arXiv preprint arXiv:1409.1771. [6] Aue, A., Hormann, S., Horvath, L. and Reimherr, M. (2009),“Break detection in the covariance structure of multivariate time series models,” The Annals of Statistics, 37, 4046-4087. [7] Ayyala, D., Park, J. and Roy, A. (2017), “Mean vector testing for high-dimensional dependent observations,” Journal of Multivariate Analysis, 153, 136-155. [8] Bai, J. (2010), “Common breaks in means and variances for panel data,” Journal of Economet- rics, 157, 78-92. [9] Bai, Z. D. and Saranadasa, H. (1996), “Effect of high dimension: By an example of a two sample problem,” Statistica Sinica, 6, 311-329. [10] Bara, I. A., Fung, C. J. and Dinh, T. (2015), “Enhancing Twitter spam accounts discovery using cross-account pattern mining,” 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 491-496, IEEE. [11] Billingsley, P. (1999), Convergence of probability measures, John Wiley & Sons. [12] Calhoun, V. D., Miller, R., Pearlson, G. and Adalı, T. (2014), “The chronnectome: time- varying connectivity networks as the next frontier in fMRI data discovery,” Neuron, 84(2), 262-274. [13] Carrasco, M. and Chen, X. (2002), “Mixing and moment properties of various GARCH and stochastic volatility models,” Econometric Theory, 18, 17-39. [14] Chan, H. P. (2017), “Optimal sequential detection in multi-stream data,” The Annals of S- tatistics, 45(6), 2736-2763. [15] Chan, H. P. and Walther, G. (2015), “Optimal detection of multi-sample aligned sparse sig- nals,” The Annals of Statistics, 43(5), 1865-1895. [16] Chan, J., Horv´ath,L. and Hu˘skov´a,M. (2013), “Darling-Erdos limit results for change-point detection in panel data,” Journal of Statistical Planning and Inference, 143, 955-970. [17] Chang, C. and Glover, G. H. (2010), “Time-frequency dynamics of resting-state brain connec- tivity measured with fMRI,” Neuroimage, 50(1), 81-98.

123 [18] Chen, H. (2019), “Sequential change-point detection based on nearest neighbors,” The Annals of Statistics, 47(3), 1381-1407. [19] Chen, H. and Zhang, N. (2015), “Graph-based change-point detection,” The Annals of Statis- tics, 43, 139-176. [20] Chen, J. and Gupta, A. (1997), “Testing and locating variance change-points with application to stock prices,” Journal of the American Statistical Association, 92, 739-747. [21] Chen, S. X., Li, J. and Zhong, P. S. (2019) “Supplement to ‘Two-sample and ANOVA tests for high dimensional means.’” DOI:10.1214/18-AOS1720SUPP [22] Chen, S. X. and Qin, Y. (2010),“ A two-sample test for high-dimensional data with applications to gene-set testing,” The Annals of Statistics, 38, 808-835. [23] Cho, H. (2016), “Change-point detection in panel data via double CUSUM statistic,” Electronic Journal of Statistics, 10, 2000-2038. [24] Cho, H. and Fryzlewicz, P. (2015), “Multiple change-point detection for high dimensional time series via sparsified binary segmentation,” Journal of the Royal Statistical Society. Series B, 77, 475-507. [25] Chu, L. and Chen, H. (2018), “Sequential Change-point Detection for High-dimensional and non-Euclidean Data,” arXiv preprint arXiv:1810.05973. [26] Cribben, I., Haraldsdottir, R., Atlas, L. Y., Wager, T. D. and Lindquist, M. A. (2012), “Dy- namic connectivity regression: determining state-related changes in brain connectivity,” Neu- roimage, 61(4), 907-920. [27] Davis, R. A., Lee, T. and Rodriguez- Yam, G (2006), “ estimation for non- stationary time series,” Journal of the American Statistical Association, 101, 223-239. [28] Fan, J., Liao, Y. and Yao, J. (2015), “Power enhancement in high-dimensional cross-sectional tests,”, Econometrica, 83, 1497-1541. [29] Finch, S. R. (2003), “Extreme value constants,” Mathematical Constants, Cambridge Univer- sity Press. [30] Friederich, H. C., Brooks, S., Uher, R., Campbell, I. C., Giampietro, V., Brammer, M., Williams, S.C.R., Herzog, W., and Treasure, J. (2010), “Neural correlates of body dissatis- faction in anorexia nervosa,” Neuropsychologia, 48, 2878-2885. [31] Friederich, H. C., Uher, R., Brooks, S., Giampietro, V., Brammer, M., Williams, S. C., Herzog, W.,Treasure, J., and Campbell, I. (2007), “I’m not as slim as that girl: neural bases of body shape self-comparison to media images,” Neuroimage, 37, 674-681. [32] Fryzlewicz, P. (2014), “Wild binary segmentation for multiple change-point detection,” The Annals of Statistics, 42, 2243-2281. [33] Gao, X., Deng, X., Wen, X., She, Y., Vinke, P. and Chen, H. (2016), “My body looks like that girl’s: body mass index modulates brain activity during body images self-reflection among young women,” PLoS ONE, 11, e0164450. [34] Gavin, A. R., Simon, G. E. and Ludman, E. J. (2010), “The association between obesity, depression, and educational attainment in women: the mediating role of body image dissatis- faction,” Journal of Psychosomatic R esearch, 69, 573-581.

124 [35] Handwerker, D. A., Roopchansingh, V., Gonzalez-Castillo, J. and Bandettini, P. A. (2012), “Periodic changes in fMRI connectivity,” Neuroimage, 63(3), 1712-1719. [36] Harchaoui, Z., Moulines, E. and Bach, F. (2009), “Kernel change-point analysis,” Advances in Neural Information Processing Systems, 609-616. [37] Horv´ath,L. and Hu˘skov´a,M. (2012), “Change-point detection in panel data,” Journal of Time Series Analysis, 33, 631-648. [38] Horv´ath,L., Hu˘skov´a,M., Rice, G., and Wang, J. (2017), “Asymptotic properties of the CUSUM estimator for the time of change in linear panel data models,” Econometric Theory, 33, 366-412. [39] Horv´ath,L., Rice, G. and Whipple, S. (2016), “Adaptive bandwidth selection in the long run covariance estimator of functional time series,” Computational Statistics and Data Analysis, 100, 676-693. [40] Hutchison, R. M., Womelsdorf, T., Allen, E. A., Bandettini, P. A., Calhoun, V. D., Corbetta, M., ... and Handwerker, D. A. (2013), “Dynamic functional connectivity: promise, issues, and interpretations,” Neuroimage, 80, 360-378. [41] Incl´an,C. and Tiao, G. (1994), “Use of sums of squares for retrospective detection of changes of variance,” Journal of the American Statistical Association, 89, 913-923. [42] James, B., James, K. L. and Siegmund, D. (1992), “Asymptotic approximations for likelihood ratio tests and confidence regions for a change-point in the mean of a multivariate normal distribution,” Statistica Sinica, 2, 69-90. [43] Janson, S. (1983), “Renewal theory for M-dependent variables,” The Annals of Probability, 11(3), 558-568. [44] Jeong, S. O., Pae, C. and Park, H. J. (2016), “Connectivity-based change point detection for large-size functional networks,” NeuroImage, 143, 353-363. [45] Jirak, M. (2012), “Change-point analysis in increasing dimension,” Journal of Multivariate Analysis, 159, 111-136. [46] ——–(2015), “Uniform change point tests in high dimension,” The Annals of Statistics, 43, 2451-2483. [47] Kokoszka, P. and Leipus, R. (2000), “Change-point estimation in ARCH models,” Bernoulli, 6, 513-539. [48] Kurosaki, M., Shirao, N., Yamashita, H., Okamoto, Y. and Yamawaki, S. (2006), “Distorted images of one’s own body activates the prefrontal cortex and limbic/paralimbic system in young women: a functional magnetic resonance imaging study,” Biological Psychiatry, 59, 380-386. [49] Lavielle, M. and Moulines, E. (2000), “Least-squares estimation of an unknown number of shifts in a time series,” Journal of Time Series Analysis, 21, 33-59. [50] Li, J. and Chen, S. X. (2012), “Two sample tests for high dimensional covariance matrices,” The Annals of Statistics , 40, 908-940. [51] Lindquist, M. (2008), “The statistical analysis of fMRI data,” Statistical Science, 23, 439-464. [52] Lindquist, M., Waugh, C. and Wager, T. (2007), “Modeling state-related fMRI activity using change-point theory,” Neuroimage, 35, 1125-1141.

125 [53] Liu, W. and Shao, Q (2013), “A Cramer moderate deviation theorem for Hotelling T 2-statistic with applications to global tests,” The Annals of Statistics, 41, 296-322. [54] Logothetis, N. K., Pauls, J., Auguth, M., Trinath, T. and Oeltermann, A. (2001), “A neuro- physiological investigation of the basis of the BOLD signal in fMRI,” Nature, 412, 150-157. [55] Lorden, G. (1971), “Procedures for reacting to a change in distribution,” The Annals of Math- ematical Statistics, 42(6), 1897-1908. [56] Matteson, D. and James, N. A. (2014), “A nonparametric approach for multiple change point analysis of multivariate data,” Journal of the American Statistical Association, 109, 334-345. [57] Mei, Y. (2010), “Efficient scalable schemes for monitoring a large number of data streams,” Biometrika, 97(2), 419-433. [58] Merlev`ede,F., Peligrad, M., and Utev, S. (2006), “Recent advances in invariance principles for stationary sequences,” Probability surveys, 3, 1-36. [59] Miyake, Y., Okamoto, Y., Onoda, K., Kurosaki, M., Shirao, N. and Yamawaki, S. (2010), “Brain activation during the perception of distorted body images in eating disorders,” Psychi- atry Research: Neuroimaging, 181, 183-192. [60] Monti, R. P., Hellyer, P., Sharp, D., Leech, R., Anagnostopoulos, C. and Montana, G. (2014), “Estimating time-varying brain connectivity networks from functional MRI time series,” Neu- roImage, 103, 427-443. [61] Olshen, A. and Venkatraman, E. (2004), “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics, 5, 557-572. [62] Ombao, H. C., Raz, J. A., von Sachs, R. and Molow, B. A. (2001), “Automatic statistical anal- ysis of bivariate nonstationary time series,” Journal of The American Statistical Association, 96, 543-560. [63] Page, E. S. (1954), “Continuous Inspection Schemes,” Biometrika, 41(1/2), 100-115. [64] Roberts, S. W. (1966), “A comparison of some control chart procedures,” Technometrics, 8(3), 411-430. [65] Robinson, L. F., Wager, T. D. and Lindquist, M. A. (2010), “Change point estimation in multi-subject fMRI studies,” NeuroImage, 49, 1581-1592. [66] Schwartz, M. B. and Brownell, K. D. (2004), “Obesity and body image,” Body Image, 1, 43-56. [67] Sen, A. K. and Srivastava, M. S. (1975), “On tests for detecting change in mean,” The Annals of Statistics, 3, 98-108. [68] Shao, X. and Zhang, X. (2010), “Testing for change points in time series,” Journal of the American Statistical Association, 105, 1228-1240. [69] Shiryayev, A. N. (1963), “On optimal methods in earliest detection problems,” Theory of Probability and its Applications, 8, 26-51. [70] Siegmund, D. (1985), : tests and confidence intervals, Springer Science & Business Media. [71] Siegmund, D. and Venkatraman, E. S. (1995), “Using the generalized likelihood ratio statistic for sequential detection of a change-point,” The Annals of Statistics, 255-271.

126 [72] Siegmund, D., Yakir, B. and Zhang, N. R. (2011). “Detecting simultaneous variant intervals in aligned sequences,” The Annals of Applied Statistics, 5, 645-668. [73] Srivastava, M.S. and Worsley, K. J. (1986). “Likelihood ratio tests for a change in the multi- variate normal mean,” Journal of the American Statistical Association, 81, 199-204. [74] Tartakovsky, A. G. and Veeravalli, V. V. (2008), “Asymptotically optimal quickest change detection in distributed sensor systems,” Sequential Analysis, 27(4), 441-475. [75] Uher, R., Murphy, T., Friederich, H. C., Dalgleish, T., Brammer, M. J. and Giampietro, V.(2005), “Functional neuroanatomy of body shape perception in healthy and eating-disordered women,” Biological Psychiatry, 58, 990-997. [76] Venkatraman, E. (1992), Consistency results in multiple change-point problems. Ph.D. Thesis, Stanford University. [77] Vostrikova, L. (1981), “Detection of disorder in multidimensional random processes,” Soviet Mathematics Doklady, 24, 55-59. [78] Wagner, A., Ruf, M., Braus, D. F. and Schmidt, M. H. (2003), “Neuronal activity changes and body image distortion in anorexia nervosa,” Neuroreport, 14, 2193-2197. [79] Wald, A. (1973), Sequential analysis, Courier Corporation. [80] Wang, G., Zou, C. and Yin, G. (2017), “ Change-point detection in multinomial data with a large number of categories,” To appear in The Annals of Applied Statistics. [81] Wang, T. and Samworth, R. (2018), “High dimensional change point estimation via sparse projection,” Journal of Royal Statistical Society, Series B, 80, 57-83. [82] Xie, Y. and Siegmund, D. (2013), “Sequential multi-sensor change-point detection,” Annals of Statistics, 41, 670-692. [83] Yang, J., Dedovic, K., Guan, L., Chen, Y. and Qi, M. (2014), “Self-esteem modulates dorsal medial prefrontal cortical response to self-positivity bias in implicit self-relevant processing,” Social Cognitive and Affective Neuroscience, 9, 1814-1818. [84] Yang, Q. and Pan, G. (2017), “Weighted statistic in detecting faint and sparse alternatives for high-dimensional covariance matrices,” Journal of the American Statistical Association, 112, 188-200. [85] Zhong, P. S. and Li, J. (2016), “Test for Temporal Homogeneity of Means in High-dimensional Longitudinal Data,” arXiv preprint arXiv:1608.07482.

127