Change Detection in Multivariate Datastreams

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Archivio istituzionale della ricerca - Politecnico di Milano

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss Cesare Alippi,1,2 Giacomo Boracchi,1 Diego Carrera,1 Manuel Roveri1 1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy 2Universita` della Svizzera Italiana, Lugano, Switzerland ﬁ[email protected] { }

Abstract consists in computing the log-likelihood of the datastream and compare the distribution of the log-likelihood over dif- We address the problem of detecting changes in ferent time windows (Section 2). In practice, computing the multivariate datastreams, and we investigate the log-likelihood is an effective way to reduce the multivari- intrinsic difficulty that change-detection methods ate change-detection problem to a univariate one, thus eas- have to face when the data dimension scales. In ily addressable by any scalar CDT. Several CDTs for mul- particular, we consider a general approach where tivariate datastreams pursue this approach, and compute the changes are detected by comparing the distribu- log-likelihood with respect to a model fitted to a training tion of the log-likelihood of the datastream over set of stationary data: [Kuncheva, 2013] uses Gaussian mix- different time windows. Despite the fact that this tures, [Krempl, 2011; Dyer et al., 2014] use nonparametric approach constitutes the frame of several change- density models. Other CDTs have been designed upon spe- detection methods, its effectiveness when data di- cific multivariate statistics [Schilling, 1986; Agarwal, 2005; mension scales has never been investigated, which Lung-Yut-Fong et al., 2011; Wang and Chen, 2002; Ditzler is indeed the goal of our paper. and Polikar, 2011; Nguyen et al., 2014]. In the classifica- We show that the magnitude of the change can tion literature, where changes in the distribution are referred be naturally measured by the symmetric Kullback- to as concept-drift [Gama et al., 2014], changes are typi- Leibler divergence between the pre- and post- cally detected by monitoring the scalar sequence of classifi- change distributions, and that the detectability of cation errors over time [Gama et al., 2004; Alippi et al., 2013; a change of a given magnitude worsens when the Bifet and Gavalda, 2007; Ross et al., 2012]. data dimension increases. This problem, which we Even though this problem is of utmost relevance in datas- refer to as detectability loss, is due to the linear rela- tream mining, no theoretical or experimental study investi- tionship between the variance of the log-likelihood gate how the data dimension d impacts on the change de- and the data dimension. We analytically derive tectability. In Section 3, we consider change-detection prob- the detectability loss on Gaussian-distributed datas- lems in Rd and investigate how d affects the detectability of a treams, and empirically demonstrate that this prob- change when monitoring the log-likelihood of the datastream. lem holds also on real-world datasets and that can In this respect, we show that the symmetric Kullback-Leibler be harmful even at low data-dimensions (say, 10). divergence (sKL) between pre-change and post-change distributions is an appropriate measure of the change magnitude, and we introduce the Signal-to-Noise Ratio of the change 1 Introduction (SNR) to quantitatively assess the change detectability when Change detection, namely the problem of detecting changes monitoring the log-likelihood. in probability distribution of a process generating a datas- Then, we show that the detectability of changes having a tream, has been widely investigated on scalar (i.e. univariate) given magnitude progressively reduces when d increases. We data. Perhaps, the reason beyond the univariate assumption is refer to this phenomenon as detectability loss, and we analyti- that change-detection tests (CDTs) were originally developed cally demonstrate that, in case of Gaussian random variables, for quality-control applications [Basseville et al., 1993], and the change detectability is upperbounded by a function that much fewer works address the problem of detecting changes decays as 1/d. We demonstrate that detectability loss occurs in multivariate datastreams. also in non Gaussian cases as far as data components are inde- A straightforward extension to the multivariate case would pendent, and we show that it affects also real-world datasets, be to independently inspect each component of the datas- which we approximate by Gaussian mixtures in our empirical tream with a scalar CDT [Tartakovsky et al., 2006], but this analysis (Section 4). Most importantly, detectability loss is does not clearly provide a truly multivariate solution, e.g., not a consequence of density-estimation problems, as it holds it is unable to detect changes affecting the correlation among either when data distribution is estimated from training sam- the data components. A common, truly multivariate approach ples or known. Our results indicate that detectability loss is a

1368 potentially harmful also at reasonably low-dimensions (e.g., several online change-detection methods [Kuncheva, 2013; 10) and not only in Big-Data scenarios. Song et al., 2007; Bifet and Gavalda, 2007; Ross et al., 2011]. Moreover, the power of the test (W ,W ) >h, namely T P R 2 Monitoring the Log-Likelihood the probability of rejecting the null hypothesis when the alternative holds, indicates the effectiveness of the test statistic 2.1 The Change Model when the same is used in sequential-monitoring techniques.T We assume that, in stationary conditions, the datastream Second, that 0 in (2) is often unknown and has to be prelim- x(t),t =1,... contains independent and identically dis- inarily estimated from a training set of stationary data. Then, { } d tributed (i.i.d.) random vectors x(t) R , drawn from a is simply replaced by its estimate . In practice, it is 2 0 0 random variable having probability-density-function (pdf) fairly reasonable to assume a training set of stationary data X 0, that for simplicity we assume continuous, strictly positive is given, while it is often unrealistic tob assume 1 is known, and bounded. Here, t denotes the time instant, bold letters in- since the datastream might change unpredictably. dicate column vectors, and 0 is the matrix transpose operator. For the sake of simplicity, we consider permanent changes 3 Theoretical Analysis 0 1 affecting the expectation and/or correlation of : ! X The section sheds light on the relationship between change 0 t<⌧ detectability and d. To this purpose, we introduce: i) a mea- x(t) , where 1(x)=0(Qx + v) , (1) sure of the change magnitude, and ii) an indicator that quan- ⇠ 1 t ⌧ ⇢ titatively assesses change detectability, namely how difficult where ⌧ is the unknown change point, v Rd changes the is to detect a change when monitoring ( ) as described in d d 2 L · location 0, and Q O(d) R ⇥ is an orthogonal ma- Section 2.2. Afterward, we can study the influence of d on 2 ⇢ x the change detectability provided that changes 0 1 have trix that modifies the correlation among the components of . ! This rather general change-model requires a truly multivari- a constant magnitude. ate monitoring scheme: changes affecting only the correlation among components of x cannot be perceived by analyz- 3.1 Change Magnitude ing each component individually, or by extracting straightfor- The magnitude of 0 1 can be naturally measured by the 1 ! ward features (such as the norm) out of vectors x(t) . symmetric Kullback-Leibler divergence between 0 and 1 (also known as Jeffreys divergence): 2.2 The Considered Change-Detection Approach sKL(0,1):=KL(0,1)+KL(1,0) We consider the popular change-detection approach that consists in monitoring the log-likelihood of x(t) with respect to 0(x) 1(x) = log 0(x)dx + log 1(x)dx . [Kuncheva, 2013; Song et al., 2007; Sullivan and Woodall, d 1(x) d 0(x) 0 ZR ZR 2000]: (3) (x(t)) = log(0(x(t))) , t. (2) This choice is supported by the Stein’s Lemma [Cover and L 8 Thomas, 2012], which states that KL( , ) yields an upper- We denote by L = (x(t)),t =1,..., the sequence 0 1 {L } bound for the power of parametric hypothesis tests that deter- of log-likelihood values, and observe that in stationary condi- mine whether a given sample population is generated from L tions, contains i.i.d. data drawn from a scalar random vari- (null hypothesis) or (alternative hypothesis). In prac- ( ) 0 1 able. When undergoes a change, the distribution of tice, large values of sKL( , ) indicate changes that are is also expectedX to change. Thus, changes canL be· 0 1 0 ! 1 very apparent, since hypothesis tests designed to detect either detected by comparing the distribution of ( ) over WP and L · 0 1 or 1 0 can be very powerful. WR, two non-overlapping windows of L, where WP refers ! ! to past data (that we assume are generated from 0), and WR 3.2 Change Detectability refers to most recent ones (that are possibly generated from We define the following indicator to quantitatively assess the ). In practice, a suitable test statistic (W ,W ), such 1 T P R detectability of a change when monitoring ( ). as the t-statistic, Kolmogorov-Smirnov or Lepage [Lepage, L · 1971], is computed to compare WP and WR. In an hypothesis Definition 1. The signal-to-noise ratio (SNR) of the change testing framework, this corresponds to formulating a test hav- 0 1 is defined as: ! ing as null hypothesis “samples in WP and WR are from the 2 same distribution”. When (W ,W ) >hwe can safely T P R E [ (x)] E [ (x)] consider that the log-likelihood values over W and W are x 0 L x 1 L P R SNR( ):=✓ ⇠ ⇠ ◆ , (4) from two different distributions, indicating indeed a change 0 ! 1 var [ (x)] + var [ (x)] in . The threshold h>0 controls the test significance. x 0 L x 1 L X ⇠ ⇠ There are two important aspects to be considered about where var[ ] denotes the variance of a random variable. this change-detection approach. First, that comparing data · In particular, SNR( ) measures the extent to which on different windows is not a genuine sequential monitor- 0 ! 1 ing scheme. However, this mechanism is at the core of 0 1 is detectable by monitoring the expectation of ( ). In fact,! the numerator of (4) corresponds to the shift intro-L · 1 We do not consider changes affecting data dispersion as these duced by 0 1 in the expectation of ( ) (i.e., the rele- can be detected by monitoring the Euclidean norm of x(t). vant information,! the signal) which is easy/difficultL · to detect

1369 relatively to its random fluctuations (i.e., the noise), which are Theorem 1. Let 0 = (µ0, ⌃0) be a d-dimensional Gaus- N d d assessed in the denominator of (4). Note that, if we replace sian pdf and 1 = 0(Qx + v), where Q R ⇥ is orthog- d 2 the expectations and the variances in (4) by their sample es- onal and v R . Then, it holds timators, we obtain that SNR( ) corresponds – up 2 0 ! 1 C to a scaling factor – to the square statistic of a Welch’s t-test SNR(0 1) (11) [Welch, 1947], that detects changes in the expectation of two !  d sample populations. This is another argument supporting the where the constant C depends only on sKL(0,1). use of SNR(0 1) as a measure of change detectability. The following! proposition relates the change magnitude Proof. Basic algebra leads to the following expression for (x) when 0 = (µ0, ⌃0): sKL(0,1) with the numerator of (4). L N Proposition 1. Let us consider a change 0 1 such that 1 d 1 1 (x)= log (2⇡) det(⌃ ) (x µ )0⌃ (x µ ) . ! L 2 0 2 0 0 0 1(x)=0(Qx + v) (5) (12) d d d where Q R ⇥ is orthogonal and v R . Then, it holds: The first term in the right-hand-side of (12) is constant, while 2 2 the second term is distributed as a chi-squared having d de- sKL(0,1) E [ (x)] E [ (x)] (6) grees of freedom. Therefore, x 0 L x 1 L ⇠ ⇠ ( , ) 1 d Proof. From the definition of sKL 0 1 in (3) it follows var [ (x)] = var 2(d) = . (13) x 0 L 2 2 sKL(0,1)= E [log (0(x))] E [log (1(x))]+ ⇠  x 0 x 0 ⇠ ⇠ Then, from the definition of SNR(0 1) in (4) and Propo- + E [log (1(x))] E [log (0(x))] . sition 1, it follows that ! x 1 x 1 ⇠ ⇠ sKL( , )2 sKL( , )2 Since ( ) = log (0( )), (6) holds if and only if SNR( ) 0 1 0 1 L · · 0 ! 1  var[ (x)] + var[ (x)]  var[ (x)] E [log (1(x))] E [log (1(x))]. (7) x L0 x L1 x L0 x 1 x 0 ⇠ ⇠ ⇠ ⇠ ⇠ 2 sKL(0,1) C From (5) it follows that 0(x)=1(Q0(x v)), thus, by re- = = . placing the mathematical expectations with their integral ex- d/2 d pressions, (7) becomes

log (1(x)) 1(x)dx log (1(x)) 1(Q0(x v))dx Theorem 1 shows detectability loss for Gaussian distributions. In fact, when d increases and sKL( , ) remains Z Z (8) 0 1 constant, SNR( ) is upper-bounded by a function Let us define y = Q (x v), then x = Qy + v and dx = 0 1 0 that monotonically! decays as 1/d. The decaying trend of det(Q) dy = dy, since Q is orthogonal. Using this change SNR( ) indicates that detecting changes becomes of| variables| in the second summand of (8) we obtain 0 1 more difficult! when d increases. Moreover, the decaying rate does not depend on sKL(0,1), thus this problem equally log (1(x)) 1(x)dx log (1(Qy + v)) 1(y)dy. affects all possible changes 0 1 defined as in (1), disre- Z Z (9) garding their magnitude. ! Finally, defining 2(y):=1(Qy + v) turns (9) into 3.4 Discussion log (1(x)) 1(x)dx log (2(y)) 1(y))dy 0, First of all, let us remark that Theorem 1 implicates de- ( , ) Z Z (10) tectability loss only when sKL 0 1 is kept constant. As- which holds since the left-hand-side of (10) is KL( , ). suming constant change magnitude is necessary to correctly 1 2 investigate the influence of the sole data dimension d on the change detectability. In fact, when the change magni- 3.3 Detectability Loss tude increases with d, changes might become even easier to d [ It is now possible to investigate the intrinsic challenge of detect as grows. This is what experiments in Zimek et ] 2 change-detection problems when data dimension increases. al., 2012 (Section 2.1) show, where outliers become eas- d In particular, we study how the change detectability (i.e., ier to detect when increases. However, in that experiment, d SNR( )) varies when d increases and changes the change-detection problem becomes easier as increases, 0 1 0 x preserve! constant magnitude (i.e., sKL( , )=const!). since each component of carries additional information 1 0 1 ( , ) Unfortunately, since there are no general expressions for the about the change, thus increases sKL 0 1 . variance of ( ), we have to assume a specific distribution Detectability loss can be also proved when 0 is non Gaus- L · sian, as far as its components are independent. In fact, if for 0 to carry out any analytical development. As a rele- vant example, we consider Gaussian random variables, which 2Even though similar techniques can be sometimes used for both enable a simple expression of ( ). The following theorem L · change-detection and anomaly-detection, the two problems are in- demonstrates the detectability loss for Gaussian distributions, trinsically different, since the former aims at recognizing process namely that SNR( ) decays as d increases. changes, while the latter at identifying spurious data. 0 ! 1

1370 d (i) (i) (i) 0(x)= (x ), where ( ) denotes either the 4.1 Gaussian Datastreams i=0 0 · marginal of a pdf or the component of a vector, it follows We generate Gaussian datastreams having dimension d Q 2 d 1, 2, 4, 8, 16, 32, 64, 128 and, for each value of d, we (i) (i) { } = (µ , ⌃ ) = var[ (x)] = var log 0 (x ) , (14) prepare 10000 runs, with 0 0 0 and 1 L x 0 Nd d d x 0 i=0 ⇠ (µ1, ⌃1). The parameters µ0 R and ⌃0 R ⇥ have ⇠ X h ⇣ ⌘i N 2 d 2 d d been randomly generated, while µ1 R and ⌃1 R ⇥ (i) (i) 2 2 since log(0 (x )) are independent. Clearly, (14) increases have been set to yield sKL(0,1)=1(see Appendix). In with d, since its summands are positive. Thus, also in this each run we generate 1000 samples: x(t),t=1,...,500 case, the upperbound of SNR( ) decays with d when { } 0 ! 1 from 0, and x(t),t= 501,...,1000 from 1. Then, we sKL(0,1) is kept constant. compute the datastream{ L = (x(t))},t =1,...,1000 , {L } Remarkably, detectability loss does not depend on how and define WP = (x(t)),t =1,...,500 and WR = the change 0 1 affects . Our results hold, for in- (x(t)),t= 501,...,{L 1000 . } ! X {L } stance, when either 0 1 affects all the components of We repeat the same experiment replacing 0 with its esti- or some of them remain! irrelevant for change-detection X mate 0(x), where µ0 and ⌃0 are computed using the sam- purposes. Moreover, detectability loss occurs independently ple estimators over an additional training set TR whose size of the specific change-detection method used on the log- grows linearly with d, i.e. #TR = 100 d. We denote by likelihood (e.g. sequential analysis, or window comparison), b b b · L = (x(t)),t =1,...,1000 the sequence of estimated as our results concern SNR(0 1) only. {L } In the next section we show! that detectability loss affects log-likelihood values. Finally, we repeat the whole experi- also real-world change-detection problems. To this purpose, mentsb keepingb #TR = 100 for any value of d, and we denote we design a rigorous empirical analysis to show that the by L100 the corresponding sequence of log-likelihood values. power of customary hypothesis tests actually decreases with Figure 1(a) shows that the power of both the Lepage and d when data are non Gaussian and possibly dependent. one-sidedb t-test substantially decrease when d increases. This result is coherent with our theoretical analysis of Section 4 Empirical Analysis 3, and confirms that SNR(0 1) is a suitable measure of change detectability. While! it is not surprising that Our empirical analysis has been designed to address the fol- the power of the t-test decays, given its connection with the lowing goals: i) showing that SNR(0 1), which is the ! SNR(0 1), it is remarkable that the power of the Lepage underpinning element of our theoretical result, is a suitable test also! decays, as this fact indicates that it becomes more measure of change detectability. In particular, we show that difficult to detect both changes in the mean and in the disper- the power of hypothesis tests able to detect both changes in sion of L. The decaying power of both tests indicates that mean and in variance of ( ) also decays. ii) Showing that de- L · the corresponding test statistics decrease with d, which imply tectability loss is not due to density-estimation problems, but larger detection delays when using this statistics in sequential it becomes a more serious issue when 0 is estimated from monitoring schemes. training data. iii) Showing that detectability loss occurs also Note that detectability loss is not due to density-estimation in Gaussian mixtures, and iv) showing that detectability loss issues, but rather to the fact that the change-detection prob- occurs also on high-dimensional real-world datasets, which lem becomes intrinsically more challenging, as it occurs in are far from being Gaussian or having independent compo- the ideal case where 0 is known (solid lines). When is nents. We address the first two points in Section 4.1, while L computed from an estimated (dashed and dotted lines), the the third and fourth ones in Sections 4.2 and 4.3, respectively. 0 problem becomes even more severe, and worsens when the In our experiments, the change-detection performance is number of training data does not grow with d (dotted lines). assessed by numerically computing the power of two cus- b tomary hypothesis tests, namely the Lepage [Lepage, 1971] 3 4.2 Gaussian mixtures and the one-sided t-test on data windows WP and WR which contains 500 data each. As we discussed in Section We now consider 0 and 1 as Gaussian mixtures, to prove 3.2, the t-statistic on the log-likelihood is closely related to that detectability loss occurs also when datastreams are generated/approximated by more general distribution models. SNR(0 1), while the Lepage is a nonparametric statistic that detects! both location and scale changes4. To compute the Mimicking the proof of Theorem 1, we show that when d power, we set h to guarantee a significance level5 ↵ =0.05. increases and sKL(0,1) is kept constant, the upper-bound Following the procedure in Appendix, we synthetically intro- of SNR(0 1) decreases. To this purpose, it is enough to show that var! [ (x)] increases with d. duce changes 0 1 having sKL(0,1)=1which, in x 0 L ! ⇠ the univariate Gaussian case, corresponds to v equals to the The pdf of a mixture of k Gaussians is standard deviation of 0. k 3 We can assume that 0 1 decreases the expectation of (x)= (µ , ⌃ )(x)= ! L 0 0,iN 0,i 0,i since E [log (0(x))] E [log (0(x))] 0 follows from (7). i=1 x 0 x 1 X 4 ⇠ ⇠ k The Lepage statistic is defined as the sum of the squares of the 1 1 0,i (x µ )0⌃ (x µ ) Mann-Whitney and Mood statistics, see also [Ross et al., 2011]. 2 0,i 0,i 0,i = d/2 1/2 e , 5 (2⇡) det(⌃ ) The value of h for the Lepage test is given by the asymptotic i=1 0,i approximation of the statistic in [Lepage, 1971]. X (15)

1371 1 1 3 10 u L t-test on u l L 0.8 L 0.8 t-test on l Lb 2 u 10 L Lepage test on u t L L 0.6 -test on l 0.6 b L Lepage test on l b L wer t-test on L wer b 1 ariance Po b Po t L 10 0.4 -test on 100 V 0.4 b Lepage testb on L b 0.2 Lepage test on L 100 0.2

Lepage test on L100 0 b 0 0 1 2 0 1 2 10 20 30 40 50 10 10 b 10 10 10 10 d d d (a) (b) (c)

Figure 1: (a) Power of the Lepage and one-sided t-test empirically computed on sequences generated as in Section 4.1. De- tectability loss clearly emerges when the log-likelihood is computed using (denoted by L) or its estimates fitted on 100 d 0 · samples (L) or from 100 samples (L ). (b) The sample variance of ( ) (16) and ( ) (17) computed as in Section 4.2. 100 Lu · Ll · As in the Gaussian case, both these variances grow linearly with d and similar results hold when using 0, which is estimate from 200 bd training data. (c) The powerb of both Lepage and one-sided t-test indicate detectability loss on the Particle dataset, · which is approximated by a mixture of 2 Gaussians, using both (16) and (17). Using guaranteesb better performance Lu Ll Lu than l because this latter yields a larger variance, as shown in (b). We achieve similar results on the Wine dataset, that was approximatedL by a mixture of 4 Gaussians. Results have not beenb reported dueb to space limitations.b b where 0,i > 0 is the weight of the i-th Gaussian by preliminarily estimating 0 from a training set containing (µ0,i, ⌃0,i). Unfortunately, the log-likelihood (x) of a 200 d N L additional samples, then we compute u and l. Gaussian mixture does not admit an expression similar to (12) Figure· 1(b) shows that theb variances of L and L grow Lu Ll and two approximations are typically used to avoid severe nu- linearly with respect to d, as in the Gaussianb case (13).b This merical issues when d 1. indicates that detectability loss occurs also when is gener- The first approximation consists in considering only the ated by a simple bimodal distribution and, most importantly,X Gaussian of the mixture yielding the largest likelihood, as also when using u or l that are traditionally adopted when in [Kuncheva, 2013] i.e., fitting Gaussian mixtures.L L As in Section 4.1, we see that log- k likelihoods u and l computed with respect to fitted models (x)= 0,i⇤ log (2⇡)d (⌃ ) + L L u det 0,i⇤ follow the same trend. We further observe that l exhibits a L 2 (16) L ⇣ 1 much largerb varianceb than u, thus we expect this to achieve +(x µ )0⌃ (x µ ) L 0,i⇤ 0,i⇤ 0,i⇤ lower change-detection performance than . Lu ⌘ where i⇤ is defined as 4.3 Real-World Data

1 1 0,i (x µ )0⌃ (x µ ) To investigate detectability loss in real-world datasets, we de- 2 0,i 0,i 0,i i⇤ = argmax 1/2 e . i=1,...,k det(⌃0,i) sign a change-detection problem on the Wine Quality Dataset ✓ ◆ [Cortez et al., 2009] and the MiniBooNE Particle Dataset The second approximation we consider is: [Roe et al., 2005] from the UCI repository [Lichman, 2013]. k The Wine dataset has 12 dimensions: 11 corresponding to 1 d numerical results of laboratory analysis (such as density, Ph, (x)= log (2⇡) det(⌃ ) + Ll 2 0,i 0,i residual sugar), and one corresponding to a ﬁnal grade (from i=1 (17) X ⇣ 0 to 10) for each different wine. We consider the vectors of 1 +(x µ0,i)0⌃0,i (x µ0,i) , laboratory analysis of all white wines having a grade above 6, resulting in a 11-dimensional dataset containing 3258 data. ⌘ that is a lower bound of ( ) due to the Jensen inequality. The Particle dataset contains numerical measurements from We consider the sameL values· of d as in Section 4.1 and, a physical experiment designed to distinguish electron from for each of these, we generate 1000 datastreams of 500 data muon neutrinos. Each sample has 50-dimensions and we con- drawn from a Gaussian mixture 0. We set k =2and 0,1 = sidered only data from muon class, yielding 93108 data. 0,2 =0.5, while the parameters µ0,1, µ0,2, ⌃0,1, ⌃0,2 are In either datasets we have to estimate 0 for both introduc- randomly generated. We then compute the sample variance of ing changes having constant magnitude and computing the both u and l over each datastream and report their average log-likelihood. We adopt Gaussian mixtures and estimate k in FigureL 1(b).L As in Section 4.1, we repeat this experiment by 5-fold cross validation over the whole datasets, obtaining

1372 k =4and k =2for Wine and Particle dataset, respectively. retical results demonstrate that detectability loss occurs inde- We process each dataset as follows. Let us denote by D pendently on the specific statistical tool used to monitor the the dataset dimension and for each value of d =1,...,Dwe log-likelihood and does not depend on the number of input consider only d components of our dataset that are randomly components affected by the change. Our empirical analy- selected. We then generate a d-dimensional training set of sis, which is rigorously performed by keeping the change- 200 d samples and a test set of 1000 samples (datastream), magnitude constant when scaling data-dimension, confirms which· are extracted by a bootstrap procedure without replace- detectability loss also on real-world datastreams. Ongoing ment. The second half of the datastream is perturbed by the works concern extending this study to other change-detection change , which is defined by fitting at first on the approaches and to other families of distributions. 0 ! 1 0 whole d-dimensional dataset, and then computing 1 accord- ing to thee proceduree in Appendix. Then, we estimatee0 from Appendix: Generating Changes of Constant the training set and we compute (W , W ), wheree W , Magnitude T P R P WR are defined as in Section 4.1. This procedure is repeatedb Here we describe a procedure to select, given 0, an or- d d d 5000 times to estimate the test powerc numerically.c Notec that thogonal matrix Q R ⇥ and a vector v R such that 2 2 1 = 0(Qx + v) guarantees sKL(0,1)=1in the case thec number of Gaussians in both 0 and 0 is the value of k of Gaussian pdfs, and sKL( , ) 1 for arbitrary distri- estimated from the whole D-dimensional dataset, and that 0 1 0 butions. Extensions to different values⇡ of sKL( , ) are is by no means used for change-detectione b purposes. 0 1 straightforward. Since 1(x)=0(Qx + v), we formulate Figure 1(c) reports the power of both Lepage and one-sidede the problem as generating at first a rotation matrix Q such that t-tests on the Particle dataset, considering (16) and (17) Lu Ll as approximated expressions of the likelihoods. The power of 0 < sKL(0( ),0(Q )) < 1 · · both tests is monotonically decreasing, indicatingb an increas-b and then defining the translation vector v to adjust 1 such ing difficulty in detecting a change among WP and WR when that sKL(0,1) reaches (or approaches) 1. d grows. This result is in agreement with the claim of The- We proceed as follows: we randomly define a rotation orem 1 and the results in the previous sections. The Lepage c c axis r, and a sequence of rotations matrices Qj j around test here turns to be more powerful than the t-test and this r, where the rotation angles monotonically decrease{ } toward 0 indicates that it is important to monitor also the dispersion of (thus Qj tends to the identity matrix as j ). Then, we ( ) when using Gaussian mixtures, where ( ) can be mul- set Q as the largest rotation yielding a sKL!1< 1, namely timodal.L · Moreover, the decaying power ofL the· Lepage test j⇤ indicates that, as in Section 4.1, monitoring both expectation j⇤ =min j : sKL(0( ),0(Qj )) < 1 . (18) and dispersion of ( ) does not prevent the detectability loss. { · · } L · When 0 is continuous and bounded (as in case of Gaussian Figure 1(c) indicates that u( ) guarantees superior perfor- L · mixtures) it can be easily proved that such a j⇤ exists. mance than ( ) since this has lower variance than ( ). l u In the case of Gaussian pdfs, when 0 = (µ0, ⌃0), This fact alsoL underlines· theb importance of consideringL the· N sKL(0,1) admits a closed-form expression: variance of ( ) in measures of change detectability, as in Lb · b (4). Experiments on Wine dataset, which is approximated 1 1 1 sKL( , )= v0⌃ v + v0Q⌃ Q0v+ by a more sophisticated distribution (k =4), confirms de- 0 1 2 0 0  tectability loss, but the results have not been reported due to 1 1 +2v0⌃ (I Q)µ +2v0Q⌃ (Q0 I)µ + space limitations. 0 0 0 0 1 1 (19) We finally remark that have set a change magnitude + Tr(Q0⌃ Q⌃ )+Tr(⌃ Q0⌃ Q) 2d+ 0 0 0 0 (sKL(0,1)=1) that is quite customary in change- 1 detection experiments, as in the univariate Gaussian case this +2µ0 (I Q0)⌃ (I Q)µ , 0 0 0 corresponds to setting v equals to the standard deviation of 0. Therefore, since in our experiments the power of both and sKL(0( ),0(Qj )) can be exactly computed to solve tests is almost halved when d 10, we can conclude that · · ⇡ (18). When there are no similar expressions for sKL(0,1) detectability loss is not only a Big-Data issue. this has to be computed via Monte Carlo simulations. After having set the rotation matrix Q, we randomly gen- 5 Conclusions erate a unit-vector u as in [Alippi, 2014] and determine a suitable translation along the line v = ⇢u, where ⇢>0, to We provide the first rigorous study of the challenges that achieve sKL(0,1)=1. Again, the closed-form expression change-detection methods have to face when data dimension (19) allows to directly compute the exact value of ⇢ by sub- scales. Our theoretical and empirical analyses reveal that the stituting v = ⇢u into (19). This yields a quadratic equation popular approach of monitoring the log-likelihood of a mul- in ⇢, whose positive solution ⇢⇤ provides v = ⇢⇤u that leads tivariate datastream suffers detectability loss when data di- to sKL(0,1)=1. When the are no analytical expressions mension increases. Remarkably, detectability loss is not a for sKL(0,1), we generate an increasing sequence ⇢n n consequence of density-estimation errors – even though these { } such that ⇢0 =0and ⇢n as n , and set further reduce detectability – but it rather refers to an intrin- !1 !1 sic limitation of this change-detection approach. Our theo- n⇤ = max n : sKL( ( ), (Q +⇢ u)) < 1 , (20) { 0 · 0 · n }

1373 where sKL(0( ),0(Q +⇢nu)) is computed by Monte [Lepage, 1971] Yves Lepage. A combination of wilcoxon’s Carlo simulations.· After· having solved (20), we determine and ansari-bradley’s statistics. Biometrika, 58(1), 1971.

⇢⇤ via linear interpolation of [⇢n⇤ ,⇢n⇤+1] on the correspond- [Lichman, 2013] M. Lichman. UCI machine learning reposi- ing values of the sKL. In this case, we can only guarantee tory, 2013. ( , ) 1 sKL 0 1 with an accuracy that can be improved by [Lung-Yut-Fong et al., 2011] Alexandre Lung-Yut-Fong, increasing the⇡ resolution of ⇢ . { n}n Celine´ Levy-Leduc,´ and Olivier Cappe.´ Robust changepoint detection based on multivariate rank statistics. In Proceed- References ings of IEEE International Conference on Acoustics, Speech [Agarwal, 2005] Deepak Agarwal. An empirical bayes ap- and Signal Processing (ICASSP), 2011. proach to detect anomalies in dynamic multidimensional ar- [Nguyen et al., 2014] Tuan Duong Nguyen, Marthi- rays. In Proceedings of IEEE International Conference on nus Christoffel Du Plessis, Takafumi Kanamori, and Data Mining (ICDM), 2005. Masashi Sugiyama. Constrained least-squares density- [Alippi et al., 2013] Cesare Alippi, Giacomo Boracchi, and difference estimation. IEICE Transactions on Information Manuel Roveri. Just-in-time classifiers for recurrent con- and Systems, 97(7), 2014. cepts. IEEE Transactions on Neural Networks and Learning [Roe et al., 2005] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Systems, 24(4), 2013. Liu, Ion Stancu, and Gordon McGregor. Boosted decision [Alippi, 2014] Cesare Alippi. Intelligence for Embedded Sys- trees as an alternative to artificial neural networks for particle tems, a Methodological Approach. Springer, 2014. identification. Nuclear Instruments and Methods in Physics [Basseville et al., 1993] Michele` Basseville, Igor V Nikiforov, Research Section A: Accelerators, Spectrometers, Detectors et al. Detection of abrupt changes: theory and application. and Associated Equipment, 543(2), 2005. Prentice Hall Englewood Cliffs, 1993. [Ross et al., 2011] Gordon J Ross, Dimitris K Tasoulis, and [Bifet and Gavalda, 2007] Albert Bifet and Ricard Gavalda. Niall M Adams. Nonparametric monitoring of data streams Learning from time-changing data with adaptive windowing. for changes in location and scale. Technometrics, 53(4), In Proceedings of SIAM International Conference on Data 2011. Mining (SDM), 2007. [Ross et al., 2012] Gordon J. Ross, Niall M. Adams, Dim- [Cortez et al., 2009] Paulo Cortez, Antonio´ Cerdeira, Fernando itris K. Tasoulis, and David J. Hand. Exponentially weighted Almeida, Telmo Matos, and Jose´ Reis. Modeling wine pref- moving average charts for detecting concept drift. Pattern erences by data mining from physicochemical properties. Recognition Letters, 33(2), 2012. Decision Support Systems, 47(4), 2009. [Schilling, 1986] Mark F Schilling. Multivariate two-sample [Cover and Thomas, 2012] Thomas M Cover and Joy A tests based on nearest neighbors. Journal of the American Thomas. Elements of information theory. John Wiley & Statistical Association, 81(395), 1986. Sons, 2012. [Song et al., 2007] Xiuyao Song, Mingxi Wu, Christopher Jer- [Ditzler and Polikar, 2011] Gregory Ditzler and Robi Polikar. maine, and Sanjay Ranka. Statistical change detection Hellinger distance based drift detection for nonstationary en- for multi-dimensional data. In Proceedings of Interna- vironments. In Proceedings of IEEE Symposium on Com- tional Conference on Knowledge Discovery and Data Mining putational Intelligence in Dynamic and Uncertain Environ- (KDD), 2007. ments (CIDUE), 2011. [Sullivan and Woodall, 2000] Joe H Sullivan and William H [Dyer et al., 2014] Karl B Dyer, Robert Capo, and Robi Po- Woodall. Change-point detection of mean vector or covari- likar. Compose: A semisupervised learning framework for ance matrix shifts using multivariate individual observations. initially labeled nonstationary streaming data. IEEE Trans- IIE Transactions, 32(6), 2000. actions on Neural Networks and Learning Systems, 25(1), [Tartakovsky et al., 2006] Alexander G Tartakovsky, Boris L 2014. Rozovskii, Rudolf B Blazek, and Hongjoong Kim. A novel [Gama et al., 2004] Joao Gama, Pedro Medas, Gladys Castillo, approach to detection of intrusions in computer networks via and Pedro Rodrigues. Learning with drift detection. In Pro- adaptive sequential and batch-sequential change-point de- ceedings of Brazilian Symposium on Artificial Intelligence tection methods. IEEE Transactions on Signal Processing, (SBIA), 2004. 54(9), 2006. [ ] [Gama et al., 2014] Joao˜ Gama, Indre˙ Zliobaitˇ e,˙ Albert Bifet, Wang and Chen, 2002 Tai-Yue Wang and Long-Hui Chen. Mykola Pechenizkiy, and Abdelhamid Bouchachia. A sur- Mean shifts detection and classification in multivariate pro- Journal of Intelligent Manu- vey on concept drift adaptation. ACM Computing Surveys cess: A neural-fuzzy approach. facturing (CSUR), 46(4), 2014. , 13(3), 2002. [Welch, 1947] Bernard L Welch. The generalization ofstu- [Krempl, 2011] Georg Krempl. The algorithm APT to classify dent’s’ problem when several different population variances in concurrence of latency and drift. In Advances in Intelligent are involved. Biometrika, 34(1/2), 1947. Data Analysis X (IDA), pages 222–233, 2011. [Zimek et al., 2012] Arthur Zimek, Erich Schubert, and Hans- [Kuncheva, 2013] Ludmila I Kuncheva. Change detection in Peter Kriegel. A survey on unsupervised outlier detection streaming multivariate data using likelihood detectors. IEEE in high-dimensional numerical data. Statistical Analysis and Transactions on Knowledge and Data Engineering, 25(5), Data Mining: The ASA Data Science Journal, 5(5), 2012. 2013.

1374