Detecting Change in Data Streams
Total Page:16
File Type:pdf, Size:1020Kb
Detecting Change in Data Streams Daniel Kifer, Shai Ben-David, Johannes Gehrke Department of Computer Science Cornell University Abstract data streams, we may want to give separate answers for each time interval where the underlying data distri- Detecting changes in a data stream is an im- bution is stable. Most existing work has concentrated portant area of research with many appli- on algorithms that adapt to changing distributions ei- cations. In this paper, we present a novel ther by discarding old data or giving it less weight [17]. method for the detection and estimation of However, to the best of our knowledge, previous work change. In addition to providing statisti- does not contain a formal definition of change and thus cal guarantees on the reliability of detected existing algorithms cannot specify precisely when and changes, our method also provides meaning- how the underlying distribution changes. ful descriptions and quantification of these In this paper we make a first step towards formal- changes. Our approach assumes that the izing the detection and quantification of change in a points in the stream are independently gen- data stream. We assume that the data points are gen- erated, but otherwise makes no assumptions erated sequentially and independently by some under- on the nature of the generating distribution. lying probability distribution. Our goal is to detect Thus our techniques work for both continuous when this distribution changes, and to quantify and and discrete data. In an experimental study describe this change. we demonstrate the power of our techniques. It is unrealistic to allow data stream processing al- gorithms enough memory capacity to store the full 1 Introduction history of the stream. Therefore we base our change- In many applications, data is not static but arrives detection algorithm on a two-window paradigm. The in data streams. Besides the algorithmic difference algorithm compares the data in some “reference win- between processing data streams and static data, there dow” to the data in a current window. Both widows is another significant difference. For static datasets, it contain a fixed number of successive data points. The is reasonable to assume that the data was generated by current window slides forward with each incoming data a fixed process, for example, the data is a sample from point, and the reference window is updated whenever a static distribution. But a data stream has necessarily a change is detected. We analyze this paradigm and a temporal dimension, and the underlying process that develop algorithms (or tests) that can be supported by generates the data stream can change over time [17, 1, proven guarantees on their sensitivity to change, their 23]. The quantification and detection of such change robustness against raising false alarms, and their run- is one of the fundamental challenges in data stream ning time. Furthermore, we aim to obtain not only settings. reliable change detection, but also a comprehensible Change has far-reaching impact on any data pro- description of the nature of the detected change. cessing algorithm. For example, when constructing data stream mining models [17, 1], data that arrived 1.1 Applications before a change can bias the models towards charac- teristics that no longer hold. If we process queries over A change detection test with the above properties has many interesting applications: Permission to copy without fee all or part of this material is Quality Control. A factory manufactures beams granted provided that the copies are not made or distributed for made from a metal alloy. The strengths of a sample direct commercial advantage, the VLDB copyright notice and of the beams can be measured during quality testing. the title of the publication and its date appear, and notice is Estimating the amount of defective beams and deter- given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee mining the statistical significance of this number are and/or special permission from the Endowment. well-studied problems in the statistics community [5]. Proceedings of the 30th VLDB Conference, However, even if the number of defective beams does Toronto, Canada, 2004 not change, the factory can still benefit by analyzing 180 the distribution of beam strengths over time. Changes Our last requirement is motivated by the users of in this distribution can signal the development of a our test. A user not only wants to be notified that problem or give evidence that a new manufacturing the underlying distribution has changed, but she also technique is creating an improvement. Information wants to know how it has changed. Thus we want a test that describes the change could also help in analysis that does not only detects changes, but also describes of the technique. the change in a user-understandable way. Data Mining. Suppose a data stream mining algo- In summary, we want a change detection test with rithm is creating a data mining model that describes these four properties: the rate of spurious detections certain aspects of the data stream. The data mining (false positives) and the rate of missed detections (false model does not need to change as long as the under- negatives) can be controlled, it is non-parametric, and lying data distribution is stable. However, for suffi- it provides a description of the detected change. ciently large changes in the distribution that generates the data stream, the model will become inaccurate. In 1.3 An Informal Motivation of our Approach this case it is better to completely remove the contri- butions of old data (which arrived before the change) Let us start by discussing briefly how our work relates from the model rather than to wait for enough new to the most common non-parametric statistical test data to come in and outweigh the stale data. Fur- for change detection, the Wilcoxon test [24]. (Recall ther information about where the change has occurred that parametric tests are unsuitable as we do not want could help the user avoid rebuilding the entire model to restrict the applicability of our work to data that – if the change is localized, it may only be necessary follows certain parametric distributions.) to rebuild part of the model. The Wilcoxon test is a statistical test that measures the tendency of one sample to contain values that are larger than those in the other sample. We can control 1.2 Statistical Requirements the probability that the Wilcoxon test raises a false Recall that our basic approach to change detection in alarm. However, this test only permits the detection data streams uses two sliding windows over the data of certain limited kinds of changes, such as a change stream. This reduces the problem of detecting change in the mean of a normal distribution. Furthermore, it over a data stream to the problem of testing whether does not give a meaningful description of the changes the two samples in the windows were generated by dif- it detects. ferent distributions. Consequently, we start by consid- We address the issue of two-way guarantees by us- ering the case of detecting a difference in distribution ing a distance function (between distributions) to help between two input samples. Assume that two datasets describe the change. Given a distance function, our S1 and S2 where generated by two probability distri- test provides guarantees of the form: “to detect a dis- butions, P1,P2. A natural question to ask is: Can we tance > between two distributions P1 and P2, we will infer from S1 and S2 whether they were generated by need samples of at most n points from the each of P1 the same distribution P1 = P2, or is it the case that and P2.” P1 6= P2? To answer this question, we need to design Let us start by considering possible distance func- a “test” that can tell us whether P1 and P2 are differ- tions. One possibility is to use powerful information ent. Assuming that we take this approach, what are theoretic distances, such as the Jensen-Shannon Di- our requirements on such a test? vergence (JSD) [19]. However, to use these measures Our first requirement is that the test should have we need discrete distributions and it may also be hard two-way guarantees: We want the guarantee that our to explain the idea of entropy to the average end-user. test detects true changes with high probability, i.e.,it There exist many other common measures of distance should have few false negatives. But we also want the between distributions, but they are either too insensi- test to notify us only if a true change has occurred, tive or too sensitive. For example, the commonly used i.e.,it should have few false positives. Furthermore we L1 distance is too sensitive and can require arbitrarily need to be able to extend those guarantees from the large samples to determine if two distributions have L1 two-sample problem to the data stream. distance > [4]. At the other extreme, Lp norms (for There are also practical considerations that affect p > 1) are far too insensitive: two distributions D1 and p our choice of test. If we know that the distribution of D2 can be close in the L norm and yet can have this the data has a certain parametric form (for example, undesirable property: all events with nonzero proba- it is a normal distribution), then we can draw upon bility under D1 have 0 probability under D2. Based decades of research in the statistics community where on these shortcomings of existing work, we introduce powerful tests have been developed.Unfortunately, real in Section 3 a new distance metric that is specifically data is rarely that well-behaved in practice; it does not tailored to find distribution changes while providing follow “nice” parametric distributions.