Hellinger Distance Based Drift Detection for Nonstationary Environments
Total Page:16
File Type:pdf, Size:1020Kb
Hellinger Distance Based Drift Detection for Nonstationary Environments Gregory Ditzler and Robi Polikar Dept. of Electrical & Computer Engineering Rowan University Glassboro, NJ, USA [email protected], [email protected] Abstract—Most machine learning algorithms, including many sion boundaries as a concept change, whereas gradual changes online learners, assume that the data distribution to be learned is in the data distribution as a concept drift. However, when the fixed. There are many real-world problems where the distribu- context does not require us to distinguish between the two, we tion of the data changes as a function of time. Changes in use the term concept drift to encompass both scenarios, as it is nonstationary data distributions can significantly reduce the ge- usually the more difficult one to detect. neralization ability of the learning algorithm on new or field data, if the algorithm is not equipped to track such changes. When the Learning from drifting environments is usually associated stationary data distribution assumption does not hold, the learner with a stream of incoming data, either one instance or one must take appropriate actions to ensure that the new/relevant in- batch at a time. There are two types of approaches for drift de- formation is learned. On the other hand, data distributions do tection in such streaming data: in passive drift detection, the not necessarily change continuously, necessitating the ability to learner assumes – every time new data become available – that monitor the distribution and detect when a significant change in some drift may have occurred, and updates the classifier ac- distribution has occurred. In this work, we propose and analyze a cording to the current data distribution, regardless whether feature based drift detection method using the Hellinger distance drift actually did occur. In active drift detection, the algorithm to detect gradual or abrupt changes in the distribution. continuously and explicitly monitors the data to detect if and Keywords-concept drift; nonstationary environments; drift when drift occurs. If – and only if – the drift is detected, the detection algorithm takes an appropriate action, such as updating the classifier with the most recent data or simply creating a new I. INTRODUCTION classifier to learn the current data. The drift detection method Detecting change in data distributions is an important prob- presented in this work is appropriate for an active drift detec- lem in data mining and machine learning algorithms due to tion framework. increasing number of applications that are governed by such This paper is organized as follows: Section II provides an data. Any algorithm that does not make the necessary adjust- overview of drift detection algorithms, followed by our moti- ments to changes in data distribution will necessarily fail to vation for using Hellinger distance as a metric, as well as a provide satisfactory generalization on future data, if such data detailed explanation of the approach for the proposed algo- do not come from the distribution on which the algorithm was rithm in Section III. Section IV presents results obtained by originally trained. For example, consider an application that the proposed approach on several real world & synthetic data- tracks a user’s web browsing habits to determine which ads sets. Finally, Section V provides concluding remarks and are most relevant for that user’s interest. User interests are future work. known to change – or drift – over time. In such cases, certain II. BACKGROUND ads in which the customer used to express interest may no longer be relevant. Thus, an algorithm designed to determine Concept drift algorithms are usually associated with incre- the relevant ads must be able to monitor the customer’s brows- mental learning of streaming data, where new datasets become ing habits and determine when there is change in the available in batches or in an instance-by-instance basis, result- customer’s interest. Applications that call for an effective ing in batch or online learning, respectively [1]. Given such change or drift detection algorithm can be expanded: analysis data, the (active) drift detection itself can be parametric or of electricity demands or pricing, financial or climate data are non-parametric, depending on whether a specific underlying all examples of applications with nonstationary data distribu- distribution is assumed. Many parametric algorithms use a tions, where change or drift detection is needed so that the CUSUM (cumulative sum) based mechanism, which is tradi- learner can take an appropriate action. tionally used for control charts in detecting nonstationary For the purposes of this paper, we define a sudden or abrupt changes in process data [2-4]. A series of successful imple- change in the underlying data distribution that alters the deci- mentations of this approach are proposed by Alippi & Roveri, including CI-CUSUM [5;6], a pdf free extension of the tradi- This material is based on work supported by the National Science Founda- tion (NSF) under Grant No: ECCS-0926159. tional CUSUM. Recently, Alippi et al. also introduced the 978-1-4244-9931-1/11/$26.00 ©2011 IEEE intersection of confidence intervals (ICI) rule for drift detec- change between the distributions of data at two subsequent tion [7]. Some drift detection approaches such as the Early time stamps. Drift Detection Method (EDDM) [8] and similar approaches In contrast to EDDM and other similar approaches that rely [9], do not make any assumptions on feature distribution, but on classifier error [8;9], the proposed Hellinger distance drift rather monitor a classifiers’ accuracy or some distance metric detection method (HDDDM) is a feature based drift detection to detect drift. Cielslak & Chawla suggest Hellinger distance, method, using the Hellinger distance between current data dis- not for detecting concept drift in an incremental learning set- tribution and a reference distribution that is updated as new ting, but rather to detect bias between training and test data data are received. The Hellinger distance is an example of distributions [10]. Using a non-parametric statistical test, the ͚divergence measure, similar to the Kullback-Leibler (KL) authors measure the significance between the probability esti- divergence. However, unlike the KL-divergence the Hellinger mates of the classifier on a validation set (carved out from divergence is a symmetric metric. Furthermore, unlike most training data) and the corresponding test dataset [10]. To do other distance metrics, the Hellinger distance is a bounded dis- so, a baseline comparison is first made by calculating the Hel- tance measure: for two distributions with probability mass linger distance between the original training and test datasets. functions (or histograms representing these distributions) ͊ Bias is then injected into the testing set. A baseline Hellinger and ͋Q the Hellinger distance is ʚ͊Q ͋ʛ ǨƫRQ:TƯ . If distance (i.e. when no bias is present between training/testing ʚ͊Q ͋ʛ ƔR, the two probability mass functions are com- sets) and the distance after bias is injected into the dataset are pletely overlapping and hence identical. If ʚ͊Q ͋ʛ Ɣ :T, the observed. two probability mass functions are completely divergent (i.e. We extend this approach to concept drift in an incremental there is no overlap). learning setting, where new datasets are presented in batches As an example, consider a two-class rotating mixture of over time. We do not use the original dataset as the baseline Gaussians with class centers moving in a circular pattern (Fig. distribution against which other data distributions are com- 1), with each distribution in the mixture corresponding to a pared, nor do we inject any bias into future distributions, as different class label. The class means can be given by the pa- done in [10], but rather we monitor the magnitude of the ʚ ʛ ʚ ʛ ʚ ʛ / Ɣ ʞ!-1 Q1', ʟ / ƔƎ/ change in Hellinger distance between a new distribution and a rametric equations ͥ / / , ͦ ͥ , Ɣ ͦ_ ͨ baseline distribution, which is updated every time drift is de- / , with fixed class covariance matrices given as tected. 9ͥ Ɣ9ͦ ƔRTWdz̏, where ͗ is the number of cycles, ͨ is the (integer valued) time stamp that iterates from zero to ͈ƎS, III. APPROACH and ̏ is a 2x2 identity matrix. Fig. 2 shows the evolution of the We begin this section by introducing the motivation for us- Hellinger distance computed between the datasets (Ď&) gener- ing the Hellinger distance as a measure that can be applied to ated with respect to ͥ and & where ͟ƔTQUQ͈ƎS. The drift detection in an incremental learning setting. Next, we Hellinger distance is capable of displaying the relevance or the present the proposed Hellinger Distance Drift Detection Me- closeness of a new dataset (Ď&) to a baseline dataset (Ďͥ) as thod (HDDDM). According to four criteria suggested by shown in Fig. 2. We plot the Hellinger distance for the diver- Kuncheva [3], HDDDM can be categorized as follows: gence of the datasets for class !ͥ, class !ͦ, and the entire data separately. The Hellinger distance varies as begins to Data chunks: batch based evolve. We observe that when ͥ and & are the same (or very Information used: raw features (not based on classifier similar) the Hellinger distance is small as observed at ͨƔ performance) ʚ/ʛ ʚͥʛ ʚ/ʛ ʜRQ TRRQ VRRQ XRRʝ . This is when ͥ Ɣͥ and ͦ Ɣ Change detection mode: explicit ʚ/ʛ ʚͥʛ Ǝ ƔƎ . We observe that the Hellinger distance com- Classifier-specific vs. classifier-free: classifier free ͥ ͥ puted for all data (entire dataset) repeats every 100 time The algorithm uses a hypothesis testing-like statistical ap- stamps, twice as often compared to class specific distributions, proach to determine if there is enough evidence to suggest that t=25 the data instances are being drawn from different distributions.