Hellinger Distance Based Drift Detection for Nonstationary Environments

Gregory Ditzler and Robi Polikar Dept. of Electrical & Computer Engineering Rowan University Glassboro, NJ, USA [email protected], [email protected]

Abstract—Most machine learning algorithms, including many sion boundaries as a concept change, whereas gradual changes online learners, assume that the data distribution to be learned is in the data distribution as a concept drift. However, when the fixed. There are many real-world problems where the distribu- context does not require us to distinguish between the two, we tion of the data changes as a function of time. Changes in use the term concept drift to encompass both scenarios, as it is nonstationary data distributions can significantly reduce the ge- usually the more difficult one to detect. neralization ability of the learning algorithm on new or field data, if the algorithm is not equipped to track such changes. When the Learning from drifting environments is usually associated stationary data distribution assumption does not hold, the learner with a stream of incoming data, either one instance or one must take appropriate actions to ensure that the new/relevant in- batch at a time. There are two types of approaches for drift de- formation is learned. On the other hand, data distributions do tection in such streaming data: in passive drift detection, the not necessarily change continuously, necessitating the ability to learner assumes – every time new data become available – that monitor the distribution and detect when a significant change in some drift may have occurred, and updates the classifier ac- distribution has occurred. In this work, we propose and analyze a cording to the current data distribution, regardless whether feature based drift detection method using the Hellinger distance drift actually did occur. In active drift detection, the algorithm to detect gradual or abrupt changes in the distribution. continuously and explicitly monitors the data to detect if and Keywords-concept drift; nonstationary environments; drift when drift occurs. If – and only if – the drift is detected, the detection algorithm takes an appropriate action, such as updating the classifier with the most recent data or simply creating a new I. INTRODUCTION classifier to learn the current data. The drift detection method Detecting change in data distributions is an important prob- presented in this work is appropriate for an active drift detec- lem in data mining and machine learning algorithms due to tion framework. increasing number of applications that are governed by such This paper is organized as follows: Section II provides an data. Any algorithm that does not make the necessary adjust- overview of drift detection algorithms, followed by our moti- ments to changes in data distribution will necessarily fail to vation for using Hellinger distance as a metric, as well as a provide satisfactory generalization on future data, if such data detailed explanation of the approach for the proposed algo- do not come from the distribution on which the algorithm was rithm in Section III. Section IV presents results obtained by originally trained. For example, consider an application that the proposed approach on several real world & synthetic data- tracks a user’s web browsing habits to determine which ads sets. Finally, Section V provides concluding remarks and are most relevant for that user’s interest. User interests are future work. known to change – or drift – over time. In such cases, certain II. BACKGROUND ads in which the customer used to express interest may no longer be relevant. Thus, an algorithm designed to determine Concept drift algorithms are usually associated with incre- the relevant ads must be able to monitor the customer’s brows- mental learning of streaming data, where new datasets become ing habits and determine when there is change in the available in batches or in an instance-by-instance basis, result- customer’s interest. Applications that call for an effective ing in batch or online learning, respectively [1]. Given such change or drift detection algorithm can be expanded: analysis data, the (active) drift detection itself can be parametric or of electricity demands or pricing, financial or climate data are non-parametric, depending on whether a specific underlying all examples of applications with nonstationary data distribu- distribution is assumed. Many parametric algorithms use a tions, where change or drift detection is needed so that the CUSUM (cumulative sum) based mechanism, which is tradi- learner can take an appropriate action. tionally used for control charts in detecting nonstationary For the purposes of this paper, we define a sudden or abrupt changes in process data [2-4]. A series of successful imple- change in the underlying data distribution that alters the deci- mentations of this approach are proposed by Alippi & Roveri, including CI-CUSUM [5;6], a pdf free extension of the tradi- This material is based on work supported by the National Science Founda- tion (NSF) under Grant No: ECCS-0926159. tional CUSUM. Recently, Alippi et al. also introduced the

978-1-4244-9931-1/11/$26.00 ©2011 IEEE intersection of confidence intervals (ICI) rule for drift detec- change between the distributions of data at two subsequent tion [7]. Some drift detection approaches such as the Early time stamps. Drift Detection Method (EDDM) [8] and similar approaches In contrast to EDDM and other similar approaches that rely [9], do not make any assumptions on feature distribution, but on classifier error [8;9], the proposed Hellinger distance drift rather monitor a classifiers’ accuracy or some distance metric detection method (HDDDM) is a feature based drift detection to detect drift. Cielslak & Chawla suggest Hellinger distance, method, using the Hellinger distance between current data dis- not for detecting concept drift in an incremental learning set- tribution and a reference distribution that is updated as new ting, but rather to detect bias between training and test data data are received. The Hellinger distance is an example of distributions [10]. Using a non-parametric statistical test, the ͚divergence , similar to the Kullback-Leibler (KL) authors measure the significance between the probability esti- . However, unlike the KL-divergence the Hellinger mates of the classifier on a validation set (carved out from divergence is a symmetric metric. Furthermore, unlike most training data) and the corresponding test dataset [10]. To do other distance metrics, the Hellinger distance is a bounded dis- so, a baseline comparison is first made by calculating the Hel- tance measure: for two distributions with probability mass linger distance between the original training and test datasets. functions (or histograms representing these distributions) ͊ Bias is then injected into the testing set. A baseline Hellinger and ͋Q the Hellinger distance is ʚ͊Q ͋ʛ ǨƫRQ:TƯ . If distance (i.e. when no bias is present between training/testing ʚ͊Q ͋ʛ ƔR, the two probability mass functions are com- sets) and the distance after bias is injected into the dataset are pletely overlapping and hence identical. If ʚ͊Q ͋ʛ Ɣ :T, the observed. two probability mass functions are completely divergent (i.e. We extend this approach to concept drift in an incremental there is no overlap). learning setting, where new datasets are presented in batches As an example, consider a two-class rotating mixture of over time. We do not use the original dataset as the baseline Gaussians with class centers moving in a circular pattern (Fig. distribution against which other data distributions are com- 1), with each distribution in the mixture corresponding to a pared, nor do we inject any bias into future distributions, as different class label. The class means can be given by the pa- done in [10], but rather we monitor the magnitude of the ʚ ʛ ʚ ʛ ʚ ʛ  / Ɣ ʞ!-1  Q1', ʟ  / ƔƎ/ change in Hellinger distance between a new distribution and a rametric equations ͥ / / , ͦ ͥ ,  Ɣ ͦ_ ͨ baseline distribution, which is updated every time drift is de- /  , with fixed class covariance matrices given as tected. 9ͥ Ɣ9ͦ ƔRTWdz̏, where ͗ is the number of cycles, ͨ is the (integer valued) time stamp that iterates from zero to ͈ƎS, III. APPROACH and ̏ is a 2x2 identity matrix. Fig. 2 shows the evolution of the We begin this section by introducing the motivation for us- Hellinger distance computed between the datasets (Ď&) gener- ing the Hellinger distance as a measure that can be applied to ated with respect to ͥ and & where ͟ƔTQUQ͈ƎS. The drift detection in an incremental learning setting. Next, we Hellinger distance is capable of displaying the relevance or the present the proposed Hellinger Distance Drift Detection Me- closeness of a new dataset (Ď&) to a baseline dataset (Ďͥ) as thod (HDDDM). According to four criteria suggested by shown in Fig. 2. We plot the Hellinger distance for the diver- Kuncheva [3], HDDDM can be categorized as follows: gence of the datasets for class !ͥ, class !ͦ, and the entire data separately. The Hellinger distance varies as  begins to  Data chunks: batch based evolve. We observe that when ͥ and & are the same (or very  Information used: raw features (not based on classifier similar) the Hellinger distance is small as observed at ͨƔ performance) ʚ/ʛ ʚͥʛ ʚ/ʛ ʜRQ TRRQ VRRQ XRRʝ . This is when ͥ Ɣͥ and ͦ Ɣ  Change detection mode: explicit ʚ/ʛ ʚͥʛ Ǝ ƔƎ . We observe that the Hellinger distance com-  Classifier-specific vs. classifier-free: classifier free ͥ ͥ puted for all data (entire dataset) repeats every 100 time The algorithm uses a hypothesis testing-like statistical ap- stamps, twice as often compared to class specific distributions, proach to determine if there is enough evidence to suggest that t=25 the data instances are being drawn from different distributions. t=0 A. Motivation The primary motivation for this detection method is to detect if drift is present in a sequence of batch data. While prior work used the Hellinger distance only as a measure be- t=75 tween a single training and testing set, we extend the approach t=100 to sequential batch learning where the goal is to detect drift among different datasets. We select the Hellinger distance over other measures such as the , be- cause there are no assumptions made about the distribution of the data. Also, the Hellinger distance is a measure of distribu- tional divergence which will allow HDDDM to measure the Fig. 1. Evolution of a binary classification task with two drifting Gaussian dis- tributions. 0.8 Hellinger distance remains near constant; however, as the dis- all  tribution itself does not change. 0.7 1 Having observed that the Hellinger distance does change  2 0.6 between two distributions as these two distributions diverge from each other, and remain constant if they do not, we can 0.5 use this information to determine when change is present in an incremental learning problem. The process of tracking the data 0.4 distributions for drift is described below.

0.3 B. Assumptions The proposed approach makes three assumptions: i) labeled 0.2 training datasets are presented in batches to the drift detection

0.1 algorithm, as the Hellinger distance is computed between two histograms (of distributions) of data. If only instance based 0 streaming data is available, one may accumulate instances to 0 100 200 300 400 500 600 form a batch to compute the histogram; ii) data distributions Fig. 2. Hellinger distance (vertical axis, plotted against time) computed on a have finite support (range): ͊ʚ͒ƙͬʛ ƔR for 6ͬ ƙ ͎ͥ and rotating Gaussian centers problem. The Hellinger distance is computed be- ͊ʚ͒ƚͬʛ ƔR ͬƚ͎ < 6Ɨ6< tween datasets Ďͥ and Ď/ where ͨ Ɣ TQUQ U Q XRR for !ͥ, !ͦ, and all classes. for ͦ, where S T are finite real num- Recurring environments occur when the Hellinger distance is at a minimum. bers. We fix the number of bins in the histogram required to compute the Hellinger distance at Ǔ:͈Ǘ, where ͈ is the num-

0.16 ber of instances at each time stamp presented to the drift detection algorithm. This can be manually adjusted if one has 0.15 prior knowledge to justify otherwise. Under minimal or no 0.14 prior knowledge, Ǔ:͈Ǘ works well. iii) Finally, in order to fol- 0.13 low a true incremental learning setting, we assume that we do 0.12 not have access to old data [11]. Each instance is only seen once by the algorithm. 0.11

0.1 C. Drift Dection Algorithm all The pseudo code of the proposed Hellinger distance drift 0.09  1 detection method (HDDDM) is shown in Fig. 4. As mentioned 0.08  2 above, we assume that the data arrive in batches, with dataset 0.07 Ď/ becoming available at time stamp ͨ. The algorithm initia- 0.06 lizes ƔS and6ĎZ ƔĎͥ where  indicates the last time stamp 0 100 200 300 400 500 600 in which change was detected. Ďͥ is established as the first

Fig. 3. Hellinger distance computed on a static Gaussian problem. The Hellin- baseline reference dataset to which we compare future datasets ger distance is computed between Ďͥ and Ď/ where ͨ Ɣ TQUQ U Q XRR for !ͥ, for possible drift. This baseline distribution, ĎZ, is updated in- ! ͦ, and all classes. crementally as described below. The algorithm begins by constructing histogram ͊ from Ď/ which repeat every 200 time steps. This is because, as seen in and histogram ͋ from ĎZ with ͖ƔǓ:͈Ǘ bins, where ͈ is the Fig. 1, every 100 time steps, the data consists of the exact cardinality of the current data batch presented to the algorithm. same instances at the exact same locations, but with their la- The Hellinger distance between the two histograms ͊ and ͋ is bels flipped, compared to the distribution at ͨƔR. then computed using Eq. (1). The (intermediate) Hellinger dis- As the class centers evolve from ͨƔR to ͨ Ɣ TRR, we ob- tance is calculated first for each feature, and the average of serve changes in Hellinger distances that follow the expected distances for all features is then considered as the final Hellin- trends based on Fig. 1, reaching the maximum value at ͨƔYW ͨ Ɣ STW ͨ Ɣ SRR ger distance. and , with a slight dip at , for the class- ͦ  specific datasets. The Hellinger distance then reduces as the S ͊ ͋ (1)  ʚͨʛ Ɣ ȕ DZȕʬ $Q& Ǝ $Q& ʭ ͨ Ɣ TRR ǰ  ǰ  class distributions return to their initial state at . The ͘ ?%Ͱͥ ͊%Q& ?%Ͱͥ ͋%Q& entire scenario then repeats two more times. &Ͱͥ $Ͱͥ ͘ ͊ ͋ If the distributions governing the data at different time where is the dimensionality of the data, and $Q& ( $Q&) is the stamps is static (i.e. Ɣ6constant), then the Hellinger distance frequency count in bin i of the histogram corresponding to the between the 1st batch and subsequent batches remains constant histogram ͊ (͋) of feature k. We then compute #ʚͨʛ, the dif- as shown in Fig. 3. We note that a static distribution does not ference in divergence between the most recent measure of the mean zero Hellinger distance. In fact, the Hellinger distance Hellinger distance, ʚͨʛ, and the Hellinger distance measured will be non-zero due to differences in class means as well as at the previous time stamp, ʚͨƎSʛ. This differential be- the random nature of the data drawn for each sample. The tween the current and prior distances is compared to a threshold, to determine whether the change is large enough to aȥ Ď ƔʤÎʚͨʛ Ǩ͒ʚͨʛRͭ Ǩ?ʥ #( ƍͨ Input: Training data, ͨ ͝ ͝ , presented in tailed test is used, P ͦQ ! :/ͯZͯͥ and not the interval batches corresponding to a joint aȥ #( ƏͨPͦQ ! . ͤͨʚÎQ !ʛ where ͨƔSQTQU :/ͯZͯͥ Initialize: ƔS, and Ď ƔĎS Ȥ (3)  ʚͨʛ Ɣ#( ƍͨPͦQ !  :ͨƎƎS for ͨƔTQUQU do 1. Generate a histogram, ͊, from Ďͨ and a histogram, ͋, from Once the adaptive threshold is computed, it is applied to Ď . Each histogram has ͖ = Ǔ:͈Ǘbins, where ͈ is the observed current difference in divergence to determine if drift cardinality of Ďͨ is present in the most recent data. If magnitude #ʚͨʛ Ƙ ʚͨʛ , then we signal that change has been detected in the most re- 2. Calculate the Hellinger distance between ͊ and ͋ using Eq. cent batch of data. As soon as change is detected, the (1). Call this ͂ʚͨʛ. Ɣͨ Ď ƔĎ 3. Compute the difference in Hellinger distance algorithm resets and Z /. Note that the resetting of #ʚͨʛ Ɣ ͂ʚͨʛ Ǝ ͂ʚͨƎSʛ the baseline distribution is essential as the old distribution is no longer an accurate reference to determine how much the 4. Update the adaptive threshold ͨƎS data distribution has changed (and whether that change is sig- S #( Ɣ ȕ #ʚ͝ʛ nificant) as indicated by the large difference in Hellinger ͨƎƎS ͝Ɣ distance between time stamps. If drift is not detected, then we update, rather than reset, the distribution ĎZ with the new data ?ʚ ͨƎS #ʚ͝ʛ Ǝ#(ʛT ͝Ɣ Ď Ď ȤƔǰ /. Thus, Z continues to be updated with the most recent data ͨƎƎS as long as drift is not detected. This histogram can be updated

Compute ʚͨʛ using the standard deviation in Equation (2) or or reset using the following equation: ͋ Ų͋ ƍ͊ Q6666'$6"0'$26'16,-26"#2#!2#" the confidence interval method in Equation (3). $Q& $Q& $Q& ͋$Q& Ų͊$Q&Q666666666'$6"0'$26'16"#2#!2#" 5. Determine if drift is present Note that the normalization in Eq. (1) ensures that the cor- if #ʚͨʛ Ƙ ʚͨʛ then Ɣͨ rect density is obtained from these histograms. Reset Ď by setting Ď ƔĎͨ The algorithm naturally lends itself to an incremental drift Indicate change was detected detection scenario since it does not need access to previously else seen data. Rather, only the histogram of the current and refer- Update Ď with Ďͨ Ŵ Ď Ɣ ʜĎ QĎͨ ʝ ence distributions are needed. Therefore, the only memory end if required for this algorithm is the previous values of # , end for ʚͨƎSʛ and the histogram of ĎZ (͋). Fig. 4. Algorithm pseudo code for the Hellinger distance drift detection me- thod (HDDDM) D. Algorithm Performance Assessment Determining the effectiveness of a drift detection algorithm claim a drift. Rather than heuristically or arbitrarily selecting a can be best tested on carefully designed synthetic datasets us- threshold, we use an adaptive threshold that is automatically ing a variety of drift scenarios, such as abrupt or gradual drift. adjusted at each time stamp. To do this, we first compute #(, Since the drift is deliberately inserted into the dataset, the abil- the mean #ʚ͝ʛ , and Ȥ, the standard deviation of differences in ity of the algorithm to detect a drift when one is known to divergence, where ͝ƔQƍSQUQͨƎS. Note that the current  exist (measured by sensitivity), as well as its ability to not sig- time stamp and all steps before (last time a change was de- nal one when there is in fact no drift (measured by specificity), tected) are not included in the mean difference calculation. ʚͨʛ ͨ can be easily determined for such datasets. The actual threshold at time step is then computed Detecting drift is only useful to the extent such information based on the mean and standard deviations of the differences can be used to guide a learner track a nonstationary environ- in divergence. We propose two alternatives to compute this ment and learn the drift in the data. To do so, we use a naïve threshold: based on the standard deviation and on a confidence Bayes classifier, which can be easily updated as new data ar- level. The first is computed simply by rive to determine whether detection of drift can be used to  ʚͨʛ Ɣ#( ƍ Ȥ (2) improve the classifier’s performance on a drifting classifica- tion problem. We use the following procedure to determine the where is some positive/real constant, indicating how many effectiveness of the drift detection algorithm: standard deviations of change around the mean we accept as st  Generate two naïve Bayes classifiers with the 1 data- “different enough”. Note that we use #( ƍ Ȥ and not #( Ə Ȥ, base. Call them Ęͥ and Ęͦ. as we flag a drift when the magnitude of the change is signifi- FOR ͨƔTQUQU cantly greater than the average of the change in recent time  (since the last detected change) with the significance con- Begin processing new databases for the presence of concept drift using HDDDM. trolled by the term and the standard deviation of the - Ę divergence differences. ͥ is our target classifier which is updated or reset ͨ based on HDDDM decision. If drift is detected us- The second implementation of HDDDM uses the -statistic Ę Ę and scales by the square root of ͨƎƎS. Again, an upper- ing HDDDM, we reset ͥ and train ͥ only on the new data, otherwise incrementally update Ęͥ with the new data. - Regardless of whether or not drift is detected, Ęͦ is incrementally trained with the new data. Ęͦ is therefore the control classifier that is not subject to HDDDM intervention. - Compute error of Ęͥ and Ęͦ on new test data ENDFOR

We expect Ęͥ to outperform Ęͦ on drifting environments. The online implementation of the naïve Bayes classifier is straightforward as the features are assumed class conditionally independent. The calculation of ͤƳͬ$ !%Ʒ required to imple- Fig. 5. Evolution of the rotating checkerboard dataset ment naïve Bayes classifier, can be further simplified by assuming the data for the ͝th feature is normally distributed (though, this is not necessary for the algorithm to work). Thus, the parameters of naïve Bayes are computed as: 1 1 |x) |x) i 0.5 i 0.5

ͦ   S Ƴͬ$ Ǝ$%Ʒ (4) P( P(  ͤƳͬ$ !%ƷƔ #6. ƵƎ ƹ 0 0 Tͦ 10 10 10 10 $%:T $% 5 5 5 5 0 0 0 0 -5 -5 -5 -5 -10 -10 -10 -10 (5)  ͊Ƴ!%ɳÎ$ƷǷ͊ʚ!%ʛșͤƳͬ$ !%Ʒ 1 1 $Ͱͥ |x) |x) i 0.5 i 0.5   P(

ͦ P( where ͬ$ is the ͝th feature, $% is the variance of the ͝th feature 0 0 10 10 10 10 of the ͞th class, $% is the mean of the ͝th feature of the ͞th 5 5 5 5 0 0 0 0 ! ͞ -5 -5 -5 -5 class, and % is the label of the th class. -10 -10 -10 -10 Fig. 6. Evolution of the circular Gaussian drift by observing the posterior XPERIMENTS IV. E probability. The drift is continuous from one batch to the next and not abrupt- Several synthetic and real world datasets, described below, ly changing. There are 100 time stamps for each cycle. were selected to evaluate the HDDDM algorithm. A. Description of Datasets As summarized in Table I, seven datasets were used. The 1 1 |x) |x) i 0.5 i 0.5

rotating checkerboard dataset is generated using the Matlab   P( P(

source found in [12], and used in two scenarios controlled by 0 0 changing the rotation of the checkerboard: i) continuous rota- 10 10 10 10 5 5 5 5 0 0 0 0 tion over 400 time stamps for gradual drift and ii) discrete -5 -5 -5 -5 rotations every 20 of 300 time steps ( a total of 15 changes) to simulate abrupt changes. 1 1 |x) |x) i 0.5 i 0.5  

TABLE I. DESCRIPTION OF DATASETS AND THEIR DRIFT PROPERTIES USED FOR P( P( THE EXPERIMENTATION OF THE HDDDM ALGORITHM. 0 0 10 10 10 10

5 5 5 5 Dataset Instances Source Drift 0 0 0 0 -5 -5 type -5 -5

Checkerboard 800,000 Synthetically Generated Synthetic 4 1 Elec 27,549 Web source Natural 1 1 SEA [5%] 400,000 Synthetically Generated Synthetic |x) |x) i 0.5 i 0.5  

RandGauss 812,500 Synthetically Generated Synthetic P( P( 2 Magic 13,065 UCI Synthetic 0 0 NOAA 18,159 NOAA3 Natural 10 10 10 10 5 5 5 5 0 0 0 0 GaussCir 950,000 Synthetically Generated Synthetic -5 -5 1. http://www.liaad.up.pt/~jgama/ales/ales_5.html -5 -5 2. UCI Machine Learning Repository (http://www.ics.uci.edu/~mlearn/) Fig. 7. Evolution of the RandGauss dataset by observing the posterior. The da- 3. National Oceanic and Atmospheric Administration(www.noaa.gov) taset begins with 2-modes (one for each class) and begins slowly drifting. The 4. All missing instances with missing features have been removed drift stops at the third evolution point and remains static for 25 time stamps followed by the introduction of a new mode for the magenta class. The dataset continues to evolve with slow change followed by abrupt changes. Figure 5 illustrates a few snapshots of the checkerboard ap- The number of feature vectors was reduced to eight features pearances over time. Electricity pricing (Elec) is a real-world and the classification task was to predict whether or not there dataset that contains natural drift within the database. SEA da- was rain on a particular day. taset comes from the original shifting hyper-plane problem B. Preliminary Results presented by Street & Kim [13]. Magic dataset is from the UCI machine learning repository [14]. The original Magic da- The preliminary results are summarized in Fig. 8. Each of taset includes little, if any, drift. Therefore, this dataset has the plots presents the error of a naïve Bayes classifier with no been modified by sorting a feature in ascending order and then update or intervention from HDDM, and with various values generating incremental batches on the sorted data with a mea- of (0.5, 1.0, 1.5, 2.0) for the standard deviation implementa- ningful drift as an end-effect. tion of the HDDDM, as well as with =0.1 for the confidence GaussCir and RandGauss are two synthetic datasets gener- interval based implementation of HDDDM. The primary ob- ated with a controlled drift scenario (whose decision servation we make is that the classifier that is continuously boundaries are shown in Fig. 6 and 7, respectively). GaussCir updated, disregarding the concept change (red curve, corres- is the example previously described in Section III. The NOAA ponding to Ęͦ , i.e., no drift detection being used), has an dataset contains approximately 50 years of weather data ob- overall error that is typically greater than the error of the clas- tained from a post at Offutt Air Force Base in Bellevue, sifiers that are reset based on the intervention of the HDDDM Nebraska. Daily measurements were taken for a variety of fea- (other colors, corresponding to different implementations of tures like temperature, pressure, visibility, wind speed, etc. Ęͥʛ. We would like to note that resetting an online classifier is

Checkerboard NOAA120 Parametric Gaussian 0.75 0.65 0.45 No Update No Update 0.7 =0.5 0.6 =0.5 0.4 (a) =1.0 (b) =1.0 (c) 0.65 =1.5 0.55 =1.5 0.35 0.6 =2.0 =2.0 0.3 No Update =0.1 0.5 =0.1 =0.5 0.55 =1.0 0.45 0.25 0.5 =1.5 Error Error Error  0.4 0.2 =2.0 0.45 =0.1 0.35 0.15 0.4

0.1 0.35 0.3

0.3 0.25 0.05

0.25 0.2 0 0 50 100 150 200 250 300 0 50 100 150 0 20 40 60 80 100 120 140 Time Stamp Time Stamp Time Stamp Magic04 Elec2 Checkerboard 0.5 0.6 0.9 No Update No Update No Update 0.45 =0.5 =0.5 =0.5 0.55 0.8 (d)=1.0 (e) =1.0 (f) =1.0 0.4 =1.5 =1.5 =1.5 =2.0 0.35 =2.0 0.5 =2.0 0.7 =0.1 =0.1 =0.1 0.3 0.45 0.6 0.25 Error Error Error 0.4 0.5 0.2

0.15 0.35 0.4 0.1 0.3 0.3 0.05

0 0.25 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 0.2 0 50 100 150 200 250 300 350 400 Time Stamp Time Stamp Time Stamp SEA (5%) Parametric Gaussian 0.22 1 No Update No Update =0.5  0.2 (g) 0.9 (h) =0.5 =1.0 =1.0 =1.5 0.8 =1.5 0.18 =2.0 =2.0 0.7 =0.1 =0.1 0.16 0.6

0.14 0.5 Error Error

0.12 0.4

0.3 0.1 0.2 0.08 0.1

0.06 0 0 20 40 60 80 100 120 140 160 180 200 0 100 200 300 400 500 600 Time Stamp Time Stamp Fig. 8. Error evaluation of the online naïve Bayes classifiers (updated and dynamically reset) with a variation in the parameters of the Hellinger Distance Drift Detection Method (HDDDM). (a) Checker board with abrupt changes in rotation, (b) NOAA (120 instances), (c) RandGauss, (d) magic, (e) elec, (f) checkerboard with continuous drift, (g) SEA (5% noise), and (h) GaussCir. not necessarily the ideal method to implement concept drift (as Fig. 9 and 10 display the location of drift detection on !ͥ it causes catastrophic forgetting [15]), and there are several al- and !ͦ from the rotating checkerboard dataset, as an addi- gorithms that can forget only what is no longer relevant, and tional figure of merit. Recall that this dataset experiences a retain the still-relevant information, such as the Learn++.NSE change every ~20 time steps. There are 15 distinct points of algorithm [16;17]. However, since our goal is to evaluate the drift (including first time step) in the checkerboard dataset drift detection mechanism, we use the resetting approach, so each indicated by the vertical gridlines, whereas each horizon- that any improvement observed is not due to the innate ability tal gridline indicates a different selection of the free of the learner to learn concept drift, but simply the impact and parameters of the HDDDM algorithm. effect of intervention based on drift detection. These plots provide a graphical display on the algorithm’s Of the seven datasets, the rotating checkerboard provides ability to detect the changes. Every marker that coincides with the best scenario to determine the effectiveness of HDDDM the vertical lines is a correct detection of drift. Every marker since the drift occurs at evenly spaced intervals: approximately that falls off a vertical grid is a false alarm of a non-existing every 20 time steps, out of 300, the board rotates 0.449 ra- drift, whereas every missing marker on a vertical grid is a dians, and remains static during the intermediate time stamps. missed detection of an actual drift. We observe from these This leads to a total of 15 concepts (including time step 1) plots that the algorithm was able to make the correct call in throughout the experiment. Table II presents the F-measure, vast majority of the cases. However, a few cases are worth fur- sensitivity and specificity of the HDDDM’s detections as av- ther discussion: for example, there are certain times when erages of 10 independent trials. We can compute these quantities, precisely because the locations of the drift points Postive Class are known for this dataset. In Table II, sensitivity ( !ͥ )   2.006 represents the ratio of the number of drifts detected to total number of real drifts that were actually present for !ͥ, whe-  1.505 reas specificity (!ͦ) is the ratio of number of no-drift time steps to total number of no-drift steps in class !ͦ. The perfor- mance measure in Table II indicates the detection rate across  1.004 all data. Note that with class label removed, there is no change in checkerboard distribution, and hence there should be no   3 change detected. The numbers are therefore de facto specifici- 0.50 ty figures for the entire data. Sensitivity cannot be measured on the entire data (when class labels are removed) since there   0.102 are no actual drifts in such data.

  0.051 TABLE II. F-MEASURE, SENSITIVITY AND SPECIFICITY MEASURES ON THE 1 22 42 62 82 102 122 142 162 182 202 222 242 262 282 ROTATING CHECKERBOARD DATASET AVERAGED OF 10 INDEPENDENT TRIALS. TRUE POSITIVES CORRESPOND TO THE TRUE POINTS OF CHANGE IN THE DATA. Fig. 9. Location of drift points for !ͥ as detected using the HDDDM on the rotating checkerboard problem with abrupt drift where the x-axis indicates the Parameter Ŵ location of the change and the y-axis is a variation of a parameter. Measure ŵ 0.5 1.0 1.5 2.0 0.05 0.1

F-measure (!ͥ) 0.80 0.84 0.78 0.64 0.79 0.82 Negative Class   2.006 Sensitivity (!ͥ) 1.0 0.97 0.81 0.61 0.98 1.0 Specificity (!ͥ) 0.97 0.98 0.98 0.99 0.97 0.97

F-measure (!ͦ) 0.79 0.81 0.78 0.64 0.82 0.80  1.505 Sensitivity (!ͦ) 0.99 0.94 0.86 0.58 0.97 0.98 Specificity (!ͦ) 0.97 0.98 0.98 0.98 0.89 0.97 Performance 0.81 0.87 0.91 0.92 0.85 0.82  1.004

Table II also shows the effect of the variation of and  on   0.503 the checkerboard dataset. We observe that a smaller is better for sensitivity (and F-measure), whereas a larger is better for specificity – not a surprising outcome – with ƔS providing   0.102 a good compromise. The standard deviation implementation of HDDDM generally maintains a higher performance than the t-   0.051 statistic implementation; however the latter is more tolerant to 1 22 42 62 82 102 122 142 162 182 202 222 242 262 282 changes in its parameter (of values). We should note that by Fig. 10.Location of drift points for !ͦ as detected using the HDDDM on the performance we are referring to the performance of the rotating checkerboard problem with abrupt drift where the x-axis indicates the HDDDM algorithm and not the performance of the naïve location of the change and the y-axis is a variation of a parameter. Bayes classifier. drift is detected in one of the classes but not the other, even REFERENCES though the drift existed on both classes. Consider the situation. [1] A. Tsymbal, "The problem of concept drift: definitions and related work," for ƔSTW, where a change was detected in !ͥ at time stamp Trinity College, Dublin, Ireland,TCD-CS-2004-15, 2004. 102, while drift was not detected in 6!ͦ . However, the [2] C. Alippi, G. Boracchi, and M. Roveri, "Just in time classifiers: managing the slow drift case," International Joint Conference on Neural Networks, HDDDM algorithm will correctly detect the change in the data pp. 114-120, 2009. in this case, because our implementation decides on a change [3] L. I. Kuncheva, "Using control charts for detecting concept change in when drift is detected in either one of the classes. This ap- streaming data," School of Computer Science, Bangor University,BCS- proach will accommodate all cases where the drift is detected TR-001-2009, 2009. [4] B. Manly and D. MacKenzie, "A cumulative sum type of method for en- in at least one of the classes. The drawback, however, is a po- vironmental monitoring," Environmetrics, vol. 11, pp. 151-166, 2000. tential increase in false alarm rate: if the algorithm incorrectly [5] C. Alippi and M. Roveri, "Just-in-Time Adaptive Classifiers - Part I: De- detects drift for any of the classes, it will indicate that the drift tecting Nonstationary Changes," IEEE Transactions on Neural Networks, exists, even though it actually may not. We plan to address vol. 19, no. 7, pp. 1145-1153, 2008. [6] C. Alippi and M. Roveri, "Just-in-time adaptive classifiers - part II: De- this issue in our future work. We also observe that the signing the classifier," IEEE Transactions on Neural Networks, vol. 19, HDDDM implemented with Ɣ RTRW is quite sensitive and no. 12, pp. 2053-2064, 2008. detects change rather often compared to the other implementa- [7] C. Alippi, G. Boracchi, and M. Roveri, "Change Detection Tests Using tions of HDDDM. the ICI Rule," World Congress of Computational Intelligence, pp. 1190- 1196, 2010. We should also note that a table similar to Table II, or fig- [8] M. Baena-Garcia, J. del Campo-vila, R. Fidalgo, and A. Bifet, "Early drift ures similar to Fig. 9 and 10 cannot be generated for other detection method," 4th ECML PKDD International Workshop on Know- databases, either because the drift is continuous, or the exact ledge Discovery from Data Streams, pp. 77-86, 2006. locations of the drift are actually not known. [9] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, "Learning with Drift Detection," SBIA Brazilian Symposium on Artificial Intelligence, vol. 3741, pp. 286-295, 2004. V. CONCLUSIONS [10] D. A. Cieslak and N. V. Chawla, "A Framework for Monitoring a Clas- We have presented a drift detection algorithm, inspired in sifiers' Performance: When and Why Failures Occurs?," Knowledge and part by [10], that relies only on the raw data features to esti- Information Systems, vol. 18, no. 1, pp. 83-108, 2009. [11] R. Polikar, L. Upda, S. S. Upda, and V. Honavar, "Learn++: an incremen- mate whether drift is present in a supervised incremental tal learning algorithm for supervised neural networks," IEEE learning scenario. This approach utilizes the Hellinger distance Transactions on Systems, Man and Cybernetics, Part C, vol. 31, no. 4, pp. as a measure to infer whether drift is present between two 497-508, 2001. batches of training data using an adaptive threshold. The adap- [12] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms ͨ John Wiley & Sons, Inc., 2004. tive threshold was analyzed using a standard deviation and - [13] W. N. Street and Y. Kim, "A streaming ensemble algorithm (SEA) for statistic approach. This threshold is computed based on the in- large-scale classification," Seventh ACM SIGKDD International Confe- formation obtained from the divergence between training rence on Knowledge Discovery & Data Mining, pp. 377-382, 2001. distributions. Preliminary results show that the Hellinger dis- [14] C. L. Blake and C. J. Merz, "UCI repository of machine learning databas- es," 1998. tance drift detection method (HDDDM) presented in this paper [15] F. H. Hamker, "Life-long learning cell structures continuously learning can improve the performance of an incremental learning algo- without catastrophic interference," Neural Networks, vol. 14, no. 5, pp. rithm by resetting the classifier when a change has been 551-573, 2001. detected. The primary goal of this preliminary effort was to [16] M. Muhlbaier and R. Polikar, "An Ensemble Approach for Incremental Learning in Nonstationary Environments," 7th International Workshop on see whether the algorithm can detect a drift accurately, and Multiple Classifier Systemsin Lecture Notes in Computer Science, vol. whether an intervention based on such detection would benefit 4472, Springer, pp. 490-500, 2007. the overall classifier performance. The answers to both ques- [17] R. Elwell and R. Polikar, "Incremental Learning in Nonstationary Envi- tions appear to be affirmative. Since our goal was not ronments with Controlled Forgetting," International Joint Conference on Neural Networks, pp. 771-778, 2009. necessarily to obtain the best concept drift classifier for a giv- en dataset, the actual classifier used in the algorithm was not optimized. The drift detection method proposed in this paper can then be employed with any active concept drift algorithm.

Note that the drift detection in HDDDM is not based on clas- sifier error, hence HDDDM is a classifier independent approach and can be used with other supervised learning algo- rithms.

Future work will include the integration of other (non- parametric) statistical tests along with the Hellinger distance as well as other distance measures. Comparison to other drift detection algorithms, as well as addressing the false alarm is- sue in cases when the algorithm detects a (non-existing) drift in only one of the classes, are also within the scope of our fu- ture work.