This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 1 ESA-: Efficient Self-Adaptive Online Data Stream Clustering

Yanni Li, Hui Li, Zhi Wang, Bing Liu, Jiangtao Cui, Hang Fei

Abstract—Many big data applications produce a massive amount of high-dimensional, real-time, and evolving streaming data. Clustering such data streams with both effectiveness and efficiency are critical for these applications. Although there are well-known data stream clustering algorithms that are based on the popular online-offline framework, these algorithms still face some major challenges. Several critical questions are still not answer satisfactorily: How to perform dimensionality reduction effectively and efficiently in the online dynamic environment? How to enable the clustering algorithm to achieve complete real-time online processing? How to make algorithm parameters learn in a self-supervised or self-adaptive manner to cope with high-speed evolving streams? In this paper, we focus on tackling these challenges by proposing a fully online stream clustering algorithm (called ESA-Stream) that can learn parameters online dynamically in a self-adaptive manner, dimensionality reduction, and cluster data streams effectively and efficiently in an online and dynamic environment Experiments on a wide range of synthetic and real-world data streams show that ESA-Stream outperforms state-of-the-art baselines considerably in both effectiveness and efficiency.

Index Terms—Self-Adaptive, Data Stream, Online Clustering. !

1 INTRODUCTION

ITH an ever-increasing number of applications in- the data stream [7], [10], [11], [12], [13], and (3) it often has Wvolving data streams, e.g., data from smart phones, no mechanism to speedup dimensionality reduction in an network monitoring, stock quotes, Internet of Things (IoT), online and dynamic environment. To overcome these weak- etc. Unsupervised clustering of data streams has become nesses, this paper proposes a fully online and light-weight an important problem for machine learning and big data stream clustering method, called ESA-Stream1 (Efficient Self analysis [1], [2], [3], [4]. As data streams are data-intensive, Adaptive Stream). Its key parameters are learned/generated temporally ordered, and rapidly evolving, efficiently and online in a self-adaptive manner. It also has a grid density effectively clustering data streams presents a major chal- centroid-based method to help speedup dimensionality lenge for storage capacity, processing time, computational reduction and to cluster more efficiently and accurately. modeling, etc. [5], [6], [7]. Experimental results with both synthetic and real-world data Most existing data stream clustering solutions typically streams show that ESA-Stream can efficiently and effectively follow a two-stage (online and offline) framework [8]. In tackle those weaknesses. ESA-Stream exhibits a significantly the online stage, the system receives and summarizes the better performance in both running time and result quality data into micro-clusters or grid cells. In the offline stage, compared to state-of-the-art baseline algorithms. In summary, these micro-clusters are merged into the final macro-clusters. the main contributions of this paper are: Since most existing algorithms consider offline processing to be less time sensitive, traditional clustering algorithms • It presents a fully online data stream clustering such as k-means, DBSCAN [9], etc., are commonly used. method ESA-Stream, which does not need the tradi- Although a dominant framework, this two-stage online- tional time-consuming offline stage and can detect offline approach has some key weaknesses. It is incapable arbitrarily shaped clusters efficiently and effectively. of responding to high-speed and high-dimensional evolving • It proposes an efficient self-adaptive technique to data streams effectively as (1) it can miss drifting concepts learn parameter settings dynamically during cluster- due to the fact that the two-phase processing model cannot ing to keep up with the drifting or evolving data truly achieve real-time and online processing, (2) its manually streams. We are not aware of any existing work that fixed parameters cannot keep up with dynamic changes in is able to do this. • It introduces the notions of the grid density centroid and grid feature vector to speedup dimensionality • Y. Li, Z. Wang, J. Cui and H. Fei are with the School of Computer reduction and to cluster data streams in an efficient Science and Technology, Xidian University, Xi’an, China, 710071. E-mails: [email protected], xd [email protected], [email protected], and fully online manner. avery [email protected]. • H. Li is with State Key Laboratory of Integrated Services Networks, Xidian The rest of this paper is organized as follows. Literature University, Xi’an, China, 710071. E-mail: [email protected]. review is given in Sec. 2. Preliminary concepts and definitions • B. Liu is with Wangxuan Institute of Computer Technology, Peking University, and Department of Computer Science, University of Illinois at are presented in Sec. 3. Sec. 4 gives a high-level overview of Chicago. E-mail: [email protected]. the proposed solution. The proposed methods in ESA-Stream • Both Hui Li and Zhi Wang contributed equally to this work. • Both J. Cui and B. Liu are co-corresponding authors. 1. Code available: https://github.com/XDU-CS/ESA-Stream

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 2

TABLE 1 Main characteristics of the recent data stream algorithms. Computing Parallel Processing high Self-adaptive Handling Algorithm name framework processing dimensional parameters cluster shapes D-Stream I [14], D-Stream II [15] Online-offline No No No Arbitrary and GD-Stream [16] DS [17] Online-offline Arbitrary Yes No No and StreamDM [18] Sphericity SNCStream [2], CEDAS [19] Online No No No Arbitrary and DBSTREAM [20] HDDStream [21] Online-offline No Yes No Arbitrary PKS-Stream [22] ASKM [23], HDDS [24] Online-offline No Yes No Sphericity and Spark-CluStream [25] Apache Spark

are detailed in Sec. 5. Empirical evaluation is reported in In recent years, several researchers also exploited pow- Sec. 6. Sec. 7 concludes the paper. erful computing platforms such as GPU-based or -based (Apache Storm is a distributed data streams computing framework) approaches [17], [36] to improve 2 RELATED WORK clustering efficiency for very fast streams. Moreover, there are As mentioned above, most existing data stream clustering also several platforms, e.g., Hadoop- algorithms adopt the two-stage offline-online framework. Streaming2 and RStorm3 for efficiently processing data According to their clustering methods in the offline stage, streams [37]. Our work is not to build a platform, but existing algorithms can be divided into the following cate- to propose a fully online and light-weight data streams gories: partition-based clustering algorithms [26], [27], [28], clustering method/framework. [29], hierarchical-based clustering algorithms [30], density- On developing one-stage algorithms, [2] introduced a based/density grid-based algorithms [15], [16], [19], [21], [31], one-stage online clustering algorithm SNCStream based on etc. Since the density-based/density grid-based algorithms Social Network Theory. It used homophyly to find non- can produce arbitrarily shaped clusters with high efficiency, hyperspherical clusters. [20] presented a micro-cluster-based they are widely favored. Our ESA-Stream algorithm also online clustering component that explicitly captures the uses the density grid-based clustering method. Hence, our density between micro-clusters via a shared density graph. discussion below focuses on the latest two-stage density- Similar to the method in [2], [19] proposed a fully online based/density grid-based data stream clustering algorithms. clustering algorithm CEDAS for evolving streams with Most existing data stream clustering algorithms are arbitrarily shaped clusters. Although these online methods designed to cluster specific domain data streams [16], [23], have improved the clustering efficiency to some extent, none [32], [33] or to detect changes in data streams [34], [35]. of them can handle high-dimensional data streams. Due to the inherent complex nature of data streams and the High-dimensional data stream clustering: Due to the stringent clustering requirements (Sec. 3.1) of real-time and curse-of-dimensionality [38], the distance between a pair of high-dimensional evolving streams, several representative data points in a high-dimensional data space are numerically works have attempted to deal with the weaknesses of the quite similar, resulting in no cluster or group. To address existing algorithms discussed in the introduction section. this issue, clustering algorithms such as subspace projection Improving the efficiency of the offline stage: To im- [21], [39] and index tree, called PKS-Stream [22], were prove the efficiency of the offline stage, many density grid- proposed. However, the dimensionality reduction methods based clustering solutions [14], [15], [22] following the online- in these techniques are computationally very expensive and offline clustering framework have been proposed. These not suitable for real-time and dynamically evolving data algorithms combine the advantages of density-based and streams. Some recent works [23], [24], [25] also addressed grid-based clustering approaches and are often referred to high-dimensional clustering over dynamic streams based as density grid-based hybrid clustering algorithms. These on k-median or k-means. However, they require the user methods can discover arbitrarily shaped clusters and detect specified number of clusters k and are suitable only for outliers. They also work fairly efficiently as they depend finding hypersphere clusters only on the number of grids with data points, which is much The main characteristics of the above algorithms are smaller than the total number of data points in the input summarized in Table 1, from which we can make the stream. In these approaches, the data space is partitioned following observations. into small segments called grids. Each data point from the • To improve the efficiency of clustering and dimension- data stream is mapped into a grid, which is then clustered ality reduction of data streams, existing algorithms with others based on their densities. The representative have made some progresses. But there is still room work following this strategy is D-Stream I [14], which for major improvements. More importantly, none was improved in D-Stream II [15] with a grid attraction of the existing algorithms can fully meet the basic mechanism, resulting in higher clustering quality. Although requirements of data stream clustering (see Sec. 3.1). D-Stream I and II have good performances to some streams,

they do not work well on high-dimensional evolving data 2. https://CRAN.R-project.org/package=HadoopStreaming streams. 3. https://CRAN.R-project.org/package=RStorm

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 3

• We found no work that learns clustering parameter coefficients of all data points that are mapped to g at or settings adaptively and automatically in order to keep before tc, which is defined as up with the real-time ever-changing data streams. X D(g, tc) = D(xi, tc) (2) xi∈E(g,tc) Since data continuously arrive and keep evolving, the density 3 PRELIMINARYAND DEFINITIONS of a grid changes constantly. However, it is proven that 3.1 Data Stream Clustering and Its Basic Requirements it is unnecessary to update the density of a grid at every We refer to those massive, unbounded sequences of data timestamp [14] for its quickly calculating. If a new data point g t g points, continuously generated by some applications at rapid is mapped into a grid at time , the density of can be rates as data streams [8], [11]. More formally, a data stream updated as follows [15]: t−tc S is a massive sequence of data points, each of which is D(g, t) = λ D(g, tc) + 1 (t > tc) (3) associated with a particular time stamp, x1, x2, ..., xn, i.e., d + D(g, t ) g t X = {xi}, where xi ∈ R , 1 ≤ d ∈ Z , 1 ≤ i ≤ n, and where c is the density of at last moment c before n → ∞. time t. The intrinsic nature of data streams calls for the following In this paper, two self-learning and self-adaptive grid requirements for clustering algorithms of data streams [10], density threshold parameters, namely Dl (lower limit) and Du [40], [41], [42]: (upper limit), are used to divide grids into three categories [14], [15] for more accurate clustering and detecting outliers. • Since a data stream is a massive and unbounded At time t, for a grid g, we call it a dense grid if its grid sequence of data points, it is impossible to store density D(g, t) (see Eqs. (2) and (3)) is greater than or equal the entire data stream. There must be a compact to Du. If its grid density D(g, t) is less than or equal to Dl representation for them. but greater than 0, we call it a sparse grid (not including empty • A stream is generally of high-speed and continually grids and as noise). Otherwise, it is called a transitional grid. evolving, which requires real-time processing. As Note that Dl and Du are two key parameters, we will show a result, traditional clustering methods that follow in Sec. 5.1 that both of them can be computed/learned in a multi-scan procedures become infeasible. Data points unsupervised or self-adaptive manner. from the stream can only be read sequentially once, and new data points from the stream must be pro- 4 OVERVIEWOFTHE PROPOSED ESA-STREAM cessed quickly (i.e., online processing) and incremen- The purposed ESA-Stream framework is outlined in Fig. 1. tally [43]. With five components cooperating with each other, ESA- • The algorithms should quickly identify outliers and Stream can cluster online real-time and high-dimensional cope with concept drift [32], [35], [44]. evolving data streams efficiently and effectively. The func- • Data streams are often inherently high-dimensional, tions of the five components are described as follows. so stream algorithms should be able to efficiently • Data Stream Acceptor: It accepts each data point high-dimensional data [24]. from a stream in real-time, and performs a min-max normalization for each dimension of the data point. From our literature survey, we can see that there is no existing • Grid Manager: It first maps or projects the normal- algorithm that satisfies all these requirements simultaneously. In ized data point to a grid, then calculates and updates this paper, we aim to propose a solution that attempts to the density and feature vector (see Sec. 5.2 for details) meet all these requirements at the same time. of the grid online at a time interval gap between two consecutive clustering updates (see Sec. 5.1.3). It then 3.2 Grids and Related Definitions calculates and produces a grid density centroid data stream. If the grid density centroid data stream is high- We define that the data points from a stream with d dimen- dimensional, its dimensionality should be reduced sions belong to the space S= S1 × ... × Sd, where Si is the th efficiently with our proposed Alg. 1 GCB-PCA (see space for the i dimension, which is evenly divided into pi Qd Sec. 5.2.2 for details). segments. So, the space S is partitioned into N = i=1 pi • Parameters Online Learner: It keeps on learning and grids. For a grid g, we denote it as g = (j1, ..., jd), where updating the parameters (i.e., Dl, Du and gap) online ji ∈ {1, ..., pi} and (j1, ..., jd) is used as the ID for g. when new data points in the stream arrive. To capture the dynamic and evolving nature of data • gap streams, we assign each data point xi in the stream a density Parallel Clustering Engine: After each , it - coefficient, which decays as time elapses. If data point xi ligently clusters the grid density centroid data (not arrives at time stamp tc, the density coefficient of xi at time the original data points) from Grid Manager in parallel t (t > tc), denoted as D(xi, t), is defined as follows with the strategy ECS shown in Sec. 5.2.4. It then generates the intermediate clustering results. D(x , t) = λt−tc i (1) • Result Integrator: It integrates the intermediate clus- where λ ∈ (0, 1) is a constant called the decay factor. tering results and outputs the final clusters for gap. For a grid g at a given time tc, let E (g, tc) be the set of The five components of ESA-Stream shown in Fig. 1 are data points that are mapped to g at or before tc. The grid pipelined. With collaboration among the components, ESA- density D(g, tc) of g is defined as the sum of the density Stream achieves efficient self-adaptive online data stream

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 4

Data Preparation Phase Clustering Analysis Phase

intermediate normalized grid density clustering result data stream Data Stream data points Grid centroid data Parallel clustering results Result Acceptor Manager Clustering Engine Integrator

updated parameters

Parameters Online Learner

Fig. 1. Stream clustering of ESA-Stream.

clustering. • Light-weight but high-efficiency: Thanks to the strategies of grid density centroid-based clustering ESA-Stream Framework with dimensionality reduction (see Sec. 5.2.2 for details) and divide-and-conquer, the ESA-Stream is Step 1: Accept each data point from a stream in real-time a light-weight yet efficient stream clustering frame- and perform a min-max normalization for each dimension; work. Step 2: Map the normalized data point to a grid and update the density and its feature vector Vg of the grid; Step 3: Read recently updated algorithm parameters Dl, 5 KEY STRATEGIES EMBEDDEDIN ESA-STREAM D and gap from the Parameter Online Learner of ESA- u 5.1 Self-adaptive Parameter Learning Stream; Step 4: Input and save Vg data of all non-sparse grids from As indicated above, ESA-Stream has three key parameters, the Grid Manager of ESA-Stream at a specific gap; grid density lower and upper thresholds Dl,Du, and optimal gap Step 5: For a high-dimensional data stream, reduce the time interval for the grids examination in order to update the clusters. Since fixed parameters from the user in all dimensionality of the grid density centroid of Vg by PCA and then goto Step 6; Otherwise, goto Step 6 directly; existing algorithms are unsuitable for dynamic and evolving data streams, we propose the following methods to learn Step 6: Intelligently divide the set S of V data into m vg g and adapt these parameters automatically. (an integer) equal parts, Svg1, Svg2,..., Svgm; Step 7: Cluster S , S ,..., S in parallel with the vg1 vg2 vgm 5.1.1 Grid Density Lower Threshold D proposed density grid-based clustering strategy, resulting l in intermediate clustering results, Ct1, Ct2,..., Ctn; [14] and [15] proved that when t → ∞, the sum of all grids densities is 1/(1 − λ) from time 0 to t. With this knowledge, Step 8: Are all Ct1, Ct2,..., Ctn output to the Result Integrator of the ESA-Stream? Yes, goto Step 9; Otherwise, we propose a method to compute Dl. Suppose the total goto Step 8; number of non-empty grids (i.e., dense grids, transitional grids, and sparse grids) is M at time t. In general, M is Step 9: Integrate C , C ,..., C and output final cluster- t t t1 t2 tn much less than the total number of grids N (M  N) due to ing results, C , C ,..., C ; t 1 2 k the existing large number of empty grids. Then the average Step 10: Is it the time for the next gap? Yes, goto Step 3; density of these three types of grids should be 1/(Mt(1−λ)). Otherwise, wait till next gap. We assume that the three types of grids follow a normal Fig. 2. The framework of ESA-Stream algorithm. distribution (which is also verified in our initial experiments). Since dense, transitional and sparse are all statistical concepts According to the above ESA-Stream component functions relative to the mean value, we define those non-empty grids with and Figs. 1 and 2, we can identify its following unique less than two-thirds of the average grid density as the sparse grids features, which are further proved by our theoretical analysis based on extensive experimentation and verification. Thus, the later and extensive experiments. grid density lower threshold Dl at time t is defined below. • Fully real-time online: The data stream is processed fully online in real time, without any offline stage. Dl(t) = 2/(3Mt(1 − λ)) (4) • Online-learning parameters: With the component Parameters Online Learner and the interactions between 5.1.2 Grid Density Upper Threshold D components, we highlight the fact that the ESA- u Stream is an intelligent framework whose parameters Similarly, grids whose density is higher than the average den- are online-learning and online-adaptive to cope with sity of non-sparse grids (also non-empty) can be regarded as the concept drift during the streaming process. dense grids. Based on this idea and experimental verification,

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 5

we define Du as follows, where gi represents the such grid Proof: According to the definition of grid types in whose density D(gi, t) is greater than Dl at time t. Sec. 3.2, if a grid g is a sparse grid at time t, we have:

Du(t) = avg(D(gi, t)) (5) D(g, t) ≤ Dl (13)

Suppose after δt2 time, g becomes a dense grid, then we 5.1.3 gap: Time Interval for Two Consecutive Cluster Up- have: dates D(g, t + δt2) ≥ Du (14)

As the data stream constantly evolves, the densities of the It is noted that E(g, t + δt2) can be divided into those points grids will change [34] from time to time. The density of in E(g, t) and those come after t. In this case, there are new data points arrived for any of the time steps from t + 1 each grid should be examined and the clusters should be until t + δt2. The average density increment of each grid g is updated as the data evolves. We define the optimal length of δ −1 δ = P t2 λi/M . Therefore, we have: the time interval between two consecutive grid examinations and density i=0 t X cluster updates/adjustments as the gap. The question is how to D(g, t + δt2) = D(x, t + δt2) compute the gap. x∈E(g,t+δt2) Pδt2−1 i X i=0 λ Generally, the value for gap cannot be too large or too ≤ D(x, t + δt2) + (15) small. If it is too large, dynamical changes of the data stream x∈E(g,t) Mt 1 − λδt2 will not be captured. If it is too small, it will result in very = λδt2 D(g, t) + frequent and needless computations in the clustering phase. Mt(1 − λ) We propose the following technique to determine the value Since a sparse grid g at time t becomes a dense grid after of the gap between two consecutive clustering updates. In time t + δt2, based on Eqs. (13)-(15), we have: particular, we consider the minimum time needed for a dense D ≤ D(g, t + δ ) grid to degenerate to a sparse grid as well as that needed for u t2 δ a sparse grid to become a dense grid. Then we set gap to be 1 − λ t2 ≤ λδt2 D(g, t) + the smaller one of the two values. Mt(1 − λ) (16) 1 − λδt2 Lemma 5.1. For any dense grid g, the minimum time needed ≤ λδt2 D + l M (1 − λ) for g to become a sparse grid is t Solving Eq. (16) yields: δ1 = blogλ (Dl/Du)c (6)   1 − DuMt(1 − λ) δt2 ≥ logλ (17) Proof: Suppose that at time t, the grid g is a dense grid, 1 − DlMt(1 − λ)

then Thus, δ2 = bδt2c. D(g, t) ≥ D u (7) Based on δ1 and δ2, we set gap as Eq. (18) so that any change of a grid from dense to sparse or vice versa can be Suppose after δt1 time, g becomes a sparse grid, then we have: recognized in real-time. D(g, t + δt1) ≤ Dl (8) gap = min (δ1, δ2)     Let E(g, t) be the set of data points mapped in g at time t. Dl 1 − DuMt(1 − λ) (18) = logλ min , Since some data points may have mapped into g from time t Du 1 − DlMt(1 − λ) to δt1, E(g, t) ⊆ E(g, t + δt1). And we have: Note that each new gap (resp., δ1, δ2) is computed and X D(g, t + δ ) = D(x, t + δ ) updated based on the changes over all the grids during t1 t1 (i) x∈E(g,t+δt1) the last gap. That is, after a period of gap , we check the X ≥ D(x, t + δt1) density changes over all the grids and compute the next gap, x∈E(g,t) (9) (i+1) (i) (i) X gap = min(δ1 , δ2 ). = λδt1 D(x, t) x∈E(g,t) Clearly, since Mt and D(gi, t) for the formulation of = λδt1 D(g, t) Dl, Du and gap are all captured online by the component Parameters Online Learner, the three parameters, Dl, Du and Combining Eqs. (7)-(9), we get: gap, of ESA-Stream are learned online in an unsupervised

δt1 δt1 and self-adaptive manner for evolving data streams. Dl ≥ D(g, t + δt1) ≥ λ D(g, t) ≥ λ Du (10)

which yields: 5.2 Parallel Clustering Engine δt1 ≥ logλ (Dl/Du) (11) Most density grid-based clustering algorithms simply map Hence, for any dense grid g, the minimum time needed for g data points to each grid and compute its density for clus- to become a sparse grid is δ1 = bδt1c. tering. This method can lose the position information of the data points within a grid and lead to low clustering quality. g Lemma 5.2. For any sparse grid , the minimum time needed As illustrated in Fig. 3, grids 1 to 6 are dense grids, but g for to become a dense grid is grids 3 and 4 are separated by small areas of much lower    1 − DuMt(1 − λ) density. Grids 7 to 10 are all transitional grids, but they δ2 = logλ (12) 1 − DlMt(1 − λ) are close to each other. In the existing density grid-based

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 6

1 2 7 8 BAB 10 9 3 A g A

5 BAB 4 d1

d3 d2 A -- 2d direct adjacent grids of grid g; 6 B -- ((3d-1)-2d) indirect adjacent grids of grid g

Fig. 3. The schematic diagram of the grid data distribution and their Chebyshev distances between grid density centroids (a large red dot represents the grid density centroid of a grid).

clustering algorithms, grids 1 to 6 will improperly form a dimensionality reduction needs to be carried out efficiently. cluster, while grids 7 to 10 will produce no cluster. These Second, grids merging for clustering based on the distance of problems can be alleviated with a finer resolution of the grid density centroids substantially improves the clustering grids or by the method [15] based on attractions of all data result quality using the merge method proposed below. points between adjacent grids. However, our analysis and Note that in grid merging, Dcen(g, t) in Vg should be 0 experiments show that the above methods are unsuitable for replaced with Dcen(g, t) (the grid density centroid of g high-speed streams due to their high computation cost. after dimensionality reduction).

5.2.1 Grid Density Centroid and Grid Feature Vector 5.2.2 Grid Centroid-based Dimensionality Reduction For efficient and effective clustering, inspired by the Cen- As we know, for the high-dimensional data streams, the curse- troidal Equation in physics [45], we first introduce the of-dimensionality is bound to happen. On the one hand, data points from a high-dimensional data stream are sparsely concept of grid density centroid Dcen (g, t), which is used to represent all data points mapped into a grid g. It is defined as: distributed in the space with almost equidistant resulting in no cluster or group. On the other hand, as the dimensionality P increases, the total number of grids grows exponentially, xj + D(xi, t) · xi D (g, t) = xi∈E(g,t) which leads to the drastic drop of computational performance cen 1 + P D(x , t) xi∈E(g,t) i of the density grid-based clustering algorithms. Although P (19) xj + D(xi, t) · xi some dimensionality reduction techniques, such as subspace = xi∈E(g,t) (i 6= j) (1 + D(g, t)) projection and index tree are used in some high-dimensional data streams clustering algorithms, these dimensionality where xj = (xj,1, ..., xj,d) is a new d-dimensional data point reduction methods [39], [46] use all data points of a grid and mapped into a grid g at t, and xi = (xi,1, ..., xi,d) is a d- is computationally very expensive and not suitable for real- dimensional data point mapped into the grid g before t. time evolving data streams. To overcome these weaknesses, From Eq. (19), we can see that comparing with the traditional we propose an efficient grid centroid-based dimensionality centroid, the grid density centroid not only preserves the reduction method, called GCB-PCA(Grid Centroid-Based coordinates information for the data points in the grid, but Principal Component Analysis) shown in Alg. 1. also takes into account both the positional distribution and In each time interval gap, GCB-PCA performs a dimen- density for the points in its computation. sionality reduction with the grid density centroid, which In order to record the necessary information about grid embeds the information of the density distribution along g, we introduce another concept of grid feature vector Vg, with locations of all data points mapped in the grid and is a which is a tetrad tuple defined as representative point of all points in the grid. Even without parallel computation, our GCB-PCA can achieve very high V = (ID,D(g, t), D (g, t), t ) (20) g cen u efficiency, which is further verified later by our experiments where ID is a serial number of grid g, and tu is the last in Sec. 6. updated time of D(g, t) and Dcen(g, t) of the grid g. Note that Dcen(g, t) and Vg are updated in each gap by 5.2.3 Grid Merging Strategy Eqs.(3), (19) and (20). In order to avoid the issue of the spatial distance measure- We will see later that the grid density centroid Dcen(g, t) ment in high-dimensional data streams, we use Chebyshev of g plays a central role in substantially improving the distance for grid merging. Let Dis(gi, gj)(i 6= j) be the clustering efficiency and effectiveness. First, dimensionality Chebyshev distance between two adjacent grids gi and gj. reduction based on the grid density centroid of grid g rather Intuitively, if one of the following conditions is satisfied than the original data points is much more efficient because (designed empirically), the two adjacent grids gi and gj can the centroid summarizes all the data points in a grid and be merged as an intermediate micro-cluster. it also preserves the positional distribution and density (1) Dis(gi, gj) ≤ 4/3 ∗ len, if both gi and gj are dense of points in grid g. For high-dimensional data streams, grids (i.e., the density centroids for adjacent dense grids

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 7

Algorithm 1 GCB-PCA Fig. 5 shows an example for ECS. Fig. 5(a) shows that a 2D-

Input: Non-sparse grid g feature vector Vg set SVg in the space is evenly divided into grids, including both dense and original d-dimensional data space S; transitional ones. Assume we evenly split the 24 grids into 3 0 0 Output: Non-empty grid g feature vector Vg0 set S groups (following Step 1 in Fig. 4), namely SV , SV and Vg0 g1 g2 0 0 S 0 d SVg3 (they are not clusters). Then we apply the GMS Strategy and grid centroid set Dcen(g ,t) in the reduced - 0 0 dimensional data space S (d << d); from each grid in SVg1 following BFS in parallel, resulting in 0 0 M S T 1: S ← ∅; S 0 ← ∅; // Initialization a MST tree ( inimum panning ree), i.e., temporary micro- Vg0 Dcen(g ,t) cluster Ct1 in Fig. 5(b). Note that, in Step 2 of ECS, during 2: SD (g,t) ← extract Dcen(g, t) of each Vg in SV to cen g traversing all 3d-1 neighbors of each grid, we restrict that construct the grid centroid set SD (g,t); cen each grid must not be traversed more than once. In this way, 3: for each Dcen(g, t) ∈ SDcen(g,t) do 0 0 the search space will get smaller and smaller during traversal. 4: Dcen(g , t) ← reduce the dimensionality d based on 0 In most cases, the search space may become empty before grid centroids Dcen(g, t) to d using PCA; 0 0 0 0 SVgm is iterated. Then this step will intelligently perform an 5: S 0 ← S 0 ∪ {D (g , t)} Dcen(g ,t) Dcen(g ,t) cen ; 6: end for early-stop. All these strategies will save a significant amount 0 0 0 of time in practice. We then immediately update the search 7: for D (g , t) ∈ S 0 do each cen Dcen(g ,t) 0 space from S to S by eliminating the grids that have 8: hg, g i ← combine the correspondence between a grid S1 S2 been merged. Afterwards, we carry on the merging process g in the original space S and a grid g0 in the reduced- from each grid in S , which leads to C and C . Note dimension space S0 based on the results of Line 4; Vg2 t2 t3 0 0 that as the grids {2, 3, 8, 13} have already been eliminated 9: Calculate D(g , t) based on D(g, t) and hg, g i; from the search space, there are only 2 grids in Ct2. After 10: Calculate Vg0 ; 0 generating Ct2 and Ct3, as all the grids have been merged, 11: if Vg0 ∈/ S then Vg0 0 0 the search space is updated to SS3 = ∅. Therefore, we do not 12: S ← S ∪ {Vg0 }; Vg0 Vg0 need to perform merging from SVg3 anymore. The process 13: else 0 0 stops here. Finally, we merge Ct1 and Ct2 into C1 as they D(g , t) D (g , t) V 0 14: Update , cen , and g by Eqs. (3), (19) share a common grid 9 following Step 3. The pipeline is and (20); shown in Fig. 5(c). The pseudo code of ECS is given in Alg. 2 15: end if corresponding to the ECS framework in Fig. 4. 16: end for 0 0 17: return S and S 0 . Vg0 Dcen(g ,t) ECS Framework

Step 1: According to the data volume |SVg | and the current computational resources, evenly splitting the set SV into should be at most 4/3 ∗ len away, such as the distance of g m subsets SV , SV ,..., SV ; grids 1 and 2 shown in Fig. 3). Here len is the segmented or g1 g2 gm Step 2: Each element of S starts to merge their adjacent grid length of each dimension for a grid. Note that, like all Vg1 Grids Merging Strategy other grid-based approaches, len is manually set. grids in parallel using the . The process iterates for SVg2 ,..., SVgm , which results in the (2) Dis(gi, gj) ≤ 1.0 ∗ len, if gi is a dense grid and gj is a intermediate macro-clusters C , C ,..., C ; transitional grid, or vise versa (such as the distance of grids t1 t2 tn 2 and 3 shown in Fig. 3); Step 3: Integrating the temporary macro-clusters Ct1, Ct2,..., Ctn in parallel by the component Result Integrator. (3) Dis(gi, gj) ≤ 2/3 ∗ len, if both gi and gj are tran- If Cti ∩ Ctj 6= (i 6= j), Cti and Ctj are merged into a sitional grids, and D(gi, t) + D(gj, t) ≥ Du (such as the ∅ distance of grids 7 and 8 shown in Fig. 3). larger macro-cluster. The process continues until all pairs C C C Note that with this grids merging strategy, GMS for short, of ti and tj from the intermediate macro-clusters t1, C C the three clusters shown in Fig. 3 can be recognized. As t2,..., tn are searched and merged; we will see in the experiment section, the strategy works Step 4: Output the clustering result set {Ci}. effectively with real-life data. Fig. 4. The framework of ECS.

5.2.4 Efficient Clustering 5.3 The Complete ESA-Stream Algorithm In the density grid-based clustering algorithms, for a non- sparse grid g, all of its 2d direct adjacent grids or (3d-1)-2d 5.3.1 Algorithm Details indirect (diagonal neighbor) ones (shown in Fig. 3) must be Hereby, the pseudo code for our ESA-Stream algorithm is traversed to form its temporary macro-clusters based on a shown as Alg. 3. grid merging strategy. Only considering all 2d direct adjacent Before processing the data stream X, ESA-Stream ini- grids will result in a low clustering accuracy. Moreover, for a tializes the algorithm parameters (Lines 1-3). Because ESA- high-dimensional data stream, the number for all adjacent Stream does not accept any data points in the data stream grids to be traversed (i.e., 3d − 1) will increase exponentially during the initialization phase, the total number of non- with d, which results in a huge computational cost. To empty grids M(t) = 0, Eqs. (4), (5), and (18) are not available. achieve a good balance between computational efficiency Therefore, among these parameters, Dl, Du and gap are and effectiveness for high-dimensional data streams, we initialized to 0. It should be noted that the initial values of developed a new Efficient Clustering Strategy, namely ECS, Dl, Du, and gap will not affect the subsequent clustering which has four steps as shown in Fig. 4. results, because these parameters will be self-adaptive (Line

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 8

8 7 splitting and clustering 8 18 14 SVg={1,2,3,4,…,24} 7 24 23 1 6 6 15 12 11 20 22 21 2 9 19 5 5 SVg1 SVg2 SVg3 6 5 4 19 18 17 16 17 4 1 2 10 3 13 14 15 16 3 4 9 20 3 21 7 8 9 24 Ct1 Ct2 Ct3 2 22 integrating results 13 Ct2 10 11 12 23 1

C Ct3 0 4 8 9 10 11 12 13 t1 1 23 567 C1={1,2,…,12,13} -- dense grid … S ={10,14,…,17,…,24} S =Ø -- transition grid SS1={1,2,3,...,8, ,24} S2 S3 C2={14,15,…,24} (a) (b) (c) Fig. 5. An example of clustering grids efficiently.

Algorithm 2 ECS Algorithm 3 ESA-Stream

Input: Non-sparse grid g feature tuple set SVg ; Input: Data stream X, decay factor λ, grid’s length len; Output: Clustering result set SC ; Output: Clustering result set SC ;

1: SVg1 , ..., SVgm ← evenly divide SVg into m parts; 1: t ← 0; Dl ← 0; Du ← 0; gap ← 0 2: for each clustering P roci, 1 ≤ i ≤ m, 2: tg ← 10000; // tg is the latest clustering time 3: start 3: gridList ← ∅;

4: SCti ← ∅; 4: for each xi ∈ X do

5: for each Vg ∈ SVgi do 5: ID ← map xi into the gird g with the grid’s len; 6: if Vg is traversed then 6: Calculate Vg; 7: continue; 7: if hID, Vgi ∈/ gridList then 8: else 8: D(g, t) ← 1; Dcen(g, t) ← xi; 9: Ct ← clustering Vg with GMS Strategy; 9: Vg ← (ID,D(g, t), Dcen(g, t), t); 10: end if 10: gridList ← gridList ∪ {hID, Vgi} ;

11: SCti ← SCti ∪ {Ct}; 11: else 12: end for 12: Update D(g, t), Dcen(g, t) in Vg by Eqs. (3), (19);

13: Send SCti to the main thread P roc0; 13: end if 14: end 14: if t == tg then 15: The thread P roc0 performs: 15: Update Dl, Du and gap by Eqs. (4), (5) and (18); 16: start 16: Detect and remove sporadic grids from gridList; Sm 17: SC ← i=1 SCti ; 17: SVg ← extract all non-sparse grid feature tuple Vg 18: for i ← 1 to |SC | do in gridList to construct the grid feature tuple set; 19: for j ← i to |SC | do 18: if X is high-dimensional (dimensionality ≥ 4) 20: if Cti ∩ Ctj 6= ∅ then judged by EAS-Stream’s Data Stream Acceptor then 0 21: Cti ← Cti ∪ Ctj; Ctj ← ; 19: S ← call function GCB-PCA(SV ); ∅ Vg0 g 0 22: end if 20: SC ← call function ECS(S ); Vg0 23: end for 21: else 24: end for 22: SC ← call function ECS(SVg ); 25: end 23: end if 26: return SC ; 24: tg ← tg + gap; 25: Output SC ; 26: end if 15). tg is the clustering analysis moment and initialized to 27: t ← t + 1; 10,000. This means that ESA-Stream clusters 10,000 points for 28: end for the first time, and then tg is updated with the update value of gap (Line 24). Moreover, in order to store the grid feature tuple of non-empty grids, ESA-Stream uses the HashMap When time tg elapses, ESA-Stream updates Dl, Du and object gridList with ID as the key and Vg as the value. gap (Line 15, corresponding to Step 3 of Fig. 2). Then, ESA- After the initialization phase, ESA-Stream accepts each Stream removes the sporadic grids and extract all non-sparse data point in the order of its arrival. A data point xi in the grids’ Vg data of gridList to construct the grid feature tuple stream X will be mapped into the gird g, and then ESA- set SVg (Lines 16-17, corresponding to Step 4 of Fig. 2). If X Stream will determine the ID of the gird g with the grid’s is high-dimensional (the dimensionality ≥ 4), the grid feature tuple set S0 is constructed after dimension reduction by length len (Lines 5-6, corresponding to Step 2 of Fig. 2). If Vg0

hID, Vgi exists in gridList, ESA-Stream will update the grid calling function GCB-PCA(SVg ) (Line 19, corresponding to feature tuple. Otherwise we initialize the grid feature tuple Step 5 of Fig. 2). and insert it into the grid list (Lines 7-13, corresponding to In the clustering analysis phase, ESA-Stream calls the Step 2 of Fig. 2). function ECS(S0 ) to cluster reduced dimensional S0 . If Vg0 Vg0

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 9

X is not high-dimensional, ESA-Stream calls the function 6 EXPERIMENTAL RESULTS

ECS(SVg ) to cluster original dimensional SVg (Lines 20-22, corresponding to Steps 6-8 of Fig. 2). Finally, ESA-Stream To evaluate the performance of ESA-Stream, we use some synthetic and real-world data streams, which are shown in outputs the clustering result set SC (Line 25, corresponding to Step 9 of Fig. 2). Table 2. All experiments are conducted on a HP ProDesk 480 G1 MT Mini Tower (Intel i5-4570, 4 cores/chip, 1 thread/core, 3.20 GHz and 8GB RAM ) running Win 7. ESA-Stream with 4 threads and all baselines are implemented in Scala 2.11. 5.3.2 Time and Space Complexity

For each data point of data stream X in the original d- TABLE 2 The experimental datasets. dimensional data space, O(d) is needed for mapping data point into the grid and calculating/updating the grid feature Data Streams Instances Attributes Clusters tuple. Therefore, the time complexity of data stream reading DB2 1,400,000 2 7

is O(d|X|). RBF5 1,000,000 5 8 Synthetic Before clustering the grid feature tuple data, ESA-Stream RBF10 1,000,000 10 8

removed noise grids, i.e., the sporadic grid feature tuples, RBF20 1,000,000 20 8 and constructed the non-sparse grid feature tuple set SV g RBF40 1,000,000 40 8 by traversing all non-empty grid feature tuples. Suppose the (i) Traffics 1,850,000 5 5 number of the non-empty grids is Mti at time gap , the Real-world Forest Covertype 581,012 54 7 time complexity of constructing SVg equals O(Mti ). If X is high-dimensional data stream in the time interval KDD Cup 99 4,898,431 38 5 (i) gap , ESA-Stream could call the function GCB-PCA(SVg ) to reduce its dimensionality. The main operations of GCB- As shown in Table 2, we created 5 synthetic datasets. PCA consist of using PCA to reduce grid density cen- To verify the ability of recognizing arbitrary shaped clus- ters, filtering outliers, and coping with concept drifts, we troid’s dimension of each grid feature tuple in SVg with O(d0M 2 ) created the synthetic dataset DB2, shown in Fig. 6. Like time ti , and constructing the grid feature tuple set S0 in the reduced d0-dimensional data space with time the evaluation of the other high-dimensional data stream Vg0 clustering algorithms, the synthetic datasets RBF5∼RBF40 are O(Mti ). Therefore, the time complexity of GCB-PCA is O(d0M 2 ) + O(M ) generated using the Radial Basis Function (RBF) generator ti ti . It is noted that some grid density- 4 centroids from original d-dimensional space S should be of MOA to ensure clusters/groups in high-dimensional mapped into the same grid in the reduced d0-dimensional space can be generated. The RBF generator creates the user- data space S0 (d0 << d). Thus, the size M 0 of S0 must be given number of drifting concepts (or clusters) and the data ti Vg0 0 points in each cluster according to the user-specified standard much less than the size Mt of Vg, i.e., M  Mt . i ti i deviations in Gaussian distribution. The real-world traffic (i) When gap comes, ESA-Stream could call Alg. 2 to dataset records traffics from the city of Aarhus5. 449 traffic carry out stream clustering. As the grid density-centroid sensors are deployed in the city which produce new values clustering is adopted, the clustering time complexity is every 5 min. This dataset covers a period of 2 months and O(dMti log(Mti )). Corresponding to the reduced dimen- contains several attributes, e.g., number of cars, average 0 sional space S , the time complexity of Alg. 2 should be speed, timestamp and 2-dimensional location of cars. The O(d0M 0 log(M 0 )) 6 ti ti . datasets Forest Covertype and KDD Cup 99 are commonly (i) Above all, suppose all of the number of gap is Ngap used for supervised learning, where the class labels are (1 ≤ i ≤ Ngap). If data stream X is high-dimensional, adopted as the ground-truth for evaluating our clustering the total time complexity of ESA-Stream equals O(d|X|) + results. All the datasets are normalized for each dimension. PNgap PNgap 0 2 PNgap 0 0 0 i=1 O(Mti ) + i=1 O(d Mti ) + i=1 O(d Mti log(Mti )). All baselines are parameterized following their original Otherwise, the total time complexity is O(d|X|) + papers shown in Table 3. GD-Stream [16], HDDStream [21] PNgap PNgap i=1 O(Mti ) + i=1 O(dMti log(Mti )). and D-Stream II [15] follow the online-offline framework with Since theoretical analysis and experimental results density grid-based algorithm clustering, while CEDAS [19] is 0 0 PNgap 2 an online clustering algorithm using DBSCAN. All the four show that Mti  Mti , d  d and i=1 Mti  PNgap PNgap 0 2 algorithms are all state-of-the-art clustering algorithms for |X|, thus, O(d|X|) + i=1 O(Mti ) + i=1 O(d Mti ) Pc 0 0 0 data streams [12]. Unlike other baselines, HDDStream was + i=1 O(d Mti log(Mti )) ≈ O(d|X|), and O(d|X|) + PNgap PNgap designed to cluster high-dimensional streams. GD-Stream O(Mt ) + O(dMt log(Mt )) ≈ O(d|X|). From i=1 i i=1 i i gap the above discussion, we conclude that the time complexity and D-Stream II parameterize as follows:     of ESA-Stream is O(d|X|) regardless of whether the dimen- Cl N − Cm gap = logλ max , (21) sion is reduced. Cm N − Cl Similarly, the storage space of grid feature tuples is O(Mt) (i) where N is the total number of grids (see Table 3 for other pa- at time gap , the storage space of GCB-PCA is O(Mt) + 0 0 rameters). Note that (1) although gap is computed, it is fixed O(Mt ) = O(Mt)(Mt << Mt), and the storage space of ECS is at most O(Mt). Therefore, the space complexity of ESA- 4. http://www.cs.waikato.ac.nz/abifet/MOA/API/ silhouette coefficient 8java source.html O(M ) + O(M ) + O(M ) = O(M ) Stream is t t t t whether it’s 5. http://www.odaa.dk/dataset/realtids-trafikdata dimensionally has been reduced or not. 6. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 10

TABLE 3 and that it has high purity of 96.1% shown in Table 4 and is Baseline algorithms and their parameters. effectively immune to outliers. Algorithm #Param. Value Second, we evaluate all baselines. The results are shown Dacay = 1000, Eps = 0.05, MinP ts = 15, in Table 4 and Fig. 8. As the drifting concepts (or clusters) CEDAS 4 gap = 10000 in the data stream occur over time, some clusters appear and others disappear over time. Since all baselines cannot λ = 0.5, Eps = 0.2, β = 0.5, MinP ts = 10, HDDStream 7 learn clustering parameters adaptively, they have very weak δ = 0.001, π = 10, gap: user-given ability to handle evolving data streams. The experiments λ = 0.998, len ∈ [0.02 ∼ 0.5], Cl = 0.8, shown in Table 4 and Fig. 8 clearly demonstrate that (1) GD-Stream 8 Cm = 3.0, β = 0.3, Eps = 1.0, MinP ts = 3, gap (Eq. (21)) with a constant gap (10000, 30000 or 50000), all baselines have low clustering purity, (2) the larger the constant gap λ = 0.998, len ∈ [0.02 ∼ 0.5], Cl = 0.8, D-Stream II 6 is, the lower the average clustering purity, and (3) using the C = 3.0, β = 0.3, gap (Eq. (21)) m computed gap, ESA-Stream has much higher purity due to λ = 0.998, len ∈ [0.02 ∼ 0.5], Dl (Eq. (4)), its parameter online learning. Since ESA-Stream can timely ESA-Stream 5 Du (Eq. (5)), gap (Eq. (18)) and dynamically detects drifting concepts 83 times, its purity is 15%, 19% and 24% higher than the baselines on their constant gaps of 10000, 30000 and 50000, respectively. These and cannot adapt to the evolving streams, and (2) for high- observations show that the gap with the online-learning and dimensional streams, N is huge, then gap = blogλ (1)c = 0 self-adaptive ability is the key to detect/capture drifting concepts according to Eq. (21). Therefore, for datasets where dimen- in data streams over time. sion is over 5, Eq. (21) cannot be applied, we set gap = 10000 in GD-Stream and D-Stream II for these cases instead. 6.2 Effectiveness and Efficiency In our ESA-Stream system, gap is computed as min(δ1, δ2) (see Eq. (18)), and the parameters Dl, Du and gap are all For the second set of experiments, we report the clustering learned in an unsupervised self-adaptive manner automati- results of all algorithms with regard to both the running cally from the data streams. time and effectiveness (measured in purity or the SC for the Evaluation metrics: The effectiveness is evaluated using Traffic dataset). In particular, we carried out clustering over purity except for the Traffic dataset. Purity is defined as all datasets in Table 2, where the speed of data streams is set to 100K/second and len = 0.075 for each dimension, 1 X purity = max |Ck ∩ Lj| (22) except for len = 0.02 on DB2. From the experiments shown N j k in Table 4, comparing with the baselines, on average, the purity is improved by 15.4 and 8.9 percents on synthetic and real-world where Lj is the ground-truth class assignment for the samples. datasets, respectively; the running time is reduced by one ninth As the Traffic dataset has no ground-truth class labels and and one sixth on the synthetic and real-world datasets, respectively. no prior probability density, purity and the measure in [47] Moreover, as the dimensionality and data volume increases, do not apply. Therefore, we use the popular SC (Silhouette the running times of the algorithms except ESA-Stream Coefficient): dramatically increase, and clustering purity or SC drop drastically. Although CEDAS was designed as fully online N 1 X b(xi) − a(xi) clustering algorithm, experiments show that it is incapable SC = (23) N max{a(xi), b(xi)} of handling high-speed and high-dimensional evolving data i=1 streams due to its weaknesses of fixed parameter settings where a(xi) (resp., b(xi)) is the mean distance between xi and not-reduced dimensionality. and all other points in the same cluster (resp., next nearest To further compare the precision performance (Purity) of cluster). Running time efficiency is measured in seconds. ESA-Stream with the baselines, we conducted statistical tests (t-test, α=0.05). The p-values of ESA-Stream and the four baselines CEDAS, HDDStream, GD-Stream and D-Stream II 6.1 Ability to Handle Evolving Data Streams are 0.0002, 0.0084, 0.0015 and 0.0013, respectively, from which In the first set of experiments on dataset DB2 with 70K it can be clearly seen that our ESA-Stream is significantly better outliers, we test all algorithms on their ability to capture the than the baselines in effectiveness. dynamic evolution of data clusters and to remove outliers. For this purpose, we generated 7 clusters (with their data points) with arbitrary shapes. The clusters randomly appear 6.3 Dimensionality and disappear over time, which simulates concept drifts [44], We now verify dimensionality scalability of all algorithms i.e., the dynamic evolution of data clusters. The speed of on all datasets shown in Table 1. Through the results in the data stream is 100K data points per second, with equal Fig. 9, we can see that the proposed ESA-Stream exhibits interval of arrival. excellent dimensionality scalability compared to baselines, First, we check our ESA-Stream’s clustering result in Fig. which is attributed to its grid centroids and the self-learned 7 at three different timestamps t1, t2 and t3 as well as the gap (i.e., dimensionality reduction using PCA is done at final clustering result shown in Fig. 6. The results clearly the end of each gap for the next clustering update). Note show that ESA-Stream can respond and adapt timely to the that the baselines of CEDAS, GD-Stream and D-Stream II dynamic evolving high-speed data stream shown in Fig. 7, have no dimensionality reduction mechanism. They only

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 11

(a) Data distribution (b) Clustering result

Fig. 6. Data distribution of DB2 and the result of ESA-Stream.

50 50 50

45 45 45

40 40 40

35 35 35

30 30 30

25 25 25

20 20 20

15 15 15

10 10 10

5 5 5

0 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50

(a) t1=25s (b) t2=50s (c) t3=75s

Fig. 7. The results of ESA-Stream for evolving data stream DB2.

TABLE 4 Running time (T) and purity/SC (P) of algorithms. CEDAS HDDStream GD-Stream D-Stream II ESA-Stream Data Sets Time(s) Purity(%) Time(s) Purity(%) Time(s) Purity(%) Time(s) Purity(%) Time(s) Purity(%)

DB2 26.01 78.4 429.5 79.3 19.35 82.7 21.47 83.5 16.02 96.1

RBF5 284.54 87.6 549.81 87.3 209.43 95.1 137.84 94.8 61.11 97.3

RBF10 369.21 86.9 943.6 59.9 996.45 88.6 271.56 86.7 92.23 96.4

RBF20 502.24 83.5 1942.5 57.2 2889.31 83.8 835.44 82.3 138.32 95.4

RBF40 1962.95 81.9 4350 53.6 5862.41 80.1 2771.56 79.2 215.23 94.8 Traffics 187.41 0.41 1431.21 0.42 155.52 0.43 133.87 0.43 86.54 0.49 Forest Covertype 155.00 37.7 656.54 58.0 149.36 56.3 80.45 57.9 33.46 62.5 KDD Cup 99 247.00 77.5 297.15 96.5 336.00 93.3 350.37 91.5 53.62 97.5

(a) ESA-Stream’s purities (b) Baselines’ purities

Fig. 8. Purities of ESA-Stream and baselines with different gap models for evolving data stream DB2.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 12

(a) Running time (b) Purity

Fig. 9. Running time and purity of algorithms with various dimensionalities from 5d-40d data streams RBF5 and RBF40.

(a) Running time (b) Purity

Fig. 10. The running time and purity of algorithms with various values of grids’ len under data stream KDD Cup 99.

leverage fast traversal of the limited 2d adjacent grids/micro- and grids merging methods, ESA-Stream is not very sensitive clusters to attempt to alleviate the “curse-of-dimensionality” to variations of len when len is not too large (e.g., from 0.05 to problem. However, the results in Fig. 9 show that, without 0.25), which is shown in Fig. 10. Note that there is no len in the dimensionality reduction, they can’t cope with the “curse- CEDAS and HDDStream as they are not density grid-based of-dimensionality”, leading to poor purity. Intuitively, one algorithms and are not compared here. trivial approach for addressing the problem is to simply All the experiments showed that ESA-Stream is far apply dimensionality reduction, e.g., PCA, before performing superior to all baselines in clustering efficiency and quality. CEDAS, GD-Stream and D-Stream II. However, we argue that this strategy is impractical due to the following reasons. 7 CONCLUSION Firstly, only when all the data samples are present, PCA is Although many stream clustering algorithms have been pro- able to learn the principle components. It cannot work on posed, there are still major gaps between existing solutions streaming scenario where new data samples dynamically and practical requirements. To bridge the gaps, we present keep on arriving. Secondly, due to the first reason, the only a real-time online and light-weight clustering framework, approach to apply PCA is to implement PCA over all the ESA-Stream, for high-speed and high-dimensional evolving samples in each gap, which definitely introduces large cost data streams. This framework is built upon a series of new and cannot satisfy the requirements for stream processing. ideas: self-learning or self-adaptation in setting parameters Although HDDStream is designed to perform dimen- in the online dynamic environment, efficient dimensionality sionality reduction with its subspace projection method, the reduction and clustering based on grid density centroids, method is very time-consuming and hard for high-speed etc., designed to address specific challenges. ESA-Stream has streams. Our grid density centroid based method enables several desirable features, i.e., fully real-time online, dynamic PCA to effectively and efficiently reduces the data stream’s and online self-adaptive learning of parameters, and light- dimensionality with rapid evolving streams. weight and high-efficiency. Empirical evaluations showed that ESA-Stream outperforms state-of-the-art data streams 6.4 Parametric Sensitivity clustering algorithms in both efficiency and result quality. Since ESA-Stream adopts the commonly used density grid- Although ESA-Stream has demonstrated its effectiveness based clustering idea, like other algorithms, the parameter in clustering real-time and high-dimensional evolving data len (grid granularity) needs to be preset. Existing algorithms streams, there are still some gaps for ESA-Stream to become are quite sensitive to len. The running time on a high- a truly intelligent machine learning framework. To cover the dimensional stream dramatically increases with the decrease gaps, embedding a lifelong learning mechanism [48] into of len. However, thanks to our data dimensionality reduction ESA-Stream is a promising future research direction.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 13

ACKNOWLEDGEMENTS [20] M. Hahsler and M. Bolanos,˜ “Clustering data streams based on shared density between micro-clusters,” IEEE Transactions on The work of Y. Li, H. Li, Z. Wang, J. Cui and H. Fei was Knowledge and Data Engineering, vol. 28, no. 6, pp. 1449–1461, 2016. partly supported by the National Natural Science Foundation [21] I. Ntoutsi, A. Zimek, T. Palpanas, P. Kroger,¨ and H. Kriegel, of China (Grant No. 61472296, 61672408), Fundamental “Density-based projected clustering over high dimensional data streams,” in Proceedings of the 12th SIAM International Conference on Research Funds for the Central Universities (No. JB181505), Data Mining, Anaheim, California, USA, April 26-28, 2012., 2012, pp. Natural Science Basic Research Plan in Shaanxi Province of 987–998. China (No. 2018JM6073) and China 111 Project (No. B16037). [22] J. Ren, B. Cai, and C. Hu, “Clustering over data streams based on grid density and index tree,” Journal of Convergence , vol. 6, no. 1, pp. 83–93, 2011. REFERENCES [23] D. Puschmann, P. M. Barnaghi, and R. Tafazolli, “Adaptive cluster- ing for dynamic iot data streams,” IEEE Internet of Things Journal, [1] A. Amini, Y. W. Teh, and H. Saboohi, “On density-based data vol. 4, no. 1, pp. 64–74, 2017. streams clustering algorithms: A survey,” Journal of Computer Science [24] V. Braverman, G. Frahling, H. Lang, C. Sohler, and L. F. Yang, and Technology, vol. 29, no. 1, pp. 116–141, 2014. “Clustering high dimensional dynamic data streams,” in Proceedings [2] J. P. Barddal, H. M. Gomes, and F. Enembreck, “Sncstream: A social of the 34th International Conference on Machine Learning, ICML 2017, network-based data stream clustering algorithm,” in Proceedings of Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 576–585. the 30th Annual ACM Symposium on Applied Computing, Salamanca, [25] O. Backhoff and E. Ntoutsi, “Scalable online-offline stream cluster- Spain, April 13-17, 2015, 2015, pp. 935–940. ing in apache spark,” in 2016 IEEE 16th International Conference on [3] W. Fan and A. Bifet, “Mining big data: current status, and forecast Data Mining Workshops (ICDMW). IEEE, 2016, pp. 37–44. to the future,” SIGKDD Explorations, vol. 14, no. 2, pp. 1–5, 2012. [26] Z. Song, L. F. Yang, and P. Zhong, “Sensitivity sampling over [4] H. Liu, J. Han, F. Nie, and X. Li, “Balanced clustering with least dynamic geometric data streams with applications to k-clustering,” square regression,” in Proceedings of the 31st AAAI Conference on CoRR, vol. abs/1802.00459, 2018. Artificial Intelligence, February 4-9, 2017, San Francisco, California, [27] J. Sui, Z. Liu, A. Jung, L. Liu, and X. Li, “Dynamic clustering scheme USA., 2017, pp. 2231–2237. for evolving data streams based on improved STRAP,” IEEE Access, [5] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern vol. 6, pp. 46 157–46 166, 2018. Recognition Letters, vol. 31, no. 8, pp. 651–666, 2010. [28] M. Carnein and H. Trautmann, “evostream–evolutionary stream [6] K. J. Ahn, G. Cormode, S. Guha, A. McGregor, and A. Wirth, clustering utilizing idle times,” Big data research, vol. 14, pp. 101–111, “Correlation clustering in data streams,” in Proceedings of the 32nd 2018. International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 2237–2246. [29] P. L. Candido,ˆ M. C. Naldi, J. A. Silva, and E. R. Faria, “Scalable data [7] L. Zheng, H. Huo, Y. Guo, and T. Fang, “Supervised adaptive stream clustering with k estimation,” in 2017 Brazilian Conference incremental clustering for data stream of chunks,” Neurocomputing, on Intelligent Systems (BRACIS). IEEE, 2017, pp. 336–341. vol. 219, pp. 502–517, 2017. [30] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for [8] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” in VLDB 2003, Proceedings of clustering evolving data streams,” in Proceedings of 29th International 29th International Conference on Very Large Data Bases, September 9-12, Conference on Very Large Data Bases, September 9-12, 2003, Berlin, 2003, Berlin, Germany, 2003, pp. 81–92. Germany, 2003, pp. 81–92. [31] Y. Chen and L. Tu, “Density-based clustering for real-time stream [9] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based data,” in Proceedings of the 13th ACM SIGKDD International Confer- algorithm for discovering clusters in large spatial databases with ence on Knowledge Discovery and Data Mining, San Jose, California, noise,” in Proceedings of the 2nd International Conference on Knowledge USA, August 12-15, 2007, 2007, pp. 133–142. Discovery and Data Mining (KDD-96), Portland, Oregon, USA, 1996, [32] J. Costa,´ C. Silva, M. Antunes, and B. Ribeiro, “Concept drift pp. 226–231. awareness in twitter streams,” in 13th International Conference on [10] G. Krempl, I. Zliobaite, D. Brzezinski, E. Hullermeier,¨ M. Last, Machine Learning and Applications, ICMLA 2014, Detroit, MI, USA, V. Lemaire, T. Noack, A. Shaker, S. Sievi, M. Spiliopoulou, and December 3-6, 2014, 2014, pp. 294–299. J. Stefanowski, “Open challenges for data stream mining research,” [33] Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, “Twit- SIGKDD Explorations, vol. 16, no. 1, pp. 1–10, 2014. ter spammer detection using data stream clustering,” Information [11] J. D. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. Science, vol. 260, pp. 64–73, 2014. F. D. Carvalho, and J. Gama, “Data stream clustering: A survey,” [34] D. Kifer, S. Ben-David, and J. Gehrke, “Detecting change in data ACM Computing Surveys, vol. 46, no. 1, pp. 13:1–13:31, 2013. streams,” in Proceedings of the Thirtieth international conference on [12] M. Ghesmoune, M. Lebbah, and H. Azzag, “State-of-the-art on Very Large Data Bases, 2004, pp. 180–191. clustering data streams,” Big Data Analytics, vol. 1, no. 1, pp. 1–13, [35] G. Boracchi, D. Carrera, C. Cervellera, and D. Maccio,` “Quanttree: 2016. Histograms for change detection in multivariate data streams,” in [13] M. Hahsler, M. Bolanos, J. Forrest et al., “Introduction to stream: Proceedings of the 35th International Conference on Machine Learning, An extensible framework for data stream clustering research with ICML 2018, Stockholmsm¨assan,Stockholm, Sweden, July 10-15, 2018, R,” Journal of Statistical Software, vol. 76, no. 14, pp. 1–50, 2017. 2018, pp. 638–647. [14] Y. Chen and L. Tu, “Density-based clustering for real-time stream [36] P. Huang, X. Li, and B. Yuan, “A parallel gpu-based approach to data,” in Proceedings of the 13th ACM SIGKDD International Confer- clustering very fast data streams,” in Proceedings of the 24th ACM ence on Knowledge Discovery and Data Mining, San Jose, California, International Conference on Information and Knowledge Management, USA, August 12-15, 2007, 2007, pp. 133–142. CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015, 2015, [15] L. Tu and Y. Chen, “Stream data clustering based on grid density pp. 23–32. and attraction,” ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 3, pp. 12:1–12:27, 2009. [37] P. P. Rodrigues and J. Gama, “Distributed clustering of ubiquitous [16] Q. Qian, C. Xiao, and R. Zhang, “Grid-based data stream clustering data streams,” Wiley Interdisciplinary Reviews: Data Mining and for intrusion detection,” International Journal of Network Security, Knowledge Discovery, vol. 4, no. 1, pp. 38–54, 2014. vol. 15, no. 1, pp. 1–8, 2013. [38] A. Hinneburg and D. A. Keim, “Optimal grid-clustering: Towards [17] P. Karunaratne, S. Karunasekera, and A. Harwood, “Distributed breaking the curse of dimensionality in high-dimensional cluster- stream clustering using micro-clusters on apache storm,” Journal of ing,” in Proceedings of the International Conference on Very Large Data Parallel and Distributed Computing, vol. 108, pp. 74–84, 2017. Bases, 1999, pp. 506–517. [18] A. Bifet, S. Maniu, J. Qian, G. Tian, C. He, and W. Fan, “Streamdm: [39] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework Advanced data mining in spark streaming,” in 2015 IEEE Interna- for projected clustering of high dimensional data streams,” in tional Conference on Data Mining Workshop (ICDMW). IEEE, 2015, Proceedings of the 30th International Conference on Very Large Data pp. 1608–1611. Bases, Toronto, Canada, August 31 - September 3 2004, 2004, pp. 852– [19] R. Hyde, P. Angelov, and A. R. MacKenzie, “Fully online clus- 863. tering of evolving data streams into arbitrarily shaped clusters,” [40] D. Barbara,´ “Requirements for clustering data streams,” ACM Information Sciences, vol. 382-383, pp. 96–114, 2017. SIGKDD Explorations Newsletter, vol. 3, no. 2, pp. 23–27, 2002.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2020.2990196, IEEE Transactions on Knowledge and Data Engineering

SA, VOL. , NO. , MARCH 2019 14

[41] M. Khalilian and N. Mustapha, “Data stream clustering: Challenges Bing Liu is a chair professor at Peking University and issues,” Lecture Notes in Engineering and Computer Science, vol. (on leave from University of Illinois at Chicago). 2180, no. 1, 2010. He received his Ph.D. in Artificial Intelligence from [42] S. Ding, F. Wu, J. Qian, H. Jia, and F. Jin, “Research on data stream the University of Edinburgh. His research inter- clustering algorithms,” Artificial Intelligence Review, vol. 43, no. 4, ests include lifelong machine learning, sentiment pp. 593–600, 2015. analysis and opinion mining, data mining, ma- [43] T. V. Marinov, P. Mianjy, and R. Arora, “Streaming principal chine learning, and natural language processing. component analysis in noisy settings,” in Proceedings of the 35th He has published extensively in top conferences International Conference on Machine Learning, ICML 2018, Stock- and journals in these areas. Three of his papers holmsm¨assan,Stockholm, Sweden, July 10-15, 2018, 2018, pp. 3410– have received 10-year Test-of-Time awards from 3419. KDD, the premier conference of data mining and [44] J. Gama, I. Zliobaitˇ e,˙ A. Bifet, M. Pechenizkiy, and A. Bouchachia, data science. He also authored three books: one on Web data mining “A survey on concept drift adaptation,” ACM computing surveys and two on sentiment analysis. Some of his work has been widely (CSUR), vol. 46, no. 4, pp. 1–37, 2014. reported in the press, including a front-page article in the New York [45] D. Halliday, R. Resnick, and J. Walker, Fundamentals of physics. John Times. On professional services, he serves as the current Chair of ACM Wiley & Sons, 2013. SIGKDD. He has served as program chair of many leading data mining [46] H. Jin, B. C. Ooi, H. T. Shen, and C. Yu, “An adaptive and efficient conferences, including KDD, ICDM, CIKM, WSDM, SDM, and PAKDD, as dimensionality reduction algorithm for high-dimensional indexing,” associate editor of leading journals such as TKDE, TWEB, and DMKD, in Proceedings of the International Conference on Data Engineering, 2003, and as an area chair or senior PC member of numerous natural language pp. 87–98. processing, AI, Web research, and data mining conferences. He is an [47] G. Menardi, “Density-based silhouette diagnostics for clustering ACM Fellow, an AAAI Fellow, and an IEEE Fellow. methods,” Statistics and Computing, vol. 21, no. 3, pp. 295–308, 2011. [48] Z. Chen and B. Liu, Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan and Claypool Publishers, 2018.

Yanni Li is a professor in School of Computer Science and Technology of Xidian University. She Jiangtao Cui received the M.S. and Ph.D. de- received her M.S. and Ph.D. degrees in Computer gree both in Computer Science from Xidian Uni- Science and Technology from Xidian University, versity, Xi’an, China in 2001 and 2005 respec- Xi’an, China. Her current research interests are tively. Between 2007 and 2008, he has been with big data analysis, data mining, machine learning, the Data and Knowledge Engineering group work- large-scale combinatorial optimization, etc. In ing on high-dimensional indexing for large scale recent five years, she has published more than image retrieval, in the University of Queensland, 10 papers as the first author in top academic Australia. He is currently a professor in School conferences or journals. of Computer Science and Technology, Xidian University, China. His current research interests include data and knowledge engineering, data security, and high-dimensional indexing.

Hui Li received the B.Eng. degree from the Harbin Institute of Technology in 2005 and the Ph.D. degree from Nanyang Technological Uni- versity, Singapore, in 2012. He is currently a Professor with the School of Cyber Engineering, Xidian University, China. His research interests include data mining, knowledge management and discovery, privacy-preserving query, and big Hang Fei received the Bachelor of Science data analysis. He has been nominated as the degree in Computer Science from School of Best Paper Award in SIGMOD 2015. Computer Science and Technology from Xidian University, Xi’an, China in 2018. He is currently a postgraduate student in School of Computer Science and technology, Xidian University. His current research interests include big data anal- ysis and machine learning.

Zhi Wang received the M.E. degree in Computer Technology from Xidian University, Xi’an, China in 2018. He is currently a Ph.D. student in School of Computer Science and Technology, Xidian University, China. His current research interests include data mining, machine learning, and large- scale combinatorial optimization.

1041-4347 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 09,2020 at 08:23:11 UTC from IEEE Xplore. Restrictions apply.