2020 International Conference on Computational Science and Computational Intelligence (CSCI)

An Efficient Local Factor for Data Stream

Processing: A Case Study

Omar Alghushairy Raed Alsini Xiaogang Ma Department of Computer Science Department of Computer Science Department of Computer Science University of Idaho University of Idaho University of Idaho Moscow, ID, USA Moscow, ID, USA Moscow, ID, USA College of Computer Science and Faculty of Computing and Information [email protected] Engineering Technology University of Jeddah King Abdulaziz University Jeddah, 23890, Saudi Arabia Jeddah, 21589, Saudi Arabia [email protected] [email protected]

Abstract— In the field of and , II. BACKGROUND outlier detection is considered a significant procedure for many A. Big Data applications, such as fraud detection in bank transactions and decision support systems. Data streams are a major player in the Big data is a set of data of a size that exceeds the ability of big data era. Currently, data streams are generated from regular databases to process, store, transfer, manage, share and various sources with huge amounts of data. This has led to analyze within an acceptable period of time [5]. Big data difficulty when using older algorithms, which are designed for comes in three different modes: structured data, semi- static data. The Local Outlier Factor (LOF) is one of these structured data, and unstructured data [6]. The most important algorithms. The most challenging issue of the LOF is that it type of big data is the data stream, which has the needs to preserve the whole dataset in computer memory. A new characteristics of volume, velocity, variety, value, and LOF that can deal with a data stream in limited memory is variability. Therefore, it is not possible to process big data by needed. This paper is a case study of several benchmark datasets traditional methods. Currently, there is an urgent need to for outlier detection that aim to increase the efficiency of the develop new algorithms of processing and managing big data. accuracy of local outlier detection in data streams. B. Data Stream Keywords—local outlier factor, , stream data A data stream is a collection of continuous data processed mining, genetic algorithm, outlier detection (short paper) to collect knowledge and extract information [7]. Data streams Symposium on Computational Intelligence (CSCI-ISCI) represent big data as primary sources by having various applications in different properties such as volume, velocity, I. INTRODUCTION variety, value, and veracity [8, 9]. Volume refers to the large Outlier detection attempts to distinguish a data point that volume of data assembled and analyzed. Velocity involves the is distinct from the rest of the given data. occur during pace at which data between various systems and devices are a procedure or as a consequence of an error of measurement generated and transported. Variety applies multiple types of [1]. By detecting outliers, essential information can be data that may be used to obtain the required knowledge or obtained to make better decisions in various applications, such performance; it involves the data modes of structured, as fraud transactions in credit card and intrusion detection [3]. unstructured, and semi-structured data [6, 9]. Value refers to Outlier detection techniques have been widely used in the advantages in extracting the information from the big data. machine learning and data mining to extract information and Lastly, veracity involves the quality of the data for precision, to clean data, for example, in various domains for the purpose integrity, confidence, protection, and reliability. Due to the of decision-making, clustering, classification, and identifying nature of the data stream with regard to these five significant frequent patterns [2, 4]. One of the popular algorithms used to data properties, data stream processing requires various process outliers detection is the Local Outlier Factor (LOF). methods to evaluate the data points in the data stream The LOF algorithm is a density-based outlier detection environment. technique. It is used to evaluate the outlier with multi-density C. The Local Outlier Factor (LOF) data points. Despite the LOF success in identifying local outliers, it cannot work in stream environments because it The LOF is a popular algorithm for outlier detection and requires retaining the whole dataset in memory. is considered the foremost algorithm in local outlier detection in static environments. The LOF aims to calculate a score for This paper describes the main challenges and all data points when determining how close a data point is to methodologies of processing the LOF in the stream an outlier. Figure 1 illustrates the key definitions for environment. Data streams produce massive data that calculating the LOF score for data point p [10]. continually expand at great velocity. The data cannot be processed entirely in computer memory because the data D. The Incremental Local Outlier Factor (ILOF) keeps increasing [1]. The paper has six remaining parts: The ILOF solves the issue of the LOF in a data stream and second is Background, third is Problem Definition, fourth is is able to calculate the score of the LOF in a stream Direction to New Development, fifth is Dataset, and the sixth environment. The goal of the ILOF algorithm is to detect the is Discussion and Conclusion. local outliers in data streams [11]. The main task in the ILOF

978-1-7281-7624-6/20/$31.00 ©2020 IEEE 1525 DOI 10.1109/CSCI51800.2020.00282 is to calculate and update the LOF score when a new incoming spaces that include many local minima. This is because the data point (np) is inserted. However, the major issue for the genetic algorithm is a population-based search technique and ILOF is that it needs to retain all data points in memory, which it has crossover and mutation operators to more widely search leads to a large usage of memory and long computational time. spaces. E. The Density Summarization Incremental Local Outlier The GILOF algorithm finds the LOF score in the data Factor (DILOF) stream in two steps: the detection step and the summarization The DILOF is an algorithm that was developed to step [13]. To operate the detection step, both the LOF and the overcome the vulnerability of the ILOF. The DILOF ILOF are applied with skipping scheme. In the summarization algorithm has two steps: the summarization step and the step the genetic density summarization (GDS) is applied. the detection step. The task of summarization step is to summarize next section describes the GDS method. The GILOF the old half of data points in the current window by using the algorithm function begins by specifying the window size as W gradient-descent method. An issue for gradient-descent is that for data points. The threshold value θ is employed to it might be stuck in local minima. The task of detection step is distinguish outliers according to the LOF threshold. Then, the to detect the outliers and update the data points [12]. GDS function is used in the summarization step. The GILOF will keep detecting outliers and measure the LOF scores until the current window reaches the determined window size. The GDS is used to summarize 50%W of the old data points in the window. Then, the GILOF selects 25%W of data points to represent 50%W of old data points, which are then deleted from the window; the 25%W of data points are transferred to the window to combine with the remaining 50%W. Figure 2 illustrates how the GILOF algorithm works in a data stream1. For more details refer to [13].

Fig. 1. The key definitions of LOF.

III. PROBLEM DEFINITION We are currently in the big data era, and the most significant type of big data is the data stream. With the Fig. 2. GILOF process for a data stream in two dimensional from time T0 to Tcurrent [13]. increasing need to analyze and process high-velocity streaming data, it has become difficult to use traditional local outlier detection algorithms effectively. The main challenge A. Genetic Density Summarization (GDS) of the LOF is that it needs the whole dataset and the distance values between all data points to be stored in memory. In The GDS algorithm aims to summarize the old data points addition, the LOF requires being recalculated from the in the window. The GDS performs the process by applying the beginning when any alteration occurs in the dataset. The ILOF genetic algorithm (GA) to minimize the difference in density and DILOF algorithms address the LOF issues in data streams, between the 50% of old data points and the selected 25% of but they also have some issues that limit their performance, as data points. GA is hypothesized to be better than the gradient- mentioned above. descent method. The reason for this is that GA is able to discover enough optimal solutions while skipping local IV. A PROPOSED NEW DEVELOPMENT minima. To improve the efficiency of the local outlier factor in data The GDS operates as follows: first, a population is created, streams and overcome the limitation of the DILOF algorithm, and it includes individuals that have chromosomes; then, the we propose a new algorithm called the Genetic-based fitness (objective) function evaluates the chromosomes; next, Incremental Local Outlier Factor (GILOF). The GILOF is the GDS uses the method of selection for each generation; based on the Genetic Algorithm (GA). Actually, the Genetic after that, the crossover process is used; following this, the Algorithm is one of the well-known heuristic search GDS uses the mutation process (the GDS will retain the algorithms that is considered under evolutionary computation optimal results after finishing all the GA operations); and, [15]. GA was designed to solve complex problems using lastly, the selected chromosomes are converted into a binary populations that contain a set of chromosomes and evaluating domain. The selected 25% of data points is set to 1 and the rest them by using the fitness function [16]. After this step, it uses to 0. For more detail, refer to [13]. Figure 3 illustrates the steps its operations, such as crossover and mutation, to find the of summarization in the GDS. optimal solutions [17]. Therefore, GA is better than the simple gradient-descent method when they are searching complex 1 https://www.youtube.com/watch?v=YY-lHhhe2Ew .

1526 B. Local Outlier Factor by Reachability Distance (LOFR) real-world datasets and they contain outlier data points. These The LOFR is similar to the LOF, which uses the local benchmark datasets are used in this study, and they were used reachability distance while the LOFR does not [14]. The goal to analyze the new GILOF and GILOFR algorithms and to of LOFR calculation method is to reduce the outlierness score compare them with existing algorithms. Table 1 summarizes in order to provide better accuracy of outlier detection. The the features of the following datasets. score of the LOFR depends on the reachability distance (rd) A. UCI Vowels Dataset of the data point p and its nearest neighbors. To find the score A Vowels dataset is considered a multivariate time series of the LOFR, the method used is to take the rd of data point p dataset as well as a classification dataset, which classifies and divide it by the average of the neighbors of rd. The LOFR speakers. In one particular case, nine speakers spoke two is calculated by the following Equation. Japanese vowels, respectively. One speech by a speaker shapes a time series from 7 to 29 lengths and each point in the time series consists of twelve characteristics. In outlier () (anomaly) detection, any frame in the training dataset is () = treated as a single data point, although the UCI machine () learning repository deems a block of frames (talk) as a single ∈() point. Furthermore, classes six, seven, and eight are considered as inliers. The dataset contains 12 dimensions with 1,456 data points and 3.4% of these data points are outliers [18, 19]. B. UCI Pendigits Dataset The pendigits dataset is originally from the UCI machine learning repository [18]. This dataset is a multiclass classification that has 16 dimensions with 10 classes. The Pendigits dataset consists of 25 samples, which were generated by 44 writers. Thirty of the writer’s samples are used for training, while the other 14 writers’ samples are used for testing. The original training set contains 7,494 data points, and the testing set has 3,498 data points [19].

TABLE I. THE REAL WORLD DATASETS FEATURES Datasets data points Dimensions Class UCI Vowel 1,456 12 11 UCI Pendigits 3,498 16 10 KDD CUP99 SMTP 95,156 3 Unknown KDD CUP99 HTTP 567,479 3 Unknown

C. KDD CUP99 HTTP Dataset KDD CUP 1999 is the original dataset from the UCI machine learning repository [18]. This dataset contains 41 attributes, but it is reduced to 4 (service, dst_bytes, src_bytes, duration), where service is only categorical. By using the service, the data is split into HTTP, FTP, FTP_data, SMTP. The original dataset KDDCUP99 contains 4,898,431 data points, where 3,925,651 data points are considered as attack (80.1%) data points. A smaller set is created by 976,157 data points that include 3,377 (0.35%) attacks. The HTTP service data is used to create the HTTP KDDCUP99 dataset from that Fig. 3. GDS framework [14]. smaller dataset, which simulates normal data with attack traffic on an IP. The HTTP KDDCUP99 dataset contains 3 V. DATASETS dimensions and 567,497 data points that include (0.4%) For practical reasons, label information is not processed outliers [19]. and analyzed in unsupervised outlier detection because it must be compared and evaluated. When a new outlier detection D. KDD CUP99 SMTP Dataset algorithm is developed, it is usual to apply the new algorithm In this instance, the SMTP service is used from the KDD to publicly available datasets and compare the results of the CUP 1999 dataset, which is from the UCI machine learning new algorithm with the common unsupervised outlier repository [18]. The original dataset (KDDCUP99) contains detection algorithms, such as the LOF. There are many 4,898,431 data points, where 3,925,651 data points are classification datasets that are fully available in the UCI considered as attack (80.1%) data points. A smaller set is machine learning repository [18]. Additionally, some outlier forged by 976,157 data points that include 3,377 (0.35%) detection datasets are provided in [19]. The datasets below are attacks. The SMTP service data is used to create the SMTP

1527 KDDCUP99 dataset from this smaller dataset. The SMTP Conference on Control, Power, Communication and Computing KDDCUP99 dataset contains 3 dimensions and 95,156 data Technologies (ICCPCCT), 2018. points that include (0.03%) outliers [19]. [4] I. Souiden, Z. Brahmi, and H. Toumi, “A Survey on Outlier Detection in the Context of Stream Mining: Review of Existing Approaches and Recommadations,” Advances in Intelligent Systems and Computing VI. DISCUSSION AND CONCLUSION Intelligent Systems Design and Applications, pp. 372–383, 2017. In the big data era, outlier detection is a very important [5] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. step in many applications, such as network intrusion detection Herrera, “Big data preprocessing: methods and prospects,” Big Data systems and decision support systems. The objective of outlier Analytics, vol. 1, no. 1, 2016. detection is to detect suspicious items and unusual activities. [6] O. Alghushairy, X. Ma, “Data Storag,” In Encyclopedia of Big Data. For example, in practice, analyzing the dataset to extract Schintler, L., McNeely, C., Eds.; Springer, Cham, 2019. information without removing the outlier’s data will lead to [7] A. Margara and T. Rabl, “Definition of Data Streams,” Encyclopedia of Big Data Technologies, pp. 1–4, 2018. inaccurate information, which will result in wrong decisions. [8] G. Krempl, I. Zliobaite, D. Brzezi ˇ nski, E. H ´ ullermeier, M. Last, ¨ Recently, outlier detection has gained a lot of attention from V. Lemaire, T. Noack, A. Shaker, S. Sievi, M. Spiliopoulou, and J. researchers, especially regarding data streams. This paper Stefanowsky, “Open challenges for data stream mining research,” proposed a new possibility for local outlier detection in data ACM SIGKDD Explorations Newsletter, vol. 16, no. 1, pp. 1–10, 2014. streams by developing two methods, which are called the [9] M. Younas, ‘‘Research challenges of big data,’’ Service Oriented Genetic Based-Incremental Local Outlier factor (GILOF) and Comput. Appl., vol. 13, pp. 105–107, Jun. 2019. the Local Outlier Factor by Reachability Distance (LOFR). As [10] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: mentioned above, the GA is assumed to be better than the identifying density-based local outliers,” Proceedings of the 2000 ACM SIGMOD international conference on Management of data - SIGMOD gradient-descent because the genetic algorithm is a '00, 2000. population-based search technique and can jump the local [11] D. Pokrajac, A. Lazarevic, and L. J. Latecki, “Incremental Local minima by the aid of its crossover and mutation operators for Outlier Detection for Data Streams,” 2007 IEEE Symposium on a more wide search of spaces. The GILOF algorithm is already Computational Intelligence and Data Mining, 2007. compared with the DILOF algorithm [12] and the results are [12] G. S. Na, D. Kim, and H. Yu, “DILOF: Effective and memory efficient extensively discussed in [13]. By trying to improve the local outlier detection in data streams,” Proceedings of the 24th ACM efficiency of the GILOF algorithm, we developed another SIGKDD International Conference on Knowledge Discovery & calculation method for the LOF, which is called LOFR. The Data Mining, 2018. new algorithm is named GILOFR. The GILOFR algorithm [13] O. Alghushairy, R. Alsini, X. Ma, and T. Soule, “A Genetic-Based Incremental Local Outlier Factor Algorithm for Efficient Data Stream was compared with the GILOF algorithm; the outcomes are Processing,” Proceedings of the 2020 the 4th International Conference also extensively discussed in [14]. The LOFR was also applied on Compute and Data Analysis, 2020. to another algorithm that is called Grid Partition-Based Local [14] O. Alghushairy, R. Alsini, X. Ma, and T. Soule, “Improving the Outlier Factor and it showed slight improvement in some Efficiency of Genetic based Incremental Local Outlier Factor datasets [20]. For future work, the new calculation method of Algorithm for Network Intrusion Detection,” In: Proceedings of the 4th the LOFR can be applied in the DILOF algorithm instead of International Conference on Applied Cognitive Computing, Springer 2020. the LOF, which may lead to more accurate results. Therefore, [15] A. E. Eiben and J. E. Smith, “Introduction to Evolutionary this paper addressed specific issues and challenges of the LOF Computing,” Natural Computing Series, 2003. in stream environments and provided new methods to improve [16] M. Mitchell, “An Introduction to Genetic Algorithms,” MIT Press, the efficiency of local outlier detection in data streams. It also 1998. proposed a new local outlier detection algorithm in data [17] K. F. Man, K. Tang, and S. Kwong, “Genetic algorithms: concepts and streams. designs,” Springer Science & Business Media, 2001. [18] K. Bache, M. Lichman. UCI Machine Learning Repository; 2013. REFERENCES Available from: http://archive.ics.uci. edu/ml. [19] Shebuti Rayana (2016). ODDS Library [1] S. Sadik and L. Gruenwald, “Research issues in outlier detection for [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook data streams,” ACM SIGKDD Explorations Newsletter, vol. 15, no. 1, University, Department of Computer Science. pp. 33–40, 2014. [20] R. Alsini, O. Alghushairy X. Ma, and T. Soule, “A Grid Partition-based [2] T. Pooja, V. Jay, and P. Vishal, “Survey on Outlier Detection in Data Local Outlier Factor for Data Stream Processing,” In: Proceedings of Stream,” International Journal of Computer Applications, vol. 136, no. the 4th International Conference on Applied Cognitive Computing, 2, pp. 13-16, 2016. Springer 2020. [3] V. M. Tellis and D. J. D'souza, “Detecting Anomalies in Data Stream Using Efficient Techniques: A Review,” 2018 International

1528