An Efficient Local Outlier Factor for Data Stream Processing: a Case Study

2020 International Conference on Computational Science and Computational Intelligence (CSCI) An Efficient Local Outlier Factor for Data Stream Processing: A Case Study Omar Alghushairy Raed Alsini Xiaogang Ma Department of Computer Science Department of Computer Science Department of Computer Science University of Idaho University of Idaho University of Idaho Moscow, ID, USA Moscow, ID, USA Moscow, ID, USA College of Computer Science and Faculty of Computing and Information [email protected] Engineering Technology University of Jeddah King Abdulaziz University Jeddah, 23890, Saudi Arabia Jeddah, 21589, Saudi Arabia [email protected] [email protected] Abstract— In the field of data mining and machine learning, II. BACKGROUND outlier detection is considered a significant procedure for many A. Big Data applications, such as fraud detection in bank transactions and decision support systems. Data streams are a major player in the Big data is a set of data of a size that exceeds the ability of big data era. Currently, data streams are generated from regular databases to process, store, transfer, manage, share and various sources with huge amounts of data. This has led to analyze within an acceptable period of time [5]. Big data difficulty when using older algorithms, which are designed for comes in three different modes: structured data, semi- static data. The Local Outlier Factor (LOF) is one of these structured data, and unstructured data [6]. The most important algorithms. The most challenging issue of the LOF is that it type of big data is the data stream, which has the needs to preserve the whole dataset in computer memory. A new characteristics of volume, velocity, variety, value, and LOF that can deal with a data stream in limited memory is variability. Therefore, it is not possible to process big data by needed. This paper is a case study of several benchmark datasets traditional methods. Currently, there is an urgent need to for outlier detection that aim to increase the efficiency of the develop new algorithms of processing and managing big data. accuracy of local outlier detection in data streams. B. Data Stream Keywords—local outlier factor, data science, stream data A data stream is a collection of continuous data processed mining, genetic algorithm, outlier detection (short paper) to collect knowledge and extract information [7]. Data streams Symposium on Computational Intelligence (CSCI-ISCI) represent big data as primary sources by having various applications in different properties such as volume, velocity, I. INTRODUCTION variety, value, and veracity [8, 9]. Volume refers to the large Outlier detection attempts to distinguish a data point that volume of data assembled and analyzed. Velocity involves the is distinct from the rest of the given data. Outliers occur during pace at which data between various systems and devices are a procedure or as a consequence of an error of measurement generated and transported. Variety applies multiple types of [1]. By detecting outliers, essential information can be data that may be used to obtain the required knowledge or obtained to make better decisions in various applications, such performance; it involves the data modes of structured, as fraud transactions in credit card and intrusion detection [3]. unstructured, and semi-structured data [6, 9]. Value refers to Outlier detection techniques have been widely used in the advantages in extracting the information from the big data. machine learning and data mining to extract information and Lastly, veracity involves the quality of the data for precision, to clean data, for example, in various domains for the purpose integrity, confidence, protection, and reliability. Due to the of decision-making, clustering, classification, and identifying nature of the data stream with regard to these five significant frequent patterns [2, 4]. One of the popular algorithms used to data properties, data stream processing requires various process outliers detection is the Local Outlier Factor (LOF). methods to evaluate the data points in the data stream The LOF algorithm is a density-based outlier detection environment. technique. It is used to evaluate the outlier with multi-density C. The Local Outlier Factor (LOF) data points. Despite the LOF success in identifying local outliers, it cannot work in stream environments because it The LOF is a popular algorithm for outlier detection and requires retaining the whole dataset in memory. is considered the foremost algorithm in local outlier detection in static environments. The LOF aims to calculate a score for This paper describes the main challenges and all data points when determining how close a data point is to methodologies of processing the LOF in the stream an outlier. Figure 1 illustrates the key definitions for environment. Data streams produce massive data that calculating the LOF score for data point p [10]. continually expand at great velocity. The data cannot be processed entirely in computer memory because the data D. The Incremental Local Outlier Factor (ILOF) keeps increasing [1]. The paper has six remaining parts: The ILOF solves the issue of the LOF in a data stream and second is Background, third is Problem Definition, fourth is is able to calculate the score of the LOF in a stream Direction to New Development, fifth is Dataset, and the sixth environment. The goal of the ILOF algorithm is to detect the is Discussion and Conclusion. local outliers in data streams [11]. The main task in the ILOF 978-1-7281-7624-6/20/$31.00 ©2020 IEEE 1525 DOI 10.1109/CSCI51800.2020.00282 is to calculate and update the LOF score when a new incoming spaces that include many local minima. This is because the data point (np) is inserted. However, the major issue for the genetic algorithm is a population-based search technique and ILOF is that it needs to retain all data points in memory, which it has crossover and mutation operators to more widely search leads to a large usage of memory and long computational time. spaces. E. The Density Summarization Incremental Local Outlier The GILOF algorithm finds the LOF score in the data Factor (DILOF) stream in two steps: the detection step and the summarization The DILOF is an algorithm that was developed to step [13]. To operate the detection step, both the LOF and the overcome the vulnerability of the ILOF. The DILOF ILOF are applied with skipping scheme. In the summarization algorithm has two steps: the summarization step and the step the genetic density summarization (GDS) is applied. the detection step. The task of summarization step is to summarize next section describes the GDS method. The GILOF the old half of data points in the current window by using the algorithm function begins by specifying the window size as W gradient-descent method. An issue for gradient-descent is that for data points. The threshold value θ is employed to it might be stuck in local minima. The task of detection step is distinguish outliers according to the LOF threshold. Then, the to detect the outliers and update the data points [12]. GDS function is used in the summarization step. The GILOF will keep detecting outliers and measure the LOF scores until the current window reaches the determined window size. The GDS is used to summarize 50%W of the old data points in the window. Then, the GILOF selects 25%W of data points to represent 50%W of old data points, which are then deleted from the window; the 25%W of data points are transferred to the window to combine with the remaining 50%W. Figure 2 illustrates how the GILOF algorithm works in a data stream1. For more details refer to [13]. Fig. 1. The key definitions of LOF. III. PROBLEM DEFINITION We are currently in the big data era, and the most significant type of big data is the data stream. With the Fig. 2. GILOF process for a data stream in two dimensional from time T0 to Tcurrent [13]. increasing need to analyze and process high-velocity streaming data, it has become difficult to use traditional local outlier detection algorithms effectively. The main challenge A. Genetic Density Summarization (GDS) of the LOF is that it needs the whole dataset and the distance values between all data points to be stored in memory. In The GDS algorithm aims to summarize the old data points addition, the LOF requires being recalculated from the in the window. The GDS performs the process by applying the beginning when any alteration occurs in the dataset. The ILOF genetic algorithm (GA) to minimize the difference in density and DILOF algorithms address the LOF issues in data streams, between the 50% of old data points and the selected 25% of but they also have some issues that limit their performance, as data points. GA is hypothesized to be better than the gradient- mentioned above. descent method. The reason for this is that GA is able to discover enough optimal solutions while skipping local IV. A PROPOSED NEW DEVELOPMENT minima. To improve the efficiency of the local outlier factor in data The GDS operates as follows: first, a population is created, streams and overcome the limitation of the DILOF algorithm, and it includes individuals that have chromosomes; then, the we propose a new algorithm called the Genetic-based fitness (objective) function evaluates the chromosomes; next, Incremental Local Outlier Factor (GILOF). The GILOF is the GDS uses the method of selection for each generation; based on the Genetic Algorithm (GA). Actually, the Genetic after that, the crossover process is used; following this, the Algorithm is one of the well-known heuristic search GDS uses the mutation process (the GDS will retain the algorithms that is considered under evolutionary computation optimal results after finishing all the GA operations); and, [15].

An Efficient Local Outlier Factor for Data Stream Processing: a Case Study

Fastlof: an Expectation-Maximization Based Local Outlier Detection Algorithm

Incremental Local Outlier Detection for Data Streams

A Two-Level Approach Based on Integration of Bagging and Voting for Outlier Detection

Accelerating the Local Outlier Factor Algorithm on a GPU for Intrusion Detection Systems

Arxiv:1904.06034V1 [Stat.ML] 12 Apr 2019 Sity Exactly for a Test Instance

A Comparative Evaluation of Semi- Supervised Anomaly Detection Techniques

Isolation Forest and Local Outlier Factor for Credit Card Fraud Detection System

Anomaly Detection Using Signal Segmentation and One-Class Classification in Diffusion Process of Semiconductor Manufacturing

Anomaly Detection Using Dictionary Learning

A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams

Unsupervised Anomaly Detection Approach for Time-Series in Multi-Domains Using Deep Reconstruction Error

Open Cheng-Kai Chen Thesis Final.Pdf