A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams
Total Page:16
File Type:pdf, Size:1020Kb
big data and cognitive computing Review A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams Omar Alghushairy 1,2,* , Raed Alsini 1,3 , Terence Soule 1 and Xiaogang Ma 1,* 1 Department of Computer Science, University of Idaho, Moscow, ID 83844, USA; [email protected] (R.A.); [email protected] (T.S.) 2 College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia 3 Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia * Correspondence: [email protected] (O.A.); [email protected] (X.M.) Abstract: Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and Citation: Alghushairy, O.; Alsini, R.; limitations of those algorithms and proposes several promising directions for developing improved Soule, T.; Ma, X. A Review of Local local outlier detection methods for data streams. Outlier Factor Algorithms for Outlier Detection in Big Data Streams. Big Keywords: outlier detection; data science; local outlier factor; genetic algorithm; stream data mining Data Cogn. Comput. 2021, 5, 1. https://doi.org/10.3390/bdcc 5010001 1. Introduction Received: 5 November 2020 Accepted: 23 December 2020 Outlier detection is the term used for anomaly detection, fraud detection, and novelty Published: 29 December 2020 detection. The purpose of outlier detection is to detect rare events or unusual activities that differ from the majority of data points in a dataset [1]. Recently, outlier detection has Publisher’s Note: MDPI stays neu- become an important problem in many applications such as in health care, fraud transaction tral with regard to jurisdictional clai- detection for credit cards, and intrusion detection in computer networks. Therefore, many ms in published maps and institutio- algorithms have been proposed and developed to detect outliers. However, most of these nal affiliations. algorithms are designed for a static data environment and are difficult to apply to streams of data, which represent a significant number of important application areas. Outlier detection can be viewed as a specific application of generalized data mining. The process of data mining involves two steps: data processing and data mining. The aim Copyright: © 2020 by the authors. Li- censee MDPI, Basel, Switzerland. of data processing is to create data in good form, i.e., the outliers are removed. Data mining This article is an open access article is the next step, which involves the discovery and understanding of the behavior of the distributed under the terms and con- data and extracting this information by machine learning [2]. Outlier detection (or data ditions of the Creative Commons At- cleaning) is an important step in data processing because, if an outlier data point is used tribution (CC BY) license (https:// during data mining, it is likely to lead to inaccurate outputs [3]. For example, a traffic creativecommons.org/licenses/by/ pattern of outliers in the network might mean the data has been sent out by a hacked 4.0/). computer to an unauthorized device [4]; if the outliers cannot be detected and removed, Big Data Cogn. Comput. 2021, 5, 1. https://doi.org/10.3390/bdcc5010001 https://www.mdpi.com/journal/bdcc Big Data Cogn. Comput. 2021, 5, x FOR PEER REVIEW 2 of 25 Big Data Cogn. Comput. 2021, 5, 1 2 of 24 computer to an unauthorized device [4]; if the outliers cannot be detected and removed, a machine learning technique applied to the data is likely to be misled. Additionally, in manya machine cases, learning the goal technique of data mining applied is to find the data the outliers. is likely For to be example, misled. an Additionally, outlier image in inmany magnetic cases, resonance the goal of imaging data mining could is mean to find the the presence outliers. of For malignant example, tumors an outlier [5]. image An out- in liermagnetic reading resonance from an imagingaircraft sensor could may mean indicate the presence a fault of in malignant some components tumors [ 5of]. the An aircraft outlier [6].reading from an aircraft sensor may indicate a fault in some components of the aircraft [6]. Outlier detection (also known as anomaly detection) is split into twotwo types,types, globalglobal and local detection. For a global outlier, outlieroutlier detection considers all data points, and the data point pt is consideredconsidered an outlieroutlier if itit isis farfar awayaway fromfrom allall otherother datadata pointspoints [[7].7]. TheThe local outlier detection covers a small subset ofof datadata pointspoints atat aa timetime (Figure(Figure1 ).1). AA local local outlier is based on the probability of data point pt being an outlier as compared to its local neighborhood, which is measured by the kk-Nearest-Nearest NeighborsNeighbors ((kkNN)NN) algorithmalgorithm [[8].8]. Figure 1. The types of outliers, where grey points are global outliers, and the red point is a local outlier. Figure 1. The types of outliers, where grey points are global outliers, and the red point is a local outlier. Recently, many studies have been published in the area of outlier detection, and thereRecently, is a need many for these studies studies have to been be reviewedpublished to in understand the area of outlier the depth detection, of the and field there and isthe a need most for promising these studies areas to for be additional reviewed to research. understand This the review depth is of focused the field on and local the outlier most promisingdetection algorithmsareas for additional in data stream research. environments. This review is This focused review on islocal distinguished outlier detection from previousalgorithms reviews in data as stream it specifically environments. covers This the Localreview Outlier is distinguished Factor (LOF) from algorithms previous re- in viewsdata stream as it specifically environments covers and reviewsthe Local the Outlier most recent,Factor state-of-the-art(LOF) algorithms techniques. in data stream In this paper,environments different and methods reviews and the algorithms most recent, for stat locale-of-the-art outlier detection techniques. are summarized.In this paper, Thedif- ferentlimitations methods and and challenges algorithms of the for LOFlocal andoutlier other detection local outlier are summarized. detection algorithms The limitations in a andstream challenges environment of the are LOF analyzed. and other Moreover, local outlier new detection methods algorithms for the detection in a stream of an envi- LOF ronmentscore in a are data analyzed. stream are Moreover, proposed. new This methods literature for reviewthe detection will greatly of an LOF benefit score researchers in a data streamin the field are proposed. of local outlier This detectionliterature becausereview wi itll provides greatly benefit a comprehensive researchers explanation in the field ofof localthe features outlier ofdetection each algorithm because and it provides shows their a comprehensive strengths and explanation weaknesses, of particularly the features in of a eachdata streamalgorithm environment. and shows The their paper strengths has five and remaining weaknesses, parts: particularly the second sectionin a data is Relatedstream Worksenvironment. and Background; The paper has the five third remaining section is part Literatures: the second Review section Methodology; is Related Works the fourth and section is Literature Review Results; the fifth section is Analysis and Discussion; and the sixth section is the Conclusion. Big Data Cogn. Comput. 2021, 5, 1 3 of 24 2. Related Works and Background Many methods and algorithms have been developed to detect outliers in datasets. Most of these methods and algorithms have been developed for specific applications and domains. Outlier detection has been the focus of many review and survey papers, as well as several books. Patcha et al. and Snyder [9,10] published a survey of outlier detection techniques for intrusion detection. Markou and Singh [11,12] provided a review of outlier detection methods that use statistical approaches and neural networks. A comprehensive