Accelerating the Local Outlier Factor Algorithm on a GPU for Intrusion Detection Systems
Total Page:16
File Type:pdf, Size:1020Kb
Accelerating the Local Outlier Factor Algorithm on a GPU for Intrusion Detection Systems Malak Alshawabkeh Byunghyun Jang David Kaeli Dept of Electrical and Dept. of Electrical and Dept. of Electrical and Computer Engineering Computer Engineering Computer Engineering Northeastern University Northeastern University Northeastern University Boston, MA Boston, MA Boston, MA [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION The Local Outlier Factor (LOF) is a very powerful anomaly The Local Outlier Factor (LOF) [3] algorithm is a powerful detection method available in machine learning and classifi- outlier detection technique that has been widely applied to cation. The algorithm defines the notion of local outlier in anomaly detection and intrusion detection systems. LOF which the degree to which an object is outlying is dependent has been applied in a number of practical applications such on the density of its local neighborhood, and each object can as credit card fraud detection [5], product marketing [16], be assigned an LOF which represents the likelihood of that and wireless sensor network security [6]. object being an outlier. Although this concept of a local out- lier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor The LOF algorithm utilizes the concept of a local outlier that queries – this overhead can limit the use of LOF due to the captures the degree to which an object is an outlier based computational overhead involved. on the density of its local neighborhood. Each object can be assigned an LOF value which represents the likelihood of that object being an outlier. High LOF value are used Due to the growing popularity of Graphics Processing Units to identify data objects that are potential outliers, whereas (GPU) in general-purpose computing domains, and equipped low LOF values indicate a normal data object. with a high-level programming language designed specifi- cally for general-purpose applications (e.g., CUDA), we look to apply this parallel computing approach to accelerate LOF. 1.1 Applying LOF to Intrusion Detection In this paper we explore how to utilize a CUDA-based GPU Intrusion Detection Systems (IDSs) have become one of the implementation of the k-nearest neighbor algorithm to ac- fastest growing technologies within the security arena. As celerate LOF classification. We achieve more than a 100X the ubiquity of computers and computer networks has grown, speedup over a multi-threaded dual-core CPU implementa- the threat posed and damage caused by cyber attacks has tion. We also consider the impact of input data set size, the multiplied. Therefore, the need to develop robust and re- neighborhood size (i.e., the value of k) and the feature space liable IDSs is becoming increasingly important. There are dimension, and report on their impact on execution time. two major intrusion detection techniques: signature-based detection and anomaly detection. Signature-based detec- tion discovers attacks based on the patterns extracted from Categories and Subject Descriptors known intrusions. Such systems have low false positive rates, D.1.3 [Programming Techniques]: Concurrent Program- but cannot detect new attacks absent in the signature file. ming—parallel programming Anomaly detection, on the other hand, identifies attacks based on the deviations from the established profiles of nor- General Terms mal activities, and the assumption is that attacks deviate from normal behavior. Most existing anomaly detection sys- Performance tems suffer from high false positive rates. This occurs pri- marily because previously unseen (yet legitimate) system be- Keywords haviors are also recognized as anomalies, and hence flagged LOF, intrusion detection system, GPU, parallelization as potential intrusions. To overcome this issue, there has been much interest in Permission to make digital or hard copies of all or part of this work for employing outlier detection methods for anomaly detection. personal or classroom use is granted without fee provided that copies are Outlier detection methods aim to find the unusual activities not made or distributed for profit or commercial advantage and that copies that are different, dissimilar and inconsistent with respect bear this notice and the full citation on the first page. To copy otherwise, to to the remainder of the data set, based on some measure [2]. republish, to post on servers or to redistribute to lists, requires prior specific It has been shown that outlier detection techniques are ef- permission and/or a fee. GPGPU-3 March 14, 2010, Pittsburg, PA, USA fective in reducing the false positive rate with the ability Copyright 2010 ACM 978-1-60558-935-0/10/03 ...$10.00. to detect unseen attacks. LOF has been shown to perform 104 well in detecting abnormal behavior in a network Intrusion 2. THE LOCAL OUTLIER FACTOR (LOF) Detection System (IDS) [10]. LOF has been used to iden- METHOD tify several novel and previously unseen intrusions in real 2.1 Algorithm Overview network data that could not be detected using other state- The main goal of the LOF method is to assign to each ob- of-the-art intrusion detection systems such as SNORT [14]. ject a degree of being an outlier [3]. This degree is called the local outlier factor (LOF) of an object. It is local in the sense that the degree depends on how isolated the object 1.2 LOF Computational Overhead is when compared to its surrounding local neighborhood. Unfortunately, the LOF algorithm’s time complexity is O(n2), The concept of a local outlier is an important one since in where n is the data size. It is designed to compute the LOF many applications, different portions of a data set can ex- for all objects in the data set, which results in a computa- hibit very different characteristics, and it is more meaning- tionally intensive process, since it requires a large number of ful to decide on the possibility of an object being an outlier k-nearest neighbor queries. Because of this issue, designing basedonotherobjectsinitsneighborhood.IntheLOFal- efficient and reliable intrusion detection systems based on gorithm, the difference in density between a data object and the LOF method is challenging. its neighborhood is the degree of being an outlier, known as its local outlier factor. Intuitively, outliers are the data ob- jects with high LOF values, whereas data objects with low Several methods have been proposed to reduce the compu- LOF values are likely to be normal with respect to their tation time of the LOF method. The general idea of these neighborhood. Possessing a high LOF value is an indication approaches is to reduce the number of distances that need of a low-density neighborhood, and hence, a higher poten- to be computed [1, 11]. However, these methods are very tial of being an outlier. The computation of LOF values slow when working with high dimensional spaces (i.e., when requires finding the k-nearest neighbors. To better under- the number of classification features is large). stand the LOF method, we next provide a brief overview of the k-nearest neighbor (KNN) approach. Given the recent popularization of Graphics Processing Units 2.2 Review of the k-nearest neighbor method (GPU), and the increased flexibility of the most recent gen- The definition of the KNN search problem is: given a set S eration of GPU hardware, and combined with high-level of reference points in a metric space M and a query point GPU programming languages such as CUDA, we would like q ∈ M, find the the k nearest (closest) reference points in S to to consider if LOF classification can be effectively acceler- q. In many cases, M is taken to be d-dimensional space and ated using a GPU platform. The key to obtain good speedup the distance (closeness) is measured by either the Euclidean with a GPU is to carefully map the data-parallel algorithm distance or Manhattan distance. Figure 1 shows an exam- on to the available hundreds of processing cores. Many at- ple of the KNN search problem with k = 4. A ”brute force” tempts have been made to use graphics processors for several approach is usually used to implement the KNN search. Fol- traditional data mining techniques that include neural net- lowing this approach, all distances between the query point works [15], support vector machine (SVM) [4] and recently q and all reference points in S are computed and then sorted for k-nearest neighbor (KNN) [7]. Since LOF shares a lot of in ascending order. The top k smallest reference points are similarities to KNN, this has motivated us to explore accel- then selected. This process should be repeated for all query eration of LOF on a GPU platform. points. The time complexity for this approach is O(n2). In this paper we propose a CUDA implementation for the LOF algorithm to accelerate the processing throughput of an intrusion detection system. We compare runtimes of LOF run on an NVIDIA G200 class of GPUs. We compare this execution to runtimes on an X86 CPU. Our results show that we can accelerate LOF by more than 100X when using the GPU. We also study the impact of three different classi- fication parameters that can impact the scalability of LOF GPU implementation in terms of performance: 1) the size of the input data set, 2) the number of neighbors, and 3) the feature space dimension. Our results show that changes to these parameters have only a small impact on the com- putation time on a GPU, whereas they have a much more dramatic effect on a CPU. Figure 1: Example of KNN search problem, with k The rest of this paper is organized as follows.