A Comparative Evaluation of Semi- Supervised Anomaly Detection Techniques
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 A Comparative Evaluation Of Semi- supervised Anomaly Detection Techniques REBWAR BAJALLAN BURHAN HASHI KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE A Comparative Evaluation Of Semi-supervised Anomaly Detection Techniques REBWAR BAJALLAN BURHAN HASHI Degree Project in Computer Science Date: June 9, 2020 Supervisor: Pawel Herman Examiner: Pawel Herman School of Electrical Engineering and Computer Science Swedish title: En jämförande utvärdering av semi-övervakade tekniker för identifiering av uteliggande datapunkter iii Abstract As we are entering the information age and the amount of data is rapidly in- creasing, the task of detecting anomalies has become a necessity in many orga- nizations as anomalies often reveal useful information which in many cases can be critical to save lives or to catch imposters. The semi-supervised approach to anomaly detection which is based on the fact that the user has no infor- mation about anomalies has become widely popular since it’s easier to model the normal state of systems than to obtain information about every anomalous behavior. Therefore, in this study we choose to conduct a comparative evalua- tion of the semi-supervised anomaly detection techniques; Autoencoder, Local outlier factor algorithm, and one class support vector machine, to simplify the process of selecting the right technique when faced with similar anomaly de- tection problems of semi-supervised nature. We found that the local outlier factor algorithm was superior in performance given the Electrocardiograms dataset (ECG5000), achieving a high precision and perfect recall. The autoencoder achieved the best performance given the credit card fraud dataset, even though the remaining models also achieved a relatively high performance that didn’t differ much from that of the autoen- coder. However, it should be noted that the definition of performance differs as the characteristics of anomaly detection problems are different, as specific problems might put a higher weight on detecting all anomalies for an increase in falsely identified normal data points. iv Sammanfattning Vi befinner oss i informationsåldern och samtidigt som mängden data snabbt ökar har problemet med att upptäcka avvikelser blivit allt mer nödvändigt i många organisationer. Då avvikelserna ofta avslöjar viktig information, som i många fall kan vara avgörande för att rädda liv eller för att upptäcka bedra- gare. Den semi-övervakade metoden för upptäcka av avvikelser är baserad på det faktum att användaren inte har någon information om avvikelser. Denna metod har blivit allmänt populärt eftersom det är lättare att modellera syste- mets normala tillstånd än att erhålla information om alla avvikande tillstånd. Därför väljer vi i denna studie att göra en jämförande utvärdering av de semi- övervakade avikelse-detektions metoderna; Autoencoder, Local outlier factor algorithm, och One class support vector machine, för att förenkla processen att välja rätt algoritm när man står inför liknande avikelse-detektionsproblem av semi-övervakad natur. Vi fann att den Local outlier factor algoritmen presterade bäst på datasät- tet Electrocardiograms (ECG5000), då den uppnådde en hög precision och perfekt återkallelse. Autoencodern var bäst med avseende på datasättet kre- ditkortsbedrägeri. Men även de andra modellerna presterade relativt nära det Autoencodern presterade. Det bör även noteras att definitionen av prestation kan skilja sig då kraven för avikelse-detekteringsproblem är olika, eftersom specifika problem kan lägga högre vikt vid att upptäcka alla avvikelser med en ökning av falskt identifierade normala datapunkter. Contents 1 Introduction 1 1.1 Problem Definition . .2 1.1.1 Scope . .2 1.1.2 Thesis Outline . .2 2 Background 3 2.1 Anomalies . .3 2.1.1 Anomaly Detection . .3 2.1.2 Anomaly Score . .4 2.1.3 Semi-supervised anomaly detection . .4 2.2 Support Vector Machine . .5 2.2.1 One-Class Support Vector Machine . .5 2.3 Local Outlier Factor Algorithm . .6 2.4 Artificial Neural Networks . .7 2.4.1 Feed-forward neural networks . .7 2.5 Autoencoders . .8 2.5.1 Reconstruction Error . .9 2.5.2 Training Autoencoders . .9 2.5.3 Autoencoders and Anomaly detection . .9 2.6 Related Work . 10 3 Method 11 3.1 Datasets . 11 3.1.1 ECG5000 . 11 3.1.2 Credit Card dataset . 12 3.2 Models . 12 3.2.1 OC-SVM . 12 3.2.2 LOF algorithm . 12 3.2.3 Autoencoder . 13 v vi CONTENTS 3.3 Evaluation Metrics . 15 4 Results 16 4.1 Autoencoder . 16 4.1.1 Training reconstruction error . 16 4.1.2 Error threshold selection . 17 4.1.3 Testing reconstruction errors . 17 4.1.4 Autoencoder testing results . 19 4.2 OC-SVM . 19 4.2.1 OC-SVM testing results . 19 4.3 LOF algorithm . 20 4.3.1 Testing outlier factors . 20 4.3.2 LOF algorithm testing results . 21 4.4 Results summary . 22 5 Discussion 23 5.1 Limitations . 25 6 Conclusions 26 6.1 Further research . 26 Bibliography 27 Chapter 1 Introduction As the world is getting data-driven the field of anomaly detection is on the rise and is starting to play a big role in many organizations. Anomaly detection is the process of detecting anomalies which are observations that deviates from the other observations so that they become suspicious of being the result of errors or fraud [1]. To detect and analyse anomalies is therefore an important task because it reveals useful information which in many cases can be critical to catch imposters or save lives. In the banking sector anomaly detection has become an extremely important task to detect and analyze fraudulent credit card transactions that assist in the disclosure of impostors [1]. Anomaly de- tection has also become frequently used in hospitals when processing medical diagnosis, where the anomalies, for example, could help to detect cancerous cells when investigating medical images [1]. There are many techniques used in anomaly detection tasks, but one of the techniques that have been on the rise the recent years is the use of artificial neural networks. These techniques come to great advantage because they have proven to be good at modeling the complexity of real-world data [1]. In recent years, artificial neural network anomaly detection techniques have become increasingly popular and have proven to outperform the traditional machine learning techniques as the scale of data increases [1]. The semi-supervised anomaly detection approach which is based on the fact that the user has no information about possible anomalies has become widely applicable. This approach has risen in popularity since it’s easier to model the normal state of the system rather than obtaining information about every possible anomalous behavior that can occur [2] . An example is the detection 1 2 CHAPTER 1. INTRODUCTION of spacecraft faults [3] , where faults may result in spacecraft incidents. These incidents hardly occur, it is therefore, easier to obtain data that reflects how the spacecraft operates in a normal state and with the semi-supervised approach detect abnormal spacecraft activities. 1.1 Problem Definition This study aims to present a comparative evaluation of different semi-supervised anomaly detection techniques. The evaluation investigates which model per- forms best in predicting anomalies from a set of different datasets, to draw a generalizable conclusion about which model that is preferable to use in prob- lems using similar datasets. 1.1.1 Scope The question is investigated by evaluating the performance of the semi-supervised anomaly detection techniques: Autoencoder, Local outlier factor, and One- class support vector machine. The datasets used in the evaluation are the ECG5000 (electrocardiogram) dataset along with the credit card dataset. The purpose of this study is to simplify the process of choosing the right tech- niques to use when people in the industries of credit card fraud analytics and healthcare are faced with similar anomaly detection problems that is of semi- supervised nature. 1.1.2 Thesis Outline The first chapter contains a basic introduction to the subject of this study to- gether with the problem definition and the scope. The second chapter explores the field of anomalies and anomaly detection. It further discloses an explana- tion of the methods Autoencoder, Local outlier factor, and One-class support vector machines. This chapter ends with a presentation with the relevant lit- erature. Chapter three covers the method of the study, the dataset used, the implementation of the models, and the evaluation. The fifth chapter presents the result together with relevant figures and chapter six contains a critical dis- cussion about the results and method, from different perspectives. The study ends with a conclusion and remarks about potential future work. Chapter 2 Background 2.1 Anomalies Anomalies or outliers are observations that deviates significantly from the other observations so that they become suspicious of being generated by a different mechanism, sometimes being the result of errors or fraud [1]. This could be for example fraudulent credit card transactions that are the results of a stolen credit card [1]. Two important characteristics of anomalies are that they are different from the norm with respect to their features and that they are rare in the dataset in contrast to the normal instances [4]. Anomalies arise for different reasons depending on the type of context. Some reasons are failures in systems, fraudulent behavior or malicious actions [1]. It is important to remember that not every anomaly is considered as being the result of error or fraud, because there are situations where data points natu- rally are different. For example if you take the height of every student in your school, there will for sure be anomalies in that dataset which of course isn’t the result of any error. But the study of anomalies usually reveals exiting insights and conveys valuable information about the data [1].