Determining Criteria for Choosing Anomaly Detection Algorithm

VYTAUTAS MAGNUS UNIVERSITY FACULTY OF INFORMATICS DEPARTMENT OF APPLIED INFORMATICS Aleksas Pantechovskis Determining Criteria for Choosing Anomaly Detection Algorithm Master final thesis Applied informatics study programme, state code 6211BX012 Study field Informatics Supervisor prof. dr. Tomas Krilavičius__________ _________ __________ (degree, name, surname) (signature) (date) Defended prof. dr. Daiva Vitkutė-Adžgauskienė __________ ________ (Dean of Faculty) (signature) (date) Kaunas, 2019 CONTENTS ABBREVIATIONS AND TERMS ............................................................................................. 1 ABSTRACT ............................................................................................................................... 2 1 INTRODUCTION .............................................................................................................. 3 1.1 Concepts ...................................................................................................................... 4 1.1.1 Unbounded and bounded data ................................................................................ 4 1.1.2 Windowing............................................................................................................ 4 1.1.3 Anomalies ............................................................................................................. 5 1.1.4 MacroBase terminology ........................................................................................ 6 2 LITERATURE AND TOOLS REVIEW ............................................................................. 7 2.1 MacroBase ................................................................................................................... 7 2.1.1 Source code ........................................................................................................... 8 2.1.2 Architecture ........................................................................................................ 10 2.1.3 MacroBase SQL .................................................................................................. 13 2.2 Outlier detection algorithms ....................................................................................... 15 2.2.1 Percentile ............................................................................................................ 16 2.2.2 MAD ................................................................................................................... 17 2.2.3 FastMCD............................................................................................................. 17 2.2.4 LOF .................................................................................................................... 17 2.2.5 MCOD ................................................................................................................ 18 2.2.6 Isolation forest..................................................................................................... 18 2.3 Evaluation of anomaly detection quality ..................................................................... 20 2.4 Datasets ...................................................................................................................... 21 2.5 Numenta Anomaly Benchmark ................................................................................... 24 2.5.1 NAB scoring method ........................................................................................... 24 3 BENCHMARKING PLATFORM IMPLEMENTATION ................................................. 27 3.1 Anomaly detection algorithms .................................................................................... 30 4 EXPERIMENTS ............................................................................................................... 33 4.1 Time and memory performance................................................................................... 34 4.2 Anomaly detection quality .......................................................................................... 38 4.3 NAB results ................................................................................................................ 43 4.3.1 Hyperparameters tuning (using labels) ................................................................. 47 4.3.2 MCOD hyperparameters tuning ........................................................................... 49 4.4 Performance and anomaly detection quality conclusions ............................................. 51 4.4.1 Percentile ............................................................................................................. 51 4.4.2 MAD ................................................................................................................... 51 4.4.3 FastMCD ............................................................................................................. 51 4.4.4 LOF ..................................................................................................................... 51 4.4.5 MCOD ................................................................................................................. 52 4.4.6 iForest.................................................................................................................. 52 5 RESULTS AND CONCLUSIONS .................................................................................... 53 5.1 Future works ............................................................................................................... 53 6 REFERENCES.................................................................................................................. 54 ABBREVIATIONS AND TERMS ADR Adaptive Damping Reservoir AMC Amortized Maintenance Counter AUC Area under curve CSV Comma-separated values FN False negatives FP False positives LOCI Local Correlation Integral LOF Local Outlier Factor MAD Median Absolute Deviation from median MCD Minimum Covariance Determinant MCOD Micro-cluster based Continuous Outlier Detection ML Machine Learning NAB Numenta Anomaly Benchmark OS Operating system TN True negatives TP True positives UDF User-Defined Function UI User interface 1 ABSTRACT Author Aleksas Pantechovskis Title Determining Criteria for Choosing Anomaly Detection Algorithm Supervisor prof. dr. Tomas Krilavičius Number of pages 61 In today’s world there is lots of data requiring automated processing: nobody can analyze and extract useful information from it manually. One of the existing processing modes is anomaly detection: detect failures, high traffic, dangerous states and so on. However, it often requires the developer or the user of such analysis systems to have a lot of knowledge on this subject making it less accessible. One of the aspects is the choice of a suitable algorithm and its parameters. The main goal of this work is to start creating guidelines or a decision tree to simplify the process of choosing the most suitable anomaly detection algorithm depending on the dataset characteristics and other requirements. This project was proposed by SAP and inspired by works of the Dawn research team from Stanford and their MacroBase system. In this work we review MacroBase architecture and functionality, describe commonly used real datasets for anomaly detection benchmarking and synthetic dataset generation methods, anomaly detection quality metrics, and develop a benchmarking platform, evaluate anomaly detection algorithms of different types: distance-based (MCOD), density-based (LOF), statistical (MAD, FastMCD, Percentile), Isolation forest. 1 INTRODUCTION Data volumes generated by machines are constantly increasing with the rise of automation. Modern hardware is powerful enough to handle and generate a lot of data (for example, recording data from sensors, some system events), network speeds allow to collect data from many devices around the world (such as web, mobile applications or IoT) and storage is cheap enough to store terabytes of the data [1] . However, this data is useless without proper automated processing: nobody can analyze and extract useful information from it manually, the data arrives faster than humans can read. One of the existing processing modes is anomaly detection, for example, to detect failures, high traffic, dangerous states and so on. However, it often requires the developer or the user of such analysis systems to have a lot of knowledge on this subject making it less accessible, and one of the aspects is the choice of a suitable algorithm and its parameters, which may be quite difficult and so far not fully researched. The main goal of this work is to start creating some kind of guidelines or a decision tree to simplify the process of choosing the most suitable anomaly detection algorithm depending on the dataset characteristics and other requirements, such as whether it needs to be fast or how much memory is available. It is of course a very huge space, but for now we are starting with just several popular algorithms and common freely available datasets. This project was proposed by SAP and inspired by works of the Dawn research team from Stanford and their MacroBase system [2] aiming to make fast data analysis more accessible. The goal of our project originally was to develop a system that can prepare large data streams for user consumption (extract “summaries”) and visualize it in different ways, but later we decided to focus deeper on a single more specific topic for now. This work focuses on the following tasks: 1. Analysis and experiments

Load more