VYTAUTAS MAGNUS UNIVERSITY FACULTY OF INFORMATICS DEPARTMENT OF APPLIED INFORMATICS

Aleksas Pantechovskis

Determining Criteria for Choosing Anomaly Detection Algorithm

Master final thesis

Applied informatics study programme, state code 6211BX012 Study field Informatics

Supervisor prof. dr. Tomas Krilavičius______(degree, name, surname) (signature) (date)

Defended prof. dr. Daiva Vitkutė-Adžgauskienė ______(Dean of Faculty) (signature) (date)

Kaunas, 2019

CONTENTS

ABBREVIATIONS AND TERMS ...... 1

ABSTRACT ...... 2

1 INTRODUCTION ...... 3

1.1 Concepts ...... 4

1.1.1 Unbounded and bounded data ...... 4

1.1.2 Windowing...... 4

1.1.3 Anomalies ...... 5

1.1.4 MacroBase terminology ...... 6

2 LITERATURE AND TOOLS REVIEW ...... 7

2.1 MacroBase ...... 7

2.1.1 Source code ...... 8

2.1.2 Architecture ...... 10

2.1.3 MacroBase SQL ...... 13

2.2 detection algorithms ...... 15

2.2.1 Percentile ...... 16

2.2.2 MAD ...... 17

2.2.3 FastMCD...... 17

2.2.4 LOF ...... 17

2.2.5 MCOD ...... 18

2.2.6 Isolation forest...... 18

2.3 Evaluation of anomaly detection quality ...... 20

2.4 Datasets ...... 21

2.5 Numenta Anomaly Benchmark ...... 24

2.5.1 NAB scoring method ...... 24

3 BENCHMARKING PLATFORM IMPLEMENTATION ...... 27

3.1 Anomaly detection algorithms ...... 30

4 EXPERIMENTS ...... 33 4.1 Time and memory performance...... 34

4.2 Anomaly detection quality ...... 38

4.3 NAB results ...... 43

4.3.1 Hyperparameters tuning (using labels) ...... 47

4.3.2 MCOD hyperparameters tuning ...... 49

4.4 Performance and anomaly detection quality conclusions ...... 51

4.4.1 Percentile ...... 51

4.4.2 MAD ...... 51

4.4.3 FastMCD ...... 51

4.4.4 LOF ...... 51

4.4.5 MCOD ...... 52

4.4.6 iForest...... 52

5 RESULTS AND CONCLUSIONS ...... 53

5.1 Future works ...... 53

6 REFERENCES...... 54

ABBREVIATIONS AND TERMS

ADR Adaptive Damping Reservoir

AMC Amortized Maintenance Counter

AUC Area under curve

CSV Comma-separated values

FN False negatives

FP False positives

LOCI Local Correlation Integral

LOF

MAD Median Absolute Deviation from median

MCD Minimum Covariance Determinant

MCOD Micro-cluster based Continuous Outlier Detection

ML NAB Numenta Anomaly Benchmark

OS Operating system

TN True negatives

TP True positives

UDF User-Defined Function

UI User interface

1 ABSTRACT

Author Aleksas Pantechovskis

Title Determining Criteria for Choosing Anomaly Detection Algorithm

Supervisor prof. dr. Tomas Krilavičius

Number of pages 61

In today’s world there is lots of data requiring automated processing: nobody can analyze and extract useful information from it manually. One of the existing processing modes is anomaly detection: detect failures, high traffic, dangerous states and so on. However, it often requires the developer or the user of such analysis systems to have a lot of knowledge on this subject making it less accessible. One of the aspects is the choice of a suitable algorithm and its parameters. The main goal of this work is to start creating guidelines or a decision tree to simplify the process of choosing the most suitable anomaly detection algorithm depending on the dataset characteristics and other requirements. This project was proposed by SAP and inspired by works of the Dawn research team from Stanford and their MacroBase system. In this work we review MacroBase architecture and functionality, describe commonly used real datasets for anomaly detection benchmarking and synthetic dataset generation methods, anomaly detection quality metrics, and develop a benchmarking platform, evaluate anomaly detection algorithms of different types: distance-based (MCOD), density-based (LOF), statistical (MAD, FastMCD, Percentile), Isolation forest.

1 INTRODUCTION

Data volumes generated by machines are constantly increasing with the rise of automation. Modern hardware is powerful enough to handle and generate a lot of data (for example, recording data from sensors, some system events), network speeds allow to collect data from many devices around the world (such as web, mobile applications or IoT) and storage is cheap enough to store terabytes of the data [1] . However, this data is useless without proper automated processing: nobody can analyze and extract useful information from it manually, the data arrives faster than humans can read. One of the existing processing modes is anomaly detection, for example, to detect failures, high traffic, dangerous states and so on. However, it often requires the developer or the user of such analysis systems to have a lot of knowledge on this subject making it less accessible, and one of the aspects is the choice of a suitable algorithm and its parameters, which may be quite difficult and so far not fully researched. The main goal of this work is to start creating some kind of guidelines or a decision tree to simplify the process of choosing the most suitable anomaly detection algorithm depending on the dataset characteristics and other requirements, such as whether it needs to be fast or how much memory is available. It is of course a very huge space, but for now we are starting with just several popular algorithms and common freely available datasets. This project was proposed by SAP and inspired by works of the Dawn research team from Stanford and their MacroBase system [2] aiming to make fast more accessible. The goal of our project originally was to develop a system that can prepare large data streams for user consumption (extract “summaries”) and visualize it in different ways, but later we decided to focus deeper on a single more specific topic for now. This work focuses on the following tasks: 1. Analysis and experiments with different outlier detection algorithms, development of benchmarking platform based on MacroBase (it is not really needed for outlier detection algorithms analysis, but we already developed some tools there and it can become more useful in the future for work on the explanations). 2. Implementation of popular outlier detection algorithms not provided with MacroBase. 3. Definition and evaluation of performance, anomaly detection quality characteristics for these algorithms.

1.1 Concepts

Some terms, such as “streaming”, are very ambiguous and used by different people to describe different things, which leads to misunderstandings. Therefore, we decided to adopt data-related terminology from Google Dataflow paper [3] [4].

1.1.1 Unbounded and bounded data

We use terms unbounded and bounded data when describing infinite and finite data. Unbounded data is continuously growing data, which is never complete and cannot be loaded all at once, so we must process it incrementally, such as by splitting into some chunks, windows, possibly storing some state from the previously processed data. However, this type of processing is not specific to unbounded data, it can be used for bounded data too, depending on the size, task, algorithm etc., for example, if it is too big to load all at once into memory.

1.1.2 Windowing

Figure 1 Example of fixed and sliding windows.

Windowing [3] [4] [5] refers to splitting data into chunks, usually based on time dimension. Some systems also have tuple-based windowing, but the only difference is that they use ordering of the elements instead of time. There are several types of windows: • Fixed windows – windows with fixed size, such as an hour. • Sliding windows – windows with fixed size and slide period, such as hourly windows starting every minute (overlapping). • Sessions – periods of activity terminated by a timeout or some other data-specific way. 1.1.3 Anomalies

Anomaly is any rare and unusual, different from the norm behavior in data [6] [7]. The simplest example could be extremely high temperature of a single device. An anomaly can include multiple data points, such as unusually low or high network traffic or CPU load during some long period (short spikes may happen regularly and not considered anomalies). Another similar term is outlier, an abnormal data point, often used interchangeably. There are three common types of anomalies: • Point anomalies – an individual data instance is anomalous with respect to the rest of data (such as one point is far away from the rest). This is the simplest type of anomaly and is the focus of majority of research on anomaly detection. [7]

Figure 2 A simple example of point anomalies in 2-dimensional data. [8]

• Contextual/conditional anomalies – a data instance is anomalous only in a specific context. [7]

Figure 3 Example of a contextual anomaly in a temperature . [8]

• Collective anomalies – a collection of related data instances is anomalous, while individual instances from this collection may not be anomalous by themselves. [7]

Figure 4 Example of a collective anomaly. [8]

In some cases, algorithms designed for point anomalies can be used for contextual anomalies if we include context as new features, such as month number in Figure 3, as well as for collective anomalies after some data preprocessing such as correlation, aggregation, grouping. [6]

1.1.4 MacroBase terminology

MacroBase uses term metric for numerical measurements (such as temperature, power drain, load) used for outlier detection, and explanatory attribute for metadata like device ID, model, location used for results explanation (based on detected ) [2]. The latter can be confusing because sometimes they simply call it “attribute”, and in other areas (such as databases) it is often synonymous to any field/column/property, not just metadata. We try to avoid such confusion by not omitting the word “explanatory”. 2 LITERATURE AND TOOLS REVIEW

2.1 MacroBase

MacroBase1 is a data analysis engine for large datasets and unbounded data streams, specialized in finding and explaining unusual or interesting trends in data, such as that devices with some specific model are likely to have higher power drain. [2] One of the main motivations for MacroBase creation was to make fast data analysis simpler, to make it accessible for more people. MacroBase provides simple architecture/infrastructure designed with extensibility in mind, allowing easy customization and extension for different tasks. Figure 5 shows standard pipeline suggested by MacroBase: input ingestion (reading from the data source) and any needed input transformation, classification (finding outliers), explanation/summarization of the outliers and the result output to user. All components in the pipeline use the same interfaces for communication, so it is easy to add or remove them.

Figure 5 MacroBase default pipeline. [2]

In the simplest mode MacroBase lets user choose target metrics, such as power drain or temperature, and “explanatory” attributes, such as device model and firmware, and uses unsupervised methods like MAD (Median Absolute Deviation from median), MCD (Minimum Covariance Determinant) to classify the data points into inliers or outliers, after that it tries to explain the outliers using pattern mining algorithms like Apriori [9], FP-Growth [10]. For more advanced usage it is possible to replace or customize any component in the source code. The modification is done by simply implementing a class with suitable interface and a new pipeline that instantiates and glues together these classes (or the default ones). The paper claims that it is not difficult even for non-experts: students working part-time can implement and test a new operator in less than a week and MacroBase maintainers need less than a day for a new pipeline. MacroBase focuses on outlier explanations and in this project so far we were working mostly on outlier detection, so MacroBase probably did not help us much with our benchmark implementation (described in section 3) except providing several simple components like config and CSV files readers/writers, as well as some ideas for simple and extensible

1 https://macrobase.stanford.edu architecture (interfaces, usage of MacroBase DataFrames to pass data between components, etc.) and MAD, MCD, Percentile classifiers implementation. However, it can become more useful in the future works on explanations/summarizers because it provides some visualizations for that and several different summarizers with some optimizations.

2.1.1 Source code

MacroBase is written mostly in Java and its source code is available in the GitHub repository (https://github.com/stanford-futuredata/macrobase) under Apache 2.0 license [11], that is modifications and commercial use are allowed if original copyright and license notices are preserved. It looks like the project is a bit lacking documentation: there are only short instructions about building the project and a bit outdated2 tutorial showing how to run the demo, as well as some notes about running benchmarks and tests (possibly outdated too judging by the last update date – middle of 2016). However, the source code quality seems good, most of the important classes are documented using comments and Javadoc3, classes are organized into packages. Maven4 build system is used for building the project and running automated tests [12], so the build can be performed without any configuration by executing Maven mvn package command. They also provide a Docker image5 with all necessary tools and MacroBase build. Overall, combined with the detailed explanation of main concepts and architecture from the paper [2], it is not difficult to get started working on this project. After building, the system can be used via simple console runner (taking configuration from a YAML file) or MacroBase SQL. Also, it is possible to start HTTP server and use the web interface, which allows to select attributes and shows list of results with plots. By default, it loads data from PostgreSQL6 database, but it also supports MySQL7 and CSV, and there are also contributed ingestors for some other data sources. In this mode, however, it uses older components which may lack some

2 A bug report with fix suggestions was submitted in the GitHub repository, https://github.com/stanford- futuredata/macrobase/issues/218, but nobody updated it yet. 3 https://en.wikipedia.org/wiki/Javadoc 4 https://maven.apache.org/index.html 5 https://macrobase.stanford.edu/docs/sql/setup/#docker 6 https://www.postgresql.org 7 https://www.mysql.com features or have lower performance. Nevertheless, it could be a good starting point to understand basic ideas of MacroBase. Figure 6 shows UI for data source (CSV file path or database address and query) and metrics, explanatory attributes selection on the left side, and part of the results on the right side: some statistics (number of outliers, elapsed time), outlier groups that were created by explanation algorithms and plot showing distribution of these groups.

Figure 6 MacroBase web UI 2.1.2 Architecture

Currently in MacroBase there are two main modules: macrobase-lib and macrobase- legacy. The latter is used by the default web UI but, as the name suggests, all new developments seems to be focused on lib, it has clearer API, better documentation, so we decided that it would be better to base our project on it instead of legacy. It still lacks some interesting legacy components though, such as MAD, MCD classifiers and Adaptable Damped Reservoir mentioned in the paper [2], but it should not be difficult to adapt them – most of the components are quite modular and do not have much dependencies. macrobase-lib has some data structures, such as DataFrame (rows with values, description of columns) which is used to pass data between all pipeline components, as well as interfaces or base classes for some common components like input ingestors, classifiers, summarizers mentioned in 2.1. It also has some implementations of input ingestors (only CSV), classifiers (only simple percentile, predicate classifiers and cubed classifiers for aggregated data) and summarizers. The most important summarizer seems to be the one based on FP-Growth [10] algorithm (with some optimizations). In the MacroBase paper [2] they said that they tried different algorithms and concluded that FP-Growth was fast and suitable for extensions. There are also summarizers based on Apriori [9] algorithm and recently there were some improvements and optimizations, so they are probably good for some cases too. For both summarizer groups there are classes with generic implementation of the algorithm (FPGrowth and APrioriLinear) which are used by the summarizers. Most of the summarizers work on bounded data (“batch” in MacroBase terminology), the only summarizer for unbounded (“streaming”) data is FPGrowth-based IncrementalSummarizer, which can be wrapped into WindowedOperator to use “time”-based windows (“time” can be any increasing attribute, such as ID). Of course, it is also possible to use batch summarizers for unbounded data treating each batch separately, without keeping “summarization” state between them.

Figure 7 MacroBase summarizers class diagram

Figure 8 MacroBase summarizers extended class diagram with dependencies

Some of the main components from the paper [2] are available only in macrobase- legacy module, There are MAD and MCD classifiers from the paper [2] – MAD and MinCovDet classes on Figure 9. MCD seems to be the only MacroBase classifier that can work with multiple metrics.

Figure 9 macrobase-legacy classifiers class diagram

All classifiers here extend abstract class BatchTrainScore ("scorer") and get instantiated in MacroBaseConf constructTransform method (called inside BatchScoreFeatureTransform and EWFeatureTransform constructors) according to the provided config. The scorer is used in BatchScoreFeatureTransform and EWFeatureTransform classes, which are used in pipelines. FeatureTransform ft = new BatchScoreFeatureTransform(conf); ft.consume(data);

OutlierClassifier oc = new BatchingPercentileClassifier(conf); oc.consume(ft.getStream().drain());

Summarizer bs = new BatchSummarizer(conf); bs.consume(oc.getStream().drain()); Summary result = bs.summarize().getStream().drain().get(0); Figure 10 Code from legacy batch pipeline

As we can see from Figure 10, the pipeline passes scores to OutlierClassifier, which determines outliers using either static threshold or percentile. BatchScoreFeatureTransform simply passes input to the scorer and returns output (score values). EWFeatureTransform is for unbounded data and it uses FlexibleDampedReservoir (ADR described in the paper [2] Algorithm 1, based on A-Chao [13]) to sample/decay training data used in the scorer. When receiving new data portion, EWFeatureTransform inserts all data records to the reservoir, besides passing all input to the scorer and returning output. Training of the scorer happens periodically (specified in config), by retrieving data sample from the reservoir. The reservoir is also decayed periodically. Another interesting component in macrobase-legacy is AMC (Amortized Maintenance Counter, Algorithm 3 in [2]). It is a probabilistic data structure for maintaining a list of the most frequent items (integer numbers) in a stream, without storing a separate counter for each distinct number. It is used in legacy MacroBase streaming summarizer to maintain the most frequent explanatory attributes among inliers and outliers (the attributes are strings, but they are encoded as integers during ingestion). It is similar to Space-Saving algorithm from [14] but has better performance in exchange for bigger memory consumption. The Dawn team was also working on another similar data structure for heavy-hitter sketching – Weight-Median Sketch [15] – but it is not integrated into MacroBase yet, only C++ implementation was published so far (https://github.com/stanford-futuredata/wmsketch).

2.1.3 MacroBase SQL

One of the biggest features recently added to MacroBase is MacroBase SQL. It allows to make queries to the data source using standard SQL syntax extended with some MacroBase- specific operators. Currently it does not add any new features compared to the web UI and console runners, but it can make dataset exploration more convenient. For example, it allows to automatically use all suitable explanatory attributes or to experiment with different parameters (classification, min support, ...) without editing the configs and restarting the program. It is also possible to use MacroBase SQL with Apache Spark cluster (https://macrobase.stanford.edu/docs/sql/spark/). Right now it is available only via console application that works similarly to other SQL shells, but they are also working on graphical UI which should be released soon. Also, currently it supports input only from CSV and only one APrioriLinear batch summarizer, but it should be easy to extend because it uses the same interfaces as other MacroBase pipelines. Two main non-standard operators are DIFF and SPLIT. DIFF takes two sets, outliers and inliers (that is the result of MacroBase classification step), and outputs explanations using the specified attributes (or * to use all suitable attributes), also it is possible to specify minimum support, minimum ratio and ratio metric, the same as in configs for all other pipelines. And SPLIT is just a shortcut that allows to write this SELECT * FROM DIFF (SPLIT flights WHERE DEPARTURE_DELAY > 10.0) ON *; instead of SELECT * FROM DIFF (SELECT * FROM flights WHERE DEPARTURE_DELAY > 10.0) outliers, (SELECT * FROM flights WHERE DEPARTURE_DELAY <= 10.0) inliers, ON *;

All commands can be composed like in the standard SQL, so we can easily filter, join, use subqueries everywhere etc. For example, the output of DIFF are rows (1 row for each explanation) with columns for all attributes (NULL if attribute not used in the explanation) specified in ON, as well as support, ratio columns. This is the reason why we need to use "SELECT * FROM DIFF ..." to see the result. In the current implementation there is no explicit support for classifiers like in other MacroBase pipelines, but it seems like they can be easily added via User-Defined Functions (UDF), except classifiers working on multiple metrics. UDF can be used to convert column values, such as "SELECT normalize(someColumn), ...". UDFs need to implement MBFunction interface from macrobase-lib which contains a method receiving array with all values of the column. However, despite being called "user", currently all possible UDFs (normalize and percentile) are hard-coded in MBFunction’s getFunction static method in macrobase-lib. Hopefully it will be improved in the future.

2.2 Outlier detection algorithms

Outlier/anomaly detection algorithm is an algorithm that finds anomalies in data. Depending on the algorithm and data, it can process points one by one or collected into bigger group. Often it must process data quickly, to detect issues as early as possible (to prevent damage, fraud etc.) and to keep up with new data (for unbounded data). Different algorithms have different properties, such as whether it needs data for training, whether it needs to be labeled (supervised) and how it works for different anomaly and data types. There are many different approaches for outlier detection, such as: • Statistical methods, for example, trying to fit data into some standard distribution. • Distance-based. An object is considered as an outlier if there are less than k objects in radius R around the object (Figure 11).

Figure 11 Distance-based outlier detection example

• Density-based, like LOF (local outlier factor) algorithm. Distances to k nearest neighbors are used to estimate an object density, objects with relatively low density are considered as outliers. • Isolation methods, like iForest, separating outliers from the rest of the data.

In this work so far, we analyzed and tested outlier detection algorithms provided with MacroBase, as well as some other algorithms that we added to MacroBase (section 0): MCOD, iForest and LOF.

Algorithm Speed Memory Notes

Percentile O(n) O(1) Works only on univariate.

MAD Training: O(n*log(n)) Training: O(1) 8 Works only on univariate. (sorts data) Scoring: O(1) Scoring: O(n)

MCD Training: Training: (n + nmetrics) Works only on multivariate. O(h + n 2n + n 3 + (MinCovDet, metrics metrics Scoring: O(n ) Not deterministic, runs n 3n + n*log(n)) metrics FastMCD) metrics training multiple times. where h = a(n + nmetrics + 1) 0 < a < 1

2 Scoring: O(n*nmetrics )

LOF Training: O(n2) or Training: depends on Has some hyperparameters O(n*log(n)) [16] implementation, but high but they do not affect speed and quality as much as in Scoring: O(n) Scoring: O(1) MCOD.

MCOD ? ? Depends on R, k hyperparameters very much, Depends on data, Can be very high, depending must be adjusted for each hyperparameters. on hyperparameters. dataset, Slow but usually faster no good defaults. than LOF.

Isolation Training: O(t*ψ2) Training: O(t*ψ) Not deterministic. forest Scoring: O(n*t*ψ) Scoring: O(1) (iForest) t – number of trees, ψ – subsample size

Table 1 Summary of the outlier detection algorithms analyzed in this paper.

2.2.1 Percentile

Percentile classifier is one of the simplest outlier detection algorithms available in MacroBase. It simply reports too low and/or too high values, for example values that lie in 1% of low values (1st percentile) of the given data portion. There are selection algorithms like Quickselect with O(n) average time complexity [17], so it should be fast even for very big datasets.

8 Actually, it is O(n) in MacroBase because it copies data to avoid modifications. 2.2.2 MAD

Another simple way to find outliers in univariate data is Median Absolute Deviation from median. It is more robust than some other statistical measures (such as standard deviation) [18] and very fast. To calculate MAD of a dataset (or some part of it) we need to 1. Find median. 2. Calculate deviation from the median for each element: |푥 − 푚푒푑𝑖푎푛(푥)| 3. Find median of these deviations. 푚푎푑 = 푚푒푑𝑖푎푛(|푥 − 푚푒푑𝑖푎푛(푥)|) In MacroBase implementation median and MAD values are calculated on training set and then the score for each element is evaluated as |푥 − 푚푒푑𝑖푎푛(푥 )| 푠푐표푟푒 = 푡푟푎푖푛 푚푎푑푡푟푎푖푛 2.2.3 FastMCD

MCD (Minimum Covariance Determinant) is a more complex outlier detection algorithm for multivariate data. According to the MacroBase paper [2], this implementation is based on FastMCD. It is approximate and runs training stage multiple times using random until it achieves the specified delta. It works only with multivariate data, probably because it calculates Mahalanobis distance. It is also very fast unless the dataset has a lot of metric columns: in Table 1 there are some quadratic and polynomial components in the training time complexity (finding covariance matrix, matrix inversion, calculating determinant etc.) but only for the metrics, so in practice with a small number of metrics it will grow almost linearly from the training set size (as we show in section 4.1).

2.2.4 LOF

LOF (Local Outlier Factor) is a density-based outlier detection algorithm [16]. LOF is an old algorithm (2000) and there are many modifications/extensions, such as [19], [20], possibly achieving better results and performance. So far, we looked only at the standard LOF. It has quadratic time and memory complexities for training step in the implementation we used (with grid data structure for distances), so it should not be used with big training sets. According to [16] other approaches can be used to achieve better complexity for this step (but worse for scoring step), such as indexing with O(n*log(n)). Also, it is possible to parallelize.

2.2.5 MCOD

MCOD (Micro-cluster based Continuous Outlier Detection) is a recently developed (2015) distance-based outlier detection algorithm designed for unbounded data streaming. It uses micro-clusters (of at least k + 1 points) with radius R / 2 to achieve better performance (less distance computations because micro-clusters centers can be used) and save some memory. [21]

Figure 12 Example of micro-clusters for k = 4

In [22] evaluation it achieved the best performance among similar algorithms. In our experiments (section 4) it was faster than LOF, but still quite slow. Besides speed, another issue with this algorithm is that we need to adjust distance-based R, k hyperparameters for each dataset, there is no good default hyperparameters working well on all datasets. In section 4.3.2 we investigated some ways of setting them automatically.

2.2.6 Isolation forest

Isolation forest (iForest) is another recently developed anomaly detection algorithm (2008-2012), it uses a novel isolation-based approach without distance or density measurements. [23] It is designed for high-dimensional data and works by partitioning elements using multiple random decision trees. The elements that require less splits to become isolated are more likely to be outliers (Figure 13). In our experiments (section 4) it is faster than LOF and MCOD, and usually shows better quality on high-dimensional data.

Figure 13 Isolation forest example. [24]

2.3 Evaluation of anomaly detection quality

There are many ways to measure anomaly detection quality (how good are the results) if we have a dataset with known (labeled) anomalies. The most obvious way is simple accuracy: 푇푃 + 푇푁

푐표푢푛푡 But it is not a good measure for anomaly detection because of the class imbalance: the amount of the anomalous elements is very small (otherwise they would not be anomalous) and the accuracy will be high even if we skip all anomalies reporting all elements as inliers. A better metric is F1-score [25]: 푇푃 푝푟푒푐𝑖푠𝑖표푛 = 푇푃 + 퐹푃 푇푃 푟푒푐푎푙푙 = 푇푃 + 퐹푁 푝푟푒푐𝑖푠𝑖표푛 ∗ 푟푒푐푎푙푙 퐹1 = 2 ∗ 푝푟푒푐𝑖푠𝑖표푛 + 푟푒푐푎푙푙 However, one issue with metrics like this is that most of the algorithms output score and we need to choose a threshold to convert the scores into binary results. One popular metric that uses all thresholds is AUC (Area Under Curve) of ROC or PR (Precision Recall) curves [26].

ROC curve consists of 2D points for each distinct threshold (score) with TP Rate 푇푃 on the 푇푃+퐹푁 Y axis and FP Rate 퐹푃 on the X axis, and PR curve is the same but with Precision, Recall 퐹푃+푇푁 on the axes. The scores do not have to be normalized to some specific range (like [0; 1]), because we do not need to do anything with them except sorting. An algorithm that does not produce any useful results (such as outputting random scores without analyzing the data) will have ROC AUC close to 0.5 (diagonal line) and PR AUC close to 0. PR curves are sometimes considered to be a better metric for imbalanced classes [27, 28], however according to Davis and Goadrich [28], “a curve dominates in ROC space if and only if it dominates in PR space”. Another issue is how to decide which parameters (like k and R for distance-based algorithms) to use for each algorithm on each dataset. In the Goldstein paper [6] they used mostly kNN algorithms and averaged results for different values of k between 10 and 50. In our case however each algorithm has different parameters, not just k, so, it looks like the only option for comparing all algorithms is to simply choose the best parameters for each algorithm on each dataset. We used Grid Search [29] to find good parameters automatically. 2.4 Datasets

We used most of the freely available UCI9 and OpenML10 datasets from [30], using the same subsets and classes for outliers and inliers as in [30], as well as the modified Shuttle dataset from [6] and Yahoo S5 dataset for univariate data. Several datasets (German Credit, Magic Gamma, Mushroom) were not included in this work because they require additional processing (3.3.2 [31]) to decrease the number of anomalies which we have not implemented yet (originally these datasets contain 30-48% samples of the chosen “anomalous” classes making them not suitable for anomaly detection). According to [31] Yeast dataset may not be a good choice for anomaly detection benchmarking because it is too difficult (all algorithms fail on it).

Dataset Samples Numeric dims Categ. dims Anomalies % Abalone 2K 7 1 (3 one-hot) 1.51% Car 2K 0 6 (21 one-hot) 3.76% CovType 286K 54 (44 binary) 0 0.96% Mammography 11K 6 0 2.32% Shuttle 12K 9 0 7.02% Shuttle-goldstein 11 46K 9 0 1.89% Thyroid 3K 21 (15 binary) 0 2.25% Wine 5K 11 0 0.51% Yahoo S5 A1 12 1.4K 1 0 0-15% 67 files (two 741) Yeast 1K 8 0 4.62% Table 2 Real datasets used in this work.

Some datasets contain categorical attributes, we converted them to numerical using one- hot encoding like in [30]. The results on Car dataset containing only categorical features were higher than the baseline (random choice without any data analysis) after such conversion, but on Abalone the results became worse after adding its categorical feature (Table 3), so we did not include this attribute during experiments.

9 https://archive.ics.uci.edu/ml 10 https://www.openml.org 11 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OPQMVF 12 https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70 Algorithm PR AUC on Abalone without categ. PR AUC on Abalone with categ. MCOD 0.16 0.14 LOF 0.20 0.06 FastMCD 0.18 0.18 Table 3 Abalone dataset results with categorical attribute converted via one-hot encoding.

Most of these datasets contain very small number of samples or features, so for some experiments (such as time and memory performance scalability) synthetically generated dataset are needed. There are many ways to generate synthetic data. The simplest way is to use synthetic data generators like Mulcross13, however they can be too simple and not a good choice for anomaly detection quality evaluation [31]. Another way is presented in the Emmott et al. Meta- analysis paper [31], the datasets are generated by modifications of real datasets affecting some of the characteristics like frequency of anomalies, presence of irrelevant features, swamping, clusteredness. In this work currently we used only Mulcross datasets for scalability experiments because of its simplicity. Mulcross generator allows to specify the desired percent of anomalies, number of samples and dimensions, distance between clusters and number of anomalous clusters. It generates datasets with one big cluster of inliers and dense anomaly clusters (Figure 14). We also were looking into generating datasets similar to [30] (producing more scattered anomalies like in Figure 15 using Student's T distributions) and the dataset modification methods from [31], but have not implemented it yet.

13 https://github.com/icaoberg/multout

Figure 14 Examples of datasets generated by Mulcross generator using different distance and number of clusters parameters. The maximum number of clusters is limited (like in the bottom-right example) depending on the number of dimensions.

Figure 15 Example of a dataset generated based on Student's T distributions using the method described in [30].

At the beginning of our work we also were trying to find suitable datasets for MacroBase explanation, and it turned out to be much more difficult task because most anomaly datasets contain only numerical measurements (such as amount of traffic), sometimes even without attribute descriptions, while for MacroBase explanations we need some categorical attributes with values shared among elements (usually strings like names, models, IDs). 2.5 Numenta Anomaly Benchmark

Numenta Anomaly Benchmark (NAB) is a benchmark for anomaly detection in unbounded data, originally developed for evaluation of Numenta HTM algorithm. It consists of the benchmark itself (algorithm output scoring methods designed for unbounded data and implementations of several algorithms) and datasets, mostly real datasets but also some artificially-generated for anomalous behaviors missing there. [32] [33] [34] All NAB datasets are univariate (one numerical metric and time column), have 1-22K rows. NAB artificial datasets include mostly collective/contextual anomalies. For example, some periodical spikes and suddenly a flat line instead of a spike (the second dataset in Figure 16). Also, there are several datasets without anomalies.

Figure 16 Example of NAB artificial datasets with collective/contextual anomalies.

Real NAB datasets contain many different anomalous behaviors, from simple point anomalies to collective/contextual anomalies. NAB may be useful for us, we can use it as a standalone benchmark or select some datasets that we find interesting (unique behavior not encountered in other datasets we have, etc.) to use with other evaluation methods. However, it includes only univariate data and focuses on time series, while all the algorithms we used in this work are not designed for time series, so it would be interesting to compare them with other methods that were specifically designed for time series, such as based on ARIMA [35] or neural networks [36]. NAB is published under AGPL 3.0 license [37], so it could be difficult to use its source code or data as part of some closed source product or project with different license.

2.5.1 NAB scoring method

NAB uses its own method for evaluating anomaly detection algorithms. It is supposed to be more suitable for evaluation of algorithms working on unbounded data than standard methods like AUC and F1 because they do not incorporate time in the calculations. [38] Features of this scoring system: • Rewards early predictions by using bigger anomaly window instead of labeling only anomaly points. • Uses sigmoid function allowing to get higher score for earlier detection within the anomaly window (only the earliest detection is counted, the rest are ignored). • Allows to set weights of TP/TN/FN/FP, for example to reward low amount of FP. • Chooses optimal threshold automatically, but the same for all data files.

2.5.1.1 Labels

During manual labeling Numenta team marks only anomaly start point. And then they create anomaly window by including elements around that point (total windows length of all anomalies in a file is 10% of file length). So, in labels file there are two timestamps for each anomaly: Start a bit before the starting point of anomaly and End a bit after that point.

2.5.1.2 Score calculation

Raw score calculation for each data file14:

푠푐표푟푒푟푎푤 = 푇푃푠푐표푟푒 ∗ 푇푃푤푒𝑖푔ℎ푡 + 퐹푃푠푐표푟푒 ∗ 퐹푃푤푒𝑖푔ℎ푡 + 퐹푁푠푐표푟푒 ∗ 퐹푁푤푒𝑖푔ℎ푡, All raw score components (TPscore, FPscore, FNscore) are set to 0 initially and changed as follows during data points traversal: TPscore – increased once for the first TP in each anomaly window. Depends on position: early detections result in more score. FPscore – decreased for each FP point. Depends on position: false detections far away from anomaly window receive more penalty. FNscore – decreased once for each anomaly window without TP. From this we can see that FP affect score much more than FN (though they use lower weight for FP, 0.11 in standard profile). An algorithm that reports many FP (even if only on several data files) will get much worse scores than Null “detector” (which has only TN and FN). All scores involving positions calculated using scaled Sigmoid function, rewarding TPs within the anomaly window as follows [33]:

14 Based on source code from https://github.com/numenta/NAB/blob/4466b6fd5ee6a172abccf280e783cc42632c3e49/nab/scorer.py#L167 • A relative position of −3.0 is the far-left edge of the anomaly window and corresponds to a 2 ∗ 푠𝑖푔푚표𝑖푑(15) − 1.0 = 0.999999. This is the earliest TP possible for a given window; an earlier detection is an FP. • A relative position of −0.5 reflects a slightly later detection and corresponds to a score of 2 ∗ 푠𝑖푔푚표𝑖푑(0.5 ∗ 5) − 1.0 = 0.84828. • A relative position of 0.0 is the right edge of the window and corresponds to a score of 2 ∗ 푠𝑖푔푚표𝑖푑(0) − 1 = 0.0. Any detection beyond this point are scored as a FP. • Relative positions > 0 correspond to FPs increasingly far away from the right edge of the window. A relative position of 1.0 is past the back edge of the window and corresponds to a score of 2 ∗ 푠𝑖푔푚표𝑖푑(−5) − 1.0 = −0.98661

TP weight FP weight FN weight Standard profile 1 0.11 1 Reward low FP 1 0.22 1 Reward low FN 1 0.11 2 Table 4 NAB scoring profiles.

Raw scores are normalized using Null detector as baseline (0) while still allowing negative scores (for example, if a lot of FP). Score normalization formulas15: 푃푒푟푓푒푐푡푆푐표푟푒 = 푇표푡푎푙퐴푛표푚푎푙𝑖푒푠퐶표푢푛푡 ∗ 푇푃푤푒𝑖푔ℎ푡 = 116 ∗ 푇푃푤푒𝑖푔ℎ푡

퐵푎푠푒푆푐표푟푒 = 푁푢푙푙퐷푒푡푒푐푡표푟푆푐표푟푒푟푎푤 (-116 for standard profile) 푇표푡푎푙푆푐표푟푒 − 퐵푎푠푒푆푐표푟푒 푠푐표푟푒 = 100 ∗ 푟푎푤 푓푖푛푎푙 푃푒푟푓푒푐푡푆푐표푟푒 − 퐵푎푠푒푆푐표푟푒

15 Based on source code from https://github.com/numenta/NAB/blob/4466b6fd5ee6a172abccf280e783cc42632c3e49/nab/runner.py#L223 3 BENCHMARKING PLATFORM IMPLEMENTATION

Our benchmarking platform consists of the three main components (steps) shown in Figure 17: 1. Preparation of datasets and algorithm configurations. 2. Execution of the algorithm with the specified configuration, time and memory measurements. 3. Evaluation of the execution results, generation of plots, reports, etc. depending on the experiment.

Figure 17 Benchmarking platform architecture. https://bit.ly/2IH55N8

The execution is performed by our Java benchmark based on MacroBase. We forked MacroBase project on GitHub and created a separate Git branch with new Maven module for the benchmark and other extensions, this allows to easily merge changes from the original project. It receives anomaly detection algorithm and dataset configuration, runs the algorithm and saves the algorithm output, elapsed time and peak memory usage. Also, it is possible to run Grid Search to find the best combination of algorithm hyperparameters from the lists of possible hyperparameter values. We tried to keep this component as simple as possible, producing only the necessary output and performing all quality evaluations, normalizations, etc. after the execution by separate scripts. Otherwise we would need to re-execute the benchmark for all affected algorithms, datasets each time we want to create or update some plot, use another quality metric and so on, also it would make the implementation more complicated and difficult to maintain, extend. We use Python for data preparation and evaluation scripts because it is an easy to use high-level programming language, it is popular in and there are many libraries for data processing, classification evaluation, plots. R or Matlab/Octave can be a good choice too because they contain many packages for statistical operations. The architecture of our platform allows to use any tools for the preparation and evaluation steps, not limiting to any single programming language or framework. Of course, sometimes it can be easier to use the same framework to be able to share the common code, but most of the scripts so far are quite small and rely on easy to use libraries like scikit, so currently it is not an issue for these tasks.

Figure 18 UML class diagram of configuration and result data classes. https://bit.ly/2IH55N8 Input and output data of the benchmark (step 2) is shown in Figure 18. It receives the configurations of dataset (only the required information like the file or database path, column names) and algorithm (ID, parameters and optionally Grid Search configuration), and outputs time, memory usage, algorithm output (as a CSV file), as well as the initial configuration and final algorithm parameters (initial parameters merged with the Grid Search result, if it was used), to be able to refer to it during the results evaluation. Grid Search should not be used in the same execution (process) for experiments when we care about memory measurements because it runs the algorithm many times, possibly allocating lots of memory, and in the current implementation peak memory measurement will not reset, time measurement also can be affected because of Java Garbage Collection. So, the simplest way for such experiments is to run Grid Search and then create configuration files without Grid Search using the parameters that were found. We are also looking into ways to make it more convenient, such as adding an option to rewrite the configuration files automatically. Currently we use Java MemoryPoolMXBeans to measure the heap peak memory usage during the classification. In general, reliable memory measurement can be quite difficult in Java because of the GC [39]. But in our case, it may be easier (at least for batch mode) because we only have a benchmarking program that reads data and executes an algorithm, it is not a small part of some bigger program (micro-benchmarking), so even such simple implementation should work well enough. One possible improvement is to also measure the memory usage before the classification and then use the difference between these two values (this was done in [30]), however in our case it probably would not change much, because in [30] it was done mainly to minimize the difference between the platforms (Python, R, Matlab). 3.1 Anomaly detection algorithms

In order to make the performance comparison fair the algorithms should be implemented efficiently (ideally with the same level of optimization) using the same programming language, framework [40]. Unfortunately, not all our algorithms are currently well optimized, so in future this should be resolved, such as by porting implementations from other well-known libraries. Replacing the benchmark (step 2) with some other platform is also possible, however it can be more difficult when we extend it for streaming and explanations, which are one of the future goals, because most of the available frameworks for anomaly detection do not provide this functionality. For now, we tried to focus more on quality evaluation than on time and memory performance. We added these outlier detection algorithms to our platform: • LOF (Local Outlier Factor) • MCOD (Micro-cluster based Continuous Outlier Detection) • Isolation forest (iForest) • LOCI (Local Correlation Integral) – not suitable for practice, O(n2) memory (for the whole batch, not possible to use small training set), O(n3) time, so it was not used in this work. Approximate aLOCI version may be better.

We also used algorithm that were provided with MacroBase provides: • MAD (Median Absolute Deviation from median) • MCD (Minimum Covariance Determinant) • Percentile

Some algorithms provided with MacroBase were not used: • Some simple algorithms (using quantiles or specified predicates) for “cubed” data/grouped attributes (count, mean, std, …). We have not tested them because we do not have suitable datasets. • – added in May 2018, we have not tested it yet. Looks like it supposed to be used only with some special summarizer they added16.

16 https://github.com/stanford-futuredata/macrobase/commit/9af7c8f7c510a9300a6bcf3ff6ae7095dbca58fb We created MCOD classifier for MacroBase based on source code provided with [22]. The classifier implementation consists mostly of instantiating the provided implementation with the specified parameters (k, R, window and slide sizes), passing data to it and (similarly to other MacroBase classifiers) adding result column to DataFrame with 1.0 value for outliers and 0.0 otherwise. We investigated the possibility of outputting scores/probabilities instead of the boolean results but have not found any way to do it with this algorithm. Also we added some optimizations17 improving performance, such as an option to use a hash set instead of an array- based list for faster search and duplicate removal in the collection of outlier, and better performance in “batch” mode (when the slide size is equal to the window size) by removing expensive queue operation that are not needed in this mode. LOF implementation was adapted (and optimized18, removing unnecessary temporary array creation and boxing of Java objects) from one of the popular Java implementations on GitHub19. It is not very efficient (it uses the simplest grid-based implementation) and we were looking at other options, such as porting implementations from Weka20, ELKI21 or sklearn22, but they are much more difficult to port because they have many dependencies to their frameworks, so it was not done because of the lack of time. Isolation forest implementation was ported from Weka. It seems to be an efficient implementation and the source code is quite short without many Weka dependencies unlike LOF, so it was not difficult to port.

17 https://github.com/anomaly-detection-macrobase-benchmark/macrobase/commits/alexp- vmu/alexp/src/main/java/alexp/macrobase/outlier/mcod/MicroCluster_New.java 18 https://github.com/anomaly-detection-macrobase-benchmark/macrobase/commits/alexp- vmu/alexp/src/main/java/alexp/macrobase/outlier/lof/bkaluza/LOF.java 19 https://github.com/bkaluza/jlof 20 https://www.cs.waikato.ac.nz/ml/weka/ 21 https://elki-project.github.io/ 22 https://scikit-learn.org/stable/modules/outlier_detection.html Figure 19 shows our hierarchy of anomaly detection algorithm implementations and interfaces. We kept Classifier base class for all algorithms from the original MacroBase and added additional MultiMetricClassifier subclass for algorithms working on multivariate data (LOF, FastMCD, iForest, MCOD). We also added Trainable interface for algorithms with training step (MAD, LOF, FastMCD, iForest) to make it easier to measure the elapsed time separately for training and scoring.

Figure 19 UML class diagram of outlier detection algorithms hierarchy in our platform. https://bit.ly/2IH55N8 4 EXPERIMENTS

The main goal of the experiments in this work was to evaluate and compare algorithms that were provided in MacroBase (Percentile, MAD, MCD) and additional algorithms that we integrated (LOF, MCOD, iForest) in order to understand the difference between them, identify important characteristics that can be used for algorithm selection. We started by reproducing experiments from [30] on our algorithms evaluating scalability when increasing the number of samples, as well as simple quality evaluations on different real datasets. One of the goals of these experiments was to test our benchmarking platform, algorithm implementations. We were planning to perform other experiments, starting with increasing number of features (adding irrelevant features like in [31] or generating synthetic datasets), varying number of anomalies, and detailed evaluation of each algorithm, such to evaluate effects of changing hyperparameters (slide and window sizes for MCOD similarly to [22], number of trees and subsamples for iForest, etc.), but we did not have time to implements these experiments yet. Also we used Numenta Anomaly Benchmark (section 4.3) and in section 4.3.2 we created a simple and crude algorithm for MCOD hyperparameters adjustment without . In section 4.4 we summarize results of our experiments.

4.1 Time and memory performance

We used synthetic Mulcross datasets with 2 features, 5% anomalies for time and memory scalability experiments. We generated a dataset with 10M samples and extracted smaller (starting with 5K) random subsets having the same characteristics. We used the same default hyperparameters for all algorithms (Table 5), which as we see in Figure 20 resulted in good detection quality for FastMCD, LOF, except several spikes for LOF (Figure 20, Figure 23), which may be caused by very small training set: we did not use bigger training set for LOF because of the speed in our implementation. Results for iForest were not as good, possibly because of the low dimensions, and MCOD with these parameters was not better than the baseline outputting random scores without any data analysis (Figure 20 left). After setting suitable MCOD parameters (R = 3.0, k = 0.1 * n) its’ quality improved (Figure 20 right) but time and memory usage greatly increased (Figure 22).

Algorithm Parameters iForest Number of trees = 100 Subsample size = 256 Training set size = min(30K, n) LOF k = 15 Training set size = 400 MCOD R = 13 k = 150 Window size = slide size = n (batch mode) FastMCD a = 0.9 delta = 0.00001 Training set size = min(4K, n) MAD Training set size = min(4K, n) Percentile 5% Table 5 Default algorithm parameters used in our experiments. n – dataset size

Figure 20 Anomaly detection quality (PR AUC) on two-dimensional Mulcross datasets of different sizes, before (left) and after (right) MCOD hyperparameters adjustment.

Figure 21 Time and peak memory usage results on two-dimensional Mulcross datasets of different sizes, before MCOD hyperparameters adjustment.

Figure 22 Time and peak memory usage results on two-dimensional Mulcross datasets of different sizes, after MCOD hyperparameters adjustment. There are some strange spikes in the time plots for MCOD (Figure 21) and FastMCD, especially at the end for FastMCD on both Figure 21 and Figure 24, as well as some spikes in the memory plots for all algorithms followed by several points without significant peak memory usage increase. It is not caused by difference in characteristics between the subsets, we checked that they are the same and re-generated the datasets, and it is not some random interference, we ran the benchmark several times and on different clean OSes (Windows and Ubuntu Linux) and the results were very similar. It could be related to Java GC (Garbage Collection) and some optimizations for memory allocation inside JVM, but we did not have time to investigate this issue deeper yet (may start by monitoring GC activity).

For univariate algorithms from MacroBase (MAD, Percentile) we generated similar datasets using Mulcross but with 1 feature. MAD here produces perfect anomaly detection results (Figure 23) like FastMCD. The behavior of other algorithms is similar as before. One interesting thing is that MAD is faster than our random baseline (Figure 24), this is because MAD uses a very simple formula for scoring (2.2.2) which is faster than Java implementation of random number generation. During training the most expensive operation in MAD is sorting of the training set, but we do not include training plots here because we used only fixed training set size for now.

Figure 23 Anomaly detection quality (PR AUC) on one-dimensional Mulcross datasets of different sizes.

Figure 24 Time and peak memory usage results on one-dimensional Mulcross datasets of different sizes.

Overall, simple statistical algorithms like FastMCD, MAD, Percentile are much faster and use less memory than other algorithms. LOF and MCOD are the slowest algorithms here. For LOF this is partially because of our implementation, but in [30] using much more optimized LOF implementation (and with different time-memory tradeoff) LOF memory usage was also very high, so it looks like LOF is not a good choice for big datasets/batches. MCOD time and memory performance depends on the hyperparameters, so it can be much better on other datasets, also MCOD is designed for streaming, so this issue would be less important when using with smaller window sizes.

4.2 Anomaly detection quality

We used real datasets that were used in [30] to evaluate anomaly detection quality for most of our algorithms (multivariate). We also used Yahoo S5 to evaluate univariate algorithms, as well as we tried to better understand the effect of algorithm hyperparameters changes using Shuttle dataset from [6]. For non-deterministic algorithms (iForest, FastMCD) we used average of 5 executions. We used algorithm hyperparameters from Table 5, but optimized LOF, MCOD and iForest hyperparameters (except training set sizes and windowing) using Grid Search on each dataset.

Figure 25 Anomaly detection quality (average PR AUC of 5 executions) on multivariate datasets described in 2.4.

In Figure 25 we see that most of these datasets are very difficult for all our algorithms, only on Shuttle datasets the results are close to perfect. iForest has the best average results here, but on some dataset LOF and FastMD are better. We noticed a lot of variation in iForest results on Abalone (0.25-0.50 PR AUC) and CovType (0.05-0.25 PR AUC) datasets with all parameters we tried. On other datasets iForest variation was much smaller (up to ±0.05 PR AUC). FastMCD did not have significant variation. iForest results match the results from [30]. LOF results are also similar, but not as close as iForest, most likely because different LOF implementation was used allowing to set different hyperparameters (such as much higher training set sizes).

Yahoo S5 consists of small files, about 1400 elements each, with different behaviors, so we did not use windowing here, all data for each file was read and processed in one portion. It contains mostly simple point anomalies (some points far away from the rest), so our algorithms worked here quite well as we can see in Table 6. LOF and iForest produced the best results according to these measurements. Looks like LOF does not depend much on hyperparameters and works well even with small training sets. MAD here is almost as good as LOF and the training set size did not affect it much too. Percentile has some of the worst results here, but it seems to be caused mostly by the way we measure the results: in many cases it reports only several points of a bigger group as shown in Figure 27 which should be enough in practice to notice an unusual event. MCOD clearly has a strong dependence from its hyperparameters, the average results were not good when we used fixed R and k for all files but got much better when we set them separately for each file (using Grid Search).

Algorithm Parameters Average ROC AUC Average PR AUC LOF k 15, training set 400 0.93 0.72 LOF k 15, training set 200 0.92 0.72 iForest trees 100, subsample 256 0.93 0.70 MAD training set 10K (all) 0.93 0.69 LOF k 60, training set 200 0.91 0.69 LOF k 15, training set 50 0.89 0.68 MAD training set 500 0.93 0.67 MAD training set 100 0.92 0.67 MCOD Grid Search tuned R, k 0.83 0.59 for each file (on whole file) Percentile 0.5% 0.75 0.22 Percentile 1% 0.80 0.20 Percentile 1.5% 0.84 0.18 MCOD R 60, k 20 0.66 0.16 MCOD R 20, k 10 0.66 0.14 Random 0.50 0.02 Table 6 Results for Yahoo S5 dataset, sorted by PR AUC.

Figure 26 Examples of anomaly detections by Percentile algorithm.

There are several more difficult anomalies in Yahoo S5 like the ones shown in Figure 27 on which our algorithms fail. The top-left example looks like a denser group of points, so it is not surprising that our algorithms designed for point anomalies fail on that, and the bottom- left example also seems to have an anomalous group of points too close to the rest of the points which makes it difficult for our algorithms. On the right side there are two examples in which the data changes significantly (all values after some point become much higher than before or vice versa), this can get detected if we read the data in smaller portions instead of all at once.

Figure 27 Examples of datasets on which these algorithms fail. Algorithm Parameters Average ROC AUC Average PR AUC iForest trees 100, subs. 256, training set 30K 0.998 0.977 iForest window 20K, trees 100, subs. 256, t. 4K 0.997 0.975 iForest trees 100, subs. 256, training set 4K 0.995 0.974 iForest trees 100, subs. 50, training set 30K 0.993 0.954 iForest trees 10, subs. 256, training set 30K 0.995 0.949 MCOD window 20K, slide 10K, R 30, k 200 0.994 0.651 LOF window 40K, k 60, training set 150 0.996 0.646 LOF window 20K, k 15, training set 50 0.996 0.639 LOF window 20K, k 60, training set 150 0.996 0.638 MCOD window 10K, slide 5K, R 30, k 200 0.993 0.623 MCOD window 20K, slide 2K, R 30, k 200 0.990 0.616 LOF window 20K, k 15, training set 100 0.995 0.614 LOF window 40K, k 15, training set 150 0.995 0.612 FastMCD training set 1K 0.983 0.611 FastMCD window 20K, training set 200 0.985 0.609 LOF k 15, training set 150 0.99 0.600 MCOD window 20K, slide 20K, R 30, k 200 0.993 0.597 LOF window 5K, k 15, training set 150 0.994 0.596 LOF window 10K, k 15, training set 150 0.993 0.569 MCOD window 10K, slide 10K, R 30, k 200 0.992 0.562 FastMCD window 20K, training set 1K 0.953 0.559 LOF window 20K, k 15, training set 150 0.993 0.556 LOF window 20K, k 15, training set 200 0.995 0.551 MCOD window 10K, slide 1K, R 30, k 200 0.987 0.551 MCOD window 40K, slide 10K, R 30, k 200 0.907 0.523 FastMCD window 20K, training set 2K 0.913 0.489 MCOD window 40K, slide 20K, R 30, k 200 0.854 0.455 FastMCD training set 2K 0.899 0.431 MCOD window 40K, slide 40K, R 30, k 200 0.879 0.394 FastMCD window 20K, training set 10K 0.664 0.134 FastMCD window 20K, training set 20K 0.500 0.019 Random 0.491 0.019 Table 7 Results for Shuttle dataset [6], sorted by PR AUC. From the algorithms we have, only MCOD was designed for unbounded data and uses sliding windows inside, so for the rest of the algorithms “window size” in our results means the size of the data slices loaded from the dataset. Each slice is processed independently because these algorithms/implementations (except MCOD) do not accumulate any state between slices; for iForest, MAD, FastMCD and LOF we retrain on a subset from each slice (all previous training state gets cleared). We can notice that LOF k hyperparameter can affect the results (however not very significantly), but the best value depends on dataset: on Yahoo S5 (Table 6) higher k produced worse results while on Shuttle dataset (Table 7) higher k improves the result. Changing LOF window size and training set size does not seem to produce any clear pattern, the results fluctuate a bit randomly, probably because we use quite small training sets. Overall, LOF results on the Shuttle dataset are good with all parameters we tried. FastMCD has much higher result variance when changing window and training set sizes. Looks like it works well when training set is much smaller than window and does not produce any useful results when we use all data for training. MCOD results on this dataset strongly depend on R and k as before, so we chose good R and k using Grid Search, which produced very good results here, even better than LOF in some cases, and tried different combinations of window and slide sizes. It is difficult to draw conclusions from these results, but it looks like MCOD usually produces worse results when the slide is too big (such as equal to the window) or too small. iForest has the best results on this dataset and was not very sensitive to hyperparameters.

4.3 NAB results

There are 3 ways to use NAB for evaluating performance of our algorithms: 1. Implement the anomaly detection algorithm in Python (using NAB base class/conventions). [41] 2. Implement their scoring method in MacroBase. 3. Run the algorithms on NAB dataset outputting results in NAB format (CSV files with score from 0 to 1 for each data point) and pass it to NAB. [41] The first 2 options seem to be time-consuming and may result in implementation mistakes/differences, so we decided to try the 3rd option. Later we may also want to implement this scoring method in MacroBase, such as if we want to modify it somehow. It is not difficult to produce output in NAB format [33], the only issue was that some of our algorithms (LOF, MAD, FastMCD) output scores greater than 1, so we used simple Min- ( ) Max normalization to make scores in 0-1 range. 푥 = 푥푖−min 푥 푖 max(푥)−min (푥) Another difference with MacroBase approach is that in NAB algorithms are supposed to output score immediately when receiving a point, without looking ahead; in MacroBase most of the algorithms can read the whole data portion (such as window slide) first and then output the results. But it should not be a problem if we simply want to compare our algorithms with each other. For all experiments on NAB we did not use any data slicing because NAB datasets are quite small (1-20K rows). LOF, iForest and MAD detection quality can be affected only by training set. We use first N elements (such as 200, 600) of each dataset for training because in NAB first 750 elements (or 15% for datasets smaller than 5000 elements) do not contain anomalies, which is good for these algorithms, and should be the best case when the normal behavior does not change later.

Algorithm/parameters Standard Profile Reward Low FP Reward Low FN

Percentile, 0.2% 46.6 25.3 55.5

Percentile, 0.1% 45.1 36.1 49.9

iForest 23.2 14.0 27.8 100 trees, 256 subs.

LOF 16.6 0.8 23.3 knn 60, train set 200

LOF 16.4 0.1 22.9 knn 45, train set 200

Percentile, 0.5% 12.4 -53.4 35.8

LOF 3.1 0.04 17.3 knn 60, train set 300

LOF -15.6 -68.8 5.4 knn 15, train set 200

LOF -43.3 -114.7 -20.9 knn 15, train set 400

MAD -102.0 -193.9 -55.9 train set 20K

MAD -102.0 -193.8 -55.9 train set 600

LOF -108.2 -249.7 -59.5 knn 100, train set 200

MCOD -261.9 -562.7 -160.2 R 10, k 5

MCOD -353.9 -737.9 -225.6 R 80, k 60

Table 8 NAB results for our algorithms.

As before, LOF takes a lot of time: 30 sec for each file if train set size is 200, 70 sec if 400, while MCOD and MAD take less than 1 sec. It is strange that increasing LOF training set resulted in worse score, but it could be simply because its results are bad overall, even the best scores we got here are very similar to Random from Table 9. Surprisingly, the best result here achieved the simplest Percentile algorithm. MCOD achieved the worst results here, but as explained in the next sections, it happens because MCOD requires different distance-based hyperparameters for each dataset. Detector Standard Profile Reward Low FP Reward Low FN

Perfect 100.0 100.0 100.0

Numenta HTM 70.5-69.7 62.6-61.7 75.2-74.2

CAD OSE 69.9 67.0 73.2

KNN CAD 58.0 43.4 64.8

Relative Entropy 54.6 47.6 58.8

Random Cut Forest 51.7 38.4 59.7

Twitter ADVec v1.0.0 47.1 33.6 53.5

Windowed Gaussian 39.6 20.9 47.4

Etsy Skyline 35.7 27.1 44.5

Bayesian Changepoint 17.7 3.2 32.2

EXPoSE 16.4 3.2 26.9

Random 11.0 1.2 19.5

Null 0.0 0.0 0.0

Table 9 Results published by Numenta. [42].

When reviewing separate NAB data files, we confirmed that our algorithms can work fine on simple point anomalies, but usually fail on more complex anomalies (collective, contextual). Below we present some of the results for several NAB datasets using plots generated by our benchmarking tool described earlier in section 0. Normal values (TN) are marked as blue points, undetected anomalies (FN) as red, correctly detected anomalies (TP) as green and incorrectly reported anomalies (FP) as yellow. The plots show many FNs but often it is because of the NAB scoring/labeling method described in section 2.5.1, that is one TP surrounded by many FNs (looking at the time axis) is a good result.

Figure 28 Examples of successful anomaly detection in NAB by MCOD and LOF algorithms.

Figure 29 Examples of anomaly detection failures in NAB by MCOD and LOF algorithms. Figure 28 shows examples of simple point anomalies like big and rare spikes which are usually correctly detected by our algorithms. Figure 29 shows more difficult anomalies on which our algorithms fail. On the left side there are two examples of datasets where lack of spike should be considered as anomaly, but our algorithms do not find anomalies here because there are still many other elements around the areas with low values. The top right example shows dataset where the area with much more dense spikes is supposed to be an anomaly, our algorithms here either incorrectly report all spikes as anomalies or do not report any anomalies at all.

4.3.1 Hyperparameters tuning (using labels)

We noticed that the main issue with our algorithms (especially MCOD) is that they need different hyperparameters for different datasets. For example, in dataset from Figure 30 the value range is 0.5-5.5, so MCOD with R = 20 will classify all elements as inliers, while other datasets have much bigger value range (thousands or millions) and small R will not work well there, it will result in too much outliers.

Figure 30 Left – MCOD with R = 20, right – MCOD with R = 3000000 on the same dataset from NAB.

We tried to automatically choose good parameters for each data file using simple grid search, with a crude version of NAB scoring algorithm as search measure (F1 and ROC/PR AUC did not work well here, resulting in too many FP). Algorithm Standard Profile Reward Low FP Reward Low FN

MCOD 63.73 57.13 68.35

MAD 56.33 46.46 61.98

LOF 33.17 19.62 38.81

Table 10 NAB results with perfect hyperparameters. MCOD and MAD results (Table 10) were much better than before (Table 8). Results for LOF here are also better but did not improve as much as MCOD and MAD, it may be because we have not included some possible parameters since LOF is very slow and it takes too much time to try many different combinations. Also, it is possible that we used not optimal threshold, we used 0.85 (on LOF scores Min-Max normalized to 0-1) during the search, running and NAB scoring. Of course, we cannot compare this with the results published by Numenta, because they did not use labels to tune parameters, but it shows that our algorithms can work well on most of these datasets if suitable hyperparameters are chosen.

4.3.2 MCOD hyperparameters tuning

We tried to tune MCOD parameters for each data file without using labels, because tuning parameters based on already labeled anomalies may not be feasible for many real applications. We used a simple empirically created algorithm working on a small training set (600 elements) from the beginning of data file (where NAB datasets contain only inliers [33]): find the closest neighbor distance for each element, set R to 4 * max of these distances, then for each element find how much neighbors in radius R it has, and finally set k to min of these neighbor counts. Most likely it can be improved. The idea was simply to make R and k somewhere in the sensible range to avoid mistakes like described earlier where R is too small for the range of dataset values and so on. We tested this tuning algorithm on NAB and Yahoo S5 datasets. Algorithm Standard Profile Reward Low FP Reward Low FN

Numenta HTM 70.5-69.7 62.6-61.7 75.2-74.2

MCOD 63.73 57.13 68.35 Hyperparameters tuned for each file with labels

Percentile, 0.2% 46.6 25.3 55.5

LOF 33.17 19.62 38.81 Hyperparameters tuned for each file with labels

MCOD 26.47 11.93 32.30 Hyperparameters tuned for each file using our tuning algorithm

LOF 16.6 0.8 23.3 knn 60, train set 200

Random 11.0 1.2 19.5

MCOD -261.9 -562.7 -160.2 Best with fixed hyperparameters

Table 11 NAB results of MCOD hyperparameters tuning. Algorithm Parameters Average ROC AUC Average PR AUC LOF k 15, training set 400 0.93 0.72 MAD training set 500 0.93 0.67 MCOD Grid Search tuned R, k 0.83 0.59 for each file (on whole file) MCOD Tuned for each file using 0.70 0.32 our tuning algorithm Percentile 1% 0.80 0.20 MCOD R 60, k 20 0.66 0.16 Random 0.50 0.02 Table 12 Results of MCOD hyperparameters tuning for Yahoo S5 dataset, sorted by PR AUC.

The results seem quite good: they are not as good as with Grid Search but better than with fixed parameters. Also, for NAB we looked at the results for each data file and noticed that there is only one file with many FP contributing a lot of penalty, and in half of the files containing anomalies (28 of 53) MCOD successfully detected some anomalies and did not make too much FP. In conclusion, looks like it is possible to achieve good results tuning MCOD parameters even using such basic tuning algorithm. This was just a very crude ad hoc attempt and most likely it is possible to create a better tuning algorithm that would be able to produce much better results, closer to the results we got when tuning with labels using Grid Search. In practice MCOD should achieve good results on point anomalies if hyperparameters are at least adjusted according to the possible data range and density of inliers.

4.4 Performance and anomaly detection quality conclusions

4.4.1 Percentile

Despite the simplicity, Percentile can work well enough on many univariate datasets with point anomalies, it even got good score in NAB (section 4.3). It is simple to use because it only has two parameters: percent and whether to detect only low or only high values, or both. In our experiments it worked well with any low enough percents, such as less than 0.5-1%.

4.4.2 MAD

MAD results quality does not seem to be significantly better than Percentile but MAD is more difficult to use because we need to choose a suitable threshold to get good results and it is different depending on dataset, and looks like it is not possible to simply normalize score ([0; 1]) and use a high threshold like 0.99 – we tried this on NAB and got bad score (MAD worked well only when adjusted using Grid Search for each file). It is also possible to adjust training set size, but it does not seem to make much difference in quality on our datasets and performance is also not an issue because all it does is sorting (to find median etc.) and a good sorting algorithm on a modern computer can sort very big datasets in reasonable time. Ideally the training set should contain only inliers (probably the same for all other algorithms we have, except hyperparameters optimization using Grid/Random Search on labeled data).

4.4.3 FastMCD

FastMCD works on multivariate data (but not on univariate) and it is also fast unless the data is very high-dimensional. In our experiments on Shuttle dataset it showed good results but a bit worse than LOF and MCOD. Adjusting FastMCD training set size can affect the results significantly, it should not be too high or too low.

4.4.4 LOF

LOF (at least in our implementation, but it corresponds to [30]) is the slowest algorithm we have here, but the results usually better than Percentile, MAD, MCD and similar to the best results of MCOD and it can work quite well without hyperparameter adjustments. The main disadvantage is that at least in this implementation it is very slow and takes a lot of memory if used with a big training set. We usually limited it to 200-400 elements, which seems quite small, but was enough for our datasets. There are many modifications/extensions of LOF, such as [19], [20], possibly achieving better results and performance, so they should be investigated.

4.4.5 MCOD

MCOD is faster than LOF, but still quite slow (especially with big window/slide sizes, though other hyperparameters and the dataset itself can affect performance too), much slower than Percentile, MAD, FastMCD. However, it is possible that this MCOD implementation is not perfect and other implementations may have better performance, for example, looks like it does not use M-Tree for range queries which was recommended in the paper [21]. Detection quality depends very much on R, k hyperparameters (an object is considered as an outlier if there are less than k objects in radius R around the object), there is no good default values working on all datasets, they should correspond to the dataset value range and inlier density. For example, if dataset has not very dense values between 10000 and 20000, then MCOD R = 10 will mark all elements as outliers. It is possible to adjust R and k automatically using Grid/Random Search, but labeled data is needed. Another workaround for univariate is our simple algorithm described in 4.3.2.

4.4.6 iForest

iForest is often faster than LOF and MCOD, and usually shows better quality on high- dimensional data. It may not be a good choice for low-dimensional/univariate data. The default hyperparameters (recommended in [23]) worked well on our datasets and tuning iForest hyperparameters did not make much difference in most cases. 5 RESULTS AND CONCLUSIONS

We created a flexible and easy to use benchmarking platform for evaluation of anomaly detection algorithms. The source code is available at https://github.com/anomaly-detection- macrobase-benchmark (use Git 1.0-batch-classification tag for the version used in this paper). We found some freely available datasets that can be used for testing and benchmarking of anomaly detection algorithms. Most of these datasets contain very small number of samples or dimensions, so for some experiments (such as time and memory performance scalability) synthetically generated dataset are needed, however they can be too simple and not a good choice for anomaly detection quality evaluation [31]. Some datasets contain categorical attributes, we converted them to numerical using one-hot encoding: this may help to cover more datasets, however on some datasets, such as Abalone, using these additional categorical attributes decreases results quality. We performed some experiments using our benchmarking platform, unfortunately we did not have time for many of the planned experiments, so currently we have only a small list of basic conclusions (4.4).

5.1 Future works

This work was focused on research of the area (anomaly detection) and the benchmarking platform implementation for evaluation of the anomaly detection algorithms, primarily covering bounded data. It can be continued in many directions, such as evaluation of methods designed for unbounded data streams, time series, explanation algorithms, other types of anomaly detection algorithms (probabilistic methods, neural networks, …), more advanced evaluation methodology (other metrics, cross validation, …). Currently another group of SAP internship students is working on some of these topics using and extending this platform. 6 REFERENCES

[1] "Dell EMC Digital Universe Survey: The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things," 2014. [Online]. Available: https://www.emc.com/leadership/digital- universe/index.htm. [2] E. Gan, P. Bailis, S. Madden, D. Narayanan, K. Rong and S. Suri, "MacroBase: Prioritizing Attention in Fast Data," 2017. [Online]. Available: http://www.bailis.org/papers/macrobase-sigmod2017.pdf. [3] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. Fernandez- Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt and S. Whittle, "The Dataflow Model: A Practical Approach to Balancing," 2015. [4] T. Akidau, "The world beyond batch: Streaming 101," 05 08 2015. [Online]. Available: https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-101. [Accessed 30 10 2018]. [5] J. Li, D. Maier, K. Tufte, V. Papadimos and P. Tucker, "Semantics and evaluation techniques for window aggregates in data streams," 2005. [6] M. Goldstein and S. Uchida, "A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data," 2015. [7] V. Chandola, A. Banerjee and V. Kumar, "Anomaly Detection : A Survey," 09 2009. [Online]. Available: http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection .pdf. [Accessed 10 11 2018]. [8] V. Christophides, "IoT Data Analytics - Inria," 22 06 2018. [Online]. Available: https://who.rocq.inria.fr/Vassilis.Christophides/IoT/IoTDataAnalytics.pptx. [Accessed 10 11 2018]. [9] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules," 09 1994. [Online]. Available: http://www.vldb.org/conf/1994/P487.PDF. [Accessed 15 12 2018]. [10] J. Han, J. Pen and Y. Yin, "Mining Frequent Patterns without Candidate Generation," 2000. [Online]. Available: https://www.cs.sfu.ca/~jpei/publications/sigmod00.pdf. [Accessed 15 12 2018]. [11] "MacroBase license," 11 08 2016. [Online]. Available: https://github.com/stanford-futuredata/macrobase/blob/master/LICENSE. [Accessed 12 01 2018]. [12] "Tutorial - MacroBase wiki," 15 08 2017. [Online]. Available: https://github.com/stanford-futuredata/macrobase/wiki/Tutorial. [Accessed 12 01 2018]. [13] P. S. Efraimidis, "Weighted Random Sampling over Data Streams," 2015. [Online]. Available: https://arxiv.org/pdf/1012.0256.pdf. [14] A. Metwally, D. Agrawal and A. El Abbadi, "Efficient Computation of Frequent and Top-k Elements in Data Streams," 2005. [Online]. Available: https://arxiv.org/abs/1610.06376. [15] K. S. Tai, V. Sharan, P. Bailis and G. Valiant, "Sketching Linear Classifiers over Data Streams," 2018. [Online]. Available: https://arxiv.org/pdf/1711.02305.pdf. [16] M. M. Breunig, H.-P. Kriegel, R. T. Ng and J. Sander, "LOF: Identifying Density-Based Local Outliers," 2000. [Online]. Available: www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf. [Accessed 15 12 2018]. [17] "Quickselect - Wikipedia," 10 9 2018. [Online]. Available: https://en.wikipedia.org/wiki/Quickselect. [Accessed 26 11 2018]. [18] C. Leys, O. Klein, B. Philippe and L. Licata, "Detecting outliers: Do not use standard deviations around the mean, do use the median absolute deviation around the median," 2013. [Online]. Available: http://www.academia.edu/3448313/Detecting_outliers_Do_not_use_standard_d eviations_around_the_mean_do_use_the_median_absolute_deviation_around_t he_median. [19] D. Pokrajac, A. Lazarevic and L. J. Latecki, "Incremental Local Outlier Detection for Data Streams," 01 2007. [Online]. Available: https://www.researchgate.net/publication/4250603_Incremental_Local_Outlier_ Detection_for_Data_Streams. [Accessed 15 12 2018]. [20] "Local outlier factor - Wikipedia," 25 11 2018. [Online]. Available: https://en.wikipedia.org/wiki/Local_outlier_factor. [Accessed 15 12 2018]. [21] M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas and Y. Manolopoulos, "Efficient and flexible algorithms for monitoring distance-based outliers over data streams," 2015. [22] L. Tran, L. Fan and C. Shahabi, "Distance-based Outlier Detection in Data Streams [Experiments and Analyses]," 2016. [Online]. Available: https://infolab.usc.edu/Luan/Outlier/. [23] F. T. Liu and K. M. Ting, "Isolation-based Anomaly Detection," 2012. [Online]. Available: https://dl.acm.org/citation.cfm?id=2133363. [Accessed 02 05 2019]. [24] W.-R. Chen, Y.-H. Yun, M. Wen, H.-M. Lu, Z.-M. Zhang and Y.-Z. Liang, "Representative subset selection and outlier detection via isolation forest," 2016. [Online]. Available: https://pubs.rsc.org/en/content/articlelanding/2016/ay/c6ay01574c. [Accessed 02 05 2019]. [25] "F1 score - Wikipedia," 2018. [Online]. Available: https://en.wikipedia.org/wiki/F1_score. [Accessed 21 06 2018]. [26] T. Fawcett, "ROC Graphs: Notes and Practical Considerations for Researchers," 2004. [27] "Differences between Receiver Operating Characteristic AUC (ROC AUC) and Precision Recall AUC (PR AUC)," 2014. [Online]. Available: http://www.chioka.in/differences-between-roc-auc-and-pr-auc/. [28] J. Davis and M. Goadrich, "The Relationship Between Precision-Recall and ROC Curves," 2006. [Online]. Available: https://www.biostat.wisc.edu/~page/rocpr.pdf. [Accessed 02 05 2019]. [29] "Hyperparameter optimization - Wikipedia," 2018. [Online]. Available: https://en.wikipedia.org/wiki/Hyperparameter_optimization. [Accessed 21 06 2018]. [30] R. Domingues, M. Filippone, P. Michiardi and J. Zouaoui, "A comparative evaluation of outlier detection algorithms: experiments and analyses," 09 2017. [Online]. Available: http://www.eurecom.fr/en/publication/5334/detail. [Accessed 02 05 2019]. [31] A. Emmott, S. Das, T. Dietterich, A. Fern and W.-K. Wong, "A Meta- Analysis of the Anomaly Detection Problem," 2016. [Online]. Available: https://arxiv.org/abs/1503.01158. [Accessed 02 05 2019]. [32] S. Ahmad, A. Lavin, S. Purdy and Z. Agha, "Unsupervised real-time anomaly detection for streaming data," 1 11 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231217309864. [Accessed 30 11 2018]. [33] "NAB whitepaper," [Online]. Available: https://github.com/numenta/NAB/wiki#nab-whitepaper. [Accessed 30 11 2018]. [34] A. Lavin and S. Ahmad, "Evaluating Real-time Anomaly Detection Algorithms - the Numenta Anomaly Benchmark," 17 11 2015. [Online]. Available: https://arxiv.org/abs/1510.03336. [Accessed 30 11 2018]. [35] Q. Yu, L. Jibin and L. Jiang, "An Improved ARIMA-Based Traffic Anomaly Detection Algorithm for Wireless Sensor Networks," 18 01 2016. [Online]. Available: https://journals.sagepub.com/doi/10.1155/2016/9653230. [Accessed 02 05 2019]. [36] D. T. Shipmon, J. M. Gurevitch, P. M. Piselli and S. Edwards, "Time Series Anomaly Detection: Detection of Anomalous Drops with Limited Features and Sparse Examples in Noisy," 2017. [Online]. Available: https://arxiv.org/pdf/1708.03665. [Accessed 02 05 2019]. [37] "NAB license," 10 8 2015. [Online]. Available: https://github.com/numenta/NAB/blob/master/LICENSE.txt. [Accessed 30 11 2018]. [38] "NAB FAQ," 16 11 2015. [Online]. Available: https://github.com/numenta/NAB/wiki/FAQ. [Accessed 15 12 2018]. [39] J. Wilke, "The 6 Memory Metrics You Should Track in Your Java Benchmarks," 28 03 2017. [Online]. Available: https://cruftex.net/2017/03/28/The-6-Memory-Metrics-You-Should-Track-in- Your-Java-Benchmarks.html. [Accessed 02 05 2019]. [40] "Benchmarking with ELKI," [Online]. Available: https://elki- project.github.io/benchmarking. [Accessed 02 05 2019]. [41] "NAB Entry Points," 27 04 2017. [Online]. Available: https://github.com/numenta/NAB/wiki/NAB-Entry-Points. [Accessed 30 11 2018]. [42] "The Numenta Anomaly Benchmark," [Online]. Available: https://github.com/numenta/NAB. [Accessed 30 11 2018].