DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020

A Comparative Evaluation Of Semi- supervised Techniques

REBWAR BAJALLAN

BURHAN HASHI

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

A Comparative Evaluation Of Semi-supervised Anomaly Detection Techniques

REBWAR BAJALLAN BURHAN HASHI

Degree Project in Computer Science Date: June 9, 2020 Supervisor: Pawel Herman Examiner: Pawel Herman School of Electrical Engineering and Computer Science Swedish title: En jämförande utvärdering av semi-övervakade tekniker för identifiering av uteliggande datapunkter

iii

Abstract

As we are entering the information age and the amount of data is rapidly in- creasing, the task of detecting anomalies has become a necessity in many orga- nizations as anomalies often reveal useful information which in many cases can be critical to save lives or to catch imposters. The semi-supervised approach to anomaly detection which is based on the fact that the user has no infor- mation about anomalies has become widely popular since it’s easier to model the normal state of systems than to obtain information about every anomalous behavior. Therefore, in this study we choose to conduct a comparative evalua- tion of the semi-supervised anomaly detection techniques; , Local factor algorithm, and one class support vector machine, to simplify the process of selecting the right technique when faced with similar anomaly de- tection problems of semi-supervised nature.

We found that the local outlier factor algorithm was superior in performance given the Electrocardiograms dataset (ECG5000), achieving a high precision and perfect recall. The autoencoder achieved the best performance given the credit card fraud dataset, even though the remaining models also achieved a relatively high performance that didn’t differ much from that of the autoen- coder. However, it should be noted that the definition of performance differs as the characteristics of anomaly detection problems are different, as specific problems might put a higher weight on detecting all anomalies for an increase in falsely identified normal data points. iv

Sammanfattning

Vi befinner oss i informationsåldern och samtidigt som mängden data snabbt ökar har problemet med att upptäcka avvikelser blivit allt mer nödvändigt i många organisationer. Då avvikelserna ofta avslöjar viktig information, som i många fall kan vara avgörande för att rädda liv eller för att upptäcka bedra- gare. Den semi-övervakade metoden för upptäcka av avvikelser är baserad på det faktum att användaren inte har någon information om avvikelser. Denna metod har blivit allmänt populärt eftersom det är lättare att modellera syste- mets normala tillstånd än att erhålla information om alla avvikande tillstånd. Därför väljer vi i denna studie att göra en jämförande utvärdering av de semi- övervakade avikelse-detektions metoderna; Autoencoder, Local outlier factor algorithm, och One class support vector machine, för att förenkla processen att välja rätt algoritm när man står inför liknande avikelse-detektionsproblem av semi-övervakad natur.

Vi fann att den Local outlier factor algoritmen presterade bäst på datasät- tet Electrocardiograms (ECG5000), då den uppnådde en hög precision och perfekt återkallelse. Autoencodern var bäst med avseende på datasättet kre- ditkortsbedrägeri. Men även de andra modellerna presterade relativt nära det Autoencodern presterade. Det bör även noteras att definitionen av prestation kan skilja sig då kraven för avikelse-detekteringsproblem är olika, eftersom specifika problem kan lägga högre vikt vid att upptäcka alla avvikelser med en ökning av falskt identifierade normala datapunkter. Contents

1 Introduction 1 1.1 Problem Definition ...... 2 1.1.1 Scope ...... 2 1.1.2 Thesis Outline ...... 2

2 Background 3 2.1 Anomalies ...... 3 2.1.1 Anomaly Detection ...... 3 2.1.2 Anomaly Score ...... 4 2.1.3 Semi-supervised anomaly detection ...... 4 2.2 Support Vector Machine ...... 5 2.2.1 One-Class Support Vector Machine ...... 5 2.3 Local Outlier Factor Algorithm ...... 6 2.4 Artificial Neural Networks ...... 7 2.4.1 Feed-forward neural networks ...... 7 2.5 ...... 8 2.5.1 Reconstruction Error ...... 9 2.5.2 Training Autoencoders ...... 9 2.5.3 Autoencoders and Anomaly detection ...... 9 2.6 Related Work ...... 10

3 Method 11 3.1 Datasets ...... 11 3.1.1 ECG5000 ...... 11 3.1.2 Credit Card dataset ...... 12 3.2 Models ...... 12 3.2.1 OC-SVM ...... 12 3.2.2 LOF algorithm ...... 12 3.2.3 Autoencoder ...... 13

v vi CONTENTS

3.3 Evaluation Metrics ...... 15

4 Results 16 4.1 Autoencoder ...... 16 4.1.1 Training reconstruction error ...... 16 4.1.2 Error threshold selection ...... 17 4.1.3 Testing reconstruction errors ...... 17 4.1.4 Autoencoder testing results ...... 19 4.2 OC-SVM ...... 19 4.2.1 OC-SVM testing results ...... 19 4.3 LOF algorithm ...... 20 4.3.1 Testing outlier factors ...... 20 4.3.2 LOF algorithm testing results ...... 21 4.4 Results summary ...... 22

5 Discussion 23 5.1 Limitations ...... 25

6 Conclusions 26 6.1 Further research ...... 26

Bibliography 27 Chapter 1

Introduction

As the world is getting data-driven the field of anomaly detection is on the rise and is starting to play a big role in many organizations. Anomaly detection is the process of detecting anomalies which are observations that deviates from the other observations so that they become suspicious of being the result of errors or fraud [1]. To detect and analyse anomalies is therefore an important task because it reveals useful information which in many cases can be critical to catch imposters or save lives. In the banking sector anomaly detection has become an extremely important task to detect and analyze fraudulent credit card transactions that assist in the disclosure of impostors [1]. Anomaly de- tection has also become frequently used in hospitals when processing medical diagnosis, where the anomalies, for example, could help to detect cancerous cells when investigating medical images [1].

There are many techniques used in anomaly detection tasks, but one of the techniques that have been on the rise the recent years is the use of artificial neural networks. These techniques come to great advantage because they have proven to be good at modeling the complexity of real-world data [1]. In recent years, artificial neural network anomaly detection techniques have become increasingly popular and have proven to outperform the traditional techniques as the scale of data increases [1].

The semi-supervised anomaly detection approach which is based on the fact that the user has no information about possible anomalies has become widely applicable. This approach has risen in popularity since it’s easier to model the normal state of the system rather than obtaining information about every possible anomalous behavior that can occur [2] . An example is the detection

1 2 CHAPTER 1. INTRODUCTION

of spacecraft faults [3] , where faults may result in spacecraft incidents. These incidents hardly occur, it is therefore, easier to obtain data that reflects how the spacecraft operates in a normal state and with the semi-supervised approach detect abnormal spacecraft activities.

1.1 Problem Definition

This study aims to present a comparative evaluation of different semi-supervised anomaly detection techniques. The evaluation investigates which model per- forms best in predicting anomalies from a set of different datasets, to draw a generalizable conclusion about which model that is preferable to use in prob- lems using similar datasets.

1.1.1 Scope The question is investigated by evaluating the performance of the semi-supervised anomaly detection techniques: Autoencoder, Local outlier factor, and One- class support vector machine. The datasets used in the evaluation are the ECG5000 (electrocardiogram) dataset along with the credit card dataset. The purpose of this study is to simplify the process of choosing the right tech- niques to use when people in the industries of credit card fraud analytics and healthcare are faced with similar anomaly detection problems that is of semi- supervised nature.

1.1.2 Thesis Outline The first chapter contains a basic introduction to the subject of this study to- gether with the problem definition and the scope. The second chapter explores the field of anomalies and anomaly detection. It further discloses an explana- tion of the methods Autoencoder, Local outlier factor, and One-class support vector machines. This chapter ends with a presentation with the relevant lit- erature. Chapter three covers the method of the study, the dataset used, the implementation of the models, and the evaluation. The fifth chapter presents the result together with relevant figures and chapter six contains a critical dis- cussion about the results and method, from different perspectives. The study ends with a conclusion and remarks about potential future work. Chapter 2

Background

2.1 Anomalies

Anomalies or are observations that deviates significantly from the other observations so that they become suspicious of being generated by a different mechanism, sometimes being the result of errors or fraud [1]. This could be for example fraudulent credit card transactions that are the results of a stolen credit card [1]. Two important characteristics of anomalies are that they are different from the norm with respect to their features and that they are rare in the dataset in contrast to the normal instances [4].

Anomalies arise for different reasons depending on the type of context. Some reasons are failures in systems, fraudulent behavior or malicious actions [1]. It is important to remember that not every anomaly is considered as being the result of error or fraud, because there are situations where data points natu- rally are different. For example if you take the height of every student in your school, there will for sure be anomalies in that dataset which of course isn’t the result of any error. But the study of anomalies usually reveals exiting insights and conveys valuable information about the data [1].

2.1.1 Anomaly Detection The process of identifying anomalies is called anomaly detection, which is a technique that’s used to reveal useful information about the characteristics of the data generation process [5]. Anomaly detection is a widely used tech- nique and is considered an essential step in various decision making processes, especially within fields where the anomalous data may indicate abnormal con-

3 4 CHAPTER 2. BACKGROUND

ditions [1]. This could be for example within the banking sector where fraud- ulent credit card transaction might indicate a stolen credit card or when some outstanding pattern in a patient medical record might indicate whether they have heart disease or not [1].

2.1.2 Anomaly Score Anomalous data points differ in degrees of outlierness depending on how different they are to the normal data points. A measurement that’s used in anomaly detection tasks is the anomaly score, which describes the level of outlierness of each data point [1]. The anomaly score is a domain specific measurement because the definition of similarity differs in different contexts [1].

2.1.3 Semi-supervised anomaly detection As the name infers semi-supervised anomaly detection is somewhere in-between supervised and unsupervised anomaly detection. The difference between un- supervised and supervised-anomaly detection is that supervised anomaly de- tection uses a fully labeled dataset for training and testing [4]. Unsupervised- anomaly detection on the other hand does not require any labels. This approach relays on the algorithm to find the fundamental properties of the dataset and then score the data solely on this. Usually, this is done by some kind of density or distance estimation [6].

In many real-life scenarios, anomalies are not known in advance or may oc- cur spontaneously after the model has been trained. This is addressed with semi-supervised anomaly detection, since only normal samples are available to train the model. The idea is, that during training, a model of the normal class is learned and during testing, anomalies can be recognized by them deviating from the learned model [7]. CHAPTER 2. BACKGROUND 5

2.2 Support Vector Machine

A support vector machine (SVM) is a machine learning technique that’s trained by input-output examples to classify new data points [8]. The SVM is trained by finding a hyperplane that separates the example classes by maximizing the distance between the hyperplane and the data points of each class. The hyper- plane is then used as a decision boundary that classifies data points based on which side of the hyperplane they are [9]. In order to work with high dimen- sional data, the SVM utilizes something called the kernel trick. The kernel trick transforms the data, which makes it possible for the hyperplane to be fit in a higher dimensional feature space [9]. The data is transformed by passing it to some kernel function K.

T K (xj, xi) = Φ(xj) Φ(xi) (2.1)

2.2.1 One-Class Support Vector Machine The one-class support vector machine (OC-SVM) is often used as a semi- supervised anomaly detection technique [4]. In contrast to the traditional SVM, the OC-SVM is trained to maximize the distance between the training data points and the origin by using a hyperplane where the distance between the hyperplane and the origin is maximised. This results in a binary function which captures the region of the input space where the training data lives [10].

Anomaly detection is achieved by training the OC-SVM solely on normal sam- ples (semi-supervised), which results in a hyperplane separating the normal data points from the origin. This creates a decision function which classifies the data points in the region capturing the normal points as normal and data points elsewhere as anomalous [10]. 6 CHAPTER 2. BACKGROUND

Figure 2.1: OC-SVM, separating the normal observations (+1) from the anomalous ones (-1), image courtesy of [11]

2.3 Local Outlier Factor Algorithm

The Local Outlier Factor algorithm (LOF) is a density based method that’s used for anomaly detection. It specifies the degree of outlier-ness as a fac- tor/score (local outlier factor) instead of defining it as a binary property. The local outlier factor describes how isolated data points are with respect to their surrounding neighborhood. To calculate this factor, we first introduce the pa- rameter k (or MinDist), the number of neighbouring points the LOF considers. Small values of k tend to have local focus while larger values of k can lead to the miss of local outlier. We can now define the k distance [26].

1. The k-distance is the distance between a point and its kth nearest neigh- bour. This k-distance is used to calculate the reachability distance.

2. The reachability distance measures the maximum of the distance of two points and the k-distance.

3. The local reachability density for a point x is calculated by calculating the reachability distance between x and all its k nearest neighbours and then averaging it. Local reachability density is simply the inverse of this calculation.

P reach − dist (p, o) o∈Nmin P ts(p) Min P ts lrdmin P ts(p) = 1/ (2.2) |NMin P ts(p)| CHAPTER 2. BACKGROUND 7

4. The local outlier factor for an object x is the average of the ratio of the local reachability density of x and those of x’s k-nearest neighbours. It is easy to see that the lower x’s local reachability density is, and the higher the local reachability density of x’s nearest neighbours are, the higher the local outlier factor will become.

P lrdMinP ts(0) X o∈NMinP ts(p) lrd (p) = MinP ts (2.3) |NMinP ts(p)| LOFMinP ts(p)

The local outlier factor score of a point, will be equivalent to the ratio of the average density of the points k-nearest neighbours and the points local den- sity. An anomalous point is expected to have a low local density and a normal sample is expected to have a local density similar to its neighbours [11].

2.4 Artificial Neural Networks

Artificial Neural networks or ANNs are computing systems inspired by the biological neural networks in the brain [12]. ANNs have the ability to model non-linear and complex relationships in the data, which makes the networks favorable over other machine learning methods [1]. The network consists of multiple information passing units called neurons where the neurons are di- vided into three main layers, input layer, hidden layer(s), and output layer. The neurons are each associated with a weight parameter that describes how much impact each neuron should have on the output. The objective of ANNs is to create a mapping between the input and the output layers [13].

The basic neural network model can be described by a series of functional transformations, where the transforms occur when information is passed be- tween the layers throughout the network. The transform functions or activa- tion functions are often chosen to be non-linear functions in order to utilize the non-linear properties of the neural network [13]. The field of is based on neural networks with numerous hidden layers, which makes the method capable of finding patterns in very complex data.

2.4.1 Feed-forward neural networks One of the most successful networks in the context of pattern recognition is the feed-forward neural network, which is an ANN where the information passes 8 CHAPTER 2. BACKGROUND

through the network in one direction. That is the connection between the neu- rons in the network does not form cycles [13]. The ANNs mentioned in this thesis are of this type.

Figure 2.2: Feed forward neural network, image courtesy of Bishop [13].

2.5 Autoencoders

An Autoencoder is an artificial neural network based method, which objective is to reproduce the input vectors {x1, x2,..., xm} as outputs {xˆ1, xˆ2,..., xˆm} [14]. The autoencoder is unsupervised in the since that method is trained by setting the input data as output and there is therefore no need for example input-output pairs.

Autoencoders are built by connecting two separate artificial neural networks. The first network is called the encoder and is responsible for compressing the input data into a lower-dimensional space. The compression is achieved by decreasing the number of neurons as the information is passed through each layer in the encoder [15]. The final representation of the data in the lower- dimensional space will be available in the output layer of the encoder also called the encoded layer. The second network is called the decoder and it takes the compressed data as input and reconstructs the initial input data point. As information is passed through the decoder the dimension of the data increases as the number of neu- rons increases with each layer. The output layer will have the same number of neurons as the input layer and therefore the input and output data points will have the same dimensionality [15]. CHAPTER 2. BACKGROUND 9

Figure 2.3: Autoencoder with input layer, three encoder layers, three decoder and output layer. In courtesy of Prakash, Krishna Rao [16].

2.5.1 Reconstruction Error The difference between the initial input data and the reconstructed data is de- scribed by the reconstruction error [14]. The reconstruction error function is essentially the same as the sum of the square error function, where the target vector is the reconstructed initial data point {xˆ1, xˆ2,..., xˆm}. The reconstruc- tion error is a function of the network weights w which describes the mapping between the input and output data.

m X 2 RE(w) = kxi − xˆik (2.4) i

2.5.2 Training Autoencoders Training an autoencoder is the process of updating the network weights to cre- ate the best mapping between inputs and outputs. This is done by minimizing the reconstruction error function, which is the same as minimizing the differ- ence between the inputs and outputs [14]. The objective is therefore to find the network weights that minimize the reconstruction error function.

2.5.3 Autoencoders and Anomaly detection Anomalies are detected by training the autoencoder solely on normal data sam- ples and therefore learning the network to reconstruct data points with similar characteristic patterns. The network weights are chosen to be those that min- imize the reconstruction error for the normal samples. The autoencoder will 10 CHAPTER 2. BACKGROUND

therefore fail to reconstruct the anomalous data samples and will produce a large reconstruction error for these [14]. The reconstruction error represents the anomaly score in the since that data points with higher reconstruction er- ror have a higher degree of outlierness. The data points that produce a high reconstruction error will be labeled as anomalies [1]. The labeling is done by selecting an error threshold for the reconstruction error values. The threshold will separate the reconstruction errors of the normal data samples from the reconstruction errors of the anomalous data samples [34].

2.6 Related Work

Anomaly detection is widely used in many different fields with various meth- ods and has been the topic of several surveys, articles, and books. A broad overview of the field can be found in the work of Chandola et al [17]. Gold- stein and Uchida [2016] presented a comparative evaluation of a large set of unsupervised anomaly detection techniques. They outlined the strength and weaknesses of the algorithms concerning their usefulness for specific applica- tions, in order to serve as a guideline for selecting an appropriate unsupervised anomaly detection algorithm for a given task. Two of the algorithms reviewed are one-class SVM and Local outlier factor [4]. Chalapathy and Chawla pro- vided a comprehensive outline of state-of-the-art research in deep anomaly detection techniques together with several real-world applications of the pre- sented techniques, one being autoencoder [1]. An extensive review of deep anomaly detection techniques in the medical domain and the fraud detection domain is presented by Adewumi and Akinyelu [18] and Geert et al [19]. A lot of work has been done in the field of anomaly detection. However, most of the literature has focused on the unsupervised method, to the best of our knowledge. Consequently, it will be interesting to evaluate the semi- supervised methods. Chapter 3

Method

3.1 Datasets

The datasets were split so that 70% of the data was used for the training of the models, 15% of the data was used as the validation dataset and 15% of the data was used as the testing dataset which was utilized when evaluating the accuracy of the models. All the data points that were labeled as anomalous samples were removed from the training dataset in order to train the models solely on normal samples. The validation and testing datasets contained both normal and outlying samples.

3.1.1 ECG5000 The ECG5000 represents an electrocardiogram, a graph of voltage versus time of the electric activity in a heart [20]. It contained 5000 samples, each corre- sponding to a heartbeat. The samples were divided into five classes and has been pre-processed by interpolating the heartbeats to the same length. All samples are therefore in the form of one-dimensional sequences of length 140. One class corresponds to normal heartbeats and the rest are anomalous [21]. The dataset was further processed by removing one of the anomalous classes (class 2), in order to make the dataset have the characteristics of an anomaly dataset. The final dataset had 3009 samples, with the anomalies representing 2.9% of the dataset.

11 12 CHAPTER 3. METHOD

3.1.2 Credit Card dataset The credit card dataset contains 284,807 credit card transactions and contains two classes, normal and fraud. All samples are numerical and have 30 features, all of the features except “Time” and “Amount” are the principle components obtained with PCA. The feature “Time” is given in seconds since the last trans- action [22]. The dataset were downsampled and therefore the resulting dataset contained 28431 normal samples and 492 anomalous ones.

3.2 Models

3.2.1 OC-SVM The one-class support vector machine is implemented by utilizing Scikit-learns OC-SVM class. The kernel function used when utilizing the kernel trick is the radial basis function (RBF), with the scaling parameter σ = 2.

2 ! kxj − xik K (xj, xi) = exp − (3.1) 2σ2 Training is done solely on the normal samples in order for the OC-SVM to fit a hyperplane that’s separating the normal data points from the origin.

The distance between the hyperplane to the closest of normal data point is given by the parameter nu [23]. The models accuracy is optimized against this parameter using the validation dataset, in order to find the distance that maximises the accuracy. The maximizing nu is later used when evaluating the model against the testing dataset. The hyperplane will work as a decision function that will classify testing data points in the region capturing the normal points as normal and data points elsewhere as anomalies.

3.2.2 LOF algorithm The Local Outlier Factor algorithm is implemented by utilizing Scikit-learns LOF class. The number of neighbors (k) to be considered when calculating the local outlier factor was selected by evaluating the models accuracy against the validation dataset by training the model with the number of neighbors (k) ranging between 20 to 200.

The error offset separating the local outlier factor of the normal data points CHAPTER 3. METHOD 13

from the local outlier factor of the anomalous data points was determined by the contamination parameter which also reflected the proportion of outliers in the dataset [23]. Since the proportion of outliers is unknown, this parameter was selected by iteratively training and evaluating the models accuracy against a set of different outlier proportions. The parameter pair (k, contamination) that maximized the models accuracy given the validation dataset was selected and used when testing the model.

The model is trained solely on normal samples in order to create a cluster of these data points. When testing the model, each data point in the testing dataset is assigned a local outlier factor. The error offset is then used to sep- arate the normal data points from the anomalous ones, based on their local outlier factor assigned[26].

3.2.3 Autoencoder The autoencoder was implemented by utilizing the Keras Python deep learn- ing library which is an high level neural network programming interface.

The autoencoder was modeled by connecting a multiple layer encoder and de- coder, along with an input and output layer. Both the encoder and decoder had five neuron layers. The activation function for each layer in both the encoder and decoder were chosen to be the hyperbolic tangent activation function. ex − e−x h(x) = (3.2) ex + e−x The number of neurons decreased while information was moving forward through the layers in the network. The number of neurons in the first layer of the en- coder (and last layer of the decoder) was strictly lower than the initial dimen- sionality of the data. The number of neurons in the first layer of the encoder were chosen to be 75% of the data dimensionality and then the number of neu- rons decreased with a factor of two until the information reached the encoded layer and thereafter increased with the same factor. The network was therefore symmetric about the encoded layer.

The autoencoder was trained by calculating the network weights that mini- mized the reconstruction error evaluated against the normal samples given the training dataset. The same network weights were then used when selecting the error threshold and when testing the autoencoder. The minimization problem 14 CHAPTER 3. METHOD

was solved by using the stochastic gradient descent algorithm with a learn- ing rate of 0.01. The training of the autoencoder was ran in multiple epochs, which means that the model went through the entire training dataset in mul- tiple cycles while updating the model weights. This is done merely to ensure that the reconstruction error converges to a minimum. The training dataset was divided into batches, which is smaller subsets of the dataset that’s used in one forward pass throughout the network. The batch size was chosen dataset specifically in order to ensure that autoencoder is learning the dataset patterns and that the training phase does not take to long time.

The autoencoder was evaluated by calculating the reconstruction error of each data point in the testing dataset. An error threshold was then selected in order to discriminate between the normal and anomalous data points. If the recon- struction error of a data point was greater then the selected error threshold that data point was labeled as an anomalous data point. In the same way a data point that generated a reconstruction error below the error threshold was labeled as a normal data point.

In order to select the error threshold, the accuracy was evaluated as a func- tion of different thresholds given the validation dataset. The threshold value that maximized the accuracy was then used in the testing phase. The threshold values differed with a step size, which was chosen to ensure that the optimiza- tion process didn’t take to long but still covered a wide variety of threshold values. CHAPTER 3. METHOD 15

3.3 Evaluation Metrics

The evaluation metrics are the metrics that were used to measure the perfor- mance of the models. The accuracy of the models was measured using the F1 score metric which is defined as the harmonic mean of precision and recall [24].

precision · recall F1 = 2 · (3.3) precision + recall

Tp Precision = (3.4) Tp + Fp

Tp Recall = (3.5) Tp + Tn The precision is defined as the fraction of all positive predictions that are actual positives [24]. Which in this case refers to the (correctly identified anomalies) divided by (identified anomalies).

Recall is defined as the fraction of all actual positives that are predicted to be positive [24]. In this case it refers to the (correctly identified anomalies) divided by the (correctly classified data points, either normal or anomalous). Chapter 4

Results

4.1 Autoencoder

The autoencoder was trained until the training reconstruction error converged given both the datasets. The batch sizes were chosen dataset specifically ac- cording to their corresponding training dataset size. The batch size was chosen in a way that made the training time of the autoencoder reasonable while still ensuring that the training reconstruction error was converging. The batch size for the credit card dataset was chosen to be 20 data points and the batch size for the ECG5000 dataset was chosen to be 2 data points.

4.1.1 Training reconstruction error The training reconstruction error converged when the autoencoder was being trained on both the datasets. When training the autoencoder on the ECG5000 dataset, the reconstruction errors ranged between the initial value 0.100 ob- tained at the first training epoch and 0.003 obtained at the last training epoch (Fig 4.1a). When trained on the credit card dataset, the reconstruction error obtained at the first epoch was 0.0435 and the reconstruction error obtained on the last training epoch was 0.0043 (Fig 4.1b). Both the reconstruction error curves slowly decreased and converged. The initial training epochs resulted in a swift decrease in the reconstruction error. The decreasing rate then slowed down.

16 CHAPTER 4. RESULTS 17

(a) ECG5000 dataset (b) Credit card dataset

Figure 4.1: The training reconstruction errors achieved on the different epochs when training the autoencoder

4.1.2 Error threshold selection The error threshold was selected by evaluating the autoencoders accuracy as a function of different error thresholds given the validation dataset. The error threshold that maximized the function given the ECG5000 dataset was 0.01 (Fig 4.2a) and the maximizing error threshold given credit card dataset was 0.019 (Fig 4.2b).

(a) ECG5000 dataset (b) Credit card dataset

Figure 4.2: The F1 score as a function of different error thresholds

4.1.3 Testing reconstruction errors Figs. 4.3a-b shows the reconstruction error of each data point in the testing dataset along with the error threshold as a horizontal line that separates the data 18 CHAPTER 4. RESULTS

points. The data points with a reconstruction error above the error threshold were labeled as outliers and those below the threshold were labeled as normal samples. The majority of the data points obtained a reconstruction error below the threshold.

Figs 4.4a-b shows the reconstruction error of the data points that were incor- rectly labeled, given the same data points and error thresholds that were used in Figs.4.3a-b. The data points above the threshold are wrongly labeled as outliers and those below the threshold are wrongly labeled as normal samples.

(a) ECG5000 dataset (b) Credit card dataset

Figure 4.3: Testing reconstruction errors of the testing data points along with the error threshold

(a) ECG5000 dataset (b) Credit card dataset

Figure 4.4: The wrongly points CHAPTER 4. RESULTS 19

4.1.4 Autoencoder testing results The results obtained when evaluating the autoencoder on the testing datasets are shown in table 4.1.

Autoencoder Testing results Dataset F1 Score Precision Recall ECG5000 0.621 0.692 0.563 Credit card 0.735 0.833 0.658

Table 4.1: Autoencoder test results

4.2 OC-SVM

The hyper parameter that determined the distance between the hyperplane and the closest normal data point (margin) was selected by optimizing the accuracy of the model against a set of different distances given the validation datasets. The maximizing distance given the ECG5000 dataset was 0.034 (Fig. 4.5a) and the maximizing distance given the credit card dataset was 0.011 (Fig 4.5b).

(a) MNIST dataset (b) Breast cancer dataset

Figure 4.5: The F1 score as a function of different margins

4.2.1 OC-SVM testing results The results obtained when evaluating the OC-SVM on the testing datasets are shown in table. 4.2. 20 CHAPTER 4. RESULTS

OC-SVM Testing results Dataset F1 Score Precision Recall ECG5000 0.737 0.636 0.875 Credit card 0.62 0.517 0.775

Table 4.2: OC-SVM test results

4.3 LOF algorithm

The number of neighbors (k) to consider when calculating the local outlier factor along with the contamination parameter were selected by evaluating the models accuracy against the validation set and then select the parameter pair that resulted in the maximum accuracy. The k parameter were evaluated by iteratively training and validating the model for k values in the range 20 to 200. For each of the k values, the contamination ranged between 0.01 to 0.05. The maximizing parameters given the ECG5000 dataset were (k = 20, contamination = 0.02) and the maximizing parameters given the credit card dataset were (k = 101, contamination = 0.01).

4.3.1 Testing outlier factors Figs. 4.6a-b shows the local outlier factors of the data points in the testing dataset. The error offset is shown as a horizontal line that separates the data points. The data points with a outlier factor greater then the error offset is labeled as anomalies and the data points with a outlier factor smaller then the error offset is labeled as normal. As with the autoencoder, the the majority of the data points obtained a outlier factor smaller then the error offset, indicating that more data points were labeled as normal.

Figs. 4.7a-b shows the outlier factors of the data points that were incorrectly labeled. The data points with an outlier factor greater than the error offset are wrongly labeled as anomalies and those with an outlier factor smaller then the error offset are wrongly labeled as normal samples. CHAPTER 4. RESULTS 21

(a) ECG5000 dataset (b) Credit card dataset

Figure 4.6: Testing local outlier factors of the testing data points along with the error threshold

(a) ECG5000 dataset (b) Credit card dataset

Figure 4.7: The wrongly labeled data points

4.3.2 LOF algorithm testing results The results obtained when evaluating the LOF algorithm on the testing datasets are shown in table 6.3.

LOF algorithm Testing results Dataset F1 Score Precision Recall ECG5000 0.889 0.800 1.00 Credit card 0.683 0.651 0.718

Table 4.3: LOF algorithm test results 22 CHAPTER 4. RESULTS

4.4 Results summary

The tables below shows the resulting F1 score, precision and recall the models obtained given both the datasets.

Testing results - F1 Score Dataset/Model Autoencoder OC-SVM LOF ECG5000 0.621 0.737 0.889 Credit card 0.735 0.62 0.683

Table 4.4: Testing F1 scores

Testing results - Precision Dataset/Model Autoencoder OC-SVM LOF ECG5000 0.692 0.636 0.800 Credit card 0.833 0.517 0.651

Table 4.5: Testing precisions

Testing results - Recall Dataset/Model Autoencoder OC-SVM LOF ECG5000 0.568 0.875 1.00 Credit card 0.658 0.775 0.718

Table 4.6: Testing recalls Chapter 5

Discussion

This thesis aimed to compare different semi-supervised anomaly detection techniques to draw a generalizable conclusion about which technique is prefer- able to use in problems with similar datasets. The idea is that the comparison will simplify the process of choosing the right technique to use when people in the industries of credit card fraud analytics and healthcare are faced with similar anomaly detection problems of semi-supervised nature. The compar- ison was accomplished by comparing the accuracy of the anomaly detection models; Autoencoder, one-class support vector machine, and the local outlier factor algorithm, given the ECG5000 and Credit card datasets.

The OC-SVM resulted in a mediocre accuracy not being the best neither the worst model given both the datasets. Even though the F1 score wasn’t the highest, the OC-SVM got a relatively high recall, which is a wanted property in some anomaly detection tasks were a big weight is put on detecting all the anomalies. The resulting precision was not that high, which indicates that there was some normal data points that were labeled as outliers.

The autoencoder performed relatively well on the credit card dataset with a high F1 score. The drawback was that the recall was low given both the datasets, indicating that a lot of outliers were wrongly labeled. The resulting accuracy of the autoencoder on the ECG5000 dataset was poor. This could possibly be explained by the fact that the autoencoder is a neural network- based method, that in general works better with larger datasets. Training neu- ral networks on smaller datasets often results the in convergence to a local minimum instead of a global minimum, which then might result in the au- toencoder not being able to achieve the wanted low reconstruction errors on

23 24 CHAPTER 5. DISCUSSION

the normal points in the testing dataset [20]. The LOF algorithm clearly was a good model to use. The resulting F1 score given the ECG5000 dataset was certainly the best, with an perfect recall which indicates that all the anoma- lies were correctly classified. Even the precision achieved on the ECG5000 was the highest of the three models. The accuracy obtained given the credit card dataset was not equally good, since there were a lot of mislabeled data points. A good property of the LOF algorithm is that it has been proven to be good in detecting anomalies in datasets were the anomalies may have different characteristics across the data regions [21]. Which reflects the case with the ECG5000 dataset, since the anomalous data points are divided in four differ- ent classes which might indicate that each class has it’s own specific feature characteristics.

From an ethical standpoint, the trade-off between the number of true posi- tives and false positives in anomaly detection problems has to be considered. Depending on the context of the problem you might want to set a lower thresh- old that allows for an increase in the false positives for a small increase of true positives. Given the example of electrocardiograms, where anomalous scenar- ios might indicate whether a patient is having a heart failure or not, you might argue that from an ethical standpoint it is better to allow for a lot of false alarms in order to detect just one actual case. But at the same time how much time should the medical staff spend on running back and forth to a patient just to find out that it was a false alarm? Depending on the sensitivity of not detect- ing the anomalies, it would be better to investigate the recall metric instead of looking at the harmonic mean of recall and precision (F1 score). This is be- cause of the recall metric measures proportion of the actual positives that were correctly identified, which in our case could be translated to ’the percentage of people having a heart failure that were correctly identified’. The OC-SVM and LOF algorithm were clear winners when it comes to maximizing the recall metric.

The critical model parameters were selected by optimizing the model’s ac- curacy given the validation datasets. These parameters were critical to the performance of the models when being evaluated against the testing datasets. A problem with this way of choosing the parameters is that there is a risk for overfitting. That is the parameter value being chosen to achieve the best pre- cision given that validation dataset and therefore won’t generalize. This is a problem because the whole purpose of anomaly detection systems is to classify previously unseen data points and therefore the importance of generalization. CHAPTER 5. DISCUSSION 25

A possible solution to this problem would be to do multiple splittings of the initial dataset and therefore evaluate the accuracy given multiple different val- idation datasets. The parameters would then be chosen as an average of the maximizing parameter values. This is thought not feasible as it would be too computationally heavy and would make the execution time too long.

5.1 Limitations

The main limitation of this study has been limited computational power. With increased computational power it would be possible to further investigate how the different parameters of the models effects it’s accuracy. Another limitation of this study has been the difficulty to find appropriate datasets that classifies as anomaly datasets. This is mainly because in many of the datasets the minority class does not have the anomaly characteristics, that is they either are to similar to the normal instance in the dataset or/and are not rare enough. Chapter 6

Conclusions

The results of this study shows that the LOF algorithm was clearly the best performing given the ECG5000 dataset as it achieved both high recall and precision. Although the difference of models performances didn’t differ very much given the credit card dataset, the autoencoder was the best performing. This doesn’t mean that the other models are bad when it comes to anomaly detection as they all achieved relatively high accuracy. As the characteristics of anomaly detection tasks differs, the definition of performance might differ. As specific tasks might put a higher weight on the precision or the recall metric.

It is important to note that this comparison does not indicate which of these models is the best in general. The accuracy of the models are the results of the specific parameters and the datasets used. Hopefully this will generalize over a set of similar datasets that has similar characteristics. To draw a wider con- clusion, some further research has to be conducted in order to test the models on a wider set of different datasets.

6.1 Further research

Anomaly detection is an active field, and possible future work would be to expand and update the evaluation with new algorithms and new datasets. It would also be interesting to focus on a more specific problem in a specific field so that a more accurate evaluation can be made.

26 Bibliography

[1] R.Chalapathy, S.Chawla. DEEP LEARNING FOR ANOMALY DETEC- TION: A SURVEY. 2019. [2] V.Chandola A.Banerjee, V.Kumar. Anomaly Detection: A Survey. 2009. [3] R.Fujimaki, T.Yairi and K.Machida. An Approach to Spacecraft Anomaly Detection Problem Using Kernel Feature Space. 2005. [4] M.Goldstein and S.Uchida. A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. 2016. [5] J.An and S.Cho. based Anomaly Detection us- ing Reconstruction Probability. 2015. [6] Z.Jiang A.Men, B.Yang. “A Hybrid Semi-Supervised Anomaly Detec- tion Model for High-Dimensional Data”. In: (2017). [7] D.Wulsin J.Blanco, R.Mani and B.Litt. “Semi-supervised anomaly de- tection for eeg waveforms using deep belief nets”. In: (2010). [8] W.Noble. What is a support vector machine? 2006. [9] Prof.O.Maimon and L.Rokach, Dr. and Knowledge Dis- covery Handbook, Second Edition. Springer New York Dordrecht Hei- delberg London, 2010. [10] Bouchra, et al. Anomaly Detection Using Similarity-based One-Class SVM for Network Traffic Characterization. 2018. [11] H.Alashwal R.Othman, S.deris. “One-Class Support Vector Machines for Protein Protein Interaction Prediction”. In: International Journal of Biological and Medical Sciences 1:2 2006 (2014). [12] Yung-Yao, et al. “Design and Implementation of Cloud Analytics-Assisted Smart Power Meters Considering Advanced Artificial Intelligence as Edge Analytics in Demand-Side Management for Smart Homes”. In: (2019).

27 28 BIBLIOGRAPHY

[13] C.Bishop. Pattern Recognition and Machine Learning. Springer Inter- national Publish, 2006, pp. 225–241. [14] S.Mayu, Y.Takehisa. Anomaly Detection Using Autoencoders with Non- linear . 2014. [15] M.Kramer. Nonlinear Principal Component Analysis Using Autoasso- ciative Neural Networks. 1991. [16] Prakash, A.Rao. R Deep Learning Cookbook: Solve complex neural net problems with TensorFlow, H2O and MXNet. Packt Publishing, 2017. [17] V.CHANDOLA A.BANERJEE, V.KUMAR. “Anomaly Detection : A Survey”. In: (2009). [18] A.Adewumi, A.Akinyelu. “A survey of machine-learning and nature- inspired based credit card fraud detection techniques”. In: International Journal of System Assurance Engineering and Management (2017). [19] Geert, et al. “A survey on deep learning in medical image analysis”. In: (2017). [20] S.Leonard. Pathophysiology of Heart Disease: A Collaborative Project of Medical Students and Faculty (sixth ed.) Lippincott Williams Wilkins, 2016, p. 74. [21] Y.Chen, E.Keogh. Dataset: ECG5000. url: http://www.timeser iesclassification.com/description.php?Dataset= ECG5000 & fbclid = IwAR158jsinyl5lQ9KeeYkUx84oKjnk ZNGrrvUMXNZrYRReJ7tawo4LtBNkWA. [22] Credit Card Fraud Detection. url: https://www.kaggle.co m/mlg- ulb/creditcardfraud?fbclid=IwAR2jw2pxfu oW4mwW_w_mf8kz513EEt_-R4wayi37by_1WELqVqRnzM- fpWQ). [23] Pedregosa, F. et al. “Scikit-learn: Machine Learning in Python ”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [24] Z.Lipton C.Elkan, B.Naryanaswamy. Optimal Thresholding of Classi- fiers to Maximize F1 Measure. 2014.

TRITA-EECS-EX-2020:428

www.kth.se