OUTLIER DETECTION FOR OVERNIGHT INDEX SWAPS Master Thesis

Johnny Kuo

Master Thesis, 30 credits Department of Mathematics and Mathematical Statistics Spring Term 2020

Abstract In this thesis, methods for in time series data are investigated. Given data for overnight index swaps (SEK), synthetic data has been created with different types of anomalies. Comparison between the Isolation forest and Local factor algorithms is done by measuring the respective performances for the synthetic data sets against Accuracy, Precision, Recall, F-Measure and Matthews correlation coefficient.

Keywords: Outlier detection, Overnight index swaps, , Isolation forest, Local outlier factor Sammanfattning I examensarbetet undersöks metoder för anomalidetektion i tidsserie data. Givet data för overnight index swaps (SEK), så har syntetiskt data skapats med olika ty- per av anomalier. Jämförelse mellan algoritmerna Isolation forest och Local outlier factor görs genom att mäta respektive prestande för de syntetiska dataseten mot Accuracy, Precision, Recall, F-measure och Matthews correlation coefficient.

Nyckelord: Outlier detection, Overnight index swaps, Machine learning, Isolation forest, Local outlier factor Acknowledgement

I would like to acknowledge and express my gratitude for the support given by Fredrik Bohlin and Richard Henriksson from the department of Model Validation and Quantita- tive Analysis.

I also would like to acknowledge the support given by supervisors within the depart- ment of Mathematics and Mathematical Statistics, Oleg Seleznjev and Leif Nilsson.

Finally, I would like to extend my gratitude to friends and family who have been by my side and supported me throughout the work.

Thank you!

Stockholm 2020-06-08

Johnny Kuo List of Figures

1 The spectrum from normal data to . Increasing outlierness score from left to right. Noise and anomalies can be considered as weak or strong outliers...... 5 2 Visualization of the three time series used for base when generating syn- thetic datasets. TS 1 is yields with 1 year to maturity, TS 2 is yields with 5 years to maturity and TS 3 is yields with 10 years to maturity...... 11 3 Visualization of global and collective outliers inserted, red points are data points moved with 3 standard deviations for TS1...... 13 4 Visualization of global and collective outliers inserted, red points are data points moved for TS1...... 14 5 Visualization of global and collective outliers inserted, red points are data points moved for TS1...... 16 6 Illustration of the workflow, form start to finish of the project...... 18 7 Performance metrics for Isolation forest...... 19 8 Performance metrics for Local outlier factor...... 20 9 Average performance score for the algorithms...... 21 10 The percentage of similarity between outliers detected in dataset before and after generation of synthetic outliers...... 22 11 The percentage of similarity between outliers detected in dataset before and after generation of synthetic outliers...... 22 List of Tables

1 Confusion matrix showing the possible combinations of correct classifica- tions and wrong classification...... 8 2 Summary of synethic datasets...... 17 3 Results of Isolation forest...... 28 4 Results of Local outlier factor...... 29 5 Amount of outliers from original datasets in synthetic datasets...... 30 Contents

List of Figuresi

List of Tables ii

1 Introduction2 1.1 Svenska Handelsbanken...... 2 1.2 Overnight index swaps...... 2 1.3 Outliers...... 2 1.4 Synthetic data...... 2 1.5 Problem Statement...... 3 1.6 Main objective...... 3 1.7 Delimitation...... 3 1.8 Outline...... 4

2 Theory4 2.1 Outlier...... 4 2.2 Types of outliers...... 5 2.2.1 Global outliers...... 5 2.2.2 Contextual outliers...... 5 2.2.3 Collective outliers...... 5 2.3 Time series...... 6 2.4 Machine learning...... 6 2.5 Isolation forest...... 6 2.6 Local outlier factor...... 7 2.7 Metrics for model assessment...... 8 2.7.1 Accuracy...... 8 2.7.2 Precision...... 8 2.7.3 Recall...... 9 2.7.4 F-Measure...... 9 2.7.5 Matthews Correlation Coefficient (MCC)...... 9

3 Objective of the project9 3.1 Main objective...... 9 3.2 Delimitation...... 9

4 Methodology 10 4.1 Programs...... 10 4.2 Description of the data...... 10 4.3 Synthetic datasets...... 11 4.3.1 Synthetic datasets 1 and 2...... 12 4.3.2 Synthetic datasets 3 and 4...... 13 4.3.3 Synthetic datasets 5 and 6...... 15 4.3.4 Synthetic datasets 7 and 8...... 16 4.3.5 Synthetic datasets 9 and 10...... 17 4.3.6 Summary of synthetic datasets...... 17 4.4 Implementation of algorithms...... 17 4.5 Model performance assessment...... 18

5 Results 19 5.1 Isolation forest...... 19 5.2 Local outlier factor...... 20 5.3 Outliers from original datasets...... 21

6 Discussion 23 6.1 Isolation forest and Local outlier factor...... 23 6.2 Quality limitations of synthetic data...... 24 6.3 Synthetic datasets...... 24

7 Conclusion 24 7.1 Best model for anomaly detection...... 24 7.2 Unsupervised anomaly detection...... 24

8 Suggestions for further studies 25

References 26

Appendices 28 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

1 Introduction

1.1 Svenska Handelsbanken Svenska Handelsbanken is one of the oldest listed share on the Swedish stock exchange. The bank was formed 1871 with goal to pursue "true banking activities" with customers mainly in the Stockholm area. This became the base for the local banking spirit that the bank has continued to build on even today. Svenska Handelsbanken will be cited as Handelsbanken further in the report (1).

Handelsbanken has become an international bank since the late 1980s. The local bank- ing relationship was established across Scandinavia and the bank expanded to Norway, Finland and Denmark. From 2000 the bank further expanded throughout UK and the Netherlands. In addition, Handelsbanken is also present in other markets to support customers from the home market of Sweden (1).

1.2 Overnight index swaps An overnight index swap is an index swap that refers to hedging a contract in which a party exchanges a predetermined cash flow with a counter-party on a specified date. This financial instrument is a specialized type of fixed rate swaps and has the ability to be set over different time spans. Commonly, overnight index swaps are set from three months and more than a year (2).

1.3 Outliers Outliers, also known as abnormalities, discordants, deviants or anomalies in and statistical literature are observations that lie in an abnormal distance from the other values in the population (3). Hawkins (4) defined an outlier as follows:

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”

In most cases when the generating process behaves in a unusual way, it results in the creation of outliers. Therefore outliers have the potential to provide meaningful insights about the processes. Outlier detection is a broad field within statistics. Credit-card fraud is an example within the banking industry where outlier detection has been widely used. Patterns of credit-card transaction data are hard to detect by human observation and anomalies are more efficiently found by outlier detection algorithms (3).

1.4 Synthetic data The usage of synthetic data has been increasingly important in many fields including, economics, urban planning, transportation planning, cyber security and weather forecast- ing. The usage of synthetic data can help development of data analytics applications,

2 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 models and algorithms performance testing (5).

Synthetic data can be specified to meet certain conditions or specifications that not can be found in the real data. In machine learning, synthetic data generation has been used progressively and beneficial in cases such as making datasets less expensive and better accessible for AI projects (6). More benefits of using synthetic data is that it can be designed to demonstrate certain key properties of data and gives a high degree of freedom for testing and training scenarios (7).

Even though the benefits of synthetic data are evident, challenges exist and should be taken into concern when creating the data. In many cases, the process of generating synthetic data also requires some minimum realistic data (5).

One difficulty with synthetic data is the assessment of quality of the generated data. Depending of the complexity of the input data, the evaluation of the output data should be evaluated accordingly. If the original data is diverse, it should be taken into consider- ation in assessment processes. This entails that people involved generating the synthetic data must have knowledge about how the synthetic dataset will be used in further analy- sis (8). As of today, there exists no comprehensive framework about how to create good synthetic data (9).

1.5 Problem Statement Overnight index swaps (OIS) are recorded daily and detection of outliers is critical for better understanding the variation of the OIS. The records of OIS may fluctuate from one day to another and it is essential to assess if the value is a normal deviation or an extreme value from the rest. The goal of this project is therefore to investigate various techniques of outlier detection and find the most suitable method for outlier detection. To assess the performance of the algorithms, synthetic data will be generated for evaluation purposes.

1.6 Main objective The Main objective of the project is to evaluate various outlier techniques and find the optimal method to detect outliers. To evaluate the effectiveness of Isolation forest and Local outlier factor, synthetic datasets will be created to evaluate the performance of the algorithms. The assessment of the performance will be done with metrics.

1.7 Delimitation The scope of this project is to assess two different unsupervised outlier detection tech- niques with synthetic datasets. Real data will be used as a base for the synthetic datasets. Performance limitations will exist without comparison to non-synthetic data.

3 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

1.8 Outline In Chapter2, an introduction is made about outliers, time series, algorithms used for outlier detection and performance metrics.

The methodology of the project is described in Chapter4, which presents the datasets involved, the process of generating synthetic datasets, implementation of algorithms and the performance assessments.

Finally, the results are given in Chapter5, followed up with discussion in Chapter6 and conclusions from the project in Chapter7. Suggestions for further studies will be presented in Chapter8.

2 Theory

This section covers theory related to the topic of outlier detection in this project. The chapter begins with theory regarding outliers, time series, outlier detection techniques and assessment metrics.

2.1 Outlier In the literature, inliers are referred to data points which are normal data. In some context such as fraud detection, the data points which are not corresponding to the nor- mal sequences of data points are called outliers. In the example of fraud detection, the event of fraud may reflect the actions of an individual in a particular sequence. This specific sequence is relevant to find anomalous events. Such anomalies are referred as collective anomalies, considering they inferred collectively from a set or sequence of data points. These collective anomalies are often a result of unusual events that could generate anomalous patterns of activity.

The output of an outlier algorithm can be one of two types (3):

• Outlier scores: outlier detection algorithms often have an output score of which the data point should be ranked as an anomaly or not. The score enables ranking of data points in order of their tendency to be considered as an outlier.

• Binary labels: another output of outlier detection algorithms is a binary label which indicated if data point is an outlier or not. This is typically achieved by implementing threshold on outlier scoring. The threshold is chosen based on the statistical distribution of the scores. The binary labeling holds less information than a scoring mechanism but in many cases is needed for decision making in many applications.

Since outliers are subjective to the context and dependent of the dataset, noise and significant interesting deviations are of interest.

4 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 1 – The spectrum from normal data to outliers. Increasing outlierness score from left to right. Noise and anomalies can be considered as weak or strong outliers.

Figure1 illustrates that noise and anomalies can be classified as weak and strong outliers. Noise represents the boundry between normal data and true outliers. This makes noise usually modeled as a weak form of outliers, which do not meet the criteria for a point to be classified as anomalous enough (3). This project will focus on how well the outlier detection methods perform detecting strong outliers.

2.2 Types of outliers The definition of an outlier is previously mentioned subjective to the context but there are three generally accepted categories which outliers can be divided into.

2.2.1 Global outliers

In a given dataset, a global outlier is a data object which deviates significantly from the rest of the data objects. Global outliers are also referred to as point anomalies. Global outlier detection are important in many scenarios. For example, in trading transaction auditing systems. The transactions that deviate from the regulation are considered as global outliers and should be further investigated (10). Global outliers are one of the most common definitions of outliers (11).

2.2.2 Contextual outliers

A contextual outlier exists if the value significantly deviates from the rest of the data in the same context. The same value may not be considered an outlier in a different context. This type of outlier is common in time series data considering that the context in time series is often temporal (11).

2.2.3 Collective outliers

In a given dataset, a subset of data objects forms a collective outlier if the objects as a group deviate significantly from the rest of the data set. Notably, the individual data objects may not be outliers but as a group of shared deviations makes them to a outlier (10).

5 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

2.3 Time series Time series data are a set of values generated by continuously measurements over time. The values are consecutive time stamps and do not change very significantly. In this case, sudden changes in the underlying data can be considered as anomalous events. Outlier detection is often related to anomalous event detection, the occurrences of such events are regularly contextual or collective outliers related to certain time stamps (3).

2.4 Machine learning Machine learning methods are efficient and widely used methods for outlier analysis. In anomaly detection there exists three broad categories which are (12): Supervised anomaly detecion, Unsupervised anomaly detection and Semi-supervised anomaly detec- tion. is interesting since it has the ability to detect patterns with no existing labels and minimum human supervision (13). In this project, the following unsupervised anomaly detection techniques will be used: • Isolation forest, • Local outlier factor. Isolation forest is fundamentally different from existing methods. The algorithm uses isolation as an effective and efficient way of detecting anomalies, compared to the more commonly used distance and density measures. Furthermore, the algorithm emphasizes low time complexity and a small memory requirement (14). Local outlier factor is a more typical outlier detection technique. It is a density-based method that uses k-nearest- neighbors whose distance is used to estimate the density, which are used to determine outliers (15). These methods will be used to investigate which of the techniques are suitable for performing anomaly detection on the time series data of OIS.

2.5 Isolation forest Isolation forest is an isolation-based method which measures the individual instances susceptibility to be isolated (16). Anomalies in this context are the data points which are few and different from the rest of the data. This makes them more susceptible for isolation. In a data induced random tree, partitioning of instances are recursively repeated until all the points are isolated. Isolation forest operates with an anomaly score to classify each point in the dataset. Given a data set of n instances, the average path length of unsuccessful search in binary search tree are defined as (16):

c(n) = 2H(n − 1) − (2(n − 1)/n), (1) where H(i) ss the harmonic number and it can be estimated by Euler’s constant, ln(i) + 0.5772156649. Here c(n) is the average of h(x) given n. h(x) is the path length of a point x which is measured by the number of edges x traverses an iTree from the root node until the traversal is terminated at an external node. c(n) is used to normalize h(x). The anomaly score s of an instance x is defined as:

6 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

− E(h(x)) s(x, n) = 2 c(n) (2) where E(h(x)) is the average of h(x) from a collection of isolation trees. In Equation2: • When E(h(x)) → c(n) then s → 0.5;

• When E(h(x)) → 0 then s → 1;

• When E(h(x)) → n − 1 then s → 0. where s is monotonic to h(x). The following assessments are enabled with the anomaly score s: • if instances return s very close to 1, then they are definitely anomalies,

• if instances have s much smaller than 0.5, then they are quite safe to be regarded as normal instances, and

• all the instances returns s ≈ 0.5, then the entire sample does not really have any distinct anomaly.

2.6 Local outlier factor Local outlier factor is a density based outlier technique (17). The algorithm compares the local density of a point to the local densities of its neighbors. The local density is estimated by the distance a point has from its neighbors.

The algorithm emphasizes a local approach for outlier detection. This enables Local outlier factor to find outliers in data sets that would not be outlier in another area of the dataset.

The following steps are done to calculate the Local outlier factor score, LOF (15). First the reachability distance (rd) and local reachability (lrd) are defined. The definition of reachability distance of A and B is:

rdk(A, B) = max(kd(B), dist(A, B)) (3) which is the true distance of B to A but at the least distance of B. Where kd is the distance of the object B to the k-th nearest neighbor. The local reachability density of an object A is defined as : k lrd(A) = (4) P rd (A, B) B∈Nk(A) k where k is the set of kNN (k-nearest-neighbor). Finally the LOF is defined as P lrd(B) B∈Nk(A) lrd(A) LOFk(A) = (5) |Nk(A)|

7 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 the LOF measures the density around A relative to the densities around the neighbors. The following is the different types of scoring of the LOF (17):

• ≈ 1 means comparable density to its neighbors,

• < 1 means Higher density than neighbors (Inlier),

• > 1 means Lower density than neighbors (Outlier).

2.7 Metrics for model assessment The synthetic datasets are labeled with generated outliers. This makes it feasible to evaluate the performance of the methods. The metrics (18) used for evaluation of the performance are Accuracy, Precision, Recall, F-Measure, and MCC (19).

A confusion matrix (Table1) is also used to evaluate the classification performance of the methods (20).

Table 1 – Confusion matrix showing the possible combinations of correct classifications and wrong classification.

Actual Classification Predicted normal data Predicted outlier

Normal data True non-match False match True negative (TN) False positive (FP)

Outlier False non-match True match False negative (FN) True positive (TP)

2.7.1 Accuracy

Accuracy is the total proportion of all the correct predictions, which can be expressed as: TP + TN Accuracy = (6) TP + TN + FP + FN

2.7.2 Precision

Precision is the percentage of the reported anomalies that are correctly identified, denoted by: TP P recision = (7) TP + FP and equals 1.0 if all the points identified by the algorithm are true outliers.

8 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

2.7.3 Recall

Recall is the percentage of the real anomalies which are detected, expressed by: TP Recall = (8) TP + FN

2.7.4 F-Measure

F-Measure is the weighted harmonic mean of Precision and Recall, which can be given by: 2TP F − Measure = (9) 2TP + FP + FN The F-Measure gives a measurement of a test’s accuracy by using both Precision and Recall.

2.7.5 Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient (MCC), is used as a measure of the quality of binary classifications (19). The MCC coefficient is a way of describing the confusion matrix. The MCC coefficient is expressed by: TP ∗ TN − FP ∗ FN MCC = (10) p(TP + FP )(TP + FN)(TN + FP )(TN + FN)

The MCC has a range of -1 to 1, where wrong classifications are indicated by -1 and correct classifications are indicated by 1.

3 Objective of the project

3.1 Main objective The Main objective of the project is to evaluate various outlier techniques and find the best method to detect outliers. To evaluate the effectiveness of Isolation forest and Local outlier factor, synthetic datasets will be created to evaluate the performance of the algorithms. The assessment of the performance will be done with metrics of Accuracy, Precision, Recall, F-Measure and MCC.

3.2 Delimitation The scope of this project is to assess two different unsupervised outlier detection tech- niques with synthetic datasets. Real data will be used as a base for the synthetic datasets. Performance limitations will exist without comparison to non-synthetic data.

9 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

4 Methodology

This section describes the programs used and the procedure of creating synthetic datasets with global and collective outliers. The implementations of the algorithms are described and the assessments of the methods are done with performance metrics Accuracy, Preci- sion, Recall, F-Measure and MCC.

4.1 Programs R is a programming language created for statistical computing, data processing and graphing (21). The implementations of Isolation forest and Local outlier factor have been used from the packages solitude (22) and Rlof (23). Excel was used for graphics.

4.2 Description of the data The data used in the project are data provided from Handelsbanken’s database for overnight index swaps in SEK currency. The interest of Handelsbanken is to see how different outlier detection techniques perform with different kinds of outliers in the time series.

10 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 2 – Visualization of the three time series used for base when generating synthetic datasets. TS 1 is yields with 1 year to maturity, TS 2 is yields with 5 years to maturity and TS 3 is yields with 10 years to maturity.

An overview of the three time series can be observed in the Figure2. Three different time series were selected, yields with 1 year to maturity (TS 1), yields with 5 years to maturity (TS 2) and yields with 10 years to maturity (TS 3). The yield rates are data ranging from 2012 to late 2019.

4.3 Synthetic datasets In order to make an evaluation of the performance of the outlier detection algorithms synthetic datasets were generated. This is needed, considering that for unsupervised

11 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 outlier detection there is no accessible way of measuring the results. The synthetic datasets create labels for the data and performance metrics can later be used. This section will describe each synthetic dataset (SD), and the procedure of generating outliers. Different amount of outliers were tested and the final decision of amount of outliers were set to 3.9% (18). For all generated datasets and the following procedures were done to all three time series.

4.3.1 Synthetic datasets 1 and 2

SD1 simulates global outliers and SD2 simulates collective outliers. The time series consists of seven periods, each period representing one year from 2012 to 2019. Global outliers were simulated by randomly selecting data points and moved by 3 standard deviations (based on local standard deviation from each year) (18). The movement of the values were randomized to either move up or down the point with 3 standard deviations, with probability p=1/2. When simulating collective outliers eight random sequences with length of seven data points were selected and moved with 3 standard deviations (based on local standard deviation from each year). Eight sequences of seven data points gives 3.9% data points moved, which is the desired amount of outliers to be generated.

12 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 3 – Visualization of global and collective outliers inserted, red points are data points moved with 3 standard deviations for TS1.

An overview of SD1 and SD2 simulations can be observed in Figure3 for time serie 1.

4.3.2 Synthetic datasets 3 and 4

SD3 simulates global outliers by randomly moving data points with either 1, 2, or 3 standard deviations (based on local standard deviation from each year) with probability

13 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 p=1/3. SD4 simulated collective outliers in which eight random sequences with length of seven data points were selected and moved up or down with probability p=1/2. How far a points will be moved is also randomized between the values of 1, 2 and 3 standard deviations (based on local standard deviation from each year) with probability p=1/3.

Figure 4 – Visualization of global and collective outliers inserted, red points are data points moved for TS1.

An overview of simulations of SD3 and SD4 can be observed in Figure4 for time serie 1.

14 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

4.3.3 Synthetic datasets 5 and 6

The same procedure as in SD3 and SD4 of generating outliers were implemented for SD5 and SD6. But in these simulations, the points were moved by global standard deviation for each time series instead of local based. SD5 simulates global outliers by randomly moving data points with either 1, 2, or 3 standard deviations (p=1/3), up or down (p=1/2). SD6 simulates collective outliers in which eight random sequences with length of seven data points were selected and moved up or down (p=1/2). How far a point will be moved is also randomized between the values of 1, 2 and 3 standard deviations (p=1/3).

15 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 5 – Visualization of global and collective outliers inserted, red points are data points moved for TS1.

An overview of simulations of SD5 and SD6 can be observed in Figure5 for time serie 1.

4.3.4 Synthetic datasets 7 and 8

SD7 and SD8 have random points moved with 2 standard deviations (based on global standard deviation). SD7 simulates global outliers by randomly moving data points with

16 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

2 standard deviations, up or down (p=1/2). SD8 simulates collective outliers in which eight random sequences with length of seven data points were selected and moved up or down (p=1/2).

4.3.5 Synthetic datasets 9 and 10

SD9 and SD10 have random points moved with 3 standard deviations (based on global standard deviation) (18). SD9 simulates global outliers by randomly moving data points 3 standard deviations, up or down. SD10 simulates collective outliers in which eight random sequences with length of seven data points were selected and moved up or down (p=1/2).

4.3.6 Summary of synthetic datasets

The Table2 shows a recap of the generated synthetic datasets presented. The type of outlier is either global or collective and the movement of the values are different in each case. Table 2 – Summary of synethic datasets

Synthetic datasets Type of outliers Movement of points SD1 Global 3*std (local based) SD2 Collective 3*std (local based) SD3 Global 1*std, 2*std or 3*std (local based) SD4 Collective 1*std, 2*std or 3*std (local based) SD5 Global 1*std, 2*std or 3*std (global based) SD6 Collective 1*std, 2*std or 3*std (global based) SD7 Global 2*std (global based) SD8 Collective 2*std (global based) SD9 Global 3*std (global based) SD10 Collective 3*std (global based)

4.4 Implementation of algorithms Isolation forest and Local outlier factor create decision boundaries to be able to separate normal data from outliers. Isolation forest calculates an anomaly score for each data point and a score above 0.5 is considered as an outlier, the points scoring below 0.5 are considered as normal points. For Isolation Forest, the parameter values used are from the original publication with number of decision trees set to t = 100 (16).

The Local outlier factor algorithm uses k as size of neighborhood. The k is then used to calculate the density score of each point in the data. Different values of k were tested to finally use k = 50 as it performed with best result according to MCC.

17 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 6 – Illustration of the workflow, form start to finish of the project.

The workflow of the project is illustrated in Figure6. First, the data for the time series are extracted. In the second step, the synthetic datasets are generated with the extracted time series as base. Labels are created for the anomalies in the dataset. In step three the algorithms are implemented on each dataset and the performance of them are assessed in steg four. In step five, discussion and conclusions from the project are done.

4.5 Model performance assessment The results from the performance of algorithms are calculated with a confusion matrix. Further evaluation of the performance will be done with the metrics Accuracy, Precision, Recall, F-measure and MCC.

18 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

5 Results

This section presents the results from ten synthetic datasets. The similarity of outliers detected before and after simulation of the synthetic sets are also displayed.

5.1 Isolation forest

Figure 7 – Performance metrics for Isolation forest.

From Figure7, we can observe that the overall Accuracy is very high. The Accuracy for all cases are above 96% (see Table3).

The four datasets SD1, SD2, SD3 and SD4 are all based on local standard deviations. Isolation forest is observed to have the best performance on SD4 comparing them. The Precision are above 0.65 and MCC above 0.5.

SD5 and SD6 are observed to show difference in performance, with the algorithm per- forming better for SD6. This could imply that the collective outliers are easier to detect.

The last four datasets SD7, SD8, SD9 and SD10 are all based on global standard de- viation for the outlier generation. The algorithm did the best performance detecting anomalies in these cases. Which were presumed considering that they the most distinc- tive outliers generated. Of these four sets, the algorithm performed worse for SD7 and SD8, compared to SD9 and SD10 in which the algorithm showed the best performance.

19 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

5.2 Local outlier factor

Figure 8 – Performance metrics for Local outlier factor.

From Figure8, we can observe that Accuracy is high for all the datasets, with Accuracy above 95% (Table4).

SD1, SD2, SD3 and SD4 are all based on local standard deviations. The performance of the algorithm was best on SD4 among them.

The algorithm is observed to show slight difference in performance between SD5 and SD6, with the algorithm performing better for SD6. This could imply that the collective outliers are easier to detect.

In the last four datasets SD7, SD8, SD9 and SD10, global standard deviation are used for the outlier generation. Best performances are for SD9 and SD10.

20 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 9 – Average performance score for the algorithms.

From Figure9, the average scores of each performance metric are observed. The difference between the algorithms are marginal.

5.3 Outliers from original datasets The outliers detected by the algorithms before simulation of synthetic datasets are in- teresting to look at. Considering that this could show how well the synthetic outliers differ from the original outliers in the datasets which could affect the performance of the algorithms.

21 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Figure 10 – The percentage of similarity between outliers detected in dataset before and after generation of synthetic outliers.

From Figure 10, we can observe that the similarity of outliers detected in each time series before and after the simulation of synthetic outliers is all below 4.5%. This could imply that the synthetic simulated outliers were significantly different from the old outliers and therefore should not have too large impact on the synthetic created outliers.

Figure 11 – The percentage of similarity between outliers detected in dataset before and after generation of synthetic outliers.

From Figure 11, we can observe that the amount of outliers detected before the simulation

22 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 of synthetic outliers were all below 5%. The impact of initial outliers in the dataset should therefore not affect the performance of the experiment in a considerable way.

6 Discussion

In this section discussion, the synthetic datasets are done followed up with assessment of the performance by the algorithms.

6.1 Isolation forest and Local outlier factor Comparing Isolation forest and Local outlier factor, the overall performance is slightly higher for Isolation forest.

Accuracy which gives the ratio of correctly predicted observations to the total obser- vations is high, and it is always desirable.

Precision which gives the ratio of correctly positive predictions to total predicted positive observations. In other words the metric of correctly predicted outliers to total predicted outliers. The Isolation forest showed Precision above 0.60, for 16 out of 30 synthetic datasets. Local outlier factor showed Precision of above 0.6 in 9 out of 30 synthetic datasets.

Recall gives the ratio of correctly positive predictions to all the outliers. It tells us about the how many which were truly outliers, and how many of them the algorithms founded. Here Local outlier factor performed slightly better than Isolation forest in most cases.

The F-Measure is the weighted average of the metrics Precision and Recall. This means it takes both FP and FN into calculation. In the case, when FN and FP are crucial then F-Measure are an important metric to look at. The average performance were equally good comparing the algorithms.

The MCC score is argued to have an advantage over the metrics Accuracy and F-Measure. This due to that the MCC the balance of ratio of all the categories in the confusion ma- trix (TP, TN, FN and FP). With 1 as perfect prediction, 0 as no better than random prediction and -1 as total disagreement between prediction and observations. For Isola- tion forest average MCC score were marginally better.

Noticeable and expected, performances were improving with the more distinctive the outliers were generated. By looking at the average scores for each performance metrics of the algorithms, it is observed to have very similar results. It is there no clear winner between them.

23 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

6.2 Quality limitations of synthetic data Synthetic data generation has many advantages and are useful for experiments of testing performance and robustness of algorithms. However there is a cost of quality limitations when dealing with synthetic data. One pitfall is our influence of the generation process on the properties of the simulated anomalies. Due to the fact that there exists no frame- work about how to create good synthetic data, datasets need to be created by numerous trials to mimic the nature of real anomalies. Simulation of outliers with too much differ- entiation from the data would result in easier detection for outlier detection algorithms but to the cost of not be realistic in a real environment. The amount of anomalies in a dataset is also a parameter to further adjust the difficulty for the algorithms to detect the points.

When creating synthetic data the way it was done in this project, it is also important to consider the amount of outliers detected by the algorithms before simulation of synthetic outliers. The results showed that the at most 5% of the initial outliers were affecting the simulated points. This indicates that the simulated points were significantly different which is desirable in this study. If the simulated outliers were to have high similarity with the initial outliers the performance metrics would not be as reliable.

6.3 Synthetic datasets Outlier detection is a wide field within statistics and the usage of effective outlier de- tection methods are applied in many real life environments. The difficulty with outlier detection is the rarity of outliers in which makes them hard to predict and the need of excessive data is generally expensive and time consuming. Better understanding of sufficient simulation of anomalies would thus provide datasets for algorithm assessments, to a lower cost and time efficient. The challenge in this project were to simulate realistic outliers. There are many ways of creating outliers and during the project this has been the main challenge. The variations of outlier creation is very wide and the project could easily be extended by testing more combinations of outlier generating techniques.

7 Conclusion

In this section the conclusions of the project are presented.

7.1 Best model for anomaly detection There is no clear winner in performance in this experiment. When comparing the average performance metrics of the algorithms there were marginally different.

7.2 Unsupervised anomaly detection The project used two unsupervised anomaly detection methods for synthetic data. To use , the evaluation of the performances for the methods were feasible. Without

24 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 the labels, the results of the methods would be rather subjective and difficult to evaluate. Isolation forest showed promise of usage for anomaly detection in time series of overnight index swaps considering the results. However, there exists quality limitations of using synthetic datasets, which also should be taken into consideration when looking at the results of the performances.

8 Suggestions for further studies

The project tested Isolation forest and Local outlier factor. Isolation forest which is ensemble-based and Local outlier factor is a density-based technique. There are many other techniques within anomaly detection which could be interesting to evaluate.

The settings for Isolation forest showed good results. For further optimization of the predictions, investigation of further parameter tuning could be done. For Local outlier factor, the number of k is essential for the performance of the algorithm. During the project the number of k were selected based on testing for best MCC. Further trials of selecting number of k based on other metrics could yield better results.

The creation of synthetic data is rather challenging and the variations of outlier cre- ation is extensive. To develop the synthetic data creations further, more techniques and variation of parameters could be done. For instance, trials to generate outlier from dif- ferent kind of distributions could be interesting.

The amount of outliers is a parameter which could be changed to further test the al- gorithms. The amount of outliers generated in this project of 3.9% and the movement with 3 standard deviations, took inspiration of previous work of simulation of synthetic datasets. It could be interesting to look at even smaller values of movement than im- plemented in this project, to examine how small of a value change that still could be considered as an outlier by the algorithms.

25 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

References

[1] Handelsbanken, “Om handelsbanken,” 2020. https://www.handelsbanken.se/sv/om- oss (Accessed: 2020-05-06).

[2] J. Chen, “Overnight index swap definition,” 2019. https://www.investopedia.com/terms/o/overnightindexswap.asp (Accessed: 2020-05-16).

[3] C. C. Aggarwal, “Outlier analysis second edition,” 2016.

[4] D. M. Hawkins, Identification of outliers, vol. 11. Springer, 1980.

[5] D. Libes, D. Lechevalier, and S. Jain, “Issues in synthetic data generation for ad- vanced manufacturing,” in 2017 IEEE International Conference on Big Data (Big Data), pp. 1746–1754, IEEE, 2017.

[6] A. Gonfalonieri, “Do you need synthetic data for your ai project?,” 2019. https://towardsdatascience.com/do-you-need-synthetic-data-for-your-ai- project-e7ecc2072d6b (Accessed: 2020-06-01).

[7] E. L. Barse, H. Kvarnstrom, and E. Jonsson, “Synthesizing test data for fraud de- tection systems,” in 19th Annual Computer Security Applications Conference, 2003. Proceedings., pp. 384–394, IEEE, 2003.

[8] R. Röhm, “The prospects and limitations of synthetic data,” 2019. https://www.linkedin.com/pulse/prospects-limitations-synthetic-data-robin- r%C3%B6hm/ (Accessed: 2020-06-02).

[9] C. Joshi, “Generative adversarial networks (gans) for syn- thetic dataset generation with binary classes,” 2019. https://datasciencecampus.ons.gov.uk/projects/generative-adversarial-networks- gans-for-synthetic-dataset-generation-with-binary-classes/ (Accessed: 2020-06-03).

[10] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.

[11] I. Cohen, “A quick guide to the different types of outliers,” 2018. https://www.anodot.com/blog/quick-guide-different-types-outliers/ (Accessed: 2020-05-10).

[12] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.

26 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

[13] G. E. Hinton, T. J. Sejnowski, T. A. Poggio, et al., Unsupervised learning: founda- tions of neural computation. MIT press, 1999.

[14] A. C. Bahnsen, “Benefits of anomaly detection using isolation forests,” 2016. https://blog.easysol.net/using-isolation-forests-anamoly-detection/ (Accessed: 2020-05-20).

[15] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density- based local outliers,” in Proceedings of the 2000 ACM SIGMOD international con- ference on Management of data, pp. 93–104, 2000.

[16] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422, IEEE, 2008.

[17] M. Hahsler, “Local outlier factor score,” 2020. https://www.rdocumentation.org/packages/dbscan/versions/1.1-5/topics/lof (Accessed: 2020-05-10).

[18] Z. Cheng, C. Zou, and J. Dong, “Outlier detection using isolation forest and lo- cal outlier factor,” in Proceedings of the Conference on Research in Adaptive and Convergent Systems, pp. 161–168, 2019.

[19] D. Lettier, “You need to know about the matthews correlation coefficient,” 2017. https://lettier.github.io/posts/2016-08-05-matthews-correlation-coefficient.html (Accessed: 2020-05-01).

[20] S. Narkhede, “Understanding confusion matrix,” 2018. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62 (Accessed: 2020-04-13).

[21] R. C. Team, “The r project for statistical computing,” 2020. https://www.r- project.org/ (Accessed: 2020-05-06).

[22] K. Srikanth, “An implementation of isolation forest,” 2019. https://www.rdocumentation.org/packages/solitude/versions/0.2.1 (Accessed: 2020-05-06).

[23] W. M. Yingsong Hu and Y. Shan, “Rlof: R parallel implementation of local outlier factor(lof),” 2015. https://cran.r-project.org/web/packages/Rlof/index.html (Accessed: 2020-04-20).

27 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Appendices

Performance metrics of Isolation forest Table 3 – Results of Isolation forest.

Synthetic dataset Time serie Accuracy Precision Recall F-Measure MCC SD1 TS 1 0.9639 0.3833 0.4339 0.407 0.3893 TS 2 0.9682 0.4166 0.283 0.337 0.3277 TS 3 0.9693 0.481415 0.2363 0.31070732 0.32377 SD2 TS 1 0.9725 0.2745 0.5357 0.3658 0.3577 TS 2 0.98 0.3783 0.5 0.4307 0.425 TS 3 0.9768 0.2413 0.25 0.2456 0.2339 SD3 TS 1 0.9591 0.3018 0.2909 0.2962 0.2753 TS 2 0.9521 0.2631 0.17785 0.2127 0.1969 TS 3 0.9618 0.2857 0.1428 0.19047 0.1849 SD4 TS 1 0.98 0.7272 0.5714 0.63 0.635 TS 2 0.9784 0.7096 0.3928 0.5057 0.5177 TS 3 0.9774 0.7666 0.4107 0.5348 0.5518 SD5 TS 1 0.9704 0.2666 0.2142 0.2376 0.2179 TS 2 0.9644 0.3611 0.2321 0.2826 0.2721 TS 3 0.9714 0.4375 0.25 0.3181 0.3154 SD6 TS 1 0.9704 0.875 0.5 0.6363 0.6541 TS 2 0.9644 0.7187 0.5 0.5227 0.5331 TS 3 0.9714 0.6875 0.5 0.5 0.5089 SD7 TS 1 0.97 0.47 0.4181 0.4523 0.4647 TS 2 0.972 0.655 0.425 0.45 0.464 TS 3 0.9752 0.6551 0.45 0.4523 0.4647 SD8 TS 1 0.965 0.614 0.625 0.6194 0.6075 TS 2 0.9731 0.5833 0.625 0.56 0.54 TS 3 0.9757 0.6086 0.625 0.549 0.5391 SD9 TS 1 0.9913 0.9347 0.7678 0.8431 0.843 TS 2 0.98 0.8367 0.7321 0.7809 0.7764 TS 3 0.99 0.9166 0.7857 0.8461 0.844 SD10 TS 1 0.9854 0.90625 0.5178 0.659 0.67 TS 2 0.9887 0.8285 0.5178 0.6373 0.6373 TS 3 0.9866 0.9487 0.6607 0.7789 0.7866

28 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Performance metrics of Local outlier factor Table 4 – Results of Local outlier factor.

Synthetic dataset Time serie Accuarcy Precision Recall F-Measure MCC SD1 TS 1 0.9564 0.321428 0.3272 0.3243 0.3035 TS 2 0.9564 0.2142 0.2181 0.2162 0.1921 TS 3 0.9564 0.2678571 0.2727 0.27027 0.2478 SD2 TS 1 0.9634 0.1428 0.2857 0.1904 0.1849 TS 2 0.9623 0.125 0.25 0.1666 0.159 TS 3 0.9623 0.125 0.25 0.1666 0.159 SD3 TS 1 0.9628 0.375 0.3818 0.3783 0.3592 TS 2 0.9521 0.16 0.16 0.16 0.13 TS 3 0.9553 0.1964 0.2 0.1981 0.1735 SD4 TS 1 0.9779 0.5 0.509 0.4045 0.4892 TS 2 0.9671 0.4464 0.4545 0.4545 0.4335 TS 3 0.9661 0.4285 0.4363 0.4324 0.4149 SD5 TS 1 0.9655 0.625 0.648 0.6363 0.6253 TS 2 0.9623 0.5 0.5185 0.509 0.4942 TS 3 0.9634 0.4464 0.4629 0.4545 0.4379 SD6 TS 1 0.9806 0.6785 0.7307 0.7037 0.6953 TS 2 0.986 0.5535 0.5961 0.574 0.5617 TS 3 0.9634 0.5357 0.5769 0.5555 0.5426 SD7 TS 1 0.9741 0.5535 0.574 0.5636 0.5504 TS 2 0.9618 0.481 0.5 0.4909 0.4754 TS 3 0.9698 0.4821 0.5 0.4909 0.4754 SD8 TS 1 0.9752 0.6071 0.6538 0.6296 0.619 TS 2 0.972 0.5535 0.5961 0.574 0.5617 TS 3 0.9833 0.5 0.549 0.5233 0.5098 SD9 TS 1 0.9838 0.7442 0.7407 0.7272 0.719 TS 2 0.9784 0.625 0.648 0.6363 0.6253 TS 3 0.9935 0.875 0.9074 0.8909 0.8877 SD10 TS 1 0.9849 0.8214 0.8214 0.8214 0.8158 TS 2 0.9698 0.75 0.75 0.75 0.7422 TS 3 0.9849 0.875 0.875 0.875 0.8711

29 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020

Similarity of outliers

Table 5 – Amount of outliers from original datasets in synthetic datasets.

Isolation forest Local outlier factor SD1 TS 1 1.50% SD1 TS 1 1.70% TS 2 2.47% TS 2 1.60% TS 3 0.94% TS 3 1.23% SD2 TS 1 3.80% SD2 TS 1 0.75% TS 2 1.18% TS 2 1.50% TS 3 1.50% TS 3 0.83% SD3 TS 1 0.90% SD3 TS 1 1.72% TS 2 0.60% TS 2 1.50% TS 3 0.90% TS 3 1.39% SD4 TS 1 3.20% SD4 TS 1 4.30% TS 2 3.10% TS 2 3.01% TS 3 3.30% TS 3 3.56% SD5 TS 1 0.60% SD5 TS 1 1.50% TS 2 0.50% TS 2 0.96% TS 3 0.61% TS 3 1.20% SD6 TS 1 4.10% SD6 TS 1 3.76% TS 2 3.06% TS 2 3.01% TS 3 3.08% TS 3 3.25% SD7 TS 1 1.80% SD7 TS 1 3.11% TS 2 2.40% TS 2 3.01% TS 3 1.50% TS 3 2.58% SD8 TS 1 4.20% SD8 TS 1 4.08% TS 2 3.60% TS 2 3.55% TS 3 3.30% TS 3 3.85% SD9 TS 1 1.82% SD9 TS 1 2.68% TS 2 2.58% TS 2 2.90% TS 3 0.90% TS 3 2.78% SD10 TS 1 3.71% SD10 TS 1 4.84% TS 2 3.50% TS 2 4.30% TS 3 0.90% TS 3 4.20%

30