OUTLIER DETECTION for OVERNIGHT INDEX SWAPS Master Thesis

OUTLIER DETECTION FOR OVERNIGHT INDEX SWAPS Master Thesis Johnny Kuo Master Thesis, 30 credits Department of Mathematics and Mathematical Statistics Spring Term 2020 Abstract In this thesis, methods for anomaly detection in time series data are investigated. Given data for overnight index swaps (SEK), synthetic data has been created with different types of anomalies. Comparison between the Isolation forest and Local outlier factor algorithms is done by measuring the respective performances for the synthetic data sets against Accuracy, Precision, Recall, F-Measure and Matthews correlation coefficient. Keywords: Outlier detection, Overnight index swaps, Machine learning, Isolation forest, Local outlier factor Sammanfattning I examensarbetet undersöks metoder för anomalidetektion i tidsserie data. Givet data för overnight index swaps (SEK), så har syntetiskt data skapats med olika ty- per av anomalier. Jämförelse mellan algoritmerna Isolation forest och Local outlier factor görs genom att mäta respektive prestande för de syntetiska dataseten mot Accuracy, Precision, Recall, F-measure och Matthews correlation coefficient. Nyckelord: Outlier detection, Overnight index swaps, Machine learning, Isolation forest, Local outlier factor Acknowledgement I would like to acknowledge and express my gratitude for the support given by Fredrik Bohlin and Richard Henriksson from the department of Model Validation and Quantita- tive Analysis. I also would like to acknowledge the support given by supervisors within the department of Mathematics and Mathematical Statistics, Oleg Seleznjev and Leif Nilsson. Finally, I would like to extend my gratitude to friends and family who have been by my side and supported me throughout the work. Thank you! Stockholm 2020-06-08 Johnny Kuo List of Figures 1 The spectrum from normal data to outliers. Increasing outlierness score from left to right. Noise and anomalies can be considered as weak or strong outliers......................................5 2 Visualization of the three time series used for base when generating synthetic datasets. TS 1 is yields with 1 year to maturity, TS 2 is yields with 5 years to maturity and TS 3 is yields with 10 years to maturity...... 11 3 Visualization of global and collective outliers inserted, red points are data points moved with 3 standard deviations for TS1............... 13 4 Visualization of global and collective outliers inserted, red points are data points moved for TS1.............................. 14 5 Visualization of global and collective outliers inserted, red points are data points moved for TS1.............................. 16 6 Illustration of the workflow, form start to finish of the project........ 18 7 Performance metrics for Isolation forest.................... 19 8 Performance metrics for Local outlier factor.................. 20 9 Average performance score for the algorithms................. 21 10 The percentage of similarity between outliers detected in dataset before and after generation of synthetic outliers................... 22 11 The percentage of similarity between outliers detected in dataset before and after generation of synthetic outliers................... 22 List of Tables 1 Confusion matrix showing the possible combinations of correct classifica- tions and wrong classification..........................8 2 Summary of synethic datasets......................... 17 3 Results of Isolation forest............................ 28 4 Results of Local outlier factor.......................... 29 5 Amount of outliers from original datasets in synthetic datasets....... 30 Contents List of Figuresi List of Tables ii 1 Introduction2 1.1 Svenska Handelsbanken............................2 1.2 Overnight index swaps.............................2 1.3 Outliers.....................................2 1.4 Synthetic data.................................2 1.5 Problem Statement...............................3 1.6 Main objective.................................3 1.7 Delimitation...................................3 1.8 Outline.....................................4 2 Theory4 2.1 Outlier......................................4 2.2 Types of outliers................................5 2.2.1 Global outliers.............................5 2.2.2 Contextual outliers...........................5 2.2.3 Collective outliers...........................5 2.3 Time series...................................6 2.4 Machine learning................................6 2.5 Isolation forest.................................6 2.6 Local outlier factor...............................7 2.7 Metrics for model assessment.........................8 2.7.1 Accuracy................................8 2.7.2 Precision................................8 2.7.3 Recall..................................9 2.7.4 F-Measure................................9 2.7.5 Matthews Correlation Coefficient (MCC)..............9 3 Objective of the project9 3.1 Main objective.................................9 3.2 Delimitation...................................9 4 Methodology 10 4.1 Programs.................................... 10 4.2 Description of the data............................. 10 4.3 Synthetic datasets............................... 11 4.3.1 Synthetic datasets 1 and 2....................... 12 4.3.2 Synthetic datasets 3 and 4....................... 13 4.3.3 Synthetic datasets 5 and 6....................... 15 4.3.4 Synthetic datasets 7 and 8....................... 16 4.3.5 Synthetic datasets 9 and 10...................... 17 4.3.6 Summary of synthetic datasets.................... 17 4.4 Implementation of algorithms......................... 17 4.5 Model performance assessment........................ 18 5 Results 19 5.1 Isolation forest................................. 19 5.2 Local outlier factor............................... 20 5.3 Outliers from original datasets........................ 21 6 Discussion 23 6.1 Isolation forest and Local outlier factor.................... 23 6.2 Quality limitations of synthetic data..................... 24 6.3 Synthetic datasets............................... 24 7 Conclusion 24 7.1 Best model for anomaly detection....................... 24 7.2 Unsupervised anomaly detection....................... 24 8 Suggestions for further studies 25 References 26 Appendices 28 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 1 Introduction 1.1 Svenska Handelsbanken Svenska Handelsbanken is one of the oldest listed share on the Swedish stock exchange. The bank was formed 1871 with goal to pursue "true banking activities" with customers mainly in the Stockholm area. This became the base for the local banking spirit that the bank has continued to build on even today. Svenska Handelsbanken will be cited as Handelsbanken further in the report (1). Handelsbanken has become an international bank since the late 1980s. The local banking relationship was established across Scandinavia and the bank expanded to Norway, Finland and Denmark. From 2000 the bank further expanded throughout UK and the Netherlands. In addition, Handelsbanken is also present in other markets to support customers from the home market of Sweden (1). 1.2 Overnight index swaps An overnight index swap is an index swap that refers to hedging a contract in which a party exchanges a predetermined cash flow with a counter-party on a specified date. This financial instrument is a specialized type of fixed rate swaps and has the ability to be set over different time spans. Commonly, overnight index swaps are set from three months and more than a year (2). 1.3 Outliers Outliers, also known as abnormalities, discordants, deviants or anomalies in data mining and statistical literature are observations that lie in an abnormal distance from the other values in the population (3). Hawkins (4) defined an outlier as follows: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” In most cases when the generating process behaves in a unusual way, it results in the creation of outliers. Therefore outliers have the potential to provide meaningful insights about the processes. Outlier detection is a broad field within statistics. Credit-card fraud is an example within the banking industry where outlier detection has been widely used. Patterns of credit-card transaction data are hard to detect by human observation and anomalies are more efficiently found by outlier detection algorithms (3). 1.4 Synthetic data The usage of synthetic data has been increasingly important in many fields including, economics, urban planning, transportation planning, cyber security and weather forecast- ing. The usage of synthetic data can help development of data analytics applications, 2 Author: Johnny Kuo Svenska Handelsbanken July 7, 2020 models and algorithms performance testing (5). Synthetic data can be specified to meet certain conditions or specifications that not can be found in the real data. In machine learning, synthetic data generation has been used progressively and beneficial in cases such as making datasets less expensive and better accessible for AI projects (6). More benefits of using synthetic data is that it can be designed to demonstrate certain key properties of data and gives a high degree of freedom for testing and training scenarios (7). Even though the benefits of synthetic data are evident, challenges exist and should be taken into concern when creating the data. In many cases, the process of generating synthetic data also requires some minimum realistic data (5). One difficulty with synthetic

Load more