Imputation and Generation of Multidimensional Market Data

IMPUTATION AND GENERATION OF MULTIDIMENSIONAL MARKET DATA Master Thesis Tobias Wall & Jacob Titus Master thesis, 30 credits Department of mathematics and Mathematical Statistics Spring Term 2021 Imputation and Generation of Multidimensional Market Data Tobias Wall†, [email protected] Jacob Titus†, [email protected] Copyright © by Tobias Wall and Jacob Titus, 2021. All rights reserved. Supervisors: Jonas Nylén Nasdaq Inc. Armin Eftekhari Umeå University Examiner: Jianfeng Wang Umeå University Master of Science Thesis in Industrial Engineering and Management, 30 ECTS Department of Mathematics and Mathematical Statistics Umeå University SE-901 87 Umeå, Sweden †Equal contribution. The order of the contributors names were chosen based on a bootstrapping procedure where the names was drawn 100 times. i Abstract Market risk is one of the most prevailing risks to which financial institutions are exposed. The most popular approach in quantifying market risk is through Value at Risk. Organisations and regulators often require a long historical horizon of the affecting financial variables to estimate the risk exposures. A long horizon stresses the completeness of the available data; something risk applications need to handle. The goal of this thesis is to evaluate and propose methods to impute financial time series. The performance of the methods will be measured with respect to both price-, and risk metric replication. Two different use cases are evaluated; missing values randomly place in the time series and consecutively missing values at the end-point of a time series. In total, there are five models applied to each use case, respectively. For the first use case, the results show that all models perform better than the naive approach. The Lasso model lowered the price replication error by 35% compared to the naive model. The result from use case two is ambiguous. Still, we can conclude that all models performed better than the naive model with respect to risk metric replication. In general, all models systemically underestimated the downstream risk metrics, implying that they failed to replicate the fat-tailed property of the price movement. Keywords: Time Series Imputation, Financial Time Series, Machine Learn- ing, Deep Learning, Value at Risk, Expected Shortfall ii Sammanfattning Marknadsrisk är en av de mest betydande riskerna som finansiella institut exponeras mot. Det populäraste sättet att kvantifiera marknadsrisk är genom Value at Risk. Organisationer och tillsynsmyndigheter kräver ofta en lång historisk horisont för de berörda marknadsvariablerna vid dessa beräkningar. En lång horisont ökar risken av ofullständighet i det tillgängliga datat, något riskapplikationer behöver hantera. Målet med denna uppsats är att utvärdera och föreslå metoder för att im- putera finansiella tidsserier. Metodernas prestanda kommer att mätas med avseende på både pris- och riskreproducerbarhet. Två olika scenarion utvärderas; värden som slumpmässigt saknas i tidsserien, och på varandra följande saknande värden i änden av en tidsserie. Totalt har fem modeller tillämpats på varje scenario. Under det första scenariot visar resultaten att alla modeller presterar bättre än det naiva tillvägagångssättet. Lasso-modellen sänkte prisreplikationsfelet med 35 % jämfört med den naiva modellen. Resultatet från det andra scenariot är tvetydigt. Vi kan ändå dra slutsatsen att alla modeller presterade bättre än den naiva modellen med avseende på riskvärdes-replikering. I allmänhet underskattade alla modeller systematiskt riskvärdena, vilket antyder på att de misslyckades med att replikera egenskapen av tjocka svansar i prisrörelsens distribution. Nyckelord: Imputering av Tidsserier, Finansiella Tidsserier, Maskininlärn- ing, Djupinlärning, Value at Risk, Expected Shortfall iii Acknowledgement We would like to extend our gratitude to Jonas Nylén, Anders Stäring, Markus Nyberg, and Oskar Janson at Nasdaq Inc., who has given us the opportunity to do this thesis work as well as providing supervision and support throughout the entire project. We would also like to thank our supervisor at the Department of Mathemat- ics and Mathematical Statistics, Assistant Professor Armin Eftekhari, for guidance and valuable advice during the project. Finally, we would like to thank our families and friends for their support and words of encouragement throughout our time at Umeå University which comes to an end with the completion of this thesis. Tobias Wall Jacob Titus Umeå, May 26, 2021 iv Contents 1 Introduction 1 1.1 Problem Definition . .4 1.2 Dataset . .4 2 Background 5 2.1 Market Risk . .5 2.1.1 Value at Risk . .6 2.1.2 Expected Shortfall . .7 2.2 Financial Variables . .7 2.2.1 Futures . .7 2.2.2 Discount Rates . .7 2.2.3 Foreign Exchange Rates . .8 2.2.4 Options . .8 2.2.5 Volatility . .9 2.3 Financial Time Series . 10 2.3.1 Stylised Facts . 11 2.4 Related Work . 13 2.4.1 Autoregressive Models . 13 2.4.2 State-Space Models . 14 2.4.3 Expectation Maximisation . 14 2.4.4 Key Points . 14 3 Theory 16 3.1 Nearest Neighbour Imputation . 16 3.2 Linear Interpolation . 17 3.3 Lasso . 17 3.4 Random Forest . 18 3.5 Bayesian Inference . 19 3.5.1 Bayes’ Rule . 20 3.5.2 Multivariate Normal Distribution . 20 3.5.3 Conditional Distribution . 20 3.5.4 Bayesian Linear Regression . 21 3.5.5 Feature Space Projection . 23 3.5.6 The Kernel Trick . 23 3.6 Gaussian Processes . 24 3.6.1 Choice of Covariance Function . 25 v 3.6.2 Optimising the Hyperparameters . 27 3.7 Artificial Neural Networks . 27 3.7.1 Multilayer Perceptron . 28 3.7.2 Training Neural Networks . 29 3.8 Recurrent Neural Networks . 30 3.8.1 Long-Short Term Memory . 31 3.9 Convolutional Neural Networks . 33 3.10 WaveNet . 35 3.11 Batch Normalisation . 37 4 Method 39 4.1 Notation . 39 4.2 Problem Framing . 40 4.2.1 Use Case One . 40 4.2.2 Use Case Two . 41 4.3 Dataset . 41 4.4 Data Preparation . 42 4.4.1 Handling of Missing Values . 42 4.4.2 Converting to Prices . 42 4.4.3 Training and Test Split . 43 4.4.4 Sliding Windows and Forward Validation . 43 4.5 Data Post-Processing . 43 4.6 Experiment Design . 44 4.7 Evaluation . 45 4.7.1 Mean Absolute Scaled Error . 45 4.7.2 Relative Deviation of VaR . 46 4.7.3 Relative Deviation of ES . 46 4.8 Models . 47 4.8.1 Nearest Neighbour Imputation . 47 4.8.2 Linear Interpolation . 47 4.8.3 Lasso . 47 4.8.4 Random Forest . 48 4.8.5 Gaussian Process . 48 4.8.6 Multilayer Perceptron . 49 4.8.7 WaveNet . 51 4.8.8 SeriesNet . 53 5 Results 56 5.1 Use Case One . 56 5.2 Use Case Two . 59 6 Discussion and Reflection 63 6.1 Risk Underestimation . 63 6.2 Time Component . 64 6.3 Fallback logic . 64 6.4 Error Measures . 64 6.5 Complexity . 65 6.6 Use Case Framing . 66 6.7 Excluded Models . 66 vi 6.8 Improvements and Extensions . 67 7 Conclusion 69 Appendices 73 Appendix A Removed Holidays 74 Appendix B Dataset 76 Appendix C Stylised Facts 78 Appendix D Example of a WaveNet-architecture 80 Appendix E Explanatory Data Analysis 81 Appendix F Asset Class Results Use Case One 85 Appendix G Asset Class Results Use Case Two 88 Appendix H Example of Imputation 91 vii Chapter 1 Introduction Market risk is one of the most prevailing risks that financial institutions are subjected to. It is the potential losses that investments inherit by uncertainties of market variables [24]. Risk management is all about identifying, quantifying, and analysing these risks to decide how market risk exposures should be avoided, accepted, or hedged. The most common approach in quantifying the market risk is by looking at how the affecting market variables, e.g. prices, have moved historically and use that knowledge to conclude how big losses could get in the future. Value at Risk, henceforth VaR, is one of the most widely used market risk metrics. There are several different ways to calculate VaR, but we will focus on a non-parametric approach using historical simulations from observed market data. VaR aims to make the following statement of an investment; “We are X percent certain that we will not lose more than V dollars in time T”. Suppose we would like to calculate 1-day 99% VaR of a USD 1 000 000 investment in the American stock index S&P5001, using seven years historical prices from 2014 to the end of 2020. Then, start by computing the daily price returns over the given period, find the return at the 1st percentile, and multiply that return by the current value of the investment. This yields 1-day 99% VaR to be USD 32 677. But, what if the price series was incomplete over the specific period, where several days had missing price data? Assume our dataset lacked the desired long-term data, and only five years history were available hence, from the beginning of 2016 as illustrated by the dashed line in Figure 1.1. Calculating 1-day 99% VaR from 2016 onwards results in a value of USD 35 675, which is 9:17% higher than the complete dataset. This discrepancy is intuitive when analysing the price and logarithmic return process presented in Figure 1.1. The period from February 19th to March 23rd of 2020 was turbulent in many ways, but mainly, it was the start of the COVID-19 pandemic. The financial markets fell, with S&P500 dropping 34% and the Swedish stock index OMX30 dropping 31%, leaving no markets unaffected. The period has two ”Black Mondays”, the 9th and 16th of March, where markets fell 8% and 13% respectively, and one ”Black Thursday” on the 12th of March where markets fell 10%.

Imputation and Generation of Multidimensional Market Data

Malware Classification with BERT

Fun with Hyperplanes: Perceptrons, Svms, and Friends

Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection

Machine Learning Methods for Classification of the Green

Random Forest Regression of Markov Chains for Accessible Music Generation

Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading Gcforest in Identifying Sentiment Polarity

Introduction to Machine Learning

Self-Training Wavenet for TTS in Low-Data Regimes

Audio Event Classification Using Deep Learning in an End-To-End Approach

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Real-Time Black-Box Modelling with Recurrent Neural Networks