<<

IMPUTATION AND GENERATION OF MULTIDIMENSIONAL MARKET DATA

Master Thesis

Tobias Wall & Jacob Titus

Master thesis, 30 credits Department of mathematics and Mathematical Statistics Spring Term 2021

Imputation and Generation of Multidimensional Market Data Tobias Wall†, [email protected] Jacob Titus†, [email protected]

Copyright © by Tobias Wall and Jacob Titus, 2021. All rights reserved.

Supervisors: Jonas Nylén Nasdaq Inc. Armin Eftekhari Umeå University

Examiner: Jianfeng Wang Umeå University

Master of Science Thesis in Industrial Engineering and Management, 30 ECTS Department of Mathematics and Mathematical Statistics Umeå University SE-901 87 Umeå, Sweden

†Equal contribution. The order of the contributors names were chosen based on a bootstrapping procedure where the names was drawn 100 times.

i Abstract

Market risk is one of the most prevailing risks to which financial institutions are exposed. The most popular approach in quantifying market risk is through Value at Risk. Organisations and regulators often require a long historical horizon of the affecting financial variables to estimate the risk exposures. A long horizon stresses the completeness of the available data; something risk applications need to handle.

The goal of this thesis is to evaluate and propose methods to impute finan- cial time series. The performance of the methods will be measured with respect to both price-, and risk metric replication. Two different use cases are evaluated; missing values randomly place in the time series and consecutively missing values at the end-point of a time series. In total, there are five models applied to each use case, respectively.

For the first use case, the results show that all models perform better than the naive approach. The Lasso model lowered the price replication error by 35% compared to the naive model. The result from use case two is ambiguous. Still, we can conclude that all models performed better than the naive model with respect to risk metric replication. In general, all models systemically underestimated the downstream risk metrics, implying that they failed to replicate the fat-tailed property of the price movement.

Keywords: Time Series Imputation, Financial Time Series, Machine Learn- ing, , Value at Risk, Expected Shortfall

ii Sammanfattning

Marknadsrisk är en av de mest betydande riskerna som finansiella institut exponeras mot. Det populäraste sättet att kvantifiera marknadsrisk är genom Value at Risk. Organisationer och tillsynsmyndigheter kräver ofta en lång historisk horisont för de berörda marknadsvariablerna vid dessa beräkningar. En lång horisont ökar risken av ofullständighet i det tillgängliga datat, något riskapplikationer behöver hantera.

Målet med denna uppsats är att utvärdera och föreslå metoder för att im- putera finansiella tidsserier. Metodernas prestanda kommer att mätas med avseende på både pris- och riskreproducerbarhet. Två olika scenarion utvärderas; värden som slumpmässigt saknas i tidsserien, och på varandra följande saknande värden i änden av en tidsserie. Totalt har fem modeller tillämpats på varje scenario.

Under det första scenariot visar resultaten att alla modeller presterar bättre än det naiva tillvägagångssättet. Lasso-modellen sänkte prisreplikationsfelet med 35 % jämfört med den naiva modellen. Resultatet från det andra scenariot är tvetydigt. Vi kan ändå dra slutsatsen att alla modeller presterade bättre än den naiva modellen med avseende på riskvärdes-replikering. I allmänhet underskattade alla modeller systematiskt riskvärdena, vilket antyder på att de misslyckades med att replikera egenskapen av tjocka svansar i prisrörelsens distribution.

Nyckelord: Imputering av Tidsserier, Finansiella Tidsserier, Maskininlärn- ing, Djupinlärning, Value at Risk, Expected Shortfall

iii Acknowledgement

We would like to extend our gratitude to Jonas Nylén, Anders Stäring, Markus Nyberg, and Oskar Janson at Nasdaq Inc., who has given us the opportunity to do this thesis work as well as providing supervision and support throughout the entire project.

We would also like to thank our supervisor at the Department of Mathemat- ics and Mathematical Statistics, Assistant Professor Armin Eftekhari, for guidance and valuable advice during the project.

Finally, we would like to thank our families and friends for their support and words of encouragement throughout our time at Umeå University which comes to an end with the completion of this thesis.

Tobias Wall Jacob Titus Umeå, May 26, 2021

iv Contents

1 Introduction 1 1.1 Problem Definition ...... 4 1.2 Dataset ...... 4

2 Background 5 2.1 Market Risk ...... 5 2.1.1 Value at Risk ...... 6 2.1.2 Expected Shortfall ...... 7 2.2 Financial Variables ...... 7 2.2.1 Futures ...... 7 2.2.2 Discount Rates ...... 7 2.2.3 Foreign Exchange Rates ...... 8 2.2.4 Options ...... 8 2.2.5 Volatility ...... 9 2.3 Financial Time Series ...... 10 2.3.1 Stylised Facts ...... 11 2.4 Related Work ...... 13 2.4.1 Autoregressive Models ...... 13 2.4.2 State-Space Models ...... 14 2.4.3 Expectation Maximisation ...... 14 2.4.4 Key Points ...... 14

3 Theory 16 3.1 Nearest Neighbour Imputation ...... 16 3.2 Linear Interpolation ...... 17 3.3 Lasso ...... 17 3.4 ...... 18 3.5 Bayesian Inference ...... 19 3.5.1 Bayes’ Rule ...... 20 3.5.2 Multivariate Normal Distribution ...... 20 3.5.3 Conditional Distribution ...... 20 3.5.4 Bayesian ...... 21 3.5.5 Feature Space Projection ...... 23 3.5.6 The Kernel Trick ...... 23 3.6 Gaussian Processes ...... 24 3.6.1 Choice of Covariance Function ...... 25

v 3.6.2 Optimising the Hyperparameters ...... 27 3.7 Artificial Neural Networks ...... 27 3.7.1 Multilayer ...... 28 3.7.2 Training Neural Networks ...... 29 3.8 Recurrent Neural Networks ...... 30 3.8.1 Long-Short Term Memory ...... 31 3.9 Convolutional Neural Networks ...... 33 3.10 WaveNet ...... 35 3.11 Batch Normalisation ...... 37

4 Method 39 4.1 Notation ...... 39 4.2 Problem Framing ...... 40 4.2.1 Use Case One ...... 40 4.2.2 Use Case Two ...... 41 4.3 Dataset ...... 41 4.4 Data Preparation ...... 42 4.4.1 Handling of Missing Values ...... 42 4.4.2 Converting to Prices ...... 42 4.4.3 Training and Test Split ...... 43 4.4.4 Sliding Windows and Forward Validation ...... 43 4.5 Data Post-Processing ...... 43 4.6 Experiment Design ...... 44 4.7 Evaluation ...... 45 4.7.1 Mean Absolute Scaled Error ...... 45 4.7.2 Relative Deviation of VaR ...... 46 4.7.3 Relative Deviation of ES ...... 46 4.8 Models ...... 47 4.8.1 Nearest Neighbour Imputation ...... 47 4.8.2 Linear Interpolation ...... 47 4.8.3 Lasso ...... 47 4.8.4 Random Forest ...... 48 4.8.5 Gaussian Process ...... 48 4.8.6 ...... 49 4.8.7 WaveNet ...... 51 4.8.8 SeriesNet ...... 53

5 Results 56 5.1 Use Case One ...... 56 5.2 Use Case Two ...... 59

6 Discussion and Reflection 63 6.1 Risk Underestimation ...... 63 6.2 Time Component ...... 64 6.3 Fallback logic ...... 64 6.4 Error Measures ...... 64 6.5 Complexity ...... 65 6.6 Use Case Framing ...... 66 6.7 Excluded Models ...... 66

vi 6.8 Improvements and Extensions ...... 67

7 Conclusion 69

Appendices 73

Appendix A Removed Holidays 74

Appendix B Dataset 76

Appendix C Stylised Facts 78

Appendix D Example of a WaveNet-architecture 80

Appendix E Explanatory Data Analysis 81

Appendix F Asset Class Results Use Case One 85

Appendix G Asset Class Results Use Case Two 88

Appendix H Example of Imputation 91

vii Chapter 1 Introduction

Market risk is one of the most prevailing risks that financial institutions are subjected to. It is the potential losses that investments inherit by uncertainties of market variables [24]. Risk management is all about identifying, quantifying, and analysing these risks to decide how market risk exposures should be avoided, accepted, or hedged. The most common approach in quantifying the market risk is by looking at how the affecting market variables, e.g. prices, have moved historically and use that knowledge to conclude how big losses could get in the future.

Value at Risk, henceforth VaR, is one of the most widely used market risk metrics. There are several different ways to calculate VaR, but we will focus on a non-parametric approach using historical simulations from observed market data. VaR aims to make the following statement of an investment; “We are X percent certain that we will not lose more than V dollars in time T”. Suppose we would like to calculate 1-day 99% VaR of a USD 1 000 000 investment in the American stock index S&P5001, using seven years historical prices from 2014 to the end of 2020. Then, start by computing the daily price returns over the given period, find the return at the 1st percentile, and multiply that return by the current value of the investment. This yields 1-day 99% VaR to be USD 32 677. But, what if the price series was incomplete over the specific period, where several days had missing price data?

Assume our dataset lacked the desired long-term data, and only five years history were available hence, from the beginning of 2016 as illustrated by the dashed line in Figure 1.1. Calculating 1-day 99% VaR from 2016 onwards results in a value of USD 35 675, which is 9.17% higher than the complete dataset. This discrepancy is intuitive when analysing the price and logarithmic return process presented in Figure 1.1. The period from February 19th to March 23rd of 2020 was turbulent in many ways, but mainly, it was the start of the COVID-19 pandemic. The financial markets fell, with S&P500 dropping 34% and the Swedish stock index OMX30 dropping 31%, leaving no markets unaffected. The period has two ”Black Mondays”, the 9th and 16th of March, where markets fell 8% and 13% respectively, and one ”Black Thursday” on the 12th of March where markets fell 10%. This stressed market period has a great impact on the VaR metric. Leaving a period of normal market condition out of the calculation will increase VaR due

1S&P500 is a stock index containing 500 large companies listed on the Nasdaq Stock Exchange and New York Stock Exchange that represent the American industry.

1 (a) (b)

Figure 1.1: S&P 500’s price process (a) and corresponding one-day logarithmic return (b) from January 2nd, 2014 to December 31st, 2020. The black dashed line marks January 1st, 2016. to the enlarged weighted contribution of the stressed period. As of 2021, 18 of the companies included in the S&P500 index were not founded before 2016 [44] and thereby not publicly listed. They would all lack the desired long-term market data as we specified for our VaR calculation.

The absence of long-term market data is one frequent issue that needs to be dealt with when assessing market risk metrics from historical simulations. Another common situation when dealing with multiple instruments is a sparse dataset where single or a few consecutive data points are missing. This could, e.g., happen due to operational failure at the market, caused by broken sensors, bugs, or data collection failure. But the main cause of this is varying business days between different markets. Suppose we have a portfolio with exposures on both the S&P500 index and the Hong Kong stock index Hang Seng, HSI2. The Hong Kong market is subjected to the Chinese public holidays, which are not aligned with the American holidays that S&P500 are affected by. E.g., during the Chinese New Year, occurring every year in February, the Hong Kong market is closed for three days3. Also, the Chinese national day 1st of October, and Buddha’s birthday 30 of April [22]. All of whom will imply a missing HSI price in our portfolio. Figure 1.2 display implicit missing prices for the HSI during the Chinese New Year in 2019. There are several approaches to tackle this problem. The simplest one would be to exclude all dates where the market data is incomplete and then feed it to the downstream risk application. The drawbacks of such a naive approach are that it ignores all known price movements by the other observed market variables. To avoid missing information, we need to develop a method to fill the missing prices.

There are several ways to fill the missing window of the HSI price series but, what effect will they infer on the VaR metric? The nearest reference points to the price process are the 4th and 8th of February; due to the high autocorrelation

2The HSI index contains the largest companies of the Hong Kong stock market and is an indicator of the overall performance of the Hong Kong stock market. 3Depending on the day of the week the New Year’s day occurs.

2 Figure 1.2: S&P 500’s and HSI’s price process during the Chinese New Year, 2019. The red area marks the missing price period of HSI. between adjacent prices, would not a straight line between the references make a good prediction? It would be good reasoning if one were only interested in getting a fair estimate of the missing price. Still, such an approach will minimise the largest, relative, price movement, which most likely will lead to a systemic underestimate of the VaR metric. Another naive method is to fill the values by the nearest known price observation; hence, the 4th becomes an estimate of 5th and 6th, and the price at 8th estimates the 7th price4. Contrarily to the straight-line approach, this method will maximise the price movement between two days while constricted to estimate prices bound between the references and flatten all other movements. This approach will perhaps not imply a biased VaR metric, but it will surely overestimate the number of horizontal movements.

A common drawback of the two above-described methods is that they only use information from the two reference prices and are further restricted to take values that fall between these two references. In reality, the price process is not restricted between the two reference points, as observed by the S&P 500’s process in the red marked area in Figure 1.2 where values fall higher and lower than any of the two bordering references. Thus far, we only mentioned problems that affect on single, missing price process. Most of the time, investors are interested in their risk exposure given a set of positions to account for netting effects that their movement may cause. For example, assume we filled the missing prices of the HSI series by the complete opposite movement of the S&P 500 for the respective time step. Given the same weighted contribution of our portfolio, such an approach would yield zero return scenarios for our portfolio. Such a strong correlation between price movements is seldom seen over longer horizons. Still, market variables often possess both long-term and temporal correlations that investors exploit to hedge their exposures.

4It is custom to use the value corresponding to the prior time point when the missing value is equally distant between two referencing points.

3 1.1 Problem Definition

This thesis will examine methods to fill missing values in multidimensional financial time series in the context of risk metric applications. The aim is to provide reason- ing about and suggest what models to be applied when imputing an incomplete, multidimensional time series. In a broader context, the project aims to improve client risk metrics to support risk managing decisions. The performance of the methods will be evaluated concerning both a price replication point of view and their effect on downstream risk application, which puts more attention to the estimated price movements.

Specifically, we will investigate two distinct use cases that contextualise dif- ferent situations of an incomplete dataset when calculating VaR, that is:

1. Single or a few missing data points causing a sparse time series. Making it an interpolation problem with reference points before and after the missing data points and with information from reference channels, e.g., other market variables for the time of interest. This use case may arise when a portfolio contains assets traded on different exchanges with variety in operational days or simply due to data loss.

2. Consecutive missing data points for a longer horizon at endpoints. Making it an extrapolation problem for that particular time series. Still, there may be reference channel data, e.g., other market variable data, for the period of interest. This use case may arise when a portfolio contains assets that have not been on the market during the full period, e.g., initial public offerings (IPO) for corporate stocks or the creation of new derivative instruments.

1.2 Dataset

The analysis of this thesis will be based on a dataset provided by Nasdaq, but sourced by Refinitiv5. The dataset contains historical market data depicting 35 different financial variables, given on a daily frequency basis. The dataset stretches from 2014 to the beginning of 2021. Four different financial variables are included in the dataset; futures, discount rates, Foreign Exchange rates, and implied volatilities. Due to confidentiality, the dataset will not be presented in full.

5Refinitiv is a provider of market- data and structure for financial institutions.

4 Chapter 2 Background

This section will explain financial markets, risk measures, and related work to fi- nancial time series imputation. The section will briefly introduce financial markets and market risk, especially the risk measures Value at Risk and Expected Shortfall. Moving on by introducing the financial variables covered in this thesis, we describe how instruments are traded in the market and explain the risk measure of volatility and its connection to options and the implied volatility surface. We continue with the stylised facts where fundamental properties of financial price processes are pre- sented and exemplified — finishing off with an introduction to imputation for time series data in general and explaining why these traditional approaches may fail in our financial setting.

2.1 Market Risk

Financial markets make up a key role in our modern society. They enable the interaction between a buyer and seller of financial instruments and have implications that can be derived to fundamental pillars of the global economy such as liquidity providing-, risk managing- and asset pricing entity. All entities operating in the financial markets are also associated with risks. Financial portfolios are dependent on several financial variables that affect the value of their assets. This risk is called market risk and is one of the main risks financial institutions are subjected to [24].

Traders often quantify and manage market risk using the Greek metrics for a smaller set of investments. However, generally, a financial institution’s portfolio is dependent on hundreds or thousands, financial variables. Presenting many greek metrics will not give a holistic view of current risk exposure to senior management or regulating authorities. As a response, Value at Risk and Expected Shortfall was developed to give a single value indicative of the total risk of a portfolio [24].

5 Figure 2.1: Value at Risk equals the profit and loss at the α-percentile, whereas Expected Shortfall equals the mean of all profit and losses greater or equal to the α-percentile.

2.1.1 Value at Risk Value at Risk is a risk measure that aims to make the following statement for a portfolio [24]; "We are α − % certain that we will not lose more than $ V in time T." As previously mentioned, the VaR calculations in this thesis are based on historical simulations, a non-parametric method using historical market data to calculate certain financial holdings incurred T period profit and loss scenarios over a fixed historical period.

To calculate the α-% T -day VaR, let t ∈ {t0 − h, . . . , t0} denote a specific day where day t0 is today and h the specified historical horizon. Assume the portfolio consist of d assets with corresponding position quantity wi, i ∈ {1, . . . , d}. Further assume that all asset prices depends on the market variables x, and have their individual price function F i (x)1. For any given day, the T -period, market variable scenario can be calculated as,

xt−T − xt st = , t ∈ {t0 + T, . . . , h} . (2.1) xt−T For every market variable scenario, one can now calculate their incurred Profit-and- Loss, PnL, on today’s portfolio value as2[24],

d d X i i X i i PnLt = w F (xt0 ) − w F (stxt0 ) , (2.2) i=1 i=1 where the first term denotes today’s portfolio value and the second term the portfolio value under the t-th day’s market variable scenario. Compute the PnL’s for all t ∈ {T + 1, . . . , h}, sort them in ascending order and pick the PnL on the α percentile to be the α-% T -day VaR [24]. In Figure 2.1 the red coloured bar illustrates the α-% VaR among the PnL distribution. 1E.g. Black-Scholes formula being dependent on interest rate, price of underlying, etc. Inclusion of a pricing function allows flexibility when calculating different scenarios. 2Note that losses are represented as a positive PnL’s, conversely profits are negative.

6 There are three parameters to the VaR model; the confidence level α, sce- nario period T , and the historical price period h. Organisational or regulatory standards set the value of these parameters. The confidence level α is usually between 95 − 99.9%, and the historical period h usually is between 1 − 7 years. The scenario period T is typically set to one day but depends on the liquidity of the investment [24].

2.1.2 Expected Shortfall The expected shortfall, henceforth ES, is similar to VaR but aims to quantify the expected loss given a scenario that violates the threshold α percentile. Thus, trying to make the following statement for a portfolio;

"If things get bad, how bad does it get?"

This thesis will focus on the non-parametric, historical simulation approach when calculating the ES metric. The parameters are the same as for VaR and are usually set within the same intervals. The ES is assessed by calculating the PnL’s for a fixed historical horizon and sort them in ascending order. Then, ES equals the mean of all PnLs larger or equal to the α percentile PnL. Figure 2.1 illustrates how VaR and ES differ.

2.2 Financial Variables

Measures depicting the financial markets are often referred to as financial variables. They are, in general, sourced through the trade information of marketed financial instruments. Below is a description of the financial variables that are relevant to this thesis.

2.2.1 Futures A futures contract is an exchange-traded derivative that is an agreement to buy or sell an asset, called the underlying, at a future time point T , for a specific price K. A futures contract is the standardised version of a forward contract that, instead of being publicly traded on an exchange, is agreed upon between two parties outside of an exchange called Over-The-Counter. Since the futures contract needs to be standardised, a contract includes [23]:

i) An underlying asset. iii) How the asset will be delivered.

ii) A contract size. iv) When the asset will be delivered.

2.2.2 Discount Rates The common denominator for valuation in financial markets is the interest rate. An interest rate defines how much one borrows funds to pay back to the lender of those funds. There are several different interest rates quoted on the market, regardless of currency. The most important interest rate in the pricing of derivatives is the

7 interest rate used for discounting the expected cash flows, called the discount rate. The so-called risk-free rate is the most used discount rate when pricing derivatives and are usually assumed to be an Interbank offered rate, which is the interest rate that banks are charged when taking short-term loans. It is important to note that these are used as the risk-free rate even though they are not risk-free [23].

2.2.3 Foreign Exchange Rates Foreign Exchange rates, commonly referred to as FX rates, denote the value of a currency pair relative to each other. E.g., the 0.8 exchange rate for EUR to USD means that EUR 0.8 can be exchanged to USD 1.0, or equivalent that USD 1.0 can be exchanged to EUR 0.8. In this case, the price relation of USD to EUR is 1.25. An FX rate can be traded as it is, called spot with delivery in the coming days, or as underlying in instruments like futures and options. FX rates are common in the use of hedging of cash flows denominated in another currency and can be done both through futures and forwards [5][43].

2.2.4 Options An options contract is a derivative that gives the holder the option to buy or sell an asset called the underlying, for a specific price, K. This is different from a futures contract where the holder of the contract is obliged to some action. The two most common form of options is the American and the European options. European options can only be exercised on a future time point T , called the maturity. In contrast, the American counterpart can be exercised whenever until T . The two most basic options are call- and put options. The call option gives the holder the right to buy the underlying asset for a price K at or before maturity, T whereas, the put option gives the holder the right to sell the underlying asset for a price K at or before maturity T . This leads to the following pay-offs for the holder of a call option C and a put option P ,

C = max (s − K, 0) , P = max (K − s, 0) , (2.3) where s is the price of the underlying. The price of a European option is usually estimated by the Black-Scholes formula, which is formally defined as:

Definition 2.2.1. The price, C of an European call option with strike price K and time to maturity T is given by the Black-Scholes formula:

−rT C = sN (d1) − Ke N (d2) (2.4) where s is the price of underlying asset, N (·) the cumulative distribution function of the standard Normal distribution, the risk-free rate and ln(s/K) + (r − σ2/2) T √ d1 = √ , d2 = d1 − σ T, (2.5) σ T where σ is the volatility of the underlying asset.

8 A European put option is then priced through the put-call parity. American options have no known closed-form solution but can be priced through binomial trees or simulation methods. Options traded in the market are often American options, but European options are easier to analyse due to the Black-Scholes formula used for European options pricing [23].

There is a certain type of lingo concerning options. Out-of-the-money (OTM), at-the-money (ATM) and in-the-money (ITM) are words that refer to the intrinsic value of the option. Hence, how much is the option worth if it was exercised today. For a call option, the terminology follows [23];

i) A call option is OTM if s < K which makes its intrinsic value 0.

ii) A call option is ATM if s ≈ K which makes its intrinsic value ≈ 0.

iii) A call option is ITM if s > K which makes its intrinsic value s − K.

2.2.5 Volatility The volatility of an asset price, usually denoted σ, is the variability of the return of that asset. Since volatility is the only parameter that is not observed in the market used in the Black-Scholes formula, volatility is often the centre of attention when dealing with options and other derivatives. There is no single concept of volatility, but the ones that will be the subject of this thesis is historical volatility and implied volatility. The historical volatility is simply the standard deviation of the log returns of a time series. The implied volatility is the volatility implied by the market price of an option priced by the Black-Scholes formula. E.g., if C∗ is the price of a European call option observed in the market, the implied volatility, σimp, is the volatility that solves the implicit equation,

∗ BS C = C (s, K, T, r, σimp) .

By plotting the implied volatilities of an option with the same maturity, T , against different strike price, K, one would get the volatility smile. By plotting the implied volatilities for options with the same strike price, K, against their different maturities, T , one would get the volatility term structure. One way of creating an implied volatility surface is by combining the volatility smiles with the volatility term structure [23][20]. However, in our dataset, the implied volatility surface is created by combining the volatility term structure with the different deltas of the existing options. A delta, ∆, is one of the Greeks and denote how the options value change with respect to the price of the underlying asset, i.e., for a European call ∂C option, it is the derivative of C with respect to s, ∆ = ∂s .

Using the volatility surface created by existing contracts on the market, it is possible to price any strike price, K, and maturity, T , using interpolation and extrapolation techniques. One needs to be careful, though, not to introduce arbitrage opportunities [15].

9 (a) (b) (c)

Figure 2.2: (a) A stationary time process with the same distribution over any given time interval. (b) A mean varying process where the expected value is dependent on time and hence, is not stationary. (c) A variance varying process where the variability depends on time and hence, is not stationary.

2.3 Financial Time Series

This thesis covers historical market data, which is structured as time series data. Time series are defined by sequential data points given in successive order. The n time series x observed on time points t = 1, . . . , n is usually written as, {xt}t=1. In our case, the data is considered in discrete time on a one per day basis. A common approach is to consider the observed time-series data as a realisation of a stochastic process of a random variable, X [10]. In time series analysis, an important question is whether or not the process of X is strictly and/or weakly stationary. Intuitively, stationarity means that the statistical properties of X do not change over time [9]. For a strictly stationary series it implies that the joint probability distribution function of the sequence X = {Xt−i,...,Xt,...,Xt+i} is independent of t and i,

2 E(Xt) = µ, V ar(Xt) = σ , ∀t (2.6) and with the autocorrelation only dependent on i,

Cov(Xt−i,Xt) ζi ρi = p = , (2.7) V ar(Xt)V ar(Xt−i) ζ0 where Cov(Xt−i,Xt) is the autocovariance. A time series is said to be weakly station- ary, or covariance stationary, if its mean and autocovariances are time independent [9], i.e., E(Xt) = µ < ∞, ∀t 2 V ar(Xt) = σ < ∞, ∀t (2.8)

Cov(Xt,Xt−i) = ζi < ∞, ∀t, i This means that the autocovariances depend only on the time interval between time points and not the observation time [10]. Figure 2.2 present an example with one stationary and two non-stationary processes.

As previously mentioned, a central question in time series analysis is whether the time series is stationary or not. Analysing the standard price process of a financial instrument on the market, say the S&P500 in Figure 1.1, it is clear that the statistical properties change over time and the series is not strictly nor weakly stationary. For example, the mean, µ, of the price process is time-dependent and

10 (a) (b)

Figure 2.3: (a) Autocorrelation function of S&P500’s log returns from lag 1 to 40 with a 95% confidence interval. (b) Autocorrelation function of S&P500’s prices from lag 1 to 40 with a 95% confidence interval. means that the time series has a trend component that violates any assumptions of stationarity. In order to deal with this, a standard way is to instead operate on the differences between time points, ∆xt = xt − xt−1 [10]. In finance, the usual way of overcoming this problem is to instead operate on the movements of the price process. Let xt denote the price of an asset at time t then,   xt rt = log , (2.9) xt−1 is the logarithmic returns of the price process, henceforth log returns. Log returns are the most usual way to work with financial time series due to their nice properties, e.g., additivity, [9], and will be the returns used in this thesis.

Figure 2.3 shows an example of the autocorrelation function of a price pro- cess and a log return process of the S&P500 with a 95% confidence interval. As depicted in Figure 2.3a, the price process has a very high autocorrelation component and makes it more difficult for some model architectures to parse the crucial signals from the underlying process due to the low signal to noise ratio [3]. This problem further motivate using log returns as the autocorrelation is smaller, depicted in Figure 2.3b, and a lot of ”noise” is removed. However, removing a large part of the autocorrelation is at the expense of removing the internal memory of the price process; see [36] for a discussion on the trade-off between stationarity and memory.

2.3.1 Stylised Facts The presence of autocorrelation in the price processes is one of the statistical properties of a financial time series. Nevertheless, the statistical properties that asset returns share over a wide range of assets, markets, and time periods are the so-called stylised facts. Independent studies have observed the stylised facts within finance over various instruments, markets, and periods [7]. Overall, it is a well-known fact that asset returns exhibit a behaviour belonging to an ever-changing probability distribution. Regardless of asset type, frequency, market, and period, the stylised facts can be summarised to; volatility clustering, fat tails, and non-linear dependence [9][7].

11 (a) (b)

Figure 2.4: (a) The distribution of S&P500’s log returns vs a standard Normal distribution as a histogram with 100 bins. Numbers are shown for Fisher’s kurtosis and skewness of the distribution. (b) A QQ-plot of S&P500’s log returns vs a standard normal distribution.

Fat tails regards the property that the returns’ probability distribution ex- hibit larger positive and negative values than a normal distribution. An example of fat tails for S&P500’s log returns is shown in Figure 2.4. Figure 2.4a shows that the log returns on S&P500 have significantly greater kurtosis, 19.08 vs 0, than the normal distribution. Also, when plotting the quantiles of the log returns against a standard normal, as in Figure 2.4b, the log returns show proper- ties of fat tails. If the distribution were normal, the plot would depict a straight line.

Volatility clustering refers to the property that volatility in the market tends to come in clusters, i.e., there is a positive autocorrelation between volatility measures. Meaning that if a financial variable is volatile today, there is an enlarged probability of being volatile tomorrow. An example of this can be found in Fig- ure 2.5b depicting the rolling window volatility of the log returns of S&P500 and HSI.

The stylised fact of non-linear dependence regards the dependence between financial return processes and how it changes according to current market condi- tions. Two return processes that move somewhat independently in normal market conditions show a high temporal correlation during financially stressed periods, i.e., the prices drop together. As an example of the ever-changing dependence between financial returns, the rolling window correlation between the log returns of S&P500 and HSI is shown in Figure 2.5a.

To conclude this section, many theories in finance, such as portfolio theory and derivative pricing, are built on the assumption that returns are normally distributed, all of whom will break down if the normality assumption is violated. However, in risk management and risk calculations, an assumption of asset returns being normally distributed leads to a substantial underestimation of risk [9]. The stylised facts should thus be carefully considered when modelling a financial time series. Implied volatility surfaces and discount rates have their specific stylised facts and can be found in Appendix C.

12 (a) (b)

Figure 2.5: (a) Estimated rolling daily correlation between S&P500 and HSI log returns from 1st of January 2018 to the 1st of January 2021 through the multivariate EWMA model proposed in [9] with λ = 0.94. (b) Estimated rolling daily volatility of S&P500 and HSI log returns from 1st of January 2018 to the 1st of January 2021 through the multivariate EWMA model proposed in [9] with λ = 0.94.

2.4 Related Work

The problem of missing data in time series is not limited to finance but is a broad problem in many application domains such as healthcare, meteorology, and traffic engineering [6]. Instead of missing values originating from damaged equipment, unexpected accidents, or human error, the missing values in the financial time series most commonly depend on whether the market is open or if the asset exists in the market.

The literature is sparse in imputation on pure financial time series data, but there has been work on autoregressive models [29][2], agent-based modelling [37], [11] and Gaussian Processes [45]. However, there is no reason to believe that the methods used are different from other domains. Much previous work has been done using statistical and approaches when imputing missing values in time series. The most common ones seem to be autoregressive-, state-space- and expectation maximisation-models3 [12].

2.4.1 Autoregressive Models An autoregressive model is a model where the output depends, often assumed linearly, on its history and an error term. An autoregressive model can, e.g., try to predict a stock’s price tomorrow based on its price today. A common model in the univariate case with a linear relationship is the Autoregressive-Moving-Average model, ARMA(p, q), which has a pth order autoregressive part, AR (p), and a qth order moving average part, MA (q). The assumptions behind the ARMA model holds if the AR(p)-part is stationary [30]. When it comes to distribution assump- tions, one usually assume that the process is normally distributed by assuming that the error term, , is a identically independently distributed (i.i.d.) random variable t ∼ N (0, 1). However, this is flexible, and, e.g., a Student’s t-distribution could be assumed instead. The ARMA framework is flexible towards extending it with

3We have deliberately disregarded methods like median and mean imputation since financial time series often are non-stationary.

13 exogenous variables and in a multivariate setting [8]. However, in finance, if the assumptions behind the ARMA model holds, meaning that tomorrows log return can be written as a linear combination of p previous log returns and q previous error terms, it would quickly be noticed and exploited.4

The Generalised Autoregressive Conditional Heteroskedasticity model is in- troduced to overcome some of these problems. The GARCH (p, q) process models the conditional variance as if an ARMA process gave it. From this model, one can show that subsequent log returns are uncorrelated but dependent, have fat tails, creates volatility clusters and has an unconditional long-term variance. Thus, being able to recreate some of the essential stylised facts. The GARCH model takes away some of the crucial assumptions on the log returns process that ruined it for the ARMA model but, the GARCH model still assumes that the volatility of the process is stationary. Extending GARCH to the multivariate case can be very hard and troublesome [9].

2.4.2 State-Space Models Continuing the regression-based approaches to the imputation of time-series data. Simply put, a state-space model assumes a latent process; let us call it zt, which evolves. This process zt is not observable but drives another process, xt, that is observable. Random factors may drive the evolution of zt and its dependence on xt, and thus, this is a probabilistic model. The state-space model consists of describing the latent state over time and its dependence on the observable processes. These models overcome some of the problems with stationarity regarding the ARMA mod- els. An example of a state-space model is the family of models called Kalman filters. The disadvantage of these models is that one needs to make assumptions about the dynamical system being modelled and assumptions on the noise affecting the system [10][8].

2.4.3 Expectation Maximisation Unlike the previously described methods, Expectation Maximisation (EM) meth- ods are not necessarily regression-based. The EM method consists of two steps, an Expectation-step and a Maximisation-step. First, one assumes a statistical model and distribution of the data; the statistical model could, e.g., be a AR (p)-model [29] or simply a normal distribution. Then the two steps are performed iteratively, imputing the missing timepoints with the statistical model to maximise the proba- bility of the missing timepoints belonging to the time series [12]. In these methods, the data do not have to be considered as time-dependent data.

2.4.4 Key Points A common drawback with the above-described approaches is that they often make strong assumptions on the missing values and may not take temporal relationships in the data into account. Thus, instead, treat the time series as non-time dependent structured data that may not suit financial time series known to have low signal to

4At least according to the Efficient Market Hypothesis.

14 noise ratio with temporal correlation to other processes in time [3][12]. However, several deep learning approaches have become successful in imputing time series data with Generative Adversarial Networks (GANs) and Recurrent Neural Networks (RNNs) at the forefront [12][6]. With at least, GANs being successful working with financial time-series [13].

Imputation of financial time series is difficult due to the complex dynamics described in Section 2.3.1, the low signal to noise ratio, and the fact that the same signal can come from several different sources with a temporal effect and with varying strength [3]. So, when imputing financial time series, a model should preferably be non-parametric, take temporal relationships with the own channel and reference channels into account, and recreate the stylised facts.

15 Chapter 3 Theory

In this section, we will present the theory needed to understand the models that later are used. The outline is intended to follow the complexity of the models introduced, i.e., introducing less complex linear models and extending to more complex and highly non-linear models.

First, two standard imputation methods, nearest neighbour imputation and linear interpolation, is introduced. Then, the standard linear model and the regularised version called Lasso is described with their fundamental properties. Moving from linearity, Random Forests is presented by first describing decision trees and their weakness of easily overfitting the data. Continuing with how using multiple trees in a Random Forest can mitigate this. Later, laying the foundation for Gaussian Processes, Bayesian inference is introduced and exemplified as the weight-space view of regression by Bayesian linear regression and finished by moving to non-linear modelling by introducing projections into the feature space and using the kernel trick. By connecting the previous work, Gaussian Processes are intro- duced as being the function-space view of regression. Different covariance functions are introduced, and how to choose the corresponding hyperparameters are described.

Then it is time to introduce artificial neural networks by starting with the vital building blocks and how they learn from data. The notion of artificial neural networks is then expanded to recurrent neural networks specialising in processing sequences and learning long-term dependencies - continuing with convolutional neural networks that are specialising in data with grid-like topology and how they differentiate from regular neural networks. An example of the successful adoption of convolutional neural networks on non-stationary time series is the WaveNet architecture, developed by DeepMind in 2016, Lastly, finishing the section with how one might speed up the convergence when training the deep neural networks.

3.1 Nearest Neighbour Imputation

A simple approach to fill the missing values is to estimate them to equal the closest observed data point. This method is referred to as Nearest Neighbour Imputation (NNI) and is illustrated in Figure 3.1a. To formulate it in a mathematical setting, assume that yt is missing and let t denote all time points with observed values. The

16 (a) (b)

Figure 3.1: (a) Illustration of how the Nearest Neighbour Imputation method works. The blue dots are observed data points, and the red dots are nearest neighbour predictions. (b) Illustration of how the linear interpolation method works. The blue dots are observed data points and the red dots are the linear interpolation prediction.

NNI method, also referred to as the naive method, is then estimating yt to be,

∗ yˆt = yt∗ , where t = argmin |τ − t|. (3.1) τ∈t The implication of NNI in an extrapolation setting is a flat line from the last observed value.

3.2 Linear Interpolation

The Linear Interpolation (LI) method is another simple method to fill missing values in a sequence. Like the NNI method, the missing data points are imputed based on the nearest observed values. The LI method predicts missing values based on the straight line between the closest observed points in time, see Figure 3.1b for an example. Assume yt is missing where {yt−a, yt+b} is nearest previous and next observed data point. Thus, an interpolation problem where t − a < t < t + b. The LI method is estimating yt to,

a (yt+b − yt−a) yˆt = yt−a + . (3.2) a + b

3.3 Lasso

Consider the standard linear model,

> y = w0 + X w + , (3.3) where y is the response variable, X ∈ Rp×n is a set of explanatory variables, p is the number of variables, n is the number of observations, w ∈ Rp×1 being the regression coefficient, and  the error term containing the noise of the linear model. Given an estimate of the regression coefficients, the predicted response is given by, ˆ > Y = w0 + X w. (3.4)

17 In practice, w are chosen as the regression coefficients that minimise the sum of squared residuals (RSS) in the least-squares fitting procedure [14],

n X > 2 yi − w0 − xi w . (3.5) i=1

The Least Absolute Shrinkage and Selection Operator, henceforth Lasso, is a shrink- age method that is very similar to the simple linear regression, but instead of only penalising the RSS term, Lasso contains a regularisation of the regression coeffi- cients. The value of the Lasso coefficients, wL, are the ones that minimise,

n p X > 2 X Yi − w0 − xi w + λ |wi| , (3.6) i=1 i=1 for any λ ∈ R+. The last term is the coefficient penalty which will shrink coefficient values towards zero. If λ = 0, the resulting estimates will equal the one obtained from Equation 3.5. The larger λ becomes, the higher the coefficient penalty and the Pp more will the coefficient shrink towards zero. i=1 |wi| is also called the `1-norm of w. Lasso has three main advantages over the simple RSS minimising model [14]:

i) Can significantly lower the coefficient variance and thus be less prone to over- fitting.

ii) Is not restricted to the setting n ≥ p, i.e. when the numbers of observations are greater than the number of features.

iii) Shrinks many of the coefficients to zero, causing a sparse set of explanatory variables, thus, increasing the model’s interpretability.

3.4 Random Forest

Random Forest (RF) is a popular machine learning method for both classification and regression tasks. In contrast to the Lasso, RF do not assume a linear rela- tionship between the response and explanatory variables and can thus learn more complex, non-linear patterns.

The building blocks of a RF are the decision tree. Building a decision tree for regression tasks is performed in a two-step procedure [14],

i) Sequentially segmenting the predictor space X ∈ {X1,X2,...,Xd} into J, disjoint, non-overlapping, regions R ∈ {R1,R2,...,RJ }.

ii) Associate each Ri with a prediction value. Explicitly, that is the average response value of all training observations that falls in Ri.

The goal of step i) is to find regions R that minimises the RSS given by

J X X 2 yi − yˆRj , (3.7)

j=1 i∈Rj

18 Figure 3.2: Schematic overview of a Random Forest model with m decision trees. Red dots indicates the certain decision path within a tree. The final prediction is the average of all individual tree estimates.

where yˆRj is the average response for the training data within the jth region. In reality, there is an infinite number of ways to segment the data and evaluating them all is not feasible. Therefore, the Recursive Binary Splitting is applied; a top-down, greedy approach to segmenting the feature space. It starts at the top of the tree and successively splits the data, where each split results in two new branches. The split is greedy since it regards the partition that yields the most significant reduction in RSS for that particular step [14].

Decision trees are easily fitted and interpreted but have the disadvantage of high variance. That is, they are susceptible to overfitting the training data. RF is an approach that compromises on the bias-variance trade-off to gain better model performance. In short, an RF contains multiple decision trees and uniformly aggregate their individual prediction to a final prediction. When building an individual decision tree, start by extracting a bootstrap sample from the training set. Then, at each split, a random sample of m predictors is chosen as split candidates. Thus, the split is restricted only to use the m predictors sampled at each split step and, this creates a greater variety amongst all decision trees in the random forest, which has proven to yield better performance [14]. An illustration of an RF model is presented in Figure 3.2 where all arrows show how the data flows through the model. The hyperparameters to a random forest include;

i) Total number of decision trees. Typically, performance converges when the amount of trees grows beyond a certain value. ii) Number of randomly drawn features that are being considered at each split of a decision tree. iii) Maximum depth per decision tree.

3.5 Bayesian Inference

Performing statistical inference using the Bayes’ Rule to update the probability of the hypothesis as more information is available is called Bayesian inference. More

19 specifically, the variable θ is treated as a random variable, and one assumes an initial guess about the distribution of θ, called the prior distribution. When more information becomes available, the distribution of θ is updated by Bayes’ Rule, called the posterior distribution.

3.5.1 Bayes’ Rule Definition 3.5.1. For data X and variable θ, Bayes’ rule tells one how to up- date ones prior beliefs about the variable θ given our data X to a posterior belief, according to, p (X | θ) p (θ) p (θ | X) = (3.8) p (X) where p (θ | X) is the posterior probability, p (θ) is the prior probability, p (X | θ) is called the likelihood and p (X) is called the evidence. The evidence is also called the marginal likelihood. The term likelihood is used for the probability that a model generates the data. The maximum a posteriori (MAP) estimate is the estimate that maximises the posterior probability, θMAP = argmax p(θ | X). (3.9) θ

3.5.2 Multivariate Normal Distribution

> A p-dimensional normal density vector x = [x1, x2, . . . , xp] with mean µ and covari- ance Σ has the distribution x ∼ N (µ, Σ). The corresponding probability density function is, 1  1  f (x) = exp − (x − µ)>Σ−1(x − µ) . (3.10) p(2π)p|Σ| 2 The following is true for a multivariate normally distributed random vector x [27]; i) Linear combination of the components of x are normally distributed. ii) All subsets of the components of x are normally distributed. iii) Zero covariance implies that the components are independently distributed. iv) The conditional distributions of the components are normally distributed.

3.5.3 Conditional Distribution  x   µ  Lemma 3.5.1. Let X = 1 be distributed as N (µ, Σ) with µ = 1 , x2 µ2   Σ11 Σ12 Σ = and |Σ22| > 0. Then the conditional distribution of x1 given Σ21 Σ22 that x2 = x2 is normal and has the following distribution, −1 −1  x1 | x2 = x2 ∼ N µ1 + Σ12Σ22 (x2 − µ2) , Σ11 − Σ12Σ22 Σ21 . (3.11) For an example of how to obtain this results see [27]. Note that the conditional −1 covariance Σ11 − Σ12Σ22 Σ21 do not depend on the conditioned variable.

20 3.5.4 Bayesian Linear Regression In a Bayesian setting, following the notation in [39], the standard linear regression model is, y = XT w +  (3.12) where w is the vector of weights, or parameters, of the regression model, X the 2 input matrix, and  is assumed to be i.i.d. normally distributed noise,  ∼ N (0, σn). Note that, X has the dimension p × n, where p is the number of features, and n the number of observations. The probability density of the observations given the weights can be written as,

n n > 2 ! Y Y 1 yi − xi w p(y | w, X) = p y | x>w = √ exp − i i 2σ2 i=1 i=1 2πσn n   (3.13) 1 1 > 2 > 2 = exp − y − X w ∼ N (X w, σnI) 2 n/2 2σ2 (2πσn) n where kzk denotes the `2-norm of the vector z. Given a prior distribution of the weights, w ∼ N (0, Σp), using Bayes rule, and writing out the likelihood and the prior distribution, the posterior is,     1 > > >  1 > −1 p(w | X, y) ∝ exp − 2 y − X w y − X w exp − w Σp w 2σn 2     (3.14) 1 > 1 > −1 ∝ exp − (w − w) 2 XX + Σp (w − w) 2 σn

−2 −2 > −1−1 −2 > −1 where w = σn σn XX + Σp Xy. With A = σn XX +Σp , the distribution can be written as,   1 −1 −1 p(w | X, y) ∼ N 2 A Xy,A . (3.15) 2σn The mean is the maximum a posteriori (MAP) estimate of w and thus the most probable weights of the underlying function.

When making predictions with the model, the average of the possible values of the weights by their respective posterior probability is calculated. Namely, to get the predictive distribution of the function value, f∗, at x∗, one compute the average of the output from all possible linear models created by the weights w.r.t. to the posterior given in Equation 3.15, Z p (f∗ | x∗, X, y) = p (f∗ | x∗, w) p (w | X, y) dw   (3.16) 1 > −1 > −1 = N 2 x∗ A Xy, x∗ A x∗ . σn In Figure 3.3 there is an example of Bayesian linear regression where the weights of the regression models are drawn from the prior and posterior distribution. This view of regression can be described as the weight-space view of regression and allows for limited flexibility if a linear function can not correctly describe the output.

21 (a) (b)

(c) (d)

Figure 3.3: (a) Joint density plot of values drawn from the prior distribution of the weights, w = N (0, σ2I) with σ = 2. (b) 200 lines drawn with weights sampled from the prior distribution of the weights, w = N (0, σ2I) with σ = 2. (c) Joint density plot of values drawn from the posterior distribution of the weights, Equation 3.13. (d) 200 lines drawn with weights sampled from the posterior distribution of the weights, Equation 3.15.

22 3.5.5 Feature Space Projection A linear model is restricted to linear relationships between the response and feature variables and will perform poorly on non-linear data. To account for non-linear relationships, one could make projections into the feature space using some basis functions. For example, a scalar projection into the space of powers would lead to a polynomial regression model. A vital advantage of this is that if the projections are made onto fixed functions, the model is still linear in its parameters and therefore analytically tractable.

Let φ (xi) = (φ1, . . . , φN ) be a basis function that maps a p-dimensional vec- tor into a N-dimensional feature space and let the matrix Φ(X) be the collection of all columns φ (xi) for all instances in the training set. The model is,

> f (xi) = φ (xi) w (3.17) where w now is a N × 1 vector. Following the same method described before, it can be shown that the expression for the predictive distribution will be the same as in equation 3.16 with the exception of that all X are replaced by Φ(X), i.e.,   1 > −1 > −1 f∗ | x∗, X, y ∼ N 2 φ (x∗) A Φy, φ (x∗) A φ (x∗) (3.18) σn

−2 > −1 where Φ = Φ (X) and A = σn ΦΦ + Σp . To make it more computationally efficient, the implementation is often rewritten as,

> 2 −1 f∗ | x∗, X, y ∼ N (φ∗ ΣpΦ K + σnI y, (3.19) > > 2 −1 > φ∗ Σpφ∗ − φ∗ ΣpΦ K + σnI Φ Σpφ∗),

> where φ (x∗) = φ∗ and K = Φ ΣpΦ is used for shorter notation. Note that this expression is equivalent to the expression of the conditional distribution presented in Equation 3.11. For a detailed explanation of the derivation, see [39].

3.5.6 The Kernel Trick

> 0 In Equation 3.19, one can see that the feature space enters in the form φ (x)Σpφ (x ) regardless if x and x0 originates from the training or test data. One can also see > 0 that φ (x)Σpφ (x ) is an inner-product with respect to Σp. Since Σp is positive 2  1/2 definite, it can be written as Σp = Σp and using singular value decomposition > 1/2 1/2 > 1/2 of Σp = UDU one can write Σp = UD U . By defining ψ (x) = Σp φ (x) the kernel can be written as, k (x, x0) = ψ (x) ψ (x0) . (3.20) If an expression is only defined in terms of the inner-products in the input space, one can use the kernel trick and lift the inputs into the feature space using the kernel, k (x, x0), to replace the inner-products. This kernel trick is convenient when it is more beneficial to compute the kernel than the feature vector. As in Gaussian Processes, the kernel is the centre of interest rather than its corresponding feature space.

23 3.6 Gaussian Processes

Definition 3.6.1. A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A Gaussian Process (GP) is determined by its mean and covariance function. Let m(x) denote the mean function and k(x, x0) denote the covariance function of a GP such that, m(x) = E[f(x)], 0 0 0 (3.21) k (x, x ) = E [(f(x) − m(x)) (f (x ) − m (x ))] , then the GP is written as,

f(x) ∼ GP (m(x), k (x, x0)) . (3.22)

The mean function, m(x), can be any real-valued function and is often set to 0 by demeaning the observations. The covariance function k(x, x0), also known as the kernel function, can be any function that satisfy Mercer’s condition [39]. With a specified mean and covariance function, an implied distribution of over functions is created. To sample from this distribution let x∗ be a number of input points, then a random Gaussian vector can be drawn from the distribution,

f∗ ∼ N (m (x∗) , k (x∗, x∗)) (3.23) and the generated values can be understood as functions of the inputs [39]. The covariance function, k, models the joint variability of the GP random variables, i.e. the function values, and returns the covariance between pair of inputs. Thus, the joint distribution of the training data, f, and the test data, f∗, according to the prior distribution is,  f    k(x, x) k (x, x )  ∼ N 0, ∗ . (3.24) f∗ k (x∗, x) k (x∗, x∗) From the conditioning property of the Gaussian distribution described in Equation 3.11, the posterior distribution of the test data, f∗, is,

−1 f∗ | x∗, x, f ∼ N (k (x∗, x) k(x, x) f, −1  (3.25) k (x∗, x∗) − k (x∗, x) k(x, x) k (x, x∗) .

This hold when assuming no noise in the underlying process and its distribution. To obtain a similar result but with noise, as the linear model in 3.5.4, one can add a noise parameter to the covariance function and instead get the prior distribution,  y    k(x, x) + σ2I k (x, x )  ∼ N 0, n n ∗ . (3.26) f∗ k (x∗, x) k (x∗, x∗)

And when making inference make use of the the following posterior distribution,

 2 −1 f∗ | x∗, x, y ∼ N (k (x∗, x) k(x, x) + σnIn y,  2 −1  (3.27) k (x∗, x∗) − k (x∗, x) k(x, x) + σnIn k (x, x∗) .

In contrast to the weight-space view in Section 3.5.4, one can yield the same results by making inference directly in the function space, and the GP is therefore used

24 (a) (b)

Figure 3.4: (a) Prior distribution of function values with a GP with a mean function that is 0 and a RBF-kernel with with parameters ` = 1 and σ = 1. (b) Posterior distribution of function values with a mean function that is 0 and a RBF-kernel with parameters ` = 1 and σ = 1 on five function values drawn from f(x) = x2 +5 sin (x). to describe the distribution over functions. GP can, for certain types of kernels, be seen as a Bayesian nonparametric generalisation of autoregressive models as AR (p)-processes making them suitable for time series applications [10].

In Figure 3.4 there is an example of fitting a GP with mean function 0 and a RBF-kernel with parameters σ = 1 and ` = 1 on five samples from the function, f(x) = x2 + 5 sin (x). In Figure 3.4a one can see the prior distribution of the GP with a 95% confidence interval of the function values for our specified process. In Figure 3.4b one can see the posterior distribution, computed by Equation 3.25, of the function values with a 95% confidence interval of the function for our specified process.

3.6.1 Choice of Covariance Function The notion of similarity in is crucial, and likewise, the choice of covariance function when using a GP is essential. The choice of covariance function encodes the assumption about the underlying function that is estimated. The most common kernel used is the RBF-kernel, also called squared exponential or Gaussian, ! kx − x0k2 k (x, x0) = σ2 exp − , (3.28) 2`2 where σ and ` are hyperparameters. Where σ can be said to control the overall variance of the random functions and ` can be thought of as controlling ”the distance one has to move in input space before the function value can change significantly” [39]. One can show that using infinitely many basis functions of the form,  (x − c)2  φc(x, l) = exp − (3.29) 2`2

2  in a Bayesian linear regression with prior distribution w ∼ N 0, σpI , give rise to a GP with a RBF-kernel.

25 (a) (b)

(c) (d)

Figure 3.5: (a) Three samples from the prior distribution of different values of ` with a RBF-kernel. (b) Three samples from the prior distribution of different values of ν with a Matérn-kernel. (c) Heat map of the covariance matrix produced with a RBF-kernel for different values of `. (d) Heat map of the covariance matrix produced with a Matérn-kernel for different values of ν.

Another type of covariance functions is the Matérn class, which is given by, √ ν √ 1−ν ! ! 2 2 2νr 2νr k(r) = σ Kν (3.30) Γ(ν) ` `

0 2 where r = kx − x k and σ, ν and ` are the hyperparameters, Kν is the modified Bessel function and Γ is the Gamma function. The interpretation of the parameters σ and ` is the same as for the RBF kernel. When ν = p + 1/2, where p is a non-negative integer, the Matérn kernel is a product of an exponential kernel, i.e. an RBF-kernel with absolute distance, and a polynomial kernel of order p. For ν = 1/2 in one dimension, the Matérn kernel gives rise to an Ornstein-Uhlenbeck process often used in financial mathematics to model, e.g., interest rates or commodity prices.

Figure 3.5 shows examples of drawing functions from the RBF-kernel and the Matérn-kernel with different values on some of the crucial hyperparameters, ` for the RBF-kernel and ν for the Matérn-kernel, and the covariance produced by these prior distributions. Figure 3.5b and 3.5a shows that a higher value on ` means a higher correlation between function values and function values drawn from these distributions have a different type of periodicity, where a low ` give rise to a very sharp curve. In contrast, a large ` give a more straight curve. In Figure 3.5d and 3.5c, one can see that a higher value on ν means the higher correlation between function values and function values drawn from these distributions have different

26 type of smoothness, where a low ν give rise to a very sharp curve, while a high ν give a smooth curve. Note that, when ν → ∞, the Matérn-kernel → RBF-kernel.

3.6.2 Optimising the Hyperparameters To find the covariance function that is the most suitable for our problem, one can find the hyperparameters that best fit the observed data. This hyperparameter optimisation can be performed by maximising the log-likelihood of the marginal distribution. Given the p-dimensional marginal normal distribution, 1  1  p (y|µ, Σ) = exp − (x − µ)>Σ−1(x − µ) , (3.31) p(2π)p|Σ| 2 one get the following log-likelihood, 1 1 p log p (y | µ, Σ) = − (y − µ)>Σ−1(y − µ) − log |Σ| − log 2π. (3.32) 2 2 2 Thus, in the GP setting where x is a p × n matrix, the log-likelihood is, 1 1 n log p(y | x, θ) = − y>K−1y − log |K| − log 2π, (3.33) 2 2 2 where K is the covariance function for the targets y and θ is the hyperparameters that determine the structure of the covariance function. To maximise this, one can use a standard optimisation algorithm as, e.g., .

3.7 Artificial Neural Networks

The fundamental idea behind artificial neural networks (ANN) is the artificial neu- ron, commonly referred to as the perceptron. A perceptron has three building blocks; the weights w = [w1, w2, . . . , wn], the biases b ∈ R, and a differentiable f (·). If x is a n × 1 dimensional input vector, the perceptron performs following calculation, a = f (wx + b) . (3.34) The main idea of the perceptron is to, i) Receive information in the form of inputs ii) Summing the inputs iii) Process the input through an activation function iv) Output the processed information also called the activation Thus, the perceptron is a function that maps an n-dimensional input to a 1- dimensional output, f : Rn → R. In Figure 3.6 the three most common activation functions; σ, ReLU, and tanh, are presented.

One or multiple can form a so called . For a layer with m percep- trons/neurons this can expressed with the compact notation of, a = f (Wx + b) (3.35)

27 (a) (b) (c)

Figure 3.6: (a) The sigmoid-activation function, σ, takes values ∈ (0, 1). (b) The ReLU-activation function takes values ∈ [0, ∞). (c) The tanh-activation function takes values ∈ (0, 1).

> > where a = [a1, a2, . . . , am] , W = [w1, w2,..., wm] and b = [b1, b2, . . . , bm]. Thus, a layer can be seen as the function mapping f : Rn → Rm where one applies a non-linear activation function to a affine transformation on the input. In Figure 3.7a, the output layer is an example of a perceptron taking eight values as input and output a single value.

3.7.1 Multilayer Perceptron Combining one or multiple perceptrons can create a neuron layer as described in Equation 3.35. If layers are connected, they create a Multilayer Perceptron network (MLP). An MLP comprises an input layer, one or more hidden layers, and an output layer. All layers can be composed of single or multiple perceptrons. All hidden layers and the output layer has weights and biases, which are the parameters one would like to optimise.

A basic MLP is depicted in Figure 3.7b. It has an input layer, one hidden layer with two neurons each, and an output layer with one output neuron. Passing the data forward in the network can be written as,

a1 = f0 (W0x + b0)

a2 = f1 (W1a1 + b1) (3.36)

yˆ = f2 (W2a2 + b2) , which is equivalent to,

yˆ = f2 (W2f1 (W1f0 (W0x + b0) + b1) + b2) . (3.37)

In a general setting, this can then be extended to an MLP with L layers following the notation in Equation 3.37,

yˆ = fL (WLfL−1 (··· (W2f1 (W1f0 (W0x + b0) + b1) + b2) ··· ) + bL) . (3.38)

28 (a) (b)

Figure 3.7: (a) Example of an perceptron as a function f : R8 → R1. (b) Example of a two-layer MLP with one input layer, one hidden layer with two neurons each, and an output layer with a single neuron.

3.7.2 Training Neural Networks Performing one iteration of Equation 3.38 is called a forward pass because each layer ”passes” the information to the next layer, from the input layer to the output layer. After the forward pass, one is interested in comparing the prediction to the true value, y. This is achieve by calculating a loss metric through a specified cost function that measures the error of the prediction. For k observations, this is calculated as, k 1 X J (W0, b0,..., WL, bL) = L (y, yˆ) . (3.39) k i=1 A typical loss metric in regression is the mean squared error loss. As presented in Equation 3.39, the depends on all training examples, which by Equation 3.38 implies that it also depends on all layers. The gradient of the loss function is calculated through , that is, the chain-rule of the partial derivatives, ∂J ∂J , (3.40) ∂Wl ∂bl where ` denote the `th layer, and then updated using some update rule. For Gradient Descent the update rule is, ∂J Wl = Wl − α , ∂Wl ∂J (3.41) bl = bl − α . ∂bl α is a hyperparameter called the . This update is called the backward pass.

This training algorithm can, in just two passes, the forward and backward pass, compute the gradient of the loss function with respect to every parameter

29 in the model. Thus, computing how the weights and biases should be changed in the network to reduce the error given by the loss function [41]. Calculating the gradient for the full training set is not computationally efficient, especially not when the amount of observations is large. Instead, the gradient is calculated for a mini-batch. Even though the gradient that is calculated through a mini-batch will be an estimate of the true gradient, it will speed up the learning convergence when backpropagating the errors. When all mini-batches have had a forward and backward pass, one epoch has been performed. Usually, several epochs need to be executed before the error of the loss function has converged.

3.8 Recurrent Neural Networks

Recurrent neural networks (RNNs) are a family of neural networks that specialise in processing sequences of data, e.g. audio recordings, text sequences, or a time series of stock prices. The difference compared to an MLP is that RNN share parameters across different part of the network. This parameter sharing makes it possible to generalise to different sequence lengths and detect important information in multiple sequence locations. RNN share parameters by letting each output be a function of previous outputs output and each output are produced using the same update rule applied to previous outputs. Hence, being named recurrent neural networks and thus being able to share parameters through a deep network [16].

With the description above in mind, one can model a simple dynamic sys- tem driven by an exogenous variable xt that output its state ht with the parameters θ as a single cell, ht = f (ht−1, xt; θ) . (3.42) To unfold it through time, one can write this as,

ht = f (ht−1, xt; θ)

= f (f (ht−2, xt−1; θ)) = ... (3.43)

= f (f (··· f (h0, x1; θ) ··· ))

= gt (xt, xt−1,..., x2, x1) .

The function gt takes the whole sequence as input and produces its state, but by unfolding the structure, one factorises gt into the repeated application of the

Figure 3.8: Unfolding of a RNN through time where, Xt is the input, A is the function, that is repeated and ht is the output state.

30 function f. This unfolding can be visualised intuitively and is depicted in Figure 3.8. The unfolding procedure allows us to discard the length of the sequences. The input will always have the same dimension, and since the input is a transmission of one state to another, it is possible to use the same transition function and its parameters at every time step. Because of this, a single model f can be learned and allows it to be estimated with fewer training examples than without parameter sharing.

A common drawback with vanilla RNNs is that they have a hard time learn- ing long term dependencies. These are usually the types of dependencies one would like an RNN to learn in the first place. The reason for this is that when training an RNN, backpropagation is made through time. When the gradients are propagated over several stages, they tend to vanish or explode1[35]. If these problems do not occur and the model can learn and represent long term dependencies, the gradient of long-term dependencies will be exponentially smaller than that of short-term dependencies. Meaning that signals of long-term dependencies will be hidden in varying short-term dependencies, and because of this, gradient-based optimisation becomes difficult [16]. In order to try to solve this, one has introduced Gated RNN cells as Long-Short Term Memory (LSTM) and (GRU) cells to overcome these problems.

3.8.1 Long-Short Term Memory Several different architectures are named LSTM, but the most common one is the one described here. Consider the situation in equation 3.42 in a vanilla RNN, where there is a chain of repeating operations; here, the repeating operation is often of a straightforward structure. In figure 3.8 there is an example of a usual and straight- forward structure that can have the following operations;

i) a concatenation between the previous hidden state ht−1 and the input xt, [ht−1, xt].

Figure 3.9: A schematic overview of an LSTM-cell where circles represent an ele- mentwise operation, a box represent a layer of a neural network with the specified activation function, converging arrows represent a concatenation of matrices, and diverging arrows represent copying a matrix.

1Each derivative w.r.t. to a parameter at a certain time will be connected to the same parameter at all other time points in the RNN, so, imagine multiplying many values that are < 1 and vice versa.

31 ii) the concatenated parts are fed into a FNN with a tanh-activation,

ht = tanh (W [ht−1, xt] + b)

iii) and the new hidden state are now fed to the next cell and as an output of the cell.

This simple structure gives rise to problems of training RNNs that are previously described.

Instead of applying an elementwise nonlinearity to an affine transformation of the inputs recurrently, the LSTM-cell have an internal recurrence in addition to the outer recurrence. The key to the LSTM-cell is the cell state, Ct, and how to update it. The architecture has the same repeating nature as a vanilla RNN, but the cell’s internal operations are different. An LSTM-cell consists of a forget gate, input gate, output gate, and its goal is to keep track of the hidden state ht and the cell state Ct. Figure 3.10 shows a schematic overview of the LSTM-cell. The step-by-step calculations will now be described.

First, the LSTM-cell decides what information that is important through the forget layer. Meaning that the cell looks at the previous hidden state ht−1 and the input xt and process this through a σ-layer, meaning that unimportant things will have values close to 0 and important things values close to 1,

ft = σ (Wf [ht−1, xt] + bf ) . (3.44)

Next, the cell decides what new information should be stored in the cell state, Ct, ˜ and thus candidate values are calculated, Ct. This calculation is done in two steps; first, the input gate decides what values will be updated through a single σ-layer, and then a tanh-layer creates a vector of the candidate values.

it = σ (Wi [ht−1, xt] + bi) ˜ (3.45) Ct = tanh (WC [ht−1, xt] + bC )

The next step is to update the old cell state, Ct−1, into the new cell state Ct, by using the results from the previous step, ˜ Ct = ft Ct−1 + it Ct, (3.46) where is the Hadamard product that means the same as element-wise multiplica- tion. Now that the cell state is updated, it is time to decide what to output. This calculation is done based on a filtered version of our cell state because first, a σ-layer determine which parts of the cell state will be part of the output. Then a tanh-layer process the cell state and is multiplied by the output of the σ-gate.

ot = σ (Wo [ht−1, xt] + bo) (3.47) ht = ot tanh (Ct)

By doing this, only the parts that are decided will be output [32]. One can show that this set-up eliminates the vanishing gradient and ensure that at least one ”path” does not vanish. Figure 3.10 show a schematic overview of these steps.

32 (a) (b)

(c) (d)

Figure 3.10: Circles represent an elementwise operation, a box represents a layer of a neural network with the specified activation function, converging arrows represent a concatenation of matrices and diverging arrows represent copying a matrix. (a) Depiction of the flow of the input to the forget gate of the LSTM-cell. (b) Depiction of the flow of the input to the input gate and production of candidate values of the LSTM-cell. (c) Depiction of the flow to update the cell state of an LSTM-cell. (d) Depiction of the flow of the input to the output gate of the LSTM-cell.

3.9 Convolutional Neural Networks

Convolutional neural networks (CNN) is a neural network that uses instead of matrix multiplication in at least one of its layers. Thus, they are also constructed of neurons with weights and biases that are being updated through backpropagation. The CNN has been inspired by research on the visual cortex and specialises in processing data with a grid-like topology. Data with grid-like topology is, e.g., time-series data that can be seen as a one-dimensional grid with samples at specific time points or image data that can be seen as a two-dimensional grid of pixels [16].

33 Figure 3.11: Example of a one-dimensional convolution on input data with size 4×3 that is convolved with a filter size of 2, getting the dimension 2×3, yielding a feature map of size 3 × 1.

The convolutional operation used in CNN’s does not exactly equal the mathematical operation convolution used in other fields. Nonetheless, the intuition remains the same. The convolution layer is a layer where each neuron only processes a subset of the input data, called the receptive field. This processing is done by sliding a filter, called a kernel, over the layer’s input and then producing the so-called feature map or activation map from each filter. When the filter slides over the input, the element-wise multiplication is computed of the values. A single filter might not be enough for detecting important features. Therefore, several filters are usually used to detect more features of the data. An increasing number of filters increase the depth of the output [16][1].

Sliding the filter over the dataset can be done in many ways and yield differ- ent sizes of the activation map. The step-size of the sliding of the filter is called stride. Now, how large will the activation map be given that a filter is being slid over the dataset with a stride of 1? Let Fq ×Fq be the size of the filter, and Hq ×Wq the size of the input data at the qth layer and. Applying a convolution in this layer would impose to align the filter at Hq+1 = (Hq − Fq + 1) positions along the height, and Wq+1 = (Wq − Fq + 1) along the width of the input data resulting in an output with the dimension Hq+1 × Wq+1. This leads us to a more formal definition of the convolution. The pth filter in the qth layer is denoted as a 3-dimensional (p,q) h (p,q)i tensor W = wijk . Where i, j and k indicate positions along the height, width and depth of the filter. The feature maps in the qth layer are represented by the (q) h (q) i 3-dimensional tensor A = aijk . Then,

Fq Fq dq (q+1) X X X (p,q) (q) aijp = wrsk ai+r−1,j+s−1,k, r=1 s=1 k=1

∀i ∈ {1 ...,Hq − Fq + 1} , (3.48)

∀j ∈ {1 ...Wq − Fq + 1} ,

∀p ∈ {1 . . . dq+1} , is the convolutional operation, also denoted as ∗, between layer q and q + 1

34 [1]. In this thesis, we will only use what is called one-dimensional convolu- tions which means that the filter is slid over one dimension. Meaning that the filter will have the size, Fq × Wq and only move in the direction of time. Figure 3.11 depicts an example of input data with size 4 × 3 that is convolved with a filter size of 2, getting the dimension 2×3, and outputs a feature map of size 3×1.

The convolution operation reduces the size of the data going through the layer. A common method to keep the size of the input data is to use the so-called padding. Using padding means that an appropriate number of zeros are added around the borders of the feature map to maintain the dimensions of the data. One type of padding is called same-padding, which adds zeros so that the input and output have the same size [1].

By using instead of the usual matrix multiplication, you get more sparse interactions between neurons. Parameters in convolutions are being shared across layers, which creates equivariant representations. Equivariant representations mean that when processing time-series data, the convolution creates a timeline that shows when different features appear and if one of these inputs is shifted later in time. The convolution will output the same representation but later in the timeline [16][31][1].

3.10 WaveNet

WaveNet is a fully probabilistic and autoregressive deep neural network developed for generating audio waves by DeepMind in 2016 [34]. The joint probability of the wave form x = {x1, . . . , xT } is factorised into a product of conditional probabilities,

T Y p(x) = p (xt | x1, . . . , xt−1) . (3.49) t=1

Meaning that, each audio sample xt is conditioned on the previous samples. The conditional distribution is modelled by a stack of several one-dimensional convolutional layers, and the model outputs a categorical distribution over the next value xt. The model is being optimised to maximise the log-likelihood of the data with respect to the model parameters.

The standard WaveNet can easily be extended to what is called conditional WaveNet, which can model the distribution p (x | h) of the audio given its input. Equation 3.49 in the conditional WaveNet setting becomes,

T Y p(x | h) = p (xt | x1, . . . , xt−1, h) . (3.50) t=1 By adding h as an input variable, one can guide the model’s generation to produce audio with desired properties, e.g. add speaker-identity or text2 [34].

2Commonly done when creating audio from text, called text-to-speech.

35 (a) (b)

Figure 3.12: (a) Visualisation of stacked causal convolutions with filter width 2. (b) Visualisation of stacked dilated causal convolutions with filter width 2 and an exponentially increased dilation rate [34].

The key aspect of the WaveNet is the use of causal convolutions, and by using these, one can make sure the model does not violate the spatial ordering of the data. This means that the output at time step t, p (xt+1 | x1, . . . , xt), only depends on previous time steps. Due to the non-recurrence of the causal con- volution, they are faster to train than an RNN when applied to long sequences. A drawback with causal convolutions is that they require many layers, or huge filters, to increase the receptive field. The WaveNet tackles this problem by using dilated causal convolutions. In dilated convolutions, also called convolution with holes, the filter is applied to an area larger than its length by skipping input data with a certain step size which lets the network work on a coarser scale. Because of this, stacked dilated convolutional layers allows the network to have a huge receptive field with few layers while still possess low computational complexity.

In the WaveNet, the dilation rate is doubled for every layer until a specific limit and then repeated, e.g., 1, 2, 4,..., 64, 1, 2, 4,..., 64. This pattern lets the network have a receptive field of 128 for each block and can be seen as a non-linear discriminative counterpart of a single 1 × 128 convolution [34]. For an example of causal convolutions and dilated causal convolutions with exponentially increasing dilation rate, see Figure 3.12.

Another contribution of the WaveNet model is the use of a gated activation unit that is inspired by the gating mechanisms in LSTMs [33],

z = tanh (Wf,k ∗ x) σ (Wg,k ∗ x) , (3.51) where ∗ is the convolution operator, is the Hadamard product, k is the layer, f and g denote the filter and the gate, where W is a learnable convolutional filter [34].

The WaveNet also uses both residual blocks and parameterised skip connec- tions in the network to speed up training and make it easier to train a much deeper model. A residual connection tries to learn the residual of the input to a layer and the mapping in the subsequent layers. Let H (x) be the underlying mapping to be fit by subsequent layers where x is the input to the first layer. Then, the residual connection tries to learn the residual function, F (x) = H (x) − x where the original function is F (x) + x. The motivation of this is if the layers after a while do not affect the output, i.e. the layer is the identity mapping I; a deeper model should still not have a greater training error than a shallower network. However, in

36 Figure 3.13: Overview of the WaveNet architecture and the residual block and skip connections [34]. practice, the identity-mapping may be sub-optimal and is hard to learn. However, the residual learning approach makes it easier to find the identity mapping if optimal, meaning that the subsequent layer is unnecessary [18].

A residual block and a skip connection are used interchangeably, but in the WaveNet architecture, they distinguish that the residual block is between each block of a dilated causal convolution. In contrast, the parameterised skip connec- tions go from each layer to the layer before the output [34]. Figure 3.13 shows how this looks for the WaveNet architecture.

3.11 Batch Normalisation

Making data normalisation a part of the model architecture and performing nor- malisation for each mini-batch in training is called Batch Normalisation. In some cases, this allows increased learning and can speed up the convergence of the net- work. Batch normalisation is performed on each dimension independently. For a p-dimensional X = [x1,..., xp], each dimension is normalised according to,

xk − E (xk) xˆk = p . (3.52) V ar (xk)

Simply normalising the input in a layer can change the representation of a layer completely. The authors in [26] dictate that, one needs to make sure that the transformation in the network can represent the identity mapping. For each activation, xk, a pair of parameters γk, βk are introduced. These values scale and shift the normalised value, yi = γkxˆki + βk and are learned together with the data. p Because of this, if optimal, one have that γk = V ar (xk) and βk = E (xk), and the original activation is recovered.

The entire procedure of batch normalisation for the dimension k, in a mini- batch consisting of m values, is described in Algorithm 1.  is a constant added for numerical stability, and the scale and shifted parameters {yi} is the output from the BN-layer [26].

37 Input: Values of xk from a mini-batch. Parameters: γ, β Output: {yi}

1 Pm i) Compute the mini-batch mean: µb = m i=1 xi.

2 1 Pm 2 ii) Compute the mini-batch variance: σb = m i=1 (xi − µb) .

√xi−µb iii) Normalise: xˆi = 2 . σb +

iv) Scale and shift: yi = γxˆi + β.

Algorithm 1: Batch Normalisation applied to x over a mini-batch.

38 Chapter 4 Method

In this section, we will present the execution procedure. The outline is intended to follow the chronological sequence of how we have carried out the implementation, following the order described in previous section.

We start by introducing the mathematical notation used throughout the sec- tion before an explicit framing is done on the two use cases. We continue with presenting the dataset and the data preparation actions performed before any analysis takes place. Later, we describe the general experimental design and the data post-processing tasks. Finally, the evaluation procedure and the models are described, leading us to the result presented in the upcoming section. In addition, an explanatory data analysis has been performed. The result from that analysis can be found in Appendix E.

4.1 Notation

T ×d 1 2 d T d Let X ∈ R denote the full dataset with T data points xt = [xt , xt , . . . , xt ] ∈ R . Assume that all data points were measured at T consecutive time points T τ = [t1, t2, . . . , tT ] where ti < ti+1 holds for all i. With this setting, X can be viewed as a multivariate time series of length T in time, and with d dimensions. Also, the d dimensions can be divided into C channels where c1, . . . , cC denote a specific channel ranging over the dimensions corresponding to one instrument.

The missing value problem can now be referred to as any number of these j data points xi can be missing, e.g., their true value is unknown. Each data point can therefore be partitioned into observed or unobserved feature. For the data point o j j xt, the observed features are xt = [xt |xt is observed ] and the unobserved/missing m j j o m features are xt = [xt |xt is missing ]. With this setting one have xt ∪ xt = xt.

The problem of imputing the missing values is to estimate the true values of m m o o the missing features X = [x1:T ] given the observed features X = [x1:T ]. If one assume the T observations to be independent, the problem can be divided m o into T estimation problems of p (xt |xt ). However, in the time series setting, the observations are seldom independent in time, which makes the estimation problem m o of p (xt |X ) more complex.

39 (a) (b)

Figure 4.1: (a) An illustration of use case one for a 120 days period on the S&P 500’s price process. (b) An illustration of use case two on the S&P 500’s price process. The blue coloured line indicate observed price, whereas the red dashed line specify missing price point.

4.2 Problem Framing

This project focuses on two specific use cases where one would see missing data in financial time series. See Section 1.1 for use case motivation. The use cases are to be further framed to simplify model selection and evaluation. The dataset is multi-dimensional, where a specific dimension corresponds to one time series. The dataset denotes daily observations of multi-dimensional financial instruments, i.e., curves and surfaces. Three points represent each point observation of a curve, and nine points represent each observation of a surface. In the coming segment, we will denote the gathered dimensions corresponding to one instrument as a channel, k. Note that |k| = 3 if the channel corresponds to a curve, and |k| = 9 if surface. The dataset has in total 141 dimensions of 34 channels.

4.2.1 Use Case One The first use case concerns single or a few missing data points. The multivariate time series X ∈ RT xd are fully observed for all but one channel k, meaning that j o all time points in reference channels will have complete data. Thus, xi ∈ X for ∀i and ∀j∈ / k. Further, also assume the endpoints of the incomplete time series have observed values. With such an assumption, the problem is constricted to be an interpolation problem for all missing values. I.e., there will be a known prior and posterior reference point. Figure 4.1a illustrates how the missing values are located. For use case one, i) Missing values are randomly placed. ii) Missing values appear in clusters of 1–3 time steps, k o k k k m xt+3 ∈ X |{xt , xt+1, xt+2} ∈ X

|Xk∈Xm| iii) Missing values constitute about 20% of the time series, |Xk| ≈ 20%.

k k o iv) Endpoints are fully observed, {x1 , xT } ∈ X .

40 4.2.2 Use Case Two The second use case concerns consecutive missing data points for a longer horizon at endpoints of the time series. Again, let us assume that a single channel time series is missing, implying the same assumption of parallel series as in the previous use case. In this use case, assume that the time series is complete until a certain point τp, after which there will not be any data for that series. With this assumption, the problem is constricted to be an extrapolation problem with parallel channels for reference. Figure 4.1b illustrates how the missing values are located. For use case two,

i) Missing values are consecutively placed at the end of the series,

k o k m [x1:τp ] ∈ X , and [xτp+1:T ] ∈ X .

|Xk∈Xm| ii) Missing values constitute about 20% of the time series, |Xk| ≈ 20%.

Note that we have chosen the last 20% of the time series is to be unobserved. This is to see if the imputation procedure works in a highly stressed environment, which this period indeed where as described in Section 1. If a method succeeds in a stressed environment, we argue that it should perform well under normal conditions as well.

4.3 Dataset

The analysis will be based on a dataset sourced by Refinitiv, provided by Nasdaq. The dataset stretches from January 2nd 2014, to January 15th 2021 and contains daily market data for several market variables. However, in this thesis, the dataset will only include a subset of these market variables. The dataset is reduced because some of the variables in the original dataset have an almost perfect correlation, e.g., derivatives with the same underlying asset trading on different venues. These instruments can intuitively be imputed using the highly correlated series. Although it will lead to better imputations, this is not interesting due to how the problem is framed.

There are four types of market variables in the dataset; futures, FX rates, discount factors, and implied volatilities. All but the implied volatilities are given as two dimensional data observations where the x-axis denote time-to-maturity. The volatilities are given in three dimensional observations with time-to-maturity on the x-axis, option delta on the y-axis, and corresponding implied volatility value on the z-axis. The asset prices are rolled over to have a constant maturity and option delta. Further, the values are interpolated and extrapolated by Refinitiv to represent the data representation from existing quotes consistently.

The different market variables are also stated in a wide range of maturities and option deltas. To further limit the data, we have decided only to use a restricted set of points. For the two dimensional data, we extract the 30, 90 and 360 days to maturity points. We extract the 30, 90 and 360 days to maturity points for the three dimensional data and the option deltas 0.25, 0.50, 0.75.

41 Thus, three points represent a curve and nine points representing a surface as proposed by [17]. For some observations, these points do not exist. We applied the extrapolation/interpolation technique specified by Refinitiv for that particular variable in such a case. The maturities and option deltas are chosen both concerning information capacity and such that a few points were to be interpolated and extrap- olated. Also, the option deltas, generally, represent an OTM, ATM, and ITM option.

The constricted dataset contains 35 market variables filtered into 18 futures, five FX rates, six discount factors, and six volatility surfaces. The feature dimen- sion of the data is 141. After all data preparation tasks have been completed, the dataset consists of 1 766 point observations for each asset.

4.4 Data Preparation

4.4.1 Handling of Missing Values The dataset needs to be complete to enable a supervised learning procedure where a ground truth value is available at every time point. That is, each market variable needs to have a daily observation. The raw dataset contains several missing values. Most of them can be derived from various operational days between markets as in our motivation for use case one1. The majority of instruments are traded at the American based exchange Chicago Mercantile Exchange (CME) and are thus affected by US public holidays. We have therefore decided to remove all US holidays from the dataset. See Appendix A for a detailed view of which dates it concerns. Further, all weekend days (Saturdays and Sundays) are also removed.

After all holidays and weekend days have been removed, only a tiny fraction (0.85%) of the total dataset is missing. These data points are filled using linear interpolation or flat extrapolation on the time axis, depending on the location of the missing values in the time series. Although a small fraction, this approach may favour the Linear Interpolation and Nearest Neighbour Imputation explained in Section 4.8.2 and 4.8.1.

4.4.2 Converting to Prices The initial data has different units. The future curves are given in prices; the FX rates are given in a fraction between the two currencies, discount factors in interest rates, and the volatility surfaces as the implied volatility from the Black-Scholes formula. The final step in the data preparation part is to convert all data points to prices to easier handle the data in our models and will simplify the evaluation since it concerns data with only one unit2. The conversion to prices is done as follows: i) FX rates are converted to how much 1 000 in the quotation currency is in the base currency, i.e., one multiplies each currency pair with 1 000. ii) Discount factors are converted to the price of a zero-coupon bond, with three months to maturity, and a face value of 1 000. 1Instruments with a limited historical horizon have already been removed at an earlier stage. 2Even though some prices are stated in different currencies, e.g. US dollar or Japanese Yen.

42 iii) Implied volatilities are converted to the price of a European call option us- ing the Black-Scholes formula, with a strike price of 1 000, the price of the underlying being 1 000, risk-free rate 1% and three months to maturity.

4.4.3 Training and Test Split Most of the models requires a complete dataset to train on. Therefore, the dataset X is split into two sets, the training set Xtrain and test set Xtest where,

o Xtrain = [xt | xt ∈ X ], ∀t ∈ τ m (4.1) Xtest = [xt | xt ∈ X ], ∀t ∈ τ

The task is later to make predictions using Xtest after a model has been trained using Xtrain. As previously described, many models operate on log returns. The training- and test set are then expressed as,

o o Rtrain = [rt | xt ∈ X and xt−1 ∈ X ], ∀t ∈ τ m m (4.2) Rtest = [rt | xt ∈ X or xt−1 ∈ X ], ∀t ∈ τ

4.4.4 Sliding Windows and Forward Validation For some of the models, the dataset will be partitioned in sliding windows. E.g, if the window size is 3 and the sliding step-size is 1, the sequence {1, 2, 3, 4, 5, 6}, would be partitioned as, {{1, 2, 3}, {2, 3, 4}, {3, 4, 5}, {4, 5, 6}}. To do this for a dataset with p-dimensions, the sliding windows would have to be stored in a three-dimensional matrix with dimensions, #windows × window size × p. Where each window now is a window size × p-matrix.

When working with sliding windows one usually train and validates models using forward validation. This means that the sequence {x1, x2, . . . , xn} is used as explanatory variables for the coming values, {xn+1, . . . , xt+k}. For the sequence {1, 2, 3, 4, 5, 6}, with window size 3, step size 1, and the window size of the predicting sequence is 1, one would get the following,

{1, 2, 3} → {4}, {2, 3, 4} → {5}, {3, 4, 5} → {6}.

4.5 Data Post-Processing

The problem is restricted to fill the missing values in the price process, and thus all models operating on the log return data need post-processing tasks to convert the data back to the original scale. This is done by exploiting the values of the d m d o nearest observed points in the price process. Assume that xt ∈ X and xt−1 ∈ X , d d dF where rt and rt−1 is the corresponding log returns. Then, a forward prediction xˆt is calculated by, dF d d xˆt = xt−1 exp rˆt . (4.3) d m dF If xt−1 ∈ X , then xˆt−1 is used instead. For use case two, there will only be reference points at preceding time steps, and thus the predicted price at time t d dF will be the forward prediction, xˆt =x ˆt . In contrast, use case one will have both

43 Figure 4.2: Example of the aggregation technique used when there is a prior and succeeding reference point. Blue dots are observed data and red dots are missing. preceding and succeeding references prices to a missing value. Since autocorrelation in the price process applies in both directions, we introduce a more sophisticated re- scaling procedure for the interpolation setting. In addition to the forward prediction, dB compute the backward prediction xˆt by,

dB d d −1 xˆt = xt+1 exp rˆt+1 . (4.4)

d m Similar to the forward prediction, if xt+1 ∈ X , then replace it by the backward dB prediction xˆt+1. If the nearest prior and succeeding reference price are equally distant from the prediction, the final prediction is the average of the two. However, if the prior and succeeding reference prices are not equally distant, we would like to weigh the predictions accordingly. Let’s introduce a time gap matrix δ as in [6]. The time gap matrix is used to weigh predictions from opposing directions based on the duration since the last observed point. δ is defined with respect to ascending (δF ) and descending time (δB) as,   0 if t = 1 0 if t = T   dF d d m dB d d m δt = 1 + δt−1 if xt−1 ∈ X , δt = 1 + δt+1 if xt+1 ∈ X (4.5)  d o  d o 1 if xt−1 ∈ X 1 if xt+1 ∈ X

Finally, linearly weight the forward and backward prediction according to the time gap matrix as, dF −1 dF dB−1 dB δt xˆt + δt xˆt xˆd = . (4.6) t dF −1 dB−1 δt + δt

4.6 Experiment Design

Through the data preparation task, we have obtained a complete dataset of 1766 observations. The problem is now framed as a supervised learning task, and we can withhold data points from the dataset to synthetically create datasets that apt the two use cases. The time points with withheld data points are naturally the test set.

44 Input: The model, f (·), dataset X. Parameters: The hyperparameters, θ. Output: Xˆ

i) Preprocess data: X∗.

ii) Split data: {Xtrain, Xval, Xtest}.

∗ ∗ ∗ iii) Standardise/normalise data: {Xtrain, Xval, Xtest}.

∗ ∗ iv) Optimise hyperparameters, θ, according to criterion on Xval: θ .

∗ ∗ ∗ v) Fit the model with θ using {Xtrain, Xval}. ∗ ˆ ∗ vi) Predict the missing values using X : Xtest. ˆ vii) Post process and return the predictions: Xtest

Algorithm 2: Schematic view of the imputation procedure.

Each model predicts the missing values in the test set that are later evaluated using the withheld ground truth values, as described in Section 4.4.3. In total, 70 datasets are created, one for each channel (35) and use case (2). Algorithm 2 presents the general scheme of the execution procedure for a particular model.

4.7 Evaluation

The performance metrics can be divided into two categories. The first category assesses the performance concerning deviation from the actual price data and will be assessed by the Mean Absolute Scaled Error (MASE) metric. The second cate- gory focuses on cross day price movements and how well the distribution of these movements are preserved and will be assessed by Relative Deviation of Value at Risk (RDVaR) and Relative Deviation of Expected Shortfall (RDES). It is essential to understand that a model that performs well in one category does not implicitly perform well in the other. Therefore, it is up to the business case to determine which model suits them best. The aim is to make a general statement of model perfor- mance. Therefore, the imputation procedure is made for all different channels, and the model performance is aggregated to a general metric. Since the channels are represented by price series of four different asset types, the metrics will be aggre- gated on an asset type level and consider that different asset types have different movement characteristics affecting the model performance. The two use cases are separately evaluated, and performance evaluations are left for the discussion part of this report across use case.

4.7.1 Mean Absolute Scaled Error It is crucial to account for the variety in scale and variance for the price series when aggregating metrics. Else, there is a considerable risk of the aggregated metric being

45 heavily biased towards the channels with large scale and variance. Therefore, to assess the estimated price deviation, we will use the unit-less, scale-free metric Mean Absolute Scaled Error (MASE) as proposed by [25]. The MASE metric compares the Mean Absolute Error (MAE) of a model with the MAE of the naive model. In our case, the naive model is chosen to be the Nearest Neighbour Imputation (NNI). ˆ k Assume we have estimated the missing values of channel k with model m by Xm, and ˆ k corresponding prediction by the naive model is XNNI . Then, MASE is calculated by, PT k k k xˆ − x MAE MASE k = t=1 t,m t = m . m PT k k k (4.7) t=1 xˆt,NNI − xt MAE NNI

MASE is calculated for all channels k ∈ c1, . . . , cC , and will be analysed on an overall and an asset-specific level. MAE can only take positive real values, and the better the model, the smaller value MAE takes.

4.7.2 Relative Deviation of VaR The Relative Deviation of VaR (RDVaR) is used as an evaluation metric to assess model-generated price movement. RDVaR tells how large the relative deviation of the estimated and actual VaR is. VaR is calculated following the procedure explained in Section 2.1.1. Since our data already is given at the price scale, one can interpret it as a point observation from the price function. The RDVaR is calculated for a portfolio with positions on the asset with missing values, i.e. the specific channel being imputed. In the VaR calculation we assume that the current total portfolio value is 1 million (vt0 = 1000000) and the quantity of each ”asset” in i vt0 a channel is set to yield equal contribution to the total portfolio value (w = xi d ). t0 In the VaR calculation, we set; confidence level to α = 99%, the time horizon for a scenario to T = 1, and the historical horizon to equal the full data set, which is approximately h = 7 years.

ˆ k Assume Xm is the estimated values of channel k with model m. To calcu- ˆ late RDVaR, start by calculating the actual VaR and the estimated VaRm. Then compute, ˆ VaRm − VaR RDVaR = . (4.8) VaR

4.7.3 Relative Deviation of ES The Relative Deviation of ES (RDES) is similar to RDVaR but yields better eval- uation for extreme scenarios. ES is calculated following the procedure explained in Section 2.1.2. The preliminary assumption made for RDVaR holds for RDES, and the ES parameters are set equally as the VaR. To calculate RDES, start by ˆ calculating the actual ES and the estimated ESm. Then compute, ˆ ESm − ES RDES = . (4.9) ES

46 4.8 Models

4.8.1 Nearest Neighbour Imputation The Nearest Neighbour Imputation (NNI) method is applied to both use case one and two. It is also referred to as the naive model and will serve as a baseline method j j,o to compare the results with. Assume that xt is missing and τ are the time points j of the observed values for dimension j. Then, NNI is then estimating xt to be,

j j ∗ xˆt = xt∗ , where t = argmin |t − τi|. (4.10) j,o τi∈τ

j j o If {xt−1, xt+1} ∈ X , thus both the previous and next data point in time are observed, j j then the estimate becomes xˆt = xt−1. Thus, the naive model will favour prior reference point if the prior and succeeding point are equally distant.

4.8.2 Linear Interpolation The Linear Interpolation (LI) method will be applied to the use case one setting, j where all missing values are bound to the interpolation problem. Assume xt is j j o missing where {xt−a, xt+b} ∈ X is nearest previous and next observed data point in time. Thus, an interpolation problem with where t − a < t < t + b. The LI method j is estimating xt to be, j j  a x − xt−a xˆj = xj + t+b . (4.11) t t−a a + b

4.8.3 Lasso The Lasso method is a method that aims to exploit any long-term linear relationship between the target channel and the reference channels. By long-term relationship, we assume constant co-movements with other channels over time and further disregard any potential temporal correlation changes. Lasso is applied to both use cases but with some differences in how they operate. Thus, we are trying to find a linear function, such that, k o rt = f (Xt ) | λ. For use case one, the Lasso operate on the log returns. Now assume that channel k is incomplete. Following the outline in Algorithm 2, the data is split into {Rtrain, Rtest} with also partitioning the training data into 10 equally sized subsets, s −19 −18 0 1 2 Rval, ∀s ∈ {1,..., 10}. Each value in λ ∈ {0, 4 , 4 ,..., 4 , 4 , 4 } is 10-fold s k cross validated using the partition Rval, where R is set as response variable and Ro as explanatory variables. The optimised hyperparameter, λ∗ is chosen as the one the yields the highest average coefficient of determination, R2, which is equivalent to minimising RSS. A complete Lasso model is then fitted on Rtrain ∗ k with λ . Finally, predict the log returns, ˆrt for ∀rt ∈ Rtest, and compute the final k prediction xˆt , in the original price unit by aggregating the forward- and backward pass prediction described in Section 4.5.

The main difference for the use case two Lasso algorithm is that it also has an additional model operating on the price process. Since use case two concerns

47 consecutive predictions of the return process, their error will be aggregated through k time. Even if a model f(·) is an unbiased estimator of rt , the error term can be compared to being a random walk when aggregated to the price process and cause it to take unrealistic values. For use case one, we apply an approach sourcing information from both the previous and future values due to the high autocorrelation in the price process, see Section 4.5. In use case two, there is no succeeding reference value to steer the price process towards. Therefore, a model is built on the prices creating a reference point at the end of the time series.

For use case two, all the steps for use case one is performed. But in paral- lel, build a model on the prices, {Xtrain, Xtest}. As before, partition the training s data into 10 equally sized subsets, Xval, ∀s ∈ {1,..., 10}. Each value in −19 −18 0 1 2 s λ ∈ {0, 4 , 4 ,..., 4 , 4 , 4 } is 10-fold cross validated using the partition Xval, where xk is set as response variable and X∀j∈ /k as explanatory variables. The optimised hyperparameter, λ∗ is chosen as before. Using the final model, estimate k o k the last price of the series, xˆT using XT . Set xˆT as the end-point reference, make k forward- and backward pass predictions using ˆrt and aggregate them to a final prediction per the aggregation formula presented for use case one in Section 4.5.

4.8.4 Random Forest One weakness of the Lasso method is that it only can learn the linear relationship between the response and reference channels. To account for any potential non-linear relationship, a Random Forest (RF) model is applied to use case one. It is only applied to use case one due to its inabilities to extrapolate from the information in the training data, i.e., an RF can not predict beyond the training data range, which intuitively does not suit well for use case two.

In our experiment, the amount of trees is set to 500 which has shown to be enough with respect to overfitting. The number of randomly drawn features being considered at each split is set to 11 which is approximately the square root of the number features [14]. The RF operates on the log return and follows the execution procedure presented in Algorithm 2. First, the data is split into {Rtrain, Rtest}. Then, Rtrain is further partitioned into 10 equally sized subsets, s Rval, ∀s ∈ {1,..., 10}. The subsets are used to tune the maximum tree depth, l, of the decision trees. So, for each l ∈ {1, 2, 4,..., 32} is 10-fold cross-validated using s k o the partition Rval, where R is set as response variable and R as explanatory variables. The optimised hyperparameter, l∗, is chosen as the one the yields the highest average coefficient of determination, R2, which is equivalent to minimising ∗ MSE. A final RF model is then fitted on Rtrain with l . With the final model, k k predict the log returns, ˆrt for ∀rt ∈ Rtest, and compute the final prediction xˆt , in the original price unit by aggregating the forward- and backward pass prediction described in Section 4.5.

4.8.5 Gaussian Process A Gaussian Process (GP) was chosen because of its; convenience of switching between inter-and extrapolation, flexible model structure, probabilistic and non-

48 parametric approach. When modelling a GP one needs to specify the mean function, m (x), and covariance function, k (x, x0), of the process. These parameters determine how the function values, e.g., the log returns, modelled by the GP, are dependent on each other through time. Here, the choice of the mean function is set to 0 and the covariance function is chosen as the Matérn-kernel with ν = 0.2 for use case one and ν = 1.5 for use case two. For both use cases, there was an assumption of having no noise in the input data.

This structure on the mean and covariance function means for use case one that function values that are close in time and similarity while affecting each other less, and the shape of the implied process are rougher than higher values on ν. While in the case of use case two, values that are close in time and similarity will affect each other more than for the structure in use case one.3

Assume that channel k is incomplete, following the outline in Algorithm 2, for use case one. The GP is used on the log returns and partitioned only in to {Rtrain, Rtest}. The data is then standardised according to,

∗ R(·) − µtrain R(·) = . σtrain

Where µtrain is the sample mean and σtrain is the sample standard deviation of the training data. Following the proposed procedure in [39], optimise the hyperparameters ` and σ of the Matérn-kernel by minimising the log-likelihood ∗ of the marginal distribution of the training data Rtrain, as described in Section 3.6.2. This problem is not always convex and may result in local optima and to overcome this, the optimisation is restarted 10 times. Fit the GP on the training ∗ k ∗ data, Rtrain and predict the function values, ˆrt for all rt ∈ Rtest, with the optimised parameters `∗, σ∗, by computing the posterior distribution, equation 3.25, according k to Algorithm 2.1 in [39]. Compute the final prediction, xˆt , in the original price unit by aggregating the forward- and backward pass prediction described in Section 4.5.

For use case two, the algorithm for the GP is a bit different, modelling the prices directly. Thus, assuming channel k is incomplete, split the data into {Xtrain, Xtest} and standardise the data according to,

∗ X(·) − µtrain X(·) = . σtrain

Where µtrain is the sample mean and σtrain is the sample standard deviation of the training data. Optimising the hyperparameters ` and σ as for use case one. ∗ k Then, fit the GP on the training data, Xtrain and predict the function values, xˆt ∗ ∗ ∗ for all xt ∈ Xtest, with the optimised parameters ` , σ , by computing the posterior distribution, equation 3.25, according to Algorithm 2.1 in [39].

4.8.6 Multilayer Perceptron The Multilayer Perceptron (MLP) is a fully connected artificial neural network model applied for use case one and two. The MLP assumes a relationship between the 3For an example of function values drawn from a Matérn-kernel with σ = 1, ` = 1 and ν = 1.5 and its corresponding covariance matrix see figure 3.5c and 3.5d.

49 target and the reference channels that are independent of time. Nevertheless, the MLP does not make any further assumption of the mapping function. Thus, the MLP aims to find a non-linear function such that,

k o rt = f (Xt ) . (4.12) The model execution procedure follows the outline in Algorithm 2. The data is initially converted to log returns and split into {Rtrain,Rtest}. The test set is further s partitioned into 5 equally sized subsets, Rval, ∀s ∈ {1,..., 5}. The input to the model is the reference channel data, Ro, and the output is an estimate of Rk. The specific configuration of the MLP model is,

i) Linear activation function at the input and output layer.

ii) ReLU activation function in all the hidden layers.

iii) 2 hidden layers with 128 neurons, respectively. That implies the total number of the trainable model parameter to be about 34 500.

iv) A dropout layer is added between the input layer and all hidden layers. The drop out layer is added to lower the risk of overfitting and increase the learning of a large network [21].

v) Adam optimisation algorithm [28] is used as optimiser. Mean squared error is used as a loss function to the model.

s The subsets Rval are used to tune the hyperparameters. A grid search, 5-fold cross- validation, is applied to estimate their optimal value. The hyperparameters, θ, and their corresponding restricted set of values are,

i) Learning rate ∈ {10−3, 10−4} {0.1, 0.2} ∈ ii) Hidden layer dropout rate ∈ {100, 250, 500} {0.2, 0.5} iv) Epochs

iii) Input layer dropout rate ∈ v) Batch size ∈ {32, 64}

The optimised hyperparameter, θ∗ is chosen as the one yielding the highest average coefficient of determination, R2, which is equivalent to minimising RSS. A final ∗ MLP model is then fitted on Rtrain with θ . With the final model, predict the log k k returns, ˆrt for ∀rt ∈ Rtest, and compute the final prediction xˆt , in the original price unit by aggregating the forward- and backward pass prediction described in Section 4.5.

The procedure described above holds for use case one. For use case two, there is an additional MLP model trained on the prices process. That model is used to estimate the last price, which is motivated similarly as in Section 4.8.3. The MLP for the price process follows the same procedure as above, but uses the k prices X instead of R. When an estimate of xT is obtained, it is used as a reference for the backward prediction, which later is used to get a final prediction using the aggregation technique explained in Section 4.5.

50 4.8.7 WaveNet Financial time series are known to have much noise. At the same time, having a short duration of strong signals and long history of data can increase this difficulty due to the ever-changing financial environment described in Section 2.3.1. Even due to this, there exists other financial time series that can have strong correlations. In implementing the type of architecture like the WaveNet, the model tries to exploit the internal time-series autoregressive properties that can be observed and use conditioning to reduce the noise in these short duration time series. In this model, the model condition a forecast of a time series based on the endogenous history and multiple exogenous time series. In an attempt to improve the predictions’ quality and learn long-term temporal dependencies between the different channels.

The WaveNet model here is built like a sequence-to-sequence model standard in, e.g., natural language processing. Meaning that the model takes a sequence of time steps as its inputs, tries to encode these inputs into a format that the model can use, and then decodes these encoded inputs and outputs another sequence. Here, the model encodes and decodes sequences of the same size, i.e. if the input has five time-steps, the output will too. The idea of this is to capture the temporal dynamics that might exist and condition the next time-steps on the earlier time-steps in an autoregressive fashion. However, since a critical signal can affect different time series at different time points, it was good to let the model predict the same value several times. Furthermore, since financial time series could have strong correlations with other time series, to leverage this, the model’s output is also conditioned on the exogenous time-series that exist in our data set, i.e., they are assumed observable. Meaning that the following sequence is conditioned such that, k k k k o o xt+n,..., xt+1 | xt ,..., xt−n−1, Xt+n,..., Xt+1. We are thus trying to find a function f such that,

k k k k o o  xt+n,..., xt+1 = f xt ,..., xt−n−1, Xt+n,..., Xt+1 .

To go about this, the model is structured such that it is centred around dilated causal convolutions that are repeated as, 1, 2,..., time-steps, 1, 2,..., time-steps, where time-steps is 2k for some k. Each of these dilated causal convolutions make up a block, where for each block there is structured as follows:

i) A pre-processing layer with a one-dimensional convolution with ”same”- padding and a kernel size of 1.

ii) A batch normalisation layer.

iii) A gated activation with gating and filtering layers consisting of one- dimensional convolutions, as described in Equation 3.51, with ”causal”- padding, kernel size 2 and a dilation rate depending on which block it is.

iv) A post-processing layer with one-dimensional convolution with ”same”-padding and a kernel size of 1.

v) A residual connection between the outputs of step i) and iv) through addition.

51 vi) A concatenation of the output of step iv) to a list consisting of the equivalent outputs of the other blocks - i.e., the skip connection.

After all blocks, the outputs of all skip connections are added to each other. Followed by a processing layer consisting of a one dimensional convolution with ”same”-padding and a kernel size of 1, a dropout layer [21] with 20% dropout and finally, the output layer with a one dimensional convolution with ”valid”-padding, a kernel size of 1, and with a linear activation function. For all but the gated activation and the output layer, the ReLU activation function is used. Tests were performed with other activation functions and another gated activation that was supposed to work better for financial data, as described in [4], without improving the performance. See Appendix D for an example of the network architecture with 8 time-steps and 32 filters.

The loss function is chosen as the mean squared error (MSE) loss, the stochastic optimisation method Adam [28] is chosen as the optimiser, the weight initialisation was chosen as He-initialisation [19], `2-regularisation was chosen as weight regu- larisation4 and the model is trained with early stopping [38], meaning that the training will be aborted when the validation loss stops decreasing. The weights corresponding to the best validation loss is restored.

All values are assumed observed during training, meaning that the model will always be trained with actual values, but this is not true in the prediction stage. Since the prediction of this model is fed back into the model and the same time-step can be included several times, an aggregating algorithm of the predictions have been created. This procedure is described in Algorithm 3.

The model has several hyperparameters, θ, and it would not be possible to search for the best ones for each missing channel. Thus, the hyperparameters optimised for each channel is the number of time-steps, the number of filters in each one-dimensional convolution in the blocks, the λ in `2-regularisation for each layer in the blocks and the learning rate. After initial searches for suitable candidates, the following were chosen;

i) time-steps ∈ {8, 16} iii) λ ∈ {10−3, 10−4}

ii) number of filters ∈ {32, 64} iv) learning rate ∈ {10−3, 10−4}

The optimised hyperparameters for each channel are chosen as the θ that minimise the MASE, as described in Equation 4.7, on the validation set and the chosen ones, θ∗, are saved for later use.

The model is only implemented on use case two and operates solely on the price process.5 Assume that channel k is incomplete, following the outline in Algorithm 2, the data is processed such as described in Section 4.4.4 with a step-size of 1, par- titioned in {Xtrain, Xval, Xtest} with the proportions 70/10/20, and then normalised

4 `2-regularisation is the corresponding regularisation as described in Section 3.3 but with the `2-norm instead of the `1-norm. 5There was testing on the log returns as well but with bad results.

52 h o i Input: The model, f (·), the last known observation Xτp and the later h o i sequences that are observed Xτp+1:T . Parameters: The sequence length s and the number of sequences n h i ˆk Output: xτp+1:T

ˆ i) Create temporary matrix, X of size n× | τp + 1 : T |.

h k i h o i ii) Predict the first sequence xˆτp+1:τp+s+1 = f Xτp . h i k ˆ iii) Add xˆτp+1:τp+s+1 to X at corresponding row and columns

iv) Then for all n − 1 sequences, perform:

(a) Average the previous predictions in Xˆ to be used for the next pre-  k  diction, xˆt−1:t+s−1 .  k   k o  (b) Predict the next sequence, xˆt:t+s = f xˆt−1:t+s−1, Xt:t+s .  k  ˆ (c) Add xˆt:t+s to X at corresponding row and columns h i ˆk v) Compute xτp+1:T by taking the average of the predictions per time point, i.e., the average of each column in Xˆ .

Algorithm 3: Predictions of autoregressive sequences with exogenous sequences. according to, min ∗ X(·) − Xtrain X(·) = max min . Xtrain − Xtrain The model is trained with θ∗ according to the previously described procedure. When the model is finished training, predict the missing values of the test set by using Algorithm 3. Since the model performs all predictions over τp + 1,...,T by itself, there is a chance that the price process is off-set; thus, the post processing here is to adjust the predictions to start on the last observed value. Lastly, create the new ˆ k data set, X, including xˆt for ∀t ∈ τp + 1 : T .

4.8.8 SeriesNet Continuing the motivation for the WaveNet-model, this LSTM-enhanced WaveNet model, called SeriesNet, tries to exploit further the temporal correlations between financial time series and are inspired by [42]. In the WaveNet model, it was assumed that the best way of sourcing information for the predicted timepoints was to con- dition on exogenous data at the same time points as the predictions. This approach may be overly optimistic due to the complex behaviour of the financial time series described in Section 2.3.1. In order to deal with this, an RNN with LSTM cells are included in the model where the WaveNet part now tries to model,

k k k k o o xt+n,..., xt+1 | xt ,..., xt−n−1, Xt ,..., Xt−n−1

53 to fully leverage the autoregressive properties of the dilated causal convolutions. While the RNN tries to encode a state of the exogenous data for the missing data, i.e., k k o o ht+n,..., ht+1 | Xt+n,..., Xt+1. These parts are then either added or concatenated and fed in to an MLP. Meaning that, we are thus trying to find a function f such that, k k k k o o  xt+n,..., xt+1 = f xt ,..., xt−n−1, Xt+n,..., Xt−n−1 . The WaveNet part is built like before, and the previously optimised hyperparame- ters are assumed optimal for this task. However, the RNN is built with two layers both with tanh-activation and with the number of cells in the first layer being the same as the number of time-steps and the second layer having the same number of cells as time-series belonging to a channel, k. The RNN was deliberately kept relatively ”small” due to the risk of severely overfitting the training data with a large network. Another regularisation of the RNN was dropout [21] when updating the recurrent state, where the dropout ratio was 20%.

The loss function is chosen as the mean squared error (MSE) loss, the stochastic optimisation method Adam [28] is chosen as the optimiser, the weight initialisation was chosen as He-initialisation [19], `2-regularisation was chosen as weight regular- isation for the WaveNet part, and the model is trained with early stopping [38], meaning that the training will be aborted when the validation loss stops decreasing. The weights corresponding to the best validation loss is restored.

Like the standalone WaveNet model, this is an autoregressive model. During training, the values are always assumed observed, meaning that the model always will be trained with the true values, but this is not true in the prediction stage. The previous algorithm, Algorithm 3, has been updated accordingly to the new model, but the main steps are the same.

Again, the model has several hyperparameters, θ, and after initial testing, the hyperparameters that needed tuning for each channel were the learning rate, the `2-regularisation for each layer in the WaveNet-model and if the output of both parts should be concatenated or added. If it is concatenated, there is an MLP with one layer, and the number of neurons is the same as the number of time steps that k k try to learn how to aggregate the state, ht , from the RNN and the predictions, xt , of the WaveNet. If it is added, the MLP tries to weigh this aggregate to a useful prediction. Note that both approaches can be seen as residual learning where the RNN tries to learn changes in the market in τp + 1 : T and adjust the predictions of the WaveNet.

The candidates for the learning rate and the regularisation parameter λ is the same as before. The optimised hyperparameters for each channel are chosen as the θ that minimise the MASE, as described in Equation 4.7, on the validation set and the chosen ones, θ∗, are saved for later use.

The model is only implemented on use case two and only operating on the original price process. Assume that channel k is incomplete, following the outline

54 in Algorithm 2, the data is processed such as described in Section 4.4.4 with a step-size of 1, partitioned in {Xtrain, Xval, Xtest} with the proportions 70/10/20, and then normalised according to,

min ∗ X(·) − Xtrain X(·) = max min . Xtrain − Xtrain The model is trained with θ∗ according to the previously described procedure. When the model is finished training, predict the missing values of the test set by using an updated version of Algorithm 3. Since the model performs all predictions over τp + 1,...,T by itself, there is a chance that the price process is off-set; thus, the post processing here is to adjust the predictions to start on the last observed value. ˆ k Lastly, create the new data set, X, including xˆt for ∀t ∈ τp + 1 : T .

55 Chapter 5 Results

In this section, the results from the project are presented. There are three per- formance measures; MASE, RDVaR, and RDES, and their interpretations are as follows.

i) MASE < 1, i) RDVaR | RDES < 0, better than the naive model. risk metric underestimation.

ii) MASE = 1, ii) RDVaR | RDES = 0, equal as the naive model. perfect risk metric reconstruction.

iii) MASE > 1, iii) RDVaR | RDES > 0, worse than the naive model. risk metric overestimation.

Recall that the naive model is chosen as the Nearest Neighbour Imputation method. The results are separately presented per use case, and the filtered, table view, results per asset class can be found in Appendix F and G.

5.1 Use Case One

For use case one, the distribution of MASE for all imputed channels are visualised in Figure 5.1, and its corresponding descriptive statistics are presented in Table 5.1. From a price replication point of view, all models have, on average, a lower deviation from the truth than the naive approach. The min and max column in Table 5.1 present the best and worst-case performance respectively. Noteworthy, all methods except Random Forest achieve equivalent results to the naive approach as their worst case. The Lasso model achieves the lowest mean, min and max MASE. Meaning that Lasso, on average, is the best performing method while simultaneously showing best worst- and best-case scenario. Though the Random Forest achieves a lower median MASE than the Lasso, it has a higher standard deviation and the worst worst-case scenario value.

56 Figure 5.1: Distribution of the MASE for all imputed channels for each model on use case one. The distribution is estimated through KDE with an RBF-kernel and bandwidth 0.7.

Table 5.1: Descriptive statistics of the MASE for all models on use case one. Num- bers depicted in bold are the best for that column, while numbers with underscore are the worst.

Model mean std (%) median min max Linear Interpolation 0.771 9.082 0.778 0.588 1.047 Lasso 0.651 17.144 0.636 0.225 1.047 Random Forest 0.664 18.601 0.625 0.283 1.366 Multilayer Perceptron 0.693 16.446 0.667 0.335 1.079 Gaussian Process 0.714 10.815 0.690 0.534 1.070

Figure 5.2 presents a ridgeline plot of the MASE distribution for each model and its performance on each asset class. It shows that there are substantial differences in performance between the asset classes. Specifically, all models perform worse on the discount factors compared to the other asset classes. The MASE distribution for discount factors is also very similar across the models. This most likely implies that none of the models has successfully parsed any vital information and predicts the price to be between the closest reference points, following the linear interpolation method.

57 Figure 5.2: Distribution of the MASE for all imputed channels for each model and the specific asset class on use case one. The distribution is estimated through KDE with an RBF-kernel and bandwidth 0.7.

In Table 5.2 and 5.3 the descriptive statistics of RDVaR and RDES are presented for on use case one. Numbers depicted in bold are the best for that column, while numbers with an underscore are the worst. All methods underestimate both risk metrics on average. Only Random Forest and Multilayer Perceptron have relative changes that overestimate the risk for some channels. However, Linear interpolation is the worst model in terms of risk metric replication. On average, it underestimates the VaR and ES metric by 8.5% and 9.5%, respectively. One can also conclude that the models have a larger underestimation of the ES metric compared to VaR. Although the models underestimate the VaR metric, the even larger underestimation of ES indicates that the models fall short in predicting extreme movements. Recall that ES is taking the mean of the largest deviations. Random Forest is the method that produces the best imputations in terms of risk, keeping some of the asset

Table 5.2: The relative deviation in Value at Risk (RDVaR) for all models on use case one. Numbers depicted in bold are the best for that column, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) Linear Interpolation -8.496 39.347h -8.389 -16.644 -1.937 Lasso -5.669 38.690 -4.618 -16.644 -0.376 Random Forest -5.061 37.592 -4.689 -16.240 0.614 Multilayer Perceptron -6.030 36.539 -5.590 -16.630 0.614 Gaussian Process -7.767 33.902 -7.737 -16.601 -1.980

58 Table 5.3: The relative deviation in Expected Shortfall (RDES) for all models on use case one. Numbers depicted in bold are the best for that column, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) Linear Interpolation -9.505 21.845h -9.631 -13.982 -5.816 Lasso -6.655 33.103 -6.743 -13.982 0.146 Random Forest -6.119 28.880 -6.267 -11.576 -0.096 Multilayer Perceptron -7.238 26.904 -7.233 -13.982 -3.175 Gaussian Process -8.879 20.658 -8.965 -13.653 -5.441 returns’ fat-tailed properties. This method still underestimated the risk in the tails, i.e., ES, for all assets. As for MASE, there is an RDVaR and RDES performance variability between the asset classes; see Appendix F. All models perform the best at imputing the movements of the options derived from the implied volatilities.

5.2 Use Case Two

For use case two, the distribution of MASE for all imputed channels are visualised in Figure 5.3, and its corresponding descriptive statistics are presented in Table 5.4. From a price replication point of view, Lasso is the only model that, on average, has a lower deviation from the truth than the naive model. It is also the model with the highest MASE variability. Compared to use case one, there is a substantial growth in the variability of the MASE metric and makes it more difficult to draw distinct conclusions. The neural network-based models all have less variability than the others. However, it is hard only from this to draw any general conclusion. As depicted in Figure 5.3, although many of the channel performances fall around MASE ≈ 1, some points are very distant to the right, which affect both the average MASE performance and enlarge the variability.

59 Figure 5.3: Distribution of the MASE for all imputed channels for each model on use case two. The distribution is estimated through kernel density estimation with an RBF-kernel and bandwidth 0.5.

Table 5.4: Descriptive statistics of the MASE for all models on use case two. Num- bers depicted in bold are the best for that column, while numbers with underscore are the worst.

Model mean std (%) median min max Lasso 0.970 109.880 0.671 0.236 6.895 Gaussian Process 1.331 97.109 1.104 0.369 7.023 Multilayer Perceptron 1.021 64.463 0.817 0.143 3.546 WaveNet 1.250 85.944 0.998 0.213 6.408 SeriesNet 1.185 62.994 1.007 0.405 3.514

There is a performance difference between the asset classes, which is depicted by the ridgeline plot in Figure 5.4. Lasso have, on average, the smallest MASE on futures and discount factors, while at the same time having the lowest median MASE on all asset classes but FX rates. Still, Lasso also has the worst, worst-case MASE on both FX rates and the option prices derived from the implied volatilities.

60 Figure 5.4: Distribution of the MASE for all imputed channels, per model and asset class on use case two. The distribution is estimated through kernel density estimation with an RBF-kernel and bandwidth 0.5. The distribution is cut at MASE equals 4 for illustrating purposes.

In Table 5.5 and 5.6 the descriptive statistics of RDVaR and RDES for all models on use case two are presented. Numbers depicted in bold are the best for that col- umn, while numbers with an underscore are the worst. Similar to use case one, on average, all methods underestimate the risk metrics. However, now, the WaveNet and SeriesNet sometimes overestimate the risk, as much or more, as it is underes- timated. SeriesNet seems to be the best, on average, at retaining the properties of the return distribution but have high variability in the results. In Tables G.2 and G.3 one can see that SeriesNet have high variability even in specific asset classes but are the best method for the median relative deviation of ES for Futures, Discount Factors and the options prices derived from the Implied Volatilities. As expected, the naive model underestimates VaR and ES the most. Also, the models underes- timate the ES metric more than VaR, which indicates failure in predicting larger price movements.

61 Table 5.5: The relative deviation in Value at Risk (RDVaR) for all models on use case two. Numbers depicted in bold are the best for that column, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) NNI -10.800 63.515h -10.015 -25.763 -0.017 Lasso -6.632 57.686 -5.936 -22.931 2.300 Gaussian Process -10.169 58.319 -9.467 -25.036 -0.017 Multilayer Perceptron -7.163 55.023 -5.786 -22.931 -0.017 WaveNet -6.240 85.669 -6.219 -24.165 26.781 SeriesNet -3.792 101.111 -3.161 -24.338 27.833

Table 5.6: The relative deviation in expected shortfall (RDES) for all models on use case two. Numbers depicted in bold are the best for that column, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) NNI -11.529 73.603h -9.634 -28.649 -3.390 Lasso -7.753 66.912 -5.947 -28.262 3.264 Gaussian Process -11.331 72.550 -9.634 -28.649 -2.048 Multilayer Perceptron -8.665 64.069 -7.060 -28.268 -0.425 WaveNet -6.993 103.727 -8.205 -26.955 34.079 SeriesNet -5.030 116.830 -4.801 -26.470 43.696

62 Chapter 6 Discussion and Reflection

In this section, we discuss the result of this thesis and reflect upon the approaches leading to this result. The discussion starts by presenting some of the identified shortcomings of our models and further elaborating why these may arise. We con- tinue with presenting models that have been considered during the thesis but not included in the report due to various reasons. Lastly, we summarise some of the improvements and extensions of this study.

6.1 Risk Underestimation

All models underestimated the downstream risk metrics Value at Risk and Expected Shortfall. Still, they performed better than the naive model with respect to a risk metric replication point of view making it fair to say that they pick up a signal. The issue seems to be that the models failed to replicate extreme scenarios that strongly influenced the risk metrics. This became even more apparent when comparing the deviation of the Value at Risk and the Expected Shortfall metric. For all models, the predicted Expected Shortfall was more significantly underestimated compared to the Value at Risk, i.e., not only was the threshold quantile in the Expected Shortfall calculation got underestimated but all exceeding observations was further underestimated.

One plausible reason for this lies in the loss function. All models are fitted to the data to minimise the deviation between the predicted and actual value. Extreme movements are minorities and, if not predicted at the correct time point, the loss function will favour a cautious model for a reckless. Another potential reason is that many of the models are fitted on the log returns. When converting the predictions to prices, an aggregation technique is applied that utilise the high autocorrelation of the price process. The final prediction is a weighted average of two predictions from opposing reference points and yielded a better result for a price replication point of view. However, when applying the price predictions in a Value at Risk model, the implied returns will deviate from the ones obtained by the initial model prediction. Thus, the way return predictions are processed to prices will affect the downstream risk measure result. If the application of filling missing data points is to obtain a complete return series, then it is reasonable to disregard converting the predicted returns to prices.

63 6.2 Time Component

As stated by the evaluation metrics, the best performing models ignored the time component of the predictions, which means that the models considering the data as a sequence did not succeed in sourcing any further information. However, what information did we expect to find? According to the stylised facts of financial vari- ables, volatility clusters and temporal correlation is common. We did not expect the historical sequence to solitary give information of future values. The ambition was that sequence handling models would parse the context of a missing value, that to- gether with information from referencing channels, would increase performance. As we see it, there are two potential sources of failure; the time concerned models may have failed to parse that knowledge, or it can not help in making better predictions. The first case could be explained by too little data being fed into the model where patterns do not repeat. It could also be that the model design was wrongly chosen or calibrated.

6.3 Fallback logic

The performance metrics are aggregated to allow a general statement. Still, one can not ignore that there is a great variety between the channels in the dataset. Some channels have a high correlation with others, whereas some do not. It also seems like the importance of the level of the price process is highly shifting between the channels; see Appendix E for details. The variability of the characteristic nature of the data being imputed puts a high demand on the generalisation capabilities of the models.

During the model development phase, we have seen that a fair ”fallback” strategy will significantly impact the overall result. By fallback strategy, we refer to the estimates of a model in situations where it could not find any significant patterns. Put less formally, and a fair fallback strategy is when a model has a reasonable guesstimate, explaining why the Lasso model shows successful results. The fallback of Lasso is to predict the average return of the training data and applied to use case one; that approach is equivalent to the linear interpolation technique. The other models do not have the same reasonable fallback as their regularisation is less effective.

6.4 Error Measures

Even though the error measures were carefully chosen, they do not capture all as- pects of performance. An example is given in Figure 6.1 where the imputation result is visualised for the Lasso and WaveNet models. Both models perform better than the naive model in terms of MASE, 0.335 and 0.458 for Lasso and WaveNet, respec- tively. Although Lasso has a lower MASE, it is arguable that WaveNet copes better with the actual structure of the time series. RDVaR aimed to complement MASE, focusing on the daily price movements and pinpoint situations like this. However, the RDVaR for Lasso and WaveNet equals −7.7×10−4 and −6.9×10−4 respectively, which is not dictating a significant difference. Maybe a more sophisticated way of

64 Figure 6.1: Imputation result for Nearest Neighbour Imputation, Lasso, and the WaveNet model. The channel being imputed is the British pound discount factor, represented as a bond instrument, on use case two dataset. measuring similarity between time series, e.g., dynamic time warping, could give additional performance information currently ignored.

6.5 Complexity

Something yet not emphasised is the trade-off between complexity and readability. A complex model allows flexibility in the types of patterns it can learn but comes at the cost of having more model parameters, thus being more challenging to interpret. The application this study focus on has a downstream effect on risk managing decisions. Therefore, the readability of a model can be of essential consideration since it affects clients or needs to be approved by regulating authorities.

Nearest Neighbour Imputation and Linear Interpolation are both easily inter- preted but have almost no flexibility. The Lasso possess an explicit model of its predictions, and due to the `1-norm regularisation, it usually ends up with a sparse set of regression coefficients. Although more flexible, it is still possible to interpret and backtrack its predictions. The foundation of the Random Forest is a decision tree. At every step, there is a binary decision that is easily understood. However, the Random Forest contains a large number of decision trees whose predictions are being aggregated. The readability of a prediction is therefore low. Still, techniques exist to measure the feature importance of a Random Forest that gives some intuition about the model’s attention. The Gaussian Process model is considered a non-parametric model in practice but has theoretically infinite numbers of parameters.

In contrast to the other models, the Gaussian Process has a probabilistic ap- proach and can express its posterior distribution at every point in the feature space. The posterior distribution shows what the model has learned. Still, it can not be used to understand how it came up with a specific estimate. The last group of models all belong to Artificial Neural Networks (ANN). To derive an understanding of a prediction made from an ANN is very difficult. They are often referred to as

65 black boxes that parse the input and spits out an output. If the network is large enough, it can learn any mapping function between the input and output and has the highest flexibility. What has been evident during the development phase is that large ANN are sensitive to their hyperparameters and needs some regularisation to generalise their knowledge. Hence, regularise to gain performance through the bias-variance trade-off.

6.6 Use Case Framing

The use cases were framed to simplify evaluation and problem modelling. Two use cases can not fully capture the situations in which financial time series are incomplete. The assumption of complete reference channels at all times is perhaps not valid in many real-world scenarios. However, these two use cases cover the potential scenarios for the missing channel reasonably well. Any arbitrary dataset could, through some manipulation, e.g. removal of channels or adjusted time horizon, be framed to fit either of the two use cases.

We also believe it is important to note that the time horizon being consid- ered has one notable feature. Namely, it covers the Covid-19 pandemic, one of the most significant shocks observed at the financial markets. The evaluation, primarily use case two, is based on abnormal market conditions. Since the historical horizon does not capture any equivalent stressed period, it is interesting to include since it indeed stress-tests the models. Still, it would also be interesting to see how this affected the result of use case two.

In the data preparation process we removed channels from the dataset that had an almost perfect correlation to other channels. This was to frame the problem in a setting where it is not an obvious way to fill the missing values. But in an application, this action will lower the imputation performance and hence not preferred. In a real-world setting, we recommend the opposing strategy; extending the dataset with as many correlated channels as possible.

6.7 Excluded Models

This report evaluates several models, but some additional models have been inves- tigated but not included in the report due to poor performance or time restric- tions. Early on, several autoregressive models were applied, such as AR, ARMA and ARIMA. They assume that a value can be modelled through linear combina- tions of previous values, and these approaches performed poorly, which is not too surprising if one believes in the theory of effective markets. Below is a summary of three models considered state of the art in data imputation, together with a short note concerning this thesis.

i) Variational (VAE); a probabilistic generative approach that aims to represent the data in a small, regularised, latent space that can be decoded to yield predictions. We implemented a VAE and applied it to the use case one data. Unfortunately, it failed to learn a regularised latent space on our

66 data and showed poor performance. An issue with this model was how we should represent our data. We applied the data both as sequences and point observations. Since we failed to get the desired performance, we did not extend the VAE towards a Gaussian Process VAE (GP-VAE) that is considered a state of the art model for imputation of multivariate time series data. The VAE approach would perhaps be more teasing if our use cases were not framed to only have a single channel with missing values. ii) Generative Adversarial Network (GAN); a generative model containing a gen- erative and discriminative network being learned by competition. GAN is also considered to be state of the art of data imputation. Although a different architecture compared to the VAE, the GAN also try to represent the input data in a regularised space. Since the VAE failed in its purpose and time restrictions, we did not implement the GAN on our dataset. iii) Bidirectional Recurrent Imputation for Time Series (BRITS); two recurrent networks applied in opposing direction, that together with feature concern- ing network, aggregate the result to a final prediction. This model seemed very promising, and we have taken inspiration from the modelling technique presented in [6]. The Unidirectional version, RITS, was implemented to our data, but it did not perform better than a Multilayer Perceptron (MLP) model trained on the log returns. As described in Section 3.8, we believe that one issue was that the gradient of long-term dependencies where ignored due to vanishing gradient in the recurrent component. Since we could not obtain a better result than a single MLP model, we did not extend our implementation to BRITS. As the authors state in [6], the BRITS model can be extended to consider the loss of downstream applications.

6.8 Improvements and Extensions

As the subject of financial time series imputation is relatively unexplored or, at least, poorly publicly documented, much time has been spent on understanding and modelling the problem. The trial and error approach in the model selection procedure has given insights into potential extensions to this field. Bellow follows a proposition summary of improvements and extensions. i) Loss function adjustments. It would be interesting to evaluate the effect the loss function has during training. Maybe we would have obtained more realistic movements and increased performance if the loss function did not only consider the reconstruction loss of the prediction. What would be the effect of adding the loss of, e.g. VaR deviation, arbitrage-free price deviation, or price series consistency, when operating on the log returns? ii) Improved regularisation and fallback strategies. As previously explained, the fallback strategy can be crucial for a model to reach a higher overall perfor- mance. As an improvement, the regularisation and fallback design (especially for the ANN models) could be further investigated. iii) Investigate the performance off use case definition. The use cases have been framed to allow easy evaluation and comparison. However, how would the

67 models perform in a setting where multiple channels have incomplete data? Such a scenario would perhaps suggest other model design and techniques. iv) Extend evaluation metrics. The evaluation metrics do not capture all aspect of performance. Adding, e.g. a dynamic time wrapping metric could increase the evaluation coverage.

v) In this study, single methods have been evaluated. Still, the different channels possess different characteristics. To find a, ”one fits all” method is perhaps complicated. Instead, how should a financial time series imputation system operate where models can be chosen with respect to some measurable element of the missing channel?

68 Chapter 7 Conclusion

The goal of this thesis was to evaluate and propose methods to impute financial time series in the context of downstream risk applications. We have found techniques and methods yielding higher performance for both use cases than the naive method, both from a price- and risk metric replication point of view.

Even though the result is ambiguous, the Lasso model has shown the best holistic performance. Lasso lowered the price replication error by 35 % compared to the naive model for use case one. Lasso generally underestimated both Value at Risk and Expected Shortfall. Still, it was one of the models showing the smallest average underestimation (-5% and -6% for Value at Risk and Expected Shortfall, respectively). For use case two, Lasso was the only model, on average, that had a lower price replication error than the naive model. The average risk metric replication error for use case two was somewhat higher (-6.6% and -7.8% for Value at Risk and Expected Shortfall, respectively). Another advantage of Lasso is the interpretability of the model, where all predictions can intuitively be derived from the sparse set of regression coefficients.

All models systemically underestimated the downstream risk metrics. Even though some models, e.g. Random Forest, WaveNet, and SeriesNet, occasionally overestimated the risk metrics, such behaviour got penalised by the price replication error. The problem is two-folded. The predicted values should align with the price process and the implied price movement to neighbouring prices. Since all models are specialised in estimating the price process, the implied return process is the second focus. So, there is a trade-off between a price- and return attention model. For use case two, it became evident that increasing the price process attention, e.g. separate model predicting the last price, lead to a substantial decrease in price replication error.

Financial time series possess stylised facts and a high noise to signal ratio, making them exciting but challenging to work with. The time-aware models designed to utilise their complex structure failed to give consistent performance. Still, modelling improvements and architectural enhancement might successfully parse and incorporate these complex structures.

69 Bibliography

[1] Charu C. Aggarwal. Neural Networks and Deep Learning - A Textbook. Springer International Publishing AG, 2018, pp. 315–416. [2] Faraj Bashir and Hua-Liang Wei. “Handling missing data in multivariate time series using a vector autoregressive model-imputation (VAR-IM) algorithm”. In: Neurocomputing (2018), pp. 23–30. [3] Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. “Autoregressive Con- volutional Neural Networks for Asynchronous Time Series”. In: (2018). url: https://arxiv.org/pdf/1703.04122.pdf. [4] Anastasia Borovykh, Sander Bohte, and Cornelis W. Oosterlee. “Dilated Con- volutional Neural Networks for Time Series Forecasting”. In: (2018). [5] Richard A. Brealey, Stewart C. Myers, and Franklin Allen. Principles of Cor- porate Finance. McGraw-Hill/Irwin, 2014, pp. 693–714. [6] Wei Cao et al. “BRITS: Bidirectional Recurrent Imputation for Time Series”. In: (2018). url: https://arxiv.org/pdf/1805.10572v1.pdf. [7] Rama Cont. “Empirical Properties of Asset Returns: Stylized Facts and Sta- tistical Issues”. In: 1 (2001), pp. 223–226. [8] Paul S.P. Cowpertwait and Andrew V. Metcalfe. Introductory Time Series with R. Springer, 2009. [9] Jon Danielsson. Financial risk forcasting. West Sussex, United Kingdom: John Wiley & Sons Ltd, 2011, pp. 208–212. [10] Matthew F. Dixon, Igor Halperin, and Paul Bilokon. Machine Learning in Finance: From Theory to Practice. Springer Nature Switzerland, 2020, pp. 91– 108. [11] Douglas Hamtilton, Senior Director – Machine Intelligence Lab, Nasdaq Inc,. Interview. Consulting interview at project start. Feb. 2021. [12] Chenguang Fang and Chen Wang. “Time Series Data Imputation: A Survey on Deep Learning Approaches”. In: (2020). url: https://arxiv.org/pdf/ 2011.11347.pdf. [13] Rao Fui et al. “Time Series Simulation by Conditional Generative Adversarial Net”. In: (2019). url: https://arxiv.org/pdf/1904.11419.pdf. [14] James G et al. An Introduction to Statistical Learning. Springer, 2013, pp. 71– 82, 203–230, 303–320.

70 [15] Jim Gatheral. The volatility surface: a practitioner’s guide. John Wiley & Sons Inc, 2006, pp. 1–13. [16] , , and Aaron Courville. Deep Learning. http: //www.deeplearningbook.org. MIT Press, 2016, pp. 373–416. [17] Fredrik Gunnarsson. “Filtered Historical Simulation Value at Risk for Options: A Dimension Reduction Approach to Model the Volatility Surface Shifts”. In: (2019). url: https : / / www . diva - portal . org / smash / get / diva2 : 1326070/FULLTEXT01.pdf. [18] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: (2015). url: https://arxiv.org/pdf/1512.03385.pdf. [19] Kaiming He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In: (2015). url: https://arxiv. org/pdf/1502.01852v1.pdf. [20] Yves J. Hilpisch. Derivatives analytics with Python: data analysis, models, simulation, calibration and hedging. John Wiley & Sons Inc, 2015, pp. 19–36. [21] G. E. Hinton et al. “Improving neural networks by preventing co-adaption of feature detectors”. In: (2012). url: https://arxiv.org/pdf/1207.0580.pdf. [22] HKEX. Trading Calendar and Holiday Schedule. 2021. url: https://www. hkex . com . hk / Services / Trading / Derivatives / Overview / Trading - Calendar-and-Holiday-Schedule?sc_lang=en (visited on 04/16/2021). [23] John C. Hull. Options, futures, and other derivatives – 8th ed. Pearson Edu- cation, Inc., 2012. [24] John C. Hull. Risk Management and Financial Institutions, fifth edition. Hobo- ken, New Jersey: John Wiley & Sons Inc, 2018, pp. 17, 277–295. [25] Rob Hyndman. “Another look at forecast-accuracy metrics for intermittent demand”. In: 4 (2006). [26] Sergey Ioffe and Christian Szegedy. “: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: (2015). url: https://arxiv.org/pdf/1502.03167.pdf. [27] Richard Johnson and Dean Wichern. Applied Multivariate Statistical Analysis, sixth edition. Pearson Education Limited, 2014, pp. 149–209. [28] D. P. Kingma and J. L. Ba. “Adam: A Method For Stochastic Optimization”. In: 9 (2015). url: https://arxiv.org/pdf/1412.6980.pdf. [29] Junyan Liu, Sandeep Kumar, and Daniel P. Palomar. “Parameter Estimation of Heavy-Tailed AR Model with Missing Data via Stochastic EM”. In: (2019). url: https://arxiv.org/pdf/1809.07203.pdf. [30] Alexander J. McNeil, Rüdiger Frey, and Paul Embrechts. Quantitative risk management: concepts, techniques and tools. Princeton University Press, 2005, pp. 116–182. [31] Christopher Olah. Conv Nets: A Modular Perspective. 2014. url: https : //colah.github.io/posts/2014- 07- Conv- Nets- Modular/ (visited on 03/03/2021).

71 [32] Christopher Olah. Understanding LSTM Networks. 2015. url: https : / / colah . github . io / posts / 2015 - 08 - Understanding - LSTMs/ (visited on 03/03/2021). [33] Aäron van den Oord et al. “Conditional Image Generation with PixelCNN Decoders”. In: (2016). url: https://arxiv.org/abs/1606.05328. [34] Aäron van den Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: (2016). url: https://arxiv.org/pdf/1609.03499.pdf. [35] Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. “On the difficulty of training Recurrent Neural Networks”. In: (2013). url: http://arxiv.org/ abs/1211.5063. [36] Marcos López de Prado. Advances in Financial Machine Learning. John Wiley & Sons, Inc., 2018, pp. 315–416. [37] Natraj Raman et al. “Synthetic Reality: Synthetic market data generation at scale using agent based modeling”. In: (2020). [38] Garvesh Raskutti, Martin J. Wainwright†, and Bin Yu. “Early stopping and non-parametric regression: An optimal data-dependent stopping rule”. In: (2013). url: https://arxiv.org/pdf/1306.3574v1.pdf. [39] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Process for Machine Learning. the MIT Press, 2006, pp. 7–107. [40] Riccardo Rebonato. Volatility and Correlation: The Perfect Hedger and the Fox – 2nd ed. John Wiley & Sons Inc, 2004, pp. 201–235. [41] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning Internal Rep- resentations by Error-Propagation”. In: (1986). Ed. by D. E. Rumelhart and R. J McClelland. [42] Zhipeng Shen et al. “A novel time series forecasting model with deep learning”. In: Neurocomputing (2020), pp. 302–313. [43] Wikipedia. Foreign exchange market. 2021. url: https://en.wikipedia. org/wiki/Foreign_exchange_market (visited on 04/20/2021). [44] Wikipedia. List of S&P 500 companies. 2021. url: https://en.wikipedia. org/wiki/List_of_S\%26P_500_companies (visited on 04/16/2021). [45] Taco de Wolff, Alejandro Cuevas, and Felipe Tobar. “Gaussian Process Impu- tation of Multiple Financial Series”. In: (2020), pp. 8444–8448.

72 Appendices

73 Chapter A Removed Holidays

Following dates have been removed from the dataset since they are public/in-official holidays in most markets where instruments in our dataset are traded. Therefore, these days causes missing values and are thus removed in the data preparation part of the thesis.

2014-01-01 New year’s Day 2015-11-26 Thanks Giving

2014-01-20 Martin L. K. Jr. Day 2015-12-25 Christmas Day

2014-02-17 President’s Day 2016-01-01 New Year’s Day

2014-04-18 Good Friday 2016-01-18 Martin L. K. Jr. Day 2016-02-15 2014-04-21 Easter Monday President’s Day 2016-03-25 Good Friday 2014-05-26 Memorial Day 2016-03-28 Easter Monday 2014-07-04 Fourth of July 2016-05-30 Memorial Day 2014-09-01 Labour day 2016-07-04 Fourth of July 2014-11-27 Thanks Giving 2016-09-05 Labour Day 2014-12-25 Christmas Day 2016-11-24 Thanks Giving 2014-12-26 Boxing Day 2016-12-26 Boxing Day 2015-01-01 New year’s Day 2017-01-02 Day After New Year’s 2015-01-19 Martin L. K. Jr. Day 2017-01-16 Martin L. K. Jr. Day 2015-02-16 President’s Day 2017-02-20 President’s Day 2015-04-03 Good Friday 2017-04-14 Good Friday 2015-04-06 Easter Monday 2017-05-01 Labour Day 2015-05-25 Memorial Day 2017-05-29 Memorial Day 2015-07-03 Independence Day 2017-07-04 Fourth of July 2015-09-07 Labour day 2017-09-04 Labour Day

74 2017-11-23 Thanks Giving 2019-04-19 Good Friday

2017-12-25 Christmas Day 2019-05-27 Memorial Day

2017-12-26 Boxing Day 2019-07-04 Fourth of July 2018-01-01 New Year’s Day 2019-09-02 Labour Day 2018-01-15 Martin L. K. Jr. Day 2019-11-28 Thanks Giving 2018-02-19 President’s Day 2019-12-25 Christmas Day 2018-03-30 Good Friday 2020-01-01 New Year’s Day 2018-04-02 Easter Monday 2020-01-20 Martin L. K. Jr. Day. 2018-05-28 Memorial Day 2020-02-17 President’s Day 2018-07-04 Fourth of July 2020-04-10 Good Friday 2018-09-03 Labour Day 2020-05-25 Memorial Day 2018-11-22 Thanks Giving 2020-07-03 Fourth of July 2018-12-25 Christmas Day 2020-09-07 2018-12-26 Boxing Day Labour Day

2019-01-01 New Year’s Day 2020-11-26 Thanks Giving

2019-01-21 Martin L. K. Jr. Day 2020-12-25 Christmas Day

2019-02-18 President’s Day 2021-01-01 New Year’s Day

75 Chapter B Dataset

Table B.1 - B.4 present the futures, FX rates, discount factors and volatility surfaces included in the dataset.

Table B.1: List of all futures in the dataset.

Underlying Asset code Venue code Settlement Currency Maturity Copper (C4) CME_CX-HG CMX Physically USD Monthly Silver (C5) CME_CX-SI CMX Physically USD Monthly Gold (C6) CME_CX-GC CMX Physically USD Monthly HSCE Index (C11) HSCEI HKEX Physically HKD Monthly S&P 500 (C16) CME_SP CME Cash USD Monthly USD Index (C17) ICUS_DX ICUS Physically USD Monthly Palladium (C18) CME_NY-PA NYM Physically USD Monthly Platinum (C19) CME_NY-PL NYM Physically USD Monthly NIKKEI225 (C21) CME_NK CME Cash USD Monthly Iron Ore (C25) SGX_FEF SGX Cash USD Monthly EURO-Buxl(r) (C26) EUREX_FGBX EUREX Physically EUR Monthly Coal (API 2) (C27) CME_NY-MTF NYM Physically USD Monthly Hang Seng Index (C29) HIS HKEX Cash HKD Monthly Aluminium Alloy (C31) LME_AA LME Physically USD Daily Nickle (C33) LME_NI LME Physically USD Daily Special High Grade Zinc (C36) LME_ZS LME Physically USD Daily Standard Lead (C37) LME_PB LME Physically USD Daily Tin (C38) LME_SN LME Physically USD Daily

Table B.2: List of all FX rates in the dataset.

From Currency To Currency Euro (EUR) (C47) US Dollar (USD) Pound Sterling (GBP) (C48) US Dollar (USD) Canadian Dollar (CAD) (C49) US Dollar (USD) Hong Kong Dollar (HKD) (C54) US Dollar (USD) Japanese Yen (JPY) (C55) US Dollar (USD)

76 Table B.3: List of all discount factors in the dataset.

Currency US Dollar (USD) (C60) Euro (EUR) (C62) Pound Sterling (GBP) (C63) Canadian Dollar (CAD) (C64) Hong Kong Dollar (HKD) (C68) Japanese Yen (JPY) (C70)

Table B.4: List of all volatility surfaces in the dataset.

Underlying Option type Asset code Venue code Currency Canadian Dollar (S74) American LME_CA LME USD Primary High Grade Aluminium (S75) American LME_AH LME USD Standard Lead (S76) American LME_PB LME USD Nickle (S79) American LME_NI LME USD Gold (S80) American CME_CX-GC CMX USD Special High Grade Zinc (S81) American LME_ZS LME USD

77 Chapter C Stylised Facts

Stylised Facts of Volatility

One of the key stylised facts about assets returns is volatility clustering. In addition, the volatility of asset returns also have some stylised facts on their own that have been observed over time [20][7]. These are,

i) Stochasticity, volatility is not deterministic or constant, and one cannot fore- cast volatility with high confidence.

ii) Mean reversion, volatility seems to be mean-reverting but, the mean can change over time.

iii) Leverage effect, volatility seems to be negatively correlated to asset returns, i.e. if an asset has high returns, it seems to have lower volatility and vice versa.

Further with the implied volatility surface having the following stylised facts [40][20],

i) Smiles, implied volatilities tend to have a smile form meaning that OTM and ITM implied volatilities are higher than ATM ones.

ii) Term structure, volatility smiles are more noticeable for options with a shorter time to maturity and implies that the future volatility should be higher than today.

Stylised Facts of Interest Rates

Short rates and their associated discount factors are involved in asset pricing for all asset types. The most important ones for modelling and are the following stylised facts [20][23],

i) Positivity, interest rates are in general non-negative.

ii) Stochasticity, interest rates in general and short rates especially, behaves ran- domly.

iii) Mean reversion, interest rates cannot trend nor go to infinity, so it must be mean reverting.

78 iv) Term structure, interest rates vary with time to maturity and imply different forward rates, i.e., the yield of a five-year bond tend to be higher than the yield for a three-year bond.

79 Chapter D Example of a WaveNet-architecture

Figure D.1: The WaveNet-architecture for 8 time-steps and 32 filters.

80 Chapter E Explanatory Data Analysis

Figure E.1: Prices of six different assets in the dataset from January 2nd 2014 to January 15th 2021. For details about the assets see Appendix B.

81 Figure E.2: The assets here come in the form of CAXB or SAXBYC, where A denotes the specific curve or surface, B denotes the time to maturity in days and C denotes the option delta. Histogram of the log returns from six different assets in the dataset from January 2nd 2014 to January 15th 2021 displayed with 30 bins.

Figure E.3: ACF of six different assets log returns in the dataset from January 2nd 2014 to January 15th 2021 for lags 1 to 40.

Before the actual modelling began, an explanatory data analysis (EDA) were per- formed to get to know the data and, maybe, uncover any difficulties. Beforehand, it was known that the different asset classes would have different characteristics and properties, but the differences within each class were not clear. To have a report without thousands of figures the EDA here is explained with only a subset of the instruments. The following assets were chosen,

82 i) A future on gold (C6 ). iv) The euro to dollar FX rate (C47 ).

ii) A future on S&P500 (C16 ). v) The discount rate for dollar (C60 ).

iii) A future on coal (C27 ). vi) An option on gold (S80 ).

All instruments have a 90-day maturity, and the option price is derived with a delta of 0.5. In Figure E.1, the price process of each asset is depicted. For the futures, one can see that two of the assets, C4 and C16, have had a clear trend in the price over the last couple of years, while for the last future C27, there is more seasonal behaviour. This is not uncommon in the world of commodities that prices have a seasonal behaviour. If one instead looks at the assets that are not futures, one can not draw any conclusions about a trend or seasonality but, the price of the zero-coupon bond derived with C60 and the option price derived by S80 seems to be mean-reverting, as described in the stylised facts of rates and volatility in Appendix C.

Moving forward to looking at the log-returns of the price processes. In Fig- ure E.2, a histogram of the log returns of each asset with 30 bins is depicted. One can clearly see that each asset have their own range of ”viable” log-returns where the zero-coupon bond derived with C60, has a narrow range whereas the option price derived by S80 has a wide range. In addition to this, the results in Table E.1show that all distribution of log returns exhibit fat tails, but only some also have a skewed distribution.

In Figure E.3, one can see the autocorrelation function with a 95% confi- dence interval for the log returns for lag 1 to 40. From this, one can see that if the correlation is outside the bounds, meaning that it is a statistically significant correlation, it still is minimal, e.g. for C16, the first significant correlation is between t and t − 1 and has a correlation of ≈ −0.15. The asset with the most correlation between its log-returns are C60 whose seems to have a significant correlation between five lags.

Leaving the subset of assets, albeit one of the key properties of financial time series, is the everchanging distributions and correlations. It is interesting to see if some assets have had a strong correlation. And, it is interesting to see both correlations between prices and log returns. Thus, in figure E.4, one can see the absolute value of the correlation matrices as a heatmap for prices and log returns between all assets, respectively. For the prices in figure E.4a one can see a strong correlation between some of the assets. If one instead looks at figure E.4b, one can instead see the correlation for log returns. Here almost all signs of a highly correlated dataset disappear. Compared to figure E.4a where it could be hard to distinguish between asset classes, it is much easier.

The conclusion one can take from this is that there has been some correla- tion between assets and their log returns over time. E.g., between C17 and C47, the absolute correlation is almost 1, and this is because C17 is a future with USD as underlying while C47 represents the FX curve of EUR/USD. Other correlations that also tend to be high is for futures on different metals, e.g. gold and silver,

83 (a) (b)

Figure E.4: The assets here come in the form of CAXB or SAXBYC, where A denotes the specific curve or surface, B denotes the time to maturity in days and C denotes the option delta. For detailed description of the assets see Appendix B. (a) The matrix of absolute correlation between all asset prices with 90 days maturity and delta 0.5. (b) The matrix of absolute correlation between all assets log-returns with 90 days maturity and delta 0.5. and futures on different indices. The futures and FX rates correlate between them, but none with the zero-coupon bonds and options. It is interesting to see that the zero-coupon bonds seem to be uncorrelated to everything in terms of their log returns. The options are only correlated inside the asset class, which is somewhat assumed since they have the same underlying, and the only difference is the implied volatilities.

Table E.1: Summary statistics including the mean, median, standard deviation, min, max, Fisher’s Kurtosis and Skewness of six assets log-returns. The Fisher’s Kurtosis and Skewness are compared to a Normal distribution.

Asset mean median std min max Fisher’s Kurtosis Skewness C6 0.00023 0.00016 0.00924 -0.04991 0.05802 4.71 0.06 C16 0.00041 0.00063 0.01118 -0.11026 0.09133 19.08 -0.90 C27 -0.00009 0.00000 0.01512 -0.11496 0.07632 3.82 -0.05 C47 -0.00007 0.00000 0.00502 -0.02383 0.02989 2.52 0.04 C60 -0.00000 0.00000 0.00005 -0.00027 0.00059 21.78 2.18 S80 -0.00005 -0.00196 0.03586 -0.27361 0.35080 11.43 1.13

84 Chapter F Asset Class Results Use Case One

Table F.1: Descriptive statistics of the MASE for all models and the specific asset class on use case one. Numbers depicted in bold are the best for that statistic, while numbers with underscore are the worst.

Model mean std (%) median min max

Futures

Linear Interpolation 0.801 4.521 0.792 0.738 0.925 Lasso 0.592 18.595 0.582 0.225 0.929 Random Forest 0.600 17.155 0.597 0.283 0.935 Multilayer Perceptron 0.645 18.769 0.630 0.335 1.079 Gaussian Process 0.705 7.951 0.681 0.610 0.925

FX Rate

Linear Interpolation 0.833 4.504 0.818 0.785 0.917 Lasso 0.578 19.608 0.573 0.295 0.912 Random Forest 0.590 18.351 0.572 0.334 0.918 Multilayer Perceptron 0.633 16.227 0.629 0.405 0.917 Gaussian Process 0.731 8.939 0.693 0.668 0.917

Discount Factors

Linear Interpolation 0.889 6.658 0.907 0.782 1.047 Lasso 0.885 7.130 0.907 0.750 1.047 Random Forest 0.941 14.482 0.925 0.766 1.366 Multilayer Perceptron 0.889 6.657 0.907 0.782 1.047 Gaussian Process 0.891 7.524 0.905 0.772 1.070

Volatilities

Linear Interpolation 0.686 5.987 0.686 0.588 0.806 Lasso 0.653 8.068 0.625 0.536 0.797 Random Forest 0.658 11.562 0.621 0.514 1.030 Multilayer Perceptron 0.691 10.226 0.663 0.524 0.923 Gaussian Process 0.659 8.050 0.634 0.534 0.806

85 Table F.2: The relative change in VaR (RDVaR) for all models and asset classes on use case one. Numbers depicted in bold are the best for that statistic, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) h Futures

Linear Interpolation -9.729 28.685 -10.501 -15.803 -4.104 Lasso -5.177 27.616 -4.133 -10.322 -1.307 Random Forest -4.832 27.342 -4.883 -9.242 -0.561 Multilayer Perceptron -5.994 18.744 -6.264 -9.920 -2.721 Gaussian Process -8.385 19.777 -8.104 -11.130 -4.104

FX Rates

Linear Interpolation -9.244 14.940 -9.007 -11.485 -7.594 Lasso -5.508 33.758 -7.887 -8.445 -0.376 Random Forest -5.432 29.408 -7.006 -8.275 -1.398 Multilayer Perceptron -7.560 18.494 -8.221 -9.811 -5.354 Gaussian Process -8.675 17.370 -9.007 -11.485 -6.760

Discount Factors

Linear Interpolation -8.862 58.592 -8.117 -16.644 -1.937 Lasso -8.582 61.725 -8.117 -16.644 -1.155 Random Forest -7.820 57.046 -5.578 -16.240 -1.865 Multilayer Perceptron -8.860 58.561 -8.117 -16.630 -1.937 Gaussian Process -8.861 58.397 -8.117 -16.601 -1.980

Volatilities

Linear Interpolation -3.807 15.125 -3.682 -6.522 -2.196 Lasso -4.368 24.326 -4.202 -7.949 -0.985 Random Forest -2.680 25.895 -2.615 -6.101 0.614 Multilayer Perceptron -2.032 20.626 -1.769 -5.590 0.614 Gaussian Process -4.061 13.905 -4.106 -6.525 -2.275

86 Table F.3: The relative change in ES (RDES) for all models and asset classes on use case one. Numbers depicted in bold are the best for that statistic, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) h Futures

Linear Interpolation -9.750 18.803 -9.811 -12.841 -6.287 Lasso -5.513 27.249 -5.598 -9.990 0.146 Random Forest -5.446 21.944 -6.058 -8.882 -1.220 Multilayer Perceptron -6.517 18.009 -7.008 -8.886 -3.175 Gaussian Process -8.855 15.512 -9.000 -12.110 -5.819

FX Rates

Linear Interpolation -10.078 6.973 -9.885 -11.409 -9.446 Lasso -6.226 36.315 -7.779 -11.386 -1.973 Random Forest -6.580 29.820 -7.203 -11.396 -3.250 Multilayer Perceptron -7.674 23.474 -7.843 -11.409 -5.085 Gaussian Process -9.352 11.103 -9.196 -11.409 -8.155

Discount Factors

Linear Interpolation -11.007 22.933 -11.235 -13.982 -6.525 Lasso -10.965 22.958 -11.108 -13.982 -6.525 Random Forest -8.378 40.862 -10.433 -11.576 -0.096 Multilayer Perceptron -11.007 22.933 -11.235 -13.982 -6.525 Gaussian Process -10.925 22.710 -11.224 -13.653 -6.382

Volatilities

Linear Interpolation -6.790 12.189 -6.101 -9.052 -5.816 Lasso -6.129 15.697 -5.850 -8.957 -3.777 Random Forest -5.497 17.532 -5.393 -8.590 -2.858 Multilayer Perceptron -5.268 16.843 -4.849 -8.502 -3.590 Gaussian Process -6.509 12.238 -5.828 -8.801 -5.441

87 Chapter G Asset Class Results Use Case Two

Table G.1: Descriptive statistics of the MASE for all models and the specific asset class on use case two. Numbers depicted in bold are the best for that statistic, while numbers with underscore are the worst.

Model mean std (%) median min max

Futures

Lasso 0.892 61.142 0.730 0.236 2.908 Gaussian Process 1.734 134.606 1.482 0.575 7.023 Multilayer Perceptron 1.136 76.701 0.908 0.143 3.546 WaveNet 1.102 46.243 0.997 0.553 3.016 SeriesNet 1.188 67.570 1.005 0.405 3.514

FX Rate

Lasso 1.079 95.208 0.757 0.248 2.958 Gaussian Process 1.621 63.796 1.790 0.369 2.280 Multilayer Perceptron 0.789 57.889 0.522 0.389 1.940 WaveNet 0.969 46.257 0.841 0.501 1.770 SeriesNet 1.148 37.602 0.997 0.708 1.946

Discount Factors

Lasso 0.528 26.079 0.390 0.242 1.082 Gaussian Process 0.776 25.090 0.753 0.472 1.382 Multilayer Perceptron 0.706 17.492 0.679 0.479 1.174 WaveNet 0.715 36.115 0.647 0.213 1.514 SeriesNet 0.895 43.517 0.752 0.449 2.165

Volatilities

Lasso 1.164 154.655 0.651 0.336 6.895 Gaussian Process 1.032 37.210 1.008 0.391 1.902 Multilayer Perceptron 1.076 57.371 0.914 0.412 3.139 WaveNet 1.653 114.454 1.200 0.752 6.408 SeriesNet 1.289 66.177 1.045 0.615 3.279

88 Table G.2: The relative change in VaR (RDVaR) for all models and asset classes on use case two. Numbers depicted in bold are the best for that statistic, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) h Futures

NNI -12.993 61.400 -12.221 -25.763 -4.670 Lasso -6.743 52.019 -5.809 -14.629 2.300 Gaussian Process -12.067 56.853 -10.616 -25.036 -4.670 Multilayer Perceptron -7.295 46.660 -5.774 -17.780 -0.304 WaveNet -10.540 58.910 -9.662 -24.165 -2.604 SeriesNet -8.323 70.641 -6.621 -24.338 3.026

FX Rates

NNI -9.982 71.777 -6.907 -22.931 -1.813 Lasso -6.253 89.189 -1.764 -22.931 1.434 Gaussian Process -9.000 61.341 -6.907 -19.130 -0.704 Multilayer Perceptron -6.969 85.939 -2.085 -22.931 -0.115 WaveNet -4.748 58.300 -2.338 -13.932 1.245 SeriesNet -6.326 68.444 -4.437 -18.071 1.770

Discount Factors

NNI -6.593 60.483 -5.179 -15.517 -0.017 Lasso -6.592 60.479 -5.179 -15.516 -0.017 Gaussian Process -6.508 60.362 -4.925 -15.517 -0.017 Multilayer Perceptron -6.593 60.483 -5.179 -15.517 -0.017 WaveNet 1.774 116.643 -1.344 -7.670 26.781 SeriesNet 7.271 133.567 7.012 -12.735 27.833

Volatilities

NNI -9.111 26.789 -9.080 -13.220 -4.667 Lasso -6.653 31.473 -7.476 -10.479 -2.046 Gaussian Process -9.111 26.789 -9.080 -13.220 -4.667 Multilayer Perceptron -7.496 35.554 -6.986 -13.050 -2.246 WaveNet -2.597 49.823 -2.638 -10.522 4.826 SeriesNet 0.849 42.854 -0.004 -5.120 7.160

89 Table G.3: The relative change in ES (RDES) for all models and asset classes on use case two. Numbers depicted in bold are the best for that statistic, while numbers with underscore are the worst.

Model mean (%) std ( ) median (%) min (%) max (%) h Futures

NNI -13.347 81.964 -11.275 -28.649 -4.598 Lasso -7.319 65.764 -6.127 -19.357 3.264 Gaussian Process -13.069 80.063 -11.239 -28.649 -4.598 Multilayer Perceptron -8.797 62.263 -6.681 -21.288 -1.717 WaveNet -11.097 80.060 -9.699 -26.955 0.537 SeriesNet -8.557 88.679 -5.796 -26.470 5.614

FX Rates

NNI -8.279 44.403 -6.153 -15.810 -3.390 Lasso -4.939 64.678 -2.149 -15.810 1.209 Gaussian Process -7.909 45.924 -6.153 -15.304 -2.048 Multilayer Perceptron -5.467 63.008 -0.727 -15.810 -0.425 WaveNet -2.441 52.862 -0.953 -9.165 4.588 SeriesNet -5.642 46.745 -3.431 -12.357 0.522

Discount Factors

NNI -11.754 79.398 -9.322 -28.268 -4.801 Lasso -11.732 79.557 -9.322 -28.262 -4.676 Gaussian Process -11.741 79.420 -9.283 -28.268 -4.801 Multilayer Perceptron -11.754 79.398 -9.322 -28.268 -4.801 WaveNet -1.763 165.702 -7.301 -14.021 34.079 SeriesNet 1.169 201.050 -6.240 -16.046 43.696

Volatilities

NNI -8.561 27.256 -7.625 -12.640 -5.828 Lasso -7.418 31.883 -5.889 -12.514 -3.811 Gaussian Process -8.561 27.256 -7.625 -12.640 -5.828 Multilayer Perceptron -7.842 28.683 -6.722 -12.242 -5.145 WaveNet -3.703 56.121 -4.631 -11.064 5.817 SeriesNet -0.137 61.526 -1.126 -8.219 10.810

90 Chapter H Example of Imputation

Lasso

Figure H.1: Example of the price process with the lowest and highest MASE on use case two and the corresponding distribution of the generated log returns for the Lasso. The distribution is estimate through KDE with bandwidth of 0.5.

91 Gaussian Process

Figure H.2: Example of the price process with the lowest and highest MASE on use case two and the corresponding distribution of the generated log returns for the GP. The distribution is estimate through KDE with bandwidth of 0.5.

92 Multilayer Perceptron

Figure H.3: Example of the price process with the lowest and highest MASE on use case two and the corresponding distribution of the generated log returns for the MLP. The distribution is estimate through KDE with bandwidth of 0.5.

93 WaveNet

Figure H.4: Example of the price process with the lowest and highest MASE on use case two and the corresponding distribution of the generated log returns for the WaveNet. The distribution is estimate through KDE with bandwidth of 0.5.

94 SeriesNet

Figure H.5: Example of the price process with the lowest and highest MASE on use case two and the corresponding distribution of the generated log returns for the SeriesNet. The distribution is estimate through KDE with bandwidth of 0.5.

95