DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Machine Learning Based Intraday Calibration of End of Day Implied Surfaces

CHRISTOPHER HERRON

ANDRÉ ZACHRISSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Machine Learning Based Intraday Calibration of End of Day Surfaces

CHRISTOPHER HERRON ANDRÉ ZACHRISSON

Degree Projects in Mathematical Statistics (30 ECTS credits) Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020 Supervisor at Nasdaq Technology AB: Sebastian Lindberg Supervisor at KTH: Fredrik Viklund Examiner at KTH: Fredrik Viklund

TRITA-SCI-GRU 2020:081 MAT-E 2020:044

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

The implied volatility surface plays an important role for Front office and Risk Manage- ment functions at Nasdaq and other financial institutions which require mark-to-market of books intraday in order to properly value their instruments and measure risk in trading activities. Based on the aforementioned business needs, being able to calibrate an end of day implied volatility surface based on new market information is a sought after trait. In this thesis a statistical learning approach is used to calibrate the implied volatility surface intraday. This is done by using OMXS30-2019 implied volatil- ity surface data in combination with market information from close to at the money options and feeding it into 3 Machine Learning models. The models, including Feed For- ward Neural Network, Recurrent Neural Network and Gaussian Process, were compared based on optimal input and data preprocessing steps. When comparing the best Ma- chine Learning model to the benchmark the performance was similar, indicating that the calibration approach did not offer much improvement. However the calibrated models had a slightly lower spread and average error compared to the benchmark indicating that there is potential of using Machine Learning to calibrate the implied volatility surface.

Sammanfattning

Implicita volatilitetsytor ¨arett viktigt vektyg f¨orfront office- och riskhanteringsfunk- tioner hos Nasdaq och andra finansiella institut som beh¨over omv¨arderaderas portf¨oljer best˚aendeav derivat under dagen men ocks˚af¨oratt m¨atarisk i handeln. Baserat p˚a ovann¨amndaaff¨arsbehov ¨ardet eftertraktat att kunna kalibrera de implicita volatilitets ytorna som skapas i slutet av dagen n¨astkommande dag baserat p˚any marknadsin- formation. I denna uppsats anv¨andsstatistisk inl¨arningf¨oratt kalibrera dessa ytor. Detta g¨orsgenom att uttnytja historiska ytor fr˚anoptioner i OMXS30 under 2019 i kombination med optioner n¨ara at the money f¨oratt tr¨ana3 Maskininl¨arningsmod- eller. Modellerna inkluderar Feed Forward Neural Network, Recurrent Neural Network och Gaussian Process som vidare j¨amf¨ordesbaserat p˚adata som var bearbetat p˚aolika s¨att. Den b¨astaMaskinl¨arnings modellen j¨amf¨ordesmed ett basv¨ardesom bestod av att anv¨andaf¨oreg˚aendedags yta d¨arresultatet inte innebar n˚agonst¨orref¨orb¨attring. Samtidigt hade modellen en l¨agrespridning samt genomsnittligt fel i j¨amf¨orelsemed basv¨ardetsom indikerar att det finns potential att anv¨anda Maskininl¨arningf¨oratt kalibrera dessa ytor.

Acknowledgements

We would like to express our gratitude towards our examiner Fredrik Viklund at the De- partment of Mathematics at the Royal Institute of Technology for his feedback regarding our thesis. We would also like to thank our supervisor at Nasdaq, Sebastian Lindberg, for his support, feedback and genuine interest in our project. Also a big thanks to our manager Henrik Hedlund who made sure that we had the data required for this project and David White who introduced us to the subject.

Contents

List of Figuresi

List of Tables iii

Acronyms iv

1 Introduction1 1.1 Problem Setting...... 2 1.2 Previous Work...... 3 1.3 Thesis Outline...... 3

2 Financial Background4 2.1 Options...... 4 2.1.1 Put and Call Options...... 4 2.1.2 Spreads and Arbitrage...... 5 2.1.3 Black Scholes Options Pricing...... 8 2.1.4 Option Risks...... 11 2.2 Volatility...... 12 2.2.1 Implied Volatility...... 12 2.2.2 Constructing a Implied Volatility Surface...... 13

3 Mathematical Background 15 3.1 Statistical Learning Theory...... 15 3.1.1 The Loss Function...... 16 3.1.2 Model Selection: The Bias Variance Dilemma...... 18 3.1.3 Regularisation...... 19 3.1.4 Dimension Reduction...... 20 3.2 Artificial Neural Networks...... 22 3.2.1 Feed Forward Neural Network...... 22 3.2.2 Recurrent Neural Networks...... 23 3.2.3 Training a Neural Network...... 25 3.2.4 Choosing a Network Architecture...... 29 3.2.5 Neural Network Drawbacks...... 30 3.3 The Gaussian Process...... 31 3.3.1 Gaussian Process Regression...... 31 3.3.2 Gaussian Process Model Selection...... 34

4 Methods 36 4.1 Statistical Learning Approach...... 36 4.2 The Data...... 36 4.2.1 Implied Volatility Data...... 36 4.2.2 Intraday Data...... 38 4.2.3 Feature Engineering...... 38 4.2.4 The Final Data Set...... 39 4.3 Hyperparameter Optimization...... 39 4.3.1 Optimized Model Parameters...... 40 4.4 Algorithm Selection...... 41

5 Results and Analysis 43 5.1 Data Comparison...... 43 5.1.1 Feature Engineering...... 43 5.1.2 ATM Options...... 45 5.2 Model Comparison...... 46 5.3 Benchmark Comparison...... 47

6 Discussion and Conclusion 51 6.1 Results Evaluation...... 51 6.1.1 Data Comparison...... 51 6.1.2 Model Comparison...... 52 6.1.3 Benchmark Comparison...... 52 6.2 Conclusion...... 52 6.3 Future Work...... 53

References 55

Appendicesa A...... a A.1 List of OMXS30 listed companies 2019...... a A.2 Feature Engineering...... b List of Figures

2.1 Example of premiums for call and put options with 30 days until .5 2.2 Profit based on stock price for a given ...... 6 2.3 Profit based on stock price for a given butterfly spread...... 7 2.4 Profit based on stock price for a given ...... 8 2.5 Example of a Implied Volatility Surface...... 13

3.1 Illustration of bias-variance trade-off...... 19 3.2 Comparison of training and validation error...... 20 3.3 Single hidden layer Feed Forward Neural Network architecture...... 22 3.4 Example of recurrence in a computational graph...... 24 3.5 Comparison of path of descent for Gradient (red) and Stochastic Gradient Descent (blue), where f(0, 0) minimises the function...... 27 3.6 Computational graph for a single hidden layer Feed Forward Neural Net- work...... 28 3.7 Gaussian Processes Regression of the function f(x) = x · sin(x)...... 33

4.1 Comparison of number of unique options observed per trading day using only traded or traded and order data...... 37

5.1 Comparison of Root Mean Squared Error and Mean Absolute Error be- tween the three different approaches to preprocess the input data..... 44 5.2 Comparison between using three, six or nine closest at the money options in the intraday data per model in terms of Root Mean Squared Error and Mean Absolute Error using Principal Component Analysis data...... 45 5.3 Comparison of Root Mean Squared Error and Mean Absolute Error be- tween the different models for the selected optimal dataset...... 46 5.4 Comparison to see if either model is closer to the previous or future end of day implied volatility surface in terms of Root Mean Squared Error and Mean Absolute Error for optimal data set...... 47 5.5 Comparison of sample Root Mean Squared Error and Mean Absolute Er- ror between the Benchmark and Gaussian Process for the selected optimal dataset...... 48

i LIST OF FIGURES

5.6 Comparison of Absolute Error for the implied volatility surface based on time to maturity and log ...... 49 5.7 3D Comparison of the Root Mean Squared Error for the implied volatility surface based on time to maturity and log moneyness...... 49 5.8 3D Comparison of the Mean Absolute Error for the implied volatility surface based on time to maturity and log moneyness...... 50

A.1 Comparison of Root Mean Squared Error and Mean Absolute Error be- tween the three different approaches to preprocess the input data.....b A.2 Comparison of Root Mean Squared Error and Mean Absolute Error be- tween the three different approaches to preprocess the input data.....b

ii List of Tables

4.1 Vol1-Vol5 is the implied volatility at 5 different time to maturities for the lowest log moneyness (-0.4), Vol6-Vol10 is the implied volatility for 5 different time to maturities for second lowest log moneyness price (- 0.3). Following the same procedure for the remaining implied volatility, Vol41-Vol45 is theIV at 5 different time to maturities for the highest log moneyness (0.4)...... 37 4.2 The following features are stored for n at the money options...... 38 4.3 The following features stored independent of n at the money options.... 38 4.4 Example of model training sample...... 39 4.5 Neural Network hyperparameter grids...... 40 4.6 Feed Forward Neural Network hyperparameters...... 40 4.7 Long Short-Term Memory Network hyperparameters...... 41 4.8 Gaussian Process hyperparameters...... 41

5.1 Mean Root Mean Squared Error. A blue colored cell indicates that it is the overall best score for a given model...... 44 5.2 Mean Mean Absolute Error. A blue colored cell indicates that it is the overall best score for a given model...... 44 5.3 Mean Root Mean Squared Error and Mean Absolute Error for Principal Component Analysis processed data. A blue colored cell indicates that it is the overall best score for a given model...... 45 5.4 Mean Root Mean Squared Error and Mean Absolute Error for Principal Component Analysis processed data and 9 at the money options intraday data. A blue colored cell indicates that it is the overall best score..... 46 5.5 Mean Root Mean Squared Error and Mean Mean Absolute Error for Prin- cipal Component Analysis processed data with 9 at the money options intraday data and Benchmark results i.e using previous day IVS. A blue colored cell indicates that it is the overall best score...... 48

iii Acronyms

AE Absolute Error. ANN Artificial Neural Network. ATM at the money.

BDT Bayesian Decision Theory. BSM Black-Scholes-Merton.

CIID conditionally independent and identically distributed. CV Cross Validation.

DNN Deep Neural Network. DP Dot Product.

EOD end of day.

FNN Feed Forward Neural Network. FX foreign exchange.

GD Gradient Descent. GP Gaussian Process.

ITM in the money. IV implied volatility. IVS implied volatility surface.

LSTM Long Short-Term Memory.

iv Acronyms

MAE Mean Absolute Error. ML Machine Learning. MSE Mean Squared Error.

NN Neural Networks.

OTM out of the money.

PCA Principal Component Analysis.

RBF Radial Basis Function. ReLu Rectified Linear Unit. RMSE Root Mean Squared Error. RNN Recurrent Neural Network. RQ Rational Quadratic.

SGD Stochastic Gradient Descent. SVD Singular Value Decomposition. tanh Hyperbolic Tangent Function. TTM time to maturity.

VaR value at risk. w.r.t with respect to.

v

Chapter 1

Introduction

The financial has been an hot topic ever since the financial crisis in 2008 and there has been a lot of speculation whether the derivatives markets were bound to decline and vanish due to regulation [1]. Despite the regulations the exchange traded derivatives markets has risen by 71% since the financial crisis [2]. The derivatives markets play an important role in the financial system in terms of sharing price risk for commodities among the participants and as a consequence effectively reducing the risk of price volatility for producers. Another important aspect of the derivatives markets is that it enables hedging and risk management, however it also introduces unpredictable dynamics in the financial markets. It is well established that derivatives markets enables reduction and redistribution of risk but it has also been shown that they stabilize prices and enable price discovery [3]. Options are a type of financial derivative instrument which are commonly used in the context of hedging or speculation by market participants. An option gives the holder the right to buy or sell some underlying asset for a specific price (), at a specific expiration date and is commonly categorized into American and European style. An American Option can be exercised before expiry in contrast to European style options which can only be exercised at expiration. In 1973 Fischer Black, Myron Scholes and Robert C. Merton developed a mathematical framework for European style option pricing, the famous Black-Scholes-Merton(BSM) model. The Black-Scholes-Merton model prices options, under the no arbitrage assumption, based on an underlying stock price, strike price, dividends, risk free rate, expiration date and its volatility. The only unknown or unobserved parameter in the model is the volatility. Since the volatility can not be directly observed, the BSM model is often used as a map from observed option prices to the unobserved volatility knows as the implied volatility(IV) . The higher the volatility the higher the risk for the seller resulting in higher option premiums. This means that if an option’s price increases with all other parameters unchanged except for volatility, the expected future fluctuations of the underlying stock is considered to be higher, which results in a higher implied volatility. On the other hand, if the option price

1 1.1. PROBLEM SETTING decreases, and all other parameters are unchanged, the implied volatility consequently decreases as the market expects lower price fluctuations for the underlying asset. All models make assumptions about the dynamics of the problem and the BSM model is no different in that sense. It is well known that the assumptions made in the BSM model are far from realistic but it is still used by a lot of financial institutions in practice due to its simplicity, fast computation and the ability to estimate the implied volatility. TheIV is often presented in relation to the strike price divided by the underlying price (log moneyness) and time to maturity(TTM) , which is the time from the current date until the expiration date. This representation is referred to as the implied volatility surface(IVS) and big efforts have been made to investigate how to create the surface more efficiently, which includes avoiding arbitrage opportunities that may occur when the wrong measure of implied volatility is used when pricing options.

1.1 Problem Setting

The implied volatility surface plays an important role for Front Office and Risk Manage- ment functions at Nasdaq and other financial institutions as they require mark-to-market of derivative books intraday in order to properly value instruments, but also to measure risk in trading activities. Nasdaq has implemented end of day(EOD) IVS generation for equity options and are looking for ways to further improve their current IVS genera- tion. Developing a method to accurately calibrate the EOD IVS intraday to new market information would significantly support the mentioned business objective. Calibrating these surfaces intraday is difficult owing to a lack of synchronicity and/or absence of market quotes and liquidity of option prices. The reason for lack of synchronicity is that option quotes across chains do not change in lockstep, as an arbitrage-free volatility surface dictates. To expand on current methods used for intraday surface calibration to market, Nasdaq is actively investigating novel technologies and frameworks such as Machine Learning to achieve a more accurate calibration of the IVS, to better support the Front Office and Risk Management team, and to prevent arbitrage opportunities. This leads to an interesting area of research where the goal is to use a Machine Learning approach, for calibrating EOD market-implied volatility surface to market conditions intraday, accounting for both changes to the underlying stock and option prices. The research will be focused on Machine Learning techniques as an alternative to current rudimentary methods such as sticky delta for the underlying stock price and parallel shifts from changes to at the money(ATM) option prices for calibration. This thesis aims to answer the following question: • Can Machine Learning based intraday calibration yield a closer estimate of EOD implied volatility surface then the previous IVS?

2 1.2. PREVIOUS WORK

1.2 Previous Work

Machine Learning has been applied to multiple areas within finance including option pricing and creating implied volatility surfaces. There are multiple studies that have applied different types of Machine Learning methods to create the IVS, where some are focusing on calculation speed at the loss of accuracy [4]. There have also been attempts in both online learning approaches [5], which take a stream of data and update model parameters accordingly, as well as more static approaches to create the surface. In the literature there are also attempts to reduce arbitrage opportunities by using specific constraints in the loss function when training Neural Networks[6]. There also exist research to directly map the pricing of options with the use of Machine Learning to improve performance, where using a Bayesian approach to incorporate prior knowledge about structure of risk neutral pricing to better predict deep ITM and OTM options shows promising results [7].

1.3 Thesis Outline

In Chapter2 the financial background on options and volatility are presented. This includes, option types, spreads, risks, the Black-Scholes-Merton formula and how implied volatility is used and represented. In Chapter3 the mathematical background covering, statistical learning theory, Neural Networks and Gaussian Process is presented in text and mathematically. Chapter4 explains the methods of the study which includes how the data was used and how the models where selected and evaluated. The results and analysis, Chapter5, presents the results for data selection and model evaluation. In Chapter6 the results and analysis are discussed, which also includes future work and the thesis conclusion.

3 Chapter 2

Financial Background

Options are an advanced trade alternative which can come with a high risk to the seller due to the stochastic nature and incomplete information of the market. Being able to manage the risk and avoid arbitrage opportunities is vital to options sellers and a great amount of research has been conducted within the area. To understand where and why risk and arbitrage arises when pricing options, a basic understanding about options and volatility is required. This chapter covers the basics in aforementioned finance areas to aid in understanding the mathematical approach presented in later chapters.

2.1 Options

An option is a type of contract that allows the holder to expose themselves to future volatility, where volatility refers to a measure of uncertainty of the future price of the underlying asset. All options are time based and have an expiration date where the time between the current date and the expiration date is known as the time to matu- rity. There are different styles of options with the main categories being European and American style, where the difference is that European options can only be exercised on the expiration date while American options can be exercised at any time before or on the expiration date. There are also different types of options in terms of the underlying asset it relates to, such as index, commodity, foreign exchange(FX) and equity options. One of the most common is equity options which is a type of derivative where the payoff is dependent on an underlying security, e.g. the underlying stock price on which the option is based.

2.1.1 Put and Call Options Commonly there are two varieties of options being traded, put options and call options. A gives the contract owner the right (option), but not the obligation, to sell the underlying asset at a given price known as strike price. This means that put options can be used as a type of insurance (hedging) to minimize potential loses or as

4 2.1. OPTIONS an opportunity to buy low and sell high. A on the other hand gives the owner the right to buy the underlying asset at the strike price. Buying a call option implies that there is a belief that the underlying asset will increase allowing the buyer to buy stocks under the market price. Depending on how the market moves, some options might not be worth exercising as they are non profitable, also known as out of the money (OTM). In other scenarios the options will be profitable and thus in the money(ITM) and lastly if the option strike price is the same as the underlying asset this is called at the money. The further the strike price is from the stock price the more the option premiums increases for ITM options and decreases for OTM options, see figure 2.1.

Figure 2.1: Example of premiums for call and put options with 30 days until expiration.

If the market fluctuates a lot the probability that an option is ITM increases, as the probability of deviating from the original stock price increases. This of course means an increased risk for the option seller and will results in higher option premiums [8].

2.1.2 Option Spreads and Arbitrage The definition of arbitrage in the context of IVS is when options are priced such that buying and selling specific spreads of options results in risk free returns. There are different types of arbitrage and typically the ones to avoid are negative vertical-, negative butterfly- and negative calendar spread [9].

The Vertical Spread A vertical spread is the most basic spread and is considered to be the building block for other types of spreads. The idea is to buy and sell options at different strike prices but with the same time to maturity T in order to minimize potential risk. The vertical spread strategy can be used for both put and call options where the profit for the latter

5 2.1. OPTIONS is defined as,

P (S) = C(T,K1) − C(T,K2) + max(0,S − K2) − max(0,S − K1), (2.1) where C(T,Ki) are the call option premiums, S the underlying stock price and Ki the different strike prices. As an example consider a underlying stock price S currently traded at 90. A vertical spread could be, selling a call option C(T,K1) = 3 at a strike price K1 = 100 and then buying a call option C(T,K2) = 1.25 at the strike price K2 = 105. This gives, based on equation 2.1, the potential profit based on the stock price as seen in figure 2.2.

Figure 2.2: Profit based on stock price for a given vertical spread.

As seen in the graphs three different scenarios can occur:

1. S < K1 which means the profit is the value of the spread P (S) = C(T,K1) − C(T,K2) > 0.

2. K1 < S < K2 which means the profit is P (S) < 0 if C(T,K1) − C(T,K2) < max(0,S − K1) and P (S) > 0 if C(T,K1) − C(T,K2) > max(0,S − K1)

3. S > K2 the profit is P (S) < 0. In the third scenario the loss would be even higher if the 105$ call option had not been purchased in order to reduce exposure, which is why it is a common strategy among traders. The negative vertical spread arbitrage occurs if the strategy guarantees that P (S) ≥ 0 ∀S.

The Butterfly Spread A butterfly spread is a strategy that in total trades four options at the same time to maturity T and can be used for both put and call options. The idea is to buy two

6 2.1. OPTIONS

K1+K2 options at different strike prices K1, K2 and then sell two options K3 = 2 , where the profit based on the stock price is given by,

P (S) = 2C(T,K3) − C(T,K1) − C(T,K2) (2.2)

+ max(0,S − K1) + max(0,S − K2) − 2 · max(0,S − K3)

Consider the example when a underlying stock price S is currently traded at 50 and C(T,K1) = 8, K1 = 45, C(T,K2) = 3.75, K2 = 55, C(T,K3) = 5.50 and K3 = 50. The potential profit is given by the graph in figure 2.3.

Figure 2.3: Profit based on stock price for a given butterfly spread.

Based on the figure two scenarios can be defined:

1. If S < K1−(2C(T,K3)−C(T,K2)−C(T,K3)) or S > K2+(2C(T,K3)−C(T,K2)− C(T,K3)) then P (S) < 0 2. Otherwise P (S) ≥ 0 The strategy has a narrow window with limited profit for movements both up and down from the at the money strike price K3. The negative butterfly arbitrage occurs if the profit P (S) ≥ 0 ∀S for a given combination of strike prices Ki and premiums C(T,Ki).

The Calendar Spread The calendar spread differs from the previous two as now instead of trading options at the same TTM the options are at the same strike price. The method can be used for both put and call options and the idea is to sell for instance, one call option with TTM T1 and buy one call option T2 where T1 < T2 and K = K1 = K2. The belief when using a calendar spread is that the stock price will remain close to K. As for the potential profit it is, like the butterfly spread, tent shaped, as seen in figure 2.4.

7 2.1. OPTIONS

Figure 2.4: Profit based on stock price for a given calendar spread.

The maximum loss with this strategy is P (S) = C(T1,K) − C(T2,K) where C(T, K) is the call option price. If C(T1,K) > C(T2,K) negative calendar arbitrage is present.

The Put Call Parity Another important aspect of arbitrage free options pricing is the put call parity, which describes the relationship between the price of put and call options when there is no arbitrage. The put-call parity states that

P (t, K, T ) + S(t) = C(t, K, T ) + Ke−r(T −t) (2.3) where, P (t, K, T ) is the price of the a put options at time t with strike price K and time to maturity T , S(t) is the stock price, C(t, K, T ) the price of the call options and r is the risk free return. If the equation is not satisfied there are arbitrage opportunities.

2.1.3 Black Scholes Options Pricing When the initial Black-Scholes-Merton formulation for European option pricing was discovered it was considered revolutionary and awarded the inventors with a Nobel Prize. The formula can be derived by considering the present value of a call option at time t0 as the discounted expected value, under the risk neutral probability measure, of the payoff at maturity T

−rT C(t0) = e E[C(T )], (2.4) where r is the risk free interest rate, which corresponds to the value under the risk- neutral assumption. Using the definition of the payoff for a call option at maturity T ,

8 2.1. OPTIONS the present value is defined as

−rT C(t0) = e E[max(S(T ) − K, 0)], (2.5) where S(T ) is the stock price at maturity T and K is the strike price. The stock prize can be seen as Geometric Brownian Motion if it is assumed that the returns of a stock follows a normal distribution and are independent on short time intervals. The stock price can be defined as

ST ST −1 S2 S1 ST = .. S0, (2.6) ST −1 ST −2 S1 S0 which can be altered to,

T ST Y Si = . (2.7) S S 0 i=1 i−1

Applying a natural log gives,

T ST X Si ln( ) = ln( ), (2.8) S S 0 i=1 i−1 where the sum of independent random variables will be normally distributed by the central limit theorem. Furthermore it is implied that the stock price is log normally distributed as the return is approximately the change in log stock price.

ST ST − ST −1 ln(ST ) − ln(ST −1) = ln( ) ≈ . (2.9) ST −1 ST −1

The stock price which can be defined as a stochastic differential equation,

dS t = rdt + σdB(t), (2.10) St where B(t) is a Brownian Motion, under the risk neutral probability measure, and σ the volatility. It has a solution

2 √ (r− σ )T +σ TZ S(T ) = S(t0)e 2 , (2.11) where Z follows a standard normal distribution.

9 2.1. OPTIONS

The assumption that stock prices are lognormal is convenient when evaluating the ex- pected value of a call option since the value will depend on if the option is ITM. The probability that a call option is ITM will then be

2 √ (r− σ )T +σ TZ P (S(T ) ≥ K) = P (S(t0)e 2 ≥ K) = ... =  K σ2  ln( ) − (r − 2 )T = P Z ≥ S(t0) √ = σ T 2 ln( K ) − (r − σ )T n S(t0) 2 o = d2 = − √ = P (Z ≥ −d2). (2.12) σ T

Which gives us the value of the call option

2 √ ( (r− σ )T +σ TZ S(t )e 2 − K, if Z ≥ −d max[S(T ) − K, 0] = 0 2 (2.13) 0, otherwise

By inserting equation 2.13 in 2.5 the present value is,

2 √ −rT (r− σ )T +σ T x C(t0) = e E[max(S(t0)e 2 − K, 0)] Z ∞ 2 √ 2 −rT (r− σ )T +σ T x 1 −x = e [S(t0)e 2 − K]√ e 2 dx −d2 2π −rT Z ∞ 2 √ 2 −rT Z ∞ 2 e (r− σ )T +σ T x − x Ke − x = √ S(t0)e 2 e 2 dx − √ e 2 dx. (2.14) 2π −d2 2π −d2

By solving the integrals the Black-Scholes-Merton formula for the present value of a call option is given as,

−rT C(t0) = S(t0)N(d1) − Ke N(d2) (2.15) √ K σ2 where d1 = [ln( ) − (r + )T ]/σ T and N(·) the normal cumulative distribution S(t0) 2 function. The present value for a put option can be calculated by using the put call parity, see equation 2.3, and is defined as

−rT P (t0) = Ke N(−d2) − S(t0)N(−d1). (2.16)

The mentioned formulation are for European options, but there are variants which are specifically designed for American style options. However, pricing American options is much harder mathematically.

10 2.1. OPTIONS

2.1.4 Option Risks At financial institutions risk is measured at different levels, at the front-desk, individual traders hedges their risks by making sure exposures to individual markets are not too large. Only measuring the individual risk taken by the traders is not sufficient and risk management teams aggregates the risk and evaluates whether the overall risk is acceptable in terms of capital adequacy and other regulatory compliance. On a individual basis, each trader needs to monitor how sensitive their portfolios are in terms of change in market conditions. The most important measures that a trader needs to monitor in terms of options are called the . The greeks for a call option, priced using Black-Scholes-Merton formula, can be derived from equation 2.15,

∂C ∆ = = N(d ) (2.17) ∂S 1 ∂2C N 0(d ) Γ = = √ 1 (2.18) ∂S2 σS(t) T − t ∂C √ ν = = S(t) T − tN 0(d ) (2.19) ∂σ 1 ∂C −r(T −t) 0 σ Θ = − = −rKe N(d2) − S(t)N (d1) √ (2.20) ∂T 2 T − t

d2 0 1 − 1 where N (d ) = √ e 2 . Delta, equation 2.17, describes how sensitive a instrument 1 2π is in terms of change in value of the underlying asset or in a mathematical sense, the partial derivative of the value of the instrument w.r.t the spot level of the underlying. Since options are nonlinear products, the second partial derivative, equation 2.18, is therefore also of interest for a trader and is called gamma. Furthermore, since the value of options is not only dependent on the spot level of the underlying, traders are also interest in how sensitive the value of an option is in terms of change in the volatility of the underlying, equation 2.19, which is called vega. The value of options w.r.t time will decay and traders are also interested in how much the value of an option will drop as time moves, equation 2.20, which is called theta. In essence the Greeks are measures of how sensitive the value of the option is and if a trader feels that they are exposing themselves to a high risk in terms of change of one of the above parameters the trader can easily hedge the risk in terms of that greek [10]. The Greeks provide important measures of risk for each individual trader but as financial institutions are exposed, in terms of risk, to a lot of different markets with its own variables the total amount of risk measures for a financial institution can be in the order of hundreds or thousands. The individual measures do not provide information about the total exposure, therefore value at risk(VaR) was created as an attempt to provide one single number that represents the worst case scenario which was needed since the amount of specific market variables are too extensive. VaR aims to make a statement

11 2.2. VOLATILITY on the following form ”I am X percent certain there will not be a loss of more than V dollars in the next N days.”. In order to calculate the VaR of the current trading book or portfolio, the risk manager needs to re-value the options they hold in their book and this is made by using the EOD implied volatility surface and the pricing model for the type of option it relates to. The two main methods to estimate the VaR, is to either do a historical simulation of the portfolio or to use a model based approach. Lets assume that the manager wants to know the VaR for tomorrow, the historical simulation approach would be to look at the daily loss for the N most recent trading days and use the percentage loss in between these trading days to build the distribution of loss and then calculate the 1-α percentile of the loss in order to estimate the VaR. The second option is to use the model based approach which for individual assets would mean that one needs to assume a distribution of the change in value for the asset and from that calculate the VaR. Similarly for more complex portfolios, Monte Carlo simulations can be made to generate the distribution of the change in value in order to calculate the VaR[10].

2.2 Volatility

The fluctuations in the market are known as volatility, which is usually defined as the standard deviation of the average return during a given period. Since options are deriva- tives, based on an underlying asset, the option premiums are greatly effected by the volatility of the asset. This is why a common view of options is that they are a bet on the future volatility of the underlying asset [8]. Calculating the historical volatility of an underlying asset is easy, but not as interesting when it comes to option pricing. This is because the future volatility is not always the same as historical and since an option is a contract that will be executed in the future, the future volatility is required in order to effectively price options. One method for forecasting volatility is using time series models and another methods is to see how the market values the volatility, where the latter seems to be a better estimator [11].

2.2.1 Implied Volatility Estimating volatility based on market predictions is referred to as implied volatility. The implied volatility is calculated by instead of using Black-Scholes-Merton for options pricing, the model is inverted with the current market price of an option as an input parameter and solving for the volatility induced by the BSM option price. This means that the resultingIV can be seen as the market belief on future volatility. A concern with BSM is that it assumes that the underlying asset prices are log normal, meaning that volatility is similar for each strike. If the BSM instead is used as a mapping from the option price to theIV fitted to market data, a skew or smile is often observed meaning the observed distribution is not log normal. A skew is a consequence of that the market does not value OTM, ITM and ATM options similarly, more specifically when the market has an asymmetric view of the value of put and call options. For example, Stock options

12 2.2. VOLATILITY tend to value protection of a long position of a stock higher than protection against a potential rise in a stock and as a consequence the market view of the distribution of returns are asymmetric. A smile is observed when the market has higher premiums on OTM and ITM options in comparison to ATM options which would imply a higher implied volatility. The smile occurs since the market believes that the distribution of log returns has fatter tails than the normal distribution which is assumed in BSM. It is called smile since the resulting graph of a option over different strikes will show higher volatility for OTM and ITM options and the volatility curve will therefore have the shape of a smile. It is common to visualize the implied volatility based on the time to maturity and log moneyness. This results in a 3D plot known as the implied volatility surface, see figure 2.5.

Figure 2.5: Example of a Implied Volatility Surface.

The implied volatility surface is used to measure risk and value options, and when used by someone with experience it can be used to assess if the results are reasonable. Commonly the surface is constructed EOD using settlement prices since the market does not change synchronously. This however means that the surface does not re-calibrate during the day based on new market information.

2.2.2 Constructing a Implied Volatility Surface There exists many different models and approaches to construct a IVS end of day where the most common approaches can be divided into two groups, indirect and direct meth- ods. The indirect methods such as Heston and SABR are all driven by other dynamical models such as , etc. The dynamical model is then

13 2.2. VOLATILITY

fitted along with the market data in order to capture the IVS. The indirect methods are generally more valid mathematically, but tend to be more invalid when compared to observed market data. Direct methods on the other hand defines the volatility sur- face explicitly such as SVI and B-Splines. The direct methods can either involve some time dependence or just capture a static representation of the surface, which can be considered as two subgroups of the direct methods [12].

Nasdaq’s Model Nasdaq currently uses a B-Spline parameterization to generate the EOD IVS, where the surface is created using a bi-directional grid (TTM, log moneyness) to span the IVS. The Splines parameterization relies on the fact that option prices are synchronously updated since it aims to directly capture the representation of the IVS. Another characteristic of the current model is that it depends on arbitrage free input data.

14 Chapter 3

Mathematical Background

Machine Learning, sometimes named Statistical Learning in the Statistics community, is a modern approach to make inferences and predictions from data. The roots of Machine Learning are deeply embedded into statistics where some concepts and methods have been used for a long time. These ideas are now seeing their full potential as the amount of data and the need for analysis has drastically increased in parallel with the improved computational power. Through the new found interest in data analysis newer methods such as Neural Networks have gained in popularity and billions of dollars are spent yearly on research within the area [13]. The aim of this chapter is to give a background in the statistical learning theory of Machine Learning and modern methods such Artificial Neural Network and Gaussian Processes.

3.1 Statistical Learning Theory

Statistical Learning comes from the concept of learning from data. Typically the goal is to predict some quantitative or qualitative outcome (response) based on a set of observed features. Consider the pair (X, Y) where Y is some response and X is the corresponding p-dimensional predictor. Moreover it is assumed that there is some relationship between the two according to some form e.g

Y = f(X) + , (3.1) where f is some unknown function of X and  is a error term which is independent of X and has a mean of zero. The goal in this setting is to find the map f that takes the observed features and returns the prediction of our response Y. Statistical Learning can therefore be summarized as a set of approaches to estimate f in order to either predict the outcome Y or to understand the dynamics between X and Y [14].

15 3.1. STATISTICAL LEARNING THEORY

Supervised and Unsupervised Learning Statistical Learning problems can normally be solved by two different learning ap- proaches, supervised and unsupervised learning. A problem where both the features and the response are observed and the aim is to predict or infer the relationship between the response and its features can be solved by supervised learning. It is called supervised learning since the predicted response is used to measure how close the model is to the observed response and therefore the learning process can be ”supervised”. In problems where the response is not observed on the other hand, it is not possible to supervise our learning by measuring how far or close the model is in terms of the observed response. The aim could instead be to understand the relationship among the observed features e.g cluster analysis. However, sometimes it is very expensive or hard to retrieve the de- sired responses and the problem will become a mix of the two groups i.e semi-supervised learning [14].

Regression and Classification In the context of supervised learning there exists roughly two types of problems, if the outcome is considered to be a quantitative or a qualitative measure. A quantitative mea- sure is a numerical value such as stock price and results in a regression task meanwhile a qualitative measure takes a value among k classes or groups e.g direction of stock price (up, down) which is called a classification task[14].

3.1.1 The Loss Function In order to train a supervised model there has to exist some sort of definition of an incorrect prediction and how it should be penalized. This is commonly referred to as the loss function. There are different variants of the loss function which change drastically depending if the problem involves classification or regression but also depending on the problem at hand.

Consider the regression setting Yi = fθ(Xi)+i, now with conditionally independent and n identically distributed(CIID) pairs {(Yi,Xi)}i=1, where fθ(Xi) is the predicted value. As with most parametric models the aim is to maximize the likelihood of the data given the parameter, in this case θ, and base prediction entirely on pY |X,Θ(y|x, θML). The maximum likelihood when X = (X1,Y1, ..., Xn,Yn) is given by,

n Y PX|Θ(x|θ) = PYi|Xi,Θ(yi|xi, θ)PXi|Θ(xi|θ). (3.2) i=1

2 A common assumption is that PY |X,θ(y|x, θ) ∼ N(fθ(X), σ ) when considering regression

16 3.1. STATISTICAL LEARNING THEORY and that the observed x do not depend on the parameter θ. This gives

n Y 1 1 P (x|θ) = exp{− (y − f (x ))2}P (x ). (3.3) X|θ (2πσ2)(1/2) 2σ2 i θ i Xi i i=1

Typically the log likelihood is of interest for easier derivation,

n n n X 1 X 1 X log P (x|θ) = − (y − f (x ))2 + log P (x ) X|θ (2πσ2)(1/2) 2σ2 i θ i Xi i i=1 i=1 i=1 n n 1 X X = C − (y − f (x ))2 + log P (x ). (3.4) 2σ2 i θ i Xi i i=1 i=1

Maximizing the log likelihood is the same as minimizing

n X 2 L(y, fθ(x)) = (yi − fθ(xi)) , (3.5) i=1 which is commonly referred to as the squared loss. This is the most commonly used loss function for regression, especially when dividing L(y, fθ(x)) by the number of samples giving the Mean Squared Error(MSE) . A similar loss function that does not penalize outliers as much is the Mean Absolute Error(MAE) which is defined as:

n 1 X L(y, f (x)) = |y − f (x)|, (3.6) θ n θ i=1

Bayesian Decision Theory When using a trained model to classify or regress based on some input data X, the model essentially takes some action a as the predicted value, where ℵ is a set of possible actions. The resulting action depends, not only on the chosen action a ∈ ℵ, but also on an unobserved quantity Y . The loss function L(y, a) is thus interpreted as the incurred loss if action a is taken and the future outcome of Y is y. A mathematical approach such as Bayesian Decision Theory(BDT) can be used to find a so called decision rule that defines the most suitable action for some input X. Bayesian Decision Theory is as a statistical system that tries to quantify the tradeoff between various decisions with the aim to minimize the posterior risk, that is, the average loss for decision rule δ given the observation X = x, which is defined as

Z r(δ|X) = L(y, a)PY |X (y|x)dx = E[L(y, δ(x))|X = x], (3.7)

17 3.1. STATISTICAL LEARNING THEORY where L(y, a) is the loss function and δ(x) is the decision based on the input. By definition if δ0 is a decision rule such that for all x, r(δ0|x) < ∞ and for all x and all decision rules δ, r(δ0|x) ≤ r(δ|x), then δ0 is called a formal Bayes rule. When minimizing the posterior risk, with the squared loss L(y, a) = (y − a)2 as the loss function, the following decision rule can be derived:

E[L(y, δ(x))|X = x] = E[(Y − δ(x))2|X = x] = E[Y 2|X = x] − 2δ(x)E[Y |X = x] + δ(x)2 = V ar(Y |X = x) + E[Y |X = x]2 + δ(x)2 − 2δ(x)E[Y |X = x] = V ar(Y |X = x) + (E[Y |X = x] − δ(x))2. (3.8)

Clearly the decision rule that minimizes the posterior risk is δ(x) = E[Y |X = x], which 2 also is a formal Bayes rule. Based on the assumption that PY |X,θ(y|x, θ) ∼ N(fθ(X), σ ), it is implied that for a regression model with squared loss the correct decision/action is to always take δ(x) = fθ(x) as the predicted value.

3.1.2 Model Selection: The Bias Variance Dilemma Most models have some type of smoothing parameter that determines the complexity of the model, where the complexity can range from a straight line to a high degree polynomial. It is not a good idea to use Mean Squared Error (training error) in order to determine the smoothing parameters, as the best model in terms of Mean Squared Error would always be the one that interpolates the training data. If a model is too complex it will not generalize well meaning it does not perform well on unseen data. On the other hand if the model is too simple, this might cause the model to not fully capture the dynamics in the system [15]. This dilemma is called the bias-variance trade-off and it can be observed by considering the expected prediction error. Consider the regression 2 setting, Y = f(X) + , assume that var() = σ and that squared loss is used. The ˆ expected prediction error of a regression fit f(X) where X = x0 can then be derived as

2 2 2 2 E[(Y − fˆ(x0)) |X = x0] = E[ ] + (f(x0) − E[fˆ(x0)]) + E[(fˆ(x0) − E[fˆ(x0)]) ] 2 2 ˆ ˆ = σ + bias (f(x0)) + var(f(x0)), (3.9) where typically the more complex the model, the lower the squared bias and the higher the variance. The trade-off is made, by for instance controlling the model complexity [15], where the goal is to find a optimal trade-off, see figure 3.1.

18 3.1. STATISTICAL LEARNING THEORY

(a) High variance. (b) High bias. (c) Low bias and low variance.

Figure 3.1: Illustration of bias-variance trade-off.

As seen in figure 3.1, good balance is when the model follows the shape of the training data and does not fit it perfectly.

3.1.3 Regularisation One way of controlling the bias-variance trade-off is by regularizing the original loss- function. By penalizing the complexity of the model it is possible to adjust the bias and therefore the variance. The loss function can be penalized with for example L2 or L1-norm regularization

p X 2 L2 = λ θj (3.10) j=1

p X L1 = λ |θj|, (3.11) j=1 whereas L2 is naturally more sensitive to outliers [15]. Regularization is not always required and should only be used when there is a belief that the model is too complex (overfitting is occurring). A common way to identify overfitting is to divide the data into a training and validation set. However only the training set is used to update the model parameters while the validation set acts as a error check (validation). If the training error decreases while the validation error increases, this implies that the model is too complex and will result in poor generalisation, see figure 3.2.

19 3.1. STATISTICAL LEARNING THEORY

Figure 3.2: Comparison of training and validation error.

There are other reasons for adding regularisation than countering overfitting. Sometimes specific regularisation is added to guarantee certain properties of the model. An example is when creating implied volatility surface it is common to add regularisation to counter arbitrage [16].

3.1.4 Dimension Reduction When the dimensionality of the feature space grows, learning will become more complex and most likely require a longer training process. Furthermore, when the dimensionality grows the resulting feature space might be represented by highly correlated features, which do not add new information. The goal is therefore to reduce the feature space without loosing information.

Principal Component Analysis Principal Component Analysis(PCA) can be used as a dimension reduction technique which enables a more compact representation of the features. The goal with PCA is to find the directions in the feature space that explains maximum amount of variance in the data and then project the original data down to a lower dimension that still contains alot of the variance. The resulting components can therefore be seen as the directions that explains most of the variance but are also orthogonal to the other components. N In order to find the direction that maximizes the variance, consider the features {xn}n=1 p p T where xn ∈ R and take a unit vector u1 in R where u1 u1 = 1 then u1xn ∈ R is the

20 3.1. STATISTICAL LEARNING THEORY

projection of xn onto u1. Similarly for the mean and the variance matrix, where the new T 1 PN 1 PN T T 2 T mean is u1 x¯ wherex ¯ = N n=1 xn and the variance N n=1(u1 xn − u1 x¯) = u1 CX u1 where CX is the covariance matrix of the data. By solving the optimization problem seen in equation 3.12, the first principal component u1 can be found

T max u1 CX u1 u1 (3.12) T s.t. u1 u1 = 1.

By using Lagrangian multiplier method it can be shown that u1 is an eigenvector of CX . The Lagrangian and its derivative w.r.t u1 becomes

T T L(u1, λ) = u1 CX u1 + λ(1 − u1 u1) (3.13)

∂L ∂L = 2CX u1 − 21 = 0 ⇔ CX u1 = λu1. (3.14) ∂u1 ∂u1

Since the principal components will be the eigenvectors of the covariance matrix CX , one way of finding the principal components is to use Singular Value Decomposition p×n (SVD). Let X denote the data matrix R where the data is assumed to be centered T and normalized, then CX = XX will be the covariance matrix of X. The matrix X can be decomposed using SVD as X = UDV T , where U is a is a p×p orthogonal matrix with left singular vectors of X as columns uj, the diagonal matrix D with non-negative entries and V , a orthogonal matrix with right singular vectors of X as columns vj. This means that U consists of eigenvectors of CX i.e all principal components. It can be shown that the transformed data is uncorrelated, consider the transformed data Z = U T X with the corresponding covariance matrix in terms of the original covariance

T T T T T T T CZ = ZZ = U X(U X) = U XX U = U CX U (3.15)

T T where the covariance matrix CX is normal (CX CX = CX CX ). Let Λ be a diagonal T T matrix with the eigenvalues of CX on the diagonal, since Λ = Λ and CX = CX then

T T CX = UΛU = U ΛU, (3.16) recall that U is orthogonal (U T = U −1), which gives us

T T T T T −1 −1 CZ = U CX U = U (U ΛU)U = U UΛU U = U UΛU U = Λ (3.17) i.e Z is uncorrelated.

21 3.2. ARTIFICIAL NEURAL NETWORKS

3.2 Artificial Neural Networks

A Artificial Neural Network(ANN) is a nonlinear statistical model [15] and the name stems from the attempts to mathematically represent information processing in biological systems [17]. The ANN architecture is comprised of an input layer, that represents the data fed to the system and an output layer representing the predicted value. Furthermore there are n number of hidden layers that are used to extract information from the data. Each layer consists of a number of nodes which are often referred to as neurons where the value of the neuron can be seen as signal strength similar to the neurons in our brain. Even though the modern ANN models are not constructed with biology in mind their ability to solve classification and regression problems has made them one of the most popular and powerful Machine Learning methods. There are a variety of ANN types and they have different strengths exploiting certain structures of the problem.

3.2.1 Feed Forward Neural Network The Feed Forward Neural Network(FNN) is a parametric model with the goal to predict some output y given x, the weight matrices W (i) and some biases b(i). A simple 1 hidden layer network can be visualized as seen in figure 3.3.

Input Hidden Ouput layer layer layer

x1 (1) H1

x2 fθ(1) . . x3 . . (1) . Hn . fθ(d) xp

Figure 3.3: Single hidden layer Feed Forward Neural Network architecture.

The network in figure 3.3 is simple with the possibility to alter the number of hidden layers and neurons into infinitely many combinations.

In order to get a prediction fθ the data has to be pushed through the system until it reaches the output layer. The iterative manor of pushing the input x from one layer to another is usually referred to as the forward pass of the Feed Forward Neural Network.

22 3.2. ARTIFICIAL NEURAL NETWORKS

The initial transformation from input to the hidden layer on vector and matrix form is, based on the architecture in figure 3.3, given by,

H(1) = W (1)x + b(1), (3.18)

(1) n×p p (1) n where W ∈ R , x ∈ R and b ∈ R . Since this is a simple linear transform a non linear activation function a(i)(·) is applied

z(1) = a(1)(H(1)), (3.19) and then to get the final output,

(1) (2) (1) (2) fθ(x) = fθ(z ) = W z + b (3.20)

(2) d×n (1) n (2) d where W ∈ R , z ∈ R and b ∈ R . A general form to compute the forward pass of a multilayered FNN with m layers is given by the composition,

(m) (m−1) (m) (m−1) (1) fθ(x) = f ◦ ... ◦ f (x) = f (f (...f (x))) (3.21) where the transition between layers is given by,

f (i)(H(i−1)) = a(i)(W (i)z(i−1) + b(i)). (3.22)

When applying formula 3.22 to figure 3.3 it yields the following,

f (1)(H(0)) = f (1)(x) = a(1)(W (1)x + b(1)) = z(1) (2) (1) (2) (1) (2) (2) (1) (2) f (H ) = f (z ) = a (W z + b ) = fθ(x), (3.23) which is the same as the previously derived result in equation 3.20.

3.2.2 Recurrent Neural Networks Similar to the Feed Forward Neural Network, Recurrent Neural Network(RNN) is a family of parametric models where the aim is to predict some output y based on some observed feature x, where the difference is that RNN exploits and handles sequential structures of data. One of the main advantages is that it can scale to longer sequences than network architectures that do not posses the sequential properties, a lot of RNNs can also manage different lengths of sequences. Consider some observed sequence of

23 3.2. ARTIFICIAL NEURAL NETWORKS input (x(1), ..., x(τ))) specifically consider the current state h(t) where t ∈ {1, ..., τ} and the corresponding input x(t), then the recurrence can be specified in the following way,

h(t) = f(h(t−1), x(t), θ). (3.24)

The recurrence can also be illustrated by the unfolded acyclic computational graph with (t) the corresponding prediction fθ as seem in figure 3.4

(t−1) (t) (t+1) fθ fθ fθ

h(...) h(t−1) h(t) h(t+1) h(...)

x(t−1) x(t) x(t+1)

Figure 3.4: Example of recurrence in a computational graph.

Essentially any Neural Network that acts similar as to the one shown in figure 3.4 in terms of recurrence, meaning that the current state depends on the previous, belongs to the family of RNN. One problem with RNN is that they tend to have problems with vanishing or exploding gradients during training, this problem arises due to the recurrence since multiplying the same matrix to it self multiple times might eventually cause the weight to explode or vanish, depending if W is small or large, this is not the same if W is generated independently but since it is the same matrix at different time- steps this is a problem. As a solution to capturing the long-term time dependence Long Short-Term Memory(LSTM) networks were developed [18].

Long Short-Term Memory The Long Short-Term Memory network is a type of RNN that is based on a gated recurrent unit. The gated recurrent unit creates a path through time where the aim is to overcome the problem with vanishing or exploding gradients and capture the long term time dependence more efficiently. A LSTM consists of a number of cells where each cell consists of an input, forget, internal (t) (t) and output gate. A LSTM network controls the input x through its input gate gi for time step t and cell i, where the input gate is defined in the following way

(t) g X g (t) X g (t−1) gi = σ(bi + Ui,jxj + Wi,jfθ,j ) (3.25) j j

24 3.2. ARTIFICIAL NEURAL NETWORKS where (bg,U g,W g) is part of the weight and biases associated with the input gate that (t) the network aims to learn. The hidden state hi also has a gate that controls the (t) information flow between hidden states called the forget gate fi which can be defined as

(t) f X f (t) X f (t−1) fi = σ(bi + Ui,jxj + Wi,jfθ,j ), (3.26) j j same as for the input gate, (bf ,U f ,W f ) corresponds to the weights and biases associated with the forget gate. (t) Similarly to the RNN, a central component to the LSTM is the hidden state hi which (t) has a linear self loop (recurrence) which is controlled by the forget gate fi that repre- sents how the current state depends on the previous and the second part that corresponds to the current input controlled by the input gate:

(t) (t) (t−1) t h X h (t) X h (t−1) hi = fi hi + gi σ(bi + Ui,jxj + Wi,jfθ,j ), (3.27) j j where (bh,U h,W h) is the weight and biases associated with the internal state. The (t) output of the LSTM fθ,i is also controlled by a gate in a similar manner as to the input (t) and forget gate, the output gate qi is defined as follow

(t) q X q (t) X q (t−1) qi = σ(bi + Ui,jxj + Wi,jfθ,j ). (3.28) j j where (bq,U q,W q) corresponds to the weight and biases of the output of the network. The final output will then be

(t) (t) (t) fθ,i = tanh(hi )qi . (3.29)

t The three gates can also be extended to include the hidden state hi inside the respective gate but that would add another three parameters to learn [18].

3.2.3 Training a Neural Network As previously mentionedNN are parametric models and can, when regarding regression, (i) m be written on the form as Y = fθ(X) + , where in the case of the FNN, θ = {θ }i=1 = (i) (i) m (i) m j j j m {(W , b )}i=1 and for the LSTM, θ = {θ }i=1 = {(U(i), (W(i), b(i))}i=1, j = {g, f, h, q}, where m refers to the number of hidden layers in the FNN and the number of cells in

25 3.2. ARTIFICIAL NEURAL NETWORKS

LSTM. The parameters θ are determined using a maximum likelihood framework and as discussed in section 3.1.1 this gives the squared loss,

n X 2 L(y, fθ(x)) = (yi − fθ(xi)) , (3.30) i=1 as the loss function.

Gradient Methods Minimizing the loss function in equation 3.30 in aNN is complicated as the function fθ(x) is nonconvex resulting in L(y, fθ(x)) also being nonconvex. This makes finding a global minima analytically unfeasible which means θML that minimizes the loss function will most likely be a local minima [17]. To find a minima, iterative numerical methods such as Gradient Descent(GD) can be used [17], which is defined as:

(t+1) (t) θ = θ − η∇L(y, fθ(t) (x)). (3.31)

The term η is commonly referred to as the step size or learning rate and its purpose is to control how fast the model learns and to avoid getting stuck in local minimas. The learning rate is treated as a hyperparameter and different values should be tried, with η = 0.1 being a common starting place. When using normal gradient descent all samples in the training set are used for a single update of θ. As the training set grows this can become slow and since there is no ran- domness the algorithm can get stuck in local minimas. A faster alternative is Stochastic Gradient Descent(SGD) where instead of all samples, one or a subset of samples are used for each update. The sample or samples are chosen random uniformly from the training set and the gradient is calculated and updated in the same manor as for Gradi- ent Descent. The upside with choosing random samples is that it creates a more random descent, as illustrated in figure 3.5, resulting in i lower probability of getting stuck in a local minima. The downside is that the SGD has high variance causing the objective function to fluctuate.

26 3.2. ARTIFICIAL NEURAL NETWORKS

Figure 3.5: Comparison of path of descent for Gradient (red) and Stochastic Gradient Descent (blue), where f(0, 0) minimises the function.

It is common to use adaptive learning rate ontop of the stochastic approach. An adaptive learning rate means that in the beginning the learning rate is high and then it lowers as θ starts to converge. A popular variant is the ADAM (Adaptive Moment Estimator) optimizer.

Backpropagation As previously stated, the input to aNN is pushed through and transformed between hidden layers to determine the predicted output. This means that the gradient of the loss function with respect to(w.r.t) the parameter θ is dependent on the parameter values through all layers as the error propagates from layer to layer. To minimize the loss function the aim is to send the error back through the network and update the weights. The described process of propagating the error backwards through the network is known as backpropagation. Backpropagation utilizes the chain rule to efficiently calculate the derivatives. The method, when applied to the single hidden layer network in figure 3.3, can be illustrated in the form of a computational graph as in figure 3.6

27 3.2. ARTIFICIAL NEURAL NETWORKS

x u(1)

H(1) z(1) u(2) H(2) fθ L(y, fθ) W (1) b(1) W (2) b(2) y

Figure 3.6: Computational graph for a single hidden layer Feed Forward Neural Network.

In order to calculate the derivatives, assuming squared loss is being used L = L(y, fθ(x)) = 2 (y − fθ(x)) , the error has to be propagated right to left given the orientation of figure ∂L 3.6. The procedure starts from the leaf L = {L} where ∂L = 1:

Leaf L = {L}, Parents(L) = {y, fθ}

∂L = −2(y − fθ(x)) (3.32) ∂fθ ∂L = 2(y − f (x)) (3.33) ∂y θ

(2) Leaf L = {fθ}, Parents(L) = {H }

∂L ∂L ∂f = θ (3.34) (2) (2) ∂H ∂fθ ∂H

Leaf L = {H(2)}, Parents(L) = {u(2), b(2)}

∂L ∂L ∂H(2) ∂L ∂f ∂H(2) = = θ (2) (2) (2) (2) (2) ∂u ∂H ∂u ∂fθ ∂H ∂u ∂L ∂L ∂H(2) ∂L ∂f ∂H(2) = = θ (3.35) (2) (2) (2) (2) (2) ∂b ∂H ∂b ∂fθ ∂H ∂b and proceed in a similar manner for each partial derivative e.g: Leaf L = {u(1)}, Parents(L) = {W (1), x}

28 3.2. ARTIFICIAL NEURAL NETWORKS

∂L ∂L ∂u(1) ∂L ∂H(1) ∂u(1) = = = ... = ∂W (1) ∂u(1) ∂W (1) ∂H(1) ∂u(1) ∂W (1) ∂L ∂f ∂H(2) ∂u(2) ∂z(1) ∂H(1) ∂u(1) = θ (3.36) (2) (2) (1) (1) (1) (1) ∂fθ ∂H ∂u ∂z ∂H ∂u ∂W

The method can be used on any network architecture and allows for fast computations as the derivatives that are previously calculated can be stored and reused.

3.2.4 Choosing a Network Architecture Once the training methods is chosen the next important part of creating a good Neural Network model is finding a suitable network architecture, commonly referred to as hy- perparameter optimization. Deciding on the amount of nodes per layer and the amount of layers is an art in its self, but some guidelines can be used. The number of nodes are typically not higher than 100 where more nodes increase model flexibility and capability of finding non-linearites in the data [15]. The number of hidden layers are usually found through experimentation, but surprisingly many problems can be solved with 1-hidden layer. The experimentation of hyperparameter optimization is usually done through a gridsearch where multiple combinations of hidden layers and nodes per layer are com- pared. The combination with the lowest validation error is then chosen as the network architecture.

Activation Functions A vital and heavily studied part, in the context of Neural Networks architecure, are the activation functions which can also be referred to as transfer functions. From a mathematical perspective activation functions are used to capture non-linear properties in the data and are therefore typically non linear. The reason for the non-linearity is that as a network with only linear activation functions would ”collapse” since the network could be written as a linear combination and would thus not be able to capture non-linear properties. It is however desirable to have some linearity in the activation function as this reduces computational cost. From a applied perspective the goal of the activation function is to determine how strong the signal from the neuron is, where a stronger signal implies a higher probability that the neuron is activated [19]. There are many different types activation functions and which one to use has to be determined on a case by case basis. General desirable properties of activation functions, is the ability to capture non-linearities, fast computation of the gradient and that the gradient does not vanish in DNN. A few popular and well researched activation functions are given below:

29 3.2. ARTIFICIAL NEURAL NETWORKS

The Sigmoid activation function σ(·) is defined as,

ez σ(z) = , (3.37) 1 + ez which has the property that the derivative is given by

0 σ (z) = σ(z)(1 − σ(z)). (3.38)

The Sigmoid function will output values between [0, 1] and is differentiable for all x with the ability to efficiently compute the gradient. A drawback with the Sigmoid function is that the deeper the networks becomes the smaller the gradient also becomes when trained with backpropagation, which results in slow or even stagnant learning. The Hyperbolic Tangent Function(tanh) is similar to the Sigmoid function with the main difference being that it gives values between [−1, 1]. The function is defined as,

ez − e−z tanh(z) = . (3.39) ez + e−z

The Hyperbolic Tangent Function also has the problem with vanishing gradients. The Rectified Linear Unit(ReLu) is starting to become the go-to function for many ANN since it is cheap to compute and works well for many applications. The function is defined as,

reLu(z) = max(0, z), (3.40) where z = W x + b. The function counters the vanishing gradient problem by not squashing the values on both ends (the problem still persist for z < 0). One must note that the function is not differentiable for z = 0, however the probability of z = 0 is very low and is thus often ignored though it can be solved by the subgradient method.

3.2.5 Neural Network Drawbacks Although extremely powerful and versatile Neural Networks also have their drawbacks. A common concern is that the way the parameters are set is very complex. The process is commonly referred to as the blackbox as the Neural Networks can approximate any function, but studying its structure will not give any insights on the structure of the function being approximated. There is also the problem with overfitting, which can be countered with regularisation as discussed in sections 3.1.2 and 3.1.3. Common methods such as L1 and L2 regularisation can also be applied to Neural Networks, but one problem with L2 regularization is that

30 3.3. THE GAUSSIAN PROCESS the network is no longer invariant to linear transformations of the input which is not ideal since for example in an image classification task where it should not matter if a handwritten digit is completely straight or slightly skewed or if its big or small. In the case of L2 regularization this is easily fixed by using different parameters λ for the different layers but then on the other hand it would be harder to make a comparison of the model in the Bayesian sense since it would mean using a improper prior leading to a model evidence being zero [17]. There are however specific regularisation techniques for Neural Networks such as dropout or early stopping. Dropout refers to the idea to dropout nodes during training in order to make the model more robust and making the remaining nodes to carry more of the information normally carried by the excluded node. Early stopping comes from observing the error of a validation set, no used during training, where the idea is to stop when the error on the validation data goes up [17] as this indicates overfitting.

3.3 The Gaussian Process

A Gaussian Process(GP) is a Bayesian framework based non-parametric method to predict outcomes based on prior knowledge and updating a posterior belief when new data is added. As the method is non-parametric, rather than calculating the probability distribution of parameters of a specific function, the probability distribution over all admissible functions that fit the data are calculated. In order to get the full probability distribution,GP utilizes the powerful properties of multivariate Gaussian distributions in order to efficiently compute the distribution of the predicted value given the input and prior data. The mathematical definition of aGP is that it is a collection of random variables d {f(x)}x∈X such that, for any finite collection (x1, ..., xd) ∈ R the random vector T (f(x1), ..., f(xd)) has a multivariate Gaussian distribution [20]. The method can be applied for both classification and regression, but is more commonly applied to the latter.

3.3.1 Gaussian Process Regression Consider the regression setting,

Yi = f(Xi) + i, (3.41)

2 n where i ∼ N(0, σ ), {(Yi,Xi)}i=1 are CIID pairs and f(Xi) is a Gaussian Process. The fact that f(Xi) is Gaussian is very convenient when calculating the predictive distribu- tion,

31 3.3. THE GAUSSIAN PROCESS

Z PY |X (y|x) = PY |f,X (y|θ, x)Pf|X (θ|x)df =

{X = (X1,Y1, ...Xn,Yn,Xn+1) and Y = Yn+1} = Z

PYn+1|f,Xn+1 (y|θ, xn+1)Pf|X1,Y1,...Xn,Yn,Xn+1 (θ|x1, y1, ..., xn, yn, xn+1)df. (3.42)

If the prior Pf (θ) is Gaussian and the likelihood PX|f (x|f) is Gaussian the posterior

Pf|X (θ|x) is also Gaussian. Furthermore since PYn+1|f,Xn+1 (y|θ, xn+1) is Gaussian the predictive distribution PY |X (y|x) is also Gaussian. In order to to get the predictive dis- tribution the goal is therefore to calculate the Gaussian conditional distribution Y |X. This can be done using another convenient property of multivariate Gaussian distribu- tions, the fact that the conditional distribution can be easily calculated given the joint distribution, e.g.

Z  m  k , k  1 ∼ N 1 , 1 12 (3.43) Z2 m2 k21, k2 gives the conditional Z1|Z2,

−1 m1|2 = m1 − k21k2 (Z2 − m2) (3.44) −1 k1|2 = k1 − k21k2 k12. (3.45)

Utilizing the above mentioned properties consider the same regression setting as 3.41, T T T T where X = (X1, ...Xn) , Xˆ = (Xn+1, ...Xn+k) , Y = (Y1, ...Yn) , Yˆ = (Yn+1, ...Yn+k) and the prior f(Xi) has zero mean and covariance function k(·, ·). The joint distribution conditional on X = x and Xˆ =x ˆ is given by,

f(ˆx) 0 k(ˆx, xˆ), k(ˆx, x)  ∼ N , 2 (3.46) Y 0 k(x, xˆ), k(x, x) + σ In which gives the posterior distribution,

2 −1 mf(Xˆ)|Y,X,Xˆ (y, x, xˆ) = k(x, xˆ)[k(x, x) + σ In] y, (3.47) 2 −1 kf(Xˆ)|Y,X,Xˆ (y, x, xˆ) = k(ˆx, xˆ) − k(x, xˆ)[k(x, x) + σ In] k(ˆx, x). (3.48)

32 3.3. THE GAUSSIAN PROCESS

Similarly for the joint between Yˆ and Y , including the noise,

     2  Yˆ 0 k(ˆx, xˆ) + σ Ik, k(ˆx, x) ∼ N , 2 (3.49) Y 0 k(x, xˆ), k(x, x) + σ In which gives the predictive posterior,

2 −1 mYˆ |Y,X,Xˆ (y, x, xˆ) = k(x, xˆ)[k(x, x) + σ In] y, (3.50) 2 2 −1 kYˆ |Y,X,Xˆ (y, x, xˆ) = k(ˆx, xˆ) + σ Ik − k(x, xˆ)[k(x, x) + σ In] k(ˆx, x). (3.51)

Recall Bayesian Decision Theory in 3.1.1, where a formal Bayes rule when using squared loss was shown to be δ(Xˆ) = E[Yˆ |Y = y, Xˆ =x, ˆ X = x]. This implies that the predicted value in Gaussian Process regression is given by the predictive mean,

2 −1 mYˆ |Y,X,Xˆ (y, x, xˆ) = k(x, xˆ)[k(x, x) + σ In] y. (3.52)

Another nice property withGP is that the full predictive distribution is available, which give the possibility to calculate confidence intervals for the predicted values. As an example the function, f(x) = x · sin(x) is approximated by using n observations as seen in figure 3.7.

Figure 3.7: Gaussian Processes Regression of the function f(x) = x · sin(x).

As seen in the figure the model perform well on values which are in the interval of the observations and becomes more uncertain when there is no data between the points that

33 3.3. THE GAUSSIAN PROCESS are being regressed and the previous observation. A drawback with Gaussian Process is that the full data set is required in order to get the predictive distribution and since inverting large matrices is computational costly this can be an issue depending on the task.

3.3.2 Gaussian Process Model Selection Model selection for Gaussian Process can be divided into two parts: Choosing a prior and optimizing its hyperparameters.

The Prior and Covariance Functions The first part when it comes to model selection in aGP setting is choosing a positive definite covariance function, also known as kernel, and a mean function. The mean function is typically set to zero or to some constant while the covariance functions is kept as a function. There are different types of covariance functions and they have their unique strengths suited for different problems. It is also possible to combine kernels by multiplying or adding the corresponding covariance matrices elementwise. That is, if you multiply together two kernels, then the resulting kernel will have high value only if both of the two separate kernels have a high value. If you add together two kernels, then the resulting kernel will have high value if either of the two seperate kernels have a high value. A few kernels used for regression are: The Radial Basis Function(RBF) kernel is a popular kernel function used in various machine learning algorithms which utilize kernels and is defined as,

kx − xˆk2 k(x, xˆ) = σ2 exp(− ) (3.53) 2`2 where σ2 is a the overall variance which also effects the amplitude and ` is the lengthscale. The strength of the RBF is that it is easy to calibrate and very smooth. This kernel can be altered to a Rational Quadratic(RQ) kernel by adding many RBF with different lengthscales which gives the form.

kx − xˆk2 k(x, xˆ) = σ2(1 + )−α. (3.54) 2α`2

Another kernel is the Dot Product(DP) Kernel which is simply a kernel that can be written as a scalar representation of on the form,

k(x, xˆ) = σ2φ(x)T φ(ˆx), (3.55) where σ controls the inhomogenity of the kernel.

34 3.3. THE GAUSSIAN PROCESS

The Evidence and Hyperparameter Optimization The second part inGP model selection is hyperparameter optimization. More specifi- cally optimizing the hyperparameters γ of the prior pf (θ) as it has a big effect on the prediction, especially if the number of samples is small resulting few updates to the posterior. The Bayesian approach to optimizing the hyperparameters is maximizing the marginal distribution, also known as the evidence,

Z PY |X (y|x) = PY |X,Θ(y|x, θ)PΘ(θ)dθ. (3.56) Θ

Since the goal is to optimize over γ, the evidence can be written as,

Z PY |X,Γ(y|x, γ) = PY |X,Θ,Γ(y|x, θ, γ)PΘ|Γ(θ|γ)dθ. (3.57) Θ which is a function over the hyperparameters. Maximizing the evidence is the same as minimizing the negative log evidence,

1 1 n −log P (y|x) = yT (k(x, x)+σ2I )−1)y+ log det(k(x, x)+σ2I ))+ log 2π, (3.58) Y |X 2 k 2 k 2 which can be solved using a gradient descent algorithms. An alternative to maximizing the evidence, is maximizing the posterior to approximate PY |X (y, x) ≈ PY |X,Θ(y|x, θMAP ), where θMAP = argmax PΘ|X (θ|x).

35 Chapter 4

Methods

In this chapter the statistical learning based approach for calibrating the implied volatil- ity surface and the methods used for preparing the data and building the Machine Learning models are presented. This includes generating implied volatility surfaces data, generating intraday data, feature engineering, model and algorithm selection.

4.1 Statistical Learning Approach

Three well establishedML methods were implemented to address the research question if a Machine Learning based intraday calibration can yield a closer estimate of EOD implied volatility surface then the previous IVS. The methods included Feed Forward Neural Network, Recurrent Neural Network and Gaussian Process, which were trained and optimized in terms of hyperparameters using different preprocessed training data with varying amounts of features. Once the best models where chosen for the respective methods in terms of hyperparameters, they were benchmarked against each other but also against the previous EOD IVS by Root Mean Squared Error(RMSE) and Mean Absolute Error.

4.2 The Data

The input data for the Machine Learning models was end of day implied volatility and intraday market data. The data is generated with orders and trades for OMXS30 options and their respective underlying for the year 2019, see appendix A.1.

4.2.1 Implied Volatility Data The implied volatility was generated with Nasdaqs own model which is briefly mentioned in section 2.2.2. Nasdaqs model required a minimum amount of valid combinations of strike prices and time to maturities in order to generate the implied volatility. The

36 4.2. THE DATA underlying and option price for a combination was taken as the last traded of the day. However, the number of option trades for a given combination was in some cases very low, resulting in that the surface could not be generated using only trade data. In order to increase the amount of generated surfaces, combinations which were not traded but could be found in the order book were added. Since the number of orders in the order book were in the range of E5−E6, only the last 20 000 orders of the day where considered, with the motivation being computational efficiency and that 20 000 was estimated to be enough to complement with more combinations. The price from the order book for a given combination can be ask or bid where the average was taken to simulate a realistic trade. Furthermore, all duplicate options based on, time to maturity, strike price, option price, put / call were removed. This resulted in the following distribution in the number of unique options used to generate the implied volatility surface, see figure 4.1.

Figure 4.1: Comparison of number of unique options observed per trading day using only traded or traded and order data.

When successfully feeding Nasdaqs model with the observed options and the respective underlying price, the output is the implied volatility for 5 time to maturities and 9 strike prices resulting in 45 features, see table 4.1.

Table 4.1: Vol1-Vol5 is the implied volatility at 5 different time to maturities for the lowest log moneyness (-0.4), Vol6-Vol10 is the implied volatility for 5 different time to maturities for second lowest log moneyness price (-0.3). Following the same procedure for the remaining implied volatility, Vol41-Vol45 is theIV at 5 different time to maturities for the highest log moneyness (0.4).

Vol1 Vol2 ... Vol45

37 4.2. THE DATA

4.2.2 Intraday Data The intraday data was extracted by looking at the last 1000 derivative orders. The motivation was that the last 1000 orders are estimated approximately 30 min from market closing. Then all duplicates based on, time to maturity, strike price, option price and put / call were removed and once again the option price was calculated as the average of the bid and ask. From the remaining derivatives the n most at the money options were extracted, with n = 3, 6, 9. The motivation for choosing the n most ATM is that the market is more certain about option prices close to ATM. The most ATM options were chosen based on the corresponding log moneyness. In order to find the log moneyness for each option the underlying price traded closest to the time stamp of the option was chosen. Once the options were chosen the following features were stored per n options, see table 4.2.

Table 4.2: The following features are stored for n at the money options.

Put / Call Log Moneyness Option Price Std Option Price Time to Maturity

Furthermore the data of the stock return, where the return is defined as the ratio between the current and previous traded price, of the underlying price for the entire day was stored independent of n, see table 4.3.

Table 4.3: The following features stored independent of n at the money options.

Min Return Max Return Mean Return Std Return

Using features from table 4.2 and 4.3 the number of intraday features are 19, 34, 49.

4.2.3 Feature Engineering In an attempt to improve model performance some data prepossessing steps, which are partly discussed in section 3.1.4, were taken.

Normalization The scale and distribution of the data varied amongst the features, e.g previous day EOD-IVS in comparison to the option price and the features where therefore normalized using a min-max scaling on the interval (0,1). The features that corresponded to the n most ATM options and the previous day EOD IVS where normalized jointly. This means that the max and min values used for normalization was collected for the related features instead of normalizing column wise and treating them separately, e.g min and maximum option price was based on all n options in the features. The features that came from the underlying, see table 4.3, were min-max scaled based on min and max for each column.

38 4.3. HYPERPARAMETER OPTIMIZATION

PCA In order to reduce the dimensions of the problem and as an attempt to enhance the performance and speed up the training process, a PCA was conducted where the amount of variance to keep was set to 95%. Since PCA does not handle categorical features such as whether one of the n most ATM options is a put or a call option, thus the categorical features were removed. PCA works better when the data is normalized/standardized since then the different scales do not affect in what directions the variance is the highest. In the case of 3, 6 and 9 most at the money options in the intraday data, 95% of the variance can be explained by 17, 24, 32 components respectively.

4.2.4 The Final Data Set In the final data set table 4.1, 4.2 and 4.3 were combined with the target value for the models being the future implied volatility surface, which was on the same form as table 4.1 but for the next available date. An example of a training sample is given in table 4.4. Table 4.4: Example of model training sample.

IVS Data Intraday Data Target IVS 2019-09-24 2019-09-25 2019-09-25

In total there where approximately 5500 samples collected for OMXS30 - 2019 which were divided into a training and test sets by using a 80/20 split. The split was done using sklearn.model selection.train test split which also shuffled the samples. The process was repeated for all the differently prepossessed final data sets resulting in 27 training and test sets.

4.3 Hyperparameter Optimization

In order to find the optimal final models for the 3 Machine Learning approaches which were used to evaluate results, hyperparameter optimization was performed.

Neural Networks The Neural Networks based models were implemented using the keras API running on top of tensorflow where the ADAM optimizer was used during training. TheNN models where optimised based on MSE where a dropout rate of 20% was used as a reg- ularization technique to prevent overfitting. EachNN model was individually optimized for the individual data sets, meaning that the best model was selected for the FNN and the LSTM for each n most at the money data set, using the original, normalized and the PCA data. In order to select the best set of parameters for both of the models, a grid search was conducted based on a grid of different batch sizes, number of neurons

39 4.3. HYPERPARAMETER OPTIMIZATION

(memory-cells in LSTM) and number of layers in the network. As seen in table 4.5 the grids were different for FNN and LSTM, this is because adding another neuron in a FNN does not correspond to adding another memory cell in a LSTM in terms of the total amount of parameters.

Table 4.5: Neural Network hyperparameter grids.

Batch Size Neurons/Memory-Cells Layers FNN {4, 16} {5, 20, 35, 50, 100} {1, 2, 3} LSTM {4, 16} {1, 3, 10, 25} {1, 2}

Each set of parameters in the grid was then trained with k-fold Cross Validation(CV) using the KFold split in the sklearn package to generate the folds. K-foldCV consists of dividing the initial training data into 10 subsets or folds, then for each subset it will train on 9 folds leaving 1 fold out for validation, the reults is then averaged based on the 10 runs. The best model was then selected based on the lowest average k-fold MSE on the validation data.

Gaussian Process The Gaussian Process was trained using a sklearn GP regressor. The kernels which were tested are: Dot Product-, Radial Basis Function- and Rational Quadratic kernel. For each data set every kernel was also tested with noise, which seemed to increase performance. The hyperparameters of the kernels where optimized using the regressors built in optimizer algorithm called Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm. Once the model was trained for a given kernel and data set, the log marginal likelihood was stored. The kernel that gave the largest log marginal likelihood was then chosen as the finalGP regressor for the given data set.

4.3.1 Optimized Model Parameters The final models parameters used for evaluating the results are summarized in tables 4.6, 4.7 and 4.8.

Table 4.6: Feed Forward Neural Network hyperparameters.

Original Normalized PCA ATM Batch Batch Batch Neurons Layers Neurons Layers Neurons Layers Size Size Size 3 4 100 1 16 100 1 16 35 1 6 4 100 1 16 100 1 16 35 1 9 4 100 1 16 100 2 16 35 1

40 4.4. ALGORITHM SELECTION

Table 4.7: Long Short-Term Memory Network hyperparameters.

Original Normalized PCA ATM Batch Memory Batch Memory Batch Memory Layers Layers Layers Size Cells Size Cells Size Cells 3 16 25 1 4 25 1 4 25 1 6 4 25 1 4 25 1 4 25 1 9 16 25 1 4 25 1 4 25 1

Table 4.8: Gaussian Process hyperparameters.

Original Normalized PCA ATM Kernel Parameters Kernel Parameters Kernel Parameters 3 RBF ` = 40.9 RQ α = 57.7, ` = 6.88 RQ α = 0.00222, ` = 7.37 6 DP σ = 0.00222 RQ α = 0.6, ` = 12.4 RQ α = 0.00199, ` = 9.47 9 DP σ = 0.00158 RBF ` = 15.6 RQ α = 0.00183, ` = 11.6

4.4 Algorithm Selection

Once the best models for each Machine Learning approach was chosen the next step was to determine which of the three was the best at calibrating the implied volatility surface. This was determined by analysing the Root Mean Squared Error and Mean Absolute Error of the test set. The comparison was done in multiple steps as there were 27 different models in total. In the first step the optimal data preprocessing method was chosen by analyzing the sample errors. The errors for an individual test sample was defined as,

v u k u 1 X RMSE = t (y − f (x ))2 (4.1) k i θ i i=1 k 1 X MAE = |y − f (x )| (4.2) k i θ i i=1 where k was the number of pivot points, y the target value and fθ(x) the predicted value. In the second step the distribution of RMSE and MAE was compared between the different models in order to determine how many ATM options in the intraday data was needed. In the third step the RMSE and MAE was compared to determine which model gener- alized best. Moreover the models were compared in terms of which model had moved the most from the previous day IVS towards the future IVS.

41 4.4. ALGORITHM SELECTION

As a final step the best model was compared in terms of RMSE and MAE whether the prediction of the model could outperform the previous day IVS (Benchmark). This comparison was made both per sample and per pivot point. The error for pivot point i was defined as,

AE = |yi − fθ(xi)|. (4.3)

42 Chapter 5

Results and Analysis

In this chapter the empirical results when feeding the three best Machine Learning models implied volatility surface and intra day data are presented. All models are tested with the original, normalized and PCA augmented data. The results were divided into data comparison, model comparison and lastly Benchmark comparison.

5.1 Data Comparison

As mentioned in section 4.2.3 the data is split into three different prepreprossed data sets with three different amounts of intraday at the money data. In this section the models are compared based on different input data where the most optimal is chosen for the remaining analysis.

5.1.1 Feature Engineering In order to compare which data preprocessing gives the best calibration the methods are compared by their error distributions, see figure 5.1.

43 5.1. DATA COMPARISON

(a) Root Mean Squared Error. (b) Mean Absolute Error.

Figure 5.1: Comparison of Root Mean Squared Error and Mean Absolute Error between the three different approaches to preprocess the input data.

As seen in figure 5.1, and the remaining figures in appendix A.2, the original data set performs slightly worse in terms of the median which can be seen in the figure. Another observations is that the number of outliers seems to be very high. Since there are many outliers and it is hard to distinguish the optimal preprocessing method convincingly, the average values are of the errors are also compared and presented in tables 5.1 and 5.2.

Table 5.1: Mean Root Mean Squared Error. A blue colored cell indicates that it is the overall best score for a given model.

Original Normalized PCA ATM FNN GP LSTM FNN GP LSTM FNN GP LSTM 3 0.04033 0.03679 0.03806 0.03831 0.03612 0.03729 0.03716 0.03655 0.03723 6 0.04031 0.03657 0.03823 0.03869 0.03633 0.03797 0.03722 0.03611 0.03730 9 0.04058 0.03662 0.03897 0.04017 0.03650 0.03815 0.03766 0.03607 0.03722

Table 5.2: Mean Mean Absolute Error. A blue colored cell indicates that it is the overall best score for a given model.

Original Normalized PCA ATM FNN GP LSTM FNN GP LSTM FNN GP LSTM 3 0.02879 0.02436 0.02614 0.02663 0.02332 0.02525 0.02537 0.02432 0.02523 6 0.0285 0.02371 0.02625 0.02698 0.02352 0.02590 0.02531 0.02376 0.02522 9 0.02904 0.02376 0.02711 0.02838 0.02372 0.02627 0.02576 0.02378 0.02516

An interesting observation is that the Gaussian Process seems to be less effected by the data preprocessing than the neural networks based models. As seen in figure 5.1 and tables 5.1 and 5.2 the original data set performs the worst and that generally, PCA is the best choice for preprocessing the data although the difference is minor.

44 5.1. DATA COMPARISON

5.1.2 ATM Options With PCA as the optimal data preproccesing step the optimal number of at the money options is chosen by comparing the distribution of the errors for the different models, see figure 5.2.

(a) Root Mean Squared Error. (b) Mean Absolute Error.

Figure 5.2: Comparison between using three, six or nine closest at the money options in the intraday data per model in terms of Root Mean Squared Error and Mean Absolute Error using Principal Component Analysis data.

As seen in figure 5.2 the difference is marginal and the number of at the money options in the input data seems to have little effect with the chosen features. Once again there are many outliers in the data where the average errors are compared in table 5.3.

Table 5.3: Mean Root Mean Squared Error and Mean Absolute Error for Principal Component Analysis processed data. A blue colored cell indicates that it is the overall best score for a given model.

RMSE MAE ATM FNN GP LSTM FNN GP LSTM 3 0.03716 0.03655 0.03723 0.02537 0.02432 0.02523 6 0.03722 0.03611 0.03730 0.02531 0.02376 0.02522 9 0.03766 0.03607 0.03722 0.02576 0.02378 0.02516

Based on table 5.3 it seems that 6-9 ATM options have a slightly lower error with PCA data with 9 being the favorite based on overall best performance.

45 5.2. MODEL COMPARISON

5.2 Model Comparison

With the previously selected optimal data set, 9 at the money options in the intraday data with PCA preprocessing, the errors of the Machine Learning models are compared, see figure 5.3.

(a) Root Mean Squared Error. (b) Mean Absolute Error.

Figure 5.3: Comparison of Root Mean Squared Error and Mean Absolute Error between the different models for the selected optimal dataset.

Once again there are many outliers and thus the average errors are compared in table 5.4. Table 5.4: Mean Root Mean Squared Error and Mean Absolute Error for Principal Component Analysis processed data and 9 at the money options intraday data. A blue colored cell indicates that it is the overall best score.

RMSE MAE ATM FNN GP LSTM FNN GP LSTM 9 0.03766 0.03607 0.03722 0.02576 0.02378 0.02516

As seen in figure 5.3 the distribution of the errors are similar for all models, the median seems to be slightly lower for theGP but with slightly higher spread and in terms of mean error the Gaussian Process has the lowest error both in terms of RMSE and MAE as seen in table 5.4. Further model analysis is done by comparing how much the model has moved from the previous implied volatility surface (input) to the future IVS (target), see figure 5.4.

46 5.3. BENCHMARK COMPARISON

(a) Root Mean Squared Error. (b) Mean Absolute Error.

Figure 5.4: Comparison to see if either model is closer to the previous or future end of day implied volatility surface in terms of Root Mean Squared Error and Mean Absolute Error for optimal data set.

As seen in figure 5.4 all of the models are closer to the previous EOD IVS where the GP model seem to be closest to previous day both in terms of RMSE and MAE. This implies that the model that has moved the least from the previous day surface seem to perform the best, although only marginally.

5.3 Benchmark Comparison

The optimal Machine Learning model is the Gaussian Process trained with 9 at the money options in the input data which is preprossed using PCA. To evaluate if the model can yield a closer estimate of the future EOD implied volatility surface than the previous IVS, the errors are compared in figure 5.5.

47 5.3. BENCHMARK COMPARISON

(a) Benchmark Root Mean Squared Error. (b) Gaussian Process Root Mean Squared Error.

Figure 5.5: Comparison of sample Root Mean Squared Error and Mean Absolute Error between the Benchmark and Gaussian Process for the selected optimal dataset.

As seen in figure 5.5 theGP has a similar median when comparing RMSE and higher median when comparing MAE. An intresting observation is that the spread is smaller for theGP. There are again a lot of outliers and thus the average error is compared in in table 5.5. Table 5.5: Mean Root Mean Squared Error and Mean Mean Absolute Error for Principal Component Analysis processed data with 9 at the money options intraday data and Benchmark results i.e using previous day IVS. A blue colored cell indicates that it is the overall best score.

RMSE MAE ATM Benchmark GP Benchmark GP 9 0.04111 0.03607 0.02455 0.02378

As seen in table 5.5 theGP model outperforms the Benchmark in terms of mean RMSE and mean MAE although the difference is small. Since the difference is small, the distribution of the errors per pivot point is compared to further investigate the difference between the Benchmark andGP, which can be seen in figure 5.6.

48 5.3. BENCHMARK COMPARISON

(a) Benchmark Absolute Error. (b) Gaussian Process Absolute Error.

Figure 5.6: Comparison of Absolute Error for the implied volatility surface based on time to maturity and log moneyness.

As seen in figure 5.6 the Benchmark andGP behave similarly when it comes to the spread of the error in the pivot points. The highest errors seem to be in the ”wings” of the implied volatility surface, but these seem to be lower for the Benchmark but with a larger spread. The mean error per pivot point is also compared in figure 5.8.

(a) Benchmark Root Mean Squared Error. (b) Gaussian Process Root Mean Squared Error.

Figure 5.7: 3D Comparison of the Root Mean Squared Error for the implied volatility surface based on time to maturity and log moneyness.

49 5.3. BENCHMARK COMPARISON

(b) Feed Forward Neural Network Mean Abso- (a) Benchmark Mean Absolute Error. lute Error.

Figure 5.8: 3D Comparison of the Mean Absolute Error for the implied volatility surface based on time to maturity and log moneyness.

The general behaviour of the error is similar for the Benchmark and theGP in terms of the RMSE and MAE where the biggest error is seen in the wings which can be seen in figures 5.7 and 5.8. The difference in performance is more distinct when looking at the RMSE where the Benchmark seem to be more inaccurate in the wings.

50 Chapter 6

Discussion and Conclusion

In this chapter the results are discussed, the conclusion is drawn and future work to improve the implied volatility surface calibration is presented.

6.1 Results Evaluation

The results are evaluated based on the input data, the optimal model and lastly by comparing the optimal model trained with the optimal input data to the Benchmark.

6.1.1 Data Comparison The distribution of the errors when comparing the RMSE and MAE between the different preprocessing steps behaves similar for all models. This is not unexpected since most, but not all, of the features are in the region of 0 to 1 in the original data set and therefore the influence of each feature is weighted similarly. That the Gaussian Process is less effected by data preprocessing is motivated by the fact that the model is not trained by MSE, which means it is not as sensitive to the higher weighted values. Although the difference in terms of increased performance is smaller forGP it can still be motivated that normalization is a good idea for multiple reasons, such as the assumption that the mean is zero. The main motivation to use PCA is not to significantly increase performance in terms of error, but rather reduce the dimension while maintaining the same information, which the result confirm, and also to further speed up the training process. There was a slight difference in performance based on the number of at the money options, but it was almost negligible. A reason can be that the features gathered from the options are not the best at explaining the variations in implied volatility, which means that the models could not find a clear pattern in order to improve prediction.

51 6.2. CONCLUSION

6.1.2 Model Comparison As for the model comparison, the distribution of the errors behaves similar where the spread seems to be slightly higher forGP but with a lower median and mean. An explanation could be that theGP tends to predict a surface closer to the previous day than the other models. The property that theGP does not diverge as far from the previous day IVS can be seen as a strength since a lot of the surfaces do not change a lot between two consecutive days due to stable market conditions. This could also be an explanation to why there are a significant amount of outliers for the errors.

6.1.3 Benchmark Comparison The error of the implied volatility is higher in the wings, indicating that the region changes the most from day to day and that it is harder for the model to calibrate in that region. This is expected since movements far from ATM are harder for the market to predict and as a consequence harder to calibrate accurately. Another reason which can contribute to poor performance in the wings is that there is no information in the input data explaining the behavior in those areas as all nonIV features are for close to ATM options instead of ITM and OTM. The median errors at the wings are smaller for the Benchmark but with a larger spread compared to theGP model. This could be explained by that most surfaces do not change that much between days but when they change the difference is bigger in the wings and that the model captures the behaviour in that region better, resulting in a lower spread. The difference in mean RMSE and mean MAE between Gaussian Process and the Bench- mark is higher in terms of RMSE. This implies that there are more outliers for the Benchmark and that the Benchmark, when it is incorrect, is on average more incorrect than the calibration. The fact thatGP has less outliers can also be an explanation for the lower spread in the distribution of errors for theGP. It can also be argued that the calibration will yield a better implied volatility surface if there is a significant change in the market, however this is usually not the case which is probably why the median error of the Benchmark is still in the same level or lower then the calibration error.

6.2 Conclusion

From this thesis it can be concluded that a Machine Learning based intraday calibration can yield a marginally closer estimate of the future EOD implied volatility surface than the previous EOD IVS. The improvement can be argued as the spread of the error distribution and the average error are lower, however, the improvement is small. There are also indications that adding market data, in the form of close to at the money option features, can help with calibrating the surface. However there has to be more research on finding the most relevant features. From the research it is also evident that the IVS does not change a lot from day to day, implying that calibration is not always required and thus being able to identify when calibration is necessary is key for a successful model.

52 6.3. FUTURE WORK

6.3 Future Work

A really important part when using Machine Learning is the data. It is therefore rec- ommended to create the implied volatility surface using real option trades, instead of synthetic trades created from the order book. This could have a large impact as the IVS is the larger part of the feature space when training the models. Since 9 at the money marginally gave the best results, it could be of interest to add more to the input data. The features could also be ITM and OTM options to capture the more tricky areas such as the wings of the IVS. One approach could be to divide the surface into regions, e.g a region for negative log moneyness (< −0.2), a ATM region (-0.2 - 0.2) and a region for positive log moneyness (> 0.2). For each region one or some options would be collected to better represent the full surface. The same procedure could be done for time to maturity if that would be necessary. It was quite evident that the wings of the implied volatility surface had the larger errors. One could therefore try to penalize the pivot points at the wings when training the models, and thus putting more emphasize on calibrating the more volatile areas. To build on this idea it can also be of interest to only calibrate specific pivot points which seem to be more sensitive to movements in the market. Since the model that performed best was the one that had moved the least from previous day to future IVS it would be interesting to only train theML models on data for the cases when the calibration is required. This can be motivated by the observation that the models seem to treat the days with higher movement as noise rather than a surface that needs to be calibrated.

53 Bibliography

[1] Keep Your Eye On Record Large Global Derivatives Markets. url: https://www. forbes.com/sites/mayrarodriguezvalladares/2019/12/08/keep-your-eye- on-record-large-global-derivatives-markets/#215b794e1a82. [2] Record Level Exchange-Traded Derivatives Volumes Were Led By Asia-Pacific And South America. url: https://www.forbes.com/sites/mayrarodriguezvalladares/ 2019/02/06/record-level-exchange-traded-derivatives-volumes-were- led-by-asia-pacific-and-north-america/#48ae65242d82. [3] Bruce Tuckman. “Derivatives: Understanding Their Usefulness and Their Role in the Financial Crisis”. In: Journal of Applied Corporate Finance 28.1 (Mar. 2016), pp. 62–71. issn: 1078-1196. doi: 10.1111/jacf.12159. url: http://doi.wiley. com/10.1111/jacf.12159. [4] Jan De Spiegeleer et al. “Machine learning for quantitative finance: fast derivative pricing, hedging and fitting”. In: Quantitative Finance 18.10 (Oct. 2018), pp. 1635– 1643. issn: 14697696. doi: 10.1080/14697688.2018.1495335. [5] Yaxiong Zeng and Diego Klabjan. “Online adaptive machine learning based algo- rithm for implied volatility surface modeling”. In: Knowledge-Based Systems 163 (2019), pp. 376–391. issn: 09507051. doi: 10.1016/j.knosys.2018.08.039. url: https://doi.org/10.1016/j.knosys.2018.08.039. [6] Damien Ackerer, Natasa Tagasovska, and Thibault Vatter. “Deep Smoothing of the Implied Volatility Surface”. In: (June 2019). url: http://arxiv.org/abs/ 1906.05065. [7] Huisu Jang and Jaewook Lee. “Generative Bayesian neural network model for risk- neutral pricing of American index options”. In: Quantitative Finance 19.4 (Apr. 2019), pp. 587–603. issn: 1469-7688. doi: 10.1080/14697688.2018.1490807. url: https://www.tandfonline.com/doi/full/10.1080/14697688.2018.1490807. [8] Adam S. Iqbal. Volatility : practical options theory. 2018. isbn: 9781119501671. [9] Jim Gatheral. The Volatility Surface: A Practitioners Guide. John Wiley & Sons, 2006, p. 210. isbn: 978-0-471-79251-2. doi: 10.1002/9781118967553.part2. [10] John Hull. Risk management and financial institutions. 4th ed.. Wiley finance series. 2015. isbn: 1-118-95595-1. [11] Martin Martens and Jason Zein. “Predicting financial volatility: High-frequency time-series forecasts vis-`a-visimplied volatility”. In: Journal of Futures Markets

54 BIBLIOGRAPHY

24.11 (Nov. 2004), pp. 1005–1028. issn: 02707314. doi: 10.1002/fut.20126. url: http://doi.wiley.com/10.1002/fut.20126. [12] Yu Zheng, Yongxin Yang, and Bowei Chen. “Gated deep neural networks for im- plied volatility surfaces”. In: (Apr. 2019). url: http://arxiv.org/abs/1904. 12834. [13] Forbes. Roundup Of Machine Learning Forecasts And Market Estimates, 2020. 2020. url: https://www.forbes.com/sites/louiscolumbus/2020/01/19/ roundup- of- machine- learning- forecasts- and- market- estimates- 2020/ #76a10b745c02. [14] Gareth James et al. An Introduction to Statistical Learning with Applications in R. 2013, ISL2013. isbn: 9781461471370. doi: 10.1007/978-1-4614-7138-7{\_}8. [15] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statis- tical Learning. 2009, pp. 1–37. isbn: 9780387848570. doi: 10.1007/b94608{\_}4. [16] Blanka Horvath, Aitor Muguruza, and Mehdi Tomas. “Deep Learning Volatility”. In: SSRN Electronic Journal (Feb. 2019). doi: 10.2139/ssrn.3322085. [17] Christopher Michael Bishop. Pattern Recognition and Machine Learning. Vol. 27. 1. 2006. isbn: 9780387310732. [18] Ian Goodfellow. Deep learning. Adaptive computation and machine learning. 2016. isbn: 9780262035613. [19] Bekir Karlik and A Vehbi Olgac. Performance analysis of various activation func- tions in generalized MLP architectures of neural networks. Tech. rep. 2011. [20] Henrik Hult. “Gaussian Process - Lecture Notes”. In: (2019), pp. 47–53.

55 Appendix

A

A.1 List of OMXS30 listed companies 2019.

Company ISIN ABB Ltd CH0012221716 Alfa Laval SE0000695876 Assa Abloy B SE0007100581 Astra Zeneca GB0009895292 Atlas Copco A SE0011166610 Atlas Copco B SE0011166628 Autoliv IncSDB SE0000382335 Boliden SE0012455673 Electrolux B SE0000103814 Ericsson B SE0000108656 Essity B SE0009922164 Getinge B SE0000202624 Hennes & Mauritz B SE0000106270 Hexagon AB B SE0000103699 Investor B SE0000107419 Kinnevik B SE0013256682 Nordea Bank FI4000297767 Sandvik SE0000667891 Securitas B SE0000163594 SEB A SE0000148884 Skanska B SE0000113250 SKF B SE0000108227 SSAB A SE0000171100 SCA B SE0000112724 Svenska Handelsbanken A SE0007100599 Swedbank A SE0000242455 Swedish Match SE0000310336 Tele2 B SE0005190238 Telia Company SE0000667925 Volvo B SE0000115446

a A.

A.2 Feature Engineering

(a) Root Mean Squared Error. (b) Mean Absolute Error.

Figure A.1: Comparison of Root Mean Squared Error and Mean Absolute Error between the three different approaches to preprocess the input data.

(a) Root Mean Squared Error. (b) Mean Absolute Error.

Figure A.2: Comparison of Root Mean Squared Error and Mean Absolute Error between the three different approaches to preprocess the input data.

b

TRITA -SCI-GRU 2020:081

www.kth.se