Delft University of Technology Faculty of Electrical Engineering, Mathematics and Computer Science Delft Institute of Applied Mathematics

Application of State Space Hidden Markov Models to the approximation of (embedded) prices

A thesis submitted to the Delft Institute of Applied Mathematics in partial fulfillment of the requirements

for the degree

MASTER OF SCIENCE in APPLIED MATHEMATICS

by

Josephine Alberts Delft, the Netherlands November 2016

Copyright © 2016 by Josephine Alberts. All rights reserved.

MSc THESIS APPLIED MATHEMATICS

“Application of State Space Hidden Markov Models to the approximation of (embedded) option prices”

Josephine Alberts

Delft University of Technology

Daily Supervisors Responsible Professor Prof. Dr. Ir. C.W. Oosterlee Prof. Dr. Ir. C.W. Oosterlee

Dr. Ir. L.A. Grzelak Other thesis committee members Ir. S.N. Singor Dr. P. Cirillo

November 2016 Delft, the Netherlands

Acknowledgments

This thesis has been submitted for the degree Master of Science in Applied Mathematics at Delft University of Technology. The responsible professor is Kees Oosterlee, professor at the Numerical Analysis group of Delft Institute of Applied Mathematics. Research for this project was carried out at Ortec Finance, under the supervision of Stefan Singor. Ortec Finance is a company aiming to improve investment decision-making by providing consistent solutions for risk and return management through a combination of market knowledge, mathematical models and information technology. First of all, I would like to thank Kees Oosterlee, Stefan Singor and Lech Grzelak for their close involvement in this project and their valuable advice. I would also like to thank Pasquale Cirillo for being part of the examination committee. Furthermore I would like to thank my colleagues at Ortec Finance for providing a pleasant and inspiring working environment. Lastly, I would like to thank my family and friends for all their encouragement and moral support over the whole duration of my studies.

1 Abstract

This thesis discusses dimension reduction of the risk drivers that determine embedded option values by using the class of State Space Hidden Markov Models. As embedded options are typically valued by nested Monte Carlo simulations, this dimension reduction leads to a major reduction in computing time. This is especially important for insurance companies that are dealing with many embedded option valuations in order to determine the market value of their liabilities. To achieve the dimension reduction of the risk driver process, this thesis proposes a specific approach. An overview on current methods for state and parameter inference within this class of models is presented. For the state-of-the-art CPF- SAEM method insights are obtained by investigating an example of the dimension reduction model. Furthermore, the satisfactory behavior of this HMM approach is investigated in more detail for multiple (market) cases. Lastly, the dimension reduction model is applied to calibration of the Heston model parameters to market data. It is shown that this approach avoids overfitting issues and results in a more stable model than direct calibration of the parameters.

2 Contents

1 Introduction 5 1.1 General setting ...... 5 1.2 Research objectives ...... 8 1.3 Organization of the report ...... 9

2 Overview of State Space Hidden Markov Models 10 2.1 Introduction to Hidden Markov Models ...... 10 2.2 State inference ...... 12 2.3 Combined state and parameter inference ...... 20

3 Dimension reduction in option valuation models by a HMM approach 26 3.1 Model description ...... 26 3.2 Black-Scholes example ...... 27 3.3 Benchmark: Kalman Filter within the EM framework ...... 28 3.4 Solving the BS example with the CPF-SAEM method ...... 31 3.5 Influence of the underlying HMM ...... 42 3.6 Conclusions ...... 45

4 Test cases for the HMM approach 46 4.1 Non-linear example ...... 46 4.2 Extensive example: Heston model ...... 47 4.3 Market example: basket of S&P-500 index options ...... 49 4.4 Conclusions ...... 52

5 Application to reduction of overfitting in the Heston model 53 5.1 Calibration of the Heston model ...... 53 5.2 Overfitting ...... 56 5.3 Hidden Markov Model approach ...... 58 5.4 Out-of-sample testing ...... 64 5.5 Conclusions ...... 65

6 Conclusions 67 6.1 Summary and conclusions ...... 67 6.2 Future research ...... 69

References 70

A Conditioning on the particle with highest weight 75

3 B The Unscented Kalman Filter 76

C Correlation matrices for the risk drivers in Section 4.3 79

D Market Data for the Heston Calibration in Chapter 5 80

E Alternative conditionings for the second out-of-sample test in Section 5.4 81

4 CHAPTER 1

Introduction

1.1 General setting

Asset and Liability Management

Asset and liability management (ALM) plays an important role in the strategic decision making of liability driven companies, such as insurers, pension funds, housing corporations and banks. It refers to the practice of managing the risks faced by a company that arise due to a mis- match between assets and liabilities [68]. Within ALM the maximum allowable risk with respect to the objectives and constraints of the stakeholders is determined by analyzing the balance sheet. Thereafter, it helps specifying policies which provide optimal returns given that maxi- mum risk. ALM models are used as guidance to determine for example contribution, premium, indexation and investment policies. Besides this, the models also need to provide insight and transparency for regulating authorities and for other stakeholders, especially after the financial crisis of 2008. In most ALM problems a lot of different stakeholders are involved. In a pension plan the stakeholders are for example the sponsor, employees and beneficiaries (retired and non-active members of the plan). An insurance company has to consider for example policy holders and shareholders. Both also need to take into consideration indirect stakeholders such as regulators, government and accountants. This wide variety of stakeholders can have conflicting interests and requirements. For example, for shareholders it is important to have stable and high returns on their invested equity. However, investing in risky assets which provide higher expected returns implies more solvency risks and this is not allowed by the regulator [69]. In practice ALM problems are approached with scenario analysis, in which external uncertainties are modeled by a set of possible plausible future developments, called scenarios. The external uncertainties concern both future development of economic variables such as interest rates, risk premiums of equity and inflation, and the development of non-economical variables such as coverage ratio and the size and composition of the group of policyholders. The scenarios are constructed to capture as many stylized facts of the market as possible, based on historical data and assumptions (market models and expert views). These scenarios form the input for an ALM- model which determines scores on the required ALM-criteria with respect to the objectives and

5 constraints that the management of the company has set. In Figure 1.1 a visualization of the scenario approach for ALM problems is given. We refer to recent Ortec Finance papers for a complete overview [59] and the relevance [60] of this scenario approach. In the comprehensive handbook [68, 69] more information about general ALM techniques can be found.

Figure 1.1: ALM approach by scenario analysis, adapted from [69]

Valuation of embedded options

Following the financial crisis of 2008 new regulatory frameworks and accounting standards (e.g. Solvency II) were introduced. Insurance companies and pension funds are now obliged to value their liabilities at market value instead of at book value (which means that future cash flows were simply discounted with a fixed interest rate) [62]. Especially for insurance companies, asset and liability management has become much more complicated because they need to determine the amount of capital they have to hold against unforseen losses. A difficult aspect of this calculation is the market valuation of so-called embedded options. An embedded option is build into the structure of a financial security and it gives one of the parties the right, but not the obligation, to exercise some action by a certain date on terms that are established in advance. These options typically have a long term contract duration and are very sensitive to interest rates. An example is a policy conversion option that gives the insurance policyholder the right to convert from the current policy to another at pre-specified conditions [55]. Ortec Finance has developed an advanced simulation framework in which these complicated insurance liabilities can be modeled and many other questions concerning investment decisions can be answered. The value of an embedded option calculated by Ortec Finance is denoted by

V (r1, . . . , rn).

In their valuation model, the price thus depends on n economical and non-economical variables, the so-called risk drivers [51]. To value an embedded option on time step t > 0, real world scenarios are generated based on assumed distributions for all of these risk drivers under the real

6 world measure P. These distributions correspond to appropriate time-series models for specific variables, for example a Hull-White model for modeling interest rates. As mentioned before, real world scenarios are thus instances of all of the risk drivers (ˆr1,..., rˆn)t at each time step t > 0. Since the valuation function V (ˆr1,..., rˆn) is typically not known in closed form, multiple Monte Carlo simulations have to be generated under the risk neutral measure Q in order to determine the option value. Note that these so-called risk neutral scenarios have to be calculated for every real world scenario at each time step. This leads to time-consuming nested Monte Carlo simulations (see Figure 1.2). To reduce computing times for the clients of Ortec Finance, we would like to reduce the dimension of the risk driver process in the option valuation model. This will lead to a major reduction of the number of real world scenarios required for the option valuation, and therefore to an even greater reduction of the number of risk neutral scenarios. For example, if we have 5 risk drivers determining the option price and we want to generate scenarios with 3 possible realizations for each of the risk drivers, this will result in 35 = 243 different real world scenarios. If we could approximate the option price using only 2 risk drivers, it would require only 32 = 9 real world scenarios to obtain 3 possible realizations of each of the risk drivers. If every real world scenario leads to 10 risk neutral scenarios, we would obtain a reduction of 2340 risk neutral Monte Carlo runs on every time step for this small example. In practice, a risk neutral validation consists of tens of thousands scenarios for thousands of real world scenarios on multiple time points. We will obtain the dimensions reduction by assuming that the n-dimensional risk driver process is driven by some lower dimensional hidden process which captures the most important proper- ties of the risk drivers. This hidden process needs to be inferred from relevant option market instruments that are observed in the market. We will make use of the class of State Space Hidden Markov Models, which provides a general modeling framework and is used in a broad range of applications. Besides the dimension reduction purpose, we will investigate if this Hidden Markov Model approach will lead to more stable out-of-sample option valuations compared to regular calibration of the model parameters. In other words, we will analyze whether this approach avoids overfitting issues and performs well on unseen data.

Figure 1.2: Visualization of the real world (dark blue) and risk neutral (light blue) scenarios

7 1.2 Research objectives

The aim of this thesis is to investigate dimension reduction of the real world risk driver process determining option values in an risk neutral option valuation model V (r1, . . . , rn). We note that for this purpose we can not rely on standard dimension reduction techniques, such as the well-known Principal Component Analysis [67]. The drawback of these methods is that they only reduce the dimension of the data matrix of the risk driver process at each time step, which makes it unclear how to compute the option value. A method is necessary that also takes the transformation from the lower dimensional process to the option prices into account. The class of State Space Hidden Markov Models fulfills this requirement. Another major advantage of this class of models is that within the model itself we already assume a transition distribution f for the hidden states that drive the risk drivers. Therefore, we do not have to separately calibrate a time-series model to the estimated hidden states in order to generate real world scenarios. We can simply take the estimated states and model parameters and sample realizations for the hidden state process according to this transition distribution f, see Figure 1.3.

Figure 1.3: Visualization of the scenario generation of the hidden states: model parameters θ and hidden states Xt are estimated from historical option prices Vt

The objectives of this thesis are as follows: ˆ Gain insight in the class of State Space Hidden Markov Models and make an inventory of existing methods for state and parameter inference within these models. ˆ Propose and test a Hidden Markov Model for dimension reduction of the risk driver process in option valuation models. ˆ Apply this approach to reduce the problem of overfitting within the Heston model. Since it is complicated to construct an example where we calculate the price of some embedded option in the nested simulation framework of Ortec Finance, we restrict the analysis in this thesis to calculating values of European call and put options. Note that this means that we replace the risk neutral Monte Carlo simulations based on an instance (ˆr1,..., rˆn)t of all risk drivers (the gray square in Figure 1.2) by simply calculating for example the Black-Scholes formulas for this instance. The key point is that we need a realization of all risk drivers in order to determine the option value (either by risk neutral MC simulations or in closed form). The aim of this thesis is to reduce the dimension of this risk driver process.

8 A relationship between European options and embedded options can be found for example in unit-linked life insurance products. A unit-linked life insurance product is a contract between a policy holder and an insurance company. The policy holder will pay either a regular premium or a lump sum which will be invested by the insurance company. The insurance company promises to pay out a guaranteed amount when the contract expires. This guaranteed return is an example of an embedded option and can be seen as a put option written by the insurance company. If the fund value is below the guaranteed value at expiration, the option will be ‘in-the-money’ and the insurance company will have to settle the difference. A very crude approximation of the price of such an option can be given by the price of some European put option, although in practice these guarantee options are valued by risk neutral Monte Carlo simulations [51].

1.3 Organization of the report

In Chapter 2 we will first give a general introduction on Hidden Markov Models and Bayesian inference. After this, we will present an overview of existing methods for state inference, with a special focus on Particle Filters. Then, we will show how to use these methods within a framework for combined state and parameter estimation, which leads to the recent CPF-SAEM method. In Chapter 3 we present a new approach for dimension reduction in risk neutral option valuation models by defining a specific Hidden Markov Model. We present an example of this model and test the CPF-SAEM method for state and parameter inference on this example. We continue the examination of our Hidden Markov Model approach in Chapter 4, where we test cases for which we expect convergence difficulties. Besides this, we investigate the estimated hidden states and error distribution of a market example. In Chapter 5 we show that direct calibration of the Heston model parameters leads to overfitting. We then apply our Hidden Markov Model approach to this Heston calibration and show that this results in a more stable model. We finish this chapter by performing two out-of-sample tests. Lastly, Chapter 6 contains the overall conclusions of this thesis, as well as recommendations for future research.

9 CHAPTER 2

Overview of State Space Hidden Markov Models

In this chapter a (non-exhaustive) overview of methods available for inference in State Space Hidden Markov Models is presented. For clarification purposes only a brief overview of the most important methods is given. We especially elaborate on the famous Kalman Filter and on methods relevant for the understanding of the sophisticated CPF-SAEM algorithm introduced in 2013 in [45]. In Section 2.1 a general introduction on Hidden Markov Models and Bayesian inference is pre- sented. Section 2.2 describes the most important algorithms on state inference in these models. We show how to use these state inference methods in frameworks for combined state and param- eter inference in Section 2.3.

2.1 Introduction to Hidden Markov Models

Hidden Markov Models provide a general and flexible framework for modelling time-series in a broad range of applications. Examples of application areas are financial mathematics, , telecommunication, gene prediction and speech recognition. A broad and thorough introduction on the field can be found in the books of Capp´e[10] and S¨arkk¨a[56]. Let (Ω, F,P ) be a probability space where Ω represents the space of all possible states in the real world financial market and P is the physical (or real world) probability measure. By the filtration {Ft}t≥0 ⊆ F we represent all information available up to time t. All stochastic pro- cesses described in this thesis are defined on this probability space. We start with recalling the definitions of the and a Markov process. Then we provide the definition of a Hidden Markov Model as given in tutorial [27].

Definition 2.1 (Markov property [9]). Let (Ω, F,P ) be a probability space with a filtration {Fn}n≥1 ⊆ F. An X -valued {Xn}n≥1 adapted to the filtration satisfies the Markov property with respect to {Fn}n≥1 if

P (Xn ∈ A | Fs) = P (Xn ∈ A | Xs) for each A ∈ X and for each s < n.

10 Definition 2.2 (Markov process [9]). A Markov process is a stochastic process that satisfies the Markov property with respect to its natural filtration.

Definition 2.3 (Hidden Markov Model [27]). Consider an X -valued discrete-time Markov pro- cess {Xn}n≥1 such that

X1 ∼ µ(x1) and Xn | (Xn−1 = xn−1) ∼ f (xn | xn−1) , (2.1)

0 where Xn is the state of the model at time n, µ(x) is a probability density function and f(x | x ) denotes the transition probability density associated with moving from x0 to x. We are interested in {Xn}n≥1 but can only observe an Y-valued process {Yn}n≥1. Given {Xn}n≥1, the observations {Yn}n≥1 are statistically independent and their marginal densities are given by

Yn | (Xn = xn) ∼ g (yn | xn) , (2.2) where g(y | x) denotes the observation probability density. Models compatible with (2.1)-(2.2) are called Hidden Markov Models (HMM) or general state-space models.

In Figure 2.1 we show the dependence structure of a HMM graphically. The observations {Yn}n≥1 can for example represent the observed value of embedded option(s) at time n.

Figure 2.1: Graphical representation of the dependence structure of a Hidden Markov Model.

We now want to estimate the hidden states x1:T = (x1, . . . , xT ) from the observed measurements y1:T = (y1, . . . , yT ). This means, in Bayesian sense, that we want to compute the joint posterior distribution of all states given all observations. To achieve this, note that the Hidden Markov Model given by (2.1) and (2.2) can be analyzed by using Bayesian techniques, where the joint prior distribution is given by

T Y p(x1:T ) = µ(x1) f (xn | xn−1) , (2.3) n=2 and joint likelihood function by

T Y p (y1:T | x1:T ) = g (yn | xn) . (2.4) n=1

11 Now the posterior distribution can be calculated by a straightforward application of Bayes’ theorem and equations (2.3)-(2.4)

p(x1:T , y1:T ) p (y1:T | x1:T ) p(x1:T ) p(x1:T | y1:T ) = = R p(y1:T ) p(x1:T , y1:T )dx1:T T T Q Q (2.5) µ(x1) g (yn | xn) f (xn | xn−1) n=1 n=2 = R , p(x1:T , y1:T )dx1:T where p(y1:T ) can be seen as a normalizing constant. In a few special cases it is possible to calculate the posterior (2.5) in closed-form. However, for most non-linear non-Gaussian models this is not possible and we have to rely on numerical methods to estimate it. In the next section we will investigate techniques to sample from this posterior distribution and its marginals.

2.2 State inference

Within state inference we can distinguish between the optimal filtering and smoothing problems. Filtering means estimating the underlying hidden states up till time n, given the observations up till time n. It can refer to sequentially estimating the joint distributions {p (x1:n | y1:n)}n≥1, alternatively in some literature the term is used to describe estimation of the marginal distri- butions {p (xn | y1:n)}n≥1. In this thesis we will state explicitly which filtering distribution we refer to. Smoothing means using future observations when estimating distributions at a cer- tain time, i.e. using filtering techniques to sequentially estimate the marginals {p (xk | y1:n)} where k ≤ n. In general, smoothing is computationally more challenging but leads to smoother trajectory estimates than filtering.

We can recursively compute the filter distributions p (x1:n | y1:n) and p (xn | y1:n) of the HMM defined by (2.1) and (2.2) [11]. Since we know that

p (x1:n, y1:n) = p (x1:n−1, y1:n−1) f (xn | xn−1) g (yn | xn) we can consequently calculate the posterior by Bayes’ theorem and the Markov property of the HMM with the following recursion

f (xn | xn−1) g (yn | xn) p (x1:n | y1:n) = p (x1:n−1 | y1:n−1) , p (yn | y1:n−1) where by the total law of probability and the Markov property Z p (yn | y1:n−1) = p (xn−1 | y1:n−1) f (xn | xn−1) g (yn | xn) dxn−1:n.

By integrating out x1:n−1 the recursion satisfied by the marginal filter distribution p (xn | y1:n) can be obtained: g (yn | xn) p (xn | y1:n) = p (xn | y1:n−1) , p (yn | y1:n−1) where Z p (xn|y1:n−1) = f (xn | xn−1) p (xn−1 | y1:n−1) dxn−1

12 is called the Chapman-Kolmogorov equation [56, 27]. These recursion formulas explain the sequential approach of all filtering and smoothing methods. The history of state inference in Hidden Markov Models starts from the Wiener Filter in 1950 [66], which is later shown to be a limiting special case of the well-known Kalman Filter. See Figure 2.2 for an overview of the most important methods for estimating filtering and smoothing distributions.

Figure 2.2: Overview of methods for state inference in Hidden Markov Models

The Kalman Filter

The wide-spread Kalman Filter (KF), also known as the Kalman-Bucy Filter, was first introduced and partially developed in the historical papers of Kalman and Bucy in 1960 [40, 41]. Due to its great importance in engineering and econometrics applications, numerous literature studies can be found on the filter. See for example [13] for a recent discussion on the mathematical theory, computational algorithms and applications of the Kalman filter. An early overview on the filter and its derivation can be found in [4]. For the special case that the state space model is linear and Gaussian the KF gives an exact numerical evaluation of the underlying hidden states. It is a recursive estimator that first predicts the state estimator from the estimate in the previous timestep. Then, it combines this prediction with the current observation to refine the state estimate by calculating a weighted average. The estimates are chosen in such a way that the mean-squared error is minimized. Although the original derivation of the KF was based on the approach, it is also possible to obtain its equations by a pure probabilistic Bayesian analysis [56]. Mathematically, the Kalman Filter gives an exact solution to the Hidden Markov Model where the hidden states depend linearly on the previous states and the observations depend linearly on

13 the current states, both with some additive noise. This system is given by

xn = An−1xn−1 + Rn−1Un−1,

yn = Bnxn + SnVn, where Un,Vn ∼ N(0,I) and x0 ∼ N(0, Σ0) are all uncorrelated. The state transition matrix An−1, the measurement transition matrix Bn, the square-root of the state process noise covari- ance Rn−1 and the square-root of the measurement noise covariance Sn are all known matrices with appropriate dimensions. Note that this model defines a special case of the general state space model in equation (2.1) n m and (2.2) where X = R , Y = R , µ(x1) = N(0, Σ0) and

T  f (xn | xn−1) = N An−1xn−1,Rn−1Rn−1 , (2.6) T  g (yn | xn) = N Bnxn,SnSn . (2.7)

The Kalman Filter provides an algorithm for recursively calculating the best state estimator xˆn at time n, and is summarized in Algorithm 1. The filter distributions can be evaluated in closed form and are given by p (xn | y1:n) = N (xˆn, Σn) [10].

Algorithm 1 Kalman Filter (KF) for k = 1, . . . , n do Prediction step pred pred if k = 1 then xˆk = 0 and Σk = Σ0 else pred xˆk = Ak−1xˆk−1 pred T T Σk = Ak−1Σk−1Ak−1 + Rk−1Rk−1

Update step pred k = yk − Bkxˆk pred T T Γk = BkΣk Bk + SkSk pred T −1 Kk = Σk Bk Γk pred xˆk = xˆk + Kkk pred pred Σk = Σk − KkBkΣk

Computation of the closed form smoothing distributions of the linear Gaussian HMM, p (xk | y1:n) where k ≤ n, can be done in a similar way by adding a backward recursion to Algorithm 1. This method was introduced in 1965 [52] and is called the Rauch-Tung-Striebel smoother.

Non-linear Hidden Markov Models

Unfortunately, in most applications the state space model is not linear Gaussian and the Kalman Filter or RTS smoother can not be used. Therefore, various numerical approximations of these exact methods were developed over the years. The Taylor series based Extended Kalman Filter (EKF) was introduced shortly after the Kalman Filter and is described in, e.g. [38] and [13]. The popular Unscented Kalman Filter (UKF) [39]

14 uses an unscented transform for the approximation instead of linearisation, see Appendix B. In general, the UKF can acquire more accurate estimation results than the EKF, but might lead to serious errors for non-Gaussian distributions. The Ensemble Kalman Filter [28] uses a Monte Carlo approximation of the KF and is closely related to the particle filters, which we will discuss in the next section. In 2000, shortly after the introduction of the UKF, it was indicated in [37] that this filter can be seen as a special case of the Gaussian filters [56]. The Gaussian filters solve the non-linear optimal filtering problem by assuming Gaussian density approximations and define a general framework which works well in many applications. However, when for example the filtering distribution is multi-model or when some state com- ponents are discrete, it is not appropriate to use Gaussian approximations [56]. Since we want our approach to be as general as possible we will investigate the use of so-called Sequential Monte Carlo (SMC) methods. These methods do not rely on local linearisation techniques or any functional approximation and are therefore becoming more and more popular, especially since computer power is ever-increasing.

Sequential Monte Carlo Methods

The main idea behind Sequential Monte Carlo (SMC) methods is to obtain a large collection of weighted random samples, named particles, whose empirical distribution converges asymptoti- cally to the distribution we want to sample from, in our case the posterior distribution [27, 18]. For this reason SMC methods are also referred to as Particle Filters in filtering context. Since their introduction in 1993, see [33], they have become a beloved class of methods for inference in non-linear non-Gaussian state space models. An overview of the theory and applications of different SMC methods can be found in [23]. In the thorough tutorial [27] it is shown that essentially all basic and advanced particle filters can be seen as special instances of one generic SMC algorithm. To develop this generic SMC algorithm, assume that we want to sample sequentially from a sequence of target probability densities {πn(x1:n)} where each distribution πn(x1:n) is defined on X n. We require the target distributions to be known up to a normalizing constant, i.e. in

γn(x1:n) γn(x1:n) πn(x1:n) = R = (2.8) γn(x1:n)dx1:n Zn

n + γn : X → R , the normalizing constant Zn might be unknown. Note that if we want to sample from the posterior distribution we take γn(x1:n) = p(x1:n, y1:n), Zn = p(y1:n) so that πn(x1:n) = p(x1:n | y1:n), but the SMC approach is more general than this. An important related inference problem is the computation of the expectation of some test n function ζn : X → R over the target distribution: Z In (ζn) = ζn (x1:n) πn (x1:n) dx1:n. (2.9)

This is used for example in techniques for parameter inference in Hidden Markov Models, when the expected likelihood function over the posterior distribution is required.

15 Importance Sampling Since in general it is not possible to obtain samples directly from the target distribution, we introduce an approximate importance (or proposal) density qn(x1:n) from which we can easily draw samples. We require that the support of qn(x1:n) is greater than or equal to the support of πn(x1:n), i.e. πn(x1:n) > 0 ⇒ qn(x1:n) > 0. If we define the importance weight as

γn(x1:n) wn(x1:n) = , (2.10) qn(x1:n) then we obtain from (2.8) the following identities

wn(x1:n)qn(x1:n) πn(x1:n) = , (2.11) Zn Z Zn = wn(x1:n)qn(x1:n)dx1:n. (2.12)

i Now if we can simulate N independent particles X1:n ∼ qn(x1:n), we get by inserting the Monte Carlo approximation of qn(x1:n) into (2.11) and (2.12)

N X i πˆn(x1:n) = W δ i (x1:n), (2.13) n X1:n i=1 N 1 X Zˆ = w (Xi ), (2.14) n N n 1:n i=1 where w (Xi ) W i = n 1:n , (2.15) n PN j j=1 wn(X1:n) and δx (x) denotes the Dirac delta mass located at x0. The Dirac delta mass is defined by 0 R δx0 (x) = 0 for all x 6= x0 and δx0 (x)dx = 1 [11]. Consequently, we can obtain an estimate of expectation (2.9) by

Z N IS X i i  In (ζn) = ζn (x1:n)π ˆn (x1:n) dx1:n = Wnζn X1:n . (2.16) i=1

IS For N finite, In is biased, however under weak assumptions the strong IS a.s. applies, that is In (ζn) −−→ In(ζn) for n → ∞ [23]. Sequential Importance Sampling It is possible to present an algorithm with fixed computational complexity at each time step if we form an importance distribution recursively, i.e. we set

n Y qn(x1:n) = qn−1(x1:n−1)qn(xn | x1:n−1) = q1(x1) qk(xk | x1:k−1). (2.17) k=1

16 i i i This means that at time 1 we sample X1 ∼ q1(x1) and Xk ∼ qk(xk | X1:k−1) at times k = 2, . . . , n i to obtain X1:n ∼ qn(x1:n) at time n. Consequently, the importance weights can be calculated recursively from (2.10) and (2.17)

γn(x1:n) wn(x1:n) = qn(x1:n) γ (x ) γ (x ) = n−1 1:n−1 n 1:n qn−1(x1:n−1) γn−1(x1:n−1)qn(xn | x1:n−1) (2.18) = wn−1(x1:n−1)αn(x1:n) n Y = w1(x1) αk(x1:k), k=2 where the incremental importance weight function αn(x1:n) is given by

γn(x1:n) αn(x1:n) = . (2.19) γn−1(x1:n−1)qn(xn | x1:n−1)

IS Now we can obtain estimatesπ ˆn (2.13) and In (2.16) by using Algorithm 2 [27], with each step carried out for i = 1,...,N.

Algorithm 2 Sequential Importance Sampling (SIS) At time n = 1 ˆ i Sample X1 ∼ q1(x1) γ (Xi ) ˆ i 1 1 Compute weights w1 X1 = i q1(X1) ˆ i Compute normalized weights W1 according to (2.15)

At time n ≥ 2 ˆ i i  Sample Xn ∼ qn xn | X1:n−1 ˆ i  i  i  Compute weights wn X1:n = wn−1 X1:n−1 αn X1:n according to (2.19) ˆ i Compute normalized weights Wn according to (2.15)

Practically, for this algorithm it is only required to choose qn(xn | x1:n−1) at time n. A reasonable choice is to select q so that the variance in the importance weights wn(x1:n) is minimized. This is achieved by selecting [27, 25]

opt qn (xn | x1:n−1) = πn(xn | x1:n−1). In many cases, it is not possible to sample from this optimal proposal distribution but this result indicates that q must capture at least some important characteristics of the target distribution. In some scenarios choosing the prior distribution of the underlying latent process as proposal distribution, i.e. set qn(xn | x1:n−1) = f(xn | xn−1), is the only known practical possibility. Especially when using the more involved methods discussed later, this choice can already lead to satisfactory results [5]. When using the SIS method we easily encounter the situation in which almost all particles have zero weights, this is called the degeneracy problem. The reason for this is that IS, and thus SIS, provides estimates whose (importance weights) variances increase typically exponentially with n [27]. To solve this problem we are going to introduce a resampling step which intuitively removes

17 all particles with very small weights and duplicates those with large weights. This leads to a significant improvement in stability, however it comes at the cost of some additional variance in the Monte Carlo approximations [10]. Generic SMC algorithm

In the SIS algorithm the approximationπ ˆn(x1:n) is based on weighted samples from qn(x1:n) and does not provide samples from πn(x1:n). To obtain N approximate samples from πn(x1:n) we i i can sample from the approximationπ ˆn(x1:n), this means we choose X1:n with probability Wn, and repeat this N times. This procedure is called resampling, and corresponds to associating a i i 1:N 1 N  number of offspring Nn with each particle X1:n so that Nn = Nn,...,Nn are multinominal 1:N distributed with parameter vector (N,Wn ) and giving each offspring a weight of 1/N. Note that we can efficiently sample from a multinomal distribution in O (N) operations [27]. To get to the generic SMC algorithm we combine SIS and resampling. At time 1 we calculate i i the IS approximationπ ˆ(x1) which is a weighted collection {W1,X1}. Then we resample to 1 ¯ i get N equally-weighted particles { N , X1}, after which we follow the SIS method and sample i ¯ i ¯ i i X2 ∼ q2 x2 | X1 . This means that the distribution of X1,X2 is approximately π1(x1)q2(x2 | x1) and the corresponding importance weights are just the incremental weights α2(x1:2), see equation (2.18). Then resample the particles based on these normalized weights and so on, see Algorithm 3.

Algorithm 3 Sequential Monte Carlo (SMC) algorithm At time n = 1 ˆ i Sample X1 ∼ q1(x1) γ (Xi ) ˆ i 1 1 Compute weights w1 X1 = i q1(X1) ˆ i Compute normalized weights W1 according to (2.15) ˆ i i 1 ¯ i Resample {W1,X1} to obtain N equally-weighted particles { N , X1}

At time n ≥ 2 ˆ i ¯ i  i ¯ i i  Sample Xn ∼ qn xn | X1:n−1 and set X1:n = X1:n−1,Xn ˆ i  Compute weights αn X1:n according to (2.19) ˆ i i  Compute normalized weights Wn ∝ αn X1:n according to (2.15) ˆ i i 1 ¯ i Resample {Wn,X1:n} to obtain N new equally-weighted particles { N , X1:n}

As stated before, resampling does add some additional variance in the Monte Carlo approximation and if the particles have weights with small variance this step might be not necessary. Therefore it can be reasonable to resample only when the variance in the weights is bigger than some threshold. This variation on Algorithm 3 is called SMC with adaptive resampling. An important drawback of SMC methods is the fact that they are computationally expensive. However, this can be improved by using for example MCMC- or stochastic approximation tech- niques. When discussing combined state and parameter inference in the next section, we will explain some of these more advanced SMC methods. Due to the resampling procedure, obtaining convergence results for SMC methods is a lot more complicated than it is for the SIS algorithm, where standard results hold. However, there are accurate results available in literature, see for example [15, 12, 17]. We discuss one illustrative result that provides some insight in the effect of the resampling step.

18 When estimating Zˆn given in (2.14) by the regular SIS algorithm, we can directly apply the and the strong law of large numbers to obtain the relative asymptotic variance of Zˆn [10] Z 2  1 πn(x1:n) dx1:n − 1 . (2.20) N qn(x1:n)

When we estimate (2.14) by the generic SMC algorithm, including a multinomial resampling step, the relative asymptotic variance of Zˆn is given by

Z 2 n Z 2 ! 1 πn(x1) X πn(x1:k) dx1 − 1 + dxk−1:k − 1 . (2.21) N q1(x1) πk−1(x1:k−1)qk(xk | x1:k−1) k=2 A proof of this expression is not given straightforwardly and therefore omitted, it can be found in [12] or [17]. When comparing (2.20) with (2.21) we see that in the SMC variance expression the importance distribution qn(x1:n) is replaced with the importance distributions πk−1(x1:k−1)qk(xk | x1:k−1) obtained after the resampling step at time k − 1. This illustrates the fact that the resampling step can be seen as ‘resetting’ the particle system each time it is applied. SMC algorithm for particle filtering

For clarity we state Algorithm 3 for the special case that γn(x1:n) = p(x1:n, y1:n) and Zn = p(y1:n) so that πn(x1:n) = p(x1:n | y1:n). This is often referred to as a standard (PF). In practice, to implement Algorithm 3 it is only necessary to select the importance distribution qn(xn | x1:n−1). In [26] it is proved that in order to minimize the variance of the importance opt weights at time n, we should choose qn (xn | x1:n−1) = πn(xn | x1:n−1) where by the Markov property and conditional independence

g(yn | xn)f(xn | xn−1) πn(xn | x1:n−1) = p(xn | yn, xn−1) = , p(yn | xn−1) and the incremental weight is αn(x1:n) = p(yn | xn−1) [27]. In most cases we can not sample from this distribution but we should try to approximate it as good as possible. This shows us that we should use an importance distribution of the form

qn(xn | x1:n−1) = q(xn | yn, xn−1). (2.22)

To get the incremental weight, we combine (2.19) and (2.22):

g(yn | xn)f(xn | xn−1) αn(x1:n) = αn(xn−1:n) = . q(xn | yn, xn−1) We summarize the Particle Filter in Algorithm 4. At time n we obtain

N X i pˆ(x1:n | y1:n) = W δ i (x1:n). n X1:n i=1

19 Algorithm 4 Particle Filter (PF) At time n = 1 ˆ i Sample X1 ∼ q1(x1 | y1) µ Xi g y |Xi ˆ i ( 1) ( 1 1) Compute weights w1 X1 = i q(X1|y1) ˆ i Compute normalized weights W1 according to (2.15) ˆ i i 1 ¯ i Resample {W1,X1} to obtain N equally-weighted particles { N , X1}

At time n ≥ 2 ˆ i ¯ i  i ¯ i i  Sample Xn ∼ qn xn | yn, Xn−1 and set X1:n = X1:n−1,Xn g y |Xi f Xi |Xi ˆ i  ( n n) ( n n−1) Compute weights αn Xn−1:n = i i q(Xn|yn,Xn−1) ˆ i i  Compute normalized weights Wn ∝ αn Xn−1:n according to (2.15) ˆ i i 1 ¯ i Resample {Wn,X1:n} to obtain N new equally-weighted particles { N , X1:n}

2.3 Combined state and parameter inference

In the previous section we implicitly assumed that the parameters of the state space model (2.1)-(2.2) were known. However, in practice our HMM depends on unknown parameters

Xn | (Xn−1 = xn−1) ∼ fθ (xn | xn−1) ,Yn | (Xn = xn) ∼ gθ (yn | xn) ,

X1 ∼ µθ(x1) and these model parameters θ ∈ Θ need to be estimated as well. Many parameter estimation methods are available in the literature. In this section we concentrate on methods based on the Expectation Maximization (EM) algorithm and on Monte Carlo (MCMC) theory, as these approaches are most widely used in the HMM context [56]. These methods often iterate between updating θ and estimating the hidden states x1:n. We can use algorithms from the previous Section 2.2 (for example SMC methods) to adress the intermediate state inference problem in each iteration. See Figure 2.3 for an overview of the most important methods for combined state and parameter inference.

EM based methods

The EM algorithm is a method for maximum likelihood inference, i.e. the problem of finding ˆ θML = arg maxθ pθ (y1:T ), when it is not possible to evaluate and thus optimize this likelihood directly. The method was introduced in 1977 in [20] and applications to Hidden Markov Models 0 can be found in i.e. [54] or [57]. We define a family {Q(·, θ )}θ0∈Θ of real valued auxiliary functions on Θ by Z 0 Q (θ, θ ) = log pθ(x1:T , y1:T )pθ0 (x1:T | y1:T )dx1:T , (2.23) which is thus the expectation of the logarithm of the complete likelihood over the joint posterior distribution of the states (2.5) given parameter θ0. The EM algorithm is based on the result that 0 0 Q (θ, θ ) may be used as surrogate for pθ (y1:T ), because increasing Q (θ, θ ) forces an increase of pθ (y1:T ), see i.e. [10].

20 Figure 2.3: Overview of methods for combined state and parameter inference in HMMs

The procedure is initialized at some θ0 ∈ Θ and then iterates between

- (E-step) Compute Q (θ, θk−1),

- (M-step) Compute θk = arg maxθ∈Θ Q (θ, θk−1).

This results in a sequence {θk}k≥0 that, under weak assumptions, converges to a stationary point of the likelihood pθ (y1:T ) [45]. EM methods are widely known for their numerical stability, in the sense that the likelihood function is increased in every iteration. Note that since the computation of the E-step includes a complicated multi-dimensional inte- gral, we can approximate this by using Monte Carlo integration [65]. For now, assume that we can sample from pθk−1 (x1:T | y1:T ). We replace the E-step by the simulation of Mk realizations j Mk 1 PMk j {X } from pθ (x1:T | y1:T ) and the computation of Q˜k (θ) = log pθ(X , y1:T ). 1:T j=1 k−1 Mk j=1 1:T This leads to the Monte Carlo EM algorithm (MCEM). Unfortunately, a drawback of this ap- proach is that it requires the number of particles Mk to grow with each new iteration of the algorithm k [10]. Besides this, we need to sample a whole new set of realizations of the hidden j Mk states {X1:T }j=1 at each iteration, that are not reused in later iterations. The Stochastic Ap- proximation EM (SAEM) algorithm [19] makes more efficient use of the simulated variables by replacing Q˜k (θ) with a stochastic averaging procedure

 M  1 Xk Qˆ (θ) = (1 − γ )Qˆ (θ) + γ log p (Xj , y ) . (2.24) k k k−1 k M θ 1:T 1:T  k j=1

P P 2 The decreasing sequence of positive step sizes {γk}k≥0 needs to satisfy γk = ∞ and γk < ∞. As k → ∞, the SEAM algorithm converges to a local maximum of the likelihood function for every fixed Mk (often Mk = 1), see [19] for an extensive proof. The computational advantage of

21 the SAEM algorithm is especially significant in problems where maximization is much cheaper than simulation. Note that in our HMM setting it is not possible to sample from the posterior distribution, therefore we can not use the MCEM and SAEM methods directly. A natural idea is to use particle filters to estimate the required samples from the posterior distribution. This leads to an SMC-analogue of the previous methods, the PSEM method described in [57]. If we take the SAEM approach, it is sufficient to generate a single sample each iteration. Unfortunately, we still need to take a large number of particles to obtain proper particle approximations of the posterior. This leads to a computationally expensive E-step at each iteration k. In [42] it is shown that for convergence of the SAEM algorithm, it is not necessary to sample exactly from the posterior distribution. We can also sample from a family of Markov kernels 0 T 1 {Mθ(x1:T | x1:T )}θ∈Θ on X that leaves the family of posterior distributions invariant . Assume that we have such a family and let in iteration k of the SAEM method x1:T [k −1] be the previous draw from the Markov kernel. We then sample X1:T [k] ∼ Mθk−1 (x1:T | x1:T [k − 1]) and update Qˆ according to Qˆk (θ) = (1 − γk)Qˆk−1 (θ) + γk log pθ(X1:T [k], y1:T ). (2.25) The next approximation of θ is then obtained by maximizing this quantity w.r.t. θ. To obtain these kernels that leave the posterior distribution invariant, we use MCMC techniques described in the following section.

MCMC based methods

Markov Chain Monte Carlo (MCMC) methods (see e.g. [49, 53]) form a standard class of al- gorithms for sampling from a complicated target distribution, based on constructing a Markov chain which has the desired distribution as its stationary distribution1. Widely used examples are the Metropolis-Hastings algorithm [35] and the Gibbs sampler [32]. For the implementation of MCMC methods we again only need to know the target distribution up to a normalizing con- stant. This makes them highly suitable to sample from the posterior distribution in parameter inference. A major advantage of MCMC algorithms is the fact that we can ensure asymptotic convergence under weak assumptions. However, when we do not use a suitable proposal distribu- tion to explore the space, the performance of the methods becomes unreliable [5]. This proposal distribution again needs to reflect important properties of the target distribution and needs to be easy to sample from. It is often very complicated to construct such an efficient proposal distribution. To resolve this problem, it is possible to use SMC methods to build efficient proposal distribu- tions. For example, we can target the true pθ (x1:n | y1:n) by MCMC methods using the SMC approximationp ˆθ (x1:n | y1:n) as proposal distribution. These particle MCMC (PMCMC) meth- ods were introduced in the seminal paper by Andrieu, Doucet and Holenstein in 2010 [5]. In this paper it has been proven that for any fixed number of particles N ≥ 1 the transition kernels of PMCMC methods leave the target density invariant. This means that they are in a way ‘exact approximations’ to idealized MCMC algorithms. This key feature makes it possible to use these methods within the SAEM framework discussed in the previous section.

1 Let {Xn}n≥1 a Markov process with Markov kernel M(A | x) defining the probability of reaching the measurable set A ⊂ X from state x for all x ∈ X . A probability distribution π is called the invariant or stationary distribution for this Markov process if π(A) = R M(A | x)π(x)dx, for all measurable sets A ⊂ X [43].

22 We will elaborate on the Particle Gibbs (PG) method [5] where the Markov kernel is constructed 0 by running a SMC sampler in which one particle trajectory x1:T is specified a priori, a so called conditional particle filter (CPF). Informally, we can think of this reference trajectory as guiding 0 the simulated particles to a relevant region of the state space. The path x1:T is ensured to survive all resampling steps. In contrary to earlier notation in the SMC algorithm, we now introduce a i N i set {an}i=1, which we call ancestor indices. Here an is the index of the ancestor particle at time i n − 1 of particle Xn. To generate a particle at time n, we start with sampling the ancestor index i j i with P (an = j) ∝ wn−1. We then sample Xn from the proposal distribution

 i  i an Xn ∼ qθ,n xn | Xn−1, yn , (2.26) and we define the particle trajectory recursively as

 i  i an i X1:n = X1:n−1,Xn . (2.27)

Note that in this notation the multinomial resampling procedure is done implicitly by sampling 0 ancestor indices for all particles. Since we condition on {x1:T }, we sample according to (2.26) N 0 only for i = 1,...,N − 1. The Nth particle and its ancestor index are then set as Xn = xn and N an = N, respectively. ∗ After a run of this CPF a trajectory X1:T is sampled from the particle trajectories, i.e. we draw ∗ ∗ i i 0 X1:T with P (X1:T = X1:T ) ∝ wT . We note that this procedure maps x1:T to a probability distribution on X T , implicitly defining a Markov kernel. This PG kernel leaves the exact target distribution invariant for any number of particles. An important drawback of this kernel is that the can be very poor (i.e the number of steps required to reach the target distribution is large), when there is path degeneracy in the underlying SMC sampler [47]. Unfortunately, some degree of path degeneracy is inevitable in all but trivial cases as a consequence of the resampling procedure. To adress this fundamental problem, it was proposed in [46] to sample a new value for the index N an in a so-called ancestor sampling step. Adding this ancestor sampling step enables fast mixing of the kernel, even when using very few particles in the underlying SMC sampler. The use of this conditional particle filter with ancestor sampling (CPF-AS) therefore significantly reduces the typically long computing times associated with SMC methods. Another advantage of CPF-AS is that it can be implemented in a forward recursion only and its computational cost is linear in the number of particles N. Consequently, the computational complexity of CPF-AS is in total O(NT ). N 0 To construct the Nth particle trajectory as in (2.27), the conditioned particle Xn = xn has to be associated with an ancestor at time n − 1. Therefore, we sample the ancestor index with N j 0 j P (an = j) ∝ wn−1fθ(xn | Xn−1). This can be understood as an application of Bayes’ theorem, j j 0 j where wn−1 is the prior probability of the particle Xn−1 and fθ(xn | Xn−1) is the likelihood of j 0 moving from Xn−1 to xn. Note that the only difference with PG is that we would simply set N an = N when using the PG algorithm. However, in [46] it is shown that this small modification significantly improves mixing of the kernel. We note that we assign the importance weights to all particles analogously to standard particle i i i an filtering. In other words, we set wt = Wθ,n(Xn,Xn−1), where the weight function is given by gθ(yn | xn)fθ(xn | xn−1) Wθ,n(xn, xn−1) = . (2.28) qθ,n(xn | xn−1, yn)

23 In Algorithm 5 we present a summary of the CPF-AS algorithm. See [61] for a real world exam- ple and an additional description of the algorithm. The key property of the CPF-AS algorithm is given in Theorem 2.1, which states previous observations more formally, see [45].

θ θ Theorem 2.1. Assume that for any θ ∈ Θ and any n ∈ {1,...,T }, Sn ⊂ Qn where

θ n Sn = {x1:n ∈ X : pθ(x1:n | y1:n) > 0}, θ n Qn = {x1:n ∈ X : qθ,n(xn | xn−1, yn)pθ(x1:n−1 | y1:n−1) > 0}.

Then, for any θ ∈ Θ and any N ≥ 2, the procedure 0 i) Run Algorithm 5 conditionally on x1:T ∗ ∗ i i ii) Sample X1:T with P (X1:T = X1:T ) ∝ wT T defines an irreducible and aperiodic Markov kernel on X which has pθ(x1:T | y1:T ) as invariant distribution.

Proof. The invariance property is a consequence of the construction of CPF-AS in [46], and ∗ the fact that the law of x1:T is independent of particle indices permutations. Irreducibility and aperiodicity follow from Theorem 5 in [5].

0 ∗ Theorem 2.1 says in other words that if x1:T ∼ pθ(x1:T | y1:T ) it holds that x1:T ∼ pθ(x1:T | y1:T ). To understand the intuition behind this result, it can be helpful to consider the two extreme ∗ 0 0 cases . When N = 1, the algorithm will simply return X1:T = x1:T , since we condition on x1:T . 0 ∗ Because x1:T is distributed according to the filtering distribution, so is x1:T . In this situation ∗ 0 the correlation between x1:T and x1:T is 1. When N = ∞ the CPF-AS becomes a regular 0 Particle Filter with infinitely many particles and the conditioning on x1:T will have negligible effect. Since such a Particle Filter will recover the filtering distribution exactly, we have that ∗ ∗ 0 x1:T ∼ pθ(x1:T | y1:T ). We note that x1:T is now independent of x1:T . Choosing a fixed N can be seen as an interpolation between these two extreme cases. This invariance property holds for ∗ 0 any N, however the larger we take N the less correlated x1:T and x1:T will be [45].

0 Algorithm 5 CPF with ancestor sampling, conditioned on {x1:T } At time n = 1 ˆ i Draw X1 ∼ qθ,1(x1 | y1) for i = 1,...,N − 1 ˆ N 0 Set X1 = x1 µ (Xi )g (y |Xi ) ˆ i θ 1 θ 1 1 Compute w1 = i for i = 1,...,N qθ,1(X1|y1) At time n = 2,...,T ˆ i i j Draw an with P (an = j) ∝ wn−1 for i = 1,...,N − 1 ai ˆ i n Draw Xn ∼ qθ,n(xn | Xn−1, yn) for i = 1,...,N − 1 ˆ N N j 0 j Draw an with P (an = j) ∝ wn−1fθ(xn | Xn−1) ˆ N 0 Set Xn = xn  ai  ˆ i n i Set X1:n = X1:n−1,Xn for i = 1,...,N ai ˆ i i n Compute wn = Wθ,n(Xn,Xn−1) according to (2.28) for i = 1,...,N

24 The CPF-SAEM algorithm

0 ∗ From Theorem 2.1 we now know that if x1:T ∼ pθ(x1:T | y1:T ) and we sample x1:T as stated in ∗ the given procedure, then x1:T ∼ pθ(x1:T | y1:T ) for any number of particles N. This means that we found the required Markov kernel and that we can state the final CPF-SAEM algorithm as introduced in [45]. We note that using the Rao-Blackwell Theorem (see e.g. [10]), it is possible to improve the auxiliary function (2.25) in such a way that we can reuse all N particle trajectories. That is, if we update Qˆk according to

N X wi Qˆ (θ) = (1 − γ )Qˆ (θ) + γ T log p (Xi , y ) (2.29) k k k−1 k P wl θ 1:T 1:T i=1 l T the variance of (2.29) is smaller than that of (2.25). If we let J be the random index of the ∗ J sampled particle path in Theorem 2.1 (x1:T = x1:T ), this is a Rao-Blackwellization over J. The CPF-SAEM algorithm for maximum likelihood inference in non-linear Hidden Markov Models is summarized in Algorithm 6.

Algorithm 6 CPF-SAEM

Set θ0 and x1:T [0] arbitrarily. Set Qˆ0(θ) ≡ 0 for k = 1,...,K do ˆ i i N Generate {X1:T , wT }i=1 by running Algorithm 5, conditioned on x1:T [k−1] and targeting

pθk−1 (x1:T | y1:T ) ˆ Compute Qˆk(θ) according to (2.29) ˆ ˆ Compute θk = arg maxθ∈ΘQk(θ) ˆ i J Sample J with P (J = i) ∝ wT and set x1:T [k] = X1:T

25 CHAPTER 3

Dimension reduction in option valuation models by a HMM approach

In this chapter we develop a Hidden Markov Model for the approximation of an option value process with the purpose of reducing the dimension of the underlying risk driver process. Besides this, we test methods for the required state and parameter inference within this model. We describe the proposed model for dimension reduction in Section 3.1. In Section 3.2 we give an example of this model, which we will use extensively in the rest of this chapter. In Section 3.3 we compare the CPF-SAEM method for state and parameter estimation with using a Kalman Filter within the described EM framework. The Kalman Filter can not be used for the model that Ortec Finance uses to value embedded options, because this model is non-linear in its risk drivers. Therefore, we investigate the behavior of the CPF-SAEM method for inference in non- linear Hidden Markov Models in Section 3.4. We test the sensitivity of our model to the choice of transition density f and the dimension of the state space X in Section 3.5.

3.1 Model description

We are interested in modelling the process of dY embedded option values depending on n risk drivers. Let the observations {Yt}t≥1 represent this dY -dimensional process of option values. The risk drivers are modeled as a stochastic process {Rt}t≥1 so that each realization Rt = (ˆr1,..., rˆn)t specifies an instance of all risk drivers and thus the option value(s). In other words, we set n dY Yt = V (Rt) where V : R 7→ R represents the risk-neutral valuation of the embedded option(s). This mapping is in general expensive to evaluate and only known analytically in some special cases, for example in the Black-Scholes and Heston models for vanilla call and put options. As mentioned in Section 1.1, in practice Monte Carlo simulations are used for this valuation and this leads to time-consuming nested simulations. If we could reduce the dimension of the risk driver process {Rt}t≥1, this would mean a major reduction in the number of real world scenarios required and therefore a reduction in computing time (see Section 1.1).

We now assume that the process {Rt}t≥1 is generated by a hidden dX -dimensional process dX {Xt}t≥1 ∈ R with dX n Rt = φ(Xt), φ : R 7→ R .

26 This mapping φ should map the risk drivers to an appropriate domain D ⊂ Rn. This domain must ensure for example that is never negative and that correlation matrices are positive definite in [−1, 1]. We denote the minimum and maximum of this domain by Dmin and Dmax, respectively. For simplicity, we consider a linear mapping between Xt and Rt but more general mappings can be adopted as well. This means that we set  Rt = φ(Xt) = min Dmax, max (Dmin, AXt + b) , (3.1) where A ∈ Rn×dX and b ∈ Rn. Here A and b represent the part of the risk driver process that does not change over time, i.e. the static part. Note that when Aij > 0, the correlation between risk driver ri and the jth hidden state is positive and when Aij < 0 this is negative. The hidden process {Xt}t≥1 then models the dynamic part of the risk driver process. Both stability and interpretability benefit from imposing such a structure on the risk driver process. Let θ0 ∈ Θ represent the model parameters in the underlying Hidden Markov Model

Xt | (Xt−1 = xt−1) ∼ fθ0 (xt | xt−1) ,X1 ∼ µθ0 (x1), (3.2) Yt | (Xt = xt) ∼ gθ0 (yt | xt) . We need to estimate the parameters θ = (A, b, θ0) and hidden states in our model and the process Yt is then approximated by Y¯t = V (φ (Xt)). When dX  n we obtain an effective dimension-reduction as the dependence of Rt is driven by a lower-dimensional process Xt. Typically, a normal distribution for the observation density is chosen. This means that each element j = 1, . . . , dY of the observation process Yt is governed by a normal distribution with ¯ + mean Yt = V (φ (Xt)) and we assume volatility σY ∈ R the same for each element, i.e.

(j)  (j)   (j) 2  g yt | xt = N y¯t , σY , (3.3)

(j) where N denotes the normal probability density function andy ¯t the j-th element of the real- izationy ¯t = V (φ (xt)). Note thaty ¯t is an accurate approximation for yt when σY ↓ 0.

Besides this, we initially assume that the hidden state is 1-dimensional (dX = 1) and normally distributed, 2  2  f (xt | xt−1) = N xt−1, σX , µ(x1) = N x0, σX , (3.4) + where x0 is some initial value and σX ∈ R . This means that we need to estimate θ = n (A, b, σX , σY ) where A, b ∈ R . Later, we will investigate model behavior when f is governed by another distribution or when the state space has more dimensions.

3.2 Black-Scholes example

To illustrate the proposed method we first consider the price process of some at-the-money call option (dY = 1) with one year to maturity in the well-known Black-Scholes model [7]. In the Black-Scholes model, the price of a call option is given by

−qτ −rτ BS(r, q, σ, S, K,ˇ τ) = Se Φ(d1) − Keˇ Φ(d2), (3.5) where   log S + r − q + 1 σ2 τ Kˇ 2 √ d = √ and d = d − σ τ, (3.6) 1 σ τ 2 1

27 r denotes the risk free interest rate, q the dividend rate, σ the volatility of the price of the underlying asset, S the price of the underlying, Kˇ the strike price, τ the time to maturity and Φ(·) the cumulative standard normal distribution [7]. We assume S = 1, Kˇ = 1 and τ = 1 so that we have n = 3 underlying risk drivers, Rt = (r, q, σ)t. First we generate data  Yt = BS rt, qt, σt | S = 1, Kˇ = 1, τ = 1 (3.7)

2 for some known risk driver processes. We choose rt ∼ N(0.02, 0.005 ) so that r is random with mean 2% and qt = 0.01 ∀t. Besides this, we let the mean of σt admit a typical volatility curve 2 (see Figure 3.1) and add some noise, so that σt ∼ N(σ ¯t, 0.005 ). We then try to approximate this price process by  Y¯t = BS φ(Xt) | S = 1, Kˇ = 1, τ = 1 , by estimating θ and the underlying hidden states {Xt}t≥1 from the observations. In transforma- tion φ given in (3.1) we set the domain Dmin = 0 for all three risk drivers and Dmax is 0.15 for rt, 0.2 for qt and 1.5 for σt.

Figure 3.1: Mean of the true risk driver process for σt

3.3 Benchmark: Kalman Filter within the EM framework

For combined state and parameter inference we would like to investigate the sophisticated CPF- SAEM method described in Algorithm 6. Before we do this, we first look into the most popular method for state estimation, the Kalman Filter (see Algorithm 1). We use the Kalman Filter to estimate the states of an HMM within the EM framework described in Section 2.3.

Linear Gaussian example

To investigate the accuracy of the CPF-SAEM method, we first test it on a linear Gaussian example for which the Kalman Filter gives us the exact solution. We consider the most basic Hidden Markov Model, a so-called standard Gaussian [56]:

2 xt = xt−1 + qt−1, qt−1 ∼ N(0, σq ) 2 yt = xt + rt, rt ∼ N(0, σr )

28 where the initial distribution for the hidden states is given by µ(x1) = N(0, 1). Note that this 2 2 corresponds to a HMM where X = Y = R, f(xt | xt−1) = N(xt−1, σq ), g(yt | xt) = N(xt, σr ) and θ = (σq, σr). We generate an observation process with σq = σr = 1 and estimate the hidden states and parameters.

We analyze the average approximationy ¯t of the observation process and the average sum of PT 2 squared errors in 50 trials of the model. Here we define the error by SSE = t=1(yt − y¯t) , where for this example the approximation is given byy ¯t = xt. For the Kalman Filter only the parameters θ have to be estimated, since it gives an exact solution for the state estimation. Compared to using a Conditional Particle Filter to estimate the hidden states, this leads to a more accurate approximation and to a significantly smaller and faster converging error. See Figures 3.2 and 3.3.

Figure 3.2: Averagey ¯t (N = 6, K = 50) Figure 3.3: Average SSE (N = 6, K = 50)

Black-Scholes example

Now, we apply the Kalman Filter to the Black-Scholes example of Section 3.2. To do this, we assume all parameters to be time independent in equations (2.6)-(2.7) and consider a one- 2 dimensional hidden state (dX = 1), so that X = Y = R, µ(x1) = N(0, σ0) and

2  f (xt | xt−1) = N axt−1, σX , 2  g (yt | xt) = N bxt, σY .

Note that this means that we approximate the price process by Y¯t = bXt instead of Y¯t = BS (φ(Xt)), since the KF can only be applied to linear Gaussian models. Estimating θ = (a, b, σX , σY ) and the hidden states with only 10 EM iterations gives a remarkably accurate approximation, see Figure 3.4. This accuracy can also be illustrated by the steep descent of σy, see Figure 3.5a. To determine the required number of EM iterations until convergence we analyze the behavior of Qˆk(θ) from (2.29). In Figure 3.5b it can be seen that the objective function reaches its maximum after a specific number of EM iterations and therefore it is sufficient to set K = 10. This is confirmed when analyzing the average error in 100 different trials of the model, see Figure 3.6. We use K = 300 EM iterations as a reference, which results in a reference value of SSE ≈ 0.004.

29 Figure 3.4: Observations and approximation based on the last EM iteration

(a) σY (b) Qˆk(θ)

Figure 3.5: Standard deviation of Y and objective function (2.29) in 10 trials

Although these results look very promising, we can not use this method in the model that Ortec Finance uses to determine the embedded option values V (Rt) by risk-neutral Monte Carlo simulations. Note that a Hidden Markov Model with

2 g(yt | xt) = N(¯yt, σY ) is Gaussian (as long as we choose the transition density f Gaussian as well). However, the option value V is a highly non-linear function in its risk drivers and thus Y¯t = V (φ(Xt)) is not linear. We have used the Kalman Filter just for educational purposes. The reason why the Kalman Filter works so well for this Black-Scholes test case, is that the function V (φ(Xt)) = BS(φ(Xt))

is approximately linear in Xt for an at-the-money BS call option with 1 year to maturity. Since φ given in (3.1) is a linear mapping, it is sufficient to check whether the Black-Scholes valuation function is (approximately) linear in r, q and σ for the relevant ranges of the domain. This is confirmed in Figure 3.7.

30 Figure 3.6: Sum of squared errors

(a) Interest rate (b) Dividend yield (c) Standard deviation of S

Figure 3.7: Ceteris paribus Black-Scholes price in various risk drivers

3.4 Solving the BS example with the CPF-SAEM method

We now approximate the same BS price process (3.7) by estimating the parameters and hidden states of the HMM defined in (3.3)-(3.4) with the CPF-SAEM method using N = 6 particles and K = 50 EM iterations. This more general method can be used on the model of Ortec Finance and also gives an accurate result, see Figures 3.8 and 3.9. The approximation gives us a less volatile price process than the real observations but stays close to the mean of the true price at every time step t.

Figure 3.8: Observations and approximation based on the last EM iteration

31 Figure 3.9: Standard deviation of Y in 10 trials

Sensitivity to the number of EM iterations

We analyze the averaged error in every EM iteration over 30 trials of the model with N = 5 particles and γk ≡ 1 for a fixed set of observations, see Figure 3.10. We observe that the error decreases with k and stabilizes as well, as we would expect. A first order linear approximation shows that the error converges roughly proportional to k−0.60, see Figure 3.11. Therefore it is optimal to choose the number of EM iterations large. On the other hand, the total computing time grows approximately linear in K (when γk ≡ 1), as we will see in one of the following subsections. To make a reasonable choice in the well-known trade-off between computing time and accuracy, we also investigate the behavior of the objective function Qˆk from (2.29) in Figure 3.12. The average value of Qˆ increases very rapidly for k ≤ 50, after this, Qˆ becomes almost constant in k. It is sufficient to take K between 50 and 100 for this case but the exact choice depends on the required accuracy.

Figure 3.10: Sum of squared errors

32 Figure 3.11: Convergence rate of SSE in k Figure 3.12: Average value of Qˆ

Sensitivity to the number of particles

To get some insight in the dependence on the number of particles, we run the method 35 times for one set of observations (K = 15, γk ≡ 1) and analyze the error and final value of Qˆ for various numbers of particles. Although the behavior gets slightly better when taking more particles, the method seems to be rather insensitive to the number of particles N, see Figures 3.13 and 3.14. This is in agreement with analysis and empirical studies in papers [46] and [50], where only a minor increase in efficiency is found when increasing N. To reduce computing times, it is advisable to take a relatively small number of particles. It is experienced in practice that the correlation between consecutive trajectories drops very quickly as N increases [45, 31]. Therefore, the range N = 6 − 15 is sufficient to ensure rapid mixing of the CPF-AS Markov kernel.

Figure 3.13: Sum of squared errors Figure 3.14: Value of Qˆ in the last EM iteration

33 Sensitivity to the initial conditions

The first step in the CPF-SAEM algorithm is to set the initial conditions x1:T [0] and θ0 arbitrarily. Since the HMM is given by (3.3)-(3.4) and in transformation (3.1) we have n = 3 risk drivers and a one-dimensional hidden state (dX = 1), we need to set θ0 = (σX , σY , a1, a2, a3, b1, b2, b3)0. We compare the average SSE of the approximation for various sets of initial conditions with the error obtained when using the ”true” values of the parameters and states. These optimal values are obtained by running the algorithm with K = 100 EM iterations, which gives for θ values between 0 and 0.259. Normally, we do not have any prior information about the states and therefore we set x1:T [0] = 0 in all five cases that we consider in Figure 3.15. We decide to only test cases where we give all parameters of θ the same value because we only want to do a quick scan of the sensitivity to the initial conditions. Since we require σX , σY ≥ 0 we choose θ0 in the range 0.01 − 10, see Figure 3.15. Naturally, fastest convergence is achieved when the initial conditions are chosen as close to the optimal values as possible. Since EM algorithms are local optimization methods, it may be necessary to run an ‘almost global’ maximization method as well to make sure that we find a global maximum. The most straightforward way to do this is to run the method multiple times for different (uniformly chosen) sets of initial conditions and choose the best one, instead of running it only for one predefined choice of x1:T [0] and θ0. We stress that this is an important step because it is possible that that the CPF-SAEM method provides a totally inaccurate estimation (for exampley ¯t ≡ c, for some contstant c) when it is ‘stuck’ in a local maximum. Besides this, when highly inaccurate initial conditions are chosen, it might happen that the likelihood of the initial sample is so low that the weight assigned to each particle is zero. Since the weights are normalized in every step, this will lead to a division by zero and the termination of the algorithm.

As a general guideline we set x1:T [0] = 0 and therefore a1, a2 and a3 can be set arbitrarily, see transformation (3.1). We set for example ai = 0.01. Now it is sufficient to choose the initial value of the other parameters in ‘the right order of magnitude’, to make sure that the algorithm converges normally. We choose for example 0.3 for the volatility parameters and 0.03 for bi when the corresponding risk driver ri refers to an interest rate.

Figure 3.15: Average sum of squared errors (35 trials, N = 6, γk ≡ 1)

34 Maximization of Qˆ

3 Because A, b ∈ R , we need to estimate θ = (σX , σY , a1, a2, a3, b1, b2, b3), where σX , σY ≥ 0. In order to compute a new parameter estimate in each EM iteration, we maximize Qˆk by using the solver fmincon in Matlab. This is the only multivariate constrained non semi-infinite solver in Matlab. For this solver we need to define upper and lower bounds for every parameter. We set the natural bounds 0 ≤ σX , σY ≤ 1.5 and investigate the sensitivity of the algorithm to the unknown bounds for A and b.

In this analysis we take the same bounds for every parameter and set lb ≤ a1, a2, a3, b1, b2, b3 ≤ ub. It turns out that the maximization step is very sensitive to the choice of these bounds, see Figure 3.16. We note that SSE ≥ 0.03 already corresponds to a completely inaccurate approximation and observe that taking highly negative lower bounds (i.e. -Inf and -10) consistently leads to inaccurate results. Taking lb = −5 gives both accurate and inaccurate approximations and decreasing the magnitude of the negative lower bound even further increases performance of the method. The most safe option is to assume that A, b ≥ 0 in order to make the maximization work well. In Figure 3.17 we observe that the method is relatively insensitive to the specific choice of positive bounds, although the most consistent results are obtained when taking ub ≤ 5.

Figure 3.16: Error wrt various bounds (N = 5, K = 30, γk ≡ 1, 50 trials)

Figure 3.17: Error wrt positive bounds (N = 5, K = 30,γk ≡ 1, 50 trials)

We note that this maximization step is the most time-consuming step in each iteration of the CPF-SAEM algorithm. Setting the optimization options for the solver in Matlab has a major impact on the computation time of the algorithm. For this example we set the maximum number of iterations allowed within the solver, MaxIter = 5, to speed up calculations. We discovered that

35 this maximization step, and therefore the whole algorithm, benefits greatly from parallelization. We run the algorithm on two different laptops, one containing a quad core intel i7 CPU running at 2.6 GHz from 2011 (intel core i7 2630QM), and the other running on a dual core intel i7 CPU running at 3.2 GHz from 2015 (intel core i7 5600U). Despite of the higher clock speed and the older CPU architecture, the program runs approximately 1.5 times as fast on the quad core system from 2011. This is due to the parallelization option in the solver used (fmincon).

Sensitivity to the choice of γk

In order to determine new estimates for the parameters in every EM iteration we maximize

N X wi Qˆ (θ) = (1 − γ )Qˆ (θ) + γ T log p (xi , y ) k k k−1 k P wl θ 1:T 1:T i=1 l T | {z } =: Pk(θ)

= (1 − γk)Qˆk−1 + γkPk(θ)

= (1 − γk)(1 − γk−1)Qˆk−2(θ) + (1 − γk)γk−1Pk−1(θ) + γkPk(θ) . .

= (1 − γk) ... (1 − γ2)γ1P1(θ) + (1 − γk) ... (1 − γ3)γ2P2(θ) + ... + (1 − γk)γk−1Pk−1(θ) + γkPk(θ) k−1 X = (1 − γk)(1 − γk−1) ... (1 − γj+1)γjPj(θ) + γkPk(θ) j=1

k−1  k  X Y =  (1 − γi) γjPj(θ) + γkPk(θ). (3.8) j=1 i=j+1

Here, {γk}k≥1 is a decreasing sequence of positive step sizes, satisfying both stochastic approxi- mation conditions X X 2 γk = ∞ and γk < ∞. (3.9) k k As proposed in [45], we take  1 if k = 1, 2,  γk = 0.98 if k = 3,..., 10, (3.10) ∝ k−0.7 if k > 10.

Curved exponential models Note that we have to maximize a weighted sum of k terms and have to store all states and weights of the previous iterates. This leads to a quite cumbersome and time-consuming maximization step. After consulting the author of [45], we found out that for models in the curved exponential family it is possible to simplify this maximization step. Models belong to this family if it is possible to write the complete data log-likelihood as an inner product of a ‘natural parameter’ and a sufficient statistic, i.e.

log pθ(x1:T , y1:T ) = hη(θ),S(x1:T , y1,T )i, (3.11)

36 where η(·) and S(·) are some arbitrary functions. By plugging (3.11) in Equation (3.8) we obtain an explicit expression for Qˆk(θ) = hη(θ), S¯ki, where S¯k is a stochastic approximation of the sufficient of the model. It is then possible to average directly on the sufficient statistic and there is no need to keep all particles from the previous iterates in memory. Unfortunately, our Hidden Markov Model for dimension reduction does not belong to the curved exponential family, so we can not use this approach. However, it can be very useful to check beforehand if a model is in this family when investigating other applications of the CPF-SAEM method. Numerical results To test the sensitivity of the CPF-SAEM method for inference within our model to the choice of γk, we compare 2 different choices of (3.10) with taking γk ≡ 1. See Figure 3.18 for a visualization of these three different choices for γk. Taking γk ≡ 1 corresponds to maximizing only the most recent complete likelihood term of Qˆk and means that we do not have to store particles and weights of all previous iterates. In other words, we consider a regular (Monte Carlo) EM method instead of a Stochastic Approximation EM method. In Figure 3.19a we see that the error is smaller if we optimize to all k terms of (3.8) instead of using only the last one (γk ≡ 1). Choosing a γk that declines more moderate in k (choice 2 in Figure 3.18), gives for our example the best results. However, since we have to maximize an increasing number of terms in every EM iteration, the time per iteration grows rapidly with k, see Figure 3.19b. Because ideally we would like to take K = 50 − 100 EM iterations, running the model would become very time-consuming (multiple hours for one run with N = 5 particles). Too speed up the calculations we decide to only account for the last 5 terms of (3.8) in every −0.7 EM iteration. As we would expect, the improvement in approximation of taking γk ∝ k over taking γk ≡ 1 is now smaller, see Figure 3.20a. In Figure 3.20b we see that when k ≥ 5, the time for each EM iteration is not growing in k anymore. However, maximizing to the last five terms still takes significantly more time than using only the last term. Besides this, the −0.7 improvement in approximation of taking γk ∝ k in the last 5 terms is small. Therefore would advise to set γk ≡ 1 for this Black-Scholes example. This is supported by the fact that taking into account only the last 5 terms of (3.8) does not satisfy the second stochastic approximation condition in (3.9) and is therefore not theoretically correct (although it does work in practice for this Black-Scholes example).

Figure 3.18: Choices for γk

37 (a) Average SSE (b) Average time per EM iteration

Figure 3.19: Maximizing Qˆk for different choices of γk (80 trials, N = 5)

(a) Average SSE (b) Average time per EM iteration

Figure 3.20: Maximizing Qˆk using the last 5 terms for different choices of γk (80 trials, N = 5)

Note on the non-monotonicity of Qˆ

We note that although on average Qˆ increases with k for this Black-Scholes example (see Figure 3.12), it is possible to observe unexpected non-monotonic behavior in Qˆ in single runs of the algorithm. In Figure 3.21 we give a visualization. The explanation for this behavior is that ˆ i i N Qk(θ) in (2.29) not only depends on θ, but also on the set of particles {X1:T , wT }i=1 generated in the kth iteration of Algorithm 6. It is possible that in the last step of the algorithm a particle with relatively small likelihood is sampled to condition the next iteration on. This can result in the generation of a new set of particles that overall has a smaller complete likelihood than the set of particles in the previous iteration. This can lead to a decrease in Qˆ that cancels out the positive effect of the optimization with respect to the parameters. Note that this happens i only with small probability because the particles X1:T to condition the next particle filter on

38 i are sampled with probability ∝ wT , therefore we see that the average trend is indeed increasing. In Appendix A we try conditioning on the particle with the highest weight in an attempt to reduce this non-monotonic behavior. However, this did not result in a significant reduction of the non-monotonic behaviour in Qˆ.

Figure 3.21: Non-monotonicity of Qˆ

Sensitivity to the choice of proposal distribution q

Using an accurate proposal (or importance) distribution q contributes to the reduction of the degeneracy problem, that was already addressed by using a resampling step and MCMC tech- niques. Also, we can expect faster convergence if we sample predictions for our new states in regions of high likelihood. Potential criteria for choosing a suitable proposal distribution are given in [11]: ˆ The support should be greater than or equal to that of the posterior distribution. ˆ Ease of sampling implementation. ˆ Achieving minimum variance of importance weights. ˆ Taking into account of transition density, likelihood and most recent observations. ˆ Long-tailed behavior to account for outliers. ˆ Being as close as possible to the true posterior. However, achieving either of these is not easy and most conventional methods only focus on a few specific criteria. We would like to emphasize that choosing an appropriate proposal is very case specific and requires a thorough understanding of the problem. One should also make sure that the potential advantages of using a good proposal distribution are not to be offset by an increase in computational costs [10]. Choices for the proposal distribution See Table 3.1 for an overview of the most common choices for the proposal distribution. For clarity reasons we omitted some mixture, trivial and less frequently used methods. opt Remember that the choosing q = p(xn | yn, xn−1) as proposal minimizes the variance of the importance weights and is therefore considered as the ’optimal’ proposal density in the literature. Unfortunately there are only two cases for which it is possible to sample from this optimal proposal distribution. Either the state space is a finite set or the HMM is Gaussian and

39 proposal reference note ’optimal’ qopt [10, 24, 6] mostly impossible to sample from prior distribution f [6, 11, 10] annealed prior distribution f β [11] approximate qopt by EKF / local linearization [24, 63, 11] gradient of g required approximate qopt by UKF (UPF) [63, 10] Hybrid Kalman Particle Filter [44] combination of EKF and UKF

Table 3.1: Overview on different proposal distributions the observations depend linearly on the states. Since neither of these cases applies to our model, we can not use this approach.

The use of the transition prior f(xn | xn−1) is the most popular choice of proposal distribution and was proposed in one of the first SMC papers [33]. This method is very intuitive and its implementation is very straightforward. The importance weights are easily evaluated because they are simply equal to g(yn | xn), see (2.28). However, since this alternative explores the state space without any knowledge of the observations it is not always effective. It is also very sensitive to outliers. The sensitivity to outliers is a motivation to choose an annealed prior distribution as the proposal. β This means that we let q(xn | yn, xn−1) = f(xn | xn−1) and the importance weights correspond 1−β to g(yn | xn)f(xn | xn−1) . When β = 1 this approach reduces to a normal prior and when β = 0 it corresponds to taking an uniform distribution as proposal. Intuitively, this method makes the prior proposal more flat, this is equivalent to artificially adding noise to the predicted particle in order to sample from a broader space. The remaining methods are all based on an approximation of the optimal proposal distribu- tion by incorporating the most current observation with the optimal Gaussian approximation of the state. With the Extended Kalman Filter (EKF) we can calculate recursively the following approximation to the true posterior distribution: p(xn | y1:n) ≈ N(¯xn, Pˆn). Within the parti- cle filter framework, we can run a separate EKF to generate a proposal distribution for each particle opt (i) (i) (i) ˆ(i) q (xn | yn, xn−1) ≈ N(¯xn , Pn ), i = 1,...,N. However, the EKF relies on a first order Taylor expansion of the likelihood g and transition density f. If the model is highly nonlinear, as is the valuation function in the model of Ortec Finance, this method introduces inaccuracies due to this linearization. The Unscented Kalman Filter (UKF), see Appendix B, generates more accurate estimates of the true mean and covari- ance of the state [63]. Besides this, the UKF is derivative-free and tailored towards non-Gaussian distributions. These properties make using a UKF to approximate the optimal proposal distri- bution better suited for the Ortec Finance valuation model. Lastly, it is also possible to use a combination of the EKF and UKF to approximate qopt, this is called the Hybrid Kalman Particle Filter. Numerical results We now investigate the effect of choosing a different proposal distribution on the CPF-SAEM method for our BS example. In all analyses before we used the transition prior f as proposal distribution, since this approach is most commonly used and very easy to implement. First we test an annealed prior density q = f β for two different values of β. See Figure 3.22 for a

40 visualization of these distributions for the same value of xn−1. We analyze the averaged error over 80 trials of the model, where N = 6, K = 50 and γk ≡ 1. Using an annealed prior distribution does not give better results in our case, see Figure 3.23. This observation is supported by the remark in [11] that when the prior f is flat compared to a peaked likelihood g, changing β will not improve behavior of the particle filter. Since we observe empirically that σY  σX , we know that this is the case for our model.

Figure 3.22: Annealed densities Figure 3.23: Average SSE

We now use an UKF to approximate the optimal proposal distribution in 80 trials of the model where N = 6, K = 50 and γk ≡ 1. This leads to a significant improvement of the average value of auxiliary function Qˆk, see Figure 3.24. This indicates that we now find states and parameters that are in a higher region of the expected complete likelihood pθ(x1:T , y1,T ). We note that, although implementing an additional filter within the algorithm is not straightforward, the additional computing time for using an UKF is in this example negligible. Therefore it is better to use an UKF to approximate q instead of using the prior distribution. We do not observe a significant improvement in SSE for our approximately linear and Gaussian example. Apparently, using the prior distribution as proposal already led to satisfactory results. However, it may be possible to see a more significant difference for more complicated cases.

(a) Average SSE (b) Average value of Qˆ

Figure 3.24: Using an UKF as proposal distribution

41 3.5 Influence of the underlying HMM

After investigating the behavior of the CPF-SAEM method with respect to various parameters of the algorithm, we now test the sensitivity of our model to the underlying Hidden Markov Model. We keep the observation density (3.3) fixed for our Black-Scholes example of Section 3.2 and will investigate the behavior of the model if we change the hidden state process (3.4).

Sensitivity to the dimension of the hidden state process

In contrary to previous analyses where dX = 1, we now assume that the hidden state pro- cess is 2-dimensional. This means that the transition density becomes the multivariate normal distribution f (xt | xt−1) = N (xt−1, ΣX ) , µ(x1) = N (x0, ΣX ) , (3.12) 2 2×2 where x0 ∈ R is some initial value and ΣX ∈ R the covariance matrix. We need to estimate 14 parameters θ = (A, b, ΣX , σY ), because transformation (3.1) now becomes         a1 a4   b1 rt x1 φ(xt) = min Dmax, max Dmin, a2 a5 + b2 = qt  . x2 a3 a6 t b3 σt

We use the CPF-SAEM method to simultaneously estimate the 2-dimensional hidden states and parameters θ for the Black-Scholes example. In Figure 3.25 we see similar behavior in the averaged 1-dimensional and 2-dimensional approx- imations of the price process in the last EM iteration over 80 runs of the model. However, in Figure 3.26 we observe that convergence is significantly faster when approximating the price using a 2-dimensional hidden process. Therefore, it seems like it is better to use a 2-dimensional hidden state for the approximation than to use a 1-dimensional one for this example. However, remember that the purpose of our model was to obtain an effective dimension reduction, which can be quantified by n − dX . So although increasing dX improves the convergence speed, it deteriorates the overall effectiveness of our model.

Figure 3.25: Average approximation of the price process (N = 6,K = 50, γk ≡ 1, q = f)

42 Figure 3.26: Average SSE (N = 6,K = 50, γk ≡ 1, q = f)

Sensitivity to the choice of transition density f

We will now investigate the situation where the state space process is again 1-dimensional, but the Hidden Markov Model is not Gaussian anymore. Before, we assumed that the transition density f was a normal distribution

2  2  f (xt | xt−1) = N xt−1, σX , µ(x1) = N x0, σX , (3.13)

+ where x0 is some initial value and σX ∈ R . We now analyze the behavior of the model if we assume that f is governed by some other distribution. The first alternative choice for f is the lognormal distribution. This is the distribution of a positive random variable whose logarithm has a normal distribution. The distribution depends on a location parameter m ∈ R and a scaling parameter s ∈ R+, which correspond to the mean and standard deviation of the related normal distribution respectively. The mean of the m+ 1 s2 lognormal density is then given by e 2 . Since we want this mean to correspond to the 1 2 previous state estimate, we set m = ln(xt−1) − 2 s and obtain

1 2 2 1 2 2 f (xt | xt−1) = ln N ln(xt−1) − 2 s , s , µ(x1) = ln N ln(x0) − 2 s , s , (3.14)

+ where x0 is some initial value and s ∈ R . Alternatively we can consider the state process f to follow a Cox-Ingersoll-Ross (CIR) model, described for example in [8]. This means that √ dxt = κ(¯x − xt)dt + σ xtdWt, (3.15)

+ + where Wt is a standard Brownian motion, κ ∈ R a mean reversion parameter,x ¯ ∈ R a long term mean parameter and σ ∈ R+ the volatility of volatility. The Feller condition 2κx¯ > σ2 has to be imposed when one wants to ensure that the process remains strictly positive. It can be shown that xt now features a noncentral chi-squared distribution, so that

f (xt | xt−1) = 2ctNCCS(2ctxt; d, λ), µ(x1) = 2ctNCCS(2ctx1; d, λ), (3.16)

2κ where ct = (1−e−κ∆t)σ2 and NCCS(· ; d, λ) denotes the noncentral chi-squared distribution with −κ∆t 4κx¯ 4κxt−1e d = σ2 degrees of freedom and non-centrality parameter λ = (1−e−κ∆t)σ2 .

43 The last choice for f is a continuous uniform distribution on some symmetric interval around the previous state estimate, i.e.

f (xt | xt−1) = U (xt−1 − a, xt−1 + a) , µ(x1) = U (x0 − a, x0 + a) , (3.17)

+ where x0 is some initial value and a ∈ R . See Figure 3.27 for a visualization of the different transition densities given that xt−1 = 5 and all other parameters are set to one.

Figure 3.27: Transition densities (xt−1 = 5, σX = s = κ =x ¯ = σ = a = 1)

Figure 3.28 shows the approximation of the Black-Scholes price process in the 100th EM iteration for choices (3.13)-(3.17) in one run of the model. We observe that assuming that f is governed by a lognormal distribution or a CIR process produces the most accurate approximation. However, choosing a uniform distribution for f does not lead to satisfactory results. Unfortunately, we note that if we choose a CIR process computing time of the algorithm is much longer than for the other three choices (about 8 hours for one run with K = 100 EM iterations). The reason for this is that the numerical maximization of the parameters can become arbitrarily time-consuming when these parameters have to satisfy an additional nonlinear constraint (the Feller condition). Therefore, we omit the CIR model in the rest of the analysis. We remark that for initial numerical tests, convergence behavior of the CIR model choice resembles that of choosing a lognormal f. Figure 3.29 shows the averaged error for every EM iteration over 30 trials of the model. We observe that convergence of the error is very slow if we choose a uniform distribution as transition density. Choosing the lognormal density (3.14) gives slightly better results in the long run than assuming a normal density for f for this Black-Scholes example.

Figure 3.28: Approximation of the price process (N = 6,K = 100, γk ≡ 1, q = f)

44 Figure 3.29: Average SSE (N = 6,K = 250, γk ≡ 1, q = f)

3.6 Conclusions

ˆ For linear Gaussian Hidden Markov Models, using the Kalman Filter in an EM framework gives more accurate results than using the CPF-SAEM method. However, our proposed Hidden Markov Model for dimension reduction is not linear due to the non-linearity of the valuation function in its risk drivers. ˆ Our dimension reduction model estimated by the CPF-SAEM method gives in general accurate results in a Black-Scholes example. The error converges in this case roughly proportional to K−0.60, where K is the number of EM iterations. ˆ The CPF-SAEM method is relatively insensitive to the number of particles N. ˆ The maximization step is the most time-consuming step in each EM iteration. However, this step benefits greatly from parallelization.

ˆ Setting the step size γk ≡ 1, i.e. using a regular Expectation Maximization framework instead of a Stochastic Approximation EM framework, leads to a major decrease of com- puting time. This only results in a minor loss of accuracy for this BS example. ˆ Constructing a proposal distribution q by running an UKF is theoretically attractive, but does not lead to significantly better results in this BS example.

ˆ Assuming a 2-dimensional hidden state process instead of setting dX = 1, leads to faster convergence. However, this deteriorates the overall effectiveness of our model. ˆ Choosing a normal distribution, lognormal distribution or CIR model as transition density f in our HMM gives comparable results for this BS example. Choosing a uniform distribution does not lead to satisfactory results.

45 CHAPTER 4

Test cases for the HMM approach

In this chapter we test our HMM approach and the CPF-SAEM method in cases where we expect difficulties with the option value approximation. Besides this, we look further into the obtained hidden states and distribution of the errors for an example where we approximate option prices observed in the market. In Section 4.1 we investigate an example in which the valuation function is (in contrary to our previous Black-Scholes example) definitely not approximately linear in its risk drivers. We investigate an example in which the valuation function is not known in closed form and has to be calculated using a semi-analytical pricing method in Section 4.2. In Section 4.3 we investigate a market example with a very high-dimensional risk driver process. This is the example for which we investigate the hidden states and approximation errors.

4.1 Non-linear example

To test the CPF-SAEM method on an example that is not approximately a linear function of its risk drivers, we take the same risk driver processes described in Section 3.2 and generate observations according to √ 4 σt Yt = h(rt, qt, σt) = rt + qt e . (4.1)

We then try to approximate this observation process by Y¯t = h(φ(Xt)), where φ(Xt) is the linear mapping given in (3.1) between the 1-dimensional hidden state process and the 3-dimensional risk driver process. The domain [Dmin,Dmax] is the same as described in Section 3.2. We consider the following non-linear Hidden Markov Model with X , Y = R:

2 2  f(xt | xt−1) = N xt−1, σX , 2  g (yt | xt) = N y¯t, σY ,

2 2 + ¯ µ(x1) = N(x0, σX ), where x0 is some initial guess and σX , σY ∈ R . To calculate Yt we estimate the hidden states and parameters θ from a realization of observations process (4.1). In Figure 4.1 we see a very accurate average approximation of the non-linear observation process compared to the approximation of the Black-Scholes price process in Figure 3.8. This is caused by the even greater simplicity of function h in (4.1) as compared to the Black-Scholes observation

46 process (3.7). This also leads to a significant reduction of the computing time per EM iteration due to a simpler maximization step. We observe in Figure 4.2 that, although we choose the same initial conditions as for the BS example, for this non-linear example convergence is much slower. Like in the Black-Scholes case in Figure 3.24, we do not observe a significant advantage in using an Unscented Kalman Filter to generate a proposal distribution compared to just taking the transition density f as proposal. We note that convergence is not faster when using more particles for the estimation, so also for this example the CPF-SAEM method is relatively insensitive to the number of particles used.

Figure 4.1: Average approximation of (4.1) (50 trials, N = 6,K = 400, γk ≡ 1)

Figure 4.2: Average SSE (50 trials, N = 6,K = 400, γk ≡ 1)

4.2 Extensive example: Heston model

We now extend the Black-Scholes example of Section 3.2 to a model more often used in industrial practice with a higher dimensional risk driver process. Therefore we will try to approximate the price process of an at-the-money call option (dY = 1) with one year to maturity under the Heston model [36]. One of the main deficiencies of the Black-Scholes model is that it assumes the volatility σ to be constant over time. This is typically not observed in the market. The Heston model models volatility as a stochastic quantity which evolves according to a CIR process. Therefore, the model is able to take into account the market skew (or smile)

47 that is typically observed in financial assets returns. The Heston model consists of two stochastic differential equations for the underlying asset price St and the variance process vt, described under the risk neutral measure Q by

√ S dSt = (r − q)Stdt + vtStdWt ,S0 = 1, (4.2) √ v dvt = κ(¯v − vt)dt + σ vtdWt , v0 > 0, (4.3)

S v where the Brownian motions are assumed to be correlated, dWt dWt = ρdt. Here r denotes the risk free interest rate, q the dividend yield, κ ≥ 0 the speed of mean reversion,v ¯ ≥ 0 the long-term mean of the variance process and σ > 0 the volatility of the volatility. For the Heston model it is possible to determine the characteristic function in closed form, therefore we can use the semi-analytical COS pricing method to determine the value of plain vanilla call options HS(r, q, κ, v0, v,¯ σ, ρ, S, K,ˇ τ). Here S is the price of the underlying asset, Kˇ denotes the strike price of the option and τ the time to maturity. The COS method is based on Fourier-cosine series expansions and has relatively high computational speed due to the exponential convergence of the error. For details about the COS method we refer to [29]. We consider the price process of some at-the-money option with 1 year to maturity (S = Kˇ = τ = 1) and assume that there is no dividend yield. Therefore we need to consider n = 6 underlying risk drivers, Rt = (r, κ, v0, v,¯ σ, ρ)t, in order to determine the price of such an option at each time step. First we generate data

Yt = HS(rt, κt, v0,t, v¯t, σt, ρt | S = Kˇ = τ = 1, q = 0) (4.4) for some known risk driver processes. We generate realizations for the long-term volatility mean in the same way as we did for σt in Section 3.2. We then try to approximate these observations by Y¯t = HS(φ(Xt) | S = Kˇ = τ = 1, q = 0) by assuming that the 6-dimensional risk driver process is driven by some 1-dimensional hidden process {Xt}t≥1. We set the domain for the risk drivers Rt in transformation (3.1) by Dmin = [0, 0.01, 0.001, 0.001, 0.1, −1] and Dmax = [0.2, 5, 0.5, 0.5, 1, −0.1]. We consider the Hidden Markov Model given in (3.3)-(3.4) and estimate the parameters and states using the CPF-SAEM method. In Figure 4.3 we see a realization of price process (4.4) and its average approximation in 40 trials of the model. We observe that, just like for the BS example, the approximation process stays close to the mean of the observation process but is less volatile. From Figure 4.4 we note that constructing a proposal distribution using an UKF gives slightly better results but that there is still no significant difference in approximation. Using the prior distribution as proposal already led to satisfactory results for this example as well. We note that we chose the initial conditions ‘in the right order of magnitude’ and that the convergence speed is comparable to that of the BS example and thus relatively fast compared to the highly non-linear example.

48 Figure 4.3: Average approximation of (4.4) (40 trials, N = 6,K = 50, γk ≡ 1)

Figure 4.4: Average SSE (40 trials, N = 6,K = 50, γk ≡ 1)

4.3 Market example: basket of S&P-500 index options

To test our model on market data, we consider the historical price process of a basket of S&P-500 index call options (dY = 1). The relative strikes are 80%, 90%, 100%, 110% and 120%, and the maturities are 3 months, 6 months, 1 year, 1.5 year and 2 years so that the basket consists of 25 options. We use historical data from June 2005 until February 2014 with a monthly frequency so that we have T = 105 timesteps. We use the observed implied volatilities and corresponding US interest rates (3 months - 2 years) to calculate the price of the basket. Implied volatilities are the values of σ in the Black-Scholes formulas (3.5)-(3.6) that match the model price with the observed option price in the market. Although in industrial practice more advanced models for option pricing have overtaken the Black-Scholes model, it is still used to quote prices in terms of volatilities. The use of implied volatilities instead of market prices makes options with different strikes and maturities easier comparable.

We assume that there is no dividend yield, this means that we have n = 30 risk drivers Rt =  impl impl r1, . . . , r5, σ1 , . . . , σ25 . The market price of the basket is then given by t

25 X  impl  Yt = wiBS ri,t, σi,t | Ki, τi,S = 1, q = 0 , (4.5) i=1

49 P 1 where i wi = 1 and we take all weights equal to /25. We approximate this price process by 25 X Y¯t = wiBS (φ(Xt) | Ki, τi,S = 1, q = 0) , (4.6) i=1 where we consider the Hidden Markov Model given in (3.3)-(3.4) for a 1-dimensional hidden state process. In the transformation φ given in (3.1) we set the lower truncation bound Dmin = 0 impl for all risk drivers and the upper bound Dmax = 0.15 for all ri and Dmax = 1.5 for all σi . Note that we have A, b ∈ R30. We use the CPF-SAEM method where we use f as proposal to estimate the parameters and hidden states. Note that we assume that the 30-dimensional risk driver process is driven by a 1-dimensional hidden process, which provides a significant dimension reduction. In Figure 4.5 we see price process (4.5) and its accurate average approximation over 40 trials of the model. A 1-dimensional hidden process can already achieve such satisfactory results because the risk drivers are all highly correlated. For example all US interest rates have very high correlation coefficients, just like the implied volatilities of two options with comparable relative strike and maturity on the same index. See Appendix C for the correlation matrices of the risk driver processes over this period. Note that in Figure 4.5 even the extreme option basket prices in the financial crisis of 2008 are relatively well captured by the approximation. In Figure 4.6 we observe that just like in our BS and Heston examples, the convergence is fast compared to the non-linear example of Section 4.1.

Figure 4.5: Average approximation of (4.5) (40 trials, N = 6,K = 150, γk ≡ 1, q = f)

Figure 4.6: Average SSE (40 trials, N = 6,K = 150, γk ≡ 1, q = f)

50 Correlation of the hidden states with the market volatility

For this market example we investigate the obtained hidden state process {Xt}t≥1 further for one of the 40 trials. To apply some of the analysis done in working paper [58] to our Hidden Markov Model approach, we investigate the relationship between the estimated hidden states and the market volatility. A widely used measure of the market’s expectation of equity market volatility over the next 30 day period is the VIX index. It is constructed using the prices of a wide range of short maturing at-the-money and out-of-the-money S&P-500 index options [2]. In Figure 4.7 we present the estimated hidden states and the historical VIX index over time. The hidden states and market volatility show a significant positive correlation, the Pearson correlation coefficient over this period is 0.7941. The explanation for this is that the market volatility is one of the main drivers of option prices. The greater the expected volatility, the higher the option value. Besides this, the option prices are positively correlated with the hidden states (compare Figure 4.7 with Figure 4.5). The reason for this is that we chose A, b ≥ 0 in transformation (3.1), therefore the riskdrivers r and σ are positively related to {Xt}t≥1 and furthermore r and σ have a positive relationship with the Black-Scholes price (see Figure 3.7).

The univariate process for the VIX index {VIXt}t≥1 is often modeled by (G)ARCH and mean- reverting models [21, 30]. Due to the high correlation between {VIXt}t≥1 and {Xt}t≥1 it might be beneficial for market examples to consider the transition probability density f(xt | xt−1) to follow a (mean-reverting) CIR model given in (3.16). We note that we can omit the Feller condition because the state process does not necessarily have to be strictly positive. Therefore, choosing a CIR model instead of a normal distribution is not computationally more expensive. We will investigate this possibility in the next chapter.

Figure 4.7: Estimated hidden states (left) and observed VIX index (right)

Distribution of the errors

We investigate the distribution of the errors for all 40 trials of this market example. This means that we analyze 4200 realizations of t = Yt − Y¯t. Within our Hidden Markov Model we assumed that 2 Yt | (Xt = xt) ∼ g(yt | xt) = N(¯yt, σY ), (4.7) wherey ¯t is defined in (4.6). Therefore we expect the errors to be normally distributed according 2 to t ∼ N(0, σY ). However, three of the most popular statistical tests for normality (Anderson- Darling, Jarque-Bera and Lilliefors tests) all reject the null hypothesis that the errors are from a population with normal distribution at a 5% significance level.

51 The best fit with the empirical error distribution is obtained when fitting a location-scale Stu- dent’s t-distribution to the errors. This distribution is useful when modeling peeked data with heavier tails than the normal distribution. The location-scale Student’s t-distribution is parametrized by a location parameter m ∈ R, scale parameter s > 0 and shape parameter ν > 0. It is a generalization of the classic Student’s t-distribution. This means that if X has X−m location-scale Student’s t-distribution, s has a Student’s t-distribution with ν degrees of freedom. The mean of the location-scale distribution is given by m when ν > 1, otherwise it is undefined. In Figure 4.8 a visualization of the best fitted normal distribution and location-scale Student’s t-distribution is presented. The location-scale distribution obtains a significantly better fit with the observed errors than the normal distribution. The estimated parameters for the location-scale distribution are m = −0.0003, s = 0.0024, ν = 1.5289. Since the estimated m is very close to zero, it might be beneficial for market data to replace the normal observation density (4.7) with a location-scale Student’s t-distribution with location parameter m =y ¯t. We investigate model behavior for this new choice of observation density in the next chapter.

Figure 4.8: Best fitting normal and location-scale Student’s t-distribution

4.4 Conclusions

ˆ The Hidden Markov Model approach with a 1-dimensional hidden state process leads to satisfactory approximations for all three cases in this chapter. ˆ When the valuation function is non-linear in its risk drivers, convergence of the CPF-SAEM method for state and parameter inference is slower. ˆ A closed form valuation function leads to better overall achieved accuracy and a less time- consuming maximization step compared to a valuation function that has to be calculated using numerical methods. ˆ For a market example the hidden states are strongly correlated with the volatility in the market. ˆ It might be beneficial for market examples to assume a location-scale Student’s t-distribution as observation density, since this better fits the empirical error distribution.

52 CHAPTER 5

Application to reduction of overfitting in the Heston model

In this chapter we apply our Hidden Markov Model for dimension reduction, described in Section 3.1, to the reduction of overfitting within the Heston model. Therefore we investigate the trade-off between goodness of fit and model (parameter) stability which is required for accurate out-of- sample approximations. In Section 5.1 we discuss standard calibration of the Heston model, which will serve as a bench- mark for our HMM approach. We discuss measures for overfitting and multicollinearity in Section 5.2 and show that direct calibration results in an overfitted Heston model. In Section 5.3 we apply various versions of our Hidden Markov Model to the calibration of the Heston model in order to reduce this overfitting. Finally, in Section 5.4 we perform two different out-of-sample tests.

5.1 Calibration of the Heston model

In this section we investigate the calibration of the Heston model given in (4.2)-(4.3) to approxi- mate the price processes of nine S&P-500 equity index put options. This model calibration refers HS the process of finding a set of model parameters θ = {κ, v0, v,¯ σ, ρ} that fits the observed market data at each time step best. Note that we assume that there is no dividend yield. The challenge in this calibration is the non-constant behavior over time and the dependence of the option prices on both moneyness and maturity which causes high dimensionality in the obser- vation process (dY = 9). The strike levels considered are 80%, 100% and 120% of the initial asset price S and the times to maturity are 3 months, 1 year and 2 years. We use historical implied volatilities from June 2005 until March 2014 with a monthly frequency from Bloomberg (T = 106). Besides this, we use the corresponding 3 months, 1 year and 2 years US interest rates. See Appendix D for a full graphical representation of all data used. With the Black-Scholes formulas (3.5)-(3.6) we convert the implied volatilities with corresponding 9 interest rates to market prices {Yt}t≥1 ∈ R , which we will use for the calibration. See Figure 5.1 for the price and implied volatility surface at 12/3/2014. We observe that the IV surface is skewed in the maturity dimension and that the price of a put option increases with a higher strike level or longer time to maturity.

53 (a) Implied volatility (b) Price

Figure 5.1: Observation surfaces on 12/3/2014

Our calibration procedure consists of computing for every time step t

9 X HS ˇ 2 arg min Yi,t − HS(θt | ri,t, Ki, τi,S = 1, q = 0) (5.1) HS θt ∈D i=1 to obtain for all 5 Heston parameters. We note that it is also possible to use some norm k.k of the difference between market and model option prices instead of the sum of squared errors. Besides this, one can also use implied volatilities instead of option prices in this minimization ¯ HS procedure. As in Section 4.2, we compute the model option prices Yt = HS(θt ) using the COS method [29], because this is an accurate pricing method with low computing times. We solve minimization problem (5.1) numerically by using the solver fmincon in Matlab. The domain D for the Heston parameters is given by Dmin = [0.01, 0.001, 0.001, 0.1, −1] and Dmax = [5, 0.5, 0.5, 1, −0.1] for every t. We measure the error in this calibration using the sum squared error (SSE), the average absolute error (|¯|) and the maximum absolute error (max ||). Besides this, we present the R-squared error (R2), which gives the proportion of the variability in the data that is accounted for by the model. It is also referred to as coefficient of determination and given by

2 P Y − Y¯  R2 = 1 − t t t . P 1 P 2 t (Yt − /T t Yt) In Equation (5.2) we summarize the overall error measures of this calibration. The calibrated Heston model accurately approximates the observed historical market prices.

SSE = 0.0187, |¯| = 0.0032, max || = 0.0177,R2 = 0.9967 (5.2)

We present the calibrated Heston model parameters at every time step in Figure 5.2. We observe non-constant behavior over time for all parameters, especially for the volatility parameters v0, v¯ and σ. We observe a negative correlation ρ between the asset returns and changes in volatility. This corresponds to the leverage effect (when the asset return becomes positive, investors gain confidence in the market and therefore volatility in the near future will decrease). The calibrated

54 option and market prices for all strike maturity pairs are shown in Figure 5.12 in Section 5.3. In general the model prices accurately approximate the prices observed in the market for each strike maturity combination. The biggest relative error is observed in the approximation of the 3 months to maturity at-the-money option. In Figure 5.3 the monthly error measures and mean absolute error surface are given. The errors resulting from the calibration are caused by limitations of the Heston model and the numerical evaluation of minimization problem (5.1).

Figure 5.2: Calibrated Heston model parameters

Figure 5.3: Monthly error measures and mean absolute error surface

55 5.2 Overfitting

In the optimization of Equation (5.1) the Heston parameter set is modeled as if it is independent of earlier realizations and the calibration is only done to the current observed market prices at time t. This approach can lead to overfitting and the volatile parameter evolution through time that we observe in Figure 5.2. In practice, we want stable Heston parameters for a consistent monthly valuation of the option prices. Overfitting occurs for example when there are too many model parameters relative to the number of observations. A model then describes random error or noise, instead of the underlying relationship. This can lead to a more accurate fit to known historical training data but to unstable out-of-sample option valuations. The model has poor predictive performance because it is sensitive to minor fluctuations in the input or training data. A concept that is closely related with and a possible cause for overfitting is (multi)collinearity. This refers to the (near) linear relation among two or more variables in a model. This means that one can accurately be linearly predicted from the others [3]. Collinearity is the relationship where there is linear dependency between two variables and multicollinearity means that one variable is (nearly) a linear combination of two or more other variables. The two terms are used interchangeably in literature and we will use multicollinearity in this thesis to refer to the general concept for clarity reasons. Multicollinearity is mostly studied in the field of multivariate linear regression analysis but is also used when analyzing machine-learning techniques and non-linear models [22, 1]. A commonly used, intuitive, quantification of multicollinearity is the pairwise Pearson correlation coefficient between the variables of a model. We note that the correlation matrix can not identify multicollinearity involving three or more variables. Since correlation is a special case of multi- collinearity, high correlation implies multicollinearity but the inverse does not hold automatically. Multicollinearity can still be present even when all correlations are close to zero. Therefore using only the correlation matrix as diagnostic is not enough. In Section 5.1 we obtained time series for HS the five Heston parameters θ = {κ, v0, v,¯ σ, ρ}, in Table 5.1 we present the Pearson correlation coefficients between these calibrated parameters. We observe high correlation coefficients (> 0.7) between the Heston volatility parameters {v0, v,¯ σ}. The correlation between ρ andv ¯ and the correlation between ρ and σ is slightly negative, the rest of the correlations are positive. Besides this, the determinant of the correlation matrix is close to zero, det = 0.0552. In general absolute correlation coefficients greater than 0.7 and close to zero determinants of the correlation matrix are considered strong indicators for multicollinearity [22]. The variance inflation factor (VIF) is a very popular measure for multicollinearity, that is mostly used in ordinary least squares regression analysis. The VIF for the ith variable is given by 1 VIFi = 2 , 1 − Ri

2 where Ri is the coefficient of determination when variable i is regressed on all other variables in the model. When VIFi = 1, the ith variable is not linearly related to the other variables. A large value of VIFi indicates multicollinearity, although various threshold values are regarded in the literature. Most commonly used is the threshold of VIFi > 5 − 10 [34, 14], however in other literature a threshold of 2.5 is already given as a reason for concern [16]. The VIF values for the parameters determined in Section 5.1 are

VIFκ = 1.2, VIFv0 = 4.3, VIFv¯ = 7.3, VIFσ = 4.0, VIFρ = 1.4,

56 which indicates at least moderate multicollinearity. We note that the most severe multicollinear- ity is observed for the volatility parameters {v0, v,¯ σ} for which we already observed the non- constant behavior over time in Figure 5.2 and the highest pairwise correlation coefficients in Table 5.1. The last indicator for multicollinearity that we consider is the condition number. This is also a general measure for the ill-conditioning (sensitivity to small changes in the data) of a problem. A condition number greater than 1 indicates the presence of multicollinearity and when the condition number is greater than 30 this multicollinearity is severe [3]. We note that this is a relatively small range compared to the differences in condition number that are considered for stability tests in the field of numerical analysis. We calculate the condition number of

HS  5 9 HS θ | r, K,ˇ τ, S = 1, q = 0 =: f : R → R HS for the Heston model parameters θ = {κ, v0, v,¯ σ, ρ} calibrated in Section 5.1 at every time step t = 1,...,T . The relative condition number is then given by [48]

HS HS kJ(θt )k CN(θt ) = HS HS . (5.3) kf(θt )k/kθt k

HS Here k.k denotes the (induced) Euclidean norm and J(θt ) denotes the Jacobian matrix of HS partial derivatives of f at θt given by

 ∂f1 ∂f1  ∂κ ... ∂ρ  . . .   . .. .  .  . .  ∂f9 ... ∂f9 ∂κ ∂ρ t We estimate this Jacobian numerically in Matlab. The condition number of the direct calibra- tion is given in Figure 5.4. This also gives an indication of the presence of some (moderate) multicollinearity. We note that there is a significant negative correlation between the market volatility (represented by the VIX index) and the condition number. The Pearson correlation coefficient over this period is −0.74. In periods when there is a volatile market, there is less multicollinearity present in the Heston model. This indicates that more parameters are required to describe the option prices in a volatile market than in a less volatile market.

1, 00 0, 23 0, 04 0, 15 0, 15  0, 23 1, 00 0, 80 0, 71 0, 18    0, 04 0, 80 1, 00 0, 86 −0.16   0.15 0, 71 0, 86 1, 00 −0.08 0, 15 0, 18 −0.16 −0.08 1, 00

Table 5.1: Correlation matrix of the calibrated parameters {κ, v0, v,¯ σ, ρ}

Figure 5.4: Condition number and VIX index

57 5.3 Hidden Markov Model approach

To reduce the problem of multicollinearity and overfitting described in the previous section, we apply our Hidden Markov Model approach to the calibration of the Heston model described in HS Section 5.1. We now regard the Heston parameters θt as risk drivers Rt = (κ, v0, v,¯ σ, ρ)t, that are driven by some 1-dimensional hidden state process {Xt}t≥1 through transformation φ given 5 in (3.1), where A, b ∈ R+. We use the same domain and data described in Section 5.1. Instead of calibrating (5.1) separately for every time step t, we now filter the hidden states and estimate the parameters iteratively using the CPF-SAEM method. We then approximate the price process 9 {Yt}t≥1 ∈ R by Y¯t = HS(φ(xt) | ri,t, Kˇi, τi,S = 1, q = 0). See Figure 5.5 for a schematic overview of the two approaches. Instead of calibrating 5 parameters at every time step (530 parameters), we now only need to estimate 12-15 parameters and T = 106 hidden states. Since we have a 9-dimensional observation process, we choose the observation density according to (4.1), i.e.  (1)   (1)  2  yt y¯t σY ∅  .    .  .  g  .  | xt = N  .  ,  ..  (5.4)  .    .    (9) (9) σ2 yt y¯t ∅ Y where N denotes the multivariate normal distribution. We initially take the most standard choice for the proposal distribution, i.e. we take q equal to the transition density f. As described in Section 4.3, it might be beneficial for market data to consider a CIR model for the hidden state process. Therefore, we test three different choices for the transition density f(xt | xt−1). Firstly, we use the normal distribution (3.13) which has given us satisfactory results in all analysis done before. Secondly, we use the noncentral chi-squared distribution given in (3.16), which corresponds to the state process following a CIR model. We omit the Feller condition for this choice because the CPF-SAEM method will not break down if xt = 0 occurs. Since we discard this condition in the optimization of the objective function, this results in a major decrease of computing time for this option. The last choice of transition density is the lognormal distribution given in (3.14), because using this transition density gave us satisfactory results for our Black-Scholes example in Section 3.5.

Figure 5.5: Visualization of the approach in Section 5.1 (above) and in Section 5.3

58 For each of these three alternatives we run the CPF-SAEM algorithm with N = 6 particles, K = 250 EM iterations, γk ≡ 1 and the same random seed. In Table 5.2 and Figure 5.6 we see very comparable model behavior for all three choices of tran- sition distribution f, all error measures are approximately the same. Against expectations we observe that using a CIR model for the hidden state process leads to a slightly less accurate approximation of the price process {Yt}t≥1 compared to choosing f (log)normal. The fast con- vergence observed in Figure 5.7 is caused by the choice of relatively accurate initial conditions. 0 We chose x1:T [0] = 0 and b = (0.36, 0.05, 0.13, 0.52, −0.71) , so that the initial guess for each of the Heston parameters is close to the mean of the calibrated parameter over time in Section 5.1. We note that for such a choice of initial conditions, running the algorithm with much fewer EM iterations (for example K = 50) would already lead to satisfactory results.

SSE |¯| max || R2 HMM normal 0.0670 0.0065 0.0252 0.9880 HMM CIR 0.0761 0.0069 0.0282 0.9864 HMM lognormal 0.0670 0.0065 0.0248 0.9880

Table 5.2: Overview of all error measures for different f

Figure 5.6: Monthly error measures and mean absolute error surface for different f

59 Figure 5.7: SSE for different f

Using the Unscented Kalman Filter

Since our model is relatively insensitive to the choice of transition distribution, we take the normal distribution for f and try to improve model behavior by changing the proposal distribution q. In Section 3.4 we gave different options for choosing a proposal distribution. We will now construct a proposal distribution by running an Unscented Kalman Filter (see Appendic B) because this gave satisfactory results in previous examples and is theoretically attractive. Unfortunately, in Table 5.3 and Figure 5.8 we observe only a minimal improvement in the error measures when using an Unscented Kalman Filter to construct a proposal distribution. In Figure 5.9 we see very comparable convergence behavior for both choices of proposal distribution.

SSE |¯| max || R2 HMM q = f 0.0670 0.0065 0.0252 0.9880 HMM q using ukf 0.0662 0.0064 0.0253 0.9882

Table 5.3: Overview of all error measures for different q

Figure 5.8: Monthly error measures and mean absolute error surface for different q

60 Figure 5.9: SSE for different q

Change of observation density

When analyzing the error distribution of the market example in Section 4.3, we discovered that it might be beneficial for market data to consider a location-scale Student’s t-distribution with location parameter m =y ¯t as observation density g. The reason for this is that this distribution obtains the best fit with the empirical error distribution, where t = Yt−Y¯t. Since our observation process {Yt}t≥1 is 9-dimensional, we change our multivariate normal observation density (5.4) in an uncorrelated multivariate Student’s t-distribution. Within the CPF-SAEM algorithm we estimate the scale parameter s and shape parameter ν, and set

y(j) − y¯(j) Z(j) = t t for j = 1,..., 9, t σ so that Zt has a multivariate Student’s t-distribution with ν degrees of freedom. As transition density f we take the normal distribution and we construct the proposal q using an UKF. In Table 5.4 and Figure 5.10 we now observe a significant improvement in most error measures for all strike-maturity combinations when using a Student’s t-distribution as observation density. This is caused by a smaller model error because for this observation density there is a closer fit with the empirical distribution of Yt | (Xt = xt). We note that the maximum absolute error is slightly bigger and the monthly error measures become more volatile over time when using this new observation density. In Figure 5.11 we observe that also the sum of squared errors in each EM iterations becomes more volatile. However, we observe a significantly better convergence in error.

SSE |¯| max || R2 HMM g = normal 0.0662 0.0064 0.0253 0.9882 HMM g = Student’s t 0.0459 0.0048 0.0357 0.9918

Table 5.4: Overview of all error measures for different g

61 Figure 5.10: Monthly error measures and mean absolute error surface for different g

Figure 5.11: SSE for different for different g

Comparison with direct calibration

We now compare our Hidden Markov Model approach (with f normal, q using an UKF and g is Student’s t) with the direct calibration of Section 5.1 in terms of goodness of fit and (parameter) stability. We see in Table 5.5 that the Hidden Markov Model approximation is less accurate than estimation by direct calibration, as was expected. In Figure 5.12 we observe that the HMM approach especially underestimates the prices for short-maturing options in periods when their price process is very volatile. We note that when the time to maturity of the option increases, the approximation of the HMM approach becomes better.

However, since the hidden state process {Xt}t≥1 is the only variable changing over time, there can no longer be multicollinearity between variables when using the HMM approach. It is trivial that the determinant of the correlation matrix and the variance inflation factor are both 1 in this case. The condition number is now a measure for the overall stability of our approximation, more than an indicator for multicollinearity. We estimate the condition number of ˇ  9 HS φ(xt) | r, K, τ, S = 1, q = 0 =: f : R → R for the estimated hidden states in our HMM approach at each time step t = 1,...,T . In the definition of CN(xt) given in (5.3), the Jacobian J(xt) is now calculated with respect to the 1-dimensional hidden state. The condition number of the HMM approach is presented in Figure

62 5.13. We observe that using an HMM approximation leads to a significant decrease in condition number compared to approximating the price process by a direct calibration (see Figure 5.4). The maximum value of the condition number over time reduced from 6.29 to 1.15 and this indicates that we obtained a more stable approximation.

SSE |¯| max || R2 Direct calibration 0.0187 0.0032 0.0177 0.9967 HMM approach 0.0459 0.0048 0.0357 0.9918

Table 5.5: Overview of all error measures

Figure 5.12: Calibrated, HMM and observed option prices

Figure 5.13: Condition number of the HMM approach and direct calibration

63 5.4 Out-of-sample testing

To assess whether the reduction of overfitting of the Hidden Markov Model approach leads to more stable and accurate approximations, we will now perform two out-of-sample tests. Since our HMM approach is recursive, it is not possible to use popular cross-validation techniques. Therefore we will consider the regular forecasting error for a 1 month, 3 months and 6 months period ahead. We take the mean of the transition distribution f at time t as point prediction for the hidden state at time t + h, in contrary to sampling a range of scenarios (see figure 1.3). Note that this means that we simply take Y¯t as forecast for Yt+h, h ∈ {1, 3, 6}. As error measure we ¯ 9 will use the average absolute error over {Yt}t≥1 ∈ R . We note that the direct calibration is not recursive and that the calibration of the Heston HS parameters θt at time step t is independent of the calibration at all other time steps. We ¯ dc HS test how accurate the prediction Yt = HS(θt ) is for Yt+1, Yt+3 and Yt+6 at time steps t = 70,..., 100. For our HMM approach (with f normal, q using an UKF and g Student’s ∗ t) we perform K = 250 iterations to obtain state estimates x1:106 and parameter estimates ∗ ¯ HMM ∗ θ = (A, b, σX , ν, s). We then test how accurate the prediction Yt = HS(φθ∗ (xt )) is for Yt+1, Yt+3 and Yt+6 at time steps t = 70,..., 100. See Figure 5.16 I for a visualization. In Table 5.6 we observe that the HMM approach indeed gives more accurate prediction results, especially when predicting over a longer period. Besides this, we observe in Figure 5.14 that the 1 month ahead prediction error of the HMM approach is less volatile than that of the direct calibration. Naturally, the error of this simple prediction method is large in periods where the price process is volatile for both direct calibration and the HMM approach.

1M 3M 6M Direct calibration 0.0056 0.0077 0.0097 HMM approach 0.0056 0.0073 0.0085

Table 5.6: First out-of-sample test overall |¯|

Figure 5.14: First out-of-sample test |¯t|

We note that for the HMM approach this first test is not strictly out-of-sample, because we used all 106 observations to estimate θ∗. In the CPF-SAEM algorithm we obtain new parameter esti- mates in each iteration k by maximizing Qˆk. In this maximization step we calculate the complete i HMM ∗ ¯ ∗ likelihood and thus also gθk (yt | xt) for t = 1,...,T and therefore Yt = HS(φθ (xt )) is not independent of Yt+1, Yt+3 and Yt+6. The most fair out-of-sample test for the HMM approach is to redo the whole algorithm with K = 250 EM iterations for each time step t = 70,..., 100 ∗ ∗ to obtain state estimates x1:t and new parameter estimates θ (see Figure 5.16 Opt). However, we note that this involves running the entire model 31 times. Since running the model with K = 250 EM iterations (for a fair comparison) takes about one day, this is unfortunately too time-consuming at this stage of the project.

64 Therefore we use a compromise in our second out-of-sample test. We perform K = 250 iterations ∗ ∗ ∗ to obtain state estimates x1:69 and parameter estimates θ = (A, b, σX , ν, s). We then fix θ and ∗ ∗ ∗ 31 generate x70:100 by running a particle filter conditional on (x69, . . . , x69) ∈ R . See Figure 5.16 II for a graphical representation. Note that although we use Yt in the particle filter to generate ∗ ¯ HMM ∗ the state estimate xt , this prediction Yt = HS(φθ∗ (xt )) is still an out-of-sample estimate for Yt+1, Yt+3 and Yt+6. In Figure 5.15 en Table 5.7 we observe that the direct calibration gives more accurate prediction results in this second test. We note that we also tried conditioning on the weighted average of the particles in the previous time step and on the particle with the highest weight in the previous ∗ timestep, instead of on x69 for every t = 70,..., 100. However, this did not lead to better or significantly different results, see Appendix E. Since the parameters θ∗ and hidden states are only adjusted to each other by one particle filter run, it makes sense that this test gives less accurate results for the HMM approach. In the optimal out of sample test they would have been adjusted to each other in K = 250 EM iterations, just like we did in the first test. Since the first out-of-sample test is biased in favor of the HMM approach, and the second test is biased against it, we can not draw definite conclusions out of them. The only fair out-of-sample comparison between direct calibration and the HMM approach can be done by performing the optimal test in Figure 5.16.

1M 3M 6M Direct calibration 0.0056 0.0077 0.0097 HMM approach 0.0092 0.0100 0.0100

Table 5.7: Second out-of-sample test overall |¯|

Figure 5.15: Second out-of-sample test |¯t|

5.5 Conclusions

ˆ Direct calibration of the Heston model leads to volatile parameter evolution and overfitting. ˆ This overfitting is caused by the presence of some moderate multicollinearity in the set of calibrated Heston parameters. ˆ Application of our HMM approach to the calibration of the Heston model leads to satis- factory results. Choosing a normal transition density f, Student’s t observation density g and a proposal distribution q that is constructed by running an UKF, leads to the most accurate approximation. ˆ In our HMM approach we loose some accuracy compared to direct calibration. However, the condition number decreases which indicates a more stable option value approximation.

65 Figure 5.16: Visualization of the HMM out-of-sample tests 66 CHAPTER 6

Conclusions

6.1 Summary and conclusions

In this thesis we have investigated dimension reduction of the risk driver process that determines embedded option values by introducing a specific Hidden Markov Model. As embedded options are typically valued by nested Monte Carlo simulations, this leads to a major reduction in computing time of the valuation. This is especially important for insurance companies that are dealing with many embedded option valuations in order to determine the market value of their liabilities. The major advantage of Hidden Markov Models is that they are able to take the transformation from the lower dimensional hidden process to the embedded option price into account. Besides this, they provide a straightforward way of generating real world scenarios for the hidden states (see Figure 1.3). We started with providing an overview of current methods available for inference in Hidden Markov Models. Since our proposed Hidden Markov Model is non-linear and in some cases non-Gaussian, we focused on the most general class of methods for state inference: the so-called Particle Filters. From this literature study it became apparent that the recent CPF-SAEM method introduced in [45] is state-of-the-art for combined state and parameter inference in Hidden Markov Models. Incorporating Markov Chain Monte Carlo theory within an Expectation Maximization framework significantly reduces the typically long computing times associated with Particle Filters. We then introduced our Hidden Markov Model for dimension reduction in Section 3.1. Within this model we assume that the n-dimensional risk driver process is generated by some dX - dimensional hidden process by means of a linear transformation presented in (3.1). Within this mapping we divide the risk driver space in a static part and a part that changes over time, the dynamic part. Both stability and interpretability benefit from imposing such a structure on the risk driver process. We note that when dX  n we obtain an effective dimension reduction. By investigating a Black-Scholes example for our Hidden Markov Model, we gained insights in the CPF-SAEM method. In general the method gives satisfactory results, although computing times can be substantial for more complicated cases. Therefore, we focused especially on the well-known trade-off between accuracy and computational speed. The most intuitive feature in this trade-off is the number of EM iterations within the algorithm, K. The error converges

67 roughly proportional to K−0.60 in this BS example, therefore we took K = 50 − 100. However, the exact choice of K is case specific and depends on the required accuracy. An important finding is that the method is relatively insensitive to the number of particles N, see Figures 3.13 and 3.14. Therefore, we can take the number of particles small in order to reduce computing times. For the cases in this report N = 6 was sufficient. Within the CPF-SAEM method an auxiliary function Qˆ is maximized in every iteration in order to obtain new parameter estimates. This maximization step is the most time-consuming step in the algorithm. However, we discovered that the solver that we use within this maximization step, and therefore the whole algorithm, benefits greatly from parallelization. We learned that setting the step size in the auxiliary function to one, i.e. γk = 1 for every EM iteration k = 1,...,K, only leads to a insignificant loss of accuracy. From Figure 3.19 we observe that this means a major reduction in total computing time, since the time per iteration is now constant in k instead of growing with k. Consistent results are obtained when assuming that the parameters of the linear transformation are positive, see Figure 3.16. In order to improve performance of the CPF-SAEM method, we investigated more advanced al- ternatives for the proposal distribution q than just taking the transition distribution f. Although using an Unscented Kalman Filter to construct the proposal distribution is theoretically attrac- tive, this only led to a minor increase in accuracy for most cases analyzed in this thesis. After testing the CPF-SAEM method for inference within our HMM approach, we investigated the sensitivity of our model to the underlying Hidden Markov Model. We analyzed several cases for which we expected difficulties in Chapter 4, additional to our BS example. The main finding is that assuming a 1-dimensional hidden process already leads to highly satisfactory option price approximations in all examples. For the market case with a high-dimensional risk driver process this is caused by strong correlation within the market observed risk drivers. We learned that the option valuation function (or method) has a major influence on both conver- gence speed and overall achieved accuracy of the approximation. When the valuation function is (highly) non-linear in its risk drivers, convergence becomes slower and a greater number of EM iterations is required in order to obtain satisfactory results. Besides this, when the valuation func- tion is easy (i.e. known in closed form compared to calculated using a numerical pricing method), the overall achieved accuracy is better and the maximization step is less time-consuming.

The HMM approach is relatively insensitive to the choice of transition distribution f(xt | xt−1), although we note that choosing a uniform distribution leads to very slow convergence. In all cases 2 simply taking the normal distribution N(xt−1, σ ) led to (the most) adequate results, despite the fact that we expected from Section 4.3 that assuming a CIR process for the hidden states would lead to better results for market examples. Taking a normal distribution as observation density g(yt | xt) leads in general to accurate approximations. By analyzing the error distribution of a market example, we discovered that taking a location-scale Student’s t-distribution improves the accuracy. This was confirmed for the Heston market example of Chapter 5. In this last chapter we investigated the trade-off between goodness of fit and model stability for the calibration of the Heston model parameters to market data. We showed that direct calibration of the parameters at every time step leads to overfitting of the model and therefore might result in unstable out-of-sample option valuations. In Section 5.3 we applied various versions of our HMM approach to this Heston calibration and observed that we loose some accuracy compared to direct calibration. However, the maximum condition number over the whole period reduced from 6.29 to 1.15 and this shows that we obtained a more stable option value approximation.

68 6.2 Future research

There are many ways to extend and improve the research done in this thesis. In the following we present a list of suggestions for further research: ˆ Analyze the performance of our HMM approach when adopting a more general mapping from the hidden states to the risk driver process instead of the linear transformation pre- sented in (3.1). ˆ Perform the optimal out-of-sample test in Figure 5.16. ˆ Check the generality of the analysis done in Section 5.3 by performing all tests for multiple trials of the CPF-SAEM method. ˆ Investigate the exact relationship between the final value of Qˆ and the error. We would expect that when the value of Qˆ is higher, the error is smaller. However, this does not hold strictly for all cases, see Figure 3.24 for an example.

69 References

[1] Adkins, L. C., Waters, M. S., Hill, R. C., et al. (2015). Collinearity diagnostics in gretl. Technical report, Oklahoma State University, Department of Economics and Legal Studies in Business. [2] Ahoniemi, K. (2008). Modeling and forecasting the vix index. Available at SSRN 1033812. [3] Alin, A. (2010). Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics, 2(3):370–374. [4] Anderson, B. and Moore, J. (1979). Optimal Filtering. Englewood Cliffs, New Jersey: Prentice-Hall. [5] Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle markov chain monte carlo meth- ods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342. [6] Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. (2002). A tutorial on parti- cle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on , 50(2):174–188. [7] Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. The journal of political economy, pages 637–654. [8] Brigo, D. and Mercurio, F. (2007). Interest rate models-theory and practice: with smile, inflation and credit. Springer Science & Business Media. [9] Brzezniak, Z. and Zastawniak, T. (1999). Basic stochastic processes: a course through exer- cises. Springer Science & Business Media. [10] Capp´e,O., Moulines, E., and Ryd´en,T. (2009). Inference in hidden markov models. [11] Chen, Z. (2003). Bayesian filtering: From kalman filters to particle filters, and beyond. Statistics, 182(1):1–69. [12] Chopin, N. (2004). Central limit theorem for sequential monte carlo methods and its appli- cation to bayesian inference. Annals of statistics, pages 2385–2411. [13] Chui, C. K. and Chen, G. (2009). Kalman filtering: with real-time applications. Springer Science & Business Media. [14] Craney, T. A. and Surles, J. G. (2002). Model-dependent variance inflation factor cutoff values. Quality Engineering, 14(3):391–403. [15] Crisan, D. and Doucet, A. (2002). A survey of convergence results on particle filtering methods for practitioners. Signal Processing, IEEE Transactions on, 50(3):736–746.

70 [16] de Jong, P., de Jonghy, E., Pienaarz, M., Gordon-Grantz, H., Oberholzerz, M., and Santana, L. (2015). The impact of pre-selected variance in ation factor thresholds on the stability and predictive power of logistic regression models in credit scoring. ORiON, 31(1):17. [17] Del Moral, P. (2004). Feynman-Kac Formulae: Genealogical and Interacting Particle Sys- tems with Applications. Probability and Applications. Springer-Verlag New York. [18] Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436. [19] Delyon, B., Lavielle, M., and Moulines, E. (1999). Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, pages 94–128. [20] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incom- plete data via the em algorithm. Journal of the royal statistical society. Series B (methodolog- ical), pages 1–38. [21] Donninger, C. (2015). Modeling and trading the vix and vstoxx with futures, options and the vxx. Options and the VXX (May 22, 2015). [22] Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carr´e,G., Marqu´ez,J. R. G., Gruber, B., Lafourcade, B., Leit˜ao,P. J., et al. (2013). Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1):27–46. [23] Doucet, A., De Freitas, N., and Gordon, N. (2001a). Sequential Monte Carlo methods in practice. Springer. [24] Doucet, A., Godsill, S., and Andrieu, C. (2000). On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208. [25] Doucet, A. and Gordon, N. J. (1999). Simulation-based optimal filter for maneuvering target tracking. In SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, pages 241–255. International Society for Optics and Photonics. [26] Doucet, A., Gordon, N. J., and Krishnamurthy, V. (2001b). Particle filters for state estima- tion of jump markov linear systems. IEEE Transactions on signal processing, 49(3):613–624. [27] Doucet, A. and Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering, 12(656-704):3. [28] Evensen, G. (2003). The ensemble kalman filter: Theoretical formulation and practical implementation. Ocean dynamics, 53(4):343–367. [29] Fang, F. and Oosterlee, C. W. (2008). A novel pricing method for european options based on fourier-cosine series expansions. SIAM Journal on Scientific Computing, 31(2):826–848. [30] Fernandes, M., Medeiros, M. C., and Scharth, M. (2014). Modeling and predicting the cboe market volatility index. Journal of Banking & Finance, 40:1–10. [31] Frigola, R., Lindsten, F., Sch¨on,T. B., and Rasmussen, C. E. (2014). Identification of gaus- sian process state-space models with particle stochastic approximation em. IFAC Proceedings Volumes, 47(3):4097–4102. [32] Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intel- ligence, (6):721–741.

71 [33] Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to nonlinear/non- gaussian bayesian state estimation. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 107–113. IET. [34] Habshah, M. and Bagheri, A. (2013). Robust multicollinearity diagnostic measures based on minimum covariance determination approach. Economics Computation and Economic Cy- bernetics Studies and Research, 4. [35] Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109. [36] Heston, S. L. (1993). A closed-form solution for options with with applications to bond and currency options. Review of financial studies, 6(2):327–343. [37] Ito, K. and Xiong, K. (2000). Gaussian filters for nonlinear filtering problems. Automatic Control, IEEE Transactions on, 45(5):910–927. [38] Jazwinski, A. (1970). Stochastic processes and filtering theory, vol. 64. san diego, california: Mathematics in science and engineering. [39] Julier, S. J. and Uhlmann, J. K. (1996). A general method for approximating nonlinear transformations of probability distributions. Technical report, Technical report, Robotics Research Group, Department of Engineering Science, University of Oxford. [40] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans- actions of the ASME–Journal of Basic Engineering, 82(Series D):35–45. [41] Kalman, R. E. and Bucy, R. S. (1961). New results in linear filtering and prediction theory. Journal of basic engineering, 83(1):95–108. [42] Kuhn, E. and Lavielle, M. (2004). Coupling a stochastic approximation version of em with an mcmc procedure. ESAIM: Probability and Statistics, 8:115–131. [43] Lee, S. S. Markov chains on continious state space. http://www.webpages.uidaho.edu/ ~stevel/565/lectures/5d%20MCMC.pdf. Accessed: 27-10-2016. [44] Lin, Y., Zhang, T., Liu, J., et al. (2011). Particle filter with hybrid proposal distribution for nonlinear state estimation. Journal of Computers, 6(11):2491–2501. [45] Lindsten, F. (2013). An efficient stochastic approximation em algorithm using conditional particle filters. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6274–6278. IEEE. [46] Lindsten, F., Jordan, M. I., and Sch¨on,T. B. (2014). Particle gibbs with ancestor sampling. The Journal of Machine Learning Research, 15(1):2145–2184. [47] Lindsten, F. and Sch¨on,T. B. (2013). Backward simulation methods for monte carlo statis- tical inference. Foundations and Trends in Machine Learning, 6(1):1–143. [48] Liu, E. Conditioning and numerical stability. http://web.mit.edu/ehliu/Public/Yelp/ conditioning_and_precision.pdf. Yelp, Applied Learning Group MIT, Accessed: 15-11- 2016. [49] Liu, J. S. (2001). Monte Carlo strategies in scientific computing. Springer Science & Business Media.

72 [50] Nonejad, N. et al. (2014). Particle gibbs with ancestor sampling for stochastic volatility models with: heavy tails, in mean effects, leverage, serial dependence and structural breaks. Studies in Nonlinear Dynamics & Econometrics, 45:306–307. [51] Possen, T. and Van Bragt, D. (2009). Market-consistent valuation of life cycle unit-linked contracts. Rotterdam, the Netherlands: Ortec Finance Research Center. [52] Rauch, H. E., Striebel, C., and Tung, F. (1965). Maximum likelihood estimates of linear dynamic systems. AIAA journal, 3(8):1445–1450. [53] Robert, C. and Casella, G. (2011). A short history of markov chain monte carlo: subjective recollections from incomplete data. Statistical Science, pages 102–115. [54] Roweis, S. and Ghahramani, Z. (2001). Learning nonlinear dynamical systems using the expectation–maximization algorithm. Kalman filtering and neural networks, page 175. [55] R¨ufenacht, N. (2012). Implicit Embedded Options in Life Insurance Contracts: A Market Consistent Valuation Framework. Springer Science & Business Media. [56] S¨arkk¨a,S. (2013). Bayesian filtering and smoothing, volume 3. Cambridge University Press. [57] Sch¨on,T. B., Wills, A., and Ninness, B. (2011). System identification of nonlinear state- space models. Automatica, 47(1):39–49. [58] Singor, S. N., Boer, A., and Oosterlee, C. W. (2014). Modeling the dynamics of equity index option implied volatilities in a real world scenario set. Rotterdam, the Netherlands: Ortec Finance Research Center, Methodological Working Paper No. 2014-02. [59] Steehouwer, H. (2016a). Ortec finance scenario approach. Rotterdam, the Netherlands: Ortec Finance scenario department paper. [60] Steehouwer, H. (2016b). Relevance of scenario models. Rotterdam, the Netherlands: Ortec Finance scenario department paper. [61] Svensson, A., Sch¨on,T. B., and Kok, M. (2015). Nonlinear state space smoothing using the conditional particle filter. IFAC-PapersOnLine, 48(28):975–980. [62] Van Bragt, D. and Steehouwer, H. (2009). Recent trends in asset and liability modeling for life insurers. Rotterdam, the Netherlands: Ortec Finance Research Center. [63] Van Der Merwe, R., Doucet, A., De Freitas, N., and Wan, E. (2000). The unscented particle filter. In NIPS, volume 2000, pages 584–590. [64] Wan, E. A. and Van Der Merwe, R. (2000). The unscented kalman filter for nonlinear esti- mation. In Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000, pages 153–158. Ieee. [65] Wei, G. C. and Tanner, M. A. (1990). A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704. [66] Wiener, N. (1950). Extrapolation, interpolation, and smoothing of stationary time series, volume 2. MIT press Cambridge, MA. [67] Wold, S., Esbensen, K., and Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52.

73 [68] Zenios, S. A. and Ziemba, W. T. (2006). Handbook of Asset and Liability Management: Applications and case studies, Volume I, volume 1. Elsevier. [69] Zenios, S. A. and Ziemba, W. T. (2007). Handbook of Asset and Liability Management: Applications and case studies, Volume II, volume 2. Elsevier.

74 APPENDIX A

Conditioning on the particle with highest weight

In order to reduce the non-monotonic behaviour of Qˆ and to improve the overall performance of the CPF-SAEM method we condition on the ’best’ particle, i.e. the particle with the highest weight. This means that we change step ii) of the procedure in Theorem 2.1 to

( i j 1 if wT = max wT . 0 ∗ ∗ i 1≤j≤N ii) Sample X1:T with P (X1:T = X1:T ) = 0 else. Unfortunately, this alternation does not improve performance of the algorithm for the Black- Scholes example, see Figure A.1. We note that the non-monotonic behavior of Qˆ, visualized in Figure 3.21, can still be observed in single runs of the algorithm. This means that even if we condition on the particle with highest weight, it is still possible to sample a new particle set with overall smaller likelihood from the proposal distribution. Besides this, it is not evident that the proof of Theorem 2.1 still holds if we alter the procedure in this way. Therefore we would rather not change the sampling procedure in Theorem 2.1. In Figure A.1 we observe that there is very little difference in the results of the two approaches. i 1 This is because the weights for all particles are all very similar, wT ≈ /N for i = 1,...,N, due to the ancestor sampling step. Therefore, all particle trajectories are (almost) equally likeli, and there is no major difference between step ii) and ii)0.

(a) Average SSE (b) Average value of Qˆ

Figure A.1: Conditioning on the ’best’ particle (100 trials, N = 6, K = 50, γk ≡ 1, q = f)

75 APPENDIX B

The Unscented Kalman Filter

The Unscented Kalman Filter (UKF) [64], introduced in [39], is founded on the fact that it is easier to estimate a Gaussian distribution than an arbitrary nonlinear function. The UKF pro- vides a Gaussian approximation to the marginal posterior distribution p(xn | y1:n) of a non-linear hidden Markov model and calculates the mean and covariance of this distribution recursively. This mean and covariance are completely captured by a set of carefully chosen sample points, the so-called sigma points. Third order accuracy of the Taylor series expansion can be achieved when using the UKF, compared to first order accuracy when using the Extended Kalman Filter.

The unscented transformation

The UKF is a straightforward extension of the unscented transformation. This transformation is a method for calculating the statistics of a random variable which undergoes a non-linear transformation. Consider a dx dimensional state random variable x with mean x¯ and covariance dx dy Px. This variable is propagated through an arbitrary non-linear function g : R 7→ R to generate y, i.e. y = g(x). To calculate the first two moments of y we first choose 2dx + 1 weighted sigma points, so that they fully capture both x¯ and Px. This is possible by using the following scheme:

X0 = x¯ p  Xi = x¯ + (dx + λ)Px i = 1, . . . , dx i p  Xi = x¯ − (dx + λ)Px i = dx + 1,..., 2dx i (B.1) (m) W0 = λ/(dx + λ) (c) 2 W0 = λ/(dx + λ) + (1 − α + β) (m) (c) Wi = Wi = 1/ (2(dx + λ)) i = 1,..., 2dx

2 p  where λ = α (dx + κ) − dx is a scaling parameter and (dx + λ)Px is the ith row of the i (m) (c) matrix square root. The weights Wi and Wi are used for calculating the mean and covariance respectively. The parameter α controls the diffusion of the sigma points around x¯ and should ideally be a small positive number. The parameter κ is a secondary scaling parameter for which

76 an appropriate default choice is κ = 0. We can use β ≥ 0 to incorporate prior knowledge of the distribution of the state. For a Gaussian distribution we should set β = 2 [63]. After calculating the set of weighted sigma points, we propagate each point through the non-linear function, Yi = g(Xi), for i = 1,..., 2dx. The mean and covariance of y are then approximated as follows

2dx X (m) y¯ = Wi Yi, i=0

2dx X (c) T Py = Wi (Yi − ¯y)(Yi − ¯y) . i=0 The UKF is now a direct application of this unscented transformation to the recursive minimum mean-squared-error estimation of the Kalman filter (see Algorithm 1).

The unscented Kalman filter

Consider the following non-linear hidden Markov model:

xn = f(xn−1, vn−1),

yn = g(xn, un),

dx dy where xn ∈ R denotes the system state and yn ∈ R the observation at time n. We assume that dv du the process noise vn ∈ R and measurement noise un ∈ R are independent, with covariances Q and R respectively. To approximate the posterior distribution with the unscented Kalman filter we redefine the state variable as the combination of the original state and noise variables: a T T T T xn = [xn vn un ] . We select the sigma points for this augmented state following (B.1) to a calculate the corresponding sigma matrix Xn . The UKF updates the meanx ¯ and covariance P of the Gaussian approximation of the posterior distribution according to Algorithm 7. Here a x T v T u T T da = dx + dv + du is the dimension of the augmented state and X = [(X ) (X ) (X ) ] is the augmented sigma matrix.

77 Algorithm 7 Unscented Kalman Filter (UKF) 1) Initialize

x¯0 = E[x0] T P0 = E[(x0 − x¯0)(x0 − x¯0) ] a a T T x¯0 = E[x ] = [¯x0 0 0]   P0 0 0 a a a a a T P0 = E[(x0 − x¯0)(x0 − x¯0) ] =  0 Q 0  0 0 R

2) For n ≥ 1 a) Select sigma points a h a a q a i Xn−1 = x¯n−1 x¯n−1 ± (da + λ)Pn−1 b) Predict

x x v  Xn|n−1 = f Xn−1, Xn−1

2da X (m) x x¯n|n−1 = Wi Xi,n|n−1 i=0 2d a T X (c) h x i h x i Pn|n−1 = Wi Xi,n|n−1 − x¯n|n−1 Xi,n|n−1 − x¯n|n−1 i=0  x u  Yn|n−1 = g Xn|n−1, Xn−1

2da X (m) y¯n|n−1 = Wi Yi,n|n−1 i=0

c) Update

2da X (c)    T Py˜ny˜n = Wi Yi,n|n−1 − y¯n|n−1 Yi,n|n−1 − y¯n|n−1 i=0

2da X (c)    T Pxnyn = Wi Xi,n|n−1 − x¯n|n−1 Yi,n|n−1 − y¯n|n−1 i=0 K = P P −1 n xnyn y˜ny˜n

x¯n =x ¯n|n−1 + Kn(yn − y¯n|n−1) T Pn = Pn|n−1 − KnPy˜ny˜n Kn

78 APPENDIX C

Correlation matrices for the risk drivers in Section 4.3

1,0000 0,9986 0,9940 0,9894 0,9818 0,9986 1,0000 0,9979 0,9945 0,9881 0,9940 0,9979 1,0000 0,9987 0,9946 0,9894 0,9945 0,9987 1,0000 0,9986 0,9818 0,9881 0,9946 0,9986 1,0000

Table C.1: Correlation coefficients between r1, . . . , r5

1,00 0,97 0,95 0,95 0,94 0,98 0,97 0,96 0,94 0,93 0,96 0,96 0,95 0,94 0,93 0,94 0,94 0,93 0,93 0,92 0,91 0,92 0,92 0,91 0,91 0,97 1,00 0,99 0,99 0,98 0,98 0,99 0,99 0,98 0,98 0,96 0,97 0,98 0,97 0,97 0,93 0,95 0,96 0,96 0,96 0,92 0,92 0,94 0,94 0,94 0,95 0,99 1,00 1,00 0,99 0,96 0,98 0,99 0,99 0,99 0,95 0,97 0,98 0,98 0,98 0,92 0,95 0,97 0,97 0,97 0,91 0,93 0,94 0,95 0,95 0,95 0,99 1,00 1,00 1,00 0,96 0,98 0,99 1,00 0,99 0,94 0,96 0,98 0,99 0,99 0,92 0,94 0,96 0,97 0,97 0,90 0,92 0,94 0,95 0,96 0,94 0,98 0,99 1,00 1,00 0,95 0,97 0,99 0,99 1,00 0,93 0,96 0,98 0,99 0,99 0,91 0,94 0,96 0,97 0,98 0,90 0,92 0,94 0,95 0,96 0,98 0,98 0,96 0,96 0,95 1,00 0,99 0,98 0,97 0,96 0,99 0,99 0,98 0,97 0,96 0,97 0,98 0,97 0,96 0,96 0,95 0,96 0,96 0,96 0,95 0,97 0,99 0,98 0,98 0,97 0,99 1,00 0,99 0,99 0,98 0,99 1,00 0,99 0,99 0,98 0,97 0,98 0,99 0,98 0,98 0,95 0,97 0,97 0,97 0,97 0,96 0,99 0,99 0,99 0,99 0,98 0,99 1,00 1,00 1,00 0,97 0,99 1,00 1,00 0,99 0,95 0,98 0,99 0,99 0,99 0,94 0,96 0,97 0,98 0,98 0,94 0,98 0,99 1,00 0,99 0,97 0,99 1,00 1,00 1,00 0,96 0,98 0,99 1,00 1,00 0,94 0,97 0,98 0,99 0,99 0,93 0,95 0,97 0,98 0,98 0,93 0,98 0,99 0,99 1,00 0,96 0,98 1,00 1,00 1,00 0,95 0,97 0,99 1,00 1,00 0,93 0,96 0,98 0,99 0,99 0,92 0,94 0,96 0,97 0,98 0,96 0,96 0,95 0,94 0,93 0,99 0,99 0,97 0,96 0,95 1,00 0,99 0,98 0,97 0,96 0,99 0,99 0,98 0,97 0,96 0,97 0,98 0,98 0,97 0,96 0,96 0,97 0,97 0,96 0,96 0,99 1,00 0,99 0,98 0,97 0,99 1,00 0,99 0,99 0,98 0,99 1,00 0,99 0,99 0,98 0,97 0,98 0,99 0,99 0,98 0,95 0,98 0,98 0,98 0,98 0,98 0,99 1,00 0,99 0,99 0,98 0,99 1,00 1,00 0,99 0,97 0,99 1,00 1,00 0,99 0,96 0,97 0,99 0,99 0,99 0,94 0,97 0,98 0,99 0,99 0,97 0,99 1,00 1,00 1,00 0,97 0,99 1,00 1,00 1,00 0,96 0,98 0,99 1,00 1,00 0,95 0,97 0,98 0,99 0,99 0,93 0,97 0,98 0,99 0,99 0,96 0,98 0,99 1,00 1,00 0,96 0,98 0,99 1,00 1,00 0,95 0,97 0,99 1,00 1,00 0,94 0,96 0,98 0,99 0,99 0,94 0,93 0,92 0,92 0,91 0,97 0,97 0,95 0,94 0,93 0,99 0,99 0,97 0,96 0,95 1,00 0,99 0,98 0,97 0,96 0,99 0,99 0,98 0,98 0,97 0,94 0,95 0,95 0,94 0,94 0,98 0,98 0,98 0,97 0,96 0,99 1,00 0,99 0,98 0,97 0,99 1,00 1,00 0,99 0,98 0,98 0,99 0,99 0,99 0,99 0,93 0,96 0,97 0,96 0,96 0,97 0,99 0,99 0,98 0,98 0,98 0,99 1,00 0,99 0,99 0,98 1,00 1,00 1,00 0,99 0,97 0,99 1,00 1,00 0,99 0,93 0,96 0,97 0,97 0,97 0,96 0,98 0,99 0,99 0,99 0,97 0,99 1,00 1,00 1,00 0,97 0,99 1,00 1,00 1,00 0,96 0,98 0,99 1,00 1,00 0,92 0,96 0,97 0,97 0,98 0,96 0,98 0,99 0,99 0,99 0,96 0,98 0,99 1,00 1,00 0,96 0,98 0,99 1,00 1,00 0,95 0,97 0,99 0,99 1,00 0,91 0,92 0,91 0,90 0,90 0,95 0,95 0,94 0,93 0,92 0,97 0,97 0,96 0,95 0,94 0,99 0,98 0,97 0,96 0,95 1,00 0,98 0,97 0,97 0,96 0,92 0,92 0,93 0,92 0,92 0,96 0,97 0,96 0,95 0,94 0,98 0,98 0,97 0,97 0,96 0,99 0,99 0,99 0,98 0,97 0,98 1,00 0,99 0,99 0,98 0,92 0,94 0,94 0,94 0,94 0,96 0,97 0,97 0,97 0,96 0,98 0,99 0,99 0,98 0,98 0,98 0,99 1,00 0,99 0,99 0,97 0,99 1,00 1,00 0,99 0,91 0,94 0,95 0,95 0,95 0,96 0,97 0,98 0,98 0,97 0,97 0,99 0,99 0,99 0,99 0,98 0,99 1,00 1,00 0,99 0,97 0,99 1,00 1,00 1,00 0,91 0,94 0,95 0,96 0,96 0,95 0,97 0,98 0,98 0,98 0,96 0,98 0,99 0,99 0,99 0,97 0,99 0,99 1,00 1,00 0,96 0,98 0,99 1,00 1,00

impl impl Table C.2: Correlation coefficients between σ1 , . . . , σ25

79 APPENDIX D

Market Data for the Heston Calibration in Chapter 5

Figure D.1: Historic implied volatility for S&P-500 options for all strike maturity pairs

Figure D.2: Historic US interest rates for all maturities

80 APPENDIX E

Alternative conditionings for the second out-of-sample test in Section 5.4

i Figure E.1: Second out-of-sample test |¯t|, conditioning on weighted average of xt−1

1M 3M 6M Direct calibration 0.0056 0.0077 0.0097 HMM, cond. on wght. avg. 0.0099 0.0112 0.0121

i Table E.1: Second out-of-sample test overall |¯|, conditioning on weighted average of xt−1

81 i Figure E.2: Second out-of-sample test |¯t|, conditioning on xt−1 with the highest weight

1M 3M 6M Direct calibration 0.0056 0.0077 0.0097 HMM, cond. on best 0.0096 0.0111 0.0111

i Table E.2: Second out-of-sample test overall |¯|, conditioning on xt−1 with the highest weight

82