DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016

Modelling of Stochastic using Partially Observed Markov Models

HJALMAR HEIMBÜRGER

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Modelling of using Partially Observed Markov Models

HJALMAR HEIMBÜRGER

Master’s Thesis in Mathematical (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits) Royal Institute of Technology year 2016 Supervisor at Handelsbanken: Björn Löfdahl Supervisor at KTH: Jimmy Olsson Examiner: Jimmy Olsson

TRITA-MAT-E 2016:62 ISRN-KTH/MAT/E--16/62-SE

Royal Institute of Technology School of Engineering Sciences

KTH SCI SE-100 44 Stockholm, Sweden

URL: www.kth.se/sci

Modelling of Stochastic Volatility using Partially Observed Markov Models

Abstract In this thesis, calibration of stochastic volatility models that allow correla- tion between the volatility and the returns has been considered. To achieve this, the dynamics has been modelled as an extension of hidden Markov models, and a special case of partially observed Markov models. This the- sis shows that such models can be calibrated using sequential Monte Carlo methods, and that a model with correlation provide a better fit to the ob- served data. However, the results are not conclusive and more research is needed in order to confirm this for other data sets and models.

i

Modellering av Stokastisk Volatilitet genom Partiellt Observerbara Markovmodeller

Sammanfattning Detta examensarbete behandlar kalibrering av stokastiska volatilitetsmod- eller som till˚aterkorrelation mellan volatiliteten och avkastningen. F¨or att uppn˚adetta beteende har dynamiken modellerats som ett specialfall av partiellt observerbara Markovmodeller som ¨aren utvidgning av dolda Markovmodeller (HMMer). I denna masteruppsats visas att dessa typer av modeller kan kalibreras med sekventiella Monte Carlo-metoder och att dessa modeller ger en b¨attreanpassning till observerad data. Resultaten ¨aremellertid inte entydiga och det ¨arn¨odv¨andigtutreda fr˚aganvidare f¨or andra modelltyper och andra datam¨angder.

ii

Acknowledgements I would like to thank my supervisor Jimmy Olsson for introducing me to the subject of computational statistics, valuable input and suggestions on everything concerning this thesis. Furthermore, I would like to thank my supervisor at Handelsbanken Bj¨ornL¨ofdahlfor the idea behind the model as well as guiding me through the process. Moreover, I would also like to thank Handelsbanken for providing me with the data necessary to perform the analysis. Lastly, I would like to thank my other half Carolina Eriksson for listening to my problems and always being there for me. Without you I would not have finished this thesis.

iii

Contents

1 Introduction1 1.1 Stochastic volatility...... 1 1.2 Hidden Markov models...... 1 1.3 Partially observed Markov models...... 3 1.4 Thesis objectives...... 3 1.5 Outline...... 4

2 Background5 2.1 Hidden Markov models...... 5 2.2 Parameter estimation in HMMs — the Expectation-Maximisation algorithm.6 2.2.1 The EM algorithm...... 7 2.2.2 Numerical approximations...... 8 2.2.3 Gradient ascent EM...... 8 2.2.4 Averaging...... 9 2.3 Sequential Monte Carlo methods...... 9 2.3.1 The bootstrap particle filter...... 10

3 Sequential Monte Carlo Methods for Virtually Hidden Markov Models 12 3.1 Virtually hidden Markov models...... 12 3.2 Filtering in virtually hidden Markov models...... 13 3.3 Smoothing in virtually hidden Markov models...... 14 3.3.1 Fixed-lag smoothing...... 14 3.3.2 The time-reversed process...... 15 3.3.3 Forward-filtering, backward-smoothing...... 17 3.3.4 Forward-only FFBSm...... 17 3.3.5 Forward-filtering, backward-simulation...... 18 3.3.6 The PaRIS algorithm...... 19 3.4 Instrumental density design...... 20 3.4.1 Designing the proposal density...... 21 3.4.2 Measures of weight imbalance...... 22 3.4.3 Instrumental kernel for backward index draws...... 23

4 Evaluation of Models 26 4.1 Information criteria...... 26 4.1.1 Akaike information criterion...... 26 4.1.2 Bayesian information criterion...... 27

iv 4.1.3 Bootstrap information criterion...... 27 4.1.4 Posterior probabilities of model candidates...... 28

5 Stochastic Volatility Models and Implementation 29 5.1 Choice of proposal density...... 30 5.2 The intermediate quantity...... 31 5.3 Volatility index model...... 33 5.4 Model evaluation...... 34 5.5 A few notes on implementation...... 35 5.5.1 Starting guesses for the EM algorithm...... 35 5.5.2 Accept-reject algorithm...... 35 5.5.3 Working with logarithms...... 36

6 Simulations and Results 37 6.1 Design of the PaRIS algorithm...... 37 6.1.1 Simulation time complexity...... 38 6.2 Calibration of SV models...... 40 6.2.1 Prefatory study...... 40 6.2.2 Main study...... 40

7 Discussion 50 7.1 Design of the PaRIS algorithm...... 50 7.1.1 Proposal kernel selection...... 50 7.1.2 Computational complexity for the PaRIS algorithm...... 51 7.2 Stochastic volatility models...... 51 7.2.1 Prefatory study...... 51 7.2.2 Main study...... 52

8 Conclusions and Future Work 54 8.1 Conclusions...... 54 8.2 Future work...... 54

A Extension of the Accept-Reject Sampling Algorithm 56

B Derivation of the Intermediate Quantity 58 B.1 E-step...... 58 B.2 M-step...... 60 B.2.1 Maximisation with respect to µ ...... 60 B.2.2 Maximisation with respect to βζ ...... 61 B.2.3 Maximisation with respect to Σ ...... 61 B.2.4 Updating formula for the HMM version...... 62 B.3 Summary...... 62

v Nomenclature

R Correlation matrix of the log-returns, i.e. the correlation corresponding to the covariance matrix in (5.4). Its elements are denoted [R]ij = rij. PN i Ωt i=1 ωt. i i pAIC Akaike weights, wAIC ∝ exp(−∆AICi/2), see (4.5). i i pBIC Posterior probability of model gi being the true model, pBIC ∝ exp(−∆BICi/2) see (4.4). xs:t A vector. xs:t , (xs, xs+1, . . . , xt), ∀s ≤ t. PaRIS(·) One iteration of Algorithm 3.2. PF(·) One iteration of Algorithm 3.1. i N i N j Pr({ωt}i=1) The categorical distribution, i.e. if J ∼ Pr({ωt}i=1), then P(J = j) = ωt /Ωt.

φ0:t;θ The smoothing distribution. φ0:t;θ , φ0:t|t;θ.

φs:s0|t(xs:s0 ) Density of Xs:s0 , conditionally on the observations, i.e. for any (measurable) s0+1−s R A ⊆ X , P(Xs:s0 ∈ A|y1:t) = A φs:s0|t(xs:s0 ) dxs:s0 .

φt;θ The filtering distribution. φt;θ , φt:t|t;θ. 0 0 0 gt(x, x ) g(x, x ; yt), i.e. the emission density associated with Yt|Xt−1 = x, Xt = x . q(x; ·) The transition density associated with Xt+1|Xt = x. 1(·) The indicator function.

E Expectation of a stochastic variable. V Variance of a stochastic variable. D(pkq) Kullback-Leibler Divergence between distributions p and q. kAk det(A). x ∧ y min(x, y).

(X, X ) The measurable space of the state process {Xt}.

(Y, Y) The measurable space of the observation process {Yt}.

vi m, n {k ∈ N : m ≤ k ≤ n, (m, n) ∈ N2}, i.e. the set of all non-negative integers J K between integers m and n.

N∗ {n ∈ N : n > 0}, i.e. the positive integers. R+ {x ∈ R : x > 0}, i.e. the positive real numbers. Xn The Cartesian product space X × · · · × X. | {z } n times

vii Chapter 1

Introduction

1.1 Stochastic volatility

The pioneering work of Black and Scholes [3] and their derivation of the famous Black-Scholes formula was a major advancement for financial mathematics and pricing of European-style derivatives. However, as time passed, several of the assumptions have proved to be too crude. The observed data exhibit features that are impossible under the simplistic assumptions made. One of the most notable such assumption is that the volatility of a stock is constant over time. To address these shortcomings, the main approach has been to add randomness to the volatility in the model. These models have become known as stochastic volatility models.A few models that have been proposed are the Heston model [18], Bates model [1], and the model proposed by Hull and White [19]. A main concept of all the models listed is the existence of volatility clusters. Empirical data suggests that there are clusters when the volatility is high, and any plausible model should allow these clusters. Furthermore, the market and the volatility is often assumed to be negatively correlated [2]. The concept is intuitive, as the increase of risk should make it less favourable to invest. However, not all literature points to this fact, as it only seems to be present in some markets [27]. There should be nevertheless a possibility to include such behaviour in the model.

1.2 Hidden Markov models

A (often abbreviated HMM) is, roughly speaking, a observed in noise. It is a bivariate {(Xt,Yt)}t∈N which consists of a state process {Xt}, and an observation process {Yt}. The state process is a Markov chain that is only partially observable through the observation process. Furthermore, the observations are, conditionally on the state process, statistically independent, where the conditional distribution of each observations given the state process only depends on the corresponding point in time. A graphical representation of a HMM can be seen in Figure 1.1.

1 Yk−1 Yk Yk+1

Xk−1 Xk Xk+1

Figure 1.1: The general dependence structure of a HMM, where each arrow denotes condi- tional dependence.

HMMs has been one of the most successful modelling tools over the last 50 years [7]. It has been used in numerous fields, including computational biology, speech recognition, , econometrics, etc. (a compilation of around 360 references is found in [6]). The of the state process allows for computationally viable algorithms.1 On the other hand, it also provides a general enough structure that allows complex behaviour to be modelled. The combination of these two properties is the main attribute of HMMs that has made them successful. Some authors reserve the term HMM for processes where the state space of the hidden process is finite, and use the term general state-space models for processes where the state- space is uncountable. However, this distinction will not be made in this report, and the use of the term HMM will not imply any restrictions on the state-space in this report. In 1961, Kalman and Bucy [20] introduced the pioneering Kalman filter which calculates the exact filter distribution whenever the HMM is linear and Gaussian. However, this turned out to be an exception to the rule. In general, the likelihood, and consequently the filter and smoothing distributions, is almost always analytically intractable. The only cases where there analytical solutions exist for the filtering and smoothing distributions is when either the state-space is finite, or when the dynamics is linear and Gaussian. For these cases, the solutions is provided by the Baum-Welch algorithm [43], and the disturbance smoother [4], respectively. For any other case, the exact filtering and smoothing distributions are intractable, and it is necessary to resort to approximations. One approach, common for applications within automatic control, is the extended Kalman filter, where the dynamics are linearised around its mean and covariance. However, if the dynamics are severely non-linear, the error from the linear approximation is non-negligible, and leads to poor approximations. The unscented Kalman filter tries to remedy the error from the non-linearities by propagating a set of “sigma points” through the non-linear dynamics to obtain more correct estimates of the mean and covariance. In this thesis, the approximations of the filtering and smoothing distributions will be ob- tained using sequential Monte Carlo methods (SMC), also known as particle filters. These algorithms generate a sequence of weighted particle samples that approximate the target distributions via the technique of importance sampling. The main interest will be to cal-

1 Although, it should be noted that the observation process {Yt} is not Markovian.

2 Yk−1 Yk Yk+1

Xk−1 Xk Xk+1

Figure 1.2: The general dependence structure of a virtually hidden Markov model. Each arrow denotes conditional dependence. culate smoothed sufficient statistics, and this will be done through the particle-based, rapid incremental smoother (PaRIS) [36].

1.3 Partially observed Markov models

Even though HMMs have been studied for a very long time, there has been little research on generalisations of them. One of the few areas of research on HMM generalisations is Markov-switching processes.2 These processes allow the observations to be conditionally dependent on the current value of the state process, as well as the previous value of the observation process. This type of process is a special case of a partially observed Markov models. In this thesis, a special type of partially observed Markov model will be considered. Unlike the hidden Markov model, the conditional distribution of the observation process will be allowed to depend on the previous value of the hidden chain, as well as the current. These type of processes will be called virtually hidden Markov models (VHMM). A graphical il- lustration can be seen in Figure 1.2. This type of dependence allows the model to take into account correlations between the noise sequences of the state process and the obser- vation process. Therefore, a well-specified VHMM could capture a (presumably negative) correlation between a stock and its volatility.

1.4 Thesis objectives

The objective of this thesis is to investigate if a stochastic volatility model that allows correlation between the log returns and the volatility is a better model compared to some other benchmark models. Consequently, the aim is to calibrate a stochastic volatility VHMM using SMC methods and then determine if this model provides better fit compared to the benchmark models. 2When the state-space is finite, they are sometimes called Markov jump systems.

3 1.5 Outline

Chapter 2 begins with a brief summary of hidden Markov models and parameter estimation in hidden Markov models using the Expectation-Maximisation algorithm. Section 2.3 gives a very brief introduction to the realm of SMC methods, the bootstrap particle filter, and notation used for the relevant quantities. Any reader not familiar with the subject is advised to read e.g. [7, 13, 14] for a more thorough background to the subject. Chapter 3 defines partially observed Markov models in general, and VHMM in particular. It provides the extension of SMC methods to such processes. It describes filtering for partially observed Markov models, which has previously been presented in [11]. Furthermore, the main focus is SMC methods that aim at computing smoothed expectations for VHMMs. Section 3.4 includes some theory about proposal kernels for the SISR algorithm, and derives, under suitable conditions, the optimal instrumental density for drawing backward indices in the accept-reject algorithm. Chapter 4 contains the relevant theory associated with model evaluation, and defines different measures associated with good models. Chapter 5 specifies the models of stochastic volatility that will be considered in this thesis, and the relevant quantities needed to implement PaRIS, and the Expectation-Maximisation algorithm are computed. Model-dependent quantities related to the goodness of fit are also computed. Finally, it contains some comments on the implementation of the algorithms. Afterwards, results of the simulations are presented in Chapter 6, and they are discussed in Chapter 7. Lastly, conclusions and some discussion about possible future work is given in Chapter 8.

4 Chapter 2

Background

In this chapter, some background on HMMs and parameter estimation will be discussed. It begins with some general definitions, including the definitions of a HMM, the filtering distribution, and the smoothing distribution. Afterwards the Expectation-Maximisation algorithm for parameter estimation in HMMs is covered. Finally, a brief introduction is given to SMC methods and, in particular, the bootstrap particle filter.

2.1 Hidden Markov models

A HMM is a bivariate stochastic process {(Xt,Yt)}t∈N and it consists of two processes: the state process {Xt} and the observation process {Yt}. The state process takes its values in some set X (e.g. X ⊆ Rd), which is equipped with a countably generated σ-algebra X . The observations live on another space Y with corresponding σ-algebra Y. This means that (X, X ) and (Y, Y) are measurable spaces.

Definition 2.1 (Markov transition kernel). Let (Z1, A) and (Z2, B) be two measurable spaces. The map M : Z1 × B → [0, 1] is a Markov transition kernel from (Z1, A) to (Z2, B) if it holds that

(i) the map B 3 B 7→ M(z1, B) is a probability measure on (Z2, B), for every z1 ∈ Z1, and

(ii) the map Z1 3 z1 7→ M(z1, B) is a measurable function, for every B ∈ B. Let Q : X × X → [0, 1] and G : X × Y → [0, 1] be two Markov transition kernels, and let χ : X 7→ [0, 1] be a probability measure on (X, X ). An HMM is the canonical bivariate

Markov chain {(Xt,Yt)}t∈N induced by the Markov transition kernel Z X × Y × (X ⊗ Y) : ((x, y), A × B) 7→ Q(x, dx0)G(x0, B), (2.1) A

R 0 0 with initial distribution X ⊗ Y 3 A × B 7→ A χ(dx )G(x , B).

5 This implies that

i) {Xt} is a Markov chain with Markov transition kernel Q and initial distribution χ, and ii) conditionally on the state process, the observations are statistically independent, and the conditional distribution of Yk given the state process is only dependent on Xk, and its distribution is given by G(Xk, ·). In this thesis, the HMM will be assumed to be fully dominated, which means that Q and G admit densities (with respect to relevant measures) as

( R 0 0 Q(x, A) = A q(x; x ) dx , (x, A) ∈ X × X , R (2.2) G(x, B) = B g(x; y) dy, (x, B) ∈ X × Y. Here, q and g are known as the transition and emission densities, respectively. Often, these densities will belong to some parametric family, indexed by a parameter θ ∈ Θ, where Θ is some parameter space. To show this dependence, the densities will be subscripted by the parameter θ, e.g. qθ. When making inference about HMMs, the aim is often to compute expectations under the posterior distribution of Xs:s0 given the observations Y0:t = y0:t. Let this distribution be denoted by φs:s0|t;θ, and expectations under this measure will be denoted by φs:s0|t;θh , Eθ[h(Xs:s0 )|Y0:t]. To obtain an expression for the posterior distribution, first consider the joint density of (X0:t,Y0:t). The dynamics specified by (2.1) and (2.2) implies that (X0:t,Y0:t) admits a joint density, which factorises as

t Y pθ(x0:t, y0:t) = χθ(x0)gθ(x0; y0) qθ(xk−1; xk)gθ(xk; yk), (2.3) k=1 where χθ is the initial distribution of the state process. Consequently, for any measurable function h, the posterior distribution of Xs:s0 given the observations Y0:t = y0:t is given by R Qt Xs0−s+1 h(xs:s0 )χθ(x0)gθ(x0; y0) k=1 qθ(xk−1; xk)gθ(xk; yk) dx0:t φ 0 h = , s:s |t;θ R Qt Xt+1 χθ(x0)gθ(x0; y0) k=1 qθ(xk−1; xk)gθ(xk; yk) dx0:t provided that the denominator is positive. There are two distributions of particular interest. These are the filter distribution φt;θ , φt:t|t;θ, and the smoothing distribution φ0:t;θ , φ0:t|t;θ. When making statistical inference about HMMs, these two turn out to be two most important distributions.

2.2 Parameter estimation in HMMs — the Expectation-Maximisation algorithm

As mentioned in the introduction, the exact filtering and smoothing distributions for HMMs are in general intractable. Consequently, when trying to estimate parameters in HMMs, the likelihood is intractable. When the likelihood is intractable, there are several iterative methods that produce sequences of estimates that converge towards the true parameter.

6 These include gradient-based methods by computing the gradient of the score function (i.e. the log-likelihood). However, in this thesis, the focus will be on the Expectation- Maximisation (EM) algorithm.

2.2.1 The EM algorithm

The Expectation-Maximisation algorithm is used to compute the maximum likelihood esti- mator for latent data models. It was proposed by several different authors. For instance, Sundberg [40] established several results for the algorithm when the considered distribu- tion belonged to the exponential family. However, credit is often given to Dempster, Laird, and Rubin [10], who generalised the algorithm and established several properties for the algorithm. However, the proof of convergence for the algorithm was erroneous. This was detected by Wu [45], who also provided a correct proof of the algorithm. The idea behind the algorithm is simple: instead of maximising the log-likelihood directly, the algorithm maximises an intermediate quantity, which is more easily computed. It can be shown that an increase of this intermediate quantity guarantees an increase in the likelihood. Let (X,Y ) denote a missing data model, where X is latent and only Y is observed. Define the intermediate quantity as h i Z 0 0 0 Qθ (θ) = Eθ log pθ(X,Y ) Y = log pθ(x, Y )pθ (x|Y ) dx, (2.4)

0 where Eθ0 denotes expectation taken under the parameter θ ∈ Θ. This quantity is the expectation of the complete data log-likelihood, conditional on the observations. Let `(θ) , R log pθ(x, y) dx denote the observed data log-likelihood under the parameter θ ∈ Θ. To relate these quantities, there exists the following theorem: Theorem 2.1 (The EM inequality). For all (θ, θ0) ∈ Θ2, it holds that

0 0 `(θ) − `(θ ) ≥ Qθ0 (θ) − Qθ0 (θ ), (2.5) where the inequality is strict unless θ = θ0.

The proof is done by defining the entropy Hθ0 (θ) , `(θ) − Qθ0 (θ), and showing that the 0 1 0 2 difference in entropy Hθ0 (θ) − Hθ0 (θ ) is non-negative for any (θ, θ ) ∈ Θ . For a complete 0 proof, see for instance Dempster et al. [10]. Consequently, if Qθ0 (θ) − Qθ0 (θ ) is maximised with respect to θ, then the log-likelihood will increase by at least as much. Hence, construct- ing a sequence {θ`}`≥0 by letting θ`+1 ← arg maxθ∈Θ Qθ` (θ) results in the EM-algorithm. 2 The updating formula will be denoted Λ(θ`). Theorem 2.1 guarantees that, by construc- tion, the sequence {`(θ`)}`≥0 is non-decreasing, which means that the EM-algorithm is a monotone optimisation algorithm. It is summarised in Algorithm 2.1. ˆ If {θ`}`≥0 is constructed by Algorithm 2.1, and lim`→∞ θ` → θ, then, assuming that θ 7→ pθ is sufficiently smooth, it is possible to show that θˆ is a stationary point of the log-likelihood, i.e. ∇θ`(θ)|θ=θˆ = 0 [28, p. 174]. 1The difference in entropy is a Kullback-Leibler Divergence, which is always non-negative. 2 Here, this would mean that θ`+1 ← Λ(θ`) , arg maxθ∈Θ Qθ` (θ).

7 Algorithm 2.1: The Expectation-Maximisation (EM) Algorithm.

Data: Observed data y, initial guess θ0. Result: {θ`}`≥0. 1 for ` ← 0 to `max − 1 do 2 Qθ` (θ) ← Eθ` [log pθ(X,Y )|Y = y]. /* E-step */ 3 θ`+1 ← arg max Qθ (θ). /* M-step */ θ∈Θ ` 4 end for `max 5 return {θ`}`=0 .

2.2.2 Numerical approximations

There are several potential obstacles with the EM algorithm. The E-step or the M-step might be analytically intractable. Beyond the cases of linear Gaussian HMMs or HMMs with finite state-space, the E-step is analytically intractable and has to be approximated. This will be done using SMC methods; see Section 2.3 and Chapter 3. Unlike the E-step, a closed form expression for the M-step is in most cases available. How- ever, if it is not possible to obtain a closed form expression for the maximisation, there are ways to tackle this problem. It can be solved by various optimisation methods. Titterington [42] introduced a gradient algorithm when the joint distribution belongs to the linear expo- nential family. Based on this work, Lange [24] introduced a generalisation of this gradient ascent algorithm that, under certain assumptions, is locally equivalent to the EM algorithm.

2.2.3 Gradient ascent EM

The idea behind the gradient ascent EM algorithm is very simple: instead of maximising

Qθ` (θ) exactly, Theorem 2.1 implies that it is sufficient to increase Qθ` (θ) to guarantee increase in the log-likelihood. Therefore, it should be sufficient to just perform one New- ton step to increase the likelihood. The sequence of parameter estimates will be updated according to

∂2Q (θ)−1 θ = θ − θ` ∇ Q (θ) Λ(θ ), (2.6) `+1 ` 2 θ θ` , ` ∂θ θ=θ`

where the Hessian matrix and the gradient are evaluated at the previous value θ`. Assuming (amongst other things) that the Hessian matrix is negative definite, Lange [24] showed that this scheme converges at the same rate of convergence as the original EM algorithm.

The assumption of negative definiteness guarantees a unique maximum of Qθ` (θ), and that the iteration will increase the log-likelihood. If the Hessian is not negative definite, it is often possible to reparametrise the distributions a way that provides negative definiteness. Reparametrising the model does not affect the regular EM algorithm,3 but does have an impact on this gradient approach.

3Although reparametrisation might make it easier to find closed form expressions for the M-step, as can be seen in Appendix B.

8 It is possible to improve the convergence rate. Instead of updating according to (2.6), consider the modified updating formula   θ`+1 = θ` + t Λ(θ`) − θ` , Λt(θ`), (2.7) for some step-size t. Lange [24] shows that for any t ∈ (0, 2), Λt converges to the same stationary point as Λ. For large amounts of missing data, such as HMMs, a value of t = 2 usually gives good results. However, if there are constraints on the parameters, one should be careful to check that Λt(θ`) ∈ Θ to ensure that the new parameters are feasible. A noteworthy remark is that despite the fact that choosing t > 1 might cause the intermediate quantity to decrease locally, it usually increases the speed of convergence. There are more advanced algorithms, such as quasi-newton methods, and conjugate gradient methods, to speed up the convergence. These try to approximate the gradient and/or the Hessian of the log-likelihood as well [25]. However, this was not within the scope of this thesis.

2.2.4 Averaging

If the E-step in the EM algorithm is intractable and approximated with sequential Monte Carlo methods, there will always be an error associated with the smoothed sufficient statis- ˆ tics. This means that the sequence {θ`}`≥0 will not converge to θ, but rather will θ` will be noisy estimate of the true maximum likelihood estimator. To reduce the noise, the param- eters can be estimated using a weighted average of the observed parameters, with weights ˜ proportional to the sample size used. The averaged estimator θ` is then given by [7]

` ˜ X Ni θ` = θi, (2.8) P` N i=`0 j=`0 j where Ni is the sample size used during iteration i. `0 determines when to start the averag- ing, and it should be chosen large enough that the estimators have converged, and exhibit steady-state behaviour.

2.3 Sequential Monte Carlo methods

To approximate the smoothing distribution in the E-step of the EM algorithm, sequential Monte Carlo methods4 will be used. Handschin [17] introduced Monte Carlo methods to non-linear filtering by applying importance sampling in a sequential manner to find ap- proximations of the filtering distributions {φt}. Importance sampling generate a weighted i i N particle sample {(ξt, ωt)}i=1 that approximate some target distribution φt. It consists of i i the particles ξt, and the weights ωt. Here, the case where the normalising constant of the target distribution is unknown is considered. This is known as self-normalised importance

4To reiterate: This is a very brief introduction. Any reader not familiar with the subject is advised to read e.g. [7, 13, 14] for more thorough background to the subject.

9 sampling. The weighted particle sample is an approximation of φt in the sense that

N Z X ωi φ h h(x)φ (x) dx ≈ t h(ξi), (2.9) t , t Ω t i=1 t

PN ` where Ωt , `=1 ωt . Even though it was known since 1970, it was not used to any major degree. The main reason is probably due to the computational requirements being too large [7]. In addition, it was later revealed that the algorithm suffers from a problem with sample impoverishment, where the importance sampling weights (as a rule) degenerates over time.

2.3.1 The bootstrap particle filter

To resolve the weight degeneration in the na¨ıve sequential importance sampling algorithm, Gordon et al. [16] introduced the bootstrap particle filter. It provided numerically stable approximations of the filter distributions {φt} for sequences of arbitrary length. It was introduced for HMMs, and the basic idea is to duplicate particles that have a large weights, and “kill” those with small weights not contributing to the estimator. i i N Assume there exists a weighted particle sample {(ξt, ωt)}i=1 targeting the filter distribution φt. The aim is to compute a new set of weighted particles targeting φt+1. The first step is to generate a uniformly weighted sample targeting φt. For each particle, an index is drawn according to i j N  It ∼ Pr {ωn}j=1 , i ∈ 1,N . (2.10) J K i N Here, Pr({ωt}i=1) denotes the categorical distribution where the probability of drawing j i j i index j is proportional to its weight ωt , i.e. P(It = j) = ωt /Ωt. By drawing It according i It N to (2.10), the new particle sample {(ξt , 1)}i=1 is a uniformly weighted sample targeting φt. This is referred to as the selection step. After the particles have been resampled, the particles are propagated according to

i Ii ξt+1 ∼ rt+1(ξt ; ·), i ∈ 1,N , (2.11) J K where rt+1 is some proposal density whose support covers the support of φt. This is known as the mutation step. Lastly, the new weights are calculated as

i It i i i q(ξt ; ξt+1)g(ξt+1; yt+1) ωt+1 = i . (2.12) It i rt+1(ξt ; ξt+1)

i To initialise the algorithm, ξ0 is drawn from some other distribution ρ, and the weights are i i i computed as ω0 = χ(ξ0)/ρ(ξ0). The key in the algorithm is the realisation that the resampling step does not add any bias to the estimator, and that it keeps the algorithm numerically stable. The original algorithm proposed by Gordon et al. [16] uses the dynamics of the underlying state process to propagate + i the particles, i.e. rt = q for all t ∈ N . This simplifies (2.12) to ωt+1 = g(ξt+1; yt+1). The choice of q as the proposal density gives the bootstrap particle filter.

10 The algorithm produces two additional results. Firstly, it provides an unbiased estimator of the normalising constant (i.e. the likelihood of the observed data), which is given by [35]

t Y L(y0:t; θ) = Ωs. (2.13) s=0 Furthermore, at each time step the entire trajectory is copied in the resampling step, i.e. i i Is i ξ0:s+1 = (ξ0:s, ξs+1). The new trajectory is a sample from the smoothing distribution φ0:s+1. Hence, in addition to the filter distribution, an approximation of any smoothed statistic φ0:tht is given by

N Z X ωi φ h h (x )φ(x ) dx ≈ t h (ξi ). (2.14) 0:t t , t 0:t 0:t 0:t Ω t 0:t i=1 t There is a major flaw with this estimator as a consequence of the systematic resampling. The resampling will diminish the number of unique sub-trajectories. If t is sufficiently i large, there will exist a k < t, such that ξ0:k is the same for all indices i. Therefore, the estimator will suffer from large variance. This is a well known phenomenon known as ancestral degeneration, and will be addressed in the Chapter 3.

11 Chapter 3

Sequential Monte Carlo Methods for Virtually Hidden Markov Models

In this chapter, the special type of partially observed Markov models considered will be defined. These will be called virtually hidden Markov models. Furthermore, the SMC algorithms for HMMs will be adapted to these types of processes. The algorithms in ques- tion include sequential importance sampling with resampling (SISR), forward-filtering back- ward smoothing (FFBSm), forward-filtering backward-simulation (FFBSi), and the particle- based, rapid incremental smoother (PaRIS). The main results are Theorem 3.1 and Theorem 3.2, which form the foundation for the adaptations of the smoothing algorithms for virtually hidden Markov models. Lastly, the design of the proposal/instrumental density is covered for two cases: the proposal density in the SISR algorithm, and the instrumental density for the accept-reject algorithm used in the FFBSi/PaRIS algorithms.

3.1 Virtually hidden Markov models

Definition 3.1 (Partially observed Markov model). A partially observed Markov model is a bivariate Markov process {(Xt,Yt)}t∈N, of which only {Yt}t∈N is observable. This somewhat non-technical definition explains intuitively what a partially observed Markov model is. It is a Markov process, where some data is latent in the sense that it’s not ob- servable. In this thesis, only a certain type of partially observed Markov models will be considered. The term used for these type of processes will be virtually hidden Markov models (VHMM). Let (X, X ) and (Y, Y) be two measurable spaces, and let Q : X×X → [0, 1] and G : X2 ×Y → [0, 1] be two Markov transition kernels. Furthermore, let η : X ⊗ Y 7→ [0, 1] be a probability

12 measure on the product space (X × Y, X ⊗ Y). A VHMM is the canonical Markov chain

∗ {(Xt,Yt)}t∈N induced by the Markov transition kernel Z X × Y × (X ⊗ Y) : ((x, y), A × B) 7→ Q(x, dx0)G((x, x0), B), (3.1) A with initial distribution η. In this setup, it is simpler to define the initial distribution of the chain for an auxiliary state X0 one step before the observations. Let χ be a probability measure on (X, X ), and define η as Z 1 0 0 0 η(A, B) , X×A(x, x )χ(dx)Q(x, dx )G((x, x ), B), A × B ∈ X ⊗ Y. (3.2)

It will be assumed that the VHMM is fully dominated, i.e. that Q and G admit densities (with respect to relevant measures) as

 R 0 0 Q(x, A) = A q(x; x ) dx , (x, A) ∈ X × X , 0 R 0 0 2 (3.3) G((x, x ), B) = B g(x, x ; y) dy, ((x, x ), B) ∈ X × Y. Furthermore, the initial distribution χ will also be assumed to admit a density which will, by slight abuse of notation, be denoted by χ.

To simplify the notation, the emission density dependence on an observed value yk will be 0 dropped from the notation, and instead implicitly denoted by a subscript gθ(x, x ; yk) , 0 gk;θ(x, x ). Here, the explicit dependence on the parameter θ has been added as well. Furthermore, note that the arguments in the emission density are in chronological order, 0 0 i.e. gk;θ(x, x ) is the density associated with the transition Yk|Xk−1 = x, Xk = x . This is compared to densities of transition kernels, where the arguments are separated by ‘;’, e.g. qθ(x; ·) is the density associated with the transition Xt+1|Xt = x. The dynamics specified in (3.1) and (3.3) implies that the process (X0:t,Y1:t) admits a joint density, which is given by t Y pθ(x0:t, y1:t) = χθ(x0) qθ(xk−1; xk)gk;θ(xk−1, xk). (3.4) k=1

To implement the E-step in the EM algorithm, it is needed to calculate smoothed sufficient statistics φ0:t;θht. To approximate these smoothed sufficient statistics, SMC methods will be used. As can seen in (3.4), the joint density of a VHMM factorises. This means that the smoothed sufficient statistics φ0:t;θht needed for the E-step of the EM algorithm will be on additive form, i.e. t X ˜ ht(x0:t) = hk(xk−1, xk), (3.5) k=1 where ht in general depend on the observations y1:t.

3.2 Filtering in virtually hidden Markov models

Filtering in VHMMs is very similar to filtering in HMMs, where Desbouvries and Pieczynski [11] gives the general formula for filtering in partially observed Markov models. Consider

13 the special case of VHMMs. By Bayes’ formula, it holds that

pθ(x0:t+1, y1:t+1) = pθ(yt+1|x0:t+1, y1:t)pθ(xt+1|x0:t, y1:t)pθ(x0:t, y1:t)

= pθ(yt+1|xt:t+1)pθ(xt+1|xt)pθ(x0:t, y1:t), where the second equality comes from the dependence structure of the VHMM. By definition it holds that pθ(xt+1|xt) = qθ(xt; xt+1) and pθ(yt+1|xt:t+1) = gt+1;θ(xt, xt+1), and hence does the joint density satisfy the recursion

pθ(x0:t+1, y1:t+1) = qθ(xt; xt+1)gt+1;θ(xt, xt+1)pθ(x0:t, y1:t), t ∈ N (3.6)

with the convention pθ(x0:0, y1:0) , χθ(x0). The similarity with the analogous recursion of a HMM, which satisfy the recursion pθ(x0:t+1, y1:t+1) = qθ(xt; xt+1)gt+1;θ(xt+1)pθ(x0:t, y1:t) makes it reasonable to assume that algorithms for filtering will be similar. The only differ- ence is that the emission density also depends on the previous value of the state process. By similar reasoning to Section 2.3, a weighted particle sample targeting φt+1 for a VHMM i i N can be constructed from {(ξt, ωt)}i=1 by auditing only the weight calculation in (2.12) to incorporate the new emission density. The algorithm is summarised in Algorithm 3.1.

Algorithm 3.1: Sequential importance sampling with resampling (SISR). i i N Data: {(ξt, ωt)}i=1 targeting φt. i i N Result: {(ξt+1, ωt+1)}i=1 targeting φt+1. 1 for i ← 1 to N do i j N 2 Draw It ∼ Pr({ωn}j=1). /* Selection */ i i It 3 Draw ξt+1 ∼ rt+1(ξt ; ·). /* Mutation */ i i i i It i It i It i 4 ωt+1 ← q(ξt ; ξt+1)gt+1(ξt , ξt+1)/rt+1(ξt ; ξt+1). 5 end for i i N 6 return {(ξt+1, ωt+1)}i=1.

Following the notation in [36], one iteration of Algorithm 3.1 will be denoted by

i i N i i N {(ξt+1, ωt+1)}i=1 ← PF({(ξt, ωt)}i=1). (3.7)

3.3 Smoothing in virtually hidden Markov models

As noted in Section 2.3, the particle trajectories generated from Algorithm 3.1 suffer from ancestral degeneration due to the systematic resampling. Consequently, any weighted sam- ple of particle trajectories that approximate the smoothing distribution will exhibit large variance, and other methods needs to be considered.

3.3.1 Fixed-lag smoothing

One remedy to the ancestral degeneracy is to apply fixed-lag smoothing [7, 21, 37]. The main concept of fixed-lag smoothing is that an observation far away from a certain state

14 should not have a big influence on that state. Therefore, when considering the smoothing distribution of consecutive states s:s0, it should hold that

0 φs:s0|t ≈ φs:s0|s0+∆, t ≥ s + ∆, (3.8) for large enough ∆. This approximation should be valid if the chain is chain is ergodic, i.e. the chain exhibit forgetting properties. Given two copies of the chain that have different initial distributions, the chain is ergodic if the distributions of the two chains approach each other as time increases. When calculating smoothed expectations of the additive form given in (3.5), (3.8) suggests that φ0:tht can be approximated as

t X N ˜ φ0:t;θht ≈ φk−1:k|(k+∆)∧t;θhk. (3.9) k=1

Advantages with fixed-lag smoothing include the need of storing only ∆ number of weighted particle samples, the simplicity to implement, and that the computational complexity of the algorithm is O(Nt). The major disadvantage is the need of choosing ∆ properly. If ∆ is too small, the approximation (3.8) is not good enough, whereas if ∆ is too large, the sample paths will have degenerated too much. It is a case of the classical bias-variance trade-off. What is a good choice of ∆ is very dependent on the process being considered, and there is no general rule what is a good choice. However, what can be said for VHMMs is that this choice is perhaps even more important compared to HMMs. Due to the extra degree of freedom in the emission density, the importance sampling weights diminish faster compared to HMMs. Therefore, it is necessary to resample more often, which will speed up the ancestral degeneration. This is mainly based on implementation of the fixed-lag smoother for the stochastic volatility model described in Chapter 5. For ∆ as small as 5, the approximation (3.9) degenerated into one path for some time steps almost independently of the number of particles. Consequently, the resulting estimators suffered from very large variance.

3.3.2 The time-reversed process

In HMMs, the “efficient smoothing algorithms” rely on the fact that, conditionally on the observations, the time-reversed process is a time-inhomogeneous Markov process as well. If ←− 2 + the HMM is fully dominated, then it has a backward transition density q φs;θ : X → R , R ←− such that for any A ∈ X , it holds that P(Xs ∈ A|Xs+1 = xs+1, y0:t) = A q φs;θ (xs+1; xs) dxs. For HMMs, the backward density is given by [7]

φ (x )q(x ; x ) ←−q (x ; x ) = s s s s+1 . (3.10) φs;θ s+1 s R 0 0 0 φs(xs)q(xs; xs+1) dxs This can be extended to partially observed Markov models, for which the following theorem holds. Theorem 3.1. For any fully dominated partially observed Markov model, the time-reversed process is, conditionally on the observations, a time-inhomogeneous Markov process with

15 backward transition density given by

p(x , y |x , y )φ (x ) ←−q (x ; x ) = s+1 s+1 s s s s (3.11) φs s+1 s R 0 0 0 X p(xs+1, ys+1|xs, ys)φs(xs) dxs for any s ∈ 0, t − 1 . J K d. Proof. If it is possible to show that Xs|Xs+1:t,Y1:t = Xs|Xs+1,Y1:t for any s ∈ 0, t − 1 , the process is Markovian. This will be done by computing the density of Xs|XsJ+1:t,Y1:Kt, and showing that it does not depend on Xs+2:t. As a bonus, this will yield the backward transition density. To find the conditional density, first decompose the joint density of (Xs:t,Y1:t) as

p(y1:s, ys+1, ys+2:t, xs, xs+1, xs+2:t)

= p(xs+2:t, ys+2:t|xs+1, ys+1, xs, y1:s)p(xs, y1:s, xs+1, ys+1)

= p(xs+2:t, ys+2:t|xs+1, ys+1)p(xs, y1:s, xs+1, ys+1), with the convention p(xs+2:t, ys+2:t|xs+1, ys+1, xs, y1:s) , 1 if s = t − 1. The joint density p(xs, y1:s, xs+1, ys+1) can also be decomposed as

p(xs, y1:s, xs+1, ys+1) = p(xs+1, ys+1|xs, ys)p(xs|y1:s)p(y1:s).

This means that the backward transition density for Xs|Xs+1:t,Y1:t is given by

p(y1:s, ys+1, ys+2:t, xs, xs+1, xs+2:t) p(xs|xs+1:t, y1:t) = R 0 0 X p(y1:s, ys+1, ys+2:t, xs, xs+1, xs+2:t) dxs p(xs+1, ys+1|xs, ys)φs(xs) = R 0 0 0 X p(xs+1, ys+1|xs, ys)φs(xs) dxs = p(xs|xs+1, y1:s+1) ←− = q φs (xs+1; xs), which is the expression above. Since it only depends on xs+1 (and not on xs+2:t), it is Markovian.

The following theorem is the main result for which all further analysis of this chapter is based upon. Theorem 3.2. For a fully dominated virtually hidden Markov model with the dynamics given in (3.1) and (3.3), the backward density is given by

g (x , x )q(x ; x )φ (x ) ←−q (x ; x ) = s+1 s s+1 s s+1 s s . (3.12) φs s+1 s R 0 0 0 0 X gs+1(xs, xs+1)q(xs; xs+1)φs(xs) dxs

Proof. The product density for a VHMM with the dynamics given in (3.1) and (3.3) is given by p(xs+1, ys+1|xs, ys) = q(xs; xs+1)gs+1(xs, xs+1), (3.13) which, inserted into (3.11), gives the result.

16 The Markovian property of VHMMs in the time-reversed direction established in Theo- rem 3.2 is the foundation for all further results for smoothing in VHMMs. It implies that the smoothing distribution of any such process can be factorised as

t−1 Y ←− φ0:t;θ(x0:t) = φt(xt) q φs;θ (xs+1; xs). (3.14) s=0

t+1 Hence, for any measurable function ht : X → R, the smoothed expectation φ0:t;θht can be written as Z t−1 ! Y ←− φ0:t;θht = ht(x0:t) q φs;θ (xs+1; xs) φt(xt) dx0:t, (3.15) t+1 X s=0 which is called the backward decomposition of a VHMM.

3.3.3 Forward-filtering, backward-smoothing

Theorem 3.1 shows, in addition to the Markovian property, that the backward density is proportional to the filtering distribution for any partially observed Markov model. Since smoothing is desired for VHMMs, consider the setup given in Theorem 3.2. Assume that i i N i i N there exist two weighted particle samples {(ξt, ωt)}i=1, and {(ξt+1, ωt+1)}i=1 targeting φt;θ, and φt+1;θ, respectively. An approximation of the backward density can then be computed as j j i j i ←− i j ωt qθ(ξt ; ξt+1)gt+1;θ(ξt , ξt+1) q φ (ξ ; ξ ) = . (3.16) t;θ t+1 t PN ` ` i ` i `=1 ωt qθ(ξt ; ξt+1)gt+1;θ(ξt , ξt+1) t If weighted particle samples have been generated for all filtering distributions {φs;θ}s=0, (3.16) can be plugged into (3.15), thus obtaining the approximation of φ0:t;θht as

N N t−1 is is is+1 is is+1 ! it X X Y ωs qθ(ξs ; ξ )gs+1;θ(ξs , ξ ) ω φN h = ··· s+1 s+1 t h (ξi0 , . . . , ξit ). (3.17) 0:t;θ t N i i t 0 t P ω`q (ξ`; ξ s )g (ξ`, ξ s ) Ωt i0=1 it=1 s=0 `=1 s θ s s+1 s+1;θ s s+1

This is the general Forward-Filtering, Backward-Smoothing (FFBSm) estimator. However, t for a general ht, the complexity is O(N t), which is not computationally feasible. Therefore, it is necessary to resort to either approximations of this distribution, or adaptations to special structures of the problem being considered.

3.3.4 Forward-only FFBSm

Del Moral et al. [9] noted that it is possible to construct a forward-only FFBSm algorithm for additive functionals on the form given in (3.5). Let Gs:s0 , σ(Xs:s0 ,Y1:s0 ) be the sigma- algebra generated by Xs:s0 , and Y0:s0 , and let Tt(Xt) , E[ht(X0:t)|Gt:t]. The aim is to construct a recursion for Tt(Xt). Straightforward application of the tower property yields that

Tt+1(Xt+1) = E[ht+1(X0:t+1)|Gt+1:t+1]

17 h i ˜ = E E[ht(X0:t) + ht+1(Xt:t+1)|Gt:t+1] Gt+1:t+1 n d. o = Xt:t+1 is Gt:t+1-measurable & X0:t|Gt:t+1 = X0:t|Gt:t h i ˜ = E E[ht(X0:t)|Gt:t] +ht+1(Xt:t+1) Gt+1:t+1 | {z } =Tt(Xt) ˜ = E[Tt(Xt) + ht+1(Xt:t+1)|Gt+1:t+1] Z ←−  ˜  = q φt;θ (Xt+1; xt) Tt(xt) + ht+1(xt,Xt+1) dxt.

i N i N Assume that {τt }i=1 are approximations of {Tt(ξt)}i=1. Then it is possible recursively i i update these approximations to obtain a new approximation τt+1 of Tt+1(ξt+1) by inserting (3.16) into the expression above, yielding

N ωjq (ξj; ξi )g (ξj, ξi ) i X t θ t t+1 t+1;θ t t+1  j ˜ j i  τ = τ + ht+1(ξ , ξ ) . (3.18) t+1 PN ` ` i ` i t t t+1 j=1 `=1 ωt qθ(ξt ; ξt+1)gt+1;θ(ξt , ξt+1)

i The algorithm is initialised with τ0 = 0. Furthermore, to obtain the estimate for φ0:t;θh, by the tower property it holds that h i Z

φ0:t;θht = E[ht(X0:t)|y0:t] = E Tt(Xt) y0:t = Tt(xt)φt(xt) dxt, (3.19) X which gives the approximation N X ωi φN h = t τ i. (3.20) 0:t;θ t Ω t i=1 t This forward-only implementation of FFBSm is attractive for several reasons. Firstly, the memory requirements are small, as it is only needed to store the filter distribution and the auxiliary statistics. Secondly, it provides online estimates of the smoothed statistics, and does not require a complete forward pass any time new data is observed. However, as the i normalisation constant in (3.18) depends on ξt, it needs to be computed once for each term in the sum. This implies that the computational complexity of the forward-only FFBSm is O(N 2t). This is better than the geometrical complexity of the standard FFBSm. However, the quadratic complexity is not optimal, as it makes using very large sample sizes infeasible. Hence, it would be helpful if this could be reduced.

3.3.5 Forward-filtering, backward-simulation

The exponential complexity of the na¨ıve FFBSm algorithm for generic functionals makes it unusable in practice. It is possible to reduce the complexity by introducing some extra ran- domness. This is known as Forward-Filtering Backward-Simulation (FFBSi). Just like the bootstrap particle filter, the idea is to generate a uniformly weighted sample by resampling. i i N t t Assume that a sequence of weighted particle samples {{(ξs, ωs)}i=1}s=0 targeting {φs}s=0 ˜i t has been generated. It is possible to draw index chains {Js}s=0 that produces uniformly

18 weighted samples from the smoothing distribuion. This is done by initially generating ˜ j N Jt ∼ Pr({ωt }j=1), and afterwards generating indices recursively according to ˜ ˜ j j i j i N  Js|Js+1 = j ∼ Pr {ωsqθ(ξs; ξs+1)gs+1;θ(ξs, ξs+1)}i=1 , s ∈ 0, t − 1 . (3.21) J K J˜0 J˜t Similarily to the SISR algorithm, this makes (ξ0 , . . . , ξt ) a uniformly weighted sample that approximates the smoothing distribution. Consequently, an arbitrary smoothed functional can be estimated by generating M index chains, and computing the smoothed statistic as [15] M 1 X J˜i J˜i φM h = h (ξ 0 , . . . , ξ t ). (3.22) 0:t;θ t M t 0 t i=1 If M = N, the complexity for FFBSi is, due to the normalisation constant being dependent on the previous index, still O(N 2t), just like the forward-only FFBSm algorithm. However, 0 0 under some additional assumptions that the product density qθ(x; x )gt;θ(x, x ) is bounded from above and below, Douc et al. [12] shows that the complexity of the algorithm can be brought down to O(Nt) by an accept-reject algorithm.1

3.3.6 The PaRIS algorithm

The particle-based, rapid incremental smoother (PaRIS) is a new algorithm proposed by Olsson and Westerborn [36]. It is based on the forward-only implementation of FFBSm. However, instead of computing (3.18) exactly, a FFBSi-like resampling step is performed. i i N i N Given a particle sample {(ξt, ωt)}i=1 targeting φt;θ, and approximations {τt }i=1 targeting i N i i N {Tt(ξt)}i=1, the aim is to generate {(ξt+1, ωt+1)}i=1 targeting φt+1;θ, and approximations i N i N {τt+1}i=1 targeting {Tt+1(ξt+1)}i=1. As derived above, it holds that Z ←−  ˜  Tt+1(xt+1) = q φt;θ (xt+1; xt) Tt(xt) + h(xt, xt+1) dxt. (3.23) X ←− In the forward-only FFBSm regime, the na¨ıve estimator of q φt;θ was plugged in, and gave the estimate (3.18). The idea behind PaRIS is introduce additional randomness to the estimator similar to the FFBSi algorithm by adding a resampling step. Instead of computing (3.18) ˜ i exactly, a set of N indices are drawn for each auxiliary statistic τt . These indices are drawn according to

(i,j) ` ` i ` i N ˜ Jt+1 ∼ Pr({ωt qθ(ξt ; ξt+1)gt+1;θ(ξt , ξt+1)}`=1), j ∈ 1, N . (3.24) J K i This gives a new estimator τt+1 of Tt+1(ξt+1) as

N˜ 1 X  J (i,j) J (i,j)  τ i = τ t+1 + h˜ (ξ t+1 , ξi ) . (3.25) t+1 ˜ t t+1 t t+1 N j=1 The algorithm is summarised in Algorithm 3.2. Thorough theoretical analysis by Olsson and Westerborn [36] provides that the algorithm is numerically unstable for N˜ = 1. However, for any N˜ ≥ 2, the algorithm remains numerically

1The result is derived for hidden Markov models, but the result should hold in general.

19 Algorithm 3.2: Particle-based, rapid incremental smoother (PaRIS). i i N Data: Particle samples {(ξt, ωt)}i=1 targeting φt, and estimated auxiliary statistics i N {τt }i=1 of φ0:tht. i i N Result: Particle samples {(ξt+1, ωt+1)}i=1 targeting φt+1, and estimated auxiliary i N statistics {τt+1}i=1 of φ0:t+1ht+1. i i N i i N 1 {(ξt+1, ωt+1)}i=1 ← PF({(ξt, ωt)}i=1). 2 for i ← 1 to N do 3 for j ← 1 to N˜ do (i,j) ` ` i ` i N 4 Jt+1 ∼ Pr({ωt q(ξt ; ξt+1)gt+1(ξt , ξt+1)}`=1). 5 end for ˜  J (i,j) J (i,j)  i ˜ −1 PN t+1 ˜ t+1 i 6 τt+1 ← N j=1 τt + ht+1(ξt , ξt+1) .

7 end for i i N i N 8 return {(ξt+1, ωt+1)}i=1 and {τt+1}i=1.

stable. By using an accept-reject algorithm similar to the one proposed in [12], it is possible to show that, under some additional assumptions, the algorithm exhibit linear complexity in the number of particles. For VHMMs, this algorithm is specified in Algorithm A.1. For large N, the variance of PaRIS is given by O([1 + 1/(N˜ − 1)]t/N)[36]. As the complexity of PaRIS is linear also in the model parameter N˜, Olsson and Westerborn propose to keep N˜ of moderate size, since the variance is reduced more by increasing the number of particles. The main advantage of PaRIS is the possibility to compute auxiliary statistics on-the-fly, without the need to perform the prefatory filtering pass in regular FFBSm, and FFBSi. Hence, it is possible use for online smoothing with only linear complexity.

PaRIS EM algorithm

By using PaRIS to approximate the smoothed statistics, it is possible to implement the EM algorithm for partially observed Markov models. PaRIS(· ; θ) denotes one sweep of Algorithm 3.2 under the dynamics specified by θ. This EM-algorithm is summarised in Algorithm 3.3.

3.4 Instrumental density design

There are several choices regarding the filter and smoothing algorithms that need to be made. What dynamic should the new particles be drawn from to ensure good performance, and how is this measured? What is the optimal choice of instrumental density for the accept-reject sampling of backward indices is used in the PaRIS algorithm (or FFBSi)? This section aim to shed some light on these questions.

20 Algorithm 3.3: PaRIS-EM for virtually hidden Markov models.

Data: Observed data sequence y1:t, and initial guess θ0. Result: Sequence of estimates {θ`}`≥0. 1 for ` ← 0 to `max − 1 do // Initialization 2 for i ← 1 to N do i 3 ξ0 ∼ ρ. i i i 4 ω0 ← χθ` (ξ0)/ρ(ξ0). i 5 τ0 ← 0. 6 end for 7 for s ← 0 to t − 1 do i i i N i i i N 8 {(ξs+1, ωs+1), τs+1}i=1 ← PaRIS({(ξs, ωs), τs}i=1; θ`). 9 end for N i X ωt i 10 z ← τ . // E-step Ω t i=1 t 11 θ`+1 ← Λ(z). // M-step 12 end for

3.4.1 Designing the proposal density

To implement the standard particle filter, the first choice that needs to be made is the dynamics from which the particles will be propagated. There are several aspects to be taken into consideration when selecting the proposal density, such as simplicity, and similarity to the target density. In sequential importance sampling literature,

0 0 0 q(x; x )gt(x, x ) 0 2 rt(x; x ) = R , (x, x ) ∈ X (3.26) X q(x; ξ)gt(x, ξ) dξ is known as the optimal density. However, it is rarely possible to sample from (3.26) directly, and therefore other proposal densities has to be chosen.

Prior density

The proposal kernel most often used [7], mainly due to its simplicity, is the transition kernel of the hidden chain, i.e. rt = q for all t. This has the attractive features that it is usually very easy to sample from q, and that the calculation of the importance weights simplifies to ωt+1 = gt+1(ξt, ξt+1)ωt. However, this choice of proposal density has one very important deficiency: it does not take the observations into consideration. Any extreme observation will lead to very degenerated importance weights, and poor estimates. This is especially true in the case of VHMMs, since the emission density depends on the previous state in addition to the current state. There might be cases where this dependence makes it more robust towards extreme observations. However, with the models considered in this thesis, the extra degree of freedom sped up the degeneration of the importance weights.

21 Optimal density from a parametric family

If the prior density does not perform as well as desired, another density needs to be con- sidered. The density is usually chosen from parametric family of densities, with parameters ϑ in some parameter space Θ. For instance, the one-dimensional normal distribution is parametrised by the its mean and variance, that is ϑ = (µ, σ2). Let the (parametric) pro- posal density at time t be denoted by rt;ϑ(x; ·). The optimal choice of ϑ is the solution to [7] 0 0 q(x; x )gt(x, x ) min sup 0 , (3.27) ϑ∈Θ x0∈X rt;ϑ(x; x ) which provides an upper bound on the importance weights. If it is possible to find the solution to (3.27) where the supremum is bounded, the choice of ϑ can be used to design an 0 0 optimal Accept-Reject algorithm to sample from q(x, x )gt(x, x ) exactly. However, finding the solution is often not feasible due to computational constrains, and it is necessary to resort to other approximations.

3.4.2 Measures of weight imbalance

To evaluate how well the proposal density works in the particle filter, there are several measures of weight imbalance that occur in the literature. If there are just a few weights that are orders of magnitude larger than the rest, these are the only weights that will contribute to the estimator, and hence the estimator will exhibit larger variance.

Coefficient of variation The coefficient of variation of the normalised weights is defined as [7, 22] v u N !2 u 1 X ωi CVN = t N − 1 . (3.28) N PN ` i=1 `=1 ω √ When all weights are equal, it obtains its minimal value CVN = 0, and CVN = N − 1 is 2 the maximal value when all but one of the weights are zero. CVN can be interpreted as the number of particles that do not contribute to the estimator.

Effective sample size Another, perhaps more intuitive, measure is the effective sample size Neff , defined as [7, 29]

−1  N !2 X ωi Neff = . (3.29)  PN `  i=1 `=1 ω

In opposite to the coefficient of variation, it is maximal (Neff = N) when all weights are equal, and minimal (Neff = 1) when only one weight is positive. Its interpretation is the number of particles that do contribute to the estimator. The coefficient of variation and effective sample size are related as N Neff = 2 . (3.30) 1 + CVN

22 Resampling

In its most na¨ıve form, SISR includes a resampling step at each iteration. However, the new estimator after the resampling exhibit higher variance in general. Therefore, if there is little weight degeneration, there is no need to resample. A common practice is to only resample if some the weights have degenerated too much, e.g. if the effective sample size drops a predefined threshold. It is also possible to choose some other resampling technique other than multinomial sam- pling. Other resampling schemes include residual sampling and systematic sampling. In this thesis, only multinomial sampling will be used. The reader is referred to [7, Section 7.4] for a more thorough discussion on resampling techniques, including specifications of the aforementioned procedures.

3.4.3 Instrumental kernel for backward index draws

In the original accept-reject algorithm proposed by Douc et al. [12], the index candidates are drawn from the multinomial distribution proportional to the importance weights. It is not necessary that this is optimal when a VHMM is considered. Algorithm A.1 relies on Theorem 3.2 and the fact that the backward density is given by ←− q φt;θ (xt+1; xt) ∝ qθ(xt; xt+1)gt+1;θ(xt, xt+1)φt(xt+1). (3.31)

j j N If there exists a set of particles {(ξt , ωt )}j=1 that target φt, the backward transition density j for the transition from xt+1 ∈ X to ξt is given by

j j j ←− j ωt qθ(ξt ; xt+1)gt+1;θ(ξt , xt+1) q φ (xt+1; ξ ) = t;θ t PN j ` ` `=1 ωt qθ(ξt ; xt+1)gt+1;θ(ξt , xt+1) Since the normalisation constant depends on the “new” x0, drawing an index from the cate- ←− 0 j N gorical distribution Pr({ q φt;θ (x , ξt )}j=1 will require computing the normalisation constant for each x0. Hence, to draw N indices would require N 2 operations. To combat this, re- j N jection sampling can be used. Let {νt }j=1 denote the instrumental density from which the index candidates are drawn. The aim is to choose this instrumental density optimally. Definition 3.2 (Kullback-Leibler Divergence). Let p and q be two probability mass func- tions on the same countable space X. The Kullback-Leibler Divergence between p and q is then defined as  p(X) X p(x0) D(pkq) log = log p(x0). (3.32) , Ep q(X) q(x0) x0∈X

The Kullback-Leibler Divergence is also known as the cross-entropy between p and q. An intuitive interpretation of D(pkq) is how good q is as an approximation of p. Consequently, since the aim is to have the instrumental density mimic the backward density, the instru- mental density will be considered optimal if the Kullback-Leibler Divergence between the backward density and the instrumental density is minimized. The Kullback-Leibler Diver- gence between these distributions is given by

23  n oN  ←− j j N D q φt;θ (xt+1; ξt ) {νt }j=1 j=1 N ←− j ! X q φt;θ (xt+1; ξt ) ←− j = log j q φt;θ (xt+1; ξt ) j=1 νt N N X ←− j ←− j X j ←− j = log q φt;θ (xt+1; ξt ) q φt;θ (xt+1; ξt ) − log(νt ) q φt;θ (xt+1; ξt ) . (3.33) j=1 j=1 | {z } | {z } (∗) (†)

Since only the distribution of xt+1 is known, not xt+1 itself, it is natural to consider the j N expected value of (3.33). As (∗) does not depend on {νt }j=1, it is only neccessary to consider (†). Let rt+1 denote the proposal density in the SISR algorithm. The new particles i PN ωt i have, conditionally on the old particles, the density rt+1(ξ ; xt+1). Assume that i=1 Ωt t i ∝ i i rt+1(ξt; xt+1) ∼ q(ξt; xt+1)gt+1(ξt, xt+1). The expectation of (†) in (3.33) is then given by

N N Z X X ωi [(†)] = log(νj)←−q (x ; ξj) t r(ξi; x ) dx E t φt;θ t+1 t Ω t t+1 t+1 X j=1 i=1 t Z N j j j j N X log(νt ) ωt q(ξt ; xt+1)gt+1(ξt , xt+1) X i i = ω r(ξ ; xt+1) dxt+1 Ω PN ` ` ` t t X j=1 t `=1 ωt q(ξt ; xt+1)gt+1(ξt , xt+1) i=1 Z N j j j j N i i i X log(νt ) ωt q(ξt ; xt+1)gt+1(ξt ; xt+1) X ωtq(ξt; xt+1)gt+1(ξt, xt+1) ≈ dxt+1 Ω PN ` ` ` R q(ξi; x0)g (ξi, x0) dx0 X j=1 t `=1 ωt q(ξt ; xt+1)gt+1(ξt , xt+1) i=1 X t t+1 t

R 0 0 0 Furthermore, assume that the normalising constant X q(x; x )gt(x, x ) dx = c does not depend on x. Under this assumption, it is possible to simplify the above expression as

N j Z j j j PN i i i X log(νt ) ωt q(ξt ; xt+1)gt+1(ξt , xt+1) i=1 ωtq(ξt; xt+1)gt+1(ξt, xt+1) E[(†)] = dxt+1 Ω c PN ` ` ` j=1 t X `=1 ωt q(ξt ; xt+1)gt+1(ξt , xt+1) N j X ω = log(νj) t t Ω j=1 t N j ! j N j ! j N j ! X ω ω X ω ω X ω = log t t − log t t − log(νj) t Ω Ω Ω Ω t Ω j=1 t t j=1 t t j=1 t   N j ! j ( j )N X ωt ωt ωt j N = log − D  {νt }j=1 . Ωt Ωt Ωt j=1 j=1 Since the Kullback-Leibler Divergence is minimized when the distributions are equal almost j N j j surely, choosing {νt }j=1 as the categorical distribution with νt = ωt /Ωt maximizes E[(†)]. In turn, this minimizes the expected Kullback-Leibler Divergence for the indices, and is thus in that sense optimal. However, it should be noted that even though the normalisation i constant of the proposal density does not depend on ξt, there is no guarantee that the product density does.

24 For an HMM where the proposal density is chosen as the prior density, all the assumptions i N made in the derivation above hold, and Pr({ωt}i=1) is the optimal instrumental density to sample index candidates from.

25 Chapter 4

Evaluation of Models

To evaluate how well the model fits the observed data, it is essential to define what is meant by a good model. Given a set of candidate models, which of them provides the best fit? To determine what a good model is, a few different measures will be used. Most types of model evaluations are comparative between models, that is, they cannot say whether model i is good, just that it is probably better than model j.

4.1 Information criteria

In information theory, there are several information criteria that can be used to compare dif- ferent models fit data. There include the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and the bootstrap information criterion (EIC) amongst others. All these make different trade-offs between small bias and high variance.

4.1.1 Akaike information criterion

Suppose that there is a “true” model f from which the data comes from. The idea is that R if there are some set of candidate models {gr}r=1, the best model is the one that minimises the Kullback-Leibler Divergence D(fkgr). Based on this notion, Akaike developed “an information criterion” (AIC) which given by [23]

ˆ AIC = −2 log L(y1:t; θ) + 2k, (4.1) where k is the number of parameters, and θˆ is the maximum likelihood estimator. A small value of AIC is desired. AIC can be used whenever θˆ is the maximum likelihood estimate [23]. ˆ The interpretation of AIC is simple. The bias of the model is represented by −2 log L(y1:t; θ). The larger the log-likelihood is, the better fit the model should be. However, it can be made arbitrarily large by just adding more parameters. if too many parameters are used, the

26 model will suffer from overfitting. Therefore, to discourage overfitting, a penalty is added for the number of parameters in the model. Therefore, it is desirable to minimise the AIC. The bias correction 2k is actually a first order approximation when the number of samples are large. If the number of samples is not much greater than the number of parameters, it is often necessary to adjust for the finite sample size. This gives the corrected AIC, denoted AICc. For the case where the model is univariate, linear, and has Gaussian residuals, it is given as AICc = AIC + 2k(k + 1)/(n − k − 1) [5], where n is the number of samples. The considered here does not fulfil these criteria, and this correction will not be used here. This is in line with other literature for model selection of stochastic volatility, which also uses AIC, and not AICc [26].

4.1.2 Bayesian information criterion

The Bayesian information critierion (BIC), also known as the Schwarz information criterion, uses another trade-off between the bias and the variance. Using the same notation as for AIC, along with n representing the sample size, BIC can be defined as ˆ BIC = −2 log L(y1:t; θ) + k log n. (4.2)

When BIC was first derived, it was assumed that the likelihood belonged to the exponential family [8]. Since inference is done on only the observation process and not on the complete data, VHMMs does not belong to the exponential family. Hence, it would seem that BIC could not be used to compare the models. However, Cavanaugh and Neath [8] showed that under certain regularity conditions, such as continuous first and second order derivatives, the approximation is still valid.

4.1.3 Bootstrap information criterion

The aforementioned AIC and BIC tries to make a trade-off between the likelihood and the number of parameters. They use different methods to derive the bias of the estimated maximum likelihood compared to the true value. Both use asymptotic theory to derive these quantities. The bootstrap information criterion (EIC) instead tries to estimate this by generating bootstrap samples, and from these estimate the bias of the estimator. It is given by [23] ˆ EIC = −2 log L(y1:t; θ) + bB(gi), (4.3) where bB(gi) is a bootstrap estimate of the bias of the log-likelihood estimator for the model gi. This would probably be the best information criterion to use in this context, as there is no guarantee that the assumptions for AIC or BIC are fulfilled. However, due to the extensive simulation time required to compute one maximum likelihood estimate, this would be very time-consuming to implement. Hence, it was not within the scope of this thesis. The reader interested of a more thorough description of the methodology on how to calculate the bias is given in [23, sec. 8.2].

27 4.1.4 Posterior probabilities of model candidates

“Of course, models not in the set remain out of consideration. AIC is useful in selecting the best model in the set; however, if all the models are very poor, AIC will still select the one estimated to be best, but even that relatively best model might be poor in an absolute sense. Thus, every effort must be made to ensure that the set of models is well founded.” [5] AIC and BIC are both on a relative scale. The actual value does not matter, it is rather the difference between the values corresponding to the different models that are of interest. 1 Let ∆ICi , ICi − ICmin be the difference between the i:th model and the best model. The smaller ∆ICi is, the more plausible model i is as the best model (out of the ones considered). According to Burnham and Anderson [5], if ∆AICi > 10, then model i has essentially no support and can be omitted from consideration. R A posterior probability of a model can be computed based on BIC. Let {gr}r=1 be a set of potential models, and let πi denote the prior probability of model gi. The Bayesian posterior model probability for model gi is given by [5]

 ∆BICi πi exp − 2 P(gi|y) = . (4.4) PR  ∆BICr r=1 πr exp − 2 Two special cases for (4.4) are of certain interest. Firstly, the obvious choice is a uniform prior for the models. For this choice of prior, the posterior probability will be denoted by i 1  pBIC. Secondly, Burnham and Anderson [5] propose another prior πi ∝ exp 2 ki log n − ki . This choice yields the posterior probability

 ∆AICi exp − 2 i P(gi|y) = pAIC, (4.5) PR  ∆AICr , r=1 exp − 2 which they call the Akaike weights. To use any information criterion, the likelihood has to be computed for the same data. For instance, consider the case where one model is calibrated to the observations {yt}, and another is calibrated to observations {log yt} or {(xt, yt)}. In this case, the likelihood will be vastly different, and comparing, e.g., AIC does not make sense.

1IC denotes either AIC or BIC.

28 Chapter 5

Stochastic Volatility Models and Implementation

When considering stochastic volatility, a popular model is given by [19]  X = φX + σ ζ ,  t t−1 ζ t X  (5.1) Y = µ + exp t DV ,  t 2 t

1 d ∗ where Yt , log(St/St−1) ∈ R , t ∈ N is the log returns of some collection of stocks. + + {ζt}t∈N and {Vt}t∈N are two normally distributed sequences with unit vari- ance, and D is a diagonal d × d matrix. The state process {Xt} is the log-volatility. For φ ∈ (−1, 1), it is weakly stationary, and if φ > 0, this model exhibit volatility clustering, which is a desired property.

∗ If {ζt}t∈N and {Vt}t∈N are uncorrelated, this is a hidden Markov model. However, since the aim is to include a (presumably negative) correlation between the volatility and the returns, i this will not be the case. Therefore, assume that ρ(ζt,Vs ) = δtsρi. It is then possible to i p 2 ˜ i ˜ write Vt = ρiζt + 1 − ρi Wt , where {Wt} and {ζt} are uncorrelated. Inserted into (5.1) yields  X = φX + σ ζ ,  t t−1 ζ t   Xt  p   (5.2) Y = µ + exp D ρζ + diag 1 − ρ2 W˜ .  t 2 t t

It is tempting to stop here, and use this parametrisation to compute the relevant quan- tities. However, this formulation is not easy to use when implementing the Expectation- Maximisation algorithm. The first order conditions in the M-step become very ugly if this T formulation would be used. Instead, let βζ , Dρ, Wt ∼ N (0, I), and Σ , AA , where A

1If S is vector-valued, the division is done component-wise.

29 is a lower triangular matrix. Insertion into (5.1) gives  X = φX + σ ζ ,  t t−1 ζ t X    (5.3) Y = µ + exp t β ζ + AW .  t 2 ζ t t

As ζt = (Xt − φXt−1)/σ, this is a VHMM. For the derivations used later, this choice of parametrisation is advantageous. The covariance of the stocks, and the correlation between the returns and the volatility are easily obtained as

h T i T V[DVt] = E (βζ ζt + AWt) (βζ ζt + AWt) = βζ βζ + Σ, [ζ (βi ζ + W˜ i)] βi (5.4) i E t ζ t t ζ ρ ζt,Vt = q = q . i ˜ i i 2 V[ζt]V[βζ ζt + Wt ] (βζ ) + Σii

0 0 To implement the accept-reject algorithm, note that, for the product density q(x, x )gt(x, x ), it holds that  0  0 0 0 1 1 dx q(x; x )gt(x, x ) ≤ c(x ) = √ exp − (5.5) 2πσ2 p(2π)dkΣk 2

5.1 Choice of proposal density

The extra degree of freedom in the emission density makes the choice of proposal density much more important. Suppose that the prior density does not suffice as proposal density, that the analytical solution to (3.27) is intractable, and finding the solution numerically is not feasible. Furthermore, Subsection 3.4.3 suggests that the proposal density should be approximately proportional the the product density between the transition density and emission density. Capp´eet al. [7] suggests methods for selecting ϑ depending on what para- metric family rt;ϑ belongs to. Here, however, a more heuristic approach will be considered, where the aim is to (approximately) bound the supremum of the weights

0 0 q(x; x )gt(x, x ) sup 0 . (5.6) x0∈X rt;ϑ(x, x )

This will be done by trying to construct rt;ϑ(x, ·) such that it has the same shape (in the second argument) as q(x, ·)gt(x, ·). This should produce roughly uniform weights, which in turn should increase the performance compared to using the prior density for new candidates.

Firstly, the parametric family needs to be chosen. Since q, and gt are (conditional) normal densities, the product of the densities should exhibit a shape similar to a normal density. Therefore, the choice that the proposal density belongs to the normal family is attractive. Secondly, ϑ needs to be determined. The transition and emission densities are given by

 (ξ − φξ )2  q(ξ ; ξ ) ∝ exp − t t−1 , t−1 t 2σ2  1 T   g (ξ , ξ ) ∝ exp − y − µ˜(ξ , ξ ) S−1 y − µ˜(ξ , ξ ) , t t−1 t 2 t t−1 t t t−1 t

30 where ξ − φξ  µ˜(ξ , ξ ) = µ + a t t−1 , t−1 t σ S = eξt Σ, ξ  a = β exp t . Z 2

Since the aim is to bound (5.6), the aim is to find the shape of q(x; ·)gt(x, ·) in the second argument. For this, the logarithm of the product is considered. Any additive terms not containing ξt are just a part of the bounding constant, and will be disregarded. 1 h  T log q(ξ ; ξ )g (ξ , ξ ) =c. − (ξ − φξ )2 + (y − µ)σ + a(ξ − φξ ) t−1 t t t−1 t 2σ2 t t−1 t t t −1 i × S (yt − µ)σ + a(ξt − φξt) 1 h =c. − ξ2 − 2φξ ξ + aT S−1aξ2 2σ2 t t−1 t t n T −1 T −1 o i − 2 (yt − µ) S aσ + φξt−1a S a ξt 1 + aT S−1a   (y − µ)T S−1a   = − ξ2 − 2 φξ + t σ ξ 2σ2 t t−1 1 + aT S−1a t 2 1 + aT S−1a   (y − µ)T S−1a  =c. − ξ − φξ + t σ 2σ2 t t−1 1 + aT S−1a 2 1 + βT Σ−1β   (y − µ)T Σ−1β  Z Z t Z ξt/2 = − 2 ξt − φξt−1 + T −1 σe . 2σ 1 + βZ Σ βZ Assuming that the exponential term can be approximated well enough by its zero order ξt/2 approximation e = 1 + O(ξt), the mean and variance of rt;ϑ becomes

 T −1 (yt − µ) Σ βZ  mt = φξt−1 + σ,  1 + βT Σ−1β Z Z (5.7) σ2  s2 = .  t T −1 1 + βZ Σ βZ Worth noting is that the prior kernel is recovered if the model is a hidden Markov model, i.e. βζ ≡ 0.

5.2 The intermediate quantity

To implement the Expectation-Maximisation algorithm, an expression for the intermediate quantity needs to be computed. An extensive derivation of these quantities can be seen in Appendix B. The intermediate quantity, up to an additive constant not depending on the

31 parameters, is given by

c. 1 1 X h Q (θ) = − t log σ2 + t log kΣk + (z − 2φz + φ2z ) + Σ−1 ∆(i,j) θ` 2 σ2 1 2 3 ij i,j ! j i i j i j i j j j i i i j i − E µ − E µ + µ µ V − βζ (A − µ S) − βζ (A − µ S) + βζ βζ Z , where  h i z Pt X2 Y  1 , Eθ` k=1 k 1:t  h i  z Pt X X Y  2 , Eθ` k=1 k−1 k 1:t  h i  Pt 2  z3 Eθ` Xk−1 Y1:t  , k=1  h t j i  (i,j) P i −Xk ∆ Eθ` Yk Y e Y1:t  , k=1 k  h t i  i P i −Xk  E Eθ` Yk e Y1:t  , k=1  h t i    P −Xk/2 1  S Eθ` e ζk Y1:t = z4 − φz5  , k=1 σ  h t i z P X e−Xk/2 Y (5.8) 4 , Eθ` k=1 k 1:t  h t i  P −Xk/2  z5 Eθ` Xk−1e Y1:t  , k=1  h t i  V P e−Xk Y  , Eθ` k=1 1:t  h i    Pt 2 1 2  Z Eθ` ζk Y1:t = 2 z1 − 2φz2 + φ z3  , k=1 σ  h t i  (i) (i)  i P i −Xk/2 1  A Eθ` Yk ζke Y1:t = a1 − φa2  , k=1 σ  (i) h t i  P i −Xk/2  a1 Eθ` Yk Xke Y1:t  , k=1  (i) h t i  a P Y iX e−Xk/2 Y .  2 , Eθ` k=1 k k−1 1:t Hence, (5.8) specifies the smoothed statistics that needs to be computed with, for instance, the PaRIS algorithm in order to implement the EM algorithm. Furthermore, explicit up- dating formulae for (µ, βζ , Σ) are given by  Σˆ = K,   −1   S2   S   βˆ = Z − A − E , ζ V V (5.9)  (  2 −1  )  1 1 S S  µˆ = (E − βˆ S) = E − S Z − A − E .  V ζ V V V

The maximisation of Qθ` (θ) is not feasible for the parameters (φ, σ), and hence this will be done by the gradient-ascent algorithm described in Subsection 2.2.3. After insertion of the new parameters from (5.9) into the intermediate quantity, Qθ` (θ) is given by 1  1  Q (θ) =c. − t log σ2 + (z − 2φz + φ2z ) + t log kKk , where θ` 2 σ2 1 2 3 ( ) 1  1   S2 −1  S   S  K = ∆(i,j) − EiEj − Z − Aj − Ej Ai − Ei . ij t V V V V

32 CBOE Volatility Index (VIX) 60

50

40

30

Implied Volatility 20

10 1990 1995 2000 2005 2010 2015 Date

Figure 5.1: Monthly historical data for the CBOE Volatility Index.

To compute the derivatives, the following matrix identities exist [38].

∂  ∂K log kKk = Tr K−1 ∂x ∂x ∂2  ∂2K   ∂K  ∂K log kKk = Tr K−1 − Tr K−1 K−1 ∂x∂y ∂x∂y ∂x ∂y

Of course, the determinant can be computed before doing the maximisation. However, this approach is more advantageous, as it is easier to obtain general expressions for the derivatives, that for instance are independent of the length of Yt. The expressions for these derivatives, are not very visually appealing, and will not be included here.

5.3 Volatility index model

To obtain a benchmark for the VHMM, a heuristic model will be considered. An assumption in VHMMs is that the state process is unobservable. If there somehow is a way to estimate the state process from other data, this could be used as a proxy for the real volatility. Chicago Board Options Exchange disseminates the CBOE Volatility Index, more commonly known by its acronym VIX. It is a measure of of the S&P 500 index options. Implied volatility is the theoretical value of the volatility which would give rise to the observed price in the market. A time series of the historical data of VIX can be seen in Figure 5.1, and it contains the expected volatility clusters. To be able to use VIX as an approximation, it should be well approximated by the same dynamics as the state process. Since {Xt} considers log-volatility, first take the logarithm of the VIX index. The state process is modelled as an autoregressive process of order one, and these can be easily identified by inspecting the autocorrelation functions of the time series

33 Logarithm of the CBOE Volatility Index (VIX) Logarithm of the CBOE Volatility Index (VIX) 1.00 0.8

0.75 0.6

0.50 0.4 ACF 0.25 0.2 Partial ACF Partial

0.00 0.0

0 5 10 15 20 0 5 10 15 20 Lag Lag

(a) Autocorrelation function. (b) Partial autocorrelation function.

Figure 5.2: ACF and PACF for the zero-mean log-VIX time series. The red bars indicate significant (partial) autocorrelation of confidence level 95%. in question. These can be seen in Figure 5.2. There is only one sharp peak in the partial autocorrelation function at lag one. This, along with the slow decay of the autocorrelation function, suggests that the process would be approximated by an autoregressive process of order one, just like the state process. There has to be caution before using VIX as an approximation. Firstly, there is no guaran- tee that the implied volatility will be a sufficiently good approximation of the instantaneous volatility. Secondly, VIX uses options on the S&P 500 index to determine the implied volatil- ity. These are american options, and how they translate to other markets (e.g. swedish) is Xt/2 1 not clear. Lastly, in (5.3), the volatility is on the form e . The 2 factor in the exponent is present for purely for aesthetic reasons, and could be omitted. In an HMM or a VHMM, it would only alter the parameters (φ, σ2). However, when converting from implied volatility, the same parameters would be estimated from the historical VIX data, and thus the choice of constant in the exponent is no longer arbitrary. As VIX measures implied volatility, Xt/2 not implied variance, it should hold that e ≈ VIXt. Hence, the log-volatility could be modelled as ( X˜t = φX˜t−1 + σζt ˜   (5.10) Xt , 2 log VIXt − ϑ ,

2 where ϑ , E[log VIXt]. To calibrate the model, maximum likelihood can be carried out directly, as the complete data likelihood is known for this model. The sufficient statistics needed are the same as in (5.8), except that {Xt} is known, and any expectation is removed (e.g. z1 is replaced by Pt 2 z˜1 , k=1 Xk etc.).

5.4 Model evaluation

To compare the models using the criteria discussed in Chapter 4, some aspects of the models need to be considered before they can be applied. In light of the last paragraph in Subsection 4.1.4, the VIX model cannot be compared directly to the HMM or VHMM, as

2 This presence of ϑ is to make {Xt} is a zero-mean process. This is for the model to align with the HMM and the VHMM.

34 it does not use the same data set. However, the parameters could be calibrated using VIX, which would produce a benchmark estimate of the MLE. The likelihood can afterwards be computed using Algorithm 3.1. Since this likelihood is a function of the same dataset, AIC (or BIC) can be compared between the models.

5.5 A few notes on implementation

The algorithms were implemented in R [39]. All plots have been made using the ggplot2 package [44].

5.5.1 Starting guesses for the EM algorithm

To initialise the EM-algorithm, starting guesses θ0 has to be made. Assume that {Xt} is in steady state, and that the white noise sequences are independent.3 This implies that 2 2 Xt ∼ N (0, σ /(1 − φ )), and the expectation and variance of Yt is given by h i ˜ Xt/2 E[Yt] = E µ + Wte = µ, h i h i h i  σ2  [Y ] = W˜ eXt/2 = W˜ eXt/2 = Γ exp , V t V t V t V 2(1 − φ2)

h ˜ i T where Γ , V Wt = βζ βζ + Σ. Hence, starting guesses for (µ, βζ , Σ) can be constructed from empirical estimates of the expectation, and variance, of {Yt}. The only parameter that needs to be chosen is the initial guess for the correlation ρ. After choosing ρ, the other parameters can be calculated as

i p βζ = Γiiρi, T Σ = Γ − βζ βζ .

For the parameters of the state process, there is no a priori information in the data {Yt} about good starting values, and they can be chosen arbitrarily. However, as the model tries to capture volatility clusters, it makes sense to choose φ ∈ (0.5, 1). They could also be initialised with the estimates from the VIX model, which should be “in the same ballpark” as the true parameters.

5.5.2 Accept-reject algorithm

When implementing the accept-reject algorithm described in Algorithm A.1, there is a potential problem that the acceptance rate is very small. For hidden Markov models, the accept criterion is U ≤ q(x, x0)/c. If the new value is very rare, this acceptance probability might be low for almost all old particles. Hence, it should be a good strategy to compute the normalising constant exactly for these particles as the expected number of iterations until acceptance would be high.

3This assumption is incorrect. However, as the aim is to obtain good starting guesses, it should be superior compared to arbitrarily guessing the starting parameters.

35 One way to carry out this idea is to impose a maximum number of trials, after which the indices are drawn from the “exact” categorical distribution. In PaRIS, this would be to draw the remaining indices needed for a certain particle i. For instance, if 1 of N˜ indices have been accepted for particle i, and the stopping criteria is met, then the remaining N˜ −1 indices should be drawn at the same time, as the normalisation constant has been computed for particle i.

How to choose the maximum√ number of trials is very model dependent. Olsson and West- erborn [36] suggests N. There has been some work on adaptive stopping criteria [41] which tries to project when it is more time efficient to simulate from the exact multinomial distribution. However, they also find that choosing the maximum number of trials to be N/δ, for some δ ≥ 1 is usually a good stopping criterion for the models they consider, where δ can be found by trial and error. For VHMMs, the need for a stopping criterion is perhaps even more necessary. Since the 0 0 0 0 accept criterion is U ≤ q(x, x )gt(x, x )/c(x ). It might be that for a new particle x , q is small whenever g is large, and vice versa. Consequently, the average number of trials might be much bigger, compared to a regular hidden Markov model.

5.5.3 Working with logarithms

When implementing the PaRIS algorithm, there are many ratios between small numbers that need to be computed. Therefore, many ratios will fall within the range of numerical precision, which leads to some evaluating to zero, even though they are positive. This is especially true for VHMMs, but it also happens when just considering hidden Markov models. In Algorithm 3.2, most of the quantities are self-normalised. Hence, it is usually more numerically stable to compute the logarithm, and then normalising before computing the ` ` i ` i N weights. For instance, if the aim is to sample from Pr({ωt q(ξt , ξt+1)gt+1(ξt , ξt+1)}`=1), the following approach is more stable against numerical precision errors. ` ` i ` i i. For all ` ∈ 1,N , compute log ωt , log q(ξt , ξt+1), and log gt+1(ξt , ξt+1). J K ` ` ` i ` i ii. Let log vt ← log ωt + log q(ξt , ξt+1) + log gt+1(ξt , ξt+1). 4 ` ` j iii. Compute the normalised logv ˜t ← log vt − maxj∈ 1,N log vt . J K ` iv. Computev ˜t , which can be used to draw from the desired distribution. ` ` i ` i In comparison, if ωt q(ξt , ξt+1)gt+1(ξt , ξt+1) were to be be computed directly, then many weights will be zero to numerical precision, and will result in very few positive terms to sample from. In fact, there was almost always a time index where all weights evaluated to zero. Important: If the likelihood of the smoothing distribution is of interest, the estimator (2.13) is not self-normalised, and special attention needs to be given to this estimator if this approach is used in Algorithm 3.1.

4 ` In the sense that max`∈ 1,N logν ˜t = 0. J K

36 Chapter 6

Simulations and Results

In this chapter, the results will be presented. It starts with some preliminary results to establish that the algorithms work sufficiently well. The preliminary results consider the numerical performance of the SMC algorithms used. First, it investigates the choice of proposal kernel in the SISR algorithm. Secondly, the computational complexity of the PaRIS algorithm is investigated. This is presented in Section 6.1. After the numerical performance has been assessed, the chapter moves on to studying the calibration of the considered stochastic volatility models in Section 6.2. This section is split into two parts. The first part is a prefatory study of the calibration process. The calibration is done on synthetic data, where the true model is known, and the results can be interpreted more easily. In the second part, the calibration is performed on real world data. A thorough discussion of the results is given in Chapter 7.

6.1 Design of the PaRIS algorithm

To examine how the proposal kernel influence the performance of Algorithm 3.1, a numerical study was carried out. The proposal kernel suggested in Section 5.1, henceforth denoted as the proportional kernel, was compared to the performance of the prior kernel. To make the scenario as similar as possible to the real application, Algorithm 3.1 was run with observed log-returns {Yt}. The observed data were the five indices described below in Subsection 6.2.2. Furthermore, the parameters were chosen with the methodology given in Subsection 5.5.1, with the “unknown”1 parameters set to (φ, σ) = (0.9, 0.1) and 2 ρi = −0.1, i = 1, . . . d. The algorithm was run with N = 2000 number of particles.

The kernels were compared by their effective sample size Neff , and the minimum and the average Neff for the two different kernels can be seen in Table 6.1. Comparing the perfor- mance difference between the two kernels, the difference of the effective sample size at each time step can be seen in Figure 6.1. It is clear that the proportional kernel performs better

1Unknown in the sense that they cannot be extrapolated from the data. 2Here, d = 5.

37 Table 6.1: Minimum and average effective sample size for the different kernels. The number of particles were N = 2000.

Kernel min Neff mean Neff Prior 42 1760 Proportional 64 1810

Difference between prior kernel and the suggested proposal kernel

150

100 Count 50

0

−0.2 0.0 0.2 0.4 Normalised difference in effective sample size

Figure 6.1: Histogram over the normalised difference in Neff when the proposal kernel is the proportional kernel suggested in Section 5.1 compared to the prior kernel. Larger value indicate a larger Neff for the proportional kernel. than the prior kernel, both in terms of the average as well as the minimum effective sample size.

6.1.1 Simulation time complexity

The simulation time complexity of the PaRIS algorithm was investigated for different stop- ping criteria. The simulation time was calculated for a data set of length 400 for different number of particles. The stopping criteria of the accept-reject algorithm that were inves- tigated were to stop after N/δ numbers of iterations for δ = 10, 20, 40, 80, and sample the ` ` i ` i N remaining indices from Pr({ωt q(ξt ; ξt+1)gt+1(ξt , ξt+1)}`=1), i = 1,N directly. As a bench- mark, PaRIS was carried out without the accept-reject sampling,J i.e.K sampling from the aforementioned categorical distribution instantly (“after 0 iterations”). Recall the assumptions in Subsection 3.4.3 which were that the proposal density is approx- imately proportional to the product density and that the product density’s normalisation constant does not depend on the previous particle. If these assumptions are true, then the complexity should be linear in the number of particles if the stopping criterion is large enough. The results can be seen in Figure 6.2.

38 PaRIS Simulation Time Stopping Criterion 6000 N 10 N 20 4000 N 40

2000 N 80

Simulation time [s] Simulation 0 0 80160 320 640 1280 Number of Particles (N)

(a) Linear scale.

PaRIS Simulation Time 10000 4600 Stopping Criterion 2200 N 10 1000 N 20 460 220 N 40 100 N 80 46

Simulation time [s] Simulation 0 22

80 160 320 640 1280 Number of Particles (N)

(b) Log-log scale.

Figure 6.2: Simulation time for the PaRIS algorithm when the considered model is a VHMM. The proposal density is the one suggested in Section 5.1. The model parameter was N˜ = 2.

39 6.2 Calibration of SV models

The calibration of the stochastic volatility model was split up into two different parts. Firstly, a prefatory study was carried out where the calibration was made for two synthetic data set where the parameters and dynamics of the data was known. Secondly, the different models were calibrated to real-world data. To make the figures comparable, the parameters presented will not be in terms of (βζ , Σ), but instead in terms of the standard deviation matrix D, the correlation matrix (between the log-returns) R, and the correlations between the log-volatility and the log-returns ρ. These are easily obtained from (5.4).

6.2.1 Prefatory study

Before the EM algorithm was implemented on the real world data, it was investigated how well the parameters were calibrated when the true parameters were known. Two different synthetic data sets were generated, one where the true model was a HMM, and one where the true model was a VHMM. Each data set was generated with the same realisations of the white-noise sequences and choices of parameters, with the exception of βζ . Each synthetic data set consisted of 600 observations, and to make the data as similar to the real world target, the steady-state drift µ and standard deviation D were estimated by the methodology in Subsection 5.5.1. The parameters obtained when the true model is a HMM can be seen in Table 6.2, and in Table 6.3 when the true model is a VHMM. The information criteria for each of the data sets were also computed. These results can be found in Table 6.4 and Table 6.5, where each table correspond to the situation when the true model is a HMM and VHMM, respectively.

6.2.2 Main study

The data used were log returns on five indices emitted by MSCI Inc. The indices were on the following five markets: Sweden [32], US [33], Japan [31], Europe [30], and World [34]. Each index captures large-cap and mid-cap representation on its respective markets. For instance the Swedish index top 10 constituents include, amongst others, H&M, Handelsbanken, and Ericsson. The data available consisted of monthly log-returns on the aforementioned indices, from 1 January 1970 to 1 December 2015. In total, this gives 552 monthly log-returns. It should be noted that all the indices were launched on 31 March 1986. All the data available prior to this date was generated by back-testing, i.e. the data was constructed by calculating how the index would have performed if it had existed [32]. The log-returns of the indices were ordered as Sweden  US    Yt =  Japan  (6.1)   Europe World that is, all parameters with subscript 1 correspond to the Swedish index and so forth.

40 Table 6.2: Table for calibrations of a HMM and VHMM when the true model was a HMM. The number of observations generated was 600, and the calibrated values were averaged over 100 iterations after convergence.

Model φ σ µ [10−3] diag(D) [10−2] R ρ True 11.2 6.0  1 0.52  0.90 0.20 — Model 8.8 4.7 1 12.0 5.8  1 0.51  HMM 0.92 0.17 — 9.1 4.8 1 12.0 5.9  1 0.51  −0.02 VHMM 0.92 0.15 9.1 4.9 1 −0.03

Table 6.3: Table for calibrations of a HMM and VHMM when the true model was a VHMM. The number of observations generated was 600, and the calibrated values were averaged over 100 iterations after convergence.

Model φ σ µ [10−3] diag(D) [10−2] R ρ True 11.2 6.0  1 0.52  −0.2 0.90 0.20 Model 8.8 4.7 1 −0.4 11.1 6.0  1 0.53  HMM 0.91 0.17 — 7.9 5.0 1 11.6 6.0  1 0.52  −0.25 VHMM 0.93 0.15 8.4 5.0 1 −0.56

Table 6.4: Observed data log-likelihood and information criteria corresponding to the param- eters given in Table 6.2, i.e when the true model is a HMM. The observed data log-likelihood estimate was calculated with N = 106 particles.

# of Model `(θˆ) ∆AIC ∆BIC pi pi parameters i i AIC BIC HMM 7 1864 0 0 0.87 0.998 VHMM 9 1864 3.9 12.7 0.13 0.002

Table 6.5: Observed data log-likelihood and information criteria corresponding to the pa- rameters given in Table 6.3, i.e when the true model is a VHMM. The observed data log- likelihood estimate was calculated with N = 106 particles.

# of Model `(θˆ) ∆AIC ∆BIC pi pi parameters i i AIC BIC HMM 7 1852 8.6 0 0.01 0.54 VHMM 9 1858 0 0.3 0.99 0.46

41 Comparison with the VIX models

To compare how well the different models performed, each were fitted to the data. Since VIX data only was available from 1 Feb 1990, all models were calibrated on the data after this point in time. Two VIX models were calibrated, one corresponding to the HMM, i.e. no correlation between the log-volatility and the log-returns, and one corresponding to the VHMM. These are denoted by VIX Model1, and VIX Model2, respectively. The calibrated values can be seen in Table 6.6. Both models produce roughly similar results. To compare the models, the information criteria given in Chapter 4 were calculated. Firstly, the two versions of the VIX models were compared on the observed log-returns and the observed VIX data. This can be seen in Table 6.7. Furthermore, to compare AIC and BIC of the best VIX model with the HMM and VHMM, the log-likelihood of the observed data is needed. Since the SISR algorithm produces an estimate of the observed data likelihood as a by-product, given by (2.13), an estimate of the observed data log-likelihood can be computed by taking the logarithm of (2.13). However, to be able to use the information criteria, the underlying data needs to be the same for all models that should be compared. VIX Model2 cannot use the VIX data to compute the observed data log-likelihood, as it is a function of more data compared to the HMM/VHMM. To remedy this problem, the SISR algorithm was run with the parameters obtained from the VIX calibration, and estimated using (2.13). The estimated observed data log-likelihood, along with the information criteria and the posterior model probabilities are given in Table 6.8.

HMM vs. VHMM

As Table 6.8 suggests that the volatility index model is much worse than the HMM or VHMM, it should be possible to disregard it and only focus on the HMM and VHMM. In this scenario, the model can be calibrated to data from 1 January 1970 and onwards. For the HMM, the trajectories of the EM algorithm can be seen in Figure 6.3, Figure 6.4, Figure 6.5, and Figure 6.6. The same trajectories for the VHMM can be seen in Figure 6.7, Figure 6.8, Figure 6.9, Figure 6.10, and Figure 6.11. The values the EM algorithm converged to are given in Table 6.9. Each value were averaged over 100 iterations after the EM algorithm had converged. Just like the prefatory study, the EM algorithm converges to similar results. The observed data log-likelihood and the information criteria for the parameter values given in Table 6.9 can be seen in Table 6.10.

42 Table 6.6: The respective models calibrated to monthly data from 1 Feb 1990 to 1 Dec 2015. The HMM and VHMM models were averaged over 100 iterations after convergence.

Model φ σ µ [10−3] diag(D) [10−2] R ρ 18.5 6.2  1 0.55 0.39 0.72 0.67  14.9 4.5  1 0.42 0.75 0.90  VIX       0.85 0.35  8.7  6.3  1 0.46 0.70  — Model1       14.7 4.1  1 0.88  13.6 4.1 1 14.5 6.3  1 0.44 0.35 0.64 0.58  −0.45 12.1 4.5  1 0.38 0.68 0.87  −0.44 VIX         0.85 0.36  6.9  6.3  1 0.42 0.70  −0.20 Model2         11.8 4.1  1 0.84  −0.50 10.9 4.1 1 −0.47 13.0 5.5  1 0.60 0.37 0.78 0.71   9.5  3.9  1 0.55 0.76 0.94        HMM 0.89 0.36  3.9  5.0  1 0.49 0.69  —       10.1 3.8  1 0.89  8.8 3.6 1 11.5 6.2  1 0.59 0.36 0.77 0.70  −0.32  9.2  4.4  1 0.52 0.75 0.94  −0.20         VHMM 0.90 0.28  3.1  5.6  1 0.47 0.68  −0.17          9.0  4.2  1 0.89  −0.37 8.0 4.0 1 −0.31

ˆ ˜ ˆ Table 6.7: Complete data log-likelihood `VIX(θ) , log L(X0:t, Y1:t; θ) with t = 310, and information criteria corresponding to the parameters for the VIX models given in Table 6.6.

# of Model ` (θˆ) ∆AIC ∆BIC pi pi parameters VIX i i AIC BIC −20 −16 VIX Model1 22 4974 89.9 71.2 3 · 10 3 · 10 VIX Model2 27 5024 0 0 1 1

Table 6.8: Observed data log-likelihood and information criteria corresponding to the pa- rameters given in Table 6.6. The observed data log-likelihood estimate was calculated with N = 106 particles.

# of Model `(θˆ) ∆AIC ∆BIC pi pi parameters i i AIC BIC VIX 27 3490 108 118 3.2 · 10−24 2.0 · 10−26 Model2 HMM 22 3535 8.45 0 0.014 0.994 VHMM 27 3544 0 10.3 0.986 0.006

43 Estimation of HMM Parameters

0.92

0.90 φ

0.88

0.30 Value 0.29

0.28 σ 0.27

0.26

0.25 0 100 200 300 Iteration

Figure 6.3: Convergence of (φ, σ) (the parameters of the hidden process) in the EM algorithm when the model under consideration is the HMM stochastic volatility model. The algorithm was run with N = 200.

Table 6.9: The respective models calibrated to monthly data from 1 January 1970 to 1 December 2015. The HMM and VHMM models were averaged over 100 iterations after convergence. Each iteration used N = 200 particles.

Model φ σ µ [10−3] diag(D) [10−2] R ρ 12.8 5.8  1 0.50 0.31 0.65 0.60  10.3 4.4  1 0.43 0.66 0.91        HMM 0.91 0.27  8.1  5.4  1 0.45 0.65  —       10.5 4.1  1 0.83  9.9 3.9 1 12.3 6.0  1 0.49 0.31 0.65 0.59  −0.23 10.0 4.6  1 0.43 0.66 0.92  −0.21         VHMM 0.91 0.26  8.1  5.6  1 0.44 0.64  −0.11         10.2 4.2  1 0.83  −0.29 9.6 4.0 1 −0.26

Table 6.10: Likelihood and information criteria corresponding to the parameters given in Table 6.9. The log-likelihood as calculated with N = 106 particles.

# of Model `(θˆ) ∆AIC ∆BIC pi pi parameters i i AIC BIC HMM 22 5900 1.25 0 0.32 1 VHMM 27 5906 0 19.8 0.68 5 · 10−5

44 Estimation of HMM Parameters

2 1.3 × 10− µ1 2 1.2 × 10−

2 1.1 × 10−

2 1 × 10− µ2

3 9 × 10− 3 9 × 10−

3 8 × 10− µ3

Value 3 7 × 10− 2 1.1 × 10− 2 1 × 10− µ4 3 9 × 10−

2 1 × 10− µ5 3 9 × 10−

0 100 200 300 Iteration

Figure 6.4: Convergence of µ in the EM algorithm when the model under consideration is the HMM stochastic volatility model. The algorithm was run with N = 200.

Estimation of HMM Parameters

2 6 × 10− 2 D 5.8 × 10− 1 2 5.6 × 10− 2 4.6 × 10− 2 4.4 × 10− D2 2 4.2 × 10− 2 5.6 × 10− 2 5.4 × 10− D3

Value 2 5.2 × 10−

2 4.2 × 10− D4 2 4 × 10−

2 4 × 10− D 2 5 3.8 × 10−

0 100 200 300 Iteration

Figure 6.5: Convergence of D (the standard deviation of the log-returns) in the EM algo- rithm when the model under consideration is the HMM stochastic volatility model. The algorithm was run with N = 200.

45 Estimation of HMM Parameters 0.51 0.50 r21 0.49 0.48 0.34 0.32 r31 0.30 0.44 0.42 r32 0.40

0.66 0.65 r41 0.64 0.68 0.67 0.66 r42 0.65 0.46 Value 0.45 r 0.44 43 0.43 0.62 0.60 r51 0.58 0.925 0.920 0.915 r52 0.910

0.65 r 0.64 53

0.835 0.830 r54 0.825 0 100 200 300 Iteration

Figure 6.6: Convergence of R (the correlation of the log-returns) in the EM algorithm when the model under consideration is the HMM stochastic volatility model. The algorithm was run with N = 200.

46 Estimation of VHMM Parameters 0.98 0.96 0.94 φ 0.92 0.90

Value 0.25

0.20 σ 0.15

0.10 0 100 200 300 400 500 Iteration

Figure 6.7: Convergence of (φ, σ) (the parameters of the hidden process) in the EM algo- rithm when the model under consideration is the VHMM stochastic volatility model. The algorithm was run with N = 200.

Estimation of VHMM Parameters

2 1.25 × 10− 2 1.2 × 10− µ1 2 1.15 × 10−

2 1 × 10− 3 9.5 × 10− µ2 3 9 × 10−

3 8.5 × 10− µ3 −3

Value 8 × 10

2 1.05 × 10− 2 1 × 10− 3 µ4 9.5 × 10− 3 9 × 10−

3 9.5 × 10− 3 µ5 9 × 10− 3 8.5 × 10− 0 100 200 300 400 500 Iteration

Figure 6.8: Convergence of µ in the EM algorithm when the model under consideration is the VHMM stochastic volatility model. The algorithm was run with N = 200. The algorithm was run with N = 200.

47 Estimation of VHMM Parameters

2 6.2 × 10− 2 D1 6 × 10− 2 5.8 × 10− 2 4.8 × 10−

2 D 4.6 × 10− 2

2 4.4 × 10− 2 5.8 × 10− −2 D3 5.6 × 10 Value 2 5.4 × 10− 2 4.4 × 10−

2 D4 4.2 × 10−

2 4.2 × 10−

2 D 4 × 10− 5

0 100 200 300 400 500 Iteration

Figure 6.9: Convergence of D (the standard deviation of the log-returns) in the EM algo- rithm when the model under consideration is the VHMM stochastic volatility model. The algorithm was run with N = 200.

Estimation of VHMM Parameters

−0.25 ρ1

−0.30 −0.20

−0.25 ρ2

−0.30 −0.1

−0.2 ρ3 Value −0.3 −0.26 −0.28 ρ4 −0.30 −0.32 −0.25 ρ5 −0.30

0 100 200 300 400 500 Iteration

Figure 6.10: Convergence of ρ in the EM algorithm when the model under consideration is the VHMM stochastic volatility model. The algorithm was run with N = 200.

48 Estimation of VHMM Parameters 0.51 0.50 r21 0.49

0.34 0.32 r31 0.30 0.42 r32 0.40

0.66 r 0.65 41 0.64 0.68 0.67 r42 0.66

Value 0.45 r 0.44 43

0.62 0.61 0.60 r51 0.59 0.920 0.915 r52 0.910 0.650 0.645 r53 0.640

0.835 r54 0.830

0 100 200 300 400 500 Iteration

Figure 6.11: Convergence of R (the correlation of the log-returns) in the EM algorithm when the model under consideration is the VHMM stochastic volatility model. The algorithm was run with N = 200.

49 Chapter 7

Discussion

In this chapter, the results presented in Chapter 6 will be discussed. It follows the structure of the previous chapter, and starts with the numerical studies in Section 7.1. Afterwards, the chapter continues with the calibration of the stochastic in Section 7.2.

7.1 Design of the PaRIS algorithm

7.1.1 Proposal kernel selection

The results presented in Section 6.1 show that the proportional kernel provides better per- formance in terms of effective sample size than the prior kernel. Since the aim was to test the different kernels under conditions similar to the real world target, these results imply that the proportional kernel should be used when implementing the PaRIS algorithm. In order to obtain a better proposal kernel, there are two corrections that extend the pro- portional kernel that could be tested. Both of these corrections concern the approximation of eξt/2. Firstly, as noted in Section 5.1, it can be approximated by propagating the previ- ous value by the hidden chains underlying dynamics, i.e. eξt/2 ≈ eφξt−1/2. Secondly, when ξt/2 deriving the proportional kernel, the zeroth order approximation e = 1+O(ξt) was used. To improve upon this approximation, it could be extended to its first order approximation ξt/2 2 e = 1+ξt/2+O(ξt ). If this approximation would be used, the proposal kernel would still be from the normal distribution. This should provide a better proposal kernel, especially for extreme observations. Lastly, the methodology suggested in [7], where the parameter vector ϑ is chosen based on which parametric family rt;ϑ belongs to could be tested. However, as the average effective sample size given in Table 6.1 was close to the number of particles, and therefore it was deemed that the performance of the proportional kernel was satisfactory. The proportional kernel was used for all simulations done in this report.

50 7.1.2 Computational complexity for the PaRIS algorithm

The results found in Subsection 6.1.1 shows that the PaRIS algorithm does not exhibit linear computational complexity in the number of particles. The accept reject algorithm does not seem to accept enough backward draws before the stopping criterion is met, and the complexity for all tested stopping criteria is not linear. Furthermore, it was always best to sample directly from the exact distribution, instead of applying the accept-reject sampling. Why the complexity is greater than linear might be for several reasons. Firstly, the assump- tions that are needed in order to obtain the linear complexity are perhaps not fulfilled, or that the instrumental density from which the backward indices are drawn is not optimal. The derivation made in Subsection 3.4.3 assumes that the proposal density used in the SISR algorithm is roughly proportional to the product density, and that the normalisation con- stant of the product density does not depend on the previous value of the underlying chain. Since the normalisation constant of gt does depend on the previous value, this assumption is not fulfilled for the product density. Therefore, the optimality of the categorical distribution with weights proportional to the importance weights is not guaranteed. If the true optimal density for sampling backward indices is very dissimilar to the categori- cal distribution induced by the importance weights, the performance will deteriorate. This could be the reason why it it does not exhibit linear computational complexity. Conse- quently, there is no major advantage to use PaRIS compared to forward-only FFBSm from a computational complexity point of view.

7.2 Stochastic volatility models

7.2.1 Prefatory study

By inspecting the values the EM algorithm converges to, the first noticeable feature of the converged values is that the estimates of the emission parameters (µ, D, R) are very similar between the models. This is expected, as a large deviation would suggest that there are major differences between the models. The largest differences in the parameter values are the parameters of the hidden process. This is also expected, as the unobservable features should be harder to distinguish. The most interesting part are the information criteria given in Table 6.4, and Table 6.5. AIC always picks the true model, both for the HMM and VHMM. However, due to the larger penalty for more parameters, BIC always gives the largest posterior probability to the HMM, even when the true model is a VHMM. Furthermore, if the dimension of Y is large, BIC has the potential to be even more favourable towards the HMM. This is because the number of parameters grows linearly with the dimension of the observable space. Therefore, this suggests that BIC should perhaps be applied with a bit more caution.

51 7.2.2 Main study

As the algorithms have been shown to converge to reasonable values in the prefatory study, it is straightforward to apply them to the real data. The VIX models does provide estimates of the parameters to compare the calibrations of the HMM and the VHMM. In general, the values in Table 6.6 are comparable for all four models. Many of the calibrated values are in line with what was expected. Firstly, as φ is close to one for all models, they all have volatility cluster features. Furthermore, the VHMM and the VIX Model2 both give rise to a negative correlation between these, which is in line with previous results. The calibrated parameter value that stands out is σ for the VHMM, which is quite different from the σ’s obtained for the other three models.

VIX models

Before proceeding to comparing one of the VIX models with the HMM and the VHMM, they were compared against each other to determine which models were best given both the observed log-returns and the observed VIX data. The comparison of the information criteria was very conclusive. It shows that the VIX Model with correlation (VIX Model2) provides a much better fit to the data. This means that, if VIX would be used as an approx- imation, introducing a correlation between the log-volatility and the log-returns is a much better model. Since the evidence for VIX Model2 against VIX Model1 is indisputable when comparing just the two of them, VIX Model1 was disregarded from further consideration.

Just as VIX Model2 is much better than its uncorrelated counterpart, Table 6.8 shows that the HMM and the VHMM provides a much better fit to the data than VIX Model2. This is expected, as the EM algorithm gives the maximum likelihood estimator of the observed data for the HMM and the VHMM, and the VIX model is merely an approximation of this. However, comparison of the values in Table 6.6 show that the different calibration methods yield roughly similar results. This confirms that VIX could be used as a coarse approximation for the shape of the volatility.

Which model is the best?

Comparing the calibrated parameter values for the HMM and the VHMM in Table 6.6 and Table 6.9, the overall impression is that most of the parameters are very similar between the models. On the other hand, there are a few changes between the values in Table 6.6 and Table 6.9 that are noteworthy. First and foremost, the σ changes much for the HMM from 0.36 to 0.27. The drastic change could indicate that the HMM cannot capture the proper dynamics of the data, and therefore is more gravely affected by the noise in the data. Furthermore, µ3 changed by a lot for both models. The average return for the MSCI Japan Index from 1 January 1970 to 1 January 1990 was 18 · 10−3, whereas it was 1 · 10−3 after 1 January 1990, which explains the large difference. However, this could imply that the assumption of a constant steady-state drift is perhaps too simplistic. The main objective of this thesis is to determine if a stochastic volatility model that al- lows correlation between the log-volatility and the log-returns gives a better fit to actual observed log-returns. Looking at Table 6.10, the evidence is inconclusive. If AIC is used as

52 a measurement of goodness of fit, the VHMM has a higher posterior probability, whereas if BIC is used then the HMM is definitely better. At first sight, it is tempting to call out the HMM as the winner, as BIC is much smaller for the HMM and AIC is not as favourable for the VHMM. However, consider that in the prefatory study, BIC always picked the HMM as the best model. This was true even when the true model was the VHMM, and that model had true parameter values in the same range as the calibrated values for the real data. This is in contrast to AIC, which picked the true model as the best. This would suggest that when considering this model, BIC is not suitable as it penalises more parameters too heavily. As the AIC and BIC contradict each-other, it is impossible to draw any definite conclusions from this limited amount of results. More extensive analysis on which criteria is better for this situation is needed to draw any unequivocal conclusions. Nevertheless, there are several reasons why I think that AIC, and the VHMM, is better: i) As mentioned before, AIC always predicted the correct values in the prefatory study, and the number of extra parameters for the VHMM grows with the dimension of the observable space. As a result, BIC will give a very large penalty to models with more parameters (especially if the number of data points is large), and declare the HMM the best model merely on the fact it has less parameters. In summary, BIC seems to overweight the number of parameters for this case. ii) As mentioned in Chapter 4, previous literature has used AIC to compare stochastic volatility models [26].

iii) When comparing the two VIX models, the correlated model (VIX Model2) was much better than the uncorrelated counterpart. As both the information criteria were much smaller for VIX Model2, there is no doubt that there exist a negative correlation between the log-VIX and the log-returns. As VIX is supposed to be a measure of volatility, this supports the claim that the VHMM is the better model. However, even though all this points towards the VHMM as the best model, it is by no means conclusive, and it has to be investigated further.

53 Chapter 8

Conclusions and Future Work

8.1 Conclusions

This thesis extends the framework of smoothing in hidden Markov models to virtually hidden Markov models. It is straightforward to use the EM algorithm to calibrate such models by using the PaRIS algorithm. This has been done for a stochastic volatility VHMM with great success. Compared to a regular HMM, the VHMM looks to be better, but the results are not conclusive and more research is needed. Lastly, even though it was possible to use the PaRIS algorithm to compute smoothed expectations for a VHMM, the algorithm did not obtain linear complexity.

8.2 Future work

This thesis lays the groundwork for smoothing in virtually hidden Markov models. How- ever, there are more topics of research regarding the results in this thesis. In terms of the algorithms used, it is of interest to see if it is possible to obtain linear complexity for the PaRIS algorithm when the model is a VHMM. Perhaps another proposal kernel or backward instrumental kernel can remedy the high complexity, or maybe some additional assumptions are needed. Regarding stochastic volatility, there are several ways forward. The methods are easily extended to stochastic volatility models that has different dynamics. In [26], several other stochastic volatility models are presented. First of all, the white-noise sequence they use for the log-returns is a multivariate t-distribution with ν degrees of freedom. This could be a first simple extension. Secondly, the dynamics of the log-returns could incorporate a baseline volatility η, that is, replacing the eX/2-term with eX/2 +η. Lastly, a changing model could be studied, where the log-volatility is allowed to change. The log-volatility could be given by the mixture model

Xt = φIt Xt−1 + σIt ζt, (8.1)

54 where It is a k-dimensional Markov chain on 1, k . J K Furthermore, it would be interesting if more conclusive results could be obtained for more data points. Here, all data was monthly log-returns. It would be interesting to investigate how more data, e.g. weekly or daily log-returns, would affect the results. However, by increasing the number of data points, the linear convergence of the EM algorithm might make the time until convergence too long, and other algorithms such as quasi-Newton methods [7] could be used. Lastly, implementing the bootstrap information criterion would be a next step to obtain another information criterion to compare models with, which should provide additional insight into the model selection process, and help determine which model is the best.

55 Appendix A

Extension of the Accept-Reject Sampling Algorithm

When sampling backward indices, there are some complications that arise when considering VHMMs. The emission density is often more complex than the transition density, and will sometimes not be uniformly bounded. Therefore, due to the emission density being a factor in the backward kernel, the accept-reject algorithm needs to be altered to accommodate this. Assumption A.1. The product density is bounded by c(x0), x0 ∈ X, that is

0 0 0 0 2 q(x; x )gt(x, x ) ≤ c(x ), ∀(x, x ) ∈ X . (A.1)

It will be shown that under Assumption A.1 it is still possible to sample indices by using an accept-reject algorithm. i 1 i ` ` i ` i N Theorem A.1. If Jt+1 is produced by Algorithm A.1, then it holds that Jt+1 ∼ Pr({ωt q(ξt ; ξt+1)gt+1(ξt , ξt+1)}`=1).

i i Proof. Consider a fixed particle ξt+1. To show that Jt+1 has the desired distribution, first i ∗ ∗ note that the event {Jt+1 = j} is equivalent to {J = j|J accepted}. By Bayes’ theorem, it holds that

∗ ∗ ∗ ∗ P(J accepted|J = j) ∗ P(J = j|J accepted) = P(J = j). (A.2) P(J ∗ accepted) Simple gives that

j ∗ ωt P(J = j) = , Ωt J ∗ i J ∗ i ! ∗ q(ξt ; ξt+1)gt+1(ξt , ξt+1) P(J accepted) = P U ≤ i c(ξt+1)

1 i (i,j) (i,j1) d. (i,j2) ˜ 2 Here, Jt+1 , Jt+1 for any j, as Jt+1 = Jt+1 for all (j1, j2) ∈ 1, N . J K

56 " J ∗ i J ∗ i # q(ξt ; ξt+1)gt+1(ξt , ξt+1) = E i c(ξt+1)

N k 1 X ωt k i k i = i q(ξt ; ξt+1)gt+1(ξt , ξt+1), c(ξ ) Ωt t+1 k=1 j i j i ∗ ∗ q(ξt ; ξt+1)gt+1(ξt , ξt+1) P(J accepted|J = j) = i . c(ξt+1) Insertion into (A.2), and simplifying the expression gives

∗ ∗ j j i j i P(J = j|J accepted) ∝ ωt q(ξt ; ξt+1)gt+1(ξt , ξt+1), (A.3) which is the shape of the desired distribution.

i Since it is possible to compute c(ξt+1) before the algorithm starts, and there are N such constants to compute, the complexity should still exhibit linear behaviour.

Algorithm A.1: Accept-reject sampling of backward indices for VHMMs. i i N t+1 Data: {{ξs, ωs}i=1}s=t. (i,j) ` ` i ` i N Result: Backward indices Jt+1 ∼ Pr({ωt q(ξt ; ξt+1)gt(ξt , ξt+1)}`=1) for all (i, j) ∈ 1,N × 1, N˜ . 1 for j ← 1 to N˜ Jdo K J K 2 L ← 1,N . 3 whileJ L 6=K∅ do 4 Ln ← ∅. 5 n ← size(L). 6 for k ← 1 to n do ∗ ` N 7 J ∼ Pr({ωt }`=1). 8 U ∼ U(0, 1). J ∗ L(k) J ∗ L(k) L(k) 9 if U ≤ q(ξt ; ξt+1 )gt(ξt , ξt+1 )/c(ξt+1 ) then (L(k),j) ∗ 10 Jt+1 ← J . 11 else 12 Ln ← Ln ∪ {L(k)}. 13 end if 14 end for 15 L ← Ln. 16 end while 17 end for (i,j) ˜ 18 return {Jt+1 :(i, j) ∈ 1,N × 1, N }. J K J K

57 Appendix B

Derivation of the Intermediate Quantity

To carry out the Expectation-Maximisation algorithm, the first thing needed is the interme- diate quantity, and identify of the smoothed sufficient statistics that needs to be computed (the E-step). The second part is to calculate the updating function Λ that maximises the intermediate quantity (the M-step).

B.1 E-step

The intermediate quantity is the expectation of the complete data log-likelihood, conditional on the observations, that is

Qθ` (θ) = Eθ` [log pθ(X0:t, Y1:t)|Y1:t] , Tθ` (θ) + Gθ` (θ). (B.1) It will be assumed that the initial density χ does not depend on θ, and can be discarded. Under this assumption, (B.1) splits into two terms, T and G, that comes from the transition density, and the emission density, respectively. First consider the term from the transition density T .

" t # X T (θ) = log q (X ,X ) Y θ` Eθ` θ k−1 k 1:t k=1 " t 2 #! c. 1 X (Xk − φXk−1) = − t log σ2 + Y 2 Eθ` σ2 1:t k=1 1  1  = t log σ2 + z − 2φz + φ2z  , (B.2) 2 σ2 1 2 3

58 where  h i z Pt X2 Y ,  1 , Eθ` k=1 k 1:t  h i z Pt X X Y , (B.3) 2 , Eθ` k=1 k−1 k 1:t  h i z Pt X2 Y ,  3 , Eθ` k=1 k−1 1:t are the smoothed sufficient statistics that needs to be computed. Now consider the term stemming from the emission density g. Tedious linear algebra gives

" t # X G (θ) log g (X ,X ) Y θ` , Eθ` t;θ k−1 k 1:t k=1 " t c. 1 X  T = − t log kΣk + e−Xk Y − µ − β eXk/2ζ Σ−1 2 Eθ` k ζ k k=1 #!   Xk/2 × Yk − µ − βζ e ζk Y0:t

  t c. 1 X X  = − t log kΣk + Σ−1e−Xk Y iY j − Y jµi − Y iµj + µiµj − 2  Eθ`  ij k k k k k=1 i,j  n o  i i j j j i Xk/2 Xk 2 i j (Yk − µ )β + (Y − µ )βζ ζke + e ζk βζ β Y0:t . ζ k ζ

Let the smoothed sufficient statistics needed be denoted by

 h t j i (i,j) P i −Xk ∆ Eθ` Yk Y e Y0:t ,  , k=1 k  h t i  i P i −Xk E Eθ` Yk e Y0:t ,  , k=1  h t i    P −Xk/2 1 S Eθ` e ζk Y0:t = z4 − φz5 ,  , k=1 σ  h t i  P −Xk/2 z4 Eθ` Xke Y0:t ,  , k=1  h t i z P X e−Xk/2 Y ,  5 , Eθ` k=1 k−1 0:t h t i (B.4) V P e−Xk Y ,  , Eθ` k=1 0:t  h i    Pt 2 1 2 Z Eθ` ζk Y0:t = 2 z1 − 2φz2 + φ z3 ,  , k=1 σ  h t i  (i) (i)  i P i −Xk/2 1 A Eθ` Yk ζke Y0:t = a1 − φa2 ,  , k=1 σ  (i) h t i  P i −Xk/2 a1 Eθ` Yk Xke Y0:t ,  , k=1  (i) h t i a P Y iX e−Xk/2 Y .  2 , Eθ` k=1 k k−1 0:t

Insertion of (B.4) into G gives

c. 1 X h G (θ) = − t log kΣk + Σ−1 ∆(i,j) − Ejµi − Eiµj + µiµjV− θ` 2 ij i,j

59 ! i j j j i i i j i βζ (A − µ S) − βζ (A − µ S) + βζ βζ Z , which, together with (B.2), gives the intermediate quantity as

c. 1 1 X h Q (θ) = − t log σ2 + t log kΣk + (z − 2φz + φ2z ) + Σ−1 ∆(i,j)− θ` 2 σ2 1 2 3 ij i,j ! j i i j i j i j j j i i i j i E µ − E µ + µ µ V − βζ (A − µ S) − βζ (A − µ S) + βζ βζ Z .

B.2 M-step

The M-step is quite tricky, mainly due to the appearance of (φ, σ) in G. Analytic expressions for updates of these gives horrendous algebraic expressions. Therefore, maximisation will be only done analytically for the parameters that does not occur in the transition density, that is (µ, βζ , Σ). These parameters only occur in G, and therefore only this part will be considered for the subsequent segments. Furthermore, the maximisation will be done by finding the stationary points of the intermediate quantity. Since the derivative will be set equal to zero, =c. will also include equality up to a multiplicative (as well as additive) constants, to make the results more readable.

B.2.1 Maximisation with respect to µ

i i n Let S , βζ S, and take the derivative of Qθ` (θ) with respect to µ . This gives

∂ c. ∂ X   Q (θ) = Σ−1 − µiEj − µjEi + µiµjV + µiSj + µjSi ∂µn θ` ∂µn ij i,j X −1h j i j i j ii = Σij − δinE − δjnE + (δinµ + δjnµ )V + δinS + δjnS i,j X −1h j j ji X −1h i i ii −1 −1 = Σnj Vµ − E + S + Σin Vµ − E + S = {Σij = Σji } j i X −1h i i ii = 2 Σin Vµ − E + S = 0 i 1 ⇒ Σ−1µ = Σ−1(E − S). V Since Σ−1 is invertible, the solution for µ is unique, and is given by 1 1 µˆ = (E − S) = (E − Sβ ). (B.5) V V ζ

Inserting this expression for µˆ into Gθ` (θ) gives

60       c. 1 X 1 S G (θ) = − t log kΣk + Σ−1 ∆(i,j) − EiEj + βi Ej − Aj + θ` 2  ij V ζ V i,j  S   S2   + βj Ei − Ai + βi βj Z − . ζ V ζ ζ V 

B.2.2 Maximisation with respect to βζ

The procedure for βζ is identical to the procedure for µ. Differentiating Qθ` (θ) with respect n to βζ gives      ∂ c. ∂ X S S Q (θ) = Σ−1 − βi Aj − Ej − βj Ai − Ei + ∂βn θ` ∂βn ij ζ V ζ V ζ ζ i,j  S2   βi βj Z − = (...) ζ ζ V d X   S2  Ai − S   = 2 Σ−1 βi Z − − Ei = 0 in ζ V V i=1  S2   S  ⇒ Z − Σ−1β = Σ−1 A − E V ζ V

Again, Σ−1 is invertible, and the unique solution is given by

 S2 −1  S  βˆ = Z − A − E . (B.6) ζ V V

ˆ Inserting this expression for βζ into Gθ` (θ) gives t   G (θ) =c. − log kΣk + Tr(KT Σ−1) , θ` 2 where K is a d × d symmetric matrix, with components given by ( ) 1  1   S2 −1  S   S  K ∆(i,j) − EiEj − Z − Aj − Ej Ai − Ei . ij , t V V V V

B.2.3 Maximisation with respect to Σ

As Σ is a matrix, the maximisation is more complicated. However, there are two matrix identities that help greatly. For any symmetric matrices X and A, it holds that [38]

∂ log kXk = (X−1)T = X−1, (B.7) ∂X

61 ∂ Tr(AT X) = A. (B.8) ∂X Taking the derivative of Q with respect to Σ−1 gives   ∂ t ∂  −1  ∂  T −1 Gθ (θ) = − − log kΣ k + Tr(K Σ ∂Σ−1 ` 2 ∂Σ−1 ∂Σ−1 n o = (B.7)&(B.8) with A = K, and X = Σ−1 t   = Σ − K = 0 2 ⇒ Σˆ = K.

B.2.4 Updating formula for the HMM version

The result is easily adapted to the case when the model is a HMM, i.e. when βζ ≡ 0. For this case, the updating formula is analytically tractable for all parameters, and they are given by z  φˆ = 2 ,  z  3  1  z2   2 2  σc = z4 − ,  t z3 (B.9) E  µˆ = ,  V   1  1   Σˆ = KHMM, where KHMM = ∆(i,j) − EiEj .  ij t V

B.3 Summary

The intermediate quantity, that still have (φ, σ) dependence is given by 1  1  Q (θ) =c. − t log σ2 + (z − 2φz + φ2z ) + t log kKk , where θ` 2 σ2 1 2 3 ( ) 1  1   S2 −1  S   S  K = ∆(i,j) − EiEj − Z − Aj − Ej Ai − Ei . ij t V V V V and the smoothed sufficient statistics are given in (B.3), and (B.4). The analytic maximi- sations done for (µ, βζ , Σ) can be concluded as  Σˆ = K,   −1   S2   S   βˆ = Z − A − E , ζ V V (B.10)  (  2 −1  )  1 1 S S  µˆ = (E − βˆ S) = E − S Z − A − E .  V ζ V V V

62 Bibliography

[1] D. S. Bates. Post-’87 crash fears in the S&P 500 futures market. Jour- nal of Econometrics, 94(1–2):181 – 238, 2000. ISSN 0304-4076. doi: http:// dx.doi.org/10.1016/S0304-4076(99)00021-4. URL http://www.sciencedirect.com/ science/article/pii/S0304407699000214. [2] G. Bekaert and G. Wu. Asymmetric volatility and risk in equity markets. The Review of Financial Studies, 13(1):1–42, 2000. ISSN 08939454, 14657368. URL http://www. jstor.org/stable/2646079. [3] F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637–654, 1973. ISSN 00223808, 1537534X. URL http: //www.jstor.org/stable/1831029. [4] A. Bryson and M. Frazier. Smoothing for linear and nonlinear dynamic systems. In Proceedings of the optimum system synthesis conference, pages 353–364, 1963. [5] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer New York, New York, NY, 2002. ISBN 978-0-387-22456-5. doi: 10.1007/978-0-387-22456-5 3. URL http://dx.doi. org/10.1007/978-0-387-22456-5_3. [6] O. Capp´e. Ten years of HMMs (online bibliography 1989-2000), Mar. 2001. URL http://www.tsi.enst.fr/~cappe/docs/hmmbib.html. [7] O. Capp´e,E. Moulines, and T. Ryd´en. Inference in Hidden Markov Models (Springer Series in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. ISBN 0387402640. [8] J. E. Cavanaugh and A. A. Neath. Generalizing the derivation of the schwarz informa- tion criterion. Communications in Statistics - Theory and Methods, 28(1):49–66, 1999. doi: 10.1080/03610929908832282. URL http://www.tandfonline.com/doi/abs/10. 1080/03610929908832282. [9] P. Del Moral, A. Doucet, and S. S. Singh. Forward smoothing using sequential Monte Carlo. ArXiv e-prints, Dec. 2010. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Method- ological), 39(1):1–38, 1977. ISSN 00359246. URL http://www.jstor.org/stable/ 2984875.

63 [11] F. Desbouvries and W. Pieczynski. Particle filtering in pairwise and triplet Markov chains. In IEEE – EURASIP Workshop on Nonlinear Signal and Image Processing (NSIP 2003), Grado-Gorizia, pages 8–11, 2003. [12] R. Douc, A. Garivier, E. Moulines, and J. Olsson. Sequential Monte Carlo smoothing for general state space hidden Markov models. Ann. Appl. Probab., 21(6):2109–2145, 12 2011. doi: 10.1214/10-AAP735. URL http://dx.doi.org/10.1214/10-AAP735. [13] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of nonlinear filtering, 12(656-704):3, 2009. [14] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for bayesian filtering. Statistics and Computing, 10(3):197–208, 2000. ISSN 1573-1375. doi: 10.1023/A:1008935410038. URL http://dx.doi.org/10.1023/A:1008935410038. [15] S. J. Godsill, A. Doucet, and M. West. Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association, 99(465):156–168, 2004. ISSN 01621459. URL http://www.jstor.org/stable/27590362. [16] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non- gaussian bayesian state estimation. IEE Proceedings F - Radar and Signal Processing, 140(2):107–113, April 1993. ISSN 0956-375X. doi: 10.1049/ip-f-2.1993.0015. [17] J. E. Handschin. Monte Carlo techniques for prediction and filtering of non-linear stochastic processes. Automatica, 6(4):555 – 563, 1970. ISSN 0005-1098. doi: http: //dx.doi.org/10.1016/0005-1098(70)90010-5. URL http://www.sciencedirect.com/ science/article/pii/0005109870900105. [18] S. L. Heston. A closed-form solution for options with stochastic volatility with appli- cations to bond and currency options. The Review of Financial Studies, 6(2):327–343, 1993. ISSN 08939454, 14657368. URL http://www.jstor.org/stable/2962057. [19] J. Hull and A. White. The pricing of options on assets with stochastic volatilities. The Journal of Finance, 42(2):281–300, 1987. ISSN 00221082, 15406261. URL http: //www.jstor.org/stable/2328253. [20] R. E. Kalman and R. S. Bucy. New results in linear filtering and prediction theory. Transactions of the ASME. Series D, Journal of Basic Engineering, 83:95–107, 1961. [21] G. Kitagawa and S. Sato. Monte Carlo Smoothing and Self-Organising State-Space Model, pages 177–195. Springer New York, New York, NY, 2001. ISBN 978-1- 4757-3437-9. doi: 10.1007/978-1-4757-3437-9 9. URL http://dx.doi.org/10.1007/ 978-1-4757-3437-9_9. [22] A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and bayesian missing data problems. Journal of the American Statistical Association, 89(425):278–288, 1994. ISSN 01621459. URL http://www.jstor.org/stable/2291224. [23] S. Konishi and G. Kitagawa. Information Criteria and Statistical Modeling. Springer Science & Business Media, 2008. [24] K. Lange. A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 57(2):425–437, 1995. ISSN 00359246. URL http://www.jstor.org/stable/2345971.

64 [25] K. Lange. A quasi-newton acceleration of the EM algorithm. Statistica sinica, pages 1–18, 1995. [26] R. Langrock, I. L. MacDonald, and W. Zucchini. Some nonstandard stochastic volatility models and their estimation using structured hidden Markov models. Journal of Empir- ical Finance, 19(1):147 – 161, 2012. ISSN 0927-5398. doi: http://dx.doi.org/10.1016/j. jempfin.2011.09.003. URL http://www.sciencedirect.com/science/article/pii/ S0927539811000661. [27] Q. Li, J. Yang, C. Hsiao, and Y.-J. Chang. The relationship between stock returns and volatility in international stock markets. Journal of Empirical Finance, 12(5):650 – 665, 2005. ISSN 0927-5398. doi: http://dx.doi.org/10.1016/j.jempfin.2005.03.001. URL http://www.sciencedirect.com/science/article/pii/S0927539805000514. [28] R. J. A. Little and D. B. Rubin. Maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse, pages 164–189. John Wiley & Sons, Inc., 2002. ISBN 9781119013563. doi: 10.1002/9781119013563.ch8. URL http://dx.doi.org/10.1002/9781119013563.ch8. [29] J. S. Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6(2):113–119, 1996. ISSN 1573- 1375. doi: 10.1007/BF00162521. URL http://dx.doi.org/10.1007/BF00162521. [30] MSCI Inc. MSCI Europe Index, 2016. URL https://www.msci.com/resources/ factsheets/index_fact_sheet/msci-europe-index.pdf. Last visited 2016-09-03. [31] MSCI Inc. MSCI Japan Index, 2016. URL https://www.msci.com/resources/ factsheets/index_fact_sheet/msci-japan-index.pdf. Last visited 2016-09-03. [32] MSCI Inc. MSCI Sweden Index, 2016. URL https://www.msci.com/resources/ factsheets/index_fact_sheet/msci-sweden-index.pdf. Last visited 2016-09-03. [33] MSCI Inc. MSCI USA Index, 2016. URL https://www.msci.com/resources/ factsheets/index_fact_sheet/msci-usa-index-gross.pdf. Last visited 2016-09- 03. [34] MSCI Inc. MSCI World Index, 2016. URL https://www.msci.com/resources/ factsheets/index_fact_sheet/msci-world-index.pdf. Last visited 2016-09-03. [35] J. Olsson. Lecture notes in Computer Intensive Methods for Mathematical Statis- tics (SF2955), 2015. URL https://www.math.kth.se/matstat/gru/sf2955/2016/ material/L7.pdf. Last visited 2016-08-14. [36] J. Olsson and J. Westerborn. Efficient particle-based online smoothing in general hidden Markov models: the PaRIS algorithm. ArXiv e-prints, Dec. 2014. [37] J. Olsson, O. Capp´e,R. Douc, and E.´ Moulines. Sequential Monte Carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli, 14 (1):155–179, 02 2008. doi: 10.3150/07-BEJ6150. URL http://dx.doi.org/10.3150/ 07-BEJ6150. [38] K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. URL http: //www2.imm.dtu.dk/pubdb/p.php?3274. Version 20121115.

65 [39] R Core Team. R: A Language and Environment for Statistical Computing. R Founda- tion for Statistical Computing, Vienna, Austria, 2015. URL http://www.R-project. org/. [40] R. Sundberg. Maximum likelihood theory for incomplete data from an exponential family. Scandinavian Journal of Statistics, 1(2):49–58, 1974. ISSN 03036898, 14679469. URL http://www.jstor.org/stable/4615553. [41] E. Taghavi, F. Lindsten, L. Svensson, and T. B. Sch¨on. Adaptive stopping for fast particle smoothing. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6293–6297, May 2013. doi: 10.1109/ICASSP.2013. 6638876. [42] D. M. Titterington. Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society. Series B (Methodological), 46(2):257–267, 1984. ISSN 00359246. URL http://www.jstor.org/stable/2345509. [43] L. R. Welch. Hidden Markov models and the Baum-Welch algorithm. IEEE Infor- mation Theory Society Newsletter, 53(4), Dec. 2003. URL http://www.itsoc.org/ publications/nltr/it_dec_03final.pdf. [44] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. ISBN 978-0-387-98140-6. URL http://ggplot2.org. [45] C. F. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95–103, 1983. ISSN 00905364. URL http://www.jstor.org/stable/ 2240463.

66

TRITA -MAT-E 2016:62 ISRN -KTH/MAT/E--16/62-SE

www.kth.se