Exact likelihood inference for autoregressive gamma stochastic volatility models1

Drew D. Creal Booth School of Business, University of Chicago

First draft: April 10, 2012 Revised: September 12, 2012

Abstract

Affine stochastic volatility models are widely applicable and appear regularly in empir- ical finance and macroeconomics. The likelihood function for this class of models is in the form of a high-dimensional integral that does not have a closed-form solution and is difficult to compute accurately. This paper develops a method to compute the likelihood function for discrete-time models that is accurate up to computer tolerance. The key insight is to integrate out the continuous-valued latent variance analytically leaving only a latent discrete variable whose support is over the non-negative integers. The likelihood can be computed using standard formulas for Markov-switching models. The maximum likelihood estimator of the unknown parameters can therefore be calculated as well as the filtered and smoothed estimates of the latent variance.

Keywords: stochastic volatility; Markov-switching; Cox-Ingersoll-Ross; affine models; autoregressive-.

JEL classification codes: C32, G32.

1I would like to thank Alan Bester, Chris Hansen, Robert Gramacy, Siem Jan Koopman, Neil Shephard, and Jing Cynthia Wu as well as seminar participants at the Triangle Econometrics Workshop (UNC Chapel Hill), University of Washington, Northwestern University Kellogg School of Management, and Erasmus Universiteit Rotterdam for their helpful comments. Correspondence: Drew D. Creal, University of Chicago Booth School of Business, 5807 South Woodlawn Avenue, Chicago, IL 60637. Email: [email protected]

1 1 Introduction

Stochastic volatility models are essential tools in finance and increasingly in macroeconomics for modeling the uncertain variation of asset prices and macroeconomic . There exists a literature in econometrics on estimation of both discrete and continuous-time stochastic volatility (SV) models; see, e.g. Shephard (2005) for a survey. The challenge when estimating SV models stems from the fact that the likelihood function of the model is a high-dimensional integral that does not have a closed-form solution. The contribution of this paper is the development of a method that calculates the likelihood function of the model exactly up to computer tolerance for a class of SV models whose stochastic variance follows an autoregressive gamma process, the discrete-time version of the Cox, Ingersoll, and Ross (1985) process. This class of models is popular in finance because it belongs to the affine family which produces closed-form formulas for derivatives prices; see, e.g. Heston (1993). A second contribution of this paper is to illustrate how the ideas developed here can be applied to other affine models, including stochastic intensity, stochastic duration, and stochastic transition models. Linear, Gaussian state space models and finite state Markov-switching models are a corner- stone of research in macroeconomics and finance because the likelihood function for these state space models is known; see, e.g. Kalman (1960) and Hamilton (1989). SV models are non- linear, non-Gaussian state space models which in general do not have closed-form likelihood functions. This has lead to a literature developing methods to approximate the maximum likelihood estimator including importance sampling (Danielsson (1994), Durbin and Koopman (1997), Durham and Gallant (2002)), particle filters (Fern´andez-Villaverde and Rubio-Ram´ırez (2007)), and Markov-chain Monte Carlo (Jacquier, Johannes, and Polson (2007)). Alterna- tively, estimation methods have also been developed that do not directly compute the likeli- hood including Bayesian methods (Jacquier, Polson, and Rossi (1994), Kim, Shephard, and Chib (1998)), the generalized method of moments, simulated method of moments (Duffie and Singleton (1993)), efficient method of moments (Gallant and Tauchen (1996)), and indirect inference (Gouri´eroux,Monfort, and Renault (1993)). The results in this paper follow from the properties of the transition density of the Cox, Ingersoll, and Ross (1985) (CIR) process, which in discrete time is known as the autoregressive gamma (AG) process (Gouri´erouxand Jasiak (2006)). The transition density of the CIR/AG

2 process is a non-central gamma (non-central chi-squared up to scale) distribution, which is a Poisson mixture of gamma random variables. The CIR/AG process has two state variables.

One is the continuous valued variance ht which is conditionally a gamma random variable and the other is a discrete mixing variable zt which is conditionally a Poisson random variable. The approach to computing the likelihood function here relies on two key insights about the model. First, it is possible to integrate out the continuous-valued state variable ht analytically. Second, an interesting feature of the CIR/AG process is that once one conditions on the discrete mixing variable zt and the data yt, the variance ht is independent of its own leads and lags. The combination of these properties means that the original SV model can be reformulated as a state space model whose only state-variable is discrete-valued with support over the set of non-negative integers. I show that the Markov transition distribution of the discrete variable zt is a Sichel distribution (Sichel (1974, 1975)). Like the Poisson distribution from which it is built, the Sichel distribution assigns increasingly small probability mass to large integers and the probabilities eventually converge to zero. Although the support of the state variable zt is technically infinite, it is finite for all practical purposes. In this paper, I propose to compute the likelihood for the affine SV model by representing it as a finite state Markov-switching model. The likelihood of a finite-state Markov switching model can be computed from known recursions; see, e.g. Baum and Petrie (1966), Baum, Petrie, Soules, and Weiss (1970), and Hamilton (1989). In addition to computing the maximum likelihood estimator, I also show how to compute moments and quantiles of the filtered and smoothed distributions of the latent variance ht. The filtering distribution characterizes the information about the state variable at time t given information up to and including time t whereas the smoothing distribution conditions on all the available data. Solving the filtering and smoothing problem for the variance ht reduces to a two step problem. First, the marginal filtering and smoothing distributions for the discrete variable zt are calculated using Markov-switching algorithms. Then, the marginal filtering and smoothing distributions of the variance ht conditional on zt and the data yt are shown to be generalized inverse Gaussian (GIG) distributions whose moments can be computed analyti- cally. Moments and quantiles of the filtering distribution for the continuous-variable ht can be computed by averaging the moments and quantiles of the conditional GIG distribution by the marginal probabilities of zt calculated from the Markov-switching algorithm.

3 Another method to approximate the likelihood and filtering distribution for this class of models are sequential Monte Carlo methods also known as particle filters. Particle filters were introduced into the economics literature by Kim, Shephard, and Chib (1998). Creal (2012a) provides a recent review of the literature on particle filters. In this paper, I describe how the filtering recursions based on the finite state Markov-switching model are related to the particle filter. Essentially, the finite state Markov-switching model can be interpreted as a particle filter but where there are no Monte Carlo methods involved. This paper continues as follows. In Section 2, I describe the autoregressive gamma stochastic volatility models and demonstrate how to reformulate it with the latent variance integrated out. The model reduces to a Markov-switching model, whose transition distribution is characterized. In Section 3, I provide the details of the filtering and smoothing algorithms and a discussion of how to compute them efficiently. In Section 4, I compare standard particle filterings to the new algorithms and illustrate their relative accuracy. The new methods are then applied to several financial time series. Section 5 concludes.

2 Autoregressive gamma stochastic volatility models

The discrete time autoregressive gamma stochastic volatility (AG-SV) model for an observable time series yt is given by

p yt = µ + βht + htεt, εt ∼ N(0, 1), (1)

ht ∼ Gamma (ν + zt, c) , (2) φh  z ∼ Poisson t−1 , (3) t c where µ controls the location, β determines the , and φ is the first-order autocorrelation of ht. The restriction φ < 1 is required for stationarity. The AG-SV model has two state variables ht and zt, where the latter can be regarded as an “auxiliary” variable in the sense that it is not of direct interest. From the relationship to the CIR process discussed below, it is possible to deduce the restriction ν > 1 which is the well-known Feller condition that guarantees that the process ht never reaches zero. The model is completed by specifying an  c  initial condition that h0 is a draw from the stationary distribution h0 ∼ Gamma ν, 1−φ with

4 νc mean E [h0] = 1−φ .

The for ht given by (2) and (3) is known as the autoregressive gamma (AG) process of Gouri´erouxand Jasiak (2006), who developed many of its properties. In particular, they illustrated how it is related to the continuous-time model of Cox, Ingersoll, and Ross (1985) for the short term interest rate. Consider an interval of time of length τ. The autoregressive gamma process (2) and (3) converges in the continuous-time limit as τ → 0 to

p dht = κ (θh − ht) dt + σ htdWt. (4)

2 where the discrete and continuous time parameters are related as φ = exp (−κτ) , ν = 2κθh/σ , and c = σ2 [1 − exp (−κτ)] /2κ. In the continuous-time parameterization, the unconditional mean of the variance is θh and κ controls the spead of mean reversion. The continuous-time process (4) was originally analyzed by Feller (1951) and is sometimes called a Feller or square- root process. The model is a staple of the literature in continuous-time finance because it is a subset of the class of affine models, which have closed-form formulas for derivatives prices; see, e.g. Heston (1993). A stronger property than convergence in the continuous-time limit is that the AG and CIR processes have the same transition density for any time interval τ. Integrating zt out of (2) and

(3), the transition density of ht is

  ∞ " k k # 1 (ht + φht−1) X h (φht−1/c) p(h |h ; θ) = hν−1 exp − t , (5) t t−1 t cν c ckΓ(ν + k) k! k=0 which is a non-central gamma (non-central chi-squared up to scale) distribution. The con- ditional mean and variance of this distribution can easily be derived using properties of the gamma and Poisson distributions and are

E[ht|ht−1] = νc + φht−1, (6)

2 V[ht|ht−1] = νc + 2cφht−1. (7)

The conditional mean (6) is a linear function of ht−1, which implies that φ is the autocorrelation parameter. The conditional variance (7) is also a linear function of ht−1 making the variance of

5 the variance change through time. This is a property of the AG/CIR model that the log-normal SV model specified as

yt = γy + exp (ht/2) εt εt ∼ N (0, 1) ,

ht = µy + φy (ht−1 − µy) + σyηt ηt ∼ N (0, 1) does not share; see, e.g. Shephard (2005). I let θ = (µ, β, ν, φ, c) denote the vector of unknown parameters of the model. Throughout the paper, I use the notation xt−k:t to denote a sequence of variables (xt−k, . . . , xt), which may be either observed or latent.

Estimating the parameter vector θ is challenging because the likelihood of the model p(y1:T ; θ) is in the form of a high-dimensional integral

∞ ∞ T Z ∞ Z ∞ X X Y p(y1:T ; θ) = ...... p(yt|ht; θ)p(ht|zt; θ)p(zt|ht−1; θ)p(h0; θ)dh0, . . . , dhT . 0 0 zT =0 z1=0 t=1

For linear, Gaussian state space models and finite state Markov-switching models, this integral is solved recursively beginning at the initial iteration using the prediction error decomposition; see, e.g. Durbin and Koopman (2001) and Fr¨uhwirth-Schnatter (2006). The solutions to these integrals are the well-known Kalman filter and the filter for regime switching models; see Kalman (1960) and Hamilton (1989), respectively. It is the difficulty of computing this integral in closed-form for SV models that led researchers to search for approximations to the likelihood function primarily by Monte Carlo methods. The continuous-time CIR-SV model (possibly with jumps and leverage) has been estimated in applications for the term-structure of interest rates and option pricing. Versions of the continuous-time model appear in for example Eraker, Johannes, and Polson (2003), Chib, Pitt, and Shephard (2006), and Jacquier, Johannes, and Polson (2007). Most work in the literature estimates the continuous-time version of the model using an Euler discretization (conditional normal approximation) for the dynamics of the variance instead of the exact non-central gamma distribution. Developing recursions that can accurately compute the likelihood function for the AG-SV model relies on recognizing two key features of the model. First, it is possible to integrate the variance ht out of the model analytically. The only unobserved state variable remaining is the

6 auxiliary variable zt, whose Markov transition distribution can be characterized. Secondly, the

AG-SV model has a unique dependence structure where ht depends on ht−1 only through zt.

Combining these features of the model makes the results of this paper possible. Integrating ht out of the model is a result of the following proposition.

Proposition 1 The conditional likelihood p(yt|zt; θ), transition distribution p(zt|zt−1, y1:t−1; θ), and initial distribution are ! r2 p(y |z ; θ) = Normal Gamma µ, + β2, β, ν + z (8) t t c t  1 φ c 2  p(z |z , y ; θ) = Sichel ν + z − , (y − µ)2 , + β2 (9) t t−1 1:t−1 t−1 2 c t−1 φ c

p(z1; θ) = Negative Binomial (ν, φ) (10)

Definitions for the distributions are available in Appendix A.1.

Proof: See Appendix A.2.

The proposition illustrates that the variance ht can always be integrated out of the model analytically by conditioning on the data yt and the auxiliary variable zt from only a single neighboring time period. This is due to the unique dependence structure of the AG-SV model where ht depends on ht−1 only through zt and not directly. This dependence structure is atypical for state space models with both discrete and continuous latent variables. It is informative to juxtapose the dependence structure of the AG-SV model with the class of linear, Gaussian state space models with Markov switching (see e.g. Shephard (1994), Kim and Nelson (1999), and Fr¨uhwirth-Schnatter (2006)), which are popular in economics. In the latter class of models, the continuous-valued state variable remains dependent on its own lag even after conditioning on the discrete variable. In order to make the distributions conditionally Gaussian and consequently analytically integrable via the Kalman filter, one needs to condition on all past discrete state variables at every iteration of the algorithm. The number of discrete state variables necessary to be integrated out grows exponentially over time. Although the likelihood can in principle be computed exactly, the computational demands of an exact solution increase with time as the number of observations increases. This is not the case for models which have a dependence structure like the AG-SV model. As ht only depends on zt and not additionally

7 on ht−1, the latent variance ht can be integrated out analytically at each time period without having to condition on all past lags of zt. Proposition 1 implies that the AG-SV model can be reduced to a new state space model where the observation density p (yt|zt; θ) is a normal gamma (NG) distribution with a Markov- switching state variable zt. The nomenclature for the normal gamma (NG) distribution used here follows Barndorff-Nielsen and Shephard (2012) but this distribution is also known in the literature as the variance gamma (VG) distribution as well as the generalized (asymmetric) ; for the properties see, e.g. Kotz, Kozubowski, and Podg´orski(2001). The symmetric version (β = 0) of the distribution was introduced into economics by Madan and Seneta (1990) and subsequently extended to the asymmetric distribution (β 6= 0) by Madan, Carr, and Chang (1998). The NG model for asset returns (without the Markov-switching variable zt) is popular among practitioners. Like the CIR-SV model, it has closed-form solutions for option prices. It is often advocated as an alternative to the CIR-SV model. This result illustrates how over discrete time intervals the CIR-SV model generalizes the NG model.

The state variable zt has a non-homogenous Markov transition kernel p(zt|zt−1, yt−1; θ) known as a Sichel distribution; see, e.g. Sichel (1974, 1975). A Sichel distribution is a Poisson distribution whose mean is a random draw from a generalized inverse Gaussian (GIG) distri- bution, see Appendix A.1 for details. The support of a Sichel random variable is therefore over the set of non-negative integers. After reformulating the AG-SV model as a Markov switching model, the challenge is now to compute the likelihood function p(y1:T ; θ) of an infinite dimensional Markov switching model.

Given an initial distribution p(z1; θ), a closed-form solution to this problem would recursively solve the integrals

p(yt|zt = i; θ)p(zt = i|y1:t−1; θ) p(zt = i|y1:t; θ) = P∞ , (11) j=0 p(yt|zt = j; θ)p(zt = j|y1:t−1; θ) ∞ X p(zt+1 = i|y1:t; θ) = p(zt+1 = i|zt = j, yt; θ)p(zt = j|y1:t; θ). (12) j=0

The contribution to the likelihood function for the current observation is the term p(yt|y1:t−1; θ) = PZ j=0 p(yt|zt = j)ˆp(zt = j|y1:t−1; θ) in the denominator of (11). The infinite sums over zt unfor- tunately do not have closed form solutions. However, the stationarity of the model ensures that

8 there always exists an integer Zθ,yt such that the probability p(zt = Zθ,yt |y1:t; θ) is zero up to machine tolerance on a computer. For all practical purposes, the AG-SV model is a finite state Markov switching model, which means that the likelihood as well as moments and quantiles can be computed accurately to a high degree of tolerance. In section 3, I propose methods for computing solutions for this model.

In addition to calculating the likelihood function, estimates of the latent variance ht are also of interest and these are characterized by the posterior distributions p(ht|y1:t; θ), p(ht+1|y1:t; θ), and p(ht|y1:T ; θ). These distributions summarize the information about the latent state variable ht based on different information sets. Estimates of the latent variance ht rely on the results of the following proposition.

Proposition 2 Conditional on the discrete variable and the data, the predictive p(ht+1|z1:t, y1:t; θ), marginal filtering p(ht|z1:t, y1:t; θ), and marginal smoothing p(ht|z1:T , y1:T ; θ), p(zt|h1:T , y1:T ; θ) distributions as well as the distribution of the initial condition are

 1  p(h |z , y ; θ) = Gamma ν − + z , c (13) t+1 1:t+1 1:t 2 t+1  1 2  p(h |z , y ; θ) = GIG ν + z − , (y − µ)2 , + β2 (14) t 1:t 1:t t 2 t c  1 2(1 + φ)  p(h |z , y ; θ) = GIG ν + z + z − , (y − µ)2 , + β2 (15) t 1:T 1:T t t+1 2 t c r ! φh h p(z |h , y ; θ) = Bessel ν − 1, 2 t t−1 (16) t 1:T 1:T c2

p(h0|z1:T , y1:T ; θ) = Gamma (ν + z1, c) (17)

Definitions for the generalized inverse Gaussian (GIG) and Bessel distributions are available in Appendix A.1.

Proof: See Appendix A.3. Conditional on the discrete mixing variable and the data, the marginal filtering and smooth- ing distributions for the variance ht are either generalized inverse Gaussian (GIG) or gamma distributions. Useful properties of GIG random variables that are needed for computational pur- poses are discussed in Appendix A.1. The fact that the conditional distribution p(zt|h1:T , y1:T ; θ) is a Bessel distribution was implicitly stated in Gouri´erouxand Jasiak (2006). This result

9 will not be used in this paper. However, this proposition states that all the conditional dis- tributions of the latent variables given the observed variables are recognizable distributions. This implies that the latent variables can easily be imputed using Gibbs sampling with no Metropolis-Hastings steps, which is useful for any researchers interested in estimating the model via Bayesian methods. It is possible to extend the model to include more general dynamics and densities than in (1)-(3), while still retaining tractability. The simplest and most important extension is to allow either or both of the measurement and transition densities to depend on an additional discrete state variable that takes on a finite set of possible values. For example, the model can be extended to allow the measurement density (1) to be a finite mixture of normals or the dynamics of the variance ht may be generalized to allow the transition densities to depend on a finite state Markov-switching variable. Extensions of this kind can also be computed with no loss in accuracy, although they do require additional computations. Another important contribution of this paper is to recognize that the ideas developed here can be used to analyze other models with continuous and discrete state variables whose de- pendence structure is like the CIR/AG dynamics. It is only necessary to analytically calculate the distributions p(yt|zt; θ) and p(zt|zt−1, y1:t−1; θ). For example, suppose the state variable ht follows an autoregressive gamma process as in (2)-(3). Then, additional models for which it is possible to calculate the likelihood function are when the distribution of p(yt|ht; θ) is Poisson with time-varying mean ht, exponential with time-varying intensity ht, and gamma with time- varying mean ht. The exponential model with time-varying intensity is an important model in the literature on credit risk; see, e.g. Duffie and Singleton (2003) and Duffie, Eckner, Horel, and Saita (2009). These results are developed further in Creal (2012b) who builds methods for Cox processes with CIR intensities, marked Poisson processes with CIR intensities, and duration models built from the AG process.

3 Filtering and smoothing recursions

In this section, I describe different methods for computing the likelihood as well as the filtered and smoothed estimates of the state variables. First, I consider deterministic schemes based on finite-state Markov switching models and then I discuss Monte Carlo methods based on the

10 particle filter.

3.1 Filtering recursions from finite-state Markov switching models

The infinite dimensional Markov switching model implied by the AG-SV model has an exact

filtering distribution described by the infinite dimensional vector p(zt|y1:t; θ). This vector can be accurately represented using a finite dimensional (Zθ,yt + 1) × 1 vector of probabilities pˆ(zt|y1:t; θ) whose points of support are the integers from 0 to Zθ,yt . The Markov transition distribution p(zt|zt−1, y1:t−1; θ) can also be represented by a (Zθ,yt + 1) × (Zθ,yt + 1) transition matrixp ˆ(zt|zt−1, y1:t−1; θ). Both approximations will be equivalent to the estimates produced by a closed-form solution as Zθ,yt → ∞. Computing the probabilities for extremely large values of zt is, however, computationally wasteful as these values are numerically zero. To obtain a fixed precision across the parameter space, the integer Zθ,yt at which the distributions are truncated should be a function of the parameters of the model θ and the data yt. A discussion on how to determine the value of Zθ,yt is provided further below. In the following, I eliminate dependence of θ and yt on Zθ,yt for notational convenience. The recursions for the marginal filtering and one-step ahead predictive distributions proceed forward in time from t = 1,...,T as follows. The distributions for the initial iteration are computed analytically in the following proposition, assuming that the initial condition is a  c  draw from the stationary distribution h0 ∼ Gamma ν, 1−φ .

Proposition 3 The initial contribution to the likelihood p(y1; θ), the marginal filtering distri- butions p(z1|y1; θ), p(h1|y1; θ), and the predictive distribution p(z2|y1; θ) are

r ! 2(1 − φ) p(y ; θ) = Normal Gamma µ, + β2, β, ν 1 c  1 2(1 − φ)  p(h |y ; θ) = GIG ν − , (y − µ)2 , + β2 1 1 2 1 c  1 φ c 2(1 − φ)  p(z |y ; θ) = Sichel ν − , (y − µ)2 , + β2 1 1 2 c 1 φ c  1 φ c 2(1 − φ)  p(z |y ; θ) = Sichel ν − , (y − µ)2 , + β2 (18) 2 1 2 c 1 φ c

Proof: See Appendix A.4.

11 After the first iteration, the distributions cannot be solved analytically because the integrals involving zt have an unknown solution. In this section, I describe how the infinite dimensional Markov-switching model can be approximated by a finite-state Markov-switching model. The likelihood function can therefore be computed using known filtering recursions; see, e.g. Baum and Petrie (1966), Baum, Petrie, Soules, and Weiss (1970), and Hamilton (1989). It is conse- quently straightforward to estimate the unknown parameters θ by maximum likelihood. The marginal filtering p(zt|y1:t; θ), one-step ahead predictive p(zt+1|y1:t; θ), and smoothing distribu- tions p(zt|y1:T ; θ) can be also computed from these recursions. Proposition 3 provides a way to initialize the filtering algorithm for a finite-state Markov switching model with Z + 1 states. Using the Sichel distribution (18), compute the (Z + 1) × 1 vector of predictive probabilitiesp ˆ(z2|y1; θ). Then, for t = 2,...,T , compute the filtering probabilitiesp ˆ(zt|y1:t; θ) and the predictive probabilitiesp ˆ(zt+1|y1:t; θ) recursively as

p(y |z = i; θ)ˆp(z = i|y ; θ) pˆ(z = i|y ; θ) = t t t 1:t−1 (19) t 1:t PZ j=0 p(yt|zt = j; θ)ˆp(zt = j|y1:t−1; θ) Z X pˆ(zt+1 = i|y1:t; θ) = pˆ(zt+1 = i|zt = j, yt; θ)ˆp(zt = j|y1:t; θ) (20) j=0

PZ The termp ˆ(yt|y1:t−1; θ) = j=0 p(yt|zt = j)ˆp(zt = j|y1:t−1; θ) is the contribution to the likeli- hood function for the current observation. The log-likelihood function to be maximized is given by

T X logp ˆ(y1:T ; θ) = logp ˆ(yt|y1:t−1; θ) + log p(y1; θ) t=2

Further discussion on Markov-switching algorithms can be found in Hamilton (1994), Capp´e, Moulines, and Ryd´en(2005), and Fr¨uhwirth-Schnatter (2006).

The Markov switching algorithms provide estimates of the auxiliary variable zt but in appli- cations interest centers on estimation of the latent variance ht through filtering and smoothing algorithms similar to the linear, Gaussian state space model; see, e.g. Durbin and Koopman

(2001). The marginal filtering and smoothing distributions for ht are not known analytically but moments and quantiles of the distributions can be accurately computed. Computing these

12 values starts by decomposing the joint filtering distribution as

p(ht, zt|y1:t; θ) = p(ht|y1:t, zt; θ)p(zt|y1:t; θ) ≈ p(ht|yt, zt; θ)ˆp(zt|y1:t; θ).

Moments and quantiles of the marginal distribution p(ht|y1:t; θ) can consequently be computed by weighting the expectations of the conditional distribution p(ht|zt, yt; θ) by the discrete prob- abilities p(zt|y1:t; θ) calculated from the Markov switching algorithm. Estimates of the moments of the marginal filtering and predictive distributions for the variance are

Z ˆ α X α E[ht |y1:t; θ] = E[ht |yt, zt; θ]p ˆ(zt|y1:t; θ) (21) zt=0 Z ˆ  α  X  α  E ht+1|y1:t; θ = E ht+1|zt+1; θ pˆ(zt+1|y1:t; θ) (22) zt=0 where the orders of integration for zt and ht have been exchanged. In these expressions, the α  α  conditional expectations E [ht |yt, zt; θ] and E ht+1|zt+1; θ are with respect to the GIG and gamma distributions from Proposition 2. These can be calculated analytically for any value of zt. As Z → ∞, the solution converges to the exact values whereas numerically accurate estimates can be computed for finite values of Z due to the fact that the probabilities p(zt|y1:t; θ) converge to zero for large integer values. If interest centers on the quantiles of the marginal filtering or predictive distributions of the variance, these can be computed analogously as

Z Z x  ˆ X P (ht < x|y1:t; θ) = p (ht|yt, zt; θ) dht pˆ(zt|y1:t; θ) (23) 0 zt=0 Z Z x  ˆ X P (ht+1 < x|y1:t; θ) = p (ht+1|zt+1; θ) dht pˆ(zt+1|y1:t; θ) (24) 0 zt=0 where the terms in brackets are the cumulative distribution functions of the GIG and gamma distributions from Proposition 2. For a given quantile and with a fixed value ofp ˆ(zt|y1:t; θ) orp ˆ(zt+1|y1:t; θ), the value of x in (23) and (24) can be determined with a simple zero-finding algorithm. Similar ideas can be applied to estimate moments and quantiles of the marginal smoothing distributions, which are also GIG and gamma distributions as shown in Proposition

13 2. Further details are discussed in Section 3.2. At this point, it is important to note that the prediction step in (20) is a vector-matrix multiplication which is an O(Z2) operation. Fortunately, the Markov transition distribution p(zt|zt−1, y1:t−1; θ) is a sparse matrix. In addition, many of the entries in the vector of filtered 2 probabilities p(zt|y1:t; θ) are close to zero. This feature of the model means that the O(Z ) computation in the prediction step of the algorithm can be avoided. When computing the vector-matrix multiplication in the prediction step, it is only necessary to compute those ele- ments in each row times column operation that contribute to the sum in (20). In other words, once the sum in (20) has already converged after computing a subset of the states zt = j or if any combination of the states i and j in the product p(zt+1 = i|zt = j, y1:t−1; θ)p(zt = j|y1:t; θ) are numerically zero, then it is unnecessary to compute the probability for other transitions whose probabilities are known to be smaller. In addition, each row times column operation in the vector-matrix multiplication of the prediction step can be computed in parallel on multi- core processors. The sparse nature of the transition matrix and the ability to compute the operations in parallel make computing the likelihood easily feasible for macroeconomic and financial data sets using daily, monthly, or quarterly data. An important practical issue is the choice of the threshold Z that determines where the infinite sums are truncated. There are several strategies one can take. Here, I propose a procedure to determine a global value of Z for a given value of θ that works well in practice. The idea is to choose Z such that the likelihood contribution

Z X pˆ(yt|y1:t−1; θ) = p(yt|zt = j; θ)ˆp(zt = j|y1:t−1; θ) (25) j=0 is guaranteed to converge numerically for all time periods. It suffices to bound the time period which needs the largest value of Z necessary for the integral (25) to converge as it will converge for all other time periods for this same value of Z as well. Determining the largest value of Z that is necessary to bound the integral corresponds to the date with the largest posterior mass for the filtering distribution p(zt|y1:t; θ). Given a data set and a value θ, it is simple to determine the conditional likelihood p(yt|zt; θ) which gives the most weight to the largest value of zt. As a function of yt, it is maximized for the largest absolute value of yt in the dataset. The challenge is to determine an approximation to the distributionp ˆ(zt = j|y1:t−1; θ) in (25) that

14 will weight the conditional likelihood. A simple solution is to start with the negative binomial

(stationary) distribution of zt for the parameters θ and compute two iterations of the filtering algorithm using the likelihood p(yt|zt; θ) for the largest absolute value of yt for both iterations.

This produces an approximation to the unknown distribution p(zt = j|y1:t−1; θ) in the integral (25). The value of Z can then be selected as the integer where this sum converges.

3.2 Marginal smoothing recursion

An object of independent interest are smoothed or two-sided estimators of the latent variance. It is also possible to recursively compute the marginal smoothing distributions for p(zt|y1:T ; θ) and moments and quantiles of the marginal distribution p(ht|y1:T ; θ). The solution for smoothing is analogous to the solution for forward filtering. First, the joint distribution can be decom- posed as p(h0:T , z1:T |y1:T ; θ) = p(h0:T |z1:T , y1:T ; θ)p(z1:T |y1:T ; θ) and the smoothing algorithms for discrete-state Markov switching models can be applied to calculate the marginal probabil- ities p(zt|y1:T ; θ). Moments and quantiles of the marginal smoothing distribution p(ht|y1:T ; θ) can be computed using the results from Proposition 2. First, the filtering algorithm (19) and (20) is run forward in time and the one-step ahead predictive probabilities and the conditional likelihoods p(yt|zt; θ) are stored for t = 1,...,T . The backwards smoothing algorithm is initialized using the final iteration’s filtering probabilities pˆ(zT |y1:T ). For t = T − 1,..., 1, the smoothing distributions for zt are computed as

p (y |z = i; θ)p ˆ(z = i|z = j, y ; θ) pˆ(z = i|z = j, y ; θ) = t t t t+1 1:t+1 (26) t t+1 1:T PZ k=0 p(yt|zt = k; θ)ˆp(zt = k|zt+1 = j, y1:t+1; θ) pˆ(zt = i, zt+1 = j|y1:T ; θ) =p ˆ(zt+1 = j|y1:T ; θ)p ˆ(zt = i|zt+1 = j, y1:T ; θ) (27) Z X pˆ(zt = i|y1:T ; θ) = pˆ(zt = j|y1:T ; θ)p ˆ(zt = i|zt+1 = j, y1:T ; θ) (28) j=0

Computation of (26) also requires an O(Z2) operation on the backwards pass. This can be reduced substantially by taking into account the sparse matrix structure of the transition probabilities.

This algorithm also computes the bivariate smoothing probabilitiesp ˆ(zt = i, zt+1 = j|y1:T ; θ) which are used to calculate the moments and quantiles of the marginal smoothing distribution

15 for the variance

Z Z ˆ α X X α E[ht |y1:T ] = E[ht |yt, zt, zt+1]p ˆ(zt, zt+1|y1:T ; θ) zt=0 zt+1=0 Z Z Z x  ˆ X X P (ht < x|y1:T ; θ) = p (ht|yt, zt, zt+1; θ) dht pˆ(zt, zt+1|y1:T ; θ) 0 zt=0 zt+1=0

Conditional expectations and quantiles of the marginal smoothing distribution p(ht|yt, zt, zt+1; θ) are calculated from the properties of the GIG and gamma distributions of Proposition 2. Smoothed estimates of the variance are of independent interest but the algorithm can also be used as part of the expectation-maximization (EM) algorithm of Dempster, Laird, and Rubin (1977) as an alternative approach to calculating the maximum likelihood estimator.

3.3 Filtering recursions based on particle filters

An alternative method for calculating the likelihood function and filtering distributions for the AG-SV model are sequential Monte Carlo algorithms also known as particle filters. Particle filters approximate distributions whose support is infinite by a finite set of points and probability masses; see Creal (2012a) for a recent survey. After integrating out the state variable ht, it is possible to design a particle filter for the AG-SV model that is similar to the filtering algorithm for finite state Markov switching models with a few key differences. First, the methods differ in how the points of support are determined. A typical particle filter will select the support points for zt at random via Monte Carlo draws while the finite state Markov-switching model in the previous section has a deterministic set of points of support. Secondly, the particle filter will approximate some probabilities via Monte Carlo that the Markov-switching algorithm will compute exactly from the definition of the probability mass functions involved in the recursions. The particle filter therefore introduces some variability due to Monte Carlo error that is not present in the deterministic scheme. In this section, I describe the similarities between a particle filter for the AG-SV model and the filter for finite state Markov-switching models as well as some of the relative advantages and disadvantages. The first step to obtaining a particle filter that is analogous to the Markov-switching model is to analytically integrate out any continuous-valued state variables such as the variance ht. This

16 is simply an application of Proposition 1 and is known as Rao-Blackwellisation, e.g. Chen and Liu (2000). Analytical integration of any state variable always decreases the Monte Carlo vari- ance of a particle filter; see, e.g. Chopin (2004). Next, the filtering distribution p(zt−1|y1:t−1; θ) is Z+1 n (i) (i) o (i) represented by a collection of “particles” given by wt−1, zt−1 . The values zt−1 are locations i=1 (i) on the support of the distribution and the weights wt−1 represent the probability mass at that point. When the support of the state variable is continuous, the probability that two particles take on the same value is zero but this is not true when the support is discrete. If the particles in the particle filter are initialized at random with replacement, some integers may be dupli- cated while other support points are randomly omitted. Conversely, in the filter for Markov- (i) switching models, the locations on the support of the distribution are the integers zt−1 = i − 1 (i) for i = 1 ...,Z + 1 and the weights are the probabilities wt−1 =p ˆ(zt−1 = i − 1|y1:t−1; θ). Each “particle” in the Markov-switching algorithm is assigned a unique integer and no integers are omitted smaller than Z. At each iteration of a particle filtering algorithm, the particles’ locations and weights are up- Z Z n (i) (i) o n (i) (i)o dated from one time period wt−1, zt−1 to the next time period wt , zt using Monte i=1 i=1 (i) (i) Carlo methods. The transitions for each particle from zt−1 to zt are determined randomly by (i) (i) drawing from an importance distribution q(zt |yt−1, zt−1; θ) and calculating the unnormalized importance weights as

(i) (i) (i) (i) (i) p(yt|zt ; θ)p(zt |zt−1, yt−1; θ) wt ∝ wt−1 (i) (i) . q(zt |yt−1, zt−1; θ)

Z (i) (i) n (i) (i)o The distribution q(zt |yt−1, zt−1; θ) is chosen by the user. The new set of particles wt , zt i=1 form a discrete distribution that approximates the next filtering distribution p(zt|y1:t; θ). It has been argued in the particle filtering literature by Fearnhead and Clifford (2003) that when the support of the state variable is discrete a particle filter should not allow particles to duplicate transitions. Too see this, consider two particles that transitioned to the same integer, each one will receive half the total probability mass at that location. However, it is equivalent to have a single particle at that same location receiving all the mass at that point. Duplicating transitions is redundant and computationally wasteful as any additional particle should go to a new integer location that has yet to be explored by any other particle. This is essentially

17 what the filtering algorithm for Markov-switching models of Section 3.1 is doing. It forces each particle to transition deterministically through all of the possible future Z states creating a 2 (i) (i) 2 total of Z unique combinations, zt = j, zt−1 = k and no duplications. By comparing the filtering algorithm of the previous section with a particle filter that uses Rao-Blackwellisation, it is easy to see that the algorithms share many similarities. The particle filter has the advantage that one does not need to choose a value of Z to truncate the infinite sums as this is endogenously determined. However, the finite dimensional Markov switching algorithm of the previous section does not introduce Monte Carlo variability. Finally, we note that there exists one important computational constraint that needs to be considered for both methods. The densities p(yt|zt; θ) and p(zt|zt−1, yt−1; θ) are both functions of the modified Bessel function of the second kind Kλ+zt (x) where the order parameter λ + zt is a function of the discrete state variable. Unfortunately, this function cannot be evaluated

“directly” for arbitrary values of zt and a fixed argument x. Instead, all methods for calculating modified Bessel functions in a computationally stable manner start by evaluating the function twice at stable values for the order and then applying the recursion

2 (λ + z ) K (x) = K (x) + t K (x). λ+zt+1 λ+zt−1 x λ+zt

This has an important practical ramification for methods that randomly sample zt such as the particle filter. Even if an integer value of zt is not drawn at random, the probabilities associated with it will likely need to be computed and discarded during the algorithm. Consequently, for a given number of particles (support points), a particle filter that fully marginalizes out zt−1 at each iteration will typically have to perform at least as many computations as the deterministic algorithm of Section 3.1. This appears to be computationally wasteful as any additional computations are better spent on increasing the value of Z in the deterministic algorithm. It is for this reason as well as the lack of Monte Carlo variation that the empirical section of this paper focuses on the filtering algorithms for finite-state Markov switching models.

2Particle filters also include a resampling step, which is not present in the deterministic filtering algorithm. When the state variable is discrete-valued, the only role played by the resampling step within a particle filter is to ensure that the number of particles does not explode.

18 4 Application of the new methods

4.1 Comparisons with the particle filter

In this section, I compare the log-likelihood functions calculated from the new algorithm with a standard particle filtering algorithm to demonstrate their relative accuracy. The algorithms are compared by running them on the S&P500 series discussed further below in Section 4.2. All filtering algorithms were run with the parameter values fixed at the ML estimates from the next section. I run the filtering algorithm of Section 3.1 with the truncation parameter set at Z = 3500. For comparison purposes, I implement a particle filter with the transition density (5) as a proposal where the particles are resampled at random times according to the effective sample size (ESS), see Creal (2012a). I use the residual resampling algorithm of Liu and Chen (1998), which is known to be unbiased. This is a simple extension of the original particle filter of Gordon, Salmond, and Smith (1993).3 The new methods are compared to a simple particle filter in order to demonstrate the accuracy of a typical particle filter which is commonly used in practice. Secondly, it is valuable to compare the new approach which requires the evaluation of the modified Bessel function to compute the conditional likelihood p(yt|zt; θ) to another algorithm that does not. To compare the methods, I compute slices of the log-likelihood function for the parameters φ and ν over the regions [0.97, 0.999] and [1.0, 2.2], respectively. Cuts of the log-likelihood function for the remaining parameters (µ, β, c) are similar and are available in an online appendix. Each of these regions are divided into 1000 equally spaced points, where the log-likelihood function is evaluated. While the values of φ and ν change one at a time, the remaining parameters are held fixed at their ML estimates. The log-likelihood functions from the particle filter are shown for different numbers of particles N = 10000, N = 30000 and N = 100000. These values were selected because they are representative of values of N used throughout the literature. One limitation of the particle filter that was raised early in that literature can be seen from Figure 1. The particle filter does not produce an estimate of the log-likelihood function that is smooth in the parameter space. Consequently, standard derivative-based optimization routines may have trouble converging to the maximum. The smoothness of the log-likelihood function produced

3I implemented additional particle filtering algorithms including the auxiliary particle filter of Pitt and Shephard (1999). The results were practically identical.

19 −4540 N = 10000 N = 10000 −4540 −4545

−4542 −4550

−4544 −4555

−4560 −4546

−4565 −4548

−4570 −4550

0.97 0.975 0.98 0.985 0.99 0.995 1.2 1.4 1.6 1.8 2 2.2

−4540 N = 30000 N = 30000 −4540 −4545

−4542 −4550

−4544 −4555

−4546 −4560

−4548 −4565

−4570 −4550

0.97 0.975 0.98 0.985 0.99 0.995 1.2 1.4 1.6 1.8 2 2.2

−4540 N = 100000 N = 100000 −4540 −4545

−4542 −4550

−4544 −4555

−4546 −4560

−4548 −4565

−4550 −4570

0.97 0.975 0.98 0.985 0.99 0.995 1.2 1.4 1.6 1.8 2 2.2

Figure 1: Slices of the log-likelihood function for φ (left) and ν (right) for the new algorithm (blue line) versus the particle filter (red dots). The particle sizes are N = 10000 (top row), N = 30000 (middle row), and N = 100000 (bottom row). The vertical (green) lines are the ML estimates of the model reported in Table 1. 20 by the new approach is an attractive feature. It makes calculation of the ML estimates easily feasible as shown in the next section. Secondly, in this application, there is a sizeable downward bias in the particle filter’s es- timates of the log-likelihood function as well as a reasonably large amount of Monte Carlo variability in the estimates. It is known that the particle filter’s estimator of the likelihood function is unbiased, as long as the resampling algorithm used within a particle filter is un- biased. The particle filter’s estimate of the likelihood is also consistent and asymptotically normal; see, e.g. Chapter 11 of Del Moral (2004). However, taking logarithms causes the par- ticle filter’s estimate to be biased downward for finite values of the particles due to Jensen’s inequality.

4.2 Applications to financial data

In this section, I illustrate the new methods on several daily financial time series. These include the S&P 500 index, the MSCI-Emerging Markets Asia index, and the Euro-to-U.S. dollar exchange rate. The first two series were downloaded from Bloomberg and the latter series was taken from the Board of Governors of the Federal Reserve. All series are from January 3rd, 2000 to December 16th, 2011 making for 3009, 3118, and 3008 observations for each series, respectively. Starting values for the parameters of the variance (φ, ν, c) were obtained by matching the unconditional mean, variance, and persistence of average squared returns to the unconditional distribution. The values of (µ, β) were initialized at zero. The Feller condition ν > 1 as well as the constraints 0 < φ < 1 and c > 0 were imposed throughout the estimation. Estimates of the parameters of the discrete-time model as well as standard errors are re- ported in Table 1. The standard errors are calculated by numerically inverting the Hessian at the ML estimates. The implied parameters of the continuous-time model are also reported 1 4 in Table 1 assuming a discretization step size of τ = 256 . For all series, the risk premium 4If interest centers on the continuous-time model, it is possible to estimate the parameters θ = 2 R t+τ µ, β, κ, θh, σ . However, one needs to make an assumption about the integrated variance t h(s)ds which does not have a closed-form solution for the CIR model. A standard assumption in the literature see for example Eraker, Johannes, and Polson (2003) is to sub-divide time intervals τ into M equal sub-intervals and approxi- R t+τ PM τ mate this integral by a Riemann sum t h(s)ds ≈ j=1 ht+j M . The discrete time model here assumes values M = 1 and τ = 1 whereas in most applications in continuous-time finance the parameters are reported on an 1 annual basis such that τ = 256 where 256 is the number of trading days in a year.

21 Table 1: Maximum likelihood estimates for the AG-SV model.

2 µ β φ c ν θh κ σ log-like S&P500 0.102 -0.061 0.988 0.015 1.539 1.815 3.168 7.470 -4542.1 (0.020) (0.018) (0.004) (0.003) (0.193)

MSCI-EM-ASIA 0.250 -0.115 0.978 0.024 1.963 2.115 5.736 12.364 -5213.2 (0.031) (0.020) (0.006) (0.005) (0.229)

Euro/$ 0.073 -0.148 0.994 0.0007 3.740 0.442 1.489 0.359 -2862.2 (0.022) (0.058) (0.004) (0.0002) (1.263)

Maximum likelihood estimates of the parameters θ = (µ, β, φ, c, ν) of the discrete-time AG-SV model on three data sets of daily returns. The series are the S&P500, the MSCI emerging markets Asia index, and the EURO-$ exchange rate. The data cover January 3, 2000 to December 16, 2011. Asymptotic standard errors 1 are reported in parenthesis. The continuous-time parameters are reported for intervals τ = 256 . parameters β are estimated to be negative and significant implying that the distribution of returns are negatively skewed. Estimates of the autocorrelation parameter φ for the S&P500 and MSCI series are slightly smaller than is often reported in the literature for log-normal SV models. The dynamics of volatility for the Euro/$ series is substantially different than the other series. The mean θh of volatility is much lower and the volatility is more persistent. Figure 2 contains output from the estimated model for the S&P500 series. The top left √ panel is a plot of the filtered and smoothed estimates of the volatility ht over the sample period. The estimates are consistent with what one would expect from looking at the raw return series. There are large increases in volatility during the U.S. financial crisis of 2008 followed by another recent spike in volatility during the European debt crisis. To provide some information on how the truncation parameter Z impacts the estimates, the top right panel of Figure 2 is a plot of the marginal filtering distributions for the discrete mixing variable p(zt|y1:t; θ) on three different days. The three distributions that are pictured are representative of the distribution p(zt|y1:t; θ) during low (6/27/2007), medium (10/2/2008), and high volatility (12/1/2008) periods. The distribution of zt for the final date (12/1/2008) was chosen because it is the day at which the mean of zt is estimated to be the largest throughout the sample. Consequently, it is the distribution where the truncation will have the largest impact. This time period also corresponds to the largest estimated value of the variance ht. The graph illustrates visually that the impact of truncating the distribution is negligible as the

22 Filtered and smoothed estimates of volatility Probability mass functions for z 4 t

Filtered 3.5 Smoothed 0.016

3 0.014

2.5 0.012

0.01 ← 2 6/27/2007

0.008 1.5

0.006 1 0.004 ← 10/2/2008 0.5 0.002 ← 12/1/2008 0 2002 2004 2006 2008 2010 0 0 500 1000 1500 2000 2500 Forecasting distribution of h on 7/29/2005 t Forecasting distribution of h on 11/28/2008 4.5 t 20 95% interval 4 95% interval median 18 median mean 3.5 16 mean

3 14

12 2.5

10 2 8 1.5 6 1 4 0.5 2 0 2004 2005 0 2008 2009

Figure 2: Estimation results for the AG-SV model on the S&P500 index from 1/3/2000 to 12/16/2011. √ Top left: filtered and smoothed estimates of the volatility ht. Top right: marginal probability mass functions p(zt|y1:t; θ) on three different dates. Bottom left: forecasting distribution of the variance ht for 100 days be- ginning on 7/29/2005. Bottom right: forecasting distribution of the variance ht for 100 days beginning on 11/28/2008. truncation point is far into the tails of the distribution.5 In Section 3.3, I argued that the only way to compute the log-likelihood function more accurately was to increase the value of Z. To quantify this further, the algorithms were run again and the log-likelihood functions as well as the c.d.f.s for zt were calculated for a series of

5Technically, the accuracy of the computations are limited to the accuracy of the modified Bessel function of the second kind, which is typically around 10−14 or 10−15 depending on the point at which it is evaluated.

23 Table 2: Cumulative distribution functions P (zt|y1:t; θ) for different values of Z.

P (zt ≤ 1500|y1:t; θ) P (zt ≤ 2000|y1:t; θ) P (zt ≤ 2500|y1:t; θ) log-like Z = 2500 0.9960571656208360 0.9999956716590525 1.0000000000000000 -4542.0625588879939 Z = 3000 0.9960571624858776 0.9999956685945898 0.9999999986272218 -4542.0625588934108 Z = 3500 0.9960571624854725 0.9999956685941842 0.9999999986268223 -4542.0625588934117 Z = 5000 0.9960571624854725 0.9999956685941842 0.9999999986268223 -4542.0625588934117

This table contains the cumulative distribution functions of the filtering distributions P (zt|y1:t; θ) and the log-likelihood as a function of the truncation parameter for Z = 2500, 3000, 3500, 5000 on 12/1/2008. This is for the S&P 500 dataset. The log-likelihood values reported in the table are for the entire sample. different truncation values on the S&P 500 dataset. The cumulative distribution functions for the filtering distribution P (zt|y1:t; θ) on 12/1/2008 and the overall log-likelihoods are reported in Table 2 for values of Z = 2500, 3000, 3500, 5000. The first row of the table contains the cumulative probabilities for values of zt = 1500, 2000, 2500 when the algorithm was run with Z = 2500. When the value of Z is set at 2500, there is 0.9960571656208360 cumulative probability to the left of zt = 1500. When zt = 2500 and Z = 2500, the c.d.f. reaches a value of one due to self-normalizing the probabilities. The purpose of the table is to illustrate that the value of the log-likelihood function converges (numerically) between the value of Z = 3000 and Z = 3500. Therefore, it is possible to calculate the log-likelihood function of the model accurately for reasonable values of Z that are computationally tractable for daily frequency data. An important feature of the AG-SV model that is useful in practice is that forecasts of the variance can be computed accurately without simulation. These H-step ahead forecasts of the variance are produced by iterating on the vector of probabilitiesp ˆ(zt+1|y1:t; θ) from the last iteration of the Markov-switching algorithm using the transition distribution of the latent variable zt. When there are no observations available out of sample, the transition distribu- tion of zt was shown by Gouri´erouxand Jasiak (2006) to be a negative binomial distribution  φ  p(zt|zt−1; θ) = Neg. Bin. ν + zt−1, 1+φ . Starting with the filtering distributionp ˆ(zT |y1:T ; θ) at the final time period and iterating for H periods on the transition distribution produces pˆ(zt+H |y1:t; θ). This distribution can be combined with the conditional distribution of the vari- ance at any horizon p(ht+H |zt+H ; θ) = Gamma (ν + zt+H , c) to construct forecasting intervals. For example, to calculate quantiles of the H-step ahead forecasting distribution, one can solve

24 for x given a fixed value of Z and the ML estimator θˆ in the equation

Z Z x  ˆ X  ˆ ˆ P (ht+H < x|y1:t; θ) = p ht+H |zt+1; θ dht pˆ(zt+H |y1:t; θ) (29) 0 zt=0 using a simple zero-finding algorithm.

The bottom two plots in Figure 2 are forecasts of the future variance ht beginning on two different dates 7/29/2005 and 11/28/2008. Forecasts for the mean, median, and 95% intervals for ht are produced for H =100 days. These dates were selected to illustrate the substantial difference in both asymmetry and uncertainty in the forecasts during low and high periods of volatility. When volatility is low (7/29/2005), the forecasting distribution of ht is highly asymmetric and the mean and median of the distribution differ in economically important ways. Conversely, when volatility is high (11/28/2008), the distribution of the variance is roughly normally distributed. The difference in width of the 95% error bands between the two dates also illustrates how much more uncertainty exists in the financial markets during a crisis.

5 Conclusion

In this paper, I developed methods that can compute the log-likelihood function of discrete- time autoregressive gamma stochastic volatility models exactly up to computer tolerance. The new algorithms are based on the insight that it is possible to integrate out the latent variance analytically leaving only a discrete mixture variable. The discrete variable is defined over the set of non-negative integers but for practical purposes it is possible to approximate the distributions involved by a finite-dimensional Markov switching model. Consequently, the log- likelihood function can be computed using standard algorithms for Markov-switching models. Furthermore, filtered and smoothed estimates of the latent variance can easily be computed as a by-product. Using the unique conditional independence structure highlighted in Proposition 2, it is also easy to draw samples from the joint distribution p(h0:T , z1:T |y1:T , θ) and thus the marginal dis- tribution of the variance. This follows by decomposing the joint distribution into a conditional and a marginal as p(h0:T , z1:T |y1:T ; θ) = p(h0:T |y1:T , z1:T ; θ)p(z1:T |y1:T ; θ). A draw is taken from the marginal distribution z1:T ∼ pˆ(z1:T |y1:T ; θ) using standard results on Markov-switching al-

25 gorithms. Then, conditional on the draw of z1:T , a draw is taken from h0:T ∼ p(h0:T |y1:T , z1:T ; θ) which are the distributions from Proposition 2. Algorithms that draw samples from the joint distribution of the latent variables given the data are known as simulation smoothers or alter- natively as forward-filtering backward sampling (FFBS) algorithms; see, e.g. Carter and Kohn (1994), Fr¨uhwirth-Schnatter (1994), de Jong and Shephard (1995) and Durbin and Koopman (2002) for linear, Gaussian models and Chib (1996) for Markov-switching models. Simulations smoothers are practically important as they allow for frequentist and Bayesian analysis of more complex AG-SV models. In addition, they are useful for calculating moments and quantiles of √ nonlinear functions of the state variable such as the volatility ht.

References

Abramowitz, M. and I. A. Stegun (1964). Handbook of Mathematical Functions. New York, NY: Dover Publications Inc.

Barndorff-Nielsen, O. E. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of 5 (3), 151–157.

Barndorff-Nielsen, O. E. and N. Shephard (2012). Basics of L´evyprocesses. Working paper, Oxford-Man Institute, University of Oxford.

Baum, L. E. and T. P. Petrie (1966). Statistical inference for probabilistic functions of finite state Markov chains. The Annals of 37 (1), 1554–1563.

Baum, L. E., T. P. Petrie, G. Soules, and N. Weiss (1970). A maximization technique oc- curring in the statistical analysis of probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics 41 (6), 164–171.

Capp´e,O., E. Moulines, and T. Ryd´en(2005). Inference in Hidden Markov Models. New York: Springer Press.

Carter, C. K. and R. Kohn (1994). On Gibbs sampling for state space models. Biometrika 81 (3), 541–553.

Chen, R. and J. S. Liu (2000). Mixture Kalman filters. Journal of the Royal Statistical Society, Series B 62 (3), 493–508.

26 Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics 75 (1), 79–97.

Chib, S., M. K. Pitt, and N. Shephard (2006). Likelihood inference for diffusion driven state space models. Working Paper. Department of Economics, Nuffield College, University of Oxford.

Chopin, N. (2004). for sequential Monte Carlo and its application to Bayesian inference. The Annals of Statistics 32 (6), 2385–2411.

Cox, J. C., J. E. Ingersoll, and S. A. Ross (1985). A theory of the term structure of interest rates. Econometrica 53 (2), 385–407.

Creal, D. D. (2012a). A survey of sequential Monte Carlo methods for economics and finance. Econometric Reviews 31 (3), 245–296.

Creal, D. D. (2012b). Maximum likelihood estimation of affine stochastic intensity models. Working paper, University of Chicago, Booth School of Business.

Danielsson, J. (1994). Stochastic volatility in asset prices: estimation with simulated maxi- mum likelihood. Journal of Econometrics 61 (1-2), 375–400. de Jong, P. and N. Shephard (1995). The simulation smoother for time series models. Biometrika 82 (2), 339–350.

Del Moral, P. (2004). Feyman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. New York, NY: Springer.

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39 (1), 1–38.

Duffie, D., A. Eckner, G. Horel, and L. Saita (2009). Frailty correlated default. The Journal of Finance 64 (5), 2089–2123.

Duffie, D. and K. J. Singleton (1993). Simulated moments estimation of Markov models of asset proces. Econometrica 61 (4), 929–952.

Duffie, D. and K. J. Singleton (2003). Credit Risk. Princeton, NJ: Princeton University Press.

Dugu´e,D. (1941). Sur un nouveau type de courbe de fr´equence. C.R. Academie Science Paris 213, 634–635.

27 Durbin, J. and S. J. Koopman (1997). Monte Carlo maximum likelihood estimation of non- Gaussian state space models. Biometrika 84 (3), 669–84.

Durbin, J. and S. J. Koopman (2001). Time Series Analysis by State Space Methods. Oxford, UK: Oxford University Press.

Durbin, J. and S. J. Koopman (2002). A simple and efficient simulation smoother for state space time series analysis. Biometrika 89 (3), 603–616.

Durham, G. and A. R. Gallant (2002). Numerical techniques for maximum likelihood esti- mation of continuous-time diffusion processes (with discussion). Journal of Business and Economic Statistics 20 (3), 297–338.

Eraker, B., M. Johannes, and N. G. Polson (2003). The impact of jumps in returns and volatility. The Journal of Finance 53 (3), 1269–1300.

Fearnhead, P. and P. Clifford (2003). On-line inference for hidden Markov models via particle filters. Journal of the Royal Statistical Society, Series B 65 (4), 887–899.

Feller, W. S. (1951). Two singular diffusion problems. Annals of Mathematics 54 (1), 173–182.

Fern´andez-Villaverde, J. and J. F. Rubio-Ram´ırez (2007). Estimating macroeconomic models: a likelihood approach. The Review of Economic Studies 74 (4), 1059–1087.

Fr¨uhwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis 15 (2), 183–202.

Fr¨uhwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. New York, NY: Springer Press.

Gallant, A. R. and G. Tauchen (1996). Which moments to match? Econometric Theory 12, 65–81.

Gordon, N. J., D. J. Salmond, and A. F. M. Smith (1993). A novel approach to nonlinear and non-Gaussian Bayesian state estimation. IEE Proceedings. Part F: Radar and Sonar Navigation 140 (2), 107–113.

Gouri´eroux,C. and J. Jasiak (2006). Autoregressive gamma processes. Journal of Forecast- ing 25 (2), 129–152.

28 Gouri´eroux,C., A. Monfort, and E. Renault (1993). Indirect inference. Journal of Applied Econometrics 8 (S), 85–118.

Hamilton, J. D. (1989). A new approach to the econometric analysis of nonstationary time series and the business cycle. Econometrica 57 (2), 357–384.

Hamilton, J. D. (1994). Time Series Analysis. Princeton, NJ: Princeton University Press.

Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with appli- cations to bond and currency options. The Review of Financial Studies 6 (2), 327–343.

Jacquier, E., M. Johannes, and N. G. Polson (2007). MCMC maximum likelihood for latent state models. Journal of Econometrics 137 (2), 615–640.

Jacquier, E., N. G. Polson, and P. E. Rossi (1994). Bayesian analysis of stochastic volatility models (with discussion). Journal of Business and Economic Statistics 12 (4), 371–417.

Kalman, R. (1960). A new approach to linear filtering and prediction problems. Journal of Basic Engineering, Transactions of the ASME 82 (1), 35–46.

Kim, C.-J. and C. R. Nelson (1999). State Space Models with Regime Switching Classical and Gibbs Sampling Approaches with Applications. Cambridge, MA: MIT Press.

Kim, S., N. Shephard, and S. Chib (1998). Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies 65 (3), 361–393.

Kotz, S., T. J. Kozubowski, and K. Podg´orski(2001). The Laplace Distribution and Gener- alizations. Boston, MA: Birkhauser.

Liu, J. S. and R. Chen (1998). Sequential Monte Carlo computation for dynamic systems. Journal of the American Statistical Association 93 (443), 1032–1044.

Madan, D. B., P. P. Carr, and E. C. Chang (1998). The variance gamma process and option pricing. Review of Finance 2 (1), 79–105.

Madan, D. B. and E. Seneta (1990). The variance gamma (V.G.) model for share market returns. Journal of Business 63 (4), 511–524.

Pitt, M. K. and N. Shephard (1999). Filtering via simulation: auxiliary particle filters. Journal of the American Statistical Association 94 (446), 590–599.

Shephard, N. (1994). Partial non-Gaussian state space. Biometrika 81 (1), 115–131.

29 Shephard, N. (2005). Stochastic Volatility: Selected Readings. Oxford, UK: Oxford University Press.

Sichel, H. S. (1974). On a distribution representing sentence-length in written prose. Journal of the Royal Statistical Society, Series A 135 (1), 25–34.

Sichel, H. S. (1975). On a distribution law for word frequencies. Journal of the American Statistical Association 70 (350), 542–547.

Slevinsky, R. M. and H. Safouhi (2010). A recursive algorithm for the G transformation and accurate computation of incomplete Bessel functions. Applied Numerical Mathematics 60, 1411–1417.

A Appendix:

A.1 Notation for the NG, GIG, Sichel and Bessel distributions

In this appendix, I define notation for the NG, GIG, Sichel and Bessel distributions that are used in the paper. The notation for the parameters of the distributions is local to this section of the appendix. A normal gamma random variable Y ∼ N. G. (µ, α, β, ν) is obtained by taking a normal distribution Y ∼ N(µ + βσ2, σ2) and 2  2  allowing its variance to have a gamma distribution σ ∼ Gamma ν, α2−β2 . The p.d.f. of a N.G. random variable is

√ (ν− 1 ) 2 2 ν 2 α − β  2 |yt − µ| Kν− 1 (α|yt − µ|) p(y|µ, α, β, ν) = 2 exp (β|y − µ|) √ (ν− 1 ) t 2 πΓ(ν) α 2 where Kλ(x) is the modified Bessel function of the second kind; see, e.g. Abramowitz and Stegun (1964). The mean and variance of the distribution are

 2  2 2βν E[y] = E 2 E y|σ = E 2 µ + βσ = µ + σ σ α2 − β2

2ν 4νβ2 V[y] = E V y|σ2 + V E y|σ2 = E σ2 + V µ + βσ2 = + α2 − β2 (α2 − β2)2 see, e.g. Kotz, Kozubowski, and Podg´orski(2001). The parameter β controls the symmetry of the distribution such that β < 0 is left-skewed, β > 0 is right skewed, and a value of β = 0 is symmetric. A generalized inverse Gaussian random variable X ∼ GIG(λ, χ, ψ) has p.d.f.

√ −λ λ   χ χψ λ−1 1  −1  p(x|λ, χ, ψ) = √  x exp − χx + ψx 2Kλ χψ 2

30 The GIG distribution was first proposed by E. Halphen under the pseudonym Dugu´e(1941) and it was popular- k √  k  χ  2 Kλ+k( χψ) ized in statistics by Barndorff-Nielsen (1978). Moments of the GIG distribution are E X = √ , ψ Kλ( χψ) which are needed throughout the paper for computations of the filtering and smoothing distributions for the latent variance. The c.d.f. of a GIG random variable needed for the calculation of the quantiles can be expressed in terms of the incomplete Bessel function. Slevinsky and Safouhi (2010) provide a routine for its calculation that does not use numerical integration. A random variable Z ∼ Sichel (λ, χ, ψ) is obtained by taking a Poisson random variable Z ∼ Poisson (X) and allowing the mean X to be a random draw from a GIG distribution X ∼ GIG(λ, χ, ψ). The mass function for a Sichel random variable is

s !λ p  ψ r χ z 1 Kλ+z χ (ψ + 2) p(z|λ, χ, ψ) = √  , z ≥ 0 ψ + 2 ψ + 2 z! Kλ χψ

The first two moments of a Sichel distribution are often reported incorrectly in the literature. They follow from the law of iterated expectations as

1 √   2  χ Kλ+1 χψ E[z] = Ex [E (z|x)] = Ex [x] = √  ψ Kλ χψ

1 √ √   2      2  2   2 χ Kλ+1 χψ χ Kλ+2 χψ E z = Ex E z |x = Ex x + x = √  + √  ψ Kλ χψ ψ Kλ χψ The probabilities of the Sichel distribution and the density of the NG distribution can be computed in a computationally stable manner using a recursive relationship for the ratio of modified Bessel functions.

K (x) K (x) 2λ λ+1 = λ−1 + Kλ(x) Kλ(x) x

A Bessel random variable S ∼ Bessel (κ, γ) has probability mass function

1 γ 2s+κ p(s|κ, γ) = , s ≥ 0 Iκ (γ) s!Γ (s + κ + 1) 2 where Iκ (γ) is the modified Bessel function of the first kind; see, e.g. Abramowitz and Stegun (1964).

31 A.2 Proof of Proposition 1:

The conditional likelihood obtained by integrating the variance ht out of the original measurement equation is

Z ∞ p(yt|zt; θ) = p(yt|ht; θ)p(ht|zt; θ)dht 0 Z ∞    (ν+zt)   1 −1/2 1 2 −1 1 1 (ν+zt)−1 ht = √ ht exp − [yt − µ − βht] ht ht exp − dht 0 2π 2 Γ(ν + zt) c c √ 1 q  ν+zt− 2 2 2  ν+zt 2 |y − µ| K 1 + β |y − µ| 1 t ν+zt− c t = 2 exp (|y − µ|β) ν+z − 1 t c √ q  t 2 2 2 πΓ(ν + zt) c + β ! r2 = N.G. µ, + β2, β, ν + z c t see Appendix A.1 for the definition of the density.  1 2 2 2 By Proposition 2 below, we know that p(ht−1|zt−1, y1:t−1; θ) = GIG ν + zt − 2 , (yt − µ) , c + β . The transition distribution of the discrete mixing variable conditional on observing y1:t−1 is given by

Z ∞ p(zt|zt−1, y1:t−1; θ) = p(zt|ht−1; θ)p(ht−1|zt−1, y1:t−1; θ)dht−1 0 Z ∞ φh zt 1  φh  = t−1 exp − t−1 0 c zt! c 1 1  ν+zt−1− 2 h i−(ν+zt−1− 2 ) q 2 2 2 2 (yt−1 − µ) (yt−1 − µ) c + β q  2 2 2 2K 1 (yt−1 − µ) + β ν+zt−1− 2 c      ν− 1 +z −1 1 2 h 2 t−1 exp − (y − µ)2 h−1 + + β2 h dh t−1 2 t−1 t−1 c t−1 t−1 1  zt !(ν+zt−1− 2 )/2 2 + β2 φzt |y − µ| = c t−1 2(1+φ) q  + β2 c 2(1+φ) 2 c c + β r   2 2(1+φ) 2 Kν+z +z − 1 (yt−1 − µ) + β 1 t−1 t 2 c q  zt! 2 2 2 K 1 (yt−1 − µ) + β ν+zt−1− 2 c  1 φ c 2  = Sichel ν + z − , (y − µ)2 , + β2 t−1 2 c t−1 φ c

The Markov transition distribution of zt is non-homogenous as it depends on the most recent observation yt−1.

The initial distribution of z1 was given by Gouri´erouxand Jasiak (2006) but we provide it for completeness.

32  c  Recall that h0 is a draw from the stationary distribution Ga ν, 1−φ . The initial distribution of zt is

Z ∞ p(z1; θ) = p(z1|h0; θ)p(h0; θ)dh0 0 Z ∞  z1   ν−1    −ν 1 φh0 φh0 h0 h0(1 − φ) c = exp − exp − dh0 0 z1! c c Γ(ν) c 1 − φ Γ(ν + z1) ν = (1 − φ) φz1 Γ(z1 + 1)Γ(ν) = Negative Binomial (ν, φ)

νφ where E [z1] = 1−φ .

A.3 Proof of Proposition 2:

The one-step ahead predictive distribution p(ht+1|z1:t, y1:tθ) = Gamma (ν + zt, c) follows immediately from the definition of the model. The marginal filtering distribution of the variance is

p(ht|z1:t, y1:t; θ) ∝ p(yt|ht; θ)p(ht|zt; θ)  1  1  h  −1/2 2 −1 −(ν+zt) (ν+zt)−1 t ∝ ht exp − [yt − µ − βht] ht c ht exp − 2 Γ(ν + zt) c  1  2   ∝ hν−1/2+zt−1 exp − (y − µ)2 h−1 + + β2 h t 2 t t c t  1 2  = GIG ν + z − , (y − µ)2 , + β2 t 2 t c

The conditional distribution only depends on zt and yt. The final distribution at time t = T is just a special case of the former result

 1 2  p(h |z , y ; θ) = GIG ν + z − , (y − µ)2 , + β2 . T 1:T 1:T T 2 T c

The marginal smoothing distribution for the variance is

p(ht|z1:T , y1:T ; θ) ∝ p(yt|ht; θ)p(ht|zt; θ)p(zt+1|ht; θ) 1  1  1  h  − 2 2 −1 −(ν+zt) (ν+zt)−1 t ∝ ht exp − [yt − µ − βht] ht c ht exp − 2 Γ(ν + zt) c φh zt+1 1  φh  t exp − t c zt+1! c      (ν+z +z − 1 )−1 1 2(1 + φ) ∝ h t t+1 2 exp − (y − µ)2 h−1 + + β2 h t 2 t t c t  1 2(1 + φ)  = GIG ν + z + z − , (y − µ)2 , + β2 t t+1 2 t c

33 The marginal smoothing distribution of ht at intermediate dates only depends on zt, zt+1 and yt. The marginal distribution of zt is

p(zt|h1:T , y1:T ; θ) ∝ p(zt|ht−1; θ)p(ht|zt; θ) 1 φh zt 1 h zt ∝ t−1 t zt! c Γ(ν + zt) c r ! φh h = Bessel ν − 1, 2 t t−1 c2 which does not depend on the observed data. The distribution for the initial condition is independent of the data conditional on z1. I find

p(h0|z1:T , y1:T ; θ) ∝ p(z1|h0; θ)p(h0; θ) 1 φh z1  φh  hν−1  h (1 − φ)  c −ν ∝ 0 exp − 0 0 exp − 0 z1! c c Γ(ν) c 1 − φ  h  ∝ hν+z1−1 exp − 0 0 c

= Gamma (ν + z1, c)

A.4 Proof of Proposition 3:

 c  Recall that the stationary distribution of the variance is a gamma distribution h0 ∼ Ga ν, 1−φ . One iter-  c  ation of the Markov transition kernel leaves the distribution invariant such that h1 ∼ Ga ν, 1−φ . The first contribution to the log-likelihood is

Z ∞ p(y1; θ) = p(y1|h1; θ)p(h1; θ)dh1 0 −ν Z ∞ 1       1 − 2 1 2 −1 h1(1 − φ) c 1 ν−1 = √ h1 exp − (y1 − µ − βh1) h1 exp − h1 dh1 0 2π 2 c 1 − φ Γ(ν) ν   Z ∞ 1      1 − φ 1 1 ν− 2 −1 1 2 −1 2(1 − φ) 2 = √ exp (β|y1 − µ|) h1 exp − (y1 − µ) h1 + + β h1 dh1 c Γ(ν) 2π 0 2 c

√ 1 q   ν− 2 2(1−φ) 2 ν 2 |y1 − µ| Kν− 1 + β |y1 − µ| 1 − φ 2 c = exp (β |y − µ|) ν− 1 1 c √ q  2 2(1−φ) 2 πΓ(ν) c + β r ! 2(1 − φ) = N.G. µ, + β2, β, ν c

34 The filtering distribution p(h1|y1; θ) immediately follows from Bayes’ rule

p(h1|y1; θ) ∝ p(y1|h1; θ)p(h1; θ)  1 h i  h (1 − φ) ∝ h−1/2 exp − (y − µ)2 h−1 + β2h exp − 1 hν−1 1 2 1 1 1 c 1      ν− 1 −1 1 2(1 − φ) ∝ h 2 exp − (y − µ)2 h−1 + + β2 h 1 2 1 1 c 1  1 2(1 − φ)  = GIG ν − , (y − µ)2 , + β2 2 1 c

The conditional filtering distribution of z1 is

p(z1|y1; θ) ∝ p(y1|z1; θ)p(z1; θ) q  2 2  z1 K 1 + β |y1 − µ| 1 φ ν+z1− 2 c 1 ∝ Γ(ν + z1) h i−z1 q z1 Γ(ν + z1) c 2 2 2 z1! (y1 − µ) c + β |y1 − µ| z   1 r ! φ |y1 − µ| 1 2 2 ∝ K 1 + β |y − µ|  q  ν+z1− 1 c 2 2 z1! 2 c c + β  1 φ c 2(1 − φ)  = Sichel ν − , (y − µ)2 , + β2 2 c 1 φ c

A Sichel distribution is typically derived as a Poisson random variable with a GIG mixing distribution. Here, it is the posterior distribution of a normal gamma likelihood combined with a negative binomial prior. The one-step ahead predictive distribution of z2 is

Z ∞ p(z2|y1; θ) = p(z2|h1; θ)p(h1|y1; θ)dh1 0 Z ∞ 1 φh z2  φh  = 1 exp − 1 0 z2! c c 1 1  ν− 2 h i−(ν− 2 ) q 2 2(1−φ) 2 (y1 − µ) |y1 − µ| c + β (ν− 1 )−1 h 2  q  1 2(1−φ) 2 2K 1 |y1 − µ| + β ν− 2 c  1  2(1 − φ)   exp − (y − µ)2 h−1 + + β2 h dh 2 1 1 c 1 1 1 q z2  q  (ν− )/2  2  2 2 2(1−φ) 2 ! 2  z2 K 1 |y − µ| + β + β φ (y1 − µ) 1 ν+z2− 1 c = c 2 2 2  q   q  c + β c 2 2 z2! 2(1−φ) 2 c + β K 1 |y1 − µ| + β ν− 2 c  1 φ c 2(1 − φ)  = Sichel ν − , (y − µ)2 , + β2 2 c 1 φ c

This is the usual construction of a Sichel distribution as a GIG mixture of Poisson random variables.

35