Gaussian Process Volatility Model
Yue Wu [email protected] Jose´ Miguel Hernandez´ Lobato [email protected] Zoubin Ghahramani [email protected] University of Cambridge, Department of Engineering, Cambridge CB2 1PZ, UK
Abstract 1986). The accurate prediction of time-changing vari- GARCH has further inspired a host of variants and ex- ances is an important task in the modeling of fi- tensions A review of many of these models can be found nancial data. Standard econometric models are in Hentschel(1995). Most of these GARCH variants at- often limited as they assume rigid functional re- tempt to address one or both limitations of GARCH: a) lationships for the variances. Moreover, function the assumption of a linear dependency between the cur- parameters are usually learned using maximum rent volatility and past volatilities, and b) the assumption likelihood, which can lead to overfitting. To ad- that positive and negative returns have symmetric effects dress these problems we introduce a novel model on volatility. Asymmetric effects are often observed, as for time-changing variances using Gaussian Pro- large negative returns send measures of volatility soaring, cesses. A Gaussian Process (GP) defines a distri- while large positive returns do not (Bekaert & Wu, 2000; bution over functions, which allows us to cap- Campbell & Hentschel, 1992). ture highly flexible functional relationships for the variances. In addition, we develop an on- Most solutions proposed in these GARCH variants involve: line algorithm to perform inference. The algo- a) introducing nonlinear functional relationships for the rithm has two main advantages. First, it takes a volatility, and b) adding asymmetric terms in the functional Bayesian approach, thereby avoiding overfitting. relationships. However, the GARCH variants do not funda- Second, it is much quicker than current offline in- mentally address the problem that the functional relation- ference procedures. Finally, our new model was ship of the volatility is unknown. In addition, these variants evaluated on financial data and showed signifi- can have a high number of parameters, which may lead to cant improvement in predictive performance over overfitting when learned using maximum likelihood. current standard models. More recently, volatility modeling has received attention within the machine learning community, with the develop- ment of copula processes (Wilson & Ghahramani, 2010) 1. Introduction and heteroscedastic Gaussian processes (Lazaro-Gredilla´ & Titsias, 2011). These models leverage the flexibility Time series of financial returns often exhibit heteroscedas- of Gaussian Processes (Rasmussen & Williams, 2006) to ticity, that is the standard deviation or volatility of the re- model the unknown relationship in the variances. How-
arXiv:1402.3085v1 [stat.ME] 13 Feb 2014 turns is time-dependent. In particular, large returns (either ever, these models do not address the asymmetric effects of positive or negative) are often followed by returns that are positive and negative returns on volatility. also large in size. The result is that financial time series frequently display periods of low and high volatility. This In this paper we introduce a new non-parametric volatility phenomenon is known as volatility clustering (Cont, 2001). model, called the Gaussian Process Volatility Model (GP- Several univariate models have been proposed for capturing Vol). This new model is more flexible, as it is not lim- this property. The best known are the Autoregressive Con- ited by a fixed functional form. Instead a prior distribution ditional Heteroscedasticity model (ARCH) (Engle, 1982) is placed on possible functions using GPs, and the func- and its extension, the Generalised Autoregressive Con- tional relationship is learned from the data. Furthermore, ditional Heteroscedasticity model (GARCH) (Bollerslev, GP-Vol explicitly models the asymmetric effects on volatil- ity from positive and negative returns. Our new volatility st Proceedings of the 31 International Conference on Machine model is evaluated in a series of experiments on real fi- Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s). nancial returns, comparing it against popular econometric Gaussian Process Volatility Model models, namely GARCH, EGARCH (Nelson, 1991) and g(xt) = θxt + λ |xt| . GJR-GARCH (Glosten et al., 1993). Overall, we found our proposed model has the best predictive performance. In ad- Asymmetry in the effects of positive and negative returns is dition, the functional relationship learned by our model is introduced through the function g(xt). Then if the coeffi- highly intuitive and automatically discovers the nonlinear cient θ is negative, negative returns will increase volatility. and asymmetric features that previous models attempt to Another popular GARCH extension with asymmetric effect capture. of returns is GJR-GARCH (Glosten et al., 1993): The second main contribution of the paper is the develop- q p r 2 X 2 X 2 X 2 ment of an online algorithm for learning GP-Vol. GP-Vol is σt = α0+ αjxt−j + βiσt−i + γkxt−kIt−k , an instance of a Gaussian Process State Space Model (GP- j=1 i=1 k=1 SSM). Most previous work on GP-SSMs (Ko & Fox, 2009; (4) Deisenroth et al., 2009; Deisenroth & Mohamed, 2012) ( 0 , if xt−k ≥ 0 have focused on developing approximation methods for It−k = . 1 , x < 0 filtering and smoothing the hidden states in GP-SSM, as- if t−k suming known GP transition dynamics. Only very recently 2 The asymmetric effect is captured by γkx It−k, which has Frigola et al.(2013) addressed the problem of learning t−k is nonzero if xt−k < 0. both the hidden states and the transition dynamics by us- ing Particle Gibbs with ancestor sampling (PGAS) (Lind- sten et al., 2012). In this paper, we introduce a new online 3. Gaussian Process State Space Models algorithm for performing inference on GP-SSMs. Our al- GARCH, EGARCH and GJR-GARCH can be all repre- gorithm has similar predictive performance as PGAS on fi- sented as General State-Space or Hidden Markov models nancial datasets, but is much quicker as inference is online. (HMM) (Baum & Petrie, 1966; Doucet et al., 2001), with the unobserved dynamic variances being the hidden states. 2. Review of GARCH and GARCH variants Transition functions for the hidden states are fixed and as- sumed to be linear in these models. The linear assumption The standard heteroscedastic variance model for financial limits the flexibility of these models. data is GARCH. GARCH assumes a Gaussian observation model (1) and a linear transition function so that the time- More generally, a non-parametric approach can be taken 2 varying variance σt is linearly dependent on p previous where a Gaussian Process prior is placed on the transition variance values and q previous squared time series values: function, so that the functional form can be learned from data. This Gaussian Process state space model (GP-SSM) 2 xt ∼ N (0, σt ) , (1) is a generalization of HMM. The two class of models dif- q p fer in two main ways. First, in HMM the transition func- 2 X 2 X 2 σt = α0 + αjxt−j + βiσt−i , (2) tion has fixed functional form, while in GP-SSM it is rep- j=1 i=1 resented by a GP. Second, in GP-SSM the states do not have Markovian structure once the transition function is where xt are the values of the return time series being mod- eled. The GARCH(p,q) generative model is flexible and marginalized out, see Section5 for details. can produce a variety of clustering behavior of high and However, the flexibility of GP-SSMs comes at a cost. low volatility periods for different settings of the model co- Specifically, inference in GP-SSMs is complicated. Most efficients, α1, . . . , αq and β1, . . . , βp. previous work on GP-SSMs (Ko & Fox, 2009; Deisenroth While GARCH is flexible, it has several limitations. First, et al., 2009; Deisenroth & Mohamed, 2012) have focused 2 2 on developing approximation methods for filtering and a linear relationship between σt−p:t−1 and σt is assumed. Second, the effect of positive and negative returns is the smoothing the hidden states in GP-SSM assuming known 2 GP dynamics. A few papers considered learning the GP same due to the quadratic term xt−j. However, it is of- ten observed that large negative returns lead to sharp rises dynamics and the states, but for special cases of GP-SSMs. in volatility, while positive returns do not (Bekaert & Wu, For example, Turner et al.(2010) applied EM to obtain 2000; Campbell & Hentschel, 1992). maximum likelihood estimates for parametric systems that can be represented by GPs. Recently, Frigola et al.(2013) A more flexible and often cited GARCH extension is Ex- learned both the hidden states and the GP dynamics using ponential GARCH (EGARCH) (Nelson, 1991): PGAS (Lindsten et al., 2012). Unfortunately PGAS is a q p full MCMC inference method, and can be expensive com- 2 X X 2 log(σt ) = α0 + αjg(xt−j) + βi log(σt−i) , (3) putationally. In this paper, we present an online Bayesian j=1 i=1 inference algorithm for learning the hidden states v1:T , the Gaussian Process Volatility Model unknown function f, and any hyper-parameters θ of the tion would be of the form: model. Our algorithm has similar predictive performance as PGAS, but is much quicker. vt = f(vt−1, vt−2, ..., vt−p, xt−1, xt−2, ..., xt−q) + . (8) 4. Gaussian Process Volatility Model 5. Bayesian Inference for GP-Vol We introduce now our new nonparametric volatility model, In the standard GP regression setting the inputs and tar- which is an instance of GP-SSM. We call our new model gets are observed, then the function f can be learned using the Gaussian Process Volatility Model (GP-Vol): exact inference. However, this is not the case in GP-Vol, 2 xt ∼ N (0, σt ) , (5) where some inputs and all targets vt are unknown. Directly learning the posterior of the unknown variables (f, θ, v ), v := log(σ2) = f(v , x ) + , (6) 1:T t t t−1 t−1 where θ denotes the hyper-parameters of the GP, is a chal- 2 ∼ N (0, σn) . lenging task. Fortunately, we can target p(θ, v1:T |x1:T ), where the function f has been marginalized out. Marginal- We focus on modeling the log variance instead of the vari- izing out f introduces dependencies across time for the hid- ance as the former has support on the real line. Equations den states. First, consider the conditional prior for the hid- (5) and (6) define a GP-SMM. Specifically, we place a GP den states p(v |θ) with known θ and f marginalized out. prior on f and letting z denote (v , x ): 1:T t t t It is not Gaussian but a product of Gaussians:
f ∼ GP(m, k) . (7) T Y 0 p(v1:T |θ) = p(v1|θ) p(vt|θ, v1:t−1, x1:t−1) . (9) where m(zt) is the mean function and k(zt, zt) is the co- variance or kernel function. In the GP prior, the mean t=2 function encodes prior knowledge of the system dynamics. Each term in Equation (9) can be viewed as a standard one- For example, it can encode the fact that large negative re- step GP prediction under the prior. Next, consider the pos- turns lead to increases in volatility. The covariance function terior for the states p(v1:t|θ, x1:t). For clarification the prior 0 k(zt, zt) gives the prior covariance between outputs f(zt) distribution of vt depends on all the previous states v1:t−1 0 0 0 and f(zt), that is Cov(f(zt), f(zt)) = k(zt, zt). Note that and previous observations x1:t−1. In contrast the posterior the covariance between the outputs is a function of the pair for state vt depends on all available observations x1:t. of inputs. Intuitively if inputs z and z0 are close to each t t The posterior p(v |θ, v , x ) can be approximated other, then the covariances between the corresponding out- t 1:t−1 1:t with particles. We now describe a standard sequential puts should be large, i.e. the outputs should be highly cor- Monte Carlo (SMC) particle filter to learn this poste- related. i rior. Let v1:t−1 with i = 1, ..., N be particles that The graphical model for GP-Vol is given in Figure1. The represent chains of states up to t − 1 with correspond- i ing weights Wt−1. Then the posterior distribution of p(v1:t−1|θ, x1:t−1) is approximated by weighted particles: vt−1 vt vt+1 f f N X i pˆ(v1:t−1|θ, x1:t−1) = W δ i (v1:t−1) . (10) t−1 v1:t−1 i=1
In addition, the posterior for vt can be approximated by propagating the previous states forward and importance xt−1 xt xt+1 weighting according to the observation model. Specifically, i sample a set of parent indices J according to Wt−1. Then j propagate forward particles {v1:t−1}j∈J . Proposals for vt Figure 1. A graphical model for GP-Vol. The transitions of the are drawn from its conditional prior: hidden states vt is represented by the unknown function f. f vj ∼ p(v |θ, vj , x ) . (11) takes as inputs the previous state vt−1 and previous observation t t 1:t−1 1:t−1 xt−1. The proposed particles are importance-weighted according to the observation model: explicit dependence of the log variance vt on the previous j j return xt−1 enables us to model asymmetric effects of pos- wt = p(xt|θ, vt ) , (12) itive and negative returns on the variance. Finally, GP-Vol wj can be extended to depend on p previous log variances and W j = t . (13) t PN k q past returns like in GARCH(p,q). In this case, the transi- k=1 wt Gaussian Process Volatility Model
Finally the posterior for vt is approximated by: variance of the predictions near these inputs will be under- estimated. N X j pˆ(vt|θ, v1:t−1, x1:t) = W δ j (vt) . (14) Many of the potential issues faced by RAPCF can be ad- t vt j=1 dressed by adopting a full MCMC approach. In particular, Particle Markov Chain Monte Carlo (PMCMC) procedures The above setup learns the states vt, assuming that (Andrieu et al., 2010) established a framework for learn- θ is given. Now consider the desired joint posterior ing the hidden states and the parameters for general state p(θ, v1:T |x1:T ). To learn the posterior, first a prior space models. Additionally, Lindsten et al.(2012) devel- p(θ, v1:T ) = p(v1:T |θ)p(θ) is defined. This suggests that oped a PMCMC algorithm called Particle Gibbs with an- the hyper-parameters θ can also be represented by parti- cestor sampling (PGAS) for learning non-Markovian state cles and filtered together with the states. Naively filtering space models, which was applied by Frigola et al.(2013) θ particles without regeneration will fail due to particle im- to learn GP-SSMs. poverishment, where a few or even one particle for θ re- ceives all the weight. To resolve particle impoverishment, Algorithm 1 RAPCF for GP-Vol algorithms, such as the Regularized Auxiliary Particle Fil- 1: Input: data x , number of particles N, shrinkage ter (RAPF) (Liu & West, 1999), regenerates parameter par- 1:T parameter 0 < a < 1, priors p(θ). ticles at each time step by sampling from a kernel. This 2: At t = 0, sample N particles of {θi } ∼ p(θ ). kernel introduces artificial dynamics and estimation bias, 0 i=1,...,N 0 Note that θ denotes estimates for the hyper-parameters but works well in practice (Wu et al., 2013). t after observing x1:t, and not that θ is time-varying. i 1 RAPF was designed for Hidden Markov Models, but GP- 3: Set initial importance weights, w0 = N Vol is marginally non-Markovian. Therefore we design a 4: for t = 1 to T do i i new version of RAPF for non-Markovian systems and re- 5: Compute point estimates mt and µt fer to it as the Regularized Auxiliary Particle Chain Fil- i i ¯ ter (RAPCF), Algorithm1. There are four main parts to mt = aθt−1 + (1 − a)θt−1 (15) RAPCF. First, there is the Auxiliary Particle Filter (APF) N ¯ 1 X n part of RAPCF in lines1 and1. The APF (Pitt & Shep- θt−1 = θ N t−1 hard, 1999) proposes from more optimal importance den- n=1 i i i sities, by considering how well previous particles would µt = E(vt|mt, v1:t−1, x1:t−1) (16) represent the current state (17). Second, the more likely 6: Compute point estimate importance weights: particle chains are propagated forward in line1. The main i i i i difference between RAPF and RAPCF is in what particles gt ∝ wt−1p(xt|µt, mt) (17) are propagated forward. In RAPCF for GP-Vol, particles 7: Resample N auxiliary indices {j} according to i {gi} representing chains of states v1:t−1 that are more likely to weights t . describe the new observation xt are propagated forward, 8: Propagate these chains forward, i.e. set i N j as the model is non-Markovian, while in standard RAPF {v1:t−1}i=1={v1:t−1}j∈J . i j j 2 only particles for the previous state vt−1 are propagated 9: Jitter the parameters θt ∼ N (mt , (1 − a )Vt−1), forward. Third, to avoid particle impoverishment in θ, new where Vt−1 is the empirical covariance of θt−1. j j j j particles are generated by applying a Gaussian kernel, in 10: Propose new states vt ∼ p(vt|θt , v1:t−1, x1:t−1) line1. Finally, the importance weights are computed for 11: Compute the importance weights adjusting for the the new states adjusting for the probability of its chain of modified proposal: origin (18). j j j p(xt|vt , θt ) RAPCF is a quick online algorithm that filters for unknown wt ∝ j j (18) states and hyper-parameters. However, it has some limita- p(xt|µt , mt ) 12: end for tions just as standard RAPF. First, it introduces bias in the j 13: Output: posterior particles for chain of states v1:T , pa- estimates for the hyper-parameters as sampling from the j j kernel in line1 adds artificial dynamics. Second, it only rameters θt and particle weights wt . filters forward and does not smooth backward. This means that the chains of states are never updated given new infor- mation in later observations. Consequently, there will be PGAS is described in Algorithm2. There are three main impoverishment in distant ancestors vt−L, since these an- parts to PGAS. First, it adopts a Gibbs sampling approach cestor states are not regenerated. When impoverishment in and alternatively samples the parameters θ[m] given all the ancestors occur, GP-Vol will consider the collapsed an- the data x1:T and current samples of the hidden states cestor states as inputs with little uncertainty. Therefore the v1:T [m − 1], where m is the iteration count, and then the Gaussian Process Volatility Model states given the new parameters and the data. Condition- 6.1. Synthetic data ally sampling θ[m] given x and v [m − 1] can often 1:T 1:T Ten synthetic datasets of length T = 100 were generated be done with slice sampling (Neal, 2003). Second, sam- according to Equations (5) and (6). The function f was ples for v [m] are drawn using a conditional auxiliary 1:T specified with a linear mean function and a squared expo- particle filter with ancestor sampling (CAPF-AS). CAPF- nential covariance function. The linear mean function used AS consists of two parts: a) conditional auxiliary particle was: filter (CAPF) and b) ancestor sampling. CAPF generates N particles for each hidden state conditional on θ[m] and E(vt) = m(vt−1, xt−1) = avt−1 + bxt−1 . (19) v1:T [m − 1]. The conditional dependence is necessary, as each hidden state vt depends on the parameters and all the This mean function encodes an asymmetric relationship be- other hidden states. In particular Lindsten et al.(2012) ver- tween the positive and negative returns and the volatility. ified that the CAPF corresponds to a collapsed Gibbs sam- The squared exponential kernel or covariance function is pling (Van Dyk & Park, 2008). Then ancestor sampling given by: is used to sample smoothed trajectories from the particles 1 k(y, z) = σ2 exp(− |y − z|2) . (20) generated by CAPF. Third, the alternate sampling of θ[m] f 2l2 and v1:T [m] is repeated for M iterations. 2 where l is the length scale parameter and σf is the signal variance parameter.
Algorithm 2 PGAS RAPCF was used to learn the hidden states v1:T and the 1: Input: data x1:T , number of particles N, number of it- hyper-parameters θ = (a, b, σn, σf , l) from f. The al- erations M, initial hidden states v1:T [0] and initial pa- gorithm was initiated with diffuse priors placed on θ. rameters θ[0]. RAPCF was able to recover the hidden states and the 2: for m = 1 to M do hyper-parameters. For the sake of brevity, we only include 3: Draw θ[m] ∼ p(·|v1:T [m − 1], x1:T ). This can often two typical plots of the 90% posterior intervals for hyper- be done by slice sampling. parameters a and b in Figures2 and3 respectively. The 4: Run a CAPF-AS with N particles, targeting intervals are estimated from the filtered particles for a and p(v1:T |θ[m], x1:T ) conditional on v1:T [m − 1]. b at each time step t. In both plots, the posterior inter- ? i 5: Draw a chain v1:T ∝ {wT }i=1,...,N . vals eventually cover the parameters, shown as dotted blue ? 6: Set v1:T [m] = v1:T . lines, that were used to generate the synthetic dataset. 7: end for 8: Output: Chains of particles and parameters − 9 − − 9 − {v1:T [m], θt[m]}m=1:M .