Generalised Wishart Processes

Andrew Gordon Wilson∗ Zoubin Ghahramani Department of Engineering Department of Engineering University of Cambridge, UK University of Cambridge, UK

Abstract and Clive Granger won the 2003 Nobel prize in eco- nomics “for methods of analysing economic time series with time-varying volatility”. The returns on major We introduce a new stochastic process called equity indices and currency exchanges are thought to the generalised Wishart process (GWP). It have a time changing and zero mean, and is a collection of positive semi-definite ran- GARCH (Bollerslev, 1986), a generalisation of En- dom matrices indexed by any arbitrary in- gle’s ARCH (Engle, 1982), is arguably unsurpassed at put variable. We use this process as a prior predicting the volatilities of returns on these equity over dynamic (e.g. time varying) covariance indices and currency exchanges (Poon and Granger, matrices Σ(t). The GWP captures a diverse 2005; Hansen and Lunde, 2005; Brownlees et al., 2009). class of covariance dynamics, naturally han- Multivariate volatility models can be used to under- dles missing data, scales nicely with dimen- stand the dynamic correlations (or co-movement) be- sion, has easily interpretable parameters, and tween equity indices, and can make better univariate can use input variables that include covari- predictions than univariate models. A good estimate ates other than time. We describe how to of the covariance Σ(t) is also necessary for port- construct the GWP, introduce general proce- folio management. An optimal portfolio allocation w∗ dures for inference and prediction, and show is said to maximise the Sharpe ratio (Sharpe, 1966): that it outperforms its main competitor, mul- tivariate GARCH, even on financial data that Portfolio return w>r(t) especially suits GARCH. = , (1) Portfolio risk pw>Σ(t)w

where r(t) are expected returns for each asset and Σ(t) 1 INTRODUCTION is the predicted for these returns. One may also wish to maximise the portfolio return > p Modelling the dependencies between random variables w r(t) for a fixed level of risk: w>Σ(t)w = λ. Mul- is fundamental in machine learning and . Co- tivariate volatility models are also used to understand variance matrices provide the simplest measure of contagion: the transmission of a financial shock from dependency, and therefore much attention has been one entity to another (Bae et al., 2003). And generally placed on modelling covariance matrices. However, – in econometrics, machine learning, climate science, the often implausible assumption of constant or otherwise – it is useful to know input dependent un- and covariances can have a significant impact on sta- certainty, and the dynamic correlations between mul- tistical inferences. tiple entities. In this paper, we are concerned with modelling the dy- Despite their importance, existing multivariate volatil- namic covariance matrix Σ(t) = cov[y|t] (multivariate ity models suffer from tractability issues and a lack volatility), for high dimensional vector valued obser- of generality. For example, multivariate GARCH vations y(t). These models are especially important (MGARCH) has a number of free parameters that in econometrics. Brownlees et al. (2009) remark that scales with dimension to the fourth power, and in- “The price of essentially every derivative security is af- terpretation and estimation of these parameters is fected by swings in volatility.” Indeed, Robert Engle difficult to impossible (Silvennoinen and Ter¨asvirta, 2009; Gouri´eroux, 1997), given the constraint that ∗ http://mlg.eng.cam.ac.uk/andrew Σ(t) must be positive definite at all points in time. Thus MGARCH, and alternative multivariate stochas- specifications. In the next section, we review Gaus- tic volatility models, are generally limited to studying sian processes (GPs), which are used to construct the processes with fewer than 5 components (Gouri´eroux GWP we use in this paper. In the following sec- et al., 2009). Recent efforts have led to simpler but tions we then review the Wishart distribution, present less general models, which make assumptions such as a GWP construction (which we use as a prior over constant correlations (Bollerslev, 1990). Σ(t) for all t), introduce procedures to sample from the posterior over Σ(t), review the main competitor, We hope to unite machine learning and economet- MGARCH, and present experiments comparing the rics in an effort to solve these problems. We intro- GWP to MGARCH on simulated and financial data. duce a stochastic process, the generalised Wishart pro- These experiments include a 5 dimensional data set, cess (GWP), which we use as a prior over covari- based on returns for NASDAQ, FTSE, NIKKEI, TSE, ance matrices Σ(t) at all times t. We call it the gen- and the Dow Jones Composite, and a set of returns eralised Wishart process, since it is a generalisation for 3 foreign currency exchanges. We also have a 200 of the first Wishart process defined by Bru (1991).1 dimensional experiment to show how the GWP can be To great acclaim, Bru’s Wishart process has recently used to study high dimensional problems. been used (Gouri´erouxet al., 2009) in multivariate stochastic volatility models (Philipov and Glickman, Also, although it is not the focus of this paper, we 2006; Harvey et al., 1994). This prior work is lim- show in the inference section how the GWP can be ited for several reasons: 1) it cannot scale to greater used as part of a new GP based regression model that than 5 × 5 covariance matrices, 2) it assumes the in- accounts for changing correlations. In other words, it put variable is a scalar, 3) it is restricted to using can be used to predict the mean µ(t) together with an Ornstein-Uhlenbeck (Brownian motion) covariance the covariance matrix Σ(t) of a multivariate process. structure (which means Σ(t + a) and Σ(t − a) are in- Alternative GP based multivariate regression models dependent given Σ(t), and complex interdependencies for µ(t), which account for fixed correlations, were re- cannot be captured), 4) it is autoregressive, and 5) cently introduced by Bonilla et al. (2008), Teh et al. there are no general learning and inference procedures. (2005), and Boyle and Frean (2004). We develop this The generalised Wishart process (GWP) addresses all extension, and many others, in a forthcoming paper. of these issues. Specifically, in our GWP formulation, 2 GAUSSIAN PROCESSES • Estimation of Σ(t) is tractable in at least 200 di- mensions, even without a factor representation. We briefly review Gaussian processes, since the gener- alised Wishart process is constructed from GPs. For • The input variable can come from any arbitrary more detail, see Rasmussen and Williams (2006). index set X , just as easily as it can represent time. This allows one to condition on covariates like in- A Gaussian process is a collection of random variables, terest rates. any finite number of which have a joint Gaussian dis- tribution. Using a Gaussian process, we can define a • One can easily handle missing data. distribution over functions u(x): • One can easily specify a vast range of co- u(x) ∼ GP(m(x), k(x, x0)) , (2) variance structures (periodic, smooth, Ornstein- Uhlenbeck, . . . ). where x is an arbitrary (potentially vector valued) in- put variable, and the mean m(x) and kernel function • We develop procedures to k(x, x0) are respectively defined as make predictions, and to learn distributions over m(x) = E[u(x)] , (3) any relevant parameters. Aspects of the covari- 0 0 ance structure are learned from data, rather than k(x, x ) = cov(u(x), u(x )) . (4) being a fixed property of the model. This means that any collection of function values has a joint Gaussian distribution: Overall, the GWP is versatile and simple. It does (u(x ), u(x ), . . . , u(x ))> ∼ N (µ,K) , (5) not require any free parameters, and any optional pa- 1 2 N rameters are easy to interpret. For this reason, it where the N × N covariance matrix K has entries also scales well with dimension. Yet, the GWP pro- Kij = k(xi, xj), and the mean µ has entries µi = vides an especially general description of multivariate m(xi). The properties of these functions (smoothness, volatility – more so than the most general MGARCH periodicity, etc.) are determined by the kernel func- tion. The squared exponential kernel is popular: 1Our model is also related to Gelfand et al. (2004)’s coregionalisation model, which we discuss in section 5. k(x, x0) = exp(−0.5||x − x0||2/l2) . (6) Functions drawn from a Gaussian process with this by replacing these Gaussian distributions with Gaus- kernel function are smooth, and can display long range sian processes, we define a process with Wishart trends. The length-scale hyperparameter l is easy to marginals – an example of a generalised Wishart pro- interpret: it determines how much the function values cess. It is a collection of positive semi-definite ran- u(x) and u(x + a) depend on one another, for some dom matrices indexed by any arbitrary (potentially constant a. high dimensional) variable x. For clarity, we assume that time is the input variable, even though it takes Autoregressive processes such as no more effort to use a vector-valued variable x from u(t + 1) = u(t) + (t) , (7) any arbitrary set. Everything still applies if we re- place t with x. In an upcoming journal submission (t) ∼ N (0, 1) , (8) (Wilson and Ghahramani, 2011b), we introduce sev- are widely used in time series modelling and are a par- eral new constructions, some of which do not have ticularly simple special case of Gaussian processes. Wishart marginals. Suppose we have νD independent Gaussian process 3 WISHART DISTRIBUTION functions, uid(t) ∼ GP(0, k), where i = 1, . . . , ν and 0 d = 1,...,D. This means cov(uid(t), ui0d0 (t )) = 0 > The Wishart distribution defines a probability density k(t, t )δii0 δdd0 , and (uid(t1), uid(t2), . . . , uid(tN )) ∼ function over positive definite matrices S: N (0,K), where δij is the Kronecker delta, and K is an N ×N covariance matrix with elements Kij = k(ti, tj). |S|(ν−D−1)/2 1 Let uˆ (t) = (u (t), . . . , u (t))>, and let L be the p(S|V, ν) = exp(− tr(V −1S)) , i i1 iD νD/2 ν/2 2 |V | ΓD(ν/2) 2 lower Cholesky decomposition of a D ×D scale matrix (9) V , such that LL> = V . Then at each t the covariance where V is a D × D positive definite scale matrix, and matrix Σ(t) has a Wishart marginal distribution, ν > D is the number of degrees of freedom. This ν X distribution has mean νV and (D − ν − 1)V for Σ(t) = Luˆ (t)uˆ>(t)L> ∼ W (V, ν) , (12) ν ≥ D + 1. Γ (·) is the multivariate gamma function: i i D D i=1 D subject to the constraint that the kernel function D(D−1)/4 Y ΓD(ν/2) = π Γ(ν/2 + (1 − j)/2) . (10) k(t, t) = 1. j=1 We can understand (12) as follows. Each element of The Wishart distribution is a multivariate generalisa- the vector uˆi(t) is a univariate Gaussian with zero tion of the when ν is real valued, mean and variance k(t, t) = 1. Since these elements 2 and the chi-square (χ ) distribution when ν is inte- are uncorrelated, uˆi(t) ∼ N (0,I). Therefore Luˆi(t) ∼ ger valued. The sum of squares of univariate Gaussian > > > > N (0,V ), since E[Luˆi(t)uˆi(t) L ] = LIL = LL = random variables is chi-squared distributed. Likewise, V . We are summing the outer products of N (0,V ) the sum of outer products of multivariate Gaussian random variables, and there are ν terms in the sum, so random variables is Wishart distributed: by definition this has a Wishart distribution WD(V, ν). ν It was not a restriction to assume k(t, t) = 1, since any X > S = uiui ∼ WD(V, ν) , (11) scaling of k can be absorbed into L. In a forthcoming i=1 journal paper (Wilson and Ghahramani, 2011b), we use the following formal and general definition, for a where the u are i.i.d. N (0,V ) D-dimensional random i GWP indexed by a variable x in an arbitrary set X . variables, and WD(V, ν) is a Wishart distribution with D × D scale matrix V , and ν degrees of freedom. S Definition 4.1. A Generalised Wishart Process is a is a D × D positive definite matrix. If D = V = 1 collection of positive semi-definite random matrices in- then W is a chi-square distribution with ν degrees of dexed by x ∈ X and constructed from outer products freedom. S−1 has the inverse Wishart distribution, of points from collections of stochastic processes like in −1 −1 (12). If the random matrices have 1) Wishart marginal WD (V , ν), which is a for covariance matrices of zero mean Gaussian distributions. distributions, meaning that Σ(x) ∼ Wp(V, ν) at every x ∈ X , and 2) dependence on x as defined by a kernel k(x, x0), then we write 4 A GENERALISED WISHART PROCESS CONSTRUCTION Σ(x) ∼ GWP(V, ν, k(x, x0)) . (13)

We saw that the Wishart distribution is constructed The GWP notation of Definition 4.1 is just like the no- from multivariate Gaussian distributions. Essentially, tation for a Wishart distribution, but includes a kernel dictions when using a Wishart process prior, allowing 1) the model to scale to hundreds of dimensions with- out a factor representation, and 2) for aspects of the covariance structure to be learned from data. These are based on recently developed Markov chain Monte Carlo techniques (Murray et al., 2010). We also intro- duce a new method for doing multivariate GP based regression with dynamic correlations.

5 BAYESIAN INFERENCE

Figure 1: A draw from a generalised Wishart process Assume we have a generalised Wishart process prior (GWP). Each ellipse is a 2 × 2 covariance matrix indexed by time, which increases from left to right. The rotation on a dynamic D × D covariance matrix: indicates the correlation between the two variables, and the major and minor axes scale with the eigenvalues of the Σ(t) ∼ GWP(V, ν, k) . (14) matrix. Like a draw from a Gaussian process is a collection of function values indexed by time, a draw from a GWP is We want to sample from the posterior Σ(t) given a D- a collection of matrices indexed by time. dimensional data set D = {y(tn): n = 1,...,N}. We explain how to do this for a general likelihood function, p(D|Σ(t)), by finding the posterior distri- which controls the dynamics of how Σ(t) varies with butions over the parameters in the model, given the t. This pleasingly compartmentalises the GWP, since data D. These parameters are: a vector of all rel- there is separation between the shape parameters and evant GP function values u, the hyperparameters of temporal (or spatial) dynamics parameters. We show the GP kernel function θ, the degrees of freedom ν, an example of these dynamics in Figure 1 with a draw and L, the lower cholesky decomposition of the scale from a GWP. matrix V (LL> = V ). The graphical model in Figure 1s (of supplementary material 2) shows all the relevant Using this construction, we can also define a gener- parameters and conditional dependence relationships. alised inverse Wishart process (GIWP). If Σ(t) ∼ The free parameters L and θ have clear interpreta- GWP, then inversion at each value of t defines a draw tions: L gives a prior on the expectation of Σ(t) for all −1 R(t) = Σ(t) from the GIWP. The conjugacy of the t, and θ describes how this structure changes with time GIWP with a Gaussian likelihood could be useful when – how much past data, for instance, one would need to doing Bayesian inference. make an accurate forecast. The degrees of freedom ν We can further extend this construction by replacing control how concentrated our prior is around our ex- the Gaussian processes with copula processes (Wil- pected value of Σ(t); the smaller ν, the more broad our son and Ghahramani, 2010). For example, as part prior is on Σ(t). Learning the values of these parame- of Bayesian inference we could learn a mapping that ters provides useful information. In the supplementary would transform the Gaussian processes uid to Gaus- material, we show explicitly how the kernel function k sian copula processes with marginals that better suit controls the autocovariances for entries of Σ(t) at dif- the covariance structure of our data set. In a forthcom- ferent times. ing journal paper (Wilson and Ghahramani, 2011b), We can sample from these posterior distributions us- we elaborate on such a construction, which allows the ing Gibbs sampling (Geman and Geman, 1984), a marginals to be Wishart distributed with real valued Markov chain Monte Carlo algorithm where initialis- degrees of freedom ν. In this journal paper we also ing {u, θ, L, ν} and then sampling in cycles from introduce efficient representations, and “heavy-tailed” representations. p(u|θ, L, ν, D) ∝ p(D|u, L, ν)p(u|θ) , (15) The formulation we outlined in this section is differ- p(θ|u, L, ν, D) ∝ p(u|θ)p(θ) , (16) ent from other multivariate volatility models in that p(L|θ, u, ν, D) ∝ p(D|u, L, ν)p(L) , (17) one can specify a kernel function k(t, t0) that controls p(ν|θ, u, L, D) ∝ p(D|u, L, ν)p(ν) , (18) how Σ(t) varies with t – for example, k(t, t0) could be periodic – and t need not be time: it can be an arbi- will converge to samples from p(u, θ, L, ν|D). We will trary input variable, including covariates like interest successively describe how to sample from the poste- rates, and does not need to be represented on an evenly rior distributions (15), (16), (17), and (18). In our spaced grid. In the next section we introduce, for the first time, general inference procedures for making pre- 2http://mlg.eng.cam.ac.uk/andrew/gwpsupp.pdf discussion we assume there are N data points (one one might simply wish to set it by taking the empir- at each time step or input), and D dimensions. We ical covariance of any data not used for predictions, then explain how to make predictions of Σ(t∗) at some dividing by the degrees of freedom, and then taking test input t∗. Finally, we discuss a potential likelihood the lower cholesky decomposition. function, and how the GWP could also be used as part of a new GP based model for multivariate regression 5.3 MAKING PREDICTIONS with outputs that have changing correlations. Once we have learned the parameters {u, θ, L, ν}, we 5.1 SAMPLING THE GP FUNCTIONS can find a distribution over Σ(t∗) at a test input t∗. To do this, we must infer the distribution over u∗ – all In this section we describe how to sample from the the relevant GP function values at t∗: posterior distribution (15) over the Gaussian process function values u. We order the entries of u by fix- u∗ = [u11(t∗), . . . , u1D(t∗), u21(t∗), . . . , u2D(t∗), (20) > ing the degrees of freedom and dimension, and run- . . . , uν1(t∗), . . . , uνD(t∗)] . ning the time steps from n = 1,...,N. We then in- crement dimensions, and finally, degrees of freedom. Consider the joint distribution over u and u∗: So u is a vector of length NDν. As before, let K " > # be an N × N covariance matrix, formed by evaluat-  u  KB A ∼ N (0, ) . (21) ing the kernel function at all pairs of training inputs. u ∗ AIp Then the prior p(u|θ) is a Gaussian distribution with NDν × NDν block diagonal covariance matrix KB, Supposing that u∗ and u respectively have p and q formed using Dν of the K matrices; if the hyperpa- elements, then A is a p × q matrix of covariances be- rameters of the kernel function change depending on tween the GP function values u∗ and u at all pairs of dimension or degrees of freedom, then these K matri- the training and test inputs: Aij = ki(t∗, tmod(N+1,j)) ces will be different from one another. In short, if 1 + (i − 1)N ≤ j ≤ iN, and 0 otherwise. The kernel function ki may differ from row to row, if it changes p(u|θ) = N (0,KB) . (19) depending on the degree of freedom or dimension; for instance, we could have a different length-scale for each With this prior, and the likelihood formulated in terms new dimension. Ip is a p×p represent- of the other parameters, we can sample from the poste- ing the prior independence between the GP function rior (15). Sampling from this posterior is difficult, be- values in u∗. Conditioning on u, we find cause the Gaussian process function values are highly −1 −1 > correlated by the KB matrix. We use Elliptical Slice u∗|u ∼ N (AKB u,Ip − AKB A ) . (22) Sampling (Murray et al., 2010): it has no free param- eters, jointly updates every element of u, and was es- We can then construct Σ(t∗) using equation (12) and pecially designed to sample from posteriors with cor- the elements of u∗. related Gaussian priors. We found it effective. 5.4 LIKELIHOOD FUNCTION 5.2 SAMPLING OTHER PARAMETERS So far we have avoided making the likelihood explicit; We can similarly obtain distributions over the other the inference procedure we described will work with a parameters. The priors we use will depend on the variety of likelihoods parametrized through a matrix data we are modelling. We placed a vague lognor- Σ(t), such as the multivariate t distribution. However, mal prior on θ and sampled from the posterior (16) assuming for simplicity that each of the variables y(tn) using axis aligned slice sampling if θ was one dimen- has a Gaussian distribution, sional, and Metropolis Hastings otherwise. We also used Metropolis Hastings to sample from (17), with a y(t) ∼ N (µ(t), Σ(t)), (23) spherical Gaussian prior on the elements of L. To sam- then the likelihood is ple (18), one can use reversible jump MCMC (Green, 1995; Robert and Casella, 2004). But in our experi- N Y ments we set ν = D + 1, letting the prior be as flexible p(D|µ(t), Σ(t)) = p(y(tn)|µ(tn), Σ(tn)) as possible, and focus on other aspects of our model. n=1 In an upcoming journal paper (Wilson and Ghahra- N Y −1/2 1 > −1 mani, 2011b), we introduce a GWP construction with = |2πΣ(tn)| exp[− w(tn) Σ(tn) w(tn)], 2 real valued degrees of freedom, where it is easy to sam- n=1 ple from p(ν|D). Although learning L is not expensive, (24) where w(tn) = y(tn) − µ(tn). principle go to about 1000 dimensions, assuming ν is O(D), and for instance, a couple years worth of We can learn a distribution over µ(t), in addition to financial training data, which is typical for GARCH Σ(t). One possible specification would be to let µ(t) (Brownlees et al., 2009). In practice, MCMC may be be a separate vector of Gaussian processes: infeasible for very high D, but we have found Elliptical µ(t) = uˆ (t) . (25) Slice Sampling incredibly robust. Overall, this is im- ν+1 pressive scaling – without further assumptions in our We discuss this further in an upcoming paper where model, we can go well beyond 5 dimensions with full we develop a multivariate Gaussian process regres- generality. In the supplementary material we have a sion model which accounts for dynamic correlations 200 dimensional experiment as an empirical accomp- between the outputs. inament to this discussion.

Alternative models, which account for fixed correla- 6 MULTIVARIATE GARCH tions, have recently been introduced by Bonilla et al. (2008), Teh et al. (2005), and Boyle and Frean (2004). We compare predictions of Σ(t) made by the gener- Rather than use a GP based regression as in (25), alised Wishart process to those made by multivari- Gelfand et al. (2004) combine a spatial Wishart pro- ate GARCH (MGARCH), since GARCH (Bollerslev, cess with a parametric linear regression on the mean, 1986) is extremely popular and arguably unsurpassed to make correlated mean predictions in a 2D spatial at predicting the volatility of returns on equity indices setting. There are some important differences in our and currency exchanges (Poon and Granger, 2005; methodology: 1) the correlation structure is a fixed Hansen and Lunde, 2005; Brownlees et al., 2009). property of their model (e.g. they do not learn the pa- rameters of a kernel function and so the autocorrela- Consider a zero mean D dimensional vector stochastic tions are not learned from data), 2) they are not inter- process y(t) with a time changing covariance matrix ested in developing a model of multivariate volatility; Σ(t) as in (23). In the general MGARCH framework, they do not explicitly evaluate or provide a means to 1/2 forecast dynamic correlations, 3) the prior expectation y(t) = Σ(t) η(t) , (26) [Σ(t)] at each t is diagonal (which is not what we ex- E where η(t) is an i.i.d. vector white noise process with pect when applying a multivariate volatility model), [η(t)η(t)>] = I, and Σ(t) is the covariance matrix of 4) the regression on µ(x) is linear, 5) the observations E y(t) conditioned on all information up until time t−1. must be on a regularly spaced grid (no missing observa- tions), and perhaps most importantly, 6) the inference The first and most general MGARCH model, the VEC relies solely on Metropolis Hastings with Gaussian pro- model of Bollerslev et al. (1988), specifies Σt as posals, which is not tractable for p > 3, and will not q p mix efficiently as the strong GP prior correlations are X > X vech(Σt) = a0+ Aivech(yt−iyt−i)+ Bjvech(Σt−j) . not accounted for (Murray et al., 2010). In this paper i=1 j=1 we focus on making predictions of Σ(t), setting µ = 0. (27) In an upcoming paper we develop and implement a Ai and Bj are D(D + 1)/2 × D(D + 1)/2 matrices of multivariate GP regression model which accounts for parameters, and a0 is a D(D + 1)/2 × 1 vector of pa- dynamic correlations between the outputs, like we have rameters.3 The vech operator stacks the columns of briefly described in this section. the lower triangular part of a D ×D matrix into a vec- tor of length D(D + 1)/2. For example, vech(Σ) = > 5.5 COMPUTATIONAL COMPLEXITY (Σ11, Σ21,..., ΣD1, Σ22,..., ΣD2,..., ΣDD) . This model is general, but difficult to use. There are Our method is mainly limited by taking the cholesky (p+q)(D(D +1)/2)2 +D(D +1)/2 parameters! These decomposition of the block diagonal KB, a NDν × parameters are hard to interpret, and there are no NDν matrix. However, chol(blkdiag(A, B, . . . )) = conditions under which Σt is positive definite for all blkdiag(chol(A), chol(B),... ). So in the case with t. Gouri´eroux(1997) discusses the challenging (and equal length-scales for each dimension, we only need sometimes impossible) problem of keeping Σt positive to take the cholesky of an N ×N matrix K, an O(N 3) definite. Training is done by a constrained maximum operation, independent of dimension! In the more likelihood, where the log likelihood is given by general case with D different length-scales, it is an N O(DN 3) operation. The total “training” complexity ND 1 X L = − log(2π) − [log |Σ | + y>Σ−1y ] , (28) therefore amounts to one O(N 3) or one O(DN 3) oper- 2 2 t t t t ation. Sampling then requires likelihood evaluations, t=1 2 3 which cost O(NνD ) operations. Thus we could in We use Σ(t) and Σt interchangeably. 0 0 supposing that ηt ∼ N (0,I), and that there are N function, k(t, t ) = exp(−|t − t |/l). Even though we training points. are still taking advantage of the inference procedures in the GWP formulation, we refer to this variant of Subsequent efforts have led to simpler but less gen- GWP as a simple Wishart process (WP), since the eral models. We can let A and B be diagonal ma- j j classic Bru (1991) construction is like a special case of trices. This model has notably fewer (though still a generalised Wishart process restricted to using a one (p + q + 1)D(D + 1)/2) parameters, and there are dimensional GP with an OU covariance structure. conditions under which Σt is positive definite for all t (Engle et al., 1994). But now there are no interac- To assess predictions we use the Mean Squared Er- tions between the different conditional variances and ror (MSE) between the predicted and true covariance covariances. A popular variant assumes constant cor- matrices, which is always safe since we never observe relations between the D components of y, and only the true Σ(t). When the truth is not known, we use lets the marginal variances – the diagonal entries of the proxy Sij(t) = yi(t)yj(t), to harmonize with the th Σ(t) – vary (Bollerslev, 1990). econometrics literature. yi is the i component of the multivariate observation y(t). This is intuitive because We compare to the ‘full’ BEKK variant of Engle and [y (t)y (t)] = Σ (t), assuming y(t) has a zero mean. Kroner (1995), as implemented by Kevin Shepphard E i j ij In a thorough empirical study, Brownlees et al. (2009) in the UCSD GARCH Toolbox.4 We chose BEKK use the univariate analogue of this proxy. We do not because it is the most general MGARCH variant in use likelihood for assessing historical predictions, since widespread use. We use the first order model: that is a training error (for MGARCH), but we do use > > > > log likelihood (L) for forecasts. Σt = CC + A yt−1yt−1A + B Σt−1B, (29) We begin by generating a 2×2 time varying covariance where A, B and C are D×D matrices of parameters. C matrix Σp(t) with periodic components, and simulat- is lower triangular to ensure that Σt is positive definite ing data at 291 time steps from a Gaussian: during maximum likelihood training. For a full review of MGARCH, see Silvennoinen and Ter¨asvirta(2009). y(t) ∼ N (0, Σp(t)) . (30) Periodicity is especially common to financial and cli- 7 EXPERIMENTS mate data, where daily trends repeat themselves. For example, the intraday volatility on equity indices and In our experiments, we predict the covariance matrix currency exchanges has a periodic covariance struc- ture. Andersen and Bollerslev (1997) discuss the lack for multivariate observations y(t) as Σ(ˆ t) = E[Σ(t)|D]. These experiments closely follow Brownlees et al. of – and critical need for – models that account for (2009), a rigorous empirical comparison of GARCH this periodicity. In the GWP formulation, we can eas- models. We use a Gaussian likelihood, as in (24), ex- ily account for this by using a periodic kernel func- cept with a zero mean function. We make historical tion, whereas in previous Wishart process volatility predictions, and one step ahead forecasts. Historical models, we are stuck with an OU covariance struc- 0 predictions are made at observed time points, or be- ture. We reconstruct Σp(t) using the kernel k(t, t ) = 0 2 2 tween these points. The one step ahead forecasts are exp(−2 sin((t−t ) )/l ). We reconstructed the histori- predictions of Σ(t + 1) taking into account all observa- cal Σp at all 291 data points, and after having learned tions until time t. Historical predictions can be used, the parameters for each of the models from the first for example, to understand the nature of covariances 200 data points, made one step forecasts for the last between equity indices during a past financial crisis. 91 points. Table 1 and Figure 2 show the results. We call this data set PERIODIC. The GWP outperforms To make these predictions we learn distributions over the competition on all error measures. It identifies the the GWP parameters through the Gibbs sampling pro- periodicity and underlying smoothness of Σp that nei- cedure outlined in section 5. The kernel functions ther the WP nor MGARCH accurately discern: both we use are solely parametrized by a one dimensional are too erratic. MGARCH is especially poor at learn- length-scale l, which indicates how dependent Σ(t) and ing the time changing covariance in this data set. Σ(t + a) are on one another. We place a lognormal prior on the length-scale, and sample from the poste- For our next experiment, we predict Σ(t) for the re- rior with axis-aligned slice sampling. turns on three currency exchanges – the Canadian to US Dollar, the Euro to USD, and the USD to the Great For each experiment, we choose a kernel function we Britain Pound – in the period 15/7/2008-15/2/2010; want to use with the GWP. We then compare to a this encompasses the recent financial crisis and so is of GWP that uses an Ornstein-Uhlenbeck (OU) kernel particular interest to economists.5 We call this data

4 5 http://www.kevinsheppard.com/wiki/UCSD GARCH We define a return as rt = log(Pt+1/Pt), where Pt is 6 2 Table 1: Error for predicting multivariate volatility. 5 1.5 4 MSE Historical MSE Forecast L Forecast 3 1 PERIODIC: 2 GWP 0.0210 0.0295 −257 0.5 Marginal Variance 1 Marginal Variance WP 0.115 0.760 −286 MGARCH 0.228 0.488 −270 0 0 0 100 200 300 0 100 200 300 Time Time EXCHANGE: a) b) GWP 3.88 × 10−9 4.80 × 10−9 2020 1 WP 3.88 × 10−9 6.98 × 10−9 1950 MGARCH 3.96 × 10−9 4.94 × 10−9 2050 0.5 EQUITY: −9 −9 0 GWP 2.80 × 10 5.84 × 10 2930 WP 3.96 × 10−9 8.92 × 10−9 1710

Covariance −9 −9 −0.5 MGARCH 6.68 × 10 29.4 × 10 2760

−1 0 100 200 300 Time c) To make forecasts and historical predictions on this data set (EQUITY), we used a GWP with a squared ex- Figure 2: Reconstructing the historical Σp(t) for the ponential kernel, k(t, t0) = exp(−0.5(t − t0)2/l2). We PERIODIC data set. We show the truth (green), and GWP (blue), WP (dashed magenta), and MGARCH (thin red) follow the same procedure as before and make 200 fore- predictions. a) and b) are the marginal variances (diagonal casts and historical predictions; results are in Table 1. elements of Σ (t)), and c) is the covariance. p Both the EXCHANGE and EQUITY data sets are especially suited to GARCH (Poon and Granger, 2005; Hansen and Lunde, 2005; Brownlees et al., 2009; McCullough set EXCHANGE. We use the proxy Sij(t) = yi(t)yj(t). With the GWP, we use the squared exponential ker- and Renfro, 1998; Brooks et al., 2001). However, the nel k(t, t0) = exp(−0.5(t − t0)2/l2). We make 200 one generalised Wishart process outperforms GARCH on step ahead forecasts, having learned the parameters both of these data sets. Based on our experiments, for each of the models on the previous 200 data points. there is evidence that the GWP is particularly good We also make 200 historical predictions for the same at capturing the co-variances (off-diagonal elements of data points as the forecasts. Results are in Table 1. Σ(t)) as compared to GARCH. The GWP also out- performs the WP, which has a fixed OU covariance Unfortunately, we cannot properly assess predictions structure, even though in our experiments the WP on natural data, because we do not know the true takes advantage of the new inference procedures we Σ(t). For example, whether we use MSE with a proxy, have derived. Thus the difference in performance is or likelihood, historical predictions would be assessed likely because the GWP is capable of capturing com- with the same data used for training. In consideration plex interdependencies, whereas the WP is not. of this problem, we generated a time varying covari- ance matrix Σ(˜ t) based on the empirical time vary- ing covariance of the daily returns on five equity in- 8 DISCUSSION dices – NASDAQ, FTSE, TSE, NIKKEI, and the Dow Jones Composite – over the period from 15/2/1990- We introduced a stochastic process – the generalised 15/2/2010. We then generated a return series by sam- Wishart process (GWP) – which we used to model pling from a multivariate Gaussian at each time step time-varying covariance matrices Σ(t). In the future, using Σ(˜ t). As seen in Figure 2s (supplementary), the GWP could be applied to study how Σ depends on the generated return series behaves like equity index covariates like interest rates, in addition to time. In returns. This method is not faultless; for example, a forthcoming journal paper we introduce several new we assume that the returns are normally distributed. GWP constructions, with benefits in expressivity and However, the models we compare between also make efficiency (Wilson and Ghahramani, 2011b). this assumption, and so no model is given an unfair ad- vantage. And there is a critical benefit: we compare We hope to unify efforts in machine learning and predictions with the true underlying Σ(˜ t). econometrics to inspire new multivariate volatility models that are simultaneously general, easy to inter- the price on day t. pret, and tractable in high dimensions. References tion of images. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 6(2):721–741. Andersen, T. G. and Bollerslev, T. (1997). Intra- day periodicity and volatility persistence in financial Gouri´eroux,C. (1997). ARCH models and financial markets. Journal of Empirical Finance, 4(2-3):115– applications. Springer Verlag. 158. Gouri´eroux, C., Jasiak, J., and Sufana, R. (2009). Bae, K., Karolyi, G., and Stulz, R. (2003). A new The Wishart autoregressive process of multivari- approach to measuring financial contagion. Review ate stochastic volatility. Journal of Econometrics, of Financial Studies, 16(3):717. 150(2):167–181. Bollerslev, T. (1986). Generalized autoregressive con- Green, P. (1995). Reversible jump Markov chain ditional heteroskedasticity. Journal of Economet- Monte Carlo computation and Bayesian model de- rics, 31(3):307–327. termination. Biometrika, 82(4):711. Hansen, P. R. and Lunde, A. (2005). A forecast com- Bollerslev, T. (1990). Modeling the coherence in short- parison of volatility models: Does anything beat term nominal exchange rates: A multivariate gen- a GARCH(1,1). Journal of Applied Econometrics, eralized arch approach. Review of Economics and 20(7):873–889. Statistics, 72:498–505. Harvey, A., Ruiz, E., and Shephard, N. (1994). Mul- Bollerslev, T., Engle, R. F., and Wooldridge, J. M. tivariate stochastic variance models. The Review of (1988). A capital asset pricing model with time- Economic Studies, 61(2):247–264. varying covariances. The Journal of Political Econ- omy, 96(1):116–131. McCullough, B. and Renfro, C. (1998). Benchmarks and software standards: A case study of GARCH Bonilla, E., Chai, K., and Williams, C. (2008). Multi- procedures. Journal of Economic and Social Mea- task Gaussian process prediction. In NIPS. surement, 25:59–71. Boyle, P. and Frean, M. (2004). Dependent Gaussian Murray, I., Adams, R. P., and MacKay, D. J. (2010). processes. In NIPS. Elliptical Slice Sampling. JMLR: W&CP, 9:541– Brooks, C., Burke, S., and Persand, G. (2001). Bench- 548. marks and the accuracy of GARCH model estima- Philipov, A. and Glickman, M. (2006). Multivariate tion. International Journal of Forecasting, 17:45–56. stochastic volatility via Wishart processes. Journal Brownlees, C. T., Engle, R. F., and Kelly, B. T. of Business and Economic Statistics, 24(3):313–328. (2009). A practical guide to volatility forecast- Poon, S.-H. and Granger, C. W. (2005). Practical ing through calm and storm. Available at SSRN: issues in forecasting volatility. Financial Analysts http://ssrn.com/abstract=1502915. Journal, 61(1):45–56. Bru, M. (1991). Wishart processes. Journal of Theo- Rasmussen, C. E. and Williams, C. K. (2006). Gaus- retical Probability, 4(4):725–751. sian processes for Machine Learning. The MIT Chen, Y. and Welling, M. (2010). Dynamical prod- Press. ucts of experts for modeling financial time series. In Robert, C. and Casella, G. (2004). Monte Carlo sta- ICML, pages 207–214. tistical methods. Springer Verlag. Engle, R. and Kroner, K. (1995). Multivariate si- Sharpe, W. (1966). Mutual fund performance. Journal multaneous generalized ARCH. Econometric theory, of business, 39(1):119–138. 11(01):122–150. Silvennoinen, A. and Ter¨asvirta,T. (2009). Multivari- Engle, R., Nelson, D., and Bollerslev, T. (1994). Arch ate GARCH models. Handbook of Financial Time models. Handbook of Econometrics, 4:2959–3038. Series, pages 201–229. Engle, R. F. (1982). Autoregressive conditional het- Teh, Y., Seeger, M., and Jordan, M. (2005). Semi- eroscedasticity with estimates of the variance of parametric latent factor models. In Workshop on united kingdom inflation. Econometrica, 50(4):987– Artificial Intelligence and Statistics, volume 10. 1007. Wilson, A. G. and Ghahramani, Z. (2010). Copula Gelfand, A., Schmidt, A., Banerjee, S., and Sirmans, processes. In NIPS. C. (2004). Nonstationary multivariate process mod- Wilson, A. G. and Ghahramani, Z. (2011a). Gener- eling through spatially varying coregionalization. alised Wishart Processes Supplementary Material. Test, 13(2):263–312. mlg.eng.cam.ac.uk/andrew/gwpsupp.pdf. Geman, S. and Geman, D. (1984). Stochastic relax- Wilson, A. G. and Ghahramani, Z. (2011b). In prepa- ation, Gibbs distributions, and the bayesian restora- ration.