<<

Gaussian Process Regression Networks

Andrew Gordon Wilson [email protected] David A. Knowles [email protected] Zoubin Ghahramani [email protected] Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK

Abstract sary for computational reasons”. Accordingly, Neal (1996) pursued the limit of large models, and found We introduce a new regression frame- that Bayesian neural networks became Gaussian pro- work, Gaussian process regression networks cesses as the number of hidden units approached infin- (GPRN), which combines the structural ity, and conjectured that “there may be simpler ways properties of Bayesian neural networks with to do inference in this case”. the nonparametric flexibility of Gaussian pro- cesses. GPRN accommodates input (pre- These simple inference techniques became the corner- dictor) dependent signal and noise corre- stone of subsequent Gaussian process models (Ras- lations between multiple output (response) mussen & Williams, 2006). These models assume a variables, input dependent length-scales and prior directly over functions, rather than parameters. amplitudes, and heavy-tailed predictive dis- By further assuming homoscedastic , tributions. We derive both elliptical slice one can analytically infer a posterior distribution over and variational Bayes inference pro- these functions, given data. The properties of these cedures for GPRN. We apply GPRN as a functions – smoothness, periodicity, etc. – can easily multiple output regression and multivariate be controlled by a Gaussian process kernel. volatility model, demonstrating substantially Gaussian process models have recently become pop- improved performance over eight popular ular for non-linear regression and classification (Ras- multiple output (multi-task) Gaussian pro- mussen & Williams, 2006), and have impressive em- cess models and three multivariate volatility pirical performances (Rasmussen, 1996). models on real datasets, including a 1000 di- However, a neural network allowed for correlations be- mensional gene expression dataset. tween multiple outputs, through sharing adaptive hid- den basis functions across the outputs. In the infinite limit of basis functions, these correlations vanished. 1. Introduction Moreover, neural networks were envisaged as intelli- “Learning representations by back-propagating errors” gent agents which discovered hidden features and rep- by Rumelhart et al.(1986) is a defining paper in ma- resentations in data, while Gaussian processes, though chine learning history. This paper made neural net- effective at regression and classification, are simply works popular for their ability to capture correlations smoothing devices (MacKay, 1998). between multiple outputs, and to discover hidden fea- Recently there has been an explosion of interest in ex- tures in data, by using adaptive hidden basis functions tending the Gaussian process regression framework to that were shared across the outputs. account for fixed correlations between output variables MacKay(1992) and Neal(1996) later showed that no (Alvarez & Lawrence, 2011; Yu et al., 2009; Bonilla matter how large or complex the neural network, one et al., 2008; Teh et al., 2005; Boyle & Frean, 2004). could avoid overfitting using a Bayesian formulation. These are often called ‘multi-task’ learning or ‘multi- Neal(1996) also argued that “limiting complexity is ple output’ regression models. Capturing correlations likely to conflict with our prior beliefs, and can there- between outputs (responses) can be used to make bet- fore only be justified to the extent that it is neces- ter predictions. Imagine we wish to predict cadmium concentrations in a region of the Swiss Jura, where ge- Appearing in Proceedings of the 29 th International Confer- ologists are interested in heavy metal concentrations. ence on , Edinburgh, Scotland, UK, 2012. A standard Gaussian process regression model would Copyright 2012 by the author(s)/owner(s). only be able to use cadmium training measurements. Gaussian Process Regression Networks

With a multi-task method, we can also make use of 2. Gaussian Process Networks correlated heavy metal measurements to enhance cad- mium predictions (Goovaerts, 1997). We could further We wish to model a p dimensional y(x), with enhance predictions if we could use how these (signal) signal and noise correlations that vary with x. correlations change with geographical location. We model y(x) as

There has similarly been great interest in extending y(x) = W (x)[f(x) + σf ] + σyz , (1) Gaussian process (GP) regression to account for in- where  = (x) and z = z(x) are respectively N (0,I ) put dependent noise variances (Goldberg et al., 1998; q and N (0,I ) processes. I and I are q ×q Kersting et al., 2007; Adams & Stegle, 2008; Turner, p q p and p×p dimensional identity matrices. W (x) is a p×q 2010; Wilson & Ghahramani, 2010b;a; L´azaro-Gredilla of independent Gaussian processes such that & Titsias, 2011). Wilson & Ghahramani(2010a; 2011) W (x) ∼ GP(0, k ), and f(x) = (f (x), . . . , f (x))> and Fox & Dunson(2011) further extended the GP ij w 1 q is a q × 1 vector of independent GPs with f (x) ∼ framework to accommodate input dependent noise cor- i GP(0, k ). The GPRN prior on y(x) is induced relations between multiple output (response) variables. fi through GP priors in W (x) and f(x), and the noise In this paper, we introduce a new regression frame- model is induced through  and z. work, Gaussian Process Regression Networks (GPRN), We represent the Gaussian process regression network which combines the structural properties of Bayesian (GPRN)1 of equation (1) in Figure1. Each of the la- neural networks with the nonparametric flexibility of tent Gaussian processes in f(x) has additive Gaussian Gaussian processes. This network is an adaptive mix- noise. Changing variables to include the noise σ , we ture of Gaussian processes, which naturally accommo- f let fˆi(x) = fi(x) + σf  ∼ GP(0, k ˆ ), where dates input dependent signal and noise correlations fi between multiple output variables, input dependent 2 k ˆ (xa, xw) = kfi (xa, xw) + σf δaw , (2) length-scales and amplitudes, and heavy tailed predic- fi tive distributions, without expensive or numerically and δaw is the Kronecker delta. The latent node func- ˆ unstable computations. The GPRN framework ex- tions f(x) are connected together to form the outputs tends and unifies the work of Journel & Huijbregts y(x). The strengths of the connections change as a (1978), Neal(1996), Gelfand et al.(2004), Teh et al. function of x; the weights themselves – the entries of (2005), Adams & Stegle(2008), Turner(2010), and W (x) – are functions. Old connections can break and Wilson & Ghahramani(2010b; 2011). new connections can form. This is an adaptive net- work, where the signal and noise correlations between Throughout this text we assume we are given a dataset the components of y(x) vary with x. We label the D { } of input output pairs, = (xi, y(xi)) : i = 1,...,N , length-scale hyperparameters for the kernels kw and where x ∈ X is an input (predictor) variable belonging kfi as θw and θf respectively. We often assume that all X to an arbitrary set , and y(x) is the corresponding the weight GPs share the same covariance kernel kw, p dimensional output; each element of y(x) is a one including hyperparameters. Roughly speaking, shar- dimensional output (response) variable, for example ing length-scale hyperparameters amongst the weights the concentration of a single heavy metal at a geo- that, a priori, the strengths of the connections graphical location x. We aim to predict y(x∗)|x∗, D in Figure1 vary with x at the same rate. and Σ(x∗) = cov[y(x∗)|x∗, D] at a test input x∗, while accounting for input dependent signal and noise cor- To explicitly separate the adaptive signal and noise relations between the elements of y(x). correlations, we re-write (1) as We start by introducing the GPRN framework and y(x) = W (x)f(x) + σf W (x) + σyz . (3) | {z } | {z } discussing inference. We then further discuss related signal noise work, before comparing to eight multiple output GP Given W (x), each of the outputs yi(x), i = 1, . . . , p, is models, on gene expression and datasets, a Gaussian process with kernel and three multivariate volatility models on several q benchmark financial datasets. In the supplementary X 2 ky (xa, xw) = Wij(xa)k ˆ (xa, xw)Wij(xw)+δawσ . material (Wilson & Ghahramani, 2012) we further dis- i fj y cuss theoretical aspects of GPRN, and review GP re- j=1 (4) gression and notation (Rasmussen & Williams, 2006). 1Coincidentally, there is an unrelated paper called “Gaussian process networks” (Friedman & Nachman, 2000), which is about learning the structure of Bayesian networks – e.g. the direction of dependence between ran- dom variables. Gaussian Process Regression Networks

) (x y1(x) than p, the dimension of y(x), the model performs W11

fˆ1(x)

W dimensionality reduction and matrix factorization as 1

q .

( part of the regression on y(x) and cov[y(x)]. How- x . . . θw θf σf . ) ever, we may want q > p, for instance if the output space were one dimensional (p = 1). In this case we x fˆi(x) yj(x) W (x) fˆ(x) would need q > 1 for nonstationary length-scales and

W σ . y . p . 1 . covariance structures. For a given dataset, we can vary ( . x ) q and select the value which gives the highest marginal fˆq(x) W pq likelihood on training data. (x) yp(x) y(x)

a) b) 3. Inference

Figure 1. The Gaussian process regression network. Latent We have specified a prior p(y(x)) at all points x in the random variables and observables are respectively labelled domain X , and a noise model, so we can infer the pos- with circles and squares, except for the weight functions in terior p(y(x)|D). The prior on y(x) is induced through a). Hyperparameters are labelled with dots. a) This neu- the GP priors in W (x) and f(x), and the parameters ral network style diagram shows the q components of the γ = {θ , θ , σ , σ }. We perform inference directly ˆ f w f y vector f (GPs with additive noise), and the p components over the GPs and parameters. of the vector y. The links in the graph, four of which are labelled, are latent random weight functions. Every We explicitly re-write the prior over GPs in terms of quantity in this graph depends on the input x. This graph u = (ˆf, W), a vector composed of all the node and emphasises the adaptive nature of this network: links can weight Gaussian process functions, evaluated at the change strength or even disappear as x changes. b) A di- training points {x1, . . . , xN }. There are q node func- rected graphical model showing the generative procedure tions and p × q weight functions. Therefore with relevant variables.

p(u|σf , θf , θw) = N (0,CB) , (5) The components of y(x) are coupled through the ma- where C is an Nq(p + 1) × Nq(p + 1) block diagonal trix W (x). Training the network involves conditioning B matrix, since the weight and node functions are a priori W (x) on the data D, and so the predictive independent. We order the entries of u so that the of y(x∗)|D are now influenced by the values of the first q blocks are N × N covariance matrices K ˆ from observations, and not just distances between the test fi the node kernels k ˆ , and the last blocks are N × N point x and the observed points x , . . . , x as is the fi ∗ 1 N covariance matrices K from the weight kernel k . case for independent GPs. w w From (1), the likelihood is We can view (4) as an adaptive kernel learned from the data. There are several other interesting features N Y 2 in equation (4): 1) the amplitude of the covariance p(D|u, σy) = N (y(xi); W (xi)fˆ(xi), σ Ip) . (6) Pq 0 y function, j=1 Wij(x)Wij(x ), is non-stationary (in- i=1 put dependent); 2) even if each of the kernels kfj has different stationary length-scales, the mixture of the Applying Bayes’ theorem, kernels kfj is input dependent and so the effective overall length-scale is non-stationary; 3) the kernels p(u|D, γ) ∝ p(D|u, σy)p(u|σf , θf , θw) . (7) k may be entirely different: some may be periodic, fj We sample from the posterior in (7) using elliptical others squared exponential, others , slice sampling (ESS) (Murray et al., 2010), which is and so on. Therefore the overall covariance kernel may specifically designed to sample from posteriors with be continuously switching between regions of entirely strongly correlated Gaussian priors. For comparison different covariance structures. we approximate (7) using a message passing imple- In addition to modelling signal correlations, we can mentation of variational Bayes (VB). We also use VB see from equation (3) that the GPRN is also a mul- to learn the hyperparameters γ|D. Details about our tivariate volatility model. The noise covariance is ESS and VB approaches are in Sections 1 and 2 of the 2 > 2 σf W (x)W (x) + σyIp. Since the entries of W (x) are supplementary material. GPs, this noise model is an example of a generalised By incorporating noise on f, the GP network accounts Wishart process (Wilson & Ghahramani, 2010a; 2011). for input dependent noise correlations (as in (3)), with- The number of nodes q influences how the model ac- out the need for costly or numerically unstable ma- 2 counts for signal and noise correlations. If q is smaller trix decompositions during inference. The matrix σyIp Gaussian Process Regression Networks does not change with x and requires only one O(1) generalised Wishart process (GWP), which generalises operation to invert. In a more typical multivariate the Wishart process of (Bru, 1991) to a process over volatility model, one must decompose a p × p matrix arbitrary positive definite matrices (Wishart marginals Σ(x) once for each datapoint xi (N times in total), are not required) with a flexible covariance structure, an O(Np3) operation which is prone to numerical in- and using the GWP, extended the GP framework to stability. In general, multivariate volatility models are account for input dependent noise correlations (mul- intractable for p > 5 (Gouri´erouxet al., 2009; Engle, tivariate volatility), without assuming the noise is ob- 2002). Moreover, multi-task Gaussian process mod- servable, or that the input space is 1D, or on a grid. els typically have an O(N 3p3) complexity (Alvarez & Gaussian process regression networks act as both a Lawrence, 2011). In Section 3 of the supplementary multi-task and multivariate volatility model. The sig- material (Wilson & Ghahramani, 2012) we show that, nal correlation model in GPRN differs from Gelfand fixing the number of ESS or VB iterations, GPRN in- et al.(2004) in that 1) the GPRN incorporates ference scales linearly with p, and further discuss the- and estimates Gaussian process hyperparameters, like oretical properties of GPRN, like the heavy-tailed pre- length-scales, effectively learning aspects of the covari- dictive distribution. ance structure from data, 2) is tractable for p > 3, 3) is used as a latent factor model (where q < p), 4) 4. Related Work can account for input dependent length-scales and co- variance structures, and 5) incorporates an input de- Gaussian process regression networks are related to pendent noise correlation model. Moreover, the VB a large body of seemingly disparate work in ma- and ESS inference procedures we present here are sig- chine learning, econometrics, geostatistics, physics, nificantly more efficient than the Metropolis-Hastings and . proposals in Gelfand et al.(2004). Generally a noise In machine learning, the semiparametric latent fac- model strongly influences a regression on the signal, tor model (SLFM) (Teh et al., 2005) was introduced even if the noise and signal models are a priori inde- to model multiple outputs with fixed signal correla- pendent. In the GPRN prior of equation (3) the noise tions. SLFM specifies a linear of latent Gaus- and signal correlations are explicitly related: through sian processes. The SLFM is similar to the linear sharing W (x), the signal and noise are encouraged model of coregionalisation (LMC) (Journel & Hui- to increase and decrease together. The noise model jbregts, 1978) and intrinsic coregionalisation model is an example of a GWP, although GPRN scales lin- (ICM) (Goovaerts, 1997) in geostatistics, but the early and not cubically with p, per iteration of ESS or SLFM incorporates important Gaussian process hy- VB. If the GPRN is exposed solely to input dependent perparameters like length-scales, and methodology for noise, the length-scales on the node functions f(x) will learning these hyperparameters. In machine learning, train to large values, turning the GPRN into solely a the SLFM has also been developed as “Gaussian pro- multivariate volatility model: all the modelling then cess factor analysis” (Yu et al., 2009), with an empha- takes place in W (x). In other words, through learn- sis on time being the input (predictor) variable. ing Gaussian process hyperparameters, GPRN can au- tomatically vary between a multi-task and multivari- For changing correlations, the Wishart process (Bru, ate volatility model. The hyperparameters in GPRN 1991) was first introduced in probability theory as a are also important for distinguishing between the be- distribution over a collection of positive definite covari- haviour of the weight and node functions. We may ance matrices with Wishart marginals. It was defined expect, for example, that the node functions will vary as an outer product of autoregressive Gaussian pro- more quickly than the weight functions, so that the cesses restricted to a Brownian motion or Ornstein- components of y(x) vary more quickly than the corre- Uhlenbeck covariance structure. In the geostatistics lations between the components of y(x). The rate at literature, Gelfand et al.(2004) applied a Wishart pro- which the node and weight functions vary is controlled cess as part of a linear coregionalisation model with by the Gaussian process length-scale hyperparameters, spatially varying signal covariances, on a p = 2 di- which are learned from data. mensional real-estate example. Later Gouri´erouxet al. (2009) returned to the Wishart process of Bru(1991) When q = p = 1, the GPRN resembles the nonstation- to model multivariate volatility, letting the noise co- ary GP regression model of Adams & Stegle(2008). variance be specified as an outer product of AR(1) Likewise, when the weight functions are constants, Gaussian processes, assuming that the covariance ma- the GPRN becomes the semiparametric latent factor trices Σ(t) = cov(y|t) are observables on an evenly model (SLFM) of Teh et al.(2005), except that the spaced one dimensional grid. In machine learning, resulting GP regression network is less prone to over- Wilson & Ghahramani(2010a; 2011) introduced the fitting through its use of full . The Gaussian Process Regression Networks

GPRN also somewhat resembles the natural sound correlated. We would like to use how these correlations model (MPAD) in Section 5.3 of Turner(2010), ex- change over time to make better predictions of time cept in MPAD the analogue of the node functions are varying gene expression in the presence of transcrip- AR(2) Gaussian processes, and the “weight functions” tion factors. In total there are 1621 genes (outputs) at are a priori correlated. N = 12 time points (inputs), on two independent repli- cas. For training, p = 50 random genes were selected Ver Hoef & Barry(1998) in geostatistics and Boyle from the first replica, and the corresponding 50 genes & Frean(2004) in machine learning proposed an al- in the second replica were used for testing. We then ternative convolution GP model for multiple outputs repeated this experiment 10 times with a different set (CMOGP) with fixed signal correlations, where each of genes each time, and averaged the results. We then output at each x ∈ X is a mixture of latent Gaussian repeated the whole experiment, but with p = 1000 processes mixed across the whole input domain X . genes. We used exactly the same training and testing sets as Alvarez & Lawrence(2011). 5. Experiments We use a smaller p = 50 dataset so that we are able We compare GPRN to multi-task learning and multi- to compare with popular alternative multi-task meth- variate volatility models. We also compare between ods (LMC, CMOGP, SLFM), which have a complex- variational Bayes (VB) and elliptical slice sampling ity of O(N 3p3) and would not scale to p = 1000 (Al- (ESS) inference within the GPRN framework. In the varez & Lawrence, 2011).2 For p = 1000, we compare multi-task setting, there are p dimensional observa- to the sparse convolved multiple output GP methods tions y(x), and the goal is to use the correlations be- (CMOFITC, CMODTC, and CMOPITC) of Alvarez tween the elements of y(x) to make better predictions & Lawrence(2011). In both of these regressions, the of y(x∗), for a test input x∗, than if we were to treat GPRN is accounting for multivariate volatility; this is the dimensions independently. A major difference be- the first time a multivariate stochastic volatility model tween GPRN and alternative multi-task models is that has been estimated for p > 50 (Chib et al., 2006). We the GPRN accounts for signal correlations that change assess performance using standardised square er- with x, and input dependent noise correlations, rather ror (SMSE) and mean standardized log loss (MSLL), than fixed correlations. We compare to multi-task GP as defined in Rasmussen & Williams(2006) on page models on gene expression and geostatistics datasets. 23. Using the empirical mean and variance to fit the data would give an SMSE and MSLL of 1 and 0 re- In the multi-task experiments, the GPRN accounts for spectively. The smaller the SMSE and more negative both input dependent signal and noise covariance ma- the MSLL the better. trices. To specifically test GPRN’s ability to model in- put dependent noise covariances (multivariate volatil- The results are in Table1, under the headings GENE ity), we compare predictions of cov[y(x)] = Σ(x) to (50D) and GENE (1000D). For SET 2 we reverse train- those made by popular multivariate volatility models ing and testing replicas in SET 1. GPRN outperforms on benchmark financial datasets. all of the other models, with between 46% and 68% of the SMSE, and similarly strong results on the MSLL In all experiments, GPRN uses squared exponential error metric.3 On both the 50 and 1000 dimensional covariance functions, with a length-scale shared across datasets, the marginal likelihood for the network struc- all node functions, and another length-scale shared ture is sharply peaked at q = 1, as we might expect across all weight functions. GPRN is robust to initial- since there is likely one transcription factor twi con- isation. We use an adversarial initialisation of N (0, 1) trolling the expression levels of the genes in question. white noise for Gaussian process functions. Typical GPRN (VB) runtimes for the 50D and 1000D 5.1. Gene Expression datasets were respectively 12 seconds and 330 seconds. These runtimes scale roughly linearly with dimension Tomancak et al.(2002) measured gene expression lev- (p), which is what we expect. GPRN (VB) runs at els every hour for 12 hours during Drosophila embryo- about the same speed as the sparse CMOGP meth- genesis; they then repeated this experiment for an in- dependent replica (a second independent ). 2 We also implemented the SVLMC of Gelfand et al. Gene expression is activated and deactivated by tran- (2004) but found it intractable on the gene expression and geostatistics datasets, and on a subset of data it gave worse scription factor proteins. We focus on genes which are results than the other methods we compare to. SVLMC is thought to at least be regulated by the transcription not applicable to the multivariate volatility datasets. The factor twi, which influences mesoderm and muscle de- supplementary material has more details. velopment in Drosophila (Zinzen et al., 2009). The 3 Independent GPs severely overfit on GENE, giving an assumption is that these gene expression levels are all MSLL of ∞. Gaussian Process Regression Networks ods, and much faster than CMOGP, LMC and SLFM, which take days to run on the 1000D dataset. The

GPRN (ESS) runtimes for the 50D and 1000D datasets 6 0.90 were 40 seconds and 9000 seconds (2.5 hr), and re- 4 0.75 quired respectively 6000 and 10 samples to reach con- 5 vergence, as assessed by trace plots of sample likeli- 0.60 hoods. In terms of both speed and accuracy GPRN 4 (ESS) outperforms all methods except GPRN (VB). 0.45 3

GPRN (ESS) does not mix as well in high dimen- latitude 0.30 sions, and the number of ESS iterations required to 2 reach convergence noticeably grows with p. However, 0.15 ESS is still tractable and performing relatively well in p = 1000 dimensions, in terms of speed and predictive 1 0.00 accuracy. Runtimes are on a 2.3 GHz Intel i5 Duo 0.15 1 2 3 4 5 Core processor. longitude

5.2. Jura Geostatistics Here we are interested in predicting concentrations of Figure 2. Spatially dependent correlation between cad- cadmium at 100 locations within a 14.5 km2 region mium and zinc learnt by the GPRN. Markers show the locations where measurements were made. of the Swiss Jura. For training, we have access to measurements of cadmium at 259 neighbouring loca- tions. We also have access to nickel and zinc concen- by JURA in Table1. The experimental setup follows trations at these 259 locations, as well as at the 100 Goovaerts(1997) and Alvarez & Lawrence(2011). We locations we wish to predict cadmium. While a stan- found log transforming and normalising each dimen- dard Gaussian process regression model would only sion to have zero mean and unit variance to be ben- be able to make use of the cadmium training mea- eficial due to the skewed distribution of the y-values surements, a multi-task method can use the correlated (but we also include results on untransformed data, nickel and zinc measurements to enhance predictions. marked with *). All the multiple output methods give With GPRN we can also make use of how the correla- lower MAE than using an independent GP, and GPRN tions between nickel, zinc, and cadmium change with outperforms SLFM and the other methods. location to further enhance predictions. For the JURA dataset, the improved performance of The network structure with the highest marginal like- GPRN is at the cost of a slightly greater runtime. lihood has q = 2 latent node functions. The node and However, GPRN is accounting for input dependent sig- weight functions learnt using VB for this setting are nal and noise correlations, unlike the other methods. shown in Figure 1 of the supplementary material (Wil- Moreover, the complexity of GPRN scales linearly with son & Ghahramani, 2012). Since there are p = 3 out- p (per iteration), unlike the other methods which scale put dimensions, the result q < p suggests that heavy as O(N 3p3). This is why GPRN runs relatively quickly metal concentrations in the Swiss Jura are correlated. on the 1000 dimensional gene expression dataset, for Indeed, using our model we can observe the spatially which the other methods are intractable. These data varying correlations between heavy metal concentra- are available from http://www.ai-geostats.org/. tions, as shown for cadmium and zinc in Figure2. Although the correlation between cadmium and zinc is generally positive (with values around 0.6), there 5.3. Multivariate Volatility is a region where the correlations noticeably decrease, In the previous experiments the GPRN implicitly ac- perhaps corresponding to a geological structure. The counted for multivariate volatility (input dependent quantitative results in Table 1 suggest that the ability noise covariances) in making predictions of y(x∗). We of GPRN to learn these spatially varying correlations now test the GPRN explicitly as a model of mul- is beneficial for predicting cadmium concentrations. tivariate volatility, and assess predictions of Σ(t) = We assess performance quantitatively using mean ab- cov[y(t)]. We make 200 historical predictions of Σ(t) solute error (MAE) between the predicted and true at observed time points, and 200 one day ahead fore- cadmium concentrations. We restart the experiment casts. Historical predictions can be used, for example, 10 times with different initialisations of the parame- to understand a past financial crisis. The forecasts ters, and average the MAE. The results are marked are assessed using the log likelihood of new observa- tions under the predicted covariance, denoted L Fore- Gaussian Process Regression Networks cast. We follow Wilson & Ghahramani(2010a), and predict Σ(t) for returns on three currency exchanges (EXCHANGE) and five equity indices (EQUITY) processed as in Wilson & Ghahramani(2010a). These datasets are especially suited to MGARCH, the most popu- Table 1. Comparative performance on all datasets. lar multivariate volatility model, and have become a benchmark for assessing GARCH models (Poon & GENE (50D) Average SMSE Average MSLL Granger, 2005; Hansen & Lunde, 2005; Brownlees SET 1: et al., 2009; McCullough & Renfro, 1998; Brooks et al., GPRN (VB) 0.3356 ± 0.0294 −0.5945 ± 0.0536 2001). We compare to full BEKK MGARCH (Engle & GPRN (ESS) 0.3236 ± 0.0311 −0.5523 ± 0.0478 Kroner, 1995), the generalised Wishart process (Wil- LMC 0.6069 ± 0.0294 −0.2687 ± 0.0594 CMOGP 0.4859 ± 0.0387 −0.3617 ± 0.0511 son & Ghahramani, 2010a), and the original Wishart SLFM 0.6435 ± 0.0657 −0.2376 ± 0.0456 process (Bru, 1991; Gouri´erouxet al., 2009). SET 2: We see in Table1 that GPRN (ESS) is often out- GPRN (VB) 0.3403 ± 0.0339 −0.6142 ± 0.0557 performed by GPRN (VB) on multivariate volatility GPRN (ESS) 0.3266 ± 0.0321 −0.5683 ± 0.0542 LMC 0.6194 ± 0.0447 −0.2360 ± 0.0696 sets, suggesting convergence difficulties with ESS. The CMOGP 0.4615 ± 0.0626 −0.3811 ± 0.0748 high historical MSE for GPRN on EXCHANGE is essen- SLFM 0.6264 ± 0.0610 −0.2528 ± 0.0453 tially training error, and less meaningful than the en- GENE (1000D) Average SMSE Average MSLL couraging step ahead forecast likelihoods; to harmo- nize with the econometrics literature, historical MSE SET 1: GPRN (VB) 0.3473 ± 0.0062 −0.6209 ± 0.0085 for EXCHANGE is between the learnt covariance Σ(x) GPRN (ESS) 0.4520 ± 0.0079 −0.4712 ± 0.0327 > and observed y(x)y(x) . See Wilson & Ghahramani CMOFITC 0.5469 ± 0.0125 −0.3124 ± 0.0200 (2010a) for details. Overall, the GPRN shows promise CMOPITC 0.5537 ± 0.0136 −0.3162 ± 0.0206 as both a multi-task and multivariate volatility model, CMODTC 0.5421 ± 0.0085 −0.2493 ± 0.0183 especially since the multivariate volatility datasets are SET 2: suited to MGARCH. These data were obtained using GPRN (VB) 0.3287 ± 0.0050 −0.6430 ± 0.0071 Datastream (http://www.datastream.com/). GPRN (ESS) 0.4140 ± 0.0078 −0.4787 ± 0.0315 CMOFITC 0.5565 ± 0.0425 −0.3024 ± 0.0294 CMOPITC 0.5713 ± 0.0794 −0.3128 ± 0.0138 6. Discussion CMODTC 0.5454 ± 0.0173 0.6499 ± 0.7961 JURA Average MAE Training (secs) A Gaussian process regression network (GPRN) has GPRN (VB) 0.4040 ± 0.0006 1040 a simple and interpretable structure, and generalises GPRN* (VB) 0.4525 ± 0.0036 1190 many of the recent extensions to the Gaussian pro- SLFM (VB) 0.4247 ± 0.0004 614 cess regression framework. The model naturally ac- SLFM* (VB) 0.4679 ± 0.0030 810 commodates input dependent signal and noise corre- SLFM 0.4578 ± 0.0025 792 lations between multiple output variables, heavy tailed Co- 0.51 ICM 0.4608 ± 0.0025 507 predictive distributions, input dependent length-scales CMOGP 0.4552 ± 0.0013 784 and amplitudes, and adaptive covariance functions. GP 0.5739 ± 0.0003 74 Furthermore, GPRN has scalable inference proce- EXCHANGE Historical MSE L Forecast dures, and strong empirical performance on several −8 real datasets. GPRN (VB) 3.83 × 10 2073 GPRN (ESS) 6.120 × 10−9 2012 In the future, it would be enlightening to use GPRN GWP 3.88 × 10−9 2020 with different types of adaptive covariance structures, WP 3.88 × 10−9 1950 particularly in the case where p = 1 and q > 1; in MGARCH 3.96 × 10−9 2050 one dimensional output space it would be easy, for in- EQUITY Historical MSE L Forecast stance, to visualise a process gradually switching be- GPRN (VB) 0.978 × 10−9 2740 tween brownian motion, periodic, and smooth covari- GPRN (ESS) 0.827 × 10−9 2630 ance functions. It would also be interesting to ap- GWP 2.80 × 10−9 2930 ply this adaptive network to classification, or to use a WP 3.96 × 10−9 1710 GPRN where the weight functions depend on a differ- MGARCH 6.69 × 10−9 2760 ent set of variables than the node functions. We hope the GPRN will inspire further research into adaptive networks, and further connections between different areas of machine learning and . Gaussian Process Regression Networks

References MacKay, David J.C. Introduction to gaussian processes. In Christopher M. Bishop, editor (ed.), Neural Net- Adams, R.P. and Stegle, O. Gaussian process product works and Machine Learning, chapter 11, pp. 133–165. models for nonparametric nonstationarity. In ICML, Springer-Verlag, 1998. 2008. MacKay, D.J.C. Bayesian interpolation. Neural Computa- Alvarez, M.A. and Lawrence, N.D. Computationally ef- tion, 4(3):415–447, 1992. ficient convolved multiple output gaussian processes. JMLR, 12:1425–1466, 2011. McCullough, B.D. and Renfro, C.G. Benchmarks and soft- ware standards: A case study of GARCH procedures. Bonilla, E., Chai, K.M.A., and Williams, C. Multi-task Journal of Economic and Social Measurement, 25:59–71, Gaussian process prediction. In NIPS, 2008. 1998. Boyle, P. and Frean, M. Dependent Gaussian processes. In Murray, Iain, Adams, Ryan Prescott, and MacKay, NIPS, 2004. David J.C. Elliptical Slice Sampling. JMLR: W&CP, Brooks, C., Burke, S.P., and Persand, G. Benchmarks and 9:541–548, 2010. the accuracy of GARCH model estimation. International Neal, R.M. Bayesian learning for neural networks. Springer Journal of Forecasting, 17:45–56, 2001. Verlag, 1996. ISBN 0387947248. Brownlees, Christian T., Engle, Robert F., and Kelly, Poon, Ser-Huang and Granger, Clive W.J. Practical issues Bryan T. A practical guide to volatility forecasting in forecasting volatility. Financial Analysts Journal, 61 through calm and storm, 2009. Available at SSRN: (1):45–56, 2005. http://ssrn.com/abstract=1502915. Rasmussen, Carl Edward. Evaluation of Gaussian Pro- Bru, M.F. Wishart processes. Journal of Theoretical Prob- cesses and Other Methods for Non-linear Regression. ability, 4(4):725–751, 1991. PhD thesis, Dept. of Computer Science, University of Chib, S., Nardari, F., and Shephard, N. Analysis of high Toronto, 1996. dimensional multivariate stochastic volatility models. Rasmussen, Carl Edward and Williams, Christopher K.I. Journal of Econometrics, 134(2):341–371, 2006. ISSN Gaussian processes for Machine Learning. The MIT 0304-4076. Press, 2006. Engle, R. New frontiers for ARCH models. Journal of Rumelhart, D.E., Hinton, G.E., and Williams, R.J. Learn- Applied Econometrics, 17(5):425–446, 2002. ISSN 1099- ing representations by back-propagating errors. Nature, 1255. 323(6088):533–536, 1986. Engle, R.F. and Kroner, K.F. Multivariate simultaneous Teh, Y.W., Seeger, M., and Jordan, M.I. Semiparametric generalized ARCH. Econometric theory, 11(01):122–150, latent factor models. In AISTATS, 2005. 1995. Tomancak, P., Beaton, A., Weiszmann, R., Kwan, E., Shu, Fox, E. and Dunson, D. Bayesian nonparametric covari- S., Lewis, S.E., et al. Systematic determination of pat- ance regression. Arxiv preprint arXiv:1101.2017, 2011. terns of gene expression during drosophila embryogene- Friedman, N. and Nachman, I. Gaussian process networks. sis. Genome Biol, 3(12):0081–0088, 2002. In UAI, pp. 211–219, 2000. Turner, Richard E. Statistical Models for Natural Sounds. Gelfand, A.E., Schmidt, A.M., Banerjee, S., and Sir- PhD thesis, University College London, 2010. mans, CF. Nonstationary multivariate process modeling Ver Hoef, J.M. and Barry, R.P. Constructing and fitting through spatially varying coregionalization. Test, 13(2): models for cokriging and multivariable spatial predic- 263–312, 2004. ISSN 1133-0686. tion. Journal of Statistical Planning and Inference, 69 Goldberg, Paul W., Williams, Christopher K.I., and (2):275–294, 1998. Bishop, Christopher M. Regression with input- Wilson, Andrew G and Ghahramani, Zoubin. GPRN sup- dependent noise: A Gaussian process treatment. In plementary material. 2012. http://mlg.eng.cam.ac. NIPS, 1998. uk/andrew/gprnsupp.pdf. Goovaerts, P. Geostatistics for natural resources evalua- Wilson, Andrew Gordon and Ghahramani, Z. Generalised tion. Oxford University Press, USA, 1997. Wishart Processes. Arxiv preprint arXiv:1101.0240, Gouri´eroux,C., Jasiak, J., and Sufana, R. The Wishart au- 2010a. toregressive process of multivariate stochastic volatility. Wilson, Andrew Gordon and Ghahramani, Z. Generalised Journal of Econometrics, 150(2):167–181, 2009. Wishart Processes. In UAI, 2011. Hansen, Peter Reinhard and Lunde, Asger. A forecast Wilson, Andrew Gordon and Ghahramani, Zoubin. Copula comparison of volatility models: Does anything beat a processes. In NIPS, 2010b. GARCH(1,1). Journal of Applied Econometrics, 20(7): 873–889, 2005. Yu, Byron M, Cunningham, John P, Santhanam, Gopal, Ryu, Stephen I, Shenoy, Krishna V, and Sahani, Journel, AG. and Huijbregts, CJ. Mining geostatistics. Maneesh. Gaussian-process factor analysis for low- Academic Press (London and New York), 1978. dimensional single-trial analysis of neural population ac- Kersting, K., Plagemann, C., Pfaff, P., and Burgard, W. tivity. In NIPS, 2009. Most likely heteroscedastic gaussian process regression. Zinzen, R.P., Girardot, C., Gagneur, J., Braun, M., and In ICML, 2007. Furlong, E.E.M. Combinatorial binding predicts spatio- L´azaro-Gredilla,M. and Titsias, M.K. Variational het- temporal cis-regulatory activity. Nature, 462(7269):65– eroscedastic gaussian process regression. In ICML, 2011. 70, 2009.