Gaussian Process Regression Networks

Gaussian Process Regression Networks Andrew Gordon Wilson [email protected] David A. Knowles [email protected] Zoubin Ghahramani [email protected] Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK Abstract sary for computational reasons". Accordingly, Neal (1996) pursued the limit of large models, and found We introduce a new regression frame- that Bayesian neural networks became Gaussian pro- work, Gaussian process regression networks cesses as the number of hidden units approached infin- (GPRN), which combines the structural ity, and conjectured that \there may be simpler ways properties of Bayesian neural networks with to do inference in this case". the nonparametric flexibility of Gaussian processes. GPRN accommodates input (pre- These simple inference techniques became the corner- dictor) dependent signal and noise corre- stone of subsequent Gaussian process models (Ras- lations between multiple output (response) mussen & Williams, 2006). These models assume a variables, input dependent length-scales and prior directly over functions, rather than parameters. amplitudes, and heavy-tailed predictive dis- By further assuming homoscedastic Gaussian noise, tributions. We derive both elliptical slice one can analytically infer a posterior distribution over sampling and variational Bayes inference pro- these functions, given data. The properties of these cedures for GPRN. We apply GPRN as a functions { smoothness, periodicity, etc. { can easily multiple output regression and multivariate be controlled by a Gaussian process covariance kernel. volatility model, demonstrating substantially Gaussian process models have recently become pop- improved performance over eight popular ular for non-linear regression and classification (Ras- multiple output (multi-task) Gaussian pro- mussen & Williams, 2006), and have impressive em- cess models and three multivariate volatility pirical performances (Rasmussen, 1996). models on real datasets, including a 1000 di- However, a neural network allowed for correlations be- mensional gene expression dataset. tween multiple outputs, through sharing adaptive hidden basis functions across the outputs. In the infinite limit of basis functions, these correlations vanished. 1. Introduction Moreover, neural networks were envisaged as intelli- \Learning representations by back-propagating errors" gent agents which discovered hidden features and rep- by Rumelhart et al.(1986) is a defining paper in ma- resentations in data, while Gaussian processes, though chine learning history. This paper made neural net- effective at regression and classification, are simply works popular for their ability to capture correlations smoothing devices (MacKay, 1998). between multiple outputs, and to discover hidden fea- Recently there has been an explosion of interest in ex- tures in data, by using adaptive hidden basis functions tending the Gaussian process regression framework to that were shared across the outputs. account for fixed correlations between output variables MacKay(1992) and Neal(1996) later showed that no (Alvarez & Lawrence, 2011; Yu et al., 2009; Bonilla matter how large or complex the neural network, one et al., 2008; Teh et al., 2005; Boyle & Frean, 2004). could avoid overfitting using a Bayesian formulation. These are often called `multi-task' learning or `multi- Neal(1996) also argued that \limiting complexity is ple output' regression models. Capturing correlations likely to conflict with our prior beliefs, and can there- between outputs (responses) can be used to make bet- fore only be justified to the extent that it is neces- ter predictions. Imagine we wish to predict cadmium concentrations in a region of the Swiss Jura, where ge- Appearing in Proceedings of the 29 th International Confer- ologists are interested in heavy metal concentrations. ence on Machine Learning, Edinburgh, Scotland, UK, 2012. A standard Gaussian process regression model would Copyright 2012 by the author(s)/owner(s). only be able to use cadmium training measurements. Gaussian Process Regression Networks With a multi-task method, we can also make use of 2. Gaussian Process Networks correlated heavy metal measurements to enhance cadmium predictions (Goovaerts, 1997). We could further We wish to model a p dimensional function y(x), with enhance predictions if we could use how these (signal) signal and noise correlations that vary with x. correlations change with geographical location. We model y(x) as There has similarly been great interest in extending y(x) = W (x)[f(x) + σf ] + σyz ; (1) Gaussian process (GP) regression to account for in- where = (x) and z = z(x) are respectively N (0;I ) put dependent noise variances (Goldberg et al., 1998; q and N (0;I ) white noise processes. I and I are q ×q Kersting et al., 2007; Adams & Stegle, 2008; Turner, p q p and p×p dimensional identity matrices. W (x) is a p×q 2010; Wilson & Ghahramani, 2010b;a; Lázaro-Gredilla matrix of independent Gaussian processes such that & Titsias, 2011). Wilson & Ghahramani(2010a; 2011) W (x) ∼ GP(0; k ), and f(x) = (f (x); : : : ; f (x))> and Fox & Dunson(2011) further extended the GP ij w 1 q is a q × 1 vector of independent GPs with f (x) ∼ framework to accommodate input dependent noise cor- i GP(0; k ). The GPRN prior on y(x) is induced relations between multiple output (response) variables. fi through GP priors in W (x) and f(x), and the noise In this paper, we introduce a new regression frame- model is induced through and z. work, Gaussian Process Regression Networks (GPRN), We represent the Gaussian process regression network which combines the structural properties of Bayesian (GPRN)1 of equation (1) in Figure1. Each of the la- neural networks with the nonparametric flexibility of tent Gaussian processes in f(x) has additive Gaussian Gaussian processes. This network is an adaptive mix- noise. Changing variables to include the noise σ , we ture of Gaussian processes, which naturally accommo- f let fî(x) = fi(x) + σf ∼ GP(0; k ^ ), where dates input dependent signal and noise correlations fi between multiple output variables, input dependent 2 k ^ (xa; xw) = kfi (xa; xw) + σf δaw ; (2) length-scales and amplitudes, and heavy tailed predic- fi tive distributions, without expensive or numerically and δaw is the Kronecker delta. The latent node func- ^ unstable computations. The GPRN framework ex- tions f(x) are connected together to form the outputs tends and unifies the work of Journel & Huijbregts y(x). The strengths of the connections change as a (1978), Neal(1996), Gelfand et al.(2004), Teh et al. function of x; the weights themselves { the entries of (2005), Adams & Stegle(2008), Turner(2010), and W (x) { are functions. Old connections can break and Wilson & Ghahramani(2010b; 2011). new connections can form. This is an adaptive network, where the signal and noise correlations between Throughout this text we assume we are given a dataset the components of y(x) vary with x. We label the D f g of input output pairs, = (xi; y(xi)) : i = 1;:::;N , length-scale hyperparameters for the kernels kw and where x 2 X is an input (predictor) variable belonging kfi as θw and θf respectively. We often assume that all X to an arbitrary set , and y(x) is the corresponding the weight GPs share the same covariance kernel kw, p dimensional output; each element of y(x) is a one including hyperparameters. Roughly speaking, shar- dimensional output (response) variable, for example ing length-scale hyperparameters amongst the weights the concentration of a single heavy metal at a geo- means that, a priori, the strengths of the connections graphical location x. We aim to predict y(x∗)jx∗; D in Figure1 vary with x at the same rate. and Σ(x∗) = cov[y(x∗)jx∗; D] at a test input x∗, while accounting for input dependent signal and noise cor- To explicitly separate the adaptive signal and noise relations between the elements of y(x). correlations, we re-write (1) as We start by introducing the GPRN framework and y(x) = W (x)f(x) + σf W (x) + σyz : (3) | {z } | {z } discussing inference. We then further discuss related signal noise work, before comparing to eight multiple output GP Given W (x), each of the outputs yi(x), i = 1; : : : ; p, is models, on gene expression and geostatistics datasets, a Gaussian process with kernel and three multivariate volatility models on several q benchmark financial datasets. In the supplementary X 2 ky (xa; xw) = Wij(xa)k ^ (xa; xw)Wij(xw)+δawσ : material (Wilson & Ghahramani, 2012) we further dis- i fj y cuss theoretical aspects of GPRN, and review GP re- j=1 (4) gression and notation (Rasmussen & Williams, 2006). 1Coincidentally, there is an unrelated paper called \Gaussian process networks" (Friedman & Nachman, 2000), which is about learning the structure of Bayesian networks { e.g. the direction of dependence between random variables. Gaussian Process Regression Networks ) (x y1(x) than p, the dimension of y(x), the model performs W11 fˆ1(x) W dimensionality reduction and matrix factorization as 1 q . ( part of the regression on y(x) and cov[y(x)]. How- x . θw θf σf . ) ever, we may want q > p, for instance if the output space were one dimensional (p = 1). In this case we x fî(x) yj(x) W (x) fˆ(x) would need q > 1 for nonstationary length-scales and W σ . y . p . 1 . covariance structures. For a given dataset, we can vary ( . x ) q and select the value which gives the highest marginal fˆq(x) W pq likelihood on training data. (x) yp(x) y(x) a) b) 3. Inference Figure 1. The Gaussian process regression network. Latent We have specified a prior p(y(x)) at all points x in the random variables and observables are respectively labelled domain X , and a noise model, so we can infer the pos- with circles and squares, except for the weight functions in terior p(y(x)jD).

Load more