Gaussian Process Kernels for Pattern Discovery and Extrapolation

Gaussian Process Kernels for Pattern Discovery and Extrapolation Andrew Gordon Wilson [email protected] Department of Engineering, University of Cambridge, Cambridge, UK Ryan Prescott Adams [email protected] School of Engineering and Applied Sciences, Harvard University, Cambridge, USA Abstract works research, triggered by Neal(1996), who observed that Bayesian neural networks became Gaussian pro- Gaussian processes are rich distributions over cesses as the number of hidden units approached in- functions, which provide a Bayesian nonpara- finity. Neal(1996) conjectured that \there may be metric approach to smoothing and interpola- simpler ways to do inference in this case". tion. We introduce simple closed form kernels that can be used with Gaussian pro- These simple inference techniques became the corner- cesses to discover patterns and enable extrap- stone of subsequent Gaussian process models for ma- olation. These kernels are derived by mod- chine learning (Rasmussen & Williams, 2006). These elling a spectral density { the Fourier trans- models construct a prior directly over functions, rather form of a kernel { with a Gaussian mixture. than parameters. Assuming Gaussian noise, one can The proposed kernels support a broad class analytically infer a posterior distribution over these of stationary covariances, but Gaussian pro- functions, given data. Gaussian process models have cess inference remains simple and analytic. become popular for non-linear regression and classifi- We demonstrate the proposed kernels by dis- cation (Rasmussen & Williams, 2006), and often have covering patterns and performing long range impressive empirical performance (Rasmussen, 1996). extrapolation on synthetic examples, as well The properties of likely functions under a GP, e.g., as atmospheric CO trends and airline pas- 2 smoothness, periodicity, etc., are controlled by a posi- senger data. We also show that it is possible tive definite covariance kernel 1, an operator which de- to reconstruct several popular standard co- termines the similarity between pairs of points in the variances within our framework. domain of the random function. The choice of kernel profoundly affects the performance of a Gaussian process on a given task { as much as the choice of 1. Introduction architecture, activation functions, learning rate, etc., Machine learning is fundamentally about pattern dis- can affect the performance of a neural network. covery. The first machine learning models, such as the Gaussian processes are sometimes used as expressive perceptron (Rosenblatt, 1962), were based on a simple statistical tools, where the pattern discovery is per- model of a neuron (McCulloch & Pitts, 1943). Papers formed by a human, and then hard coded into para- such as Rumelhart et al.(1986) inspired hope that it metric kernels. Often, however, the squared exponen- would be possible to develop intelligent agents with tial (Gaussian) kernel is used by default. In either models like neural networks, which could automati- case, GPs are used as smoothing interpolators with cally discover hidden representations in data. Indeed, a fixed (albeit infinite) set of basis functions. Such machine learning aims not only to equip humans with simple smoothing devices are not a realistic replace- tools to analyze data, but to fully automate the learn- ment for neural networks, which were envisaged as in- ing and decision making process. telligent agents that could discover hidden features in 2 Research on Gaussian processes (GPs) within the ma- data via adaptive basis functions (MacKay, 1998). chine learning community developed out of neural net- 1The terms covariance kernel, covariance function, kernel function, and kernel are used interchangeably. th Proceedings of the 30 International Conference on Ma- 2We refer to representations, features and patterns in- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: terchangeably. Features sometimes means low dimensional W&CP volume 28. Copyright 2013 by the author(s). representations of data, like neurons in a neural network. Gaussian Process Kernels for Pattern Discovery and Extrapolation However, Bayesian nonparametrics can help build au- go beyond composition of simple analytic forms, while tomated intelligent systems that reason and make de- maintaining the useful inductive bias of stationarity. cisions. It has been suggested that the human ability We propose new kernels which can be used to automat- for inductive reasoning { concept generalization with ically discover patterns and extrapolate far beyond the remarkably few examples { could derive from a prior available data. This class of kernels contains many sta- combined with Bayesian inference (Yuille & Kersten, tionary kernels, but has a simple closed form that leads 2006; Tenenbaum et al., 2011; Steyvers et al., 2006). to straightforward analytic inference. The simplicity Bayesian nonparametric models, and Gaussian pro- of these kernels is one of their strongest qualities. In cesses in particular, are an expressive way to encode many cases, these kernels can be used as a drop in re- prior knowledge, and also reflect the belief that the placement for the popular squared exponential kernel, real world is infinitely complex (Neal, 1996). with benefits in performance and expressiveness. By learning features in data, we not only improve predic- With more expressive kernels, one could use Gaus- tive performance, but we can more deeply understand sian processes to learn hidden representations in data. the structure of the problem at hand { greenhouse Expressive kernels have been developed by combining gases, air travel, heart physiology, brain activity, etc. Gaussian processes in a type of Bayesian neural network structure (Salakhutdinov & Hinton, 2008; Wilson After a brief review of Gaussian processes in Section et al., 2012; Damianou & Lawrence, 2012). However, 2, we derive the new kernels in Section3 by modelling these approaches, while promising, typically 1) are a spectral density with a mixture of Gaussians. We designed to model specific types of structure (e.g., focus our experiments in Section4 on elucidating the input-dependent correlations between different tasks); fundamental differences between the proposed kernels 2) make use of component GPs with simple interpo- and the popular alternatives in Rasmussen & Williams lating kernels; 3) indirectly induce complicated ker- (2006). In particular, we show how the proposed kernels that do not have a closed form and are difficult nels can automatically discover patterns and extrap- to interpret; and 4) require sophisticated approximate olate on the CO2 dataset in Rasmussen & Williams inference techniques that are much more demanding (2006), on a synthetic dataset with strong negative than that required by simple analytic kernels. covariances, on a difficult synthetic sinc pattern, and on airline passenger data. We also use our framework Sophisticated kernels are most often achieved by com- to reconstruct several popular standard kernels. posing together a few standard kernel functions (Ar- chambeau & Bach, 2011; Durrande et al., 2011; Gönen & Alpaydın, 2011; Rasmussen & Williams, 2006). 2. Gaussian Processes Tight restrictions are typically enforced on these com- A Gaussian process is a collection of random variables, positions and they are hand-crafted for specialized any finite number of which have a joint Gaussian dis- applications. Without such restrictions, complicated tribution. Using a Gaussian process, we can define a compositions of kernels can lead to overfitting and distribution over functions f(x), unmanageable hyperparameter inference. Moreover, while some compositions (e.g., addition) have an in- f(x) ∼ GP(m(x); k(x; x0)) ; (1) terpretable effect, many other operations change the distribution over functions in ways that are difficult to where x 2 RP is an arbitrary input variable, and the identify. It is difficult, therefore, to construct an effec- mean function m(x) and covariance kernel k(x; x0) are tive inductive bias for kernel composition that leads defined as to automatic discovery of the appropriate statistical structure, without human intervention. m(x) = E[f(x)] ; (2) This difficulty is exacerbated by the fact that it is chal- k(x; x0) = cov(f(x); f(x0)) : (3) lenging to say anything about the covariance function of a stochastic process from a single draw if no as- Any collection of function values has a joint Gaussian sumptions are made. If we allow the covariance be- distribution tween any two points in the input space to arise from > any positive definite function, with equal probability, [f(x1); f(x2); : : : ; f(xN )] ∼ N (µ; K) ; (4) then we gain essentially no information from a single realization. Most commonly one assumes a restriction where the N × N covariance matrix K has en- to stationary kernels, meaning that covariances are in- tries Kij = k(xi; xj), and the mean µ has en- variant to translations in the input space. tries µi = m(xi). The properties of the functions { smoothness, periodicity, etc. { are determined by the In this paper, we explore flexible classes of kernels that kernel function. Gaussian Process Kernels for Pattern Discovery and Extrapolation The popular squared exponential (SE) kernel has the In other words, a spectral density entirely determines form the properties of a stationary kernel. Substituting the 0 0 2 2 squared exponential kernel of (5) into (8), we find its kSE(x; x ) = exp(−0:5jjx − x jj =` ) : (5) 2 P=2 2 2 2 spectral density is SSE(s) = (2π` ) exp(−2π ` s ). Functions drawn

Load more