Regression and Classification Using Gaussian Process Priors RADFORD M

BAYESIAN STATISTICS 6, pp. 000--000 J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (Eds.) Oxford University Press, 1998 Regression and Classification Using Gaussian Process Priors RADFORD M. NEAL University of Toronto, CANADA SUMMARY Gaussian processes are a natural way of specifying prior distributions over functions of one or more input variables. When such a function defines the mean response in a regression model with Gaussian errors, inference can be done using matrix computations, which are feasible for datasets of up to about a thousand cases. The covariance function of the Gaussian process can be given a hierarchical prior, which allows the model to discover high-level properties of the data, such as which inputs are relevant to predicting the response. Inference for these covariance hyperparameters can be done using Markov chain sampling. Classification models can be defined using Gaussian processes for underlying latent values, which can also be sampled within the Markov chain. Gaussian processes are in my view the simplest and most obvious way of defining flexible Bayesian regression and classification models, but despite some past usage, they appear to have been rather neglected as a general-purpose technique. This may be partly due to a confusion between the properties of the function being modeled and the properties of the best predictor for this unknown function. Keywords: GAUSSIAN PROCESSES; NONPARAMETRIC MODELS; MARKOV CHAIN MONTE CARLO. In this paper, I hope to persuade you that Gaussian processes are a fruitful way of defining prior distributions for flexible regression and classification models in which the regression or class probability functions are not limited to simple parametric forms. The basic idea goes back many years in a regression context, but is nevertheless not widely appreciated. The use of general Gaussian process models for classification is more recent, and to my knowledge the work presented here is the first that implements an exact Bayesian approach. One attraction of Gaussian processes is the variety of covariance functions one can choose from, which lead to functions with different degrees of smoothness, or different sorts of additive structure. I will describe some of these possibilities, while also noting the limitations of Gaussian processes. I then discuss computations for Gaussian process models, starting with the basic matrix operations involved. For classification models, one must integrate over the ``latent values'' underlying each case. I show how this can be done using Markov chain Monte Carlo methods. In a full-fledged Bayesian treatment, one must also integrate over the posterior distribution for the hyperparameters of the covariance function, which can also be done using Markov chain sampling. I show how this all works for a synthetic three-way classification problem. The methods I describe are implemented in software that is available from my web page, at http://www.cs.utoronto.ca/radford/. Gaussian processes and related methods have been used in various contexts for many years. Despite this past usage, and despite the fundamental simplicity of the idea, Gaussian process models appear to have been little appreciated by most Bayesians. I speculate that this could be partly due to a confusion between the properties one expects of the true function being modeled and those of the best predictor for this unknown function. Clarity in this respect is particularly important when thinking of flexible models such as those based on Gaussian processes. 2 R. M. Neal 1. MODELS BASED ON GAUSSIAN PROCESS PRIORS (1) (1) (2) (2) (n) (n) x t x t x t Assume we have observed data for n cases, , in which (i) (i) (i) (i) x x x p i t p is the vector of ``inputs'' (predictors) for case and is the associated 1 (n+1) ``target'' (response). Our primary purpose is to predict the target, t , for a new case (n+1) where we have observed only the inputs, x . For a regression problem, the targets will K g be real-valued; for a classification problem, the targets will be from the set f . (i) It will sometimes be convenient to represent the distributions of the targets, t , in terms of (i) unobserved ``latent values'', y , associated with each case. Bayesian regression and classification models are usually formulated in terms of a prior distribution for a set of unknown model parameters, from which a posterior distribution for the parameters is derived. If our focus is on prediction for a future case, however, the final result is (n+1) a predictive distribution for a new target value, t , that is obtained by integrating over the unknown parameters. This predictive distribution could be expressed directly in terms of the (n+1) n inputs for the new case, x , and the inputs and targets for the observed cases, without any mention of the model parameters. Furthermore, rather than expressing our prior knowledge in terms of a prior for the parameters, we could instead simply specify a prior distribution for the targets in any set of cases. A predictive distribution for an unknown target can then be obtained by conditioning on the known targets. These operations are most easily carried out if all the distributions are Gaussian. Fortunately, Gaussian processes are flexible enough to represent a wide variety of interesting models, many of which would have an infinite number of parameters if formulated in more conventional fashion. 1.1 A Gaussian process for linear regression Before discussing such flexible models, however, it may help to see how the scheme works for a simple linear regression model, which can be written as p X (i) (i) (i) t x u u u=1 (i) i where is the Gaussian ``noise'' for case , assumed to be independent from case to case, 2 2 and to have mean zero and variance . For the moment, we will assume that is known, but that and the u are unknown. Let us give and the u independent Gaussian priors with means of zero and variances 2 2 (1) (2) x x and . For any set of cases with fixed inputs, , this prior distribution for u (1) (2) t parameters implies a prior distribution for the associated target values, t , which will be multivariate Gaussian, with mean zero, and with covariances given by p p h i X X (i) (j ) (i) (j ) (i) (j ) Cov t t E x x u u u u u=1 u=1 p X (j ) (i) 2 2 2 x x u u ij u u=1 i j where ij is one if and zero otherwise. This mean and covariance function define a ``Gaussian process'' giving a distribution over possible relationships between the inputs and the target. (One might wish to confine the term ``Gaussian process'' to distributions over functions Regression and classification using Gaussian process priors 3 (i) from the inputs to the target. The relationship above is not functional, since (due to noise) t (j ) (i) (j ) x x may differ from t even if is identical to , but the looser usage is convenient.) (1) (n) x n Suppose now that we know the inputs, x , for observed cases, as well as (n+1) x , the inputs in a case for which we wish to predict the target. We can use the covariance n function above to compute the n by covariance matrix of the associated targets, (1) (n) (n+1) t t t . Together with the assumption that the means are zero, these covariances define a Gaussian joint distribution for the targets in these cases. We can condition on the (n+1) (1) (n) t t known targets to obtain the predictive distribution for t given . Well-known results show that this predictive distribution is Gaussian, with mean and variance given by h i (n+1) (1) (n) T 1 E t j t t k C t h i (n+1) (1) (n) T 1 j t t C k t V k Var (1) (n) T n n t t t where C is the by covariance matrix of the observed targets, is the (n+1) t vector of known values for these targets, k is the vector of covariances between and the (n+1) (n+1) (n+1) V t Cov t t n known targets, and is the prior variance of (ie, ). 1.2 More general Gaussian processes The procedure just described is unnecessarily expensive for a typical linear regression model n with p . However, the Gaussian process procedure can handle more interesting models by simply using a different covariance function. For example, a regression model based on a class of smooth functions can be obtained using a covariance function of the form p X (j ) (i) 2 2 2 (i) (j ) 2 x x Cov t t exp u u ij u u=1 There are many other possibilities for the covariance function, some of which are discussed in Section 2. Our prior knowledge will usually not be sufficient to fix appropriate values for the u hyperparameters in the covariance function ( , , and the for the model above). We will therefore give prior distributions to the hyperparameters, and base predictions on a sample of values from their posterior distribution. Sampling from the posterior distribution requires computation of the log likelihood based on the n observed cases, which is n T 1 L log log det C t C t The derivatives of L can also be computed, if they are needed by the sampling method. 1.3 Gaussian process models for classification K g Models for classification problems, where the targets are from the set f , can be defined in terms of a Gaussian process model for ``latent values'' associated with each case.

Load more