A Marginalized Particle Gaussian Process Regression

A Marginalized Particle Gaussian Process Regression Yali Wang and Brahim Chaib-draa Department of Computer Science Laval University Quebec, Quebec G1V0A6 fwang,[email protected] Abstract We present a novel marginalized particle Gaussian process (MPGP) regression, which provides a fast, accurate online Bayesian filtering framework to model the latent function. Using a state space model established by the data construction procedure, our MPGP recursively filters out the estimation of hidden function values by a Gaussian mixture. Meanwhile, it provides a new online method for training hyperparameters with a number of weighted particles. We demonstrate the estimated performance of our MPGP on both simulated and real large data sets. The results show that our MPGP is a robust estimation algorithm with high computational efficiency, which outperforms other state-of-art sparse GP methods. 1 Introduction The Gaussian process (GP) is a popular nonparametric Bayesian method for nonlinear regression. However, the O(n3) computational load for training the GP model would severely limit its applica- bility in practice when the number of training points n is larger than a few thousand [1]. A number of attempts have been made to handle it with a small computational load. One typical method is a sparse pseudo-input Gaussian process (SPGP) [2] that uses a pseudo-input data set with m inputs (m n) to parameterize the GP predictive distribution to reduce the computational burden. Then a sparse spectrum Gaussian process (SSGP) [3] was proposed to further improve the performance of SPGP while retaining the computational efficiency by using a stationary trigonometric Bayesian model with m basis functions. However, both SPGP and SSGP learn hyperparameters offline by maximizing the marginal likelihood before making the inference. They would take a risk to fall in the local optimum. Another recent model is a Kalman filter Gaussian process (KFGP) [4] which re- duces computation load by correlating function values of data subsets at each Kalman filter iteration. But it still causes underfitting or overfitting if the hyperparameters are badly learned offline. On the contrary, we propose in this paper an online marginalized particle filter to simultaneously learn the hyperprameters and hidden function values. By collecting small data subsets sequentially, we establish a novel state space model which allows us to estimate the marginal posterior distribution (not the marginal likelihood) of hyperparameters online with a number of weighted particles. For each particle, a Kalman filter is applied to estimate the posterior distribution of hidden function values. We will later explain it in details and show its validity via the experiments on large datasets. 2 Data Construction In practice, the whole training data set is usually constructed by gathering small subsets sev- eral times. For the tth collection, the training subset (Xt; yt) consists of nt input-output pairs: 1 1 nt nt i i f(xt ; yt ); ··· (xt ; yt )g. Each scalar output yt is generated from a nonlinear function f(xt) of a i 2 d-dimensional input vector xt with an additive Gaussian noise N(0; a0). All the pairs are separately organized as an input matrix Xt and output vector yt. For simplicity, the whole training data with 1 T collections is symbolized as (X1:T ; y1:T ). The goal refers to a regression issue - estimating the 1 m function value of f(x) at m test inputs X? = [x?; ··· x? ] given (X1:T ; y1:T ). 3 Gaussian Process Regression A Gaussian process (GP) represents a distribution over functions, which is a generalization of the Gaussian distribution to an infinite dimensional function space. Formally, it is a collection of random variables, any finite number of which have a joint Gaussian distribution [1]. Similar to a Gaussian distribution specified by a mean vector and covariance matrix, a GP is fully defined by a mean function m(x) = E[f(x)] and covariance function k(x; x0) = E[(f(x) − m(x))(f(x0) − m(x0))]. Here we follow the practical choice that m(x) is set to be zero. Moreover, due to the spatial non- 0 0 0 stationary phenomena in the real world, we choose k(x; x ) as kSE(x; x ) + kNN (x; x ) where 2 −2 0 T 0 kSE = a1exp[−0:5a2 (x − x ) (x − x )] is the stationary squared exponential covariance func- 2 −1 −2 T 0 −2 T −2 0T 0 −0:5 tion, kNN = a3sin [a4 ~x ~x ((1 + a4 ~x ~x)(1 + a4 ~x ~x )) ] is the nonstationary neural network covariance function with the augmented input ~x = [1 xT ]T . For simplicity, all the hyper- T parameters are collected into a vector θ = [a0 a1 a2 a3 a4] . The regression problem could be solved by the standard GP with the following two steps: First, learning θ given (X1:T ; y1:T ). One technique is to draw samples from p(θjX1:T ; y1:T ) using Markov Chain Monte Carlo (MCMC) [5, 6], another popular way is to maximize the log evi- dence p(y1:T jX1:T ; θ) via a gradient based optimizer [1]. Second, estimating the distribution of the function value p(f(X?)jX1:T ; y1:T ;X?; θ). From the perspective of GP, a function f(x) could be loosely considered as an infinitely long vector in which each random variable is the function value at an input x, and any finite set of function values is jointly Gaussian distributed. Hence, the joint distribution p(y1:T ; f(X?)jX1:T ;X?; θ) is a multivariate Gaussian distribution. Then according to the conditional property of Gaussian distribution, p(f(X?)jX1:T ; y1:T ;X?; θ) is also Gaussian dis- ¯ tributed with the following mean vector f(X?) and covariance matrix P (X?;X?) [1, 7]: ¯ 2 −1 f(X?) = Kθ(X?;X1:T )[Kθ(X1:T ;X1:T ) + a0I] y1:T 2 −1 T P (X?;X?) = Kθ(X?;X?) − Kθ(X?;X1:T )[Kθ(X1:T ;X1:T ) + a0I] Kθ(X?;X1:T ) If there are n training inputs and m test inputs then Kθ(X?;X1:T ) denotes an m × n covariance matrix in which each entry is calculated by the covariance function k(x; x0) with the learned θ. It is similar to construct Kθ(X1:T ;X1:T ) and Kθ(X?;X?). 4 Marginalized Particle Gaussian Process Regression Even though GP is an elegant nonparametric method for Bayesian regression, it is commonly in- feasible for large data sets due to an O(n3) scaling for learning the model. In order to derive a computational tractable GP model which preserves the estimation accuracy, we firstly explore a state space model from the data construction procedure, then propose a marginalized particle filter to estimate the hidden f(X?) and θ in an online Bayesian filtering framework. 4.1 State Space Model The standard state space model (SSM) consists of the state equation and observation equation. The state equation reflects the Markovian evolution of hidden states (the hyperparamters and function values). For the hidden static hyperparameter θ, a popular method in filtering techniques is to add an artificial evolution using kernel smoothing which guarantees the estimation convergence [8, 9, 10]: ¯ θt = bθt−1 + (1 − b)θt−1 + st−1 (1) ¯ where b = (3δ − 1)=(2δ), δ is a discount factor which is typically around 0.95-0.99, θt−1 is the 2 2 2 Monte Carlo mean of θ at t − 1, and st−1 ∼ N(0; r Σt−1), r = 1 − b , Σt−1 is the Monte Carlo variance matrix of θ at t − 1. For hidden function values, we attempt to explore the relation between c c c the (t − 1)th and tth data subset. For simplicity, we denoted Xt = Xt [ X? and ft = f(Xt ). If 0 c c c c f(x) ∼ GP (0; k(x; x )), then the prior distribution p(ft ; ft−1jXt−1;Xt ; θt) is jointly Gaussian: c c c c c c c c Kθt (Xt ;Xt ) Kθt (Xt ;Xt−1) p(ft ; ft−1jXt−1;Xt ; θt) = N(0; c c T c c ) Kθt (Xt ;Xt−1) Kθt (Xt−1;Xt−1) 2 Then according to the conditional property of Gaussian distribution, we could get c c c c c p(ft jft−1;Xt−1;Xt ; θt) = N(G(θt)ft−1;Q(θt)) (2) where G(θ ) = K (Xc;Xc )K−1(Xc ;Xc ) (3) t θt t t−1 θt t−1 t−1 Q(θ ) = K (Xc;Xc) − K (Xc;Xc )K−1(Xc ;Xc )K (Xc;Xc )T (4) t θt t t θt t t−1 θt t−1 t−1 θt t t−1 This conditional density (2) could be transformed into a linear equation of the function value with f an additive Gaussian noise vt ∼ N(0;Q(θt)): c c f ft = G(θt)ft−1 + vt (5) Finally, the observation (output) equation could be directly obtained from the tth data collection: c y yt = Htft + vt (6) c where Ht = [Int 0] is an index matrix to make Htft = f(Xt) since yt is only obtained from the y 2 tth training inputs Xt. The noise vt ∼ N(0;R(θt)) is from the section 2 where R(θt) = a0;tI. Note that a0 is a fixed unknown hyperparameter. We use the symbol a0;t just because of the consistency with the artificial evolution of θ. To sum up, our SSM is fully specified by (1), (5), (6). 4.2 Bayesian Inference by Marginalized Particle Filter In contrast to the GP regression with a two-step offline inference in section 3, we propose an online filtering framework to simultaneously learn hyperparameters and estimate hidden function values.

Load more