Tutorial: Gaussian Process Models for Machine Learning

Tutorial: Gaussian process models for machine learning Ed Snelson ([email protected]) Gatsby Computational Neuroscience Unit, UCL 26th October 2006 Info The GP book: Rasmussen and Williams, 2006 Basic GP (Matlab) code available: http://www.gaussianprocess.org/gpml/ 1 Gaussian process history Prediction with GPs: • Time series: Wiener, Kolmogorov 1940’s • Geostatistics: kriging 1970’s — naturally only two or three dimensional input spaces • Spatial statistics in general: see Cressie [1993] for overview • General regression: O’Hagan [1978] • Computer experiments (noise free): Sacks et al. [1989] • Machine learning: Williams and Rasmussen [1996], Neal [1996] 2 Nonlinear regression Consider the problem of nonlinear regression: You want to learn a function f with error bars from data D = {X, y} y x A Gaussian process is a prior over functions p(f) which can be used for Bayesian regression: p(f)p(D|f) p(f|D) = p(D) 3 What is a Gaussian process? • Continuous stochastic process — random functions — a set of random variables indexed by a continuous variable: f(x) • Set of ‘inputs’ X = {x1, x2, . , xN }; corresponding set of random function variables f = {f1, f2, . , fN } N • GP: Any set of function variables {fn}n=1 has joint (zero mean) Gaussian distribution: p(f|X) = N (0, K) • Conditional model - density of inputs not modeled Z • Consistency: p(f1) = df2 p(f1, f2) 4 Covariances Where does the covariance matrix K come from? • Covariance matrix constructed from covariance function: Kij = K(xi, xj) • Covariance function characterizes correlations between different points in the process: K(x, x0) = E[f(x)f(x0)] • Must produce positive semidefinite covariance matrices v>Kv ≥ 0 • Ensures consistency 5 Squared exponential (SE) covariance " # 1 x − x02 K(x, x0) = σ 2 exp − 0 2 λ • Intuition: function variables close in input space are highly correlated, whilst those far away are uncorrelated • λ, σ0 — hyperparameters. λ: lengthscale, σ0: amplitude • Stationary: K(x, x0) = K(x − x0) — invariant to translations • Very smooth sample functions — infinitely differentiable 6 Matérnclass of covariances √ √ 21−ν 2ν|x − x0|ν 2ν|x − x0| K(x, x0) = K Γ(ν) λ ν λ where Kν is a modified Bessel function. • Stationary, isotropic • ν → ∞: SE covariance • Finite ν: much rougher sample functions • ν = 1/2: K(x, x0) = exp(−|x − x0|/λ), OU process, very rough sample functions 7 Nonstationary covariances 0 2 0 • Linear covariance: K(x, x ) = σ0 + xx • Brownian motion (Wiener process): K(x, x0) = min(x, x0) 0 2x−x 0 2 sin 2 • Periodic covariance: K(x, x ) = exp − λ2 • Neural network covariance 8 Constructing new covariances from old There are several ways to combine covariances: 0 0 0 • Sum: K(x, x ) = K1(x, x ) + K2(x, x ) addition of independent processes 0 0 0 • Product: K(x, x ) = K1(x, x )K2(x, x ) product of independent processes Z • Convolution: K(x, x0) = dz dz0 h(x, z)K(z, z0)h(x0, z0) blurring of process with kernel h 9 Prediction with GPs • We have seen examples of GPs with certain covariance functions • General properties of covariances controlled by small number of hyperparameters • Task: prediction from noisy data • Use GP as a Bayesian prior expressing beliefs about underlying function we are modeling • Link to data via noise model or likelihood 10 GP regression with Gaussian noise Data generated with Gaussian white noise around the function f y = f + E[(x)(x0)] = σ2δ(x − x0) Equivalently, the noise model, or likelihood is: p(y|f) = N (f, σ2I) Integrating over the function variables gives the marginal likelihood: Z p(y) = df p(y|f)p(f) = N (0, K + σ2I) 11 Prediction N training input and output pairs (X, y), and T test inputs XT Consider joint training and test marginal likelihood: 2 KN KNT p(y, yT ) = N (0, KN+T + σ I) , KN+T = , KTN KT Condition on training outputs: p(yT |y) = N (µT , ΣT ) 2 −1 µT = KTN [KN + σ I] y 2 −1 2 ΣT = KT − KTN [KN + σ I] KNT + σ I Gives correlated predictions. Defines a predictive Gaussian process 12 Prediction • Often only marginal variances (diag ΣT ) are required — sufficient to consider a single test input x∗: 2 −1 µ∗ = K∗N [KN + σ I] y 2 2 −1 2 σ∗ = K∗ − K∗N [KN + σ I] KN∗ + σ . • Mean predictor is a linear predictor: µ∗ = K∗N α 2 3 • Inversion of KN + σ I costs O(N ) • Prediction cost per test case is O(N) for the mean and O(N 2) for the variance 13 Determination of hyperparameters • Advantage of the probabilistic GP framework — ability to choose hyperparameters and covariances directly from the training data • Other models, e.g. SVMs, splines etc. require cross validation • GP: minimize negative log marginal likelihood L(θ) wrt hyperparameters and noise level θ: 1 1 N L = − log p(y|θ) = log det C(θ) + y>C−1(θ)y + log(2π) 2 2 2 where C = K + σ2I • Uncertainty in the function variables f is taken into account 14 Gradient based optimization • Minimization of L(θ) is a non-convex optimization task • Standard gradient based techniques, such as CG or quasi-Newton • Gradients: ∂L 1 ∂C 1 ∂C = tr C−1 − y>C−1 C−1y ∂θi 2 ∂θi 2 ∂θi • Local minima, but usually not much of a problem with few hyperparameters • Use weighted sums of covariances and let ML choose 15 Automatic relevance determination1 The ARD SE covariance function for multi-dimensional inputs: " D 0 2# 0 2 1 X xd − xd K(x, x ) = σ0 exp − 2 λd d=1 • Learn an individual lengthscale hyperparameter λd for each input dimension xd • λd determines the relevancy of input feature d to the regression • If λd very large, then the feature is irrelevant 1Neal, 1996. MacKay, 1998. 16 Relationship to generalized linear regression • Weighted sum of fixed finite set of M basis functions: M X > f(x) = wmφm(x) = w φ(x) m=1 • Place Gaussian prior on weights: p(w) = N (0, Σw) • Defines GP with finite rank M (degenerate) covariance function: 0 0 > 0 K(x, x ) = E[f(x)f(x )] = φ (x)Σwφ(x ) • Function space vs. weight space interpretations 17 Relationship to generalized linear regression • General GP — can specify covariance function directly rather than via set of basis functions • Mercer’s theorem: can always decompose covariance function into eigenfunctions and eigenvalues: ∞ 0 X 0 K(x, x ) = λiψi(x)ψi(x ) i=1 • If sum finite, back to linear regression. Often sum infinite, and no analytical expressions for eigenfunctions • Power of kernel methods in general (e.g. GPs, SVMs etc.) — project x 7→ ψ(x) into high or infinite dimensional feature space and still handle computations tractably 18 Relationship to neural networks Neural net with one hidden layer of NH units: N XH f(x) = b + vjh(x; uj) j=1 h — bounded hidden layer transfer function (e.g. h(x; u) = erf(u>x)) • If v’s and b zero mean independent, and weights uj iid, then CLT implies NN → GP as NH → ∞ [Neal, 1996] • NN covariance function depends on transfer function h, but is in general non-stationary 19 Relationship to spline models Univariate cubic spline has cost functional: N Z 1 X 2 00 2 (f(xn) − yn) + λ f (x) dx n=1 0 • Can give probabilistic GP interpretation by considering RH term as an (improper) GP prior • Make proper by weak penalty on zeroth and first derivatives • Can derive a spline covariance function, and full GP machinery can be applied to spline regression (uncertainties, ML hyperparameters) • Penalties on derivatives — equivalent to specifying the inverse covariance function — no natural marginalization 20 Relationship to support vector regression GP regression SV regression −log p(y|f) −log p(y|f) f y e f e y • -insensitive error function can be considered as a non-Gaussian likelihood or noise model. Integrating over f becomes intractable • SV regression can be considered as MAP solution fMAP to GP with -insensitive error likelihood • Advantages of SVR: naturally sparse solution by QP, robust to outliers. Disadvantages: uncertainties not taken into account, no predictive variances, or learning of hyperparameters by ML 21 Logistic and probit regression • Binary classification task: y = ±1 • GLM likelihood: p(y = +1|x, w) = π(x) = σ(x>w) • σ(z) — sigmoid function such as the logistic or cumulative normal. • Weight space viewpoint: prior p(w) = N (0, Σw) • Function space viewpoint: let f(x) = w>x, then likelihood π(x) = σ(f(x)), Gaussian prior on f 22 GP classification 1. Place a GP prior directly on f(x) 2. Use a sigmoidal likelihood: p(y = +1|f) = σ(f) Just as for SVR, non-Gaussian likelihood makes integrating over f intractable: Z p(f∗|y) = df p(f∗|f)p(f|y) where the posterior p(f|y) ∝ p(y|f)p(f) Make tractable by using a Gaussian approximation to posterior. Then prediction: Z p(y∗ = +1|y) = df∗ σ(f∗)p(f∗|y) 23 GP classification Two common ways to make Gaussian approximation to posterior: 1. Laplace approximation. Second order Taylor approximation about mode of posterior 2. Expectation propagation (EP)2. EP can be thought of as approximately minimizing KL[p(f|y)||q(f|y)] by an iterative procedure. • Kuss and Rasmussen [2005] evaluate both methods experimentally and find EP to be significantly superior • Classification accuracy on digits data sets comparable to SVMs. Advantages: probabilistic predictions, hyperparameters by ML 2Minka, 2001 24 GP latent variable model (GPLVM)3 • Probabilistic model for dimensionality reduction: data is set of high dimensional vectors: y1, y2,..., yN • Model each dimension (k) of y as an independent GP with unknown low dimensional latent inputs: x1, x2,..., xN Y 1 p({y }|{x }) ∝ exp − w2Y>K−1Y n n 2 k k k k • Maximize likelihood to find latent projections {xn} (not a true density model for y because cannot tractably integrate over x).

Tutorial: Gaussian Process Models for Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support