Quick viewing(Text Mode)

Bayesian Linear Regression Regression Regression

Bayesian Linear Regression Regression Regression

Bayesian linear Bayesian linear Bayesian linear regression regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Equivalent kernel Bayesian linear regression Equivalent kernel In a maximum likelihood approach for setting parameters in a linear model for Gaussian processes Gaussian processes regression, we tune effective model complexity, the number ofbasisfunctions Linear models for regression Linear regression revisited Linear regression revisited • We control it based on the size of the data set Gaussian processes for Gaussian processes for regression regression Learning the Learning the hyper-parameters hyper-parameters Francesco Corona Adding a regularisation term to the log means that the effective model complexity can be controlled by the regularisation coefficient • The choice of the number and form of the basis functions is still important in determining the overall behaviour of the model

Bayesian linear Bayesian linear regression (cont.) Bayesian linear Bayesian linear regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution This leaves the issue of setting appropriate model complexity for the problem Predictive distribution Equivalent kernel Equivalent kernel • We therefore turn to a Bayesian treatment of linear regression Gaussian processes It cannot be decided simply by maximising the likelihood function Gaussian processes • Linear regression revisited • This always leads to excessively complex models and over-fitting Linear regression revisited Avoids the over-fitting problem of maximum likelihood Gaussian processes for Gaussian processes for regression regression • Leads to automatic methods of setting model complexity Learning the Learning the hyper-parameters hyper-parameters Remark Independent hold-out data can be used to determine model complexity We again focus on the case of a single target variable t • This can be both computationally expensive and wasteful of data Bayesian linear Outline Bayesian linear regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution 1 Bayesian linear regression Predictive distribution Equivalent kernel Parameter distribution Equivalent kernel Gaussian processes Predictive distribution Gaussian processes Linear regression revisited Linear regression revisited Gaussian processes for Equivalent kernel Gaussian processes for Parameter distribution regression regression Bayesian linear regression Learning the Learning the hyper-parameters hyper-parameters

2 Gaussian processes Linear regression revisited Gaussian processes for regression Learning the hyper-parameters

Bayesian linear Parameter distribution Bayesian linear Parameter distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 The posterior distribution is ∝ to the product of likelihood function and prior • 23 Bayesian linear The Bayesian treatment of linear regression starts by introducing Bayesian linear Due to the choice of a , the posterior is Gaussian too regression apriorprobabilitydistributionoverthemodelparametersw1 regression Parameter distribution Parameter distribution N Predictive distribution Predictive distribution T −1 p(w|t) ∝ N (tn|w φ(xn), β ) N (w|m0, S0) Equivalent kernel Equivalent kernel The likelihood function p(t|w)istheexponentialofaquadraticfunctionofw n=1 Gaussian processes Gaussian processes " ! # Linear regression revisited Linear regression revisited β T 1 T −1 N ∝ exp − (t − Φ) (t − Φ) exp − (w − m0) S (w − m0) Gaussian processes for T Gaussian processes for 0 regression p(t|w)= N (t |w φ(x ), β) regression 2 2 n n " # " # Learning the n=1 Learning the hyper-parameters ! hyper-parameters The posterior distribution can be thus written directly in the form The corresponding conjugate prior is thus a Gaussian distribution of the form p(w|t)=N (w|mN , SN )(2) p(w)=N (w|m0, S0)(1)

−1 ΦT mN = SN (S0 m0 + β t)(3) • Mean m0 and covariance S0 −1 −1 ΦT Φ SN = S0 − β (4)

2We derived something similar when discussing Bayes’ theoremforGaussianvariables. 3This distribution is calculated by completing the square in the exponential and finding the 1There also is the noise precision parameter β,wefirstassumeitisaknownconstant normalisation coefficient using the result for a normalised Gaussian. Bayesian linear Parameter distribution (cont.) Bayesian linear Parameter distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Consider a simple form of the Gaussian distribution, zero-mean isotropic Equivalent kernel Because the posterior distribution is Gaussian, mode and mean coincide Equivalent kernel • Only a single precision parameter α characterises it • Gaussian processes The maximum posterior weight vector is given by wMAP = mN Gaussian processes Linear regression revisited Linear regression revisited p(w|α)=N (w|0, α−1I)(5) Gaussian processes for −1 Gaussian processes for regression If we consider an infinitely broad prior S0 = α I with α → 0, the mean regression Learning the Learning the hyper-parameters mN of the posterior distribution reduces to the maximum likelihood value hyper-parameters

T −1 T The corresponding posterior distribution over w is p(w|t)=N (w|mN , SN ) wML =(Φ Φ) Φ t T mN = βSN Φ t (6) Similarly, if N =0,thenagaintheposteriordistributionrevertstotheprior −1 ΦT Φ SN = αI + β (7)

Bayesian linear Parameter distribution (cont.) Bayesian linear Parameter distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 To illustrate Bayesian learning in a linear basis function model, together with the sequential update of a posterior distribution, we consider plain line fitting Bayesian linear The log of the posterior distribution is given by the Bayesian linear regression regression Parameter distribution sum of the log likelihood and the log of the prior Parameter distribution Predictive distribution • As a function of w,ittakestheform Predictive distribution Consider a single input variable x,asingletargetvariablet and linear model Equivalent kernel Equivalent kernel

Gaussian processes N Gaussian processes y x w 2 ( , w)= 0 + w1x β T α T Linear regression revisited ln p(w|α)=− tn − w φ(xn) − w w +const (8) Linear regression revisited Gaussian processes for Gaussian processes for 2 n 2 We generate a synthetic set of data from function f (x, a)=a0 + a1x regression $=1 " # regression Learning the Learning the • a a hyper-parameters hyper-parameters with 0 = −0.3and 1 =0.5 Maximisation of this posterior distribution with respect to w is equivalent to For a selection of input points xn ∼ U(−1, +1), we first evaluate f (xn, a)and N 2 2 then we add Gaussian noise ε ∼ N (0, 0.2 )togetthetargetvaluestn 1 T λ T tn − w φ(xn) + w w, with λ = α/β • 2 2 The goal is to recover the values of a0 and a1 (thru w0 and w1) n=1 $ " # • Under the assumption that the variance of the noise is known • the minimisation of the sum-of-squares error function 1 2 • β = =25 with the addition of a quadratic regularisation term 0.2 " # • We fix α =2.0intheGaussianpriorp(w|α)=N (w|0, α−1I) Bayesian linear Parameter distribution (cont.) Bayesian linear Parameter distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression The plain Gaussian is not the only available form of prior overtheparameters Parameter distribution Parameter distribution • The Gaussian can be generalised Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel M−1 q 1 M α Gaussian processes Gaussian processes 1/q q p(w|α)= (α/2) exp − |wj | (9) Linear regression revisited Linear regression revisited 2 Γ(1/q) 2 j=0 Gaussian processes for Gaussian processes for " # " $ # regression regression Learning the Learning the • It is not a conjugate prior to the likelihood function, unless q =2 hyper-parameters hyper-parameters

Finding the maximum of the posterior distribution over the parameters corresponds to the minimisation of a regularised error function

N M 2 1 T λ q tn − w φ(xn) + |wj | 2 n 2 $=1 " # $j=1

Because w is bi-dimensional, we can plot the prior and posterior distribution

Bayesian linear Bayesian linear Predictive distribution regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear In practice, we are not usually interested in the value of w itself regression regression Parameter distribution Parameter distribution • We want to predictions of t for new values of x Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel Gaussian processes Gaussian processes This requires that we evaluate the predictive distribution defined by Linear regression revisited Linear regression revisited Gaussian processes for Predictive distribution Gaussian processes for regression Bayesian linear regression regression p(t|t, α, β)= p(t|w, β)p(w|t, α, β)dw (10) Learning the Learning the hyper-parameters hyper-parameters % where t is the vector of target values from the training set4

can be omitted • The conditional distribution of the target is p(t|x✁✕ , w, β) • The posterior distribution of the weights is p(w|t, α, β)

4We omit the corresponding input vectors X from the rhs of the conditioning to simplify notation Bayesian linear Predictive distribution (cont.) Bayesian linear Predictive distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Calculating the predictive distribution involves the convolution of Gaussians Bayesian linear regression regression Using old results (Eq. 2.115, ⋆), the predictive distribution takes the form Parameter distribution Parameter distribution p(t|t, α, β)= p(t|w, β)p(w|t, α, β)dw T 2 Predictive distribution Predictive distribution p(t|x, t, α, β)=N (t|mN φ(x), σN (x)) (11) Equivalent kernel % Equivalent kernel 2 Gaussian processes Gaussian processes where the variance σN (x)ifthepredictivedistributionis Linear regression revisited • Linear regression revisited Gaussian processes for The conditional distribution of the target Gaussian processes for regression regression 2 1 σ (x)= + φ(x)SN φ(x)(12) Learning the T Learning the N hyper-parameters y(x, w)=φ(x) w hyper-parameters β p t p t t y −1 ( |w, β)= ( |x, w, β)=N ( | (x, w), β )with−1 &β • the first term 1/β represents the noise on the data • The posterior distribution of the weights • the second term reflects uncertainty associated with w m = S (S−1m + βΦT t) p(w|t, α, β)=N (w|m , S )with N N 0 0 N N −1 −1 ΦT Φ &SN = S0 − β The noise process and the distribution of w are independent Gaussians • their variances are additive The mean of the convolution is the sum of the mean of the two Gaussians, and the covariance of the convolution is the sum of their covariances

Bayesian linear Predictive distribution (cont.) Bayesian linear Predictive distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel As more points are observed, the posterior distribution becomes narrower ⋆ Illustration of the predictive distribution for Bayesian linear regression Gaussian processes Gaussian processes • 2 2 • Linear regression revisited As a consequence, it can be shown that σN+1(x) ≤ σN (x) Linear regression revisited The sinusoidal data with additive Gaussian noise Gaussian processes for Gaussian processes for regression regression Learning the Learning the 2 1 hyper-parameters In the limit N →∞,secondterminσ (x)= + φ(x)SN φ(x)goestozero hyper-parameters Model fitted to data, linear combination of 9 Gaussian basis functions N β • • The variance of the predictive distribution arises solely Different datasets of different sizes from the additive noise governed by the parameter β • N =1,N =2,N =4andN =25 Bayesian linear Bayesian linear Predictive distribution (cont.) regression The red curve (one per N)isthemeanoftheGaussianpredictivedistribution regression UFC/DC • The red shaded region spans one standard deviation either side the mean UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear 1 1 Bayesian linear regression regression Parameter distribution t t Parameter distribution Predictive distribution Predictive distribution Equivalent kernel 0 0 Equivalent kernel Gaussian processes Gaussian processes So far, we showed only the point-wise predictive variance as afunctionofx Linear regression revisited Linear regression revisited Gaussian processes for −1 −1 Gaussian processes for regression regression Learning the Learning the In order to gain insight into the covariance between predictions at different hyper-parameters 0 x 1 0 x 1 hyper-parameters values of x,wecandrawsamplesfromtheposteriordistributionoverw • We have a probabilistic model and we can generate new data 1 1 t t

0 0

−1 −1

0 x 1 0 x 1

The predictive uncertainty (the variance) depends on x,itissmallestinthe neighbourhood of the points and it decreases as more points are observed

Bayesian linear Predictive distribution (cont.) Bayesian linear Predictive distribution (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) Plots of the functions y(x, w), with sampled wsfromtheposteriordistribution PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression 1 1 Parameter distribution Parameter distribution Predictive distribution t t Predictive distribution Equivalent kernel Equivalent kernel 0 0 Gaussian processes Gaussian processes Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for If both w and β are treated as unknowns, we can introduce a conjugate prior regression −1 −1 regression distribution p(w, β)whichwillbegivenbyaGaussian-gammadistribution Learning the Learning the hyper-parameters hyper-parameters • The resulting predictive distribution is a Student’s t-distribution 0 x 1 0 x 1

1 1 t t

0 0

−1 −1

0 x 1 0 x 1 Bayesian linear Bayesian linear Equivalent kernel regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 ΦT t Bayesian linear Bayesian linear The posterior mean solution mN = βSN for the linear basis function regression regression model has an interesting interpretation that sets the stage for kernel methods Parameter distribution Parameter distribution Predictive distribution Predictive distribution M Equivalent kernel Equivalent kernel ΦT −1 T Substituting mN = βSN t into y(x, w)= j=0 wj φj (x)=w φ(x), we get Gaussian processes Gaussian processes Linear regression revisited Equivalent kernel Linear regression revisited ' N Gaussian processes for Gaussian processes for y T φ φ T Φt t φ T φ t regression Bayesian linear regression regression (x, mN )=mN (x)=β (x) SN = β (x) SN (xn) n (13) Learning the Learning the n=1 hyper-parameters hyper-parameters $ −1 −1 ΦT Φ Anewexpressionforthepredictivedistribution,whereSN = S0 − β • The mean of the predictive distribution at a point x is a linear combination of the training set target variables tn

N T y(x, mN )= βφ(x) SN φ(xn) tn n=1 $ k(x,xn) ( )* +

Bayesian linear Equivalent kernel (cont.) Bayesian linear Equivalent kernel (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 The kernel functions k(x, x ′)arecollectedinthesmoothermatrix Bayesian linear Bayesian linear regression regression They can be plotted as a function of x ′ for different (3) values of x Parameter distribution Parameter distribution Predictive distribution The function k(x, x‘) is known as the smoother matrix or equivalent kernel Predictive distribution Equivalent kernel Equivalent kernel T Gaussian processes k(x, x‘) = βφ(x) SN φ(x‘) (14) Gaussian processes Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for regression Regression functions that make predictions by taking linearcombinations regression Learning the Learning the hyper-parameters of the target values tn in the training set are known as linear smoothers hyper-parameters

N y(x, mN )= k(x, xn)tn (15) n $=1 The dependence on the input values xn in the training set are through SN Localised around x,sothemeany(x, mN )ofthepredictivedistributionatx • is a weighted combination of the target values • points close to x are given higher weight Intuitively, local evidence is weighted more strongly that distant evidence Bayesian linear Equivalent kernel (cont.) Bayesian linear Equivalent kernel (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) Examples of equivalent kernels k(x, x ′)forx =0plottedasafunctionofx ′ PR (TIP8311) 2016.2 2016.2 • Polynomial basis functions (left) and sigmoidal basis functions (right) Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Further insight into the role of the equivalent kernel can be obtained by ′ 5 Predictive distribution Predictive distribution considering the covariance between y(x)andy(x ), which is given by 0.04 0.04 Equivalent kernel Equivalent kernel

Gaussian processes Gaussian processes ′ T T ′ 0.02 0.02 cov y(x), y(x ) =covφ(x) w, w φ(x ) Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for , - φ ,T φ ′ - regression 0 0 regression = (x) SN (x ) Learning the Learning the −1 ′ hyper-parameters hyper-parameters = β k(x, x )(16) −1 0 1 −1 0 1 1 1 • From the form of the equivalent kernel, we see that the predictive 0.5 0.75 mean at nearby points will be highly correlated, whereas for more distant pairs of points the correlation will be smaller 0 0.5

−0.5 0.25

−1 0 −1 0 1 −1 0 1 k is a localised function of x′,thoughthecorrespondingbasisfunctionisnot

5 T We used p(w|t)=N (w|mN , SN )andk(x, x‘) = βφ(x) SN φ(x‘)

Bayesian linear Equivalent kernel (cont.) Bayesian linear Equivalent kernel (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 The effective kernel defines the weights by which the training set target Bayesian linear Bayesian linear regression regression values are combined in order to make a prediction at a new valueofx Parameter distribution Parameter distribution Predictive distribution The formulation of linear regression in terms of a kernel Predictive distribution Equivalent kernel function suggests an alternative approach to regression Equivalent kernel It can be shown that these weights sum to one, in other words Gaussian processes Gaussian processes N Linear regression revisited Linear regression revisited ′ Gaussian processes for Instead of introducing a set of basis functions, which implicitly determines an Gaussian processes for k(x, x )=1, ∀x (17) regression regression equivalent kernel, we can instead define a localised kernel directly and use this n=1 Learning the Learning the $ hyper-parameters to make predictions for new input vectors x,giventheobservedtrainingset hyper-parameters It can also be shown that the kernel function can be written

Remark k(x, z)=ψ(x)T ψ(z)(18) This leads to a practical framework for regression with Gaussian processes This is an inner product with respect to vector ψ(x)ofasetofnonlinearfunctions,with

1/2 1/2 ψ(x)=β SN φ(x) Bayesian linear Bayesian linear Gaussian processes regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression We applied the concept of duality to a non-probabilistic model for regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution • We extend the role of kernels to probabilistic discriminative models Equivalent kernel Equivalent kernel

Gaussian processes Gaussian processes Linear regression revisited Gaussian processes Linear regression revisited We considered linear regression models of the form y(x, w)=wT φ(x) Gaussian processes for Gaussian processes for regression Bayesian linear regression regression • w is a vector of parameters and φ(x)isavectoroffixed Learning the Learning the hyper-parameters hyper-parameters nonlinear basis functions that depend on the input vector x • We showed that a prior distribution over w induced a corresponding prior distribution over functions y(x, w)

Given a training data set, we evaluated the posterior distribution over w • To obtain a corresponding posterior distribution over regression functions With noise, it implies a predictive distribution p(t|x)fornewinputsx

Bayesian linear Gaussian processes (cont.) Bayesian linear regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel

Gaussian processes In the Gaussian process viewpoint, we dispense with the parametric model Gaussian processes Linear regression revisited and instead define a distribution over functions directly Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for regression It is difficult to work with a distribution over the infinite space of functions regression Gaussian processes Learning the Learning the hyper-parameters • For a finite training set we only need to consider the values of the function hyper-parameters at the discrete set of input values xn corresponding to the training set and test set data points, and so in practice we can work in a finite space Bayesian linear Linear regression revisited Bayesian linear Linear regression revisited (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Consider a prior distribution over w given by an isotropic Gaussian of the form Parameter distribution Parameter distribution Predictive distribution To illustrate the Gaussian process viewpoint, we consider linear regression Predictive distribution p −1 Equivalent kernel Equivalent kernel (w)=N (w|0, α I)(20) • We re-derive the predictive distribution Gaussian processes Gaussian processes governed by hyperparameter α,precision(inversevariance)ofthedistribution Linear regression revisited • In terms of distributions over functions y(x, w) Linear regression revisited Gaussian processes for Gaussian processes for • For any given value of w, y(x)=wT φ(x)definesafunctionofx regression regression Learning the Learning the hyper-parameters Consider a model defined in terms of a linear combination of M fixed basis hyper-parameters T functions given by the elements of the vector φ(x)=(φ0(x),...,φM−1(x)) The probability distribution over w induces aprobabilitydistributionoverfunctionsy(x) y(x)=wT φ(x)(19)

where x is the input vector and w is the M-dimensional weight vector We wish to evaluate this function at specific values of x,saythetrainingdata

x1,...,xN

Bayesian linear Linear regression revisited (cont.) Bayesian linear Linear regression revisited (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression We find the probability distribution of y by seeing that y is a linear combo of Parameter distribution We are interested in the joint distribution of function values y(x ),...,y(xN ), Parameter distribution 1 Gaussian distributed variables, the elements of w and thus is itself Gaussian Predictive distribution which we denote by the vector y with elements yn = y(xn), for n =1,...,N Predictive distribution Equivalent kernel Equivalent kernel

Gaussian processes Gaussian processes • T We need to find its mean and covariance Linear regression revisited From y(x)=w φ(x), this vector is given by Linear regression revisited Gaussian processes for Gaussian processes for E[y]=ΦE[w]=0 (22) regression Φ regression Learning the y = w (21) Learning the hyper-parameters hyper-parameters 1 cov[y]=E[yyT ]=ΦE[wwT ]ΦT = ΦΦT = K (23) α

φ0(x1) φ1(x1) ··· φM−1(x1) • where K is the Gram matrix with elements φ0(x2) φ1(x2) ··· φM−1(x2) Φ = ⎛ . . . . ⎞:Designmatrix(Φnk = φk (xn)) 1 T . . .. . knm = k(xn , xm)= φ(xn) φ(xm)(24) ⎜ . . . ⎟ α ⎜ ⎟ ⎜φ0(xN ) φ1(xN ) ··· φM−1(xN )⎟ ⎝ ⎠ This model provides us only with a particular example of a Gaussian process Bayesian linear Linear regression revisited (cont.) Bayesian linear Linear regression revisited (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution AkeypointaboutGaussianprocessesisthatthejointdistribution over the Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel N variables y1,...,yN is specified completely by second-order statistics Gaussian processes Remark Gaussian processes Linear regression revisited Linear regression revisited • Mean:Inmostapplications,wewillnothaveanypriorknowledge Gaussian processes for Gaussian processes for about the mean of y(x)andsobysymmetrywetakeittobezero regression • regression Learning the AGaussianprocessisaprobabilitydistributionoverfunctions y(x)such Learning the hyper-parameters that the set of values of y(x)evaluatedatanarbitrarysetofpoints{xn} hyper-parameters This is equivalent to choosing the mean of the prior over weight jointly have a Gaussian distribution values p(w|α)tobezerointhebasisfunctionviewpoint

• Covariance:ThespecificationoftheGaussianprocessiscompleted by giving the covariance of y(x)evaluatedatanytwovaluesofx

This is given by the kernel function E[y(xn)y(x)m]=k(xn , xm)

Bayesian linear Linear regression revisited (cont.) Bayesian linear Linear regression revisited (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression We can also define the kernel function directly, rather than indirectly Parameter distribution Parameter distribution • Predictive distribution Predictive distribution By-pass the choice of basis functions Equivalent kernel Equivalent kernel • Draw samples of functions from the GP Gaussian processes Gaussian processes Linear regression revisited For the specific case of a Gaussian process defined by the linearregression Linear regression revisited Gaussian processes for T −1 Gaussian processes for 3 3 regression model y(x)=w φ(x)withaweightpriorp(w|α)=N (w|0, α I), regression Learning the Learning the hyper-parameters 1 T hyper-parameters • the kernel function is given by knm = k(xn , xm)= φ(xn) φ(xn) 1.5 1.5 α 0 0

−1.5 −1.5

−3 −3 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 Bayesian linear Bayesian linear Gaussian processes for regression regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution In order to apply Gaussian process models to the problem of regression, Equivalent kernel Equivalent kernel we need to take account of the noise on the observed target values

Gaussian processes Gaussian processes t y Linear regression revisited Gaussian processes for regression Linear regression revisited n = n + εn (25) Gaussian processes for Gaussian processes for regression regression where y = y(x )andε is a random noise variable whose Learning the Gaussian processes Learning the n n n hyper-parameters hyper-parameters value is chosen independently for each observation n

We consider noise processes that have a Gaussian distribution, so that

−1 p(tn|yn)=N (tn|yn, β )(26)

where β is a hyperparameter representing for the precision of the noise

Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression Because the noise is independent for each point, the joint distribution of the regression In order to find the marginal distribution p(t), conditioned on Parameter distribution Parameter distribution target values t =(t ,...,t )T conditioned on the values of y =(y ,...,y )T the input values x1,...,xN ,weneedtointegratep(t|y)overy Predictive distribution 1 N 1 N Predictive distribution Equivalent kernel is given by an isotropic Gaussian of the form Equivalent kernel

Gaussian processes Gaussian processes p(t)= p(t|y)p(y)dy = N (t|0, C)(29) p(t|y)=N (t|y, β−1I )(27) Linear regression revisited N Linear regression revisited % Gaussian processes for Gaussian processes for regression regression where the covariance matrix C has elements Learning the From the definition of Gaussian process, the marginal distribution p(y)isgiven Learning the hyper-parameters hyper-parameters −1 by a Gaussian whose mean is zero and whose covariance is a Gram matrix K C(xn, xm)=k(xn , xm)+β δnm (30)

• p(y)=N (y|0, K)(28) δnm is a Kronecker delta (1 iff n = m,0otherwise)

The kernel function that determines K can be chosen to express the property The covariance matrix C reflects the fact that the two Gaussian sources of randomness (one associated with y(x)andonetoε)areindependent that, for points xn and xm that are similar, corresponding values y(xn)and • −1 y(xm)willbemorestronglycorrelatedthanfordissimilarpoints their covariances (K and β I)simplyadd Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) From a GP prior with covariance function k(xn, xm), we can sample functions 2016.2 2016.2

(1.00, 4.00, 0.00, 0.00) (9.00, 4.00, 0.00, 0.00) (1.00, 64.00, 0.00, 0.00) Bayesian linear Bayesian linear 3 9 3 regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution 1.5 4.5 1.5 Equivalent kernel One widely used kernel function for Gaussian processes is theexponential Equivalent kernel Gaussian processes of a quadratic form with the addition of constant and linear terms to give Gaussian processes 0 0 0 Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for regression regression −1.5 −4.5 −1.5 θ1 2 T Learning the k(xn , xm)=θ0 exp − ||xn − xm|| + θ2 + θ3xn xm (31) Learning the hyper-parameters 2 hyper-parameters −3 −9 −3 " # −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (1.00, 0.25, 0.00, 0.00) (1.00, 4.00, 10.00, 0.00) (1.00, 4.00, 0.00, 5.00) The term involving θ3 corresponds to a parametric 3 9 4 model that is a linear function of the input variables. 1.5 4.5 2

0 0 0

−1.5 −4.5 −2

−3 −9 −4 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Asampledfunction(blueline)drawnfromtheGaussianprocess prior over Bayesian linear regression functions is evaluated at a set of points {x } to give points {y } (red dots) regression Parameter distribution n n Parameter distribution Predictive distribution Predictive distribution Our goal in regression is to make predictions of target variables for new inputs Equivalent kernel 3 Equivalent kernel • Gaussian processes Gaussian processes given a set of training data Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for T regression t regression Let us suppose that tN =(t1,...,tN ) for input values x1,...,xN ,comprise Learning the Learning the the observed training data set and our goal is to predict the variable tN+1 hyper-parameters The corresponding values of {tn} hyper-parameters • (green dots) are obtained by adding for a new input vector xN+1 0 independent Gaussian noise to each point in {yn} This requires that we evaluate the predictive distribution p(tN+1|tN ) • This distribution is conditioned also on the variables x1,...,xN and xN+1

−3 −1 0 x 1 Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 To find the conditional distribution p(tN+1|tN ), we begin by writing down the 2016.2 joint distribution p(t ), t is the vector t =(t ,...,t , t )T Bayesian linear N+1 N+1 N+1 1 N N+1 Bayesian linear regression regression Parameter distribution We then apply results from the Gaussian distribution to obtain the conditional Parameter distribution Predictive distribution Predictive distribution From p(t)= p(t|y)p(y)dy,thejointdistributionovert1,...,tN+1 is Equivalent kernel Equivalent kernel t2 Gaussian processes Gaussian processes 4 p(tN+1)=N (tN+1|0, CN+1)(32) Linear regression revisited Linear regression revisited Gaussian processes for Gaussian processes for regression regression N N 1 One training t1 and one test point t2 where CN+1 is a ( +1)× ( + 1) covariance matrix with elements given by Learning the Learning the hyper-parameters x • hyper-parameters −1 m( 2) Contours of the joint C(xn, xm)=k(xn , xm)+β δnm distribution p(t1, t2) 0 t1 We condition on (fix) the value of t 1 Because this joint distribution is Gaussian, we can apply theresultsfromthe • We obtain p(t2|t1) Gaussian distribution to characterise this conditional Gaussian distribution −1

−1 0 1

The conditional distribution p(tN+1|t)willalsobeaGaussiandistribution

Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 Remembering that when two sets of variables are jointly Gaussian 2016.2 Bayesian linear also the conditional distribution of one set conditioned is Gaussian Bayesian linear regression regression Parameter distribution For an arbitrary vector x with Gaussian distribution N (x|µ, Σ) Parameter distribution Predictive distribution Predictive distribution We first partition the covariance matrix of the joint distribution • Equivalent kernel We first partitioned x into two disjoint subsets xa and xb Equivalent kernel Gaussian processes Gaussian processes CN k CN+1 = T (33) Linear regression revisited xa Linear regression revisited k c x = 5 6 Gaussian processes for xb Gaussian processes for regression 5 6 regression Learning the Learning the • hyper-parameters • We partitioned mean vector µ and covariance matrix Σ hyper-parameters CN is a N × N covariance matrix with elements given by

µ Σ Σ C(x , x )=k(x , x )+β−1δ , with n, m =1,...,N µ = a Σ = aa ab n m n m nm µb Σba Σbb • 5 6 5 6 Vector k has elements k(xn, xN+1)forn =1,...,N From these, we obtained the expressions for the mean and covariance of the • −1 Scalar c = k(xN+1 , xN+1)+β conditional distribution p(xa|xb)intermsofthepartitionedcovariancematrix

Σ Σ−1 µa|b = µa + ab bb (xb − µb) Σ Σ Σ Σ−1Σ a|b = aa − ab bb ba Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Illustration of Gaussian process regression applied to the sinusoidal data set Parameter distribution Parameter distribution Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel Three right-most points were omitted Using expressions from the conditional Gaussian distribution on p(tn|t)yields Gaussian processes Gaussian processes 1 Linear regression revisited T −1 Linear regression revisited m(xN+1)=k C t (34) Green curve: The sine function from Gaussian processes for N Gaussian processes for 0.5 regression 2 T −1 regression which data points (blue) are obtained Learning the σ (xN+1)=c − k CN k (35) Learning the hyper-parameters hyper-parameters 0 by sampling and adding Gaussian noise

−0.5 Red line: The mean of the Gaussian Because vector k is a function of test point input value xN+1,thepredictive process predictive distribution −1 distribution is a Gaussian whose mean and variance both depend on xN+1 • ± two standard deviations 0 0.2 0.4 0.6 0.8 1

Note how the uncertainty increases in the region to the right of the data points

Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution The only restriction on the kernel function is that the covariance Parameter distribution −1 Predictive distribution matrix C(xn, xm)=k(xn, xm)+β δnm must be positive definite Predictive distribution Equivalent kernel • Equivalent kernel T −1 If λi is an eigenvalue of K,thenthe Mean m(xN+1)=k CN t of the predictive distribution is a function of xN+1 Gaussian processes −1 Gaussian processes associated eigenvalue of C will be λi + β Linear regression revisited Linear regression revisited N Gaussian processes for Gaussian processes for regression regression −1 m(xN+1)= ank(xN , xN+1), an is the n-th component of CN t (36) Learning the Learning the It is therefore sufficient that the kernel matrix k(x , x )bepositive n=1 hyper-parameters n m hyper-parameters $ semidefinite for any pair of points xn and xm,sothatλi ≥ 0 • any eigenvalue λi that is zero will still give rise For a kernel function k(xn , xm)dependsonlyonthedistance to a positive eigenvalue for C because β > 0 ||xn − xm||,weobtainanexpansioninradialbasisfunctions

This is the same restriction on the kernel function discussedearlier • We can exploit all of the techniques to construct suitable kernels Bayesian linear Gaussian processes for regression (cont.) Bayesian linear Gaussian processes for regression (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear The central computational operation in GP is the inversion ofaN × N matrix regression regression • Standard methods require O(N3)computations Parameter distribution m(x )=kT C−1t Parameter distribution Predictive distribution N+1 N Predictive distribution Equivalent kernel 2 T −1 Equivalent kernel In the basis function model, we have to invert a matrix SN of size M × M σ (xN+1)=c − k CN t Gaussian processes Gaussian processes • O(M3)computationalcomplexity Linear regression revisited The results above define the predictive distribution for Gaussian Linear regression revisited Gaussian processes for Gaussian processes for regression process for regression with an arbitrary kernel function k(xn , xm) regression For both, matrix inversion must be performed once for the given training set Learning the Learning the hyper-parameters hyper-parameters

In the particular case in which the kernel function k(xn , xm)isdefinedinterms of a finite set of basis functions, we can derive the results forlinearregression Remark • from a Gaussian process view point (⋆) For each new test point, both require a vector-matrix multiply, which has cost For such models, we can therefore obtain the predictive distribution either O(N2)forGaussianprocessmodelsandO(M2)forlinearbasismodels • by taking a parameter space viewpoint and using linear regression results If the number M of basis functions is smaller than the number N of points, • by taking a function space viewpoint and using the GP model result it is computationally more efficient to work in the basis function framework

Bayesian linear Bayesian linear Learning the hyper-parameters regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel Predictions of a GP model depends partly on the choice of covariance function Gaussian processes Gaussian processes Linear regression revisited Learning the hyper-parameters Linear regression revisited In practice, rather than fixing the covariance function, we may prefer to use a Gaussian processes for Gaussian processes for regression Gaussian processes regression parametric family of functions and then infer the parameter values from data Learning the Learning the hyper-parameters hyper-parameters

These parameters govern such things as length scale of correlations and the precision of noise, they are hyper-parameters in a standard parametric model Bayesian linear Learning the hyper-parameters (cont.) Bayesian linear Learning the hyper-parameters (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression The log likelihood function for a Gaussian process regression model is Parameter distribution Parameter distribution evaluated using the standard form for a multivariate Gaussian distro Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel 1 1 N Θ T −1 Gaussian processes Remark Gaussian processes ln p(t| )=− Tr|CN | − t C t − ln (2π)(37) 2 2 N 2 Linear regression revisited Linear regression revisited Gaussian processes for Techniques for learning the hyper-parameters are based on the evaluation of Gaussian processes for regression the likelihood function p(t|Θ)whereΘ are the hyper-parameters of the GP regression Learning the Learning the hyper-parameters hyper-parameters For nonlinear optimisation, we also need the gradient of the log Θ The simplest approach is to make a point estimate of by maximising the likelihood function with respect to the parameter vector Θ log likelihood function (e.g., by gradient-based optimisation algorithms as CG) Θ ∂ln p(t| ) 1 −1 ∂CN 1 T −1 ∂CN −1 = − Tr CN + t CN CN t (38) ∂θi 2 ∂θi 2 ∂θi " # In general, ln p(t|Θ)isanon-convexfunction,itmighthavemultiplemaxima

Bayesian linear Learning the hyper-parameters (cont.) Bayesian linear Learning the hyper-parameters (cont.) regression regression

UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2

Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Optionally, we could introduce a prior over Θ and maximise the log posterior Predictive distribution Predictive distribution The Gaussian process regression model gives a predictive distribution Equivalent kernel Equivalent kernel • Again, by using gradient-based methods whose mean and variance are functions of the input vector x Gaussian processes Gaussian processes Linear regression revisited Linear regression revisited However, we have assumed that the contribution to the predictive variance Gaussian processes for Gaussian processes for regression regression arising from the additive noise, governed by the parameter β,isaconstant Learning the Remark Learning the hyper-parameters hyper-parameters In a fully Bayesian treatment, we would need to evaluate marginals over Θ weighted by the product of the prior p(Θ)andthelikelihoodfunctionp(t|Θ) For hetero-scedastic problems, the noise variance itself will also depend on x • In general, however, exact marginalisation will be intractable To model this, we can extend the Gaussian process framework byintroducing • We must resort to approximations asecondGaussianprocesstorepresentthedependenceofβ on the input x