Bayesian Linear Regression Regression Regression

Bayesian linear Bayesian linear Bayesian linear regression regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Equivalent kernel Bayesian linear regression Equivalent kernel In a maximum likelihood approach for setting parameters in a linear model for Gaussian processes Gaussian processes regression, we tune effective model complexity, the number ofbasisfunctions Linear models for regression Linear regression revisited Linear regression revisited • We control it based on the size of the data set Gaussian processes for Gaussian processes for regression regression Learning the Learning the hyper-parameters hyper-parameters Francesco Corona Adding a regularisation term to the log likelihood function means that the effective model complexity can be controlled by the regularisation coefficient • The choice of the number and form of the basis functions is still important in determining the overall behaviour of the model Bayesian linear Bayesian linear regression (cont.) Bayesian linear Bayesian linear regression (cont.) regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution This leaves the issue of setting appropriate model complexity for the problem Predictive distribution Equivalent kernel Equivalent kernel • We therefore turn to a Bayesian treatment of linear regression Gaussian processes It cannot be decided simply by maximising the likelihood function Gaussian processes • Linear regression revisited • This always leads to excessively complex models and over-fitting Linear regression revisited Avoids the over-fitting problem of maximum likelihood Gaussian processes for Gaussian processes for regression regression • Leads to automatic methods of setting model complexity Learning the Learning the hyper-parameters hyper-parameters Remark Independent hold-out data can be used to determine model complexity We again focus on the case of a single target variable t • This can be both computationally expensive and wasteful of data Bayesian linear Outline Bayesian linear regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution 1 Bayesian linear regression Predictive distribution Equivalent kernel Parameter distribution Equivalent kernel Gaussian processes Predictive distribution Gaussian processes Linear regression revisited Linear regression revisited Gaussian processes for Equivalent kernel Gaussian processes for Parameter distribution regression regression Bayesian linear regression Learning the Learning the hyper-parameters hyper-parameters 2 Gaussian processes Linear regression revisited Gaussian processes for regression Learning the hyper-parameters Bayesian linear Parameter distribution Bayesian linear Parameter distribution (cont.) regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 The posterior distribution is ∝ to the product of likelihood function and prior • 23 Bayesian linear The Bayesian treatment of linear regression starts by introducing Bayesian linear Due to the choice of a conjugate prior, the posterior is Gaussian too regression apriorprobabilitydistributionoverthemodelparametersw1 regression Parameter distribution Parameter distribution N Predictive distribution Predictive distribution T −1 p(w|t) ∝ N (tn|w φ(xn), β ) N (w|m0, S0) Equivalent kernel Equivalent kernel The likelihood function p(t|w)istheexponentialofaquadraticfunctionofw n=1 Gaussian processes Gaussian processes " ! # Linear regression revisited Linear regression revisited β T 1 T −1 N ∝ exp − (t − Φ) (t − Φ) exp − (w − m0) S (w − m0) Gaussian processes for T Gaussian processes for 0 regression p(t|w)= N (t |w φ(x ), β) regression 2 2 n n " # " # Learning the n=1 Learning the hyper-parameters ! hyper-parameters The posterior distribution can be thus written directly in the form The corresponding conjugate prior is thus a Gaussian distribution of the form p(w|t)=N (w|mN , SN )(2) p(w)=N (w|m0, S0)(1) −1 ΦT mN = SN (S0 m0 + β t)(3) • Mean m0 and covariance S0 −1 −1 ΦT Φ SN = S0 − β (4) 2We derived something similar when discussing Bayes’ theoremforGaussianvariables. 3This distribution is calculated by completing the square in the exponential and finding the 1There also is the noise precision parameter β,wefirstassumeitisaknownconstant normalisation coefficient using the result for a normalised Gaussian. Bayesian linear Parameter distribution (cont.) Bayesian linear Parameter distribution (cont.) regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 Bayesian linear Bayesian linear regression regression Parameter distribution Parameter distribution Predictive distribution Predictive distribution Consider a simple form of the Gaussian distribution, zero-mean isotropic Equivalent kernel Because the posterior distribution is Gaussian, mode and mean coincide Equivalent kernel • Only a single precision parameter α characterises it • Gaussian processes The maximum posterior weight vector is given by wMAP = mN Gaussian processes Linear regression revisited Linear regression revisited p(w|α)=N (w|0, α−1I)(5) Gaussian processes for −1 Gaussian processes for regression If we consider an infinitely broad prior S0 = α I with α → 0, the mean regression Learning the Learning the hyper-parameters mN of the posterior distribution reduces to the maximum likelihood value hyper-parameters T −1 T The corresponding posterior distribution over w is p(w|t)=N (w|mN , SN ) wML =(Φ Φ) Φ t T mN = βSN Φ t (6) Similarly, if N =0,thenagaintheposteriordistributionrevertstotheprior −1 ΦT Φ SN = αI + β (7) Bayesian linear Parameter distribution (cont.) Bayesian linear Parameter distribution (cont.) regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 To illustrate Bayesian learning in a linear basis function model, together with the sequential update of a posterior distribution, we consider plain line fitting Bayesian linear The log of the posterior distribution is given by the Bayesian linear regression regression Parameter distribution sum of the log likelihood and the log of the prior Parameter distribution Predictive distribution • As a function of w,ittakestheform Predictive distribution Consider a single input variable x,asingletargetvariablet and linear model Equivalent kernel Equivalent kernel Gaussian processes N Gaussian processes y x w 2 ( , w)= 0 + w1x β T α T Linear regression revisited ln p(w|α)=− tn − w φ(xn) − w w +const (8) Linear regression revisited Gaussian processes for Gaussian processes for 2 n 2 We generate a synthetic set of data from function f (x, a)=a0 + a1x regression $=1 " # regression Learning the Learning the • a a hyper-parameters hyper-parameters with 0 = −0.3and 1 =0.5 Maximisation of this posterior distribution with respect to w is equivalent to For a selection of input points xn ∼ U(−1, +1), we first evaluate f (xn, a)and N 2 2 then we add Gaussian noise ε ∼ N (0, 0.2 )togetthetargetvaluestn 1 T λ T tn − w φ(xn) + w w, with λ = α/β • 2 2 The goal is to recover the values of a0 and a1 (thru w0 and w1) n=1 $ " # • Under the assumption that the variance of the noise is known • the minimisation of the sum-of-squares error function 1 2 • β = =25 with the addition of a quadratic regularisation term 0.2 " # • We fix α =2.0intheGaussianpriorp(w|α)=N (w|0, α−1I) Bayesian linear Parameter distribution (cont.) Bayesian linear Parameter distribution (cont.) regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 Bayesian linear Bayesian linear regression regression The plain Gaussian is not the only available form of prior overtheparameters Parameter distribution Parameter distribution • The Gaussian can be generalised Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel M−1 q 1 M α Gaussian processes Gaussian processes 1/q q p(w|α)= (α/2) exp − |wj | (9) Linear regression revisited Linear regression revisited 2 Γ(1/q) 2 j=0 Gaussian processes for Gaussian processes for " # " $ # regression regression Learning the Learning the • It is not a conjugate prior to the likelihood function, unless q =2 hyper-parameters hyper-parameters Finding the maximum of the posterior distribution over the parameters corresponds to the minimisation of a regularised error function N M 2 1 T λ q tn − w φ(xn) + |wj | 2 n 2 $=1 " # $j=1 Because w is bi-dimensional, we can plot the prior and posterior distribution Bayesian linear Bayesian linear Predictive distribution regression regression UFC/DC UFC/DC ATAI-I (CK0146) ATAI-I (CK0146) PR (TIP8311) PR (TIP8311) 2016.2 2016.2 Bayesian linear Bayesian linear In practice, we are not usually interested in the value of w itself regression regression Parameter distribution Parameter distribution • We want to predictions of t for new values of x Predictive distribution Predictive distribution Equivalent kernel Equivalent kernel Gaussian processes Gaussian processes This requires that we evaluate the predictive distribution defined by Linear regression revisited Linear

Bayesian Linear Regression Regression Regression

Paradoxes and Priors in Bayesian Regression

A Compendium of Conjugate Priors

A Review of Bayesian Optimization

9 Introduction to Hierarchical Models

On the Robustness to Misspecification of Α-Posteriors and Their

Prior Distributions for Variance Parameters in Hierarchical Models

The Role of Hierarchical Priors in Robust Bayesian Inference

Automatic Kernel Selection for Gaussian Processes Regression with Approximate Bayesian Computation and Sequential Monte Carlo

Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and Consistency

Sampling Hyperparameters in Hierarchical Models

STATS 300A Theory of Statistics Stanford University, Fall 2015

Bayesian Optimization of Hyperparameters When the Marginal Likelihood Is Estimated by MCMC