Bayesian Linear Regression

Machine Learning Srihari Bayesian Linear Regression Sargur Srihari [email protected] Machine Learning Srihari Topics in Bayesian Regression • Recall Max Likelihood Linear Regression • Parameter Distribution • Predictive Distribution • Equivalent Kernel 2 Machine Learning Srihari Linear Regression: model complexity M M y(x,w) w w x w x 2 .. w x M w x j • Polynomial regression = 0 + 1 + 2 + + M = ∑ j j =0 – Red lines are best fits with M = 0,1,3,9 and N=10 Poor representations of sin(2πx) Over Fit Best Fit Poor to representation sin(2πx) of sin(2πx) 3 Machine Learning Srihari Max Likelihood Regression • Input vector x , basis functions {ϕ1(x),.., ϕM(x)}: M −1 ⎛ 1 ⎞ T ⎜ t −1 ⎟ y(x,w) = w φ (x) = w φ(x) Radial basis fns: φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ∑ j j ⎝⎜ 2 ⎠⎟ j =0 • Objective Function: 2 N 1 T Max Likelihood objective with N examples {x1,..xN}: E( ) = t − ( ) w ∑{ n w φ xn } 2 n=1 (equivalent to Mean Squared Error Objective) N 2 Regularized MSE with N examples: 1 T λ T E(w) = t − w φ(x ) + w w (λ is the regularization coefficient) ∑{ n n } 2 n=1 2 ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ • Closed-form ML solution is: ⎜ 0 1 1 1 M −1 1 ⎟ ⎜φ 0(x 2 ) ⎟ T −1 T where Φ is the design matrix: Φ = w = (Φ Φ) Φ t T -1 ⎜ ⎟ ML (Φ Φ) is Moore-Penrose inverse ⎜ ⎟ ⎜φ (x ) φ (x )⎟ Regularized ⎝ 0 N M −1 N ⎠ w = (λI + ΦT Φ)−1ΦTt solution is: ML • Gradient Descent: w(τ +1) = w(τ ) − η∇E N N (τ)T ⎡ ⎤ 4 ∇E = − t −w φ(x ) φ(x ) Regularized version: (τ )T (τ ) ∑{ n n } n ∇E = ⎢− t − w φ(x ) φ(x )⎥ − λw n=1 ∑{ n n } n ⎣ n=1 ⎦ Machine Learning Srihari Shortcomings of MLE • M.L.E. of parameters w does not address – M (Model complexity: how many basis functions? – It is controlled by data size N • More data allows better fit without overfitting • Regularization also controls overfit (λ controls effect) 2 1 N 1 T E(w) = E (w) + λE (w) E (w) = t −wTφ(x ) E (w) = w w D W where D ∑{ n n } W 2 2 n=1 • But M and choice of ϕj are still important – M can be determined by holdout, but wasteful of data • Model complexity and over-fitting are better handled using Bayesian approach 5 Machine Learning Srihari Bayesian Linear Regression • Using Bayes rule, posterior is proportional to Likelihood × Prior: p(t | w)p(w) p(w | t) = p(t) – where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters • We will look at: – A normal distribution for prior p(w) – Likelihood p(t|w) is a product of Gaussians based on the noise model – And conclude that posterior is also Gaussian 6 Machine Learning Srihari Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1) p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0 -1 If we choose S0 = α I it means that the variances of the -1 weights are all equal to α and covariances are zero p(w) with zero mean (m0=0) and isotropic over weights (same variances) w1 7 w0 Machine Learning Srihari Likelihood of Data is Gaussian • Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar • Likelihood of t ={t1,..,tN} is then N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 – This is the probability of target data t given the parameters w and input X={x1,..,xN} – Due to Gaussian noise, likelihood p(t |w) is also a Gaussian 8 Machine Learning Srihari Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 • It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians: • If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian: – Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1 9 Machine Learning Srihari Exact form of Posterior Distribution N p(t | X,w,β) = N t | wTφ(x ),β −1 • We have p(w)= N (w|m0 , S0) & ∏ ( n n ) n=1 • Posterior is also Gaussian, written directly as p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior -1 T given by mN= SN (S0 m0+ β Φ t) Φ is the design matrix ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ ⎜ 0 1 1 1 M −1 1 ⎟ – and S is the covariance matrix of posterior ⎜φ 0(x 2 ) ⎟ N Φ = ⎜ ⎟ ⎜ ⎟ -1 -1 T ⎜φ (x ) φ (x )⎟ given by SN = S0 + β Φ Φ ⎝ 0 N M −1 N ⎠ w1 p(w | α) = N(w | 0,α −1I ) Prior w1 w1 and Posterior in weight space for scalar input x and 10 y(x,w)=w0+w1x w0 w0 Machine Learning Srihari Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is Gaussian its mode coincides with its mean – Thus maximum posterior weight is wMAP= mN -1 2. Infinitely broad prior S0=α I, i.e.,precision α à0 – Then mean mN reduces to the maximum likelihood value, i.e., mean is the solution vector w = (ΦT Φ)−1 ΦTt ML 3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11 Machine Learning Srihari Choose a simple Gaussian prior p(w) y(x,w)=w0+w1x p(w|α)~N(0,1/α) • Zero mean (m0=0) isotropic w1 • (same variances) Gaussian w0 p(w | α) ∼ N(w | 0,α −1I) Single precision parameter α • Corresponding posterior distribution is p(w|t)=N(w|mN,SN) where T -1 T mN=β SNΦ t and SN =α I+β Φ Φ Note: β is noise precision and Point Estimate α is variance of parameter in prior with 12 w infinite samples Machine Learning Srihari Equivalence to MLE with Regularization N p(t | X,w,β) = N t | wTφ(x ),β −1 −1 • Since ∏ ( n n ) and p(w | α) = N(w | 0,α I ) n=1 • we have N p(w | t) = N t | wTφ(x ),β −1 N(w | 0,α −1I) ∏ ( n n ) n=1 • Log of Posterior is 2 β N α ln p(w | t) = − t − wTφ(x ) − wTw + const ∑{ n n } 2 n=1 2 • Thus Maximization of posterior is equivalent to minimization of sum-of-squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 with addition of quadratic regularization term wTw with λ = α /β 13 Machine Learning Srihari Bayesian Linear Regression Example (Straight Line Fit) • Single input variable x • Single target variable t t • Goal is to fit – Linear model y(x,w) = w0+ w1x x • Goal of Linear Regression is to recover w =[w0 ,w1] given the samples 14 Machine Learning Srihari Data Generation • Synthetic data generated from t f(x,w)=w0+w1x with parameter values w0= -0.3 and w1=0.5 -1 x 1 – First choose xn from U(x|-1,1), then evaluate f(xn,w) – Add Gaussian noise with st dev 0.2 to get target tn – Precision parameter β = (1/0.2 )2= 25 • For prior over w we choose α = 2 p(w | α) = N(w | 0,α −1I ) 15 Machine Learning Srihari Sampling p(w) and p(w|t) • Each sample represents a straight line in data space (modified by examples) Distribution Six samples y(x,w)=w0+w1x w1 p(w) With no examples: w0 With two examples: p(w|t) Goal of Bayesian Linear Regression: Determine p(w|t) 16 Machine Learning Prior/ Six samples Srihari Sequential Bayesian Learning Posterior (regression functions) Likelihood p(t|x.w) p(w) corresponding to y(x,w) • Since there are as function of w gives p(w|t) with w drawn from posterior only two We are plotting p(w|t) parameters for a single data point – We can plot Before data No points Data prior and observed posterior Point True parameter X distributions in Value parameter First After first data point space (x,t) observed Data Band represents values Point • We look at of representing w0, w1 x ,t sequential st lines going near data ( 1 1) point x update of posterior Likelihood Second for 2nd point Data alone Point With infinite points Likelihood for Twenty th posterior is a delta 20 point Data alone function centered Points at true parameters 17 (white cross) Machine Learning Srihari Generalization of Gaussian prior • The Gaussian prior over parameters is −1 p(w | α) = N(w | 0,α I) Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 • Other prior yields Lasso and variations: M 1/q ⎡q ⎛ α ⎞ 1 ⎤ ⎛ α M ⎞ ⎢ ⎥ q p(w | α) = exp − | wj | ⎜ ⎟ ⎜ ∑ ⎟ ⎢2 ⎝ 2 ⎠ Γ(1/q)⎥ ⎝ 2 j =1 ⎠ ⎣ ⎦ – q=2 corresponds to Gaussian – Corresponds to minimization of regularized error function 1 N 2 λ M t − wTφ(x ) + | w |q 18 2 ∑{ n n } 2 ∑ j n=1 j =1 Machine Learning Srihari Predictive Distribution • Usually not interested in the value of w itself • But predicting t for a new value of x p(t|t,X,x) or p(t|t) – Leaving out conditioning variables X and x for convenience • Marginalizing over parameter variable w, is the standard Bayesian approach – Sum rule of probability p(t)= p(t,w)dw = p(t|w)p(w)dw ∫ ∫ – We can now write p(t | t)=∫ p(t|w)p(w|t)dw 19 Machine Learning Srihari Predictive Distribution with α, β,x,t • We can predict t for a new value of x using p(t | t)= p(t|w)p(w|t)dw We have left out conditioning variables X and x for convenience.

Bayesian Linear Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support