Bayesian Linear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Srihari Bayesian Linear Regression Sargur Srihari [email protected] Machine Learning Srihari Topics in Bayesian Regression • Recall Max Likelihood Linear Regression • Parameter Distribution • Predictive Distribution • Equivalent Kernel 2 Machine Learning Srihari Linear Regression: model complexity M M y(x,w) w w x w x 2 .. w x M w x j • Polynomial regression = 0 + 1 + 2 + + M = ∑ j j =0 – Red lines are best fits with M = 0,1,3,9 and N=10 Poor representations of sin(2πx) Over Fit Best Fit Poor to representation sin(2πx) of sin(2πx) 3 Machine Learning Srihari Max Likelihood Regression • Input vector x , basis functions {ϕ1(x),.., ϕM(x)}: M −1 ⎛ 1 ⎞ T ⎜ t −1 ⎟ y(x,w) = w φ (x) = w φ(x) Radial basis fns: φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ∑ j j ⎝⎜ 2 ⎠⎟ j =0 • Objective Function: 2 N 1 T Max Likelihood objective with N examples {x1,..xN}: E( ) = t − ( ) w ∑{ n w φ xn } 2 n=1 (equivalent to Mean Squared Error Objective) N 2 Regularized MSE with N examples: 1 T λ T E(w) = t − w φ(x ) + w w (λ is the regularization coefficient) ∑{ n n } 2 n=1 2 ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ • Closed-form ML solution is: ⎜ 0 1 1 1 M −1 1 ⎟ ⎜φ 0(x 2 ) ⎟ T −1 T where Φ is the design matrix: Φ = w = (Φ Φ) Φ t T -1 ⎜ ⎟ ML (Φ Φ) is Moore-Penrose inverse ⎜ ⎟ ⎜φ (x ) φ (x )⎟ Regularized ⎝ 0 N M −1 N ⎠ w = (λI + ΦT Φ)−1ΦTt solution is: ML • Gradient Descent: w(τ +1) = w(τ ) − η∇E N N (τ)T ⎡ ⎤ 4 ∇E = − t −w φ(x ) φ(x ) Regularized version: (τ )T (τ ) ∑{ n n } n ∇E = ⎢− t − w φ(x ) φ(x )⎥ − λw n=1 ∑{ n n } n ⎣ n=1 ⎦ Machine Learning Srihari Shortcomings of MLE • M.L.E. of parameters w does not address – M (Model complexity: how many basis functions? – It is controlled by data size N • More data allows better fit without overfitting • Regularization also controls overfit (λ controls effect) 2 1 N 1 T E(w) = E (w) + λE (w) E (w) = t −wTφ(x ) E (w) = w w D W where D ∑{ n n } W 2 2 n=1 • But M and choice of ϕj are still important – M can be determined by holdout, but wasteful of data • Model complexity and over-fitting are better handled using Bayesian approach 5 Machine Learning Srihari Bayesian Linear Regression • Using Bayes rule, posterior is proportional to Likelihood × Prior: p(t | w)p(w) p(w | t) = p(t) – where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters • We will look at: – A normal distribution for prior p(w) – Likelihood p(t|w) is a product of Gaussians based on the noise model – And conclude that posterior is also Gaussian 6 Machine Learning Srihari Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w0,..,wM-1) p(w) = N (w|m0 , S0) with mean m0 and covariance matrix S0 -1 If we choose S0 = α I it means that the variances of the -1 weights are all equal to α and covariances are zero p(w) with zero mean (m0=0) and isotropic over weights (same variances) w1 7 w0 Machine Learning Srihari Likelihood of Data is Gaussian • Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a scalar • Likelihood of t ={t1,..,tN} is then N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 – This is the probability of target data t given the parameters w and input X={x1,..,xN} – Due to Gaussian noise, likelihood p(t |w) is also a Gaussian 8 Machine Learning Srihari Posterior Distribution is also Gaussian • Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 • It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians: • If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian: – Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1 9 Machine Learning Srihari Exact form of Posterior Distribution N p(t | X,w,β) = N t | wTφ(x ),β −1 • We have p(w)= N (w|m0 , S0) & ∏ ( n n ) n=1 • Posterior is also Gaussian, written directly as p(w|t)=N(w|mN,SN) – where mN is the mean of the posterior -1 T given by mN= SN (S0 m0+ β Φ t) Φ is the design matrix ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ ⎜ 0 1 1 1 M −1 1 ⎟ – and S is the covariance matrix of posterior ⎜φ 0(x 2 ) ⎟ N Φ = ⎜ ⎟ ⎜ ⎟ -1 -1 T ⎜φ (x ) φ (x )⎟ given by SN = S0 + β Φ Φ ⎝ 0 N M −1 N ⎠ w1 p(w | α) = N(w | 0,α −1I ) Prior w1 w1 and Posterior in weight space for scalar input x and 10 y(x,w)=w0+w1x w0 w0 Machine Learning Srihari Properties of Posterior 1. Since posterior p(w|t)=N(w|mN,SN) is Gaussian its mode coincides with its mean – Thus maximum posterior weight is wMAP= mN -1 2. Infinitely broad prior S0=α I, i.e.,precision α à0 – Then mean mN reduces to the maximum likelihood value, i.e., mean is the solution vector w = (ΦT Φ)−1 ΦTt ML 3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11 Machine Learning Srihari Choose a simple Gaussian prior p(w) y(x,w)=w0+w1x p(w|α)~N(0,1/α) • Zero mean (m0=0) isotropic w1 • (same variances) Gaussian w0 p(w | α) ∼ N(w | 0,α −1I) Single precision parameter α • Corresponding posterior distribution is p(w|t)=N(w|mN,SN) where T -1 T mN=β SNΦ t and SN =α I+β Φ Φ Note: β is noise precision and Point Estimate α is variance of parameter in prior with 12 w infinite samples Machine Learning Srihari Equivalence to MLE with Regularization N p(t | X,w,β) = N t | wTφ(x ),β −1 −1 • Since ∏ ( n n ) and p(w | α) = N(w | 0,α I ) n=1 • we have N p(w | t) = N t | wTφ(x ),β −1 N(w | 0,α −1I) ∏ ( n n ) n=1 • Log of Posterior is 2 β N α ln p(w | t) = − t − wTφ(x ) − wTw + const ∑{ n n } 2 n=1 2 • Thus Maximization of posterior is equivalent to minimization of sum-of-squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 with addition of quadratic regularization term wTw with λ = α /β 13 Machine Learning Srihari Bayesian Linear Regression Example (Straight Line Fit) • Single input variable x • Single target variable t t • Goal is to fit – Linear model y(x,w) = w0+ w1x x • Goal of Linear Regression is to recover w =[w0 ,w1] given the samples 14 Machine Learning Srihari Data Generation • Synthetic data generated from t f(x,w)=w0+w1x with parameter values w0= -0.3 and w1=0.5 -1 x 1 – First choose xn from U(x|-1,1), then evaluate f(xn,w) – Add Gaussian noise with st dev 0.2 to get target tn – Precision parameter β = (1/0.2 )2= 25 • For prior over w we choose α = 2 p(w | α) = N(w | 0,α −1I ) 15 Machine Learning Srihari Sampling p(w) and p(w|t) • Each sample represents a straight line in data space (modified by examples) Distribution Six samples y(x,w)=w0+w1x w1 p(w) With no examples: w0 With two examples: p(w|t) Goal of Bayesian Linear Regression: Determine p(w|t) 16 Machine Learning Prior/ Six samples Srihari Sequential Bayesian Learning Posterior (regression functions) Likelihood p(t|x.w) p(w) corresponding to y(x,w) • Since there are as function of w gives p(w|t) with w drawn from posterior only two We are plotting p(w|t) parameters for a single data point – We can plot Before data No points Data prior and observed posterior Point True parameter X distributions in Value parameter First After first data point space (x,t) observed Data Band represents values Point • We look at of representing w0, w1 x ,t sequential st lines going near data ( 1 1) point x update of posterior Likelihood Second for 2nd point Data alone Point With infinite points Likelihood for Twenty th posterior is a delta 20 point Data alone function centered Points at true parameters 17 (white cross) Machine Learning Srihari Generalization of Gaussian prior • The Gaussian prior over parameters is −1 p(w | α) = N(w | 0,α I) Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 • Other prior yields Lasso and variations: M 1/q ⎡q ⎛ α ⎞ 1 ⎤ ⎛ α M ⎞ ⎢ ⎥ q p(w | α) = exp − | wj | ⎜ ⎟ ⎜ ∑ ⎟ ⎢2 ⎝ 2 ⎠ Γ(1/q)⎥ ⎝ 2 j =1 ⎠ ⎣ ⎦ – q=2 corresponds to Gaussian – Corresponds to minimization of regularized error function 1 N 2 λ M t − wTφ(x ) + | w |q 18 2 ∑{ n n } 2 ∑ j n=1 j =1 Machine Learning Srihari Predictive Distribution • Usually not interested in the value of w itself • But predicting t for a new value of x p(t|t,X,x) or p(t|t) – Leaving out conditioning variables X and x for convenience • Marginalizing over parameter variable w, is the standard Bayesian approach – Sum rule of probability p(t)= p(t,w)dw = p(t|w)p(w)dw ∫ ∫ – We can now write p(t | t)=∫ p(t|w)p(w|t)dw 19 Machine Learning Srihari Predictive Distribution with α, β,x,t • We can predict t for a new value of x using p(t | t)= p(t|w)p(w|t)dw We have left out conditioning variables X and x for convenience.