Introduction to Machine Learning

Introduction to Machine Learning Bayesian Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 19 Outline Linear Regression Problem Formulation Learning Parameters Bayesian Linear Regression Bayesian Regression Estimating Bayesian Regression Parameters Prediction with Bayesian Regression Handling Outliers in Regression Probabilistic Interpretation of Logistic Regression Logistic Regression - Training Using Gradient Descent for Learning Weights Using Newton's Method Regularization with Logistic Regression Handling Multiple Classes Bayesian Logistic Regression Chandola@UB CSE 474/574 2 / 19 Linear Regression I There is one scalar target variable y (instead of hidden) I There is one vector input variable x I Inductive bias: y = w>x Linear Regression Learning Task Learn w given training examples, hX; yi. Chandola@UB CSE 474/574 3 / 19 Probabilistic Interpretation I y is assumed to be normally distributed y ∼ N (w>x; σ2) I or, equivalently: y = w>x + where ∼ N (0; σ2) I y is a linear combination of the input variables I Given w and σ2, one can find the probability distribution of y for a given x Chandola@UB CSE 474/574 4 / 19 Learning Parameters - MLE Approach I Find w and σ2 that maximize the likelihood of training data > −1 > wbMLE = (X X) X y 1 σ2 = (y − Xw)>(y − Xw) bMLE N Chandola@UB CSE 474/574 5 / 19 Putting a Prior on w I \Penalize" large values of w I A zero-mean Gaussian prior p(w) = N (wj0; τ 2I ) I What is posterior of w Y > 2 p(wjD) / N (yi jw xi ; σ )p(w) i I Posterior is also Gaussian Chandola@UB CSE 474/574 6 / 19 Parameter Estimation for Bayesian Regression I Prior for w 2 w ∼ N (wj0; τ ID ) I Posterior for w p(yjX; w)p(w) p(wjy; X) = p(yjX) σ2 σ2 = N (w¯ = (X>X + I )−1X>y; σ2(X>X + I )−1) τ 2 D τ 2 D I Posterior distribution for w is also Gaussian I What will be MAP estimate for w? Chandola@UB CSE 474/574 7 / 19 Prediction with Bayesian Regression I For a new x∗, predict y ∗ I Point estimate of y ∗ ∗ > ∗ y = wbMLE x I Treating y as a Gaussian random variable ∗ ∗ > ∗ 2 p(y jx ) = N (wbMLE x ; σbMLE ) ∗ ∗ > ∗ 2 p(y jx ) = N (wbMAP x ; σbMAP ) Chandola@UB CSE 474/574 8 / 19 Full Bayesian Treatment I Treating y and w as random variables Z p(y ∗jx∗) = p(y ∗jx∗; w)p(wjX; y)dw I This is also Gaussian! Chandola@UB CSE 474/574 9 / 19 Impact of outliers on regression I Linear regression training gets impacted by the presence of outliers I The square term in the exponent of the Gaussian pdf is the culprit I Equivalent to the square term in the loss I How to handle this (Robust Regression)? I Probabilistic: I Use a different distribution instead of Gaussian for p(yjx) I Robust regression uses Laplace distribution p(yjx) ∼ Laplace(w>x; b) I Geometric: I Least absolute deviations instead of least squares N X > J(w) = jyi − w xj i=1 Chandola@UB CSE 474/574 10 / 19 Logistic Regression I yjx is a Bernoulli distribution with parameter θ = sigmoid(w>x) I When a new input x∗ arrives, we toss a coin which has sigmoid(w>x∗) as the probability of heads I If outcome is heads, the predicted class is 1 else 0 I Learns a linear boundary Learning Task for Logistic Regression D Given training examples hxi ; yi ii=1, learn w Chandola@UB CSE 474/574 11 / 19 Learning Parameters I MLE Approach I Assume that y 2 f0; 1g I What is the likelihood for a bernoulli sample? 1 I If yi = 1, p(yi ) = θi = > 1+exp(−w xi ) 1 I If yi = 0, p(yi ) = 1 − θi = > 1+exp(w xi ) yi 1−yi I In general, p(yi ) = θi (1 − θi ) Log-likelihood N X LL(w) = yi log θi + (1 − yi ) log (1 − θi ) i=1 I No closed form solution for maximizing log-likelihood Chandola@UB CSE 474/574 12 / 19 Using Gradient Descent for Learning Weights I Compute gradient of LL with respect to w I A convex function of w with a unique global maximum N d X LL(w) = (y − θ )x dw i i i 0 i=1 −0:5 I Update rule: d 10 wk+1 = wk + η LL(wk ) 5 dw 8 10 k 4 6 0 0 2 Chandola@UB CSE 474/574 13 / 19 Using Newton's Method I Setting η is sometimes tricky I Too large { incorrect results I Too small { slow convergence I Another way to speed up convergence: Newton's Method −1 d wk+1 = wk + ηHk LL(wk ) dwk Chandola@UB CSE 474/574 14 / 19 What is the Hessian? I Hessian or H is the second order derivative of the objective function I Newton's method belong to the family of second order optimization algorithms I For logistic regression, the Hessian is: X > H = − θi (1 − θi )xi xi i Chandola@UB CSE 474/574 15 / 19 Regularization with Logistic Regression I Overfitting is an issue, especially with large number of features I Add a Gaussian prior ∼ N (0; τ 2) I Easy to incorporate in the gradient descent based approach 1 LL0(w) = LL(w) − λw>w 2 d d LL0(w) = LL(w) − λw dw dw H0 = H − λI where I is the identity matrix. Chandola@UB CSE 474/574 16 / 19 Handling Multiple Classes I p(yjx) ∼ Multinoulli(θ) I Multinoulli parameter vector θ is defined as: > exp(wj x) θj = PC > k=1 exp(wk x) I Multiclass logistic regression has C weight vectors to learn Chandola@UB CSE 474/574 17 / 19 Bayesian Logistic Regression I How to get the posterior for w? I Not easy - Why? Laplace Approximation I We do not know what the true posterior distribution for w is. I Is there a close-enough (approximate) Gaussian distribution? Chandola@UB CSE 474/574 18 / 19 References Chandola@UB CSE 474/574 19 / 19.

Introduction to Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support