Bayesian Regression

Bayesian Regression David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 23 Bayesian Statistics: Recap Bayesian Statistics: Recap David Rosenberg (New York University) DS-GA 1003 October 29, 2016 2 / 23 Bayesian Statistics: Recap The Bayesian Method 1 Define the model: Choose a probability model or “likelihood model”: fp(D j θ) j θ 2 Θg. Choose a distribution p(θ), called the prior distribution. 2 After observing D, compute the posterior distribution p(θ j D). 3 Choose action based on p(θ j D). e.g. E[θ j D] as point estimate for θ e.g. interval [a,b], where p(θ 2 [a,b] j D) = 0.95 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 3 / 23 Bayesian Statistics: Recap The Posterior Distribution By Bayes rule, can write the posterior distribution as p(D j θ)p(θ) p(θ j D) = . p(D) likelihood: p(D j θ) prior: p(θ) marginal likelihood: p(D). Note: p(D) is just a normalizing constant for p(θ j D). Can write p(θ j D) / p(D j θ)p(θ). posterior likelihood prior | {z } | {z }|{z} David Rosenberg (New York University) DS-GA 1003 October 29, 2016 4 / 23 Bayesian Statistics: Recap Summary Prior represents belief about θ before observing data D. Posterior represents the rationally “updated” beliefs after seeing D. All inferences and action-taking are based on posterior distribution. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 5 / 23 Bayesian Gaussian Linear Regression Bayesian Gaussian Linear Regression David Rosenberg (New York University) DS-GA 1003 October 29, 2016 6 / 23 Bayesian Gaussian Linear Regression Bayesian Conditional Models Input space X = Rd Output space Y = R Conditional probability model, or likelihood model: fp(y j x,θ) j θ 2 Θg Conditional here refers to the conditioning on the input x. x’s are not governed by our probability model. Everything conditioned on x means “x is known” Prior distribution: p(θ) on θ 2 Θ David Rosenberg (New York University) DS-GA 1003 October 29, 2016 7 / 23 Bayesian Gaussian Linear Regression Gaussian Regression Model Input space X = Rd Output space Y = R Conditional probability model, or likelihood model: y j x,w ∼ Nw T x,σ2, for some known σ2 > 0. Parameter space? Rd . Data: D = f(x1,y1),...,(xn,yn)g Notation: y = (y1,...,yn) and x = (x1,...,xn). Assume yi ’s are conditionally independent, given x and w. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 8 / 23 Bayesian Gaussian Linear Regression Gaussian Likelihood The likelihood of w 2 Rd for the data D is n p(y j x,w) = p(yi j xi ,w) by conditional independence. i=1 Y n 1 (y - w T x )2 = p exp - i i σ π 2σ2 i=1 2 Y You should see in your head1 that the MLE is ∗ wMLE = argmaxp(y j x,w) w2Rd n T 2 = argmin (yi - w xi ) . d w2R i=1 X 1See https://davidrosenberg.github.io/ml2015/docs/8.Lab.glm.pdf, slide 5. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 9 / 23 Bayesian Gaussian Linear Regression Priors and Posteriors Choose a Gaussian prior distribution p(w) on Rd : w ∼ N(0,Σ0) for some covariance matrix Σ0 0 (i.e. Σ0 is spd). Posterior distribution p(w j D) = p(w j x,y) = p (y j x,w)p(w)=p(y) / p(y j x,w)p(w) n 1 (y - w T x )2 = p exp - i i (likelihood) σ π 2σ2 i=1 2 Y 1 ×j2πΣ j-1=2 exp - w T Σ-1w) (prior) 0 2 0 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 10 / 23 Bayesian Gaussian Linear Regression What does the posterior give us? Likelihood model: y j x,w ∼ Nw T x,σ2 Prior distribution: w ∼ N(0,Σ0) Given data, compute posterior distribution: p(w j D). If we knew w, best prediction function (for square loss) is T y^(x) = E[y j x,w] = w x. Prior p(w) and posterior p(w j D) give distributions over prediction functions! David Rosenberg (New York University) DS-GA 1003 October 29, 2016 11 / 23 Gaussian Regression Example Gaussian Regression Example David Rosenberg (New York University) DS-GA 1003 October 29, 2016 12 / 23 Gaussian Regression Example Example in 1-Dimension: Setup Input space X = [-1,1] Output space Y = R Let’s suppose for any x, y is generated as y = w0 + w1x + ", where " ∼ N(0,0.22). Written another way, the likelihood model is 2 y j x,w0,w1 ∼ N w0 + w1x , 0.2 . What’s the parameter space? R2. 1 Prior distribution: w = (w0,w1) ∼ N 0, 2 I David Rosenberg (New York University) DS-GA 1003 October 29, 2016 13 / 23 Gaussian Regression Example Example in 1-Dimension: Prior Situation 1 Prior distribution: w = (w0,w1) ∼ N 0, 2 I 1 On right, y(x) = E[y j x,w], for randomly chosen w ∼ N 0, 2 I . 1 y(x) = w0 + w1x for random (w0,w1) ∼ p(θ) = N(0, 2 I ). Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 14 / 23 Gaussian Regression Example Example in 1-Dimension: 1 Observation On left: posterior distribution; white ’+’ indicates true parameters On right: blue circle indicates the training observation Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 15 / 23 Gaussian Regression Example Example in 1-Dimension: 2 and 20 Observations Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 16 / 23 Gaussian Regression Continued Gaussian Regression Continued David Rosenberg (New York University) DS-GA 1003 October 29, 2016 17 / 23 Gaussian Regression Continued Closed Form for Posterior Model: w ∼ N(0,Σ0) T 2 yi j x,w i.i.d. N(w xi ,σ ) Design matrix X ; Response column vector y Posterior distribution is a Gaussian distribution: w j D ∼ N(µP ,ΣP ) T 2 -1-1 T µP = X X + σ Σ0 X y -2 T -1-1 ΣP = σ X X + Σ0 Posterior Variance ΣP gives us a natural uncertainty measure. See Rasmussen and Williams’ Gaussian Processes for Machine Learning, Ch 2.1. http://www.gaussianprocess.org/gpml/chapters/RW2.pdf David Rosenberg (New York University) DS-GA 1003 October 29, 2016 18 / 23 Gaussian Regression Continued Closed Form for Posterior Posterior distribution is a Gaussian distribution: w j D ∼ N(µP ,ΣP ) T 2 -1-1 T µP = X X + σ Σ0 X y -2 T -1-1 ΣP = σ X X + Σ0 The MAP estimator and the posterior mean are given by T 2 -1-1 T µP = X X + σ Σ0 X y σ2 For the prior variance Σ0 = λ I , we get T -1 T µP = X X + λI X y, which is of course the ridge regression solution. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 19 / 23 Gaussian Regression Continued Posterior Mean and Posterior Mode (MAP) σ2 Posterior density for Σ0 = λ I : λ n (y - w T x )2 p(w j D) / exp - kwk2 exp - i i 2σ2 2σ2 i=1 Y prior likelihood | {z } To find MAP, sufficient to minimize| the negative{z log posterior:} w^MAP = argmin[-logp(w j D)] w2Rd n T 2 2 = argmin (yi - w xi ) + λkwk d w2R i=1 X log-prior log-likelihood | {z } | {z } Which is the ridge regression objective. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 20 / 23 Gaussian Regression Continued Predictive Distribution Given a new input point xnew, how to predict ynew ? Predictive distribution p(ynew j xnew,D) = p(ynew j xnew,θ,D)p(θ j D)dθ Z = p(ynew j xnew,θ)p(θ j D)dθ Z For Gaussian regression, predictive distribution has closed form. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 21 / 23 Gaussian Regression Continued Closed Form for Predictive Distribution Model: w ∼ N(0,Σ0) T 2 yi j x,w i.i.d. N(w xi ,σ ) Predictive Distribution p(ynew j xnew,D) = p(ynew j xnew,w)p(w j D)dθ. Z Averages over prediction for each θ, weighted by posterior distribution. Closed form: ynew j xnew,D ∼ N(ηnew , σnew) T ηnew = µP xnew T 2 σnew = xnewΣPxnew + σ from variance in θ inherent variance in y David Rosenberg (New York University) DS-GA| 1003{z } |{z}October 29, 2016 22 / 23 Gaussian Regression Continued Predictive Distributions With predictive distributions, can give mean prediction with error bands: Rasmussen and Williams’ Gaussian Processes for Machine Learning, Fig.2.1(b) David Rosenberg (New York University) DS-GA 1003 October 29, 2016 23 / 23.

Bayesian Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support