Bayesian Regression

Bayesian Regression David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 23 Bayesian Statistics: Recap Bayesian Statistics: Recap David Rosenberg (New York University) DS-GA 1003 October 29, 2016 2 / 23 Bayesian Statistics: Recap The Bayesian Method 1 Define the model: Choose a probability model or “likelihood model”: fp(D j θ) j θ 2 Θg. Choose a distribution p(θ), called the prior distribution. 2 After observing D, compute the posterior distribution p(θ j D). 3 Choose action based on p(θ j D). e.g. E[θ j D] as point estimate for θ e.g. interval [a,b], where p(θ 2 [a,b] j D) = 0.95 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 3 / 23 Bayesian Statistics: Recap The Posterior Distribution By Bayes rule, can write the posterior distribution as p(D j θ)p(θ) p(θ j D) = . p(D) likelihood: p(D j θ) prior: p(θ) marginal likelihood: p(D). Note: p(D) is just a normalizing constant for p(θ j D). Can write p(θ j D) / p(D j θ)p(θ). posterior likelihood prior | {z } | {z }|{z} David Rosenberg (New York University) DS-GA 1003 October 29, 2016 4 / 23 Bayesian Statistics: Recap Summary Prior represents belief about θ before observing data D. Posterior represents the rationally “updated” beliefs after seeing D. All inferences and action-taking are based on posterior distribution. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 5 / 23 Bayesian Gaussian Linear Regression Bayesian Gaussian Linear Regression David Rosenberg (New York University) DS-GA 1003 October 29, 2016 6 / 23 Bayesian Gaussian Linear Regression Bayesian Conditional Models Input space X = Rd Output space Y = R Conditional probability model, or likelihood model: fp(y j x,θ) j θ 2 Θg Conditional here refers to the conditioning on the input x. x’s are not governed by our probability model. Everything conditioned on x means “x is known” Prior distribution: p(θ) on θ 2 Θ David Rosenberg (New York University) DS-GA 1003 October 29, 2016 7 / 23 Bayesian Gaussian Linear Regression Gaussian Regression Model Input space X = Rd Output space Y = R Conditional probability model, or likelihood model: y j x,w ∼ Nw T x,σ2, for some known σ2 > 0. Parameter space? Rd . Data: D = f(x1,y1),...,(xn,yn)g Notation: y = (y1,...,yn) and x = (x1,...,xn). Assume yi ’s are conditionally independent, given x and w. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 8 / 23 Bayesian Gaussian Linear Regression Gaussian Likelihood The likelihood of w 2 Rd for the data D is n p(y j x,w) = p(yi j xi ,w) by conditional independence. i=1 Y n 1 (y - w T x )2 = p exp - i i σ π 2σ2 i=1 2 Y You should see in your head1 that the MLE is ∗ wMLE = argmaxp(y j x,w) w2Rd n T 2 = argmin (yi - w xi ) . d w2R i=1 X 1See https://davidrosenberg.github.io/ml2015/docs/8.Lab.glm.pdf, slide 5. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 9 / 23 Bayesian Gaussian Linear Regression Priors and Posteriors Choose a Gaussian prior distribution p(w) on Rd : w ∼ N(0,Σ0) for some covariance matrix Σ0 0 (i.e. Σ0 is spd). Posterior distribution p(w j D) = p(w j x,y) = p (y j x,w)p(w)=p(y) / p(y j x,w)p(w) n 1 (y - w T x )2 = p exp - i i (likelihood) σ π 2σ2 i=1 2 Y 1 ×j2πΣ j-1=2 exp - w T Σ-1w) (prior) 0 2 0 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 10 / 23 Bayesian Gaussian Linear Regression What does the posterior give us? Likelihood model: y j x,w ∼ Nw T x,σ2 Prior distribution: w ∼ N(0,Σ0) Given data, compute posterior distribution: p(w j D). If we knew w, best prediction function (for square loss) is T y^(x) = E[y j x,w] = w x. Prior p(w) and posterior p(w j D) give distributions over prediction functions! David Rosenberg (New York University) DS-GA 1003 October 29, 2016 11 / 23 Gaussian Regression Example Gaussian Regression Example David Rosenberg (New York University) DS-GA 1003 October 29, 2016 12 / 23 Gaussian Regression Example Example in 1-Dimension: Setup Input space X = [-1,1] Output space Y = R Let’s suppose for any x, y is generated as y = w0 + w1x + ", where " ∼ N(0,0.22). Written another way, the likelihood model is 2 y j x,w0,w1 ∼ N w0 + w1x , 0.2 . What’s the parameter space? R2. 1 Prior distribution: w = (w0,w1) ∼ N 0, 2 I David Rosenberg (New York University) DS-GA 1003 October 29, 2016 13 / 23 Gaussian Regression Example Example in 1-Dimension: Prior Situation 1 Prior distribution: w = (w0,w1) ∼ N 0, 2 I 1 On right, y(x) = E[y j x,w], for randomly chosen w ∼ N 0, 2 I . 1 y(x) = w0 + w1x for random (w0,w1) ∼ p(θ) = N(0, 2 I ). Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 14 / 23 Gaussian Regression Example Example in 1-Dimension: 1 Observation On left: posterior distribution; white ’+’ indicates true parameters On right: blue circle indicates the training observation Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 15 / 23 Gaussian Regression Example Example in 1-Dimension: 2 and 20 Observations Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 16 / 23 Gaussian Regression Continued Gaussian Regression Continued David Rosenberg (New York University) DS-GA 1003 October 29, 2016 17 / 23 Gaussian Regression Continued Closed Form for Posterior Model: w ∼ N(0,Σ0) T 2 yi j x,w i.i.d. N(w xi ,σ ) Design matrix X ; Response column vector y Posterior distribution is a Gaussian distribution: w j D ∼ N(µP ,ΣP ) T 2 -1-1 T µP = X X + σ Σ0 X y -2 T -1-1 ΣP = σ X X + Σ0 Posterior Variance ΣP gives us a natural uncertainty measure. See Rasmussen and Williams’ Gaussian Processes for Machine Learning, Ch 2.1. http://www.gaussianprocess.org/gpml/chapters/RW2.pdf David Rosenberg (New York University) DS-GA 1003 October 29, 2016 18 / 23 Gaussian Regression Continued Closed Form for Posterior Posterior distribution is a Gaussian distribution: w j D ∼ N(µP ,ΣP ) T 2 -1-1 T µP = X X + σ Σ0 X y -2 T -1-1 ΣP = σ X X + Σ0 The MAP estimator and the posterior mean are given by T 2 -1-1 T µP = X X + σ Σ0 X y σ2 For the prior variance Σ0 = λ I , we get T -1 T µP = X X + λI X y, which is of course the ridge regression solution. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 19 / 23 Gaussian Regression Continued Posterior Mean and Posterior Mode (MAP) σ2 Posterior density for Σ0 = λ I : λ n (y - w T x )2 p(w j D) / exp - kwk2 exp - i i 2σ2 2σ2 i=1 Y prior likelihood | {z } To find MAP, sufficient to minimize| the negative{z log posterior:} w^MAP = argmin[-logp(w j D)] w2Rd n T 2 2 = argmin (yi - w xi ) + λkwk d w2R i=1 X log-prior log-likelihood | {z } | {z } Which is the ridge regression objective. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 20 / 23 Gaussian Regression Continued Predictive Distribution Given a new input point xnew, how to predict ynew ? Predictive distribution p(ynew j xnew,D) = p(ynew j xnew,θ,D)p(θ j D)dθ Z = p(ynew j xnew,θ)p(θ j D)dθ Z For Gaussian regression, predictive distribution has closed form. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 21 / 23 Gaussian Regression Continued Closed Form for Predictive Distribution Model: w ∼ N(0,Σ0) T 2 yi j x,w i.i.d. N(w xi ,σ ) Predictive Distribution p(ynew j xnew,D) = p(ynew j xnew,w)p(w j D)dθ. Z Averages over prediction for each θ, weighted by posterior distribution. Closed form: ynew j xnew,D ∼ N(ηnew , σnew) T ηnew = µP xnew T 2 σnew = xnewΣPxnew + σ from variance in θ inherent variance in y David Rosenberg (New York University) DS-GA| 1003{z } |{z}October 29, 2016 22 / 23 Gaussian Regression Continued Predictive Distributions With predictive distributions, can give mean prediction with error bands: Rasmussen and Williams’ Gaussian Processes for Machine Learning, Fig.2.1(b) David Rosenberg (New York University) DS-GA 1003 October 29, 2016 23 / 23.

Load more