<<

Bayesian Regression

David Rosenberg

New York University

October 29, 2016

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 23 Bayesian : Recap

Bayesian Statistics: Recap

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 2 / 23 Bayesian Statistics: Recap The Bayesian Method

1 Define the model: Choose a model or “likelihood model”:

{p(D | θ) | θ ∈ Θ}.

Choose a distribution p(θ), called the prior distribution.

2 After observing D, compute the posterior distribution p(θ | D). 3 Choose action based on p(θ | D). e.g. E[θ | D] as point estimate for θ e.g. interval [a,b], where p(θ ∈ [a,b] | D) = 0.95

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 3 / 23 Bayesian Statistics: Recap The Posterior Distribution

By Bayes rule, can write the posterior distribution as

p(D | θ)p(θ) p(θ | D) = . p(D)

likelihood: p(D | θ) prior: p(θ) : p(D). Note: p(D) is just a normalizing constant for p(θ | D). Can write

p(θ | D) ∝ p(D | θ)p(θ).

posterior likelihood prior | {z } | {z }|{z}

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 4 / 23 Bayesian Statistics: Recap Summary

Prior represents belief about θ before observing data D.

Posterior represents the rationally “updated” beliefs after seeing D.

All inferences and action-taking are based on posterior distribution.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 5 / 23 Bayesian Gaussian

Bayesian Gaussian Linear Regression

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 6 / 23 Bayesian Gaussian Linear Regression Bayesian Conditional Models

Input space X = Rd Output space Y = R

Conditional probability model, or likelihood model:

{p(y | x,θ) | θ ∈ Θ}

Conditional here refers to the conditioning on the input x. x’s are not governed by our probability model.

Everything conditioned on x means “x is known” Prior distribution: p(θ) on θ ∈ Θ

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 7 / 23 Bayesian Gaussian Linear Regression Gaussian Regression Model

Input space X = Rd Output space Y = R model, or likelihood model:

y | x,w ∼ Nw T x,σ2,

for some known σ2 > 0. space? Rd .

Data: D = {(x1,y1),...,(xn,yn)}

Notation: y = (y1,...,yn) and x = (x1,...,xn).

Assume yi ’s are conditionally independent, given x and w.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 8 / 23 Bayesian Gaussian Linear Regression Gaussian Likelihood

The likelihood of w ∈ Rd for the data D is

n p(y | x,w) = p(yi | xi ,w) by conditional independence. i=1 Y n  1  (y − w T x )2  = √ exp − i i σ π 2σ2 i=1 2 Y You should see in your head1 that the MLE is

∗ wMLE = argmaxp(y | x,w) w∈Rd n T 2 = argmin (yi − w xi ) . d w∈R i=1 X 1See https://davidrosenberg.github.io/ml2015/docs/8.Lab.glm.pdf, slide 5. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 9 / 23 Bayesian Gaussian Linear Regression Priors and Posteriors

Choose a Gaussian prior distribution p(w) on Rd :

w ∼ N(0,Σ0)

for some covariance matrix Σ0 0 (i.e. Σ0 is spd). Posterior distribution

p(w | D) = p(w | x,y) = p (y | x,w)p(w)/p(y) ∝ p(y | x,w)p(w) n  1  (y − w T x )2  = √ exp − i i (likelihood) σ π 2σ2 i=1 2 Y  1  ×|2πΣ |−1/2 exp − w T Σ−1w) (prior) 0 2 0

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 10 / 23 Bayesian Gaussian Linear Regression What does the posterior give us?

Likelihood model: y | x,w ∼ Nw T x,σ2

Prior distribution: w ∼ N(0,Σ0) Given data, compute posterior distribution: p(w | D). If we knew w, best prediction function (for square loss) is

T yˆ(x) = E[y | x,w] = w x.

Prior p(w) and posterior p(w | D) give distributions over prediction functions!

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 11 / 23 Gaussian Regression Example

Gaussian Regression Example

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 12 / 23 Gaussian Regression Example Example in 1-Dimension: Setup

Input space X = [−1,1] Output space Y = R Let’s suppose for any x, y is generated as

y = w0 + w1x + ε,

where ε ∼ N(0,0.22). Written another way, the likelihood model is

2 y | x,w0,w1 ∼ N w0 + w1x , 0.2 .

What’s the parameter space? R2. 1  Prior distribution: w = (w0,w1) ∼ N 0, 2 I

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 13 / 23 Gaussian Regression Example Example in 1-Dimension: Prior Situation

1  Prior distribution: w = (w0,w1) ∼ N 0, 2 I

1  On right, y(x) = E[y | x,w], for randomly chosen w ∼ N 0, 2 I . 1 y(x) = w0 + w1x for random (w0,w1) ∼ p(θ) = N(0, 2 I ).

Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 14 / 23 Gaussian Regression Example Example in 1-Dimension: 1 Observation

On left: posterior distribution; white ’+’ indicates true On right: blue circle indicates the training observation

Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 15 / 23 Gaussian Regression Example Example in 1-Dimension: 2 and 20 Observations

Bishop’s PRML Fig 3.7 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 16 / 23 Gaussian Regression Continued

Gaussian Regression Continued

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 17 / 23 Gaussian Regression Continued Closed Form for Posterior

Model:

w ∼ N(0,Σ0) T 2 yi | x,w i.i.d. N(w xi ,σ ) Design matrix X ; Response column vector y Posterior distribution is a Gaussian distribution:

w | D ∼ N(µP ,ΣP ) T 2 −1−1 T µP = X X + σ Σ0 X y −2 T −1−1 ΣP = σ X X + Σ0

Posterior Variance ΣP gives us a natural uncertainty measure.

See Rasmussen and Williams’ Gaussian Processes for Machine Learning, Ch 2.1. http://www.gaussianprocess.org/gpml/chapters/RW2.pdf David Rosenberg (New York University) DS-GA 1003 October 29, 2016 18 / 23 Gaussian Regression Continued Closed Form for Posterior

Posterior distribution is a Gaussian distribution:

w | D ∼ N(µP ,ΣP ) T 2 −1−1 T µP = X X + σ Σ0 X y −2 T −1−1 ΣP = σ X X + Σ0

The MAP estimator and the posterior mean are given by

T 2 −1−1 T µP = X X + σ Σ0 X y

σ2 For the prior variance Σ0 = λ I , we get T −1 T µP = X X + λI X y, which is of course the ridge regression solution.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 19 / 23 Gaussian Regression Continued Posterior Mean and Posterior (MAP)

σ2 Posterior density for Σ0 = λ I :  λ  n  (y − w T x )2  p(w | D) ∝ exp − kwk2 exp − i i 2σ2 2σ2 i=1 Y prior likelihood | {z } To find MAP, sufficient to minimize| the negative{z log posterior:}

wˆMAP = argmin[−logp(w | D)] w∈Rd n T 2 2 = argmin (yi − w xi ) + λkwk d w∈R i=1 X log-prior log-likelihood | {z } | {z } Which is the ridge regression objective.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 20 / 23 Gaussian Regression Continued Predictive Distribution

Given a new input point xnew, how to predict ynew ? Predictive distribution

p(ynew | xnew,D) = p(ynew | xnew,θ,D)p(θ | D)dθ Z = p(ynew | xnew,θ)p(θ | D)dθ Z

For Gaussian regression, predictive distribution has closed form.

David Rosenberg (New York University) DS-GA 1003 October 29, 2016 21 / 23 Gaussian Regression Continued Closed Form for Predictive Distribution

Model:

w ∼ N(0,Σ0) T 2 yi | x,w i.i.d. N(w xi ,σ ) Predictive Distribution

p(ynew | xnew,D) = p(ynew | xnew,w)p(w | D)dθ. Z Averages over prediction for each θ, weighted by posterior distribution. Closed form:

ynew | xnew,D ∼ N(ηnew , σnew) T ηnew = µP xnew T 2 σnew = xnewΣPxnew + σ from variance in θ inherent variance in y

David Rosenberg (New York University) DS-GA| 1003{z } |{z}October 29, 2016 22 / 23 Gaussian Regression Continued Predictive Distributions

With predictive distributions, can give mean prediction with error bands:

Rasmussen and Williams’ Gaussian Processes for Machine Learning, Fig.2.1(b) David Rosenberg (New York University) DS-GA 1003 October 29, 2016 23 / 23