<<

Bayesian (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression

Piyush Rai

Topics in Probabilistic Modeling and Inference (CS698X)

Jan 21, 2019

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 1 Assuming hyperparameters as fixed, the posterior is Gaussian

p(w|y, X, β, λ) = N (µN , ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior’s covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior’s ) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗|x ∗, X, y, β, λ) = p(y∗|w, x ∗, β)p(w|y, X, β, λ)dw = N (µN x ∗,β + x ∗ ΣN x ∗)

Gives both predictive mean and predictive (imp: pred-var is different for each input)

Recap: Bayesian Linear Regression

QN > −1 −1 Assume Gaussian likelihood: p(y|X, w, β) = n=1 N (yn|w x n, β ) = N (y|Xw, β IN )

QD −1 −1 Assume zero-mean spherical Gaussian prior: p(w|λ) = d=1 N (wd |0, λ ) = N (w|0, λ ID )

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior’s covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior’s mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗|x ∗, X, y, β, λ) = p(y∗|w, x ∗, β)p(w|y, X, β, λ)dw = N (µN x ∗,β + x ∗ ΣN x ∗)

Gives both predictive mean and predictive variance (imp: pred-var is different for each input)

Recap: Bayesian Linear Regression

QN > −1 −1 Assume Gaussian likelihood: p(y|X, w, β) = n=1 N (yn|w x n, β ) = N (y|Xw, β IN )

QD −1 −1 Assume zero-mean spherical Gaussian prior: p(w|λ) = d=1 N (wd |0, λ ) = N (w|0, λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian

p(w|y, X, β, λ) = N (µN , ΣN )

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior’s mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗|x ∗, X, y, β, λ) = p(y∗|w, x ∗, β)p(w|y, X, β, λ)dw = N (µN x ∗,β + x ∗ ΣN x ∗)

Gives both predictive mean and predictive variance (imp: pred-var is different for each input)

Recap: Bayesian Linear Regression

QN > −1 −1 Assume Gaussian likelihood: p(y|X, w, β) = n=1 N (yn|w x n, β ) = N (y|Xw, β IN )

QD −1 −1 Assume zero-mean spherical Gaussian prior: p(w|λ) = d=1 N (wd |0, λ ) = N (w|0, λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian

p(w|y, X, β, λ) = N (µN , ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior’s covariance matrix) n=1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗|x ∗, X, y, β, λ) = p(y∗|w, x ∗, β)p(w|y, X, β, λ)dw = N (µN x ∗,β + x ∗ ΣN x ∗)

Gives both predictive mean and predictive variance (imp: pred-var is different for each input)

Recap: Bayesian Linear Regression

QN > −1 −1 Assume Gaussian likelihood: p(y|X, w, β) = n=1 N (yn|w x n, β ) = N (y|Xw, β IN )

QD −1 −1 Assume zero-mean spherical Gaussian prior: p(w|λ) = d=1 N (wd |0, λ ) = N (w|0, λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian

p(w|y, X, β, λ) = N (µN , ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior’s covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior’s mean) N β n=1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 Gives both predictive mean and predictive variance (imp: pred-var is different for each input)

Recap: Bayesian Linear Regression

QN > −1 −1 Assume Gaussian likelihood: p(y|X, w, β) = n=1 N (yn|w x n, β ) = N (y|Xw, β IN )

QD −1 −1 Assume zero-mean spherical Gaussian prior: p(w|λ) = d=1 N (wd |0, λ ) = N (w|0, λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian

p(w|y, X, β, λ) = N (µN , ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior’s covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior’s mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗|x ∗, X, y, β, λ) = p(y∗|w, x ∗, β)p(w|y, X, β, λ)dw = N (µN x ∗,β + x ∗ ΣN x ∗)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 Recap: Bayesian Linear Regression

QN > −1 −1 Assume Gaussian likelihood: p(y|X, w, β) = n=1 N (yn|w x n, β ) = N (y|Xw, β IN )

QD −1 −1 Assume zero-mean spherical Gaussian prior: p(w|λ) = d=1 N (wd |0, λ ) = N (w|0, λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian

p(w|y, X, β, λ) = N (µN , ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior’s covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior’s mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗|x ∗, X, y, β, λ) = p(y∗|w, x ∗, β)p(w|y, X, β, λ)dw = N (µN x ∗,β + x ∗ ΣN x ∗)

Gives both predictive mean and predictive variance (imp: pred-var is different for each input)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 A visualization of the posterior predictive of a Bayesian linear regression model

Small predictive Line showing variance here predictive mean y Large predictive variance here

x

A Visualization of Uncertainty in Bayesian Linear Regression

Posterior p(w|X, y) and lines (w0 intercept, w1 slope) corresponding to some random w’s

Prior (N=0) Posterior (N=1) Posterior (N=2) Posterior (N=20)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 3 A Visualization of Uncertainty in Bayesian Linear Regression

Posterior p(w|X, y) and lines (w0 intercept, w1 slope) corresponding to some random w’s

Prior (N=0) Posterior (N=1) Posterior (N=2) Posterior (N=20) A visualization of the posterior predictive of a Bayesian linear regression model

Small predictive Line showing variance here predictive mean y Large predictive variance here

x

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 3 Posterior predictive: Red curve is predictive mean, shaded region denotes predictive uncertainty

y y y y

x x x x

A Visualization of Uncertainty (Contd)

We can similarly visualize a Bayesian model

Figures below: Green curve is the true function and blue circles are observations (xn, yn) Posterior of the nonlinear regression model: Some curves drawn from the posterior

y y y y

x x x x

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 4 A Visualization of Uncertainty (Contd)

We can similarly visualize a Bayesian nonlinear regression model

Figures below: Green curve is the true function and blue circles are observations (xn, yn) Posterior of the nonlinear regression model: Some curves drawn from the posterior

y y y y

x x x x Posterior predictive: Red curve is predictive mean, shaded region denotes predictive uncertainty

y y y y

x x x x

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 4 Estimating Hyperparameters for Bayesian Linear Regression

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 5 Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also, What priors (or “”) to choose for β and λ? What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm ( or fully Bayesian)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also, What priors (or “hyperpriors”) to choose for β and λ? What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also, What priors (or “hyperpriors”) to choose for β and λ? What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 Requires approximations. Also, What priors (or “hyperpriors”) to choose for β and λ? What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy).

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 Also, What priors (or “hyperpriors”) to choose for β and λ? What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy). Requires approximations.

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 What priors (or “hyperpriors”) to choose for β and λ? What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also,

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 What about the hyperparameters of those priors?

Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also, What priors (or “hyperpriors”) to choose for β and λ?

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 Learning Hyperparameters in Probabilistic Models

Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Example: For the linear regression model, the full set of parameters would be (w, λ, β)

Can assume priors on all these parameters and infer their “joint” posterior distribution

p(y|X, w, β, λ)p(w, λ, β) p(y|X, w, β, λ)p(w|λ)p(β)p(λ) p(w, β, λ|X, y) = = p(y|X) R p(y|X, w, β)p(w|λ)p(β)p(λ) dw dλdβ

Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also, What priors (or “hyperpriors”) to choose for β and λ? What about the hyperparameters of those priors?

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 For our linear regression model, this quantity (a function of the hyperparams) will be Z p(y|X, β, λ) = p(y|X, w, β)p(w|λ)dw

The “optimal” hyperparameters in this case can be then found by

β,ˆ λˆ = arg max log p(y|X, β, λ) β,λ

This is called MLE-II or (log) evidence maximization Akin to doing MLE to estimate the hyperparameters where the “main” parameter (in this case w) has been integrated out from the model’s

Note: If the likelihood and prior are conjugate then marginal likelihood is available in closed form

Learning Hyperparameters via Point Estimation

One popular way to estimate hyperparameters is by maximizing the marginal likelihood

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 7 The “optimal” hyperparameters in this case can be then found by

β,ˆ λˆ = arg max log p(y|X, β, λ) β,λ

This is called MLE-II or (log) evidence maximization Akin to doing MLE to estimate the hyperparameters where the “main” parameter (in this case w) has been integrated out from the model’s likelihood function

Note: If the likelihood and prior are conjugate then marginal likelihood is available in closed form

Learning Hyperparameters via Point Estimation

One popular way to estimate hyperparameters is by maximizing the marginal likelihood

For our linear regression model, this quantity (a function of the hyperparams) will be Z p(y|X, β, λ) = p(y|X, w, β)p(w|λ)dw

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 7 This is called MLE-II or (log) evidence maximization Akin to doing MLE to estimate the hyperparameters where the “main” parameter (in this case w) has been integrated out from the model’s likelihood function

Note: If the likelihood and prior are conjugate then marginal likelihood is available in closed form

Learning Hyperparameters via Point Estimation

One popular way to estimate hyperparameters is by maximizing the marginal likelihood

For our linear regression model, this quantity (a function of the hyperparams) will be Z p(y|X, β, λ) = p(y|X, w, β)p(w|λ)dw

The “optimal” hyperparameters in this case can be then found by

β,ˆ λˆ = arg max log p(y|X, β, λ) β,λ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 7 Akin to doing MLE to estimate the hyperparameters where the “main” parameter (in this case w) has been integrated out from the model’s likelihood function

Note: If the likelihood and prior are conjugate then marginal likelihood is available in closed form

Learning Hyperparameters via Point Estimation

One popular way to estimate hyperparameters is by maximizing the marginal likelihood

For our linear regression model, this quantity (a function of the hyperparams) will be Z p(y|X, β, λ) = p(y|X, w, β)p(w|λ)dw

The “optimal” hyperparameters in this case can be then found by

β,ˆ λˆ = arg max log p(y|X, β, λ) β,λ

This is called MLE-II or (log) evidence maximization

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 7 Note: If the likelihood and prior are conjugate then marginal likelihood is available in closed form

Learning Hyperparameters via Point Estimation

One popular way to estimate hyperparameters is by maximizing the marginal likelihood

For our linear regression model, this quantity (a function of the hyperparams) will be Z p(y|X, β, λ) = p(y|X, w, β)p(w|λ)dw

The “optimal” hyperparameters in this case can be then found by

β,ˆ λˆ = arg max log p(y|X, β, λ) β,λ

This is called MLE-II or (log) evidence maximization Akin to doing MLE to estimate the hyperparameters where the “main” parameter (in this case w) has been integrated out from the model’s likelihood function

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 7 Learning Hyperparameters via Point Estimation

One popular way to estimate hyperparameters is by maximizing the marginal likelihood

For our linear regression model, this quantity (a function of the hyperparams) will be Z p(y|X, β, λ) = p(y|X, w, β)p(w|λ)dw

The “optimal” hyperparameters in this case can be then found by

β,ˆ λˆ = arg max log p(y|X, β, λ) β,λ

This is called MLE-II or (log) evidence maximization Akin to doing MLE to estimate the hyperparameters where the “main” parameter (in this case w) has been integrated out from the model’s likelihood function

Note: If the likelihood and prior are conjugate then marginal likelihood is available in closed form

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 7 Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator) Let’s approximate it by a point function δ at the of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator) Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 (lack of conjugacy, intractable denominator)

Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator) Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator) Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator) Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 What is MLE-II Doing?

For linear regression case, would ideally like the posterior over all unknowns, i.e., p(w, λ, β|X, y)

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) (from product rule)

Note that p(w|X, y, β, λ) is easy if λ, β are known

p(y|X,β,α)p(β)p(λ) However p(β, λ|X, y) = p(y|X) is hard (lack of conjugacy, intractable denominator) Let’s approximate it by a point function δ at the mode of p(β, λ|X, y)

p(β, λ|X, y) ≈ δ(β,ˆ λˆ) where β,ˆ λˆ = arg max p(β, λ|X, y) = arg max p(y|X, β, λ)p(λ)p(β) β,λ β,λ Moreover, if p(β), p(λ) are uniform/uninformative priors then

β,ˆ λˆ = arg max p(y|X, β, λ) β,λ Thus MLE-II is approximating the posterior of hyperparams by their point estimate assuming uniform priors (therefore we don’t need to worry about a prior over the hyperparams)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 8 −1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 Note: Can also use different λd for each wd

MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 MLE-II for Linear Regression

For the linear regression case, the marginal likelihood is defined as Z p(y|X, β, λ)= p(y|X, w, β)p(w|λ)dw

−1 −1 Since p(y|X, w, β) = N (y|Xw, β IN ) and p(w|λ) = N (w|0, λ ID ), the marginal likelihood

p(y|X, β, λ) = N (y|0, β−1I + λ−1XX>) 1 1 = |β−1I + λ−1XX>|−1/2 exp(− y >(β−1I + λ−1XX>)−1y) (2π)N/2 2

MLE-II maximizes log p(y|X, β, λ) w.r.t. β and λ to estimate these hyperparams This objective doesn’t have a closed form solution Solved using iterative/alternating optimization PRML Chapter 3 contains the iterative update equations Note: Can also do “MAP-II” using a suitable prior on these hyperparams (e.g., gamma)

Note: Can also use different λd for each wd

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 9 ≈ p(w|X, y, β,ˆ λˆ)

The posterior predictive distribution can also be approximated as

Z p(y∗|x ∗, X, y) = p(y∗|x ∗, w, β)p(w, β, λ|X, y) dw dβ dλ Z = p(y∗|x ∗, w, β)p(w|X, y, β, λ)p(β, λ|X, y)dβ dλ dw Z ≈ p(y∗|x ∗, w, β)p(w|X, y, β,ˆ λˆ) dw

This is also the same as the usual posterior predictive distribution we have seen earlier, except we are treating the hyperparams β,ˆ λˆ fixed at their MLE-II based estimates

Using MLE-II Estimates for Making

With the MLE-II approximation p(β, λ|X, y) ≈ δ(β,ˆ λˆ), the posterior over unknowns

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 10 The posterior predictive distribution can also be approximated as

Z p(y∗|x ∗, X, y) = p(y∗|x ∗, w, β)p(w, β, λ|X, y) dw dβ dλ Z = p(y∗|x ∗, w, β)p(w|X, y, β, λ)p(β, λ|X, y)dβ dλ dw Z ≈ p(y∗|x ∗, w, β)p(w|X, y, β,ˆ λˆ) dw

This is also the same as the usual posterior predictive distribution we have seen earlier, except we are treating the hyperparams β,ˆ λˆ fixed at their MLE-II based estimates

Using MLE-II Estimates for Making Prediction

With the MLE-II approximation p(β, λ|X, y) ≈ δ(β,ˆ λˆ), the posterior over unknowns

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) ≈ p(w|X, y, β,ˆ λˆ)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 10 Z = p(y∗|x ∗, w, β)p(w|X, y, β, λ)p(β, λ|X, y)dβ dλ dw Z ≈ p(y∗|x ∗, w, β)p(w|X, y, β,ˆ λˆ) dw

This is also the same as the usual posterior predictive distribution we have seen earlier, except we are treating the hyperparams β,ˆ λˆ fixed at their MLE-II based estimates

Using MLE-II Estimates for Making Prediction

With the MLE-II approximation p(β, λ|X, y) ≈ δ(β,ˆ λˆ), the posterior over unknowns

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) ≈ p(w|X, y, β,ˆ λˆ)

The posterior predictive distribution can also be approximated as

Z p(y∗|x ∗, X, y) = p(y∗|x ∗, w, β)p(w, β, λ|X, y) dw dβ dλ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 10 Z ≈ p(y∗|x ∗, w, β)p(w|X, y, β,ˆ λˆ) dw

This is also the same as the usual posterior predictive distribution we have seen earlier, except we are treating the hyperparams β,ˆ λˆ fixed at their MLE-II based estimates

Using MLE-II Estimates for Making Prediction

With the MLE-II approximation p(β, λ|X, y) ≈ δ(β,ˆ λˆ), the posterior over unknowns

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) ≈ p(w|X, y, β,ˆ λˆ)

The posterior predictive distribution can also be approximated as

Z p(y∗|x ∗, X, y) = p(y∗|x ∗, w, β)p(w, β, λ|X, y) dw dβ dλ Z = p(y∗|x ∗, w, β)p(w|X, y, β, λ)p(β, λ|X, y)dβ dλ dw

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 10 This is also the same as the usual posterior predictive distribution we have seen earlier, except we are treating the hyperparams β,ˆ λˆ fixed at their MLE-II based estimates

Using MLE-II Estimates for Making Prediction

With the MLE-II approximation p(β, λ|X, y) ≈ δ(β,ˆ λˆ), the posterior over unknowns

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) ≈ p(w|X, y, β,ˆ λˆ)

The posterior predictive distribution can also be approximated as

Z p(y∗|x ∗, X, y) = p(y∗|x ∗, w, β)p(w, β, λ|X, y) dw dβ dλ Z = p(y∗|x ∗, w, β)p(w|X, y, β, λ)p(β, λ|X, y)dβ dλ dw Z ≈ p(y∗|x ∗, w, β)p(w|X, y, β,ˆ λˆ) dw

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 10 Using MLE-II Estimates for Making Prediction

With the MLE-II approximation p(β, λ|X, y) ≈ δ(β,ˆ λˆ), the posterior over unknowns

p(w, β, λ|X, y) = p(w|X, y, β, λ)p(β, λ|X, y) ≈ p(w|X, y, β,ˆ λˆ)

The posterior predictive distribution can also be approximated as

Z p(y∗|x ∗, X, y) = p(y∗|x ∗, w, β)p(w, β, λ|X, y) dw dβ dλ Z = p(y∗|x ∗, w, β)p(w|X, y, β, λ)p(β, λ|X, y)dβ dλ dw Z ≈ p(y∗|x ∗, w, β)p(w|X, y, β,ˆ λˆ) dw

This is also the same as the usual posterior predictive distribution we have seen earlier, except we are treating the hyperparams β,ˆ λˆ fixed at their MLE-II based estimates

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 10 Modeling Sparse Weights

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 11 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

For , this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n ()

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

For linear model, this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

For linear model, this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

For linear model, this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 Sparsity of weights will be a very useful thing to have in many models, e.g.,

For linear model, this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 For linear model, this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

For linear model, this helps learn relevance of each feature xd

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 Modeling Sparse Weights

Many probabilistic models consist of weights that are given zero-mean Gaussian priors, e.g., D X µ(x) = wd xd (mean of a prob. lin reg model) d=1 N X µ(x) = wnk(x n, x) (mean of a prob. kernel based nonlin reg model) n=1 −1 −1 A zero-mean prior is of the form p(wd ) = N (0, λ ) or p(wd ) = N (0, λd )

Precision λ or λd specifies our belief about how close to zero wd is (like regularization hyperparam) However, such a prior usually gives small weights but not very strong sparsity Putting a gamma prior on precision can give sparsity (will soon see why) Sparsity of weights will be a very useful thing to have in many models, e.g.,

For linear model, this helps learn relevance of each feature xd

For kernel based model, this helps learn the relevance of each input x n (Relevance Vector Machine)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 12 Marginalizing the precision leads to a Student-t prior on each wd

Z a b Γ(a + 1/2) 2 −(a+1/2) p(wd ) = p(wd |λd )p(λd )dλd = √ (b + w /2) 2πΓ(a) d

w w 2 2

w w 1 1 Note: Can make the prior an uninformative prior by setting a and b to be very small (e.g., 10−4)

Note: Some other priors on λd (e.g., exponential distribution) also result in sparse priors on wd

Sparsity via a Hierarchical Prior

−1 Consider linear regression with prior p(wd |λd ) = N (0, λd ) on each weight

Let’s treat precision λd as unknown and use a gamma (shape = a, rate = b) prior on it a b a−1 p(λd ) = Gamma(a, b) = λ exp(−bλd ) Γ(a) d

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 13 Note: Can make the prior an uninformative prior by setting a and b to be very small (e.g., 10−4)

Note: Some other priors on λd (e.g., exponential distribution) also result in sparse priors on wd

Sparsity via a Hierarchical Prior

−1 Consider linear regression with prior p(wd |λd ) = N (0, λd ) on each weight

Let’s treat precision λd as unknown and use a gamma (shape = a, rate = b) prior on it a b a−1 p(λd ) = Gamma(a, b) = λ exp(−bλd ) Γ(a) d

Marginalizing the precision leads to a Student-t prior on each wd

Z a b Γ(a + 1/2) 2 −(a+1/2) p(wd ) = p(wd |λd )p(λd )dλd = √ (b + w /2) 2πΓ(a) d

w w 2 2

w w 1 1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 13 Note: Some other priors on λd (e.g., exponential distribution) also result in sparse priors on wd

Sparsity via a Hierarchical Prior

−1 Consider linear regression with prior p(wd |λd ) = N (0, λd ) on each weight

Let’s treat precision λd as unknown and use a gamma (shape = a, rate = b) prior on it a b a−1 p(λd ) = Gamma(a, b) = λ exp(−bλd ) Γ(a) d

Marginalizing the precision leads to a Student-t prior on each wd

Z a b Γ(a + 1/2) 2 −(a+1/2) p(wd ) = p(wd |λd )p(λd )dλd = √ (b + w /2) 2πΓ(a) d

w w 2 2

w w 1 1 Note: Can make the prior an uninformative prior by setting a and b to be very small (e.g., 10−4)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 13 Sparsity via a Hierarchical Prior

−1 Consider linear regression with prior p(wd |λd ) = N (0, λd ) on each weight

Let’s treat precision λd as unknown and use a gamma (shape = a, rate = b) prior on it a b a−1 p(λd ) = Gamma(a, b) = λ exp(−bλd ) Γ(a) d

Marginalizing the precision leads to a Student-t prior on each wd

Z a b Γ(a + 1/2) 2 −(a+1/2) p(wd ) = p(wd |λd )p(λd )dλd = √ (b + w /2) 2πΓ(a) d

w w 2 2

w w 1 1 Note: Can make the prior an uninformative prior by setting a and b to be very small (e.g., 10−4)

Note: Some other priors on λd (e.g., exponential distribution) also result in sparse priors on wd

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 13 Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Popular approaches: EM, Gibbs , variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Popular approaches: EM, , variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Estimate λd given wd , estimate wd given λd

Popular approaches: EM, Gibbs sampling, variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Popular approaches: EM, Gibbs sampling, variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Popular approaches: EM, Gibbs sampling, variational inference, etc

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Popular approaches: EM, Gibbs sampling, variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Popular approaches: EM, Gibbs sampling, variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Bayesian Linear Regression with Sparse Prior on Weights

QD Posterior inference for w not straightforward since p(w) = d=1 p(wd ) is no longer Gaussian Approximate inference is usually needed for inferring the full posterior Many approaches exist (which we will see later) Such approaches are mostly in form of alternating estimation of w and λ

Estimate λd given wd , estimate wd given λd

Popular approaches: EM, Gibbs sampling, variational inference, etc Working with such sparse priors is known as Sparse Bayesian Learning Used in many models where we want to have sparsity in the weights (very few non-zero weights)

Note: We will later look at other ways of getting sparsity (e.g., spike-and-slab priors defined by binary switch variables for each weight)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 14 Bayesian Logistic Regression

(..a simple, single-parameter, yet non-conjugate model)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 15 Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 p(y)p(x|y) as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

Generative Classification: Model and learn p(y|x) “indirectly”

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 Probabilistic Models for Classification

The goal is to learn p(y|x). Here p(y|x) will be a discrete distribution (e.g., Bernoulli, multinoulli)

Usually two approaches to learn p(y|x): Discriminative Classification and Generative Classification Discriminative Classification: Model and learn p(y|x) directly This approach does not model the distribution of the inputs x

p(y)p(x|y) Generative Classification: Model and learn p(y|x) “indirectly”as p(y|x) = p(x) Called generative because, via p(x|y), we model how the inputs x of each class are generated The approach requires first learning class-marginal p(y) and class-conditional distributions p(x|y) Usually harder to learn than discriminative but also has some advantages (more on this later)

Both approaches can be given a non-Bayesian or Bayesian treatment The Bayesian treatment won’t rely on point estimates but infer the posterior over unknowns

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 16 Logistic Regression models x to y relationship using the 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x large of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1}

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 1 exp(w >x) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function p(y = 1|x, w)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 1 exp(w >x) = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function p(y = 1|x, w) = µ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 1 exp(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function p(y = 1|x, w) = µ = σ(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 exp(w >x) = 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 p(y = 1|x, w) = µ = σ(w >x) = 1 + exp(−w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector.

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability?

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 . E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 (where Φ denotes the CDF of N (0, 1))

Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 Discriminative Classification via Logistic Regression

Logistic Regression (LR) is an example of discriminative binary classification, i.e., y ∈ {0, 1} Logistic Regression models x to y relationship using the sigmoid function 1 exp(w >x) p(y = 1|x, w) = µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) where w ∈ RD is the weight vector. Also note that p(y = 0|x, w) = 1 − µ

A large positive (negative) “score” w >x means large probability of label being 1 (0) Is sigmoid the only way to convert the score into a probability? No, while LR does that, there exist models that define µ in other ways. E.g. Probit Regression µ = p(y = 1|x, w) = Φ(w >x) (where Φ denotes the CDF of N (0, 1))

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 17 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex . Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 " #y " #(1−y) exp(w >x) 1 = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels

p(y|x, w) = Bernoulli(σ(w >x))

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood).

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 N X = arg min − log p(yn|x n, w) = arg min NLL(w) w w n=1 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N X w MLE = arg max log p(yn|x n, w) w n=1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 = arg min NLL(w) w Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) w w n=1 n=1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex loss function. Global minima. Both first order and second order methods widely used.

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 Logistic Regression

The LR classification rule is 1 exp(w >x) p(y = 1|x, w)= µ = σ(w >x) = = 1 + exp(−w >x) 1 + exp(w >x) 1 p(y = 0|x, w) = 1 − µ = 1 − σ(w >x) = 1 + exp(w >x) This implies a Bernoulli likelihood model for the labels " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

N Given N observations (X, y) = {x n, yn}n=1, we can do point estimation for w by maximizing the log-likelihood (or minimizing the negative log-likelihood). This is basically MLE. N N X X w MLE = arg max log p(yn|x n, w) = arg min − log p(yn|x n, w) = arg min NLL(w) w w w n=1 n=1 Convex loss function. Global minima. Both first order and second order methods widely used. Can also add a regularizer on w to prevent overfitting. This corresponds to doing MAP estimation PN with a prior on w, i.e., w MAP = arg maxw [ n=1 log p(yn|x n, w)+ log p(w)]

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 18 Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 QN p(y |x , w)p(w) = n=1 n n R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w p(y|X, w)p(w) p(w|X, y) = R p(y|X, w)p(w)dw

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it!

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 Bayesian Logistic Regression

MLE/MAP only gives a point estimate. We would like to infer the full posterior over w Recall that the likelihood model is Bernoulli " #y " #(1−y) exp(w >x) 1 p(y|x, w) = Bernoulli(σ(w >x)) = 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let’s use a Gausian prior on w λ p(w) = N (0, λ−1I ) ∝ exp(− w >w) D 2 N Given N observations (X, y) = {x n, yn}n=1, where X is N × D and y is N × 1, the posterior over w QN p(y|X, w)p(w) p(yn|x n, w)p(w) p(w|X, y) = = n=1 R p(y|X, w)p(w)dw R QN n=1 p(yn|x n, w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can’t get a closed form expression for p(w|X, y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (next class)

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 19 Next Class

Laplace approximation Computing posterior and posterior predictive for logistic regression Properties/benefits of Bayesian logistic regression

Bayesian approach to generative classification

Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 20