Bayesian Logistic Regression, Bayesian Generative Classification

Bayesian Logistic Regression, Bayesian Generative Classification Piyush Rai Topics in Probabilistic Modeling and Inference (CS698X) Jan 23, 2019 Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 1 Bayesian Logistic Regression Recall that the likelihood model for logistic regression is Bernoulli (since y 2 f0; 1g) " #y " #(1−y) exp(w >x) 1 p(yjx; w) = Bernoulli(σ(w >x)) = = µy (1 − µ)1−y 1 + exp(w >x) 1 + exp(w >x) Just like the Bayesian linear regression case, let's use a Gausian prior on w λ p(w) = N (0; λ−1I ) / exp(− w >w) D 2 N Given N observations (X; y) = fx n; yngn=1, where X is N × D and y is N × 1, the posterior over w QN p(yjX; w)p(w) p(ynjx n; w)p(w) p(wjX; y) = = n=1 R p(yjX; w)p(w)dw R QN n=1 p(ynjx n; w)p(w)dw The denominator is intractable in general (logistic-Bernoulli and Gaussian are not conjugate) Can't get a closed form expression for p(wjX; y). Must approximate it! Several ways to do it, e.g., MCMC, variational inference, Laplace approximation (today) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 2 Laplace Approximation of Posterior Distribution p(Djθ)p(θ) p(D,θ) Approximate the posterior distribution p(θjD) = p(D) = p(D) by the following Gaussian −1 p(θjD) ≈ N (θMAP ; H ) Note: θMAP is the maximum-a-posteriori (MAP) estimate of θ, i.e., θMAP = arg max p(θjD) = arg max p(D; θ) = arg max p(Djθ)p(θ) = arg max[log p(Djθ) + log p(θ)] θ θ θ θ Usually θMAP can be easily solved for (e.g., using first/second order iterative methods) H is the Hessian matrix of the negative log-posterior (or negative log-joint-prob) at θMAP 2 2 2 H = −∇ log p(θjD) = −∇ log p(D; θ) = −∇ [log p(Djθ) + log p(θ)] θ=θMAP θ=θMAP θ=θMAP Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 3 Derivation of the Laplace Approximation Let's write the Bayes rule as p(D; θ) p(D; θ) elog p(D,θ) p(θjD) = = = p(D) R p(D; θ)dθ R elog p(D,θ)dθ Suppose log p(D; θ) = f (θ). Let's approximate f (θ) using its 2nd order Taylor expansion 1 f (θ) ≈ f (θ ) + (θ − θ )>rf (θ ) + (θ − θ )>r2f (θ )(θ − θ ) 0 0 0 2 0 0 0 where θ0 is some arbitrarily chosen point in the domain of f Let's choose θ0 = θMAP . Note that rf (θMAP ) = r log p(D; θMAP ) = 0. Therefore 1 log p(D; θ) ≈ log p(D; θ ) + (θ − θ )>r2 log p(D; θ )(θ − θ ) MAP 2 MAP MAP MAP Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 4 Derivation of the Laplace Approximation Plugging in this 2nd order Taylor approximation for log p(D; θ), we have log p(D,θ )+ 1 (θ−θ )>r2 log p(D,θ )(θ−θ ) elog p(D,θ) e MAP 2 MAP MAP MAP p(θjD) = ≈ R log p(D,θ) log p(D,θ )+ 1 (θ−θ )>r2 log p(D,θ )(θ−θ ) e dθ R e MAP 2 MAP MAP MAP dθ Further simplifying, we have − 1 (θ−θ )>{−∇2 log p(D,θ )g(θ−θ ) e 2 MAP MAP MAP p(θjD) ≈ − 1 (θ−θ )>{−∇2 log p(D,θ )g(θ−θ ) R e 2 MAP MAP MAP dθ Therefore the Laplace approximation of the posterior p(θjD) is a Gaussian and is given by −1 2 p(θjD) ≈ N (θjθMAP ; H ) where H = −∇ log p(D; θMAP ) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 5 Properties of Laplace Approximation Usually straightforward if derivatives (first and second) can be computed easily Expensive if the number of parameters is very large (due to Hessian computation and inversion) Can do badly if the (true) posterior is multimodal Can actually apply it when working with any regularized loss function (not just probabilistic models) to get a Gaussian posterior distribution over the parameters negative log-likelihood (NLL) = loss function, negative log-prior = regularizer Easy exercise: Try doing this for `2 regularized least squares regression (will get the same posterior as in Bayesian linear regression) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 6 Laplace Approximation for Bayesian Logistic Regression Data D = (X; y) and parameter θ = w. The Laplace approximation of posterior will be −1 p(wjX; y) ≈ N (w MAP ; H ) The required quantities are defined as w MAP = arg max log p(wjy; X) = arg max log p(y; wjX) = arg min[− log p(y; wjX)] w w w 2 H = r [− log p(y; wjX)] w=w MAP We can compute w MAP using iterative methods (gradient descent): First-order (gradient) methods: w t+1 = w t − ηg t . Requires gradient g of − log p(y; wjX) g = r[− log p(y; wjX)] −1 Second-order methods. w t+1 = w t − Ht g t . Requires both gradient and Hessian (defined above) Note: When using second order methods for estimating w MAP , we anyway get the Hessian needed for the Laplace approximation of the posterior Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 7 An Aside: Gradient and Hessian for Logistic Regression The LR objective function − log p(y; wjX) = − log p(yjX; w) − log p(w) can be written as N N Y X − log p(ynjx n; w) − log p(w) = − log p(ynjx n; w) − log p(w) n=1 n=1 > yn 1−yn exp(w x n) For the logistic regression model, p(ynjx n; w) = µ (1 − µn) where µn = > n 1+exp(w x n) With a Gaussian prior p(w) = N (wj0; λ−1I) / exp(−λw >w), the gradient and Hessian will be N X > g = − (yn − µn)x n + λIw = X (µ − y) + λw (a D × 1 vector) n=1 N X > > H = µn(1 − µn)x nx n + λI = X SX + λI (a D × D matrix) n=1 > µ = [µ1; : : : ; µN ] is N × 1 and S is a N × N diagonal matrix with Snn = µn(1 − µn) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 8 Logistic Regression: Predictive Distributions When using MLE, the predictive distribution will be > p(y∗ = 1jx ∗; w MLE ) = σ(w MLE x ∗) > p(y∗jx ∗; w MLE ) = Bernoulli(σ(w MLE x ∗)) When using MAP, the predictive distribution will be > p(y∗ = 1jx ∗; w MAP ) = σ(w MAP x ∗) > p(y∗jx ∗; w MAP ) = Bernoulli(σ(w MAP x ∗)) When using Bayesian inference, the posterior predictive distribution, based on posterior averaging Z Z > p(y∗ = 1jx ∗; X; y) = p(y∗ = 1jx ∗; w)p(wjX; y)dw = σ(w x ∗)p(wjX; y)dw Above is hard in general. :-( If using the Laplace approximation for p(wjX; y), it will be Z > −1 p(y∗ = 1jx ∗; X; y) ≈ σ(w x ∗)N (wjw MAP ; H )dw Even after Laplace approximation for p(wjX; y), the above integral to compute posterior predictive is intractable. So we will need to also approximate the predictive posterior. :-) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 9 Posterior Predictive via Monte-Carlo Sampling The posterior predictive is given by the following integral Z > −1 p(y∗ = 1jx ∗; X; y) = σ(w x ∗)N (wjw MAP ; H )dw −1 Monte-Carlo approximation: Draw several samples of w from N (wjw MAP ; H ) and replace the > above integral by an empirical average of σ(w x ∗) computed using each of those samples S 1 X p(y = 1jx ; X; y) ≈ σ(w >x ) ∗ ∗ S s ∗ s=1 −1 where w s ∼ N (wjw MAP ; H ), s = 1;:::; S More on Monte-Carlo methods when we discuss MCMC sampling Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 10 Predictive Posterior via Probit Approximation The posterior predictive we wanted to compute was Z > −1 p(y∗ = 1jx ∗; X; y) ≈ σ(w x ∗)N (wjw MAP ; H )dw > > In the above, let's replace the sigmoid σ(w x ∗) by Φ(w x ∗) , i.e., CDF of standard normal z 1 Z 2 Φ(z) = p e−t dt (Note: z is a scalar and 0 ≤ Φ(z) ≤ 1) 2π −∞ Note: Φ(z) is also called the probit function This approach relies on numerical approximation (as we will see) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian Generative Classification 11 Predictive Posterior via Probit Approximation With this approximation, the predictive posterior will be Z > −1 p(y∗ = 1jx ∗; X; y) = Φ(w x ∗)N (wjw MAP ; H )dw (an expectation) Z 1 2 = Φ(a)p(ajµa; σa )da (an equivalent expectation) −∞ > > 2 2 Since a = w x ∗ = x ∗ w, and w is normally distributed, p(ajµa; σa ) = N (ajµa; σa ), with > 2 > −1 µa = w MAP x ∗ and σa = x ∗ H x ∗ (follows from the linear trans. property of random vars) > 2 > −1 Given µa = w MAP x ∗ and σa = x ∗ H x ∗, the predictive posterior will be Z 1 ! 2 µa p(y∗ = 1jx ∗; X; y) ≈ Φ(a)N (ajµa; σ )da =Φ p a 2 −∞ 1 + σa 2 Note that the variance σa also \moderates" the probability of yn being 1 (MAP would give Φ(µa)) Since logistic and probit aren't exactly identical, we usually scale a by a scalar t s.t.

Bayesian Logistic Regression, Bayesian Generative Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support