
Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain Monte Carlo algorithms. 2 References/Acknowledgements • Chris Bishop's book: Pattern Recognition and Machine Learning, chapter 11 (many figures are borrowed from this book). • David MacKay's book: Information Theory, Inference, and Learning Algorithms, chapters 29-32. • Radford Neals's technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods. • Zoubin Ghahramani's ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/∼zoubin/ICML04-tutorial.html • Ian Murray's tutorial on Sampling Methods: http://www.cs.toronto.edu/∼murray/teaching/ 3 Basic Notation P (x) probability of x P (xjθ) conditional probability of x given θ P (x; θ) joint probability of x and θ Bayes Rule: P (xjθ)P (θ) P (θjx) = P (x) where Z P (x) = P (x; θ)dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4 Inference Problem Given a dataset D = fx1; :::; xng: Bayes Rule: P (Djθ) Likelihood function of θ P (Djθ)P (θ) P (θjD) = P (θ) Prior probability of θ P (D) P (θjD) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: Z P (D) = P (D; θ)dθ This integral can be very high-dimensional and difficult to compute. 5 Prediction P (Djθ) Likelihood function of θ P (Djθ)P (θ) P (θjD) = P (D) P (θ) Prior probability of θ P (θjD) Posterior distribution over θ Prediction: Given D, computing conditional probability of x∗ requires computing the following integral: Z P (x∗jD) = P (x∗jθ; D)P (θjD)dθ ∗ = EP (θjD)[P (x jθ; D)] which is sometimes called predictive distribution. Computing predictive distribution requires posterior P (θjD). 6 Model Selection Compare model classes, e.g. M1 and M2. Need to compute posterior probabilities given D: P (DjM)P (M) P (MjD) = P (D) where Z P (DjM) = P (Djθ; M)P (θjM)dθ is known as the marginal likelihood or evidence. 7 Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference. • First, let us look at some specific examples: { Bayesian Probabilistic Matrix Factorization { Bayesian Neural Networks { Dirichlet Process Mixtures (last class) 8 Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie 5 R ~ U Features 6 7 ... We have N users, M movies, and integer rating values from 1 to K. D×N D×M Let rij be the rating of user i for movie j, and U 2 R , V 2 R be latent user and movie feature matrices: R ≈ U >V Goal: Predict missing ratings. 9 Bayesian PMF α α VU Probabilistic linear model with Gaussian observation noise. Likelihood: Θ Θ VU 2 > 2 p(rijjui; vj; σ ) = N (rijjui vj; σ ) Gaussian Priors over parameters: Vj U i N Y p(UjµU ; ΛU ) = N (uijµu; Σu); i=1 Rij i=1,...,N M j=1,...,M Y p(V jµV ; ΛV ) = N (vijµv; Σv): σ i=1 Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters ΘU = fµu; Σug and ΘV = fµv; Σvg. Hierarchical Prior. 10 Bayesian PMF ∗ Predictive distribution: Consider predicting a rating rij for user i and query movie j: ZZ ∗ ∗ p(rijjR) = p(rijjui; vj)p(U; V; ΘU ; ΘV jR)dfU; V gdfΘU ; ΘV g | {z } Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p(U; V; ΘU ; ΘV jR) is complicated and does not have a closed form expression. Need to approximate. 11 Bayesian Neural Nets n N Regression problem: Given a set of i:i:d observations X = fx gn=1 n N with corresponding targets D = ft gn=1. Likelihood: N Y p(DjX; w) = N (tnjy(xn; w); β2) n=1 The mean is given by the output of the neural network: M D X 2 X 1 yk(x; w) = wkjσ wjixi j=0 i=0 where σ(x) is the sigmoid function. Gaussian prior over the network parameters: p(w) = N (0; α2I). 12 Bayesian Neural Nets Likelihood: N Y p(DjX; w) = N (tnjy(xn; w); β2) n=1 Gaussian prior over parameters: p(w) = N (0; α2I) Posterior is analytically intractable: p(Djw; X)p(w) p(wjD; X) = R p(Djw; X)p(w)dw Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13 Undirected Models x is a binary random vector with xi 2 f+1; −1g: 1 X X p(x) = exp θ x x + θ x : Z ij i j i i (i;j)2E i2V where Z is known as partition function: X X X Z = exp θijxixj + θixi : x (i;j)2E i2V If x is 100-dimensional, need to sum over 2100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14 Inference For most situations we will be interested in evaluating the expectation: Z E[f] = f(z)p(z)dz p~(z) We will use the following notation: p(z) = Z . We can evaluate p~(z) pointwise, but cannot evaluate Z. 1 • Posterior distribution: P (θjD) = P (D)P (Djθ)P (θ) 1 • Markov random fields: P (z) = Z exp(−E(z)) 15 Laplace Approximation Consider: 0.8 p~(z) p(z) = (1) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q(z) which is centered on a mode 0 of the distribution p(z). −2 −1 0 1 2 3 4 At a stationary point z0 the gradient 5p~(z) vanishes. Consider a Taylor expansion of lnp ~(z): 1 lnp ~(z) ≈ lnp ~(z ) − (z − z )T A(z − z ) 0 2 0 0 where A is a Hessian matrix: A = − 5 5 lnp ~(z)jz=z0 16 Laplace Approximation Consider: 0.8 p~(z) p(z) = (2) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q(z) which is centered on a mode 0 of the distribution p(z). −2 −1 0 1 2 3 4 Exponentiating both sides: 1 p~(z) ≈ p~(z ) exp − (z − z )T A(z − z ) 0 2 0 0 We get a multivariate Gaussian approximation: jAj1=2 1 q(z) = exp − (z − z )T A(z − z ) (2π)D=2 2 0 0 17 Laplace Approximation p~(z) Remember p(z) = Z , where we approximate: Z Z 1 (2π)D=2 Z = p~(z)dz ≈ p~(z ) exp − (z − z )T A(z − z ) =p ~(z ) 0 2 0 0 0 jAj1=2 1 Bayesian Inference: P (θjD) = P (D)P (Djθ)P (θ). Identify: p~(θ) = P (Djθ)P (θ) and Z = P (D): • The posterior is approximately Gaussian around the MAP estimate θMAP jAj1=2 1 p(θjD) ≈ exp − (θ − θ )T A(θ − θ ) (2π)D=2 2 MAP MAP 18 Laplace Approximation p~(z) Remember p(z) = Z , where we approximate: Z Z 1 (2π)D=2 Z = p~(z)dz ≈ p~(z ) exp − (z − z )T A(z − z ) =p ~(z ) 0 2 0 0 0 jAj1=2 1 Bayesian Inference: P (θjD) = P (D)P (Djθ)P (θ). Identify: p~(θ) = P (Djθ)P (θ) and Z = P (D): • Can approximate Model Evidence: Z P (D) = P (Djθ)P (θ)dθ • Using Laplace approximation D 1 ln P (D) ≈ ln P (Djθ ) + ln P (θ ) + ln 2π − ln jAj MAP MAP 2 2 | {z } Occam factor: penalize model complexity 19 Bayesian Information Criterion BIC can be obtained from the Laplace approximation: D 1 ln P (D) ≈ ln P (Djθ ) + ln P (θ ) + ln 2π − ln jAj MAP MAP 2 2 by taking the large sample limit (N ! 1) where N is the number of data points: 1 ln P (D) ≈ P (Djθ ) − D ln N MAP 2 • Quick, easy, does not depend on the prior. • Can use maximum likelihood estimate of θ instead of the MAP estimate • D denotes the number of \well-determined parameters" • Danger: Counting parameters can be tricky (e.g. infinite models) 20 Variational Inference Key Idea: Approximate intractable distribution p(θjD) with simpler, tractable distribution q(θ). We can lower bound the marginal likelihood using Jensen's inequality: Z Z P (D; θ) ln p(D) = ln p(D; θ)dθ = ln q(θ) dθ q(θ) Z p(D; θ) Z Z 1 ≥ q(θ) ln dθ = q(θ) ln p(D; θ)dθ + q(θ) ln dθ q(θ) q(θ) | {z } Entropy functional | {z } Variational Lower-Bound = ln p(D) − KL(q(θ)jjp(θjD)) = L(q) where KL(qjjp) is a Kullback{Leibler divergence. It is a non-symmetric measure of the difference between two probability distributions q and p. The goal of variational inference is to maximize the variational lower-bound w.r.t. approximate q distribution, or minimize KL(qjjp). 21 Variational Inference Key Idea: Approximate intractable distribution p(θjD) with simpler, tractable distribution q(θ) by minimizing KL(q(θ)jjp(θjD)). QD We can choose a fully factorized distribution: q(θ) = i=1 qi(θi), also known as a mean-field approximation.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages45 Page
-
File Size-