Approximate Inference

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain Monte Carlo algorithms. 2 References/Acknowledgements • Chris Bishop's book: Pattern Recognition and Machine Learning, chapter 11 (many figures are borrowed from this book). • David MacKay's book: Information Theory, Inference, and Learning Algorithms, chapters 29-32. • Radford Neals's technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods. • Zoubin Ghahramani's ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/∼zoubin/ICML04-tutorial.html • Ian Murray's tutorial on Sampling Methods: http://www.cs.toronto.edu/∼murray/teaching/ 3 Basic Notation P (x) probability of x P (xjθ) conditional probability of x given θ P (x; θ) joint probability of x and θ Bayes Rule: P (xjθ)P (θ) P (θjx) = P (x) where Z P (x) = P (x; θ)dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4 Inference Problem Given a dataset D = fx1; :::; xng: Bayes Rule: P (Djθ) Likelihood function of θ P (Djθ)P (θ) P (θjD) = P (θ) Prior probability of θ P (D) P (θjD) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: Z P (D) = P (D; θ)dθ This integral can be very high-dimensional and difficult to compute. 5 Prediction P (Djθ) Likelihood function of θ P (Djθ)P (θ) P (θjD) = P (D) P (θ) Prior probability of θ P (θjD) Posterior distribution over θ Prediction: Given D, computing conditional probability of x∗ requires computing the following integral: Z P (x∗jD) = P (x∗jθ; D)P (θjD)dθ ∗ = EP (θjD)[P (x jθ; D)] which is sometimes called predictive distribution. Computing predictive distribution requires posterior P (θjD). 6 Model Selection Compare model classes, e.g. M1 and M2. Need to compute posterior probabilities given D: P (DjM)P (M) P (MjD) = P (D) where Z P (DjM) = P (Djθ; M)P (θjM)dθ is known as the marginal likelihood or evidence. 7 Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference. • First, let us look at some specific examples: { Bayesian Probabilistic Matrix Factorization { Bayesian Neural Networks { Dirichlet Process Mixtures (last class) 8 Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie 5 R ~ U Features 6 7 ... We have N users, M movies, and integer rating values from 1 to K. D×N D×M Let rij be the rating of user i for movie j, and U 2 R , V 2 R be latent user and movie feature matrices: R ≈ U >V Goal: Predict missing ratings. 9 Bayesian PMF α α VU Probabilistic linear model with Gaussian observation noise. Likelihood: Θ Θ VU 2 > 2 p(rijjui; vj; σ ) = N (rijjui vj; σ ) Gaussian Priors over parameters: Vj U i N Y p(UjµU ; ΛU ) = N (uijµu; Σu); i=1 Rij i=1,...,N M j=1,...,M Y p(V jµV ; ΛV ) = N (vijµv; Σv): σ i=1 Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters ΘU = fµu; Σug and ΘV = fµv; Σvg. Hierarchical Prior. 10 Bayesian PMF ∗ Predictive distribution: Consider predicting a rating rij for user i and query movie j: ZZ ∗ ∗ p(rijjR) = p(rijjui; vj)p(U; V; ΘU ; ΘV jR)dfU; V gdfΘU ; ΘV g | {z } Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p(U; V; ΘU ; ΘV jR) is complicated and does not have a closed form expression. Need to approximate. 11 Bayesian Neural Nets n N Regression problem: Given a set of i:i:d observations X = fx gn=1 n N with corresponding targets D = ft gn=1. Likelihood: N Y p(DjX; w) = N (tnjy(xn; w); β2) n=1 The mean is given by the output of the neural network: M D X 2 X 1 yk(x; w) = wkjσ wjixi j=0 i=0 where σ(x) is the sigmoid function. Gaussian prior over the network parameters: p(w) = N (0; α2I). 12 Bayesian Neural Nets Likelihood: N Y p(DjX; w) = N (tnjy(xn; w); β2) n=1 Gaussian prior over parameters: p(w) = N (0; α2I) Posterior is analytically intractable: p(Djw; X)p(w) p(wjD; X) = R p(Djw; X)p(w)dw Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13 Undirected Models x is a binary random vector with xi 2 f+1; −1g: 1 X X p(x) = exp θ x x + θ x : Z ij i j i i (i;j)2E i2V where Z is known as partition function: X X X Z = exp θijxixj + θixi : x (i;j)2E i2V If x is 100-dimensional, need to sum over 2100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14 Inference For most situations we will be interested in evaluating the expectation: Z E[f] = f(z)p(z)dz p~(z) We will use the following notation: p(z) = Z . We can evaluate p~(z) pointwise, but cannot evaluate Z. 1 • Posterior distribution: P (θjD) = P (D)P (Djθ)P (θ) 1 • Markov random fields: P (z) = Z exp(−E(z)) 15 Laplace Approximation Consider: 0.8 p~(z) p(z) = (1) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q(z) which is centered on a mode 0 of the distribution p(z). −2 −1 0 1 2 3 4 At a stationary point z0 the gradient 5p~(z) vanishes. Consider a Taylor expansion of lnp ~(z): 1 lnp ~(z) ≈ lnp ~(z ) − (z − z )T A(z − z ) 0 2 0 0 where A is a Hessian matrix: A = − 5 5 lnp ~(z)jz=z0 16 Laplace Approximation Consider: 0.8 p~(z) p(z) = (2) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q(z) which is centered on a mode 0 of the distribution p(z). −2 −1 0 1 2 3 4 Exponentiating both sides: 1 p~(z) ≈ p~(z ) exp − (z − z )T A(z − z ) 0 2 0 0 We get a multivariate Gaussian approximation: jAj1=2 1 q(z) = exp − (z − z )T A(z − z ) (2π)D=2 2 0 0 17 Laplace Approximation p~(z) Remember p(z) = Z , where we approximate: Z Z 1 (2π)D=2 Z = p~(z)dz ≈ p~(z ) exp − (z − z )T A(z − z ) =p ~(z ) 0 2 0 0 0 jAj1=2 1 Bayesian Inference: P (θjD) = P (D)P (Djθ)P (θ). Identify: p~(θ) = P (Djθ)P (θ) and Z = P (D): • The posterior is approximately Gaussian around the MAP estimate θMAP jAj1=2 1 p(θjD) ≈ exp − (θ − θ )T A(θ − θ ) (2π)D=2 2 MAP MAP 18 Laplace Approximation p~(z) Remember p(z) = Z , where we approximate: Z Z 1 (2π)D=2 Z = p~(z)dz ≈ p~(z ) exp − (z − z )T A(z − z ) =p ~(z ) 0 2 0 0 0 jAj1=2 1 Bayesian Inference: P (θjD) = P (D)P (Djθ)P (θ). Identify: p~(θ) = P (Djθ)P (θ) and Z = P (D): • Can approximate Model Evidence: Z P (D) = P (Djθ)P (θ)dθ • Using Laplace approximation D 1 ln P (D) ≈ ln P (Djθ ) + ln P (θ ) + ln 2π − ln jAj MAP MAP 2 2 | {z } Occam factor: penalize model complexity 19 Bayesian Information Criterion BIC can be obtained from the Laplace approximation: D 1 ln P (D) ≈ ln P (Djθ ) + ln P (θ ) + ln 2π − ln jAj MAP MAP 2 2 by taking the large sample limit (N ! 1) where N is the number of data points: 1 ln P (D) ≈ P (Djθ ) − D ln N MAP 2 • Quick, easy, does not depend on the prior. • Can use maximum likelihood estimate of θ instead of the MAP estimate • D denotes the number of \well-determined parameters" • Danger: Counting parameters can be tricky (e.g. infinite models) 20 Variational Inference Key Idea: Approximate intractable distribution p(θjD) with simpler, tractable distribution q(θ). We can lower bound the marginal likelihood using Jensen's inequality: Z Z P (D; θ) ln p(D) = ln p(D; θ)dθ = ln q(θ) dθ q(θ) Z p(D; θ) Z Z 1 ≥ q(θ) ln dθ = q(θ) ln p(D; θ)dθ + q(θ) ln dθ q(θ) q(θ) | {z } Entropy functional | {z } Variational Lower-Bound = ln p(D) − KL(q(θ)jjp(θjD)) = L(q) where KL(qjjp) is a Kullback{Leibler divergence. It is a non-symmetric measure of the difference between two probability distributions q and p. The goal of variational inference is to maximize the variational lower-bound w.r.t. approximate q distribution, or minimize KL(qjjp). 21 Variational Inference Key Idea: Approximate intractable distribution p(θjD) with simpler, tractable distribution q(θ) by minimizing KL(q(θ)jjp(θjD)). QD We can choose a fully factorized distribution: q(θ) = i=1 qi(θi), also known as a mean-field approximation.

Approximate Inference

The Composite Marginal Likelihood (CML) Inference Approach with Applications to Discrete and Mixed Dependent Variable Models

A Widely Applicable Bayesian Information Criterion

Categorical Distributions in Natural Language Processing Version 0.1

Binomial and Multinomial Distributions

Bayesian Monte Carlo

Marginal Likelihood

Bayesian Inference

On the Derivation of the Bayesian Information Criterion

Marginal Likelihood from the Gibbs Output Siddhartha Chib Journal Of

Bayesian Analysis of Graphical Models of Marginal Independence for Three Way Contingency Tables

Objective Priors in the Empirical Bayes Framework Arxiv:1612.00064V5 [Stat.ME] 11 May 2020

Estimating the Marginal Likelihood with Integrated Nested Laplace Approximation (INLA)