Approximate Inference Using MCMC
Total Page:16
File Type:pdf, Size:1020Kb
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov chains. 5. Markov chain Monte Carlo algorithms. 2 References/Acknowledgements • Chris Bishop’s book: Pattern Recognition and Machine Learning, chapter 11 (many figures are borrowed from this book). • David MacKay’s book: Information Theory, Inference, and Learning Algorithms, chapters 29-32. • Radford Neals’s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods. • Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/∼zoubin/ICML04-tutorial.html • Ian Murray’s tutorial on Sampling Methods: http://www.cs.toronto.edu/∼murray/teaching/ 3 Basic Notation P (x) probability of x P (x|θ) conditional probability of x given θ P (x, θ) joint probability of x and θ Bayes Rule: P (x|θ)P (θ) P (θ|x) = P (x) where P (x) = P (x, θ)dθ Marginalization Z I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4 Inference Problem Given a dataset D = {x1, ..., xn}: Bayes Rule: P (D|θ) Likelihood function of θ P (D|θ)P (θ) P (θ|D)= P (θ) Prior probability of θ P (D) P (θ|D) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: P (D) = P (D,θ)dθ Z This integral can be very high-dimensional and difficult to compute. 5 Prediction P (D|θ) Likelihood function of θ P (D|θ)P (θ) P (θ|D)= P (D) P (θ) Prior probability of θ P (θ|D) Posterior distribution over θ Prediction: Given D, computing conditional probability of x∗ requires computing the following integral: P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ Z E ∗ = P (θ|D)[P (x |θ, D)] which is sometimes called predictive distribution. Computing predictive distribution requires posterior P (θ|D). 6 Model Selection Compare model classes, e.g. M1 and M2. Need to compute posterior probabilities given D: P (D|M)P (M) P (M|D)= P (D) where P (D|M)= P (D|θ, M)P (θ, M)dθ Z is known as the marginal likelihood or evidence. 7 Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference. • First, let us look at some specific examples: – Bayesian Probabilistic Matrix Factorization – Bayesian Neural Networks – Dirichlet Process Mixtures (last class) 8 Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie 5 R ~ U Features 6 ~ 7 ... We have N users, M movies, and integer rating values from 1 to K. D×N D×M Let rij be the rating of user i for movie j, and U ∈ R , V ∈ R be latent user and movie feature matrices: R ≈ U ⊤V Goal: Predict missing ratings. 9 Bayesian PMF α α VU Probabilistic linear model with Gaussian observation noise. Likelihood: Θ Θ VU 2 ⊤ 2 p(rij|ui, vj, σ )= N (rij|ui vj, σ ) Gaussian Priors over parameters: Vj U i N p(U|µU , ΛU )= N (ui|µu, Σu), iY=1 Rij i=1,...,N M j=1,...,M p(V |µV , ΛV )= N (vi|µv, Σv). σ iY=1 Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters ΘU = {µu, Σu} and ΘV = {µv, Σv}. Hierarchical Prior. 10 Bayesian PMF ∗ Predictive distribution: Consider predicting a rating rij for user i and query movie j: ∗ ∗ p(r |R)= p(r |ui, vj)p(U, V, ΘU , ΘV |R)d{U, V }d{ΘU , ΘV } ij ZZ ij Posterior over parameters and hyperparameters | {z } Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p(U, V, ΘU , ΘV |R) is complicated and does not have a closed form expression. Need to approximate. 11 Bayesian Neural Nets X xn N Regression problem: Given a set of i.i.d observations = { }n=1 n N with corresponding targets D = {t }n=1. hidden units zM Likelihood: (1) N wMD (2) wKM n n 2 xD p(D|X, w)= N (t |y(x , w), β ) yK nY=1 inputs outputs The mean is given by the output y1 of the neural network: x1 M D x w 2 1 yk( , )= wkjσ wjixi z1 (2) w10 x0 Xj=0 Xi=0 where σ(x) is the sigmoid function. z0 Gaussian prior over the network parameters: p(w)= N (0, α2I). 12 Bayesian Neural Nets Likelihood: hidden units N zM n n 2 (1) p(D|X, w)= N (t |y(x , w), β ) wMD (2) wKM n=1 xD Y yK Gaussian prior over parameters: inputs outputs p(w)= N (0, α2I) y1 x1 Posterior is analytically intractable: z1 (2) w10 x0 w X w w X p(D| , )p( ) p( |D, )= w X w w z0 p(D| , )p( )d R Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13 Undirected Models x is a binary random vector with xi ∈ {+1, −1}: 1 p(x)= exp θ x x + θ x . Z ij i j i i (i,jX)∈E iX∈V where Z is known as partition function: Z = exp θijxixj + θixi . Xx (i,jX)∈E iX∈V If x is 100-dimensional, need to sum over 2100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14 Monte Carlo p(z) f(z) For most situations we will be interested in evaluating the expectation: E[f]= f(z)p(z)dz Z z p˜(z) We will use the following notation: p(z)= Z . We can evaluate p˜(z) pointwise, but cannot evaluate Z. 1 • Posterior distribution: P (θ|D)= P (D)P (D|θ)P (θ) 1 • Markov random fields: P (z)= Z exp(−E(z)) 15 Simple Monte Carlo General Idea: Draw independent samples {z1, ..., zn} from distribution p(z) to approximate expectation: N 1 E[f]= f(z)p(z)dz ≈ f(zn)= fˆ Z N nX=1 Note that E[f]= E[fˆ], so the estimator fˆhas correct mean (unbiased). The variance: 1 var[fˆ]= E (f − E[f])2 N Remark: The accuracy of the estimator does not depend on dimensionality of z. 16 Simple Monte Carlo In general: N 1 f(z)p(z)dz ≈ f(zn), zn ∼ p(z) Z N nX=1 Predictive distribution: P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ Z N 1 ≈ P (x∗|θn, D), θn ∼ p(θ|D) N nX=1 Problem: It is hard to draw exact samples from p(z). 17 Basic Sampling Algorithm How to generate samples from simple non-uniform distributions assuming we can generate samples from uniform distribution. 1 h(y) y Define: h(y)= −∞ p(ˆy)dyˆ R p(y) Sample: z ∼ U[0, 1]. Then: y = h−1(z) is a sample from p(y). 0 y Problem: Computing cumulative h(y) is just as hard! 18 Rejection Sampling Sampling from target distribution p(z) =p ˜(z)/Zp is difficult. Suppose we have an easy-to-sample proposal distribution q(z), such that kq(z) ≥ p˜(z), ∀z. kq(z) Sample z0 from q(z). kq(z0) Sample u0 from Uniform[0, kq(z0)] p(z) u0 The pair (z0, u0) has uniform distribution e z0 z under the curve of kq(z). If u0 > p˜(z0), the sample is rejected. 19 Rejection Sampling Probability that a sample is accepted is: kq(z) kq(z0) p˜(z) p(accept) = q(z)dz Z kq(z) p(z) 1 u0 = p˜(z)dz e k Z z0 z The fraction of accepted samples depends on the ratio of the area under p˜(z) and kq(z). Hard to find appropriate q(z) with optimal k. Useful technique in one or two dimensions. Typically applied as a subroutine in more advanced algorithms. 20 Importance Sampling Suppose we have an easy-to-sample proposal distribution q(z), such that q(z) > 0 if p(z) > 0. p(z) q(z) f(z) E[f] = f(z)p(z)dz Z p(z) = f(z) q(z)dz Z q(z) 1 p(zn) ≈ f(zn), zn ∼ q(z) z N q(zn) Xn n n The quantities w = p(z )/q(zn) are known as importance weights. Unlike rejection sampling, all samples are retained. But wait: we cannot compute p(z), only p˜(z). 21 Importance Sampling Let our proposal be of the form q(z) =q ˜(z)/Zq: p(z) Z p˜(z) E[f] = f(z)p(z)dz = f(z) q(z)dz = q f(z) q(z)dz Z Z q(z) Zp Z q˜(z) Z 1 p˜(zn) Z 1 ≈ q f(zn)= q wnf(zn), zn ∼ q(z) Z N q˜(zn) Z N p Xn p Xn But we can use the same importance weights to approximate Zp: Zq Z 1 p˜(z) 1 p˜(zn) 1 p = p˜(z)dz = q(z)dz ≈ = wn Z Z q˜(z) N q˜(zn) N q q Z Z Xn Xn Hence: 1 wn E[f] ≈ f(zn) Consistentbutbiased. N wn Xn n P 22 Problems If our proposal distribution q(z) poorly matches our target distribution p(z) then: • Rejection Sampling: almost always rejects • Importance Sampling: has large, possibly infinite, variance (unreliable estimator).