Approximate Inference Using MCMC

Approximate Inference Using MCMC

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov chains. 5. Markov chain Monte Carlo algorithms. 2 References/Acknowledgements • Chris Bishop’s book: Pattern Recognition and Machine Learning, chapter 11 (many figures are borrowed from this book). • David MacKay’s book: Information Theory, Inference, and Learning Algorithms, chapters 29-32. • Radford Neals’s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods. • Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/∼zoubin/ICML04-tutorial.html • Ian Murray’s tutorial on Sampling Methods: http://www.cs.toronto.edu/∼murray/teaching/ 3 Basic Notation P (x) probability of x P (x|θ) conditional probability of x given θ P (x, θ) joint probability of x and θ Bayes Rule: P (x|θ)P (θ) P (θ|x) = P (x) where P (x) = P (x, θ)dθ Marginalization Z I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4 Inference Problem Given a dataset D = {x1, ..., xn}: Bayes Rule: P (D|θ) Likelihood function of θ P (D|θ)P (θ) P (θ|D)= P (θ) Prior probability of θ P (D) P (θ|D) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: P (D) = P (D,θ)dθ Z This integral can be very high-dimensional and difficult to compute. 5 Prediction P (D|θ) Likelihood function of θ P (D|θ)P (θ) P (θ|D)= P (D) P (θ) Prior probability of θ P (θ|D) Posterior distribution over θ Prediction: Given D, computing conditional probability of x∗ requires computing the following integral: P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ Z E ∗ = P (θ|D)[P (x |θ, D)] which is sometimes called predictive distribution. Computing predictive distribution requires posterior P (θ|D). 6 Model Selection Compare model classes, e.g. M1 and M2. Need to compute posterior probabilities given D: P (D|M)P (M) P (M|D)= P (D) where P (D|M)= P (D|θ, M)P (θ, M)dθ Z is known as the marginal likelihood or evidence. 7 Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference. • First, let us look at some specific examples: – Bayesian Probabilistic Matrix Factorization – Bayesian Neural Networks – Dirichlet Process Mixtures (last class) 8 Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie 5 R ~ U Features 6 ~ 7 ... We have N users, M movies, and integer rating values from 1 to K. D×N D×M Let rij be the rating of user i for movie j, and U ∈ R , V ∈ R be latent user and movie feature matrices: R ≈ U ⊤V Goal: Predict missing ratings. 9 Bayesian PMF α α VU Probabilistic linear model with Gaussian observation noise. Likelihood: Θ Θ VU 2 ⊤ 2 p(rij|ui, vj, σ )= N (rij|ui vj, σ ) Gaussian Priors over parameters: Vj U i N p(U|µU , ΛU )= N (ui|µu, Σu), iY=1 Rij i=1,...,N M j=1,...,M p(V |µV , ΛV )= N (vi|µv, Σv). σ iY=1 Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters ΘU = {µu, Σu} and ΘV = {µv, Σv}. Hierarchical Prior. 10 Bayesian PMF ∗ Predictive distribution: Consider predicting a rating rij for user i and query movie j: ∗ ∗ p(r |R)= p(r |ui, vj)p(U, V, ΘU , ΘV |R)d{U, V }d{ΘU , ΘV } ij ZZ ij Posterior over parameters and hyperparameters | {z } Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p(U, V, ΘU , ΘV |R) is complicated and does not have a closed form expression. Need to approximate. 11 Bayesian Neural Nets X xn N Regression problem: Given a set of i.i.d observations = { }n=1 n N with corresponding targets D = {t }n=1. hidden units zM Likelihood: (1) N wMD (2) wKM n n 2 xD p(D|X, w)= N (t |y(x , w), β ) yK nY=1 inputs outputs The mean is given by the output y1 of the neural network: x1 M D x w 2 1 yk( , )= wkjσ wjixi z1 (2) w10 x0 Xj=0 Xi=0 where σ(x) is the sigmoid function. z0 Gaussian prior over the network parameters: p(w)= N (0, α2I). 12 Bayesian Neural Nets Likelihood: hidden units N zM n n 2 (1) p(D|X, w)= N (t |y(x , w), β ) wMD (2) wKM n=1 xD Y yK Gaussian prior over parameters: inputs outputs p(w)= N (0, α2I) y1 x1 Posterior is analytically intractable: z1 (2) w10 x0 w X w w X p(D| , )p( ) p( |D, )= w X w w z0 p(D| , )p( )d R Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13 Undirected Models x is a binary random vector with xi ∈ {+1, −1}: 1 p(x)= exp θ x x + θ x . Z ij i j i i (i,jX)∈E iX∈V where Z is known as partition function: Z = exp θijxixj + θixi . Xx (i,jX)∈E iX∈V If x is 100-dimensional, need to sum over 2100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14 Monte Carlo p(z) f(z) For most situations we will be interested in evaluating the expectation: E[f]= f(z)p(z)dz Z z p˜(z) We will use the following notation: p(z)= Z . We can evaluate p˜(z) pointwise, but cannot evaluate Z. 1 • Posterior distribution: P (θ|D)= P (D)P (D|θ)P (θ) 1 • Markov random fields: P (z)= Z exp(−E(z)) 15 Simple Monte Carlo General Idea: Draw independent samples {z1, ..., zn} from distribution p(z) to approximate expectation: N 1 E[f]= f(z)p(z)dz ≈ f(zn)= fˆ Z N nX=1 Note that E[f]= E[fˆ], so the estimator fˆhas correct mean (unbiased). The variance: 1 var[fˆ]= E (f − E[f])2 N Remark: The accuracy of the estimator does not depend on dimensionality of z. 16 Simple Monte Carlo In general: N 1 f(z)p(z)dz ≈ f(zn), zn ∼ p(z) Z N nX=1 Predictive distribution: P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ Z N 1 ≈ P (x∗|θn, D), θn ∼ p(θ|D) N nX=1 Problem: It is hard to draw exact samples from p(z). 17 Basic Sampling Algorithm How to generate samples from simple non-uniform distributions assuming we can generate samples from uniform distribution. 1 h(y) y Define: h(y)= −∞ p(ˆy)dyˆ R p(y) Sample: z ∼ U[0, 1]. Then: y = h−1(z) is a sample from p(y). 0 y Problem: Computing cumulative h(y) is just as hard! 18 Rejection Sampling Sampling from target distribution p(z) =p ˜(z)/Zp is difficult. Suppose we have an easy-to-sample proposal distribution q(z), such that kq(z) ≥ p˜(z), ∀z. kq(z) Sample z0 from q(z). kq(z0) Sample u0 from Uniform[0, kq(z0)] p(z) u0 The pair (z0, u0) has uniform distribution e z0 z under the curve of kq(z). If u0 > p˜(z0), the sample is rejected. 19 Rejection Sampling Probability that a sample is accepted is: kq(z) kq(z0) p˜(z) p(accept) = q(z)dz Z kq(z) p(z) 1 u0 = p˜(z)dz e k Z z0 z The fraction of accepted samples depends on the ratio of the area under p˜(z) and kq(z). Hard to find appropriate q(z) with optimal k. Useful technique in one or two dimensions. Typically applied as a subroutine in more advanced algorithms. 20 Importance Sampling Suppose we have an easy-to-sample proposal distribution q(z), such that q(z) > 0 if p(z) > 0. p(z) q(z) f(z) E[f] = f(z)p(z)dz Z p(z) = f(z) q(z)dz Z q(z) 1 p(zn) ≈ f(zn), zn ∼ q(z) z N q(zn) Xn n n The quantities w = p(z )/q(zn) are known as importance weights. Unlike rejection sampling, all samples are retained. But wait: we cannot compute p(z), only p˜(z). 21 Importance Sampling Let our proposal be of the form q(z) =q ˜(z)/Zq: p(z) Z p˜(z) E[f] = f(z)p(z)dz = f(z) q(z)dz = q f(z) q(z)dz Z Z q(z) Zp Z q˜(z) Z 1 p˜(zn) Z 1 ≈ q f(zn)= q wnf(zn), zn ∼ q(z) Z N q˜(zn) Z N p Xn p Xn But we can use the same importance weights to approximate Zp: Zq Z 1 p˜(z) 1 p˜(zn) 1 p = p˜(z)dz = q(z)dz ≈ = wn Z Z q˜(z) N q˜(zn) N q q Z Z Xn Xn Hence: 1 wn E[f] ≈ f(zn) Consistentbutbiased. N wn Xn n P 22 Problems If our proposal distribution q(z) poorly matches our target distribution p(z) then: • Rejection Sampling: almost always rejects • Importance Sampling: has large, possibly infinite, variance (unreliable estimator).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    37 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us