Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS “Summer School” 1 / 50 Random variables and “distributed according to” notation • A probability distributionF is a non-negative function from some set X whose values sum (integrate) to 1 • A random variable X is distributed according to a distribution F, or more simply, X has distributionF , written X ∼ F, iff: P(X = x) = F(x) for all x (This is for discrete RVs). • You’ll sometimes see the notion X j Y ∼ F which means “X is generated conditional on Y with distribution F” (where F usually depends on Y), i.e., P(X j Y) = F(X j Y) 2 / 50 Outline Introduction to Bayesian Inference Mixture models Sampling with Markov Chains The Gibbs sampler Gibbs sampling for Dirichlet-Multinomial mixtures Topic modeling with Dirichlet multinomial mixtures 3 / 50 Bayes’ rule P(Data j Hypothesis) P(Hypothesis) P(Hypothesis j Data)= P(Data) • Bayesian’s use Bayes’ Rule to update beliefs in hypotheses in response to data • P(Hypothesis j Data) is the posterior distribution, • P(Hypothesis) is the prior distribution, • P(Data j Hypothesis) is the likelihood, and • P(Data) is a normalising constant sometimes called the evidence 4 / 50 Computing the normalising constant P(Data)= ∑ P(Data, Hypothesis0) Hypothesis02H = ∑ P(Data j Hypothesis0)P(Hypothesis0) Hypothesis02H • If set of hypotheses H is small, can calculate P(Data) by enumeration • But often these sums are intractable 5 / 50 Bayesian belief updating • Idea: treat posterior from last observation as the prior for next • Consistency follows because likelihood factors I Suppose d = (d1, d2). Then the posterior of a hypothesis h is: P(h j d1, d2) ∝ P(h) P(d1, d2 j h) = P(h) P(d1 j h) P(d2 j h, d1) ∝ P(h j d1) P(d2 j h, d1) | {z } | {z } updated prior likelihood 6 / 50 Discrete distributions • A discrete distribution has a finite set of outcomes 1, . , m • A discrete distribution is parameterized by a vector m q = (q1,..., qm), where P(X = jjq) = qj (so ∑j=1 qj = 1) I Example: An m-sided die, where qj = prob. of face j • Suppose X = (X1,..., Xn) and each Xijq ∼ DISCRETE(q). Then: n m Nj P(Xjq) = ∏ DISCRETE(Xi; q) = ∏ qj i=1 j=1 where Nj is the number of times j occurs in X. • Goal of next few slides: compute P(qjX) 7 / 50 Multinomial distributions • Suppose Xi ∼ DISCRETE(q) for i = 1, . , n, and Nj is the number of times j occurs in X • Then Njn, q ∼ MULTI(q, n), and m n! Nj P(Njn, q) = m ∏ qj ∏j=1 Nj! j=1 m where n!/ ∏j=1 Nj! is the number of sequences of values with occurence counts N • The vector N is known as a sufficient statistic for q because it supplies as much information about q as the original sequence X does. 8 / 50 Dirichlet distributions • Dirichlet distributions are probability distributions over multinomial parameter vectors I called Beta distributions when m = 2 • Parameterized by a vector a = (a1,..., am) where aj > 0 that determines the shape of the distribution m 1 aj−1 DIR(q j a) = q ( ) ∏ j C a j=1 m m Z a −1 ∏j=1 G(aj) C(a) = q j dq = ∏ j ( m ) D j=1 G ∑j=1 aj • G is a generalization of the factorial function • G(k) = (k − 1)! for positive integer k • G(x) = (x − 1)G(x − 1) for all x 9 / 50 Plots of the Dirichlet distribution m m G(∑j=1 aj) − P(q j a) = qak 1 m ( ) ∏ k ∏j=1 G aj k=1 4 α = (1,1) α = (5,2) α = (0.1,0.1) 3 ) α | 1 θ P( 2 1 0 0 0.2 0.4 0.6 0.8 1 θ 1 (probability of outcome 1) 10 / 50 Dirichlet distributions as priors for q • Generative model: q j a ∼ DIR(a) Xi j q ∼ DISCRETE(q), i = 1, . , n • We can depict this as a Bayes net using plates, which indicate replication a q Xi n 11 / 50 Inference for q with Dirichlet priors • Data X = (X1,..., Xn) generated i.i.d. from DISCRETE(q) • Prior is DIR(a). By Bayes Rule, posterior is: P(qjX) ∝ P(Xjq) P(q) m ! m ! Nj aj−1 ∝ ∏ qj ∏ qj j=1 j=1 m Nj+aj−1 = ∏ qj , so j=1 P(qjX) = DIR(N + a) • So if prior is Dirichlet with parameters a, posterior is Dirichlet with parameters N + a ) can regard Dirichlet parameters a as “pseudo-counts” from “pseudo-data” 12 / 50 Conjugate priors • If prior is DIR(a) and likelihood is i.i.d. DISCRETE(q), then posterior is DIR(N + a) ) prior parameters a specify “pseudo-observations” • A class C of prior distributions P(H) is conjugate to a class of likelihood functions P(DjH) iff the posterior P(HjD) is also a member of C • In general, conjugate priors encode “pseudo-observations” I the difference between prior P(H) and posterior P(HjD) are the observations in D I but P(HjD) belongs to same family as P(H), and can serve as prior for inferences about more data D0 ) must be possible to encode observations D using parameters of prior • In general, the likelihood functions that have conjugate priors belong to the exponential family 13 / 50 Point estimates from Bayesian posteriors • A “true” Bayesian prefers to use the full P(HjD), but sometimes we have to choose a “best” hypothesis • The Maximum a posteriori (MAP) or posterior mode is Hb = argmax P(HjD) = argmax P(DjH) P(H) H H • The expected value EP[X] of X under distribution P is: Z EP[X] = x P(X = x) dx The expected value is a kind of average, weighted by P(X). The expected value E[q] of q is an estimate of q. 14 / 50 The posterior mode of a Dirichlet • The Maximum a posteriori (MAP) or posterior mode is Hb = argmax P(HjD) = argmax P(DjH) P(H) H H • For Dirichlets with parameters a, the MAP estimate is: a − 1 ˆ j qj = m ∑j0=1(aj0 − 1) so if the posterior is DIR(N + a), the MAP estimate for q is: N + a − 1 ˆ j j qj = m n + ∑j0=1(aj0 − 1) • If a = 1 then qˆj = Nj/n, which is also the maximum likelihood estimate (MLE) for q 15 / 50 The expected value of q for a Dirichlet • The expected value EP[X] of X under distribution P is: Z EP[X] = x P(X = x) dx • For Dirichlets with parameters a, the expected value of qj is: aj EDIR(a)[qj] = m ∑j0=1 aj0 • Thus if the posterior is DIR(N + a), the expected value of qj is: Nj + aj EDIR(N+a)[qj] = m n + ∑j0=1 aj0 • E[q] smooths or regularizes the MLE by adding pseudo-counts a to N 16 / 50 Sampling from a Dirichlet m 1 aj−1 q j a ∼ DIR(a) iff P(qja) = q , where: ( ) ∏ j C a j=1 m ∏j=1 G(aj) C(a) = m G(∑j=1 aj) • There are several algorithms for producing samples from DIR(a). A simple one relies on the following result: m • 0 If Vk ∼ GAMMA(ak) and qk = Vk/(∑k0=1 Vk ), then q ∼ DIR(a) • This leads to the following algorithm for producing a sample q from DIR(a) I Sample vk from GAMMA(ak) for k = 1, . , m m I Set qk = vk/(∑k0=1 vk0 ) 17 / 50 Posterior with Dirichlet priors q j a ∼ DIR(a) Xi j q ∼ DISCRETE(q), i = 1, . , n • Integrate out q to calculate posterior probability of X Z Z P(Xja)= P(X, qja) dq = P(Xjq) P(qja) dq D m ! m ! Z N 1 a −1 = q j q j dq ∏ j ( ) ∏ j D j=1 C a j=1 m 1 Z N +a −1 = q j j dq ( ) ∏ j C a j=1 m C(N + a) ∏j=1 G(aj) = , where C(a) = m C(a) G(∑j=1 aj) • Collapsed Gibbs samplers and the Chinese Restaurant Process rely on this result 18 / 50 Predictive distribution for Dirichlet-Multinomial • The predictive distribution is the distribution of observation Xn+1 given observations X = (X1,..., Xn) and prior DIR(a) Z P(Xn+1 = k j X, a) = P(Xn+1 = k j q) P(q j X, a) dq D Z = qk DIR(q j N + a) dq D Nk + ak = m ∑j=1 Nj + aj 19 / 50 Example: rolling a die • Data d = (2, 5, 4, 2, 6) 4 α = (1,1,1,1,1,1) α = (1,2,1,1,1,1) α = (1,2,1,1,2,1) 3 α = (1,2,1,2,2,1) ) α α | = (1,3,1,2,2,1) 2 α θ = (1,3,1,2,2,2) P( 2 1 0 0 0.2 0.4 0.6 0.8 1 θ 2 (probability of side 2) 20 / 50 Inference in complex models • If the model is simple enough we can calculate the posterior exactly (conjugate priors) • When the model is more complicated, we can only approximate the posterior • Variational Bayes calculate the function closest to the posterior within a class of functions • Sampling algorithms produce samples from the posterior distribution I Markov chain Monte Carlo algorithms (MCMC) use a Markov chain to produce samples I A Gibbs sampler is a particular MCMC algorithm • Particle filters are a kind of on-line sampling algorithm (on-line algorithms only make one pass through the data) 21 / 50 Outline Introduction to Bayesian Inference Mixture models Sampling with Markov Chains The Gibbs sampler Gibbs sampling for Dirichlet-Multinomial mixtures Topic modeling with Dirichlet multinomial mixtures 22 / 50 Mixture models • Observations Xi are a mixture of ` source distributions F(qk), k = 1, .

Load more