Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS “Summer School” 1 / 50 Random variables and “distributed according to” notation • A probability distributionF is a non-negative function from some set X whose values sum (integrate) to 1 • A random variable X is distributed according to a distribution F, or more simply, X has distributionF , written X ∼ F, iff: P(X = x) = F(x) for all x (This is for discrete RVs). • You’ll sometimes see the notion X j Y ∼ F which means “X is generated conditional on Y with distribution F” (where F usually depends on Y), i.e., P(X j Y) = F(X j Y) 2 / 50 Outline Introduction to Bayesian Inference Mixture models Sampling with Markov Chains The Gibbs sampler Gibbs sampling for Dirichlet-Multinomial mixtures Topic modeling with Dirichlet multinomial mixtures 3 / 50 Bayes’ rule P(Data j Hypothesis) P(Hypothesis) P(Hypothesis j Data)= P(Data) • Bayesian’s use Bayes’ Rule to update beliefs in hypotheses in response to data • P(Hypothesis j Data) is the posterior distribution, • P(Hypothesis) is the prior distribution, • P(Data j Hypothesis) is the likelihood, and • P(Data) is a normalising constant sometimes called the evidence 4 / 50 Computing the normalising constant P(Data)= ∑ P(Data, Hypothesis0) Hypothesis02H = ∑ P(Data j Hypothesis0)P(Hypothesis0) Hypothesis02H • If set of hypotheses H is small, can calculate P(Data) by enumeration • But often these sums are intractable 5 / 50 Bayesian belief updating • Idea: treat posterior from last observation as the prior for next • Consistency follows because likelihood factors I Suppose d = (d1, d2). Then the posterior of a hypothesis h is: P(h j d1, d2) ∝ P(h) P(d1, d2 j h) = P(h) P(d1 j h) P(d2 j h, d1) ∝ P(h j d1) P(d2 j h, d1) | {z } | {z } updated prior likelihood 6 / 50 Discrete distributions • A discrete distribution has a finite set of outcomes 1, . , m • A discrete distribution is parameterized by a vector m q = (q1,..., qm), where P(X = jjq) = qj (so ∑j=1 qj = 1) I Example: An m-sided die, where qj = prob. of face j • Suppose X = (X1,..., Xn) and each Xijq ∼ DISCRETE(q). Then: n m Nj P(Xjq) = ∏ DISCRETE(Xi; q) = ∏ qj i=1 j=1 where Nj is the number of times j occurs in X. • Goal of next few slides: compute P(qjX) 7 / 50 Multinomial distributions • Suppose Xi ∼ DISCRETE(q) for i = 1, . , n, and Nj is the number of times j occurs in X • Then Njn, q ∼ MULTI(q, n), and m n! Nj P(Njn, q) = m ∏ qj ∏j=1 Nj! j=1 m where n!/ ∏j=1 Nj! is the number of sequences of values with occurence counts N • The vector N is known as a sufficient statistic for q because it supplies as much information about q as the original sequence X does. 8 / 50 Dirichlet distributions • Dirichlet distributions are probability distributions over multinomial parameter vectors I called Beta distributions when m = 2 • Parameterized by a vector a = (a1,..., am) where aj > 0 that determines the shape of the distribution m 1 aj−1 DIR(q j a) = q ( ) ∏ j C a j=1 m m Z a −1 ∏j=1 G(aj) C(a) = q j dq = ∏ j ( m ) D j=1 G ∑j=1 aj • G is a generalization of the factorial function • G(k) = (k − 1)! for positive integer k • G(x) = (x − 1)G(x − 1) for all x 9 / 50 Plots of the Dirichlet distribution m m G(∑j=1 aj) − P(q j a) = qak 1 m ( ) ∏ k ∏j=1 G aj k=1 4 α = (1,1) α = (5,2) α = (0.1,0.1) 3 ) α | 1 θ P( 2 1 0 0 0.2 0.4 0.6 0.8 1 θ 1 (probability of outcome 1) 10 / 50 Dirichlet distributions as priors for q • Generative model: q j a ∼ DIR(a) Xi j q ∼ DISCRETE(q), i = 1, . , n • We can depict this as a Bayes net using plates, which indicate replication a q Xi n 11 / 50 Inference for q with Dirichlet priors • Data X = (X1,..., Xn) generated i.i.d. from DISCRETE(q) • Prior is DIR(a). By Bayes Rule, posterior is: P(qjX) ∝ P(Xjq) P(q) m ! m ! Nj aj−1 ∝ ∏ qj ∏ qj j=1 j=1 m Nj+aj−1 = ∏ qj , so j=1 P(qjX) = DIR(N + a) • So if prior is Dirichlet with parameters a, posterior is Dirichlet with parameters N + a ) can regard Dirichlet parameters a as “pseudo-counts” from “pseudo-data” 12 / 50 Conjugate priors • If prior is DIR(a) and likelihood is i.i.d. DISCRETE(q), then posterior is DIR(N + a) ) prior parameters a specify “pseudo-observations” • A class C of prior distributions P(H) is conjugate to a class of likelihood functions P(DjH) iff the posterior P(HjD) is also a member of C • In general, conjugate priors encode “pseudo-observations” I the difference between prior P(H) and posterior P(HjD) are the observations in D I but P(HjD) belongs to same family as P(H), and can serve as prior for inferences about more data D0 ) must be possible to encode observations D using parameters of prior • In general, the likelihood functions that have conjugate priors belong to the exponential family 13 / 50 Point estimates from Bayesian posteriors • A “true” Bayesian prefers to use the full P(HjD), but sometimes we have to choose a “best” hypothesis • The Maximum a posteriori (MAP) or posterior mode is Hb = argmax P(HjD) = argmax P(DjH) P(H) H H • The expected value EP[X] of X under distribution P is: Z EP[X] = x P(X = x) dx The expected value is a kind of average, weighted by P(X). The expected value E[q] of q is an estimate of q. 14 / 50 The posterior mode of a Dirichlet • The Maximum a posteriori (MAP) or posterior mode is Hb = argmax P(HjD) = argmax P(DjH) P(H) H H • For Dirichlets with parameters a, the MAP estimate is: a − 1 ˆ j qj = m ∑j0=1(aj0 − 1) so if the posterior is DIR(N + a), the MAP estimate for q is: N + a − 1 ˆ j j qj = m n + ∑j0=1(aj0 − 1) • If a = 1 then qˆj = Nj/n, which is also the maximum likelihood estimate (MLE) for q 15 / 50 The expected value of q for a Dirichlet • The expected value EP[X] of X under distribution P is: Z EP[X] = x P(X = x) dx • For Dirichlets with parameters a, the expected value of qj is: aj EDIR(a)[qj] = m ∑j0=1 aj0 • Thus if the posterior is DIR(N + a), the expected value of qj is: Nj + aj EDIR(N+a)[qj] = m n + ∑j0=1 aj0 • E[q] smooths or regularizes the MLE by adding pseudo-counts a to N 16 / 50 Sampling from a Dirichlet m 1 aj−1 q j a ∼ DIR(a) iff P(qja) = q , where: ( ) ∏ j C a j=1 m ∏j=1 G(aj) C(a) = m G(∑j=1 aj) • There are several algorithms for producing samples from DIR(a). A simple one relies on the following result: m • 0 If Vk ∼ GAMMA(ak) and qk = Vk/(∑k0=1 Vk ), then q ∼ DIR(a) • This leads to the following algorithm for producing a sample q from DIR(a) I Sample vk from GAMMA(ak) for k = 1, . , m m I Set qk = vk/(∑k0=1 vk0 ) 17 / 50 Posterior with Dirichlet priors q j a ∼ DIR(a) Xi j q ∼ DISCRETE(q), i = 1, . , n • Integrate out q to calculate posterior probability of X Z Z P(Xja)= P(X, qja) dq = P(Xjq) P(qja) dq D m ! m ! Z N 1 a −1 = q j q j dq ∏ j ( ) ∏ j D j=1 C a j=1 m 1 Z N +a −1 = q j j dq ( ) ∏ j C a j=1 m C(N + a) ∏j=1 G(aj) = , where C(a) = m C(a) G(∑j=1 aj) • Collapsed Gibbs samplers and the Chinese Restaurant Process rely on this result 18 / 50 Predictive distribution for Dirichlet-Multinomial • The predictive distribution is the distribution of observation Xn+1 given observations X = (X1,..., Xn) and prior DIR(a) Z P(Xn+1 = k j X, a) = P(Xn+1 = k j q) P(q j X, a) dq D Z = qk DIR(q j N + a) dq D Nk + ak = m ∑j=1 Nj + aj 19 / 50 Example: rolling a die • Data d = (2, 5, 4, 2, 6) 4 α = (1,1,1,1,1,1) α = (1,2,1,1,1,1) α = (1,2,1,1,2,1) 3 α = (1,2,1,2,2,1) ) α α | = (1,3,1,2,2,1) 2 α θ = (1,3,1,2,2,2) P( 2 1 0 0 0.2 0.4 0.6 0.8 1 θ 2 (probability of side 2) 20 / 50 Inference in complex models • If the model is simple enough we can calculate the posterior exactly (conjugate priors) • When the model is more complicated, we can only approximate the posterior • Variational Bayes calculate the function closest to the posterior within a class of functions • Sampling algorithms produce samples from the posterior distribution I Markov chain Monte Carlo algorithms (MCMC) use a Markov chain to produce samples I A Gibbs sampler is a particular MCMC algorithm • Particle filters are a kind of on-line sampling algorithm (on-line algorithms only make one pass through the data) 21 / 50 Outline Introduction to Bayesian Inference Mixture models Sampling with Markov Chains The Gibbs sampler Gibbs sampling for Dirichlet-Multinomial mixtures Topic modeling with Dirichlet multinomial mixtures 22 / 50 Mixture models • Observations Xi are a mixture of ` source distributions F(qk), k = 1, .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    50 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us