A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model

A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model Mark Andrews April 11, 2019 HE hierarchical Dirichlet process mixture model (HDPMM), when its mixture components are categorical distributions, is a probabilistic model T of multinomial data. It was first described as part of a more general description of Hierarchical Dirichlet Process (HDP) models by Teh, Jordan, Beal, and Blei (2004, 2006), and is the Bayesian nonparametric generalization of the Latent Dirichlet Allocation (LDA) model of Blei, Ng, and Jordan (2003). The aim of this note is to describe in detail a Gibbs sampler for the HDPMM when used with multinomial data. This Gibbs sampler is based on of what was described in Teh et al. (2004, 2006) for general HDP models. However, as they did not deal in detail with HDP mixture models for multinomial data, important details of the sampler required for this particular case were not described. On the other hand, Newman, Asuncion, Smyth, and Welling (2009) do deal explicitly with the case of the HDPMM for multinomial data, and the Gibbs sampler that we describe here is almost identical to theirs. However, for some hyper-parameters, Newman et al. (2009) either make simplifying assumptions or assume that their values are known, which are assumptions that we do not make here. As such, the sampler we describe here is an extension, though minor, of that described in Newman et al. (2009). The probabilistic model One of the most straightforward applications of the multinomial data HDPMM is as a bag-of-words probabilistic language model, and in what follows we’ll describe it with this application in mind. However, modulo some possible changes in notation, this will in fact constitute a general description of the HDPMM for multinomial data. In general, according to a bag-of-words language model, a corpus of natural language is a set of J documents or texts w1, w2 ... wj ... wJ, where text j, i.e., wj, is a set of nj words from a finite vocabulary of V word types. For simplicity, this vocabulary can be represented as the V integers f1, 2 ... Vg. From this, we have each wj defined as wj = wj1, wj2 ... wji ... wjnj , with each wji 2 f1 ... Vg. The bag-of-words assumption is that, for each text, wj1, wj2 ... wji ... wjnj are exchangeable random variables, i.e. their joint probability distribution is invariant to any permutation of the indices. By this assumption therefore, as the name implies, each text is modelled as an unordered set, or bag, of words. As a generative model of this language corpus, the HDPMM treats each observed word wji as a sample from one of an underlying set of text or discourse topics: φ = φ1, φ2 ... φk ..., 1 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 2 where each φk is a probability distribution over f1 ... Vg. The identity of the particular topic distribution from which wji is drawn is determined by the value of a discrete latent variable xji 2 f1, 2 ... k ...g that corresponds to wji. As such, for each wji, we model it as wjijxji, φ ∼ dcat(φxji ). To be clear, the HDPMM assumes that there are an unlimited number of topic distributions from which the observed data are drawn, and so each xji can take on infinitely many discrete values. The probability distribution over the infinitely possible values of each xji is given by an infinite length categorical distribution πj, i.e., πj = πj1, πj2 ... πjk ..., where 0 6 πjk 6 1 and k=1 πjk = 1, that is specific to text j. In other words, P1 xjijπj ∼ dcat(πj). Each πj is assumed to be drawn from a Dirichlet process prior whose base distribution, m, is a categorical distribution over the positive integers and whose scalar concentration parameter is a: πjja, m ∼ ddp(a, m). The m base distribution is assumed to be drawn from a stick breaking distribution with a parameter γ: mjγ ∼ dstick(γ). The prior distributions of the Dirichlet process concentration parameter a and the stick-breaking parameter γ are Gamma distributions, both with shape and scale parameters equal to 1. For the topic distributions, φ1, φ2 ... φk ..., we can assume they are independently and identically drawn from a Dirichlet distribution with a length V location parameter and concentration parameter b. In turn, is drawn from a symmetric Dirichlet distribution with concentration parameter c. Finally, both b and c, like a and γ, can be given Gamma priors, again with shape and scale distributions equal to 1. Sampling each latent variable xji The posterior probability that xji takes the value k, for any k 2 1, 2 . ., is P(xji = kjwji = v, x:ji, w:ji, b, , a, m), / P(wji = vjxji = k, w:ji, x:ji, b, )P(xji = kjx:ji, a, m), where x:ji denotes all latent variables excluding xji, with an analogous meaning for w:ji. Here, the likelihood term is P(wji = vjxji = k, x:ji, w:ji, b, ) = P(wji = vjφk)P(φkjx:ji, w:ji, b, )dφk. Z This is the expected value of φkv according the Dirichlet posterior :ji V Γ(S + b) S:ji+b - (φ jx w b ) = k· φ kv v 1 P k :ji, :ji, , V :ji kv , v= Γ(S + b v) v= 1 kv Y1 :ji Q :ji V :ji where Skv , j0i06=ji I(wj0i0 = v, xj0i0 = k) and Sk· = v=1 Skv . As such, P :jiP Skv + b v P(wji = vjxji = k, x:ji, w:jli, b, ) = :ji . Sk· + b A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 3 The prior term, on the other hand, is P(xji = kjx:ji, a, m) = P(xji = kjπj)P(πjjx:ji, a, m)dπj, Z and this the expected value of πjk according to P(πjjx:ji, a, m) / P(x:jijπj)P(πjja, m). As P(πjja, m) is a Dirichlet process, by the definition of a Dirichlet process, we have P(πjkja, m) = Beta(amk, a fk06=kg mk0), and therefore, P :ji :ji P(πjkjx:ji, a, m) = Beta(Rjk + amk, a fk06=kg Rjk0 + mk0) :ji nj P where Rjk , i06=i I(xji0 = k). As such, the expected value of πjk is P :ji Rjk + amk P(xji = kjx:ji, a, m) = :ji , Rj· + a :ji :ji where Rj· = k=1 Rjk . Given these likelihood and prior terms, the posterior is simply P1 :ji R:ji + am Skv + b v jk k P(xji = kjwji = v, x:ji, w:ji, b, , a, m) / :ji :ji , Sk· + b Rj· + a :ji Skv + b v :ji / :ji × Rjk + amk . Sk· + b Note that from this, we also have :ji Skv + b v :ji P(xji > Kjx:ji, w, b, , a, m) / × R + amk . S:ji + b jk fk>KX g k· Given that for all k > K, where K is the maximum value of the set fxji : j 2 :ji :ji 1 . J, i 2 1 . njg, we have Rjk = 0 and Skv = 0, then b v P(x > Kjx , w, b, , a, m) / × am , ji :ji b k fk>KX g = v × amu where mu = fk>Kg mjk. As a practical matter of sampling, for each latent variable x , we calculate P ji :ji Skv + b v :ji fjik =/ :ji × Rjk + amk , Sk· + b for k 2 1, 2 . K, and then fjiu = v × amu where K, mu are defined as above and v = wji. Now, K k=1 fjik P(xji 6 Kjx:ji, w, b, , a, m) = K kP=1 fjik + fjiu P A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 4 and fjiu P(xji > Kjx:ji, w, b, , a, m) = K , k=1 fjik + fjiu and so a single random sample will be sufficientP to decide if xji 6 K or xji > K. If xji 6 K, then fjik P(xji = kjx:ji, w, b, , a, m, xji 6 K) = K . k=1 fjik new new On the other hand, if xji > K, the probability that xji =Pk for k > K is new v × amknew mknew P(xji = k jx:ji, w, b, , a, m, xji > K) = = . fjiu mu Sampling m and a The posterior distribution over the infinite length array m is P(mjx1:J, a, γ) / P(x1:Jja, m)P(mjγ), while the posterior over the scalar parameter a is P(ajx1:J) / P(x1:Jja, m)P(a), with the priors as stated above. The likelihood term in both cases is J nj P(x1:Jja, m) = P(xjijπj)P(πjja, m)dπj, j= i= Y1 Z Y1 nj K Rjk nj where i=1 P(xjijπj) = k=1 πjk , with Rjk = i=1 I(xji = k) and K is, as stated above, the maximum value attained by any latent variable. The prior Q Q P P(πjja, m) is a Dirichlet process prior and so, by definition of the Dirichlet process, K Γ(a) amu-1 amk-1 P(πj1, πj2 ... πjK, πjuja, m) = K πju πjk , Γ(amu) k=1 Γ(amk) k= Y1 where mu = fk>Kg mjk, as stated above,Q and πu = fk>Kg πjk. Therefore, PJ nj P P(x1:Jja, m) = P(xjijπj)P(πjja, m)dπj, j= i= Y1 Z Y1 J K Γ(a) amu-1 Rjk+mk-1 = K πju πjk dπj, j= Γ(amu) k=1 Γ(amk) k= Y1 Z Y1 J Q K Γ(a) Γ(amu) k=1 Γ(Rjk + amk) = K , Γ(am ) Γ(am ) Γ(Rj· + a) j=1 u k=1 k Q Y J K Γ(a) Q Γ(Rjk + amk) = , Γ(R + a) Γ(am ) j= j· k= k Y1 Y1 and this can be re-written as R J 1 K jk 1 σr = (τr)a-1(1 - τr)Rj·-1dτr (R , σr )(am ) jk , Γ(R ) j j j S jk jk k j= j· 0 k= σr = Y1 Z Y1 Xjk 0 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 5 given that 1 r a- r r .

A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model

Assessing Fairness with Unlabeled Data and Bayesian Inference

Bayesian Inference Chapter 4: Regression and Hierarchical Models

Hyperprior on Symmetric Dirichlet Distribution

The Bayesian Lasso

Posterior Propriety and Admissibility of Hyperpriors in Normal

Package 'Distributional'

Empirical Bayes Methods for Combining Likelihoods Bradley EFRON

A Dirichlet Process Mixture Model of Discrete Choice Arxiv:1801.06296V1

Part 2: Basics of Dirichlet Processes 2.1 Motivation

Fisher Information Matrix for Gaussian and Categorical Distributions

Categorical Distributions in Natural Language Processing Version 0.1

A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics