A Gibbs Sampler for a Hierarchical

Mark Andrews April 11, 2019

HE hierarchical Dirichlet process mixture model (HDPMM), when its mix- ture components are categorical distributions, is a probabilistic model T of multinomial data. It was first described as part of a more general description of Hierarchical Dirichlet Process (HDP) models by Teh, Jordan, Beal, and Blei (2004, 2006), and is the Bayesian nonparametric generalization of the Latent Dirichlet Allocation (LDA) model of Blei, Ng, and Jordan (2003). The aim of this note is to describe in detail a Gibbs sampler for the HDPMM when used with multinomial data. This Gibbs sampler is based on of what was described in Teh et al. (2004, 2006) for general HDP models. However, as they did not deal in detail with HDP mixture models for multinomial data, important details of the sampler required for this particular case were not described. On the other hand, Newman, Asuncion, Smyth, and Welling (2009) do deal explicitly with the case of the HDPMM for multinomial data, and the Gibbs sampler that we describe here is almost identical to theirs. However, for some hyper-parameters, Newman et al. (2009) either make simplifying assumptions or assume that their values are known, which are assumptions that we do not make here. As such, the sampler we describe here is an extension, though minor, of that described in Newman et al. (2009).

The probabilistic model

One of the most straightforward applications of the multinomial data HDPMM is as a bag-of-words probabilistic language model, and in what follows we’ll describe it with this application in mind. However, modulo some possible changes in notation, this will in fact constitute a general description of the HDPMM for multinomial data. In general, according to a bag-of-words language model, a corpus of natural language is a set of J documents or texts

w1, w2 ... wj ... wJ, where text j, i.e., wj, is a set of nj words from a finite vocabulary of V word types. For simplicity, this vocabulary can be represented as the V integers {1, 2 ... V}. From this, we have each wj defined as

wj = wj1, wj2 ... wji ... wjnj , with each wji ∈ {1 ... V}. The bag-of-words assumption is that, for each text, wj1, wj2 ... wji ... wjnj are exchangeable random variables, i.e. their joint proba- bility distribution is invariant to any permutation of the indices. By this assump- tion therefore, as the name implies, each text is modelled as an unordered set, or bag, of words. As a generative model of this language corpus, the HDPMM treats each ob- served word wji as a sample from one of an underlying set of text or discourse topics: φ = φ1, φ2 ... φk ...,

1 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 2

where each φk is a over {1 ... V}. The identity of the particular topic distribution from which wji is drawn is determined by the value of a discrete latent variable xji ∈ {1, 2 ... k ...} that corresponds to wji. As such, for each wji, we model it as

wji|xji, φ ∼ dcat(φxji ).

To be clear, the HDPMM assumes that there are an unlimited number of topic distributions from which the observed data are drawn, and so each xji can take on infinitely many discrete values. The probability distribution over the infinitely possible values of each xji is given by an infinite length πj, i.e., πj = πj1, πj2 ... πjk ..., where 0 6 πjk 6 1 and k=1 πjk = 1, that is specific to text j. In other words, P∞ xji|πj ∼ dcat(πj).

Each πj is assumed to be drawn from a Dirichlet process prior whose base distribution, m, is a categorical distribution over the positive integers and whose scalar is a:

πj|a, m ∼ ddp(a, m).

The m base distribution is assumed to be drawn from a stick breaking distribu- tion with a parameter γ: m|γ ∼ dstick(γ). The prior distributions of the Dirichlet process concentration parameter a and the stick-breaking parameter γ are Gamma distributions, both with shape and scale parameters equal to 1. For the topic distributions, φ1, φ2 ... φk ..., we can assume they are independently and identically drawn from a with a length V location parameter ψ and concentration parameter b. In turn, ψ is drawn from a symmetric Dirichlet distribution with concentration parameter c. Finally, both b and c, like a and γ, can be given Gamma priors, again with shape and scale distributions equal to 1.

Sampling each latent variable xji

The that xji takes the value k, for any k ∈ 1, 2 . . ., is

P(xji = k|wji = v, x¬ji, w¬ji, b, ψ, a, m),

∝ P(wji = v|xji = k, w¬ji, x¬ji, b, ψ)P(xji = k|x¬ji, a, m), where x¬ji denotes all latent variables excluding xji, with an analogous meaning for w¬ji. Here, the likelihood term is

P(wji = v|xji = k, x¬ji, w¬ji, b, ψ) = P(wji = v|φk)P(φk|x¬ji, w¬ji, b, ψ)dφk. Z This is the of φkv according the Dirichlet posterior

¬ji V Γ(S + b) S¬ji+bψ − (φ |x w b ψ) = k· φ kv v 1 P k ¬ji, ¬ji, , V ¬ji kv , v= Γ(S + bψv) v= 1 kv Y1 ¬ji Q ¬ji V ¬ji where Skv , j0i06=ji I(wj0i0 = v, xj0i0 = k) and Sk· = v=1 Skv . As such, P ¬jiP Skv + bψv P(wji = v|xji = k, x¬ji, w¬jli, b, ψ) = ¬ji . Sk· + b A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 3

The prior term, on the other hand, is

P(xji = k|x¬ji, a, m) = P(xji = k|πj)P(πj|x¬ji, a, m)dπj, Z and this the expected value of πjk according to

P(πj|x¬ji, a, m) ∝ P(x¬ji|πj)P(πj|a, m).

As P(πj|a, m) is a Dirichlet process, by the definition of a Dirichlet process, we have P(πjk|a, m) = Beta(amk, a {k06=k} mk0), and therefore, P ¬ji ¬ji P(πjk|x¬ji, a, m) = Beta(Rjk + amk, a {k06=k} Rjk0 + mk0)

¬ji nj P where Rjk , i06=i I(xji0 = k). As such, the expected value of πjk is P ¬ji Rjk + amk P(xji = k|x¬ji, a, m) = ¬ji , Rj· + a

¬ji ¬ji where Rj· = k=1 Rjk . Given these likelihood and prior terms, the posterior is simply P∞ ¬ji R¬ji + am Skv + bψv jk k P(xji = k|wji = v, x¬ji, w¬ji, b, ψ, a, m) ∝ ¬ji ¬ji , Sk· + b Rj· + a ¬ji Skv + bψv  ¬ji  ∝ ¬ji × Rjk + amk . Sk· + b Note that from this, we also have ¬ji Skv + bψv  ¬ji  P(xji > K|x¬ji, w, b, ψ, a, m) ∝ × R + amk . S¬ji + b jk {k>KX } k·

Given that for all k > K, where K is the maximum value of the set {xji : j ∈ ¬ji ¬ji 1 . . . J, i ∈ 1 . . . nj}, we have Rjk = 0 and Skv = 0, then

bψv P(x > K|x , w, b, ψ, a, m) ∝ × am , ji ¬ji b k {k>KX } = ψv × amu where mu = {k>K} mjk. As a practical matter of sampling, for each latent variable x , we calculate P ji ¬ji Skv + bψv  ¬ji  fjik =∝ ¬ji × Rjk + amk , Sk· + b for k ∈ 1, 2 . . . K, and then

fjiu = ψv × amu where K, mu are defined as above and v = wji. Now,

K k=1 fjik P(xji 6 K|x¬ji, w, b, ψ, a, m) = K kP=1 fjik + fjiu P A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 4 and fjiu P(xji > K|x¬ji, w, b, ψ, a, m) = K , k=1 fjik + fjiu and so a single random sample will be sufficientP to decide if xji 6 K or xji > K. If xji 6 K, then

fjik P(xji = k|x¬ji, w, b, ψ, a, m, xji 6 K) = K . k=1 fjik new new On the other hand, if xji > K, the probability that xji =Pk for k > K is

new ψv × amknew mknew P(xji = k |x¬ji, w, b, ψ, a, m, xji > K) = = . fjiu mu

Sampling m and a

The posterior distribution over the infinite length array m is

P(m|x1:J, a, γ) ∝ P(x1:J|a, m)P(m|γ), while the posterior over the scalar parameter a is

P(a|x1:J) ∝ P(x1:J|a, m)P(a), with the priors as stated above. The likelihood term in both cases is

J nj P(x1:J|a, m) = P(xji|πj)P(πj|a, m)dπj, j= i= Y1 Z Y1 nj K Rjk nj where i=1 P(xji|πj) = k=1 πjk , with Rjk = i=1 I(xji = k) and K is, as stated above, the maximum value attained by any latent variable. The prior Q Q P P(πj|a, m) is a Dirichlet process prior and so, by definition of the Dirichlet process,

K Γ(a) amu−1 amk−1 P(πj1, πj2 ... πjK, πju|a, m) = K πju πjk , Γ(amu) k=1 Γ(amk) k= Y1 where mu = {k>K} mjk, as stated above,Q and πu = {k>K} πjk. Therefore,

PJ nj P P(x1:J|a, m) = P(xji|πj)P(πj|a, m)dπj, j= i= Y1 Z Y1 J K Γ(a) amu−1 Rjk+mk−1 = K πju πjk dπj, j= Γ(amu) k=1 Γ(amk) k= Y1 Z Y1 J Q K Γ(a) Γ(amu) k=1 Γ(Rjk + amk) = K , Γ(am ) Γ(am ) Γ(Rj· + a) j=1 u k=1 k Q Y J K Γ(a) Q Γ(Rjk + amk) = , Γ(R + a) Γ(am ) j= j· k= k Y1 Y1 and this can be re-written as

R J 1 K jk 1 σr = (τr)a−1(1 − τr)Rj·−1dτr (R , σr )(am ) jk , Γ(R ) j j j S jk jk k j= j· 0 k= σr = Y1 Z Y1 Xjk 0 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 5 given that 1 r a− r r . Γ(a)Γ(Rj·) (τ ) 1(1 − τ )Rj·−1dτ = , j j j Γ(R + a) Z0 j· and Rjk r r σ . Γ(Rjk + amk) (R , σ )(am ) jk = , S jk jk k Γ(am ) σr = k Xjk 0 where S is an unsigned Stirling number of the first kind. By treating τr and σr as unobserved auxiliary variable, this leads to the augmented likelihood term

J K 1 σr P(x |a, m, τr, σr) = (τr)a−1(1 − τr)Rj·−1 (R , σr )(am ) jk . 1:J Γ(R ) j j S jk jk k j= j· k= Y1 Y1

r With the augmented likelihood treated as a function of σjk, we have

r r r σjk P(x1:J|a, m, σ ) ∝ S(Rjk, σjk)(amk) .

r r and with a uniform prior over the values of σjk, the posterior probability of σjk given all other variables is therefore

r r σ S(Rjk, σ )(amk) jk (σr |x a m) = jk P jk 1:J, , R r . jk r σjk r S(Rjk, σ )(amk) σjk=0 jk

P r On the other hand, when the augmented likelihood is treated as a function of τj, we have r r a−1 r Rj·−1 P(x1:J|a, τj) ∝ (τj) (1 − τj) ,

r and so, with a uniform prior on τj, the posterior is

r r a−1 r Rj·−1 P(τj|x1:J, a) ∝ (τj) (1 − τj) = Beta(a, Rj·).

Similarly, with the augmented likelihood treated as function of m, we have

K K J σr r r j=1 jk σ·k P(x1:J|m, σ ) ∝ mk = mk . P k= k= Y1 Y1 The prior on m is a stick-breaking prior, and so the probability distribution 0 0 k−1 over m1 ... mk ... mK, mu is m1 ∼ ω1, and then mk ∼ ωk(1 − k0=1 mk0) for k ∈ 2 ... K, and finally m = 1 − K m , where ω0 ... ω0 are independently u k=1 k 1 K P and identically distributed as Beta(1, γ). This finite stick-breaking distribution is P a K + 1 dimensional Generalized Dirichlet distribution, see Connor and Mosi- mann (1969); Wong (1998), whose parameter vectors are a vector of K 1’s and a vector of K γ’s. As described in Wong (1998), the Generalized Dirichlet distri- bution is a for a multinomial likelihood. As such, the posterior over m1, m2 ... mK, mu is a Generalized Dirichlet distribution with length K parameter vectors α1, α2 ... αK and β1, β2 ... βK, where

K r r αk = 1 + σ·k, βk = γ + σ·k0, for k ∈ 1, 2 . . . K. k =k+ 0X 1 We can sample from this Generalized Dirichlet distribution by using a finite stick- breaking construction, as was used for the prior, i.e. for m1 ∼ ω1, mk ∼ ωk(1 − A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 6

k−1 K k0=1 mk0) for k ∈ 1 ... K, and mu = 1 − k=1 mk, where ωk ∼ Beta(αk, βk). Finally, when treated as a function of a, the augmented likelihood is P P J K r J r r r r a σjk a j=1 log τj jk σjk P(x1:J|m, σ ) = (τj) a = e a . j= k= P P Y1 Y1 As stated above, the prior on a is a whose shape and Here, we assume the following parameterization of the Gamma scale parameters are both equal to 1.0. Therefore, the posterior is a Gamma distribution: distribution with shape and scale 1 x P(x|a, s) = xa−1e− s , saΓ(a) r 1 σjk + 1, J , r where a and s are the shape and jk 1 − j=1 log τj X scale parameters, respectively. respectively. P

Sampling ψ and b

The posterior distribution over ψ is

P(ψ|w1:J, x1:J) ∝ P(w1:J|x1:J, ψ, b)P(ψ|c), while the posterior over the concentration parameter b is

P(b|w1:J, x1:J) ∝ P(w1:J|x1:J, ψ, b)P(b).

In both cases, the likelihood term is

J nj P(w1:J|x1:J, ψ, b) = P(wji|xji, φ1, φ2 ...)P(φ1, φ2 ... |ψ, b)dφ1dφ2 ..., j= i= Z Y1 Y1 J nj

= P(wji|φxji ) P(φk|ψ, b)dφ1dφ2 ..., ∞ j= i= k= Z Y1 Y1 Y1 J nj K

= P(φk|ψ, b)dφk P(wji|φxji ) P(φk|ψ, b)dφ1dφ2 ..., ∞ k=K+ j= i= k Y 1 Z Z Y1 Y1 Y =1 K V | Γ(b) {z } Skv−bψv−1 = V φkv dφk, v=1 Γ(bψv) k= v= Y1 Z Y1 K V Q Γ(b) Γ(Skv + bψv) = , Γ(S + b) Γ(bψv) k= k· v= Y1 Y1 and, as was the case for P(x1:J|a, m), this likelihood can be rewritten as

K 1 V Skv s 1 s b− s S − s s σ P(w |x , ψ, b) = (τ ) 1(1 − τ ) k· 1dτ (S , σ )(bψ ) kv . 1:J 1:J Γ(S ) k k k S kv kv v k= k· 0 v= σs = Y1 Z Y1 Xkv 0 and by treating σs and τs as auxiliary variables, we obtain the augmented likelihood

s s P(w1:J|x1:J, σ , τ , ψ, b) K V s 1 s b− s S − s σ = (τ ) 1(1 − τ ) k· 1 (S , σ )(bψ ) kv . Γ(S ) k k S kv kv v k= k· v= Y1 Y1 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 7

As the prior for ψ is a symmetric Dirichlet with concentration parameter c, its posterior is the Dirichlet distribution

V s s σ +c/V−1 P(ψ|σ , c, V) =∝ ψv k kv . v= P Y1 r s On the other hand, just like with the case of σjk, the posterior for each σkv is

s s σkv s S(Skv, σkv)bψv P(σkv|w, x, ψ, b) = s Skv s σkv0 s S(Skv, σ )bψv σkv0=0 kv0 P For the case of b, its prior is a Gamma distribution with shape and scale parame- ters equal to 1. The augmented likelihood treated as a function of b is

s s s b σ b k log τ kv σ P(w1:j|x1:j, b) ∝ τk b kv = e k b kv , k v P P Y Y and so the posterior is a Gamma distribution with shape and scale

1 σs + 1, , kv 1 − log τs kv k k X P s respectively. Finally, with a uniform prior on each τk, and given that the aug- s mented treated as a function of τk is

r s b−1 s Sk·−1 P(w1:J|x1:J, b, τj) = (τk) (1 − τk)

s the posterior for each τk is Beta(b, Sk·).

Sampling c

Rather than inferring c, the of ψ, directly from the sampled value of ψ, we can infer c on the basis of the remaining observed variables, integrating over φ and ψ. The probability of the observed variables, integrating over φ and ψ, is

P(w1:J|x1:J, b, c) = P(w1:J|x1:J, φ)P(φ|b, ψ)dφ P(ψ|c)dψ. ZZ P(w1:J|x1:J,ψ,b)

Using the augmented version of| P(w1:J|x1:J,{zψ, b) from the} previous section, i.e., s s P(w1:J|x1:J, σ , τ , ψ, b), and treating this augmented likelihood as a function of ψ, we then have

V V K s s k=1 σkv c/V−1 P(w1:J|σ , ψ)P(ψ|c)dψ ∝ ψv ψv dψ, v= P v= Z Z Y1 Y1 V Γ(c) Γ(σs + c/V) = ·v , Γ(c + σs ) Γ(c/V) ·· v= Y1 which is the for c. As before, we can re-write this likelihood as s V σ·v s q 1 q c−1 q σ··−1 q s q c σv (τ ) (1 − τ ) dτ S(σ·v, σv)( ) Γ(σs ) V ·· v= q Z Y1 σXv =0 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 8 and treating τq and σq as auxiliary variables, we have the augmented likelihood

V s q 1 q c−1 q σ··−1 s q c σv (τ ) (1 − τ ) S(σ·v, σv)( ) . Γ(σs ) V ·· v= Y1 Following the same procedure here as we used for the case of a and m, and b q and ψ, the posterior for each σv is

q s q c σv S(σ·v, σv) V s q . σ·v s q c σv0 q (σ·v, σv0) σv0=0 S V The posterior for c is the GammaP distribution with shape and scale

V 1 σq + 1, , v 1 − logτq v= X1 q s respectively, and the posterior for τ is Beta(c, σ··).

Sampling γ

The posterior distribution distribution for the stick-breaking parameter γ is

P(γ|m) ∝ P(m|γ)P(γ).

As m1, m2 ... mK, mu is a deterministic function of ω1, ω2 ... ωK, the likelihood of m given γ is equivalent to the likelihood of ω given γ, which is a product of Beta distributions

K P(ω|γ) = Beta(ωk|1, γ) k= Y1 K K Γ(1 + γ) γ−1 Γ(1+a) ∝ (1 − ω ) , For any a, = a. Γ(γ) k Γ(a) k= Y1 = γKeγ k log(1−ωk), P With a Gamma prior on γ, whose shape and scale equal 1, the posterior is also a Gamma distribution with shape and scale

1 K + 1, . 1 − k log(1 − ωk) respectively. P

References

Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Research, 3, 993-1022. Connor, R. J., & Mosimann, J. E. (1969). Concepts of independence for pro- portions with a generalization of the dirichlet distribution. Journal of the American Statistical Association, 64(325), 194-206. Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed algorithms for topic models. The Journal of Machine Learning Research, 10, 1801–1828. A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 9

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2004). Hierarchical Dirichlet Processes. In Advances in neural information processing systems (Vol. 17). Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006, DEC). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566-1581. Wong, T.-T. (1998). Generalized dirichlet distribution in bayesian analysis. Applied Mathematics and Computation, 97(2), 165–181.