Gamma Processes, Stick-Breaking, and Variational Inference

Anirban Roychowdhury Brian Kulis Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering Ohio State University Ohio State University [email protected] [email protected]

Abstract 1 Introduction

While most Bayesian nonparametric mod- The gamma process is a versatile pure-jump Lévy pro- els in have focused on the cess with widespread applications in various fields of , the beta process, or their science. Of late it is emerging as an increasingly pop- variants, the gamma process has recently ular prior in the Bayesian nonparametric literature emerged as a useful nonparametric prior in within the machine learning community; it has re- its own right. Current inference schemes cently been applied to exchangeable models of sparse for models involving the gamma process are graphs Caron and Fox (2013) as well as for nonpara- restricted to MCMC-based methods, which metric ranking models Caron et al. (2013). It also limits their scalability. In this paper, we has been used as a prior for infinite-dimensional la- present a variational inference framework for tent indicator matrices Titsias (2008). This latter ap- models involving gamma process priors. Our plication is one of the earliest Bayesian nonparamet- approach is based on a novel stick-breaking ric approaches to count modeling, and as such can be constructive definition of the gamma process. thought of as an extension of the venerable Indian Buf- We prove correctness of this stick-breaking fet Process to modeling latent structures where each process by using the characterization of the feature can occur multiple times for a datapoint, in- gamma process as a completely random mea- stead of being simply binary. sure (CRM), and we explicitly derive the The flexibility of gamma process models allows them rate measure of our construction using Pois- to be applied in a wide variety of Bayesian nonpara- son process machinery. We also derive error metric settings, but their relative complexity makes bounds on the truncation of the infinite pro- principled inference nontrivial. In particular, most di- cess required for variational inference, simi- rect applications of the gamma process in the Bayesian lar to the truncation analyses for other non- nonparametric literature use Monte parametric models based on the Dirichlet and Carlo samplers (typically Gibbs sampling) for pos- beta processes. Our representation is then terior inference, which often suffers from poor scal- used to derive a variational inference algo- ability. For other Bayesian nonparametric models— rithm for a particular Bayesian nonparamet- in particular those involving the Dirichlet process or ric latent structure formulation known as the beta process—a successful thread of research has con- infinite Gamma-Poisson model, where the la- sidered variational alternatives to standard sampling tent variables are drawn from a gamma pro- methods Blei and Jordan (2003); Teh et al. (2007a); cess prior with Poisson likelihoods. Finally, Wang et al. (2011). One first derives an explicit con- we present results for our algorithm on non- struction of the underlying “weights” of the atomic negative matrix factorization tasks on doc- measure component of the random measures under- ument corpora, and show that we compare lying the infinite priors; so-called “stick-breaking” pro- favorably to both sampling-based techniques cesses for the Dirichlet and beta processes yield such and variational approaches based on beta- a construction. Then these weights are truncated and Bernoulli priors, as well as a direct DP-based integrated into a mean-field variational inference al- construction of the gamma process. gorithm. For instance, stick-breaking was derived for the Dirichlet process in the seminal paper by Sethura- Appearing in Proceedings of the 18th International Con- man Sethuraman (1994), which was in turn used for ference on Artificial Intelligence and (AISTATS) variational inference in Dirichlet process models Blei 2015, San Diego, CA, USA. JMLR: W&CP volume 38. and Jordan (2003). Similar stick-breaking represen- Copyright 2015 by the authors. tations for a special case of the Indian Buffet Pro-

800 Gamma Processes, Stick-Breaking, and Variational Inference cess Teh et al. (2007b) and the beta process Paisley (see 5.1) and show that it performs worse than our et al. (2010) have been constructed, and have natu- main algorithm on both synthetic and real datasets. rally led to mean-field variational inference algorithms The very general inverse Lévy measure algorithm of for nonparametric models involving these priors Doshi- Wolpert and Ickstadt (1998) requires inversion of the Velez et al. (2009); Paisley et al. (2011). Such vari- exponential integral, as does the generalized CRM con- ational inference algorithms have been shown to be struction technique of Orbanz and Williamson (2012) more scalable than the sampling-based inference tech- when applied to the gamma process; since the closed niques normally used; moreover they work with the full form solution of the inverse of an exponential integral model posterior without marginalizing out any vari- is not known, these techniques do not give us an an- ables. alytic construction of the weights, and hence cannot be adapted to variational techniques in a straightfor- In this paper we propose a variational inference frame- ward manner. Other constructive definitions of the work for gamma process priors using a novel stick- gamma process include Thibaux (2008), who discusses breaking construction of the process. We use the char- a sampling-based scheme for the weights of a gamma acterization of the gamma process as a completely ran- process by sampling from a Poisson process. As an dom measure (CRM), which allows us to leverage Pois- alternative to gamma process-based models for count son process properties to arrive at a simple derivation modeling, recent research has examined the negative of the rate measure of our stick-breaking construction, binomial-beta process and its variants Zhou and Carin and show that it is indeed equal to the Lévy mea- (2012); Zhou et al. (2012); Broderick et al. (2014); sure of the gamma process. We also use the Pois- the stick-breaking construction of Paisley et al. (2010) son process formulation to derive a bound on the er- readily extends to such models since they have beta ror of the truncated version compared to the full pro- process priors. The beta stick-breaking construction cess, analogous to the bounds derived for the Dirichlet has also been used for variational inference in beta- process Ishwaran and James (2001), the Indian Buf- priors Paisley et al. (2011), though fet Process Doshi-Velez et al. (2009) and the beta they have scalability issues when applied to the count process Paisley et al. (2011). We then, as a partic- modeling problems addressed in this work, as we show ular example, focus on the infinite Gamma-Poisson in the experimental section. model of Titsias (2008) (note that variational inference need not be limited to this model). This model is a prior on infinitely wide latent indicator matrices with 2 Background non-negative integer-valued entries; each column has an associated parameter independently drawn from a 2.1 Completely random measures , and the matrix values are inde- A completely random measure Kingman (1967); Jor- pendently drawn from Poisson distributions with these dan (2010) G on a space (Ω, F) is defined as a parameters as means. We develop a mean-field varia- on F such that for any two disjoint tional technique using a truncated version of our stick- Borel subsets A1 and A2 in F, the random variables breaking construction, and a sampling algorithm that G(A1) and G(A2) are independent. The canonical uses Monte Carlo integration for parameter marginal- way of constructing a completely random measure G is ization, similar to Paisley et al. (2010), as a baseline + to first take a σ-finite product measure H on Ω ⊗ R , inference algorithm for comparison. We also derive a then draw a countable set of points {(ωk, pk)} from a variational algorithm based on the naïve construction + Poisson process on a Borel σ-algebra on Ω ⊗ R with of the gamma process. Finally we compare these with H as the rate measure. Then the CRM is constructed variational algorithms based on Beta-Bernoulli priors P∞ as G = pkδωk , where the measure given to a on a non-negative matrix factorization task involving k=0 P measurable Borel set B ⊂ Ω is G(B) = pk. In the Psychological Review, NIPS, KOS and New York k:ωk∈B Times document corpora, and show that the varia- this notation pk are referred to as weights and the ωk tional algorithm based on our construction performs as atoms. and scales better than all the others. If the rate measure is defined on Ω ⊗ [0, 1] as −1 c−1 Related Work. To our knowledge this is the first ex- H(dω, dp) = cp (1 − p) B0(dω)dp, where B0 is plicit “stick-breaking”-like construction of the gamma an arbitrary finite continuous measure on Ω and c CRM, apart from the naïve approach of denormaliz- is some constant (or function of ω), then the cor- ing the construction of the DP with a suitable gamma responding CRM constructed as above is known as random variable Miller (2011), Gopalan et al. (2014); a beta process. If the rate measure is defined as −1 −cp moreover, as mentioned above, we develop a varia- H(dω, dp) = cp e G0(dω)dp, with the same re- tional inference algorithm using the naïve construction strictions on c and G0, then the corresponding CRM constructed as above is known as the gamma process.

801 Anirban Roychowdhury, Brian Kulis

The total mass of the gamma process G, G(Ω), is dis- (2010) to yield stick-breaking for the beta process: tributed as Gamma(cG0(Ω), c). The improper distri- ∞ Ci i−1 butions in these rate measures integrate to infinity over X X (i) Y (l) their respective domains, ensuring a countably infinite B = Vij (1 − Vij )δωij , (1) set of points in a draw from the Poisson process. For i=1 j=1 l=1 the beta process, the weights p are in [0,1], whereas k (i) iid iid iid for the gamma process they are in [0, ∞). In both cases where Vij ∼ Beta(1, α),Ci ∼ Poisson(γ), ωij ∼ 1 however the sum of the weights is finite, as can be γ B0. We use this representation as the basis for our seen from Campbell’s theorem Kingman (1967), and stick breaking-like construction of the Gamma CRM, is governed by c and the total mass of the base mea- and use Poisson process-based proof techniques similar sure on Ω. For completeness we note that completely to Paisley et al. (2012) to derive the rate measure. random measures as defined in Kingman (1967) have three components: a set of fixed atoms, a determinis- 3 The Stick-breaking Construction of tic measure (usually assumed absent), and a random discrete measure. It is this third component that is the Gamma Process explicitly generated using a Poisson process, though 3.1 Constructions and proof of correctness the fixed component can be readily incorporated into this construction Kingman (1993). We propose a simple recursive construction of the gamma process CRM, based on the stick-breaking con- If we create an atomic measure by normalizing the struction for the beta process proposed in Paisley et al. weights {p } from the gamma process, i.e. D = k (2010, 2012). In particular, we augment (or ‘mark’) a P∞ π δ where π = p / P∞ p , then D is known k=0 k ωk k k i=0 i slightly modified stick-breaking beta process with an as a Dirichlet process Ferguson (1973), denoted as independent gamma-distributed random measure and D ∼ DP(α ,H ) where α = G (Ω) and H = G /α . 0 0 0 0 0 0 0 show that the resultant Poisson process has the rate It is not a CRM as the random variables induced on measure H(dω, dp) = cp−1e−cpG (dω)dp as defined disjoint sets lack independence because of the normal- 0 above. We show this by directly deriving the rate mea- ization; it belongs to the class of normalized random sure of the marked Poisson process using product dis- measures with independent increments (NRMIs). tribution formulae. Our proposed stick-breaking con- struction is as follows: 2.2 Stick-breaking for the Dirichlet and Beta ∞ Ci i Processes X X (i) (i) Y (l) G = Gij Vij (1 − Vij )δωij , (2) A recursive way to generate the weights of random i=1 j=1 l=1 measures is given by stick-breaking, where a unit inter- (i) iid (i) iid val is subdivided into fragments based on draws from where Gij ∼ Gamma(α + 1, c),Vij ∼ suitably chosen distributions. For example, the sick- iid iid 1 Beta(1, α),Ci ∼ Poisson(γ), ωij ∼ γ H0. breaking construction of the Dirichlet process Sethu- As with the beta process stick-breaking construction, raman (1994) is given by the product of beta random variables allows us to interpret each j as corresponding to a stick that is ∞ i−1 being broken into an infinite number of pieces. Note X Y D = Vi (1 − Vj)δωi , that the expected weight on an atom in round i is i=1 j=1 αi/c(1 + α)i. The parameter c can therefore be used to control the weight decay cadence along with α. iid iid where Vi ∼ Beta(1, α), ωi ∼ H0. Here the length of The above representation provides the clearest view of the first break from a unit-length stick is given by V1. the construction, but is somewhat cumbersome to deal In the next round, a fraction V2 of the remaining stick with in practice, mostly due to the introduction of the of length 1 − V1 is broken off, and we are left with a additional gamma random variable. We reduce the piece of length (1−V2)(1−V1). The length of the piece number of random variables by noting that the prod- in the next round is therefore given by V3(1 − V2)(1 − uct of a Beta(1, α) and a Gamma(α + 1, c) random V1), and so on. Note that the weights belong to (0,1), variable has an Exp(c) distribution; we also perform and since this is a normalized measure, the weights a change of variables on the product of the (1 − Vij)s sum to 1 almost surely. This is consistent with the to arrive at the following equivalent construction, for use of the Dirichlet process as a prior on which we now prove its correctness: distributions. Theorem 1. A gamma CRM with positive concentra- This construction was generalized in Paisley et al. tion parameters α and c and finite base measure H0

802 Gamma Processes, Stick-Breaking, and Variational Inference

1 may be constructed as Z −2 −c p = αw ce w dw ∞ Ci X X 0 G = E e−Tij δ , (3) ij ωij −1 −cp i=1 j=1 = αp e −1 −cp α iid ind iid = cp e , where Eij ∼ Exp(c),Tij ∼ Gamma(i, α),Ci ∼ c iid 1 Poisson(γ), ωij ∼ γ H0. where we have used monotone convergence to get the Taylor expansion of exp(−α log w) inside the integral. Proof. Note that, by construction, in each round i in Note that the measure defined on B([0, ∞)) by the “im- −Tij  Ci −1 −cp (3), each set of weighted atoms { ωij,Eije }j=1 proper” gamma distribution p e is σ-finite, in the forms a Poisson since the Ci are drawn sense that we can decompose [0, ∞) into the count- from a Poisson(γ) distribution. In particular, each able union of disjoint intervals [1/k, 1/(k − 1)), k = of these sets is a marked Poisson process Kingman 1, 2,... ∞, each of which has finite measure. In par- (1993), where the atoms ωij of the Poisson process ticular, the measure of the interval [1, ∞) is given by −Tij on Ω are marked with the random variables Eije the exponential integral. that have a probability measure on (0, ∞). The super- Therefore the rate measure of the process G as con- position theorem of Kingman (1993) tells us that the structed here is G(dω, dp) = cp−1e−cpG (dω)dp where countable union of Poisson process is itself a Poisson 0 G is the same as H up to the multiplicative constant process on the same measure space; therefore denot- 0 0 α , and therefore satisfies the finiteness assumption im- Ci ∞ c X −Tij [ posed on H0. ing Gi = Eije δωij , we can say G = Gi is a j=1 i=1 Poisson process on Ω×[0, ∞). We show below that the We use the form specified in the theorem above in rate measure of this process equals that of the Gamma our variational inference algorithm since the varia- CRM. tional distributions on almost all the parameters and

−Tij variables in this construction lend themselves to sim- Now, we note that the random variable Eije has ple closed-form exponential family updates. As an a probability measure on [0, ∞); denote this by qij. We are going to mark the underlying Poisson pro- aside, we note that the random variables (1 − Vij) cess with this measure. The density corresponding have a Beta(α, 1) distribution; therefore if we denote to this measure can be readily derived using product Uij = 1 − Vij then the construction in (2) is equivalent distribution formulae. To that end, ignoring indices, to ∞ Ci i X X (i) Y (l) if we denote W = exp (−T ), then we can derive its G = E U δ , distribution by a change of variable. Then, denoting ij ij ωij i=1 j=1 l=1 Q = E×W where E ∼ Exp(c), we can use the product distribution formula to write the density of Q as (i) iid (i) iid iid where Eij ∼ Exp(c),Uij ∼ Beta(α, 1),Ci ∼ iid 1 Poisson(γ), ω ∼ 1 H . This notation therefore re- Z i ij γ 0 α i−1 α−2 −c q fQ(q) = (− log w) w ce w dw, lates our construction to the stick-breaking construc- Γ(i) tion of the Indian Buffet Process Teh et al. (2007b), 0 where the Bernoulli πk are generated as where T ∼ Gamma(i, α). Formally speaking, this is products of iid Beta(α, 1) random variables : π1 = k the Radon-Nikodym density corresponding to the mea- Q iid ν1, πk = νi where νi ∼ Beta(α, 1). In particu- sure q, since it is absolutely continuous with respect to i=1 the Lebesgue measure on [0, ∞) and σ-finite by virtue lar, we can view our construction as a generalization of being a probability measure. Furthermore, these of the IBP stick-breaking, where the stick-breaking conditions hold for all the measures that we have in weights are multiplied with independent Exp(c) ran- our union of marked Poisson processes; this allows us dom variables, with the summation over j providing to write the density of the combined measure as an explicit Poissonization.

1 3.2 Truncation analysis ∞ Z i X α i−1 α−2 −c p f(p) = (− log w) w ce w dw Γ(i) The variational algorithm requires a truncation level i=1 0 for the number of atoms for tractability. Therefore we 1 Z ∞ i need to analyze the closeness between the marginal X α i−1 α−2 −c p = (− log w) w ce w dw distributions of the data drawn from the full prior Γ(i) 0 i=1 and the truncated prior, with the stick-breaking prior

803 Anirban Roychowdhury, Brian Kulis

" ∞ r# weights integrated out. Our construction leads to a 1 X  α  = 1 − exp Nγ simpler truncation analysis if we truncate the num- c 1 + α ber of rounds (indexed by i in the outer sum), which r=R+1 ( ) automatically truncates the atoms to a finite num- α  α R = 1 − exp −Nγ . ber. For this analysis, we will use the stick-breaking c 1 + α gamma process as the base measure of a Poisson like- lihood process, which we denote by PP ; this is pre- cisely the model for which we develop variational in- ference in the next section. If we denote the gamma P∞ 4 Variational Inference process as G = k=0 gkδωk , with gk as the recur- sively constructed weights, then PP can be written As discussed in Section 3.2, we will focus on the infinite P∞ as PP = k=0 pkδωk where pk = Poisson(gk). Under Gamma-Poisson model, where a gamma process prior this model, we can obtain the following result, which is used in conjunction with a Poisson likelihood func- is analogous to error bounds derived for other non- tion. When integrating out the weights of the gamma parametric models Ishwaran and James (2001); Doshi- process, this process is known to yield a nonparametric Velez et al. (2009); Paisley et al. (2011) in the litera- prior for sparse, infinite count matrices Titsias (2008). ture. We note that our approach should easily be applicable

Theorem 2. Let N samples X = (X1, .., XN ) be to other models involving gamma process priors. drawn from PP (G). If G ∼ ΓP(c, G ), the full 0 4.1 The Model gamma process, then denote the marginal density To effectively perform variational inference, we re- of X as m∞(X). If G is a gamma process trun- cated after R rounds, denote the marginal density of write G as a single sum of weighted atoms, using indi- cator variables {d } for the rounds in which the atoms X as mR(X). Then k occur, similar to Paisley et al. (2010):

Z (  R) ∞ 1 α α X −Tk |m∞(X)−mR(X)|dX ≤ 1−exp −Nγ . G = Eke δω , (4) 4 c 1 + α k k=1

Proof. The starting intuition is that if we truncate the iid ind where Ek ∼ Exp(c), Tk ∼ Gamma(dk, α), process after R rounds, then the error in the marginal ∞ P 1 iid∼ Poisson(γ), ω iid∼ 1 H . We also place distribution of the data will depend on the probability (dk=r) k γ 0 k=1 of positive indicator values appearing for atoms after gamma priors on α, γ and c : α ∼ Gamma(a1, a2), γ ∼ the Rth round in the infinite version. Combining this Gamma(b1, b2), c ∼ Gamma(c1, c2). Denoting the with ideas analogous to those in Ishwaran and James data, the latent prior variables and the model (2000) and Ishwaran and James (2001), we get the fol- hyperparameters by D, Π and Λ respectively, the lowing bound for the difference between the marginal full likelihood may be written as P (D, Π|Λ) = distributions: P (D, Π−G|ΠG, Λ) · P (ΠG|Λ) where P (ΠG|Λ) = P (α) · Z K 1 Q |m∞(X) − mR(X)|dX P (γ) · P (c) · P (d|γ) · P (Ek|c) · P (Tk|dk, α) · 4 k=1 N ( R ) Q X P (znk|Ek,Tk). We truncate the infinite gamma ≤ P ∃(k, j), k > Cr, 1 ≤ n ≤ N s.t. Xn(ωkj) > 0 . n=1 r=1 process to K atoms, and take N to be the total number of datapoints. Π denotes the set of the latent vari- Since we have a Poisson likelihood on the underlying −G ables excluding those from the Poisson-Gamma prior; gamma process, this probability can be written as for instance, in factor analysis for topic models, this   N  contains the Dirichlet-distributed factor variables (or  ∞ C   Y Yr   −πrj  topics). P(·) = 1 − E E  e  Cr  ,  r=R+1 j=1  From the Poisson likelihood, we have znk|Ek,Tk ∼ −T Poisson(Eke k ), independently for each n. The dis- (r) (r) Qr (l) where πrj = Grj Vrj l=1(1 − Vrj ). We may then tributions of Tk and d involve the indicator functions use Jensen’s inequality to bound it as follows: on the round indicator variables dk:    ∞ C νk(0) r α ν (1) X X −π  k −αTk (·) ≤ 1 − exp N log(e rj ) P (Tk|dk, α) = 1 Tk e , P  E  Q Γ(r) (dk=r) r=R+1 j=1  r≥1

804 Gamma Processes, Stick-Breaking, and Variational Inference

P 1 where νk(s) = (r − s) (dk=r). We use the same The updates for the multinomial probabilities in q(dk) r≥1 are given by: weighting factors in our distribution on d as Paisley et al. (2011). See Paisley et al. (2011) for a discussions ϕk(r) ∝ exp{rEQ(log α) − log Γ(r) + (r − 1)EQ(log Tk) on how to approximate these factors in the variational r j−1 algorithm. X X Y X 0 −ζ · ϕi(r) − EQ(γ) ϕk0 (r )}. i6=k j=2 0 0 4.2 The Variational Prior Distribution k 6=k r =1

Mean-field variational inference involves minimizing The variational distribution q(Tk) does not lend itself the KL divergence between the model posterior, and to closed-form analytical updates, so we perform gra- a suitably constructed variational distribution which dient ascent on the evidence lower bound. The varia- is used as a more tractable alternative to the actual tional updates for q(znk) and for the variational distri- posterior distribution. To that end, we propose a butions on the latent variables in Π−G are model de- fully-factorized variational distribution on the Poisson- pendent, and require some approximations for the fac- Gamma prior as follows: tor analysis case. See Roychowdhury and Kulis (2014) for details. K N Y Y Q = q(α)·q(γ)·q(c)· q(Ek)·q(Tk)·q(dk)· q(znk), 5 Other Algorithms k=1 n=1 Here we briefly describe the two primary compet- ing algorithms we developed based on constructions ´ where q(Ek) ∼ Gamma(ξk, ´k), q(Tk) ∼ of the Gamma process: a variational inference algo- Gamma(u ´k, υ´k), q(α) ∼ Gamma(κ1, κ2), q(γ) ∼ rithm from the naïve construction, and a Markov chain Gamma(τ1, τ2), q(c) ∼ Gamma(ρ1, ρ2), q(znk) ∼ Monte Carlo sampler based on our construction. Poisson(λnk), q(dk) ∼ Mult(ϕk). Instead of working with the actual KL divergence be- 5.1 Naïve Variational Inference tween the full posterior and the factorized proxy distri- We derive a variational inference algorithm from a sim- bution, variational inference maximizes what is canon- pler construction of the Gamma process, where we ically known as the evidence lower bound (ELBO), multiply the stick-breaking construction of the Dirich- a function that is the same as the KL divergence let process by a Gamma random variable. The con- up to a constant. In our case it may be written as struction can be written as: L = EQ log P (D, Π|Λ) − EQ log Q. We omit the full ∞ i−1 representation here for brevity. X Y G = G0 Vi (1 − Vj)δωi , i=1 j=1 4.3 The Variational Parameter Updates iid iid Since we are using exponential family variational dis- where G0 ∼ Gamma(α, c),Vi ∼ Beta(1, α), ωi ∼ tributions, we leverage the closed form variational up- H0. dates for exponential families wherever we can, and We use an equivalent form of the construction that is perform gradient ascent on the ELBO for the param- similar to the one used above : eters of those distributions which do not have closed ∞ form updates. We list the updates on the distribu- X G = G V e−Tk δ , tions of the prior below. The closed-form updates for 0 k ωk k=1 the hyperparameters in q(Ek), q(α), q(c) and q(γ) are as follows: iid ind where G0 ∼ Gamma(α, c), Vk ∼ Beta(1, α), Tk ∼ iid N Gamma(k − 1, α), ω ∼ H . X i 0 ξ´ = (z ) + 1, ´ = (c) + N × e−Tk  , k EQ nk k E EQ As before, we place gamma priors on α and c : α ∼ n=1 K K Gamma(a1, a2), c ∼ Gamma(c1, c2). The closed-form X X X coordinate ascent updates for G0, α and c and the gra- κ1 = rϕk(r) + a1, κ2 = EQ(Tk) + a2, dient ascent updates for {V ,T } are detailed in the k=1 r≥1 k=1 k k supplementary. K X ρ = c + K, ρ = (E ) + c , 1 1 2 EQ k 2 5.2 The MCMC Sampler k=1 ( K r−1 ) As a baseline, we also derive and compare the varia- X Y X tional algorithm with a standard MCMC sampler for τ1 = b1 + K, τ2 = 1 − ϕk(´r) + b2. r≥1 k=1 r´=1 this model. We use the construction in (4) for sampling

805 Anirban Roychowdhury, Brian Kulis

(a) (b) (c)

(d) (e) (f) Figure 1: Plots of held-out test likelihoods and per-iteration running times (best viewed in color). Plots (d), (e) and (f) are for PsyRev, KOS, and NYT respectively. Plots (b) and (c) are for the PsyRev dataset. Algorithm trace colors are common to plots. See text for full details. from the model. To avoid inferring the latent variables from this paper (abbreviated hereafter as VGP), a in all the atom weights of the Poisson-Gamma prior, Poisson-gamma prior using the naïve construction of we use Monte Carlo techniques to integrate them out, the gamma process (VnGP), the Bernoulli-beta prior as in Paisley et al. (2010). This affects posterior infer- from Paisley et al. (2011) (VBP) and the IBP prior ence for the indicators znk, the round indicators d and from Doshi-Velez et al. (2009) (VIBP), along with the hyperparameters c and α. The posterior distribu- the MCMC sampler mentioned above (SGP). For the tion for γ is closed form, as are those for the likelihood Bernoulli-beta priors we modeled I as I = W ◦ Z as latent variables in Π−G. The complete updates are in Paisley et al. (2011), where the nonparametric pri- described in the supplementary. ors are put on Z and a vague Gamma prior is put on W . For the VGP and SGP models we set I = Z. In 6 Experiments addition, for all four algorithms, we put a symmetric We consider the problem of learning latent topics in Dirichlet(β1, . . . , βV ) prior on the columns of Φ. We document corpora. Given an observed set of counts of added corresponding variational distributions for the vocabulary words in a set of documents, represented by variables in the collection denoted as Π−G above. We say a V × N count matrix, where V is the vocabulary use held-out per-word test log-likelihoods and times size and N the number of documents, we aim to learn required to update all variables in Π in each iteration K latent factors and their vocabulary realizations us- as our comparison metrics, with 80% of the data used ing Poisson factor analysis. In particular, we model for training. We used the same likelihood metric as the observed corpus count matrix D as D ∼ Poi(ΦI), Zhou and Carin (2012), with the samples replaced by where the V × K matrix Φ models the factor load- the expectations of the variational distributions. ings, and the K ×N matrix I models the actual factor Synthetic Data. As a warm-up, we consider the per- counts in the documents. formances of VGP and SGP on some synthetic data We implemented and analyzed the performance of generated from this model. We generate 200 weighted three variational algorithms corresponding to four dif- atoms from the gamma prior using the stick-breaking ferent priors on I: the Poisson-gamma process prior construction, and use the Poisson likelihood to gen-

806 Gamma Processes, Stick-Breaking, and Variational Inference

erate 3000 values for each atom to yield the indi- ter 1000 iterations of burn-in. VnGP was the worst cator matrix Z. We simulated a vocabulary of 200 performer, with stable log-likelihood of -7.85. Also as terms, generated a 200×200 factor-loading matrix Φ seen in fig.1c, VGP was faster to convergence (in less using symmetric Dirichlet priors, and then generated than 10 iterations in ∼5 seconds) than VIBP and VBP D = Poi(ΦZ). For the VGP and VnGP, we mea- (∼50 iterations each). The test log-likelihoods after a sure the test likelihood after every iteration and av- few hours of runtime were largely independent of the erage the results across 10 random restarts. These truncation K for the three variational algorithms. Be- measurements are plotted in fig.1a. As shown, VGP’s havior for the other datasets was similar. measured heldout likelihood converges within 10 iter- Among the three variational algorithms, the VIBP ations. The SGP traceplot shows the first thirty held- scaled best for small to medium datasets as a func- out likelihoods measured after burn-in. Per-iteration tion of the truncation factor due to all updates being times were 15 seconds and 2.36 minutes for VGP (with closed-form, in spite of having to learn the additional K=125) and SGP respectively. The SGP learned K weight matrix W . The VGP running times were com- online, with values oscillating around 50. SNBP refers petitive for small values of K for these datasets. How- to the Poisson-Gamma mixture (“NB process”) sam- ever, in the large NYT dataset, VGP was orders of pler from Zhou and Carin (2012). Its traceplot shows magnitude faster than the Bernoulli-beta algorithms the first 30 likelihoods measured after 1000 burn-in it- (note the log-scale in fig.1f). For example, with a erations. We see that it performed similarly to our truncation of 100 atoms, VGP took around 45 sec- algorithms, though slightly worse. onds per iteration, whereas both VIBP and VBP took Real data. We used a similar framework to model the more than 3 minutes. The VBP scaled poorly for all count data from the KOS1, NIPS2, Psychological Re- datasets, as seen in figs.1d through 1f. The reason view (PsyRev)3, and New York Times1 corpora. The for this is three-fold: learning the parameters for the vocabulary sizes are 2566, 13649, 6906 and 100872 re- additional matrix W which is directly affected by di- spectively, while the document counts are 1281, 1740, mensionality (also the reason for VIBP being slow for 3430 and 300000 respectively. For each dataset, we NYT dataset), gradient updates for two variables (as ran all three variational algorithms with 10 random opposed to one for VGP) and a Taylor approximation restarts each, measuring the held-out log-likelihoods required for these gradient updates (see Paisley et al. and per-iteration runtimes for different values of the (2011)). The sampler SGP required around 7 minutes truncation factor K. The learning rates for gradient per iteration for the small datasets and an hour and ascent updates were kept on the order of 10−4 for both 40 minutes on average for NYT. VGP and VBP, with 5 gradient steps per iteration. To summarize, we found the VGP to post running A representative subset of results is shown in figs.1b times that are competitive with the fastest algorithm through 1f. (VIBP) in small to medium datasets, and outper- We used vague gamma priors on the hyperparameters form the other methods completely in the large NYT α, γ and c in the variational algorithms, and improper dataset, all the while providing similar accuracy com- (1) priors for the sampler. We found the test likeli- pared to the other variational algorithms, as measured hoods to be independent of these initializations. The by held-out likelihood. It was also the fastest to con- results for the variational algorithms were dependent verge, typically taking less than 15 iterations. Com- on the Dirichlet prior β on Φ, as noted in fig.1b. We pared with SGP, our variational method is substan- therefore used the learned test likelihood after 100 it- tially faster (particularly on large-scale data) and pro- erations as a heuristic to select β. We found the three duces higher likelihood scores on real data. variational algorithms to attain very similar test like- lihoods across all four datasets after a few hours of 7 Conclusion CPU time, with the VGP and VBP having a slight We have described a novel stick-breaking representa- edge over the VIBP. The sampler somewhat unexpect- tion for gamma processes and used it to derive a vari- edly did not attain a competitive score for any dataset, ational inference algorithm. This algorithm has been unlike the synthetic case. For instance, as shown shown to be far more scalable for large datasets than in fig.1c, it oscillated around -7.45 for the PsyRev related variational algorithms, while attaining similar dataset, whereas the variational algorithms attained accuracy and outperforming sampling-based methods. -7.23. For comparison, the NB process sampler from We expect that recent improvements to variational Zhou and Carin (2012) attains -7.25 each iteration af- techniques can also be applied to our algorithm, po- tentially yielding even further scalability. 1 https://archive.ics.uci.edu/ml/datasets/Bag+of+Words Acknowledgements 2http://www.stats.ox.ac.uk/ teh/data.html 3http://psiexp.ss.uci.edu/research/programs_data/toolbox.htmThis work was supported by NSF award IIS-1217433.

807 Anirban Roychowdhury, Brian Kulis

References cess. In Artificial Intelligence and Statistics. Paisley, J., Carin, L., and Blei, D. M. (2011). Varia- Blei, D. and Jordan, M. (2003). Variational methods tional Inference for Stick-Breaking Beta Process Pri- for Dirichlet process mixtures. Bayesian Analysis, ors. In International Conference on Machine Learn- 1:121–144. ing. Broderick, T., Mackey, L., Paisley, J., and Jordan, Paisley, J., Zaas, A., Woods, C. W., Ginsburg, G. S., M. I. (2014). Combinatorial clustering and the beta and Carin, L. (2010). A Stick-Breaking Construc- negative binomial process. IEEE Transactions on tion of the Beta Process. In International Confer- Pattern Analysis and Machine Intelligence. ence on Machine Learning. Caron, F. and Fox, E. B. (2013). Bayesian Nonpara- Roychowdhury, A. and Kulis, B. (2014). Gamma Pro- metric Models of Sparse and Exchangeable Random cesses, Stick-Breaking, and Variational Inference. Graphs. arXiv:1401.1137. arXiv:1410.1068. Caron, F., Teh, Y. W., and Murphy, B. T. Sethuraman, J. (1994). A constructive definition of (2013). Bayesian Nonparametric Plackett-Luce Dirichlet priors. Statistica Sinica, 4:639–650. models for the Analysis of Clustered Ranked Data. arXiv:1211.5037. Teh, Y., Kurihara, K., and Welling, M. (2007a). Col- lapsed variational inference for HDP. In NIPS. Doshi-Velez, F., Miller, K., Gael, J. V., and Teh, Y. W. (2009). Variational Inference for the Indian Buffet Teh, Y. W., Görür, D., and Ghahramani, Z. (2007b). Process. In AISTATS. Stick-breaking construction for the Indian buffet process. In Proceedings of the International Con- Ferguson, T. (1973). A Bayesian Analysis of Some ference on Artificial Intelligence and Statistics, vol- Nonparametric Problems. The Annals of Statistics, ume 11. 1(2):209–230. Thibaux, R. (2008). Nonparametric Bayesian Models Gopalan, P., Ruiz, F., Ranganath, R., and Blei, D. for Machine Learning. PhD thesis, University of (2014). Bayesian nonparametric Poisson factoriza- California at Berkeley. tion. In AISTATS. Titsias, M. (2008). The Infinite Gamma-Poisson Ishwaran, H. and James, L. F. (2000). Approximate Model. In Advances in Neural Information Process- Dirichlet Process Computing in Finite Normal Mix- ing Systems. tures: Smoothing and Prior Information. Journal of Computational and Graphical Statistics, 11:508–532. Wang, C., Paisley, J., and Blei, D. (2011). Online vari- ational inference for the hierarchical Dirichlet pro- Ishwaran, H. and James, L. F. (2001). Gibbs Sampling cess. In AISTATS. Methods for Stick-Breaking Priors. Journal of the American Statistical Association, 96:161–173. Wolpert, R. and Ickstadt, K. (1998). Simulation of Lévy Random Fields. In Practical Nonparametric Jordan, M. I. (2010). Hierarchical Models, Nested and Semiparametric Bayesian Statistics. Springer- Models and Completely Random Measures. In Verlag. Chen, M.-H., Dey, D., Mueller, P., Sun, D., and Ye, K., editors, Frontiers of Statistical Decision Mak- Zhou, M. and Carin, L. (2012). Augment-and-conquer ing and Bayesian Analysis: In Honor of James O. negative binomial processes. In NIPS. Berger. New York: Springer. Zhou, M., Hannah, L., Dunson, D., and Carin, L. Kingman, J. (1967). Completely Random Measures. (2012). Beta-negative binomial process and Poisson Pacific Journal of Mathematics, 21(1):59–78. factor analysis. In AISTATS. Kingman, J. F. C. (1993). Poisson Processes, volume 3 of Oxford Studies in Probability. Oxford University Press, New York. Miller, K. (2011). Bayesian Nonparametric Latent Feature Models. PhD thesis, University of California at Berkeley. Orbanz, P. and Williamson, S. (2012). Unit-rate Pois- son representations of completely random measures. Paisley, J., Blei, D. M., and Jordan, M. I. (2012). Stick-Breaking Beta Processes and the Poisson Pro-

808