Gamma Processes, Stick-Breaking, and Variational Inference

Gamma Processes, Stick-Breaking, and Variational Inference Anirban Roychowdhury Brian Kulis Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering Ohio State University Ohio State University [email protected] [email protected] Abstract 1 Introduction While most Bayesian nonparametric mod- The gamma process is a versatile pure-jump Lévy pro- els in machine learning have focused on the cess with widespread applications in various fields of Dirichlet process, the beta process, or their science. Of late it is emerging as an increasingly pop- variants, the gamma process has recently ular prior in the Bayesian nonparametric literature emerged as a useful nonparametric prior in within the machine learning community; it has re- its own right. Current inference schemes cently been applied to exchangeable models of sparse for models involving the gamma process are graphs Caron and Fox (2013) as well as for nonpara- restricted to MCMC-based methods, which metric ranking models Caron et al. (2013). It also limits their scalability. In this paper, we has been used as a prior for infinite-dimensional la- present a variational inference framework for tent indicator matrices Titsias (2008). This latter ap- models involving gamma process priors. Our plication is one of the earliest Bayesian nonparamet- approach is based on a novel stick-breaking ric approaches to count modeling, and as such can be constructive definition of the gamma process. thought of as an extension of the venerable Indian Buf- We prove correctness of this stick-breaking fet Process to modeling latent structures where each process by using the characterization of the feature can occur multiple times for a datapoint, in- gamma process as a completely random mea- stead of being simply binary. sure (CRM), and we explicitly derive the The flexibility of gamma process models allows them rate measure of our construction using Pois- to be applied in a wide variety of Bayesian nonpara- son process machinery. We also derive error metric settings, but their relative complexity makes bounds on the truncation of the infinite pro- principled inference nontrivial. In particular, most di- cess required for variational inference, simi- rect applications of the gamma process in the Bayesian lar to the truncation analyses for other non- nonparametric literature use Markov chain Monte parametric models based on the Dirichlet and Carlo samplers (typically Gibbs sampling) for pos- beta processes. Our representation is then terior inference, which often suffers from poor scal- used to derive a variational inference algo- ability. For other Bayesian nonparametric models— rithm for a particular Bayesian nonparamet- in particular those involving the Dirichlet process or ric latent structure formulation known as the beta process—a successful thread of research has con- infinite Gamma-Poisson model, where the la- sidered variational alternatives to standard sampling tent variables are drawn from a gamma pro- methods Blei and Jordan (2003); Teh et al. (2007a); cess prior with Poisson likelihoods. Finally, Wang et al. (2011). One first derives an explicit con- we present results for our algorithm on non- struction of the underlying “weights” of the atomic negative matrix factorization tasks on doc- measure component of the random measures under- ument corpora, and show that we compare lying the infinite priors; so-called “stick-breaking” pro- favorably to both sampling-based techniques cesses for the Dirichlet and beta processes yield such and variational approaches based on beta- a construction. Then these weights are truncated and Bernoulli priors, as well as a direct DP-based integrated into a mean-field variational inference al- construction of the gamma process. gorithm. For instance, stick-breaking was derived for the Dirichlet process in the seminal paper by Sethura- Appearing in Proceedings of the 18th International Con- man Sethuraman (1994), which was in turn used for ference on Artificial Intelligence and Statistics (AISTATS) variational inference in Dirichlet process models Blei 2015, San Diego, CA, USA. JMLR: W&CP volume 38. and Jordan (2003). Similar stick-breaking represen- Copyright 2015 by the authors. tations for a special case of the Indian Buffet Pro- 800 Gamma Processes, Stick-Breaking, and Variational Inference cess Teh et al. (2007b) and the beta process Paisley (see 5.1) and show that it performs worse than our et al. (2010) have been constructed, and have natu- main algorithm on both synthetic and real datasets. rally led to mean-field variational inference algorithms The very general inverse Lévy measure algorithm of for nonparametric models involving these priors Doshi- Wolpert and Ickstadt (1998) requires inversion of the Velez et al. (2009); Paisley et al. (2011). Such vari- exponential integral, as does the generalized CRM con- ational inference algorithms have been shown to be struction technique of Orbanz and Williamson (2012) more scalable than the sampling-based inference tech- when applied to the gamma process; since the closed niques normally used; moreover they work with the full form solution of the inverse of an exponential integral model posterior without marginalizing out any vari- is not known, these techniques do not give us an an- ables. alytic construction of the weights, and hence cannot be adapted to variational techniques in a straightfor- In this paper we propose a variational inference frame- ward manner. Other constructive definitions of the work for gamma process priors using a novel stick- gamma process include Thibaux (2008), who discusses breaking construction of the process. We use the char- a sampling-based scheme for the weights of a gamma acterization of the gamma process as a completely ran- process by sampling from a Poisson process. As an dom measure (CRM), which allows us to leverage Pois- alternative to gamma process-based models for count son process properties to arrive at a simple derivation modeling, recent research has examined the negative of the rate measure of our stick-breaking construction, binomial-beta process and its variants Zhou and Carin and show that it is indeed equal to the Lévy mea- (2012); Zhou et al. (2012); Broderick et al. (2014); sure of the gamma process. We also use the Pois- the stick-breaking construction of Paisley et al. (2010) son process formulation to derive a bound on the er- readily extends to such models since they have beta ror of the truncated version compared to the full pro- process priors. The beta stick-breaking construction cess, analogous to the bounds derived for the Dirichlet has also been used for variational inference in beta- process Ishwaran and James (2001), the Indian Buf- Bernoulli process priors Paisley et al. (2011), though fet Process Doshi-Velez et al. (2009) and the beta they have scalability issues when applied to the count process Paisley et al. (2011). We then, as a partic- modeling problems addressed in this work, as we show ular example, focus on the infinite Gamma-Poisson in the experimental section. model of Titsias (2008) (note that variational inference need not be limited to this model). This model is a prior on infinitely wide latent indicator matrices with 2 Background non-negative integer-valued entries; each column has an associated parameter independently drawn from a 2.1 Completely random measures gamma distribution, and the matrix values are inde- A completely random measure Kingman (1967); Jor- pendently drawn from Poisson distributions with these dan (2010) G on a space (Ω; F) is defined as a parameters as means. We develop a mean-field varia- stochastic process on F such that for any two disjoint tional technique using a truncated version of our stick- Borel subsets A1 and A2 in F, the random variables breaking construction, and a sampling algorithm that G(A1) and G(A2) are independent. The canonical uses Monte Carlo integration for parameter marginal- way of constructing a completely random measure G is ization, similar to Paisley et al. (2010), as a baseline + to first take a σ-finite product measure H on Ω ⊗ R , inference algorithm for comparison. We also derive a then draw a countable set of points f(!k; pk)g from a variational algorithm based on the naïve construction + Poisson process on a Borel σ-algebra on Ω ⊗ R with of the gamma process. Finally we compare these with H as the rate measure. Then the CRM is constructed variational algorithms based on Beta-Bernoulli priors P1 as G = pkδ!k , where the measure given to a on a non-negative matrix factorization task involving k=0 P measurable Borel set B ⊂ Ω is G(B) = pk. In the Psychological Review, NIPS, KOS and New York k:!k2B Times document corpora, and show that the varia- this notation pk are referred to as weights and the !k tional algorithm based on our construction performs as atoms. and scales better than all the others. If the rate measure is defined on Ω ⊗ [0; 1] as −1 c−1 Related Work. To our knowledge this is the first ex- H(d!; dp) = cp (1 − p) B0(d!)dp, where B0 is plicit “stick-breaking”-like construction of the gamma an arbitrary finite continuous measure on Ω and c CRM, apart from the naïve approach of denormaliz- is some constant (or function of !), then the cor- ing the construction of the DP with a suitable gamma responding CRM constructed as above is known as random variable Miller (2011), Gopalan et al. (2014); a beta process. If the rate measure is defined as −1 −cp moreover, as mentioned above, we develop a varia- H(d!; dp) = cp e G0(d!)dp, with the same re- tional inference algorithm using the naïve construction strictions on c and G0, then the corresponding CRM constructed as above is known as the gamma process. 801 Anirban Roychowdhury, Brian Kulis The total mass of the gamma process G; G(Ω), is dis- (2010) to yield stick-breaking for the beta process: tributed as Gamma(cG0(Ω); c).

Load more