<<

for General Completely Random Measures

Peiyuan Zhu Alexandre Bouchard-Cotˆ e´ Trevor Campbell Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4

Abstract such model can select each category zero or once, the latent categories are called features [1], whereas if each Completely random measures provide a princi- category can be selected with multiplicities, the latent pled approach to creating flexible unsupervised categories are called traits [2]. models, where the number of latent features is Consider for example the problem of modelling movie infinite and the number of features that influ- ratings for a set of users. As a first rough approximation, ence the data grows with the size of the data an analyst may entertain a clustering over the movies set. Due to the infinity the latent features, poste- and hope to automatically infer movie genres. Cluster- rior inference requires either marginalization— ing in this context is limited; users may like or dislike resulting in dependence structures that prevent movies based on many overlapping factors such as genre, efficient computation via parallelization and actor and score preferences. Feature models, in contrast, conjugacy—or finite truncation, which arbi- support inference of these overlapping movie attributes. trarily limits the flexibility of the model. In this paper we present a novel Markov chain As the amount of data increases, one may hope to cap- Monte Carlo algorithm for posterior inference ture increasingly sophisticated patterns. We therefore that adaptively sets the truncation level using want the model to increase its complexity accordingly. auxiliary slice variables, enabling efficient, par- In our movie example, this means uncovering more and allelized computation without sacrificing flexi- more diverse user preference patterns and movie attributes bility. In contrast to past work that achieved this from the growing number of registered users and new on a model-by-model basis, we provide a gen- movie releases. Bayesian nonparametric methods (BNP) eral recipe that is applicable to the broad class enable unbounded model capacity by positing infinite- of completely random measure-based priors. dimensional prior distributions. These infinite dimen- The efficacy of the proposed algorithm is eval- sional priors are designed so that for any given dataset uated on several popular nonparametric mod- only a finite number of latent parameters are utilized, mak- els, demonstrating a higher effective sample ing Bayesian nonparametric inference possible in practice. size per second compared to algorithms using The present work is concerned with developing efficient marginalization as well as a higher predictive and flexible inference methods for a class of BNP priors performance compared to models employing called completely random measures (CRMs) [3], which fixed truncations. are commonly used in practice [4–7]. In particular, CRMs provide a unified approach to the construction of BNP priors over both latent features and traits [8, 9]. 1 INTRODUCTION Previous approaches to CRM posterior inference can be categorized into two main types. First, some methods In unsupervised data analysis, one aims to uncover com- analytically marginalize the infinite dimensional objects plex latent structure in data. Traditionally, this structure involved in CRMs [1, 10–14]. This has the disadvan- has been assumed to take the form of a clustering, in tage of making restrictive conjugacy assumptions and is which each data point is associated with exactly one latent moreover not amenable to parallelization. A second type category. Here we are concerned with a new generation of inference method introduced by Blei and Jordan [15] of unobserved structures such that each data point can instead uses a fixed truncation of the infinite model. How- be associated to any number of latent categories. When

Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. ever, this strategy is at odds with the motivation behind N0 := {0, 1, 2,... } via BNP, namely, its ability to learn model capacity as part of indep the inferential procedure. Campbell et al. [16] provide a Xnk ∼ h(·; θk) n, k ∈ [N] × N priori error bounds on such truncation, but it is not obvi- ∞ ! indep X ous how to extend these bounds to approximation errors Yn ∼ f ·; Xnkδψk n ∈ [N], on the posterior distribution. k=1

Our method is based on slice sampling, a family of where δ(·) denotes a Dirac delta measure, h is a distri- methods first used in a BNP bution on N0, and f is a distribution on the space of context by [17]. Slice samplers have advantages over observations. Note that each data point yn is influenced both marginalization and truncation techniques: they do only by those traits ψk for which xnk > 0, and the value not require conjugacy, enable parallelization, and target of xnk denotes the amount of influence. the exact nonparametric posterior distribution. But while To generate the infinite collection of (ψk, θk) pairs, we there is a rich literature on sampling methods for BNP use a Poisson point process [24] on the product space of latent feature and trait models—e.g., the Indian buffet / traits and rates Ψ × R+ with σ-finite mean measure µ, beta-Bernoulli process [13, 18], hierarchies thereof [10], ∞ normalized CRMs [12, 19], beta-negative binomial pro- {ψk, θk}k=1 ∼ PP(µ) µ(Ψ × R+) = ∞. (1) cess [14, 20], generalized gamma process [21], gamma- Equivalently, this process can be formulated as a com- Poisson process [11], and more—these have often been pletely random measure (CRM) [3] on the space of traits developed on a model-by-model basis. 1 Ψ by placing a Dirac measure at each ψk with weight θk , In contrast to these past model-specific techniques, we ∞ develop our sampler based on a series representation of X θkδψk ∼ CRM(µ). (2) the Levy´ process [22] underlying the CRM. In a fash- k=1 ion similar to [23] we introduce auxiliary variables that adaptively truncate the series representation; only finitely In Bayesian nonparametric modelling, the traits are typi- many latent features are updated in each Gibbs sweep. cally generated independently of the rates, i.e., The representation that we utilize factorizes the weights µ (dθ, dψ) = ν (dθ) H (dψ) , of CRMs into a transformed Poisson process with inde- pendent and identically distributed marks, thereby turning where H is a probability measure on Ψ, and ν is a σ-finite the sampling problem into evaluating the mean measure measure on R+. In order to guarantee that the CRM has of a marginalized Poisson point process over a zero-set. infinitely many atoms, we require that ν satisfies

The remainder of the paper is organized as follows: Sec- ν (R+) = ∞, tion 2 introduces the general model that we consider, se- and in order to guarantee that each observation y is only ries representations of CRMs, and posterior inference via n influenced by finitely many traits ψ having x 6= 0 a.s., marginalization and truncation. Section 3 discusses our k nk we require that main contributions, including model augmentation and ∞ ! slice sampling. Section 4 demonstrates how the method- X Z ology can be applied to two popular latent feature models. E 1{Xnk 6= 0} = (1 − h(0|θ))ν (dθ) < ∞. Finally, in Section 5 we compare our method against sev- k=1 eral state-of-the-art samplers for these models on both To summarize, the model we consider in this paper is: real and synthetic datasets. ∞ X θkδψk ∼ CRM (ν × H) 2 BACKGROUND k=1 indep Xnk ∼ h(·; θk) n, k ∈ [N] × N 2.1 MODEL ! indep X Yn ∼ f ·; Xnkδψk n ∈ [N]. (3) In the standard Bayesian nonparametric latent trait model k [16], we are given a data set of random observations 1More generally, CRMs are the sum of a deterministic mea- N sure, an atomic measure with fixed atom locations, and a Poisson (Yn)n=1 generated using an infinite collection of latent traits (ψ )∞ , ψ ∈ Ψ with corresponding rates (θ )∞ , point process-based measure as in Eq. (2). In BNP models, there k k=1 k k k=1 is typically no deterministic component, and the fixed-location θk ∈ R+ := [0, ∞). We assume each data point Yn, atomic component has finitely many atoms, posing no chal- n ∈ [N] := {1,...,N} is influenced by each trait ψk lenge in posterior inference. Thus we focus only on the infinite in an amount corresponding to an integer count Xnk ∈ Poisson point process-based component in this paper. 2.2 SEQUENTIAL REPRESENTATION be parallelized across n, making them computationally expensive with large amounts of data. While the specification of {ψk, θk}k as a Poisson point process is mathematically elegant, it does not lend itself Fixed truncation Another option for posterior infer- immediately to computation. For this purpose—since ence is to truncate a sequential representation of the CRM there are infinitely many atoms—we require a way of such that it generates finitely many traits, i.e., generating them one-by-one in a sequence using famil- iar finite-dimensional distributions; this is known as a K (ψk,Vk, Γk)k=1 ∼ Eq. (4) sequential representation of the CRM. While there are indep many such representations (see [16] for an overview), here Xnk ∼ h(·; τ(Vk, Γk)) n, k ∈ [N] × [K] we will employ the general class of series representations, K ! indep X which simulate the traits ψk and rates θk via Yn ∼ f ·; Xnkδψk n ∈ [N]. k=1 k i.i.d. X i.i.d. Ej ∼ Exp(1) Γk = Ej Vk ∼ G Because there are only finitely many traits and rates in j=1 this model, it is not difficult to develop i.i.d. and variational algorithms [28] that iterate between up- θk = τ(Vk, Γk) ψk ∼ H, (4) K K dating the rates (θk)k=1, the traits (ψk)k=1, and then the assignments (X )N . Further, the independence of the where Γk are the ordered jumps of a homogeneous, unit- n n=1 assignments across observations n conditioned on the rate Poisson process on R+, G is a rates and traits enables computationally efficient paralel- on R+, and τ : R+ × R+ → R+ is a nonnegative measur- lization of the X update. However, the major drawback able function such that limu→∞ τ(v, u) = 0 for G-almost every v. For each mean measure µ in Eq. (1), there are of this approach is that the error incurred by truncation many choices of G and τ that together yield a valid series is unknown; previous work provides bounds on the total representation for PP(µ), such as the inverse-Levy´ repre- variation distance between the truncated and infinite data sentation [25], Bondesson representation [26], rejection marginal distributions [16, 29–31], but error incurred in representation [22], etc. the posterior distribution is unknown.

2.3 POSTERIOR INFERENCE 3 SLICE SAMPLING FOR CRMs

Posterior inference in the BNP model Eq. (3) is compli- In this section, we employ an adaptive truncation of gen- cated by the presence of infinitely many traits ψk and eral CRM series representations to obtain both the compu- rates θk, as the application of traditional MCMC and vari- tational efficiency of truncated methods and the statistical ational procedures would require infinite computation and correctness of approaches based on marginalization. In memory resources. Past work has handled this issue in Section 3.1 we first add an auxiliary variable for each ob- two ways: marginalization and fixed truncation. servation n that truncates the full conditional distribution of its underlying assignments Xn. Section 3.2 provides a Marginalization In a wide variety of CRM-based mod- slice sampling scheme for the augmented model, resulting els, it is possible to analytically integrate out the latent in truncation that adapts from iteration to iteration. rates and traits [9], thus expressing the model in terms of only the i.i.d. traits ψk and sequence of conditional 3.1 AUGMENTED MODEL distributions for the assignments Xn,

i.i.d. We begin by augmenting the model Eq. (3) with auxiliary ψk ∼ H k ∈ N N variables (Un)n=1 that truncate the full conditional dis- Xn ∼ P (Xn = · | X1:n−1) n ∈ [N] tributions of the assignments Xn. In particular, suppose ! we fix the assignments for observations other than n (de- indep X Yn ∼ f ·; Xnkδψk n ∈ [N]. noted X−n), the CRM variables ψ, V, Γ, and the auxiliary k variables U. Then we require for some T < ∞, (X )N Using the exchangeability of the sequence n n=1, ∀k > T, P (Xnk > 0 | X−n, ψ, V, Γ,U) = 0. Gibbs sampling [27] algorithms can be derived that alter- nate between sampling Xn for each n ∈ [N], and sam- Past model augmentations in Bayesian nonparametrics pling ψk for each of the (finitely many) “active traits” k have largely required either the normalization of the ran- P such that n Xnk > 0. However, because each Xn must dom measure [12, 17, 19] or a particular sequential rep- be sampled conditioned on X−n, these methods cannot resentation that guarantees strictly decreasing values of λ θk [18]. In the present setting of general, unnormalized CRMs, we cannot take advantage of either of these facts. We therefore take an approach inspired by [23] for aug- V Γ menting the model. In particular, for each observation α k k n ∈ [N], define its maximum active index

kn := max {k ∈ N : Xnk > 0} ∪ {0}, Xnk ψk and let ξ : N0 → R+ be a monotone decreasing sequence ∞ such that limn→∞ ξ(n) = 0. Then we add a uniform random slice variable Un lying in the interval [0, ξ(kn)] to the model for each observation n, i.e., Un yn ∞ (ψk,Vk, Γk)k=1 ∼ Eq. (4) N X indep∼ h(·; τ(V , Γ )) n, k ∈ [N] × nk k k N Figure 1: Probablistic graphical model based on series indep Un ∼ Unif [0, ξ(kn)] n ∈ [N] representation of CRM in plate notation after augmenta- K ! tion with auxiliary variables U. indep X Yn ∼ f ·; Xnkδψk n ∈ [N]. (5) k=1 Update global truncation level: Set N The variables (Un)n=1 do not change the posterior marginal of interest on X, ψ, θ, but do provide computa- Kprev ← max kn n∈[N] tional benefits. In particular, the full conditional distri- n o bution of Xn based on Eq. (5) sets Xnk = 0 for any k K ← max k ∈ 0 : ξ(k) ≥ min Un . N n such that ξ(k) < Un. Thus, the truncation level for each Xn will adapt as Un changes from iteration-to-iteration. In other words, K is the maximum index that observations Further, slice sampling in Eq. (5) requires only finite might activate in this iteration, and Kprev is the maximum memory and computation, since we need to store and index that observations activated in the previous iteration. ψ ,V , Γ simulate only those finitely many k k k such that Note that Γk, Vk, ψk are all guaranteed to be instantiated ξ(k) ≥ min U at each iteration. The ability to instanti- n n for k = 1,...,Kprev. ate the latent ψk,Vk, Γk variables has many advantages; e.g., we can leverage the independence of X for par- n Sample ψ For each k ∈ [K], sample ψ from its full allelization without sacrificing the fidelity of the model. k conditional distribution, with measure proportional to The augmented probablistic model is depicted in Fig. 1.

N   3.2 SLICE SAMPLING Y X H(dψk) f Yn; Xnkδψk + Xnjδψj  . n=1 j6=k In this section, we develop a slice sampling scheme for the augmented model Eq. (5) that iteratively simulates from If possible, (ψ )K should instead be sampled jointly each full conditional distribution. The state of the Markov k k=1 from the same density above. Further, if f is not conju- chain that we construct is infinite-dimensional, consist- gate to the prior measure H, this step may be conducted ing of (X ) , (ψ ,V , Γ ) , and (U ) . nk n∈[N],k∈N k k k k∈N n n∈[N] using Metropolis-Hastings. Since the remaining values Due to the augmentation in Eq. (5), however, only finitely (ψ )∞ will not influence the remainder of this itera- many of these variables need to be stored or simulated k k=K+1 tion, they do not need to be simulated. during any iteration of the algorithm. The particular steps follow; note that the order of the steps is important. Sample V , Γ This step is split into two substeps: first Kprev−1 Initialization Set the assignment variables X = 0, the sample (Vk, Γk)k=1 each from their own full condi- tional, and then sample (V , Γ )∞ as a single block. global truncation levels K = Kprev = 0, and for all n ∈ k k k=Kprev 0 Define Γ0 := 0 for notational convenience. [N], the local truncation levels kn = kn = 0. Run this step only a single time at the beginning of the algorithm. Substep 1: Note that Vk, Γk are conditionally indepen- N dent of all other variables given (Xnk)n=1, Γk−1, and indep Sample U: For n ∈ [N], draw Un ∼ Unif [0, ξ(kn)]. Γk+1. Thus, for each k ∈ [Kprev], we generate each pair joint space by repeated use of the marking theorem [24, Ch. 5.2]. Therefore, this probability can be written as the probability that Π has no atoms with a nonzero integer component:  P Xn∈[N],j>k = 0 | Γk 2 N   =P Π ∩ R × N = 0 | Γk , which itself can be written explicitly using the fact that the number of atoms of a Poisson point process in any set has a : ! Z X = exp − h(x | τ(v, γ))G(dv)dγ γ≥Γk N x∈N  Z    = exp − 1 − h (0 | τ(v, γ))N G(dv)dγ (6) γ≥Γk If this expression cannot be evaluated exactly but an up- per bound is available, then may be used. It is worth noting that Eq. (6) appears in past bounds on the incurred error by truncating completely random Figure 2: An instance of slice variables in a dataset with measures [16, Eqn 4.2]; here we bridge the connection four observations. between truncation and slice sampling; if a tight bound of the truncation error exists, then efficient rejection sam- pling scheme can be developed accordingly. Prior to infer- of Vk, Γk from its full conditional, with measure propor- tional to ence, the integral Eq. (6) can be precomputed for a range of Γks; afterward, it can be evaluated via interpolation. G(dVk)dΓk1 [Γk−1 ≤Γk ≤Γk+1] · N Discard unused traits Discard ψ , V , Γ for k > K; Y k k k h(Xnk; τ(Vk, Γk)). this step will only occur if Kprev > K. n=1 Sample X For each n ∈ [N] and k ∈ [K], note that Substep 2: If Kprev > K, this step is skipped. Other- X is conditionally independent of all other variables wise, for each k ∈ {Kprev,...,K} in increasing order, nk given V , Γ ,U , (X ) . Simulate the integer value we generate Vk, Γk conditioned on V1:k−1, Γ1:k−1 and k k n nj j6=k of X from its full conditional distribution with measure the remaining variables. In particular, note that Vk, Γk are nk conditionally independent of all other variables given proportional to (Xnj)n∈[N],j≥k and Γk−1. Thus, we generate Vk, Γk   from a measure proportional to X f Yn; Xnkδψk + Xnjδψj  h (Xnk|τ (Vk, Γk)) G(dVk)dΓk exp (−Γk) 1 [Γk ≥ Γk−1] · j6=k    −1 h  i ˆ 1 ˆ P (Xnj)n∈[N],j>k = 0|Γk · ·ξ kn(Xnk) Un ≤ ξ kn (Xnk) , N Y ˆ h (Xnk; τ(Vk, Γk)) where the function kn : N0 → N0 is defined by n=1  0 kn x = 0, k = kn ∞ ˆ  Again, since the remaining values (Vk, Γk)k=K+1 will kn(x) := k x > 0, k > kn , not influence the remainder of this iteration, they do not  kn otherwise need to be simulated and can be safely ignored.   i.e., it computes what kn would be as we vary the value In order to evaluate P (Xnj)n∈[N],j>k = 0 | Γk , we of Xnk. Note that this step can be parallelized across the use the machinery of Poisson point processes. In partic- N data points. Further, note that (Xnk)n∈[N],k>K are all ular, note that the independent generation of Xnk given guaranteed to be 0 due to the truncation at level K, and so Vk, Γk ensures that Π := {Γj,Vj,Xnj}n∈[N],j>k con- do not need to be simulated. If the density is intractable, ditioned on Γk is itself a Poisson point process on the we can use Metropolis-Hasting within Gibbs for this step. Algorithm 1 Slice sampling for general CRMs where PP(1) is shorthand for a unit-rate homogeneous 1: procedure SLICE SAMPLER(y, MK, f, h, τ, G) Poisson process on [0, ∞). The beta process can be paired 0 with a Bernoulli likelihood 2: Initialize X,Kprev, K, k, k 3: for m ← 1, ··· ,M do: h(x|θ) = θx (1 − θ)1−x , x ∈ {0, 1}, 4: Sample Un for n ∈ [N] or negative binomial likelihood, 5: Update K,Kprev 6: Resize X, ψ, Γ,V to K x + r − 1 h(x|θ, r) = (1 − θ)r θx, x ∈ ∪ {0}. 7: Sample ψk for k ∈ [K] x N 8: for k ← 1, ··· ,K do In both of the following examples, we set the monotone 9: if k < Kprev then: sequence ξ(k) to ξ(k) = exp (−k/∆ξ), where ∆ξ > 0 10: Sample Γk,Vk|Γk−1, Γk+1 11: else: is a hyperparameter to be tuned in each case. 12: Sample Γk,Vk|Γk−1 13: end if 4.1 BETA-BERNOULLI FEATURE MODEL 14: end for Given a dataset of observations y ∈ d, n = 1,...,N, 15: Sample X for n ∈ [N], k ∈ [K] n R nk the beta-Bernoulli latent feature model aims to uncover a 16: Update k, k0 collection of latent features ψ ∈ d and binary assign- 17: end for k R ments X ∈ {0, 1} of data to features responsible for 18: end procedure nk generating the observations: ∞ {Γk}k=1 ∼ PP (1) Update local truncation levels For each n ∈ [N],   Γ  X indep∼ Bern exp − k , n ∈ [N] compute the first and second maximum active indices, nk c i.i.d. 2  kn ← max{k ∈ N : Xnk > 0} ∪ {0} ψk ∼ N 0, σ0I , k ∈ N k0 ← max{k < k : X > 0} ∪ {0}. ! n n nk indep X 2 yn ∼ N Xnkψk, σ I , n ∈ [N]. k Most probablistic graphical models with CRM component 2 2 can be converted to the form of 3. This algorithm then We assume the hyperparameters σ0, σ , and c are given. takes the input of a probablistic graphical model with We don’t need to sample V because here G = δ1. samplers that sample from the conditional distributions and pointers to the variables Y , X, and ψ. This procedure Sample X: For each n ∈ [N] and k ∈ [K], we sample is shown in pseudocode in Algorithm 1. Xnk = r ∈ {0, 1} with probability proportional to  ∞  X 2 4 APPLICATIONS N yn; Xnjψj + rψk, σ I · j=1,j6=k    In this section, we show how the general slice sampler Γk  ˆ  Bern r; exp − Unif Un; 0, ξ kn(r) , from Section 3 can be applied to two popular Bayesian c nonparametric models: the beta-Bernoulli latent feature where N (·; ... ), Bern(·; ... ), and Unif(·; ... ) are the model [1, 32] and beta-negative binomial (BNB) combi- density functions of the respective distributions. One may natorial clustering model [20]. In particular, we provide parallelize this sampling step across n ∈ [N] due to the the details of each of the sampling steps from Section 3.2 introduction of auxiliary variables. based on the Bondesson series representation [26] of the beta process. The beta process with mass parameter α Sample ψ: Using the conjugacy of the feature prior and shape parameter λ has rate measure K and data likelihood, we sample all the features (ψk)k=1 simultaneously, ν(dθ) = λαθ−1(1 − θ)λ−1dθ, 2 T σ Q = X X + 2 I and has a Bondesson representation [16, 26] for the rates σ0 ψ ∼ MN ψ; Q−1 XT y ,Q−1, σ2I , θk = Vk exp (−Γk/(λα)) ∞ i.i.d. where MN (·; ... ) is the density function for the matrix (Γk)k=1 ∼ PP(1),Vk ∼ Beta(1, λ − 1), [33]. Sample Γ (substep 1): The full conditional distribution Here Zdn is the topic indicator of word n in document d. of Γk has density proportional to We assume the hyperparameters λ > 1, 0 < α < 1 and set c = λα. β ∈ V , and r > 0 are given. In this model,  N−mk R+ −mkΓk/c −Γk/c e 1 − e 1 [Γk−1 ≤ Γk ≤ Γk+1] , note that we sample per-word auxiliary variables:

PN U ∼ Unif [0, ξ (Z )] . where mk := n=1 Xnk. Rather than simulating from dn dn this density exactly—which would require expensive iter-

ative numerical integration—we use Metropolis-Hastings Sample Z The conditional distribution of Zdn is a cate- to sample Γk, with proposal gorical thresholded by the auxiliary variable; in particular, the probability that Zdn = k is proportional to W ∼ Unif [−∆Γ, ∆Γ] 0 Γk = W + max (min (Γk, Γk+1 − ∆Γ) , Γk−1 + ∆Γ) 1 πdkψkydn [Udn ≤ ξ (k)] /ξ(k). (7)

Γk+1−Γk−1 for step size ∆Γ > 0 and ∆Γ = by dividing Sample ψ We sample the latent features exactly from nΓ the interval length into nΓ pieces. their full conditional Dirichlet distribution via  W  Sample Γ (substep 2): The expansion distribution of X ψ ∼ Dir  β + 1{y = w}  . Γk has density proportional to k  v dn   (n,d):z =k dn w=1  N−mk −(1+mk/c)Γk−I(Γk) −Γk/c e 1 − e 1 [Γk ≥ Γk−1] The calculation is standard and similar to [34]. PN where mk := n=1 Xnk, and Sample V, Γ (substep 1): The full conditional distribu- Z ∞  −γ/c V Γ I (Γk) = 1 − exp −Ne dγ. tion of k, k has density proportional to Γk D Y  Γk  Γk  We again use Metropolis-Hastings to sample and Γk. For − c − c 1 BNB Xdk; r, αλVke , λ 1 − αVke convenience we set ∆Γ = for this step in particular. nΓ d=1

Note that I(γ) can be precomputed using numerical inte- · Beta (Vk; 1, λ − 1) Unif [Γk;Γk−1, Γk+1] gration at a wide range of points prior to slice sampling; PNd 1 here we chose to precompute I(x) at 1000 evenly spaced where Xdk = n=1 [Znd = k], and BNB(·; ... ), points e−x/c ∈ [, 1] for  = 10−30. During MCMC, we Beta(·; ... ), and Unif(·; ... ) are the density functions evaluate I(Γk) with spline-interpolation. for the beta-negative binomial, beta, and uniform distri- butions. Here the Xdk’s are conditionally independent 4.2 BNB CLUSTERING MODEL because we do not condition on a fixed number of words in each document Nd. Given a collection of documents d = 1,...,D each We again use Metropolis-Hastings to sample Vk and Γk. containing Nd ∈ N words ydn ∈ [W ], W ∈ N, BNB combinatorial clustering aims to uncover latent topics Here V is sampled by a proposal for with V P hyperparameter 4V . ψk ∈ [0, 1] , w ψkw = 1, and document-specific topic rates πdk > 0, k = 1,..., ∞, via E ∼ Unif [−∆V , ∆V ] {Γ }∞ ∼ PP (1) 0 k k=1 Vk = E + max (min (Vk, 1 − ∆V ) , ∆V ) , V i.i.d.∼ Beta (1, λ − 1) k and Γ is sampled using the same algorihtm as 7. indep  − Γk  − Γk  θdk ∼ Beta αλVke c , λ 1 − αVke c Sample V, Γ (substep 2): i.i.d. The expansion distribution ψk ∼ Dir (β) of Vk, Γk has density proportional to  1 − θ  π indep∼ Gam r, dk dk exp (−I (Γk)) θdk  ∞  D indep πdk Y  − Γk  − Γk  Zdn ∼ Categorical P BNB Xdk; r, αλVke c , λ 1 − αVke c πdk k k=1 d=1 indep ydn ∼ Categorical (ψzdn ) . Beta (Vk; 1, λ − 1) Exp (Γk − Γk−1; 1) , where the integral expression is given by ous experiment. To evaluate the ESS/s for both samplers, we compute a test function that returns 1 if the combina- I (Γk) torial matrix X has an even number of non-zero entries Z ∞ Z 1 and 0 otherwise, and use the batch mean estimator [35]. = (1 − F (v, γ)) Beta (v; 1, λ − 1) dvdγ Γk 0 The results are shown in Figs. 3a and 3b. Fig. 3a suggests D −0.6  − γ  − γ  our sampler has ESS/s that scales as O(N ), while F (v, γ) = BNB 0; r, αλve c , λ 1 − αve c . the accelerated collapsed sampler has ESS/s that scales −1.34 For simplicity the formula presented here is for beta- as O(N ); this improvement arises from the linear negative binomial likelihood with homogeneous fail- runtime per-iteration compared to the quadratic runtime ure probability but we set them differently in later ex- of the algorithm of [13]. Fig. 3b achieved the smallest periments. This density can be jointly sampled using error by quickly selecting a suitable truncation level. Metropolis-Hastings with thresholded uniform propos- als, similarly to the previous substep. Furthermore, as in 5.2 HIERARCHICAL CLUSTERING substep 2 of the previous example, this integral can be precomputed for a range of values before sampling. In the second experiment, we used the BNB clustering model from Section 4.2 to analyze the NeurIPS corpus from 2010 to 2015, preprocessed to remove stopwords 5 EXPERIMENTS and truncate the vocabulary to those words appearing more than 50 times. We randomly split each document of In this section, we compare the performance of our algo- the NeurIPS papers corpus into held-out test words (30%) rithm on the two models described in Sections 4.1 and 4.2. and training words (70%). We set concentration and scale On both synthetic and real datasets our algorithm out- parameters to (α, λ) = (1, 1.1), the prior Dirichlet topic performs state-of-the-art methods and fixed truncation. distribution parameter over the V vocabulary words to β = (0.1, ··· , 0.1), and the failure rate of the negative 5.1 BETA-BERNOULLI FEATURE MODEL Nd(λ−1) binomial distribution is set to rd = αλ for each doc- ument d ∈ {1,...,D}. We set the Metropolis-Hastings In the first experiment we generate synthetic data from step sizes to (nΓ, ∆V ) = (10, 0.3), and the scale of the ξ a truncated version of the beta-Bernoulli model from sequence to ∆ξ = 3. These parameters are tuned similar Section 4.1, with model parameters set to (σ, σ0, c) = to the previous section. We compared our slice sampler to (0.2, 0.5, 1). We test a number of experimental settings: the slice sampler of [20] on both ESS/s and held-out data for each N ∈ {10000, 11000, ··· , 19000, 20000}, we perplexity. We use the same procedure as in the previous set the synthetic generating model truncation level to N log(N) experiment to estimate ESS/s. K = 2dlog(N)e and data dimension D = 2d N−log(N) e, such that the features are roughly identifiable from the The results are shown in Figs. 4a and 4b. Fig. 4a shows data. For the auxiliary variables in the slice sampler, we that our algorithm produces a roughly two orders of mag- nitude improvement in ESS/s on large datasets than the set the scale of the ξ sequence to ∆ξ = 1. The scale is optimized over {0.1, 0.2, 0.3, 0.4, ..., 2.9, 3} to maximize comparison method. Fig. 4b demonstrates that the pro- effective sample size per second (ESS/s) using simulated posed slice sampler also provides a significant decrease data (N=1000). We set the Metropolis-Hastings step size in the held-out test set perplexity [34]. This is at least in part because the proposed slice sampler is generic and to nΓ = 10. This chosen from the set 1, 2,..., 10 to re- sult in a MH acceptance rate of Γ averaged over iterations can use any series representation of the underlying CRM; to be roughly between 0.2 and 0.9, which is a standard here, we take advantage of that and use the Bondesson general practice in MCMC methods. We generate syn- representation, which is known to provide exponentially thetic data and run the proposed slice sampling algorithm decreasing truncation error [16] and is significantly more for 1,000 iterations over 10 independent trials, comparing efficient than the superposition representation used by to the state-of-the-art accelerated collapsed Gibbs sampler [20]. In practice, this manifests as a high number of by Doshi-Velez and Ghahramani [13] with O(N 2) run- unused or redundant atoms in past samplers, while the time per MCMC iteration. We perform the comparison proposed sampler does not exhibit this issue. by measuring both ESS/s and 2-norm error of held-out data. The error is evaluated with latent features from the 6 CONCLUSION Monte Carlo samples and combinatorial variable X cho- sen to minimize error. (Ntrain,Ntest, K, c, σ, σ0, 4ξ) = In this paper, we introduced a computational method (300, 200, 20, 2, 0.5, 0.5) where the parameters that re- for posterior inference in a large class of unsupervised quire tuning are tuned in a procedure similar to the previ- Bayesian nonparametric models. Compared with past (a) (b)

Figure 3: (3a): Boxplot and linear fit of the ESS/s of 20 runs of model. Both ESS/s and N are plotted in log scale of 0 0 base 10. The line “double slice” corresponds to auxiliary variable Un ≡ Kn ∼ Unif [kn, 2kn]. (3b): 2-norm error of held-out data for truncated and adpatively truncated model (slice sampler) optimized over the combinatorial space.

(a) (b)

Figure 4: (4a): Boxplot and linear fit of the ESS/s of 6 runs of model. Both ESS/s and N are plotted in log scale of base 10. (4b): Perplexity evaluated on the test set at each iteration. work, our method enables parallel inference, does not References require conjugacy, and targets the exact posterior. [1] Thomas Griffiths and Zoubin Ghahramani. Infinite latent It is worth noting that the proposed sampler does not nec- feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, 2005. essarily generalize past slice samplers for specific models (e.g., [18]). Although model-specific methods may pro- [2] Trevor Campbell, Diana Cai, and Tamara Broderick. Ex- vide performance gains in some cases, our method can changeable trait allocations. Electronic Journal of Statis- tics, 12:2290–2322, 2018. be easily incorporated in a general probabilistic program- ming system, and provides parallel inference for a wide [3] John Kingman. Completely random measures. Pacific Journal of Mathematics, 21(1):59–78, 1967. range of models. Future work on the proposed methodol- ogy could include automated selection of the determinis- [4] Diana Cai and Tamara Broderick. Completely random tic sequence ξ, and the incorporation of more advanced measures for modeling power laws in sparse graphs. NeurIPS Workshop on Networks in the Social and Infor- Markov chain moves, such as split-merge [36]. mation Sciences, 2015. [5] Sunil Gupta, Dinh Phung, and Svetha Venkatesh. Factorial Acknowledgements multi-task learning: A Bayesian nonparametric approach. In International Conference on Machine Learning, 2013. This research is supported by a National Sciences and [6] John Paisley, David Blei, and Michael Jordan. Stick- Engineering Research Council of Canada (NSERC) Dis- breaking beta processes and the Poisson process. In Artifi- covery Grant and Discovery Launch Supplement. cial Intelligence and Statistics, 2012. [7] Ayan Acharya, Joydeep Ghosh, and Mingyuan Zhou. Non- [25] Thomas Ferguson and Michael Klass. A representation of parametric Bayesian factor analysis for dynamic count independent increment processes without gaussian com- matrices. In Artificial Intelligence and Statistics, 2015. ponents. The Annals of Mathematical Statistics, 43(5): 1634–1643, 1972. [8] Michael Jordan. Hierarchical models, nested models and completely random measures. Frontiers of statistical deci- [26] Lennart Bondesson. On simulation from infinitely divis- sion making and Bayesian analysis: In honor of James O. ible distributions. Advances in Applied Probability, 14: Berger. New York: Springer, pages 207–218, 2010. 855–869, 1982. [9] Tamara Broderick, Ashia Wilson, and Michael Jordan. [27] Martin Tanner and Wing Hung Wong. The calculation Posteriors, conjugacy, and exponential families for com- of posterior distributions by data augmentation. Journal pletely random measures. Bernoulli, 24(4):3181–3221, of the American Statistical Association, 82(398):528–540, 2018. 1987. [10] Romain Thibaux and Michael Jordan. Hierarchical Beta [28] Anirban Roychowdhury and Brian Kulis. Gamma pro- processes and the Indian buffet process. In Artificial Intel- cesses, stick-breaking, and variational inference. In Artifi- ligence and Statistics, 2007. cial Intelligence and Statistics, 2015. [29] Julyan Arbel and Igor Prunster.¨ A moment-matching Fer- [11] Michalis Titsias. The infinite gamma-Poisson feature guson & Klass algorithm. Statistics and Computing, 27 model. In Advances in Neural Information Processing (1):3–17, 2017. Systems, 2008. [30] Raffaele Argiento, Ilaria Bianchini, and Alessandra [12] Jim Griffin and Stephen Walker. Posterior simulation of Guglielmi. A blocked Gibbs sampler for NGG-mixture normalized random measure mixtures. Journal of Compu- models via a priori truncation. Statistics and Computing, tational and Graphical Statistics, 20(1):241–259, 2011. 26(3):641–661, 2016. [13] Finale Doshi-Velez and Zoubin Ghahramani. Accelerated [31] Finale Doshi, Kurt Miller, Jurgen Van Gael, and Yee Whye sampling for the Indian buffet process. In International Teh. Variational inference for the Indian buffet process. In Conference of Machine Learning, 2009. Artificial Intelligence and Statistics, 2009. [14] Mingyuan Zhou, Lauren Hannah, David Dunson, and [32] Nils Lid Hjort. Nonparametric Bayes estimators based on Lawrence Carin. Beta-negative binomial process and Pois- beta processes in models for life history data. The Annals son factor analysis. In Artificial Intelligence and Statistics, of Statistics, 18(3):1259–1294, 1990. 2012. [33] Philip Dawid. Some matrix-variate distribution theory: [15] David Blei and Michael Jordan. Variational inference Notational considerations and a Bayesian application. for Dirichlet process mixtures. Bayesian Analysis, 1(1): Biometrika, 68(1):265–274, 1981. 121–144, 2006. [34] David Blei, Andrew Ng, and Michael Jordan. Latent [16] Trevor Campbell, Jonathan Huggins, Jonathan How, and Dirichlet allocation. Journal of Machine Learning Re- Tamara Broderick. Truncated random measures. Bernoulli, search, 2003. 25(2):1256–1288, 2019. [35] James Flegal and Galin Jones. Batch means and spectral [17] Stephen Walker. Sampling the Dirichlet mixture model variance estimators in Markov chain Monte Carlo. The with slices. Communications in Statistics - Simulation and Annals of Statistics, 38(2):1034–1070, 2010. Computation, 36(1):45–54, 2007. [36] Sonia Jain and Radford Neal. Splitting and merging components of a nonconjugate Dirichlet process mixture [18] Yee Whye Teh, Dilan Gor¨ ur,¨ and Zoubin Ghahramani. model. Bayesian Analysis, 2(3), 2007. Stick-breaking construction for the Indian buffet process. In Artificial Intelligence and Statistics, 2007. [19] Stefano Favaro and Yee Whye Teh. MCMC for normalized random measure mixture models. Statistical Science, 28 (3):335–359, 2013. [20] T. Broderick, L. Mackey, J. Paisley, and M. Jordan. Combi- natorial clustering and the beta negative binomial process. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 37(2):290–306, 2015. [21] Fadhel Ayed and Franc¸ois Caron. Nonnegative Bayesian nonparametric factor models with completely random mea- sures for community detection. arXiv:1902.10693, 2019. [22] Jan Rosinski.´ Series representations of Levy´ processes from the perspective of point processes. In Levy´ Processes: Theory and Applications, pages 401–415. Springer, 2001. [23] Maria Kalli, Jim Griffin, and Stephen Walker. Slice sam- pling mixture models. Statistics and Computing, 21(1): 93–105, 2011. [24] John Kingman. Poisson Processes. Clarendon Press, 1992.