Slice Sampling for General Completely Random Measures
Total Page:16
File Type:pdf, Size:1020Kb
Slice Sampling for General Completely Random Measures Peiyuan Zhu Alexandre Bouchard-Cotˆ e´ Trevor Campbell Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4 Abstract such model can select each category zero or once, the latent categories are called features [1], whereas if each Completely random measures provide a princi- category can be selected with multiplicities, the latent pled approach to creating flexible unsupervised categories are called traits [2]. models, where the number of latent features is Consider for example the problem of modelling movie infinite and the number of features that influ- ratings for a set of users. As a first rough approximation, ence the data grows with the size of the data an analyst may entertain a clustering over the movies set. Due to the infinity the latent features, poste- and hope to automatically infer movie genres. Cluster- rior inference requires either marginalization— ing in this context is limited; users may like or dislike resulting in dependence structures that prevent movies based on many overlapping factors such as genre, efficient computation via parallelization and actor and score preferences. Feature models, in contrast, conjugacy—or finite truncation, which arbi- support inference of these overlapping movie attributes. trarily limits the flexibility of the model. In this paper we present a novel Markov chain As the amount of data increases, one may hope to cap- Monte Carlo algorithm for posterior inference ture increasingly sophisticated patterns. We therefore that adaptively sets the truncation level using want the model to increase its complexity accordingly. auxiliary slice variables, enabling efficient, par- In our movie example, this means uncovering more and allelized computation without sacrificing flexi- more diverse user preference patterns and movie attributes bility. In contrast to past work that achieved this from the growing number of registered users and new on a model-by-model basis, we provide a gen- movie releases. Bayesian nonparametric methods (BNP) eral recipe that is applicable to the broad class enable unbounded model capacity by positing infinite- of completely random measure-based priors. dimensional prior distributions. These infinite dimen- The efficacy of the proposed algorithm is eval- sional priors are designed so that for any given dataset uated on several popular nonparametric mod- only a finite number of latent parameters are utilized, mak- els, demonstrating a higher effective sample ing Bayesian nonparametric inference possible in practice. size per second compared to algorithms using The present work is concerned with developing efficient marginalization as well as a higher predictive and flexible inference methods for a class of BNP priors performance compared to models employing called completely random measures (CRMs) [3], which fixed truncations. are commonly used in practice [4–7]. In particular, CRMs provide a unified approach to the construction of BNP priors over both latent features and traits [8, 9]. 1 INTRODUCTION Previous approaches to CRM posterior inference can be categorized into two main types. First, some methods In unsupervised data analysis, one aims to uncover com- analytically marginalize the infinite dimensional objects plex latent structure in data. Traditionally, this structure involved in CRMs [1, 10–14]. This has the disadvan- has been assumed to take the form of a clustering, in tage of making restrictive conjugacy assumptions and is which each data point is associated with exactly one latent moreover not amenable to parallelization. A second type category. Here we are concerned with a new generation of inference method introduced by Blei and Jordan [15] of unobserved structures such that each data point can instead uses a fixed truncation of the infinite model. How- be associated to any number of latent categories. When Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. ever, this strategy is at odds with the motivation behind N0 := f0; 1; 2;::: g via BNP, namely, its ability to learn model capacity as part of indep the inferential procedure. Campbell et al. [16] provide a Xnk ∼ h(·; θk) n; k 2 [N] × N priori error bounds on such truncation, but it is not obvi- 1 ! indep X ous how to extend these bounds to approximation errors Yn ∼ f ·; Xnkδ k n 2 [N]; on the posterior distribution. k=1 Our method is based on slice sampling, a family of where δ(·) denotes a Dirac delta measure, h is a distri- Markov chain Monte Carlo methods first used in a BNP bution on N0, and f is a distribution on the space of context by [17]. Slice samplers have advantages over observations. Note that each data point yn is influenced both marginalization and truncation techniques: they do only by those traits k for which xnk > 0, and the value not require conjugacy, enable parallelization, and target of xnk denotes the amount of influence. the exact nonparametric posterior distribution. But while To generate the infinite collection of ( k; θk) pairs, we there is a rich literature on sampling methods for BNP use a Poisson point process [24] on the product space of latent feature and trait models—e.g., the Indian buffet / traits and rates Ψ × R+ with σ-finite mean measure µ, beta-Bernoulli process [13, 18], hierarchies thereof [10], 1 normalized CRMs [12, 19], beta-negative binomial pro- f k; θkgk=1 ∼ PP(µ) µ(Ψ × R+) = 1: (1) cess [14, 20], generalized gamma process [21], gamma- Equivalently, this process can be formulated as a com- Poisson process [11], and more—these have often been pletely random measure (CRM) [3] on the space of traits developed on a model-by-model basis. 1 Ψ by placing a Dirac measure at each k with weight θk , In contrast to these past model-specific techniques, we 1 develop our sampler based on a series representation of X θkδ k ∼ CRM(µ): (2) the Levy´ process [22] underlying the CRM. In a fash- k=1 ion similar to [23] we introduce auxiliary variables that adaptively truncate the series representation; only finitely In Bayesian nonparametric modelling, the traits are typi- many latent features are updated in each Gibbs sweep. cally generated independently of the rates, i.e., The representation that we utilize factorizes the weights µ (dθ; d ) = ν (dθ) H (d ) ; of CRMs into a transformed Poisson process with inde- pendent and identically distributed marks, thereby turning where H is a probability measure on Ψ, and ν is a σ-finite the sampling problem into evaluating the mean measure measure on R+. In order to guarantee that the CRM has of a marginalized Poisson point process over a zero-set. infinitely many atoms, we require that ν satisfies The remainder of the paper is organized as follows: Sec- ν (R+) = 1; tion 2 introduces the general model that we consider, se- and in order to guarantee that each observation y is only ries representations of CRMs, and posterior inference via n influenced by finitely many traits having x 6= 0 a.s., marginalization and truncation. Section 3 discusses our k nk we require that main contributions, including model augmentation and 1 ! slice sampling. Section 4 demonstrates how the method- X Z ology can be applied to two popular latent feature models. E 1fXnk 6= 0g = (1 − h(0jθ))ν (dθ) < 1: Finally, in Section 5 we compare our method against sev- k=1 eral state-of-the-art samplers for these models on both To summarize, the model we consider in this paper is: real and synthetic datasets. 1 X θkδ k ∼ CRM (ν × H) 2 BACKGROUND k=1 indep Xnk ∼ h(·; θk) n; k 2 [N] × N 2.1 MODEL ! indep X Yn ∼ f ·; Xnkδ k n 2 [N]: (3) In the standard Bayesian nonparametric latent trait model k [16], we are given a data set of random observations 1More generally, CRMs are the sum of a deterministic mea- N sure, an atomic measure with fixed atom locations, and a Poisson (Yn)n=1 generated using an infinite collection of latent traits ( )1 , 2 Ψ with corresponding rates (θ )1 , point process-based measure as in Eq. (2). In BNP models, there k k=1 k k k=1 is typically no deterministic component, and the fixed-location θk 2 R+ := [0; 1). We assume each data point Yn, atomic component has finitely many atoms, posing no chal- n 2 [N] := f1;:::;Ng is influenced by each trait k lenge in posterior inference. Thus we focus only on the infinite in an amount corresponding to an integer count Xnk 2 Poisson point process-based component in this paper. 2.2 SEQUENTIAL REPRESENTATION be parallelized across n, making them computationally expensive with large amounts of data. While the specification of f k; θkgk as a Poisson point process is mathematically elegant, it does not lend itself Fixed truncation Another option for posterior infer- immediately to computation. For this purpose—since ence is to truncate a sequential representation of the CRM there are infinitely many atoms—we require a way of such that it generates finitely many traits, i.e., generating them one-by-one in a sequence using famil- iar finite-dimensional distributions; this is known as a K ( k;Vk; Γk)k=1 ∼ Eq. (4) sequential representation of the CRM. While there are indep many such representations (see [16] for an overview), here Xnk ∼ h(·; τ(Vk; Γk)) n; k 2 [N] × [K] we will employ the general class of series representations, K ! indep X which simulate the traits k and rates θk via Yn ∼ f ·; Xnkδ k n 2 [N]: k=1 k i.i.d. X i.i.d. Ej ∼ Exp(1) Γk = Ej Vk ∼ G Because there are only finitely many traits and rates in j=1 this model, it is not difficult to develop Gibbs sampling i.i.d.