Sequential Monte Carlo Samplers for Dirichlet Process Mixtures

Sequential Monte Carlo Samplers for Dirichlet Process Mixtures Yener Ulker and Bilge Gunsel A. Taylan Cemgil Dept. of Electronics and Communications Eng. Dept. of Computer Engineering, Istanbul Technical University Bogazici University 34469 Maslak Istanbul, Turkey 34342 Bebek Istanbul, Turkey [email protected], [email protected] [email protected] Abstract Provided that inference can be carried out effectively for the DPM, at least in principle, any density can be approximated with arbitrary precision. However, In this paper, we develop a novel online algo- exact inference is unfortunately intractable. Yet due rithm based on the Sequential Monte Carlo to the mentioned potential advantages of nonparamet- (SMC) samplers framework for posterior in- ric approaches, there has been a surge of interest to ference in Dirichlet Process Mixtures (DPM) the DPM model and efficient inference strategies based (DelMoral et al., 2006). Our method gener- on variational techniques (Blei and Jordan, 2004) and alizes many sequential importance sampling Monte Carlo Markov Chain (MCMC) (Escobar and approaches. It provides a computationally ef- West, 1992; Jain and Neal, 2000). ficient improvement to particle filtering that is less prone to getting stuck in isolated By construction, the DPM model is exchangable and modes. The proposed method is a particular the ordering of data does not matter. However, for SMC sampler that enables us to design so- inference it is nevertheless beneficial to process data phisticated clustering update schemes, such sequentially in some natural order. Such an approach as updating past trajectories of the parti- gives computational advantages especially for large cles in light of recent observations, and still datasets. In the literature a number of online infer- ensures convergence to the true DPM tar- ence techniques have been proposed to estimate an get distribution asymptotically. Performance artificially time evolving DPM posterior (MacEachern has been evaluated in a Bayesian Infinite et al., 1999; Quintana, 1996; Fearnhead, 2004). Gaussian mixture density estimation prob- However, it is argued that sequential importance sam- lem and it is shown that the proposed algo- pling is not an appropriate method for models with rithm outperforms conventional Monte Carlo static parameters and especially large datasets due to approaches in terms of estimation variance the degeneracy phenomenon and accumulated Monte and average log-marginal likelihood. Carlo error over time (Quintana and Newton, 1998). The sampler becomes ’sticky’, meaning that previously assigned clusterings can never be updated according to 1 Introduction the information provided by the latest observations. In order to overcome this drawback, we propose an effi- In recent years, Dirichlet Process Mixtures (DPM) cient sequential Monte Carlo sampler that estimates model (Antoniak, 1974) has been one of the most the sequentially evolving DPM model posterior. Un- widely used and popular approach to nonparametri- like the existing methods (MacEachern et al., 1999; cal probabilistic models. Originally, DPM have been Quintana, 1996; Fearnhead, 2004), our algorithm en- widely used as a building block in hierarchical models ables us to update past trajectories of the particles in for solving density estimation and clustering problems the light of recent observations. The proposed method where the actual form of the data generation process takes advantage of the SMC sampler framework to de- is not constrained to a particular parametric family. sign such update schemes (DelMoral et al., 2006). th Appearing in Proceedings of the 13 International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copyright 2010 by the authors. 876 Sequential Monte Carlo Samplers for Dirichlet Process Mixtures 2 Dirichlet Process Mixtures (DPM) In our work, we assume that a conjugate prior is cho- sen such that given zn, the parameter θn can be inte- In a batch Bayesian setting, the joint distribution cor- grated out and the DPM posterior distribution can be responding to a finite mixture model over N observa- calculated up to a normalizing constant. tions y = {yi}, i = 1 ...N, can be defined as follows: N k 3 Sequential Monte Carlo (SMC) Samplers p(θ,z,y) = p(z) g(yi|θzi ) p(θj ). (1) i=1 ! j=1 Y Y Sequential inference schemes have limited success in Here, for i = 1 ...N, zi ∈ {1 ...k} denotes the cluster maintaining an accurate approximation to the true index of the i th observation and θ = {θj},j ∈ {1 ...k} target density. Particularly for large datasets, Monte denote the cluster conditional parameters where k Carlo error accumulates over time and the estimation refers to the maximum number of clusters. We will variance increases (Quintana and Newton, 1998). This use z = {zi}, i = 1 ...N to refer to clustering vari- is due to the fact that past states of particle trajecto- ables, that we also call cluster labels or simply labels. ries (i.e., past clusterings) are not updated with new observations. The problem can be alleviated by a ret- However, the number of mixture components is of- rospective method that is able to reassign the previ- ten unknown in practice and the DPM model provides ous clusterings {z ...z } at time n according an elegant solution for construction of mixture models n,1 n,n−1 to latest observations received. The SMC samplers with unknown number of components. In the sequel, framework enables us to accomplish this in practice we will refer to the target posterior as and still ensures convergence to the true target poste- π(x) ≡ p(z,θ|y) (2) rior asymptotically (DelMoral et al., 2006). where x = {z,θ}. It is advantageous to construct a 3.1 The Method mixture model sequentially, where data arrives one by one. To denote the sequential construction, we extend In sequential Monte Carlo algorithms such as particle our notation with an explicit ’time’ index n. filtering, we sample from a sequence of target densities evolving with a countable index n, π1(x1) . πn(xn), We denote the observation sequence at time n by yn = each defined on a common measurable space (En, En) {yn,1 ...yn,n}. Each observation yn,i, i = 1,...n, is where xn ∈ En. Conventionally the particle fil- assigned to a cluster where zn,i ∈ {1,...kn} is the ter is defined on the sequence of target densities cluster label and, kn ∈ {1 ...n} represent the number π1(x1) . πn(xn) where corresponding proposal distri- of clusters at time n. The vector of cluster variables butions are defined as η1(x1) ...ηn(xn). The unnor- is defined as zn = {zn,1 ...zn,n} and corresponding malized importance weight wn at time n can be defined cluster parameters are represented with the parameter as, vector θn = {θn,1 ...θn,kn }. γn(xn) The DPM model assumes that the cluster parameters wn = (5) are independently drawn from the prior π(θ) and the ηn(xn) observations are independent of each other conditional where γn is the unnormalized target distribution ac- on the assignment variable zn,i. Hence the DPM pos- cording to πn = γn/Z, and Z is the normalizing con- terior density πn(xn) can be expressed as, stant. In order to derive the importance weights sequentially one needs to calculate the proposal distri- kn n bution ηn(xn) pointwise. πn(xn) ∝ p(zn) p(θn,j ) g(yn,i|θn,zn,i ) (3) j=1 i=1 Y Y Computation of the importance distribution ηn(xn) for n > 1 requires an integration with respect to where xn = {zn,θn}. The prior on clustering variable x1:n−1 = {x1 ...xn−1} thus a closed form solution to vector zn is formulated by Eq.(4) in a recursive way, ηn(xn) is not available except for specifically designed kernels such as K(xn−1,xn) = K(xn). SMC samplers lj i+κ , j = 1,...,ki (DelMoral et al., 2006) circumvent this limitation us- p(zn,i+1 = j|zn,{1:i}) = κ (4) ( i+κ , j = ki + 1 ing an auxiliary variable technique which solves the sequential importance sampling problem in an extended n where ki is the number of clusters in the assignment space E = {E1 × . × En}. One performs impor- zn,{1:i}. lj is the number of observations that zn,{1:i} tance sampling between the joint importance distribu- assigns to cluster j and κ is a ’novelty’ parameter. tion ηn(x1:n) and the artificial joint target distribution 877 Yener Ulker, Bilge Gunsel, A. Taylan Cemgil defined by πn(x1:n) = γn(x1:n)/Zn where Zn denotes ηn(xn). However increasing the integration domain the normalizing constant. The algorithm enables us to also increases the variance of the importance weights. calculate efficient weight update equations for a huge In (DelMoral et al., 2006) it is shown that the optimal e e opt class of proposal kernels Kn(xn−1,xn), such as MCMC backward Markov kernel Lk−1 (k = 2,...,n) mini- transition kernels. mizing the variance of the unnormalized importance weight wn(x1:n) is given for any k by, The proposal distribution ηn(x1:n) of SMC is defined n on the extended space E as follows, opt ηk−1(xk−1)Kk(xk−1,xk) Lk−1(xk,xk−1) = (10) n ηk(xk) ηn(x1:n) = η1(x1) K(xk−1,xk). (6) k=2 However, the kernel given by Eq.(10) usually does not Y admit a closed form solution. Therefore common strat- Note that here an integration is no longer required. egy is to approximate the optimal kernel as close as However, this comes with the expense of an artificial possible to provide asymptotically consistent estimates target density that is also defined on a larger space: (DelMoral et al., 2006) .

Sequential Monte Carlo Samplers for Dirichlet Process Mixtures

Hierarchical Dirichlet Process Hidden Markov Model for Unsupervised Bioacoustic Analysis

Part 2: Basics of Dirichlet Processes 2.1 Motivation

Lectures on Bayesian Nonparametrics: Modeling, Algorithms and Some Theory VIASM Summer School in Hanoi, August 7–14, 2015

Final Report (PDF)

A Tutorial on Bayesian Nonparametric Models

Gaussian Process-Mixture Conditional Heteroscedasticity

Statistical Analysis of Neural Data Lecture 6

The Infinite Hidden Markov Model

Kernel Analysis Based on Dirichlet Processes Mixture Models

Working Paper 10-38 Statistics and Econometrics Series 22 September

Dirichlet Distribution, Dirichlet Process and Dirichlet Process Mixture

Nonparametric Density Estimation for Learning Noise Distributions in Mobile Robotics