Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution
Total Page:16
File Type:pdf, Size:1020Kb
GÄorÄurD, Rasmussen CE. Dirichlet process Gaussian mixture models: Choice of the base distribution. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(4): 615{626 July 2010/DOI 10.1007/s11390-010-1051-1 Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution Dilan GÄorÄur1 and Carl Edward Rasmussen2;3 1Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K. 2Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K. 3Max Planck Institute for Biological Cybernetics, D-72076 TÄubingen,Germany E-mail: [email protected]; [email protected] Received May 2, 2009; revised February 10, 2010. Abstract In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of ¯nding the \correct" number of mixture components by assuming in¯nitely many components. In this paper Dirichlet process mixture (DPM) models are cast as in¯nite mixture models and inference using Markov chain Monte Carlo is described. The speci¯cation of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM). We compare computational e±ciency and modeling performance of DPGMM de¯ned using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational e®ort. Keywords Bayesian nonparametrics, Dirichlet processes, Gaussian mixtures 1 Introduction The hierarchical models in which the DP is used as a prior over the distribution of the parameters are re- ferred to as the Dirichlet process mixture (DPM) mod- Bayesian inference requires assigning prior distribu- els, also called mixture of Dirichlet process models by tions to all unknown quantities in a model. The uncer- some authors due to [7]. Construction of the DP using tainty about the parametric form of the prior distribu- a stick-breaking process or a gamma process represents tion can be expressed by using a nonparametric prior. the DP as a countably in¯nite sum of atomic measures. The Dirichlet process (DP) is one of the most promi- These approaches suggest that a DPM model can be nent random probability measures due to its richness, seen as a mixture model with in¯nitely many compo- computational ease, and interpretability. It can be used nents. The distribution of parameters imposed by a DP to model the uncertainty about the functional form of can also be obtained as a limiting case of a parametric the distribution for parameters in a model. mixture model[8-11]. This approach shows that a DPM The DP, ¯rst de¯ned using Kolmogorov consistency can easily be de¯ned as an extension of a parametric conditions[1], can be de¯ned in several di®erent per- mixture model without the need to do model selection spectives. It can be obtained by normalising a gamma for determining the number of components to be used. process[1]. Using exchangeability, a generalisation of Here, we take this approach to extend simple ¯nite mix- the P¶olya urn scheme leads to the DP[2]. A closely re- ture models to Dirichlet process mixture models. lated sequential process is the Chinese restaurant pro- The DP is de¯ned by two parameters, a positive cess (CRP)[3-4], a distribution over partitions, which scalar ® and a probability measure G , referred to as also results in the DP when each partition is assigned 0 the concentration parameter and the base measure, re- an independent parameter with a common distribution. spectively. The base distribution G is the parameter A constructive de¯nition of the DP was given by [5], 0 on which the nonparametric distribution is centered, which leads to the characterisation of the DP as a stick- which can be thought of as the prior guess[7]. The breaking prior[6]. Regular Paper This work is supported by Gatsby Charitable Foundation and PASCAL2. ©2010 Springer Science + Business Media, LLC & Science Press, China 616 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4 concentration parameter ® expresses the strength of be- out conditioned on the other, but both parameters can- lief in G0. For small values of ®, samples from a DP not simultaneously be integrated out. In the following, is likely to be composed of a small number of atomic we give formulations of the DPGMM with both a con- measures with large weights. For large values, most jugate and a conditionally conjugate base distribution samples are likely to be distinct, thus concentrated on G0. For both prior speci¯cations, we de¯ne hyperpri- G0. ors on G0 for robustness, building upon [11]. We refer The form of the base distribution and the value of to the models with the conjugate and the conditionally the concentration parameter are important parts of the conjugate base distributions shortly as the conjugate model choice that will e®ect the modeling performance. model and the conditionally conjugate model, respec- The concentration parameter can be given a vague prior tively. After specifying the model structure, we discuss distribution and its value can be inferred from the data. in detail how to do inference on both models. We show It is harder to decide on the base distribution since the that mixing performance of the non-conjugate sampler model performance will heavily depend on its paramet- can be improved substantially by exploiting the con- ric form even if it is de¯ned in a hierarchical manner ditional conjugacy. We present experimental results for robustness. Generally, the choice of the base distri- comparing the two prior speci¯cations. Both predic- bution is guided by mathematical and practical conve- tive accuracy and computational demand are compared nience. In particular, conjugate distributions are pre- systematically on several arti¯cial and real multivariate ferred for computational ease. It is important to be density estimation problems. aware of the implications of the particular choice of the base distribution. An important question is whether 2 Dirichlet Process Gaussian Mixture Models the modeling performance is weakened by using a con- A DPM model can be constructed as a limit of a jugate base distribution instead of a less restricted dis- parametric mixture model[8-11]. We start with setting tribution. A related interesting question is whether the out the hierarchical Gaussian mixture model formula- inference is actually computationally cheaper for the tion and then take the limit as the number of mixture conjugate DPM models. components approaches in¯nity to obtain the Dirichlet The Dirichlet process Gaussian mixture model process mixture model. Throughout the paper, vector (DPGMM) with both conjugate and non-conjugate quantities are written in bold. The index i always runs base distributions has been used extensively in appli- over observations, i = 1; : : : ; n, and index j runs over cations of the DPM models for density estimation and components, j = 1;:::;K. Generally, variables that clustering[11-15]. However, the performance of the mod- play no role in conditional distributions are dropped els using these di®erent prior speci¯cations have not from the condition for simplicity. been compared. For Gaussian mixture models the con- jugate (Normal-Inverse-Wishart) priors have some un- The Gaussian mixture model with K components appealing properties with prior dependencies between may be written as: the mean and the covariance parameters, see e.g., [16]. XK The main focus of this paper is empirical evaluation p(xjθ1; : : : ; θK ) = ¼jN (xj¹j;Sj) (1) of the di®erences between the modeling performance of j=1 the DPGMM with conjugate and non-conjugate base distributions. where θj = f¹j;Sj; ¼jg is the set of parameters for Markov chain Monte Carlo (MCMC) techniques are component j, ¼ are the mixing proportions (which must the most popular tools used for inference on the DPM be positive and sum to one), ¹j is the mean vector for models. Inference algorithms for the DPM models with component j, and Sj is its precision (inverse covariance a conjugate base distribution are relatively easier to im- matrix). plement than for the non-conjugate case. Nevertheless De¯ning a joint prior distribution G0 on the com- several MCMC algorithms have been developed also for ponent parameters and introducing indicator variables the general case[17-18]. In our experiments, we also com- ci; i = 1; : : : ; n, the model can be written as: pare the computational cost of inference on the conju- gate and the non-conjugate model speci¯cations. xi j ci; £ »N (¹ci ;Sci ) We de¯ne the DPGMM model with a non-conjugate c j ¼ » Discrete(¼ ; : : : ; ¼ ) base distribution by removing the undesirable depen- i 1 K dency of the distribution of the mean and the covari- (¹j;Sj) » G0 ance parameters. This results in what we refer to as ¼ j ® » Dir(®=K; : : : ; ®=K): (2) a conditionally conjugate base distribution since one of the parameters (mean or covariance) can be integrated Given the mixing proportions ¼, the distribution of the Dilan GÄorÄur et al.: (Conditionally) Conjugate DP Gaussian Mixtures 617 number of observations assigned to each component, Note that these probabilities are the same as the proba- called the occupation number, is multinomial bilities of seating a new customer in a Chinese restau- rant process[4]. Thus, the in¯nite limit of the model in YK n! nj (2) is equivalent to a DPM which we de¯ne by starting p(n1; : : : ; nK j¼) = ¼j ; n1!n2! ¢ ¢ ¢ nK ! with Gaussian distributed data xijθ »N (xij¹i;Si), j=1 with a random distribution of the model parameters and the distribution of the indicators is (¹i;Si) » G drawn from a DP, G » DP (®; G0).