<<

G¨or¨urD, Rasmussen CE. Gaussian mixture models: Choice of the base distribution. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(4): 615–626 July 2010/DOI 10.1007/s11390-010-1051-1

Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution

Dilan G¨or¨ur1 and Carl Edward Rasmussen2,3

1Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K. 2Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, U.K. 3Max Planck Institute for Biological Cybernetics, D-72076 T¨ubingen,Germany E-mail: [email protected]; [email protected] Received May 2, 2009; revised February 10, 2010.

Abstract In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of finding the “correct” number of mixture components by assuming infinitely many components. In this paper Dirichlet process mixture (DPM) models are cast as infinite mixture models and inference using Monte Carlo is described. The specification of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian (DPGMM). We compare computational efficiency and modeling performance of DPGMM defined using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational effort. Keywords Bayesian nonparametrics, Dirichlet processes, Gaussian mixtures

1 Introduction The hierarchical models in which the DP is used as a prior over the distribution of the parameters are re- ferred to as the Dirichlet process mixture (DPM) mod- requires assigning prior distribu- els, also called mixture of Dirichlet process models by tions to all unknown quantities in a model. The uncer- some authors due to [7]. Construction of the DP using tainty about the parametric form of the prior distribu- a stick-breaking process or a represents tion can be expressed by using a nonparametric prior. the DP as a countably infinite sum of atomic measures. The Dirichlet process (DP) is one of the most promi- These approaches suggest that a DPM model can be nent random probability measures due to its richness, seen as a mixture model with infinitely many compo- computational ease, and interpretability. It can be used nents. The distribution of parameters imposed by a DP to model the uncertainty about the functional form of can also be obtained as a limiting case of a parametric the distribution for parameters in a model. mixture model[8-11]. This approach shows that a DPM The DP, first defined using Kolmogorov consistency can easily be defined as an extension of a parametric conditions[1], can be defined in several different per- mixture model without the need to do model selection spectives. It can be obtained by normalising a gamma for determining the number of components to be used. process[1]. Using exchangeability, a generalisation of Here, we take this approach to extend simple finite mix- the P´olya urn scheme leads to the DP[2]. A closely re- ture models to Dirichlet process mixture models. lated sequential process is the Chinese restaurant pro- The DP is defined by two parameters, a positive cess (CRP)[3-4], a distribution over partitions, which scalar α and a probability G , referred to as also results in the DP when each partition is assigned 0 the and the base measure, re- an independent parameter with a common distribution. spectively. The base distribution G is the parameter A constructive definition of the DP was given by [5], 0 on which the nonparametric distribution is centered, which leads to the characterisation of the DP as a stick- which can be thought of as the prior guess[7]. The breaking prior[6].

Regular Paper This work is supported by Gatsby Charitable Foundation and PASCAL2. ©2010 Springer Science + Business Media, LLC & Science Press, China 616 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4 concentration parameter α expresses the strength of be- out conditioned on the other, but both parameters can- lief in G0. For small values of α, samples from a DP not simultaneously be integrated out. In the following, is likely to be composed of a small number of atomic we give formulations of the DPGMM with both a con- measures with large weights. For large values, most jugate and a conditionally conjugate base distribution samples are likely to be distinct, thus concentrated on G0. For both prior specifications, we define hyperpri- G0. ors on G0 for robustness, building upon [11]. We refer The form of the base distribution and the value of to the models with the conjugate and the conditionally the concentration parameter are important parts of the conjugate base distributions shortly as the conjugate model choice that will effect the modeling performance. model and the conditionally conjugate model, respec- The concentration parameter can be given a vague prior tively. After specifying the model structure, we discuss distribution and its value can be inferred from the data. in detail how to do inference on both models. We show It is harder to decide on the base distribution since the that performance of the non-conjugate sampler model performance will heavily depend on its paramet- can be improved substantially by exploiting the con- ric form even if it is defined in a hierarchical manner ditional conjugacy. We present experimental results for robustness. Generally, the choice of the base distri- comparing the two prior specifications. Both predic- bution is guided by mathematical and practical conve- tive accuracy and computational demand are compared nience. In particular, conjugate distributions are pre- systematically on several artificial and real multivariate ferred for computational ease. It is important to be density estimation problems. aware of the implications of the particular choice of the base distribution. An important question is whether 2 Dirichlet Process Gaussian Mixture Models the modeling performance is weakened by using a con- A DPM model can be constructed as a limit of a jugate base distribution instead of a less restricted dis- parametric mixture model[8-11]. We start with setting tribution. A related interesting question is whether the out the hierarchical Gaussian mixture model formula- inference is actually computationally cheaper for the tion and then take the limit as the number of mixture conjugate DPM models. components approaches infinity to obtain the Dirichlet The Dirichlet process Gaussian mixture model process mixture model. Throughout the paper, vector (DPGMM) with both conjugate and non-conjugate quantities are written in bold. The index i always runs base distributions has been used extensively in appli- over observations, i = 1, . . . , n, and index j runs over cations of the DPM models for density estimation and components, j = 1,...,K. Generally, variables that clustering[11-15]. However, the performance of the mod- play no role in conditional distributions are dropped els using these different prior specifications have not from the condition for simplicity. been compared. For Gaussian mixture models the con- jugate (Normal-Inverse-Wishart) priors have some un- The Gaussian mixture model with K components appealing properties with prior dependencies between may be written as: the mean and the covariance parameters, see e.g., [16]. XK The main focus of this paper is empirical evaluation p(x|θ1, . . . , θK ) = πjN (x|µj,Sj) (1) of the differences between the modeling performance of j=1 the DPGMM with conjugate and non-conjugate base distributions. where θj = {µj,Sj, πj} is the set of parameters for (MCMC) techniques are component j, π are the mixing proportions (which must the most popular tools used for inference on the DPM be positive and sum to one), µj is the mean vector for models. Inference algorithms for the DPM models with component j, and Sj is its precision (inverse covariance a conjugate base distribution are relatively easier to im- matrix). plement than for the non-conjugate case. Nevertheless Defining a joint prior distribution G0 on the com- several MCMC algorithms have been developed also for ponent parameters and introducing indicator variables the general case[17-18]. In our experiments, we also com- ci, i = 1, . . . , n, the model can be written as: pare the computational cost of inference on the conju- gate and the non-conjugate model specifications. xi | ci, Θ ∼ N (µci ,Sci ) We define the DPGMM model with a non-conjugate c | π ∼ Discrete(π , . . . , π ) base distribution by removing the undesirable depen- i 1 K dency of the distribution of the mean and the covari- (µj,Sj) ∼ G0 ance parameters. This results in what we refer to as π | α ∼ Dir(α/K, . . . , α/K). (2) a conditionally conjugate base distribution since one of the parameters (mean or covariance) can be integrated Given the mixing proportions π, the distribution of the Dilan G¨or¨ur et al.: (Conditionally) Conjugate DP Gaussian Mixtures 617 number of observations assigned to each component, Note that these probabilities are the same as the proba- called the occupation number, is multinomial bilities of seating a new customer in a Chinese restau- rant process[4]. Thus, the infinite limit of the model in YK n! nj (2) is equivalent to a DPM which we define by starting p(n1, . . . , nK |π) = πj , n1!n2! ··· nK ! with Gaussian distributed data xi|θ ∼ N (xi|µi,Si), j=1 with a random distribution of the model parameters and the distribution of the indicators is (µi,Si) ∼ G drawn from a DP, G ∼ DP (α, G0). We need to specify the base distribution G0 to com- YK plete the model. The base distribution of the DP is the p(c , . . . , c |π) = πnj . (3) 1 n j prior guess of the distribution of the parameters in a j=1 model. It corresponds to the distribution of the com- The indicator variables are stochastic variables whose ponent parameters in an infinite mixture model. The values encode the class (or mixture component) to mixture components of a DPGMM are specified by their mean and precision (or covariance) parameters, there- which observation yi belongs. The actual values of the indicators are arbitrary, as long as they faithfully rep- fore G0 specifies the prior on the joint distribution of resent which observations belong to the same class, but µ and S. For this model, a conjugate base distribu- they can conveniently be thought of as taking values tion exists however it has the unappealing property of from 1,...,K, where K is the total number of classes, prior dependency between the mean and the covariance to which observations are associated. as detailed below. We proceed by giving the detailed model formulation for both conjugate and conditionally 2.1 Dirichlet Process Mixtures conjugate cases.

Placing a symmetric with pa- 2.2 Conjugate DPGMM rameter α/K and treating all components as equivalent is the key in defining the DPM as a limiting case of the The natural choice of priors for the mean of the ① parametric mixture model. Taking the product of the Gaussian is a Gaussian, and a Wishart distribution prior over the mixing proportions p(π), and the indi- for the precision (inverse-Wishart for the covariance). cator variables p(c|π) and integrating out the mixing To accomplish conjugacy of the joint prior distribution proportions, we can write the prior on c directly in of the mean and the precision to the likelihood, the dis- terms of the Dirichlet parameter α: tribution of the mean has to depend on the precision. The prior distribution of the mean µj is Gaussian con- K Γ (α) Y Γ (n + α/K) ditioned on Sj: p(c|α) = j . (4) Γ (n + α) Γ (α/K) j=1 −1 (µj|Sj, ξ, ρ) ∼ N (ξ, (ρSj) ), (7) Fixing all but a single indicator c in (4) and using the i and the prior distribution of Sj is Wishart: fact that the datapoints are exchangeable a priori, we −1 may obtain the conditional for the individual indica- (Sj|β, W ) ∼ W(β, (βW ) ). (8) tors: n + α/K −i,j The joint distribution of µ and Sj is the Nor- p(ci = j|c−i, α) = , (5) j n − 1 + α mal/Wishart distribution denoted as: where the subscript −i indicates all indices except for (µ ,Sj) ∼ N W(ξ, ρ, β, βW ), (9) i, and n−i,j is the number of datapoints, excluding j x , that are associated with class j. Taking the limit i with ξ, ρ, β and W being hyperparameters common to K → ∞, the conditional prior for c reach the following i all mixture components, expressing the belief where limits: the component parameters should be similar, centered components for which n−i,j > 0 : around some particular value. The graphical represen- n−i,j p(ci = j|c−i, α) = , (6a) tation of the hierarchical model is depicted in Fig.1. n − 1 + α Note in (7) that the data precision and the prior on all other components combined: the mean are linked, as the precision for the mean is a 0 α p(c 6= c 0 for all i 6= i |c , α) = . multiple of the component precision itself. This depen- i i −i n − 1 + α (6b) dency is probably not generally desirable since it means

①There are multiple parameterizations used for the density function of the Wishart distribution. We use W(β, (βW )−1) = D β/2 ³ ´ ³ ´ (|W |(β/2) ) (β−D−1)/2 β D(D−1)/4 QD |S| exp − tr(SW ) , where ΓD(z) = π Γ z + (d − D)/2 . ΓD (β/2) 2 d=1 618 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

Fig.1. Graphical representation of hierarchical GMM model with conjugate priors. Note the dependency of the distribution of the component mean on the component precision. Variables are labelled below by the name of their distribution, and the parameters of these distributions are given above. The circles/ovals are segmented for the parameters whose prior is defined not directly on that parameter but on a related quantity. that the prior distribution of the mean of a component with a large range of datasets. Furthermore, robust- depends on the covariance of that component but it is ness is required so that the performance of the model an unavoidable consequence of requiring conjugacy. does not change drastically with small changes in its parameter values. Generally it is hard to specify good 2.3 Conditionally Conjugate DPGMM parameter values that give rise to a successful model fit If we remove the undesired dependency of the mean for a given problem and misspecifying parameter values on the precision, we no longer have conjugacy. For a may lead to poor modeling performance. Using hierar- more realistic model, we define the prior on µ to be chical priors guards against this possibility by allowing j to express the uncertainty about the initial parame- −1 ter values, leading to flexible and robust models. We p(µj|ξ,R) ∼ N (ξ,R ) (10) put hyperpriors on the hyperparameters in both prior whose mean vector ξ and precision matrix R are hyper- specifications to have a robust model. We use the hi- parameters common to all mixture components. Keep- erarchical model specification of [11] for the condition- ing the Wishart prior over the precisions as in (8), we ally conjugate model, and a similar specification for the obtain the conditionally conjugate model. That is, the conjugate case. Vague priors are put on the hyperpa- prior of the mean is conjugate to the likelihood condi- rameters, some of which depend on the observations tional on S and the prior of the precision is conjugate which technically they ought not to. However, only the conditional on µ. See Fig.2 for the graphical represen- empirical mean µy and the covariance Σy of the data tation of the hierarchical model. are used in such a way that the full procedure becomes A model is required to be flexible to be able to deal invariant to translations, rotations and rescaling of the

Fig.2. Graphical representation of hierarchical GMM model with conditionally conjugate priors. Note that the mean µk and the precision Sk are independent conditioned on the data. Dilan G¨or¨ur et al.: (Conditionally) Conjugate DP Gaussian Mixtures 619 data. One could equivalently use unit priors and scale 3 Inference Using the data before the analysis since finding the overall mean and covariance of the data is not the primary We utilise MCMC algorithms for inference on the concern of the analysis, rather, we wish to find struc- models described in the previous section. The Markov ture within the data. chain relies on Gibbs updates, where each variable is In detail, for both models the hyperparameters W updated in turn by sampling from its posterior distri- bution conditional on all other variables. We repeatedly and β associated with the component precisions Sj are given the following priors, keeping in mind that β sample the parameters, hyperparameters and the indi- should be greater than D − 1: cator variables from their posterior distributions condi- tional on all other variables. As a general summary, we ³ 1 ´ ³ 1 ´ ³ 1 ´ iterate: W ∼ W D, Σy , ∼ G 1, . (11) D β − D + 1 D • update mixture parameters (µj,Sj); • update hyperparameters; For the conjugate model, the priors for the hyper- • update the indicators, conditional on the other in- parameters ξ and ρ associated with the mixture means dicators and the (hyper) parameters; ② µj are Gaussian and gamma : • update DP concentration parameter α. For the models we consider, the conditional poste- ξ ∼ N (µy, Σy), ρ ∼ G(1/2, 1/2). (12) riors for all parameters and hyperparameters except For the non-conjugate case, the component means for α, β and the indicator variables ci are of standard have a mean vector ξ and a precision matrix R as hy- form, thus can be sampled easily. The conditional pos- perparameters. We put a Gaussian prior on ξ and a teriors of log(α) and log(β) are both log-concave, so Wishart prior on the precision matrix: they can be updated using Adaptive Rejection Sam- pling (ARS)[19] as suggested in [11]. −1 The likelihood for components that have observa- ξ ∼ N (µy, Σy),R ∼ W(D, (DΣy) ). (13) tions associated with them is given by the parameters The prior number of components is governed by the of that component, and the likelihood pertaining to cur- concentration parameter α. For both models we chose rently inactive classes (which have no mixture parame- the prior for α−1 to have Gamma shape with unit mean ters associated with them) is obtained through integra- and 1 degree of freedom: tion over the prior distribution. The conditional pos- terior class probabilities are calculated by multiplying −1 p(α ) ∼ G(1, 1). (14) the likelihood term by the prior. The conditional posterior class probabilities for the This prior is asymmetric, having a short tail for small DPGMM are: values of α, expressing our prior belief that we do not expect that a very small number of active classes (say components for which n−i,j > 0 : K† = 1) is overwhelmingly likely a priori. p(ci = j|c−i, µ ,Sj, α) The probability of the occupation numbers given α j ∝ p(c = j|c , α)p(x |µ ,S ) and the number of active components, as a function i −i i j j n−i,j of α, is the likelihood function for α. It can be de- ∝ N (x|µj,Sj), (16a) rived from (6) by reinterpreting this equation to draw n − 1 + α a sequence of indicators, each conditioned only on the all others combined : 0 previously drawn ones. This gives us the following like- p(ci 6= ci0 for all i 6= i |c−i, ξ, ρ, β, W, α) lihood function: α ∝ n − 1 + α n K† Z † Y 1 α Γ (α) αK = , (15) i − 1 + α Γ (n + α) × p(xi|µ,S)p(µ,S|ξ, ρ, β, W )dµdS. i=1 (16b) where K† is the number of active components, that is the components that have nonzero occupation numbers. We can evaluate this integral to obtain the conditional We notice, that this expression depends only on the to- posterior of the inactive classes in the conjugate case, tal number of active components, K†, and not on how but it is analytically intractable in the non-conjugate the observations are distributed among them. case.

②Gamma distribution is equivalent to a one-dimensional Wishart distribution. Its density function is given by G(α, β) = ( α )α/2 2β α/2−1 sα Γ(α/2) s exp(− 2β ). 620 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

For updating the indicator variables of the conjugate of [9] and also show how to improve this algorithm by model, we use Algorithm 2 of [9] which makes full use of making use of the conditional conjugacy. conjugacy to integrate out the component parameters, SampleBoth leaving only the indicator variables as the state of the The auxiliary variable algorithm of [9] (Algorithm Markov chain. For the conditionally conjugate case, we 8) suggests the following sampling steps. For each ob- use Algorithm 8 of [9] utilizing auxiliary variables. servation xi in turn, the updates are performed by: “invent” ζ auxiliary classes by picking means µ and 3.1 Conjugate Model j precisions Sj from their priors. Update ci using Gibbs When the priors are conjugate, the integral in (16b) sampling, i.e., sample from the discrete conditional pos- is analytically tractable. In fact, even for the active terior class distribution), and finally remove the com- classes, we can marginalise out the component param- ponents that are no longer associated with any obser- eters using an integral over their posterior, by analogy vations. Here, to emphasize the difference between the with the inactive classes. Thus, in all cases the log other sampling schemes that we will describe, we call likelihood term is: it SampleBoth scheme since both means and precisions are sampled from their priors to represent the inactive D ρ + nj components. log p(xi | c−i, ρ, ξ, β, W ) = log − 2 ρ + nj + 1 The auxiliary classes represent the effect of the in- active classes, therefore using (6b), the prior for each D ³β + n + 1´ log π + log Γ j − auxiliary class is 2 2 α/ζ ³ ´ . (18) β + nj + 1 − D β + nj n − 1 + α log Γ + log |W ∗|− 2 2 In other words, we define n−i,j = α/ζ for the auxiliary β + nj + 1 ∗ ρ + nj ∗ ∗ T components. log |W + (xi − ξ )(xi − ξ ) |, 2 ρ + nj + 1 The sampling iterations are as follows: (17) • Update µj and Sj conditional on the indicators and hyperparameters. where ³ ´. X • Update hyperparameters conditional on µ and Sj ξ∗ = ρξ + x (ρ + n ) j l j • For each indicator variable: l:cl=j – If ci is a singleton, assign its parameters µci and and Sci to one of the auxiliary parameter pairs; X ∗ T T ∗ ∗T – Invent other auxiliary components by sam- W = βW + ρξξ + xlxl − (ρ + nj)ξ ξ , pling values for µj and Sj from their priors, (10) l:cl=j and (8) respectively; which simplifies considerably for the inactive classes. – Update the indicator variable, conditional The sampling iterations become: on the data, the other indicators and hyperpa-

• update µj and Sj conditional on the data, the in- rameters using (16a) and defining n−i,j = α/ζ dicators and the hyperparameters; for the auxiliary components;

• update hyperparameters conditional on µj and Sj; – Discard the empty components; • remove the parameters, µj and Sj from represen- • Update DP concentration parameter α. tation; The integral we want to evaluate is over two param- • update each indicator variable, conditional on the eters, µj and Sj. Exploiting the conditional conjugacy, data, the other indicators and the hyperparameters; it is possible to integrate over one of these parameters • update the DP concentration parameter α, using given the other. Thus, we can pick only one of the pa- ARS. rameters randomly from its prior, and integrate over the other, which might lead to faster mixing. The log 3.2 Conditionally Conjugate Model likelihood for the SampleMu and the SampleS schemes As a consequence of not using fully conjugate priors, are as follows: the posterior conditional class probabilities for inactive SampleMu classes cannot be computed analytically. Here, we give Sampling µj from its prior and integrating over Sj details for using the auxiliary variable sampling scheme give the conditional log likelihood: Dilan G¨or¨ur et al.: (Conditionally) Conjugate DP Gaussian Mixtures 621

D The value of ζ may effect the convergence and mixing log p(xi|c−i, µ , β, W ) = − log π + time of the algorithm. If more auxiliary components are j 2 used, the Markov chain may mix faster, however using β + nj ∗ β + nj + 1 ∗ T log |W | − log |W + xix | + more components will increase the computation time 2 2 i ³ ´ ³ ´ of each iteration. In our experiments, we tried different β + nj + 1 β + nj + 1 − D log Γ − log Γ , values of ζ and found that using a single auxiliary com- 2 2 (19) ponent was enough to get a good mixing behaviour. In P the experiments, we report results using ζ = 1. where W ∗ = βW + (x − µ )(x − µ )T. l:cl=j l j l j Note that SampleBoth, SampleMu and SampleS are The sampling steps for the indicator variables are: three different ways of doing inference for the same • Remove all precision parameters Sj from the rep- model since there are no approximations involved. One resentation; should expect only the computational cost of the in- • If c is a singleton, assign its mean parameter µ i ci ference algorithm (e.g., the convergence and the mix- to one of the auxiliary parameters; ing time) to differ among these schemes, whereas the • Invent other auxiliary components by sampling conjugate model discussed in the previous section is a values for the component mean µj from its prior, (10); different model since the prior distribution is different. • Update the indicator variable, conditional on the data, the other indicators, the component means and 4 Predictive Distribution hyperparameters using the likelihood given in (19) and the prior given in (6a) and (18). The predictive distribution is obtained by averaging • Discard the empty components. over the samples generated by the Markov Chain. For a SampleS particular sample in the chain, the predictive distribu- tion is a mixture of the finite number of Gaussian com- Sampling Sj from its prior and integrating over µj, the conditional log likelihood becomes: ponents which have observations associated with them, and the infinitely many components that have no data

log p(xi|c−i,Sj,R, ξ) associated with them given by the integral over the base D 1 distribution: = − log(2π) − xTS x + Z 2 2 i j i p(x |µ,S)p(µ,S|ξ, ρ, β, W ) dµ dS. (21) 1 ¡ ¢−1 i (ξ∗ + S x )T (n + 1)S + R (ξ∗ + S x ) − 2 j i j j j i The combined mass of the represented components is 1 ∗T −1 ∗ 1 |Sj||njSj + R| ξ (njSj + R) ξ + log , n/(n + α) and the combined mass of the unrepresented 2 2 |(nj + 1)Sj + R| components is the remaining α/(n + α). where X For the conjugate model, the predictive distribution ∗ ξ = Sj xj + Rξ. (20) can be given analytically. For the non-conjugate mod-

l:cl=j els, since the integral for the unrepresented classes (21) The sampling steps for the indicator variables are: is not tractable, it is approximated by a mixture of a finite number ζ† of components, with parameters (µ • Remove the component means µj from the repre- sentation; and S, or µ or S, depending on which of the three sampling schemes is being used) drawn from the base • If ci is a singleton, assign its precision Sci to one † of the auxiliary parameters; distribution p(µ,S). Note that the larger the ζ , the • Invent other auxiliary components by sampling better the approximation will get. The predictive per- formance of the model will be underestimated if we do values for the component precision Sj from its prior, (8); not use enough components to approximate the inte- gral. Therefore the predictive performance calculated • Update the indicator variable, conditional on the † data, the other indicators, the component precisions using a finite ζ can be thought of as a lower bound on the actual performance of the model. In our exper- and hyperparameters using the likelihood given in (20) † and the prior given in (6a) and (18); iments, we used ζ = 10. • Discard the empty components. 5 Experiments The number of auxiliary components ζ can be thought of as a free parameter of these sampling al- We present results on simulated and real datasets gorithms. It is important to note that the algorithms with different dimensions to compare the predictive ac- will converge to the same true posterior distribution curacy and computational cost of the different model regardless of how many auxiliary components are used. specifications and sampling schemes described above. 622 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

We use the duration of consecutive eruptions of the likelihood of the data does not favor this. The density Old Faithful geyser[20] as a two dimensional example. fit by CCDP can be seen as an interpolation between The three dimensional Spiral dataset used in [11], the the CDP and the KDE results as it utilizes more mix- four dimensional “Iris” dataset used in [21] and the 13 ture components than CDP but has a smoother density dimensional “Wine” dataset[22] are modeled for assess- estimate than KDE. For all datasets, the KDE model ing the model performance and the computational cost has the lowest average leave one out predictive density, in higher dimensions. and the conditionally conjugate model has the best, The density of each dataset has been estimated using with the difference between the models getting larger the conjugate model (CDP), the three different sam- in higher dimensions, see Table 1. For instance, on the pling schemes for the conditionally conjugate model Wine data CCDP is five fold better than KDE. (CCDP), SampleBoth, SampleMu, and SampleS, and by kernel density estimation③ (KDE) using Gaussian kernels.

5.1 Density Estimation Performance

As a measure for modeling performance, we use the average leave one out predictive densities. That is, for all datasets considered, we leave out one observation, model the density using all others, and calculate the log predictive density on the left-out datapoint. We re- peat this for all datapoints in the training set and report the average log predictive density. We choose this as a performance measure because the log predictive den- sity gives a quantity proportional to the KL divergence which is a measure of the discrepancy between the ac- tual generating density and the modeled density. The three different sampling schemes for the conditionally Fig.3. Old Faithful geyser dataset and its density modelled by conjugate model all have identical equilibrium distribu- CDP, CCDP and KDE. The two dimensional data consists of the tions, therefore the result of the conditionally conjugate durations of the consecutive eruptions of the Old Faithful geyser. model is presented only once, instead of discriminating between different schemes when the predictive densities Table 1. Average Leave One Out Log-Predictive Densi- ties for Kernel Density Estimation (KDE), Conjugate DP are considered. Mixture Model (CDP), Conditionally Conjugate DP Mixture We use the Old Faithful geyser dataset to visu- Model (CCDP) on Different Datasets alise the estimated densities, Fig.3. Visualisation is Dataset KDE CDP CCDP important for giving an intuition about the behaviour Geyser −1.906 −1.902 −1.879 of the different algorithms. Convergence and mixing Spiral −7.205 −7.123 −7.117 of all samplers is fast for the two dimensional Geyser Iris −1.860 −1.577 −1.546 dataset. There is also not a significant difference in Wine −18.979 −17.595 −17.341 the predictive performance, see Tables 1 and 2. How- ever, we can see form the plots in Fig.3 and the average entropy values in Table 3 that the resulting density esti- Table 2. Paired t-Test Scores of Leave One Out Predictive Densities (The test does not give enough evidence in case of mates are different for all models. CDP uses fewer com- the Geyser data, however it shows that KDE is statistically ponents to fit the data, therefore the density estimate significantly different from both DP models for the higher di- has fewer modes compared to the estimates obtained mensional datasets. Also, CDP is significantly different from by CCDP or KDE. In Fig.3 we see three almost Gaus- CCDP for the Wine data.) sian modes for CDP. Since KDE places a kernel on each Dataset KDE/CDP KDE/CCDP CDP/CCDP datapoint, its resulting density estimate is less smooth. Geyser 0.95 0.59 0.41 Note that both CDP and CCDP have the potential to Spiral < 0.01 < 0.01 0.036 use one component per datapoint and return the same Iris < 0.01 < 0.01 0.099 estimate as the KDE, however their prior as well as the Wine < 0.01 < 0.01 < 0.01

③Kernel density estimation is a classical non parametric density estimation technique which places kernels on each training data- point. The kernel bandwidth is adjusted separately on each dimension to obtain a smooth density estimate, by maximising the sum of leave-one-out log densities. Dilan G¨or¨ur et al.: (Conditionally) Conjugate DP Gaussian Mixtures 623

p-values for a paired t-test are given in Table 2 to compare the distribution of the leave one out densities. There is no statistically significant difference between any of the models on the Geyser dataset. For the Spi- ral, Iris and Wine datasets, the difference between the predictions of KDE and both DP models are statisti- cally significant. CCDP is significantly different from CDP in terms of its predictive density only on the Wine dataset. Furthermore, for all datasets, the CCDP consistently utilizes more components than CDP. The average num- ber of datapoints assigned to different components and the distribution over the number of active components are given in Fig.4 and Fig.5, respectively. Fig.5 shows that the distribution of the number of components used by the CCDP is much broader and centered on higher

Fig.5. Distribution of number of active components for DPGMM from 1000 iterations. The CDP model favors a lower number of components for all datasets. The average number of components for the CCDP model is larger, with a more diffuse distribution. Note that histogram for the CDP model for the Geyser dataset and the Wine dataset has been cut off on the y-axis. values. The difference in the density estimates is also reflected in the average entropy values reported in Ta- ble 3.

5.2 Clustering Performance The main objective of the models presented in this paper is density estimation, but the models can be used for clustering (or classification where labels are avail- able) as well by observing the assignment of datapoints to model components. Since the number of components change over the chain, one would need to form a con- fusion matrix showing the frequency of each data pair being assigned to the same component for the entire Markov chain, see Fig.6. Class labels are available for the Iris and Wine datasets, both datasets consisting of 3 classes. The Fig.4. Number of datapoints assigned to the components aver- CDP model has 3∼4 active components for the Iris data aged over different positions in the chain. The standard devia- and 3 active components for the Wine dataset. The tions are indicated by the error bars. Note the existence of many assignment of datapoints to the components shows suc- small components for the CCDP model. cessful clustering. The CCDP model has more compo- nents on average for both datasets, but datapoints with different labels are generally not assigned to the same Table 3. Average Entropies of Mixing Proportions for the Conjugate DP Mixture Model (CDP), Conditionally Conju- component, resulting in successful clustering which can gate DP Mixture Model (CCDP) on Various Datasets be seen by the block diagonal structure of the confu- sion matrices given in Fig.6, and comparing to the true Dataset CDP CCDP labels given on the right hand side figures. The confu- Geyser 1.64 (0.14) 2.65 (0.51) sion matrices were constructed by counting the number Spiral 4.03 (0.06) 4.10 (0.08) of times each datapoint was assigned to the same com- Iris 1.71 (0.13) 2.13 (0.28) ponent with another datapoint. Note that the CCDP Wine 1.58 (0.004) 2.35 (0.17) utilizes more clusters, resulting in less extreme values 624 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

Fig.6. Confusion matrices for the Iris dataset: (a) CDP; (b) CCDP; (c) Correct labels. Wine dataset: (d) CDP; (e) CCDP; (f) Correct labels. Brighter means higher, hence the darker gray values for the datapoints that were not always assigned to the same component. for the confusion matrix entries (darker gray values of the Markov chain, and mixing time was calculated as the confusion matrix) which expresses the uncertainty the sum of the auto-covariance coefficients of the slow- in cluster assignments for some datapoints. Further- est mixing quantities from lag −1000 to 1000. more, we can see in Fig.6(d) that there is one datapoint The slowest mixing quantity was the number of ac- in the Wine dataset that CDP assigns to the wrong tive components in all experiments. Example auto- cluster for almost all MCMC iterations, whereas this covariance coefficients are shown in Fig.7. The con- datapoint is allowed to have its own cluster by CCDP. vergence time for the CDP model is usually shorter The Spiral dataset is generated by sampling 5 points than the SampleBoth scheme of CCDP but longer than form each of the 160 Gaussians whose means lie on a the two other schemes. For the CCDP model, the spiral. For this data, the number of active components SampleBoth scheme is the slowest in terms of both con- of CDP and CCDP do not go beyond 21 and 28, respec- verging and mixing. SampleS has comparable conver- tively. This is due to the assumption of independence gence time to the SampleMu scheme. of component means for both models, which does not hold for this dataset, therefore it is not surprising that 6 Conclusions the models do not find the correct clustering structure The Dirichlet process mixtures of Gaussians model is although they can closely estimate the density. one of the most widely used DPM models. We have pre- 5.3 Computational Cost sented hierarchical formulations of DPGMM with con- jugate and conditionally conjugate base distributions. The differences in density estimates and predictive The only difference between the two models is the prior performances show that different specification of the on µ and the related hyperparameters. We kept the base distribution leads to different behaviour of the specifications for all other parameters the same so as model on the same data. It is also interesting to find to make sure only the presence or absence of conju- out if there is a significant gain (if at all) in compu- gacy would effect the results. We compared the two tational efficiency when conjugate base distribution is model specifications in terms of their modeling proper- used rather than the non-conjugate one. The inference ties for density estimation and clustering and the com- algorithms considered only differ in the way they up- putational cost of the inference algorithms on several date the indicator variables, therefore the computation datasets with differing dimensions. time per iteration is similar for all algorithms. Experiments show that the behaviour of the two base We use the convergence and burn-in time as mea- distributions differ. For density estimation, the predic- sures of computational cost. Convergence was deter- tive accuracy of the CCDP model is found to be better mined by examining various properties of the state of than the CDP model for all datasets considered, the Dilan G¨or¨ur et al.: (Conditionally) Conjugate DP Gaussian Mixtures 625

Fig.7. Autocorrelation coefficients of the number of active components for CDP and different sampling schemes for CCDP, for the Spiral data based on 5 × 105 iterations, the Iris data based on 106 iterations and Wine data based on 1.5 × 106 iterations. difference being larger in high dimensions. The CDP prior specification. model tends to use less components than the CCDP model, having smaller entropy. The clustering perfor- References mance of both cases are comparable, with the CCDP [1] Ferguson T S. A Bayesian analysis of some nonparametric expressing more uncertainty on some datapoints and al- problems. Annals of , 1973, 1(2): 209-230. lowing some datapoints to have their own cluster when [2] Blackwell D, MacQueen J B. Ferguson distributions via P´olya they do not match the datapoints in other clusters. urn schemes. Annals of Statistics, 1973, 1(2): 353-355. [3] Aldous D. Exchangeability and Related Topics. Ecole d’Et´e´ From the experimental results we can conclude that the de Probabilit´esde Saint-Flour XIII–1983, Berlin: Springer, more restrictive form of the base distribution forces the 1985, pp.1-198. conjugate model to be more parsimonious in the num- [4] Pitman J. Combinatorial Stochastic Processes Ecole d’Et´ede ber of components utilized. This may be a desirable Probabilit´esde Saint-Flour XXXII – 2002, Lecture Notes in Mathematics, Vol. 1875, Springer, 2006. feature if only a rough clustering is adequate for the [5] Sethuraman J, Tiwari R C. Convergence of Dirichlet Mea- task in hand, however it has the risk of overlooking the sures and the Interpretation of Their Parameter. Statistical outliers in the dataset and assigning them to a cluster Decision Theory and Related Topics, III, Gupta S S, Berger J O (eds.), London: Academic Press, Vol.2, 1982, pp.305-315. together with other datapoints. Since it is more flexible [6] Ishwaran H, James L F. Gibbs sampling methods for stick- in the number of components, the CCDP model may in breaking priors. Journal of the American Statistical Associ- general result in more reliable clusterings. ation, March 2001, 96(453): 161-173. We adjusted MCMC algorithms from [9] for infer- [7] Antoniak C E. Mixtures of Dirichlet processes with applica- tions to Bayesian nonparametric problems. Annals of Statis- ence on both specifications of the model and proposed tics, 1974, 2(6): 1152–1174. two sampling schemes for the conditionally conjugate [8] Neal R M. Bayesian mixture modeling. In Proc. the 11th base model with improved convergence and mixing International Workshop on Maximum Entropy and Bayesian properties. Although MCMC inference on the conju- Methods of Statistical Analysis, Seattle, USA, June, 1991, pp.197-211. gate case is relatively easier to implement, experimental [9] Neal R M. Markov chain sampling methods for Dirichlet pro- results show that it is not necessarily computationally cess mixture models. Journal of Computational and Graphi- cheaper than inference on the conditionally conjugate cal Statistics, 2000, 9(2): 249-265. model when conditional conjugacy is exploited. [10] Green P, Richardson S. Modelling heterogeneity with and without the Dirichlet process. Scandinavian Journal of In the light of the empirical results, we conclude that Statistics, 2001, 28: 355-375. marginalising over one of the parameters by exploiting [11] Rasmussen C E. The infinite Gaussian mixture model. Ad- conditional conjugacy leads to considerably faster mix- vances in Neural Information Processing Systems, 2000, 12: ing in the conditionally conjugate model. When using 554-560. [12] Escobar M D. Estimating normal means with a Dirichlet pro- this trick, the fully conjugate model is not necessar- cess prior. Journal of the American Statistical Association, ily computationally cheaper in the case of DPGMM. 1994, 89(425): 268-277. The DPGMM with the more flexible prior specification [13] MacEachern S N. Estimating normal means with a conjugate (conditionally ) can be used on higher style Dirichlet process prior. Communications in Statistics: Simulation and Computation, 1994, 23(3): 727-741. dimensional density estimation problems, resulting in [14] Escobar M D, West M. Bayesian density estimation and in- better density estimates than the model with conjugate ference using mixtures. Journal of the American Statistical 626 J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

Association, 1995, 90(430): 577-588. Carl Edward Rasmussen is a [15] M¨ullerP, Erkanli A, West M. Bayesian curve fitting using lecturer in the Computational and multivariate normal mixtures. Biometrika, 1996, 83(1): 67- Biological Learning Lab at the De- 79. partment of Engineering, University [16] West M, M¨ullerP, Escobar M D. Hierarchical Priors and Mix- of Cambridge and an adjunct re- ture Models with Applications in Regression and Density Es- search scientist at the Max Planck timation. Aspects of Uncertainty, Freeman P R, Smith A F Institute for Biological Cybernetics, M (eds.), John Wiley, 1994, pp.363-386. T¨ubingen,Germany. His main re- [17] MacEachern S N, M¨ullerP. Estimating mixture of Dirich- search interests are Bayesian infer- let process models. Journal of Computational and Graphical Statistics, 1998, 7(2): 223-238. ence and . He re- [18] Neal R M. Markov chain sampling methods for Dirichlet pro- ceived his Masters’ degree in engineering from the Techni- cess mixture models. Technical Report 4915, Department of cal University of Denmark and his Ph.D. degree in com- Statistics, University of Toronto, 1998. puter science from the University of Toronto in 1996. Since [19] Gilks W R, Wild P. Adaptive rejection sampling for Gibbs then he has been a post doc at the Technical University of sampling. Applied Statistics, 1992, 41(2): 337-348. Denmark, a senior research fellow at the Gatsby Computa- [20] Scott D W. Multivariate Density Estimation: Theory, Prac- tional Neuroscience Unit at University College London from tice and Visualization, Wiley, 1992. 2000 to 2002 and a junior research group leader at the Max [21] Fisher R A. The use of multiple measurements in axonomic Planck Institute for Biological Cybernetics in T¨ubingen, problems. Annals of Eugenics, 1936, 7: 179-188. Germany, form 2002 to 2007. [22] Forina M, Armanino C, Castino M, Ubigli M. Multivariate data analysis as a discriminating method of the origin of wines. Vitis, 1986, 25(3): 189-201.

Dilan G¨or¨ur is a post-doctoral research fellow in the Gatsby Com- putational Neuroscience Unit, Uni- versity College London. She com- pleted her Ph.D. on machine learn- ing in 2007 at the Max Planck Insti- tute for Biological Cybernetics, Ger- many under the supervision of Carl Edward Rasmussen. She received the B.S. and M.Sc. degrees from Electri- cal and Electronics Engineering Department of Middle East Technical University, Turkey in 2000 and 2003, respectively. Her research interests lie in the theoretical and practical as- pects of machine learning and Bayesian inference.