Infinite Mixture of Inverted Dirichlet Distributions

arXiv:1807.10693v2 [cs.LG] 2 Feb 2020 • a enue oaayetecpct n ro proba- error and capacity the analyze to used been has F M a euiie odsrb h neligdistri- underlying the generalized- describe In to [11], bution. utilized [10], be data can proportional [9]. DMM model data efficiently methylation to DNA order and In values modeling pixel in GMM. applied image widely grey conventional been has the BMM example, to For compared data efficiently, distributed non-Gaussian more the model model mixture mixture can Mises-Fisher [8], Gamma von (vMM) the the [7], (DMM), (GaMM) model model mixture Dirichlet models, tical analytically estima- an Bayesian by the represented [6]. form be and tractable can pair ML prior-posterior algorithms the conjugate tion Both with [6]. out GMM matching carried of to estimation be Bayesian distributions GMM, can prior a in assigning parameters By maximumthe like- [5]. expectation maximum algorithm the via with (EM) estimation efficiently a (ML) estimated lihood in be parameters number can the unlimited Moreover, GMM with components. the arbitrarily GMM mixture be to a of can due by distribution approximated is continuous well popularity any its that of fact modeling Much for method data. popular continuous most the all model been mixture machine has Among Gaussian (GMM) finite recognition, [4]. the vision models, pattern mixture computer finite as mining, such data applied learning, areas, widely many been has to from generated It be populations. to heterogeneous assumed are that data multivariate I 1 nnt itr fIvre iihe Distributions Dirichlet Inverted of Mixture Infinite rlmnr oko non work. ongoing of work Preliminary eetsuishv hw htnnGusa statis- non-Gaussian that shown have studies Recent ru rbblsi oeigto o nvraeand univariate for tool modeling probabilistic erful I iemxuemdln 1–3 saflxbeadpow- and flexible a is [1]–[3] modeling mixture nite opnnshsbe vroe oevr h rbeso ove of problems the i Moreover, infinite overcome. an been data as from has viewed components components mixture be of number can the model of bo determination proposed lower the single principle, introducing In by guaranteed mo theoretically derive for to flexible is adopted very is framework be (EVI) to inference shown variational been has which distributions, Abstract approximation ne Terms Index an data synthesized both with demonstrated DP-rel been proposed have recently method several with Comparing approach. NTRODUCTION I hswr,w eeo oe aeinetmto method estimation Bayesian novel a develop we work, this —In e.g. Drcltpoes netdDrcltdsrbto,Bay distribution, Dirichlet inverted process, —Dirichlet h eamxuemdl(M) the (BMM), model mixture beta the , K ( K G aigcanl,GaMM channels, fading ) hnuM n uigLai Yuping and Ma Zhanyu naayial rcal ouin h ovrec fthe of convergency The solution. tractable analytically an n prxmto oteoiia betv ucini the in function objective original the to approximation und tdmtos h odpromneadefcieeso the of effectiveness and performance good the methods, ated hrfr,tepolmo r-eemnn h pia num optimal the pre-determining of problem the Therefore, . eigvcoswt oiieeeet.Tercnl propos recently The elements. positive with vectors deling eldt evaluations. data real d vre iihl itr oe IIM)ta losteau the allows that (InIDMM) model mixture Dirichelt nverted ✦ -tigadudrfitn r vie yteBysa estim Bayesian the by avoided are under-fitting and r-fitting aibe ne h aeinfaeok h posterior random The mix- as framework. treated finite Bayesian component the are a a under coefficients) of variables in weighting mixture parameters parameters the the finite the and (including a case, model in this ture components In find of to model. number used widely suitable been initial- the have a to to model, distributions the sensitive in prior parameters proper not the introducing hand, are highly by other ization is which the solution methods, On its initialization. Bayesian local and its a point on to saddle converges dependent a algorithm noting or worth EM number is maximum the It general, the model. in mixture determine that, the to infor- in Akaike [24], components of the (AIC) and criterion [23], Bayesian mation (BIC) the entropy [21], criterion (MML) information of as length such message integration criteria, minimum under theoretic the the information estimation some require or ML measures and by EM-based approaches implemented an Deterministic generally [22]. are methods Bayesian [21] [20], and approaches deterministic groups: catego- problem, two be into this can rized These with proposed. been deal have or To methods over-fit many data. is may observed components model the mixture mixture under-fit the of chosen, number properly modeling the not the on If The effect [19]. appropriate data. strong accuracy a the the has on decide number based component components automatically mixture to of number how compo- mixture is non-Gaussian nent, or Gaussian with matter [18]. [17], classification scene visual and [16], analysis widely classification been module has software IDMM for instance, used For positive with [17]. vector [16], data elements modeling for tool to efficient demonstrated an been be has and mixture others, among Dirichlet [8] (IDMM), inverted expression model finite modeling gene The in [15]. yeast used detection as topic widely such been data, has directional vMM The [7]. bility sa siain xeddvrainlifrne oe b lower inference, variational extended estimation, esian nesnilpolmi nt itr oeig no modeling, mixture finite in problem essential An o h iihe rcs D)mxueo h netdDiric inverted the of mixture (DP) process Dirichlet the for rpsdalgorithm proposed Iframework. VI e fmixing of ber dextended ed proposed tomatic ound ation hlet 1 2 distributions of the parameters, rather than simple point can be avoided. Hence, the VI-based solutions can lead estimates, are computed [3]. The model truncation in to more efficient estimation. They have been successfully Bayesian estimation of finite mixture model is carried out applied in a variety of applications including the estima- by setting the corresponding weights of the unimportant tion of mixture models [45]. mixture components to zero (or a small value close to Motivated by the ability of the Bayesian non- zero) [3]. However, the number of mixture components parametric approaches to solve the model selection prob- should be properly initialized, as it can only decrease lem and the good performance recently obtained by during the training process. the VI framework, we focus on the variational learn- The increasing interest in mixture modeling has led ing of the DP mixture of inverted Dirichlet distribu- to the development of the model selection method1. Re- tions (a.k.a. the infinite inverted Dirichlet mixture model cent work has shown that the non-parametric Bayesian (InIDMM)). Since InIDMM is a typical non-Gaussian approach [25], [26] can provide an elegant solution for statistical model, it is not feasible to apply the tradi- automatically determining the complexity of model. The tional VI framework to obtain an analytically tractable basic idea behind this approach is that it provides meth- solution for the Bayesian estimation. To derive an an- ods to adaptively select the optimal number of mixing alytically tractable solution for the variational learning components, while also allows the number of mixture of InIDMM, the recently proposed extended variational components to remain unbounded. In other words, this inference (EVI), which is particularly suitable for non- approach allows the number of components to increase Gaussian statistical models, has been adopted to provide as new data arrives, which is the key difference from an appropriate single lower bound approximation to the finite mixture modeling. The most widely used Bayesian original object function. With the auxiliary functiond, nonparametric [30] model selection method is based on an analytically tractable solution for Bayesian estima- the Dirichlet process (DP) mixture model [31], [32]. The tion of InIDMM is derived. The key contributions of DP mixture model extends distributions over measures, our work are three-fold: 1) The finite inverted Dirichlet which has the appealing property that it does not need to mixture model (IDMM) has been extended to the infinite set a prior on the number of components. In essence, the inverted Dirichlet mixture model (InIDMM) under the DP mixture model can also be viewed as an infinite mix- stick-breaking framework [31], [46]. Thus, the difficulty ture model with its complexity increasing as the size of in automatically determining the number of mixture dataset grows. Recently, the DP mixture model has been components can be overcome. 2) An analytically solution applied in many important applications. For instance, is derived with the EVI framework for InIDMM. More- the DP mixture model has been adopted to a mixture over, comparing with the recently proposed algorithm of different types of non-Gaussian distributions, such for InIDMM [47], which is based on multiple lower bound as the DP mixture of beta-Liouville distributions [33], approximation, our algorithm can not only theoretically the DP mixture of student’s-t distributions [34], the guarantee convergence but also provide better approx- DP mixture of generalized Dirichlet distributions [35], imations. 3) The proposed method has been applied the DP mixture of student’s-t factors [36], and the DP in several important applications, such as image cate- mixture of hidden Markov random field models [37]. gorization and object detection. The good performance Generally speaking, most parameter estimation al- has been illustrated with both synthesized and real data gorithms for both the deterministic and the Bayesian evaluations. methods are time consuming, because they have to nu- The remaining part of this paper is organized as merically evaluate a given model selection criterion [38], follow: Section 2 provides a brief overview of the fi- [39]. This is especially true for the fully Bayesian Markov nite inverted Dirichlet mixture and the DP mixture. chain Monte Carlo (MCMC) [26], [40], which is one The infinite inverted Dirichlet mixture model is also of the widely applied Bayesian approaches with nu- proposed. In Section 3, a Bayesian learning algorithm merical simulations. The MCMC approach has its own with EVI is derived. The proposed algorithm has an limitations, when high-dimensional data are involved analytically tractable form. The experimental results with in the training stage [41], [42]. This is due to the fact both synthesized and real data evaluations are reported that its sampling-based characteristics yield a heavy in Section 4. Finally, we draw conclusions and future computational burden and it is difficult to monitor the research directions in Section 5. convergence in the high-dimensional space. To overcome 2 THE STATISTICAL MODEL the aforementioned problems, variational inference (VI), which can provide an analytically tractable solution and In this section, we first present a brief overview of the good generalization performance, has been proposed as finite inverted Dirichlet mixture model (IDMM). Then, an efficient alternative to the MCMC approach [43], [44]. the DP mixture model with stick-breaking representation With an analytically tractable solution, the numerical is introduced. Finally, we extend the IDMM to InIDMM. sampling during each iteration in the optimization stage 2.1 Finite inverted Dirichlet mixture model

1. Here, model selection means selecting the best of a set of models Given a D-dimensional vector ~x = {x1, ··· , xD} gener- of different orders ated from a IDMM with M components, the probability 3 density function (PDF) of ~x is denoted as [16] simplicity and natural generalization ability, the stick-

M breaking construction has been a widely applied scheme IDMM(~x|~π, Λ) = πmiDir(~x|~αm), (1) for the inference of DPs [33], [46], [49]. m=1 X M M 2.3 Infinite Inverted Dirichlet Mixture Model where Λ = {~αm}m=1 and ~π = {πm}m=1 is the mixing coefficient vector subject to the constraints 0 ≤ πm ≤ 1 Now we consider the problem of modeling ~x by an Infi- M and πm =1. Moreover, iDir(~x|~α) is an inverted nite Inverted Dirichlet Mixture Model (InIDMM), which Pm=1 Dirichlet distribution with its (D + 1)-dimensional posi- is actually an extended IDMM with an infinite number tive parameter vector ~α = {α1, ··· , αD+1} defined as of components. Therefore, (1) can be reformulated as

− PD+1 α ∞ D+1 D D d=1 d Γ( d=1 αd) αd−1 InIDMM(~x|~π, Λ) = πmiDir(~x|~αm), (4) iDir(~x|~α)= x 1+ xd , D+1 Γ(α ) d m=1 dP=1 d d=1 d=1 ! X Y X (2) ∞ ∞ Q where ~π = {πm}m=1 and Λ = {~αm}m=1 Then, the where xd > 0 for d = 1, ··· ,D and Γ(·) is the Gamma ∞ likelihood function of the InIDMM given the observed function defined as Γ(a)= ta−1e−tdt. N R0 dataset X = {~xn}n=1 is given by N ∞ 2.2 Dirichlet Process with Stick-Breaking InIDMM(X |~π, Λ) = πmiDir(~xn|~αm) . (5) n=1 (m=1 ) The Dirichlet process (DP) [31], [32] is a stochastic Y X process used for Bayesian nonparametric data analysis, In order to clearly illustrate the generation process of particularly in a DP mixture model (infinite mixture each observation ~xn in the mixture model, we introduce model). It is a distribution over distributions rather than a latent indication vector variable ~zn = {zn1,zn2, ···}. ~z parameters, i.e., each draw from a DP is a probability has only one element equal to 1 and the other elements distribution itself, rather than a parameter vector [48]. in ~z are 0. For example, znm = 1 indicates the sample We adopt the DP to extend the IDMM to the infinite case, ~xn comes from the mixture component m. Therefore, the such that the difficulty of the automatic determination conditional distribution of X given the parameters Λ and of the model complexity (i.e., the number of mixture the latent variables Z = {znm} is components) can be overcome. To this end, the DP N ∞ znm is constructed by the following stick-breaking formula- InIDMM(X |Z, Λ) = iDir(~xn|~αm) . (6) n=1 m=1 tion [30], which is an intuitive and simple constructive Y Y definition of the DP. Moreover, to exploit the advantages of the Bayesian Assume that H is a random distribution and ϕ is framework, conjugate prior distributions are introduced a positive real scalar. We consider two countably infor all the unknown parameters according to their distri- finite collections of independently generated stochastic bution properties. In this work, we place the conjugate 2 variables Ωm ∼ H and λm ∼ Beta(λm;1, ϕ) for m = priors over the unknown stochastic variables Z, Λ, and {1, ··· , ∞}, where Beta(x; a,b) is the beta distribution ~λ = (λ1, λ2, ··· ) such that a full Bayesian estimation Γ(a+b) a−1 b−1 defined as Beta(x; a,b) = Γ(a)Γ(b) x (1 − x) . A dis- model can be obtained. tribution G is said to be DP distributed with a concentra- In the aforementioned full Bayesian model, the prior tion parameter ϕ and a base measure or base distribution distribution of Z given ~π is given by

H (denoted as G ∼ DP(ϕ, H)), if the following conditions N ∞ znm are satisfied: p(Z|~π)= πm . (7) n=1 m=1 ∞ m−1 Y Y G = πmδΩ , πm = λm (1 − λl), (3) m As ~π is a function of ~λ according to the stick-breaking m=1 l=1 X Y construction of the DP as shown in (3), we rewrite (7) as where {πm} is a set of stick-breaking weights with z ∞ N ∞ m−1 nm constraints m πm =1, δΩm is a delta function whose ~ P =1 p(Z|λ)= λm (1 − λl) . (8) value is 1 at location Ωm and 0 otherwise. The generation n=1 m=1 " l # Y Y Y=1 of the mixing coefficients {πm} can be considered as As previously mentioned in Section 2.2, the prior process of breaking a unit length stick into an infinite distribution of ~λ is number of pieces. The length of each piece, λm, which ∞ ∞ is proportional to the rest of the “stick” before the cur- ϕm−1 p(~λ|~ϕ)= Beta(λm; 1, ϕm)= ϕm(1 − λm) , (9) rent breaking, is considered as an independent random m=1 m=1 variable generated from Beta(λm;1, ϕ). Because of its Y Y where ~ϕ = (ϕ1, ϕ2, ··· ). Based on (3), we can obtain the 2. To avoid confusion, we use f (x; a) to denote the PDF of x expected value of πm. In order to do this, the expected parameterized by parameter a. f (x|a) is used to denote the conditional value of λm will first be calculated as PDF of x given a, where both x and a are random variables. Both f (x; a) and f (x|a) have exactly the same mathematical expressions. hλmi = 1/(1 + ϕm). (10) 4

Then, the expected value of πm is denoted as ~zn λm ϕm

m−1 PSfrag replacements

hπmi = hλmi (1 −hλli). (11) ∞ ∞ l Y=1 It is worth to note that, when the value of ϕm is small, hλmi will become large. Therefore, the expected of the ~xn mixing coefficients πm are controlled by the parameters ~αm ϕm, i.e., small value of ϕm will yield small πm such that the distribution of πm will be sparse. N ∞ As ϕm is positive, we assume ~ϕ follows a product of gamma prior distributions as Fig. 1: Graphical representation of the variables relationships in the Bayesian inference of a InIDMM. All of the ∞ ∞ sm tm sm−1 −tmϕm circles in the graphical figure represent variables. Arrows p(~ϕ; ~s,~t)= Gam(ϕm; sm,tm)= ϕm e , Γ(sm) m=1 m=1 show the relationships between variables. The variables Y Y (12) in the box are the i.i.d. observations. where Gam(·) is the gamma distribution. ~s = (s1,s2, ··· ) and ~t = (t1,t2, ··· ) are the hyperparamters and subject observation X with all the i.i.d. latent variables Θ = to the constraints sm > 0 and tm > 0. (Z, Λ, ~λ, ~ϕ) as Next, we introduce an approximating conjugate prior distribution to parameter Λ in InIDMM. The inverted p(X , Θ) =p(X |Z, Λ)p(Z|~λ)p(~λ|~ϕ)p(~ϕ)p(Λ) Dirichlet distribution belongs to the exponential family N ∞ m−1 D+1 Γ d=1 αmd and its formal conjugate prior can be derived with the = λm (1 − λj ) D+1  P Γ(αmd) Bayesian rule [3] as n=1 m=1 j=1 d=1 Y Y Y z  − PD+1 α nm D+1 ν0 D D Q d=1 md T ~ Γ( d=1 αd) −~µ0(~α −ID+1) αmd−1 p(~α|~µ0, v0)= C(~µ0, v0) e , (13) × x 1+ xnd D+1 nd  " αd # d=1 d=1 ! Pd=1 Y X ∞ sm  where ~µ0 = [µ10 , ··· µD+1Q0 ] and ν0 are the hyperparame- ϕm−1 tm sm−1 −tmϕm × ϕm(1 − λm) ϕm e Γ(sm) ters in the prior distribution, C(~µ0, v0) is a normalization m=1 ~ Y coefficient such that p(~α|~µ0, v0)d~α = 1. Id is a D- ∞ D+1 umd R vmd umd−1 −vmdαmd dimensional vector with all elements equal to one. Then, × αmd e . Γ(umd) we can write the posterior distribution of ~α as (with N m=1 d=1 Y Y (17) i.i.d. observations X )

iDir(X |~α)f(~α|~µ0,ν0) The structure of the InIDMM can be represented in terms f(~α|X )= iDir(X |~α)f(~α|~µ0,ν0)d~α of a directed probabilistic graphical model, as shown in D+1 νN Fig. 1, which illustrates the relations among the variables T ~ R Γ( d=1 αd) −~µN (~α −ID+1) = C(~µN ,νN ) D+1 e and the observations. " dP=1 Γ(αd) # (14) Q where the hyperparameters νN and ~µN in the posterior 3 VARIATIONAL LEARNINGFOR INIDMM distribution are In this section, we develop a variational Bayesian infer- + ~ ~T + ~ νN = ν0 + N, ~µN = ~µ0 − [ln X − ID+1 ln(1 + ID+1X )]IN . ence framework for learning the InIDMM. With the assis- (15) tance of recently proposed EVI, an analytically tractable + ~T In (15), X is a (D + 1) × N matrix by connecting ID+1 algorithm, which prevents numerical sampling during to the bottom of X . However, it is not applicable in each iteration and facilitates a training procedure, is our VI framework due to the analytically intractable obtained. The proposed solution is also able to overcome normalization factor in (50). Because Λ is positive, we the problem of overﬁtting and automatically decide the adopt gamma prior distributions to approximate conju- number of mixture components. gate prior for Λ as well. By assuming the parameters of inverted Dirichlet distribution are mutually independent, we have 3.1 Variational Inference

∞ D+1 umd The purpose of Bayesian analysis is to estimate the vmd umd−1 −vmdαmd p(Λ) = Gam(Λ; U, V )= αmd e , values of the hyperparameters as well as the posterior Γ(umd) m=1 d=1 probability distribution of the latent variables. Although Y Y (16) we can formulate the posterior distribution p(Θ|X ) by where all the hyperparameters U = {umd} and V = using the Bayesian rule as {vmd} are positive. With the Bayesian rules and by combining (6) and (8)- p(X , Θ) p(Θ|X )= . (18) (16) together, we can represent the joint density of the p(X ) 5

The calculation of p(X ) from the joint distribution In order to properly formulate the variational posterior p(X , Θ), which involves the summation and the integra- q(Θ), we truncate the stick-breaking representation for tion over the latent variables, is analytically intractable the InIDMM at a value M as for most of non-Gaussian statistical models. Therefore, we apply the VI framework to approximate the actual M posterior p(X |Θ) with an approximating distribution λM = 1, πm = 0 when m>M, and πm = 1. (24) m=1 q(Θ). In principle, q(Θ) can be of arbitrary form. To X make this approximation as close as possible to the actual posterior distribution, we can find the optimal Note that the model is still a full DP mixture. The trun- approximation by minimizing the Kullback-Leibler (KL) cation level M is not a part of our prior infinite mixture divergence of p(Θ|X ) from q(Θ) as model, it is only a variational parameter for pursuing an approximation to the posterior, which can be freely p(Θ|X ) initialized and automatically optimized without yielding KL(q||p)= − q(Θ) ln dΘ. (19) q(Θ) overfitting during the learning process. Additionally, Z we make use of the following factorized variational By some mathematical manipulations, we obtain the distribution to approximate p(Θ|X ) as following expression:

p(X , Θ) p(Θ|X ) M N M ln p(X )= q(Θ)ln dΘ − q(Θ)ln dΘ, q(Θ) q(Θ) q(Θ) = q(λm)q(ϕm) q(znm) Z Z "m #"n m # =1 =1 =1 (25) L(q) KL(q||p) YM D+1 Y Y (20) | {z } | {z } × q(αmd) , where the KL divergence KL(q||p) is nonnegative and "m=1 d # Y Y=1 can be equal to zero if and only if q(Θ) = p(Θ|X ) [50]. However, it is infeasible to solve q(Θ) by minimizing where the variables in the posterior distribution are as- KL(q(Θ)||p(Θ|X )), as p(Θ|X ) is unknown. As the log- sumed to be mutually independent (as illustrated by the arithm of the marginal evidence ln p(X ) is ﬁxed by a graphical model in Fig. 1). This is the only assumption given X , minimizing the KL divergence is equivalent we introduced to the posterior distribution. No other to maximizing the lower bound L(q) (which is known restrictions are imposed over the mathematical forms of as the variational objective function). Hence, it is usual the individual factor distributions [3]. to ﬁnd the optimal approximation q(Θ) by maximizing Applying the full factorization formulation and the L(q) [3] in the VI framework. Within the variational truncated stick-breaking representation for the proposed inference framework, the variational objective function model, we can solve the variational learning by maxi- that needs to be maximized is mizing the lower bound L˜(q) shown in (23). The optimal

L(q)= Eq(Θ)[ln p(X , Θ)] − Eq(Θ)[ln q(Θ)]. (21) solution in this case is given by

ln qs(Θs)= hlnp ˜(X , Θ)ij6=s + Con., (26) 3.2 Extended Variational Inference

For most of the non-Gaussian mixture models (e.g., the where h·ij=6 s refers to the expectation with respect to all beta mixture model, the Dirichlet mixture model, the the distributions qj (Θj ) except for variable s. In addition, beta-Liouville mixture model [33], the inverted Dirichlet any term that does not include Θs are absorbed into mixture model [17]), the term Eq(Θ)[ln p(X , Θ)] is an- the additive constant “Con.” [3], [43]. In the variational alytically intractable such that the lower bound L(q) inference, all factors qs(Θs) need to be suitably initiated, cannot be maximized directly by a closed-form solution. then each factor is updated in turn with a revised value Therefore, the EVI method [43], [51] was proposed to obtained by (26) using the current values of all the overcome the aforementioned problem. With an auxil- other factors. Convergence is theoretically guaranteed iary function p˜(X , Θ) that satisﬁes since the lower bound is a convex with respect to each factor qs(Θs) [3]. It is is worth noting that, although E [ln p(X , Θ)] ≥ E [lnp ˜(X , Θ)] (22) q(Θ) q(Θ) convergence is promised, the algorithm may also fall in and substituting (22) into (21), we can still reach the max- local maxima or saddle points. imum value of L(q) at some given points by maximizing a lower bound of L˜(q)

L(q) ≥ L˜(q)= Eq(Θ)[lnp ˜(X , Θ)] − Eq(Θ)[ln q(Θ)]. (23) 3.3 EVI for the Optimal Posterior Distributions

If p˜(X , Θ) is properly selected, an analytically tractable According to the principles of EVI, the expectation of the solution can be obtained. The strategy for selecting a logarithm of the joint distribution, given the joint pos- proper p˜(X , Θ) can be found. terior distributions of the parameters, can be expressed 6

as As any term that is independent of znm can be ab- hln p(X , Θ)i sorbed into the additive constant, we have N M D m−1 ∗ = hznmi Rm+ (hαmdi− 1) ln xnd ln q (znm)= Con. + znm Rm + hln λmi + hln(1 − λj )i n=1 m=1 " d " j=1 X X X=1 X D+1 D m−1 D e D+1 D − hαmdi(1 + xnd)+ hln λmi + hln(1 − λj )i + (hαmdi− 1) ln xnd + hαmdi ln(1 + xnd) , d d j=1 # d=1 d=1 d=1 # X=1 X=1 X X X X M (30) + [hln ϕmi +(hϕmi− 1)hln(1 − λm)i] which has same logarithmic form of the prior distribu- m=1 XM D+1 tion (i.e., the categorial distribution). Therefore, we can write ln q∗(Z) as + (umd − 1)hln αmdi− vmdhαmdi m=1 d X X=1 N M M ∗ ln q (Z)= znm ln ρnm + Con. (31) + [(sm − 1)hln ϕmi− tmhϕmi]+ Con., n=1 m=1 m=1 X X X (27) with the deﬁnition that

D+1 m−1 Γ(Pd=1 αmd) where Rm = ln D+1 . ln ρnm = hln λmi + hln(1 − λj )i + R˜ m D Qd=1 Γ(αmd) E j=1 With the mathematical expression in (27), an analyti- X (32) D D D cally tractable solution is not feasible, which is due to the +1 + (hαmdi− 1) ln xnd − hαmdi(1 + xnd). fact that R cannot be explicitly calculated (although it m d=1 d=1 d=1 can be simulated by some numerical sampling methods). X X X M In order to apply (26) to explicitly calculate the optimal Recalling that znm ∈ (0, 1) and znm =1, we deﬁne Pm=1 posterior distributions and with the principles of the ρnm rnm = . (33) EVI framework, it is required to introduce an auxiliary M m=1 ρnm function R˜m such that Rm ≥ R˜m. According to, we can Taking the exponential ofP both sides of (31), we have select R˜m as N M R˜ m ∗ znm q (Z)= rnm , (34) D+1 D+1 D+1 n=1 m=1 Γ( d=1 hαmdi) = ln + Ψ( hαmdi) − Ψ(hαmdi) Y Y D+1 Γ(hα i) dP=1 md d=1 " k=1 # which is the optimal posterior distribution of Z. X X hz i hz i = × [Qhln αmdi− ln hαmdi] hαmdi, The posterior mean nm can be calculated as nm (28) rnm. Actually, the quantities {rnm} are playing a similar role as the responsibilities in the conventional EM [52] where Ψ(·) is the digamma function deﬁned as Ψ(a) = algorithm. ∂ ln Γ(a)/∂a. 2) The posterior distribution of q(~λ) Substituting (28) into (27), a lower bound to The optimal solution to the posterior distribution of hln p(X , Θ)i can be obtained as λm is given by

hlnp ˜(X , Θ)i N N M D ln q(λm)= Con.+ ln λm hznmi ˜ = hznmi Rm+ (hαmdi− 1) ln xnd n=1 n=1 m=1 " d X (35) X X X=1 N M D+1 D m−1 + ln(1 − λm) hznj i + hϕmi− 1 , − hαmdi(1 + xnd)+ hln λmi + hln(1 − λj )i "n=1 j=m+1 # d d j=1 # X X X=1 X=1 X M which has the logarithmic form of the beta prior distri-

+ [hln ϕmi +(hϕmi− 1)hln(1 − λm)i] bution. Hence, the optimal posterior distribution is m=1 X M M D+1 ~ ∗ ∗ q(λ)= Beta(λm; gm, hm), (36) + (umd − 1)hln αmdi− vmdhα i md m=1 m=1 d=1 Y X X ∗ ∗ M where the hyperparameters sm and qm are + [(sm − 1)hln ϕmi− tmhϕmi]+ Con.. N N M m=1 ∗ ∗ X (29) gm =1+ hznmi, hm = hϕmi + hznj i. (37) n=1 n=1 j=m+1 With (26), we can get analytically tractable solutions for X X X 3) The posterior distribution of q(~ϕ) optimally estimating the posterior distributions of Z, ~λ, For variable ϕ , we have ~ϕ, and Λ. We now consider each of these in more detail: m ∗ 1) The posterior distribution of q(Z) ln q (ϕm)= Con. + sm ln ϕm + [hln(1 − λm)i− tm] ϕm. (38) 7

It can be observed that (38) has the logarithmic form of Algorithm 1 Algorithm for EVI-based Bayesian InIDMM the gamma prior distribution. By taking the exponential 1: Set the initial truncation level M and the initial of the both sides of (38), we have 0 0 0 0 values for hyperparameters sm, tm, umd, and vmd M 2: Initialize the values of rnm by K-means algorithm. ∗ ∗ ∗ q (~ϕ)= Gam(ϕm; sm,tm), (39) 3: repeat m=1 Y 4: Calculate the expectations in (45). ∗ where the optimal solutions to the hyperparamters sm 5: Update the posterior distributions for each vari- ∗ and tm are able by (37), (40), (43) and (44). ∗ 0 ∗ 0 6: until Stop criterion is reached. sm =1+ sm, tm = tm −hln(1 − λm)i, (40) ∗ ∗ ∗ 7: For all m, calculate hλmi = sm/(sm + tm) and 0 0 where sm and tm denote the hyperparameters initialized substitute it back into (11) to get the estimated values in the prior distribution, respectively. of the mixing coefﬁcients πm. 4) The posterior distribution of q(Λ) 8: Determine the optimum numberb of components M Similar to the above derivations, for the variable αmd, by eliminating the components with mixing weights −5 1 ≤ d ≤ D+1, the optimal approximation to the posterior smaller than 10 . distribution is 9: Renormalize {πm} to have a unit l1 norm. ∗ ∗ ∗ 10: Calculate αmd = u /v for all m and d. ln q (αmd) b md md D b 0 = hznmi Ψ( hαmki) − Ψ(hαmdi) + umd − 1 × ln αmd ( " k # ) X=1 In order to obtain optimal posterior distributions for N D all the variables, iterative updates are required until − v0 − hz i ln x − ln(1 + x ) × α md nm nd nd md convergence. With the obtained posterior distributions, ( n=1 " d #) X X=1 ˜ + Con. it is straightforward to calculate the lower bound L(q) (41) p˜(Θ, X ) L˜(q)= q(Θ)ln dΘ q(Θ) Since the posterior distribution of αmd has the logarith- Z mic form of gamma distribution, we have =hlnp ˜(X , Θ)i−hln q(Θ)i (46) ~ M D+1 =hlnp ˜(X , Θ)i−hln q(Z)i−hln q(λ)i ∗ ∗ ∗ q (Λ) = Gam(αmd; umd, vmd), (42) −hln q(~ϕ)i−hln q(Λ)i, m=1 d Y Y=1 which is helpful in monitoring the convergence. In (46), ∗ where the optimal solutions to the hyperparameters umd each term with expectation (i.e., h·i) is evaluated with ∗ and vmd are given by respect to all the variables in its argument as N K+1 ∗ 0 hln q(Z)i = rnm ln rnm, (47) umd = umd + hznmi Ψ( hαmki) − Ψ(hαmdi) hαmdi M n=1 " k=1 # X X (43) hln q(~λ)i = [ln Γ(g∗ + h∗ ) − lnΓ(g∗ ) − lnΓ(h∗ ) m m m m (48) and m=1 X ∗ ∗ N D +(gm − 1)hln λmi +(hm − 1)hln(1 − λm)i] , ∗ 0 vmd = vmd − hznmi ln xnd − ln(1 + xnd) . (44) M n=1 " d=1 # hln q(~ϕ)i = [s∗ ln t∗ − lnΓ(s∗ ) X X m m m (49) 0 0 m=1 In the above equations, umd and vmd are the hyperpa- X ∗ ∗ +(sm − 1)hln ϕmi− tmϕ¯m] , rameters in the prior distribution and we set xn,D+1 =1. The following expectations are needed to calculate the and aforementioned update equations: M D+1 ∗ ∗ ∗ ∗ ∗ ∗ hln q(~α)i = [umd ln vmd − lnΓ(um) hln(1 − λm)i =Ψ(hm) − Ψ(gm + hm), (50) m=1 d=1 ∗ ∗ ∗ X X hln λmi =Ψ(g ) − Ψ(g + t ), ∗ ∗ m m m +(um − 1)hln αmdi− vmdα¯md] . ∗ ∗ hln αmdi =Ψ(umd) − ln vmd, (45) ∗ ∗ Additionally, hlnp ˜(X , Θ)i is given in (29) . sm umd hϕmi = , hαmdi = . The algorithm of the proposed EVI-based Bayesian t∗ v∗ m md estimation of InIDMM is summarized in Algorithm 1.

3.4 Full Variational Learning Algorithm As can be observed from the above updating process, 4 EXPERIMENTAL RESULTS the optimal solutions for the posterior distributions are In this section, both synthesized data and real data are dependent on the moments evaluated with respect to utilized to demonstrate the performance of the proposed the posterior distributions of the other variables. Thus, algorithm for InIDMM. In the initialization stage of all the variational update equations are mutually coupled. the experiments, the truncation level M is set to 15 and 8 -433.76 -167.7 479.3 (Θ)] (Θ)] (Θ)] -167.72 q q -433.78 q 479.25 -167.74 [ln [ln [ln 479.2 -433.8 -167.76 (Θ) (Θ) (Θ) q q q E E E -167.78 -433.82 479.15 − − − -167.8 -433.84 479.1 Θ)] Θ)] Θ)] -167.82 , , , X X X -167.84 ( ( -433.86 ( 479.05 p p p -167.86 [ln [ln PSfrag replacements -433.88 PSfrag replacements [ln PSfrag replacements 479 -167.88 (Θ) (Θ) (Θ) q q q

-433.9 -167.9 E 478.95 E 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 E 734 735 736 737 738 739 740 741 865 866 867 868 869 870 871 872 873 874 875 876 877 878 Iter. ♯ Iter. ♯ Iter. ♯ (a) Model A (b) Model B (c) Model C Fig. 2: Observations of the objective function’s oscillations during iterations. This non-convergence indicates that the MLB approximation-based method cannot theoretically guarantee convergence. The model settings are the same as Tab. 1. TABLE 1: Comparisons of true and estimated models. Model B T Model A π1 = 0.25 , ~α1 = [12 36 14 18 55 16] T T True Model π1 = 0.5 , ~α1 =[16 8 6 2] π2 = 0.25 , ~α2 = [32 48 25 12 36 48] T T π2 = 0.5 , ~α2 = [8 12 15 18] π3 = 0.25 , ~α3 = [25 10 18 10 36 48] T π4 = 0.25 , ~α4 = [6 28 16 32 12 24] b T πb1 = 0.251 , ~α1 = [12.26 36.59 14.30 18.19 56.36 16.25] b T b T πb1 = 0.502 , ~α1 = [16.96 8.58 6.39 12.49] πb2 = 0.249 , ~α2 = [33.37 49.92 25.85 12.80 37.00 49.79] InIDMMSLB b T b T πb2 = 0.498 , ~α2 = [8.20 12.16 15.49 18.34] πb3 = 0.252 , ~α3 = [25.72 10.32 18.09 10.09 37.27 49.58] b T πb4 = 0.248 , ~α4 = [6.14 28.94 16.72 33.46 12.32 25.20] b T πb1 = 0.249 , ~α1 = [12.18 37.82 14.56 18.85 57.32 16.44] b T b T πb1 = 0.508 , ~α1 = [15.20 7.71 5.90 11.64] πb2 = 0.249 , ~α2 = [33.71 51.10 26.92 12.89 38.66 51.73] InIDMMMLB b T b T πb2 = 0.492 , ~α2 = [9.21 13.76 17.13 21.10] πb3 = 0.250 , ~α3 = [24.94 9.90 18.07 10.04 36.10 48.25] b T πb4 = 0.252 , ~α4 = [5.82 27.43 15.77 31.14 11.82 23.58] Model C T π1 = 0.2 , ~α1 = [12 21 36 18 32 65 76] T π2 = 0.2 , ~α2 = [28 42 21 8 54 21 48] True Model T π3 = 0.2 , ~α3 = [32 12 7 35 13 32 18] T π4 = 0.2 , ~α4 = [62 44 31 65 72 15 44] T π5 = 0.2 , ~α5 = [53 12 18 44 65 33 52] b T πb1 = 0.201 , ~α1 = [12.08 20.89 36.25 18.28 32.69 65.72 76.70] b T πb2 = 0.199 , ~α2 = [29.12 43.43 21.41 8.33 56.11 21.74 49.20] b T InIDMMSLB πb3 = 0.200 , ~α3 = [31.57 11.89 6.99 34.70 12.90 31.85 17.89] b T πb4 = 0.201 , ~α4 = [59.83 42.55 29.89 61.98 67.68 14.11 42.46] b T πb5 = 0.199 , ~α5 = [58.00 12.8 20.02 47.70 71.08 36.57 57.66] b T πb1 = 0.200 , ~α1 = [12.56 21.50 37.69 19.00 33.06 68.04 79.64] b T πb2 = 0.200 , ~α2 = [28.26 43.02 20.85 8.14 55.36 21.21 49.17] b T InIDMMMLB πb3 = 0.199 , ~α3 = [32.17 12.19 7.13 35.66 13.01 32.54 17.84] b T πb4 = 0.199 , ~α4 = [63.61 45.48 32.00 66.63 74.31 15.21 45.45] b T πb5 = 0.202 , ~α5 = [52.12 11.83 18.34 43.77 64.80 32.53 51.48]

the hyperparameters of the gamma prior distributions InIDMM using the MLB approximation (proposed in [47] are chosen as u0 = s0 = 1 and v0 = t0 = 0.005, which and denoted as InIDMMMLB). Three models (see Tab. 1 provide non-informative prior distributions. Note that for details) were selected to generate the synthesized these speciﬁc choices were based on our experiments and datasets. were found convenient and effective in our case. We take the posterior means as point estimates to the parameters 4.1.1 Observations of Oscillations in an InIDMM. We ran the InIDMMMLB algorithm and monitored the value of the variational objective function during each iteration. It can be observed that the variational ob- 4.1 Synthesized Data Evaluation jective function was not always increasing in Bayesian As shown in the previous studies for EVI-based Bayesian estimation with the InIDMMMLB. Figure 2 illustrates estimation, the single lower bound (SLB) approxima- the decreasing values during iterations. On the other tion can guarantee the convergence while the multiple hand, the variational objective function obtained with lower bound (MLB) approximation cannot. We use the the InIDMMSLB algorithm was always increasing until synthesized data evaluation to compare the Bayesian convergence, as the SLB approximation insures the con- InIDMM using the SLB approximation (proposed in this vergency theoretically. The observations of oscillations paper and denoted as InIDMMSLB) with the Bayesian demonstrate that the convergence with MLB approxi- 9 TABLE 2: Comparisons of objective function values and runtime for InIDMM with SLB and MLB. Model A Model B Model C Model & Method InIDMMSLB InIDMMMLB InIDMMSLB InIDMMMLB InIDMMSLB InIDMMMLB Obj. Func. Val. −1.86 × 103 −1.90 × 103 0.42 × 103 0.32 × 103 3.05 × 103 2.99 × 103 p-values 0.046 6.48 × 10−4 0.016 b 3 3 3 KL(p(X|Θ)kp(X|Θ)) 3.35 × 10− 6.97 × 10−3 2.80 × 10− 8.07 × 10−3 2.93 × 10− 6.24 × 10−3 p-values 1.46 × 10−11 6.93 × 10−15 2.08 × 10−7 Runtime (in s)† 2.06 2.26 3.06 3.61 2.84 3.07 † On a ThinkCentrer computer with Intelr CoreTM i5 − 4590 CPU 8G.

× 10 4 Number of Mixture Component 9 6 3 2 Number of Mixture6 Component Number of Mixture Component -0.8 14 13 12 10 5 4 4000 14 13 10 9 8 5 4 3 4000 14 13 11 10 8 7 6 5

(Θ)] -0.9 (Θ)] (Θ)] 2000 2000 q q q -1 [ln

[ln 0 0 [ln -1.1 (Θ) (Θ) (Θ) -2000 -2000 q q q E E -1.2 E -4000 -4000 − − -1.3 − -6000 -6000 Θ)] Θ)] -1.4 Θ)] , , , -8000 -8000 X X -1.5 X ( ( ( p p PSfrag replacements PSfrag replacements p -10000 PSfrag replacements -10000 -1.6 [ln [ln [ln -1.7 -12000 -12000 (Θ) (Θ) (Θ) q q -1.8 q -14000 -14000 E E 0 20 40 60 80 100 120 140 160 0 50 100 150 200 250 300 350 400 450 E 0 100 200 300 400 500 600 Iter. ♯ Iter. ♯ Iter. ♯ (a) Model A (b) Model B (c) Model C Fig. 3: Illustration of the variational objective function’s values obtained by SLB against the number of iterations.

3200

550 (Θ)] (Θ)] -1750 (Θ)] 3150 q q q 500 [ln [ln

[ln 3100 -1800 450 (Θ) (Θ) (Θ)

q 3050 q q E E E -1850 400 − −

− 3000

350 Θ)] Θ)]

-1900 Θ)] 2950 , , ,

X 300 X PSfrag replacements PSfrag replacements PSfrag replacements X ( (

( 2900 p p

-1950 p 250

[ln 2850 [ln [ln -2000 200 (Θ) (Θ) (Θ) 2800 q q q E E SLB MLB SLB MLB E SLB MLB (a) Model A (b) Model B (c) Model C Fig. 4: Boxplots for comparisons of the objective function values’ distributions obtained by SLB and MLB with different models. The model settings are the same as those in Tab. 1. The central mark is the median, the edges of the box are the 25th and 75th percentiles. The outliers are marked individually.

mation cannot be guaranteed. The original variational objective function with sampling method, superior per- object function was numerically calculated by employing formance of the InIDMMSLB over the InIDMMMLB can be sampling method. In order to monitor the parameter observed from Tab. 2. The mean values of the objective estimation process of InIDMMSLB, we show the value function obtained by InIDMMSLB are larger than those of the variational objective function during iterations in obtained by the InIDMMSLB while the computational Fig. 3. It can be observe that the variational objective cost (measured in seconds) required by the InIDMMSLB function obtained by InIDMMSLB increases during itera- are smaller than those required by the InIDMMMLB. tions and in most cases it increases very fast. Moreover, smaller KL divergences3 of the estimated models from the corresponding true models also ver- 4.1.2 Quantitative Comparisons ify that the InIDMMSLB yields better estimates than the InIDMMMLB. In order to examine if the differences Next, we compare the InIDMMSLB with the InIDMMMLB quantitatively. With a known IDMM, 2000 samples were between the InIDMMSLB and the InIDMMMLB are statistically signiﬁcant, we conducted the student’s t-test generated. The InIDMMSLB and the InIDMMMLB were applied to estimate the posterior distributions of the with the null-hypothesis that the results obtained by model, respectively. In Tab. 1, we list the estimated these two methods have equal means and equal but parameters by taking the posterior means. It can be unknown variances. All the p-values of in Tab. 2 are smaller than the signiﬁcant level 0.1, which indicates that observed that, both the InIDMMSLB and the InIDMMMLB can carry out the estimation properly. However, with 20 b 3. Here, the KL divergence is calculated as KL(p(X|Θ)kp(X|Θ)) by b repeats of the aforementioned “data generation-model sampling method. Θ denotes the point estimate of the parameters from estimation” procedure and calculating the variational the posterior distribution. 10

TABLE 3: Comparisons of image categorization accuracies (in [8] J. Taghia, Z. Ma, and A. Leijon, “Bayesian estimation of the von- %) obtained with different models. The standard deviations Mises Fisher mixture model with variational inference,” IEEE are in the brackets. The p-values of the student’s t-test with Transactions on Pattern Analysis and Machine Intelligence, vol. 36, the null-hypothesis that InIDMMSLB and the referring method no. 9, pp. 1701–1715, Sept 2014. have equal means but unknown variances are listed. [9] E. A. Houseman, B. C. Christensen, R. F. Yeh, C. J. Marsit, M. R. Karagas, M. Wrensch, H. H. Nelson, J. Wiemels, S. Zheng, J. K. InIDMMSLB InIDMMMLB IDMMSLB IDMMMLB Wiencke, and K. T. Kelsey, “Model-based clustering of DNA Caltech-4 93.49(1.05) 92.27(1.91) 89.27(0.84) 88.75(2.04) methylation array data: a recursive-partitioning algorithm for p-value N/A 0.094 1.01 × 10−8 3.79 × 10−6 high-dimensional data arising as a mixture of beta distributions,” ETH-80 75.49(0.75) 73.94(1.90) 72.88(1.46) 71.51(0.61) Bioinformatics, vol. 9, p. 365, 2008. p-value N/A 0.027 8.69 × 10−5 1.25 × 10−10 [10] J. M. P. Nascimento and J. M. Bioucas-Dias, “Hyperspectral unmixing based on mixtures of Dirichlet components,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 3, pp. the superiority of the InIDMMSLB over the InIDMMMLB is 863–878, March 2012. statistically significant. The distributions of the objective [11] Z. Ma, A. Leijon, and W. B. Kleijn, “Vector quantization of function values are shown by the boxplots in Fig. 4. LSF parameters with a mixture of Dirichlet distributions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1777–1790, Sept 2013. [12] P. Xu, Q. Yin, Y. Huang, Y.-Z. Song, Z. Ma, L. Wang, T. Xiang, 5 CONCLUSIONS W. B. Kleijn, and J. Guo, “Cross-modal subspace learning for fine- grained sketch-based image retrieval,” NEUROCOMPUTING, vol. The inverted Dirichlet distribution has been widely 278, pp. 75–86, Feb. 2018. [13] H. Yu, Z. H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing applied in modeling the positive vector (vector that detection in automatic speaker verification systems using DNN contains only positive elements). The Dirichlet process- classifiers and dynamic acoustic features,” IEEE Transactions on ing mixture of the inverted Dirichlet mixture model Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–12, 2018. (InIDMM) can provide good modeling performance to [14] Z. Ma, J. H. Xue, A. Leijon, Z. H. Tan, Z. Yang, and J. Guo, the positive vectors. Compared to the conventional finite “Decorrelation of neutral vector variables: Theory and applica- inverted Dirichlet mixture model (IDMM), the InIDMM tions,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 1, pp. 129–143, Jan 2018. has more flexible model complexity as the number of [15] Q. He, K. Chang, E. P. Lim, and A. Banerjee, “Keep it simple with mixture components can be automatically determined. time: A reexamination of probabilistic topic detection models,” Moreover, the over-fitting and under-fitting problem is IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1795–1808, Oct 2010. avoided by the Bayesian estimation of InIDMM. To [16] T. Bdiri and N. Bouguila, “Positive vectors clustering using obtain an analytically tractable solution for Bayesian inverted Dirichlet finite mixture models,” Expert Systems with estimation of InIDMM, we utilized the recently proposed Applications, vol. 39, no. 2, pp. 1869–1882, 2012. [17] ——, “Bayesian learning of inverted Dirichlet mixtures for SVM extended variational inference (EVI) framework. With kernels generation,” Neural Computing and Applications, vol. 23, single lower bound (SLB) approximation, the conver- no. 5, pp. 1443–1458, 2013. gence of the proposed analytically tractable solution is [18] T. Bdiri, N. Bouguila, and D. Ziou, “Visual scenes categorization using a flexible hierarchical mixture model supporting users guaranteed, while the solution obtained via multiple ontology,” in IEEE International Conference on TOOLS with Artificial lower bound (MLB) approximations may result in os- Intelligence, 2013, pp. 262–267. cillations of the variational objective function. Extensive [19] S. C. Markley and D. J. Miller, “Joint parsimonious modeling and model order selection for multivariate Gaussian mixtures,” IEEE synthesized data evaluations and real data evaluations Journal of Selected Topics in Signal Processing, vol. 4, no. 3, pp. 548– demonstrated the superior performance of the proposed 559, June 2010. method. [20] Z. Liang and S. Wang, “An EM approach to MAP solution of seg- menting tissue mixtures: a numerical analysis.” IEEE Transactions on Medical Imaging, vol. 28, no. 2, pp. 297–310, 2009. [21] N. Bouguila and D. Ziou, “Unsupervised selection of a finite REFERENCES Dirichlet mixture model: an MML-based approach,” IEEE Trans- actions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 993– [1] B. Everitt and D. Hand, Finite Mixture Distributions. Chapman 1009, June 2006. and Hall, London, UK, 1981. [22] S. Richardson and P. J. Green, “Corrigendum: On bayesian analy- [2] G. McLachlan and D. Peel, Finite Mixture Models. New York, NY, sis of mixtures with an unknown number of components,” Journal USA: Wiley, 2000. of the Royal Statistical Society, vol. 60, no. 3, p. 661, 1996. [3] C. M. Bishop, Pattern Recognition and Machine Learning (Information [23] L. Huang, Y. Xiao, K. Liu, H. C. So, and J. K. Zhang, “Bayesian in- Science and Statistics). Springer-Verlag New York, Inc., 2006. formation criterion for source enumeration in large-scale adaptive [4] N. Bouguila, D. Ziou, and J. Vaillancourt, “Unsupervised learning antenna array,” IEEE Transactions on Vehicular Technology, vol. 65, of a finite mixture model based on the Dirichlet distribution and no. 5, pp. 3018–3032, May 2016. its application,” IEEE Transactions on Image Processing, vol. 13, [24] X. Chen, “Using Akaike information criterion for selecting the no. 11, pp. 1533–1543, 2004. field distribution in a reverberation chamber,” IEEE Transactions [5] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker on Electromagnetic Compatibility, vol. 55, no. 4, pp. 664–670, Aug identification using Gaussian mixture speaker models,” IEEE 2013. Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, [25] K. Bousmalis, S. Zafeiriou, L. P. Morency, M. Pantic, and 1995. Z. Ghahramani, “Variational infinite hidden conditional random [6] N. Nasios and A. G. Bors, “Variational learning for Gaussian fields,” IEEE Transactions on Pattern Analysis and Machine Intelli- mixture models,” IEEE Transactions on Systems Man and Cybernetics gence, vol. 37, no. 9, pp. 1917–1929, Sept 2015. Part B (Cybernetics), vol. 36, no. 4, pp. 849–862, July 2006. [26] M. Meilˇaand H. Chen, “Bayesian non-parametric clustering of [7] J. Jung, S. R. Lee, H. Park, S. Lee, and I. Lee, “Capacity and ranking data,” IEEE Transactions on Pattern Analysis and Machine error probability analysis of diversity reception schemes over Intelligence, vol. 38, no. 11, pp. 2156–2169, Nov 2016. generalized-K fading channels using a mixture gamma distribu- [27] Y. Xu, M. Megjhani, K. Trett, W. Shain, B. Roysam, and Z. Han, tion,” IEEE Transactions on Wireless Communications, vol. 13, no. 9, “Unsupervised profiling of microglial arbor morphologies and pp. 4721–4730, Sept 2014. distribution using a nonparametric Bayesian approach,” IEEE 11

Journal of Selected Topics in Signal Processing, vol. 10, no. 1, pp. [51] H. Attias, “A variational bayesian framework for graphical mod- 115–129, Feb 2016. els,” Advances in Neural Information Processing Systems, vol. 12, pp. [28] T. S. Ferguson, “A Bayesian analysis of some nonparametric 209–215, 2000. problems,” Annals of Statistics, vol. 1, no. 2, pp. 209–230, 1973. [52] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum [29] C. E. Antoniak, “Mixtures of Dirichlet processes with applications likelihood from incomplete data via EM algorithm,” Journal of the to Bayesian nonparametric problems,” Annals of Statistics, vol. 2, Royal Statistical Society, vol. 39, pp. 1–38, 1977. no. 6, pp. 1152–1174, 1974. [30] N. L. Hjort, C. Holmes, P. M üller, and S. G. Walker, Eds., Bayesian Nonparametrics. Cambridge University Press, 2010. [31] Y. W. Teh and D. M. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1566–1581, 2006. [32] N. J. Foti and S. A. Williamson, “A survey of non-exchangeable priors for Bayesian nonparametric models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 359–371, Feb 2015. [33] W. Fan and N. Bouguila, “Online learning of a Dirichlet process mixture of beta-Liouville distributions via variational inference,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 11, pp. 1850–1862, 2013. [34] X. Wei and C. Li, “The infinite student’s t -mixture for robust modeling,” Signal Processing, vol. 92, no. 1, pp. 224–234, 2012. [35] N. Bouguila and D. Ziou, “A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling,” IEEE Transactions on Neural Networks, vol. 21, no. 1, pp. 107–122, 2010. [36] X. Wei and Z. Yang, “The infinite student’s t -factor mixture ana- lyzer for robust clustering and classification ,” Pattern Recognition, vol. 45, no. 12, pp. 4346–4357, 2012. [37] S. P. Chatzis and G. Tsechpenakis, “The infinite hidden Markov random field model.” IEEE Transactions on Neural Networks, vol. 21, no. 6, pp. 1004–14, 2010. [38] N. Bouguila and D. Ziou, “High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1716–31, Aug. 2007. [39] N. Bouguila, “Hybrid generative/discriminative approaches for proportional data modeling and classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 12, pp. 2184–2202, July 2012. [40] M. Wedel and P. Lenk, Markov Chain Monte Carlo. Boston, MA: Springer US, 2013, pp. 925–930. [41] C. P. Robert, The Bayesian Choice: From Decision-Theoretic Founda- tions to Computational Implementation. Springer-Verlag New York, 2007. [42] M. Pereyra, P. Schniter, E. Chouzenoux, J. C. Pesquet, J. Y. Tourneret, A. O. Hero, and S. McLaughlin, “A survey of stochastic simulation and optimization methods in signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 2, pp. 224–241, Mar. 2016. [43] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine Learning, vol. 37, no. 2, pp. 183–233, 1999. [44] Z. Ma and A. Leijon, “Modeling speech line spectral frequencies with dirichlet mixture models,” in Proceedings of INTERSPEECH, 2010. [45] J. Taghia and A. Leijon, “Variational inference for Watson mixture model,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 38, no. 9, pp. 1886–1900, 2015. [46] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, “Nested hierarchical Dirichlet processes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 2, pp. 256–270, Feb. 2015. [47] W. Fan and N. Bouguila, “Topic novelty detection using infinite variational inverted Dirichlet mixture models,” in IEEE Interna- tional Conference on Machine Learning and Applications (ICMLA), Dec 2015, pp. 70–75. [48] B. A. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Department of Electrical Engineering, University of Washington, Tech. Rep., 2010. [49] J. Paisley and L. Carin, “Hidden Markov models with stick- breaking priors,” IEEE Transactions on Signal Processing, vol. 57, no. 10, pp. 3905–3917, June 2009. [50] S. Kullback and R. A. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, no. 22, pp. 79–86, 1951.