arXiv:1710.07954v3 [math.ST] 27 Aug 2018 lse nmrto,cutraayi,usprie lear unsupervised analysis, distribution Gaussian cluster multivariate enumeration, cluster stse sn ytei n eldt sets. real algorith and two-step i synthetic proposed BIC using proposed the tested the of is which performance determined for model The is the maximal. clusters with associated of one can number the of the set given Subsequently, a to lea models. according unsupervised data e the cluster model-based partitions two-step differ a algorithm a is First, propose term We algorithm. BIC. penalty derivatio meration original whose the the expression of into an that from problem in results clustering BIC the multivari incorporatin the of that for structure show expression We data BIC variables. star closed-form distributed a a Alon Gaussian as distributions. provide mild serves specific we for This some BIC line, distributions. expressi the data BIC that deriving of when general point class Given a broad provide models. a we for poste candidate satisfied, the are of of the assumptions maximization number as of the set estimating data probability observed of an problem in the clusters formulating by (BIC) amtd,Drsat emn emi:[email protected] (e-mail: Germany Darmstadt, Darmstadt, n ru n h rdaeSho fCmuainlEngineer Computational (e-mai of Germany [email protected]). School darmstadt.de; Darmstadt, Graduate Darmstadt, Universit¨at the nische and Group ing posteriori investigati a under models is candidate m that of generating family data model the true a to the if belongs of consistent Bayes’ is selection It the probable. the of most limit to Bayesian sample leads large the [ which the [37]–[41], apply finds [31]–[33], to BIC decades [29], The is (BIC) for approach Criterion popular researched Information cluste a called intensively and also been [27]–[45] clusters, of has model number Bayesian enumeration, the a of of estimation analysis. derivation The cluster the for on criterion t lies selection In focus [27]–[45]. da of our problems number paper, learning the unsupervised of estimation in the clusters and [22], [21], [4]–[6] components [14], n analysis [12], signal of regression in number of parameters the regression number of zero selection the the of [23]–[25], [18]–[20], estimation [15], the applicat various as in rev arise such the problems and pro- selection [1]–[25] Model been example [26]. have for in see methods literature, Many the in models. posed candidate of family a S es faycprgtdcmoeto hswr nohrwork other in work this of component copyrighted any of reuse ernigrpbihn hsmtra o detsn or advertising for material this reprinting/republishing

c .Mm swt h inlPoesn ru,Tcnsh Univ Technische Group, Processing Signal the with is Muma M. .K elhyao n .M obraewt h inlProce Signal the with are Zoubir M. A. and Teklehaymanot K. F. ne Terms Index Abstract 08IE.Proa s fti aeili emte.Permi permitted. is material this of use Personal IEEE. 2018 oe htaeutl xlisteosrain from observations the explains adequately that model choosing a with concerned is selection model TATISTICAL W eieanwBysa nomto Criterion Information Bayesian new a derive —We aeinCutrEueainCieinfor Criterion Enumeration Cluster Bayesian mdlslcin aeinifraincriterion, information Bayesian selection, —model rwyiK Teklehaymanot, K. Freweyni .I I. NTRODUCTION nuevsdLearning Unsupervised n behkM Zoubir, M. Abdelhak and rmtoa upss raignwcletv ok,frr for works, collective new creating purposes, promotional :[email protected] l: tdn ebr IEEE, Member, Student stadt.de). so rmIE utb bandfralohrue,i n cur any in uses, other all for obtained be must IEEE from ssion .DI 010/S.082635 EETascin nSi on Transactions IEEE 10.1109/TSP.2018.2866385, DOI: s. n,Tech- ing, [11], , didate this g rning the g ersit¨at ning, ions, odel of n 44]. ting rior iew on- nu- on. ent his ate on ss- as m ta s r ncutraayi.Aogti ie esmlf BIC simplify distributions we starting data line, a this specific Along as for analysis. BIC serves cluster the This in deriving distributions. when data point of class broad xrsinfrteBC BIC general BIC, a the provide set candida we for data assumptions, the expression observed mild of some probability an Under posterior in models. the (clusters) of maximization partitions as estimating of of problem number the formulating the by sampl BIC large new the a using derive investigated We best analysis thoroughly the cluster for has to approximations. BIC work, one the mentioned no of derivation above knowledge, the our than propos of method Other limi enumeration [38]. greatly cluster in the This of data. applicability dimensional the wor is high real which in with model, impractical candidate applications and each expensive in very cluster each computationally for calcula each FIM the requires the of [38] of (FIM) in mixing proposed Matrix the method Information The on partition. Fisher based are the that and terms probability well additional original are some the plus they contains term expression that made resulting assuming marginal was The models derivation separated. the mixture This to for sizes. by sample specifically approximation [38] small accurate in for One made likelihood more analysis. was of a cluster direction appropriateness this for providing the towards [16] check work formulation noticeable to BIC made original been the [44], has [37]–[41], effort [31]–[33], little [29], B selectio enumeration the model of cluster use specific for selection widespread to the model despite applied examined Nevertheless, that be are problems. to they shown have before complexity have model carefully for [46] penalize [15], the that rules have in they works if parameters. The unknown way of tw same specific that penalizes number the it same sense the result, models the a regarding different As in hand. structurally information at criterion problem incorporate selection generic model not are a conditions does is regularity it Schwarz’s BIC some The that assumptio generalizing two given satisfied. by first Schwarz the [16] by drop authors made in o the [16], provided applicability In derivation. widespread was selection model the BIC of of scope the larger justification much A a in problems. linear used assumption are been restrictive fami has models BIC rather exponential candidate these an the Ignoring (iii) from parameters. identical and arise assum- and distributions, they [8] independent of (ii) in are observations (iid), Schwarz the distributed by (i) derived that originally ing was BIC The moiga supino h aadsrbto.Aclosed- A distribution. data the on assumption an imposing elw IEEE Fellow, ihe Muma, Michael sl rrdsrbto osreso it,or lists, or servers to redistribution or esale ebr IEEE, Member, G ( · ) hc sapial oa to applicable is which , nlProcessing gnal eto uuemda including media, future or rent G ( ,the s, · ) very BIC tion IC by ed ns in ly ly ld te ts o n e 1 f 2

form expression, BICN(·), is derived assuming that the data the staking of the columns of an arbitrary matrix Y into one set is distributed as a multivariate Gaussian. The derived long column vector. criterion, BICN(·), is based on large sample approximations and it does not require the calculation of the II. PROBLEM FORMULATION FIM. This renders our criterion computationally cheap and Given a set of r-dimensional vectors X , {x ,..., x }, let practical compared to the criterion presented in [38]. 1 N {X ,..., X } be a partition of X into K clusters X ⊆ X for Standard clustering methods, such as the Expectation- 1 K k k ∈ K , {1,...,K}. The subsets (clusters) X , k ∈ K, are Maximization (EM) and K- algorithm, can be used to k independent, mutually exclusive, and non-empty. Let M , cluster data only when the number of clusters is supplied by {M ,...,M } be a family of candidate models that the user. To mitigate this shortcoming, we propose a two- Lmin Lmax represent a partitioning of X into l = L ,...,L subsets, step cluster enumeration algorithm which provides a principled min max where l ∈ Z+. The parameters of each model M ∈ M are way of estimating the number of clusters by utilizing existing l denoted by Θ = [θ ,..., θ ] which lies in a parameter space clustering algorithms. The proposed two-step algorithm uses l 1 l Ω ⊂ Rq×l. Let f(X |M , Θ ) denote the probability density a model-based unsupervised learning algorithm to partition l l l function (pdf) of the observation set X given the candidate the observed data into the number of clusters provided by model M and its associated parameter matrix Θ . Let p(M ) the candidate model prior to the calculation of BIC (·) for l l l N be the discrete prior of the model M over the set of candidate that particular model. We use the EM algorithm, which is l models M and let f(Θ |M ) denote a prior on the parameter a model-based unsupervised learning algorithm, because it is l l vectors in Θ given M ∈M. suitable for Gaussian mixture models and this complies with l l According to Bayes’ theorem, the joint posterior density of the Gaussianity assumption made by BIC (·). However, the N M and Θ given the observed data set X is given by model selection criterion that we propose is more general and l l can be used as a wrapper around any clustering algorithm, see p(Ml)f(Θl|Ml)f(X |Ml, Θl) f(M , Θ |X )= , (1) [47] for a survey of clustering methods. l l f(X ) The paper is organized as follows. Section II formulates the where f(X ) is the pdf of X . Our objective is to choose the problem of estimating the number of clusters given data. candidate model M ∈ M, where Kˆ ∈ {L ,...,L }, The proposed generic Bayesian cluster enumeration criterion, Kˆ min max which is most probable a posteriori assuming that BICG(·), is introduced in Section III. Section IV presents the proposed Bayesian cluster enumeration algorithm for multi- (A.1) the true number of clusters (K) in the observed data set variate Gaussian data in detail. A brief description of the X satisfies the constraint Lmin ≤ K ≤ Lmax. existing BIC-based cluster enumeration methods is given in Mathematically, this corresponds to solving Section V. Section VI provides a comparison of the penalty M ˆ = argmax p(Ml|X ), (2) terms of different cluster enumeration criteria. A detailed K M performance evaluation of the proposed criterion and compar- where p(Ml|X ) is the of Ml given the isons to existing BIC-based cluster enumeration criteria using observations X . p(Ml|X ) can be written as simulated and real data sets are given in Section VII. Finally, concluding remarks are drawn and future directions are briefly p(Ml|X )= f(Ml, Θl|X )dΘl discussed in Section VIII. A detailed proof is provided in ZΩl Appendix A, whereas Appendix B contains the vector and = f(X )−1p(M ) f(Θ |M )L(Θ |X )dΘ , (3) matrix differentiation rules that we used in the derivations. l l l l l ZΩl Notation: Lower- and upper-case boldface letters stand for Θ , Θ column vectors and matrices, respectively; Calligraphic letters where L( l|X ) f(X |Ml, l) is the . M can also be determined via denote sets with the exception of L which is used for the Kˆ R Z+ likelihood function; represents the set of real numbers; argmax log p(Ml|X ) (4) denotes the set of positive integers; Probability density and M mass functions are denoted by f(·) and p(·), respectively; instead of Eq. (2) since log is a monotonic function. Hence, x ∼ N (µ, Σ) represents a Gaussian distributed random taking the logarithm of Eq. (3) results in variable x with µ and matrix Σ; θˆ stands for the estimator (or estimate) of the parameter θ; log denotes the log p(Ml|X )=log p(Ml)+log f(Θl|Ml)L(Θl|X )dΘl +ρ, Ωl natural logarithm; iid stands for independent and identically Z (5) distributed; (A.) denotes an assumption, for example (A.1) where − log f(X ) is replaced by ρ (a constant) since it is stands for the first assumption; O(1) represents Landau’s term not a function of Ml ∈ M and thus has no effect on the which tends to a constant as the data size goes to infinity; Ir maximization of log p(Ml|X ) over M. Since the partitions stands for an r × r identity matrix; 0r×r denotes an r × r (clusters) Xm ⊆ X ,m = 1, . . . , l, are independent, mutually all zero matrix; #X represents the cardinality of the set X ; exclusive, and non-empty, f(Θ |M ) and L(Θ |X ) can be ⊤ l l l stands for vector or matrix transpose; |Y | denotes the written as determinant of the matrix Y ; Tr(·) represents the trace of l a matrix; ⊗ denotes the Kronecker product; vec(Y ) refers to f(Θl|Ml)= f(θm|Ml) (6) m=1 Y 3

l 1 d2 log L(θ |X ) Θ θ˜⊤ m m θ˜ L( l|X )= L(θm|Xm). (7) + m ⊤ m 2 dθmdθ ˆ m=1 " m θm=θm # Y ˆ Nm ˜⊤ ˆ ˜ Substituting Eqs. (6) and (7) into Eq. (5) results in = log L(θm|Xm) − θ Hmθm, (10) 2 m l where θ˜m , θm − θˆm,m = 1,...,l. The first derivative of log p(Ml|X )=log p(Ml)+ log f(θm|Ml)L(θm|Xm)dθm Rq ˆ m=1 log L(θm|Xm) evaluated at θm vanishes because of assump- X Z +ρ. (8) tion (A.3). With , Maximizing log p(Ml|X ) over all candidate models Ml ∈M U f(θm|Ml) exp (log L(θm|Xm)) dθm, (11) Rq involves the computation of the logarithm of a multidimen- Z sional integral. Unfortunately, the solution of the multidimen- substituting Eq. (10) into Eq. (11) and approximating sional integral does not possess a closed analytical form for f(θm|Ml) by its Taylor series expansion yields most practical cases. This problem can be solved using either ˆ ˜⊤ df(θm|Ml) numerical integration or approximations that allow a closed- U ≈ f(θm|Ml)+ θm + HOT Rq dθm ˆ form solution. In the context of model selection, closed-form Z  θm=θm  N approximations are known to provide more insight into the × L(θˆ |X )exp − m θ˜⊤ Hˆ θ˜ dθ , (12) m m 2 m m m m problem than numerical integration [15]. Following this line    of argument, we use Laplace’s method of integration [15], where HOT denotes higher order terms and [46], [48] and provide an asymptotic approximation to the Nm ⊤ exp − θ˜ Hˆ mθ˜m is a Gaussian kernel with mean multidimensional integral in Eq. (8). 2 m   −1 θˆm and NmHˆ m . The second term III. PROPOSED BAYESIAN CLUSTER ENUMERATION in the first line of Eq. (12) vanishes because it simplifies to ˆ CRITERION κ E θm − θm = 0, where κ is a constant (see [48, p. 53] In this section, we derive a general BIC expression for for moreh details).i Consequently, Eq. (12) reduces to , which we call BIC . Under some mild G(·) ˆ ˆ Nm ˜⊤ ˆ ˜ U ≈f(θm|Ml)L(θm|Xm) exp − θmHmθm dθm assumptions, we provide a closed-form expression that is Rq 2 Z   applicable to a broad class of data distributions. 1/2 ˆ ˆ q/2 −1 ˆ −1 In order to provide a closed-form analytic approximation to = f(θm|Ml)L(θm|Xm) (2π) Nm Hm Rq Eq. (8), we begin by approximating the multidimensional inte- Z  1 Nm gral using Laplace’s method of integration. Laplace’s method θ˜⊤ Hˆ θ˜ dθ 1/2 exp − m m m m q/2 −1 −1 2 of integration makes the following assumptions. (2π) Nm Hˆ m    (A.2) log L(θm|Xm) with m =1,...,l has first- and second- 1/2 ˆ ˆ q/2 −1 ˆ −1 order derivatives which are continuous over the param- = f(θm |Ml)L(θm |Xm)(2π) Nm Hm , (13) eter space Ωl. where | · | stands for the determinant, given that N → ∞. (A.3) log L(θm|Xm) with m =1,...,l has a global maximum m Using Eq. (13), we are thus able to provide an asymptotic at θˆm, where θˆm is an interior point of Ωl. approximation to the multidimensional integral in Eq. (8). (A.4) f(θm|Ml) with m = 1,...,l is continuously differen- Now, substituting Eq. (13) into Eq. (8), we arrive at tiable and its first-order derivatives are bounded on Ωl. (A.5) The negative of the Hessian matrix of l 1 log L(θ |X ) p M p M f θˆ M θˆ Nm m m log ( l|X ) ≈ log ( l)+ log ( m| l)L( m|Xm) m=1 1 d2 log L(θ |X ) X   Hˆ , m m Rq×q (9) l m − ⊤ ∈ lq 1 Nm dθmdθ ˆ + log 2π − log Jˆm + ρ, (14) m θm=θm 2 2 m=1 is positive definite, where Nm is the cardinality of Xm X where (Nm = #Xm). That is, mins,m λs Hˆ m > ǫ for s = d2 log L(θ |X )  ˆ Jˆ , N Hˆ m m Rq×q (15) 1,...,q and m = 1,...,l, where λs Hm is the sth m m m = − ⊤ ∈ dθmdθ ˆ ˆ m θm=θm eigenvalue of Hm and ǫ is a small positive  constant. is the Matrix (FIM) of data from the mth The first step in Laplace’s method of integration is to write partition. the Taylor series expansion of f(θ |M ) and log L(θ |X ) m l m m In the derivation of log p(M |X ), so far, we have made no around θˆ ,m = 1,...,l. We begin by approximating l m distributional assumption on the data set X except that the log L(θm|Xm) by its second-order Taylor series expansion log-likelihood function log L(θm|Xm) and the prior on the around θˆm as follows: parameter vectors f(θm|Ml), for m =1,...,l, should satisfy some mild conditions under each model M . Hence, ˆ ˜⊤ d log L(θm|Xm) l ∈ M log L(θm|Xm)≈log L(θm|Xm)+θm dθm ˆ Eq. (14) is a general expression of the posterior probability of θm=θm

4

l l the model M given X for a general class of data distributions N l ≈ N log N − m log Σˆ that satisfy assumptions (A.2)-(A.5). The BIC is concerned m m 2 m m=1 m=1 with the computation of the posterior probability of candidate X X l q models and thus Eq. (14) can also be written as − log N , (18) 2 m , m=1 BICG(Ml) log p(Ml|X ) X where Nm is the cardinality of the subset Xm and it sat- ≈ log p(Ml) + log f(Θˆ l|Ml) + log L(Θˆ l|X ) isfies N = l N . The term l log N sums the l m=1 m m=1 m lq 1 logarithms of the number of data vectors in each cluster + log 2π − log Jˆ + ρ. (16) 2 2 m m =1,...,l.P P m=1 X Proof. Proving Theorem 1 requires finding an asymptotic After calculating BICG(Ml) for each candidate model Ml ∈ M, the number of clusters in X is estimated as approximation to log Jˆm in Eq. (16) and, based on this ap- ˆ proximation, deriving an expression for BICN(Ml). A detailed KBICG = argmax BICG(Ml). (17)  l=Lmin,...,Lmax proof is given in Appendix A.

However, calculating BICG(Ml) using Eq. (16) is a compu- Once the Bayesian Information Criterion, BICN(Ml), is com- tationally expensive task as it requires the estimation of the puted for each candidate model Ml ∈ M, the number of FIM, Jˆm, for each cluster m =1,...,l in the candidate model partitions (clusters) in X is estimated as Ml ∈M. Our objective is to find an asymptotic approximation ˆ KBICN = argmax BICN(Ml). (19) for log Jˆm ,m = 1, . . . , l, in order to simplify the com- l=Lmin,...,Lmax putation of BICG(Ml). We solve this problem by imposing Remark. The proposed criterion, BIC , and the original BIC N specific assumptions on the distribution of the data set X . as derived in [8], [16] differ in terms of their penalty terms. In the next section, we provide an asymptotic approximation A detailed discussion is provided in Section VI. for log Jˆm ,m = 1, . . . , l, assuming that Xm contains iid The first step in calculating BICN(Ml) for each model Ml ∈ multivariate Gaussian data points. M is the partitioning of the data set X into l clusters

Xm,m =1, . . . , l, and the estimation of the associated cluster IV. PROPOSED BAYESIAN CLUSTER ENUMERATION parameters using an unsupervised learning algorithm. Since ALGORITHM FOR MULTIVARIATE GAUSSIAN DATA the approximations in BICN(Ml) are based on maximizing the We propose a two-step approach to estimate the number of likelihood function of Gaussian distributed random variables, partitions (clusters) in X and provide an estimate of cluster pa- we use a clustering algorithm that is based on the maximum rameters, such as cluster centroids and covariance matrices, in likelihood principle. Accordingly, a natural choice is the EM an unsupervised learning framework. The proposed approach algorithm for Gaussian mixture models. consists of a model-based clustering algorithm, which clusters the data set X according to each candidate model Ml ∈ M, B. The Expectation-Maximization (EM) Algorithm for Gaus- and a Bayesian cluster enumeration criterion that selects the sian Mixture Models model which is a posteriori most probable. The EM algorithm finds maximum likelihood solutions for models with latent variables [49]. In our case, the latent A. Proposed Bayesian Cluster Enumeration Criterion for Mul- variables are the cluster memberships of the data vectors in tivariate Gaussian Data X , given that the l-component Gaussian mixture distribution

Let X , {x1,..., xN } denote the observed data set which of a data vector xn can be written as can be partitioned into K clusters {X1,..., XK }. Each cluster l Xk, k ∈ K, contains Nk data vectors that are realizations f(xn|Ml, Θl)= τmg(xn; µm, Σm), (20) of iid Gaussian random variables xk ∼ N (µk, Σk), where m=1 r×1 r×r X µk ∈ R and Σk ∈ R represent the centroid and the where g(xn; µm, Σm) represents the r-variate Gaussian pdf covariance matrix of the kth cluster, respectively. Further, and τ is the mixing coefficient of the mth cluster. The goal of , m let M {MLmin ,...,MLmax } denote a set of Gaussian the EM algorithm is to maximize the log-likelihood function candidate models and let there be a clustering algorithm that of the data set X with respect to the parameters of interest as partitions X into l independent, mutually exclusive, and non- follows: empty subsets (clusters) Xm,m = 1, . . . , l, by providing ˆ Σˆ ⊤ N l parameter estimates θm = [µˆm, m] for each candidate Ψ Σ Z+ argmax log L( l|X )=arg max log τmg(xn; µm, m), model Ml ∈ M, where l = Lmin,...,Lmax and l ∈ . Ψl Ψl n=1 m=1 Assume that (A.1)-(A.7) are satisfied. X X (21) where Ψ τ , Θ⊤ and τ τ ,...,τ ⊤. Maximizing Theorem 1. The posterior probability of M given l = [ l l ] l = [ 1 l] l ∈ M X Ψ can be asymptotically approximated as Eq. (21) with respect to the elements of l results in coupled equations. The EM algorithm solves these coupled equations , BICN(Ml) log p(Ml|X ) using a two-step iterative procedure. The first step (E step) 5

(i) evaluates υˆnm, which is an estimate of the probability that Algorithm 1 Proposed two-step cluster enumeration approach data vector xn belongs to the mth cluster at the ith iteration, Inputs: data set ; set of candidate models , (i) X M for n = 1,...,N and m = 1,...,l. υˆnm is calculated as {MLmin ,...,MLmax } follows: for l = Lmin,...,Lmax do (i−1) (i−1) Σˆ (i−1) Step 1: Model-based clustering (i) τˆm g(xn; µˆm , m ) υˆnm = , (22) Step 1.1: The EM algorithm l τ (i−1)g x µˆ(i−1), Σˆ (i−1) j=1 ˆj ( n; j j ) for m =1,...,l do (i−1) PΣˆ (i−1) Initialize µm using K-means++ [50], [51] where µˆm and m represent the centroid and covariance 1 ⊤ Σˆ m = (xn − µˆm)(xn − µˆm) matrix estimates, respectively, of the mth cluster at the previ- Nm xn∈Xm τˆ = Nm ous iteration (i−1). The second step (M step) re-estimates the m N P end for cluster parameters using the current values of υˆnm as follows: for i =1, 2,... do N (i) υˆnmx E step: µˆ(i) = n=1 n (23) m N (i) for n =1,...,N do P n=1 υˆnm ( ) ( ) ( ) for m =1,...,l do N υˆ i (x − µˆ i )(x − µˆ i )⊤ (i) Σˆ (i) = Pn=1 nm n m n m (24) Calculate υˆnm using Eq. (22) m N (i) end for P n=1 υˆnm N (i) end for υˆnm τˆ(i) = n=1 P (25) M step: m N P for m =1,...,l do The E and M steps are performed iteratively until either the Determine µˆ(i), Σˆ (i), and τˆ(i) via Eqs. (23)-(25) Ψˆ m m m cluster parameter estimates l or the log-likelihood estimate end for log L(Ψˆ l|X ) converges. Ψˆ (i) Check for convergence of either l or A summary of the estimation of the number of clusters in Ψˆ (i) log L( l |X ) an observed data set using the proposed two-step approach if convergence condition is satisfied then is provided in Algorithm 1. Note that the computational Exit for loop complexity of BICN(Ml) is only O(1), which can easily be end if ignored during the run-time analysis of the proposed approach. end for Hence, since the EM algorithm is run for all candidate models Step 1.2: Hard clustering in M, the computational complexity of the proposed two-step for n =1,...,N do approach is ζNr2 L ... L , where ζ is a fixed O( ( min + + max)) for m =1,...,l do stopping threshold of the EM algorithm. (i) 1, m = argmax υˆnj ι = j=1,...,l V. EXISTING BIC-BASED CLUSTER ENUMERATION nm 0, otherwise METHODS  As discussed in Section I, existing cluster enumeration end for  algorithms that are based on the original BIC use the crite- end for for m =1,...,l do rion as it is known from parameter estimation tasks without N questioning its validity on cluster analysis. Nevertheless, since Nm = n=1 ιnm end for these criteria have been widely used, we briefly review them P Step 2: Calculate BICN(Ml) via Eq. (18) to provide a comparison to the proposed criterion BICN, which is given by Eq. (18). end for Estimate the number of clusters, Kˆ , in via Eq. (19) The original BIC, as derived in [16], evaluated at a candidate BICN X model Ml ∈M is written as BIC (M ) = 2 log L(Θˆ |X ) − ql log N, (26) O l l to each cluster are iid as Gaussian and all clusters are spherical 2 where L(Θˆ l|X ) denotes the likelihood function and N = #X . with an identical , i.e. Σm = Σj = σ Ir for m 6= j, 2 In Eq. (26), 2 log L(Θˆ l|X ) denotes the data-fidelity term, where σ is the common variance of the clusters in Ml [29], while ql log N is the penalty term. Under the assumption [32], [33]. Under these assumptions, the original BIC is given that the observed data is Gaussian distributed, the data-fidelity by terms of BIC and the ones of our proposed criterion, BIC , O N Θˆ are exactly the same. The only deference between the two BICOS(Ml) = 2 log L( l|X ) − α log N, (27) is the penalty term. Hence, we use a similar procedure as in Algorithm 1 to implement the original BIC as a wrapper where BICOS(Ml) denotes the original BIC of the candidate around the EM algorithm. model Ml derived under the assumptions stated above and Moreover, the original BIC is commonly used as a wrapper α = (rl + 1) is the number of estimated parameters in Ml ∈ around K-means by assuming that the data points that belong M. Ignoring the model independent terms, BICOS(Ml) can be 6 written as enumeration criteria. Then, the numerical per- l formed on sets and the results obtained from real 2 BICOS(Ml)=2 Nm log Nm −rN logσ ˆ −α log N, (28) data sets are discussed in detail. For all simulations, we assume m=1 that a family of candidate models M , {M ,...,M } X Lmin Lmax where is given with Lmin = 1 and Lmax = 2K, where K is the l true number of clusters in the data set X . All simulation 2 1 ⊤ results are an average of 1000 Monte Carlo experiments unless σˆ = (xn − µˆm) (xn − µˆm) (29) rN stated otherwise. The compared cluster enumeration criteria m=1 xn∈Xm X X are based on the same initial cluster parameters in each Monte is the maximum likelihood estimator of the common variance. Carlo , which allows for a fair comparison. In our experiments, we implement BICOS as a wrapper around The MATLAB c code that implements the proposed two- the K-means++ algorithm [51]. The implementation of the step algorithm and the Bayesian cluster enumeration methods proposed BIC as a wrapper around the K-means++ algorithm discussed in Section V is available at: is given by Eq. (37). https://github.com/FreTekle/Bayesian-Cluster-Enumeration

VI. COMPARISON OF THE PENALTY TERMS OF DIFFERENT BAYESIAN CLUSTER ENUMERATION CRITERIA A. Performance Measures Comparing Eqs. (18), (26), and (27), we notice that they The of detection (pdet), the empirical have a common form [11], [46], that is, probability of underestimation (punder), the empirical proba- bility of selection, and the Mean Absolute Error (MAE) are Θˆ 2 log L( l|X ) − η, (30) used as performance measures. The empirical probability of but with different penalty terms, where detection is defined as the probability with which the correct number of clusters is selected and it is calculated as l MC BICN : η = q log Nm (31) 1

pdet = ½ ˆ , (34) m=1 MC {Ki=K} X i=1 BICO : η = ql log N (32) X where MC is the total number of Monte Carlo experiments, BICOS : η = (rl + 1) log N. (33) Kˆi is the estimated number of clusters in the ith Monte Carlo

Remark. BIC and BIC carry information about the struc- experiment, and ½ is an indicator function which is O OS {Kˆ i=K} ture of the data only on their data-fidelity term, which is the defined as first term in Eq. (30). On the other hand, as shown in Eq. (18), ˆ , 1, if Ki = K both the data-fidelity and penalty terms of our proposed ½ ˆ . (35) {Ki=K} 0, otherwise criterion, BICN, contain information about the structure of the ( data. The empirical probability of underestimation (punder) is the ˆ The penalty terms of BICO and BICOS depend linearly on probability that K

In the asymptotic regime, the penalty terms of BICN and BICO (MAE), is computed as coincide. But, in the finite sample regime, for values of l> 1, MC 1 the penalty term of BICO is stronger than the penalty term of MAE = K − Kˆi . (36) BIC . Note that the penalty term of our proposed criterion, MC N i=1 BIC , depends on the number of data vectors in each cluster, X N Nm,m = 1,...,l, of each candidate model Ml ∈ M, while B. Numerical Experiments the penalty term of the original BIC depends only on the total 1) Simulation Setup: We consider two synthetic data sets, number of data vectors in the data set. Hence, the penalty namely Data-1 and Data-2, in our simulations. Data-1, shown term of our proposed criterion might exhibit sensitivities to the in Fig. 1a, contains realizations of the random variables initialization of cluster parameters and the associated number xk ∼ N (µk, Σk), where k = 1, 2, 3, with cluster centroids of data vectors per cluster. ⊤ ⊤ ⊤ µ1 = [2, 3.5] , µ2 = [6, 2.7] , µ3 = [9, 4] , and covariance matrices VII. EXPERIMENTAL EVALUATION 0.2 0.1 0.5 0.25 1 0.5 In this section, we compare the cluster enumeration per- Σ = , Σ = , Σ = . 1 0.1 0.75 2 0.25 0.5 3 0.5 1 formance of our proposed criterion, BICN given by Eq. (18),       with the cluster enumeration methods discussed in Section V, The first cluster is linearly separable from the others, while the namely BICO and BICOS given by Eqs. (26) and (28), respec- remaining clusters are overlapping. The number of data vectors tively, using synthetic and real data sets. We first describe the per cluster is specified as N1 = γ × 50, N2 = γ × 100, and performance measures used for comparing the different cluster N3 = γ × 200, where γ is a constant. 7

TABLE I: The empirical probability of detection in %, the 7 empirical probability of underestimation in %, and the Mean Absolute Error (MAE) of various Bayesian cluster enumera- 6 tion criteria as a function of γ for Data-1.

γ 1 3 6 12 48 5 BICN 55.2 74.3 87.4 95.7 100 pdet(%) BICO 43.6 69.7 85.1 94.9 100 4 BICOS 53.9 50.5 49.4 42.4 31.8

Feature 2 BICN 44.5 25.7 12.6 4.3 0 punder(%) 3 BICO 56.4 30.3 14.9 5.1 0 BICOS 0 0 0 0 0 BIC 0.449 0.257 0.126 0.043 0 2 MAE N BICO 0.564 0.303 0.149 0.051 0 BICOS 0.469 0.495 0.506 0.576 0.682 1 0 2 4 6 8 10 12 Feature 1 TABLE II: The empirical probability of detection in %, the (a) Data-1 for γ = 1 empirical probability of underestimation in %, and the Mean Absolute Error (MAE) of various Bayesian cluster enumera- 6 tion criteria as a function of the number of data vectors per cluster (Nk) for Data-2.

4 Nk 100 200 500 1000

BICN 56.1 66 81 85.3 pdet(%) 2 BICO 41 57.1 78 84.9 BICOS 2.7 0.9 0.1 0

BICN 37.6 30.2 18.2 13.5 punder(%)

Feature 2 0 BICO 58.6 41.7 21.4 14.1 BICOS 0 0 0 0 BIC 0.452 0.341 0.19 0.148 MAE N -2 BICO 0.59 0.429 0.22 0.151 BICOS 1.613 1.659 1.745 1.8

-4 -5 0 5 Feature 1 as a function of γ, where γ is allowed to take values from the (b) Data-2 for Nk = 100 set {1, 3, 6, 12, 48}. The cluster enumeration performance of

BICOS is lower than the other methods because it is designed Fig. 1: Synthetic data sets. for spherical clusters with identical variance, while Data-1 has one elliptical and two spherical clusters with different The second data set, Data-2, contains realizations of the ran- covariance matrices. Our proposed criterion performs best in terms of pdet and MAE for all values of γ. As γ increases, dom variables xk ∼ N (µk, Σk), where k = 1,..., 10, with ⊤ ⊤ which corresponds to an increase in the number of data vectors cluster centroids µ1 = [0, 0] , µ2 = [3, −2.5], µ3 = [3, 1] , ⊤ ⊤ ⊤ in the data set, the cluster enumeration performance of BICN µ4 = [−1, −3] , µ5 = [−4, 0] , µ6 = [−1, 1] , µ7 = ⊤ ⊤ ⊤ and BICO greatly improves, while the performance of BICOS [−3, 3] , µ8 = [2.5, 4] , µ9 = [−3.5, −2.5], µ10 = [0, 3] , and covariance matrices deteriorates because of the increase in overestimation. The total criterion (BIC) and penalty term of different Bayesian 0.25 −0.15 0.5 0 0.1 0 Σ = , Σ = , Σ = , cluster enumeration criteria as a function of the number of 1 −0.15 0.15 2 0 0.15 i 0 0.1       clusters specified by the candidate models for γ = 1 is where i =3,..., 10. As depicted in Fig. 1b, Data-2 contains depicted in Fig. 2. The BIC plot in Fig. 2a is the result of eight identical and spherical clusters and two elliptical clusters. one Monte Carlo run. It shows that BICN and BICO have a There exists an overlap between two clusters, while the rest maximum at the true number of clusters (K = 3), while ˆ of the clusters are well separated. All clusters in this data set BICOS overestimates the number of clusters to KBICOS = 4. have the same number of data vectors. As shown in Fig. 2b, our proposed criterion, BICN, has the 2) Simulation Results: Data-1 is particularly challenging second strongest penalty term. Note that, the penalty term of for cluster enumeration criteria because it has not only overlap- our proposed criterion shows a curvature at the true number ping but also unbalanced clusters. Cluster unbalance refers to of clusters, while the penalty terms of BICO and BICOS are the fact that different clusters have a different number of data uninformative on their own. vectors, which might result in some clusters dominating the Table II shows pdet and MAE as a function of the number others. The impact of cluster overlap and unbalance on pdet and of data vectors per cluster, Nk, k = 1,..., 10, where Nk is MAE is displayed in Table I. This table shows pdet and MAE allowed to take values from the set {100, 200, 500, 1000}, for 8

points. However, unless the random initializations are repeated 3700 sufficiently many times, the algorithms tend to converge to a BIC N poor local optimum. K-means++ [51] attempts to solve this BIC 3600 O problem by providing a systematic initialization to K-means. BIC OS One can also use a few runs of K-means++ to initialize the EM 3500 algorithm. An alternative approach to the initialization prob- lem is to use random swap [52], [53]. Unlike repeated random 3400 initializations, random swap creates random perturbations to BIC the solutions of K-means and EM in an attempt to move the 3300 clustering result away from an inferior local optimum. We compare the performance of the proposed criterion and 3200 the original BIC as wrappers around the above discussed clustering methods using five synthetic data sets, which in- 3100 clude Data-1 with γ = 6, Data-2 with Nk = 500, and the 1 2 3 4 5 6 ones summarized in Table III. The number of random swaps Number of clusters specified by the candidate models is set to 100 and the results are an average of 100 Monte (a) Total criteria Carlo experiments. To allow for a fair comparison, the number of replicates required by the clustering methods that use K- 200 means++ initialization is set equal to the number of random BIC N swaps. BIC O The empirical probability of detection (pdet) of the proposed BIC 150 OS criterion and the original BIC as wrappers around the different ) clustering methods is depicted in Table IV, where RSK-means is the random swap K-means and REM is the random swap

100 EM. BICNS is the implementation of the proposed BIC as a wrapper around the K-means variants and is given by

l Penalty term ( Nr 50 BIC = N log N − logσ ˆ2 NS m m 2 m=1 X α l − log N , (37) 0 2 m 1 2 3 4 5 6 m=1 Number of clusters specified by the candidate models X where α = r +1 and σˆ2 is given by Eq. (29). For the data sets (b) Penalty terms that are mostly spherical, the K-means variants outperform the Fig. 2: The BIC (a) and penalty term (b) of different Bayesian ones that are based on EM in terms of the correct estimation cluster enumeration criteria for Data-1 when γ =1. of the number of clusters, while, as expected, EM is superior for the elliptical data sets. Among the K-means variants, the gain obtained from using random swap instead of simple K- means++ is almost negligible. On the other hand, for the EM Data-2. Data-2 contains both spherical and elliptical clusters variants, EM significantly outperforms RSEM especially for and there is an overlap between two clusters, while the rest of BICN. the clusters are well separated. The proposed criterion, BICN, consistently outperforms the cluster enumeration methods that TABLE III: Summary of synthetic data sets in terms of their are based on the original BIC for the specified number of number of features (r), number of samples (N), number of data vectors per cluster (Nk). BICO tends to underestimate samples per cluster (N ), and number of clusters (K). ˆ k the number of clusters to KBICO = 9 when Nk is small, and it merges the two overlapping clusters. Even though majority Data sets r N Nk K of the clusters are spherical, BICOS rarely finds the correct S3 [54] 2 5000 333 15 number of clusters. A1 [55] 2 3000 150 20 G2-2-40 [56] 2 2048 1024 2 3) Initialization Strategies for Clustering Algorithms: The overall performance of the two-step approach presented in Algorithm 1 depends on how well the clustering algorithm in C. Real Data Results the first step is able to partition the given data set. Clustering Although there is no when repeating the ex- algorithms such as K-means and EM are known to converge periments for the real data sets, we still use the empirical to a local optimum and exhibit sensitivity to initialization probabilities defined in Section VII-A as performance mea- of cluster parameters. The simplest initialization method is sures because the cluster enumeration results vary depending to randomly select cluster centroids from the set of data on the initialization of the EM and K-means++ algorithm. 9

TABLE IV: Empirical probability of detection in %. 100 Data-1 Data-2 S3 A1 G2-2-40 BIC N BIC 49 0 100 98 100 50 K-means++ [51] NS BICOS 48 0 100 98 100 0 BIC 49 0 100 100 100 RSK-means [52] NS 1 2 3 4 5 6 BIC 100 100 100 OS 48 0 100

BICN 87 92 10 98 100 BIC EM [49] O BICO 85 89 10 98 100 50 BIC 22 68 11 16 90 RSEM [53] N BIC 85 89 9 28 97 0 O 1 2 3 4 5 6 100 BIC OS 1) Iris Data Set: The Iris data set, also called Fisher’s 50

Iris data set [57], is a 4-dimensional data set collected from Empirical probability of selection in % 0 three species of the Iris flower. It contains three clusters of 1 2 3 4 5 6 50 instances each, where each cluster corresponds to one Number of clusters specified by the candidate models species of the Iris flower [58]. One cluster is linearly separable Fig. 3: Empirical probability of selection of our proposed from the other two, while the remaining ones overlap. We criterion, BICN, and existing Bayesian cluster enumeration have normalized the data set by dividing the features by their criteria for the Iris data set. corresponding mean. Fig. 3 shows the empirical probability of selection of different cluster enumeration criteria as a function of the number of clusters specified by the candidate models in M. Our proposed cluster enumeration in the sense that each camera monitors the criterion, BICN, is able to estimate the correct number of scene from different angles, which can result in differences in clusters (K = 3) 98.8% of the time, while BICO always the extracted feature vectors (descriptors) of the same object. ˆ underestimates the number of clusters to KBICO = 2. BICOS Furthermore, as shown in Fig. 5, the video that is captured by completely breaks down and, in most cases, goes for the the cameras has a low resolution. specified maximum number of clusters. Even though two We consider a centralized network structure where the spa- out of three clusters are not linearly separable, our proposed tially distributed cameras send feature vectors to a central criterion is able to estimate the correct number of clusters with fusion center for further processing. Hence, each camera a very high empirical probability of detection. Fig. 4 shows ci ∈C , {c1,...,c7} first extracts the objects of interest, cars the behavior of the BIC curves of the proposed criterion, BICN in this case, from the frames in the video using a Gaussian given by Eq. (18), and the original BIC implemented as a mixture model-based foreground detector. Then, SURF [59] wrapper around the EM algorithm, BICO given by Eq. (26), and color features are extracted from the cars. A standard for one Monte Carlo experiment. From Eq. (30), we know MATLAB c implementation of SURF is used to generate a that the data-fidelity terms of both criteria are the same and 64-dimensional feature vector for each detected object. Addi- this can be seen in Fig. 4a. But, their penalty terms are quite tionally, a 10 bin for each of the RGB color channels different, see Fig. 4b. Due to the difference in the penalty is extracted, resulting in a 30-dimensional color feature vector. terms of BICN and BICO, we observe a different BIC curve In our simulations, we apply Principal Component Analysis in Fig. 4c. The total criterion (BIC) curve of BICN has a (PCA) to reduce the dimension of the color features to 15. maximum at the true number of clusters (K = 3), while Each camera ci ∈C stores its feature vectors in Xci . Finally, ˆ BICO has a maximum at KBICO =2. Observe that, again, the the feature vectors extracted by each camera, Xci , are sent to penalty term of our proposed criterion, BICN, has a curvature the fusion center. At the fusion center, we have the total set , R79×5213 at the true number of clusters K =3. Just as in the simulated of feature vectors X {Xc1 ,..., Xc7 }⊂ based on data experiments, the penalty term of our proposed criterion which the cluster enumeration is performed. gives valuable information about the true number of clusters The empirical probability of selection for different Bayesian in the data set while the penalty terms of the other methods cluster enumeration criteria as a function of the number of are uninformative on their own. clusters specified by the candidate models in M is displayed 2) Multi-Object Multi-Camera Network Application: The in Fig. 6. Even though there are six cars in the scene of multi-object multi-camera network application [44], [45] de- interest, two cars have similar colors. Our proposed criterion, picted in Fig. 5 contains seven cameras that actively monitor BICN, finds six clusters only 14.7% of the time, while the a common scene of interest from different view points. There other cluster enumeration criteria are unable to find the correct are six cars that enter and leave the scene of interest at number of clusters (cars). BICN finds five clusters majority of different time frames. The video captured by each camera in the time. This is very reasonable due to the color similarity the network is 18 seconds long and 550 frames are captured of the two cars, which results in the merging of their clusters. by each camera. Our objective is to estimate the total number The original BIC, BICO, also finds five clusters majority of the of cars observed by the camera network. This multi-object time. But, it also tends to underestimate the number of clusters multi-camera network example is a challenging scenario for even more by detecting only four clusters. Hence, our proposed 10

1900 BIC N BIC 1850 O

1800

1750

1700 Data-fidelity term

1650 c c c 7 1600 1 c 6 1 2 3 4 5 6 c 5 3 c Number of clusters specified by the candidate models c 4 2 (a) Data-fidelity terms

450 BIC 400 N BIC O 350 ) 300 Fig. 5: A wireless camera network continuously observing a 250 common scene of interest. The top image depicts a camera 200 network with 7 spatially distributed cameras that actively Penalty term ( 150 monitor the scene from different viewpoints. The bottom left and right images show a frame captured by cameras 1 and 7, 100 respectively, at the same time instant. 50 1 2 3 4 5 6 Number of clusters specified by the candidate models 100 (b) Penalty terms BIC N 50 3550 0 3500 1 2 3 4 5 6 7 8 9 10 11 12 100 3450 BIC O 50 3400 0 BIC 3350 1 2 3 4 5 6 7 8 9 10 11 12 100 3300 BIC OS 50 BIC 3250 N BIC Empirical probability of selection in % O 0 3200 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 Number of clusters specified by the candidate models Number of clusters specified by the candidate models Fig. 6: Empirical probability of selection of our proposed (c) Total criteria criterion, BICN, and existing Bayesian cluster enumeration Fig. 4: The data-fidelity term (a) penalty term (b) and the BIC criteria for the multi-object multi-camera network application.

(c) of the proposed criterion, BICN, and BICO for the Iris data set. of kernels belonging to three different varieties of wheat, where each variety is represented by 70 instances [58]. cluster enumeration criterion outperforms existing BIC-based As can be seen in Fig. 7 the proposed criterion, BICN, and methods in terms of MAE as shown in Table V, which BICO are able to estimate the correct number of clusters 100% summarizes the performance of different Bayesian cluster of the time while BICOS overestimates the number of clusters ˆ enumeration criteria on the real data sets. to KBICOS =6. 3) Seeds Data Set: The Seeds data set is a 7 dimensional In cases where either the maximum found from the BIC curve data set which contains measurements of geometric properties is very near to the maximum number of clusters specified 11

TABLE V: Comparison of cluster enumeration performance closed-form BIC expression. Further, we have presented a two- of different Bayesian criteria for the real data sets. The step cluster enumeration algorithm. The proposed criterion performance metrics are the empirical probability of detection contains information about the structure of the data in both in %, the empirical probability of underestimation in %, and its data-fidelity and penalty terms because it is derived by the Mean Absolute Error (MAE). taking the cluster analysis problem into account. Simulation Iris Cars Seeds results indicate that the penalty term of the proposed criterion has a curvature point at the true number of clusters which is BICN 98.8 14.7 100 pdet(%) created due to the change in the trend of the curve at that point. BICO 0 0 100 BICOS 0 0 0 Hence, the penalty term of the proposed criterion can contain

BICN 0 85 0 information about the true number of clusters in an observed punder(%) BICO 100 100 0 data set. In contrast, the penalty terms of the existing BIC- BICOS 0 0 0 based cluster enumeration methods are uninformative on their 0.024 0.853 0 BICN own. In a forthcoming paper, we will alleviate the Gaussian MAE 0 BICO 1 1.012 assumption and consider robust [61], [62] cluster enumeration BIC 2.674 6 3 OS in the presence of outliers. by the candidate models or no clear maximum can be found, APPENDIX A different post-processing steps that attempt to find a significant PROOF OF THEOREM 1 curvature in the BIC curve have been proposed in the literature. To obtain an asymptotic approximation of the FIM Jˆm, we One such method is the knee point detection strategy [32], first express the log-likelihood of the data points that belong [33]. For the Seeds data set, applying the knee point detection to the mth cluster as follows: method to the BIC curve generated by BICOS allows for the correct estimation of the number of clusters 100% of the time. log L(θm|Xm) = log p(xn ∈ Xm)f(xn|θm)

Once the number of clusters in an observed data set is xn∈Xm Y estimated, the next step is to analyze the overall classification Nm 1 = log r 1 performance of the proposed two-step approach. An evaluation N 2 Σ 2 x (2π) | m| of the cluster enumeration and classification performance of nX∈Xm  1 the proposed algorithm using radar data of human gait can be × exp − x˜⊤Σ−1x˜ 2 n m n found in [60].   N r 1 = log m − log 2π − log |Σ | N 2 2 m x 100 nX∈Xm  BIC N 1 ⊤ −1 50 − Tr x˜ Σ x˜ 2 n m n xn∈Xm ! 0 X 1 2 3 4 5 6 Nm rNm Nm = Nm log − log 2π − log |Σm| 100 N 2 2 BIC O 1 −1 ⊤ 50 − Tr Σ x˜ x˜ 2 m n n xn∈Xm ! 0 X 1 2 3 4 5 6 Nm rNm Nm = Nm log − log 2π − log |Σm| 100 N 2 2 BIC 1 OS − Tr Σ−1∆ , (38) 50 2 m m Empirical probability of selection in %  ⊤ 0 where x˜n , xn − µm and ∆m , x˜nx˜ . Here, we 1 2 3 4 5 6 xn∈Xm n have assumed that Number of clusters specified by the candidate models P (A.6) the covariance matrix Σ ,m = 1, . . . , l, is positive Fig. 7: Empirical probability of selection of our proposed m definite. criterion, BICN, and existing Bayesian cluster enumeration criteria for the Seeds data set. Then, we take the first- and second-order derivatives of log L(θm|Xm) with respect to the elements of θm = ⊤ [µm, Σm] . To make the paper self contained, we have included the vector and matrix differentiation rules used in VIII. CONCLUSION Eqs. (39)-(41), and (50) in Appendix B (see [63] for de- We have derived a general expression of the BIC for cluster tails). A generic expression of the first-order derivative of analysis which is applicable to a broad class of data distri- log L(θm|Xm) with respect to θm is given by butions. By imposing the multivariate Gaussian assumption d log L(θ |X ) N dΣ m m = − m Tr Σ−1 m on the distribution of the observations, we have provided a dθ 2 m dθ m  m  12

Σ 1 Σ−1 d m Σ−1∆ Eq. (41) can be further simplified into + Tr m m m 2 dθm   d2 log L(θ |X ) N du ⊤ du dµ⊤ m m = m m D⊤V D m Σ−1 m ˇ ˇ⊤ m ⊤ + Tr m x˜n dθmdθm 2 dum dum dθm   xn∈Xm ! ⊤ X dum dum dΣ − D⊤W D 1 m Σ−1 Σ−1 ⊤ m = Tr m Em m dum dum 2 dθm     du ⊤ dµ⊤ dµ⊤ − N m D⊤Z vec m + N m Σ−1(x¯ − µ ), (39) m du m dµ⊤ m dθ m m m  m   m  m ⊤ dµ dµm 1 N m Σ−1 , (45) where x¯m , xn is the sample mean of the data − m m ⊤ Nm xn∈Xm dµ dµ , ∆ Σ m m points that belong to the mth cluster and Em m−Nm m. 2 1 P ⊤ where D ∈ Rr × 2 r(r+1) denotes the duplication matrix. For Differentiating Eq. (39) with respect to θm results in the symmetric matrix Σ , the duplication matrix transforms 2 Σ Σ−1 m d log L(θm|Xm) 1 d m d m −1 Σ Σ = Tr E Σ um into vec( m) using the relation vec( m) = Dum [63, dθ dθ⊤ 2 dθ dθ⊤ m m m m  m m  pp. 56–57]. 1 dΣ dΣ−1 A compact matrix representation of the second-order derivative m Σ−1E m + Tr m m ⊤ 2 dθ dθ of log L(θm|Xm) is given by  m m  1 dΣm dEm 2 2 Σ−1 Σ−1 ∂ log L(θm|Xm) ∂ log L(θm|Xm) + Tr 2 ⊤ ⊤ m ⊤ m d log L(θm|Xm) ∂µm∂µ ∂µm∂u 2 dθm dθm = 2 m 2 m .   ∂ log L(θm|Xm) ∂ log L(θm|Xm) ⊤ −1 dθˇ dθˇ⊤   dµ dΣ m m ∂u ∂µ⊤ ∂u ∂u⊤ N m m x µ m m m m + m ⊤ (¯m − m) (46) dθm dθm   ⊤ The individual elements of the above matrix can be written as dµm −1 dµm − N Σ 2 m m ⊤ ∂ log L(θm|Xm) dθm dθm N Σ−1 (47) ⊤ = − m m Σ Σ ∂µm∂µ Nm d m −1 d m −1 m = Tr Σ Σ 2 m ⊤ m ∂ log L(θm|Xm) 2 dθm dθm N Z⊤D (48)   ⊤ = − m m Σ Σ ∂µm∂u d m −1 d m −1 −1 m − Tr Σ Σ ∆mΣ 2 m ⊤ m m ∂ log L(θm|Xm) dθm dθm N D⊤Z (49)   ⊤ = − m m Σ ⊤ ∂um∂µ d m −1 dµm −1 m −N Tr Σ (x¯ −µ ) Σ 2 m m m m ⊤ m ∂ log L(θm|Xm) Nm dθm dθm D⊤F D, (50)   ⊤ = m dµ⊤ dµ ∂um∂um 2 N m Σ−1 m . (40) − m m ⊤ 2 2 dθm dθm , Σ−1 Σ−1 2 Σ−1∆ Σ−1 Rr ×r where Fm m ⊗ m − N m m m ∈ . Σ m Next, we exploit the symmetry of the covariance matrix m The FIM of the mth cluster can be written as  to come up with a final expression for the second-order 2 2 ∂ log L(θˆm|Xm) ∂ log L(θˆm|Xm) Σ − ⊤ − ⊤ derivative of log L(θm|Xm). The unique elements of m can ˆ ∂µm∂µm ∂µm∂um 1 Jm = 2 ˆ 2 ˆ be collected into a vector u ∈ R 2 r(r+1)×1 as defined in  ∂ log L(θm|Xm) ∂ log L(θm|Xm)  m − ∂u ∂µ⊤ − ∂u ∂u⊤ [63, pp. 56–57]. Hence, incorporating the symmetry of the m m m m  ˆ −1 ˆ⊤  Σ NmΣ NmZ D covariance matrix m and replacing the parameter vector θm = m m . (51) ˇ ⊤ N D⊤Zˆ − Nm D⊤Fˆ D by θm = [µm, um] in Eq. (40) results in the following  m m 2 m  expression: The maximum likelihood of the mean and covari- 2 ⊤ ance matrix of the mth Gaussian cluster are given by d log L(θm|Xm) Nm dΣm dΣm = vec Vmvec dθˇ dθˇ⊤ 2 dθˇ dθˇ⊤ 1 m m  m   m  µˆm = xn = x¯m (52) ⊤ Nm dΣ dΣ x∈Xm − vec m W vec m X ⊤ m 1 dθˇ dθˇm Σˆ ⊤  m    m = (xn − µˆm) (xn − µˆm) . (53) ⊤ ⊤ Nm dΣ dµ xn∈Xm − N vec m Z vec m X m ˇ m ˇ⊤ dθm dθ ˆ , Σˆ −1 Σˆ −1 0 2    m  Hence, Zm m (x¯m − µˆm) ⊗ m = r ×r. Conse- ⊤ dµ dµm quently, Eq. (51) can be further simplified to − N m Σ−1 , (41) m ˇ m ˇ⊤ dθm dθm −1 NmΣˆ 0 1 Jˆ m r× 2 r(r+1) . (54) where m = Nm ⊤ 0 1 − D Fˆ D 2 r(r+1)×r m 2 2 " 2 # V , Σ−1 Σ−1 Rr ×r (42) m m ⊗ m ∈ ˆ 2 2 The determinant of the FIM, Jm, can be written as W , Σ−1 ⊗ Σ−1∆ Σ−1 ∈ Rr ×r (43) m m m m m N 2 ˆ ˆ −1 m ⊤ ˆ , Σ−1 Σ−1 Rr ×r Jm = NmΣ × − D FmD . (55) Zm m (x¯m − µm) ⊗ m ∈ . (44) m 2

13

As N → ∞, Nm → ∞ given that l<

[11] C. R. Rao and Y. Wu, “A strongly consistent procedure for model [39] A. Dasgupta and A. E. Raftery, “Detecting features in spatial point selection in a regression problem,” Biometrika, vol. 76, no. 2, pp. 369– processes with clutter via model-based clustering,” J. Am. Stat. Assoc., 74, 1989. vol. 93, no. 441, pp. 294–302, Mar. 1998. [12] L. Breiman, “The little bootstrap and other methods for dimensionality [40] J. G. Campbell, C. Fraley, F. Murtagh, and A. E. Raftery, “Linear selection in regression: X-fixed prediction error,” J. Am. Stat. Assoc, flaw detection in woven textiles using model-based clustering,” Pattern vol. 87, no. 419, pp. 738–754, Sept. 1992. Recognit. Lett., vol. 18, pp. 1539–1548, Aug. 1997. [13] R. E. Kass and A. E. Raftery, “Bayes factors,” J. Am. Stat. Assoc., [41] S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and vol. 90, no. 430, pp. 773–795, June 1995. A. Raftery, “Three types of Gamma-ray bursts,” Astrophysical J., vol. [14] J. Shao, “Bootstrap model selection,” J. Am. Stat. Assoc., vol. 91, no. 508, pp. 314–327, Nov. 1998. 434, pp. 655–665, June 1996. [42] W. J. Krzanowski and Y. T. Lai, “A criterion for determining the number [15] P. M. Djuri´c, “Asymptotic MAP criteria for model selection,” IEEE of groups in a data set using sum-of-squares clustering,” Biometrics, Trans. Signal Process., vol. 46, no. 10, pp. 2726–2735, Oct. 1998. vol. 44, no. 1, pp. 23–34, Mar. 1988. [16] J. E. Cavanaugh and A. A. Neath, “Generalizing the derivation of the [43] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of Schwarz information criterion,” Commun. Statist.-Theory Meth., vol. 28, clusters in a dataset via the gap ,” J. R. Statist. Soc. B, vol. 63, no. 1, pp. 49–66, 1999. pp. 411–423, 2001. [17] A. M. Zoubir, “Bootstrap methods for model selection,” Int. J. Electron. [44] F. K. Teklehaymanot, M. Muma, J. Liu, and A. M. Zoubir, “In-network Commun., vol. 53, no. 6, pp. 386–392, 1999. adaptive cluster enumeration for distributed classification/labeling,” in [18] A. M. Zoubir and D. R. Iskander, “Bootstrap modeling of a class of Proc. 24th Eur. Signal Process. Conf. (EUSIPCO), Budapest, Hungary, nonstationary signals,” IEEE Trans. Signal Process., vol. 48, no. 2, pp. 2016, pp. 448–452. 399–408, Feb. 2000. [45] P. Binder, M. Muma, and A. M. Zoubir, “Gravitational clustering: a [19] R. F. Brcich, A. M. Zoubir, and P. Pelin, “Detection of sources using simple, robust and adaptive approach for distributed networks,” Signal bootstrap techniques,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. Process., vol. 149, pp. 36–48, Aug. 2018. 206–215, Feb. 2002. [46] P. Stoica and Y. Selen, “Model-order selection: a review of information [20] M. R. Morelande and A. M. Zoubir, “Model selection of random criterion rules,” IEEE Signal Process. Mag., vol. 21, no. 4, pp. 36–47, amplitude polynomial phase signals,” IEEE Trans. Signal Process., July 2004. vol. 50, no. 3, pp. 578–589, Mar. 2002. [47] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans. [21] D. J. Spiegelhalter, N. G. Best, B. P. Carlin, and A. van der Linde, Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005. “Bayesian measures of model complexity and fit,” J. R. Statist. Soc. B, [48] T. Ando, Bayesian Model Selection and Statistical Modeling, ser. Statis- vol. 64, no. 4, pp. 583–639, 2002. tics: Textbooks and Monographs. Florida, USA: Taylor and Francis [22] G. Claeskens and N. L. Hjort, “The focused information criterion,” J. Group, LLC, 2010. Am. Stat. Assoc., vol. 98, no. 464, pp. 900–916, Dec. 2003. [49] C. M. Bishop, Pattern Recognition and Machine Learning. New York, USA: Springer Science+Business Media, LLC, 2006. [23] Z. Lu and A. M. Zoubir, “Generalized Bayesian information criterion for source enumeration in array processing,” IEEE Trans. Signal Process., [50] J. Bl¨omer and K. Bujna, “Adaptive seeding for Gaussian mixture Proc. 20th Pacific Asia Conf. Adv. Knowl. Discovery and vol. 61, no. 6, pp. 1470–1480, Mar. 2013. models,” in Data Mining (PAKDD), vol. 9652, Auckland, New Zealand, 2016, pp. [24] ——, “Flexible detection criterion for source enumeration in array 296–308. processing,” IEEE Trans. Signal Process., vol. 61, no. 6, pp. 1303–1314, [51] D. Arthur and S. Vassilvitskii, “K-means++: the advantages of careful Mar. 2013. seeding,” in Proc. 18th Annu. ACM-SIAM Symp. Discrete Algorithms, [25] ——, “Source enumeration in array processing using a two-step test,” New Orleans, USA, 2007, pp. 1027–1035. IEEE Trans. Signal Process., vol. 63, no. 10, pp. 2718–2727, May 2015. [52] P. Fr¨anti, “Efficiency of random swap clustering,” J. Big Data, vol. 5, [26] C. R. Rao and Y. Wu, “On model selection,” IMS Lecture Notes - no. 13, 2018. Monograph Series, pp. 1–57, 2001. [53] Q. Zhao, V. Hautam¨aki, I. K¨arkk¨ainen, and P. Fr¨anti, “Random swap [27] A. Kalogeratos and A. Likas, “Dip-means: an incremental clustering EM algorithm for Gaussian mixture models,” Pattern Recognit. Lett., method for estimating the number of clusters,” in Proc. Adv. Neural Inf. vol. 33, pp. 2120–2126, 2012. Process. Syst. 25, 2012, pp. 2402–2410. [54] P. Fr¨anti and O. Virmajoki, “Iterative shrinking method for clustering [28] G. Hamerly and E. Charles, “Learning the K in K-Means,” in Proc. 16th problems,” Pattern Recognit., vol. 39, no. 5, pp. 761–765, 2006. Int. Conf. Neural Inf. Process. Syst. (NIPS), Whistler, Canada, 2003, pp. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2005.09.012 281–288. [55] I. K¨arkk¨ainen and P. Fr¨anti, “Dynamic local search algorithm for the [29] D. Pelleg and A. Moore, “X-means: extending K-means with efficient clustering problem,” Department of Computer Science, University of estimation of the number of clusters,” in Proc. 17th Int. Conf. Mach. Joensuu, Joensuu, Finland, Tech. Rep. A-2002-6, 2002. Learn. (ICML), 2000, pp. 727–734. [56] P. Fr¨anti, R. Mariescu-Istodor, and C. Zhong, “Xnn graph,” Joint Int. [30] M. Shahbaba and S. Beheshti, “Improving X-means clustering with Workshop on Structural, Syntactic, and Statist. Pattern Recognit., vol. MNDL,” in Proc. 11th Int. Conf. Inf. Sci., Signal Process. and Appl. LNCS 10029, pp. 207–217, 2016. (ISSPA), Montreal, Canada, 2012, pp. 1298–1302. [57] R. A. Fisher, “The use of multiple measurements in taxonomic prob- [31] T. Ishioka, “An expansion of X-means for automatically determining the lems,” Ann. Eugenics, vol. 7, pp. 179–188, 1936. optimal number of clusters,” in Proc. 4th IASTED Int. Conf. Comput. [58] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Intell., Calgary, Canada, 2005, pp. 91–96. Available: http://archive.ics.uci.edu/ml [32] Q. Zhao, V. Hautamaki, and P. Fr¨anti, “Knee point detection in BIC for [59] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-Up Robust detecting the number of clusters,” in Proc. 10th Int. Conf. Adv. Concepts Features (SURF),” Comput. Vis. Image Underst., vol. 110, pp. 346–359, Intell. Vis. Syst. (ACIVS), Juan-les-Pins, France, 2008, pp. 664–673. June 2008. [33] Q. Zhao, M. Xu, and P. Fr¨anti, “Knee point detection on Bayesian [60] F. K. Teklehaymanot, A.-K. Seifert, M. Muma, M. G. Amin, and A. M. information criterion,” in Proc. 20th IEEE Int. Conf. Tools with Artificial Zoubir, “Bayesian target enumeration and labeling using radar data of Intell., Dayton, USA, 2008, pp. 431–438. human gait,” in 26th Eur. Signal Process. Conf. (EUSIPCO) (accepted), [34] Y. Feng and G. Hamerly, “PG-means: learning the number of clusters 2018. in data,” in Proc. Conf. Adv. Neural Inf. Process. Syst. 19 (NIPS), 2006, [61] A. M. Zoubir, V. Koivunen, Y. Chakhchoukh, and M. Muma, “Robust pp. 393–400. estimation in signal processing,” IEEE Signal Process. Mag., vol. 29, [35] C. Constantinopoulos, M. K. Titsias, and A. Likas, “Bayesian feature no. 4, pp. 61–80, July 2012. and model selection for Gaussian mixture models,” IEEE Trans. Pattern [62] A. M. Zoubir, V. Koivunen, E. Ollila, and M. Muma, Robust Anal. Mach. Intell., vol. 28, no. 6, June 2006. for Signal Processing. Cambridge University Press, 2018. [36] T. Huang, H. Peng, and K. Zhang, “Model selection for Gaussian [63] J. R. Magnus and H. Neudecker, Matrix Differential Calculus with mixture models,” Statistica Sinica, vol. 27, no. 1, pp. 147–169, 2017. Applications in Statistics and (3 ed.), ser. Wiley Series [37] C. Fraley and A. Raftery, “How many clusters? Which clustering in Probability and Statistics. Chichester, England: John Wiley & Sons method? Answers via model-based cluster analysis,” Comput. J., vol. 41, Ltd, 2007. no. 8, 1998. [38] A. Mehrjou, R. Hosseini, and B. N. Araabi, “Improved Bayesian information criterion for mixture model selection,” Pattern Recognit. Lett., vol. 69, pp. 22–27, Jan. 2016.