<<

IEICE TRANS. INF. & SYST., VOL.E102–D, NO.6 JUNE 2019 1154

PAPER Direct Log-Density Estimation with Gaussian Mixture Models and Its Application to Clustering

Qi ZHANG†a), Hiroaki SASAKI††b), Nonmembers, and Kazushi IKEDA††c), Senior Member

SUMMARY Estimation of the gradient of the of a probabil- detected modes. Therefore, MS has been applied to a va- ity density is a versatile tool in statistical data analysis. A recent riety of problems such as image segmentation [6], [8], [9] method for model-seeking clustering called the least-squares log-density and object tracking [10], [11] (See also a recent review ar- gradient clustering (LSLDGC) [Sasaki et al., 2014] employs a sophisti- cated gradient estimator, which directly estimates the log-density ticle [12]). Furthermore, MS and its related methods have without going through density estimation. However, the typical implemen- been extended for handling manifold data [13], [14] and to tation of LSLDGC is based on a spherical Gaussian function, which may hierarchical clustering [15]. not work well when the probability density function for data has highly A native approach is first to estimate the probability correlated local structures. To cope with this problem, we propose a new gradient estimator for log-density gradients with Gaussian mixture models density function (e.g., by kernel density estimation), and (GMMs). Covariance matrices in GMMs enable the new estimator to cap- then to compute its log-gradient. However, this approach ture the highly correlated structures. Through the application of the new can be unreliable because a good density estimator does gradient estimator to mode-seeking clustering and hierarchical clustering, not necessarily mean a good gradient estimator. To allevi- we experimentally demonstrate the usefulness of our clustering methods ate this problem, a recent mode-seeking clustering method over existing methods. key words: probability density gradient, mixture model, clustering, hier- called LSLDG clustering employs a sophisticated gradient archical clustering estimator, which directly fits a gradient model to the true log-density gradient without going through density estima- 1. Introduction tion [16]–[18]. LSLDG clustering has been experimentally demonstrated to significantly improve the performance of Estimating the gradient of the logarithm of the probabil- MS particularly for high-dimensional data [17], [18].How- ity density function underlying data is an important prob- ever, the gradient estimator in LSLDG clustering is typically lem. For instance, an unsupervised dimensionality reduction implemented based on the spherical Gaussian kernel. Thus, method requires to estimate the log-density gradient [1].In when the probability density function includes clusters with a supervised learning problem, estimating the log-density highly correlated structures, LSLDG may require a huge gradient enables us to capture multiple functional relation- number of samples to produce a smooth gradient estimate. ship between output and input variables [2]. Other statistical This can be problematic particularly in mode-seeking clus- topics can be seen in [3]. Thus, log-density gradient estima- tering because an unsmooth estimate may create spurious tion offers a certain range of applications in statistical data modes as we demonstrate later. analysis. To improve the performance of LSLDG clustering to Among them, an interesting application is mode- the highly correlated structures, we propose to use a Gaus- seeking clustering. Mean shift clustering (MS) [4]–[6] it- sian mixture model (GMM) in log-density gradient estima- eratively updates data samples toward the modes (i.e., lo- tion. Estimating the covariance matrices in GMM makes the cal maxima) of the estimated probability density function gradient estimator much more adaptive to the local struc- by gradient ascent, and then assigns a cluster label to the tures in the probability density function. A challenge is data samples which converged to the same mode. Com- to satisfy the positive semidefinite constraint of the covari- pared with k-means and mixture-model-based clustering [7], ance matrices. To overcome this challenge, we develop MS has two notable points: MS does not make strong as- an estimation algorithm combined with manifold optimiza- sumptions on the probability density function and the num- tion [19]. ber of clusters in MS is automatically determined by the Next, we apply the proposed estimator to mode- seeking clustering. To update data samples during mode- Manuscript received October 20, 2018. Manuscript revised February 14, 2019. seeking, we derive an update formula based on the fixed- Manuscript publicized March 22, 2019. point method. A similar formula has been proposed in †The author is with the Graduate University for Advanced MS [20], but as shown later, our formula includes it as a spe- Studies, Tachikawa-shi, 190–0014 Japan. cial case. Furthermore, we extend the proposed clustering †† The authors are with Nara Institute of Science and Technol- method to hierarchical clustering, which is a novel exten- ogy, Ikoma-shi, 630–0192 Japan. sion of LSLDG clustering. The usefulness of the proposed a) E-mail: [email protected] b) E-mail: [email protected] clustering methods are experimentally demonstrated. c) E-mail: [email protected] This paper is organized as follows: Sect. 2 reviews an DOI: 10.1587/transinf.2018EDP7354 existing gradient estimator and MS. In Sect. 3, we propose a

Copyright c 2019 The Institute of Electronics, Information and Communication Engineers ZHANG et al.: LOG-DENSITY GRADIENT ESTIMATION WITH GAUSSIAN MIXTURE MODELS 1155 ⎛ ⎞ new estimator for log-density gradients with Gaussian mix- ⎜ x − c 2 ⎟ ϕ( j)(x):= ∂ exp ⎝⎜− i ⎠⎟ , (3) ture models. Section 4 develops a mode-seeking clustering i j 2 2σ j method as an application of the proposed gradient estimator. Then, the developed clustering method is further extended = (1) (d)  where ci : (ci ,...,ci ) is the kernel centers fixed at a to hierarchical clustering. The performance of the cluster- subset of data samples randomly chosen from D, and σ j de- ing methods are experimentally investigated in Sect. 5. Sec- notes the bandwidth parameter. After substituting the model tion 6 concludes this paper. ( j) = ( j) ( j)  (2) into J j, the optimal α : (α1 ,...,αb ) can be com- puted analytically by solving the following problem: 2. Background ( j) ( j) ( j) 2 α := argmin J j(α ) + λ jα  This section reviews an existing estimator for log-density α( j) −1 gradients and mode-seeking clustering methods. = −(G j + λ j Ib) h j,

2.1 Review of LSLDG where λ j is the regularization parameter, Ib denotes the b by ( j) = ( j) ( j)  b identity matrix, and with ϕ (x): (ϕ1 (x),...,ϕb (x)) , Here, we review a direct estimator for log-density gradi- n n ents which we refer to the least-squares log-density gradi- = 1 ( j) ( j)  = 1 ( j) ents (LSLDG) [16], [17]. Suppose that n i.i.d. data samples G j : ϕ (xi)ϕ (xi) , h j : ∂ jϕ (xi). n = n = drawn from a probability distribution with density p(x)are i 1 i 1 available as The gradient estimator is finally given by

(1) (d) n b D := {xi = (x ,...,x )} = ∼ p(x). i i i 1 = ( j) ( j) g j(x) αi ϕi (x). The goal of LSLDG is to estimate the gradient of the log- i=1 density from D: 2.2 Review of Mode-Seeking Clustering  ∂ p(x) ∂ p(x) ∇ log p(x) = 1 ,..., d , x p(x) p(x) Mean shift clustering (MS) [4], [6] is a mode-seeking clus- tering method based on kernel density estimation (KDE). = ∂ ∇ ff where ∂ j : ∂x( j) and x denotes the di erential MS employs KDE to estimate the probability density func- with respect to x. tion as follows: LSLDG directly fits a model g j to the true partial n 1 x − x 2 of the log-density under the squared-loss: p (x):= K i , (4) KDE n 2h2 i=1 1 2 J j(g j):= {g j(x) − ∂ j log p(x)} p(x)dx − C j 2 where h denotes the bandwidth parameter and K is a smooth kernel function for KDE (See [6, Eq. (3)] for definition). = 1 { }2 − g j(x) p(x)dx g j(x) ∂ j p(x) dx 2 Taking the gradient of pKDE(x) with respect to x yields = 1 { }2 + ∇ p (x) g j(x) p(x)dx ∂ jg j(x) p(x)dx, x KDE 2 n 1 x − x x − x 2 = i H i where C := 1 {∂ log p(x)}2 p(x)dx, and we applied the n h2 2h2 j 2 j i=1 to the second term on the right-hand ⎡ ⎤ ⎢n  − 2 n  − 2 ⎥ | |→ | |→∞ 1 ⎢ x xi x xi ⎥ side under the assumption that g j(x)p(x) 0as x j . = ⎣⎢ x H − x H ⎦⎥ nh2 i 2h2 2h2 Then, the empirical risk up to the ignorable constant C j is i=1 i=1 obtained as where H(t):= − d K(t). Based on the fixed-point method, n dt 1 1 setting ∇ p (x) = 0 yields a simple update formula as J g = g x 2 + ∂ g x . x KDE j( j): j( i) j j( i) (1)   n = 2   − 2 i 1 n x H x xi ← i=1 i  2h2  x  2 . To estimate g j, LSLDG employs the following model: n x−xi i=1 H 2h2 b g (x):= α( j)ϕ( j)(x), (2) This update formula is iteratively applied for data samples xi j i i i=1 until they converge to the modes of pKDE. MS finally assigns the same cluster label to the data samples converging to the ( j) ( j) ffi where αi and ϕi are coe cients and basis functions re- same mode. spectively, and b denotes the number of basis functions. In LSLDG clustering [17], [18] is another mode-seeking [17], the derivative of the Gaussian kernel is used as clustering method. The main difference is that ∇xpKDE(x) IEICE TRANS. INF. & SYST., VOL.E102–D, NO.6 JUNE 2019 1156

= ( j) = −1 is replaced with the LSLDG estimator g(x) (g1(x),..., θ G j h j, (8)  gd(x)) . An update formula is derived based on the fixed- ( j) = ( j) ( j)  point method in LSLDG clustering as well. It has been ex- where with ψ (x) (ψ1 (x),...,ψb (x)) , perimentally demonstrated that LSLDG clustering signifi- n n cantly outperforms MS for high-dimensional data. 1 ( j) ( j)  1 ( j) G j := ψ (xi)ψ (xi) , h j := ∂ jψ (xi). n = n = 3. LSLDG with Gaussian Mixture Models i 1 i 1 { }b {Λ }b (2) Optimization of μi i=1 and i i=1 To improve the performance of LSLDG to local correlation Given {θ( j)}d , we optimize {μ }b and {Λ }b . Here, the structures in the probability density function, instead of the j=1 i i=1 i i=1 challenge is that Λ must be symmetric positive semidefi- spherical Gaussian kernel in (3), we use a Gaussian function i nite. To satisfy the constraint, we employ manifold opti- with the mean parameter μ and precision matrix Λ (i.e., the i i mization techniques [19]. In this step, gradient descent on inverse of a covariance matrix) as the basis function: ††† a manifold is performed by a single step (iteration) .Re-   garding μi, the standard gradient descent is applied with a ( j) 1  ψ (x; μi, Λi):= ∂ j exp − (x − μi) Λi(x − μi) single step. i 2 After repeating the alternate optimization over the pa- = Λ − Λ [ i(μi x)] jφi(x; μi, i), (5) rameters, we finally obtain the estimator as where [x] j denotes the j-th element in x and b GM = ( j) ( j) Λ   g j (x): θ ψ (x; μi, i). (9) 1 i i φ (x; μ , Λ ):= exp − (x − μ )Λ (x − μ ) . i=1 i i i 2 i i i We call the estimator the Gaussian mixture LSLDG (GM- Then, a gradient model based on a Gaussian mixture model† LSLDG). is given by 4. Application to Clustering b GM = ( j) ( j) Λ g j (x): θi ψi (x; μi, i), (6) This section applies GM-LSLDG to mode-seeking cluster- i=1 ing. Furthermore, the clustering method is extended to hier- ( j) ffi archical clustering. where θi denote coe cients. Substituting the Gaussian mixture model (6) into J j yields the following optimization 4.1 Mode-Seeking Clustering problem: Here, we derive an update formula of data samples to per- {θ( j)}d , {μ }b , {Λ }b j=1 i i=1 i i=1 form mode-seeking as done in MS and LSLDG clustering. d Let us denote the estimated gradient vector by GM-LSLDG = ( j) { }b {Λ }b  : argmin J j(θ , μi i=1, i i=1) as gGM(x):= (gGM(x),...,gGM(x)) . Then, by expanding {θ( j)}d ,{μ }b ,{Λ }b = 1 d j=1 i i=1 i i=1 j 1 the right-hand side of (9), gGM(x) can be expressed as Λ = Λ Λ subject to i i , i O for all i, (7) b ( j) = ( j) ( j)  Λ Λ gGM(x) = Θ Λ (μ − x)φ (x) where θ : (θ1 ,...,θb ) , and i O means that i is i i i i positive semidefinite. i=1 ⎧ ⎫ We next describe our estimation algorithm. Our algo- b ⎪b ⎪  ⎨ ⎬ d = ΘiΛiμiφi(x) − ⎪ ΘiΛiφi(x)⎪ x, (10) rithm is to alternately repeat two steps until j=1 J j con- ⎩ ⎭ { ( j)}d { }b i=1 i=1 verges: (1) Optimization of θ j=1 and (2) of μi i=1 and b ( j) ( j) {Λi} . The details of the two steps are given as follows: = Λ Θ i=1 where φi (x): φi (x; μi, i) and i denotes the d by d { ( j)}d diagonal matrix with diagonals θ(1),...,θ(d) for a fixed i. (1) Optimization of θ j=1 i i Based on the fixed-point method, setting gGM(x) = 0 yields { }b {Λ }b ( j) Given μi i=1 and i i=1, the optimal θ for all j can be the following update formula: †† computed analytically : ⎧ ⎫− ⎧ ⎫ ⎪b ⎪ 1 ⎪b ⎪ †This model is different from the standard Gaussian mixture ⎨ ⎬ ⎨ ⎬ x ← ⎪ ΘiΛiφi(x)⎪ ⎪ ΘiΛiμiφi(x)⎪ . (11) model in statistics because the Gaussian functions in (6) are not ⎩ ⎭ ⎩ ⎭ i= i= normalized. Nonetheless, we call it a Gaussian mixture model to 1 1 Λ emphasize that the mean parameter μi and precision matrix i are A similar update formula has been derived in MS with a estimated from data samples. ††The analytic solution can be easily derived by solving †††We used a MATLAB function, sympositivedefinitefactory(), ∇ ( j) { }b {Λ }b = ( j) θ( j) J j(θ , μi i=1, i i=1) 0 with respect to θ . in software called Manopt [21]. ZHANG et al.: LOG-DENSITY GRADIENT ESTIMATION WITH GAUSSIAN MIXTURE MODELS 1157 mixture model [20, Eq. (2)] , but it is a special case of our method based on mode-seeking, and then propose our hi- formula where Θi = θi Id with a single parameter θi, while erarchical clustering method, which is a novel extension of Θi in our method is also a diagonal matrix yet the diagonal LSLDG clustering. elements can differ. It is important to confirm whether the formula (11) up- 4.2.1 Hierarchical Mode Association Clustering dates data samples toward the modes. To this end, we show ffi A hierarchical clustering method called the hierarchical asu cient condition that the update formula (11) performs  −1 mode association clustering (HMAC) has been proposed b Θ Λ gradient ascent. By multiplying i=1 i iφi(x) to the based on a mode-seeking clustering method [15]. When both-hand sides of (10), (11) can be equivalently expressed kernel density estimation in (4) is employed for HMAC, as the main idea is to perform mode-seeking clustering over ⎧ ⎫−1 ⎨⎪b ⎬⎪ a set of bandwidth parameters h: A smaller (or larger) value ← + Θ Λ GM x x ⎩⎪ i iφi(x)⎭⎪ g (x). (12) of the bandwidth parameter h produces a less (or more) i=1 smooth density estimate, which tends to have more (or less) From the definition of ascent direction [22, Sect. 9.2] , (12) modes (i.e., clusters). Thus, roughly speaking, a hierar- shows that the update formula (11) performs gradient ascent chy of clusters can be built by organizing the clustering re- whenever sults from the set of bandwidth parameters. As discussed ⎧ ⎫−1 in [15], HMAC is substantially different from linkage clus- ⎨⎪b ⎬⎪ GM  ⎪ Θ Λ ( j) ⎪ GM tering [24, Sect. 14.3.12] : Clusters in HMAC are merged g (x) ⎩⎪ i iφi (x)⎭⎪ g (x) > 0. (13) i=1 globally based on the estimated density, while linkage clus-  tering merges two clusters based on local distance, which b Θ Λ ( j) Thus, we need to check (13) or that i=1 i iφi (x) is may result in skewed clusters. a positive definite matrix when the update formula (11) is We follow the approach in HMAC, but the extension is used. not straightforward because we need to prepare for a set of Based on this fact, our clustering algorithm consists the precision matrices in the Gaussian mixture model to obtain following steps: a variety of cluster presentations, which seems much more Step 1 Perform GM-LSLDG and obtain the optimal param- challenging than selecting a set of the bandwidth parame- {( j)}d { }b {Λ }b ters. Below, we describe how to select a set of the precision eters θ j=1, μi i=1 and i i=1. Step 2 Repeat the following mode-seeking step to each data matrices and then propose a practical algorithm of our hier- sample xi until convergence: If (13) is satisfied, xi is archical clustering method. updated by the fixed-point method (11). Otherwise, we perform standard gradient ascent as xi ← xi + 4.2.2 Selecting a Set of Precision Matrices GM g (xi) where is a fixed step-size parameter. Step 3 Assign the same cluster label to a set of the data To systematically select a set of precision matrices, we op- {Λ }b ff points from which the converged points are close timize i i=1 twice from di erent initializations and store {Λ }b enough. i i=1 at each iteration to build a hierarchy of clusters. The {Λ }b initialization of i i=1 is an important problem, and we ini- We empirically observed that most of data samples were Λ = monotonically converged to the modes without using stan- tialize i ηId for all i where η is some parameter. Intu- dard gradient ascent. Thus, we conjecture that as in MS [6, itively, when η is a small (or large) value, a set of precision Theorem 1] and LSLDG clustering [18, Theorem 3] ,the matrices obtained over iterations in the optimization would produce a large (or small) to smaller (or larger) number of update rule (11) has a monotonic hill-climbing property to Λ the modes. Proving this conjecture is our future work. clusters. Due to this reason, we optimize i twice from dif- Finally, we discuss whether our gradient estimator en- ferent initializations with a small and large value of η, and {Λ }b unify a set of i i=1 obtained at each iteration. ables us to accurately find the modes of the true density, not ffi of the estimated one. Following the analysis in [18], [23], However, the di culty in this approach is that the op- timization problem (7) is nonconvex. Thus, the final solu- Appendix A discusses that our gradient estimator with a ff sufficiently large number of Gaussians would suffice to ac- tions from two di erent initializations may not converge to curately capture the modes for a wide-range of the den- the same solution. To cope with this problem, we first per- {( j)}d { }b {Λ }b sity functions as the sample size goes to infinity. However, form GM-LSLDG and obtain θ j=1, μi i=1 and i i=1 in our experimental results indicate that our clustering method (7). Then, given these parameters, we solve the following with a small number of Gaussians often works well. Filling optimization problem with an additional term: this gap between theory and practice is also an interesting d b {Λ }b + Λ − Λ 2 challenge, and one of our future works. min J j( i i= ) β i i , (14) {Λ }b 1 F i i=1 j=1 i=1 4.2 Extension to Hierarchical Clustering {Λ }b = ( j) { }b {Λ }b where J j( i i=1): J j(θ , μi i=1, i i=1), β is a non- First, we briefly review an existing hierarchical clustering negative parameter, and ·F denotes the Frobenius norm. IEICE TRANS. INF. & SYST., VOL.E102–D, NO.6 JUNE 2019 1158

The second term in (14) ensures that the final solution is set of bandwidth parameters in HMAC is manually selected, close to Λi by appropriately choosing β. Thus, we use a while a set of precision matrices in our method is obtained set of precision matrices obtained by solving (14) for hier- through the optimization, and the parameter manually cho- archical clustering. We choose β as follows: We begin with sen is only η. Therefore, it should be easier to select a set a small value of β and compute the Frobenius distance be- of precision matrices in our method. In addition, thanks to tween the final solution and Λi as in the second term of (14). the precision matrices, our hierarchical clustering method If the Frobenius norm is sufficiently small, then we choose would be more useful than HMAC when the data density a sequence of precision matrices until the final solution and includes highly correlated local structures. use it for hierarchical clustering. Otherwise, we slightly in- crease the value of β, and repeat the same procedure until 5. Experiments the Frobenius distance gets sufficiently small. This section performs numerical experiments both for 4.2.3 Practical Algorithm mode-seeking clustering and hierarchical clustering to demonstrate how the proposed methods work and to com- Before going to the details of the practical algorithm, let us pare them with existing methods. define some notations. We denote by Ml a set of the mode points obtained at the hierarchical level l. A label partition 5.1 Mode-Seeking Clustering function of x at the hierarchical level l assigns a cluster label ∈{ } to x and is defined by Pl(x) 1, 2,...,Kl where Kl denotes Here, we illustrate the performance of the proposed method the number of cluster at the level l. for mode-seeking clustering both on artificial and bench- Our algorithm takes the following steps: mark datasets. Step 1 Perform GM-LSLDG and obtain the optimal param- {( j)}d { }b {Λ }b 5.1.1 Artificial Data eters θ j=1, μi i=1 and i i=1. Step 2 Given {θ( j)}d , {μ }b , solve (14) with a small value j=1 i i=1 (1) Clustering performance and computational efficiency of η (e.g., η = 0.001) for the initialization Λi = ηId, and obtain a sequence of precision matrices as We first generated artificial data from the following mixture {Λ˜ (1) Λ˜ (2) Λ˜ (T)}b Λ˜ (t) i , i ,..., i i=1 where i denotes the pre- of three Gaussians: cision matrix at the t-th iteration and T is the total number of iterations†. 3 = | Step 3 With a large value of η (e.g., η = 1000), solve pdata(x) αkN(x μk, Ck), = (14) and obtain a sequence of precision matrices as k 1 {Λˇ (1), Λˇ (2),...,Λˇ (T)}b . i i i i=1 where N(x|μk, Ck) denotes the Gaussian density with mean Step 4 We unify the two sequences of all precision matri- μk and covariance matrix Ck. The mixing coefficients were {Λ˜ (1) Λ˜ (T) Λ Λˇ (T) Λˇ (1)}b ces as i ,..., i , i, i ,..., i i=1, and re- α1 = 0.3, α2 = 0.4, and α3 = 0.3, respectively. After gener- {Λ˜ (1) Λ˜ (L)}b = + express it by i ,..., i i=1 where L 2T 1 ating data samples, we applied the following three methods: and Λ˜ (L) = Λˇ (1). Then, we repeatedly perform GM- i i • GM-LSLDGC: Proposed mode-seeking clustering {( j)}d { }b {Λ˜ (l)}b LSLDGC with θ j=1, μi i=1 and i i=1 from method based on GM-LSLDG. We performed the ten- l = 1tol = L as follows: fold cross-validation for choosing bfrom the candi- d ˜ (l) b dates {2, 3,...,8, 9} with respect to J .Weini- (a) Perform GM-LSLDGC with {Λ } = to the j=1 j i i 1 tialized Λ for all i as a diagonal matrix whose diag- points in Ml−1 and obtain Ml where M0 corre- i sponds to the set of data samples. onals were sampled from the uniform distribution on [0.1, 1.0]. For μ ,thej-th element was initialized by (b) If Pl−1(xi) = k and the k-th mode point in Ml−1 i ( j) ( j) converged to the k -th element in Ml after mode- the uniform distribution on [mini xi , maxi xi ]. • seeking, then Pl(xi) = k . LSLDGC [18]: Mode-seeking clustering based on LSLDG††. All the hyper-parameters were determined Step 4 essentially follows HMAC [15] and ensures that this by the five-hold cross validation. hierarchical clustering algorithm produces a nested cluster- • MS [4]–[6]: Mean shift clustering method. We em- → ing representation, i.e., the following mapping holds: xi ployed kernel density estimation with a bandwidth ma- → → M1(xi) M2(M1(xi)) ... where Ml(x) maps x to a trix H as M mode point in l.   n Compared with HMAC, the proposed method has an 1 1 p (x):= exp (x − x ) H−1(x − x ) advantage in terms of selection of precision matrices: A H nZ 2 i i i=1 †To promote the interpretability of the results, when T is large, we only use precision matrices at the 0.7T-th, 0.8T-th and 0.9T-th ††Software is available at https://sites.google.com/site/ iterations. hworksites/home/software/lsldg. ZHANG et al.: LOG-DENSITY GRADIENT ESTIMATION WITH GAUSSIAN MIXTURE MODELS 1159

Fig. 3 Comparison in term of CPU time in seconds against (left) data dimension in n = 200 and (right) sample size in d = 2. The markers and error bars denote the means and standard deviations over 30 runs, re- spectively. GM-LSLDGC (no CV) denotes GM-LSLDGC without cross- validation where the number of Gaussians is fixed at 4.

the highly correlated structures because these methods em- ploy non-spherical Gaussian functions, while LSLDGC gets worse as the correlation parameter c becomes larger. Next, we demonstrate the performance of GM- Fig. 1 Examples of mode-seeking clustering results. LSLDGC to higher-dimensional data. Here, we set Ck in pdata as follows: the diagonals in all Ck are all one and the (1, 2)-th and (2, 1)-th in C1,(1, 3)-th and (3, 1)-th in C2 and (2, 3)-th and (3, 2)-th elements in C3 are 0.75, while the other elements in all Ck are zeros. Regarding μk, we ap- pended zeros to the mean parameters in the previous exper- iment. n = 200 samples were generated. The right plot in Fig. 2 shows the superior performance of GM-LSLDGC to higher-dimensional data. LSLDGC fairly works, but the performance of MS rapidly decreases as previously demon- Fig. 2 Clustering performance on artificial data against (left) correlation strated in [17], [18]. A possible reason is that combined with and (right) data dimension. The markers and error bars denote the means the direct fitting, our model (6) is more adaptive by precision and standard deviations over 30 runs, respectively. Λ ffi ( j) matrices i as well as coe cient parameters θi . Finally, we investigate the computational efficiency of where Z = (2π)d/2|H|1/2. Based on the well-known nor- GM-LSLDGC. To this end, we performed the same exper- mal reference rule [25], H was fixed at 4Cn−2/(d+6)/(d+ iment as the right plot in Fig. 2. Figure 3 shows that the 4) with the sample covariance matrix C, which is opti- computationally most efficient method is mean shift and mal in terms of the asymptotic mean integrated squared LSLDGC is also fairly efficient. On the other hand, GM- error for the density [26, Eq. (4) in r = 1] . LSLDGC is the most computationally demanding method. This is due to the fact that unlike mean shift and LSLDGC, We measured the clustering performance by adjusted Rand an iterative optimization procedure is adopted to estimate index (ARI) [27]: ARI takes a value less than or equal to the Gaussian mixture model together with cross-validation one, a larger value indicates a better clustering result, and for the number of Gaussians in the mixture model. The when a clustering result is perfect, the ARI value equals to computational cost of GM-LSLDGC can be alleviated with- one. out performing cross-validation (Fig. 3). This approach is First, to investigate how GM-LSLDGC works to den- promising because as will be shown in Fig. 4, the good clus- sities with highly correlated structures, we set μk and Ck in tering performance is retained as long as we use an enough = −  =  pdata as follows: μ1 (1.5, 3.5) , μ2 (0.0, 0.0) and number of Gaussians. = −  μ3 ( 1.5, 3.5) , and In short, GM-LSLDGC can be advantageous when data 6 c 6 −c 6 c is high-dimensional and includes highly correlated local C = , C = and C = , 1 c 1.8 2 −c 1.5 3 c 1.6 structures, while the superior clustering performance comes at the price of increasing the computational cost. As a rem- where c is a non-negative parameter related to the strength edy, the computational cost could be alleviated by fixing the for correlations. Here, we generated n = 300 samples. Fig- number of Gaussians in advance. ure 1 illustrates that GM-LSLDGC well-captures the clus- (2) Influence of the number of Gaussians ters with highly correlated structures. On the other hand, LSLDGC produces many clusters, which implies that it Here, we show how the number of Gaussians affects the is not easy to obtain a smooth gradient estimate by the clustering performance of GM-LSLDGC. In this experi- spherical Gaussian kernel. The left plot in Fig. 2 quanti- ment, we fixed the number of Gaussians and performed tatively shows that GM-LSLDGC and MS perform well to GM-LSLDGC without cross-validation. The same two- IEICE TRANS. INF. & SYST., VOL.E102–D, NO.6 JUNE 2019 1160

Fig. 5 (Left) A cluster-merging process and (Right) dendrogram. The levels in the dendrogram cor- respond to the cluster representations of the left plots.

Table 1 The average and standard deviation of ARI values over 20 runs. Numbers in the parentheses are standard deviations. The best and compa- rable methods judged by the unpaired t-test at the significance level 5% are described in boldface. k denotes the true number of clusters. GM-LSLDGC LSLDGC Mean shift k-means Electricity (d = 2, n = 400, k = 4) 0.493(0.083) 0.367(0.035) 0.379(0.014) 0.365(0.055) Yeast (d = 8, n = 250, k = 5) 0.662(0.029) 0.651(0.044) 0.118(0.003) 0.643(0.041) Fig. 4 Clustering performance against the number of Gaussians. The Anuran Calls (d = 18, n = 150, k = 3) same two-dimensional data as Fig. 1 was used. The markers and error bars 0.298(0.072) 0.255(0.048) 0.000(0.000) 0.237(0.029) denote the means and standard deviations over 30 runs, respectively. Sports articles (d = 59, n = 200, k = 2) 0.224(0.068) 0.323(0.116) 0.000(0.000) 0.000(0.000) dimensional data as Fig. 1 was used. Figure 4 indicates that accurate clustering is possible on benchmark datasets. as long as an enough number of Gaussians is used. This would be for the reason that a larger number of Gaussians 5.2 Hierarchical Clustering can potentially approximate a wider-range of functions, and thus a better gradient estimate was obtained with an enough number of Gaussians. This property of GM-LSLDGC is Finally, we perform the experiments for hierarchical clus- stark contrast with standard mixture-model-based clustering tering. To first visualize the cluster-merging process in our where the number of clusters corresponds to the number of method, artificial data were generated as in Fig. 1. The left mixtures [7]. Therefore, GM-LSLDGC would be easier to plot in Fig. 5 shows the cluster-merging process. Initially, determine the number of Gaussians. all the data points are regarded as individual clusters, but as the hierarchy level becomes higher, the clusters are merged. This merging process can be summarized in the dendrogram 5.1.2 Benchmark Data of Fig. 5. Next, we compare our hierarchical clustering method Here, we investigate the clustering performance on bench- with HMAC on artificial and benchmark datasets. As a per- mark datasets available at the UCI machine learning repos- formance measure for hierarchical clustering, we used FS- itory†. Since we used classification datasets, we regarded core [28], which takes a value between zero and one and the class labels as the true cluster labels. In addition to the whose larger value means better hierarchical clustering. For three methods in the previous experiments, we applied the k- HMAC, we downloaded the software at http://personal.psu. means clustering method where the number of clusters was edu/jol2/hmac/ and used it as the default setting. fixed at the true cluster number. Before applying the cluster- Table 2 shows FScores on artificial and benchmark ing methods, data samples were standardized by subtracting datasets downloaded from the UCI machine learning repos- the sample mean and normalizing with the sample variance. itory. Our hierarchical clustering method gets higher values Table 1 shows that GM-LSLDGC performs better than of FScore than HMAC. This could be due to the fact that our MS and k-means, and compares favorably with LSLDGC method employs a set of precision matrices, while HMAC †https://archive.ics.uci.edu/ml/index.php uses a set of bandwidth parameters. Thus, our method could ZHANG et al.: LOG-DENSITY GRADIENT ESTIMATION WITH GAUSSIAN MIXTURE MODELS 1161

Table 2 The average and standard deviation of FScores over 20 runs. pp.790–799, 1995. Numbers in the parentheses are standard deviations. The best and compa- [6] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward rable methods judged by the unpaired t-test at the significance level 5% are feature space analysis,” IEEE Transactions on Pattern Analysis and described in boldface. h-GM-LSLDGC indicates the proposed hierarchical Machine Intelligence, vol.24, no.5, pp.603–619, 2002. clustering method. k denotes the true number of clusters. [7] V. Melnykov and R. Maitra, “Finite mixture models and model- h-GM-LSLDGC HMAC based clustering,” Statistics Surveys, vol.4, pp.80–116, 2010. = = = Artificial Data (d 2, n 350, k 3) [8] W. Tao, H. Jin, and Y. Zhang, “Color image segmentation based 0.889(0.163) 0.651(0.217) on mean shift and normalized cuts,” IEEE Transactions on Sys- = = = Electricity (d 2, n 400, k 4) tems, Man, and Cybernetics, Part B: Cybernetics, vol.37, no.5, 0.467(0.294) 0.230(0.091) pp.1382–1389, 2007. = = = Iris (d 4, n 90, k 3) [9] J. Wang, B. Thiesson, Y. Xu, and M. Cohen, “Image and video 0.595(0.174) 0.547(0.036) segmentation by anisotropic kernel mean shift,” Proceedings of European Conference on Computer Vision (ECCV), vol.3022, pp.238–249, 2004. better-capture locally correlated structures than HMAC. [10] R.T. Collins, “Mean-shift blob tracking through scale space,” Pro- ceedings of IEEE Computer Society Conference on Computer Vi- 6. Conclusion sion and Pattern Recognition (CVPR), pp.234–240, 2003. [11] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non- In this paper, we proposed a new estimator of the gradient rigid objects using mean shift,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.142–149, of a logarithmic probability density function. In contrast 2000. with an existing estimator, we propose to use Gaussian mix- [12] M. Carreira-Perpin˜an,´ “A review of mean-shift algorithms for clus- ture models, which enable the new estimator to flexibly cap- tering,” arXiv preprint arXiv:1503.00687, 2015. ture the highly correlated structures in the probability den- [13] R. Subbarao and P. Meer, “Nonlinear mean shift for clustering over sity function. We applied the proposed estimator to mode- analytic manifolds,” Proceedings of IEEE Computer Society Con- ference on of Computer Vision and Pattern Recognition (CVPR), seeking clustering, and developed a fixed-point algorithm pp.1168–1175, 2006. for mode-seeking. Our experimental results showed that [14] R. Subbarao and P. Meer, “Nonlinear mean shift over Riemannian our mode-seeking clustering method is advantageous par- manifolds,” International Journal of Computer Vision, vol.84, no.1, ticularly when the dimensionality of data is relatively high pp.1–20, 2009. and clusters have highly correlated structures, although the [15] J. Li, S. Ray, and B.G. Lindsay, “A nonparametric statistical ap- good performance comes at a cost of making our cluster- proach to clustering via mode identification,” Journal of Machine Learning Research, vol.8, pp.1687–1723, 2007. ing method computationally more demanding. A promis- [16] D.D. Cox, “A penalty method for nonparametric estimation of the ing remedy is to use an enough number of Gaussians with- logarithmic derivative of a density function,” Annals of the Institute out performing cross-validation, which could decrease the of Statistical Mathematics, vol.37, no.1, pp.271–288, 1985. computational cost without sacrificing the clustering perfor- [17] H. Sasaki, A. Hyvarinen,¨ and M. Sugiyama, “Clustering via mode mance. Finally, we extended our clustering method to hier- seeking by direct estimation of the gradient of a log-density,” Ma- chine Learning and Knowledge Discovery in Databases Part III- Eu- archical clustering, and demonstrated its usefulness. ropean Conference, ECML/PKDD 2014, pp.19–34, 2014. This paper focused only on mode-seeking clustering [18] H. Sasaki, T. Kanamori, A. Hyvarinen,¨ G. Niu, and M. Sugiyama, and hierarchical clustering as applications of the new gra- “Mode-seeking clustering and density ridge estimation via direct es- dient estimator. However, the proposed estimator is poten- timation of density-derivative-ratios,” Journal of Machine Learing tially applicable to other problems such as unsupervised di- Research, vol.18, no.180, 2018. mensionality reduction [1], modal regression [2] and so on. [19] P.-A. Absil, R. Mahony, and R. Sepulchre, “Optimization algorithms on matrix manifolds,” Princeton University Press, 2008. In the future, we explore new applications of the proposed [20] M. Carreira-Perpin˜an,´ “Gaussian mean-shift is an EM algorithm,” estimator. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, no.5, pp.767–776, 2007. References [21] N. Boumal, B. Mishra, P. Absil, and R. Sepulchre, “Manopt, a Matlab toolbox for optimization on manifolds,” Journal of Machine [1] H. Sasaki, G. Niu, and M. Sugiyama, “Non-Gaussian component Learning Research, vol.15, pp.1455–1459, 2014. analysis with log-density gradient estimation,” Proceedings of the [22] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge 19th International Conference on Artificial Intelligence and Statis- University Press, 2004. tics (AISTATS), pp.1177–1185, 2016. [23] Y.-C. Chen, C.R. Genovese, and L. Wasserman, “A comprehen- [2] J. Einbeck and G. Tutz, “Modelling beyond regression functions: an sive approach to mode clustering,” Electronic Journal of Statistics, application of multimodal regression to speed–flow data,” Journal of vol.10, no.1, pp.210–241, 2016. the Royal Statistical Society: C (Applied Statistics), vol.55, [24] T. Hastie, R. Tibshirani, and J. Friedman, “Ensemble Learning,” The no.4, pp.461–475, 2006. elements of statistical learning, pp.605–624, Springer, 2009. [3] R. Singh, “Applications of estimators of a density and its derivatives [25] B. Silverman, Density Estimation for Statistics and Data Analysis, to certain statistical problems,” Journal of the Royal Statistical Soci- CRC Press, 1986. ety. Series B, vol.39, no.3, pp.357–363, 1977. [26] J.E. Chacon,´ T. Duong, and M. Wand, “Asymptotics for general [4] K. Fukunaga and L. Hostetler, “The estimation of the gradient of multivariate kernel density derivative estimators,” Statistica Sinica, a density function, with applications in pattern recognition,” IEEE vol.21, pp.807–840, 2011. Transactions on Information Theory, vol.21, no.1, pp.32–40, 1975. [27] L. Hubert and P. Arabie, “Comparing partitions,” Journal of Classi- [5] Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE Trans- fication, vol.2, no.1, pp.193–218, 1985. actions on Pattern Analysis and Machine Intelligence, vol.17, no.8, [28] B. Larsen and C. Aone, “Fast and effective text mining using linear- IEICE TRANS. INF. & SYST., VOL.E102–D, NO.6 JUNE 2019 1162

time document clustering,” Proceedings of the fifth ACM SIGKDD this paper. Therefore, we leave this challenge for our future international conference on Knowledge discovery and data mining, work. pp.16–22, ACM, 1999. [29] H. Sasaki and A. Hyvarinen,¨ “Neural-kernelized conditional density estimation,” arXiv preprint arXiv:1806.01754, 2018. [30] J. Park and I. Sandberg, “Universal approximation using radial- basis-function networks,” Neural computation, vol.3, no.2, pp.246–257, 1991. Zhang Qi received his B.E. in Trafficand Transportation from Henan University of Sci- Appendix A: Accuracy of Mode Estimation ence and Technology, China and M.E. in Infor- mation Science from Nara Institute of Science and Technology, Japan in 2014 and 2018, re- Let us define the mode mj of the true logarithmic probability density function as a point satisfying spectively. He is currently pursuing the doc- toral degree from the Graduate University for = ∇ · Advanced Studies, Japan. His research inter- g(mj) 0 and xg(mj) O, (A 1) ests include applications and theories of ma- chine learning. where g(x):= ∇x log p(x), ∇xg(x):= ∇x∇x log p(x), and ∇xg(mj) O means that the ∇xg(x)isneg- ative definite at x = mj. Here, we follow the analysis in Hiroaki Sasaki received his B.E. in Applied [18], [23]. First of all, let us make two assumptions: (1) Physics, and M.E. and D.E. in Electrical and GM Communication Engineering from Tohoku Uni- The matrix ∇xg (mj) is invertible, and (2) each mode mj is uniquely approximated by an estimated mode m which versity in 2006, 2008 and 2011, respectively. He j was a research fellow of the Japan Society for is defined as a point satisfying the Promotion of Science from 2011 to 2014. He was a postdoctoral researcher with the Graduate GM = ∇ GM · g (mj) 0 and xg (mj) O, (A 2) School of Information Science and Engineering, Tokyo Institute of Technology for six months in With the definition (A· 2), the Taylor expansion yields 2014, and with the Graduate School of Frontier Sciences, the University of Tokyo from 2014 to GM GM GM g (mj) = g (mj) − g (m j) 2015. Since 2016, he has been an assistant professor of Nara Institute of = ∇ GM m m − m + o m − m  Science and Technology. His research interests include algorithms and the- xg ( j)( j j) ( j j ). ories of machine learning. (A· 3)

On the other hand, by the definition (A· 1), we have Kazushi Ikeda received his B.E., M.E., GM GM g (mj) = g (mj) − g(mj)(A· 4) and Ph.D. in mathematical engineering and in- formation physics from the University of Tokyo Combining (A· 3) with (A· 4) yields in 1989, 1991, and 1994. He was a research associate with the Department of Electrical and GM m j − mj≤O(g (mj) − g(mj)). (A· 5) Computer Engineering of Kanazawa University from 1994 to 1998. He was a research associate of Chinese University of Hong Kong for three Thus, a mode mj can be well-approximated by a mode esti- GM months in 1995. He was with Graduate School mate m j if g is an accurate estimate of g around mj. GM of Informatics, Kyoto University, as an associate Since we directly fit a model g to g under the professor from 1998 to 2008. Since 2008, he has GM squared-loss, g converges to g as the samples size n been a full professor of Nara Institute of Science and Technology. He was goes to infinity if gGM and g belong to the same function the editor-in-chief of the Journal of the Japanese Neural Network Society, set (See Proposition 3 in [29] for a more rigorous analy- and is currently an action editor of Neural Networks, and an associate editor sis based on the same squared loss). Thus, we next need of IEEE Transactions on Neural Networks and Learning Systems. to confirm whether a function set of gGM is large enough. The Gaussian mixture model in (6) includes radial basis function (RBF) networks with Gaussian functions as a spe- cial case, and RBF networks with a sufficiently large num- ber of Gaussians have the universal approximation capa- bility under some conditions [30]. Since gGM is the gradi- ent of the Gaussian mixture model, gGM withasufficiently large number of Gaussians could well-approximate a wide- range of continuous g as the samples size n approaches infinity. To make a more rigorous statement, we need to establish the consistency of gGM under the uniform norm |GM − | max j supx g j (x) g j(x) , but this is beyond the scope of