Direct Log-Density Gradient Estimation with Gaussian Mixture Models and Its Application to Clustering

IEICE TRANS. INF. & SYST., VOL.E102–D, NO.6 JUNE 2019 1154 PAPER Direct Log-Density Gradient Estimation with Gaussian Mixture Models and Its Application to Clustering Qi ZHANG†a), Hiroaki SASAKI††b), Nonmembers, and Kazushi IKEDA††c), Senior Member SUMMARY Estimation of the gradient of the logarithm of a probabil- detected modes. Therefore, MS has been applied to a va- ity density function is a versatile tool in statistical data analysis. A recent riety of problems such as image segmentation [6], [8], [9] method for model-seeking clustering called the least-squares log-density and object tracking [10], [11] (See also a recent review ar- gradient clustering (LSLDGC) [Sasaki et al., 2014] employs a sophisticated gradient estimator, which directly estimates the log-density gradients ticle [12]). Furthermore, MS and its related methods have without going through density estimation. However, the typical implemen- been extended for handling manifold data [13], [14] and to tation of LSLDGC is based on a spherical Gaussian function, which may hierarchical clustering [15]. not work well when the probability density function for data has highly A native approach is first to estimate the probability correlated local structures. To cope with this problem, we propose a new gradient estimator for log-density gradients with Gaussian mixture models density function (e.g., by kernel density estimation), and (GMMs). Covariance matrices in GMMs enable the new estimator to cap- then to compute its log-gradient. However, this approach ture the highly correlated structures. Through the application of the new can be unreliable because a good density estimator does gradient estimator to mode-seeking clustering and hierarchical clustering, not necessarily mean a good gradient estimator. To allevi- we experimentally demonstrate the usefulness of our clustering methods ate this problem, a recent mode-seeking clustering method over existing methods. key words: probability density gradient, mixture model, clustering, hier- called LSLDG clustering employs a sophisticated gradient archical clustering estimator, which directly fits a gradient model to the true log-density gradient without going through density estima- 1. Introduction tion [16]–[18]. LSLDG clustering has been experimentally demonstrated to significantly improve the performance of Estimating the gradient of the logarithm of the probabil- MS particularly for high-dimensional data [17], [18].How- ity density function underlying data is an important prob- ever, the gradient estimator in LSLDG clustering is typically lem. For instance, an unsupervised dimensionality reduction implemented based on the spherical Gaussian kernel. Thus, method requires to estimate the log-density gradient [1].In when the probability density function includes clusters with a supervised learning problem, estimating the log-density highly correlated structures, LSLDG may require a huge gradient enables us to capture multiple functional relation- number of samples to produce a smooth gradient estimate. ship between output and input variables [2]. Other statistical This can be problematic particularly in mode-seeking clus- topics can be seen in [3]. Thus, log-density gradient estima- tering because an unsmooth estimate may create spurious tion offers a certain range of applications in statistical data modes as we demonstrate later. analysis. To improve the performance of LSLDG clustering to Among them, an interesting application is mode- the highly correlated structures, we propose to use a Gaus- seeking clustering. Mean shift clustering (MS) [4]–[6] it- sian mixture model (GMM) in log-density gradient estima- eratively updates data samples toward the modes (i.e., lo- tion. Estimating the covariance matrices in GMM makes the cal maxima) of the estimated probability density function gradient estimator much more adaptive to the local struc- by gradient ascent, and then assigns a cluster label to the tures in the probability density function. A challenge is data samples which converged to the same mode. Com- to satisfy the positive semidefinite constraint of the covari- pared with k-means and mixture-model-based clustering [7], ance matrices. To overcome this challenge, we develop MS has two notable points: MS does not make strong as- an estimation algorithm combined with manifold optimiza- sumptions on the probability density function and the num- tion [19]. ber of clusters in MS is automatically determined by the Next, we apply the proposed estimator to mode- seeking clustering. To update data samples during mode- Manuscript received October 20, 2018. Manuscript revised February 14, 2019. seeking, we derive an update formula based on the fixed- Manuscript publicized March 22, 2019. point method. A similar formula has been proposed in †The author is with the Graduate University for Advanced MS [20], but as shown later, our formula includes it as a spe- Studies, Tachikawa-shi, 190–0014 Japan. cial case. Furthermore, we extend the proposed clustering †† The authors are with Nara Institute of Science and Technol- method to hierarchical clustering, which is a novel exten- ogy, Ikoma-shi, 630–0192 Japan. sion of LSLDG clustering. The usefulness of the proposed a) E-mail: [email protected] b) E-mail: [email protected] clustering methods are experimentally demonstrated. c) E-mail: [email protected] This paper is organized as follows: Sect. 2 reviews an DOI: 10.1587/transinf.2018EDP7354 existing gradient estimator and MS. In Sect. 3, we propose a Copyright c 2019 The Institute of Electronics, Information and Communication Engineers ZHANG et al.: LOG-DENSITY GRADIENT ESTIMATION WITH GAUSSIAN MIXTURE MODELS 1155 ⎛ ⎞ new estimator for log-density gradients with Gaussian mix- ⎜ x − c 2 ⎟ ϕ( j)(x):= ∂ exp ⎝⎜− i ⎠⎟ , (3) ture models. Section 4 develops a mode-seeking clustering i j 2 2σ j method as an application of the proposed gradient estimator. Then, the developed clustering method is further extended = (1) (d) where ci : (ci ,...,ci ) is the kernel centers fixed at a to hierarchical clustering. The performance of the cluster- subset of data samples randomly chosen from D, and σ j de- ing methods are experimentally investigated in Sect. 5. Sec- notes the bandwidth parameter. After substituting the model tion 6 concludes this paper. ( j) = ( j) ( j) (2) into J j, the optimal α : (α1 ,...,αb ) can be com- puted analytically by solving the following problem: 2. Background ( j) ( j) ( j) 2 α := argmin J j(α ) + λ jα This section reviews an existing estimator for log-density α( j) −1 gradients and mode-seeking clustering methods. = −(G j + λ j Ib) h j, 2.1 Review of LSLDG where λ j is the regularization parameter, Ib denotes the b by ( j) = ( j) ( j) b identity matrix, and with ϕ (x): (ϕ1 (x),...,ϕb (x)) , Here, we review a direct estimator for log-density gradi- n n ents which we refer to the least-squares log-density gradi- = 1 ( j) ( j) = 1 ( j) ents (LSLDG) [16], [17]. Suppose that n i.i.d. data samples G j : ϕ (xi)ϕ (xi) , h j : ∂ jϕ (xi). n = n = drawn from a probability distribution with density p(x)are i 1 i 1 available as The gradient estimator is finally given by (1) (d) n b D := {xi = (x ,...,x )} = ∼ p(x). i i i 1 = ( j) ( j) g j(x) αi ϕi (x). The goal of LSLDG is to estimate the gradient of the log- i=1 density from D: 2.2 Review of Mode-Seeking Clustering ∂ p(x) ∂ p(x) ∇ log p(x) = 1 ,..., d , x p(x) p(x) Mean shift clustering (MS) [4], [6] is a mode-seeking clustering method based on kernel density estimation (KDE). = ∂ ∇ ff where ∂ j : ∂x( j) and x denotes the di erential operator MS employs KDE to estimate the probability density func- with respect to x. tion as follows: LSLDG directly fits a model g j to the true partial n 1 x − x 2 derivative of the log-density under the squared-loss: p (x):= K i , (4) KDE n 2h2 i=1 1 2 J j(g j):= {g j(x) − ∂ j log p(x)} p(x)dx − C j 2 where h denotes the bandwidth parameter and K is a smooth kernel function for KDE (See [6, Eq. (3)] for definition). = 1 { }2 − g j(x) p(x)dx g j(x) ∂ j p(x) dx 2 Taking the gradient of pKDE(x) with respect to x yields = 1 { }2 + ∇ p (x) g j(x) p(x)dx ∂ jg j(x) p(x)dx, x KDE 2 n 1 x − x x − x 2 = i H i where C := 1 {∂ log p(x)}2 p(x)dx, and we applied the n h2 2h2 j 2 j i=1 integration by parts to the second term on the right-hand ⎡ ⎤ ⎢n − 2 n − 2 ⎥ | |→ | |→∞ 1 ⎢ x xi x xi ⎥ side under the assumption that g j(x)p(x) 0as x j . = ⎣⎢ x H − x H ⎦⎥ nh2 i 2h2 2h2 Then, the empirical risk up to the ignorable constant C j is i=1 i=1 obtained as where H(t):= − d K(t). Based on the fixed-point method, n dt 1 1 setting ∇ p (x) = 0 yields a simple update formula as J g = g x 2 + ∂ g x . x KDE j( j): j( i) j j( i) (1) n = 2 − 2 i 1 n x H x xi ← i=1 i 2h2 x 2 . To estimate g j, LSLDG employs the following model: n x−xi i=1 H 2h2 b g (x):= α( j)ϕ( j)(x), (2) This update formula is iteratively applied for data samples xi j i i i=1 until they converge to the modes of pKDE. MS finally assigns the same cluster label to the data samples converging to the ( j) ( j) ffi where αi and ϕi are coe cients and basis functions re- same mode.

Direct Log-Density Gradient Estimation with Gaussian Mixture Models and Its Application to Clustering

Which Moments of a Logarithmic Derivative Imply Quasiinvariance?

A Logarithmic Derivative Lemma in Several Complex Variables and Its Applications

Understanding Diagnostic Plots for Well-Test Interpretation

Symmetric Logarithmic Derivative of Fermionic Gaussian States

Differential Algebraic Equations from Definability

Dictionary of Mathematical Terms

Reminder. Vector and Matrix Differentiation, Integral Formulas

Official Worksheet 5: the Logarithmic Derivative and the Argument Principle

A Note on the Logarithmic Derivative of the Gamma Function

Timath.Com Calculus

3 Additional Derivative Topics and Graphs

Introduction to Differential Equations