The Burbea-Rao and Bhattacharyya Centroids Frank Nielsen, Senior Member, IEEE, and Sylvain Boltz, Nonmember, IEEE

IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 1 The Burbea-Rao and Bhattacharyya centroids Frank Nielsen, Senior Member, IEEE, and Sylvain Boltz, Nonmember, IEEE Abstract—We study the centroid with respect to the class of by Aczel´ [3]. Without loss of generality we consider information-theoretic Burbea-Rao divergences that generalize the the mean of two non-negative numbers x1 and x2, and celebrated Jensen-Shannon divergence by measuring the non- postulate the following expected behaviors of a mean negative Jensen difference induced by a strictly convex and differentiable function. Although those Burbea-Rao divergences function M(x1; x2) as axioms (common sense): are symmetric by construction, they are not metric since they – Reflexivity. M(x; x) = x, fail to satisfy the triangle inequality. We first explain how a – Symmetry. M(x ; x ) = M(x ; x ), particular symmetrization of Bregman divergences called Jensen- 1 2 2 1 – Continuity and strict monotonicity. M( ; ) continu- Bregman distances yields exactly those Burbea-Rao divergences. · · We then proceed by defining skew Burbea-Rao divergences, and ous and M(x1; x2) < M(x10 ; x2) for x1 < x10 , and show that skew Burbea-Rao divergences amount in limit cases to – Anonymity. M(M(x11; x12);M(x21; x22)) = compute Bregman divergences. We then prove that Burbea-Rao M(M(x11; x21);M(x12; x22)) (also called centroids are unique, and can be arbitrarily finely approximated bisymmetry expressing the fact that the mean by a generic iterative concave-convex optimization algorithm with guaranteed convergence property. In the second part of the paper, can be computed as a mean on the row means or we consider the Bhattacharyya distance that is commonly used to equivalently as a mean on the column means). measure overlapping degree of probability distributions. We show Then one can show that the mean function M( ; ) is that Bhattacharyya distances on members of the same statistical necessarily written as: · · exponential family amount to calculate a Burbea-Rao divergence in disguise. Thus we get an efficient algorithm for computing the Bhattacharyya centroid of a set of parametric distributions 1 f(x1) + f(x2) def M(x ; x ) = f − = M (x ; x ); belonging to the same exponential families, improving over 1 2 2 f 1 2 former specialized methods found in the literature that were (1) limited to univariate or “diagonal” multivariate Gaussians. To x1+x2 for a strictly increasing function f. The arithmetic 2 , illustrate the performance of our Bhattacharyya/Burbea-Rao geometric and harmonic means 2 are in- px1x2 1 + 1 centroid algorithm, we present experimental performance results x1 x2 for k-means and hierarchical clustering methods of Gaussian stances of such generalized means obtained for f(x) = x, mixture models. 1 f(x) = log x and f(x) = x , respectively. Those general- Index Terms—Centroid, Kullback-Leibler divergence, Jensen- ized means are also called quasi-arithmetic means, since Shannon divergence, Burbea-Rao divergence, Bregman diver- they can be interpreted as the arithmetic mean on the se- gences, Exponential families, Bhattacharrya divergence, Infor- quence f(x1); :::; f(xn), the f-representation of numbers. mation geometry. To get geometric centroids, we simply consider means on each coordinate axis independently. The Euclidean I. INTRODUCTION centroid is thus interpreted as the Euclidean arithmetic A. Means and centroids mean. Barycenters (weighted centroids) are similarly obtained using non-negative weights (normalized so that In Euclidean geometry, the centroid c of a point set = n n P P 1 P i=1 wi = 1): p1; :::; pn is defined as the center of mass n i=1 pi, also characterizedf g as the center point that minimizes the average Pn 1 2 arXiv:1004.5049v3 [cs.IT] 19 Apr 2012 squared Euclidean distances: c = arg minp p pi . ! i=1 n k − k n This basic notion of Euclidean centroid can be extended to 1 X Mf (x1; :::; xn; w1; :::; wn) = f − wif(xi) (2) denote a mean point M( ) representing the centrality of a i=1 given point set . ThereP are basically two complementary approaches to defineP mean values of numbers: (1) by ax- Those generalized means satisfy the inequality property: iomatization, or (2) by optimization, summarized concisely as follows: M (x ; :::; x ; w ; :::; w ) M (x ; :::; x ; w ; :::; w ); f 1 n 1 n ≤ g 1 n 1 n By axiomatization. This approach was first historically (3) • pioneered by the independent work of Kolmogorov [1] if and only if function g dominates f: That is, x; g(x) > 8 and Nagumo [2] in 1930, and simplified and refined later f(x). Therefore the arithmetic mean (f(x) = x) dominates the geometric mean (f(x) = log x) which in turn F. Nielsen is with the Department of Fundamental Research of Sony dominates the harmonic mean f(x) = 1 . Note that it Computer Science Laboratories, Inc., Tokyo, Japan, and the Computer Sci- x ence Department (LIX) of Ecole´ Polytechnique, Palaiseau, France. e-mail: is not a strict inequality in Eq. 3 as the means coincide [email protected] for all identical elements: if all xi are equal to x then S. Boltz is with the Computer Science Department (LIX) of Ecole´ Poly- 1 1 M (x ; :::; x ) = f − (f(x)) = x = g− (g(x)) = technique, Palaiseau, France. e-mail: [email protected] f 1 n Manuscript received April 2010, revised April 2012. This revision includes Mg(x1; :::; xn). All those quasi-arithmetic means further in the appendix a proof of the uniqueness of the centroid. satisfy the “interness” property IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 2 for f ∗(x) = xf(1=x) so that (OPT’) is solved as a (OPT) problem for the conjugate function f ∗( ). In the same min(x1; :::; xn) Mf (x1; :::; xn) max(x1; :::; xn); spirit, we have: · ≤ ≤ (4) 1 derived from limit cases p of power means for B (x; p) = B ∗ (F (p);F (x)) (11) ! ±∞ F F 0 0 f(x) = xp; p R = ( ; ) 0 , a non-zero real 2 ∗ −∞ 1 nf g number. for Bregman divergences, where F ∗ denotes the Legendre 4 By optimization. In this second alternative approach, the convex conjugate [8], [9]. Surprisingly, although (OPT’) • barycenter c is defined according to a distance function may not be convex in x for Bregman divergences (e.g., d( ; ) as the optimal solution of a minimization problem F (x) = log x), (OPT’) admits nevertheless a unique · · minimizer,− independent of the generator function F : the Pn 1 n center of mass M 0( ; BF ) = pi. Bregman means X P i=1 n (OPT) : min wid(x; pi) = min L(x; ; d); (5) are not homogeneous except for the power generators x x P p i=1 F (x) = x which yields entropic means, i.e. means that can also be interpreted5 as minimizers of average f- where the non-negative weights wi denote multiplicity or relative importance of points (by default, the centroid divergences [4]. Amari [11] further studied those power 1 means (known as α-means in information geometry [12]), is defined by fixing all wi = n ). Ben-Tal et al. [4] considered an information-theoretic class of distances and showed that they are linear-scale free means obtained as minimizers of -divergences, a proper sub- called f-divergences [5], [6]: α class of f-divergences. Nielsen and Nock [13] reported x an alternative simpler proof of α-means by showing I (x; p) = pf ; (6) f p that the α-divergences are Bregman divergences in disguise (namely, representational Bregman divergences for for a strictly convex differentiable function f( ) satisfying · positive measures, but not for normalized distribution f(1) = 0 and f 0(1) = 0. Although those f-divergences measures [10]). To get geometric centroids, we simply 2 were primarily investigated for probability measures, we consider multivariate extensions of the optimization task can extend the f-divergence to positive measures. Since (OPT). In particular, one may consider separable di- program (OPT) is strictly convex in x, it admits a unique vergences that are divergences that can be assembled minimizer M( ; If ) = arg minx L(x; ;If ), termed the coordinate-wise: entropic meanPby Ben-Tal et al. [4]. Interestingly,P those 3 d entropic means are linear scale-invariant: X (i) (i) d(x; p) = di(x ; p ); (12) i=1 M(λp1; :::; λpn; If ) = λM(p1; :::; pn; If ) (7) with x(i) denoting the ith coordinate. A typical non Nielsen and Nock [7] considered another class of separable divergence is the squared Mahalanobis dis- information-theoretic distortion measures B called F tance [14]: Bregman divergences [8], [9]: d(x; p) = (x p)T Q(x p); (13) BF (x; p) = F (x) F (p) (x p)F 0(p); (8) − − − − − for a strictly convex differentiable function . It follows a Bregman divergence called generalized quadratic dis- F T that (OPT) is convex, and admits a unique minimizer tance, defined for the generator F (x) = x Qx, where Q is a positive-definite matrix (Q 0). For separable M(p1; :::; pn; BF ) = MF 0 (p1; :::; pn), a quasi-arithmetic mean for the strictly increasing and continuous func- distances, the optimization problem (OPT) may then be reinterpreted as the task of finding the projection [15] of tion F 0, the derivative of F . Observe that information- a point p (of dimension d n) to the upper line U: theoretic distances may be asymmetric (i.e., d(x; p) = × d(p; x)), and therefore one may also define a right-sided6 (PROJ) : inf d(u; p) (14) centroid M 0 as the minimizer of u U 2 n X with u1 = ::: = ud n > 0, and p the (n d)-dimensional (OPT0) : min wid(pi; x); (9) × × x point obtained by stacking the d coordinates of each of i=1 the n points.

Load more