IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 1 The Burbea-Rao and Bhattacharyya centroids Frank Nielsen, Senior Member, IEEE, and Sylvain Boltz, Nonmember, IEEE

Abstract—We study the centroid with respect to the class of by Aczel´ [3]. Without loss of generality we consider information-theoretic Burbea-Rao divergences that generalize the the mean of two non-negative numbers x1 and x2, and celebrated Jensen-Shannon by measuring the non- postulate the following expected behaviors of a mean negative Jensen difference induced by a strictly convex and differentiable function. Although those Burbea-Rao divergences function M(x1, x2) as axioms (common sense): are symmetric by construction, they are not metric since they – Reflexivity. M(x, x) = x, fail to satisfy the triangle inequality. We first explain how a – Symmetry. M(x , x ) = M(x , x ), particular symmetrization of Bregman divergences called Jensen- 1 2 2 1 – Continuity and strict monotonicity. M( , ) continu- Bregman distances yields exactly those Burbea-Rao divergences. · · We then proceed by defining skew Burbea-Rao divergences, and ous and M(x1, x2) < M(x10 , x2) for x1 < x10 , and show that skew Burbea-Rao divergences amount in limit cases to – Anonymity. M(M(x11, x12),M(x21, x22)) = compute Bregman divergences. We then prove that Burbea-Rao M(M(x11, x21),M(x12, x22)) (also called centroids are unique, and can be arbitrarily finely approximated bisymmetry expressing the fact that the mean by a generic iterative concave-convex optimization algorithm with guaranteed convergence property. In the second part of the paper, can be computed as a mean on the row means or we consider the Bhattacharyya distance that is commonly used to equivalently as a mean on the column means). measure overlapping degree of probability distributions. We show Then one can show that the mean function M( , ) is that Bhattacharyya distances on members of the same statistical necessarily written as: · · exponential family amount to calculate a Burbea-Rao divergence in disguise. Thus we get an efficient algorithm for computing   the Bhattacharyya centroid of a set of parametric distributions 1 f(x1) + f(x2) def M(x , x ) = f − = M (x , x ), belonging to the same exponential families, improving over 1 2 2 f 1 2 former specialized methods found in the literature that were (1) limited to univariate or “diagonal” multivariate Gaussians. To x1+x2 for a strictly increasing function f. The arithmetic 2 , illustrate the performance of our Bhattacharyya/Burbea-Rao geometric and harmonic means 2 are in- √x1x2 1 + 1 centroid algorithm, we present experimental performance results x1 x2 for k-means and hierarchical clustering methods of Gaussian stances of such generalized means obtained for f(x) = x, mixture models. 1 f(x) = log x and f(x) = x , respectively. Those general- Index Terms—Centroid, Kullback-Leibler divergence, Jensen- ized means are also called quasi-arithmetic means, since Shannon divergence, Burbea-Rao divergence, Bregman diver- they can be interpreted as the arithmetic mean on the se- gences, Exponential families, Bhattacharrya divergence, Infor- quence f(x1), ..., f(xn), the f-representation of numbers. mation geometry. To get geometric centroids, we simply consider means on each coordinate axis independently. The Euclidean I.INTRODUCTION centroid is thus interpreted as the Euclidean arithmetic A. Means and centroids mean. Barycenters (weighted centroids) are similarly obtained using non-negative weights (normalized so that In Euclidean geometry, the centroid c of a point set = n n P P 1 P i=1 wi = 1): p1, ..., pn is defined as the center of mass n i=1 pi, also characterized{ } as the center point that minimizes the average Pn 1 2 arXiv:1004.5049v3 [cs.IT] 19 Apr 2012 squared Euclidean distances: c = arg minp p pi . ! i=1 n k − k n This basic notion of Euclidean centroid can be extended to 1 X Mf (x1, ..., xn; w1, ..., wn) = f − wif(xi) (2) denote a mean point M( ) representing the centrality of a i=1 given point set . ThereP are basically two complementary approaches to defineP mean values of numbers: (1) by ax- Those generalized means satisfy the inequality property: iomatization, or (2) by optimization, summarized concisely as follows: M (x , ..., x ; w , ..., w ) M (x , ..., x ; w , ..., w ), f 1 n 1 n ≤ g 1 n 1 n By axiomatization. This approach was first historically (3) • pioneered by the independent work of Kolmogorov [1] if and only if function g dominates f: That is, x, g(x) > ∀ and Nagumo [2] in 1930, and simplified and refined later f(x). Therefore the arithmetic mean (f(x) = x) domi- nates the geometric mean (f(x) = log x) which in turn F. Nielsen is with the Department of Fundamental Research of Sony dominates the harmonic mean f(x) = 1 . Note that it Computer Science Laboratories, Inc., Tokyo, Japan, and the Computer Sci- x ence Department (LIX) of Ecole´ Polytechnique, Palaiseau, France. e-mail: is not a strict inequality in Eq. 3 as the means coincide [email protected] for all identical elements: if all xi are equal to x then S. Boltz is with the Computer Science Department (LIX) of Ecole´ Poly- 1 1 M (x , ..., x ) = f − (f(x)) = x = g− (g(x)) = technique, Palaiseau, France. e-mail: [email protected] f 1 n Manuscript received April 2010, revised April 2012. This revision includes Mg(x1, ..., xn). All those quasi-arithmetic means further in the appendix a proof of the uniqueness of the centroid. satisfy the “interness” property IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 2

for f ∗(x) = xf(1/x) so that (OPT’) is solved as a (OPT) problem for the conjugate function f ∗( ). In the same min(x1, ..., xn) Mf (x1, ..., xn) max(x1, ..., xn), spirit, we have: · ≤ ≤ (4) 1 derived from limit cases p of power means for B (x, p) = B ∗ (F (p),F (x)) (11) → ±∞ F F 0 0 f(x) = xp, p R = ( , ) 0 , a non-zero real ∈ ∗ −∞ ∞ \{ } number. for Bregman divergences, where F ∗ denotes the Legendre 4 By optimization. In this second alternative approach, the convex conjugate [8], [9]. Surprisingly, although (OPT’) • barycenter c is defined according to a distance function may not be convex in x for Bregman divergences (e.g., d( , ) as the optimal solution of a minimization problem F (x) = log x), (OPT’) admits nevertheless a unique · · minimizer,− independent of the generator function F : the Pn 1 n center of mass M 0( ; BF ) = pi. Bregman means X P i=1 n (OPT) : min wid(x, pi) = min L(x; , d), (5) are not homogeneous except for the power generators x x P p i=1 F (x) = x which yields entropic means, i.e. means that can also be interpreted5 as minimizers of average f- where the non-negative weights wi denote multiplicity or relative importance of points (by default, the centroid divergences [4]. Amari [11] further studied those power 1 means (known as α-means in information geometry [12]), is defined by fixing all wi = n ). Ben-Tal et al. [4] considered an information-theoretic class of distances and showed that they are linear-scale free means ob- tained as minimizers of -divergences, a proper sub- called f-divergences [5], [6]: α class of f-divergences. Nielsen and Nock [13] reported x an alternative simpler proof of α-means by showing I (x, p) = pf , (6) f p that the α-divergences are Bregman divergences in dis- guise (namely, representational Bregman divergences for for a strictly convex differentiable function f( ) satisfying · positive measures, but not for normalized distribution f(1) = 0 and f 0(1) = 0. Although those f-divergences measures [10]). To get geometric centroids, we simply 2 were primarily investigated for probability measures, we consider multivariate extensions of the optimization task can extend the f-divergence to positive measures. Since (OPT). In particular, one may consider separable di- program (OPT) is strictly convex in x, it admits a unique vergences that are divergences that can be assembled minimizer M( ; If ) = arg minx L(x; ,If ), termed the coordinate-wise: entropic meanPby Ben-Tal et al. [4]. Interestingly,P those 3 d entropic means are linear scale-invariant: X (i) (i) d(x, p) = di(x , p ), (12) i=1 M(λp1, ..., λpn; If ) = λM(p1, ..., pn; If ) (7) with x(i) denoting the ith coordinate. A typical non Nielsen and Nock [7] considered another class of separable divergence is the squared Mahalanobis dis- information-theoretic distortion measures B called F tance [14]: Bregman divergences [8], [9]: d(x, p) = (x p)T Q(x p), (13) BF (x, p) = F (x) F (p) (x p)F 0(p), (8) − − − − − for a strictly convex differentiable function . It follows a Bregman divergence called generalized quadratic dis- F T that (OPT) is convex, and admits a unique minimizer tance, defined for the generator F (x) = x Qx, where Q is a positive-definite matrix (Q 0). For separable M(p1, ..., pn; BF ) = MF 0 (p1, ..., pn), a quasi-arithmetic mean for the strictly increasing and continuous func- distances, the optimization problem (OPT) may then be reinterpreted as the task of finding the projection [15] of tion F 0, the derivative of F . Observe that information- a point p (of dimension d n) to the upper line U: theoretic distances may be asymmetric (i.e., d(x, p) = × d(p, x)), and therefore one may also define a right-sided6 (PROJ) : inf d(u, p) (14) centroid M 0 as the minimizer of u U ∈ n X with u1 = ... = ud n > 0, and p the (n d)-dimensional (OPT0) : min wid(pi, x), (9) × × x point obtained by stacking the d coordinates of each of i=1 the n points. It turns out that for f-divergences, we have: In geometry, means (centroids) play a crucial role in center- based clustering (i.e., k-means [16] for vector quantization If (x, p) = If (p, x), (10) ∗ applications). Indeed, the mean of a cluster allows one to 1Besides the min/max operators interpreted as extremal power means, the aggregate data into a single center datum. Thus the notion 1 Qn p p geometric mean itself can also be interpreted as a power mean ( i=1 xi ) in the limit case p → 0. 4Legendre dual convex conjugates F and F ∗ have necessarily reciprocal 2In that context, a d-dimensional point is interpreted as a discrete and finite gradients: F ∗0 = (F 0)−1. See [7]. probability measure lying in the (d − 1)-dimensional unit simplex. 5In fact, Amari [10] proved that the intersection of the class of f- 3That is, means of homogeneous degree 1. divergences with the class of Bregman divergences are α-divergences. IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 3 of means are encapsulated into the broader theory of mathe- Although the square root of the Jensen-Shannon diver- matical aggregators [17]. gence yields a metric (a Hilbertian metric), it is not true Results on geometric means can be easily transfered to the in general for Burbea-Rao divergences. The closest work to field of [4] by generalizing the optimization problem our paper is a 1-page symposium6 paper [21] discussing task to a random variable X with distribution F as: about Ali-Silvey-Csiszar´ f-divergences [5], [6] and Bregman divergences [22], [8] (two entropy-based divergence classes). Z Those information-theoretic distortion classes are compared (OPT) : min E[Xd(x, X)] = min td(x, t)dF (t), (15) using quadratic differential metrics, mean values and projec- x x t tions. The notion of skew Jensen differences intervene in the where E[ ] denotes the expectation defined with respect to discussion. the Lebesgue-Stieltjes· integral. Although this approach is discussed in [4] and important for defining various notions C. Contributions and paper organization of centrality in statistics, we shall not cover this extended framework here, for sake of brevity. The paper is articulated into two parts: The first part studies the Burbea-Rao centroids, and the second part shows some applications in Statistics. We summarize our contributions as B. Burbea-Rao divergences follows: In this paper, we focus on the optimization approach We define the parametric class of (skew) Burbea-Rao • (OPT) for defining other (geometric) means using the class of divergences, and show that those divergences naturally information-theoretic distances obtained by Jensen difference arise when generalizing the principle of the Jensen- for a strictly convex and differentiable function F : Shannon divergence [20] to Jensen-Bregman divergences. In the limit cases, we further prove that those skew F (x) + F (p) x + p Burbea-Rao divergences yield asymptotically Bregman d(x, p) = F def= BR (x, p) 0. 2 − 2 F ≥ divergences. We show that the centroids with respect to the (skew) (16) • Since the underlying differential geometry implied by those Burbea-Rao divergences are unique. Besides centroids Jensen difference distances have been seminally studied in for special cases of Burbea-Rao divergences (including papers of Burbea and Rao [18], [19], we shall term them the squared Euclidean distances), those centroids are Burbea-Rao divergences, and point out to them as BRF . In not available in closed-form equations. However, we the remainder, we consider separable Burbea-Rao divergences. show that any Burbea-Rao centroid can be estimated That is, for d-dimensional points p and q, we define efficiently using an iterative convex-concave optimization procedure. As a by-product, we find Bregman sided d X (i) (i) centroids [7] in closed-form in the extremal skew cases. BRF (p, q) = BRF (p , q ), (17) We then consider applications of Burbea-Rao centroids in i=1 Statistics, and show the link with Bhattacharyya distances. A and study the Burbea-Rao centroids (and barycenters) as the wide class of statistical parametric models can be handled in minimizers of the average Burbea-Rao divergences. Those a unified manner as exponential families [23]. The classes of Burbea-Rao divergences generalize the celebrated Jensen- exponential families contain many of the standard parametric Shannon divergence [20] models including the Poisson, Gaussian, multinomial, and p + q  H(p) + H(q) Gamma/Beta distributions, just to name a few prominent JS(p, q) = H (18) 2 − 2 members. However, only a few closed-form formulas for the statistical Bhattacharyya distances between those densities are by choosing F (x) = H(x), the negative Shannon entropy reported in the literature.7 H(x) = x log x. Generators− F ( ) of parametric distances − · For the second part, our contributions are reviewed as are convex functions representing entropies which are concave follows: functions. Burbea-Rao divergences contain all generalized We show that the (skew) Bhattacharyya distances calcu- T • quadratic distances (F (x) = x Qx = Qx, x for a positive lated for distributions belonging to the same exponential definite matrix Q 0, also calledh squaredi Mahalanobis family in statistics, are equivalent to (skew) Burbea- distances): Rao divergences. We mention corresponding closed-form formula for computing Chernoff coefficients and α- F (p) + F (q) p + q  divergences of exponential families. In the limit case, we BR (p, q) = F F 2 − 2 obtain an alternative proof showing that the Kullback- 2 Qp, p + 2 Qq, q Q(p + q), p + q Leibler divergence of members of the same exponential = h i h i − h i 4 6 1 In the nineties, the IEEE International Symposium on Information Theory = ( Qp, p + Qq, q 2 Qp, q ) (ISIT) published only 1-page papers. We are grateful to Prof. Michele` 4 h i h i − h i Basseville for sending us the corresponding slides. 1 1 7For instance, the Bhattacharyya distance between multivariate normal = Q(p q), p q = p q 2 . 4h − − i 4k − kQ distributions is given here [24]. IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 4

family is equivalent to a Bregman divergence calculated on the natural parameters [14]. We approximate iteratively the Bhattacharyya centroid of • any set of distributions of the same exponential family (including multivariate Gaussians) using the Burbea-Rao centroid algorithm. For the case of multivariate Gaus- p+q F (p)+F (q) ( 2 , 2 ) sians, we design yet another tailored iterative scheme (p, F (p)) based on matrix differentials, generalizing the former (q, F (q)) univariate study of Rigazio et al. [25]. Thus we get either BRF (p, q) the generic way or the tailored way for computing the Bhattacharrya centroids of arbitrary Gaussians. As a field application, we show how to simplify Gaus- ( p+q ,F ( p+q )) • 2 2 sian mixture models using hierarchical clustering, and p p+q q show experimentally that the results obtained with the 2 Bhattacharyya centroids compare favorably well with Fig. 1. Interpreting the Burbea-Rao divergence BRF (p, q) as the vertical former results obtained for Bregman centroids [26]. Our distance between the midpoint of segment [(p, F (p)), (q, F (q))] and the  p+q  p+q  numerical experiments show that the generic method out- midpoint of the graph plot 2 ,F 2 . performs the alternative tailored method for multivariate Gaussians. F The paper is organized as follows: In section II, we intro- H duce Burbea-Rao divergences as a natural extension of the q0 Jensen-Shannon divergence using the framework of Bregman pˆ divergences. It is followed by Section III which considers BF (p, q) = Hq Hq0 the general case of skew divergences, and reveals asymptotic − Hq behaviors of extreme skew Burbea-Rao divergences as Breg- qˆ man divergences. Section IV defines the (skew) Burbea-Rao centroids, show they are unique, and present a simple iterative q p algorithm with guaranteed convergence. We then consider applications in Statistics in Section V: After briefly recalling Fig. 2. Interpreting the Bregman divergence BF (p, q) as the vertical distance between the tangent plane at q and its translate passing through p (with exponential distributions in V-A, we show that Bhattacharyya identical slope ∇F (q)). distances and Chernoff/Amari§ α-divergences are available in closed-form equations as Burbea-Rao divergences for distribu- tions of the same exponential families. Section V-C presents are not metrics since they fail to satisfy the triangle inequality. an alternative iterative algorithm tailored to compute the Bhat- A geometric interpretation of those divergences is given in tacharyya centroid of multivariate Gaussians, generalizing the Figure 1. Note that F is defined up to an affine term ax + b. former specialized work of Rigazio et al. [25]. In section V-D, We show that Burbea-Rao divergences extend the Jensen- we use those Bhattacharyya/Burbea-Rao centroids to simplify Shannon divergence using the broader concept of Bregman hierarchically Gaussian mixture models, and comment both divergences instead of the Kullback-Leibler divergence. A qualitatively and quantitatively our experiments on a color Bregman divergence [22], [8], [9] BF is defined as the positive image segmentation application. Finally, section VI concludes tail of the first-order Taylor expansion of a strictly convex and this paper by describing further perspectives and hinting at differentiable convex function F : some information geometrical aspects of this work. B (p, q) = F (p) F (q) p q, F (q) , (19) F − − h − ∇ i II.BURBEA-RAODIVERGENCESFROMSYMMETRIZATION where F denote the gradient of F (the vector of partial OF BREGMANDIVERGENCES derivatives∇ ∂F ), and x, y = xT y the inner product (dot ∂xi i Let R+ = [0, + ) denote the set of non-negative reals. For product for{ vectors).} Ah Bregmani divergence is interpreted ∞ a strictly convex (and differentiable) generator F , we define geometrically [14] as the vertical distance between the tangent the Burbea-Rao divergence as the following non-negative plane Hq at q of the graph plot = xˆ = (x, F (x)) x function: F { | ∈ X } and its translates Hq0 passing through pˆ = (p, F (p)). Fig- ure 2 depicts graphically the geometric interpretation of the + Bregman divergence (to be compared with the Burbea-Rao BRF : R X × X → divergence in Figure 1). F (p) + F (q) p + q  (p, q) BRF (p, q) = F 0 Bregman divergences are never metrics, and symmetric 7→ 2 − 2 ≥ only for the generalized quadratic distances [14] obtained by The non-negative property of those divergences follows choosing F (x) = xT Qx, for some positive definite matrix straightforwardly from Jensen inequality. Although Burbea- Q 0. Bregman divergences allow one to encapsulate both Rao distances are symmetric (BRF (p, q) = BRF (q, p)), they statistical distances with geometric distances: IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 5

Kullback-Leibler divergence obtained for F (x) = Nielsen and Nock [7] investigated the centroids with respect • x log x: to Jeffreys-Bregman divergences (the symmetrized Kullback- d X p(i) Leibler divergence). KL(p, q) = p(i) log (20) q(i) i=1 III.SKEW BURBEA-RAODIVERGENCES squared Euclidean distance obtained for F (x) = x2: • We further generalize Burbea-Rao divergences by intro- d ducing a positive weight α (0, 1) when averaging source X L2(p, q) = (p(i) q(i))2 = p q 2 (21) parameters p and q as follows:∈ 2 − k − k i=1 (α) + BR : R Basically, there are two ways to symmetrize Bregman di- F X × X → BR(α)(p, q) = αF (p) + (1 α)F (q) F (αp + (1 α)q) vergences (see also work on Bregman metrization [27], [28]): F − − − Jeffreys-Bregman divergences. We consider half of the • We consider the open interval (0, 1) since otherwise the double-sided divergences: divergence has no discriminatory power (indeed, for α (α) ∈ 0, 1 , BRF (p, q) = 0, p, q). Although skewed divergences { } (α) ∀ (α) BF (p, q) + BF (q, p) are asymmetric BRF (p, q) = BRF (q, p), we can swap SF (p; q) = (22) 6 2 arguments by replacing α by 1 α: 1 − = p q, F (p) F (q) , (23) 2h − ∇ − ∇ i (α) BRF (p, q) = αF (p) + (1 α)F (q) F (αp + (1 α)q) Except for the generalized quadratic distances, this sym- (1 α) − − − metric distance cannot be interpreted as a Bregman = BRF − (q, p) (28) divergence [14]. Those skew Burbea-Rao divergences are similarly found us- Jensen-Bregman divergences. We consider the Jeffreys- • ing a skew Jensen-Bregman counterpart (the gradient terms Bregman divergences from the source parameters to the F (αp + (1 α)q) perfectly cancel in the sum of skew average parameter p+q as follows: ∇ − 2 Bregman divergences):

p+q p+q BF (p, 2 ) + BF (q, 2 ) def JF (p; q) = (24) αBF (p, αp + (1 α)q) + (1 α)BF (q, αp + (1 α)q) = 2 − − − (α) F (p) + F (q) p + q BR (p, q) = F ( ) = BR (p, q) F 2 − 2 F Note that even for the negative Shannon entropy (α) F (x) = In the limit cases, α 0 or α 1, we have BRF (p, q) x log x x (extended to positive measures), those two sym- 0 p, q. That is, those→ divergences→ loose their discriminatory→ − ∀ metrizations yield different divergences: While SF uses the power at extremities. However, we show that those skew gradient F , JF relies only on the generator F . Both JF Burbea-Rao divergences tend asymptotically to Bregman di- ∇ 8 and SF have always finite values. The first symmetrization vergences: approach was historically studied by Jeffreys [29]. The second way to symmetrize Bregman divergences gen- 1 (α) BF (p, q) = lim BR (p, q) (29) eralizes the spirit of the Jensen-Shannon divergence [20] α 0 F → α 1   p + q   p + q  1 (α) BF (q, p) = lim BR (p, q) (30) JS(p, q) = KL p, + KL q, (25), α 1 1 α F 2 2 2 → − p + q  H(p) + H(q) The limit in the right-hand-side of Eq. 30 can be expressed = H (26) 2 − 2 alternatively as the following one-sided limit: with non-negativity that can be derived from Jensen’s in- 1 (α) 1 (α) lim BRF (p, q) = lim BRF (q, p), (31) equality, hence its name. The Jensen-Shannon divergence is α 1 1 α α 0 α ↑ − ↓ also called the total divergence to the average, a generalized where the arrows and denote the limit from the left and measure of diversity from the population distributions p and q ↑ ↓ p+q the limit from the right, respectively (see [30] for notations). to the average population 2 . Those Jensen difference-type The right derivative of a function f at x is defined as f+0 (x) = divergences are by definition Burbea-Rao divergences. For the f(y) f(x) (0) lim − . Since BR (p, q) = 0 p, q, it follows that Shannon entropy, those two different information divergence y x y x F the right-hand-side↓ − limit of Eq. 31 is the∀ right derivative (see symmetrizations (Jensen-Shannon divergence and Jeffreys J Theorem 1 of [30] that gives a generalized Taylor expansion divergence) satisfy the following inequality: of convex functions) of the map J(p, q) 4 JS(p, q) 0. (27) ≥ ≥ (α) L(α): α BRF (q, p) (32) 8This may not be the case of Bregman/Kullback-Leibler divergences that 7→ can potentially be unbounded. taken at α = 0. Thus we have IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 6

concave functions is concave). We can thus solve iteratively 1 (α) this optimization problem using the Convex-ConCave Proce- lim BRF (q, p) = L+0 (0)., (33) α 0 α ↓ dure [32], [33] (CCCP), by starting from an initial position c0 Pn with (say, the barycenter c0 = i=1 wipi), and iteratively update the barycenter as follows:

d+ L0 (0) = (αF (q) + (1 α)F (p) F (αq + (1 α)p)) n + dα − − − 1 X F (c ) = w α F (α c + (1 α )p ) = F (q) F (p) q p, F (p) (34) ∇ t+1 Pn w α i i∇ i t − i i − − h − ∇ i i=1 i i i=1 = BF (q, p) (35) (40) n ! 1 1 X Lemma 1: Skew Burbea-Rao divergences tend asymptoti- c = F − w α F (α c + (1 α )p ) t+1 ∇ Pn w α i i∇ i t − i i cally to Bregman divergences (α 0) or reverse Bregman i=1 i i i=1 divergences (α 1). → (41) → Since F is convex, the second-order derivative 2F is Thus we may scale skew Burbea-Rao divergences so that ∇ Bregman divergences belong to skew Burbea-Rao divergences: always positive definite, and F is strictly monotone in- creasing. Thus we can interpret∇ Eq. 41 as a fixed-point equation by considering the F -representation. Each iteration (α) ∇ sBRF (p, q) = is interpreted as a quasi-arithmetic mean. This proves that the 1 Burbea-Rao centroid is always well-defined and unique (see (αF (p) + (1 α)F (q) F (αp + (1 α)q)) α(1 α) − − − Appendix A for a detailed proof), since there is (at most) a − (36) unique fixed point for x = g(x) with a function g( ) strictly monotone increasing. · Moreover, α is now not anymore restricted to (0, 1) but In some cases, like the squared Euclidean distance (or to the full real line: α , as also noticed in [31]. Setting 0 R squared Mahalanobis distances), we find closed-form solutions 1 α ∈ α = − (that is, α0 = 1 2α), we get 2 − for the Burbea-Rao barycenters. For example, consider the Pd (i) 2 (negative) quadratic entropy F (x) = x, x = i=1(x ) (α0) 1 h i with weights wi and all αi = (non-skew symmetric Burbea- sBRF (p, q) = 2   Rao divergences). We have: 4 1 α0 1 + α0 1 α0 1 + α0 − F (p) + F (q) F − p + q 1 α 2 2 2 − 2 2 0 n   − F (x) X pi + x (37)min E(x) = w F , (42) 2 − i 2 i=1 n IV. BURBEA-RAOCENTROIDS x, x 1 X = min h i wi ( x, x + 2 x, pi + pi, pi ) Let = p1, ..., pn denote a d-dimensional point set. 2 − 4 h i h i h i P { } i=1 To each point, let us further associate a positive weight wi The minimum is obtained when the gradient E(x) = 0, (accounting for arbitrary multiplicity) and a positive scalar n (α ) P ∇ i that is when x =p ¯ = i=1 wipi, the barycenter of the point αi (0, 1) to define an anchored distance BRF ( , pi). ∈ 9 · set . For most Burbea-Rao divergences, Eq. 42 can only be Define the skew Burbea-Rao barycenter (or centroid) c as P the minimizer of the following optimization task: solved numerically. Observe that for extremal skew cases (for α 0 or α 1), → → n we obtain the Bregman centroids in closed-form solutions (see X (αi) Eq. 30). Thus skew Burbea-Rao centroids allow one to get a OPT : c = arg min wiBR (x, pi) = arg min L(x) x F x i=1 smooth transition from the right-sided centroid (the center of (38) mass) to the left-sided centroid (a quasi-arithmetic mean Mf Without loss of generality, we consider argument x on the obtained for f = F , a continuous and strictly increasing left argument position (otherwise, we change all α 1 α to function). ∇ i → − i get the right-sided Burbea-Rao centroid). Removing all terms Theorem 1: Skew Burbea-Rao centroids are unique. They independent of x, the minimization program (OPT) amounts can be estimated iteratively using the CCCP iterative algo- to minimize equivalently the following energy function: rithm. In extremal skew cases, the Burbea-Rao centroids tend to Bregman left/right sided centroids, and have closed-form n n equations in limit cases. X X E(c) = ( w α )F (c) w F (α c + (1 α )p ) (39) To describe the orbit of Burbea-Rao centroids linking the i i − i i − i i i=1 i=1 left to right sided Bregman centroids, we compute for α ∈ Observe that the energy function is decomposable in the [0, 1] the skew Burbea-Rao centroids with the following update Pn scheme: sum of a convex function ( i=1 wiαi)F (c) with a concave Pn function i=1 wiF (αic + (1 αi)pi) (since the sum of n − − n ! 9 1 X We also call them skew Jensen barycenters or centroids since they are c = F − w F (αc + (1 α)p ) (43) t+1 ∇ i∇ t − i induced by a divergence using the Jensen inequality. i=1 IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 7

We may further consider various convex generators Fi for The vector t(x) denote the sufficient statistics, that is the each point, and consider the updating scheme set of linear independent functions that allows to concentrate without any loss all information about the parameter θ carried ct+1 = !−1 n in! the iid. observations x1, x2, ..., . The inner product p, q is X 1 X defined according to the primitive type of . Namely,h it isi a wi∇Fi Pn wiαi∇Fi(αict + (1 − αi)pi) θ wiαi i i=1 i=1 multiplication p, q = pq for scalars, a dot product p, q = pT q for vectors,h a matrixi trace p, q = tr(pT q) = tr(hp iqT ) for matrices, etc. For compositeh typesi such as×p being defined× A. Burbea-Rao divergences of a population by both a vector part and a matrix part, the composite inner product is defined as the sum of inner products on the primitive Consider now the Burbea-Rao divergence of a popula- types. Finally, k(x) represents the carrier measure according to tion p , ..., p with respective positive normalized weights 1 n the counting or Lebesgue measures. Decompositions for most w , ..., w . The Burbea-Rao divergence is defined by: 1 n common exponential family distributions are given in [23]. An exponential family = p (x; θ) θ Θ is the set of n n EF { F | ∈ } w X X probability distributions obtained for the same log-normalizer BRF (p1, ..., pn) = wiF (pi) F ( wipi) 0 (44) − ≥ function F . Information geometry considers F as a manifold i=1 i=1 entity, and study its differential geometric propertiesE [12]. This family of diversity measures includes the Jensen- For example, consider the family of Poisson distributions Renyi´ divergences [34], [35] for F (x) = Rα(x), where with mass function: 1 Pd α − EF Rα(x) = 1 α log j=1 pj is the Renyi´ entropy of order α. (Renyi´ entropy− is concave for α (0, 1) and tend to Shannon λx ∈ p(x; λ) = exp( λ), (47) entropy for α 1.) x! − → for x N+ = N 0 a positive integer. Poisson distributions HATTACHARYYA DISTANCES AS URBEA AO ∈ ∪{ } V. B B -R are univariate exponential families (x N+) of order 1 DISTANCES (parameter λ). The canonical decomposition∈ yields We first briefly recall the versatile class of exponential fam- the sufficient statistic t(x) = x, • ily distributions in Section V-A. Then we show in Section V-B θ = log λ, the natural parameter, • that the statistical Bhattacharyya/Chernoff distances between F (θ) = exp θ, the log-normalizer, • exponential family distributions amount to compute a Burbea- and k(x) = log x! the carrier measure (with respect to • Rao divergence. the counting− measure). Since we deal with applications using multivariate nor- A. Exponential family distribution in Statistics mals in the following, we also report explicitly that canon- ical decomposition for the multivariate Gaussian family Many usual statistical parametric distributions p(x; λ) (e.g., p (x; θ) θ Θ . We rewrite the usual Gaussian density Gaussian, Poisson, Bernoulli/multinomial, Gamma/Beta, etc.) F {of mean µ|and∈ variance-covariance} matrix Σ: share common properties arising from their common canonical decomposition of [9]: p(x; λ) = p(x; µ, Σ) (48)  T 1  1 (x µ) Σ− (x µ)) p(x; λ) = pF (x; θ) = exp ( t(x), θ F (θ) + k(x)) . (45) = exp − − (49) h i − 2π√det Σ − 2 Those distributions10 are said to belong to the exponential families (see [23] for a tutorial). An exponential family is in the canonical form of Eq. 45 with, 1 1 1 d characterized by its log-normalizer F (θ), and a distribution in θ = (Σ− µ, Σ− ) Θ = R Kd d, with Kd d • 2 ∈ × × × that family by its natural parameter θ belonging to the natural denotes the cone of positive definite matrices, 1 1 T 1 d space Θ. The log-normalizer F is strictly convex and C , and F (θ) = tr(θ− θ1θ1 ) log det θ2 + log π, ∞ • 4 2 − 2 2 can also be expressed using the source coordinate system λ t(x) = (x, xT x), • − using the 1-to-1 map τ :Λ Θ that converts parameters k(x) = 0. → • from the source coordinate system λ to the natural coordinate In this case, the inner product is composite and is calculated system θ: as the sum of a dot product and a matrix trace as follows:

T T F (θ) = F (τ(λ)) = (F τ)(λ) = F (λ), (46) θ, θ0 = θ1 θ10 + tr(θ2 θ20 ). (50) ◦ λ h i The coordinate transformation τ :Λ Θ is given for λ = where Fλ = F τ denotes the log-normalizer function → expressed using the◦ λ-coordinates instead of the natural θ- (µ, Σ) by coordinates.   1 1 1 τ(λ) = λ2− λ1, λ2− , (51) 10The distributions can either be discrete or continuous. We do not introduce 2 the unifying framework of probability measures in order to not burden the 1 paper. and its inverse mapping τ − :Θ Λ by → IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 8

  1 1 1 1 1 Z τ − (θ) = θ− θ1, θ− . (52) α 1 α 2 2 2 2 Bα(p, q) = ln p (x)q − (x)dx = ln Cα(p, q(58)) − x − Z p(x)α = ln q(x) dx (59) B. Bhattacharyya/Chernoff coefficients and α-divergences as − x q(x) skew Burbea-Rao divergences = ln E [Lα(x)] (60) − q For arbitrary probability distributions p(x) and q(x) (para- defined for some α (0, 1) (the Bhattacharyya divergence metric or not), we measure the amount of overlap between ∈ 1 is obtained for α = 2 ), where E[ ] denote the expec- those distributions using the Bhattacharyya coefficient [36]: p(x) · tation, and L(x) = q(x) the likelihood ratio. The term Z R α 1 α is called the Chernoff coefficient. The p x p (x)q − (x)dx C(p, q) = p(x)q(x)dx, (53) Bhattacharyya/Chernoff distance of members of the same exponential family yields a weighted asymmetric Burbea-Rao Clearly, the Bhattacharyya coefficient (measuring the affinity divergence (namely, a skew Burbea-Rao divergence): between distributions [37]) falls in the unit range: B (p (x; θ ), p (x; θ )) = BR(α)(θ , θ ) (61) 0 C(p, q) 1. (54) α F p F q F p q ≤ ≤ with In fact, we may interpret this coefficient geometrically by con- p p (α) sidering p(x) and q(x) as unit vectors. The Bhattacharyya BR (θp, θq) = αF (θp)+(1 α)F (θq) F (αθp +(1 α)θq) F − − − distance is then the dot product, representing the cosine of (62) the angle made by the two unit vectors. The Bhattacharyya Chernoff coefficients are also related to α-divergences, the distance B : R+ is derived from its coefficient [36] canonical divergences in α-flat spaces in information geome- as X × X → try [12] (p. 57):

  1−α 1+α  4 R 2 2 B(p, q) = ln C(p, q). (55)  2 1 − p(x) q(x) dx , α 6= ±1,  1−α − p(x) Dα(p||q) = R p(x) log q(x) dx = KL(p, q), α = −1, The Bhattacharyya distance allows one to get both upper  R q(x)  q(x) log p(x) dx = KL(q, p), α = 1, and lower bound the Bayes’ classification error [38], [39], (63) while there are no such results for the symmetric Kullback- Leibler divergence. Both the Bhattacharyya distance and the The class of α-divergences satisfy the following reference symmetric Kullback-Leibler divergence agrees with the Fisher 1 α duality: Dα(p q) = D α(q p). Remapping α0 = − (α = information at the infinitesimal level. Although the Bhat- || − || 2 1 2α0), we transform Amari α-divergences to Chernoff α0- tacharyya distance is symmetric, it is not a metric. Neverthe- divergences:− 12 less, it can be metrized by transforming it into to the following Hellinger metric [40]:  1  R α0 1−α0  0  0 0 1 − p(x) q(x) dx , α 6∈ {0, 1},  α (1−α ) p(x) 0 s D 0 (p, q) = R Z α p(x) log q(x) dx = KL(p, q), α = 1, 1 p p 2  H(p, q) = ( p(x) q(x)) dx, (56)  R q(x) log q(x) dx = KL(q, p), α0 = 0, 2 − p(x) (64) such that 0 H(p, q) 1. It follows that ≤ ≤ Theorem 2: The Chernoff α0-divergence (α = 1) of distributions belonging to the same exponential6 family± is H(p, q) = given in closed-form by means of a skewed Burbea-Rao 0 1 α s 0 BRF (θp,θq ) Z Z Z  divergence as: Dα (p, q) = α0(1 α0) (1 e− ), with 1 p p − − p(x)dx + q(x)dx 2 p(x) q(x)dx BR(α)(θ , θ ) = (αF (θ ) (1 α)F (θ )) F (αθ 2 − F p q p − − q − p − p (1 α)θq). Amari α-divergence for members of the same = 1 C(p, q). (57) − 4 − exponential families amount to compute Dα(p, q) = 1 α2 (1 1−α − − ( 2 ) BR (θp,θq ) Hellinger metric is also called Matusita metric [37] in the e− F ) literature. The thesis of Hellinger was emphasized in the work We get the following theorem for Bhattacharyya/Chernoff of Kakutani [41]. distances: We consider a direct generalization of Bhattacharyya coef- 11 12 Chernoff coefficients are also related to Renyi´ α-divergence generalizing ficients and divergences called Chernoff divergences 1 R α 1−α the Kullback-Leibler divergence: Rα(p||q) = α−1 log x p(x) q (x)dx built on Renyi´ entropy Hα(p) = 1 log(R pα(x)dx − 1). The Tsallis 11 R 1−α x In the literature, Chernoff information is also defined α 1 R α R α 1−α entropy HT (p) = α−1 (1− p(x) dx) can also be obtained from the Renyi´ as − log infα∈[0,1] p (x)q (x)dx. Similarly, Chernoff α α 1 (1−α)HR(p) coefficients Cα(p, q) are defined as the supremum: Cα(p, q) = entropy (and vice-versa) via the mappings: HT (p) = 1−α (e − R α 1−α α 1 α supα∈[0,1] p (x)q (x)dx. 1) and HR(p) = 1−α log(1 + (1 − α)HT (p)). IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 9

Let us compute the Chernoff coefficient for distributions belonging to the same exponential families. Without loss of generality, let us consider the reduced canonical form of exponential families pF (x; θ) = exp x, θ F (θ). Chernoff coefficients Cα(p, q) of members p = p (x; θ ) and q = p (x; θ ) of the same exponential family :h i− F p F q EF Z Z α 1 α (α) 1 α Cα(p, q) = p (x)q − (x)dx = pF (x; θp)pF− (x; θq)dx Z = exp(α( x, θ F (θ ))) exp((1 α)( x, θ F (θ )))dx h pi − p × − h qi − q Z = exp ( x, αθ + (1 α)θ (αF (θ ) + (1 α)F (θ )) dx h p − qi − p − q Z = exp (αF (θ ) + (1 α)F (θ )) exp ( x, αθ + (1 α)θ F (αθ + (1 α)θ ) + F (αθ + (1 α)θ )) dx − p − q × h p − qi − p − q p − q Z = exp (F (αθ + (1 α)θ ) (αF (θ ) + (1 α)F (θ )) exp x, αθ + (1 α)θ F (αθ + (1 α)θ )dx p − q − p − q × h p − qi − p − q Z = exp (F (αθ + (1 α)θ ) (αF (θ ) + (1 α)F (θ )) p (x; αθ + (1 α)θ )dx p − q − p − q × F p − q | {z } =1 = exp( BR(α)(θ , θ )) 0. − F p q ≥

Theorem 3: The skew Bhattacharyya divergence C. Direct method for calculating the Bhattacharyya centroids Bα(p, q) is equivalent to the Burbea-Rao divergence of multivariate normals for members of the same exponential family To the best of our knowledge, the Bhattacharyya centroid F : Bα(p, q) = Bα(pF (x; θp), pF (x; θq)) = has only been studied for univariate Gaussian or diagonal E (α) log Cα(pF (x; θp), pF (x; θq)) = BR (θp, θq) 0. multivariate Gaussian distributions [42] in the context of − F ≥ In particular, for α = 1, the Kullback-Leibler divergence speech recognition, where it is reported that it can be estimated of those exponential family± distributions amount to compute using an iterative algorithm (no convergence guarantees are a Bregman divergence [14] (by taking the limit as α 1 or reported in [42]). α 0). → In order to compare this scheme on multivariate data with → our generic Burbea-Rao scheme, we extend the approach of Corollary 1: In the limit case α0 0, 1 , the α0- divergences amount to compute a Kullback-Leibler∈ { } diver- Rigazio et al. [42] to multivariate Gaussians. Plugging the gence, and is equivalent to compute a Bregman divergence Bhattacharyya distance of Gaussians in the energy function for the log-normalized on the swapped natural parameters: of the optimization problem (OPT), we get n   1 KL(pF (x; θp), pF (x; θq)) = BF (θq, θp). X 1 Σc + Σi − L(c) = (µ µ )T (µ µ ) Proof: The proof relies on the equivalence of Burbea- 8 c − i 2 c − i Rao divergences to Bregman divergences for extremal values i=1 Σc+Σi  ! of α 0, 1 . 1 det 2 ∈ { } + log . (69) 2 √det Σc det Σi This is equivalent to minimize the following energy: n KL(p, q) = KL(pF (x; θp), pF (x; θq)) (65) X T 1 F (c) = (µc µi) (Σc + Σi)− (µc µi) = lim Dα0 (pF (x; θp), pF (x; θq)) (66) − − α0 1 i=1 → 1 + 2 log (det(Σc + Σi)) log (det Σc) = lim (1 Cα(pF (x; θp), pF (x; θq))) − 0 2d  α 1 α0(1 α0) − log 2 det Σ . (70) → − | {z } i since exp x x'01+x − ' In order to minimize F (c), let us differentiate with respect to 1 α0 1 = lim BRF (θp, θq) (67) µc. let Ui denote (Σc + Σi)− . Using matrix differentials [43] α0 1 α (1 α ) → 0 0 | {z } 0 (p.10 Eq. 73), we get: − (1 α )BF (θq ,θp) − n 1 ∂L X  T  = lim BF (θq, θp) = BF (θq, θp) (68) = Ui + Ui [µc µi] (71) α0 1 α ∂µ − → 0 c i=1 Then one can estimate iteratively µc, since Ui depends on Similarly, we have limα0 0 Dα0 (pF (x; θp), pF (x; θq)) = → Σc which is unknown. We update µc as follows: KL(pF (x; θq), pF (x; θp)) = BF (θp, θq). 1 " n #− " n # Table I reports the Bhattacharyya distances for members of X  T  X  T  µc(t + 1) = Ui + Ui Ui + Ui µi (72) the same exponential families. i=1 i=1 IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 10

Exponential family τ : λ → θ F (θ) (up to a constant) Bhattacharyya/Burbea-Rao BRF (λp, λq) = BRF (τ(λp), τ(λq))

pi Pd−1 Pd √ Multinomial (log )i log(1 + exp θi) − ln piqi pd i=1 i=1 1 √ √ 2 Poisson log λ exp θ 2 ( µp − µq) 2 2 σ2+σ2 2 θ1 1 π 1 (µp−µq ) 1 p q Gaussian (θ1 = µ, θ2 = σ ) − + log(− ) 2 2 + ln 4θ2 2 θ2 4 σp+σq 2 2σpσq −1 Σp+Σq −1 1 −1 1 −1 T 1 1 T  Σp+Σq  1 det 2 Multivariate Gaussian (θ = Σ µ, Θ = Σ ) tr(Θ θθ ) − log det Θ (µp − µq) (µp − µq) + ln 2 4 2 8 2 2 det Σp det Σq TABLE I CLOSED-FORM BHATTACHARYYA DISTANCES FOR SOME CLASSES OF EXPONENTIAL FAMILIES (EXPRESSEDINSOURCEPARAMETERSFOREASEOF USE)).

Now let us estimate Σc. We used matrix differentials [43] (p.9 The first experimental results depicted in Figure 3 demon- Eq. 55 for the first term, and Eq. 51 p.8 for the two others): strates the qualitative stability of the clustering performance. n In particular, the hierarchical clustering with respect to the ∂L X T T T Bhattacharrya distance performs qualitatively much better on = Ui (µc µi)(µc µi) Ui ∂Σ − − − 13 c i=1 the last colormap image. n n The second experiment focuses on characterizing the nu- X T X T + 2 Ui Σc− . (73) merical convergence of the generic Burbea-Rao method com- − i=1 i=1 pared to the tailored Gaussian method. Since we presented Taken into account the fact that Σc is symmetric, differential two novel different schemes to compute the Bhattacharyya calculus on symmetric matrices can be simply estimate: centroids of multivariate Gaussians, one wants to compare them, both in terms of stability and accuracy. Whenever the dL ∂L  ∂L T  ∂L  = + diag . (74) ratio of Bhattacharyya distance energy function between those dΣc ∂Σc ∂Σc − ∂Σc estimated centroids is greater than 1%, we consider that one Thus, if one notes of the two estimation methods is beaten (namely, the method n that gives the highest Bhattacharyya distance). Among the 760 X A = 2U T U T (µ µ )(µ µ )T U T (75) centroids computed to generate Figures 3, 100% were correct i − i c − i c − i i i=1 with the Burbea-Rao approach, while only 87% were correct with the tailored multivariate Gaussian matrix optimization and recalling that Σc is symmetric, one has to solve method. The average number of iterations to reach the 1% 1 1 T n(2Σ− diag(Σ− )) = A + A diag(A). (76) accuracy is 4.1 for the Burbea-Rao estimation algorithm, and c − c − 5.2 for the alternative method. Let Thus we experimentally checked that the generic CCCP B = A + AT diag(A) (77) − iterative Burbea-Rao algorithm described for computing the Then one can estimate Σc iteratively as follows: Bhattacharrya centroids always converge, and moreover beats 1 another ad-hoc iterative method tailored for multivariate Gaus- (k+1) h (k) (k) i− Σc = 2n (B + diag(B )) (78) sians. Let us now compare the two generic Burbea-Rao/tailored Gaussian methods for computing the Bhattacharyya centroids VI.CONCLUDINGREMARKS on multvariate Gaussians. In this paper, we have shown that the Bhattacharrya distance for distributions of the same statistical exponential families D. Applications to mixture simplification in statistics can be computed equivalently as a Burbea-Rao divergence on the corresponding natural parameters. Those results ex- Simplifying Gaussian mixtures is important in many appli- tend to skew Chernoff coefficients (and Amari α-divergences) cations arising in signal processing [26]. Mixture simplifica- and skew Bhattacharyya distances using the notion of skew tion is also a crucial step when one wants to study the Rie- Burbea-Rao divergences. We proved that (skew) Burbea-Rao mannian geometry induced by the Rao distance with respect centroids are unique, and can be efficiently estimated using to the Fisher metric: The set of mixture models need to have an iterative concave-convex procedure with guaranteed con- the same number of components, so that we simplify source vergence. We have shown that extremally skewed Burbea- mixtures to get a set of Gaussian mixtures with prescribed Rao divergences amount asymptotically to evaluate Bregman size. We adapt the hierarchical clustering algorithm of Garcia divergences. This work emphasizes on the attractiveness of et al. [26] by replacing the symmetrized Bregman centroid exponential families in Statistics. Indeed, it turns out that for (namely, the Jeffreys-Bregman centroid) by the Bhattacharyya many statistical distances, one can evaluate them in closed- centroid. We consider the task of color image segmentation form. For sake of brevity, we have not mentioned the recent by learning a Gaussian mixture model for each image. Each image is represented as a set of 5D points (color RGB and 13See reference images and segmentation using Bregman centroids at http: position xy). //www.informationgeometry.org/MEF/ IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 11

(a)

(b)

(c) Fig. 3. Color image segmentation results: (a) source images, (b) segmentation with k = 48 5D Gaussians, and (c) segmentation with k = 16 5D Gaussians.

β-divergences and γ-divergences [44], although their distances left-sided Bregman centroids is available at: on exponential families are again available in closed-form. http://www.informationgeometry.org/BurbeaRao/ The differential Riemannian geometry induced by the class of such Jensen difference measures was studied by Burbea ACKNOWLEDGMENTS and Rao [18], [19] who built quadratic differential metrics We gratefully acknowledge financial support from French on probability spaces using Jensen differences. The Jensen- agency DIGITEO (GAS 2008-16D) and French National Re- Shannon divergence is also an instance of a broad class of search Agency (ANR GAIA 07-BLAN-0328-01), and Sony divergences called the f-divergences. A f-divergence I is a f Computer Science Laboratories, Inc. We are very grateful to statistical measure of dissimilarity defined by the functional the reviewers for their thorough and thoughtful comments and I (p, q) = R p(x)f( q(x) )dx. It turns out that the Jensen- f p(x) suggestions. In particular, we are thankful to the anonymous Shannon divergence is a f-divergence for the generator Referee that pointed out a rigorous proof of Lemma 1. 1  2  f(x) = (x + 1) log + x log x . (79) 2 x + 1 APPENDIX f-divergences preserve the information monotonicity [44], and PROOFOFUNIQUENESSOFTHE BURBEA-RAOCENTROIDS their differential geometry was studied by Vos [45]. However, Consider without loss of generality the Burbea-Rao centroid this Jensen-Shannon divergence is a very particular case of (also called Jensen centroid) defined as the minimizer of Burbea-Rao divergences since the squared Euclidean distance n (another Burbea-Rao divergence) does not belong to the class X 1 c = arg min JF (pi, x) of f-divergences. x n i=1 SOURCECODE where J (p, q) = F (p)+F (q) F ( p+q ) 0. For sake of F 2 − 2 ≥ The generic Burbea-Rao barycenter estimation algorithm simplicity, let us consider univariate generators. The Jensen 0 F (x) 1 x+p shall be released in the JMEF open source library: divergence may not be convex as J 0(x, p) = 2 2 F ( 2 ) 1 1 x+p − and J 00(x, p) = 2 F 00(x) 4 F 00( 2 ) can be alternatively http://www.informationgeometry.org/MEF/ positive/negative (see [46]).− In general, minimizing the average An applet visualizing the skew Burbea- non-convex divergence may a priori yield to many local Rao centroids ranging from the right-sided to minima [47]. It is remarkable to observe that the centroid IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 12 induced by a Jensen divergence is unique although the problem REFERENCES may not be convex. [1] A. N. Kolmogorov, “Sur la notion de la moyenne,” Accad. Naz. Lincei The proof of uniqueness of the Burbea-Rao centroid and Mem. Cl. Sci. Fis. Mat. Natur. Sez., vol. 12, pp. 388–391, 1930. the convergence of the CCCP approximation algorithm rely on [2] M. Nagumo, “Uber¨ eine Klasse der Mittelwerte,” Japanese Journal of the “interness” property (called compensativeness14 in [48]) of Mathematics, vol. 7, pp. 71–79, 1930, see Collected papers, Springer quasi-arithmetic means: 1993. [3] J. D. Aczel,´ “On mean values,” Bulletin of the American Mathematical Society, vol. 54, no. 4, pp. 392–400, 1948, http://www.mta.hu/. n n min pi M F (p1, ..., pn) max pi, [4] A. Ben-Tal, A. Charnes, and M. Teboulle, “Entropic means,” Journal i=1 ≤ ∇ ≤ i=1 of Mathematical Analysis and Applications, vol. 139, no. 2, pp. 537 – with 551, 1989. n ! [5] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence 1 X 1 of one distribution from another,” Journal of the Royal Statistical Society, M F (p1, ..., pn) = ( F )− F (pi) Series B, vol. 28, pp. 131–142, 1966. ∇ ∇ n∇ i=1 [6] I. Csiszar,´ “Information-type measures of difference of probability distri- for a strictly convex function F (and hence, strictly monotone butions and indirect observation,” Studia Scientiarum Mathematicarum Hungarica, vol. 2, p. 229318, 1967. increasing gradient F ). The interness property of quasi- [7] F. Nielsen and R. Nock, “Sided and symmetrized Bregman centroids,” arithmetic means ensures∇ that it is indeed a mean value IEEE Transactions on Information Theory, vol. 55, no. 6, pp. 2048– contained within the extremal values. 2059, June 2009. [8] Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms, For sake of simplicity, let us first consider a univariate and Applications. Oxford University Press, 1997. convex generator F with the pi’s following the increasing [9] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential order: p ... p . Let initially c [p(0) = p , p(0) = p ]. families, and variational inference,” Foundational Trends in Machine 1 n 0 1 1 n n Learning, vol. 1, pp. 1–305, January 2008. ≤ ≤ (1) p1+c0 ∈(1) pn+c0 Since c1 = M F (p1 = 2 , ..., pn = 2 ) is a quasi- [10] S.-I. Amari, “α-divergence is unique, belonging to both f-divergence ∇ (1) (1) and bregman divergence classes,” IEEE Trans. Inf. Theor., vol. 55, arithmetic mean, we necessarily have c [p , pn ] and 1 1 no. 11, pp. 4925–4931, 2009. (1) (1) c0+pn c0 p1 pn p1 ∈ pn p = − − = − . Thus the CCCP iterations − 1 2 2 [11] S.-i. Amari, “Integration of stochastic models by minimizing α- induce a sequence of iterated quasi-arithmetic means ct such divergence,” Neural Comput., vol. 19, no. 10, pp. 2780–2796, 2007. that [12] S. Amari and H. Nagaoka, Methods of Information Geometry, A. M. Society, Ed. Oxford University Press, 2000. [13] F. Nielsen and R. Nock, “The dual voronoi diagrams with respect to (t 1) (t 1) ! representational bregman divergences,” in International Symposium on p − + c p − + c (t) 1 t 1 (t) n t 1 Voronoi Diagrams (ISVD). DTU Lyngby, Denmark: IEEE, June 2009. ct = M F p1 = − , ..., pn = − , ∇ 2 2 [14] J.-D. Boissonnat, F. Nielsen, and R. Nock, “Bregman Voronoi dia- grams,” Discrete & Computational Geometry, 2010, accepted, extend (t) (t) ct [p1 , pn ] ACM-SIAM SODA 2007. ∈ [15] I. Csiszar,´ “Generalized projections for non-negative functions,” Acta with Mathematica Hungarica, vol. 68, no. 1-2, pp. 161–185, 1995. (t 1) (t 1) [16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with (t) (t) ct 1 + pn − ct 1 p1 − p p = − − − − , Bregman divergences,” J. Mach. Learn. Res., vol. 6, pp. 1705–1749, n − 1 2 2005. (t 1) (t 1) [17] M. Detyniecki, “Mathematical aggregation operators and their applica- pn − p − = − 1 , tion to video querying,” Ph.D. dissertation, 2000. 2 [18] J. Burbea and C. R. Rao, “On the convexity of some divergence measures 1 based on entropy functions,” IEEE Transactions on Information Theory, = t (pn p1). vol. 28, no. 3, pp. 489–495, 1982. 2 − [19] ——, “On the convexity of higher order Jensen differences based on It follows that the sequence of centroid approximation ct entropy functions,” IEEE Transactions on Information Theory, vol. 28, converges in the limit to a unique centroid c∗. That is, the no. 6, pp. 961–, 1982. Burbea-Rao centroids exist and are unique for any strictly [20] J. Lin, “Divergence measures based on the Shannon entropy,” IEEE Transactions on Information Theory, vol. 37, pp. 145–151, 1991. convex generator F . The centroid can be approximated within [21] M. Basseville and J.-F. Cardoso, “On entropies, divergences and mean 1 values,” in Proceedings of the IEEE Workshop on Information Theory, 2t relative precision after t iterations (linear convergence of the CCCP). Since the CCCP iterations yield both an 1995. (t) (t) [22] L. M. Bregman, “The relaxation method of finding the common point approximation ct and a range [p1 , pn ] where ct should be at of convex sets and its application to the solution of problems in convex the t-iteration, we choose in practice to stop iterating whenever programming,” USSR Computational Mathematics and Mathematical (t) (t) Physics, vol. 7, pp. 200–217, 1967. pn p1 −(t) goes below a prescribed threshold (for example, [23] F. Nielsen and V. Garcia, “Statistical exponential families: A digest with pn taking F = log x, we find in about 50 iterations the centroid flash cards,” 2009, arXiv.org:0911.4863. with machine∇ precision 12). The CCCP algorithm with [24] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed.). 10− b Academic Press Professional, Inc., 1990. bits precision require O(nb) time to approximate. [25] L. Rigazio, B. Tsakam, and J. C. Junqua, “An optimal bhattacharyya (t) 1 Pt 1 centroid algorithm for gaussian clustering with applications in Note that limt p1 = limt 2 i=0 2i p1 = p1 (and →∞ (t) →∞ automatic speech recognition,” in Acoustics, Speech, and Signal similarly, we have limt pn = pn). It follows that c∗ Processing, 2000. ICASSP ’00. Proceedings. 2000 IEEE International →∞ ∈ [p1, pn] as expected (all the initial extremal range is possible, Conference on, vol. 3, 2000, pp. 1599–1602 vol.3. [Online]. Available: and the center shall depend on the chosen generator F ). The http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=861998 [26] V. Garcia, F. Nielsen, and R. Nock, “Hierarchical gaussian mixture proof extends naturally to separable multivariate functions by model,” in International Conference on Acoustics, Speech and Signal carrying the analysis on each dimension independently. Processing (ICASSP), March 2010. [27] P. Chen, Y. Chen, and M. Rao, “Metrics defined by Bregman diver- 14A fact following from the monotonicity of the generator function ∇F . gences: Part I,” Commun. Math. Sci., vol. 6, pp. 9915–926, 2008. IEEE TRANSACTIONS ON INFORMATION THEORY 57(8):5455-5466, 2011. 13

[28] ——, “Metrics defined by Bregman divergences: Part II,” Commun. Sylvain Boltz Sylvain Boltz received the M.S. de- Math. Sci., vol. 6, pp. 927–948, 2008. gree and the Ph.D. degree in computer vision from [29] H. Jeffreys, “An invariant form for the prior probability in estimation the University of Nice-Sophia Antipolis, France, in problems,” Proceedings of the Royal Society of London, vol. 186, no. 2004 and 2008, respectively. Since then, he has been 1007, pp. 453–461, March 1946. PLACE a postdoctoral fellow at the VisionLab, University [30] F. Liese and I. Vajda, “On Divergences and Informations in Statistics PHOTO of California, Los Angeles and a LIX-Qualcomm and Information Theory,” IEEE Transactions on Information Theory, HERE postdoctoral fellow in Ecole Polytechnique, France. vol. 52, no. 10, pp. 4394–4412, October 2006. His research spans computer vision and image, video [31] J. Zhang, “Divergence function, duality, and convex analysis,” Neural processing with a particular interest in applications Computation, vol. 16, no. 1, pp. 159–195, 2004. of information theory and compressed sensing to [32] A. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural these areas. Computation, vol. 15, no. 4, pp. 915–936, 2003. [33] B. Sriperumbudur and G. Lanckriet, “On the convergence of the concave-convex procedure,” in Neural Information Processing Systems, 2009. [34] Y. He, A. B. Hamza, and H. Krim, “An information divergence measure for ISAR image registration,” in Automatic target recognition XI (SPIE), vol. 4379, 2001, pp. 199–208. [35] A. O. Hero, B. Ma, O. Michel, and J. D. Gorman, “Alpha-divergence for classification, indexing and retrieval,” Comm. and Sig. Proc. Lab. (CSPL), Dept. EECS, University of Michigan, Ann Arbor, Tech. Rep. 328, July, 2001, presented at Joint Statistical Meeting. [36] A. Bhattacharyya, “On a measure of divergence between two statisti- cal populations defined by their probability distributions,” Bulletin of Calcutta Mathematical Society, vol. 35, pp. 99–110, 1943. [37] K. Matusita, “Decision rules based on the distance, for problems of fit, two samples, and estimation,” Annal of Mathematics and Statistics, vol. 26, pp. 631–640, 1955. [38] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Transactions on Communication Technology, vol. 15, no. 1, pp. 52–60, 1967. [39] F. Aherne, N. Thacker, and P. Rockett, “The Bhattacharyya metric as an absolute similarity measure for frequency coded data,” Kybernetika, vol. 34, no. 4, pp. 363–368, 1998. [40] E. D. Hellinger, “Die orthogonalinvarianten quadratischer formen von unendlich vielen variablen,” 1907, thesis of the university of Gottingen.¨ [41] S. Kakutani, “On equivalence of infinite product measures,” Annals of Mathematics, vol. 49, no. 214-224, 1948. [42] L. Rigazio, B. Tsakam, and J. Junqua, “Optimal Bhattacharyya centroid algorithm for Gaussian clustering with applications in automatic speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, 2000, pp. 1599–1602. [43] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook. Technical University of Denmark, oct 2008. [Online]. Available: http://www2.imm.dtu.dk/pubdb/p.php?3274 [44] A. Cichocki and S. ichi Amari, “Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities,” Entropy, 2010, review submitted. [45] P. Vos, “Geometry of f-divergence,” Annals of the Institute of Statistical Mathematics, vol. 43, no. 3, pp. 515–537, 1991. [46] F. Nielsen, R. Nock, “Skew Jensen-Bregman Voronoi Diagrams.” Trans- actions on Computational Science, no. 14, pp. 102–128, 2011. [47] P. Auer, M. Herbster and M. Warmuth, “Exponentially many local minima for single neurons,” Advances in Neural Information Processing Systems, vol. 8, pp. 316-317, 1995. [48] J.-L. Marichal, “Aggregation Operators for Multicriteria Decision Aid,” Institute of Mathematics, University of Liege,` Belgium, 1998.

Frank Nielsen received the BSc (1992) and MSc (1994) degrees from Ecole Normale Superieure (ENS Lyon, France). He prepared his PhD on adaptive computational geometry at INRIA Sophia- PLACE Antipolis (France) and defended it in 1996. As a PHOTO civil servant of the University of Nice (France), HERE he gave lectures at the engineering schools ESSI and ISIA (Ecole des Mines). In 1997, he served in the army as a scientific member in the computer science laboratory of Ecole Polytechnique. In 1998, he joined Sony Computer Science Laboratories Inc., Tokyo (Japan) where he is senior researcher. He became a professor of the CS Dept. of Ecole Polytechnique in 2008. His current research interests include geometry, vision, graphics, learning, and optimization. He is a senior ACM and senior IEEE member.