arXiv:1904.04017v3 [cs.IT] 10 Dec 2020 Keywords rga iegne btatwihe en quasi-arithm , weighted M abstract , Bregman information, Chernoff , tacharyya . ulakLilrdvrec n t symmetrizations its ( and Let divergence Kullback-Leibler motivations 1.1 and Introduction 1 esrbeeet.Cnie oiiemeasure positive a Consider events. measurable mxue xoeta aiy asinfml,Cuh sca Cauchy family, Gaussian family, exponential -mixture, ∗ † hswr perdi h ora 4](09.Ti eotfu report This (2019). [41] journal the in appeared work This E-mail: nagnrlzto fteJne-hno iegneadth and divergence Jensen-Shannon the of generalization a On X esnSanndvrecs n osdrcutrn ihrespe divergences. bet with clustering divergences consider Jensen-Shannon and divergences) generalized dis Jensen-Shannon Cauchy define scale also We the divergenc for tions. Jensen-Shannon well-suited harmonic the is for mean formula harmonic closed-form the that As show f divergence. we exponential Kullback-Leibler reverse geome same th the the the (i) of of for JS-symmetrization that densities formula show probability closed-form first between two divergence we report particular, and distrib families, In generalized of exponential using family . distance parametric any abstract the of from prese to JS-symmetrizations we according the problem, chosen define we is this which mean means bypass abstract the To using when divergence closed-form. (JS) in Jensen-Shannon available the divergenc not Jensen-Shannon is the However tions Kullback-Leib distribution. total mixture the age measures which divergence Kullback-Leibler , Ssmerzto fdsacsrligo btatmeans abstract on relying of JS-symmetrization A h esnSanndvrec sarnw one symmetrizat bounded renown a is divergence Jensen-Shannon The eamaual pc 1]where [10] space measurable a be ) [email protected] esnSanndvrec,Jffesdvrec,resistor divergence, Jeffreys divergence, Jensen-Shannon : oyCmue cec aoaois Inc. Laboratories, Science Computer Sony ae e page: web Paper . rn Nielsen Frank f oy,Japan Tokyo, dvrec,Jne iegne ubaRodivergence, Burbea-Rao divergence, Jensen -divergence, Abstract X https://FrankNielsen.github.io/M-JS/ 1 eoe h apesaeand space the denotes µ uulyteLbsu measure Lebesgue the (usually te nldsnwrslsadextensions. and results new includes rther † tcma,mxuefml,statistical family, mixture mean, etic efml,clustering. family, le tt hs oe Jensen-Shannon novel these to ct ewe cl acydistribu- Cauchy scale between e ilscoe-omexpressions closed-form yields enmtie eg,quantum (e.g., matrices ween example, illustrating second a ewe asindistribu- Gaussian between e ml,ad(i h geometric the (ii) and amily, emti Jensen-Shannon geometric e rcma swl-utdfor well-suited is mean tric ttsia itrsderived mixtures statistical e iegnet h aver- the to divergence ler rbtos n eota report and tributions, tos oegenerally, More utions. o fteunbounded the of ion tagnrlzto of generalization a nt vrg itne Bhat- distance, average A the µ L σ agbaof -algebra ihBorel with ∗ e d σ-algebra (R ) or the counting measure µc with power set σ-algebra 2 ). Denote by the set of B X P probability distributions. The Kullback-Leibler Divergence [16] (KLD for short) KL : [0, ] is the most funda- P×P→ ∞ mental distance1 [16] between probability distributions, defined by:

p KL(P : Q):= p log dµ, (1) q Z where p and q denote the Radon-Nikodym derivatives of probability measures P and Q with respect to µ (with P,Q µ). The KLD expression between P and Q in Eq. 1 is independent ≪ P +Q of the dominating measure µ. For example, the dominating measure can be chosen as µ = 2 . Appendix A summarizes the various distances and their notations used in this paper. The KLD is also called the relative entropy [16] because it can be written as the difference of the cross-entropy h minus the entropy h: × KL(p : q)= h (p : q) h(p), (2) × − where h denotes the cross-entropy [16]: × 1 h (p : q):= p log dµ = p log qdµ, (3) × q − Z Z and 1 h(p):= p log dµ = p log pdµ = h (p : p), (4) p − × Z Z denotes the Shannon entropy [16]. Although the formula of the Shannon entropy in Eq. 4 unifies both the discrete case and the continuous case of probability distributions, the behavior of entropy in the discrete case and in the continuous case is very different: When µ = µc (counting measure), Eq. 4 yields the discrete Shannon entropy which is always positive and upper bounded by log . |X | When µ = µL (Lebesgue measure), Eq. 4 defines the Shannon differential entropy which may be negative and unbounded [16] (e.g., the differential entropy of the Gaussian distribution N(m,σ) 1 2 is 2 log(2πeσ )). See also [26] for further important differences between the discrete case and the continuous case. In general, the KLD is an asymmetric distance (i.e., KL(p : q) = KL(q : p), hence the argument 6 separator notation2 using the delimiter ’:’) that is unbounded and may even be infinite. The reverse KL divergence or dual KL divergence is defined by: q KL∗(P : Q):=KL(Q : P )= q log dµ. (5) p Z In general, the reverse distance or dual distance for a distance D is written as:

D∗(p : q):=D(q : p). (6)

1In this paper, we call distance a dissimilarity measure which may not be a . Many synonyms have been used in the literature like distortion, , divergence, information, etc. 2 In [16], it is customary to use the double bar notation ’pX kpY ’ instead of the comma ’,’ notation to avoid confusion with joint random variables (X,Y ).

2 One way to symmetrize the KLD is to consider the Jeffreys3 Divergence [37] (JD): p J(p; q):=KL(p : q)+KL(q : p)= (p q) log dµ = J(q; p). (7) − q Z However this symmetric distance is not upper bounded, and its sensitivity can raise numerical issues in applications. Here, we used the optional argument separator notation ’;’ to emphasize that the distance is symmetric4 but not necessarily a metric distance (i.e., violating the triangle inequality of metric distances). The symmetrization of the KLD may also be obtained using the instead of the , yielding the resistor average distance [27] R(p; q):

1 1 1 1 = + , (8) R(p; q) 2 KL(p : q) KL(q : p)   1 (KL(p : q) + KL(q : p)) = , (9) R(p; q) 2KL(p : q)KL(q : p) 2KL(p : q)KL(q : p) R(p; q) = . (10) J(p; q)

Another famous symmetrization of the KLD is the Jensen-Shannon Divergence [30] (JSD) defined by: 1 p + q p + q JS(p; q) := KL p : + KL q : , (11) 2 2 2      1 2p 2q = p log + q log dµ. (12) 2 p + q p + q Z   This distance can be interpreted as the total divergence to the average distribution (see Eq. 11). The JSD can be rewritten as a Jensen divergence (or Burbea-Rao divergence [47]) for the negentropy generator h (called Shannon information): − p + q h(p)+ h(q) JS(p; q)= h . (13) 2 − 2   An important property of the Jensen-Shannon divergence compared to the Jeffreys divergence is that this Jensen-Shannon distance is always bounded:

0 JS(p : q) log 2. (14) ≤ ≤ This follows from the fact that p + q 2p 2p KL p : = p log dµ p log dµ = log 2. (15) 2 p + q ≤ p   Z Z Last but not least, the square root of the JSD (i.e., √JS) yields a metric distance satisfying the triangular inequality [64, 22]. The JSD has found applications in many fields like [62]

3Sir Harold Jeffreys (1891-1989) was a British . 4To match the notational convention of the if two joint random variables in information theory [16].

3 and social sciences [18], just to name a few. Recently, the JSD has gained attention in the deep learning community with the Generative Adversarial Networks (GANs) [23]. In information geometry [3, 38, 43], the KLD, JD and JSD are invariant divergences which satisfy the property of information monotonicity. The class of (separable) distances satisfying the information monotonicity are exhaustively characterized as Csisz´ar’s f-divergences [17]. A f- divergence is defined for a convex generator function f strictly convex at 1 (with f(1) = f ′(1) = 0) by: q I (p : q)= pf dµ. (16) f p Z   The Jeffreys divergence and the Jensen-Shannon divergence are f-divergence for the following f-generators:

fJ (u) := (u 1) log u, (17) − 1 1+ u f (u) := u log u (u + 1) log . (18) JS 2 − 2   1.2 Statistical distances and parameter divergences In information and , the term “divergence” informally means a statistical dis- tance [16]. However in information geometry [3], a divergence has a stricter meaning of being a smooth parametric distance (called a contrast function in [19]) from which a dual geometric structure can be derived [4, 43]. Consider a parametric family of distributions pθ : θ Θ (e.g., { ∈ } Gaussian family or Cauchy family), where Θ denotes the parameter space. Then a D between distributions pθ and pθ′ amount to an equivalent parameter distance:

P (θ : θ′):=D(pθ : pθ′ ). (19)

For example, the KLD between two distributions belonging to the same (e.g., Gaussian family) amount to a reverse for the cumulant generator F of the exponential family [8, 43]:

KL(pθ : pθ′ )= BF∗ (θ : θ′)= BF (θ′ : θ). (20)

A Bregman divergence BF is defined for a strictly convex and differentiable generator F as:

BF (θ : θ′):=F (θ) F (θ′) θ θ′, F (θ′) , (21) − − h − ∇ i where , is a inner product (usually the Euclidean dot product for vector parameters). h· ·i Similar to the interpretation of the Jensen-Shannon divergence (statistical divergence) as a Jensen divergence for the negentropy generator, the Jensen-Bregman divergence [47] JBF (paramet- ric divergence JBD) amounts to a Jensen divergence JF for a strictly convex generator F :Θ R: →

1 θ + θ′ θ + θ′ JB (θ : θ′) := B θ : + B θ′ : , (22) F 2 F 2 F 2      F (θ)+ F (θ′) θ + θ′ = F =: JF (θ : θ′), (23) 2 − 2  

4 Let us introduce the handy notation (θpθq)α:=(1 α)θp + αθq to denote the linear interpolation − (LERP) of the parameters (for α [0, 1]). Then we have more generally that the skew Jensen- α ∈ α Bregman divergence JBF (θ : θ′) amounts to a skew Jensen divergence JF (θ : θ′):

α JBF (θ : θ′) := (1 α)BF θ : (θθ′)α + αBF θ′ : (θθ′)α) , (24) − α = (F (θ)F (θ′))α F (θθ′)α =: J (θ : θ′), (25) −  F   1.3 Jeffreys’ J-symmetrization and Jensen-Shannon JS-symmetrization of dis- tances For any arbitrary distance D(p : q), we can define its skew J-symmetrization for α [0, 1] by: ∈ J α (p : q):=(1 α)D (p : q)+ αD (q : p) , (26) D − and its JS-symmetrization by:

JSα (p : q) := (1 α)D (p : (1 α)p + αq)+ αD (q : (1 α)p + αq) , (27) D − − − = (1 α)D (p : (pq)α)+ αD (q : (pq)α) . (28) − 1 1 2 Usually, α = 2 , and for notational brevity, we drop the superscript: JSD(p : q) := JSD(p : q). The Jeffreys divergence is twice the J-symmetrization of the KLD, and the Jensen-Shannon divergence is the JS-symmetrization of the KLD. The J-symmetrization of a f-divergence If is obtained by taking the generator

J f (u) = (1 α)f(u)+ αf ⋄(u), (29) α − 1 where f ⋄(u)= uf( u ) is the conjugate generator:

If ⋄ (p : q)= If∗(p : q)= If (q : p). (30)

The JS-symmetrization of a f-divergence

α I (p : q):=(1 α)If (p : (pq)α)+ αIf (q : (pq)α), (31) f − with (pq)α = (1 α)p + αq is obtained by taking the generator − 1 α f JS(u):=(1 α)f(αu + 1 α)+ αf α + − . (32) α − − u   We check that we have:

α 1 α I (p : q) = (1 α)If (p : (pq)α)+ αIf (q : (pq)α)= I − (q : p)= I JS (p : q). (33) f − f fα A family of symmetric distances unifying the Jeffreys divergence with the Jensen-Shannon divergence was proposed in [34]. Finally, let us mention that once we have symmetrized a distance, we may also metrize it by choosing (when it exists) the largest exponent δ > 0 such that Dδ becomes a metric distance [13, 14, 28, 60, 64].

5 1.4 Paper outline The paper is organized as follows:

• Section 2 reports the special case of mixture families in information geometry [3] for which the Jensen-Shannon divergence can be expressed as a Bregman divergence (Theorem 1), and highlight the lack of closed-form formula when considering generic exponential families. This observation precisely motivated this work.

• Section 3 introduces the generalized Jensen-Shannon divergences using generalized statistical mixtures derived from abstract weighted means (Definition 3 and Definition 7), presents the JS-symmetrization of statistical distances, and report a sufficient condition to get bounded JS-symmetrizations (Property 5).

• In §4.1, we consider the calculation of the geometric JSD between members of the same exponential family (Theorem 8) and instantiate the formula for the multivariate Gaussian distributions (Corollary 10). We show that the Bhattacharrya distance can be interpreted as the negative of the cumulant function of likelihood ratio exponential families in §4.2 and introduce the Chernoff information. We discuss about applications for k-means clustering in §4.1.2. In §4.3, we illustrate the method with another example that calculates in closed-form the harmonic JSD between scale Cauchy distributions (Theorem 15).

Finally, we wrap up and conclude this work in Section 5.

2 Jensen-Shannon divergence in mixture and exponential families

We are interested to calculate the JSD between densities belonging to parametric families of dis- tributions. A trivial example is when p = (p0,...,pD) and q = (q0,...,qD) are categorical dis- tributions (finite discrete distributions sometimes called multinoulli distributions): The average p+q distribution 2 is again a categorical distribution, and the JSD is expressed plainly as:

D 1 2p 2q JS(p,q)= p log i + q log i . (34) 2 i p + q i p + q i i i i i X=0   Another example is when p = mθ and q = mθ both belong to the same mixture family [3] : p q M D D := mθ(x)= 1 θipi(x) p (x)+ θipi(x) : θi > 0, θi < 1 , (35) M − 0 ( i ! i i ) X=1 X=1 X for linearly independent component distributions p0(x),p1(x),...,pD(x). We have [57]:

KL(mθp : mθq )= BF (θp : θq), (36) where BF is a Bregman divergence defined in Eq. 21 obtained for the convex negentropy genera- tor [57] F (θ)= h(mθ). (The proof that F (θ) is a strictly convex function is not trivial, see [49].) − The mixture families include the family of categorical distributions over a finite alphabet = X E ,...,ED (the D-dimensional probability simplex) since those categorical distributions form { 0 }

6 a mixture family with pi(x):= Pr(X = Ei) = δEi (x). Beware that mixture families impose to prescribe the component distributions. Therefore a density of a mixture family is a special case of statistical mixtures (e.g., Gaussian mixture models) with prescribed component distributions. The special mixtures with prescribed components are also called w-mixtures [57] because they are convex weighted combinations of components. The remarkable mathematical identity of Eq. 36 that does not yield a practical formula since F (θ) is usually not itself available in closed-form.5 Worse, the Bregman generator can be non- analytic [65]. Nevertheless, this identity is useful for computing the right-sided Bregman centroid (left KL centroid of mixtures) since this centroid is equivalent to the center of mass, and is always independent of the Bregman generator [57]. The mixture of mixtures is also a mixture:

mθp + mθq = m θp+θq . (37) 2 2 ∈ M Thus we get a closed-form expression for the JSD between mixtures belonging to . M

Theorem 1 (Jensen-Shannon divergence between w-mixtures). The Jensen-Shannon divergence between two distributions p = mθ and q = mθ belonging to the same mixture family is expressed p q M as a Jensen-Bregman divergence for the negentropy generator F :

1 θ + θ θ + θ JS(m ,m )= B θ : p q + B θ : p q . (38) θp θq 2 F p 2 F q 2      This amounts to calculate the Jensen divergence:

JS(mθp ,mθq )= JF (θ1; θ2) = (F (θ1)F (θ2)) 1 F ((θ1θ2) 1 ), (39) 2 − 2 where (v v )α:=(1 α)v + αv . 1 2 − 1 2

Now, consider distributions p = eθ and q = eθ belonging to the same exponential family [3] : p q E

:= eθ(x) = exp θ⊤x F (θ) : θ Θ , (40) E − ∈ where n   o D Θ:= θ R : exp(θ⊤x)dµ< , (41) ∈ ∞  Z  denotes the natural parameter space. We have [3, 43]:

KL(eθp : eθq )= BF (θq : θp), (42) where F denotes the log-normalizer or cumulant function of the exponential family [3] (also called log-partition or log-Laplace function):

F (θ) := log exp(θ⊤x)dµ . (43) Z  5Namely, it is available in closed-form when the fixed component distributions have pairwise disjoint supports with closed-form entropy formula.

7 e +e However, θp θq does not belong to in general, except for the case of the categori- 2 E cal/multinomial family which is both an exponential family and a mixture family [3]. For example, the mixture of two Gaussian distributions with distinct components is not a Gaussian distribution. Thus it is not obvious to get a closed-form expression for the JSD in that case. This limitation precisely motivated the introduction of generalized Jensen-Shannon divergences defined in the next section. Notice that in [53, 50], it is shown how to express or approximate the f-divergences using expansions of power χ pseudo-distances. These power χ pseudo-distances can all be expressed in closed-form when dealing with isotropic Gaussians. This results holds for the JSD since the JSD is a f-divergence [50].

3 Generalized Jensen-Shannon divergences

We first define abstract means M(x,y) of two reals x and y, and then define generic statistical M-mixtures from which generalized Jensen-Shannon divergences are built thereof.

3.1 Definitions Consider an abstract mean [33] M. That is, a continuous bivariate function M( , ) : I I I on · · × → an interval I R that satisfies the following in-betweenness property: ⊂ inf x,y M(x,y) sup x,y , x,y I. (44) { }≤ ≤ { } ∀ ∈ Using the unique dyadic expansion of real numbers, we can always build a corresponding weighted mean Mα(p,q) (with α [0, 1]) following the construction reported in [33] (page 3) such ∈ that M (p,q)= p and M (p,q)= q. In the remainder, we consider I = (0, ). 0 1 ∞ Examples of common weighted means are:

• the arithmetic mean Aα(x,y) = (1 α)x + αy, − 1 α α • the Gα(x,y)= x − y , and • the harmonic mean H (x,y)= 1 = xy . α (1 α) 1 +α 1 (1 α)y+αx − x y − These means can be unified using the concept of quasi-arithmetic means [33] (also called Kolmogorov-De Finetti-Nagumo means):

h 1 M (x,y):=h− ((1 α)h(x)+ αh(y)) , (45) α − where h is a strictly monotonous and continuous function. For example, the geometric mean h Gα(x,y) is obtained as Mα (x,y) for the generator h(u) = log(u). R´enyi used the concept of quasi- arithmetic means instead of the arithmetic mean to define axiomatically the R´enyi entropy [61] of order α in information theory [16]. For any abstract weighted mean Mα, we can build a statistical mixture called a M-mixture as follows:

8 M Definition 2 (Statistical M-mixture). The Mα-interpolation (pq) (with α [0, 1]) of densities α ∈ p and q with respect to a mean M is a α-weighted M-mixture defined by:

M Mα(p(x),q(x)) (pq)α (x):= M , (46) Zα (p : q) where M Zα (p : q)= Mα(p(t),q(t))dµ(t) =: Mα(p,q) . (47) t h i Z ∈X is the normalizer function (or scaling factor) ensuring that (pq)M . (The bracket notation f α ∈ P h i denotes the integral of f over .) X

A The A-mixture (pq)α (x) = (1 α)p(x)+αq(x) (‘A’ standing for the arithmetic mean) represents − A G p(x)1−αq(x)α the usual statistical mixture [31] (with Zα (p : q) = 1). The G-mixture (pq)α (x) = G of Zα (p:q) two distributions p(x) and q(x) (’G’ standing for the geometric mean G) is an exponential family of order 1, see [48]:

(pq)G(x) = exp (1 α)p(x)+ αq(x) log ZG(p : q) . (48) α − − α

The two-component M-mixture can be generalized to a k-component M-mixture with α ∆k 1, ∈ − the (k 1)-dimensional standard simplex: − α1 αk M p1(x) . . . pk(x) (p1 ...pk)α := × × , (49) Zα(p1,...,pk)

α1 α where Zα(p1,...,pk):= p1(x) . . . pk(x) k dµ(x). X × × For a given pair of distributions p and q, the set Mα(p(x),q(x)) : α [0, 1] describes a path R { ∈ } in the space of probability density functions called an Hellinger arc [25]. This density interpolation scheme was investigated for quasi-arithmetic weighted means in [39, 20, 21]. In [6], the authors study the Fisher information matrix for the α-mixture models (using α-power means). M We call (pq)α the α-weighted M-mixture, thus extending the notion of α-mixtures [2] obtained for power means Pα. Notice that abstract means have also been used to generalize Bregman divergences using the concept of (M,N)-convexity generalizing the Jensen midpoint inequality [56]. Let us state a first generalization of the Jensen-Shannon divergence:

Definition 3 (M-Jensen-Shannon divergence). For a mean M, the skew M-Jensen-Shannon di- vergence (for α [0, 1]) is defined by ∈ JSMα (p : q):=(1 α)KL p : (pq)M + α KL q : (pq)M (50) − α α  

When Mα = Aα, we recover the ordinary Jensen-Shannon divergence since Aα(p : q) = (pq)α A (and Zα (p : q) = 1). We can extend the definition to the JS-symmetrization of any distance:

9 Definition 4 (M-JS symmetrization). For a mean M and a distance D, the skew M-JS sym- metrization of D (for α [0, 1]) is defined by ∈ JSMα (p : q):=(1 α)D p : (pq)M + αD q : (pq)M (51) D − α α  

Mα Mα By notation, we have JS (p : q)=JSKL (p : q). That is, the arithmetic JS-symmetrization of the KLD is the JSD. Let us define the α-skew K-divergence [29, 30] Kα(p : q) as

Kα (p : q) :=KL(p : (1 α)p + αq) = KL(p : (pq)α), (52) − where (pq)α(x):=(1 α)p(x)+ αq(x). Then the Jensen-Shannon divergence and the Jeffreys diver- − gence can be rewritten [34] as 1 JS (p; q) = K 1 (p : q)+ K 1 (q : p) , (53) 2 2 2 J (p; q) = K1(p : q)+ K1(q : p),  (54) since KL(p : q)= K1(p : q). Then JSα(p : q) = (1 α)Kα(p : q)+ αK1 α(q : p). Similarly, we can − − define the generalized skew K-divergence:

Mα M KD (p : q):=D p : (pq)α . (55) The success of the JSD compared to the JD in applications  is partially due to the fact that the JSD is upper bounded by log 2 (and thus can handle distributions with non- supports). So one question to ask is whether those generalized JSDs are upper bounded or not? To report a sufficient condition, let us first introduce the dominance relationship between means: We say that a mean M dominates a mean N when M(x,y) N(x,y) for all x,y 0, see [33]. In ≥ ≥ that case we write concisely M N. For example, the Arithmetic-Geometric-Harmonic (AGH) ≥ inequality states that A G H. ≥ ≥ Consider the term p(x)ZM (p,q) KL(p : (pq)M ) = p(x) log α dµ(x), (56) α M (p(x),q(x)) Z α p(x) = log ZM (p,q)+ p(x) log dµ(x). (57) α M (p(x),q(x)) Z α When mean Mα dominates the arithmetic mean Aα, we have p(x) p(x) p(x) log dµ(x) p(x) log dµ(x), M (p(x),q(x)) ≤ A (p(x),q(x)) Z α Z α and p(x) p(x) p(x) log dµ(x) p(x) log dµ(x)= log(1 α). Aα(p(x),q(x)) ≤ (1 α)p(x) − − Z Z − A Notice that Zα (p : q) = 1 (when M = A is the arithmetic mean), and we recover the fact that the α-skew Jensen-Shannon divergence is upper bounded by log(1 α) (e.g., log 2 when α = 1 ). − − 2 We summarize the result in the following property:

10 M Zα (p,q) Property 5 (Upper bound on M-JSD). The M-JSD is upper bounded by log 1 α when M A. − ≥ Let us observe that dominance of means can be used to define distances: For example, the celebrated α-divergences

α 1 α Iα(p : q)= αp(x) + (1 α)q(x) p(x) q(x) − dµ(x), α 0, 1 (58) − − 6∈ { } Z  can be interpreted as a difference of two means, the arithmetic mean and the geometry mean:

Iα(p : q)= (Aα(q(x) : p(x)) Gα(q(x) : p(x))) dµ(x). (59) − Z See [44] for more details. We can also define the generalized Jeffreys divergence as follows:

Definition 6 (N-Jeffreys divergence). For a mean N, the skew N-Jeffreys divergence (for β ∈ [0, 1]) is defined by

Nβ J (p : q) := Nβ(KL (p : q) , KL (q : p)). (60)

This definition includes the (scaled) resistor average distance [27] R(p; q), obtained for the 1 harmonic mean N = H for the KLD with skew parameter β = 2 : 1 1 1 1 = + , (61) R(p; q) 2 KL(p : q) KL(q : p)   2KL(p : q)KL(q : p) R(p; q) = . (62) J(p; q)

1 In [27], the factor 2 is omitted to keep the spirit of the original Jeffreys divergence. Notice that G(KL(p : q), KL(q : p)) R(p; q)= 1 (63) A(KL(p : q), KL(q : p)) ≤ since A G (i.e., the arithmetic mean A dominates the geometric mean G). ≥ Thus for any arbitrary divergence D, we can define the following generalization of the Jensen- Shannon divergence:

Definition 7 (Skew (M,N)-JS D divergence). The skew (M,N)-divergence with respect to weighted means Mα and Nβ as follows:

Mα,Nβ M M JSD (p : q):=Nβ D p : (pq)α , D q : (pq)α . (64)  

We now show how to choose the abstract mean according to the parametric family of distribu- tions in order to obtain some closed-form formula for some statistical distances.

11 4 Some closed-form formula for the M-Jensen-Shannon diver- gences

Our motivation to introduce these novel families of M-Jensen-Shannon divergences is to obtain closed-form formula when probability densities belong to some given parametric families . We PΘ shall illustrate the principle of the method to choose the right abstract mean for the considered parametric family, and report corresponding formula for the following two case studies:

1. The geometric G-Jensen-Shannon divergence for the exponential families (§4.1), and

2. the harmonic H-Jensen-Shannon divergence for the family of Cauchy scale distributions (§4.3).

Recall that the arithmetic A-Jensen-Shannon divergence is well-suited for mixture families (The- orem 1).

4.1 The geometric Jensen-Shannon divergence: G-JSD

Consider an exponential family [48] F with log-normalizer F : E

F = pθ(x)dµ = exp(θ⊤x F (θ))dµ : θ Θ , (65) E − ∈ n o and natural parameter space

Θ= θ : exp(θ⊤x)dµ< . (66) ∞  ZX  The log-normalizer (a log-Laplace function also called log-partition or cumulant function) is a real analytic convex function. Choose for the abstract mean Mα(x,y) the weighted geometric mean Gα: Mα(x,y)= Gα(x,y)= 1 α α x − y , for x,y > 0. It is well-known that the normalized weighted product of distributions belonging to the same exponential family also belongs to this exponential family [42]:

1 α α p − (x)p (x) G Gα(pθ1 (x),pθ2 (x)) θ1 θ2 x , (pθ1 pθ2 )α (x) := = G , (67) ∀ ∈X Gα(pθ1 (t),pθ2 (t))dµ(t) Zα (p : q) = p (x), (68) R(θ1θ2)α where the normalization factor is

ZG(p : q) = exp( J α(θ : θ )), (69) α − F 1 2 α for the skew Jensen divergence JF defined by:

α J (θ : θ ):=(F (θ )F (θ ))α F ((θ θ )α). (70) F 1 2 1 2 − 1 2

Notice that since the natural parameter space Θ is convex, the distribution p F (since (θ1θ2)α ∈E (θ θ )α Θ). 1 2 ∈

12 Thus it follows that we have:

G KL pθ : (pθ1 pθ2 )α = KL pθ : p(θ1θ2)α , (71)  = BF (( θ1θ2)α : θ). (72) This allows us to conclude that the G-Jensen-Shannon divergence admits the following closed- form expression between densities belonging to the same exponential family:

G G G JS (pθ : pθ ) := (1 α)KL(pθ : (pθ pθ ) )+ α KL(pθ : (pθ pθ ) ), (73) α 1 2 − 1 1 2 α 2 1 2 α = (1 α) BF ((θ θ )α : θ )+ αBF ((θ θ )α : θ ). (74) − 1 2 1 1 2 2

Note that since (θ1θ2)α θ1 = α(θ2 θ1) and (θ1θ2)α θ2 = (1 α)(θ1 θ2), it follows that − − α − − − (1 α)BF (θ : (θ θ )α)+ αBF (θ : (θ θ )α)= J (θ : θ ). − 1 1 2 2 1 2 F 1 2 The dual divergence [67] D∗ (with respect to the reference argument) or reverse divergence of a divergence D is defined by swapping the calling arguments: D∗(θ : θ′):=D(θ′ : θ). Thus if we defined the Jensen-Shannon divergence for the dual KL divergence KL∗(p : q):=KL(q : p)

1 p + q p + q JS ∗ (p : q) := KL∗ p : + KL∗ q : , (75) KL 2 2 2      1 p + q p + q = KL : p + KL : q , (76) 2 2 2      then we obtain:

Gα G G JSKL∗ (pθ1 : pθ2 ) := (1 α)KL((pθ1 pθ2 )α : pθ1 )+ αKL((pθ1 pθ2 )α : pθ2 ), (77) − α = (1 α)BF (θ : (θ θ )α)+ αBF (θ : (θ θ )α)=JB (θ : θ ), (78) − 1 1 2 2 1 2 F 1 2 = (1 α)F (θ1)+ αF (θ2) F ((θ1θ2)α), (79) α− − = JF (θ1 : θ2). (80)

Note that JSD∗ = JSD∗. 6 In general, the JS-symmetrization for the reverse KL divergence is

1 p + q p + q JS ∗ (p; q) = KL : p + KL : q , (81) KL 2 2 2      m A(p,q) = m log dµ = A(p,q) log dµ, (82) √pq G(p,q) Z Z where m = p+q = A(p,q) and G(p,q) = √pq. Since A G (arithmetic-geometric inequality), it 2 ≥ follows that JS ∗ (p; q) 0. KL ≥

13 Theorem 8 (G-JSD and its dual JS-symmetrization in exponential families). The α-skew G- Gα Jensen-Shannon divergence JS between two distributions pθ1 and pθ2 of the same exponential family F is expressed in closed-form for α (0, 1) as: E ∈ Gα JS (pθ : pθ ) = (1 α)BF ((θ θ )α : θ )+ αBF ((θ θ )α : θ ) , (83) 1 2 − 1 2 1 1 2 2 Gα α α JSKL∗ (pθ1 : pθ2 ) = JBF (θ1 : θ2)= JF (θ1 : θ2). (84)

4.1.1 Case study: The multivariate Gaussian family Consider the exponential family [3, 48] of multivariate Gaussian distributions [66, 52, 40]

N(µ, Σ) : µ Rd, Σ 0 . (85) ∈ ≻ n o The multivariate Gaussian family is also called the MultiVariate Normal family in the literature, or MVN family for short. Let λ:=(λv, λM ) = (µ, Σ) denote the composite (vector, matrix) parameter of a MVN. The d-dimensional MVN density is given by

1 1 1 pλ(x; λ) := d exp (x λv)⊤λM− (x λv) , (86) (2π) 2 λM −2 − − | |   where denotes the matrix determinant.p The natural parameters θ are also expressed using both | · | a vector parameter θv and a matrix parameter θM in a compound object θ = (θv,θM ). By defining the following compound inner product on a composite (vector,matrix) object

θ,θ′ :=θ⊤θ′ + tr θ′ ⊤θM , (87) h i v v M   where tr( ) denotes the matrix trace, we rewrite the MVN density of Eq.86 in the canonical form · of an exponential family [48]:

pθ(x; θ) := exp ( t(x),θ Fθ(θ)) = pλ(x; λ(θ)), (88) h i− where 1 1 1 1 1 1 θ = (θ ,θ )= Σ− µ, Σ− = θ(λ)= λ− λ , λ− , (89) v M 2 M v 2 M     is the compound natural parameter and

t(x) = (x, xx⊤) (90) − is the compound sufficient . The function Fθ is the strictly convex and continuously differ- entiable log-normalizer defined by:

1 1 1 Fθ(θ)= d log π log θM + θ⊤θ− θv , (91) 2 − | | 2 v M  

14 The log-normalizer can be expressed using the ordinary parameters, λ = (µ, Σ), as:

1 1 Fλ(λ) = λ⊤λ− λv + log λM + d log 2π , (92) 2 v M | | 1  1  = µ⊤Σ− µ + log Σ + d log 2π . (93) 2 | |   The /expectation parameters [3, 40] are

η = (ηv, ηM )= E[t(x)] = F (θ). (94) ∇ We report the conversion formula between the three types of coordinate systems (namely, the ordinary parameter λ, the natural parameter θ and the moment parameter η) as follows:

1 1 1 1 θv(λ)= λM− λv =Σ− µ λv(θ)= 2 θM− θv = µ 1 1 1 1 1 1 (95) θ (λ)= λ− = Σ ⇔ λ (θ)= θ− =Σ  M 2 M 2 −  M 2 M 1 1 1 ηv(θ)= 2 θM− θv θv(η)= (ηM + ηvηv⊤)− ηv 1 1 1 1 1 − 1 1 (96) ηM (θ)= θ− (θ− θv)(θ− θv) ⇔ θM (η)= (ηM + ηvη⊤)−  − 2 M − 4 M M ⊤  − 2 v λ (η)= η = µ η (λ)= λ = µ v v v v (97) λM (η)= ηM ηvη =Σ ⇔ ηM (λ)= λM λvλ = Σ µµ  − − v⊤  − − v⊤ − − ⊤ The dual Legendre convex conjugate [3, 40] is

1 1 F ∗(η)= log(1 + η⊤η− ηv) + log ηM + d(1 + log 2π) , (98) η −2 v M |− |   and θ = ηF (η). We check the Fenchel-Young equality when η = F (θ) and θ = F (η): ∇ η∗ ∇ ∇ ∗ Fθ(θ)+ F ∗(η) θ, η = 0. (99) η − h i Notice that the log-normalizer can be expressed using the expectation parameters as well:

1 1 1 Fη(η)= η⊤(ηM ηvη⊤)− ηv + log 2π(ηM ηvη⊤) . (100) 2 v − v 2 | − v |

We have Fθ(θ)= Fλ(λ(θ)) = Fη(η(θ)).

The Kullback-Leibler divergence between two d-dimensional Gaussians distributions p(µ1,Σ1) and p (with ∆µ = µ µ ) is (µ2,Σ2) 2 − 1

1 1 1 Σ2 KL(p : p ) = tr(Σ− Σ )+∆⊤Σ− ∆µ + log | | d = KL(pλ : pλ ).(101) (µ1,Σ1) (µ2,Σ2) 2 2 1 µ 2 Σ − 1 2  | 1|  1 We check that KL(p(µ,Σ) : p(µ,Σ)) = 0 since ∆µ = 0 and tr(Σ− Σ) = tr(I) = d. Notice that when Σ1 =Σ2 = Σ, we have

1 1 1 2 KL(p : p )= ∆⊤Σ− ∆ = D −1 (µ ,µ ), (102) (µ1,Σ) (µ2,Σ) 2 µ µ 2 Σ 1 2 1 that is half the squared for the precision matrix Σ− (a positive-definite matrix: Σ 1 0), where the Mahalanobis distance is defined for any positive matrix Q 0 as − ≻ ≻ follows: DQ(p : p )= (p p ) Q(p p ). (103) 1 2 1 − 2 ⊤ 1 − 2 q 15 The Kullback-Leibler divergence between two probability densities of the same exponential families amount to a Bregman divergence [3]:

∗ KL(p(µ1,Σ1) : p(µ2,Σ2)) = KL(pλ1 : pλ2 )= BF (θ2 : θ1)= BF (η1 : η2), (104) where the Bregman divergence is defined by

BF (θ : θ′):=F (θ) F (θ′) θ θ′, F (θ′) , (105) − − h − ∇ i with η = F (θ ). Define the canonical divergence [3] ′ ∇ ′ AF (θ : η )= F (θ )+ F ∗(η ) θ , η = AF ∗ (η : θ ), (106) 1 2 1 2 − h 1 2i 2 1 since F ∗∗ = F . We have BF (θ1 : θ2)= AF (θ1 : η2). Now, observe that pθ(0,θ) = exp( F (θ)) when t(0),θ = 0. In particular, this holds for the − h i multivariate normal family. Thus we have the following proposition Proposition 9. For the MVN family, we have

1 α α pθ(x,θ1) − pθ(x,θ2) pθ(x; (θ1θ2)α)= G , (107) Zα (pθ1 : pθ2 ) with the scaling normalization factor:

1 α α G α pθ(0; θ1) − pθ(0; θ2) Zα (pθ1 : pθ2 ) = exp( JF (θ1 : θ2)) = . (108) − pθ(0; (θ1θ2)α) More generally, we have for a k-dimensional weight vector α belonging to the (k 1)-dimensional − standard simplex: k αi G i=1 pθ(0,θi) Z (pθ ,...pθ )= , (109) α 1 k p (0; θ¯) Q θ ¯ k where θ = i=1 αiθi. Finally, we state the formulas for the G-JS divergence between MVNs for the KL and reverse P KL, respectively:

G Definition 10 (G-JSD between Gaussians). The skew G-Jensen-Shannon divergence JSα and the G dual skew G-Jensen-Shannon divergence JS∗α between two multivariate Gaussians N(µ1, Σ1) and N(µ2, Σ2) is

JSGα (p : p ) = (1 α)KL(p : p )+ αKL(p : p ), (110) (µ1,Σ1) (µ2,Σ2) − (µ1,Σ1) (µα,Σα) (µ2,Σ2) (µα,Σα) = (1 α)BF ((θ θ )α : θ )+ αBF ((θ θ )α : θ ), (111) − 1 2 1 1 2 2 Gα JS (p(µ1,Σ1) : p(µ2,Σ2)) = (1 α)KL(p(µα,Σα) : p(µ1,Σ1))+ αKL(p(µα,Σα) : p(µ2,Σ2)), (112) ∗ − = (1 α)BF (θ : (θ θ )α)+ αBF (θ : (θ θ )α), (113) − 1 1 2 2 1 2 = JF (θ1 : θ2), (114) 1 α α 1 1 1 1 Σ1 − Σ2 = (1 α)µ1⊤Σ1− µ1 + αµ2⊤Σ2− µ2 µα⊤Σα− µα + log | | | | , 2 − − Σα  | | 

16 where

Σ 1 1 1 Σα = (Σ Σ ) = (1 α)Σ− + αΣ− − , (115) 1 2 α − 1 2 (matrix harmonic barycenter) and 

µ 1 1 µα = (µ µ ) =Σα (1 α)Σ− µ + αΣ− µ . (116) 1 2 α − 1 1 2 2 

Notice that the α-skew [47]:

1 α α Bα(p : q)= log p − q dµ (117) − ZX between two members of the same exponential family amounts to a α-skew Jensen divergence between the corresponding natural parameters:

α Bα(pθ1 : pθ2 )= JF (θ1 : θ2). (118)

A simple proof follows from the fact that

1 α α p − (x)p (x) p (x)dµ(x)=1= θ1 θ2 dµ(x). (119) (θ1θ2)α ZG(p : p ) Z Z α θ1 θ2 Therefore we have

1 α α G log 1 = 0 = log p − (x)p (x)dµ(x) log Z (pθ : pθ ), (120) θ1 θ2 − α 1 2 Z G with Z (pθ : pθ ) = exp( JF (pθ : pθ )). Thus it follows that α 1 2 − 1 2

1 α α Bα(pθ : pθ ) = log p − (x)p (x)dµ(x), (121) 1 2 − θ1 θ2 Z G = log Z (pθ : pθ ), (122) − α 1 2 = JF (pθ1 : pθ2 ). (123)

In the literature, the Bhattacharyya distance Bhatα(p : q) is often defined as our reverse Bhat- tacharyya distance: α 1 α Bhatα(p : q) := log p q − dµ = Bα(q : p). (124) − ZX Corollary 11. The JS-symmetrization of the reverse Kullback-Leibler divergence between densities of the same exponential family amount to calculate a Jensen/Burbea-Rao divergence between the corresponding natural parameters.

1 1 Let us report one numerical example: Consider α = 2 , and the source parameters λv = µ1 = 0 1 1 0 2 1 2 1 1 , λ = Σ1 = and λ = µ2 = , λ = Σ2 = − . The corresponding 0 M 0 1 v 2 M 1 2        − 

17 0 1 0 4 1 1 natural parameters are θ1 = ,θ1 = 2 and θ2 = ,θ2 = 2 , and the dual v 0 M 0 1 v 3 M 1 1    2     2 2  0 1 0 1 2 1 expectation parameters are η1 = , η1 = − and η2 = , η2 = − − . v 0 M 0 1 v 2 M 1 6    3− 1    −  α 2 α 4 4 α 1 α The interpolated parameter is θv = 3 ,θM = 1 1 , or equivalently ηv = , ηM = 2 4 2 1 9 3 4  2     5 5 α 1 α 5 5 3 11 or λv = , λM = 2 −6 . We find that geometric Jensen-Shannon KL diver- 5 5 1 5 5 gence is  1.26343 and the geometric − Jensen-Shannon reverse KL divergence is 0.86157. ≃ ≃ In §4.2, we further show that the Bhattacharyya distance Bα between two densities is the negative of the log-normalized of an exponential family induced by the geometric mixtures of the densities.

4.1.2 Applications to k-means clustering

Let P = p ,...,pn denote a point set, and C = c ,...,ck denote a set of k (cluster) centers. { 1 } { 1 } The generalized k-means objective [8] with respect to a distance D is defined by:

n 1 ED(P,C)= min D(pi : cj). (125) n j 1,...,k i X=1 ∈{ }

By defining the distance D(p,C) = minj 1,...,k D(p : cj) of a point to a set of points, we can ∈{ } 1 n rewrite compactly the objective function as ED(P,C)= n i=1 D(pi,C). Denote by ED∗ (P, k) the minimum objective loss for a set of k = C clusters: ED∗ (P, k) = min C =k ED(P,C). It is NP- | | P | | hard [54] to compute ED∗ (P, k) when k > 1 and the dimension d> 1. The most common heuristic is Lloyd’s batched k-means [8] that yields a local minimum. The performance of the probabilistic k-means++ initialization [5] has been extended to arbitrary distances in [58] as follows:

Theorem 12 (Generalized k-means++ performance, [55]). Let κ1 and κ2 be two constants such that κ1 defines the quasi-triangular inequality property:

D(x : z) κ (D(x : y)+ D(y : z)) , x,y,z ∆d, (126) ≤ 1 ∀ ∈ and κ2 handles the symmetry inequality:

D(x : y) κ D(y : x), x,y ∆d. (127) ≤ 2 ∀ ∈ Then the generalized k-means++ seeding guarantees with high probability a configuration C of cluster centers such that:

2 ED(P,C) 2κ (1 + κ )(2 + log k)E∗ (P, k). (128) ≤ 1 2 D

To bound the constants κ1 and κ2, we rewrite the generalized Jensen-Shannon divergences using quadratic form expressions: That is, using a squared Mahalanobis distance:

DQ(p : q)= (p q) Q(p q), (129) − ⊤ − q 18 for a positive-definite matrix Q 0. Since the Bregman divergence can be interpreted as the tail ≻ of a first-order Taylor expansion, we have:

1 2 BF (θ : θ )= (θ θ )⊤ F (ξ)(θ θ ), (130) 1 2 2 1 − 2 ∇ 1 − 2 for ξ Θ (open convex). Similarly, the Jensen divergence can be interpreted as a Jensen-Bregman ∈ divergence, and thus we have

1 2 JF (θ : θ ) (θ θ )⊤ F (ξ′)(θ θ ), (131) 1 2 2 1 − 2 ∇ 1 − 2 for ξ Θ. More precisely, for a prescribed point set θ ,...,θn , we have ξ,ξ CH( θ ,...,θn ), ′ ∈ { 1 } ′ ∈ { 1 } where CH denotes the closed convex hull. We can therefore upper bound κ1 and κ2 using the ratio max 2F (θ) θ∈CH({θ1,...,θn}) k∇ k max 2F (θ) . See [1] for further details. θ∈CH({θ1,...,θn}) k∇ k A centroid for a set of parameters θ1,...,θn is defined as the minimizer of the functional 1 E (θ)= D(θ : θ). (132) D n i i X In particular, the symmetrized Bregman centroids have been studied in [51] (for JSGα ), and the Jensen centroids (for JSGα ) have been investigated in [47] using the convex-concave iterative procedure. ∗

4.2 The skewed Bhattacharyya distance interpreted as a geometric Jensen- Shannon symmetrization (G-JS)

Let KL∗(p : q) := KL(q : p) denote the reverse Kullback-Leibler divergence (rKL divergence for short). The -reference duality is an involution: (D ) (p : q)= D(p : q). ∗ ∗ ∗ The geometric Jensen-Shannon symmetrization [45] of the reverse Kullback-Leibler divergence is defined by Gα G G JS ∗ (p : q) := (1 α)KL∗(p : (pq) )+ αKL∗(q : (pq) ), (133) KL − α α 1 α α where Gα(x,y)= x − y denotes the geometric weighted mean for x> 0 and y > 0, and

G Gα(p(x),q(x)) (pq)α (x) := G , (134) Zα (p : q) is the geometric mixture with normalizing coefficient:

G Zα (p : q)= Gα(p(x),q(x))dµ(x). (135) Z Let us observe that following identity:

Definition 13. The skewed Bhattacharyya distance is a geometric Jensen-Shannon divergence with respect to the reverse Kullback-Leibler divergence:

Gα JSKL∗ (p : q)= Bα(p : q).

19 Proof:

G G Gα G (pq)α G (pq)α JS ∗ (p : q) = (1 α)(pq) log + α(pq) log dµ, (136) KL − α p α q Z   (pq)G = (pq)G log α dµ, (137) α p1 αqα Z  −  1 = (pq)G log dµ, (138) α ZG(p : q) Z α = log ZG(p : q) (pq)Gdµ, (139) − α α Z = Bα(p : q), (140)

G 1 α α G 1 α α G since Zα (p : q)= p − q dµ and (pq)α = p − q /Zα (p : q). The Bhattacharyya distance can also be interpreted as the negative the log-normalizer of the 1D R exponential family induced by the two distinct distributions p and q. Indeed, consider the family of geometric mixtures

1 λ λ p − (x)q (x) (p,q) := pλ(x) := , λ (0, 1) . E ZG(p,q) ∈  λ 

This family describes an Hellinger arc [25, 59] (also called a Bhattacharyya arc). Let p0(x) := p(x) and p1(x) := q(x). The Bhattacharyya arc is a 1D natural exponential family (i.e., an exponential family of order 1):

1 λ λ p0− (x)p1 (x) pλ(x) = G , (141) Zλ (p,q) p (x) = p (x)exp λ log 1 log ZG(p,q) , (142) 0 p (x) − λ   0   = exp (λt(x) Fpq(λ)+ k(x)) , (143) − with sufficient statistics t(x) the logarithm of likelihood ratio6 (i.e., the ratio of the densities p1(x) ): p0(x)

p (x) t(x) := log 1 , (144) p (x)  0  auxiliary carrier term k(x) = log p(x) with respect to µ (e.g., counting or Lebesgue positive mea- sure), natural parameter θ = λ, and log-normalizer:

G Fpq(λ)= Fpq(θ) := log Zλ (p,q) , (145)

1 λ λ = log p − (x)q (x)dµ(x) , (146) ZX  =: Bλ[p : q], (147) − where Bα is the skewed α-Bhattacharyya distance also called α-Chernoff distance.

6Hence, Gr¨unwald [24] called this Hellinger arc a likelihood ratio exponential family.

20 ω Since the log-normalizer Fpq of an exponential family is strictly convex and real analytic C , we deduce that Bα[p : q] is strictly concave and real analytic for α (0, 1). Moreover, since ∈ B0[p : q]= B1[p : q] = 0, there exists are unique value α∗ maximizing Bα[p : q]:

α∗ = arg max Bα[p : q]. (148) α [0,1] ∈ The maximal skewed Bhattacharyya distance is called the Chernoff information [35, 36]: Chernoff Dα [p : q] := Bα∗ [p : q]. (149)

The value α∗ corresponds to the error exponent in Bayesian asymptotic hypothesis testing [36, 39]. Note that since Fpq(λ) < for λ (0, 1), we have Bα[p : q] < for α (0, 1). Moreover, the ∞ ∈ ∞ ∈ moment parameterization of a 1D exponential family [48] is η(θ) = Epθ [t(x)] = F ′(θ). Thus it follows that p1(x) F ′ (λ)= E [t(x)] = E log . (150) pq pλ pλ p (x)   0  The optimal exponent α for the Chernoff information is therefore found for B (p : q)= F (α)= ∗ α′ − pq′ 0: p (x) E log 0 = 0, (151) pλ p (x)   1  p (x) p (x) log 0 dµ(x) = 0, (152) λ p (x) Z  1  p (x)p (x) p (x) log 0 λ dµ(x) = 0, (153) λ p (x)p (x) Z  λ 1  p (x) p (x) p (x) log 0 dµ(x)+ p (x) log λ dµ(x) = 0, (154) λ p (x) λ p (x) Z  λ  Z  1  KL(pλ : p ) KL(pλ : p ) = 0. (155) 1 − 0 Thus at the optimal exponent α∗, we have KL(pα∗ : p1) = KL(pα∗ : p0). We summarize the result:

Definition 14. The skewed Bhattacharyya distance Bα is strictly concave on (0, 1) and real analytic: Bα admits a unique maximum α (0, 1) called the Chernoff information so that ∗ ∈ KL(pα∗ : p1) = KL(pα∗ : p0).

Since the reverse KL divergence between two densities amount to a Bregman divergence between the corresponding natural parameters for the Bregman generator set to the cumulant function [40, 43], we have:

G G G G KL∗((pq)λ1 : (pq)λ2 )= BFpq (λ1 : λ2) = KL((pq)λ2 : (pq)λ1 ), (156) where BF is the Bregman divergence corresponding to the univariate generator Fpq(θ)= Bθ(p : pq − q). Therefore we check the following identity: 1 β β G G p − q dµ KL∗((pq) : (pq) ) = log (β α) (KL(pα : p ) KL(pα : p )) . (157) λ1 λ2 p1 βqβdµ − − 1 − 0 R −  R 21 4.3 The harmonic Jensen-Shannon divergence (H-JS) The principle to get closed-form formula for generalized Jensen-Shannon divergences between dis- tributions belonging to a parametric family Θ = pθ : θ Θ consists in finding an abstract mean M P { ∈ } M such that the M-mixture (pθ1 pθ2 )α belongs to the family Θ. In particular, when Θ is a convex M P domain, we seek a mean M such that (pθ pθ ) = p with (θ θ )α Θ. 1 2 α (θ1θ2)α 1 2 ∈ Let us consider the weighted harmonic mean [33] (induced by the harmonic mean) H: 1 xy xy Hα(x,y):= 1 1 = = , α [0, 1]. (158) (1 α) + α (1 α)y + αx (xy)1 α ∈ − x y − − h The harmonic mean is a quasi-arithmetic mean Hα(x,y)= Mα (x,y) obtained for the monotone (decreasing) function h(u)= 1 (or equivalently for the increasing monotone function h(u)= 1 ). u − u This harmonic mean is well-suited for the scale family of Cauchy probability distributions [46] C (also called Lorentzian distributions):

1 x γ := pγ(x)= p = : γ Γ = (0, ) , (159) CΓ γ std γ π(γ2 + x2) ∈ ∞     1 where γ denotes the scale and pstd(x)= π(1+x2) the standard Cauchy distribution. Using a computer algebra system7 (see Appendix B), we find that

H Hα(pγ1 (x) : pγ2 (x)) (pγ1 pγ2 ) 1 (x)= H = p(γ1γ2)α (160) 2 Zα (γ1, γ2) where the normalizing coefficient is

H γ1γ2 γ1γ2 Zα (γ1, γ2):= = , (161) (γ1γ2)α(γ1γ2)1 α (γ1γ2)α(γ2γ1)α r − r since we have (γ1γ2)1 α = (γ2γ1)α. − The H-Jensen-Shannon symmetrization of a distance D between distributions writes as:

JSHα (p : q) = (1 α)D(p : (pq)H )+ αD(q : (pq)H ), (162) D − α α where Hα denote the weighted harmonic mean. When D is available in closed form for distributions Hα belonging to the scale Cauchy distributions, so is JSD (p : q). For example, consider the KL divergence formula between two scale Cauchy distributions:8

A(γ1, γ2) γ1 + γ2 KL(pγ1 : pγ2 ) = 2 log = 2 log , (163) G(γ1, γ2) 2√γ1γ2

A where A and G denote the arithmetic and geometric means, respectively. Since A G (and G 1), 9≥ ≥ it follows that KL(pγ : pγ ) 0. Notice that the KL divergence is symmetric for Cauchy scale 1 2 ≥ 7We use Maxima: http://maxima.sourceforge.net/ 8The formula initially reported in [63] has been corrected by the authors. 9For exponential families, the KL divergence is symmetric only for the location Gaussian family (since the only symmetric Bregman divergences are the squared Mahalanobis distances [11]).

22 2 (γ1+γ2) distributions. The cross-entropy between scale Cauchy distributions is h (pγ : pγ ) = log π , × 1 2 γ2 and the differential entropy is h(pγ)= h×(pγ : pγ) = log 4πγ.

Then the H-JS divergence between p = pγ1 and q = pγ2 is:

H 1 H H JS (p : q) = KL p : (pq) 1 + KL q : (pq) 1 , (164) 2 2 2 H 1      JS (pγ1 : pγ2 ) = KL pγ1 : p γ1+γ2 + KL pγ2 : p γ1+γ2 , (165) 2 2 2  (3γ + γ )(3γ + γ )   = log 1 2 2 1 . (166) 8√γ1γ2(γ1 + γ2)  

Hα We check that when γ1 = γ2 = γ, we have JS (pγ : pγ) = 0. Notice that we could use the more generic KL divergence formula between two location-scale Cauchy distributions pl1,γ1 and pl2,γ2 (with respective location l1 and l2):

2 2 (γ1 + γ2) + (l1 l2) KL(pl1,γ1 : pl2,γ2 ) = log − . (167) 4γ1γ2 Theorem 15 (Harmonic JSD between scale Cauchy distributions.). The harmonic Jensen-Shannon divergence between two scale Cauchy distributions pγ1 and pγ2 is H (3γ1+γ2)(3γ2+γ1) JS (pγ : pγ ) = log . 1 2 8√γ1γ2(γ1+γ2)

Let us report some numerical examples: Consider pγ1 = 0.1 and pγ1 = 0.5, we find that H H JS (pγ : pγ ) 0.176. When pγ = 0.2 and pγ = 0.8, we find that JS (pγ : pγ ) 0.129. 1 2 ≃ 1 1 1 2 ≃ Notice that KL formula is scale-invariant and this property holds for any scale family:

Lemma 16. The Kullback-Leibler divergence between two distributions ps1 and ps2 belonging to 1 x the same scale family ps(x)= s p( s ) s (0, ) with standard density p is scale-invariant: KL(pλs1 : { } ∈ ∞ s s pλs2 ) = KL(ps1 : ps2 ) = KL(p : p 2 ) = KL(p 1 : p) for any λ> 0. s1 s2 x A direct proof follows from a change of variable in the KL integral with y = λ and dx = λdy. Note that although the KLD between scale Cauchy distributions is symmetric, it is not the case for all scale families: For example, the Rayleigh distributions form a scale family with the KLD amounting to compute a Bregman asymmetric Itakura-Saito divergence between parameters [48]. Instead of the KLD, we can choose the total variation distance for which a formula has been reported in [39] between two Cauchy distributions. Notice that the Cauchy distributions are alpha- stable distributions for α = 1 and q gaussian distributions for q = 2 ([32], p. 104). A closed-form formula for the divergence between two q-Gaussians is given in [32] when q < 2. The definite + q integral hq(p)= ∞ p(x) dµ is available in closed-form for Cauchy distributions. When q = 2, we have h (p )= 1 −∞. 2 γ 2πγR This example further extends to power means and Student t-distributions: We refer to [39] for yet other illustrative examples considering the family of Pearson type VII distributions and central multivariate t-distributions which use the power means (quasi-arithmetic means M h induced by h(u)= uα for α> 0) for defining mixtures. Table 1 summarizes the various examples introduced in the paper.

23 Mα M JS mean M parametric family Zα (p : q) Aα M JS arithmetic A mixture family Zα (θ1 : θ2) = 1 JSGα geometric G exponential family ZG(θ : θ ) = exp( J α(θ : θ )) α 1 2 − F 1 2 JSHα harmonic H Cauchy scale family ZH(θ : θ )= θ1θ2 α 1 2 (θ1θ2)α(θ1θ2)1−α q Table 1: Summary of the weighted means M chosen according to the parametric family in order M to ensure that the family is closed under M-mixturing: (pθ1 pθ2 )α = p(θ1θ2)α .

4.4 The M-Jensen-Shannon matrix distances In this section, we consider distances between matrices which play an important role in quantum computing [12, 7]. We refer to [15] for the matrix Jensen-Bregman logdet divergence. The can be interpreted as the difference of an arithmetic mean A and a geometric mean G:

DH (p,q)= 1 p(x) q(x)dµ(x)= (A(p(x),q(x)) G(p(x),q(x)))dµ(x). (168) s − s − ZX p p ZX Notice that since A G, we have DH (p,q) 0. The scaled and squared Hellinger distance is ≥ ≥ an α-divergence Iα for α = 0. Recall that the α-divergence can be interpreted as the difference of a weighted arithmetic minus a weighted geometry mean. In general, if a mean M1 dominates a mean M2, we may define the distance as

DM ,M (p,q)= (M (p,q) M (p,q)) dµ(x). (169) 1 2 1 − 2 ZX However, when considering matrices [9], there is not a unique definition of a geometric matrix mean, and thus we have different notions of matrix Hellinger distances [9], some of them are divergences (i.e., smooth distances defining a dualistic structure in information geometry). We define the matrix M-Jensen-Shannon divergence for a matrix divergence D as follows: 1 JSM (X , X ) := (D(X , M(X , X )) + D(X , M(X , X ))) . (170) D 1 2 2 1 1 2 2 1 2 5 Conclusion and perspectives

In this work, we introduced a generalization of the celebrated Jensen-Shannon divergence [30], termed the (M,N)-Jensen-Shannon divergences, based on statistical M-mixtures derived from generic abstract means M and N, where the N mean is used to symmetrize the asymmetric Kullback-Leibler divergence. This new family of divergences includes the ordinary Jensen-Shannon divergence when both M and N are set to the arithmetic mean. This process can be extended to any base divergence D to obtain its JS-symmetrization. We reported closed-form expressions of the M Jensen-Shannon divergences for mixture families and exponential families in information geometry by choosing the arithmetic and geometric weighted mean, respectively. The α-skewed geometric Jensen-Shannon divergence (G-JSD for short) between densities pθ1 and pθ2 of the same exponential family with cumulant function F is

Gα Aα JS [p : p ]=JS ∗ (θ : θ ). KL θ1 θ2 BF 1 2

24 Gα Here, we used the bracket notation to emphasize the fact that the statistical distance JSKL is Aα between densities, and the parenthesis notation to emphasize that the distance JS ∗ is between BF Gα α (vector) parameters. We also have JSKL∗ [pθ1 : pθ2 ] = JF (θ1 : θ2). We reported how to get a closed-form formula for the harmonic Jensen-Shannon divergence of Cauchy scale distributions by taking harmonic mixtures. We defined the skew N-Jeffreys symmetrization for an arbitrary distance D and scalar β [0, 1]: ∈ Nβ JD (p1 : p2)= Nβ(D(p1 : p2), D(p2 : p1)), (171) and the skew (M,N)-JS symmetrization of an arbitrary distance D:

Mα,Nβ M M JSD (p1 : p2)= Nβ(D(p1, (p1p2)α ), D(p2, (p1p2)α )). (172) Finally, let us mention that the Jensen-Shannon divergence has further been extended recently to a skew vector parameter in [45] instead of an ordinary scalar parameter.

References

[1] Marcel R Ackermann and Johannes Bl¨omer. Bregman clustering for separable instances. In Scandinavian Workshop on Algorithm Theory, pages 212–223. Springer, 2010.

[2] Shun-ichi Amari. Integration of stochastic models by minimizing α-divergence. Neural com- putation, 19(10):2780–2796, 2007.

[3] Shun-ichi Amari. Information geometry and its applications. Springer, 2016.

[4] Shun-ichi Amari and Andrzej Cichocki. Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences: Technical Sciences, 58(1):183–195, 2010.

[5] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

[6] M. Asadi, N. Ebrahimi, O. Kharazmi, and E. S. Soofi. Mixture models, bayes fisher informa- tion, and divergence measures. IEEE Transactions on Information Theory, 65(4):2316–2321, April 2019.

[7] Koenraad MR Audenaert. Quantum skew divergence. Journal of Mathematical Physics, 55(11):112202, 2014.

[8] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.

[9] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. Strong convexity of sandwiched entropies and related optimization problems. Reviews in Mathematical Physics, 30(09):1850014, 2018.

[10] Patrick Billingsley. Probability and measure. John Wiley & Sons, 2008.

[11] Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Dis- crete & Computational Geometry, 44(2):281–307, 2010.

25 [12] Jop Bri¨et and Peter Harremo¨es. Properties of classical and quantum Jensen-Shannon diver- gence. Physical review A, 79(5):052311, 2009.

[13] Pengwen Chen, Yunmei Chen, and Murali Rao. Metrics defined by Bregman divergences. Communications in Mathematical Sciences, 6(4):915–926, 2008.

[14] Pengwen Chen, Yunmei Chen, and Murali Rao. Metrics defined by Bregman divergences: Part 2. Communications in Mathematical Sciences, 6(4):927–948, 2008.

[15] Anoop Cherian, Suvrit Sra, Arindam Banerjee, and Nikolaos Papanikolopoulos. Jensen- Bregman logdet divergence with application to efficient similarity search for matri- ces. IEEE transactions on pattern analysis and machine intelligence, 35(9):2161–2174, 2013.

[16] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.

[17] Imre Csisz´ar. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.

[18] Simon DeDeo, Robert XD Hawkins, Sara Klingenstein, and Tim Hitchcock. Bootstrap methods for the empirical study of decision-making and information flows in social systems. Entropy, 15(6):2246–2276, 2013.

[19] Shinto Eguchi. Geometry of minimum contrast. Hiroshima Mathematical Journal, 22(3):631– 647, 1992.

[20] Shinto Eguchi and Osamu Komori. Path connectedness on a space of probability density functions. In Geometric Science of Information (GSI), pages 615–624, 2015.

[21] Shinto Eguchi, Osamu Komori, and Atsumi Ohara. Information geometry associated with generalized means. In Information Geometry and its Applications IV, pages 279–295. Springer, 2016.

[22] Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium on Information Theory (ISIT), page 31. IEEE, 2014.

[23] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[24] Peter D Gr¨unwald. The minimum description length principle. MIT press, 2007.

[25] Henryk Gzyl. The method of maximum entropy, volume 29. World scientific, 1995.

[26] Siu-Wai Ho and Raymond W Yeung. On the discontinuity of the Shannon information mea- sures. In International Symposium on Information Theory (ISIT), pages 159–163. IEEE, 2005.

[27] Don Johnson and Sinan Sinanovic. Symmetrizing the Kullback-Leibler distance. IEEE Trans- actions on Information Theory, 2001.

[28] P Kafka, F Osterreicher,¨ and I Vincze. On powers of f-divergences defining a distance. Studia Sci. Math. Hungar, 26(4):415–422, 1991.

26 [29] Lillian Lee. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL ’99, pages 25–32, Stroudsburg, PA, USA, 1999. Association for Computational Linguistics.

[30] Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.

[31] Geoffrey J McLachlan, Sharon X Lee, and Suren I Rathnayake. Finite mixture models. Annual review of statistics and its application, 6:355–378, 2019.

[32] Jan Naudts. Generalised thermostatistics. Springer Science & Business Media, 2011.

[33] Constantin Niculescu and Lars-Erik Persson. Convex functions and their applications. Springer, 2018. Second Edition.

[34] Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010.

[35] Frank Nielsen. Chernoff information of exponential families. arXiv preprint arXiv:1102.2684, 2011.

[36] Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters, 20(3):269–272, 2013.

[37] Frank Nielsen. Jeffreys centroids: A closed-form expression for positive and a guaranteed tight approximation for histograms. IEEE Signal Processing Letters, 20(7):657–660, 2013.

[38] Frank Nielsen. Pattern learning and recognition on statistical manifolds: an information- geometric review. In International Workshop on Similarity-Based Pattern Recognition, pages 1–25. Springer, 2013.

[39] Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognition Letters, 42:25–34, 2014.

[40] Frank Nielsen. An elementary introduction to information geometry. CoRR, abs/1808.08271, 2018.

[41] Frank Nielsen. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy, 21(5):485, 2019.

[42] Frank Nielsen. The statistical Minkowski distances: Closed-form formula for Gaussian mixture models. arXiv preprint arXiv:1901.03732, 2019.

[43] Frank Nielsen. An elementary introduction to information geometry. Entropy, 22(10):1100, 2020.

[44] Frank Nielsen. A generalization of the α-divergences based on comparable and distinct weighted means. CoRR, abs/2001.09660, 2020.

[45] Frank Nielsen. On a generalization of the Jensen-Shannon divergence and the Jensen-Shannon centroid. Entropy, 22(2):221, 2020.

27 [46] Frank Nielsen. On Voronoi diagrams on the information-geometric Cauchy manifolds. Entropy, 22(7):713, 2020. [47] Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans- actions on Information Theory, 57(8):5455–5466, 2011. [48] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009. [49] Frank Nielsen and Ga¨etan Hadjeres. Monte Carlo information geometry: The dually flat case. arXiv preprint arXiv:1803.07225, 2018. [50] Frank Nielsen and Ga¨etan Hadjeres. On power chi expansions of f-divergences. arXiv preprint arXiv:1903.05818, 2019. [51] Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transac- tions on Information Theory, 55(6):2882–2904, 2009. [52] Frank Nielsen and Richard Nock. A closed-form expression for the Sharma–Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45(3):032003, 2011. [53] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approx- imating f-divergences. IEEE Signal Processing Letters, 21(1):10–13, 2014. [54] Frank Nielsen and Richard Nock. Optimal interval clustering: Application to Bregman clus- tering and statistical mixture learning. IEEE Signal Processing Letters, 21(10):1289–1292, 2014. [55] Frank Nielsen and Richard Nock. Total Jensen divergences: definition, properties and clus- tering. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2016–2020. IEEE, 2015. [56] Frank Nielsen and Richard Nock. Generalizing skew Jensen divergences and Bregman diver- gences with comparative convexity. IEEE Signal Processing Letters, 24(8):1123–1127, 2017. [57] Frank Nielsen and Richard Nock. On the geometry of mixtures of prescribed distributions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2861–2865. IEEE, 2018. [58] Frank Nielsen, Richard Nock, and Shun-ichi Amari. On clustering histograms with k-means by using mixed α-divergences. Entropy, 16(6):3273–3301, 2014. [59] Michael Nussbaum and Arleta Szko la. The Chernoff lower bound for symmetric quantum hypothesis testing. The Annals of Statistics, 37(2):1040–1057, 2009. [60] Ferdinand Osterreicher¨ and Igor Vajda. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639–653, 2003. [61] Alfr´ed R´enyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.

28 [62] Gregory E Sims, Se-Ran Jun, Guohong A Wu, and Sung-Hou Kim. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences, pages pnas–0813249106, 2009. [63] George Tzagkarakis and Panagiotis Tsakalides. A statistical approach to texture image re- trieval via alpha-stable modeling of decompositions. In 5th International Workshop on Image Analysis for Multimedia Interactive Services, pages 21–23, 1997. [64] Igor Vajda. On metric divergences of probability measures. Kybernetika, 45(6):885–900, 2009. [65] Sumio Watanabe, Keisuke Yamazaki, and Miki Aoyagi. Kullback information of normal mix- ture is not an analytic function. IEICE technical report. Neurocomputing, 104(225):41–46, 2004. [66] Shintaro Yoshizawa and Kunio Tanabe. Dual differential geometry associated with the Kullback-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT Journal of Mathematics, 35(1):113–137, 1999. [67] Jun Zhang. Reference duality and representation duality in information geometry. AIP Con- ference Proceedings, 1641(1):130–146, 2015.

A Summary of distances and their notations

Weighted mean Mα, α (0, 1) ∈ Arithmetic mean Aα(x,y) = (1 α)x + αy 1 −α α Geometric mean Gα(x,y)= x − y xy Harmonic mean Hα(x,y)= (1 α)y+αx − 1 p p p p Power mean Pα(x,y) = ((1 α)x + αy ) p , p R 0 , limp 0 Pα = G f −1 ∈ \{ } → Quasi-arithmetic mean Mα (x,y)= f − ((1 α)f(x)+ αf(y)), f strictly monotonous M − M-mixture Zα (p,q)= t Mα(p(t),q(t))dµ(t) M ∈X with Zα (p,q)= t Mα(p(t),q(t))dµ(t) R ∈X R Statistical distance D(p : q)

Dual/reverse distance D∗ D∗(p : q):=D(q : p) p(x) Kullback-Leibler divergence KL(p : q)= p(x) log q(x) dµ(x) q(x) reverse Kullback-Leibler divergence KL∗(p : q) =R KL(q : p)= q(x) log p(x) dµ(x) p(x) Jeffreys divergence J(p; q) = KL(p : q)+KL(Rq : p)= (p(x) q(x)) log q(x) dµ(x) 2KL(p:q)KL(q:p) 2J(−p;q) Resistor divergence R(p; q)= J(p;q) . R(p; q)=R KL(p:q)KL(q:p) p(x) skew K-divergence Kα(p : q)= p(x) log (1 α)p(x)+αq(x) dµ(x) 1 p+q− p+q Jensen-Shannon divergence JS(p,q)= 2 KL p : 2 + KL q : 2 R 1 α α skew Bhattacharrya divergence Bα(p : q)= log p(x) − q(x) dµ(x) − X  α 1 α  skew Bhattacharrya divergence Bhatα(p : q)= log p(x) q(x) − dµ(x) − R X Hellinger distance DH (p,q)= 1 R p(x) q(x)dµ(x) − X α 1 α α-divergences Iα(p : q)= qαp(x) + (1 α)q(x) p(x) q(x) dµ(x), α 0, 1 R p − p − − 6∈ { } R  29 Iα(p : q)= Aα(q : p) Gα(q : p) − Mahalanobis distance DQ(p : q)= (p q)⊤Q(p q) for a positive-definite matrix Q 0 − q(x) − ≻ f-divergence If (p : q)= pp(x)f p(x) dµ(x), with f(1) = f ′(1) = 0 f strictly convexR at 1  p(x) reverse f-divergence If∗(p : q)= q(x)f q(x) dµ(x)= If ⋄ (p : q) 1 for f ⋄(u)=Ruf( u )   1 J-symmetrized f-divergence Jf (p; q)= 2 (If (p : q)+ If (q : p)) α JS-symmetrized f-divergence I (p; q):=(1 α)I (p : (pq) )+ αI (q : (pq) )= I JS (p : q) f f α f α fα JS − 1 α for f (u):=(1 α)f(αu + 1 α)+ αf α + − α − − u  Parameter distance

Bregman divergence BF (θ : θ′):=F (θ) F (θ′) θ θ′, F (θ′) α − − h − ∇ i skew Jeffreys-Bregman divergence SF = (1 α)BF (θ : θ′)+ αBF (θ′ : θ) α − skew Jensen divergence JF (θ : θ′):=(F (θ)F (θ′))α F ((θθ′)α) 1 θ+−θ′ θ+θ′ Jensen-Bregman divergence JBF (θ; θ′)= 2 BF θ : 2 + BF θ′ : 2 = JF (θ; θ′).      Generalized Jensen-Shannon divergences α skew J-symmetrization JD(p : q):=(1 α)D (p : q)+ αD (q : p) α − skew JS-symmetrization JSD(p : q):=(1 α)D (p : (1 α)p + αq)+ αD (q : (1 α)p + αq) Mα − − M M − skew M-Jensen-Shannon divergence JS (p : q):=(1 α)KL p : (pq)α + αKL q : (pq)α Mα − M M skew M-JS-symmetrization JSD (p : q):=(1 α)D p : (pq)α + αD q : (pq)α N −   N-Jeffreys divergence J β (p : q):=Nβ(KL (p : q) , KL (q : p)) Nβ   N-J D divergence JD (p : q)= Nβ(D(p : q), D(q : p)) Mα,Nβ M M skew (M,N)-D JS divergence JSD (p : q):=Nβ D p : (pq)α , D q : (pq)α   B Symbolic calculations in Maxima

The program below (written in Maxima, software available at http://maxima.sourceforge.net/) calculates the normalizer Z for the harmonic H-mixtures of Cauchy distributions (Eq. 161). assume(gamma>0); Cauchy(x,gamma) := gamma/(%pi*(x**2+gamma**2)); assume(alpha>0); h(x,y,alpha) := (x*y)/((1-alpha)*y+alpha*x); assume(gamma1>0); assume(gamma2>0); m(x,alpha) := ratsimp(h(Cauchy(x,gamma1),Cauchy(x,gamma2),alpha));

/* calculate Z */ integrate(m(x,alpha),x,-inf,inf);

30