Generalized Bregman and Jensen Divergences Which Include Some F-Divergences

GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES

TOMOHIRO NISHIYAMA

Abstract. In this paper, we introduce new classes of divergences by extending the deﬁnitions of the Bregman divergence and the skew Jensen divergence. These new divergence classes (g-Bregman divergence and skew g-Jensen divergence) satisfy some properties similar to the Bregman or skew Jensen divergence. We show these g-divergences include divergences which belong to a class of f-divergence (the Hellinger distance, the chi-square divergence and the alpha-divergence in addition to the Kullback-Leibler divergence). More- over, we derive an inequality between the skew g-Jensen divergence and the g-Bregman divergence and show this inequality is a generalization of Lin’s inequality.

Keywords: Bregman divergence, f-divergence, Jensen divergence, Kullback- Leibler divergence, Jeﬀreys divergence, Jensen-Shannon divergence, Hellinger distance, chi-square divergence, alpha divergence.

1. Introduction Divergences are functions measure the discrepancy between two points and play a key role in the field of machine learning, signal processing and so on. Given a set S and p, q ∈ S, a divergence is defined as a function D : S × S → R which satisfies the following properties. 1. D(p, q) ≥ 0 for all p, q ∈ S 2. D(p, q) = 0 ⇐⇒ p = q

≡ − − ⟨ − ∇ ⟩ The Bregman divergence∑ [6] BF (p, q) F (p) F (q) p q, F (q) and f- pi divergence [1, 9] Df (p∥q) ≡ qif( ) are representative divergences deﬁned by i qi using a strictly convex function, where F, f : S → R are strictly convex functions and f(1) = 0. Besides that Nielsen have introduced the skew Jensen divergence JF,α(p, q) ≡ (1 − α)F (p) + αF (q) − F ((1 − α)p + αq) for a paramter α ∈ (0, 1) and have shown the relation between the Bregman divergence and the skew Jensen divergence [7, 14]. The Bregaman divergence and the f-divergece include well-known divergences. For example, the Kullback-Leibler divergence(KL-divergence) [11] is a type of the Bregman divergence or the f-divergence. On the other hand, the Hellinger distance, the χ-square divergence, the α-divergence [4, 8] are a type of the f-divergence and the Jensen-Shannon divergence(JS-divergence) [12] is a type of the skew Jensen divergence. For probability distributions p and q, each divergence is deﬁned as follows. KL-divergence

∑ p KL(p∥q) ≡ p ln( i ) (1) i q i i 1 2 TOMOHIRO NISHIYAMA

Hellinger distance ∑ √ √ 2 2 H (p, q) ≡ ( pi − qi) (2) i Pearson χ-square divergence

∑ 2 (pi − qi) D 2 (p∥q) ≡ (3) χ ,P q i i Neyman χ-square divergence

∑ 2 (pi − qi) D 2 (p∥q) ≡ (4) χ ,N p i i α-divergence 1 ∑ D (p∥q) ≡ (pαq1−α − 1) (5) α α(α − 1) i i i The main goal of this paper is to introduce new classes of divergences (”g- divergence”) and clarify the relations among g-divergences and f-divergence. First, we introduce the g-Bregman divergence and the skew g-Jensen divergence by extending the deﬁnitions of the Bregman divergence and the skew Jensen divergence respectively. For these new classes of divergences, we show they satisfy some properties similar to the Bregman or the skew Jensen divergence and derive an inequality between them. Then, we show that the g-Bregman divergence includes the Hellinger distance, the Pearson and Neyman χ-square divergence and the α-divergence. Furthermore, we show that the skew g-Jensen divergence include the Hellinger distance and the α-divergence. These divergences have not only the properties of f-divergence but also some properties similar to the Bregman or the skew Jensen divergence. Finally we derive many inequalities by using an inequality between the g-Bregman divergence and skew g-Jensen divergence.

2. g-Bregman divergence and skew g-Jensen divergence By using a function g which has a inverse function, we define the g-Bregman divergence and the skew g-Jensen divergence. Definition 1. Let Ω ⊂ Rd be a convex set. For p ∈ Ω, let F (x) be a strictly convex function F :Ω → R. The Bregman divergence and the symmetric Bregman divergence are defined as

BF (p, q) ≡ F (p) − F (q) − ⟨p − q, ∇F (q)⟩ (6) g ≡ g g ⟨ − ∇ − ∇ ⟩ BF,sym(p, q) BF (p, q) + BF (q, p) = p q, F (p) F (q) . (7) The symbol ⟨, ⟩ indicates an inner product. For the parameter α ∈ R \{0, 1}, the scaled skew Jensen divergence [13, 14] is deﬁned as ( ) 1 sJ (p, q) ≡ (1 − α)F (p) + αF (q) − F ((1 − α)p + αq) . (8) F,α α(1 − α) Nielsen has shown the relation between the Bregman divergence and the Jensen divergence as follows [14].

BF (q, p) = lim sJF,α(p, q) (9) α→0

BF (p, q) = lim sJF,α(p, q) (10) α→1 GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES3 ( ) 1 J (p, q) = (1 − α)B (p, (1 − α)p + αq) + αB (q, (1 − α)p + αq) (11) F,α α(1 − α) F F Definition 2. (Definition of the g-convex set)Let S be a vector space over the real numbers and g be a function g : S → S. If S satisfies the following condition, we call the set S ”g-convex set”. For all p and q in S and all α in the interval (0, 1), the point (1 − α)g(p) + αg(q) also belongs to S. Definition 3. (Definition of the g-divergences) Let Ω be a g-convex set and let g :Ω → Ω be a function which satisfies g(p) = g(q) ⇐⇒ p = q. Let α be a parameter in (0, 1). We define the (scaled) skew g-Jensen divergence, the g-Bregman divergence and the g-symmetric Bregman divergence as follows. g ≡ sJF,α(p, q) sJF,α(g(p), g(q)) (12) g ≡ BF (p, q) BF (g(p), g(q)) (13) g ≡ BF,sym(p, q) BF,sym(g(p), g(q)) (14) From the definition of function g, these functions satisfy divergence properties Dg(p, q) = 0 ⇐⇒ g(p) = g(q) ⇐⇒ p = q and Dg(p, q) ≥ 0, where Dg indicates the g-divergences. When g(p) is equal to p for all p ∈ Ω, the g-divergences are consistent with original divergences. In the following, we assume that g is a function having an inverse function. The g-Bregman divergence and the symmetric g-Bregman divergence satisfy the linearity, the generalized law of cosines and the generalized parallelogram law as well as the Bregman and the symmetric Bregman divergence. Linearity For positive constant c1 and c2, Bg (p, q) = c Bg (p, q) + c Bg (p, q) (15) c1F1+c2F2 1 F1 2 F2 holds. Proposition 1. (Generalized law of cosines) Let Ω be a g-convex set. For points p, q, r ∈ Ω, the following equations hold. g g g − ⟨ − ∇ − ∇ ⟩ BF (p, q) = BF (p, r) + BF (r, q) g(p) g(r), F (g(q)) F (g(r)) (16) g g g BF,sym(p, q) = BF,sym(p, r) + BF,sym(r, q) (17) −⟨g(p) − g(r), ∇F (g(q)) − ∇F (g(r))⟩ + ⟨g(q) − g(r), ∇F (g(p)) − ∇F (g(r))⟩ It is easily proved in the same way as the Bregman divergence by putting p′ = g(p), q′ = g(q) and r′ = g(r). Theorem 1. (Generalized parallelogram law) Let Ω be a g-convex set. When points p, q, r, s ∈ Ω satisfy g(p)+g(r) = g(q)+g(s), the following equations holds. g g g g g g BF,sym(p, q)+BF,sym(q, r)+BF,sym(r, s)+BF,sym(s, q) = BF,sym(p, r)+BF,sym(q, s) (18) The left hand side is the sum of four sides of rectangle pqrs and the right hand side is the sum of diagonal lines of rectangle pqrs. Proof. It has been shown that the Bregman divergence satisfies the four point identity(see equation (2.3) in [16]).

⟨∇F (r) − ∇F (s), p − q⟩ = BF (q, r) + BF (p, s) − BF (p, r) − BF (q, s) (19) 4 TOMOHIRO NISHIYAMA

By putting p′ = g(p), q′ = g(q), r′ = g(r) and s′ = g(s), we can prove the g- Bregman divergence satisﬁes the similar four point identity. ⟨∇ −∇ − ⟩ g g − g − g F (g(r)) F (g(s)), g(p) g(q) = BF (q, r)+BF (p, s) BF (p, r) BF (q, s) (20) By combining the assumption and the deﬁnition of the symmetric g-Bregman divergence ((7) and (14)), we have − g g g − g − g BF,sym(r, s) = BF (q, r) + BF (p, s) BF (p, r) BF (q, s). (21) Exchanging p ↔ r and q ↔ s in (21) and taking the sum with (21), the result follows. This theorem can be also proved in the same way as mentioned in the paper [15]. The same relation between the Bregman and the skew-Jensen divergence also hold for the g-Bregman and the skew g-Jensen divergence.

g g BF (q, p) = lim sJF,α(p, q) (22) α↓0 g g BF (p, q) = lim sJF,α(p, q) (23) α↑1 Lemma 1. Let Ω be a g-convex set and let p, q be points p, q ∈ Ω. For a parameter α ∈ (0, 1) and a point r ∈ Ω which satisfies g(r) = (1 − α)g(p) + αg(q), the g-Bregman divergence can be expressed as follows. ( ) 1 sJ g (p, q) ≡ (1 − α)Bg (p, r) + αBg (q, r) (24) F,α α(1 − α) F F It is easily proved in the same way as equation (11) by putting p′ = g(p) and q′ = g(q). Next, we derive an inequality between the skew g-Jensen and the symmetric g-Bregman divergence. Lemma 2. Let Ω be a g-convex set and let p, q be points p, q ∈ Ω. For a parameter α ∈ (0, 1) and a point r ∈ Ω which satisfies g(r) = (1−α)g(p)+αg(q), the following equations hold. α Bg (p, q) = Bg (p, r) + Bg (r, q) + Bg (r, q) (25) F F F 1 − α F,sym 1 − α Bg (q, p) = Bg (r, p) + Bg (q, r) + Bg (r, p) (26) F F F α F,sym Proof. We first prove (25). By the generalized law of cosines (16), we have g g g ⟨ − ∇ − ∇ BF (p, q) = BF (p, r) + BF (r, q) + g(r) g(p), F (g(q)) F (g(r)). (27) From the assumption, the equation α g(r) − g(p) = (g(q) − g(r)) (28) 1 − α holds. By substituting (28) to (27) and using (14), the result follows. By exchanging p and q in (27), we can prove (26) in the same way.

Lemma 3. Let Ω be a g-convex set and let p, q be points p, q ∈ Ω. For a parameter α ∈ (0, 1) and a point r ∈ Ω which satisﬁes g(r) = (1−α)g(p)+αg(q), the following equation holds. 1 1 Bg (p, q) = Bg (p, r) + Bg (r, q) (29) F,sym α F,sym 1 − α F,sym Proof. Taking the sum of (25) and (26), the result follows. GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES5

Theorem 2. (Jensen-Bregaman inequality) Let Ω be a g-convex set and let p, q be points in Ω. For a parameter α ∈ (0, 1), the skew g-Jensen divergence and the symmetric g-Bregman divergence, the following inequality holds. g ≥ g BF,sym(p, q) sJF,α(p, q) (30) Proof. Let r ∈ Ω be a point which satisﬁes g(r) = (1 − α)g(p) + αg(q). From (24) and g ≤ g using BF (p, q) BF,sym(p, q), we have 1 1 1 1 sJ g (p, q) = Bg (p, r) + Bg (q, r) ≤ Bg (p, r) + Bg (q, r). F,α α F 1 − α F α F,sym 1 − α F,sym (31) Using Lemma 3., we have 1 1 sJ g (p, q) ≤ Bg (p, r) + Bg (r, q) = Bg (p, q) (32) F,α α F,sym 1 − α F,sym F,sym

3. Examples We show some examples of the g-divergences. We denote the g-Bregman divergence as ”GBD”, the symmetric g-Bregman divergence as ”SGBD” , the skew g-Jensen divergence as ”SGJD” and the Jensen-Bregman inequality (30) as ”JB- inequality”. A parameter α is in (0, 1) except for example 3) and we use a parameter β ∈ (0, 1) instead of α in example 3). From example 1) to 6), we show the case that GBD is belong to a class of f-divergence and example 7) is the case of information geometry [2–4] and statistical mechanics.

1) KL-divergence, Jeﬀreys∑ divergence, JS-divergence Generating functions: F (p) = i pi ln pi, g(p)∑ = p pi GBD: generalized KL-divergence KL(p∥q) ≡ (−pi + qi + pi ln( )) i ∑ qi pi SGBD: Jeﬀreys divergence(J-divergence) [10] J(p, q) ≡ (pi − qi) ln( ) i qi SGJD: scaled skew JS-divergence( ) 1 ∥ ≡ 1 − ∥ − ∥ − α(1−α) JSα(p q) α(1−α) (1 α)KL(p (1 α)p + αq) + αKL(q (1 α)p + αq)

JB-ineuqlity: generalized Lin’s inequality α(1 − α)J(p, q) ≥ JSα(p, q). When α is 1 equall to 2 , this inequality is consistent with Lin’s inequality [12].

2) KL-divergence, Jeﬀreys∑ divergence, α-divergence Generating functions: F (p) = i exp(pi), g(p) = ln p∑ qi GBD: generalized reverse KL-divergence KL(q∥p) ≡ (pi − qi + qi ln( )) ∑ i pi pi SGBD: J-divergence J(p, q) ≡ (pi − qi) ln( ) i qi ∑ ∥ ∥ ≡ 1 α 1−α − SGJD: generalized α-divergence D1−α(p q), where Dα(p q) i α(α−1) (pi qi αpi − (1 − α)qi) JB-ineuqlity: J(p, q) ≥ Dα(p∥q)

3) α-divergence ∑ 1 1 α α ∈ R \{ } Generating functions: F (p) = 1−α i pi , g(p) = p for α 0, 1 GBD: generalized α-divergence Dα(p∥q) ≡ ∥ ∥ SGBD: symmetric α-divergence( Dα,sym(p, q) Dα(p q) + Dα()q p) ∑ ( ) 1 1 − − − α α α SGJD: β(1−β)(1−α) i (1 β)pi + βqi (1 β)pi + βqi 6 TOMOHIRO NISHIYAMA ( ) ∑ ( ) 1 ≥ 1 − − − α α α JB-ineuqlity: Dα,sym(p, q) β(1−β)(1−α) i (1 β)pi+βqi (1 β)pi + βqi

1 2 By using the equations D 1 (p∥q) = H (p, q), 2D(2)(p∥q) = Dχ2,P (p∥q) and 2 ( 2 ) 2D(−1)(p∥q) = Dχ2,N (p∥q) [8], we obtain example 4) to 6).

4) Hellinger distance ∑ √ 2 Generating functions: F (p) = i pi ,∑g(p√) = p√ 2 ≡ − 2 GBD: Hellinger distance H (p, q) i( pi qi) SGBD: Hellinger distance 2H2(p, q) SGJD: Hellinger distance H2(p, q) JB-ineuqlity: trivial

5) Pearson χ-square divergence∑ √ − 2 Generating functions: F (p) = 2 i pi, g(p) = p ∑ 2 (pi−qi) GBD: Pearson χ-square divergence D 2 (q∥p) ≡ χ ,P i qi ≡ ∥ ∥ SGBD: symmetric∑ χ-square divergence√Dχ2,sym(p, q) Dχ2 (p q) + Dχ2 (q p) SGJD: 2 (−(1 − α)p − αq + (1 − α)p2 + αq2) α(1−α) i i i ∑ i i √ ≥ 2 − − − − 2 2 JB-ineuqlity: Dχ2,sym(p, q) α(1−α) i( (1 α)pi αqi + (1 α)pi + αqi )

6) Neyman χ-square divergence∑ −1 −1 Generating functions: F (p) = i pi , g(p) = p ∑ 2 (pi−qi) GBD: Neyman χ-square divergence D 2 (q∥p) ≡ χ ,N i pi SGBD: symmetric∑ χ-square divergence Dχ2,sym(p, q) 1 piqi SGJD: − ((1 − α)pi + αqi − − ) α(1 α) i (1∑ α)qi+αpi 1 piqi JB-ineuqlity: D 2 (p, q) ≥ ((1 − α)pi + αqi − ) χ ,sym α(1−α) i (1−α)qi+αpi

7) Information geometry and statistical mechanics i In the dually ﬂat space M, there exist two aﬃne coordinates θ , ηi and two convex i functions called potential ψ(θ), ϕ(η). For the points P,Q ∈ M, we put pi = θ (P ), i qi = θ (Q) or pi = ηi(P ), qi = ηi(Q).

Generating functions: F (p) = ψ(θ) or F (p) = ϕ(η), g(p) = p ∑ ∥ ≡ − i GBD: canonical divergence∑ D(P Q) ϕ(η(Q)) + ψ(θ(P )) i ηi(Q)θ (P ) ≡ − i − i SGBD: DA(P,Q) i(ηi(Q) ηi(P ))(θ (Q) θ (P )) SGJD: sJψ,α(θ(P ), θ(Q)) or sJϕ,α(η(P ), η(Q)) JB-ineuqlity: DA(P,Q) ≥ sJψ,α(θ(P ), θ(Q)) or DA(P,Q) ≥ sJϕ,α(η(P ), η(Q))

For the exponential or mixture families, GBD is equal to the KL-divergence and SGBD is equal to the J-divergence. For the exponential families, SGJD sJψ,α(θ(P ), θ(Q)) is equal to the skew Bhattacharyya distance [5, 14], and for the mixture families , SGJD sJϕ,α(η(P ), η(Q)) is equal to the skew-JS divergence(see [15] in detail). In the case of canonical ensemble in statistical mechanics, the probability distribution function belongs to exponential families. Because the manifold of the exponential family is a dually flat space, the same equations and inequality hold GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES7 for 1 θ = −β ≡ − (33) kBT η = U (34) ψ = −βF (35) 1 ϕ = − S, (36) kB where kB is the Boltzmann constant, T is temperature, U is the internal energy, S is the entropy and F is the Helmholtz free energy. Defining a parameter α as a index of the state which satisfy 1 = (1−α) 1 +α 1 Tα T0 T1 or Uα = (1 − α)U0 + αU1, JB-inequality can be written as follows. (T − T )(U − U ) F F F α(1 − α) 1 0 1 0 ≥ α − (1 − α) 0 − α 1 (37) T0T1 Tα T0 T1 (T1 − T0)(U1 − U0) α(1 − α) ≥ Sα − (1 − α)S0 − αS1 (38) T0T1 4. Conclusion We have introduced the g-Bregman divergence and the skew g-Jensen divergence by extending the definitions of the Bregman and the skew Jensen divergence respectively. We have shown they satisfy some properties similar to the Bregman and the skew-Jensen divergence. Furthermore, we have derived an inequality between the symmetric g-Bregman divergence and the skew g-Jensen divergence and we have shown they include divergences which belong to a class of f-divergence . If the relationship between these three classes of divergence are studied in more detail, it is expected to proceed applications to various fields including machine learning.

References

[1] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966. [2] Shun-ichi Amari. Information geometry and its applications. Springer, 2016. [3] Shun-ichi Amari and Andrzej Cichocki. Information geometry of divergence functions. Bul- letin of the Polish Academy of Sciences: Technical Sciences, 58(1):183–195, 2010. [4] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. Amer- ican Mathematical Soc., 2007. [5] Anil Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc., 35:99–109, 1943. [6] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967. [7] Jacob Burbea and C Radhakrishna Rao. On the convexity of some divergence measures based on entropy functions. Technical report, PITTSBURGH UNIV PA INST FOR STATISTICS AND APPLICATIONS, 1980. [8] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flex- ible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010. [9] Imre Csiszár.Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967. [10] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A, 186(1007):453–461, 1946. [11] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997. [12] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991. [13] Frank Nielsen. A family of statistical symmetric divergences based on jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010. 8 TOMOHIRO NISHIYAMA

[14] Frank Nielsen and Sylvain Boltz. The burbea-rao and bhattacharyya centroids. IEEE Trans- actions on Information Theory, 57(8):5455–5466, 2011. [15] Tomohiro Nishiyama. Divergence functions in dually ﬂat spaces and their properties. 2018. [16] Simeon Reich and Shoham Sabach. Two strong convergence theorems for bregman strongly nonexpansive operators in reﬂexive banach spaces. Nonlinear Analysis: Theory, Methods & Applications, 73(1):122–135, 2010.