<<

Information geometry connecting Wasserstein and Kullback– Leibler via the entropy-relaxed transportation problem

Shun-ichi Amari, Ryo Karakida & Masafumi Oizumi

Information Geometry

ISSN 2511-2481

Info. Geo. DOI 10.1007/s41884-018-0002-8

1 23 Your article is published under the Creative Commons Attribution license which allows users to read, copy, distribute and make derivative works, as long as the author of the original work is cited. You may self- archive this article on your own website, an institutional repository or funder’s repository and make it publicly available immediately.

1 23 Info Geo https://doi.org/10.1007/s41884-018-0002-8

RESEARCH PAPER

Information geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropy-relaxed transportation problem

Shun-ichi Amari1,3 · Ryo Karakida2 · Masafumi Oizumi1,3

Received: 5 October 2017 / Revised: 9 February 2018 © The Author(s) 2018

Abstract Two geometrical structures have been extensively studied for a manifold of probability distributions. One is based on the metric, which is invariant under reversible transformations of random variables, while the other is based on the Wasserstein distance of optimal transportation, which reflects the structure of the distance between underlying random variables. Here, we propose a new information-geometrical theory that provides a unified framework connecting the Wasserstein distance and Kullback–Leibler (KL) divergence. We primarily considered a discrete case consisting of n elements and studied the geometry of the probability simplex Sn−1, which is the set of all probability distributions over n elements. The Wasserstein distance was introduced in Sn−1 by the optimal transportation of com- modities from distribution p to distribution q, where p, q ∈ Sn−1. We relaxed the optimal transportation by using entropy, which was introduced by Cuturi. The optimal solution was called the entropy-relaxed stochastic transportation plan. The entropy- relaxed optimal cost C( p, q) was computationally much less demanding than the original Wasserstein distance but does not define a distance because it is not minimized at p = q. To define a proper divergence while retaining the computational advantage, we first introduced a divergence function in the manifold Sn−1 × Sn−1 composed of all optimal transportation plans. We fully explored the information geometry of the manifold of the optimal transportation plans and subsequently constructed a new one-parameter family of divergences in Sn−1 that are related to both the Wasserstein distance and the KL-divergence.

B Shun-ichi Amari [email protected]

1 RIKEN Brain Science Institute, 2–1 Hirosawa, Wako-shi, Saitama 351-0198, Japan 2 National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan 3 Araya Inc, 2-8-10 Toranomon, Minato-ku, Tokyo 105-0001, Japan 123 Info Geo

Keywords Wasserstein distance · Kullback–Leibler divergence · Optimal transporta- tion · Information geometry

1 Introduction

Information geometry [1] studies the properties of a manifold of probability distri- butions and is useful for various applications in , machine learning, signal processing, and optimization. Two geometrical structures have been introduced from two distinct backgrounds. One is based on the invariance principle, where the geometry is invariant under reversible transformations of random variables. The Fisher informa- tion matrix, for example, is a unique invariant Riemannian metric from the invariance principle [1–3]. Moreover, two dually coupled affine connections are used as invariant connections [1,4], which are useful in various applications. The other geometrical structure was introduced through the transportation problem, where one distribution of commodities is transported to another distribution. The minimum transportation cost defines a distance between the two distributions, which is called the Wasserstein, Kantorovich or earth-mover distance [5,6]. This structure provides a tool to study the geometry of distributions by taking the metric of the supporting manifold into account. Let χ = {1,...,n} be the support of a probability measure p. The invariant geom- etry provides a structure that is invariant under permutations of elements of χ and results in an efficient estimator in statistical estimation. On the other hand, when we consider a picture over n2 pixels χ = {(ij); i, j = 1,...,n} and regard it as a distri- bution over χ, the pixels have a proper distance structure in χ. Spatially close pixels tend to take similar values. A permutation of χ destroys such a neighboring structure, suggesting that the invariance might not play a useful role. The Wasserstein distance takes such a structure into account and is therefore useful for problems with metric structure in support χ (see, e.g., [7–9]). An interesting question is how these two geometrical structures are related. While both are important in their own respects, it would be intriguing to construct a unified framework that connects the two. With this purpose in mind, we examined the discrete case over n elements, such that a is given by a probability vector p = (p,...,pn) in the probability simplex      Sn−1 = p  pi > 0, pi = 1 . (1)

It is easy to naively extend our theory to distributions over Rn, ignoring mathematical difficulties of geometry of function spaces. See Ay et al. [4] for details. We consider Gaussian distributions over the one-dimensional real line X as an example of the continuous case. Cuturi modified the transportation problem such that the cost is minimized under an entropy constraint [7]. This is called the entropy-relaxed optimal transportation prob- lem and is computationally less demanding than the original transportation problem. In addition to the advantage in computational cost, Cuturi showed that the quasi- distance defined by the entropy-relaxed optimal solution yields superior results in 123 Info Geo many applications compared to the original Wasserstein distance and information- geometric divergences such as the KL divergence. We followed the entropy-relaxed framework that Cuturi et al. proposed [7–9] and introduced a Lagrangian function, which is a linear combination of the transportation cost and entropy. Given a distribution p of commodity on the senders side and q on the receivers side, the constrained optimal transportation plan is the minimizer of the Lagrangian function. The minimum value C( p, q) is a function of p and q, which we called the Cuturi function. However, this does not define the distance between p and q because it is non-zero at p = q and is not minimized when p = q. To define a proper distance-like function in Sn−1, we introduced a divergence between p and q derived from the optimal transportation plan. A divergence is a general metric concept that includes the square of a distance but is more flexible, allowing non-symmetricity between p and q. A manifold equipped with a divergence yields a Riemannian metric with a pair of dual affine connections. Dually coupled geodesics are defined, which possess remarkable properties, generalizing the Rieman- nian geometry [1]. We studied the geometry of the entropy-relaxed optimal transportation plans within the framework of information geometry. They form an of proba- bility distributions defined in the product manifold Sn−1 × Sn−1. Therefore, a dually flat structure was introduced. The m-flat coordinates are the expectation parameters ( p, q) and their dual, e-flat coordinates (canonical parameters) are (α, β), which are assigned from the minimax duality of nonlinear optimization problems. We can natu- rally defined a canonical divergence, that is the KL divergence KL[( p, q) : ( p, q)] between the two optimal transportation plans for ( p, q) and ( p, q), sending p to q and p to q, respectively. To define a divergence from p to q in Sn−1, we used a reference distribution r. Given r, we defined a divergence between p and q by KL[(r, p) : (r, q)]. There are a number of potential choices for r: one is to use r = p and another is to use the arithmetic or geometric mean of p and q. These options yield one-parameter families of divergences connecting the Wasserstein distance and KL-divergence. Our work uncovers a novel direction for studying the geometry of a manifold of probability distributions by integrating the Wasserstein distance and KL divergence.

2 Entropy-constrained transportation problem

Let us consider n terminals χ = (X1,...,Xn), some of which, say X1,...,Xs,are sending terminals at which p1,...,ps (pi > 0) amounts of commodities are stocked. At the other terminals, Xs+1,...,Xn, no commodities are stocked (pi = 0). These are transported within χ such that q1,...,qr amounts are newly stored at the receiving ,..., terminals X j1 X jr . There may be overlap in the sending and receiving terminals, χ = { ,..., } χ = ,..., χ = χ = χ S X1 Xs and R X j1 X jr , including the case that R S (Fig. 1). We normalized the total amount of commodities to be equal to 1 so that p = (p1,...,ps) and q = (q1,...,qr ) can be regarded as probability distributions in the probability simplex Ss−1 and Sr−1, respectively, 123 Info Geo ··· · ··· ··· ···

Fig. 1 Transportation from the sending terminals χS to the receiving terminals χR   pi = 1, qi = 1, pi > 0, qi > 0. (2)

¯ ¯ Let Sn−1 be the probability simplex over χ. Then Ss−1 ⊂ Sn−1, Sr−1 ⊂ Sn−1, where ¯ Sn−1 is the closure of Sn−1,     ¯  Sn−1 = r  ri ≥ 0, ri = 1 . (3)

It should be noted that if some components of p and q areallowedtobe0,wedo not need to treat χS and χR separately, i.e., we can consider both χS and χR to be equal to χ. Under such a situation, we simply considered both p and q as elements of ¯ Sn−1.  We considered a transportation plan P = Pij denoted by an s × r matrix, where Pij ≥ 0 is the amount of commodity transported from Xi ∈ χS to X j ∈ χR. The plan P was regarded as a (probability) distribution of commodities flowing from Xi to X j , satisfying the sender and receivers conditions,    Pij = pi , Pij = q j , Pij = 1. (4) j i ij

We denoted the set of P satisfying Eq. (4)asU( p, q). Let M = mij be the cost matrix, where mij ≥ 0 denotes the cost of transporting one unit of commodity from Xi to X j . We can interpret mij as a generalized distance between Xi and X j . The transportation cost of plan P is  C(P) =M, P= mijPij. (5)

The Wasserstein distance between p and q is the minimum cost of transporting com- modities distributed by p at the senders to q at the receivers side,

CW ( p, q) = min M, P, (6) P⊂U( p,q) where min is taken over all P satisfying the constraints in Eq. (4)[5,6]. 123 Info Geo

We considered the joint entropy of P,  H(P) =− Pij log Pij. (7)

Given marginal distributions p and q, the plan that maximizes the entropy is given by the direct product of p and q,  PD = p ⊗ q = pi q j . (8)

This is because the entropy of PD,  H (PD) =− PDij log PDij = H( p) + H(q), (9) is the maximum among all possible P belonging to U( p, q), i.e.,

H(P) ≤ H( p) + H(q) = H(PD), (10) where H(P), H( p) and H(q) are the entropies of the respective distributions. We consider the constrained problem of searching for P that minimizes M, P under the constraint H(P) ≥ const. This is equivalent to imposing the condition that P lies within a KL-divergence ball centered at PD,

KL[P : PD] ≤ d (11) for constant d, because the KL-divergence from plan P to PD is  Pij KL[P : PD] = Pij log =−H(P) + H( p) + H(q). (12) pi q j

The entropy of P increases within the ball as d increases. Therefore, this is equiva- lent to the entropy constrained problem that minimizes a linear combination of the transportation cost M, P and entropy H(P),

Fλ(P) =M, P−λH(P) (13) for constant λ [7]. Here, λ is regarded as a Lagrangian multiplier for the entropy constraint and λ becomes smaller as d becomes larger.

3 Solution to the entropy-constrained problem: Cuturi function

Let us fix λ as a parameter controlling the magnitude of the entropy or the size of the KL-ball. When P satisfies the constraints in Eq. (4), minimization of Eq. (13)is formulated in the Lagrangian form by using Lagrangian multipliers αi , β j ,

1 λ   Lλ(P) = M, P− H(P) − αi + β j Pij, (14) 1 + λ 1 + λ i, j 123 Info Geo where λ in (13) is slightly modified. By differentiating Eq. (14) with respect to Pij, we have

1 + λ ∂ 1 1 + λ  Lλ(P) = mij + log Pij − αi + β j + 1. (15) λ ∂ Pij λ λ

By setting the above derivatives equal to 0, we have the following solution,

m 1 + λ  P ∝ − ij + α + β . ij exp λ λ i j (16)

Let us put   mij Kij = exp − , (17) λ 1 + λ 1 + λ a = α , b = β . i exp λ i j exp λ j (18)

Then, the optimal solution is written as

∗ = , Pij cai b j Kij (19) where ai and b j are positive and correspond to the Lagrangian multipliers αi and β j to be determined from the constraints [Eq. (4)]. c is the normalization constant. Since r + s constraints [Eq. (4)] are not independent because of the conditions that pi = 1 and q j = 1, we can use br = 1. Further, we noted that μa and b/μ yield the same answer for any μ>0, where a = (ai ) and b = b j . Therefore, the degrees of freedom of a and b are s − 1 and r − 1, respectively. We can choose a and b such that they satisfy   ai = 1, b j = 1. (20)

Then, a and b are included in Ss−1 and Sr−1 respectively. We have the theorem below.

∗ Theorem 1 The optimal transportation plan Pλ is given by

∗ = , Pλij cai b j Kij (21) 1 c =  , (22) ai b j Kij where two vectors a and b are determined from p and q using Eq. (4). We have a generalized cost function of transporting p to q based on the entropy- ∗ constrained optimal plan Pλ( p, q):    1 ∗ λ ∗ Cλ( p, q) = M, Pλ − H Pλ . (23) 1 + λ 1 + λ 123 Info Geo

We called it the Cuturi function because extensive studies have been conducted by Cuturi and colleagues [7–9]. The function has been used in various applications as a measure of discrepancy between p and q. The following theorem holds for the Cuturi function:

Theorem 2 The Cuturi function Cλ( p, q) is a convex function of ( p, q).

∗ ∗ ( , ) Proof Let P1 and P2 be the optimal solutions of transportation problems p1 q1 and ( p2, q2), respectively. For scalar 0 ≤ ν ≤ 1, we use

¯ = ν ∗ + ( − ν) ∗. P P1 1 P2 (24)

We have

νCλ ( p1 : q1) + (1 − ν)Cλ ( p2 : q2)       1 ∗ ∗ λ ∗ ∗ = νM, P +(1 − ν)M, P  − ν H P + (1 − ν)H P (1 + λ) 1 2 1 + λ 1 2 1 λ  ≥ M, P¯ − H P¯ , (25) (1 + λ) 1 + λ because H(P) is a concave function of P. We further have

1 λ  1 λ M, P¯ − H P¯ ≥ min M, P− H(P) (1 + λ) 1 + λ P (1 + λ) 1 + λ = Cλ {νp1 + (1 − ν)p2,νq1 + (1 − ν)q2} , (26) since the minimum is taken for P transporting commodities from ν p1 + (1 − ν)p2 to νq1 + (1 − ν)q2. Hence, the convexity of Cλ is proven.

When λ → 0, it converges to the original Wasserstein distance CW ( p, q). However, it does not satisfy important requirements for “distance”. When p = q, Cλ is not equal to 0 and does not take the minimum value, i.e., there are some q (= p) that yield smaller Cλ than q = p:

Cλ( p, p)>Cλ( p, q). (27)

4 Geometry of optimal transportation plans

We first showed that a set of optimal transportation plans forms an exponential fam- ily embedded within the manifold of all transportation plans. Then, we studied the invariant geometry induced within these plans. A transportation plan P is a probabil- ity distribution over branches (i, j) connecting terminals of χi ∈ χS and χ j ∈ χR. Let x denote branches (i, j). We used the delta function δij(x), which is 1 when x is (i, j) and 0 otherwise. Then, P is written as a probability distribution of the random 123 Info Geo variable x,  P(x) = Pijδij(x). (28) i, j

By introducing new parameters   Pij θij = log , θ = θij , (29) Psr it is rewritten in parameterized form as ⎧ ⎫ ⎨ ⎬ ( , θ) = θijδ ( ) + . P x exp ⎩ ij x log Psr ⎭ (30) i, j

This shows that the set of transportation plans is an exponential family, where θij are the canonical parameters and ηij = Pij are the expectation parameters. They form an sr (sr − 1)-dimensional manifold denoted by STP, because θ = 0. The transportation problem is related to various problems in such as the rate-distortion theory. We provide detailed studies on the transportation plans in the information-geometric framework in Sect. 7, but here we introduce the manifold of the optimal transportation plans, which are determined by the senders and receivers probability distributions p and q. The optimal transportation plan specified by (α, β) in Eq. (16) is written as ⎡ ⎤

 + λ  + λ ⎣ 1 mij 1 ⎦ Pλ(x, α, β) = α + β − δ (x) − ψλ . exp λ i j λ ij λ (31) i, j

The notation ψ is a normalization factor called the potential function which is defined by λ ψλ(α, β) =− log c, (32) 1 + λ where c is calculated by taking the summation over all of x, ⎡ ⎤

  1 + λ  m c = ⎣ α + β − ij δ (x)⎦ . exp λ i j λ ij (33) x∈(χS,χR ) i, j

This corresponds to the free energy in physics. Using + λ  m θij = 1 α + β − ij , λ i j λ (34) we see that the set SOT P,λ of the optimal transformation plans is a submanifold of STP. Because Eq. (34) is linear in α and β, SOT P,λ itself is an exponential family, 123 Info Geo where the canonical parameters are (α, β) and the expectation parameters are ( p, q) ∈ Ss−1 × Sr−1. This is confirmed by ⎡ ⎤  ⎣ ⎦ E δij(x) = pi , (35)  j   E δij(x) = q j , (36) i where E denotes the expectation. Because of p ∈ Ss−1 and q ∈ Sr−1, SOPT,λ is a (r + s − 2)-dimensional dually flat manifold, We can use αs = βr = 0 without loss of generality, which corresponds to using as = br = 1 instead of the normalization ai = b j = 1ofa and b. In a dually flat manifold, the dual potential function ϕλ is given from the potential function ψλ as its Legendre dual, which is given by

ϕλ( p, q) = p · α + q · β − ψλ(α, β). (37)

When we use new notations η = ( p, q)T , θ = (α, β)T ,wehave

ψλ(θ) + ϕλ(η) = θ · η, (38) which is the Legendre relationship between θ and η, and we have the following theo- rem:

Theorem 3 The dual potential ϕλ is equal to the Cuturi function Cλ. Proof Direct calculation of Eq. (37)gives

ϕ ( , ) = · α + · β − ψ (α, β) λ p q p q λ 1   1 = M, P + Pij αi + β j − mij − ψλ 1 + λ 1 + λ i, j    1 λ mij = M, P + Pij log ai + log b j − + log c 1 + λ 1 + λ λ i, j = Cλ( p, q). (39)

We summarize the Legendre relationship below.

Theorem 4 The dual potential function ϕλ (Cuturi function) and potential function (free energy, cumulant generating function) ψλ of the exponential family SOPT,λ are both convex, connected by the Legendre transformation,

θ =∇ηϕλ(η), η =∇θ ψλ(θ), (40) 123 Info Geo or

α =∇pϕλ( p, q), β =∇qϕλ( p, q), (41) p =∇αψλ(α, β), q =∇β ψλ(α, β). (42)

Since SOPT,λ is dually flat, we can introduce a Riemannian metric and cubic tensor. The Riemannian metric Gλ is given to Ss−1 × Sr−1 by

Gλ =∇η∇ηϕλ(η) (43) in the η-coordinate system ( p, q). Its inverse is

−1 Gλ =∇θ ∇θ ψλ(θ). (44)

Calculating Eq. (44) carefully, we have the following theorem:

−1 Theorem 5 The Fisher information matrix Gλ in the θ-coordinate system is given by −1 pi δij − pi p j Pij − pi q j Gλ = . (45) Pij − pi q j qi δij − qi q j

−1 Remark 1 The p-part and q-part of Gλ are equal to the corresponding Fisher infor- mation in Ss−1 and Sr−1 in the e-coordinate systems.

Remark 2 The p-part and the q-part of Gλ are not equal to the corresponding Fisher information in the m-coordinate system. This is because ( p, q)-part of G is not 0.

We can similarly calculate the cubic tensor,

T = ∇∇∇ψλ (46) but we have not shown the results here. From the Legendre pair of convex functions ϕλ and ψλ, we can also introduce the canonical divergence between two transportation problems ( p, q) and ( p, q), !  "       Dλ ( p, q) : p , q = ψλ(α, β) + ϕλ( p , q ) − α · p − β · q (47) where (α, β) corresponds to ( p, q). This is the KL-divergence between the two optimal transportation plans, !  "     Dλ ( p, q) : p , q = KL[Pλ( p, q) : Pλ( p , q )]. (48)

123 Info Geo

5 λ-divergences in Sn−1

5.1 Derivation of λ-divergences

We define a divergence between p ∈ Sn−1 and q ∈ Sn−1 using the canonical diver- gence in the set SOT P,λ of the optimal transportation plans [Eq. (48)]. For the sake of simplicity, we hereafter only study the case χS = χR = χ. We introduce a reference distribution r ∈ Sn−1 and define the r-referenced divergence between p and q by ! " ∗ ∗ Dr,λ[ p : q]=γλ KL Pλ(r, p) : Pλ(r, q) , (49) ∗ where γλ is a scaling factor, which we discuss later, and Pλ(r, p) is the optimal transportation plan from r to p. Note that its dual ! " ˜ ∗ ∗ Dr,λ[ p, q]=γλ KL Pλ(r, q) : Pλ(r, p) (50) is another candidate. There are other combinations but we study only Eq. (49)asthe first step. There are various ways of choosing a reference distribution r. We first considered the simple choice of r = p, yielding the following λ-divergence: ! " ∗ ∗ Dλ[ p : q]=γλ KL Pλ( p, p) : Pλ( p, q) . (51)

[ : ] γ = λ Theorem 6 Dλ p q with the scaling factor λ 1+λ is given by

Dλ[ p : q]=Cλ( p, p) − Cλ( p, q) −∇qCλ( p, q) · ( p − q), (52) which is constructed from the Cuturi function.

Proof The optimal transportation plans are rewritten by the θ coordinates in the form

λ ∗   mij  log Pλ( p, p)ij = α + β − − ψλ, (53) 1 + λ i j λ λ ∗ mij log Pλ( p, q)ij = αi + β j − − ψλ. (54) 1 + λ λ

Then, we have

   Dλ[ p : q]= p · α + p · β − ψλ − p · α − q · β − ψλ − ( p − q) · β

= ϕλ( p, p) − ϕλ( p, q) −∇qϕλ( p, q) · ( p − q). (55)

Since we showed that ϕλ = Cλ in Theorem 3, we obtain Eq. (52).

This is a divergence function satisfying Dλ[ p : q]≥0, with equality when and only when p = q. However, it is not a canonical divergence of a dually flat manifold. The Bregman divergence derived from a convex function ϕ(˜ p) is given by 123 Info Geo

˜ Dλ[ p : q]=ϕ( ˜ p) −˜ϕ(q) −∇pϕ(˜ q) · ( p − q). (56)

This is different from Eq. (52), which is derived from ϕλ( p, q). Thus, we call Dλ[ p : q] Bregman-like divergence. In the extremes of λ, the proposed divergence Dλ[ p : q] is related to the KL- divergence and Wasserstein distance in the following sense: ∗ 1. When λ →∞, Dλ converges to KL[ p : q]. This is because P converges to p⊗q in the limit and we easily have

KL[ p ⊗ p : p ⊗ q]=KL[ p : q]. (57) ! λ → γ = λ/( + λ) ∗( , ) : 2. When " 0, Dλ with λ 1 converges to 0, because KL P0 p p ∗( , ) P0 p q takes a finite value (see Example 1 in the next section). This suggests that it is preferable to use a scaling factor other than γλ = λ/(1 + λ) when λ is small. When λ = 0, Cλ = ϕλ is not differentiable. Hence, we cannot construct the Bregman-like divergence from C0 [Eq. (52)] in a simple example given in Sect. 5.3.

5.2 Other choices of reference distribution r

We can consider other choices of the reference distribution r. One option is choosing r, which minimizes the KL-divergence. ! " ˜ ∗ ∗ Dλ[ p : q]=γλminKL Pλ(r, p) : Pλ(r, q) . (58) r

However, obtaining the minimizer r is not computationally easy. Thus, we can simply replace the optimal r with the arithmetic mean or geometric mean of p and q.The arithmetic mean is given by the m-mixture midpoint of p and q,

1 r = ( p + q). (59) 2

The geometric mean is given by the e-midpoint of p and q, √ r = c pi qi , (60) where c is the normalization constant.

5.3 Examples of λ-divergence

Below, we consider the case where r = p. We show two simple examples, where Dλ( p, q) can be analytically computed. 123 Info Geo

Example 1 Let n = 2 and

mii = 0, mij = 1 (i = j). (61)

We use a2 = b2 = 1 for normalization,

Pij = cai b j Kij, (62)   mij 1 ε K = exp − = , (63) ij λ ε 1

ε = − 1 . exp λ (64)

Note that ε → 0asλ → 0.

When λ>0, the receiver conditions require

cab + caε = p, (65) cab + cbε = q, (66) where we use a = a1, b = b1 and

1 c = . (67) ab + ε(a + b) + 1

Solving the above equations, we have

z − (q − p)/ε a = , (68) 2(1− p) z + (q − p)/ε b = , (69) 2(1− q) where # z =−ε(1− p− q) + (q − p)2/ε2 + ε2(1− p− q)2 + 2p(1− p) + 2q(1− q).

We can show Dλ[ p : q] explicitly by using the solution, although it is complicated. When λ = 0, we easily have

C0(p, q) =|p − q|, (70) where p = (p, 1 − p) and q = (q, 1 −q). C0(p, q) is piecewise linear, and cannot be used to construct a Bregman-like divergence. However, we can calculate the limiting case of λ → 0 because the optimal transportation plans P∗ where λ is small are directly calculated by minimizing Cλ( p, q) as 123 Info Geo ∗ p 0 −εε P ( p, p) = + , (71) λ 01− p ε −ε 2 2 ∗ p 0 −ε ε P ( p, q) = + . (72) λ q − p 1 − q ε2 −ε2 where we set q > p. The limit of KL divergence is given by $ p ( ≥ ), ∗ ∗ p log q p q lim KL[Pλ( p, p) : Pλ( p, q)]= − (73) λ→ ( − ) 1 p ( < ). 0 1 p log 1−q p q

≥ ∗( , ) = ( δ ) In the general case of n 2, the optimal transportation plan is P0 p p pi ij . ∗( , ) { , } = , > The diagonal parts of the optimal P0 p q are min pi qi when mii 0 mij 0 (i = j). Thus, the KL divergence is given by  [ ∗( , ) : ∗( , )]= pi . KL P0 p p P0 p q pi log (74) qi i;pi >qi

Remark that when λ →∞,  ∗ ∗ pi lim KL[Pλ( p, p) : Pλ( p, q)]= p log . (75) λ→∞ i q i i  Example 2 We take a family of Gaussian distributions N μ, σ 2 ,

  1 (x − μ)2 p x ; μ, σ 2 = √ exp − (76) 2πσ 2σ 2 on the real line χ = {x}, extending the discrete case to the continuous case. We ; μ ,σ2 ; μ ,σ2 transport p x p p to q x q q , where the transportation cost is

m(x, y) =|x − y|2. (77)

Then, we have (x − y)2 K (x, y) = exp − , (78) 2λ2 where we use 2λ2 instead of previous λ for the sake of convenience.

The optimal transportation plan is written as

∗ P (x, y) = ca(x)b(y)K (x, y), (79) 123 Info Geo where a and b are determined from % ca(x)b(y)K (x, y)dy = p(x), (80) % ca(x)b(y)K (x, y)dx = q(x). (81)

  The solutions are given in the Gaussian framework, x ∼ N μ,˜ σ˜ 2 , y ∼ N μ˜ , σ˜ 2 . As derived in Appendix A, the optimal cost and divergence are as follows:  λ √  1 2 2 2 Cλ(p, q) = μ − μ + σ + σ + − + X + λ p q p q 1 1 1 2  √  1 2 2 1 −λ log σpσq + log 8π e − log 1 + 1 + X , (82) 2 2  & √ &  1 σq 1 1 + 1 + X p Dλ [p : q] = γλ 1 + X − 1 + X p + log + log √ 2 σp 2 1 + 1 + X √ $ ' 2 σ 2 1 + 1 + X μp − μq p + + − 1 , σ 2 σ 2 4 q q σ 2σ 2 σ 4 16 p q 16 p where X = X = . (83) λ2 p λ2 ! " ∗ ∗ Note that Dλ = KL Pλ( p, p) : Pλ( p, q) diverges to infinity in the limit of λ → 0 ∗ because the support of the optimal transport Pλ( p, q) reduces to a 1-dimensional subspace. To prevent Dλ from diverging and to make it finite, we set the scaling factor γ = λ as λ 1+λ . In this case, Dλ is equivalent to the Bregman-like divergence of the Cuturi function as shown in Theorem 6. With this scaling factor γλ, Dλ in the limits of λ →∞and λ → 0 is given by $ ' 2 σ 2 1 (μp − μq ) p σq lim Dλ = + − 1 + log = KL[p : q], (84) λ→∞ σ 2 σ 2 σ 2 q q p σ σ p 2 p 2 lim Dλ = (μp − μq ) + (σp − σq ) . (85) λ→0 σq σq

6 Applications of λ-divergence

6.1 Cluster center (barycenter)

∗ Let q1,...,qk be k distributions in Sn−1.Itsλ-center is defined by p , which minimizes the average of λ-divergences from qi to p ∈ Sn−1,  ∗ p = arg min Dλ[qi : p]. (86) p 123 Info Geo

The center is obtained from  ! " ∂ p Dλ qi : p = 0, (87) i which yields the equation to give p∗,    ∗ ∗ G qi , p qi − p = 0, (88) where G(q, p) =∇p∇ pϕλ(q, p). (89)  μ ,σ2 It is known [5] that the mean (center) of two Gaussian distributions N 1 1 and  μ +μ (σ +σ )2 μ ,σ2 χ = 1 2 , 1 2 N 2 2 over the real line R is Gaussian N 2 4 , when we use 2 | − |2 the square of the Wasserstein distance W2 with the cost function x1 x2 . It would be interesting to see how the center changes depending on λ based on Dλ[ p : q]. We consider the center of two Gaussian distributions q1 and q2, defined by  ! " ηp = arg min Dλ p : qi . (90) p

When λ → 0 and λ →∞,wehave

σ 2 σ 2 σ 2 μ + σ 2 μ 2 q q q q1 q q2 λ →∞:σ 2 = 1 2 ,μ= 2 1 , (91) p σ 2 + σ 2 p σ 2 + σ 2 q1 q2 q1 q2 2σ σ σ μ + σ μ λ → : σ = q1 q2 ,μ= q2 q1 q1 q2 . 0 p σ + σ p σ + σ (92) q1 q2 q1 q2

However, if we use Cλ instead of Dλ the centers are

λ →∞:σp = λ, (93) σ + σ q1 q2 λ → 0 : σp = , (94) 2 which are not reasonable for large λ.

6.2 Statistical estimation

Let us consider a M,

M = {p(x, ξ)} (95) parameterized by ξ. An interesting example is the set of distributions over χ = (0, 1)n, where x is a vector random variable defined on the n-cube χ. 123 Info Geo

The Boltzmann machine M is its submodel, consisting of probability distributions which do not include higher-order interaction terms of random variables xi , ⎧ ⎫ ⎨  ⎬ ( , ξ) = + w − ψ , p x exp ⎩ bi xi ijxi x j ⎭ (96) i< j  where ξ = bi ,wij . The transportation cost is  m(x, y) = |xi − yi | , (97) i which is the Hamming distance [10]. ∗ ∗ ∗ Let qˆ = qˆ(x) be an empirical distribution. Then, Dλ-estimator p = p (x, ξ ) ∈ M is defined by  ! " ∗ p x, ξ = arg minDλ qˆ : p(x, ξ) . (98) ξ

Differentiating Dλ with respect to ξ, we obtain the following theorem: Theorem 7 The λ-estimator ξ ∗ satisfies

  ∂p(x, ξ ∗) G qˆ, p p − qˆ = . ∂ξ 0 (99)

6.3 Pattern classifier

Let p1 and p2 be two prototype patterns of categories C1 and C2. A separating hyper- submanifold of the two categories is defined by the set of q that satisfy ! " ! " Dλ p1 : q = Dλ p2 : q (100) or ! " ! " Dλ q : p1 = Dλ q : p2 . (101) It would be interesting to study the geometrical properties of the λ-separating hyper- submanifold (Fig. 2).

7 Information geometry of transportation plans

We provide a general framework of the transportation plans from the viewpoint of information geometry. The manifold of all transportation plans is a probability simplex = χ ×χ M Sn2−1 consisting of all the joint probability distributions P over . It is dually flat, where m-coordinates are ηij = Pij, from which Pnn is determined.  Pij = 1. (102) 123 Info Geo

Fig. 2 λ-separating hyperplane

The corresponding e-coordinates are log Pij divided by Pnn as

Pij θij = log . (103) Pnn  = = We considered three problems in M Sn2−1, when the cost matrix M mij is given.

7.1 Free problem

Minimize the entropy-relaxed transportation cost ϕλ(P) without any constraints on P. The solution is m 1 + λ P∗ = − ij − ψ = cK, free exp λ λ (104) where c is a normalization constant. This clarifies the meaning of the matrix K [Eq. (17)], i.e., K is the optimal transportation plan for the free problem.

7.2 Rate-distortion problem

We considered a communication channel in which p is a probability distribution on the senders terminals. The channel is noisy and Pij/pi is the probability that x j is received when xi is sent. The costs mij are regarded as the distortion of xi changing to x j . The rate distortion-problem in information theory searches for P, which minimizes the mutual information of the sender and receiver under the constraint of distortion M, P. The problem is formulated by maximizing ϕλ(P) under the senders constraint p, where q is free (R. Belavkin, personal communication; see also [16]). 123 Info Geo

∗ ( ,⋅)

Fig. 3 e-projection in the rate-distortion problem

The optimal solution is given by  ∗ = , Prd cai Kij (105) since q is free and β = 0orb j = 1. ai are determined from p such that the senders condition  c ai Kij = pi (106) j is satisfied. Therefore, the dual parameters ai are given explicitly as

pi cai =  . (107) j Kij

Let M( p, ·) be the set of plans that satisfy the senders condition  Pij = pi . (108) j

∗ ∗ ( , ·) Then, we will see that Prd is the e-projection of Pfree to M p .Thee-projection is explicitly given by Eq. (107) (Fig. 3).

7.3 Transportation problem

A transportation plan satisfies the senders and receivers conditions. Let M(·, q) be the set of plans that satisfies the receivers conditions  Pij = q j . (109) i 123 Info Geo

( , )

(⋅, ) Θ

( ,⋅)

Fig. 4 m-flat submanifolds in the transportation problem

Then, the transportation problem searches for the plan that minimizes the entropy- relaxed cost in the subset

M( p, q) = M( p, ·) ∩ M(·, q). (110)

Since the constraints Eqs. (108) and (109) are linear in the m-coordinates P, M( p, ·), M(·, q) and M( p, q) are m-flat submanifolds (Fig. 4). Since p and q are fixed, M( p, q) is of dimensions (n −1)2, in which all the degrees of freedom represent mutual interactions between the sender and receiver. We define them by PijPnn Θij = log , i, j = 1,...,n − 1. (111) Pin Pnj

They vanish for PD = p ⊗ q, as is easily seen Eq. (111). Since Θij are linear in log Pij, the submanifold E Θij , in which Θij’s take fixed values but p and q are free, is an 2(n − 1)-dimensional e-flat submanifold. We introduce mixed coordinates  Ξ = p, q,Θij (112) such that the first 2(n − 1) coordinates ( p, q) are the marginal distributions in the m- coordinates and the last (n − 1)2 coordinates Θ are interactions in the e-coordinates given in Eq. (111). Since the two complementary coordinates are orthogonal, we have orthogonal foliations of Sn2−1 [1] (Fig. 5). Given two vectors a = (ai ) and b = b j , we considered the following transfor- mation of P,  TabP = cai b j Pij , (113) 123 Info Geo

(Θ)

Θ ( , ) ( , )

Fig. 5 Orthogonal foliations of Sn2−1 with the mixed coordinates where c is a constant determined from the normalization condition,  c ai b j Pij = 1. (114) i, j

Ξ is the mixed coordinates of P and m-flat submanifold M( p, q), defined by fixing the first 2(n − 1) coordinates, is orthogonal to e-flat submanifold E (Θ), defined by 2 making the last (n−1) coordinates equal to Θij. This is called the RAS transformation in the input-output analysis of economics.

Lemma For any a, b, transformation Tab does not change the interaction terms Θij. Moreover, the e-geodesic connecting P and TabP is orthogonal to M( p, q).

Proof By calculating the mixed coordinates of TabP, we easily see that the Θ-part does not change. Hence, the e-geodesic connecting P and TabP is given, in terms of the mixed coordinates, by keeping the last part fixed while changing the first part. This is included in E (Θ). Therefore, the geodesic is orthogonal to M( p, q). Since the optimal solution is given by applying Tab to K, even when K is not normalized, such that the terminal conditions [Eq. (4)] are satisfied, we have the following theorem: Theorem 8 The optimal solution P∗ is given by e-projecting K to M( p, q).

7.4 Iterative algorithm (Sinkhorn algorithm) for obtaining a and b

We need to calculate a and b when p and q are given for obtaining the optimal transportation plan. The Sinkhorn algorithm is well known for this purpose [5]. It is an iterative algorithm for obtaining the e-projection of K to M( p, q). 123 Info Geo

Let TA· be the e-projection of P to M( p, ·) and let T·B be the e-projection to M(·, q). From the Pythagorean theorem, we have ! " ! " ∗ ∗ KL[TA·P : P] + KL P : TA·P = KL P : P , (115)

∗ where P = TabP is the optimal solution; that is, the e-projection of K to M( p, q). Hence, we have ! " ! " ∗ ∗ KL P : TA·P ≤ KL P : P (116) and the equality holds when and only when P ∈ M( p, ·).Thee-projection of P decreases the dual KL-divergence to P∗. The same property holds for the e-projection to M(·, q). The iterative e-projections of K to M( p, ·) and M(·, q) converges to the optimal solution P∗. It is difficult to have an explicit expression of the e-projection of P to M( p, q),but those of e-projections to M( p, ·) and M(·, q) are easily obtained. The e-projection of P to M( p, ·) is given by  TA·P = ai Pij , (117) where a is given explicitly by pi ai =  . (118) j Pij Similarly, the e-projection to M(·, q) is given by  T·BP = b j Pij , (119) with q j b j =  . (120) i Pij Therefore, the iterative algorithm, which is known as the Sinkhorn Algorithm [7,12] of e-projection from K is formulated as follows: Iterative e-projection algorithm

1. Begin with P0 = K.

2. For t = 0, 1, 2,..., e-project P2t to M( p, ·) to obtain

P2t+1 = TA·P2t . (121)

3. To obtain P2t+2, e-project P2t+1 to M(·, q),

P2t+2 = T·BP2t+1. (122)

4. Repeat until convergence. Figure 6 schematically illustrates the iterative e-projection algorithm for finding the optimal solution P∗. 123 Info Geo K

(⋅, ) ⋅

∗ ( ,⋅)

Fig. 6 Sinkhorn algorithm as iterative e-projections 8 Conclusions and additional remarks

We elucidated the geometry of optimal transportation plans and introduced a one- parameter family of divergences in the probability simplex which connects the Wasserstein distance and KL-divergence. A one-parameter family of Riemannian met- rics and dually coupled affine connections were introduced in Sn−1, although they are not dually flat in general. We uncovered a new way of studying the geometry of prob- ability distributions. Future studies should examine the properties of the λ-divergence and apply these to various problems. We touch upon some related problems below.

1. Uniqueness of the optimal plan The original Wasserstein distance is obtained by solving a linear programming problem. Hence, the solution is not unique in some cases and is not necessarily a continuous function of M. However, the entropy-constrained solution is unique and continuous with respect to M [7]. While ϕλ( p, q) converges to ϕ0( p, q) as λ → 0, ϕ0( p, q) is not necessarily differentiable with respect to p and q. 2. Integrated information theory of consciousness Given a joint probability distribution P, the amount of integrated information is measured by the amount of interactions of information among different terminals. We used a disconnected model in which no information is transferred through branches connecting different terminals. The geometric measure of integrated information theory is given by the KL-divergence from P to the submanifold of disconnected models [13,14]. However, the Wasserstein divergence can be consid- ered as such a measure when the cost of transferring information through different terminals depends on the physical positions of the terminals [15]. We can use the entropy-constrained divergence Dλ to define the amount of information integra- tion. 3. f -divergence We used the KL-divergence in a dually flat manifold for defining Dλ. It is pos- 123 Info Geo

sible to use any other divergences, for example, the f -divergence instead of KL-divergence. We would obtain similar results. 4. q-entropy Muzellec et al. used the α-entropy (Tsallis q-entropy) instead of the Shannon entropy for regularization [16]. This yields the q-entropy-relaxed framework. 5. Comparison of Cλ and Dλ Although Dλ satisfies the criterion of a divergence, it might differ considerably from the original Cλ. In particular, when Cλ( p, q) includes a piecewise linear term such as di |pi − qi | for constant di , Dλ defined in Eq. (52) eliminates this term. When this term is important, we can use {Cλ( p, q)}2 instead of Cλ( p, q) for defining a new divergence Dλ in Eq. (52). In our accompanying paper [17], we define a new type of divergence that retains the properties of Cλ and is closer to Cλ.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 Interna- tional License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: The proof of Example 2

Let us assume that functions a(x) and b (y) are constrained into Gaussian distributions: a(x) = N μ,˜ σ˜ 2 , b(y) = N μ˜ , σ˜ 2 . This means that the optimal plan P∗(x, y) is also given by a Gaussian distribution N(μ,Σ). The marginal distributions p and q require the mean value of the optimal plan to become

T μ =[μp μq ] . (A.1)

It is also necessary for the diagonal part of the covariance matrix to become

Σ = σ 2, 11 p (A.2) Σ = σ 2. 22 q (A.3)

Because the entropy-relaxed optimal transport is given by Eq. (79), Σ is composed of σ˜ 2 and σ˜ 2 as follows:  σ˜ 2 2σ˜ 2 + λ Σ11 =  , (A.4) 2 σ˜ 2 +˜σ 2 + λ  σ˜ 2 2σ˜ 2 + λ Σ22 =  . (A.5) 2 σ˜ 2 +˜σ 2 + λ 123 Info Geo

Solving Eqs. (A.4, A.5) under the conditions given in Eqs. (A.2, A.3), we have $ ' −1 1  √ 2 σ˜ 2 = 1 + 1 + X − , (A.6) σ 2 λ 2 p $ '−  √ 1  1 2 σ˜ 2 = 1 + 1 + X − , (A.7) σ 2 λ 2 q σ 2σ 2 16 p q where X = . (A.8) λ2

Substituting the mean [Eq. (A.1)] and variances [Eqs. (A.6, A.7)] into the definition of the cost [Eq. (23)], after straightforward calculations, we get Eq. (82). In general, the η (η ,η ) = (μ ,μ2 + σ 2) coordinates of the Gaussian distribution q are given by 1 2 q q q . After differentiating Cλ(p, q) with the η coordinates and substituting them into Eq. (52), we get Eq. (83).

References

1. Amari, S.: Information Geometry and Its Applications. Springer, Japan (2016) 2. Chentsov, N.N.: Statistical Decision Rules and Optimal Inference. Nauka (1972) (translated in English, AMS (1982)) 3. Rao, C.R.: Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945) 4. Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Cham (2017) 5. Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkhauser, Basel (2015) 6. Villani, C.: Topics in Optimal Transportation. Graduate Studies in Math. AMS, Providence (2013) 7. Cuturi, M.: Sinkhorn : light speed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013) 8. Cuturi, M., Avis, D.: Ground metric learning. J. Mach. Learn. Res. 15, 533–564 (2014) 9. Cuturi, M., Peyré, G.: A smoothed dual formulation for variational Wasserstein problems. SIAM J. Imaging Sci. 9, 320–343 (2016) 10. Montavon, G., Muller K., Cuturi, M.: Wasserstein training for Boltzmann machines (2015). arXiv:1507.01972v1 11. Belavkin, R.V.: Optimal measures and Markov transition kernels. J. Glob. Optim. 55, 387–416 (2013) 12. Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964) 13. Oizumi, M., Tsuchiya, N., Amari, S.: Unified framework for information integration based on infor- mation geometry. Proc. Natl. Acad. Sci. 113, 14817–14822 (2016) 14. Amari, S., Tsuchiya, N., Oizumi, M.: Geometry of information integration (2017). arXiv:1709.02050 15. Oizumi, M., Albantakis, L., Tononi, G.: From the phenomenology to the mechanisms of consciousness: integrated information theory 3.0. PLoS Comput. Biol. 10, e1003588 (2014) 16. Muzellec, B., Nock, R., Patrini, G., Nielsen, F.: Tsallis regularized optimal transport and ecological inference (2016). arXiv:1609.04495v1 17. Amari, S, Karakida, R. Oizumi, M., Cuturi, M.: New divergence derived from Cuturi function (in preparation)

123