<<

VERSIONS OF THE HELLINGER DISTANCE

RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

Abstract. On the space of positive definite matrices we consider dis- tance functions of the form d(A, B) = [trA(A, B) − trG(A, B)]1/2 , where A(A, B) is the arithmetic mean and G(A, B) is one of the different versions of the geometric mean. When G(A, B) = A1/2B1/2 this distance is kA1/2− 1/2 1/2 1/2 1/2 B k2, and when G(A, B) = (A BA ) it is the Bures-Wasserstein metric. We study two other cases: G(A, B) = A1/2(A−1/2BA−1/2)1/2A1/2, log A+log B  the Pusz-Woronowicz geometric mean, and G(A, B) = exp 2 , the log Euclidean mean. With these choices d(A, B) is no longer a metric, but it turns out that d2(A, B) is a divergence. We establish some (strict) convexity properties of these divergences. We obtain characterisations of barycentres of m positive definite matrices with respect to these distance measures.

1. Introduction

Let p and q be two discrete probability distributions; i.e. p = (p1, . . . , pn) and q = (q1, . . . , qn) are n -vectors with nonnegative coordinates such that P P pi = qi = 1. The Hellinger distance between p and q is the Euclidean norm of the difference between the square roots of p and q ; i.e.

1/2 1/2 √ √ hX √ √ 2i hX X √ i d(p, q) = k p− qk2 = ( pi − qi) = (pi + qi) − 2 piqi . (1) This distance and its continuous version, are much used in statistics, where it is customary to take d (p, q) = √1 d(p, q) as the definition of the Hellinger H 2 distance. We have then p dH (p, q) = trA(p, q) − trG(p, q), (2) arXiv:1901.01378v2 [math-ph] 8 Apr 2020 where A(p, q) is the arithmetic mean of the vectors p and q, G(p, q) is their P geometric mean, and tr x stands for xi.

A matrix/noncommutative/quantum version would seek to replace the probability vectors p and q by density matrices A and B ; i.e., positive semidefinite matrices A, B with tr A = tr B = 1. In the discussion that fol- lows, the restriction on is not needed, and so we let A and B be any two positive semidefinite matrices. On the other hand, a part of our analysis requires A and B to be positive definite. This will be clear from the context.

2010 Subject Classification. 15B48, 49K35, 94A17, 81P45. Key words and phrases. Geometric mean, matrix divergence, Bregman divergence, relative entropy, strict convexity, barycentre. 1 2 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

We let P be the set of n×n complex positive definite matrices. The notation A > 0 means that A is positive (semi) definite.

Here we run into the essential difference between the matrix and the scalar case. For positive definite matrices A and B, there is only one possible arith- metic mean, A(A, B) = (A + B)/2. However, the geometric mean G(A, B) could have different meanings. Each of these leads to a different version of the Hellinger distance on matrices. In this paper we study some of these distances and their properties.

The Euclidean inner product on n × n matrices is defined as hA, Bi = tr A∗B. The associated Euclidean norm is

∗ 1/2 X 2 1/2 kAk2 = (tr A A) = ( |aij| ) .

Recall that the matrices AB and BA have the same eigenvalues. Thus if A and B are positive definite, then AB is not positive definite unless A and B commute. However, the eigenvalues of AB are all positive as they are the same as the eigenvalues of A1/2BA1/2. Also every matrix with positive eigenvalues has a unique square root with positive eigenvalues. If A, B are positive definite, then we denote by (AB)1/2 the square root that has positive eigenvalues. Since (AB)1/2 = A1/2(A1/2BA1/2)1/2A−1/2, the matrices (AB)1/2 and (A1/2BA1/2)1/2 are similar, and hence have the same eigenvalues.

The straightforward generalisation of (1) for positive definite matrices A, B is evidently

1/2 1/2  1/2 1/21/2 d1(A, B) = kA − B k2 = tr(A + B) − 2trA B . (3) Another version could be

 1/2 1/2 1/21/2  1/21/2 d2(A, B) = tr(A + B) − 2tr(A BA ) = tr(A + B) − 2tr(AB) . (4)

While it is clear from (3) that d1 is a metric on P, it is not obvious that d2 is a metric. It turns out that 1/2 1/2 d2(A, B) = min kA − B Uk2, (5) where the minimum is taken over all unitary matrices U. It follows from this that d2 is a metric. This is called the Bures distance in the quantum information literature and the Wasserstein metric in the literature on optimal transport. It plays an important role in both these subjects. We refer the reader to [18] for a recent exposition, and to [12, 26, 28, 36] for earlier work. The quantity F (A, B) = tr(A1/2BA1/2)1/2 is called the fidelity between the ∗ ∗ states A and B. In the special case when A =√uu ,B = vv are pure ∗ ∗ 1/2 states, we have F (A, B) = |u v| and d2(A, B) = 2(1 − |u v|) . For qubit states this is the distance on the . 3

For various reasons, theoretical and practical, the most accepted definition of geometric mean of A, B is the entity A#B = A1/2(A−1/2BA−1/2)1/2A1/2. (6) This formula was introduced by Pusz and Woronowicz [32]. When A and B commute A#B reduces to A1/2B1/2. The mean A#B has been studied extensively for several years and has remarkable properties that make it useful in diverse areas. One of them is its connection with inequalities related to monotonicity and convexity theorems for the quantum entropy. See Chapter 4 of [15] for a detailed exposition. Another object of interest has been the log Euclidean mean L(A, B) defined as log A + log B  L(A, B) = exp . (7) 2 This mean too reduces to A1/2B1/2 when A and B commute, and has been used in various contexts [7], though it lacks some pleasing properties that A#B has.

Thus it is natural to consider two more matrix versions of the Hellinger distance, viz, 1/2 d3(A, B) = [tr(A + B) − 2tr(A#B)] , (8) and 1/2 d4(A, B) = [tr(A + B) − 2trL(A, B)] . (9)

In view of what has been discussed, we may expect that d3 and d4 are metrics on P. However, it turns out that neither of them obeys the triangle inequality. Examples are given in Section 2. Nevertheless, this is compensated by the fact that the squares of d3 and d4 both are divergences, and hence they can serve as good distance measures.

A smooth Φ from P × P to the set of nonnegative real numbers, R+ , is called a divergence if (i)Φ( A, B) = 0 if and only if A = B. (ii) The first derivative DΦ with respect to the second vanishes on the ; i.e.,

DΦ(A, X)|X=A = 0. (10) (iii) The second derivative D2Φ is positive on the diagonal; i.e.,

2 D Φ(A, X)|X=A(Y,Y ) > 0 for all Hermitian Y. (11) See [4], Sections 1.2 and 1.3.

The prototypical example is the Euclidean divergence Φ(A, B) = kA − 2 2 2 Bk2. The functions d1(A, B) and d2(A, B) are also divergences. Another well-known example is the Kullback-Leibler divergence [4]. A special kind 4 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN of divergence is the Bregman divergence corresponding to a strictly convex differentiable function ϕ : P → R. If ϕ is such a function, then Φ(A, B) = ϕ(A) − ϕ(B) − Dϕ(B)(A − B), (12) is called the Bregman divergence corresponding to ϕ. Not every divergence 2 arises in this way. In particular, dH (p, q), the square of the Hellinger distance, on probability vectors is not a Bregman divergence.

Now we describe our main results. We will show that both the functions 2 2 Φ3(A, B) = d3(A, B) and Φ4(A, B) = d4(A, B) are divergences. We will show that Φ3 and Φ4 are jointly convex in the variables A and B, and strictly convex in each of the variables separately. One consequence of this is that for every m -tuple A1,...,Am in P and positive weights w1, . . . , wm the minimisation problem m X 2 min wjd (X,Aj) (13) X>0 j=1 has a unique solution when d = d3 or d4. When d = d1 the minimum in (13) is attained at the 1/2 -power mean

m !2 X 1/2 Q1/2 = wjAj . (14) j=1

This is one of the much studied family of classical power means. When d = d2, the minimiser in (13) is the Wasserstein mean [2, 18]. This is the unique solution of the matrix equation m X 1/2 1/2 1/2 X = wj(X AjX ) . (15) j=1 This mean has major applications in optimal transport, statistics, quantum information and other areas. Means with respect to various divergences have also been of interest in information theory. See e.g., [8, 30]. An inspection of (14) and (15) shows a common feature. Both for d1 and d2 the minimiser in (13) is the solution of the equation m X X = wjG(X,Aj), (16) j=1 where G is the version of the geometric mean chosen in the definition of d. 1/2 1/2 1/2 1/2 1/2 That is, G(A, B) = A B in the case of d1, and G(A, B) = (A BA ) in the case of d2. It turns out that this is also the case for d4 but not for d3. When d = d3 the minimisation problem (13) has a unique solution X which is also the solution of the matrix equation ∞ m Z 2 X −2 √ X2 = w λX−1 + A−1 λdλ. (17) π j j j=1 0 5

This, in general, is different from the solution of the matrix equation m X X = wj(X#Aj). (18) j=1

When d = d4, the problem (13) has a unique solution X which is also the solution of the matrix equation m X X = wjL(X,Aj). (19) j=1 In the past few years there has been extensive work on the Cartan mean (also known as Karcher or Riemann mean) of positive definite matrices. This is the solution of the minimisation problem m X 2 min wjδ (X,Aj), (20) X>0 j=1 where −1/2 −1/2 δ(A, B) = k log A BA k2 is the Cartan metric on the manifold P .This mean from classical differential geometry has found several important applications [9, 15, 16, 24, 29].

Our analysis of Φ4 leads to some interesting facts about quantum relative entropy. We observe that the convex function ϕ(A) = tr (A log A − A) leads to the Bregman divergence Φ(A, B) = tr A(log A−log B)−tr(A−B), and the log Euclidean mean is the barycentre with respect to this Bregman divergence. As a related issue, we explore properties of barycentres with respect to general matrix Bregman divergences, and point out similarities and crucial differences between the scalar and matrix case.

Convexity properties of matrix Bregman divergences have been studied in [11, 31], and matrix approximation problems with divergences in [23]. Means with respect to matrix divergences are studied in [22]. In [35] Sra studied a related distance function h A + B 1 i1/2 δ (A, B) := log det  − (log det A + log det B) S 2 2 and showed that this is a metric on P . Several parallels between this metric and the Cartan metric are pointed out in [35].

2. Convexity and derivative computations

Inequalities for traces of matrix expressions have a long history. For the different geometric means mentioned in Section 1, we know [17] that

1/2 1/2 1/2 tr(A#B) 6 trL(A, B) 6 tr(A B ) 6 tr(AB) . (21) 6 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

It follows that 2 2 2 2 d3(A, B) > d4(A, B) > d1(A, B) > d2(A, B). (22) 2 Since d1 is a metric, this implies that d3(A, B) = 0 if and only if A = B. 2 The same is true for d4(A, B). Thus Φ3 and Φ4 satisfy the first condition in the definition of a divergence. To prove Φ3 is a divergence we need to compute its first and second derivatives. These results are of independent interest.

Proposition 1. Let A be a positive definite matrix. Let g be the map on P defined as g(X) = A#X. Then the derivative of g is given by the formula ∞ Z Dg(X)(Y ) = (λ + XA−1)−1Y (λ + A−1X)−1dν(λ), (23)

0 1 1/2 where dν(λ) = π λ dλ. Proof. We will use the integral representation ∞ 1 Z  λ 1  x1/2 = √ + − dν(λ), (24) 2 λ2 + 1 λ + x 0 1 1/2 where dν(λ) = π λ dλ. See [14] p.143. Using this we see that the derivative of the function X → X1/2 is the linear map ∞ Z DX1/2(Y ) = (λ + X)−1Y (λ + X)−1dν(λ), (25)

0 where Y is any Hermitian matrix. This shows that Dg(X)(Y ) ∞ Z = A1/2(λ + A−1/2XA−1/2)−1A−1/2YA−1/2(λ + A−1/2XA−1/2)−1A1/2dν(λ)

0 ∞ Z = (λ + XA−1)−1Y (λ + A−1X)−1dν(λ).

0 This proves the proposition.

2 Theorem 2. Let DΦ3 and D Φ3 be the first and the second derivatives of Φ3. Then DΦ3(A, A) = 0, (26) 1 D2Φ (A, A)(Y,Y ) = tr YA−1Y. (27) 3 2 (In other words, the gradient of Φ3 at every diagonal point is 0 and the Hessian is positive.) 7

Proof. For a fixed A, let g be the map on P defined as g(X) = A#X. When X = A, the expression in (23) reduces to ∞ 1 Z λ1/2 1 dλ Y = Y. π (1 + λ)2 2 0

Recalling that Φ3(A, X) = tr(A + X) − 2trg(X), we see that

DΦ3(A, X)|X=A(Y ) = 0 for all Y. This establishes (26). Next note that for the second derivative we have 2 2 D Φ3(A, X)(Y,Z) = −2D (trg(X)) (Y,Z). (28) From (23) we see that D (tr g(X)) (Y ) ∞ Z = tr(λ + XA−1)−1Y (λ + A−1X)−1dν(λ)). (29)

0 By definition d D2(tr g(X))(Y,Z) = | D(tr g(X + tZ))(Y ). dt t=0 Hence, from (29) we see that D2(tr g(X))(Y,Z) is equal to ∞ Z − tr(λ + XA−1)−1ZA−1(λ + XA−1)−1Y (λ + A−1X)−1dν(λ)

0 ∞ Z − tr(λ + XA−1)−1Y (λ + A−1X)−1A−1Z(λ + A−1X)−1dν(λ). (30)

0 When X = A and Z = Y, this reduces to give ∞ 2 Z λ1/2 D2Φ (A, A)(Y,Y ) = dλ tr YA−1Y 3 π (1 + λ)3 0 1 = tr YA−1Y. 2 This proves (27).

Consider maps f defined on P and taking values in P or R++ (the set of positive real numbers). We say that f is concave if for all X,Y in P and 0 6 α 6 1 f((1 − α)X + αY ) > (1 − α)f(X) + αf(Y ). (31) It is strictly concave if the two sides of (31) are equal only if X = Y. A map f from P × P into P or R+ is called jointly concave if for all X1,X2,Y1,Y2 in P and 0 6 α 6 1, f((1 − α)X1 + αY1, (1 − α)X2 + αY2) > (1 − α)f(X1,X2) + αf(Y1,Y2). 8 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

It is a basic fact in the theory of the geometric mean that A#B is jointly concave in A and B , see [5, 6]. However,√ it is not strictly jointly concave. Indeed, even the function f(a, b) = ab on R+ × R+ is not strictly jointly concave (its restriction to the diagonal is linear). Our next theorem says that in each of the variables separately, the geometric mean is strictly concave. Theorem 3. For each A the function f(X) = tr A#X is strictly concave on P. This implies that the function g(X) = A#X is also strictly concave. Proof. Suppose  X + Y  tr A#X + tr A#Y tr A# = . 2 2 We have to show that this implies X = Y. Rewrite the above equality as  X + Y  A#X + A#Y  tr A# − = 0. 2 2 By the concavity of A#X, the expression inside the braces is positive semi- definite. The trace of such a matrix is zero if and only if the matrix itself is zero. Hence X + Y  A#X + A#Y A# = . 2 2 Using the definition (6) this can be written as

1/2  X + Y  1 1/2 A1/2 A−1/2 A−1/2 A1/2 = A1/2 A−1/2XA−1/2 A1/2 2 2 1 + A1/2(A−1/2YA−1/2)1/2A1/2. 2 Cancel the factors A1/2 occurring on both sides, then square both sides, and rearrange terms to get A−1/2(X + Y )A−1/2 − (A−1/2XA−1/2)1/2(A−1/2YA−1/2)1/2 −(A−1/2YA−1/2)1/2(A−1/2XA−1/2)1/2 = 0. This is the same as saying 2 (A−1/2XA−1/2)1/2 − (A−1/2YA−1/2)1/2 = 0. The square of a Hermitian matrix Z is zero only if Z = 0. Hence, we have (A−1/2XA−1/2)1/2 = (A−1/2YA−1/2)1/2. From this it follows that X = Y. Finally, if X,Y are to elements of P such that g((X + Y )/2) = (g(X) + g(Y ))/2 , taking traces on both sides, we have, f((X + Y )/2) = (f(X) + f(Y ))/2. We have seen that this implies X = Y . 9

As a consequence, we observe that

Φ3(A, B) = tr(A + B) − 2tr(A#B) is jointly convex in A and B and is strictly convex in each of the variables separately.

Now we turn to the analysis of Φ4 on the same lines as above. The arguments we present in this case are quite different. From (22) we know that

Φ3(A, B) > Φ4(A, B) > Φ1(A, B). We also know that

Φ3(A, A) = Φ4(A, A) = Φ1(A, A) = 0, and DΦ1(A, A) = DΦ3(A, A) = 0. Together, these three relations lead to the conclusion that

DΦ4(A, A) = 0.

Thus Φ4 satisfies condition (10).

By a theorem of Bhagwat and Subramanian [13]

m ! m !1/p 1 X 1 X p exp log Aj = lim Aj . (32) m p→0+ m j=1 j=1 One of the several remarkable concavity theorems of Carlen and Lieb, [20, 21] P p1/p says that the expression tr Aj is jointly concave in A1,...,Am, when 0 < p 6 1, and jointly convex when 1 6 p 6 2. Using equation (32) we obtain from this the joint concavity of trL(A, B). As a consequence Φ4(A, B) is jointly convex in A, B. Hence we have proved the following theorem.

Theorem 4. The function Φ4 is a divergence on P.

We have shown that Φ3 and Φ4 are divergences. But unlike Φ1 and Φ2 they are not the squares of metrics on P, i.e., d3 and d4 are not metrics. The following two examples show that d3 and d4 do not satisfy the triangle inequality.

Let 2 5  13 8 5 3  A = ,B = ,C = . 5 17 8 5 3 10

Then d3(A, B) ≈ 5.0347 and d3(A, C) + d3(C,B) ≈ 4.6768. This example is a small modification of one suggested to us by Suvrit Sra, to whom we are thankful. Let  4 −7  8 −2  5 −4 A = ,B = ,C = . −7 13 −2 1 −4 5

Then d4(A, B) ≈ 3.3349 and d4(A, C) + d4(C,B) ≈ 3.3146. 10 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

Next we study some more properties of Φ4 , like its strict convexity in each of the arguments, and its connections with matrix entropy. To put these in context we recall some facts about Bregman divergence.

Let ϕ : R+ → R be a smooth strictly convex function and let Φ(x, y) = ϕ(x) − ϕ(y) − ϕ0(y)(x − y), (33) be the associated Bregman divergence. Then Φ is strictly convex in the vari- able x but need not be convex in y. (See, e.g., [23] Section 2.2.) Given x1, . . . , xm in R+, the minimiser m X 1 argmin Φ(x , x), (34) m j j=1 always turns out to be the arithmetic mean m X 1 x = x , m j j=1 independent of the mother function ϕ.

In fact, this property characterises Bregman divergences; see [23, 8]. We can also consider the problem m X 1 argmin Φ(x, x ). (35) m j j=1 In this case, a calculation shows that the solution is the quasi-arithmetic mean (the Kolmogorov mean) associated with the function ϕ0. More precisely, the solution of (35), which we may think of as the mean, or the barycentre, of the points x1, . . . , xm with respect to the divergence Φ is m ! −1 X 1 µ (x , . . . , x ) = ϕ0 ϕ0(x ) . (36) Φ 1 m m j j=1

We wish to study the matrix version of the problems (34) and (35). Here we run into a basic difference between the one-variable and the several-variables cases. It is natural to replace the derivative ϕ0 in (36) by the gradient ∇ϕ in the several-variables case. If ϕ is a differentiable strictly convex function defined on an open interval I of R , then, its derivative ϕ0 is a strictly monotone continuous function, and hence a homeomorphism from I to its image ϕ0(I) . In particular, (ϕ0)−1 is defined. The appropriate generalisation of these facts to the several-variable case requires the notion of a Legendre type function. Definition (Section 26 in [33] or Def. 2.8 in [10]). Suppose ϕ is a convex lower-semicontinuous function from Rn to R ∪ {+∞} , and let dom f := {x ∈ Rn | ϕ(x) < +∞} . We say that ϕ is of Legendre type if it satisfies (i) int dom ϕ 6= ∅ , 11

(ii) ϕ is differentiable on int dom ϕ , (iii) ϕ is strictly convex on int dom ϕ , (iv) limt→0+ h∇ϕ(x + t(y − x)), y − xi = −∞ , for all x ∈ bdry(dom(ϕ)) and y ∈ int dom ϕ . If ϕ is of Legendre type, the gradient mapping ∇ϕ is a homeomorphism from int dom ϕ to int dom ϕ? , where ϕ? denotes the Legendre-Fenchel con- jugate of ϕ . See Theorem 26.5 in [33]. Lemma 5. If ϕ is of Legendre type, and Φ is the Bregman divergence asso- ciated with ϕ , and a1, . . . , am ∈ int dom ϕ , then the function m X x 7→ Φ(x, aj) j=1 achieves its minimum at a unique point, which belongs to int dom ϕ . The proof is given in Appendix A. We shall apply this lemma in the situation where ϕ is a convex function defined only on P and taking fi- nite values on this set. The map ϕ trivially extends to a convex lower- semicontinuous function defined on the whole space of Hermitian matrices— set ϕ(X) := lim infY →X,Y ∈P ϕ(Y ) for X ∈ bdry(P) , and ϕ(X) = +∞ if X 6∈ bdry(P) . We shall say that the original function ϕ defined on P is of Legendre type if its extension is of Legendre type.

Theorem 6. Let ϕ be a differentiable strictly convex function from P to R, and let Φ be the Bregman divergence corresponding to ϕ. Then: (i) The minimiser in the problem m X 1 argmin Φ(A ,X), (37) X∈P m j j=1 m P 1 is the arithmetic mean m Aj. j=1 (ii) If, in addition, ϕ is of Legendre type, then the problem m X 1 argmin Φ(X,A ) (38) X∈P m j j=1 has a unique solution, and this is given by m  X 1  X = (∇ϕ)−1 ∇ϕ(A ) . (39) m j j=1

(iii) If ψ is any differentiable strictly convex function from R++ to R and Φ is the Bregman divergence on P corresponding to the func- tion ϕ(X) := trψ(X) on P , then the solution of the minimisation problem (38) is m  X 1  X = (ψ0)−1 ψ0(A ) . (40) m j j=1 12 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

Proof. (i). Since Φ is given by (12), m m m X 1 X 1 X 1 Φ(A ,X) = ϕ(A ) − ϕ(X) − Dϕ(X)(A − X) m j m j m j j=1 j=1 j=1 m m ! X 1 X 1 = ϕ(A ) − ϕ(X) − Dϕ(X) A − X m j m j j=1 j=1 m X 1 = ϕ(A ) − ϕ(X) − Dϕ(X)(A − X), m j j=1 m P 1 where A denotes the arithmetic mean m Aj. Hence j=1 m m X 1 X 1 Φ(A , A) = ϕ(A ) − ϕ(A). m j m j j=1 j=1 Since ϕ is strictly convex, for every X 6= A ϕ(A) − ϕ(X) > Dϕ(X)(A − X). This implies that m m X 1 X 1 Φ(A ,X) > Φ(A , A) m j m j j=1 j=1 which shows that A is the unique minimiser of the problem (37).

(ii). Let Ψ be the map from P to R+ defined as m X 1 Ψ(X) = Φ(X,A ). m j j=1 Then m X 1 DΨ(X)(Z) = Dϕ(X)(Z) − Dϕ(A )(Z). m j j=1 Lemma 5 shows that the minimum of the map Ψ on the set P is achieved at some point X ∈ P , and by the first order optimality condition, DΨ(X) = 0 , showing that X satisfies (39).

(iii). If ψ is a differentiable convex function on R++ and Φ is the Breg- man divergence corresponding to ϕ = trψ, then ∇ϕ(X) = ψ0(X). Hence, to show that the minimisation problem (38) has a solution, it suffices to show that the first order optimality condition m X 1 ψ0(X) = ψ0(A ) (41) m j j=1 is satisfied for some X in P . Since ψ is strictly convex, as noted above, 0 ψ is strictly increasing and is a homeomorphism from R++ to the interval 0 0 J := ψ (R++) . The spectrum of each matrix ψ (Aj) belongs to J , and so 13

m P 1 0 the spectrum of m ψ (Aj) also belongs to J , which implies that (41) is j=1 solvable.

The assumption that ϕ is of Legendre type is not needed in the tracial case (statement (iii)). Proposition 11 in Appendix B shows that this assumption cannot be dispensed with in the case of statement (ii). The much studied convex function

ϕ(x) = x log x − x, (42) on R+ leads to the Bregman divergence Φ(x, y) = x(log x − log y) − (x − y). (43)

This is called the Kullback-Leibler divergence. Since ϕ0(x) = log x, the solution of the minimisation problem (35) in this case is

m ! m 1 X Y 1/m µ (x , . . . , x ) = exp ϕ(x ) = x , Φ 1 m m j j j=1 j=1 the geometric mean of x1, . . . , xm.

As a matrix analogue of (42) one considers the function on P defined as ϕ(A) = tr(A log A − A). (44)

The associated Bregman divergence then is

Φ(A, B) = tr A(log A − log B) − tr(A − B). (45)

(See [4], p.12). The quantity

S(A|B) = tr A(log A − log B), (46) is called the relative entropy and has been of great interest in quantum informa- tion. Given A1,...,Am in P, their barycentre with respect to the divergence Φ, i.e., the solution of the minimisation problem (38) is the log Euclidean mean m ! 1 X L(A ,...,A ) = exp log A . (47) 1 m m j j=1

It is also of interest to compute the variance of the points A1,...,Am with respect to Φ, i.e., the minimum value of the objective function in (38). This is the quantity m X 1 σ2 = Φ(µ ,A ). (48) Φ m Φ j j=1 14 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

For the divergence Φ in (45), µΦ is the log Euclidean mean L given in (47). So

m 1 X σ2 = Φ(L,A ) Φ m j j=1 m 1 X = [trL(log L − log A ) − tr(L − A )] m j j j=1 ( m " m ! #) 1 X 1 X = tr L log A − log A − (L − A ) m m k j j j=1 k=1 m 1 X = −trL + tr A . m j j=1

In other words

2 σΦ = trA(A1,...,Am) − trL(A1,...,Am), (49) the difference between the traces of the arithmetic and the log Euclidean means of A1,...,Am.

In particular, the divergence Φ4(A, B) can be characterised using (49), as the minimum value min [Φ(X,A) + Φ(X,B)] , (50) X>0 where Φ is defined by (45). Using this characterisation we can show that the function Φ4(A, B) is strictly convex in each of the variables separately. To this end, we recall the following lemma of convex analysis, showing that the “mar- ginal” of a jointly convex function is convex; compare with Proposition 2.22 of [34] where a similar result (without the strictness conclusion) is provided.

Lemma 7. Let f(x, y) be a jointly convex function which is strictly convex in each of its variables separately. Suppose for each a, b

g(a, b) = min [f(x, a) + f(x, b)] , (51) x exists. Then the function g(a, b) is jointly convex, and is strictly convex in each of the variables separately.

Proof. Given a1, a2, b1, b2, choose x1 and x2 such that

g(a1, b1) = f(x1, a1) + f(x1, b1) and

g(a2, b2) = f(x2, a2) + f(x2, b2). 15

Then a + a b + b  g 1 2 , 1 2 2 2 x + x a + a  x + x b + b  f 1 2 , 1 2 + f 1 2 , 1 2 6 2 2 2 2 1 [f(x , a ) + f(x , a ) + f(x , b ) + f(x , b )] 6 2 1 1 2 2 1 1 2 2 1 = [g(a , b ) + g(a , b )] . 2 1 1 2 2 This shows that g is jointly convex. Now we show that it is strictly convex in the first variable.

Let a1, a2, b be any three points with a1 6= a2. Choose x1 and x2 such that

g(a1, b) = f(x1, a1) + f(x1, b) and

g(a2, b) = f(x2, a2) + f(x2, b).

Two cases arise. If x1 = x2 = x, then x + x a + a   a + a  f 1 2 , 1 2 = f x, 1 2 2 2 2 1 < [f(x, a ) + f(x, a )] , 2 1 2 because of strict convexity of f in the second variable. This implies that a + a  1 g 1 2 , b < [f(x, a ) + f(x, a ) + f(x, b) + f(x, b)] 2 2 1 2 1 = [g(a , b) + g(a , b)] . 2 1 2

If x1 6= x2, then by strict convexity of f in the first variable, x + x  1 f 1 2 , b < [f(x , b) + f(x , b)] , 2 2 1 2 and by joint convexity of f x + x a + a  1 f 1 2 , 1 2 [f(x , a ) + f(x , a )] . 2 2 6 2 1 1 2 2 Adding the last two inequalities we get a + a  1 g 1 2 , b < [g(a , b) + g(a , b)] . 2 2 1 2 Thus g(a, b) is strictly convex in the first variable, and by symmetry it is so in the second variable.

Theorem 8. For each A, the function f(X) = Φ4(X,A) is strictly convex on P. 16 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

Proof. One of the fundamental, and best known, properties of the relative entropy S(A|B) is that it is jointly convex function of A and B. (See, e.g., Section IX.6 in [14].) It is also known that if ϕ is strictly convex function on R+, then the function tr ϕ(X) is strictly convex on P. (See, e.g., Theorem 4 in [19].) It follows from this that S(A|B) is strictly convex in each of the variables separately. Combining these properties of S(A|B), Lemma 7 and the characterisation of Φ4(A, B) as the minimum value in (50) we obtain Theorem 8. It might be pertinent to add here that the question of equality in the joint convexity inequality   A1 + A2 B1 + B2 S(A1|B1) + S(A2|B2) S , (52) 2 2 6 2 has been addressed in [25] and [27]. In [27] Jencova and Ruskai show that the equality holds in (52) if and only if

log(A1 + A2) − log(B1 + B2) = log A1 − log B1

= log A2 − log B2. On the other hand, Hiai et al [25] show that equality holds in (52) if and only if −1/2 −1/2 −1/2 −1/2 (B1 + B2) (A1 + A2)(B1 + B2) = B1 A1B1 −1/2 −1/2 = B2 A2B2 . We are thankful to F. Hiai for making us aware of these results.

3. Barycentres If f is a convex function on an open convex set, then a critical point of f is the global minimum of f. If f is strictly convex, then f can have at most one such critical point. In this section we show that for d = d3 and d4, the objective function in (13) has a critical point, and hence in both cases the problem (13) has a unique solution.

Theorem 9. When d = d3, the minimum in (13) is attained at a unique point X which is the solution of the matrix equation (17) ∞ m Z 2 X −2 √ X2 = w λX−1 + A−1 λdλ. π j j j=1 0

This minimiser is the 1/2 -power mean Q1/2 given by (14) if Q1/2 commutes with every Aj. In particular, the minimiser is Q1/2 if

(i) all Aj ’s commute, or (ii) Q1/2 = I.

Proof. For a fixed positive definite matrix A, define the map GA as

GA(X) = A#X. 17

By Proposition 1, we have ∞ Z −1 −1 −1 −1 DGA(X)(Y ) = (λ + XA ) Y (λ + A X) dν(λ). 0 The objective function in (13) is

m X f(X) = wjΦ3(X,Aj). j=1

Using the definition of Φ3 we have

m ! X Df(X)(Y ) = tr Y − 2 wjDGAj (X)(Y ) . j=1

Then using the above expression for DGAj (X) we see that

 ∞  m Z X −1 −1 −1 −1 Df(X)(Y ) = tr Y − 2 wj (λ + XAj ) Y (λ + Aj X) dν(λ) j=1 0  ∞   m Z X −1 −1 −1 = tr I − 2 wj (λ + XAj )(λ + Aj X) dν(λ) Y  . j=1 0 At the last step above we use the cyclicity of the trace function. Hence the critical point of f is the matrix X0 if and only if X0 satisfies the matrix equation ∞ m Z X −1 −1 −1 I = 2 wj (λ + XAj )(λ + Aj X) dν(λ). (53) j=1 0 Taking congruence with X on both sides we see that (53) is equivalent to (17).

We now show that there exists a positive definite matrix X0 that satisfies (17). Let α, β > 0 such that αI 6 Aj 6 βI for all j = 1, . . . , m, and let K be the compact set K = {X ∈ P(n): αI 6 X 6 βI}. Define the map F : K → P(n) as

 ∞ 1/2 m Z X −1 −1 −2 F (X) = 2 wj (λX + Aj ) dν(λ) . j=1 0

−1 −1 −1 −1 Since X,Aj ∈ K, (λ + 1)α > (λX + Aj ) > (λ + 1)β . Thus we have 2 2 −1 −1 −2 2 2 R ∞ α /(λ + 1) 6 (λX + Aj ) 6 β /(λ + 1) . We know that 0 dν(λ)/(λ + 1)2 = 1/2. This gives F (X) ∈ K. By the Brouwer fixed point theorem, we get that F has a fixed point X0 in K. This fixed point X0 is the solution of (17). 18 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

Suppose Q1/2 commutes with every Aj, 1 6 j 6 m. We show that Q1/2 satisfies (17). Differentiating (24) we get ∞ 1 Z 1 x−1/2 = dν(λ). (54) 2 (λ + x)2 0 −1 −1 Using Q1/2Aj = Aj Q1/2 in (53) and using (54) we get m 1/2 −1/2 X  1/2 −1/2 I = Q1/2Q1/2 = wj Aj Q1/2 j=1 m 1/2 X  −1  = wj AjQ1/2 j=1 ∞ m Z X −1 −2 = 2 wj λ + Aj Q1/2 dν(λ) j=1 0 ∞ m Z X −1 −1 −1 = 2 wj (λ + Q1/2Aj )(λ + Aj Q1/2) dν(λ). j=1 0 This proves the second statement of the theorem. If (i) holds, it follows from (14) that Q1/2 commutes with Aj ’s. The same is trivially true if (ii) holds.

Theorem 10. When d = d4 the minimum in (13) is attained at a unique point X which satisfies the matrix equation (19) m X X = wjL(X,Aj). j=1 Proof. Start with the integral representation ∞ Z  λ 1  log x = − dλ, x > 0. λ2 + 1 λ + x 0 This shows that for all X > 0 and all Hermitian Y we have ∞ Z D(log X)(Y ) = (λ + X)−1Y (λ + X)−1dλ.

0 For a fixed A, let 1 g(X) = (log A + log X). 2 Then ∞ 1 Z Dg(X)(Y ) = (λ + X)−1Y (λ + X)−1dλ. (55) 2 0 19

The log Euclidean mean L(A, X) = eg(X). So, by the chain rule and Dyson’s formula (see [14] p. 311), we have

1 Z DL(A, X)(Y ) = e(1−t)g(X)Dg(X)(Y )etg(X)dt.

0 This shows that 1 Z D(trL(A, X))(Y ) = tr e(1−t)g(X)Dg(X)(Y )etg(X)dt

0 = tr eg(X)Dg(X)(Y ) , using the cyclicity of trace. Using (55) and the cyclicity once again, we obtain

∞ 1 Z D(trL(A, X))(Y ) = tr (λ + X)−1eg(X)(λ + X)−1Y dλ 2 0  ∞  1 Z = tr (λ + X)−1L(A, X)(λ + X)−1dλ Y. 2   0 Hence, for the function

2 Φ4(A, X) = d4(A, X) = tr(A + X) − 2trL(A, X), we have

DΦ4(A, X)(Y )  ∞  Z −1 −1 = tr I − (λ + X) L(A, X)(λ + X) dλ Y. 0 The objective function in (13) is

m X f(X) = wjΦ4(Aj,X). j=1 So, we have Df(X)(Y )  ∞  Z −1 −1 = tr I − (λ + X) Z(λ + X) dλ Y, (56) 0 where m X Z = wjL(Aj,X). j=1 20 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

This shows that Df(X) = 0 if and only if ∞ Z (λ + X)−1Z(λ + X)−1dλ = I. (57)

0   Choose an orthonormal in which X = diag(x1, . . . , xn), and let Z = zij in this basis. Then the condition (57) says that ∞ Z zij dλ = δij for all i, j. (λ + xi)(λ + xj) 0 This shows that Z is diagonal, and ∞ 1 Z 1 1 = 2 dλ = . zii (λ + xi) xi 0 m P Thus X = Z = wjL(Aj,X), as claimed. j=1 We should also show that the equation (19) has a unique solution. Let α, β be positive numbers such that αI 6 Aj 6 βI for all 1 6 j 6 m. Let K be the compact convex set K = {X ∈ P : αI 6 X 6 βI}. The function log X is operator monotone. So for all X in K we have log αI 6 log X 6 log βI. Hence L(X,Aj) is in K for all 1 6 j 6 k. This shows that the function m P F (X) = wjL(X,Aj) maps K into itself. By Brouwer’s fixed point theorem j=1 F has a unique fixed point X in K. This X is a solution of (19) and therefore must be unique.

Finally, we remark that in the case of d1, the barycentre is given explicitly by the formula (14). For d2, d3, d4 it has been given implicitly as solution of the equations (15),(17),(19), respectively. When m = 2 and w1 = w2 = 1/2 , the solution of (15) is the Wasserstein mean of A1 and A2 defined as 1 A + A + (A A )1/2 + (A A )1/2 . 4 1 2 1 2 2 1 See [18].

Acknowledgements: The authors thank F. Hiai and S. Sra for helpful comments and references, and the anonymous referee for a careful reading of the manuscript. The first author is grateful to INRIA and Ecole´ polytechnique, Palaiseau for visits that facilitated this work, and to CSIR(India) for the award of a Bhatnagar Fellowship.

Appendix A. Proof of Lemma 5 We make a variation of the proof of Theorem 3.12 in [10], dealing with a related problem (the minimisation of Φ over a closed convex set). 21

Since ϕ is of Legendre type, Theorem 3.7(iii) of [10] shows that for all a ∈ int dom ϕ , the map x 7→ Φ(x, a) is coercive, meaning that limkxk→∞Φ(x, a) = +∞ . A sum of coercive functions is coercive, and so the map m X 1 Ψ(x) := Φ(x, a ) m j j=1 is coercive. The infimum of a coercive lower-semicontinuous function on a closed non-empty set is attained, so there is an elementx ¯ ∈ clo int dom ϕ such that infx∈clo int dom ϕ Φ(x) = Φ(¯x) < +∞ . Suppose thatx ¯ belongs to the boundary of int dom ϕ . Let us fix an arbitrary z ∈ int dom ϕ , and let g(t) := Ψ((1 − t)¯x + tz) , defined for t ∈ [0, 1) . We have

m X 1 g0(t) = h∇ϕ((1 − t)¯x + tz) − ∇ϕ(a ), z − x¯i . m j j=1 Using property (iv) of the definition of Legendre type functions, we get that 0 limt→0+ g (t) = −∞ , which entails that g(t) < g(0) = Ψ(¯x) for t small enough. Since (1 − t)¯x + tz ∈ int dom ϕ for all t ∈ (0, 1) , this contradicts the optimality ofx ¯ . Sox ¯ ∈ int dom ϕ , which proves Lemma 5.

Appendix B. Examples In the last statement of Theorem 6, dealing with tracial convex functions, we required ϕ to be differentiable and strictly convex on P . In the second statement, dealing with the non tracial case, we made a stronger assumption, requiring ϕ to be of Legendre type. We now give an example showing that the Legendre condition cannot be dispensed with. To this end, it is convenient to construct first an example showing the tightness of Lemma 5.

Need for the Legendre condition in Lemma 5. Let us fix N > 3 , let e = (1, 1)> ∈ R2 ,  N − 1 −2  L = (58) −2 N − 1 and consider the affine transformation g(x) = e + Lx . Let a = (N, 0)> , b = (0,N)> , and 1  N 2 − 2N − 1  a¯ := g−1(a) = , N 2 − 2N − 3 N − 1

1  N − 1  ¯b := g−1(b) = . N 2 − 2N − 3 N 2 − 2N − 1 ¯ 2 Observe thata, ¯ b ∈ R++ since N > 3 . p p p Consider now, for p > 1 , the map ϕ(x) := kxkp = |x1| +|x2| defined on R2 andϕ ¯(x) = ϕ(g(x)) . Observe that ϕ is strictly convex and differentiable. Let Φ¯ denote the Bregman divergence associated withϕ ¯ , and let Ψ(¯ x) := 22 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN

b C

e u a

The example illustrated. The point u is the unconstrained min- imum of the sum of Bregman divergences Ψ(x) := Φ(x, a) + p p Φ(x, b) associated with ϕ(x) = x1 + x2 , here p = 1.2 . Level curves of Ψ are shown. The minimum of Ψ on the simplicial cone C is at the unit vector e . An affine change of variables sending C to the standard quadrant, and a lift to the cone of positive semidefinite matrices leads to Proposition 11

1 ¯ ¯ ¯ ¯ 2 (Φ(x, a¯) + Φ(x, b)) . We claim that 0 is the unique point of minimum of Ψ 2 over R+ . Indeed, 1  ∇Ψ(¯ x) = L>(∇ϕ(g(x))) − L>(∇ϕ(a)) + L>(∇ϕ(b)) , 2 from which we get ∇Ψ(0)¯ = L(p(1 − N p−1/2)e) = (N − 3)p(1 − N p−1/2)e . ¯ 2 It follows that ∇Ψ(0) ∈ R++ if p > 1 is chosen close enough to 1 , so that 1 − N p−1/2 > 0 . Then, since Ψ¯ is convex, we have ¯ ¯ ¯ 2 Ψ(x) − Ψ(0) > h∇Ψ(0), xi > 0, for all x ∈ R+ \{0} (59) showing the claim. Consider now the modificationϕ ˆ ofϕ ¯ , so thatϕ ˆ(x) =ϕ ¯(x) for x ∈ 2 R+ , andϕ ˆ(x) = +∞ otherwise. The functionϕ ˆ is strictly convex, lower- semicontinuous, and differentiable on the interior of its domain, but not of Legendre type, and the conclusion of Lemma 5 does not apply to it. The geometric intuition leading to this example is described in the figure.

Need for the Legendre condition in Theorem 6. We next construct an example showing that the Legendre condition in the second statement of Theorem 6 cannot be dispensed with. Observe that the inverse of the linear operator L in (58) is given by 1  N − 1 2  L−1 = . N 2 − 2N − 3 2 N − 1 In particular, it is a . 0 1 We set τ = ( 1 0 ) , and consider the “quantum” analogue of L , i.e., T (X) = (N − 1)X − 2τXτ . 23

Then, 1 T −1(X) = (N − 1)X + 2τXτ N 2 − 2N − 3 is a completely positive map leaving P invariant. The analogue of the map g is G(X) = I + T (X) where I denotes the . p p We now consider the map ϕ(X) := kXkp = tr(|X| ) defined on the space of Hermitian matrices. The function ϕ is differentiable and strictly convex, still assuming that p > 1 . We set A¯ := diag(¯a) ∈ P , B¯ := diag(¯b) ∈ P , and now define Φ¯ to be the Bregman divergence associated withϕ ¯ := ϕ ◦ G . Let 1  Ψ(¯ X) := Φ(¯ X, A¯) + Φ(¯ X, B¯) . 2 We then have the following result. Proposition 11. The minimum of the function Ψ¯ on the closure of P is achieved at point 0 . Moreover, the equation 1 ∇ϕ¯(X) = (∇ϕ¯(A¯) + ∇ϕ¯(B¯)) (60) 2 has no solution X in P . Proof. From [3] (Theorem 2.1) or [1] (Theorem 2.3), we have d | tr |X + tY |p = p Re tr |X|p−1U ∗Y dt t=0 where X = U|X| is the of X . In particular, if X is diagonal and positive semidefinite, ∇ϕ(X) = pXp−1 . Then, by a computation similar to the one in the scalar case above, we get p−1 ∇Ψ(0)¯ = (N − 3)p(1 − N /2)I ∈ P . We conclude, as in (59), that ¯ ¯ ¯ Ψ(X) − Ψ(0) > h∇Ψ(0),Xi > 0, for all X ∈ clo P \{0} , where now h·, ·i is the Frobenius scalar product on the space of Hermitian matrices. It follows that 0 is the unique point of minimum of Ψ¯ on clo P . Moreover, if the equation (60) had a solution X ∈ P , the first order optimality condition for the minimisation of the function Ψ¯ over P would ¯ ¯ be satisfied, showing that Ψ(Y ) > Ψ(X) for all X ∈ P , and by density, ¯ ¯ Ψ(0) > Ψ(X) , contradicting the fact that 0 is the unique point of minimum of Ψ¯ over clo P .

Note added to the second version: In the earlier version of this paper posted on January 5, 2019 that appeared in Letters in Mathematical Physics, 109, (2019) 1777-1804, , we made an unfortunate error. Theorem 9 in that version wrongly claimed that for the case d = d3 the solution of the minimisa- tion problem (13) is also the solution of the matrix equation (18). The mistake 24 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN in the statement and in the proof has been pointed in J. Pitrik and D. Vi- rosztek, Quantum Hellinger distances revisited, arXiv: 1903.10455v3. In this paper some more general divergence functions are considered, the barycentre equations are derived, and an example is given to show that the solution to the matrix equations (17) and (18) need not be the same.

References [1] T.J. Abatzoglou, Norm derivatives on spaces of operators, Math. Ann., 239 (1979), 129-135. [2] M. Agueh and G. Carlier, Barycenters in the Wasserstein space, SIAM J. Math. Anal. Appl. 43 (2011), 904-924. [3] J.G. Aiken, J.A. Erdos, J.A. Goldstein Unitary approximation of positive operators, Illinois J. Math., 24 (1980), 61-72. [4] S. Amari, Information Geometry and its Applications, Springer (Tokyo), 2016. [5] T. Ando, Concavity of certain maps on positive definite matrices and applications to Hadamard products, Appl. 26 (1979), 203-241. [6] T. Ando, C.-K. Li and R. Mathias, Geometric means, Linear Algebra Appl. 385 (2004), 305-334. [7] V. Arsigny, P. Fillard, X. Pennec and N. Ayache, Geometric means in a novel structure on symmetric positive-definite matrices, SIAM J. Math. Anal. Appl. 29 (2007), 328-347. [8] A. Banerjee, S. Merugu, I. S. Dhillon and J. Ghosh, Clustering with Bregman diver- gences, J. Mach. Learn. Res. 6 (2005), 1705-1749. [9] F. Barbaresco, Innovative tools for radar signal processing based on Cartan’s geometry of SPD matrices and information geometry, IEEE Radar Conference, Rome, May 2008. [10] H. H. Bauschke and J. M. Borwein, Legendre functions and the method of random Bregman projections, J. of Convex Anal. 4(1997), 27-67. [11] H. H. Bauschke and J. M. Borwein, Joint and separate convexity of the Bregman dis- tance, Stud. Comput. Math. 8 (2001), 23-36. [12] I. Bengtsson and K. Zyczkowski, Geometry of Quantum States: An Introduction to Quantum Entanglement, Cambridge University Press, 2006. [13] K. V. Bhagwat and R. Subramanian, Inequalities between means of positive operators, Math. Proc. Camb. Phil. Soc. 83 (1978), 393-401. [14] R. Bhatia, Matrix Analysis, Springer, 1997. [15] R. Bhatia, Positive Definite Matrices, Princeton University Press, 2007. [16] R. Bhatia, The Riemannian mean of positive matrices, in Matrix Information Geometry, eds. F. Nielsen and R. Bhatia, Springer, (2013), 35-51. [17] R. Bhatia and P. Grover, Norm inequalities related to the matrix geometric mean, Linear Algebra Appl. 437 (2012), 726-733. [18] R. Bhatia, T. Jain and Y. Lim , On the Bures-Wasserstein distance between positive definite matrices, Expos. Math., to appear. [19] R. Bhatia, T. Jain and Y. Lim, Strong convexity of sandwiched entropies and related optimization problems, Rev. Math. Phys. 30 (2018), 1850014. [20] E. A. Carlen and E. H. Lieb, A Minkowski type and strong subadditivity of quantum entropy, Advances in the Mathematical Sciences, AMS Transl. 180 (1999), 59-68. [21] E. A. Carlen and E. H. Lieb, A Minkowski type trace inequality and strong subadditivity of quantum entropy. II. Convexity and concavity, Lett. Math. Phys. 83 (2008), 107-126. [22] Z. Chebbi and M. Moakher, Means of Hermitian positive-definite matrices based on the log- α -divergence function, Linear Algebra Appl. 436 (2012), 18721889. [23] I. S. Dhillon and J. A. Tropp, Matrix nearness problems with Bregman divergences, SIAM J. Matrix Anal. Appl. 29 (2004), 1120-1146. 25

[24] P. Fletcher and S. Joshi, Riemannian geometry for the statistical analysis of diffusion tensor data, Signal Processing 87 (2007), 250-262. [25] F. Hiai, M. Mosonyi, D. Petz and C. Beny, Quantum f-divergences and error correction, Rev. Math. Phys. 23 (2011), 691-747. [26] A. Jencov´a, Geodesic distances on density matrices, J. Math. Phys. 45 (2004), 1787- 1794. [27] A. Jencova and M. B. Ruskai, A unified treatment of convexity of relative entropy and related trace functions with conditions for equality, Rev. Math. Phys. 22 (2010), 1099- 1121. [28] K. Modin, Geometry of matrix decompositions seen through optimal transport and in- formation geometry, J. Geom. Mech. 9 (2017), 335-390. [29] F. Nielsen and R. Bhatia, eds., Matrix Information Geometry, Springer, 2013. [30] F. Nielsen and S. Boltz, The Burbea-Rao and Bhattacharyya centroids, IEEE Transac- tions on Information Theory 57 (2011), 5455-5466. [31] J. Pitrik and D. Virosztek, On the joint convexity of the Bregman divergence of matrices, Lett. Math. Phys. 105 (2015), 675-692. [32] W. Pusz and S. L. Woronowicz, Functional calculus for sesquilinear forms and the purification map, Rep. Math. Phys. 8 (1975), 159-170. [33] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. [34] R. T. Rockafellar and R. J-B. Wets. Variational Analysis. Springer, 1998. [35] S. Sra, Positive definite matrices and the S -divergence, Proc. Amer. Math. Soc. 144 (2016), 2787-2797. [36] A. Takatsu, Wasserstein geometry of Gaussian measures, Osaka J. Math. 48 (2011), 1005-1026. January 4, 2019

Ashoka University, Sonepat, Haryana, 131029, India E-mail address: [email protected]

INRIA and CMAP, Ecole Polytechnique, CNRS, 91128, Palaiseau, France E-mail address: [email protected]

Indian Statistical Institute, New Delhi 110016, India E-mail address: [email protected]