Arxiv:1901.01378V2 [Math-Ph] 8 Apr 2020 Where A(P, Q) Is the Arithmetic Mean of the Vectors P and Q, G(P, Q) Is Their P Geometric Mean, and Tr X Stands for Xi
Total Page:16
File Type:pdf, Size:1020Kb
MATRIX VERSIONS OF THE HELLINGER DISTANCE RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN Abstract. On the space of positive definite matrices we consider dis- tance functions of the form d(A; B) = [trA(A; B) − trG(A; B)]1=2 ; where A(A; B) is the arithmetic mean and G(A; B) is one of the different versions of the geometric mean. When G(A; B) = A1=2B1=2 this distance is kA1=2− 1=2 1=2 1=2 1=2 B k2; and when G(A; B) = (A BA ) it is the Bures-Wasserstein metric. We study two other cases: G(A; B) = A1=2(A−1=2BA−1=2)1=2A1=2; log A+log B the Pusz-Woronowicz geometric mean, and G(A; B) = exp 2 ; the log Euclidean mean. With these choices d(A; B) is no longer a metric, but it turns out that d2(A; B) is a divergence. We establish some (strict) convexity properties of these divergences. We obtain characterisations of barycentres of m positive definite matrices with respect to these distance measures. 1. Introduction Let p and q be two discrete probability distributions; i.e. p = (p1; : : : ; pn) and q = (q1; : : : ; qn) are n -vectors with nonnegative coordinates such that P P pi = qi = 1: The Hellinger distance between p and q is the Euclidean norm of the difference between the square roots of p and q ; i.e. 1=2 1=2 p p hX p p 2i hX X p i d(p; q) = k p− qk2 = ( pi − qi) = (pi + qi) − 2 piqi : (1) This distance and its continuous version, are much used in statistics, where it is customary to take d (p; q) = p1 d(p; q) as the definition of the Hellinger H 2 distance. We have then p dH (p; q) = trA(p; q) − trG(p; q); (2) arXiv:1901.01378v2 [math-ph] 8 Apr 2020 where A(p; q) is the arithmetic mean of the vectors p and q; G(p; q) is their P geometric mean, and tr x stands for xi: A matrix/noncommutative/quantum version would seek to replace the probability vectors p and q by density matrices A and B ; i.e., positive semidefinite matrices A; B with tr A = tr B = 1: In the discussion that fol- lows, the restriction on trace is not needed, and so we let A and B be any two positive semidefinite matrices. On the other hand, a part of our analysis requires A and B to be positive definite. This will be clear from the context. 2010 Mathematics Subject Classification. 15B48, 49K35, 94A17, 81P45. Key words and phrases. Geometric mean, matrix divergence, Bregman divergence, relative entropy, strict convexity, barycentre. 1 2 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN We let P be the set of n×n complex positive definite matrices. The notation A > 0 means that A is positive (semi) definite. Here we run into the essential difference between the matrix and the scalar case. For positive definite matrices A and B; there is only one possible arith- metic mean, A(A; B) = (A + B)=2: However, the geometric mean G(A; B) could have different meanings. Each of these leads to a different version of the Hellinger distance on matrices. In this paper we study some of these distances and their properties. The Euclidean inner product on n × n matrices is defined as hA; Bi = tr A∗B: The associated Euclidean norm is ∗ 1=2 X 2 1=2 kAk2 = (tr A A) = ( jaijj ) : Recall that the matrices AB and BA have the same eigenvalues. Thus if A and B are positive definite, then AB is not positive definite unless A and B commute. However, the eigenvalues of AB are all positive as they are the same as the eigenvalues of A1=2BA1=2: Also every matrix with positive eigenvalues has a unique square root with positive eigenvalues. If A; B are positive definite, then we denote by (AB)1=2 the square root that has positive eigenvalues. Since (AB)1=2 = A1=2(A1=2BA1=2)1=2A−1=2; the matrices (AB)1=2 and (A1=2BA1=2)1=2 are similar, and hence have the same eigenvalues. The straightforward generalisation of (1) for positive definite matrices A; B is evidently 1=2 1=2 1=2 1=21=2 d1(A; B) = kA − B k2 = tr(A + B) − 2trA B : (3) Another version could be 1=2 1=2 1=21=2 1=21=2 d2(A; B) = tr(A + B) − 2tr(A BA ) = tr(A + B) − 2tr(AB) : (4) While it is clear from (3) that d1 is a metric on P; it is not obvious that d2 is a metric. It turns out that 1=2 1=2 d2(A; B) = min kA − B Uk2; (5) where the minimum is taken over all unitary matrices U: It follows from this that d2 is a metric. This is called the Bures distance in the quantum information literature and the Wasserstein metric in the literature on optimal transport. It plays an important role in both these subjects. We refer the reader to [18] for a recent exposition, and to [12, 26, 28, 36] for earlier work. The quantity F (A; B) = tr(A1=2BA1=2)1=2 is called the fidelity between the ∗ ∗ states A and B: In the special case when A =puu ;B = vv are pure ∗ ∗ 1=2 states, we have F (A; B) = ju vj and d2(A; B) = 2(1 − ju vj) : For qubit states this is the distance on the Bloch sphere. 3 For various reasons, theoretical and practical, the most accepted definition of geometric mean of A; B is the entity A#B = A1=2(A−1=2BA−1=2)1=2A1=2: (6) This formula was introduced by Pusz and Woronowicz [32]. When A and B commute A#B reduces to A1=2B1=2: The mean A#B has been studied extensively for several years and has remarkable properties that make it useful in diverse areas. One of them is its connection with operator inequalities related to monotonicity and convexity theorems for the quantum entropy. See Chapter 4 of [15] for a detailed exposition. Another object of interest has been the log Euclidean mean L(A; B) defined as log A + log B L(A; B) = exp : (7) 2 This mean too reduces to A1=2B1=2 when A and B commute, and has been used in various contexts [7], though it lacks some pleasing properties that A#B has. Thus it is natural to consider two more matrix versions of the Hellinger distance, viz, 1=2 d3(A; B) = [tr(A + B) − 2tr(A#B)] ; (8) and 1=2 d4(A; B) = [tr(A + B) − 2trL(A; B)] : (9) In view of what has been discussed, we may expect that d3 and d4 are metrics on P: However, it turns out that neither of them obeys the triangle inequality. Examples are given in Section 2. Nevertheless, this is compensated by the fact that the squares of d3 and d4 both are divergences, and hence they can serve as good distance measures. A smooth function Φ from P × P to the set of nonnegative real numbers, R+ , is called a divergence if (i)Φ( A; B) = 0 if and only if A = B: (ii) The first derivative DΦ with respect to the second variable vanishes on the diagonal; i.e., DΦ(A; X)jX=A = 0: (10) (iii) The second derivative D2Φ is positive on the diagonal; i.e., 2 D Φ(A; X)jX=A(Y; Y ) > 0 for all Hermitian Y: (11) See [4], Sections 1.2 and 1.3. The prototypical example is the Euclidean divergence Φ(A; B) = kA − 2 2 2 Bk2: The functions d1(A; B) and d2(A; B) are also divergences. Another well-known example is the Kullback-Leibler divergence [4]. A special kind 4 RAJENDRA BHATIA, STEPHANE GAUBERT, AND TANVI JAIN of divergence is the Bregman divergence corresponding to a strictly convex differentiable function ' : P ! R: If ' is such a function, then Φ(A; B) = '(A) − '(B) − D'(B)(A − B); (12) is called the Bregman divergence corresponding to ': Not every divergence 2 arises in this way. In particular, dH (p; q); the square of the Hellinger distance, on probability vectors is not a Bregman divergence. Now we describe our main results. We will show that both the functions 2 2 Φ3(A; B) = d3(A; B) and Φ4(A; B) = d4(A; B) are divergences. We will show that Φ3 and Φ4 are jointly convex in the variables A and B; and strictly convex in each of the variables separately. One consequence of this is that for every m -tuple A1;:::;Am in P and positive weights w1; : : : ; wm the minimisation problem m X 2 min wjd (X; Aj) (13) X>0 j=1 has a unique solution when d = d3 or d4: When d = d1 the minimum in (13) is attained at the 1=2 -power mean m !2 X 1=2 Q1=2 = wjAj : (14) j=1 This is one of the much studied family of classical power means. When d = d2; the minimiser in (13) is the Wasserstein mean [2, 18]. This is the unique solution of the matrix equation m X 1=2 1=2 1=2 X = wj(X AjX ) : (15) j=1 This mean has major applications in optimal transport, statistics, quantum information and other areas. Means with respect to various divergences have also been of interest in information theory.