<<

BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS

YUHANG CAI AND LEK-HENG LIM

Abstract. Comparing probability distributions is an indispensable and ubiquitous task in machine learning and . The most common way to compare a pair of Borel probability measures is to compute a between them, and by far the most widely used notions of metric are the Wasserstein metric and the total variation metric. The next most common way is to compute a between them, and in this case almost every known such as those of Kullback–Leibler, Jensen–Shannon, R´enyi, and many more, are special cases of the f-divergence. Nevertheless these metrics and divergences may only be computed, in fact, are only defined, when the pair of probability measures are on spaces of the same dimension. How would one quantify, say, a KL-divergence between the uniform distribution on the [−1, 1] and a Gaussian distribution 3 on R ? We will show that, in a completely natural manner, various common notions of metrics and divergences give rise to a between Borel probability measures defined on spaces of different m n dimensions, e.g., one on R and another on R where m, n are distinct, so as to give a meaningful answer to the previous question.

1. Introduction Measuring a distance, whether in the sense of a metric or a divergence, between two probability distributions is a fundamental endeavor in machine learning and statistics. We encounter it in clustering [27], [61], generative adversarial networks [2], image recognition [3], minimax lower bounds [22], and just about any field that undertakes a statistical approach towards n . It is well-known that the of Borel probability measures on a measurable space Ω ⊆ R may be equipped with many different metrics and divergences, each good for its own purpose, but two of the most common families are:  Z 1/p p p-Wasserstein: Wp(µ, ν) := inf kx − yk2 dγ(x, y) , γ∈Γ(µ,ν) Ω×Ω Z dµ -divergence: Df (µkν) := f dν.f Ω dν For p = 1 and 2, the p-Wasserstein metric gives the Kantorovich metric (also called earth mover’s metric) and L´evy-Fr´echet metric respectively. Likewise, for various choices of f, we obtain as special cases the Kullback–Liebler, Jensen–Shannon, R´enyi, Jeffreys, Chernoff, Pearson chi-squared, Hellinger squared, exponential, and alpha–beta divergences, as well as the total variation metric (see Table1). However, we stress that the p-Wasserstein metric cannot be expressed as an f-divergence. All these distances are only defined when µ and ν are probability measures on a common mea- n surable space Ω ⊆ R . This article provides an answer to the question: m How can one define a distance between µ, a probability on Ω1 ⊆ R , and n ν, a on Ω2 ⊆ R , where m 6= n? We will show that this problem has a natural solution that works for any of the aforementioned metrics and divergences in a way that is consistent with recent works that extended common notions of distances to matrices [37] and subspaces [69] of distinct dimensions. Although we will

2020 Mathematics Subject Classification. 28A33, 28A50, 46E27, 49Q22, 60E05, 94A17. Key words and phrases. Probability densities, probability measures, Wasserstein distance, total variation distance, KL-divergence, R´enyi divergence. 1 2 YUHANG CAI AND LEK-HENG LIM draw from the same high-level ideas in [37, 69], we require substantially different techniques in order to work with probability measures. Given a p-Wasserstein metric or an f-divergence, which is defined between two probability mea- sures of the same dimension, we show that it naturally defines two different distances for probability measures µ and ν on spaces of different dimensions — we call these the embedding distance and projection distance respectively. Both these distances are completely natural and are each befitting candidates for the distance we seek; the trouble is that there is not one but two of them, both equally reasonable. The punchline, as we shall prove, is that the two distances are always equal, giving us a unique distance defined on inequidimensional probability measures. We will state this result more precisely after introducing a few notations. To the best of our knowledge — and we have one of our referees to thank for filling us in on this — the only alternative for defining a distance between probability measures of different dimensions is the Gromov–Wasserstein distance proposed in [43]. As will be evident from our description below, we adopt a ‘bottom-up’ approach that begins from first principles and requires nothing aside from the most basic definitions. On the other hand, the approach in [43] is a ‘top-down’ one by adapting the vastly more general and powerful Gromov–Hausdorff distance to a special case. Our construction works with a wide variety of common metrics and divergences mentioned in first paragraph. Although the work in [43] is restricted to the 2-Wasserstein metric, it is conceivable that the framework therein would apply more generally to other metrics as well; however, it is not obvious how the framework might apply to divergences given that the Gromov–Hausdorff approach requires a metric. In the one case that allows a comparison, namely, applying the two different constructions to the 2-Wasserstein metric to obtain distances on probability measures of different dimensions, they lead to different results. We are of the opinion that both approaches are useful although our simplistic approach is more likely to yield distances that have closed-form expressions or are readily computable, as we will see in Section6; the Gromov–Wasserstein distance tends to be NP-hard [39] and closed-form expression are rare and not easy to obtain [58].

n 1.1. Main result. Let M(Ω) denote the of all Borel probability measures on Ω ⊆ R and let p M (Ω) ⊆ M(Ω) denote those with finite pth moments, p ∈ N. For any m, n ∈ N, m ≤ n, we write m×n T O(m, n) := {V ∈ R : VV = Im}, i.e., the Stiefel manifold of m × n matrices with orthonormal rows. We write O(n) := O(n, n) for m the orthogonal group. For any V ∈ O(m, n) and b ∈ R , let n m ϕV,b : R → R , ϕV,b(x) = V x + b; n −1 and for any µ ∈ M(R ), let ϕV,b(µ) := µ ◦ ϕV,b be the . For simplicity, we write ϕV := ϕV,0 when b = 0. For any m, n ∈ N, there is no loss of generality in assuming that m ≤ n for the remainder of our article. Our goal is to define a distance d(µ, ν) for measures µ ∈ M(Ω1) and ν ∈ M(Ω2) where m n Ω1 ⊆ R and Ω2 ⊆ R , and where by ‘distance’ we include both metrics and divergences. Again, there is no loss of generality in assuming that m n (1) Ω1 = R , Ω2 = R since we may simply restrict our attention to measures supported on smaller subsets. Henceforth, we will assume (1). We will call µ and ν an m- and n-dimensional measure respectively. We begin by defining the projection and embedding of measures. These are measure theoretic analogues of the Schubert varieties in [69] and we choose our notations to be consistent with [69]. m n Definition 1.1. Let m, n ∈ N, m ≤ n. For any µ ∈ M(R ) and ν ∈ M(R ), the embeddings of µ n into R are the set of n-dimensional measures + n m Φ (µ, n) := {α ∈ M(R ): ϕV,b(α) = µ for some V ∈ O(m, n), b ∈ R }; DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 3

m the projections of ν onto R are the set of m-dimensional measures

− m m Φ (ν, m) := {β ∈ M(R ): ϕV,b(ν) = β for some V ∈ O(m, n), b ∈ R }. n Let d be any notion of distance on M(R ) for any n ∈ N. Define the projection distance d−(µ, ν) := inf d(µ, β) β∈Φ−(ν,m) and the embedding distance d+(µ, ν) := inf d(α, ν). α∈Φ+(µ,n) Both d−(µ, ν) and d+(µ, ν) are natural ways of defining d on probability measures µ and ν of different dimensions. The trouble is that they are just as natural and there is no reason to favor one or the other. Our main result, which resolves this dilemma, may be stated as follows.

Theorem 1.2. Let m, n ∈ N, m ≤ n. Let d be a p-Wasserstein metric or an f-divergence. Then (2) d−(µ, ν) = d+(µ, ν).

The common value in (2), denoted db(µ, ν), defines a distance between µ and ν and serves as our answer to the question on page1. We will prove Theorem 1.2 for p-Wasserstein metric (Theo- rem 2.2) and for f-divergence (Theorem 3.4). Jensen–Shannon divergence (Theorem 4.2) and total variation metric (Theorem 5.2) require separate treatments since the definition of p-Wasserstein metric requires that µ and ν have finite pth moments and the definition of f-divergence requires that µ and ν have densities, assumptions that we do not need for Jensen–Shannon divergence and total variation metric. While the proofs of Theorems 2.2, 3.4, 4.2, and 5.2 follow a similar broad outline, the subtle details are different and depend on the specific distance involved. An important departure from the results in [37, 69] is that in general db(µ, ν) 6= d(µ, ν) when m = n. Incidentally the Gromov–Wasserstein distance [43] mentioned earlier also lacks this property. To see this, we first state a more general corollary.

Corollary 1.3. Let m, n ∈ N, m ≤ n. Let d be a p-Wasserstein metric, a Jensen–Shannon divergence, a total variation metric or an f-divergence. Then db(µ, ν) = d−(µ, ν) = d+(µ, ν) = 0 if m and only if there exists V ∈ O(m, n) and b ∈ R with ϕV,b(ν) = µ.

Corollary 1.3 gives a necessary and sufficient condition for db(µ, ν) to be zero, saying that this happens if only if the two measures µ and ν are rotated copies of each other, modulo embedding in a higher-dimensional ambient space when m 6= n. Hence for any d that is not rotationally invariant, we will generally have db(µ, ν) 6= d(µ, ν) when m = n. The discussion in the previous paragraph notwithstanding, the distance db has several nice fea- tures. Firstly, it preserves certain well-known relations satisfied by the original distance d. For example, we know that for p ≤ q, the p- and q-Wasserstein metrics satisfy Wp(µ, ν) ≤ Wq(µ, ν) for measures µ, ν of the same dimension; we will see in Corollary 2.3 that

Wcp(µ, ν) ≤ Wcq(µ, ν) for measures µ, ν of different dimensions. Secondly, as our construction applies consistently across a wide variety of distances, both metrics and divergences, relations between different types of distances can also be preserved. For example, it is well-known that the total variation metric and 2 KL-divergence satisfy Pinker’s inequality dTV(µ, ν) ≤ 1/2 DKL(µkν) for measures µ, ν of the same dimension; we will see in Corollary 3.5 that

2 1 dbTV(µ, ν) ≤ Db KL(µkν) 2 4 YUHANG CAI AND LEK-HENG LIM for measures µ, ν of different dimensions. As another example,√ the Hellinger squared divergence 2 and total variation metric satisfy DH(µ, ν) ≤ 2dTV(µ, ν) ≤ 2 DH(µ, ν) for measures µ, ν of the same dimension; we will see in Corollary 3.6 that √ 2 Db H(µ, ν) ≤ 2dbTV(µ, ν) ≤ 2Db H(µ, ν) for measures µ, ν of different dimensions. Another advantage of our construction is that for some common distributions, the distance db obtained often has closed-form expression or is readily computable,1 as we will see in Section6. In particular, we will have an explicit answer for the rhetorical question in the abstract: What is the 3 KL-divergence between the uniform distribution on [−1, 1] and a Gaussian distribution on R ? 1.2. Background. For easy reference, we remind the reader of two standard results. Theorem 1.4 (Hahn Decomposition). Let Ω be a measurable space and µ be a on the σ-algebra Σ(Ω). Then there exist P and N ∈ Σ(Ω) such that (i) P ∪ N = Ω, P ∩ N = ∅; (ii) any E ∈ Σ(Ω) with E ⊆ P has µ(E) ≥ 0; (iii) any E ∈ Σ(Ω) with E ⊆ N has µ(E) ≤ 0. The Disintegration Theorem [51] rigorously defines the notion of a nontrivial “restriction” of a measure to a measure-zero subset of a measure space. It is famously used to establish the existence of conditional probability measures.

Theorem 1.5 (Disintegration Theorem). Let Ω1 and Ω2 be two Radon spaces. Let µ ∈ M(Ω1) and ϕ :Ω1 → Ω2 be a Borel measurable . Set ν ∈ M(Ω2) to be the pushforward measure ν = µ ◦ ϕ−1. Then there exists a ν- uniquely determined family of probability measures {µy ∈ M(Ω1): y ∈ Ω2} such that (i) the function Ω2 → M(Ω1), y 7→ µy is Borel measurable, i.e., for any measurable B ⊆ Ω1, y → µy(B) is a of y; −1  (ii) µy Ω1 \ ϕ (y) = 0; (iii) for every Borel-measurable function f :Ω1 → [0, ∞], Z Z Z f(x) dµ(x) = f(x) dµy(x) dν(y). −1 Ω1 Ω2 ϕ (y) In this article, we use the terms ‘probability measure’ and ‘’ interchange- R ably since given a cumulative distribution function F and A ∈ Σ(Ω), µ(A) := x∈A dF (x) defines a probability measure.

2. Wasserstein metric We begin by properly defining the p-Wasserstein metric, filling in some details left out in Sec- p n tion1. Given two measures µ, ν ∈ M (R ) and any p ∈ [1, ∞], the p-Wasserstein metric, also called the Lp-Wasserstein metric, between them is  Z 1/p p (3) Wp(µ, ν) := inf kx − yk2 dγ(x, y) γ∈Γ(µ,ν) R2n where p = ∞ is interpreted in the sense of essential supremum, and  Z Z  n×n Γ(µ, ν) := γ ∈ M(R ): dγ(x, y) = dν(y), dγ(x, y) = dµ(x) x∈Rn y∈Rn

1To the extent afforded by the original distance d — if d has no closed-form expression or is NP-hard, we would not expect db to be any different. DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 5 is the set of couplings between µ and ν. The measure π ∈ Γ(µ, ν) that attains the minimum in (3) is called the optimal transport coupling. For the purpose of this article, we use the standard n Euclidean metric dE(x, y) = kx − yk2 but this may be replaced by other metrics and R by other metric spaces; in which case (3) is just called the Wasserstein metric or transportation distance. The general definition is due to Villani [65] but the notion has a long history involving the works Fr´echet [18], Kantorovich [30], L´evy[36], Wasserstein [64], and many others. As we mentioned earlier, the 1-Wasserstein metric is often called the earth mover’s metric or Kantorovich metric whereas the the 2-Wasserstein metric is sometimes called the L´evy–Fr´echet metric [18]. The Wasserstein metric is widely used in the imaging sciences for capturing geometric features [57, 63, 59], with a variety of applications including contrast equalization [14], texture synthesis [23], image [68, 71], image fusion [10], medical imaging [67], shape registration [41], image watermarking [42]. In economics, it is used to match job seekers with jobs, determine real estate prices, form matrimonial unions, among many other things [20]. Wasserstein metric and optimal transport coupling also show up unexpectedly in areas from astrophysics, where it is used to reconstruct initial conditions of the early universe [19]; to computer music, where it is used to automate music transcriptions [17]; to machine learning, where it is used for machine translation [70] and word embedding [34]. Unlike the f-divergence that we will discussion in Section3, a significant advantage afforded by the Wasserstein distance is that it is finite even when neither measure is absolutely continuous with respect to the other. Our goal is to use Wp to construct a new distance Wcp so that Wcp(µ, ν) would m n be well-defined for µ ∈ M(R ) and ν ∈ M(R ) where m 6= n. Note any attempt to directly extend m the definition in (3) to such a scenario would require that we make sense of kx−yk2 for x ∈ R and n y ∈ R — our approach would avoid this conundrum entirely. We begin by establishing a simple but crucial lemma.

p n Lemma 2.1. Let m, n ∈ N, m ≤ n, and p ∈ [1, ∞]. For any α, ν ∈ M (R ), any V ∈ O(m, n), m and any b ∈ R , we have  Wp ϕV,b(α), ϕV,b(ν) ≤ Wp(α, ν).

n×n Proof. Let π ∈ M(R ) be the optimal transport coupling for Wp(α, v). We define Z dπ+(x, y) := dπ(z, w). ϕV,b(z)=x, ϕV,b(w)=y Then Z Z Z dπ+(x, y) = dπ(z, w) = dα(z) = dϕV,b(α)(x), n y ϕV,b(z)=x, w∈R ϕV,b(z)=x Z Z Z dπ+(x, y) = dπ(z, w) = dν(w) = dϕV,b(ν)(y), n x z∈R , ϕV,b(w)=y ϕV,b(w)=y  and thus π+(x, y) ∈ Γ ϕV,b(α), ϕV,b(ν) . Now Z p Wp(ϕV,b(α), ϕV,b(ν)) ≤ kx − yk2 dπ+(x, y) x∈Rm, y∈Rm Z p = kϕV,b(z) − ϕV,b(w)k2 dπ(z, w) z∈Rn, w∈Rn Z p ≤ kz − wk2 dπ(z, w), z∈Rn, w∈Rn and taking pth root gives the result. The last inequality follows from kϕV,b(z)−ϕV,b(w)k2 ≤ kz−wk2 as ϕV,b is an orthogonal projection plus a translation.  6 YUHANG CAI AND LEK-HENG LIM

− p m + p n Lemma 2.1 assures that Φ (ν, m) ⊆ M (R ) but in general we may not have Φ (ν, n) ⊆ M (R ). With this in mind, we introduce the set

+ p n m Φp (µ, n) := {α ∈ M (R ): ϕV,b(α) = µ for some V ∈ O(m, n), b ∈ R } for the purpose of our next result, which shows that the projection and embedding Wasserstein distances are always equal.

p m p n Theorem 2.2. Let m, n ∈ N, m ≤ n, and p ∈ [1, ∞]. For µ ∈ M (R ) and ν ∈ M (R ), let

− + W (µ, ν) := inf Wp(µ, β), W (µ, ν) := inf Wp(α, ν). p − p + β∈Φ (ν,m) α∈Φp (µ,n) Then

− + (4) Wp (µ, ν) = Wp (µ, ν).

− + + Proof. It is easy to deduce that Wp (µ, ν) ≤ Wp (µ, ν): For any α ∈ Φ (µ, n), there exists m Vα ∈ O(m, n) and bα ∈ R with ϕVα,bα (α) = µ. It follows from Lemma 2.1 that Wp(α, ν) ≥  Wp µ, ϕVα,bα (ν) and thus   inf Wp(α, ν) ≥ inf Wp µ, ϕVα,bα (ν) ≥ inf Wp µ, ϕV,b(ν) . α∈Φ+(µ,n) α∈Φ+(µ,n) V ∈O(m,n), b∈Rm

− + The bulk of the work is to show that Wp (µ, ν) ≥ Wp (µ, ν). Let ε > 0 be arbitrary. Then there − exists β∗ ∈ Φ (ν, m) with − Wp(µ, β∗) ≤ Wp (µ, ν) + ε. m Let V∗ ∈ O(m, n) and b∗ ∈ R be such that ϕV∗,b∗ (ν) = β∗ and W∗ ∈ O(n − m, n) be such that  V∗  ∈ O(n). Then ϕ is the complementary projection of ϕ . Applying Theorem 1.5 to W∗ W∗ V∗,b∗ n m ϕV∗,b∗ , we obtain a family of measures {νy ∈ M(R ): y ∈ R } satisfying Z Z Z dν(x) = dνy(x) dβ∗(y). n m ϕ−1 (y) R R V∗,b∗

Let γ ∈ Γ(β∗, µ) be the optimal transport coupling attaining Wp(β∗, µ). Then Z Z dγ(x, y) = dµ(y), dγ(x, y) = dβ∗(x). x∈Rm y∈Rm

n×n + We define a new measure γ+ ∈ M(R ) that will in turn give us a measure α∗ ∈ Φp (µ, n) with Wp(α∗, ν) ≤ Wp(µ, β∗). Let ( δϕ (x) = ϕ (y) dν (x) · dγϕ (x), ϕ (y) if ϕ (x) = ϕ (y), W∗ W∗ ϕV∗,b∗ (x) V∗,b∗ V∗,b∗ W∗ W∗ dγ+(x, y) := 0 if ϕW∗ (x) 6= ϕW∗ (y),  where δ ϕW∗ (x) = ϕW∗ (y) is the for the set {(x, y): ϕW∗ (x) = ϕW∗ (y)}. For n any fixed x ∈ R , Z Z dγ (x, y) = δϕ (x) = ϕ (y) dν (x) · dγϕ (x), ϕ (y) + W∗ W∗ ϕV∗,b∗ (x) V∗,b∗ V∗,b∗ n n y∈R y∈R , ϕW∗ (x)=ϕW∗ (y) Z = dν (x) · dγϕ (x), z ϕV∗,b∗ (x) V∗,b∗ z∈Rm = dν (x) · dβ ϕ (x) = dν(x). ϕV∗,b∗ (x) ∗ V∗,b∗ n ∼ m Note that for the second equality {ϕV∗,b∗ (y): y ∈ R , ϕW∗ (x) = ϕW∗ (y)} = R measure theoreti- cally; for the third equality, one marginal of γ is β∗. Hence ν(x) is a marginal measure of γ+. Let DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 7

n R α∗ ∈ M( ) be defined by dα∗(y) = n dγ+(x, y). Since R x∈R Z Z Z p 2p/2 2 2p/2 kyk2 dα∗(y) = kyk2 dα∗(y) = kϕV∗,b∗ (y)k2 + kϕW∗ (y)k2 dα∗(y) y∈Rn y∈Rn y∈Rn Z p−2 p p 2 ≤ max{2 , 1}· kϕV∗,b∗ (y)k2 + kϕW∗ (y)k2 dα∗(y) y∈Rn Z p−2 p = max{2 2 , 1}· kyk2 dµ(y) y∈Rm Z 0 p 0  + kϕW∗ (x )k2 δ ϕW∗ (x) = ϕW∗ (x ) n 0 n 0 x∈R ,x ∈R ,ϕW∗ (x)=ϕW∗ (x )  dν (x) · dγϕ (x), ϕ (x0) ϕV∗,b∗ (x) V∗,b∗ V∗,b∗ Z Z  p−2 p p 2 = max{2 , 1}· kyk2 dµ(y) + kϕW∗ (x)k2 dν(x) y∈Rm x∈Rn Z Z  p−2 p p ≤ max{2 2 , 1}· kyk2 dµ(y) + kxk2 dν(x) ; y∈Rm x∈Rn p/2 p−2 p/2 p/2 where in the first inequality we have used (σ + τ) ≤ max{2 2 , 1}· (σ + τ ); in the fourth equality, ϕW∗ (x) = ϕW∗ (y); and in the last inequality, kϕW∗ (x)k2 ≤ kxk2. Since the pth central p n of α∗ is bounded by the pth central moments of µ and ν, we have α∗ ∈ M (R ). Finally, Z p p Wp(α∗, ν) ≤ kx − yk2 dγ+(x, y) Rn×Rn Z p  = kϕV∗,b∗ (x) − ϕV∗,b∗ (y)k2 δ ϕW∗ (x) = ϕW∗ (y) ϕW∗ (x)=ϕW∗ (y) dν (x) · dγϕ (x), ϕ (y) ϕV∗,b∗ (x) V∗,b∗ V∗,b∗ Z p p = kx − yk2 dγ(x, y) = Wp(µ, β∗). Rm×Rm

Note that the first relation is an inequality since γ+ may not be an optimal transport coupling between α∗ and ν. + It remains to show that under the projection ϕV∗,b∗ , we have ϕV∗,b∗ (α∗) = µ, i.e., α∗ ∈ Φp (µ, n). This follows from Z Z

dϕV∗,b∗ (α∗)(y) = dγ+(x, z) n x∈R ϕV∗,b∗ (z)=y Z = δϕ (x) = ϕ (y) dν (x) · dγϕ (x), y W∗ W∗ ϕV∗,b∗ (x) V∗,b∗ x∈Rn Z = dγ(x, y) = dµ(y). x∈Rm Therefore, with Lemma 2.1, we have Wp(α∗, ν) = Wp(µ, β∗). Hence + − Wp (µ, ν) = inf Wp(α, ν) ≤ Wp(α∗, ν) = Wp(µ, β∗) ≤ Wp (µ, ν) + ε. α∈Φ+(µ,n)

− + Since ε > 0 is arbitrary, we conclude that Wp (µ, ν) ≥ Wp (µ, ν). 

We denote the common value in (4) by Wcp(µ, ν), and call it the augmented p-Wasserstein distance m n between µ ∈ Mp(R ) and ν ∈ Mp(R ). Note that this is a distance in the sense of a distance from p m a point to a set; it is not a metric since if we take µ, ν ∈ M (R ) with ν a nontrivial rotation of µ, we will have Wcp(µ, ν) = 0 even though µ 6= ν. 8 YUHANG CAI AND LEK-HENG LIM

The augmented p-Wasserstein distance Wcp preserves some well-known properties of the p- Wasserstein metric Wp, an example is the following inequality, which is known to hold for Wp. m Corollary 2.3. Let m, n ∈ N, m ≤ n. Let p, q ∈ [1, ∞], p ≤ q. For any µ ∈ Mq(R ) and n ν ∈ Mq(R ), we have Wcp(µ, ν) ≤ Wcq(µ, ν).

Proof. Wcp(µ, ν) = infβ∈Φ−(ν,m) Wp(µ, β) ≤ infβ∈Φ−(ν,m) Wq(µ, β) = Wcq(µ, ν).  3. f-Divergence The most useful notion of distance on probability densities is sometimes not a metric. Diver- gences are in general asymmetric and do not satisfy the triangle inequality. The Kullback–Leibler divergence [33, 32] is probably the best known example, ubiquitous in , machine learning, and statistics. It has been used to characterize relative entropy in information systems [60], to measure in continuous [66], to quantify information gain in comparison of statistical models of inference [7], among other things. The KL-divergence is a special limiting case of a R´enyi divergence [56], which is in turn a special case of a vast generalization called the f-divergence [11]. Definition 3.1. Let µ, ν ∈ M(Ω) and µ be absolutely continuous with respect to ν. For any convex function f : R → R with f(1) = 0, the f-divergence of µ from ν is Z Z dµ  Df (µkν) = f dν = f g(x) dν(x), Ω dν Ω where g is the Radon–Nikodym derivative with dµ(x) = g(x) dν(x). Aside from the R´enyi divergence, the f-divergence includes just about every known divergences as special cases. These include the Pearson chi-squared [52], Hellinger squared [25], Chernoff [9], Jeffreys [29], alpha–beta [16], Jensen–Shannon [38, 45], and exponential [6] divergences, as well as the total variation metric. For easy reference, we provide a list in Table1. Note that taking limit as θ → 1 in the R´enyi divergence gives us the KL-divergence. These divergences are all useful in their own right. The Pearson chi-squared divergence is used in statistical test of categorical data to quantify the difference between two distributions [52]. The Hellinger squared divergence is used in dimension reduction for multivariate data [55]. The Jeffreys divergence is used in Markov random field for image classification [46]. The Chernoff divergence is used in image feature classification, indexing, and retrieval [26]. The R´enyi divergence is used in quantum information theory as a measure of entanglement [24]. The alpha–beta divergence is used in geomtrical analyses of parametric inference [1]. We will defer discussions of Jensen–Shannon divergence and total variation metric in Sections4 and5 respectively. By definition, an f-divergence Df (µkν) is only defined if µ is absolutely continuous with respect to ν. For convenience, in this section we will restrict our attention to probability measures with densities so that we do not have to keep track of which measure is absolutely continuous to which n n n other measure. Let λ be the restricted to Ω ⊆ R . With respect to λ , we define

Md(Ω) := {µ ∈ M(Ω) : µ has a density function},

Mpd(Ω) := {µ ∈ Md(Ω) : µ has a strictly positive density function}. n Note that µ ∈ Md(Ω) iff it is absolutely continuous with respect to λ . The following lemma guarantees the existence of projection and embedding f-divergences, to be defined later. Lemma 3.2. Let m, n ∈ N, m ≤ n, and Φ−(ν, m) be as in Definition 1.1. n − m (i) If ν ∈ Mpd(R ), then Φ (ν, m) ⊆ Mpd(R ). n n (ii) If α ∈ Md(R ) and ν ∈ Mpd(R ), then α is absolutely continuous with respect to ν. DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 9

Table 1. In the following, ζ = (1 − θ)µ + θν, η = (1 − θ)ν + θµ, and θ, φ ∈ (0, 1).

f(t)Df (µkν) Z  dµ  Kullback–Liebler t log t log dµ Ω dν Z  dµ  Exponential t log2 t log2 dµ Ω dν Z  dµ 2 Pearson (t − 1)2 − 1 dν Ω dν √ Z h dµ 1/2 i2 Hellinger ( t − 1)2 − 1 dν Ω dν Z  dµ   dµ  Jeffreys (t − 1) log t − 1 log dν Ω dν dν tθ − t 1 Z h dµ θ dµ i R´enyi − dν θ(θ − 1) θ(θ − 1) Ω dν dν 4(1 − t(1+θ)/2) 4 Z h  dµ (1+θ)/2i Chernoff 2 2 1 − dν 1 − θ 1 − θ Ω dν 2(1 − t(1−θ)/2)(1 − t(1−φ)/2) 2 Z h  dµ (1−θ)/2ih  dµ (1−φ)/2i alpha-beta 1 − 1 − dν (1 − θ)(1 − φ) (1 − θ)(1 − φ) Ω dν dν t t 1 1 1 Z  dµ  1 Z  dν  Jensen–Shannon log + log log dµ + log dν 2 (1 − θ)t + θ 2 1 − θ + θt 2 Ω dζ 2 Ω dη |t − 1| total variation sup |µ(A) − ν(A)| 2 A∈Σ(Ω)

− m Proof. Let β ∈ Φ (ν, m) and let V ∈ O(m, n), b ∈ R be such that ϕV,b(ν) = β. Let dν(x) = n  V  m t(x) dλ (x) and W ∈ O(n − m, n) be such that W ∈ O(n). For any measurable g : R → R, Z Z Z Z n 0 m g(y) dβ(y) = g(ϕV,b(x)) dν(x) = g(ϕV,b(x))t(x) dλ (x) = g(y)t (y) dλ (y), y∈Rm x∈Rn x∈Rn y∈Rm 0 R n−m  where t (y) = −1 t(x) dλ ϕW (x) . This is because ϕV,b is an orthogonal projection plus a ϕV,b(y) translation and Z Z Z n n−m m dλ (x) = dλ (ϕW (x)) dλ (y), n m −1 x∈R y∈R ϕV,b(y) 0 m where the existence of ϕW follows from Theorem 1.5. Hence dβ(y) = t (y) dλ (y) and we have (i). n n For (ii), suppose dα(x) = tα(x) dλ (x) and dν(x) = tν(x) dλ (x) with tν(x) > 0, then

tα(x) dα(x) = dν(x).  tν(x) We have the following f-divergence analogue of Lemma 2.1.

n Lemma 3.3. Let m, n ∈ N, m ≤ n, and f : R → R be convex with f(1) = 0. Let α ∈ Md(R ), n m ν ∈ Mpd(R ), V ∈ O(m, n), and b ∈ R . Then  Df (αkν) ≥ Df ϕV,b(α)kϕV,b(ν) .

Proof. Let µ = ϕV,b(α) and β = ϕV,b(ν) with dα(x) = t(x) dν(x). Then by Theorem 1.5, we have Z Z Z Z Z Z dα(x) = dαy(x) dµ(y), dν(x) = dνy(x) dβ(y). n m −1 n m −1 R R ϕV,b(y) R R ϕV,b(y) 10 YUHANG CAI AND LEK-HENG LIM

0 0 R By Lemma 3.2, we have dµ(y) = t (y) dβ(y) with t (y) = −1 t(x) dνy(x). Hence, by Defini- ϕV,b(y) tion 3.1 and Jensen inequality, Z Z Z    Df (αkv) = f t(x) dν(x) = f t(x) dνy(x) dβ(y) n m −1 x∈R y∈R ϕV,b(y) Z Z  Z 0  ≥ f t(x) dνy(x) dβ(y) = f t (y) dβ(y) = Df (µkβ).  m −1 m y∈R ϕV,b(y) y∈R

− m + Lemma 3.2 assures that Φ (ν, m) ⊆ Md(R ) but in general, it will not be true that Φ (µ, n) ⊆ n Md(R ). As such we introduce the following subset: + n m Φd (µ, n) := {α ∈ Md(R ): ϕV,b(α) = µ for some V ∈ O(m, n), b ∈ R } and with this, we establish Theorem 1.2 for f-divergence. m n Theorem 3.4. Let m, n ∈ N and m ≤ n. For µ ∈ M(R ) and ν ∈ M(R ), let − + Df (µkν) := inf Df (µkβ), Df (µkν) := inf Df (αkν). β∈Φ−(ν,m) + α∈Φd (µ,n) Then

− + (5) Df (µkν) = Df (µkν).

− + + Proof. Again, Df (µkν) ≤ Df (µkν) is easy: For any α ∈ Φd (µ, n), there exist Vα ∈ O(m, n) and m  bα ∈ R with ϕVα,bα (α) = µ. It follows from Lemma 3.3 that Df (αkν) ≥ Df µkϕVα,bα (ν) and thus   inf Df (αkν) ≥ inf Df µkϕVα,bα (ν) ≥ inf Df µkϕV,b(ν) . + + V ∈O(m,n), b∈ m α∈Φd (µ,n) α∈Φd (µ,n) R − + − It remains to show that Df (µkν) ≥ Df (µkν). By the definition of Df (µkν), for any ε > 0, there − exists β∗ ∈ Φ (ν, m) with − − Df (µkν) ≤ Df (µkβ∗) ≤ Df (µkν) + ε. m Let V∗ ∈ O(m, n) and b∗ ∈ R be such that ϕV∗,b∗ (ν) = β∗ and W∗ ∈ O(n − m, n) be such that  V∗  ∈ O(n). Applying Theorem 1.5 to ϕ , we obtain a family of measures {ν ∈ M( n): y ∈ W∗ V∗,b∗ y R m R } such that Z Z Z dν(x) = dνy(x) dβ∗(y). n m ϕ−1 (y) R R V∗,b∗ n Define α∗ ∈ M(R ) by Z Z Z dα∗(x) = dνy(x) dµ(y). n m ϕ−1 (y) R R V∗,b∗ n n m n−m Since ν ∈ Mpd(R ), we may identify {νy ∈ M(R ): y ∈ R } as a subset of Md(R ). Let n−m m dνy(x) = sy(x) dλ (ϕW∗ (x)) and dµ(y) = g(y) dλ (y). Then dα (x) = gϕ (x)s (x) dλn−m(ϕ (x))dλm(ϕ (x)) = gϕ (x)s (x) dλn(x). ∗ V∗,b∗ ϕV∗,b∗ (x) W∗ V∗,b∗ V∗,b∗ ϕV∗,b∗ (x) n Hence we deduce that α∗ ∈ Md(R ). We may also check that ϕV∗,b∗ (α∗) = µ. Let dµ(y) = t(y) dβ∗(y). Then  dα∗(x) = t ϕV∗,b∗ (x) dν(x). Finally, by Definition 3.1, we have Z Z +   − Df (µkν) ≤ Df (α∗kν) = f t(ϕV∗,b∗ (x)) dν(x) = f t(y) dβ∗(y) = Df (µkβ∗) ≤ Df (µkν)+ε. Rn Rm + − Since ε > 0 is arbitrary, we have Df (µkν) ≤ Df (µkν).  DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 11

As in the case of Wasserstein distance, we denote the common value in (5) by Db f (µkν) and call it the augmented f-divergence. This of course applies to all special cases of f-divergence. What is perhaps surprising is that the relations between these distances remain true even with this extension to probability densities of different dimensions. For example, Pinker’s inequality

[12] between the total variation metric dTV and KL-divergence DKL also holds for the augmented total variation distance dbTV and augmented KL-divergence Db KL; another standard relation between the Hellinger squared divergence DH and total variation metric is preserved for their augmented counterparts too.

Corollary 3.5 (Pinker’s inequality for probability measures of different dimensions). Let m, n ∈ N, m n m ≤ n. For any µ ∈ Md(R ) and ν ∈ Mpd(R ), we have r1 dbTV(µ, ν) ≤ Db KL(µkν). 2 2 2 Proof. Db KL(µkν) = infβ∈Φ−(ν,m) DKL(µkβ) ≥ infβ∈Φ−(ν,m) 2dTV(µkβ) = 2dbTV(αkν) .  m n Corollary 3.6. Let m, n ∈ N, m ≤ n. For any µ ∈ Md(R ) and ν ∈ Mpd(R ), we have √ 2 Db H(µ, ν) ≤ 2dbTV(µ, ν) ≤ 2Db H(µ, ν).

Proof. It is easy to see that the two inequalities hold for DH and dTV when m = n. The inequidi- mensional version then follows from 2 2 Db H(µ, ν) = inf DH(µ, β) ≤ inf 2dTV(µ, β) = 2dbTV(µ, ν) β∈Φ−(ν,m) β∈Φ−(ν,m) √ √ ≤ inf 2 DH(µ, β) = 2Db H(µ, ν).  β∈Φ−(ν,m)

4. Jensen–Shannon divergence n Let µ, ν ∈ M(R ) and θ ∈ (0, 1). The Jensen–Shannon divergence is defined by 1 1 (6) D (µ, ν) := D (µkζ) + D (νkη), JS 2 KL 2 KL where ζ := (1 − θ)µ + θν and η := (1 − θ)ν + θµ. What we call Jensen–Shannon divergence here is slightly more general [45] than the usual definition [38],2 which corresponds to the case when θ = 1/2. When θ = 1, we get the Jeffreys divergence in Table1 as another special case. We have written DJS(µ, ν) instead of the usual DJS f(µkν) for f-divergence to remind the reader that DJS is 1/2 n symmetric in its arguments; in fact, DJS(µ, ν) defines a metric on M(R ). The Jensen–Shannon divergence is often viewed as the symmetrization of the Kullback–Liebler divergence but this perspective hides an important distinction, namely, the JS-divergence may be directly defined on probability measures without the requirement that they have densities: Observe that µ, ν are automatically absolutely continuous with respect to ζ and η. As such the definition n n in (6) is valid for any µ, ν ∈ M(R ) and we do not need to work over Md(R ) like in Section3. The JS-divergence is used in applications to compare genome [28, 62] and protein surfaces [48] in ; to quantify information flow in social and biological systems [13, 31], and to detect anomalies in fire [44]. It was notably used to establish the main theorem in the landmark paper on Generative Adversarial Nets [21]. n Lemma 4.1. Let m, n ∈ N, m ≤ n, and θ ∈ (0, 1). For any α, ν ∈ M(R ), V ∈ O(m, n), and m b ∈ R ,  DJS(α, ν) ≥ DJS ϕV,b(α), ϕV,b(ν) .

2Neither Jensen nor Shannon is a coauthor of [38]. The name comes from an application of Jensen inequality to Shannon entropy as a convex function to establish nonnegativity of the divergence. 12 YUHANG CAI AND LEK-HENG LIM

Proof. The proof is similar to that of Lemma 3.3. We only need to check that ϕV,b(ζ) = (1 − θ)ϕV,b(α) + θϕV,b(ν), ϕV,b(η) = θϕV,b(α) + (1 − θ)ϕV,b(ν), where ζ = (1 − θ)µ + θν, η = (1 − θ)ν + θµ.  We now prove Theorem 1.2 for Jensen–Shannon divergence.

m n Theorem 4.2. Let m, n ∈ N and m ≤ n. For µ ∈ M(R ) and ν ∈ M(R ), let

− + DJS(µ, ν) := inf DJS(µ, β), DJS(µ, ν) := inf DJS(α, ν). β∈Φ−(ν,m) α∈Φ+(µ,n) Then

− + (7) DJS(µ, ν) = DJS(µ, ν).

+ m Proof. For any α ∈ Φ (µ, n), there exist Vα ∈ O(m, n) and bα ∈ R with ϕVα,bα (α) = µ. It follows  from Lemma 4.1 that DJS(α, ν) ≥ DJS µ, ϕVα,bα (ν) and thus   inf DJS(α, ν) ≥ inf DJS µ, ϕVα,bα (ν) ≥ inf DJS µ, ϕV,b(ν) α∈Φ+(µ,n) α∈Φ+(µ,n) V ∈O(m,n), b∈Rm

− + and thus DJS(µ, ν) ≤ DJS(µ, ν). − + − + We next show that DJS(µ, ν) ≥ DJS(µ, ν). By the definition of DJS(µ, ν), for each α ∈ Φ (µ, n) − and any ε > 0, there exists β∗ ∈ Φ (ν, m) such that

− − DJS(µ, ν) ≤ DJS(µ, β∗) ≤ DJS(µ, ν) + ε. m Let V∗ ∈ O(m, n) and b∗ ∈ R be such that ϕV∗,b∗ (ν) = β∗. Applying Theorem 1.5 to ϕV∗,b∗ , we n m obtain {νy ∈ M(R ): y ∈ R } satisfying Z Z Z dν(x) = dνy(x) dβ∗(y). n m ϕ−1 (y) R R V∗,b∗ n Let α∗ ∈ M(R ) be such that Z Z Z dα∗(x) = dνy(x) dµ(y). n m ϕ−1 (y) R R V∗,b∗

+ Then ϕV∗,b∗ (α∗) = µ and so α∗ ∈ Φ (µ, n). Consider the weighted measures ∗ ∗ ζ := (1 − θ)µ + θβ∗, η := θµ + (1 − θ)β∗, ξ1 := (1 − θ)α∗ + θν, ξ2 := θα∗ + (1 − θ)ν. ∗ ∗ ∗ ∗ Since µ is absolutely continuous with respect to ζ and β∗ to η , we let dµ = g1 dζ and dβ∗ = g2 dη . Then we have ∗ ∗   ϕV∗,b∗ (ξ1) = ζ , ϕV∗,b∗ (ξ2) = η , dα∗(x) = g1 ϕV∗,b∗ (x) dξ1(x), dν(x) = g2 ϕV∗,b∗ (x) dξ2(x).

+ By the definition of DJS, 1 D+ (µ, ν) ≤ D (α , ν) = D (α kξ ) + D (νkξ ) JS JS ∗ 2 KL ∗ 1 KL 2 Z 1       = log g1 ϕV∗,b∗ (x) g1 ϕV∗,b∗ (x) dξ1(x) + log g2 ϕV∗,b∗ (x) g2 ϕV∗,b∗ (x) dξ2(x) 2 Rn Z 1  ∗  ∗ 1 ∗ ∗  = log g1(y) g1(y) dζ (y) + log g2(y) g2(y) dη (y) = DKL(µkζ ) + DKL(β∗kη ) 2 Rm 2 − = DJS(µ, β∗) ≤ DJS(µ, ν) + ε.

+ − Since ε > 0 is arbitrary, we have that DJS(µ, ν) ≤ DJS(µ, ν).  DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 13

5. Total Variation Distance The total variation metric is quite possibly the most classical notion of distance between probabil- ity measures. It is used in Markov models [8, 15], stochastic processes [47], Monte Carlo algorithms [5], geometric approximation [53], image restoration [50], among other areas. n The definition is straightforward: The total variation metric between µ, ν ∈ M(R ) is simply

dTV(µ, ν) := sup |µ(A) − ν(A)|. A∈Σ(Rn) As we saw in Section3, when the probability measures have densities, the total variation metric is a special case of the f-divergence with f(t) = |t − 1|/2. While it is not a special case of the Wasserstein metric in Section2, it is related in that Z dTV(µ, ν) = inf Ix6=y(x, y) dγ(x, y), γ∈Γ(µ,ν) R2n where Γ(µ, ν) is the set of couplings as in the definition of Wasserstein metric and Ix6=y is an indicator function, i.e., takes value 1 when x 6= y and 0 when x = y. n For any µ, ν ∈ M(R ), it follows from Hahn Decomposition, i.e., Theorem 1.4, that there exists n S ∈ Σ(R ) such that

(8) dTV(µ, ν) = µ(S) − ν(S). n So for any measurable B ⊆ S, C ⊆ S{ := R \ S, (9) µ(B) − ν(B) ≥ 0, µ(C) − ν(C) ≤ 0; or, equivalently, for all measurable function f(x) ≥ 0, Z Z (10) f(x) dµ(x) − ν(x) ≥ 0 and f(x) dµ(x) − ν(x) ≤ 0. S S{ n m Lemma 5.1. Let α, ν ∈ M(R ). Then for any V ∈ O(m, n) and b ∈ R ,  dTV(α, ν) ≥ dTV ϕV,b(α), ϕV,b(ν) .  −1  −1  Proof. By (8) and (9), dTV ϕV,b(α), ϕV,b(ν) = ϕV,b(α)(S)−ϕV,b(ν)(S) = α ϕV,b(S) −ν ϕV,b(S) ≤ dTV(α, ν).  m n Theorem 5.2. Let m, n ∈ N and m ≤ n. For µ ∈ M(R ) and ν ∈ M(R ), let − + dTV(µ, ν) := inf dTV(µ, β), dTV(µ, ν) := inf dTV(α, ν). β∈Φ−(ν,m) α∈Φ+(µ,n) Then

+ − (11) dTV(µ, ν) = dTV(µ, ν).

+ m Proof. For any α ∈ Φ (µ, n), there exist Vα ∈ O(m, n) and bα ∈ R with ϕVα,bα (α) = µ. So by  + − Lemma 5.1, dTV(α, ν) ≥ dTV µ, ϕVα,bα (ν) and we get dTV(µ, ν) ≥ dTV(µ, ν) from   inf dTV(α, ν) ≥ inf dTV µ, ϕVα,bα (ν) ≥ inf dTV µ, ϕV,b(ν) . α∈Φ+(µ,n) α∈Φ+(µ,n) V ∈O(m,n),b∈Rm − + − We next show that dTV(µ, ν) ≥ dTV(µ, ν). By the definition of dTV(µ, ν), for any ε > 0, there − exists β∗ ∈ Φ (ν, m) such that − − dTV(µ, ν) ≤ dTV(µ, β∗) ≤ dTV(µ, ν) + ε. m m Let V∗ ∈ O(m, n) and b∗ ∈ R be such that ϕV∗,b∗ (ν) = β∗ and S ∈ Σ(R ) be such that dTV(µ, β∗) = n m µ(S) − β∗(S). Applying Theorem 1.5 to ϕV∗,b∗ , we obtain {νy ∈ M(R ): y ∈ R } such that Z Z Z dν(x) = dνy(x) dβ∗(y). n m ϕ−1 (y) R R V∗,b∗ 14 YUHANG CAI AND LEK-HENG LIM

n We may check that α∗ ∈ M(R ) defined by Z Z Z dα∗(x) := dνy(x) dµ(y) n m ϕ−1 (y) R R V∗,b∗ is indeed a probability measure and ϕ (α ) = µ. Partition n into ϕ−1 (S) and ϕ−1 (S{). V∗,b∗ ∗ R V∗,b∗ V∗,b∗ We claim that for any measurable B ⊆ ϕ−1 (S) and C ⊆ ϕ−1 (S{), V∗,b∗ V∗,b∗

α∗(B) − ν(B) ≥ 0, α∗(C) − ν(C) ≤ 0.

Let g = 1x∈B. Then Z Z Z   α∗(B) − ν(B) = g(x) d α∗(x) − ν(x) = g(x) dνy(x) d µ(y) − β∗(y) n m ϕ−1 (y) R R V∗,b∗ Z Z Z   = g(x) dνy(x) d µ(y) − β∗(y) = h(y) d µ(y) − β∗(y) , S ϕ−1 (y) S V∗,b∗ R where h(y) = ϕ−1 (y) g(x)dνy(x) ≥ 0. By (9), we deduce that α∗(B) − ν(B) ≥ 0. Likewise, V∗,b∗ α (C) − ν(C) ≤ 0. Let T = ϕ−1 (S). Then for any measurable A ⊆ n, ∗ V∗,b∗ R

{ { α∗(A) − ν(A) = α∗(A ∩ T ) − ν(A ∩ T ) + α∗(A ∩ T ) − ν(A ∩ T )

≤ α∗(A ∩ T ) − ν(A ∩ T ) ≤ α∗(T ) − ν(T ). Hence we obtain

+ − dTV(µ, ν) ≤ dTV(α∗, ν) = α∗(T ) − ν(T ) = µ(S) − β∗(S) = dTV(µ, β∗) ≤ dTV(µ, ν) + ε.

+ − Since ε > 0 is arbitrary, dTV(µ, ν) ≤ dTV(µ, ν).  Theorem 5.2 is stronger than what we may deduce from Theorem 3.4 since we do not require that our measures have densities.

6. Examples Theorems 2.2, 3.4, 4.2, and 5.2 show that to compute any of the distances therein between probability measures of different dimensions, we may either compute the projection distance d− or the embedding distance d+. We will present five examples, three continuous and two discrete. In this section, we denote our probability measures by ρ1, ρ2 instead of µ, ν to avoid any clash with the standard notation for . n In the following, we will write Nn(µ, Σ) for the n-dimensional normal measure with mean µ ∈ R n×n n and covariance Σ ∈ R . For ρ1 = Nn(µ1, Σ1) and ρ2 = Nn(µ2, Σ2) ∈ M(R ), recall that the 2-Wasserstein metric and the KL-divergence between them are given by

2 2 1/2 1/21/2 W2(ρ1, ρ2) = kµ1 − µ2k2 + tr Σ1 + Σ2 − 2Σ2 Σ1Σ2 ,    1 −1 T −1 det Σ2 DKL(ρ1kρ2) = tr(Σ2 Σ1) + (µ2 − µ1) Σ2 (µ2 − µ1) − n + log 2 det Σ1 respectively. The former may be found in [49] while the latter is a routine calculation. m We adopt the standard convention that a vector in R will always be assumed to be a column m m×1 m×n vector, i.e., R ≡ R . A matrix X ∈ R , when denoted X = [x1, . . . , xn] implicitly m T T T that x1, . . . , xn ∈ R are its column vectors, and when denoted X = [y1, . . . , ym] implicitly means n that y1, . . . , ym ∈ R are its row vectors. The notation diag(λ1, . . . , λn) means an n × n diagonal matrix with diagonal entries λ1, . . . , λn ∈ R. DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 15

Example 6.1 (2-Wasserstein distance between one- and n-dimensional Gaussians). Let ρ1 = 2 n N1(µ1, σ ) ∈ M(R) be a one-dimensional and ρ2 = Nn(µ2, Σ) ∈ M(R ) be an n-dimensional Gaussian measure, n ∈ N arbitrary. We seek the 2-Wasserstein distance Wc2(ρ1, ρ2) − + between them. By Theorem 2.2, we have the option of computing either W2 (ρ1, ρ2) or W2 (ρ1, ρ2) but the choice is obvious, given that the former is considerably simpler: √ √ − 2 T 2 2 T T T 2 W2 (ρ1, ρ2) = min kµ1 − x µ2 − yk2 + tr(σ + x Σx − 2σ x Σx) = min (σ − x Σx) . kxk2=1, y∈R kxk2=1

T Let λ1 and λn be the largest and smallest eigenvalues of Σ. Then λn ≤ x Σx ≤ λ1 and thus we must have  √ √ ( λ − σ) if σ < λ ,  n √ n √ Wc2(ρ1, ρ2) = 0 if λn ≤ σ ≤ λ1,  √ √ (σ − λ1) if σ > λ1.

2 Example 6.2 (KL-divergence between one- and n-dimensional Gaussians). Let ρ1 = N1(µ1, σ ) ∈ n M(R) and ρ2 = Nn(µ2, Σ) ∈ M(R ) be as in Example 6.1. By Theorem 3.4, we may compute either − + DKL(ρ1kρ2) or DKL(ρ1kρ2) and again the simpler option is 1h σ2 (µ − xTµ − y)2 xTΣxi D− (ρ kρ ) = min + 1 2 − 1 + log KL 1 2 T T 2 kxk2=1,y∈R 2 x Σx x Σx σ 1h σ2 xTΣxi = min − 1 + log T 2 kxk2=1 2 x Σx σ

T Again λn ≤ x Σx ≤ λ1 where λ1 and λn are the largest and smallest eigenvalues of Σ. Since f(λ) = σ2/λ + log(λ/σ2) has f 0(λ) = (λ − σ2)/λ2, we obtain

1h σ2 λ i √  n  − 1 + log 2 if σ < λn, 2 λn σ  √ √ Db KL(ρ1kρ2) = 0 if λn ≤ σ ≤ λ1,  1hσ2 λ i √  1  − 1 + log 2 if σ > λ1. 2 λ1 σ Example 6.3 (KL-divergence between uniform measure on m-dimensional ball and n-dimensional m m m m Gaussian). Let B = {x ∈ R : kxk2 ≤ 1} be the unit 2- ball in R and let ρ1 = U(B ) m be the uniform probability measure on B . Let ρ2 = Nn(µ2, Σ) be an n-dimensional Gaussian n×n measure with zero mean and Σ ∈ R symmetric positive definite. By Theorem 3.4,

−  Db KL(ρ1kρ2) = DKL(ρ1kρ2) = inf DKL ρ1kϕV,b(ρ2) . V ∈O(m,n), b∈Rm

T Note that ϕV,b(ρ2) = Nm(V µ2 + b, V ΣV ) is an m-dimensional Gaussian. Let λ1 ≥ · · · ≥ λn > 0 T be the eigenvalues of Σ and σ1 ≥ · · · ≥ σm > 0 be the eigenvalues of V ΣV . Then

m m 1X 1 X 1  m  m log 2 D ρ kϕ (ρ ) = log(σ ) + + log Γ + 1 + KL 1 V,b 2 2 i (m + 2) σ 2 2 i=1 i=1 i T T + min (V µ2 + b) V ΣV (V µ2 + b) b∈Rm m m 1X 1 X 1  m  m log 2 = log(σ ) + + log Γ + 1 + , (12) 2 i (m + 2) σ 2 2 i=1 i=1 i 16 YUHANG CAI AND LEK-HENG LIM where the minimum is attained at b = −V µ2 and Γ is the Gamma function. Let g(σ) := log(σ)/2 + 1/[2(m + 2)σ], which has global minimum at σ = 1/(m + 2). For any α ≥ β ≥ 0, define  g(β) if β > 1/(m + 2),   (13) gm(α, β) := g 1/(m + 2) if β ≤ 1/(m + 2) ≤ α, g(α) if α < 1/(m + 2). Thus when m = 1, we have 1 π 1 1  log + + log λ if λ > 1/3, 2 2 6λ 2 n n  n 1 π 1 π 1 Db KL(ρ1kρ2) = g1(λ1, λn) + log = log + if λn ≤ 1/3 ≤ λ1, 2 2 2 6 2  1 π 1 1  log + + log λ1 if λ1 < 1/3. 2 2 6λ1 2 Note that setting n = 3 answers the question we posed in the abstract: What is the KL-divergence between the uniform distribution ρ1 = U([−1, 1]) and the Gaussian distribution ρ2 = N3(µ2, Σ) in 3 R . More generally, suppose m < n/2. Observe that for any σ1 ≥ · · · ≥ σm ≥ 0 with

λn−m+i ≤ σi ≤ λi i = 1, . . . , m,

T T we may construct a V ∈ O(m, n) with V ΣV = diag(σ1, . . . , σm). Let Σ = QΛQ be an eigenvalue decomposition with Q = [q1, . . . , qn] ∈ O(n). For each i = 1, . . . , m, let n 2 2 vi(θi) := qi sin θi + qn−m+i cos θi ∈ R , σi(θi) := λi sin θi + λn−m+i cos θi ∈ R+. T T T T  Then V = [v1(θ1) , . . . , vm(θm) ] ∈ O(m, n) and V ΣV = diag σ1(θ1), . . . , σm(θm) . Choosing θi so that σi(θi) = σi, i = 1, . . . , m, gives us the required result. With this observation, it follows that when m < n/2, the minimum in (12) is attained when σi = λi, i = 1, . . . , m, and we obtain the closed-form expression m X m  m log 2 Db KL(ρ1kρ2) = gm(λi, λn−m+i) + log Γ + 1 + , 2 2 i=1 where gm is as defined in (13). m n Example 6.4 (2-Wasserstein distance between Dirac measure on R and discrete measure on R ). m m Let y ∈ R and ρ1 ∈ M(R ) be the Dirac measure with ρ1(y) = 1, i.e., all mass centered at y. Let n n x1, . . . , xk ∈ R be distinct points, p1, . . . , pk ≥ 0, p1 + ··· + pk = 0, and let ρ2 ∈ M(R ) be the discrete measure of point masses with ρ2(xi) = pi, i = 1, . . . , k. We seek the 2-Wasserstein distance − Wc2(ρ1, ρ2) and by Theorem 2.2, this is given by W2 (ρ1, ρ2). We will show it has a closed-form solution. Suppose m ≤ n, then k − 2 X 2 W2 (ρ1, ρ2) = inf pikV xi + b + yk2 V ∈O(m,n), b∈ m R i=1 k k 2 X X T = inf pi V xi − piV xi = inf tr(VXV ), V ∈O(m,n) V ∈O(m,n) i=1 i=1 2 Pk noting that the second infimum is attained by b = −y − i=1 piV xi and defining X in the last infimum to be k  k  k T X X X n×n X := pi xi − pixi xi − pixi ∈ R . i=1 i=1 i=1 DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 17

Let the eigenvalue decomposition of the symmetric positive semidefinite matrix X be X = QΛQT with Λ = diag(λ1, . . . , λn), λ1 ≥ · · · ≥ λn ≥ 0. Then m−1 T X inf tr(VXV ) = λn−i V ∈O(m,n) i=0 and is attained when V ∈ O(m, n) has row vectors given by the last m columns of Q ∈ O(n). m n Example 6.5 (2-Wasserstein distance between discrete measures on R and R ). More generally, m we may seek the 2-Wasserstein distance between discrete probability measures ρ1 ∈ M(R ) and n m ρ2 ∈ M(R ). Let ρ1 be supported on x1, . . . , xk ∈ R with values ρ1(xi) = pi, i = 1, . . . , m; and n − y1, . . . , yl ∈ R with values ρ2(xi) = qi, i = 1, . . . , m. The optimization problem for W2 (ρ1, ρ2) becomes k l X X 2 (14) inf πijkV xi + b − yjk2, V ∈O(m,n), b∈ m, π∈Γ(ρ ,ρ ) R 1 2 i=1 j=1 where  n m  m×n X X Γ(ρ1, ρ2) = π ∈ R+ : πij = pi, i = 1, . . . , m; πij = qj, j = 1, . . . , n . j=1 i=1 While the solution to (14) may no longer be determined in closed-form, it is a polynomial opti- mization problem and can be solved using the Lasserre sum-of-squares technique as a sequence of semidefinite programs [35].

7. Conclusion We proposed a simple, natural framework for taking any p-Wasserstein metric or f-divergence, and constructing a corresponding distance for probability distributions on m- and n-dimensional measure spaces where m 6= n. The new distances preserve some well-known properties satisfied by the original distances. We saw from several examples that the new distances may be either determined in closed-form or near closed-form, or computed using Stiefel manifold optimization or sums-of-squares polynomial optimization. In future work, we hope to apply our framework to other less common distances like the [4], the L´evy–Prokhorov metric [36, 54], and theLukaszyk–Karmowski metric [40]. Acknowledgment. We are deeply grateful to the two anonymous referees for their exceptionally helpful comments and suggestions that vastly improved our article. This work is supported by NSF IIS 1546413, DMS 1854831, and the Eckhardt Faculty Fund. YC would like to express his gratitude to Philippe Rigollet, Jonathan Niles-Weed, and Geoffrey Schiebinger for teaching him about optimal transport and for their hospitality when he visited MIT in 2018. LHL would like to thank Louis H. Y. Chen for asking the question on page1 that led to this work.

References [1] S.-I. Amari, O. E. Barndorff-Nielsen, R. E. Kass, S. L. Lauritzen, and C. R. Rao. Differential Geometry in , volume 10 of Institute of Lecture Notes. Institute of Mathematical Statistics, Hayward, CA, 1987. [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv:1701.07875, 2017. [3] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: fine-grained image generation through asymmetric training. arXiv:1703.10155, 2017. [4] A. Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc., 35:99–109, 1943. [5] S. P. Brooks, P. Dellaportas, and G. O. Roberts. An approach to diagnosing total variation convergence of MCMC algorithms. J. Comput. Graph. Statist., 6(3):251–265, 1997. [6] O. Calin and C. Udri¸ste. Geometric Modeling in Probability and Statistics. Springer, Cham, 2014. 18 YUHANG CAI AND LEK-HENG LIM

[7] K. Chaloner and I. Verdinelli. Bayesian experimental design: a review. Statist. Sci., 10(3):273–304, 1995. [8] T. Chen and S. Kiefer. On the total variation distance of labelled Markov chains. In Proc. Joint Meeting 23rd EACSL Ann. Conf. Comp. Sci. Logic and the 29th Ann. ACM/IEEE Symp. Logic Comp. Sci., pages 1–10, 2014. [9] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statistics, 23:493–507, 1952. [10] N. Courty, R. Flamary, D. Tuia, and T. Corpetti. Optimal transport for data fusion in remote sensing. In IEEE Int. Geosci. Remote Sensing Symp., pages 3571–3574. IEEE, 2016. [11] I. Csisz´ar.Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizit¨atvon Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutat´oInt. K¨ozl., 8:85–108, 1963. [12] I. Csiszar and J. K¨orner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, Cambridge, 2011. [13] S. DeDeo, R. X. Hawkins, S. Klingenstein, and T. Hitchcock. Bootstrap methods for the empirical study of decision-making and information flows in social systems. Entropy, 15(6):2246–2276, 2013. [14] J. Delon. Midway image equalization. J. Math. Imaging Vision, 21(2):119–134, 2004. [15] J. Ding, E. Lubetzky, and Y. Peres. Total variation cutoff in birth-and-death chains. Probab. Theory Relat. Fields, 146(1-2):61, 2010. [16] S. Eguchi et al. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J., 15(2):341–391, 1985. [17] R. Flamary, C. F´evotte, N. Courty, and V. Emiya. Optimal spectral transportation with application to music transcription. In Proc. 30th Int. Conf. Adv. Neural Inform. Process. Sys., pages 703–711, 2016. [18] M. Fr´echet. Sur la distance de deux lois de probabilit´e. C. R. Acad. Sci. Paris, 244:689–692, 1957. [19] U. Frisch, S. Matarrese, R. Mohayaee, and A. Sobolevski. A reconstruction of the initial conditions of the universe by optimal mass transportation. Nature, 417:260–262, 2002. [20] A. Galichon. Optimal Transport Methods in Economics. Princeton University Press, Princeton, NJ, 2016. [21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. Int. Conf. Adv. Neural Inform. Process. Sys., pages 2672–2680. Curran Associates, Inc., 2014. [22] A. Guntuboyina. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inform. Theory, 57(4):2386–2399, 2011. [23] J. Gutierrez, J. Rabin, B. Galerne, and T. Hurtut. Optimal patch assignment for statistically constrained texture synthesis. In Scale Space and Variational Methods in Computer Vision, pages 172–183. Springer, 2017. [24] M. B. Hastings, I. Gonz´alez, A. B. Kallin, and R. G. Melko. Measuring Renyi entanglement entropy in quantum Monte Carlo simulations. Phys. Rev. Lett., 104(15):157201, 2010. [25] E. Hellinger. Neue Begr¨undungder Theorie quadratischer Formen von unendlichvielen Ver¨anderlichen. J. Reine Angew. Math., 136:210–271, 1909. [26] A. O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha-divergence for classification, indexing and retrieval. Technical report, Communication and Signal Processing Laboratory, University of Michigan, Technical Report CSPL-328, 2001. [27] A. Irpino and R. Verde. Dynamic clustering of interval data using a wasserstein-based distance. Pattern Recognit. Lett., 29(11):1648–1658, 2008. [28] S. Itzkovitz, E. Hodis, and E. Segal. Overlapping codes within protein-coding sequences. Genome Res., 20(11):1582–1589, 2010. [29] H. Jeffreys. Theory of Probability. Third edition. Clarendon Press, Oxford, 1961. [30] L. V. Kantorovich. On a problem of Monge. Zap. Nauchn. Sem. S.-Peterburg. Otdel. Mat. Inst. Steklov. (POMI), 312(11):15–16, 2004. [31] S. Klingenstein, T. Hitchcock, and S. DeDeo. The civilizing process in London’s Old Bailey. Proc. Natl. Acad. Sci. U.S.A., 111(26):9419–9424, 2014. [32] S. Kullback. Information Theory and Statistics. Dover Publications, Inc., Mineola, NY, 1997. [33] S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statistics, 22:79–86, 1951. [34] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In Proc. 32nd Int. Conf. Mach. Learn., pages 957–966, 2015. [35] J. B. Lasserre. An Introduction to Polynomial and Semi-algebraic Optimization. Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge, 2015. [36] P. L´evy. Th´eoriede l’Addition des Variables Al´eatoires, volume 1 of Monographies des Probabilit´es,publi´essous la direction de E. Borel. Gauthier–Villars, Paris, 1937. [37] L.-H. Lim, R. Sepulchre, and K. Ye. Geometric distance between positive definite matrices of different dimensions. IEEE Trans. Inform. Theory, 65(9):5401–5405, 2019. [38] J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory, 37(1):145–151, 1991. DISTANCES BETWEEN PROBABILITY DISTRIBUTIONS OF DIFFERENT DIMENSIONS 19

[39] E. M. Loiola, N. M. Maia de Abreu, P. O. Boaventura-Netto, P. Hahn, and T. Querido. A survey for the quadratic assignment problem. European J. Oper. Res., 176(2):657–690, 2007. [40] S.Lukaszyk. A new concept of probability metric and its applications in approximation of scattered data sets. Comput. Mech., 33(4):299–304, 2004. [41] Y. Makihara and Y. Yagi. Earth mover’s morphing: Topology-free shape morphing using cluster-based emd flows. In Computer Vision, pages 202–215. Springer, 2010. [42] B. Mathon, F. Cayre, P. Bas, and B. Macq. Optimal transport for secure spread-spectrum watermarking of still images. IEEE Trans. Image Process., 23(4):1694–1705, 2014. [43] F. M´emoli.Gromov-Wasserstein distances and the metric approach to object matching. Found. Comput. Math., 11(4):417–487, 2011. [44] F.-C. Mitroi-Symeonidis, I. Anghel, and N. Minculete. Parametric Jensen–Shannon statistical complexity and its applications on full-scale compartment fire data. Symmetry, 12(1):22, 2020. [45] F. Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv:1009.4004, 2010. [46] R. Nishii and S. Eguchi. Image classification based on markov random field models with Jeffreys divergence. Stoch. Process. Appl., 97(9):1997–2008, 2006. [47] I. Nourdin and G. Poly. Convergence in total variation on Wiener chaos. Stoch. Process. Appl., 123(2):651–674, 2013. [48] Y. Ofran and B. Rost. Analysing six types of protein–protein interfaces. J. Mol. Biol., 325(2):377–387, 2003. [49] I. Olkin and F. Pukelsheim. The distance between two random vectors with given dispersion matrices. Linear Algebra Appl., 48:257–263, 1982. [50] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total variation-based image restoration. Multiscale Model. Simul., 4(2):460–489, 2005. [51] J. K. Pachl. Disintegration and compact measures. Math. Scand., 43(1):157–168, 1978/79. [52] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random . London Edinburgh Philos. Mag. J. Sci., 50(302):157–175, 1900. [53] E. A. Pek¨oz,A. R¨ollin,N. Ross, et al. Total variation error bounds for geometric approximation. Bernoulli, 19(2):610–632, 2013. [54] Y. V. Prokhorov. Convergence of random processes and limit theorems in . Teor. Veroyatnost. i Primenen., 1:177–238, 1956. [55] C. R. Rao. A review of canonical coordinates and an alternative to correspondence analysis using . Q¨uestii´o, 19(1-3):23–63, 1995. [56] A. R´enyi. On measures of entropy and information. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., volume 1, pages 547–561. University of California Press, Berkeley, CA, 1961. [57] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. J. Comput. Vis., 40(2):99–121, 2000. [58] A. Salmona, J. Delon, and A. Desolneux. Gromov–wasserstein distances between gaussian distributions. arXiv:2104.07970, 2021. [59] R. Sandler and M. Lindenbaum. Nonnegative matrix factorization with earth mover’s distance metric for image analysis. IEEE Trans. Pattern Anal. Mach. Intell., 33(8):1590–1602, 2011. [60] C. E. Shannon. A mathematical theory of communication. Bell System Tech. J., 27:379–423, 623–656, 1948. [61] S. J. Sheather. Density estimation. Statist. Sci., 19(4):588–597, 2004. [62] G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim. Alignment-free genome comparison with feature profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. U.S.A., 106(8):2677–2682, 2009. [63] J. Solomon, F. De Goes, G. Peyr´e,M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph., 34(4):66, 2015. [64] L. N. Vasershtein. Markov processes over denumerable products of spaces describing large system of automata. Problems Inform. Transmission, 5(3):47–52, 1969. [65] C. Villani. Optimal Transport: Old and New, volume 338 of Grundlehren der Mathematischen Wissenschaften. Springer-Verlag, Berlin, 2009. [66] P. Von B¨unau,F. C. Meinecke, F. C. Kir´aly, and K.-R. M¨uller.Finding stationary subspaces in multivariate time series. Phys. Rev. Lett., 103(21):214101, 2009. [67] W. Wang, J. A. Ozolek, D. Slepˇcev,A. B. Lee, C. Chen, and G. K. Rohde. An optimal transportation approach for nuclear structure-based pathology. IEEE Trans. Med. Imaging, pages 621–631, 2010. [68] W. Wang, D. Slepˇcev,S. Basu, J. A. Ozolek, and G. K. Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. Int. J. Comput. Vis., 101(2):254–269, 2013. [69] K. Ye and L.-H. Lim. Schubert varieties and distances between subspaces of different dimensions. SIAM J. Matrix Anal. Appl., 37(3):1176–1197, 2016. 20 YUHANG CAI AND LEK-HENG LIM

[70] M. Zhang, Y. Liu, H. Luan, M. Sun, T. Izuha, and J. Hao. Building earth mover’s distance on bilingual word embeddings for machine translation. In Proc. 30th AAAI Conf. Artif. Intell., pages 2870–2876. AAAI Press, 2016. [71] L. Zhu, Y. Yang, S. Haker, and A. Tannenbaum. An image morphing technique based on optimal mass preserving mapping. IEEE Trans. Image Process., 16(6):1481–1495, 2007.

Department of Statistics, University of Chicago, Chicago, IL 60637 Email address: [email protected], [email protected]