PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 136, Number 1, January 2008, Pages 333–339 S 0002-9939(07)09020-X Article electronically published on September 27, 2007

AN ELEMENTARY PROOF OF THE TRIANGLE INEQUALITY FOR THE WASSERSTEIN

PHILIPPE CLEMENT AND WOLFGANG DESCH

(Communicated by Richard C. Bradley)

Abstract. We give an elementary proof for the triangle inequality of the p- Wasserstein metric for probability measures on separable metric spaces. Unlike known approaches, our proof does not rely on the disintegration theorem in its full generality; therefore the additional assumption that the underlying space is Radon can be omitted. We also supply a proof, not depending on disintegration, that the Wasserstein metric is complete on Polish spaces.

1. Introduction

In [1] the Wasserstein metric Wp (with p ≥ 1) is defined for probability measures on a Radon space. The proof of the triangle inequality relies on the disintegration theorem [1, Theorem 5.3.1]. The aim of this paper is to give an elementary proof of the triangle inequality for a general separable . To be more precise, we introduce the following notation and definitions (accord- ing to [1]): Let (X, d) be a separable metric space with its Borel σ-algebra B(X). Here P(X) denotes the set of Borel probability measures. If (Y,d) is another sep- arable metric space, if µ ∈P(X)andf : X → Y is a measurable map, then f #µ −1 is the image measure on B(Y ), i.e. (f #µ)(A)=µ(f (A)) for A ∈B(Y ). In par- ticular, considering the product X1 × X2 of two separable metric spaces, we define i i the canonical projections π : X1 × X2 → Xi.Ifγ ∈P(X1 × X2), then π #γ (for i =1, 2) are the marginal distributions of γ. Similarly, for products of three spaces, i,j i π : X1 × X2 × X3 → Xi × Xj denotes the canonical projection. If µ ∈P(Xi), we define 1 2 i i Γ(µ ,µ )={γ ∈P(X1 × X2): π #γ = µ for i =1, 2}.

Let 1 ≤ p<∞.ByPp(X)wedenotethesetofallµ ∈P(X) such that p ∞ ∈ 1 2 ∈P X d(x, y) dµ(x) < for some (equivalently, all) y X.Forµ ,µ p(X)we define the Wasserstein metric as in [2, Section 11.8, Problem 7]: 1/p 1 2 p (1.1) Wp(µ ,µ )= inf d(x1,x2) dγ(x1,x2) . ∈ 1 2 γ Γ(µ ,µ ) X×X

In [2, Section 11.8.3] it is shown for p =1thatW1 is a metric. Moreover, [2, Section 11.8.3, Problem 9] consists in proving that Wp is a metric for p ≥ 1,

Received by the editors October 30, 2006. 2000 Subject Classification. Primary 60B05. Key words and phrases. Wasserstein metric, triangle inequality, probability measures on metric spaces.

c 2007 American Mathematical Society 333

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 334 PHILIPPE CLEMENT AND WOLFGANG DESCH

provided (X, d) is a . In [1, Section 7.1] it is proved that Wp is a metric for p ≥ 1 under the more general assumption that (X, d) is a separable Radon space. In this note (Section 2) we prove that the triangle inequality can be proved by more elementary means, thus extending its validity to general separable metric spaces. The strategy of our proof follows quite closely that of [2]. However, in the case of a countable metric space, the disintegration can be done in a very straightforward way without requiring higher tools of measure theory. The generalisation from countable to separable metric spaces is done by an approximation procedure. In [1], the disintegration theorem is also used to prove the completeness of (Pp(X),Wp)when(X, d) is complete. In Section 3 of this note we give a proof which does not rely on the disintegration theorem but on the completeness of (P(X),β), where β is the dual bounded Lipschitz metric (see [2, p. 394]).

2. The triangle inequality We begin with a proof of the triangle inequality in the case where (X, d)isa countable metric space. Proposition 2.1. Let (X, d) be a countable metric space and let 1 ≤ p<∞.Let 1 2 3 1,2 1 2 2,3 2 3 µ ,µ ,µ ∈Pp(X), γ ∈ Γ(µ ,µ ) and γ ∈ Γ(µ ,µ ). Then there exists some γ1,3 ∈ Γ(µ1,µ3) such that 1/p p 1,3 d(x1,x3) dγ (x1,x3) X×X 1/p 1/p p 1,2 p 2,3 ≤ d(x1,x2) dγ (x1,x2) + d(x2,x3) dγ (x1,x3) . X×X X×X 1 3 1 2 2 3 In particular, Wp(µ ,µ ) ≤ Wp(µ ,µ )+Wp(µ ,µ ) . { ···} i i { } i,j Proof. Let X = v1,v2, . For short, we denote µk = µ ( vk )andγk,l = i,j γ ({(vk,vl)}) whenever i, j ∈{1, 2, 3} and k, l ∈ N. We define (in accordance with the notation just introduced) the measure × × γ = γk,m,n(δvk δvm δvn ) k,m,n γ1,2 γ2,3 k,m m,n 2  µ2 if µm =0, with γk,m,n = m 2 0ifµm =0. Since the first marginal of γ2,3 equals µ2,weobtain γ1,2 γ2,3 k,m m,n 1,2 2  1,2 n µ2 = γk,m if µm =0, π #γ({(vk,vm)})= m 1,2 2 0=γk,m if µm =0.

2,3 2,3 Similarly, π #γ = γ . Since the marginals are probability measures, we infer j j that γ is a . Moreover it follows for j =1, 2, 3thatπ #γ = µ . We define 1,3 1,3 γ = π #γ which is again a probability measure and has marginals µ1 and µ3.

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use WASSERSTEIN METRIC 335

Now use the definition of γ1,3 and Minkowski’s inequality to estimate 1/p p 1,3 d(x1,x3) dγ (x1,x3) X×X 1/p p = d(x1,x3) dγ(x1,x2,x3) X×X×X 1/p p ≤ [d(x1,x2)+d(x2,x3)] dγ(x1,x2,x3) X×X×X 1/p p ≤ d(x1,x2) dγ(x1,x2,x3) X×X×X 1/p p + d(x2,x3) dγ(x1,x2,x3) X×X×X 1/p 1/p p 1,2 p 2,3 = d(x1,x2) dγ (x1,x2) + d(x2,x3) dγ (x2,x3) . X×X X×X Thus γ1,3 satisfies the desired estimate. 

The extension from the case of a countable metric space to a is done by an approximation procedure. We will need several lemmas. Lemma 2.2. Let (X, d) be a separable metric space and let X˜ be a countable dense subset of X. Then, for each >0, there exists a Borel measurable map f : X → X˜ −1 such that d(x, f(x)) <for each x ∈ X.ThesetsSi = f (vi) form a partition of X.

Proof. Let X˜ = {v1,v2, ···}. We define inductively S = B(v ,), 1 1 Si = B(vi,) \ Sj. j

f(x)=vi whenever x ∈ Si. 

Remark 2.3. By skipping the indices where Si = ∅ and renumbering, we can obtain a partition of X into nonempty sets Si. Lemma 2.4. Let X be a separable metric space and let X˜ be a countable dense subset of X.Let>0 and let f be given according to Lemma 2.2. Moreover, let γ ∈P(X × X) and γ˜ ∈P(X˜ × X˜) be such that γ˜ =(f × f)#γ. Then the following assertions hold: i i (1) the marginals satisfy (for i =1, 2) π #γ˜ = f #(π#γ), 1/p 1/p p − p ≤ (2) X×X d(x, y) dγ(x, y) X˜×X˜ d(x, y) dγ˜(x, y) 2.

Proof. Let U˜ ⊂ X˜.Then

1 −1 1 −1 1 (π #γ˜)(U˜)=˜γ(U˜ × X˜)=γ(f (U˜) × X)=(π #γ)(f (U˜)) = [f #(π #γ)](U˜).

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 336 PHILIPPE CLEMENT AND WOLFGANG DESCH

Of course, the same proof holds for π2. To obtain the estimate, we utilize the fact thatγ ˜ =(f × f) γ and Minkowski’s inequality: # 1/p 1/p d(x, y)p dγ(x, y) − d(x, y)p dγ˜(x, y) × ˜× ˜ X X X X 1/p 1/p p − p = d(x, y) dγ(x, y) d(f(x),f(y)) dγ(x, y) X×X X×X 1/p ≤ |d(x, y) − d(f(x),f(y))|p dγ(x, y) X×X 1/p ≤ (2)p dγ(x, y) =2.  X×X Theorem 2.5. Let (X, d) be a separable metric space and let 1 ≤ p<∞.Let 1 2 3 1 3 1 2 2 3 µ ,µ ,µ ∈Pp(X).ThenWp(µ ,µ ) ≤ Wp(µ ,µ )+Wp(µ ,µ ).

Proof. Let X˜ = {v1,v2, ···} be a dense countable subset of X,let>0, and let −1 f begivenaccordingtoLemma2.2.LetSi = f ({vi}). For i =1, 2, 3 we define i i µ˜ = f #µ . Now, for the pairs (i, j) ∈{(1, 2), (2, 3)},letγi,j ∈ Γ(µi,µj) be such that 1/p p i,j i j d(xi,xj) dγ (xi,xj)

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use WASSERSTEIN METRIC 337

1 3 1 2 2 3 From this we conclude that Wp(µ ,µ ) ≤ Wp(µ ,µ )+Wp(µ ,µ )+6.With → 0 we infer the desired triangle inequality. Now we show that γ1,3 ∈ Γ(µ1,µ3). We will use the definition of γ1,3 and the marginals ofγ ˜1,3. In the following computation the sum has to be understood over 1 3 all indices where the denominators are nonzero. Notice that µk =0orµn =0 1,3 implies γk,n =0. ∞ γ˜1,3 π1 γ1,3(V )=γ1,3(V × X)= k,n (µ1 × µ3)[(V ∩ S ) × S ] # µ˜1µ˜3 k n k,n=1 k n ∞ 1,3 ∞ ∞ γ˜ 1 = k,n µ1(V ∩ S )= µ1(V ∩ S ) γ˜1,3 µ˜1 k µ˜1 k k,n k,n=1 k k=1 k n=1 ∞ 1 1 = µ (V ∩ Sk)=µ (V ). k=1 1,3 × 1,3 ˜ ⊂ ˜ × ˜ × −1 ˜ To show thatγ ˜ =(f f)#γ ,considerU X X.Then(f f) (U)= ∈ ˜ Sk × Sn.Consequently, k,n:(vk,vn) U ⎛ ⎞ 1,3 1,3 ⎝ ⎠ (f × f)#γ (U˜)=γ Sk × Sn

k,n:(vk,vn)∈U˜ γ˜1,3 k,n 1 × 3 × 1,3 1,3 ˜  = 1 3 (µ µ )(Sk Sn)= γ˜k,n =˜γ (U). µ˜kµ˜n k,n:(vk,vn)∈U˜ k,n:(vk,vn)∈U˜ Thus the proof is finished.

3. Completeness

Here we give an alternative proof of the completeness of (Pp(X),Wp) whenever X is complete (see [1, Propostion 7.1.5]). Throughout this section, let 1 ≤ p<∞. We recall the definition of the dual bounded Lipschitz metric for µ, ν ∈P(X): (3.1) β(µ, ν)=sup f(x) dµ(x) − f(x) dν(x) X X where the supremum is taken over all bounded and Lipschitz continuous f : X → R such that f ∞ +[f]Lip ≤ 1. Convergence with respect to β is equivalent to narrow convergence [2, 11.3.3]. Moreover, for µ, ν ∈Pp(X)

(3.2) β(µ, ν) ≤ Wp(µ, ν). Indeed, let f be bounded and Lipschitz with [f] ≤ 1. For γ ∈ Γ(µ, ν)wehave Lip | f(x) dµ(x) − f(x) dν(x)| = | (f(x) − f(y)) dγ(x, y)| X X X×X ≤ |f(x) − f(y)| dγ(x, y) ≤ d(x, y) dγ(x, y) X×X X×X 1/p ≤ d(x, y)p dγ(x, y) . X×X

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use 338 PHILIPPE CLEMENT AND WOLFGANG DESCH

Let µn be a Cauchy sequence in (Pp(X),Wp). From (3.2) it is also a Cauchy sequence in (P(X),β). Since (X, d) is complete, (P(X),β) is also complete by [2, Corollary 11.5.5]. Let µ denote the limit of µn with respect to the metric β. Given >0, let N ≥ 1 be such that Wp(µm,µn) ≤  for all m, n ≥ N. We claim that given x ∈ X, n ≥ N p p p p p (3.3) d(x, x) dµ(x) ≤ 2  +2 d(y, x) dµn(y), X X (3.4) Wp(µ, µn) ≤ .

This implies that µ ∈Pp(X)andWp(µn,µ) → 0. By [1, Section 7.1], since (X, d) is complete, the infimum in (1.1) is in fact a minimum. For m, n ≥ 1letγm,n ∈ Γ(µm,µn) be such that p p (3.5) Wp (µm,µn)= d(x, y) dγm,n(x, y). X×X

Since µm → µ narrowly, [1, Lemma 5.2.2] implies that for each n there exist γn ∈ P × (X X) and a subsequence γmk,n such that → P × (3.6) γmk,n γn narrowly in (X X). i Since the maps π # (i =1, 2) are continuous with respect to narrow convergence, we infer that

(3.7) γn ∈ Γ(µ, µn). Let x ∈ X. Then, for m ,n≥ N, k p p d(x, x) dµmk (x)= d(x, x) dγmk,n X X×X ≤ p p p p 2 d(x, y) dγmk,n +2 d(y, x) dγmk,n X×X X×X p p p p ≤ 2  +2 d(y, x) dµn. X

In the last step we have utilized (3.5). Since µmk converges narrowly to µ, claim (3.3) follows from the Portmanteau theorem (see [1, (5.1.15)]). Similarly, p p d(x, y) dγn(x, y) ≤ lim inf d(x, y) dγm ,n(x, y) →∞ k X×X k X×X p ≤ p = lim inf Wp (µmk ,µn)  . k→∞ By (3.7) we have p ≤ p ≤ p Wp (µ, µn) d(x, y) dγn(x, y)  . X×X This proves claim (3.4). Thus our proof is complete.

Acknowledgments The authors wish to thank L. Ambrosio, G. Savar´e, and C. Villani for their valuable comments. In particular a suggestion by G. Savar´ehelpedtomakethe technicalities in Section 2 shorter and the proof of Proposition 2.1 more transparent.

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use WASSERSTEIN METRIC 339

References

[1] L. Ambrosio, N. Gigli, G. Savar´e, Gradient Flows in Metric Spaces and in the Space of Prob- ability Measures,Birkh¨auser, 2005. MR2129498 (2006k:49001) [2]R.M.Dudley,Real Analysis and Probability, Cambridge Studies in Advanced Mathematics 74, Cambridge University Press, 2002. MR1932358 (2003h:60001)

Mathematical Institute, Leiden University, P. O. Box 9512, NL-2300 RA Leiden, The Netherlands E-mail address: [email protected] Institut fur¨ Mathematik und Wissenschaftliches Rechnen, Karl-Franzens-Univer- sitat¨ Graz, Heinrichstrasse 36, 8010 Graz, Austria E-mail address: [email protected]

License or copyright restrictions may apply to redistribution; see https://www.ams.org/journal-terms-of-use