Smooth p-Wasserstein : Structure, Empirical Approximation, and Statistical Applications

Sloan Nietert 1 Ziv Goldfeld 2 Kengo Kato 3

Abstract applications in and ML, ranging from generative Discrepancy measures between probability dis- modeling (Arjovsky et al., 2017; Gulrajani et al., 2017; Tol- tributions, often termed statistical , are stikhin et al., 2018) and image recognition (Rubner et al., ubiquitous in , statistics and ma- 2000; Sandler & Lindenbaum, 2011; Li et al., 2013) to chine learning. To combat the curse of dimen- domain adaptation (Courty et al., 2014; 2016) and robust sionality when estimating these distances from optimization (Mohajerin Esfahani & Kuhn, 2018; Blanchet , recent work has proposed smoothing out lo- et al., 2018; Gao & Kleywegt, 2016). The widespread use cal irregularities in the measured distributions via of this statistical distance is driven by an array of desirable convolution with a Gaussian kernel. Motivated by properties, including its structure, a convenient dual the scalability of this framework to high dimen- form, and its robustness to support mismatch. sions, we investigate the structural and statistical In applications, the Wasserstein distance is often estimated behavior of the Gaussian-smoothed p-Wasserstein from samples. However, the error of these empirical esti- (σ) distance Wp , for arbitrary p ≥ 1. After estab- mates suffers from an exponential dependence on dimension lishing basic metric and topological properties of that presents an obstacle to -efficient bounds for in- (σ) Wp , we explore the asymptotic statistical be- ference and learning. More specifically, the rate at which (σ) Wp(ˆµn, µ) converges to 0, where µˆn is an empirical mea- havior of Wp (ˆµn, µ), where µˆn is the empir- ical distribution of n independent observations sure based on n independent samples from µ, scales as −1/d (σ) n under mild conditions, for d ≥ 3 (Dereich from µ. We prove that W enjoys a parametric p et al., 2013; Boissard & Le Gouic, 2014; Fournier & Guillin, empirical convergence rate of n−1/2, which con- −1/d 2015; Panaretos & Zemel, 2019; Weed & Bach, 2019; Lei, trasts the n rate for unsmoothed Wp when (σ) 2020). This rate deteriorates poorly with dimension and d ≥ 3. Our proof relies on controlling Wp by a seems at odds with the scalability of empirical W observed (σ) p pth-order smooth Sobolev distance dp and deriv- in modern machine learning practice. √ (σ) ing the limit distribution of n dp (ˆµn, µ), for all dimensions d. As applications, we provide 1.1. Smooth Wasserstein Distances asymptotic guarantees for two-sample testing and (σ) Gaussian smoothing was recently introduced as a to minimum distance estimation using W , with p alleviate the curse of dimensionality of empirical W , while for p = 2 using a maximum p preserving the virtuous structural properties of the classic d(σ) discrepancy formulation of 2 . framework (Goldfeld et al., 2020b; Goldfeld & Greenewald, 2020; Goldfeld et al., 2020a). Specifically the σ-smooth p- (σ) Wasserstein distance is Wp (µ, ν) := Wp(µ ∗ Nσ, ν ∗ Nσ), 1. Introduction 2 where Nσ = N (0, σ I) is the isotropic Gaussian measure The Wasserstein distance Wp is a discrepancy measure be- of parameter σ. Goldfeld & Greenewald(2020) showed that (σ) tween probability distributions rooted in the theory of opti- W1 inherits the metric and topological√ structure of W1 mal transport (Villani, 2003; 2008). It has seen a surge of and approximates it within an O(σ d) gap. At the same W(σ) 1Department of Computer Science, Cornell University, Ithaca, time, empirical convergence rates for p are much faster. 2  (σ)  NY School of Electrical and Computer Engineering, Cornell As shown in (Goldfeld et al., 2020b), E W1 (ˆµn, µ) = 3 University, Ithaca, NY Department of Statistics and Data Science, O(n−1/2) when µ is sub-Gaussian in any dimension, i.e., it Cornell University, Ithaca, NY. Correspondence to: Sloan Nietert exhibits a parametric convergence rate. This fast rate was . (σ) also established for W2 but only when the sub-Gaussian Proceedings of the 38 th International Conference on Machine constant is smaller than σ/2. These results significantly Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Smooth Wp: Structure, Empirical Approximation, and Applications depart from the n−1/d rate in the unsmoothed case. For generative modeling, we examine minimum distance es- (σ) Follow-up work (Goldfeld et al., 2020a) developed a limit timation with Wp and establish measurability, consistency, √ and parametric convergence rates, along with finite-sample distribution theory for W(σ), showing that n W(σ)(ˆµ , µ) 1 1 n generalization guarantees in arbitrary dimension. Many of converges in distribution to the supremum of a tight Gaus- these directions (beyond measurability and consistency) are sian process, for all dimensions d and under a milder mo- intractable with standard W unless d = 1. We conclude ment condition. Such limit distribution results are known for p with numerical results that support our theory. unsmoothed Wp only when p ∈ {1, 2} and d = 1 (del Bar- rio et al., 1999; 2005) or µ is supported on a finite or count- able set (Sommerfeld & Munk, 2018; Tameling et al., 2019), 1.3. Related Discrepancy Measures but are a wide open question otherwise. Other works investi- The sliced Wasserstein distance (Rabin et al., 2011) takes (σ) gated the behavior of Wp as σ → ∞ (Chen & Niles-Weed, an average (or maximum (Deshpande et al., 2019)) of one- 2020), and established related results for other statistical dimensional Wasserstein distances over random projections distances, including total variation (TV), Kullback-Leibler of the d-dimensional distributions. Like the smooth frame- (KL) , and χ2-divergence (Goldfeld et al., 2020b; work considered herein, the sliced distance also exhibits Chen & Niles-Weed, 2020). an n−1/2 empirical convergence rate and has characterized limit distributions in some cases (Nadjahi et al., 2019; 2020). 1.2. Contributions In (Nadjahi et al., 2019), sliced W1 was shown to be a met- ric that induces a topology at least as fine as that of weak We focus on the smooth p-Wasserstein distance W(σ) for p convergence, akin to W(σ). They further examined genera- p > 1 and arbitrary dimension d. We first explore basic 1 (σ) tive modeling via minimum sliced Wasserstein estimation, structural properties of Wp , proving that many of the ben- establishing measurability, consistency, and some limit the- eficial attributes of Wp carry over to the smooth setting. We (σ) (σ) orems. However, while Wp converges to Wp as σ → 0, show that Wp is a metric and induces the same topology there is no approximation parameter for these sliced dis- (σ) as Wp. Then, we prove that Wp is stable under small tances, and comparisons to the standard distances typically perturbations of the smoothing parameter σ, implying, in require compact support and feature dimension-dependent (σ) particular, that Wp → Wp as σ → 0. We then extend multiplicative constants (see, e.g., (Bonnotte, 2013)). the stability of optimal transport distances to that of trans- Another relevant framework is entropic optimal transport port plans, establishing weak convergence of the optimal (σ) (EOT), which admits efficient algorithms (Cuturi, 2013; couplings for Wp to those of Wp as σ shrinks. Altschuler et al., 2017) and some desirable statistical prop- Moving on to a statistical analysis, we explore empirical erties (Genevay et al., 2016; Rigollet & Weed, 2018). In (σ) convergence for Wp . Elementary techniques imply that particular, two-sample EOT has empirical convergence rate −1/2  (σ)  −1/(2p) n for smooth costs and compactly supported distribu- Wp (ˆµ , µ) = O(n ) under a mild moment con- E n tions (Genevay et al., 2019). For the quadratic cost, (Mena dition. While this rate is independent of d, it is suboptimal & Niles-Weed, 2019) extended this rate to sub-Gaussian in p, with the expected answer being n−1/2 as previously distributions and derived a (CLT) for established for p = 1, 2 (Goldfeld et al., 2020a;b). To get empirical EOT, mirroring a result for W established in the correct rate, we establish a comparison between W(σ) 2 p (del Bariro & Loubes, 2019). The Mena & Niles-Weed and a smooth pth-order Sobolev integral probability metric (σ) (2019) CLT is notably different from ours: it uses as a cen- (IPM), dp ; the latter lends itself well to tools from em- tering constant the expected empirical distance between µˆm pirical process theory. Under a sub-Gaussian assumption, and νˆ as opposed to the population distance between µ (σ) n we prove that the function class defining dp is µ-Donsker, and ν (which corresponds to our centering about 0 in the √ (σ) giving a limit distribution for n dp (ˆµn, µ) that implies one-sample case). Finally, while EOT can be computed −1/2 (σ) the n rate for Wp . We conclude with a concentration efficiently, it is no longer a metric, even if the underlying (σ) cost is (Feydy et al., 2019; Bigot et al., 2019). inequality for Wp (ˆµn, µ). We next turn to computational aspects, first showing that Sobolev IPMs have proven independently useful for gener- (σ) ative modeling, often referred to as ‘dual Sobolev norms’. d is efficiently computable as a maximum mean dis- 2 For example, alternative Sobolev IPMs are the basis for crepancy (MMD) and characterizing its reproducing kernel. multiple generative adversarial network (GAN) frameworks Next, we consider applications to two-sample testing and (Mroueh et al., 2018; Xu et al., 2020) and are featured in W(σ) generative modeling using p . We construct two-sample (Si et al., 2020), which examines Wasserstein projections of p tests based on the smooth -Wasserstein test that empirical measures onto a chosen hypothesis class. achieve asymptotic consistency and correct asymptotic level. Smooth Wp: Structure, Empirical Approximation, and Applications

2. Preliminaries Smooth Sobolev IPM. Let γ be a Borel measure on Rd and fix p ≥ 1. For a differentiable function f : Rd → R, let 2.1. Notation Z 1/p Let | · | and h·, ·i denote the Euclidean norm and inner prod- p kfk := k∇fk p d = |∇f| dγ H˙ 1,p(γ) L (γ;R ) d uct. For a (signed) measure µ and a measurable function f R d R on R , we write µ(f) = f dµ. For a non-empty set T , let ∞ be its Sobolev seminorm. We define the homogeneous ` (T ) be the space of all bounded functions f : T → R, ˙ 1,p ˙ ∞ Sobolev space H (γ) as the completion of C0 = {f +a : equipped with the sup-norm kfk∞,T = supt∈T |f(t)|. The ∞ a ∈ , f ∈ C } w.r.t. k · k 1,p . The dual Sobolev norm space of compactly supported, infinitely differentiable real R 0 H˙ (γ) d ∞ of a signed measure ` on d with zero total mass is functions on R is C0 . For any p ∈ [1, ∞) and any Borel R d p k measure γ on R , we denote by L (γ; R ) the space of ∞ d k k`k ˙ −1,p := sup{`(f): f ∈ C0 , kfk ˙ 1,q ≤ 1}, f : → kfk p k = H (γ) H (γ) measurable maps R R such that L (γ;R ) R p 1/p p k ( |f| dγ) < ∞. The space (L (γ; ), k · k p k ) d R L (γ;R ) q p 1/p + 1/q = 1 R p p 1 where is the conjugate index of , i.e., . We is a Banach space, and we also write L (γ) = L (γ; R ). define the pth-order smooth Sobolev IPM by d The class of Borel probability measures on R is P and (σ) d (µ, ν) := k(µ − ν) ∗ Nσk ˙ −1,p the subset of measures µ ∈ P with finite p-th moment p H (Nσ ) R p |x| dµ(x) is Pp. The convolution of measures µ, ν ∈ P RR (σ) is defined by (µ ∗ ν)(A) := 1A(x + y) dµ(x) dν(y), for measures µ, ν ∈ P. Observe that dp is an IPM w.r.t. ∞ where 1A is the indicator of A. The convolution of mea- the class F ∗ ϕσ = {f ∗ ϕσ : f ∈ F} with F = {f ∈ C0 : d R kfk ˙ 1,q ≤ 1}. surable functions f, g on R is (f ∗ g)(x) := f(x − H (Nσ ) 2 y)g(y) dy. Recall that Nσ = N (0, σ I) and use ϕσ(x) = 2 2 (2πσ2)−d/2e−|x| /(2σ ), x ∈ Rd, for the Gaussian density. 3. Structure of Smooth Wasserstein Distance w Write µ ⊗ ν for the product measure of µ, ν ∈ P. Let → and Comparison with Smooth Sobolev IPM and →d denote weak convergence of probability measures We now examine basic properties of smooth Wasserstein dis- and convergence in distribution of random variables. tances, including a useful connection to the smooth Sobolev W(σ) 2.2. Background IPM. The case of 1 has been well-studied in (Goldfeld & Greenewald, 2020; Goldfeld et al., 2020a). Herein we We next provide some background on the statistical dis- present results that hold for arbitrary p ≥ 1 and σ ≥ 0 tances used in this paper. unless stated otherwise, with proofs left for the supplement. Extending beyond p = 1 requires new techniques, most (σ) (σ) (Smooth) Wasserstein Distance. For p ≥ 1, the p- prominently a comparison result between Wp and dp . Wasserstein distance between µ, ν ∈ Pp is defined by 3.1. Structural Properties  Z 1/p p (σ) Wp(µ, ν) := inf |x − y| dπ(x, y) We first consider the topology induced by Wp . Since π∈Π(µ,ν) d R (σ) convolution acts as a contraction, we have Wp ≤ Wp. In where Π(µ, ν) is the set of couplings of µ and ν. See fact, the two distances induce the same topology on Pp, (Villani, 2003; 2008; Santambrogio, 2015) for additional which coincides with that of weak convergence in addition background. The σ-smooth p-Wasserstein distance between to convergence of pth moments. probability measures µ, ν ∈ P is defined by (σ) p Proposition 1 (Metric and topological structure of Wp ). (σ) (σ) Wp is a metric on Pp inducing the same topology as Wp. Wp (µ, ν) := Wp(µ ∗ Nσ, ν ∗ Nσ). The proof uses existence of optimal couplings and uniform Integral Probability Metrics. Let F be a class of mea- integrability arguments. Next, we examine the behavior of d (σ) surable real functions on R . The IPM with respect to (w.r.t.) Wp as a function of the smoothing parameter σ. We start F between probability measures µ, ν ∈ P is defined by from the following stability lemma, guaranteeing that small (σ) changes in σ result only in slight perturbations of Wp . kµ − νk∞,F = sup µ(f) − ν(f). (σ) f∈F Lemma 1 (Stability of Wp ). For µ, ν ∈ Pp and 0 ≤ σ1 ≤ σ2 < ∞, we have (σ) We subsequently control Wp via an IPM whose functions q (σ2) (σ1) 2 2 have bounded Sobolev norm. Wp (µ, ν) − Wp (µ, ν) ≤ 2 (σ2 − σ1)(d + 2p + 2). Smooth Wp: Structure, Empirical Approximation, and Applications

This result generalizes Lemma 1 of (Goldfeld & Gree- The proof builds upon related inequalities established for newald, 2020), which covers p = 1, and establishes uniform standard Wp (Dolbeault et al., 2009; Peyre, 2018; Ledoux, (σ) continuity of Wp in σ. Its proof takes a different approach, 2019), exploiting the metric structure of the Wasserstein using Minkowski’s inequality instead of the Kantorovich- space and the Benamou-Brenier dynamic formulation of Rubinstein duality. An immediate consequence of Lemma1 optimal transport (Benamou & Brenier, 2000). Namely, we is given next, mirroring Theorem 3 of (Goldfeld & Gree- note that Wp(µ0, µ1) is upper bounded by the length of any newald, 2020). continuous path from µ0 to µ1 in (Pp, Wp) and examine the path t 7→ tµ +(1−t)µ which interpolates linearly between Corollary 1 W(σ) σ . For µ, ν ∈ P , the 1 0 ( p dependence on ) p the two densities. The theorem follows upon applying the following hold: resulting bound to µ ∗ Nσ and ν ∗ Nσ. We also give a lower (σ) (σ) (i) Wp (µ, ν) is continuous and monotonically non- bound for Wp (µ, ν) using k(µ − ν) ∗ Nσ ˙ −1,p √ , H (N 2σ ) increasing in σ ∈ [0, +∞); though the constant factor restricts its usefulness. (σ) (ii) lim Wp (µ, ν) = Wp(µ, ν); Remark 2. When p = 1, one can show that W1 and the dual σ→0 Sobolev norm k·kH˙ −1,1(γ) coincide (Dolbeault et al., 2009). (σ) (iii) lim Wp (µ, ν) = [X] − [Y ] , for X ∼ µ and (σ) (σ) σ→∞ E E In particular, this implies that W1 (µ, ν) = d1 (µ, ν). For (σ) Y ∼ ν sub-Gaussian. larger p, the gap between Wp (µ, ν) and the upper bound (σ) given by Theorem1 can grow quite large, so we view the Remark 1 (Infinite smoothing). A detailed study of W p comparison as a useful theoretical tool rather than a device in the infinite smoothing regime (i.e., when σ → ∞) is for practical approximation guarantees. conducted in (Chen & Niles-Weed, 2020). Therein, the authors prove Item 3 above and examine the convergence (σ) Finally, we establish some basic properties of dp . W(σ)(µ, ν) [X] = [Y ] of p to 0 when E E . For that case, they (σ) show that if µ and ν have moment tensors up to Proposition 3 (dp dependence on σ). For µ, ν ∈ P, the (σ) −n following hold: order n (but not n + 1), then W2 (µ, ν)  σ as σ → ∞.

(σ) Corollary1 guarantees the convergence of transport costs as (i) limσ→0 dp (µ, ν) = ∞ for µ 6= ν; σ → 0. It is natural to ask whether optimal transport plans (σ) (i.e., couplings) that achieve these costs converge as well. (ii) limσ→∞ d2 (µ, ν) = | E[X] − E[Y ]|, for X ∼ µ and We answer this question to the affirmative. Y ∼ ν sub-Gaussian. Proposition 2 (Convergence of transport plans). Fix µ, ν ∈ We focus on p = 2 for (ii) due to a convenient MMD Pp and let (σk)k∈N be a sequence with σk & σ ≥ 0. (σ) formulation for d2 established in Section5. For each k ∈ N, let πk ∈ Π(µ ∗ Nσk , ν ∗ Nσk ) be an (σk) optimal coupling for Wp (µ, ν). Then there exists π ∈ w Π(µ ∗ Nσ, ν ∗ Nσ) such that πk → π as k → ∞, and π is 4. Empirical Approximations optimal for W(σ)(µ, ν). p Fix p > 1, σ > 0, and let µ ∈ Pp with X ∼ µ. Given independently and identically distributed (i.i.d.) The proof observes that the arguments for Theorem 4 of samples X1,...,Xn ∼ µ with empirical distribution (Goldfeld & Greenewald, 2020) extend from p = 1 to the −1 Pn µˆn := n i=1 δXi , we study the convergence rate of general case with minor changes.  (σ)  E Wp (ˆµn, µ) to zero. To start, we observe that elemen- So far we have studied metric, topological, and limiting tary techniques imply W(σ)(ˆµ , µ) = O(n−1/(2p)) un- (σ) E p n properties of Wp . In Section4 we explore its statistical der mild conditions on µ. Although the rate n−1/(2p) is behavior, when distributions are estimated from samples. To dimension-free, its dependence on p is sub-optimal. (σ) that end, we now establish a relation between Wp and the (σ) Theorem 2 (Slow rate). If X ∼ µ satisfies smooth Sobolev IPM dp . This result is later used to study ∞ (σ) Z p the empirical convergence under Wp using tools from rd+p−1 (|X| > r)dr < ∞, (2) (σ) P empirical process theory (applied to the dp upper bound). 0

(σ) (σ)  (σ)  −1/(2p) Theorem 1 (Comparison between Wp and dp ). Fix then E Wp (ˆµn, µ) = O(n ). Condition (2) holds p > 1 and let q be the conjugate index of p. Then, for if µ has finite (2d + 2p + )-th moment for some  > 0. X ∼ µ ∈ Pp with mean 0 and ν ∈ P, we have The proof follows by coupling µ and µˆn via the maximal 2 2 p (σ) E[|X| ]/(2qσ ) (σ)  (σ)  Wp (µ, ν) ≤ p e dp µ, ν . (1) coupling. This bounds Wp (ˆµn, µ) from above by a Smooth Wp: Structure, Empirical Approximation, and Applications weighted TV distance, which converges as n−1/2, provided with constants C, c independent of n and t. that the above moment condition holds. This proof tech- nique was previously applied in (Goldfeld et al., 2020b) to For unbounded domains, concentration results mirroring achieve the same rate when p = 1. those of Corollary 3 in (Goldfeld et al., 2020a) can be es- tablished in the same way under a stronger sub-Gaussianity −1/2 We next turn to show that the n rate is attainable assumption; we omit the details for brevity. for W(σ)(ˆµ , µ) itself (rather than for its pth power). To p n Remark 3 (Constants). While the rates provided in this this end, we first establish a limit distribution result for (σ) section are dimension-free, the constants necessarily exhibit the empirical smooth Sobolev IPM dp (ˆµn, µ). This, in  (σ)  an exponential dependence on dimension. Indeed, mini- turn, yields the desired rate for Wp (ˆµn, µ) via the E max results for estimation of standard Wp due to Singh & comparison from Theorem1. Recall the function class ∞ Poczos´ (2018), combined with Lemma1, imply that achiev- F = {f ∈ C : kfk ˙ 1,q ≤ 1}. 0 H (Nσ ) ing dimension-free rates with constants scaling only poly- (σ) Theorem 3 (Limit distribution for empirical dp ). Suppose nomially in dimension is impossible in general. We provide there exists θ > p − 1 for which X ∼ µ satisfies further details in the proofs of Theorem2 and Theorem3.

Z ∞ 2 θr p e 2σ2 P(|X| > r)dr < ∞. (3) 5. Smooth Sobolev IPM Efficient 0 √ (σ) d Computation Then ndp (ˆµn,µ) → kGk∞,F as n → ∞, where G = ∞ (σ) (G(f))f∈F is a tight Gaussian process in ` (F) with mean We next consider computation of dp , which takes a conve- zero and function Cov(G(f),G(g)) = Cov(f ∗ nient form when p = 2. Specifically, the Hilbertian structure 1,2 ϕσ(X), g ∗ ϕσ(X)). of H˙ (Nσ) enables to streamline calculations significantly Corollary 2 (Fast rate). Under the conditions of Theorem3, for all dimensions d. In the following, fix σ > 0, µ, ν ∈ P, √  (σ)    X,X0 ∼ µ ⊗ µ, and Y,Y 0 ∼ ν ⊗ ν. we have limn→∞ n E dp (ˆµn, µ) = E kGk∞,F < ∞.  (σ)  −1/2 ˙ 1,2 ˙ 1,2 Consequently, E Wp (ˆµn, µ) = O(n ). Consider the function space H0 (Nσ) := {f ∈ H (Nσ) : N (f) = 0} with norm k · k 1,2 (this norm is proper σ H˙ (Nσ ) The proof of Theorem3 shows that the smoothed function because of the constraint N (f) = 0). This space be- F ∗ϕ = {f ∗ϕ : f ∈ F} µ σ class σ σ is -Donsker. Specifically, comes a Hilbert space when equipped with inner product we prove that functions in F ∗ ϕσ are smooth with deriva- R hf, gi ˙ 1,2 = d h∇f, ∇gi dNσ. Likewise, the space tives uniformly bounded on domains within a fixed radius of H (Nσ ) R ˙ 1,2 ˙ 1,2 the origin. Using these bounds, we apply techniques from H0 (Nσ) ∗ ϕσ = {f ∗ ϕσ : f ∈ H0 (Nσ)} is a Hilbert space with inner product hf ∗ ϕσ, g ∗ ϕσi ˙ 1,2 = empirical process theory to establish the Donsker property. H (Nσ )∗ϕσ hf, gi 1,2 (see Appendix A.3 for a proof that this is Importantly, the preceding argument hinges on the convolu- H˙ (Nσ ) tion with the smooth Gaussian density and does not hold for well-defined). In fact, we can say a bit more, first recalling the unsmoothed function class. No mean zero requirement some definitions. appears because we can center µˆn and µ by the mean of µ. Reproducing Kernel Hilbert Space (RKHS). Let H be Condition (3) requires that P(|X| > r) → 0 faster than a Hilbert space of real-valued functions on d. We say that −Cr2 2 R e as r → ∞ for some C > (p − 1)/σ , which in turn H is an RKHS if there is a positive semidefinite function requires |X| to be sub-Gaussian. The requirement is trivially k : Rd × Rd → R, called a reproducing kernel, such that satisfied if µ is compactly supported. We can also relate k(·, x) ∈ H and h(x) = hh, k(·, x)i for all x ∈ Rd and Condition (3) to a more standard notion of sub-Gaussianity h ∈ H. See (Steinwart & Christmann, 2008; Aronszajn, for random vectors. 1950) for comprehensive background on RKHSs. Definition 1 (Sub-Gaussian distribution). Let Y ∼ ν ∈ P with E[|Y |] < ∞. We say that ν or Y is β-sub-Gaussian Maximum Mean Discrepancy. Let H be an RKHS with 2  for β ≥ 0 if E[exp(hα, Y − E[Y ]i)] ≤ exp β|α| /2 for kernel k. The IPM corresponding to its unit ball, termed all α ∈ Rd. MMD, is given by Proposition 4 (Sub-Gaussianity implies (3)). If µ is β-sub- MMD (µ, ν) := sup µ(f) − ν(f). Gaussian with β < σ/p2(p − 1), then (3) holds. H f∈H: kfkH≤1 (σ) p  Next, we consider the concentration of Wp (ˆµn, µ). Proposition 6 (Borgwardt et al.(2006)) . If E k(X,X) , p  Proposition 5 (Concentration inequality). If µ has compact E k(Y,Y ) < ∞, then support, then, for all t > 0 and n ∈ , we have N 2 0 0 MMDH(µ,ν) =E[k(X,X )]−2E[k(X,Y )]+E[k(Y,Y )].  (σ) −1/2  2 P Wp (ˆµn, µ) ≥ Cn + t ≤ exp −cnt (4) Smooth Wp: Structure, Empirical Approximation, and Applications

˙ 1,2 We prove that H0 (Nσ) ∗ ϕσ is an RKHS whose ker- take X1,...,Xm ∼ µ and Y1,...,Yn ∼ ν to be mutu- nel is expressed in terms of the entire exponential inte- ally independent samples. The goal of nonparametric two- R z −t dt gral (Oldham et al., 2009) Ein(z) := 0 (1 − e ) t = sample testing is to detect, based on the samples, whether k+1 k P∞ (−1) z (σ) the null hypothesis H0 : µ = ν holds, without imposing k=1 k·k! , giving an MMD form for d2 . (σ) parametric assumptions on the distributions. Theorem 4 (d2 as an MMD). The space ˙ 1,2 A standard class of tests rejects H0 if Dm,n > cm,n, H0 (Nσ) ∗ ϕσ is an RKHS with reproducing ker- nel κ(σ)(x, y) := −σ2 Ein −hx, yi/σ2. Thus, if where Dm,n = Dm,n(X1,...,Xm,Y1,...,Yn) is a scalar test statistic and c is a critical value chosen accord- pκ(σ)(X,X), pκ(σ)(Y,Y ) < ∞, then m,n E E ing to the desired level α ∈ (0, 1). Precisely, we say (σ) 2  (σ) 0   (σ) 0  that such a sequence of tests has asymptotic level α if d2 (µ, ν) = E κ (X,X ) + E κ (Y,Y ) (5) lim sup (D > c ) ≤ α whenever µ = ν.  (σ)  m,n→∞ P m,n m,n − 2 E κ (X,Y ) . We say that these tests are asymptotically consistent if limm,n→∞ P(Dm,n > cm,n) = 1 whenever µ 6= ν. In The proof begins with a reduction to σ = 1 and observes what follows, we assume that m, n → ∞ and m/N → τ ∈ that properly normalized multivariate Hermite polynomials (0, 1) with N = m + n. ˙ 1,2 form an orthonormal basis for H0 (N1). Convolving these polynomials with ϕ1, we obtain an orthonormal basis for The previous theorems will help us construct tests that en- ˙ 1,2 joy asymptotic consistency and correct asymptotic level H0 (N1) ∗ ϕ1 comprising scaled monomials, which can then be used to calculate the kernel. based on the smooth p-Wasserstein distance, using Wm,n := p mn W(σ)(ˆµ , νˆ ). Two-sample testing using the Wasser- The MMD formulation (5) gives a convenient way to com- N p m n (σ) stein distance was previously explored in (Ramdas et al., pute d2 in practice. Suppose that we generate i.i.d. 2017), but these results are fundamentally restricted to the samples X1,...,Xm ∼ µ and Y1,...,Yn ∼ ν with one-dimensional setting. Specifically, while the authors de- empirical distributions µˆ = m−1 Pm δ and νˆ = m i=1 Xi n signed tests with data-independent critical values for d = 1, n−1 Pn δ . Then, we can compute j=1 Yj they rely heavily on limit distributions of empirical Wasser- m m stein distances that do not extend to higher dimensions. Our (σ) 1 X X d (ˆµ , νˆ )2 = κ(σ)(x , x )+ results use data-dependent critical values but scale to arbi- 2 m n m2 i j i=1 j=1 trary dimension, demonstrating the compatibility of smooth n n m n distances for multivariate two-sample testing. 1 X X 2 X X κ(σ)(y , y ) − κ(σ)(x , y ). n2 i j mn i j We use the bootstrap to calibrate critical val- i=1 j=1 i=1 j=1 ues. Consider the pooled data (Z1,...,ZN ) = Provided that µ and ν are√ compactly supported or β-sub- (X1,...,Xm,Y1,...,Yn) with empirical distribu- β < σ/ 2 −1 PN B B Gaussian with , Corollary2 and the triangle tion γˆN = N i=1 δZi . Let X1 ,...,Xm and B B inequality imply Y1 ,...,Yn be i.i.d. from γˆN given Z1,...,ZN , and take B −1 Pm B −1 Pn h i   µˆm = m δXB and νˆn = n δY B to be the (σ) (σ) −1/2 i=1 i i=1 i E d2 (ˆµm, νˆn) − d2 (µ, ν) = O min{m, n} . corresponding bootstrap empirical measures. (σ) Specifying critical values requires a bit of care, as the com- Hence, we can approximate d2 up to expected error ε with O(ε−2) samples from the measured distributions and parison inequality (1) requires centering of one of measures. B B O(ε−4) evaluations of κ(σ) for any dimension d. So we center the bootstrap empirical measures µˆm and νˆn ¯ −1 Pn by the pooled sample mean Z = N i=1 Zi, namely, we apply the bootstrap as 6. Statistical Applications ˆ tr ΣZ r B 2 mn (σ) B B  With the empirical approximation and computational re- W = p e 2qσ d µˆ ∗ δ ¯ , νˆ ∗ δ ¯ , m,n N p m −ZN n −ZN sults in hand, we now present applications to two-sample testing and minimum distance estimation. These results ˆ −1 PN ¯ ¯ > where ΣZ = N i=1(Zi −ZN )(Zi −ZN ) . Denote the highlight the benefits of smoothing, as several of the sub- B B conditional (1 − α)-quantile of Wm,n by wm,n(1 − α), i.e., sequent claims are unavailable for standard Wp due to the lack of parametric rates and limit distributions. B n B B o wm,n(1 − α) = inf t : P (Wm,n ≤ t) ≥ 1 − α .

6.1. Two-Sample Testing Then, we have the following result. (σ) (σ) We start from two-sample testing with Wp and dp , where Proposition 7 (Asymptotic validity). For µ, ν ∈ Pp satisfy- p > 1 and σ > 0 are fixed throughout. Let µ, ν ∈ Pp and ing the condition of Theorem3, the sequence of tests that re- Smooth Wp: Structure, Empirical Approximation, and Applications

d = 1 d = 3 d = 5

−1 −1 ) σ ( 2 × 10 4 × 10 2 6 −2 σ = 0.0 10 d=1

) 6 × 10 W σ

( 2

σ = 0.02 n d=3 −1 o W 3 × 10 4

σ = 0.05 −1 10 d=6

−2 d d 4 × 10 10 n e t u 2 d=10 a

o 10 −2 −2

3 × 10 b m

6 × 10 −1 i

2 × 10 r t e s 100 p E 4 × 10−2 2 × 10−2 p 3 × 10−2 U 50 100 200 400 50 100 200 400 50 100 200 400 101 103 # samples n # samples n # samples n # samples n

(σ) d Figure 1. (Left) Empirical convergence of estimated E[W2 (ˆµn, µ)] to 0 for µ = Unif([−1, 1] ) with d ∈ {1, 3, 5}. (Right) Loose (0.5) (0.5) d upper bound on E[W2 (ˆµn, µ)] provided by d2 for µ = Unif([−1, 1] ) with d ∈ {1, 3, 6, 10}.

B ject the null hypothesis H0 : µ = ν if Wm,n > wm,n(1−α) 2. There exists an event with probability one on which ˆ is asymptotically consistent with level α. the following holds: for any sequence {θn}n∈N of (σ) measurable estimators such that Wp (ˆµ , ν ) ≤ n θˆn We remark that the same argument gives a simpler result (σ) (σ) inf W (ˆµ , ν ) + o (1), the set of cluster points of when p = 1. Because W is an IPM itself w.r.t. a function θ∈Θ p n θ P 1 ˆ (σ) class that is µ- and ν-Donsker under moment conditions (see {θn}n∈N is included in argminθ∈Θ Wp (µ, νθ). (Goldfeld et al., 2020a, Theorem 1)), no centering is neces- (σ) ? ˆ ? 3. If argmin Wp (µ, νθ) = {θ }, then θn → θ a.s. B p mn (σ) B B θ∈Θ sary, and Wm,n can be replaced by N W1 (ˆµm, νˆn ). Proposition 10 (M-SWE convergence rate). If µ satisfies 6.2. Generative Modeling the conditions of Theorem3, then

In the unsupervised learning task of generative modeling, (σ) (σ) −1/2 inf Wp (ˆµn, νθ) − inf Wp (µ, νθ) = O (n ). we obtain an i.i.d. sample X1,...,Xn from a distribution θ∈Θ θ∈Θ P µ ∈ P and aim to learn a from a parame- terized family {ν } ⊂ P which approximates µ under θ θ∈Θ Likewise, under additional regularity conditions, the solu- some statistical distance. We adopt the smooth Wasserstein tions θˆ to M-SWE converge at the parametric rate, i.e., distance as the figure of merit and use the empirical distri- n |θˆ − θ?| = O (n−1/2), where θ? is the unique solution as bution µˆ as an estimate for µ. Generative modeling is thus n P n above; see Supplement A.4 for details. formulated as the following minimum smooth Wasserstein estimation (M-SWE) problem: These propositions follow by similar arguments to those (σ) in (Goldfeld et al., 2020a), which build on (Pollard, 1980), inf Wp (ˆµn, νθ). θ∈Θ with arbitrary p ≥ 1 instead of p = 1 as considered therein (the needed results from (Villani, 2008) hold for all p ≥ 1). In the unsmooth case, this objective with W inspired the 1 We thus omit their proofs for brevity. Wasserstein GAN (W-GAN) framework that continues to underlie state-of-the-art methods in generative modeling We next examine a high probability generalization bound (Arjovsky et al., 2017; Gulrajani et al., 2017). M-SWE with for generative modeling via M-SWE, in accordance to the p = 1 was studied in (Goldfeld et al., 2020a), and here framework from (Arora et al., 2017; Zhang et al., 2018). (σ) we pursue similar measurability and consistency results for Thus, we want to control the gap between the Wp loss at- p > 1. tained by approximate, possibly suboptimal, empirical min- imizers and the population loss inf W(σ)(µ, ν ). Upper In what follows, we take both µ ∈ Pp and {νθ}θ∈Θ ⊂ θ∈Θ p θ d0 bounding this gap by the rate of empirical convergence, the Pp. Further, we suppose that Θ ⊂ R is compact with concentration result Proposition5 implies the following. nonempty interior and that θ 7→ νθ is continuous w.r.t. the w ¯ weak topology, i.e., νθ → νθ¯ whenever θ → θ. We start Corollary 3 (M-SWE generalization error). Assume µ ˆ by establishing measurability, consistency, and parametric has compact support and let θn be an estimator with (σ) (σ) convergence rates for M-SWE. Wp (ˆµ , ν ) ≤ inf Wp (ˆµ , ν ) + , for some  > 0. n θˆn θ∈Θ n θ Proposition 8 (M-SWE measurability). For each n ∈ N, We have ˆ there exists a measurable function ω 7→ θn(ω) such that   ˆ (σ) (σ) (σ) −cnt2 θn(ω) ∈ argminθ∈Θ Wp (ˆµn(ω), νθ). P Wp (µ, νˆ ) − inf Wp (µ, νθ) >  + t ≤ Ce , θn θ∈Θ Proposition 9 (M-SWE consistency). The following hold: (σ) (σ) 1. infθ∈Θ Wp (ˆµn, νθ) → infθ∈Θ Wp (µ, νθ) a.s. for constants C, c independent of n and t. Smooth Wp: Structure, Empirical Approximation, and Applications 7. Experiments

We present several numerical experiments supporting the n = 25 6 n = 50 theoretical results established in the previous sections. We 10 n = 100 4 focus on p = 2 so that we can use the MMD form n = 200 5 (σ) 2 for d2 . Code is provided at https://github.com/ sbnietert/smooth-Wp Density estimate . 0 0 (σ) 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 First, we examine W (µ, µˆn) directly, with computations 2 √nd(1)( μ̂ , μ), μ = √nd(1)( μ̂ , μ), μ = feasible for small sample sizes using the stochastic aver- 2 n 0.1 2 n 0.2 aged gradient (SAG) method as proposed by (Genevay 0.8 et al., 2016) and implemented by (Hallin et al., 2020). In 2 0.6 d Figure1 (left), we take µ = Unif([−1, 1] ) and estimate 0.4 (σ) [W (ˆµn, µ)] averaged over 10 trials, for varied d and σ. 1 E 2 0.2 (σ)

We observe the contractive property of W2 and the speed Density estimate up in convergence rate due to smoothing. However, SAG 0 0.0 0.0 0.5 1.0 1.5 0 20 40 60 is not well-suited for computation in high dimensions and (1) ̂ (1) ̂ √nd2 ( μn, μ), μ = 0.5 √nd2 ( μn, μ), μ = 1.0 larger values of σ. Indeed, this method computes standard √ Figure 2. Empirical limiting behavior of n d(1)(ˆµ , µ) for µ = W2 between the convolved measures, which needs an expo- 2 n nential in d number of samples from the Gaussian measure. Ns with s ∈ {0.1, 0.2, 0.5, 1.0} and d = 5. Recently, Vacher et al.(2021) suggested that OT distances between smooth densities (like those of our convolved mea- σ = 0.1 σ = 0.5 6 6 sures) may be computed more efficiently, but their algorithm n=25 4 4 is restricted to compactly supported distributions and leaves n=100 ) )

2 2 2 2 important hyperparameters unspecified. a n=400 a − − ̂ ̂ 2 0 2 0 a a

(σ) ( ( Turning to d , the MMD form from (5) readily enables n −2 n −2

2 √ √ −1/2 efficient computation. In Figure1 (right), we plot the n −4 −4 (σ) upper bound it gives for W empirical convergence, using −6 −6 2 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 a closed form for relevant expectations described in Ap- √n(a1̂ − a1) √n(a1̂ − a1) pendix A.5.1. We emphasize that this bound is too loose for (σ) σ = 0.2 W2 convergence practical approximation and serves rather as a theoretical 6 tool for obtaining correct rates. In Figure2, we plot distribu- 4 2 × 10−1

√ ) (1) 2 2 tions of n d (ˆµn, µ) as n increases, using µ = Ns with a 2 ) − σ ( 2 ̂ 2 0 a varied s and d = 5. Distributions are computed using kernel W −1 ( 10

n −2 σ=0.1 µ µˆ √ over 50 trials and estimating by 1000. σ=0.2 −4 −2 We see convergence to a limit distribution for small σ (esti- 6 × 10 σ=0.5 −6 mating kGk∞,F from Theorem3) and note the necessity of −5.0 −2.5 0.0 2.5 5.0 102 ̂ the sub-Gaussian√ condition (3), with convergence failing as √n(a1 − a1) # samples n σ surpasses 1/ 2, supporting Proposition4. Figure 3. One-dimensional limiting behavior of M-SWE estimates Next, we examine M-SWE when d = 1, exploiting the for the two mean parameters of µ = N (a1, 1)/2 + N (a2, 1)/2 p (σ) fact that Wp can be expressed as an L distance between with a1 = −1 and a2 = 1. Also shown is a log-log plot of W2 quantile functions (see, e.g., (Villani, 2008)). The consid- convergence in n. ered task is fitting a two-parameter generative model to a Gaussian mixture (parameterized by the means of its two Finally, we provide two-sample testing results in Figure modes). Distance minimization is√ implemented via gradient descent. Plotted in Figure3 are n-scaled scatter plots of 4 for p = 1, leveraging the simplifications discussed at the estimation errors, with 40 trials for each σ and n pair. the end of Section 6.1. We approximate the convolved The consistent spread of the (scaled) estimation errors as empirical measures by adding Gaussian noise samples and −1/2 n increases demonstrates the n convergence rate. The compute W1 (exactly) for d = 1 via its representation as (σ) the L1 distance between quantile functions. For d = 2, we bottom-right subplot shows W2 estimation errors that fur- ther support the fast convergence. In Appendix A.5, we estimate W1 using a standard implementation of W-GAN provide additional results for a single Gaussian parameter- (Gulrajani et al., 2017; Cao, 2017). For varied sample sizes q ized by mean and . n2 (σ) B B n = m, the quantiles of N W1 (ˆµn , νˆn ) are estimated Smooth Wp: Structure, Empirical Approximation, and Applications

d=1 d=2 injection and W∞ underlies the Wasserstein privacy mecha- 0 H

1.00 1.00

g n = 20 n = 10 nism (Song et al., 2017). n i t n = 50 n = 20 c 0.75 0.75 e j n = 100 n = 50 e r

f 0.50 n = 200 0.50 Acknowledgements o

. b

o 0.25 0.25 r S. Nietert is supported by the National Science Foundation p

. t

s 0.00 0.00 (NSF) Graduate Research Fellowship under Grant DGE- E 0.0 0.5 1.0 0.0 0.5 1.0 1650441. Z. Goldfeld is supported by the NSF CRII grant Desired significance level α Desired significance level α CCF-1947801, in part by the 2020 IBM Academic Award, (0.1) Figure 4. Estimated probability that W1 two-sample test rejects and in part by the NSF CAREER Award CCF-2046018. K. d null hypothesis H0 : µ = ν given that µ = ν = Unif([0, 1] ). Kato is partially supported by NSF grants DMS-1952306 and DMS-2014636. using 1000 and 200 bootstrap samples for d = 1 and d = 2, References respectively. The probability of rejecting the null hypothesis Altschuler, J., Weed, J., and Rigollet, P. Near-linear time ap- for varied significance levels and sample sizes is estimated proximation algorithms for optimal transport via sinkhorn by repeating the tests over 100 and 200 draws of the original iteration. In Advances in Neural Information Processing samples, for d = 1 and d = 2 respectively. Figure4 displays Systems (NeurIPS-2017), pp. 1964–1974, Long Beach, the probability of false alarm versus the significance level α. CA, USA, Dec. 2017. Evidently, the curves approximately fall along the diagonal y = x, supporting the consistency result. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Confer- 8. Conclusions and Future Directions ence on Machine Learning (ICML-2017), pp. 214–223, Sydney, Australia, Aug. 2017. This work provided a thorough analysis of structural and statistical properties of the Gaussian-smoothed Wasserstein Aronszajn, N. Theory of reproducing kernels. Transactions (σ) (σ) of the American Mathematical Society, 68(3):337–404, distance Wp . While Wp maintains many desirable prop- 1950. erties of standard Wp, we have shown via comparison to (σ) the smooth Sobolev IPM dp that it admits a parametric Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Gener- empirical convergence rate, avoiding the curse of dimen- alization and equilibrium in generative adversarial nets sionality that arises when estimating Wp from data. Using (GANs). In International Conference on Machine Learn- (σ) this fast rate and the associated limit distribution for dp , ing (ICML-2017), Sydney, Australia, Aug. 2017. we have explored new applications to two-sample testing and generative modeling. Benamou, J.-D. and Brenier, Y. A computational fluid me- chanics solution to the Monge-Kantorovich mass transfer An important direction for future research is efficient com- problem. Numerische Mathematik, 84(3):375–393, 2000. (σ) putation of Wp . While standard methods for computing Bernton, E., Jacob, P. E., Gerber, M., and Robert, C. P. Wp are applicable in the smooth case (by the noise), it is desirable to find computational techniques that On parameter estimation with the Wasserstein distance. make use of structure induced by the convolution with a Information and Inference. A Journal of the IMA, 8(4): (σ) 657–676, 2019. known smooth kernel. Furthermore, while Wp exhibits −1/2 an expected empirical convergence rate of O(n ) that Bigot, J., Cazelles, E., and Papadakis, N. Central limit theo- is optimal in n, the prefactor scales exponentially with di- rems for entropy-regularized optimal transport on finite mension and warrants additional study. We suspect that this spaces and statistical applications. Electronic Journal of scaling can be shown under a manifold hypothesis to de- Statistics, 13(2):5120–5150, 2019. pend only on the intrinsic dimension of the data distribution rather than that of the ambient space. Bilodeau, G. G. The Weierstrass transform and Hermite (σ) polynomials. Duke Mathematical Journal, 29:293–308, Finally, we are interested in the limiting behavior of Wp 1962. as σ → 0 and p → ∞. The former case has implications for standard Wp and its dependence on intrinsic dimension, Blanchet, J., Murthy, K., and Zhang, F. Optimal trans- as well as for noise annealing that is common in machine port based distributionally robust optimization: Struc- learning practice. The latter may connect to differential tural properties and iterative schemes. arXiv preprint privacy, where smoothing corresponds to (Gaussian) noise arXiv:1810.02403, 2018. Smooth Wp: Structure, Empirical Approximation, and Applications

Bogachev, V. I. Gaussian measures. American Mathemati- Dereich, S., Scheutzow, M., and Schottstedt, R. Construc- cal Society, 1998. tive quantization: Approximation by empirical measures. Annales de l’Institut Henri Poincare´ Probabilites´ et Statis- Boissard, E. and Le Gouic, T. On the mean speed of conver- tiques, 49(4):1183–1203, 2013. gence of empirical and occupation measures in wasser- stein distance. Annales de l’IHP Probabilites´ et statis- Deshpande, I., Hu, Y., Sun, R., Pyrros, A., Siddiqui, N., tiques, 50(2):539–563, 2014. Koyejo, S., Zhao, Z., Forsyth, D. A., and Schwing, A. G. Max-sliced Wasserstein distance and its use for GANs. In Bonnotte, N. Unidimensional and Evolution Methods for IEEE Conference on Computer Vision and Pattern Recog- Optimal Transportation. PhD thesis, Paris-Sud University, nition (CVPR-2019), pp. 10648–10656, Long Beach, CA, 2013. USA, Jun. 2019. Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Scholkopf,¨ B., and Smola, A. J. Integrating structured Dolbeault, J., Nazaret, B., and Savare,´ G. A new class biological data by Kernel Maximum Mean Discrepancy. of transport distances between measures. Calculus of , 22(14):e49–e57, 07 2006. Variations and Partial Differential Equations, 34(2):193– 231, 2009. Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Feydy, J., Sejourn´ e,´ T., Vialard, F., Amari, S., Trouve,´ A., Oxford University Press, 2013. and Peyre,´ G. Interpolating between optimal transport and MMD using sinkhorn divergences. In Chaudhuri, K. Cao, M. WGAN-GP. https://github.com/ and Sugiyama, M. (eds.), The International Conference caogang/wgan-gp, 2017. on Artificial Intelligence and Statistics (AISTATS-2019), Chen, H.-B. and Niles-Weed, J. Asymptotics of smoothed pp. 2681–2690, Naha, Okinawa, Japan, Apr. 2019. Wasserstein distances. arXiv preprint arXiv:2005.00738, Fournier, N. and Guillin, A. On the rate of convergence in 2020. Wasserstein distance of the empirical measure. Probabil- Courty, N., Flamary, R., and Tuia, D. Domain adaptation ity Theory and Related Fields, 162:707–738, 2015. with regularized optimal transport. In European Confer- ence on Machine Learning and Knowledge Discovery Gao, R. and Kleywegt, A. J. Distributionally robust stochas- in Databases (ECML PKDD 2014), Nancy, France, Sep. tic optimization with Wasserstein distance. arXiv preprint 2014. arXiv:1604.02199, 2016.

Courty, N., Flamary, R., D., Tuia, and Rakotomamonjy, A. Genevay, A., Cuturi, M., Peyre,´ G., and Bach, F. R. Stochas- Optimal transport for domain adaptation. IEEE Transac- tic optimization for large-scale optimal transport. In tions on Pattern Analysis and Machine Intelligence, 39 Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, (9):1853–1865, 2016. I., and Garnett, R. (eds.), Advances in Neural Informa- tion Processing Systems (NeurIPS-2016), pp. 3432–3440, Cuturi, M. Sinkhorn distances: Lightspeed computation Barcelona, Spain, Dec. 2016. of optimal transport. In Burges, C. J. C., Bottou, L., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances Genevay, A., Chizat, L., Bach, F. R., Cuturi, M., and Peyre,´ in Neural Information Processing Systems (NIPS-2013), G. Sample complexity of sinkhorn divergences. In The pp. 2292–2300, Lake Tahoe, Nevada, USA, Dec. 2013. International Conference on Artificial Intelligence and Statistics (AISTATS-2019), pp. 1574–1583, Naha, Oki- del Bariro, E. and Loubes, J.-M. Central limit theorems for nawa, Japan, Apr. 2019. empirical transportation cost in general dimension. The Annals of Probability, 47(2):926–951, 2019. Goldfeld, Z. and Greenewald, K. Gaussian-smoothed opti- del Barrio, E., Gine,´ E., and Matran,´ C. Central limit theo- mal transport: Metric structure and statistical efficiency. rems for the Wasserstein distance between the empirical In International Conference on Artificial Intelligence and and the true distributions. The Annals of Probability, 27 Statistics (AISTATS-2020), Palermo, Sicily, Italy, Jun. (2):1009–1071, 1999. 2020. del Barrio, E., Gine,´ E., and Utzet, F. Asymptotics for Goldfeld, Z., Greenewald, K., and Kato, K. Asymptotic L2 functionals of the empirical quantile process, with guarantees for generative modeling based on the smooth applications to tests of fit based on weighted Wasserstein Wasserstein distance. In Advances in Neural Information distances. Bernoulli, 11(1):131–189, 2005. Processing Systems (NeurIPS-2020), 2020a. Smooth Wp: Structure, Empirical Approximation, and Applications

Goldfeld, Z., Greenewald, K. H., Niles-Weed, J., and Nadjahi, K., Durmus, A., Chizat, L., Kolouri, S., Shahram- Polyanskiy, Y. Convergence of smoothed empirical mea- pour, S., and Simsekli, U. Statistical and topological prop- sures with applications to entropy estimation. IEEE Trans- erties of sliced probability divergences. In Advances in actions on , 66(7):1489–1501, 2020b. Neural Information Processing Systems (NeurIPS-2020), Dec. 2020. Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf,¨ B., and Smola, A. A kernel two-sample test. Journal of Oldham, K., Myland, J., and Spanier, J. An Atlas of Func- Machine Learning Research, 13:723–773, 2012. tions. Springer, second edition, 2009. Panaretos, V. M. and Zemel, Y. Statistical aspects of Wasser- Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and stein distances. Annual Review of Statistics and its Appli- Courville, A. C. Improved training of Wasserstein GANs. cation, 6:405–431, 2019. In Advances in Neural Information Processing Systems −1 (NeurIPS-2017), Long Beach, CA, USA, Dec. 2017. Peyre, G. Comparison between W2 distance and H˙ norm, and localization of Wasserstein distance. ESAIM: Con- Hallin, M., Mordant, G., and Segers, J. Multivariate trol, Optimisation and Calculus of Variations, 24(4):1489– goodness-of-fit tests based on wasserstein distance. arXiv 1501, 2018. preprint arXiv:2003.06684, 2020. Pollard, D. The minimum distance method of testing. Ledoux, M. On optimal matching of Gaussian samples. Metrika, 27(1):43–70, 1980. Journal of Mathematical Science, 238:495–522, 2019. Rabin, J., Peyre,´ G., Delon, J., and Bernot, M. Wasser- stein barycenter and its application to texture mixing. In Lei, J. Convergence and concentration of empirical mea- Scale Space and Variational Methods in Computer Vi- sures under Wasserstein distance in unbounded functional sion (SSVM-2011), volume 6667, pp. 435–446, Ein-Gedi, spaces. Bernoulli, 26(1):767–798, 2020. Israel, May 2011. Li, P., Wang, Q., and Zhang, L. A novel earth mover’s Ramdas, A., Garc´ıa Trillos, N., and Cuturi, M. On Wasser- distance methodology for image matching with Gaussian stein two-sample testing and related families of nonpara- mixture models. In IEEE International Conference on metric tests. Entropy, 19(2):Paper No. 47, 2017. Computer Vision (ICCV-2013), Sydney, Australia, Dec. 2013. Rasala, R. The Rodrigues formula and polynomial differ- ential operators. Journal of Mathematical Analysis and Mena, G. and Niles-Weed, J. Statistical bounds for entropic Applications, 84(2):443–482, 1981. optimal transport: Sample complexity and the central Rigollet, P. and Weed, J. Entropic optimal transport is limit theorem. In Advances in Neural Information Pro- maximum-likelihood deconvolution. Comptes Rendus cessing Systems (NeurIPS-2019), pp. 4543–4553, Van- Mathematique.´ Academie´ des Sciences. Paris, 356(11- couver, BC, Canada, Dec. 2019. 12):1228–1235, 2018. Milman, E. On the role of convexity in isoperimetry, spectral Rubner, Y., Tomasi, C., and Guibas, L. J. The earth mover’s gap and concentration. Inventiones Mathematicae, 177 distance as a metric for image retrieval. International (1):1–43, 2009. Journal of Computer Vision, 40(2):99–121, 2000. Mohajerin Esfahani, P. and Kuhn, D. Data-driven distribu- Sandler, R. and Lindenbaum, M. Nonnegative matrix fac- tionally robust optimization using the Wasserstein met- torization with earth mover’s distance metric for image ric: Performance guarantees and tractable reformulations. analysis. IEEE Transactions on Pattern Analysis and Mathematical Programming, 171:115–166, 2018. Machine Intelligence, 33(8):1590–1602, 2011.

Mroueh, Y., Li, C., Sercu, T., Raj, A., and Cheng, Y. Sobolev Santambrogio, F. Optimal Transport for Applied Mathe- GAN. In International Conference on Learning Repre- maticians. Birkhauser,¨ 2015. sentations (ICLR-2018), Vancouver, BC, Canada, May Schmuland, B. Dirichlet forms with polynomial domain. 2018. Mathematica Japonica, 37(6):1015–1024, 1992. Nadjahi, K., Durmus, A., Simsekli, U., and Badeau, R. Si, N., Blanchet, J. H., Ghosh, S., and Squillante, M. S. Asymptotic guarantees for learning generative models Quantifying the empirical wasserstein distance to a set with the sliced-Wasserstein distance. In Advances in of measures: Beating the curse of dimensionality. In Neural Information Processing Systems (NeurIPS-2019), Advances in Neural Information Processing Systems pp. 250–260, 2019. (NeurIPS-2020), pp. 21260–21270, virtual, Dec. 2020. Smooth Wp: Structure, Empirical Approximation, and Applications

Singh, S. and Poczos,´ B. Minimax distribution estimation in Zhang, P., Liu, Q., Zhou, D., Xu, T., and He, X. On the wasserstein distance. arXiv preprint arXiv:1802.08855, discrimination-generalization tradeoff in GANs. In Inter- 2018. national Conference on Learning Representations (ICLR- 2018), Vancouver, BC, Canada, Apr.–May 2018. Sommerfeld, M. and Munk, A. Inference for empirical Wasserstein distances on finite spaces. Journal of Royal Statistical Society Series B, 80:219–238, 2018.

Song, S., Wang, Y., and Chaudhuri, K. Pufferfish privacy mechanisms for correlated data. In ACM International Conference on Management of Data (SIGMOD-2017, pp. 1291–1306, New York, NY, USA, May 2017.

Steinwart, I. and Christmann, A. Support Vector Machines. Information Science and Statistics. Springer, 2008.

Tameling, C., Sommerfeld, M., and Munk, A. Empirical op- timal transport on countable metric spaces: Distributional limits and statistical applications. Annals of Applied Prob- ability, 29:2744–2781, 2019.

Tolstikhin, I., Bousquet, O., Gelly, S., and Scholkopf,¨ B. Wasserstein auto-encoders. In International Conference on Learning Representations (ICLR-2018), Vancouver, Canada, Apr.–May 2018.

Vacher, A., Muzellec, B., Rudi, A., Bach, F., and Vialard, F.-X. A dimension-free computational upper-bound for smooth optimal transport estimation. arXiv preprint arXiv:2101.05380, 2021. van der Vaart, A. Asymptotic Statistics, volume 3. Cam- bridge University Press, 1998. van der Vaart, A. and Wellner, J. A. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, 1996. var der Vaart, A. New Donsker classes. The Annals of Probability, 24(4):2128–2124, 1996.

Villani, C. Topics in Optimal Transportation. American Mathematical Society, 2003.

Villani, C. Optimal Transport: Old and New. Springer, 2008.

Weed, J. and Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620–2648, 11 2019. doi: 10.3150/18-BEJ1065. URL https://doi.org/10. 3150/18-BEJ1065.

Xu, M., Zhou, Z., Lu, G., Tang, J., Zhang, W., and Yu, Y. Towards generalized implementation of wasserstein distance in gans. arXiv preprint arXiv:2012.03420v2, 2020.