Estimation of Wasserstein Distances in the Spiked Transport Model

arXiv:1909.07513v1 [math.ST] 16 Sep 2019 ilnssc smcielann [ learning machine as such ciplines nhg ieso ic nsc ae,teeprcldistrib empirical the cases, such in since notor dimension are high distances in Wasserstein flexibility, their to Owing 705;ORgatN01-7 1-2147. N00014-17- grant ONR 1740751; ul cnwegsissupport. its acknowledges fully e okUiest n ascuet nttt fTechnolo of Institute Massachusetts and University York New Niles-Weed Jonathan Spiked Model the Transport in Wasserstein distances of Estimation n aua ttsia usini h siaino Wass of estimation the pro is between data. question distances statistical Wasserstein natural of a and computation the is box 71 [ graphics computer steeprclmauedfie by defined measure empirical the is , pia rnpr sa nraigyueu olo nvari in toolbox useful increasingly an is transport Optimal ∗ † e beti hsedao steeprclmeasure empirical the is endeavor this in object key A upre yNFaad I-IDT-887,DMS-1712596 IIS-BIGDATA-1838071, awards NSF by Supported ato hsrsac a odce hl tteIsiuefo Institute the at while conducted was research this of Part 72 iesoa statistics. dimensional phrases: and dimensio words of Key curse classifications: the subject from 2000 suffer AMS to bound eviden is com give estimator any also efficient esti s that We conjecture for such and dimension. of rate-optimal gap high statistical-computational absence nearly in the is distance in estimator Wasserstein analy that, plug-in minimax showing the our bound ture, of lower byproduct a a establish t As m exploited dimensionality. this be of can under curse structure distance low-dimensional Wasserstein this the that m show for the study estimation We subspace. of low-dimensional rate a on only differ tions model Abstract. , 76 , 82 hc omlzsteasmto httopoaiiydist probability two that assumption the formalizes which , , 91 epooeanwsaitclmdl the model, statistical new a propose We , 94 31 n h cecs[ sciences the and ] , 54 µ ∗ , n 78 = n hlpeRigollet Philippe and , .INTRODUCTION 1. asrti itne pia rnpr;high- transport; optimal distance; Wasserstein n 1 79 X i ,saitc [ statistics ], =1 n 2 , δ 3 X , 22 i 4 X , rmr 29;scnay62H99. secondary 62F99; Primary 1 , , 17 59 , , i 1 20 74 ∼ , 7 , , µ , 32 93 18 , .Acr rmtv fti tool- this of primitive core A ]. , i.i.d 36 19 dacdSuy N grate- JNW Study. Advanced r , † , 41 . 21 osyhr oestimate to hard iously , , rti itne from distances erstein µ 44 to sapo proxy poor a is ution pkdtransport spiked 43 n u aadie dis- data-driven ous , gy n CCF-TRIPODS- and soitdto associated , aiiymeasures, bability 64 putationally 49 aigthe mating vi the avoid o , , 67 dland odel 51 efra for ce , inimax nality. we sis, , 73 55 truc- ribu- , , 75 65 , µ , 80 It . 70 ], , 2 NILES-WEED & RIGOLLET for the underlying distribution of interest. Indeed, for d-dimensional distributions, Wasserstein distances between empirical measures generally converge at the slow 1/d rate n− [11, 23, 29, 34, 90] and thus suffer from the curse of dimensionality. For example, the following behavior is typical.

Proposition 1. Let µ be a probability measure on [ 1, 1]d. If µ is an em- − n pirical measure comprising n i.i.d. samples from µ, then for any p [1, ), ∈ ∞ 1/2p n− if d< 2p 1/2p 1/p EWp (µn,µ) rp,d(n) := cp√d n− (log n) if d = 2p ≤  1/d  n− if d> 2p In many settings, this bound is known to be tight up to logarithmic factors. In fact, the rate in Proposition 1 has been shown to be essentially minimax optimal for the problem of estimating µ in Wasserstein distance [77], using any estimator, not necessarily the empirical distribution µn. Since Wp satisfies the triangle inequality, Proposition 1 readily yields that (1) (2) (1) (2) d Wp(µ ,µ ) for two probability measures µ ,µ on [ 1, 1] can be estimated 1/d (1) (2)− at the rate n− by the plug-in estimator Wp(µn ,µn ) when d > 2p, and a lower bound of the same order can be shown for the plug-in estimator when µ(1) and µ(2) are, for example, the uniform measure on [ 1, 1]d. However, while the − (1) (2) above results give a strong indication that the Wasserstein distance Wp(µ ,µ ) itself is also hard to estimate in high dimension, they do not preclude the exis- (1) (2) tence of estimators that are better than Wp(µn ,µn ). Indeed, until recently, the best known lower bound for the problem of estimating the distance itself was of 3/2d order n− [28]. A concurrent and independent result [58] closes this gap for p = 1 and indicates that estimating the W1 itself is essentially as hard as estimating the measure itself. Our results show that, in fact, estimating the distance (1) (2) Wp(µ ,µ ) is essentially as hard as estimating a measure µ in Wp-distance, for any p 1. As a result, any estimator of the distance itself must suffer from the ≥ curse of dimensionality. One goal of statistical optimal transport is to develop new models and methods that overcome this curse of dimensionality by leveraging plausible structure in the problem. Early contributions in this direction include assuming smoothness [43, 58, 91] or sparsity [33]. In this work, we propose a new model, called the spiked transport model, to formalize the assumption that two distributions differ only on a low dimensional subspace of Rd. Such an assumption forms the basis of several popular alternatives to Wasserstein distances such as the Sliced Wasser- stein distance [69] or random one-dimensional projections [68]. More recently, several numerical methods that exploit a form of low-dimensional structure were proposed together with illustration of their good numerical performance [24, 50, 66]. To exploit the low-dimensional structure of the spiked transport model, we consider a standard method in statistics often called “projection pursuit” [35, 52, 53]. This general method aims to alleviate the challenges of high-dimensional data analysis by considering low-dimensional projections of the dataset that reveal “interesting” features of the data. We show that a suitable instantiation of this method to the present problem, which we call Wasserstein Projection Pursuit (WPP), leads to near optimal rates of estimation of the Wasserstein distance in SPIKED TRANSPORT MODEL 3 the spiked transport model and permits to alleviate the curse of dimensionality from which the plug-in estimator suffers. While our results establish a clear statistical picture, it is unclear how to implement WPP efficiently. An efficient relaxation of this estimator was recently proposed by Paty and Cuturi [66], and a natural question is to analyze its performance in the spiked transport model. Instead of pursuing this direction we bring strong evidence that, in fact, no computationally efficient estimator is likely to be able to take advantage of the low-dimensional structure inherent to the spiked transport model. Our computational lower bounds come from the well-established statistical query framework [48]. In particular, they indicate a fundamental trade- off between statistical and computational efficiency [5, 6, 13, 16, 60]: computationally efficient methods to estimate Wasserstein distances, are bound to suffer the curse of dimensionality. The rest of this paper is organized as follows. We introduce our model and main results in Sections 2 and 3. In Section 4, we define a low-dimensional version of the Wasserstein distance and establish its connection to the spiked transport model. Section 5 proves the equivalence between transportation inequalities and subgaussian concentration properties of the Wasserstein distance. In Section 6 we propose and analyze an estimator for the Wasserstein distance under the spiked transport model. We establish a minimax lower bound in Section 7. Finally, we prove a statistical query bound on the performance of efficient estimators in Section 8. Supplementary proofs and lemmas appear in the appendices.

Notation. We denote by the Euclidean norm on Rd. The symbols and k · k k·kop k·kF denote the operator norm and Frobenius norm, respectively. If X is a random variable on R, we let X := (E X p)1/p. Throughout, we use c and C to denote k kp | | positive constants whose value may change from line to line, and we use subscripts to indicate when these constants depend on other parameters. We write a . b if a Cb holds for a universal positive constant C. ≤ 2. MODEL AND METHODS In this section, we describe the spiked transport model and Wasserstein projection pursuit. 2.1 Wasserstein distances d Given two probability measures µ and ν on R , let Γµ,ν denote the set of couplings between µ and ν so that γ Γ iff γ(U Rd)= µ(U) and γ(Rd V )= ∈ µ,ν × × ν(V ). For any p 1, the p-Wasserstein distance W between µ and ν is defined as ≥ p 1/p p (1) Wp(µ,ν) := inf x y dγ(x,y) . γ Γµ,ν Rd Rd k − k ∈ Z × The definition of Wasserstein distances may be extended to measures defined on general metric spaces but such extensions are beyond the scope of this paper. We refer the reader to Villani [89] for a comprehensive treatment. 4 NILES-WEED & RIGOLLET

2.2 Spiked transport model We introduce a new model that induces a low-dimensional structure on the optimal transport between two measures µ(1) and µ(2) over Rd. To that end, ﬁx a subspace Rd of dimension k d and let X(1), X(2) be two random U ⊆ ≪ ∈ U variables with arbitrary distributions. Next, let Z be a third random variable, independent of (X(1), X(2)) and such that Z is supported on the orthogonal complement of . Finally, let U ⊥ U µ(1) := Law(X(1) + Z) (2) µ(2) := Law(X(2) + Z) .

Though µ(1) and µ(2) are high-dimensional distributions, they diﬀer only on the low-dimensional subspace . Borrowing terminology from principal component U analysis [47], we say that the pair (µ(1),µ(2)) satisﬁes the spiked transport model and we call the spike. U We pose the following question: given n independent observations from both µ(1) and µ(2), is it possible to estimate the Wasserstein distance between them at 1/d a rate faster than n− ? 2.3 Concentration assumptions In order to establish sharp statistical results for estimation of the Wasserstein distance, it is necessary to adopt smoothness and decay assumptions on the measures in question [see, e.g., 9, 34]. We focus on a family of such conditions known as transport inequalities, the study of which is a central object in the theory of concentration of measure [56]. d 2 A probability measure µ on R is said to satisfy the Tp(σ ) transport inequality if

(3) W (ν,µ) 2σ2D(ν µ) p ≤ k for all probability measures ν on Rd. p These inequalities interpolate between several well known assumptions in high- 2 dimensional probability. For example inequality T1(σ ) is essentially equivalent to 2 the assertion that µ is subgaussian, and T2(σ ) is implied by (and often equivalent to) a stronger log-Sobolev inequality [40]. (1) (2) Our main results on the estimation of Wp(µ ,µ ) are established under the (1) (2) 2 assumption that both µ and µ satisfy a Tp(σ ) transport inequality. We show in Section 5 that (3) is precisely equivalent to requiring that the random variable Wp(µn,µ) is subgaussian. 2.4 Wasserstein projection pursuit To take advantage of the spiked transport model, we employ a natural estimation method that we call Wasserstein Projection Pursuit (WPP). Let µ and ν be two probability distributions on Rd. Given a k d matrix U × with orthonormal rows, let µU (resp. νU ) denote the distribution of UY where Y µ (resp. Y ν). We deﬁne ∼ ∼

W˜ p,k(µ,ν) := max Wp(µU ,νU ) , U (Rd) ∈Vk SPIKED TRANSPORT MODEL 5 where the maximization is taken over the Stiefel manifold (Rd) of k d matrices Vk × with orthonormal rows. (1) (2) (1) (2) Given empirical measures µn and µn associated to µ and µ that satisfy the spiked transport model, we propose the following WPP estimator of (1) (2) Wp(µ ,µ ): ˆ ˜ (1) (2) Wp,k = Wp,k(µn ,µn ) , In the next section, we show that this estimator is near-minimax-optimal.

3. MAIN RESULTS As a theoretical justiﬁcation for Wasserstein projection pursuit, we prove that our procedure successfully avoids the curse of dimensionality under the spiked transport model. Our results primarily focus on the estimation of the the Wasser- stein distance itself but we also obtain as a byproduct of Wasserstein projection pursuit an estimator for the spike using standard perturbation results. U 3.1 Estimation of the Wasserstein distance The following theorem shows that Wasserstein projection pursuit takes advantage of the low-dimensional structure of the spiked transport model when estimating the Wasserstein distance.

Theorem 1. Let (µ(1),µ(2)) satisfy the spiked transport model (2). For any p [1, 2], if µ(1) and µ(2) satisfy the T (σ2) transport inequality, then the WPP ∈ p estimator Wˆ p,k satisﬁes

(1) (2) d log n E Wˆ p,k Wp(µ ,µ ) ck σ rp,k(n)+ . − ≤ · r n Strikingly, the rate rp,d(n) achieved by the na¨ıve plug-in estimator (see Propo- sition 1) has been replaced by rp,k(n)—in other words, this estimator enjoys the rate typical for k-dimensional rather than d-dimensional measures. The only dependence on the ambient dimension is in the second term, which is of lower order than the ﬁrst whenever p > 1 or k > 2. A more general version of this theorem appears in Section 6. 3.2 Estimation of the spike We show that if the distance between µ(1) and µ(2) is large enough, Wasserstein projection pursuit recovers the subspace . For simplicity, we state here the result U when k = 1 and defer the full version to Section 6.

Theorem 2. Let (µ(1),µ(2)) satisfy the spiked transport model with k = 1 and let be spanned by the unit vector u Rd. Fix p [1, 2] and assume that (1) U (2) 2 ∈ ∈ (1) (2) µ and µ satisfy the Tp(σ ) transport inequality and that Wp(µ ,µ ) & 1. Then the estimator (1) (2) uˆ := argmax Wp(µv ,µv ) v Rd, v =1 ∈ k k satisﬁes

2 1/2p d log n E sin ∡(ˆu, u) . σ n− + . · r n 6 NILES-WEED & RIGOLLET

3.3 Lower bounds To show that Theorem 1 has the right dependence on n and d, we exhibit two lower bounds, which imply that neither term in Theorem 1 can be avoided. To show the optimality of the ﬁrst term, we deﬁne

1/2p n− if d< 2p 1/2p rp,d′ (n) := cp,d n− if d = 2p  1/d  (n log n)− if d> 2p

Theorem 3. Fix p 1. For any estimator Wˆ , there exists a pair of measures ≥ µ and ν supported on [0, 1]d such

(1) (2) E Wˆ W (µ ,µ ) r′ (n) . | − p |≥ p,d This lower bound readily implies that the plug-in estimator for the Wasserstein distance is optimal up to logarithmic factors. By embedding [0, 1]k into [0, 1]d, this result likewise implies that the term rp,k(n) in Theorem 1 is essentially optimal. A proof appears in Section 7 Independently, Liang [58] recently obtained a similar result in the case p = 1. More speciﬁcally, he proved that when d 2, for any estimator Wˆ , there exist ≥ probability measures µ(1) and µ(2) such that the following lower bound holds:

(1) (2) log log n 1/d E Wˆ W (µ ,µ ) & n− | − 1 | log n In particular, while our lower bound is slightly stronger and holds for all p 1, ≥ both our result and that of Liang [58] fail to match the na¨ıve upper bound of order n1/d by logarithmic factors when d is large. The presence of a logarithmic factor in our lower bound comes from a reduction to estimating the total variation distance. In that case, as in several other instances of functional estimation problems, the presence of this factor is, in fact, optimal, and has been dubbed sample size enlargement [45]. Closing this gap in the context of estimation of the Wasserstein distance is an interesting and fundamental question. The only appearance of the ambient dimension d is in the second term of The- orem 1. The following theorem shows that this dependence cannot be eliminated, even when k = 1.

Theorem 4. Let p [1, 2] and σ > 0, and assume k = 1. For all estimators ∈ Wˆ , there exists a pair of measures µ(1) and µ(2) satisfying the spiked transport 2 model and Tp(σ ) such that

(1) (2) d E Wˆ Wp(µ ,µ ) & σ . | − | rn The proof of Theorem 4 is deferred to the appendix. For problems where d is large, dependence on d may be a crippling limitation. In that case, we conjecture that assuming a sparse spike, in the same spirit as sparse PCA, can mitigate this eﬀect and bring interpretability to the estimated spike. SPIKED TRANSPORT MODEL 7

3.4 A computational-statistical gap The WPP estimator achieving the rate in Theorem 1 is computationally ex- pensive to implement, which raises the question of whether an eﬃcient estimator exists achieving the same rate. We give evidence in the form of a statistical query lower bound that no such estimator exists. The statistical query model considers algorithms with access to an oracle VSTAT(t), where t> 0 is a parameter which plays the role of sample size. We show that any such algorithm for estimating the Wasserstein distance needs an exponential number of queries to an oracle with exponential sample size parameter, even under the spiked transport model. By contrast, Theorem 1 implies that a non-eﬃcient estimator needs a number of samples only polynomial in the dimension.

Theorem 5. Let p [1, 2], and consider probability measures µ(1) and µ(2) ∈ satisfying the spiked transport model. There exists a positive constant c such (1) (2) that any statistical query algorithm which estimates Wp(µ ,µ ) to accuracy 1/poly(d) with probability at least 2/3 requires at least 2cd queries to VSTAT(2cd).

4. LOW-DIMENSIONAL WASSERSTEIN DISTANCES Motivated by projection pursuit, we deﬁne the following version of the Wasser- stein distance which measures the discrepancy between low-dimensional projections of the measures.

Definition 1. For k [d], the k-dimensional Wasserstein distance between ∈ µ(1) and µ(2) is

˜ (1) (2) (1) (2) Wp,k(µ ,µ ):= sup Wp(µU ,µU ) . U (Rd) ∈Vk This deﬁnition has been proposed independently and concurrently by a number of other recent works [24, 50, 66]. We will use throughout the following basic fact about the k-dimensional Wasser- stein distance [66, Proposition 1].

d Proposition 2. W˜ p,k is a metric on the set of probability measures over R with ﬁnite pth moment.

The deﬁnition of W˜ p,k is chosen so that, under the spiked transport model (2), the k-dimensional Wasserstein distance agrees with the normal Wasserstein distance.

Proposition 3. Under the spiked transport model (2),

(1) (2) (1) (2) W˜ p,k(µ ,µ )= Wp(µ ,µ ) .

Proposition 3 follows from the following statement, which pertains to distributions that are allowed to have a diﬀerent component on the space orthogonal to the spike . Suppose ν(1) and ν(2) satisfy U ν(1) = Law(X(1) + Z(1)) (4) ν(2) = Law(X(2) + Z(2)) , 8 NILES-WEED & RIGOLLET where as before X(1) and X(2) are supported on a subspace and Z(1) and Z(2) U are supported on its orthogonal complement , and where we assume that X(i) U ⊥ and Z(i) are independent for i 1, 2 . Note that unlike in the spiked transport ∈ { } model, the components Z(1) and Z(2) on the orthogonal complement of need U not be identical. The following result shows that under this relaxed model, the k-dimensional Wasserstein distance between ν(1) and ν(2) still captures the true Wasserstein distance between the distributions as long as the distributions of Z(1) and Z(2) are suﬃciently close.

Proposition 4. Under the relaxed spiked transport model (4),

W˜ (ν(1),ν(2)) W (ν(1),ν(2)) W (Law(Z(1)), Law(Z(2))) . | p,k − p |≤ p 5. CONCENTRATION A key step to establish the upper bound of Section 6 consists in establishing good concentration properties for the Wasserstein distance between a measure and its empirical counterpart. The main assumption we adopt is that the measures in question satisfy a transport inequality. Since the pioneering work of [61, 62] and [81], transport inequalities have played a central role in the analysis of the concentration properties of high-dimensional measures. We require two deﬁnitions.

Definition 2. Given a Polish space equipped with a metric ρ, denote X by ( ) the space of all Borel probability measures . Let ( ) := µ P X X Pp X { ∈ ( ) : ρ(x, )p dµ(x) < . P X · ∞} A measure µ ( ) satisﬁes the T (σ2) inequality for some σ > 0 if R ∈ Pp X p W (ν,µ) 2σ2D(ν µ) ν ( ) , p ≤ k ∀ ∈ Pp X where W is the Wasserstein-ppdistance on ( , ρ) and D is the Kullback-Leibler p X divergence.

Definition 3. A random variable X on R is σ2-subgaussian if

λ(X EX) λ2σ2/2 Ee − e λ R . ≤ ∀ ∈ In this section, we present a surprisingly simple equivalence between transport inequalities and subgaussian concentration for the Wasserstein distance. The essence of this result is present in the works of Gozlan and Léonard [see 38, 39, 40], and similar bounds have been obtained by Bolley et al. [12]. Nev- ertheless, we could not find this simple fact stated in a form suitable for our purposes in the literature. For any measure µ, recall that the random measure µ := 1 n δ , where X µ i.i.d., denotes its associated empirical measure. n n i=1 Xi i ∼ 2 TheoremP 6. Let p [1, 2]. A measure µ p( ) satisfies Tp(σ ) if and only ∈ 2 ∈ P X if the random variable Wp(µn,µ) is σ /n-subgaussian for all n.

2 2 Because Tp′ (σ ) implies Tp(σ ) when p′ p, Theorem 6 also implies that a 2 ≥ measure satisfying Tp′ (σ ) has good concentration for Wp if p p′. In the opposite 2 ≤ direction, if p>p′, then satisfying Tp′ (σ ) still yields a weaker concentration bound. A modiﬁcation of the proof of Theorem 6 yields the following result. SPIKED TRANSPORT MODEL 9

2 Theorem 7. Let p [1, 2] and p 1. If µ ′ ( ) satisﬁes T ′ (σ ), then ′ ∈ ≥ ∈ Pp X p 1 2 2 2 p′ p Wp(µn,µ) is σ /n − − + subgaussian.

2 2 The conclusion of Theorem 7 is interesting whenever ′ < 1. For exam- p − p + 2 ple, if we assume merely that µ satisﬁes T1(σ ), Theorem7 only yields a nontrivial 2 concentration result for Wp(µn,µ) when p< 2; by contrast, if µ satisﬁes T2(σ ), then Theorem 7 implies a concentration result for W (µ ,µ) for all p< . p n ∞ In Section 6, we require concentration properties not of the Wasserstein distance itself but of the k-dimensional Wasserstein distance. The following result shows that low-dimensional projections inherit the concentration properties of the d-dimensional measure.

d Proposition 5. Let U k(R ). For any p [1, 2] and σ > 0, if µ satisﬁes 2 ∈V ∈ Tp(σ ), then so does µU .

Proof. The projection x Ux is a contraction. The result then follows from 7→ Gozlan [37, Corollary 20] [see also 63].

We conclude this section by giving some simple conditions under which the 2 T1(σ ) inequality is satisfied. The following characterization is well known. Denote by Lip( ) the space of all functions f : R satisfying f(x) f(y) d(x,y) X X → | − |≤ for all x,y R. ∈ Proposition 6 (10, Theorem 1.3). A measure µ ( ) satisfies T (σ2) if ∈ P1 X 1 and only if f(X) is σ2-subgaussian for all f Lip( ). ∈ X It is common to extend Definition 3 to random vectors as follows.

d 2 2 Definition 4. A random variable X on R is σ -subgaussian if u⊤X is σ - subgaussian for all u Rd satisfying u = 1. ∈ k k Subgaussian random vectors yield a large collection of random variables satisfying a T1 inequality.

Lemma 1. If µ on Rk satisﬁes T (σ2) for any p 1, then X µ is σ2- p ≥ ∼ subgaussian. Conversely, if X µ on Rk is σ2-subgaussian, then µ satisﬁes T (Ckσ2) ∼ 1 for a universal constant C > 0.

If the entries of X are independent, then the result holds with C = 1 by a result of Marton [61]. The presence of the factor k in the converse statement is unavoidable; unlike T2 inequalities, T1 inequalities do not exhibit dimension-free concentration [40].

6. UPPER BOUNDS In this section, we establish that under the spiked transport model, Wasser- stein projection pursuit produces a signiﬁcantly more accurate estimate of the Wasserstein distance than the plug-in estimator. 10 NILES-WEED & RIGOLLET

Let µ(1) and µ(2) be two measures generated according to the spiked transport (i) 1 n (i) (i) model (2). For i 1, 2 , we let µn := n j=1 δ (i) , where Xj µ are i.i.d. ∈ { } Xj ∼ We deﬁne P ˆ ˜ (1) (2) Wp,k := Wp,k(µn ,µn ) .

Our main upper bound shows that Wˆ p,k converges to the true Wasserstein 1/d distance Wp(µ,ν) at a rate much faster than n− .

Theorem 8. Let p′ [1, 2] and p 1. Under the spiked transport model, if (1) (2) ∈2 ≥ µ and µ satisfy Tp′ (σ ), then

1 1 dk log n (1) (2) ( ′ p )+ E Wˆ p,k Wp(µ ,µ ) . σ rp,k(n)+ cp n p − . | − | · r n !

Theorem 8 can also be extended to the misspeciﬁed model proposed in (4)

Theorem 9. Let p′ [1, 2] and p 1. Under the relaxed spiked transport (1) (2)∈ 2≥ model (4), if ν and ν satisfy Tp′ (σ ), then

1 1 dk log n (1) (2) ( ′ p )+ E Wˆ p,k Wp(ν ,ν ) . σ rp,k(n)+ cp n p − + ε , | − | · r n !

(1) (2) where ε = Wp(Law(Z ), Law(Z )).

Theorem 9, which follows almost immediately from the proof of Theorem 8, establishes that Wasserstein projection pursuit brings statistical beneﬁts even in the situation where the spiked transport model holds only approximately. Theorem 8 follows from the following two propositions. We ﬁrst show that the quality of the proposed estimator Wˆ p,k can be bounded by the sum of two terms (1) (2) depending only on µn and µn individually.

Proposition 7.

E Wˆ W (µ(1),µ(2)) EW˜ (µ(1),µ(1))+ EW˜ (µ(2),µ(2)) | p − p |≤ p,k n p,k n (1) (2) (1) (2) (1) (2) Proof. Since Wˆ p = W˜ p,k(µn ,µn ) and Wp(µ ,µ )= W˜ p,k(µ ,µ ), the claim is immediate from Proposition 2.

The following proposition allows us to bound both terms of Proposition 7 by the desired quantity.

2 Proposition 8. Let p [1, 2] and p 1. If µ satisﬁes T ′ (σ ), then ′ ∈ ≥ p 1 1 dk log n ( ′ p )+ EW˜ p,k(µ,µn) . σ rp,k(n)+ cp n p − . · r n ! Proof. Wasserstein distances are invariant under translating both measures by the same vector. Therefore, we can assume without loss of generality that µ has mean 0. Likewise, by homogeneity, we assume σ = 1. Let ZU := Wp(µU , (µn)U ). We ﬁrst show that the process ZU is Lipschitz. SPIKED TRANSPORT MODEL 11

Lemma 2. There exists a random variable L such that for all U, V (Rd), ∈Vk Z Z L U V | U − V |≤ k − kop and EL . √dp.

Proof. Let X µ. Then ∼ Z Z W (µ ,µ )+ W ((µ ) , (µ ) ) | U − V |≤ p U V p n U n V n 1 1/p (E (U V )X p)1/p + (U V )X p ≤ k − k n k − ik i=1 X n 1 1/p U V (E X p)1/p + X p . ≤ k − kop k k n k ik Xi=1 We obtain that Z Z L U V | U − V |≤ k − kop 1/p where L = (E X p)1/p + 1 n X p . By Jensen’s inequality, we have k k n i=1 k ik EL 2(E X p)1/p. Together with Lemma 5, it yields the claim. ≤ k k P 2 2 1+ ′ By Theorem 7, for all U (Rd), the random variable Z is n− p − p + sub- ∈Vk U gaussian. Therefore, by a standard ε-net argument, if we denote by ( , ε, ) N Vk k·kop the covering number of with respect to the operator norm, we obtain Vk

1 1 log ( , ε, ) ( ′ )+ k op E sup (ZU EZU ) . inf εEL + n p − p N V k·k . U (Rd) − ε>0  s n  ∈Vk  

Lemma 4 shows that there exists a universal constant c such that log ( k, ε, ) N V k·kop ≤ dk log c√k for ε (0, 1]. Choosing ε = k/n yields ε ∈ p dkp 1 1 dk log n ( ′ p )+ E sup (ZU EZU ) . + n p − U (Rd) − n n ∈Vk r r 1 1 ( ′ )+ dk log n cp n p − p . ≤ · r n Applying Proposition 17 yields

E sup Wp(µU , (µn)U ) sup EWp(µU , (µn)U )+ E sup (ZU EZU ) U (Rd) ≤ U (Rd) U (Rd) − ∈Vk ∈Vk ∈Vk 1 1 dk log n ( ′ p )+ . rp,k(n)+ cp n p − , · r n as claimed.

We also obtain a Davis-Kahan-type theorem on subspace recovery. Given two subspaces and , the minimal angle [25, 27] between them is deﬁned to be U1 U2

u1⊤u2 ∡( 1, 2) := arccos sup . U U u1 1,u2 2 u1 u2 ∈U ∈U k kk k! 12 NILES-WEED & RIGOLLET

If ∡( , ) = 0, then = 0 , so that and are at least partially U1 U2 U1 ∩U2 6 { } U1 U2 aligned. In the important special case that and are each one dimensional, U1 U2 this deﬁnition reduces to the angle between the subspaces. The following result indicates that as long as µ(1) and µ(2) are well separated, Wasserstein projection pursuit also yields a subspace with at least partial alignment to . U (1) (2) Theorem 10. Let p′ [1, 2] and p 1. Assume that µ and µ satisfy the ∈ 2 ≥ spiked transport model and T ′ (σ ) Let ˆ := span(Uˆ), where p U ˆ (1) (2) U := argmax Wp (µn )U , (µn )U . d U k(R ) ∈V Then 1 1 ( )+ dk log n σ r (n)+ c n p′ − p p,k p · n E sin2 ∡( ˆ, ) . . (1) (2) q U U Wp(µ ,µ ) A proof of Theorem 10 appears in the appendix. Note that the a bound on the minimal angle is a rather weak guarantee. Indeed, ∡( ˆ, ) 0 implies that the U V → subspaces ˆ and share at least a common line asymptotically but not more. U U When k = 1, this ensures recovery of the subspace, but this no longer holds true for higher dimensional spikes. In retrospect such a guarantee is all that we can (1) (2) hope for under the mere assumption that Wp(µ ,µ ) > 0. Indeed, it may be the case that these distributions diﬀer only on a one dimensional space. Stronger (1) (2) guarantees may be achieved by assuming that Wp(µV ,µV ) > 0 for a large family of V , but we leave them for future research.

7. A LOWER BOUND ON ESTIMATING THE WASSERSTEIN DISTANCE

In this section, we prove that the rate rp,d(n) is optimal for estimating the Wasserstein distance, up to logarithmic factors. The core idea of our lower bound is to relate estimating the Wasserstein distance to the problem of estimating total variation distance, sharp rates for which are known [45, 86]. To obtain sufficient control over the Wasserstein distance as a function of total variation, we prove a refined bound incorporating both total variation and the χ2 divergence (Propo- sition 9). We then show a modified lower bound (Proposition 10) for a testing problem involving the total variation distance over the class of distributions on [m] close to the uniform measure in χ2 divergence. In the interest of generality, we formulate our results for any compact metric space whose covering numbers satisfy X d d (5) cε− ( , ε) Cε− ≤N X ≤ for all ε diam( ). This condition clearly holds for compact subsets of Rd and ≤ X more generally for metric spaces with Minkowski dimension d. We adopt the assumption diam( ) = 1 without loss of generality. X Let be the set of distributions supported on and let R(n, ) denote the P X P minimax risk over , P

R(n, ) := inf sup Eµ,ν Wˆ Wp(µ,ν) . P Wˆ µ,ν | − | ∈P SPIKED TRANSPORT MODEL 13

1/2p The bound R(n, ) & n− is an almost trivial consequence of the fact that 1 P 1 1 1 the distribution δ 1 + δ1 cannot be distinguished from ( + ε)δ 1 + ( ε)δ1 2 − 2 2 − 2 − on the basis of n samples when ε n 1/2. The interesting part of Theorem 3 is ≍ − the rate when d> 2p. We prove the following.

Theorem 11. Let d> 2p 2 and assume satisﬁes (5). Then ≥ X 1/d R(n, ) C (n log n)− . P ≥ d,p Before proving Theorem 11, we establish the two propositions described above. Proposition 9 allows us to reduce Theorem 11 to an estimation problem involving total variation distance, and Proposition 10 is a lower bound on the minimax rate for that total variation estimation problem.

Proposition 9. Assume d> 2p 2, and let m be a positive integer. Let u be ≥ the uniform distribution on [m] := 1,...,m . There exists a random function F : { } [m] X such that for any distribution q on [m], → 1/d 1 1/d 2 1/d 1 2 cm− d (q, u) p W (F q, F u) C m− (χ (q, u)) d (q, u) p − d TV ≤ p ♯ ♯ ≤ d,p TV with probability at least .9.

Proof. Lemma 6 shows that the condition ( , ε) cε d implies the exis- N X ≥ − tence a set := x ,...,x such that d(x ,x ) & m 1/d for all i = j. We Gm { 1 m}⊆X i j − 6 select F uniformly at random from the set of all bijections from [m] to . Gm To show the lower bound, we note that any points x,y satisfy ∈ Gm p p/d d(x,y) & m− 1 x = y , { 6 } which implies that for any coupling π between F♯q and F♯u

p p/d d(x,y) dπ(x,y) & m− P [X = Y ] π 6 Z p/d m− d (F q, F u) ≥ TV ♯ ♯ p/d = m− dTV(q, u) .

The lower bound therefore holds with probability 1. We now turn to the upper bound. We employ a dyadic covering bound [90, k Proposition 1]. For any k∗ there exists a dyadic partition 1 k k∗ of X with {Q } ≤ ≤ parameter δ = 1/3 such that k ( , 3 (k+1)). We obtain that for any |Q |≤N X − k 0, ∗ ≥ k∗ p k∗p (k 1)p k k W (F q, F u) 3− + 3− − F q(Q ) F u(Q ) , p ♯ ♯ ≤ | ♯ i − ♯ i | k=1 Qk k X Xi ∈Q By Lemma 7, for any k,

1/2 3kdχ2(q, u) E F q(Qk) F u(Qk) 2d (q, u) C . | ♯ i − ♯ i |≤ TV ∧ d,p m Qk k Xi ∈Q 14 NILES-WEED & RIGOLLET

Let k0 be a positive integer to be ﬁxed later. By applying the ﬁrst bound, we obtain

(k 1)p k k k0p E 3− − F q(Q ) F u(Q ) . 3− d (q, u) . | ♯ i − ♯ i | TV k>k0 Qk k X Xi ∈Q Applying the second bound and recalling that d/2 >p yields

2 1/2 (k 1)p k k χ (q, u) k(d/2 p) E 3− − F q(Q ) F u(Q ) C 3 − | ♯ i − ♯ i |≤ d,p m k k0 Qk k k k0 X≤ Xi ∈Q X≤ 2 1/2 χ (q, u) k0(d/2 p) C 3 − , ≤ d,p m We obtain for any k 0 that ∗ ≥ 2 1/2 ∗ p χ (q, u) k0(d/2 p) k0p k p EW (F q, F u) C 3 − + C 3− d (q, u) + 3− , p ♯ ♯ ≤ d,p m · TV and taking k it suﬃces to bound the ﬁrst two terms. ∗ →∞ Let k0 to be the smallest positive integer such that

d (q, u)2 3k0d m TV . ≥ χ2(q, u)

Then χ2(q, u) 1/2 3k0d/2 C d (q, u) , m ≤ d · TV and hence

2p p k0p p/d 2 p/d 1 EW (F q, F u) C 3− d (q, u) C m− (χ (q, u)) d (q, u) − d . p ♯ ♯ ≤ d,p · TV ≤ d,p TV The claim follows from Markov’s inequality.

We now show that there are composite hypotheses that are well separated in total variation distance but nevertheless hard to distinguish on the basis of samples.

Proposition 10. Fix a positive integer n and a constant δ [0, 1/10]. Given ∈ a positive integer m, let m be the set of probability distributions q on [m] sat- 2 D isfying χ (q, u) 9. Denote by − the subset of of distributions satisfy- ≤ Dm,δ Dm ing d (q, u) δ and by + the subset of satisfying d (q, u) 1/4. If TV ≤ Dm Dm TV ≥ m = Cδ 1n log n for a suﬃciently large universal constant C and n is suﬃ- ⌈ − ⌉ ciently large, then

inf sup Pq[ψ =1]+ sup Pq[ψ = 0] .9 , ψ + − ≥ q m q ∈D ∈Dm,δ where the inﬁmum is taken over all (possibly randomized) tests based on n samples. SPIKED TRANSPORT MODEL 15

The proof of Proposition 10 follows a strategy due to Valiant and Valiant [85] and Wu and Yang [92], and our argument is a modiﬁcation of theirs which permits simultaneous control of total variation and the χ2 divergence. We give the proof in Appendix A.2. We now give a proof of the main theorem.

Proof of Theorem 11. Let Wˆ be any estimator for the Wasserstein distance between distributions on constructed on the basis of n samples from X each distribution. Let u be the uniform distribution on [m], for some m to be specified. Let c∗ be the constant appearing in the lower bound of Proposition 9 and define 1 1/d ∆d = 16 c∗m− . Given n samples X1,...,Xn from an unknown distribution on [m], define the randomized test

ψ = ψ(X ,...,X ) := 1 Wˆ (F (X ),...,F (X ); F (Y ),...,F (Y )) 2∆ , 1 n { 1 n 1 n ≤ d} where F is the random function constructed in Proposition 9 and where Yi are i.i.d. from u. 1 c∗ 1/p−2/d By Proposition 9, if δ δd,p := , any q − satisﬁes the ≤ 176Cd,p ∈ Dm,δ bound W (F q, F u) ∆ with probability at least .9. Likewise, for q + , the p ♯ ♯ ≤ d ∈ Dm bound W (F q, F u) 3∆ also holds with probability at least .9. p ♯ ♯ ≥ d Deﬁne the event A = Wˆ W (F q, F u) ∆ . We obtain, for any q − , {| − p ♯ ♯ |≥ d} ∈ Dm,δ E P [A] E P [Wˆ > 2∆ and W (F q, F u) ∆ ] F F♯q,F♯u ≥ F F♯qF♯u d p ♯ ♯ ≤ d E P [Wˆ > 2∆ ] P[W (F q, F u) > ∆ ] ≥ F F♯qF♯u d − p ♯ ♯ d P [ψ = 0] .1 , ≥ q − and analogously for q + , ∈ Dm E P [A] P [ψ = 1] .1 . F F♯q,F♯u ≥ q − For any estimator Wˆ , we have

P ˆ 1 E P E P sup µ,ν[ W Wp(µ,ν) ∆d] sup F F♯q,F♯u[A]+ sup F F♯q,F♯u[A] | − |≥ ≥ 2 + − µ,ν q m q ∈P ∈D ∈Dm,δ 1 sup Pq[ψ =1]+ sup Pq[ψ = 0] .1 . ≥ 2 + − − q m q ∈D ∈Dm,δ 1 Choosing m = Cδ− n log n for a suﬃciently large constant C and applying ⌈ ⌉ P ˆ Proposition 10 yields that supµ,ν µ,ν[ W Wp(F♯q, F♯u) ∆d] .8, and ∈P | − | ≥ ≥ Markov’s inequality yields the claim.

8. COMPUTATIONAL-STATISTICAL GAPS FOR THE SPIKED TRANSPORT MODEL Sections 4 and 7 clarify the statistical price for estimating the Wasserstein distance for high-dimensional measures. Section 4 shows that the curse of dimensionality can be avoided under the spiked transport model. The WPP estimator exploits the low-dimensional structure in the spiked transport model, thereby 16 NILES-WEED & RIGOLLET beating the worst-case rate presented in Section 7. However, it is not clear how to make the estimator we propose computationally efficient. In this section, we give evidence that this obstruction is a fundamental obstacle, that is, that no computationally efficient estimator can beat the curse of dimensionality. The statistical query model, first introduced in the context of PAC learning [48], is a well known computational framework for analyzing statistical algorithms. Instead of being given access to data points from a distribution, a statistical query (SQ) algorithm can approximately evaluate the expectation of arbitrary functions with respect to the distribution. This model naturally captures the power of noise-tolerant algorithms [48] and is strong enough to implement nearly all common machine learning procedures [see, e.g. 8]. We recall the following definition.

Definition 5. Given a distribution on Rd, for any sample size parame- D ter t > 0 and function f : Rd [0, 1], the oracle VSTAT(t) returns a value → 1 p(1 p) v [p τ,p + τ], where p = Ef(X) and τ = − . ∈ − t ∨ t q A query to a VSTAT(t) oracle can be simulated by using a data set of size approximately t. Our main result proves a lower bound against an oracle with sample size parameter t = 2cd for a positive constant c. Simulating such an oracle would require a number of samples exponential in the dimension. Nevertheless, we show that even under this strong assumption, at least 2cd queries to the oracle are required. This result suggests that any computationally eﬃcient procedure to estimate the Wasserstein distance under the spiked transport model requires an exponential number of samples. By contrast, Section 6 establishes that, information theoretically, only a polynomial number of samples are required. We now state our main result.

Theorem 12. There exists a positive universal constant c such that, for (1) (2) (1) (2) d any d, estimating W1(µ ,µ ) for distributions µ and µ on R satisfying the spiked transport assumption with k = 1 to accuracy Θ(1/√d) with probability at least 2/3 requires at least 2cd queries to VSTAT(2cd).

Our proof is based on a construction due to [26] [see also 14]. We defer the details to the appendix.

APPENDIX A: PROOFS OF LOWER BOUNDS A.1 Proof of Theorem 4 We reduce from the spiked covariance model. By homogeneity, we may assume that σ = 1. Let µ(1) be the standard Gaussian measure on Rd, and for µ(2) we take either the standard Gaussian measure or the distribution of a centered Gaussian with covariance I + βvv , where v = 1 and β > 0 is to be speciﬁed. As long as ⊤ k k β . 1, the measure µ(2) is a O(1)-Lipschitz pushforward of the Gaussian measure. Hence, it satisﬁes Tp(O(1)) [37, Corollary 20]. (2) Note that if µ has covariance I + βvv⊤, then

W (µ(1),µ(2)) W (µ(1),µ(2))= W ( (0, 1), (0, 1+ β)) & β . p ≥ 1 1 N N SPIKED TRANSPORT MODEL 17

However, Cai et al. [15, Proposition 2] establish that the minimax testing error for H : (0,I) vs. H : (0,I + βvv⊤), v = 1 0 N 1 N k k is bounded below by a constant when β . d/n. A standard application of Le Cam’s two-point method [84] yields the claim. p A.2 Proof of Proposition 10

We require the existence of two distributions on R+, which will serve as the building blocks of our construction.

Proposition 11. For any integer L 0 and ε [0, 1/6], there exists a pair ≥ ∈ of random variables U and V with the following properties: EU j = EV j j L • ∀ ≤ U, V [0, 16ε 1L2] almost surely • ∈ − EU = EV = 1 and EU 2 = EV 2 6. • ≤ E U 1 12ε but E V 1 1. • | − |≤ | − |≥ The proof is deferred to Appendix A.3.

Proof of Proposition 10. The proof follows closely the approach of Wu and Yang [92, Proposition 1]. We first employ a standard argument showing that we can consider the Poissonalized setting. It is trivial to see that given samples X1,...,Xn from a distribution q on [m], the counts Ni = Ni(X1,...,Xn) := j [n] : X = i are sufficient for q. We therefore consider tests ψ based on |{ ∈ j }| count vectors. Note that, under q, the count vector (N1,...,Nm) has distribution Multinomial(n,q). Define Rn := inf sup Pq[ψ =1]+ sup Pq[ψ = 0] . ψ + − } q m q ∈D ∈Dm,δ We aim to prove a lower bound on Rn. Let ρ> 0, and let for n 1 let ψ be a set of near optimal tests for a fixed ≥ { n} sample size; i.e. P 1 sup q[ψn = q m,δ− ] Rn + ρ . q − + 6 { ∈ D } ≤ ∈Dm,δ∪Dm Define set of approximate probability vectors

m ˜ Rm q m,δ− := q + : qi 1 δ, m m,δ− D ( ∈ − ≤ i=1 qi ∈ D ) Xi=1 m P ˜+ Rm q + m := q + : qi 1 δ, m m . D ( ∈ − ≤ i=1 qi ∈ D ) i=1 X ˜ ˜ ˜+ ˜ P We let m,δ := m,δ− m. Given q m,δ, deﬁne the renormalizationq ¯ = m D D ∪ D ∈ D i=1 qi/q. We then deﬁne

P Rñ := inf sup Pq[ψ =1]+ sup Pq[ψ = 0] , ψ + − q ˜m q ˜ ∈D ∈Dm,δ 18 NILES-WEED & RIGOLLET where the infimum is taken over all estimators based on the counts N1,...,Nm and where Pq indicates the probability when N1,...,Nm are independent and N Pois(nq ) for all i [m]. We set N = m N , and note that, conditioned i ∼ i ∈ i=1 i on N = n , the count vector (N ,...,N ) has distribution Multinomial(n , q¯). ′ 1 m P ′ We define a test ψ˜ based on these Poissonalized counts by setting

ψ˜(N1,...,Nm) := ψN (N1,...,Nm) .

This deﬁnition along with the near optimality of ψ ′ for n 0 implies n ′ ≥ sup Pq[ψ˜ =1]+ sup Pq[ψ˜ = 0] Rn′ Pq[N = n′]+ ρ + − ˜ ˜ ≤ ′ q m q m,δ n 0 ∈D ∈D X≥ R + P [N

8enL L d (P, P′) m . TV ≤ εm + C C Let E = Q ˜− and E = Q ˜ . By Lemma 10, P[E ] and P[E ] are { ∈ Dm,δ} ′ { ′ ∈ Dm} ′ L4 each at most C δ2m . Let πE be the law of Q conditioned on E, and deﬁne πE′ ′ analogously, and let PE and PE′ ′ be the laws of N and N ′ under these priors. We obtain for any estimator ψ based on count vectors

P ˜ P ˜ P P sup q[ψ =1]+ sup q[ψ = 0] q′ [ψ = 1]dπE′ ′ (q′)+ q[ψ = 0]dπE(q) q ˜+ q ˜− ≥ ∈Dm ∈Dm,δ Z Z 1 d (P , P′ ′ ) ≥ − TV E E L4 1 d (P, P′) C . ≥ − TV − δ2m δm Choosing L = c n for a suﬃciently small constant c yields that δm δ2m3 R˜ 1 m exp( C ) C . n ≥ − − n − n4 SPIKED TRANSPORT MODEL 19

Therefore δm δ2m3 R 1 m exp( C ) C exp( Cn) , n ≥ − − n − n4 − − and choosing m = Cδ 1n log n for C a suﬃciently large constant and n suﬃ- ⌈ − ⌉ ciently large yields the claim.

A.3 Proof of Proposition 11 First, the reduction of Wu and Yang [92, Lemma 7] implies that it suﬃces to construct random variables Y and Y ′ such that EY j = EY j 0 j

9 2 ∆ , ∆ ′ [0, ε ] ε ε ∈ 5 3 ε , Zε ≥ 10 which implies in particular that both Q and Q′ are probability distributions. Let Y Q and Y Q . We ﬁrst check the last three conditions. Clearly Y ∼ ′ ∼ ′ and Y are supported on [1, 16ε 1L2], and Lemma 9 implies that EY = EY 6. ′ − ′ ≤ We have E 1 P[Y = 1] = 1 ∆ε 1 6ε, and since Y 2 almost surely the Y ε ′ 1 ≥1 − Z ≥ − ≥ bound E ′ is immediate. Y ≤ 2 20 NILES-WEED & RIGOLLET

It remains to check the moment-matching condition. Any polynomial p(y) of degree at most L 1 can be written − p(y) = (y 1)(y 2)q(y)+ αy + β , − − where q(y) has degree less than L 1. Then −

Ep(Y ) Ep(Y ′)= E(Y 1)(Y 2)q(Y ) E(Y ′ 1)(Y ′ 2)q(Y ′)+ α(EY EY ′) . − − − − − − −

The last term vanishes because EY = EY ′, and

1 1 1 E(Y 1)(Y 2)q(Y ) E(Y ′ 1)(Y ′ 2)q(Y ′)= Eq(ε− X) Eq(ε− X′) = 0 , − − − − − − Zε since EXj = EX j for all j

W (A, (0, 1)) W (Q, (0, 1)) W (A, Q) = Ω(1/√m) 1 N ≥ 1 N − 1 and χ2(A, (0, 1)) = O(√m) exp(O(m)) = exp(O(m)). N Finally, we show that A is O(1)-subgaussian, and therefore satisfies T1(C) for a positive constant C. Standard facts [see 88] imply that it suffices to show that if X A, then X = O(√k) for all k. This clearly holds for k 2m 1, ∼ k kk ≤ − since (0, 1) is itself 1-subgaussian and the first 2m 1 moments of A and N − SPIKED TRANSPORT MODEL 21

(0, 1) agree. On the other hand, for k 2m, if Y Q and Z (0, 1), then N ≥ ∼ ∼N X = Y + √δZ A, and ∼

X + √δZ k X + √δ Z k . √m + k/m . √k , k k ≤ k k∞ k k as desired. p

The separation W (A, (0, 1)) = Ω(1/√m) in Proposition 12 is easily seen to 1 N be tight. Indeed, Rigollet and Weed [72, Corollary 2] show that if µ and ν are O(1)- subgaussian and agree on their ﬁrst O(m) moments, then W1(µ,ν)= O(1/√m). By planting the distribution constructed in Proposition 12 in a random direction, we obtain two high-dimensional measures satisfying the spiked transport model.

d Lemma 3. Let v be a unit vector in R , and denote by Pv the distribution on Rd of the random variable Xv + Z, where X A and Z (0,I vv ) ∼ ∼ N d − ⊤ is independent of X. Then µ(1) = P and µ(2) = (0,I ) satisfy the spiked v N d transport model (2), and W (P , (0,I )) = Ω(1/√m). 1 v N d Proof. If we let ξ (0, 1), then (0,I ) is the law of ξv + Z, where ξ ∼ N N d and Z are independent and Z (0,I vv ). Denoting by the span of ∼ N d − ⊤ U v, we see that P and (0,I ) satisfy (2) with X(1) = Xv and X(2) = ξv. By v N d Propositions 3 and 12, W (P , (0,I )) = W (A, (0, 1)) = Ω(1/√m). 1 v N d 1 N The proof of Theorem 12 follows from a framework due to Feldman et al. [30], from which the following result is extracted. Given distributions P1, P2, and Q, deﬁne dP dP χ2 (P , P ) := 1 1 2 1 dQ . Q 1 2 dQ − dQ − Z We call a set of distributions (γ, β) correlated with respect to Q if for all P P , P , i j ∈ P β if i = j, χ2 (P , P ) Q i j ≤ γ if i = j. 6 We then have the following.

Proposition 13. Let be a set of distributions, and let Q be a reference Q distribution. Suppose that there exists a set such that is (γ, β) correlated P ⊆ Q P with respect to Q. Then any SQ algorithm that distinguishes queries from P = Q and P requires at least γ/3β queries to VSTAT(1/2γ) ∈ Q |P|

Proof. By choosing γ′ = γ in Feldman et al. [30, Lemma 3.10], we obtain that the set satisﬁes SDA( , Q, 2γ) γ/(β γ) γ/β. Then Feldman P P ≥ |P| − ≥ |P| et al. [30, Theorem 3.7] implies that distinguishing Q from requires at least Q γ/3β queries to VSTAT(1/2γ). |P| We can now prove the lower bound.

Proof of Theorem 12. Let be the set P : v Rd, v = 1 . The Q { v ∈ k k } Johnson-Lindenstrauss lemma [46] implies that for any δ (0, 1), there exists a ∈ 22 NILES-WEED & RIGOLLET

2 set of 2Ω(δ d) unit vectors in Rd with pairwise inner product at most δ. Denote by a set of such vectors, and set := P : v . S P { v ∈S}⊆Q Write ∆ for χ2(A, (0, 1)), and recall that ∆ = exp(O(m)). By Diakonikolas N 2 2m et al. [26, Lemma 3.4], χ (0,I)(Pv, Pv′ ) v v′ ∆, and the set is therefore N ≤ | · | P (δ2m∆, ∆) correlated. By Proposition 13, any SQ algorithm that distinguishes 2 between µ = (0, 1) and µ with probability at least 2/3 requires 2Ω(δ d)δ2m N 2m ∈ P queries to VSTAT(δ− /2∆). 2m Let δ > 0 be a constant small enough that δ− /2∆ = exp(Ω(m)). If m = cd 2 for a suﬃciently small positive constant c, then 2Ω(δ d)δ2m = exp(Ω(d)). An SQ algorithm to distinguish queries from (0, 1) from those from a distribu- N tion in with probability at least 2/3 therefore requires exp(Ω(d)) queries to Q VSTAT(exp(Ω(d)). Therefore, by Lemma 3, any SQ algorithm which estimates W1 under the spiked transport model to accuracy Θ(1/√m) = Θ(1/√d) requires exp(Ω(d)) queries to VSTAT(exp(Ω(d)), as claimed.

APPENDIX B: SUPPLEMENTARY MATERIAL B.1 Additional lemmas Lemma 4. There exists a universal constant c such that for all ε (0, 1), the ∈ covering number of (Rd) satisﬁes Vk c√k log ( , ε, ) dk log . N Vk k·kop ≤ ε

d d k Proof. If we denote by the unit sphere in R , then (R ) × , and Sd Vk ⊆ Sd k U V 2 U V 2 , k − kop ≤ k i − ik Xi=1 where U (resp. V ) are the rows of U (resp. V ). The claim follows immediately { i} { i} from the fact that log ( , ε, ) d log c for a universal constant c. N Sd k · k ≤ ε Lemma 5. Let µ be a centered distribution satisfying T (σ2). If X µ, then 1 ∼ (E X p)1/p . √dp. k k

Proof. By the monotonicity of Lp norms, we may assume that p is a positive even integer. The claim then follows from Proposition 16 and standard bounds on the moments of subgaussian random variables [88].

The following proposition provides a sort of converse to Proposition 3: if the k- dimensional Wasserstein distance agrees with the Wasserstein distance, then there is an optimal coupling between the measures which acts only on a k-dimensional subspace.

Proposition 14. If W˜ p,k(µ,ν) = Wp(µ,ν), then there exists a coupling γ optimal for µ and ν such that X Y is supported on a k-dimensional subspace. − Proof. Let γ be an optimal coupling for µ and ν, and write

U = argmax Wp(µV ,νV ) . V (Rd) ∈Vk SPIKED TRANSPORT MODEL 23

Then

W p(µ,ν)= x y p dγ(x,y) U(x y) p dγ(x,y) W˜ p (µ,ν) , p k − k ≥ k − k ≥ p,k Z Z where the ﬁrst inequality uses the fact that U = 1 and the second uses the k kop fact that if (X,Y ) γ, then UX and UY are distributed according to µ and ν , ∼ U U respectively. If the ﬁrst equality holds, then we must have X Y = U(X Y ) k − k k − k γ-almost surely, and if the second inequality holds then this coupling is optimal for µU and νU

Proposition 15 (88, Proposition 2.5.2). If a real-valued random variable X is σ2-subgaussian, then

2 2 2 2 1 (8) Eeλ X e4λ σ λ . ≤ ∀| |≤ 2σ Conversely, if (8) holds and EX = 0, then X is 8σ2-subgaussian.

Proof. For the ﬁrst claim, let Z be a standard Gaussian random variable independent of X. Then, if λ2σ2 < 1/4,

2 2 2 2 2 1 2 2 Eeλ X = Ee√2λZX Eeλ σ Z = e4λ σ , ≤ (1 2λ2σ2)1/2 ≤ − where the last step uses the inequality (1 x) 1 e2x for 0 x 1 . − − ≤ ≤ ≤ 2 Conversely, suppose that (8) holds and EX = 0. Then, for λ 1 , we have | |≤ 2σ 2 2 2 2 EeλX EλX + Eeλ X e4λ σ . ≤ ≤ If λ > 1 , then | | 2σ 2 λX 2λ2σ2+ X 2λ2σ2+1 4λ2σ2 Ee Ee 8σ2 e e . ≤ ≤ ≤

Proposition 16. If X on Rk is σ2-subgaussian, then ε X is (8kσ2)-subgaussian, k k where ε is a Rademacher random variable independent of X.

Proof. For λ < 1 , | | 2√kσ 2 2 2 2 2 2 2 2 E exp(λ (ε X ) )= E exp(λ X ) E exp(kλ (e⊤X) ) exp(4kλ σ ) . k k k k ≤ i ≤ Applying Proposition 15 yields the claim.

k 2 Proposition 17. Let p [1, 2]. If ρ on R satisﬁes T ′ (σ ), then for any ′ ∈ n p p [1, ), ∈ ∞ 1/2p n− if k < 2p 1/2p 1/p EWp (ρn, ρ) rp,k(n) := cpσ√k n− (log n) if k = 2p ≤  1/k  n− if k > 2p  24 NILES-WEED & RIGOLLET

Proof. Lemma 1 implies that X ρ is σ2 subgaussian. in particular, X ∼ satisﬁes (E X q)1/q σ√qk. Choosing q = 3p and using Lei [57, Theorem 3.1] k k ≤ implies the claim.

Lemma 6. If ( ) cε d, then for any positive integer m there exists a Nε X ≥ − subset of of cardinality m such that each pair of points is separated by at least 1 c 1/d X 2 m . Proof. Let S be a maximal subset of such that each pair of points in S is X separated by at least 1 c 1/d. Then, for any point x X, there must be an s S 2 m ∈ ∈ 1 c 1/d such that d(x,s) < 2 m , since otherwise x could be added to S. Therefore, balls of radius 1 c 1/d around the points s S cover X, which implies 2 m ∈ c 1/d S (X, ) m , | |≥N m ≥ as claimed.

Lemma 7. Let Q ,...,Q be a partition of . If F is a uniform random 1 ℓ X bijection from [m] to , then Gm ℓ ℓ χ2(q, u) E F♯q(Qi) F♯u(Qi) q u 1 · | − | ≤ k − k ∧ r m Xi=1 Proof. The ﬁrst bound follows immediately from the triangle inequality:

ℓ ℓ 1 F q(Q ) F u(Q ) q(j) = q u . | ♯ i − ♯ i |≤ − m k − k1 i=1 i=1 j:F (j) Q X X X∈ i

We now show the second bound. Fix i [ℓ]. Under the random choice of F , ∈ 1 the quantity F♯q(Qi) F♯u(Qi) = j:F (j) Q q(j) m is a sum of Qi terms − ∈ i − | | selected uniformly without replacement from the set ∆ := q(j) 1 : j [m] . P { − m ∈ } By Hoeﬀding [42, Theorem 4], since x x is continuous and convex, 7→ | | Q 1 | i| E q(j) E ∆ , − m ≤ j j:F (j) Qi j=1 X∈ X where ∆ are selected uniformly with replacement from ∆. Note that E∆ = 0, j j and E∆2 = 1 m (q(j) 1 )2 = m 2χ2(q, u). Jensen’s inequality then yields j m j=1 − m − P Q | i| Q χ2(q, u) E ∆ | i| . j ≤ m j=1 p X

The Cauchy-Schwarz inequality ﬁnally implies

ℓ ℓ 1 Q χ2(q, u) ℓ χ2(q, u) E q(j) | i| · . − m ≤ m ≤ m i=1 j:F (j) Qi i=1 p r X X∈ X

SPIKED TRANSPORT MODEL 25

1 1 1 Lemma 8. Let ε [0, 1/6]. If X, X 1 almost surely and E E ′ , ∈ ′ ≥ X − X ≥ 2 then 1 1 9 E , E [0, ε2] (ε 1X 1)(ε 1X 2) (ε 1X 1)(ε 1X 2) ∈ 5 − − − − − ′ − − ′ − and 1 1 3 E E ε . ε 1X 2 − ε 1X 1 ≥ 10 − − − ′ − Proof. If x 1 and ε 1/6, then (x ε)(x 2ε) 5 , so ≥ ≤ − − ≥ 9 1 ε2 9 = ε2 , (ε 1x 1)(ε 1x 2) (x ε)(x 2ε) ≤ 5 − − − − − − and since X, X′ 1 almost surely the ﬁrst claim is immediate. ≥ ε 1 Similarly, if x′ 1 and ε 1/6, we have x′(x′ ε) 5 ; hence almost surely ≥ ≤ − ≤ 1 1 ε ε ε2 1 1 1 ε , ε 1X 2 − ε 1X 1 ≥ X − X − X (X ε) ≥ X − X − 5 − − − ′ − ′ ′ ′ − ′ and taking expectations yields the claim.

Lemma 9. Let Q and Q be deﬁned as in (6) and (7). If Y Q and Y Q , ′ ∼ ′ ∼ ′ then EY = EY ′ 6 . ≤ Proof. It follows directly from the deﬁnition that 1 y EY =1+ P (dy) ∆ (y 1)(y 2) ε − ε Zε Z − − 1 1 =1+ P (dy) y 2 ε Zε Z − and analogously

1 y′ EY ′ =2+ P′ (dy′) 2∆ ′ (y 1)(y 2) ε − ε Zε Z ′ − ′ − 1 1 =2+ P′ (dy′) , y 1 ε Zε Z ′ − which implies 1 1 1 EY ′ EY = 1 dP (y) dP′ (y′) − − y 2 ε − y 1 ε Zε Z − Z ′ − 1 = 1 ( ) = 0 . − Zε Zε To verify the inequality, note that 1 1 1 P (dy)= E = εE , y 2 ε ε 1X 2 X 2ε Z − − − − and the fact that X 1 almost surely and ε 1/6 implies that this quantity is ≥ ≤ bounded by 3 ε. By Lemma 8, 3 ε; therefore 2 Zε ≥ 10 3 1 1 2 ε EY =1+ Pε(dy) 1+ = 6 . y 2 ≤ 3 ε Zε Z − 10 26 NILES-WEED & RIGOLLET

1 1 Lemma 10. If Q = m (U1,...,Um) and Q′ = m (V1,...,Vm), where U and V 1 satisfy the assumptions of Proposition 11 with ε = 24 δ, and recall that E = Q + { ∈ ˜− and E = Q ˜ . Then Dm,δ} ′ { ∈ Dm} 4 C C L P[E ]+ P[E′ ] . . δ2m

Proof. We ﬁrst show that E holds as long as 1 m U 2 8 and 1 m U m i=1 i ≤ m i=1 | i− 1 δ. Indeed, if we set s := m Q , where Q = U /m, then the second condi- |≤ i=1 i i Pi P tion together with Jensen’s inequality imply that s 1 δ. Therefore, recalling P | − |≤ that Q¯ = Q/s, we get

m m 1 8 8 χ2(Q,¯ u)+1= m Q¯2 = U 2 10 , i s2m i ≤ s2 ≤ 1 2δ ≤ Xi=1 Xi=1 − where the ﬁnal inequality follows from the assumption that δ 1/10. Hence ≤ χ2(Q,¯ u) 9. ≤ Likewise,

m m m 1 1 1 1 1 1 d (Q,¯ u)= Q¯ Q¯ Q + Q ( s 1 +δ) δ . TV 2 i − m ≤ 2 i − i 2 i − m ≤ 2 | − | ≤ i=1 i=1 i=1 X X X

By Chebyshev’s inequality combined with the assumption tha t EU 2 6, ≤ m 2 P 1 2 Var(U ) Ui > 8 , "m # ≤ 4m Xi=1 and since U 16ε 1L2 almost surely, ≤ − 2 4 1 2 2 2 2 4 Var(U ) EU (16ε− L ) EU . δ− L . ≤ ≤ Moreover, since ε = 1 δ and E U 1 12ε, 24 | − |≤ m P 1 4 Var( U 1 ) 1 Ui 1 > δ 2| − | . 2 . "m | − | # ≤ δ m δ m Xi=1 Combining these bounds yields

L4 P[EC ] . . δ2m

1 m 2 The argument for E′ is analogous. It suﬃces for E′ to hold that m i=1 Vi 8 1 m 3 1 m ≤ along with the conditions m i=1 Vi 1 4 and ( m i=1 Vi) 1 δ. As | − | ≥ −P4 ≤ above, the ﬁrst condition is violated with probability at most C L , and the P P δ2m 1 latter two conditions are violated with probability at most C δ2m . The claim follows. SPIKED TRANSPORT MODEL 27

B.2 Omitted proofs Proof of Proposition 4. Denote by ρ(1) and ρ(2) the distributions of Z(1) and Z(2), respectively. It suﬃces to show W˜ (ν(1),ν(2))+W (ρ(1), ρ(2)) W (ν(1),ν(2)). p,k p ≥ p Let U (Rd) be such that the column span of U agrees with the subspace on ∈Vk U which the random variables X(1) and X(2) are supported in the deﬁnition of µ(1) and µ(2). We let (Y (1),Y (2)) be a pair of random variables such that marginally Y (i) µ(i) for i 1, 2 and ∼ U ∈ { } W p(µ(1),µ(2))= E Y (1) Y (2) p . p U U k − k Likewise, let (W (1),W (2)) be a coupling of ρ(1) and ρ(2) such that

W p(ρ(1), ρ(2))= E W (1) W (2) p . p k − k Then (4) implies that (Y (1) + W (1),Y (2) + W (2)) is a coupling of µ(1) and µ(2). Therefore

1/p W (µ(1),µ(2)) E Y (1) + W (1) (Y (2) + W (2)) p p ≤ k − k 1/p 1/p E Y (1) Y (2) p + E W (1) W (2) p ≤ k − k k − k (1) (2) (1) (2) = Wp(µU ,µU )+ Wp(ρ , ρ ) W˜ (µ(1),µ(2))+ W (ρ(1), ρ(2)) . ≤ p,k p

Proof of Lemma 1. This can be deduced from a series of well known facts. By Proposition 6, it suﬃces to show that for f Lipschitz,

λ(f(X) Ef(X)) C2kλ2σ2/2 Ee − e . ≤ By symmetrization, we can bound

λ(f(X) Ef(X)) λε(f(X) f(X′)) λε X X′ Ee − Ee − Ee k − k , ≤ ≤ where X µ is independent of X. Continuing, we obtain ′ ∼ λε X X′ 2λε X 16kλ2σ2 Ee k − k Ee k k e ≤ ≤ by Proposition 16. This proves the claim with C = 32.

2 Proof of Theorem 6. First, assume that µ satisfies the Tp(σ ) inequality. n Then, by Gozlan and Léonard [40, Proposition 1.9], the product measure µ⊗ 2/p 1 2 n satisfies the inequality Tp(n − σ ) on the space equipped with the metric n p 1/p n X 2/p 1 2 ρp(x,y) := ( i=1 ρ(xi,yi) ) . Therefore, µ⊗ also satisfies T1(n − σ ) with respect to this same metric. P We now note that the function

n 1 (x1,...xn) Wp δxi ,µ 7→ n ! Xi=1 28 NILES-WEED & RIGOLLET

1/p is n− -Lipschitz with respect to ρp. Indeed, we have

n n n n 1 1 1 1 W δ ,µ W δ ,µ W δ , δ p n xi − p n yi ≤ p n xi n yi i=1 i=1 i=1 i=1 X X n X X 1 1/p ρ(x ,y )p ≤ n i i Xi=1 1/p = n− ρp(x,y) .

Combining these observations with Proposition 6, we obtain

2 −2/p 2/p−1 2 2 −1 2 λ(Wp(µn,µ) EWp(µn,µ)) λ n n σ /2 λ n σ /2 Ee − e = e λ R . ≤ ∀ ∈ 2 In the other direction, suppose that Wp(µn,µ) is σ /n-subgaussian. A Chernoﬀ bound implies the concentration inequality

2 2 P[W (µ ,µ) EW (µ ,µ) t] ent /2σ . p n − p n ≥ ≤ We follow the proof of Gozlan [38, Theorem 3.4], and only sketch the argument here. It is easy to verify that the function W ( ,µ) is lower semi-continuous with p · respect to the weak topology [see 89, proof of Theorem 4.1], which implies that, for any t 0, the set := ν ( ) : W (ν,µ) > t is open. Moreover, as ≥ Ot { ∈ P X p } µ , Villani [89, Theorem 6.9] combined with Varadarajan’s theorem yields ∈ Pp that EW (µ ,µ) 0 as n [see 89, proof of Theorem 22.22]. p n → →∞ Applying Sanov’s theorem then yields 1 inf D(ν µ) lim inf log P[Wp(µn,µ) >t] − ν t k ≤ n n ∈O →∞ 1 n max (t EW (µ ,µ))2, 0 lim inf { − p n } n 2 ≤ →∞ n · 2σ t2 = . 2σ2 We obtain that, for any t 0, ≥ t2 D(ν µ) ν ( ) s.t. W (ν,µ) > t . k ≥−2σ2 ∀ ∈ P X p 2 Therefore µ satisﬁes Tp(σ ).

n Proof of Theorem 7. As in the proof of Theorem 6, we have that µ⊗ 2/p′ 1 2 satisﬁes T1(n − σ ) with respect to ρp′ (x,y). We also that that

n 1 (x1,...xn) Wp δxi ,µ 7→ n ! Xi=1 1/p is n -Lipschitz with respect to ρ ′ , since the fact that p p implies − p ′ ≤ n n 1 1 1/p 1/p W δ ,µ W δ ,µ n− ρ (x,y) n− ρ ′ (x,y) . p n xi − p n yi ≤ p ≤ p i=1 i=1 X X We conclude as in the ﬁrst part of the proof of Theorem 6. SPIKED TRANSPORT MODEL 29

Proof of Theorem 10. We show that for any measures ν(1), ν(2), if = ˆ (1) (2) V span(V ) where V = argmaxU (Rd) Wp(νU ,νU ), then ∈Vk W˜ (µ(1),ν(1))+ W˜ (µ(2),ν(2)) sin2 ∡( , ) . p,k p,k . (1) (2) V U Wp(µ ,µ ) Combined with Proposition 8, this implies the claim. Proposition 14 guarantees the existence of an optimal coupling between µ(1) and µ(2) such that if (X,Y ) γ, then X Y lies in almost surely. Let U ∼ − U ∈ (Rd) be such that the column span of U is . We have Vk U W (µ(1),µ(2)) (E V (X Y ) p)1/p p V V ≤ k − k p 1/p = (E VU ⊤U(X Y ) ) k − k p 1/p (1) (2) VU ⊤ (E X Y ) = VU ⊤ Wp(µ ,µ ) . ≤ op k − k op

By the deﬁnition of V ,

W (µ(1),µ(2)) W (µ(1),µ(2))= W (µ(1),µ(2)) W (µ(1),µ(2)) p − p V V p U U − p V V W (µ(1),µ(2)) W (ν(1),ν(2)) ≤ p U U − p U U + W (ν(1),ν(2)) W (µ(1),µ(2)) p V V − p V V (1) (2) (1) (2) 2 sup Wp(µU ,µU ) Wp(νU ,νU ) ≤ U (Rd) | − | ∈Vk 2W˜ (µ(1),ν(1)) + 2W˜ (µ(2),ν(2)) . ≤ p,k p,k Therefore

(1) (2) (1) (1) (2) (2) (1 VU ⊤ )Wp(µ ,µ ) 2W˜ p,k(µ ,ν ) + 2W˜ p,k(µ ,ν ) . − op ≤

Moreover,

2 2 sin ∡( , ) = 1 sup (u⊤v) V U − v ,u u∈V= v∈U=1 k k k k 2 = 1 sup ((Ux)⊤(Vy)) − x,y Rd x =∈y =1 k k k k 2 = 1 VU ⊤ − op

2(1 VU ⊤ ) , ≤ − op

since VU 1. The claim follows. ⊤ op ≤

ACKNOWLEDGEMENTS The authors thank Oded Regev for helpful suggestions. 30 NILES-WEED & RIGOLLET

REFERENCES [1] Ahidar-Coutrix, A., Gouic, T. L., and Paris, Q. On the rate of convergence of empirical barycentres in metric spaces: curvature, convexity and extendible geodesics. arXiv:1806.02740, 2018.

[2] Alaux, J., Grave, E., Cuturi, M., and Joulin, A. Unsupervised hyper- alignment for multilingual word embeddings. ICLR, 2019.

[3] Alvarez-Melis, D., Jaakkola, T. S., and Jegelka, S. Structured optimal transport. AISTATS, 2018.

[4] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. ICML, 2017.

[5] Bandeira, A. S., Perry, A., and Wein, A. S. Notes on computational-to- statistical gaps: predictions using statistical physics. arXiv:1803.11132, 2018.

[6] Berthet, Q. and Rigollet, P. Optimal detection of sparse principal components in high dimension. Ann. Statist., 41(4):1780–1815, 2013.

[7] Bigot, J., Cazelles, E., and Papadakis, N. Central limit theorems for Sinkhorn divergence between probability distributions on ﬁnite spaces and statistical applications. arXiv:1711.08947, 2017.

[8] Blum, A., Dwork, C., McSherry, F., and Nissim, K. Practical privacy: the sulq framework. PODS, 2005.

[9] Bobkov, S. and Ledoux, M. One-dimensional empirical measures, order statistics and kantorovich transport distances. preprint, 2014.

[10] Bobkov, S. G. and G¨otze, F. Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. J. Funct. Anal., 163(1): 1–28, 1999.

[11] Boissard, E. and Le Gouic, T. On the mean speed of convergence of empirical and occupation measures in Wasserstein distance. Ann. Inst. Henri Poincar´e Probab. Stat., 50(2):539–563, 2014.

[12] Bolley, F., Guillin, A., and Villani, C. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probab. Theory Related Fields, 137(3-4):541–593, 2007.

[13] Brennan, M., Bresler, G., and Huleihel, W. Reducibility and computational lower bounds for problems with planted sparse structure. COLT, 2018.

[14] Bubeck, S., Lee, Y. T., Price, E., and Razenshteyn, I. P. Adversarial exam- ples from computational constraints. ICML, 2019.

[15] Cai, T., Ma, Z., and Wu, Y. Optimal estimation and rank detection for sparse spiked covariance matrices. Probab. Theory Related Fields, 161(3-4): 781–815, 2015. SPIKED TRANSPORT MODEL 31

[16] Cai, T. T., Liang, T., and Rakhlin, A. Computational and statistical bound- aries for submatrix localization in a large noisy matrix. Ann. Statist., 45(4): 1403–1430, 2017.

[17] Canas, G. and Rosasco, L. Learning probability measures with respect to optimal transport metrics. NIPS, 2012.

[18] Cazelles, E., Seguy, V., Bigot, J., Cuturi, M., and Papadakis, N. Geodesic PCA versus log-pca of histograms in the wasserstein space. SIAM J. Scien- tiﬁc Computing, 40(2), 2018.

[19] Claici, S. and Solomon, J. Wasserstein coresets for Lipschitz costs. ArXiv:1805.07412, 2018.

[20] Del Barrio, E., Gamboa, F., Gordaliza, P., and Loubes, J.-M. Obtaining fairness using optimal transport theory. ICML, 2019.

[21] del Barrio, E., Gordaliza, P., Lescornel, H., and Loubes, J.-M. Central limit theorem and bootstrap procedure for wasserstein’s variations with an application to structural relationships between distributions. Journal of Multi- variate Analysis, 169:341 – 362, 2019.

[22] del Barrio, E., Inouzhe, H., Loubes, J.-M., Matrán, C., and Mayo-Íscar, A. optimalFlow: Optimal-transport approach to flow cytometry gating and population matching. arXiv:1907.08006, 2019.

[23] Dereich, S., Scheutzow, M., and Schottstedt, R. Constructive quantization: approximation by empirical measures. Ann. Inst. Henri Poincar´eProbab. Stat., 49(4):1183–1203, 2013.

[24] Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. Max-sliced Wasserstein distance and its use for GANs. arXiv:1904.05877, 2019.

[25] Deutsch, F. The angle between subspaces of a Hilbert space. In Approxima- tion theory, wavelets and applications (Maratea, 1994), volume 454 of NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., pages 107–130. Kluwer Acad. Publ., Dordrecht, 1995.

[26] Diakonikolas, I., Kane, D. M., and Stewart, A. Statistical query lower bounds for robust estimation of high-dimensional Gaussians and Gaussian mixtures. FOCS, 2017.

[27] Dixmier, J. Etude´ sur les variétés et les opérateurs de Julia, avec quelques applications. Bull. Soc. Math. France, 77:11–101, 1949.

[28] Do Ba, K., Nguyen, H. L., Nguyen, H. N., and Rubinfeld, R. Sublinear time algorithms for earth mover’s distance. Theory Comput. Syst., 48(2):428–442, 2011.

[29] Dudley, R. M. The speed of mean Glivenko-Cantelli convergence. Ann. Math. Statist, 40:40–50, 1969. 32 NILES-WEED & RIGOLLET

[30] Feldman, V., Grigorescu, E., Reyzin, L., Vempala, S. S., and Xiao, Y. Sta- tistical algorithms and a lower bound for detecting planted cliques. J. ACM, 64(2):Art. 8, 37, 2017. [31] Feydy, J., Charlier, B., Vialard, F., and Peyré, G. Optimal transport for diffeomorphic registration. MICCAI, 2017. [32] Flamary, R., Cuturi, M., Courty, N., and Rakotomamonjy, A. Wasserstein discriminant analysis. Machine Learning, 107(12):1923–1945, 2018. [33] Forrow, A., Hütter, J.-C., Nitzan, M., Rigollet, P., Schiebinger, G., and Weed, J. Statistical optimal transport via factored couplings. AISTATS, 2019. [34] Fournier, N. and Guillin, A. On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields, 162(3-4): 707–738, 2015. [35] Friedman, J. H. and Tukey, J. W. A projection pursuit algorithm for ex- ploratory data analysis. IEEE Trans. Computers, 23(9):881–890, 1974. [36] Genevay, A., Peyré, G., and Cuturi, M. Learning generative models with sinkhorn divergences. AISTATS, 2018. [37] Gozlan, N. Characterization of Talagrand’s like transportation-cost inequalities on the real line. J. Funct. Anal., 250(2):400–425, 2007. [38] Gozlan, N. A characterization of dimension free concentration in terms of transportation inequalities. Ann. Probab., 37(6):2480–2498, 2009. [39] Gozlan, N. and Léonard, C. A large deviation approach to some transportation cost inequalities. Probab. Theory Related Fields, 139(1-2):235–283, 2007. [40] Gozlan, N. and Léonard, C. Transport inequalities. A survey. Markov Pro- cess. Related Fields, 16(4):635–736, 2010. [41] Grave, E., Joulin, A., and Berthet, Q. Unsupervised alignment of embeddings with wasserstein procrustes. AISTATS, 2019. [42] Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30, 1963. [43] Hütter, J.-C. and Rigollet, P. Minimax rates of estimation for smooth optimal transport maps. arXiv:1905.05828, 2019. [44] Janati, H., Cuturi, M., and Gramfort, A. Wasserstein regularization for sparse multi-task regression. AISTATS, 2019.

[45] Jiao, J., Han, Y., and Weissman, T. Minimax estimation of the L1 distance. IEEE Trans. Inform. Theory, 64(10):6672–6706, 2018. [46] Johnson, W. B. and Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189–206. Amer. Math. Soc., Providence, RI, 1984. SPIKED TRANSPORT MODEL 33

[47] Johnstone, I. M. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295–327, 2001.

[48] Kearns, M. Eﬃcient noise-tolerant learning from statistical queries. J. ACM, 45(6):983–1006, 1998.

[49] Klatt, M., Tameling, C., and Munk, A. Empirical regularized optimal transport: Statistical theory and applications. arXiv:1810.09880, 2018.

[50] Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. K. Gener- alized sliced Wasserstein distances. arXiv preprint arXiv:1902.00434, 2019.

[51] Kroshnin, A., Spokoiny, V., and Suvorikova, A. Statistical inference for Bures-Wasserstein barycenters. arXiv:1901.00226, 2019.

[52] Kruskal, J. B. Toward a practical method which helps uncover the structure of a set of multivariate observations by ﬁnding the linear transformation which optimizes a new “index of condensation”. In Statistical Computation, pages 427–440. Elsevier, 1969.

[53] Kruskal, J. B. Linear transformation of multivariate data to reveal clustering. Multidimensional scaling: theory and applications in the behavioral sciences, 1:181–191, 1972.

[54] Lavenant, H., Claici, S., Chien, E., and Solomon, J. Dynamical optimal transport on discrete surfaces. ACM Trans. Graph., 37(6):250:1–250:16, 2018.

[55] Le Gouic, T., Paris, Q., Rigollet, P., and Stromme, A. J. Fast convergence of empirical barycenters in Alexandrov spaces and the Wasserstein space. arXiv:1908.00828, 2019.

[56] Ledoux, M. The concentration of measure phenomenon, volume 89 of Math- ematical Surveys and Monographs. American Mathematical Society, Provi- dence, RI, 2001.

[57] Lei, J. Convergence and concentration of empirical measures under Wasser- stein distance in unbounded functional spaces. arXiv:1804.10556, 2018.

[58] Liang, T. On the Minimax Optimality of Estimating the Wasserstein Metric. arXiv:1908.10324, 2019.

[59] Lim, S., Lee, S.-E., Chang, S., and Ye, J. C. CycleGAN with a Blur Kernel for Deconvolution Microscopy: Optimal Transport Geometry. arXiv:1908.09414, 2019.

[60] Ma, Z. and Wu, Y. Computational barriers in minimax submatrix detection. Ann. Statist., 43(3):1089–1116, 2015.

[61] Marton, K. Bounding d-distance by informational divergence: a method to prove measure concentration. Ann. Probab., 24(2):857–866, 1996.

[62] Marton, K. A measure concentration inequality for contracting Markov chains. Geom. Funct. Anal., 6(3):556–571, 1996. 34 NILES-WEED & RIGOLLET

[63] Maurey, B. Some deviation inequalities. Geom. Funct. Anal., 1(2):188–197, 1991.

[64] Montavon, G., M¨uller, K., and Cuturi, M. Wasserstein training of restricted boltzmann machines. NIPS, pages 3711–3719, 2016.

[65] Panaretos, V. M. and Zemel, Y. Statistical aspects of wasserstein distances. Annual Review of Statistics and Its Application, 6(1):405–431, 2019.

[66] Paty, F.-P. and Cuturi, M. Subspace robust Wasserstein distances. ICML, 2019.

[67] Peyr´e, G. and Cuturi, M. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.

[68] Piti´e, F., Kokaram, A. C., and Dahyot, R. Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding, 107(1-2):123–137, 2007.

[69] Rabin, J., Peyr´e, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.

[70] Ramdas, A., Trillos, N. G., and Cuturi, M. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.

[71] Rigollet, P. and Weed, J. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Mathematique, 356(11):1228 – 1235, 2018.

[72] Rigollet, P. and Weed, J. Uncoupled isotonic regression via minimum Wasser- stein deconvolution. Information and Inference: A Journal of the IMA, 04 2019.

[73] Rolet, A., Cuturi, M., and Peyr´e, G. Fast dictionary learning with a smoothed wasserstein loss. MIFODS Semester on Learning under complex structure, 2016.

[74] Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., Lee, L., Chen, J., Brumbaugh, J., Rigollet, P., Hochedlinger, K., Jaenisch, R., Regev, A., and Lander, E. S. Optimal-transport analysis of single-cell gene expression identiﬁes develop- mental trajectories in reprogramming. Cell, 176(4):928–943, 2019.

[75] Schmitz, M. A., Heitz, M., Bonneel, N., Mboula, F. M. N., Coeurjolly, D., Cuturi, M., Peyr´e, G., and Starck, J. Wasserstein dictionary learning: Op- timal transport-based unsupervised nonlinear dictionary learning. SIAM J. Imaging Sciences, 11(1):643–678, 2018.

[76] Seguy, V. and Cuturi, M. Principal geodesic analysis for probability measures under the optimal transport metric. NIPS, 2015.

[77] Singh, S. and P´oczos, B. Minimax distribution estimation in Wasserstein distance. arXiv:1802.08855, 2018. SPIKED TRANSPORT MODEL 35

[78] Solomon, J., de Goes, F., Peyr´e, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas, L. Convolutional wasserstein distances: Eﬃcient optimal transportation on geometric domains. ACM Trans. Graph., 34(4):66:1–66:11, July 2015.

[79] Solomon, J., Peyr´e, G., Kim, V. G., and Sra, S. Entropic metric alignment for correspondence problems. ACM Trans. Graph., 35(4):72:1–72:13, 2016.

[80] Staib, M., Claici, S., Solomon, J. M., and Jegelka, S. Parallel streaming wasserstein barycenters. NIPS, 2017.

[81] Talagrand, M. Transportation cost for Gaussian and other product measures. Geom. Funct. Anal., 6(3):587–600, 1996.

[82] Tameling, C. and Munk, A. Computational strategies for statistical inference based on empirical optimal transport. In 2018 IEEE Data Science Workshop, DSW 2018, Lausanne, Switzerland, June 4-6, 2018, pages 175–179, 2018.

[83] Timan, A. F. Theory of approximation of functions of a real variable. Dover Publications, Inc., New York, 1994. Translated from the Russian by J. Berry, Translation edited and with a preface by J. Cossar, Reprint of the 1963 English translation.

[84] Tsybakov, A. B. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.

[85] Valiant, G. and Valiant, P. A CLT and tight lower bounds for estimating entropy. Electronic Colloquium on Computational Complexity (ECCC), 17: 183, 2010.

[86] Valiant, G. and Valiant, P. The power of linear estimators. FOCS, 2011.

[87] Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. Arxiv Preprint, 11 2010.

[88] Vershynin, R. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. An introduction with applications in data science, With a foreword by Sara van de Geer.

[89] Villani, C. Optimal transport, volume 338 of Grundlehren der Mathema- tischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009. Old and new.

[90] Weed, J. and Bach, F. Sharp asymptotic and ﬁnite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 2018. To appear.

[91] Weed, J. and Berthet, Q. Estimation of smooth densities in wasserstein distance. COLT, 2019.

[92] Wu, Y. and Yang, P. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. Ann. Statist., 47(2):857–883, 2019. 36 NILES-WEED & RIGOLLET

[93] Yang, K. D., Damodaran, K., Venkatchalapathy, S., Soylemezoglu, A. C., Shivashankar, G., and Uhler, C. Autoencoder and optimal transport to infer single-cell trajectories of biological processes. bioRxiv, page 455469, 2018.

[94] Zemel, Y. and Panaretos, V. M. Fr´echet means and procrustes analysis in wasserstein space. Bernoulli, 25(2):932–976, 05 2019. Jonathan Niles-Weed Philippe Rigollet Courant Institute of Mathematical Sciences Department of Mathematics New York University Massachusetts Institute of Technology 251 Mercer Street 77 Massachusetts Avenue New York, NY 10012-1185, USA Cambridge, MA 02139-4307, USA ([email protected]) ([email protected])