<<

Wasserstein : Statistics and Optimization

by Austin J. Stromme

B.S., Mathematics, University of Washington (2018) B.S., Computer Science, University of Washington (2018)

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May 2020

○c Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 15th, 2020

Certifiedby...... Philippe Rigollet Associate Professor of Mathematics Thesis Supervisor

Accepted by ...... Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee for Graduate Students 2 Wasserstein Barycenters: Statistics and Optimization by Austin J. Stromme

Submitted to the Department of Electrical Engineering and Computer Science on May 15th, 2020, in partial fulfillment of the requirements for the degree of Master of Science

Abstract We study a geometric notion of average, the , over 2-Wasserstein space. We significantly advance the state of the art by introducing extendible geodesics, a simple synthetic geometric condition which implies non-asymptotic convergence of the empir- ical barycenter in non-negatively curved spaces such as Wasserstein space. We further establish convergence of first-order methods in the Gaussian case, overcoming the non- convexity of the barycenter functional. These results are accomplished by various novel geometrically inspired estimates for the barycenter functional including a variance in- equality, new so-called quantitative stability estimates, and a Polyak-Łojasiewicz (PL) in- equality. These inequalities may be of independent interest.

Thesis Supervisor: Philippe Rigollet Title: Associate Professor of Mathematics

3 4 Acknowledgments

Foremost thanks goes to my collaborators: Sinho Chewi, Thibaut Le Gouic, Quentin Paris, Tyler Maunu, and Philippe Rigollet. Without them none of this would have been possible. I am especially grateful to Philippe for his talented and well-rounded support as an advi- sor. Extra thanks as well to Sinho for always patiently teaching me mathematics. At this point in my PhD, I have learned far more from Sinho than from any class or book. Thanks again to Matt and Dheeraj for showing Enric and me how it’s done. Thank you to my friends in the chemistry department and in Washington state. Finally, highest gratitude and love for Steve, Peggy, Emily, Kamran, Lang, Sarah, and Stephanie.

5 6 Contents

1 Introduction9 1.1 Background...... 9 1.2 Main Results...... 11 1.2.1 Empirical barycenters...... 11 1.2.2 First-order methods for barycenters on Gaussians...... 12 1.3 Guide to the thesis...... 14

2 Geometry of optimal transport 15 2.1 Background on optimal transport...... 15 2.2 A simple notion of curvature in geodesic metric spaces...... 17 2.3 Curvature and convexity of the barycenter functional...... 19 2.4 Wasserstein space is non-negatively curved...... 23 2.5 Wasserstein space has infinite curvature...... 24 2.6 An ounce of convexity...... 26 2.7 Extendible geodesics and curvature bounds...... 28 푑 2.8 The tangent cone in 푊2(R )*...... 33 2.9 New quantitative stability bounds*...... 35

3 Averaging probability distributions: statistics 39 3.1 Introduction to the problem...... 39 3.2 Parametric rates under bi-extendibility...... 40

4 Averaging probability distributions: optimization 43 4.1 Introduction and main results...... 43 4.2 Local to global phenomena for Wasserstein barycenters...... 46

7 푑 4.3 Sufficient conditions for first-order methods to converge on 푊2(R ) ..... 47 4.4 An integrated Polyak-Łojasiewicz inequality...... 49 4.5 Specializing to the Bures-Wasserstein case...... 51 4.5.1 Bures-Wasserstein gradient descent algorithms...... 51 4.5.2 Proof of the main results...... 52

A Additional results and omitted proofs 57 A.1 Additional results and omitted proofs from Chapter2...... 57 A.2 Additional results and omitted proofs from Chapter4...... 58 A.2.1 Proof of Theorem 4.3.5...... 58 A.2.2 Proof of Lemma 4.4.3...... 59 A.2.3 Results for Bures-Wasserstein barycenters...... 60

B Experiments for Bures-Wasserstein barycenters 63 B.0.1 Simulations for the Bures manifold...... 63 B.0.2 Details of the non-convexity example...... 66

8 Chapter 1

Introduction

This thesis is concerned with averaging probability measures. It combines the theory of opti- mal transport, termed Wasserstein space, with geometric averages, termed barycenters, to provide novel quantitative understanding of barycenters on Wasserstein space.

1.1 Background

Optimal transport is a geometrically meaningful way of measuring distances between probability distributions. Essentially, optimal transport distances measure the most effi- cient means of moving one distribution to another. Stated at the level of random variables, the optimal transport distance between two random variables 푋, 푌 is the infimum, over 2 all of couplings 휋 of 푋, 푌 , of E휋[‖푋 − 푌 ‖2]. Note the contrast between this distance mea- sure and classical information divergences, such as the Kullback-Liebler divergence, which measure purely pointwise differences of densities. This distinction is often summed up by saying that optimal transport is “horizontal" whereas divergences are “vertical". We shall call the space of measures with the optimal transport metric Wasserstein space. Optimal transport was first studied in 1781 by Gaspard Monge who wondered about the optimal way to fill holes with sand [42]. The theory developed in fits and starts until the 1980’s when Knott and Smith [34] and Brenier [16] independently discovered a simple and beautiful form for the optimal coupling. Since these major advances, the theory has deepened and expanded at an accelerating pace. Although this rapid theoretical progress inspired a few practical forays into optimal transport, it was Cuturi’s 2013 introduction of an efficient approximation algorithm via

9 entropic regularization that heralded the current wave of interest amongst applied scien- tists [23]. His efficient algorithm, combined with optimal transport’s appealing geometric semantics for comparing probability distributions, led to an explosion of applications, see e.g. [49] and the references therein. The aspect of optimal transport at stake in this present thesis is a means of using the geometry to construct meaningful summaries, i.e., “aver- ages," of collections of distributions. To incorporate the geometry of optimal transport into this average, we use the concept of barycenters.

The barycenter, also dubbed , center of gravity, or (mistakenly) the Karcher mean, is a natural generalization of averages to curved spaces that has been studied since at least the early 20th century [58, 31]. Ignoring issues of existence and uniqueness, the * barycenter 푏 of a distribution 푃 on a metric space (푀, 푑푀 ), is defined as

* 1 [︀ 2 ]︀ 푏 := arg min 퐹 (푏) := E푝∼푃 푑푀 (푏, 푝) . 푏∈푀 2

As can be easily verified, this coincides with the usual Euclidean average in the case where 푑 (푀, 푑푀 ) = (R , ‖·‖2).

푑 Wasserstein barycenters are thus barycenters in the case of probability measures on R metrized by the optimal transport distance. Wasserstein barycenters offer practitioners a highly non-linear yet meaningful average for applications where the pointwise average of measures is inappropriate. Whenever data can be embedded as a probability distribution 푑 over R , Wasserstein barycenters may offer a significant advantage over more traditional techniques of summarization [24]. They have thus been applied in a broad variety of ar- eas, including graphics, neuroimaging, Bayestian statistics, dimensionality reduction, and economics [51, 50, 55, 27, 15, 56, 21].

The theoretical study of Wasserstein barycenters dates back to the 1990’s [45, 35, 41]. Several special cases were considered, the most important being the case where 푃 is sup- ported on two points and the corresponding theory of Wasserstein geodesics by McCann in [41]. Although there was significant theoretical understanding of barycenters in some general contexts around that time [58], the existing theory did not apply in a significant way due to the positive curvature of Wasserstein space [7, 44]. Because of this, insight into the Wasserstein space case was not provided until the work of Agueh and Carlier in 2011 [1], where the authors used the structure of Wasserstein space to develop a primal-

10 dual proof establishing existence, uniqueness, and optimality conditions under general assumptions. One beautiful consequence of the work of Agueh and Carlier is a surprising connection between the barycenter problem and the original optimal transport problem via the concept of multi-marginal transport. This connection, combined with the funda- mental relationship between barycenters and geometry, serve to theoretically motivate the study of Wasserstein barycenters.

1.2 Main Results

Theoretical work subsequent to the Agueh and Carlier paper has focused on both statistical and computational aspects of Wasserstein barycenters. In this section, we present the two primary results contained in this thesis. The first result is on the convergence of empirical Wasserstein barycenters to their population counterparts, from [26]. The second result is on the convergence of popular first order methods for computing Wasserstein barycenters, from [22].

The failure of standard techniques for both of these problems essentially stems from the (sometimes infinitely) positive curvature of Wasserstein space. As such, the proofs rely on building a powerful quantitative understanding of the positive curvature of Wasserstein space and carefully applying it to the problems of interest.

1.2.1 Empirical barycenters

Empirical averages and their convergence to population averages are at the heart of statis- tics. We thus study the corresponding problem on Wasserstein space. Consider a distri- bution 푃 on Wasserstein space, and let ^푏푛 be the barycenter of the empirical distribution 1 ∑︀푛 ^ 푃푛 := 푛 푖=1 훿푝푖 for independent samples 푝푖 ∼ 푃 . We call 푏푛 the empirical barycenter. It has * been shown that ^푏푛 is a consistent estimator of 푏 [38]. It is then natural to ask about the 2 * ^ finite sample variance for this estimator, which we can define by E[푊2 (푏 , 푏푛)] where the expectation is over all samples of size 푛. Note that in the Euclidean case, the variance of the empirical mean is always exactly

[︁ ]︁ 휎2 ‖푏* − ^푏 ‖2 = E 푛 2 푛

11 2 * 2 where 휎 := E[‖푥 − 푏 ‖2] is the variance of the distribution. We are thus motivated to understand when a parametric rate1 holds for the empirical barycenter over Wasserstein space. Some special cases have been considered [47, 12, 14,2]. In [36], parametric rates are obtained for the Gaussian case. Prior to the following theorem, the best rates in a general context were of the form 푂(푛−1/푑) and were established using techniques from empirical process theory [3]. Before we can state our first main result, we review the discovery of Brenier and Agueh and Carlier: consider the 2-Wasserstein problem between two measures 휇, 휈. If 휇 is abso- lutely continuous w.r.t. the Lebesgue measure then the infimium over couplings in the

푊2 problem is uniquely attained by a deterministic mapping which we will often write as 푇휇→휈. Moreover, this mapping is the gradient of a convex function; we write this as

푇휇→휈 = ∇휙휇→휈. The first main result in this thesis is the following:

Theorem 1.2.1 (Main Theorem 1). Suppose 푃 is a distribution supported on absolutely continu- 푑 * ous measures over R with finite second moment. Suppose further that 푃 has a barycenter 푏 such that for each 휇 ∈ supp(푃 ), the optimal potential 휙푏*→휇 is 훼-strongly convex and 훽-smooth, where 훽 − 훼 < 1. Then 푏* is the unique barycenter of 푃 and moreover

[︁ ]︁ 4휎2 푊 2(^푏 , 푏*) . E 2 푛 6 푘2푛

2 2 * where 푘 := 1 − (훽 − 훼) > 0 and 휎 := E[푊2 (푝, 푏 )] is the variance of 푃 .

A couple of remarks are now in order. First, the condition on the optimal potential is * geometrically natural. In fact, it is merely the statement that the 푊2-geodesic between 푏 and 휇 can be extended past 푏* and past 휇 by an amount depending on 훽 and 훼, respectively. The second remark is that this result is actually a special case of a much more general theorem proved by the author and collaborators in [26].

1.2.2 First-order methods for barycenters on Gaussians

The above theorem provides finite-sample convergence rates for empirical Wasserstein barycenters in a broad variety of situations. However, given that there is no explicit form for the Wasserstein barycenter, the question becomes how to compute one. A common ap- proach considered in a number of applications is to use a first order optimization method

1precisely: a rate of the form 푐/푛 for 푐 independent of dimension

12 such as gradient descent or stochastic gradient descent to approximate the barycenter. This can be made formally rigorous (see [7]), but for our purposes we will simply define the 푊2- gradient of the barycenter functional to be, for all 푏 absolutely continuous and with finite second moment,

∇푊2 퐹 (푏) := −E푝∼푃 [(푇푏→푝 − id)]

푑 푑 where 푇푏→푝 is the optimal map from 푏 → 푝 and id: R → R is the identity. Panaretos and Zemel were able to show that under mild assumptions on the distribution 푃 , a simple gra- dient descent algorithm converges to the population barycenter asymptotically [46]. The case where 푃 is supported on Gaussians was also studied in [6,9, 62]. Fast convergence of first order methods was observed in each of these three works, and proving such conver- gence was left open in [6]. The following theorem, from our work [22], resolves this open problem.

Theorem 1.2.2 (Main Theorem 2). Let 푃 be a distribution supported on mean-zero Gaussians whose covariance matrices have eigenvalues uniformly bounded between 휆min and 휆max. Let 휅 := * 휆max/휆min. Then 푃 has a unique barycenter 푏 . Moreover, 푊2 gradient descent for the barycenter functional initialized at 푏0 ∈ supp(푃 ) obeys

(︂ 1 )︂푇 푊 2(푏 , 푏*) 2휅 1 − (퐹 (푏 ) − 퐹 (푏*)). 2 푇 6 4휅2 0

Performing stochastic gradient descent for the barycenter functional initialized at 푏0 ∈ supp(푃 ), we obtain the bound 96휎2휅5 [︀푊 2(푏 , 푏*)]︀ . E 2 푛 6 푛

We remark that this theorem is an example of a non-convex optimization problem where convex methods provably work. Specifically, the positive curvature of Wasserstein space means the barycenter functional is not geodesically convex - in fact, over Gaus- sians, it can actually be concave! The proof overcomes this non-convexity by establishing a Polyak-Łojasiewicz inequality using novel theory for the Wasserstein barycenter func- tional. In fact, a significant portion of the technical work is valid at the level of general first-order optimization for Wasserstein barycenters.

13 1.3 Guide to the thesis

Chapter2 begins by summarizing facts from optimal transport and then dives into a signif- 푑 icant exploration of curvature in 푊2(R ) and its connection with the barycenter functional. We believe that this chapter is the most independently interesting, so we attempted to de- velop a cohesive discussion of the relevant ideas. Sections marked with an asterisk are not used in the remainder of the work, though we feel they are of independent interest. Chapter3 takes the machinery developed in the previous chapter and applies it to pro- vide a quick solution for the empirical barycenters problem. Chapter4 studies first-order methods for the barycenter problem. The most generally useful and novel inequalities are in Theorem 2.3.7, Theorem 2.7.11 (see also Theorem 2.6.1 and Theorem 2.9.2 for specializations), and Lemma 4.4.1. The appendices contain some empirical work as well as omitted proofs.

14 Chapter 2

Geometry of optimal transport

In this chapter we develop some of the most general and conceptually important results of the thesis. We start by summarizing background on optimal transport, discussing some simple notions of curvature in metric spaces, and then explaining their close relationship with the barycenter functional. We then study the curvature of Wasserstein space. We conclude with a series of related inequalities that, roughly, allow us to control the positive curvature under the assumption of extendible geodesics.

2.1 Background on optimal transport

In this section, we define frequently used notation and state the most important theorems 푑 about optimal transport. We refer the reader to the books [60, 61,7, 52]. We let 풫2(R ) 푑 be the set of probability measures on R with finite second moment, namely 휇 such that 2 E푥∼휇[‖푥‖2] < ∞. We write the collection of probability measures that are absolutely con- 푑 tinuous w.r.t. the Lebesgue measure with finite second moment as 풫2,ac(R ). We shall 푑 푑 often write the identity map id: R → R . 푑 Given two measures 휇, 휈 ∈ 풫2(R ), the set Π(휇, 휈) is the set of all couplings of 휇 and 휈, 푑 푑 i.e. the set of measures 휋 on R × R with marginals 휇 and 휈. The 2-Wasserstein distance between 휇 and 휈 is defined as

∫︁ 2 2 푊2 (휇, 휈) := inf ‖푥 − 푦‖2d휋(푥, 푦). 휋∈Π(휇,휈)

The infimum can be shown to be attained by compactness of Π(휇, 휈) w.r.t. the weak topol-

15 ogy. A minimizer is referred to as an optimal transport plan. Moreover, 푊2(휇, 휈) thus defined is a metric and, when restricted to probability measures over a compact set, the 푊2 distance metrizes weak convergence of measures. In the non-compact case, 푊2 metrizes weak convergence and convergence of second moments [60, Thm 7.12]. The 2-Wasserstein problem admits a dual formulation, called the dual Kantorovich problem, given by

(︂∫︁ ∫︁ )︂ sup 푓푑휇 + 푔푑휈 , (푓,푔)∈푆휇,휈 where 1 1 2 푆휇,휈 := {(푓, 푔) ∈ 퐿 (휇) × 퐿 (휈): 푓(푥) + 푔(푦) 6 ‖푥 − 푦‖2}.

푑 푑 Given a map 푇 : R → R , we let 푇#휇 be the push-forward of 휇 under 푇 , namely the law of 푇 (푥) when 푥 ∼ 휇. We shall call a convex function 휙 proper if 휙(푥) < +∞ for some 푥 and if 휙(푥) > −∞ for all 푥. The domain of a convex function dom(휙) shall be the set of points at which it is finite. The convex conjugate of a convex function 휙 is denoted 휙* * and defined as 휙 (푦) := sup푥∈R푑 ⟨푥, 푦⟩ − 휙(푥). We shall often use that for a proper lower semicontinous function 휙, 휙 convex if and only if 휙 = 휙**, and as well that ∇휙* and ∇휙, when defined uniquely, are inverses for one another. We can now state the fundamental theorem of optimal transport.

푑 푑 Theorem 2.1.1. Suppose 휇 ∈ 풫2,ac(R ) and 휈 ∈ 풫2(R ). Then the following are equivalent:

1. 휋 ∈ Π(휇, 휈) is an optimal transport plan

2. 휋 = (id, ∇휙)#휇 for a proper convex function 휙

3. Strong duality holds between the 푊2 problem and the dual Kantorovich problem:

∫︁ (︂∫︁ ∫︁ )︂ 2 ‖푥 − 푦‖2푑휋(푥, 푦) = sup 푓푑휇 + 푔푑휈 . (푓,푔)∈푆휇,휈

2 2 * Moreover, the supremum is attained for 푓 = ‖푥‖2 − 2휙(푥) and 푔 = ‖푦‖2 − 2휙 (푦).

Finally, there is a unique ∇휙 such that the above holds, in the sense that if ∇휓 is also optimal then ∇휙(푥) = ∇휓(푥) 휇-a.e.

The potentials 휙, 휙* are referred to as Kantorovich potentials for the pair (휇, 휈). We 푑 푑 shall often write the optimal transport map between 휇 ∈ 풫2,ac(R ) and 휈 ∈ 풫2,ac(R ) as

16 푇휇→휈 and an optimal Kantorovich potential as 휙휇→휈 so that 푇휇→휈 = ∇휙휇→휈.

We shall always implicitly assume metric spaces (푋, 푑푋 ) are complete and separable.

A constant speed geodesic in (푋, 푑푋 ) is a curve 훾 :[푎, 푏] → 푋 such that for all 푡, 푠 ∈ [푎, 푏],

|푡 − 푠| 푑 (훾(푡), 훾(푠)) = 푑 (훾(푎), 훾(푏)). 푋 푏 − 푎 푋

We then say that (푋, 푑푋 ) is geodesically complete if each pair of points can be connected by a constant speed geodesic.

푑 Theorem 2.1.2. 푊2(R ) is a geodesically complete metric space. For any 휇, 휈 ∈ 푊2, let 휋푡 := (1 − 푡)푥 + 푡푦 for 푡 ∈ [0, 1]. Let 휋* ∈ Π(휇, 휈) be an optimal transport plan for 휇 to 휈. Then the path * 휔(푡) = (휋푡)#휋 is a constant-speed geodesic in 푊2 connecting 휔(0) = 휇 to 휔(1) = 휈. Moreover, all constant-speed geodesics are of this form. Hence, if 휇 is absolutely continuous with respect to Lebesgue, there is in fact a unique geodesic joining 휇 to 휈.

Remark 2.1.3. We remark that geodesics between non-absolutely continuous measures need not be 1 1 1 1 unique. Consider as an example the measures 휇0 := 2 훿(0,1) + 2 훿(0,−1) and 휇1 := 2 훿(1,0) + 2 훿(−1,0). Then both 1 1 휇 := 훿 + 훿 푡 2 (푡,(1−푡)) 2 (−푡,−(1−푡)) and 1 1 휇˜ := 훿 + 훿 푡 2 (−푡,1−푡) 2 (푡,−(1−푡)) are distance-minimizing geodesics between 휇0 and 휇1.

Lastly, we shall need to define strong convexity and smoothness: for 훼 > 0 and 훽 > 0, we will say that a convex function 휙 is 훼-strongly convex if for all 푥, 푦 ∈ dom(휙)

훼 휙((1 − 푡)푥 + 푡푦) (1 − 푡)휙(푥) + 푡휙(푦) − 푡(1 − 푡)‖푥 − 푦‖2. 6 2 2

It is 훽-smooth if the reverse inequality holds with 훼 replaced by 훽.

2.2 A simple notion of curvature in geodesic metric spaces

It is not within scope for this thesis to give an introduction to curvature in Riemannian and metric geometry. Suffice it to say that curvature is a word with many meanings, all gener-

17 ally related to controlling the deviation of an object from it’s flat Euclidean counterpart. As Gromov says in [28]: “The curvature tensor of a Riemannian manifold is a little monster of (multi)linear algebra whose full geometric meaning remains obscure." Luckily, there are intuitive simplifications of the full curvature tensor which generalize well beyond Rieman- nian manifolds. In this section we shall define one such notion that will be fundamental for the rest of the thesis.

Definition 2.2.1. We say a geodesically complete metric space (푋, 푑푋 ) is non-positively curved

(NPC) if for every 푦, 푥0, 푥1 ∈ 푋, and every constant speed geodesic 훾 : [0, 1] → (푋, 푑푋 ) joining

푥0 to 푥1,

2 2 2 2 푑푋 (훾(푡), 푦) 6 (1 − 푡)푑푋 (푥0, 푦) + 푡푑푋 (푥1, 푦) − 푡(1 − 푡)푑푋 (푥0, 푥1). (2.2.1.1)

If the reverse inquality holds for all triples 푦, 푥0, 푥1 and constant speed geodesics 훾 : [0, 1] →

(푋, 푑푋 ), namely

2 2 2 2 푑푋 (훾(푡), 푦) > (1 − 푡)푑푋 (푥0, 푦) + 푡푑푋 (푥1, 푦) − 푡(1 − 푡)푑푋 (푥0, 푥1). (2.2.1.2)

then we say that (푋, 푑푋 ) is non-negatively curved (NNC).

푑 Remark 2.2.2. We remark that the parallelogram identity in R is the equality case of the above 푑 inequalities, and so we can say that R has zero curvature, or is flat. The standard examples of NPC and NNC spaces are the constant curvature surfaces: hyperbolic manifolds and spheres, respectively. In fact, the sectional curvature of these surfaces, and indeed any Riemannian manifold, is reflected at least locally by a constant factor times the third term [43]. For our purposes these simple definitions are sufficient.

Remark 2.2.3. The theory of curvature in metric spaces goes back to work of Alexandrov in 1951 [4]. There is a huge literature on the subject, see e.g. [58, 57, 19] as well as the books [18, 17,5].

In the case where (푋, 푑푋 ) is a Riemannian manifold, these definitions coincide with complete Rie- mannian manifolds of non-positive sectional curvature and complete Riemannian manifolds of non- negative sectional curvature, respectively. They arise not merely as an idle generalization of Rie- mannian manifolds, but as limits of sequences of Riemannian manifolds cropping up in geometric flows. As evidenced in this thesis, they also provide a useful way of thinking about the structure of 푑 infinite-dimensional, non-Riemannian spaces such as 푊2(R ).

18 2.3 Curvature and convexity of the barycenter functional

In this section we explore the close relationship between curvature of the metric space and convexity properties of barycenter functionals.

Definition 2.3.1 (Barycenters). Given a distribution 푃 on (푋, 푑푋 ) we define the barycenter functional 퐹푃 :(푋, 푑푋 ) → R as

1 퐹 (푏) := [푑2 (푥, 푏)]. 푃 2E푥∼푃 푋

If 퐹푃 (푏) is finite for any 푏 ∈ 푋, then we say that 푃 has finite second moment. The collection of all distributions supported on 푋 with finite second moment is denoted 풫2(푋). The variance of 푃 is defined as 2 휎 (푃 ) := 2 inf 퐹푃 (푏). 푏

Lastly, a barycenter of 푃 is a minimizer of 퐹푃 .

Remark 2.3.2. We shall typically suppress the distribution 푃 and write simply 퐹 = 퐹푃 and 2 2 휎 = 휎 (푃 ). Note as well that by the triangle inequality 퐹푃 (푏) is finite for any fixed 푏 ∈ 푋 if and only if it is finite for all 푏 ∈ 푋.

Definition 2.3.3. Given a geodesically complete metric space (푋, 푑푋 ), we say that a function

푓 :(푋, 푑푋 ) → R is 훼-geodesically convex for 훼 > 0 if along all constant speed geodesics

훾 : [0, 1] → (푋, 푑푋 ) we have the analog of the Euclidean inequality:

훼 푓(훾(푡)) (1 − 푡)푓(훾(0)) + 푡푓(훾(1)) − 푡(1 − 푡)푑2 (훾(0), 훾(1)). 6 2 푋

We say it is 훽-geodesically smooth for 훽 > 0 if along all constant speed geodesics 훾 : [0, 1] →

(푋, 푑푋 ) we have the opposite inequality:

훽 푓(훾(푡)) (1 − 푡)푓(훾(0)) + 푡푓(훾(1)) − 푡(1 − 푡)푑2 (훾(0), 훾(1)). > 2 푋

Using these definitions we can derive the following simple but important observation:

Proposition 2.3.4. (푋, 푑푋 ) is non-positively curved (resp. non-negatively curved) if and only if for all distributions 푃 ∈ 풫2(푋) the barycenter functional 퐹푃 is 1-geodesically convex (resp. smooth).

19 Proof. For the forward direction apply our definition of curvature pointwise in the expec- tation, and for the reverse direction take the distribution 푃 = 훿푥 for each 푥 ∈ 푋.

We thus see that the convexity of the barycenter functional is a direct reflection of the underlying curvature of (푋, 푑푋 ). In this thesis the central issue shall be that we want the barycenter functional to be convex, but it is not because of the positive curvature of 푑 푊2(R ) (see Section 2.4 below). To circumvent this issue we then marshal various weak and quantitative forms of convexity that allow us to achieve our ends. One important weak form of convexity that shall play a major role is termed a quadratic growth condition. This is a standard notion in optimization which says that if a function 푓 * * * 2 has a minimizer 푥 then 푓(푥) − 푓(푥 ) > 푐‖푥 − 푥 ‖2 for some constant 푐 > 0. In particular, it is a consequence of strong convexity. In the context of barycenters, this is known as a variance inequality.

Definition 2.3.5 (Variance Inequality). Suppose (푋, 푑푋 ) is a geodesically complete metric space and 푃 ∈ 풫2(푋). Then we say that 퐹푃 satisfies a variance inequality with constant 퐶var > 0 if there exists a point 푏* ∈ 푋 such that for all 푏 ∈ 푋,

퐶 var 푑2(푏, 푏*) 퐹 (푏) − 퐹 (푏*). 2 6

Let’s look at this inequality when (푋, 푑푋 ) is NPC. In that case, for any 푃 ∈ 풫2(푋) with a barycenter 푏*, fix a point 푏 ∈ 푋. Let 훾 be the constant speed geodesic joining 푏* to 푏. Then by Prop. 2.3.4, for each 푡 ∈ [0, 1], we have

1 퐹 (훾(푡)) (1 − 푡)퐹 (푏*) + 푡퐹 (푏) − 푡(1 − 푡)푑2 (푏, 푏*). 6 2 푋

* Observe that by definition 퐹 (푏 ) 6 퐹 (훾(푡)) for all 푡 ∈ [0, 1]. Whence

1 0 퐹 (훾(푡)) − 퐹 (푏*) 푡(퐹 (푏) − 퐹 (푏*)) − 푡(1 − 푡)푑2 (푏, 푏*). 6 6 2 푋

Dividing by 푡 and re-arranging yields

1 (1 − 푡)푑2 (푏, 푏*) 퐹 (푏) − 퐹 (푏*). 2 푋 6

Since 푡 is arbitrary we conclude that in NPC spaces, barycenter functionals (with minimiz-

20 ers) always satisfy a variance inequality with 퐶var = 1. In fact, more is true.

Theorem 2.3.6 ([58]). Suppose (푋, 푑푋 ) is a geodesically complete metric space. Then (푋, 푑푋 ) is non-positively curved if and only if 퐹푃 obeys a variance inequality with 퐶var = 1 for every

푃 ∈ 풫2(푋).

We omit the proof for brevity. As this theorem evidences, even relaxations of convexity for the barycenter functional are equivalent to non-positive curvature. Variance inequali- ties for barycenter functionals are fundamental to the two central problems of this thesis. For the empirical barycenters problem considered in Chapter3, it has been shown that a variance inequality leads to dimension-dependent rates [3]. Moreover, our approach to the statistical barycenters problem crucially uses a variance inequality. And a general variance inequality shall be central to our approach to analyzing first-order methods for barycenters in Chapter4. We conclude this section with the first significant result of this thesis, which provides general conditions under which distributions on Wasserstein space satisfy a variance in- equality.

푑 푑 Theorem 2.3.7 (Variance inequality in 푊2(R ) [22]). Let 푃 ∈ 풫2(풫2,ac(R )) be a distribution * 푑 푑 푑 with barycenter 푏 ∈ 풫2,ac(R ). Assume there exists a mapping 휙: 풫2,ac(R )×R → R such that

(1) 휙 is measurable

* (2) for 푃 -a.e. 휇, 휙휇 is an optimal Kantorovich potential for 푏 to 휇

푑 (3) for almost all 푥 ∈ R 1 [휙 (푥)] = ‖푥‖2. E휇∼푃 휇 2 2

푑 (4) for 푃 -a.e. 휇 ∈ 풫2,ac(R ), the mapping 휙휇 is 훼(휇)-strongly convex for some measurable 푑 function 훼: 풫2,ac(R ) → R+.

푑 Then, 푃 satisfies a variance inequality for all 푏 ∈ 풫2,ac(R ) with constant

∫︁ 퐶var = 훼(휇) d푃 (휇) .

Remark 2.3.8. Assumption (3) is roughly a first-order optimality statement for 푏*, and so should be thought of as essentially following from the assumption that 푏* is the barycenter of 푃 . We shall have

21 more to say about assumption (3) in Chapter4. Assumption (4) is related to extendible geodesics, see Section 2.7. Finally, we remark that a similar result appears in the unpublished note [37].

푑 푑 Proof. Fix a 휇 ∈ supp(푃 ) and 푏 ∈ 풫2,ac(R ). By lemma A.1.1, for arbitrary 푥 ∈ R and 푑 almost every 푦 ∈ R we have

훼(휇) 휙 (푥) + 휙* (푦) ⟨푥, 푦⟩ + ‖푥 − ∇휙* (푦)‖2. 휇 휇 > 2 휇 2

Rearranging this is

1 1 1 훼(휇) ‖푥‖2 − 휙 (푥) + ‖푦‖2 − 휙* (푦) ‖푥 − 푦‖2 − ‖푥 − ∇휙* (푦)‖2. 2 2 휇 2 2 휇 6 2 2 2 휇 2

Hence we can integrate over the optimal coupling of 휇 to 푏 to yield

1 ∫︁ 1 ∫︁ 1 푊 2(푏, 휇) ( ‖푥‖2 − 휙 (푥))d푏(푥) + ( ‖푦‖2 − 휙* (푦))d휇(푦) 2 2 > 2 2 휇 2 2 휇

훼(휇) 2 + ‖푇 − 푇 * ‖ 2 휇→푏 휇→푏 퐿2(휇) ∫︁ 1 ∫︁ 1 ( ‖푥‖2 − 휙 (푥))d푏(푥) + ( ‖푦‖2 − 휙* (푦))d휇(푦) > 2 2 휇 2 2 휇 훼(휇) + 푊 2(푏, 푏*). 2 2 where in the last inequality we used the definition of 푊2 distance. We now integrate over 휇 ∼ 푃 and argue the first integral is 0 irrespective of 푏. To do this, we first observe that

∫︁∫︁ 1 1 ∫︁∫︁ | ‖푥‖2 − 휙 (푥)|d푃 (휇)d푏(푥) 휎2(푏) + |휙 (푥)|d푃 (휇)d푏(푥) 2 2 휇 6 2 휇 1 ∫︁∫︁ = 휎2(푏) + 휙 (푥)d푃 (휇)d푏(푥) 2 휇 1 = 휎2(푏) < ∞. 2 where we applied triangle inequality, added and subtracted a linear function lower bound- 푑 ing 휙휇(푥) (the linear function is integrable by the assumption that 푃 ∈ 풫2(풫2,ac(R ))), and 2 1 assumption (3), respectively. It follows that (‖푥‖2/2 − 휙휇(푥)) ∈ 퐿 (푏 ⊗ 푃 ) so that we can swap integrals and apply assumption (3) again to conclude

∫︁∫︁ 1 ∫︁∫︁ 1 ( ‖푥‖2 − 휙 (푥))d푏(푥)d푃 (휇) = ( ‖푥‖2 − 휙 (푥))d푃 (휇)d푏(푥) = 0. 2 2 휇 2 2 휇

22 In fact, the same reasoning applied to 푏 = 푏* shows

∫︁∫︁ 1 ∫︁∫︁ 1 ( ‖푥‖2 − 휙 (푥))d푏(푥)d푃 (휇) = 0 = ( ‖푥‖2 − 휙 (푥))d푏*(푥)d푃 (휇). 2 2 휇 2 2 휇

Hence

∫︁∫︁ 1 ∫︁∫︁ 1 퐹 (푏) ( ‖푥‖2 − 휙 (푥))d푏(푥)d푃 (휇) + ( ‖푦‖2 − 휙* (푦))d휇(푦)d푃 (휇) > 2 2 휇 2 2 휇 훼(휇) + 푊 2(푏, 푏*) 2 2 ∫︁∫︁ 1 ∫︁∫︁ 1 = ( ‖푥‖2 − 휙 (푥))d푏*(푥)d푃 (휇) + ( ‖푦‖2 − 휙* (푦))d휇(푦)d푃 (휇) 2 2 휇 2 2 휇 훼(휇) + 푊 2(푏, 푏*) 2 2 훼(휇) = 퐹 (푏*) + 푊 2(푏, 푏*). 2 2

This proves the result.

2.4 Wasserstein space is non-negatively curved

푑 Theorem 2.4.1. The space 푊2(R ) is non-negatively curved.

푑 Proof. We know that it is geodesically complete by 2.1.2. Fix 휇, 휈0, 휈1 ∈ 풫2(R ). Choose a

푊2 geodesic 휈푡 from 휈0 to 휈1. Then 휈푡 = ((1 − 푡)휋1 + 푡휋2)#훾 where 훾 is an optimal coupling of 휈0 to 휈1 and the 휋푖 are the projection operators. Fix 푡 ∈ [0, 1], let 푦푡 := (1 − 푡)푦0 + 푡푦1, and let 훼푡 be an optimal coupling from 휈푡 to 휇. We describe a coupling (푥, 푦0, 푦1) where

푥 ∼ 휇, 푦0 ∼ 휈0, and 푦1 ∼ 휈1. We take 푦0 ∼ 휈0, 푦1 from 훾 conditional on 푦0, and 푥 from 훼푡 conditional on 푦푡. Then we can calculate:

2 2 푊2 (휇, 휈푡) = E[‖푥 − 푦푡‖2] 2 2 2 = (1 − 푡)E[‖푥 − 푦0‖2] + 푡E[‖푥 − 푦1‖2] − 푡(1 − 푡)E[‖푦0 − 푦1‖2] 2 2 2 = (1 − 푡)E[‖푥 − 푦0‖2] + 푡E[‖푥 − 푦1‖2] − 푡(1 − 푡)푊2 (휈0, 휈1).

We observe that the joint couplings of (푥, 푦0) and (푥, 푦1) have marginals 휇, 휈0 and 휇, 휈1

23 respectively, but may not in fact be optimal. Hence we get the lower bound

2 2 2 2 푊2 (휇, 휈푡) > (1 − 푡)푊2 (휇, 휈0) + 푡푊2 (휇, 휈1) − 푡(1 − 푡)푊2 (휈0, 휈1).

Which proves the result.

We remark that this theorem can be significantly extended, though we omit the proof for brevity.

Theorem 2.4.2 (Thm. A.2 [39]). A smooth compact connected Riemannian manifold 푀 has non- negative curvature if and only if 푊2(푀) has non-negative curvature.

2.5 Wasserstein space has infinite curvature

In this section, we describe an example which shows, in a certain sense, that “Wasserstein space has infinite curvature at every scale." The idea of the construction is that we can embed the square counterexample in remark 2.1.3 at an arbitrarily small scale within a given measure. We thereby show that any 푊2 ball around any measure has elements with non-unique , which means there cannot be a uniform upper curvature bound on any open subset of Wasserstein space.

푑 Fix a measure 휇 ∈ 풫2(R ) and 휀 > 0. More specifically, we shall show that there exist measures 휇˜0 and 휇˜1, such that 푊2(휇, 휇˜0), 푊2(휇, 휇˜1) < 휀 and 휇˜0 and 휇˜1 have two midpoints contained in the 푊2 휀-ball around 휇.

Begin by letting 휇˜ be the (normalized) restriction of 휇 to the complement of a ball 퐵.

We claim that it is possible to choose the ball 퐵 such that 푊2(휇, 휇˜) < 휀/2. If 휇 has a continuous density this ball can be chosen within the support, and otherwise it can be chosen at infinity. Suppose the ball 퐵 = 퐵(푚, 푅). For a small parameter 휏 > 0, let

1 1 휈 := 훿 + 훿 , 0 2 푚+휏푒1 2 푚−휏푒1 1 1 휈 := 훿 + 훿 . 1 2 푚+휏푒2 2 푚−휏푒2

24 Let the union of these four points be 푆. Finally, for a small parameter 휂 > 0, set

휇˜0 := (1 − 휂)˜휇 + 휂휈0

휇˜1 := (1 − 휂)˜휇 + 휂휈1.

* We claim that by taking 휏 sufficiently small, the optimal coupling 휋01 between 휇˜0 and 휇˜1 can be decomposed as * 휋01 = (1 − 휂)(id, id)#휇˜ + 휂푞01, where 푞01 is an optimal coupling between 휈0 and 휈1. To see this, consider any coupling 푐 푐 푐 휋 ∈ Π(˜휇0, 휇˜1). Let 푝푐푐 = P휋[(푥0, 푥1) ∈ 퐵 × 퐵 ], 푝푐푠 = P휋[(푥0, 푥1) ∈ 퐵 × 푆] and similarly for 푝푠푐 and 푝푠푠. Then

2 2 2 E휋[‖푥 − 푦‖2] > (푅 − 휏) 푝푐푠 + 휏 푝푠푠 2 > 휏 (푝푐푠 + 푝푠푠) 2 2 = 휂휏 = * [‖푥 − 푥 ‖ ]. E휋01 0 1 2

Where the first inequality follows by ignoring the contribution of 푝푐푐 and 푝푠푐, the second by choosing 휏 < 푅/2, the third equality by the fact that 휋 ∈ Π(˜휇0, 휇˜1), and the last equality * by our choice of 휋01. We observe that if 푝푐푠 > 0 then by our choice of 휏 < 푅/2, the second inequality is in fact strict. Hence it follows that the optimal couplings must indeed be of * the form 휋01.

But then, since there are two optimal couplings of 휈0 to 휈1, there are two optimal cou- plings of 휇˜0 and 휇˜1. Hence they don’t have a unique . Lastly, we verify that 휂 > 0 can be chosen small enough that both 푊2-geodesics between 휇˜0 and 휇˜1 are always within

휀 of 휇 in 푊2 distance. Fix one of the optimal couplings from 휈0 to 휈1 as 푞01, and let 휈푡 be the point along this geodesic from 휈0 to 휈1. Let

휇˜푡 := (1 − 휂)˜휇 + 휂휈푡.

Observe that

((1 − 휂)(id, id)#휇˜ + 휂휇˜ ⊗ 휈푡) ∈ Π(˜휇, 휇˜푡)

25 and using this coupling we have the upper bound

2 2 2 1/2 푊2(˜휇, 휇˜푡) 6 휂(2E푥∼휇˜[‖푥‖2] + 2‖푚‖2 + 휏 ) .

Take 휂 small enough so that this is smaller than 휀/2. Using these settings we get

푊2(휇, 휇˜푡) 6 푊2(휇, 휇˜) + 푊2(˜휇, 휇˜푡) < 휀/2 + 휀/2 = 휀.

We summarize this construction in the next proposition.

푑 Proposition 2.5.1. Suppose 휇 ∈ 풫2(R ) and 휀 > 0. Then there exist distinct distance minimizing

푊2-geodesics 훾0, 훾1 : [0, 1] → 퐵푊2 (휇, 휀) such that 훾0(0) = 훾1(0) and 훾0(1) = 훾1(1).

Combining this with the fact that geodesic spaces with bounded curvature (the reader can regard "bounded curvature" as strong geodesic convexity of the squared distance for the moment) have locally unique geodesics we get the following theorem.

푑 Theorem 2.5.2. No open subset of 푊2(R ) has bounded curvature.

2.6 An ounce of convexity

Not discouraged by the previous section, we will look for some weak form of geodesic convexity of the squared 2-Wasserstein distance.

푑 Specifically, fix measures 휇0, 휇1, 휇2 ∈ 풫2,ac(R ), and consider the 푊2-geodesic 휇푡 : [0, 1] → 푑 2 푊2,ac(R ). Consider the following approach to establishing a form of convexity for 푊2 (휇2, 휇푡) by applying the definition of Wasserstein space and the flatness of Hilbert spaces:

2 2 푊 (휇 , 휇 ) ‖푇 − 푇 ‖ 2 2 2 푡 6 휇0→휇푡 휇0→휇2 퐿 (휇0) 2 2 2 = (1 − 푡)푊 (휇 , 휇 ) + 푡‖푇 − 푇 ‖ 2 − 푡(1 − 푡)푊 (휇 , 휇 ). 2 2 0 휇0→휇1 휇0→휇2 퐿 (휇0) 2 0 1

2 Compare to the 퐾-strong convexity inequality for 푊2 (휇2, 휇푡),

퐾 푊 2(휇 , 휇 ) = (1 − 푡)푊 2(휇 , 휇 ) + 푡푊 2(휇 , 휇 ) − 푡(1 − 푡)푊 2(휇 , 휇 ) (2.6.0.1) 2 2 푡 2 2 0 2 2 1 2 2 0 1

26 Hence we see that if we could show an inequality of the form

2 2 2 ‖푇 − 푇 ‖ 2 푊 (휇 , 휇 ) + (1 − 퐾/2)(1 − 푡)푊 (휇 , 휇 ), 휇0→휇1 휇0→휇2 퐿 (휇0) 6 2 2 1 2 0 1 we’d be able to get the 퐾-strong convexity inequality (2.6.0.1). We remark that when each

휇푖 is simply a translate of some base measure 휇, we in fact expect this inequality to hold with 퐾 = 1, since in that case the 2-Wasserstein geometry is totally flat. Hence, this in- equality can actually hold. On the other hand, by the construction in Section 2.5, we know 푑 that it cannot hold uniformly over open subsets of 푊2(R ). In this section, we show that under a simple regularity condition on the Kantorovich potentials generating the optimal coupling, we can get inequalities of this form. The crucial estimate is as follows.

푑 푑 Theorem 2.6.1. Suppose 휇 ∈ 푊2,ac(R ), 휈0, 휈1 ∈ 푊2(R ), and the optimal Kantorovich potential

휙휇→휈0 is 훼-strongly convex and 훽-smooth.

2 2 2 ‖푇휇→휈0 − 푇휇→휈1 ‖퐿2(휇) 6 푊2 (휈0, 휈1) + (훽 − 훼)푊2 (휇, 휈1).

Before we give the proof, we note that combining the preceding discussion, this esti- mate, and a tedious calculation, we obtain the corollary.

푑 푑 Corollary 2.6.2. Fix 휇 ∈ 푊2,ac(R ) and 휈0, 휈1 ∈ 푊2(R ). Suppose that the optimal Kantorovich potentials 휙휇→휈푖 are 훼푖-strongly convex and 훽푖-smooth, 푖 = 0, 1. Let 휅푖 := 1/훼푖 − 1/훽푖 and

휅푖 < 1, 푖 = 0, 1. Then the 퐾-strong convexity inequality (2.6.0.1) holds with 퐾 = 2(휅0 + 휅1).

We remark as well that the most useful applications of this estimate will generally be when only one of the optimal potentials is known to be strongly convex and smooth, which corresponds to a 퐾-strong convexity inequality but only for small 푡.

Proof of Theorem 2.6.1. We begin by writing

2 ‖푇휇→휈0 − 푇휇→휈1 ‖퐿2(휇) 2 2 = 푊2 (휈0, 휇) + 푊2 (휇, 휈1) − 2E휇[⟨푇휇→휈0 (푥) − 푥, 푇휇→휈1 (푥) − 푥⟩].

We will focus on bounding this last term. Write 푇휇→휈0 = ∇푓0. We can integrate the 훽-

27 smoothness condition to obtain

E휇[⟨푇휇→휈0 (푥), 푇휇→휈1 (푥) − 푥⟩] 훽 [푓 (푇 (푥)) − 푓 (푥)] − [‖푇 (푥) − 푥‖2] > E휇 0 휇→휈1 0 2 E휇 휇→휈1 2 훽 = [푓 ] − [푓 ] − 푊 2(휇, 휈 ). E휈1 0 E휇 0 2 2 1

Let 푧 be such that (푇휇→휈0 (푥), 푧) is an optimal coupling of 휈0 to 휈1 when 푥 ∼ 휇. In this case, we can apply the strong convexity assumption to continue the lower bound:

E휇[⟨푇휇→휈0 (푥), 푇휇→휈1 (푥) − 푥⟩] 훼 훽 [⟨∇푓 (푥), 푧 − 푥⟩] + ‖푧 − 푥‖2 − 푊 2(휇, 휈 ) > E 0 2 E 2 2 2 1 훽 − 훼 [⟨∇푓 (푥), 푧 − 푥⟩] − 푊 2(휇, 휈 ), > E 0 2 2 1 where in the last line we used that the induced joint on 푥 and 푧 is a valid coupling of 휇 to

휈1. We also calculate

2 2 2 푊2 (휈0, 휇) = E푥∼휇[‖푥‖2] + E푥∼휈0 [‖푥‖2] − 2E휇[⟨푇휇→휈0 (푥), 푥⟩], and similarly

2 2 2 푊2 (휈1, 휇) = E푥∼휇[‖푥‖2] + E푥∼휈1 [‖푥‖2] − 2E휇[⟨푇휇→휈1 (푥), 푥⟩].

Putting this together we find

2 2 2 ‖푇휇→휈0 − 푇휇→휈1 ‖퐿2(휇) 6 E[‖∇푓0(푥) − 푧‖2] + (훽 − 훼)푊2 (휇, 휈1) 2 2 = 푊2 (휈0, 휈1) + (훽 − 훼)푊2 (휇, 휈1), where the last equality is by our assumption on 푧. Hence the result.

2.7 Extendible geodesics and curvature bounds

In this section, we shall generalize quite extensively theorem 2.6.1. The starting point for this work is the observation that the hypotheses in theorem 2.6.1 are related to simple

28 geometric conditions on the Wasserstein space.

Definition 2.7.1. A constant-speed geodesic 훾 : [0, 1] → (푋, 푑푋 ) is said to be (휆in, 휆out) ex- + tendible for 휆in, 휆out > 0 if there exists a constant-speed geodesic 훾 :[−휆in, 1+휆out] → (푋, 푑푋 ) such that 훾 is the restriction of 훾+ to [0, 1].

푑 푑 Theorem 2.7.2 (Thm. 3.5 [3]). Let 휇, 휈 ∈ 풫2(R ) and 훾 : [0, 1] → 푊2(R ) be a geodesic con- necting 휇 to 휈. Then, 훾 is (0, 휆) extendible if, and only if, the support of the optimal transport plan of 휇 to 휈 lies in the subdifferential 휕휙휇→휈 of a 휆/(1 + 휆)-strongly convex map 휙휇→휈.

Hence, we can re-interpret theorem 2.6.1 as follows.

푑 푑 Corollary 2.7.3. Fix 휇 ∈ 풫2,ac(R ) and 휈0, 휈1 ∈ ∈(R ). Suppose the 푊2-geodesic connecting 휇 to 휈0 is (휆in, 휆out)-extendible. Then

(︂ )︂ 2 2 1 휆out 2 ‖푇휇→휈0 − 푇휇→휈1 ‖퐿2(휇) 6 푊2 (휈0, 휈1) + 1 + − 푊2 (휇, 휈1). (2.7.3.1) 휆in 1 + 휆out

In this section, we shall show that an analogous statement holds in fact over all non- negatively curved metric spaces. The power and utility of this result will be made clear subsequently. To complete our work, we will need a certain amount of technical machinery from the theory of Alexandrov spaces with (lower) curvature bounds. Specifically, we will need to find the proper generalization of the left-hand side of (2.7.3.1). We begin with this, and refer to [18] for a comprehensive treatment.

Definition 2.7.4. Suppose (푋, 푑푋 ) is a non-negatively curved metric space. For 푝, 푥, 푦 ∈ 푋, the 0 comparison angle ^푝(푥, 푦) ∈ [0, 휋] at 푝 is defined by

2 2 2 0 푑푋 (푝, 푥) + 푑푋 (푝, 푦) − 푑푋 (푥, 푦) cos ^푝(푥, 푦) := . 2푑푋 (푝, 푥)푑푋 (푝, 푦)

Applying triangle inequality shows that this is well-defined. Now, fix 푝 ∈ 푋, and let Γ푝 be the set of all geodesics 훾 : [0, 1] → (푋, 푑푋 ) with 훾(0) = 푝. For 훾, 휎 ∈ Γ푝 the Alexandrov angle

^푝(훾, 휎) is defined as 0 푝(훾, 휎) := lim 푝(훾(푠), 휎(푡)). ^ 푠,푡→0 ^

It can be shown that when (푋, 푑푋 ) is non-negatively curved, this is well defined and is in fact a

(pseudo-)metric on Γ푝. Quotient by this equivalence relationship and take the completion of the

29 result to yield the space of directions (Σ푝, ^푝). Then the tangent cone 푇푝푋 of (푋, 푑푋 ) at 푝 is the set Σ푝 × R>0 modulo the equivalence (훾, 푠) ≈ (휎, 푡) when 푠 = 푡 = 0. We denote the unique element (훾, 0) by 표푝, and call it the tip of the cone. For 푢, 푣 ∈ 푇푝푋 with 푢 = (훾, 푠) and 푣 = (휎, 푡) the metric is 2 2 2 ‖푢 − 푣‖푝 := 푠 + 푡 − 2푠푡 cos ^푝(훾, 휎).

We use the notation ‖푢‖푝 := ‖푢 − 표푝‖푝 and ⟨푢, 푣⟩푝 := 푠푡 cos ^푝(훾, 휎), which means

2 2 2 ‖푢 − 푣‖푝 = ‖푢‖푝 + ‖푣‖푝 − 2⟨푢, 푣⟩푝.

Let 퐶푝 ⊂ 푆 be the cut- of 푝 nad for all 푥 ∈ 푆 ∖ 퐶푝, let 훾푝→푥 ∈ Σ푝 denote the direction of the unique geodesic connecting 푝 to 푥. Then the log map at 푝 is the map log푝 : 푆 ∖ 퐶푝 → 푇푝푆 which sends

푥 ↦→ log푝(푥) := (훾푝→푥, 푑푋 (푝, 푥)).

We will extend this to 푥 ∈ 퐶푝 by selecting an arbitrary direction from 푝 to 푥.

Using these definitions, we now collect the relevant facts we will need about non- negatively curved spaces, tangent cones, and barycenters. We’ll omit proofs as they would take us quite far afield; full details are given in our work [26].

* Theorem 2.7.5. Suppose (푋, 푑푋 ) is non-negatively curved and 푃 ∈ 풫2(푋) with barycenter 푏 .

1. For any 푝, 푥, 푦 ∈ 푋 2 2 푑푋 (푥, 푦) 6 ‖ log푝(푥) − log푝(푦)‖푝. (2.7.5.1)

2. We have ∫︁∫︁ ⟨log푏* (푥), log푏* (푦)⟩푏* d푃 (푥)d푃 (푦) = 0. (2.7.5.2)

3. There exists a subset ℒ푏* 푋 ⊂ 푇푏* 푋 which is a Hilbert space when equipped with the re-

stricted metric and such that log푏* (supp(푃 )) ⊂ ℒ푏* 푋.

4. For any 푄 ∈ 풫2(푋) with log푏* (supp(푄)) ⊂ ℒ푏* 푋 and 푏 ∈ 푋, we have

∫︁ ⟨∫︁ ⟩ ⟨log푏* (푥), log푏* (푏)⟩푏* d푄(푥) = log푏* (푥)d푄(푥), log푏* (푏) . (2.7.5.3) 푏*

The following property is sometimes called being an exponential barycenter in the

30 literature [3].

* Corollary 2.7.6. Suppose (푋, 푑푋 ) is non-negatively curved and 푃 ∈ 풫2(푋) with barycenter 푏 . Then ∫︁ log푏* (푥)d푃 (푥) = 0.

* Definition 2.7.7. Suppose (푋, 푑푋 ) is non-negatively curved, 푃 ∈ 풫2(푋), and 푏 is a barycenter of 푃 . Then for all 푏, 푥 ∈ 푋 we define hugging function at 푏* as

2 2 푏 ‖ log푏* (푥) − log푏* (푏)‖푏* − 푑푋 (푥, 푏) 푘푏* (푥) := 1 − 2 * . 푑푋 (푏, 푏 )

푑 Remark 2.7.8. In Section 2.8, we shall show that in case (푋, 푑푋 ) = (풫2,ac(R ), 푊2), the tangent cone metric is the “퐿2(푏*) norm on transport maps." This will mean that

2 2 휈1 2 ‖푇휇→휈0 − 푇휇→휈1 ‖퐿2(휇) = 푊2 (휈0, 휈1) + (1 − 푘휇 (휈0))푊2 (휇, 휈1).

Hence, contingent upon Section 2.8, we are indeed studying a generalization of the previous esti- mates.

We start with the following fact.

* Proposition 2.7.9. Suppose (푋, 푑푋 ) is non-negatively curved, 푃 ∈ 풫2(푋), and 푏 is a barycenter of 푃 . Then, for all 푏 ∈ 푋,

∫︁ 1 2 * 푏 * 푑 (푏, 푏 ) 푘 * (푥)d푃 (푥) = 퐹 (푏) − 퐹 (푏 ). 2 푋 푏

Proof. Expanding and using the definition of the cone metric we calculate:

2 * 푏 2 2 * 2 푑푋 (푏, 푏 )푘푏* (푥) = 푑푋 (푥, 푏) + 푑푋 (푏, 푏 ) − ‖ log푏* (푥) − log푏* (푏)‖푏*

2 2 * = 푑푋 (푥, 푏) − 푑푋 (푥, 푏 ) + 2⟨log푏* (푥), log푏* (푏)⟩푏* .

Taking the expectation over 푥 ∼ 푃 , the second term is 0 by Corollary 2.7.6 and (2.7.5.3). This yields the result.

Theorem 2.7.10 (Theorem 3.3 [3]). Suppose (푋, 푑푋 ) is non-negatively curved. Let 푃 ∈ 풫2(푋) * with barycenter 푏 . Suppose that, for each 푥 ∈ supp(푃 ), there exists a geodesic 훾푥 : [0, 1] → 푋

31 connecting 푏* to 푥 which is (0, 휆)-extendible. Suppose in addition that 푏* remains a barycenter of + distribution 푃휆 = (푒휆)#푃 where 푒휆(푥) = 훾푥 (1 + 휆). Then for all 푏 ∈ 푋, 퐹 (푏) satisfies a variance inequality with 퐶VI = 휆/(1 + 휆).

We give a simpler proof.

Proof. Fix any 푏 ∈ 푋. Write the NNC inequality:

휆 1 휆 푑2 (푏, 푥) 푑2 (푏, 푏*) + 푑2 (푏, 푒 (푥)) − 푑2 (푏*, 푥휆). 푋 > 1 + 휆 푋 1 + 휆 푋 휆 (1 + 휆)2 푋

Re-arrange to yield

휆 1 휆 푑2 (푏*, 푏) 푑2 (푏, 푥) − 푑2 (푏, 푒 (푥)) + 푑2 (푏*, 푒 (푥)). 1 + 휆 푋 6 푋 1 + 휆 푋 휆 (1 + 휆)2 푋 휆

We calculate that

휆 1 푑2 (푏*, 푥휆) = 푑2 (푏*, 푒 (푥)) − 푑2 (푏*, 푥). (1 + 휆)2 푋 1 + 휆 푋 휆 푋

Hence

휆 1 푑2 (푏*, 푏) 푑2 (푏, 푥) − 푑2 (푏*, 푥) + (︀푑2 (푏*, 푒 (푥)) − 푑2 (푏, 푒 (푥))︀ . 1 + 휆 푋 6 푋 푋 1 + 휆 푋 휆 푋 휆

We assumed that 푏* is a barycenter of the extended distribution, so by taking expectation over 푃 the second term is non-positive and we get the result.

According to proposition 2.7.9, a variance inequality with 퐶var = 휆/(1+휆) is equivalent to the statement that, for all 푏 ∈ 푋,

∫︁ 푏 휆 푘 * (푥)d푃 (푥) . 푏 > 1 + 휆

We’ll use this observation next.

* Theorem 2.7.11. Suppose that (푋, 푑푋 ) is non-negatively curved and let 푥, 푏, 푏 ∈ 푋. Assume * that for some 휆in, 휆out > 0, there is a geodesic connecting 푏 to 푥 which is (휆in, 휆out)-extendible. Then 푏 휆out 1 푘푏* (푥) > − . 1 + 휆out 휆in

32 Remark 2.7.12. We note that we can combine this bound with the analogous argument at the beginning of Section 2.6 (also using (2.7.5.1)) to get similar strong convexity statements for the squared distance function as in Corollary 2.7.3.

* Proof. Let 훾 : [0, 1] → 푋 be a (휆in, 휆out)-extendible geodesic connecting 푏 to 푥 and let + + 훾 :[−휆in, 1 + 휆out] → 푆 be its extension. Let 푧 = 훾 (−휉) where 휉 = 휆in/(1 + 휆out). Then, it may be easily checked that 푏* is a barycenter of the probability measure

휉 1 푃 := 훿 + 훿 . 1 + 휉 푥 1 + 휉 푧

Now, we wish to apply theorem 2.7.10 to 푃 . To this aim, note that the geodesic 훾 from 푏* + to 푥 is (0, 1 + 휆out)-extendible by assumption with 푒휆out (푥) = 훾 (1 + 휆out). Similarly, we check that the geodesic 휎 : [0, 1] → 푆 connecting 푏* to 푧 and defined by 휎(푡) = 훾+(−푡휉) is + (0, 1 + 휆out)-extendible by construction with 푒휆out (푧) = 훾 (−휆in). Finally, one checks that * 푏 remains a barycenter of the probability measure 푃휆out = (푒휆out )#푃 . As a result, theorem 2.7.10 implies that

휆out 푏 6 E푥∼푃 [푘푏* (푥)] 1 + 휆out 휉 푏 1 푏 = 푘 * (푥) + 푘 * (푧). 1 + 휉 푏 1 + 휉 푏

Finally, the fact (푋, 푑푋 ) is non-negatively curved implies that 푑(푥, 푦) 6 ‖ log푏* (푥)−log푏* (푦)‖푏* , 푏 for all 푥, 푦 ∈ 푋. Thus, 푘푏* (푧) 6 1 for all 푏, 푧 ∈ 푆. Hence we obtain

(︂ )︂ 푏 1 + 휉 휆out 1 푘푏* (푥) > − 휉 1 + 휆out 1 + 휉 휆 1 = out − , 1 + 휆out 휆in which completes the proof.

푑 2.8 The tangent cone in 푊2(R )*

In this section we give a characterization of the tangent cone of Definition 2.7.4, in case 푑 (푋, 푑푋 ) = (풫2,ac(R ), 푊2). This allows us to prove that the result of theorem 2.7.11 is a total generalization of theorem 2.6.1. We remark that our proof uses theorem 2.7.11 to

33 provide a geometrically natural analysis of the tangent cone.

푑 푑 Start by fixing a measure 휇 ∈ 풫2,ac(R ). At stated in Definition 2.7.4, since 푊2(R ) is non-negatively curved the Alexandrov angles always exist. Fix two elements of the tan- 푑 gent cone, 푢, 푣 ∈ 푇휇푊2(R ), and assume 푢 = [훾0, 푠0] and 푣 = [훾1, 푠1] with representatives such that 푊2(훾푖(1), 휇) = 푠푖, 푖 = 1, 2. Since the angle ^휇(훾0, 훾1) exists we can take the limit with 푡 = 푠. Applying continuity of cos and re-arranging we find that in this case

2 2 푊2 (훾0(푡), 훾1(푡)) ‖푢 − 푣‖휇 = lim . 푡→0 푡2

The key point of this section is to show another form for the right side. Since our as- sumptions will always be satisfied in particular by 푢 and 푣 of the form log휇(휈), this will prove that the squared norm in the definition of the hugging function (Definition 2.7.7) is precisely the 퐿2 distance between transport maps, as claimed in 2.7.8.

푑 푑 Theorem 2.8.1. Suppose 휇 ∈ 풫2,ac(R ) and 휈0, 휈1 ∈ 푊2(R ). Let 훾푖 be the constant-speed 푑 geodesic 훾푖 : [0, 1] → 푊2(R ) joining 휇 and 휈푖, 푖 = 1, 2. Then

2 푊2 (훾0(푡), 훾1(푡)) 2 lim = ‖푇휇→휈 − 푇휇→휈 ‖ 2 . 푡→0 푡2 0 1 퐿 (휇)

푑 Proof. Using the definition of 푊2(R ) we get that

푊 2(훾 (푡), 훾 (푡)) ‖푇 − 푇 ‖2 2 0 1 6 휇→훾0(푡) 휇→훾1(푡) 퐿2(휇) = ‖((1 − 푡) id +푡푇 − ((1 − 푡) id +푡푇 )‖2 휇→휈0 휇→휈1 퐿2(휇) = 푡2‖푇 − 푇 ‖2 . 휇→휈0 휇→휈1 퐿2(휇)

Dividing by 푡2 and taking 푡 → 0+ we get one side of the equality. The other side is a bit more subtle. We start by assuming that the transport map 푇휇→휈0 = ∇휙 and 휙 is 훽-smooth for some 훽 ∈ R>0. Then 푇휇→훾0(푡) is the gradient of a (1 − 푡) + 푡훽 smooth and at least 1 − 푡 convex function. Apply theorem 2.6.1 to yield

푡2‖푇 − 푇 ‖2 = ‖푇 − 푇 ‖2 휇→휈0 휇→휈1 퐿2(휇) 휇→훾0(푡) 휇→훾1(푡) 퐿2(휇) 2 2 6 ((1 − 푡 + 푡훽) − (1 − 푡))푊2 (휇, 훾0(푡)) + 푊2 (훾0(푡), 훾1(푡)) 3 2 2 = 훽푡 푊2 (휇, 휈0) + 푊2 (훾0(푡), 훾1(푡)).

34 Dividing by 푡2 and letting 푡 → 0+ shows that

2 푊2 (훾0(푡), 훾1(푡)) 2 lim = ‖푇휇→휈 − 푇휇→휈 ‖ , 푡→0 푡2 0 1 퐿2(휇)

at least when 푇휇→휈0 is 훽-smooth for some 훽. As we remarked above, this implies that, for smooth 푇휇→휈0 ,

‖[훾 , 푊 (휇, 훾 (1))] − [훾 , 푊 (휇, 훾 (1))]‖2 = ‖푇 − 푇 ‖2 . 0 2 0 1 2 1 휇 휇→휈0 휇→휈1 퐿2(휇) (2.8.1.1)

To finish, we can take a sequence of smooth transport maps approximating a given one (say by Moreau-Yosida regularization): each side of (2.8.1.1) will converge, and so we will have (2.8.1.1) for all 훾0, 훾1 ∈ Σ푝. Thence we can apply the discussion above the theorem to conclude the result.

Lastly, we mention that this result can be strengthened into a full isomorphism.

푑 Theorem 2.8.2 (Thm. 12.4.4 [7]). Suppose 휇 ∈ 풫2,ac(R ). Then

푑 ∼ 푇휇(푊2(R )) = {휆(∇휙 − id): 휆 > 0, ∇휙(푥) = 푇휇→(∇휙)#(휇)(푥) 휇-a.e.} where the overline indicates a completion with respect to 퐿2(휇).

2.9 New quantitative stability bounds*

In this section we demonstrate the power of our bound on the hugging function 푘 through the following application. Fix a compact smooth manifold without boundary and with non-negative sectional curvatures 푀 and consider the Wasserstein geometry on 풫2(푀).

Then 풫2(푀) is non-negatively curved and moreover theorem 2.8.1 can be generalized [39,

Prop. A.8, A.33], so that we can write, for any 휈, 휇1, 휇2 ∈ 풫2(푀),

∫︁ 2 2 2 푑(푇휈→휇1 , 푇휈→휇2 ) 푑휈 = (1 − 푘(휇1, 휇2, 휈))푊2 (휇2, 휈) + 푊2 (휇1, 휇2). (2.9.0.1) 푀

Hence, analogous to the Euclidean case, lower bounds on the quantity 푘(휇1, 휇2, 휈) are di- 2 rectly related to upper bounds on E[푑(푇휈→휇1 , 푇휈→휇2 ) ]. This was stated explicitly as a prob- lem in [8]. Desire for upper bounds on this quantity arises naturally in applications beyond

35 that setting, such as when obtaining fast rates for estimating optimal transport maps [30]. We can apply theorem 2.7.11 and optimal transport theory to get different quantitative sta- bility estimates in terms of analytic conditions on the potentials generating the transport maps. We make an example explicit. Unfortunately, the manifold case does not exhibit the same simple relationship between the Brenier theorem optimality condition, namely 푑2/2- convexity, and analytic conditions [61, Chapter 13]. However, under fairly restrictive con- ditions on the potential 휙, we may still obtain sufficient conditions for 푑2/2-convexity.

Theorem 2.9.1 ([25]). Suppose (푀, 푔) is a smooth compact manifold without boundary. Then 2 there is a constant 푐 depending on 푀 such that if 휙: 푀 → R is 퐶 and

2 ‖∇휙‖∞ + ‖∇ 휙‖∞ 6 푐, then 휙 is 푑2/2-convex.

With this theorem in hand we can state a new quantitative stability result.

Theorem 2.9.2. Suppose (푀, 푔) is a smooth compact manifold with non-negative sectional cur- vatures and no boundary. Fix 휈, 휇1, 휇2 ∈ 풫2(푀). Let 휈 ≪ 푑 vol푔. Let 푇1, 푇2 be the optimal −1 2 maps between 휈 and 휇1, 휇2 respectively and assume 푇1 = exp(∇휙) for some 푑 /2-convex 휙: 푀 → R. Then there is a constant 푐 depending on the manifold 푀 such that whenever 2 ‖∇휙‖∞ + ‖∇ 휙‖∞ 6 푐,

∫︁ 2 푐 2 2 푑(푇1, 푇2) 푑휈 6 2 푊2 (휇1, 휈) + 푊2 (휇1, 휇2). 푀 푐 − (‖∇휙‖∞ + ‖∇ 휙‖∞)

Remark 2.9.3. Contrast with [8], where they get the estimate

∫︁ 2 2 푑(푇1, 푇2) 푑휈 . 푊2 (휇1, 휇2) + 푊2(휇1, 휇2)푊2(휈, 휇2). 푀

Proof of Theorem 2.9.2. If the denominator on the left term is 0 then the bound is trivial. So suppose it isn’t. Since 푊2(푀) is non-negatively curved, we may apply theorem 2.7.11 to see that, if the geodesic between 휈 and 휇1 is (휆in, 휆out) extendible, then

휆out 1 푘(휇1, 휇2, 휈) > − . 1 + 휆out 휆in

36 Now, by the sufficient condition in 2.9.1 we know

푐 휆in > 2 − 1. ‖∇휙‖∞ + ‖∇ 휙‖∞

Plugging this in to equation 2.9.0.1, we get

∫︁ 2 2 2 푑(푇휈→휇1 , 푇휈→휇2 ) 푑휈 = (1 − 푘(휇1, 휇2, 휈))푊2 (휇2, 휈) + 푊2 (휇1, 휇2) 푀 (︂ )︂ 1 휆out 2 2 6 1 + − 푊2 (휈, 휇1) + 푊2 (휇1, 휇2) 휆in 1 + 휆out (︂ )︂ 1 2 2 6 1 + 푊2 (휈, 휇1) + 푊2 (휇1, 휇2) 휆in 푐 2 2 6 2 푊2 (휇1, 휈) + 푊2 (휇1, 휇2). 푐 − (‖∇휙‖∞ + ‖∇ 휙‖∞)

37 38 Chapter 3

Averaging probability distributions: statistics

3.1 Introduction to the problem

In this chapter we develop the first major application of the theory in Chapter2. We are interested here in perhaps the most basic statistical question of all: how quickly does the sample average converge to the population average? Specifically, let 푃 ∈ 풫2(푋) for 푋 a non-negatively curved metric space. Assume 푃 has a barycenter 푏*. Given samples

푥1, . . . , 푥푛 ∼ 푃 , we shall let 푃푛 be the empirical measure defined as

푛 1 ∑︁ 푃 := 훿 . 푛 푛 푥푖 푖=1

Assume 푃푛 has a barycenter ^푏푛. Then we wish to understand the quantity

2 ^ * E[푑푋 (푏푛, 푏 )],

푑 where the expectation is over samples of size 푛. We note that in the case where 푋 = R , we can use the fact that barycenters are averages to see that this is precisely

휎2(푃 ) [푑2 (^푏 , 푏*) = [‖^푏 − 푏*‖2] = . E 푋 푛 E 푛 2 푛

39 We shall call a rate of the form 푐/푛 for parametric. This is the model case that we will try and emulate in our results.

Remark 3.1.1. We shall assume throughout that barycenters exist. In all cases of interest this is know to hold [38]. Uniqueness of the population barycenter 푏* will hold under the hypotheses we consider as well. We note that our results do not require uniqueness of the empirical barycenter.

3.2 Parametric rates under bi-extendibility

We now state the main result of this section.

* Theorem 3.2.1. Suppose (푋, 푑푋 ) is non-negatively curved and 푃 ∈ 풫2(푋) with barycenter 푏 . 푏 * If there exists 푘min > 0 such that 푘푏* (푥) > 푘min (cf. Definition 2.7.7) for all 푏, 푥 ∈ 푋, then 푏 is unique and any empirical barycenter ^푏푛 obeys

2 2 ^ * 4휎 E[푑푋 (푏푛, 푏 )] 6 2 푛푘min

For a moment taking this statement as granted we immediately obtain the following corollaries courtesy of Section 2.7.

* Corollary 3.2.2. Suppose (푋, 푑푋 ) is non-negatively curved and 푃 ∈ 풫2(푋) with barycenter 푏 . * Suppose for each 푥 ∈ supp(푃 ) the geodesic from 푏 to 푥 is (휆in, 휆out)-extendible such that

휆out 1 푘0 := − 1 + 휆out 휆in is positive. Then 2 2 ^ * 4휎 (푃 ) E[푑푋 (푏푛, 푏 )] 6 2 . 푛푘0 푑 * 푑 In case (푋, 푑푋 ) = 푊2(R ), suppose 푃 has a barycenter 푏 ∈ 풫2,ac(R ) such that for each 휇 ∈ supp(푃 ), the optimal Kantorovich potential 휙푏*→휇 is 훼-strongly convex and 훽-smooth for 훽 −훼 < 1. Then 4휎2(푃 ) [푊 2(^푏 , 푏*)] . E 2 푛 6 푛(1 − (훽 − 훼))2

Remark 3.2.3. This result significantly advances the state of the art. Early works on empirical barycenters focused on consistency and limit distributions [10, 11, 33]. Asymptotics in the Wasser- stein case were considered in [38]. Finite sample rates have been established in the non-positively

40 curved regime and mild generalizations thereof [58, 13, 54]. The most significant result valid on spaces of non-negative curvature is from [3]. There, the authors show dimension dependent rates 푑 under a variance inequality. In case (푋, 푑푋 ) = 푊2(R ), their convergence rates are of the form 푛−1/푑.

Proof of Theorem 3.2.1. Fix 푛 elements 푥1, . . . , 푥푛 ∈ supp(푃 ) and let ^푏푛 be their barycenter.

We apply our assumption on the hugging function to observe that for each 푥푖,

^ 2 2 * ^ 2 ^ ‖ log푏* (푏푛) − log푏* (푥푖)‖푏* 6 푘min푑푋 (푏 , 푏푛) + 푑푋 (푏푛, 푥푖).

On the other hand, by definition of the tangent cone

^ 2 2 * ^ ^ 2 * ‖ log푏* (푏푛) − log푏* (푥푖)‖푏* = 푑푋 (푏 , 푏푛) − 2⟨log푏* (푏푛) − id, log푏* (푥푖) − id⟩푏* + 푑푋 (푏 , 푥푖).

Whence

2 * ^ ^ 2 ^ 2 * (1 − 푘min)푑푋 (푏 , 푏푛) 6 2⟨log푏* (푏푛), log푏* (푥푖)⟩푏* + 푑푋 (푏푛, 푥푖) − 푑푋 (푏 , 푥푖).

If we sum over 푖 and divide by 푛 then the difference on the right hand side is non-positive, so we get 2 * ^ ^ (1 − 푘min)푑푋 (푏 , 푏푛) 6 2⟨log푏* (푏푛), 푏푛⟩푏* .

1 ∑︀푛 where we set 푏푛 := 푛 푖=1 log푏* (푥푖), Applying Cauchy-Schwarz and simplifying we find

2 2 * ^ 2 (1 − 푘min) 푑푋 (푏 , 푏푛) 6 4‖푏푛‖푏*

We observe that by theorem 2.7.5-(3) the log푏* (푥푖) are in a Hilbert space, so that

푛 2 1 ∑︁ ‖푏 ‖ * = ⟨log * (푥 ) − id, 푇 * − id⟩ * . 푛 푏 푛2 푏 푖 푏 →푥푗 푏 푖,푗=1

Taking each 푥푖 independently distributed according to 푃 and applying theorem 2.7.5-(2) we obtain 4 4휎2 (1 − 푘 )2 [푑2 (푏*, ^푏 )] [푑2 (푏*, 푥)] = . min E 푋 푛 6 푛E 푋 푛

This proves the result.

41 42 Chapter 4

Averaging probability distributions: optimization

In this chapter, we turn to the question of computing Wasserstein barycenters. We develop a general machinery to study first-order optimization methods for this purpose, and apply it to the case of distributions supported on Gaussians known as the Bures-Wasserstein manifold. We provide the first analysis of exponential convergence of gradient descent in this setting, resolving an open question of [6].

4.1 Introduction and main results

Establishing fast convergence of first-order methods is usually intimately related to con- 푑 vexity. As we well know by now, however, the barycenter functional on 푊2(R ) cannot be expected to be (geodesically) convex. Indeed, as we can see in Figure 4.1, the barycenter functional may even be concave along geodesics. Fortunately, the optimization literature describes conditions for global convergence of first order algorithms even for non-convex objectives. We shall employ one such condition, a Polyak-Łojasiewicz (PL) inequality of the form (4.3.2.1), which is known to yield linear convergence for a variety of gradient methods on flat spaces even in absence of convex- ity [32]. The main work of this chapter is in two parts. The first is to develop general machinery towards a PL inequality for the barycenter functional, and the second is to instantiate it for

43 1.20 ) ) t ( 1.15 Bures geodesic , C

( Euclidean geodesic 2 2 1.10 W

1.05 0.0 0.2 0.4 0.6 0.8 1.0 t

2 Figure 4-1: Example of the non-geodesic convexity of 푊2 . Displayed is the squared Bu- res distance along a Wasserstein geodesic and a Euclidean geodesic. Details are given in Appendix B.0.2 the Bures-Wasserstein case. The consequences for distributions supported on Gaussians are given below.

Theorem 4.1.1. Fix 휁 ∈ (0, 1] and let 푃 be a distribution supported on mean-zero Gaussian measures whose covariance matrices Σ satisfy ‖Σ‖op 6 1 and det Σ > 휁. Then, 푃 has a unique * barycenter 푏 , and Gradient Descent (Algorithm1) initialized at 푏0 ∈ supp(푃 ) yields a sequence (푏 ) such that 푇 푇 >1

2(︁ 휁2 )︁푇 푊 2(푏 , 푏*) 1 − [퐹 (푏 ) − 퐹 (푏*)] . 2 푇 6 휁 4 0

The above theorem establishes a linear rate of convergence for gradient descent and answers a question left open in [6]. Moreover, when 푃 is an empirical distribution, com- bined with the existing results of [3, 36], it yields a procedure to estimate Bures-Wasserstein barycenters at the parametric rate after a number of iterations that is logarithmic in the sample size 푛. Still in the Gaussian case, we also show that a stochastic gradient descent (SGD) algo- rithm converges to the true barycenter at a parametric rate.

Theorem 4.1.2. Fix 휁 ∈ (0, 1] and let 푃 be a distribution supported on mean-zero Gaussian measures whose covariance matrices Σ satisfy ‖Σ‖op 6 1 and det Σ > 휁. Then, 푃 has a unique barycenter 푏*, and Stochastic Gradient Descent (Algorithm2) run on a sample of size 푛 + 1 from

푃 returns a Gaussian measure 푏푛 such that

96휎2(푃 ) 푊 2(푏 , 푏*) . E 2 푛 6 푛휁5

44 0 SGD GD to empirical barycenter 0.5 SGD (averaged) 1 GD to population barycenter SGD with replacement 1.0 SGD with replacement (averaged) 2 ) ) ) ) empirical barycenter * 3 b b , 1.5 , t t b b ( ( 4 2 2 2 2

W 2.0 W ( ( 0

0 5 1 1 g g o 2.5 o l l 6

3.0 7

3.5 8 0 200 400 600 800 1000 0 2 4 6 8 10 12 iteration iteration

Figure 4-2: Left. Convergence of SGD on Bures manifold for 푛 = 1000, 푑 = 3, and 푏⋆ =

훾0,퐼3 . Right: linear convergence of GD on the same problem.

This theorem shows that SGD yields an estimator 푏푛, different from the empirical ⋆ barycenter ^푏푛, that also converges at the parametric rate to 푏 . Hence we actually have two ways to estimate an empirical barycenter: we can either apply theorem 4.1.1 to the empirical distribution or theorem 4.1.2 to the population distribution. The latter exhibits slower convergence but has much cheaper iterations.

As far as we are aware, these results provide the first non-asymptotic rates of conver- gence for first-order methods on the Bures-Wasserstein manifold. The problem of Bures- Wasserstein barycenters over Gaussians has been studied since the 1990’s [35]. Previous works have focused primarily on empirical and asymptotic explorations, as well as con- nections with matrix analysis [6,9].

Remark 4.1.3. The assumption ‖Σ‖op 6 1 is simply a normalization and changes nothing. A natural sufficient condition for det Σ > 휁 to be satisfied is when all the eigenvalues of the covariance 푑 matrix Σ are lower bounded by a constant 휆min > 0. In this case, the parameter 휁 > 휆min can be exponentially small in the dimension. Note, however, that in this case the Gaussian measure is quite degenerate in the sense that the density of 훾0,Σ is exponentially large at 0.

In Figure 4-2, we present an experiment confirming these two results; see AppendixB for more details and further numerical results.

45 4.2 Local to global phenomena for Wasserstein barycenters

The gradient of the barycenter functional is defined as

∇퐹 (푏) := E휇∼푃 [id −푇푏→휇] ∀푏 ∈ 풫2,ac.

This formula can be justified rigorously, but we take it simply as a convention [7]. The early work of Agueh and Carlier in 2011 [1] showed that the norm of this gradient is closely related to the barycenter problem. Specifically, they proved:

푑 Theorem 4.2.1 ([1] Prop. 3.8). Suppose 푃 has finite support in 풫2,ac. Fix 푏 ∈ 풫2,ac(R ). For each 휇 ∈ supp(푃 ), let the Monge-Kantorovich potentials generating the optimal coupling from 푏 to 휇 be 휙푏→휇 so that 푇푏→휇 = ∇휙푏→휇. Then 푏 is the barycenter of 푃 if and only if there exists a constant 퐶 such that 1 [휙 (푥)] 퐶 + ‖푥‖2 ∀푥 ∈ 푑 E휇∼푃 푏→휇푖 6 2 2 R

푑 and such that equality holds for 푏-a.e. 푥 ∈ R .

1 2 The crucial point here is that the statement that E휇∼푃 [휙푏→휇푖 (푥)] = 2 ‖푥‖2 푏-a.e. is equiv- alent to 푏 being a critical point for 퐹 :

‖∇퐹 (푏)‖퐿2(푏) = 0.

Indeed, under some mild regularity and support assumptions on 푏 and the support of 푃 , their theorem can be strengthened to an unconditional if and only if: ‖∇퐹 (푏)‖퐿2(푏) = 0 if and only if 푏 = 푏* [48].

Taking this interpretation we have found a promising fact. The positive curvature of 푑 푊2(R ) a priori blocks convexity of 퐹 (푏), and yet, 퐹 (푏) still has one of the local-to-global attributes of convex functions: critical points are global optima. It is this fact which moti- vates the previous and present study of first-order methods for computing barycenters.

46 4.3 Sufficient conditions for first-order methods to converge on 푑 푊2(R )

In this section we describe an approach to combine the 1-smoothness of 퐹 (푏) (cf. propo- sition 2.3.4) with a certain quantitative weakening of strong convexity to obtain fast rates of convergence for first-order algorithms for Wasserstein barycenters. To begin, we shall describe the methods in consideration. 푑 Given a sequence of step-sizes (휂푡)푡>1 and an initial point 푏0 ∈ 풫2,ac(R ) we consider the gradient descent (GD) dynamics:

푏푡+1 := (id −휂푡∇퐹 (푏푡))#푏푡 푡 = 1, 2,... (4.3.0.1)

We shall also consider the stochastic gradient descent (SGD) dynamics: given a sequence 푛−1 푑 of step sizes (휂푡)푡=1 , an initial point 푏0 ∈ 풫2,ac(R ), and 푛 samples 휇푡 ∼ 푃 , 푡 = 1, . . . , 푛, the iterates are

푏푡+1 := (id +휂푡(푇푏푡→휇푡+1 − id))#푏푡 푡 = 0, . . . , 푛 − 1. (4.3.0.2)

To analyze these dynamics we shall first need a calculus version of the 1-smoothness property of 퐹 .

푑 Proposition 4.3.1. For any 푏0, 푏1 ∈ 풫2,ac(R ) such that 퐹 (푏0) < ∞, the barycenter functional satisfies the smoothness inequality

1 퐹 (푏 ) 퐹 (푏 ) + ⟨∇퐹 (푏 ), 푇 − id⟩ + 푊 2(푏 , 푏 ). (4.3.1.1) 1 6 0 0 푏0→푏1 푏0 2 2 0 1

푑 + Moreover, for any 푏 ∈ 풫2,ac(R ) and 푏 := [id −∇퐹 (푏)]#푏, it holds.

1 퐹 (푏+) − 퐹 (푏) − ‖∇퐹 (푏)‖2. (4.3.1.2) 6 2 푏

푑 Proof. Let (푏푠)푠∈[0,1] be the constant-speed geodesic between arbitrary 푏0, 푏1 ∈ 풫2,ac(R ). From the non-negative curvature inequality (2.2.1.2), it holds that for any 푠 ∈ (0, 1],

∫︁ 푊 2(푏 , 휇) − 푊 2(푏 , 휇) ∫︁ 2 푠 2 0 d푃 (휇) [푊 2(푏 , 휇) − 푊 2(푏 , 휇)] d푃 (휇) − (1 − 푠)푊 2(푏 , 푏 ). 푠 > 2 1 2 0 2 0 1

We will apply the dominated convergence theorem to the left-hand side. To do this, we

47 observe that

1 ⃒ 2 2 ⃒ 1 ⃒푊 (푏푠, 휇) − 푊 (푏0, 휇)⃒ = (푊2(푏푠, 휇) + 푊2(푏0, 휇))(푊2(푏푠, 휇) − 푊2(푏0, 휇)) 푠 2 2 푠 1 (푊 (푏 , 푏 ) + 2푊 (푏 , 휇))푊 (푏 , 푏 ) 6 푠 2 0 푠 2 0 2 0 푠

6 (푊2(푏0, 푏1) + 2푊2(푏0, 휇))푊2(푏0, 푏1).

By our assumption that 퐹 (푏0) < ∞, the result is integrable. Hence, the LHS converges to

∫︁ ∫︁ d 2 ⃒ 푊2 (푏푠, 휇)⃒ d푃 (휇) = −2 ⟨푇푏0→휇 − id, 푇푏0→푏1 − id⟩퐿 (푏 ) d푃 (휇) d푠 푠=0+ 2 0

= 2⟨∇퐹 (푏0), 푇푏0→푏1 − id⟩푏0 , where in the first identity, we used the characterization of [7, Prop. 7.3.6]. Rearranging terms yields (4.3.1.1). 2 + 2 Noticing that 푊2 (푏, 푏 ) = ‖−∇퐹 (푏)‖푏 , the second equation follows from the first.

We now introduce a certain quantitative weakening of strong convexity which shall be useful in the sequel.

푑 Definition 4.3.2. We say that 푃 satisfies a Polyak-Łojasiewicz (PL) inequality at 푏 ∈ 풫2,ac(R ) if

* 2 2퐶PL(퐹 (푏) − 퐹 (푏 )) 6 ‖∇퐹 (푏)‖퐿2(푏). (4.3.2.1)

Remark 4.3.3. PL inequalities are a standard item in the theory of optimization, for they are essen- tially a minimal requirement to derive typical rates of convergence for optimization methods [32]. We note as well that a PL inequality for the barycenter functional can be viewed as a quantitative strengthening of the “critical points are optimal" statement from the previous section.

As the following theorems evidence, if 퐹 satisfies a PL inequality at each iterate of gradient descent or stochastic gradient descent then we can recover the usual rates from Euclidean optimization.

Theorem 4.3.4 (Exponential convergence for GD). Suppose 퐹 satisfies a PL inequality at each iterate of the gradient descent dynamics (푏푡)푡<푇 . Then

* 푇 * 퐹 (푏푇 ) − 퐹 (푏 ) 6 (1 − 퐶PL) (퐹 (푏0) − 퐹 (푏 )).

48 Proof of Theorem 4.3.4. The PL inequality implies 퐹 (푏푡) < ∞ for all 푡. We may thus apply the smoothness (4.3.1.2) and PL (4.3.2.1) inequalities, to see that

* 퐹 (푏푡+1) − 퐹 (푏푡) 6 −퐶PL[퐹 (푏푡) − 퐹 (푏 )].

* * It yields 퐹 (푏푡+1) − 퐹 (푏 ) 6 (1 − 퐶PL)[퐹 (푏푡) − 퐹 (푏 )], which gives the result.

We also get a 1/푛 rate for the SGD dynamics.

Theorem 4.3.5 (1/푛 convergence for SGD). Suppose that there exists a constant 퐶PL > 0 such that 퐹 satisfies the PL inequality at all iterates (푏 ) of SGD run with step size (4.3.2.1) 푡 06푡6푛

√︃ (︁ 2(푡 + 푘) + 1 )︁ 2 휂 = 퐶 1 − 1 − , (4.3.5.1) 푡 PL 2 2 6 퐶 (푡 + 푘 + 1) 퐶PL(푡 + 푘 + 1) PL

2 where we take 푘 = 2/퐶PL − 1 > 0. Then,

2 * 3휎 (푃 ) E퐹 (푏푛) − 퐹 (푏 ) 6 2 . 퐶PL푛

The proof is relegated to the appendix subsection A.2.1. Although mostly standard, 푑 there is a critical usage of the non-negative curvature of 푊2(R ).

4.4 An integrated Polyak-Łojasiewicz inequality

The aim of this section is to prove the following “integrated PL inequality".

Lemma 4.4.1 (Integrated PL). Let 푃 satisfy a variance inequality with constant 퐶var and let 푑 * 푏 ∈ 풫2,ac(R ) be such that the barycenter 푏 of 푃 is absolutely continuous w.r.t. 푏. Assume further the following measurability conditions: there exists a 푃 ⊗ 푏*, 푃 ⊗ 푏-integrable mapping 푑 푑 푑 휙: 풫2,ac(R ) × R → R ∪ {∞}, (휇, 푥) ↦→ 휙푏→휇(푥), such that for 푃 -almost every 휇 ∈ 풫2,ac(R ), 푑 휙푏→휇 : R → R ∪ {∞} is a Kantorovich potential for the optimal transport map from 푏 to 휇. Then,

(︂∫︁ 1 )︂2 * 2 2 퐹 (푏) − 퐹 (푏 ) 6 ‖∇퐹 (푏)‖퐿 (푏푠)푑푠 , 퐶var 0

* where (푏푠)푠∈[0,1] is the constant-speed 푊2-geodesic joining 푏 to 푏 .

49 Remark 4.4.2. We comment on deriving a full PL inequality from the above. If we know that

‖푑푏푠/푑푏‖∞ 6 퐶, then this immediately yields a full PL inequality. The difficulty with such an approach lies in the fact that we must guarantee it to be true when 푏 is an iterate of the gradient descent trajectory. In fact, sup norm upper bounds on the iterates can be easily maintained through- out the trajectory as they hold along 푊2-geodesics. To complete such an argument we would need a lower bound of the form 푑푏 > 푐. However, ensuring lower bounds along 푊2-geodesics is a difficult open problem [53].

The following lemma will be useful for us. It appears in [39, Lem. A.1] in the case of Lipschitz functions. A minor modification of their proof allows us to handle locally Lipschitz rather than only Lipschitz functions, which we include in Appendix A.2.2.

푑 푑 Lemma 4.4.3. Let (푏푠)푠∈[0,1] be a Wasserstein geodesic in 풫2(R ). Let Ω ⊆ R be a convex open 푑 subset for which 푏0(Ω) = 푏1(Ω) = 1. Then, for any function 푓 : R → R which is locally Lipschitz on Ω, it holds that

⃒∫︁ ∫︁ ⃒ ∫︁ 1 ⃒ 푓 d푏 − 푓 d푏 ⃒ 푊 (푏 , 푏 ) ‖∇푓‖ 2 d푠. ⃒ 0 1⃒ 6 2 0 1 퐿 (푏푠) 0

Proof of Lemma 4.4.1. By Kantorovich duality (cf. theorem 2.1.1),

1 ∫︁ (︁‖·‖2 )︁ ∫︁ (︁‖·‖2 )︁ 푊 2(푏, 휇) = − 휙 d휇 + − 휙 d푏, 2 2 2 휇→푏 2 푏→휇 1 ∫︁ (︁‖·‖2 )︁ ∫︁ (︁‖·‖2 )︁ 푊 2(푏*, 휇) − 휙 d휇 + − 휙 d푏*. 2 2 > 2 휇→푏 2 푏→휇

This yields the inequality

∫︁ (︁‖·‖2 ∫︁ )︁ 퐹 (푏) − 퐹 (푏*) − 휙 d푃 (휇) d(푏 − 푏*), 6 2 푏→휇

∫︀ where we have used the integrability assumption to swap integrals. Let 휙 := 휙푏→휇 d푃 (휇); 푑 this is a proper LSC convex function R → R ∪ {∞}. We apply lemma 4.4.3 with Ω = int dom 휙. Since 휙 is locally Lipschitz on the interior of its domain and 푏* ≪ 푏, then 푏(Ω) = 푏*(Ω) = 1, whence

√︃ ∫︁ 1 * ∫︁ 1 * * 2[퐹 (푏) − 퐹 (푏 )] 2 2 퐹 (푏) − 퐹 (푏 ) 6 푊2(푏, 푏 ) ‖∇휙 − id‖퐿 (푏푠) d푠 6 ‖∇휙 − id‖퐿 (푏푠) d푠. 0 퐶var 0

50 Square and rearrange to yield

∫︁ 1 2 * 2 (︁ )︁ 2 퐹 (푏) − 퐹 (푏 ) 6 ‖∇휙 − id‖퐿 (푏푠) d푠. 퐶var 0

Recognizing that ∇퐹 (푏) = id −∇휙 yields the result.

4.5 Specializing to the Bures-Wasserstein case

As discussed in Remark 4.4.2, obtaining a full PL inequality from the integrated PL in general is not known. In this section, we apply the integrated PL inequality to obtain a bona fide PL for distributions supported on Gaussians. Identifying a centered non-degenerate Gaussian measure with its covariance matrix, the Wasserstein geometry induces a Riemannian structure on the space of positive definite matrices, known as the Bures geometry. Accordingly, we now refer to the barycenter of 푃 as the Bures-Wasserstein barycenter [20,9].

4.5.1 Bures-Wasserstein gradient descent algorithms

We now specialize both GD and SGD when 푃 is supported on mean-zero Gaussian mea- sures. In this case, the updates of both algorithms take a remarkably simple form. To see 푑 푑 푑 this, for 푚 ∈ R , Σ ∈ S++, let 훾푚,Σ denote the Gaussian measure on R with mean 푚 and covariance matrix Σ. In particular, the optimal coupling between 훾푚0,Σ0 and 훾푚1,Σ1 has the explicit form

1/2 푥 ↦→ 푇 (푥) := 푚 + Σ−1/2(Σ1/2Σ Σ1/2) Σ−1/2(푥 − 푚 ). 훾휇0,Σ0 →훾휇1,Σ1 1 0 0 1 0 0 0 (4.5.0.1)

푇 ∫︀ 푇 d푃 (훾) Observe that 훾휇0,Σ0 →훾휇1,Σ1 is affine, and thus 훾휇0,Σ0 →훾 is affine. This means that all of the GD (or SGD) iterates are Gaussian measures, so it suffices to keep track of the mean and covariance matrix of the current iterate. For both GD and SGD, the update equation for the descent step decomposes into two decoupled equations: an update equation for the mean, and an update equation for the covariance matrix. More- over, the update equation for the mean is trivial, corresponding to a simple GD or SGD procedure on the objective function 푚 ↦→ ∫︀ ‖푚 − 푚(휇)‖2 d푃 (휇). Therefore, for simplicity and without loss of generality, we consider only mean-zero Gaussians throughout this sec-

51 Algorithm 1 Bures-Wasserstein GD 1: procedure BURES-GD(Σ0, 푃, 푇 ) 2: for 푡 = 1, . . . , 푇 do ∫︀ −1/2 1/2 1/2 1/2 −1/2 3: 푆푡 ← Σ푡−1 {Σ푡−1Σ(휇)Σ푡−1} Σ푡−1 d푃 (휇) 4: Σ푡 ← 푆푡Σ푡−1푆푡 5: end for 6: return Σ푇 7: end procedure

Algorithm 2 Bures-Wasserstein SGD 푇 푇 1: procedure BURES-SGD(Σ0, (휂푡)푡=1, (퐾푡)푡=1) 2: for 푡 = 1, . . . , 푇 do ^ −1/2 1/2 1/2 1/2 −1/2 3: 푆푡 ← Σ푡−1 {Σ푡−1퐾푡Σ푡−1} Σ푡−1 4: Σ푡 ← ((1 − 휂푡)퐼퐷 + 휂푡푆^푡)Σ푡−1((1 − 휂푡)퐼퐷 + 휂푡푆^푡) 5: end for 6: return Σ푇 7: end procedure

tion and we simply have to write down the update equations for the covariance matrix Σ푡 of the iterate. They are summarized in Algorithms1 and2.

In the rest of this section, we prove the guarantees for GD and SGD on the Bures- Wasserstein manifold given in theorems 4.1.1 and 4.1.2.

4.5.2 Proof of the main results

For simplicity, we make the following reductions: we assume that the Gaussians are cen- tered (see previous subsection) and that the eigenvalues of the covariance matrices of the Gaussians are uniformly bounded above by 1. The latter assumption is justified by the observation that if there is a uniform upper bound on the eigenvalues of the covariance matrices, then we can simply rescale them so that the bound becomes 1. As can be easily checked, the barycenter thus obtained will be the scaled version of the original barycenter, so there is no harm in doing this. While the centering and scaling assumptions stated above can be made without loss of generality, our results require the following regularity condition. Note that it is equivalent to a uniform upper bound on the densities of the Gaussians.

52 푑 Definition 4.5.1 (휁-regular). Fix 휁 ∈ (0, 1]. A distribution 푃 ∈ 풫2(R ) is said to be 휁-regular if its support is contained in

{︀ 푑 }︀ 풮휁 = 훾0,Σ :Σ ∈ S++, ‖Σ‖op 6 1, det Σ > 휁 . (4.5.1.1)

We use this definition to prove our main theorems with the following four steps. We show that, for any 휁-regular distribution 푃 ,

1. Corollary 4.5.5: 푃 has a barycenter in 풮휁 .

2. Theorem 4.5.7: 푃 obeys a variance inequality with 퐶var = 휁.

2 3. Theorem 4.5.8: 푃 obeys a PL-inequality uniformly over 풮휁 with 퐶PL = 휁 /4.

4. Corollary 4.5.6: GD and SGD initialized in 풮휁 remain in 풮휁 .

We thus obtain a PL inequality throughout the optimization trajectories and can apply our optimizations results from Section 4.3 to conclude. For the rest of this section, we execute this proof plan. We lay the groundwork for our proof by defining the following strong form of geodesic convexity for 풮휁 .

푑 Definition 4.5.2 (Defn. 7.4 [1]). Fix 휇, 휈0, 휈1 ∈ 풫2,ac(R ). Let 푇푖 be the optimal coupling from 휇 to 휈푖, 푖 = 0, 1. Then the generalized geodesic from 휈0 to 휈1 with base 휇 is the interpolated curve

휇 휈푡 := ((1 − 푡)푇0 + 푡푇1)#휇.

푑 We say that a functional 퐺: 풫2,ac(R ) → R is convex along generalized geodesics if for all 휇 휇 푑 generalized geodesics 휈푡 , the function 퐺 ∘ 휈푡 : [0, 1] → R ∪ {∞} is convex. A set 푆 ⊂ 풫2,ac(R ) is said to be convex along generalized geodesics if the convex indicator

⎧ ⎨⎪0 휇 ∈ 푆 휄푆(휇) := ⎩⎪∞ 휇 ̸∈ 푆 is convex along generalized geodesics.

We next demonstrate two important functionals which are convex along generalized geodesics.

53 푑 ∫︀ Lemma 4.5.3. For a measure 휇 ∈ 풫2(R ), let 푀(휇) := 푥 ⊗ 푥 d휇(푥). Then, the functional 푑 휇 ↦→ ‖푀(휇)‖op = 휆max(푀(휇)) is convex along generalized geodesics on 풫2(R ).

푑−1 푑 푑−1 Proof. Let 푆 denote the unit sphere of R and observe that for any 푒 ∈ 푆 the function 2 푑 푥 ↦→ ⟨푥, 푒⟩ is convex on R . By known results for geodesic convexity in Wasserstein space (see [7, Prop. 9.3.2]), the functional 휇 ↦→ ∫︀ ⟨·, 푒⟩2 d휇 = ⟨푒, 푀(휇)푒⟩ is convex along 푑 generalized geodesics in 풫2(R ); hence, so is the functional 휇 ↦→ max푒∈푆푑−1 ⟨푒, 푀(휇)푒⟩ =

‖푀(휇)‖op.

The next lemma establishes convexity along generalized geodesics of 휇 ↦→ − ln det Σ(휇). It follows from specializing lemma A.2.3 in the Appendix to the Bures-Wasserstein mani- fold.

∑︀푑 Lemma 4.5.4. The functional 훾0,Σ ↦→ − 푖=1 ln 휆푖(Σ) is convex along generalized geodesics on the space of non-degenerate Gaussian measures.

It follows readily from lemmas 4.5.3 and 4.5.4 that the set 풮휁 is convex along general- ized geodesics. Combining this with the fact that 휁-regular distributions have barycenters (proposition A.2.1 in the Appendix) and a result of Agueh and Carlier [1, Prop. 7.6] we get the following.

* Corollary 4.5.5. If 푃 is 휁-regular then it has a barycenter 푏 ∈ 풮휁 .

Moreover since SGD moves along geodesics and is initialized at 푏0 ∈ supp 푃 ⊂ 풮휁 , then all the iterates of SGD stay in 풮휁 . To show that the same holds for GD, observe that the ∫︀ set log푏푡 (풮휁 ) is convex. Therefore, −∇퐹 (푏푡) = (푇푏푡→휇 − id) d푃 (휇) ∈ log푏푡 (풮휁 ) as a convex combination of elements in this set. This is equivalent to 푏푡+1 = exp푏푡 (−∇퐹 (푏푡)) ∈ 풮휁 . These observations yield the following corollary.

Corollary 4.5.6. The set 풮휁 is convex along generalized geodesics and when initialized in supp 푃 , the iterates of both GD and SGD remain in 풮휁 .

The next result establishes uniqueness and a variance inequality.

* Theorem 4.5.7. Suppose 푃 is a 휁-regular distribution. Then 푃 has a unique barycenter 푏 ∈ 풮휁 and obeys a variance inequality with 퐶var = 휁.

54 Proof. By lemma A.2.2 we know that the Jacobian of the transport map between any two elements of 풮휁 has eigenvalues between 휁 and 1/휁. We can thus apply theorem 2.3.7 to get a variance inequality with 퐶var = 휁. This implies uniqueness.

We can now show a PL inequality over all of 풮휁 .

Theorem 4.5.8. Fix 휁 ∈ (0, 1], and let 푃 be a 휁-regular distribution. Then, the barycenter func- 2 tional 퐹 satisfies the PL inequality with constant 퐶PL = 휁 /4 uniformly at all 푏 ∈ 풮휁 :

2 퐹 (푏) − 퐹 (푏*) ‖∇퐹 (푏)‖2. 6 휁2 푏

˜ Proof. For any 훾0,Σ ∈ 풮휁 , the eigenvalues of Σ are in [휁, 1]. Let (푏푠)푠∈[0,1] be the constant- ˜ ˜ * speed geodesic between 푏0 := 푏 := 훾0,Σ and 푏1 := 푏 := 훾0,Σ* . Combining lemma 4.4.1 (with an additional use of the Cauchy-Schwarz inequality) and theorem 4.5.7, we get

∫︁ 1 ∫︁ * 2 2 ˜ 퐹 (푏) − 퐹 (푏 ) 6 ‖∇퐹 (푏)‖2 d푏푠 d푠. (4.5.8.1) 휁 0

Define a random variable 푋푠 ∼ ˜푏푠 and observe that

∫︁ ∫︁ 2 ˜ ˜ 2 ˜ −1/2 1/2 1/2 1/2 −1/2 ‖∇퐹 (푏)‖2 d푏푠 = E‖(푀 − 퐼퐷)푋푠‖2 , where 푀 = Σ (Σ 푆Σ ) Σ d푃 (훾0,푆).

Moreover, recall that 푋푠 = 푠푋1 + (1 − 푠)푋0 where 푋0 ∼ ˜푏0 and 푋1 ∼ ˜푏1 are optimally coupled. Therefore, by Jensen’s inequality, we have for all 푠 ∈ [0, 1],

1 ‖(푀˜ − 퐼 )푋 ‖2 푠 ‖(푀˜ − 퐼 )푋 ‖2 + (1 − 푠) ‖(푀˜ − 퐼 )푋 ‖2 ‖(푀˜ − 퐼 )푋 ‖2 , E 퐷 푠 2 6 E 퐷 1 2 E 퐷 0 2 6 휁 E 퐷 0 2 where in the second inequality, we used the fact that

2 1 ‖(푀˜ − 퐼 )푋 ‖2 = tr(︀Σ*(푀˜ − 퐼 )2)︀ ‖Σ*Σ−1‖ tr(︀Σ(푀˜ − 퐼 ) )︀ ‖(푀˜ − 퐼 )푋 ‖2 . E 퐷 1 2 퐷 6 op 퐷 6 휁 E 퐷 0 2

Together with (4.5.8.1), it yields

2 2 퐹 (푏) − 퐹 (푏*) ‖(푀˜ − 퐼 )푋 ‖2 = ‖∇퐹 (푏)‖2 . 6 휁2 E 퐷 0 2 휁2 푏

55 By combining corollary 4.5.6 with theorem 4.5.8, we can instantiate theorems 4.3.4 and 4.3.5 to obtain our main results for this chapter, theorem 4.1.1 and theorem 4.1.2 re- spectively.

56 Appendix A

Additional results and omitted proofs

A.1 Additional results and omitted proofs from Chapter2

푑 Lemma A.1.1. Suppose 휙: R → R is an 훼-strongly convex proper lower-semi continuous func- 푑 푑 tion. Then for arbitrary 푥 ∈ R and almost every 푦 ∈ R

훼 휙(푥) + 휙*(푥) ⟨푥, 푦⟩ + ‖푥 − ∇휙*(푦)‖2. > 2 2

푑 * Proof. Fix 푥, 푦 ∈ R , and let 푥0 ∈ 휕휙 (푦). Then 푦 ∈ 휕휙(푥0) (by e.g. Prop. 2.4 [60]), so

휙((1 − 푡)푥0 + 푡푥) > 푡⟨푥 − 푥0, 푦⟩ + 휙(푥0).

Using this, we have

* 휙 (푦) > ⟨푥0, 푦⟩ − 휙(푥0) 1 ⟨푥, 푦⟩ + (휙(푥 ) − 휙((1 − 푡)푥 + 푡푥)) − 휙(푥 ) > 푡 0 0 0 훼 ⟨푥, 푦⟩ + 휙(푥 ) − 휙(푥) + (1 − 푡)‖푥 − 푥 ‖2 − 휙(푥 ), > 0 2 0 2 0 where the last inequality follows from 훼-strong convexity. Rearranging and taking 푡 → 0 yields 훼 휙(푥) + 휙(푦) ⟨푥, 푦⟩ + ‖푥 − 푥 ‖2. > 2 0 2

* 푑 * We know that 휙 is differentiable at almost every 푦 ∈ R , in which case 푥0 = ∇휙 (푦). This gives the result.

57 A.2 Additional results and omitted proofs from Chapter4

A.2.1 Proof of Theorem 4.3.5

We begin by noting that the step size 휂푡 is chosen to solve the equation

(︁ 푡 + 푘 )︁2 1 − 2퐶 휂 + 휂2 = . PL 푡 푡 푡 + 푘 + 1

We use the definition of Wasserstein distance to calculate:

2 2 2 푊2 (푏푡+1, 휇) ≤ ‖log푏푡 푏푡+1 − log푏푡 휇‖푏푡 = ‖휂푡 log푏푡 휇푡+1 − log푏푡 휇‖푏푡 2 2 2 = ‖log푏푡 휇‖푏푡 + 휂푡 ‖log푏푡 휇푡+1‖푏푡 − 2휂푡⟨log푏푡 휇, log푏푡 휇푡+1⟩푏푡 .

⊗2 Taking the expectation with respect to (휇, 휇푡+1) ∼ 푃 (conditioning appropriately on the increasing sequence of 휎-fields), we have

2 2 퐹 (푏 ) [(1 + 휂 )퐹 (푏 ) − 휂 ‖∇퐹 (푏 )‖ 2 ]. E 푡+1 6 E 푡 푡 푡 푡 퐿 (푏푡)

Using the PL inequality (4.3.2.1),

[︀ 2 * ]︀ E퐹 (푏푡+1) 6 E (1 + 휂푡 )퐹 (푏푡) − 2퐶PL휂푡[퐹 (푏푡) − 퐹 (푏 )] .

Subtracting 퐹 (푏*) and rearranging,

휂2 퐹 (푏 ) − 퐹 (푏*) (1 − 2퐶 휂 + 휂2)[ 퐹 (푏 ) − 퐹 (푏*)] + 푡 휎2(푃 ), E 푡+1 6 PL 푡 푡 E 푡 2

With the chosen step size, we find

(︁ 푡 + 푘 )︁2 2휎2(푃 ) 퐹 (푏 ) − 퐹 (푏*) [ 퐹 (푏 ) − 퐹 (푏*)] + . E 푡+1 6 푡 + 푘 + 1 E 푡 2 2 퐶PL(푡 + 푘 + 1)

Or equivalently,

2 2 * 2 * 2휎 (푃 ) (푡 + 푘 + 1) [E퐹 (푏푡+1) − 퐹 (푏 )] 6 (푡 + 푘) [E퐹 (푏푡) − 퐹 (푏 )] + 2 . 퐶PL

58 Unrolling over 푡 = 0, 1, . . . , 푛 − 1 yields

2 2 * 2 * 2푛휎 (푃 ) (푛 + 푘) [E퐹 (푏푛) − 퐹 (푏 )] 6 푘 [E퐹 (푏0) − 퐹 (푏 )] + 2 , 퐶PL or, equivalently,

2 2 * 푘 * 2휎 (푃 ) E퐹 (푏푛) − 퐹 (푏 ) 6 2 [E퐹 (푏0) − 퐹 (푏 )] + 2 . (A.2.0.1) (푛 + 푘) 퐶PL(푛 + 푘)

To conclude the proof, recall that from (4.3.1.1), we have

1 퐹 (푏 ) − 퐹 (푏*) 푊 2(푏 , 푏*). 0 6 2 2 0

Taking the expectation over 푏0 ∼ 푃 we find

1 퐹 (푏 ) − 퐹 (푏*) 퐹 (푏*) = 휎2(푃 ), E 0 6 2 as claimed. Together with (A.2.0.1), it yields

2 2 2 * 휎 (푃 )(︁ 푘 2 )︁ 휎 (푃 )(︁푘 + 1 2 )︁ E퐹 (푏푛) − 퐹 (푏 ) 6 + 2 6 + 2 . 푛 + 푘 2(푛 + 푘) 퐶PL 푛 2 퐶PL

Plugging in the value of 푘 completes the proof.

A.2.2 Proof of Lemma 4.4.3

According to [39, Corollary 7.22], there exists a probability measure Π on the space of 푑 constant-speed geodesics in R such that 훾 ∼ Π and 푏푠 is the law of 훾(푠). In particular, it yields

∫︁ ∫︁ ∫︁ [︀ (︀ )︀ (︀ )︀]︀ 푓 d푏0 − 푓 d푏1 = 푓 훾(0) − 푓 훾(1) dΠ(훾).

We can cover the geodesic (훾(푠))푠∈[0,1] by finitely many open neighborhoods contained in Ω so that 푓 is Lipschitz on each such neighborhood; thus, the mapping 푡 ↦→ 푓(훾(푠)) is Lipschitz and we may apply the fundamental theorem of calculus, the Fubini-Tonelli

59 theorem, and Cauchy-Schwarz:

∫︁ ∫︁ ∫︁ ∫︁ 1 ⟨︀ (︀ )︀ ⟩︀ 푓 d푏0 − 푓 d푏1 = ∇푓 훾(푠) , 훾˙ (푠) d푠 dΠ(훾) 0 ∫︁ 1 ∫︁ ⃦ (︀ )︀⃦ 6 len(훾)⃦∇푓 훾(푠) ⃦ dΠ(훾) d푠 0 ∫︁ 1 ∫︁ ∫︁ (︀ 2 )︀1/2(︀ ⃦ (︀ )︀⃦2 )︀1/2 6 len(훾) dΠ(훾) ⃦∇푓 훾(푠) ⃦ dΠ(훾) d푠 0 ∫︁ 1 2 = 푊2(푏0, 푏1) ‖∇푓‖퐿 (푏푠) d푠. 0

It yields the result.

A.2.3 Results for Bures-Wasserstein barycenters

푑 Proposition A.2.1 (Gaussian barycenter). Fix 0 < 휆min 6 휆max < ∞. Let 푃 ∈ 풫2(풫2,ac(R )) be such that for all 휇 ∈ supp 푃 , 휇 = 훾푚(휇),Σ(휇) is a Gaussian with 휆min퐼퐷 ⪯ Σ(휇) ⪯ 휆max퐼퐷. * ∫︀ * Let 훾푚*,Σ* be the Gaussian measure with mean 푚 := 푚(휇) d푃 (휇) and covariance matrix Σ ∫︀ 1/2 1/2 1/2 which is a fixed point of the mapping 푆 ↦→ 퐹 (푆) := (푆 Σ(휇)푆 ) d푃 (휇). Then, 훾푚*,Σ* is a barycenter of 푃 .

Proof. To show that there exists a fixed point for the mapping 퐹 , apply Brouwer’s fixed- point theorem as in [1, Theorem 6.1]. To see that 훾푚*,Σ* is indeed a barycenter, we observe the mapping

1 1/2 휙 :(휇, 푥) ↦→ 휙 (푥) := ⟨푥, 푚(휇)⟩+ ⟨푥−푚*, (Σ*)−1/2[(Σ*)1/2Σ(휇)(Σ*)1/2] (Σ*)−1/2(푥−푚*)⟩ 휇 2 satisfies the following condition:

1 1 [휙 (푥)] = ‖푥‖2 + ‖푚*‖2. E휇∼푃 휇 2 2 2 2

This implies optimality by simply applying the dual definition of optimal transport to any

60 alternative 훾푚,Σ:

∫︁∫︁ (︂1 )︂ ∫︁∫︁ (︂1 )︂ 퐹 (훾 ) ‖푥‖2 − 휙 d훾 d푄(휇) + ‖푥‖2 − 휙* d휇d푄(휇) 푚,Σ > 2 2 휇 푚,Σ 2 2 휇 ∫︁∫︁ (︂1 )︂ = ‖푥‖2 − 휙* d휇d푄(휇) 2 2 휇

= 퐹 (훾푚*,Σ* ), where we swapped integrals with the same justification as in the proof of Theorem 2.3.7.

Hence 훾푚*,Σ* is a barycenter of 푃 .

Lemma A.2.2. Suppose there exist constants 0 < 휆min 6 휆max < ∞ such that all of the eigen- ′ 퐷 values of Σ, Σ ∈ S++ are bounded between 휆min and 휆max and define 휅 = 휆max/휆min. Then, the −1 transport map from 훾0,Σ to 훾0,Σ′ is (휅 , 휅)-regular.

−1/2 1/2 ′ 1/2 1/2 −1/2 Proof. The transport map from 훾0,Σ to 훾0,Σ′ is the map 푥 ↦→ Σ (Σ Σ Σ ) Σ 푥.

Throughout this proof, we write ‖·‖ = ‖·‖op for simplicity. We have the trivial bound

√︁ −1/2 1/2 ′ 1/2 1/2 −1/2 ‖Σ (Σ Σ Σ ) Σ ‖ 6 ‖Σ−1‖‖Σ1/2Σ′Σ1/2‖‖Σ−1‖.

−1 −1 1/2 ′ 1/2 2 Moreover ‖Σ ‖ 6 휆min and ‖Σ Σ Σ ‖ 6 휆max, so that the smoothness is bounded by

−1/2 1/2 ′ 1/2 1/2 −1/2 휆max ‖Σ (Σ Σ Σ ) Σ ‖ 6 . 휆min

We can take advantage of the fact that Σ, Σ′ are interchangeable and infer that the strong convexity parameter of the transport map from Σ to Σ′ is the inverse of the smoothness parameter of the transport map from Σ′ to Σ. In other words,

(︀ −1/2 1/2 ′ 1/2 1/2 −1/2)︀ 휆min min 휆푗 Σ (Σ Σ Σ ) Σ > . 16푗6퐷 휆max

This concludes the proof.

푑 Lemma A.2.3. Identify measures 휌 ∈ 풫2,ac(R ) with their densities, and let the ‖·‖퐿∞ norm ∞ 푑 denote the 퐿 -norm (essential supremum) w.r.t. the Lebesgue measure on R . Then, for any

61 푑 푑 푏, 휇0, 휇1 ∈ 풫2,ac(R ), any 푠 ∈ [0, 1], and almost every 푥 ∈ R , it holds that

푏(︀ )︀ (︀ )︀ (︀ )︀ ln 휇 ∇휙 푏 (푥) (1 − 푠) ln 휇 ∇휙 (푥) + 푠 ln 휇 ∇휙 (푥) . 푠 푏→휇푠 6 0 푏→휇0 1 푏→휇1

In particular, taking the essential supremum over 푥 on both sides, we deduce that the functional 푑 풫2,ac(R ) → (−∞, ∞] given by 휌 ↦→ ln ‖·‖퐿∞ is convex along generalized geodesics.

Proof. Let 휌 := [(1 − 푠)푇푏→휇 + 푠푇푏→휈]#푏 be a point on the generalized geodesic with base 푏 connecting 휇 to 휈. Let 휙푏→휇, 휙푏→휈 be the convex potentials whose gradients are 푇푏→휇 and 푑 푇푏→휈 respectively. Then, for almost all 푥 ∈ R , the Monge-Ampère equation applied to the pairs (푏, 휇), (푏, 휈), and (푏, 휌) respectively, yields

⎧ ⎪ ⎪ 휇(︀∇휙 (푥))︀ det 퐷2 휙 (푥) ⎪ 푏→휇 A 푏→휇 ⎪ ⎨⎪ 푏(푥) = (︀ )︀ 2 휈 ∇휙푏→휈(푥) det 퐷 휙푏→휈(푥) ⎪ A ⎪ ⎪ ⎪ (︀ )︀ (︀ 2 2 )︀ ⎩⎪ 휌 (1 − 푠)∇휙푏→휇(푥) + 푠∇휙푏→휈(푥) det (1 − 푠)퐷A휙푏→휇(푥) + 푠퐷A휙푏→휈(푥) .

2 Here, 퐷A휙 denotes the Hessian of 휙 in the Alexandrov sense; see [60, Theorem 4.8]. Fix 푥 such that 푏(푥) > 0. On the one hand, applying log-concavity of the determinant, it follows from the third Monge-Ampère equation that

(︀ )︀ (︀ 2 2 )︀ ln 푏(푥) = ln 휌 (1 − 푠)∇휙푏→휇(푥) + 푠∇휙푏→휈(푥) + ln det (1 − 푠)퐷A휙푏→휇(푥) + 푠퐷A휙푏→휈(푥) (︀ )︀ 2 2 > ln 휌 (1 − 푠)∇휙푏→휇(푥) + 푠∇휙푏→휈(푥) + (1 − 푠) ln det 퐷A휙푏→휇(푥) + 푠 ln det 퐷A휙푏→휈(푥).

On the other hand, it follows from the first two Monge-Ampère equations that

(︀ )︀ (︀ )︀ ln 푏(푥) = (1 − 푠) ln 휇 ∇휙푏→휇(푥) + 푠 ln 휈 ∇휙푏→휈(푥)

2 2 + (1 − 푠) ln det 퐷A휙푏→휇(푥) + 푠 ln det 퐷A휙푏→휈(푥).

The above two displays yield

(︀ )︀ (︀ )︀ (︀ )︀ ln 휌 (1 − 푠)∇휙푏→휇(푥) + 푠∇휙푏→휈(푥) 6 (1 − 푠) ln 휇 ∇휙푏→휇(푥) + 푠 ln 휈 ∇휙푏→휈(푥)

It yields the result.

62 Appendix B

Experiments for Bures-Wasserstein barycenters

In this section, we demonstrate the linear convergence of GD, the fast rate of estimation for SGD, and some potential advantages of averaging stochastic gradient by way of numerical experiments. In evaluating SGD, we also include a variant that involves sampling with replacement from the empirical distribution.

B.0.1 Simulations for the Bures manifold

First, we begin by illustrating how SGD indeed achieves the fast rate of convergence to the true barycenter on the Bures manifold, as indicated by Theorem 4.1.2.

To generate distributions with a known barycenter, we use the following fact. If the ⋆ mean of the distribution (log푏⋆ )#푃 is 0, then 푏 is a barycenter of 푃 . This fact follows from our PL inequality (Theorem 4.5.8) or also from general arguments in [48, Theorem 2]. We also use the fact that the tangent space of the Bures manifold is given by the set of all symmetric matrices [9].

Figure 4-2 shows convergence of SGD for distributions on the Bures manifold. To gen- erate a sample, we let 퐴푖 be a matrix with i.i.d. 훾0,휎2 entries. Our random sample on the Bures manifold is then given by

⊤ (︁퐴푖 + 퐴푖 )︁ Σ푖 = exp훾 , (B.0.0.1) 0,퐼퐷 2

63 0.0 SGD linear fit 0.5 )

) 1.0 * b , t

b 1.5 ( 2 2 W (

0 2.0 1 g o l 2.5

3.0 slope: -1.014 ± 0.004

0.0 0.5 1.0 1.5 2.0 2.5 3.0 log10(iteration)

Figure B-1: Log-log plot of convergence for SGD on Bures manifold for 푛 = 1000, 푑 = 3, ⋆ and and 푏 = 훾0,퐼3 . This corresponds to the experiment on the left in Figure 4-2.

⋆ which has population barycenter 푏 = 훾0,퐼퐷 . An explicit form of this exponential map is derived in [40]. We run two versions of SGD. The first variant uses each sample only once, and passes over the data once. The second variant samples from Σ1,..., Σ푛 with replacement at each iteration and takes the stochastic gradient step towards the selected matrix. For the resulting sequences, we also show the results of averaging the iterates. (푏 ) Specifically, if 푡 푡∈N is the sequence generated by SGD, then the averaged sequence is given by ˜푏0 = 푏0 and ˜ [︁ 푡 1 ]︁ ˜ 푏푡+1 = id + 푇˜푏 →푏 푏푡. 푡 + 1 푡 + 1 푡 푡+1 #

On Riemannian manifolds, averaged SGD is known to attain optimal statistical rates under smoothness and geodesic convexity assumptions [59].

Here, we generate 100 datasets of size 푛 = 1000 in the way specified above and set 2 휎 = 0.25. In this experiment, the SGD step size is chosen to be 휂푡 = 2/[0.7 · (푡 + 2/0.7 + 1)]. The results from these 100 datasets are then averaged for each algorithm, and we also display 95% confidence bands for the resulting sequences. As is clear from the log-log plot in Figure B-1, SGD achieves the fast 푂(푛−1) statistical rate on this dataset.

The right of Figure 4-2 shows convergence of GD to the empirical barycenter and true barycenter. We generate samples in the same way as before. This linear convergence was observed previously by [6].

In Figure B-2, we repeat the same experiment, except this time the barycenter has co-

64 0.5 SGD 1.0 SGD 0.0 SGD (averaged) linear fit SGD with replacement 0.5 0.5 SGD with replacement (averaged) ) )

) empirical barycenter ) * * 0.0

b 1.0 b , , t t b b ( ( 2 2 2 1.5 2 0.5 W W ( (

0 2.0 0 1 1 1.0 g g o o l 2.5 l 1.5 slope: -1.05 ± 0.031 3.0 2.0 3.5 0 200 400 600 800 1000 0.0 0.5 1.0 1.5 2.0 2.5 3.0 iteration log10(iteration)

Figure B-2: Convergence of SGD on Bures manifold. Here, 푛 = 1000, 푑 = 3, and barycen- ter given by diag(20, 1, 1). The result displays the average over 100 randomly generated datasets.

variance matrix ⎛ ⎞ 20 0 0 ⎜ ⎟ ⋆ ⎜ ⎟ Σ = ⎜ 0 1 0⎟ , ⎝ ⎠ 0 0 1 and the entries of 퐴푖 are drawn i.i.d. from 훾0,1. In this situation, the condition numbers of the matrices generated according to this distribution are typically much larger than those centered around 훾0,퐼3 . To account for a potentially smaller PL constant, we chose

휂푡 = 2/[0.1 · (푡 + 2/0.1 + 1)]. It is again clear from the right pane in Figure B-2 that SGD achieves the fast 푂(푛−1) statistical rate on this dataset. To account for the slow convergence initially, we only fit this line to the last 500 iterations. We also note that averaging yields drastically better performance in this case, which we are currently unable to theoretically justify.

Figure B-3 shows convergence of SGD with replacement to the empirical barycenter. We generate 푛 = 500 samples in the same way as in Figure 4-2, where the true barycenter 2 is 퐼3 and 휎 = 0.25. We calculate the error obtained by the empirical barycenter by running GD on this dataset until convergence, which is displayed with the green line. We also calculate the error obtained by a single pass of SGD, which is given by the blue line. SGD with replacement is then run for 5000 iterations, and we observe that it does indeed achieve better error than single pass SGD if run for long enough. SGD with replacement converges to the empirical barycenter, albeit at a slow rate.

65 SGD with replacement 1.0 GD SGD single pass )

) 1.5 b , t b ( 2

2 2.0 W ( 0 1 g

o 2.5 l

3.0

0 1000 2000 3000 4000 5000 iteration

Figure B-3: Convergence of SGD on Bures manifold. Here, 푛 = 500, 푑 = 3, and the distri- ⋆ 2 bution is given by (B.0.0.1) with Σ = 퐼3 and 휎 = 0.25. The result displays the average over 100 randomly generated datasets.

B.0.2 Details of the non-convexity example

We consider the example of the Wasserstein metric restricted to centered Gaussian mea- sures, which induces the Bures metric on positive definite matrices. Even restricted to such Gaussian measures, the Wasserstein barycenter objective is geodesically non-convex, despite the fact that it is Euclidean convex [62]. Figure 4.1 gives a simulated example of this fact. This figure plots the Bures distance squared between a positive definite matrix 퐶 and points along some geodesic 훾, which runs between two matrices 퐴 and 퐵. The matrices used in this example are

⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 0.8 −0.4 0.3 −0.5 0.5 0.5 퐴 = ⎝ ⎠ , 퐵 = ⎝ ⎠ , 퐶 = ⎝ ⎠ , −0.4 0.3 −0.5 1.0 0.5 0.6 and 훾(푡), 푡 ∈ [0, 1], is taken to be the Bures or Euclidean geodesic from 퐴 to 퐵 (the Euclidean geodesic is given by 푡 ↦→ (1 − 푡)퐴 + 푡퐵). This function is clearly non-convex, and therefore we cannot assume that there is some underlying strong convexity (although the Bures distance is in fact strongly geodesically convex for sufficiently small balls [29]).

66 Bibliography

[1] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.

[2] Martial Agueh and Guillaume Carlier. Vers un théorème de la limite centrale dans l’espace de wasserstein? Comptes Rendus Mathématique, 355(7):812–818, 2017.

[3] Adil Ahidar-Coutrix, Thibaut Le Gouic, and Quentin Paris. Convergence rates for em- pirical barycenters in metric spaces: curvature, convexity and extendable geodesics. Probability Theory and Related Fields, pages 1–46.

[4] A. D. Aleksandrov. A theorem on triangles in a metric space and some of its applica- tions. 38:5–23, 1951.

[5] Stephanie Alexander, Vitali Kapovitch, and Anton Petrunin. Alexandrov geometry: preliminary version no. 1. arXiv preprint arXiv:1903.08539, 2019.

[6] Pedro C Álvarez-Esteban, E Del Barrio, JA Cuesta-Albertos, and C Matrán. A fixed- point approach to barycenters in wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744–762, 2016.

[7] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.

[8] Luigi Ambrosio, Federico Glaudo, and Dario Trevisan. On the optimal map in the 2-dimensional random matching problem. arXiv preprint arXiv:1903.12153, 2019.

[9] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the bures–wasserstein distance between positive definite matrices. Expositiones Mathematicae, 2018.

[10] Rabi Bhattacharya, Vic Patrangenaru, et al. Large sample theory of intrinsic and ex- trinsic sample means on manifolds. The Annals of Statistics, 31(1):1–29, 2003.

[11] Rabi Bhattacharya, Vic Patrangenaru, et al. Large sample theory of intrinsic and ex- trinsic sample means on manifolds—ii. The Annals of Statistics, 33(3):1225–1259, 2005.

[12] Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Data-driven regularization of wasserstein barycenters with an application to multivariate density registration. In- formation and Inference: A Journal of the IMA, 8(4):719–755, 2019.

[13] Jérémie Bigot, Raúl Gouet, Thierry Klein, Alfredo Lopez, et al. Upper and lower risk bounds for estimating the wasserstein barycenter of random measures on the real line. Electronic Journal of Statistics, 12(2):2253–2289, 2018.

67 [14] Emmanuel Boissard, Thibaut Le Gouic, Jean-Michel Loubes, et al. Distribution’s tem- plate estimate with wasserstein metrics. Bernoulli, 21(2):740–759, 2015.

[15] Nicolas Bonneel, Gabriel Peyré, and Marco Cuturi. Wasserstein barycentric coordi- nates: Histogram regression using optimal transport. ACM Transactions on Graphics, 35(4), 2016.

[16] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued func- tions. Communications on pure and applied mathematics, 44(4):375–417, 1991.

[17] Martin R Bridson and André Haefliger. Metric spaces of non-positive curvature, volume 319. Springer Science & Business Media, 2013.

[18] Dmitri Burago, Iu D Burago, Yuri Burago, Sergei A Ivanov, and Sergei Ivanov. A course in metric geometry, volume 33. American Mathematical Soc., 2001.

[19] Yu Burago, Mikhail Gromov, and Gregory Perel’man. Ad alexandrov spaces with curvature bounded below. Russian mathematical surveys, 47(2):1, 1992.

[20] Donald Bures. An extension of Kakutani’s theorem on infinite product measures to the tensor product of semifinite w*-algebras. Transactions of the American Mathematical Society, 135:199–212, 1969.

[21] Guillaume Carlier and Ivar Ekeland. Matching for teams. Economic theory, 42(2):397– 418, 2010.

[22] Sinho Chewi, Tyler Maunu, Philippe Rigollet, and Austin J Stromme. Gradient de- scent algorithms for bures-wasserstein barycenters. arXiv preprint arXiv:2001.01700, 2020.

[23] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.

[24] Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.

[25] Federico Glaudo. On the c-concavity with respect to the quadratic cost on a manifold. Nonlinear Analysis, 178:145–151, 2019.

[26] Thibaut Le Gouic, Quentin Paris, Philippe Rigollet, and Austin J Stromme. Fast con- vergence of empirical barycenters in alexandrov spaces and the wasserstein space. arXiv preprint arXiv:1908.00828, 2019.

[27] Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport av- eraging of neuroimaging data. In International Conference on Information Processing in Medical Imaging, pages 261–272. Springer, 2015.

[28] Mikhael Gromov. Sign and geometric meaning of curvature. Rendiconti del Seminario Matematico e Fisico di Milano, 61(1):9–123, 1991.

[29] W. Huang, K. A. Gallivan, and P.-A. Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.

68 [30] Jan-Christian Hütter and Philippe Rigollet. Minimax rates of estimation for smooth optimal transport maps. arXiv preprint arXiv:1905.05828, 2019.

[31] Hermann Karcher. Riemannian center of mass and so called karcher mean. arXiv preprint arXiv:1407.2087, 2014.

[32] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.

[33] Wilfrid S Kendall, Huiling Le, et al. Limit theorems for empirical fréchet means of in- dependent and non-identically distributed manifold-valued random variables. Brazil- ian Journal of Probability and Statistics, 25(3):323–352, 2011.

[34] Martin Knott and Cyril S Smith. On the optimal mapping of distributions. Journal of Optimization Theory and Applications, 43(1):39–49, 1984.

[35] Martin Knott and Cyril S Smith. On a generalization of cyclic monotonicity and dis- tances among random vectors. Linear algebra and its applications, 199:363–371, 1994.

[36] Alexey Kroshnin, Vladimir Spokoiny, and Alexandra Suvorikova. Statistical inference for bures-wasserstein barycenters. arXiv preprint arXiv:1901.00226, 2019.

[37] Thibaut Le Gouic. Dual and multimarginal problems for the wasserstein barycenter. 2020. unpublished.

[38] Thibaut Le Gouic and Jean-Michel Loubes. Existence and Consistency of Wasserstein Barycenters. Probability Theory and Related Fields, August 2017.

[39] John Lott and Cédric Villani. Ricci curvature for metric-measure spaces via optimal transport. Annals of Mathematics, pages 903–991, 2009.

[40] L Malagò, L. Montrucchio, and G. Pistone. Wasserstein Riemannian geometry of Gaussian densities. Information Geometry, 1(2):137–179, 2018.

[41] Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.

[42] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.

[43] Shin-ichi Ohta. Convexities of metric spaces. Geometriae Dedicata, 125(1):225–250, 2007.

[44] Shin-ichi Ohta. Barycenters in alexandrov spaces of curvature bounded below. Ad- vances in geometry, 12(4):571–587, 2012.

[45] Ingram Olkin and Svetlozar T Rachev. Maximum submatrix traces for positive defi- nite matrices. SIAM journal on matrix analysis and applications, 14(2):390–397, 1993.

[46] Victor M Panaretos and Yoav Zemel. Statistical aspects of wasserstein distances. An- nual review of statistics and its application, 6:405–431, 2019.

69 [47] Victor M Panaretos, Yoav Zemel, et al. Amplitude and phase variation of point pro- cesses. The Annals of Statistics, 44(2):771–812, 2016.

[48] VM Panaretos and Y Zemel. Fréchet means and procrustes analysis in wasserstein space. Bernoulli, To be published, 2017.

[49] G. Peyré and M. Cuturi. Computational Optimal Transport. ArXiv e-prints, March 2018.

[50] Julien Rabin and Nicolas Papadakis. Convex color image segmentation with optimal transport distances. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 256–269. Springer, 2015.

[51] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.

[52] Filippo Santambrogio. Optimal Transport for Applied Mathematicians. 2015.

[53] Filippo Santambrogio and Xu-Jia Wang. Convexity of the support of the displacement interpolation: Counterexamples. Applied Mathematics Letters, 58:152–158, 2016.

[54] Christof Schötz et al. Convergence rates for the generalized fréchet mean via the quadruple inequality. Electronic Journal of Statistics, 13(2):4280–4345, 2019.

[55] Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.

[56] Sanvesh Srivastava, Cheng Li, and David B Dunson. Scalable bayes via barycenter in wasserstein space. The Journal of Machine Learning Research, 19(1):312–346, 2018.

[57] Karl-Theodor Sturm. Metric spaces of lower bounded curvature. Expositiones Mathe- maticae, 17, 01 1999.

[58] Karl-Theodor Sturm. Probability measures on metric spaces of nonpositive. Heat Kernels and Analysis on Manifolds, Graphs, and Metric Spaces: Lecture Notes from a Quarter Program on Heat Kernels, Random Walks, and Analysis on Manifolds and Graphs: April 16- July 13, 2002, Emile Borel Centre of the Henri Poincaré Institute, Paris, France, 338:357, 2003.

[59] Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, and Michael I. Jordan. Av- eraging stochastic gradient descent on Riemannian manifolds. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 650–687, 2018.

[60] Cédric Villani. Topics in optimal transportation, volume 58 of Graduate Studies in Mathe- matics. American Mathematical Society, Providence, RI, 2003.

[61] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

70 [62] Melanie Weber and Suvrit Sra. Nonconvex stochastic optimization on manifolds via riemannian frank-wolfe methods. arXiv preprint arXiv:1910.04194, 2019.

71