<<

NEW TECHNIQUES IN OPTIMAL TRANSPORT A Thesis Submitted to the Faculty in partial fulfillment of the requirements for the degree of Doctor of Philosophy in by James Gordon Ronan DARTMOUTH COLLEGE Hanover, New Hampshire May 2021

Examining Committee: AnneGeet Anne Gelb, Chair Mather Pour Matthew Parno

Peter Doyle

Douglas Cochran

F. Jon Kull, Ph.D. Dean of the Guarini School of Graduate and Advanced Studies

Abstract

This thesis develops a new technique for applications of optimal transport and presents anewperspectiveonoptimaltransportthroughthemeasuretheoretictooloftransi- tion kernels. Optimal transport provides a way of lifting a distance metric on a space to probability measures over that space. This makes the field well suited for certain types of image analysis. Part of this thesis focuses on a new application for optimal transport, while the other focuses on a new approach to optimal transport itself. With respect to the first part of this thesis, we propose using semi-discrete optimal transport for the estimation of parameters in physical scenes and show how to do so. Optimal transport is a natural setting when studying images because displacements of the objects in the image directly correspond to a change in the optimal transport cost. In the second part of this thesis we discuss transition kernels, which provide a mathematical tool that can be used to map measures to measures. It therefore seems intuitive to incorporate transition kernels into optimal transport problems. However, this requires changing the traditional perspective of viewing optimal transport as primarily a tool to distances between two fixed measures. To that end, this thesis develops theory to show how kernels may be used to extend optimal transport to signed measures.

ii Preface

IamfortunateandgratefultobewhereIamtoday.Myparentshavesupportedand encouraged me along the way, and pushed me to take all of the opportunities that I have been given. My sincere gratitude goes to my advisor Anne Gelb for accepting me as her odd duck of a student. She has helped me to grow into a better mathematician and think about how mathematics should serve the world. She and my secondary advisor, Matthew Parno have been resiliently positive and optimistic despite the challenges of the past year. I am excited to continue to work with and to learn from them in the future. Thank you to the remainder of my committee, Douglas Cochran and Peter Doyle, for meeting with me and helping to improve this thesis.

iii Contents

Abstract...... ii Preface ...... iii

1Overview 1

2 Preliminaries 4 2.1 Measure Theory Background ...... 4 2.2 OT Background ...... 10 2.2.1 Historical development of optimal transport ...... 10 2.2.2 The Kantorovich Problem ...... 11 2.2.3 Duality Theory ...... 17 2.3 Wasserstein Distances ...... 25 2.3.1 Examples of Wasserstein distances ...... 29 2.3.2 Geodesics in Wasserstein Space ...... 31 2.4 1-D OT ...... 35 2.4.1 One Dimensional Transport ...... 36 2.4.2 Structure of Monotone Coupling ...... 38 2.4.3 c-Cyclic Monotonicity in One Dimension ...... 41 2.4.4 OptimalTransportMaps...... 44

3 SDOT 45

iv 3.1 SDOT ...... 46 3.1.1 Laguerre Cells ...... 47 3.1.2 Theoretical Results on SDOT ...... 52 3.2 AlgorithmicSDOT ...... 54 3.2.1 AlgorithmicModifications ...... 60 3.2.2 Other Regularized Optimal Transport ...... 66 3.2.3 Quantization ...... 67

4ParameterEstimation 69 4.1 Simple Examples ...... 70 4.1.1 Centers of Mass ...... 70 4.1.2 AngleofRotation...... 73 4.1.3 Rotating and Translating Object ...... 76 4.2 Misfits in Time ...... 77 4.2.1 Velocity Estimation ...... 78 4.2.2 Colliding Balls ...... 80 4.3 Cantilever Beam ...... 85 4.3.1 Model Set-up ...... 86 4.3.2 Results ...... 90 4.3.3 AdjointGradient ...... 92 4.4 Further Questions ...... 98 4.4.1 Representation of Objects ...... 98 4.4.2 Time Sensitivity ...... 99 4.4.3 BlurringandNoise ...... 99 4.4.4 AnalysisofSolutionOperators...... 99

5 OT Kernels 101

v 5.1 Kernels Background ...... 102 5.2 Kernels for OT ...... 105 5.2.1 Optimal transport kernels ...... 107 5.2.2 Geodesics and Kernels ...... 112 5.3 Signed OT with Kernels ...... 114 5.3.1 One Dimensional Signed Optimal Transport ...... 118 5.4 FutureQuestions ...... 123

6 Conclusion 124

References 127

vi Chapter 1

Overview

Optimal transport is an area of mathematics that combines analysis, probability and geometry while o↵ering applications in diverse areas. Optimal transport studies various kinds of transport costs that describe how to rearrange one measure into another. The initial description of optimal transport codified these rearrangements as transport maps, T , and the mass of the measure at x was sent to T (x). Just as there are many ways to go from point A to point B, there are many di↵erent possible transport maps which send one measure into another. Optimal transport is a way of assigning a cost to these rearrangements and understanding the transport map with the lowest total cost. This can be useful when looking at physical systems because it acts like Occam’s razor. The optimal transport map is the one that required the least e↵ort to get things done. While an optimal map might not perfectly reflect what happened, in some ways it is the simplest. It is important to note that a transport map sends one measure to another on the global scale, but it does so by coordinating the local paths at each point supported in the measure. The optimal transport cost represents perfect cooperation, where every point acts in accordance with everyone else to minimize the cost. Hence we see that optimal transport exists in a perfect world while still allowing us to understand our

1 Overview Overview imperfect one. The roots of optimal transport date back to Monge in 1781, and the idea was originally motivated by Monge’s interest in quantifying the amount of work it would take to excavate a hole of a particular shape and construct a pile in another shape [38]. While we have moved on from focusing on digging ditches, applications are an integral part of optimal transport to this day. In 1942 Kantorovich o↵ered a new perspective and re-framed the problem and showed the field’s applicability to economic problems [23]. The trends of re-interpretation and application can be found running through the history of optimal transport. A notable re-formulation was to view the problem in a continuous time setting in [8], which opened up the geometric aspects of the field which were explored further in [35, 40, 33]. The di↵erent formulations of the problems lent themselves to di↵ering applications and improved implementations that reinforce each other – more applications become feasible as the implementations improve, while the increase in potential applications drives the desire for better implementations. A few notable implementations include using an elliptic partial di↵erential equation (PDE) formulation of the problem, [9], entropic regularization, which adds a regularization term to the objective function of the problem, [14, 51], and semi-discrete optimal transport (SDOT) which exploits connections to computational geometry that arise when the class of measures un- der consideration is restricted, [24, 26, 28, 37]. Applications arise in diverse fields including economics [36], fluid dynamics [21], and seismic imaging [16, 17, 18]. This thesis presents a new way to apply optimal transport for parameter estima- tion. We show how to apply SDOT to form misfit functions for parameter estimation of physical systems. We show how we can use quantitization to create a discrete representation of the objects in the system, and use SDOT to compare our model to the observation. This approach can be used for a variety of applications and we

2 Overview Overview demonstrate this through examples. Additionally, this thesis demonstrates that there is a latent measure theoretic ker- nel structure in optimal transport, which allows us to restore some of the functionality that has been lost in reformulations of the problem. The original formulation of opti- mal transport su↵ered from many disadvantages, but when it was reframed in terms of couplings we lost the ability to send multiple measures through the same map. Kernels allow us to understand couplings in a new way to restore some of the lost functionality. We discuss how they may be exploited as a new approach to extending optimal transport to signed measures. This remainder of this thesis proceeds in four parts:

(a) Chapter 2 reviews the conceptual and mathematical foundations of optimal transport. This provides a somewhat broad introduction to the subject and establishes the notational conventions used throughout the thesis.

(b) Chapter 3 presents a focused review of semi-discrete optimal transport. Both the theoretical background of the area as well as topics related to the imple- mentation are discussed.

(c) Chapter 4 demonstrates how to construct an optimal transport misfit function and provides examples. We show how this formulation can be used to estimate parameters of physical systems.

(d) Chapter 5 introduces the kernel based framework and presents a new approach to signed optimal transport. We conclude this chapter by focusing on the one dimensional case.

Small results are presented throughout the thesis, with the major contributions taking place in Chapters 4 and 5.

3 Chapter 2

Preliminaries

The purpose of this chapter is to serve as an easy reference for the necessary back- ground on optimal transport. This includes first defining and introducing our notation for measure theoretic concepts before providing an introduction to optimal transport. A few examples are included in this chapter to illustrate key ideas.

Section 2.1 Definitions and Conventions in Measure Theory

Measure theory provides an essential building block of optimal transport. This section therefore summarizes measure theoretic concepts used in the thesis. A more thorough background may be found in standard analysis texts such as [20, 48] and probability focused texts such as [13, 25]. A curated review of the measure theory necessary for optimal transport is provided in [5, Ch. 5].

At the heart of measure theory is the intertwined triple (X, M,µ)whereX is aset,M is a -algebra of subsets of X,andµ is a measure. Elements of M are called measurable sets and the pair (X, M) is called a measurable space. When the -algebra is understood from context, X itself may be called a measurable space.

Both signed measures µ : M R and positive measures (more properly called ! [{1} 4 2.1 Measure Theory Background Preliminaries non-negative measures) µ : M [0, ] are used in this thesis. A measure may be ! 1 restricted to a subspace and sub-algebra. If µ is a measure on X, then we denote the restriction of µ to Y X as µ Y . ⇢ A X with metric dist(x, y)iscompletewhenallCauchysequences converge and is separable when it contains a dense countable subset. If X is a , the smallest -algebra containing all open subsets of X is called the Borel -algebra. Measures on Borel -algebras are called Borel measures.

Definition 2.1 ([56, Pg. XX]). A is a complete, separable metric space with the metric topology generated by open balls of the distance metric and the corresponding Borel -algebra.

Following [56, Pg. XX], the convention of this thesis is that all measures which we consider will be Borel measures on Polish Spaces. For a Polish space, the triple

(X, B,µ)canalsobethoughtofas(X, dist,µ)becausethe-algebra is generated from the distance dist.

Definition 2.2 ([5, Pg. 105]). The support supp(µ)ofapositivemeasureisthe defined by

supp(µ):= x X : µ(U) > 0foreachneighborhoodU of x . { 2 }

If supp(µ) A, then we say that µ is concentrated on A. ⇢

We say a property holds µ almost everywhere, or µ-a.e. or simply a.e., if the set of points N where the property does not hold has measure zero, i.e. µ(N)=0.Thisis common for the uniqueness of functions in integral conditions, i.e. if E fdµ = E gdµ for all sets E M, this implies that f = g a.e. R R 2

Definition 2.3 ([20, Pg. 87]). Positive measures µ, ⌫ are said to be mutually

5 2.1 Measure Theory Background Preliminaries singular if there are disjoint sets A and B with A B = X and µ(A)=0and [ ⌫(B)=0.Thisiswrittenµ ⌫. ?

Theorem 2.4 (The Jordan Decomposition Theorem [20, Pg. 87]). If ⌫ is a signed

+ + measure, there exist unique positive measures ⌫ and ⌫ such that ⌫ = ⌫ ⌫ and + ⌫ ⌫. ?

Definition 2.5 ([20, Pg. 88]). We say a signed measure ⌫ is absolutely continuous with respect to a positive measure µ if whenever µ(E)=0,then⌫(E)=0.Wedenote this as ⌫ µ. ⌧

We will say that a positive measure ⌫ is dominated by a positive measure µ if

⌫(E) µ(E)forallE M.  2 The integral of a measure µ on a space X is µ(X). The mass of a measure µ is

+ µ (X)+µ(X). For positive measures the mass and integral are the same. For a

+ + signed measure ⌫ = ⌫ ⌫, the total variation measure is defined as ⌫ = ⌫ + ⌫, | | so the mass of a measure is the integral of the total variation of the measure. The space of signed Borel measures with finite mass on X is denoted M(X), and the space of positive measures is denoted by M+(X). A positive measure with mass equal to 1 is a probability measure and the space of Borel probability measures is

P(X).

Definition 2.6 ([20, Pg. 43]). For two polish spaces (X, BX )and(Y,BY ), a function

1 T : X Y is a measurable map if T (E) B for all E B . 7! 2 X 2 Y

The class of measurable functions is broad but not all encompassing. However, limiting our discussion to measurable maps should not be seen as a constraint. A function f : X R is a if it is a measurable map between X 7! and R with the Borel -algebra on R.

6 2.1 Measure Theory Background Preliminaries

Definition 2.7 ([50, Pg. 3]). On a metric space X, a function f : X R is 7! [ {1} lower semicontinuous if for every sequence x x, f(x) lim inf f(x ). n !  n

Definition 2.8 ([20, Pg. 314]). Let (X, BX ,µ)beameasurespace,(Y,BY )bea measurable space, and T : X Y be a measurable map. Then T induces a push- 7! forward measure, or image measure, T#µ on Y via

1 T#µ(E):=µ(T (E)) for all E B . 2 Y

Integration of a function f on Y with respect to a push-forward measure T#µ is given by

f(y)dT#µ(y)= f(T (x))dµ(x). ZY ZX Push-forward measures play a large role in historical and modern optimal transport theory, and an important class of push-forward measures comes from projection maps. To this end, let X and Y be two Polish spaces (this condition is unnecessary, but present because our default space is a Polish space), then X Y is the product space ⇥ which is also a Polish space with the product -algebra.1 In this thesis we will use proj and proj to represent the maps from X Y X and X Y Y given by x y ⇥ 7! ⇥ 7! projx((x, y)) = x and projy((x, y)) = y.

Definition 2.9 ([20, Pg. 53]). A measurable function f is said to be integrable with respect to µ if f dµ < and we write f L1(µ). X | | 1 2 R Complex measures do not take on the value on any set. In this thesis we will 1 primarily be concerned with probability measures, which also do not take on the value

1A complete metric can be put on this space, and a countable product of separable spaces will still be separable.

7 2.1 Measure Theory Background Preliminaries

. Observe in Theorem 2.10 that one of the measures is complex. This is done to 1 avoid the complications caused by taking the value . 1 Theorem 2.10 (The Theorem of Lebesque-Radon-Nikodym [48, Pg. 121]). Recall a positive measure µ on X is said to be -finite if there is a countable union of sets E such that µ(E ) < and X = E . Let µ be a positive -finite measure on a i i 1 [ i -algebra M, and let be a complex measure on M.

(a) There is a unique pair of complex measures a and s on M such that

= + , µ, µ. a s a ⌧ s ?

If is positive and finite, then so are a and s.

(b) There is a unique h L1(µ) such that 2

a(E)= h(x)dµ(x) ZE

for every set E M. The function h is called the Radon-Nikodym derivative of 2 da a with respect to µ and we write da = hdµ and h = dµ .

The condition that is not infinite on any set ensures that the Radon-Nikodym derivative h is integrable.

Definition 2.11 ([13, Pg. 17]). A positive measure µ on a space X has an atom at apointx if µ( x ) > 0. If a measure has no atoms, then it is atom-less. { } Of course the -function has an atom at the origin. We will denote a -function centered at x as x. An exception will be when we look at -function on the real line centered at an integer value i, which will be written as (x i). This is to avoid confusion with the common notation for the Kronecker -function, which is not used in this thesis.

8 2.1 Measure Theory Background Preliminaries

We refer to the Radon-Nikodym derivative of a measure µ with respect to ⌫ as the density with respect to ⌫. In the case that µ is absolutely continuous to the Lebesgue measure on Rn then we simply say that µ has a density and refer to it as µ(x)dx. A measure may be atom-less without being absolutely continuous with respect to the reference measure. Consider a probability measure supported on the line from

(0, 0) to (0, 1) in R2. Theorem 2.12, while not essential to the general understanding of this thesis, is included since it is foundational to optimal transport.

Theorem 2.12 (Disintegration, [5, Pg. 121]). Let X and Y be Polish spaces2 and

µ P(X), and T : X Y be a Borel map and let ⌫ = T µ P(X). Then there 2 7! # 2 exists a ⌫-a.e. uniquely determined Borel family of probability measures µy y Y { } 2 ⇢ P(X) such that

1 µ (X T (x)) = 0 for ⌫-a.e. x X y \ 2 and

f(x)dµ(x)= f(x)dµy(x) d⌫(y) 1 ZX ZY ✓ZT (x) ◆ for every Borel map f : X [0, ]. 7! 1

Our default notion of convergence of measures will be weak convergence which is given by duality with bounded continuous test functions. This induces the on P(X). Theorem 2.13, which can be found in [5, Pg. 108], provides a key result in understanding the weak topology on P(X).3

Theorem 2.13 (Prokhorov). If a set P(X) is tight, i.e. if K ⇢

✏ > 0 K compact in X such that µ(X K ) < ✏ µ , 8 9 ✏ \ ✏ 8 2 K 2Note that this theorem holds for a broader class of spaces, Radon spaces. We limit it here to keep our attention on Polish spaces, which is the environment for optimal transport. 3This is of course not the original citation, however it serves as a readily accessible reference for those interested in optimal transport

9 2.2 OT Background Preliminaries then is relatively compact4 in P(X). Conversely when X is a Polish Space, then K every compact subset of P(X) is tight.

Section 2.2 Optimal Transport Background

We now present an abbreviated introduction to optimal transport. Additional intro- ductory material may be found in [4, 36, 49]. For broader, but still non-exhaustive background, we point to [5, 50, 55, 56]. Finally, [2] provides information for a broad set of methods to which optimal transport methods are often compared, however those comparisons are not made in this work and optimal transport is not discussed in [2]. While Monge’s original interest stemmed from an interest in the distribution of aphysicalquantity,optimaltransporthasmanyareasofapplicability,andeconomic interpretations often help understand the problem more intuitively. This perspective is used in many treatments of optimal transport, see e.g. [36, 50, 56].

2.2.1. Historical development of optimal transport We start our introduction of optimal transport by informally examining the original Monge formulation of the problem.

We start with two Polish spaces (X, B ,µ), and (Y,B , ⌫)whereµ P(X) X Y 2 and ⌫ P(Y ) are probability measures. Connecting these spaces is a cost function 2 c : X Y R . Monge’s version of optimal transport may be formulated as ⇥ 7! [ {1} the minimization of the functional

T c (x, T (x)) dµ(x) 7! ZX 4A set is relatively compact if its closure is compact.

10 2.2 OT Background Preliminaries

among all measurable maps T such that T#µ = ⌫. Here the cost function is c(x, y)= y x and X, Y are both copies of Rn. If the measure µ is thought of as a collection | | of particles, then x T (x) represents the distance the ‘particle’ moves and dµ(x) | | represents the amount of mass that is moved that distance. The constraint T#µ = ⌫ ensures that the ‘particles’ of µ are rearranged into the shape of ⌫. Observe that x T (x) dµ(x)isthecumulativedistancethattheparticlesofthemeasuremust Rd | | moveR weighted by their mass according to µ. This leads to the following definition.

Definition 2.14 ([50, Pg. 2]). A measurable function T : X Y which satisfies 7! the push-forward constraint is called a transport map. If it is a minimizer of the Monge problem, then it is an optimal transport map.

The Monge formulation of the problem does not necessarily have a solution and can fail in two notable ways. First, the set of measurable maps which satisfy the constraint

1 3 1 may be empty. For example, suppose µ = i=1 (x i)and⌫ = ( 1 + 3 ). No 3 2 2 2 map will satisfy the marginalization constraintP in either direction. A second way to fail is when the set is non-empty but without an optimal candidate. An example of this type of failure may be found in [56, Fig. 4.1].

2.2.2. The Kantorovich Problem The Kantorovich problem is is a more modern statement of the problem and changes the focus from measurable maps to probabilities over the joint space. Specifically, it is defined as the infimum of the functional given by

C : c(x, y)d(x, y)(2.1) 7! X Y Z ⇥ for P(X Y ) such that (proj ) = µ and (proj ) = ⌫. 2 ⇥ x # y #

Definition 2.15 ([56, Ch. 4]). The set of measures which satisfy the marginalization

11 2.2 OT Background Preliminaries constraints are called couplings or transport plans of µ and ⌫. The set of couplings is denoted ⇧(µ, ⌫).

The functional in Equation (2.1) is called the transport cost. We will let C() be the evaluation of the function for a given coupling, while C(µ, ⌫)willrefertothe infimum of the transport cost. A transport map T is included in the set of transport plans by considering couplings of the form (Id T )#µ as X Y c(x, y)d(Id T )#µ = ⌦ ⇥ ⌦ R X c(x, T (x))dµ. This equivalence can be seen in [3, Prop 2.1]. R A transport map can be understood as moving the mass from a location x to a location T (x). A coupling conveys similar information. Rather than looking at a single point, we consider sets. First note that the marginalization constraints are equivalently written as (A Y )=µ(A)and(X B)=⌫(B)forallmeasureablesets ⇥ ⇥ A and B in their respective -algebras. Hence, (A B) µ(A) and (A B) ⌫(B). ⇥  ⇥  The value (A B)canbeunderstoodashowmuchmassinA is moved into the set ⇥ B. Furthermore, this allows a coupling to move mass in a set A (or in a point) to disjoint sets B and B0, something that could not be done by transport maps which preclude mass splitting. We can see that couplings still convey the notion that the mass in A is mapped

B B to B by evaluating (Id T ) µ(A B). Splitting A into A and A0 where A is the ⌦ # ⇥ part of A which maps to B, then our expectation is that (Id T ) µ(A B)=µ(AB). ⌦ # ⇥ We see that this is borne out as,

1 (Id T ) µ(A B)=µ (Id T ) (A B) (2.2) ⌦ # ⇥ ⌦ ⇥ = µ(AB), (2.3)

1 B because (Id T ) (A B)=A are the set of elements in A which are mapped by ⌦ ⇥ (Id T )intoA B. ⌦ ⇥

12 2.2 OT Background Preliminaries

It is important to first establish that there is indeed a solution which minimizes the functional in Equation (2.1).

Theorem 2.16 (Existence of an optimal coupling [56, Pg. 43]). Let (X, µ) and (Y,⌫) be two Polish spaces; let a : X R and b : Y R be two upper 7! [ {1} 7! [ {1} semicontinuous functions such that a L1(µ), b L1(⌫). Let c : X Y R + 2 2 ⇥ 7! [{ 1} be a lower semicontinuous cost function, such that c(x, y) a(x)+b(y) for all x, y.

Then there is a coupling of (µ, ⌫) which minimizes the total cost X Y c(x, y)d(x, y) ⇥ among all possible couplings. R

Before proving Theorem 2.16, it is useful to provide some context. First, observe that the theorem immediately demonstrates that unlike the Monge problem, which in some cases does not even have any candidate solutions, the Kantorovich problem always has an optimal solution. Second, this thesis will primarily focus on cost functions of the form c(x, y)=dist(x, y)p where X and Y are thus copies of the same space and p is a positive integer. We note that cost functions of this form give rise to the Wasserstein distances discussed in Section 2.3. They are also non- negative, implying that we can choose the zero function for a and b in Theorem 2.16. This non-negativity also provides another useful property, as given by the following lemma.

Lemma 2.17. [56, Pg. 44] If c is non-negative, then C : cd is lower 7! semicontinuous on P(X Y ) for the weak topology.5 R ⇥

Couplings are probabilities on the product space X Y . The following lemma ⇥ describes how the topologies of P(X)andP(Y )areinheritedbyP(X Y ). ⇥

Lemma 2.18. [56, Pg. 44] Let X and Y be two Polish spaces. Let P(X) and P ⇢ P(Y ) be tight subsets of their respective spaces. Then let ⇧( , ) be the set Q ⇢ P Q 5This is a particular case of a broader theorem.

13 2.2 OT Background Preliminaries of couplings whose marginals lie in and respectively. Then ⇧( , ) is tight in P Q P Q P(X Y ). ⇥

We are now able to sketch a proof for Theorem 2.16 using Lemmas 2.17 and 2.18.

Proof. [56, Pg. 44] This proof proceeds by first showing that the set of couplings is tight, extracting a sequence with a limit coupling and then looking at the limit of a subsequence.

(1) Since X and Y are Polish spaces, the singleton sets of the measures µ and ⌫ are tight in their respective spaces. From Lemma 2.18, we know that the set of couplings ⇧(µ, ⌫)istightandfromTheorem2.13(Prokhorov’stheorem)we know it has compact closure.

(2) The limit of any convergent sequence of couplings would satisfy the marginal- ization constraint and would therefore be a coupling itself. This shows that ⇧(µ, ⌫) is closed and also compact. We can extract a sequence from ⇧(µ, ⌫) which converges to the infimum of the transport cost.

(3) Since ⇧(µ, ⌫)iscompactthissequencehasaconvergentsubsequence(k)with limit . By Lemma 2.17, the transport cost is lower semicontinuous and so is optimal.

This completes the proof.

With Theorem 2.16 guaranteeing the existence of a solution to the Kantorovich formulation, the next steps are understanding the nature of the solutions. Theorem 2.19 helps to inform that understanding, especially when considered in conjunction with later theorems dependent on it.

14 2.2 OT Background Preliminaries

Theorem 2.19 (Optimality is inherited by restriction [56, Pg. 46]). Let (X, µ) and

(Y,⌫) be two Polish spaces; let a : X R and b : Y R be two upper 7! [{1} 7! [{1} semicontinuous functions such that a L1(µ), b L1(⌫). Let c : X Y R + 2 2 ⇥ 7! [{ 1} be a lower semicontinuous cost function, such that c(x, y) a(x)+b(y) for all x, y. Let C(µ, ⌫) be the optimal transport cost from µ to ⌫. Assume that C(µ, ⌫) < and 1 let ⇧(µ, ⌫) be an optimal coupling. Let ˜ be a non-negative measure on X Y 2 ⇥ such that ˜ and ˜[X Y ] > 0. Then the probability measure  ⇥

˜ 0 := ˜[X Y ] ⇥ is an optimal transport plan between its marginals.

Moreover, if is the unique transport plan between its marginals, then 0 is the unique optimal transport plan between its marginals.

Proof. The proof of this theorem can be found in full in [56, Pg. 46]. The key idea is that if part of the transport plan ˜ when normalized to be a probability measure 0 is not optimal, then the original plan would not be optimal. The normalization procedure can be thought of as ‘zooming in’ on a portion of the plan and ensuring that everything is proper in that region.

Theorem 2.19 tells us that an optimal transport plan can be considered to be built from smaller optimal transport plans and that each piece must itself be opti- mal. This interpretation is alluring as the evaluation of an optimal transport cost is computationally expensive. The hope would be that if the problem could be broken into pieces, then the plan could be reassembled at a fraction of the original cost. This is not the case as the object which is broken into components in Theorem 2.19 is the optimal transport plan ,andnotthemarginalsµ and ⌫. The problem would need to be solved before it was ‘simplified’ into smaller problems because it is the optimal

15 2.2 OT Background Preliminaries plan itself that is broken into components. Our next steps will help us see how the pieces of a transport plan must fit together.

Definition 2.20 ([50, Pg. 28]). Let X and Y be arbitrary sets and c : X Y ⇥ 7! ( , )beafunction.Asubset X Y is said to be c-cyclically monotone, 1 1 ⇢ ⇥ or c-CM, if for any N N, permutation , and family of points (x1,y1),...,(xn,yn) 2 in ,wehave N N c(x ,y) c(x ,y ). (2.4) i i  i (i) i=1 i=1 X X Note that Equation (2.4) is sometimes stated as

N N c(x ,y) c(x ,y ). (2.5) i i  i i+1 i=1 i=1 X X By breaking up a permutation into cycles and reordering we see that Equations (2.4) and (2.5) are equivalent. The cycles in the latter definition are why the property is called ‘cyclically’ monotone. We chose to use Equation (2.4) in Definition 2.20 to emphasize that any finite collection of pairs in a c-CM set has minimal ‘c-cost’ when compared to any permutation of partners, however Equation (2.5) is an easier property to show. We will see in Theorem 2.24 that a plan is optimal if and only if it has c-cyclically monotone support, and we will say that a plan is c-CM if it has c-cyclically mono- tone support. While the theorem will hold for lower semicontinuous cost functions, consider the reasoning for a continuous cost function. The idea here is that if you had a family of points (x1,y1) ...(xn,yn) in the support of your coupling which failed the c-CM criterion, then you would be able to form neighborhoods Ai and Bi around x and y such that (A B ) > 0. The neighborhoods A and B can be cho- i i i ⇥ i i i sen small enough such that the inequality in c-CM fails for the entire neighborhood. The restriction of the optimal plan to these neighborhoods would not be optimal in

16 2.2 OT Background Preliminaries contradiction of Theorem 2.19. This is the intuition in saying that an optimal plan is c-CM, because if it were not c-CM, then we could make a modification to make it slightly better. The property of c-CM is enough to characterize an optimal plan.

2.2.3. Duality Theory So far we have discussed the primal version of optimal transport. We now turn to the dual formulation, and use an analogy from economics to aid in its understanding. While optimal transport began with sand piles and is also often described in terms of transporting ore from mines to factories, the analogy for the dual problem will be phrased in terms of bakeries and cafes, paying homage to the French origins (and continued importance) of optimal transport [56, Pg. 53]. In primal optimal transport we minimize the cost of a transport plan directly, while in the dual problem we introduce intermediate agents whose goal is to maximize profits. To begin, let µ represent the output of fresh bread from bakeries in X and let ⌫ represent the demand for bread at cafes in Y . The bakeries and cafes are in a union to cooperate to minimize the cost of transporting the bread. The dual problem can be understood as one in which an intermediary purchases all of the bread from the bakeries at cost (x)andthenresellsittothecafesatprice(y). The total amount of money exchanged at x depends on the quantity of bread and is given by (x)dµ(x), and likewise at y we have (y)d⌫(y). Thus from the perspective of the union, transporting bread now costs (y) (x). For this to be advantageous to the union, it must hold that (y) (x) c(x, y), (2.6)  otherwise the bakery would transport the bread to the cafe without the intermediary.

17 2.2 OT Background Preliminaries

We call such a pair competitive. The middleman seeks to maximize profits, which yields the dual Kantorovich problem:

sup (y)d⌫(y) (x)dµ(x):(y) (x) c(x, y) . (2.7) L1(⌫) Y X  2 ⇢ L1(µ) Z Z 2

Letting be the optimal coupling of the primal problem, then for a competitive pair ( , ) we have that

c(x, y)d(x, y) ((y) (x))d(x, y) X Y X Y Z ⇥ Z ⇥ = (y)d⌫(y) (x)dµ(x). (2.8) ZY ZX

Since (2.8) holds for any competitive pair, it immediately follows that the supremum given in the dual problem Equation (2.7) is always less than or equal to the infimum in the primal problem in Equation (2.1). Since neither the middleman nor the union can spend an infinite amount of money at any point, and must be correspondingly integrable with respect to µ and ⌫. This line of reasoning follows the story that is told in [56], also see [50, Pg. 9] for a more formal derivation of the duality.6 Rearranging the inequality in Equation (2.6) we see that a competitive pair ( , ) must satisfy both (y) (x)+c(x, y), (2.9) 

(x) (y) c(x, y). (2.10)

Following Equation (2.9), the best (highest) possible selling price is obtained by solv-

6A word of warning however is that these texts use di↵erent conventions in the dual problem. Specifically, [56] considers pairs ( , ) such that (y) (y) c(x, y), while [50] considers pairs such that (y)+ (x) c(x, y).   18 2.2 OT Background Preliminaries ing (y)=inf( (x)+c(x, y)) . x

Similarly from Equation (2.10), the best (lowest) is satisfied by

(x)=sup((y) c(x, y)) . y

These relationships motivate the following definitions.

Definition 2.21 ([56, Pg. 54]). Let X, Y be sets and c : X Y ( , ]. A ⇥ 7! 1 1 function : X R is said to be c-convex if it is not identically + ,and 7! [ {1} 1 there exists ⇣ : Y R such that 7! [ {±1}

x X, (x) = sup (⇣(y) c(x, y)) . 8 2 y

The corresponding c-transform of is the function c defined as

y Y, c(y)=inf( (x)+c(x, y)) . 8 2 x

Finally, its c-subdi↵erential is the c-cyclically monotone set defined by

@ = (x, y) X Y ; c(y) (x)=c(x, y) . c { 2 ⇥ }

Definition 2.22 ([56, Pg. 54]). With the same notation as Definition 2.21, a function

: Y R is said to be c-concave if it is not identically and there exists 7! [ {1} 1 : R such that = c. Then its c-transform is the function c defined by [ {±1}

x X c(x)=sup((y) c(x, y)) , 8 2 y

19 2.2 OT Background Preliminaries and its c-superdi↵erntial is the c-cyclically monotone set defined by

@ = (x, y) X Y : (y) c(x)=c(x, y) . c { 2 ⇥ }

Definitions 2.21 and 2.22 yield some ambiguity regarding the meaning of c-transform, and indeed some authors refer to the c-transform given in Definition 2.22 as thec ¯- transform. In practice the meaning is clear, however. What is more germane to the discussion is that by using either definition of the c-transform, it is possible to start from a competitive pair ( , )andformapair(c, )(or( , c)) which yields a greater or equal value in the dual problem. To further understand the subdi↵erential, note that a point y Y is an element 2 of @c (x)ifandonlyif

z X, (x)+c(x, y) (z)+c(z,y). 8 2 

Before using these concepts, we provide an alternate characterization of c-convexity using the c-transforms.

Proposition 2.23 (Alternative characterization of c-convexity [56, Pg. 57]). For any function : X R , let its c-convexification be defined by cc =( c)c. More 7! [ {1} explicitly, cc(x) = sup inf ( (z)+c(z,y) c(x, y)). y Y z X 2 2 Then is c-convex if and only if cc = .

We are now in a position to state a theorem that determines a large part of the structure of optimal transport.

Theorem 2.24 (Kantorovich Duality [56, Pg. 57]). Let (X, µ) and (Y,⌫) be two

Polish probability spaces and let c : X Y R be a lower semicontinuous ⇥ 7! [ {1} 20 2.2 OT Background Preliminaries cost function, such that

(x, y) X Y, c(x, y) a(x)+b(y) 8 2 ⇥ for some real-valued upper semicontinuous functions a L1(µ) and b L(⌫). Then 2 2 (a) There is duality:

min c(x, y)d(x, y) ⇧(µ,⌫) X Y 2 Z ⇥ =sup (y)d⌫(y) (x)dµ(x) ( ,) Cb(X) Cb(Y ) Y X 2 C⇥ ✓Z Z ◆  =sup (y)d⌫(y) (x)dµ(x) ( ,) L1(µ) L1(⌫) Y X 2 C⇥ ✓Z Z ◆  =sup c(y)d⌫(y) (x)dµ(x) L1(µ) Y X 2 ✓Z Z ◆ =sup (y)d⌫(y) c(x)dµ(x) . (2.11) L1(⌫) Y X 2 ✓Z Z ◆

Note that in the above suprema one might as well impose that be c-convex and c-concave.

(b) If c is real-valued and the optimal cost C(µ, ⌫)=inf ⇧(µ,⌫) cd is finite, then 2 there is a measurable c-cyclically monotone set X Y R(closed if a,b,c are ⇢ ⇥ continuous) such that for any ⇧(µ, ⌫) the following five statements are 2 equivalent:

(i) is optimal;

(ii) is c-cyclically monotone;

(iii) There is a c-convex such that, -almost everywhere,

c(y) (x)=c(x, y); 21 2.2 OT Background Preliminaries

(iv) There exist : X R and : Y R such that 7! [ {1} 7! [ {1}

(y) (x) c(x, y) 

for all (x, y), with equality almost everywhere;

(v) is concentrated on .

(c) if c is real valued, C(µ, ⌫) < and one has the pointwise upper bound 1

c(x, y) c (x)+c (y), (c ,c ) L1(µ) L1(⌫),  X Y X Y 2 ⇥

then both the primal and dual Kantorovich problems have solutions, meaning that the suprema in Equation (2.11) are all maximums. If in addition a, b, and c are continuous, then there is a closed c-cyclically monotone set X Y ⇢ ⇥ such that for any ⇧(µ, ⌫) and for any convex L1(µ), 2 2

(i) is optimal in the Kantorovich problem if and only if ()=1;

(ii) is optimal in the dual Kantorovich problem if and only if @ . ⇢ c

The proof of Theorem 2.24 can be found in [56, Pg. 57].

Remark 2.25. The optimal prices for the transport problem from µ to ⌫ are not the optimal prices for the transport problem from ⌫ to µ. For a symmetric cost function c(x, y), if ( , )isanoptimaldualpairforµ, ⌫, then the optimal pair for ⌫,µ is ( , ). This is because of the dual form that we have chosen [56, Pg. 60]. Corollary 2.26 of Theorem 2.24 is a new result, and motivates the work in Chapter 5.

Corollary 2.26. Let (X, µ) and (Y,⌫) be two Polish probability spaces and let c be a real valued cost function which satisfies the hypotheses of Theorem 2.24 with an

22 2.2 OT Background Preliminaries optimal transport plan (with finite cost). Let ⇡ be a probability measure which is absolutely continuous with respect to . Then if ⇡ has finite cost, it is an optimal transport plan between its marginals.

Proof. Consider such a ⇡ and call its marginals ⌘ and ⇠. By assumption it has finite cost, and hence it is an immediate upper bound for the optimal transport cost C(⌘, ⇠), as defined in Equation (2.1). Thus we satisfy the hypothesis for Theorem 2.24(b). Since is an optimal transport plan, it is supported on a c-CM set .Since⇡ is absolutely continuous with respect to , it is supported on the same c-CM set . Thus, by Theorem 2.24(b) it is an optimal transport plan between its marginals. This completes the proof.

Corollary 2.26 bears resemblance to Theorem 2.19, that is, the optimality inher- ited by restriction. Observe that by construction 0 in Theorem 2.19 satisfies the hypothesis in Corollary 2.26, however they are not interchangable. The purpose of Corollary 2.26 is to consider measures that are absolutely contin- uous to an optimal plan as opposed to those dominated by one, which is the extent of Theorem 2.19. We note that Corollary 2.26 is proven through the equivalence of c-CM support and optimality, and this equivalence is a consequence of Theorem 2.19. Despite being a consequence of Theorem 2.19, Corollary 2.26 is an extension and cannot be proven in the same way as Theorem 2.19. In particular, an intermediate step of Theorem 2.19 considers the measure ⇡, where ⇡ is a restriction of , in which case it must be true that ⇡ is a non-negative measure. By contrast, in Corollary 2.26 the relationship between ⇡ and is such that ⇡ is absolutely continuous with respect to , as opposed to being dominated by , thus preventing ⇡ from being a non-negative measure. Note that the hypothesis that ⇡ has finite cost in Corollary 2.26 is not superfluous.

Let X = Y = R and consider the cost function c(x, y)=dist(x, y)p and µ the normal

23 2.2 OT Background Preliminaries

1 distribution with unit covariance and ⌫ = 2 (0 + 1). Then a Cauchy distribution ✏, which does not have finite second moment, will be absolutely continuous with respect to µ. However, there does not exist a coupling with finite cost and therefore optimality is not tied to cyclic monotonicity. We chose ⌫ to have two point masses because there is only one coupling between a measure and a single point mass and that coupling will not only have c-cyclic monotonic support because there is only a single y coordinate available, but also since the set of couplings is a singleton, it is a degenerate optimum for any problem. To end our introduction to general optimal transport, we answer the question that started it all with Theorem 2.27, which tells us when there exists a solution to the Monge problem.

Theorem 2.27 (Criterion for solvability of the Monge problem [56, Pg. 84]). Let a L1(µ) and b L1(⌫) be two real-valued upper semicontinuous functions. Let 2 2 c : X Y R be a lower semicontinuous cost function such that c(x, y) a(x)+b(y) ⇥ 7! for all x, y. Let C(µ, ⌫) be the optimal transport cost . If

(a) C(µ, ⌫) < and 1

(b) for any c-convex function : X R , the set of x X such that @ (x) 7! [ {1} 2 c contains more than one element is µ-negligible, then there is a unique optimal coupling of (µ, ⌫) that is determined by an optimal transport map. Moreover, it is characterized by the existence of a c-convex function

such that ⌫(@c(X)) = 1. In particular, the Monge problem with initial measure µ and final measure ⌫ admits a unique solution.

In later sections we will see the large role that plays. For now, we observe that the subdi↵erential at almost all points must only have a single element in order for the Monge problem to have a solution. This is because a transport map cannot split mass

24 2.3 Wasserstein Distances Preliminaries in the same way that a transport plan can, and the elements of the subdi↵erential at a point determine to where the mass at that point will be sent.

Section 2.3 Wasserstein Distances

The rest of this thesis will focus on optimal transport with cost functions given by c(x, y) = dist(x, y)p on a generic space X or x y p when we consider Rn. This cost | | function is a function from X X R. However, we will still often write X Y ⇥ 7! ⇥ and Y to distinguish the coordinates.

Definition 2.28 (Wasserstein distance [56, Pg. 93]). Let (X, dist) be a Polish space and let p [1, ). For any two probability measures µ, ⌫ on X, the Wasserstein 2 1 distance of order p (also called the Wasserstein p-distance or the pth Wasserstein distance) between µ and ⌫ is defined by the formula

1/p p p(µ, ⌫)= inf dist(x, y) d(x, y) . (2.12) W ⇧(µ,⌫) X X ✓ 2 Z ⇥ ◆

As mentioned earlier, the cost function c(x, y)=dist(x, y)p is non-negative and thus choosing a(x)=b(y)=0(wherec(x, y) a(x)+b(y) from Theorem 2.16) yields an upper semicontinuous integrable function which is a lower bound for c(x, y). We restrict our attention to a subset of measures on which the Wasserstein distance takes finite values.

Definition 2.29 (Wasserstein space [56, Pg. 94]). With the same conventions as Definition 2.28, the Wasserstein space of order p is defined as

P (X):= µ P(X): dist(x, x )pdµ(x) < p 2 0 1 ⇢ ZX

25 2.3 Wasserstein Distances Preliminaries where x X is arbitrary. This space does not depend on the choice of the point x . 0 2 0 Then defines a finite distance on P (X). Wp p

The Wasserstein distance satisfies the axioms of a distance. While complete proofs may be found in the aforementioned references, below we discuss some of the main properties that will help provide context for this work.

(a) Symmetry: Notice that if (x, y)isacouplingofµ(x)and⌫(y), then ˜(x, y)= (y, x)isacouplingof⌫(x)andµ(y). Since our cost function c(x, y)= dist(x, y)p is symmetric, C()=C(˜), so we obtain

(µ, ⌫)= (⌫,µ). Wp Wp

(b) Identity: If (µ, ⌫)=0,thenthereisacouplingwhichissupportedonthe Wp

diagonal, which is the graph of the identity. Then ⌫ =Id# µ = µ.

(c) Triangular inequality: Let µ ,µ and µ be our measures. A measure 1 2 3 1,2,3 2 P(X X X)canbeformedfromtheoptimalplans between µ and µ , ⇥ ⇥ 1,2 1 2

and 2,3 between µ2 and µ3. Such couplings are joined using Theorem 2.12, the Disintegration theorem, along the common marginal to form the measure on X X X. A full explanation may be found in [5, Pg. 122]. ⇥ ⇥

Note that limiting the measures to those with finite pth moments is important

p 1 p p 1 p because it allows us to set cX (x)=2 dist(x, x0) and cY (y)=2 dist(y, x0) in Theorem 2.24(c). These functions are integrable when µ and ⌫ are in Pp(X) and c(x, y) c (x)+c (y)soweareguaranteedasolutiontotheprimalanddual  X Y problems. The topology on the Wasserstein space is the topology given by weak convergence plus a bit more as given by the following definition.

26 2.3 Wasserstein Distances Preliminaries

Definition 2.30 (Weak Convergence in Pp [56, Pg. 96]). Let (X, dist) be a Polish space and p [1, ). Let (µk)k N be a sequence of probability measures in Pp(X) 2 1 2 and let µ be another element of Pp(X). Then (µk)k N is said to converge weakly in 2 P (X)ifanyoneofthefollowingequivalentpropertiesissatisfiedforsomex X: p 0 2

(a) µ µ and dist(x, x )pdµ (x) dist(x, x )pdµ(x) k ! 0 k ! 0 R R p p (b) µk µ and lim sup dist(x, x0) dµk(x) dist(x, x0) dµ(x) ! k  !1 R R (c) µ µ and lim lim sup dist(x, x )pdµ (x)=0 k dist(x0,x) R 0 k ! R k !1 !1 R p (d) For all continuous functions with (x) C(1 + dist(x0,x) ), C R,onehas | |  2

(x)dµ (x) (x)dµ(x). k ! Z Z

The final criterion shows that the space Pp(X)shouldbethoughtofasbeing dual to the space of functions satisfying the inequality (x) C(1 + dist(x ,x)p), | |  0 which can be thought of as functions which grow no faster than order p out at infinity.

This space is larger than the Cb(X)andexplainswhyweakconvergenceisinsucient in fully characterizing the space. The first equivalent criterion may be the easiest to remember as it says that µ µ in P (X)iftheseriesisweaklyconvergentandwe k ! p have convergence of the ‘pth moment’ of the series.

Theorem 2.31 ( metrizes P [56, Pg. 96]). Let (X, d) be a Polish space, and p Wp p 2 [1, ); then the Wasserstein distance metrizes the weak convergence in P (X). 1 Wp p

If (µk)k N is a sequence of measures in Pp(X), then it is equivalent to say 2

(a) µk converges weakly in Pp(X) to µ

(b) (µ ,µ) 0. Wp k !

27 2.3 Wasserstein Distances Preliminaries

Corollary 2.32 allows us to find the Wasserstein distance through a refinement scheme.

Corollary 2.32 (Continuity of [56, Pg. 97]). If (X, d) is a Polish space, and Wp p [1, ), then is continuous on P (X). More explicitly, if µ converges to µ 2 1 Wp p k weakly in Pp(X) and ⌫k to ⌫, then

(µ , ⌫ ) (µ, ⌫). Wp k k ! Wp

This shows that we are able to approximate the Wasserstein distance between two measures by calculating the Wasserstein distance between two discrete measures that approximate the original measures. A coupling between two measures which are finite sums of point masses is itself a measure comprised of finitely many point masses. If µ and ⌫ have M and N point masses respectively, then a coupling between them has at most MN. The optimal transport problem then becomes a linear programming problem; µ and ⌫ can be represented as M and N dimensional vectors respectively, the cost function as an N M array, and the coupling can also be represented ⇥ as an N M array. Evaluating the cost of a coupling is done via an element-wise ⇥ multiplication and sum, or a ‘double-dot’ product. The marginalization constraints on a coupling become 1 = µ and 1T = ⌫T . Along with the advantage of existence, the amenability of the Kantorovich problem to linear programming provides another benefit in comparison to the Monge formu- lation. Specifically, it not only produces a practical formulation of the Wasserstein distance and therefore its potential relevance, but also allows for increased under- standing through computation and experimentation. Point masses best demonstrate the way in which the Wasserstein distance incor- porates the metric properties of the underlying space. Mapping x isometrically 7! x embeds X into P (X)as ( , )=dist(x, y), regardless of p. Theorem 2.33 p Wp x y 28 2.3 Wasserstein Distances Preliminaries

describes another way in which the metric structure of X a↵ects Pp(X).

Theorem 2.33 (Topology of the Wasserstein space [56, Pg. 104]). Let (X, dist) be a

Polish space, then P (X) with is a Polish space. Moreover, any probability mea- p Wp sure can be approximated by a sequence of probability measures comprised of finitely many point masses.

This completes the basic discussion of the metric structure of the Wasserstein space and encapsulates how the Wasserstein space inherits properties from the underlying space X.

2.3.1. Examples of Wasserstein distances We now provide some examples to illustrate the Wasserstein distance.

Example 2.34. As mentioned earlier, the Wasserstein distance between x and y

1 n is dist(x, y). Expanding beyond the case of a single point mass, let µ = n i=1 xi 1 n and ⌫ = n i=1 yi , with the further criterion that xi is the unique closestP point (among theP collection x )toy ,andthaty is the unique closest point to x (among { i} i i i the collection y ). This shows that (x ,y) X Y is a c-CM set and that { i} { i i } ⇢ ⇥ 1 n = n i=1 (xi,yi) is the optimal coupling between µ, ⌫.

In ExampleP 2.34, the transport cost C(µ, ⌫)istheaveragetransportcostoftrans-

porting each point x to its nearest y-neighbor. While any coupling i,j = (xi,yj ) is trivially optimal between its marginals xi and yi here we see how we need to satisfy a global criterion in order to assemble this coupling. Note that the criterion that xi is the closest point to yi is a strong way to enforce c-CM, but conveys a simple way in which each point needs to coordinate with the others. With the same collection of points as in Example 2.34, we can also consider

n n n µ0 = a and ⌫0 = a with a 0and a =1.Thentheoptimal i=1 i xi i=1 i yi i i=1 i n couplingP between these twoP measures is 0 = i=1 aPi(xi,yi). This is because it has P 29 2.3 Wasserstein Distances Preliminaries the same support as , and so it has c-CM support. It can also be seen by letting f(xi,yi)=ain so that 0(x, y)=f(x, y)(x, y)andthenapplyingCorollary2.26, which is itself an argument using the c-CM support. The Wasserstein distance is generally dicult to compute but in special cases has a closed form. Before discussing one such case, we review a well-known general principle of the Wasserstein distance, namely that it is a product metric between a

‘shape’ and a ‘mean’ component. Specifically, let X = Rn, and consider two measures

µ and ⌫ with m and n their corresponding means. Let Sh be the shift operator such that Sh(x)=x + h. Then let ⇠ =(S m)#µ and ⌘ =(S n)#⌫. The new measures, ⇠ and ⌘ have the same shape as µ and ⌫, but have zero mean. Then (µ, ⌫)p = m n p + (⇠, ⌘)p. The mean component is m n p and the shape Wp || || Wp || || component is (⇠, ⌘)p, the Wasserstein distance between the measures. Wp The transport cost C(µ, ⌫)canberewrittenas

inf E[dist(x, y)p]= inf dist(x, y)pd(x, y), (2.13) ⇧(µ,⌫) ⇧(µ,⌫) X Y 2 2 Z ⇥ where we view the Wasserstein distance as the pth root of the expectation of the cost function. In this context, breaking the argument into its mean and mean-free components is very common and can be found in [22].

We have already seen the trivial example of this. Specifically, since x and y have the same shape, the Wasserstein distance between them is only a function of their means. This is true for any shape of measure and provides motivation for using the Wasserstein distance for data assimilation [19, 29].

Example 2.35. Let X = Rn and let µ = (m, A)and⌫ = (n, B)beGaussian N N measures with means m and n respectively and covariance matrices A and B. Then

(µ, ⌫)2 = m n 2 +tr(A)+tr(B) 2tr[(pABpA)1/2]. (2.14) W2 || ||

30 2.3 Wasserstein Distances Preliminaries

Observe that the latter terms in (2.14) involving the trace of the covariance matri- ces are the shape component of the cost. This can be see by examining a few special cases. When A and B are the same, then 2 tr[(pABpA)1/2]=2tr(A)andtheshape component vanishes. When B = A, with > 0, then the shape component is (1 + 21/2)tr(A). These are both special cases of when A and B commute, and the shape component is larger when they do not commute. This result and proof for Example 2.35 can be seen in [22], and demonstrates the mean and shape decomposition of the Wasserstein distance. Additional discussion on Gaussian measures is provided in the next section.

2.3.2. Geodesics in Wasserstein Space Optimal transport is situated at a juncture between many di↵erent fields. Of course this includes measure theory which was covered in Section 2.1, but it also includes geometry. Since the work in this thesis only requires broadly accessible and well understood concepts, we limit our background discussion. Specifically, we review the very basics of geometry in Wasserstein spaces and examine these concepts through examples. Many treatments in optimal transport include some background in the necessary concepts from geometry, see e.g. [4, 5, 36, 50, 56]. A path or curve ! in a space X is a from [0, 1] X. The 7! length of a path is given by

n 1 Length(!) := sup dist(!(tk), !(tk+1)) , ( i=0 ) X where the supremum is taken over all n 1andallpartitionsoftheunitinterval such that 0 = t0

31 2.3 Wasserstein Distances Preliminaries inf Length(!) . { }

Definition 2.36 ([50, Pg. 202]). A curve ! is a geodesic between x and y if !(0) = x, !(1) = y and Length(w)=dist(x, y). A space (X, dist) is a geodesic space if there exists geodesics between arbitrary points.

We refine our notion of a geodesic to be a constant speed geodesic where

dist(!(s), !(t)) = (t s)dist(x, y).

We also define functions et as evaluations at time t which map a curve to its location at t, so that et(!)=!(t). An important notion for understanding geodesic spaces is the midpoint criterion: A complete metric space (X, dist) is a geodesic space, if and only if between any two points x, y in X there is a point m X which is the midpoint between them,i.e. 2

dist(x, y) dist(x, m)=dist(m, y)= . 2

This criterion is used in [56] in his proof that the Wasserstein spaces are geodesic. When looking at a geodesic space X, let Geod(X)refertothespaceofconstant speed geodesics over X. Geodesics spaces are guaranteed to have at least one geodesic between two points, but this geodesic may not be unique. When that is the case, it is useful to have a geodesic selection function, GeodSel, which maps a start and end point to a geodesic between them, i.e. GeodSel : X X Geod(X). ⇥ 7!

Corollary 2.37 ([56, Pg. 127]). Let (X, d) be a geodesic space. Let p>1 and let

th Pp(X) be the space of probability measures on X with finite p moment. Then, given any two µ0,µ1 Pp(X), and a continuous curve (µt)0 t 1 valued in Pp(X), 2   the following properties are equivalent:

32 2.3 Wasserstein Distances Preliminaries

(a) µ is the law of , where is a random geodesic, P(Geod(X)), such that t t 2

(e0,e1)# =(0, 1) is an optimal coupling;

(b) (µt)0 t 1 is a geodesic curve in the space Pp(X).  

Moreover, if µ0 and µ1 are given, there exists at least one such curve.

Notice that Corollary 2.37 excludes p =1.Thisisbecausewhenp =1,the Wasserstein distance is not tied to a coercive Lagrangian action. More information may be found in [56, Ch. 7]. Corollary 2.37 tells us that (X) is a geodesic space and that the geodesic of Wp the probabilities is a probability on geodesics. This is clear in [4, Pg. 32] where a coupling is mapped to a measure on geodesics via the geodesic selection function, GeodSel.

Let ⇡ be the optimal coupling between µ0 and µ1, then := (GeodSel)#⇡(x, y). This measure, , on the geodesics satisfies the following properties:

(a) (e0)# = µ0 and

(b) (e1)# = µ1.

The map GeodSel uses x as the starting point and y for the ending point for the geodesic it selects, and when we use it to push forward a coupling, the first marginal

µ0 is the push-forward measure when evaluating the geodesics at t =0andthesecond marginal µ1 when evaluating the geodesics at t =1.

Corollary 2.37 is remarkable because it tells us that =(GeodSel)#⇡ will be an element of P(Geod(X)) as expected, but also that it is an element of Geod(Pp(X)).

It is worthwhile to take a moment to examine what (GeodSel)#⇡ looks like in the following example.

33 2.3 Wasserstein Distances Preliminaries

Example 2.38. Let X = R2 and let the measures µ and ⌫ be given by

1 µ = ( + ), 2 (0,0) (0,1) 1 ⌫ = ( + ). 2 (1,0) (1,1)

That is, they are concentrated on the four corners of a square with µ concentrated on the left corners and ⌫ concentrated on the right. Then the optimal coupling is

1 ⇡ = ((0,0) (1,0) + (0,1) (1,1)). 2 ⇥ ⇥

In this case, the geodesic curve in Pp(X)isgivenby := (GeodSel)#⇡ is (t)= 1 ( + ). Said another way, is concentrated on the geodesic (0, 0) (1, 0) 2 (t,0) (t,1) ! and the geodesic (0, 1) (1, 1). !

It is important to note that in general P(Geod(X)) is not Geod(Pp(X)), and only holds when looking at the push-forward measure of an optimal coupling. To emphasize this point, consider once again Example 2.38, but instead of the horizontal geodesics, consider the diagonal geodesics, i.e. (0, 0) (1, 1) and (0, 1) (1, 0). ! !

This is not a geodesic in Pp(X)becauseitisnotthepush-forwardofanoptimal coupling. Thus we see that the optimal coupling does more than serve as a measure that we can push into the space of geodesics, it also coordinates which geodesics.

Choosing the correct geodesics ensures that we form an element of Geod(P(X)), not just P(Geod(X)). The general lesson to take from Example 2.38 is not that path lines (the trace of the geodesic) cannot cross, but rather that the ‘particles’ themselves cannot collide. We turn to another example to demonstrate that path lines may indeed cross.

Example 2.39 ([55, Pg. 309]). Let x =( 1, 2), x =(1, 2) and y =(1, 1), y = 1 2 1 2 ( 1, 1). Then the optimal coupling between µ = 1 ( + )and⌫ = 1 ( + ) 2 x1 x2 2 y1 y2 sends x y . This particle identification can be thought of as being related to i ! i

34 2.4 1-D OT Preliminaries the ‘shape’ component of the optimal coupling (as both measures have mean zero already).

In Rn when p =2,anoptimaltransportplanisoptimalevenifoneofthemarginals is translated, i.e. let Sh be translation by h so that Sh(x)=x+h, then if is optimal, so too is (Id S ) (). ⌦ h # 1 Suppose h =(1, 4) and let zi = yi + h,and⌘ = 2 (z1 + z2 ). Then the optimal coupling is the one which sends x z . We can interpret this as saying that while the i ! i translation a↵ects the mean, the shape of the distribution is invariant to translations. Now let be the optimal coupling between µ and ⌫ and ⇡ =(Id S ) ()theopti- ⌦ h # mal coupling between µ and ⌘. We see neither collisions nor crossings in (GeodSel)#.

However, while there are still no collisions, there is a crossing in (GeodSel)#⇡.

Observe that this is yet another way in which p =1distinguishesitselffromother Wasserstein spaces. When p = 1, not only may ‘particles’ not collide, but also path lines cannot cross. More details may be found in [56, Ch. 8]. While we limited our discussion to geodesics, the connections between optimal transport and geometry are much deeper. One can put a Riemannian manifold struc- ture when we consider Rn and p =2[40],andGaussianmeasuresformaclosed sub-manifold that inherit the Riemannian metric [52]. The Riemannian structure is further developed in [33], and since then in [4, 5, 36, 50, 56].

Section 2.4 One Dimensional Optimal Transport

We begin with a deeper look into the specifics of optimal transport on R. Although one dimension provides some intuition as to the behavior of optimal transport, some properties are not carried into higher dimensional spaces, as will become evident in Chapter 5. Most of the reference material in this chapter can found in [50].

35 2.4 1-D OT Preliminaries

There are two primary reasons why 1-d optimal transport is simpler than in higher dimensional cases: (i) for a 1-d measure we can form a cumulative distribution func- tion (CDF) and (ii) c-CM becomes a very restrictive condition for broad class of cost functions including the cost functions for the Wasserstein p-distances when p>2.

2.4.1. One Dimensional Transport We begin by discussing a specific construction of transport plan that is only suitable when our base space is R or a domain ⌦ R. The primary reference for this section ⇢ is [50, Ch.2].

Given a measure µ on R, the CDF Fµ is the function defined at a point by the measure of the set from negative infinity to the point, i.e. F (x):=µ( ,x). µ 1

Because µ is a positive measure, Fµ is monotonically increasing. It is important to note here that Fµ will be everywhere right continuous in addition to being continuous where µ is non-atomic.7 When a measure µ is non-atomic and has a density that is nowhere zero, then Fµ is invertible because it is strictly increasing. This is not always the case, but we can form the pseudo-inverse of the CDF, which will prove to be useful.

Definition 2.40 ([50, Pg. 60]). Given a monotone increasing and right continuous function F : R [0, 1], the pseudo-inverse function G :[0, 1] R is given by 7! 7!

G(t)=inf x R : F (x) t . { 2 }

In plain words, the pseudo-inverse G(t)isthesmallestvaluex such that at least t mass accumulates up to x. We need the function F to be monotonically increasing for the pseudo-inverse function to function akin to an inverse, while right continuity ensures that the infimum is a minimum when the set is non-empty.

7 Recall that right continuity means that when a sequence (xn)n limits to x and all xn >x, 2N then lim Fµ(xn)=Fµ(x).

36 2.4 1-D OT Preliminaries

Two important consequences of Definition 2.40 that will be used in Proposition 2.43 are

G(t) x F (x) t  () G(t) >x F (x)

It is worth looking at a couple of examples before moving on. In both cases, Gµ refers to the pseudo-inverse of Fµ, i.e. the CDF of a measure µ.

Example 2.41. Let µ be the uniform measure on [0, 2]. Then

12x 8  > 1 Fµ(x)=> x 0 x 2 > 2   <> 0 x 0. >  > > :> Here Gµ(t)=2t.

Example 2.42. Let µ = 1 2 (x i). Then 2 i=1 P 12 x 8  > 1 Fµ(x)=> 1 x<2 > 2  <> 0 x<0. > > > :> Here 1 2 2 1 Gµ(t)=>10  2 <> t =0. >1 > > The pseudo-inverse is valuable when:> looking at measures because of the following

37 2.4 1-D OT Preliminaries well known result which can be found in [50].

Proposition 2.43. [50, Pg. 60]. If µ P(R) and Gµ is the pseudo-inverse of its 2

CDF Fµ, then (Gµ)#(L [0, 1]) = µ.

Given µ, ⌫ P(R), if we define ⌘ := (Gµ,G⌫)#(L [0, 1]), then ⌘ ⇧(µ, ⌫) and 2 2 ⌘(( ,a] ( ,b)) = F (a) F (b) with F (a) F (b) = min F (a),F (b) . 1 ⇥ 1 µ ^ ⌫ µ ^ ⌫ { µ ⌫ }

Proof. We can determine the measureµ ˆ =(Gµ)#(L [0, 1]) by looking at its values on sets of the form ( ,a]. 1 Letting a be arbitrary, then F (a)=ˆµ(( ,a]) = L (A)whereA is defined as µˆ 1 A = t [0, 1] : G (t) a . Using Equation (2.15), we see that A can be rewritten { 2 µ  } as A = t [0, 1] : F (a) t . In this second form it is clear that L (A)=F (a). { 2 µ } µ Thusµ ˆ and µ have the same CDFs and so are equal as measures. The second half is proven similarly. By direct calculation,

⌘(( ,a] ( ,b)) = L ( t [0, 1] : G (t) a and G (t) b ) 1 ⇥ 1 { 2 µ  ⌫  } = L ( t [0, 1] : F (a) t and F (b) t ) { 2 µ ⌫ } = F (a) F (b), µ ^ ⌫ yielding the second part of the proposition.

Definition 2.44 ([50, Pg. 61]). We call the coupling ⌘ := (Gµ,G⌫)#(L [0, 1]) the monotone coupling and denote it by mon.

2.4.2. Structure of Monotone Coupling Before moving on, consider the structure of the monotone coupling through the fol- lowing lemma which we will use later in Chapter 5.

Lemma 2.45. Let µ and ⌫ be measures with points c and d for i from 1 to N 1 i i such that F (c )=F (d ) =0. Let be the monotone coupling between them. Also µ i ⌫ i 6 mon 38 2.4 1-D OT Preliminaries

N let c0,d0 = and cN ,dN = . Then supp(mon) i=1(ci 1,ci] (di 1,di]. 1 1 ⇢[ ⇥

Proof. We prove this by showing that ((c , ) ( ,d]) = 0 and note that the mon i 1 ⇥ 1 i proof would hold for any i as well as for reversing the roles of ci and di. Consider the following:

F (d )= (( , ) ( ,d]) ⌫ i mon 1 1 ⇥ 1 i = (( ,c] ( ,d]) + ((c , ) ( ,d]) mon 1 i ⇥ 1 i mon i 1 ⇥ 1 i = F (c ) F (d )+ ((c , ) ( ,d]) µ i ^ ⌫ i mon i 1 ⇥ 1 i = F (d )+ ((c , ) ( ,d]). ⌫ i mon i 1 ⇥ 1 i

Subtracting F (d )frombothsidesshowsthat ((c , ) ( ,d]) = 0. This ar- ⌫ i mon i 1 ⇥ 1 i gument can be done in the same way starting with F (c )toshowthat (( ,c] µ i mon 1 i ⇥ (d , )) = 0. Together these prove our claim as we have shown that no mass lies in i 1 N the region outside of i=1(ci 1,ci] (di 1,di], and completes the proof. [ ⇥

The previous lemma allows us to more intuitively understand the monotone cou- pling. Let µ, ⌫ be measures without atoms and mon the monotone coupling between them. Further assume that the measures are non-atomic so that we can say that ci

th and di for i, i =1,...,9arethemarkersforthei decile of the mass of µ and ⌫ respectively, i.e. Fµ(ci)=F⌫(di)=.1i. We call the sets (ci 1,ci]and(di 1,di]the ith deciles of µ and ⌫ respectively. Applying Lemma 2.45, we see that this measure is supported on rectangles made by the matching deciles of the two measures. There is .1massineachofthese rectangles. Using the interpretation of a coupling that says that measure (A B)repre- mon ⇥ sents the mass from A which is sent to B, we see that the coupling sends all of the

39 2.4 1-D OT Preliminaries mass in the ith decile of µ to the ith decile of ⌫. For atom-less measures this same deduction could be made at any level of coarseness, and it tells us that the monotone coupling keeps the mass in the same order that it was in. A final note on the monotone coupling is that it does not move all of the mass in asingledirectionasmightbeunderstoodfromitsname.Itiscalledthemonotone coupling here because of the following lemma.

Lemma 2.46. [50, p. 62]. Let ⇧(µ, ⌫) be a transport plan between two measures 2 µ, ⌫ P(R). Suppose that it satisfies the monotone support property 2

(x, y), (x0,y0) supp(),x

Then we have that = mon.

Proof. To prove this statement we need only to show that for such a ,

(( ,a] ( ,b]) = F (a) F (b), 1 ⇥ 1 µ ^ ⌫ i.e. (( ,a] ( ,b]) = (( ,a] ( ,b]). 1 ⇥ 1 mon 1 ⇥ 1

Sets of this form fully characterize a measure, so by doing this we should that the two measures agree everywhere. Consider the sets A =( ,a] (b, )andB =(a, ) ( ,b]. At least one 1 ⇥ 1 1 ⇥ 1 of (A)or(B)willbeequaltozerobecauseofthemonotonesupportassumption we have placed on . Letting C =( ,a] ( ,b] we see that 1 ⇥ 1

(C)=(C A) (C B), (2.17) [ ^ [ because (A)or(B)willbezero.

40 2.4 1-D OT Preliminaries

Since is still a coupling it must satisfy marginalization constraints. Specifically

(C A)=(( ,a] ( ,b] ( ,a] (b, )) [ 1 ⇥ 1 [ 1 ⇥ 1

= (( ,a] R)=µ(( ,a]) = Fµ(a). 1 ⇥ 1

Similarly (C B)=F (b)andEquation(2.17)becomes [ ⌫

(C)=F (a) F (b). µ ^ ⌫

Since sets of the form ( ,a] ( ,b] are enough to fully characterize the 1 ⇥ 1 measure , we see that agrees with mon, demonstrating that that any satisfying

Property 2.16 agrees with mon yielding mon as the unique measure with monotone support. This completes the proof.

2.4.3. c-Cyclic Monotonicity in One Dimension We know that it is equivalent for a coupling to be optimal and to have c-cyclic monotone support. In one dimension, for a cost function of the form c(x, y)=h(y x) for a strictly convex function h, c-cyclic monotone support becomes equivalent to monotone support. This is now a statement about pairs of points in the support rather than the full statement of c-CM, which is a condition on any finite collection of points in the support. Not only is this a simpler constraint to satisfy, but also we saw in Lemma 2.46 that the monotone coupling is the unique coupling which satisfies it. This leads us to the conclusion that the monotone coupling is the optimal coupling, which is the content of the following theorem.

Theorem 2.47 ([50, Pg. 63]). Let h : R R+ be a strictly convex function and 7! µ, ⌫ P(R) be probability measures. Consider the cost c(x, y)=h(y x) and suppose 2

41 2.4 1-D OT Preliminaries that the Kantorovich problem has a finite value. Then, the Kantorovich problem has a unique solution given by mon, the monotone coupling between µ, and ⌫.

Moreover, if strict convexity is withdrawn and h is only convex then the same mon is an optimal transport plan, but may no longer be unique.

Proof. The full proof may be found in [50, p. 63]. The proof proceeds by showing that the support must satisfy

(x, y), (x0,y0) supp(),x

Since the class of strictly convex cost functions includes x y p when p>2, The- | | orem 2.47 applies to all of the Wasserstein p-distances, p =1.Theorem2.47further 6 demonstrates that the optimal transport in one dimension is much simpler than in higher dimensions because in many cases it can be simplified to monotone transport. We note that typically when looking at Wasserstein p-distances, one considers mea-

th sures in Pp(X), i.e. the set of measures with finite p moment. This restriction is made to ensure that the Kantorovich problem has finite value. However, since this latter constraint is explicitly included in the hypothesis, the former is not required by Theorem 2.47.

Book shifting (p =1). It is worth noting the broader class of transport plans that are optimal for p = 1 but are not optimal for any other p. Such plans generally fall under the category of book shifting, and can be described by the following example:

Example 2.48. Let

N 1 N 1 1 µ = (x i)andlet⌫ = (x j). N N i=0 j=1 X X 42 2.4 1-D OT Preliminaries

Clearly, µ and ⌫ represent the same measure with ⌫ merely shifted over by one unit. The principle this example demonstrates is commonly called book shifting because one can imagine each delta function as representing a book on a shelf. The optimal transport plan for c(x, y)= x y p for p>1wouldbethemonotone | | coupling which would send the book at position i to i +1.Thispreservestheorder of the books and each book moves the same distance. While this is still an optimal coupling for p =1,itisnolongertheonlyoptimal coupling. In particular we can achieve the same distance by moving any book to po- sition N so long as we then move an earlier book into the new vacancy, a process that continues until we move the book at position 0. Hence there will still be monotonicty among the books that are moved, but not among the entire collection. This added flexibility comes because the Wasserstein 1-distance depends only on the di↵erence between the measures as can be seen most easily in the dual problem formulation. From [56, Pg. 95],

1(µ, ⌫)= sup dµ d⌫ =sup d(µ ⌫) (2.18) W 1 X X 1 X || ||Lip ⇢Z Z || ||Lip ⇢Z

Equation (2.18) looks strange because when c(x, y)=dist(x, y), the c-convex func- tions are 1-Lipshitz functions and are their own c-transforms. This form clearly illustrates that in the dual problem, it does not matter where the common mass of the measures µ and ⌫ is; all that matters is their di↵erence. It is also useful to note that when p =1thatitdoesmatterwherethecommon 6 mass is. In our example µ and ⌫ have common mass at positions 1 through N 1, and the optimal plan is to shift all of the books over by one position for a total cost of 1. Were we only aware of the di↵erence between the two measures, then we would

N p be forced to move the book from position 0 to position N at a total cost of N , which is greater than 1 as soon as p>1.

43 2.4 1-D OT Preliminaries

Book shifting is a phenomenon that is present in higher dimensions as well.

2.4.4. Optimal Transport Maps In this section we have talked exclusively about optimal transport plans and the Kantorovich form of the problem. Before moving on, it is necessary to mention the following theorem.

Theorem 2.49 ([50, Pg. 61]). Given µ, ⌫ P(R), suppose that µ is atom-less. Then 2 there exists a unique non-decreasing map Tmon : R R such that (Tmon)#µ = ⌫. 7!

Proof. The full proof is available in [50]. The idea is that since µ is atom-less,

(Fµ)#µ = L [0, 1]. Then from Proposition 2.43 we know that (G⌫)#L [0, 1] = ⌫, and so the map

Tmon(x):=G⌫(Fµ(x)) satisfies the push-forward criterion. It is shown that this map is well-defined µ-a.e.

With the additional hypothesis that µ is atom-less appended to Lemma 2.46 and

Theorem 2.47 we can further conclude that mon is the coupling induced by Tmon.

44 Chapter 3

Theoretical and Experimental Results in Semi-Discrete Optimal Transport

Optimal transport became more useful when it became more usable. The initial step in this direction was re-framing the problem in terms of couplings and solving it as alinearprogrammingproblem.Therehavebeennumerousingeniousadvancessince then, many of which are compiled in [42] and [50, Ch. 6]. This chapter considers semi-discrete optimal transport (SDOT), and we specifi- cally focus our attention on optimal transport between a purely atomic measure and a non-atomic measure. This restriction allows us to take advantage of the dual problem and make the problem computationally tractable. Our spaces X, Y in this section will both be R2 and the measure ⌫ will be absolutely continuous with respect to the Lebesgue measure.

45 3.1 SDOT SDOT

Section 3.1 Semi-Discrete Optimal Transport

The primary reference for this section is [28], with more specific references given when appropriate. We start this section with another economic interpretation of the dual problem. Rather than thinking about the production side as we did in our first analogy, here we think about bread shops and the customers who chose to buy from them. We have a distribution of bread shops that is given by µ(x)andwhichisconcen- trated on the store locations, x n . We model this as the finite sum of -functions { i}i=1 weighted by the capacity of bread that they can sell. These bread shops sell bread of indistinguishable quality at market price (x). The market price function (x)only matters at the locations of the bread shop. The customer base is di↵use and spread throughout the region and has no atoms, so we model it as an absolutely continuous measure. This customer base is cheap and lazy in equal proportions. In deciding which bread shop they go to, a customer at y cares only about the price of the bread (x) and the transport cost of moving themselves to the shop c(x, y). The total price that a customer at y must pay is then c(x, y)+ (x). As a cheap and lazy customer, the customer at y will choose the bread shop which has the lowest total cost. Taking this minimum (since it is the infimum of a finite set), we recover

min c(xi,y)+ (xi) , 1 i n   { } which is the c-transform of the market price . We will move into a more technical consideration of these details, however this analogy is very useful in understanding the roles that each of the components have to play.

46 3.1 SDOT SDOT

Grouping the locations with customers that all ‘shop’ at the same bread shop will nearly partition the locations into disjoint regions called Laguerre cells. We say ‘nearly’ because the intersections will have measure zero when the customer base ⌫ is absolutely continuous with respect to a reference measure. This will break up R2 into the Laguerre cells associated with the market price , cost function c, and store locations.

3.1.1. Laguerre Cells

n In this section ⌫(y)isourabsolutelycontinuousmeasureandµ = i=1 µixi is our discrete measure. The points in the support of µ, x n will be calledP nodes. When { i}i=1 we consider a price function , we only care about its values on the nodes and write

i = (xi).

Definition 3.1. [28] With a price function , cost function c(x, y)andnodes x n , { i}i=1 the Laguerre cell with node xi is the collection of points y for which the total cost c(xi,y)+ (xi)isminimalatxi when compared to the other node locations, i.e.

Lagc (x )= y Y : c(x ,y)+ c(x ,y)+ , j = i . (3.1) i { 2 i i  j j 8 6 }

Since we will only discuss one particular cost function, to avoid cumbersome notation we henceforth omit the superscript c. The Laguerre cell Lag (xi)willbereferredto

th as the i Laguerre cell and will be written as Li.

Increasing a price i to i0 = i + ✏ will shrink the Laguerre cell. This is because an increase in the market price i will necessarily correspond to a decrease in the transport cost to preserve the inequality in Equation (3.1). With our cost function being the squared distance, this means the points must be closer to the node. This reflects the economic intuition that raising the price means that you will attract fewer customers and that lowering the price will attract more. Laguerre cells are a function

47 3.1 SDOT SDOT of the relative prices between the nodes, so if two price functions di↵er by a constant, then they form the same Laguerre cells. Laguerre cells are a generalization of Voronoi cells, which partition a space into the regions which are closest to a set of nodes. The Laguerre cells generated by 0are ⌘ Voronoi cells. Interestingly when the cost function is the squared euclidean distance, Laguerre cells can also be implemented by creating the Voronoi cells of a space which is one dimension higher. The extra dimension is used to lift the node above the plane we care about according to the price on the node. The intersection of Laguerre cells will be contained within the set of points y where

c(xi,y)+ i = c(xj,y)+ j. (3.2)

With our cost function c(x, y)= x y 2, the set of points, y satisfying Equation (3.2) || || forms a hyperplane. Specifically, starting from

Lag (x ) Lag (x ) y : x y 2 + = x y 2 + , (3.3) k \ j ⇢ || k || k || j || j and then manipulating the equality we obtain

0= y x 2 + y x 2 || k|| k || j|| j = y 2 + x 2 2 y, x y2 x 2 +2 y, x + || || || k|| h ki || j|| h ji k j =2y, x x + x 2 x 2 + . (3.4) h j ki || k|| || j|| k j

Thus, the border between two cells lies on the hyperplane defined by

1 y, x x = x 2 x 2 + . (3.5) h k ji 2 || j|| || k|| j k Each cell will be the intersection of regions bounded by hyperplanes, showing that

48 3.1 SDOT SDOT the Laguerre cells are convex shapes. Additionally, the boundaries are ⌫-negligible because they are segments of hyperplanes. Any price function partitions the space into Laguerre cells, yielding a joint density that sends all of the mass in the Laguerre cell to its node. We do not call this a coupling, because this joint probability does not have the correct marginals unless the mass in the Laguerre cells matches the mass on the node. This is interesting to note, because each of these measures has c-CM support, and so is optimal for the marginals which it has. This is because the dual problem is searching for feasibility (compliance with the marginalization constraint) as opposed to searching for optimality. Every set of prices not only gives an optimal coupling, but also an optimal map in one direction. There is no optimal map from the discrete distribution µ to the di↵use distribution ⌫, however there is a T such that T#⌫ = µ which sends all of the mass in a Laguerre cell the node of the Laguerre cell. It may be useful to see how Theorem 2.27, about the solvability of the Monge problem, is satisfied in this case. Theorem 2.27 says that the Monge solution exists when the optimal cost is finite and when the set of points where the subdi↵erential of the any c-convex function contains more than one element is µ-negligible. We are guaranteed to have finite cost because we are considering the case when we already have an optimal coupling with finite cost. So we focus on the second criterion about the subdi↵erential. Observe that the second criterion is not met when considering whether there is an optimal map from µ to ⌫, as it would require a map which sends the point masses to the di↵use measure. The subdi↵erential of our price function @c (xi)atanodexi is the set of points y where c(y) (x)=c(x, y)orwhere c(y)= (x)+c(x, y). This holds for the entire Laguerre cell associated to the node, so the subdi↵erential holds more than a single point. This is at a node of µ, so it is not µ-neglible and thus

49 3.1 SDOT SDOT we do not satisfy the criterion. The optimal map does exist from ⌫ to µ, however. This can be seen by amending Theorem 2.27 to include a statement about the superdi↵erential of or by transform- ing the dual pair ( , )forµ and ⌫ into the dual pair for ⌫ and µ. As mentioned in Section 2.2.3, if ( , )isthedualpairforµ and ⌫, then ( , ) is the dual pair for ⌫ and µ. The superdi↵erential of is the same as the subdi↵erential of . In our problem (y)= c(y), and a point x is in the subdi↵erential @ ( c)(y) c if c(y) (x)=c(x, y). The points in the interior of a Laguerre cell have only the node of the Laguerre cell in their subdi↵erential and the optimal map sends all points in the cell to the node. However, the points on the boundaries between cells have both of the corresponding nodes in their subdi↵erential, but these are the only such points. Since ⌫ is absolutely continuous with respect to the Lebesgue measure and the boundaries of the Laguerre cells are subsets of hyperplanes, they are ⌫-negligible. Thus the set of points y where the subdi↵erential has more than one element is ⌫- negligible. It follows that there is an optimal transport map from ⌫ to µ, which we call

T because it is determined by the prices we put on the nodes. When ⌫ is supported on the entire plane, then T is uniquely determined by up to an additive constant on the prices.

The Dual Functional. In discussing the Laguerre cells it will be useful to introduce afunctionalF ( ), defined as

n F ( )= c(y)d⌫(y) µ . (3.6) i i Y i=1 Z X

50 3.1 SDOT SDOT

We will call this the dual functional because it evaluates candidate functions in the dual problem. This means that sup F = C(µ, ⌫). Expanding out the c-transform and partitioning it out into Laguerre cells we can see that

F ( )= min( (xi)+c(x, y))d⌫ (x)dµ Y i X Zn n Z n = (x )d⌫ + c(x ,y)d⌫ µ i i i i i=1 ZLag (xi) i=1 ZLag (xi) i=1 Xn nX X = ( d⌫ µ )+ c(x ,y)d⌫. (3.7) i i i i=1 Li i=1 Li X Z X Z We can intuit several properties from Equation (3.7), however for full proof we refer the reader to [15, 28, 57].

1. The functional F ( ) is the sum of two parts: the former which explicitly de- pends on the price function, and the latter which implicitly depends on the price function through the Laguerre cells. When we have the correct price function,

n this latter part c(xi,y)d⌫ is equal to the transport cost, or the squared i=1 Li Wasserstein 2-distance.P R When we do not have a price function which solves the dual problem, then this is the transport cost of an optimal transport plan with di↵erent marginals.

2. Since the dual problem is always less than the primal problem and the lat- ter part of Equation (3.7) will be equal to the primal cost, the former part

n i( d⌫ µi)mustequalzeroforthecorrect . This tells us that the i=1 Li th massP in theR i Laguerre cell must equal µi and also provides the marginalization constraint in the dual problem.

51 3.1 SDOT SDOT

3. The transport cost portion has no explicit dependence on and

@F ( ) = d⌫ µ . (3.8) @ i i ZLi

We will see that this allows for the optimal prices to be found through straight forward algorithms.

These results are presented here with no proof, however the third result merits a brief note. The idea used in showing Equation (3.8) in [28] is that the price function and regions are di↵erent variables which removes the implicit dependence between them. Practically, this means that when taking the first derivative, the focus is on the argument of the integral rather than the change to the boundary. We will see that the results for the second derivative are more nuanced. The classic setting for SDOT is not only that ⌫ is absolutely continuous, but also that it is supported on the entire region. We will see later that for our application this is not true, but only a minor modification to the algorithm is necessary.

3.1.2. Theoretical Results on SDOT We now present some minor theoretical results which may be useful in schemes moving forward. While minor, we have not seen these results formally presented elsewhere.

Lemma 3.2. Let be a price function on the nodes x n and let c(x, y)= x y 2. { i}i=1 || || Also let h be some translation vector and let

0 = 2 x ,h , i i h i i

and a new set of nodes be given by xi0 = xi + h. Then

Lagc (x )=Lagc (x + h). i 0 i

52 3.1 SDOT SDOT

Proof. This proof heavily relies on the cost function c(x, y)havingtheformc(x, y)= x y 2. Noting Equation (3.4), the Laguerre cells are entirely determined by the || || hyperplanes that define their boundaries. Hence showing that 0 generates the same boundaries as is equivalent to showing that they form the same Laguerre cells. That is, we aim to show that y satisfies

c(x ,y)+ = c(x ,y)+ c(x + h, y)+ 0 = c(x + h, y)+ 0 . i i j j () i i j j

Observe that

c(xi + h, y)+ i0 = c(xj + h, y)+ j0

x (y h) 2 + 2 x ,h = x (y h) 2 + 2 x ,h || i || i h i i || j || j h j i x 2 + y h 2 2 x ,y h + 2 x ,h || i|| || || h i i i h i i = x 2 + y h 2 2 x ,y h + 2 x ,h || j|| || || h j i j h j i() x 2 + y 2 2 x ,y + = x 2 + y 2 2 x ,y + || i|| || || h i i i || j|| || || h j i j

c(xi,y)+ i = c(xj,y)+ j.

The first three lines are merely expanding out definitions and rearranging terms. The transition from the third to the fourth involves subtracting the common term y h 2 || || and adding y 2 to both sides and cancelling out the +2 x, h with the 2 x, h for || || h i h i xi and xj. The final step is putting the terms back into the cost function. This shows that the boundaries of the Laguerre cells are the same, implying that the Laguerre cells themselves must be the same. This completes the proof.

Theorem 3.3. Let be the optimal price function between the distributions µ = N µ (x x ) and ⌫ = ⌫(x)dx, and let c(x, y)= x y 2. Then the optimal price i=1 i i || || P 53 3.2 Algorithmic SDOT SDOT

N function 0 between ⌫ and (S ) µ = µ (x (x + h)) is h # i=1 i i P

0 = 2 x ,h . (3.9) i i h i i

Proof. From Lemma 3.2, we see that the Laguerre cells from and nodes x are { i} the same as the Laguerre cells from 0 and x + h . Since is optimal, we have { i }

d⌫ µi =0. (3.10) Lagc (x ) Z i

As the Laguerre cells are the same, and ⌫ is unchanged, this implies that

d⌫ µi =0. (3.11) c Lag (xi+h) Z 0

Thus from [24, 28] we know that the transport plan given by the map T 0 is optimal and is unique up to additive an constant. This completes the proof.

Section 3.2 Algorithmic SDOT

The previous section outlines the theory of optimal transport between a discrete distribution and a continuous one, the setting of SDOT. This theory allows us to iteratively search for the maximum of the dual problem, which is the same as the Wasserstein distance. The implementation has its roots in computational geometry, [7], and is further developed in [15, 24, 26, 28, 37, 57]. There are minor di↵erences in what we present here than in these papers; some di↵erences are due to notation, while some di↵erences are due to using a di↵erent but equivalent formulation of the dual problem.

54 3.2 Algorithmic SDOT SDOT

The principle of algorithimic SDOT is that we are able to create Laguerre diagrams for a given price function, and so by maximizing the dual functional F in Equation (3.6), we are able to find the Wasserstein distance between the discrete distribution and continuous one. This new context would already be easier than optimal transport, but it is made easier still by the following theorem.

Theorem 3.4. [28, Pg. 12] The objective function F of the semi-discrete Kantorovich dual problem in Equation (3.6) is concave.

The concavity of the dual problem implies that there is a unique maximum value and allows us to search for the optimal price in straight-forward ways. While Theo- rem 3.4 states that the objective is concave, it is not strictly concave. One would need to place additional assumptions on the continuous measure ⌫(y)inordertoensure strict concavity, such as being di↵use over the entire region. While it would be possible to simply use a gradient ascent approach using the derivative of F given by Equation (3.8), we employed approaches that took advantage of the analytic Hessian information that is available. Our original approach was to use a a damped Newton’s method with backtracking, however the current implementation uses a trust region Newton’s method. To use any Newton’s method, we need the Hessian of F , which is given in [24, Pg. 6] and [28, Pg. 15] (although not in this exact form)

@2F ⌫(y) ( )= dS(y),i= j (3.12) @ i@ j 2 xi xj 6 Lag (x )ZLag (x ) || || i \ j and @2F @2F 2 = ( ). (3.13) @ @ i@ j j i=j X6 We will refer to the Hessian matrix given by Equations (3.12) and (3.13) as H or

55 3.2 Algorithmic SDOT SDOT

H( ). The Hessian is never invertible because H1 =0,aswecanseebyrearranging Equation (3.13). If ⌫(y)isdi↵use and supported over the entire domain, then 1 is the only element of the kernel for a non-optimal . The conditions in [24] loosen that requirement but still ensure that the kernel of H is spanned by 1. This element of the kernel has already been explained in economic terms when we noted that when a constant function is added to the price function, the Laguerre cells do not change. We circumvent this issue by fixing the price of the nth node while allowing the others to vary, of course any other node could be fixed instead. By setting n =0,and calling the remaining prices ˜, we are able to solve the problem F˜( ˜)=F (( ˜, 0)). When ⌫(y)isdi↵use and supported over the entire domain, F˜ has an invertible Hessian, H˜ , given by the (n 1) (n 1) minor of H, i.e. by dropping the last ⇥ row and column. This will allow for us to use a modified Newton’s method to more quickly find the correct prices.

Algorithms for computing the optimal price. The general procedure of any al- gorithm to determine the optimal price would go as follows can be seen in Algorithm 1.

56 3.2 Algorithmic SDOT SDOT

Algorithm 1: Solving the dual problem in SDOT. Result: Solve the dual problem;

1 Input the distributions µ and ⌫

0 2 Initialize the price function

3 Set tolerance for convergence tol

4 while Convergence is not met do

5 Compute the Laguerre diagram according to

6 Compute the masses in the cells, ~⌫

7 Check for convergence (i.e. ~⌫ µ~

9 Break

10 else

11 Call subroutine to obtain step d

k+1 k 12 Update prices to form = + d

13 end

14 end

15 Return the optimal prices , and squared Wasserstein cost F ( );

Remark 3.5. We note that the initial code for all of the simulations and experiments was written in Python. This was later updated to C++ which allowed the use of the Computational Geometry Algorithms Library, CGAL [53]. CGAL is specifically designed for computational geometry, which greatly increased the eciency.1

There are di↵erent ways to implement Algorithm 3.2.1, with the most notable contrast occurring when we update the prices. The updates discussed in this thesis are gradient step and two variants of Newton’s method. Both update the prices by

k k 1 k k k th k setting = + ⌧ p , where ⌧ is the step-size of the k step and p is the

1Thanks to Dr. Matthew Parno of CRREL for providing the base of the C++ code which uses CGAL [53].

57 3.2 Algorithmic SDOT SDOT step direction. In either case, p is an ascent direction, meaning that for some small

k k 1 enough ⌧, F ( ) >F( ). Since our dual functional F is concave, these algorithms approach the maximum value (remember optimal transport looks for the minimum transport cost, while the dual problem looks for the maximum). We set a non-zero convergence when implementing these algorithms, so there is some necessary error in our calculation of the Wasserstein distance. In Chapter 4, this leads to a small amount of ‘noise’ appearing in computations, but it does not obscure the results.

Gradient Step. A gradient step algorithm attempts to find the maximum of the dual problem by looking at the gradient of the objective and following that direction. With a gradient step the step will be proportional to the gradient, so p = F ( ). r Recall Equation (3.8), @F = d⌫(y) µ . @ i i ZLi By allowing ~⌫ to be the vector with entries equal to the masses in the Laguerre cells with price function ,andµ~ to be the vector of weights on the nodes, the gradient of F is ~⌫ µ~. Note that both µ~ and ~⌫ represent probability vectors since their respective entries sum to 1. A gradient step method for finding the optimal prices is therefore given by

k k 1 = + ⌧(~⌫ µ~). (3.14)

To ensure that this converges to the solution, a line search or other globalization strategy must also be employed [12]. The specific values of obviously depend on µ and ⌫, however the possible range of depends on the diameter of the domain. This possible range should inform the choice of the parameter ⌧. To illustrate this e↵ect, first consider the case where both measures are supported in a domain ⌦ =[0, 1]2. The maximum cost c(x, y)= x 1 || y 2 is equal to 2. If however both measures are supported in a domain ⌦ =[0, 100]2, || 2

58 3.2 Algorithmic SDOT SDOT then the maximum cost is 20,000. Since we are solving a problem looking for the minimal total cost, it is unlikely that any pair (x, y) in the support of the optimal coupling achieves the maximal cost. Nevertheless, we should allow to have a range such that it could put any y into any Laguerre cell. In this way, while it is unlikely for to take on the maximal cost, the range of possible values will still be proportional to it. Finally we note that the gradient p is the di↵erence between two probability vectors so its entries sum to 0 and has maximum magnitude of 2, while the range of is proportional to the diameter of the domain. Thus the value of ⌧ should grow when the diameter of the domain under consideration grows. We do not give a specific recommendation here, except to warn of implementing a method meant for a domain

2 2 ⌦1 =[0, 1] onto a domain ⌦2 =[0, 100] because the required number of steps would grow.

Newton’s Method. Newton’s method is a standard root finding method, see e.g. [39]. The idea of it in this application is that we are looking for a root of the function F ( ), and that because F is a concave function, the root of its gradient corresponds r to the maximal value of the function. For a one dimensional smooth function, f(x), Newton’s method starts with an initial guess x0 and computes the linear approximation of f(x)atx0, which is

0 0 0 f(x )+f 0(x )(x x ). The next iterate is then given by the the root of the lin- i+1 i i 1 i ear approximation at the previous step, x = x f 0(x ) (f(x )). We can apply Newton’s method to the problem of finding the optimal prices by looking at the functional F˜( ˜)=F (( ˜, 0)). The gradient F˜( ˜) has the same entries r as F ( ), except for the missing term stemming from the node that has a fixed price. r In order to apply this method, we need H˜ , the Hessian of F˜, to be invertible. This can be done either by requiring ⌫(y)tosatisfyadditionalhypothesis,ormodifying

59 3.2 Algorithmic SDOT SDOT the method so that invertibility is not required. For this problem, Newton’s method takes the form

k k 1 1 ˜ = ˜ ⌧(H˜ ) F˜( ˜). (3.15) r

While the simplest version of either of these algorithms would have a fixed step size, better theoretical and practical results are obtained by allowing ⌧ to vary, one particular way to allow ⌧ to vary is by implementing back-tracking [39]. Back-tracking is designed to first find the step direction and then compare F ( k)

k 1 k k 1 to F ( ). If F ( ) F ( ) we have ‘overstepped’ our ascent direction and  hence need to ‘back-track’. It is common to then halve the step size and reset so

k k 1 ⌧ k that = + 2 p. This can be done as many number of times as necessary to ensure that we are approaching a maximum. Back-tracking can be done with either algorithm. Newton’s method is a higher order method and has faster theoretical convergence rates, however there are additional assumptions in order to implement it reliably,

1 namely the ability to calculate H˜ at each iterate. We implemented an approach that can be thought of as a hybrid between Newton’s method and a gradient step.

3.2.1. Algorithmic Modifications

The two assumptions necessary to ensure the invertibility of H˜ and hence the con- vergence of the damped Newton’s method in [24] are:

1. Prices at each step form non-empty Laguerre cells.

2. The density ⌫(y)satisfiesaweightedL1-Poincar´e-Wirtinger inequality.

An empty Laguerre cell means that an entire row and column of the Hessian will have zero entries, adding an additional dimension to the kernel. The issue for one-

60 3.2 Algorithmic SDOT SDOT dimensional kernels is easily rectified since finding the probability vector ~⌫ is not impacted – determining all but one of the entries is equivalent to determining all the entries. This is not the case for a two or higher dimensional kernel. The second assumption ensures that the support of ⌫(y)isconnected,andal- though not equivalent, acts as a de facto definition for satisfying a weighted L1- Poincar´e-Wirtinger inequality. When the support is disconnected it is possible for the boundary between two Laguerre to fall into a region where there is no mass. This leads to the integration along the boundary to be 0, and can lead to the Hessian be- ing non-invertible. We employed three techniques initialization through translation, Hessian regularization and trust region that may help make schemes more robust to the obstructions that the above assumptions are meant to circumvent. We describe them below.

Initialization through translation. The first modification can be considered as an application of either Lemma 3.2 or Theorem 3.3, both of which relate the optimal set of prices for a set of nodes to the the optimal set of prices for a translated set of nodes. This is useful because while := 0 may form empty Laguerre cells for µ, and ⌫ (making it bad), it may be a good initialization for a translation of µ (and untranslated ⌫). Figure 3.1 provides an example of a problem which we may want to solve using SDOT that benefits from our proposed modification. Translating the nodes so that the distributions have the same center of mass is computationally convenient. Thinking of the dual problem as having a ‘mean’ and ‘shape’ component like the primal problem, then altering the prices in this way solves the ‘mean’ component directly, and allows us to focus the problem only on the ‘shape’. We will now walk through the initialization through translation technique with the two approaches, one following Lemma 3.2 and the other Theorem 3.3. Let W and M be the centers of mass of µ and ⌫ respectively, then corresponding shift is

61 3.2 Algorithmic SDOT SDOT h = M W . Let the set x n be the set of untranslated nodes, with corresponding { i}i=1 price function and let the set x h n be the set of translated nodes, with price { i }i=1 function 0. The two approaches we now describe do not yield significant di↵erences.

1 Initialize prices as ( 0) =2x ,h for i from 1 to n, rather than 0 0. Recall i h i i ⌘ from Lemma 3.2 that this is equivalent to having 0 0ifthedistribution ⌘ were translated to have the same center of mass as the continuous distribution. After this shift in the initialization, the rest of the optimization proceeds as usual. Note that the nth price (or any other) could either be fixed at this price or it could be set to zero by subtracting away the price on that node from all of the others. Setting the nth price to zero yields the price function ( 0) =2x ,h 2 x ,h =2x x ,h . i h i i h n i h i n i

2 Solve the optimization problem for the translated set of nodes. In this case,

0 the initialization price ( 0) 0 would be more appropriate. After finding the ⌘ optimal price function 0 for the translated nodes, Theorem 3.3 tells us that

the optimal prices for the translation by h is = 0 + x ,h . i i h i i

Observe that both approaches similarly alter the prices. The di↵erence is whether the initialization prices or the optimal prices are being modified. Both approaches provide ‘shortcuts’ to the optimization problem by relating the problem to an easier version, which we have found useful in practice. Figure 3.1 shows a uniform density spread over a rectangular region and a discrete density given by µ = 1 ( + ), with x =( 2, 0) and x =(0, 0). To achieve the 2 x1 x2 1 2 boundary shown in Figure 3.1(b), we let h =(2, 0) and set the prices to be

=2h, x , i h ii so =( 8, 0). These prices reflect that when we are looking at the nodes shifted by 62 3.2 Algorithmic SDOT SDOT

x1 x2 x1 x2

(a) Original Laguerre Diagram. (b) Modified Laguerre Diagram.

Figure 3.1: A uniform density over a rectangular region (shaded in blue) and a discrete density equally spread over the two (red) circular nodes. (a) Laguerre Diagram when the initial prices are equal to zero. (b) Laguerre Diagram modified based on the adjusted prices using Theorem 3.3. Note that the mass is now correctly distributed into both Laguerre cells.

h, then the optimal prices 0 in (3.9) would be 0 0. ⌘

Hessian Regularization. Our second modification is devised to circumvent the requirement that the support of the continuous density be connected. Originally motivated by wanting to use SDOT for image processing of sea ice images as in [41], it is useful in this work and will allow for SDOT to be used in a variety of image processing applications. Figure 3.2 displays a simple example of this occurrence.

x1 x2 x3 x4 x1 x2 x3 x4

(a) Original Laguerre Diagram. (b) Optimal Laguerre Diagram.

Figure 3.2: The boundaries of the Laguerre cells are shown by the red dashed lines. The initial prices 0, i =1,...,4, cause two Laguerre cells to be empty. i ⌘

As suggested by its name, this modification works by regularizing the Hessian, in particular by replacing H( )withR( )=H( ) ✏ Id. The definition of the Hessian in Equation (3.13) shows that the Hessian is weakly diagonally dominant with negative entries on the diagonal. Adding ✏ Id makes the matrix strongly diagonally

63 3.2 Algorithmic SDOT SDOT dominant, and therefore invertible. Convergence analysis of regularized Newton’s methods can be found in [47]. To understand the e↵ect of regularization, we examine R( )whenthereisno mass along any of the cell boundaries. In this case, R( )= ✏I and our iterative update is correspondingly

1 ⌧ ˜ = ˜ ⌧( ✏ Id) ( F˜( ˜)) = ˜ + ( F˜( ˜)). (3.16) n+1 n r n ✏ r

Since there is no information from the Hessian, the regularized Hessian is equivalent to a damped gradient step algorithm. Even when the Hessian is non-zero, the regularized Hessian modification can be thought of as a incorporating the gradient step into the update. We can also view the Hessian regularization in terms of the economic analogy for semi-discrete optimal transport. Specifically, the Hessian (and the inverse of the Hessian) describes the situation where all of the store keepers adjust their prices by considering whether they and adjacent shops, have too many or too few customers. However, the Hessian only contains information when there are customers along the boundary between the Laguerre cell. The regularization term makes it so that a store keeper will adjust their prices based on whether they alone have too many or too few customers. An example of an optimal transport problem which benefits from using a regular- ized Hessian can be found in Figure 3.2. Specifically in this case we cannot find the optimal prices as was done in Figure 3.1 because it would require changing the prices according to two di↵erent translations. By contrast, the regularized Hessian would initially lower the prices on nodes x1 and x4 while simultaneously raising the prices of x2 and x3. Adjustments would then continue until convergence to an optimal solution is reached.

64 3.2 Algorithmic SDOT SDOT

Regularizing the Hessian is the more practical of the two methods discussed, since it ensures that the Hessian is invertible thereby circumventing the need for the assumptions of non-empty Laguerre cells and disconnected support. Additionally, while it useful to understand the e↵ect of translations on the optimal prices, it takes further processing to determine the necessary translation for the price shift to be used.

Trust Region. For discussion of the trust region for Newton’s method for convex optimization, see [12, Pg. 515]. We adapt what is there for the concave optimization that we need in our problem, because of this some expressions may seem a bit di↵erent than their usual form. To understand the the trust region method, we first consider the quadratic ap- proximation to F around

1 F˜( ˜ + v) F˜( ˜)+ F˜( ˜),v + vT H˜ ( ˜)v. (3.17) ⇡ hr i 2

We are aiming to find the v that maximizes this approximation. We can ignore the term F ( ) that appears on the right hand side of Equation 3.17 since it does not depend on v, and so we solve for the v that gives us

1 max vT H˜ ( ˜)v + F˜( ˜),v . (3.18) v 2 hr i ⇢

However, the step v is not well defined when H is singular. In this case, we solve for

1 max vT H˜ ( ˜)v + F˜( ˜),v . (3.19) v ⇢ 2 hr i || || ⇢

The set v ⇢ is the trust region, and gives the method the name. || ||  The trust region and regularized Hessian are di↵erent approaches, however they

65 3.2 Algorithmic SDOT SDOT are related. We can see this because the solution to the trust region problem in Equation 3.19 also solves

1 max vT (H˜ ✏ Id)v + F˜( ˜),v . (3.20) 2 hr i ⇢

1 The solution to this is v =(H ✏ Id) F˜( ˜), which is the same as our step from the r regularized Hessian method. With that said, it is not a simple relationship between ⇢ and ✏ so while the methods are similar and related, they are not the same.

3.2.2. Other Regularized Optimal Transport For purposes of completeness we briefly discuss entropic regularization in optimal transport. Entropic regularization changes the objective in optimal transport to

2 2 2,✏(µ, ⌫)=inf dist(x, y) d ✏H() , (3.21) W ⇧{ X X } 2 Z ⇥ where H() is the entropy of the coupling defined by

H()= ln((x, y))d(x, y). (3.22) X X Z ⇥

Entropic regularization changes the optimal transport problem into something which is more amenable to computation (by considering a problem that is guaranteed to be strictly convex), and is therefore a popular tool in optimal transport problems. However, the resulting solutions are optimal for their cost functions, but not the true transport cost. More information may be found in [14, 51] and in [42, Ch. 4]. We point out that by contrast, regularizing the Hessian in Equation (3.16) simply applies adi↵erent optimization technique to solve the true optimal transport problem.

66 3.2 Algorithmic SDOT SDOT

3.2.3. Quantization Often we want to represent a continuous probability as a sum of weighted delta functions. This falls under the broad umbrella of quantization, where a continuous quantity (like a function or a measure) is represented by a discrete set of values. Quantization is combined with optimal transport in [10, 11, 15] by using SDOT to measure the misfit of representation to the original continuous quantity. While the goal of SDOT is to solve the optimal transport problem between a fixed µ and ⌫, quantization looks at ways to modify the discrete distribution µ to lower the optimal transport cost. Quantization plays an important role in Chapter 4, where nodes are used to rep- resent objects in our models. Quantization concerns itself with two things – the positions of the nodes and their corresponding weights. With the position of the nodes fixed, then the optimal weights are the masses obtained by integrating over the Voronoi cells.2 The optimal transport problem cares only about the transport cost, which is a function of the distance, so it makes sense that the assignment of points to the nearest node is minimal. It also makes sense in our economic analogy because we use the prices to incentivize a customer to travel a farther distance for their bread. In order to discuss how changing the position of the nodes a↵ects the optimal transport cost, we must first define a Laguerre cell’s centroid.

th Definition 3.6. The centroid, mi of the i Laguerre cell, Li, is the center of mass of the cell according to ⌫ and is given by

1 m = yd⌫(y). (3.23) i µ i ZLi

We note that the weight µi from the discrete distribution µ appears in this definition

2Voronoi cells choose the nearest node for each point, which is equivalent to saying the price function is constant.

67 3.2 Algorithmic SDOT SDOT because it is the mass (according to the continuous distribution ⌫)intheLaguerre cell.

The centroid is always within its Laguerre cell, which is not always the case with the node. For fixed weights on the nodes, the gradient of the optimal transport cost is given by C(µ, ⌫)= c(x ,y)d⌫(y), (3.24) rxi rxi i ZLi and in our case by 2(µ, ⌫)=2µ (x m ). (3.25) rxi W i i i

As can be inferred from Equation (3.25), the optimal locations for the nodes is when they agree with the centroids. In this context, we mention Lloyd’s algorithm [10], a simple gradient step algorithm with linear convergence, that computes the centroidal diagram. The algorithm is constructed to move node locations closer to the centroids. The procedure is iterative since changing the node location changes the Laguerre cell and the centroid, which is akin to hitting a moving target. While we mention Lloyd’s algorithm for its simplicity and intuitive understanding, faster algorithms such as a quasi-newton BFGS, [27] are more commonly used. Optimal locations for the discrete representations were not needed in our applica- tions. This is due to the fact that we are only considering rigid objects, so any shape mismatch is a constant and does not a↵ect the optimization. Hence the additional step of determining the centroidal Voronoi diagram is not justified. Instead, we sim- ply choose the representation to capture some relevant feature of the distribution, e.g. we represent a bar with a rectangular grid, and ellipse with a diamond. In Chapter 4 we consider naive quantitizations with predetermined weights and node locations as well as distributions where we only prescribe the node locations.

68 Chapter 4

Parameter Estimation using Optimal Transport Misfit Functions

We now lay out a framework for using the Wasserstein distance as a means of tuning parameters in order match observations. All of our examples will use misfit functions that incorporate the squared Wasserstein 2-distance. To avoid cumbersome notation, the subscript will henceforth be omitted. To begin we consider some simple examples in Section 4.1, namely locating the center of two separate objects in an image, learning the rotation angle, and determin- ing an object’s velocity from a series of images. We then graduate to more dicult problems in Section 4.2: (i) We determine parameters in an example where two balls are colliding and (ii) we show how the gradient of the Wasserstein distance may be combined with the adjoint equations of a PDE model for a cantilevered beam problem.

69 4.1 Simple Examples Parameter Estimation

Section 4.1 Simple Examples

For a family of distributions with parameter ✓, continuous probability ⌫, and discrete probability µ✓, the misfit J(✓)isgivenby

J(✓)= 2(⌫,µ✓). (4.1) W

As before 2 denotes the squared Wasserstein 2-distance. For our initial examples W the goal is to find the parameters which minimize the misfit between the image and the discrete distribution. Later we discuss how this methodology may be extended to more complicated problems.

4.1.1. Centers of Mass Although our first example may be thought of as quantitization with fixed weights, it is useful to interpret the results di↵erently. Here we are using the Wasserstein distance to determine the characterizing properties of our distribution. Consider an image of two bodies represented by the continuous density ⌫(y)dy, which is the sum of indicator functions over the object and normalized so that one has

1 2 3 the mass and the other has 3 . We compare the continuous density to the discrete 1 2 density 3 z1 + 3 z2 as we change the y-coordinates of z1 and z2. This arrangement and the corresponding heat map of the misfit function is shown in Figure 4.1. Varying the y-coordinates, y1 and y2, moves the nodes across the red dashed lines in Figure 4.1(a). The goal of the optimization is to find the center of mass of the two components

✓ by treating (z1,z2)astheparameter✓ of our misfit function. Let µ (x)beour distribution on X with parameter ✓. Since the weight on the nodes matches the mass

70 4.1 Simple Examples Parameter Estimation

(a) The continuous image ⌫. (b) Contour of the misfit function.

2 1 Figure 4.1: (a) The yellow ball has 3 of the mass and the teal ball has 3 . The dotted lines represent the range of values considered for y1 and y2 respectively. (b) A contour plot of the misfit for varying y1 and y2. in each body, our problem is to solve

min J(✓) = min 2(⌫,µ✓). (4.2) ✓ ✓ W

This is an optimal quantization problem which can be solved with Lloyd’s algorithm or related faster algorithms like BFGS in [27]. We note that while this method allows z1,andz2 to range over the entire domain, for ease of presentation we show what the misfit function looks like when only the y-coordinates of the two nodes, y1 and y2, are varied. As demonstrated in Figure 4.1(b), the minimum of the misfit function indeed occurs when the nodes correspond to the centers of the balls (the true centers of mass). Moreover, the basin is convex and thus amenable to optimization techniques. This is an important feature because it underlies the success of later applications. We also note that convexity was a motivating factor for the algorithmic development with respect to the one dimensional optimal transport problem in [16, 17, 18]. More contours occur when we vary y1 since it corresponds to the node which has a greater weight.

71 4.1 Simple Examples Parameter Estimation

Although we considered only two disjoint bodies, this technique can be applied to find the means for more general multimodal distributions. However, greater care must be taken when the nodes have di↵erent weights as there will still be local minima when the nodes are assigned to the wrong bodies.

Non-optimal quantization. It is worth pointing out that there are non-optimal equilibria in the optimal quantization problem. Consider Figure 4.2, which shows astablecentroidalLaguerrediagramandonewhichwouldconvergetoanunstable centroidal Laguerre diagram; moving zi to mi in Figure 4.2(b), would yield an unstable centroidal Laguerre diagram.

z1

m1 z1 z2 m2

z2

(a) Stable equilibrium. (b) Unstable arrangement.

Figure 4.2: Laguerre diagrams that (a) produces stability and (b) leads to unstable results when employing Lloyd’s algorithm. The boundaries of the Laguerre cells in each case are shown by the red dashed lines.

While the more intuitive quantization is to have a node at the center of mass of each square as shown in Figure 4.2(a), by working backwards it is possible to obtain a second configuration of nodes that is an equilibrium in Lloyd’s algorithm. Specifically, by placing the boundary of the would be Laguerre cells so that it bisects each square, then the two Laguerre cells will have the correct amount of mass. Recall from Equation (3.5), the boundary of two Laguerre cells will always be perpendicular to the line connecting the two nodes. Due to the symmetry in the diagram displayed in Figure 4.2(b), the centers of mass of these Laguerre cells will be perpendicular to this boundary and therefore serve as a possible set of node locations which generate

72 4.1 Simple Examples Parameter Estimation the desired Laguerre cells with price function 0. Any two points vertically ⌘ stacked will generate these Laguerre cells, however the price is only 0when ⌘ they are equidistant and it is only a stationary point for Lloyd’s algorithm when they are the centers of mass of the cells. Figure 4.2(b) therefore displays an unstable configuration since if either of the nodes moved in any direction other than vertical, the centers of mass would be skewed in opposite directions towards opposite balls. Lloyd’s algorithm on this configuration would have the nodes chasing the new centers of mass until the nodes were at the center of each ball. Because of this instability we do not further consider these non-optimal equilibria .

4.1.2. Angle of Rotation By examining the Wasserstein according to the ‘shape’ and ‘mean’ components, a translation only changes the ‘mean’ component and leaves the ‘shape’ unchanged. Rotations (around the center of mass) will leave the ‘mean’ unchanged, but alter the ‘shape’ of the distribution. In this section we show that the Wasserstein distance can be used to detect the rotation of an object. We consider the question of detecting the angle of rotation of a known object assuming a fixed configuration of nodes for the discrete distribution representing the non-rotated version of our image. We aim to infer the angle of rotation by minimizing the misfit in Equation (4.1), where µ✓ is our distribution rotated by ✓ about the center of mass of the discrete distribution. Note that the discrete distribution needs to only be representative of the original image and that it does not need to an optimal quantization. We are not looking for the best configuration of nodes, but rather the best rotation of our configuration. A variety of simple configurations are suitable for this purpose, and Figure 4.3 shows two possible arrangements of nodes, neither of which are chosen to be optimal quantizations of the non-rotated image. Figure 4.3(a) has the nodes arrange as the corners of a parallelogram, while Figure 4.3(b) simply

73 4.1 Simple Examples Parameter Estimation has two nodes opposite each other. For each configuration, the weights are evenly distributed over the nodes.

(a) Four node discrete distribution. (b) Two node discrete distribution.

Figure 4.3: Two di↵erent configurations of nodes used to determine the rotation of the ellipse. The long axis of the nodes is ⇡/4awayfromthetrueorientation.

(a) 2(⌫,µ✓) for four nodes. (b) 2(⌫,µ✓) for two nodes. W W Figure 4.4: 2(⌫,µ✓)forrotationangle✓ given (a) four node configuration and (b) two node configuration.W

As displayed in Figure 4.4, both choices of nodal configurations are suitable for detecting the angle of rotation. In each case the misfit function yields an optimal parameter corresponding to the true rotation of the ellipse. Due to the symmetry of the image, the misfit at a rotation angle ✓ = ↵ is indistinguishable from that at

⇡ ✓ = ↵ + 2 . This leads to two minima for each misfit function. The scales on the two misfit traces are di↵erent, however, which demonstrates that the representation with four points is the more suitable of the two choices.

74 4.1 Simple Examples Parameter Estimation

As was already described in Section 3.2 in the context of Newton’s method, in optimization settings, it is advantageous to have gradient information of the misfit function in order to eciently approach the optimal parameters. In the previous problem, when the model parameters were the node locations themselves, the gradient of the misfit was already with respect to the parameters. This is not true for the optimal rotation case, but we can still find the gradient of our misfit function with respect to the parameters. This is essentially accomplished using chain rule, as we will demonstrate below. A more complicated example will be provided in Section 4.3.3 when we employ a partial di↵erential equation (PDE) model.

For convenience we define gi as the gradient of the transport cost with respect to asinglenodelocationgiveninEquation(3.25):

g 2(µ, ⌫)=2µ (x m ). i ⌘rxi W i i i

The e↵ect of moving x by h is then 2 g ,h . i W ⇡h i i By viewing the transport cost as a function of node locations, and the node loca- tions themselves as functions of the parameter ✓, we are able to use the chain rule to see that d 2(µ, ⌫) n dx W = g , i . (4.3) d✓ h i d✓ i i=1 X Each node location can also be thought of in terms of its radial coorinates with xi

T represented as ri(cos(✓ + ✓i), sin(✓ + ✓i)) ,and

dx i = r ( sin(✓ + ✓ ), cos(✓ + ✓ ))T . d✓ i i i

This is a simple example of how we are able to incorporate the point gradient with the gradient of node locations to the parameters of our model, which allows us to use gradient based optimization methods to find the optimal parameters for our misfit

75 4.1 Simple Examples Parameter Estimation functions. In Section 4.3.3 we will consider a PDE model determining the locations of our nodes. Although the problem is more complex, the same general approach may be used. In order to establish the e↵ect of the parameters on the optimal transport cost, we insert an intermediary and determine how the parameters a↵ect the node locations. We then combine this information with our understanding of how the node locations a↵ect the optimal transport cost.

4.1.3. Rotating and Translating Object Although it is possible to detect the translation and rotation of an object by param- eterizing our model with (h, ✓), doing so would be unnecessarily complicated. Since the center of mass of the continuous distribution, M, is the same as the center of mass of the weighted centroids, solving the semi-discrete optimal transport problem and finding the centroids, m n , of the Laguerre cells is equivalent to finding M. { i}i=1 The translation which aligns the centers of mass can also be found by looking at half the sum of the point gradients, as seen by

1 n n 1 2µ (x m )= µ x µ yd⌫(y) 2 i i i i i i µ i=1 i=1 i Li X X X Z = W M. (4.4)

Because the centers of mass were not known, in Section 3.2.1 it was decided that in general Hessian regularization was better suited for SDOT algorithm modification than nodal translation was. However here, due to (4.4), the problem is simplified since we do know the center of mass. Specifically, we can simply return to the case where the centers of masses are aligned and we are interested in determining the rotation angle. This apparent conflict is solved by recognizing that we are solving two di↵erent problems. In Section 3.2.1 we are focused on solving the for the optimal transport cost

76 4.2 Misfits in Time Parameter Estimation between two fixed probability distributions, while here we are focused on finding the probability distribution (and parameter) which minimizes the misfit. This involves first solving the optimal transport problem for some initial parameter choice, and in doing so we obtain the information about the centers of mass of the distribution.

Section 4.2 Misfits in Time

The previous section focused on a determining a parameter from a single image. We now extend our methodology to include dynamic scenes, specifically by looking at multiple snapshots. There are a variety of applications for which our approach may be useful. A main motivation for our study is to study the evolution of ice floes and in particular to to determine related winds and currents. The misfit function is now the sum of the optimal transport cost at multiple snapshots in times. The data consist of aseriesofimages,whichwerepresentasprobabilitymeasures ⌫ti K , and a series { }i=1 of discrete distributions arising from our model µti K . With this notation we now { }i=1 define the generic form for the misfit function as

K J(✓)= 2(⌫ti ,µti ). (4.5) W i=1 X Although we restrict our discussion to simple ordinary di↵erential equations (ODEs), our approach is not inherently limited. We speculate that this technique will be useful for a broad class of processes, whenever it is meaningful for a misfit function to convey geometric information. This is clearly true in cases where an ODE describes how an object moves in time. This would not be useful in modeling a process which cares about the number of distinct bodies which are spawned. To illustrate this di↵erence, the misfit in Equation (4.5) would be appropriate to evaluate a model of

77 4.2 Misfits in Time Parameter Estimation how a population of rabbits spreads out in an environment, but it would not be useful to measure how many rabbits there are. Additionally, it is necessary that ⌫ti be a probability, so it would be poorly suited to measure growth processes and rabbits are notoriously prolific. While we do not have a full theory on what sort of ODE is appropriate for this method, it is natural to assume that they satisfy conservation and that the conserved quantity should form the basis for ⌫ti . The two ODEs we consider govern the position of the object(s), in this case balls. The continuous probability will be formed by placing the center of a ball at the output of the ODE. We form the image using ‘true’ parameters, and evaluate the misfit for the discrete distributions made using a range of parameters. To complete our analysis, as was done in previous sections, we construct the traces and heat maps of the misfit functions to demonstrate the convexity of the optimization problem.

4.2.1. Velocity Estimation To begin we consider a simple example of a ball moving through space at constant speed. The corresponding ODE tracking the center of an object is then

dx = v. (4.6) dt

For our initial investigation, we assume that we have the correct direction and vary the magnitude of the velocity. The misfit function in Equation (4.5) for this example is 3 J(v)= 2(⌫ti ,µti ), (4.7) W i=1 X with ti =1, 2, 3. We omit the measurement at time t0 =0becauseitwouldbethe same across all parameters. The principle at play here is that the velocity parameter a↵ects the position in the scene which in turn a↵ects the Wasserstein distance. The Wasserstein distance is an appropriate misfit function for the velocity since it has a

78 4.2 Misfits in Time Parameter Estimation

(a) t = 1. (b) t = 2. (c) t = 3.

Figure 4.5: The images and node locations used to the determine the velocity of the moving ball. The ball moves according to vtrue =0.25, and the nodes move according to v =0.35. geometric impact on each image. This important feature will be noted in following examples.

(a) Components of Equation (4.7). (b) Total misfit.

Figure 4.6: The individual components of the misfit given by Equation (4.7) and the misfit function itself. (a) The squared Wasserstein distance at t =1(blue,lowest), t =2(green,middle),andt = 3 (red, highest). All of the curves have the minimum at the true velocity value.

Figure 4.6 shows that the misfit function given by Equation (4.7) provides a con- vex environment for optimization and that the minimum of the misfit is the true parameter. However, Figure 4.6(a) also suggests that the whole misfit function is not needed, and that any of the component squared Wasserstein distances would suce. We will see in the next example that this is not generally the case.

79 4.2 Misfits in Time Parameter Estimation

4.2.2. Colliding Balls Our next example describes two balls in an elastic collision. This is modelled using Newton’s second law of motion, F = ma, with the force given by the spring force given by Hooke’s law. We do not deform the ball, but the force is proportional to the extent to which the balls overlap. The resulting system of ODEs is

dc 1 = v (4.8) dt 1 dc 2 = v dt 2 dv m 1 = k p d 1 dt ⇤ ⇤ dv m 2 = k p d, 2 dt ⇤ ⇤

where ci and vi are the center of mass and velocity of each ball, k is the spring constant, and p =min0, c c r r , is the penetration distance of the { || 1 2|| 1 2} balls where r1 and r2 are the radii of the respective balls, r1 = r2 =0.2. Finally d =(c c )/ c c is the direction between the two centers and the direction the 1 2 || 1 2|| force will be applied.

For this example, we will examine two pairs of parameters: (1) the mass, m1,and initial height of one of the balls, y1(0); and (2) the mass, m1 and the spring constant k. The true initial height y1(0) positions the ball such that the right ball strikes the underside of the left ball. We vary this by consideringy ˜1(0) = y1(0) + y0 as initial conditions for the di↵erential equation and measure the e↵ect in terms of y0. We vary the height enough so that the ball can glance o↵ in either direction. The range of values is chosen so that there is always a collision, however. Note that we treat the mass as a parameter for the dynamics of the scene, and not as a parameter that describes how the mass is distributed over the two bodies in the probability distribution. This is an important distinction, because if we treated the probability

80 4.2 Misfits in Time Parameter Estimation mass as a parameter, then it would a↵ect our ability to use the Wasserstein distance as a proxy particle identification. However this description is consistent with how one might expect to gather the observations when these methods are implemented in practice. To put into context, there is little di↵erence between a photo of a soccer ball and a bowling ball, until you try to kick one of them.

When examining the problem with parameters (m1,y1)and(m1,k), our misfit functions will have the form

4 J(✓)= 2(⌫ti ,µti ), (4.9) W i=1 X for ti =0, 1, 2, 3. The balls always collide between t =1andt =2.Itisimportant that the collision takes place during the interval we are observing, but does not need to be observed itself. This is because m1 and k only have any e↵ect on the dynamics after the collision, and the height has a greater e↵ect after the collision as well. Our misfit functions can only detect things which influence the scene and to the degree that they e↵ect the distribution of mass. Figure 4.7 shows the configuration of the balls, which we are treating as our observation, as well as a configuration of the nodes with the incorrect parameters. Initially the balls move towards each other, then they collide and move apart. The mass of the moving ball, m1,a↵ects how momentum and energy is transferred between the two balls and changes the scene after the collision. While the initial height has an e↵ect before the collision, it too has a much larger e↵ect after the collision because it changes the angles of the trajectories. Figure 4.8 shows the value of the misfit function given by Equation (4.9), and demonstrates not only that the minimum is attained at the true parameters, but also that this misfit provides a convex environment suitable for optimization. Figure 4.9 shows two of the components in the misfit function given in Equation

81 4.2 Misfits in Time Parameter Estimation

(a) t = 0. (b) t = 1.

(c) t = 2. (d) t = 3.

Figure 4.7: The images and node locations used to determine parameters of our colliding ball examples. The balls collide in between t=1, and t=2. The balls move according to the true parameters, m1 =1, y0 =0,andthenodesmoveaccordingto m =1.9375, y = .1. 1 0

(4.9). The misfit from before the collision, shown in Figure 4.9(a), varies very little, and only proportionally to the initial displacement y0, whereas the misfit after, shown in Figure 4.9(b), has a nearly identical shape to the total misfit as shown in Figure 4.8. As discussed in Chapter 3, the Wasserstein distances we calculate are computed to a tolerance level; this tolerance leads to the ‘wiggles’ in the contours

82 4.2 Misfits in Time Parameter Estimation

Figure 4.8: Contour plot of the misfit function given by Equation (4.9) for the two colliding balls. The true parameter values are marked with the red dot at m1 =1, y0 =0.Thisshowstheconvexityofthemisfitaroundthetrueparameters.

(a) The optimal transport cost at t =2(b) The optimal transport cost at t =4 prior to the balls colliding. after the balls have collided.

Figure 4.9: 2(µt, ⌫t)hasaverydi↵erent dependence on the parameters depending on whetherW or not we are before or after the collision. Note that the scales on these two contour plots are not the same. displayed in our figures. It is clear from Figure 4.9 that the squared Wasserstein distance before the collision is not enough to determine the correct parameters, but that a single observation afterwards is. We do not have a theory as to what the requirements are as to how many observations you need in order to determine the

83 4.2 Misfits in Time Parameter Estimation parameters, but we will see in Section 4.3 an example where the physics of the problem cause there to be a curve of parameters which produce the same value of the misfit function.

Now consider the problem of varying k and m1. As we have mentioned, varying m1 a↵ects how the momentum and energy is transferred between the two balls. The spring constant k a↵ects the duration of the interaction. A higher k means that the balls are sti↵er and bounce o↵ of each other more quickly.

Figure 4.10: Contour plot of the misfit function given by Equation (4.9). The true parameter values are marked with the red dot at m1 =1,k =500.

In examining Figure 4.10, we appear not to have any dependence on k, evidenced by the nearly vertical contours.1 This is because k a↵ects the strength of the force and how long the balls are in contact, which can be thought of as a time delay. The contact time for the entire range of k is brief, however, and when combined with the velocity of the balls generates very little impact, causing only a very minor di↵erence in the position of the balls and an indiscernible e↵ect in the Wasserstein distance

1As discussed before, the wiggles are due to the tolerance we allow in calculating the optimal transport cost.

84 4.3 Cantilever Beam Parameter Estimation

when compared to the e↵ect on m1 and y0. While it is possible to estimate the time delay, it is not particularly instructive to do so. The above analysis is simply meant to convey that even in situations where certain parameters appear to be important, it is always best to have physical intuition about the problem to understand the scale of the e↵ect.

Section 4.3 Cantilever Beam

We now demonstrate how a Wasserstein misfit may be used in situations other than particle tracking, and in particular how it can be useful in determining material properties. We consider a small elastic deformations of a beam with a fixed end, also known as a cantilever beam. This set up is related to the problem considered in [41] when looking at the stress on a miter gate, a type of lock gate for canals. In future work we will apply this approach to estimate parameters relating to ice floes. In what follows we introduce the cantilever beam PDE model, describe how it fits into our framework and then show how the Wasserstein point gradient may be incorporated into the adjoint equations to find the gradient with respect to the model parameters. While we limit our discussion to a low dimensional setting, the techniques are easily extended to incorporate high-dimensional spatially varying parameters. In this section, we combine elastic deformation with optimal transport. Unfortu- nately both of those fields use µ and ⌫ to refer to important concepts. The symbol µ refers to a parameter of the PDE that we are varying, and the symbol ⌫ refers to the Poisson ratio, which is a material property related to the PDE. The probability densities in this section will all appear with subscripts, indicating whether or not they are the densities before or after the deformation has been applied.

85 4.3 Cantilever Beam Parameter Estimation

4.3.1. Model Set-up In our example, the cantilever beam is under an unknown tangential force and has unknown material properties. As mentioned, this model is related to miter gates, but a more familiar example might describe someone standing at the end of a diving board. We model the beam as a solid material over a domain ⌦ R2. Small steady ⇢ state elastic deformations of this material can be modeled as

= f (4.10) r · = tr(")Id+2µ" (4.11) 1 " = u +( u)T , (4.12) 2 r r ⇥ ⇤ 2 2 2 Here u : ⌦ R is the displacement of the material, " : ⌦ R ⇥ is the two- ! ! 2 2 dimensional strain tensor, Id is the identity tensor, : ⌦ R ⇥ is the two dimen- ! sional stress tensor, and f : ⌦ R2 represents two dimensional body forces, which ! are zero except along the boundary. The quantities and µ are called the Lam´e parameters and represent the material properties of the beam. The Young’s modu- lus, E, and Poisson ratio, ⌫, are correspondingly equivalent to the Lam´eparameters but with more apparent physical significance. Specifcally, the Young’s modulus is a sti↵ness parameter that determines how much an object bends when under strain, and the Poisson ratio measures the relative expansion (or less commonly contraction) of the material horizontally when under a vertical force. The Young’s modulus is measured in pascals or gigapascals due to their typical size, and the Poisson ratio varies from 1.0to0.5. Most materials have a Poisson ratio from 0 to 0.5, reflecting that most objects expand slightly when you push on them. The conversion between

86 4.3 Cantilever Beam Parameter Estimation these two pairs of parameters is given by

E⌫ = (4.13) (1 + ⌫)(1 2⌫) E µ = . 2(1 + ⌫)

We require that the solution satisfy the following two boundary conditions

(x) nˆ(x)=`tˆ x (4.14) · 8 2 L u(x)=0 x , (4.15) 8 2 R where ` is the magnitude of the tangential force and tˆ is the tangent direction, and

L and R are the boundaries of the beam on the left and right. This corresponds to applying a tangential force of magnitude ` on the left, and having a ‘no-slip’ boundary condition on the right. To solve this PDE, we used the finite element PDE solver, FEniCS [1, 30, 31, 32]. Our problem is to determine the Young’s modulus, Poisson ratio and tangential load of an object from an observation before the load is applied and after. This set up corresponds to a laboratory setting where a known force would be applied in order to estimate material properties. We treat the load as an unknown to better show how this technique may be used outside of a laboratory setting. This problem will not have a unique solution, not due to a failing of our technique, but rather because of the nature of the problem.

Procedure. There are a few di↵erent ways to use our framework for this problem. We will describe one now and mention other possibilities at the end of the section. In brief, the procedure is to look at the observation before the experiment is run to determine a discrete representation, and then compare the experiment results to

87 4.3 Cantilever Beam Parameter Estimation those observed by the simulation. For our purposes here, ‘running the experiment’ means solving the PDE using true parameters and deforming the image. The ‘simula- tion’ entails solving the PDE using candidate parameters and deforming the discrete representation. Unlike in our previous examples, we do not assign weights to the node locations of our representation, but rather we assume that we have one observation of the unbent beam before it is under strain and a second observation afterwards. We place the nodes in a regular formation on the object before it undergoes stress. Since we are looking at a long rectangular beam, we place the nodes in two rows of six equidistant positions (see Figure 4.11).

Figure 4.11: The beam before undergoing the elastic deformation with the node locations marked by the red dots. In this figure, white corresponds to zero mass.

With the nodes x 12 placed, we determine their corresponding weights by cre- { i}i=1 ating a Voronoi diagram and evaluating the mass in each cell. The weights associated to the Voronoi tessellation, w 12 , are used because they minimize the Wasserstein { i}i=1 distance among all distributions with those nodes. Specifically, the Voronoi tessella- tion sends each point to the closest node, so all of the mass is going to the nearest cell. More information on this as well as other discussions on quantization may be found in [11]. We now have an observation of the material before a load is applied and a discrete representation of the material, respectively given by the atom-less density ⌫1(y)and

88 4.3 Cantilever Beam Parameter Estimation

12 µ1(x)= i=1 wixi . We solveP the PDE given in Equation (4.12) with the fixed true parameters and deform the observation of the unbent beam to form the observation of the bent beam. The deformation is done by interpolating the displacement from the mesh of the finite element solver. Since the process does not conserve the total mass of the image, we must re-normalize.2 The final result after re-normalization serves as our observation of the experiment and is represented by ⌫2(y).

Figure 4.12: Image of the deformed beam with the nodes corresponding to the true parameters. In this figure, white corresponds to zero mass.

We evaluate a candidate parameter ✓ by first solving the PDE given in Equa- tion (4.12) to a displacement function u✓. We then use u✓ to displace the nodes of

✓ the discrete representation to form µ2(x)= wixi+u (xi). Finally we compare this distribution µ2 to the continuous distributionP⌫2. The misfit function for this problem is J(✓)= 2(µ , ⌫ ), W 2 2 which is the transport cost between ⌫2 and µ2. Figure 4.12 shows the beam after it has been deformed and the node locations we observe from the true parameters. We can consider this procedure in another way. Let u,andu✓ be displacement functions that we found by solving the PDE with the true parameter and the

2While we have not observed any ill e↵ects from this re-noramlization, we do not have general results demonstrating that this will always yield stable or accurate results.

89 4.3 Cantilever Beam Parameter Estimation candidate parameter ✓. Define the maps S := Id +u and S✓ := Id + u✓. Then

✓ ⌫2 =(S )#⌫1 and µ2 =(S )#µ1. The misfit is correspondingly

J(✓)= 2((S) ⌫ , (S✓) µ ). (4.16) W # 1 # 1

Since µ1 is determined by the Voronoi cells, we expect that the minimum of this misfit function will be the true parameters. This argument is only intuitive, however. Indeed what we will discover is that when considered independently, two of the three true parameters given are minima of the cost function, and the third is nearly minimal. By contrast the problem becomes degenerate when we vary the Young’s modulus and tangent load together.

Finally, it is worth mentioning that µ2 is not necessarily equivalent to the distri- bution one would obtain by directly calculating the weights according to the Voronoi tessellation of the bent beam with nodes at xi + u(xi). In particular we note that the weights of µ2 were determined using the Voronoi tessellation of ⌫1 with the nodes x 12 , rather than from the Voronoi tessellation of ⌫ with the nodes x + u(x ). { i}i=1 2 i i

4.3.2. Results

We parameterized the PDE in Equation (4.12) by the Young’s modulus, Poisson ratio, and tangential load along the left boundary (the normal load is 0). Our range for the Young’s modulus is [1 108, 2 109] (Pa), for the Poisson ratio is [0.1, 0.4], and for the ⇥ ⇥ tangent load [ 100, 000, 5, 000] (N) (the loads are negative since they are directed downwards). We chose the range of the Young’s modulus and Poisson ratio to be representative of material values, and the tangent load was chosen to give the beam amoderateflex.Theseparametersareonverydi↵erent scales so we convert them to coordinates in [0, 1]3. We scale the Young’s modulus and tangent load logarithmically, and the Poisson ratio linearly since it has a limited range. Our image was made using

90 4.3 Cantilever Beam Parameter Estimation the parameters 109 (Pa) for the Young’s modulus, 0.3forthePoissonratio,and 50, 000 (N) for the tangential load respectively which approximately corresponds to the point (0.75, 0.667, 0.25) in the re-scaled range.

(a) Young’s Modulus. (b) Tangent Load. (c) Poisson Ratio

Figure 4.13: Trace of the optimal transport misfit when varying the parameter for (a) the Young’s modulus, (b) the tangent load, and (c) the Poisson ratio. The true values are marked by the red dots.

Figures 4.13 shows the e↵ect of each parameter on the misfit function when the other two are at their true values. The true values for the Young’s Modulus and tangent load are minima for their traces, while the true value of the Poisson ratio is nearly so. The three traces have di↵erent scales, with the Young’s modulus having the greatest range. The Poisson ratio has a relatively minor e↵ect on the misfit when compared to the other two parameters, which is likely due to the fact that the Poisson ratio measures the expansion of the material when a normal force is applied and in this experiment we are only applying a tangential load. Figure 4.14 shows the optimal transport misfit when we vary the Young’s modulus and tangent load in conjunction and demonstrates that this misfit function does not uniquely identify the true parameters. This should be thought of not as a failure of the misfit function, but rather as a consequence of the nature of the PDE. The misfit function views those parameter values as equivalent because they alter the material in the same way; a weak force applied to a squishy beam causes the same e↵ect as a strong force applied to a sti↵ beam. The texture of the beams in Figures 4.11 is given from satellite imagery of ice. While not of material importance here, we mention

91 4.3 Cantilever Beam Parameter Estimation

Figure 4.14: Contour plot of the misfit function for the cantilever beam when varying the magnitude of the tangent load, and the Young’s Modulus. The true parameter values are marked with the red dot at (0.23, 0.77). this fact to keep in mind the sort of images to which this method may be applied. The non-uniformity of the image would lead to diculties with an L2 misfit function, as it would be pocked with local minima. It would also be dicult to use gradient methods in this case. By contrast, employing the optimal transport cost as a misfit function does allow us to use gradient based methods, as we now demonstrate.

4.3.3. Adjoint Gradient While Figure 4.14 is helpful in demonstrating that an optimal transport misfit is amenable to a gradient based approach, it is not ecient in determining the optimal set of parameters. We now show that we can eciently calculate the gradient to our misfit using the adjoint equations of the PDE. In particular, calculating the gradient via the adjoint equations is approximately as computationally expensive as a single additional run of our model, which is more ecient than approximating the gradient via a finite di↵erence scheme. Further, it allows us to use gradient based optimization

92 4.3 Cantilever Beam Parameter Estimation methods to find the optimal value without an exhaustive search. Finally, it makes it possible to use this sort of misfit in settings where the Lam´eparameters are spatially varying.

A Theory of the Adjoint Approach. There are multiple approaches to devel- oping the adjoint equations that allow the computation of the gradient. Here we consider the Lagrangian of the constrained optimization problem, [54]. The general setting of PDE constrained optimization considers the problem

min J(u, m) u U,m M 2 2 subject to e(u, m)=0, (4.17) for admissible pairs (u, m) (U, M).3 For the cantilever beam, u is the displacement 2 function, m is the set of model parameters, J is the transport cost, e is the PDE given in Equation (4.12), U is the set appropriate set of functions from ⌦ R2 (not 7! necessarily satisfying the elastic PDE), and M is the range of our parameter values. In some ways it is easier to think about M as being the cube [0, 1]3, but in other ways it is easier to think about it as the appropriate range of the tangential load ` and the Lam´eparameters (µ, )becausetheseparametersa↵ect the PDE in more direct ways. We will find the gradient associate to (`,µ,)andthenmapthattothegradienton the [0, 1]3. The Lagrangian associated with Equation (4.17) is

L(u, m, p)=J(u, m)+ p, e(u, m) . (4.18) h i

Recall that Lagrangians are often used in constrained optimization to reformulate the

3Our variable names are changed from [54]

93 4.3 Cantilever Beam Parameter Estimation problem in its unconstrained form [12]. Specifically by incorporating the additional variable p in Equation (4.17) we are able to remove the constraint in Equation (4.17). Since the parameter of our PDE, m, determines the displacement function so it is appropriate to think of it as u(m). Hence we rewrite the misfit function as Jˆ(m)= J(u, m), that is, we reframe the misfit function as a function of parameters rather than as a function on the probability distribution itself. Probability distributions have a complicated di↵erential structure which is examined in [40]. We are now able to circumvent this diculty and write

Jˆ(m)=J(u(m),m)=L(u(m), m, p). (4.19)

Di↵erentiating yields

Jˆ0(m),s = L (u(m), m, p),u0(m)s + L (u(m), m, p),s . (4.20) h i h u i h m i

By choosing an appropriate p(u), we obtain Lu =0sothatthefirsttermontheright hand side is also zero. Solving the second term yields the derivative with respect to the model parameters.

To satisfy Lu =0weuseEquation(4.18),andwrite

L (u, m, p),d = J (u, m),d + p, e (u, m)d h u i h u i h u i

= J (u, m)+e (u, m)⇤p, d . (4.21) h u u i

With p(u)determined,wenowuseEquation(4.20)toobtainthegradientofinterest as

Jˆ0(m),s = L (u(m), m, p(m)),s . (4.22) h i h m i

This is the framework that allows us to take the gradient with respect to the model

94 4.3 Cantilever Beam Parameter Estimation parameters in our example.

The gradient of our model. To find the gradient of our model, we start by rewriting our system as a Lagrangian,

L(u, m, p)= 2(⌫,µ)+ p, p `tdsˆ + (p nˆ)ds. (4.23) W h r · i · · · ZL ZL

Here , refers to integration taken over the interior of the beam material. This form h i is reasonably compact, however the relationships become clearer when we expand . First noting that p, = p, , (4.24) h r · i hr i where we change the integral of a dot product between two vector functions into the integral of the ‘double-dot’ product between two matrix functions, we change our Lagrangian into

L(u, m, p)= 2(⌫,µ)+ p, tr(✏)Id + p, 2µE p `tdsˆ + (p nˆ)ds. W hr i hr i · · · ZL ZL (4.25) While eventually our model is parameterized with coordinates in [0, 1]3, we will first consider the direct model parameters (`, ,µ)ofthemagnitudeoftheload,andthe two Lam´eparameters. The form of the Lagrangian in Equation (4.25) is useful in determining the derivatives with respect to the model parameters m =(`, ,µ)which are needed for the final calculation of Equation (4.22). Another form is more useful in determining Lu which is needed to determine p(u), however. This form is obtained

95 4.3 Cantilever Beam Parameter Estimation by expanding p, as hr i

p, = p : d⌦ hr i r Z⌦ = p :(µ( u + uT )+( u)Id)d⌦ r r r r · Z⌦ µ = ( p + pT ):( u + uT )d⌦ + ( p)( u)d⌦, (4.26) 2 r r r r r · r · Z⌦ Z⌦ where u =tr( u). It is now clear that the role of p and u are interchangeable, r · r which allows use to consider := µ( p + pT )+( p)Id,andrewrite(again) p r r r ·

p, = , u . (4.27) hr i h p r i

The weak form of u-derivative is found by adding a perturbation w, yielding , w . h p r i Using test functions which vanish at the boundaries allows us to omit the boundary terms of Equation (4.25) and find

d L ,w = 2(⌫,µ),w + ,w . (4.28) h u i hduW i h p i

We have already examined d 2(⌫,µ)inprevioussections,butitisworthrecon- du W textualizing. The displacement function u is the displacement function which bends the beam. It is also the function which determines the node locations. The perturba- tion w a↵ects the transport cost according to the point gradient given by Equation (3.25). What is needed is the dot product of point gradient with the value of w at the node locations – the value of w elsewhere is not relevant. In this regard, we let

th gi be the point gradient of the i node to obtain

d 12 2(⌫,µ)= g . (4.29) duW i ni i=1 X

96 4.3 Cantilever Beam Parameter Estimation

This is the final piece that allows us to solve the adjoint equation in Equation (4.28) to find the p(u), which we then plug in to our equation for Lm to determine the gradient with respect to the parameters. Using Equation (4.25), we now separate the the components of the gradient which

T are of interest, J 0(m)=(L`,L,Lµ) ,as

L = p tdsˆ (4.30) ` · ZL L = p, tr(✏)Id (4.31) hr i L = p, 2E . (4.32) µ hr i

We have now demonstrated how an optimal transport misfit can be incorporated into physical examples and how we can combine the gradient from the adjoint equations with the point gradient we have from the optimal transport. By employing the chain rule, the gradient on the m M parameters is then 2 mapped to become a gradient on the cube ⇥ =[0, 1]3. Specifically, let A map ⇥ M, 7! ˜ d ˜ T and J(✓)=J(A(✓)) = J(m). Then d✓ J(✓)=J 0(m)(dA) . This allows us to have the best of both worlds by computing the gradient with respect to natural parameters, but also by choosing an environment suitable for optimization. Thus we see that we can compute useful gradients with respect to an optimal transport misfit in situations where the gradient with respect to L2 misfit would not be useful.

97 4.4 Further Questions Parameter Estimation

Section 4.4 Further Questions

The ideas presented in this chapter are merely invitations for future investigations. The thrust is that the Wasserstein distance is able to serve as a misfit function which can help to estimate parameters in many physical settings. With that said, we present afewquestionsthatwethinkdeservefurtherattention.

4.4.1. Representation of Objects Our current representation of objects is somewhat naive. We choose a set of nodes that in some ways captures the relevant feature of the object we hope to match, such as a long side and a short side in the case of determining the rotation of an object. In the examples we have looked at there is not a clear downside or sensitivity to our choices, however that does not mean there never will be one. When looking at the cantilever beam, we allowed the weights on the nodes to be determined by the Voronoi tessellation. Would we have found the same results if we had placed uniform weights on the nodes? Are there settings where this choice causes a divergence in results? Do we get better results if we use a centroidal Voronoi tessellation, where not only are the weights chosen according to the tessellation but the node locations are the centroids of their cells? If there is some di↵erence be- tween centroidal nodes and non-centroidal nodes, what is the right way to quantify the di↵erence? It will only be in controlled settings that we can implement a cen- troidal Voronoi tessellation before the e↵ect is observed, but it may be worthwhile to understand the cost of doing it otherwise. In our examples so far we only consider representing a discrete object by assigning one or many nodes to a particle object which does not fracture. How do we adapt amethodlikethistoproblemswhereweneedtorepresentafluid,orwhereobjects

98 4.4 Further Questions Parameter Estimation break apart and come together?

4.4.2. Time Sensitivity In Section 4.2, we looked at a misfit which was the sum of the optimal transport cost at a few points in time. A misfit function of this sort draws connections to sampling and so rather than always using equally spaced points, there are likely better choices based on the properties of what is being sampled.

4.4.3. Blurring and Noise The Wasserstein distance is dependent on the distribution of mass in a distribution. This means that this method should be largely impervious to distortions which do not cause large impacts on the distribution of mass. Both blurring and additive Gaussian noise fall in this category. Blurring is a local averaging of the distribution, so the e↵ect of blurring is that mass at one location is spread out over the blurred area. Our expectation is this technique will be more robust to blurring than other techniques, but this of course should be investigated. We similarly expect our technique to be robust to additive Gaussian noise because asmallballwouldcontainenoughpixelssuchthatthemeanofthenoisewouldbe nearly zero. There is some work showing this to be partially true in [17], although some care must be taken to ensure that the density value remains positive.

4.4.4. Analysis of Solution Operators

As mentioned in Equation (4.16), in some cases the misfit function can be rephrased as J(✓)= 2(S(⌫),S✓(µ)). This provides a clue in that the well-posedness of the W optimization question depends on the nature of the solution operator of our underlying model.

99 4.4 Further Questions Parameter Estimation

Analysis in this direction may provide understanding of the limitations of this technique, but also assurances of its results in appropriate contexts. It may also provide clues as to how to improve the technique and answer some of the other questions presented here.

100 Chapter 5

Optimal Transport through Kernels

We now introduce an idea for using kernels in optimal transport. The definition of kernel in mathematics depends on the context, and here we are referring to transition kernels from probability theory, which are used to map points in (or measures over) X to measures over Y . What we describe in this chapter as a culminating topic for this thesis is how kernels may be used to study optimal transport. Although this seems intuitive, to the best of our knowledge it has yet to be done. The rest of this chapter is organized as follows: We first provide the necessary background on the theory of kernels. We then show their connections to optimal transport. Finally we discuss how they may be used for one-dimensional signed optimal transport. Primary references for useful background material include [13, 25].

101 5.1 Kernels Background OT Kernels

Section 5.1 Kernels Background

Kernels, or transition kernels, are most often seen in relation to Markov chain Monte Carlo (MCMC) and in stochastic processes, but they are also well-suited for optimal transport.

Definition 5.1. [13, Pg. 37], [25, Pg. 180] Let (X, M)and(Y,N)bemeasurable spaces and K be a mapping from X N into [0, ]. Then K is called a transition ⇥ 1 kernel from (X, M)into(Y,N)if:

(i) the mapping x K(x, B)isM-measurable for every set B in N,and 7! (ii) the mapping B K(x, B)isameasureon(Y,N)foreveryx X. 7! 2 If in (ii) the measure is a probability measure for all x X,theK is called a 2 stochastic kernel or a Markov kernel. If in (ii) we also have K(x, Y ) 1forany  x X, then K is called sub-stochastic or sub-Markov. 2 Since there is no ambiguity in our meaning, henceforth we refer to transition kernels simply as kernels. The definition of a kernel tells us that if we fix a measurable set in the space Y , then we have a measurable function in the base space and if we fix a point in the space X, then we have a measure over Y . As of yet, this does not provide us the functionality of mapping measures to measures. That comes now.

Theorem 5.2. [13, Pg. 38] Let K be a transition kernel from (X, M) into (Y,N). Then Kg(x)= g(y)K(x, dy),x X, 2 ZY defines a measurable function Kg in M for every measurable function g in N;

µK(B)= K(x, B)dµ(x),B N, 2 Z 102 5.1 Kernels Background OT Kernels defines a measure µK on (Y,N) for each measure µ on (X, M); and

(µK)g = µ(Kg)= dµ(x) g(y)K(x, dy) ZX ZY for every measure µ on E,M and measurable function g in N.

Theorem 5.2 allows us to treat kernels as objects that map measures. In addition to viewing the kernel which maps a measure µ on X to a measure µK on Y , we can also view it as mapping to a measure on the product space X Y , with -algebra ⇥ M N. We specialize the next theorem to Markov kernels since they will be our ⌦ focus.

Theorem 5.3. [13, Pg. 41] Let µ be a measure on (X, M) and K be a Markov kernel from (X, M) to (Y,N). Then

f = dµ(x) f(x, y)K(x, dy) ZX ZY for measurable functions in (M N) defines a measure on the product space (X ⌦ + ⇥ Y,M N).IfK is a Markov kernel and µ is -finite, then ⇡ is -finite and is the ⌦ unique measure on the product space satisfying

⇡(A B)= K(x, B)dµ(x),A M,B N. ⇥ 2 2 ZA

Theorem 5.3 tells us that we can form a product measure µK by averaging the values K(x, B)overthesetA according to µ. Transition kernels can be used as a map between measures and to produce product measures. Both functions serve important roles in extending transition kernels to optimal transport.

103 5.1 Kernels Background OT Kernels

Stochastic Processes and Markov Kernels We first collect a few facts about stochastic processes that will help to provide context for our discussion about transition kernels.

Let (X, M)beameasurablespaceandT be an arbitrary set, normally either R+ to represent a time parameter, or N when the process takes place in a series of steps. Then let µ be a measure supported in (X, M). The collection µ : t T is called t { t 2 } astochasticprocesswithstatespace(X, M)andparametersetT .

Stochastic processes are typically phrased in terms of random variables Xt on a latent probability space (⌦, , P), where µt =(Xt)#P. We shift the framing here H towards focusing on the measures since those are the objects of interest in optimal transport. One special way in which the elements of the stochastic process can be related to one another is if there is a family of Markov kernels from (X, M)toitselfindexedby pairs of times t u, such that they satisfy the Chapman-Kolmogorov equation, 

K K = K , 0 s t u. (5.1) s,t t,u s,u   

Then there is the stochastic process (µt)t T where µ0 is defined, and µt = µ0K0,t. 2

The Chapman-Kolmogorov equation ensures consistency so that µtKt,u = µsKs,u, for 0 s t u. If random variables X are being considered in our process, then the    t kernels act as conditional probabilities, with Kt,u(x, A)=P(Xu A Xt = x), where 2 | P is the probability measure on the latent probability space (⌦, , P)[13,Pg.446]. H A special class of Markov kernels are those which are time-homogeneous, where

Kt,u = K0,u t. Those which are not time-homogeneous are called inhomogeneous. After extending the use of kernels to optimal transport, whether or not optimal transport kernels are homogeneous will be one of the first questions we ask.

104 5.2 Kernels for OT OT Kernels

Section 5.2 Optimal Transport Kernels

Before connecting the role of kernels to optimal transport, we first reexamine optimal transport between probability vectors µ,and⌫, and a cost function c(x, y) c (x)+  X cY (y). The optimal coupling between µ =(µi), and ⌫ =(⌫j)isthenac-CM array

=(i,j) such that

µi = i,j, and ⌫j = i,j. (5.2) j i X X Equation (5.2) is called the marginalization condition and is often written as µ = 1, and ⌫T = 1T , although we never use as a matrix in any other context. Note that µ is the sum of the columns of , while ⌫ is the sum of the rows. If we assume µ and ⌫ have all non-zero entries (since the corresponding row or column of would otherwise be all zeros), then we can form the array

i,j Si,j = . (5.3) ⌫j

Summing the entries of each column of S,wehave

S = i,j i,j ⌫ i i j X X1 = ⌫ i,j j i 1 X = ⌫j =1. (5.4) ⌫j

i,j This shows that S is a stochastic matrix, and S⌫ = ⌫j = µ. We can similarly j ⌫j form a stochastic matrix S˜ = j,i that sends µ to ⌫. P µi The above construction becomes useful when considering the probability vector

105 5.2 Kernels for OT OT Kernels

⌘, and defining a probability vector ⇣ := S⌘. Specifically, we observe that

i,j ! =( ⌘j)i,j ⌫j yields the optimal coupling between ⌘ and ⇣. First, from Equation (5.2), we have that ! is a coupling of ⌘ and ⇣ because

(!1) = i,j ⌘ = S ⌘ = ⇣ , i ⌫ j i,j j i j j j X X and (1T !) = i,j ⌘ = ⌘ S = ⌘ , j ⌫ j j i,j j i j i X X since S is a stochastic matrix with each column summing to 1 following Equation (5.3). Second, we see that the optimality of the coupling comes via Corollary 2.26, which says that a probability measure which is absolutely continuous with respect to an optimal coupling is optimal itself. Allowing i,j = (x, y)(andlikewisefor!), then we see that ! with d!(x, y)= ⌘(x) d(x, y). ⌧ ⌫(x) It may be unsatisfying that the coupling which we refer to as solving the problem from µ to ⌫ creates a matrix which sends ⌫ to µ. This is a consequence our choices to view ⌫ and ⌫ as probability vectors instead of probability covectors and because we interpret the i, j in as being rows columns as opposed to columns rows. i,j ⇥ ⇥ So far we have looked at how in the discrete to discrete case we can transform the coupling into a stochastic matrix, but how we extend this to the general case is through Markov kernels. To this end, we begin with the following definition:

Definition 5.4. A Markov kernel K is an optimal transport kernel for a cost function c(x, y) if for measures µ and ⌫ := µK, then the product measure := µK is an optimal coupling between µ and ⌫ when the transport problem has finite cost.

106 5.2 Kernels for OT OT Kernels

While there are many advantages to phrasing optimal transport in terms of cou- plings as opposed to transport maps, there is one significant disadvantage. Specifi- cally, while an optimal transport map may transport any measure optimally, a cou- pling is only between fixed marginals. Fortunately, optimal transport kernels allow us to restore that important functionality. The question that remains is how to construct an optimal transport kernel, and the surprising answer is that we have been constructing them all along by constructing optimal transport couplings. To help clarify our presentation, we briefly pause to advise the reader on the organization of the remainder of this chapter. We first focus on the construction of an optimal transport kernel and how we can use optimal transport couplings as kernels. We will discuss how we can recognize when two couplings are images of the same optimal transport kernel and this will be conveyed through the c CM compatibility property. We then move to a discussion on connections between optimal transport kernels and stochastic processes, in particular by looking at the Chapman- Kolmogorov Equation and time homogeneity. We then show how optimal transport kernels allow us to extend the theory to to signed measures. Finally, we consider some future questions in this direction.

5.2.1. Optimal transport kernels

Recall that the stochastic matrix was formed in Equation (5.3) by dividing the entries of the coupling by the marginal in the discrete to discrete case. We could instead choose to use itself for this purpose if rather than dividing the entries of the matrix, we considered the vector ⌘j . Note that this is still a probability vector, but it is now ⌫j aprobabilityvectorinreferenceto⌫. This shift in perspective is helpful because it allows us to use the coupling in e↵ect as a kernel. There is an important caveat, however: we can only use it as a kernel for a measure ⌘ which is absolutely continuous

107 5.2 Kernels for OT OT Kernels with respect to µ (or ⌫,butthiswouldsendittoadi↵erent measure), and for which c (x)d⌘(x) < . X X 1 R Corollary 5.5. On a Polish space X, let c(x, y) be a cost function satisfying the conditions of Theorem 2.24, and with c(x, y) c (x)+c (y). Let P (X) be the  X Y c set of measures in P(X) such that c (x)dµ(x) < and similarly for P (Y ). X X 1 c Then let µ P (X) and ⌫ P (Y ),R and be an optimal coupling between them. If 2 c 2 c ⌘ P (X) and ⌘ µ with f(x):= d⌘ (x), then !(x, y):=f(x)(x, y) is an optimal 2 c ⌧ dµ coupling with marginals ⌘(x), and ⇣(y):=(projy)#(x, y).

Proof. This is proven by first showing that ! has finite transport cost, and then showing that ! is optimal because it is absolutely continuous with respect to by applying Corollary 2.26. Observe that ! has finite transport cost since

c(x, y)d!(x, y) (cX (x)+cY (y))f(x)d(x, y) X Y  X Y Z ⇥ Z ⇥

= cX (x)f(x)dµ(x)+ cY (y)d⌫(y) ZX ZY = c (x)d⌘(x)+ c (y)d⌫(y) < . X Y 1 ZX ZY

Clearly ! , with Radon-Nikodym derivative d! (x, y)=f(x). Thus ! satis- ⌧ d fies the conditions of Corollary 2.26 and is an optimal transport plan between the marginals ⌘ and ⇣. This completes the proof.

Corollary 5.5 applies to the Wasserstein distances since for any x0

p 1 p cX (x)=cY (x)=2 dist(x, x0) ,

and in that setting Pc(X)isthenowfamiliarspacePp(X).

108 5.2 Kernels for OT OT Kernels

In light of earlier discussion in this chapter, Corollary 5.5 implies that we can use an optimal coupling as an optimal transport kernel. It moreover provides further conditions that we should impose on transport kernels. To ensure that a transport kernel gives rise to a coupling with a finite cost, it is useful to consider cost functions c(x, y) c (x)+c (y), as well as to consider  X Y transition kernels that are not only Markov kernels, but also cY -finite Markov kernels, as defined by

Definition 5.6. A Markov transition kernel is said to be a c finite kernel if Y

c (y)K(x, dy)

When µ P (X)andK is a c -finite kernel, then we can be assured that µK 2 c Y as a product measure has finite transport cost. We note that some optimal transport

p kernels are not cY -finite. Henceforth we will consider only c(x, y)=dist(x, y) ,but the results can be generalized to broader cost functions as long as care is taken to ensure the existence of optimal couplings. It is important to point out that although we can use a coupling as a kernel in some cases, there are some key di↵erences. Unlike a kernel, a coupling is not defined point-wise, which is why we can only send measures which are absolutely continuous with respect to a marginal through a coupling. If a coupling is not defined everywhere, then it is as though we are only seeing part of the picture. A related question then is how can we know if two couplings can come from the same kernel. To answer that question we introduce the following definition.

Definition 5.7. Two optimal couplings 1, 2 are said to be c-CM compatible,or just compatible, if supp( ) supp( )isac-CM set. 1 [ 2

It is interesting to note that the concept of ‘plan splitting’ in [34] is related to

109 5.2 Kernels for OT OT Kernels decomposing a coupling into smaller pieces. The concept we introduce here, by con- trast, allows us to put the pieces back together. Moreover, there is no mention of the joint support being c-CM in [34], and the use there is in line with Theorem 2.19 as opposed to Corollary 2.26. When we have an optimal transport kernel K and two measures with disjoint support, µ1 and µ2, satisfying appropriate hypotheses, then we can consider 1(x, y)=

µ1K and 2(x, y)=µ2K. These two couplings will be compatible. We can see this because by letting µ = 1 (µ + µ ), we have supp(µK)=supp( ) supp( ), 2 1 2 1 [ 2 and supp(µK) will be c-CM because K is an optimal transport kernel. The c-CM compatibility allows us to do the reverse and put together couplings in order to create a kernel which can take inputs from a broader set of measures. This would be most useful if we were able to determine when the optimal coupling between µ1 and ⌫1 is compatible with the optimal coupling between µ2 and ⌫2 before we actually solved for them. Unfortunately, this is dicult to do and we only have coarse results in that direction.

Theorem 5.8. Let the measures µ , ⌫ P (X) be supported on compact sets J 1 1 2 p 1 and L and analogously the measures µ , ⌫ P (X) on the compact sets J and 1 2 2 2 p 2

L2. Also let 1 and 2 be optimal couplings between µ1 and ⌫1 and between µ2 and ⌫2 respectively. If max c(x, y) < min c(x, y) (5.5) x J1 { } x J1 { } y2L1 y2L2 2 2 and max c(x, y) < min c(x, y) , (5.6) x J2 { } x J2 { } y2L2 y2L1 2 2 then 1 and 2 are compatible.

Proof. The inequality required by this theorem is quite strong and is designed to ensure c-CM.

110 5.2 Kernels for OT OT Kernels

Choose n+m arbitrary points from supp( ) supp( ), and let (x ,y ),...,(x ,y ) 1 [ 2 { 1 1 n n } ⇢ supp( )and (ˆx , yˆ ),...,(ˆx , yˆ ) supp( ). Then using the alternate definition 1 { 1 1 n m } ⇢ 2 of c-CM given by Equation (2.5), we need to show that

n 1 m 1 n m c(x ,y )+c(x , yˆ )+ c(ˆx , yˆ )+c(ˆx ,y ) c(x ,y)+ c(ˆx , yˆ ). i i+1 n 1 i i+1 m 1 i i i i i=1 i=1 i=1 i=1 X X X X (5.7) From Equations (5.5) and Equation (5.6) we have that

c(xn,y1)

Hence n 1 m 1 n m c(xi,yi+1)+c(xn, yˆ1)+ c(ˆxi, yˆi+1)+c(ˆxm,y1) > c(xi,yi+1)+ c(ˆxi, yˆi+1), i=1 i=1 i=1 i=1 X X X X (5.9) with yn+1 = y1 andy ˆm+1 =ˆy1. By hypothesis 1 and 2 are optimal couplings and each have c-CM support. Thus

n n m m c(x ,y ) c(x ,y)and c(ˆx , yˆ ) c(ˆx , yˆ ). (5.10) i i+1 i i i i+1 i i i=1 i=1 i=1 i=1 X X X X Combining (5.8) with Equation (5.10) yields the desired inequality and shows that supp( ) supp( )isac-CM set. 1 [ 2

The results in Theorem 5.8 are not surprising. The theorem tells us that if the measures involved are far apart from each other, then the couplings will not interact at all. In this case we can put them together because they represent distinct parts of alargerpicture.

111 5.2 Kernels for OT OT Kernels

5.2.2. Geodesics and Kernels

A geodesic in Pp(X)isdescribedinCorollary2.37asacontinuouscurve(µt)0 t 1   valued in Pp(X). In the background there is a probability measure, , on the space of geodesics, with t the evaluation of those paths at time t giving the measure µt. Note that only special members of P(Geod(X)) yield geodesics in Pp(X). We can find the optimal coupling between any two points on this curve by looking at (es,eu)# = s,u, which has marginals µs and µu.

As previously established, each of these couplings s,u can function as a kernel (for measures absolutely continuous with respect to µs), but it is worth taking a moment to explore what that means. Consider s,u as a kernel Ks,u, and send µs through it. Trivially µ µ , and the Radon-Nikodym derivative will be the constant function s ⌧ s 1, so this complies with Corollary 5.5. Then µ K =(proj) (1 )=µ .Sothe s s,u y # · s,u t kernel progresses the measure µs along the geodesic.

When we have a family of kernels (Ks,u)0 s u 1 and a geodesic (µt)0 t 1 such that     

µsKs,u = µu, we say that the family of kernels is connected to the geodesic. We now turn to a more interesting example. Let ⌘ µ with Radon-Nikodym s ⌧ s derivative f(x). Then fs(x)s,u(x, y)isanoptimalcouplingbetween⌘s and ⌘u where

⌘u := (projy)#(fs(x)s,u).

Note that the geodesics in the support of do not change, rather they are just re- weighted. With this in mind, we can think of a kernel as a collection of strings, each with attached weights. The connections stay the same regardless of the mass transported through the kernel, but the appearance varies as a result of di↵ering mass quantities. It is therefore not surprising that ⌘u can be represented in a number of di↵erent ways, depending on the process at an intermediate time t.Fors t u,  

112 5.2 Kernels for OT OT Kernels let

⌘t := (projy)#(fs(s)s,t),

and ft(x) be the Radon-Nikodym derivative of ⌘t with respect to µt. Then

(projy)#(ft(x)t,u)=⌘u =(projy)#(fs(x)s,u). (5.11)

This provides a nice consistency check, but also stems from understanding that all of these couplings are from the same dynamical optimal coupling in the background.

If we consider ⇠(x, y, z):=(es,et,eu)#(), then fs(x)⇠(x, y, z)willbeameasureon P(X X X)withtheindividualmarginals⌘ , ⌘ , ⌘ , and the marginals from proj ⇥ ⇥ s t u x,y (and the other pairs) as the optimal couplings between them. From this perspective, the consistency observed in Equation (5.11) is expected. This type of consistency is comparable to that of the Chapman-Kolmogorov equa- tion in Equation (5.1). In this way we can see the evolution of a Wasserstein geodesic as a stochastic process. An immediate follow-up question is whether optimal trans- port kernels are homogeneous. The answer is that when p>1, optimal transport kernels are generically inhomogeneous, and the reason was actually given back in Chapter 2 by Example 2.39. If the path-lines of a kernel cross, then it cannot be homogeneous.

Let Ks,t be a family of optimal transport kernels connected to the geodesic (µt)0 t 1   and measure on the geodesics. Let P (t)andQ(t) represent two path-lines of

(µt)0 t 1, i.e. P (t)andQ(t)aretwogeodesicsinthesupportof. Suppose that P (t)   and Q(t)cross,sothereisans = u such that P (t )=Q(t )=C. Since the kernel 6 1 2

K is connected to (µt)0 t 1 then P (s)Ks,t = P (t) and Q(s)Ks,t = Q(t).  

We can see that the K is inhomogeneous by comparing Kt1,t1+t and Kt2,t2+t. If

113 5.3SignedOTwithKernels OTKernels

K were homogeneous, then these kernels would be identical. However

C Kt1,t1+t = P (t1)Kt1,t1+t = P (t1+t) and

C Kt2,t2+t = Q(t2)Kt2,t2+t = Q(t2+t).

P and Q are di↵erent path-lines so they can intersect at most once and we see that the kernels send C to di↵erent positions depending on the time. Thus whenever we have path-lines which can cross, we have inhomogeneous kernels and path-lines can cross for all of the Wasserstein distances p 2. Section 5.3 Signed Optimal Transport with Kernels

An exciting prospect for using kernels in optimal transport is that they provide an opportunity to extend optimal transport to signed measures. There have been other attempts to do this, notably [6, 34, 43, 44, 45, 46], however the approach here is new and quite distinct. Having a theory of optimal transport which allows for signed measures will allow for new application areas, like those in signal processing. We note that optimal trans- port was used to estimate the parameters in seismic imaging models in [16, 17, 18], but these studies all required that the signal first be converted to be non-negative. Multiple approaches were taken to do this, including adding a large constant, ex- ponentiating the signal, and separating the signal into positive and negative parts. These approaches were all done without a developed theory for signed optimal trans- port, and although successful to some degree, it is clear that potential applications for optimal transport methods are currently limited because of the lack of fundamen-

114 5.3SignedOTwithKernels OTKernels tal results. The theory presented here helps to bridge this gap. While it is more restrained than other theories related to signed optimal transport, it carries over the the essential structure of optimal transport to signed measures using transition kernels. In what follows we refer to signed measures with the Latin letters, i.e. a, b and the positive absolute value measures either as a , b or with Greek letters, i.e. ↵, . | | | | The class of signed measures that we focus on are those a M(X)withproperties 2

(a) da < (finite integral), X 1 R (b) d↵ < (finite mass), X 1 R (c) dist(x, x )pd↵ < (finite moment). X 0 1 R The requirements on the signed measure are such that the associated positive measure is one which will have finite optimal transport cost. We will also be looking at optimal couplings between two positive measures which are not probability measures, but which have the same mass. Such couplings are simply the couplings that we have grown familiar with over the course of this thesis scaled by a constant. They must still satisfy the marginalization constraints and have c-CM support, but the marginals are no longer required to be probability measures.

Definition 5.9. A signed coupling g is an optimal signed coupling between the marginals a and b, which have finite integral, mass, and moment if it is equal to the product measure derived from aK for some optimal transport kernel K.

For two signed measures to be connected by an optimal transport kernel, they must have the same integral. This is because if b = aK, then

b(Y )= K(x, Y )da(x)= da(x), ZX ZX

115 5.3SignedOTwithKernels OTKernels since K is a Markov kernel and K(x, Y )=1forallx X. While they do not need to 2 have the same mass, we will treat this as the standard. Kernels which do not preserve mass are considered special. A salient feature and restriction of the kernel based approach to signed optimal transport comes in the following theorem.

Theorem 5.10. Let a and b be two signed measures with finite moment, and equal

+ + mass and integral. Let a = a a and b = b b be the Jordan decompositions of a and b respectively. Then a and b are connected by an optimal transport kernel

+ + + if and only if there are compatible optimal couplings , between a and b , and , between a and b.

+ + Proof. Let ↵ = a = a + a and = b = b + b. Suppose that a and b are | | | | connected by an optimal transport kernel K, i.e. aK = b. Since a and b have the same integral,

+ + da = a a = b b = db, || || || || || || || || ZX ZX and the same mass,

+ + d↵ = a + a = b + b = d, || || || || || || || || ZX ZX then

+ + a = b and a = b . || || || || || || || ||

+ + We want to first show that a K = b and aK = b. Notice that for any set B,

+ aK(B)= K(x, B)da = K(x, B)da K(x, B)da ZX ZX ZX (5.12) K(x, B)da+ = a+K(B).  ZX 116 5.3SignedOTwithKernels OTKernels

Therefore (aK)+ (a+K)+. Now, since aK = b,wehave(aK)+ = b+. We also  already have that a+K is a positive measure, so that (a+K)+ = a+K. Since K is aMarkovkernel,wehave a+K = a+ ,implyingthat a+K = b+ . Hence || || || || || || || || (aK)+ =(a+K)+ and b+ = a+K, that is, when b is the image of a from the kernel K, then b+ = a+K. We note that in general, b+ a+K, but here we have equality since the measures  have equal mass. Similarly we find b = aK, while in general b aK. This yields  + + a K = and aK = (as product measures). These measures will be compatible

+ because ↵K is an optimal coupling, and supp(↵K) = supp( ) supp(). [ + For the other direction, suppose that and are compatible and let :=

+ + . Observe that is an optimal coupling between its marginals (note, this is not a coupling between probability measures).

+ Now a a and a± ↵ := (proj ) (). Let K be the transition kernel ? ⌧ x # associated to , defined on measures that are absolutely continuous with respect to ↵.

+ Consider the Radon-Nikdoym derivatives of a and a with respect to . Let

+ f+(x) be the Radon-Nikodym derivative of a with respect to ↵, which will be equal

+ + to +1 on supp(a ) supp(↵)and0onsupp(a)sincea a. Let f be the the ⇢ ? Radon-Nikodym derivative of a with respect to ↵, and similar statements will hold.

+ Then f+(x) = and f (x) = , so

+ aK =(projy)# ((f+(x) f (x))) = (projy)#( )=b. (5.13)

In this way, acts as a kernel sending a to b, which completes the proof.

Theorem 5.10 tells us that we will not be able to connect any two arbitrary signed measures together, only the ones where the positive and negative couplings are c-CM.

117 5.3SignedOTwithKernels OTKernels

This is disappointing for some visions of signed optimal transport that want to be able to connect any two signed measures in the same way as you can with probability measures, but it is an inherent limitation from our approach that focuses on c-CM as the salient feature of optimality. It would nice if there were an equivalence of signed measures such that for any three measures ai in the equivalence class, there are kernels K1,K2 and K3 such that a1K1 = a3, a2K2 = a3,anda1K3 = a2 (and kernels mapping the reverse direction as well). Unfortunately in two dimensions and higher, this is not the case, which we demonstrate with the following example.

2 Example 5.11. Let X = R ,anda1 = (1,.5) ( 1, .5), a2 = ( 1,.5) (1, .5),and a3 = (0,1) (0, 1).

In this example there are optimal transport kernels K1 and K3 such that a1K1 = a3 and a3K2 = a2. However, Theorem 5.10 says that there is no kernel sending a1 to a2. This example more generally demonstrates that signed measures that in 2 dimensions or higher are not transitive. However they will be in one dimension, and there is a simple tool that we can use to signify when two signed measures are in the same equivalence class.

5.3.1. One Dimensional Signed Optimal Transport The preliminaries of this section were laid our in Section 2.4, and we encourage the reader to revisit that section. We now look at measures on R which not only have finite mass, measure and moment, but also the property of finite length measure, as defined by

Definition 5.12. A signed measure a is said to have finite length signature if

n i i it has finite mass and can be decomposed into a = i=1 a for some n,andthea measures are mutually singular with support containedP in an interval (pi 1,pi], with

118 5.3SignedOTwithKernels OTKernels

i + p0 = and pn = and each a is equal to either a (pi 1,pi]or a (pi 1,pi]. 1 1 i i 1 We further require that the measure a has opposite sign to the measures a and

i+1 a when they exist. The signed measure a then has signature (z1,...,zn)with

pi zi := da. The intervals (pi 1,pi] are called the signature intervals of a. pi 1 R The signature of a measure is meant to encapsulate the order of the positive and negative mass of the measure. It has the following properties:

(a) da = z R j j R P (b) d↵ = z R j | j| R P pi i (c) da = j=1 zi. 1 R P We now rephrase Lemma 2.45 in the context of signed measures with the same signature.

Lemma 5.13. Let ↵ and be positive measures coming from signed measures a and

n b that both have the same signature (zi)i=1 and with signature intervals (pi 1,pi] and

(qi 1,qi]. Then the support of the optimal plan is contained in

n i=1(pi 1,pi] (qi 1,qi]. [ ⇥

With this lemma, we can show that any two signed measures with the same signature can be sent to one another by the kernel generated by the coupling between the absolute value measures.

Theorem 5.14. Let a and b be signed measures with the same finite length signa- ture (z ,...,z ) and let ↵ := a , and := b be their absolute value measures with 1 n | | | | appropriate finite moments. Let the optimal coupling between ↵ and , and K the kernel associated to . Then K sends a to b.

119 5.3SignedOTwithKernels OTKernels

Proof. Let f(x) be the Radon-Nikodym derivative of a with respect to ↵,andg(y)the

Radon-Nikodym derivative of b with respect to . First, notice that Y f(x)d(x, y)= R f↵ = a and X g(y)d(x, y)=g = b by the marginalization criteria of an optimal coupling andR the definition of the Radon-Nikodym derivative. ˜ Now, let b = X f(x)d(x, y). This is the image of a under the kernel given by . We will showR that ˜b = b as measures by comparing their evaluations of a measurable set. Without loss of generality, assume that the set I is contained within a single signature interval (qi 1,qi]. Since the intervals are disjoint, we could of course partition a set that spanned multiple signature intervals into its parts contained within each signature interval.

Now, ˜b(I)= f(x)d(x, y). Since supp() (pi 1,pi] (qi 1,qi] by Lemma I X ⇢[ ⇥

5.13 and I (qiR 1R,qi], then we can restrict our integral to being over (pi 1,pi]. In ⇢

(pi 1,pi] I (pi 1,pi] (qi 1,qi], the Radon-Nikodym derivative f(x)isconstant ⇥ ⇢ ⇥ and is equal to sign(zi), likewise g(y)isconstantandisalsoequaltosign(zi). Thus,

pi pi pi ˜b(I)= f(x)d(x, y)= sign(zi)d(x, y)= g(y)d(x, y). y I pi 1 I pi 1 y I pi 1 Z 2 Z Z Z Z 2 Z

Observe that the above integral is equivalent to the integral over all of X because is only supported on the part of X that is currently included. It is equal to b(I)by the marginalization constraint on ,thatis

pi g(y)d(x, y)= gd = b(I). I pi 1 I X Z Z Z Z

By combining the above two equations, we obtain ˜b(I)=b(I)foranymeasurableset contained in a single signature interval and thus ˜b = b for any measurable set. This completes the proof.

An alternate proof shows that the support of the couplings + (between a+ and

120 5.3SignedOTwithKernels OTKernels

+ b )and (between a and b)arec-CM compatible and then uses Theorem 5.10 to obtain the final result. We prefer the proof presented here because it emphasizes the ability to use the coupling between the absolute value measure as a coupling. We now show that if two signed measures have the same integral and mass but di↵erent signatures, then they cannot be sent to each other.

Theorem 5.15. Let a and b be signed measures with equal mass and integral and

n m finite moment. Let (zi)i=1 be the signature of a and (wi)i=1 the signature of b, and let (z ) =(w ). Then there is not an optimal transport kernel K sending a to b. i 6 i

Proof. Theorem 5.10 says that there is an optimal transport kernel between a and

+ + + b if and only if the coupling between a and b and between a and b are compatible. Let k be the first index at which k z = k w . There will be no issue trans- 1 i 6 1 i porting the mass before we reach pk 1Por qk 1,P but we can modify the measures such that they are equal to 0 up to pk 1 and qk 1. Thus we can assume that it is the first term of the signature that is di↵erent.

Case 1: If w1 and z1 have di↵erent signs, then because the measures still have the same mass and integral there will be a w2 and z2 also with di↵erent signs. Without loss of generality, assume that z1 is positive, which determines the signs for z2,w1, and w2.

+ + Forming the optimal couplings and , we know that (( ,p ] (q ,q ]) = 1 1 ⇥ 1 2 + z w and that ((p ,p ] ( ,q ]) = z w . So there is (x ,y ) supp( ) 1 ^ 2 1 2 ⇥ 1 1 | 2| ^ | 1| 1 1 2 where x p and y >q, and there is (x ,y ) supp()wherex >p and 1  1 1 1 2 2 2 2 1 + y q . These points show that supp( ) supp() is not a monotone set, and thus 2  2 [ + is not c-CM. Thus and are not compatible and there cannot be an optimal transport kernel from a to b.

Case 2: Suppose that w1 and z1 are distinct but have the same sign. Assume

121 5.3SignedOTwithKernels OTKernels that z > w . We know that there is at least a z , w and w because there needs to | 1| | 1| 2 2 3 be an equal amount of positive and negative mass. In a similar way to the case when

+ they have opposite signs, consider and . As before assume that z1 is positive. If it is not, then the argument still works, we would only need to switch the labels

+ and . Now there is too much mass of a in ( ,p ] to be sent to ( ,q ], and so some 1 1 1 1 of it must be sent to the interval (q ,q ]. Thus +(( ,p ] (q ,q ]) = (z w ) w , 2 3 1 1 ⇥ 2 3 1 1 ^ 3 and this is the mass in the first interval of a that is sent to the third interval of b.

For the coupling of the negative measures, we have ((p ,p ] (q ,q ]) = z 1 2 ⇥ 1 2 | 2| ^ w . Thus there is an (x ,y ) supp(+)wherex p and y >q and and | 2| 1 1 2 1  1 1 2 similarly an (x ,y ) supp()wherex >p and y q . 2 2 2 2 1 2  2 + These points show that supp( ) supp()isnotamonotoneset,andthusis [ + not c-CM. Then we see that and are not c-CM and thus there is not an optimal transport kernel from a to b. As these are the only two possible cases, the proof is complete.

The theory above tells us that among signed measures with the same mass and integral, the signature indicates when two signed measures can be sent to one another through a kernel. It is important to point out that throughout this section, we have emphasized that the signed measures not only have equal integral, as would be expected from the image of a Markov kernel, but also equal mass. This is because an optimal transport kernel can destroy mass. Consider the kernel K which maps to 2 . Then the measure a = [0,2] 1 [0,1] (1,2] is a signed measure with finite mass, integral and moment, but aK =0.Thisis because the positive and negative parts of a are sent to the same place and cancel out, but this is done without violating c-CM. Mass destruction is a feature one may want consider in future theoretical treatises of signed optimal transport.

122 5.4FutureQuestions OTKernels

Section 5.4 Future Questions

There remain a number of interesting questions in the context of what was presented within this chapter. One such question is that while we can use a coupling as a kernel for measures which are absolutely continuous with respect to a marginal, how do we extend that kernel beyond its original support? While it seems likely that there should be many ways to do this while also preserving the kernel’s optimality, coming up with a robust method is less clear. Further research could look into the role of kernels that map non-singular measures to singular measures. While they destroy mass for signed measures, they also show how singular measures are cut points of geodesics in the Wasserstein space through the connection between the Chapman-Kolmogorov equations and geodesics. General research can also work to connect the kernel-based approach to the other formulations of the problem. Our hope is that kernels provide a new approach for extending optimal transport that does not compromise its salient features. In this chapter we discussed signed measures, but by broadening the class of kernels beyond Markov kernels, they may also provide a way to perform unbalanced optimal transport. Hence one future goal is to more fully explore the capabilities of the new tools developed here.

123 Chapter 6

Conclusion

This thesis showed an interesting and new application of semi-discrete optimal trans- port. It also developed new methodology regarding the latent transition kernel struc- ture of an optimal coupling. Chapter 2 provided the bulk of the necessary background in optimal transport for the bulk of the thesis, and includes Corollary 2.26, which lays the groundwork for Chapter 5. Chapter 3 provided the specific background for semi-discrete optimal transport. Lemma 3.2 and Theorem 3.3 showed that the boundaries of a Laguerre cell are in- variant to translations of the nodes and how to relate the price function between the translated and untranslated problems. We introduced modified an existing SDOT algorithm by regularizing the Hessian. Chapter 4 showed how we can use SDOT to relate observations and to model outputs. We presented a few varied set of examples and showed how misfit func- tions incorporating an optimal transport cost are amenable to optimization. We also showed how the squared Wasserstein distance can be combined with the gradient of the model through the adjoint equations. This showed that SDOT can be a valuable tool in when the parameter space is high dimensional, or the model is expensive.

124 Conclusion Conclusion

Chapter 5 introduced the idea of using transition kernels as a tool for optimal transport. We showed how they o↵er a way to extend optimal transport to signed measures, and we looked at what this revealed in one dimension. The most interesting areas to continue to explore follow the ideas of Chapters 4 and 5. In Chapter 4 we used SDOT for parameter estimation, and hopefully showed that it could be useful, but the task remains to go and use it. We are particularly interested in applications that leverage our ability to take the gradient through the model. In addition to apply an optimal transport misfit to new areas, interesting general questions remain about the technique. These were mentioned in Section 4.4, and many are standard questions in a new context. How robust is the technique to blurring, and noise? We are optimistic, but this was not work which was done in this thesis. How does the number of evaluation points a↵ect the problem when it is time dependent? Other questions are specific to this approach and are related to SDOT. What discrete representations are appropriate? Is a centroidal Voronoi diagram best? What are the relevant properties of the model that ensure that the minimum of the solution is optimal? Chapter 5 introduced transition kernels as a tool for optimal transport. This was based on the observation in Lemma 2.26 that since optimality can be phrased in terms of the support of a coupling, then we could use a coupling as a transition kernel. This allowed us a new perspective on signed optimal transport. There are structural questions that remain in this area. What is a useful class of transition kernels? We o↵ered a stringent definition of cY -finite Markov kernels, but did this knowing that it excluded too many. Are we able to cross-pollinate stochastic process with ideas from optimal transport, or vice-versa? The answer is mostly likely ‘yes’, but that does not tell us what the ideas will be. Do the Chapman-Kolmogrov Equations about the

125 Conclusion Conclusion consistency of a stochastic process tell us anything about the geodesic structure of the Wasserstein spaces? What role do kernels which produce singularities from non- singular measures play for positive measures and for signed measures? Can we find a problem where a kernel description is more useful than the standard description? Can kernels be used for unbalanced optimal transport? Answering these questions will only lead to more, and we look forward to devel- oping the ideas in this thesis forward and seeing how they grow.

126 Bibliography

[1] Martin S Alnæs, Jan Blechta, Johan Hake, August Johansson, Benjamin Kehlet, Anders Logg, Chris Richardson, Johannes Ring, Marie E Rognes, and Garth N Wells, The fenics project version 1.5, Archive of Numerical Software 3 (2015), no. 100.

[2] Shun-ichi Amari and Hiroshi Nagaoka, Methods of information geometry,vol. 191, American Mathematical Soc., 2007.

[3] Luigi Ambrosio, Lecture notes on optimal transport problems, Mathematical as- pects of evolving interfaces, Springer, 2003, pp. 1–52.

[4] Luigi Ambrosio and Nicola Gigli, A user’s guide to optimal transport, Modelling and optimisation of flows on networks, Springer, 2013, pp. 1–155.

[5] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar´e, Gradient flows: in metric spaces and in the space of probability measures, Springer Science & Business Media, 2008.

[6] Luigi Ambrosio, Edoardo Mainini, and Sylvia Serfaty, Gradient flow of the Chapman–Rubinstein–Schatzman model for signed vortices, Annales de l’IHP Analyse non lin´eaire, vol. 28, 2011, pp. 217–246.

[7] Franz Aurenhammer, Friedrich Ho↵mann, and Boris Aronov, Minkowski-type theorems and least-squares clustering, Algorithmica 20 (1998), no. 1, 61–76.

127 BIBLIOGRAPHY

[8] Jean-David Benamou and Yann Brenier, A computational fluid mechanics solu- tion to the Monge-Kantorovich mass transfer problem, Numerische Mathematik 84 (2000), no. 3, 375–393.

[9] Jean-David Benamou, Brittany D Froese, and Adam M Oberman, Numerical solution of the optimal transportation problem using the Monge–Amp`ere equation, Journal of Computational Physics 260 (2014), 107–126.

[10] David P Bourne and Steven M Roper, Centroidal power diagrams, Lloyd’s algo- rithm, and applications to optimal location problems, SIAM Journal on Numerical Analysis 53 (2015), no. 6, 2545–2569.

[11] David P Bourne, Bernhard Schmitzer, and Benedikt Wirth, Semi-discrete un- balanced optimal transport and quantization, arXiv preprint arXiv:1808.01962 (2018).

[12] Stephen Boyd and Lieven Vandenberghe, Convex optimization, Cambridge uni- versity press, 2004.

[13] Erhan C¸ınlar, Probability and stochastics, vol. 261, Springer Science & Business Media, 2011.

[14] Marco Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, Advances in neural information processing systems, 2013, pp. 2292–2300.

[15] Fernando De Goes, Katherine Breeden, Victor Ostromoukhov, and Mathieu Desbrun, Blue noise through optimal transport, ACM Transactions on Graph- ics (TOG) 31 (2012), no. 6, 171.

[16] Bjorn Engquist and Brittany D Froese, Application of the to seismic signals, arXiv preprint arXiv:1311.4581 (2013).

128 BIBLIOGRAPHY

[17] Bjorn Engquist, Brittany D Froese, and Yunan Yang, Optimal transport for seismic full waveform inversion, arXiv preprint arXiv:1602.01540 (2016).

[18] Bj¨ornEngquist and Yunan Yang, Seismic imaging and optimal transport, arXiv preprint arXiv:1808.04801 (2018).

[19] Nelson Feyeux, Arthur Vidard, and Ma¨elle Nodet, Optimal transport for vari- ational data assimilation, Nonlinear Processes in Geophysics 25 (2018), no. 1, 55–66.

[20] Gerald B Folland, Real analysis: Modern techniques and their applications, John Wiley & Sons, 2013.

[21] Thomas O Gallou¨etand Quentin M´erigot, A lagrangian scheme `ala Brenier for the incompressible Euler equations, Foundations of Computational Mathematics 18 (2018), no. 4, 835–865.

[22] Clark R Givens, Rae Michael Shortt, et al., A class of Wasserstein metrics for probability distributions., The Michigan Mathematical Journal 31 (1984), no. 2, 231–240.

[23] Leonid V Kantorovich, On the translocation of masses, Dokl. Akad. Nauk. USSR (NS), vol. 37, 1942, pp. 199–201.

[24] Jun Kitagawa, Quentin M´erigot, and Boris Thibert, Convergence of a Newton algorithm for semi-discrete optimal transport, arXiv preprint arXiv:1603.05579 (2016).

[25] Achim Klenke, Probability theory: a comprehensive course, Springer Science & Business Media, 2013.

129 BIBLIOGRAPHY

[26] Bruno L´evy, A numerical algorithm for L2 semi-discrete optimal transport in 3D, ESAIM: Mathematical Modelling and Numerical Analysis 49 (2015), no. 6, 1693–1715.

[27] Bruno L´evyand Yang Liu, L p centroidal voronoi tessellation and its applications, ACM Transactions on Graphics (TOG) 29 (2010), no. 4, 1–11.

[28] Bruno L´evyand Erica L Schwindt, Notions of optimal transport theory and how to implement them on a computer, Computers & Graphics 72 (2018), 135–148.

[29] Long Li, Arthur Vidard, Fran¸cois-Xavier Le Dimet, and Jianwei Ma, Topological data assimilation using Wasserstein distance, Inverse Problems 35 (2018), no. 1, 015006.

[30] Anders Logg, Kent-Andre Mardal, Garth N Wells, et al., Automated solution of di↵erential equations by the finite element method, Springer, 2012.

[31] Anders Logg and Garth N Wells, Dolfin: Automated finite element computing, ACM Transactions on Mathematical Software 37 (2010), no. 2.

[32] Anders Logg, Garth N Wells, and Johan Hake, Dolfin: a C++/python finite element library, ch. 10, Springer, 2012.

[33] John Lott, Some geometric calculations on Wasserstein space, arXiv preprint math/0612562 (2006).

[34] Edoardo Mainini, A description of transport cost for signed measures, Journal of Mathematical Sciences 181 (2012), no. 6, 837–855.

[35] Robert J McCann, A convexity principle for interacting gases, Advances in math- ematics 128 (1997), no. 1, 153–179.

130 BIBLIOGRAPHY

[36] Robert J McCann and Nestor Guillen, Five lectures on optimal transportation: geometry, regularity and applications, Analysis and geometry of metric mea- sure spaces: lecture notes of the s´eminaire de Math´ematiques Sup´erieure (SMS) Montr´eal(2011), 145–180.

[37] Quentin M´erigot, A multiscale approach to optimal transport, Computer Graph- ics Forum, vol. 30, Wiley Online Library, 2011, pp. 1583–1592.

[38] Gaspard Monge, M´emoire sur la th´eorie des d´eblaiset des remblais, Histoire de l’Acad´emie Royale des Sciences de Paris (1781).

[39] Jorge Nocedal and Stephen Wright, Numerical optimization, Springer Science & Business Media, 2006.

[40] Felix Otto, The geometry of dissipative evolution equations: The porous medium equation,(2001).

[41] Matthew D Parno, Brendan A West, Arnold J Song, Taylor S Hodgdon, and DT O’Connor, Remote measurement of sea ice dynamics with regularized optimal transport, Geophysical Research Letters 46 (2019), no. 10, 5341–5350.

[42] Gabriel Peyr´e, Marco Cuturi, et al., Computational optimal transport: With

applications to data science, Foundations and Trends R in Machine Learning 11 (2019), no. 5-6, 355–607.

[43] Benedetto Piccoli and Francesco Rossi, Transport equation with nonlocal veloc- ity in Wasserstein spaces: convergence of numerical schemes, Acta applicandae mathematicae 124 (2013), no. 1, 73–105.

[44] , Generalized Wasserstein distance and its application to transport equa- tions with source, Archive for Rational Mechanics and Analysis 211 (2014), no. 1, 335–358.

131 BIBLIOGRAPHY

[45] , On properties of the generalized Wasserstein distance, Archive for Ra- tional Mechanics and Analysis 222 (2016), no. 3, 1339–1365.

[46] Benedetto Piccoli, Francesco Rossi, and Magali Tournus, A Wasserstein norm for signed measures, with application to nonlocal transport equation with source term, arXiv preprint arXiv:1910.05105 (2019).

[47] Roman Polyak, Complexity of the regularized Newton method, arXiv preprint arXiv:1706.08483 (2017).

[48] Walter Rudin, Real and complex analysis, 3rd ed., McGraw Hill, 1987.

[49] Filippo Santambrogio, Introduction to optimal transport theory, Notes (2014).

[50] , Optimal transport for applied mathematicians, Birk¨auser, NY 55 (2015), 58–63.

[51] Justin Solomon, Fernando De Goes, Gabriel Peyr´e, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas, Convolutional Wasser- stein distances: Ecient optimal transportation on geometric domains, ACM Transactions on Graphics (TOG) 34 (2015), no. 4, 66.

[52] Asuka Takatsu et al., Wasserstein geometry of gaussian measures, Osaka Journal of Mathematics 48 (2011), no. 4, 1005–1026.

[53] The CGAL Project, CGAL user and reference manual, 5.2 ed., CGAL Editorial Board, 2020.

[54] Stefan Ulbrich, Analytical background and optimality theory, Optimization with PDE Constraints, Springer, 2009, pp. 1–95.

[55] C´edric Villani, Topics in optimal transportation, no. 58, American Mathematical Soc., 2003.

132 Bibliography

[56] , Optimal transport: old and new, vol. 338, Springer Science & Business Media, 2008.

[57] Shi-Qing Xin, Bruno L´evy, Zhonggui Chen, Lei Chu, Yaohui Yu, Changhe Tu, and Wenping Wang, Centroidal power diagrams with capacity constraints: Com- putation, applications, and extension, ACM Transactions on Graphics (TOG) 35 (2016), no. 6, 1–12.

133