Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcˇev, and Gustavo K. Rohde Optimal Mass Transport

Signal processing and machine- learning applications

ransport-based techniques for signal and data analysis have recently received increased interest. Given their ability to provide accurate generative models for signal intensities and other data distributions, they have been used in a variety of Tapplications, including content-based retrieval, cancer detection, image superresolution, and statistical machine learning, to name a few, and they have been shown to produce state-of-the-art results. Moreover, the geometric characteristics of transport-related met- rics have inspired new kinds of algorithms for interpreting the meaning of data distributions. Here, we provide a practical over- view of the mathematical underpinnings of mass transport-­related methods, including numerical implementation, as well as a review, with demonstrations, of several applications. Software accompa- nying this article is available from [43].

Purposes for optimal mass transport

Motivation and goals Numerous applications in science and technology depend on effective modeling and information extraction from signal and image data. Examples include being able to distinguish between benign and malignant tumors in medical images; learning mod- els (e.g., dictionaries) for solving inverse problems; identifying people from images of faces, voice profiles, or fingerprints; and many others. Techniques based on the of optimal mass transport, also known as Earth Mover’s Distance in engi- neering-related fields, have received significant attention recently given their ability to incorporate spatial (in addition to intensity) information when comparing signals, images, and other data sources, thus giving rise to different geometric inter- pretations of data distributions. These techniques have been used to simplify and augment the accuracy of numerous pattern

Digital Object Identifier 10.1109/MSP.2017.2695801 Date of publication: 11 July 2017

1053-5888/17©2017IEEE IEEE Signal Processing Magazine | July 2017 | 43 recognition-related problems. Some examples covered in A brief historical note this article include image retrieval [32], [44], signal and The optimal mass transport problem seeks the most efficient image representation [25], [27], [40], [50], inverse problems way of transforming one distribution of mass to another, rela- [30], cancer detection [4], [39], texture and color modeling tive to a given cost function. The problem was initially studied [18], [41], shape and image registration [22], [29], and by the French mathematician Gaspard Monge in his seminal machine learning [12], [17], [19], [28], [36], [42], to name a work “Mémoire sur la Théorie des Déblais et des Remblais” few. This article is meant to serve as an introductory guide [35] in 1781. In 1942, Leonid V. Kantorovich, who, at that to those wishing to familiarize themselves with these emerg- time, was unaware of Monge’s work, proposed a general for- ing techniques.­ Specifically, we mulation of the problem by considering optimal mass trans- ■■ provide a brief overview of key mathematical concepts port plans, which, as opposed to Monge’s formulation, allows related to optimal mass transport for mass splitting [23]. Kantorovich shared the 1975 Nobel ■■ describe recent advances in transport-related methodology Prize in Economic Sciences with Tjalling Koopmans for and theory his work in the optimal allocation of scarce resources. ■■ provide a practical overview of their applications in mod- Kantorovich’s contribution is considered “the birth of the ern signal analysis, modeling, and learning problems. modern formulation of optimal transport” [49], and it made the optimal mass transport problem an active field of research in Why transport? the following years. In recent years, numerous techniques for signal and image A significant portion of the theory of the optimal mass analysis have been developed to address important learning transport problem was developed in the 1990s, starting with and estimation problems. Researchers working to unveil solu- Brenier’s seminal work on the characterization, existence, and tions to these problems have found it necessary to develop uniqueness of optimal transport maps [9], followed by Caf- techniques to compare signal intensities across different sig- farelli’s work on regularity conditions of such mappings [10] nal/image coordinates. A common problem in medical imag- and Gangbo and McCann’s work on a geometric interpreta- ing, for example, is the analysis of magnetic resonance tion of the problem [20]. A more thorough history and back- images with the goal of learning about brain morphology dif- ground on the optimal mass transport problem can be found ferences between healthy and diseased populations. Decades in Villani’s book Optimal Transport: Old and New [49] and of research in this area have culminated with techniques such Santambrogio’s book Optimal Transport for Applied Math- as voxel- and deformation-based morphology that make use ematicians [45]. The significant contributions in mathemati- of nonlinear registration methods to understand differences in cal foundations of the optimal transport problem together tissue density and locations. Likewise, the development of with recent advancements in numerical methods [6], [14], [31], dynamic time-warping techniques was necessary to enable the [37] have spurred the recent development of numerous data- comparison of time series data more meaningfully without analysis techniques for modern estimation and detection (e.g., confounds from commonly encountered variations in time. classification) problems. Furthermore, researchers desiring to create realistic models of facial appearance have long understood that appearance mod- Formulation of the problem and methodology els for the eyes, lips, nose, and other facial features are signifi- While reviewing both the continuous and discrete formula- cantly different and thus must be dependent on a position tions of the optimal transport problem (i.e., Monge’s and relative to a fixed anatomy. The pervasive success of these as Kantorovich’s formulations), the geometrical characteristics of well as other techniques, such as optical flow, level-set meth- the problem, and the transport-based signal/image embed- ods, and deep neural networks, have shown that 1) nonlinearity dings, we have elected to avoid measure-theoretic notation, and 2) modeling the location of pixel intensities are essential and other detailed mathematical language, in lieu of a more concepts to keep in mind when solving modern regression informal and intuitive description of the problem. However, it problems related to estimation and classification. must be said that certain mathematical precision is required to The previously mentioned methodology for modeling best understand the differences between Monge’s and appearance and learning morphology, time series analysis and Kantorivich’s formulation, their geometric interpretations, and predictive modeling, deep neural networks for classification of other points. The interested reader may find it useful to con- sensor data, and the like is algorithmic in nature. The trans- sult [24] for a more complete and mathematical description of port-related techniques reviewed in this article are nonlinear the concepts explained in the following sections. methods that, unlike linear methods such as Fourier, wave- lets, and dictionary models, explicitly model signal intensities Optimal transport: Formulation and their locations. Furthermore, they are often based on the Over the past century or so, the theory of optimal transport theory of optimal mass transport from which fundamental (Earth mover’s distance) has developed two main for­­mulations, principles can be put to use. Thus, they hold the promise to one utilizing a continuous map (Monge’s formulation) and ultimately play a significant role in the development of a theo- another utilizing what is called a transport plan (Kantarovich’s retical foundation for certain subclasses of modern learning formulation), for assigning the spatial correspondence neces- and estimation problems. sary for the related transport problem. Although Monge’s

44 IEEE Signal Processing Magazine | July 2017 | ­continuous formulation is helpful in problems where a point-to- more than a century, the answers to questions regarding point assignment is desired, Kantarovich’s formulation is more existence and characterization of the Monge’s problem general and also covers the case of discrete (Dirac) masses (in remained unknown. our case, signal intensities). These not only differ in mathemat- For certain measures, the Monge’s formulation of the opti- ical formulation but also have consequences with regard to mal transport problem is ill posed in the sense that there is no their respective numerical solutions as well as applications. transport map to rearrange one PDF to another. For instance,

consider the case where I0 is a Dirac mass and I1 is not. Kan- Monge’s continuous formulation torovich’s formulation alleviates this problem by finding the The Monge optimal mass transport problem is formulated as optimal transport plan as opposed to the transport map. follows. Consider two signals or images I0 and I1 defined over their respective domains X0 and X1. Here, X0 and X1 are Kantorovich’s formulation typically subsets of Rd and can often be taken as the unit Kantorovich formulated the transport problem by optimizing square (or cube in three dimensions). Although a detailed over transportation plans, which we denote as c. One can

­measure-theoretic formulation is typically required (see [24]), think of c as the joint distribution of I0 and I1 describing we bypass the rigorous formulation here and simply assume how much mass is being moved to different coordinates; i.e., that Ix0 () and Iy1 () correspond to signal intensities at posi- let A be a subset of X0 and similarly B 3 X1 . For notation- tions x ! X0 and y ! X1. For digital signals, an interpolating al simplicity, we will not make a distinction between a model can be used to construct these functions defined over probability distribution and its density. More precisely, continuous domains from sampled discrete data. The signals we associate a probability distribution to a signal I0 by are required to be nonnegative, i.e., Ix0 ()$ 0 6x ! X0 IA00()= Ix()dx. #A and Iy1 ()$ 0 6y ! X1. In addition, the total amount of The quantity c()AB# tells us how much mass in set A is ­signal (or mass) for both signals should be equal to the being moved to set B. Here, the MP constraint can be expressed same constant (which is generally chosen to be 1): as c()X01# BI= ()B and c()AI# X10= ()A . Kantorovich’s ##Ix01()dx ==Iy()dy 1. In other words, I0 and I1 are formulation for the optimal transport problem can then be XX01 assumed to be probability density functions (PDFs). written as Monge’s optimal transportation problem is to find a func- KI(,01Ic)(= min xy,)dxc(,y). (4) tion f:XX01" that pushes I onto I and minimizes the !MP # 0 1 c XX01# objective function, Note that the integration notation dxc(,y) is meant to rep- MI(,01Ic)(= inf # xf,(xI)) 0 ()xdx, (1) resent the fact that this integral is more general than the routine fM! P X0

+ where c:XX01# " R is the cost of moving pixel intensity Ix0 () from x to f(x) [Monge considered the Euclidean distance I f I as the cost function in his original formulation, 0 1 cx(,fx())(=-xfx), and MP stands for a measure preserv- @ ing map that moves all the signal intensity from I0 to I1. That is, for a subset B 1 1 the MP requirement is that X X B Y A = {x: f(x) ∈ B} Ix01()dx = Iy()dy. (2) ##{:xf()xB! } B I (x)dx == I (y)dy det (D f(x))I (f(x))dx 1{x: f(x)∈B} 0 1B 1 1{x: f(x)∈B} 1 If f is one to one, this just means that for A 1 X0, (a)

Ix01()dx = Iy()dy. (x, y) ##Af()A y1 0.5 y 0.25 Such maps fM! P are sometimes called transport maps 3 x y or mass-preserving maps. Simply put, the Monge formulation 1 0.25 2 0.25 y2 of the problem seeks to rearrange signal I0 into signal I1 while 0.25 y 0.5 x 1 minimizing a specific cost function. In cases when f is smooth 2 y3 and one to one, then the requirement (2) can be written in a x1 x2 3 1 1 differential form as I (x) = (x – x ) I (y) = (y – y ) + (y – y ) 0 4 1 1 2 1 4 2 1 1 ++(x – x ) (y – y ) det((Df xI)) 10((fx)) = Ix() (3) 4 2 4 3 (b) almost everywhere, where Df is the Jacobian of f [see Figure 1(a)]. Note that both the objective function and the FIGURE 1. (a) The Monge transport map and (b) Kantorovich’s constraint in (1) are nonlinear with respect to f(x). Hence, for transport plan.

IEEE Signal Processing Magazine | July 2017 | 45 Riemman-type integral commonly used in signal processing, [9] addressed this problem for the special case where and the integral can cover integration over domains that are more cx(,yx)|=-y |.2 Bernier’s results were later relaxed to general. The minimizer of the optimization problem above, c*, more general cases by Gangbo and McCann [20], which led is called the optimal transport plan. However, unlike the Monge to the following theorem. problem, in Kantorovich’s formulation, the objective function and the constraints are linear with respect to c(,xy). Moreover, Theorem Kantorovich’s formulation is in the form of a convex optimi- Let I0 and I1 be nonnegative functions of the same total mass zation problem. We also note that the Monge problem is more and with bounded support. When cx(,yh)(=-xy) for some restrictive than the Kantorovich problem; i.e., in Monge’s ver- strictly convex function h, then there exists a unique optimal * sion, mass from a single location in X0 is being sent to a single transportation map f minimizing (1). In addition, the opti- location in X1. Kantorovich’s formulation, however, considers mal transport plan is unique and given by (5). Moreover, if transport plans that can deal with arbitrary measurable sets and cx(,yx)|=-y |,2 then there exists a (unique up to adding a has the ability to distribute mass from the one location in one constant) convex function z such that f * = dz. A proof is density to multiple locations in another [see Figure 1(b)]. For any available in [20] and [49]. transport map f:XX01" there is an associated transport­ plan, determined by Optimal mass transport: Geometric properties

c()AB# = Ix0 ()dx. (5) Wasserstein #{:xA!!fx() B} Let Ω be a bounded subset of Rd on which the signals are Furthermore, when an optimal transport map f * exists, it can defined. As an example, for signals (d = 1) or images (d = 2), be shown that the transport plan c* derived from (5) is an this can simply be the space [,01].d Let P()X be the set of optimal transportation plan [49]. probability densities supported on Ω. The p-Wasserstein met- The Kantorovich problem is especially interesting in a dis- ric, Wp, for p $ 1 on P()X is then defined as using the opti- M crete setting, i.e., for PDFs of the form Ip0 =-/ i =1 i d()xxi mal transportation problem (4) with the cost function N p and Iq1 =-/ j =1 j d()yyj , where d()x is the Dirac delta cx(,yx)|=-y |. For I0 and I1 in P()X , function. Generally speaking, for such PDFs a transport map 1 p p that pushes I0 into I1 does not exist. In these cases, mass split- WIpM(,01Ix)*=-infc ! P ||ydc(,xy). #XX# ting, as allowed by the Kantorovich formulation, is necessary `j [see Figure 1(b)]. The Kantorovich problem can be written as For any p $ 1, Wp is a metric on P()X . The ((PWX), p) is referred to as the p-Wasserstein space. To under- stand the nature of the optimal transportation distances, it is KI(,01Ic)(= min// xyij,)cij c i j useful to note that for any p $ 1, the convergence with respect st.. //ccij ==pqii, j j to Wp is equivalent to the weak ; i.e., j i WIpn(,I) " 0 as n " 3 if and only if for every bounded and cij $ 01,,iM==..., , jN1,..., , (6) continuous function f:X " R

where cij identifies how much of the mass particle mi at xi fx()Ixn ()dx " fxIx()()dx. ##XX needs to be moved to yj [see Figure 1(b)]. The optimization above has a linear objective function and linear constraints; For the specific case of p = 1, the p-Wasserstein metric therefore, it is a linear programming problem. This problem is also known as the Monge–Rubinstein metric [49] or the is convex (which, in practice, translates to a relatively easier Earth mover’s distance [44]. The p-Wasserstein metric in one process of finding a global minimum), but not strictly so, dimension has a simple characterization. For one-dimensional

and the constraint provides a polyhedral set of M × N matri- (1-D) signals I0 and I1, the optimal transport map has a closed- ces. In practice, a nondiscrete measure is often approximated form solution. Let Fi be the cumulative distribution function by a discrete measure, and the Kantorovich problem is of Ii for i = 0, 1, i.e., solved through the linear programming optimization x expressed in (6). Fxii()==Ix()dx for i 01,. #inf()X Basic properties Note that this is a nondecreasing function going from 0 to 1. Consider a transportation cost c(x, y) that is continuous and We define the pseudoinverse of F0 as follows: for z ! (,01), -1 bounded from below. Given two signals I0 and I1 as previously Fz() is the smallest x for which Fx0 ()$ z, i.e., shown, there always exists a transportation plan minimiz- -1 ing (4). This holds true for both when signals I0 and I1 are Fz0 ()= inf{:xF!$X 0 ()xz}. functions and when they are discrete probability distribu-

tions [49]. Another important question is regarding the exis- If I0 2 0, then F0 is continuous and increasing (and thus tence of an optimal transport map instead of a plan. Brenier invertible), and the inverse of the function F0 is equal to

46 IEEE Signal Processing Magazine | July 2017 | the pseudoinverse we just defined. In other words, the pseu- Furthermore, the space with p = 2 is special because it pos- doinverse is a generalization of the notion of the inverse of sesses a structure of a formal, infinite dimensional, Rieman- a function. The pseudoinverse (i.e., the inverse if I0 2 0 nian manifold. That structure was first noted by Otto [38], who and I1 2 0) provides a closed-form solution for the p-Was- developed the formal calculations for using this structure. The serstein distance: precise description of the manifold of probability measures endowed with Wasserstein metric can be found in [1]. p 1 1 --1 1 p WIIp (,01)(=-# Fz0 )(Fz1 ).dz (7) Next, we review the two main notions that have a wide `j0 use. We characterize the geodesics in ((PWX), p), and in the The closed-form solution of the p-Wasserstein distance in one case of p = 2, we describe what is the local, Riemannian dimension is an attractive property, as it alleviates the need for metric of ((PWX), 2). Finally, we state the seminal result optimization. This property was employed in the sliced- of Benamou and Brenier [5], who provided a characteriza- Wasserstein metrics as defined below. tion of geodesics via action minimization, which is useful in computations and also gives an intuitive explanation of the Sliced-Wasserstein metric Wasserstein metric. The idea behind the sliced-Wasserstein metric is to first obtain We first recall the definition of the length of a curve in a a set of 1-D representations for a higher-dimensional proba- metric space. Let (X, d) be a metric space and Ia:[ ,]bX" . bility distribution through projections (slicing the measure) Then the length of I, denoted by L(I) is and then calculate the distance between two input distribu- tions as a functional on the Wasserstein distance of their 1-D m representations. In this sense, the distance is obtained by LI()= sup / dI((tIii-1), ()t ). ma! N, ==tt0111g 1 tbm i =1 solving several 1-D optimal transport problems, which have closed-form solutions.

The projection of high-dimensional PDFs is closely relat- A metric space (X, d) is a geodesic space if, for any I0 and I1, ed to the well-known Radon transform in the imaging and there exists a curve IX:[01,]" such that II()01==01,(II) image processing community [8], [25]. The d-dimensional and for all 01##st1 ,(dI()sI,(tL)) = (|I [,st]). In particu- d Radon transform R maps a function IL! 1 ()R where lar, the length of I is equal to the distance from I0 to I1. Such a dd LI1 ()RR:{= :|" R |(Ix)|dx # 3} into the set of its curve I is called a geodesic. The existence of geodesics is use- # d R n integrals over the hyperplanes of R . It is defined as ful because it allows one to define the average of I0 and I1 as the midpoint of the geodesic connection between them. = RIt(,ii):=+It()sdi s, An important property of ((PW), p) is that it is a geodesic #R X d -1 space and that geodesics are easy to characterize. Specifically, 66t !!RS,;i they are given by the displacement interpolation (also known as here, i= is the subspace orthogonal to i, and Sd -1 is the unit a McCann interpolation). When a unique transportation map d dd-1 * p sphere in R . Note that R:.LL11RR" # S In other f from I to I exists that minimizes (1) for cx(,yx)|=-y |, ^ h ^ h 0 1 words, the Radon transform projects a PDF, IP! Rd , the geodesic is obtained by moving the mass at constant speed ^h where d > 1, into an infinite set of 1-D PDFs RI(.,)i . The from x to fx* (). More precisely, for t ! [,01] and x ! X let d sliced-Wasserstein metric for PDFs I0 and I1 on R is then * * defined as fxt ()=-()1 tx+tf ()x

* 1 be the position at time t of the mass initially at x. Note that f 0 p p * * SW p (,II01)(= WIp RR01(.,)ii,(Id., )) i , is identity mapping and ff1 = . Pushing forward the mass by # d-1 `jS * f t , which by (3) has the form

* Ix0 () where p $ 1, and Wp is the p-Wasserstein metric, which, Ift ((t x)) = * det((Df t x)) for 1-D PDFs, RI0 (.,)i and RI1 (.,)i has a closed-form * solution [see (7)]. For more details and definitions of the if f is smooth, provides the desired geodesic from I0 to I1. The * * sliced-­Wasserstein metric, we refer the reader to [8], [25] velocity of each particle 2t fft =-()xx is the displacement of and [29]. the optimal transportation map. Figure 2 conceptualizes the geo- desic between two PDFs in P()X and visualizes it for three dif- Wasserstein spaces, geodesics, and Riemannian structure ferent pairs of PDFs. In this section, we assume that Ω is convex. Here, we highlight An important fact regarding the 2-Wasserstein space is that the p-Wasserstein space ((PWX), p) is not just a metric Otto’s presentation of a formal Riemannian metric for this space but has additional geometric structure. In particular, for space [38]. It involves shifting to a Lagrangian point of view. any p $ 1 and any II01,(! P X), there exists a continuous To explain, consider the path I(x, t) in P()X with I(x, t) smooth. path (interpolation) between I0 and I1 whose length is the Then sx(,tI)(= 22/ tx,)t can be considered a tangent vector ­distance between I0 and I1. to the manifold or a density perturbation. Instead of thinking

IEEE Signal Processing Magazine | July 2017 | 47 transforms adopt a Lagrangian point of view for analyzing signals; i.e., they are able to move signal (pixel) intensi- ties around. Moreover, the transforms can be viewed as Eucli­­dean embedd­ ings for the data, under the previously described transport-related metric space structure. The benefit of such a Eu­­ clidean embedding is that they facili- t = 0 t = 0.25 t = 0.5 t = 0.75 t = 1 t = 0 t = 0.25 t = 0.5 t = 0.75 t = 1 tate the application of many standard ∗∗ I(x, t) = det(Dgt(x))I0(gt (x)) I(x, t) = (1 – t)I0(x) + tI1(x) data-analysis algorithms (e.g., learn- ∗ ∗ gt (ft(x)) = x ing). Here, we briefly describe these (a) (b) transforms and some of their promi- nent properties. FIGURE 2. Geodesics in (a) the 2-Wasserstein space and in (b) the between various 1-D and two-dimensional (2-D) PDFs. Note that the geodesic in the 2-Wasserstein space captures The linear optimal transportation the nonlinear structure of the signals and images and provides a natural morphing. (Face portraits framework courtesy of the public CMU Pose, Illumination, and Expression database.) The linear optimal transportation (LOT) framework was proposed by Wang et al. of increasing/decreasing the density, this perturbation can be [50]. The framework was used in [4] and [39] for pattern rec- viewed as resulting from moving the mass by a vector field. In ognition in biomedical images and specifically histopathology other words, consider vector fields v(x, t) such that and cytology images. Later, it was extended in [27] as a gener- ic framework for pattern recognition, and it was used in [26] sI=-d·( v). (8) for the single-frame superresolution reconstruction of face images. The LOT framework, which provides an invertible There are many such vector fields. Otto defined the size of Lagrangian transform for images, was initially proposed as a st(·,) as the square root of the minimal kinetic energy of the method to simultaneously amend the computationally expen- vector field that produces the perturbation to density s, i.e., sive requirement of calculating pairwise 2-Wasserstein dis- tance between N signals for pattern recognition purposes and ss,|= min vI| 2 dx. (9) to allow for the construction of generative models for images GHvsatisfies()8 # involving textures and shapes. For a given set of images

Utilizing the Riemmanian manifold structure of P()X togeth- IPi ! 2 ()X , for iN= 1,..., , and a fixed template I0, all non- er with the inner product presented in (9), the 2-Wasserstein negative and having been normalized to have the same sum,

metric can be reformulated into finding the minimizer of the the transform projects the images to the tangent space at I0. following action among all curves in P()X connecting The projections are acquired by finding the optimal velocity

I0 and I1 [5], fields corresponding to the optimal transport plans between I0 and each image in the set. 2 2 1 The framework provides a linear embedding for P2 ()X WI2 01,,II= infIv, # # xt vx,t dxdt ^ h 0 X ^ h ^ h with respect to a fixed signal IP02! ()X . This means that the s.t. 2t II+=4$ v 0 Euclidean distance between an embedded signal, denoted as ^ h u II$$,,01==01II$$,, Ii, and the fixed reference, I0, is equal to WI20(,Ii), and the ^ h ^ h ^ h ^ h Euclidean distance between two embedded normalized signals where the first constraint is the well-known continu- is, generally speaking, an approximation of their 2-Wasserstein ity equation. distance. The geometric interpretation of the LOT framework is presented in Figure 3. The linear embedding then facilitates Optimal transport: Embeddings and transforms the application of linear techniques such as principal compo- The optimal transport problem and, specifically, the nent analysis (PCA) and linear discriminant analysis (LDA) to 2-Wasserstein metric and the sliced-2-Wasserstein metric probability measures. have been recently used to define nonlinear transforms for signals and images [25], [27], [40], [50]. In contrast to com- The cumulative distribution transform monly used linear signal transformation frameworks (e.g., Park et al. [40] considered the LOT framework for 1-D PDFs Fourier and wavelet transforms) that employ signal intensi- (positive signals normalized to integrate to 1), and since in ties only at fixed coordinate points, thus adopting an Eulerian dimension one the transport maps are explicit, they were able point of view, the idea behind transport-based transforms is to characterize the properties of the transformed densities.

to consider the intensity variations together with the loca- Similar to the LOT framework, let Ii for iN= 1,..., and I0 tions of the intensity variations in the signal. Therefore, such be signals (PDFs) defined on R. The framework first

48 IEEE Signal Processing Magazine | July 2017 | calculates the optimal transport maps between Ii and I0 using -1 fxi ()= FFi % 0 ()x for all iN= 1,..., . Then the forward and inverse transport-based transform, denoted as the cumulative W2(I0, I1) W (I , I ) distribution transform (CDT) by Park et al. [40], for these I0 2 0 2 I2 density functions with respect to the fixed template I0 is defined as P2(Ω) I1 W2(I1, I2)

Ifuii=-()Id I0 (Analysis) --1 1 , )Ifi = ()iil ()If0 % (Synthesis) ~ I0 = 0 -1 -1 ~ where ()If0 % ii()xI= 0 ((fx)). Note that the L -norm I ~ 2 1 I2 (Euclidean distance) of the transformed signals, Iui, corre- sponds to the 2-Wasserstein distance between I and I . In con- ~ 0 i I = (f ∗– Id) √I trast to the higher-dimensional LOT, however, the Euclidean i i 0 u ~~ ~~~ distance between two transformed (embedded) signals Ii and |I0 – Ii| = |Ii| = W2(I0, Ii), |I1 – I2| ≈ W2(I1, I2) u I j, is the exact 2-Wasserstein distance between Ii and Ij (see [40] for a proof) and not just an approximation. Hence, the FIGURE 3. A graphical representation of the LOT framework. The framework transformation is isometric (preserves) with respect to the embeds the PDFs (i.e., signals or images) Ii in the tangent space (i.e., the 2-Wasserstein metric. This isometric nature of the CDT was set of all tangent vectors) of P()X with respect to a fixed PDF I0. As a con- u utilized in [28] to provide positive definite kernels for machine sequence, the Euclidean distance between the embedded functions I1 and u learning of n-dimensional signals. I2 provides an approximation for the 2-Wasserstein distance, WI21(,I2). From a signal processing point of view, the CDT is a non- linear signal transformation that captures certain nonlinear above conditions. We refer the reader to [40] for further infor- variations in signals including translation and scaling. Specifi- mation. Figure 4(a) and (b) demonstrates the linear separation cally, it gives rise to the transformation pairs presented in Table 1. property of the CDT. The signal classes P and Q are chosen to From Table 1, one can observe that although It()-x is non- be the set of all translations of a single Gaussian and a Gauss- linear in x (when I(.) is not a linear function), its CDT repre- ian mixture including two Gaussian functions with a fixed sentation Itu()+x It0 () becomes affine in x (a similar effect mean difference, respectively. The discriminant subspace is is observed for scaling). In effect, the Lagrangian transforma- calculated for these classes, and it is shown that although the tions (compositions) in original signal space are rendered into signal classes are not linearly separable in the signal domain, Eulerian perturbations in transform space, borrowing from they become linearly separable in the transform domain. the partial differential equation (PDE) parlance. Furthermore, Park et al. [40] demonstrated that the CDT facilitates certain The Radon CDT pattern recognition problems. More precisely, the transforma- The CDT framework was extended to 2-D density functions tion turns certain not linearly separable and disjoint classes of (images) through the sliced-Wasserstein distance in [25] and signals into linearly separable ones. Formally, let C be a set of was denoted the Radon CDT. It is shown in [25] that similar 1-D maps, and let PQ,(1 P2 X) be sets of positive PDFs born characteristics of the CDT, including the linear separation from two positive PDFs pq00,(! P2 X) (which we denote as property, also hold for the Radon CDT. Figure 4 clarifies the mother density functions or signals) as linear separation property of the Radon CDT and demonstrate the capability of such transformations. In particular, Figure 4(c) and (d) shows a facial expression data set with two classes (i.e., Pp=={|phl ()ph0 % ,}6hC! , neutral and smiling expressions) and its corresponding repre- Qq=={|qhl ()qh0 % ,}6hC! . sentations in the LDA discriminant subspace calculated from the images [Figure 4(c)] and the Radon CDT of the data set If there exists no hC! for which ph00= l ()qh% , then the sets P and Q are disjoint but not necessarily linearly separable in the signal space. A main result of [40] states that the sig- Table 1. The CDT pairs. Note that the composition holds for all strictly nal classes P and Q are guaranteed to be linearly separable in monotonically increasing functions g. the transform space (regardless of the choice of the reference Signal Domain CDT Domain ­signal I0) if C satisfies the following conditions: Property I(x) Ixu() & -1 1) hC!!hC u Translation Ix()-x Ix()+x Ix0 () 2) hh12,(!!Ch& tt12+-10),hC 6t ! [ ,]1 u 3) hh12,(!!Ch& 12hh), 21()hC Ix() ()a -1 Scaling aI()ax -x Ix0 () 4) hpl ()00% hq! , 6hC! . a a u The set of translations Cf=={|fx() x +xx,}! R and -1 Ix() Composition gxl ()Ig((x)) ((g +-xx))Ix0 () scaling Cf=={|fx() ax,}a ! R+ , for instance, satisfy the Ix0 ()

IEEE Signal Processing Magazine | July 2017 | 49 CDT Signal Space Transform Space ~~ PQ PQ

P = {p|p = h′(p o h), ∀h ∈ C} CDT ~ ~ 0 p(x) = p0(x + t) p(x) = p0(x) + t √I0(x) Q = {q|q = h′(q0 o h), ∀h ∈ C} CDT ~ ~ C = {h|h(x) = x + t, t ∈R} q(x) = q0(x + t) q(x) = q0(x) + t √I0(x)

~ P P~ Q Q

Projection of the Data Onto a 3-D Projection of the Transformed Data Onto a Discriminant Subspace 3-D Discriminant Subspace (a) (b)

Random CDT Image Space Transform Space

Class 1 Class 1

Class 2 Class 2

Class 1 Class 2

Class 1 Class 2 Projection of the Data Onto a 2-D Projection of the Transformed Data Onto a Discriminant Subspace 2-D Discriminant Subspace (c) (d)

FIGURE 4. Examples for the linear separability characteristic of the CDT and the Radon CDT. The discriminant subspace for each case is calculated using the penalized-linear discriminant analysis. It can be seen that the nonlinear structure of the data is well captured in the transform spaces. (a) and (b) The linear separation property of the CDT. (c) A facial expression data set with two classes and its corresponding representations in the LDA discriminant subspace and (d) the Radon CDT of the data set and the corresponding representation of the transformed data in the LDA discriminant subspace. 3-D: three-dimensional. (Face portraits courtesy of the public CMU Pose, Illumination, and Expression database.)

50 IEEE Signal Processing Magazine | July 2017 | and the corresponding representation of the ­transformed data ­simplifies to a one-to-one assignment problem that can be solved in the LDA discriminant subspace [Figure 4(d)]. It is clear that in O()NN2 log . In addition, several multiscale approaches and the image classes become more linearly separable in the trans- sparse approximation approaches have recently been intro- form space. In addition, the cumulative percentage variation duced to improve the computational performance of the linear (CPV) of the data set in the image space, the Radon transform programming solvers [37], [46]. space, the Ridgelet transform space, and the Radon-CDT space are shown in Figure 5. The figure shows that the variations in Entropy-regularized solution the data set could be explained with fewer components in the Cuturi’s work [14] provides a fast and easy-to-implement vari- Radon-CDT space. ation of the Kantorovich problem by considering the transpor- tation problem from a maximum-entropy perspective. The Numerical methods idea is to regularize the Wasserstein metric by the entropy of The development of robust and efficient numerical methods the transport plan. This modification simplifies the problem for computing transport-related maps, plans, metrics, and geo- and enables much faster numerical schemes with complexity desics is crucial for the development of algorithms that can be used in practical applications. We next present several notable approaches for finding transportation maps and plans. Table 2 CPV provides a high-level overview of these methods. 100

80 A linear programming problem The linear programming problem is an optimization problem 60 with a linear objective function and linear equality and CPV 40 inequality constraints. Several numerical methods exist for Image Space solving linear programming problems, among which are 20 + Radon Space the simplex method and its variations and the interior-point Ridgelet Space Radon-CDT Space methods. The computational complexity of the mentioned 0 0204060 80 numerical methods, however, scales at best cubically in the Number of Eigenvalues (k) size of the domain. Hence, assuming the measures considered k 2 ∑ have N particles, the number of unknowns cij s is N and the CPV = i=1 i N computational complexities of the solvers are at best ∑n=1 n O 3 ()NNlog [14], [44]. The computational complexity of the i = ith Eigenvalue linear programming methods is a very important limiting fac- tor for the applications­ of the Kantorovich problem. FIGURE 5. The cumulative percentage of the face data set in Figure 4 in the We note that, in the special case where I0 and I1 both have image space, the Radon transform space, the Ridgelet transform space, N equidistributed particles, the optimal transport ­problem and the Radon-CDT transform space.

Table 2. The key properties of various numerical approaches.

Comparison of Numerical Approaches

Method Remark Linear programming Applicable to general costs. Good approach if the PDFs are supported at very few sites. Multiscale linear programming Applicable to general costs. Fast and robust method, though truncation involved can lead to imprecise distances. Auction algorithm Applicable only when the number of particles in the source and the target is equal and all of their masses are the same. Entropy-regularized linear Applicable to general costs. Simple and performs very well in practice for moderately large problems. programming Difficult to obtain high accuracy. Fluid mechanics This approach can be adapted to generalizations of the quadratic cost, based on action along paths. AHT minimization Quadratic cost. Requires some smoothness and positivity of densities. Convergence is guaranteed only for infinitesimal step size. Gradient descent on the dual problem Quadratic cost. Convergence depends on the smoothness of the densities, hence a multiscale approach is needed for nonsmooth densities (i.e., normalized images). Monge–Ampère solver Quadratic cost. One in [7] is proved to be convergent. Accuracy is an issue due to the wide stencil used. Semidiscrete approximation An efficient way to find the map between a continuous and discrete signal [31]. AHT: Angenent, Haker, and Tannenbaum.

IEEE Signal Processing Magazine | July 2017 | 51 O()N2 [14] or O((NNlog )) using the convolutional Flow minimization (AHT) Wasserstein distance presented in [47] (compared to O()N3 of Angenent, Haker, and Tannenbaum [2], proposed a flow min- the linear programming methods), where N is the number of imization scheme to obtain the optimal transport map from delta masses in each of the measures. The disadvantage is that the Monge problem. The method was used in several image- it is difficult to obtain high-accuracy approximations of the registration applications [22], pattern recognition [27], [50], optimal transport plan. The entropy-regularized p-Wasserstein and computer vision [26]. A brief review of the method is distance, also known as the Sinkhorn distance, between PDFs provided here. + + I0 and I1 defined on the metric space (,X d) is defined as Let IX0: " R and IY1: " R be continuous probability densities defined on convex domains XY,.3 Rd To find the p p * WIp,m (,01Id)(= infc ! MP xy,)c(,xy)dxdy optimal transport map, f , AHT starts with an initial trans- #XX# port map, fX0: " Y calculated from the Knothe–Rosenblatt +mc(,xy)(ln c(,xy))dxdy, (10) #XX# coupling [49]. Then it updates f0 to minimize the transport cost while constraining it to remain a transport map from I0 to I1. where the regularizer is the negative entropy of the plan. We The updated equation for finding the optimal transport map in p note that this is not a true metric since WIp,m (,01I ).2 0 Since AHT is calculated to be the entropy term is strictly concave, the overall optimization 1 -1 in (10) becomes strictly convex. It is shown in [14] that the fxkk+1 ()=+fx() e Dfkk((fd-d D iv( fk))), I0 entropy-regularized p-Wasserstein distance in (10) can be -1 reformulated as where e is the step size, Dfk is the Jacobian matrix, and D is the Poisson solver with Neumann boundary conditions. p WIp,m (,01I )*= mcinf KL(|Km), c !MP AHT show that for infinitesimal step size, e, fxk () converges to the optimal transport map. For a detailed derivation of the p where Km (,xy)(=-exp dx(,y)/m) and KL(|c Km) is the preceding equation, see [2] and [24].

Kullback–Leibler (KL) divergence between c and Km . In The AHT method is, in essence, a gradient descent method short, the regularizer enforces the plan to be within 1/m radius on the Monge formulation of the optimal transport problem. in the KL-divergence sense from the transport plan Chartrand, Wohlberg, Vixie, and Bollt (CWVB) [11] proposed * c3 (,xy)(= Ix01)(Iy). an alternative gradient-descent method based on Kantorovich’s Cuturi shows that the optimal transport plan c in (10) is of dual formulation of the transport problem that updates the

the form DDvwKm , where Dv and Dw are diagonal matrices optimal potential transport field, h()x , where fx()= dh()x . with diagonal entries vw, ! RN [14]; therefore, the number of Figure 6 presents the iterations of the CWVB method for two unknowns in the regularized formulation is reduced from N2 face images taken from the YaleB face database. to 2N. The new problem can then be solved through computa- tionally efficient algorithms such as the iterative proportional Monge–Ampère equation fitting procedure, also known as the iterative proportional fit- The Monge–Ampère PDE is defined as ting procedure algorithm, or, alternatively, through the Sink- horn–Knopp algorithm. det()Hhzz= (,xD,)z

for some functional h and where Hz is the Hessian matrix of z. The Monge–Ampère PDE is closely related to the Monge k = 0 k = 20 k = 40 k = 60 k = 80 problem for the quadratic cost function. According to Bernier’s theorem (discussed in the “Basic Properties” section), when I0 and I are absolutely continuous PDFs defined on sets k 1 XY,,1 Rn the optimal transport map that minimizes the 2-Wasserstein metric is uniquely characterized as the gradi- ent of a convex function z:XY" . Moreover, we showed that ∇ k the mass-preserving constraint of the Monge problem can be written as det()Df If10()= I . Combining these results, one can have

Ik Ix0 () det((Dxdz( ))) = , (11) I1 ()dz 1 I = det(D∇ )I (∇ ), (x) = x 2 – (x) k k 1 k k 2 k where DHdzz= , and, therefore, the equation shown above is in the form of the Monge–Ampère PDE. Now, if z is a con- FIGURE 6. A visualization of the iterative update of the transport potential and correspondingly the transport displacement map through CWVB itera- vex function on X satisfying dz()XY= and solving (11), * tions. (Face portraits courtesy of the public Extended Yale Face Database B.) then f = dz is the optimal transportation map from I0 to I1.

52 IEEE Signal Processing Magazine | July 2017 | The geometrical constraint on this problem is rather unusual image-retrieval task, and it was shown that the Wasserstein in PDEs and is often referred to as the optimal transport metric achieves the highest precision/recall performance boundary conditions. Several authors have proposed numeri- among all. cal methods to obtain the optimal transport map through solv- Speed of computation is an important practical consid- ing the Monge–Ampère PDE in (11) [7], [33]. In particular, the eration in image-retrieval applications. For almost a decade, scheme in [7] is monotone, has complexity O(N) (up to the high computational cost of the optimal transport problem ­logarithms), and is provably convergent. We conclude by overshadowed its practicality in large-scale image-retrieval remarking that several regularity results on the optimal trans- applications. Recent advancements in numerical methods, port maps were established through the Monge–Ampère including the work of Merigot [34] and Cuturi [14], among ­equation (see [24] for references). many others, have reinvigorated optimal transport-based dis- tances as a feasible and appealing candidate for large-scale Semidiscrete approximation image-retrieval problems. Several works [31], [34] have considered the problem in which one PDF, I0, has a continuous form while the other, I1 is dis- Registration and morphing crete, Iy1 ()=-/ qyiid()y . It turns out there exist weights Image registration deals with finding a common geometric wi such that the optimal transport map fX: " Y can be reference frame between two or more images. It plays an described via a power diagram. More precisely, the set of x important role in analyzing images obtained at different times mapping to yi is the following cell of the power diagram: or using different imaging modalities. Image registration and, more specifically, biomedical image registration are active 22 PDwi()yx=-{:||xyii--wx# ||ywjj- ,}6j . areas of research. Registration methods find a transformation f that maximizes the similarity between two or more image rep-

The main observation is that the weights wi are minimizers resentations (e.g., image intensities and image features). of the following unconstrained convex functional: Among the plethora of registration methods, nonrigid registra- tion methods are especially important given their numerous 2 applications in biomedical problems. They can be used to / qwii--# (|||xyii|)-wI0 ()xdx). i eoPDwi()y quantify the morphology of different organs, correct for physi- ological motion, and allow for comparison of image intensi- Works by Mérigot [34] and Levy [31] use Newton-based ties in a fixed coordinate space (atlas). Generally speaking, schemes and multiscale approaches to minimize the functional. nonrigid registration is a nonconvex and nonsymmetric prob- The need to integrate over the power diagram makes the imple- lem, with no guarantee of the existence of a globally opti- mentation somewhat geometrically delicate. Nevertheless, a mal transformation. recent implementation by Levy [31] gives impressive results Various works in the literature deploy the Monge prob- in terms of speed. This approach provides the transportation­ lem for image warping and elastic registration. Utilizing the ­mapping (not just the approximation of a plan). Monge problem in an image-warping/registration setting has a number of advantages. First, the existence and uniqueness Applications of the global transformation (the optimal transport map) is known. Second, the problem is symmetric, meaning that the

Image retrieval optimal transport map for warping I0 to I1 is the inverse of One of the earliest applications of the optimal transport prob- the optimal transport map for warping I1 to I0. Last, it pro- lem was in image retrieval. Rubner et al. [44] employed the dis- vides a landmark-free and parameter-free registration scheme crete Wasserstein metric, which they denoted the Earth mover’s with a built-in mass preservation constraint. These advan- distance, to measure the dissimilarity between image signa- tages motivated several follow-up works to investigate the tures. In image-retrieval applications, it is common practice application of the Monge problem in image registration and first to extract features (i.e., color features, texture feature, warping [21], [22]. shape features, and so on) and then generate high-dimensional In addition to images, the optimal mass transport prob- histograms or signatures (histograms with dynamic/adaptive lem has also been used in point cloud and mesh registration binning) to represent images. The retrieval task then simplifies [29] (see [24] for more references), which have various appli- to finding images with similar representations (e.g., small dis- cations in shape analysis and graphics. In these applications, tance between their histograms/signatures). The Wasserstein shape images (2-D or 3-D binary images) are first represented metric is specifically suitable for such applications because it using either sets of weighted points (e.g., point clouds), using can compare histograms/signatures of different sizes (histo- clustering techniques such as K-means or fuzzy C-means, grams with different binning). This unique capability turns the or with meshes. Then a regularized variation of the optimal Wasserstein metric into an attractive candidate in image- transport problem is solved to match such representations. The retrieval applications [32], [44]. In [44], the Wasserstein metric regularization on the transportation problem is often imposed was compared with common metrics such as Jeffrey’s diver- to enforce the neighboring points (or vertices) to remain near 2 gence, the | statistic, the L1 distance, and the L2 distance in an each other after the transformation.

IEEE Signal Processing Magazine | July 2017 | 53 tributions of the images is often not Intensity satisfying, because a color-transfer Input Image Target Image Transferred Image map may not transfer the colors of neighboring pixels in a coherent manner and may introduce arti- facts in the ­color-transferred image. Therefore, the color-transfer map is often regularized to make the transfer map spatially coherent [41]. Gray-Value Histogramm Figure 7 shows a simple example of 1 1 1 Cumulative gray-value and color transfer via the Distribution 0.5 00.5.5 0.50.5 optimal transport framework. It can of Gray 0 0 0 be seen that the cumulative distri­

Values 0 0 0 50 50 50 bution of the gray-value and col- 100 150 200 250 100 150 200 250 100 150 200 250 (a) or-transferred images are similar to Color that of the input image. Input Image Target Image Transferred Image Texture synthesis and mixing Texture synthesis is the problem of synthesizing a texture image that is visually similar to an exemplar input-texture image and has vari- ous applications in computer graph- RGB Histogram ics and image processing. Many Cumulative 1 1 1 methods have been proposed for Distribution 0.5 0.0.55 0.5 texture synthesis, such as synthesis of RGB 0 0 0 by recopy and synthesis by statis­­ Values 0 0 0 50 50 50

100 150 200 250 100 150 200 250 100 150 200 250 tical modeling. Texture mixing, (b) however, considers the problem of synthesizing a texture image from FIGURE 7. (a) Gray value and (b) color transfer via optimal transportation. RGB: red, green, blue. a collection of input-texture images in a way that the synthesized tex- ture provides a meaningful integra- Color transfer and texture synthesis tion of the colors and textures of the input-texture images. Texture mixing and color transfer are appealing applica- Meta­­morphosis is one of the successful ap­­­proaches in texture tions of the optimal transport framework in image analysis, mixing; it performs the mixing via identifying correspon- graphics, and computer vision. Here, we briefly discuss dences between elementary features (i.e., textons) among these applications. input textures and progressively morphing between the shapes of elements. In other approaches, texture images are first param- Color transfer etrized through a tight frame (often steerable wavelets), and sta- The purpose of color transfer is to change the color palette tistical modeling is performed­ on the parameters. of an image to impose the feel and look of another image. Other successful approaches include random phase and Color transfer is generally performed through finding a spot noise texture modeling [18], which model textures as sta- map, which morphs the color distribution of the first tionary Gaussian random fields. These models are based on image into the second one. For grayscale images, the color-­ the assumption that the visual texture perception is based on transfer problem simplifies to a histogram-matching prob- the spectral magnitude of the texture image. Therefore, uti- lem, which is solved through the 1-D optimal transport lizing the spectral magnitude of an input image and random- formulation [16]. In fact, the classic problem of histogram izing its phase will lead to a new synthetic texture image that equalization is a 1-D transport problem [16]. The color- is visually similar to the input image. Ferradans et al. [18] uti- transfer problem, on the other hand, is concerned with lized this assumption together with the Wasserstein geodesics pushing the 3-D color distribution of the first image into the to interpolate between spectral magnitude of texture images second one. This problem can also be formulated as an and provide synthetic mixed texture images. Figure 8 shows optimal transport problem, as demonstrated in [41] (see [24] an example of texture missing via the Wasserstein geodesic for more references). between the spectral magnitudes of the input-texture images. A complication that occurs in the color transfer on real The in-between images are synthetically generated using the images, however, is that a perfect match between color dis- random-phase technique.

54 IEEE Signal Processing Magazine | July 2017 | Image denoising and restoration The optimal transport problem has also been used in several image-denoising and -restoration problems [30]. The goal in these applications is to restore or reconstruct an image from noisy or incomplete observation. Lellmann et al. [30] utilized the Kantorovich–Rubinsten discrepancy term together with a R-Channel total variation (TV) term in the context of image denoising. Spectral They called their method Kantorovich–Rubinstein-TV (KR- Magnitude TV ) denoising. Note that the KR metric is closely related to G-Channel the 1-Wasserstein metric (for 1-D signals they are equivalent). The KR term in their proposed functional provides a fidelity B-Channel term for denoising, and the TV term enforces a piecewise con- stant reconstruction. = 0 = 0.25 = 0.5 = 0.75 = 1

Transport-based morphometry FIGURE 8. An example of texture mixing via optimal transport using the Given their suitability for comparing mass distributions, method presented in Ferradans et al. [18]. transport-based approaches for performing pattern recogni- tion of morphometry encoded in image intensity values have also lately emerged. Recently described approaches for those deformation-based methods in that it has numerically transport-based morphometry (TBM) [4], [27], [50] work by exact, uniquely defined solutions for the transport plans or computing transport maps or plans between a set of images maps used; i.e., images can be matched with little perceptible and a reference or template image. The transport plans/maps error. The same is not true in methods that rely on registration are then utilized as an invertible feature/transform onto via the computation of deformations, given the significant which pattern recognition algorithms such as PCA or LDA topology differences commonly found in medical images. can be applied. In effect, it utilizes the LOT framework Moreover, TBM allows for comparison of the entire inten- described in the “The Linear Optimal Transportation sity information present in the images (shapes and textures), Framework” section. These techniques have recently been while deformation-based methods are usually employed to employed to decode differences in cell and nuclear morphol- deal with shape differences. Figure 9 shows a schematic of ogy for drug screening [4], cancer detection histopathology the TBM steps applied to a cell nuclei data set. It can be seen [39], and cytology images, as well as applications such as the that TBM is capable of modeling the variation in the data set. analysis of galaxy morphologies [27]. In addition, it enables one to visualize the classifier, which Deformation-based methods have long been used in ana- discriminates between image classes (in this case malignant lyzing biomedical images. TBM, however, is different from versus benign).

Transport Mode 1 Mode 2 Wasserstein Geodesic Map f – id i = 1 det(Df)I0(f) Mode 3 Mode 4 i = 2

i = 3 –2 – 0 2 –2 – 0 2 (b) i = 4

i = N

∗ ui = fi – id

(a) (c)

FIGURE 9. The schematic of the TBM framework. (a) The optimal transport maps between input images II1,..., N and a template image I0 are calculated. (b) and (c) Next, linear statistical modeling such as PCA, LDA, and canonical correlation analysis is performed on the optimal transport maps. The resulting transport maps obtained from the statistical modeling step are then applied to the template image to visualize the results of the analysis in the image space.

IEEE Signal Processing Magazine | July 2017 | 55 Superresolution for the calculated optimal transport maps. A transport map Superresolution is the process of reconstructing a high-­ in this subspace can then be applied to the template image to resolution image from one or several corresponding low- synthesize a high-resolution face image. In the testing phase, resolution images. Superresolution algorithms can be broadly the goal is to reconstruct a high-resolution image from the categorized into two major classes, multiframe superresolu- low-resolution input image. The method searches for a syn- tion and single-frame superresolution, based on the number of thetic high-resolution face image (generated from the trans- low-resolution images they require to reconstruct the port subspace) that provides a corresponding low-resolution corresponding high-resolution image. The TBM approach was image, which is similar to the input low-resolution image. used for single-frame superresolution in [26] to reconstruct Figure 10 shows the steps used in this method and demon- high-resolution­ faces from very low-resolution-input face strates reconstruction results. images. The authors utilized the TBM in combination with subspace learning techniques to learn a nonlinear model for Machine learning and statistics the high-resolution face images in the training set. The optimal transport framework has recently attracted ample In short, the method consists of a training and a testing attention from the machine-learning and statistics communities phase. In the training phase, it uses high-resolution face [12], [19], [25], [28], [36]. Some applications of the optimal images and morphs them to a template high-resolution face transport in these arenas include various transport-based learning through optimal transport maps. Next, it learns a subspace methods [19], [28], [36], [48], domain adaptation, Bayesian

Mode 1 Mode 2 Mode 3 Mode 4 Wasserstein Geodesic Transport Map + i = 1 PCA Modes

i = 2 –

i = 3 Generative Modeling of Face Images ith jth i = 0 i = i/2 i = i Mode Mode

i = 0

i = i/2 i = N Vi Vj

i = i det(Df(x))I0(f(x)) u = f ∗– id i i f(x) = x + aiVi(x) + jVj(x) (a) (b)

UN 1) 2) Low-Resolution Input

U U2 1 High-Resolution Reconstruction U 3) N 4)

High-Resolution Image U (Ground Tr uth) U2 1 U

(c) (d)

FIGURE 10. (a) In the training phase, optimal transport maps that morph the template image to high-resolution training face images are calculated. (b) Statistical modeling of transport maps. PCA is used to learn a linear subspace for transport maps for which a linear combination of obtained eigenmaps can be applied to the template image to obtain synthetic face images. (c) A geometric interpretation of the problem and (d) reconstruction results in transport-based single-frame superresolution. (Face portraits courtesy of the public Extended Yale Face Database B.)

56 IEEE Signal Processing Magazine | July 2017 | inference [12], [13] and hypothesis testing [15], [42] among oth- They specifically showed that their proposed Wasserstein ers. Here, we provide a brief overview of the recent developments GAN does not suffer from common issues in such networks, of transport-based methods in machine learning and statistics. including instability and mode collapse.

Learning Domain adaptation Transport-based distances have recently been used in several Domain adaptation is one of the fundamental problems in works as a loss function for regression, classification, and machine learning that has gained proper attention from the other techniques. Montavon, Müller, and Cuturi [36], for machine-learning research community in the past decade. instance, utilized the dual formulation of the entropy-regu­ Domain adaptation is the task of transferring knowledge from larized Wasserstein distance to train restricted Boltzmann classifiers trained on available labeled data to unlabeled test machines (RBMs). Boltzmann machines are probabilistic domains with data distributions that differ from that of the train- graphical models (Markov random fields) that can be catego- ing data. The optimal transport framework was recently present- rized as stochastic neural networks and are capable of extract- ed as a potential major player in domain adaptation problems ing hierarchical features at multiple scales. RBMs are bipartite [12], [13]. Courty et al. [12], for instance, assumed that there graphs that are special cases of Boltzmann machines, which exists a nonrigid transformation between the source and target define parameterized probability distributions over a set of distributions, and they find this transformation using an entropy- d-binary input variables (observations) whose states are repre- regularized optimal transport problem. They also proposed a sented by h binary output variables (hidden variables). The label-aware version of the problem in which the transport plan parameters of RBMs are often learned through information is regularized so a given target point (testing exemplar) is asso- theoretic divergences such as KL divergence. Montavon et al. ciated only with source points (training exemplars) belonging to [36] proposed an alternative approach through a scalable the same class. Courty et al. [12] showed that domain adaptation entropy-regularized Wasserstein distance estimator for RBMs via regularized optimal transport outperforms the state-of-the- and showed the practical advantages of this distance over the art results in several challenging domain adaptation problems. commonly used information divergence-based loss functions. In another approach, Frogner et al. [19] used the entropy- Bayesian inference regularized Wasserstein loss for multilabel classification. They Another interesting and emerging application of the optimal proposed a relaxation of the transport problem to deal with transport problem is in Bayesian inference [17]. In Bayesian unnormalized measures by replacing the equality constraints inference, one critical step is the evaluation of expectations in (6) with soft penalties with respect to KL divergence. In with respect to a posterior probability function, which leads to addition, Frogner et al. [19] provided statistical bounds on complex multidimensional integrals. These integrals are com- the expected semantic distance between the prediction and monly solved through the Monte Carlo numerical integration, the ground truth. In yet another approach, Kolouri et al. [28] which requires independent sampling from the posterior distri- ­utilized the sliced-Wasserstein metric and provided a family bution. In practice, sampling from a general posterior dis­ of positive definite kernels, denoted sliced-Wasserstein ker- tribution might be difficult, so, therefore, the sampling is nels, and showed the advantages of learning with such kernels. performed­­ via a Markov chain that converges to the posterior The sliced-Wasserstein kernels were shown to be effective probability after a certain number of steps. This leads to the in various machine-learning tasks, including classification, celebrated Markov chain Monte Carlo (MCMC) method. The ­clustering, and regression. downside of the MCMC method is that the samples are not Solomon et al. [48] considered the problem of graph-based independent, and, hence, the convergence of the empirical semisupervised learning, in which graph nodes are partially expectation is slow. El Moselhy and Marzouk [17] proposed a labeled and the task is to propagate the labels throughout the transport-based method that evades the need for Markov-chain nodes. Specifically, they considered a problem in which the simulation by allowing direct sampling from the posterior dis- labels are histograms. This problem arises, for example, in traf- tribution. The core idea in their work is to find a transport map fic density prediction, in which the traffic density is observed (via a regularized Monge formulation) that pushes forward the for a few stop lights over 24 h in a city and the city is interested prior measure to the posterior measure. Then, sampling the in predicting the traffic density at the unobserved stop lights. prior distribution and applying the transport map to the sam- They pose the problem as an optimization of a Dirichlet ener- ples will lead to a sampling scheme from the posterior distri- gy for distribution-valued maps based on the 2-Wasserstein bution. Figure 11 shows the basic idea behind these methods. ­distance and present a Wasserstein propagation scheme for semisupervised distribution propagation along graphs. Hypothesis testing More recently, Arjovskly et al. [3] compared various dis- The Wasserstein distance is used for goodness-of-fit testing in tances, i.e., TV, KL divergence, Jenson–Shannon divergence, [15] and for two-sample testing in [42]. Ramdas et al. [42] pre- and the Wasserstein distance in training generative adversar- sented connections between the entropy-regularized ial networks (GANs). They demonstrated (theoretically and Wasserstein distance, multivariate Energy distance, and the numerically) that the Wasserstein distance leads to a superior kernel maximum mean discrepancy and provided a “distribu- performance compared to the later dissimilarity measures. tion-free” univariate Wasserstein test statistic. These and other

IEEE Signal Processing Magazine | July 2017 | 57 mathematical properties of transport-based metrics have

p(x) Empirical Distribution of xi been extensively studied, we believe that the foundation is

QQ set for their increased use as tools or building blocks based p(x) on which complex computational systems can be built. The y = f(x) (b) confluence of these emerging ideas may spur a significant amount of innovation in a world where sensor and other data q(y) Empirical Distribution of f(xi) x are becoming abundant and computational intelligence to q(y) f′(x)q(f(x)) = p(x) analyze these is in high demand. We believe transport-based (a) (c) models will become an important component of the ever- expanding tool set available to modern signal-processing and FIGURE 11. (a) The prior distribution p, the posterior distribution q, and the data-science experts. corresponding transport map f that pushes p into q. One million samples,

xi, were generated from distribution p. (b) The empirical distribution of these samples denoted as pt and (c) the empirical distribution of Acknowledgments We gratefully acknowledge funding from the National Science ­transformed samples, yfii= ()x , denoted as qt. Foundation (NSF) (CCF 1421502) and the National Institutes applications of transport-related concepts show the promise of of Health (GM090033, CA188938) in contributing to a por- the mathematical modeling technique in the design of statisti- tion of this work. Dejan Slepcˇev also acknowledges funding cal data-analysis methods to tackle modern learning problems. by the NSF (DMS DMS-1516677). Finally, note that, in the interest of brevity, a number of other important applications of transport-related techniques were Authors not discussed above but are certainly interesting in their own Soheil Kolouri ([email protected]) received his right. For a more detailed discussion and more references please B.S. degree in electrical engineering from Sharif University of refer to [24]. Technology, Tehran, Iran, in 2010 and his M.S. degree in elec- trical engineering in 2012 from Colorado State University, Summary and conclusions Fort Collins. He received his doctorate degree in biomedical Transport-related methods and applications have come a long engineering from Carnegie Mellon University, Pittsburgh, way. Although earlier applications focused primarily in civil Pennsylvania, in 2015, where his thesis, “Transport-Based engineering and economics problems, they have recently begun Pattern Recognition and Image Modelling,” won the Best to be employed in a wide variety of problems related to signal Thesis Award. He is currently with HRL Laboratories, LLC, and image analysis and pattern recognition. In this article, Malibu, California. seven main areas of application were reviewed: image retrieval, Se Rim Park ([email protected]) received her registration and morphing, color transfer and texture analysis, B.S. degree in electrical and electronic engineering from image restoration, TBM, image superresolution, and machine Yonsei University, Seoul, South Korea, in 2011 and is currently learning and statistics. Transport and related techniques have a doctoral candidate in the Electrical and Computer gained increased interest in recent years. Overall, researchers Engineering Department at Carnegie Mellon University, have found that the application of transport-related concepts Pittsburgh, Pennsylvania. She is mainly interested in signal can be helpful in solving problems in diverse applications. processing and machine learning, especially designing new sig- Given recent trends, it seems safe to expect that the number of nal and image transforms and developing novel systems for application areas will continue to grow. pattern recognition. In its most general form, the transport-related techniques Matthew Thorpe ([email protected]) received his reviewed in this article can be thought as mathematical mod- B.S., M.S., and Ph.D. degrees in mathematics from the University els for signals and images and in general data distributions. of Warwick, United Kingdom, in 2009, 2012, and 2015, respec- Transport-related metrics involve calculating differences not tively, and his M.S.Tech. degree in mathematics from the only of pixel or distribution intensities but also where they are University of New South Wales, Australia, in 2010. He is current- located in the corresponding coordinate space (a pixel coor- ly a postdoctoral associate within the Mathematics Department dinate in an image or a particular axis in some arbitrary fea- at Carnegie Mellon University, Pittsburgh, Pennsylvania. ture space). As such, the geometry (e.g., geodesics) induced by Dejan Slepcˇev received his B.S. degree in mathematics from such metrics can give rise to dramatically different algorithms the University of Novi Sad, Serbia, in 1995, his M.A. degree in and data interpretation results. The interesting performance mathematics from the University of Wisconsin–Madison in improvements recently obtained could motivate the search for 2000, and his Ph.D. degree in mathematics from the University a more rigorous mathematical understanding of transport-relat- of Texas at Austin in 2002. He is currently an associate profes- ed ­metrics and applications. sor in the Department of Mathematical Sciences at Carnegie The emergence of numerically precise and efficient ways Mellon University, Pittsburgh, Pennsylvania. of computing transport-related metrics and geodesics, as pre- Gustavo K. Rohde ([email protected]) received his B.S. sented in the “Numerical Methods” section, also serves as degrees in physics and mathematics in 1999 and his M.S. degree an enabling mechanism. Coupled with the fact that ­several in electrical engineering in 2001 from Vanderbilt University,

58 IEEE Signal Processing Magazine | July 2017 | Nashville, Tennessee. He received his doctorate in applied math- [25] S. Kolouri, S. R. Park, and G. K. Rohde, “The radon cumulative distribution transform and its application to image classification,” IEEE Trans. Image Process., ematics and scientific computation in 2005 from the University vol. 25, no. 2, pp. 920–934, 2016. of Maryland, College Park. He is currently an ­associate profes- [26] S. Kolouri and G. K. Rohde. “Transport-based single frame super resolution of sor of biomedical engineering and electrical ­engineering at the very low resolution face images,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2015, pp. 4876–4884. University of Virginia, Charlottesville. [27] S. Kolouri, A. B. Tosun, J. A. Ozolek, and G. K. Rohde, “A continuous linear optimal transport approach for pattern analysis in image datasets,” Pattern References Recognit., vol. 51, pp. 453–462, Mar. 2016. [1] L. Ambrosio, N. Gigli, and G. Savaré, Gradient Flows in Metric Spaces and in [28] S. Kolouri, Y. Zou, and G. K. Rohde, “Sliced Wasserstein kernels for probabili- the Space of Probability Measures, 2nd ed. Lectures in Mathematics ETH Zürich. ty distributions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition Basel: Birkhäuser Verlag, 2008. (CVPR), June 2016, pp. 5258–5267. [2] S. Angenent, S. Haker, and A. Tannenbaum, “Minimizing flows for the Monge– [29] R. Lai and H. Zhao. (2014). Multi-scale non-rigid point cloud registration using Kantorovich problem,” SIAM J. Math. Anal., vol. 35, no. 1, pp. 61–97, 2003. robust Sliced–Wasserstein distance via Laplace-Beltrami eigenmap. arXiv. [Online]. [3] M. Arjovsky, S. Chintala, and L. Bottou. (2017). Wasserstein GAN. arXiv. Available: https://arxiv.org/abs/1406.3758 [Online]. Available: https://arxiv.org/abs/1701.07875 [30] J. Lellmann, D. A. Lorenz, C. Schönlieb, and T. Valkonen, “Imaging with [4] S. Basu, S. Kolouri, and G. K. Rohde, “Detecting and visualizing cell phenotype Kantorovich–Rubinstein discrepancy,” SIAM J. Imaging Sci., vol. 7, no. 4, pp. differences from microscopy images using transport-based morphometry,” Proc. 2833–2859, 2014.

Nat. Acad. Sci. U.S.A., vol. 111, no. 9, pp. 3448–3453, 2014. [31] B. Lévy, “A numerical algorithm for L2 semi-discrete optimal transport in 3D,” [5] J.-D. Benamou and Y. Brenier, “A computational fluid mechanics solution to the ESAIM Math. Model. Numer. Anal., vol. 49, no. 6, pp. 1693–1715, 2015. Monge-Kantorovich mass transfer problem,” Numer. Math., vol. 84, no. 3, pp. 375– [32] P. Li, Q. Wang, and L. Zhang. “A novel earth mover’s distance methodology for 393, 2000. image matching with Gaussian mixture models,” in Proc. IEEE Int. Conf. [6] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré, “Iterative Computer Vision, 2013, pp. 1689–1696. Bregman projections for regularized transportation problems,” SIAM J. Sci. [33] G. Loeper and F. Rapetti, “Numerical solution of the Monge–Ampère equation Comput., vol. 37, no. 2, pp. A1111–A1138, 2015. by a Newton’s algorithm,” Comptes Rendus Math., vol. 340, no. 4, pp. 319–324, [7] J.-D. Benamou, B. D. Froese, and A. M. Oberman, “Numerical solution of the 2005. optimal transportation problem using the Monge–Ampère equation,” J. Comput. [34] Q. Mérigot, “A multiscale approach to optimal transport,” Comput. Graph. Phys., vol. 260, pp. 107–126, 2014. Forum, vol. 30, no. 5, pp. 1583–1592, 2011. [8] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister, “Sliced and Radon Wasserstein [35] G. Monge, Mémoire sur la théorie des déblais et des remblais. Paris, France: barycenters of measures,” J. Math. Imaging Vision, vol. 51, no. 1, pp. 22–45, 2015. De l’Imprimerie Royale, 1781. [9] Y. Brenier, “Polar factorization and monotone rearrangement of vector-valued [36] G. Montavon, K.-R. Müller, and M. Cuturi. (2015). Wasserstein training of functions,” Comm. Pure Appl. Math., vol. 44, no. 4, pp. 375–417, 1991. Boltzmann machines. arXiv. [Online]. Available: https://arxiv.org/abs/1507.01972 [10] L. A. Caffarelli, “The regularity of mappings with a convex potential,” J. Am. [37] A. M. Oberman and Y. Ruan. (2015). An efficient linear programming method Math. Soc., vol. 5, no. 1, pp. 99–104, Jan. 1992. for Optimal Transportation. arXiv. [Online]. Available: https://arxiv.org/ [11] R. Chartrand, K. Vixie, B. Wohlberg, and E. Bollt, “A gradient descent solu- abs/1509.03668 tion to the Monge-Kantorovich problem,” Appl. Math. Sci., vol. 3, no. 22, pp. 1071– [38] F. Otto, “The geometry of dissipative evolution equations: the porous medium 1080, 2009. equation,” Comm. Partial Differential Equations, vol. 26, no. 1–2, pp. 101–174, [12] N. Courty, R. Flamary, and D. Tuia, “Domain adaptation with regularized 2001. optimal transport,” in Machine Learning and Knowledge Discovery in Databases. [39] J. A. Ozolek, A. B. Tosun, W. Wang, C. Chen, S. Kolouri, S. Basu, H. Huang, New York: Springer Berlin Heidelberg, 2014, pp. 274–289. and G. K. Rohde, “Accurate diagnosis of thyroid follicular lesions from nuclear [13] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. (2016). Optimal morphology using supervised learning,” Med. Image Anal., vol. 18, no. 5, pp. 772– transport for domain adaptation. arXiv. [Online]. Available: https://arxiv.org/ 780, 2014. abs/1507.00504 [40] S. R. Park, S. Kolouri, S. Kundu, and G. K. Rohde, “The cumulative distribu- [14] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal trans- tion transform and linear pattern classification,” Appl. Comput. Harmonic Anal., port,” in Advances in Neural Information Processing Systems, C. J. C. Burges, 2017 in press. L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Red Hook, [41] J. Rabin, S. Ferradans, and N. Papadakis. “Adaptive color transfer with relaxed NY: Curran Associates, 2013, pp. 2292–2300. optimal transport,” in Proc. 2014 IEEE Int. Conf. Image Processing (ICIP), 2014, [15] E. del Barrio, J. A. Cuesta-Albertos, C. Matrán, J. M. Rodríguez-Rodríguez, pp. 4852–4856. “Tests of goodness of fit based on the L2-Wasserstein distance,” Ann. Stat., vol. 27, [42] A. Ramdas, N. Garcia, and M. Cuturi. (2015). On Wasserstein two sample test- no. 4, pp. 1230–1239, 1999. ing and related families of nonparametric tests. arXiv. [Online]. Available: https:// [16] J. Delon, “Midway image equalization,” J. Math. Imaging Vision, vol. 21, no. arxiv.org/abs/1509.02237 2, pp. 119–134, 2004. [43] G. K. Rohde, et al. Transport and other Lagrangian transforms for signal analy- [17] T. A. El Moselhy and Y. M. Marzouk, “Bayesian inference with optimal sis and discrimination. [Online]. Available: http://faculty.virginia.edu/rohde/ maps,” J. Comput. Phys., vol. 231, no. 23, pp. 7815–7850, 2012. transport [18] S. Ferradans, G.-S. Xia, G. Peyré, and J.-F. Aujol, “Static and dynamic texture [44] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a met- mixing using optimal transport,” in Proc. Scale Space and Variational Methods in ric for image retrieval,” Int. J. Comput. Vision, vol. 40, no. 2, pp. 99–121, 2000. Computer Vision: 4th Int. Conf. (SSVM 2013), Leibnitz, Austria, June 2–6, 2013, [45] F. Santambrogio, Optimal Transport for Applied Mathematicians. New York: pp. 137–148. doi: 10.1007/978-3-642-38267-3_12. Birkhäuser, 2015. [19] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with [46] B. Schmitzer, “A sparse multiscale algorithm for dense optimal transport,” J. a Wasserstein loss,” in Advances in Neural Information Processing Systems, Math. Imaging Vision, vol. 56, no. 2, pp. 238–259. doi: 10.1007/s10851-016-0653-9. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Red [47] J. Solomon, F. de Goes, P. A. Studios, G. Peyré, M. Cuturi, A. Butscher, A. Hook, NY: Curran Associates, 2015, pp. 2044–2052. Nguyen, T. Du, and L. Guibas. “Convolutional Wasserstein distances: Efficient opti- [20] W. Gangbo and R. J. McCann, “The geometry of optimal transportation,” Acta mal transportation on geometric domains,” presented at the Special Interest Group Math., vol. 177, no. 2, pp. 113–161, 1996. on Computer Graphics and Interactive Techniques Conf., 2015. [21] E. Haber, T. Rehman, and A. Tannenbaum, “An efficient numerical method for [48] J. Solomon, R. Rustamov, L. Guibas, and A. Butscher. “Wasserstein propaga- the solution of the l2 optimal mass transfer problem,” SIAM J. Sci. Comput., vol. 32, tion for semi-supervised learning,” in Proc. 31st Int. Conf. Machine Learning, 2014, no. 1, pp. 197–211, 2010. pp. 306–314. [22] S. Haker, L. Zhu, A. Tannenbaum, and S. Angenent, “Optimal mass transport for [49] C. Villani, Optimal Transport: Old and New, vol. 338. Berlin: Springer- registration and warping,” Int. J. Comput. Vision, vol. 60, no. 3, pp. 225–240, 2004. Verlag, 2008. [23] L. V. Kantorovich, “On translation of mass” (in Russian), Dokl. AN SSSR, [50] W. Wang, D. Slepcev, S. Basu, J. A. Ozolek, and G. K. Rohde, “A linear opti- 37:199–201, 1942. mal transportation framework for quantifying and visualizing variations in sets of [24] S. Kolouri, S. Park, M. Thorpe, D. Slepcˇev, and G. K. Rohde. (2016). images,” Int. J. Comput. Vision, vol. 101, no. 2, pp. 254–269, 2013. Transport-based analysis, modeling, and learning from signal and data distributions. arXiv. [Online]. Available: https://arxiv.org/abs/1609.04767 SP

IEEE Signal Processing Magazine | July 2017 | 59