Arxiv:2002.01615V3 [Stat.ML] 10 Feb 2021 Yt University Kyoto Eli Aiu Xeietlstig Ta at Settings Experimental Various in Well Itself
Total Page:16
File Type:pdf, Size:1020Kb
Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces Ryoma Sato Marco Cuturi Makoto Yamada Hisashi Kashima Kyoto University CREST-ENSAE Kyoto University Kyoto University RIKEN AIP Google Brain RIKEN AIP RIKEN AIP Abstract 1 Introduction Comparing two probability measures sup- Wasserstein distances have proved useful to compare ported on heterogeneous spaces is an two probability distributions when they are both increasingly important problem in machine supported in the same metric space. This is ex- learning. Such problems arise when compar- emplified by its several applications, notably in im- ing for instance two populations of biological age (Ni et al., 2009; Rabin et al., 2011; De Goes et al., cells, each described with its own set of 2012; Schmitz et al., 2018) and natural language pro- features, or when looking at families of cessing (Kusner et al., 2015; Rolet et al., 2016), neu- word embeddings trained across different roimaging (Janati et al., 2020), single-cell biology corpora/languages. For such settings, the (Schiebinger et al., 2019; Yang et al., 2020), and when Gromov Wasserstein (GW) distance is training generative models (Arjovsky et al., 2017; often presented as the gold standard. GW Salimans et al., 2018; Genevay et al., 2018). Yet, is intuitive, as it quantifies whether one when the two distributions live in two different and measure can be isomorphically mapped to seemingly unrelated spaces, simple Wasserstein dis- the other. However, its exact computation tances fail and one must to resort to the more in- is intractable, and most algorithms that volved framework of Gromov-Wasserstein (GW) dis- claim to approximate it remain expensive. tances (Mémoli, 2011). Building on Mémoli (2011), who proposed Generality of GW. The GW distance replaces the to represent each point in each distribution linear objective appearing in optimal transport (OT) as the 1D distribution of its distances to all with a quadratic function of the transportation plan other points, we introduce in this paper the that quantifies some form of metric distortion when Anchor Energy (AE) and Anchor Wasser- transporting points from one space to another. GW stein (AW) distances, which are respectively is not only an elegant answer to this problem, but the energy and Wasserstein distances in- it is also well grounded in theory (Sturm, 2012) arXiv:2002.01615v3 [stat.ML] 10 Feb 2021 stantiated on such representations. Our and has been successfully applied to shape matching main contribution is to propose a sweep (Mémoli, 2007; Solomon et al., 2016), machine trans- line algorithm to compute AE exactly in lation (Alvarez-Melis and Jaakkola, 2018; Grave et al., log-quadratic time, where a naive implemen- 2019), and graph matching (Xu et al., 2019b,a). The tation would be cubic. This is quasi-linear GW distance compares two distributions supported on w.r.t. the description of the problem itself. spaces that are endowed with a metric, or more gen- Our second contribution is the proposal of erally, a cost structure. This “(distribution, metric)” robust variants of AE and AW that uses pair is called a measured metric spaces (MMS), which rank statistics rather than the original dis- reduces, in a discrete setting to a probability vector of tances. We show that AE and AW perform size n paired with an n × n cost matrix. well in various experimental settings at a fraction of the computational cost of popular GW approximations. Although well grounded in GW approximations. Code is available at theory, the GW geometry is not tractable computa- https://github.com/joisino/anchor-energy. tionally: since its exact computation requires solv- ing an NP-hard quadratic assignment problem (QAP) (Mémoli, 2007), practical applications rely on approx- imation schemes to approach GW. Aflalo et al. (2015) Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces relaxed the problem into a convex quadratic program, We also consider the entropic regularized Wasserstein while Kezurer et al. (2015) relaxed it to a semidefinite distance of the distributions of anchor 1D distributions, program. Although tractable for small n, these relax- an call this variant the Anchor Wasserstein (AW) dis- ations have scalability issues that prevent their use be- tance. Our second contribution is to introduce a rank- yond a few hundred points. Solomon et al. (2016) and based approach in which raw distance values in a MMS Peyré et al. (2016) proposed to modify the QAP using are replaced by their ranks, relative to all pairwise entropic regularization. When comparing two MMSs distances described in the MMS. We show that the supported on n points, this results in an algorithm AE and AW distances improve on entropic GW in 3D that alternates between a local linearization of the GW shape comparison and node assignment tasks, while be- objective (requiring two matrix multiplications at a ing order of magnitudes faster and applicable to larger 2n3 cost) followed by Sinkhorn iterations (for a O(n2) scales. cost). This combination of outer/inner loops can prove too costly for large n, and can suffer from numerical 2 Optimal Transport and Anchors instability due to the fact that the scale of the cost matrix can change at each linearization, making the Sinkhorn kernel application numerically unstable. To We call any measure µ on a measurable space X R bring down computational costs further, Vayer et al. endowed with a cost function X×X → a mea- (2019) proposed to restrict the GW problem to mea- sured metric space (MMS). Equivalently, a discrete sures supported on Euclidean spaces, and to general- measured metric set (MMSet) of size n is a pair Rn×n ize the slicing approach (Rabin et al., 2011) to obtain S = (a, C) ∈ Σn × , where, for a given n ≥ 1, a Rn a a cheaper O(n log n) computational price, pending ad- Σn = { ∈ + | i i = 1} is the n-probability sim- plex. Intuitively an MMSet is a probability vector a, ditional efforts to rotate/register point clouds. This P lower complexity if obtained by placing restrictions on putting mass on points seen exclusively from the lens the data, since it cannot handle non-Euclidean data or of an n × n pairwise cost matrix C. Such pairs of costs (graphs, geodesic distances of shapes, and more weights/cost matrices typically arise when describing general arbitrary kernels and costs). a weighted point cloud, along with the shortest path distance matrix induced from a graph on that cloud, or From Points to Distributions of Costs. An ef- more generally any other cost function (Solomon et al., fective way to compare two points living in different 2016; Peyré et al., 2016). We assume no knowledge on MMSs could be achieved by giving them a common the points that constitute an MMSet, and only work feature representation. Such a representation can be from C (the SGW framework assumes that C is a obtained by mapping each point from an MMS onto squared-Euclidean distance matrix). Before describ- the 1D distribution of all its distances/costs to all ing the GW distance between two MMSets, we review other points in its MMS (hence the characterization the standard Wasserstein distance and quadratic as- of such points as “anchors”). This approach was in- signments. troduced in computer graphics and vision to compare 3D shapes (Osada et al., 2002; Hamza and Krim, 2003; Wasserstein Distance. Given two measurable R Gelfand et al., 2005; Manay et al., 2006) and linked spaces X and Y, a cost function c: X×Y→ , and with the GW problem by Mémoli (2011). As a re- two measures µ ∈ P(X ) and ν ∈ P(Y), the optimal sult, two points from homogeneous sources can be transport problem between µ and ν can be written as compared through their respective cost distributions E R OT(µ,ν) = min (X,Y )∼γ[c(X, Y )], in P( ). Consequently, MMSs can be represented as γ∈Π(µ,ν) elements of P(P(R)) and compared with a suitable metric on that space. where Π(µ,ν) is the set of joint couplings on X × Y Contributions. We focus first on the energy distance that have marginals µ,ν. When X = Y and the cost c (Székely and Rizzo, 2013; Sejdinovic et al., 2013) com- is a metric on X raised to the power p (with p ≥ 1), the 1/p puted between measures in P(P(R)), using the 1- optimum OT(µ,ν) is called the p-Wasserstein met- Wasserstein distance as a negative definite kernel to ric between µ and ν. We consider next two subcases compare atoms in P(R) (Wasserstein distances are that are relevant for the remainder of this paper. R 1 2 indeed n.d. in P( ), as remarked by Kolouri et al. When a , a ∈ Σn (i.e., discrete), a cost matrix C ∈ (2016)). We call this approach the Anchor Energy Rn×n suffices to instantiate the OT problem: (AE) distance, and our first contribution is to propose an efficient approach to compute it in O(n2 log n) time, 1 2 OT(a , a ) = min Pij Cij , where P∈U(a1,a2) resulting in quasi-linear complexity w.r.t. the input ij 2 X size (an MMS is described with an n cost matrix). 1 2 n×m 1 ⊤ 2 ½ R ½ U(a , a )= {P ∈ + | P m = a , P n = a }, Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima is the transportation polytope and ½m is the m-vector distances as SGW, but without relying on a projection of ones. When µ,ν ∈ P(R) and c(x, y) = |x − y|p, step. p ≥ 1, one has that (Santambrogio, 2015, §2) Points as Anchors. To compare two MMSets, we 1 map each point from each MMSet to the distribution −1 −1 p OTp(µ,ν)= |Hµ (u) − Hν (u)| du, of its (weighted) ground cost or distance to all other Z0 points within that MMSet.