arXiv:2002.01615v3 [stat.ML] 10 Feb 2021 yt University Kyoto eli aiu xeietlstig ta at settings experimental various in well itself. problem the of description the w.r.t. nhrEeg A)adAco Wasser- Anchor and (AE) Energy Anchor different across trained embeddings word Wapoiain.Cd saalbeat available is Code https://github.com/joisino/anchor-energy popular approximations. of GW cost computational the perform of AW fraction and dis- AE uses that original that show the We AW than tances. and rather of statistics AE proposal rank of the variants is robust contribution quasi-linear second is Our This cubic. be implemen- would naive tation a where time, log-quadratic ancnrbto st rps sweep AE a compute propose to Our to algorithm is line in- representations. contribution such main Wasserstein on and stantiated energy respectively are the which distances, (AW) the stein paper this all in to introduce distances we its points, of distribution other distribution each 1D in the proposed point as who each expensive. represent (2011), that to remain Mémoli algorithms on it Building approximate most to and computation claim exact intractable, its one to However, is mapped whether other. isomorphically quantifies the be it GW can is as measure standard. intuitive, gold the is as (GW) presented the often Wasserstein settings, of such families Gromov For of at set corpora/languages. looking own when its or with described features, each biological of an cells, populations two compar- instance is when for arise ing problems machine spaces Such in learning. problem heterogeneous important sup- increasingly on measures probability ported two Comparing ym aoMroCtr aooYmd iah Kashima Hisashi Yamada Makoto Cuturi Marco Sato Ryoma IE AIP RIKEN atadRbs oprsno rbblt esrsin Measures Probability of Comparison Robust and Fast Abstract CREST-ENSAE ogeBrain Google eeoeeu Spaces Heterogeneous exactly in . Mml,20) rcia plctosrl napprox- on rely applications practical 2007), (Mémoli, Mml,20;Slmne l,21) ahn trans- machine 2016), al., et Solomon 2007; (Mémoli, Sheigre l,21;Yn ta. 00,adwhen and 2020), al., et Yang 2019; al., et (Schiebinger asrti itne aepoe sflt compare to useful proved have distances Wasserstein ihaqartcfnto ftetasotto plan transportation the of function quadratic a with hntetodsrbtoslv ntwo in live distributions two the when 09,adgahmthn X ta. 09,) The 2019b,a). al., et (Xu matching graph and 2019), dis- (GW) Gromov-Wasserstein of framework volved 02 cmt ta. 08 n aua agaepro- language natural and 2018) al., et Schmitz 2012; mto cee oapoc W fll ta.(2015) al. et Aflalo GW. approach to schemes solv- imation (QAP) requires problem computation assignment quadratic computa- exact NP-hard tractable an its ing not since is geometry tionally: GW the theory, approximations. GW w rbblt itiuin hnte r both are they the when in distributions supported probability two Introduction 1 pcsta r noe ihamti,o oegen- a more This called or is pair structure. , cost a a with endowed erally, are that on supported spaces distributions two compares distance GW al., 2012) et Grave (Sturm, 2018; Jaakkola, and matching theory (Alvarez-Melis shape lation in but to applied problem, grounded successfully this well been to has also answer and GW elegant is an another. it only to when space not distortion one is metric from of points form transporting some quantifies (OT) that transport optimal in appearing objective linear in- GW. more of the Generality to resort 2011). (Mémoli, to tances must one dis- and Wasserstein fail simple tances spaces, unrelated seemingly eue,i iceestigt rbblt etrof vector probability a to setting size discrete a in reduces, aiase l,21;Gnvye l,21) Yet, 2018). 2017; al., et Genevay al., et 2018; (Arjovsky al., models et biology Salimans generative single-cell neu- training 2020), 2016), al., al., et et Rolet (Janati 2015; al., roimaging al., et et Goes (Kusner De 2011; cessing im- al., et in Rabin notably 2009; al., applications, et (Ni several age its by emplified yt University Kyoto n IE AIP RIKEN ardwt an with paired esrdmti spaces metric measured same n h Wdsac elcsthe replaces distance GW The × ercsae hsi ex- is This space. metric n lhuhwl ruddin grounded well Although otmatrix. cost (itiuin metric)” “(distribution, yt University Kyoto IE AIP RIKEN MS,which (MMS), different and Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

relaxed the problem into a convex quadratic program, We also consider the entropic regularized Wasserstein while Kezurer et al. (2015) relaxed it to a semidefinite distance of the distributions of anchor 1D distributions, program. Although tractable for small n, these relax- an call this variant the Anchor Wasserstein (AW) dis- ations have scalability issues that prevent their use be- tance. Our second contribution is to introduce a rank- yond a few hundred points. Solomon et al. (2016) and based approach in which raw distance values in a MMS Peyré et al. (2016) proposed to modify the QAP using are replaced by their ranks, relative to all pairwise entropic regularization. When comparing two MMSs distances described in the MMS. We show that the supported on n points, this results in an algorithm AE and AW distances improve on entropic GW in 3D that alternates between a local linearization of the GW shape comparison and node assignment tasks, while be- objective (requiring two matrix multiplications at a ing order of magnitudes faster and applicable to larger 2n3 cost) followed by Sinkhorn iterations (for a O(n2) scales. cost). This combination of outer/inner loops can prove too costly for large n, and can suffer from numerical 2 Optimal Transport and Anchors instability due to the fact that the scale of the cost matrix can change at each linearization, making the Sinkhorn kernel application numerically unstable. To We call any measure µ on a measurable space X R bring down computational costs further, Vayer et al. endowed with a cost function X×X → a mea- (2019) proposed to restrict the GW problem to mea- sured (MMS). Equivalently, a discrete sures supported on Euclidean spaces, and to general- measured metric set (MMSet) of size n is a pair Rn×n ize the slicing approach (Rabin et al., 2011) to obtain S = (a, C) ∈ Σn × , where, for a given n ≥ 1, a Rn a a cheaper O(n log n) computational price, pending ad- Σn = { ∈ + | i i = 1} is the n-probability sim- plex. Intuitively an MMSet is a probability vector a, ditional efforts to rotate/register point clouds. This P lower complexity if obtained by placing restrictions on putting mass on points seen exclusively from the lens the data, since it cannot handle non-Euclidean data or of an n × n pairwise cost matrix C. Such pairs of costs (graphs, geodesic distances of shapes, and more weights/cost matrices typically arise when describing general arbitrary kernels and costs). a weighted point cloud, along with the shortest path distance matrix induced from a graph on that cloud, or From Points to Distributions of Costs. An ef- more generally any other cost function (Solomon et al., fective way to compare two points living in different 2016; Peyré et al., 2016). We assume no knowledge on MMSs could be achieved by giving them a common the points that constitute an MMSet, and only work feature representation. Such a representation can be from C (the SGW framework assumes that C is a obtained by mapping each point from an MMS onto squared-Euclidean distance matrix). Before describ- the 1D distribution of all its distances/costs to all ing the GW distance between two MMSets, we review other points in its MMS (hence the characterization the standard Wasserstein distance and quadratic as- of such points as “anchors”). This approach was in- signments. troduced in computer graphics and vision to compare 3D shapes (Osada et al., 2002; Hamza and Krim, 2003; Wasserstein Distance. Given two measurable R Gelfand et al., 2005; Manay et al., 2006) and linked spaces X and Y, a cost function c: X×Y→ , and with the GW problem by Mémoli (2011). As a re- two measures µ ∈ P(X ) and ν ∈ P(Y), the optimal sult, two points from homogeneous sources can be transport problem between µ and ν can be written as compared through their respective cost distributions E R OT(µ,ν) = min (X,Y )∼γ[c(X, Y )], in P( ). Consequently, MMSs can be represented as γ∈Π(µ,ν) elements of P(P(R)) and compared with a suitable metric on that space. where Π(µ,ν) is the set of joint couplings on X × Y Contributions. We focus first on the energy distance that have marginals µ,ν. When X = Y and the cost c (Székely and Rizzo, 2013; Sejdinovic et al., 2013) com- is a metric on X raised to the power p (with p ≥ 1), the 1/p puted between measures in P(P(R)), using the 1- optimum OT(µ,ν) is called the p-Wasserstein met- Wasserstein distance as a negative definite kernel to ric between µ and ν. We consider next two subcases compare atoms in P(R) (Wasserstein distances are that are relevant for the remainder of this paper. R 1 2 indeed n.d. in P( ), as remarked by Kolouri et al. When a , a ∈ Σn (i.e., discrete), a cost matrix C ∈ (2016)). We call this approach the Anchor Energy Rn×n suffices to instantiate the OT problem: (AE) distance, and our first contribution is to propose an efficient approach to compute it in O(n2 log n) time, 1 2 OT(a , a ) = min Pij Cij , where P∈U(a1,a2) resulting in quasi-linear complexity w.r.t. the input ij 2 X

size (an MMS is described with an n cost matrix). 1 2 n×m 1 ⊤ 2

½ R ½ U(a , a )= {P ∈ + | P m = a , P n = a }, Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima

is the transportation polytope and ½m is the m-vector distances as SGW, but without relying on a projection of ones. When µ,ν ∈ P(R) and c(x, y) = |x − y|p, step. p ≥ 1, one has that (Santambrogio, 2015, §2) Points as Anchors. To compare two MMSets, we 1 map each point from each MMSet to the distribution −1 −1 p OTp(µ,ν)= |Hµ (u) − Hν (u)| du, of its (weighted) ground cost or distance to all other Z0 points within that MMSet. This idea was success- fully used in computer graphics (Gelfand et al., 2005; where Hµ is the empirical or cumulative density func- −1 Mémoli, 2011) under the name of local distribution of tion (CDF) of measure µ, and therefore Hµ is its distances, spanning several successful approaches for quantile function. We will write in that case Wp = 1/p 3D shape comparison. Specifically, the cost distribu- (OTp) for the p-Wasserstein distance. Note further that for p =1, integrating differences in CDF is equal tion from the anchor indexed by i within S = (a, C) to integrating differences in quantile functions: is simply the 1D empirical distribution of the cost of line i in C, weighted by a: ∞ n OT1(µ,ν)= |Hµ(u) − Hν (u)|du. (1) ρ(i, a, C) := aj δC ∈P(R), Z−∞ ij j=1 X Gromov Wasserstein Distance. The Gromov where for x ∈ R, δx is the Dirac mass at x. Notice Wasserstein (GW) problem between two MMSs that this can be interpreted, for continuous measures, (µ,c1), (ν,c2) is defined as follows: as the push-forward (c(x, ·))♯µ for an anchor x ∈ X . We represent an MMSet as a distribution of local dis- E ′ ′ 2 GW(µ,ν)= min (X,Y )∼γ (c1(X, Y ) − c2(X , Y )) . tributions of distances γ∈Π(µ,ν) ′ ′ (X ,Y )∼γ n   A(a, C) := a δ ∈P(P(R)), 1 a1 C1 i ρ(i,a,C) When instantiated on two MMSets, S = ( , ) and i=1 S2 = (a2, C2), this problem reduces to X which we call the anchor feature representation of an 1 2 1 2 2 MMSet. Although the anchor feature map is not in- GW(S ,S ) = min PikPjl(Cij − Ckl) . P∈U(a1,a2) jective (Mémoli, 2011) (i.e., there exists two MMSets ijkl X S1 6= S2 such that A(S1) = A(S2)), many powerful (2) features for 3D shape comparison, such as the global As recalled in §1, the GW distance has found success in distribution of distances (Osada et al., 2002) and several applications, such as shape matching, machine eccentricity (Hamza and Krim, 2003; Manay et al., translation, and graph matching. However, computing 2006), can be computed from the anchor feature of it is NP-hard (Mémoli, 2007), and even simply evalu- a 3D shape. See also (Boutin and Kemper, 2004) for ating it knowing the optimal γ beforehand would have conditions under which a point cloud can be recon- an O(n3) computational pricetag (Vayer et al., 2019). structed from its anchor features.

Slicing Approach. As mentioned above, the Wasser- 3 Anchor Distances and Plans stein distance between two univariate distributions (namely when C = |x − x | for values x ∈ R) ij i j i We build on the several ideas introduced in the previ- can be solved by computing CDFs, and therefore only ous section to present several tools to handle MMSets requires sorting with an O(n log n) time complexity. in practice, when regarded as measures in P(P(R)). The sliced Wasserstein distance (Rabin et al., 2011) We first start with a simple entropic regularization of leverages this feature by projecting measures onto ran- a previous proposal by (Mémoli, 2011), follow with dom 1-dimensional lines, to sum up these Wasserstein instantiating an energy distance (which we show can distances. Recently, Vayer et al. (2019) proposed the be computed in quadratic time in §4) and introduce a sliced Gromov Wasserstein (SGW) distance, which ex- way to recover a plan from it. We conclude this section ploits this idea when comparing two distributions lying with a simple proposal to robustify our quantities that in two Euclidean spaces. To be efficient, this method has a quantifiable experimental impact. requires an additional “realignment” step which com- putes a linear transform of one of the input measures. Anchor Wasserstein (AW). The first anchor-based The SGW approach assumes that both MMSets are distances that we propose is the entropic-regularized embedded in Rd, and can only consider Euclidean ge- Wasserstein distance on P(P(R)) using the Wasser- ometry. That method cannot, therefore, be general- stein distance on P(R) as the ground metric. No- ized to more generic families of costs. In this paper, tice that this “hierarchy” of Wasserstein distance de- we do exploit the fast computation of 1D Wasserstein fined using Wasserstein distances has proved useful Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

in related tasks (Yurochkin et al., 2019; Dukler et al., when the anchor feature is the distribution of node fea- 2019). This yields in our case, ture distributions. Thus our proposed algorithm can compare graph generative models efficiently as well as 1 2 1 2 AWp(S ,S ) := OTε(A(S ), A(S )) measures in heterogeneous spaces. We confirm this p 1 2 by statistical tests of graph generative models in the = min Pij OTp(ρi ,ρj ) − εH(P) P∈U(a1,a2) ij experiments. X 1 2 p Each term of Eq. 3 can be explicitly expressed as = min Pij min Qkl |Cik − Cjl| − εH(P), P∈U Q∈U ij follows: X Xkl E p 1 2 p where U = U(a1, a2) is the set of transportation plans, h∼A(S1)[OTp(h,g)] = ai aj OTp(h,g) k k k g∼A(S2) ij ρi = ρ(i, a , C ) is the local distribution of distances X 1 2 1 2 p of the i-th node in either the first or second MMSet, = ai aj min Pkl|Cik − Cjl| . (4) P∈U(a1a2) H(P) = − P (log P − 1) is the entropy, and ij ij ij ij X Xkl ε> 0 is the regularization coefficient. This formula is similar to thatP of the GW distance (Eq. 2), but AW op- The last representation looks similar to the GW dis- timizes global and local assignments separately. In the tance (Eq. 2), but the AE distance calculates an aver- case of ε =0, this is equal to the lower bound (TLB) of age of the local OT of all pairs of points, while the GW the GW distance introduced in (Mémoli, 2011). How- distance computes a global OT. The AE distance can ever, ε =0 requires more computational cost for solv- be seen as the simplified version of the AW distance ing a linear programming (Mémoli, 2011). Thanks (thus of the TLB (Mémoli, 2011)) that fixes the local to the entropic regularization, the Sinkhorn algorithm global assignment to the marginalized transportation. (Cuturi, 2013) can speed up the computation. Namely, The AE distance requires cubic time when computed p 1 2 naively because computing an OT of sorted 1D distri- the cost matrix Cij = OTp (ρi ,ρj ) can be computed in O(nm(n + m)) time, and the OT matrix P∗ can butions takes linear time. We show below that the AE distance can be, in fact, computed in quasi-quadratic be computed in quadratic time by the Sinkhorn algo- 2 2 rithm. The total time complexity is cubic, but unlike O((n + m ) log(nm)) time in Section 4. We highlight GW the cost matrix is computed only once and the en- two other contributions in this section: the AE dis- ergy is convex. This cubic computational complexity tance can also be used to devise an efficient transporta- might be, however, prohibitive in some applications. tion plan, and anchor distances can be “robustified” by C We switch to a simpler tool to reach a quasi-quadratic replacing costs in by their ranks. complexity. Anchor Energy Plan (AEP). A strong appeal of Anchor Energy (AE). The second variant OT and GW-based distances is that they also provide of the anchor-based distances is energy dis- assignments. We propose the AE plan (AEP) to fill in tance (Székely and Rizzo, 2013) using the 1D that gap, Wasserstein kernel. Specifically, 1 2 1 2 1 2 p AEP(S ,S )= ai aj argmin Pkl|Cik − Cjl| . P∈U(a1a2) 1 2 E p ij kl AEp(S ,S ) :=2 h∼A(S1)[OTp(h,g)] X X g∼A(S2) This assignment takes an average of the local OTs p E 1 of all pairs of points as the AE distance, while the − h1∼A(S )[OTp(h1,h2)] (3) 1 h2∼A(S ) assignment by the GW distance computes a global p E 2 OT. Therefore, AEP cannot distinguish detailed struc- − g1∼A(S )[OTp(g1,g2)]. 2 g2∼A(S ) ture or take interactions between points into consider- ation, but AEP can provide robust assignment that 2 OT1 and OT2 are conditionally negative definite be- preserves the role of points. For example, suppose cause turning a measure into a quantile function there exists an outlier point in an MMSet. The GW is a feature map onto a Hilbert space of functions assignment is influenced by this outlier point. In (Kolouri et al., 2016). Therefore, AEp is a valid energy contrast, a small fraction of local OTs are affected distance for p =1, 2 (Sejdinovic et al., 2013) and pro- by this outlier, and affected transportation plans do vides a metric structure to the anchor feature space. In not take too large values because all transportation particular, this is a pseudometric in the original mea- plans must be doubly-stochastic. Therefore, trans- sure space. Note that the energy distance can be inter- portation plans of inlier points are dominant terms preted as maximum mean discrepancy (MMD) via the in AEP. We empirically confirm that AEP can pro- distance-induced kernel (Sejdinovic et al., 2013). One vide a good assignment that preserves roles of nodes can also see that Eq. 3 corresponds to the MMD for in networks in the experiments. AEP can be com- comparing graph generative models (You et al., 2018) puted in O(nm(n + m)+ n2 log n + m2 log m) time, Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima

that AE and SL-AE are equal, and that AE refers to a naive summation approach that has cubic time, while SL-AE refers to a mover evolved approach to compute the same quantity. SL-AE computes the AE distance in O((n2+m2) log(nm)) time for any integer p ≥ 1. We only this value changes introduce the case of p =1 for simplicity, but the idea can be naturally extended to general integers p. SL- AE uses a technique inspired from the sweep line algo- Figure 1: Overview of SL-AE: The AE distance is rithm (Shamos and Hoey, 1976; Fortune, 1987), which the integration of the total variations f(l) of CDFs. SL- was originally proposed in the computational geometry AE sums up the total variations f(1),f(2),...,f(n2 + field. Figure 1 illustrates the algorithm behind SL-AE. m2 −1) sweeping a vertical line from left to right. The Summary of SL-AE: The AE distance requires at key idea is only terms that involve a single CDF change least cubic time if each OT is computed separately, in each segment, and these total variations can be man- because the AE distance involves nm OT evaluations, aged efficiently by balanced trees. and each OT evaluation takes at least linear time. SL- AE computes the summation of all OT distances at and an assignment of a single node is computed in once to speed up computations. There are n “red” O(nm log(nm)+m2 log m) time by naive computation. CDFs from the first distribution and m “blue” CDFs from the second distribution. The AE distance is the Robust Anchor Representation. All of our algo- sum of areas between all nm pairs of red and blue 1 2 rithms operate by considering input values C and C . functions because the 1-Wasserstein distance of 1D Because these values may take very different ranges, distributions is equal to the area between the CDFs the problem of rescaling in a suitable way is crucial in of the distributions (Eq. 1). SL-AE integrates these most applications. Yet, the anchor features we have areas from left to right at once rather than comput- introduced are not robust to scaling since A(a, C) will ing each area separately. This change of viewpoint likely differ from A(a, λC) when λ 6=1. This issue is a is important, yet another important insight is that common one when using GW distances. Although di- each CDF is piecewise constant, and the CDF of viding the distance by the largest value in the distance 1 1 ρ(i, a , C ) changes at only n points Ci1, Ci2,..., Cin, matrix can alleviate this problem, if anomaly points, 2 2 which makes L = n + m change points s1,...,sL in noise, or clutters are contained in the MMSets, for ex- total. We assume all change points are unique with- ample due to the scanning process of 3D shapes, it is out loss of generality. SL-AE cuts the real line at these difficult to settle this problem. To overcome this issue, change points, which makes (n2 +m2 −1) segments. In we propose a robust variant of the anchor feature. In- each segment, all CDFs are constant. Thus, the sum of stead of using matrix C directly, we pre-process them the areas in each segment l is the total variation f(l) of to output their normalized rank-based statistics: red and blue CDFs multiplied by the length (sl+1 − sl) of the segment. In addition, adjacent segments are al- 1 most the same, and only one CDF changes. Although C = reshape (σ (vec(C)) , (n,n)). n2 argsort the difference (f(l+1)−f(l)) involves O(n+m) terms, efficient data structures, such as B-tree and Fenwick e This translates into rank-based anchor feature distri- tree, offer a logarithmic time algorithm to compute the butions (Appendix A). The rank-based anchor features partial summation. In total, there are (n2 + m2 − 1) can be computed through a simple argsort of the cost segments, and it takes logarithmic time to update the matrices, in O(n2 log n) time. Our experiments show total variation in each segment; thus, the time com- that the distances using the robust anchor feature per- plexity of SL-AE is O((n2 + m2) log(nm)). form better when the scales of the entries in cost matri- ces of MMSets differ significantly. We also confirm in We can now describe formally our algorithm. Let L = 2 2 the appendix that this modification also “robustifies” n +m be the number of elements in the cost matrices GW. and (il, jl, kl) ∈ {1,...,n}×{1,...,m}×{1, 2} (l = 1, 2,...,L) be indices of the cost matrices in increasing Ck1 Ck2 order of their values. Thus i1j1 ≤ i2j2 ≤ ··· ≤ 4 Sweep Line Anchor Energy. CkL Ckl k iLjL . Let sl = iljl (l =1,...,L) and let Hi (x) be the CDF of ρ(i, ak, Ck). In this section, we propose an efficient computation algorithm for the AE distance, using the sweep line Proposition 1. algorithm. We refer to this quantity as Sweep Line E 1 2 Anchor Energy (SL-AE), but one should keep in mind h1∼A(S1),h2∼A(S2)[OT1(h ,h )] Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

 Algorithm 1: Sweep Line Anchor Energy  Initialize W , f, S1, and T1 to zero.  Cubic Sort the segments in increasing order.  for l =1, 2,...,L − 1 do 

Subtract the previous values (i.e., the first line    in Prop. 2) from f by querying S and T .  l l   Quadratic  Update Sl and Tl to obtain Sl+1 and Tl+1 (sec) consumption Time based on Eq. 5 and 6.   Add the next values (i.e., the second line in #points Prop. 2) to f by querying Sl+1 and Tl+1. Figure 2: Scalability w.r.t. number of points of various W ← W + (sl+1 − sl) · f distances for MMSets, using cat shapes from Fig. 3. end return W 2 requires O(n + m) time naively because Sk and T k involve O(n + m) terms. This summation can be sped n2+m2−1 up to O(log(nm)) time by balanced trees. The pseudo = (s − s )f(l), l+1 l code of SL-AE is shown in Algorithm 1, and more Xl=1 details are described in the Appendix. where f(l)= n m a1a2 |H1(s ) − H2(s )| is the i=1 j=1 i j i l j l Time Complexity: We analyze the complexity of total variation in x ∈ [sl,sl+1]. P P SL-AE briefly. The sorting of {si}i=1,...n2+m2 requires 2 2 2 2 The key idea of SL-AE is hat in each iteration l, only O((n +m ) log(nm)) time. The loop iterates n +m − kl k k 1 times. In each iteration, there are constant numbers Hil changes (i.e., Hi (sl+1) = Hi (sl) for i 6= il or k k of updates and queries for Ti and Si. If we use balanced k 6= kl). Let S (u, v) and T (u, v) be the partial sums of weights and weighted CDFs: trees as the data structures for Ti and Si, each query and update requires O(log(nm)) times. Therefore, the k k 2 2 Sl (u, v)= ai , (5) total time complexity is O((n + m ) log(nm)). k i : u≤HX(i,sl)

query node matched nodes

cat dog man center node

AE (n=500) A

AW peripheral node

GW GOT

+0.31 +0.25 Figure 3: 3D Shape Comparison: (Top) Examples of 3D shapes. (Bottom) Trade-off between the speed and performance of each method.

scaled + noise Figure 5: Node Assignment: (Top) Red nodes in the right indicate that the nodes are matched, and the gray nodes in the right indicate that the nodes are not matched. This shows that the center node matches original scaled query with the center nodes and the peripheral node matches in database with the peripheral nodes. (Bottom) Correlation coef- Figure 4: Robust 3D shape comparison: (Left) Ex- ficients of the node indices of the matched nodes. amples of original shapes as well as scaled and scaled + noisy queries. (Right) Accuracy of 1-NN classification a single pair of shapes for each distance when we vary using each distance. the regularization coefficient ε. Each point represents one configuration of hyperparameters. This figure also reports the accuracy of the AE distance with n = 500 gives the orders of magnitude to execute each method, points and GOT (Maretic et al., 2019) for illustrative highlighting the quasi-quadratic complexity of SL-AE. purposes. This result highlights that the AE distance performs comparably to the AW and GW distances Trade-off. We assess the trade-off between the speed even though the AE distance works fast and has no and performance of the AE and AW distances via 3D hyperparameters. shape comparisons. As the regularization coefficient ε tends to zero, the number of Sinkhorn iterations Robustness. We apply the robust versions of the AE needed for AW and GW to converge grows. Balancing and AW distances to the SHREC10 robustness bench- a good trade-off between the speed and performance mark (Bronstein et al., 2010), which contains shapes is important in practice. We use the non-rigid world in the database and transformed query shapes for each dataset (Bronstein et al., 2006), which contains 148 shape. We use scaling transformation queries and scal- 3D shapes of 12 classes, to assess the trade-off between ing + noise transformation queries in this experiment. the performance and speed. Figure 3 (Top) shows ex- Each noised query has a noise level from 1 (weak) to amples of 3D shapes. We use the uniform weights and 5 (strong) in this dataset. We use the normal and ro- geodesic distances on 3D shapes as the cost function. bust versions of the AE and AW distances to retrieve We randomly sample n = 100 vertices to construct cost a shape in the database corresponding to the query matrices so that the GW distance is feasible, although shape by the nearest neighbor retrieval. We use the the SL-AE algorithm works efficiently even for larger training data, which contain 12 shapes and 120 query shapes. 148 · 147/2 = 10878 pairs of shapes are com- shapes, for evaluation because no training phase is in- pared in total to conduct 1-NN classification. Figure 3 volved in our classifier. Figure 4 (Right) reports the (Bottom) shows the Pareto optimal points of the 1-NN accuracy for each noise level as well as the accuracy of classification accuracy and average speed to compare the GW distance as a reference record. This indicates Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

that the robust versions of the AE and AW distances deg. clus. can handle the difference of scaling, whereas normal 0.4 1.5 1.399* distances suffer from a drop in accuracy as the noise 0.330*

level increases. We confirm that our rank-based statis- 0.3 0.285* 0.228 tics can make GW robust as well in Appendix C. 0.227 1.0 0.913* Assignment. In this experiment, we assess perfor- 0.2 0.612 mance on the node assignment task. We use the 0.567 0.5 Barabasi-Albert (BA) model (Barabasi and Albert, 0.1 1999) in this task. The BA graphs resemble real-world Train. G-RNN ER BA Train. G-RNN ER BA networks, such as citations networks and social net- 0.0 0.0 works, thanks to its preferential attachment mecha- Figure 6: Energy distance between the test dataset nism. The latent order of each node in the BA model and each model. ∗ means the distance is significant. represents the order in which the node participates in the network. Therefore, smaller order nodes play s is the number of samples and n is the number of more central roles than larger order nodes. We com- nodes. Our proposed method enables exact compu- pute node assignments of two BA graphs using a vari- tation of the energy distance in O(sn log(sn)) time. ety of graph matching algorithms. A good matching Thus our method is efficient when the number of algorithm is expected to match each node with a node samples is large. We use the ENZYMES dataset with a similar latent order (i.e., similar role). We com- (Schomburg et al., 2004; Borgwardt et al., 2005) to pute the correlation coefficient between the orders of train a GraphRNN (You et al., 2018), Erdős-Rényi matched nodes. High correlation coefficients indicate (ER) model (Erdős and Rényi, 1959), and BA model. that the assignment preserves the role of nodes. We then compute the energy distance between the test dataset and training dataset, GraphRNN, ER, We compute the AEP, GW matching, and optimal and BA model using our proposed method and con- ∗ global assignment P of the AW distance. We use duct two-sample tests by the permutation test with an open implementation (Bunne et al., 2019) to tune α = 0.05. Figure 6 shows that the weak baselines hyperparameters for the GW matching in addition to (i.e., ER and BA) are rejected while the GraphRNN various hyperparameters. We also use GWL (Xu et al., model is not rejected. In other words, the GraphRNN 2019b) and GOT (Maretic et al., 2019), which are model resembles the real distribution in terms of the state-of-the-art OT-based methods for graph match- energy distance, while the ER and BA models are far ing, and the various centrality-based assignments, from the real distribution. This result indicates that where nodes are matched in order of centrality mea- our propose method can detect the effectiveness of sure. We use the geodesic distances as the ground GraphRNN against baselines. This result is consistent metric in this task. The examples of AEP in Figure 5 with You et al. (2018). (Top) show that the center node matches with center nodes and the peripheral node matches with peripheral nodes. Figure 5 (Bottom) reports the average correla- 6 Conclusion tion coefficients of ten pairs of graphs. This indicates that the AEP preserves the roles of nodes better than We introduced in this paper several tools to com- the other methods. The GW and AW matching fail pared two measured metric sets lying in heterogeneous this task because they keenly fit noise signal, while spaces. Our approaches build on the anchor features the AEP is robust to noise thanks to averaging trans- of both measures to represent these measures, which portation plans. are originally lying in different spaces, in the same space of probability measures on probability measures Graph Generative Model Comparison. Our pro- on the line. The AE distance is the energy distance posed method can evaluate graph generative mod- and AW distance is the entropic-regularized Wasser- els efficiently. A graph generative model receives stein distance of the anchor features. We proposed an a seed and outputs a graph. The statistics of the exact algorithm to compute the AE distance exactly generated graphs, such as the degree and clustering in O((n2 + m2) log(nm)) time, shaving a linear term coefficient distributions, of a good generative model w.r.t a naive implementation. We also proposed AEP, should resemble that of the real graphs. We com- a novel assignment method inspired by the AE dis- pare graph distributions using energy distance with tance. We experimentally showed that our sweeping the Wasserstein kernel of the degree and clustering line approach scales quasi-quadratically w.r.t. the size coefficient distributions, following (You et al., 2018). of the support, and it is faster than other OT-based However, naive computation of the energy distance is 2 metrics. We also showed that our proposed methods time consuming due to its O(s n) complexity, where perform well in shape comparison and node assignment Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima tasks, leveraging a simple robustified variant of these track: Robust shape retrieval. In Eurographics methods. While AE and AW distances cannot claim Workshop on 3D Object Retrieval, pages 71–78, 2010. to capture structural correspondences at the level that Charlotte Bunne, David Alvarez-Melis, Andreas can be reached by the GW framework, they do of- Krause, and Stefanie Jegelka. Learning generative fer stable (convex) and efficient numerical approaches models across incomparable spaces. In Proceedings to reach the same goal. We argue that for some of the 36th International Conference on Machine tasks, such shape comparison or social network anal- Learning, ICML, pages 851–861, 2019. ysis, this tradeoff is favorable to AE distance, which reaches comparable performance to that of GW with Marco Cuturi. Sinkhorn distances: Lightspeed compu- simpler computations and concepts. Code is available tation of optimal transport. In Advances in Neural at https://github.com/joisino/anchor-energy. Information Processing Systems 26: Annual Con- ference on Neural Information Processing Systems References 2013, NIPS, pages 2292–2300, 2013. Fernando De Goes, Katherine Breeden, Victor Ostro- Yonathan Aflalo, Alexander Bronstein, and Ron Kim- moukhov, and Mathieu Desbrun. Blue noise through mel. On convex relaxation of graph isomorphism. optimal transport. ACM Transactions on Graphics Proceedings of the National Academy of Sciences, (TOG), 31(6):1–11, 2012. 112(10):2942–2947, 2015. David Alvarez-Melis and Tommi S. Jaakkola. Gromov- Yonatan Dukler, Wuchen Li, Alex Tong Lin, and wasserstein alignment of word embedding spaces. In Guido Montúfar. Wasserstein of wasserstein loss for Proceedings of the 2018 Conference on Empirical learning generative models. In Proceedings of the Methods in Natural Language Processing, EMNLP, 36th International Conference on , pages 1881–1890, 2018. ICML, pages 1716–1725, 2019. Martín Arjovsky, Soumith Chintala, and Léon Bottou. Paul Erdős and Alfréd Rényi. On random graphs I. Wasserstein generative adversarial networks. In Pro- Publicationes Mathematicae, 6:290–297, 1959. ceedings of the 34th International Conference on Ma- Peter M. Fenwick. A new data structure for cumulative chine Learning, ICML, pages 214–223, 2017. frequency tables. Softw., Pract. Exper., 24(3):327– Albert-Laszlo Barabasi and Reka Albert. Emergence 336, 1994. of scaling in random networks. Science, 286(5439): Steven Fortune. A sweepline algorithm for voronoi 509–512, 1999. diagrams. Algorithmica, 2(1-4):153, 1987. Rudolf Bayer and Edward M. McCreight. Organiza- tion and maintenance of large ordered indices. Acta Natasha Gelfand, Niloy J. Mitra, Leonidas J. Guibas, Inf., 1:173–189, 1972. and Helmut Pottmann. Robust global registration. In Third Eurographics Symposium on Geometry Pro- Karsten M Borgwardt, Cheng Soon Ong, Stefan Schö- cessing, pages 197–206, 2005. nauer, SVN Vishwanathan, Alex J Smola, and Hans- Peter Kriegel. Protein function prediction via graph Aude Genevay, Gabriel Peyré, and Marco Cuturi. kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005. Learning generative models with sinkhorn diver- gences. In Proceedings of the 21st International Con- Mireille Boutin and Gregor Kemper. On reconstruct- ference on Artificial Intelligence and Statistics, AIS- ing n-point configurations from the distribution of TATS, pages 1608–1617, 2018. distances or areas. Adv. Appl. Math., 32(4):709–735, 2004. Edouard Grave, Armand Joulin, and Quentin Berthet. Sergey Brin and Lawrence Page. The anatomy of a Unsupervised alignment of embeddings with wasser- large-scale hypertextual web search engine. Com- stein procrustes. In Proceedings of the 22nd Inter- puter Networks, 30(1-7):107–117, 1998. national Conference on Artificial Intelligence and Statistics, AISTATS, pages 1880–1890, 2019. Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. Efficient computation of isometry- A. Ben Hamza and Hamid Krim. Geodesic object rep- invariant distances between surfaces. SIAM J. Sci- resentation and recognition. In Discrete Geometry entific Computing, 28(5):1812–1836, 2006. for Computer Imagery, 11th International Confer- ence, DGCI, pages 378–387, 2003. Alexander M. Bronstein, Michael M. Bronstein, Um- berto Castellani, Bianca Falcidieno, Andrea Fusiello, Hicham Janati, Thomas Bazeille, Bertrand Thirion, Afzal Godil, Leonidas J. Guibas, Iasonas Kokkinos, Marco Cuturi, and Alexandre Gramfort. Multi- Zhouhui Lian, Maks Ovsjanikov, Giuseppe Patanè, subject meg/eeg source imaging with sparse multi- Michela Spagnuolo, and Roberto Toldo. Shrec’10 task regression. NeuroImage, page 116847, 2020. Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

Itay Kezurer, Shahar Z. Kovalsky, Ronen Basri, and Yang, Zachary DeVito, Martin Raison, Alykhan Te- Yaron Lipman. Tight relaxation of quadratic match- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, ing. Comput. Graph. Forum, 34(5):115–128, 2015. Junjie Bai, and Soumith Chintala. Pytorch: An Jon M. Kleinberg. Authoritative sources in a hyper- imperative style, high-performance deep learning li- linked environment. J. ACM, 46(5):604–632, 1999. brary. In Advances in Neural Information Process- ing Systems 32: Annual Conference on Neural In- Soheil Kolouri, Yang Zou, and Gustavo K Rohde. formation Processing Systems 2019, NeurIPS, pages Sliced wasserstein kernels for probability distribu- 8024–8035, 2019. tions. In Proceedings of the IEEE Conference on Gabriel Peyré, Marco Cuturi, and Justin Solomon. Computer Vision and Pattern Recognition, pages Gromov-wasserstein averaging of kernel and dis- 5258–5267, 2016. tance matrices. In Proceedings of the 33rd Inter- Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian national Conference on Machine Learning, ICML, Weinberger. From word embeddings to document pages 2664–2672, 2016. distances. In Proceedings of the 32nd International Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Conference on Machine Learning, ICML, pages 957– Bernot. Wasserstein barycenter and its application 966, 2015. to texture mixing. In Proceedings of Scale Space Yann LeCun, Léon Bottou, Yoshua Bengio, and and Variational Methods in Computer Vision, pages Patrick Haffner. Gradient-based learning applied to 435–446, 2011. document recognition. Proceedings of the IEEE, 86 Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast (11):2278–2324, 1998. dictionary learning with a smoothed wasserstein loss. Siddharth Manay, Daniel Cremers, Byung-Woo Hong, In Artificial Intelligence and Statistics, pages 630– Anthony J. Yezzi, and Stefano Soatto. Integral in- 638, 2016. variants for shape matching. IEEE Trans. Pattern Tim Salimans, Han Zhang, Alec Radford, and Dimitris Anal. Mach. Intell., 28(10):1602–1618, 2006. Metaxas. Improving GANs using optimal transport. Hermina Petric Maretic, Mireille El Gheche, Giovanni In Proceedings of the Sixth International Conference Chierchia, and Pascal Frossard. GOT: an optimal on Learning Representations, ICLR, 2018. transport framework for graph comparison. In Ad- Filippo Santambrogio. Optimal transport for applied vances in Neural Information Processing Systems mathematicians. Birkhauser, 2015. 32: Annual Conference on Neural Information Pro- cessing Systems 2019, NeurIPS, pages 13876–13887, Geoffrey Schiebinger, Jian Shu, Marcin Tabaka, 2019. Brian Cleary, Vidya Subramanian, Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, Facundo Mémoli. On the use of gromov-hausdorff dis- et al. Optimal-transport analysis of single-cell gene tances for shape comparison. In Proceedings of the expression identifies developmental trajectories in re- Symposium on Point Based Graphics, pages 81–90, programming. Cell, 176(4):928–943, 2019. 2007. Morgan A Schmitz, Matthieu Heitz, Nicolas Bon- Facundo Mémoli. Gromov–Wasserstein distances and neel, Fred Ngole, David Coeurjolly, Marco Cuturi, the metric approach to object matching. Founda- Gabriel Peyré, and Jean-Luc Starck. Wasserstein tions of Computational Mathematics, 11(4):417–487, dictionary learning: Optimal transport-based unsu- 2011. pervised nonlinear dictionary learning. SIAM Jour- Mark Newman. Networks. Oxford university press, nal on Imaging Sciences, 11(1):643–678, 2018. 2018. Ida Schomburg, Antje Chang, Christian Ebeling, Mar- Kangyu Ni, Xavier Bresson, Tony F. Chan, and Selim ion Gremse, Christian Heldt, Gregor Huhn, and Di- Esedoglu. Local histogram based segmentation us- etmar Schomburg. Brenda, the enzyme database: ing the wasserstein distance. International Journal updates and major new developments. Nucleic acids of Computer Vision, 84(1):97–111, 2009. research, 32(suppl_1):D431–D433, 2004. Robert Osada, Thomas A. Funkhouser, Bernard Dino Sejdinovic, Bharath Sriperumbudur, Arthur Chazelle, and David P. Dobkin. Shape distributions. Gretton, Kenji Fukumizu, et al. Equivalence of ACM Trans. Graph., 21(4):807–832, 2002. distance-based and rkhs-based statistics in hypoth- Adam Paszke, Sam Gross, Francisco Massa, Adam esis testing. The Annals of Statistics, 41(5):2263– Lerer, James Bradbury, Gregory Chanan, Trevor 2291, 2013. Killeen, Zeming Lin, Natalia Gimelshein, Luca Michael Ian Shamos and Dan Hoey. Geometric inter- Antiga, Alban Desmaison, Andreas Köpf, Edward section problems. In Proceedings of the 17th An- Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima

nual Symposium on Foundations of Computer Sci- ence, FOCS, pages 208–215, 1976. Justin Solomon, Gabriel Peyré, Vladimir G. Kim, and Suvrit Sra. Entropic metric alignment for correspon- dence problems. ACM Trans. Graph., 35(4):72:1– 72:13, 2016. Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient flows on the space of metric measure spaces. arXiv preprint arXiv:1208.0434, 2012. Gábor J Székely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013. Titouan Vayer, Rémi Flamary, Romain Tavenard, Laetitia Chapel, and Nicolas Courty. Sliced gromov- wasserstein. In Advances in Neural Information Processing Systems 32: Annual Conference on Neu- ral Information Processing Systems 2019, NeurIPS, 2019. Hongteng Xu, Dixin Luo, and Lawrence Carin. Scal- able gromov-wasserstein learning for graph parti- tioning and matching. In Advances in Neural Infor- mation Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 3046–3056, 2019a. Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin. Gromov-wasserstein learning for graph matching and node embedding. In Proceed- ings of the 36th International Conference on Ma- chine Learning, ICML, pages 6932–6941, 2019b. Karren Dai Yang, Karthik Damodaran, Saradha Venkatachalapathy, Ali C Soylemezoglu, GV Shiv- ashankar, and Caroline Uhler. Predicting cell lin- eages using autoencoders and optimal transport. PLoS computational biology, 16(4):e1007828, 2020. Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamil- ton, and Jure Leskovec. Graphrnn: Generating re- alistic graphs with deep auto-regressive models. In Proceedings of the 35th International Conference on Machine Learning, ICML, pages 5694–5703, 2018. Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M. Solomon. Hier- archical optimal transport for document representa- tion. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Informa- tion Processing Systems 2019, NeurIPS, pages 1599– 1609, 2019. Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

Algorithm 2: Detailed pseudo code of SL-AE. W ← 0, f[0] ← 0 H[0][i] ← 0 (i =0,...,n) H[1][j] ← 0 (j =0,...,m) 1 t[i · n + j] ← (Cij , i, j, 1, 2) (i, j =1,...,n) 2 2 t[n + i · n + j] ← (Cij , i, j, 2, 1) (i, j =1,...,m) ′ 2 2 (sl,il, jl, kl, kl) ← sorted(t)[l] (l =1,...,n + m ) Initialize S1, S2, T 1, and T 2 to 0 for l =1,...n2 + m2 − 1 do c ← H[kl][il] f[l] ← f[l − 1] k ′ ′ a l kl kl f[l] ← f[l] − il · (S (−∞,c) · c − T (−∞,c)) k ′ ′ a l kl kl f[l] ← f[l] − il (T (c, ∞) − S (c, ∞) · c) kl akl Add(S ,H[kl][il], − il ) Add( kl akl ) T ,H[kl][il], − il H[kl][il] H[k ][i ] ← H[k ][i ]+ akl l l l l jl kl akl Add(S ,H[kl][il], il ) kl akl Add(T ,H[kl][il], il H[kl][il]) c′ ← H[k ][i ] l l ′ ′ kl ′ ′ ′ a kl kl f[l] ← f[l]+ il · (S (−∞,c ) · c − T (−∞,c )) ′ ′ kl ′ ′ ′ a kl kl f[l] ← f[l]+ il (T (c , ∞) − S (c , ∞) · c ) W ← W + (sl+1 − sl) · f[l] end return W A Robust Anchor Feature

The robust anchor feature is defined as: 1 p(i, j, C) := #{(k,l) | C < C } ∈ [0, 1], n2 kl ij n

ρR(i, a, C) := aj δp(i,j,C) ∈P([0, 1]), j=1 X n

AR(a, C) := aiδρR(i,a,C) ∈P(P([0, 1])), i=1 X where # denotes the number of elements in the set. Example 3. When the cost matrix is 0 4.5 4.1 4.3 4.5 0 0.5 0.4 C = , 4.1 0.5 0 0.5 4.3 0.4 0.5 0    then the robust anchor feature corresponds to the standard anchor feature with the cost matrix 0 15/16 11/16 13/16 15/16 0 7/16 5/16 C′ = . 11/16 7/16 0 7/16  13/16 5/16 7/16 0      B Experimental Details

We use the Fenwick trees (Fenwick, 1994) for the implementation of Sk and T k. We use the Sinkhorn algorithm with entropic regularization proposed by Solomon et al. (2016) (Algorithm 1 in (Solomon et al., 2016)) to com- pute the GW distance. Note that Solomon et al. (2016) used α for the symbol of the regularization coefficient, Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima

AE (n=500) AE (n=500) AE AE

AW AW

GW GOT GW GOT

Figure 7: Top-3 accuracy of 1-NN. Figure 8: Top-5 accuracy of 1-NN. but we use ε instead. We set the momentum term to η =1 because we found this did not affect the result much in the initial experiments. The Sinkhorn algorithm for the AW distance and the inner loop of the GW distance stops when the relative change in the transportation plan becomes less than 10−6. The outer loop of the GW distance stops when the relative change in the transportation plan or the objective value is less than 10−6. We measure the speeds of algorithms with a single core of an Intel Xeon E7-4830 CPU.

B.1 Scalability (Q1)

We compute the distances with the geodesic distance costs and uniform distribution. Specifically, we first compute the geodesic distance matrices D1 and D2 of the two cat shapes. For n = 32, 64,..., 2048, we sample n points uniformly and randomly for each shape and sample the corresponding rows and columns of the distance 1 2 Rn×n 1 2 matrix to make Cn and Cn ∈ + . We compute the AE, AW, and GW distances of Cn and Cn with the uniform distribution. We do not include the time consumption of preprocessing (i.e., the distance calculation and sampling) in the result. We set ε = 10−5 for the AW distance and ε = 10 for the GW distance. The regularization constants are different because they need to adjust to the scale of the ground costs used within AW and GW, which are themselves different (one uses W1, the other the original ground metric).

B.2 Trade-off (Q2)

In this experiment, we exclude the “shark” class from the dataset because this class contains only one shape. We first compute the geodesic distance matrix for each shape. For each shape, we then sample n = 100 points uniformly and randomly and sample the corresponding rows and columns of the distance matrix to make Rn×n C ∈ + . We compute the AE, AW, and GW distances of each pair of sampled distance matrices with uniform distribution. We compute distances using a large-scale computer cluster in parallel. The time consumption is the average time to compute the distance of a pair of shapes with a single core of Intel Xeon E7-4830 CPU for 100 random pairs. We use the official implementation of GOT (Maretic et al., 2019) https://github.com/Hermina/GOT and vary the number of iterations to assess the trade-off in this experiment. We use negative exponential transformation Wij = exp(−Cij /σ) to construct the weight matrix W from the cost matrix C. We use σ = 10 chosen from 2 2 σ = 5, 10, 20 in validation. We also used the Gaussian kernel Wij = exp(−Cij /σ ) but did not observe any improvement in early experiments. We also measure top-3 and top-5 accuracy of the 1-NN classification. Figures 7 and 8 plot the Pareto optimal points of the accuracy and speed and show similar tendencies to Figure 3.

B.3 Robustness (Q3)

We set ε = 10−6 for the AW distance, ε = 10−8 for the robust AW distance, and ε = 100 for the GW distance. The regularization constants are different because they need to adjust to the scale of the ground costs used within Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

1.0 Robust AW 0.8 Robust GW

0.6

0.4 GW Robust AE Accuracy AW

0.2 AE

0.0 1 2 3 4 5 Noise Level

Figure 9: Robust GW. AW, robust AW, and GW, which are themselves different. We found these hyperparameters did not affect the performance much in the early experiments.

B.4 Assignment (Q4)

We use the Barabasi Albert model with n = 200 nodes and m =2 attachments in this experiment. The centrality- based baseline methods in Figure 5 are as follows. Closeness: The closeness centrality (Newman, 2018, §7.1.6). Between: betweenness centrality (Newman, 2018, §7.1.7). Degree: degree centrality (Newman, 2018, §7.1.1). Eigenvector: eigenvector centrality (Newman, 2018, §7.1.2). Katz: Katz centrality (Newman, 2018, §7.1.3). PageRank: PageRank score (Brin and Page, 1998). HITS(Hub): hub score of the HITS algorithm (Kleinberg, 1999). HITS(Auth): authority score of the HITS algorithm (Kleinberg, 1999). We use the official implementa- tion of GOT (Maretic et al., 2019) https://github.com/Hermina/GOT and the official implementation of GWL (Xu et al., 2019b) https://github.com/HongtengXu/gwl. It should be noted that this task is more difficult than the task in the GWL paper (Xu et al., 2019b) because we match different graphs (generated by the same model), whereas Xu et al. (2019b) matched a graph with the same graph with noise.

B.5 Graph Model Comparison (Q5)

We use the official implementation of GraphRNN https://github.com/JiaxuanYou/graph-generation for the GraphRNN, ER model, and BA model. The parameters of the ER model and BA model are estimated by the rule-based method (i.e., the “general” option in the implementation). Note that the authors of the original GraphRNN paper (You et al., 2018) used the exponential negative Wasserstein distance as the kernel of MMD, while we use the raw Wasserstein distance as the ground distance of the energy distance. The theory of the energy distance (Sejdinovic et al., 2013) shows that our energy distance can be interpreted as MMD and used for statistical hypothesis testing because the 1-dimensional Wasserstein distance is conditionally negative. Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima

Data AE AW

Figure 10: Barycenters of 3D digits: Barycenters of 3D digits in 2-dimensional Euclidean space w.r.t. the AE and AW geometries. Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces

C Additional Experiments

Robust GW: Can we apply our robustification technique to GW? We apply our proposed robust technique (i.e., rank-based statistics) to the GW distance and conduct an experiments with the same setting as Q3. We would like to emphasize that applying the rank-based statistics to GW is also a part of our contribution. Figure 9 shows that our technique makes the GW distance robust as well. Since the GW distance is used in many existing works, our proposed method is shown to have all the more significance in this field. Barycenter: Do AE and AW barycenters provide good summaries of datasets from different domains? We compute barycenters of 3D shapes generated from the MNIST handwritten digits dataset (LeCun et al., 1998) into 2-dimensional Euclidean space with respect to the AE and AW geometries. It is difficult to summarize them because the data are lying in a different space (i.e., 3-dimensional Euclidean space) from the barycenter (2-dimensional Euclidean space). The AE and AW distances can be applied to these settings since they can compare measures lying in different spaces. The shapes are rotated in 3-dimensional spaces. Specifically, we generate the dataset by the following process: (1) we sample 100 points {pi}i=1,...,100 in the 2-dimensional lattice {1, 2,..., 28}×{1, 2,..., 28} with a probability proportional to the brightness of a pixel, (2) we sample a random 3-dimensional rotation matrix M uniformly, (3) we rotate points {pi} by M, and then (4) we add i.i.d. Gaussian noise N (0, 0.25I) to each point. We optimize the sum of AE and AW distances from the data shapes by gradient descent. The gradients are computed by the auto-gradient package PyTorch (Paszke et al., 2019). Figure 10 shows the data and barycenters. The AE barycenters fails to provide good summaries of digits 2 and 3, whereas the AW barycenters preserve the characteristics of all digits thanks to more precise global matching.

D Proofs

k Proof of Proposition 1. Since Hi (x) is a piecewise-constant function, n m E p 1 2 1 2 p 1 1 2 2 h1∼A(S1),h2∼A(S2)[OTp(h ,h )] = ai aj OTp(ρ(i, a , C ),ρ(j, a , C )) i=1 j=1 X X n m ∞ 1 2 1 2 = ai aj |Hi (x) − Hj (x)|dx i=1 j=1 0 X X Z n m K−1 1 2 1 2 = ai aj (sl+1 − sl)|Hi (sl) − Hj (sl)| i=1 j=1 X X Xl=1 K−1 n m 1 2 1 2 = (sl+1 − sl) ai aj |Hi (sl) − Hj (sl)| i=1 j=1 Xl=1 X X K−1 = (sl+1 − sl)f(l). Xl=1

1 Proof of Proposition 2. We assume n1 = n2 = n without loss of generality by appropriately zero-padding a and a2.

n ′ ′ akl akl kl kl f(l + 1) − f(l)= − il x |Hil (sl) − Hx (sl)| x=1 ! X n ′ ′ akl akl kl kl + il x |Hil (sl+1) − Hx (sl+1)| x=1 ! X n ′ ′ akl akl kl = − il x |c − Hx (sl)| x=1 ! X n ′ ′ akl akl ′ kl + il x |c − Hx (sl+1)| x=1 ! X Ryoma Sato, Marco Cuturi, Makoto Yamada, Hisashi Kashima

′ kl kl kl kl = − a a c − a Hx (s )  il x x l  ′ k l  x : HXx (sl)