SEQUENCES OF WELL-DISTRIBUTED VERTICES ON GRAPHS AND SPECTRAL BOUNDS ON OPTIMAL TRANSPORT

LOUIS BROWN

Abstract. Given a graph G = (V,E), suppose we are interested in select- n ing a sequence of vertices (xj )j=1 such that {x1, . . . , xk} is ‘well-distributed’ uniformly in k. We describe a greedy algorithm motivated by potential the- ory and corresponding developments in the continuous setting. The algorithm performs nicely on graphs and may be of use for sampling problems. We can interpret the algorithm as trying to greedily minimize a negative Sobolev norm; we explain why this is related to Wasserstein distance by establishing a purely spectral bound on the Wasserstein distance on graphs that mirrors R. Peyre’s estimate in the continuous setting. We illustrate this with many examples and discuss several open problems.

1. Introduction 1.1. Introduction. The purpose of this paper is to discuss a basic problem on finite graphs G = (V,E) that is already difficult on the unit interval [0, 1]. Given a metric space (say, the unit interval, a compact manifold or a finite graph), how does one construct a sequence x1, x2,... of elements in the space such that their distribution is uniformly good—by this we mean that if one takes the first k el- ements {x1, . . . , xk}, then this set is very nearly as evenly distributed on the set as any set of k elements would be. We have not clearly defined what notion of ‘well-distributed’ we mean, this will depend on the actual setting; the question is frequently interesting for several different such notions. There is not even a canon- ical answer on the unit interval [0, 1] but the van der Corput sequence 1 1 3 1 5 3 7 , , , , , , ,... 2 4 4 8 8 8 8 constructed via an inverse binary digit expansion (see [10]), is a good example.

1 1 3 1 5 3 7 arXiv:2008.11296v2 [math.CO] 4 Mar 2021 8 4 8 2 8 4 8

4 2 6 1 5 3 7

Figure 1. The first 7 elements of the van der Corput sequence.

2010 Mathematics Subject Classification. 05C35, 05C85, 05C90, 31B10, 44A99, 49Q20. Key words and phrases. Sampling, Graph, Manifold, Inverse Laplacian, Potential Theory. This paper is part of the author’s PhD thesis at Yale University and has been partially sup- ported by NSF (DMS-1763179). 1 2

We see that the first k elements of the sequence are not as uniformly distributed as k equispaced points but they are always fairly uniformly distributed independently of what k is. There are many different reasons why one could be interested in such sequences: they are natural sampling points for functions (especially for on-line selection and in cases where one does not know in advance in how many points one can sample) but there is also an obvious combinatorial question (‘How well distributed can sequences be? What is the unavoidable degree of irregularity?’).

Figure 2. The first k (here k = 2, 3, 4) elements of the sequence are nearly as evenly distributed as any set of k vertices could be.

Historically, the question has lead to remarkable connections to other fields of math- ematics such as Number Theory, Harmonic Analysis and even Probability Theory (we will discuss several of these connections in §2). We will now state one informal version of the main problem before stating a precise version further below. Main Problem (informal version). Given a finite graph G = (V,E), how would one select a sequence of vertices that are uni- formly good? In what metric would one measure the ‘goodness’ of such a sequence?

8 1

11 4

5 6 9 10 2 3

12 7

Figure 3. The Frucht Graph (see §4.3) and the enumeration ob- tained by the algorithm when starting with vertex 1. 3

Fig. 2 contains a simple such example: taking the Truncated Tetrahedral Graph (see §4.2), in which order should one select the vertices so as to obtain a sequence that is uniformly evenly distributed? Fig. 3 showcases another example, on the Frucht Graph (see §4.3). In Fig. 5, we display an example on the Nauru Graph, a 3-regular with 24 vertices and 36 edges. Even before making the notion of quality precise, we can get some intuition from these simple examples. The enumeration of the vertices in each case was generated by the algorithm discussed below, in §2.2. The organization of this article is as follows: in §1.2 we introduce the Wasserstein distance, and state Kantorovich-Rubinstein duality in the setting of finite graphs. In §2, we introduce a novel algorithm for selecting evenly distributed vertices on graphs, and in §2.3 we provide a Theorem quantitatively demonstrating the even distribution of these vertices. In §3 we explore the connection between optimal transport cost and spectral properties of the Laplacian, providing in §3.2 a Theorem bounding the former in terms of the latter on graphs. In §3.4 and 3.5 we conduct a case study of cycle graphs and grid graphs, respectively, in the context of this Theorem. In §4, we provide numerics displaying the performance of the algorithm on a variety of different graphs. In §5, we present proofs of both Theorems. In §6, we list connections to results in the continuous setting.

x2 x1

1/2

x3 x6

1/2

x4 x5

P6 Figure 4. W1((δx1 + δx2 )/2, k=1 δxk /6) = 2/3: We can trans- port 1/6 units of mass from x1 to each of x2 and x6, and similarly with x4 to x3 and x5, incurring a total cost of 4 × 1/6.

1.2. Wasserstein Distance. Wasserstein Distance is a notion of distance on prob- ability distributions introduced by Wasserstein in 1969 [38]. Its simplest instance is W1(µ, ν), also known as Earth Mover’s Distance, which is defined as the minimal amount of cost required to transport one distribution µ to match another distri- bution ν—here, cost is defined as mass × distance: transporting ε units of mass over a distance of δ has a cost of εδ. The notion of Wasserstein distance can be immediately applied to finite graphs: we will work only in the simple case of un- weighted graphs where the distance between two vertices is given as the length of the shortest path connecting them; extensions to the weighted case are conceivable. In short, transporting ε mass over a single edge has a W1 cost of ε (see Fig. 4). More formally, on an abstract metric space X equipped with a metric d, we define  Z 1/p p Wp(µ, ν) = inf d(x, y) dγ(x, y) , γ∈Γ(µ,ν) X×X 4 where Γ(µ, ν) denotes the collection of all measures on X × X with marginals µ and ν, respectively (also called the set of all couplings of µ and ν). Note that µ, ν need not be probability measures. We will, throughout this paper, work exclusively with the Earth Mover’s Distance W1 (although extensions to more general Wp are certainly conceivable). The Earth Mover’s Distance is particularly nice to work with: by Kantorovich-Rubinstein duality (see e.g., [39]) Z Z  W1(µ, ν) = sup fdµ − fdν : f is 1-Lipschitz . X X 0 Because of this, W1 is translation invariant—for positive measures µ, ν, µ , we have 0 0 W1(µ, ν) = W1(µ + µ , ν + µ ).

Thus, if we have a positive measure µ decomposed into non-positive measures µ1, µ2 as µ = µ1 + µ2, we may use this translation invariance to write + + − − W1(µ, ν) = W1(µ1 + µ2 , ν + µ1 + µ2 ), + − where µi = max{µi, 0} is the positive part of µi and µi = max{−µi, 0} is the negative part. Now that we have defined the Wasserstein distance, we may make precise our original question regarding evenly distributed vertices: Main Problem (formal version). Given a finite graph G = (V,E), how would one select a sequence of vertices such that

 k  1 X W δ , dx is small for all k, 1 k xj  j=1

where dx is normalized counting measure having weight |V |−1 on each vertex of G. How small one could expect this quantity to be will depend on the particular geometry of the graph. For simplicity of exposition, consider M = Td, the d- dimensional torus with normalized volume measure dx: it is a basic exercise to show that the Earth Mover’s Distance satisfies

 n  1 X W δ , dx ≥ c n−1/d 1 n xj  d j=1 for all sets of points {x1, . . . , xn} ⊂ M. This clearly shows that the geometry (here: the dimension d) plays a role in what we can expect. We also emphasize that it is almost surely the case that our main question (as asked in §1.1) is of interest also for many other ways of making the notion of distribution quantitative and Wasserstein distance may be one of many (though certainly a rather canonical one). We conclude our short introduction to Wasserstein distance by stating Kantorovich- Rubinstein duality, mentioned above, when X is a finite graph. Proposition (Kantorovich-Rubinstein, see e.g., [17, 27]). Let G = (V,E) be a finite, simple graph, let f : V → R and let W ⊂ V be a subset of vertices. Then ! 1 X 1 X 1 X f(x) − f(x) ≤ W1 δx, dx max |f(xi) − f(xj)|. |V | |W | |W | xi∼xj x∈V x∈W x∈W 5

This shows that our notion of uniform distribution of a subset of vertices has a natural connection to the question of sampling on graphs (i.e., reconstructing the average value of a ‘smooth’ function by sampling in a subset of the vertices). The theory of sampling on graphs is in its infancy but rapidly developing, we refer to [16, 20, 25, 26, 33].

11 20 13 5 4 14 22 7 16 3 1 18 8 24 23 9 17 2 10 21 19 12 6 15

Figure 5. The Nauru Graph on 24 vertices: algorithm starting at 1.

2. The Algorithm 2.1. Setup. We recall that for a finite graph G = (V,E), we can define the adja- cency matrix ( |V | 1 if xi ∼E xj A = (aij) where aij = i,j=1 0 otherwise as well as the degree matrix ( |V | deg(xi) if i = j D = (dij) where dij = i,j=1 0 otherwise.

With these definitions, we can define a notion of a Laplacian via L = D − A.

We denote the eigenvectors of L by φi with corresponding eigenvalues λi, i.e., Lφi = λiφi. Note that L has all its eigenvalues in [0, 2 maxv∈V deg(v)]. Since L’s columns sum to 0, we have for all measures µ that Lµ has no net mass, i.e., is orthogonal to the constant vector, or has mean 0. Since L is symmetric, it is diagonalizable and all pairs of eigenvectors with distinct eigenvalues are orthogonal. The induced ordering of eigenvectors is analogous to the continuous case: small eigenvalue means slow oscillation frequency and the oscillation increases with the eigenvalue—the larger the eigenvalue, the more oscillation there is. In particular, φ1 is constant with λ1 = 0. Since we assume G is connected, there is only one 6 instance of the trivial eigenvalue. As L is diagonalizable, we can also take arbitrary powers and define the fractional Laplacian Lα for α > 0. Setting n = |V |, n α X α L v = hv, φii λi φi. i=1 −α To define L , we need to adjust this definition slightly: since λ1 = 0, we first shift v down by its mean to avoid dividing by 0. (In other words, we simply ignore the component of v in the direction of the constant eigenvector φ1.) That is, n −α X −α L v = hv, φii λi φi. i=2 Of course, multiplying a vector by a matrix can be interpreted as applying an operator to a function, since vectors indexed by vertices are simply functions V → R. Note that AD−1 = I − LD−1 is a diffusion operator: each vertex splits its mass uniformly among its neighbors, and hence mass is preserved. This operator has eigenvalues in [−1, 1]. If we instead apply the transpose D−1A, this corresponds to each vertex taking an equal portion of each of its neighbors masses, which, in general, is not mass-preserving. If G is k-regular (each vertex has equal degree k), 1 then D = kI is scalar and these two notions coincide and equal k A. We refer to [8, 14] for a good introduction to these notions and many references. 2.2. Description of the Algorithm. We present an algorithm, parametrized by 0 < α < 1, for greedily picking well-distributed vertices xk on graphs. (If α  1 the algorithm degenerates and repeatedly selects the same small subset of distant vertices over and over.) First, x1 is chosen arbitrarily. Then, vertices are picked recursively according to the following:

 k  −α X xk+1 = arg min L δxj  (x), x∈V j=1 breaking ties arbitrarily (Figures 2, 3, 5, and 9-11 display applications of this algo- rithm to various graphs). If we write out the algorithm explicitly in terms of the spectrum of the Laplacian operator L = D − A, it becomes n k X X φi(xj) xk+1 = arg min φi(x), x∈V λα i=2 j=1 i where the φi are normalized with kφikL2 = 1, and we skip the constant eigenvector φ1 since it has eigenvalue 0 (note that this choice, while seemingly arbitrary, has no impact on the algorithm as any contribution from the constant vector can be ignored when computing arg min—the algorithm is independent of choice of right inverse of the Laplacian). That is, we add up the projections of the indicator Pk vector of the current vertex set, j=1 δxj , onto each eigenvector, scaling down by the α power of the respective eigenvalue. Consider the case of a cycle graph: if the number of vertices is sufficiently large, this is well approximated by a torus. Setting α = 1/2 and identifying the torus with [0, 2π)/ ∼:= T, we have a simple explicit formula for the inverse Laplacian of a point mass (see §3.4 for the derivation): 1 L−1/2(δ ) = − ln |2 sin((x − x )/2)|. xk π k 7

Note that 2 sin((x − xk)/2) is precisely the Euclidean distance between the points 2πix 2πix at angles xi and x on the unit circle (i.e., |e − e k |). Thus, the algorithm is simply maximizing the product of distances between points on the circle, by setting

 k  1 X xk+1 = arg min − ln |2 sin((x − xj)/2)| . x∈ π T j=1 The arising sequence appears to behave on par with provably optimally regular sequences, and Steinerberger recently proved strong results on the regularity of such a sequence in [36], using techniques which are specific to this setting and unlikely to generalize to other graphs. (We explore this example in more detail in §3.4.) Nonetheless, these remarkable results on the torus and cycle graph give us hope that the algorithm may work comparably well on graphs more generally.

2.3. A Theoretical Guarantee. We prove a theoretical guarantee, for any finite graph G, that these sequences do exhibit at least a certain degree of regularity. Theorem 1. Let G be a simple connected graph and let

k 1 X µ = δ , k k xj j=1 where vertices xj are selected as above. Then, for all 1 ≤ k ≤ n,

n 2   X |hµk, φii| −2α  2 −1 ≤ max L δxj 2 k . λ2α j≤k ` i=2 i Remark 1. Observe for the sake of comparison that n X 2 2 2 1 1 |hµ , φ i| = kµ k 2 − |hµ , φ i| = − . k i k ` k 1 k n i=2 Thus, it is natural that the bound in the Theorem should be ∼ k−1. However, 2α λi (and thus λi ) may be arbitrarily close to 0, scaling up the terms in the sum substantially. The only way to prevent this is for µk to be almost orthogonal to low-frequency eigenfunctions (which is cf. Erd˝os-Tur´an[12, 13] a natural way of defining regularity, as it means that µk is concentrated at high frequencies). Remark 2. Note that −2α  2 −2α 2 max L δxj 2 ≤ max L (δx) 2 . j≤k ` x∈V ` The term on the right side is an interesting quantity in itself, and there may be good bounds for it in terms of the geometry of the graph. As can be seen from the expansion into eigenfunctions, this quantity measures, implicitly, how much low- frequency eigenfunctions concentrate in a particular vertex. In vertex-transitive graphs like cycle graphs and torus grid graphs we see that the quantity is actually independent of the vertex x.

3. Spectral Bounds on Transport Distances 3.1. Motivation. The motivation behind the algorithm is two-fold: 8

(1) The greedy algorithm tries to minimize a Sobolev norm

k 1 X δxj . k j=1 H˙ −1 (2) A recent result of R. Peyre [28] shows that, in the continuous setting,

W2(µ, dx) . kµkH˙ −1 . The purpose of this section is to establish a connection between problems of optimal transport and spectral properties of the Laplacian. This is known to hold in the continuous case, we recall the following bound:

Theorem (Carroll, Massaneda, Ortega-Cerda [7]). Let (M, g) be a compact Rie- mannian manifold with normalized volume measure dx and ∂M = ∅. If −∆gφ = λφ on M, then, for some constant C > 0 depending only on (M, g),

+ −  C W1 φ dx, φ dx ≤ √ kφk 1 . λ L (M) This inequality is sharp. We recall the basic intuition that a Laplacian eigenfunction may, at scale λ−1/2 (the wavelength), be understood as a random wave. This suggests that one has to move mass at least a distance comparable to the wavelength and examples on the torus Td or the sphere Sd show that this is indeed the case.

3.2. Spectral Bounds on Transport. The purpose of this section is to show that a variation of this result exists on finite graphs; we will prove this for the Earth Mover’s Distance p = 1.

−1 Theorem 2. Let M = I − AD and let Mφk = λkφk. Then 0 ≤ λk ≤ 2 and

+ − 1 W1(φk , φk ) ≤ kφkk`1 . 1 − |1 − λk|

+ − + − Note that, since φ = φk − φk has mean 0, the measures φk and φk have the same mass, and thus one can be transported to the other. When we consider the + − asymptotic behavior of W1(φk , φk ) on cycle graphs of increasing size, we see that this bound is a natural analog of Peyre’s result [28] to graphs—it scales sharply ˙ −1 with respect to H (φk) (see §3.4). We observe that this bound degenerates if |λk − 1| is close to 1 and this a consequence of the proof. We also note that we always have the trivial transport inequality

+ − + diam(G) W1(φ , φ ) ≤ diam(G) φ 1 = kφkk 1 , k k k ` 2 ` and thus the bound in the Theorem is preferable to the trivial bound only when 2 |1 − λ | < 1 − . k diam(G) In many of the interesting cases for applications (graphs with good mixing proper- ties), we can expect a spectral gap that quantitatively bounds |λ2 − 1| < 1. 9

3.3. Applying the Theorem to obtain Transport Bounds. For an arbitrary distribution µ, we may use this bound to measure the Wasserstein distance to the uniform distribution on a graph with n vertices. The observation above motivates splitting µ into mid-range and extreme-frequency components, X µ = hµ, φki φk

|1−λk|<1−2/diam(G) and X µ = hµ, φki φk.

|1−λk|≥1−2/diam(G) We then transport µ by propagating infinitely, and bound µ with a diameter bound: + + − − W1 (µ, dx) = W1(µ + µ , µ + µ + dx) + − + −  ≤ W1 µ , µ + W1 µ , µ + dx ∞ X i −i diam(G) ≤ kA D µk 1 + kµ − dxk 1 ` 2 ` i=0 While the above spectral bound is not guaranteed to be smaller than the diameter bound, empirical evidence suggests that it is in general a stronger bound, particu- larly for graphs with large diameter. We can test the quality of this bound by using linear programming [27] to compute exact Wasserstein distances on graphs. In §4, we display the results of doing so with k 1 X µ = δ k xj j=1 using the algorithm to pick the xj:

 k  −1/2 X xk+1 = arg min L δxj  (x), x∈V j=1 compared against picking vertices xk uniformly at random (without repetition) and averaging over 1000 Wasserstein distances obtained in this manner. This approach yields promising computational results across a number of large graphs. 3.4. A Case Study: Cycle Graphs. In this subsection, we will look carefully at the behavior of the cycle graphs Cn in the context of the above result. We recall 1 the eigenvectors of M = 2 L on Cn are 2πkx φ (x) = (n/2)−1/2 cos k n and 2πkx φ (x) = (n/2)−1/2 sin , n−k n −1/2 −1/2 x for 0 < k < n/2, with φ0(x) ≡ n and, if n is even, φn/2(x) = n (−1) and corresponding eigenvalues 2πk  λ = 1 − 1 cos . k n

Note that φ0 is the constant eigenvector here and, since cosine is even, λk = λn−k for all k 6= 0. We have elected to use the real eigenvectors in order to apply our 10 arguments, though it is worth noting that we can change basis for the dimension 2 eigenspaces and simply write 2πikx φ (x) = n−1/2 exp , k n where 0 ≤ k < n with all λk the same as above. Then, 1 1  n 2 = ≈ 2 1 − |1 − λ | 2πk  2πk k 1 − cos n for small k, by Taylor expansion. Further, √ Z n   −1/2 2πkx 8n kφkk`1 ≈ kφn−kk`1 ≈ (n/2) cos dx = . 0 n π + − + − On the other hand, W1(φk , φk ) ≈ W1(φn−k, φn−k) can be approximated by the continuous analogue, where it is clear from symmetry that the optimal way to transport the sine wave is sending all mass to the nearest zero, where the positive and negative mass will cancel. This endures a cost of Z n/4k 2πkx  n 2 (n/2)−1/24k x sin dx = (n/2)−1/24k , 0 n 2πk integrating by parts. Putting it all together, this yields

+ − πk 1 W1(φk , φk ) ≈ · kφkk`1 . n 1 − |1 − λk| We may let k ≤ n/100 so that k is small enough for the Taylor expansion to be good, but nonetheless on the order of n: then we see the bound in Theorem 2 is sharp up to constants. Observe that when we apply the fractional inverse Laplacian −1/2 L to the point mass δ0, we get n−1 n−1   −1/2 X φk(0) X 1 2πikx L (δ0) = φk = exp λ1/2 2πk 1/2 n k=1 k k=1 2 − 2 cos n n bn/2c X 1 2πkx = 2 cos 2πk 1/2 n k=1 2 − 2 cos n n bn/2c 1 X 1 2πkx ≈ cos , π k n k=1 for all odd n, again√ approximating with Taylor expansion. (If n is even, we will get x an extra (−1) /n 2 term corresponding to φn/2, but this will vanish in the limit we are about to take.) We caution the reader that the λk above are the eigenvalues of L = 2M, and are thus double the eigenvalues of M referred to in the preceding computations. Rescaling with x = nθ/2π, we have

bn/2c 1 X 1 L−1/2(δ ) ≈ cos (kθ) . 0 π k k=1 Fixing θ and letting n → ∞, this is simply the Fourier series for

−1/2 1 lim L (δ0) = − ln |2 sin (θ/2)| . n→∞ π 11

Taking this limit and rescaling really is just transitioning us to the continuous setting: the eigenfunctions of the Laplacian on the unit circle S1 identified with [0, 2π)/ ∼ are (2π)−1/2 exp(ikx), with eigenvalue k2, for k ∈ Z. So the fractional inverse Laplacian L−1/2 of a point mass on the circle is

X φk(0) X 1 L−1/2(δ ) = φ = exp(ikx) 0 1/2 k 2π|k| k6=0 λk k6=0 ∞ 1 X 1 1 = cos(kx) = − ln |2 sin(x/2)|, π k π k=1 precisely our function in the discrete case.

0.00020

0.00015

0.00010

0.00005

20 40 60 80 100 120 140

−1/2 1 Figure 6. The difference between L (δ0) on S and the frac- −1/2 tional Laplacian L (δ0) on C150 is small.

In Figure 6 we show the difference between the inverse fractional Laplacian on a 1/2 point mass L (δ0) in the continuous and discrete settings. The two outputs are almost identical and thus their difference is quite small (this holds even for very small values of n). It is worth recalling the recent Theorem of Pausinger, which proves that this algorithm belongs to a large class which produces the van der Cor- put sequence on the torus (and thus achieves optimal discrepancy, up to constants) [24]. Further, Steinerberger’s recent result [36] indicates that this algorithm per- forms extremely well on the torus, and by extension large cycle graphs, and that the sequence of points rapidly becomes very evenly distributed.

3.5. A Case Study: Torus Grid Graphs. In this subsection, we examine an- other class of graphs: torus grid graphs. The m × n torus grid graph Tm,n is the Cartesian product of Cm and Cn, so we can apply many of our computations from the previous subsection here. In particular, the eigenvectors for Tm,n are the Kro- necker products of pairs of eigenvectors (φj, φk) from Cm and Cn, respectively, with 1 corresponding eigenvalues under M = 4 L of λ + λ 1  2πj  2πk  λ = j k = 1 − cos + cos . j,k 2 2 m n 12

(The factor of 1/2 appears because cycle graphs are 2-regular while torus grid graphs are 4-regular, and vanishes if we instead use L here.) Then we have 1 1 4 = ≈ 1 2πj  2πk  2 2 1 − |1 − λj,k| 1 − cos + cos 2πj  2πk  2 m n m + n for small j, k, by Taylor expansion. Further, √ 8 nm kφ k 1 = kφ k 1 kφ k 1 ≈ . j,k ` j ` k ` π2 + − + To find W1(φj,k, φj,k), note that the support of φj,k consists of checkerboarded rectangles, and the most efficient way to transport the positive mass to the negative mass will be along the higher frequency direction—that is, horizontally if m/j < n/k, and vertically otherwise. Without loss of generality, let us suppose we are in the former case. Then this incurs a cost of √ n−1  2 X + − + − 8n −1/2 m |φ (i)|W (φ , φ ) = kφ k 1 W (φ , φ ) ≈ · (m/2) 4j . k 1 j j k ` 1 j j π 2πj i=0 Applying our assumption that m/j < n/k to our earlier estimate, we see 1 4 4  n 2 ≈ 2 2 ≤ 2 2 = 2 . 1 − |1 − λj,k| 2πj  2πk  2πk  2πk  2πk m + n n + n Putting it all together, √ 8n  m 2 (m/2)−1/24j ≈ W (φ+ , φ− ) π 2πj 1 j,k j,k √ 1  n 2 8 nm 1 ≤ kφj,kk` ≤ 2 · 2 , 1 − |1 − λj,k| 2πk π and thus, taking the quotient of the two sides in the above inequality our bound is off by (at most) a factor of πk2m(n2j)−1. Note that when k/n = j/m (i.e., when the horizontal and vertical components of φj,k have the same frequency), this simplifies to πk/n, precisely our result on cycle graphs.

4. Numerics Below we provide numerics on a variety of graphs demonstrating the performance of the algorithm and the bound from Theorem 2. In particular, we compare the performance of vertices selected according to our algorithm against that of randomly selected vertices. The differences between our vertex sequences and random vertex sequences may seem marginal, but this is partly due to the fact that the diameter of some of these graphs is quite small. For instance, the Truncated Tetrahedral graph, with diameter 3, only has 12 vertices, so we will hardly be able to distinguish the performance of the algorithm’s vertices from randomly selected vertices on such a small set—it is impressive that we see a difference at all. We see that for the Faulkner-Younger Graph and the Level 2 Menger Sponge the difference becomes significantly more drastic. Many of the graphs are quite well connected, which makes the transport problem easier than on sparse graphs (e.g., on complete graphs it makes no difference at all which vertices are selected, the transport cost only depends on the number of vertices). In all the tables in this section, xj were 13 computed directly using the recursive definition for the algorithm (α = .5) given in Section 2 with any ties broken randomly, and the exact Wasserstein distances

 k  1 X W δ , dx 1 k xj  j=1 were subsequently computed using the dual linear program in [27] in the “Algo- rithm” row. For graphs which are not vertex-transitive, the performance of the algorithm depends upon the arbitrary initial vertex chosen, and thus all choices of initial vertex were attempted and the transport costs averaged. In the “Random” row, 1000 uniformly randomly selected sets of k distinct vertices were taken, and the corresponding Wasserstein distances were averaged.

10

8

6

4

2

0.5 1.0 1.5 2.0

Figure 7. The Figure 8. The tightness of the Menger Sponge, bound in Theorem 2, applied to whose 400 the eigenfunctions of the Level form the vertices 2 Menger Sponge Connectivity of a connectivity Graph (computed as a quotient graph. of the right and left sides).

4.1. Connectivity Graph of Level 2 Menger Sponge. The Level 2 Menger sponge is the objectPrinted by Wolfram obtained Mathematica Student Edition beginning with a and drilling out the middle square of each face (viewed as a three by three grid of squares), and then iterating this process one more time on the smaller cubes (see Fig. 7). We can then generate a connectivity graph of the remaining 400 smaller cubes (each one ninth the side length of the original cube). Note that this is not a regular graph. In Figure 8, we see the that, on the Level 2 Menger Sponge Connectivity Graph, the Theorem 2 bound is tightest for mid-range eigenvalues. This is to be expected, due to the blow-up of the 1/(1 − |1 − λ|) term at the extremes, where a diameter bound is tighter (see §3.2). But we see here that, even for very small eigenvalues, the bound is fairly tight. 14

No. of vertices 1 3 5 10 15 20 25 30 Algorithm 9.94 6.48 5.01 3.54 2.92 2.55 2.31 2.11 Random 9.94 6.73 5.52 4.25 3.63 3.20 2.90 2.69

Table 1. W1(µ, dx) for the Connectivity Graph of a Level 2 Menger Sponge

1 7 8 11 1 4 9 11 4

5 5 6 3 6 9 10 12 2 3 10 8 12 2 7 Figure 9. The se- Figure 10. The se- quence of vertices picked quence of vertices picked by the algorithm on the by the algorithm on the Truncated Tetrahedral Frucht Graph. Graph.

4.2. Truncated Tetrahedral Graph. The Truncated Tetrahedral Graph (see Fig. 9) is a 3-regular, vertex-transitive graph on 12 vertices. It is the 1-skeleton of the Archimedean solid formed by truncating each vertex of a tetrahedron.

No. of vertices 1 2 3 4 5 6 7 8 9 10 Algorithm 1.92 1.17 0.83 0.67 0.58 0.50 0.42 0.33 0.28 0.23 Random 1.92 1.35 1.01 0.84 0.72 0.58 0.52 0.43 0.34 0.27

Table 2. W1(µ, dx) for the Truncated Tetrahedral Graph

4.3. Frucht Graph. The Frucht Graph (see Fig. 10) is a 3-regular graph on 12 vertices, and has trivial automorphism group despite being degree-regular.

No. of vertices 1 2 3 4 5 6 7 8 9 10 Algorithm 1.93 1.17 0.86 0.67 0.59 0.50 0.42 0.34 0.29 0.23 Random 1.93 1.34 1.04 0.85 0.73 0.59 0.52 0.43 0.35 0.27

Table 3. W1(µ, dx) for the Frucht Graph 15

4.4. Faulkner-Younger Graph. The Faulkner-Younger Graph on 44 vertices (see Fig. 11) is a 3-regular non-Hamiltonian graph (that is, there is no path along its edges that traverses every vertex exactly once).

7 1

4 10 5 9 2 8 3 6

Figure 11. The first ten vertices picked by the algorithm on the Faulkner-Younger Graph. Each label is above and to the right of the corresponding vertex.

No. of vertices 1 2 3 4 5 6 7 8 9 10 Algorithm 4.17 2.67 2.05 1.71 1.52 1.34 1.23 1.15 1.05 0.97 Random 4.17 3.08 2.57 2.24 2.01 1.83 1.68 1.56 1.46 1.37

Table 4. W1(µ, dx) for the Faulkner-Younger Graph

4.5. Erd˝os-R´enyi Random Graphs. The Erd˝os-R´enyi model for random graphs G(n, p) is given by including an edge between each pair of the n vertices indepen- dently with probability p. Here we display the performance of the algorithm on two such graphs, one taken from G(100,.06) (a sparse graph, see Fig. 12) and another from G(100,.2) (a dense graph, see Fig. 13).

No. of vertices 1 3 5 10 15 20 25 30 Algorithm, Sparse Graph 2.72 2.61 2.13 1.52 1.22 1.02 0.85 0.74 Random, Sparse Graph 2.72 2.11 1.82 1.44 1.22 1.06 0.93 0.83 Algorithm, Dense Graph 1.79 1.53 1.31 1.01 0.86 0.80 0.75 0.70 Random, Dense Graph 1.79 1.46 1.26 0.99 0.88 0.81 0.75 0.70

Table 5. W1(µ, dx) for Erd˝os-R´enyi Random Graphs

It is no surprise that the dense graph exhibits little variation between the transport cost of random vertices and of the algorithm’s—after all, any pair of vertices has many short paths between them. In fact, this particular graph has diameter 3. Thus, for sufficiently dense graphs it is largely irrelevant which vertices are selected: the transport cost will be low. The sparse graph displayed has diameter 6 and is thus more interesting: while random vertices initially outperform the algorithm, by 15 vertices selected they are matched, after which the algorithm surpasses the 16 random vertices. That is, even in highly irregular graphs such as this one where random vertices perform well at first, the algorithm nonetheless manages to catch up even with a relatively small number of vertices.

Figure 12. A sparse Figure 13. A dense Erd˝os-R´enyi Graph from Erd˝os-R´enyi Graph from G(100,.06). G(100,.2).

4.6. Complete 3-ary Tree. The complete 3-ary tree of depth 4 is a rooted tree, where each vertex has 3 children, except for the fourth generation of vertices which all have no children, yielding a total of 40 vertices (see Fig. 14).

Figure 14. The 3-ary tree of depth 4

No. of vertices 1 2 3 4 5 6 7 8 9 10 Algorithm 4.25 3.58 2.75 2.83 2.55 2.07 1.98 1.73 1.34 1.37 Random 4.25 3.62 3.12 2.93 2.74 2.48 2.33 2.17 1.99 1.86

Table 6. W1(µ, dx) for the 3-ary tree of depth 4

5. Proofs 5.1. Proof of Theorem 1.

Proof. Note first that, since, for all v ∈ Rn, L−2αv has mean 0 (being spanned by φi, i > 1, and thus orthogonal to the constant φ1), we will always have

 k  −2α X min L δxj  (x) < 0. x∈V j=1 Using the `2 norm n 2 X 2 kvk`2 = vi , i=1 17 we observe

2 2 2 L−α(kµ ) = L−α ((k − 1)µ ) + L−α (δ ) k `2 k−1 `2 xk `2 −α −α + 2 L ((k − 1)µk−1),L (δxk ) 2 2 = L−α ((k − 1)µ ) + L−α (δ ) k−1 `2 xk `2 −2α + 2 L ((k − 1)µk−1), δxk , since L−α is self-adjoint. Rewriting the inner product term,

 k−1  −2α −2α X L ((k − 1)µk−1), δxk = L δxj  (xk). j=1

But xk was chosen by the algorithm specifically to minimize that quantity—thus, it is certainly less than the average value of 0, and so

2 2 2 L−α(kµ ) ≤ L−α ((k − 1)µ ) + L−α (δ ) . k `2 k−1 `2 xk `2 Then, by induction, we obtain the desired inequality:

n 2   X |hµk, φii| −α 2 −α  2 −1 = L (µk) 2 ≤ max L δxj 2 k . λα ` j≤k ` i=2 i



5.2. Proof of Theorem 2.

Proof. We recall that AD−1 can be interpreted as the propagator of the random walk on the graph G = (V,E). Moreover, we have

−1 AD φk = (1 − λk)φk and observe that |1−λk| ≤ 1. We proceed in a similar manner to [34] and interpret diffusion on the graph as one of many ways to transport mass. In particular, we will apply the elementary estimate

n −1 X W1(AD v, v) ≤ kvk`1 = |vj| j=1

−1 i ± to v = ((AD ) φk) ,

i −1 −1 i ± −1 i ± −1 i ± |1 − λk| W (AD ((AD ) φ ) ), ((AD ) φ ) ) ≤ k((AD ) φ ) k 1 = kφ k 1 . 1 k k k ` 2 k `

−1 m In particular, we will transport φk to (AD ) φk through its positive and negative parts after each diffusion. For large m, this measure almost vanishes since

−1 m m (AD ) φk = (1 − λk) φk. 18

We perform this operation until some arbitrary m and then use the trivial bound on the remaining measure. This shows that the total transport can be bounded by

m m−1 + − |1 − λk| diam(G) X i W (φ , φ ) ≤ kφ k 1 + |1 − λ | kφ k 1 1 k k 2 k ` k k ` i=0  m m  |1 − λk| diam(G) 1 − |1 − λk| = + kφkk`1 2 1 − |1 − λk|     m diam(G) 1 1 = |1 − λk| − + kφkk`1 . 2 1 − |1 − λk| 1 − |1 − λk| We observe that this bound is monotonic in m, with direction depending on the sign of the bracketed expression. If 2 |1 − λ | ≥ 1 − , k diam(G) the bound is monotonically increasing and we set m = 0, recovering the initial diameter bound. If 2 |1 − λ | < 1 − , k diam(G) we have a monotonically decreasing bound, and take the limit as m → ∞, yielding the desired bound of 1 kφkk`1 . 1 − |1 − λk| 

6. Connection to other Results 6.1. Low-discrepancy point sets. A classical problem in the study of irregulari- ∞ ties of distribution is to construct sequences (xn)n=1 on the unit interval [0, 1] such that {x1, . . . , xn} is fairly evenly distributed over the unit interval for all n ∈ N. The problem has now been solved completely: it is known, this is a result of Schmidt [32], that for any sequences on [0, 1], there exists infinitely many n ∈ N such that

# {1 ≤ i ≤ n : xi ≤ x} 1 log n max − x ≥ . 0≤x≤1 n 100 n The quantity on the left-hand side is also known as discrepancy (or extreme dis- crepancy, L∞−discrepancy), we refer to the textbook of Dick & Pillichshammer [10]. Steinerberger recently established [35, 36] that a greedy sequence defined via

n ! −1/2 X xn+1 = arg min L δxk (x) x∈[0,1] k=1 satisfies excellent distribution properties for all n ∈ N—numerical examples show that the arising sequences seem to be remarkably close to the best possible bound n−1 log n (down to the level of the constant). The argument is somewhat different and uses the Koksma-Hlawka inequality [18] and classical Fourier Analysis. In particular, this result is stronger than what is guaranteed by Theorem 1. 19

6.2. Greedy minimization. Greedy minimizations such as the one explored in this paper are well-behaved in general. In [5], Steinerberger and the author showed that for any f : T → R with fb(k) ≥ c|k|−2, the sequence defined by n−1 X xn = arg min f(x − xk) x∈ T k=1 is well-distributed. Note that this algorithm is summing shifted copies of f and finding the smallest value, essentially “filling in the gaps” in the point set. Inde- pendently of initial choice of {x1, . . . , xk}, we have n ! 1 X c W δ , dx √ . 2 n xi . n i=1 In higher dimensions, the picture is even nicer: defining an appropriate analogue of the fb(k) ≥ c|k|−2 condition, we find ourselves minimizing the energy of Green’s function-like kernels, and now have, for d ≥ 3, n ! 1 X 1 W δ , dx . 2 n xi . n1/d i=1 √ The exception is dimension d = 2, where the bound weakens to log n/n1/2. It is unknown if this bound is sharp.

6.3. Leja points. Leja points can be defined, in the utmost level of generality, for any symmetric kernel k : X × X → R ∪ {∞} on a compact Hausdorff space. We remain on smooth compact manifolds M, a natural example for the kernel is 1 k(x, y) = s where dg(x, y) is the geodesic distance and s > 0. dg(x, y)

We can then define, in an iterative fashion, for a given initial point x1 ∈ M, a ∞ sequence (xk)k=1, in such a way that n−1 n−1 X X k(xn, xk) = inf k(x, xk). x∈M k=1 k=1

Put differently, we add a new point xn in a greedy fashion in such a way that the total energy n X k(xk, x`) is as small as possible. k,`=1 k6=` These sets were introduced by Edrei [11] and intensively studied by Leja [19] after whom they are named. The most commonly used kernel is k(x, y) = − log |x − y| (such that minimizing the sum is the same as maximizing the product of the dis- tances). Leja points have a number of applications in numerical analysis [2, 6, 22, 29, 30]. Pausinger [24] recently gave a very precise description of Leja sequences on T for fairly general kernel functions and established a connection to binary digit expansion. For the Riesz kernel k(x, y) = |x − y|−s, it is known that Leja se- quences are asymptotically uniformly distributed [21]. We are not aware of any study of Leja vertices on graphs; while one could take existing kernels, for example k(x, y) = |x − y|−s, and consider them on graphs, there is little reason to assume 20 that such vertices will have many special properties: Graphs are simply too flexible. We can summarize the approach in this paper as stating that there is a very good reason to believe (see the Figures in this paper) that considering k(x, y) to be the Green’s function of the inverse Laplacian leads to well-distributed sets of vertices. Moreover, we are able to analyze the continuous limit of manifolds and are able to obtain a quantitative bound showing that the bounds are more regularly distributed than simply exhibiting uniform distribution. 6.4. Riesz points. Riesz points refer, at great level of generality, to point sets minimizing energy expressions of the following form n X 1 arg min . x∈M n kx − x ks k,`=1 k ` k6=` The problem was first stated on S2 with s = 1 by Thomson [37] in 1904 and has since inspired a lot of work, we refer to [3, 9, 15, 31] and references therein. We make a connection with two contributions in particular. The first is due to Beltran, Corral and Criado del Rey [1]: they show that if we consider sets of n points on a compact manifold chosen to attain minimal energy n X min G(xk, x`), x∈M n k,`=1 k6=` where G is the Green’s function of the Laplacian on M, then the sequence of measures converges weakly to the uniform measure n 1 X δ * dx. n xk k=1 This can be considered the static analogue (since one finds the minimal arrangement for all n points) of our problem (keeping the previous n − 1 points fixed and then adding the point so as to minimize energy). The second contribution that we highlight is very recent and due to Marzo & Mas [23]. They studied the specific problem of minimizing the s−Riesz energy n X 1 E = on d s kx − x ks S k,`=1 k ` k6=` and estimating the spherical cap discrepancy of the minimizing point set: in short, if the points are uniformly distributed, then we would expect the number of points in each spherical cap to be proportional to the volume of the cap; the largest discrepancy is known as spherical cap discrepancy. They use ideas dating back to Wolff: the Riesz energy Es is comparable to a negative Sobolev norm and, more 2 d precisely, there exists a constant Cs,d > 1 such that for all f ∈ L (S ), Z −1 2 f(x)f(y) 2 Cs,d kfkH(s−d)/2 ≤ s dxdy ≤ Cs,dkfkH(s−d)/2 . d d kx − yk S ×S This, while not directly related to our approach, is at least philosophically related: we estimate the Wasserstein distance in negative Sobolev spaces and use the un- derlying L2−structure. 21

References [1] C. Beltran, N. Corral and J. Criado del Rey, Discrete and Continuous Green Energy on Compact Manifolds, Journal of Approximation Theory 237, 160–185 (2019). [2] L. Bialas-Ciez, J.-P. Calvi, Pseudo Leja sequences, Annali di Matematica 191, 53–75 (2012). [3] J. Brauchart, Optimal logarithmic energy points on the unit sphere, Math. Comp. 77, 1599- 1613 (2008). [4] L. Brown, S. Steinerberger, On the Wasserstein Distance between Classical Sequences and the Lebesgue Measure, Trans. Amer. Math. Soc. 373, 8943–8962 (2020). [5] L. Brown, S. Steinerberger, Positive-definite Functions, Exponential Sums and the Greedy Algorithm: a Curious Phenomenon, Journal of Complexity (2020). [6] D. Calvetti, L. Reichel, D. Sorensen, An implicitly restarted Lanczos method for large sym- metric eigenvalue problems, Electronic Transactions on Numerical Analysis 2, 1–21 (1994). [7] T. Carroll, X. Massaneda, J. Ortega-Cerda, An enhanced uncertainty principle for the Vaser- stein distance, Bull. London Math. Soc. 52, 1158–1173 (2020). [8] F. Chung, Spectral , CBMS Regional Conference Series in Mathematics 92, American Mathematical Society (1997). [9] B. Dahlberg, Regularity Properties of Riesz Potentials, Indiana University Mathematics Jour- nal 28, 257–268 (1979). [10] J. Dick and F. Pillichshammer, Digital nets and sequences. Discrepancy theory and quasi- Monte Carlo integration, Cambridge University Press, Cambridge (2010). [11] A. Edrei, Sur les d´eterminants r´ecurrents et les singularit´esd’une fonction donn´eepar son d´eveloppement de Taylor, Compositio Mathematica 7, 20–88 (1940). [12] P. Erd˝osand P. Tur´an,On a problem in the theory of uniform distribution. I. Nederl. Akad. Wetensch. 51, 1146–1154 (1948). [13] P. Erd˝osand P. Tur´an,On a problem in the theory of uniform distribution. II. Nederl. Akad. Wetensch. 51, 1262–1269 (1948). [14] A. Grigor’yan, Introduction to Analysis on Graphs, University Lecture Series 71, American Mathematical Society (2018). [15] D. Hardin and E. Saff, Discretizing manifolds via minimum energy points, Not. Amer. Math. Soc. (2004). [16] P. Hu and W. C. Lau, A Survey and Taxonomy of Graph Sampling, arXiv:1308.5865 (2013). [17] L.V. Kantorovich, On the Translocation of Masses. J Math Sci 133, 1381–1382 (2006). [18] J. Koksma, Een algemeene stelling uit de theorie der gelijkmatige verdeeling modulo 1, Math- ematica B (Zutphen) 11, 7–11 (1942/43). [19] F. Leja, Sur certaines suites li´eesaux ensembles plans et leur application ‘a la repr´esentation conforme, Annales Polonici Mathematici 4, 8–13 (1957). [20] G. Linderman and S. Steinerberger, Numerical Integration on Graphs: where to sample and how to weigh, Math. Comp. 89 (2020). [21] G. A. L´opez and E. B. Saff. Asymptotics of greedy energy points. Math. Comp. 79 (272), 2287–2316 (2010). [22] S. De Marchi and G. Elefante, Quasi-Monte Carlo integration on manifolds with mapped low-discrepancy points and greedy minimal Riesz s-energy points, Applied Numerical Math- ematics 127, 110–124 (2018). [23] J. Marzo and A. Mas, Discrepancy of Minimal Riesz Energy Points, arXiv:1907.04814 (2019). [24] F. Pausinger, Greedy energy minimization can count in binary: point charges and the van der Corput sequence, Annali di Matematica 200, 165–186 (2021). [25] I. Pesenson, Sampling in Paley-Wiener spaces on combinatorial graphs, Trans. Amer. Math. Soc. 360, 5603–5627 (2010). [26] I. Pesenson and M. Pesenson, Sampling, filtering and sparse approximations on combinatorial graphs, J. Fourier Anal. Appl. 16, 921–942 (2010). [27] G. Peyr´eand M. Cuturi, Computational optimal transport, Foundations and Trends in Ma- chine Learning 11, no. 5-6, 355–607 (2019). −1 [28] R. Peyre, Comparison between W2 distance and H˙ norm, and Localization of Wasserstein distance, ESAIM: COCV 24, 1489–1501 (2018). [29] L. Reichel, Newton Interpolation at Leja points, BIT 30, 332–346 (1990). [30] L. Reichel, The Application of Leja Points to Richardson Iteration and Polynomial Precon- ditioning, Linear Algebra and its Applications 154–156, 389–414 (1991). 22

[31] E. Saff and V. Totik, Logarithmic potentials with external fields, Springer (2013). [32] W. M. Schmidt, Irregularities of distribution. VII. Acta Arith. 21, 45–50 (1972). [33] D. Shuman, S. Narang, P. Frossard, A. Ortega and P. Vandergheynst, The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains, IEEE Signal Process. Mag. 30(3), 83–98 (2013). [34] S. Steinerberger, Wasserstein Distance, Fourier Series and Applications, arXiv:1803.08011 (2018). [35] S. Steinerberger, Dynamically Defined Sequences with Small Discrepancy, arXiv:1902.03269 (2019). [36] S. Steinerberger, Polynomials with Zeros on the Unit Circle: Regularity of Leja Sequences, arXiv:2006.10708 (2020). [37] J. J. Thomson, On the Structure of the Atom: an Investigation of the Stability and Periods of Oscillation of a number of Corpuscles arranged at equal intervals around the Circumference of a Circle; with Application of the Results to the Theory of Atomic Structure, Philosophical Magazine Series 6, Volume 7, Number 39, 237–265 (1904). [38] L. N. Vasershtein, Markov processes on a countable product space describing large systems of automata, Problemy Peredavci Informacii, 3, 64–72 (1969). [39] C. Villani, Topics in Optimal Transportation, Graduate Studies in Mathematics, American Mathematical Society (2003).

Department of Mathematics, Yale University, New Haven, CT 06511, USA Email address: [email protected]