<<

arXiv:2011.02466v1 [cs.DS] 4 Nov 2020 fCmue cec FC 2020). (FOCS Science Computer of h uhrwsagaut tdn tMT upre npr by part in Supported MIT. at student graduate a was author the ekly upre npr yOR#0041--49 Pac a #N00014-17-1-2429, ONR by part in Supported Berkeley. ftewr oewieteato a rdaesueta UT- at student graduate a was author D NSF, the Foundation, Simons while Foundation, done Schmidt Foundation, work the of ¶ ∗ † ‡ § rlmnr eso fti ae perdi h Proceedi the in appeared paper this of version preliminary A [email protected] [email protected] [email protected] [email protected] iiainpoe bu atmlioemethods. multipole fast about proven limitation h eertdfs utpl ehdo regr n Rokhl and Greengard of method SETH multipole fast celebrated the xoeta ieHptei ( subquadrat Hypothesis of Time nonexistence Exponential the imply values parameter high hc o aaee ausimply values parameter low which such o function a for ie by given dimension it not or whether investigate We in graphs. problems these on possible is graph ov alca ytmin system Laplacian a Solve • utpyagvnvco yteajcnymti rLaplacia or matrix adjacency the by vector given a Multiply • idaseta prie of sparsifier spectral a Find • spr forrsls eas hwta h xoeta depe exponential the that show also we results, our of part As o aho hs rbes ecnie l ucin ftef the of functions all consider we problems, these of each For o function a For K o ra ls ffunctions of class broad a for , G nldn h asinkre,Nua agn enl,an kernels, tangent Neural kernel, Gaussian the including , P K of ( d x n nvriyo ahntn ato h okdn hl h au the while done work the of Part Washington. of University , Ω(log = i ohAlman Josh P x , 1+ f j o stecmlt rp on graph complete the is : (1) ) ierAgbao emti Graphs Geometric on Algebra Linear nti ae,w ntaetesuyo hnecetspectra efficient when of study the initiate we paper, this In . R K angeMlo University. Mellon Carnegie , iefra for time → n : avr nvriy upre yaRbnflosi.Part fellowship. Rabin a by supported University, Harvard , ) R oubaUiest,PictnUiest n Institute and University Princeton University, Columbia , eso htteei aaee soitdwt h functi the with associated parameter a is there that show we , loihsadHrns for Hardness and Algorithms R d epoieagrtm n oprbehrns eut for results hardness comparable and algorithms provide We . × † R d K ioh Chu Timothy SETH → -graph G G n R P P 1+ ≥ ,gvnntrlasmtoson assumptions natural given ), sLpainmatrix Laplacian ’s f 0 o otebs forkolde hsi h rtformal first the is this knowledge, our of best the To . n set a and , G (1) P Abstract n ieagrtm o l he fteepolm and problems these of three all for algorithms time when oe hr h egtbtennodes between weight the where nodes ‡ n < d ao Schild Aaron P RASC ogeadAmazon. and Google ARPA/SRC, g f6s nulIE ypsu nFoundations on Symposium IEEE Annual 61st of ngs adflosi,adteUiest fWashington. of University the and fellowship, kard = ihe .RbnPsdcoa Fellowship. Postdoctoral Rabin O. Michael a o (1) { x : utn upre npr yM Huateng Ma by part in Supported Austin. 1 ctm loihsasmn Strong assuming algorithms time ic x , . . . , ncno eipoe,assuming improved, be cannot in spsil osletefollowing the solve to possible is dneo h dimension the on ndence § arxof matrix n n orm ⊂ } oe o xml,in example, For more. d hoSong Zhao f hrwsagaut tdn tUC at student graduate a was thor . K R ( d ,v u, of o dacdSuy Part Study. Advanced for ∗ = ) G n ftewr oewhile done work the of P rp theory graph l ons the points, ¶ f ( k i u on and − many f v d j k for 2 2 in K is ) Contents

1 Introduction 1 1.1 High-dimensional results ...... 3 1.1.1 Multiplication...... 3 1.1.2 Sparsification ...... 4 1.1.3 Laplaciansolving...... 5 1.2 OurTechniques...... 6 1.2.1 Multiplication...... 6 1.2.2 Sparsification ...... 8 1.3 Brief summary of our results in terms of pf ...... 9 1.4 Summary of our Results on Examples ...... 11 1.5 OtherRelatedWork ...... 11

2 Summary of Low Dimensional Results 12 2.1 Multiplication...... 12 2.2 Sparsification ...... 13 2.3 Laplaciansolving ...... 13

3 Preliminaries 14 3.1 Notation...... 14 3.2 Graph and Laplacian Notation ...... 15 3.3 Spectral Sparsification via Random Sampling ...... 16 3.4 WoodburyIdentity ...... 17 3.5 TailBounds...... 17 3.6 Fine-GrainedHypotheses ...... 17 3.7 Dimensionality Reduction ...... 18 3.8 NearestNeighborSearch ...... 18 3.9 Geometric Laplacian System ...... 19

4 Equivalence of Matrix-Vector Multiplication and Solving Linear Systems 20 4.1 Solving Linear Systems Implies Matrix-Vector Multiplication ...... 22 4.2 Matrix-Vector Multiplication Implies Solving Linear Systems ...... 24 4.3 Lower bound for high-dimensional linear system solving ...... 24

5 Matrix-Vector Multiplication 25 5.1 Equivalence between Adjacency and Laplacian Evaluation ...... 26 5.2 ApproximateDegree ...... 27 5.3 ‘Kernel Method’ Algorithms ...... 28 5.4 Lower Bound in High Dimensions ...... 29 5.5 Lower Bounds in Low Dimensions ...... 37 5.6 Hardness of the n-BodyProblem ...... 39 5.7 HardnessofKernelPCA...... 39

6 Sparsifying Multiplicatively Lipschitz Functions in AlmostLinearTime 40 6.1 High Dimensional Sparsification ...... 43 6.2 Low Dimensional Sparsification ...... 44

i 7 Sparsifiers for x,y 44 |h i| 7.1 Existence of large expanders in inner product graphs ...... 44 7.2 Efficient algorithm for finding sets with low effective resistance diameter ...... 47 7.3 Using low-effective-resistance clusters to sparsify the unweighted IP graph ...... 51 7.4 Sampling data structure ...... 53 7.5 Weighted IP graph sparsification ...... 56 7.5.1 (ζ, κ)-cover for unweighted IP graphs ...... 58 7.5.2 (ζ, κ)-cover for weighted IP graphs on bounded-norm vectors ...... 60 7.5.3 (ζ, κ)-cover for weighted IP graphs on vectors with norms in the set [1, 2] ∪ [z, 2z] for any z > 1 ...... 60 7.5.4 (ζ, κ)-cover for weighted IP graphs with polylogarithmic dependence on norm 63 7.5.5 Desired (ζ, κ, δ)-cover ...... 63 7.5.6 ProofofLemma7.1 ...... 67

8 Hardness of Sparsifying and Solving Non-Multiplicatively-Lipschitz Laplacians 68

9 Fast Multipole Method 77 9.1 Overview ...... 77 9.2 K(x,y) = exp( x y 2),FastGaussiantransform...... 78 −k − k2 9.2.1 Estimation ...... 80 9.2.2 Algorithm...... 84 9.2.3 Result...... 86 9.3 Generalization...... 87 9.4 K(x,y) = 1/ x y 2 ...... 88 k − k2 10 Neural Tangent Kernel 88

References 89

ii 1 Introduction

Linear algebra has a myriad of applications throughout computer science and physics. Consider the following seemingly unrelated tasks: 1. n-body simulation (one step): Given n bodies X located at points in Rd, compute the gravitational force on each body induced by the other bodies. 2. Spectral clustering: Given n points X in Rd, partition X by building a graph G on the points in X, computing the top k eigenvectors of the Laplacian matrix L of G for some k 1 G ≥ to embed X into Rk, and run k-means on the resulting points. 3. Semi-supervised learning: Given n points X in Rd and a function g : X R whose values → on some of X are known, extend g to the rest of X. Each of these tasks has seen much work throughout numerical analysis, theoretical computer science, and . The first task is a celebrated application of the fast multipole method of Greengard and Rokhlin [GR87, GR88, GR89], voted one of the top ten algorithms of the twentieth century by the editors of Computing in Science and Engineering [DS00]. The second task is spectral clustering [NJW02, LWDH13], a popular algorithm for clustering data. The third task is to label a full set of data given only a small set of partial labels [Zhu05b, CSZ09, ZL05], which has seen increasing use in machine learning. One notable method for performing semi-supervised learning is the graph-based Laplacian regularizer method [LSZ+19b, ZL05, BNS06, Zhu05a]. Popular techniques for each of these problems benefit from primitives in spectral graph theory on a special class of dense graphs called geometric graphs. For a function K : Rd Rd R and a set × → of points X Rd, the K-graph on X is a graph with vertex set X and edges with weight K(u, v) for ⊆ each pair u, v X. Adjacency matrix-vector multiplication, spectral sparsification, and Laplacian ∈ system solving in geometric graphs are directly relevant to each of the above problems, respectively: 1. n-body simulation (one step): For each i 1, 2,...,d , make a weighted graph G ∈ { } i on the points in X, in which the weight of the edge between the points u, v X in Gi is Ggrav mu mv vi ui ∈ Ki(u, v) := ( · 2· )( − ), where Ggrav is the gravitational constant and mx is the u v 2 u v 2 mass of the pointk −x k X. Letk −Ak denote the weighted adjacency matrix of G . Then A 1 is the ∈ i i i vector of ith coordinates of force vectors. In particular, gravitational force can be computed by doing O(d) adjacency matrix-vector multiplications, where each adjacency matrix is that of the Ki-graph on X for some i. 2 2. Spectral clustering: Make a K graph G on X. In applications, K(u, v)= f( u v 2), where z k − k f is often chosen to be f(z)= e− [vL07, NJW02]. Instead of directly running a spectral clus- tering algorithm on LG, one popular method is to construct a sparse matrix M approximating + LG and run spectral clustering on M instead [CFH16, CSB 11, KMT12]. Standard sparsifi- cation methods in the literature are heuristical, and include the widely used Nystrom method which uniformly samples rows and columns from the original matrix [CJK+13]. If H is a spectral sparsifier of G, it has been suggested that spectral clustering with the top k eigenvectors of LH performs just as well in practice as spectral clustering with the top k eigenvectors of LG [CFH16]. One justification is that since H is a spectral sparsifier of G, the eigenvalues of LH are at most a constant factor larger than those of LG, so cuts with similar conductance guarantees are produced. Moreover, spectral clustering using sparse matrices like LH is known to be faster than spectral clustering on dense matrices like LG [CFH16, CJK+13, KMT12]. 3. Semi-supervised learning: An important subroutine in semi-supervised learning is com- + pletion based on ℓ2-minimization [Zhu05a, Zhu05b, LSZ 19b]. Specifically, given values gv for v Y , where Y is a subset of X, find the vector g Rn (variable over X Y ) that minimizes ∈ ∈ \

1 K(u, v)(g g )2. The vector g can be found by solving a Laplacian system on the u,v X,u=v u − v K-graph∈ for6 X. P In the first, second, and third tasks above, a small number of calls to matrix-vector multipli- cation, spectral sparsification, and Laplacian system solving, respectively, were made on geometric graphs. One could solve these problems by first explicitly writing down the graph G and then using near-linear time algorithms [SS11, CKM+14] to multiply, sparsify, and solve systems. However, this requires a minimum of Ω(n2) time, as G is a dense graph. In this paper, we initiate a theoretical study of the geometric graphs for which efficient spectral graph theory is possible. In particular, we attempt to determine for which (a) functions K and (b) dimensions d there is a much faster, n1+o(1)-time algorithm for each of (c) multiplication, sparsification, and Laplacian solving. Before describing our results, we elaborate on the choices of (a), (b), and (c) that we consider in this work. We start by discussing the functions K that we consider (part (a)). Our results primarily focus 2 R R on the class of functions of the form K(u, v) = f( u v 2) for a function f : 0 for k − k ≥ → u, v Rd. Study of these functions dates back at least eighty years, to the early work of Bochner, ∈ Schoenberg, and John Von Neumann on metric embeddings into Hilbert Spaces [Boc33, Sch37, NS41]. These choices of K are ubiquitous in applications, like the three described above, since they naturally capture many kernel functions K from statistics and machine learning, including 2 q u v 2 u v 2 the Gaussian kernel (e−k − k ), the exponential kernel (e−k − k ), the power kernel ( u v 2) for q k − k both positive and negative q, the logarithmic kernel (log( u v + c)), and more [Sou10, Zhu05a, k − k2 BTB05a]. See Section 1.5 below for even more popular examples. In computational geometry, many transformations of distance functions are also captured by such functions K, notably in the case when K(u, v)= u v q [Llo82, AS14, ACX19, CMS20]. k − k2 We would also like to emphasize that many kernel functions which do not at first appear to be of the form f( u v 2) can be rearranged appropriately to be of this form. For instance, in Section 10 k − k2 below we show that the recently popular Neural Tangent Kernel is of this form, so our results apply to it as well. That said, to emphasize that our results are very general, we will mention later how they also apply to some functions of the form K(u, v)= f( u, v ), including K(u, v)= u, v . h i |h i| Next, we briefly elaborate on the problems that we consider (part (c)). For more details, see Section 3. The points in X are assumed to be real numbers stated with polylog(n) bits of precision. Our algorithms and hardness results pertain to algorithms that are allowed some degree of approx- imation. For an error parameter ε> 0, our multiplication and Laplacian system solving algorithms produce solutions with ε-additive error, and our sparsification algorithms produce a graph H for which the Laplacian quadratic form (1 ε)-approximates that of G. ± Matrix-vector multiplication, spectral sparsification, and Laplacian system solving are very nat- ural linear algebraic problems in this setting, and have many applications beyond the three we have focused on (n-body simulation, spectral clustering, and semi-supervised learning). See Section 1.5 below where we expand on more applications. Finally, we discuss dependencies on the dimension d and the accuracy ε for which n1+o(1) al- gorithms are possible (part (b)). Define α, a measure of the ‘diameter’ of the point set and f, as 2 2 maxu,v X f( u v 2) maxu,v X u v 2 ∈ ∈ α := k − k2 + k − k2 . minu,v X f( u v 2) minu,v X u v 2 ∈ k − k ∈ k − k It is helpful to have the following two questions in mind when reading our results: • (High-dimensional algorithms, e.g. d = Θ(log n)) Is there an algorithm which runs in time poly(d, log(nα/ε))n1+o(1) for multiplication and Laplacian solving? Is there an algorithm which runs in time poly(d, log(nα))n1+o(1) for sparsification when ε = 1/2?

2 • (Low-dimensional algorithms, e.g. d = o(log n)) Is there an algorithm which runs in time (log(nα/ε))O(d)n1+o(1) for multiplication and Laplacian solving? Is there a sparsification al- gorithm which runs in time (log(nα))O(d)n1+o(1) when ε = 1/2? We will see that there are many important functions K for which there are such efficient low- dimensional algorithms, but no such efficient high-dimensional algorithms. In other words, these functions K suffer from the classic ‘curse of dimensionality.’ At the same time, other functions K will allow for efficient low-dimensional and high-dimensional algorithms, while others won’t allow for either. We now state our results. We will give very general classifications of functions K for which our results hold, but afterwards in Section 1.4 we summarize the results for a few particular functions K of interest. The main goal of our results is as follows: Goal: For each problem of interest (part (c)) and dimension d (part (b)), find a natural param- eter pf > 0 associated with the function f for which the following dichotomy holds: • If pf is high, then the problem cannot be solved in subquadratic time assuming SETH on points in dimension d. 1+o(1) • If pf is low, then the problem of interest can be solved in almost-linear time (n time) on points in dimension d. As we will see shortly, the two parameters pf which will characterize the difficulties of our problems of interest in most settings are the approximate degree of f, and a parameter related to how multiplicatively Lipschitz f is. We define both of these in the next section.

1.1 High-dimensional results We begin in this subsection by stating our results about which functions have poly(d, log(α), log(1/ε)) · n1+o(1)-time algorithms for multiplication and Laplacian solving and poly(d, log(α), 1/ε) n1+o(1)- · time algorithms for sparsification. When reading these results, it is helpful to think of d = Θ(log n), α = 2polylog(n), ε = 1/2polylog(n) for multiplication and Laplacian solving, and ε = 1/2 for sparsifi- cation. With these parameter settings, poly(d)n1+o(1) is almost-linear time, while 2O(d)n1+o(1) time is not. For results about algorithms with runtimes that are exponential in d, see Section 2.

1.1.1 Multiplication In high dimensions, we give a full classification of when the matrix-vector multiplication problems 2 R R are easy for kernels of the form K(u, v) = f( u v 2) for some function f : 0 0 that is k − k ≥ → ≥ analytic on an interval. We show that the problem can be efficiently solved only when K is very well-approximated by a simple polynomial kernel. That is, we let pf denote the minimum degree of a polynomial that ε-additively-approximates the function f.

Theorem 1.1 (Informal version of Theorem 5.14 and Corollary 5.13). For any function f : R+ polylog(n) → R+ which is analytic on an interval (0, δ) for any δ > 0, and any 0 <ε< 2− , consider the following problem: given as input x ,...,x Rd with d = Θ(log n) which define a K graph G via 1 n ∈ K(x ,x )= f( x x 2), and a vector y 0, 1 n, compute an ε-additive-approximation to L y. i j k i − jk2 ∈ { } G · • If f can be ε-additively-approximated by a polynomial of degree at most o(log n), then the problem can be solved in n1+o(1) time. 2 o(1) • Otherwise, assuming SETH, the problem requires time n − . The same holds for LG, the Laplacian matrix of G, replaced by AG, the adjacency matrix of G.

While Theorem 1.1 yields a parameter pf that characterizes hardness of multiplication in high dimensions, it is somewhat cumbersome to use, as it can be challenging to show that a function is

3 far from a polynomial. We also show Theorem 5.15, which shows that if f has a single point with 2 o(1) large Θ(log n)-th derivative, then the problem requires time n − assuming SETH. The Strong Exponential Time Hypothesis (SETH) is a common assumption in fine-grained complexity regarding the difficulty of solving the Boolean satisfiability problem; see section 3.6 for more details. Theo- rem 1.1 informally says that assuming SETH, the curse of dimensionality is inherent in performing adjacency matrix-vector multiplication. In particular, we directly apply this result to the n-body problem discussed at the beginning:

Corollary 1.2. Assuming SETH, in dimension d = Θ(log n) one step of the n-body problem requires 2 o(1) time n − . The fast multipole method of Greengard and Rokhlin [GR87, GR89] solves one step of this n-body problem in time (log(n/ε))O(d)n1+o(1). Our Corollary 1.2 shows that assuming SETH, such an exponential dependence on d is required and cannot be improved. To the best of our knowledge, this is the first time such a formal limitation on fast multipole methods has been proved. This hardness result also applies to fast multipole methods for other popular kernels, like the Gaussian kernel K(u, v) = exp( u v 2), as well. −k − k2 1.1.2 Sparsification We next show that sparsification can be performed in almost-linear time in high dimensions for kernels that are “multiplicatively Lipschitz” functions of the ℓ2-distance. We say f : R 0 R 0 is ≥ → ≥ (C,L)-multiplicatively Lipschitz for C > 1,L> 1 if for all x R 0 and all ρ (1/C,C), ∈ ≥ ∈ L L C− f(x) f(ρx) C f(x). ≤ ≤ Here are some popular functions that are helpful to think about in the context of our results: 1. f(z) = zL for any positive or negative constant L. This function is (C, L )-multiplicatively | | Lipschitz for any C > 1. z 2. f(z) = e− . This function is not (C,L)-multiplicatively Lipschitz for any L > 1 and C > 1. We call this the exponential function. 3. The piecewise function f(z) = e z for z L and f(z) = e L for z>L. This function is − ≤ − (C, O(L))-multiplicatively Lipschitz for any C > 1. We call this a piecewise exponential function. 4. The piecewise function f(z) = 1 for z k and f(z) = 0 for z > k, where k R 0. This ≤ ∈ ≥ function is not (C,L)-multiplicatively Lipschitz for any C > 1 or L> 1. This is a threshold function. We show that multiplicatively Lipschitz functions can be sparsified in n1+o(1)poly(d) time:

Theorem 1.3 (Informal version of Theorem 6.3). For any function f such that f is (2,L)-multiplicatively Lipschitz, building a (1 ε)-spectral sparsifier of the K-graph on n points in Rd where K(u, v) = ± f( u v 2), with O(n log n/ε2) edges, can be done in time k − k2 O(nd L log n)+ n log n 2O(√L log n) (log α)/ε2. · · This Theorem applies evenp when d = Ω(log n). When L is constant, the running time simplifies to O(nd√log n + n1+o(1) log α/ε2). This covers the case when f(x) is any rational function with L L non-negative coefficients, like f(z)= z or f(z)= z− . It may seem more natural to instead define L-multiplicatively Lipschitz functions, without the parameter C, as functions with ρ Lf(x) f(ρx) ρLf(x) for all ρ and x. Indeed, an L- − ≤ ≤ multiplicatively Lipschitz function is also (C,L)-multiplicative Lipschitz for any C > 1, so our results show that efficient sparsification is possible for such functions. However, the parameter C is

4 necessary to characterize when efficient sparsification is possible. Indeed, as in Theorem 1.3 above, it is sufficient for f to be (C,L)-multiplicative Lipschitz for a C that is bounded away from 1. To complement this result, we also show a lower bound for sparsification for any function f which is not (C,L)-multiplicatively Lipschitz for any L and sufficiently large C:

Theorem 1.4 (Informal version of Theorem 8.3). Consider an L > 1. There is some sufficiently large value CL > 1 depending on L such that for any decreasing function f : R 0 R 0 that is L.48 ≥ → ≥ not (CL,L)-multiplicatively Lipschitz, no O(n2 )-time algorithm for constructing an O(1)-spectral sparsifier of the K-graph of a set of n points in O(log n) dimensions exists assuming SETH, where K(x,y)= f( x y 2). k − k2 For example, when L = Θ(log2+δ n) for some constant δ > 0, Theorem 1.4 shows that there is a C for which, whenever f is not (C,L)-multiplicatively Lipschitz, the sparsification problem cannot be solved in time n1+o(1) assuming SETH. Bounding C in terms of L above is important. For example, if C is small enough that CL = 2, then f could be close to constant. Such K-graphs are easy to sparsify by uniformly sampling edges, so one cannot hope to show hardness for such functions. Theorem 1.4 shows that geometric graphs for threshold functions, the exponential function, and the Gaussian kernel do not have efficient sparsification algorithms. Furthermore, this hardness result essentially completes the story of which decreasing functions can be sparsified in high dimensions, modulo a gap of L.48 versus L in the exponent. The tractability landscape is likely much more complicated for non-decreasing functions. That said, many of the kernels used in practice, like the Gaussian kernel, are decreasing functions of distance, so our dichotomy applies to them. We also show that our techniques for sparsification extend beyond kernels that are functions of ℓ norms; specifically K(u, v)= u, v : 2 |h i| Lemma 1.5 (Informal version of Lemma 7.1). The K(u, v) = u, v -graph on n points in Rd can |h i| be ε-approximately sparsified in n1+o(1)poly(d)/ε2 time.

1.1.3 Laplacian solving Laplacian system solving has a similar tractability landscape to that of adjacency matrix multipli- cation. We prove the following algorithmic result for solving Laplacian systems:

Theorem 1.6 (Informal version of Corollary 5.12 and Proposition 3.9). There is an algorithm that takes n1+o(1)poly(d, log(nα/ε)) time to ε-approximately solve Laplacian systems on n-vertex K-graphs, where K(u, v)= f( u v 2) for some (nonnegative) polynomial f.1 k − k2 We show that this theorem is nearly tight via two hardness results. The first applies to mul- tiplicatively Lipschitz kernels, while the second applies to kernels that are not multiplicatively Lipschitz. The second hardness result only works for kernels that are decreasing functions of ℓ2 distance. We now state our first hardness result:

Corollary 1.7. Consider a function f that is (2, o(log n))-multiplicatively Lipschitz for which f can- poly(log n) not be (ε = 2− )-approximated by a polynomial of degree at most o(log n). Then, assuming SETH, there is no n1+o(1)poly(d, log(αn/ε))-time algorithm for ε-approximately solving Laplacian systems in the K-graph on n points, where K(u, v)= f( u v 2). k − k2 1f is a nonnegative function if f(x) ≥ 0 for all x ≥ 0.

5 In Section 4, we will see, using an iterative refinement approach, that if a K graph can be efficiently sparsified, then there is an efficient Laplacian multiplier for K graphs if and only if there is an efficient Laplacian system solver for K graphs. Corollary 1.7 then follows using this connection: it describes functions which we have shown have efficient sparsifiers but not efficient multipliers. Corollary 1.7, which is the first of our two hardness results in this setting, applies to slowly- growing functions that do not have low-degree polynomial approximations, like f(z) = 1/(1 + z). Next, we state our second hardness result:

Theorem 1.8 (Informal version of Theorem 8.7). Consider an L > 1. There is some sufficiently large value CL > 1 depending on L such that for any decreasing function f : R 0 R 0 that L.48 ≥ → ≥ is not (CL,L)-multiplicatively Lipschitz, no O(n2 log α)-time algorithm exists for solving Lapla- poly(log n) cian systems 2− approximately in the K-graph of a set of n points in O(log n) dimensions assuming SETH, where K(u, v)= f( u v 2). k − k2 This yields a quadratic time hardness result when L = Ω(log2 n). By comparison, the first hardness result, Corollary 1.7, only applied for L = o(log n). In particular, this shows that for non-Lipschitz functions like the Gaussian kernel, the problem of solving Laplacian systems and, in particular, doing semi-supervised learning, cannot be done in almost-linear time assuming SETH.

1.2 Our Techniques 1.2.1 Multiplication

d Our goal in matrix-vector multiplication is, given points P = x1,...,xn R and a vector n { } ⊂ y R , to compute a (1 ε)-approximation to the vector LG y where LG is the Laplacian matrix ∈ ± Ω(1) · of the K graph on P , for ε = n− (see Definition 4.1for the precise error guarantees on ε). We call this the K Laplacian Evaluation (KLapE) problem. A related problem, in which the Laplacian matrix LG is replaced by the adjacency matrix AG, is the K Adjacency Evaluation (KAdjE) problem. We begin by showing a simple, generic equivalence between KLapE and KAdjE for any K: an algorithm for either one can be used as a black box to design an algorithm for the other with only negligible blowups to the running time and error. It thus suffices to design algorithms and prove lower bounds for KAdjE.

Algorithmic Techniques We use two primary algorithmic tools: the Fast Multipole Method (FMM), and a ‘kernel method’ for approximating AG by a low-rank matrix. FMM is an algorithmic technique for computing aggregate interactions between n bodies which has applications in many different areas of science. Indeed, when the interactions between bodies is described by our function K, then the problem solved by FMM coincides with our KAdjE problem. Most past work on FMM either considers the low-dimensional case, in which d is a small constant, or else the low-error case, in which ε is a constant. Thus, much of the literature does not consider the simultaneous running time dependence of FMM on ε and d. In order to solve KAdjE, we need to consider the high-dimensional, high-error case. We thus give a clean mathematical overview and detailed analysis of the running time of FMM in Section 9, following the seminal work of Greengard and Strain [GS91], which may be of independent interest. As discussed in section 1.1 above, the running time of FMM depends exponentially on d, and so it is most useful in the low-dimensional setting. Our main algorithmic tool in high dimensions is a low-rank approximation technique: we show that when f(x) can be approximated by a sufficiently low-degree polynomial (e.g. any degree o(log n) suffices in dimension Θ(log n)), then we can quickly find a low-rank approximation of the adjacency matrix AG, and use this to efficiently multiply by

6 a vector. Although this seems fairly simple, in Theorem 1.1 we show it is optimal: when such a 2 o(1) low-rank approximation is not possible in high dimensions, then SETH implies that n − time is required for KAdjE. The simplest way to show that f(x) can be approximated by a low-degree polynomial is by truncating its Taylor series. In fact, the FMM also requires that a truncation of the Taylor series of f gives a good approximation to f. By comparison, the FMM puts more lenient restrictions on what degree the series must be truncated to in low dimensions, but in exchange adds other constraints on f, including that f must be monotone. See Section 9.3 and Corollary 5.13 for more details.

Lower Bound Techniques We now sketch the proof of Theorem 1.1, our lower bound for KAdjE for many functions K in high enough dimensions (typically d = Ω(log n)), assuming SETH. Al- though SETH is a hardness hypothesis about the Boolean satisfiability problem, a number of recent results [AW15, Rub18, Che18, SM19] have showed that it implies hardness for a variety of nearest neighbor search problems. Our lower bound approach is hence to show that KAdjE is useful for solving nearest neighbor search problems. The high-level idea is as follows. Suppose we are given as input points X = x ,...,x { 1 n} ⊂ 0, 1 d, and our goal is to find the closest pair of them. For each ℓ 1, 2,...,d , let c denote the { } ∈ { } ℓ number of pairs of distinct points x ,x X with distance x x 2 = ℓ. Using an algorithm for i j ∈ k i − jk2 KAdjE for our function K(x,y)= f( x y 2), we can estimate k − k2 d 1⊤A 1= K(x ,x )= c f(ℓ). G i j ℓ · i=j ℓ=1 X6 X Similarly, for any nonnegative reals a, b 0, we can take an appropriate affine transformation of X ≥ so that an algorithm for KAdjE can estimate

d c f(a ℓ + b). (1) ℓ · · Xℓ=1 Suppose we pick real values a , . . . , a , b , . . . , b 0 and define the d d matrix M by M[i, ℓ]= 1 d 1 d ≥ × f(a ℓ+b ). By estimating the sum (1) for each pair (a , b ), we get an estimate of the matrix-vector i · i i i product Mc, where c Rd is the vector of the c values. We show that if M has a large enough ∈ ℓ determinant relative to the magnitudes of its entries, then one can recover an estimate of c itself from this, and hence solve the nearest neighbor problem. The main tool we need for this approach is a way to pick a1, . . . , ad, b1, . . . , bd for a function f which cannot be approximated by a low degree polynomial so that M has large determinant. We do this by decomposing det(M) in terms of the derivatives of f using the Cauchy-Binet formula, and then noting that if f cannot be approximated by a polynomial, then many of the contributions in this sum must be large. The specifics of this construction are quite technical; see section 5.4 for the details.

Comparison with Previous Lower Bound Techniques Prior work (e.g. [CS17], [BIS17], [BCIS18]) has shown SETH-based fine-grained complexity results for matrix-related computations. For instance, [BIS17] showed hardness results for exact algorithms for many machine-learning related tasks, like kernel PCA and gradient computation in training neural networks, while [CS17] and [BCIS18] showed hardness results for kernel density estimation. In all of this work, the authors are only able to show hardness for a limited set of kernels. For example, [BIS17] shows hardness for

7 kernel PCA only for Gaussian kernels. These limitations arise from the technique used. To show hardness, [BIS17] exploits the fact that the Gaussian kernel decays rapidly to obtain a gap between the completeness and soundness cases in approximate nearest neighbors, just as we do for functions f like f(x) = (1/x)Ω(log n). The hardness results of [CS17] and [BCIS18] employ a similar idea. As discussed in Lower Bound Techniques, we circumvent these limitations by showing that applying the multiplication algorithm for one kernel a small number of times and linearly combining the results is enough to solve Hamming closest pair. This idea is enough to give a nearly tight characterization of the analytic kernels for which subquadratic-time multiplication is possible in dimension d = Θ(log n). As a result, by combining with reductions similar to those from past work, our lower bound also applies to a variety of similar problems, including kernel PCA, for a much broader set of kernels than previously known; see Section 5.7 below for the details. Our lower bound is also interesting when compared with the Online Matrix-Vector Multiplication (OMV) Conjecture of Henzinger et al. [HKNS15]. In the OMV problem, one is given an n n matrix × M to preprocess, then afterwards one is given a stream v1, . . . , vn of length-n vectors, and for each vi, one must output M vi before being given vi+1. The OMV Conjecture posits that one cannot × 3 Ω(1) solve this problem in total time n − for a general matrix M. At first glance, our lower bound may seem to have implications for the OMV Conjecture: For some kernels K, our lower bound shows that for an input set of points P and corresponding adjacency matrix AG, and input vector 2 Ω(1) vi, there is no algorithm running in time n − for multiplying AG vi, so perhaps multiplying by 3 Ω(1) × n vectors cannot be done in time n − . However, this is not necessarily the case, since the OMV 2.99 problem allows O(n ) time for preprocessing AG, which our lower bound does not incorporate. More broadly, the matrices AG which we study, which have very concise descriptions compared to general matrices, are likely not the best candidates for proving the OMV Conjecture. That said, perhaps our results can lead to a form of the OMV Conjecture for geometric graphs with concise descriptions.

1.2.2 Sparsification Algorithmic techniques Our algorithm for constructing high-dimensional sparsifiers for K(x,y)= f( x y 2), when f is a (2,L) multiplicatively Lipschitz function, involves using three classic ideas: k − k2 the Johnson Lindenstrauss lemma of random projection [JL84, IM98], the notion of well-separated pair decomposition from Callahan and Kosaraju [CK93, CK95], and spectral sparsification via over- sampling [SS11, KMP10]. Combining these techniques carefully gives us the bounds in Theorem 1.3. To overcome the ‘curse of dimensionality’, we use the Lindenstrauss lemma to project onto √L log n dimensions. This preserves all pairs distance, with a distortion of at most 2O(√log n/L). Then, using a 1/2-well-separated pair decomposition partitions the set of projected distances into bicliques, such that each biclique has edges that are no more than 2O(√log n/L) larger than the smallest edge in the biclique. This ratio will upper bound the maximum leverage score of an edge in this biclique in the original K-graph. Each biclique in the set of projected distances has a one-to-one correspondence to a biclique in the original K-graph. Thus to sparsify our K-graph, we sparsify each biclique in the K-graph by uniform sampling, and take the union of the resulting sparsified bicliques. Due to the (2,L)-Lipschitz nature of our function, it is guaranteed that the longest edge in any biclique (measured using K(x,y)) is at most 2O(√L log n). This upper bounds the maximum leverage score of an edge in this biclique with respect to the K-graph, which then can be used to upper bound the number of edges we need to sample from each biclique via uniform sampling. We take the union of these sampled edges over all bicliques, which gives our results for high-dimensional sparsification summarized in Theorem 1.3. Details can be found in the proof of Theorem 6.3 in Section 6. When L is constant, we get almost linear time sparsification algorithms.

8 For low dimensional sparsification, we skip the Johnson Lindenstrauss step, and use a (1+1/L)- well separated pair decomposition. This gives us a nearly linear time algorithm for sparsifying (C,L) multiplicative Lipschitz functions, when (2L)O(d) is small, which covers the case when d is constant and L = no(1). See Theorem 6.9 for details. For K(u, v) = u, v , our sparsification algorithm is quite different from the multiplicative |h i| Lipschitz setting. In particular, the fact that K(u, v) = 0 on a large family of pairs u, v Rd ∈ presents challenges. Luckily, though, this kernel does have some nice structure. For simplicity, just consider defining the K-graph on a set of unit vectors. The weight of any edge in this graph is at most 1 by Cauchy-Schwarz. The key structural property of this graph is that for every set S with S > d + 1, there is a pair u, v S for which the u-v edge has weight at least Ω(1/d). In | | ∈ other words, the unweighted graph consisting of edges with weight between Ω(1/d) and 1 does not have independent sets with size greater than d + 1. It turns out that all such graphs are dense (see Proposition 7.6)and that all dense graphs have an expander subgraph consisting of a large fraction of the vertices (see Proposition 7.7). Thus, if this expander could be found in O(n) time, we could partition the graph into expander clusters, sparsify the expanders via uniform sampling, and sparsify the edges between expanders via uniform sampling. It is unclear to us how to identify this expander efficiently, so we instead identify clusters with effective resistance diameter O(poly(d log n)/n). This can be done via uniform sampling and Johnson-Lindenstrauss [SS11]. As part of the proof, we prove a novel Markov-style lower bound on the probability that effective resistances deviate too low in a randomly sampled graph, which may be of independent interest (see Lemma 7.10).

Lower bound techniques To prove lower bounds on sparsification for decreasing functions that are not (CL,L)-multiplicatively Lipschitz, we reduce from exact bichromatic nearest neighbors on two sets of points A and B. In high dimensions, nearest neighbors is hard even for Hamming distance [Rub18], so we may assume that A, B 0, 1 d. In low dimensions, we may assume that ⊆ { } the coordinates of points in A and B consist of integers on at most O(log n) bits. In both cases, the set of possible distances between points in A and B is discrete. We take advantage of the discrete nature of these distance sets to prove a lower bound. In particular, CL is set so that CL is the smallest ratio between any two possible distances between points in A and B. To see this in more detail, see Lemma 8.4. Let xf be a point at which the function f is not (CL,L)-multiplicatively Lipschitz and suppose that we want to solve the decision problem of determining whether or not mina A,b B a b 2 k. ∈ ∈ k − k ≤ We can do this using sparsification by scaling the points in A and B by a factor of k/xf , sparsifying the K-graph on the resulting points, and thresholding based on the total weight of the resulting A-B cut. If there is a pair with distance at most k, there is an edge crossing the cut with weight at least f(xf ) because f is a decreasing function. Therefore, the sparsifier has total weight at least f(xf )/(1 + ε) = f(xf )/2 crossing the A-B cut by the cut sparsification approximation guarantee. If there is not a pair with distance at most k, no edges crossing the cut with weight larger than L 10 f(C x ) C− f(x ) (1/n ) f(x ) by choice of C . Therefore, the total weight of the A-B L f ≤ L f ≤ · f L cut is at most (1/n8) f(x ), which means that it is at most ((1 + ε)/n8) f(x )

1.3 Brief summary of our results in terms of pf Before proceeding to the body of the paper, we summarize our results. Recall that we consider three linear-algebraic problems in this paper, along with two different dimension settings (low and high). This gives six different settings to consider. We now define pf in each of these settings. In all

9 high-dimensional settings, we have found a definition of pf that characterizes the complexity of the problem. In some low-dimensional settings, we do not know of a suitable definition for pf and leave this as an open problem. For simplicity, we focus here only on decreasing functions f, although all of our algorithms, and most of our hardness results, hold for more general functions as well.

Dimension Multiplication Sparsification Solving d = poly(log n) f1 f1,f2 f1 log∗ n 1 δ c 0 f1,f2,f3 f1,f2,f3 f1,f2,f3

Table 1: Functions among f1,f2,f3,f4 that have almost-linear time algorithms

1. Adjacency matrix-vector multiplication poly(log n) (a) High dimensions: pf is the minimum degree of any polynomial that 1/2 -additively approximates f. pf > Ω(log n) implies subquadratic-time hardness (Theorem 1.1 part 2), while pf < o(log n) implies an almost-linear time algorithm (Theorem 1.1, part 1). (b) Low dimensions: Not completely understood. The fast multipole method yields an almost-linear time algorithm for some functions, like the Gaussian kernel (Theorem 2.1), but functions exist that are hard in low dimensions (Proposition 5.31).

2. Sparsification. In both settings, pf is the minimum value for which f is (C,pf )-multiplicatively c Lipschitz, where C =1+1/pf for some constant c> 0 independent of f. 2 (a) High dimensions: If pf > Ω(log n) and f is nonincreasing, then no subquadratic time algorithm exists (Theorem 1.4). If pf < o(log n), then an almost-linear time algorithm for sparsification exists (Theorem 1.3). t (b) Low dimensions: There is some constant t > 1 such that if pf > Ω(n ) and f is nonin- o(1/d) creasing, then no subquadratic time algorithm exists (Theorem 2.5).If pf

(a) High dimensions: pf is the maximum of the pf values in the Adjacency matrix-vector multiplication and Sparsification settings, with hardness occurring for decreasing func- 2 tions f if pf > Ω(log n) (Corollary 1.7 combined with Theorem 1.8) and an algorithm existing when pf < o(log n) (Theorem 1.6). (b) Low dimensions: Not completely understood, as in the low-dimensional multiplication setting. As in the sparsification setting, we are able to show that there is a constant t t such that if f is nonincreasing and pf > Ω(n ) where pf is defined as in the Sparsifiction setting, then no subquadratic time algorithm exists (Theorem 8.6). Many of our results are not tight for two reasons: (a) some of the hardness results only apply to decreasing functions, and (b) there are gaps in pf values between the upper and lower bounds. However, neither of these concerns are important in most applications, as (a) weight often decreases as a function of distance and (b) pf values for natural functions are often either very low or very x high. For example, pf > Ω(polylog(n)) for all problems for the Gaussian kernel (f(x) = e− ), while pf = O(1) for sparsification and pf > Ω(polylog(n)) for multiplication for the gravitational potential (f(x) = 1/x). Resolving the gap may also be difficult, as for intermediate values of pf , the true best running time is likely an intermediate running time of n1+c+o(1) for some constant 0

10 1.4 Summary of our Results on Examples To understand our results better, we illustrate how they apply to some examples. For each of the 2 functions fi given below, make the K-graph, where Ki(u, v)= fi( u v 2): k k − k 1. f1(z)= z for a positive integer constant k. c 2. f2(z)= z for a negative constant or a positive non-integer constant c. z 3. f3(z)= e− (the Gaussian kernel). 4. f (z) = 1 if z θ and f (z) = 0 if z>θ for some parameter θ > 0 (the threshold kernel). 4 ≤ 4 In Table 1, we summarize for which of the above functions there are efficient algorithms and for which we have hardness results. There are six regimes, corresponding to three problems (multi- ∗ plication, sparsification, and solving) and two dimension regimes (d = poly(log n) and d = clog n). A function is placed in a table cell if an almost-linear time algorithm exists, where runtimes are n1+o(1)(log(αn/ε))t in the case of multiplication and system solving and n1+o(1)(log(αn))t/ε2 in 1 δ the case of sparsification for some t O(log − n) for some δ > 0. Moreover, for each of these ≤ functions f1,f2,f3, and f4, if it does not appear in a table cell, then we show a lower bound that no subquadratic time algorithm exists in that regime assuming SETH.

1.5 Other Related Work Linear Program Solvers Linear Program is a fundamental problem in convex optimization. There is a long list of work focused on designing fast algorithms for linear program [Dan47, Kha80, Kar84, Vai87, Vai89, LS14, LS15, CLS19, LSZ19a, Son19, Bra20, BLSS20, SY20, JSWZ20]. For max ω,2+1/18 the dense input matrix, the state-of-the-art algorithm [JSWZ20] takes n { } log(1/ε) time, ω is the exponent of matrix multiplication [AW21]. The solver can run faster when matrix A has some structures, e.g. Laplacian matrix.

Laplacian System Solvers It is well understood that a Laplacian linear system can be solved in time O(m log(1/ε)), where m is the number of edges in the graph generating the Laplacian [ST04, KMP10, KMP11, KOSZ13, LS13, CKM+14, KLP+16, KS16]. This algorithm is very efficient when the graphe is sparse. However, in our setting where the K graph is dense but succinctly describable by only n points in Rd, we aim for much faster algorithms.

Algorithms for Kernel Density Function Approximation A recent line of work by Charikar et al. [CS17, BCIS18] also studies the algorithmic KDE problem. They show, among other things, that kernel density functions for “smooth” kernels K can be estimated in time which depends only polynomially on the dimension d, but which depends polynomially on the error ε. We are unfortu- Ω(1) nately unable to use their algorithms in our setting, where we need to solve KAdjE with ε = n− , and the algorithms of Charikar et al. do not run in subquadratic time. We instead design and make use of algorithms whose running times have only polylogarithmic dependences on ε, but often have exponential dependences on d.

Kernel Functions Kernel functions are useful functions in data analysis, with applications in physics, machine learning, and computational biology [Sou10]. There are many kernels studied and applied in the literature; we list here most of the popular examples. The following kernels are of the form K(x,y)= f( x y 2), which we study in this paper: the k − k2 Gaussian kernel [NJW02, RR08], exponential kernel, Laplace kernel [RR08], rational quadratic kernel, multiquadric kernel [BG97], inverse multiquadric kernel [Mic84, Mar12], circular kernel

11 [BTFB05], spherical kernel, power kernel [FS03], log kernel [BG97, Mar12], Cauchy kernel [RR08], and generalized T-Student kernel [BTF04]. For these next kernels, it is straightforward that their corresponding graphs have low-rank adjacency matrices, and so efficient linear algebra is possible using the Woodbury Identity (see Section 3.4 below): the linear kernel [SSM98, MSS+99, Hof07, Shl14] and the polynomial kernel [CV95, GE08, BHOS+08, CHC+10]. Finally, the following relatively popular kernels are not of the form we directly study in this paper, and we leave extending our results to them as an important open problem: the Hyper- bolic tangent (Sigmoid) kernel [HS97, BTB05a, JKL09, KSH12, ZSJ+17, ZSD17, SSB+17, ZSJD19], spline kernel [Gun98, Uns99], B-spline kernel [Ham04, Mas10], Chi-Square kernel [VZ12], and the histogram intersection kernel and generalized histogram intersection [BTB05b]. More interestingly, our result also can be applied to Neural Tangent Kernel [JGH18], which plays a crucial role in the recent work about convergence of neural network training [LL18, DZPS19, AZLS19b, AZLS19a, SY19, BPSW21, LSS+20, JMSZ20]. For more details, we refer the readers to Section 10.

Acknowledgements The authors would like to thank Lijie Chen for helpful suggestions in the hardness section and explanation of his papers. The authors would like to thank Sanjeev Arora, Simon Du, and Jason Lee for useful discussions about the neural tangent kernel.

2 Summary of Low Dimensional Results

In the results we’ve discussed so far, we show that in high-dimensional settings, the curse of di- mensionality applies to a wide variety of functions that are relevant in applications, including the Gaussian kernel and inverse polynomial kernels. Luckily, in many settings, the points supplied as input are very low-dimensional. In the classic n-body problem, for example, the input points are 3-dimensional. In this subsection, we discuss our results pertaining to whether algorithms with runtimes exponential in d exist; such algorithms can still be efficient in low dimensions d = o(log n).

2.1 Multiplication The prior work on the fast multipole method [GR87, GR88, GR89] yields algorithms with runtime (log(n/ε))O(d)n1+o(1) for ε-approximate adjacency matrix-vector multiplication for a number of 1 u v 2 functions K, including when K(u, v)= c for a constant c and when K(u, v)= e−k − k2 . In order u v 2 to explain what functions K the fast multipolek − k methods work well for, and to clarify dependencies on d in the literature, we give a complete exposition of how the fast multipole method of [GS91] works on the Gaussian kernel:

Theorem 2.1 (fast Gaussian transform [GS91], exposition in Section 9). Let K(x,y) = exp( x −k − y 2). Given a set of points P Rd with P = n. Let G denote the K-graph. For any vector u Rd, k2 ⊂ | | ∈ for accuracy parameter ε, there is an algorithm that runs in n logO(d)( u /ε) time to approximate k k1 A u within ε additive error. G · The fast multipole method is fairly general, and so similar algorithms also exist for a number of other functions K; see Section 9.3 for further discussion. Unlike in the high-dimensional case, we do not have a characterization of the functions for which almost-linear time algorithms exist in near-constant dimension. We leave this as an open problem. Nonetheless, we are able to show lower

12 2 bounds, even in barely super-constant dimension d = exp(log∗(n)) , on adjacency matrix-vector multiplication for kernels that are not multiplicatively Lipschitz:

Theorem 2.2 (Informal version of Proposition 5.31). For some constant c > 1, any function f that is not (C,L)-multiplicatively Lipschitz for any constants C > 1,L> 1 does not have an n1+o(1) poly(log n) log∗ n time adjacency matrix-vector multiplication (up to 2− additive error) algorithm in c dimensions assuming SETH when K(u, v)= f( u v 2). k − k2 This includes threshold functions, but does not include piecewise exponential functions. Piece- wise exponential functions do have efficient adjacency multiplication algorithms by Theorem 9.13. To illustrate the complexity of the adjacency matrix-vector multiplication problem in low di- mensions, we are also able to show hardness for the function K(u, v) = u, v in nearly constant |h i| dimensions. By comparison, we are able to sparsify for this function K, even in very high d = no(1) dimensions (in Theorem 1.5 above).

Theorem 2.3 (Informal version of Corollary 5.28). For some constant c > 1, assuming SETH, poly(log n) log∗ n adjacency matrix-vector multiplication (up to 2− additive error) in c dimensions cannot be done in subquadratic time in dimension d = exp(log∗(n)) when K(u, v)= u, v . |h i| 2.2 Sparsification We are able to give a characterization of the decreasing functions for which sparsification is possible in near-constant dimension. We show that a polynomial dependence on the multiplicative Lipschitz constant is allowed, unlike in the high-dimensional setting:

Theorem 2.4 (Informal version of Theorem 6.9). Let f be a (1 + 1/L,L)-multiplicatively Lipschitz function and let K(u, v) = f( u v 2). Then an (1 ε)-spectral sparsifier for the K-graph on n k − k2 ± points can be found in n1+o(1)LO(d)(log α)/ε2 time.

Thus, geometric graphs for piecewise exponential functions with L = no(1) can be sparsified in almost-linear time when d is constant, unlike in the case when d = Ω(log n). In particular, spectral clustering can be done in O(kn1+o(1)) time for k clusters in low dimensions. Unfortunately, not all geometric graphs can be sparsified, even in nearly constant dimensions:

Theorem 2.5 (Informal version of Theorem 8.2). There are constants c (0, 1),c> 1 and a value ′ ∈ CL given L > 1 for which any decreasing function f that is not (CL,L)-multiplicatively Lipschitz ′ ∗ does not have an O(nLc ) time sparsification algorithm for K-graphs on clog n dimensional points, where K(u, v)= f( u v 2). k − k2 This theorem shows, in particular, that geometric graphs of threshold functions are not sparsifi- able in subquadratic time even for low-dimensional pointsets. These two theorems together nearly classify the decreasing functions for which efficient sparsification is possible, up to the exponent on L.

2.3 Laplacian solving As in the case of multiplication, we are unable to characterize the functions for which solving Laplacian systems can be done in almost-linear time in low dimensions. That said, we still have results for many functions K, including most kernel functions of interest in applications. We prove

2Here, log∗(n) denotes the very slowly growing iterated logarithm of n.

13 most of these using the aforementioned connection from Section 4: if a K graph can be efficiently sparsified, then there is an efficient Laplacian multiplier for K graphs if and only if there is an efficient Laplacian system solver for K graphs. For the kernels K(u, v) = 1/ u v c for constants c and the piecewise exponential kernel, we have k − k2 almost-linear time algorithms in low dimensions by Theorems 6.3 and 6.9 respectively. Furthermore, the fast multipole method yields almost-linear time algorithms for multiplication. Therefore, there are almost-linear time algorithm for solving Laplacian systems in geometric graphs for these kernels. A similar approach also yields hardness results. Theorem 1.5 above implies that an almost-linear time algorithm for solving Laplacian systems on K-graphs for K(u, v) = u, v yields an almost- |h i| linear time algorithm for K-adjacency multiplication. However, no such algorithm exists assuming SETH by Theorem 2.3 above. Therefore, SETH implies that no almost-linear time algorithm for solving Laplacian systems in this kernel can exist. We directly (i.e. without using a sparsifier algorithm) show an additional hardness result for solving Laplacian systems for kernels that are not multiplicatively Lipschitz, like threshold functions of ℓ2-distance: Theorem 2.6 (Informal version of Theorem 8.6). Consider an L > 1. There is some sufficiently large value CL > 1 depending on L such that for any decreasing function f : R 0 R 0 that is c′ ≥ → ≥ not (CL,L)-multiplicatively Lipschitz, no O(nL log α)-time algorithm exists for solving Laplacian poly(log n) log∗ n systems 2− approximately in the K-graph of a set of n points in c dimensions for some constants c> 1, c (0, 1) assuming SETH, where K(u, v)= f( u v 2). ′ ∈ k − k2 3 Preliminaries

Our results build off of algorithms and hardness results from many different areas of theoretical computer science. We begin by defining the relevant notation and describing the important past work.

3.1 Notation For an n N , let [n] denote the set 1, 2, ,n . ∈ + { · · · } For any function f, we write O(f) to denote f logO(1)(f). In addition to O( ) notation, for · · two functions f, g, we use the shorthand f . g (resp. &) to indicate that f Cg (resp. ) for an ≤ ≥ absolute constant C. We use f h ge to mean cf g Cf for constants c, C. ≤ ≤ For a matrix A, we use A 2 to denote the spectral norm of A. Let A⊤ denote the transpose of k k 1 A. Let A† denote the Moore-Penrose pseudoinverse of A. Let A− denote the inverse of a full rank square matrix. We say matrix A is positive semi-definite (PSD) if A = A and x Ax 0 for all x Rn. We ⊤ ⊤ ≥ ∈ use , to denote the semidefinite ordering, e.g. A 0 denotes that A is PSD, and A B means     A B 0. We say matrix A is positive definite (PD) if A = A and x Ax > 0 for all x Rn 0 . −  ⊤ ⊤ ∈ −{ } A B means A B is PD. ≻ − For a vector v, we denote v as the standard ℓ norm. For a vector v and PSD matrix A, we k kp p let v = (v Av)1/2. k kA ⊤ The iterated logarithm log∗ : R Z is given by → 0, if n 1; log∗(n)= ≤ (1 + log∗(log n), otherwise.

We use Ggrav to denote the Gravitational constant.

14 We define α slightly differently in different sections. Note that both are less than the value of α used in Theorem 1.3: Table 2

Notation Meaning Location max f( x x 2) i,j k i− j k2 α min f( x x 2) Section 4 i,j k i− j k2 max x x 2 α i,j k i− j k Section 6 min x x 2 i,j k i− j k

3.2 Graph and Laplacian Notation Let G = (V, E, w) be a connected weighted undirected graph with n vertices and m edges and edge weights we > 0. We say re = 1/we is the resistance of edge e. If we give a direction to the edges of G arbitrarily, we can write its Laplacian as L = B WB, where W Rm m is the diagonal G ⊤ ∈ × matrix W (e, e)= w and B Rm n is the signed edge-vertex incidence matrix and can be defined e ∈ × in the following way 1, if v is e’s head; B(e, v)= 1, if v is e’s tail; (2) − 0, otherwise.

A useful notion related to Laplacian matrices is the effective resistance of a pair of nodes: Definition 3.1 (Effective resistance). The effective resistance of a pair of vertices u, v V is ∈ G defined as Reff G(u, v)= bu,v⊤ L†bu,v where b R VG is an all zero vector except for entries of 1 at u and 1 at v. u,v ∈ | | − Using effective resistance, we can define leverage score Definition 3.2 (Leverage score). The leverage score of an edge e = (u, v) E is defined as ∈ G l = w Reff (u, v). e e · G We define a useful notation called electrical flow Definition 3.3 (Electrical flow). Let B Rm n be defined as Eq. (2), for a given demand vector ∈ × d Rn, we define electrical flow f Rm as follows: ∈ ∈ 2 f =arg min fe /we. f:B⊤f=d e E X∈ G We let d(i) denote the degree of vertex i. For any set S V , we define volume of S: µ(S) = ⊆ d(i). It is obvious that µ(V ) = 2 E . For any two sets S, T V , let E(S, T ) be the set of i S | | ⊆ edges∈ connecting a vertex in S with a vertex in T . We call Φ(S) to be the conductance of a set of P vertices S, and can be formally defined as E(S, V S) Φ(S)= | \ | . min(µ(S),µ(V S)) \ We define the notation conductance, which is standard in the literature of graph partitioning and graph clustering [ST04, KVV04, ACL06, AP09, LRTV11, ZLM13, OZ14].

15 Definition 3.4 (Conductance). The conductance of a graph G is defined as follows:

ΦG = min Φ(S). S V ⊂ Lemma 3.5 ([ST04, AALG18]). A graph G with minimum conductance ΦG has the property that for every pair of vertices u, v,

Reff 1 1 1 G(u, v) O + 2 ≤ cu cv · ΦG    where cu is the sum of the weights of edges incident with u. Furthermore, for every pair of vertices u, v, Reff (u, v) max(1/c , 1/c ) G ≥ u v For a function K : Rd Rd R, the K-graph on a set of points X Rd is the graph with vertex × → ⊆ set X and edge weights K(u, v) for u, v X. For a function fR 0 R 0, the f-graph on a set of ∈ ≥ → ≥ points X is defined to be the K graph on X for K(u, v)= f( u v ). k − k2 3.3 Spectral Sparsification via Random Sampling Here, we state some well known results on spectral sparsification via random sampling, from previous works. The theorems below are essential for our results on sparsifying geometric graphs quickly.

Theorem 3.6 (Oversampling [KMP11]). Consider a graph G = (V, E) with edge weights we > 0 and probabilities p (0, 1] assigned to each edge and parameters δ (0, 1), ε (0, 1). Generate a e ∈ ∈ ∈ reweighted subgraph H of G with q edges, with each edge e sampled with probability pe/t and added to H with weight wet/(peq), where t = e E pe. If ∈ 1. q C ε 2 t log t log(1/δ), where C > 1 is a sufficiently large constant ≥ · − · · P 2. p w Reff (u, v) for all edges e = u, v in G e ≥ e · G { } then (1 ε)L L (1 + ε)L with probability at least 1 δ. − G  H  G − Algorithm 1 1: procedure Oversampling(G, w, p, ε, δ) ⊲ Theorem 3.6 2: t p ← e E e 3: q C ∈ε 2 t log t log(1/δ) ← P· − · · 4: Initialize H to be an empty graph 5: for i = 1 q do → 6: Sample one e E with probability p /t ∈ e 7: Add that edge with weight wet/(peq) to graph H 8: end for 9: return H 10: end procedure

Theorem 3.7 ([SS11] effective resistance data structure). There is a O(m(log α)/ε2) time algorithm which on input ε> 0 and G = (V, E, w) with α = w /w computes a (24 log n/ε2) n matrix max min × Z such that with probability at least 1 1/n, e − e (1 ε)Reff (u, v) Zb 2 (1 + ε)Reff (u, v) − G ≤ k uvk2 ≤ G for every pair of vertices u, v V . ∈ e 16 The following is an immediate corollary of Theorems 3.6 and 3.7: Corollary 3.8 ([SS11]). There is a O(m(log α)/ε2) time algorithm which on input ε > 0 and G = (V, E, w) with α = w /w , produces an (1 ε)-approximate sparsifier for G. max min ± e 3.4 Woodbury Identity Proposition 3.9 ([Woo49, Woo50]). The Woodbury matrix identity is 1 1 1 1 1 1 1 (A + UCV )− = A− A− U(C− + V A− U)− V A− − where A, U, C and V all denote matrices of the correct (conformable) sizes: For integers n and k, A is n n, U is n k, C is k k and V is k n. × × × × The Woodbury identity is useful for solving linear systems in a matrix M which can be written as the sum of a diagonal matrix A and a low-rank matrix UV for k n (setting C = I). ≪ 3.5 Tail Bounds We will use several well-known tail bounds from probability theory. n Theorem 3.10 (Chernoff Bounds [Che52]). Let X = i=1 Xi, where Xi = 1 with probability pi and X = 0 with probability 1 p , and all X are independent. Let µ = E[X]= n p . Then i − i i P i=1 i 1. Pr[X (1 + δ)µ] exp( δ2µ/3), δ > 0 ; ≥ ≤ − ∀ P 2. Pr[X (1 δ)µ] exp( δ2µ/2), 0 <δ< 1. ≤ − ≤ − ∀ Theorem 3.11 (Hoeffding bound [Hoe63]). Let X1, , Xn denote n independent bounded variables n · · · in [ai, bi]. Let X = i=1 Xi, then we have 2t2 P Pr[ X E[X] t] 2exp | − |≥ ≤ − n (b a )2  i=1 i − i  3.6 Fine-Grained Hypotheses P Strong Exponential Time Hypothesis Impagliazzo and Paturi [IP01] introduced the Strong Exponential Time Hypothesis (SETH) to address the complexity of CNF-SAT. Although it was originally stated only for deterministic algorithms, it is now common to extend SETH to randomized algorithms as well. Hypothesis 3.12 (Strong Exponential Time Hypothesis (SETH)). For every ε > 0 there exists an integer k 3 such that CNF-SAT on formulas with clause size at most k (the so called k-SAT ≥ (1 ε)n problem) and n variables cannot be solved in O(2 − ) time even by a randomized algorithm.

Orthogonal Vectors Conjecture The Orthogonal Vectors (OV) problem asks: given n vectors x , ,x 0, 1 d, are there i, j such that v , v = 0 (where the inner product is taken over Z)? 1 · · · n ∈ { } h i ji It is easy to see that O(n2d) time suffices for solving OV, and slightly subquadratic-time algorithms are known in the case of small d [AWY15, CW16]. It is conjectured that there is no OV algorithm running in n1.99 time when d = ω(log n). Conjecture 3.13 (Orthogonal Vectors Conjecture (OVC) [Wil05, AWW14]). For every ε > 0, there is a c 1 such that OV cannot be solved in n2 ε time on instances with d = c log n. ≥ − In particular, it is known that SETH implies OVC [Wil05]. SETH and OVC are the most common hardness assumption in fine-grained complexity theory, and they are known to imply tight lower bounds for a number of algorithmic problems throughout computer science. See, for instance, the survey [Wil18] for more background.

17 3.7 Dimensionality Reduction We make use of the following binary version of the Johnson-Lindenstrauss lemma due to Achlioptas [Ach03]:

Theorem 3.14 ([JL84, Ach03]). Given fixed vectors v , . . . , v Rd and ε > 0, let Q Rk d be 1 n ∈ ∈ × a random 1/√k matrix (i.e. independent Bernoulli entries) with k 24(log n)/ε2. Then with ± ≥ probability at least 1 1/n, − (1 ε) v v 2 Qv Qv 2 (1 + ε) v v 2 − k i − jk ≤ k i − jk ≤ k i − jk for all pairs i, j [n]. ∈ We will also use the following variant of Johnson Lindenstrauss for Euclidean space, for random projections onto o(log n) dimensions:

Lemma 3.15 (Ultralow Dimensional Projection [JL84, DG03], see Theorem 8.2 in [She17] for example). For k = o(log n), with high probability the maximum distortion in pairwise distance obtained from projecting n points into k dimensions (with appropriate scaling) is at most nO(1/k).

3.8 Nearest Neighbor Search Our results will make use of a number of prior results, both algorithms and lower bounds, for nearest neighbor search problems.

Nearest Neighbor Search Data Structures

Problem 3.16 ([And09, Raz17] data-structure ANN). Given an n-point dataset P in Rd with d = no(1), the goal is to preprocess it to answer the following queries. Given a query point q Rd ∈ such that there exists a data point within ℓp distance r from q, return a data point within ℓp distance cr from q.

Theorem 3.17 ([AI06]). There exists a data structure that returns a 2c-approximation to the nearest 2 d 1+1/c +oc(1) neighbor distance in ℓ2 with preprocessing time and space Oc(n + nd) and query time 2 1/c +oc(1) Oc(dn )

Hardness for Approximate Hamming Nearest Neighbor Search We provide the definition of the Approximate Nearest Neighbor search problem

Problem 3.18 (monochromatic ANN). The monochromatic Approximate Nearest Neighbor (ANN) problem is defined as : given a set of n points x , ,x Rd with no(1), the goal is to compute 1 · · · n ∈ α-approximation of mini=j dist(xi,xj). 6

Theorem 3.19 ([SM19]). Let dist(x,y) be ℓp distance. Assuming SETH, for every δ > 0, there is a ε> 0 such that the monochromatic (1 + ε)-ANN problem for dimension d = Ω(log n) requires time 1.5 δ n − . Problem 3.20 (bichromatic ANN). Let dist( , ) denote the some distance function. Let α > 1 · · denote some approximation factor. The bichromatic Approximate Nearest Neighbor (ANN) prob- lem is defined as: given two sets A, B of vectors Rd, the goal is to compute α-approximation of mina A,b B dist(a, b). ∈ ∈

18 Theorem 3.21 ([Rub18]). Let dist(x,y) be any of Euclidean, Manhattan, Hamming( x y ), k − k0 and edit distance. Assuming SETH, for every δ > 0, there is a ε > 0 such that the bichromatic 2 δ (1 + ε)-ANN problem for dimension d = Ω(log n) requires time n − . By comparison, the best known algorithm for d = Ω(log n) for each of these distance measures 2 Ω(ε1/3/ log(1/ε)) other than edit distance runs in time about dn + n − [ACW16].

Hardness for Z-Max-IP

Problem 3.22. For n, d N, the Z-MaxIP problem for dimension d asks: given two sets A, B of ∈ vectors from Zd, compute

max a, b . a A,b Bh i ∈ ∈ Z-MaxIP is known to be hard even when the dimension d is barely superconstant:

Theorem 3.23 (Theorem 1.14 in [Che18]). Assuming SETH (or OVC), there is a constant c such log∗ n 2 o(1) that any exact algorithm for Z-MaxIP for d = c dimensions requires n − time, with vectors of O(log n)-bit entries.

It is believed that Z-MaxIP cannot be solved in truly subquadratic time even in constant di- mension [Che18]. Even for d = 3, the best known algorithm runs in O(n4/3) time and has not been improved for decades:

Theorem 3.24 ([Mat92, AESW91, Yao82]). Z-MaxIP for d = 3 can be solved in time O(n4/3). For 2 Θ(1/d) general d, it can be solved in n − .

The closely related problem of ℓ2-nearest neighbor search is also hard in barely superconstant dimension:

Theorem 3.25 (Theorem 1.16 in [Che18]). Assuming SETH (or OVC), there is a constant c such log∗ n 2 o(1) that any exact algorithm for bichromatic ℓ2-closest pair for d = c dimensions requires n − time, with vectors of c0 log n-bit entries for some constants c> 1 and c0 > 1.

3.9 Geometric Laplacian System Building off of a long line of work on Laplacian system solving [ST04, KMP10, KMP11, KOSZ13, CKM+14], we study the problem of solving geometric Laplacian systems:

Problem 3.26. Let K : Rd Rd R. Given a set of points x , ,x Rd, a vector b Rn and × → 1 · · · n ∈ ∈ accuracy parameter ε. Let graph G denote the graph that has n vertices and each edge(i, j)’s weight is K(x ,x ). Let L denote the Laplacian matrix of graph G. The goal is to output a vector u Rn i j G ∈ such that

u L† b ε L† b k − G kLG ≤ k G kLG where L† denotes the pseudo-inverse of L and matrix norm is defined as c = √c Ac. G G k kA ⊤

19 4 Equivalence of Matrix-Vector Multiplication and Solving Linear Systems

In this section, we show that for linear systems with sparse preconditioners, approximately solving them is equivalent to approximate matrix multiplication. We begin by formalizing our notion of approximation. Definition 4.1. Given a matrix M and a vector x, we say that b is an ε-approximate multiplication of Mx if

(b Mx)⊤M †(b Mx) ε x⊤Mx. − − ≤ · Given a vector d, we say that a vector y is an ε-approximate solution to My = d if y is an ε-approximate multiplication of M †d. Before stating the desired reductions, we state a folklore fact about the Laplacian norm: Proposition 4.2 (property of Laplacian norm). Consider a w-weighted n-vertex connected graph G and let wmin = minu,v G wuv, wmax = maxu,v G wuv, and α = wmax/wmin. Then, for any vector ∈ ∈ x Rn that is orthogonal to the all ones vector, ∈ wmin 2 2 2 2 x x L n wmax x 2n4α2 k k∞ ≤ k k G ≤ k k∞ and for any vector b Rn orthogonal to all ones, ∈ 4 2 1 2 2 2n α 2 b b † b 2 L n wmax k k∞ ≤ k k G ≤ wmin k k∞

Proof. Lower Bound for LG: Let λmin and λmax denote the minimum nonzero and maximum 1/2 1/2 eigenvalues of DG− LGDG− respectively, where DG is the diagonal matrix of vertex degrees. 2 2 Since G is connected, all cuts have conductance at least wmin/(n wmax) = 1/(n α). Therefore, by Cheeger’s Inequality [Che70], λ 1/(2n4α2). It follows that, min ≥ 2 x = x⊤L x k kLG G 4 2 (x⊤D x)/(2n α ) ≥ G 2 4 2 x wmin/(2n α ) ≥ k k∞ as desired. Upper bound for L : λ 1. Therefore, G max ≤ 2 2 2 x L x⊤DGx n wmax x k k G ≤ ≤ k k∞ as desired.

Lower bound for LG† : 2 2 2 2 b † (1/λmax) b D−1 1/(n wmax) b k kLG ≥ k k G ≥ k k∞

Upper bound for LG† : 2 2 4 2 2 b † (1/λmin) b D−1 (2n α /wmin) b k kLG ≤ k k G ≤ k k∞

20 Proposition 4.2 implies the following equivalent definition of ε-approximate multiplication:

Corollary 4.3. Let G be a connected n-vertex graph with edge weights we e E(G) with wmin = { } ∈ n mine E(G) we, wmax = maxe E(G) we, and α = wmax/wmin and consider any vectors b, x R . If ∈ ∈ ∈

b LGx εwmax x k − k∞ ≤ k k∞ 3 2 ,b⊤1 = 0, and x⊤1 = 0, then b is an 2n α ε-approximate multiplication of LGx.

Proof. Since b 1 = 0, (b L x) 1 = 0 also. By the upper bound for L† -norms in Proposition 4.2, ⊤ − G ⊤ G 4 2 2 2n α 2 4 4 2 2 b L x † b L x 2n α ε w x G L G min k − k G ≤ wmin k − k∞ ≤ k k∞

Since x⊤1 = 0, x has both nonnegative and nonpositive coordinates. Therefore, since G is connected, there exists vertices a, b in G for which a, b is an edge and for which xa xb x /n. { } | − | ≥ k k∞ Therefore,

2 2 2 x⊤LGx wab(xa xb) (wmin/n ) x ≥ − ≥ k k∞ Substitution shows that

2 4 4 2 2 b LGx † 2n α ε (n x⊤LGx) k − kLG ≤ This is the desired result by definition of ε-approximate multiplication.

Corollary 4.4. Let G be a connected n-vertex graph with edge weights we e E(G) with wmin = { } ∈ Rn mine E(G) we, wmax = maxe E(G) we, and α = wmax/wmin and consider any vectors b, x . If b ∈ 3 2 ∈ ∈ is an ε/(2n α )-approximate multiplication of LGx and b⊤1 = 0, then

b LGx εwmin x k − k∞ ≤ k k∞ Proof. Since b 1 = 0, (b L x) 1 = 0 as well. By the lower bound for L† -norms in Proposition ⊤ − G ⊤ G 4.2 and the fact that b is an approximate multiplication for LGx,

2 2 2 2 ε wmin b LGx n αwmin b LGx † 4 3 x⊤LGx. k − k∞ ≤ k − kLG ≤ 4n α Notice that

2 2 2 x⊤LGx = wab(xa xb) n wmax(4 x ). − ≤ k k∞ a,b E(G) { X}∈ Therefore, by substitution,

2 2 ε wmin 2 2 2 2 2 b LGx ( )(n αwmin(4 x )) wminε x . k − k∞ ≤ 4n4α3 k k∞ ≤ k k∞ Taking square roots gives the desired result.

21 4.1 Solving Linear Systems Implies Matrix-Vector Multiplication

Lemma 4.5. Consider an n-vertex w-weighted graph G, let wmin = mine G we, wmax = maxe G we, ∈ ∈ α = wmax/wmin, and H be a known graph for which

(1 1/900)L L (1 + 1/900)L . − G  H  G Suppose that H has at most Z edges and suppose that there is a (n, δ)-time algorithm SolveG(b, δ) T that, when given a vector b Rn and δ (0, 1), returns a vector x Rn with ∈ ∈ ∈

x L† b δ L† b . k − G kLG ≤ · k G kLG Then, given a vector x Rn and an ε (0, 1), there is a ∈ ∈ O((Z + (n, ε/(n2α))) log(Znα/ε)) T -time algorithm MultiplyG(x, εe) (Algorithm 2) that returns a vector b Rn for which ∈

b LGx ε wmin x . k − k∞ ≤ · · k k∞ The algorithm MultiplyG (Algorithm 2) uses standard preconditioned iterative refinement. It is applied in the opposite from the usual way. Instead of using iterative refinement to solve a linear system given matrix-vector multiplication, we use iterative refinement to do matrix-vector multiplication given access to a linear system solver.

Algorithm 2 MultiplyG and MultiplyGAdditive 1: procedure MultiplyG(x, ε) ⊲ Lemma 4.5, Theorem 4.6 2: Given: x Rn, ε (0, 1), the sparsifier H for G, and a system solver for G ∈ ∈ 3: Returns: an approximation b to LGx 4: return MultiplyGAdditive(x, ε x /(log2(αn/ε))) k k∞ 5: end procedure 6: procedure MultiplyGAdditive(x,τ) 7: if x τ/(n10α5) then k k∞ ≤ 8: return 0 9: end if 10 5 10: xmain SolveG(L x,τ√w /(n α )) ← H min 11: xres x xmain ← − 12: bres MultiplyGAdditive(xres,τ) ← 13: return LH x + bres 14: end procedure

Proof. In this proof, assume that 1⊤x = 0. If this is not the case, then shifting x so that it is orthogonal to 1 only decreases its ℓ2-norm, which means that the ℓ norm only increases by a factor of √n, so the error only increases by O(log n) additional solves.∞ Reduction in residual and iteration bound: First, we show that

xres (1/14) x k kLG ≤ k kLG

22 Since (I L L† )L (I L† L ) 3(1/900)L , − H G G − G H  G

xres = x xmain k kLG k − kLG x L† L x + L† L x xmain ≤ k − G H kLG k G H − kLG (1/15) x + (1/10000) L† L x ≤ k kLG k G H kLG (1/14) x ≤ k kLG

Let xfinal be the lowest element of the call stack and let k be the number of recursive calls to MultiplyGAdditive (Algorithm 2). By Proposition 4.2,

2 x L x √wmax n and xfinal L xfinal √wmin/(2n α). k k G ≤ k k∞ · k k G ≥ k k∞

By definition of xfinal,

τ ε√wmin x xfinal = k k∞ . k k∞ ≥ (14n10α5) 28n12α5 log(αn/ε)

Therefore,

2 k log ( x / xfinal ) log (αn/ε) ≤ 14 k kLG k kLG ≤ as desired.

Error: We start by bounding error in the LG† norm. Let b be the output of MultiplyGAdditive(x,τ) (Algorithm 2). We bound the desired error recursively:

b LGx L† = LH x + bres LGx L† k − k G k − k G

= bres LG(x LG† LH x) L† k − − k G

= bres LG(x xmain) LG(xmain LG† LH x) L† k − − − − k G

bres LG(x xmain) L† + LG(xmain LG† LH x) L† ≤ k − − k G k − k G

† † = bres LGxres L + xmain LGLH x LG k − k G k − k 10 5 bres LGxres L† + √wminτ/(n α ) ≤ k − k G

Because 0= MultiplyGAdditive(xfinal,τ) (Algorithm 2),

10 5 b LGx L† xfinal LG + k√wminτ/(n α ) k − k G ≤ k k 10 5 n√wmax xfinal + k√wminτ/(n α ) ≤ k k∞ √w τ/(n8α4) ≤ min 8 4 ε√wmin x /(n α ) by τ ε x ≤ k k∞ ≤ k k∞

23 By Proposition 4.2 applied to LG† , b LGx L† b LGx /(n√wmax). Therefore, k − k G ≥ k − k∞

b LGx n√wmax b LGx L† k − k∞ ≤ k − k G 1 n√wmaxε√wmin x ≤ n8α4 k k∞ 1 εwmin x by α = wmax/wmin ≤ k k∞ · n7α3.5 εwmin x ≤ k k∞ as desired. Runtime: There is one call to SolveG and one multiplication by LH per call to MultiplyGAdditive (Algorithm 2). Each multiplication by LH takes O(Z) time. As we have shown, there are only k O(log(nα/ε)) calls to MultiplyGAdditive (Algorithm 2). Therefore, we are done. ≤ 4.2 Matrix-Vector Multiplication Implies Solving Linear Systems The converse is well-known to be true [ST04, KMP11]:

Lemma 4.6 ([ST04]). Consider an n-vertex w-weighted graph G, let wmin = mine G we, wmax = ∈ maxe G we, α = wmax/wmin, and H be a known graph with at most Z edges for which ∈ (1 1/900)L L (1 + 1/900)L . − G  H  G Suppose that, given an ε (0, 1), there is a (n, ε)-time algorithm MultiplyG(x, ε) (Algorithm 2) ∈ T that, given a vector x Rn, returns a vector b Rn for which ∈ ∈ b LGx ε wmin x . k − k∞ ≤ · · k k∞ Then, there is an algorithm SolveG(b, δ) that, when given a vector b Rn and δ (0, 1), returns ∈ ∈ a vector x Rn with ∈ x L† b δ L† b . k − G kLG ≤ · k G kLG in O(Z + (n,δ/(n4α2))) log(Znα/δ) T time. e 4.3 Lower bound for high-dimensional linear system solving We have shown in this section that if a K graph can be efficiently sparsified, then there is an efficient Laplacian multiplier for K graphs if and only if ther is an efficient Laplacian system solver for K graphs. Here we give one example of how this connection can be used to prove lower bounds for Laplacian system solving: Corollary 4.7 (Restatement of Corollary 1.7). Consider a function f that is (2, o(log n))-multiplicatively Lipschitz for which f cannot be ε-approximated by a polynomial of degree at most o(log n). Then, assuming SETH, there is no poly(d log(αn/ε))n1+o(1)-time algorithm for ε-approximately solving Laplacian systems in the K-graph on n points, where K(u, v)= f( u v 2). k − k2 Proof. By Theorem 6.3, there is a poly(d log(α))n1+o(1)-time algorithm for sparsifying the K- graph on n points. Since sparsification is efficient, Lemma 4.5 implies that the existence of a poly(d log(αn/ε))n1+o(1)-time Laplacian solver yields access to a poly(d log(αn/ε))n1+o(1)-time Laplacian multiplier. The existence of this multiplier contradicts Theorem 5.14 assuming SETH, as desired.

24 5 Matrix-Vector Multiplication

Recall the adjacency and Laplacian matrices of a K graph: For any function K : Rd Rd R, and × → any set P = x ,...,x Rd of n points, define the matrix AK Rn n by { 1 n}⊆ ,P ∈ ×

K(xi,xj), if i = j; AK,P [i, j]= 6 (0, if i = j.

Similarly, define the matrix LK Rn n by ,P ∈ ×

K(xi,xj), if i = j; LK,P [i, j]= − 6 ( a [n] i K(xi,xa), if i = j. ∈ \{ }

AK,P and LK,P are the adjacency matrixP and Laplacian matrix, respectively, of the complete weighted graph on n nodes where the weight between node i and node j is K(xi,xj). In this section, we study the algorithmic problem of computing the linear transformations defined by these matrices: Problem 5.1 (K Adjacency Evaluation). For a given function K : Rd Rd R, the K Adjacency × →d Evaluation (KAdjE) problem asks: Given as input a set P = x1,...,xn R with P = n and a n n { }⊆ | | vector y R , compute a vector b R such that b AK,P y ε wmax y . ∈ ∈ k − · k∞ ≤ · · k k∞ Problem 5.2 (K Laplacian Evaluation). For a given function K : Rd Rd R, the K Laplacian × →d Evaluation (KLapE) problem asks: Given as input a set P = x1,...,xn R with P = n and a n n { }⊆ | | vector y R , compute a vector b R such that b LK,P y ε wmax y . ∈ ∈ k − · k∞ ≤ · · k k∞ We make a few important notes about these problems:

• In both of the above, problems, wmax := maxu,v P K(u, v) . ∈ | | polylogn • We assume ε = 2− when it is omitted in the above problems. As discussed in Section 4, this is small enough error so that we can apply such an algorithm for K Laplacian Evaluation to solve Laplacian systems, and furthermore, if we can prove hardness for any such ε, it implies hardness for solving Laplacian systems.

polylogn • Note, by Corollary 4.3, that when ε = 2− , the result of KAdjE is an ε-approximate multiplication of AK y (see Definition 4.1), and the result of KLapE is an ε-approximate ,P · multiplication of LK y. ,P · • We will also sometimes discuss the f KAdjE and f KLapE problems for a single-input function f : R R. In this case, we implicitly pick K(u, v)= f( u v 2). → k − k2 Suppose the function K can be evaluated in time T (in this paper we’ve been assuming T = O(1)). Then, both the KAdjE and KLapE problems can be solved in O(Tn2) time, by computing all n2 entries of the matrix and then doing a straightforward matrix-vector multiplication. However,e since the input size to the problem is only O(nd) real numbers, we can hope for much faster algorithms when d = o(n). In particular, we will aim for n1+o(1) time algorithms when d = no(1). For some functions K, like K(x,y) = x y 2, we will show that a running time of n1+o(1) is k − k2 possible for all d = no(1). For others, like K(x,y) = 1/ x y 2 and K(x,y) = exp( x y 2), we k − k2 −k − k2 will show that such an algorithm is only possible when d log(n). More precisely, for these K: ≪ 1. When d = O(log(n)/ log log(n)), we give an algorithm running in time n1+o(1), and

25 2 o(1) 2. For d = Ω(log n), we prove a conditional lower bound showing that n − time is necessary. Finally, for some functions like K(x,y) = x,y , we will show a conditional lower bound showing 2 δ |h Ω(logi| ∗ n) that Ω(n − ) time is required even when d = 2 is just barely super-constant. In fact, assuming SETH, we will characterize the functions f for which the KAdjE and KLapE problems can be efficiently solved in high dimensions d = Ω(log n) in terms of the approximate degree of f (see subsection 5.2 below). The answer is more complicated in low dimensions d = o(log n), and for some functions f we make use of the Fast Multipole Method to design efficient algorithms (in fact, we will see that the Fast Multipole Method solves a problem equivalent to our KAdjE problem).

5.1 Equivalence between Adjacency and Laplacian Evaluation Although our goal in this section is to study the K Laplacian Evaluation problem, it will make the details easier to instead look at the K Adjacency Evaluation problem. Here we show that any running time achievable for one of the two problems can also be achieved for the other (up to a log n factor), and so it will be sufficient in the rest of this section to only give algorithms and lower bounds for the K Adjacency Evaluation problem.

Proposition 5.3. Suppose the KAdjE (Problem 5.1) can be solved in (n, d, ε) time. Then, the T KLapE (Problem 5.2) can be solved in O( (n, d, ε/2)) time. T Proof. We use the K Adjacency Evaluation algorithm with error ε/2 twice, to compute s := AK y ,P · and g := AK ~1, where ~1 is the all-1s vector of length n. We then output the vector z Rn given ,P · ∈ by z = g y s , which can be computed in O(n)= O( (n, d, ε/2)) time. i i · i − i T Proposition 5.4. Suppose the KLapE (Problem 5.2) can be solved in (n, d, ε) time, and that T T satisfies (n +n , d, ε) (n , d, ε)+ (n , d, ε) for all n ,n , d, ε. Then, the KAdjE (Problem 5.1) T 1 2 ≥T 1 T 2 1 2 can be solved in O( (n log n, d, 0.5ε/ log n)) time. T Proof. We will show that the K Adjacency Evaluation problem can be solved in

log n O(2i (n/2i, d, 0.5ε/ log n)) ·T Xi=0 time, and then apply the superadditive identity for T to get the final running time. For a fixed d, we proceed by strong induction on n, and assume the K Adjacency Evaluation problem can be solved in this running time for all smaller values of n. Let a , a Rn be the vectors given by a = y and a = 0 when i [n/2], and a = 0 and ′ ′′ ∈ i′ i i′′ ∈ i′ ai′′ = yi when i>n/2. We first compute z Rn and z Rn as follows: ′ ∈ ′′ ∈

z′ := LK a′ and z′′ := LK a′′ ,P · ,P · in O( (n, d, 0.5ε/ log n)) time. T Next, let y ,y Rn/2 be the vectors given by y = y and y = y for all i [n/2], and let ′ ′′ ∈ i′ i i′′ n/2+i ∈ P , P Rd be given by P = x ,...,x and P = x ,...,x . ′ ′′ ⊆ ′ { 1 n/2} ′′ { n/2+1 n} We recursively compute r , r Rn/2 given by ′ ′′ ∈

r′ = LK ′ y′ and r′′ = LK ′′ y′′. ,P · ,P ·

26 Finally, we can output the vector z Rn given by z = z + r and z = z + r for all ∈ i i′′ i′ n/2+i n/′ 2+i i′′ i [n/2]. Each of the two recursive calls took time ∈

log2(n) i 1 i O(2 − (n/2 , d, 0.5ε/ log n)), ·T Xi=1 and our two initial calls took time O( (n, d, 0.5ε/ log n)), leading to the desired running time. Each T output entry is ultimately the sum of at most 2 log n terms from calls to the given algorithm, and hence has error ε (since we perform all recursive calls with error 0.5ε/ log n and the additive error guarantees in recursive calls can only be more stringent).

Remark 5.5. In both Proposition 5.3 and Proposition 5.4, if the input to the KLapE (resp. KAdjE) problem is a 0, 1 vector, then we only apply the given KAdjE (KLapE) algorithm on 0, 1 vectors. { } { } Hence, the two problems are equivalent even in the special case where the input vector y must be a 0, 1 vector. { } 5.2 Approximate Degree We will see in this section that the key property of a function f : R R for determining whether → f KAdjE is easy or hard is how well it can be approximated by a low-degree polynomial. Definition 5.6. For a positive integer k and a positive real number ε > 0, we say a function f : [0, 1] R is ε-close to a polynomial of degree k if there is a polynomial p : [0, 1] R of degree → → at most k such that, for every x [0, 1], we have f(x) p(x) ε. ∈ | − |≤ The Stone-Weierstrass theorem says that, for any ε > 0, and any continuous function f which is bounded on [0, 1], there is a positive integer k such that f is ε-close to a polynomial of degree k. That said, k can be quite large for some natural and important continuous functions f. For some examples: ℓ ℓ Example 5.7. For the function f(x) = 1/(1 + x), we have f(x)= ∞ ( 1) x for all x [0, 1). ℓ=0 − ∈ Truncating this series to degree O(log(1/ε)) gives a ε approximation on the interval [0, 1/2]. The P following proposition shows that this is optimal up to constant factors. Proposition 5.8. Any polynomial p(x) such that p(x) 1/(1 + x) ε for all x [0, 1/2] has | − | ≤ ∈ degree at least Ω(log(1/ε)). Proof. For such a polynomial p(x), define q(x) := 1 x p(x 1). Thus, the polynomial q has − · − the two properties that q(x) < ε for all x [1, 3/2], and q(0) = 1. By standard properties of | | ∈ the Chebyshev polynomials (see e.g. [SV14, Proposition 2.4]), the polynomial q with those two properties of minimum degree is an appropriately scaled and shifted Chebyshev polynomial, which requires degree Ω(log(1/ε)).

x ℓ ℓ Example 5.9. For the function f(x) = e , we have f(x) = ∞ ( 1) x /ℓ! for all x R . − ℓ=0 − ∈ + Truncating this series to degree O(log(1/ε)/ log log(1/ε)) gives a ε approximation on any interval P [0, a] for constant a> 0. Such a dependence is believed to be optimal, and is known to be optimal if we must approximate f(x) on the slightly larger interval [0, log2(1/ε)/ log2 log(1/ε)] [SV14, Section 5].

Ω(log4 n) In both of the above settings, for error ε = n− , the function f is only ε-close to a polynomial of degree ω(log n). We will see in Theorem 5.14 below that this implies that, for each of these functions f, the ε-approximate f KAdjE problem in dimension d = Ω(log n) requires time 2 o(1) n − assuming SETH.

27 5.3 ‘Kernel Method’ Algorithms

2 q Lemma 5.10. For any integer q 0, let K(u, v) = ( u v 2) . The KAdjE problem (Problem 5.1) ≥ 2d+2k q−1 k can be solved exactly (with 0 error) in time O(n − ). · 2q  Proof. The function e

d q 2 K(u, v)= (ui vi) − ! Xi=1 is a homogeneous polynomial of degree 2q in the variables u1, . . . , ud, v1, . . . , vd. Let

V = u , . . . , u , v , . . . , v , { 1 d 1 d} and let T be the set of functions t : V 0, 1, 2,... 2q such that t(v) = 2q. → { } v V We can count that ∈ P 2d + 2q 1 T = − . | | 2q   Hence, there are coefficients c R for each t T such that t ∈ ∈ K(u, v)= c vt(v). (3) t · t T v V X∈ Y∈ Let V = u , . . . , u and V = V V . Define φ : Rd R T by, for t T , u { 1 d} v \ u u → | | ∈ φ (u , . . . , u ) = c u t(ui). u 1 d t t · i u Vu Yi∈ Similarly define φ : Rd R T by, for t T , v → | | ∈

t(vi) φv(v1, . . . , vd)t = vi . v Vv iY∈ It follows from (3) that, for all u, v Rd, we have K(u, v)= φ (u), φ (v) . ∈ h u v i Our algorithm thus constructs the matrix M Rn T whose rows are the vectors φ (x ) for u ∈ ×| | u i i [n], and the matrix M R T n whose columns are the vectors φ (x ) for i [n]. Then, on ∈ v ∈ | |× v i ∈ input y Rn, it computes y := M y R T in O(n T ) time, then z := M y Rn, again in ∈ ′ v · ∈ | | · | | u · ′ ∈ O(n T ) time, and outputs z. Since Mu Mv = AK, x1,...,xn , it follows that the vector we output · | | · { } is the desired z = AK, x1,...,xn y. e { } · e d+q 1 Remark 5.11. The running time in Lemma 5.10 can be improved to O(n − ) with more · q careful work, by noting that each monomial has either ‘x-degree’ or ‘y-degree’ at most d, but we  omit this here since the difference is negligible for our parameters of interest.e

2(d+q) o(1) Corollary 5.12. Let q, d be positive integers which may be functions of n, such that 2q

28 • when d = Θ(log n) and q < o(log n). If f : R R is a polynomial of degree at most q, and we define K(u, v) := f( u v 2), then the → k − k2 KAdjE problem in dimension d can be solved exactly in n1+o(1) time. Proof. This follows by applying Lemma 5.10 separately to each monomial of f, and summing the results. When d = o(log n) and q O(log n), then we can write d = 1 log n for some a(n) = ω(1), ≤ a(n) and q = b(n) log n for some b(n)= O(1). It follows that · 2(d + q) 2(d + q) = 2q 2d     O(q) = by d = O(q) O(d)   O(q/d)O(d) ≤ = 2O(d log(q/d))

O( log(ab) ) log n log n = 2 a · by d = ,q = b log n a O( log(a) ) log n 2 a · by b = O(1) ≤ < no(1). by a = ω(1) The other cases are similar.

Corollary 5.13. Suppose f : R R is ε/n-close to a polynomial of degree q (Definition 5.6), → 2(d+q) o(1) where q, d are positive integers such that 2q

5.4 Lower Bound in High Dimensions We now prove that in the high dimensional setting, where d = Θ(log n), the algorithm from Corol- lary 5.13 is essentially tight. In that algorithm, we showed that (recalling Definition 5.6) functions f which are ε-close to a polynomial of degree o(log n) have efficient algorithms; here we show a lower bound if f is not ε-close to a polynomial of degree O(log n). Theorem 5.14. Let f : [0, 1] R be an analytic function on [0, 1] and let κ : N [0, 1] be a → → nonincreasing function. Suppose that, for infinitely many positive integers k, f is not κ(k)-close to a polynomial of degree k. 2 Then, assuming SETH, the KAdjE problem for K(x,y)= f( x y 2) in dimension d and error O(d4) d 2 o(1) k − k (κ(d + 1)) on n = 1.01 points requires time n − . This theorem will be a corollary of another result, which is simpler to use in proving lower bounds: Theorem 5.15. Let f : [0, 1] R be an analytic function on [0, 1] and let κ : N [0, 1] be → → a nonincreasing function. Suppose that, for infinitely many positive integers k, there exists an (k+1) xk [0, 1] for which f (xk) > κ(k). ∈ | | 2 Then, assuming SETH, the KAdjE problem for K(x,y)= f( x y 2) in dimension d and error O(d4) d 2 o(1) k − k (κ(d + 1)) on n = 1.01 points requires time n − .

29 To better understand Theorem 5.14 in the context of our dichotomy, think about the following example:

k3 Example 5.16. Consider a function f that is exactly κ(k) = 2− -far from the closest polynomial 1/3 of degree k for every k N. By Corollary 5.13, there is an dlog nn 0 assuming SETH (Theorem 3.21), this suffices. To solve Hamming nearest neighbors a a pair of sets S and T with S = T = n, we set up d + 1 different f-graph adjacency matrix multiplication | | | | problems. In problem i, we scale the points in S and T by a factor of ζi for some ζ0 and translate them by c [0, 1] so that they are in the interval I. Then, with one adjacency multiplication, one ∈ can evaluate an expression Z , where Z = f(c + i2ζ2 x y 2). For each distance i [d], i i x S,y T k − k2 ∈ let u = x S,y T : x y 2 = j . To∈ solve∈ bichromatic nearest neighbors, it suffices to j |{ ∈ ∈ k − k2 }|P compute all of the ujs. This can be done by setting up a linear system in the ujs, where there is one equation for each Zi. The matrix for this linear system has high determinant because f has high (k + 1)-th derivatives on I (Lemma 5.23). Cramer’s Rule can be used to bound the ≤ error in our estimate of the ujs that comes from the error in the multiplication oracle (Lemma 5.24). Therefore, O(d) calls to a multiplication oracle suffices for computing the number of pairs of vertices in S T that are at each distance value. Returning the minimum distance i for which × ui > 0 solves bichromatic nearest neighbors, as desired. In Lemma 5.23, we will make use of the Cauchy-Binet formula:

Lemma 5.17 (Cauchy-Binet formula for infinite matrices). Let k be a positive integer, and for functions A : [k] N R and B : N [k] R, define the matrix C Rk k by, for i, j [k], × → × → ∈ × ∈

∞ C := A B , ij iℓ · ℓj Xℓ=0 and suppose that the sum defining Cij converges absolutely for all i, j. Then,

det(C)= det(A[ℓ ,ℓ , ,ℓ ]) det(B[ℓ ,ℓ , ,ℓ ]), 1 2 · · · k · 1 2 · · · k 1 ℓ1<ℓ2< <ℓ ≤ X··· k

30 where A[ℓ ,ℓ , ,ℓ ] Rk k denotes the matrix whose i, j entry is given by A and B[ℓ ,ℓ , ,ℓ ] 1 2 · · · k ∈ × iℓj 1 2 · · · k denotes the matrix whose i, j entry is given by Bℓij. In this section, we exploit the following property of analytic functions: Proposition 5.18. Consider a function f : [0, 1] R that is analytic. Then, there is a constant → B > 0 depending on f such that for all k 0 and all x [0, 1], f (k)(x)

∞ f(y)= f (i)(x)(y x)i/i!. − Xi=0 Let y0 = argmaxa 0,1 a x . Note that y0 x 1/2. Since f(y0) is a convergent series, there ∈{ } | − | | − |≥ exists a constant Nx dependent on x such that for all i > Nx, the absolute value of the i-th term of the series for f(y ) is at most 1/2. Therefore, for all i > N , f (i)(x) < 2(2i)i!. For all i N , 0 x | | ≤ x f (i)(x) is a constant depending on x, so we are done with this part. Next, we show that f (k)(x) < B4kk! for all x [0, 1] and some constant B depending only | | ∈ on f. Let x i/8 8 be the point that minimizes x x . Note that x x < 1/16. Taylor 0 ∈ { }i=0 | − 0| | − 0| expand f around x0:

∞ f(x)= f (i)(x )(x x )i/i!. 0 − 0 Xi=0 Take derivatives for some k 0 and use the triangle inequality: ≥ ∞ f (k)(x) f (i+k)(x ) x x i/i! | |≤ | 0 || − 0| Xi=0 By the first part, f (i+k)(x ) B 2i+k(i + k)!, so | 0 |≤ x0 ∞ f (k)(x) B 2i+k((i + k)!/i!)(1/16)(i+k) | |≤ x0 Xi=0 Note that (i + k)!/i! (2i)k for i > k (we are done for i k). There is some constant C for which k i ≤ ≤ i∞=0 i 4− = C, so letting B = Bx0 C suffices, as desired. P We now move on to proving the main results of this section, which consists of several steps. Lemma 5.19 (Step 1: high (k + 1)-derivative). Let f : [0, 1] R be an analytic function on → [0, 1] and let κ : N [0, 1] be a nonincreasing function. Suppose that, for some positive integer → k, f is not κ(k)-close to a polynomial of degree k. Then, there exists some x [0, 1] for which ∈ f (k+1)(x) > κ(k). | | k f (ℓ)(0) xℓ Proof. Define the function g : [0, 1] R by g(x) = f(x) · . We claim that, for each → − ℓ=0 ℓ! i 0, 1,...,k + 1 , there is an x [0, 1] such that g(i)(x ) > κ(k). Since g(k+1)(x)= f (k+1)(x), ∈ { } i ∈ | iP| plugging in i = k + 1 into this will imply our desired result. We prove this by induction on i. For the base case i = 0: note that g is the difference between f and a polynomial of degree k, and so by our assumption that f is not κ(k)-close to a polynomial of degree k, there must be an x [0, 1] such that g(x ) > κ(k). 0 ∈ | 0 |

31 (i 1) For the inductive step, consider an integer i [k + 1]. Notice that g − (0) = 0 by definition of ∈ (i 1) g since i k + 1. By the inductive hypothesis, there is an xi 1 [0, 1] with g − (xi 1) > κ(k). ≤ − ∈ | − | Hence, by the mean value theorem, there must be an xi [0,xi 1] such that ∈ − (i) (i 1) (i 1) (i 1) g (xi) g − (xi 1) g − (0) /xi 1 g − (xi 1) > κ(k), | | ≥ | − − | − ≥ | − | as desired.

Proof of Theorem 5.14 given Theorem 5.15. For each of the infinitely many ks for which f is κ(k)- far from a degree k polynomial, Lemma 5.19 implies the existance of an x for which f (k+1)(x ) > k | k | κ(xk). Thus, f satisfies the input condition of Theorem 5.15, so applying Theorem 5.15 proves Theorem 5.14 as desired.

Now, we focus on Theorem 5.15:

Lemma 5.20 (Step 2: high (k+1)-derivative on interval). Let f : [0, 1] R be an analytic function → on [0, 1] and let κ : N [0, 1] be a nonincreasing function. There is a constant B > 0 depending → only on f such that the following holds. Suppose that, for some sufficiently large positive integer k, there is an x [0, 1] for which ∈ f (k+1)(x) > κ(k). Then, there exists an interval [a, b] [0, 1] with the property that both b a | | ⊆ − ≥ κ(k)/(32Bk4k k!) and, for all y [a, b], f (k+1)(y) > κ(k)/2. · ∈ | | Proof. Since f is analytic on [0, 1], Proposition 5.18 applies and it follows that there is a constant B > 0 dependent on f such that for every y [0, 1] and every nonnegative integer m, we have ∈ f (m)(y) Bm4m m!. In particular, for all y [0, 1], we have f (k+2)(y) 16Bk4k k!. Let | | ≤ · ∈ | | ≤ · δ = κ(k)/(32Bk4k k!), then let a = max 0,x δ , and b = min 1,x + δ . We have b a δ = · { − } { } − ≥ κ(k)/(32Bk4k k!), since when k is large enough, we get that δ < 1/2, so we cannot have both a = 0 · and b = 1. Meanwhile, for any y [a, b], we have as desired that ∈ (k+1) (k+1) (k+2) f (y) f (x) x y sup f (y′) | | ≥ | | − | − | · y′ [a,b] | | ∈ > κ(k) δ 16Bk4k k! − · · = κ(k)/2.

Lemma 5.21 (Step 3: high lower derivatives on subintervals). Let f : [0, 1] R be an analytic → function on [0, 1] and let κ : N [0, 1] be a nonincreasing function. There is a constant B > 1 → depending only on f such that the following holds. Suppose that, for some sufficiently large positive integer k, there exists an x [0, 1] for which ∈ f (k+1)(x) > κ(k). Then, there exists an interval [c, d] [0, 1] with the property that both d c > | | ⊆ − k+2 i κ(k)/(128Bk42k+1 k!) and, for all y [c, d] and all i k+1, f (i)(y) > κ(k)2/(64Bk42k+1 k!) − . · ∈ ≤ | | · Proof. We will prove that, for all integers 0 i k + 1, there is an interval [ci, di] [0, 1] such 2k+1 i ≤ ≤ ⊆ that di ci > κ(k)/(32Bk4 − k!), and for all integers i i′ k + 1 and all y [ci, di] we have ′ − · k+2 i′ ≤ ≤ ∈ f (i )(y) > κ(k)2/(64Bk42k+1 k!) − . Plugging in i = 0 gives the desired statement. We will | | · prove this by induction on i, from i = k + 1 to i = 0. The base case i = k + 1 is given (with slightly   better parameters) by Lemma 5.20. For the inductive step, suppose the statement is true for i + 1. We will pick [ci, di] to be a subin- terval of [c , d ], so the inductive hypothesis says that for every y [c , d ] and every integer i< i+1 i+1 ∈ i i

32 ′ (i′) 2 2k+1 k+2 i i′ k + 1 we have f (y) > κ(k) /(64Bk4 k!) − . It thus remains to show that we can ≤ | | · k+2 i further pick c and d such that d c 1 (d c ) and f (i)(y) > κ(k)2/(64Bk42k+1 k!) − i i  i − i ≥ 4 i+1 − i+1  | | · for all y [ci, di]. ∈ k+1 i   Recall that f (i+1)(y) > κ(k)2/(64Bk42k+1 k!) − for all y [c , d ]. Since f (i+1) is | | · ∈ i+1 i+1 continuous, we must have   k+1 i • either f (i+1)(y) > κ(k)2/(64Bk42k+1 k!) − for all such y, ·   k+1 i • or f (i+1)(y) > κ(k)2/(64Bk42k+1 k!) − for all such y. − · Let us assume we are in the first case; the second case is nearly identical. Let δ = (d c )/4, i+1 − i+1 and consider the four subintervals

[ci+1, ci+1 + δ], [ci+1 + δ, ci+1 + 2δ], [ci+1 + 2δ, ci+1 + 3δ], and [ci+1 + 3δ, ci+1 + 4δ].

k+1 i Since f (i+1)(y) > κ(k)2/(64Bk42k+1 k!) − for all y in each of those intervals, we know that · for each of the intervals, letting c denote its left endpoint and d denote its right endpoint, we have  ′  ′ k+1 i (i) (i) 2 2k+1 − f (d′) f (c′) δ κ(k) /(64Bk4 k!) − ≥ · · h i k+1 i 2k+1 i 2 2k+1 − κ(k)/(32Bk4 − k!) κ(k) /(64Bk4 k!) /4 ≥ · · · h ik+2h i i κ(k)2/(64Bk42k+1 k!) − . ≥ · h i (i) In particular, f is increasing on the interval [ci+1, di+1], and if we look at the five points (i) y = ci+1 + a δ for a 0, 1, 2, 3, 4 which form the endpoints of our four subintervals, f increases · ∈ { } k+2 i by more than κ(k)2/(64Bk42k+1 k!) − from each to the next. It follows by a simple case · analysis (on where in our interval f (i) has a root) that there must be one of our four subintervals with  k+2 i f (i)(y) > κ(k)2/(64Bk42k+1 k!) − for all y in the subinterval. We can pick that subinterval | | · as desired.   k+2 To simplify notation in the rest of the proof, we will let ρ(k)= κ(k)2/(64Bk42k+1 k!) . We · now use these properties of f to reason about a certain matrix connected to f that can be used to   count the number of pairs of points at each distance.

Definition 5.22 (Counting matrix). For a function f : R R, an integer k 1, and a function → ≥ ρ : N R , let [c, d] be the interval from Lemma 5.21. Define the counting matrix be the k k → >0 × matrix M for which

M = f(c + (ρ(k)/(B(200k)k ))10k i j/k2). ij · · Lemma 5.23 (Step 4: determinant lower bound for functions f that are far from polynomials using Cauchy-Binet). Let f : [0, 1] R be an analytic function on [0, 1], let κ : N [0, 1] be a → ℓ+2 → nonincreasing function, and let ρ(ℓ)= κ(ℓ)2/(64Bℓ42ℓ+1 ℓ!) (as discussed before). Let k be an · integer for which there exists an x [0, 1] with f (k+1)(x) > κ(k). ∈  | |  Let M be the counting matrix (Definition 5.22) for f, k, and ρ. Then

3 det(M) > (ρ(k))k(ρ(k)/(Bk2(200k)k))10k . | |

33 Proof. Since f is analytic on [c, d], we can Taylor expand it around c:

∞ f (ℓ)(c) f(x)= (x c)ℓ ℓ! − Xℓ=0 Let δ = (ρ(k)/(B(200k)k ))10k/k2. Note that for all values of i, j [k], the input to f in M is in ∈ ij the interval [c, d] by the lower bound on d c in Lemma 5.21. In particular, for all i, j [k], − ∈ ∞ f (ℓ)(c) M = δℓiℓjℓ ij ℓ! Xℓ=0 Define two infinite matrices A : [k] Z 0 R and C : Z 0 [k] R as follows: × ≥ → ≥ × → f (ℓ)(c) A = δℓiℓ iℓ ℓ!

ℓ Cℓj = j for all i, j [k] and ℓ Z 0. Then ∈ ∈ ≥

∞ Mij = AiℓCℓj Xℓ=0 for all i, j [k] and converges, so we may apply Lemma 5.17. By Lemma 5.17, ∈

det(M)= det(A[ℓ1,ℓ2,...,ℓk]) det(C[ℓ1,ℓ2,...,ℓk])

0 ℓ1<ℓ2<...<ℓ ≤ X k To lower bound det(M) , we | | (1) lower bound the contribution of the term for the tuple (ℓ ,ℓ ,...,ℓ ) = (0, 1,...,k 1), 1 2 k − (2) upper bound the contribution of every other term,

(3) show that the lower bound dominates the sum.

(ℓ) ℓ We start with part (1). Let D and P be k k matrices, with D diagonal, D = f (c)δ , and × ℓℓ ℓ! P = iℓ for all ℓ 0, 1,...,k and i [k]. Then A[0, 1,...,k 1] = P D, which means that iℓ ∈ { } ∈ − det(A[0, 1,...,k 1]) = det(P ) det(D) − · P and C[0, 1,...,k 1] are Vandermonde matrices, so their determinants has a closed form and, in − particular, have the property that det(P ) 1 and det(C[0, 1,...,k 1]) 1. By Lemma 5.21, | |≥ | − |≥ D > δℓρ(ℓ) δℓρ(k) for all ℓ 0, 1 ...,k 1 . Therefore, | ℓℓ| ≥ ∈ { − } 1+2+...+(k 1) k det(A[0, 1,...,k 1]) det(C[0, 1,...,k 1]) > δ − ρ(k) | − | · | − | k k = δ(2)ρ(k) .

This completes part (1). Next, we do part (2). Consider a k-tuple 0 ℓ < ℓ <...<ℓ and a ≤ 1 2 k permutation σ : [k] [k]. By Proposition 5.18, there is some constant B > 0 depending on f for → which f (ℓ)(c) B10ℓ(ℓ!) B(10ℓ)ℓ for all ℓ. Therefore, | |≤ ≤

34 k k k Pi=1 ℓi Aiℓσ(i) B (10δk) ≤ i=1 Y We also get that

k Pk ℓ C k j=1 j ℓσ(j)j ≤ j=1 Y

Summing over all k! permutations σ yields an upper bound on the determinants of the blocks of A and C, excluding the top block:

det(A[ℓ ,ℓ ,...,ℓ ]) det(C[ℓ ,ℓ ,...,ℓ ]) | 1 2 k || 1 2 k | 0 ℓ1<ℓ2<...<ℓ ,ℓ =k 1 ≤ X k k6 − 2 k 2 Pk ℓ (k!) B (10δk ) i=1 i ≤ 0 ℓ1<ℓ2<...<ℓ ,ℓ =k 1 ≤ X k k6 − ∞ τ k(k!)2Bk(10δk2)τ ≤ τ=1+2+...+(k 3)+(k 2)+k X− − 2τ k(k!)2Bk(10δk2)τ0 , ≤ 0 k where τ0 = 2 + 1. This completes part (2). Now, we do part (3). By Lemma 5.17,  det(M) det(A[0, 1,...,k 1]) det(C[0, 1,...,k 1]) | | ≥ | − || − | det(A[ℓ ,...,ℓ ]) det(C[ℓ ,...,ℓ ]) − | 1 k || 1 k | 0 ℓ1<ℓ2<...<ℓ ,ℓ =k 1 ≤ X k k6 − Plugging in the part (1) lower bound and the part (2) upper bound yields

τ0 1 k k 2 k 2 τ0 det(M) δ − ρ(k) 2τ (k!) B (10δk ) | |≥ − 0 τ0 1 k k 2 k 2 τ0 = δ − ρ(k) 2δτ (k!) B (10k ) − 0 τ0 1  k  > δ − ρ(k) /2 3 > ρ(k)k(ρ(k)/(Bk2(200k)k))10k as desired.

Lemma 5.24 (Step 5: Cramer’s rule-based bound on error in linear system). Let M be an invertible k by k matrix with Mij B for all i, j [k]. Let b be a k-dimensional vector for which bi ε | | ≤1 k ∈ | | ≤ for all i [k]. Then, M − b εk!B / det(M) . ∈ k k∞ ≤ | | Proof. Cramer’s rule says that, for each i [k], the entry i of the vector M 1b is given by ∈ −

1 det(Mi) (M − b) = , i det(M)

35 where Mi is the matrix which one gets by replacing column i of M by b. Let us upper-bound det(M ) . We are given that each entry of M in column i has magnitude at most ε, and each entry | i | i in every other column has magnitude at most B. Hence, for any permutation σ S on [k], we ∈ k have k k 1 (M ) ε B − , i j,σ(j) ≤ · j=1 Y and so k k 1 det(M ) (M ) ε B − k!. | i |≤ i j,σ(j) ≤ · · σ Sk j=1 X∈ Y

It follows from Cramer’s rule that (M 1b) ε Bk 1 k!/ det(M) , as desired. | − i |≤ · − · | | Lemma 5.25 (Step 6: reduction). Let f : [0, 1] R be an analytic function on [0, 1], let κ : → ℓ+2 N [0, 1] be a nonincreasing function, and let ρ(ℓ)= κ(ℓ)2/(64Bℓ42ℓ+1 ℓ!) for any ℓ N (as → · ∈ discussed before). Suppose that, for infinitely many positive integers k, there exists an x for which   k f (k+1)(x) > κ(k). | | Suppose that there is an algorithm for ε-approximate matrix-vector multiplication by an n n × f-matrix for points in [0, 1]d in T (n, d, ε) time. Then, there is a

4 n poly(d + log(1/κ(d + 1))) + O(d T (2n, d + 1, (κ(d + 1))O(d )))) · · time algorithm for exact bichromatic Hamming nearest neighbors on n-point sets in dimension d.

d Proof. Let x1,...,xn 0, 1 be the input to the Hamming nearest neighbors problem, so our goal ∈ { } d+1 is to compute min1 i 0 to be picked later. In order to recover t, we will make d + 1 calls to our given algorithm. For ℓ 0, 1,...,d , the goal of call ℓ is to compute ∈ { } a value uℓ which is an approximation, with additive error at most ε, of entry ℓ of the vector Mt, where M is the counting matrix defined above. Let us explain why this is sufficient to recover t. Suppose we have computed this vector u. We 1 claim that if we compute M − u, and round each entry to the nearest integer, the result is the vector 1 t. Indeed, by Lemma 5.24, each entry of M − u differs from the corresponding entry of t by at most an additive ε Bk 1 k!/ det(M) , where the constant B is from Proposition 5.18 (since f(z) B · − · | | | |≤ for all z [0, 1]). Substituting our lower bound on det(M) from Lemma 5.23, we see this additive ∈ | | error is at most 1/3 as long as we’ve picked a sufficiently large constant α> 0. Thus, rounding each entry to the nearest integer recovers t, as desired. It remains to show how to compute uℓ, an approximation with additive error at most ε of entry ℓ of the vector Mt. In other words, we need to approximate the sum

d M t = f(c + (ρ(k)/(B(200k)k ))10k ℓ x x 2/k2). ℓ,p · p · · k i − jk2 p=0 1 i

36 for all i, j [n], and apply our given algorithm with error ε to these points. For i [n], let ∈ ∈ x = (ρ(k)/(B(200k)k ))5k √ℓ x /k be a rescaling of x . We pick y to equal x in the first d entries i′ · · i i i i′ and 0 in the last entry, and zi to equal xi′ in the first d entries and √c in the last entry. These points have the desired distances, completing the proof.

2 Step 7: proof of Theorem 5.15. If the KAdjE problem for K(x,y) = f( x y 2) in dimension d O(d4) d k − 2 kδ and error (κ(d + 1)) on n = 1.01 points could be solved in time time n − for any constant δ > 0, then one could immediately substitute this into Lemma 5.25 to refute SETH in light of Theorem 3.21.

5.5 Lower Bounds in Low Dimensions The landscape of algorithms available in low dimensions d = o(log n) is a fair bit more complex. The Fast Multipole Method allows us to solve KAdjE for a number of functions f, including K(x,y) = exp( x y 2) and K(x,y) = 1/ x y 2, for which we have a lower bound in high dimensions. −k − k2 k − k2 Classifying when these multipole methods apply to a function f seems quite difficult, as researchers have introduced more and more tools to expand the class of applicable functions. See Section 9, below, in which we give a much more detailed overview of these methods. That said, in this subsection, we prove lower bounds for a number of functions K of interest. We show that for a number of functions K with applications to geometry and statistics, the KAdjE problem seems to become hard even in dimension d = 3 (see the end of this subsection for a list of such K). We begin with the function K(x,y) = x,y . Here, the K Adjacency Evaluation problem |h i| becomes hard even for very small d. We give an n1+o(1) time algorithm only for d 2. For d = 3 we ≤ show that such an algorithm would lead to a breakthrough in algorithms for the Z-MaxIP problem, Ω(log∗ n) 2 ε and for the only slightly super-constant d = 2 , we show that a n − time algorithm would refute SETH.

Lemma 5.26. For the function K(x,y) = x,y , the KAdjE problem (Problem 5.1) can be solved |h i| exactly in time n1+o(1) when d = 2.

Proof. Given as input x ,...,x R2 and y Rn, our goal is to compute z Rn given by 1 n ∈ ∈ ∈ z := x ,x y . We will first compute z := x ,x y , and then subtract x ,x y i j=i |h i ji| · j i′ j |h i ji| · j |h i ii| · i for each i6 to get z . P i P We first sort the input vectors by their polar coordinate angle, and relabel so that x1,...,xn are in sorted order. Let φi be the polar coordinate angle of xi. We will maintain two vectors x+,x R2. We will ‘sweep’ an angle θ from 0 to 2π, and maintain that x+ is the sum of the y x − ∈ i · i with x ,θ > 0, and x is the sum of the other y x . Initially let θ = 0 and let x+ be the sum of h i i − i · i the y x with φ [ π/2, π/2), and x be the sum of the remaining y x . As we sweep, whenever i · i i ∈ − − i · i θ is in the direction of an x we can set z = x ,x+ x . Whenever θ is orthogonal to an x , we i i′ h i − −i i swap y x from one of x+ or x to the other. Over the whole sweep, each point is swapped at i · i − most twice, so the total running time is indeed n1+o(1).

Lemma 5.27. For the function K(x,y) = x,y , if the KAdjE problem (Problem 5.1) with er- |h i| ror 1/nω(1) can be solved in time (n, d), and n > d + 1, then Z-MaxIP (Problem 3.22) with d- T dimensional vectors of integer entries of bit length O(log n) can be solved in time O( (n, d) log2 n + T nd).

d O(1) Proof. Let x1,...,xn Z be the input vectors. Let M n be the maximum magnitude of ∈ ≤ 2 2 any entry of any input vector. Thus, maxi=j xi,xj is an integer in the range [ dM , dM ]. We 6 h i −

37 will binary search for the answer in this interval. The total number of binary search steps will be O(log(dM 2)) O(log n). ≤ We now show how to do each binary search step. Suppose we are testing whether the answer is a, i.e. testing whether x ,x a for all i = j. Let S ,...,S 1,...,n be subsets such ≤ h i ji ≤ 6 1 log n ⊆ { } that for each i = j, there is a k such that S i, j = 1. For each k 1,..., log n we will show 6 | k ∩ { }| ∈ { } how to test whether there are i, j with S i, j = 1 such that x ,x a, which will complete | k ∩ { }| h i ji≤ the binary search step. Define the vectors x ,...,x Zd+1 by (x ) = (x ) for j d, and (x ) = a 1 if i P , 1′ n′ ∈ i′ j i j ≤ i′ d+1 − ∈ k and (x ) = 1 if i / P . Hence, for i, j with S i, j = 1 we have x ,x = x ,x a + 1, i′ d+1 − ∈ k | k ∩ { }| h i′ j′ i h i ji− and so our goal is to test whether there are any such i, j with x ,x > 0. h i′ j′ i Let v 0, 1 n be the vector with (v ) = 1 when i P and (v ) = 0 when i / P . Use the k ∈ { } k i ∈ k k i ∈ k given algorithm to vector v , we can compute a (a n ω(1)) approximation to k ± −

s := x′ ,x′ 1 |h i ji| i Pk j /P X∈ X∈ k in time O( (n, d)). Similarly, using the fact that the corresponding matrix has rank d by definition, T we can exactly compute

s := x′ ,x′ 2 h i ji i Pk j /P X∈ X∈ k in time O(nd). Our goal is to determine whether s = s . Since each s and s is a polynomially- 1 − 2 1 2 bounded integer, and we have a superpolynomially low error approximation to each, we can deter- mine this as desired.

Combining Lemma 5.27 with Theorem 3.23 we get:

Corollary 5.28. Assuming SETH, there is a constant c such that for the function K(x,y)= x,y , ∗ |h i| the KAdjE problem (Problem 5.1) with error 1/nω(1) and dimension d = clog n vectors of O(log n) 2 o(1) bit entries requires time n − . Similarly, combining with Theorem 3.24 we get:

Corollary 5.29. For the function K(x,y)= x,y , if the KAdjE problem (Problem 5.1) with error ω(1) |h i| 4/3 O(1) 1/n and dimension d = 3 vectors of O(log n) bit entries can be solved in time n − , then we would get a faster-than-known algorithm for Z-MaxIP (Problem 3.22) in dimension d = 3.

The same proof, but using Theorem 3.25 instead of Theorem 3.23, can also show hardness of thresholds of distance functions:

Corollary 5.30. Corollary 5.28 also holds for the function K(x,y) = TH( x y 2), where TH is k − k2 any threshold function (i.e. TH(z) = 1 for z θ and TH(z) = 0 otherwise, for some θ R ). ≥ ∈ >0 Using essentially the same proof as for Lemma 8.4 in Section 8, we can further extend Corol- lary 5.30 to any non-Lipschitz functions f:

Proposition 5.31. Suppose f : R R is any function which is not (C,L)-multiplicatively Lip- + → schitz for any constants C,L 1. Then, assuming SETH, the KAdjE problem (Problem 5.1) with ω(1) ≥ log∗ n 2 o(1) error 1/n and dimension d = c requires time n − .

38 5.6 Hardness of the n-Body Problem We now prove Corollary 1.2 from the Introduction, showing that our hardness results for the KAdjE problem (Problem 5.1) also imply hardness for the n-body problem. Corollary 5.32 (Restatement of Corollary 1.2). Assuming SETH, there is no poly(d, log(α)) n1+o(1) · -time algorithm for one step of the n-body problem. Proof. We reduce from the K graph Laplacian multiplication problem, where K(u, v)= f( u v 2) k − k2 and f(z) = 1 . A K graph Laplacian multiplication instance consists of the K graph G on a (1+z)3/2 set of n points X Rd and a vector y 0, 1 n for which we wish to compute L y. Think of y ⊆ ∈ { } G as vector with coordinates in the set X. Compute this multiplication using an n-body problem as follows:

1. For each b 0, 1 , let X = x X y = b . Let Z Rd+1 be the set of all (x, 0) for ∈ { } b { ∈ | x } ⊆ x X and (x, 1) for x X . ∈ 0 ∈ 1 2. Solve the one-step n-body problem on Z with unit masses. Let z Rn be the vector of the ∈ (d + 1)-th coordinate of these forces, with forces negated for coordinates x X . ∈ 0 3. Return z/Ggrav (Note that Ggrav is the Gravitational constant) − 1 b We now show that z = LGy. For x Xb, (LGy)x = ( 1) − ′ K(x,x′). We now ∈ − x X1−b check that z is equal to this by going through pairs x,x individually. Note∈ that the (d + 1)-th x { ′} P coordinate of the force between (x, b) and (x′, b) is 0. The (d + 1)-th coordinate of the force exerted by (x′, 1) on (x, 0) is

Ggrav (1 0) − = Ggrav K(x,x′) (x, 0) (x , 1) 2 · (x, 0) (x , 1) · k − ′ k2 k − ′ k2 Negating this gives the force exerted by (x′, 0) on (x, 1). All of these contributions agree with the corresponding contributions to the sum (L y) , so z/Ggrav = L y as desired. G x − G · The runtime of this reduction is O(n) plus the runtime of the n-body problem. However, Theorem 1.1 shows that no almost-linear time algorithm exists for K-Laplacian multiplication, since f is not approximable by a polynomial with degree less than Θ(log n). Therefore, no almost-linear time algorithm exists for n-body either assuming SETH, as desired.

5.7 Hardness of Kernel PCA For any function K : Rd Rd R, and any set P = x ,...,x Rd of n points, define the × → { 1 n} ⊆ matrix KK Rn n by ,P ∈ × KK,P [i, j]= K(xi,xj)

1+o(1) AK,P and KK,P differ only on their diagonal entries, so a n time algorithm for multiplying by one can be easily converted into such an algorithm for the other. Kernel PCA studies Problem 5.33 (K PCA). For a given function K : Rd Rd R, the K PCA problem asks: × → Given as input a set P = x ,...,x Rd with P = n, output the n eigenvalues of the matrix { 1 n} ⊆ | | (I J ) KK (I J ), where J is the n n matrix whose entries are all 1/n. In ε-approximate n − n × ,P × n − n n × K PCA, we want to return a (1 ε)-multiplicative approximation to each eigenvalue. ±

39 We can now show a general hardness result for K PCA:

Theorem 5.34 (Approximate). For every function f : R R which is equal to a Taylor expansion i → f(x)= i∞=0 cix on an interval (0, 1), if f is not ε-approximated by a polynomial of degree O(log n) log4 n (Definition 5.6) on an interval (0, 1) for ε = 2− , then, assuming SETH, the ε-approximate K P 2 o(1) PCA problem (Problem 5.1) in dimension d = O(log n) requires time n − . Proof Sketch. Theorem 5.34 follows almost directly from Theorem 5.14 when combined with the reduction from [BCIS18, Section 5]. The idea is as follows: Suppose we are able to estimate the n eigenvalues of (I J ) KK (I J ). Then, in particular, we can estimate their sum, which n − n × ,P × n − n is equal to:

2 tr((I J ) KK (I J )) = tr(KK (I J ) ) = tr(KK (I J )) = tr(KK ) S(KK )/n, n− n × ,P × n− n ,P × n− n ,P × n− n ,P − ,P where S(KK,P ) denotes the sum of the entries of KK,P . We can compute tr(KK,P ) exactly in time 1+o(1) n , so we are able to get an approximation to S(KK,P ). However, in the proof of Theorem 5.14, we showed hardness for approximating S(KK,P ), which concludes our proof sketch.

6 Sparsifying Multiplicatively Lipschitz Functions in Almost Linear Time

Given n points, let α denote

2 maxu,v P ( u v 2) ∈ α = k − k2 . minu,v P ( u v 2) ∈ k − k In this section, we give an algorithm to compute sparsifiers for a large class of kernels K in almost linear time in nd, with logarithmic dependency on α and 1/ε2 dependence on ε. When d = log n, our algorithm runs in almost linear time in n. To formally state our main theorem, we define multiplicatively Lipschitz functions:

Definition 6.1. Let C 1 and L 1. A function f : R 0 R 0 is (C,L)-multiplicatively ≥ ≥ ≥ → ≥ Lipschitz iff for all c [1/C,C], we have: ∈

1 f(cx) L <

Examples: Any polynomial with non-negative coefficients and maximum degree q is (1 + ε, q) multiplicatively Lipschitz for any ε > 0. The function f(x) = 1 when x < 1 and f(x) = 2 when x 1 is (2, 1) multiplicatively Lipschitz. ≥ The following lemma is a simple consequence of our definition of multiplicatively Lipschitz functions:

Lemma 6.2. Let C 1 and L > 0. Any function f : R 0 R 0 that is (C,L)-multiplicatively ≥ ≥ → ≥ Lipschitz satisfies for all c (0, 1/C) [C, + ) : ∈ ∪ ∞ 1 f(cx) < < c2L. c2L f(x)

We now state the core theorem of this section:

40 Theorem 6.3. Let C 1 and L 1. Consider any (C,L)-multiplicatively Lipschitz function f : ≥ ≥ R R, and let K(x,y)= f( x y 2). Let P be a set of n points in Rd. For any k [Ω(1), O(log n)] → k − k2 ∈ such that C = nO(1/k), there exists an algorithm sparsify-K-graph(P, n, d, K,k,L) (Algorithm 3) that runs in time:

2 1+O(L/k) O(k) O(ndk)+ ε− n 2 log n log α · · and outputs an ε-spectral sparsifier H of the K graph with E = O(n log n/ε2). | H | We give a corollary of this theorem.

Corollary 6.4. Consider a K graph where K(x,y) = f( x y 2), and f is (2,L)-multiplicatively k − k2 Lipschitz. Let G denote the K graph from n points in Rd. There is an algorithm that takes in time

2 O(√L log n log log n) O(nd L log n)+ ε− n 2 log α · · · and outputs an ε-spectral sparsifierp H of G with E = O(n log n/ε2). If L = o(log n/(log log n)2), | H | this runs in time

2 1+o(1) o(nd log n)+ ε− n log α.

Proof. Set k = √L log n, and the corollary follows from Theorem 6.3.

This implies that if f is a polynomial with non-negative coefficients, then sparsifiers of the corresponding K-graph can be found in almost linear time. The same result holds if f is the reciprocal of a polynomial with non-negative coefficients. We will need a few geometric preliminaries in order to present our core algorithm of this sec- tion, Sparsify-K-graph.

Definition 6.5 (ε-well separated pair). Given two sets of points A and B. We say A, B is an ε-well separated pair if the diameter of Ai and Bi are at most ε times the distance between Ai and Bi. Definition 6.6 (ε-well separated pair decomposition [CK95]). An ε-well separated pair decom- position (ε-WSPD) of a given point set P is a family of pairs = (A ,B ),... (A ,B ) with F { 1 1 s s } A ,B P such that: i i ⊂ • i [s], A , B are ε-well separated pair (Definition 6.5) ∀ ∈ i i • For any pair p,q P , there is a unique i [s] such that p A and q B ∈ ∈ ∈ i ∈ i A famous theorem of Callahan and Kosaraju [CK95] states:

Theorem 6.7 (Callahan and Kosaraju [CK95]). Given any point set P Rd and 0 ε 9/10, an ⊂ ≤ ≤ ε-WSPD (Definition 6.6) of size O(n/εd) can be found in 2O(d) (n log n + n/εd) time. Moreover, · each vertex participates in at most 2O(d) log α ε-well separated pairs. · Well-separated pairs can be interpreted as complete bipartite graphs on the vertex set, or bi- cliques. The biclique associated with a well-separated pair is the bipartite graph connecting all vertices on one side of the pair to another. This concludes our definitions on well-separated pairs. We now give names to some algorithms in past work, which will be used in our algorithm sparsify-k-graph. We define the algorithm GenerateWSPD(P, ε) to output an ε-WSPD (Definition 6.6) of P . We define the algorithm RandomProject(P, k) to be a random projection of P onto k dimensions.

41 Let Biclique(K,P,A,B) be the complete biclique on the K-graph of P with one side of the biclique having verticese corresponding to points in A, and the other side having vertices corre- sponding to points in B. We store this biclique implicitly as (A, B) rather than as a collection of edges. Let RandSample(G, s) be an algorithm uniformly at random sampling O(s) edges from G, where the big O is the same constant as the big O in the nO(1/k) from Lemma 3.15. Let SpectralSparsify(G, ε) be any nearly linear time spectral sparsification algorithm that outputs a (1 + ε) spectral sparsifier with O(n log n/ε2) edges, such as that in Theorem 3.7 from [SS11].

Algorithm 3 1: procedure Sparsify-K-Graph(P, n, d, K,k,L,ε) ⊲ Theorem 6.3 2: Input: A point set P with n points in dimension d, a kernel function K(x,y)= f( x y 2), k − k2 an integer variable Ω(1) k O(log n), and a variable L, and error ε. ≤ ≤ 3: Output: A candidate sparsifier of the K-graph on P . 4: P RandomProject(P, k) ′ ← 5: H empty graph with n vertices. ← 6: (A ,B ),... (A ,B ) GenerateWSPD(P , n, d, 1/2) ⊲ t = n 2d { 1′ 1′ t′ t′ }← ′ · 7: for i = 1 t do → 8: Find (A ,B ) corresponding to (A ,B ), where A ,B P . i i i′ i′ i i ⊂ 9: Q BiClique(K,P,A ,B ). ← i i 10: s ε 2nO(L/k)( A + B ) log( A + B ) ← − | i| | i| | i| | i| 11: Q RandSample(Q,s) ← 12: Q Q with each edge scaled by A B /s. ← | i|| i| 13: H H + Q ← 14: end for 15: H SpectralSparsify(H, ε) ⊲ Corollary 3.8 ← 16: Return H 17: end procedure 18: procedure GenerateWSPD(P,n,d,ε) ⊲ Theorem 6.7 19: ... ⊲ See details in [CK95] 20: end procedure 21: procedure RandSample(G, s) 22: Sample O(s) edges from G and generate a new graph G 23: return G 24: end procedure 25: procedure RandomProject(P, d, k) 26: P ′ ← ∅ 27: Choose a JL matrix S Rk d ∈ × 28: for x P do ∈ 29: x S x ′ ← · 30: P P x ′ ← ′ ∪ ′ 31: end for 32: return P ′ 33: end procedure

The rest of this section is devoted to proving Theorem 6.3.

42 6.1 High Dimensional Sparsification We are nearly ready to prove Theorem 6.3. We start with a Lemma: Lemma 6.8 ((C,L) multiplicative Lipschitz functions don’t distort a graph’s edge weights much). Consider a complete graph G, and a complete graph G′, where vertices of G are identified with vertices of G (which induces an identification between edges). Let K 1. Suppose each edge in G ′ ≥ satisfies: 1 w ′ (e) w (e) K w ′ (e) K · G ≤ G ≤ · G If f is a (C,L) multiplicative Lipschitz function, and f(G) refers to the graph G where f is applied to each edge length, and C < K then:

1 2L w ′ (e) w (e) K w ′ (e). K2L · G ≤ f(G) ≤ · G Proof. This follows from Lemma 6.2.

Proof. (of Theorem 6.3): The Algorithm sparsify-K-graph starts by performing a random pro- jection of point set P into k dimensions. Call the new point set P ′. This runs in time O(ndk), and incurs distortion nO(1/k), as seen in Lemma 3.15. Next, our algorithm performs a 1/2-WSPD on P ′. As seen in Theorem 6.7, this runs in time O(n log n + n2O(k))

We view each well-separated pair (Ai′ ,Bi′) on P ′ as a biclique, where the edge length between any two points in P ′ corresponds to the edge length between those two points in the original K-graph. By the guarantees of Theorem 6.7, the longest edge divided by the shortest edge between two sides of a well-separated pair in P ′ is at most 2. Thus, the longest edge divided by the shortest edge within any induced bipartite graph on the K-graph is 2 nO(1/k), by Lemma 6.8. · For each such biclique, the leverage score for each edge is overestimated by O(L/k) n ( A′ + B′ )/( A′ B′ ). · | i| | i| | i|| i| This comes from first applying Lemma 6.8 to upper bound the ratio of the longest edge in a biclique divided by the shortest edge. This ratio comes out to be 2nO(L/k). Now recall the definition of leverage score on graphs as weRe, where Re is the effective resistance assuming conductances of we on the graph, and we is the edge weight. Here, we is upper bounded by the longest edge length, and Re is upper bounded by the leverage score of a biclique supported on the same edges, where all edges lengths are equal to the shortest edge length (this is an underestimate of effective resistance due to Rayleigh monotonicity, see [Chu97] for details). Therefore, a leverage score overestimate of the graph can be obtained by nO(L/k) ( A + B )/( A B ), as claimed. The union of these graphs · | i′ | | i′| | i′ || i′| is a spectral sparsifier of our K-graph. Finally, our algorithm samples nO(L/k)( A + B ) log( A + B ) edges uniformly at random | i′ | | i′| | i′ | | i′| from each biclique, scaling each sampled edge’s weight so that the expected value of the sampled graph is equal to the original biclique. Each vertex participates in at most log α2O(k) bicliques (see Theorem 6.7). Thus, this uniform sampling procedures’ run time is bounded above by n1+O(L/k)2O(k) log n log α. · Finally, our algorithm runs a sparsification algorithm on our graph after uniform sampling, which gets the edge count of the final graph down to O(n log n/ε2). This completes our proof of Theorem 6.3.

43 6.2 Low Dimensional Sparsification We now present a result on sparsification in low dimensions, when d is assumed to be small or constant. Theorem 6.9. Let L 1. Consider a K-graph with n vertices arising from a point set in d ≥ dimensions, and let α be the ratio of the maximum Euclidean distance to the minimum Euclidean distance in the point set . Let f be a (1 + 1/L,L) multiplicatively Lipschitz function. Then an ε spectral sparsifier of the K-graph can be found in time

2 O(d) ε− n log n log α (2L) · · · · Proof. We roughly follow the proof of Theorem 6.3, except without the projection onto low dimen- sions. Now, on the d dimensional data, we create a 1/L-WSPD. This takes time

O(n log n)+ n (2L)O(d). · Since f is (1 + 1/L,L)-multiplicatively Lipschitz, it follows that within each biclique of the K-graph induced by the WSPD (Definition 6.6), the maximum edge length divided by the minimum edge length is bounded above by (1 + 1/L)O(L) = O(1). Therefore, performing scaled and reweighted uniform sampling on each biclique takes O(s log s) time if there are s vertices in the biclique, and gives a sparsified biclique with O(s log s) edges. Taking the union of this number over all bicliques gives an algorithm that runs in time

2 O(d) ε− n log n log α (2L) · · · · as desired.

7 Sparsifiers for x, y |h i| In this section, we construct sparsifiers for Kernels of the form x,y . |h i| Lemma 7.1 (sparsification algorithm for inner product kernel). Given a set of vectors X Rd with ⊆ X = n and accuracy parameter ε (0, 1/2), there is an algorithm that runs in ε 2n poly(d, log n) | | ∈ − · time, and outputs a graph H that satisfies both of the following properties with probability at least 1 1/poly(n): − 1. (1 ε)L L (1 + ε)L ; − G  H  G 2. E(H) ε 2 n poly(d, log n). | |≤ − · · where G is the K-graph on X, where K(x,y)= x,y . |h i| Throughout this section, we specify various values Ci. For each subscript i, it is the case that 1 C poly(d, log n). ≤ i ≤ 7.1 Existence of large expanders in inner product graphs We start by showing that certain graphs that are related to unweighted versions of K-weighted graphs contain large expanders: Definition 7.2 (k-dependent graphs). For a positive integer k > 1, call an unweighted graph G k-dependent if no independent set with size at least k + 1 exists in G.

44 We start by observing that inner product graphs are (d + 1)-dependent.

Definition 7.3 (inner product graphs). For a set of points X Rd, the unweighted inner product ⊆ graph for X is a graph G with vertex set X and unweighted edges u, v E(G) if and only if { } ∈ u, v 1 u v . The weighted inner product graph for X is a complete graph G with edge |h i| ≥ d+1 k k2k k2 weights w for which w = u, v . e uv |h i| We now show that these graphs are k-dependent:

Lemma 7.4. Suppose M Rn n such that M[i, i] = 1 for all i [n], and M[i, j] < 1/n for all ∈ × ∈ | | i = j. Then, M has full rank. 6 Proof. Assume to the contrary that M does not have full rank. Thus, there are values c1,...,cn 1 n 1 − ∈ R such that for all i [n] we have − c M[i, j]= M[i, n]. ∈ j=1 j First, note that there must be a j with c 1 + 1/n. Otherwise, we would have P | j|≥ n 1 n 1 − 1 − n 1 1= M[n,n] = c M[n, j] < c < − (1 + 1/n) < 1. | | j n | j| n j=1 j=1 X X

Assume without loss of generality that c c for all j 2, 3,...,n 1 , so in particu- | 1| ≥ | j| ∈ { − } lar c 1 + 1/n. Letting c = 1, this means that n c M[1, j] = 0, and so M[1, 1] = | 1| ≥ n − j=1 j n (c /c )M[1, j]. Thus, − j=2 j 1 P

P n n n c c 1 1= M[1, 1] = j M[1, j] j M[1, j] < M[1, j] < (n 1) < 1, | | c ≤ c | | − · n j=2 1 j=2 1 j=2 X X X

a contradiction as desired. Proposition 7.5. The unweighted inner product graph for X Rd is (d + 1)-dependent. ⊆ Proof. For an independent set S in the unweighted inner product graph G for X, define an S S × matrix M with M[i, j] = si,sj where S = s1,s2,...,s S . Then Lemma 7.4 coupled with the h i { | |} definition for edge presence in G shows that M is full rank. However, M is a rank d matrix because it is the matrix of inner products for dimension d vectors. Therefore, d S , so no independent ≥ | | set has size greater than d.

Next, we show that k-dependent graphs are dense:

Proposition 7.6 (dependent graph is dense). Any k-dependent graph G has at least n2/(2k2) edges.

n Proof. Consider any k + 1-tuple of vertices in G. There are k+1 such k + 1-tuples. By definition of k-dependence, there must be some edge with endpoints in any k + 1-tuple. The number of k-tuples n 2  that any given edge can be a part of is at most k−1 . Therefore, the number of edges in the graph is at least −  n n(n 1) n2 k+1 − n 2 ≥ (k + 1)k ≥ 2k2 k−1 − as desired. 

Next, we argue that unweighted inner product graphs have large expanders:

45 Proposition 7.7 (every dense graph has a large expander). Consider an unweighted graph G with n vertices and at least n2/c edges for some c > 1. Then, there exists a set S V (G) with the ⊆ following properties:

1. (Size) S n/(40c) | |≥ 2. (Expander) Φ 1/(100c log n) G[S] ≥ 3. (Degree) The degree of each vertex in S within G[S] is at least n/(10000c)

Proof. We start by partitioning the graph as follows:

1. Initialize = V (G) F { } 2. While there exists a set U with (a) a partition U = U U with U cut having conductance ∈ F 1∪ 2 1 1/(100c log n) or (b) a vertex u with degree less than n/(10000c) in G[U] ≤ (a) If (a), replace U in with U and U F 1 2 (b) Else if (b), replace U in with U u and u F \ { } { } We now argue that when this procedure stops,

E(G[U]) n2/(2c) | |≥ U X∈F To prove this, think of each splitting of U into U1 and U2 as deleting the edges in E(U1,U2) from G. Design a charging scheme that assigns deleted edges due to (a) steps to edges of G as follows. Let ce denote the charge assigned to an edge e and initialize each charge to 0. When U is split into U and U , let U denote the set with E(U ) E(U ) . When U is split, increase the charge c 1 2 1 | 1 | ≤ | 2 | e for each e E(U ) E(U ,U ) by E(U ,U ) / E(U ) E(U ,U ) . ∈ 1 ∪ 1 2 | 1 2 | | 1 ∪ 1 2 | We now bound the charge assigned to each edge at the end of the algorithm. By construction,

e E(G) ce is the number of edges deleted over the course of the algorithm of type (a). Each edge is ∈ E(U1) + E(U2) E(U) assigned charge at most log E(G) 2 log n times, because E(U ) | | | | | | when P | |≤ | 1 |≤ 2 ≤ 2 charge is assigned to edges in U1. Furthermore, the amount of charge assigned is the conductance of the cut deleted, which is at most 1/(100c log n). Therefore,

2 log n 1 c = e ≤ 100c log n 50c for all edges in G, which means that the total number of edges deleted of type (a) was at most n2/(100c). Each type (b) deletion reduces the number of edges in G by at most n/(10000c), so the total number of type (b) edge deletions is at most n2/(10000c). Therefore, the total number of edges remaining is at least

n2 n2 n2 n2 > , c − 100c − 10000c 2c as desired. By the stopping condition of the algorithm, each connected component of G after edge deletions is a graph with all cuts having conductance at least 1/(100c log n) and all vertices having degree at

46 least n/(10000c). Next, we show that some set in has at least n/(40c) vertices. If this is not the F case, then

E(G[U]) U 2 | |≤ | | U U X∈F X∈F n U by assuming all U n/(40c) ≤ 40c | | | |≤ U X∈F n2 by U n ≤ 40c | |≤ n2 < 2c which leads to a contradiction. Therefore, there must be some connected component with at least n/(40c) vertices. Let S be this component. By definition S satisfies the Size guarantee. By the stopping condition for , S satisfies the other two guarantees as well, as desired. F Proposition 7.7 does not immediately lead to an efficient algorithm for finding S. Instead, we give an algorithm for finding a weaker but sufficient object:

Proposition 7.8 (algorithm for finding sets with low effective resistance diameter). For any set of vectors X Rd with n = X and unweighted inner product graph G for X, there is an algorithm ⊆ | | LowDiamSet(X) that runs in time

poly(d, log n)n and returns a set Q that has the following properties with probability at least 1 1/poly(n): − 1. (Size) Q n/C , where C = 320d2 | |≥ 1 1 Reff C2 2. (Low effective resistance diameter) For any pair of vertices u, v Q, G(u, v) n , where 1 ∈ ≤ C2 = (10d log n) 0

We prove this proposition in Section 7.2.

7.2 Efficient algorithm for finding sets with low effective resistance diameter In this section, we prove Proposition 7.8. We implement LowDiamSet by picking a random vertex v and checking to see whether or not it belongs to a set with low enough effective resistance diameter. We know that such a set exists by Proposition 7.7 and the fact that dense expander graphs have low effective resistance diameter. One could check that v lies in this set in O(n2) time by using the effective resistance data structure of [SS11]. Verifying that v is in such a set in poly(d, log n)n time is challenging given that G is dense. We instead use the following datae structure, which is implemented by uniformly sampling sparse subgraphs of G and using [SS11] on those subgraphs:

Proposition 7.9 (Sampling-based effective resistance data structure). Consider a set X Rd, ⊆ let n = X , let G be the unweighted inner product graph for X, and let S X be a set with the | | ⊆ following properties:

1. (Size) S n/C , where C = 40 8d | |≥ 3a 3a · 2. (Expander) Φ 1/C , where C = 800d log n G[S] ≥ 3b 3b

47 3. (Degree) The degree of each vertex in S within G[S] is at least n/C , where C = 10000 8d. 3c 3c · There is a data structure that, when given a pair of query points u, v X, outputs a value ∈ ReffQuery(u, v) R 0 that satisfies the following properties with probability at least 1 1/poly(n): ∈ ≥ − 1. (Upper bound) For any pair u, v S, ReffQuery(u, v) C Reff (u, v), where C = 10 ∈ ≤ 4 G 4 2. (Lower bound) For any pair u, v X, ReffQuery(u, v) Reff (u, v)/2. ∈ ≥ G The preprocessing method ReffPreproc(X) takes poly(d, log n)n time and ReffQuery takes poly(d, log n) time. We now implement this data structure. ReffPreproc uniformly samples O(log n) subgraphs of G and builds an effective resistance data structure for each one using [SS11]. ReffQuery queries each data structure and returns the maximum:

1: procedure ReffPreproc(X) 2: Input: X Rd with unweighted inner product graph G ⊆ 3: C 1000 log n 5 ← 4: C 80000CC2 C2 C2 , where C is the constant in Theorem 3.6 6 ← 3a 3b 3c 5: for i from 1 through C5 do 6: H uniformly random subgraph of G; sampled by picking C n uniformly random pairs i ← 6 (u, v) X X and adding the e = u, v edge to Hi if and only if u, v E(G) (that is when ∈ ×1 { } { }∈ u, v d+1 u 2 v 2). |h i| ≥ k k k k E(G) 7: For each edge e E(Hi), let we = |E(H )| . ∈ | i | 8: f RX X approximation to Reff (u, v) given by the data structure of Theorem i ∈ × ← Hi 3.7 for ε = 1/6. 9: end for 10: end procedure 11: procedure ReffQuery(u, v) 12: Input: A pair of points u, v X ∈ 13: Output: An estimate for the u-v effective resistance in G C5 14: return maxi=1 fi(u, v) 15: end procedure

Bounding the runtime of these two routines is fairly straightforward. We now outline how we prove the approximation guarantee for ReffQuery. To obtain the upper bound, we use Theorem 3.6 to show that Hi contains a sparsifier for G[S], so effective resistances are preserved within S. To obtain the lower bound, we use the following novel Markov-style bound on effective resistances: Lemma 7.10. Let G be a w-weighted graph with vertex set X and assign numbers p [0, 1] to e ∈ each edge. Sample a reweighted subgraph H of G by independently and identically selecting q edges, with an edge chosen with probability proportional to pe and added to H with weight twe/(peq), where t = e G pe. Fix a pair of vertices u, v. Then for any κ> 1, ∈ P Pr [ReffH (u, v) ReffG(u, v)/κ] 1/κ. ≤ ≤ Proof. For two vertices u, v, define the (folklore) effective conductance between u and v in the graph I to be Ceff (u, v) := min w (q q )2. I n xy x y q R :qu=0,qv=1 − ∈ edges x,y I X{ }∈

48 It is well-known that CeffI (u, v) = 1/ReffI (u, v). Let

2 q∗ =arg min w (q q ) . n xy x y q R :qu=0,qv=1 − ∈ x,y E(G) { X}∈

Using q∗ as a feasible solution in the CeffH optimization problem shows that

H 2 E[Ceff (u, v)] E w (q∗ q∗) H ≤  xy x − y  x,y E(G) { X}∈  H 2  = E[w ](q∗ q∗) xy x − y x,y E(G) { X}∈ q G pxy twxy 2 = ( )(qx∗ qy∗) t qpxy − x,y E(G) i=1 { X}∈ X G 2 = w (q∗ q∗) xy x − y x,y E(G) { X}∈ = CeffG(u, v)

I where we denotes the weight of the edge e in the graph I. Finally, we have

Pr[Reff (u, v) < Reff (u, v)/κ] = Pr[Ceff (u, v) > κ Ceff (u, v)] H G H · G 1/κ, ≤ where the first step follows from ReffG(u, v) = 1/CeffG(u, v), and the last step follows from Markov’s inequality.

We now prove Proposition 7.9:

Proof of Proposition 7.9. Runtime. We start with preprocessing. Sampling C6n pairs and checking if each pair satisfies the edge presence condition for G takes O(dC6n) = poly(d, log n)n time. Pre- processing for the function fi takes poly(d, log n)n time by the preprocessing guarantee of Theorem 3.7. Since there are C poly(d, log n) different is, the total preprocessing time is poly(d, log n)n, 5 ≤ as desired. Next, we reason about query time. This follows immediately from the query time bound of Theorem 3.7, along the the fact that C5 poly(d, log n). upper ≤ 2 Upper bound. Let pe = C /n for each edge e E(G[S]), where C = 2C C . We 7 ∈ 7 3a 3b apply Theorem 3.6 to argue that Hi[S] is a sparsifier for G[S] for each i. By choice of C7 and upper 2 the Size condition on S , pu,v Φ2 S . By Lemma 3.5 and the Expander condition on G[S], | | ≥ G[S]| | 2 Reff Φ2 S G[S](u, v) for each pair u, v S. Therefore, the second condition of Theorem 3.6 is G[S]| | ≥ ∈ 2 2 satisfied by the probabilities pe. Let q = 10000C(n /(C3aC3c)) log(n /(C3aC3c))(C7/n), where C is the constant in Theorem 3.6. This value of q satisfies the first condition of Theorem 3.6. To apply Theorem 3.6, we just need to show that Hi[S] has at least q edges with probability at least 1 1/poly(n). First, note that for uniformly random u, v X − ∈

49 Pr [ u, v E(G[S])] = Pr[ u, v E(G[S]), u, v S] u,v X { }∈ { }∈ ∈ ∈ = Pr[ u, v E(G[S]) u, v S]Pr[u S]Pr[v S] { }∈ | ∈ ∈ ∈ 1 1 2 ≥ 2C3c C3a

by the Degree and Size conditions on S. Since at least C6n pairs in X X are chosen, E[ E(Hi[S]) ] 1 1 × | | ≥ C6n( 2C )( 2 ) 2q. By Chernoff bounds, this means that E(Hi[S]) q with probability at 3c C3a ≥ | | ≥ least 1 1/poly(n). Therefore, Theorem 3.6 applies and shows that H [S] with edge weights − i E(G[S]) / E(H [S]) is a (1/6)-sparsifier for G[S] with probability at least 1 1/poly(n). By Cher- | | | i | − noff bounds, E(G[S]) / E(H [S]) E(G) /(2 E(H ) ), so Reff (u, v) Reff (u, v)(1 + | | | i | ≥ | | | i | Hi[S] ≤ G[S] 1/6)2 C Reff (u, v) for all u, v S by the sparsification accuracy guarantee. Since this holds ≤ 4 G[S] ∈ for each i, the maximum over all i satisfies the guarantee as well, as desired. Lower bound. Let plower = 1/ E(G) for each edge e. Note that t = 1, so with q = E(H ) , all e | | | i | edges in the sampled graph should have weight tw /(p q)= E(G) / E(H ) . Therefore, by Lemma e e | | | i | 7.10, for a pair u, v X ∈ 4 Pr[Reff (u, v) 4Reff (u, v)/5] Hi ≤ G ≤ 5 for each i. Since the His are chosen independently and identically,

4 C5 1 Pr[max ReffH (u, v) 4ReffG(u, v)/5] i i ≤ ≤ 5 ≤ n100   Union bounding over all pairs shows that

ReffQuery(u, v) > (6/7)(4/5)ReffG(u, v) > ReffG(u, v)/2 for all u, v X with probability at least 1 1/n98 = 1 1/poly(n), as desired. ∈ − − We now describe the algorithm LowDiamSet. This algorithm simply picks random vertices v and queries the effective resistance data structure to check that v is in a set with the desired properties:

1: procedure LowDiamSet(X) 2: ReffPreproc(X) 3: while true do 4: Pick a uniformly random vertex v from X 5: Q u X : ReffQuery(u, v) C /(2n) v ← { ∈ ≤ 2 } 6: Return Q if Q n/C v | v|≥ 1 7: end while 8: end procedure

We now prove that this algorithm suffices:

Proof of Proposition 7.8. Size and Low effective resistance diameter. Follows immediately from the return condition and the fact that for every u Q for the returned Q , ∈ v v

50 Reff (u, v) 2ReffQuery(u, v) C /n G ≤ ≤ 2 by the Lower bound guarantee of Proposition 7.9. Runtime. We start by showing that the while loop terminates after poly(d, log n) iterations with probability at least 1 1/poly(n). By Proposition 7.5, G is (d+1)-dependent. By Proposition 7.6, G − has at least n2/(8d2) edges. By Proposition 7.7, G has a set of vertices S for which S n/(320d2), | |≥ Φ 1/(800d2 log n), and for which the minimum degree of G[S] is at least n/(80000d2). By the G[S] ≥ first condition on S,

Pr[v S] 1/(320d2) ∈ ≥ so the algorithm picks a v in S with probability at least 1 1/poly(n) after at most 32000d2 log n − ≤ poly(d, log n) iterations. By Lemma 3.5 applied to S,

1 1 1 Reff (x,y) Reff (x,y) + G ≤ G[S] ≤ d (x) d (y) Φ2  S S  G[S] for any x,y S, where d (w) denotes the degree of the vertex w in G[S]. By the third property of ∈ S S, d (x) n/(80000d2) and d (y) n/(80000d2). By this and the second property of S, S ≥ S ≥ Reff (x,y) 102400000000d6 (log2 n)/n C /(2C n) G ≤ ≤ 2 4 for any x,y S. S satisfies the conditions required of Proposition 7.9 by choice of the values C , ∈ 3a C3b, and C3c. Therefore, by the Upper bound guarantee of Proposition 7.9,

ReffQuery(u, v) C /(2n) ≤ 2 for every u S if v S. Since S n/C by choice of C , the return statement returns Q with ∈ ∈ | | ≥ 1 1 v probability at least 1 1/poly(n) when v S. Therefore, the algorithm returns a set Q with − ∈ v probability at least 1 1/poly(n) after at most poly(d, log n) iterations. − Each iteration consists of O(n) ReffQuery calls and O(n) additional work. Therefore, the total work done by the while loop is poly(d, log n)n with probability at least 1 1/poly(n). ReffPreproc(X) − takes poly(d, log n)n by Proposition 7.9. Thus, the total runtime is poly(d, log n)n, as desired.

7.3 Using low-effective-resistance clusters to sparsify the unweighted IP graph In this section, we prove the following result: Proposition 7.11 (Unweighted inner product sparsification). There is a poly(d, log n)n/ε2-time algorithm for constructing an ε-sparsifier with O(n/ε2) for the unweighted inner product graph of a set of n points X Rd. ⊆ To sparsify an unweighted inner product graph, it suffices to apply Proposition 7.8 repeatedly to partition the graph into poly(d, log n) clusters, each with low effective resistance diameter. We can use this structure to get a good bound on the leverage scores of edges between clusters: Proposition 7.12 (bound leverage scores of edges between clusters). For a w-weighted graph G with Reff vertex set X and n = X , let S1,S2 X be two sets of vertices, let R1 = maxu,v S1 G(u, v), | Reff| ⊆ ∈ and let R2 = maxu,v S2 G(u, v). Then, for any u S1 and v S2, ∈ ∈ ∈ 3 Reff (u, v) 3R + 3R + . G 1 2 w ≤ x S1,y S2 xy ∈ ∈ where wxx = for all x X P ∞ ∈

51 Proof. Let χ Rn be the vector with χ = 1, χ = 1, and χ = 0 for all x X with x = u and ∈ u v − x ∈ 6 x = v. For each vertex x S , let s = w . For each vertex y S , let s = w . Let 1 x y S2 xy 2 y x S1 xy 6 ∈ ∈ ∈ (1) (12)∈ (2) n τ = sx = sy = wxy. Write χ as a sum of three vectors d , d , d R x S1 y S2 x S1,y SP2 P ∈ as follows:∈ ∈ ∈ ∈ P P P 1 sx x = u − τ d(1) = sx x S u x − τ ∈ 1 \ { } 0 otherwise

 sx 1 x = v τ − d(2) = sx x S v x  τ ∈ 2 \ { } 0 otherwise

 sx x S τ ∈ 1 d(12) = sx x S x − τ ∈ 2 0 otherwise

Notice that d(1) + d(2) + d(12) = χ. Furthermore, notice that d(1) = p χ(ux) and d(2) =  x S1 x ∈ q χ(yv), where χ(ab) is the signed indicator vector of the edge from a to b and p = 1, y S2 y x S1 x ∈ P ∈ qy = 1, and px 0 and qy 0 for all x S1,y S2. The function f(d)= d⊤L†d is convex, Py S2 ≥ ≥ ∈ ∈ P so by∈ Jensen’s Inequality, P (1) (1) (ux) (ux) (d )⊤L† (d ) p (χ )⊤L† χ G ≤ x G x S1 X∈ p R ≤ x 1 x S1 X∈ R ≤ 1

(2) (2) E(G) wxy and (d ) L† (d ) R . Let f R be the vector with f = for all x S ,y S and ⊤ G ≤ 2 ∈ | | xy τ ∈ 1 ∈ 2 let f = 0 for all other e E(G). By definition of the s s, f is a feasible flow for the electrical flow e ∈ u optimization problem for the demand vector d(12). Therefore,

2 (12) (12) fxy (d )⊤LG† d ≤ wxy x S1,y S2 ∈ X∈ w = xy τ 2 x S1,y S2 ∈ X∈ 1 = τ so we can upper bound ReffG(u, v) in the following way

Reff (1) (2) (12) (1) (2) (12) G(u, v) = (d + d + d )⊤LG† (d + d + d ) (1) (1) (2) (2) (12) (12) 3(d )⊤L† d + 3(d )⊤L† d + 3(d )⊤L† d ≤ G G G 3R + 3R + 3/τ ≤ 1 2 as desired.

52 7.4 Sampling data structure In this section, we give a data structure for efficiently sampling pairs of points (u, v) Rd Rd with ∈ × probability proportional to a constant-factor approximation of u, v : |h i| Lemma 7.13. Given a pair of sets S ,S Rd, there is a data structure that can be constructed in 1 2 ⊆ O(d S + S ) time that, in poly(d log( S + S )) time per sample, independently samples pairs | 1| | 2| | 1| | 2| u S , v S with probability p , where ∈ 1 ∈ 2 uv e 1 u, v u, v |h i| puv 2 |h i| 2 a S1,b S2 a, b ≤ ≤ a S1,b S2 a, b ∈ ∈ |h i| ∈ ∈ |h i| Furthermore, it is possible toP query the probability p inPpoly(d log( S + S )) time. uv | 1| | 2| To produce this data structure, we use the following algorithm for sketching ℓ1-norms: Theorem 7.14 (Theorem 3 in [Ind06]). An efficiently computable, poly(log d, 1/ε)-space linear sketch exists for the ℓ1 norm. That is, given a d Z 1, δ (0, 1), and ε (0, 1), there is a matrix ∈ ≥ ∈ ∈ C = SketchMatrix(d, δ, ε) Rℓ d and an algorithm RecoverNorm(s, d, δ, ε) with the following ∈ × properties:

1. (Approximation) For any vector v Rd, with probability at least 1 δ over the randomness of ∈ − SketchMatrix, the value r = RecoverNorm(Cv, d, δ, ε) is as follows:

(1 ε) v r (1 + ε) v − k k1 ≤ ≤ k k1 2. ℓ = (c/ε2) log(1/δ) for some constant c> 1

3. (Runtime) SketchMatrix and RecoverNorm take O(ℓd) and poly(ℓ) time respectively.

We use this sketching algorithm to obtain the desired samplie ng algorithm in the following sub- routine:

Corollary 7.15. Given a set S Rd, an ε (0, 1), and a δ (0, 1), there exists a data struc- ⊆ ∈ ∈ ture which, when given a query point u Rd, returns a (1 ε)-multiplicative approximation to ∈ ± u, v with probability at least 1 δ. This data structure can be computed in O(ℓd S ) pre- v S |h i| − | | processing∈ time and takes poly(ℓd) time per query, where ℓ = O(ε 2 log(1/δ)). P − Proof. Let n = S and C = SketchMatrix(n, δ, ε). We will show that the following algorithm | | returns the desired estimate with probability at least 1 δ: − 1. Preprocessing:

(a) Index the rows of C by integers between 1 and ℓ. Index columns of C by points v S. ∈ (b) Compute the vector x(i) = C v for each i ℓ, where ℓ = (c/ε2) log(1/δ) for the v S iv ∈ constant c in Theorem 7.14. ∈ P 2. Given a query point u Rd, ∈ (a) Let y Rℓ be a vector with y = u, x(i) for each i [ℓ]. ∈ i h i ∈ (b) Return RecoverNorm(y,n,δ,ε)

53 Approximation. Let w Rn be the vector with w = u, v for each v S. It suffices ∈ v h i ∈ to show that the number that a query returns is a (1 ε)-approximation to w . By definition, ± k k1 y = C w for all i [ℓ] and v S, so y = Cw. Therefore, by the Approximation guarantee of i v S iv v ∈ ∈ Theorem∈7.14, RecoverNorm(y,n,δ,ε) returns a (1 ε)-approximation to w with probability P ± k k1 at least 1 δ, as desired. − Preprocessing time. Computing the matrix C takes O(nℓ) time by the Runtime guarantee of Theorem 7.14. Computing the vectors x(i) for all i [ℓ] takes O(dℓn) time. This is all of the ∈ preprocessing steps, so the total runtime is O(dℓn), as desired.e Query time. Each query consists of ℓ inner products of d dimensioal vectors and one call to RecoverNorm, for a total of O(ℓd + poly(eℓ)) work, as desired.

We use this corollary to obtain a sampling algorithm as follows, where n = S + S : | 1| | 2| 1. Preprocessing:

(a) Use the Corollary 7.15 data structure to (1 1/(100 log n))-approximate u, v ± v S2 |h i| for each u S . Let t be this estimate for each u S . (one preprocess for∈ S S , ∈ 1 u ∈ 1 P ← 2 S queries). | 1| (b) Form a balanced binary tree of subsets of S , with S at the root, the elements of S T 2 2 2 at the leaves, and the property that for every parent-child pair (P,C), C 2 P /3. | |≤ | | (c) For every node S in the binary tree, construct a (1 1/(100 log n))-approximate data ± structure for S.

2. Sampling query:

(a) Sample a point u S with probability t /( t ). 1 u a S1 a ∈ ∈ (b) Initialize S S . While S > 1, ← 2 | | P i. Let P and P denote the two children of S in 1 2 T ii. Let s1 and s2 be the (1 1/(100 log n))-approximations to u, v and u, v ± v P1 |h i| v P2 |h i| respectively obtained from the data structure for S computed∈ during preprocessing.∈ P P iii. Reset S to P1 with probability s1/(s1 + s2); otherwise reset S to P2. (c) Return the single element in S.

3. puv query: (a) Return the product of the O(log n) probabilities attached to ancestor nodes of the node v in for u, obtained from the preprocessing step (as during the sampling query) { } T s /(s + s ) is a (1 1/(100 log n))2-approximation to Pr[v P v S] for each i 1, 2 . A i 1 2 ± ∈ i| ∈ ∈ { } v S is sampled with probability proportional to the product of these conditional probabilities ∈ 2 for the ancestors, for which p is a (1 + 1/(100 log n))2 log n+1 2-approximation. The total uv ≤ preprocessing time is proportional to the size of all sets in the tree, which is at most O( S log S ). | 2| | 2| The total query time is also polylog( S + S )poly(d) due to the logarithmic depth of the query | 1| | 2| binary tree. We now expand on this intuition to prove Lemma 7.13:

Proof of Lemma 7.13. We show that the algorithm given just before this proof satisfies this lemma: Probability guarantee. Consider a pair u S1, v S2. Let A0 = S2, A1,...Ak 1, Ak = v ∈ ∈ − { } denote the sequence of ancestor sets of the singleton set v in . For each i [k], let B be { } T ∈ i

54 the child of Ai 1 besides Ai (unique because is binary). For a node X of , let sX be the − T T (1 1/(100 log n))-approximation to u, v used by the algorithm. The sampling probability ± v X |h i| p is the following product of probabilities:∈ uv P k tu sAi puv = a S1 ta sAi + sBi ∈ Yi=1 s + s is a (1 1/(100 log n))-approximationP to u, v by the approximation guarantee Ai Bi v A − ± ∈ i 1 |h i| of Corollary 7.15. sA is a (1 1/(100 log n))-approximation to u, v by the approximation i ± P v Ai |h i| guarantee of Corollary 7.15. By these guarantees and the approximation∈ guarantees for the t s for P a a S , p is a (1 1/(100 log n))2k+2 (1 1/2)-approximation to ∈ 1 uv ± ≤ ± u, b k u, b b S2 b Ai u, v ∈ |h i| ∈ |h i| = |h i| a, b u, b a, b Pa S1,b S2 i=1 Pb Ai−1 a S1,b S2 ∈ ∈ |h i| Y ∈ |h i| ∈ ∈ |h i| as desired, since k Plog n. P P ≤ Preprocessing time. Let δ = 1/n1000 and ℓ = (c/ε2) log(1/δ), where c is the constant from Theorem 7.14. Approximating all t s takes O(ℓd S )+ S poly(ℓd)= npoly(ℓd) time by Corollary u | 2| | 1| 7.15. Preprocessing the data structures for each node S takes O(ℓd S ) time in total. ∈ T S | | Each member of S is in at most O(log n) setse in , since is a balanced∈T binary tree. Therefore, 2 T T P O(ℓd S ) O(nℓd), so the total preprocessing time is O(nℓd), as desired. S | | ≤ Query∈T time. O(log n) of the data structures attached to sets in are queried per sample P T query, for a total ofeO(log n)poly(ℓd) time by Corollary 7.15,e as desired.

We use this data structure via a simple reduction to implement the following data structure, which suffices for our applications:

d Proposition 7.16. Given a family of sets in R , a collection of positive real numbers γS S , G { } ∈G and a set S Rd, there is a data structure that can be constructed in O(d( S + S )) time 2 ⊆ | 2| S | | that, in poly(d log( S + S )) time per sample, independently samples pairs u ∈GS for some | 2| S | | P ∈ S , v S with probability∈Gp , where e ∈ G ∈ 2 P uv γ u, v S|h i| ( γ ) a, b A A a A,b S2 ∈G ∈ ∈ |h i| is a 2-approximation to puv. Furthermore,P it is possibleP to query the number puv in time poly(d log( S2 + | | S S )) time. ∈G | | Rd+log n Proof.P Define a new set S1 as follows, where n = S2 + S S : ⊆ | | ∈G | | P S1 = S fS(u) u S ∪ ∈G{ ∀ ∈ } where the function f : Rd Rd is defined as follows for any set S : S → ∈ G γ (u, id(u)) f (u)= S S a, b a S,b S2 ∈ ∈ |h i| log n where id : S S 0, 1 is a function thatP outputs a unique ID for each element of S S. ∪ ∈G → { } ∪ ∈G Define f(u)= fS(u) for the unique S containing u (Without loss of generality assume that exactly one S contains u ). Let S = (x, 0log n) x S . Construct the data structure from Lemma 2′ { ∀ ∈ 2} D 7.13 on the pair of sets S1,S2′ . Now, sample a pair of d-dimensional vectors as follows:

55 1. Sample:

(a) Sample a pair (x, (y, 0log n)) from , where y S and x Rd+log n. D ∈ 2 ∈ (b) Since the function id outputs values that are not proportional to one other, the function f is injective. 1 (c) Return the pair (f − (x),y). (d) (For the proof, let w = f 1(x) and let S be the unique set for which w S) − ∈ This data structure has preprocessing time O(d( S + S )) = O(d( S + S )) and sample | 2′ | | 1| | 2| S | | query time poly(d log n) by Lemma 7.13, as desired. Therefore, we just need to show∈G that it samples P a pair (w,y) with the desired probability. Bye the probability guaranteee for Lemma 7.13 and the injectivity of the mapping f combined over all S , p is 2-approximated by ∈ G wy

x, (y, 0log n) γ w,y |h i| = S|h i| ′ a S1,b S a, b A p A,q S2 p,q ∈ ∈ 2 |h i| ∈G ∈ ∈ |h i| P P P as desired.

7.5 Weighted IP graph sparsification In this section, we use the tools developed in the previous sections to sparsify weighted inner product graphs. To modularize the exposition, we define a partition of the edge set of a weighted inner product graph:

Definition 7.17. For a w-weighted graph G and three functions on pairs of vertex sets ζ, κ, δ, a collection of vertex set family-vertex set pairs is called a (ζ, κ, δ)-cover for G iff the following F property holds:

1. (Coverage) For any e = u, v E(G), there exists a pair ( ,S ) and an S for { } ∈ G 1 ∈ F 0 ∈ G which u S , v S or u S , v S , and ∈ 0 ∈ 1 ∈ 1 ∈ 0 δ(S ,S ) κ(S ,S ) ζ(S ,S ) Reff (u, v) 0 1 + 0 1 + 0 1 G w max w w ≤ uv x S0,y S1 xy x S0,y S1 xy ∈ ∈ ∈ ∈ P A (ζ, κ, δ)-cover is said to be s-sparse if ((δ(S ,S ) + κ(S ,S )) S S + ( ,S1) S0 0 1 0 1 0 1 G ∈F ∈G | || | ζ(S0,S1)) s. A (ζ, κ)-cover is said to be w-efficient if S1 + S0 w. When ≤ P P( ,S1) | | S0 | | ≤ δ = 0, we simplify notation to refer to (ζ, κ)-covers instead.G ∈F ∈G P P  Given a (ζ, κ)-cover for a weighted or unweighted inner product graph, one can sparsify it using Theorem 3.6 and the sampling data structure from Proposition 7.16:

Proposition 7.18. Given a set X Rd with n = X and an s-sparse w-efficient (ζ, κ, δ)-cover ⊆ | | F for the weighted inner product graph G on X, and ε, δ (0, 1), there is an ∈ poly(d, log n, log s, log w, log(1/δ))(s + w + n)/ε4 time algorithm for constructing an (1 ε)-sparsifier for G with O(n log n/ε2) edges with probability ± at least 1 δ. −

56 Algorithm 4 1: procedure OversamplingWithCover(X, , ε, δ) F 2: (for analysis only: define r for each pair u, v X as in proof) uv ∈ 3: Construct the Proposition 7.16 data structure ( ,S1) with γS = ζ(S,S1) for each D G S ∈ G 4: t u,v X ruv ← ∈ 5: q C ε 2 t log t log(1/δ) ← P· − · · 6: Initialize H to be an empty graph 7: for i = 1 q do → 8: Sample one e = u, v X X with probability r /t by sampling u, v uniformly { } ∈ × e { } or from some data structure ( ,S1) (see proof for details) D G 9: Add that edge with weight wet/(req) to graph H (note: re can be computed in poly(d, log w) by the puv query time in Prop 7.16) 10: end for 11: return Spielman-Srivastava [SS11] applied to H 12: end procedure

Proof. Filling in algorithm details (the bolded parts). We start by filling in the details in the algorithm OversamplingWithCover. First, we define r for each pair of distinct u, v X. u, v is a uv ∈ { } weighted edge in G with weight wuv. Define

( ,S1) ruv = 2 δ(S0,S1)+ κ(S0,S1)+ ζ(A, S1) puvG ( ,S1) S0 :u S0,v S1 or v S0,u S1 A ! ! G X∈F ∈G ∈ ∈X ∈ ∈ X∈G

( ,S1) ( ,S ) where puvG is the probability p defined for the data structure 1 in Proposition 7.16. Next, uv D G we fully describe how to sample pairs u, v with probability proportional to r . Notice that t can { } uv be computed in O(w) time because

t = 2 ruv u,v X X∈

( ,S1) = 2 δ(S0,S1)+ κ(S0,S1)+ ζ(A, S1) puvG ( ,S1) S0 u S0,v S1 A ! ! G X∈F X∈G ∈ X∈ X∈G

= 2 ζ(A, S ) + S S (δ(S ,S )+ κ(S ,S ))  1 | 0|| 1| 0 1 0 1  ( ,S1) A ! S0 G X∈F X∈G X∈G   can be computed in O(w) time. Sample a pair u, v with probability equal to r /t as follows: { } uv 1. Sample a pair:

(a) Sample a Bernoulli b Bernoulli 1 ζ(A, S ) . t ( ,S1) A 1 ∼ G ∈F ∈G (b) If b = 1  P P 

57 i. Sample a pair ( ,S1) with probability proportional to A ζ(A, S1). G ∈ F ∈G ii. Sample the pair (u, v) using the data structure ( ,S1). D G P (c) Else

i. Sample a pair ( ,S1) with probability proportional to S0 S1 (δ(S0,S1)+ G ∈ F S0 | || | κ(S ,S )). ∈G 0 1 P ii. Sample an S with probability proportional to S (δ(S ,S )+ κ(S ,S )). 0 ∈ G | 0| 0 1 0 1 iii. Sample (u, v) S S uniformly. ∈ 0 × 1 All sums in the above sampling procedure can be precomputed in poly(d)w time. After doing this precomputation, each sample from the above procedure takes poly(d, log n, log w) time by to Proposition 7.16 for the last step in the if statement and uniform sampling from [0, 1] with intervals otherwise. Sparsifier correctness. By the Coverage guarantee of and the approximation guarantee for F the p s in Proposition 7.16, w Reff (u, v) r for all u, v X. Therefore, Theorem 3.6 applies uv uv G ≤ uv ∈ and shows that the graph H returned is a (1 ε)-sparsifier for G with probability at least 1 δ. ± − Spielman-Srivastava only worsens the approximation guarantee by a (1 + ε) factor, as desired. Number of edges in H. It suffices to bound q. In turn, it suffices to bound t. Recall from above that

t = 2 ζ(A, S ) + S S (δ(S ,S )+ κ(S ,S ))  1 | 0|| 1| 0 1 0 1  ( ,S1) A ! S0 G X∈F X∈G X∈G   = 2 ( S S (δ(S ,S )+ κ(S ,S )) + ζ(S ,S ))  | 0|| 1| 0 1 0 1 0 1  ( ,S1) S0 G X∈F X∈G 2s   ≤ since is s-sparse. Therefore, q poly(d, log s, log 1/δ)s/ε2. F ≤ Runtime. We start by bounding the runtime to produce H. Constructing the data structure

( ,S1) takes poly(d, log n, log w)( S1 + S S0 ) time by Proposition 7.16. Therefore, the total D G | | 0 | | time to construct all data structures is at most∈G poly(d, log n, log w)w. Computing t and q, as dis- P cussed above, takes O(w) time. q poly(d, log s, log 1/δ)s/ε2 as discussed above and each iteration ≤ of the for loop takes poly(d, log n, log w) time by the query complexity bounds of Proposition 7.16. Therefore, the total time required to produce H is at most poly(d, log n, log s, log w, log 1/δ)(s + w)/ε2. Running Spielman-Srivastava requires an additional poly(log s, log n)(s/ε2 + n)/ε2 time, for a total of poly(d, log n, log s, log w, log 1/δ)(s + w)/ε4 time, as desired.

Therefore, to sparsify weighted inner product graphs, it suffices to construct an poly(d, log n)n- sparse, poly(d, log n)n-efficient (ζ, κ)-cover. We break up this construction into a sequence of steps:

7.5.1 (ζ, κ)-cover for unweighted IP graphs We start by constructing covers for unweighted inner product graphs. The algorithm repeatedly peels off sets constructed using LowDiamSet and returns all pairs of such sets. The sparsity of the cover is bounded due to Proposition 7.12. The efficiency of the cover is bounded thanks to a poly(d, log n) bound on the number of while loop iterations, which in turn follows from the Size guarantee of Proposition 7.9.

58 Proposition 7.19 (Cover for unweighted graphs). Given a set X Rd with X = n, there is an ⊆ | | poly(d, log n)-time algorithm UnweightedCover(X) that, with probability at least 1 1/poly(n), − produces an poly(d, log n)n-sparse poly(d, log n)n-efficient (ζ, κ)-cover for the unweighted inner product graph G on X.

Algorithm 5 1: procedure UnweightedCover(X) 2: Input: X Rd ⊆ 3: Output: An sparse, efficient (ζ, κ) cover for the unweighted inner product graph G on X 4: U ← ∅ 5: Y X ← 6: while Y = do ⊲ Finding expanders 6 ∅ 7: Let Q LowDiamSet(Y ) ← 8: Add the set Q to U 9: Remove the vertices Q from Y 10: end while 11: return ( ,S) : S { U ∀ ∈ U} 12: end procedure

Proof. Let = UnweightedCover(X) and define the functions ζ, κ as follows: ζ(S ,S ) = 3 F 0 1 and κ(S ,S ) = C 3 + 3 for any pair of sets S ,S X. Recall that C is defined in the 0 1 2 S0 S1 0 1 2 | | | | ⊆ statement of Proposition 7.8.  Number of while loop iterations. We start by showing that there are at most poly(d, log n) while loop iterations with probability at least 1 1/poly(n). By the Size guarantee of Proposition − 7.8, when LowDiamSet succeeds (which happens with probability 1 1/poly(n)), Y decreases in − size by a factor of at least 1 1/p(d, log n) for some fixed constant degree polynomial p. Therefore, − after p(d, log n) log n = poly(d, log n) iterations, Y will be empty, as desired. Therefore, |U| ≤ poly(d, log n). Runtime. The runtime follows immediately from the bound on the number of while loop iterations and the runtime bound on LowDiamSet from Proposition 7.8. C2 Coverage. For each S , maxu,v S ReffG(u, v) by the Low effective resistance diameter ∈U ∈ ≤ S guarantee of Proposition 7.8. Plugging this into Proposition| | 7.12 immediately shows that is a F (ζ, κ)-cover for G. Sparsity bound. We bound the desired quantity directly using the fact that poly(d, log n): |U| ≤

(κ(S ,S ) S S + ζ(S ,S )) poly(d, log n) ( S + S + 1) 0 1 | 0|| 1| 0 1 ≤ | 0| | 1| ( ,S1) S0 ( ,S1) S0 U X∈F X∈U U X∈F X∈U poly(d, log n)n ≤ as desired.

Efficiency bound. Follows immediately from the bound on . |U|

59 7.5.2 (ζ, κ)-cover for weighted IP graphs on bounded-norm vectors Given a covers for unweighted inner product graphs, it is easy to construct covers for weighted inner product graphs on bounded norm vectors simply by removing edge weights and producing the cover. Edge weights only differ by a factor of O(d) in these two graphs, so effective resistances also differ by at most that amount. Note that the following algorithm also works for vectors with norms between z and 2z for any real number z.

Proposition 7.20 (Weighted bounded norm cover). Given a set X Rd with 1 u 2 for all ⊆ ≤ k k2 ≤ u X and X = n, there is an npoly(d, log n) time algorithm BoundedCover(X) that produces ∈ | | a poly(d, log n)-sparse, poly(d, log n)-efficient (ζ, κ)-cover for the weighted inner product graph G on X.

Proof. Let G0 be the unweighted inner product graph on X. Let G1 be the weighted graph G with all edges that are not in G deleted. Let be the (ζ , κ )-cover given by Proposition 7.19 for G . 0 F 0 0 0 Let this cover be the output of BoundedCover(X). It suffices to show that is a (ζ, κ)-cover for F G, where ζ = 8dζ0 and κ = 8dκ0. By Rayleigh monotonicity,

Reff (u, v) Reff (u, v) G ≤ G1 for all u, v X. Let w denote the weight of the edge e in G . For all edges e in G , 1 w 4 ∈ e 1 1 d+1 ≤ e ≤ by the norm condition on X. Therefore, for all u, v X, ∈ Reff (u, v) (d + 1)Reff (u, v) G1 ≤ G0 By the Coverage guarantee on , there exists a pair ( ,S ) and an S for which u S , v S F G 1 0 ∈ G ∈ 0 ∈ 1 or v S , u S and ∈ 0 ∈ 1 ζ (S ,S ) Reff (u, v) (d + 1) κ (S ,S )+ 0 0 1 G1 ≤ 0 0 1 S S  | 0|| 1|  By the upper bound on the edge weights for G1,

κ (S ,S ) ζ (S ,S ) Reff (u, v) 4(d + 1) 0 0 1 + 0 0 1 G1 ≤ max w w x S0,y S1 xy x S0,y S1 xy ! ∈ ∈ ∈ ∈ Since 4(d + 1) 8d, is a (ζ, κ)-cover for G as well, as desired.P ≤ F 7.5.3 (ζ, κ)-cover for weighted IP graphs on vectors with norms in the set [1, 2] [z, 2z] ∪ for any z > 1 By the previous subsection, it suffices to cover the pairs (u, v) for which u [1, 2] and v k k2 ∈ k k2 ∈ [z, 2z]. This can be done by clustering using LowDiamSet on the [z, 2z]-norm vectors. For each cluster S , let = u : u X with u [1, 2] . This cover is sparse because of the fact that 1 G {{ } ∀ ∈ k k2 ∈ } the clusters have low effective resistance diameter. It is efficient because of the small number of clusters.

Proposition 7.21 (Two-scale cover). Given a set X Rd for which X = n and u ⊆ | | k k2 ∈ [1, 2] [z, 2z] for all u X, there is a poly(d, log n)n-time algorithm TwoBoundedCover(X) ∪ ∈ that produces an poly(d, log n)n-sparse, poly(d, log n)-efficient (ζ, κ)-cover for the weighted inner product graph G on X.

60 Algorithm 6 1: procedure TwoBoundedCover(X) 2: Input: X Rd, where u [1, 2] [z, 2z] for all u X ⊆ k k2 ∈ ∪ ∈ 3: Output: An sparse, efficient (ζ, κ) cover for the weighted inner product graph G on X 4: Xlow u X : u [1, 2] ← { ∈ k k2 ∈ } 5: Xhigh u X : u [z, 2z] ← { ∈ k k2 ∈ } 6: U ← ∅ 7: Y Xhigh ← 8: while Y = do 6 ∅ 9: Let Q LowDiamSet(Y ) ← 10: Add the set Q to U 11: Remove the vertices Q from Y 12: end while 13: return ( u u Xlow ,S ) : S { {{ }∀ ∈ } 1 ∀ 1 ∈ U} 14: BoundedCover(Xlow) BoundedCover(Xhigh) ∪ ∪ 15: end procedure

Proof. Suppose that the BoundedCovers returned for Xlow and Xhigh are (ζlow, κlow) and (ζhigh, κhigh)- covers respectively. Recall the value C poly(d, log n) from the statement of Proposition 7.8. Let 2 ≤ w = u, v denote the weight of the u-v edge in G. Let = TwoBoundedCover(X) and uv |h i| F define the functions ζ, κ as follows:

ζlow(S ,S ) if S ,S Xlow 0 1 0 1 ⊆ ζhigh(S0,S1) if S0,S1 Xhigh ζ(S0,S1)=  ⊆ 3 if S0 Xlow and S1 Xhigh  ⊆ ⊆ otherwise ∞  κlow(S ,S ) if S ,S Xlow 0 1 0 1 ⊆ κhigh(S0,S1) if S0,S1 Xhigh κ(S ,S )= ⊆ 0 1  24dC2  if S Xlow and S Xhigh  S1 0 1  | | ⊆ ⊆ otherwise ∞  Number of while loop iterations . A poly(d, log n)-round bound follows from the Size bound of Proposition 7.8. For more details, see the same part of the proof of Proposition 7.19, which used the exact same algorithm for producing . U Runtime. Follows immediately from the runtime bounds of LowDiamSet, BoundedCover, and the number of while loop iterations. Coverage. Consider a pair u, v X. We break the analysis up into cases: ∈ Case 1: u, v Xlow. In this case, the Coverage property of BoundedCover(Xlow) implies that ∈ the pair (u, v) is covered in by Rayleigh monotonicity (since Glow is a subgraph of G, where Glow F is the weighted inner product graph for Xlow). Case 2: u, v Xhigh. In this case, the Coverage property of BoundedCover(Xhigh) implies ∈ that the pair (u, v) is covered in by Rayleigh monotonicity. F Case 3: u Xlow and v Xhigh. Since is a partition of Xhigh, there is a unique pair ( ,S ) ∈ ∈ U G 1 ∈ for which u and v S . Let H denote the unweighted inner product graph on Xhigh. By F { } ∈ G ∈ 1

61 2 2 Rayleigh monotonicity, the fact that z z w for all x,y E(H), and the Low effective 2d ≤ d+1 ≤ xy { } ∈ resistance diameter guarantee of Proposition 7.8,

Reff (x,y) Reff (x,y) G ≤ Ghigh 2dReffH (x,y) ≤ z2 2dC2 ≤ z2 S | 1| for any x,y S . Since z > 1, w 4z2 for all x,y X. Therefore, ∈ 1 xy ≤ ∈ 8dC2 ReffG(x,y) ≤ maxp X ,q S1 wpq ∈ low ∈ for any x,y S . Therefore, Proposition 7.12 implies the desired Coverage bound in this case. ∈ 1 Sparsity bound. We use the fact that poly(d, log n) along with sparsity bounds for |U| ≤ BoundedCover from Proposition 7.20 to bound the sparsity of as follows: F

(κ(S ,S ) S S + ζ(S ,S )) = Sparsity(BoundedCover(Xlow)) 0 1 | 0|| 1| 0 1 ( ,S1) S0 G X∈F X∈G + Sparsity(BoundedCover(Xhigh)) + (κ( u ,S ) S + ζ( u ,S )) { } 1 | 1| { } 1 u X ,S1 ∈ lowX ∈U poly(d, log n)n + Xlow (24dC + 3) ≤ | ||U| 2 poly(d, log n)n ≤ as desired. Efficiency bound. We use the efficiency bounds of Proposition 7.20 along with the bound on : |U|

S + S = Efficiency(BoundedCover(Xlow)) | 1| | 0| ( ,S1) S0 G X∈F X∈G   + Efficiency(BoundedCover(Xhigh))

+ ( S + Xlow ) | 1| | | S1 X∈U poly(d, log n)n + Xhigh + Xlow ≤ | | |U|| | poly(d, log n)n ≤ as desired.

62 7.5.4 (ζ, κ)-cover for weighted IP graphs with polylogarithmic dependence on norm We now apply the subroutine from the previous subsection to produce a cover for weighted inner product graphs on vectors with arbitrary norms. However, we allow the sparsity and efficiency of the cover to depend on the ratio τ between the maximum and minimum norm of points in X. To obtain this cover, we bucket vectors by norm and call TwoBoundedCover on all pairs of buckets.

d maxx∈X x 2 Proposition 7.22 (Log-dependence cover). Given a set of vectors X R with τ = k k and ⊆ minx∈X x 2 n = X , there is a poly(d, log n, log τ)n-time algorithm LogCover(X) that produces a poly(kd,k log n, log τ)n- | | sparse, poly(d, log n, log τ)n-efficient (ζ, κ)-cover for the weighted inner product graph G on X.

Algorithm 7 1: procedure LogCover(X) 2: Input: X Rd ⊆ 3: Output: An sparse, efficient (ζ, κ) cover for the weighted inner product graph G on X 4: dmin minx X x 2, dmax maxx X x 2 ← ∈ k k ← ∈ k k 5: for i 0, 1,..., log τ do ∈ { } X x X : x [2id , 2i+1d ) i ← { ∈ k k2 ∈ min min } 6: end for 7: = F ∅ 8: for each pair i, j 0, 1,..., log τ do ∈ { } 9: Add TwoBoundedCover(X X ) to i ∪ j F 10: end for 11: return F 12: end procedure

Proof. Coverage. For any edge u, v E(G), there exists a pair i, j 0, 1,..., log τ for which { }∈ ∈ { } u, v X X . Therefore, the Coverage property for TwoBoundedCover(X X ) (which is ∈ i ∪ j i ∪ j part of ) implies that the pair u, v is covered by . F { } F Runtime, efficiency, and sparsity. Efficiency and sparsity of are at most the sum of the F efficiencies and sparsities respectively of the constituent TwoBoundedCovers, each of which are at most poly(d, log n)n by Proposition 7.21. There are O(log2 τ) such covers in , so the efficiency F and sparsity of is at most poly(d, log n) log2 τn poly(d, log n, log τ)n, as desired. Runtime is F ≤ also bounded due to the fact that there are at most log2 τ for loop iterations.

7.5.5 Desired (ζ, κ, δ)-cover Now, we obtain a cover for all X with sparsity, efficiency, and runtime poly(d, log n). To do this, we break up pairs to cover u, v X X into two types. Without loss of generality, suppose { } ∈ × that u v . The first type consists of pairs for which v (dn)1000 u . These pairs are k k2 ≤ k k2 k k2 ≤ k k2 covered using several LogCovers. The total efficiency, sparsity, and runtime required for these covers is poly(d, log n)n due to the fact that each vector is in at most poly(d, log n) of these covers. The second type consists of all other pairs, i.e. those with v > (dn)1000 u . For these pairs, k k2 k k2 we take care of them via a clustering argument. We cluster all vectors in X into d + 1 clusters in a greedy fashion. Specifically, we sort vectors in decreasing order by norm and create a new cluster for a vector x X if x,y < 1 x y for the first vector y in each cluster. Otherwise, we assign ∈ |h i| d+1 k k2k k2 x to an arbitrary cluster for which x,y 1 x y for first cluster vector y. We then cover |h i| ≥ d+1 k k2k k2

63 the pair u, v using the pair of sets ( u ,C), where C is the cluster containing v. To argue that { } { } this satisfies the Coverage property, we exploit the norm condition on the pair u, v . To bound { } efficiency, sparsity, and runtime, it suffices to bound the number of clusters, which is at most d + 1 by Proposition 7.5. In order to define this algorithm, we use the notion of an interval family, which is exactly the same as the one-dimensional interval tree from computational geometry .

Definition 7.23. For a set X and a function f : X R, define the interval family for X, denoted → = IntervalFamily(X), to be a family of sets produced recursively by initializing = X X X { } and repeatedly taking an element S , splitting it evenly into two subsets S and S for which ∈ X 0 1 maxx S0 f(x) minx S1 f(x), and adding S0 and S1 to until contains all singleton subsets of ∈ ≤ ∈ X X X. has the property that for any set S X consisting of all x X for which a f(x) b for X ⊆ ∈ ≤ ≤ two a, b R, S is the disjoint union of O(log X ) sets in . Furthermore, each element in X is in ∈ | | X at most O(log X ) sets in . | | X Proposition 7.24 (Desired cover). Given a set X Rd with n = X , there is a poly(d, log n)n- ⊆ | | time algorithm DesiredCover(X) that produces a poly(d, log n)n-sparse, poly(d, log n)n-efficient (ζ, κ, δ)-cover for the weighted inner product graph G on X.

Proof. We start by defining the functions ζ, κ, and δ. Let ζi and κi denote the functions for which LogCover(Yi) is a (ζi, κi)-cover for the weighted inner product graph on Yi, where Yi = Xi Xi+1 . . . Xi+ log ξ for all i log(dmax/dmin) log ξ . Let ∪ ∪ ∪ ⌈ ⌉ ≤ − ⌈ ⌉

ζ (S ,S ) if there exists i for which S ,S Y i:S0,S1 Y ,ζ (S0,S1)= i 0 1 0 1 i ⊆ i i 6 ∞ ⊆ 2P if S1 y for some y B and S0 = x for some x ζ(S0,S1)=  ∈C ∈ { }  with x 2 mina C a 2/ξ  k k ≤ ∈ k k otherwise ∞   κ (S ,S ) if there exists i for which S ,S Y i:S0,S1 Y ,κ (S0,S1)= i 0 1 0 1 i ⊆ i i 6 ∞ ⊆ 0P if S1 y for some y B and S0 = x for some x κ(S0,S1)=  ∈C ∈ { }  with x 2 mina C a 2/ξ  k k ≤ ∈ k k otherwise ∞    0 if there exists i for which S ,S Y 0 1 ⊆ i 1 if S for some y B and S = x for some x  S1 1 y 0 δ(S0,S1)=  | | ∈C ∈ { }  with x 2 mina C a 2/ξ  k k ≤ ∈ k k otherwise ∞  Before proving that the required guarantees are satisfied, we bound some important quantities. Bound on w in terms of w for y C if y ξ u . By definition of C , w 1 y w yw uy ∈ w k k2 ≥ k k2 w yw ≥ d+1 k k2k k2 for any y C . w was the first member added to C , so w y . By the norm assumption ∈ w w k k2 ≥ k k2 on y, y ξ u . By Cauchy-Schwarz, y u y, u = w . Therefore, k k2 ≥ k k2 k k2k k2 ≥ |h i| uy ξ w w yw ≥ d + 1 uy

64 Algorithm 8 1: procedure DesiredCover(X) 2: Input: X Rd ⊆ 3: Output: An sparse, efficient (ζ, κ, δ)-cover for the weighted inner product graph G on X, where ζ, κ, and δ are defined in the proof of Proposition 7.24. 4: F ← ∅ 5: ξ (dn)1000 ← 6: dmin minx X x 2, dmax maxx X x 2 ← ∈ k k ← ∈ k k ⊲ Cover nearby norm pairs 7: for i 0, 1,..., log(d /d ) log ξ do ∈ { max min − ⌈ ⌉ 8: X x X : x [d 2i, d 2i+1) i ← { ∈ k k2 ∈ min min 9: Add LogCover(Xi Xi+1 . . . Xi+ log ξ ) to ∪ ∪ ∪ ⌈ ⌉ F 10: end for ⊲ Create approximate basis for spread pairs B ← ∅ 11: for x X in decreasing order by x do ∈ k k2 12: if there does not exist y B for which x,y x y /(d + 1) then ∈ |h i| ≥ k k2k k2 13: Add x to B and initialize a cluster C = x x { } 14: else 15: Add x to Cy for an arbitrary choice of y satisfying the condition 16: end if 17: end for ⊲ Cover the spread pairs 18: for w B do ∈ 19: IntervalFamily(C ) Cw ← w 20: for each set C do ∈Cw 21: the family of all singletons of x X for which the disjoint union of O(log n) GC ← ∈ sets for y C : y ξ u obtained from contains C { ∈ w k k2 ≥ k k2} Cw 22: Add ( ,C) to GC F 23: end for 24: end for return F 25: end procedure

Bound on Reff (u, w) for w B. We start by bounding the effective resistance between u X G ∈ ∈ and any w B for which w >ξ u . Recall that w = x,y for any x,y X. Consider any ∈ k k2 k k2 xy |h i| ∈ C w for which mina C a 2 >ξ u 2. We show that ∈C ∈ k k k k 2 ReffG(u, w) ≤ y C wuy ∈ Consider all 2-edge paths of the form u-y-w for y PC. By assumption on C, y 2 ξ u 2 for any ∈ k k ≥ k k y C. Therefore, the bound on w applies: ∈ yw ξ w w yw ≥ d + 1 uy for any y C. By series-parallel reductions, the u-w effective resistance is at most ∈

65 Reff 1 G(u, w) 1 ≤ y C 1/w +1/w ∈ uy yw P 1 1 ≤ y C (1+(d+1)/ξ)/w ∈ uy P 2 ≤ y C wuy ∈ P as desired. Bound on Reff (u, y) for y C. Any y C has the property that y ξ u . Therefore, G ∈ ∈ k k2 ≥ k k2 for y C, Reff (y, w) d+1 1 . By the triangle inequality for effective resistance, G ξwuy C wuy ∈ ≤ ≤ | | 1 2 ReffG(u, y) ReffG(u, w)+ ReffG(w,y) + ≤ ≤ C wuy a C wua | | ∈ Coverage. For any pair u, v for which there exists i with u, v PYi, u, v is still covered { } ∈ { } by by the Coverage property of LogCover(Y ). Therefore, we may assume that this is not the F i case. Without loss of generality, suppose that v u . Then, by assumption, v ξ u . k k2 ≥ k k2 k k2 ≥ k k2 By definition of the C s, there exists a w B for which v C . By the first property of interval w ∈ ∈ w families, the set x C : x ξ u is the disjoint union of O(log n) sets in . Let C be the { ∈ w k k2 ≥ k k2} Cw unique set among these for which v C. By our effective resistance bound, ∈

1 2 ReffG(u, v) + ≤ C wuv x C wux | | ∈ δ( u ,C) κ( u ,C) ζ( u ,C) = { } +P { } + { } wuv maxx C wux x C wux ∈ ∈ P so the coverage property for the pair u, v is satisfied within by the pair ( ,C), as desired. { } F GC Efficiency. The efficiency of is at most the efficiency of the LogCovers and the remaining F part for spread pairs. We start with the LogCovers. By Proposition 7.22,

Efficiency(LogCover(Y )) poly(d, log Y , log ξ) Y i ≤ | i| | i| Xi Xi poly(d, log n) Y ≤ | i| Xi (log ξ + 1)poly(d, log n) X ≤ | i| Xi poly(d, log n)n ≤

Therefore, we just need to bound the efficiency of the remainder of . The efficiency of is at F F most

66 Efficiency( )= ((δ(S ,S )+ κ(S ,S )) S S + ζ(S ,S )) F 0 1 0 1 | 0|| 1| 0 1 ( ,S1) S0 G X∈F X∈G = Efficiency(LogCover(Yi)) Xi + (δ( u ,C) C + ζ( u ,C)) { } | | { } w B C w u X∈ X∈C { X}∈GC poly(d, log n)n + 3 ≤ |GC | w B C w X∈ X∈C

By the first property of interval families, each x X is present as a singleton in at most O(log n) ∈ s for C that are a subset of a given C . Therefore, GC w Efficiency( ) poly(d, log n)n + O(log n)n F ≤ w B X∈ By Proposition 7.5, B d + 1. Therefore, Efficiency( ) poly(d, log n)n, as desired. | |≤ F ≤ Sparsity. By Proposition 7.22,

Sparsity(LogCover(Y )) poly(d, log n, log ξ) Y poly(d, log n)n i ≤ | i|≤ Xi Xi Therefore, we may focus on the remaining part for spread pairs. In particular,

Sparsity( )= ( S + S ) F | 1| | 0| ( ,S1) S0 G X∈F X∈G poly(d, log n)n ≤ + ( C + ) | | |GC | w B C w X∈ X∈C poly(d, log n)n + C ≤ | | w B C w X∈ X∈C where the last inequality follows from the first property of interval families. By the second prop- erty of interval families, each element of Cw is in at most O(log n) sets in w, so C C C w | | ≤ O(log n) C . Since B d + 1, Sparsity( ) poly(d, log n)n, as desired. ∈C | w| | |≤ F ≤ P Runtime. The first for loop takes poly(d, log n) Y poly(d, log n)n by Proposition 7.22. i | i| ≤ The second for loop takes O(d B n) poly(d, log n)n by the bound on B . The third for loop takes | | ≤ P | | poly(d, log n)n by the runtime for IntervalFamily and the two properties of interval families. Therefore, the total runtime is poly(d, log n)n, as desired.

7.5.6 Proof of Lemma 7.1 Proof of Lemma 7.1. Follows immediately from constructing the cover given by Proposition 7.24 F and plugging that into Proposition 7.18.

67 8 Hardness of Sparsifying and Solving Non-Multiplicatively-Lipschitz Laplacians

We now define some terms to state our hardness results:

Definition 8.1. For a decreasing function f that is not (C,L)-multiplicatively Lipschitz, there exists L a point x for which f(Cx) C− f(x). Let x0 denote one such point. A set of real numbers S R 0 ≤ ⊆ ≥ is called ρ-discrete for some ρ > 1 if for any pair a, b S with b > a, b ρa. A set of points d ∈ ≤ X R is called ρ-spaced for some ρ> 1 if there is some ρ-discrete set S R 0 with the property ⊆ ⊆ ≥ that for any pair x,y X, x y S. S is called the distance set for X. ∈ k − k2 ∈

Dim. Thm. d g(p) ρ Time ∗ Low 8.2 clog n p 1 + 16 log(10(L1/(4c0)))/L O(nL1/(8c0)) 0.48 .48 High 8.3 log n ep 1 + 2 log(10(2L ))/L O(n2L )

Table 3: Sparsification Hardness

In this section, we show the following two hardness results:

Theorem 8.2 (Low-dimensional sparsification hardness). Consider a decreasing function f : R 0 ≥ → R 0 that is not (ρ, L)-multiplicatively lipschitz for some L > 1, where c and c0 are the constants ≥ given in Theorem 3.25 and ρ = 1 + 2 log(10L1/(4c0 ))/L. There is no algorithm that, given a set ∗ of n points X in d = clog n dimensions, returns a sparsifier of the f-graph for X in less than O(nL1/(8c0)) time assuming SETH.

Theorem 8.3 (High-dimensional sparsification hardness). Consider a decreasing function f : L0.48 R 0 R 0 that is not (ρ, L)-multiplicatively Lipschitz for some L> 1, where ρ = 1+2 log(10(2 ))/L. ≥ → ≥ There is no algorithm that, given a set of n points X in d = O(log n) dimensions, returns a sparsifier .48 of the f-graph for X in less than O(n2L ) time assuming SETH.

Both of these results follow from the following reduction:

Lemma 8.4. Consider a decreasing function f : R 0 R 0 that is not (ρ, L)-multiplicatively ≥ → ≥ Lipschitz for some L> 1, where ρ = 1+2 log(10n)/L and n> 1. Suppose that there is an algorithm that, when given a set of n points X Rd, returns a 2-approximate sparsifier for the f-graph of A ⊆ X with O(n) edges in (n,L,d) time. Then, there is an algorithm (Algorithm 9) that, given two T sets A, B Rd for which A B is ρ-spaced with distance set S, k S, and A B = n, returns ⊆ ∪ ∈ | ∪ | whether or not mina A,b B a b 2 k in ∈ ∈ k − k ≤ O( ( A B ,L,d)+ A B ) T | ∪ | | ∪ | time.

The reduction described starts by scaling the points in A B by a factor of x /k to obtain A ∪ 0 and B respectively. Then, it sparsifies the f-graph for A B. Finally, it computes the weight of the ∪ edges in the A-B cut. Because f is not multiplicatively Lipschitz and the distance set for A B ise ∪ spaced,e thresholding suffices for solving the A B neareste e neighbor problem. × e e e e Proof of Lemma 8.4. Consider the following algorithm, BichromaticNearestNeighbor (Algo- rithm 9), given below:

68 Algorithm 9 1: procedure BichromaticNearestNeighbor(A, B, k) ⊲ Lemma 8.4 2: Given: A, B Rd with the property that A B is ρ-spaced, where ρ = 1 + 2(log(10n))/L, ⊂ ∪ and k S ∈ 3: Returns: whether there are a A, b B for which a b k ∈ ∈ k − k2 ≤ 4: A a √x /k a A ⊲ A Rd ← { · 0 | ∀ ∈ } ⊂ 5: B b √x /k b B ⊲ B Rd ← { · 0 | ∀ ∈ } ⊂ 6: He (A B) ⊲ H is a 2-approximate sparsifier for the f-graph G ofeA B ←A ∪ ∪ 7: ife the total weight of edges between A and B in H is at least f(x0)/2 then e 8: returne truee e e 9: else e e 10: return false 11: end if 12: end procedure

We start by bounding the runtime of this algorithm. Constructing A and B and calculating the total weight of edges between A and B takes O(n) time since H has O(n) edges. Since the sparsification algorithm is only called once, the total runtime is thereforee (n,L,de ) + O(n), as T desired. For the rest of the proof, wee may thereforee focus on correctness. First, suppose that mina A,b B a b 2 k. There exists a pair of points a A, b B with ∈ ∈ k − k ≤ ∈ ∈ a b 2 √x0. Since f is a decreasing function, the edge between a and b in G has weight at least k − k ≤ e e e f(x0), which means that the total weight of edges in the A-B cut in G is at leastef(x0). Since H is ae 2-approximatee sparsifier for G, the total weight of edges in the A-eB cute is at least f(x0)/2. This means that true is returned, as desired. e e Next, suppose that mina A,b B a b 2 > k. Since A B is ρe-spacede with distance set S and ∈ ∈ k − k ∪ k S, a b ρ k for all a A and b B. Therefore, a b ρ √x for all a A and ∈ k − k2 ≥ · ∈ ∈ k − k2 ≥ · 0 ∈ b B. ∈ Since f is decreasing and not (ρ, L)-multiplicatively Lipschitz,e e the weight of any edgee betweee n eA ande B in G is at most

2 e e f(ρ x0) f(x0)/(100n ). · ≤ The total weight of edges between A and B is therefore at most

n2 (f(x )/(100n2))

Since H is a 2-approximate sparsifier for G, the total weight between C and D is at most f(x0)/4 < f(x0)/2, so the algorithm returns false, as desired. We now prove the theorems:

1/(4c0) Proof of Theorem 8.2. Consider an instance of ℓ2-bichromatic closest pair for n = L and log∗ n d = c , where c is the constant given in the dimension bound of Theorem 3.25 and c0 is such that integers have bit length c log n in Theorem 3.25. This consists of two sets of points A, B Rd 0 ⊆ with A B = n for which we wish to compute mina A,b B a b 2. By Theorem 3.25, the | ∪ | ∈ ∈ k − k coordinates of points in A are also c0 log n bit integers. Therefore, the set S of possible ℓ2 distances between points in A and B is a set of square roots of integers with log d + c log n 2c log n bits. 0 ≤ 0

69 Therefore, S is a ρ-discrete (recall ρ = 1 + (2 log(10n))/L), since 1 + 1/n2c0 > 1 + (2 log(10n))/L. Furthermore, note that f is not (ρ, L)-multiplicatively Lipschitz. We now describe an algorithm for solving ℓ -closest pair on A B. Use binary search on the 2 × values in S to compute the minimum distance between points in A, B. For each query point k S, ∈ by Lemma 8.4, there is a

∗ (n,L,d)= O( (L1/(4c0),L,clog L)+ L1/(4c0)) T T -time algorithm for determining whether or not the closest pair has distance at most k. Therefore, there is a

∗ O(log S ( (L1/(4c0),L,clog L)+ L1/4c0 )) = O(L1/(4c0)L1/(8c0)) < O(n3/2) | | · T time algorithm for solving ℓ2-closest pair on pairs of sets withe n points. But this is impossible given SETH by Theorem 3.25, a contradiction. This completes the result.

Proof of Theorem 8.3. Consider an instance of bichromatic Hamming nearest neighbor search for L0.49 n = 2 and d = c1 log n for the constant c1 in the dimension bound in Theorem 3.21. This consists d of two sets of points A, B R with A B = n for which we wish to compute mina A,b B a b 2. ⊆ | ∪ | ∈ ∈ k − k The coordinates of points in A and B are 0-1. Therefore, the set S of possible ℓ2 distances between points in A and B is the set of square roots of integers between 0 and c1 log n, which differ by a factor of at least 1 + 1/(2c log n) > ρ (recall ρ = 1 + (2 log(10n))/L). Therefore, A B is ρ-spaced. 1 ∪ Note that f is also no (ρ, L)-multiplicatively Lipschitz by definition. We now give an algorithm for solving ℓ -closest pair on A B. Use binary search on S. For 2 × each query k S, Lemma 8.4 implies that one can check if there is a pair with distance at most k ∈ in

.49 .49 (n,L,d)= O( (2L ,L,c L.49) + 2L ) T T 1 .49 .48 O(2L +L ) ≤ = n1+o(1) time on pairs of sets with n points. But this is impossible given SETH by Theorem 3.21. This completes the result.

Next, we prove hardness results for solving Laplacian systems. In these hardness results, we insist that kernels are bounded:

Definition 8.5. Call a function f : R 0 R 0 g(p)-bounded for a function g : R 0 R 0 ≥ → ≥ ≥ → ≥ iff for any pair a, b > 0 with b > a, f(b) f(a) g(b/a). Call a set S R 0 γ-bounded iff ≥ d · ⊆ ≥ maxs S s γ mins S,s=0 s. A set of points X R is called γ-boxed iff the set of distances between ∈ ≤ ∈ 6 ⊆ points in X is γ-bounded.

Dim. Thm. d g(p) ρ Time ε ∗ Low 8.6 clog n p 1 + 16 log(10(L1/(4c0 )))/L n log(g(γ))L1/(64c0 ) 1/(g(γ)32poly(log n)) 0.48 .48 High 8.7 log n ep 1 + 2 log(10(2L ))/L n log(g(γ))2L 1/(g(γ)32poly(log n))

Table 4: Linear System Hardness

We show the following results:

70 Theorem 8.6 (Partial low-dimensional linear system hardness). Consider a decreasing g(p) = p- bounded function f : R 0 R 0 that is not (ρ, L)-multiplicatively Lipschitz for some L> 1, where ≥ → ≥ 1/(32c0 ) c and c0 are the constants given in Theorem 3.25 and ρ = 1 + 16 log(10(L ))/L. Assuming ∗ SETH, there is no algorithm that, given a γ-boxed set of n points X in d = clog n dimensions with f-graph G and a vector b Rn, returns a ε = 1/(g(γ)32poly(log n))-approximate solution x Rn to ∈ 1/(64c0 ) ∈ the geometric Laplacian system LGx = b in less than O(n log(g(γ))L ) time. Theorem 8.7 (Partial high-dimensional linear system hardness). Consider a decreasing g(p)= ep- bounded function f : R 0 R 0 that is not (ρ, L)-multiplicatively Lipschitz for some L > 1, ≥ 0→.48 ≥ where ρ = 1 + 16 log(10(2L ))/L. Assuming SETH, there is no algorithm that, given a γ-boxed set of n points X in d = O(log n) dimensions with f-graph G and a vector b Rn, returns a 3 poly(log n) n ∈ ε = 1/(g(γ) 2 )-approximate solution x R to the geometric Laplacian system LGx = b .48 ∈ in less than O(n log(g(γ))2L ) time assuming SETH.

To prove these theorems, we use the following reduction from bichromatic nearest neighbors:

Lemma 8.8. Consider a decreasing g(p)-bounded function f : R 0 R 0 that is not (ρ, L)- ≥ → ≥ multiplicatively Lipschitz for some L > 1, where ρ = 1 + 16(log(10n))/L and n > 1. Suppose that there is an algorithm that, given a γ-boxed set of n points X in d dimensions with f-graph G and A a vector b Rn, returns an ε = 1/(g(γ)32poly(log n))-approximate solution x Rn to the geometric ∈ ∈ Laplacian system L x = b in (n,L,g(γ), d) time. Then, there is an algorithm that, given two sets G T A, B Rd for which A B is ρ-spaced with S O(1)-bounded distance set S, k S, and A B = n, ⊆ ∪ | | ∈ | ∪ | returns whether or not mina A,b B a b 2 k in ∈ ∈ k − k ≤ O((log n) (n,L,g( S O(1)), d)+ A B ) T | | | ∪ | time.

This reduction works in a similar way to the effective resistance data structure of Spielman and Srivastava [SS11], but with minor differences due to the fact that their data structure requires multiplication by incidence matrix of the graph, which in our case is dense. Our reduction uses Johnson-Lindenstrauss to embed the points

vs = L†bs for vertices s in the graph G into O(log n) dimensions in a way that distorts the distances

2 bst⊤(L†) bst for vertices s,t in G by a factor of at most 2. After computing this embedding, we build an O(log n)- approximate nearest neighbor data structure on the resulting points. This allows us to determine whether or not a vertex in A has a high-weight edge in G to B in almost-constant time. After looping through all of the edges in A in total time n1+o(1), we determine whether or not there are any high-weight edges between A and B in G, allowing us to answer the bichromatic nearest neighbors decision problem. We start by proving a result that links norms of vs to effective resistances: Proposition 8.9. In an n-vertex graph G with vertices s and t,

0.5 (Reff (s,t))2 v v 2 n (Reff (s,t))2 · G ≤ k s − tk2 ≤ · G

71 n Proof. Lower bound. Let x = v v = L† b R . By definition, t − s G st ∈ v v 2 = x 2 k s − tk2 k k2 (x )2 + (x )2 ≥ s t 2 2 = (x ) + (x b⊤L† b ) s s − st G st 2 (b⊤L† b ) /2, ≥ st G st where third step follows from x = x b L† b . t s − st⊤ G st Thus we complete the proof of the lower bound. Upper bound. Next, we prove the upper bound. The maximum and minimum coordinates of x are xt and xs respectively. By definition of the pseudoinverse, image(LG† )= image(LG). Therefore, 1 x = 0, x 0, and x 0. x 0 implies that for all i [n], ⊤ s ≤ t ≥ s ≤ ∈ x x + b⊤L† b b⊤L† b . i ≤ s st G st ≤ st G st x 0 implies that for all i [n], t ≥ ∈ x x b⊤L† b b⊤L† b . i ≥ t − st G st ≥− st G st Therefore, x b L† b = Reff (s,t) for all i [n]. Summing across i [n] yields the desired | i| ≤ st⊤ G st G ∈ ∈ upper bound.

Furthermore, the minimum effective resistance of an edge across a cut is related to the maximum weight edge across the cut: Proposition 8.10. In an m-edge graph G with vertex set S,

min ReffG(s,t) min(1/we) m min ReffG(s,t). s S,t/S ≤ e ∂S ≤ s S,t/S ∈ ∈ ∈ ∈ ∈ Proof. Lower bound. The lower bound on mine 1/we follows immediately from the fact that for any edge e = s,t , Reff (s,t) r . { } G ≤ e Upper bound. For the upper bound, recall that

2 ReffG(s,t)= min f /we m ⊤ e f R :B f=bst ∈ e E(G) ∈X 2 min f /we m ⊤ e ≥ f R :B f=bst ∈ e ∂S X∈ 2 min f min 1/we m ⊤ e ≥ f R :B f=bst e ∂S ∈ e ∂S !  ∈  X∈ 2 1 min fe min 1/we m ⊤ ≥ ∂S f R :B f=bst | | e ∂S ∈ e ∂S !  ∈  | | X∈ where the first step follows from definition of effective resistance, the third step follows from taking w out, and the last step follows from Cauchy-Schwarz.

Since s S and t / S, e ∂S fe 1. Therefore, ∈ ∈ ∈ | |≥ P 1 ReffG(s,t) min 1/we ≥ m e ∂S ∈ completing the upper bound.

72 Proposition 8.11. Consider an n-vertex connected graph G with edge weights we e G, two ma- { } ∈ trices Z, Z Rk n with rows z k and z k respectively for k n, and ε (1/n, 1). Suppose ∈ × { i}i=1 { i}i=1 ≤ ∈ that both of the following properties hold: e e 1. z z 0.01n 12w2 w 2 z for all i [k], where w and w are the minimum k i − ikLG ≤ − min max− k ikLG ∈ min max and maximum weights of edges in G respectively e 2. (1 ε/10) L† b Zb (1 + ε/10) L† b for any vertices s,t in G − · k G stk2 ≤ k stk2 ≤ · k G stk Then for any vertices s,t in G,

(1 ε) L† b Zb (1 + ε) L† b . − · k G stk2 ≤ k stk2 ≤ · k G stk2 Proof. We start by bounding e

2 ((z z )⊤b ) i − i st for each i [k]. Since G is connected, there is a path from s to t consisting of edges e , e ,...,e in ∈ e 1 2 ℓ that order, where ℓ n. By the Cauchy-Schwarz inequality, ≤ ℓ 2 2 ((z z )⊤b ) n ((z z )⊤b ) i − i st ≤ i − i ej Xj=1 e (n/w ) ez z 2 ≤ min · k i − ikLG 23 3 4 2 0.01n− wminwmax− zi LG ≤ e · k k where the last step from property 1 in proposition statement. 2 2 2 By the upper bound on Zb ′ ′ for any vertices s ,t in G, (z b ) (1 + ε) b (L† ) b for all k s t k2 ′ ′ i⊤ e ≤ e⊤ G e edges e in G, so

23 3 4 2 23 3 4 2 0.01n− w w− z = 0.01n− w w− w ((z )⊤b ) min max · k ikLG min max · e i e e E(G) ∈X 23 3 3 2 0.01n− w w− ((z )⊤b ) ≤ min max · i e e E(G) ∈X 23 3 3 2 2 0.01n− w w− (1 + ε) b⊤(L† ) b ≤ min max · e G e e E(G) ∈X 21 3 3 2 2 0.01n− wminwmax− (1 + ε) max (be⊤(LG† ) be) ≤ e E(G) ∈ 21 3 3 2 0.04n− wminwmax− max (be⊤(LG† ) be). ≤ e E(G) ∈ 2 where the first step follows from zi = z⊤webeb⊤zi, the second step follows from wmax = k kLG e E i e ∈ 2 2 maxe G we, and the third step follows from (zi⊤be) (1 + ε) be⊤(L† )be, the forth step follows from ∈ P ≤ G summation has at most n2 terms, the last step follows from (1 + ε)2 4, ε (0, 1). ≤ ∀ ∈ L† b is a vector that is maximized and minimized at the endpoints of e. Furthermore, 1 G e ∈ kernel(LG† ) by definition of the pseudoinverse. Thus, we know LG† be has both positive and negative coordinates and that LG† be maxi=j (LG† be)i (LG† be)j be⊤LG† be . k k∞ ≤ 6 | − |≤

73 We have

21 3 3 2 0.04n− wminwmax− max be⊤(LG† ) be · e E(G) ∈ 20 3 3 2 0.04n− wminwmax− max (be⊤LG† be) ≤ · e E(G) ∈ 20 3 0.4n− w w− ≤ min max 16 1 2 0.4n− w w− (b⊤L† b ) ≤ min max · st G st 16 1 2 0.8n− w w− L† b ≤ min max · k G stk2 2 2 2 where the first step follows from maxe E(G) be⊤(LG† ) be = maxe E LG† be 2 n maxe E LG† be ∈ ∈ ∈ 2 k 2 k ≤ 2 k k∞ ≤ n maxe E(G)(be⊤LG† be) , the second step follows from maxe(be⊤LG† be) wmin− (the lower bound ∈ 2 4 ≤ 2 in Proposition 8.10), and the third step follows from wmax− n (bstLG† bst) (the upper bound in 2 ≤ 2 Proposition 8.10), and the last step follows from (b L† b ) 2 L† b (Since L† b is maximized st⊤ G st ≤ k G stk2 G st and minimized at t and s respectively). Combining these inequalities shows that

2 16 1 2 ((z z )⊤b ) 0.8n− w w− L† b i − i st ≤ min max · k G stk2 Summing over all i and using the fact that k n, ε> 1/n, and w w shows that e ≤ min ≤ max k 2 2 (Z Z)b = ((z z )⊤b ) k − stk2 i − i st Xi=1 16 1 2 e 0.8kn− we w− L† b ≤ min max · k G stk2 2 13 1 2 0.8ε n− w w− L† b ≤ min max · k G stk2 2 13 2 0.8ε n− L† b ≤ · k G stk2 Combining this with the given upper and lower bounds on Zb using the triangle inequality k stk2 yields the desired result.

Proof of Lemma 8.8. Consider the following algorithm BichromaticNearestNeighbor, given below: First, we bound the runtime of BichromaticNearestNeighborProj (Algorithm 10). Com- puting the sets C, D and the matrix P trivially takes O(n) time. Computing the matrix Z takes

O(log n) (n,L,g( S O(1)), d) T | e| e time since the point set C D is S O(1)-boxed. Computing C and D takes O(n) time, as computing ∪ | | Zb takes O(log n) time for each i C D since b is supported on just one vertex. Computing i ∈ ∪ i t takes n1+o(1) time by Theorem 3.17. In particular, one computesb b t by preprocessinge a O(log n)- approximateb nearest neighbors data structure on D (takes n1+o(1) time), queries the data structure on all points in C (takes n(no(1)) = n1+o(1) time), and returns the minimum of all of the queries. The subsequent if statement takes constant time.b Therefore, the reduction takes b O((log n) (n,L, S O(1), d)+ n1+o(1) ·T | | time overall, as desired.

74 Algorithm 10 1: procedure BichromaticNearestNeighborProj(A,B,k) ⊲ Lemma 8.8 2: Given: A, B Rd with the property that A B is ρ-spaced with S O(1)-bounded distance ∈ ∪ | | set , where ρ = 1 + 16(log(10n))/L, and k S ∈ 3: Returns: whether there are a A, b B for which a b k ∈ ∈ k − k2 ≤ 4: C a √x /k a A ← { · 0 | ∀ ∈ } 5: D b √x /k b B ← { · 0 | ∀ ∈ } 6: ℓ 200 log n ← 7: Construct matrix P Rℓ d, where each entry is 1/√ℓ with prob 1/2 and 1/√ℓ with prob ∈ × − 1/2 ℓ d ℓ n 8: Zi, (C D, Pi, ) for each i [ℓ] ⊲ Pi, , Zi, denotes row i of P R × , Z R × ∗ ←A ∪ ∗ ∈ ∗ ∗ ∈ ∈ 9: C Z b c C ⊲ b Rn denotes the indicator vector of the vertex c, i.e. (b ) = 1 ← { · c | ∀ ∈ } c ∈ c c ande(b ) = 0 for all i = c e e c i 6 10: Db Ze b c D ← { · c | ∀ ∈ } 11: t (log n)-approximation to closest C-D ℓ -distance using Theorem 3.17 ← 2 12: ifb t 3e√n(log n)/f(x ) then ≤ 0 13: return true b b 14: else 15: return false 16: end if 17: end procedure

Next, suppose that there exists a A and b B for which a b k. We show that the ∈ ∈ k − k2 ≤ reduction returns true with probability at least 1 1/n. Let G denote the f-graph on C D. By − ∪ definition of C and D and the fact that f is decreasing, there exists a pair of points in C and D with edge weight at least f(x0). By the lower bound on Proposition 8.10, p C,q D s.t. Reff (p,q) 1/f(x ). ∃ ∈ ∈ G ≤ 0 By the upper bound of Proposition 8.9,

2 2 2 b⊤ (L† ) b n (Reff (p,q)) n/f(x ) . pq G pq ≤ · G ≤ 0 ℓ n Let Z R be defined as Z = P L† . By Theorem 3.14 with ε = 1/2 applied to the collection ∈ × · G of vectors L† bs s C D and projection matrix P , { G } ∈ ∪ 1 3 L† b Zb L† b 2k G stk2 ≤ k stk2 ≤ 2k G stk2 for all pairs s,t C D with high probability. Therefore, the second input guarantee of Proposition ∈ ∪ 8.11 is satisfied with high probability. Furthermore, for each i [ℓ], z , the ith row of Z, satisfies ∈ i the first input guarantee by the output error guarantee of the algorithm . Therefore, Proposition A 8.11 applies and shows that e e 9 Z b L† b 3√n/f(x ). k · pqk2 ≤ 4k G · pqk2 ≤ 0 This means that there exists ofe vectors a C, b D with ∈ ∈ a b 3√n/f(x ). k − kb2 ≤ b 0

75 By the approximation guarantee of the nearest neighbors data structure, t (3√n/f(x )) log n and ≤ 0 the reduction returns true with probability at least 1 1/n, as desired. − Next, suppose that there do not exist a A and b B for which a b k. We show that ∈ ∈ k − k2 ≤ the reduction returns false with probability at least 1 1/n. Since k S and A B is ρ-spaced, − ∈ ∪ a b ρ k for all a A and b B. Therefore, all edges between C and D in the f-graph G k − k2 ≥ · ∈ ∈ for C D have weight at most f(ρx ) f(x0) since f is not (ρ, L)-multiplicatively Lipschitz. ∪ 0 ≤ 100n16 By the upper bound of Proposition 8.10,

14 Reff 1 100n G(s,t) 2 ≥ n f(ρx0) ≥ f(x0) for any pair of vertices s C,t D. ∈ ∈ By the lower bound of Proposition 8.9,

14 2 Reff 2 50n 2 bst⊤(LG† ) bst ( G(s,t)) /2 ( ) ≥ ≥ f(x0) for all s C,t D. ∈ ∈ Recall from the discussion of the a b k case that Theorem 3.14 and Proposition 8.11 apply. k − k≤ Therefore, with probability at least 1 1/n, by the lower bound of Proposition 8.11, − 1 12n14 Zbst 2 LG† bst 2 k k ≥ 4k k ≥ f(xf ) for any s C,t D. e ∈ ∈ Therefore, by the approximate nearest neighbors guarantee, 12n14 3√n(log n) t > , ≥ (log n) f(x ) f(x ) · 0 0 so the algorithm returns false with probability at least 1 1/n, as desired. − We now prove the theorems:

1/(32c0) Proof of Theorem 8.6. Consider an instance of bichromatic ℓ2-closest pair for n = L and log∗ n d = c , where c is the constant given in the dimension bound of Theorem 3.25 and c0 is such that integers have bit length c log n in Theorem 3.25. This consists of two sets of points A, B Rd 0 ⊆ with A B = n for which we wish to compute mina A,b B a b 2. By Theorem 3.25, the | ∪ | ∈ ∈ k − k coordinates of points in A are also c0 log n bit integers. Therefore, the set S of possible ℓ2 distances between points in A and B is a set of square roots of integers with log d + c log n 2c log n bits. 0 ≤ 0 Therefore, S is a ρ-discrete (recall ρ = 1 + (16 log(10n))/L) since 1 + 1/n2c0 > 1 + (16 log n)/L. Furthermore, note that f is not (ρ, L)-multiplicatively Lipschitz by assumption. We now describe an algorithm for solving ℓ -closest pair on A B. Use binary search on the 2 × values in S to compute the minimum distance between points in A, B. S consists of integers between 1 and nc0 (pairs with distance 0 can be found in linear time), so γ nc0 . For each query point ≤ k S, by Lemma 8.8, there is a ∈ ∗ O((log n) (n,L,g(γ), d)+ n)= O((log n) (L1/(32c0),L,L1/32, clog L)+ L1/(32c0)) ·T ·T -time algorithm for determining whether or not the closest pair has distance at most k. Therefore, there is a

∗ O((log n) (L1/(32c0),L,L1/32, clog L)+ L1/(32c0))= O(L1/(32c0)L1/(64c0) log(L1/32)) < O(n3/2) T e e 76 time algorithm for solving ℓ2-closest pair on pairs of sets with n points. But this is impossible given SETH by Theorem 3.25, a contradiction. This completes the result.

Proof of Theorem 8.7. Consider an instance of bichromatic Hamming nearest neighbor search for L0.49 n = 2 and d = c1 log n for the constant c1 in the dimension bound in Theorem 3.21. This consists d of two sets of points A, B R with A B = n for which we wish to compute mina A,b B a b 2. ⊆ | ∪ | ∈ ∈ k − k The coordinates of points in A and B are 0-1. Therefore, the set S of possible ℓ2 distances between points in A and B is the set of square roots of integers between 0 and c1 log n, which differ by a factor of at least 1 + 1/(16c log n) > ρ (recall ρ = 1 + (16 log(10n))/L). Therefore, A B is 1 ∪ ρ-spaced. Note that f is not (ρ, L)-multiplicatively Lipschitz by assumption. Furthermore, A B ∪ is √c log n L.25-boxed. 1 ≤ We now give an algorithm for solving ℓ -closest pair on A B. Use binary search on S. For 2 × each query k S, Lemma 8.8 implies that one can check if there is a pair with distance at most k ∈ in

.49 .25 .49 .49 (n,L,d)= O( (2L ,L,eL , 2c1L ) + 2L ) T T .49 .48 O(2L +L ) ≤ = n1+o(1) time on pairs of sets with n points. But this is impossible given SETH by Theorem 3.21. This completes the result.

9 Fast Multipole Method

The fast multipole method (FMM) was described as one of the top-10 most important algorithms of the 20th century [DS00]. It is a numerical technique that was developed to speed up calculations of long-range forces in the n-body problem in physics. In 1987, FMM was first introduced by Greengard and Rokhlin [GR87], based on the multipole expansion of the vector Helmholtz equation. By treating the interactions between far-away basis functions using the FMM, the corresponding matrix elements do not need to be explicitly computed or stored. This is technique allows us to improve the naive O(n2) matrix-vector multiplication time to o(n2). Since Greengard and Rokhlin invented FMM, the topic has attracted researchers from many different fields, including physics, math, and computer science [GR87, Gre88, GR88, GR89, Gre90, GS91, EMRV92, Gre94, GR96, BG97, Dar00, YDGD03, YDD04, Mar12]. We first give a quick overview of the high-level ideas of FMM in Section 9.1. In Section 9.2, we provide a complete description and proof of correctness for the fast Gaussian transform, where the kernel function is the Gaussian kernel. Although a number of researchers have used FMM in the past, most of the previous papers about FMM either focus on the low-dimensional or low-error cases. We therefore focus on the superconstant-error, high dimensional case, and carefully analyze the joint dependence on ε and d. We believe that our presentation of the original proof in Section 9.2 is thus of independent interest to the community. In Section 9.4, we give the analogous results for other kernel functions used in this paper.

9.1 Overview We begin with a description of high-level ideas of the Fast Multipole Method (FMM). Let K : Rd Rd R denote a kernel function. The inputs to the FMM are N sources s ,s , ,s Rd × → 1 2 · · · N ∈

77 and M targets t ,t , ,t . For each i [N], the source s has a strength q . Suppose all sources 1 2 · · · M ∈ i i are in a ‘box’ and all the targets are in a ‘box’ . The goal is to evaluate B C N u = K(s ,t )q , j [M] j i j i ∀ ∈ Xi=1 Intuitively, if K has some nice property (e.g. smooth), we can hope to approximate K in the following sense

P 1 − K(s,t) B (s) C (t), s ,t ≈ p · p ∈B ∈C p=0 X where P is a small positive integer, usually called the interaction rank in the literature. Now, we can construct ui in two steps:

v = B (s )q , p = 0, 1, , P 1, p p i i ∀ · · · − i X∈B and

P 1 − u = C (t )v , i [M]. j p j p ∀ ∈ Xp=0 Intuitively, as long as and aree well-separated, then u is very good estimation to u even for B C j j small P , i.e., u u < ε. | j − j| Recall that, at the beginning of this section, we assumede that all the sources are in the the same box and . This is not true in general. To deal with this, we can discretize the continuous space B C e into a batch of boxes , , and , , . For a box and a box , if they are very far B1 B2 · · · C1 C2 · · · Bl1 Cl2 apart, then the interaction between points within them is small, and we can ignore it. If the two boxes are close, then we deal wit them efficiently by truncating the high order expansion terms in K (only keeping the first logO(d)(1/ε)). For each box, we will see that the number of nearby relevant boxes is at most logO(d)(1/ε).

9.2 K(x, y) = exp( x y 2), Fast Gaussian transform −k − k2 Given N vectors s , s Rd, M vectors t , ,t Rd and a strength vector q Rn, Greengard 1 · · · N ∈ 1 · · · M ∈ ∈ and Strain [GS91] provided a fast algorithm for evaluating discrete Gauss transform

N 2 ti sj /δ G(ti)= qje−k − k Xj=1 for i [M] in O(M + N) time. In this section, we re-prove the algorithm described in [GS91], and ∈ determine the exact dependences on ε and d in the running time. By shifting the origin and rescaling δ, we can assume that the sources sj and targets ti all lie in the unit box = [0, 1]d. B0 Let t and s lie in d-dimensional Euclidean space Rd, and consider the Gaussian

t s 2 Pd (t s )2 e−k − k2 = e− i=1 i− i

We begin with some definitions.

78 Definition 9.1 (one-dimensional Hermite polynomial). The Hermite polynomials h : R R is n → defined as follows n e n t2 d t2 h (t) = ( 1) e e− n − dt

Definition 9.2 (one-dimensional Hermitee function). The Hermite functions hn : R R is defined → as follows

t2 hn(t)= e− hn(t)

(t s)2/δ We use the following Fact to simplify e− − . e Fact 9.3. For s R and δ > 0, we have 0 ∈ n (t s)2/δ ∞ 1 s s0 t s0 e− − = − h − n! · √ · n √ n=0 δ δ X     and n 2 2 ∞ (t s) /δ (t s0) /δ 1 s s0 t s0 e− − = e− − − h − . n! · √ · n √ n=0 δ δ X     Proof. e

2 2 (t s) /δ (t s0 (s s0)) /δ e− − = e− − − − n ∞ 1 s s0 t s0 = − hn − n! √δ √δ n=0     X n 2 ∞ (t s0) /δ 1 s s0 t s0 = e− − − h − . n! √ n √ n=0 δ δ X     e

Using Cramer’s inequality, we have the following standard bound. Lemma 9.4. For any constant K 1.09, we have ≤ 2 h (t) K 2n/2 √n! et /2 | n |≤ · · · and e n/2 t2/2 h (t) K 2 √n! e− . | n |≤ · · · Next, we will extend the above definitions and observations to the high dimensional case. To simplify the discussion, we define multi-index notation. A multi-index α = (α , α , , α ) is a 1 2 · · · d d-tuple of nonnegative integers, playing the role of a multi-dimensional index. For any multi-index α Rd and any t Rt, we write ∈ ∈ d d α!= (α !), tα = tαi , Dα = ∂α1 ∂α2 ∂αd . i i 1 2 · · · d Yi=1 Yi=1 d where ∂i is the differentiatial operator with respect to the i-th coordinate in R . For integer p, we say α p if α p, i [d]. ≥ i ≥ ∀ ∈ We can now define:

79 Definition 9.5 (multi-dimensional Hermite polynomial). We define function H : Rd R as α → follows: e d

Hα(t)= hαi (ti). Yi=1 e e Definition 9.6 (multi-dimensional Hermite function). We define function H : Rd R as follows: α → d

Hα(t)= hαi (ti). Yi=1 t 2 It is easy to see that H (t)= e 2 H (t) α −k k · α Rd The Hermite expansion of a Gaussiane in is α t s 2 (t s0) e−k − k2 = − h (s s ). (4) α! α − 0 α 0 X≥ Cramer’s inequality generalizes to Lemma 9.7 (Cramer’s inequality). Let K < (1.09)d, then

2 t /2 α 1/2 H (t) K ek k2 2k k √α! | α |≤ · · · and e 2 t /2 α 1/2 H (t) K e−k k2 2k k √α! | α |≤ · · ·

The Taylor series of Hα is

β (t t0) β 1 H (t)= − ( 1)k k H (t ) (5) α β! − α+β 0 β 0 X≥ 9.2.1 Estimation We first give a definition

Definition 9.8. Let denote a box with center s and side length r√2δ with r < 1. If source sj B B is in box , we say j . Then the Gaussian evaluation from the sources in box is, B ∈B B t s 2/δ G(t)= q e−k − j k2 . j · j X∈B The Hermite expansion of G(t) is

t s G(t)= Aα hα − B , (6) · √δ α 0   X≥ where the coefficients Aα are defined by α 1 sj s Aα = qj − B (7) α! · √δ j   X∈B 80 The rest of this section will present a batch of Lemmas that bound the error of the function truncated at certain degree of Taylor and Hermite expansion.

Lemma 9.9. Let p denote an integer, let ErrH (p) denote the error after truncating the series G(t) (as defined in Def. 9.8) after pd terms, i.e., t s ErrH (p)= Aα Hα − B . · √δ α p   X≥ Then we have 1 d/2 rp+1 d Err (p) K q , | H |≤ · | j| · p! · 1 r j     X∈B − where K = (1.09)d. Proof. Using Eq. (4) to expand each Gaussian (see Definition 9.8) in the

t s 2/δ G(t)= q e−k − j k2 j · j X∈B into a Hermite series about s B α 1 sj s t s qj − B Hα − B α! · √δ  √δ α 0 j     X≥ X∈B   and swap the summation over α and j to obtain α 1 sj s t s qj − B Hα − B α! · √δ · √δ j α 0     X∈B X≥ The truncation error bound follows from Cramer’s inequality (Lemma 9.7) and the formula for the tail of a geometric series.

The next Lemma shows how to convert a Hermite expansion about s into a Taylor expansion B about t . The Taylor series converges rapidly in a box of side length r√2δ about t , where r< 1. C C Lemma 9.10. The Hermite expansion of G(t) is t s G(t)= Aα Hα − B · √δ α 0   X≥ has the following Taylor expansion, at an arbitrary point t0 :

β t t0 G(t)= Bβ − . (8) √δ β 0   X≥ where the coefficients Bβ are defined as

β ( 1)| | s t0 Bβ = − Aα Hα+β B − . (9) β! · √δ α 0   X≥

81 Let Err (p) denote the error by truncating the Taylor series after pd terms, in the box with center T C t and side length r√2δ, i.e., C t t β ErrT (p)= Bβ − C √δ β p   X≥ Then 1 d/2 rp+1 d Err (p) K Q . | T |≤ · B · p! 1 r    −  Proof. Each Hermite function in Eq. (6) can be expanded into a Taylor series by means of Eq. (5). The expansion in Eq. (8) is obtained by swapping the order of summation. The truncation error bound can be proved as follows. Using Eq. (7) for Aα, we can rewrite Bβ:

β ( 1)| | s t Bβ = − AαHα+β B − C β! √δ α 0   X≥ β α ( 1)| | 1 sj s s t = − qj − B Hα+β B − C β! α! √δ  √δ α 0 j     X≥ X∈B β  α  ( 1)| | 1 sj s s t = − qj − B Hα+β B − C β! α! √δ · √δ j α 0     X∈B X≥

By Eq. (5), the inner sum is the Taylor expansion of Hβ((sj t )/√δ). Thus − C β 1 ( 1)k k sj t Bβ = − qj Hβ − C β! · √δ j   X∈B and Cramer’s inequality implies

β 1/2 1 β 1/2 2k k B K QB2k k β! KQB | β|≤ β! · ≤ √β! p The truncation error follows from summation the tail of a geometric series.

For the purpose of designing our algorithm, we’d like to make a variant of Lemma 9.10 in which the Hermite series is truncated before converting it to a Taylor series. This means that in addition to truncating the Taylor series itself, we are also truncating the finite sum formula in Eq. (9) for the coefficients.

Lemma 9.11. Let G(t) be defined as Def 9.8. For an integer p, let Gp(t) denote the Hermite expansion of G(t) truncated at p, t s Gp(t)= AαHα − B . √δ α p   X≤ The function Gp(t) has the following Taylor expansion about an arbitrary point t0:

β t t0 Gp(t)= Cβ − , · √δ β 0   X≥

82 where the the coefficients Cβ are defined as

β 1 ( 1)k k s t Cβ = − Aα Hα+β B − C . (10) β! · √δ α p   X≤ Let Err (p) denote the error in truncating the Taylor series after pd terms, in the box with center T C t and side length r√2δ, i.e., C t t β ErrT (p)= Cβ − C . √δ β p   X≥ Then, we have

1 d/2 rp+1 d Err (p) K′ Q | T |≤ · B p! 1 r    −  where K 2K and r 1/2. ′ ≤ ≤ Proof. We can write Cβ in the following way:

β 1 α ( 1)k k 1 sj s s t Cβ = − qj − B Hα+β B − C β! α! √δ · √δ j α p     X∈B X≤ β 1 α ( 1)k k 1 sj s s t = − qj − B Hα+β B − C β!  −  α! √δ · √δ j α 0 α>p     X∈B X≥ X β 1   α ( 1)k k 1 sj s s t = Bβ − qj − B Hα+β B − C − β! α! √δ · √δ j α>p     X∈B X = B + (C B ) β β − β Next, we have

t t β t t β ErrT (p) Bβ − C + (Cβ Bβ) − C (11) | |≤ √δ − · √δ β p   β p   X≥ X≥

Using Lemma 9.10, we can upper bound the first term in the Eq. (11) by, 1 d/2 rp+1 d K Q · B p! · 1 r    −  To bound the second term in Eq. (11), we can do the following

t t β (Cβ Bβ) − C − · √δ β p   X≥ β α t t 1 1 sj s s t Q B − C − B Hα+β B − C ≤ · √δ · β! α! √δ · √δ β p α>p   X≥   X  

r α 1 (α + β)! r β 1 KQ k k k k ≤ B √ · α!β! · √ α>p α! s β! X β>pX

83 Finally, the proof is complete since we know that

(α + β)! α+β 1 2k k . α!β! ≤

The proof of the following Lemma is almost identical. We omit the details here.

Lemma 9.12. Let Gsj (t) be defined as

t s 2/δ G (t)= q e−k − j k2 sj j · has the following Taylor expansion at t C t t β Gs (t)= β − C , j B √δ β 0   X≥ where the coefficients Bβ is defined as

β 1 ( 1)k k sj t B = q − H − C β j · β! · β √  δ  and the error in truncation after pd terms is

t t β 1 d/2 rp+1 d ErrT (p) = Bβ − C K qj | | √δ ≤ · · p! · 1 r β p      −  X≥ for r< 1.

9.2.2 Algorithm

The algorithm is based on subdividing B0 into smaller boxes with sides of length r√2δ parallel to the axes, for a fixed r 1/2. We can then assign each source s to the box in which it lies and ≤ j B each target t, to the box in which it lies. C For each target box , we need to evaluate the total field due to sources in all boxes. Since C boxed have side lengths r√2δ, only a fixed number of source boxes can contribute more than B B Qε to the field in a given target box , where Q = q and ε is the precision parameter. If we cut C k k1 off the sum over all after including the (2k + 1)d nearest boxes to , it incurs an error which can B C be upper bounded as follows

t s 2/δ t s 2 /δ q e−k − j k2 q e−k − j k∞ | j| · ≤ | j| · j: t s ∞ kr√2δ j: t s ∞ kr√2δ k − jXk ≥ k − jXk ≥ (k r√2δ)2/δ q e− · ≤ | j| · j: t s ∞ kr√2δ k − jXk ≥ 2r2k2 Q e− (12) ≤ · where the first step follows from 2 , the second step follows from t sj kr√2δ, k · k ≥k·k∞ k − k∞ ≥ and the last step follows from a straightforward calculation. For a box and a box , there are several possible ways to evaluate the interaction between B C B and . We mainly need the following three techniques: C

84 1. N Gaussians, accumulated in Taylor series via definition Bβ in Lemma 9.12 B 2. Hermite series, directly evaluated 3. Hermite series, accumulated in Taylor series in Lemma 9.11 Essentially, having any two of the above three techniques is sufficient to give an algorithm that runs in (M + N) logO(d)( q /ε) time. k k1 In the next a few paragraphs, we explain the details of the three techniques.

Technique 1. Consider a fixed source box . For each target box within range, we must B C compute pd Taylor series coefficients β ( 1)| | sj tC Cβ( )= − Hβ − . B β! √δ j   X∈B Each coefficient requires O(N ) work to evaluate, resulting in a net cost O(pdN ). Since there are at B B most (2k+1)d boxes within range, the total work for forming all the Taylor series is O((2k+1)dpdN). d Now, for each target ti, one must evaluate the p -term Taylor series corresponding to the box in which ti lies. The total running time of algorithm is thus O((2k + 1)dp2N)+ O(pdM).

Technique 2. We form a Hermite series for each box and evaluate it at all targets. Using B Lemma 9.9, we can rewrite G(t) as

t s 2/δ G(t)= q e−k − j k2 j · j XB X∈B t s = Aα(B)Hα − B + ErrH (p) √δ α 0   XB X≥ where Err (p) ε and | H |≤ α 1 sj s Aα( )= qj − B . (13) B α! · √δ j   X∈B d To compute each Aα( ) costs O(N ) time, so forming all the Hermite expansions takes O(p N) B d B d d time. Evaluating at most (2k +1) expansions at each target ti costs O((2k +1) p ) time per target, so this approach takes O(pdN)+ O((2k + 1)dpdM) time in total.

Technique 3. Let N(B) denote the number of boxes. Note that N(B) min((r√2δ) d/2, M). ≤ − Suppose we accumulate all sources into truncated Hermite expansions and transform all Hermite expansions into Taylor expansions via Lemma 9.11. Then we can approximate the function G(t) by

t s 2/δ G(t)= q e−k − j k2 j · j XB X∈B t t β = Cβ − C + ErrT (p)+ErrH (p) √δ β p   X≤

85 where Err (p) + Err (p) Q ε, | H | | T |≤ ·

β 1 ( 1)k k s t Cβ = − Aα( )Hα+β B − C β! B √δ α p   XB X≤ and the coefficients A ( ) are defined as Eq. (13). Recall in Part 2, it takes O(pdN) time to α B compute all the Hermite expansions, i.e., to compute the coefficients A ( ) for all α p and all α B ≤ sources boxes . B Making use of the large product in the definition of Hα+β, we see that the time to compute the pd coefficients of C is only O(dpd+1) for each box in the range. Thus, we know for each target β B box , the running time is C O((2k + 1)ddpd+1).

Finally we need to evaluate the appropriate Taylor series for each target ti, which can be done in O(pdM) time. Putting it all together, this technique 3 takes time

O((2k + 1)ddpd+1N(B)) + O(pdN)+ O(pdM).

9.2.3 Result Finally, in order to get ε additive error for each coordinate, we will choose k = O(log( q /ε)) and k k1 p = O(log( q /ε)). k k1 Theorem 9.13 (fast Gaussian transform). Given N vectors s , ,s Rd, M vectors t ,t , ,t 1 · · · N ∈ 1 2 · · · M Rd, a number δ > 0, and a vector q Rn, let function G : Rd R be defined as G(t) = ∈ N 2 ∈ → q e t si 2/δ. There is an algorithm that runs in i=1 i · −k − k P O (M + N) logO(d)( q /ε) k k1   time, and outputs M numbers x , ,x such that for each j [M] 1 · · · M ∈ G(t ) ε x G(t )+ ε. j − ≤ j ≤ j The proof of fast Gaussian transform also implies a result for the online version:

d Theorem 9.14 (online version). Given N vectors s1, ,sN R , a number δ > 0, and a vector n d · · · N ∈ t s 2/δ q R , let function G : R R be defined as G(t) = q e i 2 . There is an algorithm ∈ → i=1 i · −k − k that takes P O N logO(d)( q /ε) k k1   time to do preprocessing, and then for each t Rd, takes O(logO(d)( q /ε)) time to output a number ∈ k k1 x such that

G(t) ε x G(t)+ ε. − ≤ ≤

86 9.3 Generalization The fast multipole method described in the previous section works not only for the Gaussian kernel, but also for K(u, v) = f( u, v 2) for many other functions f. As long as f has the following k k2 properties, the result of Theorem 9.13 also holds for f:

• f : R R . → + • f is non-increasing, i.e., if x y 0, then f(x) f(y). ≥ ≥ ≤ • f is decreasing fast, i.e., for any ε (0, 1), we have f(Θ(log(1/ε))) ε. ∈ ≤ • f’s Hermite expansion and Taylor expansions are truncateable: If we only keep logd(1/ε) terms of the polynomial for K, then the error is at most ε.

Let us now sketch how each of these properties is used in discritizing the continuous domain into a finite number of boxes. First, note that Eq.(12) holds more generally for any function f with these properties. Indeed, we can bound the error as follows (note that Q = q ): k k1

qj f( t sj 2/√δ) qj f( t sj /√δ) | | · k − k ≤ | | · k − k∞ j: t s ∞ kr√2δ j: t s ∞ kr√2δ k − jXk ≥ k − jXk ≥ q f(√2kr) ≤ | j| · j: t s ∞ kr√2δ k − jXk ≥ Q f(√2kr) ≤ · ε (14) ≤ where the first step follows from 2 and that f is non-increasing, the second step follows k · k ≥k·k∞ from t sj kr√2δ and that f is non-increasing, and the last step follows from the fact that k − k∞ ≥ f is decreasing fast, and choosing k = O(log(Q/ε)/r). We next give an example of how the truncatable expansions property is used. Here we only show to generalize Definition 9.8 to Definition 9.15 and generalize Lemma 9.9 to Definition 9.16; the other Lemmas in the proof can be extended in a similar way.

Definition 9.15. Let denote a box with center s and side length r√2δ with r< 1. If source sj B B is in box , we say j . Then the Gaussian evaluation from the sources in box is, B ∈B B G(t)= q f( t s /√δ). j · k − jk2 j X∈B The Hermite expansion of G(t) is

t s G(t)= Aα hα − B , (15) · √δ α 0   X≥ where the coefficients Aα are defined by α 1 sj s Aα = qj − B (16) α! · √δ j   X∈B

87 Definition 9.16. Let p denote an integer, let ErrH (p) denote the error after truncating the series G(t) (as defined in Def. 9.15) after pd terms, i.e.,

t s ErrH (p)= Aα Hα − B . · √δ α p   X≥ We say f is Hermite truncateable, if Then we have

Ω(pd) Err (p) p− | H |≤ where K = (1.09)d.

9.4 K(x, y)=1/ x y 2 k − k2 Similar ideas yield the following algorithm:

Theorem 9.17 (FMM, [BG97, Mar12]). Given n vectors x ,x , ,x Rd, let matrix A Rn n 1 2 · · · n ∈ ∈ × be defined as A = 1/ x x 2. For any vector h Rn, in time O(n logO(d)( u /ε)), we can i,j k i − jk2 ∈ k k1 output a vector u such that

(Ah) ε u (Ah) + ε. i − ≤ i ≤ i 10 Neural Tangent Kernel

In this section, we show that the popular Neural Tangent Kernel K from theoretical Deep Learning can be rearranged into the form K(x,y) = f( x y 2) for an appropriate analytic function f, so k − k2 our results in this paper apply to it. We first define the kernel.

Definition 10.1 (Neural Tangent Kernel, [JGH18]). Given n points x ,x , ,x Rd, and any 1 2 · · · n ∈ activation function σ : R R, the neural tangent kernel matrix K Rn n can be defined as follows, → ∈ × where (0,I ) denotes the Gaussian distribution: N d

Ki,j := σ′(w⊤xi)σ′(w⊤xj)xi⊤xjdw. (0,I ) ZN d In the literature of convergence results for deep neural networks [LL18, DZPS19, AZLS19b, AZLS19a, SY19, BPSW21, LSS+20], it is natural to assume that all the data points are on the unit sphere, i.e., for all i [n] we have x = 1 and datas are separable i.e., for all i = j, x x δ. ∈ k ik2 6 k i − jk2 ≥ One of the most standard and common used activation functions in neural network training is ReLU activation, which is σ(x) = max x, 0 . Using Lemma 10.2, we can figure out the corresponding { } kernel function. By Theorem 5.14, the multiplication task for neural tangent kernels is hard. In neural network training, the multiplication can potentially being used to speed the neural network training procedure(See [SY19]). In the following lemma, we compute the kernel function for ReLU activation function.

Lemma 10.2. If σ(x) = max x, 0 , then the Neural Tangent Kernel can be written as K(x,y) = { } f( x y 2) for k − k2

1 1 f(x)= (π cos− (1 0.5x)) (1 0.5x) π − − · −

88 Proof. First, since x = x , we know that k ik2 k jk2 x x 2 = x 2 2 x ,x + x 2 = 2 2 x ,x . k i − jk2 k ik2 − h i ji k jk2 − h i ji By definition of σ, we know

1, if x> 0; σ(x)= (0, otherwise. Using properties of the Gaussian distribution (0,I ), we have N d 1 1 σ′(w⊤xi)σ′(w⊤xj)dw = (π cos− (xixj)) (0,I ) π − ZN d 1 1 2 = (π cos− (1 0.5 x x )) π − − k i − jk2

We can rewrite Ki,j as follows:

Ki,j = σ′(w⊤xi)σ′(w⊤xj)xi⊤xjdw (0,I ) ZN d 2 = (1 0.5 xi xj 2) σ′(w⊤xi)σ′(w⊤xj)dw − k − k · (0,I ) ZN d 2 1 1 = (1 0.5 x x ) (π cos− (1 0.5x)) − k i − jk2 · π − − = f( x x 2). k i − jk2

Lemma 10.3. Given n data points x ,...,x Rd on unit sphere. For any activation function 1 n ∈ σ : R R, the corresponding Neural Tangent Kernel K(x ,x ) is a function of x x 2. → i j k i − jk2 Proof. Note that w (0,I ), so we know (x w,x w) (0, Σ ), where the covariance matrix ∼N d i⊤ j⊤ ∼N i,j

xi⊤xi xi⊤xj 1 xi⊤xj 2 2 Σi,j = = R × , x⊤xi x⊤xj x⊤xj 1 ∈  j j   i  since x = x = 1. Thus, k ik2 k jk2 E K(xi,xj)= (a,b) (0,Σ )[σ′(a)σ′(b)]xi⊤xj = g(xi⊤xj) ∼N i,j for some function g. Note x x = 1 x x 2 +1, so K(x ,x )= f( x x 2) for some function f, which completes i⊤ j − 2 k i − jk i j k i − jk2 the proof.

References

[AALG18] Vedat Levi Alev, Nima Anari, Lap Chi Lau, and Shayan Oveis Gharan. Graph clus- tering using effective resistance. In ITCS, 2018.

[Ach03] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003.

89 [ACL06] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pager- ank vectors. In IEEE 47th Annual Symposium on Foundations of Computer Science (FOCS), 2006.

[ACW16] Josh Alman, Timothy M Chan, and Ryan Williams. Polynomial representations of threshold functions and algorithmic applications. In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 467–476, 2016.

[ACX19] Pankaj K. Agarwal, Hsien-Chih Chang, and Allen Xiao. Efficient algorithms for geo- metric partial matching. In 35th International Symposium on Computational Geometry (SoCG), pages 6:1–6:14, 2019.

[AESW91] Pankaj K Agarwal, Herbert Edelsbrunner, Otfried Schwarzkopf, and Emo Welzl. Eu- clidean minimum spanning trees and bichromatic closest pairs. Discrete & Computa- tional Geometry, 6(3):407–422, 1991.

[AI06] Alexandr Andoni and . Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 459–468, 2006.

[And09] Alexandr Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, Massachusetts Institute of Technology, 2009.

[AP09] Reid Andersen and Yuval Peres. Finding sparse cuts locally using evolving sets. In Proceedings of the forty-first annual ACM symposium on Theory of computing (STOC), 2009.

[AS14] Pankaj K Agarwal and R Sharathkumar. Approximation algorithms for bipartite matching with metric and geometric costs. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC), pages 555–564, 2014.

[AW15] Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neigh- bors. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pages 136–150, 2015.

[AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021.

[AWW14] Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 39–51. Springer, 2014.

[AWY15] Amir Abboud, Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 218–230, 2015.

[AZLS19a] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), 2019.

90 [AZLS19b] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

[BCIS18] Arturs Backurs, , Piotr Indyk, and Paris Siminelakis. Efficient density evaluation for smooth kernels. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 615–626, 2018.

[BG97] Rick Beatson and Leslie Greengard. A short course on fast multipole methods. Wavelets, multilevel methods and elliptic PDEs, 1:1–37, 1997.

[BHOS+08] Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gun- nar Rätsch. Support vector machines and kernels for computational biology. PLoS computational biology, 4(10):e1000173, 2008.

[BIS17] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained complexity of empirical risk minimization: Kernel methods and neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 4308–4318, 2017.

[BLSS20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time. In 52nd Annual ACM Symposium on Theory of Computing (STOC), 2020.

[BNS06] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geo- metric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, December 2006.

[Boc33] Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische anal- yse. Mathematische Annalen, 108:378–410, 1933.

[BPSW21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (over- parametrized) neural networks in near-linear time. In The 12th Innovations in Theo- retical Computer Science Conference (ITCS), 2021.

[Bra20] Jan van den Brand. A deterministic linear program solver in current matrix multipli- cation time. In SODA, 2020.

[BTB05a] Sabri Boughorbel, J-P Tarel, and Nozha Boujemaa. Conditionally positive definite kernels for svm based image recognition. In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116, 2005.

[BTB05b] Sabri Boughorbel, J-P Tarel, and Nozha Boujemaa. Generalized histogram intersection kernel for image recognition. In IEEE International Conference on Image Processing 2005, volume 3, pages III–161. IEEE, 2005.

[BTF04] Sabri Boughorbel, Jean-Philippe Tarel, and Francois Fleuret. Non-mercer kernels for svm object recognition. In BMVC, pages 1–10, 2004.

[BTFB05] Sabri Boughorbel, Jean-Philippe Tarel, François Fleuret, and Nozha Boujemaa. The gcs kernel for svm-based image recognition. In International Conference on Artificial Neural Networks, pages 595–600. Springer, 2005.

91 [CFH16] Alireza Chakeri, Hamidreza Farhidzadeh, and Lawrence O Hall. Spectral sparsification in spectral clustering. In 23rd international conference on pattern recognition (ICPR), pages 2301–2306. IEEE, 2016.

[CHC+10] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11(Apr):1471–1490, 2010.

[Che52] Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493–507, 1952.

[Che70] Jeff Cheeger. A lower bound for the smallest eigenvalue of the laplacian. In In Gun- ning, Robert C. Problems in analysis (Papers dedicated to Salomon Bochner, 1969). Princeton, N. J.: Princeton Univ. Press., pages 195–199, 1970.

[Che18] Lijie Chen. On the hardness of approximate and exact (bichromatic) maximum inner product. In Computational Complexity Conference (CCC), 2018.

[Chu97] Fan Chung. Spectral graph theory. 92. American Mathematical Soc., 1997.

[CJK+13] Anna Choromanska, Tony Jebara, Hyungtae Kim, Mahesh Mohan, and Claire Mon- teleoni. Fast spectral clustering via the nyström method. In International Conference on Algorithmic Learning Theory, pages 367–381. Springer, 2013.

[CK93] Paul B Callahan and S Rao Kosaraju. Faster algorithms for some geometric graph problems in higher dimensions. In SODA, pages 291–300, 1993.

[CK95] Paul B Callahan and S Rao Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 42(1):67–90, 1995.

[CKM+14] Michael B Cohen, Rasmus Kyng, Gary L Miller, Jakub W Pachocki, Richard Peng, Anup B Rao, and Shen Chen Xu. Solving sdd linear systems in nearly m log1/2 n time. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (STOC), pages 343–352, 2014.

[CLS19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51th Annual Symposium on the Theory of Computing (STOC), 2019.

[CMS20] Timothy Chu, Gary L. Miller, and Donald Sheehy. Exact computation of a manifold metric, via lipschitz embeddings and shortest paths on a graph. In SODA, 2020.

[CS17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimensions. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 1032–1043, 2017.

[CSB+11] W. Chen, Y. Song, H. Bai, C. Lin, and E. Y. Chang. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(3):568–586, March 2011.

[CSZ09] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.

92 [CV95] and . Support-vector networks. Machine learning, 20(3):273–297, 1995.

[CW16] Timothy M Chan and Ryan Williams. Deterministic apsp, orthogonal vectors, and more: Quickly derandomizing razborov-smolensky. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 1246–1255, 2016.

[Dan47] George B Dantzig. Maximization of a linear function of variables subject to linear inequalities. Activity analysis of production and allocation, 13:339–347, 1947.

[Dar00] Eric Darve. The fast multipole method: numerical implementation. Journal of Com- putational Physics 160.1, 2000.

[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.

[DS00] Jack Dongarra and Francis Sullivan. Guest editors’ introduction: The top 10 algo- rithms. Computing in Science & Engineering, 2(1):22, 2000.

[DZPS19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019.

[EMRV92] Nader Engheta, William D. Murphy, Vladimir Rokhlin, and Marius Vassiliou. The fast multipole method for electromagnetic scattering computation. IEEE Transactions on Antennas and Propagation 40, pages 634–641, 1992.

[FS03] François Fleuret and Hichem Sahbi. Scale-invariance of support vector machines based on the triangular kernel. In 3rd International Workshop on Statistical and Computa- tional Theories of Vision, pages 1–13, 2003.

[GE08] Yoav Goldberg and Michael Elhadad. splitsvm: fast, space-efficient, non-heuristic, polynomial kernel computation for nlp applications. Proceedings of ACL-08: HLT, Short Papers, pages 237–240, 2008.

[GR87] Leslie Greengard and Vladimir Rokhlin. A fast algorithm for particle simulations. Journal of computational physics, 73(2):325–348, 1987.

[GR88] Leslie Greengard and Vladimir Rokhlin. The rapid evaluation of potential fields in three dimensions. Vortex Methods. Springer, Berlin, Heidelberg, pages 121–141, 1988.

[GR89] Leslie Greengard and Vladimir Rokhlin. On the evaluation of electrostatic interactions in molecular modeling. Chemica scripta, 29:139–144, 1989.

[GR96] Leslie Greengard and Vladimir Rokhlin. An improved fast multipole algorithm in thre dimensions. ., 1996.

[Gre88] Leslie Greengard. The rapid evaluation of potential fields in particle systems. MIT press, 1988.

[Gre90] Leslie Greengard. The numerical solution of the n-body problem. Computers in physics, 4(2):142–152, 1990.

[Gre94] Leslie Greengard. Fast algorithms for classical physics. Science, 265(5174):909–914, 1994.

93 [GS91] Leslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scien- tific and Statistical Computing, 12(1):79–94, 1991.

[Gun98] Steve R Gunn. Support vector machines for classification and regression. ISIS technical report, 14(1):5–16, 1998.

[Ham04] Bart Hamers. Kernel models for large scale applications. ., 2004.

[HKNS15] Monika Henzinger, Sebastian Krinninger, Danupon Nanongkai, and Thatchaphol Sara- nurak. Unifying and strengthening hardness for dynamic problems via the online matrix-vector multiplication conjecture. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 21–30, 2015.

[Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.

[Hof07] Heiko Hoffmann. Kernel pca for novelty detection. Pattern recognition, 40(3):863–874, 2007.

[HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa- tion, 9(8):1735–1780, 1997.

[IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (STOC), 1998.

[Ind06] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307–323, 2006.

[IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001.

[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Con- vergence and generalization in neural networks. In Advances in neural information processing systems (NeurIPS), pages 8571–8580, 2018.

[JKL09] Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision (ICCV), pages 2146–2153, 2009.

[JL84] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984.

[JMSZ20] Shunhua Jiang, Yunze Man, Zhao Song, and Danyang Zhuo. Graph neu- ral network acceleration via matrix dimension reduction. In Openreview. https://openreview.net/forum?id=8IbZUle6ieH, 2020.

[JSWZ20] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. arXiv preprint arXiv:2004.07470, 2020.

[Kar84] . A new polynomial-time algorithm for linear programming. In Proceedings of the sixteenth annual ACM symposium on Theory of computing (STOC), pages 302–311, 1984.

94 [Kha80] Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computa- tional Mathematics and Mathematical Physics, 20(1):53–72, 1980.

[KLP+16] Rasmus Kyng, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Daniel A. Spielman. Sparsified cholesky and multigrid solvers for connection laplacians. In STOC, 2016.

[KMP10] Ioannis Koutis, Gary L Miller, and Richard Peng. Approaching optimality for solving sdd linear systems. FOCS, 2010.

[KMP11] Ioannis Koutis, Gary L Miller, and Richard Peng. A nearly-m log n time solver for sdd linear systems. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), pages 590–598, 2011.

[KMT12] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nys- tröm method. Journal of Machine Learning Research, 13(Apr):981–1006, 2012.

[KOSZ13] Jonathan A Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving sdd systems in nearly-linear time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 911– 920, 2013.

[KS16] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for laplacians- fast, sparse, and simple. In 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 573–582, 2016.

[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing sys- tems (NIPS), pages 1097–1105, 2012.

[KVV04] Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral. In Journal of the ACM (JACM), pages 497–515, 2004.

[LL18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochas- tic gradient descent on structured data. In Advances in Neural Information Processing Systems (NeurIPS), 2018.

[Llo82] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.

[LRTV11] Anand Louis, Prasad Raghavendra, Prasad Tetali, and Santosh Vempala. Algorithmic extensions of cheeger’s inequality to higher eigenvalues and partitions. In In Approxi- mation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 315–326, 2011.

[LS13] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In FOCS, 2013.

[LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in O(√rank) iterations and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), pages 424–433. IEEE, 2014.

95 [LS15] Yin Tat Lee and He Sun. Constructing linear-sized spectral sparsification in almost- linear time. In FOCS, 2015.

[LSS+20] Jason D. Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. Generalized leverage score sampling for neural network. In NeurIPS, 2020.

[LSZ19a] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empricial risk minimization in the current matrix multiplication time. In COLT, 2019.

[LSZ+19b] Xuanqing Liu, Si Si, Xiaojin Zhu, Yang Li, and Cho-Jui Hsieh. A unified framework for data poisoning attack to graph-based semi-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

[LWDH13] J. Liu, C. Wang, M. Danilevsky, and J. Han. Large-scale spectral clustering on graphs. In In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.

[Mar12] Per-Gunnar Martinsson. Encyclopedia entry on fast multipole methods. In University of Colorado at Boulder, 2012.

[Mas10] Peter Massopust. Interpolation and approximation with splines and fractals. Oxford University Press, Inc., 2010.

[Mat92] Jiří Matoušek. Efficient partition trees. Discrete & Computational Geometry, 8(3):315– 334, 1992.

[Mic84] Charles A Micchelli. Interpolation of scattered data: distance matrices and condition- ally positive definite functions. In Approximation theory and spline functions, pages 143–145. Springer, 1984.

[MSS+99] Sebastian Mika, Bernhard Schölkopf, Alex J Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch. Kernel pca and de-noising in feature spaces. In Advances in neural information processing systems (NIPS), pages 536–542, 1999.

[NJW02] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856, 2002.

[NS41] J. Von Neumann and I. J. Schoenberg. Fourier integrals and metric geometry. Trans- actions of the American Mathematical Society, 50(2):226–251, 1941.

[OZ14] Lorenzo Orecchia and Zeyuan Allen Zhu. Flow-based algorithms for local graph clus- tering. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1267–1286, 2014.

[Raz17] Ilya Razenshteyn. High-dimensional similarity search and sketching: algorithms and hardness. PhD thesis, Massachusetts Institute of Technology, 2017.

[RR08] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems (NIPS), pages 1177–1184, 2008.

[Rub18] Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1260–1268, 2018.

96 [Sch37] I. J. Schoenberg. On certain metric spaces arising from euclidean spaces by a change of metric and their imbedding in hilbert space. Annals of Mathematics, 38(4):787–793, 1937.

[She17] Jonah Sherman. Generalized preconditioning and network flow problems. In SODA, pages 772–780, 2017.

[Shl14] Jonathon Shlens. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100, 2014.

[SM19] Karthik C. S. and Pasin Manurangsi. On closest pair in euclidean metric: Monochro- matic is as hard as bichromatic. In ITCS, 2019.

[Son19] Zhao Song. Matrix Theory : Optimization, Concentration and Algorithms. PhD thesis, The University of Texas at Austin, 2019.

[Sou10] César R Souza. Kernel functions for machine learning applications. Creative Commons Attribution-Noncommercial-Share Alike, 3:29, 2010.

[SS11] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.

[SSB+17] Hojjat Salehinejad, Sharan Sankar, Joseph Barfett, Errol Colak, and Shahrokh Valaee. Recent advances in recurrent neural networks. arXiv preprint arXiv:1801.01078, 2017.

[SSM98] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998.

[ST04] Daniel A Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, 2004.

[SV14] Sushant Sachdeva and Nisheeth K Vishnoi. Faster algorithms via approximation theory. Foundations and Trends in Theoretical Computer Science, 9(2):125–210, 2014.

[SY19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix cher- noff bound. arXiv preprint arXiv:1906.03593, 2019.

[SY20] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In Openreview. https://openreview.net/forum?id=fGiKxvF-eub, 2020.

[Uns99] Michael Unser. Splines: A perfect fit for signal and image processing. IEEE Signal processing magazine, 16(6):22–38, 1999.

[Vai87] Pravin M Vaidya. An algorithm for linear programming which requires O(((m+n)n2 + (m + n)1.5n)L) arithmetic operations. In FOCS, 1987.

[Vai89] Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 332–337. IEEE, 1989.

[vL07] Ulrike von Luxburg. A tutorial on spectral clustering, 2007.

97 [VZ12] Andrea Vedaldi and Andrew Zisserman. Efficient additive kernels via explicit feature maps. IEEE transactions on pattern analysis and machine intelligence, 34(3):480–492, 2012.

[Wil05] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implica- tions. Theoretical Computer Science, 348(2-3):357–365, 2005.

[Wil18] Virginia Vassilevska Williams. On some fine-grained questions in algorithms and com- plexity. In Proceedings of the International Congress of Mathematicians (ICM), 2018.

[Woo49] Max A Woodbury. The stability of out-input matrices. Chicago, IL, 9, 1949.

[Woo50] Max A Woodbury. Inverting modified matrices. Memorandum report, 42(106):336, 1950.

[Yao82] Andrew Chi-Chih Yao. On constructing minimum spanning trees in k-dimensional spaces and related problems. SIAM Journal on Computing, 11(4):721–736, 1982.

[YDD04] Changjiang Yang, Ramani Duraiswami, and Larry Davis. Efficient kernel machines using the improved fast gauss transform. In NIPS, 2004.

[YDGD03] Changjiang Yang, Ramani Duraiswami, Nail A. Gumerov, and Larry Davis. Improved fast gauss transform and efficient kernel density estimation. In Proceedings Ninth IEEE International Conference on Computer Vision (ICCV), 2003.

[Zhu05a] Xiaojin Zhu. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon Uni- versity, language technologies institute, school of Computer Science, 2005.

[Zhu05b] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, Uni- versity of Wisconsin-Madison Department of Computer Sciences, 2005.

[ZL05] Xiaojin Zhu and John Lafferty. Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In Proceed- ings of the 22nd international conference on Machine learning (ICML), pages 1052– 1059, 2005.

[ZLM13] Zeyuan Allen Zhu, Silvio Lattanzi, and Vahab Mirrokni. A local algorithm for finding well-connected clusters. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 396–404, 2013.

[ZSD17] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.

[ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Re- covery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 4140–4149, 2017.

[ZSJD19] Kai Zhong, Zhao Song, Prateek Jain, and Inderjit S Dhillon. Provable non-linear inductive matrix completion. In Advances in Neural Information Processing Systems (NeurIPS), pages 11439–11449, 2019.

98