<<

W&M ScholarWorks

Dissertations, Theses, and Masters Projects Theses, Dissertations, & Master Projects

Fall 2016

Methods for Estimating The Diagonal of Functions

Jesse Harrison Laeuchli College of William and Mary, [email protected]

Follow this and additional works at: https://scholarworks.wm.edu/etd

Part of the Computer Sciences Commons

Recommended Citation Laeuchli, Jesse Harrison, "Methods for Estimating The Diagonal of Matrix Functions" (2016). Dissertations, Theses, and Masters Projects. Paper 1477067934. http://doi.org/10.21220/S2CC7X

This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Methods for Estimating the Diagonal of Matrix Functions

Jesse Harrison Laeuchli

Williamsburg, Virginia

Bachelor of Science, University of Notre Dame, 2007 Master of Science, College of William and Mary, 2012

A Dissertation presented to the Graduate Faculty of the College of William and Mary in Candidacy for the Degree of Doctor of Philosophy

Department of Computer Science

The College of William and Mary May 2016

© Copyright by Jesse Harrison Laeuchli 2016

ABSTRACT

Many applications such as path integral evaluation in Lattice Quantum Chromodynamics (LQCD), variance estimation of least solutions and spline fits, and centrality measures in network analysis, require computing the diagonal of a of a matrix, Diag(f(A)) where A is sparse matrix, and f is some function. Unfortunately, when A is large, this can be computationally prohibitive. Because of this, many applications resort to Monte Carlo methods. However, Monte Carlo methods tend to converge slowly.

One method for dealing with this shortcoming is probing. Probing assumes that nodes that have a large distance between them in the graph of A, have only a small weight connection in f(A). To determine the distances between nodes, probing forms Ak. Coloring the graph of this matrix will nodes that have a high distance between them together, and thus a small connection in f(A). This enables the construction of certain vectors, called probing vectors, that can capture the diagonals of f(A). One drawback of probing is in many cases it is too expensive to compute and store Ak for the k that adequately determines which nodes have a strong connection in f(A). Additionally, it is unlikely that the set of probing vectors required for Ak is a subset of the probing vectors needed for Ak+1. This means that if more accuracy in the estimation is required, all previously computed work must be discarded.

In the case where the underlying problem arises from a discretization of a partial differential equation (PDE) onto a lattice, we can make use of our knowledge of the of the lattice to quickly create hierarchical colorings for the graph of Ak. A hierarchical coloring is one in which colors for Ak+1 are created by splitting groups of nodes sharing a color in Ak. The hierarchical property ensures that the probing vectors used to estimate Diag(f(A)) are nested subsets, so if the results are inaccurate the estimate can be improved without discarding the previous work.

If we do not have knowledge of the intrinsic geometry of the matrix, we propose two new classes of methods that improve on the results of probing. One method seeks to determine structural properties of the matrix f(A) by obtaining random samples of the columns of f(A). The other method leverages ideas arising from similar problems in graph partitioning, and makes use of the eigenvectors of f(A) to form effective hierarchical colorings.

Our methods have thus far seen successful use in computational physics, where they have been applied to compute observables arising in LQCD. We hope that the refinements presented in this work will enable interesting applications in many other fields. TABLE OF CONTENTS

Acknowledgments iv

Dedication v

List of Tables vi

List of Figures vii

1 Introduction 2 1.1 Motivation ...... 2 1.1.1 Prior Work and New Approach ...... 3 1.2 Overview ...... 6

2 Prior Work and Applications 7 2.1 Applications ...... 8 2.1.1 Statistical Applications ...... 8 2.1.2 Lattice Quantum Chromodynamics ...... 9 2.1.3 Network Centrality ...... 10 2.2 Prior Work ...... 10 2.2.1 Statistical Methods ...... 10 2.2.2 Non-Statistical Methods ...... 12

3 Estimation of Diag(f(A)) on torodial lattices 18 3.1 Lattices with consisting only of powers of 2 ...... 18 3.1.1 Introduction ...... 18

i 3.1.2 Preliminaries ...... 20 3.1.3 Lattice QCD problems ...... 21 3.1.4 The Monte Carlo method for Tr(A−1) ...... 21 3.1.5 Probing ...... 22 3.1.6 Hadamard vectors ...... 25 3.1.7 Overcoming probing limitations ...... 26 3.1.8 Hierarchical coloring ...... 29 3.1.9 Hierarchical coloring on lattices ...... 29 3.1.10 Splitting color blocks into conformal d-D lattices ...... 31 3.1.11 Facilitating bit reversal in higher dimensions ...... 32 3.1.12 Lattices with different sizes per ...... 33 3.1.13 Coloring lattices with non-power of two sizes ...... 37 3.1.14 Generating the probing basis ...... 38 3.1.15 Removing the deterministic bias ...... 40 3.1.16 Numerical experiments ...... 41 3.1.17 Comparison with classical probing ...... 41 3.1.18 Comparison with random-noise Monte Carlo ...... 44 3.1.19 A large QCD problem ...... 45 3.1.20 Conclusions ...... 49 3.2 Lattices of arbitrary dimensions ...... 52 3.2.1 Introduction and Preliminaries ...... 52 3.3 Lattices as spans of sublattices ...... 53 3.4 Coloring sublattices ...... 57 3.4.1 Hierarchical Permutations of Lattices with Equal Sides . . . . . 62 3.4.2 Hierarchical Permutations of Lattices with Unequal Sides . . . 64 3.4.3 Generating Probing Vectors Quickly ...... 68 3.5 Probing Vectors For Hierarchical Coloring on General Graphs . . . . . 70

ii 3.6 Performance Testing ...... 73 3.7 Conclusion ...... 75

4 Estimation of diag(f(A)) in the general case 77 4.1 Graph Coloring ...... 78 4.2 Statistical Considerations ...... 80 4.3 Structural Methods ...... 83 4.4 Spectral Methods ...... 89 4.4.1 Spectral k-partitioning for the matrix inverse ...... 92 4.5 Experimental Results ...... 94 4.6 Conclusions ...... 99

5 Conclusion and future work 103 5.1 Methods for Lattices ...... 103 5.2 Methods for General Matrices ...... 104

iii ACKNOWLEDGMENTS

It is difficult to convey my deep gratitude and respect for my advisor Andreas Stathopoulos, without whom this work would not have been possible. His keen perception, mathematical insight, and deep understanding of the field carried us though many difficult problems. To me he is the ideal computer scientist and mentor. I also wish to thank all of the members of my dissertation committee for their thoughtful comments and guidance, which greatly improved this work.

As part of the Computational Science research group at William and Mary, I had the good fortune to work with Lingfei Wu, and Eloy Romero Alcalde. I am grateful for their support and friendship.

During the course of this research I had several mentors at different internships, whose guidance and support contributed greatly to this work. In particular I would like to thank Chris Long, Lance Ward, and Geoff Sanders. They provided the spark for many ideas, and a great working environment to explore them.

For their many stimulating conversations on Computer Science, Mathematics and other topics, I would like to thank Philip Bozek, Douglas Tableman, and Walter McClean.

Finally, I would like to thank the Tan family for their moral support during the production of this thesis.

iv To My Parents, Samuel and Elizabeth Laeuchli

v LIST OF TABLES

2.1 Convergence rates of different methods...... 11

3.1 Table showing run times of the new algorithm compared to the origi- nal. Results obtained on an Intel i7 860 clocked at 2.8 GHz...... 74

vi LIST OF FIGURES

2.1 The zeroed out by using Hadamard Vectors ...... 13 2.2 An example of probing a 4-colorable graph ...... 14 2.3 An example of wasted probing vectors in non-hierarchical probing . . 15

3.1 Visualizing a 4-colorable matrix permuted such that all rows corre- sponding to color 1 appear first, for color 2 appear second, and so on. Each diagonal block is a diagonal matrix. The four probing vectors with 1s in the corresponding blocks are shown on the right...... 22 3.2 Crossed out nodes have their contribution to the error canceled by the Hadamard vectors used. Left: the first two, natural order Hadamard vectors do not cancel errors in some distance-1 neighbors in the lex- icographic ordering of a 2-D uniform lattice. Right: if the grid is permuted with the red nodes first, the first and the middle Hadamard vectors completely cancel variance from nearest neighbors and corre- spond to the distance-1 probing vectors...... 25 3.3 When doubling the probing distance (here from 1 to 2) we first split the 2-D grid to four conformal 2-D subgrids. Red nodes split to two 2 × 2 grids (red and green), and similarly black nodes split to blues and black. Smaller 2-D grids can then be red-black ordered...... 29

vii 3.4 Error in the Tr(A−1) approximation using the MC method with var- ious deterministic vectors. Classic probing requires 2,16,62, and 317 colors for probing distances 1,2,4, and 8, respectively. Left: Clas- sic probing approximates the trace better than the same number of Hadamard vectors taken in their natural order. Going to higher distance-k requires discarding previous work. Right: Perform distance- k probing, then apply Hadamard in natural order within each color. Performs well, but hierarchical performs even better...... 41 3.5 Left: The hierarchical coloring algorithm is stopped after 1, 2, 3, 4, 5 levels corresponding to distances 2, 4, 8, 16, 32. The ticks on the x- axis show the number of colors for each distance. Trace estimation is effective up to the stopped level; beyond that the vectors do not capture the remaining of large elements in A−1. Compare the results with classical probing in Figure 3.4, which requires only a few less colors for the same distance. Right: When the matrix is shifted to have high condition number, the lack of structure in A−1 causes all methods to produce similar results...... 42

3.6 Convergence history of Z2 random estimator, Hadamard vectors in natural order, and hierarchical probing, the latter two with bias re- moved as in (3.9). Because of small condition number, A−1 has a lot of structure, making hierarchical probing clearly superior to the stan- dard estimator. As expected, Hadamard vectors in natural order are not competitive. The markers on the plot of the hierarchical probing method designate the number of vectors required for a particular dis- tance coloring to complete. It is on these markers that structure is captured and error minimized...... 43

viii 3.7 Convergence history of the three estimators as in Figure 3.6 for a larger condition number O(104). As the structure of A−1 becomes less prominent, the differences between methods reduce. Still, hierarchical probing has a clear advantage...... 44 3.8 Convergence history of the three estimators as in Figure 3.6 for a high condition number O(106). Even with no prominent structure in A−1 to discover, hierarchical probing is as effective as the standard method. 45

3.9 Providing statistics over 100 random vectors z0, used to modify the se- quence of 2048 hierarchical probing vectors as in (3.9). At every step, the variance of quadratures from the 100 different runs is computed, and confidence intervals reported around the hierarchical probing con- vergence. Note that for the standard noise MC estimator confidence intervals are computed differently and thus they are not directly com- parable...... 46 3.10 (a) Left: The variance of the hierarchical probing trace estimator as a function of the number of vectors (s) used. The minima appear when s is a power of two. The places where the colors complete are marked with the cyan . These minima become successively deeper as we progress from 2 to 32 to 512 vectors. (b) Right: Speed-up of the

LQCD trace calculation over the standard Z2 MC estimator. The cyan mark where colors complete. The maximal speed up is observed at s = 512. In both cases the uncertainties are estimated using the Jackknife procedure on a sample of 253 noise vectors, except for s = 256 and 512 where 37 vectors were used...... 47

2 3.11 The decomposition of a 6x6 lattice into 3 sublattices L(3I)c0 . . . . . 52 3.12 Affine sublattices with x,y coordinates...... 54

ix 3.13 The circled nodes constitute the C lattice of offsets. Note how C tiles the entire lattice, and that its coloring reflects the coloring of each

sublatice L(bI)c. Since b = 3, each of colors is the same as the previous line, shifted by 1 mod 3...... 58 3.14 Comparison of the two methods on a 2D lattice with common factors 2 × 2 × 3 × 3 × 5. For the common factors of two, the methods are the same, but once these are exhausted the improved method has much lower error...... 74

4.1 Probing vs Coloring the structure of L† directly, where the percentage of the weight of L† retained varies from .1 to .5. in .05 increments. As the number of colors increases probing struggles to capture the structure of L†...... 82 4.2 Sampling 4 columns from A and then shifting to detect if they share an off-diagonal ...... 83 4.3 Approximation of the pseudo-inverse of a Laplacian with periodic boundary conditions with 10, 100, and 1000 vectors in the vvT ap- proximation. Here the vectors v are not sparsified, the figure shows how in the best case of an unsparsified v, vvT contains mostly local structure, until a significant number of vectors v are supplied. . . . . 85 4.4 Lattice Graph Results ...... 93 4.5 Scale Free Graph Results ...... 94 4.6 Wiki-Vote graph Graph Results ...... 95 4.7 P2P-GNU graph Graph Results ...... 96 4.8 Gre512 Matrix Graph Results ...... 97 4.9 Orseg Matrix Graph Results ...... 98 4.10 Mhd416 Matrix Results ...... 99

x 4.11 Nos6 Matrix Results ...... 99 4.12 Bcsstk07 Matrix Results ...... 100 4.13 Af23560 Matrix Results ...... 100

xi Methods for Estimating the Diagonal of Matrix Functions Chapter 1

Introduction

1.1 Motivation

PN In this work we study the problem of computing Diag(f(A)), and i=1 Diag(f(A))i = Tr(f(A)), where A is a sparse matrix of size N and f is some function. Some useful examples are f(A) = A−1, or f(A) = exp(A). When A is small, this can be computed directly, which is an O(N 3) approach. When A is of intermediate size and is properly structured, recursive factorization methods allow for an O(N 2) solution [70]. However, in many problems of interest, the size of A can be such that even O(N 2) solutions are impractical. Because of this we abandon exact computation, and attempt to approximate the desired diagonals. There are two main methods for approximating the desired results.

The first of these is based on Monte Carlo methods [45, 22]. The second is based on matrix sparsification [62], where we hope to ignore unimportant parts of the matrix in order to speed up the computation. Our work takes advantage of the features of both types of approximations in order to produce better algorithms.

The need to apply efficient solutions to this problem is a result of ever increasing matrix sizes in many diverse areas. Several examples are Statistics [45, 67], Lattice Quantum

Chromodynamics (LQCD) [31], Material Science [28], and Network Analysis [68]. For example, in LQCD increasing the size of the lattice improves the physical accuracy of the

1 simulation. Thus, there is significant interest in increasing the size of the lattices beyond what is currently computationally feasible. The same is true in the field of social network analysis. As social networks expand and represent ever more interconnected networks, performing analysis requires the use of increasingly large matrices. Finally, as large data sets become ever more prevalent, many statistical processes require matrices that continue to increase in size, necessitating faster methods.

1.1.1 Prior Work and New Approach

This problem of computing Diag(f(A)) differs from common numerical analysis prob- lems in a key way. Frequently, when similar problems are encountered, the problem is rewritten to be an optimization problem on a convex function. The problem can then be approached using optimization methods such as Newton’s method, Gradient Descent,

Conjugate Gradient and Non-Linear Conjugate Gradient, to converge to the value we are seeking. With this problem, no such optimization process can be undertaken. Because of this, statistical methods must be used.

Since this problem was first studied by Hutchinson [45], several statistical methods for the problem have been proposed. While these methods have the attractive feature that they provide statistical error estimates, they converge slowly and do not take advantage of information that the user may have about the matrix. The main goal of this work is to take advantage of anything that is known about a matrix in order to obtain a better estimate of Diag(f(A)) in less time. Although the matrix f(A) is normally dense, if

the smallest elements of f(A) are dropped, structure in this sparsified version of f(A) will emerge. Since exactly obtaining this structure is no easier than solving the original problem, alternative information is used to approximate it. Once this structure is known, an estimate for Diag(f(A)) can be obtained. This was the idea behind the method of

[60] known as probing. Probing uses the powers of a matrix A to obtain an estimate of the structure of f(A). For A−1 in particular this is based on the assumption that the

Neumann series of A converges to f(A). For different functions of A, other polynomials

2 could be considered. Once an estimate for the structure of f(A) is obtained, the authors of [62] show how to create probing vectors that allow for the recovery of the diagonal.

However, this requires taking high powers of the matrix A, which is expensive to compute and may be impractical to store. Ideally, knowledge of the matrix A that is less expensive to obtain should be used. Further, the method in [60] provides no way to tell how accurate the error estimate is. Since this is important for many applications, this is a significant drawback.

Our proposed methods take advantage of several major areas of knowledge in order to obtain the structure of f(A), in ways that are cheaper than finding powers of A. The first is geometric information. Many of our target applications arise from partial differential equations (PDE) that are discretized onto lattices. For a PDE given by g(u) = y, the solution at u is often given in terms of the Green’s function G(u, u’), where u is the point we are obtaining a solution for, and u’ is some other point not equal to u.

For many PDEs, as ku − u’k increases, G(u, u’) decays quickly since the physical forces

the Green’s function is attempting to model fall off rapidly with distance. Because of

this, only connections between nodes that have short distances in the graph will have a

large connection, or to put this another way, only elements in f(A) corresponding to links

between close nodes will be large. If we can determine which points have short distances

between them in the graph, we can use this distance information to obtain an estimate for

which are the large elements of f(A). This is the previously mentioned approach of [60], where they use successively higher powers of A to compute the distance between nodes. In

lattices, because of the known geometry, we can cheaply compute the distances between

nodes without computing the powers of A.

For more general matrices where geometric information is lacking, we would prefer

to bypass polynomial approximations which are expensive to compute, and work more

directly with the structure of f(A). This approach allows us to deal with matrices that

do not exhibit the decay in interaction between distant nodes seen in many PDEs. We

term this family of algorithms inverse probing. Inverse probing works by computing a

3 subset of the columns v of the matrix f(A). These columns can then be used by several algorithms to build an approximation to f(A). The two main approaches we take are to form an approximation vvT ≈ f(A), and to examine the values of v at different lags to try to predict the location of major off-diagonals of f(A). This allows for the estimation of the magnitude of the connection between nodes directly.

A final method for obtaining information on the structure of f(A) is based on examin- ing the spectrum of f(A). If probing were used to determine all the distance k connections

k of the i-th node, the algorithm would take the matrix vector product A ei. However, this is similar to the process of obtaining a single iterate of the power iteration method, which

Akr would take the product kAkrk , where r is an appropriately chosen random vector. The power method is known to converge to the eigenvector of largest modulus of A. This sug-

gests there could be a connection between probing and the largest eigenvector of A. The

eigenvector holds the distance information that would be obtained by probing if it were

taken an infinite number of steps. To state this slightly differently, the largest eigenvector

holds similar structural information about f(A) to that which probing could obtain. We present a heuristic that explores the connection between these two ideas and lays the basis for future research in this area.

The geometric, structural, and spectral information that our new algorithms make use of are all more computationally and storage efficient than the high powers of A required

for probing. Additionally, in many cases we observe that they provide more accurate

results then probing, because they obtain a more accurate representation of the structure

of f(A).

The final contribution of our work is to combine the idea of exploiting known but

deterministic information about a matrix with statistical methods in order to provide the

user with an improved as well as unbiased estimator. We provide a framework to analyze

when the information obtained by our methods will provide meaningful improvements

in the error estimates of our methods. By merging the strengths of the statistical and

deterministic approaches we provide algorithms that are more robust.

4 1.2 Overview

The rest of this dissertation is structured as follows.

Chapter 2 We discuss in more detail the applications where Diag(f(A)) is needed. We

also examine the prior approaches, and determine the areas in which they are insuf-

ficient.

Chapter 3 We introduce a method for computing Diag(f(A)) when A is a matrix arising

from a toroidal lattice, that is, a lattice where the boundary conditions are periodic.

Our method works by exploiting the geometry of the lattice. We show how the

problem can be solved in the special case when the lattice has dimensions that are

powers of 2, and in the more general case where the dimensions are arbitrarily sized.

Finally we show how these methods can be combined with statistical approaches to

provide unbiased estimators.

Chapter 4 We discuss the more general class of matrices, where the matrix is not a

lattice but still has some structure that can be exploited. We examine structural

and spectral types of information, and provide a framework for analyzing when

enough structure exists for our algorithms to outperform previous approaches.

Conclusion We summarize our discussion.

5 Chapter 2

Prior Work and Applications

In this chapter we discuss related applications and prior work. Most prior applications are related to either computing Tr(A−1) or Diag(exp(A)). Given an N × N matrix A, with an eigendecomposition A = VEV −1, the Tr(A−1) = PN Diag(A−1) = PN 1 . If the i=1 i i=1 Eii matrix is small, one could solve the problem directly by performing an LU decomposition

[19], and then solving N linear equations for the diagonals of A−1. Alternatively, one could

obtain the eigenvalues of A, for example, by using the QR method [19], and summing

them. Unfortunately, these approaches require O(N 3) work, and so are impractical for

the matrices that occur in the applications we are interested in.

−1 P∞ Ak The exponent of a matrix is given by exp(A) = V exp(E)V = k=0 k! . If the matrix is small enough, the eigendecomposition of the matrix can be computed, but in

most cases of interest this is not practical. There are many methods proposed for forming

exp(A) explicitly [69], but all of them require raising A to a high power. This can be

computationally difficulty as well as requiring impractical amounts of storage, since in

many cases Ak will become dense quickly.

6 2.1 Applications

2.1.1 Statistical Applications

We briefly consider applications of Diag(f(A)) and Tr(f(A)), to motivate our research.

Such applications appear in physics, social network analysis, and statistics among others.

Computing Diag(f(A)) shows up in several statistical problems. The simplest of these

is computing the variance of a least problem [67]. In the case of a least squares

problem, one would like to solve minxkBx − yk, where B is some arbitrarily sized matrix representing the observations we are trying to fit, that is likely singular. A solution can

be found by computing the normal equation x = A−1z, where A = BT B and z = BT y.

While x can be solved using an iterative method, one would like to know what the variance of the solution is. It can be shown that given B, the covariance matrix of x is (BT B)−1σ2, where σ2 is the variance of the error of our fit. We do not have this variance available, but we can estimate it. Define Xn,p, as our n observations of p variables we are attempting to 2 2 1 2 1 Pn 2 fit. Then as an estimator for σ we have σ = n−p ky − Bxk = n−p i=1 ei , where ei are

the residuals ei = yi −Xi,1z1 −...−Xi,pzp. The variance of the individual components zi is

T −1 2 then computed as Diag((B B)) iσi . Where B is large, representing many observations, inverting BT B to obtain Diag(BT B) is difficult.

Another area of statistics that this problem arises in, and which originally motivated

Hutchinson [45] to develop his method, is fitting a spline to set of multidimensional noisy Pn data z at irregularly spaced points x. This is done by defining the function f, i=1(zi − 2 f(xi)) + ρJ(f), where n is the number of data points, and J(f) is a rotation invariant measure of the roughness of f. This roughness is defined in terms of the partial derivates

of f. The value ρ is a positive value controlling the degree of smoothing of the data, and

is chosen to minimize the generalized cross validation function [71], which is defined as 1 2 n kI−Azk GCV = 1 2 , where A is the n × n symmetric influence matrix which takes the ( n Tr(I−A)) data values to their fitted values [72]. Forming this influence matrix requires inverting the spline matrix B. Because of this, the main expense of the validation is obtaining Tr(I −A).

7 2.1.2 Lattice Quantum Chromodynamics

Similar statistical issues arise in many areas of physics, and in paticular in Lattice Quan- tum Chromodynamics (LQCD) [31]. QCD is the theory of the behavior of the fundamental force known as the strong interaction, which describes the interactions among quarks, the building blocks of Hadrons. LQCD is a method for simulating these interactions. Since it is a non-perturbative method, it can be used to compute physical properties such as the masses of the various quarks, as well as the observables governing the coupling of the particles [31].

Unfortunately QCD gives rise to path integrals that are difficult to compute directly.

If the system is discretized onto a 4D lattice, they can be approximated using Monte Carlo

Integration. Normally this is done in two stages, by generating gauge fields according to a paticular probability distribution, then evaluating a correlation function that depends on these fields. The physical properties of interest are determined by a Monte Carlo average of the correleation functions generated by the ensemble of gauge fields [54].

This approach requires the computation of the trace of A−1. Since the systems arising in this simulation are normally very large, most approaches in this area are based on iterative methods. Aside from the size of the matrix, they are also poorly conditioned, making their solution difficult. In particular, as the simulation parameters are tuned so that they more accurately represent the physical system of interest, the matrix starts to be become singular. Therefore, it is important to minimize the number of systems of equations that have to be solved to obtain an estimate for the trace, since each solution may take many iterations to converge. Despite these difficulties, LQCD has been a very successful approach, and our methods have wide applicability to it.

2.1.3 Network Centrality

Finally, we consider an application where the required f(A) is not A−1, but is instead exp(A),as in [68]. The authors are interested in computing the node centrality in a network,

8 a metric of how important a particular node is in a given network. This question arises in social network analysis, as well as network design. In [68], the authors begin by defining a path as a list of distinct vertices connecting two nodes, and define a path that starts and stops at the same node as a closed path. They assume that nodes that have more closed paths are more important. Further, they give closed paths of differing lengths different weights, assigning to shorter paths a higher weight. If we let k(i)j be the number of paths of distance i for node j, and weight the paths with the inverse of their factorial distance,

P∞ k(i)j i then we obtain the centrality metric i=0 i! . If one recalls that Ajj, which is the j-th diagonal element of the i-power of A, gives the number of round trip paths of length i for P∞ Ai node j, then the desired equation is Diag( i=0 i! ) = Diag(exp(A)).

2.2 Prior Work

2.2.1 Statistical Methods

Many of the applications shown in the prior section require extremely large matrices.

Further, as computational resources expand, the applications will want to increase the

size of the matrices in order to achieve more accurate results. This means that it is

unlikely that attempts to solve this problem directly though matrix decomposition or

eigensolvers, will ever be the best choice for most applications. Instead methods that

attempt to statistically estimate it are needed.

The first attempt at such a statistical solution was made by Hutchinson [45], who was

interested in calculating Tr(A−1) in order to compute splines. He showed that for a set of random vectors zi, where each vector element is drawn independently from a Rademacher

1 distribution, where each element has a 2 chance to be 1 or −1,

Pi=n T z f(A)zi Tr(f(A)) = E[zT f(A)z] = i=0 i . (2.1) n In his case, f(A) = A−1. Taking advantage of the fact that A−1z = y can be solved by rewriting in the form Ay = z and using an iterative solver, it is then possible to estimate

9 the trace of A−1 even for very large matrices.

Following Hutchinson’s work, the authors of [22] investigated several variations on

Hutchinson’s method, and proved bounds on their statistical variance, as well as on the number of samples needed to achieve a given accuracy, which can be seen in Table 2.1.

Instead of taking zi from (2.1) to be random vectors with elements from the Rademacher distribution, they examined the cases where the elements of zi are Gaussian, where they

H are selected so that zi zi = N which they term the Rayleigh-quotient estimator, and the case where the zis are random unit vectors, ei = [0 ... 010 ... 0] where the 1 is in the i-th location. Additionally, they consider a variation on the scheme of taking zi as unit vectors. Using these unit vectors directly computes a particular set of diagonal elements, and then attempts to extrapolate the missing diagonal elements from them. However, in cases where the values of the diagonal elements vary widely, this will work poorly. To counteract this, Pn T T i=0 zi D ADzi they instead compute n , where D is either the Discrete Fourier Transform (DFT) matrix, or the Hadamard matrix. The DFT matrix is generated by D = FFT (I),

and the Hadamard matrix [73] is formed recurisvely, as

    1, +1 H2k−2 , +H2k−1 H1 = [1],H2 = ,H2k = = H2 ⊗ H2k−1 . (2.2) 1, −1 H2k−2 , −H2k−1

The Hadamard matrices have the disadvantage of only having sizes that are powers of two, but avoid the use of complex arithmetic, which is a requirement of using the DFT matrix. Since normally we will not need all N columns, we instead generate each matrix column by column as process which can be done efficiently [66].

Because these matrices are unitary, DT D = I, Tr(DT AD) = Tr(DT DA) = Tr(IA) =

Tr(A), but has the effect of smoothing out the elements of the matrix A, thus making it less likely that an important diagonal element will be missed out by the estimator.

While [22] derives upper bounds for the number of vectors needed to achieve the probability of obtaining the desired amount of accuracy, these bounds are not tight. In practice the authors observe that the various methods perform almost identically. These

10 Table 2.1: Convergence rates of different methods.

Estimator Variance of the Sample Bound on number of samples for an (,δ)-approx Random bits per sample −2 Gaussian 2kAkF 20 ln(2/δ) infinite;Θ(n) in floating point 1 −2 −2 2 2 Normalized Rayleigh-quotient - 2  n rank (A)ln(2/δ)kf (A) - 2 Pn 2 −2 Hutchinson’s 2(kAkF − i=1 Aii) 6 ln(2rank(A)/δ) θ(n) Pn 2 2 1 −2 2 nmaxiAii Unit Vector n i=1 Aii − Tr (A) 2  ln(2/δ)rD(A), rD(A) = Tr(A) θ(logn) Mixed Unit Vector(DFT/Hadamard) - 8−2ln(4n2/δ)ln(4/δ) θ(logn) methods converge slowly and cannot be improved without additional information about the matrix. However, it is seldom the case that no useful knowledge of the matrix is available or that cannot be extracted by approximation techniques.

2.2.2 Non-Statistical Methods

Several deterministic methods have been proposed that solve this problem exactly, in the case where f(A) = A−1, by preforming some form of matrix factorization, and avoid the problem of slow convergence that the statistical methods have. These approaches have serious drawbacks however. In [49], the authors introduce a method which works by

finding a hierarchy of Schur complements of matrices arising from grids, but the run time of this method is of order O(N 3/2) and O(N 2) for the 2D and 3D case respectively. Thus

for sufficiently large N, or for higher dimensional problems, this approach is not practical.

Further it does not address the case of matrices that do not arise from PDEs.

The method introduced in [70] performs an LU factorization and computes the last

diagonal entry of the inverse directly from this factorization. It then reorders the nodes so

that each diagonal is in turn the last element of the LU matrix. In order to avoid computing

a unique LU factorization for every reordering, they decompose each LU factorization into

partial LU factorizations. This method has similar drawbacks to those in [49], requiring

that the matrix arise from a PDE and has a run time of O(N 2) for a 2D matrix.

Proposed in [67] is a method based on Takahashi’s equations, which allow a subset

of the elements of A−1 to be recursively computed, using only the elements of an LDU

decomposition of A, and the previously generated elements of A−1, with the first step of

the recursive process requiring only the elements of the LDU decomposition to compute.

11 The subset of elements which can be computed in this manner are those elements that are non-zero in the LDU decomposition. Given a sparse LDU decomposition, it follows that the number of elements needed to compute Diag(A−1) is small. However, while the

subset of elements A−1 needed to compute Diag(A−1) may be smaller than that needed

to compute the entire matrix A−1, it can still be quite large.

Alternative approaches have been developed that avoid computing the result directly,

which is infeasible for large problems, while still making use of any information that is

known about the problem. The main idea behind them is to create the vectors in (2.1) in

such a way that any available structure is exploited. This method was first introduced in

[28] . The main insight is that many matrices in practice have an inverse with a periodic

and decaying structure where the magnitude of the elements falls off away from the main

diagonal of the matrix. Therefore we can set the zis to be such that they zero out as many diagonals of the matrix as possible. If the contribution to the error from the diagonals

that have not been zeroed out is small, the results will be very accurate. As more zis are used, more diagonals are zeroed out, and the solution becomes more accurate. Further, we show later how, if this is paired with statistical methods, this will reduce the variance, because there will be fewer elements contributing to the sums in Table 2.1. To achieve this zeroing out effect, the vectors of the Hadamard matrix are used in their natural order, since they zero all contributions to the error except those elements from an increasingly small subset of diagonals, as can be seen in Figure 2.1.

Another useful method supplied in [28] is how to calculate the diagonal of A instead

of simply the trace. They show that the following estimator will converge to the diagonal

n n X X Diag(A) ≈ ( zi Azi) ( zi zi) (2.3) i=1 i=1 where is componentwise multiplication and is componentwise division, and the zis are random vectors. This has the same drawback as (2.1), in that while in expectation this will yield the correct answer, in practice convergence can be very slow. Therefore the

12 0 0

20 20

40 40

60 60

80 80

100 100

120 120

0 20 40 60 80 100 120 0 20 40 60 80 100 120 nz = 1024 nz = 512

(a) 16 Hadamard vectors (b) 32 Hadamard vectors

Figure 2.1: The area zeroed out by using Hadamard vectors. As the number of vectors increases, the number of diagonals that contribute to the error decease, and become further from the main diagonal.

question of picking the zis to exploit the structure of A is the same as for estimating the trace.

While the Hadamard based method of [28] works well for a specific class of matrices which are generally those arising from a PDE with a Green’s function describing a force that decays with distance, matrices without this diagonal structure do not benefit as much. An attempt to exploit less regularly ordered structure is behind the idea of probing.

Probing has been a useful technique with a long history in the context of approximating the Jacobian matrix [17, 38], or other matrices [18]. Its use for approximating the diagonal of A−1 was proposed in [60] because it finds the most important areas of A−1 rather than the fixed structure removed by the Hadamard approach. Probing recovers the diagonals of a sparse matrix by finding the coloring of its associated graph. Coloring a graph involves assigning a color to each in such a way that no two connected vertices share a color.

Unfortunately, finding the optimal coloring in the sense of using the least number of colors is an NP-Complete problem. However, for many graphs a greedy algorithm performs well

[61], and is the approach used in probing. When the rows and columns of a matrix are arranged so that all nodes that share the same color are adjacent, a zero block diagonal

13 structure will result, as can be seen in Figure 2.2. This structure is due to the fact that since these nodes share a color, they must have no connection (otherwise this would be an invalid coloring).

This block-diagonal zero structure can be exploited to recover the diagonal of the matrix by creating probing vectors. Given a coloring C for A, with c total colors, we will need only c vectors to recover the diagonal. We generate a probing vector for each color m, and set the i-th element of that vector to be 1 if the i-th node of the graph of A was assigned the m-th color and zero elsewhere as seen in (2.4). These vectors can then be used with (2.1) or (2.3).

  1, if i ∈ Cm m  pi = (2.4)  0, otherwise

  1 0 0 0  1 0 0 0   1 0 0 0  1 0 0 0   0 1 0 0   0 1 0 0  0 0 1 0   0 0 1 0   0 0 1 0  0 0 1 0   0 0 1 0  0 0 1 0   0 0 0 1   0 0 0 1  0 0 0 1   Figure 2.2: Probing a four colorable graph. The diagonal elements of the graph can be recovered using the probing vectors shown.

Unfortunately, f(A) is normally dense. Because of this the associated graph of f(A) is fully connected, and every node will be assigned a unique color. To avoid this probing, structure for f(A) must be induced by sparsification. If the smallest magnitude elements of f(A) are dropped, then f(A) will appear, which can be exploited by probing. Of course, computing f(A) and then sparsifying it, is not easier than the original problem of obtaining

Diag(f(A)). To counteract this issue, [62] introduced the idea of probing using a matrix

14 −1 polynomial qn(A). They find a polynomial of matrix A, such that qn(A) ≈ A , as the order n of the polynomial increases. In their paper they use the Neumann approximation

−1 −1 Pn −1 j −1 to A , which is A ≈ ( j=0 M Q )M , where A = M − Q, and M = Diag(A).

However, they are interested only in the structure, and not the values of qn(A). Since in

the case of the Neumann approximation the nonzero structure of qn(A) is a subset of the

n nonzero structure of qn+1(A), they need only color A to obtain an approximation to the

−1 structure of A after sparsification. This method has the drawback that qn(A) quickly becomes denser and thus expensive to compute for even small n, when A is large.

An additional drawback with probing is that once the coloring for a particular qn(A) is obtained, and probing vectors are created, it is possible that the estimate it produces

will not be sufficiently accurate. However, it is unlikely that the probing vectors of qn(A)

will be a subset of those created for qm(A), m < n. In this case, all the previous work computing f(A)z will have to be discarded. Since these results are produced by solving

large linear systems iteratively, this is a serious shortcoming. An example illustrating this

may be seen in Figure 2.3. In the first case, the associated graph of q1(A), the associated

graph was colored with three colors, but required four colors to color the graph of q2(A). Unfortunately the resultant probing vectors do not span the original probing vectors. If

on the other hand the colors had split as seen in the second example, the initial probing

vectors would have been spanned by the new ones, meaning the work of computing f(A)z

need not have gone to waste.

Figure 2.3: An example of wasted vectors in probing, versus an example where the vectors can be reused.

15 Chapter 3

Estimation of Diag(f(A)) on torodial lattices

In this chapter, we discuss how to compute Diag(f(A)) in the case where A represents a

toroidal lattice. In this case the geometric information can be leveraged to provide high

quality probing vectors very quickly. In addition, these vectors can be constructed in

such a way that if the accuracy of the result is insufficient, the process can continue by

reusing our previously computed results, a process we term Hierarchical Probing. While

our solution works for lattices of arbitrary dimensions, we also present an even faster

method for lattices which have dimensions of powers of two.

3.1 Lattices with dimensions consisting only of powers of 2

3.1.1 Introduction

Two methods that have previously attempted to address this problem by exploiting the

structure of the matrix A are the approach of using Hadamard vectors [28], and the method

of probing [62]. We combine ideas from both of these methods in order to overcome their

respective shortcomings.

The approach based on the use of Hadamard vectors discussed in Section 2.2.2 bor-

16 rows ideas from coding theory and selects deterministic vectors for the Monte Carlo( MC) estimate given in 2.1 as columns of a Hadamard matrix [28]. These vectors are orthogonal and, although they produce the exact answer in N steps, their benefit stems from sys- tematically capturing certain diagonals of the matrix, as in figure 2.1. For example, if we use the first 2m Hadamard vectors, the error in the trace approximation comes only from non-zero elements on the (k2m)th matrix diagonal, k = 1,...,N/2m. Thus, the MC itera- tion continues annihilating more diagonals with more Hadamard vectors, until it achieves the required accuracy. However, in most practical problems the matrix bandwidth is too large, the non-zero diagonals do not fall on the required positions, or the matrix is not even sparse (which is typically the case for A−1).

Probing [62] attempts to select vectors that annihilate the error contribution from the heaviest elements of A−1. For a large class of sparse matrices, elements of A−1 decay exponentially away from the non-zero structure of A. By this we mean that the magnitude

−1 of the Ai,j element relates to the distance of the shortest path between nodes i and j in the graph of A. Assume that the graph of A has a distance-k coloring (or distance-1

k coloring of the graph of A ) with m colors. Then, if we define the vectors zj, j = 1, . . . , m, Pm T with zj(i) = 1 if color(i)=j, and zj(i) = 0 otherwise, we obtain Tr(A) = j=1 zj Azj. For Tr(A−1) the equation is not exact, but it does not include errors from all elements of

A−1 that correspond to paths between vertices that are distance-k neighbors in A. The probing technique has been used for decades in the context of approximating the Jacobian matrix [35, 38] or other matrices [57]. Its use for approximating the diagonal of A−1 in

[62] (see also [28]) is promising as it selects the important areas of A−1 rather than the predetermined structure dictated by Hadamard vectors. However, the accuracy of the trace estimate obtained through a specific distance-k probing can only be improved by applying Monte Carlo, using random vectors that follow the structure of each probing vector. To take advantage of a higher distance probing, all previous work has to be discarded, and the method rerun for a larger k. We discuss this in Sections 3.1.5 and

3.1.7.

17 We introduce hierarchical probing which avoids the problems of the previous two meth- ods. It annihilates error stemming from the heaviest parts of A−1, and it does so incre-

mentally, until the required accuracy is met. To achieve this, we relax the requirement of

distance-k coloring of the entire graph. The idea is to obtain recursively a (suboptimal)

distance-2i+1 coloring by independently computing distance-1 colorings of the subgraphs

corresponding to each color from the distance-2i coloring. The recursion stops when all

the color-subgraphs are dense, i.e., we have covered all distances up to the of the

graph. We call this method, “hierarchical coloring”. For regular, toroidal lattices each

subgroup has the same number of colors, which enables an elegant, hierarchical basis for

probing based on an appropriate ordering of the Hadamard and/or Fourier vectors: the

first m such vectors constitute a basis for the corresponding m probing vectors. We call

this method, “hierarchical probing”. It can be implemented using only bit arithmetic,

independently on each lattice site. We also address the issue of statistical bias by viewing

hierarchical probing as a method to create a hierarchical basis starting from any vector,

including random.

SubSection 3.2.2 presents some background for these methods and describes the limita-

tions of classical probing. In subSection 3.2.3, we introduce the idea of hierarchical coloring

and, for the case of uniform grids and tori, we develop a hierarchical coloring method that

uses only local coordinate information and bit operations. In subSection 3.2.4, we use

this coloring to produce a sequence of hierarchical probing vectors. In subSection 3.2.5,

we provide several experiments for typical lattices and problems from LQCD that show

that MC with hierarchical probing has much smaller variance than random vectors and

performs equally well or better than the expensive, classical probing method.

3.1.2 Preliminaries

We use vector subscripts to denote the order of a sequence of vectors, and parentheses to

denote the index of the entries of a vector. We use MATLAB notation to refer to row or

column numbers and ranges. The matrix A, of size N ×N, is assumed to have a symmetric

18 structure (undirected graph).

3.1.3 Lattice QCD problems

Lattice Quantum ChromoDynamics (LQCD) is a formulation of Quantum Chromo-Dynamics

(QCD) that allows for numerical calculations of properties of strongly interacting matter

(Hadron Physics) [64]. These calculations are performed through Monte Carlo computa- tions of the discretized theory on a finite 4 dimensional Euclidean lattice. Physical results are obtained after extrapolation of the lattice spacing to zero. Hence calculations on mul- tiple lattice sizes are required for taking the continuum and infinite volume limits. In this formulation, a large sparse matrix D called the Dirac matrix plays a central role. This matrix depends explicitly on the gauge fields U. The physical observables in a LQCD calculation are computed as averages over the ensemble of gauge field configurations. In various stages of the computation one needs, among other things, to estimate the determi- nant as well as the trace of the inverse of this matrix. The dimensionality of the matrix is

3 3 × 4 × Ls × Lt, where Ls and Lt are the dimensions of the spatial and temporal directions of the space-time lattice, 3 is the dimension of an internal space named “color”, and 4 is the dimension of the space associated with the spin and particle/antiparticle degrees of freedom. Typical lattice sizes in todays calculations have Ls = 32 and Lt = 64 and the largest calculations performed on leadership class machines at DOE or NSF supercomput- ing centers have Ls = 64 and Lt = 128. As computational resources increase and precision requirements grow, lattices will become even bigger.

3.1.4 The Monte Carlo method for Tr(A−1)

Hutchinson introduced the standard MC method for estimating the trace of A and proved the following [45].

Lemma 3.1 Let A be a matrix of size N × N and denote by A˜ = A − Diag(A). Let z be a Z2 random vector (i.e., whose entries are i.i.d Rademacher random variables P r(z(i) =

19 ±1) = 1/2). Then, zT Az is an unbiased estimator of Tr(A), i.e.,

E(zT Az) = Tr(A),

and N ! T ˜ 2 2 X 2 var(z Az) = kAkF = 2 kAkF − A(i, i) . i=1 p The MC method converges with rate var(zT Az)/s, where s is the sample size of the estimator (number of random vectors). Thus, the MC converges in one step for diagonal matrices, and very fast for strongly diagonal dominant matrices. More relevant to our

Tr(A−1) problem is that large off-diagonal elements of A−1 contribute more to the variance

−1 2 kAg kF and thus to slower convergence. Computationally, zT A−1z (often regarded as a Gaussian quadrature) can be computed using the Lanczos method [24, 39, 58]. This method also produces upper and lower bounds on the quadrature, which are useful for terminating the process. The alternative of solving the linear system A−1z, is not recommended for non-Hermitian systems because of worse

floating point behavior [59], but for Hermitian systems it can be as effective if we stop the system earlier. Specifically, the quadrature error in Lanczos converges as the square of the system residual norm [39], and therefore we need only let the residual converge to the of the required tolerance. A potential advantage of solving A−1z is that the result can be reused when computing multiple correlation functions involving bilinear forms yT A−1z (e.g., in LQCD).

3.1.5 Probing

Probing has been used extensively for the estimation of sparse Jacobians [35, 38], for preconditioning [57], and in Density Functional Theory for approximating the diagonal of a dense projector whose elements decay away from the main diagonal [28, 62]. The idea is to expose the structure and recover the non-zero entries of a matrix by multiplying it with a small, specially chosen set of vectors. For example, we can recover the elements

20 of a diagonal matrix through a matrix-vector multiplication with the vector of N 1’s,

T 1N = [1,..., 1] . Similarly, a banded matrix of bandwidth b can be found by matrix- vector multiplications with vectors zk, k = 1, . . . , b, where

 1, for i = k : b : N zk(i) = 0, otherwise .

To find the trace (or more generally the main diagonal) of a matrix, the methods are based on the following proposition [28].

Proposition 3.1 Let Z ∈

In the above example of a banded matrix, we choose the vectors zk such that their rows only overlap for structurally orthogonal rows of A (i.e., for rows farther than b apart).

Thus, by proposition 3.1, the trace computed with these zk is exact. If A is not banded but its sparsity pattern is known, graph coloring can be used to

identify the structurally orthogonal segments of rows, and derive the appropriate probing

vectors [62]. Assume the graph of A is colorable with m colors, each color having n(k) number of vertices, k = 1, . . . , m. The coloring is best visualized by letting q be the permutation vector which reorders identically colored vertices together. Then A(q, q) has m blocks along the diagonal, the k-th block is of dimension n(k), and each block is a diagonal matrix. Figure 3.1 shows an example of the sparsity structure of a permuted

4-colorable matrix. Computationally, permuting A is not needed. If we define the vectors:

 1 if color(i) = k zk(i) = 0 otherwise , k = 1, . . . , m (3.1)

Pm T we see that Proposition 3.1 applies, and therefore Tr(A) = k=1 zk Azk. When the matrix is dense and all its elements are of similar magnitude, there is no structure to be exploited by probing. The inverse of a sparse matrix is typically dense,

21 1 0 0 0 0 . . 0 . . . 1 0 . 0 0 1 0 1 0 . 0 1 . 0 . . . 0 . . . 1 0 0 0 1 0 0 0 0 1

Figure 3.1: Visualizing a 4-colorable matrix permuted such that all rows corresponding to color 1 appear first, for color 2 appear second, and so on. Each diagonal block is a diagonal matrix. The four probing vectors with 1s in the corresponding blocks are shown on the right. but, for many applications, its elements decay on locations that are farther from the locations of the non-zero elements of A. Such small elements of A−1 can be dropped, and

the remaining A−1 is sparse and thus colorable. Diagonal dominance of the matrix is a

sufficient (but not necessary) condition for the decay to occur [35, 62]. This property is

exploited by approximate inverse preconditioners and can be explained from various points

of view, including Green’s function for differential operators, the power series expansion

of A−1, or a purely graph theoretical view [29, 30, 34, 44]. In the context of probing, we

drop elements A−1(i, j) where the vertices i and j are farther than k links apart in the

graph of A. Because this graph corresponds to the matrix Ak, our required distance-k

coloring is simply the distance-1 coloring of the matrix Ak [38, 62]. Computing Ak for

large k, however, is time and/or memory intensive.

The effectiveness of probing depends on the decay properties of the elements of A−1,

and the choice of k in the distance-k coloring. The problem is that k depends both on

the structure and the numerical properties of the matrix. If elements of A−1 exhibit slow

decay, choosing k too small does not produce sufficiently accurate estimates because large

elements of A−1 (linking vertices that are farther than k apart) contribute to the variance

in Lemma 3.1. Choosing k too large increases the number of quadratures unnecessarily,

and more importantly, makes the coloring of Ak prohibitive. This problem has also been

identified in [62] but no solution proposed.

22 A conservative approach is to use probing for a small distance (typically 1 or 2) to remove the variance associated only with the largest, off-diagonal parts of the matrix.

Then, for each of the resulting m probing vectors, we generate s random vectors that

follow the non-zero structure of the corresponding probing vector, and perform s MC

steps (requiring ms quadratures). In LQCD this method is called dilution. In its most

common form it performs a 2-color (red-black) ordering on the uniform lattice and uses

the MC estimator to compute two partial traces: one restricted on the red sites, the

other on the black sites of the lattice [25, 37, 51]. Therefore, all variance caused by the

direct red-black connections of A−1 is removed. The improvement is modest, however, so

additional “dilution” is required [23, 51].

3.1.6 Hadamard vectors

An N × N matrix H is a Hadamard matrix of order N if it has entries H(i, j) = ±1 and

HHT = NI, where I is the of order N [28, 43]. N must be 1, 2, or a multiple of 4. We restrict our attention to Hadamard matrices whose order is a power of

2, and can be recursively obtained as:

    1 1 Hn Hn H2 = ,H2n = = H2 ⊗ Hn. 1 −1 Hn −Hn

For powers of two, Hn is also symmetric, and its elements can be obtained directly as

Plog N ikjk Hn(i, j) = (−1) k=1 , (3.2)

where (ilog N , . . . , i1)2 and (jlog N , . . . , j1)2 are the binary representations of of i−1 and j−1 respectively. We also use the following notation to denote Hadamard columns (vectors):

hj = Hn(:, j + 1), j = 0, . . . , n − 1. Hadamard matrices are often called the integer version of the discrete Fourier matrices,

√ 2π(j−1)(k−1) −1/n Fn(j, k) = e . (3.3)

23 For n = 2, H2 = F2, but for n > 2, Fn are complex. These matrices have been studied extensively in coding theory where the problem is to design a code (a set of s < N vectors

T Z) for which ZZ is as close to identity as possible [28]. Hn and Fn vectors satisfy the well known Welch bounds but Hn vectors do not achieve equality [63]. Moreover, Fn are not restricted to powers of two. Still, Hadamard matrices involve only real arithmetic,

which is important for efficiency and interoperability with real codes, and it is easy to

identify the non-zero pattern they generate. Later, we will view the Hadamard matrix as

a methodical way to build an orthogonal basis.

Consider the first 2k columns of a Hadamard matrix Z = H(:, 1 : 2k). The non-zero pattern of the matrix ZZT consists of the i2k upper and lower diagonals, i = 0, 1,... [28].

Because Tr(ZT A−1Z) = Tr(A−1ZZT ) and because of Lemma 3.1 and Proposition 3.1, the error in the MC estimation of the trace is induced only by the off-diagonal elements of A−1 that appear on the same locations as the non-zero diagonals of ZZT . If the matrix is banded or its diagonals do not coincide with the ones of ZZT , the trace estimation is exact. When the off-diagonal elements of A−1 decay exponentially away from the main diagonal, increasing the number of Hadamard vectors achieves a consistent (if not mono- tonic) reduction of the error. We note that this special structure of ZZT is achieved only when the number of vectors, s, is a power of two. For 2k < s < 2k+1, the structure of

ZZT is dense in general, but the weight of ZZT elements is largest on the main diagonal

(equal to s) and decreases between diagonals i2k and (i + 1)2k. Thus, estimation accuracy improves with s, even for dense matrices. However, to annihilate a certain sparsity struc- ture of a matrix, the estimates at only s = 2k should be considered. Similar properties apply for the Fn matrices.

3.1.7 Overcoming probing limitations

We seek to construct special vectors for the MC estimator that perform at least as well as Z2 noise vectors, but can also exploit the structure of the matrix, when such structure exists. Although Hadamard vectors seem natural for banded matrices, they cannot handle

24 Hadamard natural order Red−black order

vs

Figure 3.2: Crossed out nodes have their contribution to the error canceled by the Hadamard vectors used. Left: the first two, natural order Hadamard vectors do not cancel errors in some distance-1 neighbors in the lexicographic ordering of a 2-D uniform lattice. Right: if the grid is permuted with the red nodes first, the first and the middle Hadamard vectors completely cancel variance from nearest neighbors and correspond to the distance-1 probing vectors. deviations from this structure. For example, the first two Hadamard vectors compute the exact trace of a tridiagonal matrix. For the matrix that corresponds to a 2-D uniform lattice of size 2n × 2n with periodic boundary conditions and lexicographic ordering, pro- ducing the exact trace requires the first 2n+1 Hadamard vectors. However, if we consider the red-black ordering of the same matrix, only two Hadamard vectors, h0 and h2n−1 , are sufficient, as shown in Figure 3.2.

The previous example shows that although Hadamard vectors are a useful tool, probing

is the method that discovers matrix structure. Therefore, we turn to the problem of how

to perform probing efficiently on Ak and for large k. Ideally, a method should start with

a small k and increase it until it achieves sufficient accuracy. However, the colorings,

and thus the probing vectors, for two different k’s are not related in general. Thus, in

addition to the expense of the new coloring, all previously performed quadratures must

be discarded.

First let us persuade the reader that work from a previous distance-k probing cannot

be reused in general. Assume the distance-1 coloring of a matrix of size 6 produced three

colors: color 1 has rows 1 and 2, color 2 has rows 3 and 4, color 3 has rows 5 and 6. Next

we perform a distance-2 coloring of A, and assume there are four colors: color 1 has row 1,

color 2 has rows 2 and 3, color 3 has rows 4 and 5, color 4 has row 6. As in Figure 3.1, the

25 distance-1 and distance-2 probing vectors, Z(1) and Z(2) respectively, are the following:

 1 0 0   1 0 0 0   1 0 0   0 1 0 0  (1)  0 1 0  (2)  0 1 0 0  Z =   ,Z =   .  0 1 0   0 0 1 0   0 0 1   0 0 1 0  0 0 1 0 0 0 1

Unfortunately, the three computed quadratures Z(1)T A−1Z(1) (or solutions to A−1Z(1)) cannot be used to avoid recomputation of the four quadratures Z(2)T A−1Z(2).

Consider now a matrix of size 8 with two colors in its distance-1 coloring. Assume that its distance-2 coloring produces four colors, and that all rows with the same color belong also to the same color group for distance-1. Then the subspace of the corresponding probing vectors is spanned by certain Hadamard vectors:

Z(1) = Z(2) =  1 0   1 1   1 0 0 0   1 1 1 1  1 0 1 1 1 0 0 0 1 1 1 1          1 0   1 1   0 1 0 0   1 1 −1 −1         − −   1 0  ∈  1 1   0 1 0 0  ∈  1 1 1 1   0 1  span( 1 −1 ),  0 0 1 0  span( 1 −1 1 −1 ).          0 1   1 −1   0 0 1 0   1 −1 1 −1   0 1   1 −1   0 0 0 1   1 −1 −1 1  0 1 1 −1 0 0 0 1 1 −1 −1 1

The four Hadamard vectors are h0, h4, h2, h6. More interesting than the equality of the spans is that the two bases are an orthogonal transformation of each other. Specifically,

(1) (2) Z = 1/2[h0, h4]H2 and Z = 1/2[h0, h4, h2, h6]H4. Because the trace is invariant under orthogonal transformations, we can use the Hadamard vectors instead (as we implicitly

did in Figure 3.2 for the lattice). Clearly, for this case, the quadratures of the first two

vectors can be reused so that the distance-2 probing will need computations for only two

additional vectors.

A key difference between the two examples is the nesting of colors between successive

colorings. In general, such nesting cannot be expected and thus an incremental approach

to probing will necessarily discard prior work. A second difference is that all color groups

are split into the same number of colors in the successive coloring. To achieve these desired

26 characteristics, we develop first a hierarchical, all-distance coloring, and then represent its probing basis through a convenient set of vectors. We explain the general hierarchical coloring idea next.

3.1.8 Hierarchical coloring

The key idea is to enforce nesting of colors in successive distance colorings by looking at each color group independently and coloring its nodes for the next higher distance using only local information from that color group. We apply this recursively until every node of the graph is colored uniquely.

We begin this recursive approach by finding a distance-1 coloring of the graph, thus partitioning its nodes into separate color groups. In subsequent levels of the hierarchy, the nodes in each group will never share a color with the nodes in another group. To move to the next level, we use the graph at the current level to produce a distance-2 connectivity among the nodes within each color group. Then we apply the algorithm recursively to the induced subgraphs for each color group independently. It is a straightforward inductive observation that at every level we generate a distance-2i coloring, but for all i = 0, 1,..., until the distance covers the diameter of the graph.

Hierarchical coloring produces more colors at distance-2i than classical coloring of the

i i graph of A2 . If the task were to approximate the trace of the matrix A2 , the extra colors

would be redundant and the additional probing vectors would represent unnecessary work.

However, we approximate the trace of A−1, which is dense. Thus, the larger number

of hierarchical probing vectors at distance-2i will also approximate some elements that

represent node distances larger than 2i, yielding a larger variance reduction than the

i corresponding A2 classical probing method.

3.1.9 Hierarchical coloring on lattices

Uniform d-D lattices allow for an extremely efficient implementation of the hierarchical

coloring approach, based entirely on bit-arithmetic.

27 k Consider first the 1-D case, where the lattice has N = 2 points, where k = log2 N, which guarantees the 2-colorability of the 1-D . Any point has a coordinate 0 ≤ x ≤

N − 1 with a binary representation: [bk, bk−1, . . . , b1] = dec2bin(x). At the first level, the distance-1 coloring is simply red-black (we associate red with 0 and black with 1), and x gets the color of its least significant bit (LSB), b1. In the coloring permutation, we order first the N/2 red nodes. At the second level, we consider red and black points separately and split each color again, but now based on the second bit b2. Thus, points [∗ ∗ ... ∗ ∗00] and [∗∗...∗∗10] take different colors, and by construction all colors are given hierarchically.

The second level permutation will not mix nodes between the first two halves of the first level, but will permute nodes within the respective halves, i.e., points with 0 in the LSB always appear in the first half of the permutation. The process is repeated recursively for each color, until all points have a different color.

The binary tree built by the recursive algorithm splits the points of a subtree in half at the i-th level based on bi. Thus, to find the final permutation we trace the path from the root to a leaf, producing the binary string: [b1b2 . . . bk], which is the bit reversed string for x. Denote by P the final permutation array such that node x = 0,...,N − 1 in the original ordering is found in location P (x) of the final permutation. Then,

P (x) = bin2dec(bitreverse(dec2bin(x))) (3.4) and the computation is completely independent for any coordinate.

Qd kj Extending to torus lattices of d dimensions, where N = j=1 2 , has three compli- cations: First, the subgraph of the same color nodes is not a conformal uniform lattice.

Second, the geometry does not allow a simple bit reversal algorithm. Third, not all di- mensions have the same size (kj =6 ki). The following sections address these.

28 3.1.10 Splitting color blocks into conformal d-D lattices

Consider a point with d coordinates (x , x , . . . , x ). Let [bj , . . . , bj , bj ] be the binary 1 2 d kj 2 1 k representation of coordinate xj with 0 ≤ xj < 2 j . We know that uniform lattices are 2-colorable, so at the first level, red black ordering involves the least significant bit of Pd j all coordinates. The color assigned to the point is mod( j=1 b1, 2). However, the red partition, which is half of the lattice points, is not a regular d-dimensional torus. Every

red point is distance-2 away from any red neighbor, and therefore it has more neighbors

(e.g., in case of 2-D it is connected with 8 neighbors, in 3-D with 18, and so on). To facil-

d−1 itate a recursive approach, we observe that the reds can bek=1 split into 2 d-dimensional sublattices, if we consider them02323 in groups 1 of 0 every other 1 rowk=2 in each dimension. Similarly 02378 1 4 5 k=4 for the blacks. For the 2-D case this is shown in Figure 3.3.

k=1 k=2

Figure 3.3: When doubling the probing distance (here from 1 to 2) we first split the 2-D grid to four conformal 2-D subgrids. Red nodes split to two 2 × 2 grids (red and green), and similarly black nodes split to blues and black. Smaller 2-D grids can then be red-black ordered.

1 2 d This partitioning is obtained based on the value of the binary string: [b1, b1, . . . , b1]. For each value, the resulting sublattice contains all points with the given least significant bits in its d coordinates. Because each coordinate loses one bit, the size of each sublattice is

Qd kj −1 d j=1 2 . At this second level, each of the 2 sublattices can be 2-colored independently. Each sublattice will receive a distinct coloring, which which will be hierarchical, as long as we track the original color of the nodes.

29 3.1.11 Facilitating bit reversal in higher dimensions

The above splitting based on the LSBs from the d coordinates does not order the adjacent colors together. For example, the partitioning at the first level of d = 2 gives four sub-

lattices (00,01,10,11) of which the 00 and 11 are reds while 01 and 10 are blacks. We can

recursively continue partitioning and coloring the sublattices. However, if we concatenate

at every level the new 2 bits from the 2 coordinates, as in the bit reversed pattern in the

1-D case, the resulting ordering is not hierarchical. In our example, all red points in the

first level are ordered in the first half, but at the second level, the colors associated with

the 00 reds will be in the first quarter of the ordering, while the colors associated with

the 11 reds will be in the fourth quarter of the ordering. Since the hierarchical ordering

is critical for reusing previous work, we order the four sublattices not in the natural order

(00,01,10,11) but in a red black order: (00 11 01 10). Algorithm 1 produces this Red-Black

reordering in d dimensions.

A more computationally convenient way to obtain the RB permutation is based on

the fact that every point on the stencil has neighbors of opposite color. In other words,

color([x1, . . . , xd]) = ¬color([x1, . . . , xd] ± ej), where ej is the unit row-vector in the j dimension, j = 1, . . . , d, and ¬ is the logical not. With two points per dimension, in

one dimension the colors are c1 = [0, 1]. Inductively, if the colors in dimension d − 1 are cd−1, the second d − 1 plane in dimension d will have the opposite colors, and thus:

cd = [cd−1, ¬cd−1]. Therefore, we can create the RB with only a check per point instead of counting coordinate bits. This is shown in Algorithm 2.

30 Algorithm 1 Red-Black order of the 2d Algorithm 2 Red-Black order of the 2d torus (slow) torus (fast)

d RB = bitarray(2 , d) c0 = 0 reds = 0, blacks = 2d−1 for j = 1 : d

d for i = 0 : 2 − 1 cj = [cj−1, ¬cj−1] if dec2bin(i) has even number of bits RB = bitarray(2d, d)

newbits = dec2bin(reds, d) reds = 0, blacks = 2d−1

reds = reds + 1 for i = 0 : 2d − 1

else if cd(i) == 0 newbits = dec2bin(blacks, d) newbits = dec2bin(reds, d)

blacks = blacks + 1 reds = reds + 1

RB(i, :) = newbits else

newbits = dec2bin(blacks, d)

blacks = blacks + 1

RB(i, :) = newbits

We are now ready to combine the Red-Black reordering with the bit-reversing scheme to address the d-dimensional case. First, assume that the lattice has the same size in each dimension, i.e., kj = k, ∀j = 1, . . . , d. Then the needed permutation is given by Algorithm 3.

3.1.12 Lattices with different sizes per dimension

At every recursive level, our algorithm splits the size of each dimension in half (removing

one bit), until there is only 1 node per dimension. When the dimensions do not all have

the same size, some of the dimensions reach 1 node first and beyond that point they are

31 Algorithm 3 Hierarchical permutation of the lattice – case kj = k

% Input: % the coordinates of a point x = (x1, x2, . . . , xd) % the global RB array produced by Algorithm 2 % Output: % The location in which x is found in the hierarchical permutation function loc = LatticeHierPermutation0((x1, x2, . . . , xd)) % Make a d × k table of all the coordinate bits for j = 1 : d j j j (bk, . . . , b2, b1) = dec2bin(xj) % Accumulate bit-reversed order. Start from LSB loc = [ ] for i = 1 : k % A vertical section of bits. Take the i-th bit of all coordinates % and permute it to the corresponding red-black order 1 d 1 2 d (s , . . . , s ) = RB(bin2dec(bi , bi , . . . , bi )) % Append this string to create the reverse order string loc = [loc, (s1, . . . , sd)] return bin2dec(loc)

not subdivided. If m out of d dimensions have reached size 1, the above algorithm should continue as in a d−m dimension lattice, at every level concatenating only the active d−m bits in loc. In this case, however, the red-black permutation RB should correspond to that of a d−m dimensional lattice. The following three lemmas and the theorem, whose proofs are in the Appendix, allow us to avoid computing and storing RBj for each j = 1, . . . , d.

As before, we consider cd the array of 0/1 colors of the 2-point, d-dimensional torus.

d−1 Lemma 3.2 For any d > 0, cd(2i) = cd−1(i), i = 0,..., 2 − 1.

d−1 Lemma 3.3 For any d > 0, cd(2i) = ¬cd(2i + 1), i = 0,..., 2 − 1.

d Lemma 3.4 For any d > 0 the values of RBd(i), i = 0,..., 2 − 1 are given by:

 bi/2c, if cd(i) = 0 RBd(i) = d−1 . bi/2c + 2 , if cd(i) = 1

We can now show how RBm, m < d, can be obtained from RBd.

32 Theorem 3.2 Let RBd be the permutation array that groups together the same colors in a red-black ordering of the two-point, d dimensional lattice, as produced by Algorithm 2.

For any 0 < m < d,

d−m d−m m RBm(i) = bRBd(i2 )/2 c, i = 0,..., 2 − 1.

The theorem says that given RBd in bit format, RBm is obtained as the left (most

d−m significant) m bits of every 2 number in RBd. We now have all the pieces needed to modify Algorithm 3 to produce the permutation of the hierarchical coloring of d dimen-

Qd kj sional lattice torus of size N = j=1 2 . For d > 1, Algorithm 4 differs from the general approach discussed in the beginning of

Section 3.1.8 because it pre-splits groups of colors into conformal lattices before it induces the new connectivity. The difference in the actual number of colors, however, is small. At level i = 0, 1,..., Algorithm 4 performs a distance-2i+1 − 1 coloring and produces 2di+1 colors.

For classical probing, the minimum number of colors required for distance-2i+1 − 1 coloring of lattices is not known for d > 2 [32]. An obvious lower bound is the number   d i   of lattice sites in the “unit sphere” of graph diameter 2 . If   denotes the binomial i coefficient, with a 0 value if d < i, the lower bound is given by [26, Theorem 2.7]:

d     X d − i−1 d i + 2 . i d i=0

For sufficiently large distances, this is O(23i−1/3) for d = 3, and O(24i−3/3) for d = 4.

Thus, we can bound asymptotically how many more colors our method gives:

 Number of colors in hierarchical probing < 12, if d=3 Number of colors in classical probing < 48, if d=4.

33 Algorithm 4 Hierarchical permutation of the lattice – case 2ki =6 2kj

% Input: % the coordinates of a point x = (x1, x2, . . . , xd) % the global RB array produced by Algorithm 2 % Output: % The location in which x is found in the hierarchical permutation function loc = LatticeHierPermutation((x1, x2, . . . , xd)) % Make a d × max(kj) table of all the coordinate bits % Dimensions with smaller sizes only have up to kj bits set for j = 1 : d (bj , . . . , bj , bj ) = dec2bin(x ) kj 2 1 j % Accumulate bit-reversed order. Start from LSB loc = [ ] for i = 1 : max(kj) % A vertical section of bits. Take the i-th bit of all coordinates % in dimensions that can still be subdivided (i ≤ kj). % Record number of such dimensions activeDims = 0 bits = [ ] for j = 1 : d if (i ≤ kj) j bits = [bits, bi ] activeDims = activeDims + 1 % permute it to the corresponding red-black order using RBi index = bin2dec(bits)2d−activeDims (s1, . . . , sactiveDims) = bRB(index)/2d−activeDimsc % Append this string to create the reverse order string loc = [loc, (s1, . . . , sactiveDims)] return bin2dec(loc)

34 In practice, we have observed ratios of 2–3. On the other hand, because hierarchical probing uses more vectors, the variance reduction it achieves when a certain distance coloring completes, i.e., after 2di+1 quadratures in the MC estimator, is typically better

than classical probing for the same distance.

In terms of computational cost, this algorithm is very efficient, especially when com-

pared with classical probing. As an example, producing the hierarchical permutation of

a 644 lattice takes about 6 seconds on a Macbook Pro with 2.8 GHz Intel Core 2 Duo.

More importantly, the permutation of each coordinate is obtained independently which

facilitates parallel computing.

3.1.13 Coloring lattices with non-power of two sizes

Qd Consider a lattice of size N = j=1 nj. Sometimes, LQCD may generate lattices where m one or more nj are not powers of two. In this case, it is typical that nj = 2 p, where p =6 2 is a small prime number. Our hierarchical coloring method works up to m levels, but then the remaining subgrids are of odd size in the j-th dimension, causing coloring

conflicts because of wrap-around. In the next theorem, whose proof is in Appendix, we

show that such a lattice is three colorable.

Qd Theorem 3.3 A toroidal, uniform lattice of size N = j=1 nj, where one or more nj are

odd, admits a three-coloring with point x = (x1, . . . , xd) receiving color:

  d d ( if − and X X 1, (xj = nj 1) C(x) =  xj + δ(xj) mod 3, where δ(xj) = (nj − 1 mod 3 = 0) j=1 j=1 0, everywhere else.

After the three-coloring is produced, further hierarchical colorings are not practical,

since the coloring yields blocks of nodes that are not conformal lattices, and are of irreg-

ular shapes. This prevents the use of a method similar to the one described in section 3.

Because of this, for lattices which have dimensions with factors other than two, we can

proceed only one level further after the factors of two have been exhausted by the hier-

35 archical coloring algorithm. This is not a shortcoming in LQCD, since, by construction, lattices have dimensions with at most one odd factor. Finally, note that the number of hi- erarchical probing vectors produced before exhausting the powers of two in each dimension is typically large, obviating the need for a last three-coloring step.

3.1.14 Generating the probing basis

Assume for the moment each color group at any level of the hierarchical method is colored with exactly two colors. Consider also the permuted matrix A(perm, perm) so that colors

appear in the block diagonal. In Section 3.1.7, we saw the Hadamard vectors required for

probing the first two levels of this recursion for a 8×8 matrix: [h0, h4] and [h0, h4, h2, h6, ].

T If ⊗ denotes the Kronecker product and 1k = [1,..., 1] the vector of k ones, these can be written as:

[h0, h4] = H2 ⊗ 14,

[h0, h4, h2, h6] = [H2 ⊗ H2(:, 1),H2 ⊗ H2(:, 2)] ⊗ 12.

(i) This pattern extends to any recursion level i = 1, 2,..., log2 N. If we denote by Z the Hadamard vectors that span the i-th level probing vectors, these are obtained by the

following recursion:

(1) Z˜ = H2,

(i) h (i−1) (i−1) i Z˜ = Z˜ ⊗ H2(:, 1), Z˜ ⊗ H2(:, 2) , (i) ˜(i) Z = Z ⊗ 1N/2i . (3.5)

Intuitively, this says that at every level, we should repeat the pattern devised in the

previous level to double the domains for the first 2i−1 vectors (Kronecker product with

[1, 1]T ), and then should split each basic subdomain in two opposites (Kronecker product

with [1, −1]T ).

36 The hierarchy Z(i−1) = Z(i)(:, 1 : 2i−1) implies that quadratures performed with Z(i−1)

can be reused if we need to increase the probing level. To obtain the m-th probing vector,

therefore, we can consider the m-th vector of Z(log2 N). Its rows can be constructed piece

by piece recursively through (3.5) and without constructing all Z(log2 N). In fact, we can even avoid the recursive construction and compute any arbitrary element of Z(log2 N)(j, m) directly. This is useful in parallel computing where each processor generates only the local rows of this vector. The reason is that recursion (3.5) produces a known permutation of the natural order of the Hadamard matrix, specifically the column indices are:

0,N/2,N/4, 3N/4,N/8, 5N/8, 3N/8, 7N/8,.... (3.6)

We can compute a-priori this column permutation array, Hperm, for all N, or for as many vectors as we plan to use in the MC estimator. Then by using the inverse permutation

(iperm) associated with the given hierarchical coloring, the j-th element of the m-th probing vector can be computed directly through (3.2) as:

zm(j) = HN (iperm(j), Hperm(m)). (3.7)

We observe now that the assumption that each subgroup is colored with exactly two

colors is not necessary. The ordering given in (3.6) is the same if each subgroup is colored

by any power of two colors, which could be different at different levels. The sequence (3.6)

is built on the smallest increment of powers of two and thus subsumes any higher powers.

We can extend the above ideas to generate the probing basis for arbitrary N, when

at every level each color block is split into exactly the same (possibly non-power of two)

colors. For example, at the first level we split the graph into 3 colors, at level two, each of

the 3 color blocks is colored with exactly 5 colors, at level three, each of the 5 color blocks

is colored with exactly 2 colors, and so on. The problem is that Hadamard matrices do not

exist for arbitrary dimensions. For example, for 3 probing vectors, there is no orthogonal

37 basis Z of ±1 elements, such that ZZT = I. In this general case, we must resort to the

N-th roots of unity, i.e., the Fourier matrices Fn. Assume that the number of colors at level i is c(i) for all blocks at that level, then the probing basis is constructed recursively as:

(1) Z˜ = Fc(1),

(i) h (i−1) (i−1) i Z˜ = Z˜ ⊗ Fc(i)(:, 1),..., Z˜ ⊗ Fc(i)(:, c(i)) , i (i) ˜(i) Y Z = Z ⊗ 1N/γi , where γi = c(j). (3.8) j=1

By construction, the vectors of Z(i−1) are contained in Z(i), and any arbitrary vector can

be generated with a simple recursive algorithm. However, we have introduced complex

arithmetic which doubles the computational cost for real matrices. On the other hand,

if a c(i) is a power of two, its Fc(i) can be replaced by Hc(i). This can be useful when the non-power of two colors appear only at later recursion levels for which the number of probing vectors is large and may not be used, or when only one or two Fc(i) will suffice. To summarize, we have provided an inexpensive way to generate, for any matrix size, an arbitrary vector of the hierarchical probing sequence through (3.8), as long as the number of colors is the same within the same level for each subgraph. If, in addition, the matrix size and the color numbers are powers of two, (3.6–3.7) provide an even simpler way to generate the probing sequence. In LQCD, many of the lattices fall in this last category.

3.1.15 Removing the deterministic bias

The probing vectors produced in Section 3.1.14 are deterministic and, even though they give better approximations than random vectors, they introduce a bias. To avoid this, we can view formula (3.5) not as a sequence of vectors but as a process of generating an orthogonal basis starting from any vector and following a particular pattern. Therefore,

38 N consider a random vector z0 ∈ Z2 , and [z1, z2, . . . , zm] the sequence of vectors produced by (3.5). If is the element-wise Hadamard product, the vectors built as

V = [z0 z1, z0 z2, . . . , z0 zm] (3.9) have the same properties as Z, i.e., V T V = ZT Z and VV T has same non-zero pattern

T T T T as ZZ (VV = (z0z0 ) ZZ ), but one can easily show that the expected value of the trace estimate over all z0 is the matrix trace, yielding an unbiased estimator.

3.1.16 Numerical experiments

We present a set of numerical examples on control test problems and on a large QCD calculation in order to show the effectiveness of hierarchical probing over classical probing, and over standard noise Monte Carlo estimators for Tr(A−1). We also study the effect of the unbiased estimator.

Our standard control problem is the discretization of the Laplacian on a uniform lattice with periodic boundary conditions. We control the dimensions (3-D or 4-D), the size per dimension, and the conditioning (and thus the decay of the elements of the inverse) by applying a shift to the matrix. Most importantly, for these matrices we know the trace of the inverse analytically. We will refer to such problems as Laplacian, with their size implying their dimensionality.

3.1.17 Comparison with classical probing

For this set of experiments we consider a 643 Laplacian, shifted so that its condition number is on the order of 102. Therefore, its A−1 exhibits dominant features on and close to (in a graph theoretical sense) the non-zero structure of A, with decay away from it.

The decay rate depends on the conditioning of A. Our methods should be able to pick this structure effectively.

39 Figure 3.4 shows the performance of classical probing, which is a natural benchmark for our methods. The left graph shows that for larger distance colorings, probing performs extremely well. For example, with 317 probing vectors, which correspond to a 8-distance coloring, we achieve more than two orders reduction in the error. Of course, if the ap- proximation is not good enough, this work must be discarded, and the algorithm must be repeated for higher distances. Hadamard vectors, in their natural order, do not capture well the nonzero structure of this A.

The right graph in Figure 3.4 shows one way to improve accuracy beyond a certain

T T probing distance. After using [0,..., 0, 1r(m), 0,... 0] as the probing vector for color m, we continue building a Hadamard matrix in its natural order only for the r(m) coordinates of that color. If probing has captured the most important parts of the matrix, the remaining parts could be sufficiently approximated by natural order Hadamard vectors. This is confirmed by the results in the graph, if one knows what initial probing distance to pick.

On the other hand, hierarchical probing, which considers all possible levels, achieves better performance than all other combinations.

In Figure 3.5, left graph, we stop our recursive algorithm at various levels and use the resulting permutation to generate the vectors for the trace computation. It is clearly beneficial to allow the recursion to run for all levels. We also point out that stopping at intermediate levels behaves similarly to classical probing with the corresponding distance.

On the right graph of Figure 3.5, we observe no difference between methods for high conditioned matrices. The reason is that the eigenvector of the smallest eigenvalue of A

−1 is the vector of all ones, 1N . The more ill conditioned A is, the more A is dominated

T by 1N 1N , which has absolutely no variation or pattern. We point out that the experiments in this subsection did not use the unbiased estimator of Section 3.1.15. This has a severe effect for the Laplacian matrix because the first vector of our Hadamard sequences is h0 = 1N , the lowest eigenvector. Even for a well conditioned Laplacian, starting with h0 guarantees that the first trace estimate will have no contribution from other eigenvectors, and thus will have a large error. From a statistical

40 Error for case 64 × 64 × 64, cond= 1e+02 Error for case 64 × 64 × 64, cond= 1e+02 7 7 10 10 Probingk1 Probingk1 Probingk2 Probingk2 Probingk4 Probingk4 6 Probingk8 10 Probingk8 6 10 Nat−Had. Hier−Hada.

5 10 5 10 4 Trace error Trace error 10

4 10 3 10

3 2 10 10 2 16 62 317 2048 2 16 62 317 2048 Number of quadratures Number of quadratures

Figure 3.4: Error in the Tr(A−1) approximation using the MC method with various deterministic vectors. Classic probing requires 2,16,62, and 317 colors for probing dis- tances 1,2,4, and 8, respectively. Left: Classic probing approximates the trace better than the same number of Hadamard vectors taken in their natural order. Going to higher distance-k requires discarding previous work. Right: Perform distance-k probing, then ap- ply Hadamard in natural order within each color. Performs well, but hierarchical performs even better.

point of view, h0 is the worst starting vector for Laplacians, but it better exposes the rate at which various methods reduce error.

3.1.18 Comparison with random-noise Monte Carlo

Having established that hierarchical probing discovers matrix structure as well as classical

probing, we turn to gauge its improvements over the standard Z2 noise MC estimator. First, we show three sets of graphs for increasing condition numbers of the Laplacian. We

use the 643, 324, and 64×1282 lattices, and plot the convergence of the trace estimates for

hierarchical probing, natural order Hadamard, and for the standard Z2 random estimator. Both Hadamard sequences employ the bias removing technique (3.9). As it is typical, the

random estimator includes error bars designating the two standard deviation confidence

intervals, ±2(V ar/s)1/2, where V ar is the variance estimator.

Figure 3.6 shows the convergence history of the three estimators for well conditioned shifted Laplacians, which therefore have prominent structure in A−1. Hierarchical probing

41 Error for case 64 × 64 × 64, cond= 1e+02 Error for case 64 × 64 × 64, cond= 1e+06 7 11 10 10 Natural Hada Probingk1 Hada Level1 Probingk2 Hada Level2 Probingk4 6 Hada Level3 Probingk8 10 10 Hada Level4 10 Hier−Hada. Hada Level5

5 10 9 10

4 Trace error Trace error 10

8 3 10 10

2 7 10 10 2 16 128 1024 2 16 62 317 2048 Number of quadratures Number of quadratures

Figure 3.5: Left: The hierarchical coloring algorithm is stopped after 1, 2, 3, 4, 5 levels corresponding to distances 2, 4, 8, 16, 32. The ticks on the x-axis show the number of colors for each distance. Trace estimation is effective up to the stopped level; beyond that the vectors do not capture the remaining areas of large elements in A−1. Compare the results with classical probing in Figure 3.4, which requires only a few less colors for the same distance. Right: When the matrix is shifted to have high condition number, the lack of structure in A−1 causes all methods to produce similar results. exploits this structure, and thus performs much better than the other methods. Note that the problem on the left graph is identical to the one used in the previous section. The far better performance of the Hadamard sequences in this case is due to avoiding the eigenvector h0 as the starting vector. Figures 3.7 and 3.8 show results as the condition number of the problems increase. As expected, the advantage of hierarchical probing wanes as the structure of A−1 disappears, but there is still no reason not to use it as diminishing improvement remain evident. We have included 4-D lattices in our experiments, first because of their use in LQCD, and second because their structure is more difficult to exploit than lower dimensional lattices.

For 1-D or 2-D lattices which we do not show, hierarchical probing was significantly more efficient.

Once we use a random vector z0 to modify our sequence as in (3.9), hierarchical probing becomes a stochastic process, whose statistical properties must be studied. Thus, (i) we generate z0 , i = 1 : 100, Z2 random vectors, and for each one we produce a modified

42 4 Trace for case 64 × 64 × 64, cond= 1e+02 5 Trace for case 32 × 32 × 32 × 32, cond= 2e+02 x 10 x 10 5.97 1.5805 Z random Z random 2 2 Z .*Nat−Had. Z .*Nat−Had. 2 2 5.965 Z .*Hier−Had. Z .*Hier−Had. 2 1.58 2 Color completes Color completes 5.96 Exact trace Exact trace 1.5795 5.955 1.579 5.95 Trace values Trace values 1.5785 5.945

5.94 1.578

5.935 1.5775 16 32 64 128 256 512 1024 2048 16 32 64 128 256 512 1024 2048 Number of quadratures Number of quadratures

Figure 3.6: Convergence history of Z2 random estimator, Hadamard vectors in natural order, and hierarchical probing, the latter two with bias removed as in (3.9). Because of small condition number, A−1 has a lot of structure, making hierarchical probing clearly superior to the standard estimator. As expected, Hadamard vectors in natural order are not competitive. The markers on the plot of the hierarchical probing method designate the number of vectors required for a particular distance coloring to complete. It is on these markers that structure is captured and error minimized.

T −1 sequence of the hierarchical probing vectors. Then, we use the 100 values xmA xm, where (i) xm = z0 zm, at every step of the 100 MC estimators to calculate confidence intervals.

These are shown in Figure 3.9. We emphasize that the confidence intervals for the Z2 random estimator are computed differently, based on the V ar estimator of the preceding

MC steps, so they may not be accurate initially. Even on a 4-D problem, hierarchical probing provides a clear variance improvement.

3.1.19 A large QCD problem

The methodology presented in this paper has the potential of improving a multitude of

LQCD calculations. In this section, we focus on the calculation of C = Tr(D−1), where

the Dirac matrix D is a non-symmetric complex sparse matrix. This is representative of

a larger class of calculations usually called “disconnected diagrams”. The physical observ-

able C is related to an important property of QCD called spontaneous chiral

breaking.

43 5 Trace for case 64 × 128 × 128, cond= 1e+04 5 Trace for case 32 × 32 × 32 × 32, cond= 2e+04 x 10 x 10 2.64 1.64 Z random Z random 2 2 Z .*Nat−Had. Z .*Nat−Had. 2 2 Z .*Hier−Had. 1.638 Z .*Hier−Had. 2.635 2 2 Color completes Color completes Exact trace 1.636 Exact trace

2.63 1.634

1.632 2.625 Trace values Trace values 1.63 2.62 1.628

2.615 1.626 16 32 64 128 256 512 1024 2048 16 32 64 128 256 512 1024 2048 Number of quadratures Number of quadratures

Figure 3.7: Convergence history of the three estimators as in Figure 3.6 for a larger condition number O(104). As the structure of A−1 becomes less prominent, the differences between methods reduce. Still, hierarchical probing has a clear advantage.

Our goal is to compare the standard MC approach of computing the trace with our hierarchical probing method. Our test was performed on a single gauge field configuration using the Dirac matrix that corresponds to the “strange” quark Dirac matrix resulting from the Clover-Wilson fermion discretization [56]. The strange quark is the third heaviest quark flavor in nature. The gauge configuration had dimensions of 323 × 64 with a lattice spacing of a = 0.11fm, for a problem size of 24 million.

First, we used an ensemble of n = 253 noise vectors to estimate the variance of the standard MC method, with complete probing (dilution) of the internal color-spin space of dimension 12 to completely eliminate the variance due to connections in this space. Then, for each of these noise vectors, we modified as in (3.9) a sequence of hierarchical probing vectors which were generated based on space-time connections. As with the standard

MC estimator, full dilution of the color-spin space was performed. This procedure was performed in order to statistically estimate the variance of hierarchical probing, similarly to the test in Figure 3.9. In Figure 3.10(a), we present the variance of the hierarchical probing estimator as a function of the number of space-time probing vectors in the sequence. The main feature in this plot is that the variance drops as more vectors are used. Local minima occur at numbers of vectors that are powers of 2, where all connections of a given

44 5 Trace for case 64 × 128 × 128, cond= 1e+06 5 Trace for case 32 × 32 × 32 × 32, cond= 2e+06 x 10 x 10 4.4 3.2 Z random Z random 2 2 Z .*Nat−Had. Z .*Nat−Had. 2 2 4.2 Z .*Hier−Had. Z .*Hier−Had. 2 3 2 Color completes Color completes 4 Exact trace Exact trace 2.8 3.8 2.6 3.6 Trace values Trace values 2.4 3.4

3.2 2.2

3 2 16 32 64 128 256 512 1024 2048 16 32 64 128 256 512 1024 2048 Number of quadratures Number of quadratures

Figure 3.8: Convergence history of the three estimators as in Figure 3.6 for a high con- dition number O(106). Even with no prominent structure in A−1 to discover, hierarchical probing is as effective as the standard method.

Manhattan distance are eliminated from the variance. The uncertainty of the variance,

represented by the errorbars in the plot, is estimated using the Jackknife resampling

procedure of our noise vector ensemble.

In addition to the variance, we estimate the speed-up ratio of the hierarchical probing

estimator over the standard MC estimator. We define speed-up ratio as:

Vstoc Rs = , Vhp(s) × s

where Vhp(s) is the variance over the n different runs when the s-th hierarchical probing vector is used, and Vstoch is the variance of the standard MC estimator as estimated from n = 253 samples. The rescaling factor of s is there to account for the fact that if one had been using a pure stochastic noise with n × s vectors, the variance would be smaller by a factor of s. Thus, the variance comparison is performed on equal amount of computation for both methods. In Figure 3.10(b) we present the speed-up ratio Rs as a function s. The errorbars on Rs are estimated using Jackknife resampling from our ensemble of starting noise vectors. The peaks in this plot occur at the points where s is a power of 2, as in the variance case. A maximum overall speed-up factor of about 10 is observed at s = 512.

45 Trace for case 16 × 16 × 16 × 16, cond= 2e+02 9900 Exact trace Z random 2 Z .*Hier−Had. 9890 2 Color completes Z .*Hier−Had ± 2σ 2 9880

9870 Trace values 9860

9850

9840 16 32 64 128 256 512 1024 2048 Number of quadratures

Figure 3.9: Providing statistics over 100 random vectors z0, used to modify the sequence of 2048 hierarchical probing vectors as in (3.9). At every step, the variance of quadratures from the 100 different runs is computed, and confidence intervals reported around the hi- erarchical probing convergence. Note that for the standard noise MC estimator confidence intervals are computed differently and thus they are not directly comparable.

Note that the color completion points for this experiment are at s = 2, s = 32 and s = 512

vectors.

Finally, we report on a comparison with classical probing for this large QCD problem.

There is a variety of approaches for efficient distance-2 coloring in the literature [38, 33],

but we have not found any standard approaches for distance-k coloring. On lattices,

however, the distance-k neighborhood of a node is explicitly known geometrically. We

implemented a coloring algorithm that visits only this neighborhood for each node, thus

achieving the minimum possible complexity for this problem [33]. Specifically, for each

node, we make a list of colors previously assigned to its distance-k neighbors, and pick

the smallest color number not appearing in the list. The distance-4 coloring of our LQCD

lattice produced 123 colors and took 457 seconds on an Intel Xeon X5672, 3.2GHz server.

Using four random vectors with the structure of each of these colors (so that the total

number of quadratures is similar to our hierarchical probing), we ran 50 sets of experi-

ments, and measured the variance of classical probing. We found that its variance was

2.16 times larger than our hierarchical probing, or in other words, our method was 2.16

46 6 10

5 1 10 10

4 ) 10 s

3 variance 10 speed up (R

2 0 10 10

1 10 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 N N hadamard hadamard

Figure 3.10: (a) Left: The variance of the hierarchical probing trace estimator as a function of the number of vectors (s) used. The minima appear when s is a power of two. The places where the colors complete are marked with the cyan circle. These minima become successively deeper as we progress from 2 to 32 to 512 vectors. (b) Right: Speed- up of the LQCD trace calculation over the standard Z2 MC estimator. The cyan circles mark where colors complete. The maximal speed up is observed at s = 512. In both cases the uncertainties are estimated using the Jackknife procedure on a sample of 253 noise vectors, except for s = 256 and 512 where 37 vectors were used. times faster. This is expected as we explained earlier. Finally, note that computing the quadratures took 4 hours on four GPUs, on a dedicated machine for LQCD calculations.

Even though classic probing with distance-4 is feasible for this problem, computing the distance-8 coloring requires 5377 seconds, which becomes comparable to the time for com- puting the quadratures. Contrast that to the 2 seconds needed to compute the hierarchical probing.

3.1.20 Conclusions

The motivation for this work comes from our need to compute Tr(A−1) for very large

sparse matrices and LQCD lattices. Current methods are based on Monte Carlo and do

not sufficiently exploit the structure of the matrix. Probing is an attractive technique but

cannot be used incrementally, and becomes expensive for ill conditioned problems. Our

research has addressed these issues.

47 We have introduced the idea of hierarchical probing that produces suboptimal but nested distance-2i colorings recursively, for all distances up to the diameter of the graph.

We have adapted this idea to uniform lattices of any dimension in a very efficient and parallelizable way.

To generate probing vectors that follow the hierarchical permutation and can be used incrementally to improve accuracy, we have developed an algorithm that produces a spe- cific permutation of the Hadamard vectors. This algorithm is limited to cases where the number of colors produced at every level is a power of two. We have also provided a re- cursive algorithm based on Fourier matrices that provides the appropriate sequence under the weaker assumption of having the same number of colors per block within a single level.

These conditions are satisfied on toroidal lattices. Finally, we proposed an inexpensive technique to avoid deterministic bias while using the above sequences of vectors.

We have performed a set of experiments in the context of computing Tr(A−1), and have shown that providing a hierarchical coloring for all possible distances is to be preferred over classical probing for a specific distance. We also showed that our methods provide significant speed-ups over the standard Monte Carlo approach.

Currently we are working to extend the idea of hierarchical coloring to general sparse matrices, and to combine it with other variance reduction techniques, in particular defla- tion type methods.

APPENDIX

Lemma 3.2. Proof: We use induction on d. For d = 2, c2 = [0, 1, 1, 0] and the result holds. Assume the result holds for any dimension d − 1 or lower. Then for d dimensions,

d−1 since the first half of cd is the same as cd−1, for i = 0,... 2 − 1, we have

d−1 d−1 cd(2i) = ¬cd(2i − 2 ) = ¬cd−1(2i − 2 ) (recursive definition of cd) d−2 d−2 = ¬cd−2(i − 2 ) = ¬cd(i − 2 ) (inductive hypothesis) d−2 = ¬cd−2(i − 2 ) = ¬(¬cd−1(i)) = cd−1(i) (recursive definition of cd).

48 

Lemma 3.8. Proof: Because cd are the colors of the two-point, d dimensional torus, every even point 2i is the beginning of a new 1-D line and thus has a different color from its neighbor 2i + 1. It can also be proved inductively, since by construction 2i and 2i + 1

cannot be split across cd−1 and cd.  Lemma 3.4. Proof: Because of Lemma 3.8, after every pair of indices (2i, 2i + 1) is considered, the number of reds or blacks increases only by 1. Algorithm 3 sends all red

(cd(i) = 0) points i to the first half of the permutation in the order they are considered, which increases by 1 every two steps. Hence the first part of the equation. Black colors are sent to the second half, which completes the proof.  Theorem 3.2. Proof: We show first for m = d − 1. Because of Lemma 3.2, we consider the even points in RBd. Assume first cd(2i) = cd−1(i) = 0. From Lemma 3.4 we have,

RBd(2i) = b2i/2c = i. Then, RBd−1(i) = bi/2c = bRBd(2i)/2c. Now assume cd(2i) =

d−1 d−1 cd−1(i) = 1. From Lemma 3.4 we have, RBd(2i) = 2 + b2i/2c = 2 + i, and

d−2 d−2 d−1 therefore RBd−1(i) = 2 + bi/2c = 2 + b(RBd(2i) − 2 )/2c = bRBd(2i)/2c, which proves the formula for both colors. A simple inductive argument proves the result for any

m = 1, . . . , d − 2.  Theorem 3.3. Proof: We show that C(x) =6 C(x0) for any two points, x, x0 with ||x −

0 x ||1 = 1. These two points differ by one coordinate, j, since otherwise they are no longer 0 PN 0 PN 0  unit length apart. So, C(x) − C(x ) = i=1(xi − xi) + i=1(δ(xi) − δ(xi)) mod 3 =  0 0  xj − xj + δ(xj) − δ(xj) mod 3. We consider the following cases. 0 If neither x and x lie on the boundary of the j-th dimension, xj =6 nj − 1, then

0 0 0 δ(xj) = δ(xj) = 0, and C(x) − C(x ) = (xj − xj) mod 3 = ±1 mod 3 =6 0. 0 Since xj, xj both vary along the j-th dimension, only one of these points can lie on the boundary point of that dimension, consequently, only one of the two delta can be equal

to one. Without loss of generality, we assume that xj is on the boundary of the j-th

0 0 0 dimension, so C(x) − C(x ) = (xj − xj + δ(xj)) mod 3. In this case xj − xj = 1, or in the 0 0 warp around case, where xj = 0, xj − xj = nj − 1. There are two subcases:

49 0 1. δ(xj) = 0, then xj = nj − 1 with nj − 1 mod 3 =6 0, so C(x) − C(x ) = 1, or

0 C(x) − C(x ) = (nj − 1 mod 3) and thus is non-zero.

0 2. δ(xj) = 1, then xj = nj − 1 and nj − 1 mod 3 = 0, so C(x) − C(x ) is equal to

0 (1 + δ(xj)) mod 3 = (1 + 1) mod 3 =6 0, or C(x) − C(x ) is equal to (nj − 1 + δ(xj)) =

(0 + δ(xj)) mod 3 = 1 mod 3 =6 0.



3.2 Lattices of arbitrary dimensions

3.2.1 Introduction and Preliminaries

In the previous section, we introduced an algorithm which produced hierarchical probing vectors, for lattices with dimension lengths that are powers of two. The main idea behind the algorithm was to find a way to split a lattice quickly into a collection of sublattices recursively, as long as the length of the lattice was a 4power of two. This was achieved by splitting each lattice into 2d sublattices, where d is the dimensionality of the lattice.

Since points which share a sublattice are all distance 2 away from each other, if they are assigned the same color, a valid distance 1 coloring of that sublattice will result. Further, if the lattice dimensions are a power of two, the lattice will split evenly, so each color will be divided into the same number of colors, and thus will be hierarchical. This process can then be continued recursively, until the desired number of colors is reached, or until the every node has a unique coloring. Unfortunately, if the lattices dimensions are not powers of two, then the sublattice cannot be split evenly into 2d sublattices. At each level the dimensions of the sublattice being split will be reduced by a factor of two. At the level at which the dimensions of the sublattice are no longer divisible by two, the algorithm will be unable to continue, since it cannot then split further into even sublattices.

In this section we extend the hierarchical probing algorithm and the associated theory to sublattices of arbitrary dimensions, as long as those dimensions share some common

50 factors. Our algorithm bases the number of splits of the original lattice on the prime factors of the lengths of its dimensions. By using these factors the lattice can be divided up into sublattices of equal sizes, allowing the process to be continued recursively. When the factors of the lattice dimensions contain factors of 2, the original method using fast binary arithmetic can still be used. It should be noted that the number of colors produced at each hierarchical level m by our method is larger than the minimum number needed to produce a distance-m coloring of the lattice. In most cases this is an acceptable trade off in order to ensure that the hierarchical property of the generated colorings is maintained.

Moreover, these additional vectors reduce the error further than the optimal coloring for the same distance.

3.3 Lattices as spans of sublattices

n Formally, a lattice is a discrete additive subgroup of R . Intuitively, it is a collection of points, such that adding the location of any points together, gives the coordinates of

another valid point, and there is a minimum distance between the closest two points. A

good example of a d-dimensional lattice would be the Cartesian product of the integers,

which is the canonical d-dimensional regular grid. One can contrast this with a vector space such as the d-dimensional Cartesian product of the reals. A lattice need not be infinite, it can be formed on any finite group that has the required properties. In this paper we are mainly interested in finite lattices that have a periodic boundary condition.

Similarly to vector spaces, one can write down the basis for a lattice. We then say the

d lattice is generated by B, a set of d vectors of R , as

( d ) X n do L(B) = xi ∗ bi, bi ∈ B, xi ∈ Z = Bx, x ∈ Z . (3.10) i=1

Note how the requirement that the coefficients of the basis vectors be integer enforces a

minimum distance between two points, in contrast to a vector space. Returning to the

51 example of the regular grid, we write the generating function for this lattice as

( d ) X n do L(I) = xi ∗ ei, ei ∈ I, xi ∈ Z = Ix, x ∈ Z , (3.11) i=1

where I is the d-dimensional identity matrix and ei is the i-th column of I. Just as vector spaces may contain closed sub-spaces, lattices may contain closed sub- lattices. In this paper we are interested in the sublattices of L(I). In particular, we want to determine which sublattices of L(I) a point lies in. For example, the sublattice L(bI)

1 can describe only b of the points of L(I), with spacing b. Consider also the concept of an affine lattice, which we define as

( d ) X d n do L(B)c = xi ∗ bi + c, bi ∈ B, xi ∈ Z, c ∈ Z = Bx+c, x, c ∈ Z . (3.12) i=1

We can use these to decompose L(I) into a union of affine sublattices, since non-affine sublattices represent only sublattices centered at zero. For a given sublattice spacing b,

d any point in L(I) lies in one of b affine sublattices L(bI)c. These sublattices can be said to span L(I). An example of this can can be seen in Figure 3.11, where the 6x6 lattice is spanned by 32 affine sublattices of spacing 3.

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11

12 13 14 15 16 17 12 13 14 15 16 17 12 13 14 15 16 17

18 19 20 21 22 23 18 19 20 21 22 23 18 19 20 21 22 23

24 25 26 27 28 29 24 25 26 27 28 29 24 25 26 27 28 29

30 31 32 33 34 35 30 31 32 33 34 35 30 31 32 33 34 35

(a) Sublattices with off- (b) Sublattices with off- (c) Sublattices with off- 0 2 1 1 0 2 2 1 0 sets , , sets , , sets , , 0 1 2 0 1 2 0 1 2

2 Figure 3.11: The decomposition of a 6x6 lattice into 3 sublattices L(3I)c0 .

More formally, given b ∈ Z, the sublattices that span the entire lattice L(I) are:

52  0   1   b − 1   0   0   b − 1  L(bI)ci = L(bI) + ci, with c0 =  .  , c1 =  .  ,..., cbd−1 =  .  . (3.13)  .   .   .  0 0 b − 1

As there are b distinct options for each of the d elements of an offset c, there are bd distinct lattice bases that span L(I). Based on the b-radix representation of integers, we can find a one to one function that maps the integers 0 ≤ i ≤ bd − 1 to each offset vector c, allowing each c to be associated with a unique sublattice number. The function that maps c to i is

d X j−1 i = cjb . (3.14) j=1

Its inverse function that maps i to a particular offset c is computed by Algorithm 5. This gives the following general equation for the i-th affine sublattice basis

 rd−1  b b0 c  .  L(bI)c = L(bI) +  .  . (3.15) i  r1   b bd−2 c  i b bd−1 c

Algorithm 5 c = ConvertIndexToOffset(i, b, d) % Find the affine offset c, given its integer reference number i % Input: i: integer lattice reference % Output: The offset vector c 1: for m = d → 1 do i 2: c(m) ← b bm−1 c m−1 3: rm ← i(mod b ) 4: i ← rm 5: end for return c

Because the sublattices span the lattice, the coordinates of any lattice point x can be represented as xi ∗ b + ci, xi, ci ∈ Z, 0 ≤ ci ≤ (b − 1), or bx + c for some offset vector c in

(3.13). Therefore, taking each coordinate mod b yields the offsets c = (ci), which determine

53 through (3.14) which sublattice the point lies in. Consider the example in Figure 3.12. The top left sublattice, consists of the points (0, 0), (0, 3),(3, 0),(3, 3) ≡ (0, 0) (mod 3). From

(3.14), i = 0∗b1+0∗b0 = 0, indicating these points are in the 0-th sublattice. Alternatively, the points (2, 1),(5, 1),(2, 4),(5, 4) ≡ (2, 1) (mod 3). Since i = 1 ∗ 3 + 2 = 5, these points are in the 5-th sublattice. Finally, points (1, 2),(1, 5),(4, 5),(4, 2) ≡ (1, 2) (mod 3), which means these points are in the 7-th sublattice (i = 2 ∗ 3 + 1 = 7).

0,0 1,0 2,0 3,0 4,0 5,0

0,1 1,1 2,1 3,1 4,1 5,1

0,2 1,2 2,2 3,2 4,2 5,2

0,3 1,3 2,3 3,3 4,3 5,3

0,4 1,4 2,4 3,4 4,4 5,4

0,5 1,5 2,5 3,5 4,5 5,5

Figure 3.12: Affine sublattices with x,y coordinates.

Consider now a finite lattice with d dimensions of length di, i = 1, . . . , d. Let Fi = factor(di) be the sorted list of the integers resulting from the prime factorization of di. Td Then define the list of common factors as F = sort( i=1 Fi). In the example of Figure

3.11, F1 = F2 = {2, 3}, and so F = {2, 3}. More interestingly, consider a lattice of dimensions 60 × 140. Then F1 = {2, 2, 3, 5},F2 = {2, 2, 5, 7}, and F = {2, 2, 5}. We can use the list of common factors F to split the lattice L(I) into a hierarchy of spanning sublattices. We start with the smallest b = F(1) and obtain the sublattices in

(3.13). Then we split every sublattice into its own set of spanning sublattices based on the next common factor F(2). The process continues recursively until all common factors have been exhausted. Using the fact that for any point p in a lattice, its sublattice offset after a split is (p mod b), Algorithm 6 computes the sublattice offset vectors of p for all levels. Because of the equivalence between offsets and indices, Algorithm 6 returns only the index of the offsets through (3.14). Note that after splitting L(I) with b = F(1), the

54 L(bI)ci have common factors F(2 : end), and all have the same size with dimensions di/b. p The coordinates of p in its sublattice are b b c.

Algorithm 6 [i(1), . . . , i(f)] = SublatticeIndicesOfPoint(p, F) % Determine which sublattice a point lies in at each splitting level % Input: p lattice point coordinates % Input: F the common prime factors F(1) ≤ · · · ≤ F(f) % Output: i(m) the index corresponding to offset cm, m = 1, . . . , f of the m-th split 1: for m = 1 → size(F) do % At each level use splitting spacing b = F(m) 2: cm ← p(mod F(m)) % determine p’s affine offset (i.e., sublattice) at level m 3: i(m) ← Convert cm to index through (3.14) 4: p ← bp/F(m)c % point coordinates in this sublattice 5: end for return i(m) for all m

3.4 Coloring sublattices

The distance between any two points p1, p2 ∈ L(bI)c is kp1 − p2k1. The minimum distance of these points is b, the spacing of the sublattice. More formally, from (3.12),

d there exist x1, x2 ∈ Z , such that p1 = bx1 + c, p2 = bx2 + c, and, if p1, p2 are unique,

kp1 − p2k1 = bkx1 − x2k1 ≥ b. Thus, we may assign the same color in all points in L(bI)c

and still have a valid distance b − 1 coloring of the points within L(bI)c. However, the minimum distance between points in two different sublattices is deter-

mined by the distance of their offsets. Using (3.12) again, if p1 ∈ L(bI)ci and p2 ∈ L(bI)cj ,

kp1 −p2k1 = kb(x1 −x2)+ci −cjk1 ≥ kci −cjk1, since we can pick x1 = x2. For example, T T points [0,..., 0] ∈ L(bI)c0 and [1, 0,..., 0] ∈ L(bI)c1 are distance 1 apart. Therefore, if

the nodes in L(bI)c0 and in L(bI)c1 are all assigned the same color, we cannot achieve a valid distance 1 coloring on the entire L(I).

The problem is equivalent to coloring the finite toroidal lattice

C = {p mod b, p ∈ L(I)} , (3.16)

55 d whose points are the b offset vectors ci in (3.13). Note that C can be used to tile the original lattice as seen in Figure 3.13. Different coloring strategies of C achieve different dis-

tances between ci, cj with the same color, and hence between the points of L(bI)ci , L(bI)cj . More formally we have the following.

Lemma 3.5 Assume that each p ∈ L(I) is assigned a color, color(p), and that all points

in each L(bI)c are assigned the same color, i.e., ∀pi, pj ∈ L(bI)c, color(pi) = color(pj). Then color(p mod b) = color(p).

Proof: Since ∪cL(bI)c = L(I), p = bx + c. Then color(bx + c mod b) = color(c), and

since p, c ∈ L(bI)c, both have the same color. 

To take advantage of the b spacing of the points within each L(bI)c, one obvious

strategy is to assign every ci in C (equivalently each sublattice) a unique color. This guarantees a distance b − 1 coloring for the entire lattice L(I). In the context of our

d recursive splitting algorithm, the first split with b1 ∈ F uses b1 colors, and achieves b1 − 1 d distance coloring. At the second recursive level with b2 ∈ F, each L(b1I)c is split into b2 d sublattices, each with a unique color, for a total of (b1b2) colors. Points in L(b2I)c are at

least b2 hops apart, but these hops are edges in the L(b1I)c lattice. Thus the minimum

distance achieved by this coloring at the second level is b1b2 − 1. A simple inductive argument shows the following.

Lemma 3.6 If at every level of the recursive splitting algorithm each sublattice is assigned

d a unique color, then at level m we have used (b1 ··· bm) colors and have achieved a distance

b1 ··· bm − 1 coloring.

The algorithm increases the effective distance exponentially with each level, and prob-

ing with the corresponding vectors should be very effective. However, the number of colors

(and of probing vectors) used increases rapidly too. This is not an efficiency problem but

rather an evaluation problem. After level m−1, probing cannot fully annihilate elements of

d distance b1 ··· bm −1 until all (b1 ··· bm) colors have been used. Thus, we cannot properly

56 evaluate its progress for intermediate numbers of colors. Moreover, the number of probing vectors in the next level may not be affordable computationally. For example, if bi = 2 and d = 4, we can only guarantee meaningful results at color numbers, 16, 256, 4096,....

Therefore, it would be desirable to maintain the effectiveness of the method when each node in C is uniquely colored, but also have one or more intermediate evaluation points where smaller distance colorings complete.

A problem with more than one intermediate points is that the colorings of C at two

different distances must be hierarchical, i.e., two lattice nodes that have different colors

at distance j can not have the same color at larger distance colorings. Also, to facilitate

the generation of probing vectors on-demand, each color should have the same number of

nodes (see later discussion). Since b is prime, we can only consider colorings with b, or b2,..., or bd colors. For example, a red-black coloring of C is not a valid distance 1 coloring for any odd b. The periodic connection links two nodes (and thus sublattices) with the same color. We will see experimentally that the errors from ignoring these connections can be significant. Instead, three colors can provide a valid distance 1 coloring of any toroidal lattice [66]. However, for b =6 3, the three color subsets would not have the same number of nodes.

We are not aware of a method that produces optimal distance colorings of C for any b and d. For small values of b, d we have identified heuristics that using b colors produce a valid distance O(b1/d) coloring of C. For practical problems as in LQCD, d ≤ 4 and b ≤ 7, so the effective distance achieved is not better than distance 1. Besides, an optimal coloring is not necessary as the nodes in the same colors will be hierarchically colored too so that nearby permuted nodes eventually have large distances. Therefore, we focus our attention on the following simpler method as our only intermediate point between two recursive levels. This coloring strategy produces a valid distance 1 coloring with b colors, and ensures that each color appears the same number of times. If p ∈ L(I), with

57 p mod b = c, define its color:

d X color(p) = p(i) mod b. (3.17) i=1

Pd Because of Lemma 3.5, we also have color(p) = color(c) = i=1 c(i) mod b. Note that when we reorder the nodes of C based on this coloring, we also consider the sublattices

L(bI)c in the same order.

0 1 2 3 4 5

6 7 8 9 10 11

12 13 14 15 16 17

18 19 20 21 22 23

24 25 26 27 28 29

30 31 32 33 34 35

Figure 3.13: The circled nodes constitute the C lattice of offsets. Note how C tiles the entire lattice, and that its coloring reflects the coloring of each sublatice L(bI)c. Since b = 3, each line of colors is the same as the previous line, shifted by 1 mod 3.

Lemma 3.7 The strategy (3.17) is a valid distance-1 coloring of L(I). Proof: Let p1, p2 ∈ L(I), with kp1 − p2k1 = 1. This means they share all but one coordinate, say the i-th. If their connection is not due to the periodic boundary, their i-th coordinate will differ by one. Thus, |p1(i) − p2(i)| mod b = 1, implying that color(p1) =6 color(p2). If

both points lie on a boundary and connect via the toroidal property, then |p1(i) − p2(i)| =

(b−1)−0 mod b 6≡ 0 mod b, and therefore color(p1) =6 color(p2). Since these are no other

cases possible, the result holds. 

The coloring strategy (3.17) has an efficient recursive implementation. Note that the

colors in the i-th (d − 1)-dimensional slice of C are the colors of the (i − 1)-th (d − 1)-

dimensional slice shifted by 1 mod b. This can be seen in Figure 3.13 for d = 2. Since the

58 0-th slice is the same as the coloring of C in d − 1 dimensions, we can build the colorings for all dimensions in the following recursive manner.

1 d Let cd,b be the array of all b colors of the corresponding d-dimensional C in natural ordering. This C has b (d − 1)-dimensional slices each corresponding to the (d − 1)-

d−1 dimensional lattice of offsets. Let cd−1,b be the array of the b colors of these (d − 1)-

dimensional lattices. Then, cd,b is a concatenation of b shifted cd−1,b arrays,

cd,b = {cd−1,b, cd−1,b + 1 mod p, . . . , cd−1,b + (b − 1) mod p} . (3.18)

Each shift applies to all the elements of the array cd−1,b. In Figure 3.13, for example,

c1,3 = {0, 1, 2} are the colors of a one dimensional lattice, but also the first row of the two

dimensional C. The colors of C are then c2,3 = {{0, 1, 2}, {0, 1, 2} + 1 mod 3, {0, 1, 2} + 2 mod 3} = {0, 1, 2, 1, 2, 0, 2, 0, 1}.

Algorithm 7 implements this recursive coloring of C starting with c0,b = 0, and then generates the permutation, Perm, that reorders sublattices of the same color together.

Since there is the same number of sublattices for each of the b colors, the first sublattice

of color i is assigned to the ibd−1 location, and the index to the next available free spot for

this color stored in ColorIndex is incremented by one. We note that this coloring occurs

negligible computational cost, even for very large lattices, while it enables an additional

intermediate step between splits where the trace estimation may be monitored. This low

cost is an advantage over other colorings that could be used to define different intermediate

steps.

We now have a method that at each level m = 1, . . . , f recursively splits a sublattice

into F(m)d sublattices, giving each a different color. But before each sublattice is assigned

its own color at level m, we have an intermediate coloring that groups together F(m)d−1

sublattices in the same color. According to Lemma 3.6, after the intermediate step before

d level m we have (b1 ··· bm−1) bm colors ensuring a distance b1 ··· bm−1 coloring, and after

1The notation of this array is not to be confused with the notation of offsets c which are in bold.

59 Algorithm 7 Perm(0, . . . , bd − 1) = GenOffsetPermutation(b, d) % Generate the b-coloring permutation of C which reorders the sublattices (offsets) % Input: prime factor b, lattice dimension d % Output: Perm, the b-color permutation of the bd sublattices 1: % Generate the coloring using (3.18) 2: c0,b ← {0} 3: for j = 1 → d do 4: cj,b ← {cj−1,b, cj−1,b + 1 mod b, ..., cj−1,b + b − 1 mod b} 5: end for 6: % Initialize index showing where the next sublattice of color i should go 7: for i = 0 → b − 1 do 8: ColorIndex(i) ← i ∗ bd−1 9: end for 10: for i = 0 → bd − 1 do 11: Color ← cd,b(i) % Lookup the color of sublattice i in array cd,b 12: Perm(i) ← ColorIndex(Color) % The new location of sublattice i 13: ColorIndex(Color) ← ColorIndex(Color) +1 14: end for return Perm

d the m level we have (b1 ··· bm) colors for a distance b1 ··· bm − 1 coloring. Next, we describe the global hierarchical permutation and in particular how to find the location in this permutation of an arbitrary node. This will be used to efficiently generate the hierarchical probing vectors.

3.4.1 Hierarchical Permutations of Lattices with Equal Sides

The final permutation can be obtained recursively by applying the coloring permutation from level m on the permuted index from level m−1. This ordering ensures that the closer the nodes are geometrically, the farther they are ordered in the permutation. Orderings of lower levels provide no additional information, since nodes never move closer together in subsequent levels, so need not be stored. Moreover, we can avoid the above recursion by determining directly where in the final permutation a node will lie.

Consider the example in Figure 3.11. At level 0, the 3-coloring of the lattice permutes the colors in three groups as shown below on the left.

60 Level 0, after 3-coloring Level 1, after splitting to 32 sublattices color 0: 0 3 8 11 13 16 18 21 26 29 31 34 offsets 0,5,7: 0 3 18 21 8 11 26 29 13 16 31 34 color 1: 1 4 6 9 14 17 19 22 24 27 32 35 offsets 1,3,8: 1 4 19 22 6 9 24 27 14 17 32 35 color 2: 2 5 7 10 12 15 20 23 25 28 30 33 offsets 2,4,6: 2 5 20 23 7 10 25 28 12 15 30 33

At level 1, since b = 3, we split the top level lattice to 9 sublattices. Three of those lattices

(offsets 0, 5, 7) contain only the red nodes from the intermediate coloring and need to be ordered first. Notice how the actual permutation of the red nodes at level 0 is not needed as it is present in the ordering of the sublattices by Algorithm 7. An exception would be after the last level, if we decide to 3-color the remaining sublattices that do not share common prime factors (as in our previous work [66]).

Algorithm 8 shows we can construct the final hierarchical permutation. Given the coor- dinates of a node, p, Algorithm 6 generates the indices [i(1), . . . , i(f)] of the sublattices the p lies in at each level. Algorithm 7 permutes these to [ˆi(1),...,ˆi(f)] = Perm([i(1), . . . , i(f)]).

Because the permutation preserves the hierarchy, the location of the node p would be de- termined by all the nodes that belong to sublattices that appear prior to its own sublattices in the final permutation. For example, there are exactly ˆi(1) sublattices preceding it at the first level, ˆi(2) sublattices preceding it at the second level, and so on. At every level m the size of each sublattice reduces by a factor of F (m)d. Thus, given the lattice size Qd L = i=1 di, we have f X L Location(p) = ˆi(m) . (3.19) Qm F(l)d m=1 l=1

3.4.2 Hierarchical Permutations of Lattices with Unequal Sides

Algorithm 8 relies on having each lattice split into an equal number of sublattices of the

same dimensionality. However, it is possible that one dimension of the lattice may be

smaller than the others, leading to that dimension being exhausted before the others.

If the rest of the dimensions share factors, the algorithm can continue but in a lower

dimensionality lattice that removes exhausted dimensions. Let d¯(m) be the number of

active dimensions at level m. For example, the lattice 6×6×2, has factors (2, 3), (2, 3), (2),

61 Algorithm 8 Location = HPpermutation(p, d, F) % Compute the Hierarchical Probing permutation of a node p when d1 = ··· = dd % Input: point coordinates p, lattice dimension d, common prime factors F % Output: the location of p in the HP permutation 1: [i(1), . . . , i(f)] = SublatticeIndicesOfPoint(p, F) (Algorithm 6) Qd 2: subLatticeSize = i=1 di 3: Location = 0 4: for m = 1 → f do 5: Perm = GenOffsetPermutation(F(m), d) (Algorithm 7) 6: subLatticeSize = subLatticeSize/F(m)d 7: Location = Location + Perm(i(m))*subLatticeSize 8: end for return Location so for the first level d¯(1) = 3, while for the second level d¯(2) = 2, since at the second level all sublattices will be 2 dimensional. Thus, d¯(m) can be computed simply by counting the number of common factors in each dimension that has not been exhausted. Then, the new location of a node is given by (3.20), 2

f X L Location(p) = ˆi(m) . (3.20) Qm dˆ(m) m=1 l=1 F(l)

We can avoid computing and storing coloring permutations for lattices with reducing dimensionalities by reusing the previously computed permutations for C of a higher di- mensionality in (3.16), as long as the spacing b is the same. First, recall that for a given

C of dimensionality d and spacing b, the coloring cd,b in (3.18) is created recursively. This means that the color of a particular node in C can be given in terms of either a higher or a lower dimensional C plus a correctional offset as below,

k c (k) = c (k mod bd−1) + b c mod b, ∀k < bd, (3.21) d,b d−1,b bd−1 k c (k mod bd−1) = c (k) − b c mod b, ∀k < bd. (3.22) d−1,b d,b bd−1 2It is worth noting that just as [66] interpreted this process as representing the node number in binary and then permuting the digits, we can represent each node in mixed radix, where the radix list is the color numbers used to color the sublattices at each level, and then permute these digits.

62 Additionally, we shall make use of the definition of mod for positive integers

k k mod b = k − bb c. (3.23) b

d−1 Lemma 3.8 For any prime b, cd,b(ib) = cd−1,b(i), ∀i = 0, . . . , b − 1.

Proof: We proceed by induction on d. For the base case, c1,b = {0 . . . b − 1}, and

c2,b = {c1,b, c1,b + 1 mod b, . . . , c1,b + b − 1 mod b}. Then by construction, every c2,b(ib) =

0 + i = c1,b(i). Assume that cd−1,b(ib) = cd−2,b(ib/b) = cd−2,b(i). Then,

d−1 ib cd,b(ib) = cd−1,b(ib mod b ) + b bd−1 c mod b ( by 3.21) ib mod bd−1 ib = cd−2,b( b ) + b bd−1 c mod b (by the I.H.) ib−b ib cbd−1 bd−1 ib = cd−2,b( b ) + b bd−1 c mod b (by 3.23) d−2 ib = cd−2,b(i mod b ) + b bd−1 c mod b (by 3.23) i ib = cd−1,b(i) − b bd−2 c + b bd−1 c mod b (by 3.22) i i = cd−1,b(i) − b bd−2 c + b bd−2 c mod b = cd−1,b(i).



d−1 Lemma 3.9 For any prime b, cd,b(ib) = cd,b(ib + q) − q mod b, ∀i = 0, . . . , b − 1, ∀q = 0, . . . , b − 1.

Proof: We proceed by induction on d. For the base case, by construction we have

c2,b(ib + q) mod b = c1,b(i) + q mod b. Then, c2,b(ib + q) − q mod b = (c1,b(i) + q mod b −

q) mod b = c1,b(i) mod b = c1,b(i) = c2,b(ib) (by Lemma 3.8). We now assume cd−1,b(ib) =

d−1 cd−1,b(ib + q mod b ) − q mod b. Then,

d−1 ib cd,b(ib) = cd−1,b(ib mod b ) + b bd−1 c mod b (by 3.21) d−1 ib = cd−1,b(ib + q mod b ) − q + b bd−1 c mod b (by the I.H.) d = cd,b(ib + q mod b ) − q mod b (by 3.22) d = cd,b(ib + q) − q mod b. (since ib + q < b )



63 Lemma 3.10 Let Permd be the permutation created by Algorithm 7 associated with di-

d d−1 mension d. Then, for any prime b, i = 0, . . . , b − 1, P ermd(i) = bi/bc + cd,b(i)b .

Proof: Because of Lemma 3.9 when any b-tuple of indices (bi, bi + 1, . . . , bi + b − 1), is considered, the number of nodes in every color increases by 1. Since Algorithm 7 will send

d−1 the b-th color to the b -th section, the equation holds. 

Theorem 3.4 For any 0 < m < d, Permm can be obtained directly as follows,

d−m d−m m Permm(i) = bPermd(ib )/b c, i = 0, . . . , b − 1.

Proof: Since i ≤ bm, we can apply Lemma 3.8 recursively,

d−m d−m−1 d−m−2 cd,b(ib ) = cd−1,b(ib ) = cd−2,b(ib ) = ... = cm,b(i).

Using this and Lemma 3.10 we have

 d−m  $ ibd−m d−m d−1 %   Perm (ib ) b c + cd,b(ib )b i d = b = + c (i)bm−1 bd−m bd−m b m,b

i m−1 = b c + c (i)b = Permm(i). b m,b



Based on Theorem 3.4, Algorithm 9 shows how to reuse previously generated cd,b to compute permutations for lower dimensional lattices when the smaller lattice dimensions are exhausted.

64 Algorithm 9 Location = HPpermutation general(p, d, di, F) % Compute the Hierarchical Probing permutation of a node p for general di % Input: point coordinates p, lattice dimension d, common prime factors F % Output: the location of p in the HP permutation 1: [i(1), . . . , i(f)] = SublatticeIndicesOfPoint(p, F) % Algorithm 6 Qd 2: subLatticeSize = i=1 di 3: Location = 0 4: for m = 1 → f do 5: dˆ(m) = setActiveDims() 6: maxBdim = 0 7: for i = d downto dˆ do % Search and retrieve highest dimensional 8: Perm = getHash(b, i); % stored permutation for a b split 9: if Perm != empty then % if found one, reuse it 10: maxBdim = i; 11: break 12: end if 13: end for 14: if maxBdim then ˆ(m) 15: for i = 1 → bd do ˆ(m) ˆ(m) 16: newperm(i) = Perm(ibmaxBdim−d /bmaxBdim−d ) % Theorem 3.4 17: end for 18: Perm = newperm 19: else 20: Perm = GenOffsetPermutation(F(m), d) % Algorithm 7 21: end if 22: setHash(b, d) = Perm; % Store this permutation ˆ(m) 23: subLatticeSize = subLatticeSize/F(m)d % Equation (12) 24: Location = Location + Perm(i(m))*subLatticeSize 25: end forreturn Location

65 3.4.3 Generating Probing Vectors Quickly

In [66], we introduce the following recursive method for generating probing vectors for a colored lattice,

Z˜(1) = Fz(1), (3.24) h i Z˜(i) = Z˜(i−1) ⊗ Fz(i)(:, 1),..., Z˜(i−1) ⊗ Fz(i)(:, z(i)) , (3.25) i ˜ Y Z(i) = Z(i) ⊗ 1N/γi , where γi = z(j). (3.26) j=1

Here Fz(i) is the Fourier transform of the identity matrix Iz(i), z(i) is the number of

colors each sublattice is split into at level i, 1s is the vector of s ones, and ⊗ is the Kronecker product. Essentially, these vectors recursively build a basis for the probing

vectors. At each level, we probe inside each color (i.e., sublattice) by smaller probing

vectors hierarchically, which are all assembled into a basis through the Kronecker products.

Instead of generating the whole matrix, however, we produce each probing vector one at

a time, hence requiring the same memory as the Hutchinson method.

To produce the k-th vector of the probing matrix, we first need to identify the maximum

level i needed such that γi−1 < k ≤ γi. Then at every lower recursive level of (3.25) we

only need two vectors; one vector from Z(i−1) and one from Fz(i). By (3.25), the matrix

Z(i) is divided into z(i) blocks, with each block forming a Kronecker product with a

different column of Fz(i). Since each block has z(i − 1) columns, the k-th vector is in k block = b z(i−1) c, and thus we can generate directly the desired column Fz(i)(:, block).

This should be paired with the (k mod z(i − 1)) vector of Z(i−1) which we find recursively with the above procedure.3

When F(i) = 2, i = 1, . . . , k, the sublattices in the first k levels are red-black colorable,

and thus use only F2. Because F2 is equal to the Hadamard matrix H2, all vectors in the

3We note that this process can also be described in terms of radix conversion. Let the z(i)s be taken as the radix list. If k is converted to this mixed radix form [? ], the vectors of the Fourier transforms Fz(i) needed at each level will be the digits of this representation.

66 first k levels can be created using real arithmetic, which yields substantial savings over complex arithmetic. Moreover, we can use the fast bit-based method we introduced in

[66] for producing the required Hadamard vectors, leading to an additional performance gain. This approach can be seen in Algorithm 10.

Algorithm 10 ProbingVector = GenerateProbingVector(k, z, N) % Compute the k-th probing vector % Input: k, the number of colors at each level z(i), the matrix size N % Output: The probing vector

1: for j = size(z) downto 2 do % Compute indices of needed Fz(i) vectors 2: block(j) = bk/z(i − 1)c; k = k mod z(i − 1) 3: end for 4: block(1) ← k; 5: % Find the initial 2-colorable sublattices, which can be probed quickly as in [66] 6: while z(j) == 2 do j ← j + 1; end while 7: fastLevels ← j − 1 8: ProbingVector ← [1] 9: ProbingVector ← FastHadamardMethod(1:fastLevels) 10: % The rest of the levels are built through Fourier vectors 11: for j = fastLevels+1 → size(z) do 12: % Create Fourier vectors F 13: f ← 2 ∗ π/z(i) √ 14: w ← [0 : f : (2 ∗ π − f/2)] ∗ −1 15: Fz(i)(:, block) ← exp(−w ⊗ block(j)) 16: ProbingVector ← ProbingVector ⊗Fz(i)(block, :) 17: end for return ProbingVector

Since Algorithm 10 is based on coloring strategy (3.17), the first γi probing vectors in (3.26) correspond to what we termed as the intermediate coloring between splitting levels i − 1 and i. However, there are other numbers k, with γi−1 < k < γi, for which Z(:, 1 : k) are the probing vectors for a different valid coloring, most importantly the one in Lemma

3.6 where each sublattice gets a unique color. This is because the coloring is hierarchical, i.e., (3.17) is applied independently on each sublattice. Let us return to the example in

Figure 3.11, where we first consider the factor b = 3 and then b = 2. The results after the 3-coloring and the first level sublattice split are shown in Section 3.4.1. Here we focus only on color 0:

67 Level 0, after 3-coloring Level 1, after splitting to 32 sublattices color 0: 0 3 8 11 13 16 18 21 26 29 31 34 offsets 0,5,7: 0 3 18 21 8 11 26 29 13 16 31 34

Level 1, after 2-coloring each sublattice we have a total of 2 × 32 colors original color 0 now contains 6 = 2 × 3 colors 0 21 3 18 8 29 11 26 13 34 16 31

Notice therefore that by construction the first nine indices in our final permutation (which is used to generate the probing vectors Z(:, 1 : 9)) correspond to the coloring at level 1 where each lattice has a different color.

3.5 Probing Vectors For Hierarchical Coloring on General

Graphs

The above method for generating probing vectors assumes each color splits into the same number of colors at a given level. This is the case with our methods in Section 3.4.

With an arbitrary coloring method that does not assign the same number of sublattices to each color (e.g., the 3-coloring method on a lattice not divisible by three), a different way to generate the probing vectors is needed. A simple but not as efficient solution is to create the required canonical probing vectors, and then orthogonalize them against previous vectors in the sublatice as well as each other with Gram-Schmidt. We introduce a more efficient and elegant method that works with uneven color splits and thus generalizes probing to any graph with hierarchical coloring.

The method is described better through an example. Consider a graph with seven nodes (each node could be generalized to be a subgraph). Suppose at the first level the graph is assigned three colors. After the corresponding permutation, the first color contains nodes 1–3, the second color nodes 4 and 5, and the third color nodes 6 and 7. To probe at the first level we use the following probing vectors which is a variation of F3 in

68 (3.24) and (3.26) to allow for different number of nodes per color,

"F3(1, :) ⊗ 13# 7×3 Z(1,0) = F3(2, :) ⊗ 12 ∈ C . (3.27) F3(3, :) ⊗ 12

Suppose now that at the second level, the first color splits into three colors and the others

into two. Clearly, the next level of probing vectors cannot be created by (3.25) because

of uneven splitting. Each color block of the first level has to be probed independently.

Thus, we could probe the first block using F3 for the elements inside the first block (with zeros everywhere else in the probing vector). Similarly for the last two blocks, but using

F2. These seven probing vectors are shown in (3.28) —note that 0k = zeros(k, 1). The problem is that using seven vectors would be wasting the solutions of linear systems with

the three probing vectors (3.27) in the first step.

" #" #" # " #" # " #" # F3(:, 1) F3(:, 2) F3(:, 3) 03 03 03 03 02 02 02 , F2(:, 1) F2(:, 2) , 02 02 . (3.28) 02 02 02 02 02 F2(:, 1) F2(:, 2)

The key to remedying this problem is to note that the three first vectors of the new

color blocks, "F3(:, 1) 03 03 # "13 03 03# I = 02 F2(:, 1) 02 = 02 12 02 02 02 F2(:, 1) 02 02 12 3 3×3 are spanned by the vectors of Z(1,0), since F3 is a basis of C . More formally, if a ∈ C , from (3.27) and basic properties of the Kronecker product we have that the following

matrix equation "F3(1, :)a ⊗ 13# Z(1,0)a = F3(2, :)a ⊗ 12 = I, F3(3, :)a ⊗ 12

is equivalent to F3a = I3, which has a unique solution a = ifft(I3), i.e., the inverse Fourier

−1 transform of the identity. Therefore, if we saved P = A Z(1,0), we can recover the probing result for the vectors in I as A−1I = P a. Thus, we only need to apply A−1 on

the remaining four probing vectors, exactly as in our hierarchical probing method on the

lattice. Finally, if each node represents a subgraph, each color block in (3.28) involves

Kronecker products of the rows of its Fourier matrix with columns of ones, each sized to

69 the cardinality of the subgraph. Thus, each block has the same form as (3.27) and the idea can be applied recursively.

To generalize we need the following definitions. First, assume a hierarchical coloring at levels i = 0, 1,..., and let li−1 be the number of colors at level i − 1. The nodes belonging

to one of these colors are called a block at the next level i. There are li−1 blocks at the i-th level. Let s(j, i) be the number of colors the j-th block splits into at level i. Thus,

Pli−1 j=1 s(j, i) = li. For each color in block j, let n(j, i, k), k = 1, . . . , s(j, i) be the number Ps(j,i) of nodes in that color. Thus, k=1 n(j, i, k) is the number of nodes in the j-th block.

For each j = 1, . . . , li−1 block, define the Fourier transform Fs(j,i) = fft(Is(j,i)), and Z(j,i) the set of probing vectors as

 0   Fm(1, :) ⊗ 1n(j,i,1)   .  Z(j,i) =  .  , where m = s(j, i). (3.29)  .  Fm(m, :) ⊗ 1n(j,i,m) 0

The 0 zero matrices have m columns and rows that overlap with all other blocks. At level

0, there is only one block (with s(1, 0) colors), so the 0 matrices are empty. E.g., the first

three vectors in (3.28) are Z(1,1), and I = [Z(1,1)(:, 1), Z(2,1)(:, 1), Z(3,1)(:, 1)]. −1 We assume that the results of the inversions have been saved P = A Z(j,i−1) for all

blocks j = 1, . . . , li−2 at level i − 1. At the i-th level, probing with the first vectors of each block can be determined as follows:

a = ifft(Ili−1 ), (3.30)

−1 A Z(j,i)(:, 1) = P a(:, j), j = 1, . . . , li−1, (3.31)

or equivalently note that

P a = ifft(P H )T . (3.32)

70 Systems for the rest of the probing vectors in the blocks are solved explicitly. At the end of level i, we have inversions for all the probing vectors Z(i) = [Z(1,i) ... Z(li−1,i)]. If further levels are needed, the process continues as described in Algorithm 11. We emphasize that our new method fuses the generation of probing vectors and the solution of linear systems needed for the trace computation. However, Algorithm 11 depicts only the generation of the vectors.

We note that our method requires memory to store the vectors P at the previous level. When generating the P for level i, Algorithm 11 carefully implements this by first permuting the implicitly computed vectors P a in their new positions and then solving the linear systems for the new P vectors. Because of the tree structure, the total storage is limax−1 which is always less than half than the final number of probing vectors at level limax—more accurately it would be less than minj s(j, limax−1).

Computationally, at level i, we have avoided the solution of li−1 systems of equations at the expense of li−1 inverse FFTs in (3.32), or a O(Nli−1 log li−1) cost. Moreover, this is

2 more elegant and less expensive than a brute-force Gram-Schmidt which costs O(Nli−1) at level i. Finally we remind the reader that this method works for hierarchical colorings on arbitrary graphs.

3.6 Performance Testing

We investigate the performance of our algorithm for lattices in two ways. First we show that the increased cost of the algorithmic extensions is not excessive. Second, we show that the probing vectors produced by our algorithm for lattices whose dimensions have sizes that are not powers of 2 provide better trace estimation than the vectors from the original algorithm in [66].

Our experimental results shown in Table 3.1 indicate that the increased computation that our method requires over the original algorithm is reasonable, given the short run- ning times involved even for very large lattices. At the same time, the algorithm is still

71 Algorithm 11 GenerateAndPerformProbingVector general(s, l, n, imax) % Input: s(j, i): number of colors the j-th block splits into at the i-th level, % n(j,i,k): the number of nodes in the color k subgraph of block j % li−1: the number of colors at level i − 1, also the number of blocks at level i % imax: maximum desired level % Output: The probing vectors Z at level imax.

1: Z ← [ ], P1 ← [] 2: Fs(1,0) ← fft(Is(1,0)) 3: Build Z(1,0) using (3.29) and the coloring permutation −1 4: P ← [ A Z(1,0)] 5: for i = 1 → imax do % Level i 6: P ← ifft(P H )T 7: newpos(1) = 1 8: for j = 2 → li−1 do % block j 9: newpos(j) = newpos(j − 1) + s(j − 1, i) % new positions of P a at level i 10: end for 11: P (:, newpos) = P (:, 1 : li−1) % Permute to new positions 12: for j = 1 → li−1 do % block j 13: Fs(j,i) ← fft(Is(j,i)) 14: Build Z(j,i) using (3.29) and the coloring permutation 15: for k = 2 → s(j, i) do % color k in block j −1 16: P (:, k) ← A Z(j,i)(:, k)] 17: end for 18: end for 19: end for

72 Original Method Extended Method Lattice Time(ms) Time(ms) Time ratio 84 12 39 3.3 164 187 673 3.5 324 3141 12156 3.8 40962 56228 277045 4.9 2563 55435 266676 4.8 644 54598 252687 4.6

Table 3.1: Table showing run times of the new algorithm compared to the original. Results obtained on an Intel i7 860 clocked at 2.8 GHz. embarrassingly parallel, since each point can be reordered independently. Given the low runtimes we obtain compared to the cost of solving the linear systems during probing, we have not investigated this option. Further, we observe that the dimensionality of the lattice does not impact the performance of the algorithm.

Finally, we examine the trace estimate produced on a model lattice problem, and compare it with the trace estimate produced by using a truncated permuted Hadamard matrix produced by the original hierarchical algorithm. This is essentially the same as applying an incorrect red-black coloring to the sublattices at each level, ignoring the links which are miscolored at the borders of each sublattice. While this will cancel the error from the most important parts of the structure of A−1, it would still be leaving out important connections between sublattices. Therefore, we expect larger trace estimate errors, especially as the algorithm goes further down the coloring hierarchy. As we can see in Figure 3.14, the new algorithm does indeed perform significantly better, providing a much better trace estimate than the original method.

3.7 Conclusion

We have provided several extensions to the algorithm for hierarchically coloring and prob- ing lattices. By formalizing the use of sublattices in the algorithm, we have made the algorithm easier to reason about. This allowed us to improve its flexibility, enabling the

73 5 10 New Method Original Method 0 10

−5 10

−10 10 Relative Trace Error

−15

10 0 1 2 3 4 10 10 10 10 10 Number of Probing Vectors

Figure 3.14: Comparison of the two methods on a 2D lattice with common factors 2 × 2 × 3 × 3 × 5. For the common factors of two, the methods are the same, but once these are exhausted the improved method has much lower error.

algorithm to handle lattices with arbitrary dimensions, as long as the sizes of those dimen- sions share common prime factors. These improvements come at minimal computational cost and retain the ease of parallelization that was an attractive feature of the original algorithm. Finally, we have introduced a method of creating probing vectors, both for the case where the colorings split evenly into the same number of colors, and for the case where the coloring does not split evenly. We note that these methods of creating probing vectors can be applied to any matrix, not just those arising from lattices.

74 Chapter 4

Estimation of diag(f(A)) in the general case

In Chapter 3 we introduced methods to compute Diag(f(A)),in the case where the A in

question arises from PDEs with certain geometric properties. Unfortunately, there are

many interesting applications such as those discussed in 2.1.1 and 2.1.3, which give rise

to matrices that do not have these useful properties.

This forces us to return to the original probing approaches for discovering structure in

the matrix [62]. As discussed in 2.2.2 probing exploits the structure of f(A) to achieve an accurate result more quickly than statistical methods, yet can be used on matrices with an unkown structure. The basis of the method is to take some polynomial approximation qn(A) ≈ f(A). If the convergence of this polynomial approximation is fast this yields information about the most important elements of f(A). The original approach involved using a Neumann series polynomial for A−1, which implies that, if the diagonal elements

n of A are nonzero, the nonzero structure of qn(A) is the same as that of A . The graph coloring of An can then be used as an approximate coloring for the graph of A−1, in the

sense that the coloring constraints will only be violated by edges of small weights. This

provides useful information about the structure of A−1, which can be used as in Figure

2.3. Unfortunately, probing requires raising A to high powers to achieve more accurate

75 results. This is not feasible for many matrices, since it can require significant amount of computation and perhaps even more importantly, a significant amount of storage.

In this chapter we propose methods based on the structural and spectral characteristics of f(A) directly. These methods can be combined with the statistical methods of [22] to achieve the attractive features of these statistical approaches, as well as the benefits of probing.

4.1 Graph Coloring

Probing as introduced in Chapter 2.2.2 works by attempting to find the most important elements of A−1, or equivalently, a sparsified version of A−1 matching the largest magni-

−1 −1 tude elements of A . It does this by taking a polynomial approximation qn(A) ≈ A . In the original proposal of probing, the polynomial was chosen as the Neumann series.

If the polynomial converges, then its graph approximates a sparsified structure of A−1 where the most important (large in magnitude) elements are kept. The goal is to find a degree for which the polynomial graph is sufficiently sparse to be colored with a reasonable number of colors. This is because if the graph associated with a matrix has a k-coloring,

then the diagonal of A can be recovered with k matrix vector multiplications. To see how this recovery is possible, consider the structure of the matrix if all the nodes that share a color are reordered to be adjacent. Since sharing a color implies a lack of communication between the nodes, the matrix can be permuted in a block diagonal form, as shown in

Figure 2.3. Then, if we create the pm probing vector as in (2.4) then Apm holds the

diagonal elements in the positions of the m-th block.

Since coloring algorithms only consider if two nodes are connected or not, the weights

n of the edges are not used. Thus instead of the entire Neumann series qn(A), only A need be computed since the structure of the two matrices are the same. Since An shows all

the connections between nodes after n hops, we can either find a distance 1 coloring of

An, or a distance n coloring of A. Both operations have similar complexity. Once the

76 probing vectors pm have been obtained, we use an iterative method to solve Ax = pm.

Since iterative methods applied for large matrices may be expensive and slow to converge

to the desired accuracy, we need to minimize the number of colors used in order to keep

the number of probing vectors produced low.

n −1 Problems arise when raising A to a high enough power for kqn(A)−A k to be small. Even if A is sparse, An can become dense quickly which increases both the computational

and the storage requirements of the method. Additionally, while probing is intended

to exploit the structure of A, it throws out interesting structural information by not

considering the strength of the connections. Finally, the colors produced by probing at

different levels are unlikely to be nested subsets of each other. Nodes which previously

were given separate colors may later on be assigned the same color, which means that

results with previously generated probing vectors cannot be reused if increased accuracy

is desired; the entire computation must be repeated from scratch.

In this chapter we investigate two types of methods for providing better colorings than

probing. The first set of methods are similar to probing in that they are structural; they

try to directly detect and exploit the structure of f(A).

The second type of method attempts to use computed spectral information of A to

find an appropriate coloring for f(A)—similar to the well known techniques of spectral

clustering. While the algorithm we have developed is not very efficient, it serves as an

interesting way to examine the issues involved with the design of spectral methods, and a

starting point for future research into these types of approaches. It also provides an upper

bound on how well such methods could work.

We combine both types of methods with the statistical methods of [22] discussed in

Chapter 2.2.1, observing in many experiments that we can obtain a more accurate error

bound for Diag(f(A)) than would be possible using only statistical methods.

77 4.2 Statistical Considerations

While both [62] and [27] consider probing and probing-like ideas from the standpoint of a deterministic process, in practice our experiments show error from the statistical methods in [22] being much smaller than the deterministic error reported by probing.

Additionally, many applications (such as QCD) require information about the error and require the error to be unbiased. Fortunately it is possible to combine statistical methods with probing as shown in [66]. By generating a random vector ζm for each color block m and performing elementwise multiplication with each probing vector to form pm ζm, we make our estimator statistically unbiased. In addition, if probing correctly identifies the smallest elements of A−1 and groups them into colors, then the variance and thus the error estimate will be reduced, while providing the added benefit of a statistical confidence interval for the results.

2 PN 2 As noted in Table 2.1 the variance of Hutchinsons method is 2(kAkF − i=1 Aii). If a k set of s random vectors ζ is choosen according to [22], and ζj refers to the j-th component of the k-th sample, then the diagonal estimator is given in [27] as

N Ps k k X k=1 ζi ζj TdiagA = Aii + aij . (4.1) i Ps (ζk)2 j=1,j6=i k=1 i

If the ζ are i.i.d. drawn from the Rademacher distribution, then for j =6 i the expectation

Ps k k Ps k k E[ k=1 ζi ζj ] = k=1 E[ζi ζj ] = 0, implying that E(TdiagAi) = Aii. We are interested 2 2 2 2 in computing the variance E[TdiagAi ] − E[TdiagAi] = E[TdiagAi ] − Aii. For this, we Ps k k Ps l l notice that E[( k=1 ζi ζj )( l=1 ζi ζm)] = s only when k = l and j = m =6 i. Since

78 Ps k 2 k=1(ζi ) = s, we have

N " Ps k k # 2 2 X k=1 ζi ζj E[TdiagA ] = a + 2aii aijE i ii Ps (ζk)2 j=1,j6=i k=1 i N " Ps k k Ps l l # X k=1 ζi ζj l=1 ζi ζj + aijaimE Ps (ζk)2 Ps (ζl)2 m,j=1, m,j6=i k=1 i l=1 i N 2 X 2 = aii + aij. (4.2) j=1,j6=i

PN 2 Thus, Var(TdiagAi) = j=1,j6=i aij, which means the estimator of each element depends on the off-diagonal elements of its row.

We can now bound how large the standard deviation is relative to the trace, which

provides information on the number of digits that can be achieved using these estimators.

For symmetric positive definite matrices and a given number of samples s, we have √ √ √ √ √ 2(kAk2 −P A2 ) 2kAk2 2Tr(A2) 2 P λ2 2(P λ )2 √F ii √ F √ √ i √ i √ s s s s s 2 ≤ = = P ≤ P = √ . (4.3) Tr(A) Tr(A) Tr(A) λi λi s

Although the bound may be pessimistic, it indicates that we should not have problems obtaining good relative estimates. On the other hand, for the diagonal estimator,

q PN 2 j=1,j6=i aij kA(i, :)k kAk √ ≤ √ ≤ √ , (4.4) s|aii| s|aii| s|aii| so the relative error can be quite large for some diagonal entries. One exception, which is often useful in practice, is the case of diagonally dominant matrices. Then,

q PN 2 PN j=1,j6=i aij j=1,j6=i |aij| 1 √ ≤ √ ≤ √ , (4.5) s|aii| s|aii| s and the stochastic estimator can provide good relative estimates.

It is possible to investigate how well probing performs compared to statistical methods

79 such as Hutchinson’s by modeling the magnitude of the elements removed by probing. As

2 Pn 2 noted in Table 2.1 the variance of Hutchinson’s method is 2(kAkF − i=1 Aii). Since in the Hutchinson method the diagonals do not contribute to the variance, in our analysis we need only consider what happens to the off-diagonal part of the matrix. Because we will refer to this portion of the matrix frequently, we define A˜ as the matrix A with the diagonal removed.

When using Hutchinson’s method, the variance comes from the entire A˜ matrix. In contrast, when using probing with a statistical estimator as above, the variance comes only from inside the block diagonals, such as those seen in Figure 2.3. This is because the contributions from outside the block diagonals are zeroed out. Since the block diagonals are the only parts of the matrix contributing to the variance estimation, we can simply ˜ 2 compute 2kAikF for each block, and then sum the results for all k blocks. This is equiv- alent to considering k independent Hutchinson methods, one for each color with variance ˜ 2 2kAikF . Suppose that the k color blocks are all of equal size. Then, if N is the dimension N 2 N 2 ˜ of A, there are k k2 − N = k − N elements of A contributing to the variance, i.e., k−1 2 only those in block diagonal. Equivalently, the k-coloring discards k N elements from contributing to the variance. Assume in addition that we have sorted all the elements of A˜ in monotonically decreasing order and that increasing the number of colors discards ˜2 off-diagonal elements starting from the largest in the list. More specifically, G = sort(Aij) would be an array of N 2 elements with the zero diagonal elements at the end. We model

k−1 this array as a monotonically decreasing function g(x), where the input x = k ∈ [0, 1] is the percentage of the discarded elements.

We can begin to model the variance that is removed given certain assumptions about

g(x). First, we assume that probing finds the largest elements of A˜ in monotonically

decreasing manner and removes them. Then, with 2 probing vectors we have V ar2 = R 1 1 g(x)dx, since with two probing vectors half the elements will be discarded. In general, 2 R 1 with k probing vectors (i.e., with k colors), V ark = 1 g(x)dx. If g(x) is constant, i.e., 1− k 80 ˜ 2 1 all the elements of A are the same, then we have V ark/kAkF ∼ O( k ). On the other hand, ˜ 2 1 if g(x) ∼ (1 − x), then we have that Vk/kAkF ∼ O( k2 ). This analysis implies that even if there is no structure to be exploited, as long as the nodes are divided up into color blocks of equal size, probing will do at least as well as the purely statistical methods of Hutchinson. Therefore algorithms should attempt to make the color sizes as equal as possible. Moreover, as long as significant structure exists in the matrix, probing should be able to outperform statistical methods.

4.3 Structural Methods

As previously discussed, the major problem with probing arises when the series qn(A) converges slowly to f(A). To see this issue in action consider the example in Figure 4.1. In this example, we consider a 4D Laplacian with periodic boundary conditions and compare probing using polynomials of order ranging from 1–8 against computing the pseudo-inverse

A† directly, dropping the elements of the smallest magnitude (that is, sparsifying the matrix), and then coloring the resultant matrix. We take the pseudo-inverse, because this operator has one zero valued eigenvalue. This coloring of the sparsified version of A†, while not practical provides information on how close to the optimal coloring the coloring provided by probing is. As we can see, probing is close to the optimal when few colors are considered, but as the number of colors increases the difference between the two methods becomes significant.

To address this, we seek to capture the structure of f(A) more directly. We propose here two methods to achieve this. The first method attempts to capture the off-diagonals of f(A). Many matrices of interest have an inverse with a banded structure. This implies that we should be able to find a coloring that captures most of the structure of f(A).

Indeed, this was behind the original idea of probing, where the authors note that if the coloring distance is increased far enough, then the major off-diagonals of the matrix will be captured. Similarly, the authors of [27] note that if a large enough number of columns of the

81 0

10 Probing Coloring of Sparsified A†

−1 10

−2 10 Relative Trace Error

−3

10 0 1 2 3 4 10 10 10 10 10 Colors

Figure 4.1: Probing vs Coloring the structure of L† directly, where the percentage of the weight of L† retained varies from .1 to .5. in .05 increments. As the number of colors increases probing struggles to capture the structure of L†.

Hadamard matrix is used, the off-diagonals of interest will be removed from contributing

to the error estimate. However, these are both indirect methods of achieving the goal

of discovering which nodes of the underlying graph of f(A) are connected via an off-

diagonal with large values. To gain insight into how it might be possible to discover

such connections, consider the sparsity plot of the matrix containing several off-diagonals

below. If we take a sampling of the columns of the matrix, and we assume that the line

the off-diagonals follow has a slope of 1, if any two columns were sampled at a distance k

to each other, then if we shift the samples up by k, the non-zero structure of the samples

will be the same, as can be seen in Figure 4.2, where we sample four columns.

Of course, in the case of f(A), the output is likely to be dense, so matching up samples

of f(A) and seeing where the non-zeros match is not possible. However, if we obtain some

columns of f(A) using an iterative method, and then sparsify them by dropping some

percentage of the elements with the smallest absolute values, we can then compare the

structure of the samples, and check to see where the non-zero values line up. If a significant

number of the sparsified samples share the same non-zero structure, it is likely there is

an off-diagonal at that location. We can then use these off-diagonals to compute an

approximate coloring for f(A). This approach is shown in Algorithm 12.

82 Figure 4.2: Sampling 4 columns from A and then shifting to detect if they share an off-diagonal

The second method we consider is based on the matrix-matrix multiplication approx- imation method proposed in [2]. They note that if some samples of a column of matrix v = A(:, i) are obtained, then vvT ≈ A2. Since we have obtained the columns v of f(A) to estimate the off-diagonals of f(A), we have vvT ≈ f(A)2. Since we want wwT = f(A),

we preform a QR factorization of v. Then v = QR and (vvT )1/2 = (QRRT QT )1/2 =

Q(RRT )1/2QT . We can obtain (RRT )1/2 using SVD, since R is a small dense matrix.

This procedure is shown in Algorithm 13.

The approximation wwT , however, is still dense. To ensure a sparse approximation to f(A), we again sparsify w by dropping the smallest magnitude elements. While it is likely that kwwT − f(A)k will be large, what is important for our application is not the difference in the value of the elements of the two matrices, but how close the orderings of the elements of the matrices are, since these are the elements that probing will discard.

To measure this, consider the two sets o1 and o2 that each contains the index pairs (i, j)

T of the non-zero elements of the matrix f(A) and ww , respectively. Define p1 to be the permutation of the o1 that sorts the corresponding elements of f(A) from largest to

T smallest magnitude. Define also p2 the permutation of o2 that sorts the elements of ww from largest to smallest magnitude. Since wwT is a sparse matrix, it has fewer elements, z = size(o2) ≤ size(o1). We then want to check how closely the index pairs (not the matrix

83 Algorithm 12 [D] =DetectOffdiags(v, ) % Input: v columns of f(A), L array containing column locations, a tolerance  % Output: D a matrix of important off-diagonals 1: v ← Sparsify(v) 2: newv ← [] 3: for i = 1 → size(v,2) do newv ← [newv shift(v(:,i),-L(i))] 4: end for 5: newvsums ← sum(abs(newv)); 6: diaglocations ← []; 7: for i = 1 → size(v,1) do 8: if newsums(i) ≥  then diaglocations(i) ← 1; 9: end if 10: end for 11: D ← diag(diaglocations) 12: return D

Algorithm 13 [w] =CreateW(v, ) % Input: v columns of f(A) % Output: w for use in approximation wwT 1: v ← sparsify(v, ) 2: [Q,R] ← QR(v) 3: [U,S,V ] ← SVD(v) 4: w ← QUS1/2V T 5: return w

84 elements) in o2(p2) match the first z index pairs in o1(p1) by computing o1(p1) ∩ o2(p2). If the size of the intersection is equal to z, then we have the best possible approximation over

all matrices with that number of non-zero elements, in the sense that we have captured

the z most important elements of f(A).

It is interesting to contrast the parts of the structure of f(A) that the two methods

find. Algorithm 12 attempts to detect the global structure of the matrix, while Algorithm

13 detects more local structure. Figure 4.3 shows the effect of progressively increasing the

number of columns from f(A). As the number of columns increase, the local areas of the

matrix f(A) that are well approximated expand. Our experiments reveal that there is a

transition point with this method, where after a certain number of columns, the coloring

produced starts to be extremely effective, although for large matrices this point may come

too late to be practical. A similar transition point can be seen with the density, where

as the sparsity of v decreases, there is a point where the coloring becomes substantially

better, although again, this point can come too late to be practical.

Figure 4.3: Approximation of the pseudo-inverse of a Laplacian with periodic boundary conditions with 10, 100, and 1000 vectors in the vvT approximation. Here the vectors v are not sparsified, the figure shows how in the best case of an unsparsified v, vvT contains mostly local structure, until a significant number of vectors v are supplied.

85 A significant drawback of these two methods is that it is difficult to provide any bounds on the error associated with using the colorings they produce. Matrices which have a significant number of off-diagonals or other highly ordered structure will likely have this structure captured by these methods. Less structured matrices may be more challenging, and it may be hard to tell the difference a-priori. Another drawback is that it is difficult to tell what level of sparsification should be applied to the vectors used by these methods.

One possible method is for the user to supply a maximum number of colors representing the limit of the number systems f(A)z to be solved, and then preform on the

densities of the two methods, doubling or halving the value until the required number of

colors are obtained. However, this still leaves open the question of the starting densities

to be used. While for some classes of matrices the user may have a good idea what will

work best, in the general case it is not possible to know the optimal value. A possible

heuristic is to examine the sampled columns v, and determine where the sharpest drop is

in the value of the sorted components. This point can then be used as a starting point

which can be refined by the previously described bisection method.

While we leave further investigation for future work, we note the similarities between

this method and the Nystr¨ommethod [3]for approximating symmetric semi positive def-

inite matrices. In the classic Nystr¨ommethod a certain number of columns C are ran-

domly selected from the matrix A for which an approximation is desired. The intersec-

tion matrix W is then formed by finding the intersection of the rows and columns of C, or C = A(:, cols); W = C(cols, :); in matlab notation. The pseudo inverse W + is then formed. If many columns of C are selected, it may not be feasible to find the pseudoinverse

† † T of W , in which case the best rank k approximation Wk is formed. Then A ≈ CW C . There are several variations on this method which are possible, such as taking an en- semble of such approximations or replacing the intersection matrix W with the matrix

C†A(C†)T . The question then is how to best select C. If additional columns of C could be selected adaptively this method could allow the derivation of error bounds, in contrast to our current method for which it is difficult to prove any bounds. If the bounds are tight

86 enough, it might even make sense to directly compute Tr(CW †CT ) or Diag(CW †CT ),

and dispense with coloring entirely, relying on the theoretical error bounds instead of the

statistical error measurements for guarantees for the accuracy.

4.4 Spectral Methods

Spectral methods for finding communities in a graph have been well understood for some

time [6]. The general basis for these methods comes from the graph partitioning problem.

Given a graph G = (V,E) and associated graph Laplacian L,

  deg(vi) if i = j   Li,j := −1 if i =6 j and vi is adjacent to vj , (4.6)    0 otherwise the nodes must be divided into groups of nearly equal size, S and its complement S¯, in such a way that the weight of the edges between the groups is minimized. We can formalize this problem as follows. Define φ(S) as

P 2 |E(S, S¯)| (i,j)∈E(xi − xj) φ(S) = = . (4.7) ¯ P 2 min(|S|, |S|) i

Then define the isoperimetric number of a graph as the value of the minimum cut

φopt = min φ(S). (4.8) S⊆V

Unfortunately, this problem is NP-Complete. However, if instead of requiring the elements of the solution vector x to be in {−1, 1}, the problem is relaxed by allowing the solution elements to take real values, then we can obtain a solution as follows

P 2 P 2 T (i,j)∈E(xi − xj) (i,j)∈E(xi − xj) x Ax φ ≈ min = min = min . (4.9) opt P 2 n Pn 2 n T x∈Rn (xi − xj) x∈R n (xi) x∈R nx x i

87 By the Courant-Fischer Theorem, this is is minimized by v2, the eigenvector of the second

smallest eigenvalue l2 of L, known as the Fiedler vector. While this relaxation is sufficent for regular graphs, frequently the normalized cut P is desired for irregular graphs. If we define vol(S) = vi∈S deg(vi), then the normalized edge cut is defined as |E(S, S¯)| φˆ = . (4.10) min(vol(S), vol(S¯)) In contrast to (4.7), (4.10) divides by the degree of the nodes in the volume instead of simply the number of vertices. This generally turns out to be a more robust measure

[6]. In this case, the solution of the relaxed problem is related to the spectrum of the normalized graph Lapalcian, L, defined as in (4.11), where D is the diagonal matrix of

the degrees of the nodes of the graph, and L is the graph Laplacian of (4.6),

− 1 − 1 L = D 2 LD 2 . (4.11)

By performing the same relaxation as (4.9) we achieve the Cheeger bounds on φˆopt, where

λ2 is the second smallest eigenvalue of L,

ˆ2 φopt ≤ λ2 ≤ 2φˆopt. (4.12) 2

It is possible to continue this process of finding an approximation to the best edge cut recursively by computing the Fiedler vector of the induced subgraph of each partition.

This allows for an arbitrary number of partitions of the graph to be obtained. This process is known as recursive spectral bisection [7].

While recursive bisection is the simplest method of spectral clustering, there are many other proposals. In fact, [8] lists five distinct classes of algorithms for finding k-clusters, based on the lower part of the spectrum of L. The first of these he describes as linear ordering, where the nodes are reordered based on the smallest eigenvector recursively, i.e., recursive spectral bisection. We investigate a heuristic based on this idea, and leave

88 a more through investigation of ways to adapt algorithms for spectral clustering to our problem of interest for future work.

In seeking to exploit the structure of a graph for trace estimation, probing tries to solve the opposite problem to that of community detection. Instead of trying to find communities with a large amount of intra-community interactions, probing seeks to find groups of nodes that have no or few interactions between each other. In this case, [9] has shown that the function we should seek to optimize is

2|E(S, S¯)| φ¯(S) = , (4.13) vol(S) ∪ vol(S¯) a measure analogous to the normalized cut in (4.10). As in the community detection problem, we seek to optimize this function as

2|E(S, S¯)| φ¯opt = max . (4.14) S vol(S) ∪ vol(S¯) While this is again an intractable problem, the relaxed version again admits a solution which is related to the spectrum of L, with the solution vector being the largest eigenvector of the graph Laplacian. It is possible to bound the error of the relaxed version of this problem using an adjusted version of the Cheeger bounds [9]

¯2 1 − φopt ≤ 2 − λN ≤ 2(1 − φ¯opt). (4.15) 2 The relationship between (4.14) and the relaxed quadratic form is more difficult to see in this case than it is for (4.10). The derivation of the upper and lower bounds in (4.15) is rather involved and can be found in [9]. Here we present the derivation only of the upper bound. In [9] the authors partition the graph nodes into three sets; V1, V2 are two bipartite sets, and V3 = V1 ∩ V2 are the rest of the nodes that cannot be split in bipartite P3 sets. Then they note vol(Vi) = j=1 |E(Vi,Vj)|. With this, (4.14) can be rewritten as

89 2|E(V1, V¯1)| φ¯opt = max . (4.16) V ,V P3 P3 1 2 j=1 |E(V1,Vj)| + j=1 |E(V2,Vj)| Hence, the connection to a quadratic form based on the eigenvectors of L follows,

xT Ax λ = max N n T x∈{R} x x x⊥1 1 1 2 1 2 1 2 ( + ) |(E(V1,V2)| + ( ) |E(V1,V3)| + ( ) |E(V2,V3)| ≥ vol(V1) vol(V2) vol(V1) vol(V2) 1 + 1 vol(V1) vol(V2) 2 (vol(V1) + vol(V2)) 2|E(V1,V2)| min(vol(V1), vol(V2)) |E(V1 ∪ V2,V3) ≥ + 2vol(V1)vol(V2) vol(V1) + vol(V2) max(vol(V1), vol(V2)) (vol(V1) + vol(V2) min(vol(V1, vol(V2)) |E(V1 ∪ V2,V3)| ≥ 2φ¯opt + max(vol(V1), vol(V2)) (vol(V1) + vol(V2))

≥ 2φ¯opt.

4.4.1 Spectral k-partitioning for the matrix inverse

Since our goal is to find the structure of the matrix inverse, we turn our attention to

L†, instead of L. We need the pseudoinverse because the graph Laplacian is singular,

1 † 1 † LD 2 1 = 0. Note that L D 2 1 = 0, and L has the same eigenvectors as L only ordered in the opposite order. Let us consider L† as a weighted graph with adjacency matrix,

Γ = L† − Diag(L†), and ignore the fact that some weights may be negative. Then for the

(weighted) Laplacian of Γ,

† † † † LΓ = Diag(Γ1) − Γ = Diag(L 1 − Diag(L )1) − L + Diag(L )

= −Diag(L†) − L† + Diag(L†) = −L†. (4.17)

This implies that the graph Laplacian of the pseudoinverse shares the same eigenvectors

and in the same order as L. Thus, the same partition is a solution to the relaxed version

of (4.16) for both matrices.

This suggests a connection between spectral methods and probing. Probing

considers the powers of An in order to learn about the structure of f(A). Similarly, the

90 power method takes matrix vector products with A on a starting vector v0, computing

n A v0 which is known to converge to the direction of the largest eigenvector. Thus, spectral method short cut the intermediate steps of the low order polynomial representations, and skip straight to the same information that the highest order polynomials provide.

The idea of finding communities with a low number of intra-community links using the top eigenvectors suggests that many of the algorithmic ideas used for spectral clustering might be adapted to the opposite problem. An obvious first algorithm is to obtain the largest eigenvector of the graph, and use it to divide the graph into two groups that are as close to bipartite as possible. Then we apply the method recursively on the induced subgraph for each group. We continue until a maximum number of groups is reached, or until the largest eigenvalue of the Laplacian of the partitioned matrices is too small to allow for a good bipartitioning. This method has the obvious advantage that the colors we produce will be nested subsets of each other, allowing us to continue probing without discarding previous results.

Unfortunately, this method has limited practicality when applied to f(A) = L†. While the eigenvectors needed at the first level are easily obtainable since they match those of

L, the required eigenvectors at the next level are more difficult to compute. The recursive process requires the eigenvectors of submatrices of L†, but we do not explictly have these submatrices available. We could use iterative methods to compute their eigenvectors, but each iteration would need a matrix-vector product with the submatrix we do not have.

Therefore, for each matrix-vector product and for each submatrix, we must solve a linear system with L. With even a few color blocks, this may cause the process to become infeasible. Further, the diagonal elements of the submatrices do not quite correspond to the elements of the graph Laplacians of the subgraphs of L†, meaning we will obtain partitions which are not quite correct.

Nevertheless, we report this algorithm for the following reasons. First, if the discrep- ancy between the diagonal elements of L† and the appropriate Laplacian is not large, which is the case in our observations, then the algorithm has the same theoretical support

91 as recursive spectral bisection for partitioning. In fact, for a given number of colors, the algorithm finds very good quality colorings, so it serves as a proxy for an upper bound for what spectral methods can achieve. Second, while the approach is not practical for f(A) = L†, it may be practical for other f(A), such as f(A) = An. We show our approach

in Algorithm 14.

Algorithm 14 c ←SpectralBisection(L, k) % Input: Laplacian matrix L, desired number of partitions k % Output: a coloring c 1: [evecs, evals] ← eigs(L) 2: n ← size(L,1) 3: % Get a permutation for the sorted eigenvector v 4: [vs, p] = ← sort(evecs) 5: ip(p) ← [1:n] % Inverse permutation 6: m = floor(n/2) 7: if k == 0 then 8: c(p(1 : m)) ← 1 9: c(p(m+1 : n)) ← 2 10: else 11: L1 ← L(1:m, 1:m) 12: L2 ← L(m+1:n, m+1:n) 13: smallc1 ← SpectralBisection(L1, k − 1) 14: smallc2 ← SpectralBisection(L2, k − 1) 15: nextC ← [smallc1 , smallc2+max(smallc1)] 16: % Return coloring to original ordering 17: c ← nextC(ip); 18: end if 19: return c

4.5 Experimental Results

In this section we present two sets of experimental results. The first set are Laplacian matrices from various graphs, chosen to help test our spectral approach. For these these matrices, we form the peusdoinverse L†, since the matrices are singular, and then use this inverse to directly compute the variance and the error. The second set are various matrices selected from the Matrix Market sparse matrix collection [12], chosen because

92 they were used as tests cases for [27]. Since these matrices were all chosen to be invertible we form A−1 directly, and then use this inverse to compute the variance and error.

2 −1 10 10 Hierarchical Spectral Hierarchical Spectral Probing Probing Structural Methods Structural Methods 1 Pure Statistical −2 Pure Statistical 10 10

0 −3 10 10 Variance

−1 −4

10 Relative Trace Error 10

−2 −5

10 0 2 4 10 0 2 4 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .1 − .5 50 20 9

Figure 4.4: The variance and relative trace error of the trace estimator for the 84 Lapla- cian with periodic boundary conditions.

While only the graph Laplacians have results for our spectral method, we show the results of our structural approaches for all the graphs. For each matrix we take 50 random columns v, and then apply the sparsity listed in the table below each graph. The number of columns in v is held constant, but the density is varied. We select a starting percentage of sparsification using the heuristics we described in Section 4.3. We then increase this density every step. While we initially increase the density by increments of 10 percent, if this proves to be either too little or too much change we use recursive bisection to adjust the increment. This process continues until the maximum color budget is reached. This maximum budget varies from matrix to matrix, because some of the examples are small, but is always set to be less then one thousand, which is the maximum we expect users to employ in real world scenarios. We also use our heuristics to pick the number of diagonals the model uses, which is then held constant thoughout the process. For the first set of

93 3 −2 10 10 Hierarchical Spectral Hierarchical Spectral Probing Probing 2 Structural Methods Structural Methods 10 Pure Statistical Pure Statistical

−3 10 1 10

0

Variance 10 −4 10

−1 Relative Trace Error 10

−2 −5

10 0 2 4 10 0 2 4 10 10 10 10 10 10 Colors Colors Density Range Vs Probing Powers .1 − .47 50 8 4

Figure 4.5: The variance and relative trace error of the trace estimator for a randomly generated scale free graph. experiments we show only the statistical approach described in Section 4.2, forming new probing vectors as ζ P , where ζ is a noise vector. For the second set of results we show in addition the results with the original Hadamard probing vectors without the statistical approach, since these matrices were originally chosen to show how Hadamard vectors can remove specific diagonals. Finally, we also show how much the Hadamard method reduces the variance, which indicates how well it would preform if used as the starting point of a

Monte Carlo method.

The first graph we examine in Figure 4.4 is the 4D Laplacian differential operator with periodic boundary conditions on a regular lattice. These types of graphs occur in

LQCD. This is the graph Laplacian test case where the structural methods and classical probing perform the best, remaining competitive with the spectral method. This is most likely due to the highly structured nature of the lattice, where certain off-diagonals take up most of the weight of A−1. Next we examine a synthetic scale-free graph with 10000 nodes, generated using the CONTEST MATLAB toolbox [13], the results of which we

94 0 0 10 10 Hierarchical Spectral Hierarchical Spectral Probing Probing −1 Structural Methods Structural Methods 10 Pure Statistical Pure Statistical

−2 10 −2 10

−3

Variance 10 −4 10

−4 Relative Trace Error 10

−5 −6

10 0 2 4 10 0 2 4 10 10 10 10 10 10 Colors Colors Density Range Vs Probing Powers .1 − .68 50 1 3

Figure 4.6: The variance and relative trace error of the trace estimator for the wiki-vote graph. see in Figure fig:scalevar. A scale-free graph is a network which has a degree distribution following a power law, P (k) ∼ k−γ, where k is the degree and γ is a constant chosen

to best fit the observed data. These types of graphs are of interest because many real

world networks such as social networks, the internet, and semantic networks [14, 15, 16],

are thought to be networks of this type. Here the spectral method outperforms all other

approaches, with the structural methods preforming the worst. The final two graphs

in Figures 4.7–4.6 are small social network graphs, showing a wikipedia voting network,

and a p2p gnutella file sharing network. In both social network experiments we see the

structural methods perform poorly, while the hierarchical spectral method works well for

both graphs.

All these graphs have the advantage that they are small enough to be directly inverted

so that the actual error and variance may be determined. In all the example graphs the

spectral methods tend to preform well, while probing and other structural methods tend

to preform poorly on the less structured graphs, especially when it comes to variance

95 3 −2 10 10 Hierarchical Spectral Hierarchical Spectral Probing Probing 2 Structural Methods Structural Methods 10 Pure Statistical Pure Statistical

−3 10 1 10

0

Variance 10 −4 10

−1 Relative Trace Error 10

−2 −5

10 0 2 4 10 0 2 4 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .1 − .92 50 1 4

Figure 4.7: The variance and relative trace error of the trace estimator for the p2p- gnutella05 graph. reduction. More importantly, the variance of the spectral method tends to reduce faster than the Monte Carlo method, which shows promise for developing approximate spectral algorithms that have a similar effect.

Our structural methods have better success on the second set of matrices than they do on the graph Laplacian test cases. In general, they provide an improvement over probing and purely statistical methods. In the gre512 and orsreg test cases, the structural meth- ods perform as well as or slightly better then probing, probably because both methods are finding the same dominant off-diagonals. The mhd14 matrix achieves similar results as probing for the range of colors it explores, but terminates early, because even with very low amounts of sparsification, the algorithm can detect very little structure. The nos6 and bcsstk07 matrices start off as being comparable to probing, but start to surpass it signifi- cantly as the density increases. The af23560 matrix is unique among the test cases because the structural methods never manage to surpass probing. However, the af23560 matrix clearly has little structure to make use of, since statistical methods, probing, Hadamard,

96 2 5 10 10

0 10 0 10

−5 10 −2 10 −10

Variance 10

Hadamard −4 Hadamard −15 Relative Trace Error 10 10 Probing Probing Structural Methods Structural Methods −20 Pure Statistical −6 Pure Statistical 10 0 1 2 3 10 0 1 2 3 10 10 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .1 − .9 50 418 16

Figure 4.8: The variance and relative trace error of the trace estimator for the gre512 matrix. and our method all preform the same.

Overall, our results indicate that our proposed structural methods can provide an improvement over probing at relatively low cost, even for matrices where probing should be expected to preform very well, that is matrices with a sharp decay away from the diagonal.

4.6 Conclusions

We have presented two classes of methods for estimating the trace of the inverse of an arbitrary matrix, which we have experimentally shown to be superior to both probing and purely statistical methods. The first class of these methods seeks to detect and exploit structural features of the matrix, and the second class seeks to use spectral information of the matrix to leverage ideas from algorithms designed for community detection and partitioning. We have also identified several promising areas of future research. In the

97 0 2 10 10

0 −2 10 10

−2 10 −4 10 −4

Variance 10

−6 Hadamard Hadamard 10 Relative Trace Error −6 Probing 10 Probing Structural Methods Structural Methods −8 Pure Statistical −8 Pure Statistical

10 0 2 4 10 0 2 4 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Classical Probing Powers .1 − .99 50 235 16

Figure 4.9: The variance and relative trace error of the trace estimator for the orsreg matrix. structural arena, the prospect of leveraging research in adaptive sampling for use with the

Nystr¨ommethod may make these types of approaches more useful, due to the potential to use some of the theoretical error bounds these ideas provide, making the structural results more reliable. In the spectral arena, the challenge is to find algorithms that approximate the recursive spectral bisection of Algorithm 14 efficiently. Moreover, there are many other classes of spectral clustering algorithms which we have not yet experimented with that could yield improved algorithms for finding good colorings.

98 8 45 10 10 Hadamard Hadamard Probing Probing Structural Methods 6 Structural Methods Pure Statistical 10 Pure Statistical 40 10

4 10

Variance 35 10 2

Relative Trace Error 10

30 0 10 0 1 2 3 10 0 1 2 3 10 10 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .1 − .999 50 115 16

Figure 4.10: The variance and relative trace error of the trace estimator for the mhd416 matrix.

5 0 10 10

0 10 −5 10

−5 10

Variance −10 10 −10 Hadamard Hadamard 10 Relative Trace Error Probing Probing Structural Methods Structural Methods −15 Pure Statistical −15 Pure Statistical

10 0 1 2 3 10 0 1 2 3 10 10 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .1 − .999 50 181 16

Figure 4.11: The variance and relative trace error of the trace estimator for the nos6 matrix.

99 2 0 10 10

0 −5 10 10

−2 10 −10 10 −4

Variance 10

−15 Hadamard Hadamard 10 Relative Trace Error −6 Probing 10 Probing Structural Methods Structural Methods −20 Pure Statistical −8 Pure Statistical 10 0 1 2 3 10 0 1 2 3 10 10 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .1 − .999 50 248 16

Figure 4.12: The variance and relative trace error of the trace estimator for the bcsstk07 matrix.

4 5 10 Hadamard 10 Probing

3 Structural Methods 10 Pure Statistical 0 10

2 10

Variance −5 10 1 Hadamard 10 Relative Trace Error Probing Structural Methods −10 Pure Statistical 0 10 10 0 2 4 0 2 4 10 10 10 10 10 10 Colors Colors Density Range Vs Diagonals Probing Powers .3 − .9 50 2623 16

Figure 4.13: The variance and relative trace error of the trace estimator for the af23560 matrix.

100 Chapter 5

Conclusion and future work

In this work we have investigated ways in which Diag(f(A)) may be computed when information about the structure of A is known and we have explored algorithms for dis- covering such information when it is not known a priori. Specifically, we have made use of geometric, structural, and spectral methods to improve on the results of probing. We have also developed a method for combing this information with the statistical methods of [22] in order to obtain unbiased methods that provide also statistical error bounds on the estimates. Further, we have provided a framework for analyzing which matrices are likely to see improvement over purely statistical methods, based on how rapidly the largest elements of the matrix decays. Finally we have preformed experiments on many different tests cases, ranging from applications in LQCD, social network graphs, engineering, and

PDEs. In some cases we obtained an order of magnitude speed up compared to previously known methods.

5.1 Methods for Lattices

Our methods greatly improved prior methods for estimating Diag(f(A)) in matrices aris- ing from PDEs discretized onto lattices. We have introduced two methods: an extremely efficient algorithm based on binary arithmetic which works when the dimensions of the matrix can be factored solely into powers of two, and another method which is more gen-

101 eral. For real world cases arising in LQCD, we demonstrate significant improvements over prior methods for challenging cases, speeding up the statistical process by an order of magnitude.

5.2 Methods for General Matrices

Further improvement to classical probing appears difficult, due to the limitations of re- lying on a polynomial of A to approximate the structure of f(A). To circumvent this issue we introduced several methods that seek to approximate the structure of f(A) di- rectly, bypassing the need to compute high order matrix polynomials. There are two main classes of information we use to discover the structure of f(A). The first of these is structural information. By solving for random columns v of f(A), we are able to form two approximations to f(A). First, by looking at the cross-correlation of the vectors at specific lags, we can locate the prominent off-diagonals of f(A). Second, we can form a rough approximation to f(A) by finding vvT . In many experiments we have found that these two methods taken together provide more accurate colorings than classical probing.

Finally, we also make use of the spectrum of A to find good colorings of f(A) in a process analogous to that used to find clusters in graphs.

Both these methods suggest future avenues of research. The structural methods have close parallels with the Nystr¨omalgorithm for sparse matrix approximation. Leveraging some of the approaches for sampling from this method may provide significant improve- ments to our algorithm. It may also be possible to apply the Nystr¨omalgorithm directly, an approach that deserves further study. The spectral methods also have many interesting open questions. First, significant research is still needed to make this algorithm practical.

Additionally there are four other algorithmic ideas used for the graph clustering problem, which we have not yet tried to adapt to our problem. Finally, while we have achieved good results with this method for graph Laplacians, it remains to be seen if there is a way to extend this approach to more general classes of matrices.

102 103 Bibliography

[1] M. Benzi, E. Estrada, and C. Klymko,Ranking Hubs and Authorities Using

Matrix Functions, Linear Algebra and its Applications, SIAM. J. Matrix Anal. Appl.

36 (2), pp. 686706 (2015)

[2] P. Drineas, R. Kannan, and M. W. Mahoney,Fast monte carlo algorithms for

matrices I: approximating matrix multiplication, SIAM Journal on Computing, 36(1),

pp. 132- 157 (2006)

[3] Shusen Wang and Zhihua Zhang, Improving CUR Matrix Decomposition and

the Nystrom Approximation via Adaptive Sampling, Journal of Machine Learning

Research (JMLR), 14: 2729-2769, (2013)

[4] J. Barlow J. Demmel, Computing Accurate Eigensystems Of Scaled Diagonally Dominant Matrices SIAM J. Numer. Anal., v. 27, n. 3, pp. 762-791, (1990)

[5] Y. Hou, Bounds for the least Laplacian eigenvalue of a signed graph Acta Mathe-

matica Sinica, Volume 21, Issue 4, pp 955-960 (2005)

[6] Fang Chung, Spectral Graph Theory, CBMS Regional Conference Series in Mathe-

matics, No. 92, (1996)

[7] A.Pothen, H. Simon, and K. Liou Partitioning Sparse Matrices with Eigenvectors

of Graphs SIAM. J. Matrix Anal. Appl., 11(3), 430452. (1990)

104 [8] C. Alpert, A. Kahng, and S. Yao Spectral partitioning with multiple eigenvectors

Discrete Appl. Math. 90, 1-3 (January 1999)

[9] F. Bauer, J. Jost Bipartite and neighborhood graphs and the spectrum of the normalized graph Laplacian Communications in Analysis and Geometry Comm. Anal.

Geom. 21 no. 4, 787-845, (2013)

[10] S. Liu Multi-way dual Cheeger constants and spectral bounds of graphs Advances

in Mathematics, Volume 268 , 306338 (2015)

[11] F. McSherry, Spectral partitioning of random graphs FOCS 01: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, 529, (2001)

[12] R. Boisvert, R. Pozo, K. Remington, R. Barrett, and J. Dongarra Ma-

trix market: a web resource for test matrix collections, Proceedings of the IFIP

TC2/WG2.5 working conference on Quality of numerical software: assessment and

enhancement, 125-137. (1997)

[13] A. Taylor and D.J. Higham CONTEST: A Controllable Test Matrix Toolbox for MATLAB ACM Transactions on Mathematical Software, 35 (4). 26:1-26:17, (2009)

[14] A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee

Measurement and Analysis of Online Social Networks Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC ’07). ACM, New York, NY,

USA, 29-42,(2007)

[15] L. Li, D. Alderson, J. Doyle, and W. Willinger Towards a Theory of Scale-

Free Graphs: Definition,Properties, and Implications Internet Mathematics Volume

2, Number 4, 431-523, (2005)

[16] M. Steyvers J. Tenenbaum The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth, Cognitive Science 29 (1): 4178,

(2005)

105 [17] T. F. Coleman and J. J. More, Estimation of sparse Jacobian matrices and graph

coloring problems, SIAM Journal on Numerical Analysis, 20, pp. 187209,(1983)

[18] C. Siefert and E. de Sturler Probing methods for generalized saddle-point prob- lems , Electronic Transactions on Numerical Analysis, 22 , pp. 163183,(2006)

[19] G. Golub and C. Van Loan Matrix computations 3rd ed. Johns Hopkins Univer-

sity Press (1996)

[20] K. Ahuja, B. Clark, E. de Sturler, D. M. Ceperley, and J. Kim, Improved

scaling for Quantum Monte Carlo on insulators, (7 May 2011).

[21] P. R. Amestoy, I. S. Duff, Y. Robert, F.-H. Rouet, and B. Ucar, On

computing inverse entries of a sparse matrix in an out-of-core environment, Tech.

Rep. TR/PA/10/59, CERFACS, Toulouse, France, 2010.

[22] H. Avron and S. Toledo, Randomized algorithms for estimating the trace of an

implicit symmetric positive semi-denite matrix, Journal of the ACM, 58 (2011), p. Ar-

ticle 8.

[23] R. Babich, R. Brower, M. Clark, G. Fleming, J. Osborn, C. Rebbi, and

D. Schaich, Exploring strange nucleon form factors on the lattice, (4 May 2011).

[24] Z. Bai, M. Fahey, and G. H. Golub, Some large-scale matrix computation prob-

lems, Journal of Computational and Applied Mathematics, 74 (1996), pp. 71–89.

[25] G. S. Bali, S. Collins, and A. Schaefer, Effective noise reduction techniques

for disconnected loops in Lattice QCD, (2010).

[26] M. Beck and S. Robins, Computing the Continuous Discretely: Integer-Point Enu-

meration in Polyhedra, Springer, 2007.

106 [27] C. Bekas, A. Curioni, and I. Fedulova, Low cost high performance uncertainty

quantication, in In WHPCF 09: Proc. of the 2nd Workshop on High Performance

Computational Finance, New York, NY, USA, 2009, ACM, pp. 1–8.

[28] C. Bekas, E. Kokiopoulou, and Y. Saad, An estimator for the diagonal of a

matrix, Appl. Numer. Math., 57 (2007), pp. 1214–1229.

[29] M. Benzi, P. Boito, and N. Razouk, Decay properties of spectral projectors with

applications to electronic structure, SIAM Review, (to appear).

[30] M. Benzi and G. H. Golub, Bounds for the entries of matrix functions with ap-

plications to preconditioning, BIT, 39 (1999), pp. 417–438.

[31] S. Bernardson, P. McCarty, and C. Thron, Monte Carlo methods for esti-

mating linear combinations of inverse matrix entries in lattice QCD, Comput. Phys.

Commun., 78 (1994), pp. 256–264.

[32] M. Blaum and J. Bruck, Interleaving schemes for multidimensional cluster errors,

IEEE Transactions on Information Theory, 44 (1998), pp. 730–743.

[33] D. Bozdag, U. Catalyurek, A. Gebremedhin, F. Manne, E. Boman, and

F. Ozguner, Distributed-memory parallel algorithms for distance-2 coloring and re-

lated problems in derivative computation, SIAM J. Sci. Comput, 32 (2010), pp. 2418–

2446.

[34] E. Chow and Y. Saad, Approximate inverse preconditioners via sparse-sparse it-

eration, SIAM J. Sci. Statist. Comput., 19 (1998), pp. 995–1023.

[35] T. F. Coleman and J. J. More´, Estimation of sparse Jacobian matrices and graph

coloring problems, SIAM Journal on Numerical Analysis, 20 (1983), pp. 187–209.

[36] I. S. Duff, A. M. Erisman, and J. K. Reid, Direct Methods for Sparse Matrices,

Oxford University Press, USA, 1989.

107 [37] J. Foley., K. J. Juge, A. O’Cais, M. Peardon, S. Ryan, and J.-I. Skullerud,

Practical all-to-all propagators for lattice qcd, Comput. Phys. Commun., 172 (2005),

pp. 145–162.

[38] A. H. Gebremedhin, F. Manne, and A. Pothen, What color is your Jacobian?

Graph coloring for computing derivatives, SIAM Rev., 47 (2005), pp. 629–705.

[39] G. H. Golub and G. Meurant, Matrices, moments and quadrature, in Numerical

Analysis 1993, D. Griffiths and G. Watson, eds., vol. 303, Longman Scientific &

Technical, Pitman Research Notes in Mathematics Series, 1994.

[40] H. Guo, Computing traces of functions of matrices, Numerical Mathematics, A Jour-

nal of Chinese Universities (English series), 2 (2000), pp. 204–215.

[41] H. Guo and R. Renaut, Estimation of utf(a)v for large-scale unsymmetric matri-

ces, Numerical Linear Algebra with applications, 11 (2004), pp. 75–89.

[42] R. Gupta, Introduction to Lattice QCD. arXiv:hep-lat/9807028v1

[http://arxiv.org/abs/hep-lat/9807028], 1998.

[43] K. J. Horadam, Hadamard matrices and their applications, Princeton University

Press, 2006.

[44] T. Huckle, Approximate sparsity patterns for the inverse of a matrix and precondi-

tioning, Appl. Numer. Math., 30 (1999), pp. 291–303.

[45] M. F. Hutchinson, A stochastic estimator of the trace of the influence matrix for

Laplacian smoothing splines, J. Commun. Statist. Simula., 19 (1990), pp. 433–450.

[46] T. Iitaka and T. Ebisuzaki, Random phase vector for calculating the trace of a

large matrix, Phys. Rev. E, 69 (2004), p. 05770110577014.

[47] I. C. Ipsen and D. J. Lee, Determinant approximations, Tech. Rep. TR 03-30,

North Carolina State University, Department of Mathematics, 2003.

108 [48] D. J. Lee and I. C. F. Ipsen, Zone determinant expansions for nuclear lattice

simulations, Phys. Rev. C, 68 (2003), p. 064003.

[49] L. Lin, J. Lu, L. Ying, R. Car, and W. E, Fast algorithm for extracting the diagonal of the inverse matrix with application to the electronic structure analysis of

metallic systems, Commun. Math. Sci., 7 (2009), pp. 755–777.

[50] L. Lin, C. Yang, J. C. Meza, J. Lu, L. Ying, and W. E., Selinv–an algorithm for

selected inversion of a sparse symmetric matrix, ACM Transactions on Mathematical

Software, 37 (4), pp. Article 40, pages 19.

[51] C. Morningstar, J. Bulava, J. Foley, K. Juge, D. Lenkner, M. Peardon,

and C. Wong1, Improved stochastic estimation of quark propagation with Laplacian

Heaviside smearing in lattice QCD, Phys. Rev. D, 83 (2011).

[52] F. Pukelsheim, Optimal design of experiments, SIAM, Classics in Applied Mathe-

matics. 50., 1993.

[53] A. Reusken, Approximation of the determinant of large sparse symmetric positive

definite matrices, SIAM J. Matrix Anal. Appl., 23 (2002), pp. 799–818.

[54] H. J. Rothe, Lattice Gauge Theories: An introduction, World Scientific Publishing

Co. Pte. Ltd., 2005.

[55] Y. Saad, Iterative methods for sparse linear systems, SIAM, 2nd edition, Philadel-

phia, PA, USA, 2003.

[56] B. Sheikholeslami and R. Wohlert, Improved Continuum Limit Lattice Action

for QCD with Wilson Fermions, Nucl.Phys., B259 (1985), p. 572.

[57] C. Siefert and E. de Sturler, Probing methods for generalized saddle-point prob-

lems, Electronic Transactions on Numerical Analysis, 22 (2006), pp. 163–183.

109 [58] Z. Strakos and G. H. Golub, Estimates in quadratic formulas, Numerical Algo-

rithms, 8 (1994), pp. 241–268.

[59] Z. Strakos and P. Tichy, On efficient numerical approximation of the bilinear

form c∗a−1b, SIAM J. Sci. Comput., 33 (2011), pp. 565–587.

[60] J. Tang and Y. Saad, Domain-decomposition-type methods for computing the di-

agonal of a matrix inverse, Report UMSI 2010/114.

[61] Y. Saad, Iterative Methods for Sparse Linear Systems, Society for Industrial and

Applied Mathematics (2003)

[62] , A probing method for computing the diagonal of the matrix inverse, Report

UMSI 2010/42.

[63] L. R. Welch, Lower bounds on the maximum cross correlation of signals, IEEE

Trans. on Info. Theory, 20 (May 1974), pp. 397–399.

[64] K. G. Wilson, Confinement of quarks, Phys. Rev., D10 (1974), pp. 2445–2459.

[65] M. N. Wong, F. J. Hickernell, and K. I. Liu, Computing the trace of a function

of a sparse matrix via Hadamard-like sampling, Tech. Rep. 377(7/04), Hong Kong

Baptist University, 2004.

[66] A. Stathopoulos, J. Laeuchli, and K. Orginos, Hierarchical probing for es-

timating the trace of the matrix inverse on toroidal lattices , SIAM J. Sci. Com-

put.,35(5) (2013), pp. 299–322.

[67] F. Rouet, Calcul partiel de l’inverse d’une matrice creuse de grande taille - appli-

cation en astrophysique, Master’s Thesis (10/09), Institut National Polytechnique de

Toulouse, 2009.

[68] E. Estrada, N. Hatano, Statistical-mechanical approach to subgraph centrality in

complex networks, Chemical Physics Letters, 439, (5/07), pp. 247-251.

110 [69] N. Higham, Functions of Matrices: Theory and Computation, Society for Industrial

and Applied Mathematics, 2008.

[70] S. Li, E. Darve, Extension and optimization of the FIND algorithm: Computing

Green’s and less-than Green’s functions,J. Comput. Physics 231(4), (5/12), pp. 1121-

1139.

[71] S. Li, E. Darve, Some new mathematical methods for variational objective weather

analysis using splines and cross-validation,Mon. Weath. Rev. 108, (6/80), pp. 1122-

1143

[72] S. Li, E. Darve, Smoothing noisy data with spline functions,Numer. Math. 31,

(2/79), pp. 377-403

[73] S. W. Golomb and L. D. Baumert, The search for Hadamard matrices, Amer.

Math. Monthly, 70 pp. 12-17 (1963)

[74] P. Drineas , R. Kannan , M. Mahoney, Fast Monte Carlo algorithms for ma-

trices I: Approximating matrix multiplication,SIAM Journal on Computing 36,

(9/04), pp. 132157

[75] D. Chen, S. Toledo, Vaidya’s preconditioners: Implementation and experimental

study,Electronic Transactions on Numerical Analysis, 16,(9/03), pp. 30-

49

111