W&M ScholarWorks
Dissertations, Theses, and Masters Projects Theses, Dissertations, & Master Projects
Fall 2016
Methods for Estimating The Diagonal of Matrix Functions
Jesse Harrison Laeuchli College of William and Mary, [email protected]
Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons
Recommended Citation Laeuchli, Jesse Harrison, "Methods for Estimating The Diagonal of Matrix Functions" (2016). Dissertations, Theses, and Masters Projects. Paper 1477067934. http://doi.org/10.21220/S2CC7X
This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Methods for Estimating the Diagonal of Matrix Functions
Jesse Harrison Laeuchli
Williamsburg, Virginia
Bachelor of Science, University of Notre Dame, 2007 Master of Science, College of William and Mary, 2012
A Dissertation presented to the Graduate Faculty of the College of William and Mary in Candidacy for the Degree of Doctor of Philosophy
Department of Computer Science
The College of William and Mary May 2016
© Copyright by Jesse Harrison Laeuchli 2016
ABSTRACT
Many applications such as path integral evaluation in Lattice Quantum Chromodynamics (LQCD), variance estimation of least square solutions and spline fits, and centrality measures in network analysis, require computing the diagonal of a function of a matrix, Diag(f(A)) where A is sparse matrix, and f is some function. Unfortunately, when A is large, this can be computationally prohibitive. Because of this, many applications resort to Monte Carlo methods. However, Monte Carlo methods tend to converge slowly.
One method for dealing with this shortcoming is probing. Probing assumes that nodes that have a large distance between them in the graph of A, have only a small weight connection in f(A). To determine the distances between nodes, probing forms Ak. Coloring the graph of this matrix will group nodes that have a high distance between them together, and thus a small connection in f(A). This enables the construction of certain vectors, called probing vectors, that can capture the diagonals of f(A). One drawback of probing is in many cases it is too expensive to compute and store Ak for the k that adequately determines which nodes have a strong connection in f(A). Additionally, it is unlikely that the set of probing vectors required for Ak is a subset of the probing vectors needed for Ak+1. This means that if more accuracy in the estimation is required, all previously computed work must be discarded.
In the case where the underlying problem arises from a discretization of a partial differential equation (PDE) onto a lattice, we can make use of our knowledge of the geometry of the lattice to quickly create hierarchical colorings for the graph of Ak. A hierarchical coloring is one in which colors for Ak+1 are created by splitting groups of nodes sharing a color in Ak. The hierarchical property ensures that the probing vectors used to estimate Diag(f(A)) are nested subsets, so if the results are inaccurate the estimate can be improved without discarding the previous work.
If we do not have knowledge of the intrinsic geometry of the matrix, we propose two new classes of methods that improve on the results of probing. One method seeks to determine structural properties of the matrix f(A) by obtaining random samples of the columns of f(A). The other method leverages ideas arising from similar problems in graph partitioning, and makes use of the eigenvectors of f(A) to form effective hierarchical colorings.
Our methods have thus far seen successful use in computational physics, where they have been applied to compute observables arising in LQCD. We hope that the refinements presented in this work will enable interesting applications in many other fields. TABLE OF CONTENTS
Acknowledgments iv
Dedication v
List of Tables vi
List of Figures vii
1 Introduction 2 1.1 Motivation ...... 2 1.1.1 Prior Work and New Approach ...... 3 1.2 Overview ...... 6
2 Prior Work and Applications 7 2.1 Applications ...... 8 2.1.1 Statistical Applications ...... 8 2.1.2 Lattice Quantum Chromodynamics ...... 9 2.1.3 Network Centrality ...... 10 2.2 Prior Work ...... 10 2.2.1 Statistical Methods ...... 10 2.2.2 Non-Statistical Methods ...... 12
3 Estimation of Diag(f(A)) on torodial lattices 18 3.1 Lattices with dimensions consisting only of powers of 2 ...... 18 3.1.1 Introduction ...... 18
i 3.1.2 Preliminaries ...... 20 3.1.3 Lattice QCD problems ...... 21 3.1.4 The Monte Carlo method for Tr(A−1) ...... 21 3.1.5 Probing ...... 22 3.1.6 Hadamard vectors ...... 25 3.1.7 Overcoming probing limitations ...... 26 3.1.8 Hierarchical coloring ...... 29 3.1.9 Hierarchical coloring on lattices ...... 29 3.1.10 Splitting color blocks into conformal d-D lattices ...... 31 3.1.11 Facilitating bit reversal in higher dimensions ...... 32 3.1.12 Lattices with different sizes per dimension ...... 33 3.1.13 Coloring lattices with non-power of two sizes ...... 37 3.1.14 Generating the probing basis ...... 38 3.1.15 Removing the deterministic bias ...... 40 3.1.16 Numerical experiments ...... 41 3.1.17 Comparison with classical probing ...... 41 3.1.18 Comparison with random-noise Monte Carlo ...... 44 3.1.19 A large QCD problem ...... 45 3.1.20 Conclusions ...... 49 3.2 Lattices of arbitrary dimensions ...... 52 3.2.1 Introduction and Preliminaries ...... 52 3.3 Lattices as spans of sublattices ...... 53 3.4 Coloring sublattices ...... 57 3.4.1 Hierarchical Permutations of Lattices with Equal Sides . . . . . 62 3.4.2 Hierarchical Permutations of Lattices with Unequal Sides . . . 64 3.4.3 Generating Probing Vectors Quickly ...... 68 3.5 Probing Vectors For Hierarchical Coloring on General Graphs . . . . . 70
ii 3.6 Performance Testing ...... 73 3.7 Conclusion ...... 75
4 Estimation of diag(f(A)) in the general case 77 4.1 Graph Coloring ...... 78 4.2 Statistical Considerations ...... 80 4.3 Structural Methods ...... 83 4.4 Spectral Methods ...... 89 4.4.1 Spectral k-partitioning for the matrix inverse ...... 92 4.5 Experimental Results ...... 94 4.6 Conclusions ...... 99
5 Conclusion and future work 103 5.1 Methods for Lattices ...... 103 5.2 Methods for General Matrices ...... 104
iii ACKNOWLEDGMENTS
It is difficult to convey my deep gratitude and respect for my advisor Andreas Stathopoulos, without whom this work would not have been possible. His keen perception, mathematical insight, and deep understanding of the field carried us though many difficult problems. To me he is the ideal computer scientist and mentor. I also wish to thank all of the members of my dissertation committee for their thoughtful comments and guidance, which greatly improved this work.
As part of the Computational Science research group at William and Mary, I had the good fortune to work with Lingfei Wu, and Eloy Romero Alcalde. I am grateful for their support and friendship.
During the course of this research I had several mentors at different internships, whose guidance and support contributed greatly to this work. In particular I would like to thank Chris Long, Lance Ward, and Geoff Sanders. They provided the spark for many ideas, and a great working environment to explore them.
For their many stimulating conversations on Computer Science, Mathematics and other topics, I would like to thank Philip Bozek, Douglas Tableman, and Walter McClean.
Finally, I would like to thank the Tan family for their moral support during the production of this thesis.
iv To My Parents, Samuel and Elizabeth Laeuchli
v LIST OF TABLES
2.1 Convergence rates of different methods...... 11
3.1 Table showing run times of the new algorithm compared to the origi- nal. Results obtained on an Intel i7 860 clocked at 2.8 GHz...... 74
vi LIST OF FIGURES
2.1 The area zeroed out by using Hadamard Vectors ...... 13 2.2 An example of probing a 4-colorable graph ...... 14 2.3 An example of wasted probing vectors in non-hierarchical probing . . 15
3.1 Visualizing a 4-colorable matrix permuted such that all rows corre- sponding to color 1 appear first, for color 2 appear second, and so on. Each diagonal block is a diagonal matrix. The four probing vectors with 1s in the corresponding blocks are shown on the right...... 22 3.2 Crossed out nodes have their contribution to the error canceled by the Hadamard vectors used. Left: the first two, natural order Hadamard vectors do not cancel errors in some distance-1 neighbors in the lex- icographic ordering of a 2-D uniform lattice. Right: if the grid is permuted with the red nodes first, the first and the middle Hadamard vectors completely cancel variance from nearest neighbors and corre- spond to the distance-1 probing vectors...... 25 3.3 When doubling the probing distance (here from 1 to 2) we first split the 2-D grid to four conformal 2-D subgrids. Red nodes split to two 2 × 2 grids (red and green), and similarly black nodes split to blues and black. Smaller 2-D grids can then be red-black ordered...... 29
vii 3.4 Error in the Tr(A−1) approximation using the MC method with var- ious deterministic vectors. Classic probing requires 2,16,62, and 317 colors for probing distances 1,2,4, and 8, respectively. Left: Clas- sic probing approximates the trace better than the same number of Hadamard vectors taken in their natural order. Going to higher distance-k requires discarding previous work. Right: Perform distance- k probing, then apply Hadamard in natural order within each color. Performs well, but hierarchical performs even better...... 41 3.5 Left: The hierarchical coloring algorithm is stopped after 1, 2, 3, 4, 5 levels corresponding to distances 2, 4, 8, 16, 32. The ticks on the x- axis show the number of colors for each distance. Trace estimation is effective up to the stopped level; beyond that the vectors do not capture the remaining areas of large elements in A−1. Compare the results with classical probing in Figure 3.4, which requires only a few less colors for the same distance. Right: When the matrix is shifted to have high condition number, the lack of structure in A−1 causes all methods to produce similar results...... 42
3.6 Convergence history of Z2 random estimator, Hadamard vectors in natural order, and hierarchical probing, the latter two with bias re- moved as in (3.9). Because of small condition number, A−1 has a lot of structure, making hierarchical probing clearly superior to the stan- dard estimator. As expected, Hadamard vectors in natural order are not competitive. The markers on the plot of the hierarchical probing method designate the number of vectors required for a particular dis- tance coloring to complete. It is on these markers that structure is captured and error minimized...... 43
viii 3.7 Convergence history of the three estimators as in Figure 3.6 for a larger condition number O(104). As the structure of A−1 becomes less prominent, the differences between methods reduce. Still, hierarchical probing has a clear advantage...... 44 3.8 Convergence history of the three estimators as in Figure 3.6 for a high condition number O(106). Even with no prominent structure in A−1 to discover, hierarchical probing is as effective as the standard method. 45
3.9 Providing statistics over 100 random vectors z0, used to modify the se- quence of 2048 hierarchical probing vectors as in (3.9). At every step, the variance of quadratures from the 100 different runs is computed, and confidence intervals reported around the hierarchical probing con- vergence. Note that for the standard noise MC estimator confidence intervals are computed differently and thus they are not directly com- parable...... 46 3.10 (a) Left: The variance of the hierarchical probing trace estimator as a function of the number of vectors (s) used. The minima appear when s is a power of two. The places where the colors complete are marked with the cyan circle. These minima become successively deeper as we progress from 2 to 32 to 512 vectors. (b) Right: Speed-up of the
LQCD trace calculation over the standard Z2 MC estimator. The cyan circles mark where colors complete. The maximal speed up is observed at s = 512. In both cases the uncertainties are estimated using the Jackknife procedure on a sample of 253 noise vectors, except for s = 256 and 512 where 37 vectors were used...... 47
2 3.11 The decomposition of a 6x6 lattice into 3 sublattices L(3I)c0 . . . . . 52 3.12 Affine sublattices with x,y coordinates...... 54
ix 3.13 The circled nodes constitute the C lattice of offsets. Note how C tiles the entire lattice, and that its coloring reflects the coloring of each
sublatice L(bI)c. Since b = 3, each line of colors is the same as the previous line, shifted by 1 mod 3...... 58 3.14 Comparison of the two methods on a 2D lattice with common factors 2 × 2 × 3 × 3 × 5. For the common factors of two, the methods are the same, but once these are exhausted the improved method has much lower error...... 74
4.1 Probing vs Coloring the structure of L† directly, where the percentage of the weight of L† retained varies from .1 to .5. in .05 increments. As the number of colors increases probing struggles to capture the structure of L†...... 82 4.2 Sampling 4 columns from A and then shifting to detect if they share an off-diagonal ...... 83 4.3 Approximation of the pseudo-inverse of a Laplacian with periodic boundary conditions with 10, 100, and 1000 vectors in the vvT ap- proximation. Here the vectors v are not sparsified, the figure shows how in the best case of an unsparsified v, vvT contains mostly local structure, until a significant number of vectors v are supplied. . . . . 85 4.4 Lattice Graph Results ...... 93 4.5 Scale Free Graph Results ...... 94 4.6 Wiki-Vote graph Graph Results ...... 95 4.7 P2P-GNU graph Graph Results ...... 96 4.8 Gre512 Matrix Graph Results ...... 97 4.9 Orseg Matrix Graph Results ...... 98 4.10 Mhd416 Matrix Results ...... 99
x 4.11 Nos6 Matrix Results ...... 99 4.12 Bcsstk07 Matrix Results ...... 100 4.13 Af23560 Matrix Results ...... 100
xi Methods for Estimating the Diagonal of Matrix Functions Chapter 1
Introduction
1.1 Motivation
PN In this work we study the problem of computing Diag(f(A)), and i=1 Diag(f(A))i = Tr(f(A)), where A is a sparse matrix of size N and f is some function. Some useful examples are f(A) = A−1, or f(A) = exp(A). When A is small, this can be computed directly, which is an O(N 3) approach. When A is of intermediate size and is properly structured, recursive factorization methods allow for an O(N 2) solution [70]. However, in many problems of interest, the size of A can be such that even O(N 2) solutions are impractical. Because of this we abandon exact computation, and attempt to approximate the desired diagonals. There are two main methods for approximating the desired results.
The first of these is based on Monte Carlo methods [45, 22]. The second is based on matrix sparsification [62], where we hope to ignore unimportant parts of the matrix in order to speed up the computation. Our work takes advantage of the features of both types of approximations in order to produce better algorithms.
The need to apply efficient solutions to this problem is a result of ever increasing matrix sizes in many diverse areas. Several examples are Statistics [45, 67], Lattice Quantum
Chromodynamics (LQCD) [31], Material Science [28], and Network Analysis [68]. For example, in LQCD increasing the size of the lattice improves the physical accuracy of the
1 simulation. Thus, there is significant interest in increasing the size of the lattices beyond what is currently computationally feasible. The same is true in the field of social network analysis. As social networks expand and represent ever more interconnected networks, performing analysis requires the use of increasingly large matrices. Finally, as large data sets become ever more prevalent, many statistical processes require matrices that continue to increase in size, necessitating faster methods.
1.1.1 Prior Work and New Approach
This problem of computing Diag(f(A)) differs from common numerical analysis prob- lems in a key way. Frequently, when similar problems are encountered, the problem is rewritten to be an optimization problem on a convex function. The problem can then be approached using optimization methods such as Newton’s method, Gradient Descent,
Conjugate Gradient and Non-Linear Conjugate Gradient, to converge to the value we are seeking. With this problem, no such optimization process can be undertaken. Because of this, statistical methods must be used.
Since this problem was first studied by Hutchinson [45], several statistical methods for the problem have been proposed. While these methods have the attractive feature that they provide statistical error estimates, they converge slowly and do not take advantage of information that the user may have about the matrix. The main goal of this work is to take advantage of anything that is known about a matrix in order to obtain a better estimate of Diag(f(A)) in less time. Although the matrix f(A) is normally dense, if
the smallest elements of f(A) are dropped, structure in this sparsified version of f(A) will emerge. Since exactly obtaining this structure is no easier than solving the original problem, alternative information is used to approximate it. Once this structure is known, an estimate for Diag(f(A)) can be obtained. This was the idea behind the method of
[60] known as probing. Probing uses the powers of a matrix A to obtain an estimate of the structure of f(A). For A−1 in particular this is based on the assumption that the
Neumann series of A converges to f(A). For different functions of A, other polynomials
2 could be considered. Once an estimate for the structure of f(A) is obtained, the authors of [62] show how to create probing vectors that allow for the recovery of the diagonal.
However, this requires taking high powers of the matrix A, which is expensive to compute and may be impractical to store. Ideally, knowledge of the matrix A that is less expensive to obtain should be used. Further, the method in [60] provides no way to tell how accurate the error estimate is. Since this is important for many applications, this is a significant drawback.
Our proposed methods take advantage of several major areas of knowledge in order to obtain the structure of f(A), in ways that are cheaper than finding powers of A. The first is geometric information. Many of our target applications arise from partial differential equations (PDE) that are discretized onto lattices. For a PDE given by g(u) = y, the solution at point u is often given in terms of the Green’s function G(u, u’), where u is the point we are obtaining a solution for, and u’ is some other point not equal to u.
For many PDEs, as ku − u’k increases, G(u, u’) decays quickly since the physical forces
the Green’s function is attempting to model fall off rapidly with distance. Because of
this, only connections between nodes that have short distances in the graph will have a
large connection, or to put this another way, only elements in f(A) corresponding to links
between close nodes will be large. If we can determine which points have short distances
between them in the graph, we can use this distance information to obtain an estimate for
which are the large elements of f(A). This is the previously mentioned approach of [60], where they use successively higher powers of A to compute the distance between nodes. In
lattices, because of the known geometry, we can cheaply compute the distances between
nodes without computing the powers of A.
For more general matrices where geometric information is lacking, we would prefer
to bypass polynomial approximations which are expensive to compute, and work more
directly with the structure of f(A). This approach allows us to deal with matrices that
do not exhibit the decay in interaction between distant nodes seen in many PDEs. We
term this family of algorithms inverse probing. Inverse probing works by computing a
3 subset of the columns v of the matrix f(A). These columns can then be used by several algorithms to build an approximation to f(A). The two main approaches we take are to form an approximation vvT ≈ f(A), and to examine the values of v at different lags to try to predict the location of major off-diagonals of f(A). This allows for the estimation of the magnitude of the connection between nodes directly.
A final method for obtaining information on the structure of f(A) is based on examin- ing the spectrum of f(A). If probing were used to determine all the distance k connections
k of the i-th node, the algorithm would take the matrix vector product A ei. However, this is similar to the process of obtaining a single iterate of the power iteration method, which
Akr would take the product kAkrk , where r is an appropriately chosen random vector. The power method is known to converge to the eigenvector of largest modulus of A. This sug-
gests there could be a connection between probing and the largest eigenvector of A. The
eigenvector holds the distance information that would be obtained by probing if it were
taken an infinite number of steps. To state this slightly differently, the largest eigenvector
holds similar structural information about f(A) to that which probing could obtain. We present a heuristic that explores the connection between these two ideas and lays the basis for future research in this area.
The geometric, structural, and spectral information that our new algorithms make use of are all more computationally and storage efficient than the high powers of A required
for probing. Additionally, in many cases we observe that they provide more accurate
results then probing, because they obtain a more accurate representation of the structure
of f(A).
The final contribution of our work is to combine the idea of exploiting known but
deterministic information about a matrix with statistical methods in order to provide the
user with an improved as well as unbiased estimator. We provide a framework to analyze
when the information obtained by our methods will provide meaningful improvements
in the error estimates of our methods. By merging the strengths of the statistical and
deterministic approaches we provide algorithms that are more robust.
4 1.2 Overview
The rest of this dissertation is structured as follows.
Chapter 2 We discuss in more detail the applications where Diag(f(A)) is needed. We
also examine the prior approaches, and determine the areas in which they are insuf-
ficient.
Chapter 3 We introduce a method for computing Diag(f(A)) when A is a matrix arising
from a toroidal lattice, that is, a lattice where the boundary conditions are periodic.
Our method works by exploiting the geometry of the lattice. We show how the
problem can be solved in the special case when the lattice has dimensions that are
powers of 2, and in the more general case where the dimensions are arbitrarily sized.
Finally we show how these methods can be combined with statistical approaches to
provide unbiased estimators.
Chapter 4 We discuss the more general class of matrices, where the matrix is not a
lattice but still has some structure that can be exploited. We examine structural
and spectral types of information, and provide a framework for analyzing when
enough structure exists for our algorithms to outperform previous approaches.
Conclusion We summarize our discussion.
5 Chapter 2
Prior Work and Applications
In this chapter we discuss related applications and prior work. Most prior applications are related to either computing Tr(A−1) or Diag(exp(A)). Given an N × N matrix A, with an eigendecomposition A = VEV −1, the Tr(A−1) = PN Diag(A−1) = PN 1 . If the i=1 i i=1 Eii matrix is small, one could solve the problem directly by performing an LU decomposition
[19], and then solving N linear equations for the diagonals of A−1. Alternatively, one could
obtain the eigenvalues of A, for example, by using the QR method [19], and summing
them. Unfortunately, these approaches require O(N 3) work, and so are impractical for
the matrices that occur in the applications we are interested in.
−1 P∞ Ak The exponent of a matrix is given by exp(A) = V exp(E)V = k=0 k! . If the matrix is small enough, the eigendecomposition of the matrix can be computed, but in
most cases of interest this is not practical. There are many methods proposed for forming
exp(A) explicitly [69], but all of them require raising A to a high power. This can be
computationally difficulty as well as requiring impractical amounts of storage, since in
many cases Ak will become dense quickly.
6 2.1 Applications
2.1.1 Statistical Applications
We briefly consider applications of Diag(f(A)) and Tr(f(A)), to motivate our research.
Such applications appear in physics, social network analysis, and statistics among others.
Computing Diag(f(A)) shows up in several statistical problems. The simplest of these
is computing the variance of a least squares problem [67]. In the case of a least squares
problem, one would like to solve minxkBx − yk, where B is some arbitrarily sized matrix representing the observations we are trying to fit, that is likely singular. A solution can
be found by computing the normal equation x = A−1z, where A = BT B and z = BT y.
While x can be solved using an iterative method, one would like to know what the variance of the solution is. It can be shown that given B, the covariance matrix of x is (BT B)−1σ2, where σ2 is the variance of the error of our fit. We do not have this variance available, but we can estimate it. Define Xn,p, as our n observations of p variables we are attempting to 2 2 1 2 1 Pn 2 fit. Then as an estimator for σ we have σ = n−p ky − Bxk = n−p i=1 ei , where ei are
the residuals ei = yi −Xi,1z1 −...−Xi,pzp. The variance of the individual components zi is
T −1 2 then computed as Diag((B B)) iσi . Where B is large, representing many observations, inverting BT B to obtain Diag(BT B) is difficult.
Another area of statistics that this problem arises in, and which originally motivated
Hutchinson [45] to develop his method, is fitting a spline to set of multidimensional noisy Pn data z at irregularly spaced points x. This is done by defining the function f, i=1(zi − 2 f(xi)) + ρJ(f), where n is the number of data points, and J(f) is a rotation invariant measure of the roughness of f. This roughness is defined in terms of the partial derivates
of f. The value ρ is a positive value controlling the degree of smoothing of the data, and
is chosen to minimize the generalized cross validation function [71], which is defined as 1 2 n kI−Azk GCV = 1 2 , where A is the n × n symmetric influence matrix which takes the ( n Tr(I−A)) data values to their fitted values [72]. Forming this influence matrix requires inverting the spline matrix B. Because of this, the main expense of the validation is obtaining Tr(I −A).
7 2.1.2 Lattice Quantum Chromodynamics
Similar statistical issues arise in many areas of physics, and in paticular in Lattice Quan- tum Chromodynamics (LQCD) [31]. QCD is the theory of the behavior of the fundamental force known as the strong interaction, which describes the interactions among quarks, the building blocks of Hadrons. LQCD is a method for simulating these interactions. Since it is a non-perturbative method, it can be used to compute physical properties such as the masses of the various quarks, as well as the observables governing the coupling of the particles [31].
Unfortunately QCD gives rise to path integrals that are difficult to compute directly.
If the system is discretized onto a 4D lattice, they can be approximated using Monte Carlo
Integration. Normally this is done in two stages, by generating gauge fields according to a paticular probability distribution, then evaluating a correlation function that depends on these fields. The physical properties of interest are determined by a Monte Carlo average of the correleation functions generated by the ensemble of gauge fields [54].
This approach requires the computation of the trace of A−1. Since the systems arising in this simulation are normally very large, most approaches in this area are based on iterative methods. Aside from the size of the matrix, they are also poorly conditioned, making their solution difficult. In particular, as the simulation parameters are tuned so that they more accurately represent the physical system of interest, the matrix starts to be become singular. Therefore, it is important to minimize the number of systems of equations that have to be solved to obtain an estimate for the trace, since each solution may take many iterations to converge. Despite these difficulties, LQCD has been a very successful approach, and our methods have wide applicability to it.
2.1.3 Network Centrality
Finally, we consider an application where the required f(A) is not A−1, but is instead exp(A),as in [68]. The authors are interested in computing the node centrality in a network,
8 a metric of how important a particular node is in a given network. This question arises in social network analysis, as well as network design. In [68], the authors begin by defining a path as a list of distinct vertices connecting two nodes, and define a path that starts and stops at the same node as a closed path. They assume that nodes that have more closed paths are more important. Further, they give closed paths of differing lengths different weights, assigning to shorter paths a higher weight. If we let k(i)j be the number of paths of distance i for node j, and weight the paths with the inverse of their factorial distance,
P∞ k(i)j i then we obtain the centrality metric i=0 i! . If one recalls that Ajj, which is the j-th diagonal element of the i-power of A, gives the number of round trip paths of length i for P∞ Ai node j, then the desired equation is Diag( i=0 i! ) = Diag(exp(A)).
2.2 Prior Work
2.2.1 Statistical Methods
Many of the applications shown in the prior section require extremely large matrices.
Further, as computational resources expand, the applications will want to increase the
size of the matrices in order to achieve more accurate results. This means that it is
unlikely that attempts to solve this problem directly though matrix decomposition or
eigensolvers, will ever be the best choice for most applications. Instead methods that
attempt to statistically estimate it are needed.
The first attempt at such a statistical solution was made by Hutchinson [45], who was
interested in calculating Tr(A−1) in order to compute splines. He showed that for a set of random vectors zi, where each vector element is drawn independently from a Rademacher
1 distribution, where each element has a 2 chance to be 1 or −1,
Pi=n T z f(A)zi Tr(f(A)) = E[zT f(A)z] = i=0 i . (2.1) n In his case, f(A) = A−1. Taking advantage of the fact that A−1z = y can be solved by rewriting in the form Ay = z and using an iterative solver, it is then possible to estimate
9 the trace of A−1 even for very large matrices.
Following Hutchinson’s work, the authors of [22] investigated several variations on
Hutchinson’s method, and proved bounds on their statistical variance, as well as on the number of samples needed to achieve a given accuracy, which can be seen in Table 2.1.
Instead of taking zi from (2.1) to be random vectors with elements from the Rademacher distribution, they examined the cases where the elements of zi are Gaussian, where they
H are selected so that zi zi = N which they term the Rayleigh-quotient estimator, and the case where the zis are random unit vectors, ei = [0 ... 010 ... 0] where the 1 is in the i-th location. Additionally, they consider a variation on the scheme of taking zi as unit vectors. Using these unit vectors directly computes a particular set of diagonal elements, and then attempts to extrapolate the missing diagonal elements from them. However, in cases where the values of the diagonal elements vary widely, this will work poorly. To counteract this, Pn T T i=0 zi D ADzi they instead compute n , where D is either the Discrete Fourier Transform (DFT) matrix, or the Hadamard matrix. The DFT matrix is generated by D = FFT (I),
and the Hadamard matrix [73] is formed recurisvely, as
1, +1 H2k−2 , +H2k−1 H1 = [1],H2 = ,H2k = = H2 ⊗ H2k−1 . (2.2) 1, −1 H2k−2 , −H2k−1
The Hadamard matrices have the disadvantage of only having sizes that are powers of two, but avoid the use of complex arithmetic, which is a requirement of using the DFT matrix. Since normally we will not need all N columns, we instead generate each matrix column by column as process which can be done efficiently [66].
Because these matrices are unitary, DT D = I, Tr(DT AD) = Tr(DT DA) = Tr(IA) =
Tr(A), but has the effect of smoothing out the elements of the matrix A, thus making it less likely that an important diagonal element will be missed out by the estimator.
While [22] derives upper bounds for the number of vectors needed to achieve the probability of obtaining the desired amount of accuracy, these bounds are not tight. In practice the authors observe that the various methods perform almost identically. These
10 Table 2.1: Convergence rates of different methods.
Estimator Variance of the Sample Bound on number of samples for an (,δ)-approx Random bits per sample −2 Gaussian 2kAkF 20 ln(2/δ) infinite;Θ(n) in floating point 1 −2 −2 2 2 Normalized Rayleigh-quotient - 2 n rank (A)ln(2/δ)kf (A) - 2 Pn 2 −2 Hutchinson’s 2(kAkF − i=1 Aii) 6 ln(2rank(A)/δ) θ(n) Pn 2 2 1 −2 2 nmaxiAii Unit Vector n i=1 Aii − Tr (A) 2 ln(2/δ)rD(A), rD(A) = Tr(A) θ(logn) Mixed Unit Vector(DFT/Hadamard) - 8−2ln(4n2/δ)ln(4/δ) θ(logn) methods converge slowly and cannot be improved without additional information about the matrix. However, it is seldom the case that no useful knowledge of the matrix is available or that cannot be extracted by approximation techniques.
2.2.2 Non-Statistical Methods
Several deterministic methods have been proposed that solve this problem exactly, in the case where f(A) = A−1, by preforming some form of matrix factorization, and avoid the problem of slow convergence that the statistical methods have. These approaches have serious drawbacks however. In [49], the authors introduce a method which works by
finding a hierarchy of Schur complements of matrices arising from grids, but the run time of this method is of order O(N 3/2) and O(N 2) for the 2D and 3D case respectively. Thus
for sufficiently large N, or for higher dimensional problems, this approach is not practical.
Further it does not address the case of matrices that do not arise from PDEs.
The method introduced in [70] performs an LU factorization and computes the last
diagonal entry of the inverse directly from this factorization. It then reorders the nodes so
that each diagonal is in turn the last element of the LU matrix. In order to avoid computing
a unique LU factorization for every reordering, they decompose each LU factorization into
partial LU factorizations. This method has similar drawbacks to those in [49], requiring
that the matrix arise from a PDE and has a run time of O(N 2) for a 2D matrix.
Proposed in [67] is a method based on Takahashi’s equations, which allow a subset
of the elements of A−1 to be recursively computed, using only the elements of an LDU
decomposition of A, and the previously generated elements of A−1, with the first step of
the recursive process requiring only the elements of the LDU decomposition to compute.
11 The subset of elements which can be computed in this manner are those elements that are non-zero in the LDU decomposition. Given a sparse LDU decomposition, it follows that the number of elements needed to compute Diag(A−1) is small. However, while the
subset of elements A−1 needed to compute Diag(A−1) may be smaller than that needed
to compute the entire matrix A−1, it can still be quite large.
Alternative approaches have been developed that avoid computing the result directly,
which is infeasible for large problems, while still making use of any information that is
known about the problem. The main idea behind them is to create the vectors in (2.1) in
such a way that any available structure is exploited. This method was first introduced in
[28] . The main insight is that many matrices in practice have an inverse with a periodic
and decaying structure where the magnitude of the elements falls off away from the main
diagonal of the matrix. Therefore we can set the zis to be such that they zero out as many diagonals of the matrix as possible. If the contribution to the error from the diagonals
that have not been zeroed out is small, the results will be very accurate. As more zis are used, more diagonals are zeroed out, and the solution becomes more accurate. Further, we show later how, if this is paired with statistical methods, this will reduce the variance, because there will be fewer elements contributing to the sums in Table 2.1. To achieve this zeroing out effect, the vectors of the Hadamard matrix are used in their natural order, since they zero all contributions to the error except those elements from an increasingly small subset of diagonals, as can be seen in Figure 2.1.
Another useful method supplied in [28] is how to calculate the diagonal of A instead
of simply the trace. They show that the following estimator will converge to the diagonal
n n X X Diag(A) ≈ ( zi Azi) ( zi zi) (2.3) i=1 i=1 where is componentwise multiplication and is componentwise division, and the zis are random vectors. This has the same drawback as (2.1), in that while in expectation this will yield the correct answer, in practice convergence can be very slow. Therefore the
12 0 0
20 20
40 40
60 60
80 80
100 100
120 120
0 20 40 60 80 100 120 0 20 40 60 80 100 120 nz = 1024 nz = 512
(a) 16 Hadamard vectors (b) 32 Hadamard vectors
Figure 2.1: The area zeroed out by using Hadamard vectors. As the number of vectors increases, the number of diagonals that contribute to the error decease, and become further from the main diagonal.
question of picking the zis to exploit the structure of A is the same as for estimating the trace.
While the Hadamard based method of [28] works well for a specific class of matrices which are generally those arising from a PDE with a Green’s function describing a force that decays with distance, matrices without this diagonal structure do not benefit as much. An attempt to exploit less regularly ordered structure is behind the idea of probing.
Probing has been a useful technique with a long history in the context of approximating the Jacobian matrix [17, 38], or other matrices [18]. Its use for approximating the diagonal of A−1 was proposed in [60] because it finds the most important areas of A−1 rather than the fixed structure removed by the Hadamard approach. Probing recovers the diagonals of a sparse matrix by finding the coloring of its associated graph. Coloring a graph involves assigning a color to each vertex in such a way that no two connected vertices share a color.
Unfortunately, finding the optimal coloring in the sense of using the least number of colors is an NP-Complete problem. However, for many graphs a greedy algorithm performs well
[61], and is the approach used in probing. When the rows and columns of a matrix are arranged so that all nodes that share the same color are adjacent, a zero block diagonal
13 structure will result, as can be seen in Figure 2.2. This structure is due to the fact that since these nodes share a color, they must have no connection (otherwise this would be an invalid coloring).
This block-diagonal zero structure can be exploited to recover the diagonal of the matrix by creating probing vectors. Given a coloring C for A, with c total colors, we will need only c vectors to recover the diagonal. We generate a probing vector for each color m, and set the i-th element of that vector to be 1 if the i-th node of the graph of A was assigned the m-th color and zero elsewhere as seen in (2.4). These vectors can then be used with (2.1) or (2.3).
1, if i ∈ Cm m pi = (2.4) 0, otherwise