Solving Large Top-K Graph Eigenproblems with a Memory and Compute-Optimized FPGA Design

Solving Large Top-K Graph Eigenproblems with a Memory and Compute-optimized FPGA Design Francesco Sgherzi∗, Alberto Parraviciniy, Marco Siracusa∗ Marco D. Santambrogioy Politecnico di Milano, DEIB, Milan, Italy ∗ffrancesco1.sgherzi, [email protected] yfalberto.parravicini, [email protected], 0 1 2 3 4 Abstract—Large-scale eigenvalue computations on sparse ma- e1 e2 trices are a key component of graph analytics techniques based Graph G:= 0 1 1 0.5 0 1.6 λ1 on spectral methods. In such applications, an exhaustive com- 2 1 1 1 1 0.5 -0.5 -1 λ putation of all eigenvalues and eigenvectors is impractical and 2 unnecessary, as spectral methods can retrieve the relevant prop- 0 2 1 1 0.5 0.5 eigenvalues eigenvectors 3 erties of enormous graphs using just the eigenvectors associated 4 1 0.3 0 3 with the Top-K largest eigenvalues. 4 1 0.3 0.5 Top-K In this work, we propose a hardware-optimized algorithm to Top-K Matrix representation of G K=2 (usually < 100) approximate a solution to the Top-K eigenproblem on sparse matrices representing large graph topologies. We prototype our Fig. 1: Top-K eigencomputation of a graph G, represented as a algorithm through a custom FPGA hardware design that exploits sparse matrix. While G can have millions of vertices, we often HBM, Systolic Architectures, and mixed-precision arithmetic. We need just the Top-K eigenvectors (K = 2 in this example). achieve a speedup of 6.22x compared to the highly optimized ARPACK library running on an 80-thread CPU, while keeping high accuracy and 49x better power efficiency. that can outperform traditional architectures in raw perfor- I. INTRODUCTION mance and power efficiency, given how applications on-top of Research in information retrieval and recommender systems Top-K eigenproblem are mostly encountered in data centers. In has spiked novel interest in spectral methods [1], a class of Ma- this work, we tackle both problems by presenting a new Top- chine Learning algorithms able to detect communities in large K sparse eigensolver whose building blocks are specifically social and e-commerce graphs, and compute the similarity of optimized for high-performance hardware designs. graph elements such as users or products [2]. At the core of We introduce a novel algorithm to address the Top-K sparse many spectral methods lies the Top-K eigenproblem for large- eigenproblem and prototype it through a custom hardware scale sparse matrices, i.e. the computation of the eigenvectors design on an FPGA accelerator card; to the best of our knowl- associated with the largest eigenvalues (in modulo) of a matrix edge this is the first FPGA-based Top-K sparse eigensolver. that stores only non-zero entries (Figure 1). For example, the Our algorithm is a 2-step procedure that combines the Lanczos famous Spectral Clustering algorithm boils down to computing algorithm (to reduce the problem size) [6] with the Jacobi the largest eigenvalues of a sparse matrix representing the algorithm (to compute the final eigencomponents) [7], as graph topology [3]. Despite the rise of theoretical interests for shown in Figure 2. The Lanczos algorithm, often encountered spectral methods, little research has focused on improving the in Top-K sparse eigensolvers [8], has never been combined performance and scalability of the Top-K sparse eigenproblem with the Jacobi algorithm. Part of the reason lies in their solvers, making them applicable to large-scale graphs. different computational bottlenecks: the Lanczos algorithm de- Most existing high-performance implementations of eigen- mands large memory bandwidth, while the Jacobi algorithm is arXiv:2103.10040v1 [cs.AR] 18 Mar 2021 problem algorithms operate on dense matrices and are com- strongly compute-bound. Our approach exploits the strengths pletely unable to process matrices with millions of rows of FPGA accelerator cards and overcomes the limitations of and columns (each encoding, for example, the user’s friends traditional architectures in this class of algorithms. in a social network graph) [4]. Even the highly optimized First, the Lanczos algorithm presents a Sparse Matrix- multi-core implementation of LAPACK requires more than 3 Vector Multiplication (SpMV) as its main bottleneck, an minutes to solve the full eigenproblem on a small graph with extremely memory-intensive computation with indirect and ∼ 104 vertices and ∼ 50 · 104 edges on a Xeon 6248, as the fully random memory accesses (Figure 2 B ). Optimizing eigenproblem complexity scales at least quadratically with the SpMV requires high peak memory bandwidth and fine-grained number of vertices in the graph. Many implementations that control over memory accesses, without passing through the support sparse matrices, on the other hand, are either forced traditional caching policies of general-purpose architectures. to compute all the eigenvalues or require an expensive matrix Our hardware design features an iterative dataflow SpMV with inversion before solving the eigenproblem [5]. multiple Compute Units (CUs), leveraging every High Band- The need for high-performance Top-K sparse eigenproblem width Memory (HBM) channels through a custom memory algorithms goes hand in hand with custom hardware designs subsystem that efficiently handles indirect memory accesses. Then, we introduce a Systolic Array (SA) design for the Lanczos Algorithm Jacobi Algorithm Jacobi eigenvalue algorithm, a computationally-intensive op- A B C D eration that operates on reduced-size inputs (K ×K) (Figure 2 Lanczos Lanczos SpMV T D ). The Jacobi algorithm maps naturally to a SA that ensures Operations 1 Operations 2 O(log(K)) convergence, while traditional architectures do K iterations Inputs: O(log(K)) iterations not ensure the same degree of performance. CPUs cannot - graph G (as a sparse matrix) Output 1: Output 2: Lanczos vectors eigenvectors V guarantee that all the data are kept in L1 cache and are unlikely - K V - v (1 per iteration) and eigenvalues to have enough floating-point arithmetic units to parallelize the 1 T computation. This results in Ω(K2) computational complexity Fig. 2: Steps of our novel Top-K sparse eigenproblem solver, and execution times more than 50 times higher than a FPGA which combines the Lanczos algorithm with a Systolic Array (Section V). Instead, GPUs cannot fill all their Stream Proces- formulation for the Jacobi eigenvalue algorithm. sors, as the input size is much smaller than what is required to utilize the GPU parallelism fully [9]. Moreover, our FPGA-based hardware design employs can solve large scale sparse eigenproblems required by spectral highly-optimized mixed-precision arithmetic, partially replac- methods. The MAGMA library [17] solves the Top-K sparse ing traditional floating-point computations with faster fixed- eigenproblem through the alternative LOBPCG algorithm [5], precision arithmetic. While high numerical accuracy is usually which requires multiple iterations (each containing at least demanded in eigenproblem algorithms, we employ fixed- one SpMV) to compute even a single eigenvector, to the precision arithmetic in parts of the design that are not critical contrary of the Lanczos algorithm. Other GPU-based Top-K to the overall accuracy and resort to floating-point arithmetic eigensolvers are domain-specific, do not support large-scale when required to guarantee precise results. inputs, or do not leverage features of modern GPUs such In summary, we present the following contributions: as HBM memory or mixed-precision arithmetic [18], [19]. • A novel algorithm for approximate resolution of large- Eigensolvers for dense matrices are more common on GPUs, scale Top-K sparse eigenproblems (Section III), opti- as they easily exploit the enormous memory bandwidth of mized for custom hardware designs. these architectures: Myllykoski et al. [4] focus on accelerating 5 • A modular mixed-precision FPGA design for our algo- the case of dense high-dimensional matrices (around 10 rows) rithm that efficiently exploits the available programmable while Cosnuau [9] operates on multiple small input matrices. logic and the bandwidth of DDR and HBM (Section IV). Clearly, none of the techniques that operate on dense matrices • A performance evaluation of our Top-K eigendecompo- can easily scale to matrices with millions of rows as simply sition algorithm against the multi-core ARPACK CPU storing them requires terabytes of memory. library, showing a speedup of 6.22x and a power effi- Specialized hardware designs for eigensolvers are limited ciency gain of 49x, with a reconstruction error due to to resolving the full eigenproblem on small dense matrices, mixed-precision arithmetic as good as 10−3 (Section V). through the QR-Householder Decomposition and Jacobi eigenvalue algorithm. Most formulations of the Jacobi algorithm II. RELATED WORK [20], [21] leverage Systolic Array, a major building block To the best of our knowledge, no prior work optimizes Top- of high performance domain-specific architectures from their K sparse eigenproblem with custom FPGA hardware designs. inception [22] to more recent results [23]–[25]. However, The most well-known large-scale Top-K sparse eigenprob- hardware designs of the Jacobi algorithm based on SA cannot lem solver on CPU is the ARPACK library [10], a multi-core scale to large matrices, as the resource utilization scales Fortran library that is also available in SciPy and MATLAB linearly with the size of the matrix. Implementations of the through thin software wrappers. ARPACK implements the QR-Houseolder algorithm face similar problems [26], [27] Implicitly Restarted Arnoldi Method (IRAM), a variation of as they also leverage systolic architectures, although research the Lanczos algorithm that supports non-Hermitian matrices. research about resource-efficient designs do exist [28]. Other sparse eigensolvers provide techniques optimized for specific domains or matrix types, although none is as common III. SOLVING THE TOP-K SPARSE EIGENPROBLEM as ARPACK [11]–[14]. Algorithms like Spectral Clustering contain as their core On GPUs, the cuSOLVER [15] library by Nvidia provides step a Top-K sparse eigenproblem, i.e.

Solving Large Top-K Graph Eigenproblems with a Memory and Compute-Optimized FPGA Design

Deflation by Restriction for the Inverse-Free Preconditioned Krylov Subspace Method

A Generalized Eigenvalue Algorithm for Tridiagonal Matrix Pencils Based on a Nonautonomous Discrete Integrable System

Analysis of a Fast Hankel Eigenvalue Algorithm

Introduction to Mathematical Programming

Inexact and Nonlinear Extensions of the FEAST Eigenvalue Algorithm

Direct Solvers for Symmetric Eigenvalue Problems

Structured Eigenvalue Problems – Structure-Preserving Algorithms, Structured Error Analysis Draft of Chapter for the Second Edition of Handbook of Linear Algebra

5. the Symmetric Eigenproblem and Singular Value Decomposition

AM 205: Lecture 22

Krylov Type Methods for Large Scale Eigenvalue Computations

PFEAST: a High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

Eigenvalue Computation in the 20Th Century Gene H