Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
Total Page:16
File Type:pdf, Size:1020Kb
Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show Block Preconditioned Conjugate Gradient method (LOBPCG, that the performance speedup factor of SpMM vs. SpMV-based see [2]) belongs to the most efficient algorithms. LOBPCG is algorithms is up to six on GPUs like NVIDIA’s K40. In particular, based on maximizing the Rayleigh Quotient, while taking the a typical SpMV performance range in double precision is 20 gradient as the search direction in every iteration step. Iterating to 25 GFlop/s, while the SpMM is in the range of 100 to 120 several approximate eigenvectors, simultaneously, in a block GFlop/s. Compared to highly-optimized CPU implementations, in a similar locally optimal fashion, results in the full block e.g., the SpMM from MKL on two eight-core Intel Xeon E5- version of the LOBPCG. Applying this algorithm efficiently 2690s, our kernel is 3 to 5×. faster on a K40 GPU. For to multi-billion size problems served as the backbone of two comparison to other computational loads, the same GPU to CPU Gordon-Bell Prize finalists that ran many-body simulations on performance acceleration is observed for the SpMV product, as the Japanese Earth Simulator [3], [4]. One of the performance- well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled crucial key elements is a kernel generating the Krylov search GPU (vs. CPU) acceleration for the entire solver is also 3 to 5×. directions via computing the product of the sparse system In practice though, currently available CPU implementations are matrix and a set of vectors. With the sparse matrix vector prod- much slower due to missed optimization opportunities, as we uct performance traditionally limited by memory bandwidth, show. LOBPCG, depending on this routine, has for a long time been considered unsuitable for GPU acceleration. I. INTRODUCTION In this paper we present an LOBPCG implementation The main challenges often associated with numerical linear for graphics processing units able to efficiently leverage the algebra are the fast and efficient solution of large, sparse linear accelerator’s computing power. For this purpose, we employ a systems, and the appertaining eigenvalue problems. While lin- sophisticated sparse matrix data layout, and develop a kernel ear solvers often serve as a backbone of simulation algorithms specifically designed to efficiently compute the product of a based on the discretization of partial differential equations, sparse and a tall-and-skinny dense matrix composed of the eigensolvers play a central role, e.g., in quantum mechan- block of eigenvector approximations. As this kernel also is an ics, where eigenstates and molecular orbitals are defined by integral part of other block-Krylov solvers, the significance eigenvectors, or principle component analysis. With increasing of its performance carries beyond the example integration system size and sparsity, dense linear algebra routines, usually into LOBPCG we present in this paper. We benchmark the based on direct solvers like LU factorization, or, in the case routine against a similar implementation provided in Intel’s of an eigenvalue problem, Hessenberg decomposition [1], MKL [5] and NVIDIA’s cuSPARSE [6] library, and analyze become less suitable as the memory demand and computational the improvement it renders to the performance of the LOBPCG cost may exceed the available resources. Iterative methods GPU implementation. Finally, we also compare it to the state- providing solution approximations often become the method of-the-art multi-threaded CPU implementations of LOBPCG of choice. However, as their performance is, at least in case based on the BLOPEX [7] code for which the software libraries of sparse linear systems, usually memory bound, leveraging PETSc and hypre provide an interface [8]. For matrices taken from the University of Florida Matrix Collection, we achieve LOBPCG implementations: The BLOPEX software significant acceleration when computing a set of the respective package maintained by Andrew Knyazev may be considered as eigenstates. state-of-the-art for CPU implementations of LOBPCG, as the popular software libraries PETSc and hypre provide an inter- The rest of the paper is structured as follows. We first revise face [8]. Also Scipy [18], octopus [19] and Anasazi [20] part the LOBPCG algorithm that serves not only as a motivation of the Trilinos library [21] feature LOBPCG implementations. to develop a building block generating the product of a sparse The first implementation of LOBPCG able to utilize a GPU’s matrix and multiple vectors, but also as a framework revealing computing power by has been available since 2011 in the the performance improvements. After listing some key char- ABINIT material science software package [22]. The imple- acteristics of the hardware platform and the test matrices we mentation, realized in fortran90, benefits from utilizing the target in Section IV, we provide a detailed description of the generic linear algebra routines available in the CUDA [23] and sparse matrix storage format and the processing idea of the MAGMA [12], [13] GPU libraries. More recently, NVIDIA kernel in Section V. We investigate the performance depending anounced that LOBPCG will be included in the GPU- on the number of vectors and compare against both the MKL accelerated Algebraic Multigrid Accelerator AmgX 1. and cuSPARSE equivalent, and implementations based on consecutive sparse matrix vector products. In Section VI we compare the runtime of the LOBPCG GPU implementation III. LOBPCG using either consecutive matrix vector products or the devel- LOBPCG stands for Locally Optimal Block Preconditioned oped blocked variant. We also include a brief performance Conjugate Gradient method [2], [24]. It is designed to find m comparison against the state-of-the-art CPU implementation of the smallest (or largest) eigenvalues λ and corresponding available in the BLOPEX software package. We conclude in eigenvectors x of a symmetric and positive definite eigenvalue Section VII by summarizing the findings and listing algorithms problem: that can potentially be accelerated in a similar fashion. Ax = λx. Similarly to other CG-based methods, this is accomplished by II. RELATED WORK the iterative minimization of the Rayleigh quotient: Blocked sparse matrix vector product: As there exists xT Ax significant need for blocked sparse matrix vector products, ρ(x) = T ; NVIDIA’s cuSPARSE library provides this routine for the x x CSR format [6]. Aside from a straight-forward implementation which results in finding the smallest eigenstates of the original assuming the set of vectors being stored in column-major problem. In the LOBPCG method the minimization at each order, the library also contains an optimized version taking step is done locally, in the subspace of the current approxi- the block of vectors as row-major matrix as input, that can mation xi, the previous approximation xi−1, and the precon- be used in combination with a preprocessing step transposing ditioned residual P (Axi − λixi), where P is a preconditioner the matrix to achieve significantly higher performance [9]. The for A. The subspace minimization is done by the RayleighRitz blocked sparse matrix vector product we propose in this paper method. This is summarized by the pseudo-code on Figure 1. not only outperforms the cuSPARSE implementations for our test cases, but the detailed description also allows porting it to 1 do i=1, niter other architectures. 2 R = P (AXi − λXi) 3 check convergence criteria Orthogonalizations for GPUs: Orthogonalization of vec- 4 [Xi; λ] = Rayleigh-Ritz on spanfXi;Xi−1;Rg 5 end do tors is a fundamental operation for both linear systems and eigenproblem solvers, and many applications. Therefore there has been extensive research on both its acceleration and stabil- Fig. 1: LOBPCG algorithm ity. Besides the classical and modified Gram-Schmidt orthog- onalizations [10] and orthogonalizations based on LAPACK (xGEQRF + xUNGQR ) [11] and correspondingly MAGMA for GPUs [12], [13], recent work includes communication- Note that the operations in the algorithm are blocked avoiding QR [14], also developed for GPUs [15], [16]. For and therefore can be very efficient on modern architectures. tall and skinny matrices these orthogonalizations are in general Indeed, the AXi is the SpMM kernel, and the bulk of the memory bound.