Evaluating the Performance of NVIDIA's A100 Ampere GPU

Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations Preprint, compiled August 20, 2020 Yuhsiang Mike Tsai Terry Cojean Hartwig Anzt Karlsruhe Institute of Technology, Karlsruhe Institute of Technology, Karlsruhe Institute of Technology, Germany Germany Germany [email protected] [email protected] University of Tennessee, Knoxville, USA [email protected] Abstract GPU accelerators have become an important backbone for scientific high performance computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA’s newest server line GPU, the A100 architecture part of the Ampere generation. Specifically, we assess its performance for sparse linear algebra operations that form the backbone of many scientific applications and assess the performance improvements over its predecessor. Keywords Sparse Linear Algebra · Sparse Matrix Vector Product · NVIDIA A100 GPU 1I ntroduction we refer to the whitepaper [20], and encourage the reader to digest the performance results we present side-by-side with Over the last decade, Graphic Processing Units (GPUs) have these technical details. seen an increasing adoption in high performance computing platforms, and in the June 2020 TOP500 list, more than half of the fastest 10 systems feature GPU accelerators [1]. At the same 2M emory Bandwidth Assessment time, the June 2020 edition of the TOP500 is the first edition The performance of sparse linear algebra operations on modern listing a system equipped with NVIDIA’s new A100 GPU, the hardware architectures is usually limited by the data access HPC line GPU of the Ampere generation. As the scientific high rather than compute power. In consequence, for sparse linear performance computing community anticipates this GPU to be algebra, the performance-critical hardware characteristics are the the new flagship architecture in NVIDIA’s hardware portfolio, memory bandwidth and the access latency. For main memory we take a look at the performance we achieve on the A100 for access, both metrics are typically somewhat intertwined, in sparse linear algebra operations. particular on processors operating in streaming mode like GPUs. Specifically, we first benchmark in Section2 the bandwidth of In this section, we assess the memory access performance by the A100 GPU for memory-bound vector operations and com- means of the Babel-STREAM benchmark [10]. pare against NVIDIA’s A100 predecessor, the V100 GPU. In We show the Babel-STREAM benchmark results for both an Section3, we review the sparse matrix vector product ( SpMV), NVIDIA V100 GPU Figure 1a and an NVIDIA A100 GPU Fig- a central kernel for sparse linear algebra, and outline the pro- ure 1b. The figures reflect a significant bandwidth improvement cessing strategy used in some popular kernel realizations. In for all operations on the A100 compared to the V100. For an Section4, we evaluate the performance of SpMV kernels on the array of size 8.2 GB, the V100 reaches for all operations a per- A100 GPU for more than 2,800 matrices available in the Suite formance between 800 and 840 GB/s whereas the A100 reaches Sparse Matrix Collection [22]. The SpMV kernels we consider a bandwidth between 1.33 and 1.4 TB/s. arXiv:2008.08478v1 [cs.MS] 19 Aug 2020 in this performance evaluation are taken from NVIDIA’s latest release of the cuSPARSE library and the Ginkgo linear alge- In Figure2, we present the data as a ratio between the A100 and bra library [2]. In Section5, we compare the performance of V100 bandwidth performance for all operations. In Figure 2a, the A100 against its predecessor for complete Krylov solver we show the performance ratio for increasing array sizes and iterations that are popular methods for iterative sparse linear observe the A100 providing a lower bandwidth than the V100 system solves. In Section6, we summarize the performance GPU for arrays of small size. For large array sizes, the A100 has assessment results and draw some preliminary conclusions on a significantly higher bandwidth, converging towards a speedup what performance we may achieve for sparse linear algebra on factor of 1.7 for most memory access benchmarks, see Figure 2b. the A100 GPU. We emphasize that with this paper, we do not intend to provide another technical specification of NVIDIA’s A100 GPU, but 3S parse Matrix Vector Product instead focus on the reporting of performance we observe on this architecture for sparse linear algebra operations. Still, for The sparse matrix-vector product (SpMV) is a heavily-used op- convenience, we append a table from NVIDIA’s whitepaper on eration in many scientific applications. The spectrum ranges the NVIDIA A100 Tensor Core GPU Architecture [20] that lists from the power iteration [8], an iterative algorithm for finding some key characteristics and compares against the predecessor eigenpairs in Google’s Page Rank algorithm [16], to iterative GPU architectures. For further information on the A100 GPU, linear system solvers like Krylov solvers that form the backbone Preprint –Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations 2 ● ● ● ● 1000 ● ● ● 1000 ● ● ● 100 Function 100 Function ● Add ● Add Copy Copy ● GB/s Dot GB/s ● Dot Mul Mul 10 Triad 10 Triad 1 1 10−1 100 101 102 103 104 10−1 100 101 102 103 104 Array Size (MB) Array Size (MB) (a) V100 (b) A100 Figure 1: Performance of the Babel-STREAM benchmark on both V100 and A100 NVIDIA GPUs. ● 2.0 1.5 ● ● ● Function Function 1.5 ● Add 1.0 Add Copy Copy Dot Dot Mul Mul Triad Triad 0.5 1.0 ● ● Bandwidth Ratio (A100/V100) Bandwidth Ratio (A100/V100) 0.0 10−1 100 101 102 103 104 Add Copy Dot Mul Triad Array Size (MB) Function (a) Evolution of the A100/V100 ratio for different array sizes. (b) A100/V100 ratio for an array of size 8.2 GB. Figure 2: Ratio of the performance improvement of A100 compared to V100 NVIDIA GPUs for the Babel-STREAM benchmark. of many finite element simulations. Given this importance, we right-hand side in Figure3. However, as this can result in sev- put particular focus on the performance of the SpMV operation eral processing elements contributing partial sums to the same on NVIDIA’s A100 GPU. However, the performance not only vector output entry, sophisticated techniques for lightweight depends on the hardware and the characteristics of the sparse synchronization are needed to avoid write conflicts [11]. matrix but also the specific SpMV format and processing strategy. Starting from the COO format, further reduction of the stor- Generally, all SpMV kernels aim at reducing the memory access age cost is possible if the elements are sorted row-wise, and cost (and computational cost) by storing only the nonzero matrix with increasing column-order in every row. (The latter assump- values [6]. Some formats additionally store a moderate amount tion is technically not required, but it usually results in better of zero elements to enable faster processing when computing performance.) Then, this Compressed Sparse Row (CSR [6]) matrix-vector products [7]. But independent of the specific strat- format can replace the array containing the row indexes with egy, since SpMV kernels store only a subset of the elements, they a pointer to the beginning of the distinct rows, see the second share the need to accompany these values with information that row in Figure3. While this generally reduces the data volume, allows to deduce their location in the original matrix [3]. We the CSR format requires extra processing to determine the row here recall in Figure3 some widespread sparse matrix storage location of a certain element. For a standard parallelization over formats and kernel parallelization techniques. the matrix rows, the row location is implicitly given, and no A straightforward idea is to accompany the nonzero elements additional information is needed. However, similar to the COO with the respective row and column indexes. This storage format, format, better load balancing is available if parallelizing across known as coordinate (COO [6]) format, allows determining the nonzero elements. This requires the matrix row information and original position of any element in the matrix without processing sophisticated atomic writes to the output vector [12]. other entries, see first row in Figure3. A standard parallelization A strategy that reduces the row indexing information even fur- of this approach assigns the matrix rows to the distinct process- ther is to pad all rows to the same number of nonzero elements ing elements (cores). However, if few rows contain a significant and accompany the values only with the column indexes. In portion of the overall nonzeros, this can result in a significant this ELL format [7] (third row in Figure3), the row index of an imbalance of the kernel and poor performance. A workaround is element can be deduced from its location in the array storing the to distribute the nonzeros across the parallel resources, see the Preprint –Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations 3 Parallelize over rows Parallelize over nonzero elements access to access to access to access to input vector x output vector y input vector x output vector y 5.41.10 0 0 0 5.41.10 0 0 0 T1 2.28.303.71.33.8 2.28.303.71.33.8 0 1 T2 004.20 0 0 0 004.20 0 01 A = * = A = * = T3 B 5.40 09.20 0C B 5.40 09.20 0C B C B C T4 B 00001.10C B 00001.10C B C B C T5 CSR format : B 000008.1 C B 000008.1 C B C B C T6 @ A @ A value = [ 5.41.12.28.33.71.33.84.25.49.21.18.1]

Evaluating the Performance of NVIDIA's A100 Ampere GPU

Efficient Algorithms for High-Dimensional Eigenvalue

Implicitly Restarted Arnoldi/Lanczos Methods for Large Scale Eigenvalue Calculations

Accelerated Stochastic Power Iteration

Numerical Linear Algebra Program Lecture 4 Basic Methods For

Arxiv:1105.1185V1 [Math.NA] 5 May 2011 Ento 2.2

Lecture 12 — February 26 and 28 12.1 Introduction

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 16: Rayleigh Quotient Iteration

Math 504 (Fall 2011)

Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems

Eigenvalue Problems Last Time …

Parallel Numerical Algorithms Chapter 5 – Eigenvalue Problems Section 5.2 – Eigenvalue Computation

Dynamical Systems and Non-Hermitian Iterative Eigensolvers∗