A Case Study of Batched LU and Cholesky Factorizations
Total Page:16
File Type:pdf, Size:1020Kb
1 A Guide For Achieving High Performance With Very Small Matrices on GPU: A case Study of Batched LU and Cholesky Factorizations Azzam Haidar∗, Ahmad Abdelfattah∗, Mawussi Zounon z, Stanimire Tomov∗, Jack Dongarra∗yz fhaidar,ahmad,tomov,[email protected],[email protected] ∗Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA yOak Ridge National Laboratory, Oak Ridge, TN, USA zUniversity of Manchester, Manchester, UK Abstract—We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU. Index Terms—batched computation, GPUs, variable small sizes F 1 INTRODUCTION in existing software has forced users to rely on different inefficient solutions. Although it might seem like an attractive idea to focus For example, in the NWChem package used in chemistry the efforts of the high-performance computing (HPC) com- problems, the computation of the two-electron integrals munity on addressing large-scale problems, the experience and the Fock matrix becomes a bottleneck when many of the research community over the last few years has integrals of small size have to be computed independently, clearly shown that applications that use many small ma- which shows the necessity of optimized batched libraries. trices or tensors cannot make efficient use of modern HPC Moreover, in [35], the authors discussed the optimization systems and the associated vendor-optimized linear algebra of NWChem for Intel’s MIC architecture and highlighted libraries. The existing libraries have been designed for large the need for tensor computations of about 200–2,000 ma- matrices, and—historically—the scope of vendor libraries trices from 10 × 10 to 40 × 40 in size. In his dis- has been too narrow to address matrix computation prob- sertation, David Ozog discussed NWChem’s Tensor Con- lems. Consequently, the performance that these libraries traction Engine (TCE) and revealed how strongly it relies deliver tends to be inadequate. Moreover, there are good on the performance of general matrix-matrix multiplication reasons to believe that neither improved compiler technol- (GEMM) in the computation of the tensor contraction. In ogy nor autotuning will make any significant headway on the summary of the work done by [20], the authors noted this problem. This lack of coverage by current library infras- that small matrix computations have a severe bottleneck, tructure is especially alarming because of the number of ap- and specialized libraries are required. The need for efficient plications from important fields that fit this profile, includ- libraries for thousands of small matrix computations was ing deep learning [9], data mining [32], astrophysics [24], also discussed at NVIDIA’s GPU Technology Conference image and signal processing [5], [25], hydrodynamics [11], in 2016 by Mniszewski et al. [34] in the context of their quantum chemistry [6], and computational fluid dynamics research on quantum molecular dynamics, where the dense (CFD) and the resulting partial differential equations (PDEs) matrix computation can be performed as an independent set through direct and multifrontal solvers [42], to name a of computations. Another important motivation is that the few. Dramatically better performance on these applications matrix polynomial can be computed in a batch fashion as a can be achieved by using software that can repetitively set of small, independent computations [27]. execute small matrix/tensor operations grouped together in Deep Learning communities have also showed a signif- “batches.” However, the near total lack of such capabilities icant interest in computations involving many small matri- 2 ces. NVIDIA [10] already highlighted the need for a batched and achieve performance close the theoretical peak of the GEMM routine and has also started providing some of hardware. Our primary contributions to this end are listed the batched routines (e.g., GEMM, triangular solver matrix below. [TRSM], LU, QR, inversion) for fixed size in their cuBLAS library. Nervana [26], which is one of the pioneers of deep • In addition to efficient implementations that exhibit learning, demonstrated the critical need for batched ma- a good speedup, we provide a detailed analysis of trix computation kernels for high-performance deep learn- optimization techniques. We also present a collection ing software. Intel has also provided batched GEMM and of best practices to help users understand and develop batched TRSM routines for fixed matrix sizes. batched computation kernels in a simple and efficient For the block-mass, weighted Hessian in the molecular fashion. dynamics simulation [36], computing the eigendecomposi- • We propose a methodology for designing a perfor- tion involves computing the eigendecomposition of many mance model that provides insight into the perfor- small, independent matrices, which can be viewed as a mance spectrum of batched kernels. The main advan- batched eigendecomposition. In addition to the package tage of this model is that it helps predict the achievable cited above, we note that in the GROMACS package [2], performance of the kernels with increased accuracy. because the particle-particle and mesh interactions are in- • We investigated a model that simplifies the autotuning dependent, batched computation can be used to speed up process by considering hardware information and rep- and overlap the expensive global communication in the resentative experiments that enable us to considerably Particle-mesh Ewald (PME). Also, in [31], the authors noted reduce the configuration/parametrization search space. the need for optimized and hardware-aware basic linear We expect this contribution to drastically reduce the algebra subprograms (BLAS) to perform many independent significant autotuning time required by complex GPU computations inside the GROMACS MD simulation. kernels. The approach behind Flash Principle Component Anal- • We propose a modular design that relies on standard ysis (FlashPCA) performs a large number of eigendecom- data formats and interfaces to enable a straightforward positions across many samples. Also, in combustion and connection with mainstream application code. astrophysics supernova applications [7], [8], [18], [24], [33], the study of a thermonuclear reaction networks (XNet pack- age) requires the solution of many sparse linear systems 3 RELATED WORK of around 150 × 150. Furthermore, the need for batched At the time of writing, there are no specifications for the routines can be illustrated in radar signal processing [5], computation of batched small linear algebra problems. Con- where a batch of 200×200 QR decompositions is needed, as sequently, the Linear Algebra PACKage (LAPACK), which well as in hydrodynamic simulations [11], where thousands is the de-facto standard library for linear algebra problems, of matrix-matrix and matrix-vector (GEMV) products of doesn’t provide such a functionality. However at the request matrices of around 100 × 100 are needed. of users, NVIDIA added a few batch kernels to CUBLAS Although the brief introduction above shows some vi- 6.5 [28]. Those additions include a batched version of both able approaches, it mostly highlights the keen awareness of BLAS and LAPACK. For BLAS kernels, batched GEMM and the need for batched libraries that can enable small matrix batched TRSM have been released. More effort has been put applications to finally start exploiting the full power of into the direction of LAPACK kernels, resulting in a batched current and future hybrid platforms. Some vendors have version of LU and QR factorizations, matrix inversion, and started to provide some batched functionalities in their nu- a least squares solver. Similarly, in response to customer de- merical libraries (e.g., NVIDIA’s CUBLAS and Intel’s Math mand, Intel’s MKL team released a batched GEMM, while— Kernel Library [MKL]). Additionally, some open-source li- at the time of writing—AMD does not provide any batched braries from the HPC community (e.g., the Matrix Algebra operations. Vendor efforts on batched BLAS may have a on GPU and Multicore Architectures [MAGMA] library [37]) tremendous impact and enhance the portability of HPC have also started to deliver batched routines [12], [13], applications, much like optimized