Clsparse: a Vendor-Optimized Open-Source Sparse BLAS Library

clSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library Joseph L. Greathousey Kent Knoxy Jakub Połazx Kiran Varagantiy Mayank Dagay yAdvanced Micro Devices, Inc. zUniversity of Wrocław xVratis Ltd. {Joseph.Greathouse, Kent.Knox, Kiran.Varaganti, Mayank.Daga}@amd.com [email protected] ABSTRACT 1. INTRODUCTION Sparse linear algebra is a cornerstone of modern compu- Modern high-performance computing systems, including tational science. These algorithms ignore the zero-valued those used for physics [1] and engineering [5], rely on sparse entries found in many domains in order to work on much data containing many zero values. Sparse data is best ser- larger problems at much faster rates than dense algorithms. viced by algorithms that store and rely only on the non-zero Nonetheless, optimizing these algorithms is not straightfor- values. Such algorithms enable faster solutions by skipping ward. Highly optimized algorithms for multiplying a sparse many needless calculations and allow larger problems to fit matrix by a dense vector, for instance, are the subject of a into memory. Towards this end, the Basic Linear Algebra vast corpus of research and can be hundreds of times longer Subprograms (BLAS) specification dedicates a chapter to than na¨ıve implementations. Optimized sparse linear alge- sparse linear algebra [3]. bra libraries are thus needed so that users can build appli- The description of these sparse algorithms is straightfor- cations without enormous effort. ward. While simple serial implementation of sparse matrix- Hardware vendors release proprietary libraries that are dense vector multiplication (SpMV) can be described in six highly optimized for their devices, but they limit interop- lines of pseudocode [6], highly optimized implementations erability and promote vendor lock-in. Open libraries often require a great deal more effort. Langr and Tvrd´ıkcite over work across multiple devices and can quickly take advantage fifty SpMV implementations from the recent literature, each of new innovations, but they may not reach peak perfor- ranging from hundreds to thousands of lines of code [8]. Su mance. The goal of this work is to provide a sparse linear and Keutzer found that the best algorithm depended both algebra library that offers both of these advantages. on the input and on the target hardware [12]. We thus describe clSPARSE, a permissively licensed open- Good sparse BLAS implementations must thus under- source sparse linear algebra library that offers state-of-the- stand not just the data structures and algorithms but also art optimized algorithms implemented in OpenCLTM. We the underlying hardware. Towards this end, hardware ven- test clSPARSE on GPUs from AMD and Nvidia and show dors offer optimized sparse BLAS libraries, such as Nvidia's performance benefits over both the proprietary cuSPARSE cuSPARSE. These libraries are proprietary and offer lim- library and the open-source ViennaCL library. ited or no support for hardware from other vendors. Their closed-source nature prevents even minor code or data struc- ture modifications that would result in faster or more main- CCS Concepts tainable code. In addition, the speed of advancement for these libraries •Mathematics of computing ! Computations on ma- can be limited because the community has no control over trices; Mathematical software performance; their code. Few (if any) of the algorithms described by Langr •Computing methodologies ! Linear algebra algo- and Tvrd´ıkmade their way into closed-source libraries, and rithms; Graphics processors; researchers do not know what optimizations or algorithms •Software and its engineering ! Software libraries and the vendors use. While proprietary, vendor-optimized li- repositories; braries offer good performance for a small selection of hardware, they limit innovation and promote vendor lock-in. Existing open-source libraries, such as ViennaCL [10], help Keywords alleviate these issues. ViennaCL offers support for devices clSPARSE; OpenCL; Sparse Linear Algebra; GPGPU from multiple vendors, demonstrating the benefits of open- source software in preventing vendor lock-in [11]. As an example of the benefits to open source innovation, ViennaCL Permission to make digital or hard copies of part or all of this work for personal or added support for the new CSR-Adaptive algorithm within classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation days of its publication [4], while additionally adding the new on the first page. Copyrights for third-party components of this work must be honored. SELL-C-σ algorithm [7]. For all other uses, contact the owner/author(s). Nonetheless, as a library that must sometimes trade per- IWOCL ’16 April 19-21, 2016, Vienna, Austria formance for portability concerns, ViennaCL can sometimes c 2016 Copyright held by the owner/author(s). leave performance on the ground [2]. There are potential ACM ISBN 978-1-4503-4338-1/16/04. performance benefits to vendor-optimized libraries, includ- DOI: http://dx.doi.org/10.1145/2909437.2909442 ing the ability to preemptively add optimizations for hardware that is not yet released. Table 1: Test Platforms AMD Nvidia In summary, existing sparse BLAS libraries offer either GPU RadeonTM Fury X GeForce GTX TITAN X limited closed-source support with high performance or broad CPU Intel Core i5-4690K Intel Core i7-5960X open source support that is not necessarily completely op- 16 GB Dual-channel 64GB Quad-channel timized. This work sets out to solve these problems. We DRAM describe and test clSPARSE, a vendor-optimized, permis- DDR3-2133 DDR4-2133 sively licensed, open-source sparse BLAS library built using OS Ubuntu 14.04.4 LTS OpenCLTM. clSPARSE was built in a collaboration between Driver fglrx 15.302 352.63 Advanced Micro Devices, Inc. and Vratis Ltd., and it is SDK AMD APP SDK 3.0 CUDA 7.5 ViennaCL v1.7.1 cuSPARSE v7.5 available at: Library http://github.com/clMathLibraries/clSPARSE/ clSPARSE v0.11 3. PERFORMANCE COMPARISON 2. CLSPARSE This section shows performance comparisons of our library As part of AMD's GPUOpen initiative to bring strong against both a popular proprietary library, cuSPARSE, and open-source library support to GPUs, we recently released a popular open-source library, ViennaCL. Our test platforms clSPARSE, an OpenCLTM sparse BLAS library optimized are described in Table 1. We test ViennaCL and clSPARSE for GPUs. Like the clBLAS, clFFT, and clRNG libraries, on an AMD RadeonTM Fury X GPU, currently AMD's top- clSPARSE is permissively licensed, works on GPUs from of-the-line consumer device. To show that clSPARSE works multiple vendors, and includes numerous performance opti- well on GPUs from multiple vendors, we also test an Nvidia mizations, both from engineers within AMD and elsewhere. GeForce GTX TITAN X, currently Nvidia's top-of-the-line clSPARSE is built to work on both Linux R and Microsoft consumer device. Due to space constraints, we only test this Windows R operating systems. GPU on clSPARSE and cuSPARSE. clSPARSE started as a collection of OpenCL routines de- The inputs used in our tests are shown in Tables 2 and 3. veloped by AMD and Vratis Ltd. to perform format conver- Some of the matrices include explicit zero values from the sions, level 1 vector operations, general SpMV, and solvers input files. We measure runtime by taking a timestamp be- such as conjugate-gradient and biconjugate gradient. Fur- fore calling the library's API, waiting for the work to finish, ther development led to the integration of AMD's CSR- and then take another timestamp. We execute each kernel Adaptive algorithm for SpMV [4, 2], a CSR-Adaptive-based 1000 times on each input and take an average of the runtime. sparse matrix-dense matrix multiplication algorithm, and GFLOPs are calculated using the widely accepted metric of a high-performance sparse matrix-sparse matrix multiplica- estimating 2 FLOPs for every output value. tion developed by the University of Copenhagen [9]. Figure 1 shows the performance benefit of the clSPARSE clSPARSE has numerous benefits over proprietary libraries SpMV compared to the cuSPARSE csrmv() algorithm. This such as Nvidia's cuSPARSE. First and foremost, it works data shows that clSPARSE on the AMD GPU yields an av- across devices from multiple vendors. As we will demon- erage performance increase of 5.4× over cuSPARSE running strate in Section 3, clSPARSE achieves high performance on the Nvidia GPU. While this demonstrates the benefits of on GPUs from both AMD and Nvidia, and the algorithms AMD-optimized software on AMD hardware, it's also inter- are portable to other OpenCL devices. In addition, the al- esting to note that clSPARSE results in a performance in- gorithms in clSPARSE, due to their open-source nature, are crease of 4.5× on the Nvidia GPU when compared against visible to users and easy to change. Even if users do not di- the proprietary Nvidia library. This shows that clSPARSE rectly link against clSPARSE, these implementations can be is beneficial across vendors. directly used and modified by application developers. Such Figure 2 illustrates the performance of the SpMV algo- problem-specific kernel modifications has proven beneficial rithm in clSPARSE versus the ViennaCL SpMV algorithm for techniques such as kernel fusion [11]. on the AMD GPU. It is worth noting that ViennaCL uses Compared to other open-source libraries, such as Vien- an earlier version of the CSR-Adaptive algorithm used in naCL, clSPARSE offers significant performance benefits. For clSPARSE. However, clSPARSE includes both new algo- example, while ViennaCL uses an early version of the CSR- rithmic changes [2] and numerous minor performance tun- Adaptive algorithm for SpMV, the newer version used in ing changes. Through these changes, clSPARSE is able to clSPARSE performs better across a wider range of matri- achieve a 2.5× higher performance than ViennaCL.

Clsparse: a Vendor-Optimized Open-Source Sparse BLAS Library

Heterogeneous Computing for Advanced Driver Assistance Systems

AMD APP SDK V2.8.1

AMD APP SDK V2.9.1 Getting Started

AMD APP SDK V3.0 Beta

AMD APP SDK Developer Release Notes

Press Release

AMD APP SDK 3.0 Samples Release Notes Document

Regression Modelling of Power Consumption for Heterogeneous Processors

Matrox Imaging Library (MIL) 9.0

Design and Evaluation of Register Allocation on Gpus

Parallel for Loops on Heterogeneous Resources

Cross Layer Reliability Estimation for Digital Systems