Design of a Modern Distributed and Accelerated Linear Algebra Library
Total Page:16
File Type:pdf, Size:1020Kb
SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library Mark Gates Jakub Kurzak Ali Charara [email protected] [email protected] [email protected] University of Tennessee University of Tennessee University of Tennessee Knoxville, TN Knoxville, TN Knoxville, TN Asim YarKhan Jack Dongarra [email protected] [email protected] University of Tennessee University of Tennessee Knoxville, TN Knoxville, TN Oak Ridge National Laboratory University of Manchester ABSTRACT USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3295500. The SLATE (Software for Linear Algebra Targeting Exascale) library 3356223 is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU–GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, includ- ing the parallel BLAS; linear systems using LU and Cholesky; least 1 INTRODUCTION squares problems using QR; and eigenvalue and singular value The objective of SLATE is to provide fundamental dense linear problems. In this respect, it will serve as a replacement for Sca- algebra capabilities for the high-performance computing (HPC) LAPACK, which after two decades of operation, cannot adequately community at large. The ultimate objective of SLATE is to replace be retrofitted for modern accelerated architectures. SLATE uses ScaLAPACK [7], which has become the industry standard for dense modern techniques such as communication-avoiding algorithms, linear algebra operations in distributed memory environments. Af- lookahead panels to overlap communication and computation, and ter two decades of operation, ScaLAPACK is past the end of its life task-based scheduling, along with a modern C++ framework. Here cycle and overdue for a replacement, as it can hardly be retrofitted we present the design of SLATE and initial reports of several of its to support hardware accelerators, which are an integral part of components. today’s HPC hardware infrastructure. In the past two decades, HPC has witnessed tectonic shifts in hardware technology, software CCS CONCEPTS technology, and a plethora of algorithmic innovations in scientific • Mathematics of computing → Computations on matrices; computing that are not reflected in ScaLAPACK. At the same time, • Computing methodologies → Massively parallel algorithms; while several libraries provide functionality similar to ScaLAPACK, Shared memory algorithms; no replacement for ScaLAPACK has emerged to gain widespread adoption in applications. SLATE aims to be this replacement, boast- KEYWORDS ing superior performance and scalability in modern, heterogeneous, dense linear algebra, distributed computing, GPU computing distributed-memory HPC environments. Primarily, SLATE aims to extract the full performance potential ACM Reference Format: and maximum scalability from modern, many-node HPC machines Mark Gates, Jakub Kurzak, Ali Charara, Asim YarKhan, and Jack Dongarra. with large numbers of cores and multiple hardware accelerators 2019. SLATE: Design of a Modern Distributed and Accelerated Linear Alge- per node. For typical dense linear algebra workloads, this means bra Library. In The International Conference for High Performance Computing, getting close to the theoretical peak performance and scaling to Networking, Storage, and Analysis (SC ’19), November 17–22, 2019, Denver, CO, the full size of the machine (i.e., thousands to tens of thousands of nodes). Permission to make digital or hard copies of all or part of this work for personal or SLATE currently implements parallel BLAS (PBLAS); linear sys- classroom use is granted without fee provided that copies are not made or distributed tems using partial pivoted LU, Cholesky, and block Aasen’s for for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM symmetric indefinite systems [6]; mixed-precision iterative refine- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, ment; least squares using communication-avoiding QR [15]; and ma- to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. trix inversion. We are actively developing communication-avoiding SC ’19, November 17–22, 2019, Denver, CO, USA two-stage symmetric eigenvalue and singular value decompositions. © 2019 Association for Computing Machinery. Future work will include non-symmetric eigenvalue problems and ACM ISBN 978-1-4503-6229-0/19/11...$15.00 randomized algorithms. https://doi.org/10.1145/3295500.3356223 SC ’19, November 17–22, 2019, Denver, CO, USA Gates, Kurzak, Charara, YarKhan, Dongarra 1.1 Motivation derived communication-avoiding algorithms for QR and LU factor- In the United States, the plan for achieving exascale relies heavily ization [15], as well as for matrix-matrix multiplication [14, 42]. on the use of GPU accelerated machines, similar to the Summit 1 There are also numerous node-level linear algebra libraries. Eigen and Sierra 2 systems at Oak Ridge National Laboratory (ORNL) [32] and Armadillo [41] use C++ templates extensively to make and Lawrence Livermore National Laboratory (LLNL), respectively, precision-independent code. MAGMA [16] provides accelerated which currently occupy positions #1 and #2 on the TOP500 list. linear algebra, with extensive coverage of LAPACK’s functionality. The urgency of developing multi-GPU accelerated, distributed- ViennaCL [40] provides accelerated linear algebra, but lacks most memory software is underscored by the architectures of these sys- dense linear algebra aside from BLAS and no-pivoting LU. NVIDIA tems. The Summit system contains three NVIDIA V100 GPUs per cuBLAS [38] and AMD rocBLAS [4] provide native BLAS opera- POWER9 CPU. The peak double precision floating point perfor- tions on GPU accelerators. Vendor libraries such as Cray Scientific mance of the CPUs is 22 cores × 24:56 GFLOP/s = 0:54 TFLOP/s. Libraries (LibSci) [13] and IBM Engineering and Scientific Sub- The peak performance of the GPUs is 3 devices × 7:8 TFLOP/s = routine Library (ESSL) [29] provide accelerated versions of BLAS 23:4 TFLOP/s. That is, 97:7% of performance is on the GPU side, and LAPACK for GPU accelerators. These accelerated node-level and only 2:3% of performance is on the CPU side. This means that, libraries may give some limited benefit to ScaLAPACK, which is for all practical purposes, the CPUs can be ignored as floating point built on top of BLAS and LAPACK. However, this is limited by hardware, with their main purpose being to keep the GPUs busy. the bulk-synchronous, fork-join design of ScaLAPACK. The Intel Math Kernels Library (MKL) [31] provides BLAS, LAPACK, and 1.2 Related work ScaLAPACK for CPUs and Xeon Phi accelerators. Numerous libraries provide dense linear algebra routines. Since its initial release nearly 30 years ago, LAPACK [5] has become the 1.3 Goals de facto standard library for dense linear algebra on a single node. Ultimately, a distributed linear algebra library built from the ground It leverages vendor-optimized BLAS for node-level performance, up to support accelerators, and incorporating state-of-the-art par- including shared-memory parallelism. ScaLAPACK [7] built upon allel algorithms, will give the best performance. In this manuscript, LAPACK by extending its routines to distributed computing, relying we describe how we have designed SLATE to achieve this, and on both the Parallel BLAS (PBLAS) and explicit distributed-memory describe how it fulfills the following design goals. parallelism. Some attempts [18, 19] have been made to adapt the Targets modern HPC hardware consisting of a large num- ScaLAPACK library for accelerators, but these efforts have shown ber of nodes with multicore processors and several hardware the need for a new framework. accelerators per node. More recently, the DPLASMA [9] and Chameleon [2, 3] libraries Achieves portable high performance by relying on vendor both build a task dependency graph and launch tasks as their depen- optimized standard BLAS, batch BLAS, and LAPACK, and dencies are fulfilled. This eliminates the artificial synchronizations standard parallel programming technologies such as MPI inherent in ScaLAPACK’s design, and allows for overlap of commu- and OpenMP. Using the OpenMP runtime puts less of a nication and computation. DPLASMA relies on the PaRSEC runtime burden on applications to integrate SLATE than adopting a to schedule tasks, while Chameleon uses the StarPU runtime. If proprietary runtime would. the dataflow is known at compile time, an algorithm can beex- Provides scalability by employing proven techniques in dense pressed using the efficient and scalable Parameterized Task Graph linear algebra, such as 2D block cyclic data distribution and (PTG) interface. Algorithms that make runtime decisions based on communication-avoiding algorithms, as well as modern par- data, such as partial pivoting in LU, are easier to express using the allel programming approaches, such as dynamic scheduling less efficient but more intuitive Insert Task interface, similar to and communication overlapping. OpenMP tasks. Both DPLASMA and Chameleon are accelerated Facilitates productivity by relying on the intuitive Single on GPU-based architectures. Instead of leveraging C++ templates, Program, Multiple Data (SPMD) programming model and they auto-generate all precisions