DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS Stefano Markidis, KTH Royal Institute of Technology After this lecture, you will be able to • Understand the importance of numerical libraries in HPC • List a series of key numerical libraries including BLAS • Describe which kind of operations BLAS supports • Experiment with OpenBLAS and perform a matrix-matrix multiply using BLAS 2021-02-22 2 Numerical Libraries are the Foundation for Application Developments • While these applications are used in a wide variety of very different disciplines, their underlying computational algorithms are very similar to one another. • Application developers do not have to waste time redeveloping supercomputing software that has already been developed elsewhere. • Libraries targeting numerical linear algebra operations are the most common, given the ubiquity of linear algebra in scientific computing algorithms. 2021-02-22 3 Libraries are Tuned for Performance • Numerical libraries have been highly tuned for performance, often for more than a decade – It makes it difficult for the application developer to match a library’s performance using a homemade equivalent. • Because they are relatively easy to use and their highly tuned performance across a wide range of HPC platforms – The use of scientific computing libraries as software dependencies in computational science applications has become widespread. 2021-02-22 4 HPC Community Standards • Apart from acting as a repository for software reuse, libraries serve the important role of providing a knowledge base for specific computational science domains. • These libraries become community standards and serve as ways for members of the community to communicate with one another. – Words like dgemm, saxpy are HPC community standard terminology 2021-02-22 5 Numerical Libraries in HPC – An Overview • Blue à Core linear algebra libraries • Red à Small sample of widely used application frameworks with dependencies on these libraries • Green à Dependencies 2021-02-22 6 BASIC LINEAR ALGEBRA SUBPROGRAMS – BLAS The most important HPC library • BLAS provides a standard interface to vector, matrix-vector, and matrix-matrix routines that have been optimized for various computer architectures. • Reference implementation which provides both Fortran 77 and C interfaces, and the Automatically Tuned Linear Algebra Software (ATLAS) project – There are multiple vendor-provided BLAS libraries optimized for their respective hardware. > Intel Math Kernel Library (MKL) BLAS is among the most famous one – GotoBLAS à OpenBLAS – The Boost libraries provide a Cpp template class with BLAS functionality called uBLAS. – For GPU, Nvidia provide CuBLAS 2021-02-22 7 Little bit of BLAS History … • BLAS design and implementation was handled by Charles Lawson, Richard Hanson, F. Krogh, D.R. Kincaid, and Jack Dongarra beginning in the 1970. • The genesis of the idea for BLAS is credited to Lawson and Hanson while they were working at NASA’s Jet Propulsion Laboratory. • BLAS development coincided with development of the Linpack package – Linpack was the first major package to incorporate the BLAS library. https://dl.acm.org/doi/10.1145/355841.355847 2021-02-22 8 BLAS Level 1 • The first BLAS routines developed were limited to vector operations, including inner products, norms, adding vectors, and scalar multiplication. An example of such operations: • where x, y are vectors and ⍺ is a scalar value. • These vector-vector operations are referred to as BLAS Level 1. 2021-02-22 9 BLAS - Level 2 and 3 • In 1987, about 10 years after BLAS Level 1 was released, routines for matrix- vector operations became available, followed by matrix-matrix operations in 1989. These later additions are the Level 2 (matrix-vector) and Level 3 (matrix-matrix) BLAS operations: Level 2 Level 3 • Here x and y are vectors and ⍺ and β are scalars. • A, B, and C are matrices. 2021-02-22 10 BLAS Naming Convention • Each routine in BLAS has a specific naming convention that specifies the precision of the operation, the type of matrix (if any) involved, and the operation to perform. • BLAS is natively written in Fortran 77, but C bindings to BLAS are available via CBLAS. • For BLAS Level 1 operations there is no matrix involved and so the naming convention for the matrix • In CBLAs, each routine begins with cblas_ • After cblas_ a precision prefix is placed before the operation name. 2021-02-22 11 BLAS Precision 2021-02-22 12 BLAS Level 1 BLAS Level 1 operations can be subdivided into four different subgroups: 1. vector rotations 2. vector operations without a dot product 3. vector operations with a dot product 4. vector norms 2021-02-22 13 BLAS Level 1 - Vector Operations Without a Dot Product What is saxpy? 2021-02-22 14 BLAS Level 2 and 3 • BLAS Level 2 and Level 3 operations involve matrices and indicate the type of matrix they support in their name. • Levels 2 and 3 names are of the form cblas_pmmoo, – p indicates the precision – mm indicates the matrix type – oo indicates the operation. • Apart from general matrices, all other matrix types come in three storage scheme flavors: dense (default), banded (indicated by a “b” in the name), and packed (indicated by a “p” in the name). • Dense storage schemes are either row-based or column-based storage in a continuous memory array. 2021-02-22 15 Kind of Matrices and Operations Supported Kind of matrix Kind of operation What is the dgemm operation? 2021-02-22 16 dgemm • As an example, the name of the BLAS Level 3 routine cblas_dgemm indicates that this routine will perform a double-precision dense matrix-matrix multiplication. • DGEMM is also the name for the matrix- matrix multiplication benchmark in the HPC Challenge. https://icl.utk.edu/hpcc/ 2021-02-22 17 • Order indicates the storage layout as either row major or column major. This input is either CblasRowMajor or CblasColMajor. • TransA indicates whether to transpose matrix A. This input is either CblasNoTrans, CblasTrans, or CblasConjTrans, indicating no transponse, transpose, or complex conjugate transpose, respectively. TransB indicates whether to transpose matrix B. • M indicates the number of rows in matrices A and C. N indicates the number of columns in matrices B and C. K indicates the number of columns in matrix A and the number of rows in matrix B. • alpha is the scaling factor for A*B. beta is the scaling factor for matrix C. • lda is the size of the first dimension of matrix A. lbd is the size of the first dimension of matrix B. ldc is the size of the first dimension of matrix C. • A is the pointer to matrix A data. B is the pointer to matrix B data. C is the pointer to matrix C data. 2021-02-22 18 Installing OpenBLAS • https://www.openblas.net/ • Both binary available / source to be built – https://github.com/xianyi/OpenBLAS 2021-02-22 19 An Example Code 2021-02-22 20 Compiling gcc -I/usr/local/opt/openblas/include -L/usr/local/opt/openblas/lib -lopenblas blas_example.c -o blas_example 2021-02-22 21.

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

ARM HPC Ecosystem

0 BLIS: a Modern Alternative to the BLAS

Supermatrix: a Multithreaded Runtime Scheduling System for Algorithms-By-Blocks

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Benchmark of C++ Libraries for Sparse Matrix Computation

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality

Anatomy of High-Performance Matrix Multiplication

Level-3 BLAS on Myriad Multi-Core Media-Processor

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Introduchon to Arm for Network Stack Developers