LAPACK/ Scalapack /PLASMA/ MAGMA/Hicma – Petsc – HYPRE – TRILINOS • Signal Processing – FFTW • Numerical Integration – GSL • Random Number Generators – SPRNG

Introduction to Numerical Libraries for HPC Bilel Hadri [email protected] Computational Scientist KAUST Supercomputing Lab Bilel Hadri 1 Numerical Libraries – Application Areas • Most used libraries/software in HPC ! • Linear Algebra – Systems of equations • Direct , Iterative, Multigrid solvers • Sparse, Dense system – Eigenvalue problems – Least squares • Signal processing – FFT • Numerical Integration • Random Number Generators Bilel Hadri – Introduction to Numerical Libraries for HPC 2 Numerical Libraries - Motivations • Don’t Reinvent the Wheel ! • Improves productivity ! • Get a better performance ! – Faster and better algorithms Bilel Hadri – Introduction to Numerical Libraries for HPC 3 Faster (Better Code) • Achieving best performance requires creating very processor- and system-specific code • Example: Dense matrix-matrix multiply – Simple to express: do i=1, n do j=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo Bilel Hadri – Introduction to Numerical Libraries for HPC 4 Performance • How fast should this run? – Our matrix-matrix multiply algorithm has 2n3 floating point operations • 3 nested loops, each with n iterations • 1 multiply, 1 add in each inner iteration – For n=100, 2x106 operations, about 1 msec on a 2GHz processor – For n=1000, 2x109 operations, or about 1 sec • Reality: – N=100 1.1ms – N=1000 6s à Obvious expression of algorithms are not transformed into leading performance. Bilel Hadri – Introduction to Numerical Libraries for HPC 5 Numerical Libraries – Packages • Linear Algebra – BLAS/LAPACK/ ScaLAPACK /PLASMA/ MAGMA/HiCMA – PETSc – HYPRE – TRILINOS • Signal processing – FFTW • Numerical Integration – GSL • Random Number Generators – SPRNG Bilel Hadri – Introduction to Numerical Libraries for HPC 6 Others MUMPS 4.9.2. MUMPS (MUltifrontal Massively Parallel sparse direct Solver) is a package of parallel, sparse, direct linear system solvers based on a multifrontal algorithm. http://graal.enslyon.fr/MUMPS/ SuperLU 4.3. SuperLU is a sequential version of SuperLU_dist and a sequential incomplete LU preconditioner that can accelerate the convergence of Krylov subspace iterative solvers. http://crd.lbl.gov/~xiaoye/SuperLU/ ParMETIS 4.0.2. ParMETIS (Parallel Graph Partitioning and Fill reducing Matrix Ordering) is a library of routines that partition unstructured graphs and meshes and compute fill reducing orderings of sparse matrices. http://glaros.dtc.umn.edu/gkhome/views/metis/ Bilel Hadri – Introduction to Numerical Libraries for HPC 7 • SUNDIALS 2.5.0 (SUite of Nonlinear and DIfferential/Algebraic equation Solvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. In addition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, and KINSOL that is called sundialsTB. https://computation.llnl.gov/casc/sundials/main.html • Scotch 5.1.12b Scotch is a software package and libraries for sequential and parallel graph partitioning, static mapping, sparse matrix block ordering, and sequential mesh and hypergraph partitioning. http://www.labri.fr/perso/pelegrin/scotch Note: On Shaheen , they are all grouped into cray-tpsl • Freely Available Software for Linear Algebra http://www.netlib.org/utk/people/JackDongarra/la-sw.html Bilel Hadri – Introduction to Numerical Libraries for HPC 8 BLAS (Basic Linear Algebra Subprograms) The BLAS functionality is divided into three levels: • Level 1: contains vector operations of the form: • Level 2: contains matrix-vector operations of the form: • Level 3: contains matrix-matrix operations of the form: • Several implementations for different languages exist – Reference implementation http://www.netlib.org/blas/ Bilel Hadri – Introduction to Numerical Libraries for HPC 9 BLAS: naming conventions • Each routine has a name which specifies the operation, the type of matrices involved and their precisions. Names are in the form: PMMOO”. – Each operation is defined for four precisions (P) – Some of the most common • S single real D double real operations (OO): C single complex • DOT scalar product, x^T y Z double complex AXPY vector sum, α x + y MV matrix-vector product, A x – The types of matrices are (MM) SV matrix-vector solve, inv(A) x • GE general MM matrix-matrix product, A B GB general band SM matrix-matrix solve, inv(A) B SY symmetric SB symmetric band SP symmetric packed HE hermitian HB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed • Examples SGEMM stands for “single-precision general matrix-matrix multiply” DGEMM stands for “double-precision matrix-matrix multiply”. Bilel Hadri – Introduction to Numerical Libraries for HPC 10 BLAS Level 1 routines • Vector operations (xROT, xSWAP, xCOPY etc.) • Scalar dot products (xDOT etc.) • Vector norms (IxAMX etc.) Bilel Hadri – Introduction to Numerical Libraries for HPC 11 BLAS Level 2 routines • Matrix-vector operations (xGEMV, xGBMV, xHEMV, xHBMV etc.) • Solving Tx = y for x, where T is triangular (xGER, xHER etc.) Bilel Hadri – Introduction to Numerical Libraries for HPC 12 BLAS Level 3 routines • Matrix-matrix operations (xGEMM etc.) • Solving for triangular matrices (xTRMM) • Widely used matrix-matrix multiply (xSYMM, xGEMM) Bilel Hadri – Introduction to Numerical Libraries for HPC 13 LAPACK (Linear Algebra PACKage) • Linear Algebra PACKage – http://www.netlib.org/lapack/ – Provides routines for • Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR decomposition on a matrix and • Singular value problems – Was initially designed to run efficiently on shared memory vector machines – Depends on BLAS – Has been extended for distributed systems ScaLAPACK ( Scalable Linear Algebra PACKage) Bilel Hadri – Introduction to Numerical Libraries for HPC 14 LAPACK naming conventions • Very similar to BLAS – XYYZZZ • YY: more matrix types • X: data type – PO: symmetric or Hermitian positive definite – S: REAL – PP: symmetric or Hermitian positive definite, – D: DOUBLE PRECISION packed storage – C: COMPLEX – PT: symmetric or Hermitian positive definite – Z: COMPLEX*16 or DOUBLE COMPLEX tridiagonal • YY: matrix type – SB: (real) symmetric band – BD: bidiagonal – SP: symmetric, packed storage – DI: diagonal – ST: (real) symmetric tridiagonal – GB: general band – SY: symmetric – GE: general (i.e., unsymmetric, in some cases – TB: triangular band rectangular) – TG: triangular matrices, generalized problem – GG: general matrices, generalized problem (i.e., a pair of triangular matrices) (i.e., a pair of general matrices) – TP: triangular, packed storage – GT: general tridiagonal – TR: triangular (or in some cases quasi- – HB: (complex) Hermitian band triangular) – HE: (complex) Hermitian – TZ: trapezoidal – HG: upper Hessenberg matrix, generalized – UN: (complex) unitary problem (i.e a Hessenberg and a triangular – UP: (complex) unitary, packed storage matrix) • ZZZ: performed computation – HP: (complex) Hermitian, packed storage – Linear systems – HS: upper Hessenberg – Factorizations – OP: (real) orthogonal, packed storage – Eigenvalue problems – OR: (real) orthogonal – Singular value decomposition – PB: symmetric or Hermitian positive definite band – Etc. Bilel Hadri – Introduction to Numerical Libraries for HPC 15 LAPACK routines http://www.icl.utk.edu/~mgates3/docs/lapack-summary.pdf Bilel Hadri – Introduction to Numerical Libraries for HPC 16 Numerical Libraries packages Dongarra/ICL • Vendor libraries optimized implementations of BLAS, LAPACK, ScaLAPACK to their processors and their platform. Bilel Hadri – Introduction to Numerical Libraries for HPC 17 LAPACK & ScaLAPACK • ScaLAPACK is a library with a subset of LAPACK routines running on distributed memory machines. • ScaLAPACK is designed for heterogeneous computing and is potable on any computer that supports MPI or PVM. • http://www.netlib.org/scalapack/scalapack_home.html Bilel Hadri – Introduction to Numerical Libraries for HPC 18 Overview of ScaLAPACK Bilel Hadri – Introduction to Numerical Libraries for HPC 19 Why use LAPACK or ScaLAPACK? • Solving systems of: – Linear equations: Ax = b – Least squares: min|| Ax -b||2 – Eigenvalue problem: �� = �� – Singular value problem: A = USV T Bilel Hadri – Introduction to Numerical Libraries for HPC 20 Reference BLAS vs Tuned • The reference BLAS and LAPACK libraries a re reference implementations of the BLAS and LAPACK standard. These are not optimised and not multi-threaded, so not much performance should be expected. These libraries are available for downloadhttp://www.netlib.org/blas and http://www.netlib.org/lapack • The Automatically Tuned Linear Algebra Software, ATLAS. During compile time, ATLAS automatically choses the algorithms delivering the best performance. ATLAS does not contain all LAPACK functionality; it can be downloaded from http://www.netlib.org/atlas • The Goto BLAS an implementation of the level 3 BLAS aimed at high efficiency]. The Goto BLAS is available for download from http://www.tacc.utexas.edu/resources/software Bilel Hadri – Introduction to Numerical Libraries for HPC 21 Optimized vendor libraries for BLAS/LAPACK • Highly efficient versions • Hand tuned assembly by hardware vendors • Provide near peak performance • Several vendors provide libraries optimized for their architecture (AMD, Fujitsu, IBM, Intel, NEC,…) – Intel à MKL – Cray à LibSci – AMD à ACML – IBM à ESSL • USE them ! ( Speedup up to 10 or more ) Bilel Hadri – Introduction to Numerical Libraries for HPC 22 AMD / MKL • ACML (AMD Core Math Library)

Load more