Introduction to Numerical Libraries for HPC

Bilel Hadri [email protected] Computational Scientist KAUST Supercomputing Lab

Bilel Hadri 1 Numerical Libraries – Application Areas • Most used libraries/software in HPC ! • Linear Algebra – Systems of equations • Direct , Iterative, Multigrid solvers • Sparse, Dense system – Eigenvalue problems – Least squares • Signal processing – FFT • Numerical Integration • Random Number Generators

Bilel Hadri – Introduction to Numerical Libraries for HPC 2 Numerical Libraries - Motivations • Don’t Reinvent the Wheel ! • Improves productivity ! • Get a better performance ! – Faster and better algorithms

Bilel Hadri – Introduction to Numerical Libraries for HPC 3 Faster (Better Code) • Achieving best performance requires creating very processor- and system-specific code • Example: Dense matrix-matrix multiply

– Simple to express: do i=1, n do j=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo

Bilel Hadri – Introduction to Numerical Libraries for HPC 4 Performance • How fast should this run? – Our matrix-matrix multiply algorithm has 2n3 floating point operations • 3 nested loops, each with n iterations • 1 multiply, 1 add in each inner iteration – For n=100, 2x106 operations, about 1 msec on a 2GHz processor – For n=1000, 2x109 operations, or about 1 sec

• Reality: – N=100 1.1ms – N=1000 6s à Obvious expression of algorithms are not transformed into leading performance.

Bilel Hadri – Introduction to Numerical Libraries for HPC 5 Numerical Libraries – Packages • Linear Algebra – BLAS/LAPACK/ ScaLAPACK /PLASMA/ MAGMA/HiCMA – PETSc – HYPRE – TRILINOS • Signal processing – FFTW • Numerical Integration – GSL • Random Number Generators – SPRNG

Bilel Hadri – Introduction to Numerical Libraries for HPC 6 Others

MUMPS 4.9.2. MUMPS (MUltifrontal Massively Parallel sparse direct Solver) is a package of parallel, sparse, direct linear system solvers based on a multifrontal algorithm. http://graal.enslyon.fr/MUMPS/

SuperLU 4.3. SuperLU is a sequential version of SuperLU_dist and a sequential incomplete LU preconditioner that can accelerate the convergence of Krylov subspace iterative solvers. http://crd.lbl.gov/~xiaoye/SuperLU/

ParMETIS 4.0.2. ParMETIS (Parallel Graph Partitioning and Fill reducing Matrix Ordering) is a library of routines that partition unstructured graphs and meshes and compute fill reducing orderings of sparse matrices. http://glaros.dtc.umn.edu/gkhome/views/metis/

Bilel Hadri – Introduction to Numerical Libraries for HPC 7 • SUNDIALS 2.5.0 (SUite of Nonlinear and DIfferential/Algebraic equation Solvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. In addition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, and KINSOL that is called sundialsTB. https://computation.llnl.gov/casc/sundials/main.html

• Scotch 5.1.12b Scotch is a software package and libraries for sequential and parallel graph partitioning, static mapping, block ordering, and sequential mesh and hypergraph partitioning. http://www.labri.fr/perso/pelegrin/scotch

Note: On Shaheen , they are all grouped into cray-tpsl

• Freely Available Software for Linear Algebra http://www.netlib.org/utk/people/JackDongarra/la-sw.html

Bilel Hadri – Introduction to Numerical Libraries for HPC 8 BLAS (Basic Linear Algebra Subprograms) The BLAS functionality is divided into three levels: • Level 1: contains vector operations of the form:

• Level 2: contains matrix-vector operations of the form:

• Level 3: contains matrix-matrix operations of the form:

• Several implementations for different languages exist – Reference implementation http://www.netlib.org/blas/

Bilel Hadri – Introduction to Numerical Libraries for HPC 9 BLAS: naming conventions

• Each routine has a name which specifies the operation, the type of matrices involved and their precisions. Names are in the form: PMMOO”. – Each operation is defined for four precisions (P) – Some of the most common • S single real D double real operations (OO): C single complex • DOT scalar product, x^T y Z double complex AXPY vector sum, α x + y MV matrix-vector product, A x – The types of matrices are (MM) SV matrix-vector solve, inv(A) x • GE general MM matrix-matrix product, A B GB general band SM matrix-matrix solve, inv(A) B SY symmetric SB symmetric band SP symmetric packed HE hermitian HB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed

• Examples SGEMM stands for “single-precision general matrix-matrix multiply” DGEMM stands for “double-precision matrix-matrix multiply”.

Bilel Hadri – Introduction to Numerical Libraries for HPC 10 BLAS Level 1 routines • Vector operations (xROT, xSWAP, xCOPY etc.) • Scalar dot products (xDOT etc.) • Vector norms (IxAMX etc.)

Bilel Hadri – Introduction to Numerical Libraries for HPC 11 BLAS Level 2 routines • Matrix-vector operations (xGEMV, xGBMV, xHEMV, xHBMV etc.) • Solving Tx = y for x, where T is triangular (xGER, xHER etc.)

Bilel Hadri – Introduction to Numerical Libraries for HPC 12 BLAS Level 3 routines

• Matrix-matrix operations (xGEMM etc.) • Solving for triangular matrices (xTRMM) • Widely used matrix-matrix multiply (xSYMM, xGEMM)

Bilel Hadri – Introduction to Numerical Libraries for HPC 13 LAPACK (Linear Algebra PACKage)

• Linear Algebra PACKage – http://www.netlib.org/lapack/ – Provides routines for • Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR decomposition on a matrix and • Singular value problems

– Was initially designed to run efficiently on shared memory vector machines – Depends on BLAS – Has been extended for distributed systems ScaLAPACK ( Scalable Linear Algebra PACKage)

Bilel Hadri – Introduction to Numerical Libraries for HPC 14 LAPACK naming conventions

• Very similar to BLAS – XYYZZZ • YY: more matrix types • X: data type – PO: symmetric or Hermitian positive definite – S: REAL – PP: symmetric or Hermitian positive definite, – D: DOUBLE PRECISION packed storage – C: COMPLEX – PT: symmetric or Hermitian positive definite – Z: COMPLEX*16 or DOUBLE COMPLEX tridiagonal • YY: matrix type – SB: (real) symmetric band – BD: bidiagonal – SP: symmetric, packed storage – DI: diagonal – ST: (real) symmetric tridiagonal – GB: general band – SY: symmetric – GE: general (i.e., unsymmetric, in some cases – TB: triangular band rectangular) – TG: triangular matrices, generalized problem – GG: general matrices, generalized problem (i.e., a pair of triangular matrices) (i.e., a pair of general matrices) – TP: triangular, packed storage – GT: general tridiagonal – TR: triangular (or in some cases quasi- – HB: (complex) Hermitian band triangular) – HE: (complex) Hermitian – TZ: trapezoidal – HG: upper Hessenberg matrix, generalized – UN: (complex) unitary problem (i.e a Hessenberg and a triangular – UP: (complex) unitary, packed storage matrix) • ZZZ: performed computation – HP: (complex) Hermitian, packed storage – Linear systems – HS: upper Hessenberg – Factorizations – OP: (real) orthogonal, packed storage – Eigenvalue problems – OR: (real) orthogonal – Singular value decomposition – PB: symmetric or Hermitian positive definite band – Etc.

Bilel Hadri – Introduction to Numerical Libraries for HPC 15 LAPACK routines

http://www.icl.utk.edu/~mgates3/docs/lapack-summary.pdf Bilel Hadri – Introduction to Numerical Libraries for HPC 16 Numerical Libraries packages

Dongarra/ICL • Vendor libraries optimized implementations of BLAS, LAPACK, ScaLAPACK to their processors and their platform.

Bilel Hadri – Introduction to Numerical Libraries for HPC 17 LAPACK & ScaLAPACK • ScaLAPACK is a library with a subset of LAPACK routines running on distributed memory machines. • ScaLAPACK is designed for heterogeneous computing and is potable on any computer that supports MPI or PVM. • http://www.netlib.org/scalapack/scalapack_home.html

Bilel Hadri – Introduction to Numerical Libraries for HPC 18 Overview of ScaLAPACK

Bilel Hadri – Introduction to Numerical Libraries for HPC 19 Why use LAPACK or ScaLAPACK? • Solving systems of: – Linear equations: Ax = b

– Least squares: min|| Ax -b||2 – Eigenvalue problem: �� = �� – Singular value problem: A = USV T

Bilel Hadri – Introduction to Numerical Libraries for HPC 20 Reference BLAS vs Tuned

• The reference BLAS and LAPACK libraries a re reference implementations of the BLAS and LAPACK standard. These are not optimised and not multi-threaded, so not much performance should be expected. These libraries are available for downloadhttp://www.netlib.org/blas and http://www.netlib.org/lapack

• The Automatically Tuned Linear Algebra Software, ATLAS. During compile time, ATLAS automatically choses the algorithms delivering the best performance. ATLAS does not contain all LAPACK functionality; it can be downloaded from http://www.netlib.org/atlas

• The Goto BLAS an implementation of the level 3 BLAS aimed at high efficiency]. The Goto BLAS is available for download from http://www.tacc.utexas.edu/resources/software

Bilel Hadri – Introduction to Numerical Libraries for HPC 21 Optimized vendor libraries for BLAS/LAPACK • Highly efficient versions • Hand tuned assembly by hardware vendors • Provide near peak performance • Several vendors provide libraries optimized for their architecture (AMD, Fujitsu, IBM, Intel, NEC,…) – Intel à MKL – Cray à LibSci – AMD à ACML – IBM à ESSL

• USE them ! ( Speedup up to 10 or more ) Bilel Hadri – Introduction to Numerical Libraries for HPC 22 AMD / MKL • ACML (AMD Core Math Library) – LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and double- precision, real and complex data types). – APIs for both Fortran and C – https://developer.amd.com/amd-open64-software-development- kit/building-with-acml/

• MKL () – LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and double-precision, real and complex data types). – APIs for both Fortran and C – www.intel.com/software/products/mkl/ – Use the MKL advisory page to link your code with it: https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Bilel Hadri – Introduction to Numerical Libraries for HPC Example with SGEMM

http://homepage.ntu.edu.tw/~wttsai/fortran/ Bilel Hadri – Introduction to Numerical Libraries for HPC 24 Fortran example

Source available at http://homepage.ntu.edu.tw/~wttsai/fortran/ Bilel Hadri – Introduction to Numerical Libraries for HPC 25 Linking examples

Library Compiler Link-flags Cray Environment by default: compile without adding flags LIBSCI on Cray GNU compile without adding flags Intel compile without adding flags GNU /opt/acml/4.4.0/gfortran64_mp/lib/libacml_mp.a –fopenmp ACML Intel /opt/acml/4.4.0/ifort64_mp/lib/libacml_mp.a -openmp –lpthread

-Wl,--start-group /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a PGI /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_pgi_thread.a /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -mp -lpthread

-Wl,--start-group /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_gnu_thread.a MKL GNU /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -L/opt/intel/Compiler/11.1/038/lib/intel64/ -liomp5

-Wl,--start-group /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a Intel /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_thread.a /opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -openmp –lpthread

• Use the MKL advisory linkline: • http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Bilel Hadri – Introduction to Numerical Libraries for HPC 26 Compilation demos • On Shaheen: Use -Wl,-ysgemm_ to check with optimized library is used. – with Cray-libsci • ftn –o exe_libsci test_sgemm.f90 – with Intel MKL • Unload cray-lisci • ftn –o exe_libsci test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm -ldl • On Ibex – with BLAS • module load blas/3.7.1/gnu-6.4.0 • -gfortran test_sgemm.f90 -lblas – Intel • module load intel/2017 • ifort -o exe_mkl test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group - liomp5 -lpthread -lm -ldl – ACML • module load acml/6.1.0.31-gfortran64 • gfortran -o exe_acml test_sgemm.f90 -lacml_mp Bilel Hadri – Introduction to Numerical Libraries for HPC 27 Python fans !

• You can speedup your python scripts: – By using the scientific libraries numpy and scipy – Build python with the vendor optimized library • Available with python/2.7.11 on Shaheen • You can build your own by following the instructions: – https://software.intel.com/en-us/node/696338 • with cray-libsci: – Available next month on Shaheen. cray-python/17.09 • Check installation: – import numpy as np; – np.show_config()

Bilel Hadri – Introduction to Numerical Libraries for HPC 28 Python Numpy check installation import numpy as np; >>> np.show_config() lapack_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016//mkl/include'] blas_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] openblas_lapack_info: NOT AVAILABLE lapack_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']

Bilel Hadri – Introduction to Numerical Libraries for HPC 29 Python use Numpy and Scipy demo

Bilel Hadri – Introduction to Numerical Libraries for HPC 30 Performance formulae

• Performance is measured in floating point operations per second, FLOPS, or FLOP/s. • Current processors deliver an R peak in the GFLOPS ( 109 FLOPS) range.

• The Rpeak of a system can be computed by:

Rpeak = nCPU · ncore · nFPU · f

– nCPU is the number of CPUs in the system,

– ncore is the number of computing cores per CPU,

– nFPU is the number of floating point units per core, – f is the clock frequency

Bilel Hadri – Introduction to Numerical Libraries for HPC 31 FLOPs counts for recent processor microarchitectures

• Intel Core 2 and Nehalem: – 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2- • AMD Bulldozer: wide SSE2 multiplication – 8 DP FLOPs/cycle: 4-wide FMA – 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide – 16 SP FLOPs/cycle: 8-wide FMA SSE multiplication • ARM Cortex-A15: • Intel Sandy Bridge/Ivy Bridge: – 2 DP FLOPs/cycle: scalar FMA or scalar multiply- – 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide add AVX multiplication – 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4- – 16 SP FLOPs/cycle: 8-wide AVX addition + 8- wide NEON multiply-add wide AVX multiplication • IBM PowerPC A2 (Blue Gene Q): • Intel Haswell: – 8 DP FLOPs/cycle: 4-wide QPX FMA – 16 DP FLOPs/cycle: two 4-wide FMA (fused – SP elements are extended to DP and processed multiply-add) instructions on the same units – 32 SP FLOPs/cycle: two 8-wide FMA (fused • Intel MIC (Xeon Phi), per core (supports 4 multiply-add) instructions hyperthreads): • Intel Skylake/Knight’s Landing AVX 512 – 16 DP FLOPs/cycle: 8-wide FMA every cycle – 32 flops/cycle DP & – 32 SP FLOPs/cycle: 16-wide FMA every cycle – 64 flops/cycle SP • Intel MIC (Xeon Phi), per thread: • AMD K10: – 8 DP FLOPs/cycle: 8-wide FMA every other cycle – 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2- – 16 SP FLOPs/cycle: 16-wide FMA every other wide SSE2 multiplication cycle – 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Bilel Hadri – Introduction to Numerical Libraries for HPC 32 Rooflines

• Roofline is a performance model used to bound the performance of various numerical methods and operations running on processor architectures

Lorena A. Barba, Rio Yokota Bilel Hadri – Introduction to Numerical Libraries for HPC 33 Best Practices

• Numerical Recipes books DO NOT provide optimized code. – (Libraries can be 100x faster). • Don’t reinvent the wheel. • Use optimized libraries ! • It’s not only for C++/C/Fortran. – Python has an interface with BLAS ( check with numpy/scipy) – R, Matlab, cython…. • Don’t forget the environment variables ! • The efficient use of numerical libraries can yield significant performance benefits – Should be one of the first things to investigate when optimizing codes – The best library implementation often varies depending on the individual routine and possibly even the size of input data – READ the manual and/or attend the tutorials/workshops !

Bilel Hadri – Introduction to Numerical Libraries for HPC 34 THANKS

Bilel Hadri – Introduction to Numerical Libraries for HPC 35