Parallel Mathematical Libraries STIMULATE Training Workshop 2018

December 2018 I.Gutheil JSC Outline

A Short History Sequential Libraries Parallel Libraries and Application Systems: Threaded Libraries MPI parallel Libraries Libraries for GPU usage Usage of ScaLAPACK, Elemental, and FFTW

Slide 1 A Short History Libraries for Dense Linear Algebra Starting in the early 1970th, written in FORTRAN LINPACK, LINear algebra PACKage for digital computers, supercomputers of the 1970th and early 1980th Dense linear system solvers, factorization and solution, real and complex, single and double precision EISPACK, EIgenSolver PACKage, written around 1972–1973 eigensolvers for real and complex, symmetric and non-symmetric matrices, solvers for generalized eigenvalue problem and singular value decomposition LAPACK, initial release 1992, still in use, now Fortran 90, also C and C++ interfaces, tuned for single-core performance on modern supercomputers, threaded versions from vendors available, e.g. MKL

A Short History Slide 2 A Short History, continued BLAS (Basic Linear Algebra Subprograms for Fortran Usage)

First step 1979: vector operations, now called BLAS 1, widely used kernels of linear algebra routines, idea: tuning in assembly on each machine leading to portable performance Second step 1988: matrix-vector operations, now called BLAS 2, optimization on the matrix-vector level necessary for vector computers, building blocks for LINPACK and EISPACK Third step 1990: matrix-matrix operations, now called BLAS 3, memory acces became slower than CPU, optimization was now necessary at matrix-matrix-level, builidng blocks for LAPACK, successor of LINPACK and EISPACK

A Short History Slide 3 Why you should use BLAS 3 if possible

1 Standardized interface, on most computers also for C, readable code 2 Optimized for data-reuse, memory access usually one order slower than CPU ”data re-use factor” Floating point operations r := Memory accesses Example: 2n AXPY: r ≈ 2n = 1

2 ≈ 2n = GEMV: r n2 2

3 ≈ 2n = 2 GEMM: r 3n2 3 n Only GEMM close to peak performance

A Short History Slide 4 Performance of matrix- JULIA gfortran, Comparison with MKL

Comp. O0 O2 O3 ijk-loop 4.9249 0.98752 0.99167 ikj-loop 5.5986 2.5144 2.5146 jik-loop 5.3677 1.0795 1.0792 jki-loop 4.5858 0.67161 0.43218 kij-loop 5.5317 2.5506 2.5492 kji-loop 4.6007 0.68648 0.46044 MKL Fortran 0.045247 0.043458 0.043392

A Short History Slide 5 Performance of matrix-matrix multiplication JULIA ifort, Comparison with MKL

Comp. O0 O2 O3 ijk-loop 11.124 1.0156 0.44883 ikj-loop 12.165 0.22118 0.089086 jik-loop 11.888 0.95409 0.38558 jki-loop 10.474 0.22997 0.23005 kij-loop 11.948 0.22124 0.089038 kji-loop 10.528 0.22985 0.23035 MKL Fortran 0.044419 0.042375 0.042731

A Short History Slide 6 Sequential Libraries

Vendor specific Library MKL Intel R Math Kernel Library Usage see https://software.intel.com/en-us/ articles/intel-mkl-link-line-advisor Public domain Libraries LAPACK (Linear Algebra PACKage), part of MKL or libopenblas.so ARPACK (Arnoldi PACKage), iterative solver for sparse eigenvalue problems GSL (Gnu Scientific Library, C library) GMP (Gnu Multiple Precision Arithmetic Library)

Sequential Libraries Slide 7 Contents of Intel R MKL 11.*

BLAS, Sparse BLAS, CBLAS LAPACK Iterative Sparse Solvers, Trust Region Solver Vector Math Library Vector Statistical Library Fourier Transform Functions Trigonometric Transform Functions GMP routines Poisson Library Interface for fftw

Sequential Libraries Slide 8 Contents of GSL (not complete)

CBLAS Linear Algebra, linear systems and eigenproblems FFT and other transformations Interpolation Integration and numerical differentiation Statistics Ordinary differential equations

Sequential Libraries Slide 9 Parallel Libraries Threaded Parallelism

MKL is multi-threaded or at least thread-save usage as with sequential routines if OMP NUM THREADS not set, maximum possible threads used ifort name.f -o name -lmkl intel lp64 -lmkl intel thread -lmkl core -liomp5 -lpthread FFTW 3.3 (Fastest Fourier Transform of the West) Sequential, threaded, and OpenMP version additional version in MKL Cray-intelmpi version on JULIA http://www.fftw.org

Parallel Libraries Slide 10 Parallel Libraries MPI Parallelism, dense linear algebra

ScaLAPACK (Scalable Linear Algebra PACKage), Fortran77 public domain version now contains BLACS http://netlib.org/scalapack ELPA (Eigenvalue SoLvers for Petaflop-Applications), Fortran2003 https://elpa.mpcdf.mpg.de Elemental, C++ framework for parallel dense linear algebra http://libelemental.org/

Parallel Libraries Slide 11 MPI Parallelism sparse linear algebra

MUMPS (MUltifrontal Massively Parallel sparse direct Solver) http://mumps.enseeiht.fr/index.php?page=home PARPACK (Parallel ARPACK), now ARPACK-NG, Eigensolver https://github.com/opencollab/arpack-ng hypre (high performance ) https://computation.llnl.gov/projects/ hypre-scalable-linear-solvers-multigrid-methods/ software

Parallel Libraries Slide 12 MPI Parallelism tools and differential equations

Tools FFTW (Fastest Fourier Transform of the West) ParMETIS (Parallel Graph Partitioning) http://glaros.dtc.umn.edu/gkhome/views/metis SPRNG (Scalable Parallel Random Number Generator) http://www.sprng.org/ Ordinary differential equations SUNDIALS (SUite of Nonlinear and DIfferential/ALgebraic equation Solvers) https://computation.llnl.gov/projects/sundials/ sundials-software

Parallel Libraries Slide 13 Parallel Systems, MPI Parallelism, PETSc

Portable, Extensible Toolkit for Scientific Computation Numerical solution of partial differential equations Can make use of many other libraries Can choose solver and with command line arguments Comes with lots of examples https://www.mcs.anl.gov/petsc/ Very active mailing list, good support via mailing list

Parallel Libraries Slide 14 Contents of parallel libraries, dense linear algebra Contents of ScaLAPACK and ELPA

ScaLAPACK Parallel BLAS 1-3, PBLAS Version 2 Dense linear system solvers Banded linear system solvers Solvers for Linear Least Squares Problem Singular value decomposition Eigenvalues and eigenvectors of dense symmetric/hermitian matrices ELPA, Eigensolver only, uses ScaLAPACK

Parallel Libraries Slide 15 Contents of parallel libraries, dense linear algebra Contents of Elemental (incomplete list) Dense and sparse-direct (generalized) Least Squares problems High-performance pseudospectral computation and visualization LU and Cholesky with full pivoting Column-pivoted QR and interpolative/skeleton decompositions Many algorithms for Singular-Value soft-Thresholding (SVT) Tall-skinny QR decompositions Hermitian matrix functions Prototype Spectral Divide and Conquer Schur decomposition and Hermitian EVD Sign-based Lyapunov/Ricatti/Sylvester solvers Convex optimization

Parallel Libraries Slide 16 Contents of parallel libraries, sparse linear algebra MUMPS and Parmetis Multifrontal Massively Parallel sparse direct Solver

MUMPS Multifrontal Massively Parallelsparse direct Solver Solution of linear systems with symmetric positive definite matrices, general symmetric matrices, general unsymmetric matrices Real or Complex Parallel factorization and solve phase, iterative refinement and backward error analysis F90 and MPI Graph partitioning used for symbolic factorization with reduced fill-in Parmetis: Parallel Graph Partinioning and Fill-reducing Matrix Ordering developed in Karypis Lab at the University of Minnesota

Parallel Libraries Slide 17 Contents of parallel libraries, sparse linear algebra

PARPACK Reverse communication interface, user has to supply parallel matrix-vector multiplication Standard or Generalized Problems. Single and Double Precision Complex Arithmetic Versions for Standard or Generalized Problems Routines for Banded Matrices - Standard or Generalized Problems. Routines for The Singular Value Decomposition.

Parallel Libraries Slide 18 Contents of parallel libraries, parallel tools

FFTW3 Version 3.3 Discrete Fourier transform (DFT) in one or more dimensions real and complex data arbitrary input size SPRNG The Scalable Parallel Random Number Generators Library for ASCI Monte Carlo Computations Version ≥ 2.0 various random number generators in one library Version 1.0 seperate library for each random number generator

Parallel Libraries Slide 19 Contents of parallel libraries, ordinary differential equations

SUNDIALS CVODE: initial value problems, ODEs CVODES: ODE systems and sensitivity analysis capabilities ARKODE: initial value ODE problems with additive Runge-Kutta methods IDA: initial value problems, differential-algebraic equation systems (DAE) IDAS: DAE systems and sensitivity analysis capabilities KINSOL: nonlinear algebraic systems

Parallel Libraries Slide 20 Libraries for GPU usage

cuBLAS, cuSPARSE, cuSOLVER Linear Algebra using CUDA from NVIDIA cuRAND, cuFFT random numbers and FFT for CUDA CUDA Math Library standard mathematical function library cuDNN primitives for deep neural networks (Deep learning) all of them and more: https://developer.nvidia.com/gpu-accelerated-libraries MAGMA, Linear Algebra Library for GPUs, similar to LAPACK http://icl.utk.edu/magma/ Other libraries come with CUDA kernels, for example ELPA

Parallel Libraries Slide 21 Usage of ScaLAPACK Background

Scalable version of a subset of LAPACK redesigned for distributed memory MIMD parallel computers Calls as similar to those of LAPACK as possible Based on PBLAS instead of BLAS BLACS (Basic Linear Algebra Communication Subroutines) for communication User has to care for data distribution on his own

Usage of ScaLAPACK, Elemental, and FFTW Slide 22 Usage of ScaLAPACK Calling sequence of PDGEMM Sequential DGEMM ... CALL DGEMM(TRANSA,TRANSB,M,N, K,& alpha ,A(1,1),LDA, & B(1 ,1) ,LDB,beta ,C(1 ,1) ,LDC) ... Parallel PDGEMM, descriptors instead of LDA, global starting indices of global matrices necessary ... CALL PDGEMM(TRANSA,TRANSB,M,N, K,& alpha ,A,1,1,DESCA, & B,1 ,1 ,DESCB, beta ,C,1 ,1 ,DESCC) ...

Usage of ScaLAPACK, Elemental, and FFTW Slide 23 Usage of ScaLAPACK 2-dimensional processor grid

CALL BLACS GRIDINIT(ictxt, ’Row’ ,MP , NP) Processor grid with MP rows and NP columns, row-first with BLACS context ictxt

P0,0 P0,1 ... P0,NP−1 P1,0 P1,1 ... P1,NP−1 ...... PMP−1,0 PMP−1,1 ... PMP−1,NP−1

Usage of ScaLAPACK, Elemental, and FFTW Slide 24 Usage of ScaLAPACK Distribution of matrices block-cyclic distribution of blocks with mb = 3 rows and nb = 2 columns to a 2 by 2 processor grid

a11 a12 a13 a14 a15 a16 a17 a18 a19 a21 a22 a23 a24 a25 a26 a27 a28 a29 a31 a32 a33 a34 a35 a36 a37 a38 a39 a41 a42 a43 a44 a45 a46 a47 a48 a49 a51 a52 a53 a54 a55 a56 a57 a58 a59 a61 a62 a63 a64 a65 a66 a67 a68 a69 a71 a72 a73 a74 a75 a76 a77 a78 a79 a81 a82 a83 a84 a85 a86 a87 a88 a89

Usage of ScaLAPACK, Elemental, and FFTW Slide 25 Usage of ScaLAPACK Distribution of matrices block-cyclic distribution of blocks with mb = 3 rows and nb = 2 columns to a 2 by 2 processor grid

a11 a12 a13 a14 a15 a16 a17 a18 a19 a21 a22 a23 a24 a25 a26 a27 a28 a29 a31 a32 a33 a34 a35 a36 a37 a38 a39 a41 a42 a43 a44 a45 a46 a47 a48 a49 a51 a52 a53 a54 a55 a56 a57 a58 a59 a61 a62 a63 a64 a65 a66 a67 a68 a69 a71 a72 a73 a74 a75 a76 a77 a78 a79 a81 a82 a83 a84 a85 a86 a87 a88 a89

Usage of ScaLAPACK, Elemental, and FFTW Slide 25 Usage of ScaLAPACK Distribution of matrices block-cyclic distribution of blocks with mb = 3 rows and nb = 2 columns to a 2 by 2 processor grid

a11 a12 a13 a14 a15 a16 a17 a18 a19 a21 a22 a23 a24 a25 a26 a27 a28 a29 a31 a32 a33 a34 a35 a36 a37 a38 a39 a41 a42 a43 a44 a45 a46 a47 a48 a49 a51 a52 a53 a54 a55 a56 a57 a58 a59 a61 a62 a63 a64 a65 a66 a67 a68 a69 a71 a72 a73 a74 a75 a76 a77 a78 a79 a81 a82 a83 a84 a85 a86 a87 a88 a89

The blue matrix elements are assigned to the processor in row 0 and column 0

Usage of ScaLAPACK, Elemental, and FFTW Slide 25 Usage of ScaLAPACK Distribution of matrices block-cyclic distribution of blocks with mb = 3 rows and nb = 2 columns to a 2 by 2 processor grid

a11 a12 a13 a14 a15 a16 a17 a18 a19 a21 a22 a23 a24 a25 a26 a27 a28 a29 a31 a32 a33 a34 a35 a36 a37 a38 a39 a41 a42 a43 a44 a45 a46 a47 a48 a49 a51 a52 a53 a54 a55 a56 a57 a58 a59 a61 a62 a63 a64 a65 a66 a67 a68 a69 a71 a72 a73 a74 a75 a76 a77 a78 a79 a81 a82 a83 a84 a85 a86 a87 a88 a89

The red matrix elements are assigned to the processor in row 0 and column 1

Usage of ScaLAPACK, Elemental, and FFTW Slide 25 Usage of ScaLAPACK Distribution of matrices block-cyclic distribution of blocks with mb = 3 rows and nb = 2 columns to a 2 by 2 processor grid

a11 a12 a13 a14 a15 a16 a17 a18 a19 a21 a22 a23 a24 a25 a26 a27 a28 a29 a31 a32 a33 a34 a35 a36 a37 a38 a39 a41 a42 a43 a44 a45 a46 a47 a48 a49 a51 a52 a53 a54 a55 a56 a57 a58 a59 a61 a62 a63 a64 a65 a66 a67 a68 a69 a71 a72 a73 a74 a75 a76 a77 a78 a79 a81 a82 a83 a84 a85 a86 a87 a88 a89

The magenta matrix elements are assigned to the processor in row 1 and column 0

Usage of ScaLAPACK, Elemental, and FFTW Slide 25 Usage of ScaLAPACK Distribution of matrices block-cyclic distribution of blocks with mb = 3 rows and nb = 2 columns to a 2 by 2 processor grid

a11 a12 a13 a14 a15 a16 a17 a18 a19 a21 a22 a23 a24 a25 a26 a27 a28 a29 a31 a32 a33 a34 a35 a36 a37 a38 a39 a41 a42 a43 a44 a45 a46 a47 a48 a49 a51 a52 a53 a54 a55 a56 a57 a58 a59 a61 a62 a63 a64 a65 a66 a67 a68 a69 a71 a72 a73 a74 a75 a76 a77 a78 a79 a81 a82 a83 a84 a85 a86 a87 a88 a89

The green matrix elements are assigned to the processor in row 1 and column 1

Usage of ScaLAPACK, Elemental, and FFTW Slide 25 Usage of ScaLAPACK Distribution of matrices, local parts of processors

Local part of processor (0,0) Local part of processor (1,1) 5 rows, 5 colums 3 rows, 4 colums

a11 a12 a15 a16 a19 a21 a22 a25 a26 a29 a43 a44 a47 a48 a31 a32 a35 a36 a39 a53 a54 a57 a58 a71 a72 a75 a76 a79 a63 a64 a67 a68 a81 a82 a85 a86 a89

Usage of ScaLAPACK, Elemental, and FFTW Slide 26 Usage of Elemental

two-dimensional distribution of data, elementwise User provides local data column-first in distributed manner, similar to ScaLAPACK, but with mb=nb=1 Built-in datatypes const El::Grid mygrid(comm); processor grid built from communicator comm El::DistMatrix X(mygrid); matrix X distributed over processor grid mygrid Built-in function El::Length() to get local number of rows and columns of a matrix X.Attach to fill local data to global DistMatrix X(mygrid) Generic functions, e.g. Gemm( NORMAL, NORMAL, 1., X, Y, 0., Z ), all informations about data distribution contained in DistMatrix

Usage of ScaLAPACK, Elemental, and FFTW Slide 27 Usage of Elemental Features of grid

Default grid from comm created, as square as possible grid.Height gives number of processor rows in the constructed grid grid.Width gives number of processor columns in the constructed grid grid.Row() gives processor row coordinate of current processor grid.Col() gives processor column coordinate of current processor

Usage of ScaLAPACK, Elemental, and FFTW Slide 28 Usage of Elemental Example as with ScaLAPACK grid.Height=MP=2, grid.Width=NP=2, matrix X with m=8, n=9

X00 X01 X02 X03 X04 X05 X06 X07 X08 X10 X11 X12 X13 X14 X15 X16 X17 X18 X20 X21 X22 X23 X24 X25 X26 X27 X28 X30 X31 X32 X33 X34 X35 X36 X37 X38 X40 X41 X42 X43 X44 X45 X46 X47 X48 X50 X51 X52 X53 X54 X55 X56 X57 X58 X60 X61 X62 X63 X64 X65 X66 X67 X68 X70 X71 X72 X73 X74 X75 X76 X77 X78

Usage of ScaLAPACK, Elemental, and FFTW Slide 29 Usage of Elemental Example as with ScaLAPACK grid.Height=MP=2, grid.Width=NP=2, matrix X with m=8, n=9

X00 X01 X02 X03 X04 X05 X06 X07 X08 X10 X11 X12 X13 X14 X15 X16 X17 X18 X20 X21 X22 X23 X24 X25 X26 X27 X28 X30 X31 X32 X33 X34 X35 X36 X37 X38 X40 X41 X42 X43 X44 X45 X46 X47 X48 X50 X51 X52 X53 X54 X55 X56 X57 X58 X60 X61 X62 X63 X64 X65 X66 X67 X68 X70 X71 X72 X73 X74 X75 X76 X77 X78

The blue matrix elements are assigned to the processor in row 0 and column 0

Usage of ScaLAPACK, Elemental, and FFTW Slide 29 Usage of Elemental Example as with ScaLAPACK grid.Height=MP=2, grid.Width=NP=2, matrix X with m=8, n=9

X00 X01 X02 X03 X04 X05 X06 X07 X08 X10 X11 X12 X13 X14 X15 X16 X17 X18 X20 X21 X22 X23 X24 X25 X26 X27 X28 X30 X31 X32 X33 X34 X35 X36 X37 X38 X40 X41 X42 X43 X44 X45 X46 X47 X48 X50 X51 X52 X53 X54 X55 X56 X57 X58 X60 X61 X62 X63 X64 X65 X66 X67 X68 X70 X71 X72 X73 X74 X75 X76 X77 X78

The red matrix elements are assigned to the processor in row 0 and column 1

Usage of ScaLAPACK, Elemental, and FFTW Slide 29 Usage of Elemental Example as with ScaLAPACK grid.Height=MP=2, grid.Width=NP=2, matrix X with m=8, n=9

X00 X01 X02 X03 X04 X05 X06 X07 X08 X10 X11 X12 X13 X14 X15 X16 X17 X18 X20 X21 X22 X23 X24 X25 X26 X27 X28 X30 X31 X32 X33 X34 X35 X36 X37 X38 X40 X41 X42 X43 X44 X45 X46 X47 X48 X50 X51 X52 X53 X54 X55 X56 X57 X58 X60 X61 X62 X63 X64 X65 X66 X67 X68 X70 X71 X72 X73 X74 X75 X76 X77 X78

The magenta matrix elements are assigned to the processor in row 1 and column 0

Usage of ScaLAPACK, Elemental, and FFTW Slide 29 Usage of Elemental Example as with ScaLAPACK grid.Height=MP=2, grid.Width=NP=2, matrix X with m=8, n=9

X00 X01 X02 X03 X04 X05 X06 X07 X08 X10 X11 X12 X13 X14 X15 X16 X17 X18 X20 X21 X22 X23 X24 X25 X26 X27 X28 X30 X31 X32 X33 X34 X35 X36 X37 X38 X40 X41 X42 X43 X44 X45 X46 X47 X48 X50 X51 X52 X53 X54 X55 X56 X57 X58 X60 X61 X62 X63 X64 X65 X66 X67 X68 X70 X71 X72 X73 X74 X75 X76 X77 X78

The green matrix elements are assigned to the processor in row 1 and column 1

Usage of ScaLAPACK, Elemental, and FFTW Slide 29 Usage of Elemental Distribution of matrices, local parts of processors

Local part of processor (0,0) Local part of processor (1,1) 4 rows, 5 colums 4 rows, 4 colums

X00 X02 X04 X06 X08 X11 X13 X15 X17 X20 X22 X24 X26 X28 X31 X33 X35 X37 X40 X42 X44 X46 X48 X51 X53 X55 X57 X60 X62 X64 X66 X68 X71 X73 X75 X77

Usage of ScaLAPACK, Elemental, and FFTW Slide 30 Usage of FFTW

MPI parallelization for 2d-fft fftw mpi local size 2d returns local part of data in first dimenion and local start of data in first dimension User has to fill local data block with function data Computations can be done in-place or with output different from input

Usage of ScaLAPACK, Elemental, and FFTW Slide 31