Parallel Mathematical Libraries STIMULATE Training Workshop 2018
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Mathematical Libraries STIMULATE Training Workshop 2018 December 2018 I.Gutheil JSC Outline A Short History Sequential Libraries Parallel Libraries and Application Systems: Threaded Libraries MPI parallel Libraries Libraries for GPU usage Usage of ScaLAPACK, Elemental, and FFTW Slide 1 A Short History Libraries for Dense Linear Algebra Starting in the early 1970th, written in FORTRAN LINPACK, LINear algebra PACKage for digital computers, supercomputers of the 1970th and early 1980th Dense linear system solvers, factorization and solution, real and complex, single and double precision EISPACK, EIgenSolver PACKage, written around 1972–1973 eigensolvers for real and complex, symmetric and non-symmetric matrices, solvers for generalized eigenvalue problem and singular value decomposition LAPACK, initial release 1992, still in use, now Fortran 90, also C and C++ interfaces, tuned for single-core performance on modern supercomputers, threaded versions from vendors available, e.g. MKL A Short History Slide 2 A Short History, continued BLAS (Basic Linear Algebra Subprograms for Fortran Usage) First step 1979: vector operations, now called BLAS 1, widely used kernels of linear algebra routines, idea: tuning in assembly on each machine leading to portable performance Second step 1988: matrix-vector operations, now called BLAS 2, optimization on the matrix-vector level necessary for vector computers, building blocks for LINPACK and EISPACK Third step 1990: matrix-matrix operations, now called BLAS 3, memory acces became slower than CPU, optimization was now necessary at matrix-matrix-level, builidng blocks for LAPACK, successor of LINPACK and EISPACK A Short History Slide 3 Why you should use BLAS 3 if possible 1 Standardized interface, on most computers also for C, readable code 2 Optimized for data-reuse, memory access usually one order slower than CPU ”data re-use factor” Floating point operations r := Memory accesses Example: 2n AXPY: r ≈ 2n = 1 2 ≈ 2n = GEMV: r n2 2 3 ≈ 2n = 2 GEMM: r 3n2 3 n Only GEMM close to peak performance A Short History Slide 4 Performance of matrix-matrix multiplication JULIA gfortran, Comparison with MKL Comp. O0 O2 O3 ijk-loop 4.9249 0.98752 0.99167 ikj-loop 5.5986 2.5144 2.5146 jik-loop 5.3677 1.0795 1.0792 jki-loop 4.5858 0.67161 0.43218 kij-loop 5.5317 2.5506 2.5492 kji-loop 4.6007 0.68648 0.46044 MKL Fortran 0.045247 0.043458 0.043392 A Short History Slide 5 Performance of matrix-matrix multiplication JULIA ifort, Comparison with MKL Comp. O0 O2 O3 ijk-loop 11.124 1.0156 0.44883 ikj-loop 12.165 0.22118 0.089086 jik-loop 11.888 0.95409 0.38558 jki-loop 10.474 0.22997 0.23005 kij-loop 11.948 0.22124 0.089038 kji-loop 10.528 0.22985 0.23035 MKL Fortran 0.044419 0.042375 0.042731 A Short History Slide 6 Sequential Libraries Vendor specific Library MKL Intel R Math Kernel Library Usage see https://software.intel.com/en-us/ articles/intel-mkl-link-line-advisor Public domain Libraries LAPACK (Linear Algebra PACKage), part of MKL or libopenblas.so ARPACK (Arnoldi PACKage), iterative solver for sparse eigenvalue problems GSL (Gnu Scientific Library, C library) GMP (Gnu Multiple Precision Arithmetic Library) Sequential Libraries Slide 7 Contents of Intel R MKL 11.* BLAS, Sparse BLAS, CBLAS LAPACK Iterative Sparse Solvers, Trust Region Solver Vector Math Library Vector Statistical Library Fourier Transform Functions Trigonometric Transform Functions GMP routines Poisson Library Interface for fftw Sequential Libraries Slide 8 Contents of GSL (not complete) CBLAS Linear Algebra, linear systems and eigenproblems FFT and other transformations Interpolation Integration and numerical differentiation Statistics Ordinary differential equations Sequential Libraries Slide 9 Parallel Libraries Threaded Parallelism MKL is multi-threaded or at least thread-save usage as with sequential routines if OMP NUM THREADS not set, maximum possible threads used ifort name.f -o name -lmkl intel lp64 -lmkl intel thread -lmkl core -liomp5 -lpthread FFTW 3.3 (Fastest Fourier Transform of the West) Sequential, threaded, and OpenMP version additional version in MKL Cray-intelmpi version on JULIA http://www.fftw.org Parallel Libraries Slide 10 Parallel Libraries MPI Parallelism, dense linear algebra ScaLAPACK (Scalable Linear Algebra PACKage), Fortran77 public domain version now contains BLACS http://netlib.org/scalapack ELPA (Eigenvalue SoLvers for Petaflop-Applications), Fortran2003 https://elpa.mpcdf.mpg.de Elemental, C++ framework for parallel dense linear algebra http://libelemental.org/ Parallel Libraries Slide 11 MPI Parallelism sparse linear algebra MUMPS (MUltifrontal Massively Parallel sparse direct Solver) http://mumps.enseeiht.fr/index.php?page=home PARPACK (Parallel ARPACK), now ARPACK-NG, Eigensolver https://github.com/opencollab/arpack-ng hypre (high performance preconditioners) https://computation.llnl.gov/projects/ hypre-scalable-linear-solvers-multigrid-methods/ software Parallel Libraries Slide 12 MPI Parallelism tools and differential equations Tools FFTW (Fastest Fourier Transform of the West) ParMETIS (Parallel Graph Partitioning) http://glaros.dtc.umn.edu/gkhome/views/metis SPRNG (Scalable Parallel Random Number Generator) http://www.sprng.org/ Ordinary differential equations SUNDIALS (SUite of Nonlinear and DIfferential/ALgebraic equation Solvers) https://computation.llnl.gov/projects/sundials/ sundials-software Parallel Libraries Slide 13 Parallel Systems, MPI Parallelism, PETSc Portable, Extensible Toolkit for Scientific Computation Numerical solution of partial differential equations Can make use of many other libraries Can choose solver and preconditioner with command line arguments Comes with lots of examples https://www.mcs.anl.gov/petsc/ Very active mailing list, good support via mailing list Parallel Libraries Slide 14 Contents of parallel libraries, dense linear algebra Contents of ScaLAPACK and ELPA ScaLAPACK Parallel BLAS 1-3, PBLAS Version 2 Dense linear system solvers Banded linear system solvers Solvers for Linear Least Squares Problem Singular value decomposition Eigenvalues and eigenvectors of dense symmetric/hermitian matrices ELPA, Eigensolver only, uses ScaLAPACK Parallel Libraries Slide 15 Contents of parallel libraries, dense linear algebra Contents of Elemental (incomplete list) Dense and sparse-direct (generalized) Least Squares problems High-performance pseudospectral computation and visualization LU and Cholesky with full pivoting Column-pivoted QR and interpolative/skeleton decompositions Many algorithms for Singular-Value soft-Thresholding (SVT) Tall-skinny QR decompositions Hermitian matrix functions Prototype Spectral Divide and Conquer Schur decomposition and Hermitian EVD Sign-based Lyapunov/Ricatti/Sylvester solvers Convex optimization Parallel Libraries Slide 16 Contents of parallel libraries, sparse linear algebra MUMPS and Parmetis Multifrontal Massively Parallel sparse direct Solver MUMPS Multifrontal Massively Parallelsparse direct Solver Solution of linear systems with symmetric positive definite matrices, general symmetric matrices, general unsymmetric matrices Real or Complex Parallel factorization and solve phase, iterative refinement and backward error analysis F90 and MPI Graph partitioning used for symbolic factorization with reduced fill-in Parmetis: Parallel Graph Partinioning and Fill-reducing Matrix Ordering developed in Karypis Lab at the University of Minnesota Parallel Libraries Slide 17 Contents of parallel libraries, sparse linear algebra PARPACK Reverse communication interface, user has to supply parallel matrix-vector multiplication Standard or Generalized Problems. Single and Double Precision Complex Arithmetic Versions for Standard or Generalized Problems Routines for Banded Matrices - Standard or Generalized Problems. Routines for The Singular Value Decomposition. Parallel Libraries Slide 18 Contents of parallel libraries, parallel tools FFTW3 Version 3.3 Discrete Fourier transform (DFT) in one or more dimensions real and complex data arbitrary input size SPRNG The Scalable Parallel Random Number Generators Library for ASCI Monte Carlo Computations Version ≥ 2.0 various random number generators in one library Version 1.0 seperate library for each random number generator Parallel Libraries Slide 19 Contents of parallel libraries, ordinary differential equations SUNDIALS CVODE: initial value problems, ODEs CVODES: ODE systems and sensitivity analysis capabilities ARKODE: initial value ODE problems with additive Runge-Kutta methods IDA: initial value problems, differential-algebraic equation systems (DAE) IDAS: DAE systems and sensitivity analysis capabilities KINSOL: nonlinear algebraic systems Parallel Libraries Slide 20 Libraries for GPU usage cuBLAS, cuSPARSE, cuSOLVER Linear Algebra using CUDA from NVIDIA cuRAND, cuFFT random numbers and FFT for CUDA CUDA Math Library standard mathematical function library cuDNN primitives for deep neural networks (Deep learning) all of them and more: https://developer.nvidia.com/gpu-accelerated-libraries MAGMA, Linear Algebra Library for GPUs, similar to LAPACK http://icl.utk.edu/magma/ Other libraries come with CUDA kernels, for example ELPA Parallel Libraries Slide 21 Usage of ScaLAPACK Background Scalable version of a subset of LAPACK redesigned for distributed memory MIMD parallel computers Calls as similar to those of LAPACK as possible Based on PBLAS instead of BLAS BLACS (Basic Linear Algebra Communication Subroutines) for communication User has to care for data distribution on his own Usage of ScaLAPACK, Elemental,