HPC Libraries Outline

HPC libraries Outline Introduction to HPC libraries Numerical libraries – Linear algebra – FFT Other – Domain-specific libraries – I/O libraries Frameworks INTRODUCTION Motivation for using libraries No need to reinvent the wheel! Tend to have good performance Libraries tend to have fewer bugs Can be called from high-level languages – C++, Python, Perl, Matlab, Mathematica Makes the code more portable – Even to GPUs! Motivation: matrix multiply At 4096x4096: Optimized DGEMM=9.5GF, Matmul=0.75GF, Naive=0.014GF Choosing libraries Factors to keep in mind – Performance – Accuracy – Licensing & pricing – Supported languages & platforms & portability – Ease of use – Parallelization & scalability NUMERICAL LIBRARIES Dense linear algebra The "Netlib" libraries have become a de facto standard for dense linear algebra – Most notably: BLAS, LAPACK, BLACS, PBLAS, ScaLAPACK – Developed at University of Tennessee – Reference (F77!) implementations available freely from http://www.netlib.org/ A number of tuned implementations are available (Use these whenever possible!) – ATLAS, GOTO, ACML, MKL, LibSci etc. Netlib ecosystem ScaLAPACK PBLAS Global Addressing Local Addressing LAPACK Platform Independent Platform Specific BLAS BLACS MPI BLAS Basic Linear Algebra Subroutines Basic dense matrix operations – Multiplication, addition, rank update, triangular solve Divided into 3 levels Level Type Example # mem refs 1 Vector – Vector DAXPY (y=x+y) 3n 2 Matrix – Vector DGEMV (y=Ax+y) n**2 3 Matrix – Matrix DGEMM (C=AB+C) 4n**2 Try to use the highest level calls – Eg. DGEMM instead of n*DGER LAPACK Linear Algebra PAckage Matrix solvers – Simultaneous linear equations, least-squares solutions, eigenvalue problems – Matrix factorizations (LU, Cholesky, QR, SVD, Schur) Relies heavily on BLAS for computation ScaLAPACK Scalable LAPACK BLACS – Message passing interface for linear algebra – 2D arrays on 2D processes grids – Generation of process grids – Communication of matrices Parallel LAPACK and BLAS routines (through PBLAS) Extends the BLAS / LAPACK interface – Addition of P prefix into the routines – For example PDGEMM Interesting implementations PLASMA – Manycore optimized linear algebra routines – Potential successor of LAPACK – http://icl.cs.utk.edu/plasma/ MAGMA & CULA – LAPACK-type routines for GPUs – http://icl.cs.utk.edu/magma/ – http://www.culatools.com/ Sparse matrices PDEs, network simulation ... "Enough zeroes that it pays to take advantage of them" (J. Wilkinson) – Only store non-zero elements Density and sparsity pattern affects performance – Optimal storage format and algorithm differ – No single general purpose solution – Many different solvers available . Both direct and indirect methods Sparse linear systems Solving sparse linear system are more difficult than dense systems – Using direct solvers is often unfeasible because the decompositions of sparse systems are not sparse – Iterative solvers utilize sparse matrix-vector multiplication (SpMV) extensively – Memory layout of spare matrix formats is difficult to optimize Sparse system libs Hypre (preconditioners) KLU, Paraklete, AztecOO (parts of Trilinos) – Direct and iterative solvers for sparse systems SuperLU, UMFPACK, MUMPS – Direct solvers for sparse systems CuSP, CuSPARSE – Solvers for use with NVIDIA GPUs Many, many more exist.. FFT libraries No standardized API Implementations tend to have similiar functionality – Create a "plan" . Combination of codelets that compute the FFT . Depends on architecture & FFT parameters – Execute the FFT(s) . Plan can be reused if parameters are identical Powers of a single factor perform well FFTW Fastest Fourier Transform in the West 1D, 2D and 3D complex and real FFTs Two versions commonly in use – Version 2: MPI support – Version 3: Much better performance, MPI support in development Planning has multiple levels of comprehensiveness – ESTIMATE, MEASURE, PATIENT, EXHAUSTIVE – Plans can be saved in a "wisdom file" and reused Other FFT implementations Cray CRAFFT – Dynamically selects the best FFT routine – Serial and parallel routines – See info_crafft AMD ACML Intel MKL and IPP CuFFT – FFT routines for NVidia GPUs OTHER LIBRARIES Domain-specific libraries OpenMM – Molecular dynamics – Supports GPUs Libint – Two-body molecular integrals Overture – PDE solvers designed for CFD Libgenometools – Genome analysis functions Decomposition How to decompose a problem efficiently? – Load balancing (static/dynamic), optimal communication, simplicity Linear algebra – BLACS Mesh manipulation and load balancing – (par)Metis, Jostle, Party, Chaco, Zoltan... Generic – Global Arrays Paratitioning: METIS Graph and mesh partitioning Graph repartitioning – Adaptive refined meshes Partitioning refinement – Improves quality of existing partitions Matrix reordering – Fill reduction for sparse systems ParMETIS for MPI parallel meshes Zoltan Library of data management services for unstructured, dynamic and/or adaptive computations – Provides different partitioning methods (geometric, graph, hypergraph) – Part of Trilinos framework, can also be used as a stand- alone library – Callable from C/C++ and Fortran Data-structure neutral design I/O libraries How should HPC data be stored? – Large, complex, heterogeneous, esoteric, metadata ... – Parallel and random access Traditional relational databases poor fit – Cannot handle large objects – Many unnecessary features for HPC data Need a better solution – NetCDF – HDF5 I/O: HDF5 A data model, library, and file format for storing and managing multidimensional data Can store complex data objects and meta-data File format and files are portable Possibility for parallel I/O on top of MPI-IO Fortran, C, C++, and Java interfaces The HDF5 data model and library are complex IO: NetCDF Used extensively by geoscientists Optimized for access to subsets of large datasets – Direct access instead of sequential Version 4 implemented on top of HDF5 – HDF5 file compatibility Parallel NetCDF (pNetCDF) – Supports MPI-IO FRAMEWORKS Frameworks Frameworks provide a large collection of routines for solving numerical problems – Not a single library, but many of them for different tasks . Datatypes for parallel matrices, linear and nonlinear solvers, parallel communication, IO, etc. – Usually different parts are well integrated and user does not have to worry about the different datatypes or matrix representations Frameworks: pros and cons Pros Cons Many common operations Many frameworks are non- are readily available trivial to compile or port – Can speed up development Strong dependency on the considerably selected framework Easy to test different methods provided by the framework – Preconditioners, solvers, etc. PETsc Library for parallel sparse linear systems – Targets solution of PDEs – Suite of data structures and routines Defines a programming model – Can be linked with external solvers C and Fortran interfaces Available on Cray XC – Load the module cray-petsc to use and see the intro_petsc manual page PETSc architecture Trilinos http://trilinos.sandia.gov Comprehensive collection of loosely connected packages developed by Sandia and different research groups Designed from the bottom up to use distributed memory Includes over 50 modules for different tasks – Linear solvers, mesh partitioning, utilities for FEM solvers, optimization, automatic differentiation, etc. Trilinos Most packages are self contained (can be compiled without others) Very liberal licensing, LGPL and BSD Available on Cray XC – 50 packages included – Load the module cray-trilinos to use – See the intro_trilinos manual page Summary Libraries should be used whenever possible – Optimized libraries, if you have a choice – Use common, standardized interfaces Familiarize yourself with potential caveats – Numerical accuracy! Libraries often depend on other libraries – Especially BLAS is fundamental No single library or framework or API covers everything.

HPC Libraries Outline

Recursive Approach in Sparse Matrix LU Factorization

Arxiv:1911.09220V2 [Cs.MS] 13 Jul 2020

Abstract 1 Introduction

Overview of Iterative Linear System Solver Packages

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

MFEM: a Modular Finite Element Methods Library

Over-Scalapack.Pdf

Prospectus for the Next LAPACK and Scalapack Libraries

XAMG: a Library for Solving Linear Systems with Multiple Right-Hand

Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist

Present and Future Leadership Computers at OLCF

Exploring Capabilities Within Fortrilinos by Solving the 3D Burgers Equation