High Performance Computing: Concepts, Methods & Means HPC Libraries

Hartmut Kaiser PhD Center for Computation & Technology Louisiana State University April 19 th , 2007 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

2 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

3 Puzzle of the Day

#include int main() { int a = 10; switch (a) { case '1': printf("ONE\n"); break ;

case '2': printf("TWO\n"); break ;

defa1ut : printf("NONE\n"); } If you expect the output of the above return 0; } program to be NONE , I would request you to check it out!

4 Application domains

• Linear algebra – BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim • Ordinary and partial Differential Equations – PETSc • Mesh manipulation and Load Balancing – METIS, ParMETIS, CHACO, JOSTLE, PARTY • Graph manipulation – Boost.Graph library • Vector/Signal/Image processing – VSIPL, PSSL. • General parallelization – MPI, pthreads • Other domain specific libraries – NAMD, NWChem, Fluent, Gaussian, LS-DYNA

5 Application Domain Overview

• Linear Algebra Libraries – Provide optimized methods for constructing sets of linear equations, performing operations on them (matrix-matrix products, matrix-vector products) and solving them (factoring, forward & backward substitution. – Commonly used libraries include BLAS, ATLAS, LAPACK, ScaLAPACK, PaLAPACK • PDE Solvers: – Developing general-porpose, parallel numerical PDE libraries – Usual toolsets include manipulation of sparse data structures, iterative linear system solvers, preconditioners, nonlinear solvers and time-stepping methods. – Commonly used libraries for solving PDEs include SAMRAI, PETSc, PARASOL, Overture, among others.

6 Application Domain Overview

• Mesh manipulation and Load Balancing – These libraries help in partitioning meshes in roughly equal sizes across processors, thereby balancing the workload while minimizing size of separators and communication costs. – Commonly used libraries for this purpose include METIS, ParMetis, Chaco, JOSTLE among others. • Other packages: – FFTW: features highly optimized Fourier transform package including both real and complex multidimensional transforms in sequential, multithreaded, and parallel versions. – NAMD: molecular dynamics library available for Unix/Linux, Windows, OS X – Fluent: computational fluid dynamics package, used for such applications as environment control systems, propulsion, reactor modeling etc.

7 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS , LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

8 BLAS

• (Updated set of) Basic Linear Algebra Subprograms

• The BLAS functionality is divided into three levels: – Level 1: contains vector operations of the form:

as well as scalar dot products and vector norms

– Level 2: contains matrix-vector operations of the form

as well as Tx = y solving for x with T being triangular

– Level 3: contains matrix-matrix operations of the form

as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiply operation.

9 BLAS

• Several implementations for different languages exist – Reference implementation (F77 and C) http://www.netlib.org/blas/ – ATLAS, highly optimized for particular processor architectures – A generic C++ template class library providing BLAS functionality: uBLAS http://www.boost.org – Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)

10 BLAS: F77 naming conventions

11 BLAS: C naming conventions

• F77 routine name is changed to lowercase and prefixed with cblas_ • All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major) • Character parameters are replaced by corresponding enum values • Input arguments are declared const • Non-complex scalar input parameters are passed by value • Complex scalar input argiments are passed using a void* • Arrays are passed by address • Output scalar arguments are passed by address • Complex functions become subroutines which return the result via an additional last parameter ( void* ), appending _sub to the name

12 BLAS Level 1 routines

• Vector operations (xROT, xSWAP, xCOPY etc.) • Scalar dot products (xDOT etc.) • Vector norms (IxAMX etc.)

13 BLAS Level 2 routines

• Matrix-vector operations (xGEMV, xGBMV, xHEMV, xHBMV etc.) • Solving Tx = y for x, where T is triangular (xGER, xHER etc.)

14 BLAS Level 3 routines

• Matrix-matrix operations (xGEMM etc.) • Solving for triangular matrices (xTRMM) • Widely used matrix-matrix multiply (xSYMM, xGEMM)

15 Demo 1

• Shows solving a matrix multiplication problem using BLAS expressed in FORTRAN, C, and C++ • Shows genericity of uBLAS, by comparing generic and banded matrix versions • Shows newmat, a C++ matrix library which uses operator overloading

16 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK ) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

17 LAPACK

• Linear Algebra PACKage – http://www.netlib.org/lapack/ – Written in F77 – Provides routines for • Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR decomposition on a matrix and • Singular value problems – Was initially designed to run efficiently on shared memory vector machines – Depends on BLAS – Has been extended for distributed (SIMD) systems (ScaPACK and PLAPACK)

18 LAPACK (Architecture)

19 LAPACK naming conventions

20 Demo 2

• Shows how using a library might speed up the computation considerably

21 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

22 PETSc (pronounced PET-see)

• Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/ ) – Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations (PDEs) – Employs the MPI standard for all message-passing communication – Intended for use in large-scale application projects – Includes a large suite of parallel linear and nonlinear equation solvers – Easily used in application codes written in C, C++, Fortran and Python • Good introduction: http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt

23 PETSc (general features)

• Features include: – Parallel vectors • Scatters (handles communicating ghost point information) • Gathers – Parallel matrices • Several sparse storage formats • Easy, efficient assembly. – Scalable parallel preconditioners – Krylov subspace methods – Parallel Newton-based nonlinear solvers – Parallel time stepping (ODE) solvers

24 PETSc (Architecture)

PETSc: Module architecture and layers of abstraction

25 PETSc: Component details

• Vector operations (Vec) : Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures. • Matrix operations (Mat) : A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems. • Preconditioners (PC) : A collection of sequential and parallel preconditioners, including – (sequential) ILU(k) (incomplete factorization), – LU (lower/upper decomposition), – both sequential and parallel block Jacobi, overlapping additive Schwarz methods • Time stepping ODE solvers (TS) : Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.

26 PETSc: Component details

• Krylov subspace solvers (KSP) : Parallel implementations of many popular Krylov subspace iterative methods, including – GMRES (Generalized Minimal Residual method), – CG (Conjugate Gradient), – CGS (Conjugate Gradient Squared), – Bi-CG-Stab (BiConjugate Gradient Squared), – two variants of TFQMR (transpose free QMR), – CR (Conjugate Residuals), – LSQR (Least Square Root). All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods. • Non-linear solvers (SNES) : Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.

27 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

28 Mesh libraries

• Introduction – Structured/unstructured meshes – Examples • Mesh decomposition

29 Introduction to Meshes and Grids

• Mesh/Grid : 2D or 3D representation of the computational

domain. Quadrilateral Triangle • Common 2D meshes are composed 2D Mesh elements of triangular or quadrilateral elements Hexahedron Prism • Common 3D meshes are composed of hexahedral, tetrahedral or pyramidal elements Tetrahedron

3D Mesh elements

30 Structured/Unstructured Meshes

Structured Grids (Meshes) Unstructured Meshes • Cartesian grids, logically • Mesh connectivity information must be stored rectangular grids – Incurs additional memory and • Mesh info accessed implicitly computational cost using grid point indices • Handles complex geometries and grid adaptivity – Efficient in both computation and storage • Typically use finite volume or • Typically use finite difference finite element discretization • Mesh quality becomes a discretization concern

31 Mesh examples

32 Meshes are used for Computation

33 Mesh Decomposition

• Goal is to maximize interior while minimizing connections between subdomains. That is, minimize communication . • Such decomposition problems have been studied in load balancing for parallel computation. • Lots of choices: • METIS , ParMETIS -- University of Minnesota. • PARTI -- University of Maryland, • CHACO -- Sandia National Laboratories, • JOSTLE -- University of Greenwich, • PARTY -- University of Paderborn, • SCOTCH -- Université Bordeaux, • TOP/DOMDEC -- NAS at NASA Ames Research Center.

http://www.hlrs.de 34 Mesh Decomposition

• Load balancing – Distribute elements evenly across processors. – Each processor should have equal share of work. • Communication costs should be minimized. – Minimize sub-domain boundary elements. – Minimize number of neighboring domains. • Distribution should reflect machine architecture. – Communication versus calculation. – Bandwidth versus latency. • Note that optimizing load balance and communication cost simultaneously is an NP-hard problem.

http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html

35 Mesh decomposition

36 http://www.hlrs.de 36 Static and Dynamic Meshes

Static Grids (Meshes) Dynamic Meshes • Decomposition need only be • Decomposition must be adapted carried out once as underlying mesh or processor load changes. • Static decomposition may therefore be carried out as a • Dynamic decomposition therefore preprocessing step, often done in becomes part of the calculation serial itself and cannot be carried out solely as a pre-processing step.

http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html

37 HP J6700 1 CPU Solve Time: 13:26 Baseline Time

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 38 Linux Cluster 2 CPU’s Solve Time: 5:20 Speed-Up: 2.5X

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 39 Linux Cluster 4 CPU’s Solve Time: 3:07 Speed-Up: 4.3X

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 40 Linux Cluster 8 CPU’s Solve Time: 1:51 Speed-Up: 7.3X

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 41 Linux Cluster 16 CPU’s Solve Time: 1:03 Speed-Up: 12.8X

src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 42 Speedup due to decomposition

# CPUs Run-times (s) 1 806 2 320 4 187 8 111 16 63

43 Jostle and Metis

44 http://www.hlrs.de 44 Jostle

45 http://www.hlrs.de 45 Jostle

46 http://www.hlrs.de 46 Jostle

47 http://www.hlrs.de 47 Metis

48 http://www.hlrs.de 48 ParMetis

49 http://www.hlrs.de 49 Metis (serial)

50 http://www.hlrs.de 50 Comparison

51 http://www.hlrs.de 51 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

52 FFTW

• Fastest Fourier Transform in the West

• Portable C subroutine library for computing discrete cosine/sine transform (DCT/DST) • Computes arbitrary size discrete Fourier and Hartley transforms on real or complex data, in one or more dimensions • Optimized for speed through application of special-purpose compiler genfft (codelet generator), originally written in OCaml; performance comparable even with vendor optimized libraries • Free software, distributed under GPL; also available under commercial MIT license • Developed at MIT by Matteo Frigo and Steven G. Johnson • Won J. H. Wilkinson Prize for Numerical Software in 1999 • Most recent stable version is 3.1.2 ( http://www.fftw.org )

53 Main FFTW Features

• C and FORTRAN interfaces, C++ wrappers available • Speed, including support for SSE, SSE2, 3dNow! and Altivec • Arbitrary size transforms with complexity of O(n·log(n)) (sizes which can be factored to 2, 3, 5 and 7 are most efficient by default, but a custom code can be also generated for other sizes if required) • Even/odd data (DCT/DST), types I-IV • Can produce pure real output, or process pure real input data • Efficient handling of multiple, strided transforms (e.g. transformation of multiple arrays at once; one dimension of multi-dimensional array; one field of multi-component array) • Parallel code supporting Cilk, SMP platforms with threads, or MPI • Ability to save and restore plans optimized for a given platform (through wisdom mechanism) • Portable to any platform with a working C compiler

54 FFTW Sample Code

Computing 1-D complex DFT #include ... { fftw_complex *in, *out; fftw_plan p; ... in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); /* populate in[] with input data */ … p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE); ... fftw_execute(p); /* repeat as needed */ /* transform now available in out[] */ ... fftw_destroy_plan(p); fftw_free(in); fftw_free(out); }

Source: http://www.fftw.org/fftw3.pdf 55 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

56 The Boost Libraries

• What’s Boost – What’s important – Other stuff

57 What is Boost?

• Data Structures, Containers, Iterators, and Algorithms • String and Text Processing • Function Objects and Higher-Order Programming • Generic Programming and Template Metaprogramming • Math and Numerics • Input/Output • Miscellaneous

• Mostly header only

58 What’s important

• OS abstraction – Thread : OS independent kernel level thread interface – Asio : asynchronous input output – Filesystem : file system operations as file copy, delete, directory create, file path handling – System : OS error code abstraction and handling – Program options : handling of command line arguments and parameters – Streams : build your own C++ streams – DateTime : Handling of dates, times and time periods – Timer : simple timer object

59 What’s important

• Data types, Container types, all extending STL – Pointer containers : allow for pointers in STL containers: vector  ptr_vector – Multi index : data structures with multiple indicies – Constant sized arrays : array, acts like vector or plain ‘C‘ array – Any : can hold values of any type (if you need polymorphism) – Variant : can hold values of any of the types specified at compile time (‘C’ equivalent is discriminated union) – Optional : can hold a value or nothing – Tuple: like a vector or array, but every element may have a different type (similar to plain struct) – Graph library: very sophisticated collection of graph releated data structures and algorithms • Parallel version exists (using MPI)

60 What’s important

• Helper classes – Smart pointers : working with pointers without having to worry about memory management – Memory pools : specialized memory allocation for containers – Iterator library : write your own iterator classes with ease (non trivial otherwise)

61 Other stuff in Boost

• String and Text processing • Regex, parsing, format, conversion etc. • Alorithms • String algos, FOR_EACH, minmax etc. • Math and numerics • Conversion, interval, random, octonion, quarternion, special functions, rational, uBLAS • Functional and higher order prgramming • Bind, lambda, function, ref, signals etc. • Generic and template metaprogramming • Proto, mpl, fusion, phoenix, enable_if etc. • Testing • Unit tests, concept checks, static_assert

62 Conclusion

• Look at Boost first if you need something not available in Standard library • Even if it‘s not in Boost look around, there are a lot of libraries in preparation for Boost (Boost Sandbox, File Vault)

63 Links

• Boost, current release V1.33.1 – Web: http://www.boost.org – CVS: http://sourceforge.net/projects/boost • Boost Sandbox – CVS: http://sourceforge.net/projects/boost-sandbox – File Vault: http://boost-consulting.com/vault/ • Boost mailing lists – http://www.boost.org/more/mailing_lists.htm

64 Outlook

Elliptic PDE discretized by Finite Volume

Functional specification with a Domain Specific Embedded Language (DSEL)

equation = sum [ sumf(0.0, _e) [ pot * orient(_e, _1) ] * A / d * eps

] - V * rho References: [1]

65 References

1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software Design, Portland, Oregon, October 2006.

66 Outline

• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test

67 Summary – Material for the Test

• High performance libraries 5,6,7 • Linear algebra libraries: BLAS: 9, 11, 12 • Linear algebra libraries: LinPACK: 18 • PDE Solvers: 23, 24, 26, 27 • Mesh decomposition & load balancing: 30, 31, 34, 35, 37, 44, 45, 46, 48, 49 • FFTW: 53, 54 • Boost: 58, 59, 60, 61, 62