High Performance Computing: Concepts, Methods & Means HPC Libraries
Hartmut Kaiser PhD Center for Computation & Technology Louisiana State University April 19 th , 2007 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
2 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
3 Puzzle of the Day
#include
case '2': printf("TWO\n"); break ;
defa1ut : printf("NONE\n"); } If you expect the output of the above return 0; } program to be NONE , I would request you to check it out!
4 Application domains
• Linear algebra – BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim • Ordinary and partial Differential Equations – PETSc • Mesh manipulation and Load Balancing – METIS, ParMETIS, CHACO, JOSTLE, PARTY • Graph manipulation – Boost.Graph library • Vector/Signal/Image processing – VSIPL, PSSL. • General parallelization – MPI, pthreads • Other domain specific libraries – NAMD, NWChem, Fluent, Gaussian, LS-DYNA
5 Application Domain Overview
• Linear Algebra Libraries – Provide optimized methods for constructing sets of linear equations, performing operations on them (matrix-matrix products, matrix-vector products) and solving them (factoring, forward & backward substitution. – Commonly used libraries include BLAS, ATLAS, LAPACK, ScaLAPACK, PaLAPACK • PDE Solvers: – Developing general-porpose, parallel numerical PDE libraries – Usual toolsets include manipulation of sparse data structures, iterative linear system solvers, preconditioners, nonlinear solvers and time-stepping methods. – Commonly used libraries for solving PDEs include SAMRAI, PETSc, PARASOL, Overture, among others.
6 Application Domain Overview
• Mesh manipulation and Load Balancing – These libraries help in partitioning meshes in roughly equal sizes across processors, thereby balancing the workload while minimizing size of separators and communication costs. – Commonly used libraries for this purpose include METIS, ParMetis, Chaco, JOSTLE among others. • Other packages: – FFTW: features highly optimized Fourier transform package including both real and complex multidimensional transforms in sequential, multithreaded, and parallel versions. – NAMD: molecular dynamics library available for Unix/Linux, Windows, OS X – Fluent: computational fluid dynamics package, used for such applications as environment control systems, propulsion, reactor modeling etc.
7 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS , LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
8 BLAS
• (Updated set of) Basic Linear Algebra Subprograms
• The BLAS functionality is divided into three levels: – Level 1: contains vector operations of the form:
as well as scalar dot products and vector norms
– Level 2: contains matrix-vector operations of the form
as well as Tx = y solving for x with T being triangular
– Level 3: contains matrix-matrix operations of the form
as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiply operation.
9 BLAS
• Several implementations for different languages exist – Reference implementation (F77 and C) http://www.netlib.org/blas/ – ATLAS, highly optimized for particular processor architectures – A generic C++ template class library providing BLAS functionality: uBLAS http://www.boost.org – Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun)
10 BLAS: F77 naming conventions
11 BLAS: C naming conventions
• F77 routine name is changed to lowercase and prefixed with cblas_ • All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major) • Character parameters are replaced by corresponding enum values • Input arguments are declared const • Non-complex scalar input parameters are passed by value • Complex scalar input argiments are passed using a void* • Arrays are passed by address • Output scalar arguments are passed by address • Complex functions become subroutines which return the result via an additional last parameter ( void* ), appending _sub to the name
12 BLAS Level 1 routines
• Vector operations (xROT, xSWAP, xCOPY etc.) • Scalar dot products (xDOT etc.) • Vector norms (IxAMX etc.)
13 BLAS Level 2 routines
• Matrix-vector operations (xGEMV, xGBMV, xHEMV, xHBMV etc.) • Solving Tx = y for x, where T is triangular (xGER, xHER etc.)
14 BLAS Level 3 routines
• Matrix-matrix operations (xGEMM etc.) • Solving for triangular matrices (xTRMM) • Widely used matrix-matrix multiply (xSYMM, xGEMM)
15 Demo 1
• Shows solving a matrix multiplication problem using BLAS expressed in FORTRAN, C, and C++ • Shows genericity of uBLAS, by comparing generic and banded matrix versions • Shows newmat, a C++ matrix library which uses operator overloading
16 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK ) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
17 LAPACK
• Linear Algebra PACKage – http://www.netlib.org/lapack/ – Written in F77 – Provides routines for • Solving systems of simultaneous linear equations, • Least-squares solutions of linear systems of equations, • Eigenvalue problems, • Householder transformation to implement QR decomposition on a matrix and • Singular value problems – Was initially designed to run efficiently on shared memory vector machines – Depends on BLAS – Has been extended for distributed (SIMD) systems (ScaPACK and PLAPACK)
18 LAPACK (Architecture)
19 LAPACK naming conventions
20 Demo 2
• Shows how using a library might speed up the computation considerably
21 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
22 PETSc (pronounced PET-see)
• Portable, Extensible Toolkit for Scientific Computation (http://www-unix.mcs.anl.gov/petsc/petsc-as/ ) – Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations (PDEs) – Employs the MPI standard for all message-passing communication – Intended for use in large-scale application projects – Includes a large suite of parallel linear and nonlinear equation solvers – Easily used in application codes written in C, C++, Fortran and Python • Good introduction: http://www-unix.mcs.anl.gov/petsc/petsc-as/documentation/tutorials/nersc02/nersc02.ppt
23 PETSc (general features)
• Features include: – Parallel vectors • Scatters (handles communicating ghost point information) • Gathers – Parallel matrices • Several sparse storage formats • Easy, efficient assembly. – Scalable parallel preconditioners – Krylov subspace methods – Parallel Newton-based nonlinear solvers – Parallel time stepping (ODE) solvers
24 PETSc (Architecture)
PETSc: Module architecture and layers of abstraction
25 PETSc: Component details
• Vector operations (Vec) : Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures. • Matrix operations (Mat) : A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems. • Preconditioners (PC) : A collection of sequential and parallel preconditioners, including – (sequential) ILU(k) (incomplete factorization), – LU (lower/upper decomposition), – both sequential and parallel block Jacobi, overlapping additive Schwarz methods • Time stepping ODE solvers (TS) : Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions.
26 PETSc: Component details
• Krylov subspace solvers (KSP) : Parallel implementations of many popular Krylov subspace iterative methods, including – GMRES (Generalized Minimal Residual method), – CG (Conjugate Gradient), – CGS (Conjugate Gradient Squared), – Bi-CG-Stab (BiConjugate Gradient Squared), – two variants of TFQMR (transpose free QMR), – CR (Conjugate Residuals), – LSQR (Least Square Root). All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods. • Non-linear solvers (SNES) : Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc.
27 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
28 Mesh libraries
• Introduction – Structured/unstructured meshes – Examples • Mesh decomposition
29 Introduction to Meshes and Grids
• Mesh/Grid : 2D or 3D representation of the computational
domain. Quadrilateral Triangle • Common 2D meshes are composed 2D Mesh elements of triangular or quadrilateral elements Hexahedron Prism • Common 3D meshes are composed of hexahedral, tetrahedral or pyramidal elements Tetrahedron
3D Mesh elements
30 Structured/Unstructured Meshes
Structured Grids (Meshes) Unstructured Meshes • Cartesian grids, logically • Mesh connectivity information must be stored rectangular grids – Incurs additional memory and • Mesh info accessed implicitly computational cost using grid point indices • Handles complex geometries and grid adaptivity – Efficient in both computation and storage • Typically use finite volume or • Typically use finite difference finite element discretization • Mesh quality becomes a discretization concern
31 Mesh examples
32 Meshes are used for Computation
33 Mesh Decomposition
• Goal is to maximize interior while minimizing connections between subdomains. That is, minimize communication . • Such decomposition problems have been studied in load balancing for parallel computation. • Lots of choices: • METIS , ParMETIS -- University of Minnesota. • PARTI -- University of Maryland, • CHACO -- Sandia National Laboratories, • JOSTLE -- University of Greenwich, • PARTY -- University of Paderborn, • SCOTCH -- Université Bordeaux, • TOP/DOMDEC -- NAS at NASA Ames Research Center.
http://www.hlrs.de 34 Mesh Decomposition
• Load balancing – Distribute elements evenly across processors. – Each processor should have equal share of work. • Communication costs should be minimized. – Minimize sub-domain boundary elements. – Minimize number of neighboring domains. • Distribution should reflect machine architecture. – Communication versus calculation. – Bandwidth versus latency. • Note that optimizing load balance and communication cost simultaneously is an NP-hard problem.
http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-13.html
35 Mesh decomposition
36 http://www.hlrs.de 36 Static and Dynamic Meshes
Static Grids (Meshes) Dynamic Meshes • Decomposition need only be • Decomposition must be adapted carried out once as underlying mesh or processor load changes. • Static decomposition may therefore be carried out as a • Dynamic decomposition therefore preprocessing step, often done in becomes part of the calculation serial itself and cannot be carried out solely as a pre-processing step.
http://www.epcc.ed.ac.uk/epcc-tec/documents/meshdecomp-slides/MeshDecomp-14.html
37 HP J6700 1 CPU Solve Time: 13:26 Baseline Time
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 38 Linux Cluster 2 CPU’s Solve Time: 5:20 Speed-Up: 2.5X
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 39 Linux Cluster 4 CPU’s Solve Time: 3:07 Speed-Up: 4.3X
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 40 Linux Cluster 8 CPU’s Solve Time: 1:51 Speed-Up: 7.3X
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 41 Linux Cluster 16 CPU’s Solve Time: 1:03 Speed-Up: 12.8X
src : Amy Apon, http://www.csce.uark.edu/~aapon/courses/concurrent/notes/marc-ddm.ppt 42 Speedup due to decomposition
# CPUs Run-times (s) 1 806 2 320 4 187 8 111 16 63
43 Jostle and Metis
44 http://www.hlrs.de 44 Jostle
45 http://www.hlrs.de 45 Jostle
46 http://www.hlrs.de 46 Jostle
47 http://www.hlrs.de 47 Metis
48 http://www.hlrs.de 48 ParMetis
49 http://www.hlrs.de 49 Metis (serial)
50 http://www.hlrs.de 50 Comparison
51 http://www.hlrs.de 51 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
52 FFTW
• Fastest Fourier Transform in the West
• Portable C subroutine library for computing discrete cosine/sine transform (DCT/DST) • Computes arbitrary size discrete Fourier and Hartley transforms on real or complex data, in one or more dimensions • Optimized for speed through application of special-purpose compiler genfft (codelet generator), originally written in OCaml; performance comparable even with vendor optimized libraries • Free software, distributed under GPL; also available under commercial MIT license • Developed at MIT by Matteo Frigo and Steven G. Johnson • Won J. H. Wilkinson Prize for Numerical Software in 1999 • Most recent stable version is 3.1.2 ( http://www.fftw.org )
53 Main FFTW Features
• C and FORTRAN interfaces, C++ wrappers available • Speed, including support for SSE, SSE2, 3dNow! and Altivec • Arbitrary size transforms with complexity of O(n·log(n)) (sizes which can be factored to 2, 3, 5 and 7 are most efficient by default, but a custom code can be also generated for other sizes if required) • Even/odd data (DCT/DST), types I-IV • Can produce pure real output, or process pure real input data • Efficient handling of multiple, strided transforms (e.g. transformation of multiple arrays at once; one dimension of multi-dimensional array; one field of multi-component array) • Parallel code supporting Cilk, SMP platforms with threads, or MPI • Ability to save and restore plans optimized for a given platform (through wisdom mechanism) • Portable to any platform with a working C compiler
54 FFTW Sample Code
Computing 1-D complex DFT #include
Source: http://www.fftw.org/fftw3.pdf 55 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
56 The Boost Libraries
• What’s Boost – What’s important – Other stuff
57 What is Boost?
• Data Structures, Containers, Iterators, and Algorithms • String and Text Processing • Function Objects and Higher-Order Programming • Generic Programming and Template Metaprogramming • Math and Numerics • Input/Output • Miscellaneous
• Mostly header only
58 What’s important
• OS abstraction – Thread : OS independent kernel level thread interface – Asio : asynchronous input output – Filesystem : file system operations as file copy, delete, directory create, file path handling – System : OS error code abstraction and handling – Program options : handling of command line arguments and parameters – Streams : build your own C++ streams – DateTime : Handling of dates, times and time periods – Timer : simple timer object
59 What’s important
• Data types, Container types, all extending STL – Pointer containers : allow for pointers in STL containers: vector
60 What’s important
• Helper classes – Smart pointers : working with pointers without having to worry about memory management – Memory pools : specialized memory allocation for containers – Iterator library : write your own iterator classes with ease (non trivial otherwise)
61 Other stuff in Boost
• String and Text processing • Regex, parsing, format, conversion etc. • Alorithms • String algos, FOR_EACH, minmax etc. • Math and numerics • Conversion, interval, random, octonion, quarternion, special functions, rational, uBLAS • Functional and higher order prgramming • Bind, lambda, function, ref, signals etc. • Generic and template metaprogramming • Proto, mpl, fusion, phoenix, enable_if etc. • Testing • Unit tests, concept checks, static_assert
62 Conclusion
• Look at Boost first if you need something not available in Standard library • Even if it‘s not in Boost look around, there are a lot of libraries in preparation for Boost (Boost Sandbox, File Vault)
63 Links
• Boost, current release V1.33.1 – Web: http://www.boost.org – CVS: http://sourceforge.net/projects/boost • Boost Sandbox – CVS: http://sourceforge.net/projects/boost-sandbox – File Vault: http://boost-consulting.com/vault/ • Boost mailing lists – http://www.boost.org/more/mailing_lists.htm
64 Outlook
Elliptic PDE discretized by Finite Volume
Functional specification with a Domain Specific Embedded Language (DSEL)
equation = sum
] - V * rho References: [1]
65 References
1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software Design, Portland, Oregon, October 2006.
66 Outline
• Introduction to High Performance Libraries • Linear Algebra Libraries (BLAS, LAPACK) • PDE Solvers (PETSc) • Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) • Special purpose libraries (FFTW) • General purpose libraries (C++: Boost) • Summary – Materials for test
67 Summary – Material for the Test
• High performance libraries 5,6,7 • Linear algebra libraries: BLAS: 9, 11, 12 • Linear algebra libraries: LinPACK: 18 • PDE Solvers: 23, 24, 26, 27 • Mesh decomposition & load balancing: 30, 31, 34, 35, 37, 44, 45, 46, 48, 49 • FFTW: 53, 54 • Boost: 58, 59, 60, 61, 62