Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004

Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004 Lecturer: Alan Edelman1 MIT 1Department of Mathematics and Laboratory for Computer Science. Room 2-388, Massachusetts Institute of Technology, Cambridge, MA 02139, Email: [email protected], http://math.mit.edu/~edelman ii Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004 Contents 1 Introduction 1 1.1 The machines . 1 1.2 The software . 2 1.3 The Reality of High Performance Computing . 3 1.4 Modern Algorithms . 3 1.5 Compilers . 3 1.6 Scientific Algorithms . 4 1.7 History, State-of-Art, and Perspective . 4 1.7.1 Things that are not traditional supercomputers . 4 1.8 Analyzing the top500 List Using Excel . 5 1.8.1 Importing the XML file . 5 1.8.2 Filtering . 7 1.8.3 Pivot Tables . 9 1.9 Parallel Computing: An Example . 14 1.10 Exercises . 16 2 MPI, OpenMP, MATLAB*P 17 2.1 Programming style . 17 2.2 Message Passing . 18 2.2.1 Who am I? . 19 2.2.2 Sending and receiving . 20 2.2.3 Tags and communicators . 22 2.2.4 Performance, and tolerance . 23 2.2.5 Who's got the floor? . 24 2.3 More on Message Passing . 26 2.3.1 Nomenclature . 26 2.3.2 The Development of Message Passing . 26 2.3.3 Machine Characteristics . 27 2.3.4 Active Messages . 27 2.4 OpenMP for Shared Memory Parallel Programming . 27 2.5 STARP . 30 3 Parallel Prefix 33 3.1 Parallel Prefix . 33 3.2 The \Myth" of lg n . 35 3.3 Applications of Parallel Prefix . 35 3.3.1 Segmented Scan . 35 iii iv Math 18.337, Computer Science 6.338, SMA 5505, Spring 2004 3.3.2 Csanky's Matrix Inversion . 36 3.3.3 Babbage and Carry Look-Ahead Addition . 37 3.4 Parallel Prefix in MPI . 38 4 Dense Linear Algebra 39 4.1 Dense Matrices . 39 4.2 Applications . 40 4.2.1 Uncovering the structure from seemingly unstructured problems . 40 4.3 Records . 41 4.4 Algorithms, and mapping matrices to processors . 42 4.5 The memory hierarchy . 44 4.6 Single processor condiderations for dense linear algebra . 45 4.6.1 LAPACK and the BLAS . 45 4.6.2 Reinventing dense linear algebra optimization . 46 4.7 Parallel computing considerations for dense linear algebra . 50 4.8 Better load balancing . 52 4.8.1 Problems . 52 5 Sparse Linear Algebra 55 5.1 Cyclic Reduction for Structured Sparse Linear Systems . 55 5.2 Sparse Direct Methods . 57 5.2.1 LU Decomposition and Gaussian Elimination . 57 5.2.2 Parallel Factorization: the Multifrontal Algorithm . 61 5.3 Basic Iterative Methods . 63 5.3.1 SuperLU-dist . 63 5.3.2 Jacobi Method . 64 5.3.3 Gauss-Seidel Method . 64 5.3.4 Splitting Matrix Method . 64 5.3.5 Weighted Splitting Matrix Method . 65 5.4 Red-Black Ordering for parallel Implementation . 65 5.5 Conjugate Gradient Method . 66 5.5.1 Parallel Conjugate Gradient . 66 5.6 Preconditioning . 67 5.7 Symmetric Supernodes . 69 5.7.1 Unsymmetric Supernodes . 69 5.7.2 The Column Elimination Tree . 70 5.7.3 Relaxed Supernodes . 72 5.7.4 Supernodal Numeric Factorization . 73 5.8 Efficient sparse matrix algorithms . 75 5.8.1 Scalable algorithms . 75 5.8.2 Cholesky factorization . 77 5.8.3 Distributed sparse Cholesky and the model problem . 78 5.8.4 Parallel Block-Oriented Sparse Cholesky Factorization . 79 5.9 Load balance with cyclic mapping . 79 5.9.1 Empirical Load Balance Results . 80 5.10 Heuristic Remapping . 82 5.11 Scheduling Local Computations . 83 Preface v 6 Parallel Machines 85 6.0.1 More on private versus shared addressing . 92 6.0.2 Programming Model . 93 6.0.3 Machine Topology . 93 6.0.4 Homogeneous and heterogeneous machines . 94 6.0.5 Distributed Computing on the Internet and Akamai Network . 95 7 FFT 97 7.1 FFT . 97 7.1.1 Data motion . 99 7.1.2 FFT on parallel machines . 100 7.1.3 Exercises . 101 7.2 Matrix Multiplication . 101 7.3 Basic Data Communication Operations . 102 8 Domain Decomposition 103 8.1 Geometric Issues . 105 8.1.1 Overlapping vs. Non-overlapping regions . 105 8.1.2 Geometric Discretization . 106 8.2 Algorithmic Issues . 108 8.2.1 Classical Iterations and their block equivalents . 109 8.2.2 Schwarz approaches: additive vs. multiplicative . 109 8.2.3 Substructuring Approaches . 112 8.2.4 Accellerants . 114 8.3 Theoretical Issues . 115 8.4 A Domain Decomposition Assignment: Decomposing MIT . 116 9 Particle Methods 119 9.1 Reduce and Broadcast: A function viewpoint . 119 9.2 Particle Methods: An Application . 120 9.3 Outline . 120 9.4 What is N-Body Simulation? . ..

Mathematics 18.337, Computer Science 6.338, SMA 5505 Applied Parallel Computing Spring 2004

Trends in HPC Architectures and Parallel Programmming

Introduction to Parallel Processing : Algorithms and Architectures

Parallel Computing in Economics - an Overview of the Software Frameworks

National Bureau of Standards Workshop on Performance Evaluation of Parallel Computers

Parallel Processing, Part 6

Much Ado About Almost Nothing: Compilation for Nanocontrollers

DEPARTMENT of COMPUTER SCIENCE Carneg=E-Mellon Un

Implicit Vs. Explicit Parallelism

Large Scale Computing in Science and Engineering

Parallel Processing, 1980 to 2020 Robert Kuhn, Retired (Formerly Intel Corporation) Parallel David Padua, University of Illinois at Urbana-Champaign

CS 252 Graduate Computer Architecture Lecture 17 Parallel

Multicore Processors