Introduction to Parallel Computing and the Message Passing Interface (MPI)

Introduction to Parallel Computing and the Message Passing Interface (MPI) Rolf Kuiper July, 1st 2008 Chapter 1 Introduction to Parallel Computing 1.1 Architectures 1.1.1 von Neumann Architecture (single machine) CPU Central Processing Unit sequentially performs decoded instructions (on data) RAM Random Access Memory stores program and data instructions 'CPU $ & % RAM 1 1.1.2 Flynn's taxonomy of parallelism Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD SISD Uni-processor machine / entirely sequential program SIMD Same operation repeatedly over different data MISD Rarely used, only few specified applications MIMD Most common Parallel Cluster architecture, the focus of this introduction 2 1.1.3 Parallel Architectures Shared Memory 'CPU $ 'CPU $ 'CPU $ 'CPU $ & % & % & % & % RAM • Default programming language: Open Multi-Processing (OpenMP) Distributed Memory 'CPU $ 'CPU $ 'CPU $ 'CPU $ & % & % & % & % RAM RAM RAM RAM Distributed Memory Network • Default programming language: Message Passing Interface (MPI) 3 Hybrid Distributed-Shared Memory 'CPU $ 'CPU $ 'CPU $ 'CPU $ & % & % & % & % RAM RAM Distributed Memory Network • Default programming language: { Cluster OpenMP { MPI 2.0 4 1.2 Partitioning 1.2.1 Functional Decomposition • Each processor takes charge of his own subroutine, e.g. Climate Model: Atmosphere Model @I 6 @ @ @R Ocean Model ? Surface Model © 5 1.2.2 Domain decomposition • The data is decomposed into different data sets. • Each processor executes the same tasks on different data sets. • Single Program Multiple Data (SPMD) e.g. Grid-based Hydrodynamics: Processor 0 Processor 1 6 1.2.3 Performance Speedup S Definition: The Speedup is the ratio of the time of the serial to the parallel code. t S = 1 tn with the number of processors n and the wall-clock-time ti, which the program run needs on i processors. Efficiency E Definition: The Efficiency is the gain in percent you win by parallelism. t S E = 1 = tnn n 7 Amdahl's law Assumptions: • fixed problem size • serial portion α of the code independent on n 1 S ≤ P 1 − P + n 1 1 −! = for n ! 1 1 − P α with the parallel portion of the code P ((1 − P ) is the serial portion α). 8 100 100 P=1.0 P=0.99 80 80 60 60 S P=0.98 Speedup 40 40 P=0.97 P=0.96 20 P=0.95 20 200 400 600 800 1000 #Processors n 9 Gustafson's law S ≤ n − α(n − 1) −! n(1 − α) = nP for n ! 1 10 Load Balancing For Grid-based hydrodynamics: Each processor should contain the same number of gridcells. Granularity Computation Granularity = Communication • Local domains with less surface and huge volume. • Cubic domain decomposition. e.g. 2D, 8x8 grid: next slide(s) 11 12 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ V 16 V 16 Boundaries: = = 2:0 Inner Domain: = = 1:0 A 8 A 16 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ V 8 Total Domain: = = 2:0 A 16 1.2.4 Ghost cells • Allow same computation algorithm for border cells as for inner domain. • Ghosts outside the the physical domain have to be set via Boundary Conditions. • Ghosts inside the global physical domain, but outside the local domain have to be communicated. Communicated Ghosts u u u Boundary Local Domain Comm. Condition Ghosts u u u Boundary Condition 15 1.2.5 Stencils • The communication stencil is equal to the computation stencil. • Standard are Star- and Box-Stencils. Standard 5-point-stencil (2D Star-Stencil): u eu u u 16 Standard 9-point-stencil (2D Box-Stencil): u u u u e u u u u 17 Dimension Star-Stencil Box-Stencil 1 2-point 2-point Computation Stencil: 2 5-point 9-point 3 7-point 27-point Communication Stencil: uuuu j ? e e e u e u e u 18 Chapter 2 The Message Passing Interface (MPI) • MPI 1.1 & MPI 2.0 standards (approx. 600 pages): http://www.mpi-forum.org/docs/docs.html • MPI 1.1 defines the basics (which we will focus on in this introduction) • MPI 2.0 main add-ons: { Dynamic process creation { Shared memory operations { Parallel I/O • Interface to c/c++ and Fortran • Different implementations available (e.g. MPICH, Open MPI) 19 2.1 Short Glossary Communicator Group of communicating processors in ordered topology (default for all running processors: MPI COMM WORLD) Rank Processor identifier (routine to get rank: MPI Comm rank(comm, &rank)) Topology Mapping / Ordering of processors (e.g. cartesian grid) Buffer Space for message storage at sender or receiver (also possible: additional user-specified application buffer) 20 2.2 MPI Primitive Data Types C Data Types Fortran Data Types MPI CHAR signed char MPI CHARACTER character(1) MPI SHORT signed short int MPI INT signed int MPI INTEGER integer MPI LONG signed long int MPI UNSIGNED CHAR unsigned char MPI UNSIGNED SHORT unsigned short int MPI UNSIGNED unsigned int MPI UNSIGNED LONG unsigned long int MPI FLOAT float MPI REAL real MPI DOUBLE double MPI DOUBLE PRECISION double precision MPI LONG DOUBLE long double MPI COMPLEX complex MPI DOUBLE COMPLEX double complex MPI LOGICAL logical MPI BYTE 8 binary digits MPI BYTE 8 binary digits It is also possible to define new data types (called "Derived Data Types"), like vector, struct ... 21 2.3 Getting started or the Environment Management Routines // ****************** // ** HelloWorld.c ** // ****************** // // Compile via "gcc -o HelloWorld HelloWorld.c" // // Run via "./HelloWorld" // // // Include Libraries: // #include <stdio.h> // for printf() // // Main routine: // int main() { // // "Hello World!" // printf("Hello World!\n"); return 0; } // ********************** // ** HelloWorld_MPI.c ** // ********************** // // Compile via "mpicc -o HelloWorld_MPI HelloWorld_MPI.c" // // Run via "mpiexec -n '#procs' ./HelloWorld_MPI" // // // Include Libraries: // #include <mpi.h> // for Message Passing Interface #include <stdio.h> // for printf() // // Main routine: // int main(argc, argv) int argc; char *argv[]; { // // Initialize MPI: // MPI_Init(&argc, &argv); // // "Hello World!" // printf("Hello World!\n"); // // Finalize MPI: // MPI_Finalize(); return 0; } // ************************ // ** HelloWorld_MPI_2.c ** // ************************ // // Compile via "mpicc -o HelloWorld_MPI_2 HelloWorld_MPI_2.c" // // Run via "mpiexec -n '#procs' ./HelloWorld_MPI_2" // // // Include Libraries: // #include <mpi.h> // for Message Passing Interface #include <stdio.h> // for printf() // // Main routine: // int main(argc, argv) int argc; char *argv[]; { // // Initialize MPI: // if(MPI_Init(&argc, &argv) != MPI_SUCCESS){ // Every call to a MPI-routine returns an error code! fprintf(stderr, "ERROR: MPI_Init failed.\n"); return -1; } // // "Hello World!" // printf("Hello World!\n"); // // Finalize MPI: // MPI_Finalize(); return 0; } #include <mpi.h> // for Message Passing Interface #include <stdio.h> // for printf() // // Parallel environment: // int MPI_rank; // identifier of single processor in the communicator int MPI_size; // number of total processors in the communicator // // Main routine: // int main(argc, argv) int argc; char *argv[]; { // // Initialize MPI: // if(MPI_Init(&argc, &argv) != MPI_SUCCESS){ fprintf(stderr, "ERROR: MPI_Init failed.\n"); return -1; } // // Get number of processors and own local rank: // // MPI_COMM_WORLD = communicator, which includes all processors. MPI_Comm_size(MPI_COMM_WORLD, &MPI_size); // Every processor in the communicator has its own identifier. MPI_Comm_rank(MPI_COMM_WORLD, &MPI_rank); // // "Hello World!" // printf("[%d / %d] Hello World!\n", MPI_rank, MPI_size); // // Finalize MPI: // MPI_Finalize(); return 0; } 2.4 Communication 2.4.1 Point-to-point communication • One message from one sender to another receiver processor. • The (default) way / path of the message goes from the specified space at the sender to the sending buffer to the receiver buffer to the specified space at the receiver. Blocking communication • Process waits until send data can be reused safely (blocking send) or the required data for further computation is received (blocking

Load more