Hybrid MPI & Openmp Parallel Programming

Hybrid MPI & OpenMP Parallel Programming MPI + OpenMP and other models on clusters of SMP nodes Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3) [email protected] [email protected] [email protected] 1) High Performance Computing Center (HLRS), University of Stuttgart, Germany 2) Regional Computing Center (RRZE), University of Erlangen, Germany 3) Texas Advanced Computing Center, The University of Texas at Austin, USA Tutorial M02 at SC10, November 15, 2010, New Orleans, LA, USA Hybrid Parallel Programming Slide 1 Höchstleistungsrechenzentrum Stuttgart Outline slide number • Introduction / Motivation 2 • Programming models on clusters of SMP nodes 6 8:30 – 10:00 • Case Studies / pure MPI vs hybrid MPI+OpenMP 13 • Practical “How-To” on hybrid programming 48 • Mismatch Problems 97 • Opportunities: Application categories that can 126 benefit from hybrid parallelization • Thread-safety quality of MPI libraries 136 10:30 – 12:00 • Tools for debugging and profiling MPI+OpenMP 143 • Other options on clusters of SMP nodes 150 • Summary 164 • Appendix 172 • Content (detailed) 188 Hybrid Parallel Programming Slide 2 / 169 Rabenseifner, Hager, Jost Motivation • Efficient programming of clusters of SMP nodes SMP nodes SMP nodes: cores shared • Dual/multi core CPUs memory • Multi CPU shared memory Node Interconnect • Multi CPU ccNUMA • Any mixture with shared memory programming model • Hardware range • mini-cluster with dual-core CPUs Core • … CPU(socket) • large constellations with large SMP nodes SMP board … with several sockets (CPUs) per SMP node … with several cores per socket ccNUMA node Hierarchical system layout Cluster of ccNUMA/SMP nodes • Hybrid MPI/OpenMP programming seems natural • MPI between the nodes • OpenMP inside of each SMP node Hybrid Parallel Programming Slide 3 / 169 Rabenseifner, Hager, Jost Motivation SMP node SMP node • Which programming model Socket 1 Socket 1 is fastest? MPMI PIM PI MPMI PIM PI Qpuroadc-ecosrse Qpuroadc-ecosrse • MPI everywhere? 4 xC mPUulti- 4 xC mPUulti- MPI MPI MPI MPI MtPhIr eparodceedss MtPhIr eparodceedss 8 x multi- 8 x multi- Socket 2 Socket 2 • Fully hybrid threaded threaded MPI & OpenMP? MPMI PIM PI MPMI PIM PI Qpuroadc-ecosrse Qpuroadc-ecosrse 4 xC mPUulti- 4 xC mPUulti- MPI MPI MPI MPI • Something between? threaded threaded (Mixed model) • Often hybrid programming Node Interconnect slower than pure MPI – Examples, Reasons, … ? Hybrid Parallel Programming Slide 4 / 169 Rabenseifner, Hager, Jost Goals of this tutorial • Sensitize to problems on clusters of SMP nodes see sections Case studies •Less Mismatch problems frustration • Technical aspects of hybrid programming & see sections Programming models on clusters •More Examples on hybrid programming success • Opportunities with hybrid programming with your parallel see section Opportunities: Application categories program on that can benefit from hybrid paralleliz. clusters of • Issues and their Solutions SMP nodes with sections Thread-safety quality of MPI libraries Tools for debugging and profiling for MPI+OpenMP Hybrid Parallel Programming Slide 5 / 169 Rabenseifner, Hager, Jost Outline • Introduction / Motivation • Programming models on clusters of SMP nodes • Case Studies / pure MPI vs hybrid MPI+OpenMP • Practical “How-To” on hybrid programming • Mismatch Problems • Opportunities: Application categories that can benefit from hybrid parallelization • Thread-safety quality of MPI libraries • Tools for debugging and profiling MPI+OpenMP • Other options on clusters of SMP nodes • Summary Hybrid Parallel Programming Slide 6 / 169 Rabenseifner, Hager, Jost Major Programming models on hybrid systems • Pure MPI (one MPI process on each core) • Hybrid MPI+OpenMP OpenMP inside of the – shared memory OpenMP SMP nodes – distributed memory MPI MPI between the nodes via node interconnect Node Interconnect • Other: Virtual shared memory systems, PGAS, HPF, … • Often hybrid programming (MPI+OpenMP) slower than pure MPI – why? MPI local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each core for (j=…;…; j++) block_to_be_parallelized Explicit Message Passing again_some_serial_code ••• sleeping ••• by calling MPI_Send & MPI_Recv Hybrid Parallel Programming Slide 7 / 169 Rabenseifner, Hager, Jost Parallel Programming Models on Hybrid Platforms pure MPI hybrid MPI+OpenMP OpenMP only one MPI process MPI: inter-node communication distributed virtual on each core OpenMP: inside of each SMP node shared memory No overlap of Comm. + Comp. Overlapping Comm. + Comp. MPI only outside of parallel regions MPI communication by one or a few threads of the numerical application code while other threads are computing Masteronly MPI only outside of parallel regions MPI local data in each process OpenMP (shared data) Master thread, other threads Sequential data some_serial_code program on #pragma omp parallel for each core for (j=…;…; j++) block_to_be_parallelized Explicit message transfers again_some_serial_code ••• sleeping ••• by calling MPI_Send & MPI_Recv Hybrid Parallel Programming Slide 8 / 169 Rabenseifner, Hager, Jost Pure MPI pure MPI one MPI process on each core Advantages – No modifications on existing MPI codes – MPI library need not to support multiple threads Major problems – Does MPI library uses internally different protocols? • Shared memory inside of the SMP nodes Discussed • Network communication between the nodes in detail later on – Does application topology fit on hardware topology? in the section Mismatch – Unnecessary MPI-communication inside of SMP nodes! Problems Hybrid Parallel Programming Slide 9 / 169 Rabenseifner, Hager, Jost Hybrid Masteronly Masteronly Advantages MPI only outside of parallel regions – No message passing inside of the SMP nodes – No topology problem for (iteration ….) Major Problems { – All other threads are sleeping #pragma omp parallel numerical code while master thread communicates! /*end omp parallel */ – Which inter-node bandwidth? /* on master thread only */ – MPI-lib must support at least MPI_Send (original data MPI_THREAD_FUNNELED to halo areas in other SMP nodes) Section MPI_Recv (halo data Thread-safety from the neighbors) } /*end for loop quality of MPI libraries Hybrid Parallel Programming Slide 10 / 169 Rabenseifner, Hager, Jost Overlapping Communication and Computation OMvPIe cormlampunpicaitniong b yc oonem or ma feuw nthriecadast wihoilen o tahenr tdhre cados amre pcoumptuatintgion if (my_thread_rank < …) { MPI_Send/Recv…. i.e., communicate all halo data } else { Execute those parts of the application that do not need halo data (on non-communicating threads) } Execute those parts of the application that need halo data (on all threads) Hybrid Parallel Programming Slide 11 / 169 Rabenseifner, Hager, Jost Pure OpenMP (on the cluster) OpenMP only distributed virtual shared memory • Distributed shared virtual memory system needed • Must support clusters of SMP nodes Experience: • e.g., Intel® Cluster OpenMP Mismatch – Shared memory parallel inside of SMP nodes section – Communication of modified parts of pages at OpenMP flush (part of each OpenMP barrier) i.e., the OpenMP memory and parallelization model is prepared for clusters! Hybrid Parallel Programming Slide 12 / 169 Rabenseifner, Hager, Jost Outline • Introduction / Motivation • Programming models on clusters of SMP nodes • Case Studies / pure MPI vs hybrid MPI+OpenMP – The Multi-Zone NAS Parallel Benchmarks – For each application we discuss: • Benchmark implementations based on different strategies and programming paradigms • Performance results and analysis on different hardware architectures – Compilation and Execution Summary Gabriele Jost (University of Texas,TACC/Naval Postgraduate School, Monterey CA) • Practical “How-To” on hybrid programming • Mismatch Problems • Opportunities: Application categories that can benefit from hybrid paralleli. • Thread-safety quality of MPI libraries • Tools for debugging and profiling MPI+OpenMP • Other options on clusters of SMP nodes • Summary Hybrid Parallel Programming Slide 13 / 169 Rabenseifner, Hager, Jost The Multi-Zone NAS Parallel Benchmarks Nested MPI/OpenMP MLP OpenMP Time step sequential sequential sequential MPI MLP inter-zones OpenMP Processes Processes exchange data copy+ Call MPI OpenMP boundaries sync. intra-zones OpenMP OpenMP OpenMP • Multi-zone versions of the NAS Parallel Benchmarks LU,SP, and BT • Two hybrid sample implementations • Load balance heuristics part of sample codes • www.nas.nasa.gov/Resources/Software/software.html Hybrid Parallel Programming Slide 14 / 169 Rabenseifner, Hager, Jost Using MPI/OpenMP call omp_set_numthreads (weight) subroutine zsolve(u, rsd,…) do step = 1, itmax ... call exch_qbc(u, qbc, nx,…) !$OMP PARALLEL DEFAUL(SHARED) !$OMP& PRIVATE(m,i,j,k...) do k = 2, nz-1 call mpi_send/recv !$OMP DO do j = 2, ny-1 do i = 2, nx-1 do zone = 1, num_zones do m = 1, 5 if (iam .eq. pzone_id(zone)) then u(m,i,j,k)= call zsolve(u,rsd,…) dt*rsd(m,i,j,k-1) end if end do end do end do end do end do !$OMP END DO nowait ... end do ... !$OMP END PARALLEL Hybrid Parallel Programming Slide 15 / 169 Rabenseifner, Hager, Jost Pipelined Thread Execution in SSOR subroutine ssor !$OMP PARALLEL DEFAULT(SHARED) !$OMP& PRIVATE(m,i,j,k...) subbroutine sync1 call sync1 () …neigh = iam -1 do k = 2, nz-1 do while (isync(neigh) .eq. 0) !$OMP DO !$OMP FLUSH(isync) do j = 2, ny-1 end do isync(neigh) = 0 do i = 2, nx-1 !$OMP FLUSH(isync) do m = 1, 5 rsd(m,i,j,k)= … dt*rsd(m,i,j,k-1)

Hybrid MPI & Openmp Parallel Programming

A Case for High Performance Computing with Virtual Machines

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

Adaptive Data Migration in Load-Imbalanced HPC Applications

Beowulf Clusters — an Overview

The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗

Exploring Massive Parallel Computation with GPU

Improving MPI Threading Support for Current Hardware Architectures

Dcuda: Hardware Supported Overlap of Computation and Communication

Exascale Computing Project -- Software

On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework

Parallel Data Mining from Multicore to Cloudy Grids

HPVM: Heterogeneous Parallel Virtual Machine