Overview of MVAPICH2 and MVAPICH2-X: Latest Status and Future Roadmap
MVAPICH2 User Group (MUG) Mee ng
by
Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h p://www.cse.ohio-state.edu/~panda Trends for Commodity Compu ng Clusters in the Top 500 List (h p://www.top500.org)
500 100 Percentage of Clusters 450 90 400 Number of Clusters 80 350 70 300 60 250 50 200 40 150 30
Number of Clusters 100 20 Percentage of Clusters 50 10 0 0
Timeline
MVAPICH User Group Mee ng 2013 2 Drivers of Modern HPC Cluster Architectures
Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/wa Mul -core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Mul -core processors are ubiquitous • InfiniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale compu ng
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
MVAPICH User Group Mee ng 2013 3 Parallel Programming Models Overview
P1 P2 P3 P1 P2 P3 P1 P2 P3
Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory
Shared Memory Model Distributed Memory Model Par oned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance
MVAPICH User Group Mee ng 2013 4 Suppor ng Programming Models for Mul - Petaflop and Exaflop Systems: Challenges
Applica on Kernels/Applica ons
Co-Design Middleware Opportuni es and Challenges across Various Programming Models Layers MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Communica on Library or Run me for Programming Models
Point-to-point Collec ve Synchroniza on & I/O & File Fault Communica on Communica on Locks (two-sided & one-sided) Systems Tolerance
Networking Technologies Mul /Many-core Accelerators (InfiniBand, 10/40GigE, Architectures (NVIDIA and MIC) Aries, BlueGene)
MVAPICH User Group Mee ng 2013 5 Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communica on (both two-sided and one-sided) – Extremely minimum memory footprint • Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) • Balancing intra-node and inter-node communica on for next genera on mul -core (128-1024 cores/node) – Mul ple end-points per node • Support for efficient mul -threading • Scalable Collec ve communica on – Offload – Non-blocking – Topology-aware – Power-aware • Support for MPI-3 RMA Model • Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communica on and I/O
6 MVAPICH User Group Mee ng 2013 MVAPICH2/MVAPICH2-X So ware
• h p://mvapich.cse.ohio-state.edu • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,077 organiza ons (HPC Centers, Industry and Universi es) in 70 countries
MVAPICH User Group Mee ng 2013 7 MVAPICH Team Projects
MVAPICH2-X
MVAPICH2
OMB
MVAPICH EOL Oct-00 Oct-00 Jan-10 Nov-04 Nov-12 Jan-02 Aug-13 Timeline MVAPICH User Group Mee ng 2013 8 MVAPICH/MVAPICH2 Release Timeline and Downloads
200,000 MV2 2.0a 180,000 MV2 1.9 160,000
140,000
120,000 MV2 1.8
100,000 MV2 1.7 80,000 MV2 1.6 60,000 MV2 1.5 Total Number of Downloads 40,000 MV2 1.4 MV 1.1 MV2 1.0.3 20,000 MV2 1.0 MV2 0.9.8 MV 0.9.4 MV2 0.9.0 0 Jan-05 Jan-06 Jan-08 Jan-09 Jan-10 Jan-07 Jan-11 Jan-12 Jan-13 Sep-05 Sep-06 Sep-08 Sep-09 Sep-10 Sep-04 Sep-07 Sep-11 Sep-12 Sep-13 May-05 May-06 May-07 May-08 May-09 May-10 May-11 May-12 May-13 Timeline • Download counts from MVAPICH2 website
MVAPICH User Group Mee ng 2013 9 MVAPICH2/MVAPICH2-X So ware (Cont’d)
• Available with so ware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) • Empowering many TOP500 clusters – 6th ranked 462,462-core cluster (Stampede) at TACC – 19th ranked 125,980-core cluster (Pleiades) at NASA – 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Ins tute of Technology and many others • Partner in the U.S. NSF-TACC Stampede System
MVAPICH User Group Mee ng 2013 10 Strong Procedure for Design, Development and Release • Research is done for exploring new designs • Designs are first presented to conference/journal publica ons • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – Exhaus ve unit tes ng – Various test procedures on diverse range of pla orms and interconnects – Performance tuning – Applica ons-based evalua on – Evalua on on large-scale systems • Even alpha and beta versions go through the above tes ng
MVAPICH User Group Mee ng 2013 11 MVAPICH2 Architecture (Latest Release 2.0)
MPI Application Process Managers
MVAPICH2 mpirun_rsh
CH3 (OSU enhanced) Nemesis mpiexec, mpiexec.hydra OFA- OFA- Shared- Shared- OFA-IB- PSM- uDAPL- TCP/IP- OFA-IB- TCP/IP- iWarp- RoCE- Memory- Memory- CH3 CH3 CH3 CH3 Nemesis Nemesis CH3 CH3 CH3 Nemesis
All Different PCI interfaces
Major Compu ng Pla orms: IA-32, EM64T, Nehalem, Westmere, Sandybridge, Opteron, Magny, ..
12 MVAPICH User Group Mee ng 2013 MVAPICH2 1.9 and MVAPICH2-X 1.9 • Released on 05/06/13 • Major Features and Enhancements – Based on MPICH-3.0.3 • Support for MPI-3 features – Support for single copy intra-node communica on using Linux supported CMA (Cross Memory A ach) • Provides flexibility for intra-node communica on: shared memory, LiMIC2, and CMA – Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR) • Support for applica on-level checkpoin ng • Support for hierarchical system-level checkpoin ng – Scalable UD-mul cast-based designs and tuned algorithm selec on for collec ves – Improved and tuned MPI communica on from GPU device memory – Improved job startup me • Provided a new run me variable MV2_HOMOGENEOUS_CLUSTER for op mized startup on homogeneous clusters – Revamped Build system with support for parallel builds • MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. – Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d MVAPICH User Group Mee ng 2013 13 MVAPICH2 2.0a and MVAPICH2-X 2.0a • Released on 08/24/13 • Major Features and Enhancements – Based on MPICH-3.0.4 – Dynamic CUDA ini aliza on. Support GPU device selec on a er MPI_Init – Support for running on heterogeneous clusters with GPU and non-GPU nodes – Suppor ng MPI-3 RMA atomic opera ons and flush opera ons with CH3-Gen2 interface – Exposing internal performance variables to MPI-3 Tools informa on interface (MPIT) – Enhanced MPI_Bcast performance – Enhanced performance for large message MPI_Sca er and MPI_Gather – Enhanced intra-node SMP performance – Reduced memory footprint – Improved job-startup performance • MVAPICH2-X 2.0a supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. – Based on MVAPICH2 2.0a including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d – Improved OpenSHMEM collec ves
MVAPICH User Group Mee ng 2013 14 One-way Latency: MPI over IB
Small Message Latency Large Message Latency 250.00 6.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 5.00 MVAPICH-ConnectX-DDR 200.00 MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR 4.00 MVAPICH2-Mellanox-ConnectIB-DualFDR 150.00 1.82 3.00 1.66 100.00 Latency (us) Latency (us) 1.64 2.00
1.56 50.00 1.00 1.09 0.99 0.00 0.00
Message Size (bytes) Message Size (bytes)
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
MVAPICH User Group Mee ng 2013 15 Bandwidth: MPI over IB
Unidirec onal Bandwidth Bidirec onal Bandwidth 14000 25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 12000 MVAPICH-ConnectX-DDR 21025 12485 20000 MVAPICH-ConnectX2-PCIe2-QDR 10000 MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR 8000 15000 11643
6000 6343 10000 6521 4000 3385 Bandwidth (MBytes/sec) 3280 Bandwidth (MBytes/sec) 4407 2000 5000 1917 3704 1706 3341 0 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
MVAPICH User Group Mee ng 2013 16 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
Latency 1.2 1 Intra-Socket Inter-Socket Latest MVAPICH2 2.0a 0.8 Intel Sandy-bridge 0.6 0.45 us 0.4 Latency (us) 0.19 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes)
Bandwidth (inter-socket) Bandwidth (intra-socket) 14000 14000 12,000MB/s intra-Socket-CMA 12,000MB/s inter-Socket-CMA 12000 12000 intra-Socket-Shmem inter-Socket-Shmem 10000 10000 intra-Socket-LiMIC inter-Socket-LiMIC 8000 8000 6000 6000 4000 4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000 0 0
Message Size (Bytes) Message Size (Bytes)
MVAPICH User Group Mee ng 2013 17 Scalable OpenSHMEM/UPC and Hybrid (MPI, UPC and OpenSHMEM) designs
• Based on OpenSHMEM Reference Implementa on (h p://openshmem.org/) & UPC version 2.14.2 Hybrid (UPC/OpenSHMEM+MPI) (h p://upc.lbl.gov/) Application UPC/OpenSHMEM • Provides a design over GASNet MPI calls calls • Does not take advantage of all OFED features UPC/OpenSHMEM MPI • Design Scalable and High-Performance Interface Interface OpenSHMEM & UPC over OFED Unified Communication Runtime • Designing a Hybrid MPI + OpenSHMEM/UPC Model • Current Model – Separate Run mes for InfiniBand Network OpenSHMEM/UPC and MPI • Possible deadlock if both run mes are not progressed Hybrid MPI+OpenSHMEM/UPC • Consumes more network resource • Our Approach – Single Unified Run me for MPI Available since and OpenSHMEM/UPC MVAPICH2-X 1.9
MVAPICH User Group Mee ng 2013 18 OSU Micro-Benchmarks (OMB) • Started in 2004 and con nuing steadily • Allows MPI developers and users to – Test and evaluate MPI libraries • Has a wide-range of benchmarks – Two-sided (MPI-1, MPI-2 and MPI-3) – One-sided (MPI-2 and MPI-3) – RMA (MPI-3) – Collec ves (MPI-1, MPI-2 and MPI-3) – Extensions for GPU-aware communica on (CUDA and OpenACC) – UPC (Pt-to-Pt) – OpenSHMEM (Pt-to-Pt and Collec ves) • Widely-used in the MPI community
MVAPICH User Group Mee ng 2013 19 Designing GPU-Aware MPI Library
• OSU started this research and development direc on in 2011 • Ini al support was provided in MVAPICH2 1.8a (SC ‘11) • Since then many enhancements and new designs related to GPU communica on have been incorporated in 1.8, 1.9 and 2.0a series • Have also extended OSU Micro-Benchmark Suite (OMB) to test and evaluate GPU-aware MPI communica on – CUDA – OpenACC • MVAPICH2 Design for GPUDirect RDMA (GDR) – Available based on 1.9
MVAPICH User Group Mee ng 2013 20 MPI Applica ons on MIC Clusters
• Flexibility in launching MPI jobs on clusters with Xeon Phi
Xeon Xeon Phi
Mul -core Centric
MPI Host-only Program
Offload MPI Offloaded (/reverse Offload) Program Computa on
MPI MPI Program Symmetric Program
Coprocessor-only MPI Program Many-core Centric
MVAPICH User Group Mee ng 2013 21 Data Movement on Intel Xeon Phi Clusters
• Connected as PCIe devices – Flexibility but Complexity
MPI Process Node 0 Node 1 1. Intra-Socket QPI 2. Inter-Socket CPU CPU CPU 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC
PCIe 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host IB MIC MIC MIC MIC 9. Inter-Socket MIC-Host 10. Inter-Node MIC-Host 11. Inter-Node MIC-MIC with IB adapter on remote socket
and more . . .
• Cri cal for run mes to op mize data movement, hiding the complexity
MVAPICH User Group Mee ng 2013 22 MVAPICH2-MIC Design for Clusters with IB and Xeon Phi • Offload Mode
• Intranode Communica on • Coprocessor-only Mode
• Symmetric Mode
• Internode Communica on
• Coprocessors-only
• Symmetric Mode • Mul -MIC Node Configura ons
• Based on MVAPICH2 1.9
• Being tested and deployed on TACC Stampede
MVAPICH User Group Mee ng 2013 23 MVAPICH2 – Plans for Exascale
• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Op miza on for GPU Support and Accelerators • Taking advantage of Collec ve Offload framework – Including support for non-blocking collec ves (MPI 3.0) – Core-Direct support • Extended RMA support (as in MPI 3.0) • Extended topology-aware collec ves • Power-aware collec ves • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migra on support with SCR
24 MVAPICH User Group Mee ng 2013 Web Pointers
NOWLAB Web Page h p://nowlab.cse.ohio-state.edu
MVAPICH Web Page h p://mvapich.cse.ohio-state.edu
MVAPICH User Group Mee ng 2013 25