Overview of MVAPICH2 and MVAPICH2-‐X

Overview of MVAPICH2 and MVAPICH2-X: Latest Status and Future Roadmap

MVAPICH2 User Group (MUG) Meeng

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hp://www.cse.ohio-state.edu/~panda Trends for Commodity Compung Clusters in the Top 500 List (hp://www.top500.org)

500 100 Percentage of Clusters 450 90 400 Number of Clusters 80 350 70 300 60 250 50 200 40 150 30

Number of Clusters 100 20 Percentage of Clusters 50 10 0 0

Timeline

MVAPICH User Group Meeng 2013 2 Drivers of Modern HPC Cluster Architectures

Accelerators / Coprocessors High Performance Interconnects - InﬁniBand high compute density, high performance/wa Mul-core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Mul-core processors are ubiquitous • InﬁniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale compung

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

MVAPICH User Group Meeng 2013 3 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory

Shared Memory Model Distributed Memory Model Paroned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models • Models can be mapped on diﬀerent types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance

MVAPICH User Group Meeng 2013 4 Supporng Programming Models for Mul- Petaﬂop and Exaﬂop Systems: Challenges

Applicaon Kernels/Applicaons

Co-Design Middleware Opportunies and Challenges across Various Programming Models Layers MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

Communicaon Library or Runme for Programming Models

Point-to-point Collecve Synchronizaon & I/O & File Fault Communicaon Communicaon Locks (two-sided & one-sided) Systems Tolerance

Networking Technologies Mul/Many-core Accelerators (InﬁniBand, 10/40GigE, Architectures (NVIDIA and MIC) Aries, BlueGene)

MVAPICH User Group Meeng 2013 5 Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communicaon (both two-sided and one-sided) – Extremely minimum memory footprint • Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) • Balancing intra-node and inter-node communicaon for next generaon mul-core (128-1024 cores/node) – Mulple end-points per node • Support for efficient mul-threading • Scalable Collecve communicaon – Offload – Non-blocking – Topology-aware – Power-aware • Support for MPI-3 RMA Model • Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communicaon and I/O

6 MVAPICH User Group Meeng 2013 MVAPICH2/MVAPICH2-X Soware

• hp://mvapich.cse.ohio-state.edu • High Performance open-source MPI Library for InﬁniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,077 organizaons (HPC Centers, Industry and Universies) in 70 countries

MVAPICH User Group Meeng 2013 7 MVAPICH Team Projects

MVAPICH2-X

MVAPICH2

OMB

MVAPICH EOL Oct-00 Oct-00 Jan-10 Nov-04 Nov-12 Jan-02 Aug-13 Timeline MVAPICH User Group Meeng 2013 8 MVAPICH/MVAPICH2 Release Timeline and Downloads

200,000 MV2 2.0a 180,000 MV2 1.9 160,000

140,000

120,000 MV2 1.8

100,000 MV2 1.7 80,000 MV2 1.6 60,000 MV2 1.5 Total Number of Downloads 40,000 MV2 1.4 MV 1.1 MV2 1.0.3 20,000 MV2 1.0 MV2 0.9.8 MV 0.9.4 MV2 0.9.0 0 Jan-05 Jan-06 Jan-08 Jan-09 Jan-10 Jan-07 Jan-11 Jan-12 Jan-13 Sep-05 Sep-06 Sep-08 Sep-09 Sep-10 Sep-04 Sep-07 Sep-11 Sep-12 Sep-13 May-05 May-06 May-07 May-08 May-09 May-10 May-11 May-12 May-13 Timeline • Download counts from MVAPICH2 website

MVAPICH User Group Meeng 2013 9 MVAPICH2/MVAPICH2-X Soware (Cont’d)

• Available with soware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) • Empowering many TOP500 clusters – 6th ranked 462,462-core cluster (Stampede) at TACC – 19th ranked 125,980-core cluster (Pleiades) at NASA – 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Instute of Technology and many others • Partner in the U.S. NSF-TACC Stampede System

MVAPICH User Group Meeng 2013 10 Strong Procedure for Design, Development and Release • Research is done for exploring new designs • Designs are ﬁrst presented to conference/journal publicaons • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – Exhausve unit tesng – Various test procedures on diverse range of plaorms and interconnects – Performance tuning – Applicaons-based evaluaon – Evaluaon on large-scale systems • Even alpha and beta versions go through the above tesng

MVAPICH User Group Meeng 2013 11 MVAPICH2 Architecture (Latest Release 2.0)

MPI Application Process Managers

MVAPICH2 mpirun_rsh

CH3 (OSU enhanced) Nemesis mpiexec, mpiexec.hydra OFA- OFA- Shared- Shared- OFA-IB- PSM- uDAPL- TCP/IP- OFA-IB- TCP/IP- iWarp- RoCE- Memory- Memory- CH3 CH3 CH3 CH3 Nemesis Nemesis CH3 CH3 CH3 Nemesis

All Diﬀerent PCI interfaces

Major Compung Plaorms: IA-32, EM64T, Nehalem, Westmere, Sandybridge, Opteron, Magny, ..

12 MVAPICH User Group Meeng 2013 MVAPICH2 1.9 and MVAPICH2-X 1.9 • Released on 05/06/13 • Major Features and Enhancements – Based on MPICH-3.0.3 • Support for MPI-3 features – Support for single copy intra-node communicaon using Linux supported CMA (Cross Memory Aach) • Provides ﬂexibility for intra-node communicaon: shared memory, LiMIC2, and CMA – Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR) • Support for applicaon-level checkpoinng • Support for hierarchical system-level checkpoinng – Scalable UD-mulcast-based designs and tuned algorithm selecon for collecves – Improved and tuned MPI communicaon from GPU device memory – Improved job startup me • Provided a new runme variable MV2_HOMOGENEOUS_CLUSTER for opmized startup on homogeneous clusters – Revamped Build system with support for parallel builds • MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. – Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d MVAPICH User Group Meeng 2013 13 MVAPICH2 2.0a and MVAPICH2-X 2.0a • Released on 08/24/13 • Major Features and Enhancements – Based on MPICH-3.0.4 – Dynamic CUDA inializaon. Support GPU device selecon aer MPI_Init – Support for running on heterogeneous clusters with GPU and non-GPU nodes – Supporng MPI-3 RMA atomic operaons and ﬂush operaons with CH3-Gen2 interface – Exposing internal performance variables to MPI-3 Tools informaon interface (MPIT) – Enhanced MPI_Bcast performance – Enhanced performance for large message MPI_Scaer and MPI_Gather – Enhanced intra-node SMP performance – Reduced memory footprint – Improved job-startup performance • MVAPICH2-X 2.0a supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. – Based on MVAPICH2 2.0a including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d – Improved OpenSHMEM collecves

MVAPICH User Group Meeng 2013 14 One-way Latency: MPI over IB

Small Message Latency Large Message Latency 250.00 6.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 5.00 MVAPICH-ConnectX-DDR 200.00 MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR 4.00 MVAPICH2-Mellanox-ConnectIB-DualFDR 150.00 1.82 3.00 1.66 100.00 Latency (us) Latency (us) 1.64 2.00

1.56 50.00 1.00 1.09 0.99 0.00 0.00

Message Size (bytes) Message Size (bytes)

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

MVAPICH User Group Meeng 2013 15 Bandwidth: MPI over IB

Unidireconal Bandwidth Bidireconal Bandwidth 14000 25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 12000 MVAPICH-ConnectX-DDR 21025 12485 20000 MVAPICH-ConnectX2-PCIe2-QDR 10000 MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR 8000 15000 11643

6000 6343 10000 6521 4000 3385 Bandwidth (MBytes/sec) 3280 Bandwidth (MBytes/sec) 4407 2000 5000 1917 3704 1706 3341 0 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

MVAPICH User Group Meeng 2013 16 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

Latency 1.2 1 Intra-Socket Inter-Socket Latest MVAPICH2 2.0a 0.8 Intel Sandy-bridge 0.6 0.45 us 0.4 Latency (us) 0.19 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes)

Bandwidth (inter-socket) Bandwidth (intra-socket) 14000 14000 12,000MB/s intra-Socket-CMA 12,000MB/s inter-Socket-CMA 12000 12000 intra-Socket-Shmem inter-Socket-Shmem 10000 10000 intra-Socket-LiMIC inter-Socket-LiMIC 8000 8000 6000 6000 4000 4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000 0 0

Message Size (Bytes) Message Size (Bytes)

MVAPICH User Group Meeng 2013 17 Scalable OpenSHMEM/UPC and Hybrid (MPI, UPC and OpenSHMEM) designs

• Based on OpenSHMEM Reference Implementaon (hp://openshmem.org/) & UPC version 2.14.2 Hybrid (UPC/OpenSHMEM+MPI) (hp://upc.lbl.gov/) Application UPC/OpenSHMEM • Provides a design over GASNet MPI calls calls • Does not take advantage of all OFED features UPC/OpenSHMEM MPI • Design Scalable and High-Performance Interface Interface OpenSHMEM & UPC over OFED Unified Communication Runtime • Designing a Hybrid MPI + OpenSHMEM/UPC Model • Current Model – Separate Runmes for InfiniBand Network OpenSHMEM/UPC and MPI • Possible deadlock if both runmes are not progressed Hybrid MPI+OpenSHMEM/UPC • Consumes more network resource • Our Approach – Single Unified Runme for MPI Available since and OpenSHMEM/UPC MVAPICH2-X 1.9

MVAPICH User Group Meeng 2013 18 OSU Micro-Benchmarks (OMB) • Started in 2004 and connuing steadily • Allows MPI developers and users to – Test and evaluate MPI libraries • Has a wide-range of benchmarks – Two-sided (MPI-1, MPI-2 and MPI-3) – One-sided (MPI-2 and MPI-3) – RMA (MPI-3) – Collecves (MPI-1, MPI-2 and MPI-3) – Extensions for GPU-aware communicaon (CUDA and OpenACC) – UPC (Pt-to-Pt) – OpenSHMEM (Pt-to-Pt and Collecves) • Widely-used in the MPI community

MVAPICH User Group Meeng 2013 19 Designing GPU-Aware MPI Library

• OSU started this research and development direcon in 2011 • Inial support was provided in MVAPICH2 1.8a (SC ‘11) • Since then many enhancements and new designs related to GPU communicaon have been incorporated in 1.8, 1.9 and 2.0a series • Have also extended OSU Micro-Benchmark Suite (OMB) to test and evaluate GPU-aware MPI communicaon – CUDA – OpenACC • MVAPICH2 Design for GPUDirect RDMA (GDR) – Available based on 1.9

MVAPICH User Group Meeng 2013 20 MPI Applicaons on MIC Clusters

• Flexibility in launching MPI jobs on clusters with Xeon Phi

Xeon Xeon Phi

Mul-core Centric

MPI Host-only Program

Offload MPI Offloaded (/reverse Offload) Program Computaon

MPI MPI Program Symmetric Program

Coprocessor-only MPI Program Many-core Centric

MVAPICH User Group Meeng 2013 21 Data Movement on Intel Xeon Phi Clusters

• Connected as PCIe devices – Flexibility but Complexity

MPI Process Node 0 Node 1 1. Intra-Socket QPI 2. Inter-Socket CPU CPU CPU 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC

PCIe 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host IB MIC MIC MIC MIC 9. Inter-Socket MIC-Host 10. Inter-Node MIC-Host 11. Inter-Node MIC-MIC with IB adapter on remote socket

and more . . .

• Crical for runmes to opmize data movement, hiding the complexity

MVAPICH User Group Meeng 2013 22 MVAPICH2-MIC Design for Clusters with IB and Xeon Phi • Oﬄoad Mode

• Intranode Communicaon • Coprocessor-only Mode

• Symmetric Mode

• Internode Communicaon

• Coprocessors-only

• Symmetric Mode • Mul-MIC Node Conﬁguraons

• Based on MVAPICH2 1.9

• Being tested and deployed on TACC Stampede

MVAPICH User Group Meeng 2013 23 MVAPICH2 – Plans for Exascale

• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Opmizaon for GPU Support and Accelerators • Taking advantage of Collecve Oﬄoad framework – Including support for non-blocking collecves (MPI 3.0) – Core-Direct support • Extended RMA support (as in MPI 3.0) • Extended topology-aware collecves • Power-aware collecves • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migraon support with SCR

24 MVAPICH User Group Meeng 2013 Web Pointers

NOWLAB Web Page hp://nowlab.cse.ohio-state.edu

MVAPICH Web Page hp://mvapich.cse.ohio-state.edu

MVAPICH User Group Meeng 2013 25