MVAPICH2 Project: Latest Status and Future Plans

Presentation at MPICH BOF (SC ’11) by Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda MPICH2 BOF 1 Overview

• Working closely with ANL on MVAPICH and MVAPICH2 projects for the last 11 years • Focus is on high performance implementation on emerging interconnects

– InfiniBand, iWARP/10GigE and RoCE (RDMA over Converged Enhanced Ethernet) • Both versions are available from OSU directly – http://mvapich.cse.ohio-state.edu – public mailing list (mvapich-discuss) with archives

• Also available with Open Fabrics Enterprise Distribution (OFED) – http://www.openfabrics.org – public mailing lists (openfabrics-ewg) with archives

• Also available from server vendors, interconnect vendors and distributors

2 MPICH BOF MVAPICH/MVAPICH2 Open-Source Software Distribution

• Primary focus is on MVAPICH2 • Latest releases – MVAPICH2 1.7 – MVAPICH2 1.8a1p1 • MVAPICH2 is included in the latest OFED 1.5.4 • Used by more than 1,810 organizations in 65 countries (registered with OSU directly) • More than 85,000 downloads from OSU site directly • During the last year itself, 500 new organizations have registered and more than 25,000 downloads have taken place • Empowering many TOP500 clusters (5th, 7th, 25th, 39th, ..)

3 MPICH BOF Latest MVAPICH2 1.7 – Major Features

• Released on 10/14/11 • Major features (Compared to MVAPICH2 1.6) – Hybrid UD-RC/XRC support to get best performance on large-scale systems with reduced/constant memory footprint – HugePage support – Optimized Fence synchronization – backed Windows for One-Sided Communication – Improved intra-node shared-memory communication performance – Minimizing number of connections and memory footprint – Enhancement/Optimization of algorithms and tuning for collectives (Barrier, Bcast, Gather, Allgather, Reduce, Allreduce, Alltoall and Allgatherv) – Fast migration using RDMA – Supporting large data transfer (>2GB) – Integrated with Enhanced LiMIC2 (v0.5.5) to support Intra-node large message (>2GB) transfers – Improved connection management – Support for Chelsio T4 adapter – Improved pt-to-pt communication and Multi-core-aware collective support (QLogic PSM interface) – Based on MPICH2 1.4.1p1 – Hwloc v1.2.2 4 MPICH BOF MVAPICH2-1.7 Performance with MPI-Level Two- Sided Communication – IB Mellanox QDR

4000 6 3500 ConnectX-QDR 3394 ConnectX-QDR 5 3000

2500

4 2000 1500 3 1000 500

2 (MB/sec) Bandwidth Latency (us) Latency 0 1.55 1

0 Message Size (Bytes)

Message Size (Bytes) 7000 6000 ConnectX-QDR 6537 5000 4000 3000 2000

• Intel Westmere (2.67 GHz), PCIe-Gen2 1000 Bandwidth (MB/sec)Bandwidth

- 0 Results for other platforms at Bi http://mvapich.cse.ohio-state.edu Message Size (Bytes)

MPICH BOF 5 MVAPICH2-1.7 Intra-node Performance (Two-sided with LiMIC2)

1.8 12000 1.6 9976

Intra-chip,Intra-socket Intra-chip, 10000

1.4 Intra-socket

1.2 Inter-socket 8000 Inter-socket 1 9596 0.8 6000 0.55 0.6 Latency (us) Latency 4000 0.4

0.24 0.2 (MB/sec) Bandwidth 2000 0 0

Message Size (Bytes) Message Size (Bytes)

Intel Westmere 2.67 GHz Quad-core

Results for other platforms at http://mvapich.cse.ohio-state.edu 6 MPICH BOF Shared-memory Aware Collectives (4K cores on TACC Ranger with MVAPICH2)

MPI_Reduce (4096 cores) MPI_ Allreduce (4096 cores) 160 4500

140 4000 Original

3500 120 Shared-memory

3000

100 Original 2500 80 Shared-memory

2000 Latency (us) Latency 60 (us) Latency 1500 40 1000

20 500

0 0

4 8

0 4 8 16 32 64 128 256 512 0

16 32 64

4K 1K 2K 8K

128 256 512 Message Size (bytes) Message Size (bytes)

MPICH BOF 7 Application Example: OMEN on TACC using MVAPICH

• OMEN is a two- and three-dimensional Schrodinger-Poisson solver based • Used in semi- conductor modeling • Run to almost 60K tasks

Courtesy: Mathieu Luisier, Gerhard Klimeck, Purde http://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf MPICH BOF 8 Enhanced AWM-ODP (Earthquake Modeling Application) with One-Sided Operations

• Experiments on TACC Ranger cluster – 64x64x64 data grid per process – 25 iterations – 32KB messages • On 4K processes – 11% with Async-2sided and 12% with Async-1sided (RMA) • On 8K processes – 6% with Async-2sided and 10% with Async-1sided (RMA) Joint work with OSU, SDSC and TACC Gordon-Bell Finalist for Supercomputing 2010 MPICH BOF 16 Latest MVAPICH2 1.8a1p1 – Features

• Released on 11/14/11 • Major features (Compared to MVAPICH2 1.7) – Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Communication with contiguous datatype – Reduced memory footprint of the – Enhanced one-sided communication design with reduced memory requirement – Enhancements and tuned collectives (Bcast and Alltoallv) – Update to Hwloc v1.3.0 – Flexible HCA selection with Nemesis interface – Support iWARP interoperability between Intel NE020 and Chelsio T4 Adapters

10 MPICH BOF GPU-GPU: MVAPICH2 Communication Performance

Uni-directional Bandwidth 40 Latency 4000

Device-Device

30 3000 Device-Host Host-Device 20 2000 Device-Device

Latency (us) Latency 10 Device-Host 1000

Host-Device Bandwidth(MB/s) 0 0 1 4 16 64 256 1K 1 16 256 4K 64K 1M Message Size (bytes) Message Size (bytes)

Bi-directional Bandwidth

5000 • MVAPICH2 1.8a1p1 4000 • OSU MPI Micro-Benchmarks 3.5 3000 Device-Device • Extended to support D-D, Device-Host 2000 Host-Device D-H and H-D operations

1000 Bandwidth(MB/s) 0 1 16 256 4K 64K 1M Message Size (bytes) Intel Westmere nodes with Mellanox IB QDR NIC, NVIDIA Tesla C2050 GPU, CUDA Toolkit 4.1RC1 11 MPICH BOF MVAPICH2 – Future Plans

• Performance and Memory toward 500K-1M cores • Unified Support for PGAS Models and Languages (UPC, OpenShmem, etc.) – A prototype of Unified MPI+UPC is available (presented at PGAS ’10 and PGAS ‘11) • Enhanced Optimization for GPU Support and Accelerators • Taking advantage of Collective Offload framework in ConnectX-2 – Including support for non-blocking collectives • Extended topology-aware collectives • Power-aware collectives • Enhanced Multi-rail Designs • Automatic optimization of collectives – LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail • Checkpoint-Restart and migration support with incremental checkpointing • QoS-aware I/O and checkpointing

MPICH BOF 12

Web Pointers

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

E-mail: [email protected]

13 MPICH BOF