<<

MVAPICH/MVAPICH2 Update

Presentation at Open Fabrics Developers Conference (Nov. ’07) by Dhabaleswar K. (DK) Panda Department of and Engg. The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda DKPanda 1 Presentation Overview

• Overview of MVAPICH/MVAPICH2 Project • Features of MVAPICH 1.0 and MVAPICH2 1.0 • Sample Performance Numbers – Point-to-point (Mellanox and Qlogic) – Point-to-point with Intel-Connects Cable – Point-to-point with Obsidian IB-WAN – Multi-core-aware Optimized Collectives – UD-based Design – Hot-spot Avoidance Mechanism (HSAM) • Upcoming Features and Issues – XRC support – Enhanced UD-based Design – Asynchronous Progress •Conclusions

2 MVAPICH/MVAPICH2 Software Distribution

• High Performance and Scalable Implementations – MPI-1 (MVAPICH) – MPI-2 (MVAPICH2) • Both are being available with OFED •With OFED 1.3 –MVAPICH 1.0-beta –MVAPICH2 1.0 • Directly downloaded and used by more than 580 organizations in 42 countries • Empowering many production and TOP500 clusters. Examples include –3rd ranked TOP500 system (14,336 cores) in Nov ‘07 list, delivering 126.9 TFlops (MVAPICH) – 8,192-core cluster at NCSA (MVAPICH2) •More details at http://mvapich.cse.ohio-state.edu

3 New Features of MVAPICH 1.0

• Asynchronous Progress – Provides better overlap between computation and communication • Flexible message coalescing – enable/disable coalescing – Allows varying degrees of coalescing •UD-based support – Best performance and with constant memory footprint for communication contexts • Support for Automatic Path Migration (APM) • Multi-core optimizations for Collectives • Enhanced mpirun_rsh for scalable launching – Provides a two-level approach (nodes and cores within a node) • Support for ConnectX • Support for Qlogic/PSM

4 New Features of MVAPICH2 1.0

• Message coalescing support • Hot-spot avoidance mechanism for alleviating network congestion in large clusters • Application-initiated systems-level checkpoint – in addition to the automatic systems-level checkpoint from 0.9.8 • Automatic Path Migration (APM) support •RDMA Read •Blocking • Multi-rail support for iWARP • RDMA CM-based connection management (Gen2-IB and Gen2-iWARP) • On-demand connection management for uDAPL (including Solaris)

5 Support for Multiple Interfaces/Adapters

•OpenFabrics/Gen2-IB – All IB adapters supporting Gen2 – ConnectX •Qlogic/PSM • uDAPL – Linux-IB –Solaris-IB – Other adapters such as Neteffect 10GigE •OpenFabrics/Gen2-iWARP •Chelsio •VAPI – All IB adapters supporting VAPI •TCP/IP – Any adapter supporting TCP/IP interface • Channel (MVAPICH), for running applications in a node with multi-core processors 6 Presentation Overview

• Overview of MVAPICH/MVAPICH2 Project • Features of MVAPICH 1.0 and MVAPICH2 1.0 • Sample Performance Numbers – Point-to-point (Mellanox and Qlogic) – Point-to-point with Intel-Connects Cable – Point-to-point with Obsidian IB-WAN – Multi-core-aware Optimized Collectives – UD-based Design – Hot-spot Avoidance Mechanism (HSAM) • Upcoming Features and Issues – XRC support – Enhanced UD-based Design – Asynchronous Progress – Passive synchronization support •Conclusions

7 MPI-level Latency (One-way): IBA (Mellanox and QLogic)

2.33 GHz Quad-core Small message latency Intel with IB switch Large message latency 14

12 700 MVAPICH-SDR

600 MVAPICH-DDR 10 Latency (us) 500 MVAPICH/Qlogic 8 400 MVAPICH-ConnectX 6 300

4.04 2.8 200 1.9 2 100 0 1.50 2K 4K 8K 16K 32K 64K 128K 256K 0 4 8 16 32 64 128 256 512 1024 Msg size (Bytes) Msg size (Bytes) • From various papers - SC ’03, Hot Interconnect ’04, IEEE Micro (Jan-Feb) ’05, one of the best papers from HotI ’04 • Also from `Performance’ link of MVAPICH page 8 MPI-level Bandwidth (Uni-directional): IBA (Mellanox and QLogic)

2.33 GHz Quad-core 1500 Intel with IB switch 1405 1350 MVAPICH-SDR

1200 1389 MVAPICH-DDR

1050 MVAPICH/Qlogic 970 900 956

750 MVAPICH-ConnectX

600

450

Bandwidth (MillionBytes/Sec) Bandwidth 300

150

0 4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message size (Bytes)

9 MPI-level Bandwidth (Bi-directional): IBA (Mellanox and QLogic)

3000 2.33 GHz Quad-core Intel with IB switch 2712 2700 MVAPICH-SDR 2400 2628 MVAPICH-DDR 2100

1800 MVAPICH/Qlogic 1841 1500 MVAPICH/ConnectX 1560 1200

900

Bandwidth (MillionBytes/Sec) Bandwidth 600

300

0 4 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message size (Bytes)

10 MVAPICH-PSM Performance: AMD Opteron with HT

1500 1350 MVAPICH-InfiniPath 1200 InfiniPath-MPI 1050 953 900 3.5 750 952 600 3 450 300 2.5 MVAPICH-InfiniPath 150

InfiniPath-MPI Bandwidth (MillionBytes/Sec) 0 2 4 16 64 256 1024 4K 16K 64K 256K 1M Latency (us) 1.5 Message size (Bytes) 1.24 1 Uni-Directional Bandwidth 1.24 3000 0.5 2700 MVAPICH-InfiniPath 2400 0 2100 InfiniPath-MPI 1888 0 8 32 128 512 1800 1888 1500 Msg size (Bytes) 1200 Small Message Latency 900 600 300 Bandwidth (MillionBytes/Sec) Bandwidth 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message size (Bytes) Bi-Directional Bandwidth 11 MPI-level Latency (One-way): iWARP with Chelsio

2.0 GHz Quad-core Intel with 10GigE (Fulcrum) switch 50

40

30

20 Latency (us) 10

0 0 2 8 32 128 512 2K

MPICH2 MVAPICH2

MVAPICH2 gives a latency of about 7.39us as compared to 28.5 for MPICH2

12 MPI-level Bandwidth: iWARP with Chelsio

2.0 GHz Quad-core Intel with 10GigE (Fulcrum) Uni-directionalswitch Bi-directional 1400 2500 1200 2000 1000

800 1500

600 1000

400 500

Bandwidth (MillionBytes/s) 200 Bi-Bandwidth (MillionBytes/s) 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M MPICH2 MVAPICH2 MPICH2 MVAPICH2

Peak bandwidth of Peak bidir-bandwidth of about 1231 MillionBytes/s about 2380 MillionBytes/s 13 Performance with Intel-Connect IB cable (MPI Latency with switch)

3.28us

3.14us

2.94us Performance with Intel-Connect IB Cable (MPI Bandwidth)

Back to back With switch

Intel Connect fiber optic cables closely match the performance of copper cables IB WAN: Obsidian Routers

Obsidian WAN Obsidian WAN Router Router

(Variable Delay) (Variable Delay)

Cluster A WAN Link Cluster B

Delay (us) Distance (km)

10 2

100 20

1000 200

10000 2000

MPI Bidirectional Bandwidth Varying protocol threshold (Delay 1 ms) Up to 80% performance difference for 8k messages with and without protocol tuning

16 MPI_Bcast Latency

32 nodes 64 nodes

without- (32x8) without-shmem (32x4) without-shmem (64x8)without-shmem (64x4) with-shmem (32x8) with-shmem (32x4) with-shm em (64x8) with-shmem (64x4)

60000

50000 80000 40000 70000 60000 30000 50000 20000 40000 latency (usec) 30000

10000 (usec) latency 6 20000 6 0 6 0 10000 0 0 8 4E+ 4 8 4 2 2E+ 0 4857 1 4 1E+ 2 36 072 2 10 5 1 6 5 2144 524288 5 3 2 6 2 4 6 1 536 26 131072 65 9430 message size (bytes) 209715 41 message size (bytes)

- Using shared memory improves the performance of MPI_Bcast on 512 cores by more than two times

17 Shared Memory-Based Collectives for Multi-core Platforms

350 2500 300 2000 250 Allreduce-shmem 1500 200 Allreduce-shmem

150 Allreduce-noshmem 1000 Allreduce-noshmem latency (us) latency latency (us) 100 500 50 0 0 36 5 8 84 8 2 512 6 3 0 3 12 2048 9 2768 65 0 8192 3 4 16

m e ssa g e siz e (byte s) message size (bytes)

• 64 Intel Quad-core systems with dual sockets (512 cores) • Improves performance by almost 3 times • Similar performance improvement for MPI_Barrier

Efficient Shared Memory and RDMA based design for MPI_Allgather over InfiniBand, Amith R. Mamidala, Abhinav Vishnu and D. K. Panda, EuroPVM/MPI, September 2006

18 MVAPICH/PSM Collective Performance 80 Broadcast (64X8) Barrier Latency 35 70 MVAPICH PSM 1.0 MVAPICH PSM 1.0 30 60 InfiniPath MPI 2.1 InfiniPath MPI 2.1 25

50 20

40 15 Latency (us) Latency Latency (us) 30 10

20 5

10 0

X8 X8 1 2 X8 X8 0 4 8 6X8 1 32X8 64X8 1 2 4 8 16 32 64 128 256 512 1024 System size Msg Size (Bytes)

• 64 Intel Quad-core systems with dual sockets; PCIe InfiniPath Adapters • Significant performance improvement for MPI_Bcast and MPI_Barrier

19 Zero-Copy over Unreliable Datagram (UD) Uni-Directional Bandwidth Bi-Directional Bandwidth

1600 2500 1400 2000 1200

1000 1500 800

600 1000

400 500 200

0 0

1 2 4 8 6 2 8 2 K K K K 1 2 4 8 6 8 2 K K K K 1 3 64 2 1 2K 4K 8K 8K 1M 1 32 64 1K 2K 4 8K 4 1 256 51 16K 32 64 12 256 51 16K 32 6 56 12K 1M 12 256 512K 128K 2 5 Message Size (bytes) Message Size (bytes)

RC-zcopy UD-copy UD-zcopy RC-zcopy UD-copy UD-zcopy

• Using a novel technique, zero-copy transfers can be made over UD. • Performance very close to that of RC • Supported in MVAPICH 1.0-beta

M. Koop, S. Sur and D. K. Panda, Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram, Cluster 2007 20 NAS Parallel Benchmarks with UD

RC (MVAPICH-0.9.8) UD (Progress) UD (Hybrid) UD () ƒ CFD Kernels with varied communication patterns 1.2 ƒ UD Progress shows better 1 performance than the thread 0.8 or hybrid models 0.6 ƒ FT and IS both use large 0.4 MPI_Alltoall collective calls, in 0.2 which each Normalized Time Normalized 0 communicates directly with BT CG EP FT IS LU MG SP every other process Benchmark (Class C) ƒ ICM cache misses for RC ƒ Large improvement for UD, even Normalized Time - 256 processes with 256 processes

M. Koop, S. Sur and D. K. Panda, High Performance MPI Design using ICS 2007 Unreliable Datagram for Ultra-Scale InfiniBand Clusters, 21 SMG2000

RC (MVAPICH 0.9.8) UD (Progress)

RC (MVAPICH 0.9.8) UD Design 1.2 Conn. Buffers Struct Total Conn Buffers Struct Total 1 . 0.8 512 22.9 65.0 0.3 88.2 0 37.0 0.2 37.2 0.6 1024 29.5 65.0 0.6 95.1 0 37.0 0.4 37.4 0.4 2048 42.4 65.0 1.2 107.4 0 37.0 0.9 37.9 0.2 Normalized TimeNormalized 4096 66.7 65.0 2.4 134.1 0 37.0 1.7 38.7 0 128 256 512 1024 2048 4096 Processes ƒ Performance is enhanced considerably with UD ƒ Large number of communicating peers per process (992 at maximum) ƒ UD reduces HCA cache thrashing ƒ Very communication intensive ƒ 27 packet drops at 4K processes with 1.4 billion MPI messages ƒ Large difference in memory consumption, even only 1/4 of connections made 22 Hot-Spot Avoidance with MVAPICH

• Deterministic nature of NAS Parallel Benchmarks 30 InfiniBand routing leads to hot- Original spots in the network even with 25 HSAM, 4 Paths Fat-Tree 20 15 • Responsibility of path utilization 10 is up to the MPI Library (Seconds) Time 5

•We Design HSAM (Hot-Spot 0 Avoidance MVAPICH) to 32x1 64x1 alleviate this problem Number of Processes • For different FT Class benchmarks, performance improvement varies from 6-9 %

A. Vishnu, M. Koop, A. Moody, A. Mamidala, S. Narravula and D. K. Panda , “ Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective, ” (CCGrid), Rio de Janeiro - Brazil, May 2007 23 Presentation Overview

• Overview of MVAPICH/MVAPICH2 Project • Features of MVAPICH 1.0 and MVAPICH2 1.0 • Sample Performance Numbers – Point-to-point (Mellanox and Qlogic) – Point-to-point with Intel-Connects Cable – Point-to-point with Obsidian IB-WAN – Multi-core-aware Optimized Collectives – UD-based Design – Hot-spot Avoidance Mechanism (HSAM) • Upcoming Features and Issues – XRC support – Enhanced UD-based Design – Asynchronous Progress – Passive synchronization support •Conclusions

24 XRC Support with ConnectX

• XRC (eXtended Reliable Connection) is being proposed for large-scale clusters • We have designed and implemented an initial prototype of MVAPICH with XRC support • In-depth results will be presented during tomorrow’s XRC session

25 MVAPICH Latency: RC and XRC

12

10

8

RC 6 XRC

4

2

0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (bytes)

• Results for latency are nearly identical between the use of RC and XRC transports • 1.49usec for RC, 1.54usec for XRC Enhanced UD-based Design

• Current UD-based design in MVAPICH 1.0 delivers good performance • Some overheads on large-scale clusters for some applications • Working on a new hybrid UD-RC design • Delivers => better performance than RC or UD design • Will be available in future releases

27 Asynchronous Progress

• Have added asynchronous progress (both at sender and receiver) in MVAPICH 1.0 • Allows to interrupt sender/receiver during long computation to handle communication • Potential for overlap of computation and communication • Carrying out performance evaluation with different applications to study the impact

28 Passive Synchronization for One-Sided Operations with Atomic Operations

• One-sided operations in MPI-2 semantics have two synchronization schemes –Active – Passive • Have taken InfiniBand atomic operations into account to implement high performance and scalable passive synchronization • Will be available in future releases

29 Conclusions

• MVAPICH and MVAPICH2 are being widely used in stable production IB clusters delivering best performance and scalability • Also enabling clusters with iWARP support • The user base stands at more than 580 organizations • New features for scalability, high performance and fault tolerance support are aimed to deploy large-scale clusters (20K-50K) nodes in the near future

30 Acknowledgements

Our research is supported by the following organizations • Current Funding support by

• Current Equipment support by

31 Acknowledgements

• Current Students –L. Chai (Ph.D.) •Past Students –W. Huang (Ph.D.) – P. Balaji (Ph.D.) – M. Koop (Ph.D.) – D. Buntinas (Ph.D.) – R. Kumar (M.S.) – S. Bhagvat (M.S.) – A. Mamidala (Ph.D.) – B. Chandrasekharan (M.S.) – S. Narravula (Ph.D.) – W. Jiang (M.S.) – R. Noronha (Ph.D.) – S. Kini (M.S.) – G. Santhanaraman (Ph.D.) – S. Krishnamoorthy (M.S.) – K. Vaidyanathan (Ph.D.) – J. Liu (Ph.D.) – A. Vishnu (Ph.D.) –S. Sur (Ph.D.) • Current Programmers –J. Wu (Ph.D.) –S. Rowland –W. Yu (Ph.D.) –J. Perkins

32 Web Pointers

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

E-mail: [email protected]

33