Infiniband + GPU Systems (Past)
Total Page:16
File Type:pdf, Size:1020Kb
MVAPICH2 and GPUDirect RDMA Presentation at HPC Advisory Council, June 2013 by Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri The Ohio State University E-mail: {panda, subramon, potluri}@cse.ohio-state.edu https://mvapich.cse.ohio-state.edu/ Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 500 100 Percentage of Clusters 450 90 400 Number of Clusters 80 350 70 300 60 250 50 200 40 150 30 Numberof Clusters 100 20 Percentage of Clusters of Clusters Percentage 50 10 0 0 Timeline DK-OSU HPC Advisory Council June'13 2 Large-scale InfiniBand Installations • 224 IB Clusters (44.8%) in the November 2012 Top500 list (http://www.top500.org) • Installations in the Top 40 (16 systems): 147, 456 cores (Super MUC) in Germany (6th) 122,400 cores (Roadrunner) at LANL (22nd) 204,900 cores (Stampede) at TACC (7th) 53,504 (PRIMERGY) at Australia/NCI (24th) 77,184 cores (Curie thin nodes) at France/CEA (11th) 78,660 cores (Lomonosov) in Russia (26th ) 120, 640 cores (Nebulae) at China/NSCS (12th) 137,200 cores (Sunway Blue Light) in China (28th) 72,288 cores (Yellowstone) at NCAR (13th) 46,208 cores (Zin) at LLNL (29th) 125,980 cores (Pleiades) at NASA/Ames (14th) 33,664 (MareNostrum) at Spain/BSC (36th) 70,560 cores (Helios) at Japan/IFERC (15th) 32,256 (SGI Altix X) at Japan/CRIEPI (39th) 73,278 cores (Tsubame 2.0) at Japan/GSIC (17th) More are getting installed ! 138,368 cores (Tera-100) at France/CEA (20th) DK-OSU HPC Advisory Council June'13 3 MVAPICH2/MVAPICH2-X Software • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 173,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7th ranked 204,900-core cluster (Stampede) at TACC • 14th ranked 125,980-core cluster (Pleiades) at NASA • 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology • and many others – Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System DK-OSU HPC Advisory Council June'13 4 MVAPICH2 1.9 and MVAPICH2-X 1.9 • Released on 05/06/13 • Major Features and Enhancements – Based on MPICH-3.0.3 • Support for all MPI-3 features – Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach) • Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA – Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR) • Support for application-level checkpointing • Support for hierarchical system-level checkpointing – Scalable UD-multicast-based designs and tuned algorithm selection for collectives – Improved and tuned MPI communication from GPU device memory – Improved job startup time • Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters – Revamped Build system with support for parallel builds • MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. – Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d DK-OSU HPC Advisory Council June'13 5 Outline • MVAPICH2/MVAPICH2-X Overview – Efficient Intra-node and Inter-node Communication and Scalable Protocols – Scalable and Non-blocking Collective Communication – High Performance Fault Tolerance Mechanisms – Support for Hybrid MPI+PGAS Programming models • MVAPICH2 for GPU Clusters – Point-to-point Communication – Collective Communication – MPI Datatype processing – Multi-GPU Configurations – MPI + OpenACC • MVAPICH2 with GPUDirect RDMA • Conclusion DK-OSU HPC Advisory Council June'13 6 One-way Latency: MPI over IB Small Message Latency Large Message Latency 250.00 6.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 5.00 MVAPICH-ConnectX-DDR 200.00 MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR 4.00 MVAPICH2-Mellanox-ConnectIB-DualFDR 150.00 1.82 3.00 1.66 100.00 Latency (us) Latency Latency (us) Latency 1.64 2.00 1.56 50.00 1.00 1.09 0.99 0.00 0.00 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch DK-OSU HPC Advisory Council June'13 7 Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth 14000 25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 12000 MVAPICH-ConnectX-DDR 21025 12485 20000 MVAPICH-ConnectX2-PCIe2-QDR ) ) 10000 MVAPICH-ConnectX3-PCIe3-FDR sec MVAPICH2-Mellanox-ConnectIB-DualFDR 8000 15000 11643 MBytes/sec (MBytes/ ( 6000 6343 10000 6521 4000 3385 Bandwidth 3280 Bandwidth 5000 4407 2000 1917 3704 1706 3341 0 0 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch DK-OSU HPC Advisory Council June'13 8 eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores) 500 1.5 MVAPICH2-RC 400 MVAPICH2-RC MVAPICH2-XRC MVAPICH2-XRC 300 1 200 0.5 100 NormalizedTime 0 0 Memory (MB/process) Memory 1 4 16 64 256 1K 4K 16K apoa1 er-gre f1atpase jac Connections Dataset • Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections) • NAMD performance improves when there is frequent communication to many peers UD Hybrid RC • Both UD and RC/XRC have benefits 6 • Hybrid for the best of both 26% 40% 38% 30% 4 • Available since MVAPICH2 1.7 as integrated interface 2 • Runtime Parameters: RC - default; Time (us) Time • UD - MV2_USE_ONLY_UD=1 0 128 256 512 1024 • Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1 Number of Processes M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08 DK-OSU HPC Advisory Council June'13 9 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) Latency 1.2 1 Intra-Socket Inter-Socket Latest MVAPICH2 1.9 0.8 Intel Sandy-bridge 0.6 0.45 us 0.4 Latency (us) Latency 0.19 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Bandwidth (intra-socket) Bandwidth (inter-socket) 14000 14000 12,000MB/s 12,000MB/s intra-Socket-CMA inter-Socket-CMA 12000 12000 intra-Socket-Shmem inter-Socket-Shmem 10000 10000 intra-Socket-LiMIC inter-Socket-LiMIC 8000 8000 6000 6000 4000 4000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 2000 2000 0 0 Message Size (Bytes) Message Size (Bytes) DK-OSU HPC Advisory Council June'13 10 Outline • MVAPICH2/MVAPICH2-X Overview – Highly Efficient Intra-node and Inter-node Communication and Scalable Protocols – Scalable Blocking and Non-blocking Collective Communication – High Performance Fault Tolerance Mechanisms – Support for Hybrid MPI+PGAS Programming models • MVAPICH2 for GPU Clusters – Point-to-point Communication – Collective Communication – MPI Datatype processing – Multi-GPU Configurations • MVAPICH2 with GPUDirect RDMA • Conclusion DK-OSU HPC Advisory Council June'13 11 Hardware Multicast-aware MPI_Bcast on Stampede Small Messages (102,400 Cores) Large Messages (102,400 Cores) 40 450 35 Default Default 400 30 Multicast 350 Multicast (us) 25 (us) 300 20 250 15 200 150 Latency Latency 10 Latency Latency 100 5 50 0 0 2 8 32 128 512 2K 8K 32K 128K Message Size (Bytes) Message Size (Bytes) 16 Byte Message 32 KByte Message 30 200 25 Default 150 Default 20 Multicast Multicast 15 100 10 Latency (us) Latency Latency (us) Latency 50 5 0 0 Number of Nodes Number of Nodes ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch DK-OSU HPC Advisory Council June'13 12 Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload 5 17% HPL-Offload HPL-1ring HPL-Host 4 1.2 4.5% Time - 1 3 0.8 2 0.6 (s) 0.4 1 Normalized Normalized 0.2 Performance Performance 0 0 Application Run Application 512 600 720 800 10 20 30 40 50 60 70 Data Size HPL Problem Size (N) as % of Total Memory Modified P3DFFT with Offload-Alltoall does up to Modified HPL with Offload-Bcast does up to 4.5% 17% better than default version (128 Processes) better than default version (512 Processes) PCG-Default Modified-PCG-Offload K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking 15 All-to-All with Collective Offload on InfiniBand Clusters: A Study 21.8% 10 with Parallel 3D FFT, ISC 2011 5 K. Kandalla, et. al, Designing Non-blocking Broadcast with Time (s) - 0 Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 Run 64 128 256 512 Number of Processes K. Kandalla, et. al., Designing Non-blocking Allreduce with Modified Pre-Conjugate Gradient Solver with Collective Offload on InfiniBand Clusters: A Case Study with Offload-Allreduce does up to 21.8% better than Conjugate Gradient Solvers, IPDPS ’12 default version DK-OSU HPC Advisory Council June'13 13 Outline • MVAPICH2/MVAPICH2-X Overview – Highly Efficient Intra-node and Inter-node Communication and Scalable Protocols – Scalable Blocking and Non-blocking Collective Communication – High Performance Fault Tolerance Mechanisms – Support for Hybrid MPI+PGAS Programming models • MVAPICH2 for GPU Clusters – Point-to-point Communication – Collective Communication – MPI Datatype processing – Multi-GPU Configurations • MVAPICH2 with GPUDirect RDMA • Conclusion DK-OSU HPC Advisory Council June'13 14 Multi-Level Checkpointing with ScalableCR (SCR) Low • LLNL’s Scalable Checkpoint/Restart library • Can be used for application guided and application transparent checkpointing • Effective utilization of storage hierarchy – Local: Store checkpoint data on node’s local storage, e.g.