Infiniband and 10-Gigabit Ethernet for Dummies
Total Page:16
File Type:pdf, Size:1020Kb
The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at OSU Booth (SC ‘19) by Hari Subramoni The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~subramon Drivers of Modern HPC Cluster Architectures High Performance Accelerators / Coprocessors Interconnects - InfiniBand high compute density, high Multi-core <1usec latency, 200Gbps performance/watt SSD, NVMe-SSD, Processors Bandwidth> >1 TFlop DP on a chip NVRAM • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. NetworkSummit Based Computing Sierra Sunway K - Laboratory OSU Booth SC’19TaihuLight Computer 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Memo Memo Memo Logical shared memory Memor Memor Memor Shared Memory ry ry ry y y y Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface)Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are Network Based Computing Laboratorygradually receiving importanceOSU Booth SC’19 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Co-DesignCo-Design Application Kernels/Applications OpportuniOpportuni tiesties andand Middleware ChallengeChallenge ss acrossacross Programming Models VariousVarious MPI, PGAS (UPC, Global Arrays, OpenSHMEM), LayersLayers CUDA, OpenMP, OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. PerformaPerforma Communication Library or Runtime for Programming Models ncence Point-to-Point-to- CollectiveCollective SynchronizSynchroniz I/OI/O andand pointpoint Energy-Energy- FaultFault ScalabilitScalabilit CommunicaCommunica ationation andand FileFile CommunicaCommunica AwarenessAwareness ToleranceTolerance tiontion LocksLocks SystemsSystems yy tiontion Fault-Fault- Networking Technologies Multi/Many-core Accelerators (InfiniBand, 40/100GigE, ResiliencResilienc Architectures (GPU and FPGA) Aries, and OmniPath) ee Network Based Computing Laboratory OSU Booth SC’19 4 MPI+X Programming model: Broad Challenges at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one- sided) – Scalable job start-up • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and FPGAs • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI+UPC++, MPI + OpenSHMEM, CAF, …) • Virtualization • NetworkEnergy-Awareness Based Computing Laboratory OSU Booth SC’19 5 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,050 organizations in 89 countries – More than 614,000 (> 0.6 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) • 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 5th, 448, 448 cores (Frontera) at TACC • 8th, 391,680 cores (ABCI) in Japan • 15th, 570,020 cores (Neurion) in South Korea and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) Partner in the TACC Frontera System Network– http://mvapich.cse.ohio-state.edu Based Computing Laboratory OSU Booth SC’19 6 • Empowering Top500 systems for over a decade 100000 Num200000 ber of Do300000 wnloads 400000 500000 600000 0 Downloads Release Timeline and MVAPICH2 Laboratory Network Based Computing M V M 0 V . 2 M 9 . 0V 4 .2 M 9 .0V M 0.2 M9 V .1 V M8 Timeline .1 2V 0. 0 11 M . .V OSU Booth OSUSC’19 Booth 01 .2 M 3 V 1M . 2 V 4 2 1 M . V 1 M 5 2 . V 6 2 1M . 1 MV 7 . 2M V MO 8 2V VS 1 MM -2 M2UVV . G M M - 22V 9 D V VM VI- -2 R 2 2I i NXG - -C rA D 2 A A tM2R2 . z W2 . 0 u S. 203 3 b r 0 .. 2. e 2 29r .2 . .c3 3 32. 2 2 . 3 . 7 2 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to-Point-to- CollectivCollectiv Energy-Energy- RemoteRemote I/OI/O andand ActiveActive IntrospecIntrospec pointpoint eses JobJob FaultFault VirtualizaVirtualiza AwareneAwarene MemoryMemory FileFile MessageMessage tiontion && PrimitivePrimitive AlgorithAlgorith StartupStartup ToleranceTolerance tiontion ssss AccessAccess SystemsSystems ss AnalysisAnalysis ss msms Support for Modern Networking TechnologySupport for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features MulMul ShareShare SRSR UMUM ODOD SR-SR- dd CMCM IVSHMIVSHM XPMEXPME NVLinNVLin CAPCAP ti ** RCRC UDUD DCDC ti Memo A EM M OptaneOptane * DD RR PP IOVIOV Memo A EM M kk II* RailRail ryry * Upcoming Network Based Computing Laboratory OSU Booth SC’19 8 MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS MVAPICH2-X and MPI+PGAS with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep MVAPICH2-GDR Learning HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS OMB Performance Network Based Computing Laboratory OSU Booth SC’19 9 MVAPICH2 Distributions • MVAPICH2 – Basic MPI support for IB, Omni-Path, iWARP and RoCE • MVAPICH2-X – Advanced MPI features and support for INAM – MPI, PGAS and Hybrid MPI+PGAS support for IB • MVAPICH2-Virt – Optimized for HPC Clouds with IB and SR-IOV virtualization – Support for OpenStack, Docker, and Singularity • OSU Micro-Benchmarks (OMB) – MPI (including CUDA-aware MPI), OpenSHMEM and UPC • OSU INAM – InfiniBand Network Analysis and Monitoring Tool Network Based Computing Laboratory OSU Booth SC’19 10 ) s d n o c e s i l l Startup Performance on TACC Frontera i M ( n e k 4.5s a T 3.9s e m i MPI_Init on Frontera T 5000 Intel MPI 4000 2019 3000 2000 1000 0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 Number of Processes • MPI_Init takes 3.9 seconds on 57,344 processes on 1,024 nodes • MPI_Init takes 195 seconds on 229376 processes on 4096 nodes while MVAPICH2 takes 31 seconds • All numbers reported with 56 processes per node Network Based ComputingNew designs available in MVAPICH2-2.3.2 Laboratory OSU Booth SC’19 11 One-way Latency: MPI over IB with Large Message La- MVAPICH21.8 Small Message La- 120 tency TrueScale-QDRtency 1.6 1.1 100 ConnectX-3-FDR 1.4 1.19 ConnectIB-DualFDR 1.2 1 80 ConnectX-4-EDR 1 60 Omni-Path t ( ) c s t ( ) L y a e c s n u L y a e 0.8n 1.15u ConnectX-6 HDR 0.6 1.0 40 1.01 0.4 1.14 20 0.2 0 0 Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch NetworkConnectX-6-HDR Based Computing - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Laboratory OSU Booth SC’19 12 Bandwidth: MPI over IB with MVAPICH2 30000 Unidirectional Band- 60000 Bidirectional Band- width TrueScale-QDRwidth 24,5 48,02 25000 50000 ConnectX-3-FDR 32 7 ConnectIB- 20000 40000 DualFDR ConnectX-4-EDR 21,98 15000i / B a n d w 30000i d t h ( M B y t e s / s e c ) t ( t ) B a n d w d h M B y e s s e 12,5c Omni-Path 24,133 10000 12,390 20000 ConnectX-6 HDR 6 12,066 21,22 83 5000 6,35 10000 12,167 3,376 6,2281 0 3 0 Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch NetworkConnectX-6-HDR Based Computing - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Laboratory