MPI Decades of Community and MPI Industry Experience in MXM Apis, Software Network Infrastructure Development of HPC Network Layers Optimizations Software

Unified Communication X (UCX) UCF Consortium Project ISC 2019 UCF Consortium . Mission: • Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data centric and high-performance applications . Projects • UCX – Unified Communication X • Open RDMA . Board members • Jeff Kuehn, UCF Chairman (Los Alamos National Laboratory) • Gilad Shainer, UCF President (Mellanox Technologies) • Pavel Shamis, UCF treasurer (ARM) • Brad Benton, Board Member (AMD) • Duncan Poole, Board Member (NVIDIA) • Pavan Balaji, Board Member (Argonne National Laboratory) • Sameh Sharkawi, Board Member (IBM) • Dhabaleswar K. (DK) Panda, Board Member (Ohio State University) • Steve Poole, Board Member (Open Source Software Solutions) © 2018 UCF Consortium 2 UCX - History A new project based on concept and ideas APIs, context, from multiple thread safety, generation of HPC etc. networks stack • Performance • Scalability Modular Architecture • Efficiency UCX • Portability UCCS PAMI Open APIs, Low-level optimizations LA MPI Decades of community and MPI industry experience in MXM APIs, software Network infrastructure development of HPC network Layers optimizations software 2000 2004 2010 2011 2012 © 2018 UCF Consortium 3 UCX High-level Overview © 2018 UCF Consortium 4 UCX Framework . UCX is a framework for network APIs and stacks . UCX aims to unify the different network APIs, protocols and implementations into a single framework that is portable, efficient and functional . UCX doesn’t focus on supporting a single programming model, instead it provides APIs and protocols that can be used to tailor the functionalities of a particular programming model efficiently . When different programming paradigms and applications use UCX to implement their functionality, it increases their portability. As just implementing a small set of UCX APIs on top of a new hardware ensures that these applications can run seamlessly without having to implement it themselves © 2018 UCF Consortium 5 UCX Development Status © 2018 UCF Consortium 6 Annual Release Schedule • v1.6.0 - April '19 • v1.7.0 - August '19 • v1.8.0 - December '19 © 2018 UCF Consortium 7 UCX at Scale © 2018 UCF Consortium 8 UCX Portability . Support for x86_64, Power 8/9, Arm v8 . Runs on Servers, Raspberry PI like platforms, SmartNIC, Nvidia Jetson platforms, etc. NVIDIA Jetson Arm ThunderX2 Bluefield SmartNIC Odroid C2 © 2018 UCF Consortium 9 RECENT DEVELOPMENT v1.5.x . UCP emulation layer (atomics, rma) . Non-blocking API for all one-sided operations . Client/server connection establishment API . Malloc hooks using binary instrumentation instead of symbol override . Statistics for UCT tag API . GPU-to-Infiniband HCA affinity support based on locality/distance (PCIe) . GPU - Support for stream API and receive side pipelining © 2018 UCF Consortium 10 RECENT DEVELOPMENT v1.6.x . AMD GPU ROCm transport re-design: support for managed memory, direct copy, ROCm GDR . Modular architecture for UCT transports . Random scheduling policy for DC transport . OmniPath support (over verbs) . Optimized out-of-box settings for multi-rail . Support for PCI atomics with IB transports . Reduced UCP address size for homogeneous environments © 2018 UCF Consortium 11 Coming Soon ! .Python bindings ! + Integration . Collectives with Dask and Rapids AI . FreeBSD support .Java bindings ! + Integration with . MacOS support SparkRDMA . Moving CI to Jenkins and Azure Pipelines .iWARP support . TCP performance optimizations .GasNET over UCX . Hardware tag matching optimizations © 2018 UCF Consortium 12 UCX Machine Layer in Charm++ © 2018 UCF Consortium 13 UCX Machine Layer in Charm++ . Charm++ is an object-oriented and message-driven parallel programming model with an adaptive runtime system that enables high-performance applications to scale. The Low-level Runtime System (LRTS) is a thin software layer in the Charm++ software stack that abstracts specific networking functionality, which supports uGNI, PAMI, Verbs, MPI, etc. UCX is a perfect fit for Charm++ machine layer: • Just ~1000 LoC is needed to implement all LRTS APIs (MPI takes ~2700 LoC, Verbs takes ~8000LoC) • UCX provides ultra low latency and high bandwidth sitting on top of RDMA Verbs stack • UCX provides much less intrusive and close-to hardware API for one-sided communications than MPI © 2018 UCF Consortium 14 Charm++ over UCX (Performance Evaluations) UCX vs. Intel MPI . Up to 63% better than Intel MPI (Charm++ pingpong, regular send) 1,000 70.00% 60.00% 100 50.00% 40.00% 30.00% 10 20.00% 10.00% 1 0.00% usec (lower is better) is (lower usec Message Size UCX Intel MPI UCX vs Intel MPI (%) UCX vs. OpenMPI . Up to 15% better than Open MPI (thru UCX pml) (Charm++ pingpong, regular send) 1,000 20.00% 15.00% 100 10.00% 5.00% 10 0.00% -5.00% 1 -10.00% usec (lower is better) is (lower usec Message Size UCX OpenMPI UCX vs OpenMPI (%) © 2018 UCF Consortium 15 Charm++ over UCX (Performance Evaluations) UCX vs. Verbs (Charm++ pingpong, regular send) . Up to 85% better than Verbs 10,000 100.00% 1,000 80.00% 60.00% 100 40.00% 10 20.00% 1 0.00% usec (lower is better) is (lower usec Message Size UCX Verbs UCX vs Verbs (%) NAMD 2.13 (master) . 4% improvement over Intel MPI with NAMD (STMV, 16 nodes, SKL 2.0GHz, HDR 100, 640 PPN) 700.00 4% (STMV public input, 40000 steps) 600.00 500.00 400.00 300.00 200.00 100.00 0.00 Intel MPI UCX NAMD: STMV 40000 Steps WallClock, sec (lower is better) is (lower sec WallClock, © 2018 UCF Consortium 16 UCX Support in MPICH Yanfei Guo Assistant Computer Scientist Argonne National Laboratory Email: [email protected] MPICH layered structure: CH4 Platform independent code MPI Layer • Collectives • Communicator management CH4 CH4 Generic “Generic” ucx ofi portals4 SHM • Packet headers + handlers posix xpmem cma “Netmods” • Provide all functionality either natively, or by calling back to “Shmmods” “generic” implementation • Implements some mandatory functionality • Can override any amount of optional functionality (e.g. better bcast, better barrier) UCX BoF @ ISC 2019 18 Benefit of using UCX in MPICH . Separating general optimizations and device specific optimizations – Lightweight and high-performance communication • Native communication support – Simple and easy to maintain – MPI can benefit from new hardware quicker . Better hardware support – Accelerated verbs with Mellanox hardware – Support for GPUs UCX BoF @ ISC 2019 19 MPICH/UCX with Accelerated Verbs Message Rate 8000000 . UCX_TLS=rc_mlx5,cm . Lower overhead 7000000 – Low latency 6000000 – Higher message rate 5000000 Rate 4000000 OSU Latency: 0.99us Message OSU BW: 12064.12 MB/s 3000000 Argonne JLSE Thing Cluster - Intel E5-2699v3 @ 2.3 GHz 2000000 - Connect-X 4 EDR - HPC-X 2.2.0, OFED 4.4-2.0.7 1000000 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size Accel Verbs UCX BoF @ ISC 2019 20 MPICH/UCX with Accelerated Verbs pt2pt latency MPI_Get Latency MPI_Put Latency 4 4 3.5 3.5 3.5 3 3 3 2.5 2.5 2.5 2 (us) (us) (us) 2 2 Latency Latency Latency 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size Message Size Message Size Accel Verbs Accel Verbs Accel Verbs UCX BoF @ ISC 2019 21 MPICH/UCX with HCOLL MPI_Allreduce Latency 1200 Argonne JLSE Thing Cluster - Intel E5-2699v3 @ 2.3 GHz - Connect-X 4 EDR 1000 - HPC-X 2.2.0, OFED 4.4-2.0.7 6 nodes, ppn=1 800 (us) 600 Latency 400 200 0 Message Size HCOLL MPICH UCX BoF @ ISC 2019 22 UCX Support in MPICH . UCX Netmod Development – MPICH Team – Mellanox – NVIDIA . MPICH 3.3.1 just released – Includes an embedded UCX 1.5.0 . Native path – pt2pt (with pack/unpack callbacks for non-contig buffers) – contiguous put/get rma for win_create/win_allocate windows . Emulation path is CH4 active messages (hdr + data) – Layered over UCX tagged API . Not yet supported – MPI dynamic processes UCX BoF @ ISC 2019 23 Hackathon on MPICH/UCX . Earlier Hackathons with Mellanox – Full HCOLL and UCX integration in MPICH 3.3 • Including HCOLL non-contig datatypes – MPICH CUDA support using UCX and HCOLL, tested and documented • https://github.com/pmodels/mpich/wiki/MPICH-CH4:UCX-with-CUDA-support – Support for FP16 datatype (non-standard, MPIX) – IBM XL and ARM HPC Compiler support – Extended UCX RMA functionality, under review • https://github.com/pmodels/mpich/pull/3398 . Upcoming hackathons with Mellanox and NVIDIA UCX BoF @ ISC 2019 24 Upcoming plans . Native UCX atomics – Enable when user supplies certain info hints – https://github.com/pmodels/mpich/pull/3398 . Extended CUDA support – Handle non-contig datatypes – https://github.com/pmodels/mpich/pull/3411 – https://github.com/pmodels/mpich/issues/3519 . Better MPI_THREAD_MULTIPLE support – Utilizing multiple workers (Rohit looking into this now) . Extend support for FP16 – Support for C _Float16 available in some compilers (MPIX_C_FLOAT16) – Missing support when GPU/Network support FP16 but CPU does not UCX BoF @ ISC 2019 25 Enhancing MPI Communication using Accelerated Verbs and Tag Matching: The MVAPICH Approach Talk at UCX BoF (ISC ‘19) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Introduction, Motivation, and Challenge • HPC applications require high-performance, low overhead data paths that provide – Low latency – High bandwidth – High message rate • Hardware Offloaded Tag Matching • Different families of accelerated verbs available – Burst family

MPI Decades of Community and MPI Industry Experience in MXM Apis, Software Network Infrastructure Development of HPC Network Layers Optimizations Software

MVAPICH2 2.2 User Guide

Scalable and High Performance MPI Design for Very Large

MVAPICH2 2.3 User Guide

The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Exascale Computing Project -- Software

MVAPICH USER GROUP MEETING August 26, 2020

Infiniband and 10-Gigabit Ethernet for Dummies

MVAPICH2 2.2Rc2 User Guide

Combining Openmp and Message Passing: Hybrid Programming

Comparison of Platforms for Recommender Algorithm on Large Datasets

Infiniband and 10-Gigabit Ethernet for Dummies