UCX Community Meeting

SC’19 November 2019 Open Meeting Forum and Publicly Available Work Product

This is an open, public standards setting discussion and development meeting of UCF. The discussions that take place during this meeting are intended to be open to the general public and all work product derived from this meeting shall be made widely and freely available to the public. All information including exchange of technical information shall take place during open sessions of this meeting and UCF will not sponsor or support any closed or private working group, standards setting or development sessions that may take place during this meeting. Your participation in any non-public interactions or settings during this meeting are outside the scope of UCF's intended open-public meeting format.

© 2019 UCF Consortium 2 UCF Consortium

. Mission: • Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data centric and high-performance applications

. Projects https://www.ucfconsortium.org Join • UCX – Unified Communication X – www.openucx.org [email protected] • SparkUCX – www.sparkucx.org • Open RDMA

. Board members • Jeff Kuehn, UCF Chairman (Los Alamos National Laboratory) • Gilad Shainer, UCF President (Mellanox Technologies) • Pavel Shamis, UCF treasurer (Arm) • Brad Benton, Board Member (AMD) • Duncan Poole, Board Member () • Pavan Balaji, Board Member (Argonne National Laboratory) • Sameh Sharkawi, Board Member (IBM) • Dhabaleswar K. (DK) Panda, Board Member (Ohio State University) • Steve Poole, Board Member (Open Source Software Solutions)

© 2019 UCF Consortium 3 UCX History

https://www.hpcwire.com/2018/09/17/ucf-ucx-and-a-car-ride-on-the-road-to-exascale/

© 2019 UCF Consortium 4 © 2019 UCF Consortium 5 UCX Portability

. Support for x86_64, Power 8/9, Arm v8 . U-arch tuned code for Xeon, AMD Rome/Naples, Arm v8 (Cortex-A/N1/ThunderX2/Huawei) . First class support for AMD and Nvidia GPUs . Runs on Servers, Raspberry PI like platforms, SmartNIC, platforms, etc.

NVIDIA Jetson

Odroid C2

Arm ThunderX2 Bluefield SmartNIC N1 SDP

© 2019 UCF Consortium 6 Annual Release Schedule 2019

• v1.6.0 - July '19 • v1.6.1 - October '19 • v1.7.0 – End of November '19

© 2019 UCF Consortium 7 RECENT DEVELOPMENT v1.6.x

V1.6.0 . AMD GPU ROCm transport re-design: support for managed memory, direct copy, ROCm GDR . Modular architecture for UCT transports – runtime plugins . Random scheduling policy for DC transport . Improved support for Vebs API . Optimized out-of-box settings for multi-rail . Support for PCI atomics with IB transports . Reduced UCP address size for homogeneous environments V1.6.1 . Add Bull Atos HCA device IDs . Azure Pipelines CI Infrastructure . Clang static checker

© 2019 UCF Consortium 8 RECENT DEVELOPMENT v1.7.0 (rc1)

. Added support for multiple listening transports . Added UCT socket-based connection manager transport . Updated API for UCT component management . Added API to retrieve the listening port . Added UCP active message API . Removed deprecated API for querying UCT memory domains . Refactored server/client examples . Added support for dlopen interception in UCM . Added support for PCIe atomics . Updated Java API: added support for most of UCP layer operations . Updated support for Mellanox DevX API . Added multiple UCT/TCP transport performance optimizations . Optimized memcpy() for platforms

© 2019 UCF Consortium 9 Preliminary Release Schedule 2020

• v1.8.0 - April ‘20 • v1.9.0 - August ‘20 • v1.10.0 - December ‘20

© 2019 UCF Consortium 10 Coming soon - 2020

. UCX Python / UCP-Py • https://github.com/rapidsai/ucx-py • Already integrated Dask and Rapids AI . UCX Java • Java API official UCX release . Spark UCX • www.sparkucx.org • https://github.com/openucx/sparkucx . Collective API • For more details see collective API pull requests at github . Enhanced Active Message API . FreeBSD and MacOS . Improved documentation

© 2019 UCF Consortium 11 Save the date !

. UCX Hackathon: December 9-12 (Registration required) • https://github.com/openucx/ucx/wiki/UCF-Hackathon-2019 . Austin, TX

© 2019 UCF Consortium 12 Laptop Stickers

© 2019 UCF Consortium 13 Presenters

. Arm . Los Alamos National Laboratory . Mellanox . Argonne National Laboratory . Stony Brook . Charm++ . AMD . ORNL . Nvidia . OSU

© 2019 UCF Consortium 14 UCX Support for Arm

Pavel Shamis (Pasha), Principal Research Egineer

© 2019 Arm Limited UCX on ThunderX2 (arm v8) at Scale

16 © 2019 Arm Limited Recent Numbers: Arm + InfiniBand ConnectX6 2x200 Gb/s

Put Ping-Pong Latency, 8B Put Injection Rate, 8B (usec) (MPP/s)

UCX-UCT (low level), 0.72 16.3 Accelerated UCX-UCT (low level), Verbs 0.75 5.9 Verbs, IB_WRTIE_LAT/BW 0.95 32 (Post List, 2 QPs) UCP, Verbs 1.07 5.6 UCP, Accelerated 0.81 15.9

• ConnectX-6 200Gbs X 2 / port 1<->2 (no switch) • PCIe Gen4 • Internal UCX version (will be unstreamed)

17 © 2019 Arm Limited UCX Mellanox Update

November 2019

© 2019 Mellanox Technologies | Confidential 18 High-level overview Applications

HPC (MPI, SHMEM, …) Storage, RPC, AI Web 2.0 (Spark, Hadoop)

UCP – High Level API (Protocols) Transport selection, multi-rail, fragmentation

HPC API: I/O API: Connection establishment: tag matching, active messages Stream, RPC, remote memory access, atomics client/server, external

UCT – Low Level API (Transports)

RDMA GPU / Accelerators Others UCX

Shared RC DCT UD iWarp CUDA AMD/ROCM TCP OmniPath memory

OFA Verbs Driver Cuda ROCM Hardware

© 2018 Mellanox Technologies 19 UCX v1.7 Advantages . UCX is the default communication substrate for HPC-X AND many open source HPC Runtimes . Open MPI and OSHMEM . HCOLL . BUPC (UCX conduit, WIP) . Charm++/UCX . NCCL/UCX – WIP . MPICH . SparkUCX - WIP . Java and Python bindings . Full support for GPU Direct RDMA, GDR copy, and CUDA IPC . CUDA aware - three-stage pipelining protocol . Support for HCA atomics including new bitwise atomics . Shared memory (CMA, KNEM, xpmem, SysV, mmap) . Support for non-blocking memory registration and On-Demand Paging ODP . Multithreaded support . Support for hardware tag matching . Support for multi-rail and socket direct . Support for PCIe atomics . Support for MEMIC – HCA memory

. In-box in many distros, and more to come © 2019 Mellanox Technologies | Confidential 20 Unified Communication X

UNIFYING Network and GPU Communication Seamlessly, Transparently, Elegantly

© 2019 Mellanox Technologies | Confidential 21 Open MPI / UCX

© 2019 Mellanox Technologies | Confidential 22 CUDA aware UCX

. CUDA Aware Tag API for data movement on GPU clusters . Intra-node GPU-GPU, GPU-HOST, HOST-GPU . Inter-node GPU-GPU, GPU-HOST, HOST-GPU . Optimal protocols . GPUDirectRDMA . CUDA-IPC . GDRCOPY . Pipelining . Efficient memory type detection

© 2019 Mellanox Technologies | Confidential 23 Performance: System and Software description Hardware . . GPU: 6 x Tesla V100 NVLink . HCA: ConnectX-5 . CPU: PowerPC

Software . CUDA 10.1 . OpenMPI-4.0.2 . UCX 1.7 . Tuning . UCX_RNDV_SCHEME=get_zcopy . UCX_RNDV_THRESH=1

© 2019 Mellanox Technologies | Confidential 24 System Architecture

Port 0 Port 1

IB

P9 P9

GPU0 GPU0 GPU0 GPU0 GPU0 GPU0

2 lane NVLink: Between GPU-GPU and CPU_GPU

8 lane PCIe Gen4 EDR Infiniband Interconnect

Inter processor X-Bus HBM2

© 2019 Mellanox Technologies | Confidential 25 OMB: Latency & Bandwidth

HMB2 2-lane NVLINK GPU-GPU 2-lane NVLINK CPU-GPU InfiniBand HMB2 2-lane NVLINK GPU-GPU 2-lane NVLINK CPU-GPU InfiniBand 1000 GPU-GPU GPU-GPU 1000.00

100 100.00

Latency(usec) 10

10.00 Bandwidth(GB/s)

1

1.00

1M 2M 4M

64K

128K 256K 512K Message Size Message Size

© 2019 Mellanox Technologies | Confidential 26 Performance Summary

IB EDR x2 GPU HBM2 2-Lane NVLink GPU-GPU 2-Lane NVLink CPU-GPU GPU-GPU

Theoretical Peak BW 900 GB/s 50 GB/s 50 GB/s 25 GB/s

Available Peak BW 723.97 GB/s 46.88 GB/s 46 GB/s 23.84 GB/s UCX Peak BW 349.6 GB/s 45.7 GB/s 23.7 GB/s 22.7 GB/s % Peak 48.3% 97.5% 51.5% 95.2%

© 2019 Mellanox Technologies | Confidential 27 MPICH / UCX

© 2019 Mellanox Technologies | Confidential 28 Performance: System and Software description Hardware . Summit Supercomputer . GPU: 6 x Tesla V100 NVLink . HCA: ConnectX-5 . CPU: PowerPC

Software . CUDA 10.1 . MPICH . master branch (12b9d564fa13d12d11463d4405db04d3354314f3) . UCX 1.7 . Tuning . UCX_RNDV_SCHEME=get_zcopy . UCX_RNDV_THRESH=1

© 2019 Mellanox Technologies | Confidential 29 osu_latency, osu_bandwidth

Latency Bandwidth NVLINK GPU- 1000 GPU NVLINK-CPU- NVLINK GPU-GPU GPU 268.56 IB GPU-GPU NVLINK-CPU-GPU 256.35 HBM2 IB GPU-GPU 100.00

100 HBM2 45.35

Latency(usec) Bandwidth(GB/s) 10.00 10

4.69

1.00 1 65K 128K 56K 512K 1MB 2MB 4MB Message Size () Message Size (bytes)

© 2019 Mellanox Technologies | Confidential 30 Performance Summary

IB EDR x2 GPU HBM2 2-Lane NVLink GPU-GPU 2-Lane NVLink CPU-GPU GPU-GPU Theoretical Peak BW 900 50 50 25

Available Peak BW 723.97 46.88 46 23.84

UCX Peak BW 346.6 46.6 23.7 22.6 % Peak 47.87% 99.40% 51.52% 94.80%

© 2019 Mellanox Technologies | Confidential 31 Collective: osu_allreduce

4B, osu_allreduce -; MPICH w/HCOLL 4K, osu_allreduce -cuda; MPICH w/HCOLL 120 400 nccl nccl ucx_p2p 350 ucx_p2p 100 300 80 250

60 200

150 Latency(usec) Latency(usec) 40 100 20 50

0 0 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 Nodes Nodes

32K, osu_allreduce -cuda; MPICH w/HCOLL 1MB, osu_allreduce -cuda; MPICH w/HCOLL 350 nccl 900 nccl ucx_p2p 300 800 ucx_p2p

250 700 600 200 500

150 400 Latency(usec) 100 Latency(usec) 300

200 50 100 0 2 4 8 16 32 64 128 256 0 2 4 8 16 32 64 128 256 Nodes Nodes

© 2019 Mellanox Technologies | Confidential 32 NCCL / UCX

© 2019 Mellanox Technologies | Confidential 33 NCCL Inter-Node P2P Communication Plugin Key Features .Replaces inter-node point-to-point communication .Dedicated API exposed by NCCL .Three-phased communication . Connection establishment . Data transfer . Connection Closure API .Connection establishment . Listen, Connect, Accept .Data transfer (Non-Blocking) . Send, Receive, Test .Connection Closure . CloseListen, CloseSend, CloseReceive

© 2019 Mellanox Technologies | Confidential 34 System and Software description Hardware . 32 Nodes of Summit Supercomputer . GPU: 6 x Tesla V100 NVLink . HCA: 1 x ConnectX-5 . CPU: PowerPC

Software . Nvidia NCCL 2.4.7 . CUDA 10.1 . UCX 1.7 . Tuning . UCX_RNDV_SCHEME=get_zcopy . UCX_RNDV_THRESH=1 . NCCL UCX plugin – HPC-X v2.6 preview

© 2019 Mellanox Technologies | Confidential 35 NCCL Internal Verbs vs NCCL UCX Plugin

. UCX plugin outperforms NCCL Verbs implementation up to 13% on large messages

14

11.77 11.86 12 11.57 11.37 verbs ucx 10.63 10.41

10 9.34 9.1 8.75 8.91

8 7.45 7.51

5.96 6 5.74 5.28 5.07

3.82 3.75

bandwidth, GB/s (higher bandwidth,isbetter) GB/s 4

2.14 1.95

2 Allreduce

0 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M Allreduce size

© 2019 Mellanox Technologies | Confidential 36 Unified Communication X

Scaling Azure HPC with HPC-X/UCX

© 2019 Mellanox Technologies | Confidential 37 Azure HPC HBv2 virtual machines for HPC

.State of the art VMs feature a wealth of new technology, including: .AMD EPYC 7742 CPUs (Rome) .2.45 GHz Base clock / 3.3 GHz Boost clock .480 MB L3 cache, 480 GB RAM .340 GB/s of Memory Bandwidth .200 Gbps HDR InfiniBand (SRIOV) with Adaptive Routing .900 GB SSD (NVMeDirect)

© 2019 Mellanox Technologies | Confidential 38 Star-CCM + on HBv2 App: Siemens Star-CCM+ Version: 14.06.004 Model: LeMans 100M Coupled Solver Configuration Details: 116 MPI ranks were run (4 ranks from each of 29 NUMA) in each HBv2 VM in order to leave nominal resources to run Linux background processes. In addition, Adaptive Routing was enabled and DCT (Dynamic Connected Transport) was used as the transport layer, while HPC-X version 2.50 (UCX v1.6) was used for MPI. Azure CentOS HPC 7.6 image was used from https://github.com/Azure/azhpc-images

Summary: Star-CCM+ was scaled at 81% efficiency to nearly 15,000 MPI ranks delivering an application speedup of more than 12,000x. This compares favorably to Azure’s previous best of more than 11,500 MPI ranks, which itself was a world-record for MPI scalability on the public cloud.

© 2019 Mellanox Technologies | Confidential 39 ANSYS Fluent on HBv2

. App: ANSYS Fluent . Version: 14.06.004 . Model: External Flow over a Formula-1 Race Car (f1_racecar_140m) . Configuration Details: 60 MPI ranks were run (2 out of 4 cores per NUMA) in each HBv2 VM in order to leave nominal resources to run Linux background processes and give ~6 GB/s of memory bandwidth per core. In addition, Adaptive Routing was enabled and DCT (Dynamic Connected Transport) was used as the transport layer, while HPC-X version 2.50 (UCX v1.6) was used for MPI. Azure CentOS HPC 7.6 image was used from https://github.com/Azure/azhpc-images . Summary: HBv2 VMs scale super linearly (112%) up to the top end measured number of VMs (128). The Fluent Solver Rating measured at this top-end level of scale is 83% more performance than the current leader submission on ANSYS public database for this model (https://bit.ly/2OdAExM).

© 2019 Mellanox Technologies | Confidential 40 Thank You

© 2019 Mellanox Technologies | Confidential 41 UCX Support in MPICH

Yanfei Guo Assistant Computer Scientist Argonne National Laboratory Email: [email protected] MPICH layered structure: CH4

Platform independent code MPI Layer • Collectives • Communicator management

CH4 CH4 Generic

“Generic” ucx ofi SHM • Packet headers + handlers posix xpmem cma “Netmods” • Provide all functionality either natively, or by calling back to “Shmmods” “generic” implementation • Implements some mandatory functionality • Can override any amount of optional functionality (e.g. better bcast, better barrier)

UCX BoF @ SC 2019 43 Benefit of using UCX in MPICH

. Separating general optimizations and device specific optimizations – Lightweight and high-performance communication • Native communication support – Simple and easy to maintain – MPI can benefit from new hardware quicker . Better hardware support – Accelerated verbs with Mellanox hardware – Support for GPUs

UCX BoF @ SC 2019 44 MPICH/UCX with Accelerated Verbs

Message Rate 8000000 . UCX_TLS=rc_mlx5,cm . Lower overhead 7000000 – Low latency 6000000

– Higher message rate 5000000 Rate 4000000

OSU Latency: 0.99us Message

OSU BW: 12064.12 MB/s 3000000 Argonne JLSE Thing Cluster

- Intel E5-2699v3 @ 2.3 GHz 2000000 - Connect-X 4 EDR

- HPC-X 2.2.0, OFED 4.4-2.0.7 1000000

0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size

Accel Verbs

UCX BoF @ SC 2019 45 MPICH/UCX with Accelerated Verbs

pt2pt latency MPI_Get Latency MPI_Put Latency 4 4 3.5

3.5 3.5 3

3 3 2.5

2.5 2.5

2

(us)

(us) (us) 2

2

Latency

Latency Latency 1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size Message Size Message Size Accel Verbs Accel Verbs Accel Verbs

UCX BoF @ SC 2019 46 MPICH/UCX with HCOLL

MPI_Allreduce Latency 1200 Argonne JLSE Thing Cluster - Intel E5-2699v3 @ 2.3 GHz - Connect-X 4 EDR 1000 - HPC-X 2.2.0, OFED 4.4-2.0.7

6 nodes, ppn=1 800 (us)

600 Latency

400

200

0

Message Size

HCOLL MPICH

UCX BoF @ SC 2019 47 UCX Support in MPICH

. UCX Netmod Development – MPICH Team – Mellanox – NVIDIA . MPICH 3.3.2 just released, 3.4a2 coming soon – Includes an embedded UCX 1.6.1 . Native path – pt2pt (with pack/unpack callbacks for non-contig buffers) – contiguous put/get rma for win_create/win_allocate windows . Emulation path is CH4 active messages (hdr + data) – Layered over UCX tagged API . Not yet supported – MPI dynamic processes

UCX BoF @ SC 2019 48 Hackathon on MPICH/UCX

. Earlier Hackathons with Mellanox – Full HCOLL and UCX integration in MPICH 3.3 • Including HCOLL non-contig datatypes – MPICH CUDA support using UCX and HCOLL, tested and documented • https://github.com/pmodels/mpich/wiki/MPICH-CH4:UCX-with-CUDA-support – Support for FP16 datatype (non-standard, MPIX) – IBM XL and ARM HPC Compiler support – Extended UCX RMA functionality, under review • https://github.com/pmodels/mpich/pull/3398

UCX BoF @ SC 2019 49 Upcoming plans

. Native UCX atomics – Enable when user supplies certain info hints – https://github.com/pmodels/mpich/pull/3398 . Extended CUDA support – Handle non-contig datatypes – https://github.com/pmodels/mpich/pull/3411 – https://github.com/pmodels/mpich/issues/3519 . Better MPI_THREAD_MULTIPLE support – Utilizing multiple workers (Rohit looking into this now) . Extend support for FP16 – Support for _Float16 available in some compilers (MPIX_C_FLOAT16) – Missing support when GPU/Network support FP16 but CPU does not

UCX BoF @ SC 2019 50 MPICH

. http://github.com/pmodels/mpich – Submit an issue or pull request! . Schedule a hackathon with us!

UCX BoF @ SC 2019 51 UCX Networking Layer for Charm++

UCX Community BoF SC 19

Nitin Bhat Software Engineer Charmworks Inc.

52 What is Charm++?

. Charm++ is a generalized approach to writing parallel programs . An alternative to the likes of MPI, UPC, GA, etc. . But not sequential languages such as C, C++, and Fortran

. Represents: . The style of writing parallel programs . The runtime system . And the entire ecosystem that surrounds it

. Three design principles:

. Over-decomposition, Migratability, Asynchrony

. Enables:

. Load Balancing, Shrink Expand, Fault Tolerance

53 Why we needed a new layer?

. Verbs layer was difficult to maintain/not working on new InfiniBand machines

. MPI layer was not scaling well

. We are interested in the runtime (and not so much on networking layers) so we wanted a portable and performant layer

. UCX offered . Portability . High performance . Ease of maintenance

54 Charm++ Architecture

Applications

Libs Langs

Charm++ Programming Model

Converse Runtime System

Low Level Runtime System Interface (LRTS)

More uGNI verbs libfabric MPI TCP/IP UCX machine layers

55 UCX Machine Layer Implementation

. Init . Process management: simple pmi/slurm pmi/PMIx . Each process : . ucp_init . ucp_worker_create . ucp_ep_create . Prepost recv buffers: ucp_tag_recv_nb

. Regular API . Send: ucp_tag_send_nb . Recv: ucp_tag_recv_nb/ucp_tag_msg_recv_nb

. Zero copy API . Send metadata message using Regular API . RDMA operations using ucp_put_nb/ucp_get_nb

56 Micro Benchmarks

57 Charm++ p2p Pingpong Benchmark - Frontera (TACC)

. Up to 47% better than MPI Point to Point Pingpong 10000

1000

100

10 One way Time (us) Time way One 1

Message Size (Bytes)

Intel Xeon 8280, HDR100 InfiniBand UCX MPI

58 Charm++ p2p Pingpong Benchmark - Thor (HPC Advisory Council)

. Up to 63% better than MPI Point to Point Pingpong 1,000

100

10 One way Time (us) Time way One

1

Message Size (Bytes) Intel Broadwell E5-2697A, HDR100 InfiniBand UCX MPI 59 Charm++ p2p Pingpong Benchmark - Thor (HPC Advisory Council)

. Up to 87% better than Verbs Point to Point Pingpong 10,000

1,000

100

One Way Time (us) Time Way One 10

1

Message Size (Bytes)

Intel Broadwell E5-2697A, HDR100 InfiniBand UCX Verbs

60 Application Performance

61 NAMD

. Nanoscale Molecular Dynamics (NAMD), is a parallel molecular dynamics code designed for high- performance simulation of large biomolecular systems

. NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations

. NAMD is written using Charm++ parallel programming model

. It is noted for its parallel efficiency and is often used to simulate large systems (millions of atoms)

62 NAMD (STMV) - Thor . UCX Machine Layer is 4% faster than MPI Machine Layer for STMV (1M)

NAMD (STMV) 200 4% 180

160

140

120

100

jobs/day 80

60

40

20

0 NO SMP N640-PPN40, 16 nodes NP16, PPN1, WTHR39, 16 nodes NO SMP N880-PPN40, 22 nodes NAMD Jobs Intel Broadwell E5-2697A, EDR InfiniBand MPI UCX 63 ChaNGa

. Cosmological simulation framework "ChaNGa" is a collaborative project with Prof. Thomas Quinn (University of Washington) supported by the NSF

. ChaNGa (Charm N-body GrAvity solver) is a code to perform collisionless N-body simulations

. ChaNGa can perform cosmological simulations with periodic boundary conditions in comoving coordinates or simulations of isolated stellar systems

. ChaNGa’s uses dynamic load balancing scheme of the Charm++ runtime system to obtain good performance on parallel systems

64 ChaNGa (dwf 5M) – Thor . UCX Machine provides 49% higher performance at 16 nodes . UCX Machine provides 161% higher performance at 32 nodes . Performance reduction demonstrated with MPI Machine Layer beyond of 16 nodes ChaNGa (dwf1.2048.00384) 20000 161% 18000 16000 14000 12000 49% 10000 jobs/day 8000 6000 4000 2000 0 2 4 8 16 32 Number of Nodes Intel Broadwell E5-2697A, EDR InfiniBand

MPI UCX 65 ChaNGa (dwf 50M) - Thor . UCX Machine provides 104% higher performance at 8 nodes . UCX Machine provides 200% higher performance at 16 nodes . Performance reduction demonstrated with MPI Machine Layer beyond of 16 nodes

ChaNGa (dwf1.6144.01472, AVX, -b=24) 350 200% 300 104% 250 200

150 jobs/day 100 50 0 4 8 16 Number of Nodes MPI UCX Intel Broadwell E5-2697A, EDR InfiniBand

66 ChaNGa (dwf 5M) - Frontera (TACC)

. Initial results

7%

Intel Xeon 8280, HDR100 InfiniBand

67 ChaNGa (h148 - 550M) - Frontera (TACC) . Milky way with a supermassive Black hole in the middle

37%

Intel Xeon 8280, HDR100 InfiniBand

68 Conclusions and Future work

. UCX layer has been a performant layer as shown by initial testing and results.

. Future work

. Testing on machines with other vendor networks (uGNI, PAMI etc.)

. Performance analysis and tuning

69 Acknowledgements

. Mellanox

. Ophir Maor

. Yong Qin

. Mikhail Brinskii

. Yossi Itigin

. Charmworks, Inc

. Evan Ramos

. Eric Bohm

. Sam White

70 Thank You Extra Slides

72 Charm++ bcast Ping All Benchmark – Frontera (TACC)

Ping All Benchmark

1000000 256 B - UCX 256 B - MPI 100000 65 KB - UCX 65 KB - MPI 10000 1 MB - UCX 1 MB - MPI 1000 16 MB - UCX 16 MB - MPI 100

Ping and Reduce Time (us) Time Reduce and Ping 10

1 1 2 4 8 16 32 64 128 Number of Nodes

73 Charm++ p2p Pingpong Benchmark - iForge (NCSA)

Point to Point Pingpong 10000

1000 (us)

Time Time 100

One way way One 10

1

Message Size (Bytes)

UCX MPI Verbs 74 ChaNGa (dwf 5M) UCX Performance Optimizations . UCX Machine provides 33% higher performance when optimized comparing to out of the box performance

33%

Intel Broadwell E5-2697A, EDR InfiniBand

75 ROCM UCX INTEGRATION UCX community BoF Supercomputing 2019 ROCm Leverages OpenUCX For Scale-up and Scale-out Distributed Programming Models

. Next generation open source HPC communication framework

. Built off the foundation of MXM, UCCS, PAMI

. Broad Industry support including: . IBM, ARM, LANL, Mellanox, NVIDIA, ORNL, SBU, UT, UH and AMD UCX . Rich platform for supporting MPI, OpenSHMEM, PGAS ROCM SOFTWARE PLATFORM An Open Source foundation for Hyper Scale and HPC-class GPU computing

Graphics core next headless Linux® 64-bit HSA drives rich capabilities into the ROCm driver hardware and software • Large memory single allocation • User mode queues • Peer-to-Peer Multi-GPU • Architected queuing language • Peer-to-Peer with RDMA • Flat memory addressing • Systems management API and tools • Atomic memory transactions • Process concurrency & preemption

Rich compiler foundation for HPC developer “Open Source” tools and libraries • LLVM native GCN ISA code generation • Rich Set of “Open Source” math libraries • Offline compilation support • Tuned “Deep Learning” frameworks • Standardized loader and code object format • Optimized parallel programing frameworks • GCN ISA assembler and disassembler • CodeXL profiler and GDB debugging • Full documentation to GCN ISA AMD EXASCALE STACK

Applications HPC Apps ML Frameworks

Cluster Singularity SLURM Docker Kubernetes Deployment

Tools Debugger Profiler, Tracer System Valid. System Mgmt.

Portability Kokkos RAJA GridTools ONNX Frameworks

Math Libraries RNG, FFT Sparse BLAS, Eigen MIOpen

Scale-out OpenMP for GPUs OpenMPI UCX MPICH RCCL Comm. Libraries 100% Open Programming OpenMP HIP OpenCL™ Python Makes CUDA Portable Models PyTorch, TensorFlow Up-streamed Processors CPU + GPU Datacenter-Ready at Scale Future Beta/Early Production

79 SC19 Briefing | PUBLIC | 2019 ROCm for Distributed Systems

 CPU can directly access to GPU memory

‒ Expose entire GPU frame buffer as addressable memory through PCIe BAR (LargeBar feature)

‒ Map GPU pages to CPU pages that allow CPU to directly load/store from/to GPU memory

 HCA to directly access GPU memory : ROCnRDMA feature

‒ Leverages Mellanox’s PeerDirect feature

‒ Allows IB HCA to directly read/write data from/to GPU memory

‒ Available and enabled by default in ROCm

‒ Integrated into ROCm drivers

 IPC for intra-node communication

‒ ROCm-IPC in UCT for interprocess communications among GPUs

‒ Improvement from the original CMA TL OSU Micro Benchmarks - Latency (Intranode, Device-to-Device) 1000000

100000

10000

• Improvements for ROCm support 1000

• rocm_ipc: for intra-node cross process zcopy, support ROCm or 100 Latency Latency (us) host memory 10

• rocm_cpy: for intra-process short and zcopy, support ROCm or 1

host memory 0 • rocm_gdr: use gdr_copy for fast read speed from GPU device memory Message Sizes (Bytes) • Enabled perftest and gtest for ROCm support without rocm_ipc and rocm_gdr rocm_ipc + rocm_gdr + UCX_RDNV_THRESH=8192 OSU Micro Benchmarks - Bandwidth (Intranode, Device-to-Device) • ROCM-IPC provides efficient support for large messages 40000 • 1.61 us for 16 Bytes, 36 us for 512KBytes transfer for intra-node 35000 (D-D) 30000 • 33 GB/s for 32MBytes for intranode D-D transfer over Infinity 25000 Fabric 20000 15000

Bandwidth Bandwidth (MB/s) 10000 • Test Configuration: 5000

0

0 1 2 4 8

16 32 64

1K 2K 4K 8K

1M 2M 4M 8M

128 256 512

32K 64K

• AMD Instinct MI50 GPUs on PCIe Gen3 platform 16K

16M 32M 64M

128K 256K 512K • Hip-ified OSU Micro Benchmarks Message Sizes (Bytes) without ROCm-IPC ROCm IPC (PCIe Gen3) ROCm-IPC (Infinity Fabric) OSU Micro Benchmarks - Latency (Inter-node, Device-to-Device) 1000 • Takes advantage of LargeBar capability to support eager protocols

• Eager protocols can run directly from GPU buffers 100 • Integrated ROCnRDMA to design rendezvous (RNDV) protocols

• Optimization and tuning work to be continued Latency (us) 10 • Enhanced and optimized GPU-Aware protocols for pipeline

1

• Performance shown UCX over ROCm Message Sizes (Bytes)

• 2.9 us for 4 Bytes transfer for inter-nodes D-D • Full EDR IB bandwidth achieved on large messages OSU Micro Benchmarks - Bandwidth • Expect EPYC Rome Gen4 platform to deliver full HDR IB bandwidth (Inter-node, Device-to-Device) with MI50 16000 • HDR IB, GPU device and EPYC Rome platform are PCIe Gen4 14000 capable 12000 10000

8000

• Test Configuration: 6000 Bandwidth Bandwidth (MB/s) • AMD MI50 GPUs on x86 PCIe Gen3 platform 4000 2000

• Mellanox ConnectX-5 EDR InfiniBand 0

0 1 2 4 8

16 32 64

1K 2K 4K 8K

1M 2M 4M 8M

128 256 512

16K 32K 64K

32M 64M

• HIP-ified OSU Micro Benchmarks 16M

128K 256K 512K 128M Message Sizes (Bytes)

D-D The information contained herein is for informational purposes only, and is subject to change without notice. Timelines, roadmaps, and/or product release dates shown in these slides are plans only and subject to change. “Polaris”, “Vega”, “Radeon Vega”, “Navi”, “Zen” and “Naples” are codenames for AMD architectures, and are not product names. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. , Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18

©2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, , Threadripper, EPYC, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

83 | AMD RADEON INSTINCT | SUPER COMPUTING 2019 Oscar Hernandez, Matt Baker UCX @ ORNL ORNL

• UCX has been deployed at ORNL: – Summit: Power9+NVIDIA – DGX-2 systems: x84+NVIDIA – Visualization Clusters (Rhea) – Wombat: ARM+NVIDIA

• Performance Portable ”communication API” between “diverse” set of architectures

• Helping co-designing next generation PM: RDMA/OpenSHMEM, Python/Dask for HPC, Accelerator-based • Help us to evaluate new systems “very fast” and efficiently.

84 Open slide master to edit NVIDIA+ARM evaluation on “Wombat” (NCCS) used UCX Applications:

CoMet LAMMPS NAMD VMD DCA++ Gromacs Gamera LSMS Comparative Molecular Molecular MD Material Molecular Earthquake Material Genomics Dynamics Dynamics Visualize. Science Dynamics Simulator Science

Benchmarks & Mini-apps: Parallel Prog Models & Sci. Libraries:

BabelStream Tea Clover MiniSweep Patatrack Kokkos Open MPI CUDA / UCX Magma Memory Leaf Leaf SNAP Radiation Pixel C++ Distributed CUDA Comm. SLATE Transfer Heat Lagrangian Radiation Transport Reconstruction Prog. Prog. Fortran Framework Sci. Cond. Eulerian Transport Model Model GPU Prog. Libraries hydrodynamic Hardware: Software: HPE Apollo 70 Preproduction nodes RHEL 8 CPU: ARM ThunderX2 CUDA 10.2.107 (aka “Drop 2”), PGI “Dev version” 2 Sockets, 28 Cores/socket, 4 threads/per core, 2.0GHz, 256 GB RAM Installed Nov 7. Most user’s results are with 10.2.91 (”Drop 1”) GPU’s: NVIDIA Volta GV100 (2 per node) with 32 GB HBM2 each GCC 8.2.1is the default compiler and armclang 19.3 available 4 nodes with NVIDIA GPUs Open MPI 3.1.4 and 4.0.2rc3 EDR InfiniBand UCX 1.7.0 Evaluation: https://www.olcf.ornl.gov/wp-content/uploads/2019/11/ARM_NVIDIA_APP_EVALUATION-1.pdf

85 Open slide master to edit DASK + UCX on DGX-1 Helping converge HPC to Data Sciences, AI, and BigData • Science mission loves Python • Easy to learn, free, numpy is similar to MATLAB • RAPIDS • Open source data science python libraries • Single thread multi accelerator • DASK • Distributed tasking framework for python • UCX • Client-server model, python bindings, efficient RDMA, etc

Source: Matthew Rocklin, Rick Zamora, Experiments in High Performance Networking with UCX and DGX https://blog.dask.org/2019/06/09/ucx-dgx

86 Open slide master to edit NVIDIA UCX UPDATE Akshay Venkatesh, Sreeram Potluri, CJ Newburn ENABLING HPC AND DATA SCIENCES Providing a Common CUDA-aware Runtime

• MPI • OpenMPI • MPICH • Parastation MPI

• Dask • RAPIDS / CuML

88 89 DASK AND RAPIDS

• RAPIDS uses dask-distributed for data distribution over python sockets • Communication of python objects backed by GPU buffers needed • Critical to leverage IB, NVLINK, Sockets

PYTHON

PY -

DEEP LEARNING UCX RAPIDS FRAMEWORKS

CUDF CUML CUGRAPH CUDNN DASK

CUDA

APACHE ARROW

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 90 FEATURES IN UCX-PY AND UCX

• Python interface • Coroutine support • CUDA-array interface to move device memory-backed objects • Interoperability with coroutines • blocking progress with CUDA transport • Client-server API • support with sockets and IB

91 DEVICE MEMORY BANDWIDTH Cupy bandwidth between 2 Summit nodes

12 10.7 10.3 10.1 9.85 10 9.2 9.4 8.87

8 7.31

6

Bandwidth GB/s Bandwidth 4

2

0 10MB 20MB 40MB 50MB Message Size

cupy native

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 92 SUPPORT FOR CUDA_VISIBLE_DEVICES Leveraging P2P and NVLink

rank 0, CVD=0 rank 1, CVD=1 • Job managers like SLURM to carve out GPUs within a node for ranks using CPU0 CPU1 CUDA_VISIBLE_DEVICES

GPU0 GPU1 • Also used by task schedulers like DASK • CUDA 10.1 enabled CUDA-IPC between devices in different visibility domains IPC-accessible • UCX now leverages this feature NODE0

93 PERFORMANCE ACROSS PLATFORMS 3-state Pipelining

Increasing use on high-end as as well as low end CPU CPU • servers GPU GPU • GPUDirect RDMA is not performant on all platforms NODE0 NODE1 • Limited PCIe P2P performance on most CPUs Current Rendezvous pipeline needs GDR • Efficient staging to host is required • Need for broader platform support

CPU CPU • Addressed in UCX master (Mellanox contribution)

GPU GPU

NODE0 NODE1

3-stage pipeline doesn’t need GDR 94 3-STAGE PIPELINE PERFORMANCE

Bandwidth (MB/s) Bandwidth (MB/s) 14000 12000

12000 10000

10000 8000

8000

6000 6000

4000 4000

2000 2000

0 0 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M D->D (w 3-stage) D->D (w GDR) D->D (w/o 3-stage) D->D (w 3-stage) D->D (w GDR) D->D (w/o 3-stage) GPU and NIC connected by PCIe Switch GPU and NIC connected to CPU

V100, CX6, Sandy Bridge 95 FUTURE FEATURES

• Topology aware NIC and threshold selection • Extend 3-stage pipeline for intra-node cases and managed memory • Optimizations for cloud deployments • GPU-support with UCX Java-bindings

96 EFFECT OF GPU-HCA AFFINITY DGX-1, V100, CX-5: Inter-node osu_mbw_mr

Bandwidth (MB/s) 60000

50000

40000

30000

20000

10000

0 8K 16K 32K 64K 128K 256K 512K 1M 2M 4-pairs (w/o affinity) 4-pairs (w affinity) 1-pair (w/o affinity) 1-pairs (w affinity)

97 FUTURE DIRECTIONS

• Open source reference implementation of CUDA-aware runtime • Enabling HPC and Data Science libraries/platforms • Optimize across architectures in bare metal and cloud

98 THANK YOU

99 Enhancing MPI Communication using Hardware Tag Matching: The MVAPICH Approach

Talk at UCX BoF (SC ‘19) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Introduction, Motivation, and Challenge • HPC applications require high-performance, low overhead data paths that provide – Low latency – High bandwidth – High message rate – Good overlap of computation with communication • Hardware Offloaded Tag Matching • Can we exploit tag matching support in UCX into existing HPC middleware to extract peak performance and overlap?

Network Based Computing Laboratory (UCX BoF) SC’19 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, /iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (SC ‘02) – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,050 organizations in 89 countries – More than 615,000 (> 0.6 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘19 ranking) • 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 5th, 448, 448 cores (Frontera) at TACC • 8th, 391,680 cores (ABCI) in Japan • 14th, 570,020 cores (Neurion) in South Korea and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu Partner in the #5th TACC Frontera System • Empowering Top500 systems for over a decade Network Based Computing Laboratory (UCX BoF) SC’19 The MVAPICH Approach

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspectio point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages n & Analysis Primitives Access File Systems

Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) Transport Protocols Modern Interconnect Features Modern HCA Features Modern Switch Features SR- Multi Tag RC XRC UD DC UMR ODP Burst Poll Multicast SHARP IOV Rail Match

* Upcoming Network Based Computing Laboratory (UCX BoF) SC’19 Hardware Tag Matching Support

• Offloads the processing of point-to-point MPI messages from the host processor to HCA • Enables zero copy of MPI message transfers – Messages are written directly to the user's buffer without extra buffering and copies • Provides rendezvous progress offload to HCA – Increases the overlap of communication and computation

Network Based Computing Laboratory (UCX BoF) SC’19 104 Impact of Zero Copy MPI Message Passing using HW Tag Matching (Point-to-point)

Eager Rendezvous osu_latency osu_latency 8 400

7 35% 350

6 300

5 250

4 200 Latency (us) Latency Latency (us) Latency 3 150

2 100

1 50

0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size () Message Size (byte)

MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM

Removal of intermediate buffering/copies can lead up to 35% performance improvement in latency of medium messages on TACC Frontera

Network Based Computing Laboratory (UCX BoF) SC’19 105 Performance of MPI_Iscatterv using HW Tag Matching on Frontera 45000 256 Nodes 25000 128 Nodes 40000 1.7 X 1.8 X 35000 20000 30000 25000 15000 20000

Latency (us) Latency 15000

10000 Latency (us) Latency 10000

5000 5000 0 16K 32K 64K 128K 256K 512K 1M 0 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM 80000 512 Nodes 160000 1024 Nodes

70000 140000

60000 1.8 X 120000

50000 100000

40000 80000 1.8 X

Latency (us) Latency 60000 Latency (us) Latency 30000

20000 40000

10000 20000

0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) • Up to 1.8x PerformanceMessage Size (byte) Improvement • Sustained benefits as system size increases Network Based Computing Laboratory (UCX BoF) SC’19 Performance of MPI_Ialltoall using HW Tag Matching on Frontera

3500 16 Nodes 1800 8 Nodes 1.7 X 3000 1600 1.6 X 1400 2500 1200 2000 1000

800 1500 Latency (us) Latency

Latency (us) Latency 600 1000 400 500 200 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM

7000 32 Nodes 12000 64 Nodes 6000 1.8 X 10000 5000 1.5 X 8000 4000 6000

3000 Latency (us) Latency Latency (us) Latency 4000 2000 2000 1000

0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) • Up to 1.8x Performance Improvement • Sustained benefits as system size increases Network Based Computing Laboratory (UCX BoF) SC’19 Overlap with MPI_Iscatterv using HW Tag Matching on Frontera

128 Nodes 100 100 256 Nodes 90 90 80 80 70 70 60 60 50 50 40

40 % Overlap Overlap % Overlap 30 30 20 20 10 10 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM

100 512 Nodes 100 1024 Nodes 90 90 80 80 70 70 60 60 50 50

40 40

Overlap % Overlap Overlap % Overlap 30 30 20 20 10 10 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) • Maximizing the overlap of communication and computation • Sustained benefits as system size increases Network Based Computing Laboratory (UCX BoF) SC’19 Overlap with MPI_Ialltoall using HW Tag Matching on Frontera

100 8 Nodes 100 16 Nodes 90 90 80 80 70 70 60 60 50 50

Overlap % Overlap 40

Overlap % Overlap 40 30 30 20 20 10 10 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM

32 Nodes 100 100 64 Nodes 90 90 80 80 70 70 60 60 50 50

Overlap % Overlap 40

Overlap % Overlap 40 30 30 20 20 10 10 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) • Maximizing the overlap of communication and computation • Sustained benefits as system size increases Network Based Computing Laboratory (UCX BoF) SC’19 Future Plans

• Complete designs are being worked out • Will be available in the future MVAPICH2 releases

Network Based Computing Laboratory (UCX BoF) SC’19 110 Multiple Events at SC ‘19 • Presentations at OSU Booth (#2094) – Members of the MVAPICH, HiBD and HiDL members – External speakers • Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase) • Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events • Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/

Network Based Computing Laboratory (UCX BoF) SC’19 The UCF Consortium is a collaboration between industry, laboratories, and academia to create production grade Thank You communication frameworks and open standards for data centric and high-performance applications.