Mellanox In-Network Computing For AI and The Development With (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019

© 2019 Mellanox Technologies | Confidential 1 Data Processing Revolution – Data Centric

Compute-Centric Data-Centric

Von Neumann DataFlow Machine Machine

© 2019 Mellanox Technologies | Confidential 2 CPU-Centric HPC/AI Center

Everything

CPU Network Storage

© 2019 Mellanox Technologies | Confidential 3 Data-Centric HPC/AI Center

Workload Workload Workload

Communication Framework (MPI) CPU Functions Network Storage Functions Functions

In-CPU Computing In-Network Computing In-Storage Computing

© 2019 Mellanox Technologies | Confidential 4 In-Network Computing Architecture

CPU-Centric (Onload) Data-Centric (Offload)

CPU GPU CPU GPU

GPU CPU GPU CPU

CPU GPU CPU GPU

GPU CPU GPU CPU

Communications Latencies Communications Latencies of 30-40us of 3-4us

© 2019 Mellanox Technologies | Confidential 5 In-Network Computing to Enable Data-Centric Data Centers

CPU GPU

Scalable Hierarchical Aggregation and GPU Reduction Protocol CPU

CPU GPU GPUDirect Over RDMA NVMe Fabrics

GPU CPU

© 2019 Mellanox Technologies | Confidential 6 In-Network Computing Connects the World’s Fastest HPC and AI

CORAL System, World’s Fastest HPC / AI System • Nvidia V100 GPU + InfiniBand HCA + In-Network Computing Fabric

© 2019 Mellanox Technologies | Confidential 7 GPUDirect RDMA Technology and Advantages

© 2019 Mellanox Technologies | Confidential 8 Scalable Hierarchical Aggregation And Reduction Protocol (SHARP)

© 2019 Mellanox Technologies | Confidential 9 Accelerating HPC and AI Applications

Accelerating HPC Applications ▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap

Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪ Accelerating distributed machine learning

© 2019 Mellanox Technologies | Confidential 10 AllReduce Example – Trees

▪ Many2One and One2Many traffic patterns – possible network congestion ▪ Probably not a good solution for large data ▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC

Switch

EndNode

Stage1

Stage2

© 2019 Mellanox Technologies | Confidential 11 AllReduce (Example) - Recursive Doubling

▪ The data is recursively divided, processed by CPUs and distributed ▪ The rank’s CPUs are occupied performing the reduce algorithm ▪ The data is sent at least 2x times, consumes at least twice the BW

Rank 1 Rank 2 Rank 3 Rank 4

Step 1 ½ Data

Calculation phase Step 2 ¼ Data

Step 3 ¼ Data

Result sending phase

Step 4 ½ Data

© 2019 Mellanox Technologies | Confidential 12 Scalable Hierarchical Aggregation Protocol

Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases ▪ In-network Tree based aggregation mechanism ▪ Large number of groups ▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation

Accelerating HPC applications SHArP Tree Root ▪ Scalable High Performance Collective Offload ▪ Barrier, Reduce, All-Reduce, Broadcast ▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

▪ Integer and Floating-Point, 16 / 32 / 64 bit SHArP Tree Aggregation Node ▪ Up to 1KB payload size (in Quantum) (Process running on HCA) ▪ Significantly reduce MPI collective runtime SHArP Tree Endnode (Process running on HCA) ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap

Accelerating Machine Learning applications SHArP Tree ▪ Proven the many-to-one Traffic Pattern ▪ CUDA , GPUDirect RDMA © 2019 Mellanox Technologies | Confidential 13 Scalable Hierarchical Aggregation Protocol

▪ SHARP Tree is a Logical Construct Switch/Router ▪ Nodes in the SHArP Tree are IB Endnodes HCA ▪ Logical tree defined on top of the physical underlying fabric ▪ SHArP Tree Links are implemented on top of the IB transport (Reliable Connection) ▪ Expected to follow the physical topology for performance but not required SHArP Tree Root

Physical Topology ▪ SHARP Operations are Executed by a SHARP Tree ▪ Multiple SHArP Trees are Supported SHArP Tree Aggregation Node ▪ Each SHArP Tree can handle Multiple Outstanding (Process running on HCA) SHArP Operations SHArP Tree Endnode (Process running on HCA) ▪ Within a SHArP Tree, each Operation is Uniquely Identified by a SHArP-Tuple ▪ GroupID ▪ SequenceNumber One SHArP Tree

© 2019 Mellanox Technologies | Confidential 14 SHARP Principles of Operation - Request

Aggregation Request

SHArP Tree Root

© 2019 Mellanox Technologies | Confidential 15 SHARP Principles of Operation – Response

Aggregation Response

SHArP Tree Root

© 2019 Mellanox Technologies | Confidential 16 NCCL Ring

▪ Simple ▪ Linear Latency ▪ Support in NCCL-2.3 & previous version ▪ Multiple rings

Switch Switch

© 2019 Mellanox Technologies | Confidential 17 NCCL Tree ▪ Support added in NCCL-2.4 ▪ Keep Intra-node chain ▪ Node leaders participate in tree ▪ Binary double tree ▪ Multiple rings -> Multiple trees

Network Fabric

NIC NIC NIC

© 2019 Mellanox Technologies | Confidential 18 NCCL SHARP Network Fabric

NIC NIC NIC

© 2019 Mellanox Technologies | Confidential 19 NCCL SHARP

▪ Collective network Plugin NCCL Allreduce Bandwidth - 1 Ring 25 ▪ Replace Inter-node tree with SHARP Tree SHARP Ring Tree

▪ Keeps Intra-node ring 20 ▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA 15

▪ 2x BW compared to NCCL-TREE 10 Bandwidth ( GB/s) ( Bandwidth

5

0

8

16 32 64

128 256 512 2048 Message Size(MB) 1024

SHARP Enables 2X Higher Data Throughput for NCCL

© 2019 Mellanox Technologies | Confidential 20 SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in Latency Scalable Hierarchical Aggregation and Reduction Protocol Providing Scalable Flat Latency

© 2019 Mellanox Technologies | Confidential 21 SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

Scalable Hierarchical SHARP Enables Highest Performance Aggregation and Reduction Protocol

© 2019 Mellanox Technologies | Confidential 22 SHARP Performance – Application (OSU)

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/

Source: Prof. DK Panda, Ohio State University

© 2019 Mellanox Technologies | Confidential 23 SHARP Performance Advantage for AI

▪ SHARP provides 16% Performance Increase for deep learning, initial results ▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)

16%

11%

P100 NVIDIA GPUs, RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0

© 2019 Mellanox Technologies | Confidential 24 NCCL-SHARP Performance – DL Training

ResNet50 25000

20456 20000 18755

15000

Images/sec 10000

5000

0 NCCL NCCL-SHARP

System Configuration: (4) HPE Apollo 6500 systems configured with (8) V100 SXM2 16GB, (2) HPE DL360 Gen10 Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory Modules HPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04 © 2019 Mellanox Technologies | Confidential 25 Accelerating All Levels of HPC / AI Frameworks

Application ▪ Data Analysis ▪ Real Time ▪ Deep Learning

Communication ▪ Mellanox SHARP In-Network Computing ▪ MPI Tag Matching ▪ MPI Rendezvous ▪ Software Defined Virtual Devices

Network ▪ Network Transport Offload GPUDirect ▪ RDMA and GPU-Direct RDMA ▪ SHIELD (Self-Healing Network) RDMA ▪ Enhanced Adaptive Routing and Congestion Control

Connectivity ▪ Multi-Host Technology ▪ Socket-Direct Technology ▪ Enhanced Topologies

© 2019 Mellanox Technologies | Confidential 26 Thank You

© 2019 Mellanox Technologies | Confidential 27