Mellanox In-Network Computing For AI and The Development With NVIDIA (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019
© 2019 Mellanox Technologies | Confidential 1 Data Processing Revolution – Data Centric
Compute-Centric Data-Centric
Von Neumann DataFlow Machine Machine
© 2019 Mellanox Technologies | Confidential 2 CPU-Centric HPC/AI Center
Everything
CPU Network Storage
© 2019 Mellanox Technologies | Confidential 3 Data-Centric HPC/AI Center
Workload Workload Workload
Communication Framework (MPI) CPU Functions Network Storage Functions Functions
In-CPU Computing In-Network Computing In-Storage Computing
© 2019 Mellanox Technologies | Confidential 4 In-Network Computing Architecture
CPU-Centric (Onload) Data-Centric (Offload)
CPU GPU CPU GPU
GPU CPU GPU CPU
CPU GPU CPU GPU
GPU CPU GPU CPU
Communications Latencies Communications Latencies of 30-40us of 3-4us
© 2019 Mellanox Technologies | Confidential 5 In-Network Computing to Enable Data-Centric Data Centers
CPU GPU
Scalable Hierarchical Aggregation and GPU Reduction Protocol CPU
CPU GPU GPUDirect Over RDMA NVMe Fabrics
GPU CPU
© 2019 Mellanox Technologies | Confidential 6 In-Network Computing Connects the World’s Fastest HPC and AI Supercomputer
• Summit CORAL System, World’s Fastest HPC / AI System • Nvidia V100 GPU + InfiniBand HCA + In-Network Computing Fabric
© 2019 Mellanox Technologies | Confidential 7 GPUDirect RDMA Technology and Advantages
© 2019 Mellanox Technologies | Confidential 8 Scalable Hierarchical Aggregation And Reduction Protocol (SHARP)
© 2019 Mellanox Technologies | Confidential 9 Accelerating HPC and AI Applications
Accelerating HPC Applications ▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap
Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪ Accelerating distributed machine learning
© 2019 Mellanox Technologies | Confidential 10 AllReduce Example – Trees
▪ Many2One and One2Many traffic patterns – possible network congestion ▪ Probably not a good solution for large data ▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC
Switch
EndNode
Stage1
Stage2
© 2019 Mellanox Technologies | Confidential 11 AllReduce (Example) - Recursive Doubling
▪ The data is recursively divided, processed by CPUs and distributed ▪ The rank’s CPUs are occupied performing the reduce algorithm ▪ The data is sent at least 2x times, consumes at least twice the BW
Rank 1 Rank 2 Rank 3 Rank 4
Step 1 ½ Data
Calculation phase Step 2 ¼ Data
Step 3 ¼ Data
Result sending phase
Step 4 ½ Data
© 2019 Mellanox Technologies | Confidential 12 Scalable Hierarchical Aggregation Protocol
Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases ▪ In-network Tree based aggregation mechanism ▪ Large number of groups ▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation
Accelerating HPC applications SHArP Tree Root ▪ Scalable High Performance Collective Offload ▪ Barrier, Reduce, All-Reduce, Broadcast ▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
▪ Integer and Floating-Point, 16 / 32 / 64 bit SHArP Tree Aggregation Node ▪ Up to 1KB payload size (in Quantum) (Process running on HCA) ▪ Significantly reduce MPI collective runtime SHArP Tree Endnode (Process running on HCA) ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap
Accelerating Machine Learning applications SHArP Tree ▪ Proven the many-to-one Traffic Pattern ▪ CUDA , GPUDirect RDMA © 2019 Mellanox Technologies | Confidential 13 Scalable Hierarchical Aggregation Protocol
▪ SHARP Tree is a Logical Construct Switch/Router ▪ Nodes in the SHArP Tree are IB Endnodes HCA ▪ Logical tree defined on top of the physical underlying fabric ▪ SHArP Tree Links are implemented on top of the IB transport (Reliable Connection) ▪ Expected to follow the physical topology for performance but not required SHArP Tree Root
Physical Topology ▪ SHARP Operations are Executed by a SHARP Tree ▪ Multiple SHArP Trees are Supported SHArP Tree Aggregation Node ▪ Each SHArP Tree can handle Multiple Outstanding (Process running on HCA) SHArP Operations SHArP Tree Endnode (Process running on HCA) ▪ Within a SHArP Tree, each Operation is Uniquely Identified by a SHArP-Tuple ▪ GroupID ▪ SequenceNumber One SHArP Tree
© 2019 Mellanox Technologies | Confidential 14 SHARP Principles of Operation - Request
Aggregation Request
SHArP Tree Root
© 2019 Mellanox Technologies | Confidential 15 SHARP Principles of Operation – Response
Aggregation Response
SHArP Tree Root
© 2019 Mellanox Technologies | Confidential 16 NCCL Ring
▪ Simple ▪ Linear Latency ▪ Support in NCCL-2.3 & previous version ▪ Multiple rings
Switch Switch
© 2019 Mellanox Technologies | Confidential 17 NCCL Tree ▪ Support added in NCCL-2.4 ▪ Keep Intra-node chain ▪ Node leaders participate in tree ▪ Binary double tree ▪ Multiple rings -> Multiple trees
Network Fabric
NIC NIC NIC
© 2019 Mellanox Technologies | Confidential 18 NCCL SHARP Network Fabric
NIC NIC NIC
© 2019 Mellanox Technologies | Confidential 19 NCCL SHARP
▪ Collective network Plugin NCCL Allreduce Bandwidth - 1 Ring 25 ▪ Replace Inter-node tree with SHARP Tree SHARP Ring Tree
▪ Keeps Intra-node ring 20 ▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA 15
▪ 2x BW compared to NCCL-TREE 10 Bandwidth ( GB/s) ( Bandwidth
5
0
8
16 32 64
128 256 512 2048 Message Size(MB) 1024
SHARP Enables 2X Higher Data Throughput for NCCL
© 2019 Mellanox Technologies | Confidential 20 SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in Latency Scalable Hierarchical Aggregation and Reduction Protocol Providing Scalable Flat Latency
© 2019 Mellanox Technologies | Confidential 21 SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
Scalable Hierarchical SHARP Enables Highest Performance Aggregation and Reduction Protocol
© 2019 Mellanox Technologies | Confidential 22 SHARP Performance – Application (OSU)
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/
Source: Prof. DK Panda, Ohio State University
© 2019 Mellanox Technologies | Confidential 23 SHARP Performance Advantage for AI
▪ SHARP provides 16% Performance Increase for deep learning, initial results ▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)
16%
11%
P100 NVIDIA GPUs, RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0
© 2019 Mellanox Technologies | Confidential 24 NCCL-SHARP Performance – DL Training
ResNet50 25000
20456 20000 18755
15000
Images/sec 10000
5000
0 NCCL NCCL-SHARP
System Configuration: (4) HPE Apollo 6500 systems configured with (8) NVIDIA Tesla V100 SXM2 16GB, (2) HPE DL360 Gen10 Intel Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory Modules HPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04 © 2019 Mellanox Technologies | Confidential 25 Accelerating All Levels of HPC / AI Frameworks
Application ▪ Data Analysis ▪ Real Time ▪ Deep Learning
Communication ▪ Mellanox SHARP In-Network Computing ▪ MPI Tag Matching ▪ MPI Rendezvous ▪ Software Defined Virtual Devices
Network ▪ Network Transport Offload GPUDirect ▪ RDMA and GPU-Direct RDMA ▪ SHIELD (Self-Healing Network) RDMA ▪ Enhanced Adaptive Routing and Congestion Control
Connectivity ▪ Multi-Host Technology ▪ Socket-Direct Technology ▪ Enhanced Topologies
© 2019 Mellanox Technologies | Confidential 26 Thank You
© 2019 Mellanox Technologies | Confidential 27