Mellanox In-Network Computing for AI and the Development with NVIDIA (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019

Mellanox In-Network Computing For AI and The Development With NVIDIA (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019 © 2019 Mellanox Technologies | Confidential 1 Data Processing Revolution – Data Centric Compute-Centric Data-Centric Von Neumann DataFlow Machine Machine © 2019 Mellanox Technologies | Confidential 2 CPU-Centric HPC/AI Center Everything CPU Network Storage © 2019 Mellanox Technologies | Confidential 3 Data-Centric HPC/AI Center Workload Workload Workload Communication Framework (MPI) CPU Functions Network Storage Functions Functions In-CPU Computing In-Network Computing In-Storage Computing © 2019 Mellanox Technologies | Confidential 4 In-Network Computing Architecture CPU-Centric (Onload) Data-Centric (Offload) CPU GPU CPU GPU GPU CPU GPU CPU CPU GPU CPU GPU GPU CPU GPU CPU Communications Latencies Communications Latencies of 30-40us of 3-4us © 2019 Mellanox Technologies | Confidential 5 In-Network Computing to Enable Data-Centric Data Centers CPU GPU Scalable Hierarchical Aggregation and GPU Reduction Protocol CPU CPU GPU GPUDirect Over RDMA NVMe Fabrics GPU CPU © 2019 Mellanox Technologies | Confidential 6 In-Network Computing Connects the World’s Fastest HPC and AI Supercomputer • Summit CORAL System, World’s Fastest HPC / AI System • Nvidia V100 GPU + InfiniBand HCA + In-Network Computing Fabric © 2019 Mellanox Technologies | Confidential 7 GPUDirect RDMA Technology and Advantages © 2019 Mellanox Technologies | Confidential 8 Scalable Hierarchical Aggregation And Reduction Protocol (SHARP) © 2019 Mellanox Technologies | Confidential 9 Accelerating HPC and AI Applications Accelerating HPC Applications ▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪ Accelerating distributed machine learning © 2019 Mellanox Technologies | Confidential 10 AllReduce Example – Trees ▪ Many2One and One2Many traffic patterns – possible network congestion ▪ Probably not a good solution for large data ▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC Switch EndNode Stage1 Stage2 © 2019 Mellanox Technologies | Confidential 11 AllReduce (Example) - Recursive Doubling ▪ The data is recursively divided, processed by CPUs and distributed ▪ The rank’s CPUs are occupied performing the reduce algorithm ▪ The data is sent at least 2x times, consumes at least twice the BW Rank 1 Rank 2 Rank 3 Rank 4 Step 1 ½ Data Calculation phase Step 2 ¼ Data Step 3 ¼ Data Result sending phase Step 4 ½ Data © 2019 Mellanox Technologies | Confidential 12 Scalable Hierarchical Aggregation Protocol Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases ▪ In-network Tree based aggregation mechanism ▪ Large number of groups ▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation Accelerating HPC applications SHArP Tree Root ▪ Scalable High Performance Collective Offload ▪ Barrier, Reduce, All-Reduce, Broadcast ▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND ▪ Integer and Floating-Point, 16 / 32 / 64 bit SHArP Tree Aggregation Node ▪ Up to 1KB payload size (in Quantum) (Process running on HCA) ▪ Significantly reduce MPI collective runtime SHArP Tree Endnode (Process running on HCA) ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap Accelerating Machine Learning applications SHArP Tree ▪ Proven the many-to-one Traffic Pattern ▪ CUDA , GPUDirect RDMA © 2019 Mellanox Technologies | Confidential 13 Scalable Hierarchical Aggregation Protocol ▪ SHARP Tree is a Logical Construct Switch/Router ▪ Nodes in the SHArP Tree are IB Endnodes HCA ▪ Logical tree defined on top of the physical underlying fabric ▪ SHArP Tree Links are implemented on top of the IB transport (Reliable Connection) ▪ Expected to follow the physical topology for performance but not required SHArP Tree Root Physical Topology ▪ SHARP Operations are Executed by a SHARP Tree ▪ Multiple SHArP Trees are Supported SHArP Tree Aggregation Node ▪ Each SHArP Tree can handle Multiple Outstanding (Process running on HCA) SHArP Operations SHArP Tree Endnode (Process running on HCA) ▪ Within a SHArP Tree, each Operation is Uniquely Identified by a SHArP-Tuple ▪ GroupID ▪ SequenceNumber One SHArP Tree © 2019 Mellanox Technologies | Confidential 14 SHARP Principles of Operation - Request Aggregation Request SHArP Tree Root © 2019 Mellanox Technologies | Confidential 15 SHARP Principles of Operation – Response Aggregation Response SHArP Tree Root © 2019 Mellanox Technologies | Confidential 16 NCCL Ring ▪ Simple ▪ Linear Latency ▪ Support in NCCL-2.3 & previous version ▪ Multiple rings Switch Switch © 2019 Mellanox Technologies | Confidential 17 NCCL Tree ▪ Support added in NCCL-2.4 ▪ Keep Intra-node chain ▪ Node leaders participate in tree ▪ Binary double tree ▪ Multiple rings -> Multiple trees Network Fabric NIC NIC NIC © 2019 Mellanox Technologies | Confidential 18 NCCL SHARP Network Fabric NIC NIC NIC © 2019 Mellanox Technologies | Confidential 19 NCCL SHARP ▪ Collective network Plugin NCCL Allreduce Bandwidth - 1 Ring 25 ▪ Replace Inter-node tree with SHARP Tree SHARP Ring Tree ▪ Keeps Intra-node ring 20 ▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA 15 ▪ 2x BW compared to NCCL-TREE 10 Bandwidth ( GB/s) Bandwidth 5 0 8 16 32 64 128 256 512 2048 Message Size(MB) 1024 SHARP Enables 2X Higher Data Throughput for NCCL © 2019 Mellanox Technologies | Confidential 20 SHARP AllReduce Performance Advantages (128 Nodes) SHARP enables 75% Reduction in Latency Scalable Hierarchical Aggregation and Reduction Protocol Providing Scalable Flat Latency © 2019 Mellanox Technologies | Confidential 21 SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology Scalable Hierarchical SHARP Enables Highest Performance Aggregation and Reduction Protocol © 2019 Mellanox Technologies | Confidential 22 SHARP Performance – Application (OSU) Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Source: Prof. DK Panda, Ohio State University © 2019 Mellanox Technologies | Confidential 23 SHARP Performance Advantage for AI ▪ SHARP provides 16% Performance Increase for deep learning, initial results ▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum) 16% 11% P100 NVIDIA GPUs, RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0 © 2019 Mellanox Technologies | Confidential 24 NCCL-SHARP Performance – DL Training ResNet50 25000 20456 20000 18755 15000 Images/sec 10000 5000 0 NCCL NCCL-SHARP System Configuration: (4) HPE Apollo 6500 systems configured with (8) NVIDIA Tesla V100 SXM2 16GB, (2) HPE DL360 Gen10 Intel Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory Modules HPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04 © 2019 Mellanox Technologies | Confidential 25 Accelerating All Levels of HPC / AI Frameworks Application ▪ Data Analysis ▪ Real Time ▪ Deep Learning Communication ▪ Mellanox SHARP In-Network Computing ▪ MPI Tag Matching ▪ MPI Rendezvous ▪ Software Defined Virtual Devices Network ▪ Network Transport Offload GPUDirect ▪ RDMA and GPU-Direct RDMA ▪ SHIELD (Self-Healing Network) RDMA ▪ Enhanced Adaptive Routing and Congestion Control Connectivity ▪ Multi-Host Technology ▪ Socket-Direct Technology ▪ Enhanced Topologies © 2019 Mellanox Technologies | Confidential 26 Thank You © 2019 Mellanox Technologies | Confidential 27.

Mellanox In-Network Computing for AI and the Development with NVIDIA (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support