Mellanox In-Network Computing for AI and the Development with NVIDIA (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019
Total Page:16
File Type:pdf, Size:1020Kb
Mellanox In-Network Computing For AI and The Development With NVIDIA (SHARP – NCCL) Qingchun Song, Mellanox July 2, 2019 © 2019 Mellanox Technologies | Confidential 1 Data Processing Revolution – Data Centric Compute-Centric Data-Centric Von Neumann DataFlow Machine Machine © 2019 Mellanox Technologies | Confidential 2 CPU-Centric HPC/AI Center Everything CPU Network Storage © 2019 Mellanox Technologies | Confidential 3 Data-Centric HPC/AI Center Workload Workload Workload Communication Framework (MPI) CPU Functions Network Storage Functions Functions In-CPU Computing In-Network Computing In-Storage Computing © 2019 Mellanox Technologies | Confidential 4 In-Network Computing Architecture CPU-Centric (Onload) Data-Centric (Offload) CPU GPU CPU GPU GPU CPU GPU CPU CPU GPU CPU GPU GPU CPU GPU CPU Communications Latencies Communications Latencies of 30-40us of 3-4us © 2019 Mellanox Technologies | Confidential 5 In-Network Computing to Enable Data-Centric Data Centers CPU GPU Scalable Hierarchical Aggregation and GPU Reduction Protocol CPU CPU GPU GPUDirect Over RDMA NVMe Fabrics GPU CPU © 2019 Mellanox Technologies | Confidential 6 In-Network Computing Connects the World’s Fastest HPC and AI Supercomputer • Summit CORAL System, World’s Fastest HPC / AI System • Nvidia V100 GPU + InfiniBand HCA + In-Network Computing Fabric © 2019 Mellanox Technologies | Confidential 7 GPUDirect RDMA Technology and Advantages © 2019 Mellanox Technologies | Confidential 8 Scalable Hierarchical Aggregation And Reduction Protocol (SHARP) © 2019 Mellanox Technologies | Confidential 9 Accelerating HPC and AI Applications Accelerating HPC Applications ▪ Significantly reduce MPI collective runtime ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap Enabling Artificial Intelligence Solutions to Perform Critical and Timely Decision Making ▪ Accelerating distributed machine learning © 2019 Mellanox Technologies | Confidential 10 AllReduce Example – Trees ▪ Many2One and One2Many traffic patterns – possible network congestion ▪ Probably not a good solution for large data ▪ Large scale requires higher tree / larger radix ▪ Result distribution – over the tree / MC Switch EndNode Stage1 Stage2 © 2019 Mellanox Technologies | Confidential 11 AllReduce (Example) - Recursive Doubling ▪ The data is recursively divided, processed by CPUs and distributed ▪ The rank’s CPUs are occupied performing the reduce algorithm ▪ The data is sent at least 2x times, consumes at least twice the BW Rank 1 Rank 2 Rank 3 Rank 4 Step 1 ½ Data Calculation phase Step 2 ¼ Data Step 3 ¼ Data Result sending phase Step 4 ½ Data © 2019 Mellanox Technologies | Confidential 12 Scalable Hierarchical Aggregation Protocol Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases ▪ In-network Tree based aggregation mechanism ▪ Large number of groups ▪ Multiple simultaneous outstanding operations ▪ Streaming aggregation Accelerating HPC applications SHArP Tree Root ▪ Scalable High Performance Collective Offload ▪ Barrier, Reduce, All-Reduce, Broadcast ▪ Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND ▪ Integer and Floating-Point, 16 / 32 / 64 bit SHArP Tree Aggregation Node ▪ Up to 1KB payload size (in Quantum) (Process running on HCA) ▪ Significantly reduce MPI collective runtime SHArP Tree Endnode (Process running on HCA) ▪ Increase CPU availability and efficiency ▪ Enable communication and computation overlap Accelerating Machine Learning applications SHArP Tree ▪ Proven the many-to-one Traffic Pattern ▪ CUDA , GPUDirect RDMA © 2019 Mellanox Technologies | Confidential 13 Scalable Hierarchical Aggregation Protocol ▪ SHARP Tree is a Logical Construct Switch/Router ▪ Nodes in the SHArP Tree are IB Endnodes HCA ▪ Logical tree defined on top of the physical underlying fabric ▪ SHArP Tree Links are implemented on top of the IB transport (Reliable Connection) ▪ Expected to follow the physical topology for performance but not required SHArP Tree Root Physical Topology ▪ SHARP Operations are Executed by a SHARP Tree ▪ Multiple SHArP Trees are Supported SHArP Tree Aggregation Node ▪ Each SHArP Tree can handle Multiple Outstanding (Process running on HCA) SHArP Operations SHArP Tree Endnode (Process running on HCA) ▪ Within a SHArP Tree, each Operation is Uniquely Identified by a SHArP-Tuple ▪ GroupID ▪ SequenceNumber One SHArP Tree © 2019 Mellanox Technologies | Confidential 14 SHARP Principles of Operation - Request Aggregation Request SHArP Tree Root © 2019 Mellanox Technologies | Confidential 15 SHARP Principles of Operation – Response Aggregation Response SHArP Tree Root © 2019 Mellanox Technologies | Confidential 16 NCCL Ring ▪ Simple ▪ Linear Latency ▪ Support in NCCL-2.3 & previous version ▪ Multiple rings Switch Switch © 2019 Mellanox Technologies | Confidential 17 NCCL Tree ▪ Support added in NCCL-2.4 ▪ Keep Intra-node chain ▪ Node leaders participate in tree ▪ Binary double tree ▪ Multiple rings -> Multiple trees Network Fabric NIC NIC NIC © 2019 Mellanox Technologies | Confidential 18 NCCL SHARP Network Fabric NIC NIC NIC © 2019 Mellanox Technologies | Confidential 19 NCCL SHARP ▪ Collective network Plugin NCCL Allreduce Bandwidth - 1 Ring 25 ▪ Replace Inter-node tree with SHARP Tree SHARP Ring Tree ▪ Keeps Intra-node ring 20 ▪ Aggregation in network switch ▪ Streaming from GPU memory with GPU Direct RDMA 15 ▪ 2x BW compared to NCCL-TREE 10 Bandwidth ( GB/s) Bandwidth 5 0 8 16 32 64 128 256 512 2048 Message Size(MB) 1024 SHARP Enables 2X Higher Data Throughput for NCCL © 2019 Mellanox Technologies | Confidential 20 SHARP AllReduce Performance Advantages (128 Nodes) SHARP enables 75% Reduction in Latency Scalable Hierarchical Aggregation and Reduction Protocol Providing Scalable Flat Latency © 2019 Mellanox Technologies | Confidential 21 SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology Scalable Hierarchical SHARP Enables Highest Performance Aggregation and Reduction Protocol © 2019 Mellanox Technologies | Confidential 22 SHARP Performance – Application (OSU) Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Source: Prof. DK Panda, Ohio State University © 2019 Mellanox Technologies | Confidential 23 SHARP Performance Advantage for AI ▪ SHARP provides 16% Performance Increase for deep learning, initial results ▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum) 16% 11% P100 NVIDIA GPUs, RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0 © 2019 Mellanox Technologies | Confidential 24 NCCL-SHARP Performance – DL Training ResNet50 25000 20456 20000 18755 15000 Images/sec 10000 5000 0 NCCL NCCL-SHARP System Configuration: (4) HPE Apollo 6500 systems configured with (8) NVIDIA Tesla V100 SXM2 16GB, (2) HPE DL360 Gen10 Intel Xeon-Gold 6134 (3.2 GHz/8-core/130 W) CPUs, (24) DDR4-2666 CAS-19-19-19 Registered Memory Modules HPE 1.6 TB NVMe SFF (2.5") SSD, ConnectX-6 HCA, IB Quantum Switch (EDR speed), Ubuntu 16.04 © 2019 Mellanox Technologies | Confidential 25 Accelerating All Levels of HPC / AI Frameworks Application ▪ Data Analysis ▪ Real Time ▪ Deep Learning Communication ▪ Mellanox SHARP In-Network Computing ▪ MPI Tag Matching ▪ MPI Rendezvous ▪ Software Defined Virtual Devices Network ▪ Network Transport Offload GPUDirect ▪ RDMA and GPU-Direct RDMA ▪ SHIELD (Self-Healing Network) RDMA ▪ Enhanced Adaptive Routing and Congestion Control Connectivity ▪ Multi-Host Technology ▪ Socket-Direct Technology ▪ Enhanced Topologies © 2019 Mellanox Technologies | Confidential 26 Thank You © 2019 Mellanox Technologies | Confidential 27.