Nvswitch and Dgx-2
Total Page:16
File Type:pdf, Size:1020Kb
NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER Alex Ishii and Denis Foley with Eric Anderson, Bill Dally, Glenn Dearth Larry Dennison, Mark Hummel and John Schafer NVIDIA® DGX-2™ SERVER AND NVSWITCH™ 16 Tesla™ V100 32 GB GPUs Single-Server Chassis 12 NVSwitch Network New NVSwitch Chip FP64: 125 TFLOPS 10U/19-Inch Rack Mount Full-Bandwidth Fat-Tree Topology 18 2nd Generation NVLink™ Ports FP32: 250 TFLOPS 10 kW Peak TDP 2.4 TBps Bisection Bandwidth 25 GBps per Port Tensor: 2000 TFLOPS Dual 24-core Xeon CPUs Global Shared Memory 900 GBps Total Bidirectional 1.5 TB DDR4 DRAM Repeater-less Bandwidth 512 GB of GPU HBM2 30 TB NVMe Storage 450 GBps Total Throughput ©2018 NVIDIA CORPORATION 2 OUTLINE The Case for “One Gigantic GPU” and NVLink Review NVSwitch — Speeds and Feeds, Architecture and System Applications DGX-2 Server — Speeds and Feeds, Architecture, and Packaging Signal-Integrity Design Highlights Achieved Performance ©2018 NVIDIA CORPORATION 3 MULTI-CORE AND CUDA WITH ONE GPU Users explicitly express parallel ® work in NVIDIA CUDA GPU Work (data and GPU Driver distributes work to XBAR GPC PCIe I/O CUDA Kernels) available Graphics Processing L2 Cache Results Clusters (GPC)/Streaming (data) Multiprocessor (SM) cores HBM2 Copy Engines L2 GPC/SM cores can compute on Cache CPU data in any of the second- Hub generation High Bandwidth GPC L2 Memories (HBM2s) Cache NVLink HBM2 GPC/SM cores use shared HBM2s L2 NVLink to exchange data Cache ©2018 NVIDIA CORPORATION 4 TWO GPUS WITH PCIE Access to HBM2 of other GPU is at PCIe BW (32 GBps GPU0 GPU1 (bidirectional)) Work (data and GPC CUDA GPC Kernels) Interactions with CPU compete PCIe I/O PCIe I/O HBM2 + L2 Caches Results with GPU-to-GPU (data) Copy Copy Engines Engines PCIe is the “Wild West” HBM2 + L2 Caches CPU (lots of performance bandits) XBAR Hub Hub XBAR GPC GPC HBM2 + L2 Caches NVLink NVLink HBM2 + L2 Caches NVLink NVLink ©2018 NVIDIA CORPORATION 5 TWO GPUS WITH NVLINK All GPCs can access all HBM2 memories GPU0 GPU1 Work (data and Access to HBM2 of other GPU GPC CUDA GPC Kernels) is at multi-NVLink bandwidth PCIe I/O PCIe I/O HBM2 + L2 Caches Results (300 GBps bidirectional in V100 (data) GPUs) Copy Copy Engines Engines HBM2 + L2 Caches CPU NVLinks are effectively a XBAR Hub Hub “bridge” between XBARs XBAR GPC GPC HBM2 + L2 Caches No collisions with PCIe traffic NVLink NVLink HBM2 + L2 Caches NVLink NVLink ©2018 NVIDIA CORPORATION 6 THE “ONE GIGANTIC GPU” IDEAL Highest number of GPUs possible GPU0 GPU1 CPU Single GPU Driver process controls GPU2 all work across all GPUs GPU3 From perspective of GPCs, all HBM2s GPU4 GPU8 can be accessed without intervention GPU5 GPU9 NVLink XBAR by other processes (LD/ST GPU6 GPU10 instructions, Copy Engine RDMA, GPU7 GPU11 everything “just works”) Access to all HBM2s is independent GPU12 of PCIe GPU13 GPU14 CPU Bandwidth across bridged XBARs is as GPU15 high as possible (some NUMA is unavoidable) ©2018 NVIDIA CORPORATION 7 “ONE GIGANTIC GPU” BENEFITS Problem Size Capacity GPU0 Problem size is limited by aggregate GPU1 CPU HBM2 capacity of entire set of GPUs, GPU2 GPU3 rather than capacity of single GPU Strong Scaling GPU4 GPU8 GPU5 GPU9 NUMA-effects greatly reduced GPU6 NVLink XBAR GPU10 compared to existing solutions GPU7 GPU11 Aggregate bandwidth to HBM2 grows ? with number of GPUs Ease of Use GPU12 GPU13 GPU14 Apps written for small number CPU of GPUs port more easily GPU15 Abundant resources enable rapid experimentation ©2018 NVIDIA CORPORATION 8 INTRODUCING NVSWITCH Parameter Spec Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GBps Bidirectional Aggregate Bandwidth 928 GBps NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps NVLink Ports 18 Transistors 2 Billion Mgmt Port (config, maintenance, errors) PCIe Process TSMC 12FFN LD/ST BW Efficiency (128 B pkts) 80.0% Die Size 106 mm^2 Copy Engine BW Efficiency (256 B pkts) 88.9% ©2018 NVIDIA CORPORATION 9 NVSWITCH BLOCK DIAGRAM GPU-XBAR-bridging device; NVSwitch not a general networking device Management PCIe I/O Packet-Transforms make traffic to/from multiple GPUs look like Port Logic 0 NVLink 0 Classification & they are to/from single GPU Routing Packet Transforms PHY TL DL XBAR is non-blocking Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms SRAM-based buffering Port Logic 17 NVLink 17 NVLink IP blocks and XBAR XBAR (18 X 18) Classification & Routing design/verification Packet Transforms infrastructure reused from V100 PHY TL DL Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms ©2018 NVIDIA CORPORATION 10 NVLINK: PHYSICAL SHARED MEMORY Virtual-to physical address translation is done in GPCs GPU0 GPU1 Work (data and NVLink packets carry physical GPC CUDA GPC Kernels) addresses PCIe I/O PCIe I/O HBM2 + L2 Caches Results (data) NVSwitch and DGX-2 follow Copy Copy Engines Engines same model HBM2 + L2 Caches CPU XBAR Hub Hub XBAR GPC GPC HBM2 + L2 Caches NVLink NVLink Physical Addressing HBM2 + L2 Caches NVLink NVLink ©2018 NVIDIA CORPORATION 11 NVSWITCH: RELIABILITY Hop-by-hop error checking NVSwitch deemed sufficient in intra- chassis operating environment Management Hooks for SW-based scrubbing for table corruptions and error conditions (orphaned packets) -12 Low BER (10 or better) on Hooks for partial-reset and queue draining intra-chassis transmission media removes need for high-overhead Port Logic NVLink protections like FEC Classification & Routing Packet Transforms PHY TL DL HW-based consistency checks PHY Error Check & Transaction Tracking flight pkts flight guard against “escapes” from - Statistics Collection & Packet Transforms hop-to-hop XBAR ECC on SRAMs and in-flight pkts Data Link (DL) pkt tracking ECC on in on ECC and retry for external links Access-Control checks for mis-configured pkts SW responsible for additional (either config/app errors or corruption) CRC on pkts (external) consistency checks and clean-up Configurable buffer allocations ©2018 NVIDIA CORPORATION 12 NVSWITCH: DIE ASPECT RATIO NVLINK NVLINK NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS NVLINK NVLINK Non-NVLink Logic NVLINK NVLINK PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS Packet-processing and switching (XBAR) logic is extremely compact With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio was key to an efficient design Having all NVLink ports exit die on parallel paths simplified package substrate routing ©2018 NVIDIA CORPORATION 13 SIMPLE SWITCH SYSTEM No NVSwitch Connect GPU directly Gang Aggregate NVLinks into gangs V100 V100 for higher bandwidth Interleaved over the links to prevent camping Max bandwidth between two GPUs V100 limited to the bandwidth of the gang With NVSwitch Interleave traffic across all the links V100 V100 and to support full bandwidth between any pair of GPUs NVSWITCH Traffic to a single GPU is non- blocking, so long as aggregate V100 bandwidth of six NVLinks is not exceeded ©2018 NVIDIA CORPORATION 14 SWITCH CONNECTED SYSTEMS Add NVSwitches in parallel to support more GPUs 8 GPUs, 3 NVSwitches Eight GPU closed system can be built with three NVSwitches with two V100 V100 NVLinks from each GPU to each switch Gang width is still six — traffic is interleaved across all the GPU links V100 V100 GPUs can now communicate pairwise using the full 300 GBps bidirectional between any pair V100 V100 NVSwitch XBAR provides unique paths NVSWITCH from and source to any destination — non-blocking, non-interfering V100 V100 ©2018 NVIDIA CORPORATION 15 NON-BLOCKING BUILDING BLOCK Taking this to the limit — connect one NVLink from each GPU to each of six switches V100 No routing between different switch planes required V100 Eight NVLinks of the 18 available V100 per switch are used to connect to GPUs V100 NVSWITCH Ten NVLinks available per switch V100 for communication outside the local group (only eight are required to V100 support full bandwidth) V100 This is the GPU baseboard configuration for DGX-2 V100 ©2018 NVIDIA CORPORATION 16 DGX-2 NVLINK FABRIC Two of these building blocks together form a fully connected 16 GPU cluster V100 V100 Non-blocking, non-interfering V100 V100 (unless same destination is involved) V100 V100 Regular load, store, atomics V100 V100 NVSWITCH just work NVSWITCH V100 V100 Left-right symmetry simplifies V100 V100 physical packaging, and manufacturability V100 V100 V100 V100 ©2018 NVIDIA CORPORATION 17 NVIDIA DGX-2: SPEEDS AND FEEDS Parameter DGX-2 Spec Parameter DGX-2 Spec Number of Tesla V100 GPUs 16 CPUs Dual Xeon Platinum 8168 Aggregate FP64/FP32 125/250 TFLOPS CPU Memory 1.5 TB DDR4 Aggregate Tensor (FP16) 2000 TFLOPS Aggregate Storage 30 TB (8 NVMes) Aggregate Shared HBM2 512 GB Peak Max TDP 10 kW Aggregate HBM2 Bandwidth 14.4 TBps 17.3”(10U)/ 19.0”/32.8” Dimensions (H/W/D) (440.0mm/482.3mm/834.0mm) Per-GPU NVLink Bandwidth 300 GBps bidirectional Weight 340 lbs (154.2 kgs) Chassis Bisection Bandwidth 2.4 TBps InfiniBand NICs 8 Mellanox EDR Cooling (forced air) 1,000 CFM ©2018 NVIDIA CORPORATION 18 DGX-2 PCIE NETWORK Xeon sockets are QPI connected, but affinity-binding keeps V100 V100 GPU-related traffic off QPI 100G PCIE PCIE 100G NIC SW SW NIC V100 V100 PCIe tree has NICs connected to PCIE PCIE pairs of GPUs to facilitate SW SW V100 V100 TM GPUDirect RDMAs over IB 100G PCIE PCIE 100G NIC SW SW NIC network V100 NVS V100 NVSNVSNVSNVS NVSNVSNVSNVS QPI PCIE WIT PCIE QPI x86 x6 WITWITWITWIT WITWITWITWIT x6 x86 SW CH SW Configuration and control of the CHCHCHCH CHCHCHCH