NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER Alex Ishii and Denis Foley with Eric Anderson, Bill Dally, Glenn Dearth Larry Dennison, Mark Hummel and John Schafer NVIDIA® DGX-2™ SERVER AND NVSWITCH™
16 Tesla™ V100 32 GB GPUs Single-Server Chassis 12 NVSwitch Network New NVSwitch Chip FP64: 125 TFLOPS 10U/19-Inch Rack Mount Full-Bandwidth Fat-Tree Topology 18 2nd Generation NVLink™ Ports FP32: 250 TFLOPS 10 kW Peak TDP 2.4 TBps Bisection Bandwidth 25 GBps per Port Tensor: 2000 TFLOPS Dual 24-core Xeon CPUs Global Shared Memory 900 GBps Total Bidirectional 1.5 TB DDR4 DRAM Repeater-less Bandwidth 512 GB of GPU HBM2 30 TB NVMe Storage 450 GBps Total Throughput
©2018 NVIDIA CORPORATION 2 OUTLINE
The Case for “One Gigantic GPU” and NVLink Review
NVSwitch — Speeds and Feeds, Architecture and System Applications
DGX-2 Server — Speeds and Feeds, Architecture, and Packaging
Signal-Integrity Design Highlights
Achieved Performance
©2018 NVIDIA CORPORATION 3 MULTI-CORE AND CUDA WITH ONE GPU
Users explicitly express parallel ® work in NVIDIA CUDA GPU
Work (data and GPU Driver distributes work to XBAR GPC PCIe I/O CUDA Kernels)
available Graphics Processing L2
Cache Results Clusters (GPC)/Streaming (data)
Multiprocessor (SM) cores HBM2 Copy Engines L2 GPC/SM cores can compute on Cache CPU data in any of the second- Hub generation High Bandwidth GPC L2 Memories (HBM2s) Cache NVLink HBM2 GPC/SM cores use shared HBM2s L2 NVLink to exchange data Cache
©2018 NVIDIA CORPORATION 4 TWO GPUS WITH PCIE
Access to HBM2 of other GPU is at PCIe BW (32 GBps GPU0 GPU1 (bidirectional)) Work (data and GPC CUDA GPC Kernels) Interactions with CPU compete PCIe I/O PCIe I/O Caches + L2 HBM2 Results with GPU-to-GPU (data)
Copy Copy Engines Engines PCIe is the “Wild West” HBM2 + L2 Caches CPU
(lots of performance bandits) XBAR Hub Hub XBAR
GPC GPC HBM2 + L2 Caches + L2 HBM2
NVLink NVLink HBM2 + L2 Caches NVLink NVLink
©2018 NVIDIA CORPORATION 5 TWO GPUS WITH NVLINK
All GPCs can access all HBM2 memories GPU0 GPU1 Work (data and Access to HBM2 of other GPU GPC CUDA GPC Kernels) is at multi-NVLink bandwidth PCIe I/O PCIe I/O Caches + L2 HBM2 Results (300 GBps bidirectional in V100 (data)
GPUs) Copy Copy Engines Engines HBM2 + L2 Caches CPU
NVLinks are effectively a XBAR Hub Hub
“bridge” between XBARs XBAR
GPC GPC HBM2 + L2 Caches + L2 HBM2 No collisions with PCIe traffic NVLink NVLink HBM2 + L2 Caches
NVLink NVLink
©2018 NVIDIA CORPORATION 6 THE “ONE GIGANTIC GPU” IDEAL
Highest number of GPUs possible GPU0 GPU1 CPU Single GPU Driver process controls GPU2 all work across all GPUs GPU3
From perspective of GPCs, all HBM2s GPU4 GPU8 can be accessed without intervention GPU5 GPU9 NVLink XBAR by other processes (LD/ST GPU6 GPU10 instructions, Copy Engine RDMA, GPU7 GPU11 everything “just works”)
Access to all HBM2s is independent GPU12 of PCIe GPU13 GPU14 CPU Bandwidth across bridged XBARs is as GPU15 high as possible (some NUMA is unavoidable)
©2018 NVIDIA CORPORATION 7 “ONE GIGANTIC GPU” BENEFITS
Problem Size Capacity GPU0 Problem size is limited by aggregate GPU1 CPU HBM2 capacity of entire set of GPUs, GPU2 GPU3 rather than capacity of single GPU
Strong Scaling GPU4 GPU8 GPU5 GPU9 NUMA-effects greatly reduced GPU6 NVLink XBAR GPU10 compared to existing solutions GPU7 GPU11 Aggregate bandwidth to HBM2 grows ? with number of GPUs
Ease of Use GPU12 GPU13 GPU14 Apps written for small number CPU of GPUs port more easily GPU15 Abundant resources enable rapid experimentation
©2018 NVIDIA CORPORATION 8 INTRODUCING NVSWITCH
Parameter Spec Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GBps Bidirectional Aggregate Bandwidth 928 GBps
NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps NVLink Ports 18
Transistors 2 Billion Mgmt Port (config, maintenance, errors) PCIe
Process TSMC 12FFN LD/ST BW Efficiency (128 B pkts) 80.0%
Die Size 106 mm^2 Copy Engine BW Efficiency (256 B pkts) 88.9%
©2018 NVIDIA CORPORATION 9 NVSWITCH BLOCK DIAGRAM
GPU-XBAR-bridging device; NVSwitch not a general networking device Management PCIe I/O Packet-Transforms make traffic to/from multiple GPUs look like Port Logic 0 NVLink 0 Classification & they are to/from single GPU Routing Packet Transforms PHY TL DL
XBAR is non-blocking Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms SRAM-based buffering
Port Logic 17 NVLink 17
NVLink IP blocks and XBAR XBAR (18 X 18) Classification & Routing design/verification Packet Transforms infrastructure reused from V100 PHY TL DL
Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms
©2018 NVIDIA CORPORATION 10 NVLINK: PHYSICAL SHARED MEMORY
Virtual-to physical address translation is done in GPCs GPU0 GPU1 Work (data and NVLink packets carry physical GPC CUDA GPC Kernels) addresses PCIe I/O PCIe I/O Caches + L2 HBM2 Results (data)
NVSwitch and DGX-2 follow Copy Copy Engines Engines same model HBM2 + L2 Caches CPU XBAR
Hub Hub XBAR
GPC GPC HBM2 + L2 Caches + L2 HBM2
NVLink NVLink Physical Addressing HBM2 + L2 Caches NVLink NVLink
©2018 NVIDIA CORPORATION 11 NVSWITCH: RELIABILITY
Hop-by-hop error checking NVSwitch deemed sufficient in intra- chassis operating environment Management Hooks for SW-based scrubbing for table corruptions and error conditions (orphaned packets) -12 Low BER (10 or better) on Hooks for partial-reset and queue draining intra-chassis transmission media removes need for high-overhead Port Logic NVLink protections like FEC Classification & Routing Packet Transforms PHY TL DL
HW-based consistency checks PHY Error Check & Transaction Tracking flight pkts flight guard against “escapes” from - Statistics Collection & Packet Transforms hop-to-hop XBAR ECC on SRAMs and in-flight pkts Data Link (DL) pkt tracking
ECC on in on ECC and retry for external links Access-Control checks for mis-configured pkts SW responsible for additional (either config/app errors or corruption) CRC on pkts (external) consistency checks and clean-up Configurable buffer allocations
©2018 NVIDIA CORPORATION 12 NVSWITCH: DIE ASPECT RATIO
NVLINK NVLINK NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS
NVLINK NVLINK Non-NVLink Logic NVLINK NVLINK PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS
Packet-processing and switching (XBAR) logic is extremely compact With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio was key to an efficient design Having all NVLink ports exit die on parallel paths simplified package substrate routing
©2018 NVIDIA CORPORATION 13 SIMPLE SWITCH SYSTEM
No NVSwitch Connect GPU directly Gang Aggregate NVLinks into gangs V100 V100 for higher bandwidth Interleaved over the links to prevent camping Max bandwidth between two GPUs V100 limited to the bandwidth of the gang
With NVSwitch Interleave traffic across all the links V100 V100 and to support full bandwidth
between any pair of GPUs NVSWITCH Traffic to a single GPU is non- blocking, so long as aggregate V100 bandwidth of six NVLinks is not exceeded
©2018 NVIDIA CORPORATION 14 SWITCH CONNECTED SYSTEMS
Add NVSwitches in parallel to support more GPUs 8 GPUs, 3 NVSwitches Eight GPU closed system can be built with three NVSwitches with two V100 V100 NVLinks from each GPU to each switch Gang width is still six — traffic is interleaved across all the GPU links V100 V100 GPUs can now communicate pairwise using the full 300 GBps bidirectional between any pair V100 V100 NVSwitch XBAR provides unique paths NVSWITCH from and source to any destination — non-blocking, non-interfering V100 V100
©2018 NVIDIA CORPORATION 15 NON-BLOCKING BUILDING BLOCK
Taking this to the limit — connect one NVLink from each GPU to each of six switches V100 No routing between different switch planes required V100 Eight NVLinks of the 18 available V100 per switch are used to connect to GPUs V100 NVSWITCH Ten NVLinks available per switch V100 for communication outside the local group (only eight are required to V100 support full bandwidth) V100 This is the GPU baseboard configuration for DGX-2 V100
©2018 NVIDIA CORPORATION 16 DGX-2 NVLINK FABRIC
Two of these building blocks together form a fully connected
16 GPU cluster V100 V100
Non-blocking, non-interfering V100 V100 (unless same destination is involved) V100 V100
Regular load, store, atomics V100 V100 NVSWITCH just work NVSWITCH V100 V100 Left-right symmetry simplifies V100 V100 physical packaging, and manufacturability V100 V100
V100 V100
©2018 NVIDIA CORPORATION 17 NVIDIA DGX-2: SPEEDS AND FEEDS
Parameter DGX-2 Spec Parameter DGX-2 Spec
Number of Tesla V100 GPUs 16 CPUs Dual Xeon Platinum 8168
Aggregate FP64/FP32 125/250 TFLOPS CPU Memory 1.5 TB DDR4
Aggregate Tensor (FP16) 2000 TFLOPS Aggregate Storage 30 TB (8 NVMes)
Aggregate Shared HBM2 512 GB Peak Max TDP 10 kW Aggregate HBM2 Bandwidth 14.4 TBps 17.3”(10U)/ 19.0”/32.8” Dimensions (H/W/D) (440.0mm/482.3mm/834.0mm) Per-GPU NVLink Bandwidth 300 GBps bidirectional
Weight 340 lbs (154.2 kgs) Chassis Bisection Bandwidth 2.4 TBps
InfiniBand NICs 8 Mellanox EDR Cooling (forced air) 1,000 CFM
©2018 NVIDIA CORPORATION 18 DGX-2 PCIE NETWORK
Xeon sockets are QPI connected, but affinity-binding keeps V100 V100 GPU-related traffic off QPI 100G PCIE PCIE 100G NIC SW SW NIC V100 V100 PCIe tree has NICs connected to PCIE PCIE pairs of GPUs to facilitate SW SW V100 V100 TM GPUDirect RDMAs over IB 100G PCIE PCIE 100G NIC SW SW NIC network V100 NVS V100 NVSNVSNVSNVS NVSNVSNVSNVS QPI PCIE WIT PCIE QPI x86 x6 WITWITWITWIT WITWITWITWIT x6 x86 SW CH SW Configuration and control of the CHCHCHCH CHCHCHCH V100 V100 NVSWITCH NVSWITCH NVSwitches is via driver process 100G PCIE PCIE 100G NIC SW SW NIC running on CPUs V100 V100 PCIE PCIE SW SW V100 V100 100G PCIE PCIE 100G NIC SW SW NIC V100 V100
©2018 NVIDIA CORPORATION 19 NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX
Two GPU Baseboards with eight V100 GPUs and six NVSwitches on each Two Plane Cards carry 24 NVLinks each No repeaters or redrivers on any of the NVLinks conserves board space and power
©2018 NVIDIA CORPORATION 20 NVIDIA DGX-2: SYSTEM PACKAGING
GPU Baseboards get power and PCIe from front midplane Front midplane connects to I/O-Expander PCB (with PCIe switches and NICs) and CPU Motherboard 48V PSUs reduce current load on distribution paths Internal “bus bars” bring current from each PSU to front-mid plane
©2018 NVIDIA CORPORATION 21 NVIDIA DGX-2: SYSTEM COOLING
Forced-air cooling of Baseboards, I/O Expander, and CPU provided by ten 92 mm fans Four supplemental 60 mm internal fans to cool NVMe drives and PSUs Air to NVSwitches is pre-heated by GPUs, requiring “full height” heatsinks
©2018 NVIDIA CORPORATION 22 DGX-2: SIGNAL INTEGRITY OPTIMIZATIONS
NVLink repeaterless topology trades-off longer traces to GPUs for shorter traces to Plane Cards Custom SXM and Plane Card connectors for reduced loss and crosstalk TX and RX grouped in pin-fields to reduce crosstalk
All TX
All RX
©2018 NVIDIA CORPORATION 23 DGX-2: ACHIEVED BISECTION BW
Test has each GPU reading data from another GPU across bisection (from GPU on 2500 different Baseboard) 2000 Raw bisection bandwidth is 2.472 TBps 1500 1.98 TBps achieved Read
bisection bandwidth matches GB/sec 1000 theoretical 80% bidirectional NVLink efficiency 500 “All-to-all” (each GPU reads from eight GPUs on other PCB) 0 results are similar 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of GPUs
©2018 NVIDIA CORPORATION 24 DGX-2: CUFFT (16K X 16K)
Results are “iso-problem instance” (more GFLOPS means ½ DGX-2 (All-to-All NVLink Fabric) DGX-1TM (V100 and Hybrid Cube Mesh) shorter running time) 9000 8000 As problem is split over more GPUs, it takes longer to transfer 7000 data than to calculate locally 6000 Aggregate nature of HBM2 5000 4000 capacity enables larger GFLOPS problem sizes 3000
2000
1000
0 1 2 3 4 5 6 7 8
Number of GPUs
©2018 NVIDIA CORPORATION 25 DGX-2: ALL-REDUCE BENCHMARK
Important communication DGX-2 (NVLink Fabric) 2 DGX-1 (4 100 Gb IB) primitive in Machine-Learning 120000 apps DGX-2 provides increased 100000 bandwidth and lower latency compared to two 8-GPU servers 80000 3X (connected with InfiniBand) 60000 NVLink network has efficiency superior to InfiniBand on 40000 smaller message sizes Bandwidth (MB/s) 8X 20000
0 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M
Message Size (B)
©2018 NVIDIA CORPORATION 26 DGX-2: >2X SPEED-UP ON TARGET APPS Two DGX-1TM Servers (V100) Compared to DGX-2 − Same Total GPU Count HPC AI Training
2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER
Physics Weather Recommender (Sparse Embedding) Language Model (MILC benchmark) (ECMWF benchmark) All-to-all Reduce & Broadcast (Transformer with MoE) 4D Grid All-to-all
Two DGX-1 (V100) DGX-2 with NVSwitch
Two DGX-1 servers have dual socket Xeon E5 2698v4 Processor. Eight Tesla V100 32 GB GPUs. Servers connected via four ©2018 NVIDIA CORPORATION EDR IB/GbE ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 Tesla V100 32GB GPUs 27 NVSWITCH AVAILABILITY: NVIDIA HGX-2™
Eight GPU baseboard with six NVSwitches Two HGX-2 boards can be passively connected to realize 16-GPU systems ODM/OEM partners build servers utilizing NVIDIA HGX-2 GPU baseboards Design guide provides recommendations
©2018 NVIDIA CORPORATION 28 SUMMARY
New NVSwitch 18-NVLink-Port, Non-Blocking NVLink Switch 51.5 GBps-per-Port 928 GBps Aggregate Bidirectional Bandwidth DGX-2 “One Gigantic GPU” Scale-Up Server with 16 Tesla V100 GPUs 125 TF (FP64) to 2000 TF (Tensor) 512 GB Aggregate HBM2 Capacity NVSwitch Fabric Enables >2X Speed- Up on Target Apps
©2018 NVIDIA CORPORATION 29