NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER Alex Ishii and Denis Foley with Eric Anderson, Bill Dally, Glenn Dearth Larry Dennison, Mark Hummel and John Schafer ® DGX-2™ SERVER AND NVSWITCH™

16 Tesla™ V100 32 GB GPUs Single-Server Chassis 12 NVSwitch Network New NVSwitch Chip FP64: 125 TFLOPS 10U/19-Inch Rack Mount Full-Bandwidth Fat-Tree Topology 18 2nd Generation NVLink™ Ports FP32: 250 TFLOPS 10 kW Peak TDP 2.4 TBps Bisection Bandwidth 25 GBps per Port Tensor: 2000 TFLOPS Dual 24-core Xeon CPUs Global Shared Memory 900 GBps Total Bidirectional 1.5 TB DDR4 DRAM Repeater-less Bandwidth 512 GB of GPU HBM2 30 TB NVMe Storage 450 GBps Total Throughput

©2018 NVIDIA CORPORATION 2 OUTLINE

The Case for “One Gigantic GPU” and NVLink Review

NVSwitch — Speeds and Feeds, Architecture and System Applications

DGX-2 Server — Speeds and Feeds, Architecture, and Packaging

Signal-Integrity Design Highlights

Achieved Performance

©2018 NVIDIA CORPORATION 3 MULTI-CORE AND CUDA WITH ONE GPU

Users explicitly express parallel ® work in NVIDIA CUDA GPU

Work (data and GPU Driver distributes work to XBAR GPC PCIe I/O CUDA Kernels)

available Graphics Processing L2

Cache Results Clusters (GPC)/Streaming (data)

Multiprocessor (SM) cores HBM2 Copy Engines L2 GPC/SM cores can compute on Cache CPU data in any of the second- Hub generation High Bandwidth GPC L2 Memories (HBM2s) Cache NVLink HBM2 GPC/SM cores use shared HBM2s L2 NVLink to exchange data Cache

©2018 NVIDIA CORPORATION 4 TWO GPUS WITH PCIE

Access to HBM2 of other GPU is at PCIe BW (32 GBps GPU0 GPU1 (bidirectional)) Work (data and GPC CUDA GPC Kernels) Interactions with CPU compete PCIe I/O PCIe I/O Caches + L2 HBM2 Results with GPU-to-GPU (data)

Copy Copy Engines Engines PCIe is the “Wild West” HBM2 + L2 Caches CPU

(lots of performance bandits) XBAR Hub Hub XBAR

GPC GPC HBM2 + L2 Caches + L2 HBM2

NVLink NVLink HBM2 + L2 Caches NVLink NVLink

©2018 NVIDIA CORPORATION 5 TWO GPUS WITH NVLINK

All GPCs can access all HBM2 memories GPU0 GPU1 Work (data and Access to HBM2 of other GPU GPC CUDA GPC Kernels) is at multi-NVLink bandwidth PCIe I/O PCIe I/O Caches + L2 HBM2 Results (300 GBps bidirectional in V100 (data)

GPUs) Copy Copy Engines Engines HBM2 + L2 Caches CPU

NVLinks are effectively a XBAR Hub Hub

“bridge” between XBARs XBAR

GPC GPC HBM2 + L2 Caches + L2 HBM2 No collisions with PCIe traffic NVLink NVLink HBM2 + L2 Caches

NVLink NVLink

©2018 NVIDIA CORPORATION 6 THE “ONE GIGANTIC GPU” IDEAL

Highest number of GPUs possible GPU0 GPU1 CPU Single GPU Driver process controls GPU2 all work across all GPUs GPU3

From perspective of GPCs, all HBM2s GPU4 GPU8 can be accessed without intervention GPU5 GPU9 NVLink XBAR by other processes (LD/ST GPU6 GPU10 instructions, Copy Engine RDMA, GPU7 GPU11 everything “just works”)

Access to all HBM2s is independent GPU12 of PCIe GPU13 GPU14 CPU Bandwidth across bridged XBARs is as GPU15 high as possible (some NUMA is unavoidable)

©2018 NVIDIA CORPORATION 7 “ONE GIGANTIC GPU” BENEFITS

Problem Size Capacity GPU0 Problem size is limited by aggregate GPU1 CPU HBM2 capacity of entire set of GPUs, GPU2 GPU3 rather than capacity of single GPU

Strong Scaling GPU4 GPU8 GPU5 GPU9 NUMA-effects greatly reduced GPU6 NVLink XBAR GPU10 compared to existing solutions GPU7 GPU11 Aggregate bandwidth to HBM2 grows ? with number of GPUs

Ease of Use GPU12 GPU13 GPU14 Apps written for small number CPU of GPUs port more easily GPU15 Abundant resources enable rapid experimentation

©2018 NVIDIA CORPORATION 8 INTRODUCING NVSWITCH

Parameter Spec Parameter Spec Bidirectional Bandwidth per NVLink 51.5 GBps Bidirectional Aggregate Bandwidth 928 GBps

NRZ Lane Rate (x8 per NVLink) 25.78125 Gbps NVLink Ports 18

Transistors 2 Billion Mgmt Port (config, maintenance, errors) PCIe

Process TSMC 12FFN LD/ST BW Efficiency (128 B pkts) 80.0%

Die Size 106 mm^2 Copy Engine BW Efficiency (256 B pkts) 88.9%

©2018 NVIDIA CORPORATION 9 NVSWITCH BLOCK DIAGRAM

GPU-XBAR-bridging device; NVSwitch not a general networking device Management PCIe I/O Packet-Transforms make traffic to/from multiple GPUs look like Port Logic 0 NVLink 0 Classification & they are to/from single GPU Routing Packet Transforms PHY TL DL

XBAR is non-blocking Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms SRAM-based buffering

Port Logic 17 NVLink 17

NVLink IP blocks and XBAR XBAR (18 X 18) Classification & Routing design/verification Packet Transforms infrastructure reused from V100 PHY TL DL

Error Check & Transaction Tracking & PHY Statistics Collection Packet Transforms

©2018 NVIDIA CORPORATION 10 NVLINK: PHYSICAL SHARED MEMORY

Virtual-to physical address translation is done in GPCs GPU0 GPU1 Work (data and NVLink packets carry physical GPC CUDA GPC Kernels) addresses PCIe I/O PCIe I/O Caches + L2 HBM2 Results (data)

NVSwitch and DGX-2 follow Copy Copy Engines Engines same model HBM2 + L2 Caches CPU XBAR

Hub Hub XBAR

GPC GPC HBM2 + L2 Caches + L2 HBM2

NVLink NVLink Physical Addressing HBM2 + L2 Caches NVLink NVLink

©2018 NVIDIA CORPORATION 11 NVSWITCH: RELIABILITY

Hop-by-hop error checking NVSwitch deemed sufficient in intra- chassis operating environment Management Hooks for SW-based scrubbing for table corruptions and error conditions (orphaned packets) -12 Low BER (10 or better) on Hooks for partial-reset and queue draining intra-chassis transmission media removes need for high-overhead Port Logic NVLink protections like FEC Classification & Routing Packet Transforms PHY TL DL

HW-based consistency checks PHY Error Check & Transaction Tracking flight pkts flight guard against “escapes” from - Statistics Collection & Packet Transforms hop-to-hop XBAR ECC on SRAMs and in-flight pkts Data Link (DL) pkt tracking

ECC on in on ECC and retry for external links Access-Control checks for mis-configured pkts SW responsible for additional (either config/app errors or corruption) CRC on pkts (external) consistency checks and clean-up Configurable buffer allocations

©2018 NVIDIA CORPORATION 12 NVSWITCH: DIE ASPECT RATIO

NVLINK NVLINK NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS

NVLINK NVLINK Non-NVLink Logic NVLINK NVLINK PHYS PHYS PHYS PHYS NVLINK NVLINK NVLINK NVLINK PHYS PHYS PHYS PHYS

Packet-processing and switching (XBAR) logic is extremely compact With more than 50% of area taken-up by I/O blocks, maximizing “perimeter”-to-area ratio was key to an efficient design Having all NVLink ports exit die on parallel paths simplified package substrate routing

©2018 NVIDIA CORPORATION 13 SIMPLE SWITCH SYSTEM

No NVSwitch Connect GPU directly Gang Aggregate NVLinks into gangs V100 V100 for higher bandwidth Interleaved over the links to prevent camping Max bandwidth between two GPUs V100 limited to the bandwidth of the gang

With NVSwitch Interleave traffic across all the links V100 V100 and to support full bandwidth

between any pair of GPUs NVSWITCH Traffic to a single GPU is non- blocking, so long as aggregate V100 bandwidth of six NVLinks is not exceeded

©2018 NVIDIA CORPORATION 14 SWITCH CONNECTED SYSTEMS

Add NVSwitches in parallel to support more GPUs 8 GPUs, 3 NVSwitches Eight GPU closed system can be built with three NVSwitches with two V100 V100 NVLinks from each GPU to each switch Gang width is still six — traffic is interleaved across all the GPU links V100 V100 GPUs can now communicate pairwise using the full 300 GBps bidirectional between any pair V100 V100 NVSwitch XBAR provides unique paths NVSWITCH from and source to any destination — non-blocking, non-interfering V100 V100

©2018 NVIDIA CORPORATION 15 NON-BLOCKING BUILDING BLOCK

Taking this to the limit — connect one NVLink from each GPU to each of six switches V100 No routing between different switch planes required V100 Eight NVLinks of the 18 available V100 per switch are used to connect to GPUs V100 NVSWITCH Ten NVLinks available per switch V100 for communication outside the local group (only eight are required to V100 support full bandwidth) V100 This is the GPU baseboard configuration for DGX-2 V100

©2018 NVIDIA CORPORATION 16 DGX-2 NVLINK FABRIC

Two of these building blocks together form a fully connected

16 GPU cluster V100 V100

Non-blocking, non-interfering V100 V100 (unless same destination is involved) V100 V100

Regular load, store, atomics V100 V100 NVSWITCH just work NVSWITCH V100 V100 Left-right symmetry simplifies V100 V100 physical packaging, and manufacturability V100 V100

V100 V100

©2018 NVIDIA CORPORATION 17 NVIDIA DGX-2: SPEEDS AND FEEDS

Parameter DGX-2 Spec Parameter DGX-2 Spec

Number of Tesla V100 GPUs 16 CPUs Dual Xeon Platinum 8168

Aggregate FP64/FP32 125/250 TFLOPS CPU Memory 1.5 TB DDR4

Aggregate Tensor (FP16) 2000 TFLOPS Aggregate Storage 30 TB (8 NVMes)

Aggregate Shared HBM2 512 GB Peak Max TDP 10 kW Aggregate HBM2 Bandwidth 14.4 TBps 17.3”(10U)/ 19.0”/32.8” Dimensions (H/W/D) (440.0mm/482.3mm/834.0mm) Per-GPU NVLink Bandwidth 300 GBps bidirectional

Weight 340 lbs (154.2 kgs) Chassis Bisection Bandwidth 2.4 TBps

InfiniBand NICs 8 Mellanox EDR Cooling (forced air) 1,000 CFM

©2018 NVIDIA CORPORATION 18 DGX-2 PCIE NETWORK

Xeon sockets are QPI connected, but affinity-binding keeps V100 V100 GPU-related traffic off QPI 100G PCIE PCIE 100G NIC SW SW NIC V100 V100 PCIe tree has NICs connected to PCIE PCIE pairs of GPUs to facilitate SW SW V100 V100 TM GPUDirect RDMAs over IB 100G PCIE PCIE 100G NIC SW SW NIC network V100 NVS V100 NVSNVSNVSNVS NVSNVSNVSNVS QPI PCIE WIT PCIE QPI x86 x6 WITWITWITWIT WITWITWITWIT x6 x86 SW CH SW Configuration and control of the CHCHCHCH CHCHCHCH V100 V100 NVSWITCH NVSWITCH NVSwitches is via driver process 100G PCIE PCIE 100G NIC SW SW NIC running on CPUs V100 V100 PCIE PCIE SW SW V100 V100 100G PCIE PCIE 100G NIC SW SW NIC V100 V100

©2018 NVIDIA CORPORATION 19 NVIDIA DGX-2: GPUS + NVSWITCH COMPLEX

Two GPU Baseboards with eight V100 GPUs and six NVSwitches on each Two Plane Cards carry 24 NVLinks each No repeaters or redrivers on any of the NVLinks conserves board space and power

©2018 NVIDIA CORPORATION 20 NVIDIA DGX-2: SYSTEM PACKAGING

GPU Baseboards get power and PCIe from front midplane Front midplane connects to I/O-Expander PCB (with PCIe switches and NICs) and CPU Motherboard 48V PSUs reduce current load on distribution paths Internal “ bars” bring current from each PSU to front-mid plane

©2018 NVIDIA CORPORATION 21 NVIDIA DGX-2: SYSTEM COOLING

Forced-air cooling of Baseboards, I/O Expander, and CPU provided by ten 92 mm fans Four supplemental 60 mm internal fans to cool NVMe drives and PSUs Air to NVSwitches is pre-heated by GPUs, requiring “full height” heatsinks

©2018 NVIDIA CORPORATION 22 DGX-2: SIGNAL INTEGRITY OPTIMIZATIONS

NVLink repeaterless topology trades-off longer traces to GPUs for shorter traces to Plane Cards Custom SXM and Plane Card connectors for reduced loss and crosstalk TX and RX grouped in pin-fields to reduce crosstalk

All TX

All RX

©2018 NVIDIA CORPORATION 23 DGX-2: ACHIEVED BISECTION BW

Test has each GPU reading data from another GPU across bisection (from GPU on 2500 different Baseboard) 2000 Raw bisection bandwidth is 2.472 TBps 1500 1.98 TBps achieved Read

bisection bandwidth matches GB/sec 1000 theoretical 80% bidirectional NVLink efficiency 500 “All-to-all” (each GPU reads from eight GPUs on other PCB) 0 results are similar 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of GPUs

©2018 NVIDIA CORPORATION 24 DGX-2: CUFFT (16K X 16K)

Results are “iso-problem instance” (more GFLOPS means ½ DGX-2 (All-to-All NVLink Fabric) DGX-1TM (V100 and Hybrid Cube Mesh) shorter running time) 9000 8000 As problem is split over more GPUs, it takes longer to 7000 data than to calculate locally 6000 Aggregate nature of HBM2 5000 4000 capacity enables larger GFLOPS problem sizes 3000

2000

1000

0 1 2 3 4 5 6 7 8

Number of GPUs

©2018 NVIDIA CORPORATION 25 DGX-2: ALL-REDUCE BENCHMARK

Important communication DGX-2 (NVLink Fabric) 2 DGX-1 (4 100 Gb IB) primitive in Machine-Learning 120000 apps DGX-2 provides increased 100000 bandwidth and lower latency compared to two 8-GPU servers 80000 3X (connected with InfiniBand) 60000 NVLink network has efficiency superior to InfiniBand on 40000 smaller message sizes Bandwidth (MB/s) 8X 20000

0 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M

Message Size (B)

©2018 NVIDIA CORPORATION 26 DGX-2: >2X SPEED-UP ON TARGET APPS Two DGX-1TM Servers (V100) Compared to DGX-2 − Same Total GPU Count HPC AI Training

2X FASTER 2.4X FASTER 2X FASTER 2.7X FASTER

Physics Weather Recommender (Sparse Embedding) Language Model (MILC benchmark) (ECMWF benchmark) All-to-all Reduce & Broadcast (Transformer with MoE) 4D Grid All-to-all

Two DGX-1 (V100) DGX-2 with NVSwitch

Two DGX-1 servers have dual socket Xeon E5 2698v4 Processor. Eight Tesla V100 32 GB GPUs. Servers connected via four ©2018 NVIDIA CORPORATION EDR IB/GbE ports | DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 Tesla V100 32GB GPUs 27 NVSWITCH AVAILABILITY: NVIDIA HGX-2™

Eight GPU baseboard with six NVSwitches Two HGX-2 boards can be passively connected to realize 16-GPU systems ODM/OEM partners build servers utilizing NVIDIA HGX-2 GPU baseboards Design guide provides recommendations

©2018 NVIDIA CORPORATION 28 SUMMARY

New NVSwitch 18-NVLink-Port, Non-Blocking NVLink Switch 51.5 GBps-per-Port 928 GBps Aggregate Bidirectional Bandwidth DGX-2 “One Gigantic GPU” Scale-Up Server with 16 Tesla V100 GPUs 125 TF (FP64) to 2000 TF (Tensor) 512 GB Aggregate HBM2 Capacity NVSwitch Fabric Enables >2X Speed- Up on Target Apps

©2018 NVIDIA CORPORATION 29