Inside the NVIDIA Ampere Architecture

Ronny Krashinsky, Olivier Giroux GPU Architects GTC 2020

1 UNPRECEDENTED ACCELERATION AT EVERY SCALE

3rd GEN SPARSITY 3rd GEN 54 BILLION XTORS MIG TENSOR CORES ACCELERATION NVLINK & NVSWITCH

2 UNIFIED AI ACCELERATION BERT-LARGE TRAINING BERT-LARGE INFERENCE (FP32) (FP16) 2,400 7,000

2,100 3X 6,000 7X 1,800 5,000

1,500 4,000 1,200 6X

3,000 Sequences/s Sequences/s 900

1X 2,000 600

300 1,000 1X 0.6X 1X 1X 0 0 V100 A100 V100 A100 T4 V100 1/7th A100 A100 All results are measured (7 MIG) BERT Large Training (FP32 & FP16) measures Pre-Training phase, uses PyTorch including (2/3) Phase1 with Seq Len 128 and (1/3) Phase 2 with Seq Len 512, V100 is DGX1 Server with 8xV100, A100 is DGX A100 Server with 8xA100, A100 uses TF32 Tensor Core for FP32 training 3 BERT Large Inference uses TRT 7.1 for T4/V100, with INT8/FP16 at batch size 256. Pre-production TRT for A100, uses batch size 94 and INT8 with sparsity ACCELERATING HPC Molecular Dynamics Physics Engineering Geo Science

2.0x 2.1X A100 2.0X 1.9X 1.9X 1.8X 1.7X 1.5x 1.6X 1.5X 1.5X

1.0x V100 Speedup

0.5x

0.0x AMBER GROMACS LAMMPS NAMD Chroma BerkeleyGW FUN3D RTM SPECFEM3D All results are measured Except BerkeleyGW, V100 used is single V100 SXM2. A100 used is single A100 SXM4 More apps detail: AMBER based on PME-Cellulose, GROMACS with STMV (h-bond), LAMMPS with Atomic Fluid LJ-2.5, NAMD with v3.0a1 STMV_NVE Chroma with szscl21_24_128, FUN3D with dpw, RTM with Isotropic Radius 4 1024^3, SPECFEM3D with Cartesian four material model 4 BerkeleyGW based on Chi Sum and uses 8xV100 in DGX-1, vs 8xA100 in DGX A100 A100 TENSOR-CORE GPU 7x 54 billion transistors in 7nm

GigaThread Engine with MIG Control Scale OUT HBM2 HBM2 Multi-Instance GPU

L2 Cache L2 Cache

HBM2 HBM2

A100

HBM2 HBM2 2x BW NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink Scale UP 3rd gen. 108 SMs 40MB L2 1.56 TB/s HBM2 NVLINK 6912 CUDA Cores 6.7x capacity 1.7x bandwidth

5 A100 SM

Third-generation Tensor Core Faster and more efficient 3rd Gen. 3rd Gen. Comprehensive data types TENSOR TENSOR CORE CORE Sparsity acceleration

Asynchronous data movement and synchronization

3rd Gen. 3rd Gen. TENSOR TENSOR CORE CORE Increased L1/SMEM capacity

192 KB L1 Data Cache / Shared Memory

6 1. New Tensor Core 2. Strong Scaling 3. Elastic GPU 4. Productivity 7 1. New Tensor Core 2. Strong Scaling 3. Elastic GPU 4. Productivity 8 V100 TENSOR CORE

X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA FP32 FP32 15.7 1x V100 FP16 FP32 125 8x V100 125 8x vs. TOPS FFMA FF16/FP32 Mixed- precision

9 A100 TENSOR CORE

X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA FP32 FP32 15.7 1x V100 FP16 FP32 125 8x FP32 FP32 19.5 1x A100 FP16 FP32 312 16x V100→A100 2.5x 2x TOPS TOPS/ FF16/FP32 Mixed- SM precision

10 A100 TENSOR CORE

X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA FP32 FP32 15.7 1x V100 FP16 FP32 125 8x FP32 FP32 19.5 1x TF32 FP32 156 8x A100 FP16 FP32 312 16x BF16 FP32 312 16x

TF32 accelerates FP32 in/out data → 10x vs. V100 FP32 BFloat16 (BF16) at same rate as FP16

11 A100 TENSOR CORE

X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA FP32 FP32 15.7 1x V100 FP16 FP32 125 8x FP32 FP32 19.5 1x TF32 FP32 156 8x FP16 FP32 312 16x BF16 FP32 312 16x A100 TOPS FP16 ½ FP16 312 16x 2x INT8 INT32 624 32x track ½ 2x INT4 INT32 1248 64x operand ¼ 4x BINARY INT32 4992 256x width

Inference data types 12 A100 TENSOR CORE SPARSE X-factor SPARSE X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA TOPS vs. FFMA FP32 FP32 15.7 1x - - V100 FP16 FP32 125 8x - - FP32 FP32 19.5 1x - - TF32 FP32 156 8x 312 16x FP16 FP32 312 16x 624 32x BF16 FP32 312 16x 624 32x A100 FP16 FP16 312 16x 624 32x INT8 INT32 624 32x 1248 64x INT4 INT32 1248 64x 2496 128x BINARY INT32 4992 256x - - With Sparsity another 2x, INT8/INT4 reach petaops 13 A100 TENSOR CORE SPARSE X-factor SPARSE X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA TOPS vs. FFMA FP32 FP32 15.7 1x - - V100 FP16 FP32 125 8x - - FP32 FP32 19.5 1x - - TF32 FP32 156 8x 312 16x FP16 FP32 312 16x 624 32x BF16 FP32 312 16x 624 32x A100 FP16 FP16 312 16x 624 32x INT8 INT32 624 32x 1248 64x INT4 INT32 1248 64x 2496 128x BINARY INT32 4992 256x V100- →A100- IEEE FP64 19.5 1x 2.5x- FLOPS-

for HPC 14 A100 TENSOR CORE SPARSE X-factor SPARSE X-factor INPUT OPERANDS ACCUMULATOR TOPS vs. FFMA TOPS vs. FFMA FP32 FP32 15.7 1x - - V100 FP16 FP32 125 8x - - FP32 FP32 19.5 1x - - TF32 FP32 156 8x 312 16x FP16 FP32 312 16x 624 32x BF16 FP32 312 16x 624 32x A100 FP16 FP16 312 16x 624 32x INT8 INT32 624 32x 1248 64x INT4 INT32 1248 64x 2496 128x BINARY INT32 4992 256x - - IEEE FP64 19.5 1x - -

15 INSIDE A100 TensorFloat-32 (TF32)

FP32 FP32 matrix matrix Out-of-the-box Range Precision

sign exponent mantissa Format to TF32 tensor core e8 m23 FP32 s and multiply acceleration for DL e8 m10 s TF32 FP32 accumulate Easy step towards maximizing e5 m10 FP16 s tensor core performance with e8 m7 mixed-precision (FP16, BF16) BF16 s FP32 Matrix

Range of FP32 with FP32 input/output Up to 4x speedup precision of FP16 FP32 storage and math for all activations, gradients, … on linear solvers everything outside tensor cores for HPC

→S22082: Mixed-Precision Training of Neural Networks, 5/20 2:45pm PDT →S21681: How CUDA Math Libraries can help you unleash the power of the new NVIDIA A100 GPU (recording available) 16 INSIDE A100 SPARSE TENSOR CORE

Sparse Input 2x Tensor Core throughput Tensor Core activations select mux mux Structured-sparsity for efficient HW and SW

~2x reduction in weights × footprint and bandwidth dot-product

Fine-grained Compress structured pruning Dense (2:4 non-zero) Non- Non- zero Output trained zero zero activations weights Fine-tuning data indices weights ~No loss in inferencing accuracy Evaluated across dozens of networks: vision, object detection, segmentation, natural language modeling, translation

→ S22085: Accelerating Sparsity in the NVIDIA Ampere Architecture, 5/20 1:30pm PDT 17 1. New Tensor Core 2. Strong Scaling 3. Elastic GPU 4. Productivity 18 DL STRONG SCALING Weak scaling Strong scaling DL networks: Long chains of sequentially- dependent compute-intensive layers

Each layer is Output 1 layer Activations parallelized Weights across GPU Output Activations Tile: work Fixed for 1 SM network Input Output … Activations Activations … runs ~2.5x faster ~2.5x larger network runs in same time 19 HOW TO KEEP TENSOR CORES FED? Required Math bandwidth data bandwidth (MACs/clock/SM) (A+B operands, B/clock/SM) 16384 6 KB 3x vs. 8192 5 KB V100 4096 4 KB 2048 2x vs. 1024 3 KB V100 512 2 KB 256 1 KB 128 64 0 KB TF32 FP16 BF16 INT8 INT4 BIN TF32 FP16 BF16 INT8 INT4 BIN A100 dense V100 A100 dense A100 sparse

20 A100 STRONG SCALING INNOVATIONS

Math Improve speeds & feeds and efficiency across all RF SM levels of compute and memory hierarchy SMEM/L1

L2 GPU memory system DRAM

NVLINK Multi-GPU systems 21 A100 TENSOR CORE 2x throughput vs. V100, >2x efficiency A100 TC Instruction (2048 MACs, 8 cycles) V100 TC Instruction 32-Thread 16x8 A100 TC (1024 MACs, 8 cycles) Math Operand (1 cycle) FFMA 8-Thread 8-Thread 8-Thread 8-Thread Sharing (32 MACs, 2 cycles) 8x4 16x16 16x8 RF 4x8 8x8

SMEM/L1 Registers 32 Threads (Warp) L2 A100 vs. A100 vs. V100 FFMA 16x16x16 matrix multiply FFMA V100 TC A100 TC (improvement) (improvement) Thread sharing 1 8 32 DRAM 4x 32x Hardware instructions 128 16 2 8x 64x Register reads+writes (warp) 512 80 28 2.9x 18x NVLINK Cycles 256 32 16 2x 16x Tensor Cores assume FP16 inputs with FP32 accumulator, V100 Tensor Core instruction uses 4 hardware instructions 22 A100 SM DATA MOVEMENT EFFICIENCY 3x SMEM/L1 bandwidth, 2x in-flight capacity

V100 A100

Math

warp0 warp1

warp2 warp3

warp2 warp3 SMEM/L1 Tensor Cores Tensor Cores RF RF 5 reads Load-Shared Load-Shared 2 reads 1 write (4x) (2x) L2 SMEM SMEM Store-Shared Reserved for RF Load-Global- Reserved for in-flight data DRAM in-flight data Store-Shared L1 Load-Global (Async-Copy) L2 L2 NVLINK DRAM DRAM 23 A100 L2 BANDWIDTH

Parallelize across GPU Output Activations Split L2 with Math V100 V100++ A100 (hypothetical) hierarchical crossbar – 80 SMs 108 SMs 108 SMs 2.3x increase in RF Tile: work V100 TC A100 TC A100 TC for 1 SM 64 L2 slices 64 L2 slices 80 L2 slices bandwidth over V100, 128 32 B/clk/slice 32 B/clk/slice 64 B/clk/slice lower latency SMEM/L1 256 12 B/clk/SM 24 B/clk/SM 24 B/clk/SM 47% 127% 51% L2 128 16 B/clk/SM 32 B/clk/SM 32 B/clk/SM 128 63% 169% 68% DRAM 64 24 B/clk/SM 48 B/clk/SM 48 B/clk/SM 128 NVLINK 94% 253% 101%

24 A100 DRAM BANDWIDTH

Faster HBM2 Larger and smarter L2 25% more pins, 38% faster clocks 40MB L2, 6.7x vs. V100 Math → 1.6 TB/s, 1.7x vs. V100 L2-Residency controls

SMEM/L1 kernel kernel kernel kernel kernel

buffer A buffer B buffer A buffer B buffer C L2 kernel kernel kernel Keep data resident in L2 to … DRAM reduce DRAM bandwidth buffer D buffer E

NVLINK

25 → S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture, 5/21 10:15am PDT A100 COMPUTE DATA COMPRESSION

Activation sparsity due to ReLU Up to 4x DRAM+L2 bandwidth ResNet-50 and 2x L2 capacity Math for fine-grained Sparsity unstructured sparsity RF Layers VGG16_BN SMs SMEM/L1

Sparsity BW savings Layers Capacity L2 L2 ResNeXt-101 savings

DRAM BW savings Sparsity Layers DRAM NVLINK

26 → S21819: Optimizing Applications for NVIDIA Ampere GPU Architecture, 5/21 10:15am PDT A100 NVLINK BANDWIDTH Third Generation NVLink 50 Gbit/sec per signal pair Math 12 links, 25 GB/s in/out, 600 GB/s total 2x vs. V100 RF

SMEM/L1

L2 NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink NVLink

DRAM

NVLINK →S21884: Under the Hood of the new DGX A100 System Architecture (recording available soon) 27 A100 ACCELERATES CUDA GRAPHS

… Grid launches: • CPU-to-GPU • GPU grid-to-grid

With strong scaling CPU and grid 32-node graphs of empty grids, DGX1-V, DGX-A100 launch overheads become increasingly important One-shot CPU-to-GPU Microarchitecture (Amdahl’s law) graph submission and improvements for graph reuse grid-to-grid latencies

28 →S21760: CUDA New Features And Beyond, 5/19 10:15am PDT A100 STRONG SCALING INNOVATIONS Delivering unprecedented levels of performance A100 improvements over V100 Math 2.5x Tensor Core math BW (FP16) 2.9x Effective RF BW with A100 Tensor Core RF 2.8x Effective RF capacity with Async-Copy bypassing RF 3.0x Effective SMEM BW with A100 Tensor Core and Async-Copy SMEM/L1 2.3x SMEM capacity 2.3x L2 BW L2 6.7x L2 capacity, +Residency Control 1.7x DRAM BW DRAM 1.3x DRAM capacity 2.0x NVLINK BW NVLINK

29 A100 STRONG SCALING INNOVATIONS Delivering unprecedented levels of performance A100 improvements over V100 Math 2.5x Tensor Core math BW (FP16) 5.0x Sparse Tensor Core (FP16) 2.9x Effective RF BW with A100 Tensor Core RF 2.8x Effective RF capacity with Async-Copy bypassing RF 3.0x Effective SMEM BW with A100 Tensor Core and Async-Copy SMEM/L1 2.3x SMEM capacity 2.3x L2 BW 9.2x L2 6.7x L2 capacity, +Residency Control 13.3x Compute Data Compression (max) 1.7x DRAM BW 6.8x DRAM 1.3x DRAM capacity 2.0x NVLINK BW NVLINK

30 1. New Tensor Core 2. Strong Scaling 3. Elastic GPU 4. Productivity 31 NVLINK: ONE BIG GPU

InfiniBand/Ethernet: travels a long distance, consistency is the responsibility of software

PCI Express: hardware consistency for I/O, not for programming language memory models

NVLINK: hardware consistency for programming language memory models, like system bus

NIC API implements consistency to GPU (NvShmem)

GPU GPU 0 1

API strengthens consistency to GPU Hardware consistency CPU (managed memory, host memory)

Hardware consistency 32 HGX A100: 3RD GEN NVLINK & SWITCH

HGX A100 4-GPU: fully-connected system with 100GB/s all-to-all BW

33 HGX A100: 3RD GEN NVLINK & SWITCH

HGX A100 4-GPU: fully-connected system with 100GB/s all-to-all BW

New NVSwitch: 6B transistors in TSMC 7FF, 36 ports, 25GB/s each, per direction

HGX A100 8-GPU: 6x NVSwitch in a fat tree topology, 2.4TB/s full-duplex bandwidth

GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7

NVSwitchNVSwitch NVSwitchNVSwitch NVSwitchNVSwitch

Hardware consistency 34 DGX A100: PCIE4 CONTROL & I/O

AMD Rome AMD Rome 200G NIC 64C 64C 200G NIC Hardware consistency

200G NIC NVMe 200G NIC NVMe 200G NIC NVMe 200G NIC NVMe PEX PEX PEX PEX SwitchSwitch Switch Switch Switch 200G NIC NVMe 200G NIC NVMe 200G NIC NVMe 200G NIC NVMe

GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7

NVSwitchNVSwitch NVSwitchNVSwitch NVSwitchNVSwitch

Hardware consistency 35 CLOUD SMALL INSTANCE USAGE

Small workloads can under-utilize GPU cloud instances, provisioned at whole GPU level

CSPs can’t use MPS for GPU space-sharing, because it doesn’t provide enough isolation

1 user

Volta security boundary.

GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7

Switch security NVSwitchNVSwitch NVSwitchNVSwitch & containment NVSwitchNVSwitch

36 NEW: MULTI-INSTANCE GPU (MIG)

Up to 7 instances total, dynamically reconfigurable Compute instances: compute/fault isolation, but share/compete for memory GPU instances: separate and isolated paths through the entire memory system

TensorFlow PyTorch PyTorch TensorFlow Jarvis + TensorRT TensorRT

GPU 0 MIG

MIG security & containment

37 ELASTIC GPU COMPUTING

Each A100 is 1 to 7 GPUs

Each DGX A100 is 1 to 56 GPUs

Each GPU can serve a different user, with full memory isolation and QoS

→S21975: Inside NVIDIA's Multi-Instance GPU Feature (recording available) →S21884: Under the Hood of the new DGX A100 System Architecture (recording available soon) →S21702: Introducing NVIDIA DGX A100: The Universal AI System for Enterprise, 5/20 9:00am PDT 38 1. New Tensor Core 2. Strong Scaling 3. Elastic GPU 4. Productivity 39 COMPUTE CAPABILITY Programming Model Development at NVIDIA

Concurrent algorithms

Managed memory

Bulk parallelism + atomics

40 GPU PROGRAMMING IN 2020 AND BEYOND Math Libraries | Standard Languages | Directives | CUDA

__global__ void saxpy(int n, float a, #pragma acc data copy(x,y) float *x, float *y) { { int i = blockIdx.x*blockDim.x + threadIdx.x; ... if (i < n) y[i] += a*x[i]; std::transform(par, x, x+n, y, y, } [=](float x, float y){ std::transform(par, x, x+n, y, y, return y + a*x; [=](float x, float y){ int main(void) { }); return y + a*x; ... }); cudaMemcpy(d_x, x, ...); cudaMemcpy(d_y, y, ...); ... do concurrent (i = 1:n) saxpy<<<(N+255)/256,256>>>(...); y(i) = y(i) + a*x(i) } enddo cudaMemcpy(y, d_y, ...);

GPU Accelerated Incremental Performance Maximize GPU Performance with C++ and Fortran Optimization with Directives CUDA C++/Fortran

GPU Accelerated Math Libraries

→S21766 Inside the NVIDIA HPC SDK: the Compilers, Libraries and Tools for Accelerated Computing, 5/19 1:30PM PST 41 PROGRAMMING MODEL WANTED Software pipelining to hide latency is hard.

__device__ void exhibit_B1() __device__ void exhibit_A1() { { compute_head(); memcpy(/* ... */); //< blocks here __syncthreads(); //< blocks here /* more work */ /* more work */

compute(); //< needed here compute_tail(); //< needed here /* more work */ /* more work */ } } Data Compute

42 PROGRAMMING MODEL WANTED Software pipelining to hide latency is hard.

__device__ void exhibit_B2() __device__ void exhibit_A2() { { compute_head(); memcpy(/* ... */); //< blocks here __syncthreads(); //< blocks here /* memcpy( ... ); */ /* compute_head(); __syncthreads(); */ compute(); //< needed here compute_tail(); //< needed here /* compute(); */ /* compute_tail(); */ } } Data Compute

43 NEW:

Asynchronous algorithms

Concurrent algorithms

Managed memory

Bulk parallelism

44 CO-DESIGNED: A100 & C++20 barrier Key to asynchronous programming in compute_80

#include // ISO C++20 conforming extension using barrier = cuda::barrier;

class barrier { // synopsis //... void arrive_and_wait(); arrival_token arrive(ptrdiff_t = 1); Nonblocking void wait(arrival_token &&) const; //... }; 45 ASYNCHRONOUS COPY + BARRIER

Capability PTX ISA CUDA C++ API

Asynchronous barrier mbarrier.{} cuda::barrier<…>

cp.async.ca + Asynchronous copy cuda::memcpy_async(…) cp.async.mbarrier.arrive

+Cache-bypass cp.async.cg

+Zero-fill ragged edge cp.async.* … wr-size, rd-size; CUDA 11 preview library in experimental:: namespace +User-level tracking cp.async.mbarrier.arrive.noinc

+Single-threaded mode cp.async.{commit_group, wait_group}

46 ASYNCHRONOUS PROGRAMMING MODEL #include // ISO C++20 conforming extension using barrier = cuda::barrier; __device__ void exhibit_B3() __device__ void exhibit_A3() { { __shared__ barrier b1, b2; __shared__ barrier b1, b2; // ^^initialization omitted // ^^initialization omitted compute_head(); cuda::memcpy_async(/* ... */, b1); auto t1 = b1.arrive(); cuda::memcpy_async(/* ... */, b2); compute_head(); b1.arrive_and_wait(); auto t2 = b2.arrive(); compute(); b1.wait(t1); b2.arrive_and_wait(); compute_tail(); compute(); b2.wait(t2); } compute_tail(); Data } Compute 47 MULTI-BUFFERING PIPELINES IN C++

#include // ISO C++20 conforming extension using barrier = cuda::barrier;

__global__ void exhibit_C(/* ... */) { __shared__ barrier b[2]; // ^^initialization omitted Data barrier::arrival_token t[2]; cuda::memcpy_async(/* ... */, b[0]); t[0] = b[0].arrive(); for(int step = 0, next = 1; step < steps; ++step, ++next) { if(next < steps) { b[next & 1].wait(t[next & 1]); cuda::memcpy_async(/* ... */, b[next & 1]); t[next & 1] = b[next & 1].arrive(); } b[step & 1].wait(t[step & 1]); Compute compute(); t[step & 1] = b[step & 1].arrive(); } }

48 MULTI-BUFFERING PIPELINES IN C++

#include // ISO C++20 conforming extension using barrier = cuda::barrier;

95% Thousands 88%

Hundreds Tens

Optimized tensor kernels CUTLASS Relative Performance to cuBLAS (tensor fp16) V100 (launch) V100 (6 months) A100 (launch) V100 A100

→S21745: Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100, 5/21 11:30AM PST 50 CLOSING

51 UNPRECEDENTED ACCELERATION AT EVERY SCALE

52 Whitepaper: NVIDIA A100 Tensor Core GPU Architecture www.nvidia.com/nvidia-ampere-architecture-whitepaper