INSIDE VOLTA

Olivier Giroux and Luke Durant NVIDIA May 10, 2017

1 VOLTA: A GIANT LEAP FOR DEEP LEARNING

ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency

2.4x faster 3.7x faster

Images Images per Second Images per Second

P100 V100 P100 V100 FP32 Tensor Cores FP16 Tensor Cores 2 V100 measured on pre-production hardware. ROAD TO EXASCALE Volta to Fuel Most Powerful US

Volta HPC Application Performance

P100

to Tesla to Relative Relative

Summit 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla 3 P100 or V100. V100 measured on pre-production hardware. 10 Megawatts INTRODUCING TESLA V100

Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2

120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning

The Fastest and Most Productive GPU for Deep Learning and HPC

4 TESLA V100

21B transistors 815 mm2

80 SM 5120 CUDA Cores 640 Tensor Cores

16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink

5 *full GV100 chip contains 84 SMs GPU PERFORMANCE COMPARISON

P100 V100 Ratio Training acceleration 10 TOPS 120 TOPS 12x Inference acceleration 21 TFLOPS 120 TOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x

6 NEW HBM2 MEMORY ARCHITECTURE 1.5x Delivered

Bandwidth

Delivered GB/s

- STREAM: STREAM: Triad

HBM2 stack P100 V100 76% DRAM 95% DRAM Utilization Utilization 7 V100 measured on pre-production hardware. VOLTA NVLINK

300GB/sec 50% more links 28% faster signaling

8 PROGRAMMABILITY

9 PROGRAMMABILITY DRIVES DEEP LEARNING

Deep Learning methods developed using CUDA

FlowNet Sparsely Gated AlexNet Batch Normalization Mixture of Experts NCCL Billion-Scale Similarity Search WinoGrad (FAISS)

2012 2013 2014 2015 2016 2017 2018

1-bit SGD Persistent RNNs FFT Convolutions Phased LSTM cuDNN ?

New solvers, new layers, new scaling techniques, new applications for old techniques, and much more…

10 STATE OF UNIFIED MEMORY High performance, low effort Performance vs no Unified Memory

PGI OpenACC on Pascal P100 Explicit data movement Geometric mean across all 15 86% SPEC ACCEL™ benchmarks 86% PCI-E, 91% NVLink

*S7285 – Unified Memory on the Latest GPU Architectures

Unified Memory Automatic data movement for allocatables

PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark 11 name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation. PASCAL UNIFIED MEMORY

GPU CPU

Unified Memory

GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine

Memory Memory

GPU Optimized State CPU Optimized State

12 VOLTA + PCIE CPU UNIFIED MEMORY

GPU CPU

Unified Memory

GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine

Memory Memory + Access counters

GPU Optimized State CPU Optimized State

13 VOLTA + NVLINK CPU UNIFIED MEMORY

GPU CPU

Unified Memory

GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine

Unified Memory Unified Memory + Access counters + New NVLink Features GPU Optimized State (Coherence, Atomics, ATS) CPU Optimized State

14 GPU MULTI-PROCESS SCHEDULING Background

A B C CPU Processes

Timeslice Scheduling Multi-Process Service

Single-process throughput optimized Multi-process throughput optimized

15 MULTI-PROCESS TIMESLICING

A B C CPU Processes

GPU Execution A

Pascal GP100 Timeslice 1

16 MULTI-PROCESS TIMESLICING

A B C A B C CPU Processes

B GPU Execution A

Pascal GP100 Pascal GP100 Timeslice 2

17 MULTI-PROCESS TIMESLICING

CPU Processes

A B C A B C A B C

B A C Pascal GP100 Pascal GP100 Pascal GP100

Timeslice 3 GPU Execution

18 MULTI-PROCESS TIMESLICING

CPU Processes

A B C A B C A B C

B A C Pascal GP100 Pascal GP100 Pascal GP100

Timeslice 1 Timeslice 2 Timeslice 3 GPU Execution

Full process isolation, peak throughput optimized for each process 19 PASCAL MULTI-PROCESS SERVICE

A B C

Software Work CUDA Multi-Process Service: CUDA MULTI-PROCESS SERVICE Submission CPU Processes Improve GPU utilization by sharing GPU Execution resources amongst small jobs

Limited B Isolation A C Pascal GP100

Opt-in: Limited isolation, peak throughput optimized across processes 20 VOLTA MULTI-PROCESS SERVICE

A B C Hardware Volta MPS Enhancements: Accelerated CUDA MULTI-PROCESS SERVICE CONTROL Work Submission CPU Processes • Reduced launch latency GPU Execution

• Improved launch throughput VOLTA MULTI-PROCESS SERVICE • Improved quality of service with scheduler partitioning Hardware • More reliable performance Isolation A B C • 3x more clients than Pascal

Volta GV100

21 VOLTA MPS FOR INFERENCE Efficient inference deployment without batching system

60% of perf with batching

7x

faster Resnet50 Images/sec, 7mslatency Images/sec, Resnet50

Single Volta Client, Multiple Volta Clients, Volta with No Batching, No Batching, Batching Using MPS System 22 V100 measured on pre-production hardware. No MPS NEW SM MICROARCHITECTURE

23 VOLTA GV100 SM

GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048

24 VOLTA GV100 SM Redesigned for Productivity

Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration =

The easiest SM to program yet 25 RECAP: PASCAL L1 AND SHARED MEMORY Pascal SM Load/Store Units

Shared Memory Low Latency 64 KB

L1$ Streaming : Unlimited cache misses in flight 24 KB

L2$ 4 MB

26 UNIFYING KEY TECHNOLOGIES Pascal SM Volta SM Load/Store Units Load/Store Units

Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB

L1$ Streaming 24 KB

L2$ L2$ 4 MB 6 MB

27 VOLTA L1 AND SHARED MEMORY SM Volta Streaming L1$ : Load/Store Units

Unlimited cache misses in flight Low cache hit latency L1$ and Shared Memory 4x more bandwidth 128 KB 5x more capacity

Volta Shared Memory :

Unified storage with L1 Configurable up to 96KB L2$ 6 MB

28 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache

Directed testing: shared in global Cache: vs shared Average Shared 93% • Easier to use Memory Benefit • 90%+ as good 70% Shared: vs cache • Faster atomics • More banks • More predictable Pascal Volta

29 INDEPENDENT THREAD SCHEDULING

30 VOLTA: INDEPENDENT THREAD SCHEDULING Communicating Algorithms

Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms

Threads cannot wait for messages Threads may wait for messages 31 STARVATION FREE ALGORITHM Example

__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C

a->next = b; b->prev = a; B b->next = c; c->prev = b;

unlock(c); unlock(a); }

32 STARVATION FREE ALGORITHM Example

__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C

a->next = b; b->prev = a; B b->next = c; c->prev = b;

unlock(c); unlock(a); }

*Not shown: lock() implementation 33 STARVATION FREE ALGORITHM Example

__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C

a->next = b; b->prev = a; B b->next = c; c->prev = b;

unlock(c); unlock(a); }

34 STARVATION FREE ALGORITHM Example

__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C

a->next = b; b->prev = a; B b->next = c; c->prev = b;

unlock(c); unlock(a); }

35 STARVATION FREE ALGORITHM Example

__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C

a->next = b; b->prev = a; B b->next = c; c->prev = b;

unlock(c); unlock(a); } Tip! Volta can run 163,840 threads simultaneously

Minimize Contention!

36 WARP IMPLEMENTATION

Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp

37 PASCAL WARP EXECUTION MODEL

if (threadIdx.x < 4) { X; Y; A; B; } else { X;

Y; diverge reconverge } A; B;

Time

38 PASCAL WARP EXECUTION MODEL

No Synchronization Permitted if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {

X; diverge __syncwarp(); reconverge Y; A; B; } Time

39 WARP IMPLEMENTATION

Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp

Volta

PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Convergence PC,S Optimizer

32 thread warp with independent scheduling

40 VOLTA WARP EXECUTION MODEL

Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {

X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time

41 VOLTA WARP EXECUTION MODEL

Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {

X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time

Software synchronization also supported, e.g. locks for doubly-linked list!

42 VOLTA’S EXTENDED SIMT MODEL

The SIMT model: enable thread-parallel programs to execute with vector efficiency

CPU Pascal GPU Volta GPU

SIMT Thread-parallelism MIMD SIMT (lock-free)

Data-parallelism SIMD SIMT SIMT

43 COOPERATION  SYNCHRONIZATION

See also: CUDA 9.0, S7132, Wednesday 230pm Cooperative Groups, S7622, Wednesday 4pm 44 SHUFFLE SYNCHRONIZATION __shfl_sync - deprecates __shfl

warp

Warp-synchronizing operation

Lane data exchange

Result distributed across warp

45 COMPARE SYNCHRONIZATION __ballot_sync, __[any|all]_sync - deprecate namesakes __match[any|all]_sync, __activemask - new

warp

Warp-synchronizing operation

= Lane data compare = = Result distributed across warp

46 VOLTA TENSOR CORE

47 TENSOR CORE Mixed Precision Matrix Math 4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

A A A A B B B B C C C C D = 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 48 TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math

warp

Warp-synchronizing operation

Composed Matrix Multiply and Accumulate for 16x16 matrices

Result distributed across warp

49 VOLTA TENSOR OPERATION

Sum with FP16 Full precision FP32 Convert to storage/input product accumulator FP32 result

more products F16 × + F32 F16

F32

Also supports FP16 accumulator mode for inferencing

50 USING TENSOR CORES

__device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment Amat; wmma::fragment Bmat; wmma::fragment Cmat;

wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); NVIDIA cuDNN, cuBLAS, TensorRT } CUDA C++ Volta Optimized Warp-Level Matrix Operations Frameworks and Libraries 51 A GIANT LEAP FOR DEEP LEARNING

cuBLAS Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) 9.3x

faster Relative Relative Performance

P100 V100 – Tensor Cores (CUDA 8) (CUDA 9)

52 V100 measured on pre-production hardware. INTRODUCING TESLA V100

Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2

120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning

More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, and more …

The Fastest and Most Productive GPU for Deep Learning and HPC

53