INSIDE VOLTA
Olivier Giroux and Luke Durant NVIDIA May 10, 2017
1 VOLTA: A GIANT LEAP FOR DEEP LEARNING
ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency
2.4x faster 3.7x faster
Images Images per Second Images per Second
P100 V100 P100 V100 FP32 Tensor Cores FP16 Tensor Cores 2 V100 measured on pre-production hardware. ROAD TO EXASCALE Volta to Fuel Most Powerful US Supercomputers
Volta HPC Application Performance
P100
to Tesla to Relative Relative
Summit Supercomputer 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla 3 P100 or V100. V100 measured on pre-production hardware. 10 Megawatts INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2
120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
4 TESLA V100
21B transistors 815 mm2
80 SM 5120 CUDA Cores 640 Tensor Cores
16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink
5 *full GV100 chip contains 84 SMs GPU PERFORMANCE COMPARISON
P100 V100 Ratio Training acceleration 10 TOPS 120 TOPS 12x Inference acceleration 21 TFLOPS 120 TOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4 MB 6 MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x
6 NEW HBM2 MEMORY ARCHITECTURE 1.5x Delivered
Bandwidth
Delivered GB/s
- STREAM: STREAM: Triad
HBM2 stack P100 V100 76% DRAM 95% DRAM Utilization Utilization 7 V100 measured on pre-production hardware. VOLTA NVLINK
300GB/sec 50% more links 28% faster signaling
8 PROGRAMMABILITY
9 PROGRAMMABILITY DRIVES DEEP LEARNING
Deep Learning methods developed using CUDA
FlowNet Sparsely Gated AlexNet Batch Normalization Mixture of Experts NCCL Billion-Scale Similarity Search WinoGrad (FAISS)
2012 2013 2014 2015 2016 2017 2018
1-bit SGD Persistent RNNs FFT Convolutions Phased LSTM cuDNN ?
New solvers, new layers, new scaling techniques, new applications for old techniques, and much more…
10 STATE OF UNIFIED MEMORY High performance, low effort Performance vs no Unified Memory
PGI OpenACC on Pascal P100 Explicit data movement Geometric mean across all 15 86% SPEC ACCEL™ benchmarks 86% PCI-E, 91% NVLink
*S7285 – Unified Memory on the Latest GPU Architectures
Unified Memory Automatic data movement for allocatables
PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark 11 name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation. PASCAL UNIFIED MEMORY
GPU CPU
Unified Memory
GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine
Memory Memory
GPU Optimized State CPU Optimized State
12 VOLTA + PCIE CPU UNIFIED MEMORY
GPU CPU
Unified Memory
GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine
Memory Memory + Access counters
GPU Optimized State CPU Optimized State
13 VOLTA + NVLINK CPU UNIFIED MEMORY
GPU CPU
Unified Memory
GPU CPU GPU CPU GPUGPU CPUCPU Page Migration Engine
Unified Memory Unified Memory + Access counters + New NVLink Features GPU Optimized State (Coherence, Atomics, ATS) CPU Optimized State
14 GPU MULTI-PROCESS SCHEDULING Background
A B C CPU Processes
Timeslice Scheduling Multi-Process Service
Single-process throughput optimized Multi-process throughput optimized
15 MULTI-PROCESS TIMESLICING
A B C CPU Processes
GPU Execution A
Pascal GP100 Timeslice 1
16 MULTI-PROCESS TIMESLICING
A B C A B C CPU Processes
B GPU Execution A
Pascal GP100 Pascal GP100 Timeslice 2
17 MULTI-PROCESS TIMESLICING
CPU Processes
A B C A B C A B C
B A C Pascal GP100 Pascal GP100 Pascal GP100
Timeslice 3 GPU Execution
18 MULTI-PROCESS TIMESLICING
CPU Processes
A B C A B C A B C
B A C Pascal GP100 Pascal GP100 Pascal GP100
Timeslice 1 Timeslice 2 Timeslice 3 GPU Execution
Full process isolation, peak throughput optimized for each process 19 PASCAL MULTI-PROCESS SERVICE
A B C
Software Work CUDA Multi-Process Service: CUDA MULTI-PROCESS SERVICE Submission CPU Processes Improve GPU utilization by sharing GPU Execution resources amongst small jobs
Limited B Isolation A C Pascal GP100
Opt-in: Limited isolation, peak throughput optimized across processes 20 VOLTA MULTI-PROCESS SERVICE
A B C Hardware Volta MPS Enhancements: Accelerated CUDA MULTI-PROCESS SERVICE CONTROL Work Submission CPU Processes • Reduced launch latency GPU Execution
• Improved launch throughput VOLTA MULTI-PROCESS SERVICE • Improved quality of service with scheduler partitioning Hardware • More reliable performance Isolation A B C • 3x more clients than Pascal
Volta GV100
21 VOLTA MPS FOR INFERENCE Efficient inference deployment without batching system
60% of perf with batching
7x
faster Resnet50 Images/sec, 7mslatency Images/sec, Resnet50
Single Volta Client, Multiple Volta Clients, Volta with No Batching, No Batching, Batching Using MPS System 22 V100 measured on pre-production hardware. No MPS NEW SM MICROARCHITECTURE
23 VOLTA GV100 SM
GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048
24 VOLTA GV100 SM Redesigned for Productivity
Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration =
The easiest SM to program yet 25 RECAP: PASCAL L1 AND SHARED MEMORY Pascal SM Load/Store Units
Shared Memory Low Latency 64 KB
L1$ Streaming : Unlimited cache misses in flight 24 KB
L2$ 4 MB
26 UNIFYING KEY TECHNOLOGIES Pascal SM Volta SM Load/Store Units Load/Store Units
Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB
L1$ Streaming 24 KB
L2$ L2$ 4 MB 6 MB
27 VOLTA L1 AND SHARED MEMORY SM Volta Streaming L1$ : Load/Store Units
Unlimited cache misses in flight Low cache hit latency L1$ and Shared Memory 4x more bandwidth 128 KB 5x more capacity
Volta Shared Memory :
Unified storage with L1 Configurable up to 96KB L2$ 6 MB
28 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache
Directed testing: shared in global Cache: vs shared Average Shared 93% • Easier to use Memory Benefit • 90%+ as good 70% Shared: vs cache • Faster atomics • More banks • More predictable Pascal Volta
29 INDEPENDENT THREAD SCHEDULING
30 VOLTA: INDEPENDENT THREAD SCHEDULING Communicating Algorithms
Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms
Threads cannot wait for messages Threads may wait for messages 31 STARVATION FREE ALGORITHM Example
__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C
a->next = b; b->prev = a; B b->next = c; c->prev = b;
unlock(c); unlock(a); }
32 STARVATION FREE ALGORITHM Example
__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C
a->next = b; b->prev = a; B b->next = c; c->prev = b;
unlock(c); unlock(a); }
*Not shown: lock() implementation 33 STARVATION FREE ALGORITHM Example
__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C
a->next = b; b->prev = a; B b->next = c; c->prev = b;
unlock(c); unlock(a); }
34 STARVATION FREE ALGORITHM Example
__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C
a->next = b; b->prev = a; B b->next = c; c->prev = b;
unlock(c); unlock(a); }
35 STARVATION FREE ALGORITHM Example
__device__ void insert_after(Node *a, Node *b) Doubly-Linked List with Fine Grained Lock { Node *c; lock(a); lock(a->next); c = a->next; A C
a->next = b; b->prev = a; B b->next = c; c->prev = b;
unlock(c); unlock(a); } Tip! Volta can run 163,840 threads simultaneously
Minimize Contention!
36 WARP IMPLEMENTATION
Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp
37 PASCAL WARP EXECUTION MODEL
if (threadIdx.x < 4) { X; Y; A; B; } else { X;
Y; diverge reconverge } A; B;
Time
38 PASCAL WARP EXECUTION MODEL
No Synchronization Permitted if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {
X; diverge __syncwarp(); reconverge Y; A; B; } Time
39 WARP IMPLEMENTATION
Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp
Volta
PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Convergence PC,S Optimizer
32 thread warp with independent scheduling
40 VOLTA WARP EXECUTION MODEL
Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {
X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time
41 VOLTA WARP EXECUTION MODEL
Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {
X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time
Software synchronization also supported, e.g. locks for doubly-linked list!
42 VOLTA’S EXTENDED SIMT MODEL
The SIMT model: enable thread-parallel programs to execute with vector efficiency
CPU Pascal GPU Volta GPU
SIMT Thread-parallelism MIMD SIMT (lock-free)
Data-parallelism SIMD SIMT SIMT
43 COOPERATION SYNCHRONIZATION
See also: CUDA 9.0, S7132, Wednesday 230pm Cooperative Groups, S7622, Wednesday 4pm 44 SHUFFLE SYNCHRONIZATION __shfl_sync - deprecates __shfl
warp
Warp-synchronizing operation
Lane data exchange
Result distributed across warp
45 COMPARE SYNCHRONIZATION __ballot_sync, __[any|all]_sync - deprecate namesakes __match[any|all]_sync, __activemask - new
warp
Warp-synchronizing operation
= Lane data compare = = Result distributed across warp
46 VOLTA TENSOR CORE
47 TENSOR CORE Mixed Precision Matrix Math 4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
A A A A B B B B C C C C D = 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 48 TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math
warp
Warp-synchronizing operation
Composed Matrix Multiply and Accumulate for 16x16 matrices
Result distributed across warp
49 VOLTA TENSOR OPERATION
Sum with FP16 Full precision FP32 Convert to storage/input product accumulator FP32 result
more products F16 × + F32 F16
F32
Also supports FP16 accumulator mode for inferencing
50 USING TENSOR CORES
__device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment
wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); NVIDIA cuDNN, cuBLAS, TensorRT } CUDA C++ Volta Optimized Warp-Level Matrix Operations Frameworks and Libraries 51 A GIANT LEAP FOR DEEP LEARNING
cuBLAS Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) 9.3x
faster Relative Relative Performance
P100 V100 – Tensor Cores (CUDA 8) (CUDA 9)
52 V100 measured on pre-production hardware. INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2
120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, and more …
The Fastest and Most Productive GPU for Deep Learning and HPC
53