THE NEW VOLTA GPU ARCHITECTURE Alan Gray, NVIDIA Oxford, July 28 2017 1 • This talk is comprised of material from the following NVIDIA GTC 2017 talks:
• S7798 - Inside Volta, Olivier Giroux and Luke Durant
• S7132 - CUDA 9 and Beyond, Mark Harris
• Recordings from these (and may others) are available at www.gputechconf.com
2 INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2
120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
3 TESLA V100 GPU
4 TESLA V100
21B transistors 815 mm2
80 SM 5120 CUDA Cores 640 Tensor Cores
16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink
5 *full GV100 chip contains 84 SMs GPU PERFORMANCE COMPARISON
P100 V100 Ratio Training acceleration 10 TFLOPS 120 TFLOPS 12x Inference acceleration 21 TFLOPS 120 TFLOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4MB 6MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x
6 NEW HBM2 MEMORY ARCHITECTURE 1.5x Delivered Bandwidth STREAM: Triad- Delivered GB/s Delivered Triad- STREAM:
HBM2 stack P100 V100 76% DRAM 95% DRAM Utilization Utilization 7 V100 measured on pre-production hardware. ROAD TO EXASCALE Volta to Fuel Most Powerful US Supercomputers
Volta HPC Application Performance Relative to TeslaP100
Summit Supercomputer 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla 8 P100 or V100. V100 measured on pre-production hardware. 10 Megawatts VOLTA: A GIANT LEAP FOR DEEP LEARNING
ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency
2.4x faster 3.7x faster Images per Second per Images Second per Images
P100 V100 P100 V100 FP32 Tensor Cores FP16 Tensor Cores 9 V100 measured on pre-production hardware. VOLTA NVLINK
300GB/sec 50% more links 28% faster signaling
10 NEW SM MICROARCHITECTURE
11 VOLTA GV100 SM
GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048
12 VOLTA GV100 SM Redesigned for Productivity
Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration =
The easiest SM to program yet 13 RECAP: PASCAL L1 AND SHARED MEMORY Pascal SM Load/Store Units
Shared Memory Low Latency 64 KB
L1$ Streaming : Unlimited cache misses in flight 24 KB
L2$ 4 MB
14 UNIFYING KEY TECHNOLOGIES Pascal SM Volta SM Load/Store Units Load/Store Units
Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB
L1$ Streaming 24 KB
L2$ L2$ 4 MB 6 MB
15 VOLTA L1 AND SHARED MEMORY SM Volta Streaming L1$ : Load/Store Units
Unlimited cache misses in flight Low cache hit latency L1$ and Shared Memory 4x more bandwidth 128 KB 5x more capacity
Volta Shared Memory :
Unified storage with L1 Configurable up to 96KB L2$ 6 MB
16 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache
Directed testing: shared in global Cache: vs shared Average Shared 93% • Easier to use Memory Benefit • 90%+ as good 70% Shared: vs cache • Faster atomics • More banks • More predictable Pascal Volta
17 VOLTA TENSOR CORE
18 TENSOR CORE Mixed Precision Matrix Math 4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
A A A A B B B B C C C C D = 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 19 TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math
warp
Warp-synchronizing operation
Composed Matrix Multiply and Accumulate for 16x16 matrices
Result distributed across warp
20 VOLTA TENSOR OPERATION
Sum with FP16 Full precision FP32 Convert to storage/input product accumulator FP32 result
more products F16 × + F32 F16
F32
Also supports FP16 accumulator mode for inferencing
21 USING TENSOR CORES
__device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment
wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); NVIDIA cuDNN, cuBLAS, TensorRT } CUDA C++ Volta Optimized Warp-Level Matrix Operations Frameworks and Libraries 22 A GIANT LEAP FOR DEEP LEARNING
cuBLAS Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) 9.3x faster Relative Relative Performance
P100 V100 – Tensor Cores (CUDA 8) (CUDA 9)
23 V100 measured on pre-production hardware. INDEPENDENT THREAD SCHEDULING
24 VOLTA: INDEPENDENT THREAD SCHEDULING Communicating Algorithms
Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms
Threads cannot wait for messages Threads may wait for messages 25 WARP IMPLEMENTATION
Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp
26 PASCAL WARP EXECUTION MODEL
if (threadIdx.x < 4) { X; Y; A; B; } else { X; Y; diverge reconverge } A; B;
Time
27 PASCAL WARP EXECUTION MODEL
No Synchronization Permitted if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {
X; diverge __syncwarp(); reconverge Y; A; B; } Time
28 WARP IMPLEMENTATION
Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp
Volta
Convergence PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Optimizer
32 thread warp with independent scheduling
29 VOLTA WARP EXECUTION MODEL
Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {
X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time
30 ROBUST AND EXPLICIT WARP PROGRAMMING Adapt Legacy Code for New Execution Model
Volta Independent Thread Scheduling:
Program familiar algorithms and data structures in a natural way
Flexible thread grouping and synchronization
Use explicit synchronization, don’t rely on implicit convergence
CUDA 9 provides a fully explicit synchronization model
31 COOPERATIVE GROUPS
32 COOPERATIVE GROUPS Flexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of cooperating threads Thread Block Group Clean composition across software boundaries
Optimize for hardware fast path
Scalable from a few threads to all running threads Partitioned Thread Groups Deploy Everywhere: Kepler and Newer GPUs
Supported by CUDA developer tools
33 * Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs SYNCHRONIZE AT ANY SCALE Three Key Capabilities
WHOLE-GRID MULTI-GPU FLEXIBLE GROUPS SYNCHRONIZATION SYNCHRONIZATION
Define and Synchronize Multiple Synchronize Arbitrary Thread Blocks Groups of Threads
partition sync sync
sync sync
34 COOPERATIVE GROUPS BASICS Flexible, Explicit Synchronization
Thread groups are explicit objects in your program Thread Block Group
thread_group block = this_thread_block();
You can synchronize threads in a group block.sync(); Create new groups by partitioning existing groups Partitioned Thread Groups thread_group tile32 = tiled_partition(block, 32); thread_group tile4 = tiled_partition(tile32, 4); Partitioned groups can also synchronize
tile4.sync();
Note: calls in green are part of the cooperative_groups:: namespace 35 EXAMPLE: PARALLEL REDUCTION Composable, Robust and Efficient
Per-Block Per-Warp g = this_thread_block(); g = tiled_partition<32>(this_thread_block()); reduce(g, ptr, myVal); reduce(g, ptr, myVal);
__device__ int reduce(thread_group g, int *x, int val) { int lane = g.thread_rank(); for (int i = g.size()/2; i > 0; i /= 2) { x[lane] = val; g.sync(); val += x[lane + i]; g.sync(); } return val; } 36 LAUNCHING COOPERATIVE KERNELS Three Synchronization Scales
Block or Sub-Block Launch with <<<>>> or Sync cudaLaunchKernel()
Launch with Multi-Block Sync cudaLaunchCooperativeKernel()
Multi-Device Sync Launch with cudaLaunchCooperativeKernelMultiDevice()
37 ROBUST AND EXPLICIT WARP PROGRAMMING Adapt Legacy Code for New Execution Model
Eliminate implicit warp synchronous programming on all architectures Use explicit synchronization Focus synchronization granularity with Cooperative Groups
Transition to new *_sync() primitives __shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask() CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()
38 FUTURE COOPERATIVE GROUPS Library of Collective Algorithms
Reductions, sorting, prefix sum (scan), etc.
// collective key-value sort using all threads in the block cooperative_groups::sort(this_thread_block(), myValues, myKeys);
// collective scan-based allocate across block int sz = myAllocationSize(); // amount each thread wants int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);
Note: preliminary API sketch 39 INTRODUCING TESLA V100
Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2
120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning
The Fastest and Most Productive GPU for Deep Learning and HPC
40