THE NEW VOLTA GPU ARCHITECTURE Alan Gray, NVIDIA Oxford, July 28 2017 1 • This talk is comprised of material from the following NVIDIA GTC 2017 talks:

• S7798 - Inside Volta, Olivier Giroux and Luke Durant

• S7132 - CUDA 9 and Beyond, Mark Harris

• Recordings from these (and may others) are available at www.gputechconf.com

2 INTRODUCING TESLA V100

Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2

120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning

The Fastest and Most Productive GPU for Deep Learning and HPC

3 TESLA V100 GPU

4 TESLA V100

21B transistors 815 mm2

80 SM 5120 CUDA Cores 640 Tensor Cores

16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink

5 *full GV100 chip contains 84 SMs GPU PERFORMANCE COMPARISON

P100 V100 Ratio Training acceleration 10 TFLOPS 120 TFLOPS 12x Inference acceleration 21 TFLOPS 120 TFLOPS 6x FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x NVLink Bandwidth 160 GB/s 300 GB/s 1.9x L2 Cache 4MB 6MB 1.5x L1 Caches 1.3 MB 10 MB 7.7x

6 NEW HBM2 MEMORY ARCHITECTURE 1.5x Delivered Bandwidth STREAM: Triad- Delivered GB/s Delivered Triad- STREAM:

HBM2 stack P100 V100 76% DRAM 95% DRAM Utilization Utilization 7 V100 measured on pre-production hardware. ROAD TO EXASCALE Volta to Fuel Most Powerful US

Volta HPC Application Performance Relative to TeslaP100

Summit 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla 8 P100 or V100. V100 measured on pre-production hardware. 10 Megawatts VOLTA: A GIANT LEAP FOR DEEP LEARNING

ResNet-50 Training ResNet-50 Inference TensorRT - 7ms Latency

2.4x faster 3.7x faster Images per Second per Images Second per Images

P100 V100 P100 V100 FP32 Tensor Cores FP16 Tensor Cores 9 V100 measured on pre-production hardware. VOLTA NVLINK

300GB/sec 50% more links 28% faster signaling

10 NEW SM MICROARCHITECTURE

11 VOLTA GV100 SM

GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048

12 VOLTA GV100 SM Redesigned for Productivity

Completely new ISA Twice the schedulers Simplified Issue Logic Large, fast L1 cache Improved SIMT model Tensor acceleration =

The easiest SM to program yet 13 RECAP: PASCAL L1 AND SHARED MEMORY Pascal SM Load/Store Units

Shared Memory Low Latency 64 KB

L1$ Streaming : Unlimited cache misses in flight 24 KB

L2$ 4 MB

14 UNIFYING KEY TECHNOLOGIES Pascal SM Volta SM Load/Store Units Load/Store Units

Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB

L1$ Streaming 24 KB

L2$ L2$ 4 MB 6 MB

15 VOLTA L1 AND SHARED MEMORY SM Volta Streaming L1$ : Load/Store Units

Unlimited cache misses in flight Low cache hit latency L1$ and Shared Memory 4x more bandwidth 128 KB 5x more capacity

Volta Shared Memory :

Unified storage with L1 Configurable up to 96KB L2$ 6 MB

16 NARROWING THE SHARED MEMORY GAP with the GV100 L1 cache

Directed testing: shared in global Cache: vs shared Average Shared 93% • Easier to use Memory Benefit • 90%+ as good 70% Shared: vs cache • Faster atomics • More banks • More predictable Pascal Volta

17 VOLTA TENSOR CORE

18 TENSOR CORE Mixed Precision Matrix Math 4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

A A A A B B B B C C C C D = 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 1,0 1,1 1,2 1,3 A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C 19 TENSOR SYNCHRONIZATION Full Warp 16x16 Matrix Math

warp

Warp-synchronizing operation

Composed Matrix Multiply and Accumulate for 16x16 matrices

Result distributed across warp

20 VOLTA TENSOR OPERATION

Sum with FP16 Full precision FP32 Convert to storage/input product accumulator FP32 result

more products F16 × + F32 F16

F32

Also supports FP16 accumulator mode for inferencing

21 USING TENSOR CORES

__device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment Amat; wmma::fragment Bmat; wmma::fragment Cmat;

wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); NVIDIA cuDNN, cuBLAS, TensorRT } CUDA C++ Volta Optimized Warp-Level Matrix Operations Frameworks and Libraries 22 A GIANT LEAP FOR DEEP LEARNING

cuBLAS Mixed Precision (FP16 input, FP32 compute) Matrix Multiply (M=N=K=2048) 9.3x faster Relative Relative Performance

P100 V100 – Tensor Cores (CUDA 8) (CUDA 9)

23 V100 measured on pre-production hardware. INDEPENDENT THREAD SCHEDULING

24 VOLTA: INDEPENDENT THREAD SCHEDULING Communicating Algorithms

Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms

Threads cannot wait for messages Threads may wait for messages 25 WARP IMPLEMENTATION

Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp

26 PASCAL WARP EXECUTION MODEL

if (threadIdx.x < 4) { X; Y; A; B; } else { X; Y; diverge reconverge } A; B;

Time

27 PASCAL WARP EXECUTION MODEL

No Synchronization Permitted if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {

X; diverge __syncwarp(); reconverge Y; A; B; } Time

28 WARP IMPLEMENTATION

Pre-Volta Program Counter (PC) and Stack (S) 32 thread warp

Volta

Convergence PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S PC,S Optimizer

32 thread warp with independent scheduling

29 VOLTA WARP EXECUTION MODEL

Synchronization may lead to interleaved scheduling! if (threadIdx.x < 4) { X; Y; A; __syncwarp(); B; } else {

X; diverge __syncwarp(); synchronize Y; A; B; } __syncwarp(); Time

30 ROBUST AND EXPLICIT WARP PROGRAMMING Adapt Legacy Code for New Execution Model

Volta Independent Thread Scheduling:

Program familiar algorithms and data structures in a natural way

Flexible thread grouping and synchronization

Use explicit synchronization, don’t rely on implicit convergence

CUDA 9 provides a fully explicit synchronization model

31 COOPERATIVE GROUPS

32 COOPERATIVE GROUPS Flexible and Scalable Thread Synchronization and Communication

Define, synchronize, and partition groups of cooperating threads Thread Block Group Clean composition across software boundaries

Optimize for hardware fast path

Scalable from a few threads to all running threads Partitioned Thread Groups Deploy Everywhere: Kepler and Newer GPUs

Supported by CUDA developer tools

33 * Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs SYNCHRONIZE AT ANY SCALE Three Key Capabilities

WHOLE-GRID MULTI-GPU FLEXIBLE GROUPS SYNCHRONIZATION SYNCHRONIZATION

Define and Synchronize Multiple Synchronize Arbitrary Thread Blocks Groups of Threads

partition sync sync

sync sync

34 COOPERATIVE GROUPS BASICS Flexible, Explicit Synchronization

Thread groups are explicit objects in your program Thread Block Group

thread_group block = this_thread_block();

You can synchronize threads in a group block.sync(); Create new groups by partitioning existing groups Partitioned Thread Groups thread_group tile32 = tiled_partition(block, 32); thread_group tile4 = tiled_partition(tile32, 4); Partitioned groups can also synchronize

tile4.sync();

Note: calls in green are part of the cooperative_groups:: namespace 35 EXAMPLE: PARALLEL REDUCTION Composable, Robust and Efficient

Per-Block Per-Warp g = this_thread_block(); g = tiled_partition<32>(this_thread_block()); reduce(g, ptr, myVal); reduce(g, ptr, myVal);

__device__ int reduce(thread_group g, int *x, int val) { int lane = g.thread_rank(); for (int i = g.size()/2; i > 0; i /= 2) { x[lane] = val; g.sync(); val += x[lane + i]; g.sync(); } return val; } 36 LAUNCHING COOPERATIVE KERNELS Three Synchronization Scales

Block or Sub-Block Launch with <<<>>> or Sync cudaLaunchKernel()

Launch with Multi-Block Sync cudaLaunchCooperativeKernel()

Multi-Device Sync Launch with cudaLaunchCooperativeKernelMultiDevice()

37 ROBUST AND EXPLICIT WARP PROGRAMMING Adapt Legacy Code for New Execution Model

Eliminate implicit warp synchronous programming on all architectures Use explicit synchronization Focus synchronization granularity with Cooperative Groups

Transition to new *_sync() primitives __shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask() CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()

38 FUTURE COOPERATIVE GROUPS Library of Collective Algorithms

Reductions, sorting, prefix sum (scan), etc.

// collective key-value sort using all threads in the block cooperative_groups::sort(this_thread_block(), myValues, myKeys);

// collective scan-based allocate across block int sz = myAllocationSize(); // amount each thread wants int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);

Note: preliminary API sketch 39 INTRODUCING TESLA V100

Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2

120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning

The Fastest and Most Productive GPU for Deep Learning and HPC

40