OPTIMIZING CUDA APPLICATIONS FOR THE VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, , Oct 18, 2018 NEW FEATURES IN CUDA ECOSYSTEM

TURING AND NEW SYSTEMS CUDA PLATFORM New GPU Architecture, Tensor Cores, NVSwitch Fabric, CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix Multiply DGX2, RTcore Accumulate (WMMA)

LIBRARIES DEVELOPER TOOLS GPU-accelerated hybrid JPEG decoding, Symmetric New Nsight Products – Nsight Systems and Nsight Compute Eigenvalue Solvers, FFT Scaling

Scientific Computing

2 AGENDA

New Features:

Tensor Cores

RTcore

CUDA Graphs

Nsight Developer Tools

Optimization strategies:

Volta/Turing Execution Model

Volta/Turing Memory Subsystem

3 TENSOR CORES

4 VOLTA / TURING SM Turing SM V100 TU102 FP64 32 2 INT32 64 64 FP32 64 64 Tensor Cores 8 8 RT Core - 1 Register File 256 KB 256 KB L1 and shmem 128 KB 96 KB Max threads 2048 1024 Compute 70 75* Capability *Volta (cc70) code runs on Turing without JIT or recompile! 5 TENSOR CORES New in Volta, Extended in Turing

PEAK INT8 PEAK INT4 PEAK GPU SMs Total Peak Half FLOPS OPS OPS Binary OPS V100 80 640 125 TFLOPS N.A. N.A. N.A. TU102 72 576 130.5 TFLOPS 261 TOPS 522 TOPS 2088 TOPS

half precision inputs  half / float accumulator 8bit/4bit INT inputs  32-bit INT accumulator 1bit Binary inputs  32-bit INT accumulator (XOR + POPC)

Used via CUBLAS, CUDNN, CUTLASS, TensorRT Exposed in CUDA 10 (4bit INT and 1bit binary are experimental)

6 TURING TENSOR CORE New Warp Matrix Functions

WMMA operations now include 8-bit integer WMMA 16x16x16 along with FP16 = + ▪ Warp Matrix Multiply Accumulate D A B 16x16 16x16 16x16 16x16 ▪ Signed & unsigned 8-bit input WMMA 32x8x16

▪ 32-bit integer accumulator = +

D A B C ▪ Input/Output dimensions similar to FP16 32x8 32x16 16x8 32x8 WMMA 8x32x16 ▪ 2048 ops per cycle, per SM for 8bit = + ▪ nvcuda::wmma D A B C 8x32 8x16 16x32 8x32

7 EXPERIMENTAL WARP MATRIX FUNCTIONS Turing Enables Experimental Sub-Byte Tensor Core Operations

Experimental Sub-Byte Operations namespace experimental { ▪ 4-bit signed & unsigned input namespace precision { struct u4; // 4-bit unsigned ▪ 1-bit input with custom matrix operations struct s4; // 4-bit signed ▪ 32-bit accumulator output struct b1; // 1-bit } enum bmmaBitOp { bmmaBitOpXOR = 1 }; Access via special namespace: enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } nvcuda::wmma::experimental Enable researchers to experiment with ultra low precision! Experimental subject to API changes not functionality. 8 WMMA – IMMA 4BIT New for Turing (Experimental) A B C

128 bits

D = 128 bits 128

8-by-8 x int32 8-by-32 x 4b 8-by-8 x int32

32-by-8 x 4b

Di,j = (Ai,k * Bk,j) + Ci,j for k = 0 .. 31

9 WMMA – BINARY - XOR POPC New for Turing (Experimental) A B C

128 bits

D = 128 bits 128

8-by-8 x int32 8-by-128 x 1b 8-by-8 x int32

128-by-8 x 1b

Di,j = popc(Ai,k ^ Bk,j) + Ci,j for k = 0 .. 127

10 BINARY TENSOR CORE OPERATION

128-bit population Bitwise 32-bit Integer Output 1-Bit Input Signal count added to XOR Operation Per Point accumulator

Other Row/Column Results

Accumulated Bitwise 32-bit Integer XOR + Count

Previous Accumulation

11 NEW TURING WARP MATRIX FUNCTIONS

Input Precision Output Supported Sizes Max Ops/Clock/SM

half * half or float 1024 16 x 16 x 16 char 32 x 8 x 16 integer (int32) 8 x 32 x 16 2048

unsigned char Native Types Native

precision::u4 (4-bit unsigned) 8 x 8 x 32 4096 precision::s4 (4-bit signed) integer (int32)

precision::b1 (1-bit) 8 x 8 x 128 16384 Experimental

* Also available on Volta sm_70. Note: WMMA requires recompilation for sm_75 for peak performance 12 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++

CUTLASS GEMM Structural Model

13 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++ > 90% Relative to Peak Performance 100%  Turing optimized GEMMs 80%  Integer (8-bit, 4-bit and 1-bit) using WMMA 60%

 Batched strided GEMM 40%

20%

 Support for CUDA 10.0 toPeak % Relative

0%  Updates to documentation and more

examples

igemm_tt

sgemm_tt

igemm_nt igemm_tn

dgemm_tt hgemm_tt

sgemm_nt sgemm_tn

igemm_nn

dgemm_nt dgemm_tn hgemm_nt hgemm_tn

sgemm_nn

dgemm_nn hgemm_nn

wmma_gemm_tt

wmma_gemm_nt wmma_gemm_tn

wmma_gemm_nn

wmma_gemm_f16_tt

wmma_gemm_f16_nt wmma_gemm_f16_tn wmma_gemm_f16_nn DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32) https://github.com/NVIDIA/cutlass 14 CUTLASS 1.1 on Volta (GV100) TURING RTCORE

15 RT Cores Turing GPU RT Cores accelerate

RT Cores perform ● Ray-BVH (Bounding Volume Hierarchy) Traversal ● Instancing: 1 Level ● Ray-Triangle Intersection

Return to SM for ● Multi-level Instancing ● Custom Intersection ● Shading

16 Software v/s Hardware Ray Tracing

Pre-Turing Turing SM SM

Tri1 Tri2 Tri3 Circle1

17 Rtcore in OPTIX

• Single-ray programming model using C++ • Transparently scales across multiple GPUs • AI Accelerated rendering http://developer.nvidia.com/optix • Easy interop with CUDA http://on-demand.gputechconf.com

18 CUDA GRAPHS

19 ASYNCHRONOUS TASK GRAPHS Execution Optimization When Workflow is Known Up-Front

Deep Neural Network Training DL Inference

Loop & Function offload Linear Algebra HPC Simulation

20 ALL CUDA WORK FORMS A GRAPH Node represents operation CUDA Work in Streams Edge represents dependency A A

B Wait Any CUDA stream can be B X Wait mapped to a graph C D X C D Wait E Y E Y

Wait End

Implicit dependencies Explicit dependencies

21 DEFINITION OF A CUDA GRAPH Graph Nodes Are Not Just Kernel Launches

Sequence of operations, connected by dependencies. A Operations are one of: B X Kernel Launch CUDA kernel running on GPU

CPU Function Call Callback function on CPU C D

Memcopy/Memset GPU data management E Y Sub-Graph Graphs are hierarchical

End

22 NEW EXECUTION MECHANISM Graphs Can Be Generated Once Then Launched Repeatedly

A

B X

for(int i=0; i<1000; i++) { launch_graph( G ); C D } E Y

End

23 EXECUTION OPTIMIZATIONS Latency & Overhead Reductions

Launch latencies:

▪ CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on

▪ Pre-defined graph allows launch of any number of kernels in one single operation

Launch Launch Launch Launch Launch CPU Idle A B C D E

A B C D E

time

Build Launch Graph CPU Idle Graph

A B C D E

24 PERFORMANCE IMPACT Optimizations for Short-Runtime Operations

CPU launch time improvements Example: Small 3D FFT Typical: 33% faster than stream launch 25% end-to-end improvement for 323 3D-FFT (16us with stream launch, 12us with graph launch)

NOTE: Performance impact is workload-dependent Benefits especially short-running kernels, where overheads account for more runtime

25 THREE-STAGE EXECUTION MODEL

Define Instantiate Execute

A A s1 s2 s3

B X A B X A C D B X

B X C D E C Y D

E C Y D E Y End E Y End

End End

Executable Graphs Single Graph “Template” Multiple “Executable Graphs” Running in CUDA Streams Created in host code, Snapshot of template Concurrency in graph or loaded from disk, Sets up & initializes GPU is not limited by stream or built up from libraries execution structures (see later) (create once, run many times)

26 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax

// Start by initating stream capture cudaStreamBeginCapture(&stream1);

// Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); B B C B<<< ..., stream1 >>>(); C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2);

D<<< ..., stream1 >>>(); stream1 stream2 graph

// Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph);

27 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax

// Start by initating stream capture cudaStreamBeginCapture(&stream1);

// Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); Capture follows B inter-stream dependencies B<<< ..., stream1 >>>(); B C to create forks & joins C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2);

D<<< ..., stream1 >>>(); stream1 stream2 graph

// Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph);

28 CREATE GRAPHS DIRECTLY Map Graph-Based Workflows Directly Into CUDA

// Define graph of work + dependencies cudaGraphCreate(&graph);

cudaGraphAddNode(graph, kernel_a, {}, ...); A cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...); B C

// Instantiate graph and apply optimizations D cudaGraphInstantiate(&instance, graph);

Graph from framework // Launch executable graph 100 times for(int i=0; i<100; i++) cudaGraphLaunch(instance, stream);

29 GRAPH EXECUTION SEMANTICS Order Graph Work With Other Non-Graph CUDA Work

stream launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2, CPU_Func cpu, cudaStream_t stream) { A

A <<< 256, 256, 0, stream >>>(); // Kernel launch cudaGraphLaunch(i1, stream); // Graph1 launch cudaStreamAddCallback(stream, cpu); // CPU callback CPU cudaGraphLaunch(i2, stream); // Graph2 launch

cudaStreamSynchronize(stream); }

If you can put it in a CUDA stream, you can run it together with a graph

30 GRAPHS IGNORE STREAM SERIALIZATION RULES Launch Stream Is Used Only For Ordering With Other Work

stream A A B X

Branches in graph still C D execute concurrently CPU even though graph is launched into a stream E Y

End

31 CROSS-DEVICE DEPENDENCIES Graphs May Span Multiple GPUs

Multi-Device Heterogeneous Execution Execution A GPU CUDA is closest to the O/S and the hardware

▪ Can optimize multi-device dependencies B C CPU ▪ Can optimize heterogeneous dependencies D GPU ▪ Define locality per-node

GPU 0 GPU 1 Heterogeneous Execution

32 NSIGHT DEVELOPER TOOLS

33 NSIGHT PRODUCT FAMILY

Nsight Systems Nsight Compute Nsight Graphics IDE Plugins Nsight Eclipse System-wide application CUDA Kernel Profiling and Graphics Shader Profiling and Edition/Visual Studio algorithm tuning Debugging Debugging (Editor, Debugger)

34 NSIGHT SYSTEMS System-wide Performance Analysis

Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more

Locate Optimization Opportunities: CUDA & OpenGL , UVM transfers, User Annotations using NVTX

Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events.

https://developer.nvidia.com/nsight-systems

35 Thread/core migration Processes and threads Thread state

CUDA and OpenGL API trace

cuDNN and cuBLAS trace

Kernel and memory transfer activities

Multi-GPU

36 NVIDIA NSIGHT COMPUTE Next Generation Kernel Profiler Kernel Profile Comparisons with  Interactive CUDA API debugging and kernel Baseline profiling

 Fast Data Collection

 Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules) Metric Data

 Command Line, Standalone, IDE Integration

 Platform Support

 OS: Linux (x86, POWER, ARM), Windows Source  GPUs: Pascal, Volta, Turing Correlation

37 EXECUTION MODEL

38 CUDA BASICS Blocks of threads, warps

Single Instruction Multiple Threads (SIMT) model

CUDA hierarchy: Grid -> Blocks -> Warps -> Threads

One warp = 32 threads.

Why does it matter ? Many optimizations based on behavior at the warp level

39 CUDA BASICS Mapping threads

Thread blocks can be 1D, 2D, 3D Only for convenience. Hardware “looks” at threads in 1D

Consecutive 32 threads belong to the same warp

80 Threads: 40 threads in X 3 warps (96 threads) rd 2 rows of threads in Y 16 inactive threads in 3 warp 40 40 1 2 2 2 2 3 3

40 CUDA BASICS Control Flow

Different warps can execute different code No impact on performance Each warp maintains its own Program Counter

Different code path inside the same warp ? Threads that don’t participate are masked out, but the whole warp executes both sides of the branch

41 CONTROL FLOW

0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3

Instructions, time

A; 0 Warp 1 … A B D if(threadIdx.y==0) 31

B; 0 else Warp 2 … A B C D C; 31 D; 0 Warp 3 … A C D 31 42 CONTROL FLOW Takeaways

Minimize thread divergence inside a warp

Divergence between warps is fine

Maximize “useful” cycles for each warp

43 THREADS ARE THREADS New in Volta

Program counter: Before Volta: Per warp Volta: Per thread

Volta guarantees Forward Progress for diverged threads in a warp

Allows to exchange data between diverged threads in a warp. E.g. mutexes among warp threads. Allows to write natural code that would deadlock before

44 THREADS ARE THREADS Example lock = 0; while (lock == 0)

lock = tryGetLock(); Pre-Volta: The code might deadlock in the loop, if the thread that gets the lock cannot forward- doSomething; progress and release the lock releaseLock();

These device functions could be implemented with atomics, or volatile pointers

45 THREADS ARE THREADS Thread re-convergence

Don’t assume the threads in a warp are re-converged or executing in lock-step mode. Use __syncwarp() to synchronize the threads in a warp.

Shuffle and warp vote functions are deprecated. Use the new equivalent “_sync” functions. Extra parameter tells the compiler/hardware which threads are expected to participate, because they might not reach it all at the same time. E.g: __shfl_up(value, 1) becomes __shfl_up_sync (0xffffffff, value, 1)

Full efficiency only when all the 32 threads of a warp are converged!

46 THREAD ARE THREADS How to deal with warp-synchronous code?

Update/fix the code!

Use Cooperative Groups (GTC 2017 talk s7622)

Compile for an older architecture (disable forward progress) -arch=compute_60,sm_70 (binary) –arch=compute_60 (PTX JIT)

47 MEMORY SUBSYSTEM

48 VOLTA MEMORY SUBSYSTEM Tesla V100

SM SM SM 80 Streaming Multiprocessors Registers Registers Registers 256KB register file (20 MB)

L1 SMEM L1 SMEM L1 SMEM Unified Shared Mem / L1 Cache 128KB, variable split (10MB Total, 14 TB/s), Volta caches L1 writes

6 MB L2 Cache, L2 is write back PCIe L2 NVLINK

16/32 GB HBM2 (900 GB/s) DRAM

49 TURING MEMORY SUBSYSTEM RTX 8000

SM SM SM 72 Streaming Multiprocessors Registers Registers Registers 256KB register file (18.5 MB)

L1 SMEM L1 SMEM L1 SMEM Unified Shared Mem / L1 Cache 96KB, variable split (7MB Total, 8 TB/s) Turing caches L1 writes

6 MB L2 Cache, L2 is write back PCIe L2 NVLINK

24 GB GDDR6 (672 GB/s) DRAM

50 L1, L2 CACHES Why do GPUs have caches?

In general, not for temporal locality

100s ~ 1000s of threads running per SM, tens of thousands of threads sharing the L2 cache

L1, L2 are small per thread

For example, at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread

51 L1, L2 CACHES Cache Lines & Sectors

Memory access granularity = 32 Bytes = 1 sector

An L1/L2 cache line is 128 Bytes, made of 4 sectors. Cache ”management” granularity = 1 cache line

128-Byte alignment

Sector 0 Sector 1 Sector 2 Sector 3

128 Byte cache line

52 ACCESS PATTERNS Warps and Sectors

For each warp: How many sectors needed?

Depends on addresses, active threads, access size.

Natural element sizes = 1B, 2B, 4B, 8B, 16B.

0 31 4-Byte element access 4 sectors WARP

0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses 53 ACCESS PATTERNS Warps and Sectors

0 31 4-Byte access, unaligned 5 sectors WARP

0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses

128 bytes requested, 160 bytes read (80% efficiency)

54 ACCESS PATTERNS Warps and Sectors

0 31 4-Byte access, unaligned 5 sectors WARP NEXT WARP

0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses

With >1 warp per block, this sector might be found in L1 or L2

55 ACCESS PATTERNS Warps and Sectors

0 31 Same address 1 sector WARP

0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses

56 L1, L2 CACHES Why do GPU have caches?

Caches on GPUs can help with:

“Smoothing” irregular, unaligned access patterns

Caching common data accessed by many threads

Faster register spills, local memory

Can help in codes that don’t use shared memory

57 SHARED MEMORY

Scratch-pad memory on each SM User-managed cache, hardware does not evict data Data written to SMEM stays there until this the code overwrites the data or threadblock finishes execution

Useful for: Storing frequently-accessed data, to reduce DRAM accesses Communication among threads of a threadblock

Performance benefits compared to DRAM: 20-40x lower latency ~15x higher bandwidth

58 UNIFIED SHARED MEM / L1 CACHE Variable split

Volta: 6 possible Turing: 2 possible SM smem / L1 splits smem / L1 splits 96KB / 32KB 64KB / 32KB Registers 64KB / 64KB 32KB / 64KB 32KB / 96KB 16KB / 112KB L1 SMEM 8KB / 120KB 0KB /128 KB

How to specify the L1 / Smem split: cudaFuncSetAttribute (MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

The driver usually does a pretty good job at choosing the right split.

To overcome 48 KB per threadblock limitation call: cudaFuncSetAttribute (MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, maxsize);

59 https://developer.nvidia.com/computeworks http://on-demand.gputechconf.com