OPTIMIZING CUDA APPLICATIONS FOR THE VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018 NEW FEATURES IN CUDA ECOSYSTEM
TURING AND NEW SYSTEMS CUDA PLATFORM New GPU Architecture, Tensor Cores, NVSwitch Fabric, CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix Multiply DGX2, RTcore Accumulate (WMMA)
LIBRARIES DEVELOPER TOOLS GPU-accelerated hybrid JPEG decoding, Symmetric New Nsight Products – Nsight Systems and Nsight Compute Eigenvalue Solvers, FFT Scaling
Scientific Computing
2 AGENDA
New Features:
Tensor Cores
RTcore
CUDA Graphs
Nsight Developer Tools
Optimization strategies:
Volta/Turing Execution Model
Volta/Turing Memory Subsystem
3 TENSOR CORES
4 VOLTA / TURING SM Turing SM V100 TU102 FP64 32 2 INT32 64 64 FP32 64 64 Tensor Cores 8 8 RT Core - 1 Register File 256 KB 256 KB L1 and shmem 128 KB 96 KB Max threads 2048 1024 Compute 70 75* Capability *Volta (cc70) code runs on Turing without JIT or recompile! 5 TENSOR CORES New in Volta, Extended in Turing
PEAK INT8 PEAK INT4 PEAK GPU SMs Total Peak Half FLOPS OPS OPS Binary OPS V100 80 640 125 TFLOPS N.A. N.A. N.A. TU102 72 576 130.5 TFLOPS 261 TOPS 522 TOPS 2088 TOPS
half precision inputs half / float accumulator 8bit/4bit INT inputs 32-bit INT accumulator 1bit Binary inputs 32-bit INT accumulator (XOR + POPC)
Used via CUBLAS, CUDNN, CUTLASS, TensorRT Exposed in CUDA 10 (4bit INT and 1bit binary are experimental)
6 TURING TENSOR CORE New Warp Matrix Functions
WMMA operations now include 8-bit integer WMMA 16x16x16 along with FP16 = + ▪ Warp Matrix Multiply Accumulate D A B C 16x16 16x16 16x16 16x16 ▪ Signed & unsigned 8-bit input WMMA 32x8x16
▪ 32-bit integer accumulator = +
D A B C ▪ Input/Output dimensions similar to FP16 32x8 32x16 16x8 32x8 WMMA 8x32x16 ▪ 2048 ops per cycle, per SM for 8bit = + ▪ nvcuda::wmma D A B C 8x32 8x16 16x32 8x32
7 EXPERIMENTAL WARP MATRIX FUNCTIONS Turing Enables Experimental Sub-Byte Tensor Core Operations
Experimental Sub-Byte Operations namespace experimental { ▪ 4-bit signed & unsigned input namespace precision { struct u4; // 4-bit unsigned ▪ 1-bit input with custom matrix operations struct s4; // 4-bit signed ▪ 32-bit accumulator output struct b1; // 1-bit } enum bmmaBitOp { bmmaBitOpXOR = 1 }; Access via special namespace: enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } nvcuda::wmma::experimental Enable researchers to experiment with ultra low precision! Experimental subject to API changes not functionality. 8 WMMA – IMMA 4BIT New for Turing (Experimental) A B C
128 bits
D = 128 bits 128
8-by-8 x int32 8-by-32 x 4b 8-by-8 x int32
32-by-8 x 4b
Di,j = (Ai,k * Bk,j) + Ci,j for k = 0 .. 31
9 WMMA – BINARY - XOR POPC New for Turing (Experimental) A B C
128 bits
D = 128 bits 128
8-by-8 x int32 8-by-128 x 1b 8-by-8 x int32
128-by-8 x 1b
Di,j = popc(Ai,k ^ Bk,j) + Ci,j for k = 0 .. 127
10 BINARY TENSOR CORE OPERATION
128-bit population Bitwise 32-bit Integer Output 1-Bit Input Signal count added to XOR Operation Per Point accumulator
Other Row/Column Results
Accumulated Bitwise 32-bit Integer XOR + Count
Previous Accumulation
11 NEW TURING WARP MATRIX FUNCTIONS
Input Precision Output Supported Sizes Max Ops/Clock/SM
half * half or float 1024 16 x 16 x 16 char 32 x 8 x 16 integer (int32) 8 x 32 x 16 2048
unsigned char Native Types Native
precision::u4 (4-bit unsigned) 8 x 8 x 32 4096 precision::s4 (4-bit signed) integer (int32)
precision::b1 (1-bit) 8 x 8 x 128 16384 Experimental
* Also available on Volta sm_70. Note: WMMA requires recompilation for sm_75 for peak performance 12 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++
CUTLASS GEMM Structural Model
13 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++ > 90% Relative to Peak Performance 100% Turing optimized GEMMs 80% Integer (8-bit, 4-bit and 1-bit) using WMMA 60%
Batched strided GEMM 40%
20%
Support for CUDA 10.0 toPeak % Relative
0% Updates to documentation and more
examples
igemm_tt
sgemm_tt
igemm_nt igemm_tn
dgemm_tt hgemm_tt
sgemm_nt sgemm_tn
igemm_nn
dgemm_nt dgemm_tn hgemm_nt hgemm_tn
sgemm_nn
dgemm_nn hgemm_nn
wmma_gemm_tt
wmma_gemm_nt wmma_gemm_tn
wmma_gemm_nn
wmma_gemm_f16_tt
wmma_gemm_f16_nt wmma_gemm_f16_tn wmma_gemm_f16_nn DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32) https://github.com/NVIDIA/cutlass 14 CUTLASS 1.1 on Volta (GV100) TURING RTCORE
15 RT Cores Turing GPU RT Cores accelerate ray tracing
RT Cores perform ● Ray-BVH (Bounding Volume Hierarchy) Traversal ● Instancing: 1 Level ● Ray-Triangle Intersection
Return to SM for ● Multi-level Instancing ● Custom Intersection ● Shading
16 Software v/s Hardware Ray Tracing
Pre-Turing Turing SM SM
Tri1 Tri2 Tri3 Circle1
17 Rtcore in OPTIX
• Single-ray shader programming model using C++ • Transparently scales across multiple GPUs • AI Accelerated rendering http://developer.nvidia.com/optix • Easy interop with CUDA http://on-demand.gputechconf.com
18 CUDA GRAPHS
19 ASYNCHRONOUS TASK GRAPHS Execution Optimization When Workflow is Known Up-Front
Deep Neural Network Training DL Inference
Loop & Function offload Linear Algebra HPC Simulation
20 ALL CUDA WORK FORMS A GRAPH Node represents operation CUDA Work in Streams Edge represents dependency A A
B Wait Any CUDA stream can be B X Wait mapped to a graph C D X C D Wait E Y E Y
Wait End
Implicit dependencies Explicit dependencies
21 DEFINITION OF A CUDA GRAPH Graph Nodes Are Not Just Kernel Launches
Sequence of operations, connected by dependencies. A Operations are one of: B X Kernel Launch CUDA kernel running on GPU
CPU Function Call Callback function on CPU C D
Memcopy/Memset GPU data management E Y Sub-Graph Graphs are hierarchical
End
22 NEW EXECUTION MECHANISM Graphs Can Be Generated Once Then Launched Repeatedly
A
B X
for(int i=0; i<1000; i++) { launch_graph( G ); C D } E Y
End
23 EXECUTION OPTIMIZATIONS Latency & Overhead Reductions
Launch latencies:
▪ CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on Linux
▪ Pre-defined graph allows launch of any number of kernels in one single operation
Launch Launch Launch Launch Launch CPU Idle A B C D E
A B C D E
time
Build Launch Graph CPU Idle Graph
A B C D E
24 PERFORMANCE IMPACT Optimizations for Short-Runtime Operations
CPU launch time improvements Example: Small 3D FFT Typical: 33% faster than stream launch 25% end-to-end improvement for 323 3D-FFT (16us with stream launch, 12us with graph launch)
NOTE: Performance impact is workload-dependent Benefits especially short-running kernels, where overheads account for more runtime
25 THREE-STAGE EXECUTION MODEL
Define Instantiate Execute
A A s1 s2 s3
B X A B X A C D B X
B X C D E C Y D
E C Y D E Y End E Y End
End End
Executable Graphs Single Graph “Template” Multiple “Executable Graphs” Running in CUDA Streams Created in host code, Snapshot of template Concurrency in graph or loaded from disk, Sets up & initializes GPU is not limited by stream or built up from libraries execution structures (see later) (create once, run many times)
26 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax
// Start by initating stream capture cudaStreamBeginCapture(&stream1);
// Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); B B C B<<< ..., stream1 >>>(); C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2);
D<<< ..., stream1 >>>(); stream1 stream2 graph
// Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph);
27 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax
// Start by initating stream capture cudaStreamBeginCapture(&stream1);
// Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); Capture follows B inter-stream dependencies B<<< ..., stream1 >>>(); B C to create forks & joins C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2);
D<<< ..., stream1 >>>(); stream1 stream2 graph
// Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph);
28 CREATE GRAPHS DIRECTLY Map Graph-Based Workflows Directly Into CUDA
// Define graph of work + dependencies cudaGraphCreate(&graph);
cudaGraphAddNode(graph, kernel_a, {}, ...); A cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...); B C
// Instantiate graph and apply optimizations D cudaGraphInstantiate(&instance, graph);
Graph from framework // Launch executable graph 100 times for(int i=0; i<100; i++) cudaGraphLaunch(instance, stream);
29 GRAPH EXECUTION SEMANTICS Order Graph Work With Other Non-Graph CUDA Work
stream launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2, CPU_Func cpu, cudaStream_t stream) { A
A <<< 256, 256, 0, stream >>>(); // Kernel launch cudaGraphLaunch(i1, stream); // Graph1 launch cudaStreamAddCallback(stream, cpu); // CPU callback CPU cudaGraphLaunch(i2, stream); // Graph2 launch
cudaStreamSynchronize(stream); }
If you can put it in a CUDA stream, you can run it together with a graph
30 GRAPHS IGNORE STREAM SERIALIZATION RULES Launch Stream Is Used Only For Ordering With Other Work
stream A A B X
Branches in graph still C D execute concurrently CPU even though graph is launched into a stream E Y
End
31 CROSS-DEVICE DEPENDENCIES Graphs May Span Multiple GPUs
Multi-Device Heterogeneous Execution Execution A GPU CUDA is closest to the O/S and the hardware
▪ Can optimize multi-device dependencies B C CPU ▪ Can optimize heterogeneous dependencies D GPU ▪ Define locality per-node
GPU 0 GPU 1 Heterogeneous Execution
32 NSIGHT DEVELOPER TOOLS
33 NSIGHT PRODUCT FAMILY
Nsight Systems Nsight Compute Nsight Graphics IDE Plugins Nsight Eclipse System-wide application CUDA Kernel Profiling and Graphics Shader Profiling and Edition/Visual Studio algorithm tuning Debugging Debugging (Editor, Debugger)
34 NSIGHT SYSTEMS System-wide Performance Analysis
Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more
Locate Optimization Opportunities: CUDA & OpenGL APIs, UVM transfers, User Annotations using NVTX
Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events.
https://developer.nvidia.com/nsight-systems
35 Thread/core migration Processes and threads Thread state
CUDA and OpenGL API trace
cuDNN and cuBLAS trace
Kernel and memory transfer activities
Multi-GPU
36 NVIDIA NSIGHT COMPUTE Next Generation Kernel Profiler Kernel Profile Comparisons with Interactive CUDA API debugging and kernel Baseline profiling
Fast Data Collection
Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules) Metric Data
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux (x86, POWER, ARM), Windows Source GPUs: Pascal, Volta, Turing Correlation
37 EXECUTION MODEL
38 CUDA BASICS Blocks of threads, warps
Single Instruction Multiple Threads (SIMT) model
CUDA hierarchy: Grid -> Blocks -> Warps -> Threads
One warp = 32 threads.
Why does it matter ? Many optimizations based on behavior at the warp level
39 CUDA BASICS Mapping threads
Thread blocks can be 1D, 2D, 3D Only for convenience. Hardware “looks” at threads in 1D
Consecutive 32 threads belong to the same warp
80 Threads: 40 threads in X 3 warps (96 threads) rd 2 rows of threads in Y 16 inactive threads in 3 warp 40 40 1 2 2 2 2 3 3
40 CUDA BASICS Control Flow
Different warps can execute different code No impact on performance Each warp maintains its own Program Counter
Different code path inside the same warp ? Threads that don’t participate are masked out, but the whole warp executes both sides of the branch
41 CONTROL FLOW
0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3
Instructions, time
A; 0 Warp 1 … A B D if(threadIdx.y==0) 31
B; 0 else Warp 2 … A B C D C; 31 D; 0 Warp 3 … A C D 31 42 CONTROL FLOW Takeaways
Minimize thread divergence inside a warp
Divergence between warps is fine
Maximize “useful” cycles for each warp
43 THREADS ARE THREADS New in Volta
Program counter: Before Volta: Per warp Volta: Per thread
Volta guarantees Forward Progress for diverged threads in a warp
Allows to exchange data between diverged threads in a warp. E.g. mutexes among warp threads. Allows to write natural code that would deadlock before
44 THREADS ARE THREADS Example lock = 0; while (lock == 0)
lock = tryGetLock(); Pre-Volta: The code might deadlock in the loop, if the thread that gets the lock cannot forward- doSomething; progress and release the lock releaseLock();
These device functions could be implemented with atomics, or volatile pointers
45 THREADS ARE THREADS Thread re-convergence
Don’t assume the threads in a warp are re-converged or executing in lock-step mode. Use __syncwarp() to synchronize the threads in a warp.
Shuffle and warp vote functions are deprecated. Use the new equivalent “_sync” functions. Extra parameter tells the compiler/hardware which threads are expected to participate, because they might not reach it all at the same time. E.g: __shfl_up(value, 1) becomes __shfl_up_sync (0xffffffff, value, 1)
Full efficiency only when all the 32 threads of a warp are converged!
46 THREAD ARE THREADS How to deal with warp-synchronous code?
Update/fix the code!
Use Cooperative Groups (GTC 2017 talk s7622)
Compile for an older architecture (disable forward progress) -arch=compute_60,sm_70 (binary) –arch=compute_60 (PTX JIT)
47 MEMORY SUBSYSTEM
48 VOLTA MEMORY SUBSYSTEM Tesla V100
SM SM SM 80 Streaming Multiprocessors Registers Registers Registers 256KB register file (20 MB)
L1 SMEM L1 SMEM L1 SMEM Unified Shared Mem / L1 Cache 128KB, variable split (10MB Total, 14 TB/s), Volta caches L1 writes
6 MB L2 Cache, L2 is write back PCIe L2 NVLINK
16/32 GB HBM2 (900 GB/s) DRAM
49 TURING MEMORY SUBSYSTEM Quadro RTX 8000
SM SM SM 72 Streaming Multiprocessors Registers Registers Registers 256KB register file (18.5 MB)
L1 SMEM L1 SMEM L1 SMEM Unified Shared Mem / L1 Cache 96KB, variable split (7MB Total, 8 TB/s) Turing caches L1 writes
6 MB L2 Cache, L2 is write back PCIe L2 NVLINK
24 GB GDDR6 (672 GB/s) DRAM
50 L1, L2 CACHES Why do GPUs have caches?
In general, not for temporal locality
100s ~ 1000s of threads running per SM, tens of thousands of threads sharing the L2 cache
L1, L2 are small per thread
For example, at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread
51 L1, L2 CACHES Cache Lines & Sectors
Memory access granularity = 32 Bytes = 1 sector
An L1/L2 cache line is 128 Bytes, made of 4 sectors. Cache ”management” granularity = 1 cache line
128-Byte alignment
Sector 0 Sector 1 Sector 2 Sector 3
128 Byte cache line
52 ACCESS PATTERNS Warps and Sectors
For each warp: How many sectors needed?
Depends on addresses, active threads, access size.
Natural element sizes = 1B, 2B, 4B, 8B, 16B.
0 31 4-Byte element access 4 sectors WARP
0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses 53 ACCESS PATTERNS Warps and Sectors
0 31 4-Byte access, unaligned 5 sectors WARP
0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses
128 bytes requested, 160 bytes read (80% efficiency)
54 ACCESS PATTERNS Warps and Sectors
0 31 4-Byte access, unaligned 5 sectors WARP NEXT WARP
0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses
With >1 warp per block, this sector might be found in L1 or L2
55 ACCESS PATTERNS Warps and Sectors
0 31 Same address 1 sector WARP
0 32 64 96 128 160 192 224 256 288 320 352 Memory Addresses
56 L1, L2 CACHES Why do GPU have caches?
Caches on GPUs can help with:
“Smoothing” irregular, unaligned access patterns
Caching common data accessed by many threads
Faster register spills, local memory
Can help in codes that don’t use shared memory
57 SHARED MEMORY
Scratch-pad memory on each SM User-managed cache, hardware does not evict data Data written to SMEM stays there until this the code overwrites the data or threadblock finishes execution
Useful for: Storing frequently-accessed data, to reduce DRAM accesses Communication among threads of a threadblock
Performance benefits compared to DRAM: 20-40x lower latency ~15x higher bandwidth
58 UNIFIED SHARED MEM / L1 CACHE Variable split
Volta: 6 possible Turing: 2 possible SM smem / L1 splits smem / L1 splits 96KB / 32KB 64KB / 32KB Registers 64KB / 64KB 32KB / 64KB 32KB / 96KB 16KB / 112KB L1 SMEM 8KB / 120KB 0KB /128 KB
How to specify the L1 / Smem split: cudaFuncSetAttribute (MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);
The driver usually does a pretty good job at choosing the right split.
To overcome 48 KB per threadblock limitation call: cudaFuncSetAttribute (MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, maxsize);
59 https://developer.nvidia.com/computeworks http://on-demand.gputechconf.com