OPTIMIZING CUDA APPLICATIONS for the VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018 NEW FEATURES in CUDA ECOSYSTEM

OPTIMIZING CUDA APPLICATIONS FOR THE VOLTA/TURING ARCHITECTURE Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018 NEW FEATURES IN CUDA ECOSYSTEM TURING AND NEW SYSTEMS CUDA PLATFORM New GPU Architecture, Tensor Cores, NVSwitch Fabric, CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix Multiply DGX2, RTcore Accumulate (WMMA) LIBRARIES DEVELOPER TOOLS GPU-accelerated hybrid JPEG decoding, Symmetric New Nsight Products – Nsight Systems and Nsight Compute Eigenvalue Solvers, FFT Scaling Scientific Computing 2 AGENDA New Features: Tensor Cores RTcore CUDA Graphs Nsight Developer Tools Optimization strategies: Volta/Turing Execution Model Volta/Turing Memory Subsystem 3 TENSOR CORES 4 VOLTA / TURING SM Turing SM V100 TU102 FP64 32 2 INT32 64 64 FP32 64 64 Tensor Cores 8 8 RT Core - 1 Register File 256 KB 256 KB L1 and shmem 128 KB 96 KB Max threads 2048 1024 Compute 70 75* Capability *Volta (cc70) code runs on Turing without JIT or recompile! 5 TENSOR CORES New in Volta, Extended in Turing PEAK INT8 PEAK INT4 PEAK GPU SMs Total Peak Half FLOPS OPS OPS Binary OPS V100 80 640 125 TFLOPS N.A. N.A. N.A. TU102 72 576 130.5 TFLOPS 261 TOPS 522 TOPS 2088 TOPS half precision inputs half / float accumulator 8bit/4bit INT inputs 32-bit INT accumulator 1bit Binary inputs 32-bit INT accumulator (XOR + POPC) Used via CUBLAS, CUDNN, CUTLASS, TensorRT Exposed in CUDA 10 (4bit INT and 1bit binary are experimental) 6 TURING TENSOR CORE New Warp Matrix Functions WMMA operations now include 8-bit integer WMMA 16x16x16 along with FP16 = + ▪ Warp Matrix Multiply Accumulate D A B C 16x16 16x16 16x16 16x16 ▪ Signed & unsigned 8-bit input WMMA 32x8x16 ▪ 32-bit integer accumulator = + D A B C ▪ Input/Output dimensions similar to FP16 32x8 32x16 16x8 32x8 WMMA 8x32x16 ▪ 2048 ops per cycle, per SM for 8bit = + ▪ nvcuda::wmma D A B C 8x32 8x16 16x32 8x32 7 EXPERIMENTAL WARP MATRIX FUNCTIONS Turing Enables Experimental Sub-Byte Tensor Core Operations Experimental Sub-Byte Operations namespace experimental { ▪ 4-bit signed & unsigned input namespace precision { struct u4; // 4-bit unsigned ▪ 1-bit input with custom matrix operations struct s4; // 4-bit signed ▪ 32-bit accumulator output struct b1; // 1-bit } enum bmmaBitOp { bmmaBitOpXOR = 1 }; Access via special namespace: enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } nvcuda::wmma::experimental Enable researchers to experiment with ultra low precision! Experimental subject to API changes not functionality. 8 WMMA – IMMA 4BIT New for Turing (Experimental) A B C 128 bits D = 128 bits 128 8-by-8 x int32 8-by-32 x 4b 8-by-8 x int32 32-by-8 x 4b Di,j = (Ai,k * Bk,j) + Ci,j for k = 0 .. 31 9 WMMA – BINARY - XOR POPC New for Turing (Experimental) A B C 128 bits D = 128 bits 128 8-by-8 x int32 8-by-128 x 1b 8-by-8 x int32 128-by-8 x 1b Di,j = popc(Ai,k ^ Bk,j) + Ci,j for k = 0 .. 127 10 BINARY TENSOR CORE OPERATION 128-bit population Bitwise 32-bit Integer Output 1-Bit Input Signal count added to XOR Operation Per Point accumulator Other Row/Column Results Accumulated Bitwise 32-bit Integer XOR + Count Previous Accumulation 11 NEW TURING WARP MATRIX FUNCTIONS Input Precision Output Supported Sizes Max Ops/Clock/SM half * half or float 1024 16 x 16 x 16 char 32 x 8 x 16 integer (int32) 8 x 32 x 16 2048 unsigned char Native Types Native precision::u4 (4-bit unsigned) 8 x 8 x 32 4096 precision::s4 (4-bit signed) integer (int32) precision::b1 (1-bit) 8 x 8 x 128 16384 Experimental * Also available on Volta sm_70. Note: WMMA requires recompilation for sm_75 for peak performance 12 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++ CUTLASS GEMM Structural Model 13 CUTLASS 1.1 High-performance Matrix Multiplication in Open Source templated CUDA C++ > 90% Relative to Peak Performance 100% Turing optimized GEMMs 80% Integer (8-bit, 4-bit and 1-bit) using WMMA 60% Batched strided GEMM 40% 20% Support for CUDA 10.0 to Peak % Relative 0% Updates to documentation and more examples igemm_tt sgemm_tt igemm_nt igemm_tn dgemm_tt hgemm_tt sgemm_nt sgemm_tn igemm_nn dgemm_nt dgemm_tn hgemm_nt hgemm_tn sgemm_nn dgemm_nn hgemm_nn wmma_gemm_tt wmma_gemm_nt wmma_gemm_tn wmma_gemm_nn wmma_gemm_f16_tt wmma_gemm_f16_nt wmma_gemm_f16_tn wmma_gemm_f16_nn DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32) https://github.com/NVIDIA/cutlass 14 CUTLASS 1.1 on Volta (GV100) TURING RTCORE 15 RT Cores Turing GPU RT Cores accelerate ray tracing RT Cores perform ● Ray-BVH (Bounding Volume Hierarchy) Traversal ● Instancing: 1 Level ● Ray-Triangle Intersection Return to SM for ● Multi-level Instancing ● Custom Intersection ● Shading 16 Software v/s Hardware Ray Tracing Pre-Turing Turing SM SM Tri1 Tri2 Tri3 Circle1 17 Rtcore in OPTIX • Single-ray shader programming model using C++ • Transparently scales across multiple GPUs • AI Accelerated rendering http://developer.nvidia.com/optix • Easy interop with CUDA http://on-demand.gputechconf.com 18 CUDA GRAPHS 19 ASYNCHRONOUS TASK GRAPHS Execution Optimization When Workflow is Known Up-Front Deep Neural Network Training DL Inference Loop & Function offload Linear Algebra HPC Simulation 20 ALL CUDA WORK FORMS A GRAPH Node represents operation CUDA Work in Streams Edge represents dependency A A B Wait Any CUDA stream can be B X Wait mapped to a graph C D X C D Wait E Y E Y Wait End Implicit dependencies Explicit dependencies 21 DEFINITION OF A CUDA GRAPH Graph Nodes Are Not Just Kernel Launches Sequence of operations, connected by dependencies. A Operations are one of: B X Kernel Launch CUDA kernel running on GPU CPU Function Call Callback function on CPU C D Memcopy/Memset GPU data management E Y Sub-Graph Graphs are hierarchical End 22 NEW EXECUTION MECHANISM Graphs Can Be Generated Once Then Launched Repeatedly A B X for(int i=0; i<1000; i++) { launch_graph( G ); C D } E Y End 23 EXECUTION OPTIMIZATIONS Latency & Overhead Reductions Launch latencies: ▪ CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on Linux ▪ Pre-defined graph allows launch of any number of kernels in one single operation Launch Launch Launch Launch Launch CPU Idle A B C D E A B C D E time Build Launch Graph CPU Idle Graph A B C D E 24 PERFORMANCE IMPACT Optimizations for Short-Runtime Operations CPU launch time improvements Example: Small 3D FFT Typical: 33% faster than stream launch 25% end-to-end improvement for 323 3D-FFT (16us with stream launch, 12us with graph launch) NOTE: Performance impact is workload-dependent Benefits especially short-running kernels, where overheads account for more runtime 25 THREE-STAGE EXECUTION MODEL Define Instantiate Execute A A s1 s2 s3 B X A B X A C D B X B X C D E C Y D E C Y D E Y End E Y End End End Executable Graphs Single Graph “Template” Multiple “Executable Graphs” Running in CUDA Streams Created in host code, Snapshot of template Concurrency in graph or loaded from disk, Sets up & initializes GPU is not limited by stream or built up from libraries execution structures (see later) (create once, run many times) 26 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax // Start by initating stream capture cudaStreamBeginCapture(&stream1); // Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); B B C B<<< ..., stream1 >>>(); C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2); D<<< ..., stream1 >>>(); stream1 stream2 graph // Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph); 27 CONVERT CUDA STREAM INTO A GRAPH Construct a graph from normal CUDA stream syntax // Start by initating stream capture cudaStreamBeginCapture(&stream1); // Build stream work as usual A A A<<< ..., stream1 >>>(); Wait cudaEventRecord(e1, stream1); Capture follows B inter-stream dependencies B<<< ..., stream1 >>>(); B C to create forks & joins C cudaStreamWaitEvent(stream2, e1); Wait C<<< ..., stream2 >>>(); D cudaEventRecord(e2, stream2); D cudaStreamWaitEvent(stream1, e2); D<<< ..., stream1 >>>(); stream1 stream2 graph // Now convert the stream to a graph cudaStreamEndCapture(stream1, &graph); 28 CREATE GRAPHS DIRECTLY Map Graph-Based Workflows Directly Into CUDA // Define graph of work + dependencies cudaGraphCreate(&graph); cudaGraphAddNode(graph, kernel_a, {}, ...); A cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...); cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...); B C // Instantiate graph and apply optimizations D cudaGraphInstantiate(&instance, graph); Graph from framework // Launch executable graph 100 times for(int i=0; i<100; i++) cudaGraphLaunch(instance, stream); 29 GRAPH EXECUTION SEMANTICS Order Graph Work With Other Non-Graph CUDA Work stream launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2, CPU_Func cpu, cudaStream_t stream) { A A <<< 256, 256, 0, stream >>>(); // Kernel launch cudaGraphLaunch(i1, stream); // Graph1 launch cudaStreamAddCallback(stream, cpu); // CPU callback CPU cudaGraphLaunch(i2, stream); // Graph2 launch cudaStreamSynchronize(stream); } If you can put it in a CUDA stream, you can run it together with a graph 30 GRAPHS IGNORE STREAM SERIALIZATION RULES Launch Stream Is Used Only For Ordering With Other Work stream A A B X Branches in graph still C D execute concurrently CPU even though graph is launched into a stream E Y End 31 CROSS-DEVICE DEPENDENCIES Graphs May Span Multiple GPUs Multi-Device Heterogeneous Execution Execution A GPU CUDA is closest to the O/S and the hardware ▪ Can optimize

Load more