Tensor Cores

HARDWARE TRENDS DRIVEN BY ML AND DATA ANALYTICS Peter Messmer, Sr. Mgr Compute Devtech, 9/6/2018 NVIDIA - AI COMPUTING COMPANY Computer Graphics GPU Computing Artificial Intelligence 2 DEEP LEARNING LEADS TO SUPER-HUMAN CAPABILITIES Human Machine outperforms humans at classification accuracy 3 THE BIG BANG IN MACHINE LEARNING DNN BIG DATA GPU “ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.” 4 OK, MACHINES ARE GREAT AT IMAGE RECOGNITION.. .. what else does this enable? Robots: Controlled collisions Medical Imaging, Segmentation Voice Recognition Autonomous Vehicles Collision avoiding Generative network “Map your problem to images and you’re in business” 5 Deep Learning for Climate Modeling ANOMALY DETECTION IN CLIMATE DATA Identifying “extreme” weather events in multi-decadal datasets with 5-layered Convolutional Neural Network. Reaching 99.98% of detection accuracy. (Kim et al, 2017) Dataset: Visualization of historic cyclones from JWTC hurricane report from 1979 to 2016 MNIST structure Systemic framework for detection and localization of extreme climate event DOI 10.1109/ICDMW.2017.476 Deep Learning for Fluid Mechanics CONVOLUTIONAL NEURAL NETWORKS FOR STEADY FLOW APPROXIMATION A quick general CNN-based approximation model for predicting the velocity field of non- uniform steady laminar flow by Guo, et al. (2016) CNN-based approximation model trained by LBM simulation results SFD data is used as import and error is used as loss function to train the convolutional neural networks. CNN based CFD surrogate model architecture Results comparison between LBM model and CNN based surrogate model 7 http://www.kdd.org/kdd2016/papers/files/adp1175-guoA.pdf ANATOMY OF A DEEP NEURAL NETWORK Image Probability of classes 푛 푦 = 푓(෍ 푥푘푎푘) What happens in the layers? How to chose the weights? Google Inception Network 푘=0 8 LOOKING INSIDE A NEURAL NET Different layers are sensitive to different features Layer 1 9 1-SLIDE INTRO TO CONVOLUTIONAL NEURAL NETS Forward/Backward Propagation 훻 훻 훻 Activation Classification 훻 훻 SGD (Softmax) weight updates Fully- Loss-Function connected (Cross Entropy) Convolution Input Backward-propagation Optimization Forward-propagation (gradient computation) Algorithm All layers are differentiable 10 RISE OF GPU COMPUTING 1000X 7 GPU-Computing perf 10 1.5X per year by APPLICATIONS 2025 106 ALGORITHMS 1.1X per year 105 SYSTEMS 104 103 CUDA 1.5X per year 102 Single-threaded perf ARCHITECTURE 1980 1990 2000 2010 2020 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 11 HOW GPU ACCELERATION WORKS Application Code Compute-Intensive Functions Rest of Sequential CPU Code GPU CPU + 12 LOW LATENCY OR HIGH THROUGHPUT? CPU architecture must minimize latency within each thread GPU architecture hides latency with computation from other threads (warps) CPU core – Low Latency Processor Computation Thread/Warp T1 T2 T3 T4 Tn Processing GPU Stream Multiprocessor – High Throughput Processor Waiting for data W4 W3 Ready to be processed W2 W1 Context switch 13 TESLA V100 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 14 *full GV100 chip contains 84 SMs INTRODUCING TESLA V100 Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2 120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning The Fastest and Most Productive GPU for Deep Learning and HPC 15 Special-Purpose vs Flexibility “Kepler” 7B xtors Unreal Engine 4 2012 Fixed function Programmable pixel CUDA capable Video Compressor and vertex shaders Geforce 8 series Tensor Cores Geforce 3 series 2006 RT Cores 2001 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] 17 cuBLAS GEMMS FOR DEEP LEARNING V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply cuBLAS Single Precision (FP32) cuBLAS Mixed Precision (FP16 Input, FP32 compute) 2 10 P100 (CUDA 8) P100 (CUDA 8) 1.8 9 V100 (CUDA 9) V100 Tensor Cores (CUDA 9) 1.6 8 1.4 1.8x 7 9.3x 1.2 6 1 5 0.8 4 0.6 3 Relative Performance Relative 0.4 Performance Relative 2 0.2 1 0 0 512 1024 2048 4096 512 1024 2048 4096 Matrix Size (M=N=K) Matrix Size (M=N=K) 18 Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release. AI INFERENCING IS EXPLODING 2 Trillion 500M 140 Billion 60 Billion Messages Per Day On Daily active users of Words Per Day Translated by Video frames/day uploaded on LinkedIn iFlyTek Google Youtube PERSONALIZATION SPEECH TRANSLATION VIDEO 19 TENSOR CORES IN TURING New in Volta, Upgraded in Turing Peak Half PEAK PEAK PEAK GPU SMs Total FLOPS INT8 OPS INT4 OPS B1 OPS V100 80 640 N.A. N.A. N.A. Quadro RTX 130.5 260 521 2087 72 576 6000/8000 TFLOPS* TOPS* TOPS* TOPS* Matrix Multiplication Pipeline, half precision inputs → half / float accumulator 8bit/4bit INT inputs → 32bit INT accumulator 1bit Binary inputs → 32 bit INT accumulator (XOR + POPC) Used in CUBLAS, CUDNN, CUTLASS Exposed in CUDA 10 (4bit INT and 1bit binary are experimental) * Using 1.77GHz Boost Clock 20 21 SM L1 Instruction Cache Instruction L1 L0 Instruction Cache Instruction L0 Cache Instruction L0 Warp Scheduler (32 thread/clk) (32 Scheduler Warp thread/clk) (32 Scheduler Warp Dispatch Unit (32 thread/clk) (32 Unit Dispatch thread/clk) (32 Unit Dispatch bit) - 32 x (16,384 File Register bit) - 32 x (16,384 File Register FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 Multiprocessors FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 TENSOR TENSOR TENSOR TENSOR CORE CORE CORE CORE FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 RT Cores RT Tensor Cores Tensor CUDA Cores CUDA FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 Streaming Streaming FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 SFU LD/ST LD/ST LD/ST LD/ST SFU LD/ST LD/ST LD/ST LD/ST L0 Instruction Cache Instruction L0 Cache Instruction L0 Warp Scheduler (32 thread/clk) (32 Scheduler Warp thread/clk) (32 Scheduler Warp Dispatch Unit (32 thread/clk) (32 Unit Dispatch thread/clk) (32 Unit Dispatch bit) - 32 x (16,384 File Register bit) - 32 x (16,384 File Register FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 TENSOR TENSOR TENSOR TENSOR CORE CORE CORE CORE FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 Built To Revolutionize The Work Of Creative Professionals Creative Of Work The Revolutionize SFU To LD/ST LD/ST LD/ST LD/ST Built SFU LD/ST LD/ST LD/ST LD/ST 96KB L1 Data Cache / Shared Memory Shared / Cache Data L1 96KB Tex Tex Tex Tex NVIDIA TURING: GRAPHICS REINVENTED GRAPHICS TURING: NVIDIA PERFORMANCE PROJECTION Comprehensive benchmark results expected in September Estimated Gen-to-Gen Performance Rendering Performance ~3X ~2.5X 20X ~30% CPU Turing Pascal Turing Pascal Turing Pascal Turing GRAPHICS RAY TRACING DL INFERENCING RAY TRACING *actual performance TBD 22 TURING Accelerated Ray Tracing RT Cores perform ● Ray-BVH Traversal ● Instancing: 1 Level ● Ray-Triangle Intersection Return to SM for ● Multi-level Instancing ● Custom Intersection ● Shading 23 NVIDIA TESLA PLATFORM World’s Leading Data Center Platform for Accelerating HPC and AI CUSTOMER USECASES Healthcare Manufacturing Engineering Molecular Weather Seismic Speech Translate Recommender Simulations Forecasting Mapping CONSUMER INTERNET ENTERPRISE APPLICATIONS SUPERCOMPUTING Amber CHROMA INDUSTRY FRAMEWORKS +550 & APPLICATIONS LAMMPS NAMD Applications cuBLAS cuDNN cuFFT cuRAND cuSPARSE DeepStream NCCL TensorRT NVIDIA SDK & LIBRARIES CUDA TESLA GPUs & SYSTEMS TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX SYSTEM OEM CLOUD 24 HARDWARE SOFTWARE CO-DESIGN Hardware is not designed in a void. Performance through specialization, adoption through flexibility Hardware alone is not enough, need comprehensive software stack Volta, Turing GPUs: Unprecedented performance for training and inference Look at new features through the eyes of your problem Map your problem to the features 25 .

Tensor Cores

Fast Calculation of the Lomb-Scargle Periodogram Using Graphics Processing Units 3

High-Performance and Energy-Efficient Irregular Graph Processing on GPU Architectures

GPU Accelerated Approach to Numerical Linear Algebra and Matrix Analysis with CFD Applications

Background on GPGPU Programming

Time Complexity Parallel Local Binary Pattern Feature Extractor on a Graphical Processing Unit

E6895 Advanced Big Data Analytics Lecture 7: GPU and CUDA

NVIDIA's Fermi: the First Complete GPU Computing Architecture

SOUNDVISION 3.5.1 Readme - V.1.0

Vysoke´Ucˇenítechnicke´V Brneˇ Zobrazenístínu˚Ve Sce

HP Z400 Workstation Overview

Oral Presentation

Cuda by Example.Book.Pdf