HARDWARE TRENDS DRIVEN BY ML AND DATA ANALYTICS Peter Messmer, Sr. Mgr Compute Devtech, 9/6/2018 - AI COMPUTING COMPANY

Computer Graphics GPU Computing Artificial Intelligence

2 DEEP LEARNING LEADS TO SUPER-HUMAN CAPABILITIES

Human

Machine outperforms humans at classification accuracy

3 THE BIG BANG IN MACHINE LEARNING

DNN BIG DATA GPU

“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”

4 OK, MACHINES ARE GREAT AT IMAGE RECOGNITION.. .. what else does this enable?

Robots: Controlled collisions Medical Imaging, Segmentation

Voice Recognition Autonomous Vehicles Collision avoiding Generative network “Map your problem to images and you’re in business” 5 Deep Learning for Climate Modeling ANOMALY DETECTION IN CLIMATE DATA

Identifying “extreme” weather events in multi-decadal datasets with 5-layered Convolutional Neural Network. Reaching 99.98% of detection accuracy. (Kim et al, 2017)

Dataset: Visualization of historic cyclones from JWTC hurricane report from 1979 to 2016 MNIST structure Systemic framework for detection and localization of extreme climate event

DOI 10.1109/ICDMW.2017.476 Deep Learning for Fluid Mechanics CONVOLUTIONAL NEURAL NETWORKS FOR STEADY FLOW APPROXIMATION

A quick general CNN-based approximation model for predicting the velocity field of non- uniform steady laminar flow by Guo, et al. (2016)

CNN-based approximation model trained by LBM simulation results

SFD data is used as import and error is used as loss function to train the convolutional neural networks.

CNN based CFD surrogate model architecture Results comparison between LBM model and CNN based surrogate model 7 http://www.kdd.org/kdd2016/papers/files/adp1175-guoA.pdf ANATOMY OF A DEEP NEURAL NETWORK

Image Probability of classes

푛 푦 = 푓(෍ 푥푘푎푘) What happens in the layers? How to chose the weights? Google Inception Network 푘=0 8 LOOKING INSIDE A NEURAL NET Different layers are sensitive to different features

Layer 1

9 1-SLIDE INTRO TO CONVOLUTIONAL NEURAL NETS Forward/Backward Propagation

훻 훻 훻 Activation Classification 훻 훻 SGD (Softmax)

weight updates Fully- Loss-Function connected (Cross Entropy) Convolution Input Backward-propagation Optimization Forward-propagation (gradient computation) Algorithm All layers are differentiable 10 RISE OF GPU COMPUTING

1000X 7 GPU-Computing perf 10 1.5X per year by APPLICATIONS 2025 106 ALGORITHMS 1.1X per year 105

SYSTEMS 104

103 CUDA 1.5X per year 102 Single-threaded perf ARCHITECTURE 1980 1990 2000 2010 2020

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

11 HOW GPU ACCELERATION WORKS Application Code

Compute-Intensive Functions Rest of Sequential CPU Code GPU CPU

+ 12 LOW LATENCY OR HIGH THROUGHPUT?

CPU architecture must minimize latency within each thread GPU architecture hides latency with computation from other threads (warps)

CPU core – Low Latency Processor Computation Thread/Warp

T1 T2 T3 T4 Tn Processing

GPU Stream Multiprocessor – High Throughput Processor Waiting for data W4

W3 Ready to be processed

W2

W1 Context switch

13 TESLA V100

21B transistors 815 mm2

80 SM 5120 CUDA Cores 640 Tensor Cores

16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink

14 *full GV100 chip contains 84 SMs INTRODUCING TESLA V100

Volta Architecture Improved NVLink & Volta MPS Improved SIMT Model Tensor Core HBM2

120 Programmable Most Productive GPU Efficient Bandwidth Inference Utilization New Algorithms TFLOPS Deep Learning

The Fastest and Most Productive GPU for Deep Learning and HPC

15 Special-Purpose vs Flexibility “Kepler” 7B xtors

Unreal Engine 4

2012

Fixed function Programmable pixel CUDA capable Video Compressor and vertex Geforce 8 series Tensor Cores Geforce 3 series 2006 RT Cores 2001 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices

New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32]

17 cuBLAS GEMMS FOR DEEP LEARNING V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply

cuBLAS Single Precision (FP32) cuBLAS Mixed Precision (FP16 Input, FP32 compute) 2 10 P100 (CUDA 8) P100 (CUDA 8) 1.8 9 V100 (CUDA 9) V100 Tensor Cores (CUDA 9) 1.6 8 1.4 1.8x 7 9.3x 1.2 6

1 5

0.8 4

0.6 3 Relative Performance Relative 0.4 Performance Relative 2

0.2 1

0 0 512 1024 2048 4096 512 1024 2048 4096 Matrix Size (M=N=K) Matrix Size (M=N=K) 18 Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release. AI INFERENCING IS EXPLODING

2 Trillion 500M 140 Billion 60 Billion Messages Per Day On Daily active users of Words Per Day Translated by Video frames/day uploaded on LinkedIn iFlyTek Google Youtube

PERSONALIZATION SPEECH TRANSLATION VIDEO

19 TENSOR CORES IN TURING New in Volta, Upgraded in Turing

Peak Half PEAK PEAK PEAK GPU SMs Total FLOPS INT8 OPS INT4 OPS B1 OPS V100 80 640 N.A. N.A. N.A. RTX 130.5 260 521 2087 72 576 6000/8000 TFLOPS* TOPS* TOPS* TOPS*

Matrix Multiplication Pipeline, half precision inputs → half / float accumulator 8bit/4bit INT inputs → 32bit INT accumulator 1bit Binary inputs → 32 bit INT accumulator (XOR + POPC)

Used in CUBLAS, CUDNN, CUTLASS Exposed in CUDA 10 (4bit INT and 1bit binary are experimental)

* Using 1.77GHz Boost Clock 20 NVIDIA TURING: GRAPHICS REINVENTED Built To Revolutionize The Work Of Creative Professionals

Streaming RT Cores Tensor Cores CUDA Cores Multiprocessors

SM

L1 Instruction Cache Instruction L1

L0 Instruction Cache Instruction L0 Cache Instruction L0 Warp Scheduler (32 thread/clk) (32 Scheduler Warp thread/clk) (32 Scheduler Warp Dispatch Unit (32 thread/clk) (32 Unit Dispatch thread/clk) (32 Unit Dispatch

bit) - 32 x (16,384 File Register bit) - 32 x (16,384 File Register

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 TENSOR TENSOR TENSOR TENSOR CORE CORE CORE CORE FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

SFU LD/ST LD/ST LD/ST LD/ST SFU LD/ST LD/ST LD/ST LD/ST

L0 Instruction Cache Instruction L0 Cache Instruction L0 Warp Scheduler (32 thread/clk) (32 Scheduler Warp thread/clk) (32 Scheduler Warp Dispatch Unit (32 thread/clk) (32 Unit Dispatch thread/clk) (32 Unit Dispatch

bit) - 32 x (16,384 File Register bit) - 32 x (16,384 File Register

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32 TENSOR TENSOR TENSOR TENSOR CORE CORE CORE CORE FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

FP32 FP32 INT32 INT32 FP32 FP32 INT32 INT32

SFU LD/ST LD/ST LD/ST LD/ST SFU LD/ST LD/ST LD/ST LD/ST

96KB L1 Data Cache / Shared Memory Shared / Cache Data L1 96KB

Tex Tex Tex Tex

21 PERFORMANCE PROJECTION Comprehensive benchmark results expected in September

Estimated Gen-to-Gen Performance Rendering Performance

~3X

~2.5X 20X

~30%

CPU Turing Pascal Turing Pascal Turing Pascal Turing GRAPHICS DL INFERENCING RAY TRACING *actual performance TBD 22 TURING Accelerated Ray Tracing

RT Cores perform ● Ray-BVH Traversal ● Instancing: 1 Level ● Ray-Triangle Intersection

Return to SM for ● Multi-level Instancing ● Custom Intersection ● Shading

23 PLATFORM World’s Leading Data Center Platform for Accelerating HPC and AI

CUSTOMER USECASES Healthcare Manufacturing Engineering Molecular Weather Seismic Speech Translate Recommender Simulations Forecasting Mapping CONSUMER INTERNET ENTERPRISE APPLICATIONS SUPERCOMPUTING

Amber CHROMA INDUSTRY FRAMEWORKS +550 & APPLICATIONS LAMMPS NAMD Applications

cuBLAS cuDNN cuFFT cuRAND cuSPARSE DeepStream NCCL TensorRT NVIDIA SDK & LIBRARIES

CUDA

TESLA GPUs & SYSTEMS

TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX SYSTEM OEM CLOUD 24 HARDWARE SOFTWARE CO-DESIGN

Hardware is not designed in a void. Performance through specialization, adoption through flexibility Hardware alone is not enough, need comprehensive software stack

Volta, Turing GPUs: Unprecedented performance for training and inference

Look at new features through the eyes of your problem Map your problem to the features

25