Why GPUs are green?

Prof. Esteban Walter Gonzalez Clua, Dr. Cuda Fellow Computer Science Department Universidade Federal Fluminense – Brazil Universidade Federal Fluminense Rio de Janeiro - Brasil Framework for solving cosmological particles propagations.

1 Million Watts DGX A100 DATA CENTER

5 DGX A100 systems for AI training and inference $1M $1M 28 kW 1 rack 1/10th 28 kW COST 1/20th POWER

0.8 Pflops for HPC, 50 PFLOPS for Tensor cores = 160 Laurences for HPC, 8000 Laurences for Tensor 35 x less energy

7 NVIDIA DGX A100 SYSTEM SPECS

App Focus Components Power and Physical Dimensions

GPUs 8x NVIDIA A100 Tensor Core GPUs System Power Usage 6.5 kW Max

GPU Memory 320GB Total System Weight 271 lbs (123 kgs)

NVIDIA NVSwitch 6 6 Rack Units (RU)

5 petaFLOPS AI System Dimensions Height: 10.4 in (264.0 mm) Performance 10 petaOPS, INT8 Width: 19.0 in (482.3 mm) Max Length: 35.3 in (897.1 mm) Max Dual AMD Rome, 128 cores total, 2.25 GHz CPU (base), 3.4 GHz (max boost) Operating Temperature 5ºC to 30ºC (41ºF to 86ºF)

System Memory 1TB Cooling Air

9x Mellanox ConnectX-6 VPI HDR Networking InfiniBand/200GigE 10th Dual-port ConnectX-6 optional

OS: 2x 1.92TB M.2 NVME drives Storage Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives

8 GPU x CPU

F1 F2 F3 F4

Intel i7 Bloomfield

Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. GPU x CPU

F1 F2 F3 F4 Kernel

Kepler K10

Intel i7 Bloomfield

Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. Modelo SIMT Or the 3 things you must learn by heart at this talk…. Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Tensor Cores Tensor Cores

(FP16/FP32) D = (FP16) A x B + C (4 x 4 x 4)

64 FP operation per clock → full process in 1 clock cycle

8 TC per SM → 1024 FP per clock per SM Mixed Precision

“Deep learning have found that deep neural network architectures have a natural resilience to errors due to the backpropagation algorithm used in training them, and some developers have argued that 16-bit floating point (half precision, or FP16) is sufficient for training neural networks.” Memory bandwidth matters!... GPU Computing Flow

PCI Bus

1. Copy input data from CPU memory to GPU memory

This slide is credited to Mark Harris (nvidia) GPU Computing Flow

PCI Bus

1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

This slide is credited to Mark Harris (nvidia) GPU Computing Flow

PCI Bus

1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

This slide is credited to Mark Harris (nvidia) GPU Computing Flow

someTFlops PCI Bus

someGB/s

1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory GPU Computing Flow

20TFlops PCI Bus

224GB/s (56 Gfloats/s)

1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory Closer to Unified Memory

Both CPU and GPU accessing the data #3 – 1 kernels, lots of threads... How things work at GPU x CPU

F2 F4 F1 F3

Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. How things work at GPU x CPU

F2 F4 F1 F3 Kernel

Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. SM evolution

Compute capability INTRODUCING NVIDIA A100

Greatest Generational Leap – 20X Volta

Peak Vs Volta

FP32 TRAINING 312 TFLOPS 20X

INT8 INFERENCE 1,248 TOPS 20X

FP64 HPC 19.5 TFLOPS 2.5X

MULTI INSTANCE GPU 7X GPUs

54B XTOR | 826mm2 | TSMC 7N | 40GB Samsung HBM2 | 600 GB/s NVLink

30 9X MORE PERFORMANCE IN 4 YEARS Beyond Moore’s Law With Full Stack Innovation

9X

Throughput Speedup AMBER Chroma GROMACS MILC NAMD 4X Pytorch 3X Quantum Espresso Random Forest 2X TensorFlow VASP 1X

P100 (2016) V100 (2017) V100 (2018) V100 (2019) A100 (2020)

Geometric Mean of application speedups vs. P100 : Benchmark Application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge], |GPU node: with dual-socket CPUs with 4x P100, V100, or A100 GPUs.

31 NVIDIA A100 DETAILED SPECS

Peak Performance Transistor Count 54 billion Die Size 826 mm2 FP64 CUDA Cores 3,456 FP32 CUDA Cores 6,912 Tensor Cores 432 Streaming Multiprocessors 108 FP64 9.7 teraFLOPS FP64 Tensor Core 19.5 teraFLOPS FP32 19.5 teraFLOPS TF32 Tensor Core 156 teraFLOPS | 312 teraFLOPS* BFLOAT16 Tensor Core 312 teraFLOPS | 624 teraFLOPS* FP16 Tensor Core 312 teraFLOPS | 624 teraFLOPS* INT8 Tensor Core 624 TOPS | 1,248 TOPS* INT4 Tensor Core 1,248 TOPS | 2,496 TOPS* GPU Memory 40 GB NVLink 600 GB/s Interconnect PCIe Gen4 64 GB/s Multi-Instance GPUs Various Instance sizes with up to 7MIGs @5GB Form Factor 4/8/16 SXM GPUs in HGX A100 Max Power 400W (SXM) * Includes Sparsity

32 5 MIRACLES OF A100

Ampere 3rd Gen Tensor Cores World’s Largest 7nm chip Faster, Flexible, Easier to use 54B XTORS, HBM2 20x AI Perf with TF32

New Sparsity Acceleration New Multi-Instance GPU 3rd Gen NVLINK and NVSWITCH Harness Sparsity in AI Models Optimal utilization with right sized GPU Efficient Scaling to Enable Super GPU 2x AI Performance 7x Simultaneous Instances per GPU 2X More Bandwidth

33 CUDA KEY INITIATIVES

Need Picture

Hierarchy Asynchrony Latency Language Programming and running Creating concurrency at Overcoming Amdahl Supporting and evolving systems at every scale every level of the hierarchy with lower overheads for Standard Languages memory & processing

34 CUDA ON ARM Technical Preview Release – Available for Download

LAMMPS GROMACS HPC APP and MILC NAMD NGC TensorFlow vis CONTAINERS HOOMD-blue VMD CUDA Base Paraview Containers

cuBLAS cuSOLVER cuSPARSE Math API GRAPHICS CUDA-X LIBRARIES cuFFT Thrust cuRAND libcu++ NVIDIA IndeX

GCC 8.3 COMMS Arm C/C++ Debugger: LIBRARIES nvc++ (PGI) Nsight Systems NCCL CUDA TOOLKIT Profilers: CUPTIv2 Tracing APIs, Metrics CUDA Aware MPI COMPILERS Nsight Compute

OPERATING SYSTEMS RHEL 8.0 for Arm Ubuntu 18.04.3 LTS

OEM SYSTEMS GPUs Tesla V100 HPE Apollo 70 Gigabyte R281

35 Divide a Single GPU Into Multiple Instances, Each With Isolated Paths Through the Entire Memory System

NEW MULTI-INSTANCE GPU (MIG)

SMs

USER0 L2

GPU Instance 0 Sys

Xbar Data Xbar

Pipe DRAM Control Control Up To 7 GPU Instances In a Single A100

USER1 Full software stack enabled on each instance, with L2

GPU Instance 1 Sys

Xbar Data Xbar

Pipe DRAM Control Control dedicated SM, memory, L2 cache & bandwidth

USER2 L2

GPU Instance 2 Sys

Xbar Data Xbar

Pipe DRAM Control Control Simultaneous Workload Execution With USER3 Guaranteed Quality Of Service

GPU L2

GPU Instance 3 Sys

Xbar Data Xbar

Pipe DRAM Control Control All MIG instances run in parallel with predictable

USER4 throughput & latency, fault & error isolation L2

GPU Instance 4 Sys

Xbar Data Xbar

Pipe

DRAM Control Control

USER5 Diverse Deployment Environments L2

GPU Instance 5 Sys

Xbar Data Xbar

Pipe DRAM Control Control Supported with Bare metal, Docker, Kubernetes

USER6 Pod, Virtualized Environments L2

GPU Instance 6 Sys

Xbar Data Xbar

Pipe

DRAM Control Control

36 FINE-GRAINED SYNCHRONIZATION NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers

Thread Block Thread Block

barrier

__syncthreads()

37 A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0

cuBLAS cuSPARSE cuTENSOR cucuBLASSOLVER cuSPARSE cuTENSOR cuSOLVER BF16, TF32 and Increased memory BW, BF16, TF32 and BF16, TF32 and FP64 Tensor Cores Shared Memory & L2 FP64 Tensor Cores FP64 Tensor Cores

CUTLASS cuFFT CUDA Math API nvJPEGnvJPEG cuFFT CUDA Math API CUTLASS Hardware Decoder BF16, TF32 and Increased memory BW, BF16 & TF32 FP64 Tensor Cores Shared Memory & L2 Support

For more information see: S21681 - How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU

38 Warp-Level GEMM and Reusable Components for Linear Algebra Kernels in CUDA

CUTLASS – TENSOR CORE PROGRAMMING MODEL

using Mma = cutlass::gemm::warp::DefaultMmaTensorOp< GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operand CUTLASS 2.2 half_t, LayoutB, // GEMM B operand float, RowMajor // GEMM C operand Optimal performance on NVIDIA Ampere microarchitecture >; __shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK]; New floating-point types: nv_bfloat16, TF32, double __shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK]; // Construct iterators into SMEM tiles Deep software pipelines with async memcopy Mma::IteratorA iter_A({smem_buffer_A, lda}, thread_id); Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id);

Mma::FragmentA frag_A; Mma::FragmentB frag_B; CUTLASS 2.1 Mma::FragmentC accum; Mma mma; BLAS-style host API accum.clear();

#pragma unroll 1 for (int k = 0; k < GemmK; k += Mma::Shape::kK) { CUTLASS 2.0 iter_A.load(frag_A); // Load fragments from A and B matrices Significant refactoring using modern C++11 programming iter_B.load(frag_B); ++iter_A; ++iter_B; // Advance along GEMM K to next tile in A // and B matrices

// Compute matrix product mma(accum, frag_A, frag_B, accum); }

For more information see: S21745 - Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit

39 NVIDIA A100 GREATEST GENERATIONAL LEAP – 20X VOLTA

1250 20X 310

155 625 10X

625 Relative Compute Relative

20 310 8 16 125 60

A100 A100 V100 A100 A100 V100 A100 A100 V100 A100 V100 SPARSE FP64 FP64 FP32 TF32 SPARSE FP16 FP16 SPARSE INT8 INT8 TF32 FP16 INT8

Peak Performance in Trillion Operations Per Second (TOPS) of A100 Compared to V100 | V100 Rounded off to the nearest whole number | A100 rounded off to the nearest 5.

40 NEW TF32 TENSOR CORES

Sign Range Precision

FP32 8 BITS 23 BITS

TF32 Range

TENSOR FLOAT 32 (TF32) 8 BITS 10 BITS

TF32 Precision ➢ Range of FP32 and Precision of FP16 FP16 10 BITS ➢ Input in FP32 and Accumulation in FP32 5 BITS

➢ No Code Change Speed-up for Training BFLOAT16 8 BITS 7 BITS

41 6X OUT OF THE BOX SPEEDUP WITH TF32 FOR AI TRAINING

1500 6X

1000

TF32

500 Sequences / Sec

1X

FP32 0 V100 A100

BERT Pre-Training Throughput using Pytorch including (2/3)Phase 1 and (1/3)Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512 V100: DGX-1 Server with 8xV100 using FP32 precision A100: DGX A100 Server with 8xA100 using TF32 precision |

42 STRUCTURAL SPARSITY BRINGS ADDITIONAL SPEEDUPS

BERT Large Inference

1,5x

2X 1x Faster Execution

A100 A100…

➢ Structured sparsity: Half the values are zero Dense Sparse A100 Tensor Matrix Matrix Core ➢ Skip half of the compute and mem fetches

➢ Compute up to 2x rate vs non-sparse

BERT Large Inference | precision = INT8 with and without sparsity | Batch sizes - no sparsity: bs256, with sparsity: bs49, A100 with 7 MIGS

43 NEW MULTI-INSTANCE GPU (MIG) Optimize GPU Utilization, Expand Access to More Users with Guaranteed Quality of Service

Up To 7 GPU Instances In a Single A100: Amber Dedicated SM, Memory, L2 cache, Bandwidth for hardware QoS & isolation

Simultaneous Workload Execution With Guaranteed Quality Of Service: All MIG instances

GPU GPU GPU GPU GPU GPU GPU run in parallel with predictable throughput & latency GPU Mem GPU Mem GPU Mem GPU Mem GPU Mem GPU Mem GPU Mem Right Sized GPU Allocation: Different sized MIG instances based on target workloads

Flexibility to run any type of workload on a MIG instance

Diverse Deployment Environments: Supported with Bare metal, Docker, Kubernetes, Virtualized Env.

44 MULTI-INSTANCE GPU (MIG) ON DGX A100 More Users and Better GPU Utilization

Flexible Utilization GPU Number of GPU Instances GPU Memory Configure GPUs for vastly different workloads Instance Size Available with GPU instances that are fault-isolated

1 GPU Slice 7 5 GB GPU 1 2 3 4 5 6 7 8

2 GPU Slice 3 10 GB

1 DGX A100 3 GPU Slice 2 20 GB = 56 users

4 GPU Slice 1 20 GB

7 GPU Slice 1 40 GB

Jupyter Notebook Batch training with NGC container Inference with TensorRT

45 46 JARVIS Framework for Multimodal Conversational AI services

JARVIS

PRE-TRAINED MODELS RETRAIN Dialog Manager

Multi- Transfer Learning Speaker Gesture Look to Chatbot video Transcriptio Recognition Talk n audio NeMo NLU Vision Speech

Service Maker Multi-Speaker Transcription

NVIDIA GPU CLOUD NVIDIA AI TOOLKIT

End-to-End Multimodal Conversational AI Services Pre-trained SOTA models-100,000 Hours of DGX TRITON INFERENCE SERVER Retrain with NeMo Interactive Response – 150ms on A100 versus 25sec on CPU Deploy Services with One Line of Code Sign-up for EA: developer.nvidia.com/nvidia-jarvis

48

REFERENCES Deep dive into any of the topics you’ve seen by following these links

S21730 Inside the NVIDIA Ampere Architecture Whitepaper https://www.nvidia.com/nvidia-ampere-architecture-whitepaper S22043 CUDA Developer Tools: Overview and Exciting New Features Developer Blog https://devblogs.nvidia.com/introducing-low-level-gpu-virtual-memory-management/ S21975 Inside NVIDIA's Multi-Instance GPU Feature S21170 CUDA on NVIDIA GPU Ampere Architecture, Taking your algorithms to the next level of... S21819 Optimizing Applications for NVIDIA Ampere GPU Architecture S22082 Mixed-Precision Training of Neural Networks S21681 How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU S21745 Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit S21766 Inside the NVIDIA HPC SDK: the Compilers, Libraries and Tools for Accelerated Computing S21262 The CUDA C++ Standard Library S21771 Optimizing CUDA kernels using Nsight Compute

50 Curso completo de Programação em GPUs: (legendado para Português) http://www2.ic.uff.br/~gpu/kit-de-ensino-gpgpu/

Curso de Deep Learning em GPUs: (Português) http://www2.ic.uff.br/~gpu/learn-gpu-computing/deep-learning/