Why Gpus Are Green?
Total Page:16
File Type:pdf, Size:1020Kb
Why GPUs are green? Prof. Esteban Walter Gonzalez Clua, Dr. NVIDIA Cuda Fellow Computer Science Department Universidade Federal Fluminense – Brazil Universidade Federal Fluminense Rio de Janeiro - Brasil Framework for solving cosmological particles propagations. 1 Million Watts DGX A100 DATA CENTER 5 DGX A100 systems for AI training and inference $1M $1M 28 kW 1 rack 1/10th 28 kW COST 1/20th POWER 0.8 Pflops for HPC, 50 PFLOPS for Tensor cores = 160 Laurences for HPC, 8000 Laurences for Tensor 35 x less energy 7 NVIDIA DGX A100 SYSTEM SPECS App Focus Components Power and Physical Dimensions GPUs 8x NVIDIA A100 Tensor Core GPUs System Power Usage 6.5 kW Max GPU Memory 320GB Total System Weight 271 lbs (123 kgs) NVIDIA NVSwitch 6 6 Rack Units (RU) 5 petaFLOPS AI System Dimensions Height: 10.4 in (264.0 mm) Performance 10 petaOPS, INT8 Width: 19.0 in (482.3 mm) Max Length: 35.3 in (897.1 mm) Max Dual AMD Rome, 128 cores total, 2.25 GHz CPU (base), 3.4 GHz (max boost) Operating Temperature 5ºC to 30ºC (41ºF to 86ºF) System Memory 1TB Cooling Air 9x Mellanox ConnectX-6 VPI HDR Networking InfiniBand/200GigE 10th Dual-port ConnectX-6 optional OS: 2x 1.92TB M.2 NVME drives Storage Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives 8 GPU x CPU F1 F2 F3 F4 Intel i7 Bloomfield Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. GPU x CPU F1 F2 F3 F4 Kernel Kepler K10 Intel i7 Bloomfield Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. Modelo SIMT Or the 3 things you must learn by heart at this talk…. Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Tensor Cores Tensor Cores (FP16/FP32) D = (FP16) A x B + C (4 x 4 x 4) 64 FP operation per clock → full process in 1 clock cycle 8 TC per SM → 1024 FP per clock per SM Mixed Precision “Deep learning have found that deep neural network architectures have a natural resilience to errors due to the backpropagation algorithm used in training them, and some developers have argued that 16-bit floating point (half precision, or FP16) is sufficient for training neural networks.” Memory bandwidth matters!... GPU Computing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory This slide is credited to Mark Harris (nvidia) GPU Computing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance This slide is credited to Mark Harris (nvidia) GPU Computing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory This slide is credited to Mark Harris (nvidia) GPU Computing Flow someTFlops PCI Bus someGB/s 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory GPU Computing Flow 20TFlops PCI Bus 224GB/s (56 Gfloats/s) 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory Closer to Unified Memory Both CPU and GPU accessing the data #3 – 1 kernels, lots of threads... How things work at GPU x CPU F2 F4 F1 F3 Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. How things work at GPU x CPU F2 F4 F1 F3 Kernel Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. SM evolution Compute capability INTRODUCING NVIDIA A100 Greatest Generational Leap – 20X Volta Peak Vs Volta FP32 TRAINING 312 TFLOPS 20X INT8 INFERENCE 1,248 TOPS 20X FP64 HPC 19.5 TFLOPS 2.5X MULTI INSTANCE GPU 7X GPUs 54B XTOR | 826mm2 | TSMC 7N | 40GB Samsung HBM2 | 600 GB/s NVLink 30 9X MORE PERFORMANCE IN 4 YEARS Beyond Moore’s Law With Full Stack Innovation 9X Throughput Speedup AMBER Chroma GROMACS MILC NAMD 4X Pytorch 3X Quantum Espresso Random Forest 2X TensorFlow VASP 1X P100 (2016) V100 (2017) V100 (2018) V100 (2019) A100 (2020) Geometric Mean of application speedups vs. P100 : Benchmark Application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge], |GPU node: with dual-socket CPUs with 4x P100, V100, or A100 GPUs. 31 NVIDIA A100 DETAILED SPECS Peak Performance Transistor Count 54 billion Die Size 826 mm2 FP64 CUDA Cores 3,456 FP32 CUDA Cores 6,912 Tensor Cores 432 Streaming Multiprocessors 108 FP64 9.7 teraFLOPS FP64 Tensor Core 19.5 teraFLOPS FP32 19.5 teraFLOPS TF32 Tensor Core 156 teraFLOPS | 312 teraFLOPS* BFLOAT16 Tensor Core 312 teraFLOPS | 624 teraFLOPS* FP16 Tensor Core 312 teraFLOPS | 624 teraFLOPS* INT8 Tensor Core 624 TOPS | 1,248 TOPS* INT4 Tensor Core 1,248 TOPS | 2,496 TOPS* GPU Memory 40 GB NVLink 600 GB/s Interconnect PCIe Gen4 64 GB/s Multi-Instance GPUs Various Instance sizes with up to 7MIGs @5GB Form Factor 4/8/16 SXM GPUs in HGX A100 Max Power 400W (SXM) * Includes Sparsity 32 5 MIRACLES OF A100 Ampere 3rd Gen Tensor Cores World’s Largest 7nm chip Faster, Flexible, Easier to use 54B XTORS, HBM2 20x AI Perf with TF32 New Sparsity Acceleration New Multi-Instance GPU 3rd Gen NVLINK and NVSWITCH Harness Sparsity in AI Models Optimal utilization with right sized GPU Efficient Scaling to Enable Super GPU 2x AI Performance 7x Simultaneous Instances per GPU 2X More Bandwidth 33 CUDA KEY INITIATIVES Need Picture Hierarchy Asynchrony Latency Language Programming and running Creating concurrency at Overcoming Amdahl Supporting and evolving systems at every scale every level of the hierarchy with lower overheads for Standard Languages memory & processing 34 CUDA ON ARM Technical Preview Release – Available for Download LAMMPS GROMACS HPC APP and MILC NAMD NGC TensorFlow vis CONTAINERS HOOMD-blue VMD CUDA Base Paraview Containers cuBLAS cuSOLVER cuSPARSE Math API GRAPHICS CUDA-X LIBRARIES cuFFT Thrust cuRAND libcu++ NVIDIA IndeX GCC 8.3 COMMS Arm C/C++ Debugger: LIBRARIES nvc++ (PGI) Nsight Systems NCCL CUDA TOOLKIT Profilers: CUPTIv2 Tracing APIs, Metrics CUDA Aware MPI COMPILERS Nsight Compute OPERATING SYSTEMS RHEL 8.0 for Arm Ubuntu 18.04.3 LTS OEM SYSTEMS GPUs Tesla V100 HPE Apollo 70 Gigabyte R281 35 GPU GPU InstanceGPU 6 InstanceGPU 5 InstanceGPU 4 InstanceGPU 3 InstanceGPU 2 InstanceGPU 1 InstanceGPU 0 Divide a Single GPU Into Multiple Multiple GPUInto Divide a Single NEW MULTI NEW Isolated Paths Isolated Through System Entire Memory the USER2 USER1 USER0 USER6 USER5 USER4 USER3 Sys Sys Sys Sys Sys Sys Sys Pipe Pipe Pipe Pipe Pipe Pipe Pipe Control Control Control Control Control Control Control Xbar Xbar Xbar Xbar Xbar Xbar Xbar SMs - INSTANCE GPU (MIG) GPU INSTANCE Data Data Data Data Data Data Data Xbar Xbar Xbar Xbar Xbar Xbar Xbar L2 L2 L2 L2 L2 L2 L2 DRAM DRAM DRAM DRAM DRAM DRAM DRAM Instances Pod, Virtualized Environments Virtualized Pod, Kubernetes Docker, metal,Bare with Supported Environments Diverse Deployment isolation error & fault & latency, throughput predictable with in parallelrun instancesAll MIG Service Of Quality Guaranteed With Execution Workload Simultaneous & bandwidth L2 cache SM, memory, dedicated enabled stack software Full A100 In a Single 7 GPU Instances To Up , With Each on each instance, withinstance,on each 36 FINE-GRAINED SYNCHRONIZATION NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers Thread Block Thread Block barrier __syncthreads() 37 A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0 cuBLAS cuSPARSE cuTENSOR cucuBLASSOLVER cuSPARSE cuTENSOR cuSOLVER BF16, TF32 and Increased memory BW, BF16, TF32 and BF16, TF32 and FP64 Tensor Cores Shared Memory & L2 FP64 Tensor Cores FP64 Tensor Cores CUTLASS cuFFT CUDA Math API nvJPEGnvJPEG cuFFT CUDA Math API CUTLASS Hardware Decoder BF16, TF32 and Increased memory BW, BF16 & TF32 FP64 Tensor Cores Shared Memory & L2 Support For more information see: S21681 - How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU 38 Warp-Level GEMM and Reusable Components for Linear Algebra Kernels in CUDA CUTLASS – TENSOR CORE PROGRAMMING MODEL using Mma = cutlass::gemm::warp::DefaultMmaTensorOp< GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operand CUTLASS 2.2 half_t, LayoutB, // GEMM B operand float, RowMajor // GEMM C operand Optimal performance on NVIDIA Ampere microarchitecture >; __shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK]; New floating-point types: nv_bfloat16, TF32, double __shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK]; // Construct iterators into SMEM tiles Deep software pipelines with async memcopy Mma::IteratorA iter_A({smem_buffer_A, lda}, thread_id); Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id); Mma::FragmentA frag_A; Mma::FragmentB frag_B; CUTLASS 2.1 Mma::FragmentC accum; Mma mma; BLAS-style host API accum.clear(); #pragma unroll 1 for (int k = 0; k < GemmK; k += Mma::Shape::kK) { CUTLASS 2.0 iter_A.load(frag_A); // Load fragments from A and B matrices Significant refactoring using modern C++11 programming iter_B.load(frag_B); ++iter_A; ++iter_B; // Advance along GEMM K to next tile in A // and B matrices // Compute matrix product mma(accum, frag_A, frag_B, accum); } For more information see: S21745 - Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit 39 NVIDIA A100 GREATEST GENERATIONAL LEAP – 20X VOLTA 1250 20X 310 155 625 10X 625 Relative Compute Relative 20 310 8 16 125 60 A100 A100 V100 A100 A100 V100 A100 A100 V100 A100 V100 SPARSE FP64 FP64 FP32 TF32 SPARSE FP16 FP16 SPARSE INT8 INT8 TF32 FP16 INT8 Peak Performance in Trillion Operations Per Second (TOPS) of A100 Compared to V100 | V100 Rounded off to the nearest whole number | A100 rounded off to the nearest 5.