Why Gpus Are Green?

Total Page:16

File Type:pdf, Size:1020Kb

Why Gpus Are Green? Why GPUs are green? Prof. Esteban Walter Gonzalez Clua, Dr. NVIDIA Cuda Fellow Computer Science Department Universidade Federal Fluminense – Brazil Universidade Federal Fluminense Rio de Janeiro - Brasil Framework for solving cosmological particles propagations. 1 Million Watts DGX A100 DATA CENTER 5 DGX A100 systems for AI training and inference $1M $1M 28 kW 1 rack 1/10th 28 kW COST 1/20th POWER 0.8 Pflops for HPC, 50 PFLOPS for Tensor cores = 160 Laurences for HPC, 8000 Laurences for Tensor 35 x less energy 7 NVIDIA DGX A100 SYSTEM SPECS App Focus Components Power and Physical Dimensions GPUs 8x NVIDIA A100 Tensor Core GPUs System Power Usage 6.5 kW Max GPU Memory 320GB Total System Weight 271 lbs (123 kgs) NVIDIA NVSwitch 6 6 Rack Units (RU) 5 petaFLOPS AI System Dimensions Height: 10.4 in (264.0 mm) Performance 10 petaOPS, INT8 Width: 19.0 in (482.3 mm) Max Length: 35.3 in (897.1 mm) Max Dual AMD Rome, 128 cores total, 2.25 GHz CPU (base), 3.4 GHz (max boost) Operating Temperature 5ºC to 30ºC (41ºF to 86ºF) System Memory 1TB Cooling Air 9x Mellanox ConnectX-6 VPI HDR Networking InfiniBand/200GigE 10th Dual-port ConnectX-6 optional OS: 2x 1.92TB M.2 NVME drives Storage Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives 8 GPU x CPU F1 F2 F3 F4 Intel i7 Bloomfield Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. GPU x CPU F1 F2 F3 F4 Kernel Kepler K10 Intel i7 Bloomfield Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. Modelo SIMT Or the 3 things you must learn by heart at this talk…. Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Why GPUs became as powerfull (and indispensable) to Deep Learning as they are for Rendering? Tensor Cores Tensor Cores (FP16/FP32) D = (FP16) A x B + C (4 x 4 x 4) 64 FP operation per clock → full process in 1 clock cycle 8 TC per SM → 1024 FP per clock per SM Mixed Precision “Deep learning have found that deep neural network architectures have a natural resilience to errors due to the backpropagation algorithm used in training them, and some developers have argued that 16-bit floating point (half precision, or FP16) is sufficient for training neural networks.” Memory bandwidth matters!... GPU Computing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory This slide is credited to Mark Harris (nvidia) GPU Computing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance This slide is credited to Mark Harris (nvidia) GPU Computing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory This slide is credited to Mark Harris (nvidia) GPU Computing Flow someTFlops PCI Bus someGB/s 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory GPU Computing Flow 20TFlops PCI Bus 224GB/s (56 Gfloats/s) 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory Closer to Unified Memory Both CPU and GPU accessing the data #3 – 1 kernels, lots of threads... How things work at GPU x CPU F2 F4 F1 F3 Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. How things work at GPU x CPU F2 F4 F1 F3 Kernel Only ~1% of CPU is dedicated to computation, 99% to moving/storing data to combat latency. SM evolution Compute capability INTRODUCING NVIDIA A100 Greatest Generational Leap – 20X Volta Peak Vs Volta FP32 TRAINING 312 TFLOPS 20X INT8 INFERENCE 1,248 TOPS 20X FP64 HPC 19.5 TFLOPS 2.5X MULTI INSTANCE GPU 7X GPUs 54B XTOR | 826mm2 | TSMC 7N | 40GB Samsung HBM2 | 600 GB/s NVLink 30 9X MORE PERFORMANCE IN 4 YEARS Beyond Moore’s Law With Full Stack Innovation 9X Throughput Speedup AMBER Chroma GROMACS MILC NAMD 4X Pytorch 3X Quantum Espresso Random Forest 2X TensorFlow VASP 1X P100 (2016) V100 (2017) V100 (2018) V100 (2019) A100 (2020) Geometric Mean of application speedups vs. P100 : Benchmark Application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge], |GPU node: with dual-socket CPUs with 4x P100, V100, or A100 GPUs. 31 NVIDIA A100 DETAILED SPECS Peak Performance Transistor Count 54 billion Die Size 826 mm2 FP64 CUDA Cores 3,456 FP32 CUDA Cores 6,912 Tensor Cores 432 Streaming Multiprocessors 108 FP64 9.7 teraFLOPS FP64 Tensor Core 19.5 teraFLOPS FP32 19.5 teraFLOPS TF32 Tensor Core 156 teraFLOPS | 312 teraFLOPS* BFLOAT16 Tensor Core 312 teraFLOPS | 624 teraFLOPS* FP16 Tensor Core 312 teraFLOPS | 624 teraFLOPS* INT8 Tensor Core 624 TOPS | 1,248 TOPS* INT4 Tensor Core 1,248 TOPS | 2,496 TOPS* GPU Memory 40 GB NVLink 600 GB/s Interconnect PCIe Gen4 64 GB/s Multi-Instance GPUs Various Instance sizes with up to 7MIGs @5GB Form Factor 4/8/16 SXM GPUs in HGX A100 Max Power 400W (SXM) * Includes Sparsity 32 5 MIRACLES OF A100 Ampere 3rd Gen Tensor Cores World’s Largest 7nm chip Faster, Flexible, Easier to use 54B XTORS, HBM2 20x AI Perf with TF32 New Sparsity Acceleration New Multi-Instance GPU 3rd Gen NVLINK and NVSWITCH Harness Sparsity in AI Models Optimal utilization with right sized GPU Efficient Scaling to Enable Super GPU 2x AI Performance 7x Simultaneous Instances per GPU 2X More Bandwidth 33 CUDA KEY INITIATIVES Need Picture Hierarchy Asynchrony Latency Language Programming and running Creating concurrency at Overcoming Amdahl Supporting and evolving systems at every scale every level of the hierarchy with lower overheads for Standard Languages memory & processing 34 CUDA ON ARM Technical Preview Release – Available for Download LAMMPS GROMACS HPC APP and MILC NAMD NGC TensorFlow vis CONTAINERS HOOMD-blue VMD CUDA Base Paraview Containers cuBLAS cuSOLVER cuSPARSE Math API GRAPHICS CUDA-X LIBRARIES cuFFT Thrust cuRAND libcu++ NVIDIA IndeX GCC 8.3 COMMS Arm C/C++ Debugger: LIBRARIES nvc++ (PGI) Nsight Systems NCCL CUDA TOOLKIT Profilers: CUPTIv2 Tracing APIs, Metrics CUDA Aware MPI COMPILERS Nsight Compute OPERATING SYSTEMS RHEL 8.0 for Arm Ubuntu 18.04.3 LTS OEM SYSTEMS GPUs Tesla V100 HPE Apollo 70 Gigabyte R281 35 GPU GPU InstanceGPU 6 InstanceGPU 5 InstanceGPU 4 InstanceGPU 3 InstanceGPU 2 InstanceGPU 1 InstanceGPU 0 Divide a Single GPU Into Multiple Multiple GPUInto Divide a Single NEW MULTI NEW Isolated Paths Isolated Through System Entire Memory the USER2 USER1 USER0 USER6 USER5 USER4 USER3 Sys Sys Sys Sys Sys Sys Sys Pipe Pipe Pipe Pipe Pipe Pipe Pipe Control Control Control Control Control Control Control Xbar Xbar Xbar Xbar Xbar Xbar Xbar SMs - INSTANCE GPU (MIG) GPU INSTANCE Data Data Data Data Data Data Data Xbar Xbar Xbar Xbar Xbar Xbar Xbar L2 L2 L2 L2 L2 L2 L2 DRAM DRAM DRAM DRAM DRAM DRAM DRAM Instances Pod, Virtualized Environments Virtualized Pod, Kubernetes Docker, metal,Bare with Supported Environments Diverse Deployment isolation error & fault & latency, throughput predictable with in parallelrun instancesAll MIG Service Of Quality Guaranteed With Execution Workload Simultaneous & bandwidth L2 cache SM, memory, dedicated enabled stack software Full A100 In a Single 7 GPU Instances To Up , With Each on each instance, withinstance,on each 36 FINE-GRAINED SYNCHRONIZATION NVIDIA Ampere GPU Architecture Allows Creation Of Arbitrary Barriers Thread Block Thread Block barrier __syncthreads() 37 A100 GPU ACCELERATED MATH LIBRARIES IN CUDA 11.0 cuBLAS cuSPARSE cuTENSOR cucuBLASSOLVER cuSPARSE cuTENSOR cuSOLVER BF16, TF32 and Increased memory BW, BF16, TF32 and BF16, TF32 and FP64 Tensor Cores Shared Memory & L2 FP64 Tensor Cores FP64 Tensor Cores CUTLASS cuFFT CUDA Math API nvJPEGnvJPEG cuFFT CUDA Math API CUTLASS Hardware Decoder BF16, TF32 and Increased memory BW, BF16 & TF32 FP64 Tensor Cores Shared Memory & L2 Support For more information see: S21681 - How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU 38 Warp-Level GEMM and Reusable Components for Linear Algebra Kernels in CUDA CUTLASS – TENSOR CORE PROGRAMMING MODEL using Mma = cutlass::gemm::warp::DefaultMmaTensorOp< GemmShape<64, 64, 16>, half_t, LayoutA, // GEMM A operand CUTLASS 2.2 half_t, LayoutB, // GEMM B operand float, RowMajor // GEMM C operand Optimal performance on NVIDIA Ampere microarchitecture >; __shared__ ElementA smem_buffer_A[Mma::Shape::kM * GemmK]; New floating-point types: nv_bfloat16, TF32, double __shared__ ElementB smem_buffer_B[Mma::Shape::kN * GemmK]; // Construct iterators into SMEM tiles Deep software pipelines with async memcopy Mma::IteratorA iter_A({smem_buffer_A, lda}, thread_id); Mma::IteratorB iter_B({smem_buffer_B, ldb}, thread_id); Mma::FragmentA frag_A; Mma::FragmentB frag_B; CUTLASS 2.1 Mma::FragmentC accum; Mma mma; BLAS-style host API accum.clear(); #pragma unroll 1 for (int k = 0; k < GemmK; k += Mma::Shape::kK) { CUTLASS 2.0 iter_A.load(frag_A); // Load fragments from A and B matrices Significant refactoring using modern C++11 programming iter_B.load(frag_B); ++iter_A; ++iter_B; // Advance along GEMM K to next tile in A // and B matrices // Compute matrix product mma(accum, frag_A, frag_B, accum); } For more information see: S21745 - Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit 39 NVIDIA A100 GREATEST GENERATIONAL LEAP – 20X VOLTA 1250 20X 310 155 625 10X 625 Relative Compute Relative 20 310 8 16 125 60 A100 A100 V100 A100 A100 V100 A100 A100 V100 A100 V100 SPARSE FP64 FP64 FP32 TF32 SPARSE FP16 FP16 SPARSE INT8 INT8 TF32 FP16 INT8 Peak Performance in Trillion Operations Per Second (TOPS) of A100 Compared to V100 | V100 Rounded off to the nearest whole number | A100 rounded off to the nearest 5.
Recommended publications
  • 20201130 Gcdv V1.0.Pdf
    DE LA RECHERCHE À L’INDUSTRIE Architecture evolutions for HPC 30 Novembre 2020 Guillaume Colin de Verdière Commissariat à l’énergie atomique et aux énergies alternatives - www.cea.fr Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 EVOLUTION DRIVER: TECHNOLOGICAL CONSTRAINTS . Power wall . Scaling wall . Memory wall . Towards accelerated architectures . SC20 main points Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 2 POWER WALL P: power 푷 = 풄푽ퟐ푭 V: voltage F: frequency . Reduce V o Reuse of embedded technologies . Limit frequencies . More cores . More compute, less logic o SIMD larger o GPU like structure © Karl Rupp https://software.intel.com/en-us/blogs/2009/08/25/why-p-scales-as-cv2f-is-so-obvious-pt-2-2 Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 3 SCALING WALL FinFET . Moore’s law comes to an end o Probable limit around 3 - 5 nm o Need to design new structure for transistors Carbon nanotube . Limit of circuit size o Yield decrease with the increase of surface • Chiplets will dominate © AMD o Data movement will be the most expensive operation • 1 DFMA = 20 pJ, SRAM access= 50 pJ, DRAM access= 1 nJ (source NVIDIA) 1 nJ = 1000 pJ, Φ Si =110 pm Commissariat à l’énergie atomique et aux énergies alternatives EUROfusion G. Colin de Verdière 30/11/2020 4 MEMORY WALL . Better bandwidth with HBM o DDR5 @ 5200 MT/s 8ch = 0.33 TB/s Thread 1 Thread Thread 2 Thread Thread 3 Thread o HBM2 @ 4 stacks = 1.64 TB/s 4 Thread Skylake: SMT2 .
    [Show full text]
  • NVIDIA DGX Station the First Personal AI Supercomputer 1.0 Introduction
    White Paper NVIDIA DGX Station The First Personal AI Supercomputer 1.0 Introduction.........................................................................................................2 2.0 NVIDIA DGX Station Architecture ........................................................................3 2.1 NVIDIA Tesla V100 ..........................................................................................5 2.2 Second-Generation NVIDIA NVLink™ .............................................................7 2.3 Water-Cooling System for the GPUs ...............................................................7 2.4 GPU and System Memory...............................................................................8 2.5 Other Workstation Components.....................................................................9 3.0 Multi-GPU with NVLink......................................................................................10 3.1 DGX NVLink Network Topology for Efficient Application Scaling..................10 3.2 Scaling Deep Learning Training on NVLink...................................................12 4.0 DGX Station Software Stack for Deep Learning.................................................14 4.1 NVIDIA CUDA Toolkit.....................................................................................16 4.2 NVIDIA Deep Learning SDK ...........................................................................16 4.3 Docker Engine Utility for NVIDIA GPUs.........................................................17 4.4 NVIDIA Container
    [Show full text]
  • The NVIDIA DGX-1 Deep Learning System
    NVIDIA DGX-1 ARTIFIcIAL INTELLIGENcE SYSTEM The World’s First AI Supercomputer in a Box Get faster training, larger models, and more accurate results from deep learning with the NVIDIA® DGX-1™. This is the world’s first purpose-built system for deep learning and AI-accelerated analytics, with performance equal to 250 conventional servers. It comes fully integrated with hardware, deep learning software, development tools, and accelerated analytics applications. Immediately shorten SYSTEM SPECIFICATIONS data processing time, visualize more data, accelerate deep learning frameworks, GPUs 8x Tesla P100 and design more sophisticated neural networks. TFLOPS (GPU FP16 / 170/3 CPU FP32) Iterate and Innovate Faster GPU Memory 16 GB per GPU CPU Dual 20-core Intel® Xeon® High-performance training accelerates your productivity giving you faster insight E5-2698 v4 2.2 GHz NVIDIA CUDA® Cores 28672 and time to market. System Memory 512 GB 2133 MHz DDR4 Storage 4x 1.92 TB SSD RAID 0 NVIDIA DGX-1 Delivers 58X Faster Training Network Dual 10 GbE, 4 IB EDR Software Ubuntu Server Linux OS DGX-1 Recommended GPU NVIDIA DX-1 23 Hours, less than 1 day Driver PU-Only Server 1310 Hours (5458 Days) System Weight 134 lbs 0 10X 20X 30X 40X 50X 60X System Dimensions 866 D x 444 W x 131 H (mm) Relatve Performance (Based on Tme to Tran) Packing Dimensions 1180 D x 730 W x 284 H (mm) affe benchmark wth V-D network, tranng 128M mages wth 70 epochs | PU servers uses 2x Xeon E5-2699v4 PUs Maximum Power 3200W Requirements Operating Temperature 10 - 30°C NVIDIA DGX-1 Delivers 34X More Performance Range NVIDIA DX-1 170 TFLOPS PU-Only Server 5 TFLOPS 0 10 50 100 150 170 Performance n teraFLOPS PU s dual socket Intel Xeon E5-2699v4 170TF s half precson or FP16 DGX-1 | DATA SHEET | DEc16 computing for Infinite Opportunities NVIDIA DGX-1 Software Stack The NVIDIA DGX-1 is the first system built with DEEP LEARNING FRAMEWORKS NVIDIA Pascal™-powered Tesla® P100 accelerators.
    [Show full text]
  • Computing for the Most Demanding Users
    COMPUTING FOR THE MOST DEMANDING USERS NVIDIA Artificial intelligence, the dream of computer scientists for over half a century, is no longer science fiction. And in the next few years, it will transform every industry. Soon, self-driving cars will reduce congestion and improve road safety. AI travel agents will know your preferences and arrange every detail of your family vacation. And medical instruments will read and understand patient DNA to detect and treat early signs of cancer. Where engines made us stronger and powered the first industrial revolution, AI will make us smarter and power the next. What will make this intelligent industrial revolution possible? A new computing model — GPU deep learning — that enables computers to learn from data and write software that is too complex for people to code. NVIDIA — INVENTOR OF THE GPU The GPU has proven to be unbelievably effective at solving some of the most complex problems in computer science. It started out as an engine for simulating human imagination, conjuring up the amazing virtual worlds of video games and Hollywood films. Today, NVIDIA’s GPU simulates human intelligence, running deep learning algorithms and acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. This is our life’s work — to amplify human imagination and intelligence. THE NVIDIA GPU DEFINES MODERN COMPUTER GRAPHICS Our invention of the GPU in 1999 made possible real-time programmable shading, which gives artists an infinite palette for expression. We’ve led the field of visual computing since. SIMULATING HUMAN IMAGINATION Digital artists, industrial designers, filmmakers, and broadcasters rely on NVIDIA Quadro® pro graphics to bring their imaginations to life.
    [Show full text]
  • Nvidia Autonomous Driving Platform
    NVIDIA AUTONOMOUS DRIVING PLATFORM Apr, 2017 Sr. Solution Architect , Marcus Oh Who is NVIDIA Deep Learning in Autonomous Driving Training Infra & Machine / DIGIT Contents DRIVE PX2 DRIVEWORKS USECASE Example Next Generation AD Platform 2 NVIDIA CONFIDENTIAL — DRIVE PX2 DEVELOPMENT PLATFORM NVIDIA Founded in 1993 | CEO & Co-founder: Jen-Hsun Huang | FY16 revenue: $5.01B | 9,500 employees | 7,300 patents | HQ in Santa Clara, CA | 3 NVIDIA — “THE AI COMPUTING COMPANY” GPU Computing Computer Graphics Artificial Intelligence 4 DEEP LEARNING IN AUTONOMOUS DRIVING 5 WHAT IS DEEP LEARNING? Input Result 6 Deep Learning Framework Forward Propagation Repeat “turtle” Tree Training Backward Propagation Cat Compute weight update to nudge Dog from “turtle” towards “dog” Trained Neural Net Model Inference “cat” 7 REINFORCEMENT LEARNING How’s it work? F A reinforcement learning agent includes: state (environment) actions (controls) reward (feedback) A value function predicts the future reward of performing actions in the current state Given the recent state, action with the maximum estimated future reward is chosen for execution For agents with complex state spaces, deep networks are used as Q-value approximator Numerical solver (gradient descent) optimizes github.com/dusty-nv/jetson-reinforcement the network on-the-fly based on reward inputs SELF-DRIVING CARS ARE AN AI CHALLENGE PERCEPTION AI PERCEPTION AI LOCALIZATION DRIVING AI DEEP LEARNING 9 NVIDIA AI SYSTEM FOR AUTONOMOUS DRIVING MAPPING KALDI LOCALIZATION DRIVENET Training on Driving with NVIDIA DGX-1 NVIDIA DRIVE PX 2 DGX-1 DriveWorks 10 TRAINING INFRA & MACHINE / DIGIT 11 170X SPEED-UP OVER COTS SERVER MICROSOFT COGNITIVE TOOLKIT SUPERCHARGED ON NVIDIA DGX-1 170x Faster (AlexNet images/sec) 13,000 78 8x Tesla P100 | 170TF FP16 | NVLink hybrid cube mesh CPU Server DGX-1 AlexNet training batch size 128, Dual Socket E5-2699v4, 44 cores CNTK 2.0b2 for CPU.
    [Show full text]
  • Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance
    Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance Wilson Lisan August 24, 2018 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2018 Abstract This dissertation project is based on participation in the Student Cluster Competition (SCC) at the International Supercomputing Conference (ISC) 2018 in Frankfurt, Germany as part of a four-member Team EPCC from The University of Edinburgh. There are two main projects which are the team-based project and a personal project. The team-based project focuses on the optimisations and tweaks of the HPL, HPCG, and HPCC benchmarks to meet the competition requirements. At the competition, Team EPCC suffered with hardware issues that shaped the cluster into an asymmetrical system with mixed hardware. Unthinkable and extreme methods were carried out to tune the performance and successfully drove the cluster back to its ideal performance. The personal project focuses on testing the SCC benchmarks to evaluate the performance efficiency and system balance at several HPC systems. HPCG fraction of peak over HPL ratio was used to determine the system performance efficiency from its peak and actual performance. It was analysed through HPCC benchmark that the fraction of peak ratio could determine the memory and network balance over the processor or GPU raw performance as well as the possibility of the memory or network bottleneck part. Contents Chapter 1 Introduction ..............................................................................................
    [Show full text]
  • Fabric Manager for NVIDIA Nvswitch Systems
    Fabric Manager for NVIDIA NVSwitch Systems User Guide / Virtualization / High Availability Modes DU-09883-001_v0.7 | January 2021 Document History DU-09883-001_v0.7 Version Date Authors Description of Change 0.1 Oct 25, 2019 SB Initial Beta Release 0.2 Mar 23, 2020 SB Updated error handling and bare metal mode 0.3 May 11, 2020 YL Updated Shared NVSwitch APIs section with new API information 0.4 July 7, 2020 SB Updated MIG interoperability and high availability details. 0.5 July 17, 2020 SB Updated running as non-root instructions 0.6 Aug 03, 2020 SB Updated installation instructions based on CUDA repo and updated SXid error details 0.7 Jan 26, 2021 GT, CC Updated with vGPU multitenancy virtualization mode Fabric Ma nager fo r NVI DIA NVSwitch Sy stems DU-09883-001_v0.7 | ii Table of Contents Chapter 1. Overview ...................................................................................................... 1 1.1 Introduction .............................................................................................................1 1.2 Terminology ............................................................................................................1 1.3 NVSwitch Core Software Stack .................................................................................2 1.4 What is Fabric Manager?..........................................................................................3 Chapter 2. Getting Started With Fabric Manager ...................................................... 5 2.1 Basic Components...................................................................................................5
    [Show full text]
  • NVIDIA DGX-1 System Architecture White Paper
    White Paper NVIDIA DGX-1 System Architecture The Fastest Platform for Deep Learning TABLE OF CONTENTS 1.0 Introduction ........................................................................................................ 2 2.0 NVIDIA DGX-1 System Architecture .................................................................... 3 2.1 DGX-1 System Technologies ........................................................................... 5 3.0 Multi-GPU and Multi-System Scaling with NVLink and InfiniBand ..................... 6 3.1 DGX-1 NVLink Network Topology for Efficient Application Scaling ................ 7 3.2 Scaling Deep Learning Training on NVLink ................................................... 10 3.3 InfiniBand for Multi-System Scaling of DGX-1 Systems ................................ 13 4.0 DGX-1 Software................................................................................................. 16 4.1 NVIDIA CUDA Toolkit.................................................................................... 17 4.2 NVIDIA Docker.............................................................................................. 18 4.3 NVIDIA Deep Learning SDK .......................................................................... 19 4.4 NCCL............................................................................................................. 20 5.0 Deep Learning Frameworks for DGX-1.............................................................. 22 5.1 NVIDIA Caffe................................................................................................
    [Show full text]
  • NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION at EVERY SCALE
    NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION AT EVERY SCALE V1.0 Table of Contents Introduction 7 Introducing NVIDIA A100 Tensor Core GPU - our 8th Generation Data Center GPU for the Age of Elastic Computing 9 NVIDIA A100 Tensor Core GPU Overview 11 Next-generation Data Center and Cloud GPU 11 Industry-leading Performance for AI, HPC, and Data Analytics 12 A100 GPU Key Features Summary 14 A100 GPU Streaming Multiprocessor (SM) 15 40 GB HBM2 and 40 MB L2 Cache 16 Multi-Instance GPU (MIG) 16 Third-Generation NVLink 16 Support for NVIDIA Magnum IO™ and Mellanox Interconnect Solutions 17 PCIe Gen 4 with SR-IOV 17 Improved Error and Fault Detection, Isolation, and Containment 17 Asynchronous Copy 17 Asynchronous Barrier 17 Task Graph Acceleration 18 NVIDIA A100 Tensor Core GPU Architecture In-Depth 19 A100 SM Architecture 20 Third-Generation NVIDIA Tensor Core 23 A100 Tensor Cores Boost Throughput 24 A100 Tensor Cores Support All DL Data Types 26 A100 Tensor Cores Accelerate HPC 28 Mixed Precision Tensor Cores for HPC 28 A100 Introduces Fine-Grained Structured Sparsity 31 Sparse Matrix Definition 31 Sparse Matrix Multiply-Accumulate (MMA) Operations 32 Combined L1 Data Cache and Shared Memory 33 Simultaneous Execution of FP32 and INT32 Operations 34 A100 HBM2 and L2 Cache Memory Architectures 34 ii NVIDIA A100 Tensor Core GPU Architecture A100 HBM2 DRAM Subsystem 34 ECC Memory Resiliency 35 A100 L2 Cache 35 Maximizing Tensor Core Performance and Efficiency for Deep Learning Applications 37 Strong Scaling Deep Learning
    [Show full text]
  • NVIDIA GPU COMPUTING: a JOURNEY from PC GAMING to DEEP LEARNING Stuart Oberman | October 2017 GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO
    NVIDIA GPU COMPUTING: A JOURNEY FROM PC GAMING TO DEEP LEARNING Stuart Oberman | October 2017 GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO NVIDIA ACCELERATED COMPUTING 2 GEFORCE: PC Gaming 200M GeForce gamers worldwide Most advanced technology Gaming ecosystem: More than just chips Amazing experiences & imagery 3 NINTENDO SWITCH: POWERED BY NVIDIA TEGRA 4 GEFORCE NOW: AMAZING GAMES ANYWHERE AAA titles delivered at 1080p 60fps Streamed to SHIELD family of devices Streaming to Mac (beta) https://www.nvidia.com/en- us/geforce/products/geforce- now/mac-pc/ 5 GPU COMPUTING Drug Design Seismic Imaging Automotive Design Medical Imaging Molecular Dynamics Reverse Time Migration Computational Fluid Dynamics Computed Tomography 15x speed up 14x speed up 30-100x speed up Astrophysics Options Pricing Product Development Weather Forecasting n-body Monte Carlo Finite Difference Time Domain Atmospheric Physics 20x speed up 6 GPU: 2017 7 2017: TESLA VOLTA V100 21B transistors 815 mm2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink 8 *full GV100 chip contains 84 SMs V100 SPECIFICATIONS 9 HOW DID WE GET HERE? 10 NVIDIA GPUS: 1999 TO NOW https://youtu.be/I25dLTIPREA 11 SOUL OF THE GRAPHICS PROCESSING UNIT GPU: Changes Everything • Accelerate computationally-intensive applications • NVIDIA introduced GPU in 1999 • A single chip processor to accelerate PC gaming and 3D graphics • Goal: approach the image quality of movie studio offline rendering farms, but in real-time • Instead of hours per frame, > 60 frames per second
    [Show full text]
  • Nvidia Dgx Os 5.0
    NVIDIA DGX OS 5.0 User Guide DU-10211-001 _v5.0.0 | September 2021 Table of Contents Chapter 1. Introduction to the NVIDIA DGX OS 5 User Guide..............................................1 1.1. Additional Documentation........................................................................................................ 2 1.2. Customer Support.....................................................................................................................2 Chapter 2. Preparing for Operation..................................................................................... 3 2.1. Software Installation and Setup...............................................................................................3 2.2. Connecting to the DGX System................................................................................................4 Chapter 3. Installing the DGX OS (Reimaging the System)................................................. 5 3.1. Obtaining the DGX OS ISO........................................................................................................5 3.2. Installing the DGX OS Image Remotely through the BMC......................................................6 3.3. Installing the DGX OS Image from a USB Flash Drive or DVD-ROM......................................6 3.3.1. Creating a Bootable USB Flash Drive by Using the dd Command.................................. 7 3.3.2. Creating a Bootable USB Flash Drive by Using Akeo Rufus............................................8 3.4. Installation Options..................................................................................................................
    [Show full text]
  • NVIDIA Topology-Aware GPU Selection 0.1.0 (Early Access)
    NVIDIA Topology-Aware GPU Selection 0.1.0 (Early Access) User Guide DU-09998-001_v0.1.0 (Early Access) | July 2020 Table of Contents Chapter 1. Introduction........................................................................................................ 1 Chapter 2. Getting Started................................................................................................... 4 2.1. Prerequisites............................................................................................................................. 4 2.2. Installing NVTAGS..................................................................................................................... 4 Chapter 3. Using NVTAGS.................................................................................................... 6 3.1. NVTAGS Tune Mode..................................................................................................................6 3.1.1. Tune with profiling............................................................................................................. 6 3.1.2. Run NVTAGS in Tune with Profiling Mode........................................................................7 3.2. Tune NVTAGS without Profiling Mode..................................................................................... 7 3.2.1. Run NVTAGS in Tune without Profiling Mode...................................................................7 3.3. NVTAGS Run Mode..................................................................................................................
    [Show full text]