Introduction to GPU Computing
Total Page:16
File Type:pdf, Size:1020Kb
INTRODUCTION TO HETEROGENEOUS/HYBRID COMPUTING François Courteille |Senior Solutions Architect, NVIDIA |[email protected] Agenda: 1 Introduction : Heterogeneous Computing & GPUs AGENDA 2 CPU versus GPU architecture 3 Accelerated computing roadmap 2015 2 Introduction : Heterogeneous Computing & GPUs 3 RACING TOWARD EXASCALE 4 ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS 100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators 50 NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers Tesla supercomputers growing at 50% CAGR 25 over past five years 0 2013 2014 2015 NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED SUMMIT SIERRA U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing WHAT IS HETEROGENEOUS COMPUTING? Application Execution High Data Parallelism High Serial Performance CPU GPU + 7 ENTERPRISE HPC & CLOUD AUTONOMOUS GAMING DESIGN VIRTUALIZATION SERVICE PROVIDERS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing Evolution of GPUs “Kepler” 7B xtors GeForce 8800 681M xtors GeForce FX 250M xtors GeForce 3 GeForce 256 60M xtors RIVA 128 23M xtors 3M xtors 1995 2000 2001 2003 2006 2012 Fixed function Programmable shaders CUDA Performance Lead Continues to Grow Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS GB/s 3500 600 3000 K80 K80 500 2500 400 2000 K40 K40 300 1500 K20 K20 M2090 200 1000 M2090 M1060 Haswell Haswell 100 500 Ivy Bridge Sandy Bridge Ivy Bridge M1060 Sandy Bridge Westmere Westmere 0 0 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU GPU Motivation: Peak Flops & Memory BW CPU GPU Add GPUs: Accelerate Applications GPUs Enable Faster Deployment of Improved Algorithms Schedule pull-in due to GPUs GPUs CPUs Elastic Waveform Elastic Waveform Inversion Inversion Reverse Time Migration Reverse Time Migration Wave Equation Wave Equation KPrSDM Shot WEM (TTI) Gaussian Beam KPrSTM CAZ WEM (VTI) ACCELERATING discoveries USING A SUPERCOMPUTER POWERED BY 3,000 TESLA PROCESSORS, UNIVERSITY OF ILLINOIS SCIENTISTS PERFORMED THE FIRST ALL-ATOM SIMULATION OF THE HIV VIRUS AND DISCOVERED THE CHEMICAL STRUCTURE OF ITS CAPSID — “THE PERFECT TARGET FOR FIGHTING THE INFECTION.” WITHOUT GPU, THE SUPERCOMPUTER WOULD NEED TO BE 5X LARGER FOR SIMILAR PERFORMANCE. GOOGLE DATACENTER STANFORD AI LAB ACCELERATING INSIGHTS “ Now You Can Build Google’s “ $1M Artificial Brain on the Cheap 1,000 CPU Servers 600 kWatts 3 GPU-Accelerated Servers 4 kWatts 2,000 CPUs • 16,000 cores $5,000,000 12 GPUs • 18,432 cores $33,000 Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013 Power is the Problem 120 petaflops | 376 megawatts Enough to power all of San Francisco CSCS Piz Daint Top 10 in Top500 and Green500 Top500 Green500 TFLOPS/s Site MFLOPS/W Site Rank Rank National Super Computer Centre 1 4,389.82 GSIC Center, Tokyo Tech 1 33,862.7 Guangzhou 2 3,631.70 Cambridge University 2 17,590.0 Oak Ridge National Lab #1 USA 3 3,517.84 University of Tsukuba 3 17,173.2 DOE, United States 4 3,459.46 SURFsara RIKEN Advanced Institute for Swiss National Supercomputing 4 10,510.0 5 3,185.91 Computational Science (CSCS) 6 3,131.06 ROMEO HPC Center 5 8,586.6 Argonne National Lab 7 3,019.72 CSIRO Swiss National Supercomputing#1 Europe 6 6,271.0 8 2,951.95 GSIC Center, Tokyo Tech Centre (CSCS) 9 2,813.14 Eni 7 5,168.1 University of Texas 10 2,629.10 (Financial Institution) 8 5,008.9 Forschungszentrum Juelich Mississippi State (top non- 16 2,495.12 NVIDIA) 9 4,293.3 DOE, United States 59 1,226.60 ICHEC (top X86 cluster) Piz Daint: 5,272 nodes with 1 x Xeon E5-2670 (SB) CPU, 10 3,143.5 Government 1 x NVIDIA K20X GPU, Cray XC30 at 2.0 MW CUDA: World’s Most Pervasive Parallel Programming Model Institutions with 700+ University Courses 14,000 CUDA Developers In 62 Countries 2,000,000 CUDA Downloads 487,000,000 CUDA GPUs Shipped 370 GPU-Accelerated Applications www.nvidia.com/appscatalog 70% OF TOP HPC APPS ACCELERATED INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical Exelis IDL Gaussian MSC NASTRAN GAMESS ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL Top 10 HPC Apps Top 50 HPC Apps CHARMM Star-CCM+ Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated Intersect360, Nov 2015 = In development “HPC Application Support for GPU Computing” = Not supported CORAL: Built for Grand Scientific Challenges Fusion Energy Climate Change Biofuels Role of material disorder, Study climate change adaptation and Search for renewable and statistics, and fluctuations in mitigation scenarios; realistically more efficient energy sources nanoscale materials and systems. represent detailed features Astrophysics Combustion Nuclear Energy Radiation transport – critical to Combustion simulations to Unprecedented high-fidelity astrophysics, laser fusion, atmospheric enable the next gen diesel/bio- radiation transport calculations for dynamics, and medical imaging fuels to burn more efficiently nuclear energy applications CPU versus GPU architecture 21 Low Latency or High Throughput? CPU ALU ALU Optimised for low-latency Control access to cached data sets ALU ALU Control logic for out-of-order Cache and speculative execution 10’s of threads DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency Massive fine grain threaded parallelism DRAM More transistors dedicated to computation 10000’s of threads GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 12 GB per GPU DRAM I/F DRAM Bandwidth currently up to ~288 GB/s (Tesla products) ECC on/off (Quadro and Tesla products) I/F DRAM DRAM I/F DRAM HOST HOST I/F Streaming Multiprocessors (SMs) L2 Perform the actual computations I/F DRAM Each SM has its own: Giga Thread DRAM I/F DRAM Control units, registers, execution pipelines, caches DRAM I/F DRAM GPU Architecture SM-0 SM-1 SM-N GPU L2 SYSTEM MEMORY GPU DRAM GPU Memory Hierarchy SM-0 SM-1 SM-N Registers Registers Registers L1 SMEM L1 SMEM ~ 1 TB/S L1 SMEM L2 ~ 150 GB/S Global Memory Scientific Computing Challenge: Memory Bandwidth NVIDIA Technology Solves Memory Bandwidth Challenges DRAM I/F DRAM Shared Memory L2 Cache 1.3 TB/s DRAM I/F DRAM NVIDIA GPU 280 GB/s DRAM I/F DRAM HOST HOST I/F L2 DRAM I/F DRAM Giga Giga Thread GPU Memory Register DRAM I/F DRAM PCI-Express 177 GB/s 10.8 TB/s 6.4 GB/s DRAM I/F DRAM Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 GPU SM Architecture Kepler SM Functional Units = CUDA cores L1 Cache 192 SP FP operations/clock SharedShared 64 DP FP operations/clock MemoryMemory Register file (256KB) Functional Register Units File Read-only Shared memory (16-48KB) Cache L1 cache (16-48KB) Constant Read-only cache (48KB) Cache Constant cache (8KB) SM SM SM SIMT Execution Model Thread: sequential execution unit All threads execute same sequential program Thread Threads execute in parallel Thread Block: a group of threads Threads within a block can cooperate Light-weight synchronization Data exchange Thread Block Grid: a collection of thread blocks Thread blocks do not synchronize with each other Communication between blocks is expensive Grid SIMT Execution Model Software Hardware Threads are executed by CUDA Cores CUDA Thread Core Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one Thread Block Multiprocessor multiprocessor - limited by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks Grid Device SIMT Execution Model Threads are organized into groups of 32 threads called “warps” All threads within a warp execute the same instruction simultaneously Simple Processing Flow PCI Bus 1. Copy input data from CPU memory/NIC to GPU memory Simple Processing Flow PCI Bus 1. Copy input data from CPU memory/NIC to GPU memory 2. Load GPU program and execute Simple Processing Flow PCI Bus 1. Copy input data from CPU memory/NIC to GPU memory 2. Load GPU program and execute 3. Copy results from GPU memory to CPU memory/NIC Accelerator Fundamentals We must expose enough parallelism to saturate the device Accelerator threads are slower than CPU threads Accelerators have orders of magnitude more threads Fine -grained parallelism is good Coarse-grained parallelism is bad t0 t1 t2 t3 t0 t0 t0 t0 t4 t5 t6 t7 t1 t1 t1 t1 t8 t9 t10 t11 t2 t2 t2 t2 t12 t13 t14 t15 t3 t3 t3 t3 Best Practices Optimize Data Locality: GPU Minimize data transfers between CPU and GPU System GPU Memory Memory Best Practices Optimize Data Locality: SM Minimize redundant accesses to L2 and DRAM Store intermediate results in registers instead of global memory Use shared memory for data frequently used within a thread block Use const __restrict__ to take advantage of read-only cache L2 GPU Cache DRAM SM 3 Ways to Accelerate Applications Applications Compiler Programming Libraries Directives Languages Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility Resources Learn more about GPUs CUDA resource center: http://docs.nvidia.com/cuda GTC on-demand and