INTRODUCTION TO HETEROGENEOUS/HYBRID COMPUTING
François Courteille |Senior Solutions Architect, NVIDIA |[email protected] Agenda:
1 Introduction : Heterogeneous Computing & GPUs
AGENDA 2 CPU versus GPU architecture 3 Accelerated computing roadmap 2015
2 Introduction : Heterogeneous Computing & GPUs
3 RACING TOWARD EXASCALE
4 ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS
100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators
50 NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers
Tesla supercomputers growing at 50% CAGR 25 over past five years
0 2013 2014 2015 NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED
SUMMIT
SIERRA
U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing WHAT IS HETEROGENEOUS COMPUTING?
Application Execution
High Data Parallelism High Serial Performance CPU GPU
+ 7 ENTERPRISE HPC & CLOUD AUTONOMOUS GAMING DESIGN VIRTUALIZATION SERVICE PROVIDERS MACHINES
PC DATA CENTER MOBILE
The World Leader in Visual Computing Evolution of GPUs “Kepler” 7B xtors
GeForce 8800 681M xtors GeForce FX 250M xtors GeForce 3 GeForce 256 60M xtors RIVA 128 23M xtors 3M xtors
1995 2000 2001 2003 2006 2012
Fixed function Programmable shaders CUDA Performance Lead Continues to Grow
Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS GB/s 3500 600
3000 K80 K80 500
2500 400
2000 K40 K40 300 1500 K20 K20 M2090 200 1000 M2090 M1060 Haswell Haswell 100 500 Ivy Bridge Sandy Bridge Ivy Bridge M1060 Sandy Bridge Westmere Westmere 0 0 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014
NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU GPU Motivation: Peak Flops & Memory BW CPU GPU
Add GPUs: Accelerate Applications GPUs Enable Faster Deployment of Improved Algorithms
Schedule pull-in due to GPUs GPUs CPUs
Elastic Waveform Elastic Waveform Inversion Inversion Reverse Time Migration Reverse Time Migration Wave Equation Wave Equation
KPrSDM Shot WEM (TTI) Gaussian Beam KPrSTM CAZ WEM (VTI) ACCELERATING discoveries
USING A SUPERCOMPUTER POWERED BY 3,000 TESLA PROCESSORS, UNIVERSITY OF ILLINOIS SCIENTISTS PERFORMED THE FIRST ALL-ATOM SIMULATION OF THE HIV VIRUS AND DISCOVERED THE CHEMICAL STRUCTURE OF ITS CAPSID — “THE PERFECT TARGET FOR FIGHTING THE INFECTION.”
WITHOUT GPU, THE SUPERCOMPUTER WOULD NEED TO BE 5X LARGER FOR SIMILAR PERFORMANCE. GOOGLE DATACENTER STANFORD AI LAB
ACCELERATING
INSIGHTS
“ Now You Can Build Google’s “ $1M Artificial Brain on the Cheap
1,000 CPU Servers 600 kWatts 3 GPU-Accelerated Servers 4 kWatts 2,000 CPUs • 16,000 cores $5,000,000 12 GPUs • 18,432 cores $33,000
Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013 Power is the Problem
120 petaflops | 376 megawatts Enough to power all of San Francisco CSCS Piz Daint Top 10 in Top500 and Green500
Top500 Green500 TFLOPS/s Site MFLOPS/W Site Rank Rank National Super Computer Centre 1 4,389.82 GSIC Center, Tokyo Tech 1 33,862.7 Guangzhou 2 3,631.70 Cambridge University
2 17,590.0 Oak Ridge National Lab #1 USA 3 3,517.84 University of Tsukuba 3 17,173.2 DOE, United States 4 3,459.46 SURFsara RIKEN Advanced Institute for Swiss National Supercomputing 4 10,510.0 5 3,185.91 Computational Science (CSCS) 6 3,131.06 ROMEO HPC Center 5 8,586.6 Argonne National Lab 7 3,019.72 CSIRO Swiss National Supercomputing#1 Europe 6 6,271.0 8 2,951.95 GSIC Center, Tokyo Tech Centre (CSCS) 9 2,813.14 Eni 7 5,168.1 University of Texas 10 2,629.10 (Financial Institution) 8 5,008.9 Forschungszentrum Juelich Mississippi State (top non- 16 2,495.12 NVIDIA) 9 4,293.3 DOE, United States 59 1,226.60 ICHEC (top X86 cluster) Piz Daint: 5,272 nodes with 1 x Xeon E5-2670 (SB) CPU, 10 3,143.5 Government 1 x NVIDIA K20X GPU, Cray XC30 at 2.0 MW CUDA: World’s Most Pervasive Parallel Programming Model
Institutions with 700+ University Courses 14,000 CUDA Developers In 62 Countries
2,000,000 CUDA Downloads
487,000,000 CUDA GPUs Shipped 370 GPU-Accelerated Applications www.nvidia.com/appscatalog 70% OF TOP HPC APPS ACCELERATED
INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY
GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical Exelis IDL Gaussian MSC NASTRAN GAMESS
ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL Top 10 HPC Apps Top 50 HPC Apps CHARMM Star-CCM+ Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated
Intersect360, Nov 2015 = In development “HPC Application Support for GPU Computing” = Not supported CORAL: Built for Grand Scientific Challenges
Fusion Energy Climate Change Biofuels Role of material disorder, Study climate change adaptation and Search for renewable and statistics, and fluctuations in mitigation scenarios; realistically more efficient energy sources nanoscale materials and systems. represent detailed features
Astrophysics Combustion Nuclear Energy Radiation transport – critical to Combustion simulations to Unprecedented high-fidelity astrophysics, laser fusion, atmospheric enable the next gen diesel/bio- radiation transport calculations for dynamics, and medical imaging fuels to burn more efficiently nuclear energy applications CPU versus GPU architecture
21 Low Latency or High Throughput? CPU ALU ALU Optimised for low-latency Control access to cached data sets ALU ALU
Control logic for out-of-order Cache and speculative execution
10’s of threads DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency Massive fine grain threaded parallelism DRAM More transistors dedicated to computation 10000’s of threads
GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU
Currently up to 12 GB per GPU DRAMI/F
Bandwidth currently up to ~288 GB/s (Tesla products)
ECC on/off (Quadro and Tesla products) I/F DRAM DRAMI/F
HOST I/F HOST Streaming Multiprocessors (SMs)
L2
Perform the actual computations DRAMI/F
Each SM has its own: Giga Thread
DRAMI/F
Control units, registers, execution pipelines, caches
DRAM I/F DRAM
GPU Architecture
SM-0 SM-1 SM-N
GPU L2 SYSTEM MEMORY GPU DRAM GPU Memory Hierarchy
SM-0 SM-1 SM-N
Registers Registers Registers
L1 SMEM L1 SMEM ~ 1 TB/S L1 SMEM
L2 ~ 150 GB/S
Global Memory Scientific Computing Challenge: Memory
Bandwidth NVIDIA Technology Solves Memory Bandwidth Challenges
DRAMI/F Shared Memory
L2 Cache 1.3 TB/s DRAM I/F DRAM NVIDIA
GPU 280 GB/s
DRAMI/F
HOST HOST I/F
L2
DRAMI/F
Giga Giga Thread
GPU Memory Register DRAMI/F PCI-Express 177 GB/s 10.8 TB/s
6.4 GB/s
DRAM I/F DRAM Kepler GK110 Block Diagram
Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 GPU SM Architecture Kepler SM
Functional Units = CUDA cores L1 Cache 192 SP FP operations/clock SharedShared 64 DP FP operations/clock MemoryMemory Register file (256KB) Functional Register Units File Read-only Shared memory (16-48KB) Cache L1 cache (16-48KB) Constant Read-only cache (48KB) Cache
Constant cache (8KB) SM SM SM SIMT Execution Model
Thread: sequential execution unit
All threads execute same sequential program Thread Threads execute in parallel Thread Block: a group of threads Threads within a block can cooperate Light-weight synchronization Data exchange Thread Block Grid: a collection of thread blocks Thread blocks do not synchronize with each other Communication between blocks is expensive Grid SIMT Execution Model Software Hardware
Threads are executed by CUDA Cores CUDA Thread Core Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one Thread Block Multiprocessor multiprocessor - limited by multiprocessor resources (shared memory and register file)
A kernel is launched as a grid of thread blocks
Grid Device SIMT Execution Model
Threads are organized into groups of 32 threads called “warps” All threads within a warp execute the same instruction simultaneously
Simple Processing Flow
PCI Bus
1. Copy input data from CPU memory/NIC to GPU memory Simple Processing Flow
PCI Bus
1. Copy input data from CPU memory/NIC to GPU memory 2. Load GPU program and execute Simple Processing Flow
PCI Bus
1. Copy input data from CPU memory/NIC to GPU memory 2. Load GPU program and execute 3. Copy results from GPU memory to CPU memory/NIC Accelerator Fundamentals
We must expose enough parallelism to saturate the device Accelerator threads are slower than CPU threads Accelerators have orders of magnitude more threads
Fine -grained parallelism is good Coarse-grained parallelism is bad
t0 t1 t2 t3 t0 t0 t0 t0
t4 t5 t6 t7 t1 t1 t1 t1
t8 t9 t10 t11 t2 t2 t2 t2
t12 t13 t14 t15 t3 t3 t3 t3 Best Practices Optimize Data Locality: GPU
Minimize data transfers between CPU and GPU
System GPU Memory Memory Best Practices Optimize Data Locality: SM
Minimize redundant accesses to L2 and DRAM Store intermediate results in registers instead of global memory Use shared memory for data frequently used within a thread block Use const __restrict__ to take advantage of read-only cache
L2 GPU Cache DRAM SM 3 Ways to Accelerate Applications
Applications
Compiler Programming Libraries Directives Languages
Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility Resources Learn more about GPUs
CUDA resource center: http://docs.nvidia.com/cuda GTC on-demand and webinars: http://on-demand-gtc.gputechconf.com http://www.gputechconf.com/gtc-webinars Parallel Forall Blog: http://devblogs.nvidia.com/parallelforall Self-paced labs: http://nvlabs.qwiklab.com
Accelerated COMPUTING Roadmap
40 GPU ARCHITECTURE ROADMAP
41 TESLA ACCELERATES DISCOVERY AND INSIGHT
SIMULATION MACHINE LEARNING VISUALIZATION
TESLA ACCELERATED COMPUTING A NEW PLATFORM FOR NEW WORKLOADS http://blogs.parc.com/blog/2015/11/the-new-kid-on-the-block-gpu-accelerated-big-data-analytics/
VIDEO TRANSCODING DATA ANALYTICS
MEDIA PROCESSING MACHINE LEARNING TESLA ACCELERATOR LINE-UP FOR 2015
2015
Seismic, Data Analytics, HPC Labs, Defense
Best in Class Multi-GPU Accelerated Apps Performance K40 K80 Single and Double Precision Workloads K20X K40
Mid-Range Higher Ed, Data Analytics, HPC Labs, Defense K20 Double Precision Workloads
K10
44 TESLA KEPLER GPU PRODUCT FAMILY TESLA K80 TESLA K40
24 GB 480 GB/sec 12 GB 288 GB/sec
45 5x Faster AMBER Performance
Dual CPU Server
TESLA K80 Simulation Time from 1 Month to 1 Weee k World’s Fastest Accelerator Tesla K80 Server for HPC 0 5 10 15 20 25 30 # of Days
CUDA Cores 2496 Peak DP 1.9 TFLOPS Peak DP w/ Boost 2.9 TFLOPS GDDR5 Memory 24 GB Bandwidth 480 GB/s Power 300 W
AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 TESLA K80 BOOSTS DATA CENTER THROUGHPUT
TESLA K80: 5X FASTER 1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT
Speed-up vs Dual CPU CPU-only System Accelerated System 15x K80 CPU
10x
5x
0x QMCPACK LAMMPS CHROMA NAMD AMBER 100 Jobs Per Day 220 Jobs Per Day
CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled QUESTIONS ?