INTRODUCTION TO HETEROGENEOUS/HYBRID COMPUTING

François Courteille |Senior Solutions Architect, |[email protected] Agenda:

1 Introduction : Heterogeneous Computing & GPUs

AGENDA 2 CPU versus GPU architecture 3 Accelerated computing roadmap 2015

2 Introduction : Heterogeneous Computing & GPUs

3 RACING TOWARD EXASCALE

4 ACCELERATORS SURGE IN WORLD’S TOP SUPERCOMPUTERS

100 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 75 1/3 of total FLOPS powered by accelerators

50 GPUs sweep 23 of 24 new accelerated supercomputers

Tesla supercomputers growing at 50% CAGR 25 over past five years

0 2013 2014 2015 NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED

SUMMIT

SIERRA

U.S. Dept. of Energy NOAA IBM Watson Pre-Exascale Supercomputers New Supercomputer for Next-Gen Breakthrough Natural Language for Science Weather Forecasting Processing for Cognitive Computing WHAT IS HETEROGENEOUS COMPUTING?

Application Execution

High Data Parallelism High Serial Performance CPU GPU

+ 7 ENTERPRISE HPC & CLOUD AUTONOMOUS GAMING DESIGN VIRTUALIZATION SERVICE PROVIDERS MACHINES

PC DATA CENTER MOBILE

The World Leader in Visual Computing Evolution of GPUs “Kepler” 7B xtors

GeForce 8800 681M xtors GeForce FX 250M xtors GeForce 3 GeForce 256 60M xtors RIVA 128 23M xtors 3M xtors

1995 2000 2001 2003 2006 2012

Fixed function Programmable CUDA Performance Lead Continues to Grow

Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS GB/s 3500 600

3000 K80 K80 500

2500 400

2000 K40 K40 300 1500 K20 K20 M2090 200 1000 M2090 M1060 Haswell Haswell 100 500 Ivy Bridge Sandy Bridge Ivy Bridge M1060 Sandy Bridge Westmere Westmere 0 0 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014

NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU GPU Motivation: Peak Flops & Memory BW CPU GPU

Add GPUs: Accelerate Applications GPUs Enable Faster Deployment of Improved Algorithms

Schedule pull-in due to GPUs GPUs CPUs

Elastic Waveform Elastic Waveform Inversion Inversion Reverse Time Migration Reverse Time Migration Wave Equation Wave Equation

KPrSDM Shot WEM (TTI) Gaussian Beam KPrSTM CAZ WEM (VTI) ACCELERATING discoveries

USING A SUPERCOMPUTER POWERED BY 3,000 TESLA PROCESSORS, UNIVERSITY OF ILLINOIS SCIENTISTS PERFORMED THE FIRST ALL-ATOM SIMULATION OF THE HIV VIRUS AND DISCOVERED THE CHEMICAL STRUCTURE OF ITS CAPSID — “THE PERFECT TARGET FOR FIGHTING THE INFECTION.”

WITHOUT GPU, THE SUPERCOMPUTER WOULD NEED TO BE 5X LARGER FOR SIMILAR PERFORMANCE. GOOGLE DATACENTER STANFORD AI LAB

ACCELERATING

INSIGHTS

“ Now You Can Build Google’s “ $1M Artificial Brain on the Cheap

1,000 CPU Servers 600 kWatts 3 GPU-Accelerated Servers 4 kWatts 2,000 CPUs • 16,000 cores $5,000,000 12 GPUs • 18,432 cores $33,000

Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013 Power is the Problem

120 petaflops | 376 megawatts Enough to power all of San Francisco CSCS Piz Daint Top 10 in Top500 and Green500

Top500 Green500 TFLOPS/s Site MFLOPS/W Site Rank Rank National Super Computer Centre 1 4,389.82 GSIC Center, Tokyo Tech 1 33,862.7 Guangzhou 2 3,631.70 Cambridge University

2 17,590.0 Oak Ridge National Lab #1 USA 3 3,517.84 University of Tsukuba 3 17,173.2 DOE, United States 4 3,459.46 SURFsara RIKEN Advanced Institute for Swiss National Supercomputing 4 10,510.0 5 3,185.91 Computational Science (CSCS) 6 3,131.06 ROMEO HPC Center 5 8,586.6 Argonne National Lab 7 3,019.72 CSIRO Swiss National Supercomputing#1 Europe 6 6,271.0 8 2,951.95 GSIC Center, Tokyo Tech Centre (CSCS) 9 2,813.14 Eni 7 5,168.1 University of Texas 10 2,629.10 (Financial Institution) 8 5,008.9 Forschungszentrum Juelich Mississippi State (top non- 16 2,495.12 NVIDIA) 9 4,293.3 DOE, United States 59 1,226.60 ICHEC (top X86 cluster) Piz Daint: 5,272 nodes with 1 x Xeon E5-2670 (SB) CPU, 10 3,143.5 Government 1 x NVIDIA K20X GPU, Cray XC30 at 2.0 MW CUDA: World’s Most Pervasive Parallel Programming Model

Institutions with 700+ University Courses 14,000 CUDA Developers In 62 Countries

2,000,000 CUDA Downloads

487,000,000 CUDA GPUs Shipped 370 GPU-Accelerated Applications www.nvidia.com/appscatalog 70% OF TOP HPC APPS ACCELERATED

INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY

GROMACS LAMMPS SIMULIA Abaqus NWChem NAMD LS-DYNA AMBER Schrodinger ANSYS Mechanical Exelis IDL Gaussian MSC NASTRAN GAMESS

ANSYS Fluent ANSYS CFX WRF Star-CD VASP CCSM OpenFOAM COMSOL Top 10 HPC Apps Top 50 HPC Apps CHARMM Star-CCM+ Quantum Espresso BLAST 90% 70% Accelerated Accelerated = All popular functions accelerated = Some popular functions accelerated

Intersect360, Nov 2015 = In development “HPC Application Support for GPU Computing” = Not supported CORAL: Built for Grand Scientific Challenges

Fusion Energy Climate Change Biofuels Role of material disorder, Study climate change adaptation and Search for renewable and statistics, and fluctuations in mitigation scenarios; realistically more efficient energy sources nanoscale materials and systems. represent detailed features

Astrophysics Combustion Nuclear Energy Radiation transport – critical to Combustion simulations to Unprecedented high-fidelity astrophysics, laser fusion, atmospheric enable the next gen diesel/bio- radiation transport calculations for dynamics, and medical imaging fuels to burn more efficiently nuclear energy applications CPU versus GPU architecture

21 Low Latency or High Throughput? CPU ALU ALU Optimised for low-latency Control access to cached data sets ALU ALU

Control logic for out-of-order Cache and speculative execution

10’s of threads DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency Massive fine grain threaded parallelism DRAM More transistors dedicated to computation 10000’s of threads

GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU

Currently up to 12 GB per GPU DRAMI/F

Bandwidth currently up to ~288 GB/s (Tesla products)

ECC on/off ( and Tesla products) I/F DRAM DRAMI/F

HOST I/F HOST Streaming Multiprocessors (SMs)

L2

Perform the actual computations DRAMI/F

Each SM has its own: Giga Thread

DRAMI/F

Control units, registers, execution pipelines, caches

DRAM I/F DRAM

GPU Architecture

SM-0 SM-1 SM-N

GPU L2 SYSTEM MEMORY GPU DRAM GPU Memory Hierarchy

SM-0 SM-1 SM-N

Registers Registers Registers

L1 SMEM L1 SMEM ~ 1 TB/S L1 SMEM

L2 ~ 150 GB/S

Global Memory Scientific Computing Challenge: Memory

Bandwidth NVIDIA Technology Solves Memory Bandwidth Challenges

DRAMI/F

L2 Cache 1.3 TB/s DRAM I/F DRAM NVIDIA

GPU 280 GB/s

DRAMI/F

HOST HOST I/F

L2

DRAMI/F

Giga Giga Thread

GPU Memory Register DRAMI/F PCI-Express 177 GB/s 10.8 TB/s

6.4 GB/s

DRAM I/F DRAM Kepler GK110 Block Diagram

Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 GPU SM Architecture Kepler SM

Functional Units = CUDA cores L1 Cache 192 SP FP operations/clock SharedShared 64 DP FP operations/clock MemoryMemory Register file (256KB) Functional Register Units File Read-only Shared memory (16-48KB) Cache L1 cache (16-48KB) Constant Read-only cache (48KB) Cache

Constant cache (8KB) SM SM SM SIMT Execution Model

Thread: sequential execution unit

All threads execute same sequential program Thread Threads execute in parallel Thread Block: a group of threads Threads within a block can cooperate Light-weight synchronization Data exchange Thread Block Grid: a collection of thread blocks Thread blocks do not synchronize with each other Communication between blocks is expensive Grid SIMT Execution Model Software Hardware

Threads are executed by CUDA Cores CUDA Thread Core Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one Thread Block Multiprocessor multiprocessor - limited by multiprocessor resources (shared memory and register file)

A kernel is launched as a grid of thread blocks

Grid Device SIMT Execution Model

Threads are organized into groups of 32 threads called “warps” All threads within a warp execute the same instruction simultaneously

Simple Processing Flow

PCI

1. Copy input data from CPU memory/NIC to GPU memory Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory/NIC to GPU memory 2. Load GPU program and execute Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory/NIC to GPU memory 2. Load GPU program and execute 3. Copy results from GPU memory to CPU memory/NIC Accelerator Fundamentals

We must expose enough parallelism to saturate the device Accelerator threads are slower than CPU threads Accelerators have orders of magnitude more threads

Fine -grained parallelism is good Coarse-grained parallelism is bad

t0 t1 t2 t3 t0 t0 t0 t0

t4 t5 t6 t7 t1 t1 t1 t1

t8 t9 t10 t11 t2 t2 t2 t2

t12 t13 t14 t15 t3 t3 t3 t3 Best Practices Optimize Data Locality: GPU

Minimize data transfers between CPU and GPU

System GPU Memory Memory Best Practices Optimize Data Locality: SM

Minimize redundant accesses to L2 and DRAM Store intermediate results in registers instead of global memory Use shared memory for data frequently used within a thread block Use const __restrict__ to take advantage of read-only cache

L2 GPU Cache DRAM SM 3 Ways to Accelerate Applications

Applications

Compiler Programming Libraries Directives Languages

Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility Resources Learn more about GPUs

CUDA resource center: http://docs.nvidia.com/cuda GTC on-demand and webinars: http://on-demand-gtc.gputechconf.com http://www.gputechconf.com/gtc-webinars Parallel Forall Blog: http://devblogs.nvidia.com/parallelforall Self-paced labs: http://nvlabs.qwiklab.com

Accelerated COMPUTING Roadmap

40 GPU ARCHITECTURE ROADMAP

41 TESLA ACCELERATES DISCOVERY AND INSIGHT

SIMULATION MACHINE LEARNING VISUALIZATION

TESLA ACCELERATED COMPUTING A NEW PLATFORM FOR NEW WORKLOADS http://blogs.parc.com/blog/2015/11/the-new-kid-on-the-block-gpu-accelerated-big-data-analytics/

VIDEO TRANSCODING DATA ANALYTICS

MEDIA PROCESSING MACHINE LEARNING TESLA ACCELERATOR LINE-UP FOR 2015

2015

Seismic, Data Analytics, HPC Labs, Defense

Best in Class Multi-GPU Accelerated Apps Performance K40 K80 Single and Double Precision Workloads K20X K40

Mid-Range Higher Ed, Data Analytics, HPC Labs, Defense K20 Double Precision Workloads

K10

44 TESLA KEPLER GPU PRODUCT FAMILY TESLA K80 TESLA K40

24 GB 480 GB/sec 12 GB 288 GB/sec

45 5x Faster AMBER Performance

Dual CPU Server

TESLA K80 Simulation Time from 1 Month to 1 Weee k World’s Fastest Accelerator Tesla K80 Server for HPC 0 5 10 15 20 25 30 # of Days

CUDA Cores 2496 Peak DP 1.9 TFLOPS Peak DP w/ Boost 2.9 TFLOPS GDDR5 Memory 24 GB Bandwidth 480 GB/s Power 300 W

AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 TESLA K80 BOOSTS DATA CENTER THROUGHPUT

TESLA K80: 5X FASTER 1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT

Speed-up vs Dual CPU CPU-only System Accelerated System 15x K80 CPU

10x

5x

0x QMCPACK LAMMPS CHROMA NAMD AMBER 100 Jobs Per Day 220 Jobs Per Day

CPU: Dual E5-2698 [email protected] 3.6GHz, 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled QUESTIONS ?