INTRODUCTION TO PARALLEL AND GPU Carlo Nardone Sr. Solution Architect, EMEA 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing

2 , ILLUSTRATED

3 FROM MULTICORE TO MANYCORE

4 PARALLEL COMPUTING IS OLD

Source: Tim Mattson

5 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing

6 MOORE’S LAW

Source: U.Delaware CISC879 / J.Dongarra 7 FREQUENCY TREND

source: CA5 Hennessy/Patterson 8 THE POWER WALL

9 PERFORMANCE TREND

source: CA5 Hennessy/Patterson 11 THE -MEMORY GAP

source: CA5 Hennessy & Patterson

13 ARITHMETIC INTENSITY

Arithmetic intensity, specified as the number of floating-point operations to run the program divided by the number of bytes accessed in main memory [Williams et al. 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as dense matrix, but there are many kernels with arithmetic intensities independent of problem size. Source: CA5 Hennessy & Patterson

14 BANDWIDTH AND THE ROOFLINE MODEL

Source: HPCA5

15 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing

16 FLYNN’S TAXONOMY: SISD

source: CA5 Hennessy & Patterson 17 FLYNN’S TAXONOMY: SIMD

source: CA5 Hennessy & Patterson 18 FLYNN’S TAXONOMY: SIMD (2)

source: CA5 Hennessy & Patterson 19 FLYNN’S TAXONOMY

source: CA5 Hennessy & Patterson 20 FLYNN’S TAXONOMY: MISD (?)

source: CA5 Hennessy & Patterson 21 FLYNN’S TAXONOMY: MIMD

source: CA5 Hennessy & Patterson 22 MULTIPROCESSORS

source: CA5 Hennessy & Patterson 23 MULTIPROCESSOR SMP CLASS

source: CA4 Hennessy & Patterson 24 MULTIPROCESSOR DISTRIBUTED MEM CLASS

source: CA4 Hennessy & Patterson 25 ARCHITECTURE Intel Nehalem

26 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing

27 28 LOW LATENCY OR HIGH THROUGHPUT?

CPU GPU Optimised for low-latency Optimised for data-parallel, access to cached data sets throughput computation Control logic for out-of-order Architecture tolerant of and memory latency More transistors dedicated to computation 29 30 GPU UNIFIED DESIGN

31 GPGPU COMPUTING, B.. ERA

32 WHY GPU ARE ATTRACTIVE FOR HPC?

• Massive multithreaded manycore chips • High count (SP and later DP) • High memory bandwidth, (later) including ECC • Lots of programming languages • Lots of programming tools

• Widely used in Computational Physics

33 GPGPU COMPUTING, A.C. ERA

(Fermi)

34 GROWTH IN GPU COMPUTING

2008 2015

150,000 3 Million CUDA Downloads CUDA Downloads

27 319 CUDA Apps CUDA Apps

60 800 Universities Universities Teaching Teaching

4,000 60,000 Academic Academic Papers Papers

6,000 450,000 Tesla GPUs Tesla GPUs

77 54,000 Supercomputing Supercomputing Teraflops Teraflops

35 GROWTH IN GPU COMPUTING

2008 2015

150,000 3 Million CUDA Downloads CUDA Downloads

27 319 CUDA Apps CUDA Apps

60 800 Universities Universities Teaching Teaching

4,000 60,000 Academic Academic Papers Papers

6,000 450,000 Tesla GPUs Tesla GPUs

77 54,000 Supercomputing Supercomputing Teraflops Teraflops

36 ACCELERATED, 10x Performance & 5x Energy Efficiency for HPC

GPU Accelerator CPU Optimized for Optimized for Parallel Tasks Serial Tasks

37 GEFORCE 8: 1ST MODERN GPU ARCH

38 NVIDIA KEPLER GK110 PROCESSOR

40 41 KEPLER SMX DIAGRAM

42 146X 36X 18X 50X 100X

Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN GPUs Accelerate Science

149X 47X 20X 130X 30X

Financial Simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois, Urbana U of Maryland 43 44 SUPERCOMPUTING ERA

45 46 JETSON TK1 THE WORLD’S 1st EMBEDDED

Development Platform for Embedded Computer Vision, Robotics, Medical

Tegra K1 SoC Quad core A15 + Kepler GPU 192 CUDA Enabled cores 326 Gflops @ 5 Watt $192

47 JETSON TK1: UNLOCKING NEW APPLICATIONS Computer Vision Robotics Automotive Medicine Avionics 48

US TO BUILD TWO FLAGSHIP

SUMMIT SIERRA 150-300 PFLOPS > 100 PFLOPS Peak Performance Peak Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

>40 TFLOPS per Node, >3,400 Nodes

2017

Major Step Forward on the Path to Exascale

49 GPU ROADMAP Pascal 2x SGEMM/W

72 Volta

60

48 Pascal Mixed Precision 36 3D Memory

NVLink SGEMM SGEMM /W

24 Maxwell

12 Kepler

Tesla Fermi 0 2008 2010 2012 2014 2016 2018

50 PASCAL GPU FEATURING NVLINK AND STACKED MEMORY

NVLINK GPU high speed interconnect 80-200 GB/s

3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit

51 THANK YOU

[email protected] +39 335 5828197 52