INTRODUCTION TO PARALLEL AND GPU COMPUTING Carlo Nardone Sr. Solution Architect, NVIDIA EMEA 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing
2 PARALLEL COMPUTING, ILLUSTRATED
3 FROM MULTICORE TO MANYCORE
4 PARALLEL COMPUTING IS OLD
Source: Tim Mattson
5 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing
6 MOORE’S LAW
Source: U.Delaware CISC879 / J.Dongarra 7 MICROPROCESSOR FREQUENCY TREND
source: CA5 Hennessy/Patterson 8 THE POWER WALL
9 MICROPROCESSORS PERFORMANCE TREND
source: CA5 Hennessy/Patterson 11 THE PROCESSOR-MEMORY GAP
source: CA5 Hennessy & Patterson
13 ARITHMETIC INTENSITY
Arithmetic intensity, specified as the number of floating-point operations to run the program divided by the number of bytes accessed in main memory [Williams et al. 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as dense matrix, but there are many kernels with arithmetic intensities independent of problem size. Source: CA5 Hennessy & Patterson
14 BANDWIDTH AND THE ROOFLINE MODEL
Source: HPCA5
15 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing
16 FLYNN’S TAXONOMY: SISD
source: CA5 Hennessy & Patterson 17 FLYNN’S TAXONOMY: SIMD
source: CA5 Hennessy & Patterson 18 FLYNN’S TAXONOMY: SIMD (2)
source: CA5 Hennessy & Patterson 19 FLYNN’S TAXONOMY
source: CA5 Hennessy & Patterson 20 FLYNN’S TAXONOMY: MISD (?)
source: CA5 Hennessy & Patterson 21 FLYNN’S TAXONOMY: MIMD
source: CA5 Hennessy & Patterson 22 MULTIPROCESSORS
source: CA5 Hennessy & Patterson 23 MULTIPROCESSOR SMP CLASS
source: CA4 Hennessy & Patterson 24 MULTIPROCESSOR DISTRIBUTED MEM CLASS
source: CA4 Hennessy & Patterson 25 X86 ARCHITECTURE Intel Nehalem
26 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing
27 28 LOW LATENCY OR HIGH THROUGHPUT?
CPU GPU Optimised for low-latency Optimised for data-parallel, access to cached data sets throughput computation Control logic for out-of-order Architecture tolerant of and speculative execution memory latency More transistors dedicated to computation 29 30 GPU UNIFIED DESIGN
31 GPGPU COMPUTING, B.C. ERA
32 WHY GPU ARE ATTRACTIVE FOR HPC?
• Massive multithreaded manycore chips • High flops count (SP and later DP) • High memory bandwidth, (later) including ECC • Lots of programming languages • Lots of programming tools
• Widely used in Computational Physics
33 GPGPU COMPUTING, A.C. ERA
(Fermi)
34 GROWTH IN GPU COMPUTING
2008 2015
150,000 3 Million CUDA Downloads CUDA Downloads
27 319 CUDA Apps CUDA Apps
60 800 Universities Universities Teaching Teaching
4,000 60,000 Academic Academic Papers Papers
6,000 450,000 Tesla GPUs Tesla GPUs
77 54,000 Supercomputing Supercomputing Teraflops Teraflops
35 GROWTH IN GPU COMPUTING
2008 2015
150,000 3 Million CUDA Downloads CUDA Downloads
27 319 CUDA Apps CUDA Apps
60 800 Universities Universities Teaching Teaching
4,000 60,000 Academic Academic Papers Papers
6,000 450,000 Tesla GPUs Tesla GPUs
77 54,000 Supercomputing Supercomputing Teraflops Teraflops
36 ACCELERATED, HETEROGENEOUS COMPUTING 10x Performance & 5x Energy Efficiency for HPC
GPU Accelerator CPU Optimized for Optimized for Parallel Tasks Serial Tasks
37 GEFORCE 8: 1ST MODERN GPU ARCH
38 NVIDIA KEPLER GK110 PROCESSOR
40 41 KEPLER SMX DIAGRAM
42 146X 36X 18X 50X 100X
Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN GPUs Accelerate Science
149X 47X 20X 130X 30X
Financial Simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois, Urbana U of Maryland 43 44 SUPERCOMPUTING ERA
45 46 JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER
Development Platform for Embedded Computer Vision, Robotics, Medical
Tegra K1 SoC Quad core A15 + Kepler GPU 192 CUDA Enabled cores 326 Gflops @ 5 Watt $192
47 JETSON TK1: UNLOCKING NEW APPLICATIONS Computer Vision Robotics Automotive Medicine Avionics 48
US TO BUILD TWO FLAGSHIP SUPERCOMPUTERS
SUMMIT SIERRA 150-300 PFLOPS > 100 PFLOPS Peak Performance Peak Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
>40 TFLOPS per Node, >3,400 Nodes
2017
Major Step Forward on the Path to Exascale
49 GPU ROADMAP Pascal 2x SGEMM/W
72 Volta
60
48 Pascal Mixed Precision 36 3D Memory
NVLink SGEMM SGEMM /W
24 Maxwell
12 Kepler
Tesla Fermi 0 2008 2010 2012 2014 2016 2018
50 PASCAL GPU FEATURING NVLINK AND STACKED MEMORY
NVLINK GPU high speed interconnect 80-200 GB/s
3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit
51 THANK YOU
[email protected] +39 335 5828197 52