INTRODUCTION to PARALLEL and GPU COMPUTING Carlo Nardone Sr

INTRODUCTION TO PARALLEL AND GPU COMPUTING Carlo Nardone Sr. Solution Architect, NVIDIA EMEA 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing 2 PARALLEL COMPUTING, ILLUSTRATED 3 FROM MULTICORE TO MANYCORE 4 PARALLEL COMPUTING IS OLD Source: Tim Mattson 5 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing 6 MOORE’S LAW Source: U.Delaware CISC879 / J.Dongarra 7 MICROPROCESSOR FREQUENCY TREND source: CA5 Hennessy/Patterson 8 THE POWER WALL 9 MICROPROCESSORS PERFORMANCE TREND source: CA5 Hennessy/Patterson 11 THE PROCESSOR-MEMORY GAP source: CA5 Hennessy & Patterson 13 ARITHMETIC INTENSITY Arithmetic intensity, specified as the number of floating-point operations to run the program divided by the number of bytes accessed in main memory [Williams et al. 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as dense matrix, but there are many kernels with arithmetic intensities independent of problem size. Source: CA5 Hennessy & Patterson 14 BANDWIDTH AND THE ROOFLINE MODEL Source: HPCA5 15 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing 16 FLYNN’S TAXONOMY: SISD source: CA5 Hennessy & Patterson 17 FLYNN’S TAXONOMY: SIMD source: CA5 Hennessy & Patterson 18 FLYNN’S TAXONOMY: SIMD (2) source: CA5 Hennessy & Patterson 19 FLYNN’S TAXONOMY source: CA5 Hennessy & Patterson 20 FLYNN’S TAXONOMY: MISD (?) source: CA5 Hennessy & Patterson 21 FLYNN’S TAXONOMY: MIMD source: CA5 Hennessy & Patterson 22 MULTIPROCESSORS source: CA5 Hennessy & Patterson 23 MULTIPROCESSOR SMP CLASS source: CA4 Hennessy & Patterson 24 MULTIPROCESSOR DISTRIBUTED MEM CLASS source: CA4 Hennessy & Patterson 25 X86 ARCHITECTURE Intel Nehalem 26 1 Intro 2 Processors Trends AGENDA 3 Multiprocessors 4 GPU Computing 27 28 LOW LATENCY OR HIGH THROUGHPUT? CPU GPU Optimised for low-latency Optimised for data-parallel, access to cached data sets throughput computation Control logic for out-of-order Architecture tolerant of and speculative execution memory latency More transistors dedicated to computation 29 30 GPU UNIFIED DESIGN 31 GPGPU COMPUTING, B.C. ERA 32 WHY GPU ARE ATTRACTIVE FOR HPC? • Massive multithreaded manycore chips • High flops count (SP and later DP) • High memory bandwidth, (later) including ECC • Lots of programming languages • Lots of programming tools • Widely used in Computational Physics 33 GPGPU COMPUTING, A.C. ERA (Fermi) 34 GROWTH IN GPU COMPUTING 2008 2015 150,000 3 Million CUDA Downloads CUDA Downloads 27 319 CUDA Apps CUDA Apps 60 800 Universities Universities Teaching Teaching 4,000 60,000 Academic Academic Papers Papers 6,000 450,000 Tesla GPUs Tesla GPUs 77 54,000 Supercomputing Supercomputing Teraflops Teraflops 35 GROWTH IN GPU COMPUTING 2008 2015 150,000 3 Million CUDA Downloads CUDA Downloads 27 319 CUDA Apps CUDA Apps 60 800 Universities Universities Teaching Teaching 4,000 60,000 Academic Academic Papers Papers 6,000 450,000 Tesla GPUs Tesla GPUs 77 54,000 Supercomputing Supercomputing Teraflops Teraflops 36 ACCELERATED, HETEROGENEOUS COMPUTING 10x Performance & 5x Energy Efficiency for HPC GPU Accelerator CPU Optimized for Optimized for Parallel Tasks Serial Tasks 37 GEFORCE 8: 1ST MODERN GPU ARCH 38 NVIDIA KEPLER GK110 PROCESSOR 40 41 KEPLER SMX DIAGRAM 42 146X 36X 18X 50X 100X Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN GPUs Accelerate Science 149X 47X 20X 130X 30X Financial Simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing Oxford Universidad Jaime Techniscan U of Illinois, Urbana U of Maryland 43 44 SUPERCOMPUTING ERA 45 46 JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER Development Platform for Embedded Computer Vision, Robotics, Medical Tegra K1 SoC Quad core A15 + Kepler GPU 192 CUDA Enabled cores 326 Gflops @ 5 Watt $192 47 JETSON TK1: UNLOCKING NEW APPLICATIONS Computer Vision Robotics Automotive Medicine Avionics 48 US TO BUILD TWO FLAGSHIP SUPERCOMPUTERS SUMMIT SIERRA 150-300 PFLOPS > 100 PFLOPS Peak Performance Peak Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect >40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 49 GPU ROADMAP Pascal 2x SGEMM/W 72 Volta 60 48 Pascal Mixed Precision 36 3D Memory NVLink SGEMM SGEMM / W 24 Maxwell 12 Kepler Tesla Fermi 0 2008 2010 2012 2014 2016 2018 50 PASCAL GPU FEATURING NVLINK AND STACKED MEMORY NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit 51 THANK YOU [email protected] +39 335 5828197 52 .

INTRODUCTION to PARALLEL and GPU COMPUTING Carlo Nardone Sr

CUDA by Example

Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation

2.5 Classification of Parallel Computers

Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming

(GPU) Computing

Computer Hardware Architecture Lecture 4

Massively Parallel Computing with CUDA

CUDA Flux: a Lightweight Instruction Profiler for CUDA Applications

GPU: Power Vs Performance

An Introduction to Gpus, CUDA and Opencl

Cuda C Programming Guide

CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY