GPU Introduction JSC OpenACC Course 2019

28 October 2019 Andreas Herten Forschungszentrum Jülich

Member of the Helmholtz Association Outline

Introduction Programming GPUs GPU History Libraries Architecture Comparison GPU programming models Jülich Systems CUDA App Showcase Platform 3 Core Features Memory Asynchronicity SIMT High Throughput Summary

Member of the Helmholtz Association 28 October 2019 Slide 1 30 History of GPUs A short but parallel story

1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] 2001 NVIDIA GeForce 3 with programmable (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2019 Top 500: 25 % with NVIDIA GPUs (#1, #2) [4], Green 500: 8 of top 10 with GPUs [5] 2021 Aurora: First (?) US exascale supercomputer based on Intel GPUs Frontier: First (?) US more-than-exascale supercomputer based on AMD GPUs

Member of the Helmholtz Association 28 October 2019 Slide 2 30 Status Quo Across Architectures Memory Bandwidth Theoretical Peak Performance, Double Precision 104 MI60

Tesla P100

Tesla V100

FirePro W9100 FirePro S9150 Platinum 9282

Tesla K40 Tesla K40 Xeon Phi 7290 (KNL) ]

Tesla K20X 6 Tesla K20 Platinum 8180 3 10 Xeon Phi 7120 (KNC) HD 6970 HD 6970 HD 5870 HD 8970

HD 7970 GHz Ed. E5-2699 v4 Graphic: Rupp [

GFLOP/sec MI25 Tesla M2090 E5-2699 v3 E5-2699 v3 HD 4870

Tesla C2050 INTEL Xeon CPUs HD 3870 E5-2697 v2 Tesla C1060 2 E5-2690 10 Tesla C1060 GPUs

AMD GPUs X5680 X5690 INTEL Xeon Phis X5482 X5492 W5590 2008 2010 2012 2014 2016 2018 End of Year Member of the Helmholtz Association 28 October 2019 Slide 3 30 Status Quo Across Architectures Memory Bandwidth Theoretical Peak Memory Bandwidth Comparison

103 Tesla V100 Tesla P100 MI60

FirePro W9100 FirePro S9150 HD 8970 HD 7970 GHz Ed. Xeon PhiMI25 7290 (KNL) ]

Tesla K40 6 HD 6970 HD 6970 HD 5870 Xeon Phi 7120 (KNC) Tesla K20X

HD 4870 Tesla K20 Tesla M2090 Platinum 9282 102

GB/sec Tesla C2050 HD 3870 Graphic: Rupp [ Platinum 8180 Tesla C1060 Tesla C1060

E5-2699 v4 E5-2699 v3 E5-2699 v3 INTEL Xeon CPUs E5-2697 v2 E5-2690 NVIDIA Tesla GPUs W5590 X5680 X5690 X5482 X5492 AMD Radeon GPUs

INTEL Xeon Phis

101 2008 2010 2012 2014 2016 2018 End of Year Member of the Helmholtz Association 28 October 2019 Slide 3 30 JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1.8 (CPU) + 0.44 (GPU) + 5 (KNL)PFLOP/s peak performance (Top500: #44) Mellanox EDR InfiniBand

Member of the Helmholtz Association 28 October 2019 Slide 4 30 Next: Booster! JUWELS – Jülich’s New Scalable System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards 10.4 (CPU) + 1.6 (GPU)PFLOP/s peak performance (Top500: #26)

Member of the Helmholtz Association 28 October 2019 Slide 5 30 Getting GPU-Acquainted TASK Some Applications

GEMM N-Body

Location of Code: 1-Introduction-…/Tasks/getting-started/

See Instructions.rst for hints.

Mandelbrot Dot Product

Member of the Helmholtz Association 28 October 2019 Slide 6 30 Getting GPU-Acquainted TASK Some Applications

DGEMM Benchmark N-Body Benchmark 12500 CPU 2000 GPU 10000 1 GPU SP 1500 7500 2 GPUs SP 1000 5000 4 GPUs SP GFLOP/s GFLOP/s 1 GPU DP 500 2500 2 GPUs DP 4 GPUs DP 0 0 2000 4000 6000 8000 10000 12000 14000 16000 20000 40000 60000 80000 100000 120000 Size of Square Matrix Number of Particles

Mandelbrot Benchmark DDot Benchmark 1500 Device CPU 103 1000 GPU

102

MPixel/s 500 101 CPU GPU 0 3 4 5 6 7 8 9 5000 10000 15000 20000 25000 30000 10 10 10 10 10 10 10 Width of Image Vector Length

Member of the Helmholtz Association 28 October 2019 Slide 6 30 Platform CPU vs. GPU A matter of specialties ] 8 ] and Shearings Holidays [ 7 Graphics: Lee [ Transporting one Transporting many

Member of the Helmholtz Association 28 October 2019 Slide 8 30 CPU vs. GPU Chip

ALU ALU Control ALUALU

Cache

DRAM DRAM

Member of the Helmholtz Association 28 October 2019 Slide 9 30 → High Throughput

GPU Architecture Overview Aim: Hide Latency Everything else follows

SIMT

Asynchronicity

Memory

Member of the Helmholtz Association 28 October 2019 Slide 10 30 → High Throughput

GPU Architecture Overview Aim: Hide Latency Everything else follows

SIMT

Asynchronicity

Memory

Member of the Helmholtz Association 28 October 2019 Slide 10 30 Memory GPU memory ain’t no CPU memory Host

Unified Virtual Addressing ALU ALU Control ALUALU GPU: accelerator / extension card → Separate device from CPU Cache Separate memory, but UVA DRAM PCIe 3 Memory transfers need special consideration! <16 GB/s Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance…?)

P100 V100 DRAM 16 GB RAM, 720 GB/s 32 GB RAM, 900 GB/s Device

Member of the Helmholtz Association 28 October 2019 Slide 11 30 Memory GPU memory ain’t no CPU memory Host

Unified Memory ALU ALU Control ALUALU GPU: accelerator / extension card → Separate device from CPU Cache Separate memory, but UVA and UM DRAM NVLink Memory transfers need special consideration! ≈80 GB/s Do as little as possible! Formerly: Explicitly copy data to/from GPU

Now: Done automatically (performance…?) HBM2 <900 GB/s P100 V100 DRAM 16 GB RAM, 720 GB/s 32 GB RAM, 900 GB/s Device

Member of the Helmholtz Association 28 October 2019 Slide 11 30 3 Transfer results back to host memory

Processing Flow Scheduler CPU → GPU → CPU

CPU ...

CPU Memory Interconnect

L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM

Member of the Helmholtz Association 28 October 2019 Slide 12 30 3 Transfer results back to host memory

Processing Flow Scheduler CPU → GPU → CPU

CPU ...

CPU Memory Interconnect

L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM

Member of the Helmholtz Association 28 October 2019 Slide 12 30 Processing Flow Scheduler CPU → GPU → CPU

CPU ...

CPU Memory Interconnect

L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM 3 Transfer results back to host memory

Member of the Helmholtz Association 28 October 2019 Slide 12 30 → High Throughput

GPU Architecture Overview Aim: Hide Latency Everything else follows

SIMT

Asynchronicity

Memory

Member of the Helmholtz Association 28 October 2019 Slide 13 30 Async Following different streams

Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! → Overlap tasks Copy and compute engines run separately (streams)

Copy Compute Copy Compute

Copy Compute Copy Compute

GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization

Member of the Helmholtz Association 28 October 2019 Slide 14 30 → High Throughput

GPU Architecture Overview Aim: Hide Latency Everything else follows

SIMT

Asynchronicity

Memory

Member of the Helmholtz Association 28 October 2019 Slide 15 30 SIMT Vector SIMT = SIMD ⊕ SMT A0 B0 C0

A1 B1 C1 + = A2 B2 C2

CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) SMT Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core ≊ GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) Branching if SIMT

Member of the Helmholtz Association 28 October 2019 Slide 16 30 SIMT Vector SIMT = SIMD ⊕ SMT A0 B0 C0

A1 B1 C1 + = A2 B2 C2

CPU: A3 B3 C3

Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core ≊ GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) Graphics: Nvidia Corporation [ Branching if SIMT

Tesla V100

Member of the Helmholtz Association 28 October 2019 Slide 16 30 Multiprocessor SIMT Vector SIMT = SIMD ⊕ SMT A0 B0 C0

A1 B1 C1 + = A2 B2 C2

CPU: A3 B3 C3

Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core ≊ GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) Graphics: Nvidia Corporation [ Branching if SIMT

Tesla V100

Member of the Helmholtz Association 28 October 2019 Slide 16 30 Low Latency vs. High Throughput Maybe GPU’s ultimate feature

CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps

CPU Core: Low Latency

T1 T2 T3 T4

GPU Streaming Multiprocessor: High Throughput

W1

W2 Thread/Warp Processing W3 Context Switch W4 Ready Waiting

Member of the Helmholtz Association 28 October 2019 Slide 17 30 CPU vs. GPU Let’s summarize this!

Optimized for low latency Optimized for high throughput + Large main memory + High bandwidth main memory + Fast + Latency tolerant (parallelism) + Large caches + More compute resources + Branch prediction + High + Powerful ALU − Limited memory capacity − Relatively low memory bandwidth − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt

Member of the Helmholtz Association 28 October 2019 Slide 18 30 Programming GPUs Summary of Acceleration Possibilities

Application

Programming Libraries Directives Languages

Drop-in Easy Flexible Acceleration Acceleration Acceleration

Member of the Helmholtz Association 28 October 2019 Slide 20 30 cuSPARSE

cuBLAS

cuFFT th ano cuRAND CUDA Math

Libraries Programming GPUs is easy: Just don’t! Use applications & libraries ] 10 Wizard: Breazell [

Member of the Helmholtz Association 28 October 2019 Slide 21 30 Libraries Programming GPUs is easy: Just don’t! Use applications & libraries

cuSPARSE ] 10

cuBLAS Wizard: Breazell [

cuFFT th ano cuRAND CUDA Math

Member of the Helmholtz Association 28 October 2019 Slide 21 30 Summary of Acceleration Possibilities

Application

Programming Libraries Directives Languages

Drop-in Easy Flexible Acceleration Acceleration Acceleration

Member of the Helmholtz Association 28 October 2019 Slide 22 30 ! Parallelism

Libraries are not enough?

You need to write your own GPU code?

Member of the Helmholtz Association 28 October 2019 Slide 23 30 Primer on Parallel Scaling Amdahl’s Law

100 Parallel Portion: 50% Parallel Portion: 75% 80 Possible maximum speedup for Parallel Portion: 90% N parallel processors Parallel Portion: 95% 60 Parallel Portion: 99% Total Time t = tserial + tparallel 40 Speedup N Processors t(N) = ts + tp/N Speedup s(N) = t/t(N) = ts+tp 20 ts+tp/N 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Processors

Member of the Helmholtz Association 28 October 2019 Slide 24 30 ! Parallelism

Parallel programming is not easy!

Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain?

Member of the Helmholtz Association 28 October 2019 Slide 25 30 Possibilities

Different levels of closeness to GPU when GPU-programming, which can ease the pain… OpenACC, OpenMP Thrust, Kokkos, SYCL PyCUDA, Cupy, Numba Other alternatives (for completeness) CUDA Fortran HIP OpenCL

Member of the Helmholtz Association 28 October 2019 Slide 26 30 CUDA SAXPY SAXPY:⃗y = a⃗x +⃗y (single precision)

__global__ void saxpy_cuda(int n, float a, float * x, float * y) { Specify kernel int i = blockIdx.x * blockDim.x + threadIdx.x; ID variables if (i < n) y[i] = a * x[i] + y[i]; Guard against } too many threads int a = 42; int n = 10; float x[n], y[n]; Allocate GPU-capable //fillx,y memory cudaMallocManaged(&x, n * sizeof(float)); cudaMallocManaged(&y, n * sizeof(float)); Call kernel 2 blocks, each 5 threads saxpy_cuda<<<2, 5>>>(n, a, x, y); Wait for kernel to finish cudaDeviceSynchronize();

Member of the Helmholtz Association 28 October 2019 Slide 27 30 CUDA Threading Model Warp the kernel, it’s a thread!

Methods to exploit parallelism:

Thread → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

Block → Grid

Threads & blocks in 3D3D3D 0 1 2 Execution entity: threads Lightweight → fast switchting! 1000s threads execute simultaneously → order non-deterministic! OpenACC takes care of threads and blocks for you! → Block configuration is just an optimization!

Member of the Helmholtz Association 28 October 2019 Slide 28 30 Summary of Acceleration Possibilities

Application

Programming Libraries ThisDirectives Course Languages

Drop-in Easy Flexible Acceleration Acceleration Acceleration

Member of the Helmholtz Association 28 October 2019 Slide 29 30 Conclusions

GPUs achieve performance by specialized hardware → threads Faster time-to-solution Lower energy-to-solution GPU acceleration can be done by different means Libraries are the easiest, CUDA the fullest OpenACC good compromise Thank you for your attention! [email protected]

Member of the Helmholtz Association 28 October 2019 Slide 30 30 Appendix Glossary References

Member of the Helmholtz Association 28 October 2019 Slide 1 8 Glossary I

AMD Manufacturer of CPUs and GPUs. 3 API A programmatic interface to by well-defined functions. Short for application programming interface. 43 ATI Canada-based GPUs manufacturing company; bought by AMD in 2006. 3

CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 3, 36, 37, 38, 40, 43

JSC Jülich Supercomputing Centre, the supercomputing institute of Forschungszentrum Jülich, Germany. 42 JURECA A multi-purpose supercomputer with 1800 nodes at JSC. 6 JUWELS Jülich’s new supercomputer, the successor of JUQUEEN. 7

Member of the Helmholtz Association 28 October 2019 Slide 2 8 Glossary II

NVIDIA US technology company creating GPUs. 3, 6, 7, 42, 43

OpenACC Directive-based programming, primarily for many-core machines. 1, 36, 38 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 3, 36 OpenGL The Open Graphics Library, an API for rendering graphics across different hardware architectures. 3 OpenMP Directive-based programming, primarily for multi-threaded machines. 36

SAXPY Single-precision A × X + Y. A simple code example of scaling a vector and adding an offset. 37

Tesla The GPU product line for general purpose computing computing of NVIDIA. 6, 7

Member of the Helmholtz Association 28 October 2019 Slide 3 8 Glossary III

Thrust A parallel algorithms library for (among others) GPUs. See https://thrust.github.io/. 36

Member of the Helmholtz Association 28 October 2019 Slide 4 8 References: Images, Graphics I

[1] Igor Ovsyannykov. Yarn. Freely available at Unsplash. URL: https://unsplash.com/photos/hvILKk7SlH4. [6] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware- characteristics-over-time/ (pages 4, 5). [7] Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/ (page 11). [8] Shearings Holidays. Picture: Shearings coach 636. URL: https://www.flickr.com/photos/shearings/13583388025/ (page 11).

Member of the Helmholtz Association 28 October 2019 Slide 5 8 References: Images, Graphics II

[9] Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/Volta- Architecture-Whitepaper-v1.0.pdf (pages 24, 25). [10] Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/ (pages 30, 31).

Member of the Helmholtz Association 28 October 2019 Slide 6 8 References: Literature I

[2] Kenneth E. Hoff III et al. “Fast Computation of Generalized Voronoi Diagrams Using Graphics Hardware”. In: Proceedings of the 26th Annual Conference on and Interactive Techniques. SIGGRAPH ’99. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1999, pp. 277–286. ISBN: 0-201-48560-5. DOI: 10.1145/311535.311567. URL: http://dx.doi.org/10.1145/311535.311567 (page 3). [3] Chris McClanahan. “History and Evolution of GPU Architecture”. In: A Survey Paper (2010). URL: http://mcclanahoochie.com/blog/wp- content/uploads/2011/03/gpu-hist-paper.pdf (page 3). [4] Jack Dongarra et al. TOP500. Nov. 2016. URL: https://www.top500.org/lists/2016/11/ (page 3).

Member of the Helmholtz Association 28 October 2019 Slide 7 8 References: Literature II

[5] Jack Dongarra et al. Green500. Nov. 2016. URL: https://www.top500.org/green500/lists/2016/11/ (page 3).

Member of the Helmholtz Association 28 October 2019 Slide 8 8