
GPU Introduction JSC OpenACC Course 2019 28 October 2019 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association Outline Introduction Programming GPUs GPU History Libraries Architecture Comparison GPU programming models Jülich Systems CUDA App Showcase Platform 3 Core Features Memory Asynchronicity SIMT High Throughput Summary Member of the Helmholtz Association 28 October 2019 Slide 1 30 History of GPUs A short but parallel story 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2019 Top 500: 25 % with NVIDIA GPUs (#1, #2) [4], Green 500: 8 of top 10 with GPUs [5] 2021 Aurora: First (?) US exascale supercomputer based on Intel GPUs Frontier: First (?) US more-than-exascale supercomputer based on AMD GPUs Member of the Helmholtz Association 28 October 2019 Slide 2 30 Theoretical Peak Performance, Double Precision 104 MI60 Tesla P100 Tesla V100 Status Quo Across ArchitecturesFirePro W9100 FirePro S9150 Memory Bandwidth Platinum 9282 Tesla K40 Tesla K20X Tesla K40 Xeon Phi 7290 (KNL) Tesla K20 Platinum 8180 3 10 Xeon Phi 7120 (KNC) HD 6970 HD 6970 HD 5870 HD 8970 HD 7970 GHz Ed. E5-2699 v4 GFLOP/sec MI25 Tesla M2090 E5-2699 v3 E5-2699 v3 HD 4870 Tesla C2050 INTEL Xeon CPUs HD 3870 E5-2697 v2 Tesla C1060 2 E5-2690 10 Tesla C1060 NVIDIA Tesla GPUs AMD Radeon GPUs X5680 X5690 INTEL Xeon Phis X5482 X5492 W5590 2008 2010 2012 2014 2016 2018 End of Year Member of the Helmholtz Association 28 October 2019 Slide 3 30 ] 6 Graphic: Rupp [ Status Quo Across Architectures Memory Bandwidth Theoretical Peak Memory Bandwidth Comparison 103 Tesla V100 Tesla P100 MI60 FirePro W9100 FirePro S9150 HD 8970 HD 7970 GHz Ed. Xeon PhiMI25 7290 (KNL) ] Tesla K40 6 HD 6970 HD 6970 HD 5870 Xeon Phi 7120 (KNC) Tesla K20X HD 4870 Tesla K20 Tesla M2090 Platinum 9282 102 GB/sec Tesla C2050 HD 3870 Graphic: Rupp [ Platinum 8180 Tesla C1060 Tesla C1060 E5-2699 v4 E5-2699 v3 E5-2699 v3 INTEL Xeon CPUs E5-2697 v2 E5-2690 NVIDIA Tesla GPUs W5590 X5680 X5690 X5482 X5492 AMD Radeon GPUs INTEL Xeon Phis 101 2008 2010 2012 2014 2016 2018 End of Year Member of the Helmholtz Association 28 October 2019 Slide 3 30 JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1:8 (CPU) + 0:44 (GPU) + 5 (KNL)PFLOP=s peak performance (Top500: #44) Mellanox EDR InfiniBand Member of the Helmholtz Association 28 October 2019 Slide 4 30 Next: Booster! JUWELS – Jülich’s New Scalable System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards 10:4 (CPU) + 1:6 (GPU)PFLOP=s peak performance (Top500: #26) Member of the Helmholtz Association 28 October 2019 Slide 5 30 Getting GPU-Acquainted TASK Some Applications GEMM N-Body Location of Code: 1-Introduction-…/Tasks/getting-started/ See Instructions.rst for hints. Mandelbrot Dot Product Member of the Helmholtz Association 28 October 2019 Slide 6 30 Getting GPU-Acquainted TASK Some Applications DGEMM Benchmark N-Body Benchmark 12500 CPU 2000 GPU 10000 1 GPU SP 1500 7500 2 GPUs SP 1000 5000 4 GPUs SP GFLOP/s GFLOP/s 1 GPU DP 500 2500 2 GPUs DP 4 GPUs DP 0 0 2000 4000 6000 8000 10000 12000 14000 16000 20000 40000 60000 80000 100000 120000 Size of Square Matrix Number of Particles Mandelbrot Benchmark DDot Benchmark 1500 Device CPU 103 1000 GPU 102 MPixel/s 500 101 CPU GPU 0 3 4 5 6 7 8 9 5000 10000 15000 20000 25000 30000 10 10 10 10 10 10 10 Width of Image Vector Length Member of the Helmholtz Association 28 October 2019 Slide 6 30 Platform CPU vs. GPU A matter of specialties ] 8 ] and Shearings Holidays [ 7 Graphics: Lee [ Transporting one Transporting many Member of the Helmholtz Association 28 October 2019 Slide 8 30 CPU vs. GPU Chip ALU ALU Control ALUALU Cache DRAM DRAM Member of the Helmholtz Association 28 October 2019 Slide 9 30 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 28 October 2019 Slide 10 30 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 28 October 2019 Slide 10 30 Memory GPU memory ain’t no CPU memory Host Unified Virtual Addressing GPU: accelerator / extension card ! Separate device fromALU CPU ALU SeparateControl memory, but UVA PCIe 3 Memory transfers need specialALUALU consideration! <16 GB=s Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automaticallyCache (performance…?) P100 V100 16 GB RAM, 720 GB=s 32 GB RAM, 900 GB=s DRAM DRAM Device Member of the Helmholtz Association 28 October 2019 Slide 11 30 Memory GPU memory ain’t no CPU memory Host Unified Memory GPU: accelerator / extension card ! Separate device fromALU CPU ALU SeparateControl memory, but UVA and UM ALUALU NVLink Memory transfers need special consideration! ≈80 GB=s Do as little as possible! Formerly: Explicitly copy data to/from GPU Cache Now: Done automatically (performance…?) HBM2 <900 GB=s P100 V100 16 GB RAM, 720 GB=s 32 GB RAM, 900 GB=s DRAM DRAM Device Member of the Helmholtz Association 28 October 2019 Slide 11 30 3 Transfer results back to host memory Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM Member of the Helmholtz Association 28 October 2019 Slide 12 30 3 Transfer results back to host memory Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM Member of the Helmholtz Association 28 October 2019 Slide 12 30 Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM 3 Transfer results back to host memory Member of the Helmholtz Association 28 October 2019 Slide 12 30 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 28 October 2019 Slide 13 30 Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! ! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Member of the Helmholtz Association 28 October 2019 Slide 14 30 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 28 October 2019 Slide 15 30 SIMT Vector SIMT = SIMD ⊕ SMT A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) SMT Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) Branching if SIMT Member of the Helmholtz Association 28 October 2019 Slide 16 30 SIMT Vector SIMT = SIMD ⊕ SMT A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) Graphics: Nvidia Corporation [ Branching if SIMT Tesla V100 Member of the Helmholtz Association 28 October 2019 Slide 16 30 Multiprocessor SIMT Vector SIMT = SIMD ⊕ SMT A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) Graphics: Nvidia Corporation [ Branching if SIMT Tesla V100 Member of the Helmholtz Association 28 October 2019 Slide 16 30 Low Latency vs. High Throughput Maybe GPU’s ultimate feature CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1 T2 T3 T4 GPU Streaming Multiprocessor: High Throughput W1 W2 Thread/Warp Processing W3 Context Switch W4 Ready Waiting Member of the Helmholtz Association 28 October 2019 Slide 17 30 CPU vs. GPU Let’s summarize this! Optimized for low latency Optimized for high throughput + Large main memory + High bandwidth main memory + Fast clock rate + Latency tolerant (parallelism) + Large caches + More compute resources + Branch prediction + High performance per watt + Powerful ALU − Limited memory capacity − Relatively low memory bandwidth − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt Member of the Helmholtz Association 28 October 2019 Slide 18 30 Programming GPUs Summary of Acceleration
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages48 Page
-
File Size-