Gpu: Platform and Programming Georgian-German Science Bridge Qualistartup
Total Page:16
File Type:pdf, Size:1020Kb
GPU: PLATFORM AND PROGRAMMING GEORGIAN-GERMAN SCIENCE BRIDGE QUALISTARTUP 12 September 2019 Andreas Herten Forschungszentrum Jülich Handout Version Member of the Helmholtz Association About, Outline Topics Motivation Platform Hardware Features High Throughput Summary Vendor Comparison Programming GPUs Libraries Jülich Supercomputing Centre About GPU Programming Operation of supercomputers Directives Application support Languages Research Abstraction Libraries/DSL Me: All things GPU Tools Slides: http://bit.ly/ggsb-gpu Member of the Helmholtz Association 12 September 2019 Slide 1 64 Status Quo A short but parallel story 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2019 Top 500: 25 % with NVIDIA GPUs (#1, #2) [4], Green 500: 8 of top 10 with GPUs [5] 2021 Aurora: First (?) US exascale supercomputer based on Intel GPUs Frontier: First (?) US more-than-exascale supercomputer based on AMD GPUs Theoretical Peak Performance, Double Precision 104 MI60 Tesla P100 Tesla V100 Status Quo FirePro W9100 FirePro S9150 A short but parallel story Platinum 9282 Tesla K40 Tesla K20X Tesla K40 Xeon Phi 7290 (KNL) Tesla K20 Platinum 8180 3 10 Xeon Phi 7120 (KNC) HD 6970 HD 6970 HD 5870 HD 8970 HD 7970 GHz Ed. E5-2699 v4 GFLOP/sec MI25 Tesla M2090 E5-2699 v3 E5-2699 v3 HD 4870 Tesla C2050 INTEL Xeon CPUs HD 3870 E5-2697 v2 Tesla C1060 2 E5-2690 10 Tesla C1060 NVIDIA Tesla GPUs AMD Radeon GPUs X5680 X5690 INTEL Xeon Phis X5482 X5492 W5590 2008 2010 2012 2014 2016 2018 End of Year ] 6 Graphic: Rupp [ Theoretical Peak Memory Bandwidth Comparison 103 Tesla V100 Status Quo Tesla P100 MI60 Peak performance double precision FirePro W9100 FirePro S9150 HD 8970 HD 7970 GHz Ed. Xeon PhiMI25 7290 (KNL) Tesla K40 HD 6970 HD 6970 HD 5870 Xeon Phi 7120 (KNC) Tesla K20X HD 4870 Tesla K20 Tesla M2090 Platinum 9282 102 GB/sec Tesla C2050 HD 3870 Platinum 8180 Tesla C1060 Tesla C1060 E5-2699 v4 E5-2699 v3 E5-2699 v3 INTEL Xeon CPUs E5-2697 v2 E5-2690 NVIDIA Tesla GPUs W5590 X5680 X5690 X5482 X5492 AMD Radeon GPUs INTEL Xeon Phis 101 2008 2010 2012 2014 2016 2018 End of Year ] 6 Graphic: Rupp [ Status Quo Peak memory bandwidth JURECA – Jülich’s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 CPUs (2 × 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs) JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1:8 (CPU) + 0:44 (GPU) + 5 (KNL)PFLOP=s peak performance Mellanox EDR InfiniBand Next: Booster! JUWELS – Jülich’s New Scalable System 2500 nodes with Intel Xeon CPUs (2 × 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards 10:4 (CPU) + 1:6 (GPU)PFLOP=s peak performance Platform CPU vs. GPU A matter of specialties ] 8 ] and Shearings Holidays [ 7 Graphics: Lee [ Transporting one Transporting many Member of the Helmholtz Association 12 September 2019 Slide 6 64 CPU vs. GPU Chip ALU ALU Control ALUALU Cache DRAM DRAM Member of the Helmholtz Association 12 September 2019 Slide 6 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 7 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 7 64 Memory GPU memory ain’t no CPU memory Host Unified Virtual Addressing GPU: accelerator / extension card ! Separate device fromALU CPU ALU SeparateControl memory, but UVA ALUALU PCIe 3 Memory transfers need special consideration! <16 GB=s Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automaticallyCache (performance…?) P100 V100 16 GB RAM,DRAM 720 GB=s 32 GB RAM, 900 GB=s DRAM Device Member of the Helmholtz Association 12 September 2019 Slide 8 64 Memory GPU memory ain’t no CPU memory Host Unified Memory GPU: accelerator / extension card ! Separate device fromALU CPU ALU SeparateControl memory, but UVA and UM ALUALU NVLink Memory transfers need special consideration! ≈80 GB=s Do as little as possible! Formerly: Explicitly copy data to/from GPU Cache Now: Done automatically (performance…?) HBM2 <900 GB=s P100 V100 16 GB RAM,DRAM 720 GB=s 32 GB RAM, 900 GB=s DRAM Device Member of the Helmholtz Association 12 September 2019 Slide 8 64 3 Transfer results back to host memory Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM Member of the Helmholtz Association 12 September 2019 Slide 9 64 3 Transfer results back to host memory Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM Member of the Helmholtz Association 12 September 2019 Slide 9 64 Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect L2 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM 3 Transfer results back to host memory Member of the Helmholtz Association 12 September 2019 Slide 9 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 10 64 Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! ! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Also: Fast switching of contexts to keep GPU busy (KGB) Member of the Helmholtz Association 12 September 2019 Slide 11 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 12 64 SIMT Vector Of threads and warps A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) SMT Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) ! hide latency Branching if SIMT Member of the Helmholtz Association 12 September 2019 Slide 13 64 SIMT Vector Of threads and warps A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) ! hide latency Graphics: Nvidia Corporation [ Branching if SIMT Tesla V100 Member of the Helmholtz Association 12 September 2019 Slide 13 64 Multiprocessor SIMT Vector Of threads and warps A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) TensorCore Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) ! hide latency Graphics: Nvidia Corporation [ Branching if SIMT Tesla V100 Member of the Helmholtz Association 12 September 2019 Slide 13 64 New: Tensor Cores New in Volta 8 Tensor Cores per Streaming Multiprocessor (SM) (640 total for V100) Performance: 125 TFLOP=s (half precision) Calculate A × B + C = D (4 × 4 matrices; A, B: half precision) ! 64 floating-point FMA operations per clock (mixed precision) × + = FP16 FP16 FP16 FP32 FP16 FP32 FP32 FP32 FP32 Member of the Helmholtz Association 12 September 2019 Slide 14 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 15 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 15 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 15 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 15 64 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 12 September 2019 Slide 15 64 Low Latency vs. High Throughput Maybe GPU’s ultimate feature CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T1 T2 T3 T4 GPU Streaming Multiprocessor: High Throughput W1 Thread/Warp W2 Processing Context Switch W3 Ready Waiting W4 Member of the Helmholtz Association 12 September 2019 Slide 16 64 CPU vs. GPU Let’s summarize this! Optimized for low latency Optimized for high throughput + Large main