Gpu Programming 101 Gridka School 2019

GPU PROGRAMMING 101 GRIDKA SCHOOL 2019 29 August 2019 Andreas Herten Forschungszentrum Jülich Handout Version Member of the Helmholtz Association About, Outline Topics Motivation Platform Hardware Features Summary Programming GPUs Libraries Directives Languages Jülich Supercomputing Centre Abstraction Libraries/DSL Operation of supercomputers Tools Application support Conclusions Research Me: All things GPU Member of the Helmholtz Association 29 August 2019 Slide 1 41 Status Quo A short but parallel story 1999 Graphics computation pipeline implemented in dedicated graphics hardware Computations using OpenGL graphics library [2] »GPU« coined by NVIDIA [3] 2001 NVIDIA GeForce 3 with programmable shaders (instead of fixed pipeline) and floating-point support; 2003: DirectX 9 at ATI 2007 CUDA 2009 OpenCL 2019 Top 500: 25 % with NVIDIA GPUs (#1, #2) [4], Green 500: 8 of top 10 with GPUs [5] 2021 Aurora: First (?) US exascale supercomputer based on Intel GPUs Frontier: First (?) US more-than-exascale supercomputer based on AMD GPUs Theoretical Peak Performance, Double Precision 104 MI60 Tesla P100 Tesla V100 Status Quo FirePro W9100 FirePro S9150 A short but parallel story Platinum 9282 Tesla K40 Tesla K20X Tesla K40 Xeon Phi 7290 (KNL) Tesla K20 Platinum 8180 3 10 Xeon Phi 7120 (KNC) HD 6970 HD 6970 HD 5870 HD 8970 HD 7970 GHz Ed. E5-2699 v4 GFLOP/sec MI25 Tesla M2090 E5-2699 v3 E5-2699 v3 HD 4870 Tesla C2050 INTEL Xeon CPUs HD 3870 E5-2697 v2 Tesla C1060 2 E5-2690 10 Tesla C1060 NVIDIA Tesla GPUs AMD Radeon GPUs X5680 X5690 INTEL Xeon Phis X5482 X5492 W5590 2008 2010 2012 2014 2016 2018 End of Year ] 6 Graphic: Rupp [ Theoretical Peak Memory Bandwidth Comparison 103 Tesla V100 Status Quo Tesla P100 MI60 Peak performance double precision FirePro W9100 FirePro S9150 HD 8970 HD 7970 GHz Ed. Xeon PhiMI25 7290 (KNL) Tesla K40 HD 6970 HD 6970 HD 5870 Xeon Phi 7120 (KNC) Tesla K20X HD 4870 Tesla K20 Tesla M2090 Platinum 9282 102 GB/sec Tesla C2050 HD 3870 Platinum 8180 Tesla C1060 Tesla C1060 E5-2699 v4 E5-2699 v3 E5-2699 v3 INTEL Xeon CPUs E5-2697 v2 E5-2690 NVIDIA Tesla GPUs W5590 X5680 X5690 X5482 X5492 AMD Radeon GPUs INTEL Xeon Phis 101 2008 2010 2012 2014 2016 2018 End of Year ] 6 Graphic: Rupp [ Next: Status Quo Booster! Peak memory bandwidth Status Quo JUWELS Next: Booster! But why?! Let’s find out! Platform CPU vs. GPU A matter of specialties ] 8 ] and Shearings Holidays [ 7 Graphics: Lee [ Transporting one Transporting many Member of the Helmholtz Association 29 August 2019 Slide 4 41 CPU vs. GPU Chip ALU ALU Control ALUALU Cache DRAM DRAM Member of the Helmholtz Association 29 August 2019 Slide 4 41 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 29 August 2019 Slide 5 41 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 29 August 2019 Slide 5 41 Memory GPU memory ain’t no CPU memory Host Unified Virtual Addressing GPU: accelerator / extension card ! Separate device fromALU CPU ALU SeparateControl memory, but UVA ALUALU PCIe 3 Memory transfers need special consideration! <16 GB=s Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automaticallyCache (performance…?) P100 V100 16 GB RAM,DRAM 720 GB=s 32 GB RAM, 900 GB=s DRAM Device Member of the Helmholtz Association 29 August 2019 Slide 6 41 Memory GPU memory ain’t no CPU memory Host Unified Memory GPU: accelerator / extension card ! Separate device fromALU CPU ALU SeparateControl memory, but UVA and UM ALUALU NVLink Memory transfers need special consideration! ≈80 GB=s Do as little as possible! Formerly: Explicitly copy data to/from GPU Cache Now: Done automatically (performance…?) HBM2 <900 GB=s P100 V100 16 GB RAM,DRAM 720 GB=s 32 GB RAM, 900 GB=s DRAM Device Member of the Helmholtz Association 29 August 2019 Slide 6 41 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 29 August 2019 Slide 7 41 Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! ! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Also: Fast switching of contexts to keep GPU busy (KGB) Member of the Helmholtz Association 29 August 2019 Slide 8 41 ! High Throughput GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory Member of the Helmholtz Association 29 August 2019 Slide 9 41 SIMT Vector Of threads and warps A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) SMT Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) ! hide latency Branching if SIMT Member of the Helmholtz Association 29 August 2019 Slide 10 41 SIMT Vector Of threads and warps A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) ! hide latency Graphics: Nvidia Corporation [ Branching if SIMT Tesla V100 Member of the Helmholtz Association 29 August 2019 Slide 10 41 Multiprocessor SIMT Vector Of threads and warps A0 B0 C0 A1 B1 C1 + = A2 B2 C2 CPU: A3 B3 C3 Single Instruction, Multiple Data (SIMD) ] SMT 9 Simultaneous Multithreading (SMT) TensorCore Thread GPU: Single Instruction, Multiple Threads (SIMT) Core Core CPU core u GPU multiprocessor (SM) Thread Working unit: set of threads (32, a warp) Core Core Fast switching of threads (large register file) ! hide latency Graphics: Nvidia Corporation [ Branching if SIMT Tesla V100 Member of the Helmholtz Association 29 August 2019 Slide 10 41 New: Tensor Cores New in Volta 8 Tensor Cores per Streaming Multiprocessor (SM) (640 total for V100) Performance: 125 TFLOP=s (half precision) Calculate A × B + C = D (4 × 4 matrices; A, B: half precision) ! 64 floating-point FMA operations per clock (mixed precision) × + = FP16 FP16 FP16 FP32 FP16 FP32 FP32 FP32 FP32 Member of the Helmholtz Association 29 August 2019 Slide 11 41 CPU vs. GPU Let’s summarize this! Optimized for low latency Optimized for high throughput + Large main memory + High bandwidth main memory + Fast clock rate + Latency tolerant (parallelism) + Large caches + More compute resources + Branch prediction + High performance per watt + Powerful ALU − Limited memory capacity − Relatively low memory bandwidth − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt Member of the Helmholtz Association 29 August 2019 Slide 12 41 Programming GPUs Preface: CPU A simple CPU program as reference! SAXPY:~y = a~x +~y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy(n, a, x, y); Member of the Helmholtz Association 29 August 2019 Slide 14 41 cuSPARSE rocSPARSE cuBLAS rocBLAS cuDNN rocDNN cuFFT rocFFT th ano cuRAND rocRAND CUDA Math Libraries Programming GPUs is easy: Just don’t! Use applications & libraries ] 10 Wizard: Breazell [ Member of the Helmholtz Association 29 August 2019 Slide 15 41 th ano Libraries Programming GPUs is easy: Just don’t! Use applications & libraries cuSPARSE ] rocSPARSE 10 cuBLAS rocBLAS cuDNN rocDNN Wizard: Breazell [ cuFFT rocFFT cuRAND rocRAND CUDA Math Member of the Helmholtz Association 29 August 2019 Slide 15 41 BLAS on GPU Parallel algebra cuBLAS GPU-parallel BLAS (all 152 routines) by NVIDIA Single, double, complex data types Constant competition with Intel’s MKL Multi-GPU support ! https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas rocBLAS AMD BLAS implementation ! https://github.com/ROCmSoftwarePlatform/rocBLAS https://rocblas.readthedocs.io/en/latest/ Member of the Helmholtz Association 29 August 2019 Slide 16 41 cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublasHandle_t handle; cublasCreate(&handle); float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); Member of the Helmholtz Association 29 August 2019 Slide 17 41 cuBLAS Code example int a = 42; int n = 10; float x[n], y[n]; // fill x, y cublasHandle_t handle; cublasCreate(&handle); Initialize float * d_x, * d_y; cudaMallocManaged(&d_x, n * sizeof(x[0]); Allocate GPU memory cudaMallocManaged(&d_y, n * sizeof(y[0]); cublasSetVector(n, sizeof(x[0]), x, 1, d_x, 1); Copy data to GPU cublasSetVector(n, sizeof(y[0]), y, 1, d_y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); Call BLAS routine cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1); Copy result to host cudaFree(d_x); cudaFree(d_y); Finalize cublasDestroy(handle); Member of the Helmholtz

Load more