Openacc Tutorial Gridka School 2018

OPENACC TUTORIAL GRIDKA SCHOOL 2018 30 August 2018 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association Outline The GPU Platform OpenACC by Example Interoperability Introduction OpenACC Workflow The Keyword Threading Model Identify Parallelism Tasks App Showcase Parallelize Loops Task 1 Parallel Models parallel Task 2 Task 3 OpenACC loops pgprof Task 4 History kernels Conclusions OpenMP Data Transfers List of Tasks Modus Operandi GPU Memory Spaces OpenACC’s Models Portability Now: Download and install Clause: copy PGI Community Edition Visual Profiler Data Locality Analyse Flow data enterdata Optimize Levels of Parallelism Clause: gang Memory Coalescing Pinned Member of the Helmholtz Association 30 August 2018 Slide 1 114 The GPU Platform Member of the Helmholtz Association 30 August 2018 Slide 2 114 CPU vs. GPU A matter of specialties ] 3 ] and Shearings Holidays [ 2 Graphics: Lee [ Transporting one Transporting many Member of the Helmholtz Association 30 August 2018 Slide 3 114 CPU vs. GPU Chip ALU ALU Control ALUALU Cache DRAM DRAM Member of the Helmholtz Association 30 August 2018 Slide 3 114 3 Transfer results back to host memory Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program L2 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM Member of the Helmholtz Association 30 August 2018 Slide 4 114 3 Transfer results back to host memory Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program L2 2 Load GPU program, execute on SMs, get (cached) data from memory; write back DRAM Member of the Helmholtz Association 30 August 2018 Slide 4 114 Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program L2 2 Load GPU program, execute on SMs, get (cached) data from memory; write back 3 Transfer results back to host memory DRAM Member of the Helmholtz Association 30 August 2018 Slide 4 114 Processing Flow Scheduler CPU ! GPU ! CPU CPU ::: CPU Memory Interconnect 1 Transfer data from CPU memory to GPU memory, transfer program L2 2 Load GPU program, execute on SMs, get (cached) data from memory; write back 3 Transfer results back to host memory Old: Manual data transfer invocations – UVA DRAM New: Driver automatically transfers data – UM Member of the Helmholtz Association 30 August 2018 Slide 4 114 CUDA Threading Model Warp the kernel, it’s a thread! Methods to exploit parallelism: Thread ! Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Block ! Grid Threads & blocks in 3D3D3D 0 1 2 Execution entity: threads Lightweight ! fast switchting! 1000s threads execute simultaneously ! order non-deterministic! Parallel function: kernel Member of the Helmholtz Association 30 August 2018 Slide 5 114 Getting GPU-Acquainted Preparations Task 0?: Setup Login to JURON Visit https://jupyter-jsc.fz-juelich.de/ Sign in Log in with train account Fill out and accept usage agreement Start Jupyter Lab on JURON login node Use Terminal of Jupyter, use Jupyter’s file editing capabilities Directory of tasks: $HOME/GPU/Tasks/Tasks/ Solutions are always given, you decide when to look ($HOME/GPU/Tasks/Solutions/) Load required modules: module load pgi [cuda] Fallback: QR Code Member of the Helmholtz Association 30 August 2018 Slide 6 114 Getting GPU-Acquainted Preparations Task 0?: Setup Login to JURON Visit https://jupyter-jsc.fz-juelich.de/ Sign in Log in with train account Fill out and accept usage agreement Start Jupyter Lab on JURON login node Use Terminal of Jupyter, use Jupyter’s file editing capabilities Directory of tasks: $HOME/GPU/Tasks/Tasks/ Solutions are always given, you decide when to look ($HOME/GPU/Tasks/Solutions/) Load required modules: module load pgi [cuda] Fallback: QR Code Member of the Helmholtz Association 30 August 2018 Slide 6 114 Getting GPU-Acquainted Preparations Task 0?: Setup Login to JURON Visit https://jupyter-jsc.fz-juelich.de/ Sign in Log in with train account Fill out and accept usage agreement Start Jupyter Lab on JURON login node Use Terminal of Jupyter, use Jupyter’s file editing capabilities Directory of tasks: $HOME/GPU/Tasks/Tasks/ Solutions are always given, you decide when to look ($HOME/GPU/Tasks/Solutions/) Load required modules: module load pgi [cuda] Fallback: QR Code Member of the Helmholtz Association 30 August 2018 Slide 6 114 Getting GPU-Acquainted Preparations Task 0?: Setup Login to JURON Visit https://jupyter-jsc.fz-juelich.de/ Sign in Log in with train account Fill out and accept usage agreement Start Jupyter Lab on JURON login node Use Terminal of Jupyter, use Jupyter’s file editing capabilities Directory of tasks: $HOME/GPU/Tasks/Tasks/ Solutions are always given, you decide when to look ($HOME/GPU/Tasks/Solutions/) Load required modules: module load pgi [cuda] Fallback: QR Code Member of the Helmholtz Association 30 August 2018 Slide 6 114 Getting GPU-Acquainted TASK 0 Some Applications GEMM N-Body Task 0: Getting Started Change to GPU/Tasks/Task0/ directory Read Instructions.rst Mandelbrot Dot Product Member of the Helmholtz Association 30 August 2018 Slide 7 114 Getting GPU-Acquainted TASK 0 Some Applications DGEMM Benchmark N-Body Benchmark 2000 CPU 1 GPU SP GPU 15000 2 GPUs SP 1500 4 GPUs SP 1 GPU DP 10000 2 GPUs DP 1000 4 GPUs DP GFLOP/s GFLOP/s 5000 500 0 0 2000 4000 6000 8000 10000 12000 14000 16000 20000 40000 60000 80000 100000 120000 Size of Square Matrix Number of Particles Mandelbrot Benchmark DDot Benchmark 1200 104 CPU 1000 GPU 103 800 102 600 1 MFLOP/s MPixel/s 10 400 200 CPU 100 GPU 0 3 4 5 6 7 8 9 5000 10000 15000 20000 25000 30000 10 10 10 10 10 10 10 Width of Image Length of Vector Member of the Helmholtz Association 30 August 2018 Slide 7 114 Primer on Parallel Scaling Amdahl’s Law 100 Parallel Portion: 50% Parallel Portion: 75% 80 Possible maximum speedup for Parallel Portion: 90% N parallel processors Parallel Portion: 95% 60 Parallel Portion: 99% Total Time t = tserial + tparallel 40 Speedup N Processors t(N) = ts + tp=N Speedup s(N) = t=t(N) = ts+tp 20 ts+tp=N 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Processors Member of the Helmholtz Association 30 August 2018 Slide 8 114 Primer on Parallel Scaling II Gustafson-Barsis’s Law 4000 Serial Portion: 1% Serial Portion: 10% […] speedup should be 3000 Serial Portion: 50% measured by scaling the Serial Portion: 75% Serial Portion: 90% problem to the number 2000 Serial Portion: 99% of processors, not fixing Speedup problem size. 1000 – John Gustafson 0 256512 1024 2048 4096 Number of Processors Member of the Helmholtz Association 30 August 2018 Slide 9 114 ! Parallelism Parallel programming is not easy! Things to consider: Is my application computationally intensive enough? What are the levels of parallelism? How much data needs to be transferred? Is the gain worth the pain? Member of the Helmholtz Association 30 August 2018 Slide 10 114 Possibilities Different levels of closeness to GPU when GPU-programming, which can ease the pain… OpenACC OpenMP Thrust PyCUDA CUDA Fortran CUDA OpenCL Member of the Helmholtz Association 30 August 2018 Slide 11 114 Primer on GPU Computing Application Programming OpenACC Libraries Directives Languages Drop-in Easy Flexible Acceleration Acceleration Acceleration Member of the Helmholtz Association 30 August 2018 Slide 12 114 OpenACC History 2011 OpenACC 1.0 specification is released ƥ NVIDIA, Cray, PGI, CAPS 2013 OpenACC 2.0: More functionality, portability ƥ 2015 OpenACC 2.5: Enhancements, clarifications ƥ 2017 OpenACC 2.6: Deep copy, … ƥ ! https://www.openacc.org/ (see also: Best practice guide ƥ) Support Compiler: PGI, GCC, Cray, Sunway Languages: C/C++, Fortran Member of the Helmholtz Association 30 August 2018 Slide 13 114 Open{MP$ACC} Everything’s connected OpenACC modeled after OpenMP … … but specific for accelerators Might eventually be absorbed into OpenMP But OpenMP >4.0 also has offloading feature OpenACC more descriptive, OpenMP more prescriptive Basic principle same: Fork/join model Master thread launches parallel child threads; merge after execution parallel parallel fork join master master master fork join master OpenMP OpenACC Member of the Helmholtz Association 30 August 2018 Slide 14 114 Modus Operandi Three-step program 1 Annotate code with directives, indicating parallelism 2 OpenACC-capable compiler generates accelerator-specific code 3 $uccess Member of the Helmholtz Association 30 August 2018 Slide 15 114 1 Directives pragmatic Compiler directives state intend to compiler C/C++ Fortran #pragma acc kernels !$acc kernels for (int i = 0; i < 23; i++) // ... do i = 1, 24 ! ... !$acc end kernels Ignored by compiler which does not understand OpenACC High level programming model for many-core machines, especially accelerators OpenACC: Compiler directives, library routines, environment variables Portable across host systems and accelerator architectures Member of the Helmholtz Association 30 August 2018 Slide 16 114 2 Compiler Simple and abstracted Compiler support PGI Best performance, great support, free GCC Beta, limited coverage, OSS Cray ??? Trust compiler to generate intended parallelism; always check status output! No need to know ins’n’outs of accelerator; leave it to expert compiler engineers? One code can target different accelerators: GPUs, or even multi-core CPUs ! Portability ?: Eventually you want to tune for device; but that’s possible Member of the Helmholtz Association

Load more