Programming Gpu-Accelerated Openpower Systems with Openacc Gpu Technology Conference 2018

PROGRAMMING GPU-ACCELERATED OPENPOWER SYSTEMS WITH OPENACC GPU TECHNOLOGY CONFERENCE 2018 26 March 2018 Andreas Herten Forschungszentrum Jülich Handout Version Member of the Helmholtz Association Overview, Outline What you will learn today Introduction What’s special about POWER GPU-equipped POWER systems Login E Parallelization strategies with OpenACC Introduction OpenACC OpenACC on CPU E OpenACC on CPU, GPU, GPUs OpenACC: GPU Optimizations All in 120 minutes OpenACC with GPUs E What you will not learn today MPI 101 Analyze program in-detail OpenACC, GPUs, and MPI E Strategies for complex programs Hands-on Extra How to leave the matrix Lecture Member of the Helmholtz Association 26 March 2018 Slide 1 77 Overview, Outline Introduction OpenACC on the GPU What you will learn today OpenPOWER Compiling on GPU What’s special about Minsky, POWER8 Data Locality Newell, POWER9 copy GPU-equipped POWER systems Using JURON data Parallelization strategies with OpenACC Introduction enter data OpenACC About OpenACC OpenACC on Multiple GPUs Modus Operandi MPI 101 OpenACC on CPU, GPU, GPUs OpenACC’s Models Jacobi MPI Strategy All in 120 minutes Parallelization Workflow Asynchronous First Steps in OpenACC Conclusions, Summary What you will not learn today Example Program Appendix Analyze program in-detail Identify Parallelism List of Tasks Parallelize Loops Strategies for complex programs parallel How to leave the matrix loops kernels Member of the Helmholtz Association 26 March 2018 Slide 1 77 Jülich Jülich Supercomputing Centre Forschungszentrum Jülich: One of largest research centers in Europe Jülich Supercomputing Centre: Host of and research in supercomputers y JUQUEEN BlueGene/Q system, Mar 2018, then: JUWELS JURECA Intel x86 system; some GPUs, many KNLs etc DEEP, QPACE, JULIA, JURON Me: Physicist, now at POWER Acceleration and Design Centre and NVIDIA Application Lab Member of the Helmholtz Association 26 March 2018 Slide 2 77 OpenPOWER Foundation Platform for collaboration around POWER processor architecture Started by IBM, NVIDIA, many more (now > 250 members) Objectives Licensing of processor architecture to partners Collaborate on system extension Open-Source Software Example technology: NVLink, fast GPU-CPU interconnect ! https://openpowerfoundation.org/ Member of the Helmholtz Association 26 March 2018 Slide 3 77 Minsky System IBM's S822LC server, codename Minsky 2 IBM POWER8NVL CPUs, 4 NVIDIA Tesla P100 GPUs System System Memory Memory 115 GB=s 115 GB=s POWER8 POWER8 CPU CPU 2 × 40 GB=s 2 × 40 GB=s P100 P100 P100 P100 GPU GPU GPU GPU 720 GB=s 720 GB=s GPU GPU GPU GPU Memory Memory Memory Memory Member of the Helmholtz Association 26 March 2018 Slide 4 77 System Core Numbers POWER8 CPU P100 GPU System System 2 sockets,Memory each 10 cores,Memory each 8× SMT 56 Streaming Multiprocessors (SMs) 115 GB=s 115 GB=s 2:5 GHzPOWER8 to 5 GHz; 8 FLOPPOWER8=Cycle=Core 64 FLOP=Cycle=SM s) CPU CPU = 2562 × 40 GB GB=s memory (1152 × 40 GB GB=s =s) 16 GB (720 GB=s) P100 P100 P100 P100 GPUL4 $ per socket:GPU 4GPU× 16 MBGPU (Buffer Chip) L2 $: 4 MB 720 GB=s 720 GB=s GPUL3, L2, L1GPU $ per core:GPU 8 MB,GPU 512 kB, 64 kB Shared Memory: 64 kB Memory Memory Memory Memory NVLink (40 GB 0:5 TFLOP=s 5 TFLOP=s Member of the Helmholtz Association 26 March 2018 Slide 5 77 System Core Numbers POWER8 CPU P100 GPU s) = NVLink (40 GB 0:5 TFLOP=s 5 TFLOP=s Member of the Helmholtz Association 26 March 2018 Slide 5 77 JURON JURON (Juelich + Neuron) 18 Minsky nodes (≈350 TFLOP=s) For Human Brain Project (HBP), but not only Prototype system, together with JULIA (KNL-based) Access via Jupyter Hub or SSH juronc08 juronc09 juronc10 juronc11 juronc07 juronc12 juronc06 juronc13 juronc05 juronc14 juronc04 juronc15 juronc03 juronc16 juronc02 juronc17 juronc01 juronc18 juron1-adm Member of the Helmholtz Association 26 March 2018 Slide 6 77 Newell Successor of Minsky (AC922 instead of S822LC) POWER9 instead of POWER8, 3 (2) Voltas instead of 2 Pascals, NVLink 2 instead of NVLink 1 ! Faster memory bandwidths, more FLOP=s, smarter NVLink System System Tesla V100 Memory Memory 120 GB=s 120 GB=s 80 SMs POWER9 POWER9 64 GB=s CPU CPU FP32, FP64 cores per SM same as Pascal ) 2 × 50 GB=s 2 × 50 GB=s 7:5 TFLOP (FP64)= sec V100 V100 V100 V100 V100 V100 GPU GPU GPU GPU GPU GPU 8 Tensor Cores per SM ) 120 TFLOP (FP16)= sec 900 GB=s 900 GB=s GPU GPU GPU GPU GPU GPU Memory Memory Memory Memory Memory Memory NVLink 2: Cache coherence, …; CPU Address Translation Service ! Appendix 1, 2 Member of the Helmholtz Association 26 March 2018 Slide 7 77 Summit New supercomputer at Oak Ridge National Lab 4600 Newell-like nodes > 200 PFLOP=s performance Maybe the world’s fastest supercomputer! Also: Sierra at Lawrence Livermore National Laboratory Member of the Helmholtz Association 26 March 2018 Slide 8 77 Using JURON TASK 1 A gentle start Task 1: JURON Website of Lab: http://bit.ly/gtc18-openacc Log in to JURON via http://jupyter-jsc.fz-juelich.de Access via Jupyter Lab (no Notebooks, but Terminal) Login from slip of paper (»Workshop password«) Click through to launch Jupyter Lab instance on JURON Start Terminal, browse to source files, view slides, … Directory of tasks cd $HOME/Tasks/Tasks/ Solutions are always given! You decide when to look Edit files with Jupyter’s source code editor (just open .c file) ? How many cores are on a compute node? How many CUDA cores? See README.md Member of the Helmholtz Association 26 March 2018 Slide 9 77 Using JURON TASK 1 A gentle start Task 1: JURON Website of Lab: http://bit.ly/gtc18-openacc Log in to JURON via http://jupyter-jsc.fz-juelich.de Access via Jupyter Lab (no Notebooks, but Terminal) Login from slip of paper (»Workshop password«) Click through to launch Jupyter Lab instance on JURON Start Terminal, browse to source files, view slides, … Directory of tasks cd $HOME/Tasks/Tasks/ Solutions are always given! You decide when to look Edit files with Jupyter’s source code editor (just open .c file) ? How many cores are onbit.ly/gtc18-openacc a compute node? How many CUDA cores? See README.md Member of the Helmholtz Association 26 March 2018 Slide 9 77 Using JURON TASK 1 A gentle start Task 1: JURON Website of Lab: http://bit.ly/gtc18-openacc Log in to JURON via http://jupyter-jsc.fz-juelich.de Access via Jupyter Lab (no Notebooks, but Terminal) Login from slip of paper (»Workshop password«) Click through to launch Jupyter Lab instance on JURON Start Terminal, browse to source files, view slides, … Directory of tasks cd $HOME/Tasks/Tasks/ Solutions are always given! You decide when to look Edit files with Jupyter’s source code editor (just open .c file) ? How many cores are on a compute node? How many CUDA cores? See README.md Member of the Helmholtz Association 26 March 2018 Slide 9 77 Using JURON TASK 1 A gentle start Task 1: JURON Website of Lab: http://bit.ly/gtc18-openacc Log in to JURON via http://jupyter-jsc.fz-juelich.de Access via Jupyter Lab (no Notebooks, but Terminal) Login from slip of paper (»Workshop password«) Click through to launch Jupyter Lab instance on JURON Start Terminal, browse to source files, view slides, … Directory of tasks cd $HOME/Tasks/Tasks/ Solutions are always given! You decide when to look Edit files with Jupyter’s source code editor (just open .c file) ? How many cores are on a compute node? How many CUDA cores? See README.md Member of the Helmholtz Association 26 March 2018 Slide 9 77 Using JURON TASK 1 A gentle start Task 1: JURON Website of Lab: http://bit.ly/gtc18-openacc Log in to JURON via http://jupyter-jsc.fz-juelich.de Access via Jupyter Lab (no Notebooks, but Terminal) Login from slip of paper (»Workshop password«) Click through to launch Jupyter Lab instance on JURON Start Terminal, browse to source files, view slides, … Directory of tasks cd $HOME/Tasks/Tasks/ Solutions are always given! You decide when to look Edit files with Jupyter’s source code editor (just open .c file) ? How many cores are on a compute node? How many CUDA cores? See README.md Member of the Helmholtz Association 26 March 2018 Slide 9 77 Using JURON So many cores! $ make run bsub -Is -U gtc lscpu [...] CPU(s): 160 [...] module load cuda cuda-samples && \ bsub -Is -R "rusage[ngpus_shared=1]" -U gtc deviceQuery [...] Device 0: "Tesla P100-SXM2-16GB" CUDA Driver Version / Runtime Version 9.1 / 9.1 CUDA Capability Major/Minor version number: 6.0 Total amount of global memory: 16276 MBytes (17066885120 bytes) (56) Multiprocessors, ( 64) CUDA Cores/MP: 3584 CUDA Cores [...] ! Total number of (totally different) cores: 160 + (4 × 3584) = 14 496 Member of the Helmholtz Association 26 March 2018 Slide 10 77 OpenACC Introduction Member of the Helmholtz Association 26 March 2018 Slide 11 77 Primer on GPU Computing Application Programming OpenACC Libraries Directives Languages Drop-in Easy Flexible Acceleration Acceleration Acceleration Member of the Helmholtz Association 26 March 2018 Slide 12 77 About OpenACC History 2011 OpenACC 1.0 specification is released NVIDIA, Cray, PGI, CAPS 2013 OpenACC 2.0: More functionality, portability 2015 OpenACC 2.5: Enhancements, clarifications 2017 OpenACC 2.6: Deep copy, … ! https://www.openacc.org/ (see also: Best practice guide ) Support Compiler: PGI, GCC, Cray, Sunway Languages: C/C++, Fortran Member of the Helmholtz Association 26 March 2018 Slide 13 77 Open{MP$ACC} Everything’s connected OpenACC modeled after OpenMP … … but specific for accelerators Might eventually be absorbed into OpenMP But OpenMP >4.0 also has offloading feature OpenACC more descriptive, OpenMP more prescriptive Basic principle same: Fork/join model Master thread launches parallel child threads; merge after execution parallel parallel fork join master fork join master master master OpenMP OpenACC Member of the Helmholtz Association 26 March 2018 Slide 14 77 Modus Operandi Three-step program 1 Annotate code with directives, indicating parallelism 2 OpenACC-capable compiler generates accelerator-specific code 3 $uccess Member of the Helmholtz Association 26 March 2018 Slide 15 77 1 Directives pragmatic Compiler directives state intend to compiler C/C++ Fortran #pragma acc kernels !$acc kernels for (int i = 0; i < 23; i++) // ..

Load more