Opencl in Scientific High Performance Computing

OpenCL in Scientific High Performance Computing —The Good, the Bad, and the Ugly Matthias Noack [email protected] Zuse Institute Berlin Distributed Algorithms and Supercomputing 2017-05-18, IWOCL 2017 https://doi.org/10.1145/3078155.3078170 1 / 28 Systems: • 2× Cray XC40 (#118 and #150 in top500, year 4/5 in lifetime) with Xeon CPUs • 80-node Xeon Phi (KNL) Cray XC40 test-and-development system • 2× 32-node Infiniband cluster with Nvidia K40 • 2 test-and-development systems with AMD FirePro W8100 What I do: • computer science research (methods) • development of HPC codes • evaluation of upcoming technologies • consulting/training with system users Context ZIB hosts part of the HLRN (Northern German Supercomputing Alliance) facilities. 2 / 28 What I do: • computer science research (methods) • development of HPC codes • evaluation of upcoming technologies • consulting/training with system users Context ZIB hosts part of the HLRN (Northern German Supercomputing Alliance) facilities. Systems: • 2× Cray XC40 (#118 and #150 in top500, year 4/5 in lifetime) with Xeon CPUs • 80-node Xeon Phi (KNL) Cray XC40 test-and-development system • 2× 32-node Infiniband cluster with Nvidia K40 • 2 test-and-development systems with AMD FirePro W8100 2 / 28 Context ZIB hosts part of the HLRN (Northern German Supercomputing Alliance) facilities. Systems: • 2× Cray XC40 (#118 and #150 in top500, year 4/5 in lifetime) with Xeon CPUs • 80-node Xeon Phi (KNL) Cray XC40 test-and-development system • 2× 32-node Infiniband cluster with Nvidia K40 • 2 test-and-development systems with AMD FirePro W8100 What I do: • computer science research (methods) • development of HPC codes • evaluation of upcoming technologies • consulting/training with system users 2 / 28 • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability Why OpenCL? (aka: The Good ) Scientific HPC in a Nutshell • tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly 3 / 28 • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability Why OpenCL? (aka: The Good ) Scientific HPC in a Nutshell • tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI 3 / 28 Why OpenCL? (aka: The Good ) Scientific HPC in a Nutshell • tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability 3 / 28 • use modern techniques with a broad community (beyond HPC) ⇒ modern C++ for host code • develop code interdisciplinary ⇒ domain experts design the model . ⇒ . computer scientists the software Why OpenCL? (aka: The Good ) Goal for new code: Do not contribute to that situation! • portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming ⇒ library only ⇒ no tool dependencies 4 / 28 • develop code interdisciplinary ⇒ domain experts design the model . ⇒ . computer scientists the software Why OpenCL? (aka: The Good ) Goal for new code: Do not contribute to that situation! • portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming ⇒ library only ⇒ no tool dependencies • use modern techniques with a broad community (beyond HPC) ⇒ modern C++ for host code 4 / 28 Why OpenCL? (aka: The Good ) Goal for new code: Do not contribute to that situation! • portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming ⇒ library only ⇒ no tool dependencies • use modern techniques with a broad community (beyond HPC) ⇒ modern C++ for host code • develop code interdisciplinary ⇒ domain experts design the model . ⇒ . computer scientists the software 4 / 28 Target Hardware vendor architecture device compute memory Intel Haswell 2× Xeon E5-2680v3 0.96 TFLOPS 136 GiB/s Intel Knights Landing Xeon Phi 7250 2.61 TFLOPS* 490/115 GiB/s AMD Hawaii Firepro W8100 2.1 TFLOPS 320 GiB/s Nvidia Kepler Tesla K40 1.31 TFLOPS 480 GiB/s * calculated with max. AVX frequency of 1.2 GHz: 2611.2 GFLOPS = 1.2 GHz × 68 cores × 8 SIMD × 2 VPUs × 2 FMA 5 / 28 COSIM - A Predictive Cometary Coma Simulation Solve dust dynamics: ~adust(~r) = ~agas-drag + ~agrav + ~aCoriolis + ~acentrifugal 1 = C αN (~r)m (~v − ~v )|~v − ~v | − ∇φ(~r) 2 d gas gas dust gas dust gas −2~ω × ~vdust − ~ω × (~ω ×~r) Compare with data of 67P/Churyumov–Gerasimenko from Rosetta spacecraft: corr 0.90 time 12:12 0 2 4 6 8 10 12 Panels 1-2: OSIRIS NAC Image, Panels 3-4: Simulation Results, Right Image: ESA – C. Carreau/ATG medialab, CC BY-SA 3.0-igo 6 / 28 • millions of coupled ODEs • hierarchical graph of matrices HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis [Image by University of Copenhagen Biology Department] M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 • hierarchical graph of matrices HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems dσu i = [H; σu] • understand the energy transfer in dt ¯h B K−1 photo-active molecular complexes X X − σ n γ(b; k) ⇒ e.g. photosynthesis u u;(b;k) b=1 k B K−1 • millions of coupled ODEs X h 2λb X c(b; k) i × × − − V V σu β¯h2ν ¯hγ(b; k) s(b) s(b) b=1 b k B K−1 X X × + + iVs(b)σu;b;k b=1 k B K−1 X X − + nu;(b;k)θMA(b;k)σ(u;b;k) b=1 k M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems th ρ 0 0 • understand the energy transfer in 0 level 0 photo-active molecular complexes ⇒ e.g. photosynthesis 1st level ρ 1 0 ρ 0 1 1 2 • millions of coupled ODEs nd 2 level ρ 2 0 ρ 1 1 ρ 0 2 • hierarchical graph of matrices 3 4 5 M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis • millions of coupled ODEs • hierarchical graph of matrices M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis • millions of coupled ODEs • hierarchical graph of matrices M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical Model 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical Model - ODEs - PDEs - Graphs -... 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype Model (Mathematica) - ODEs - PDEs - Graphs -... 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype Model (Mathematica) - ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype OpenCL kernel Model (Mathematica) within Mathematica - ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype OpenCL kernel Model (Mathematica) within Mathematica - ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers -... - arbitrary precision - very limited performance 8 / 28 Mathematica and OpenCL (* Load OpenCL support*) Needs["OpenCLLink"] (* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad["

Opencl in Scientific High Performance Computing

Other Apis What’S Wrong with Openmp?

Opencl on Shared Memory Multicore Cpus

Heterogeneous Task Scheduling for Accelerated Openmp

AMD Accelerated Parallel Processing Opencl Programming Guide

Implementing FPGA Design with the Opencl Standard

Opencl SYCL 2.2 Specification

An Introduction to Gpus, CUDA and Opencl

The Continuing Renaissance in Parallel Programming Languages

SYCL – Introduction and Hands-On

Opencl – the Open Standard for Parallel Programming of Heterogeneous Systems J

Opencl: a Hands-On Introduction

Openclunk: a Cilk-Like Framework for the GPGPU