OpenCL in Scientific High Performance Computing

Matthias Noack [email protected]

Zuse Institute Distributed Algorithms and Supercomputing

2019-09-24, IXPUG Annual Conference 2019 1 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 Audience Survey

• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?

2 / 32 • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI

• hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability

The Situation with Scientific HPC

• tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a CS student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly

3 / 32 • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability

The Situation with Scientific HPC

• tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a CS student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly

• intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI

3 / 32 The Situation with Scientific HPC

• tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a CS student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly

• intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI

• hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability

3 / 32 • use modern techniques with a broad community (beyond HPC) ⇒ e.g. modern ++ for host code ⇒ e.g. CMake for building

• develop code interdisciplinary ⇒ domain experts design the model . . . ⇒ . . . computer scientists the software

Do not contribute to that situation!

What can we do better? • put portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming • CPUs, GPUs, AI accelerators, FPGAs, . . . ⇒ only ⇒ no tool dependencies

4 / 32 • develop code interdisciplinary ⇒ domain experts design the model . . . ⇒ . . . computer scientists the software

Do not contribute to that situation!

What can we do better? • put portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming • CPUs, GPUs, AI accelerators, FPGAs, . . . ⇒ library only ⇒ no tool dependencies

• use modern techniques with a broad community (beyond HPC) ⇒ e.g. modern C++ for host code ⇒ e.g. CMake for building

4 / 32 Do not contribute to that situation!

What can we do better? • put portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming • CPUs, GPUs, AI accelerators, FPGAs, . . . ⇒ library only ⇒ no tool dependencies

• use modern techniques with a broad community (beyond HPC) ⇒ e.g. modern C++ for host code ⇒ e.g. CMake for building

• develop code interdisciplinary ⇒ domain experts design the model . . . ⇒ . . . computer scientists the software

4 / 32 • higher-level programming model for OpenCL • single source, standard C++, ⇒ SYCL compiler needed

• Standard Portable Intermediate Representation • standardised intermediate language ⇒ based on LLVM-IR • device-independent binaries for OpenCL

Big Picture: OpenCL, SYCL, SPIR-V and Intel One API

Application

Intel One API

Data Parallel C++ (DPC++)

SYCL

SPIR-V

OpenCL

CPU GPU FPGA AI ...

5 / 32 • Standard Portable Intermediate Representation • standardised intermediate language ⇒ based on LLVM-IR • device-independent binaries for OpenCL

Big Picture: OpenCL, SYCL, SPIR-V and Intel One API

Application

• higher-level programming model for Intel One API OpenCL Data Parallel C++ • single source, standard C++, (DPC++) ⇒ SYCL compiler needed

SYCL

SPIR-V

OpenCL

CPU GPU FPGA AI ...

5 / 32 Big Picture: OpenCL, SYCL, SPIR-V and Intel One API

Application

• higher-level programming model for Intel One API OpenCL Data Parallel C++ • single source, standard C++, (DPC++) ⇒ SYCL compiler needed

SYCL

• Standard Portable Intermediate SPIR-V Representation • standardised intermediate language OpenCL ⇒ based on LLVM-IR • device-independent binaries for CPU GPU FPGA AI ... OpenCL

5 / 32 OpenCL (Open Computing Language) in a Nutshell • open, royalty-free standard for • personal computers, servers, mobile cross-platform, parallel programming devices and embedded platforms • maintained by Khronos Group • first released: 2009-08-28

https://www.khronos.org/opencl/ 6 / 32 OpenCL (Open Computing Language) in a Nutshell • open, royalty-free standard for • personal computers, servers, mobile cross-platform, parallel programming devices and embedded platforms • maintained by Khronos Group • first released: 2009-08-28

https://www.khronos.org/opencl/ 6 / 32 Memory Model: • CD has device memory with glottal/constant addr. space • CU has local memory addr. space • PE has private memory addr. space ⇒ relaxed consistency

OpenCL Platform and Memory Model

Compute Device (CD) Compute Unit (CU)

...... Processing Element ...... (PE) ......

Host

7 / 32 OpenCL Platform and Memory Model

Compute Device (CD) Compute Unit (CU) Memory Model: • ...... CD has device Processing memory with glottal/constant Element ...... (PE) ... addr. space ...... • CU has local memory addr. space ...... • PE has private memory addr. space Host ⇒ relaxed consistency

7 / 32 ⇒ currently: widest practical portability of parallel programming models

⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings

OpenCL Machine Model Mapping

OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register

8 / 32 ⇒ currently: widest practical portability of parallel programming models

⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings

OpenCL Machine Model Mapping

OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model

8 / 32 ⇒ currently: widest practical portability of parallel programming models

⇒ library-only implementation: no toolchain, many language bindings

OpenCL Machine Model Mapping

OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware

8 / 32 ⇒ currently: widest practical portability of parallel programming models

OpenCL Machine Model Mapping

OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings

8 / 32 OpenCL Machine Model Mapping

OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register

⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings ⇒ currently: widest practical portability of parallel programming models

8 / 32 work-item work-group sizex okgopsize work-group y y NDRange size

NDRange sizex

Compute Device Compute Unit Processing Element

OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads

9 / 32 work-item work-group sizex okgopsize work-group y y NDRange size

NDRange sizex

Compute Device Compute Unit Processing Element

OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads

9 / 32 work-item work-group sizex okgopsize work-group y y NDRange size

NDRange sizex

Compute Device Compute Unit Processing Element

OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads

9 / 32 work-item work-group sizex okgopsize work-group y y NDRange size

NDRange sizex

Compute Device Compute Unit Processing Element

OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads

9 / 32 OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads

work-item work-group sizex okgopsize work-group y y NDRange size

NDRange sizex

Compute Device Compute Unit Processing Element 9 / 32 Selected Target Hardware:

Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte] 2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.8 2× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1 Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b Nvidia Tesla K40 (Kepler) 1.31 480 2.7 AMD Firepro W8100 (Hawaii) 2.1 320 6.6

* calculated with max. AVX frequency of 1.2 GHz: 2611.2 GFLOPS = 1.2 GHz × 68 cores × 8 SIMD × 2 VPUs × 2 FMA 10 / 32 COSIM - A Predictive Cometary Coma Simulation Solve dust dynamics:

~adust(~r) = ~agas-drag + ~agrav + ~aCoriolis + ~acentrifugal 1 = C αN (~r)m (~v − ~v )|~v − ~v | − ∇φ(~r) 2 d gas gas dust gas dust gas −2~ω × ~vdust − ~ω × (~ω ×~r)

Compare with data of 67P/Churyumov–Gerasimenko from Rosetta spacecraft: corr 0.90 time 12:12

0 2 4 6 8 10 12

Panels 1-2: OSIRIS NAC Image, Panels 3-4: Simulation Results, Right Image: ESA – C. Carreau/ATG medialab, CC BY-SA 3.0-igo 11 / 32 • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)

⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d

DM-HEOM: Computing the Hierarchical Equations Of Motion

Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing

[Image by University of Copenhagen Biology Department]

M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework

for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 • hierarchical graph of complex matrices (auxiliary density operators, ADOs)

⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d

DM-HEOM: Computing the Hierarchical Equations Of Motion

Model for Open Quantum Systems dσu i = [H, σu] • understand the energy transfer in dt ¯h B K−1 photo-active molecular complexes X X − σ n γ(b, k) ⇒ e.g. photosynthesis u u,(b,k) b=1 k . . . but also quantum computing B K−1 X h 2λb X c(b, k) i × × − − V V σu • β¯h2ν ¯hγ(b, k) s(b) s(b) millions of coupled ODEs b=1 b k B K−1 X X × + + iVs(b)σu,b,k b=1 k B K−1 X X − + nu,(b,k)θMA(b,k)σ(u,b,k) b=1 k

M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework

for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 DM-HEOM: Computing the Hierarchical Equations Of Motion

Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)

⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d

M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework

for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 DM-HEOM: Computing the Hierarchical Equations Of Motion

Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)

⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d

M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework

for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 DM-HEOM: Computing the Hierarchical Equations Of Motion

Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)

⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d

M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework

for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical Model

13 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical Model - ODEs - PDEs - Graphs -...

13 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype Model (Mathematica) - ODEs - PDEs - Graphs -...

13 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype Model (Mathematica)

- ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance

13 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance

13 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers -... - arbitrary precision - very limited performance

13 / 32 Mathematica and OpenCL

(* Load OpenCL support*) Needs["OpenCLLink"]

(* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad[" __kernel void doubleVec(__global mint* in, mint length){ int index= get_global_id(0);

if(index< length) in[index]= 2*in[index]; }","doubleVec", {{_Integer}, _Integer}, 256]

(* Create some input*) vec = Range[20];

(* Call the function*) doubleFun[vec, 20] (* NDRange deduced from args and wg-size*) https://reference.wolfram.com/language/OpenCLLink/guide/OpenCLLink.html 14 / 32 Mathematica and OpenCL

(* Load OpenCL support*) Needs["OpenCLLink"]

(* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad[" __kernel void doubleVec(__global mint* in, mint length){ int index= get_global_id(0); special OpenCL typedefs if(index< length) matching Mathematica types in[index]= 2*in[index]; }","doubleVec", {{_Integer}, _Integer}, 256]

(* Create some input*) vec = Range[20];

(* Call the function*) doubleFun[vec, 20] (* NDRange deduced from args and wg-size*) https://reference.wolfram.com/language/OpenCLLink/guide/OpenCLLink.html 14 / 32 Mathematica and OpenCL

(* Load OpenCL support*) Needs["OpenCLLink"]

(* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad[" __kernel void doubleVec(__global mint* in, mint length){ int index= get_global_id(0);

if(index< length) NDRange can be larger than length in[index]= 2*in[index]; }","doubleVec", {{_Integer}, _Integer}, 256]

(* Create some input*) vec = Range[20];

(* Call the function*) doubleFun[vec, 20] (* NDRange deduced from args and wg-size*) https://reference.wolfram.com/language/OpenCLLink/guide/OpenCLLink.html 14 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers -... - arbitrary precision - very limited performance

15 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

15 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Application

15 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Application

- start single node - OpenCL 1.2 for hotspots - modern C++ 11/14/17 - CMake for building

15 / 32 ⇒ Intel rediscovered OpenCL for HPC (One API) ⇒ OpenCL 2.x mostly supported now, but 1.2 is still lowest common denominator ⇒ many more, e.g. FPGA SDKs by Intel (Altera), and Xylinx

OpenCL SDKs and Versions

name version OpenCL version supported devices Intel OpenCL 18.1 CPU: 2.1, CPUs (AVX-512), GPU: 2.1 Intel GPUs, no KNL Intel OpenCL ... CPU: 1.2 (exp. 2.1), CPUs (up to AVX2), GPU: 2.1 Intel GPUs Intel OpenCL 14.2 1.2 Xeon Phi (KNC, IMCI SIMD) Nvidia OpenCL CUDA 10.1 1.2 (exp. 2.0) Nvidia GPU AMD-APP SDK ≥ 18.8.1 2.0 (GPU), 1.2 (CPU) GPU, CPUs (AVX,FMA4,XOP) PoCL 1.3 2.0 CPUs (AVX-512), GPUs, ARM

16 / 32 OpenCL SDKs and Versions

name version OpenCL version supported devices Intel OpenCL 18.1 CPU: 2.1, CPUs (AVX-512), GPU: 2.1 Intel GPUs, no KNL Intel OpenCL ... CPU: 1.2 (exp. 2.1), CPUs (up to AVX2), GPU: 2.1 Intel GPUs Intel OpenCL 14.2 1.2 Xeon Phi (KNC, IMCI SIMD) Nvidia OpenCL CUDA 10.1 1.2 (exp. 2.0) Nvidia GPU AMD-APP SDK ≥ 18.8.1 2.0 (GPU), 1.2 (CPU) GPU, CPUs (AVX,FMA4,XOP) PoCL 1.3 2.0 CPUs (AVX-512), GPUs, ARM

⇒ Intel rediscovered OpenCL for HPC (One API) ⇒ OpenCL 2.x mostly supported now, but 1.2 is still lowest common denominator ⇒ many more, e.g. FPGA SDKs by Intel (Altera), and Xylinx 16 / 32 / usr/lib/ libOpenCL.so ...... runtime loader looking into /etc/OpenCL/vendors etc/OpenCL/vendors/ intel64.icd...... textfile with actual library name/path amdocl64.icd ...... textfile with actual library name/path opt/intel_ocl/lib/ libintelocl.so ...... actual Intel implementation opt/AMDAPPSDK/lib/ libamdocl64.so ...... actual AMD implementation • applications typically link to the loader, direct link is also possible • only some loaders support OPENCL_VENDOR_PATH env. variable ⇒ problematic for user-installation (modifying /etc/ requires root)

The OpenCL Installable Client Driver (ICD) Loader • allows multiple OpenCL installations to be installed and used next to each other • applications link with a generic libOpenCL.so

17 / 32 • applications typically link to the loader, direct link is also possible • only some loaders support OPENCL_VENDOR_PATH env. variable ⇒ problematic for user-installation (modifying /etc/ requires root)

The OpenCL Installable Client Driver (ICD) Loader • allows multiple OpenCL installations to be installed and used next to each other • applications link with a generic libOpenCL.so / usr/lib/ libOpenCL.so ...... runtime loader looking into /etc/OpenCL/vendors etc/OpenCL/vendors/ intel64.icd...... textfile with actual library name/path amdocl64.icd ...... textfile with actual library name/path opt/intel_ocl/lib/ libintelocl.so ...... actual Intel implementation opt/AMDAPPSDK/lib/ libamdocl64.so ...... actual AMD implementation

17 / 32 The OpenCL Installable Client Driver (ICD) Loader • allows multiple OpenCL installations to be installed and used next to each other • applications link with a generic libOpenCL.so / usr/lib/ libOpenCL.so ...... runtime loader looking into /etc/OpenCL/vendors etc/OpenCL/vendors/ intel64.icd...... textfile with actual library name/path amdocl64.icd ...... textfile with actual library name/path opt/intel_ocl/lib/ libintelocl.so ...... actual Intel implementation opt/AMDAPPSDK/lib/ libamdocl64.so ...... actual AMD implementation • applications typically link to the loader, direct link is also possible • only some loaders support OPENCL_VENDOR_PATH env. variable ⇒ problematic for user-installation (modifying /etc/ requires root)

17 / 32 CMake: "find_package(OpenCL REQUIRED)" • OpenCL CMake module only works in some scenarios ⇒ the magic line (optional, shown here: bypass libOpenCL.so):

mkdir build.intel_16.1.1 cd build.intel_16.1.1

cmake -DCMAKE_BUILD_TYPE=Release -DOpenCL_FOUND=True -DOpenCL_INCLUDE_DIR=../../ thirdparty/include/ -DOpenCL_LIBRARY=/opt/intel/opencl_runtime_16.1.1/opt/ intel/opencl -1.2-6.4.0.25/lib64/libintelocl.so..

make -j

Compilation

OpenCL Header Files: ⇒ avoid trouble: use reference headers, ship with project ⇒ https://github.com/KhronosGroup/OpenCL-Headers

18 / 32 Compilation

OpenCL Header Files: ⇒ avoid trouble: use reference headers, ship with project ⇒ https://github.com/KhronosGroup/OpenCL-Headers

CMake: "find_package(OpenCL REQUIRED)" • OpenCL CMake module only works in some scenarios ⇒ the magic line (optional, shown here: bypass libOpenCL.so):

mkdir build.intel_16.1.1 cd build.intel_16.1.1

cmake -DCMAKE_BUILD_TYPE=Release -DOpenCL_FOUND=True -DOpenCL_INCLUDE_DIR=../../ thirdparty/include/ -DOpenCL_LIBRARY=/opt/intel/opencl_runtime_16.1.1/opt/ intel/opencl -1.2-6.4.0.25/lib64/libintelocl.so..

make -j

18 / 32 Platform 0: NAME: AMD Accelerated Parallel Processing VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 0: NAME: Hawaii VENDOR: Advanced Micro Devices, Inc. VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 1: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: GenuineIntel VERSION: OpenCL 1.2 AMD-APP (1912.5) Platform 1: NAME: Intel(R) OpenCL VERSION: OpenCL 1.2 Device 0: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: Intel(R) Corporation VERSION: OpenCL 1.2 (Build 57)

Platform and Device Selection • OpenCL API: lots of device properties can be queried • simple and pragmatic: oclinfo tool ⇒ platform/device index

https://github.com/noma/ocl 19 / 32 Platform and Device Selection • OpenCL API: lots of device properties can be queried • simple and pragmatic: oclinfo tool ⇒ platform/device index

Platform 0: NAME: AMD Accelerated Parallel Processing VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 0: NAME: Hawaii VENDOR: Advanced Micro Devices, Inc. VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 1: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: GenuineIntel VERSION: OpenCL 1.2 AMD-APP (1912.5) Platform 1: NAME: Intel(R) OpenCL VERSION: OpenCL 1.2 LINUX Device 0: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: Intel(R) Corporation VERSION: OpenCL 1.2 (Build 57) https://github.com/noma/ocl 19 / 32 header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency generates #include

kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

https://github.com/noma/ocl 20 / 32 - create raw string literal R"str_not_in_src( // input )str_not_in_src" kernel_ resolve_includes.sh cl_to_hpp.sh source .hpp CMake dependency generates #include

kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl

#include

kernel_ source .cl

https://github.com/noma/ocl 20 / 32 - create raw string literal R"str_not_in_src( // input )str_not_in_src" kernel_ cl_to_hpp.sh source .hpp CMake dependency generates #include

kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl

#include

kernel_ source resolve_includes.sh .cl

https://github.com/noma/ocl 20 / 32 kernel_ source .hpp CMake dependency generates #include

kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ source resolve_includes.sh cl_to_hpp.sh .cl

https://github.com/noma/ocl 20 / 32 CMake dependency generates #include

kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp

https://github.com/noma/ocl 20 / 32 CMake dependency generates

load file at runtime (via alternative ctor)

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp

#include

kernel_ wrapper_ class .hpp/.cpp https://github.com/noma/ocl 20 / 32 generates

load file at runtime (via alternative ctor)

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency #include

kernel_ wrapper_ class .hpp/.cpp https://github.com/noma/ocl 20 / 32 load file at runtime (via alternative ctor)

Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency generates #include

kernel_ wrapper_ class .hpp/.cpp https://github.com/noma/ocl 20 / 32 Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use

header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency generates #include

kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp https://github.com/noma/ocl 20 / 32 Example OpenCL Runtime Configuration File

[opencl] # use first device of second platform platform_index=1 device_index=0

# enable zero copy buffers for CPU devices zero_copy_device_types={cpu}

# pass a custom include path to the OpenCL compiler compile_options=-I../cl

# load kernel source from file at runtime kernel_file_heom_ode=../cl/heom_ode.cl kernel_name_heom_ode=heom_ode

# unset option, load embedded source #kernel_file_rk_weighted_add= #kernel_name_rk_weighted_add=

https://github.com/noma/ocl 21 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Application

- start single node - OpenCL 1.2 for hotspots - modern C++ 11/14/17 - CMake for building

22 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Distributed Application Host Application

- start single node - OpenCL 1.2 for hotspots - modern C++ 11/14/17 - CMake for building

22 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Distributed Application Host Application

- start single node - scale to multiple nodes - OpenCL 1.2 for hotspots - partitioning, balancing, - modern C++ 11/14/17 neighbour exchange, . . . - CMake for building - wrapped MPI 3.0

22 / 32 Current trade-offs: • communication introduces additional logical host-device transfers ⇒ scaling starts slowly, e.g. two nodes might be slower than one • a single process might not be able to saturate the network ⇒ multiple processes per node sharing a device (CPU device: set CPU mask) • pick one: zero-copy buffers or overlapping compute and communication ⇒ either host (comm.) or device (comp.) own the memory at any point in time ⇒ overlapping requires copies again

OpenCL and Communication/MPI Design Recommendation: • keep both aspects as independent as possible • design code to be agnostic to whether it works on a complete problem instance or on a partition • provide hooks to trigger communication in-between kernel calls • wrap needed parts of MPI in a thin, exchangeable abstraction layer

23 / 32 OpenCL and Communication/MPI Design Recommendation: • keep both aspects as independent as possible • design code to be agnostic to whether it works on a complete problem instance or on a partition • provide hooks to trigger communication in-between kernel calls • wrap needed parts of MPI in a thin, exchangeable abstraction layer

Current trade-offs: • communication introduces additional logical host-device transfers ⇒ scaling starts slowly, e.g. two nodes might be slower than one • a single process might not be able to saturate the network ⇒ multiple processes per node sharing a device (CPU device: set CPU mask) • pick one: zero-copy buffers or overlapping compute and communication ⇒ either host (comm.) or device (comp.) own the memory at any point in time ⇒ overlapping requires copies again 23 / 32 Data Transfer Paths

OpenCL application fabric fabric application OpenCL driver code driver driver code driver

device device

device memory memory

DMA DMA

pinned pinned pinned pinned mem host mem mem host mem device fabric RDMA fabric device cpy memory cpy cpy memory cpy

host buffer buffer buffer buffer

24 / 32 Data Transfer Paths

OpenCL application fabric fabric application OpenCL driver code driver driver code driver

device device

device memory memory can be avoided in some cases with OpenCL DMA DMA

pinned pinned pinned pinned mem host mem mem host mem device fabric RDMA fabric device cpy memory cpy cpy memory cpy

host buffer buffer buffer buffer

24 / 32 Data Transfer Paths

OpenCL application fabric fabric application OpenCL driver code driver driver code driver

CUDA GPU-Direct RDMA device device

device memory memory can be avoided in some cases with OpenCL DMA DMA

pinned pinned pinned pinned mem host mem mem host mem device fabric RDMA fabric device cpy memory cpy cpy memory cpy

host buffer buffer buffer buffer

24 / 32 COSIM Runtime vs. Particle Count (2x Xeon, Haswell)

6

Intel OpenCL SDK AMD APP SDK PoCL

4

2 runtime per iteration [s] runtime per iteration

PoCL: outlier every 32 particles ⇒ 0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125

Intel OpenCL SDK

100 AMD APP SDK PoCL

75

50 runtime per iteration [s] runtime per iteration

25

0

0 250 500 750 1000 particles per compute node

25 / 32 COSIM Runtime vs. Particle Count (2x Xeon, Haswell)

6

Intel OpenCL SDK AMD APP SDK PoCL

4

2 runtime per iteration [s] runtime per iteration

0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125

Intel OpenCL SDK

100 AMD APP SDK PoCL

75

50 runtime per iteration [s] runtime per iteration

25 PoCL: outlier every 32 particles ⇒

0

0 250 500 750 1000 particles per compute node

25 / 32 ⇒ 384 workitems = 16 × 24 cores

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125 6

Intel OpenCL SDK Intel OpenCL SDK

100 AMD APP SDK AMD APP SDK PoCL PoCL

4 75

50

2 runtime per iteration [s] runtime per iteration [s] runtime per iteration

25 PoCL: outlier every 32 particles ⇒

0 0

0 250 500 750 1000 0 250 500 750 1000 particles per compute node particles per compute node

25 / 32 Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125 6

Intel OpenCL SDK Intel OpenCL SDK

100 AMD APP SDK AMD APP SDK PoCL PoCL

4 75

50

2 runtime per iteration [s] runtime per iteration [s] runtime per iteration

25 PoCL: outlier every 32 particles ⇒ 0 0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 0 250 500 750 1000 particles per compute node particles per compute node

25 / 32 • highest per-node-efficiency with 384 work-items per node with Intel SDK ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser • work-items per node can dramatically affect job runtime ⇒± 1 work-item on a single node can more thanPoCL: double outlier every job 32 runtime particles ⇒ ⇒ benchmark, adapt job size, pad work-items to n × 16

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK PoCL

4

2 runtime per iteration [s] runtime per iteration

0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node

25 / 32 • work-items per node can dramatically affect job runtime ⇒± 1 work-item on a single node can more thanPoCL: double outlier every job 32 runtime particles ⇒ ⇒ benchmark, adapt job size, pad work-items to n × 16

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK • highest per-node-efficiency with 384 PoCL

work-items per node with Intel SDK 4 ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser

2 runtime per iteration [s] runtime per iteration

0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node

25 / 32 PoCL: outlier every 32 particles ⇒ ⇒ benchmark, adapt job size, pad work-items to n × 16

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK • highest per-node-efficiency with 384 PoCL

work-items per node with Intel SDK 4 ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser • work-items per node can 2

dramatically affect job runtime [s] runtime per iteration ⇒± 1 work-item on a single node can more than double job runtime

0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node

25 / 32 PoCL: outlier every 32 particles ⇒

Benchmark Results: COSIM load imbalance (Xeon)

COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK • highest per-node-efficiency with 384 PoCL

work-items per node with Intel SDK 4 ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser • work-items per node can 2

dramatically affect job runtime [s] runtime per iteration ⇒± 1 work-item on a single node can more than double job runtime ⇒ benchmark, adapt job size, pad 0⇒ 384 workitems = 16 × 24 cores work-items to n × 16 0 250 500 750 1000 particles per compute node

25 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Distributed Application Host Application

- start single node - scale to multiple nodes - OpenCL 1.2 for hotspots - partitioning, balancing, - modern C++ 11/14/17 neighbour exchange, . . . - CMake for building - wrapped MPI 3.0

26 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Distributed Optimisation / Application Host Application Production Runs

- start single node - scale to multiple nodes - OpenCL 1.2 for hotspots - partitioning, balancing, - modern C++ 11/14/17 neighbour exchange, . . . - CMake for building - wrapped MPI 3.0

26 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Distributed Optimisation / Application Host Application Production Runs

- start single node - scale to multiple nodes - always collect perf. data - OpenCL 1.2 for hotspots - partitioning, balancing, - profile/tune code - modern C++ 11/14/17 neighbour exchange, . . . - add performance tweaks - CMake for building - wrapped MPI 3.0 - use device-specific

kernel variants if needed 26 / 32 DM-HEOM Benchmarks: Work-item Granularity

Impact of Work-item Granularity

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

6000

Granularity 4000 Matrix Element 2000

0

average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100

27 / 32 DM-HEOM Benchmarks: Work-item Granularity

Impact of Work-item Granularity

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

6000

Granularity 4000 Matrix Element 2000

0

average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100

⇒ CPUs: 1.2× to 1.35× speedup for Matrix granularity

27 / 32 DM-HEOM Benchmarks: Work-item Granularity

Impact of Work-item Granularity

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

6000

Granularity 4000 Matrix Element 2000

0

average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100

⇒ GPUs: up to 6.7× (K40) and 7.2× (W8100) speedup for Element granularity

27 / 32 DM-HEOM Benchmarks: Memory Layout

Impact of Configurable Memory Layout

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

1500

Layout 1000 AoS AoSoA 500

0

average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL 2 SKL 2 HSW KNL

28 / 32 DM-HEOM Benchmarks: Memory Layout

Impact of Configurable Memory Layout

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

1500

Layout 1000 AoS AoSoA 500

0

average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL 2 SKL 2 HSW KNL

⇒ SKL and HSW: 1.3× to 2.4× speedup with AoSoA

28 / 32 DM-HEOM Benchmarks: Memory Layout

Impact of Configurable Memory Layout

fmo_22baths_d3.cfg lhcii_1bath_d8.cfg

1500

Layout 1000 AoS AoSoA 500

0

average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL 2 SKL 2 HSW KNL

⇒ KNL: 1.6× to 2.8× speedup with AoSoA

28 / 32 DM-HEOM Benchmarks: Performance Portability

Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0

average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS

29 / 32 DM-HEOM Benchmarks: Performance Portability

Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0

average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS

⇒ SKL (Xeon) is the reference

29 / 32 DM-HEOM Benchmarks: Performance Portability

Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0

average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS

⇒ gray bars are expected runtimes extrapolated from peak FLOPS

29 / 32 DM-HEOM Benchmarks: Performance Portability

Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0

average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS

⇒ Older Haswell Xeon exceeds expectations, due to better OpenCL support

29 / 32 DM-HEOM Benchmarks: Performance Portability

Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0

average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS

⇒ Good: within 30 % of expectation

29 / 32 DM-HEOM Benchmarks: Performance Portability

Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0

average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS

⇒ KNL and K40 sensitive to irregular accesses from extreme coupling in this scenario.

29 / 32 Interdisciplinary Workflow domain experts computer scientists

Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica

- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM

... C++ Host Distributed Optimisation / Application Host Application Production Runs

- start single node - scale to multiple nodes - always collect perf. data - OpenCL 1.2 for hotspots - partitioning, balancing, - profile/tune code - modern C++ 11/14/17 neighbour exchange, . . . - add performance tweaks - CMake for building - wrapped MPI 3.0 - use device-specific

kernel variants if needed 30 / 32 OpenCL • highest portability of available parallel programming models • integrates well into interdisciplinary workflow • runtime compilation allows compiler-optimisation with runtime-constants • performance portability is not for free: ⇒ e.g. via configurability of work-item granularity and memory layout ⇒ worst case: multiple kernels, still better than multiple programming models Caveats • vendors are slow in implementing new standards ⇒ use OpenCL 1.2 + complain • interoperability with communication APIs not addressed by current standard

Conclusion General • work interdisciplinary • put portability first

31 / 32 Caveats • vendors are slow in implementing new standards ⇒ use OpenCL 1.2 + complain • interoperability with communication APIs not addressed by current standard

Conclusion General • work interdisciplinary • put portability first OpenCL • highest portability of available parallel programming models • integrates well into interdisciplinary workflow • runtime compilation allows compiler-optimisation with runtime-constants • performance portability is not for free: ⇒ e.g. via configurability of work-item granularity and memory layout ⇒ worst case: multiple kernels, still better than multiple programming models

31 / 32 Conclusion General • work interdisciplinary • put portability first OpenCL • highest portability of available parallel programming models • integrates well into interdisciplinary workflow • runtime compilation allows compiler-optimisation with runtime-constants • performance portability is not for free: ⇒ e.g. via configurability of work-item granularity and memory layout ⇒ worst case: multiple kernels, still better than multiple programming models Caveats • vendors are slow in implementing new standards ⇒ use OpenCL 1.2 + complain • interoperability with communication APIs not addressed by current standard

31 / 32 EoP

Thank you.

Feedback? Questions? Ideas? [email protected]

This work was funded by the Deutsche Forschungsgemeinschaft (DFG) project RE 1389/8-1 and the "Research Center for Many-Core HPC" at ZIB, an Intel Parallel Computing Center. The authors acknowledge the North-German Supercomputing Alliance (HLRN) for providing compute resources.

32 / 32