OpenCL in Scientific High Performance Computing
Matthias Noack [email protected]
Zuse Institute Berlin Distributed Algorithms and Supercomputing
2019-09-24, IXPUG Annual Conference 2019 1 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 Audience Survey
• Who has a rough idea what OpenCL is? • Who has hands-on experience with OpenCL? • Who is using OpenCL in a real-world code? • . . . Why? • . . . Why not? • . . . What are you using instead?
2 / 32 • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI
• hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability
The Situation with Scientific HPC
• tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a CS student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly
3 / 32 • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability
The Situation with Scientific HPC
• tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a CS student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly
• intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI
3 / 32 The Situation with Scientific HPC
• tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a CS student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly
• intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI
• hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability
3 / 32 • use modern techniques with a broad community (beyond HPC) ⇒ e.g. modern C++ for host code ⇒ e.g. CMake for building
• develop code interdisciplinary ⇒ domain experts design the model . . . ⇒ . . . computer scientists the software
Do not contribute to that situation!
What can we do better? • put portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming • CPUs, GPUs, AI accelerators, FPGAs, . . . ⇒ library only ⇒ no tool dependencies
4 / 32 • develop code interdisciplinary ⇒ domain experts design the model . . . ⇒ . . . computer scientists the software
Do not contribute to that situation!
What can we do better? • put portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming • CPUs, GPUs, AI accelerators, FPGAs, . . . ⇒ library only ⇒ no tool dependencies
• use modern techniques with a broad community (beyond HPC) ⇒ e.g. modern C++ for host code ⇒ e.g. CMake for building
4 / 32 Do not contribute to that situation!
What can we do better? • put portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming • CPUs, GPUs, AI accelerators, FPGAs, . . . ⇒ library only ⇒ no tool dependencies
• use modern techniques with a broad community (beyond HPC) ⇒ e.g. modern C++ for host code ⇒ e.g. CMake for building
• develop code interdisciplinary ⇒ domain experts design the model . . . ⇒ . . . computer scientists the software
4 / 32 • higher-level programming model for OpenCL • single source, standard C++, ⇒ SYCL compiler needed
• Standard Portable Intermediate Representation • standardised intermediate language ⇒ based on LLVM-IR • device-independent binaries for OpenCL
Big Picture: OpenCL, SYCL, SPIR-V and Intel One API
Application
Intel One API
Data Parallel C++ (DPC++)
SYCL
SPIR-V
OpenCL
CPU GPU FPGA AI ...
5 / 32 • Standard Portable Intermediate Representation • standardised intermediate language ⇒ based on LLVM-IR • device-independent binaries for OpenCL
Big Picture: OpenCL, SYCL, SPIR-V and Intel One API
Application
• higher-level programming model for Intel One API OpenCL Data Parallel C++ • single source, standard C++, (DPC++) ⇒ SYCL compiler needed
SYCL
SPIR-V
OpenCL
CPU GPU FPGA AI ...
5 / 32 Big Picture: OpenCL, SYCL, SPIR-V and Intel One API
Application
• higher-level programming model for Intel One API OpenCL Data Parallel C++ • single source, standard C++, (DPC++) ⇒ SYCL compiler needed
SYCL
• Standard Portable Intermediate SPIR-V Representation • standardised intermediate language OpenCL ⇒ based on LLVM-IR • device-independent binaries for CPU GPU FPGA AI ... OpenCL
5 / 32 OpenCL (Open Computing Language) in a Nutshell • open, royalty-free standard for • personal computers, servers, mobile cross-platform, parallel programming devices and embedded platforms • maintained by Khronos Group • first released: 2009-08-28
https://www.khronos.org/opencl/ 6 / 32 OpenCL (Open Computing Language) in a Nutshell • open, royalty-free standard for • personal computers, servers, mobile cross-platform, parallel programming devices and embedded platforms • maintained by Khronos Group • first released: 2009-08-28
https://www.khronos.org/opencl/ 6 / 32 Memory Model: • CD has device memory with glottal/constant addr. space • CU has local memory addr. space • PE has private memory addr. space ⇒ relaxed consistency
OpenCL Platform and Memory Model
Compute Device (CD) Compute Unit (CU)
...... Processing Element ...... (PE) ......
Host
7 / 32 OpenCL Platform and Memory Model
Compute Device (CD) Compute Unit (CU) Memory Model: • ...... CD has device Processing memory with glottal/constant Element ...... (PE) ... addr. space ...... • CU has local memory addr. space ...... • PE has private memory addr. space Host ⇒ relaxed consistency
7 / 32 ⇒ currently: widest practical portability of parallel programming models
⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings
OpenCL Machine Model Mapping
OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register
8 / 32 ⇒ currently: widest practical portability of parallel programming models
⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings
OpenCL Machine Model Mapping
OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register
⇒ write code for this abstract machine model
8 / 32 ⇒ currently: widest practical portability of parallel programming models
⇒ library-only implementation: no toolchain, many language bindings
OpenCL Machine Model Mapping
OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register
⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware
8 / 32 ⇒ currently: widest practical portability of parallel programming models
OpenCL Machine Model Mapping
OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register
⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings
8 / 32 OpenCL Machine Model Mapping
OpenCL Platform CPU Hardware GPU Hardware Compute Device Processor/Board GPU device Compute Unit Core (thread) Streaming MP Processing Element SIMD Lane CUDA Core global/const. memory DRAM DRAM local memory DRAM Shared Memory private memory Register/DRAM Priv. Mem./Register
⇒ write code for this abstract machine model ⇒ device-specific OpenCL compiler and runtime maps it to actual hardware ⇒ library-only implementation: no toolchain, many language bindings ⇒ currently: widest practical portability of parallel programming models
8 / 32 work-item work-group sizex okgopsize work-group y y NDRange size
NDRange sizex
Compute Device Compute Unit Processing Element
OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads
9 / 32 work-item work-group sizex okgopsize work-group y y NDRange size
NDRange sizex
Compute Device Compute Unit Processing Element
OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads
9 / 32 work-item work-group sizex okgopsize work-group y y NDRange size
NDRange sizex
Compute Device Compute Unit Processing Element
OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads
9 / 32 work-item work-group sizex okgopsize work-group y y NDRange size
NDRange sizex
Compute Device Compute Unit Processing Element
OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads
9 / 32 OpenCL Host Program and Kernel Execution ⇒ a host program uses OpenCL API-calls ⇒ kernels are written in OpenCL C/C++ kernel language ⇒ kernels are compiled at runtime for a specific device ⇒ kernels are executed on a range of work-items as work-groups of cooperating threads
work-item work-group sizex okgopsize work-group y y NDRange size
NDRange sizex
Compute Device Compute Unit Processing Element 9 / 32 Selected Target Hardware:
Device Name (architecture) [TFLOPS] [GiB/s] [FLOP/Byte] 2× Intel Xeon Gold 6138 (SKL) 2.56 238 10.8 2× Intel Xeon E5-2680v3 (HSW) 0.96 136 7.1 Intel Xeon Phi 7250 (KNL) 2.61a 490/115b 5.3/22.7b Nvidia Tesla K40 (Kepler) 1.31 480 2.7 AMD Firepro W8100 (Hawaii) 2.1 320 6.6
* calculated with max. AVX frequency of 1.2 GHz: 2611.2 GFLOPS = 1.2 GHz × 68 cores × 8 SIMD × 2 VPUs × 2 FMA 10 / 32 COSIM - A Predictive Cometary Coma Simulation Solve dust dynamics:
~adust(~r) = ~agas-drag + ~agrav + ~aCoriolis + ~acentrifugal 1 = C αN (~r)m (~v − ~v )|~v − ~v | − ∇φ(~r) 2 d gas gas dust gas dust gas −2~ω × ~vdust − ~ω × (~ω ×~r)
Compare with data of 67P/Churyumov–Gerasimenko from Rosetta spacecraft: corr 0.90 time 12:12
0 2 4 6 8 10 12
Panels 1-2: OSIRIS NAC Image, Panels 3-4: Simulation Results, Right Image: ESA – C. Carreau/ATG medialab, CC BY-SA 3.0-igo 11 / 32 • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)
⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d
DM-HEOM: Computing the Hierarchical Equations Of Motion
Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing
[Image by University of Copenhagen Biology Department]
M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework
for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 • hierarchical graph of complex matrices (auxiliary density operators, ADOs)
⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d
DM-HEOM: Computing the Hierarchical Equations Of Motion
Model for Open Quantum Systems dσu i = [H, σu] • understand the energy transfer in dt ¯h B K−1 photo-active molecular complexes X X − σ n γ(b, k) ⇒ e.g. photosynthesis u u,(b,k) b=1 k . . . but also quantum computing B K−1 X h 2λb X c(b, k) i × × − − V V σu • β¯h2ν ¯hγ(b, k) s(b) s(b) millions of coupled ODEs b=1 b k B K−1 X X × + + iVs(b)σu,b,k b=1 k B K−1 X X − + nu,(b,k)θMA(b,k)σ(u,b,k) b=1 k
M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework
for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 DM-HEOM: Computing the Hierarchical Equations Of Motion
Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)
⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d
M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework
for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 DM-HEOM: Computing the Hierarchical Equations Of Motion
Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)
⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d
M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework
for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 DM-HEOM: Computing the Hierarchical Equations Of Motion
Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis . . . but also quantum computing • millions of coupled ODEs • hierarchical graph of complex matrices (auxiliary density operators, ADOs)
⇒ dim: Nsites × Nsites ⇒ count: exp. in hierarchy depth d
M. Noack, A. Reinefeld, T. Kramer, Th. Steinke, DM-HEOM: A Portable and Scalable Solver-Framework
for the Hierarchical Equations of Motion, IPDPS/PDSEC’18, 10.1109/IPDPSW.2018.00149 12 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical Model
13 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical Model - ODEs - PDEs - Graphs -...
13 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype Model (Mathematica) - ODEs - PDEs - Graphs -...
13 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype Model (Mathematica)
- ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance
13 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance
13 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers -... - arbitrary precision - very limited performance
13 / 32 Mathematica and OpenCL
(* Load OpenCL support*) Needs["OpenCLLink"]
(* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad[" __kernel void doubleVec(__global mint* in, mint length){ int index= get_global_id(0);
if(index< length) in[index]= 2*in[index]; }","doubleVec", {{_Integer}, _Integer}, 256]
(* Create some input*) vec = Range[20];
(* Call the function*) doubleFun[vec, 20] (* NDRange deduced from args and wg-size*) https://reference.wolfram.com/language/OpenCLLink/guide/OpenCLLink.html 14 / 32 Mathematica and OpenCL
(* Load OpenCL support*) Needs["OpenCLLink"]
(* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad[" __kernel void doubleVec(__global mint* in, mint length){ int index= get_global_id(0); special OpenCL typedefs if(index< length) matching Mathematica types in[index]= 2*in[index]; }","doubleVec", {{_Integer}, _Integer}, 256]
(* Create some input*) vec = Range[20];
(* Call the function*) doubleFun[vec, 20] (* NDRange deduced from args and wg-size*) https://reference.wolfram.com/language/OpenCLLink/guide/OpenCLLink.html 14 / 32 Mathematica and OpenCL
(* Load OpenCL support*) Needs["OpenCLLink"]
(* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad[" __kernel void doubleVec(__global mint* in, mint length){ int index= get_global_id(0);
if(index< length) NDRange can be larger than length in[index]= 2*in[index]; }","doubleVec", {{_Integer}, _Integer}, 256]
(* Create some input*) vec = Range[20];
(* Call the function*) doubleFun[vec, 20] (* NDRange deduced from args and wg-size*) https://reference.wolfram.com/language/OpenCLLink/guide/OpenCLLink.html 14 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers -... - arbitrary precision - very limited performance
15 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
15 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Application
15 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Application
- start single node - OpenCL 1.2 for hotspots - modern C++ 11/14/17 - CMake for building
15 / 32 ⇒ Intel rediscovered OpenCL for HPC (One API) ⇒ OpenCL 2.x mostly supported now, but 1.2 is still lowest common denominator ⇒ many more, e.g. FPGA SDKs by Intel (Altera), and Xylinx
OpenCL SDKs and Versions
name version OpenCL version supported devices Intel OpenCL 18.1 CPU: 2.1, CPUs (AVX-512), GPU: 2.1 Intel GPUs, no KNL Intel OpenCL ... CPU: 1.2 (exp. 2.1), CPUs (up to AVX2), GPU: 2.1 Intel GPUs Intel OpenCL 14.2 1.2 Xeon Phi (KNC, IMCI SIMD) Nvidia OpenCL CUDA 10.1 1.2 (exp. 2.0) Nvidia GPU AMD-APP SDK ≥ 18.8.1 2.0 (GPU), 1.2 (CPU) GPU, CPUs (AVX,FMA4,XOP) PoCL 1.3 2.0 CPUs (AVX-512), GPUs, ARM
16 / 32 OpenCL SDKs and Versions
name version OpenCL version supported devices Intel OpenCL 18.1 CPU: 2.1, CPUs (AVX-512), GPU: 2.1 Intel GPUs, no KNL Intel OpenCL ... CPU: 1.2 (exp. 2.1), CPUs (up to AVX2), GPU: 2.1 Intel GPUs Intel OpenCL 14.2 1.2 Xeon Phi (KNC, IMCI SIMD) Nvidia OpenCL CUDA 10.1 1.2 (exp. 2.0) Nvidia GPU AMD-APP SDK ≥ 18.8.1 2.0 (GPU), 1.2 (CPU) GPU, CPUs (AVX,FMA4,XOP) PoCL 1.3 2.0 CPUs (AVX-512), GPUs, ARM
⇒ Intel rediscovered OpenCL for HPC (One API) ⇒ OpenCL 2.x mostly supported now, but 1.2 is still lowest common denominator ⇒ many more, e.g. FPGA SDKs by Intel (Altera), and Xylinx 16 / 32 / usr/lib/ libOpenCL.so ...... runtime loader looking into /etc/OpenCL/vendors etc/OpenCL/vendors/ intel64.icd...... textfile with actual library name/path amdocl64.icd ...... textfile with actual library name/path opt/intel_ocl/lib/ libintelocl.so ...... actual Intel implementation opt/AMDAPPSDK/lib/ libamdocl64.so ...... actual AMD implementation • applications typically link to the loader, direct link is also possible • only some loaders support OPENCL_VENDOR_PATH env. variable ⇒ problematic for user-installation (modifying /etc/ requires root)
The OpenCL Installable Client Driver (ICD) Loader • allows multiple OpenCL installations to be installed and used next to each other • applications link with a generic libOpenCL.so
17 / 32 • applications typically link to the loader, direct link is also possible • only some loaders support OPENCL_VENDOR_PATH env. variable ⇒ problematic for user-installation (modifying /etc/ requires root)
The OpenCL Installable Client Driver (ICD) Loader • allows multiple OpenCL installations to be installed and used next to each other • applications link with a generic libOpenCL.so / usr/lib/ libOpenCL.so ...... runtime loader looking into /etc/OpenCL/vendors etc/OpenCL/vendors/ intel64.icd...... textfile with actual library name/path amdocl64.icd ...... textfile with actual library name/path opt/intel_ocl/lib/ libintelocl.so ...... actual Intel implementation opt/AMDAPPSDK/lib/ libamdocl64.so ...... actual AMD implementation
17 / 32 The OpenCL Installable Client Driver (ICD) Loader • allows multiple OpenCL installations to be installed and used next to each other • applications link with a generic libOpenCL.so / usr/lib/ libOpenCL.so ...... runtime loader looking into /etc/OpenCL/vendors etc/OpenCL/vendors/ intel64.icd...... textfile with actual library name/path amdocl64.icd ...... textfile with actual library name/path opt/intel_ocl/lib/ libintelocl.so ...... actual Intel implementation opt/AMDAPPSDK/lib/ libamdocl64.so ...... actual AMD implementation • applications typically link to the loader, direct link is also possible • only some loaders support OPENCL_VENDOR_PATH env. variable ⇒ problematic for user-installation (modifying /etc/ requires root)
17 / 32 CMake: "find_package(OpenCL REQUIRED)" • OpenCL CMake module only works in some scenarios ⇒ the magic line (optional, shown here: bypass libOpenCL.so):
mkdir build.intel_16.1.1 cd build.intel_16.1.1
cmake -DCMAKE_BUILD_TYPE=Release -DOpenCL_FOUND=True -DOpenCL_INCLUDE_DIR=../../ thirdparty/include/ -DOpenCL_LIBRARY=/opt/intel/opencl_runtime_16.1.1/opt/ intel/opencl -1.2-6.4.0.25/lib64/libintelocl.so..
make -j
Compilation
OpenCL Header Files: ⇒ avoid trouble: use reference headers, ship with project ⇒ https://github.com/KhronosGroup/OpenCL-Headers
18 / 32 Compilation
OpenCL Header Files: ⇒ avoid trouble: use reference headers, ship with project ⇒ https://github.com/KhronosGroup/OpenCL-Headers
CMake: "find_package(OpenCL REQUIRED)" • OpenCL CMake module only works in some scenarios ⇒ the magic line (optional, shown here: bypass libOpenCL.so):
mkdir build.intel_16.1.1 cd build.intel_16.1.1
cmake -DCMAKE_BUILD_TYPE=Release -DOpenCL_FOUND=True -DOpenCL_INCLUDE_DIR=../../ thirdparty/include/ -DOpenCL_LIBRARY=/opt/intel/opencl_runtime_16.1.1/opt/ intel/opencl -1.2-6.4.0.25/lib64/libintelocl.so..
make -j
18 / 32 Platform 0: NAME: AMD Accelerated Parallel Processing VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 0: NAME: Hawaii VENDOR: Advanced Micro Devices, Inc. VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 1: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: GenuineIntel VERSION: OpenCL 1.2 AMD-APP (1912.5) Platform 1: NAME: Intel(R) OpenCL VERSION: OpenCL 1.2 LINUX Device 0: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: Intel(R) Corporation VERSION: OpenCL 1.2 (Build 57)
Platform and Device Selection • OpenCL API: lots of device properties can be queried • simple and pragmatic: oclinfo tool ⇒ platform/device index
https://github.com/noma/ocl 19 / 32 Platform and Device Selection • OpenCL API: lots of device properties can be queried • simple and pragmatic: oclinfo tool ⇒ platform/device index
Platform 0: NAME: AMD Accelerated Parallel Processing VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 0: NAME: Hawaii VENDOR: Advanced Micro Devices, Inc. VERSION: OpenCL 2.0 AMD-APP (1912.5) Device 1: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: GenuineIntel VERSION: OpenCL 1.2 AMD-APP (1912.5) Platform 1: NAME: Intel(R) OpenCL VERSION: OpenCL 1.2 LINUX Device 0: NAME: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz VENDOR: Intel(R) Corporation VERSION: OpenCL 1.2 (Build 57) https://github.com/noma/ocl 19 / 32 header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency generates #include
kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
https://github.com/noma/ocl 20 / 32 - create raw string literal R"str_not_in_src( // input )str_not_in_src" kernel_ resolve_includes.sh cl_to_hpp.sh source .hpp CMake dependency generates #include
kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl
#include
kernel_ source .cl
https://github.com/noma/ocl 20 / 32 - create raw string literal R"str_not_in_src( // input )str_not_in_src" kernel_ cl_to_hpp.sh source .hpp CMake dependency generates #include
kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl
#include
kernel_ source resolve_includes.sh .cl
https://github.com/noma/ocl 20 / 32 kernel_ source .hpp CMake dependency generates #include
kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ source resolve_includes.sh cl_to_hpp.sh .cl
https://github.com/noma/ocl 20 / 32 CMake dependency generates #include
kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp
https://github.com/noma/ocl 20 / 32 CMake dependency generates
load file at runtime (via alternative ctor)
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp
#include
kernel_ wrapper_ class .hpp/.cpp https://github.com/noma/ocl 20 / 32 generates
load file at runtime (via alternative ctor)
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency #include
kernel_ wrapper_ class .hpp/.cpp https://github.com/noma/ocl 20 / 32 load file at runtime (via alternative ctor)
Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency generates #include
kernel_ wrapper_ class .hpp/.cpp https://github.com/noma/ocl 20 / 32 Handling Kernel Source Code a) loading source files at runtime: b) embedded source as string constant: X no host-code recompilation X self-contained executable for X #include directives production use
header .cl - create raw string literal R"str_not_in_src( #include // input )str_not_in_src" kernel_ kernel_ source resolve_includes.sh cl_to_hpp.sh source .cl .hpp CMake dependency generates #include
kernel_ wrapper_ load file at runtime (via alternative ctor) class .hpp/.cpp https://github.com/noma/ocl 20 / 32 Example OpenCL Runtime Configuration File
[opencl] # use first device of second platform platform_index=1 device_index=0
# enable zero copy buffers for CPU devices zero_copy_device_types={cpu}
# pass a custom include path to the OpenCL compiler compile_options=-I../cl
# load kernel source from file at runtime kernel_file_heom_ode=../cl/heom_ode.cl kernel_name_heom_ode=heom_ode
# unset option, load embedded source #kernel_file_rk_weighted_add= #kernel_name_rk_weighted_add=
https://github.com/noma/ocl 21 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Application
- start single node - OpenCL 1.2 for hotspots - modern C++ 11/14/17 - CMake for building
22 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Distributed Application Host Application
- start single node - OpenCL 1.2 for hotspots - modern C++ 11/14/17 - CMake for building
22 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Distributed Application Host Application
- start single node - scale to multiple nodes - OpenCL 1.2 for hotspots - partitioning, balancing, - modern C++ 11/14/17 neighbour exchange, . . . - CMake for building - wrapped MPI 3.0
22 / 32 Current trade-offs: • communication introduces additional logical host-device transfers ⇒ scaling starts slowly, e.g. two nodes might be slower than one • a single process might not be able to saturate the network ⇒ multiple processes per node sharing a device (CPU device: set CPU mask) • pick one: zero-copy buffers or overlapping compute and communication ⇒ either host (comm.) or device (comp.) own the memory at any point in time ⇒ overlapping requires copies again
OpenCL and Communication/MPI Design Recommendation: • keep both aspects as independent as possible • design code to be agnostic to whether it works on a complete problem instance or on a partition • provide hooks to trigger communication in-between kernel calls • wrap needed parts of MPI in a thin, exchangeable abstraction layer
23 / 32 OpenCL and Communication/MPI Design Recommendation: • keep both aspects as independent as possible • design code to be agnostic to whether it works on a complete problem instance or on a partition • provide hooks to trigger communication in-between kernel calls • wrap needed parts of MPI in a thin, exchangeable abstraction layer
Current trade-offs: • communication introduces additional logical host-device transfers ⇒ scaling starts slowly, e.g. two nodes might be slower than one • a single process might not be able to saturate the network ⇒ multiple processes per node sharing a device (CPU device: set CPU mask) • pick one: zero-copy buffers or overlapping compute and communication ⇒ either host (comm.) or device (comp.) own the memory at any point in time ⇒ overlapping requires copies again 23 / 32 Data Transfer Paths
OpenCL application fabric fabric application OpenCL driver code driver driver code driver
device device
device memory memory
DMA DMA
pinned pinned pinned pinned mem host mem mem host mem device fabric RDMA fabric device cpy memory cpy cpy memory cpy
host buffer buffer buffer buffer
24 / 32 Data Transfer Paths
OpenCL application fabric fabric application OpenCL driver code driver driver code driver
device device
device memory memory can be avoided in some cases with OpenCL DMA DMA
pinned pinned pinned pinned mem host mem mem host mem device fabric RDMA fabric device cpy memory cpy cpy memory cpy
host buffer buffer buffer buffer
24 / 32 Data Transfer Paths
OpenCL application fabric fabric application OpenCL driver code driver driver code driver
CUDA GPU-Direct RDMA device device
device memory memory can be avoided in some cases with OpenCL DMA DMA
pinned pinned pinned pinned mem host mem mem host mem device fabric RDMA fabric device cpy memory cpy cpy memory cpy
host buffer buffer buffer buffer
24 / 32 COSIM Runtime vs. Particle Count (2x Xeon, Haswell)
6
Intel OpenCL SDK AMD APP SDK PoCL
4
2 runtime per iteration [s] runtime per iteration
PoCL: outlier every 32 particles ⇒ 0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125
Intel OpenCL SDK
100 AMD APP SDK PoCL
75
50 runtime per iteration [s] runtime per iteration
25
0
0 250 500 750 1000 particles per compute node
25 / 32 COSIM Runtime vs. Particle Count (2x Xeon, Haswell)
6
Intel OpenCL SDK AMD APP SDK PoCL
4
2 runtime per iteration [s] runtime per iteration
0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125
Intel OpenCL SDK
100 AMD APP SDK PoCL
75
50 runtime per iteration [s] runtime per iteration
25 PoCL: outlier every 32 particles ⇒
0
0 250 500 750 1000 particles per compute node
25 / 32 ⇒ 384 workitems = 16 × 24 cores
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125 6
Intel OpenCL SDK Intel OpenCL SDK
100 AMD APP SDK AMD APP SDK PoCL PoCL
4 75
50
2 runtime per iteration [s] runtime per iteration [s] runtime per iteration
25 PoCL: outlier every 32 particles ⇒
0 0
0 250 500 750 1000 0 250 500 750 1000 particles per compute node particles per compute node
25 / 32 Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) COSIM Runtime vs. Particle Count (2x Xeon, Haswell) 125 6
Intel OpenCL SDK Intel OpenCL SDK
100 AMD APP SDK AMD APP SDK PoCL PoCL
4 75
50
2 runtime per iteration [s] runtime per iteration [s] runtime per iteration
25 PoCL: outlier every 32 particles ⇒ 0 0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 0 250 500 750 1000 particles per compute node particles per compute node
25 / 32 • highest per-node-efficiency with 384 work-items per node with Intel SDK ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser • work-items per node can dramatically affect job runtime ⇒± 1 work-item on a single node can more thanPoCL: double outlier every job 32 runtime particles ⇒ ⇒ benchmark, adapt job size, pad work-items to n × 16
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK PoCL
4
2 runtime per iteration [s] runtime per iteration
0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node
25 / 32 • work-items per node can dramatically affect job runtime ⇒± 1 work-item on a single node can more thanPoCL: double outlier every job 32 runtime particles ⇒ ⇒ benchmark, adapt job size, pad work-items to n × 16
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK • highest per-node-efficiency with 384 PoCL
work-items per node with Intel SDK 4 ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser
2 runtime per iteration [s] runtime per iteration
0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node
25 / 32 PoCL: outlier every 32 particles ⇒ ⇒ benchmark, adapt job size, pad work-items to n × 16
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK • highest per-node-efficiency with 384 PoCL
work-items per node with Intel SDK 4 ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser • work-items per node can 2
dramatically affect job runtime [s] runtime per iteration ⇒± 1 work-item on a single node can more than double job runtime
0⇒ 384 workitems = 16 × 24 cores 0 250 500 750 1000 particles per compute node
25 / 32 PoCL: outlier every 32 particles ⇒
Benchmark Results: COSIM load imbalance (Xeon)
COSIM Runtime vs. Particle Count (2x Xeon, Haswell) • different performance and 6 characteristics across OpenCL implementations Intel OpenCL SDK AMD APP SDK • highest per-node-efficiency with 384 PoCL
work-items per node with Intel SDK 4 ⇒ 16 = logical SIMD-width required by Intel OpenCL vectoriser • work-items per node can 2
dramatically affect job runtime [s] runtime per iteration ⇒± 1 work-item on a single node can more than double job runtime ⇒ benchmark, adapt job size, pad 0⇒ 384 workitems = 16 × 24 cores work-items to n × 16 0 250 500 750 1000 particles per compute node
25 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Distributed Application Host Application
- start single node - scale to multiple nodes - OpenCL 1.2 for hotspots - partitioning, balancing, - modern C++ 11/14/17 neighbour exchange, . . . - CMake for building - wrapped MPI 3.0
26 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Distributed Optimisation / Application Host Application Production Runs
- start single node - scale to multiple nodes - OpenCL 1.2 for hotspots - partitioning, balancing, - modern C++ 11/14/17 neighbour exchange, . . . - CMake for building - wrapped MPI 3.0
26 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Distributed Optimisation / Application Host Application Production Runs
- start single node - scale to multiple nodes - always collect perf. data - OpenCL 1.2 for hotspots - partitioning, balancing, - profile/tune code - modern C++ 11/14/17 neighbour exchange, . . . - add performance tweaks - CMake for building - wrapped MPI 3.0 - use device-specific
kernel variants if needed 26 / 32 DM-HEOM Benchmarks: Work-item Granularity
Impact of Work-item Granularity
fmo_22baths_d3.cfg lhcii_1bath_d8.cfg
6000
Granularity 4000 Matrix Element 2000
0
average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100
27 / 32 DM-HEOM Benchmarks: Work-item Granularity
Impact of Work-item Granularity
fmo_22baths_d3.cfg lhcii_1bath_d8.cfg
6000
Granularity 4000 Matrix Element 2000
0
average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100
⇒ CPUs: 1.2× to 1.35× speedup for Matrix granularity
27 / 32 DM-HEOM Benchmarks: Work-item Granularity
Impact of Work-item Granularity
fmo_22baths_d3.cfg lhcii_1bath_d8.cfg
6000
Granularity 4000 Matrix Element 2000
0
average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL K40 W8100 2 SKL 2 HSW KNL K40 W8100
⇒ GPUs: up to 6.7× (K40) and 7.2× (W8100) speedup for Element granularity
27 / 32 DM-HEOM Benchmarks: Memory Layout
Impact of Configurable Memory Layout
fmo_22baths_d3.cfg lhcii_1bath_d8.cfg
1500
Layout 1000 AoS AoSoA 500
0
average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL 2 SKL 2 HSW KNL
28 / 32 DM-HEOM Benchmarks: Memory Layout
Impact of Configurable Memory Layout
fmo_22baths_d3.cfg lhcii_1bath_d8.cfg
1500
Layout 1000 AoS AoSoA 500
0
average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL 2 SKL 2 HSW KNL
⇒ SKL and HSW: 1.3× to 2.4× speedup with AoSoA
28 / 32 DM-HEOM Benchmarks: Memory Layout
Impact of Configurable Memory Layout
fmo_22baths_d3.cfg lhcii_1bath_d8.cfg
1500
Layout 1000 AoS AoSoA 500
0
average solver step runtime [ms] step solver average 2 SKL 2 HSW KNL 2 SKL 2 HSW KNL
⇒ KNL: 1.6× to 2.8× speedup with AoSoA
28 / 32 DM-HEOM Benchmarks: Performance Portability
Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0
average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS
29 / 32 DM-HEOM Benchmarks: Performance Portability
Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0
average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS
⇒ SKL (Xeon) is the reference
29 / 32 DM-HEOM Benchmarks: Performance Portability
Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0
average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS
⇒ gray bars are expected runtimes extrapolated from peak FLOPS
29 / 32 DM-HEOM Benchmarks: Performance Portability
Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0
average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS
⇒ Older Haswell Xeon exceeds expectations, due to better OpenCL support
29 / 32 DM-HEOM Benchmarks: Performance Portability
Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0
average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS
⇒ Good: within 30 % of expectation
29 / 32 DM-HEOM Benchmarks: Performance Portability
Performance Portability Relative to Xeon (SKL) fmo_22baths_d3.cfg lhcii_1bath_d8.cfg Hardware 2x Xeon (SKL) 2x Xeon (HSW) 1000 ex. from FLOPS Xeon Phi (KNL) ex. from FLOPS 500 Tesla K40 ex. from FLOPS FirePro W8100 0
average solver step runtime [ms] step solver average SKL HSW KNL K40 W8100 SKL HSW KNL K40 W8100 ex. from FLOPS
⇒ KNL and K40 sensitive to irregular accesses from extreme coupling in this scenario.
29 / 32 Interdisciplinary Workflow domain experts computer scientists
Mathematical High-Level Prototype OpenCL kernel ... Model (Mathematica) within Mathematica
- ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers - compare results -... - arbitrary precision - figure out numerics - very limited performance - use accelerators in MM
... C++ Host Distributed Optimisation / Application Host Application Production Runs
- start single node - scale to multiple nodes - always collect perf. data - OpenCL 1.2 for hotspots - partitioning, balancing, - profile/tune code - modern C++ 11/14/17 neighbour exchange, . . . - add performance tweaks - CMake for building - wrapped MPI 3.0 - use device-specific
kernel variants if needed 30 / 32 OpenCL • highest portability of available parallel programming models • integrates well into interdisciplinary workflow • runtime compilation allows compiler-optimisation with runtime-constants • performance portability is not for free: ⇒ e.g. via configurability of work-item granularity and memory layout ⇒ worst case: multiple kernels, still better than multiple programming models Caveats • vendors are slow in implementing new standards ⇒ use OpenCL 1.2 + complain • interoperability with communication APIs not addressed by current standard
Conclusion General • work interdisciplinary • put portability first
31 / 32 Caveats • vendors are slow in implementing new standards ⇒ use OpenCL 1.2 + complain • interoperability with communication APIs not addressed by current standard
Conclusion General • work interdisciplinary • put portability first OpenCL • highest portability of available parallel programming models • integrates well into interdisciplinary workflow • runtime compilation allows compiler-optimisation with runtime-constants • performance portability is not for free: ⇒ e.g. via configurability of work-item granularity and memory layout ⇒ worst case: multiple kernels, still better than multiple programming models
31 / 32 Conclusion General • work interdisciplinary • put portability first OpenCL • highest portability of available parallel programming models • integrates well into interdisciplinary workflow • runtime compilation allows compiler-optimisation with runtime-constants • performance portability is not for free: ⇒ e.g. via configurability of work-item granularity and memory layout ⇒ worst case: multiple kernels, still better than multiple programming models Caveats • vendors are slow in implementing new standards ⇒ use OpenCL 1.2 + complain • interoperability with communication APIs not addressed by current standard
31 / 32 EoP
Thank you.
Feedback? Questions? Ideas? [email protected]
This work was funded by the Deutsche Forschungsgemeinschaft (DFG) project RE 1389/8-1 and the "Research Center for Many-Core HPC" at ZIB, an Intel Parallel Computing Center. The authors acknowledge the North-German Supercomputing Alliance (HLRN) for providing compute resources.
32 / 32