Opencl in Scientific High Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
OpenCL in Scientific High Performance Computing —The Good, the Bad, and the Ugly Matthias Noack [email protected] Zuse Institute Berlin Distributed Algorithms and Supercomputing 2017-05-18, IWOCL 2017 https://doi.org/10.1145/3078155.3078170 1 / 28 Systems: • 2× Cray XC40 (#118 and #150 in top500, year 4/5 in lifetime) with Xeon CPUs • 80-node Xeon Phi (KNL) Cray XC40 test-and-development system • 2× 32-node Infiniband cluster with Nvidia K40 • 2 test-and-development systems with AMD FirePro W8100 What I do: • computer science research (methods) • development of HPC codes • evaluation of upcoming technologies • consulting/training with system users Context ZIB hosts part of the HLRN (Northern German Supercomputing Alliance) facilities. 2 / 28 What I do: • computer science research (methods) • development of HPC codes • evaluation of upcoming technologies • consulting/training with system users Context ZIB hosts part of the HLRN (Northern German Supercomputing Alliance) facilities. Systems: • 2× Cray XC40 (#118 and #150 in top500, year 4/5 in lifetime) with Xeon CPUs • 80-node Xeon Phi (KNL) Cray XC40 test-and-development system • 2× 32-node Infiniband cluster with Nvidia K40 • 2 test-and-development systems with AMD FirePro W8100 2 / 28 Context ZIB hosts part of the HLRN (Northern German Supercomputing Alliance) facilities. Systems: • 2× Cray XC40 (#118 and #150 in top500, year 4/5 in lifetime) with Xeon CPUs • 80-node Xeon Phi (KNL) Cray XC40 test-and-development system • 2× 32-node Infiniband cluster with Nvidia K40 • 2 test-and-development systems with AMD FirePro W8100 What I do: • computer science research (methods) • development of HPC codes • evaluation of upcoming technologies • consulting/training with system users 2 / 28 • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability Why OpenCL? (aka: The Good ) Scientific HPC in a Nutshell • tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly 3 / 28 • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability Why OpenCL? (aka: The Good ) Scientific HPC in a Nutshell • tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI 3 / 28 Why OpenCL? (aka: The Good ) Scientific HPC in a Nutshell • tons of legacy code (FORTRAN) authored by domain experts ⇒ rather closed community ⇒ decoupled from computer science (ask a student about FORTRAN) • highly conservative code owners ⇒ modern software engineering advances are picked up very slowly • intra-node parallelism dominated by OpenMP (e.g. Intel) and CUDA (Nvidia) ⇒ vendor and tool dependencies ⇒ limited portability ⇒ multiple diverging code branches ⇒ hard to maintain • inter-node communication = MPI • hardware life time: 5 years • software life time: multiple tens of years ⇒ outlives systems by far ⇒ aim for portability 3 / 28 • use modern techniques with a broad community (beyond HPC) ⇒ modern C++ for host code • develop code interdisciplinary ⇒ domain experts design the model . ⇒ . computer scientists the software Why OpenCL? (aka: The Good ) Goal for new code: Do not contribute to that situation! • portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming ⇒ library only ⇒ no tool dependencies 4 / 28 • develop code interdisciplinary ⇒ domain experts design the model . ⇒ . computer scientists the software Why OpenCL? (aka: The Good ) Goal for new code: Do not contribute to that situation! • portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming ⇒ library only ⇒ no tool dependencies • use modern techniques with a broad community (beyond HPC) ⇒ modern C++ for host code 4 / 28 Why OpenCL? (aka: The Good ) Goal for new code: Do not contribute to that situation! • portability first (6= performance portability) ⇒ OpenCL has the largest hardware coverage for intra-node programming ⇒ library only ⇒ no tool dependencies • use modern techniques with a broad community (beyond HPC) ⇒ modern C++ for host code • develop code interdisciplinary ⇒ domain experts design the model . ⇒ . computer scientists the software 4 / 28 Target Hardware vendor architecture device compute memory Intel Haswell 2× Xeon E5-2680v3 0.96 TFLOPS 136 GiB/s Intel Knights Landing Xeon Phi 7250 2.61 TFLOPS* 490/115 GiB/s AMD Hawaii Firepro W8100 2.1 TFLOPS 320 GiB/s Nvidia Kepler Tesla K40 1.31 TFLOPS 480 GiB/s * calculated with max. AVX frequency of 1.2 GHz: 2611.2 GFLOPS = 1.2 GHz × 68 cores × 8 SIMD × 2 VPUs × 2 FMA 5 / 28 COSIM - A Predictive Cometary Coma Simulation Solve dust dynamics: ~adust(~r) = ~agas-drag + ~agrav + ~aCoriolis + ~acentrifugal 1 = C αN (~r)m (~v − ~v )|~v − ~v | − ∇φ(~r) 2 d gas gas dust gas dust gas −2~ω × ~vdust − ~ω × (~ω ×~r) Compare with data of 67P/Churyumov–Gerasimenko from Rosetta spacecraft: corr 0.90 time 12:12 0 2 4 6 8 10 12 Panels 1-2: OSIRIS NAC Image, Panels 3-4: Simulation Results, Right Image: ESA – C. Carreau/ATG medialab, CC BY-SA 3.0-igo 6 / 28 • millions of coupled ODEs • hierarchical graph of matrices HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis [Image by University of Copenhagen Biology Department] M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 • hierarchical graph of matrices HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems dσu i = [H; σu] • understand the energy transfer in dt ¯h B K−1 photo-active molecular complexes X X − σ n γ(b; k) ⇒ e.g. photosynthesis u u;(b;k) b=1 k B K−1 • millions of coupled ODEs X h 2λb X c(b; k) i × × − − V V σu β¯h2ν ¯hγ(b; k) s(b) s(b) b=1 b k B K−1 X X × + + iVs(b)σu;b;k b=1 k B K−1 X X − + nu;(b;k)θMA(b;k)σ(u;b;k) b=1 k M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems th ρ 0 0 • understand the energy transfer in 0 level 0 photo-active molecular complexes ⇒ e.g. photosynthesis 1st level ρ 1 0 ρ 0 1 1 2 • millions of coupled ODEs nd 2 level ρ 2 0 ρ 1 1 ρ 0 2 • hierarchical graph of matrices 3 4 5 M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis • millions of coupled ODEs • hierarchical graph of matrices M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 HEOM - The Hierarchical Equations of Motion Model for Open Quantum Systems • understand the energy transfer in photo-active molecular complexes ⇒ e.g. photosynthesis • millions of coupled ODEs • hierarchical graph of matrices M. Noack, F. Wende, K.-D. Oertel, OpenCL: There and Back Again, in High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches, Vol. 2, 2015 7 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical Model 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical Model - ODEs - PDEs - Graphs -... 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype Model (Mathematica) - ODEs - PDEs - Graphs -... 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype Model (Mathematica) - ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype OpenCL kernel Model (Mathematica) within Mathematica - ODEs - domain scientist’s tool - PDEs - high level - Graphs - symbolic solvers -... - arbitrary precision - very limited performance 8 / 28 Interdisciplinary Workflow domain experts computer scientists Mathematical High Level Prototype OpenCL kernel Model (Mathematica) within Mathematica - ODEs - domain scientist’s tool - replace some code - PDEs - high level with OpenCL - Graphs - symbolic solvers -... - arbitrary precision - very limited performance 8 / 28 Mathematica and OpenCL (* Load OpenCL support*) Needs["OpenCLLink"] (* Create OpenCLFunction from source, kernel name, signature*) doubleFun = OpenCLFunctionLoad["