Performance Portability and Unified Profiling for Finite Element Methods on Parallel Systems

Advances in Science, Technology and Engineering Systems Journal Vol. 5, No. 1, 119-127 (2020) ASTES Journal www.astesj.com ISSN: 2415-6698 Special Issue on Multidisciplinary Sciences and Engineering Performance Portability and Unified Profiling for Finite Element Methods on Parallel Systems Vladyslav Kucher*;1, Jens Hunloh2, Sergei Gorlatch2 1National Technical University of Ukraine ”Igor Sikorsky Kyiv Polytechnic Institute”, Prosp. Peremohy 37, Kyiv, 03056, Ukraine 2University of Muenster, Einsteinstr. 62, Muenster, Germany ARTICLEINFOABSTRACT Article history: The currently available variety of modern, highly-parallel universal processors includes Received: 11 November, 2019 multi-core CPU and many-core GPU (Graphics Processing Units) from different vendors. Accepted: 20 December, 2019 Systems composed of such processors enable high-performance execution of demanding Online: 24 January, 2020 applications like numerical Finite Element Methods. However, today’s application programming for parallel systems lacks performance portability: the same program code Keywords: cannot achieve stable high performance on different parallel architectures. One of the main C++ Compilers reasons for this is that parallel programs are developed by utilizing system-specific profiling GPU programming interfaces of the corresponding hardware vendors. We describe a novel, portable profiling High-performance computing interface: its design, implementation, and evaluation within the popular framework DUNE Performance portability for solving differential equations using finite element methods. Our profiler is built on top of Finite element methods the PACXX framework for parallel programming in C++, and it supports portable parallel Parallel computing programming using a single profiling tool on various target hardware platforms. Portable profiling Unified parallel programming 1 Introduction GPU parts of program code by using the LLVM infrastructure [6]. Experimental approach [7] uses CUDA’s TAU tool for profiling Modern universal processors are becoming highly parallel: multi- GPU applications by producing a detailed information on commu- core CPU and many-core GPU (Graphics Processing Units) are nications between CPU and GPU, without modifying the source produced by different vendors and exemplify a variety of archi- code. The SASSI instrumentation tool (NVIDIA assembly code tecture and hardware solutions. Such processors enable building “SASS Instrumentor”) [8] relies on customizable metrics: it enables high-performance systems for computation-intensive applications to inserting user-defined instrumentation code (e.g., debug hooks like numerical finite element methods. and customized counters) to collect detailed profiling data that are The current programming approaches for parallel computing usually not available. The generic tool integrated in the Score-P in- systems include CUDA [1] that is restricted to GPU produced by frastructure [9] provides an interface for evaluating the performance NVIDIA, as well as more universal programming models - OpenCL of OpenCL code on different parallel architectures. [2], SYCL [3], and PACXX [4] - that follow the idea of unified pro- This paper aims at designing a single, cross-platform profiler gramming: the programmer can target different hardware without that significantly improves the portability of program development. changing the source code, thus reducing the development overhead. We design our profiler by extending the unified programming frame- However, existing profiling tools are still restricted to the corre- work PACXX (Programming Accelerators in C++) [4] with a generic sponding vendor; therefore, the application programmer usually has profiling interface: the interface follows the unified programming to use several different tools to achieve the ultimate portability. model of PACXX. Our profiler seamlessly extends the PACXX pro- There have been several research efforts to make profiling tools gramming system and enables collecting profiling information at more portable and flexible. CUDA Advisor [5] is a profiling in- different stages of the development process, for different kinds of terface that collects data about the performance of both CPU and the target hardware, thus reducing the development overhead. *Vladyslav Kucher, Prosp. Peremohy 37, Kyiv, 03056, Ukraine, [email protected] www.astesj.com 119 https://dx.doi.org/10.25046/aj050116 V. Kucher et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 5, No. 1, 119-127 (2020) Application Runtime - Online Compiler execute Compute Kernel CPU Back-End CPU Profiling Extension select C++ Source Profiling Offline Compiler Compute Kernel GPU Back-End GPU Profiling Extension Results Figure 1: The PACXX framework, extended with profiling (shaded parts) The unified parallel programming approach of PACXX enables design of our profiling interface. In Section 3, we show how this programming CPU/GPU systems in the newest standard C++14/17. interface is used for a portable profiling a simple case on various tar- PACXX significantly simplifies the development of portable parallel get architectures. In Sections 4 and 5, we illustrate how PACXX can C++ programs by slightly modifying the source code staying within accelerate grid-based FE algorithms for PDE in the extended DUNE. C++. The PACXX advantages were demonstrated for application We demonstrate that the integration of PACXX with DUNE leads areas like linear algebra, stencils, and N-Body simulations [10]. to performance portability over different architectures, significantly We choose Partial Differential Equations (PDE) solving as an decreasing the development overhead and improving code mainte- application field to illustrate and evaluate our approach to unified nance. We experimentally evaluate our unified profiling approach profiling, because PDE are broadly used in different areas of engi- in Section 6, and we summarize our results in Section 7. neering and science, including simulation [11], finance [12], numerical modeling [13], and biology [14]. Most of practically important PDE are solved using finite-element (FE) methods. Unfortunately, 2 Programming and profiling these methods’ implementation is often tied to a particular data structure representing the corresponding computation grid. C++ is 2.1 The PACXX Programming model often used for implementing finite-element methods, because of its We use the PACXX framework [4] that supports a unified, C++- high performance and modularity. The Finite Element Computa- based model of parallel programming. In analogy with the recent tional Software (FEniCS) [15] is a C++-based framework for FE OpenCL standard [23], PACXX uses the parallelism expressed us- variational problems, relying on automatic code generation and the ing kernels that are executed on devices. The main advantage of custom FEniCS Form Compiler [16]: it supports performance porta- PACXX, however, is that an application is expressed as single- bility by employing a domain-specific language based on Python source C++ code, whereas OpenCL kernels are implemented in a [17], and multi-staging optimization. The Firedrake approach [18] special kernel language and they need additional host code to be achieves performance portability by using an advanced code genera- executable. The performance portability in PACXX is ensured by tor: problems are specified in a Python-embedded, domain-specific means of several pre-implemented back-ends. language that is lowered to C and then JIT-compiled, so target code Figure 1 shows that the compilation process of PACXX proceeds is generated at run time. The object-oriented framework Overture in two steps. The first step is the offline compilation that transforms [19] for solving PDE on various platforms contains C++ classes for source code into an executable. Kernel’s code is precompiled, and computations in different geometries. The Feel++ framework [20] the executable is integrated with the particular back-end for the tar- uses a C++-based domain-specific language whose implementation get CPU or GPU. In the second stage, online compilation is invoked is compatible with third-party libraries. when the executable is started. The integration of the executable In order to evaluate our approach to unified profiling, we con- with back-ends for different architectures allows the user to choose sider the Distributed and Unified Numerics Environment (DUNE) the target hardware for each execution. The efficient execution on [21] as a case study. DUNE offers a unified interface for solving the chosen target system is handled by PACXX transparently: all PDE on parallel systems [22]: the interface combines various meth- necessary generation steps and optimizations are performed auto- ods in a modular C++ library. DUNE enables using arbitrarily matically. This brings the portability of PACXX applications on shaped grids and a large set of particular algorithms. While the de- CPUs and GPUs of different vendors. sign of DUNE is very flexible, its application performance strongly depends on the employed C++ compiler and the target hardware. The goal of this paper is to extend the flexibility of existing 2.2 Profiling interface: Design and implementation approaches to parallel PDE solving. We achieve performance porta- Figure 1 shows the design of our unified profiling interface in- bility and cross-platform profiling of the finite-element methods for tegrated into PACXX: shaded are our profiling extensions to the solving PDE, by extending the PACXX framework and integrating original PACXX back-ends. These extensions comprise

Load more