A Source-To-Source Translator Using Clang/LLVM Libtooling

1 OP2-Clang: A Source-to-Source Translator Using Clang/LLVM LibTooling G.D. Balogh1, G.R. Mudalige2, I.Z. Reguly1, S.F. Antao3, C. Bertolli3 Abstract—Domain Specific Languages or Active Library LibTooling, CUDA, OpenMP, automatic parallelization, frameworks have recently emerged as an important method DSL, OP2, unstructured-mesh for gaining performance portability, where an application I. INTRODUCTION can be efficiently executed on a wide range of HPC architectures without significant manual modifications. Embedded Domain Specific Languages or Active Library frame- DSLs such as OP2, provides an API embedded in general works provide higher-level abstraction mechanisms, us- purpose languages such as C/C++/Fortran. They rely on ing which applications can be developed by scientists source-to-source translation and code refactorization to translate the higher-level API calls to platform specific and engineers. Particularly when developing numeri- parallel implementations. OP2 targets the solution of cal simulation programs for parallel high-performance unstructured-mesh computations, where it can generate computing systems, these frameworks boost both pro- a variety of parallel implementations for execution on grammers’ productivity and application performance by architectures such as CPUs, GPUs, distributed memory separating the concerns of programming the numerical clusters and heterogeneous processors making use of a wide range of platform specific optimizations. Compiler solution from its parallel implementation. As such, they tool-chains supporting source-to-source translation of code have recently emerged as an important method for written in mainstream languages currently lack the capabil- gaining performance portability, where an application ities to carry out such wide-ranging code transformations. can be efficiently executed on a wide range of HPC Clang/LLVM’s Tooling library (LibTooling) has long been architectures without significant manual modifications. touted as having such capabilities but have only demonstrated its use in simple source refactoring tasks. Embedded DSLs (eDSLs) such as OP2 provide a In this paper we introduce OP2-Clang, a source-to- domain specific API embedded in C/C++ and For- source translator based on LibTooling, for OP2’s C/C++ tran [1], [2]. The API appears as calls to functions in API, capable of generating target parallel code based on a classical software library to the application developer. SIMD, OpenMP, CUDA and their combinations with MPI. However, OP2 can then make use of extensive source- OP2-Clang is designed to significantly reduce maintenance, to-source translation and code refactorization to translate particularly making it easy to be extended to generate new parallelizations and optimizations for hardware platforms. the higher-level API calls to platform specific parallel In this research, we demonstrate its capabilities including implementations. In the case of OP2, which provides (1) the use of LibTooling’s AST matchers together with a a high-level domain specific abstraction for the solu- simple strategy that use parallelization templates or skele- tion of unstructured-mesh applications, the source-to- tons to significantly reduce the complexity of generating source translation can generate a wide range of parallel radically different and transformed target code and (2) chart the challenges and solution to generating optimized implementations for execution on architectures such as parallelizations for OpenMP, SIMD and CUDA. Results CPUs, GPUs, distributed memory clusters and emerging indicate that OP2-Clang produces near-identical parallel heterogeneous processors. The generated code makes use code to that of OP2’s current source-to-source translator. of parallel programming models/extensions including We believe that the lessons learnt in OP2-Clang can be SIMD-vectorization, OpenMP3.0/4.0, CUDA, OpenACC readily applied to developing other similar source-to-source translators, particularly for DSLs. and their combinations with MPI. Furthermore they implement the best platform specific optimizations, in- Index Terms—Source-to-source translation, Clang, cluding sophisticated orchestration of parallelizations to handle data races, optimized data layout and accesses, *The following names are subject to trademark: LLVM™, CLANG™, NVIDIA™, P100™, CUDA™, INTEL™, embedding of parameters/flags that allow to tune the 1G.D. Balogh and I.Z. Reguly is with the Faculty of In- performance at runtime and even facilitate runtime- formation Technology and Bionics, Pázmány Péter Catholic Uni- optimizations for gaining further performance. versity, Hungary. [email protected], [email protected] Compiler tool-chains supporting source-to-source 2G.R. Mudalige is with the Department of Computer Science, translation of code written in mainstream languages University of Warwick, UK. [email protected] such as C/C++ or Fortran currently lack the capabil- 3C. Bertolli and S.F. Antao is with the IBM T.J. Watson Research Center, USA. [email protected], ities to carry out such wide-ranging code transforma- [email protected] tions. Available tool-chains such as the ROSE compiler 2 framework [3], [4], [5] and others, have suffered from Clang/LLVM. In this research, we chart the development a lack of adoption by both the compilers and HPC of OP2-Clang, outlining its design and architecture. community. This has been a major factor in limiting More specifically we make the following contributions. the wider adoption of DSLs or active libraries with a • We demonstrate how Clang’s LibTooling can be non-conventional source-to-source translator where there used for source-to-source translations as required by is a lack of a significant community of developers OP2 where the target code includes wide-ranging under open governance to maintain it. The result is code transformations such as SIMD-vectorization, often a highly complicated tool with a narrow focus, OpenMP, CUDA and their combinations with MPI. steep learning curve, inflexible for easy extension to Such a wide range of source-to-source target code produce different target code (e.g. new optimizations, generation have not been previously demonstrated new parallelizations for different hardware, etc.) and either using LibTooling nor any other Clang/LLVM issues with long-term maintenance. In other words, the compiler tool. lack of industrial-strength tooling that can be integrated • We present the design, including the use of LibTool- in a DevOps toolchain without any changes provokes a ing’s AST matchers together with a novel strategy lack of growth in the numbers of users and developers that use parallelization templates or skeletons to for DSLs or active libraries, which in turn is the source reduce the complexity of the translator. The use of missing tools themselves, leading to a typical catch- of skeletons allows to significantly reduce the code 22 or deadlock situation. The underlying motivation of that needs to be generated from scratch and help the research presented in this paper is to break this modularize the translator to be easily extensible deadlock by pioneering the introduction of a industrial- with new parallelizations. strength compiler toolchain, to come-up with a standard • OP2-Clang is used for generating parallel code (or a methodology) that can be integrated as is in most for two unstructured mesh applications written DevOps environments. using the OP2 API: (1) an industrial represen- Clang/LLVM’s Tooling library (libTooling) [6] has tative Airfoil [11] CFD application and (2) a long been touted as facilitating source-to-source trans- production-grade Tsunami simulation application lation capabilities but have only demonstrated its use in called Volna [12], [13]. The performance of the simple source refactoring and code generation tasks [7], generated parallelizations is compared to code gen- [8] or in the transformation of a very limited subset of erated by OP2’s current source-to-source translator algorithms without hardware specific optimizations [9]. demonstrating identical performance on a number In this paper we introduce OP2-Clang [10], a source-to- of multi-core/many-core systems. source translator based on Clang’s LibTooling for OP2. The work described in this paper is the first step towards The broader aim is to generate platform specific parallel an application-driven full stack optimizer toolchain codes for unstructured-mesh applications written with based on Clang/LLVM. The rest of this paper is orga- OP2. Two options to achieve this are: (1) translating nized as follows. In SectionII we briefly introduce the programs written with OP2’s C/C++ API, to code with motivation driving this work, including eDSLs such as C++ parallelised with SIMD, OpenMP, CUDA, and their OP2 and limitations of current source-to-source trans- combinations with MPI etc., that can then be compiled lation software. In Section III we chart the design and by conventional platform specific compilers and (2) com- development of OP2-Clang. In SectionIV we illustrate piling programs written with OP2’s C/C++ API to LLVM the ease of extending the tool charting the case for IR. The former case, which is the subject of this paper, the SIMD-vectorization and CUDA code generator. In follows OP2’s current application development process SectionV we evaluate the performance of the gener- and has been shown to deliver significant performance ated parallel code and compare the results to the code improvements. The latter case, which will be explored generated

Load more