ClangJIT: Enhancing C++ with Just-in-Time Compilation

Hal Finkel David Poliakoff Jean-Sylvain Camier Leadership Computing Facility Sandia National Laboratories and David F. Richards Argonne National Laboratory Albuquerque, NM 87123, USA Lawrence Livermore National Laboratory Lemont, IL 60439, USA Email: [email protected] Livermore, CA 94550, USA Email: hfi[email protected] Email: [email protected] and [email protected]

Abstract—The C++ is not only a A significant design requirement for ClangJIT is that the keystone of the high-performance-computing ecosystem but has runtime-compilation process not explicitly access the file sys- proven to be a successful base for portable parallel-programming tem - only loading data from the running binary is permitted - frameworks. As is well known, C++ programmers use templates to specialize algorithms, thus allowing the to generate which allows for deployment within environments where file- highly-efficient code for specific parameters, data structures, and system access is either unavailable or prohibitively expensive. so on. This capability has been limited to those specializations that In addition, this requirement maintains the redistributibility can be identified when the application is compiled, and in many of the binaries using the JIT-compilation features (i.e., they critical cases, compiling all potentially-relevant specializations is can run on systems where the is unavailable). not practical. ClangJIT provides a well-integrated C++ language extension allowing template-based specialization to occur during For example, on large HPC deployments, especially on su- program . This capability has been implemented for use percomputers with distributed file systems, extreme care is in large-scale applications, and we demonstrate that just-in-time- required whenever many simultaneously-running processes compilation-based dynamic specialization can be integrated into might access many files on the file system, and ClangJIT applications, often requiring minimal changes (or no changes) to should elide these concerns altogether. the applications themselves, providing significant performance improvements, programmer-productivity improvements, and de- Moreover, as discussed below, ClangJIT achieves another creased compilation time. important design goal of maximal, incremental reuse of the Index Terms—C++, , LLVM, Just-in-Time, Specialization state of the compiler. As is well known, compiling C++ code, especially when many templates are involved, can be a slow process. ClangJIT does not start this process afresh for the I.INTRODUCTIONAND RELATED WORK entire translation unit each time a new template instantiation is The C++ programming language is well-known for its requested. Instead, it starts with the state generated by the AoT design doctrine of leaving no room below it for another compilation process, and from there, adds to it incrementally portable programming language [1]. As compiler technology as new instantiations are required. This minimizes the time has advanced, however, it has become clear that using just- associated with the runtime compilation process. in-time (JIT) compilation to provide runtime specialization The hypothesis of this work is that, for production-relevant can provide performance benefits practically unachievable with C++ libraries used for this kind of specialization, incremen- purely ahead-of-time (AoT) compilation. Some of the benefits tal JIT compilation can produce performance benefits while of runtime specialization can be realized using aggressive simultaneously decreasing compilation time. The initial eval- multiversioning, both manual and automatic, along with run- uations presented in Section V confirm this hypothesis, and time dispatch across these pre-generated code variants (this future work will explore applicability and benefits in more technique has been applied for many decades; e.g., [2]). This detail. technique, however, comes with combinatorial compile-time cost, and as a result, is practically limited. Moreover, most of A. Related Work this compile-time cost from multiversioning is wasted when only a small subset of the specialized variants are actually Clang, and thus ClangJIT, are built on top of the LLVM used (as is often the case in practice). This paper introduces compiler infrastructure [3]. The LLVM compiler infrastructure ClangJIT, an extension to the Clang C++ compiler which in- has been specifically designed to support both AoT and JIT tegrates just-in-time compilation into the otherwise-ahead-of- compilation, and has been used to implement JIT time-compiled C++ programming language. ClangJIT allows for a variety of languages both general purpose (e.g., Java [4], programmers to take advantage of their existing body of C++ Haskell [5], Lua [6], Julia [7]), and domain specific (e.g., code, but critically, defer the generation and optimization of TensorFlow/XLA [8]). In addition, C++ libraries implemented template specializations until runtime using a relatively-natural using LLVM to provide runtime specialization for specific do- extension to the core C++ programming language. mains are not uncommon (e.g., TensorFlow/XLA, Halide [9]). Several existing projects have been developed to add dy- user can enable JIT-compilation support in the compiler simply namic capabilities to the C++ programming language [10]. A by using the command line flang -fjit. Using this flag, significant number of these rely on running the compiler as both when compiling and when linking, is all that should be a separate process in order to generate a shared library, and necessary to make using the JIT-compilation features possible. that shared library is then dynamically loaded into the running By itself, however, the command-line flag does not enable process (e.g., [11], [12]). Unfortunately, not only do such any use of JIT compilation. To do that, function templates systems require being able to execute a compiler with access can be tagged for JIT compilation by using the C++ attribute to the relevant source files at runtime, but careful management [[clang::jit]]. An attributed function template provides is required to constrain the cost of the runtime compilation for additional features and restrictions. These features are: processes, because each time the compiler is spawned it must • Instantiations of this function template will not be con- start processing its inputs afresh. NVIDIA’s CUDA Toolkit structed at , but rather, calling a specializa- contains NVRTC [13] allowing the compilation of CUDA code tion of the template, or taking the address of a special- provided as a string at runtime. The Easy::JIT project [14] ization of the template, will trigger the instantiation and provides a limited JIT-based runtime specialization capability compilation of the template at runtime. for C++ lambda functions to Clang, but this specialization is • Non-constant expressions may be provided for the non- limited to parameter values, because the types are fixed during type template parameters, and these values will be used AoT compilation (a fork of Easy::JIT known as atJIT [15] at runtime to construct the type of the requested instan- adds some autotuning capabilities). LambdaJIT [16] also pro- tiation. See Listing 1 for a simple example. vides for the dynamic compilation of lambda functions, and • Type arguments to the template can be provided as also their automated parallelization for GPUs. NativeJIT [17] strings. If the argument is implicitly convertible to a provides an in-process JIT for a subset of C for 64. The const char *, then that conversion is performed, closest known work to ClangJIT is CERN’s Cling [18] project. and the result is used to identify the requested type. Cling also implements a JIT for C++ code using a modified Otherwise, if an object is provided, and that object has version of Clang, but Cling’s goals are very different from a member function named c_str(), and the result of the goals of ClangJIT. Cling effectively turns C++ into a that function can be converted to a const char *, JIT-compiled scripting language, including a REPL interface. then the call and conversion (if necessary) are performed ClangJIT, on the other hand, is designed for high-performance, in order to get a string used to identify the type. The incremental compilation of template instantiations using only string is parsed and analyzed to identify the type in the information embedded in the hosting binary. Cling provides declaration context of the parent of the function triggering a dynamic compiler for C++ code, while ClangJIT provides the instantiation. Whether types defined after the point in a language extension for embedded JIT compilation, and as a the source code that triggers the instantiation are available result, the two serve different use cases and have significantly- is not specified. See Listing 2 for a demonstration of different implementations. this functionality, with Listing 3 showing some example Compared to previous work, ClangJIT embodies the fol- output. lowing technical advances: First, ClangJIT provides a C++ JIT-compilation feature that naturally integrates with the C++ Listing 1: A JIT “hello world”. language, thus providing a “single source” solution, while still #include allowing for dynamic type creation. Previous work relies on #include providing C++ source code directly to the JIT-compilation t e m p l a t e engine. Second, ClangJIT provides a fully-self-contained JIT- [[clang:: jit ]] void run() { s t d : : c o u t << ”Hello, World,I was compiled at runtime,” compilation solution – no external files are needed during ”x=” << x << ”\n”; program execution. Third, ClangJIT provides a fast lookup } mechanism: in most cases, no underlying Clang components i n t main(int argc, char ∗ argv [ ] ) { need to be invoked to find previously-JIT-compiled code. i n t a = std::atoi(argv[1]); run(); The remainder of this paper is structured as follows: Sec- } tion II discusses the syntax and semantics of ClangJIT’s language extension, Section III describes the implementation of ClangJIT’s ahead-of-time-compilation components, Sec- Listing 2: A JIT example demonstrating using strings as types. tion IV describes the implementation of ClangJIT’s runtime #include components, Section V contains initial evaluation results, we s t r u c tF { i n t i; discuss future work in Section VI, and the paper concludes in double d; Section VII. };

HE ANGUAGE XTENSION t e m p l a t e II.T L E s t r u c tG { T a r r [ S ] ; A key design goal for ClangJIT is natural integration with }; the C++ language while making JIT compilation easy to use. A t e m p l a t e without the usual check to determine if constant evaluation [[clang:: jit ]] void run() { s t d : : c o u t << ”I was compiled at runtime, sizeof(T)=” is possible. For a candidate template argument that should << s i z e o f(T) << ”\n”; be a type, if the candidate is instead a non-type argument, } logic was added to check for conversion to const char i n t main(int argc, char ∗ argv [ ] ) { *, first by calling a c_str() method if necessary, and std::string t(argv[1]); run(); if the conversion is possible, the candidate is allowed to } match. Each time a JIT function template is instantiated, the instantiation is assigned a translation-unit-unique integer identifier which will be used during code generation and by Listing 3: Output from the strings-as-types example in Listing 2. the runtime library. $ c l a n g ++ −O3 − f j i t −o / tmp / j i t −t / tmp / j i t −t . cpp $ / tmp / j i t −t ’ : : F ’ In order for the runtime library to instantiate templates I was compiled at runtime, sizeof(T) = 16 dynamically, it requires a saved copy of Clang’s abstract- $ / tmp / j i t −t ’F ’ I was compiled at runtime, sizeof(T) = 16 syntax tree (AST), which is the internal data structure on $ / tmp / j i t −t ’ f l o a t ’ which template instantiation is performed. Fortunately, Clang I was compiled at runtime, sizeof(T) = 4 $ / tmp / j i t −t ’ double ’ already contains the infrastructure for serializing and deserial- I was compiled at runtime, sizeof(T) = 8 izing its AST, and moreover, embedding compressed copies $ / tmp / j i t −t ’ s i z e t ’ I was compiled at runtime, sizeof(T) = 8 of the input source files along with the AST, as part of $ / tmp / j i t −t ’std::size t ’ the implementation of its modules feature. The embedded I was compiled at runtime, sizeof(T) = 8 $ / tmp / j i t −t ’G’ source files are important so that, should an error occur during I was compiled at runtime, sizeof(T) = 80 template instantiation (e.g., a static_assert is triggered), useful messages can be produced for the user. Reconstructing There are a few noteworthy restrictions: the parameters used for code generation (e.g., whether use • Because the body of the template is not instantiated at of AVX-2 vector instructions is enabled when targeting the compile time, decltype(auto) and any other type- x86 64 architecture) also requires the set of command-line deduction mechanisms depending on the body of the parameters passed by Clang’s driver to the underlying compi- function are not available. lation invocation. In addition, in order to allow JIT-compiled • Because the template specializations are not compiled code to access local variables in what would otherwise be their until runtime, they’re not available at compile time for containing translation unit, the addresses of such potentially- use as non-type template arguments, etc. required variables are saved for use by the runtime library. Explicit specializations of a JIT function template are All of these items are embedded in the compiled object not JIT compiled, but rather, compiled during the regular file, resulting, as illustrated by Figure 1, in a “fat” object AoT compilation process. If, at runtime, values are spec- file containing both the compiled host code as well as the ified corresponding to some explicit specialization (which the information necessary to resume the compilation process will have already been compiled), the template instantiation during program execution. is not recompiled, but rather, the already-compiled function When emitting a call to a direct callee, and when getting is used. An exception to this rule is that a JIT template a function pointer to a given function, Clang’s code gener- with a pointer/reference non-type template parameter which is ation needs to generate a pointer to the relevant function. provided with a runtime pointer value will generate a different For JIT-compiled function-template instantiations, Clang is instantiation for each pointer value. If the pointer provided enhanced to generate a call to a special runtime function, points to a global object, no attempt is made to map that __clang_jit, which returns the required function pointer. pointer value back to the name of the global object when How this runtime function works will be discussed in Sec- constructing the new type. tion IV. The runtime-dependent non-type template parameters for the particular template instantiation are packed into an III.HOWIT WORKS:AHEAD-OF-TIME COMPILATION on-stack structure, an on-stack array of strings represent- Implementing ClangJIT required modifying Clang’s ing runtime types is formed, and the addresses of these semantic-analysis and code-generation components in non- on-stack data structures are provided as arguments to the trivial ways. Clang’s parsing and semantic analysis was __clang_jit call. Two additional parameters are passed: extended to allow the [[clang::jit]] attribute to appear first, a mangled name for the template being instantiated (with on declarations and definitions of function templates. The wildcard character sequences in places where the runtime most-significant modifications were to the code in Clang values and types are used), which we call the instantiation key, which determines whether a given template-argument list used in combination with the runtime values to identify the can be used to instantiate a given function template. In instantiation, and second, the translation-unit-unique identifier this case, when the function template in question has the for this particular instantiation. This translation-unit-unique [[clang::jit]] attribute, two important changes were identifier is used by the runtime library to look up this made: For a candidate non-type template argument (e.g., an particular instantiation in the serialized AST. expression of type int), the candidate is allowed to match Finally, Clang’s driver code was updated so that use Fig. 1: When using JIT-compilation features, the object files produced by the compiler become “fat” object files containing information needed at runtime. A JIT-enabled “fat” object file

Serialized AST • Preprocessor state and compressed source files Serialized AST for Device (First Architecture) • Binary encoding of the AST

Compilation Command-Line Arguments (First Architecture) Compilation Command-Line Arguments • Used to restore code-generation options. Compilation Optimized LLVM IR (First Architecture)

Optimized LLVM IR Serialized AST for Device (Second Architecture) • Used to allow inlining of pre-compiled functions into JIT-compiled functions. Compilation Command-Line Arguments (Second Architecture) Local Symbol Addresses

• Used to allow the JIT to look up Compilation Optimized LLVM IR (Second Architecture) non-exported symbols in the translation unit.

All Targets / . CUDA Support of the -fjit flag not only enables processing of the pointers to its state data to an array of similar entries from [[clang::jit]] attributes during compilation, but also all GPU-targeting invocations within the bitcode file. When causes Clang’s implementation libraries to be linked with the the host-targeting compilation takes place, the bitcode file is application, and when producing dynamically-linked executa- loaded and linked into the LLVM module Clang is constructing bles, -rdynamic is implied. -rdynamic is used so that the and the address of the relevant global variable (which contains runtime library can use the ’s exported dynamic pointers to all of the other device-compilation-relevant global symbol table to find external symbols from all translation variables) becomes an argument to calls to __clang_jit. units comprising the program (the alternative would require As illustrated by Figure 1, all of this information ends up essentially duplicating this information in the array of local embedded in the host object file. variables passed to the runtime library). IV. HOWIT WORKS:RUNTIME COMPILATION A. CUDA Support ClangJIT’s runtime library is large, namely because it in- In order to support JIT compilation of CUDA kernels for cludes all of Clang and LLVM, but it has only one entry-point applications targeting NVIDIA GPUs, ClangJIT includes sup- used by the compiler-generated code: __clang_jit. This port for the CUDA programming model built on Clang’s native function is used to instantiate function templates at runtime. CUDA support [19]. The primary challenge in supporting The ClangJIT-specific parts of the runtime library are approx- CUDA in ClangJIT derives from the fact that, when compiling imately two thousand lines of code, and an outline of the im- CUDA code, the driver invokes the compiler multiple times: plementation is provided in Algorithm 1. The runtime library Once to compile for the host architecture and once for each has a program-global cache of generated instantiations, but targeted GPU architecture (e.g., sm 35). At runtime, the state otherwise keeps separate state for each translation unit making of not only the host-targeting compiler invocation must be use of JIT compilation. This per-translation-unit state is largely reconstructed, but also the state of one of these GPU-targeting composed of two parts: first, a Clang compiler instance holding compiler invocations (whichever most closely matches the the AST and other data structures, and second, an in-memory device being used at runtime). Fortunately, Clang’s CUDA LLVM IR module containing externally-available definitions. compilation workflow compiles code for the GPUs first, and This LLVM IR module is initially populated by loading the only once all GPU-targeting compilation is complete, is the optimized LLVM IR for the translation unit that is stored in compiler targeting the host invoked. When JIT compilation and the running binary, and marking all definitions with externally- CUDA are both enabled, the driver creates a temporary LLVM available linkage, thus allowing the definitions to be analyzed bitcode file, and each GPU-targeting compilation saves the se- and inlined into newly-generated code. rialized AST, command-line parameters, and some additional When the __clang_jit function is called to retrieve a metadata into a set of global variables in this bitcode file. The pointer to a requested template instantiation, the instantiation implementation takes advantage of LLVM’s appending linkage is looked up in the cache of instantiations. This cache uses feature so that each GPU-targeting invocation can easily add LLVM’s DenseMap data structure, along with a mutex, and while these constructs are designed to have high performance, quest to the previously-compiled function takes approximately this lookup can have noticeable overhead. If the instantation 160 thousand cycles (approximately 65 µs). Compiling new does not exist, as outlined in Algorithm 1, the instantiation is instantiations takes, at the very least, tens of millions of cycles created along with any other new dependencies (e.g., a static (a few milliseconds)2. function not otherwise used in the translation unit might now need to be emitted along with the requested instantiation). A input : TranslationUnitData{SerializedAST, new LLVM IR module is created by Clang, and that LLVM IR CommandLineArguments, OptimizedLLVMIR, module is merged with the module containing the externally- LocalSymbolsArray}, NonTypeTemplateValues, available definitions. This combined module is optimized and TypeStringsArray, MangledName, InstantiationIdentifier JIT compiled. The newly-generated LLVM IR is then also output : A pointer to the requested function merged, with externally-available linkage, into the module globals : Instantiations, TUCompilerInstances containing externally-available definitions for all previously- 1 if Instantiations.lookup{MangledName, generated code. These externally-available definitions form a NonTypeTemplateValues, TypeStringsArray} then return kind of cache used to enable inlining of previously-generated the cached function pointer; code into newly-generated code. This is important because if 2 if A compiler instance for the translation unit of the caller the incremental code generation associated with JIT compi- 6∈ TUCompilerInstances then create a compiler instance using TranslationUnitData, loading the optimized LLVM lation could not inline code generated in other stages, then IR and marking all definitions as externally available, and the JIT-compiled code would likely be slower than the AoT- add the instance to TUCompilerInstances; compiled code. 3 use InstantiationIdentifier to look up the instantiation, and For some use cases, the speed of finding a previously-JIT- then its parent template, in the AST 4 form a new template argument list using the instantiation in compiled instantiation in the cache of instantiations is critical the AST combined with the values in to the usability of the JIT-compilation feature. When a new NonTypeTemplateValues and the types in instantiation is requested from the JIT-compilation runtime TypeStringsArray library, the runtime library is provided with a pointer to the 5 call Clang’s internal InstantiateFunctionDefinition function instantiation key (as a null-terminated string), a pointer to a to get a new AST function definition node 6 repeat packed structure1 containing all runtime non-type template 7 emit all functions in Clang’s deferred list which don’t arguments, and in addition, an array of pointers to null- exist in the program to LLVM IR terminated strings specifying the runtime type template ar- 8 for all LLVM IR declarations for symbols that don’t guments. LLVM’s DenseMap can be customized to allow exist in the program do lookup operations using a type containing the same relevant 9 lookup the AST node for the declaration and call Clang’s internal HandleTopLevelDecl function on information as the key type, but different from the key type, the AST node and ClangJIT takes advantage of this feature to avoid unneces- 10 end sarily copying the instantiation key and data into a temporary 11 until no new functions were just emitted to LLVM IR; object solely for the purpose of performing a cache lookup 12 call Clang’s internal HandleTranslationUnit function to finalize the LLVM IR module being constructed operation. Because the non-type parameters are in a packed 13 mark nearly everything in the new LLVM IR module as structure, the data in the structure can be hashed and compared having external linkage for equality without having understand the types and alignment 14 merge the externally available LLVM IR definitions into requirements of the non-type arguments. The type arguments the just-generated LLVM IR module are compared as strings without any consideration for the 15 optimize the new LLVM IR module and provide the result to LLVM’s Orc JIT engine fact that, in C++, there are often different ways to name the 16 take the newly-generated LLVM IR, mark the definitions as same type (e.g., “G” and “::G” might refer to the same type). externally available, and merge with the module Should the cache lookup fail to find a matching instantiation containing the other externally-available definitions solely because some types were specified using different, but 17 get the function pointer for the requested instantiaton, store semantically equivalent, names, this will be discovered once it in Instantiations, and return it Algorithm 1: __clang_jit - Retrieve a JIT- the new instantiation is requested and its mangled name is Compiled Function-Template Instantiation computed, and a pointer to the previously-compiled function Clang already uses LLVM’s virtual-file-system infrastruc- will be returned in lieu of compiling the same function (and ture and this allows isolating it from any file-system access its dependencies) again. at runtime. As noted earler, it is important to avoid JIT On an Intel Haswell processor, for the simplest lookup compilation triggering any file-system access, and so, the involving a single template argument, the cache lookup takes Clang compiler instance created by the runtime library is approximately 350 cycles (approximately 140 ns). Should provided only with an in-memory virtual-file-system provider. the cache lookup fail, but solely because some types were That provider only contains specific in-memory files with data specified using different names, resolving the instantiation re- from the running binary’s compiler-created global variables.

1A packed structure means that there are no padding bytes in between 2The compilation time of a new instantiation depends heavily on the amount members. and complexity of the code being compiled. A. CUDA Support } } To support JIT compilation of CUDA code, which, in inter- esting cases, will require JIT compilation of CUDA kernels, void test a o t (const std::string &type, int size, i n t repeat) { dedicated logic exists in ClangJIT’s runtime library. First, i f (type ==”float”) the AoT-compilation might have compiled device code for t e s t a o t (size , repeat); e l s e if (type ==”double”) multiple device architectures (e.g., for sm 35 and sm 70). t e s t a o t (size , repeat); The CUDA runtime library is queried in order to determine e l s e if (type ==”long double”) t e s t a o t (size , repeat); the compute capability of the current device, and based on e l s e that, ClangJIT’s runtime selects which device compiler state c o u t << t y p e << ”not supported for AoT \n”; } to use for JIT compilation. As for the host, the serialized AST, command-line options, and optimized LLVM IR are loaded to create a device-targeting compiler instance. During Listing 5: An excerpt from the simple Eigen benchmark (JIT version). the execution of Algorithm 1, at a high level, whatever is done t e m p l a t e [[clang:: jit ]] void test j i t s z (int repeat) { with the host compiler instance is also done with the device Matrix I = Matrix:: Ones ( ) ; compiler instance. However, instead of taking the optimized IR Matrix m; f o r(int i=0;i < s i z e ; i ++) and handing it off to LLVM’s JIT engine, the device compiler f o r(int j=0;j < s i z e ; j ++) { instance is configured to run the NVPTX backend and generate m(i,j) = (i+size ∗ j ) ; } textual PTX code. This PTX code is then wrapped in an in- memory CUDA fatbin file3, and that fatbin file is provided to f o r(intr=0;r < repeat; ++r) { m = Matrix:: Ones ( ) the host compiler instance to embed in the generated module + T(0.00005) ∗ (m + (m∗m) ) ; in the usual manner for CUDA compilation. } }

V. INITIAL EVALUATION void test j i t (const std::string &type, int size, i n t repeat) { We present here an evaluation of ClangJIT on three bases: r e t u r n test j i t s z ( r e p e a t ) ; ease of use, compile-time reduction, and runtime-performance } improvement. First, we’ll discuss the performance of ClangJIT by making use of a microbenchmark4 which relies on the The Eigen library was chosen for this benchmark be- Eigen C++ matrix library [21]. Consider a simple benchmark cause the library supports matrices of both compile-time size which repeatedly calculates M = I +5×10−5(M +M 2) for a (specified as non-type template parameters) and runtime size matrix M of some N × N size. We’re interested here in cases (specific as constructor arguments). When using JIT compi- where N is small, and examples from real applications where lation, we can use the non-type-template-parameter method such small loop bounds occur will be presented later in this to specify sizes known only during program execution. First, section. Listing 4 excerpts the version of the benchmark where we’ll examine the compile-time advantages that ClangJIT dynamic matrix sizes are handled in the traditional manner. offers over both traditional AoT compilation and over the Listing 5 excerpts the version of the benchmark where code up-front compilation of numerous potentially-used template for dynamic matrix sizes is generated at runtime using JIT specializations. In Figure 2, we present the AoT compilation 5 compilation. time for this benchmark in various configurations . For all of these times, we subtracted a baseline compilation time of Listing 4: An excerpt from the simple Eigen benchmark. 2.58s - the time required to compile a trivial main() function #include with the Eigen header file included. The time identified by “J” indicates the AoT compilation time for the benchmark when u s i n g namespace std; u s i n g namespace Eigen; only the JIT-compilation-based implementation is present (i.e., that in Listing 5 plus the associated main function). The t e m p l a t e void test a o t (int size, int repeat) { time identified by “A1” indicates the compilation time for the Matrix I = generic version, compiled only using the scalar type double Matrix::Ones(size , size); Matrix m(size , size); (i.e., Listing 4 but with only the double case present). As f o r(int i=0;i < s i z e ; i ++) can be seen, this takes significantly longer than the AoT f o r(int j=0;j < s i z e ; j ++) { m(i,j) = (i+size ∗ j ) ; compilation time for the JIT-based version. Moreover, the JIT- } based version can be used with any scalar type. If we try to f o r(intr=0;r < repeat; ++r) { replicate that capability with the traditional AoT approach, m = Matrix::Ones(size , size) and thus instantiate the implementation for float, double, + T(0.00005) ∗ (m + (m∗m) ) ; and long double, as shown in Listing 4, then the AoT 3Unfortunately, the format of the fatbin files is not documented by NVIDIA, compilation time is nearly 7x larger, labeled “A3”, than using but given the information available in https://reviews.llvm.org/D8397 and in [20], we were able to create fatbins that are functional for this purpose. 5This benchmarking was conducted on a Intel Xeon E5-2699 using the 4The microbenchmark presented here is an adapted version of https://github. flags -march=native -ffast-math -O3, and using a ClangJIT build compiled com/eigenteam/eigen-git-mirror/blob/master/bench/benchmark.cpp. using GCC 8.2.0 with CMake’s RelWithDebInfo mode the JIT capability. The reported “J” time does omit the time Figure 4 shows the performance of the benchmark, using spent at runtime compiling the necessary specializations. Here type double, adapted to use CUDA6. The source code for we show the compilation time for different specializations the CUDA adaptation is straightforwardly derived from the (i.e., with both the size and type fixed), for a scalar type of original where the the matrix computation is moved into a double, where the size was 16 for “S16”, the size was 7 for kernel, and that kernel is executed using some number of GPU “S7”, the size was 3 for “S3”, and size was 1 for “S1”, and threads7. We present the relative performance when running where specializations for both sizes 16 and 7 were included one thread/block (to isolate the relative single-thread perfor- in the time “S16a7” (to demonstrate that the specialization mance) and when running 512 threads/block. As expected, it compilation times are roughly additive). It is useful to note is often the case that the JIT-compiled code provides an even that the specialization compilation time depends on the size, greater performance advantage at 512 threads/block than at but is always less than the single size-generic implementation one thread/block. This is because the JIT-compiled kernels, in “A1”. As the size becomes larger, the difference in the work in addition to their better serial performance, also generally the compiler must do to handle the specialization compared use fewer registers, allowing for greater GPU occupancy. As to handling the generic version shrinks. shown in Figure 4, for matrix sizes 3x3 through 7x7, the performance of the JIT-compiled GPU kernels is one to two 8 Fig. 2: AoT compilation time of the Eigen benchmark. See the text orders of magnitude better than the generic AoT-compiled for a description of the configurations presented. version9. As shown in Figure 5, on top of that, the smaller

8 7.12 JIT-compiled kernels used far fewer registers than the generic 10 6 AoT-compiled version .

3.12 4 2.72 2.37 Fig. 4: Runtime performance of the Eigen benchmark using CUDA, double time (s) using type , for the indicated size. The time is normalized 2 0.92 0.72 0.62 0.37 to the time of the JIT-compiled version. 0 J S16 S7 S3 S1 S16a7 A3 A1

164 156.8 Figure 3 shows the performance of the benchmark, using 150 134.7 136 type double, for several sizes. For small sizes, when the 100 size is known, the compiler can unroll the loops and perform 54 57 other optimizations to produce code that performs signifi- 50 33.4 35.7 cantly better than the generic version. This performance is 14 11.46 0 essentially identical to the performance of the AoT-generated specializations of the same size, but of course, the JIT-based 3x3 4x4 5x5 6x6 7x7

version is more flexible. As the size gets larger, the code AoT kernel time relative to JIT Speedup (1 thrd/block) Speedup (512 thrds/block) the compiler generates becomes increasingly similar to that generated for the generic version, and differences such as generated tail loops come to have a decreasingly-important Listing 6: An excerpt from the simple Eigen benchmark in Cling [18]. performance impact, and so for large sizes little performance void test c l i n g (const std::string &type, int size, i n t repeat) { difference remains between the JIT-compiled code and the std::stringstream ss; generic version. Note that the runtime overhead associated with the JIT-compilation process itself is not included, but 6The CUDA benchmark was run on an IBM POWER8 host with an is close to the “S16”, “S7”, and “S3” times in Figure 2. NVIDIA Tesla K80 GPU using CUDA 10.0 7A call to cudaThreadSetLimit(cudaLimitMallocHeapSize, ...) was inserted in the non-JIT implementation to allow dynamic allocation Fig. 3: Runtime performance of the Eigen benchmark, using type to work on the device. 8 double, for the indicated size. The time is normalized to the time For matrix sizes 1x1 and 2x2, the relative speedups were 650x and of the JIT-compiled version. For small sizes, the JIT-compiled version 332x respectively, at one thread/block, and 14,666x and 3,750x at 512 is significantly faster. threads/block. These are not included in Figure 4 both to aid the visual presentation of the larger matrix sizes and because the larger matrix sizes

8.05 are more representative of realistic problems. These do show, however, how 8 the decreased register pressure can greatly magnify the serial performance advantages of the specialized code. 6 9The compilation of the Eigen matrix type for sizes larger than 7x7 failed, 4 because some required template specializations were not available, a problem 2.4 that did not occur when compiling for the host, and so 7x7 is the largest size 2 1 1.01 1 1.01 1 1 1.02 shown. 10 0 Repeating the experiment with type float shows the JIT-compiled ker- time relative to JIT nels always use fewer registers than the AoT-compiled generic version. Loop 3x3 7x7 16x16 unrolling and other compiler optimizations can increase register pressure, so JIT-compiled AoT specialization AoT generic while the specialized code can use many fewer registers, specialization is not guaranteed to reduce register usage. Fig. 5: Register usage of the Eigen benchmark using CUDA, using type double, for the indicated size. t e m p l a t e void forall(int begin, int end, Body body) { f o r(int x = begin; x

number of registers used 1x1 2x2 3x3 4x4 5x5 6x6 7x7 allowing the optimizer to specialize the code for the particular JIT-compiled AoT generic index range provided. As shown in Listing 8, this can be done by wrapping the template using one marked for JIT compilation, and in doing so, allows using JIT compilation s s << ”test c l i n g s z <” << t y p e << ”,” without changing the interface, the forall template in this << s i z e << ”>(” << r e p e a t << ”)”; cling::Value V; gCling−>evaluate(ss.str(), V); illustration, used by the application. } Listing 8: A simple loop abstraction transparently using JIT compi- As a point of comparison, we implemented the benchmark lation. in Cling [18] such that it could take advantage of Cling’s t e m p l a t e void forall o r i g i n a l (int begin, int end, Body body) { JIT-compilation capabilities. In this case, the language inte- f o r(int x = begin; x optimize code as Clang does, and is based on an older version [[clang:: jit ]] void forall shim(Body body) { of LLVM compared to ClangJIT, so the JIT performance is f o r a l l original(begin, end, body); } only about 20% better than the AoT implementation when using Cling for a size of 3x3. In absolute terms, the code t e m p l a t e void forall(int begin, int end, Body body) { performance is much worse than using ClangJIT, and in ad- f o r a l l s h i m (body ) ; dition, Cling performs significantly more file input operations } (because, as an , it compiles everything from source This serves to illustrate how the JIT capability can be trans- at runtime). parently used by an application using a RAJA-like abstraction. In the following we present a case study on using JIT Doing this, however, exposes a number of tradeoffs. First, compilation transparently, and then two case studies where the JIT compilation itself takes time. Second, the process of ClangJIT was used to improve two open-source proxy applica- looking up already-compiled template instantiations also has tions developed by Lawrence Livermore National Laboratory an overhead that can be significant compared to a compiled (LLNL): Kripke and Laghos. These proxy applications make function doing very little computational work per invocation. use of the RAJA library [22], also developed at LLNL, In essentially all of the evaluations presented in this paper, it as do the corresponding production applications. Care was is this lookup overhead that is most important. To illustrate the taken to ensure that the techniques used to adapt these proxy interplay between these overheads and how a realistic RAJA applications to make use of ClangJIT can be applied to the abstraction around JIT compilation can be constructed, we production applications that they represent. Both compile- present a simple matrix-multiplication benchmark in Listing 9 time and runtime performance data was collected with the which uses the RAJA library abstraction in Listing 10. As assistance of the Caliper [23] tool. can be seen, this abstraction is more complicated than that A. RAJA and Transparent Usage in Listing 8, because it uses parameter packs to handle The RAJA library aims to abstract underlying parallel RAJA multi-dimensional ranges. This abstraction code would programming models, such as OpenMP and CUDA, allow- be placed in the RAJA library, however, thus absolving the ing portable applications to be created by making use of application developer from dealing with these complexities. RAJA’s programming-model-neutral multi-dimensional-loop Listing 9: A simple RAJA matrix-multiplication benchmark. templates. It is also possible to hide the use of ClangJIT’s u s i n g namespace RAJA:: statement; JIT-compilation behind RAJA’s abstraction layer, allowing u s i n g RAJA : : seq exec ; applications to make use of JIT compilation without changes u s i n g MatMulPolicy = RAJA:: KernelPolicy>>>>>; #define MAT2D(r,c,size) r ∗ s i z e +c Listing 7: A simple loop abstraction. i n t main(int argc, char ∗ argv [ ] ) { B. Application: Kripke s t d : : s i z e t s i z e = . . . ; s t d : : s i z e t b a t c h s i z e = . . . ; Kripke [24] is a proxy application representing three- s t d : : s i z e t repeats = ...; dimensional deterministic transport applications. Such an ap- ...// Allocate and randomly initialize out m a t r i x, // input m a t r i x 1, input m a t r i x 2 plication has many nested loops which iterate over directions, f o r(long rep = 0; rep <(repeats/batch size); rep++){ zones, and energy groups. On different architectures, the a f f i n e j i t k e r n e l d i f f i c u l t ( camp : : make tuple ( optimal nesting order of those loops changes, as do the layouts RAJA:: RangeSegment(0 , batch s i z e ) , of many data structures within the application. Managing this RAJA:: RangeSegment(0, size ) , RAJA:: RangeSegment(0, size ) , in an abstract way pushes Kripke to use bleeding-edge features RAJA:: RangeSegment(0 , size ) from RAJA and modern C++. Unfortunately, constraints of ), [ = ] (const std::size t m a t r i c e s , const std::size t i , C++ make such abstractions unfriendly to write, and as shown c o n s t std::size t j , in Listing 11, Kripke has to maintain a lot of plumbing code c o n s t std::size t k ){ o u t matrix[MAT2D(i ,j , size )] += to select among all the possible layouts and loop-execution i n p u t matrix1[MAT2D(i ,k, size )] ∗ mechanisms based on runtime user selections. This also means i n p u t matrix2[MAT2D(k,j , size )]; } that every possible variant of the loop has to be compiled ahead ); of time. } With a JIT compiler, however, runtime user selections are trivial, and as illustrated in Listing 12, these selections can Listing 10: A RAJA abstraction to apply JIT compilation to a multi- be replaced with a much simpler dispatch mechanism, and dimensional loop kernel. the compilation of these selections happens only as needed. t e m p l a t e [[clang:: jit ]] void affine j i t k e r n e l full(Kernel&& kernel) { Moreover, removing the extra kernel variants speeds up the s t a t i c auto rs= AoT compilation of many files by 2-3x11, and leads to code camp : : make tuple(RAJA::RangeSegment(0,ends )...); RAJA:: kernel

(rs , std::forward( k e r n e l ) ) ; that is easier to write and maintain. The runtime performance } is essentially unchanged. t e m p l a t e Listing 11: An excerpt from Kripke’s ArchLayout.h and SteadyState- void affine j i t k e r n e l difficult(TupleLike IndexTuple , Solver.cpp, showing the original dispatching scheme. Kernel&& kernel) { t e m p l a t e // Details omitted of forwarding to affine j i t k e r n e l f u l l. void dispatchLayout(LayoutV layout v , F u n c t i o n const &fcn, } Args &&... args) { We present in Figure 6 performance data from around the s w i t c h(layout v ){ c a s e LayoutV DGZ: fcn(LayoutT DGZ{} , transition to profitability for various small sizes, for various std ::forward(args )...); batch sizes (which change based upon over how many invoca- b r ea k; c a s e LayoutV DZG: fcn(LayoutT DZG{} , tions the lookup of the already-compiled template instantiation std ::forward(args )...); is amortized), and for different numbers of total iterations b r ea k; ...// There are six lines, in total, like those above. (which change based upon over how much work the JIT com- d e f a u l t : KRIPKE ABORT(”Unknown layout v=%d\n”, pilation itself is amortized). As can be seen, the compilation (int)layout v ) ; b r e ak; time can be significant if not enough computational work } is performed by the compiled code (as is the case by the } 9 size = 2x2 runs with fewer than 10 total iterations), and t e m p l a t e moreover, the instantiation lookup adds significant overhead void dispatchArch(ArchV arch v , F u n c t i o n const &fcn, Args &&... args) (as can be seen by noting that the speedup of the runs with { smaller batch sizes is generally smaller than those with the s w i t c h(arch v ){ c a s e ArchV Sequential: fcn(ArchT S e q u e n t i a l {} , larger batch sizes). std ::forward(args )...); b r e ak; Fig. 6: Performance of the RAJA small-matrix-multiply benchmark. #ifdef KRIPKE USE OPENMP M c a s e ArchV OpenMP: fcn(ArchT OpenMP{} , The runs are labeled as bMiT where the batch size is 10 , and the std ::forward(args )...); total number of iterations is 10T . b r ea k; #endif 1.22 1.22 1.22 1.15 1.17 1.14 ...//A similar CUDA option. 1.2 1.08 d e f a u l t : KRIPKE ABORT(”Unknown arch v=%d\n”, 1 (int)arch v ) ; 0.83 0.83 b r e ak; 0.8 } 0.6 } 0.43

relative speedup 0.4 11Specifically, the files Scattering.cpp (68% compilation-time decrease), b1i8 b3i8 b4i8 b3i9 b4i9 LTimes.cpp (58% compilation-time decrease), and LPlusTimes.cpp (56% size=2x2 size=8x8 compilation-time decrease), from which specializations were removed and replaced by uses of JIT compilation, exhibited AoT-compilation-time improve- ments. t e m p l a t e codes already use template parameters (see Listing 14), porting s t r u c t DispatchHelper { t e m p l a t e of the template solution, which is 8x faster than the non- void operator()(layout t , F u n c t i o n const &fcn, Args &&... args) const { template solution, and make AoT compilation 10x faster. In u s i n g al t = ArchLayoutT; our experience, higher-order finite-element codes use runtime f c n ( a l t {} , std::forward(args )...); } specialization very frequently, and we believe that JIT compi- }; lation will play a huge part in doing this in a more-productive

t e m p l a t e way in the future. void dispatch(ArchLayoutV al v , F u n c t i o n const &fcn, The changes applied to the Laghos were also applied to Args &&... args) { MARBL, verifying that the beneficial characteristics observed dispatchArch(al v . arch v , [&](auto arch t ){ in the proxy application were also observed in the correspond- DispatchHelper h e l p e r ; dispatchLayout(al v . l a y o u t v, helper, fcn, ing production application (as shown in Figure 8). std ::forward(args )...); } ); Listing 13: The manual explicit-instantiation-and-dispatch code in Laghos’s rMassMultAdd.cpp that ClangJIT makes obsolete. // Code like this appears several times to launch // the specialized kernels. c o n s t unsigned int id= Kripke:: dispatch(al v , SourceSdom {} , sdom id , (DIM< <16)|((NUM DOFS 1D−1)<<8)|(NUM QUAD 1D>>1); ...// Other parameters omitted. s t a t i c std::unordered map c a l l = } { {0x20001,&rMassMultAdd2D <1,2>}, {0x20101,&rMassMultAdd2D <2,2>}, Listing 12: An excerpt from Kripke’s SteadyStateSolver.cpp, adapted {0x20102,&rMassMultAdd2D <2,4>}, {0x20202,&rMassMultAdd2D <3,4>}, to use ClangJIT, showing how simple the dispatcher has become. {0x20203,&rMassMultAdd2D <3,6>}, t e m p l a t e {0x20303,&rMassMultAdd2D <4,6>}, void launcher(Body && body){ {0x20304,&rMassMultAdd2D <4,8>}, body(std :: string(”Kripke:: ArchLayoutT <” {0x20404,&rMassMultAdd2D <5,8>}, ”Kripke:: ArchT Sequential,” ...// There are approximately 64 lines, in total, ”Kripke:: LayoutT DGZ>”)); // like those above. // In the real application,a configuration file }; // provides this string. } assert(call[id]);// This solution is not as flexible // as using ClangJIT. ... call [id ](numElements ,dofToQuad ,dofToQuadD ,quadToDof , quadToDofD,op,x,y); // Code like this now appears several times to lauch // variousJIT −compiled kernels. launcher([&](const std:: string& ArchitectureLayout) { Kripke:: Kernel :: template scattering < Listing 14: An excerpt of the rMassMultAdd2D template in Laghos. ArchitectureLayout >( d a t a s t o r e ) ; t e m p l a t e // Notice that the ArchitectureLayout template parameter void rMassMultAdd2D( // isa string. c o n s t int numElements , } ); c o n s t double ∗ restrict dofToQuad, c o n s t double ∗ restrict dofToQuadD, c o n s t double ∗ restrict quadToDof, C. Application: Laghos c o n s t double ∗ restrict quadToDofD, c o n s t double ∗ restrict oper, Laghos [25] is a higher-order finite-element-method (FEM) c o n s t double ∗ restrict solIn , proxy application solving the time-dependent equations double ∗ restrict solOut) { forall (e,numElements , { of compressible gas dynamics in a moving Lagrangian frame double sol x y [NUM QUAD 1D] [NUM QUAD 1D] ; using unstructured high-order finite-element spatial discretiza- // sol x y[ ∗ ][ ∗ ]= 0; f o r(int dy = 0; dy < NUM DOFS 1D; ++dy ) { tion [26] and explicit high-order time-stepping. Higher-order double sol x [NUM QUAD 1D] ; FEM codes are unlike traditional codes solving partial differ- // sol x[ ∗ ]= 0; f o r(int dx = 0; dx < NUM DOFS 1D; ++dx ) { ential equations in that, in addition to a traditional loop over all c o n s t doubles= of the elements in a simulation, there are nested loops within solIn[ijkN(dx,dy,e,NUM DOFS 1D ) ] ; f o r(int qx = 0; qx < NUM QUAD 1D; ++qx ) { each element whose bounds are determined at runtimes and s o l x [ qx ] += often small. The relevant parameters (NUM DOFS 1D and dofToQuad[ijN(qx ,dx ,NUM QUAD 1D) ] ∗ s ; } NUM QUAD 1D in Listing 13) are often between two and } 32, and they form the bounds on over thirty loops in just one f o r(int qy = 0; qy < NUM QUAD 1D; ++qy ) { c o n s t double d2q = kernel, often to depths of four within the main element loop. dofToQuad[ijN(qy ,dy ,NUM QUAD 1D ) ] ; Knowing these loop bounds is critical, as shown in Figure 7: f o r(int qx = 0; qx < NUM QUAD 1D; ++qx ) { s o l xy[qy][qx] += d2q ∗ s o l x [ qx ] ; not knowing them can cause an 8x slowdown. } Currently, as shown in Listing 13, this performance op- } } portunity is realized by picking the most common orders, //a second loop nest similar to that above... instantiating a template function for each, and then dispatch- } ); } ing among these explicitly-specialized functions. Since these Fig. 7: Laghos runtime performance improvement using ClangJIT other accelerator programming models, especially HIP and over the non-specialized kernel implementations, using 10000 ele- OpenMP. HIP is a programming model for AMD GPUs, ments. The runs are labled as dNqM where the dofs parameter is N and the quads parameter is M. and its implementation in Clang is based on Clang’s CUDA implementation. Because of the similarities between HIP and 8.08 8 CUDA, we anticipate the changes necessary to add HIP 6.41 support will be minor. OpenMP accelerator support, however, 6 4.88 4.27 is more complicated. Even when targeting NVIDIA GPUs, 4 3.8 2.85 Clang’s OpenMP target-offload compilation pipeline differs 2.53 2.19 significantly from its CUDA compilation pipeline. Among

relative speedup 2 other differences, for OpenMP target offloading, the host code d2q4 d2q2 d4q8 d2q8 d4q2 d4q4 d8q4 d8q2 is compiled first, followed by the device code, and a linker script is used to integrate the various components later. Further Fig. 8: Performance of the partial assembly of the mass matrix in enhancements to the scheme described in Section III are MARBL, using ClangJIT and the pre-existing template-specialization necessary in order to support a model where host compilation scheme for various polynomial orders (p). The x- and y-axes indi- precedes device compilation. cate the problem size and performance respectively. The amortized During our explorations of different application use cases, overhead of the ClangJIT implementation is slightly lower than that two noteworthy enhancements were identified. First, C++ non- of the original scheme, yielding slightly better performance, and the shaded regions indicate that performance difference. type template parameters currently cannot be floating-point

1e7 Config: MFEM/cpu, host: (1 node, 1 task/node), clang, BP1 values. However, for many HPC use cases, it will be useful 3.5 p=1 to create specializations given specific sets of floating-point p=2 values (e.g., for some polynomial coefficients or constant 3.0 p=3 p=4 matrix elements). Currently, this can be done by casting the p=5 floating-point values to integers, using those to instantiate the 2.5 p=6 p=7 templates, and then casting the integers back to floating-point p=8 values inside the JIT-compiled code. The casting, however, is 2.0 a workaround that we’ll investigate eliminating. Second, the

1.5 current implementation does not allow JIT-compiled templates to make use of other JIT-compiled templates - in other

1.0 words, once ClangJIT starts compiling code at runtime, it assumes that code will not, itself, create more places where [DOFs x CG iterations] / [compute nodes seconds] 0.5 JIT compilation might be invoked in the future. This has limited applicability to some use cases because it creates an

0.0 unhelpful barrier between code that can be used during AoT 1 4 5 7 10 102 103 10 10 106 10 Points per compute node compilation and code that can be used (transitively) in JIT- compiled templates. This work has inspired a recent proposal to the C++ VI.FUTURE WORK standards committee [27], and as discussed in that proposal, the language extension discussed here might not be the best Future enhancements to ClangJIT will allow for not just way to design such an extension to C++. As highlighted in runtime specialization, but adaptive runtime specialization. that proposal, use of an attribute such as [[clang::jit]] LLVM’s optimizer can make good use of profiling informa- might be suboptimal because: tion to guide function inlining, code placement, and other • The template is not otherwise special, it is the point of optimizations. ClangJIT can be enhanced to gather profiling instantiation that is special (and the point of instantiation information and use that information to recompile further- is what might cause compilation to fail without JIT- optimized code variants. LLVM supports two kinds of profil- compilation support). ing: instrumentation-based profiling and sampling-based pro- • The uses of the template look vaguely normal, and filing, and how to best use this capability is under investi- so places where the application might invoke the JIT gation. Moreover, profiling information can be used to guide compiler will be difficult to spot during code review. more-advanced autotuning of the JIT-compiled code, and this • The current mechanism provides no place to get out an is also under investigation. While ClangJIT already allows JIT error or provide a fall-back execution path - except that compilation to be used in a portable manner, it provides per- having the runtime throw an exception might work. formance portability only to the extent that LLVM’s optimizer Future work will explore different ways to address these is generally capable. With autotuning, however, additional potential downsides of the language extension presented here. performance potability can be realized above that generally For example, it might be better to use some syntax like: available during static compilation. We will investigate enhancing ClangJIT with support for j i t t h i s template foo (); where jit_this_template would be a new keyword. significant speedups to the AoT compilation process, which in- Clang’s ASTs can be large, and the embedded LLVM creases programmer productivity, and moreover, the resulting IR also adds non-trivially to the size of the object files code is simpler and easier to maintain, which also increases containing code making use of the JIT-compilation feature. programmer productivity. In the future, we expect to see uses For example, for the Eigen-based microbenchmark presented of JIT compilation replacing generic algorithm implementa- in Section V, AoT-only executable is 135 KB in size. The tions in HPC applications which represent so many potential executable making use of the JIT-compilation feature is 12 specializations that instantiating them during AoT compilation MB in size, and this does not include the size of the compiled is not practical. We already see this in machine-learning Clang and LLVM libraries. If these libraries are linked to frameworks (e.g., in TensorFlow/XLA) and other domain- the application statically, instead of dynamically, the size of specific libraries, and now ClangJIT makes this capability the binary increases to 75 MB. While this size increase is available in the general-purpose C++ programming language. not a problem in practice if only a small number of source The way that HPC developers think about the construction of files make use of the JIT-compilation feature, many source optimized kernels, unlike in the past, will increasingly include file making use of the JIT-compilation feature can lead to a the availability of JIT compilation capabilities. large increase in the size of the resulting executable. In the ACKNOWLEDGMENT future, we will explore whether the size of the embedded ASTs can be reduced by using additional compression and/or This research was supported by the Exascale Computing excluding parts of the AST that cannot be relevant to the JIT- Project (17-SC-20-SC), a collaborative effort of two U.S. compiled code. Similarly, we will explore whether the size of Department of Energy organizations (Office of Science and the embedded LLVM IR can be reduced. This will also reduce the National Nuclear Security Administration) responsible for the memory overhead of the JIT-compilation engine (which is the planning and preparation of a capable exascale ecosystem, approximately 127 MB for the microbenchmark in Section V). including software, applications, hardware, advanced system Finally, we will investigate improving the error report- engineering, and early testbed platforms, in support of the ing and handling in the future. While ClangJIT will do as nation’s exascale computing imperative. Additionally, this much semantic validation as possible during AoT compila- research used resources of the Argonne Leadership Computing tion, compilation errors can still occur at runtime. Currently, Facility, which is a DOE Office of Science User Facility should compilation errors occur at runtime (e.g., because a supported under Contract DE-AC02-06CH11357. Work at static_assert is triggered, because the specified type Lawrence Livermore National Laboratory was performed un- does not exist), the corresponding error messages are printed der Contract DE-AC52-07NA27344 (LLNL-CONF-772305). to standard error (just as they are when running Clang from We would additionally like to thank the many members of the command line) and then the program is forced to exit. the C++ standards committee who provided feedback on this Clang generally prints lines from the original source files along concept during the 2019 committee meeting in Kona. Finally, with its messages, and ClangJIT can do this as well because we thank Quinn Finkel for editing and providing feedback. the relevant original source files are stored, in compressed form, along with the serialized AST within the object files. The submitted manuscript has been created by UChicago However, it may be desirable to allow an application to recover Argonne, LLC, Operator of Argonne National Laboratory from these kinds of errors, and it might be desirable for an (”Argonne”). Argonne, a U.S. Department of Energy application to save the error messages to a string for later use, Office of Science laboratory, is operated under Contract and these capabilities may be added in the future. No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonex- VII.CONCLUSIONS clusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to In this paper, we’ve demonstrated that JIT-compilation tech- the public, and perform publicly and display publicly, nology can be integrated into the C++ programming language, by or on behalf of the Government. The Department that this can be done in a way which makes using JIT of Energy will provide public access to these results compilation easy, that this can reduce compile time, making of federally sponsored research in accordance with the application developers more productive, and that this can DOE Public Access Plan. http://energy.gov/downloads/ be used to realize high performance in HPC applications. doe-public-access-plan We investigated whether JIT compilation could be integrated into applications making use of the RAJA abstraction library without any changes to the application source code at all and found that we could. We then investigated how JIT compilation could be integrated into existing HPC applications. Kripke and Laghos were presented here, and we demonstrated that the existing dispatch schemes could be replaced with the pro- posed JIT-compilation mechanism. This replacement produced REFERENCES [24] A. J. Kunen, T. S. Bailey, and P. N. Brown, “Kripke-a massively parallel transport mini-app,” https://github.com/LLNL/Kripke, Lawrence [1] B. Stroustrup, “Evolving a language in and for the real world: C++ Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. 1991-2006,” in Proceedings of the third ACM SIGPLAN conference on Rep., 2015. History of programming languages. ACM, 2007, pp. 4–1. [25] Laghos Team, “Laghos: LAGrangian High-Order Solver,” https://github. [2] M. Byler, “Multiple version loops,” in Proceedings of the 1987 Inter- com/CEED/Laghos, 2019. national Conference on Parallel Processing. New York, 1987. [26] V. A. Dobrev, T. V. Kolev, and R. N. Rieben, “High-order curvilinear [3] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong finite element methods for lagrangian hydrodynamics,” SIAM Journal program analysis & transformation,” in Proceedings of the international on Scientific Computing, vol. 34, no. 5, pp. B606–B641, 2012. symposium on Code generation and optimization: feedback-directed and [27] H. Finkel, “P1609r0: C++ should support just-in-time compilation,” http: runtime optimization. IEEE Computer Society, 2004, p. 75. //wg21.link/p1609r0, 2019. [4] P. Reames, “Falcon: An optimizing java JIT,” in 2017 US LLVM Developers’ Meeting, 2017. APPENDIX A [5] D. A. Terei and M. M. Chakravarty, “An backend for ghc,” in ACM ARTIFACT DESCRIPTION APPENDIX:CLANGJIT: Sigplan Notices, vol. 45, no. 11. ACM, 2010, pp. 109–120. [6] M. Pall, “The LuaJIT project,” http://luajit.org, 2008. ENHANCING C++ WITH JUST-IN-TIME COMPILATION [7] J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman, “Julia: A. Abstract A fast dynamic language for technical computing,” arXiv preprint arXiv:1209.5145, 2012. The Eigen benchmarks were run on JLSE’s ”thing” ma- [8] M. Abadi, M. Isard, and D. G. Murray, “A computational model for tensorflow: an introduction,” in Proceedings of the 1st ACM SIGPLAN chine, a Intel Haswell-EP E5-2699v3 machine. International Workshop on Machine Learning and Programming Lan- The Eigen CUDA benchmarks were run on JLSE’s ”fire- guages. ACM, 2017, pp. 1–7. stone” machine, IBM S822LC, OpenPower, Power8 10c [9] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Ama- rasinghe, “Halide: a language and compiler for optimizing parallelism, 2.92Ghz with an Nvidia K80 GPU. locality, and recomputation in image processing pipelines,” Acm Sigplan The RAJA-based applications and benchmarks were run on Notices, vol. 48, no. 6, pp. 519–530, 2013. LLNL’s Genie machine, Intel Xeon CPU E5- 2695 v4 @ [10] D. Binks, “Runtime compiled c & c++ solutions,” https: 2.10GH, using ClangJIT and compiled using -O3. //github.com/RuntimeCompiledCPlusPlus/RuntimeCompiledCPlusPlus/ wiki/Alternatives, 2019. ClangJIT is based on a version of Clang close to Clang v8. [11] D. Binks, M. Jack, and W. Wilson, “Runtime compiled c++ for rapid ai development,” Game AI Pro: Collected Wisdom of Game AI Profes- B. Description sionals, p. 201, 2013. [12] M. Noack, F. Wende, G. Zitzlsberger, M. Klemm, and T. Steinke, “Kart– 1) Check-list (artifact meta information): Fill in whatever a runtime compilation library for improving hpc application perfor- is applicable with some informal keywords and remove the rest mance,” in International Conference on High Performance Computing. Springer, 2017, pp. 389–403. • Program: Clang, LLVM, ClangJIT, Eigen, Laghos, MARBL, [13] NVIDIA, “NVRTC (runtime compilation): Cuda toolkit documentation,” Kripke https://docs.nvidia.com/cuda/nvrtc/, 2019. • Run-time environment: Linux [14] J. M. M. Caamano˜ and S. Guelton, “Easy::Jit: compiler assisted li- • Hardware: X86 64, PowerPC, NVIDIA brary to enable just-in-time compilation in c++ codes,” in Conference • Publicly available?: Yes (except MARBL) Companion of the 2nd International Conference on Art, Science, and Engineering of Programming. ACM, 2018, pp. 49–50. 2) How software can be obtained (if available): [15] K. Farvardin, “atJIT: A just-in-time autotuning compiler for c++,” https: ClangJIT is available from https://github.com/hfinkel/ //github.com/kavon/atJIT, 2018. llvm-project-cxxjit, and results were generated using commit [16] T. Lutz and V. Grover, “Lambdajit: a dynamic compiler for heteroge- neous optimizations of stl algorithms,” in Proceedings of the 3rd ACM d032b2c6925cc5b5e5e2bd681df2c50dc22df634. SIGPLAN workshop on Functional high-performance computing. ACM, The source code for the Eigen-based benchmarks 2014, pp. 99–108. are available from https://github.com/hfinkel/ [17] M. Hopcroft et al., “Bitfunnel/nativejit,” https://github.com/BitFunnel/ clangjitmisc-benchmarks. For Eigen, we used the code NativeJIT, 2014. [18] V. Vasilev, P. Canal, A. Naumann, and P. Russo, “Cling–the new from https://github.com/eigenteam/eigen-git-mirror commit interactive interpreter for root 6,” in Journal of Physics: Conference a98e2c7c48e5e387630ad9589d023ebcb0d7b5b0. Series, vol. 396, no. 5. IOP Publishing, 2012, p. 052071. The Kripke version used is https://github.com/ [19] J. Wu, A. Belevich, E. Bendersky, M. Heffernan, C. Leary, J. Pienaar, B. Roune, R. Springer, X. Weng, and R. Hundt, “gpucc: an open-source DavidPoliakoff/Kripke/tree/feature/clang jit commit gpgpu compiler,” in Proceedings of the 2016 International Symposium 66046d8cd51f5bcf8666fd8c810322e253c4ce0e. on Code Generation and Optimization. ACM, 2016, pp. 105–116. For RAJA, the code is available from https://github.com/ [20] G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: a dynamic optimization framework for bulk-synchronous applications DavidPoliakoff/RAJA/tree/reproducibility/clang jit/jit test, in heterogeneous systems,” in Proceedings of the 19th international where simple test.cpp is our matmul example using RAJA, conference on Parallel architectures and compilation techniques. ACM, simple fem test.cpp is our FEM test, based on Laghos 2010, pp. 353–364. [21] G. Guennebaud, B. Jacob et al., “Eigen,” http://eigen.tuxfamily.org, (https://github.com/CEED/Laghos/). 2010. https://github.com/DavidPoliakoff/ [22] R. D. Hornung and J. A. Keasler, “The raja portability layer: overview SC-Clang-Jit-Performance holds the data on the RAJA- and status,” Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2014. based examples. [23] D. Boehme, T. Gamblin, D. Beckingsale, P.-T. Bremer, A. Gimenez, 3) Hardware dependencies: thing: Intel Haswell-EP E5- M. LeGendre, O. Pearce, and M. Schulz, “Caliper: performance intro- 2699v3, firestone: IBM S822LC, OpenPower, Power8 10c spection for hpc software stacks,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and 2.92Ghz + Nvidia K80 GPU, genie: Intel Xeon CPU E5-2695 Analysis. IEEE Press, 2016, p. 47. v4 @ 2.10GH 4) Software dependencies: ClangJIT compiled with gcc v8.2.0. 5) Datasets: C. Installation Each piece of software contains corresponding installaton instructons. First, compile ClangJIT from its repository, then follow the instructions for each other piece of software using ClangJIT as the compiler. D. Experiment workflow The various repositories contain instructions regarding the workflow - there are two main kinds of outputs: execution time and compilation time. When collecting compilation time, remember to repeat particular compilation jobs multiple times because I/O times and other system functions can have con- siderable variance. Execution times are colected from the system, except that the Eigen benchmarks have their own timing output. E. Evaluation and expected result Two kinds of results can be obtained from the software provided for evaluation. First JIT-specialized kernels can ex- ecute much more quickly than generic versions (e.g., in the Eigen benchmark for smal sizes). The proxy and production applications already take advantage of specialization, but do so by precompiling all potentially-relevant kernels. Using the JIT-compiled kernels leads to essentially-equivalent execution time with a reduction in compile time.