Using LLVM for Optimized Lightweight Binary Re-Writing at Runtime
Total Page:16
File Type:pdf, Size:1020Kb
Using LLVM for Optimized Lightweight Binary Re-Writing at Runtime Alexis Engelke, Josef Weidendorfer Department of Informatics Technical University of Munich Munich, Germany EMail: [email protected], [email protected] Abstract—Providing new parallel programming models/ab- helpful in reducing any overhead of provided abstractions. stractions as a set of library functions has the huge advantage But with new languages, porting of existing application that it allows for an relatively easy incremental porting path for code is required, being a high burden for adoption with legacy HPC applications, in contrast to the huge effort needed when novel concepts are only provided in new programming old code bases. Therefore, to provide abstractions such as languages or language extensions. However, performance issues PGAS for legacy codes, APIs implemented as libraries are are to be expected with fine granular usage of library functions. proposed [5], [6]. Libraries have the benefit that they easily In previous work, we argued that binary rewriting can bridge can be composed and can stay small and focused. However, the gap by tightly coupling application and library functions at libraries come with the caveat that fine granular usage of runtime. We showed that runtime specialization at the binary level, starting from a compiled, generic stencil code can help library functions in inner kernels will severely limit compiler in approaching performance of manually written, statically optimizations such as vectorization and thus, may heavily compiled version. reduce performance. In this paper, we analyze the benefits of post-processing the To this end, in previous work [7], we proposed a technique re-written binary code using standard compiler optimizations for lightweight code generation by re-combining existing as provided by LLVM. To this end, we present our approach for efficiently converting x86-64 binary code to LLVM-IR. binary code. We presented our open-source prototype DBrew Using the mentioned generic code for arbitrary 2d stencils, we (Dynamic Binary REWriting)1. The main features are (1) present performance numbers with and without LLVM post- tight coupling of separately compiled functions (e.g. from processing. We find that we can now achieve the performance application code and/or different libraries) by aggressive of variants specialized by hand. inlining and (2) specialization of generic code with infor- Keywords-High Performance Computing; Dynamic Code mation known at runtime. For example, variable function Generation; Dynamic Optimization; Binary Transformation parameters can be configured to have known values, enabling constant propagation, dead-code elimination (due to runtime I. INTRODUCTION knowledge), and full loop unrolling. Specialization is useful In High Performance Computing (HPC), the usage of for minimizing the runtime overhead of abstractions: how programming languages which get compiled to native code to best handle different runtime properties (input data, exact such as Fortran or C++ is common, as well as the usage of target architecture with e.g. varying cache sizes, specific low level libraries which are relatively thin abstractions over features of I/O devices) can be covered in generic code. hardware. An example for the latter is MPI [1]. The reason This gets specialized into a concrete implementation when for favoring compiled languages and low level libraries is the executed by the proposed rewriting technique2. Rewriting at need for predictable and stable runtime behavior, which is the binary level has several advantages. We can inline code essential in being able for parallel application codes to scale from existing libraries (even commercial) without the need well even when using thousands of processors. On the one for source code, which makes the technique easy to deploy in hand, this low level of programming allows for sophisticated different HPC environments. Furthermore, for the supported optimizations tuned for specific hardware features. But on code transformations, staying at the binary level of a given the other hand, productivity is heavily limited compared to ISA is fine: either instructions can be copied over into the software development in other IT domains. For example, newly generated code, partly replacing operands with known HPC programmers do not take advantage of memory safety constants, or they simply disappear if all input parameters as provided by managed environments (Java, .NET). are known. We showed that the performance of the resulting Due to the low productivity, strategies are desperately code almost matched the performance of the variant with the needed which allow both for higher abstraction and yet keep specialization done statically by hand. the ability for tuning of details. Promising concepts such In this paper, we study the performance implications of as PGAS (partitioned global address space) were proposed. However, they often come with new programming languages 1Available at https://github.com/lrr-tum/dbrew or language extensions [2]–[4]. Compiler support is very 2C++ templates allow for variants, but may have code explosion issues. Compile Time Runtime LLVM LLVM JIT Optimizer IR Compiler Loaded New Source Exe- Compiler Binary Transformation Binary Code cutable Code X86 ! LLVM-IR Code Code Decoding Specialization Generation DBrew Figure 1: Overview on the process of binary re-writing in combination with LLVM. The decoded and optionally specialized binary code is transformed to LLVM-IR before standard compiler optimizations of LLVM can be applied. our binary re-combination/specialization technique in more int func(int a, int b) { ... } detail. We want to understand if there are inherent overheads typedef int (*func_t)(int,int); within the resulting code which makes it impossible to func_t newfunc; approach statically compiled, manually specialized code. int main() { To this end, we forward code from DBrew into standard // call original function optimization passes of an established compiler backend, as int x = func(1,2); shown in Fig. 1. For this, we use the C-API and JIT (Just- // new rewriter config for func In-Time) compilation features of LLVM [8]. We use the dbrew_rewriter* r = dbrew_new(func); ... // configure rewriter same example as in our previous work (a generic 2d Jacobi newfunc = (func_t) dbrew_rewrite(r); code). We compare the performance of various variants // call rewritten version and the effort required. While having the option to use int x2 = (*newfunc)(1,2); LLVM optimization passes is nice in the scope of DBrew, } we also did this experiment for two other reasons. On the one hand, we now have an infrastructure to understand Figure 2: Basic usage of DBrew. for given use cases, which of the optimization passes are essential and which can be neglected. This will guide us to further improve the backend of DBrew with simple forms combining pieces of compiled binary code. It currently of essential optimization passes without the heavy LLVM works with the x86-64 ISA and Linux. The basic approach resource overhead (both in space and time). On the other is to generate drop-in replacements of existing functions. hand, we wanted to understand the importance of meta A request for re-writing returns a function pointer with information for standard compiler optimization passes, as exactly the same function signature as the original code. We this is not readily available in the binary code from DBrew. expect that re-writing may fail: each of the internal steps The paper is structured as follows: first, we shortly revisit “decoding”, “emulation” (if all input to an instruction is the DBrew prototype in the next section. Then we describe known), and “encoding” may not be covered for a given in detail the required steps for converting x86-64 binary code instruction in the instruction stream. This will trigger an into LLVM-IR. In Sec. IV, we describe specific aspects for internal error. DBrew has a default error handler which the conversion in the context of specialization. Afterwards, simply will return the original function to ensure correctness. results are presented for the 2d Jacobi case. Here, we also However, the user can provide a custom error handler, show excerpts of x86 code as produced by DBrew as well which for example may iteratively enlarge the buffer space as the code when post-processed by LLVM. We discuss our available for the generated code and restart the re-writing. findings, and after mentioning related work, we conclude the Fig. 2 shows basic DBrew usage. paper with ideas for future extensions. There exist different configuration options to influence II. DBREW:DYNAMIC BINARY RE-WRITING the behavior of rewriting the binary code of a compiled DBrew is our prototype for the proposed lightweight function. On the one hand, the amount of resources available code generation technique consisting of re-writing and re- to rewriting can be limited (e.g. for decoding, duplicated // set config and rewrite func assembly emitted by current C compilers (GCC or Clang). r = dbrew_new(func); For example, we assume the Linux ABI (System V) and for dbrew_setpar(r,1,42); floating point support, we only cover SSE instructions for dbrew_setmem(r,start,end); newfunc = (func_t) dbrew_rewrite(r); now. // par 1: uses 42 instead of 1 A. Functions int x2 = (*newfunc)(1,2); As functions are a central construct in the LLVM-IR, the Figure 3: Declaring known values for specialization. signature of the function and the calling convention has to be known. This information is required to construct a mapping of the function parameter in the LLVM-IR to the register variant generation, depth of allowed inlining, or space for or stack address where the function expects the argument. generated code). On the other hand, for specialization, one Likewise, when the function returns, the register containing can configure that some values which get used in a function the return value has to be determined. should be fixed to a given constant. To specify which values B. Basic Blocks should be assumed to have a fixed value, DBrew allows to specify memory ranges.