An Infrastructure for Adaptive Dynamic Optimization
Total Page:16
File Type:pdf, Size:1020Kb
An Infrastructure for Adaptive Dynamic Optimization Derek Bruening, Timothy Garnett, and Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 fiye,timothyg,[email protected] Abstract ing profile information improves the predictions but still falls short for programs whose behavior changes dynami- Dynamic optimization is emerging as a promising ap- cally. Additionally, many software vendors are hesitant to proach to overcome many of the obstacles of traditional ship binaries that are compiled with high levels of static op- static compilation. But while there are a number of com- timization because they are hard to debug. piler infrastructures for developing static optimizations, Shifting optimizations to runtime solves these problems. there are very few for developing dynamic optimizations. Dynamic optimization allows the user to improve the per- We present a framework for implementing dynamic analy- formance of binaries without relying on how they were ses and optimizations. We provide an interface for build- compiled. Furthermore, several types of optimizations are ing external modules, or clients, for the DynamoRIO dy- best suited to a dynamic optimization framework. These in- namic code modification system. This interface abstracts clude adaptive, architecture-specific, and inter-module opti- away many low-level details of the DynamoRIO runtime mizations. system while exposing a simple and powerful, yet efficient Adaptive optimizations require instant responses to and lightweight, API. This is achieved by restricting opti- changes in program behavior. When performed statically, a mization units to linear streams of code and using adaptive single profiling run is taken to be representative of the pro- levels of detail for representing instructions. The interface gram’s behavior. Within a dynamic optimization system, is not restricted to optimization and can be used for instru- ongoing profiling identifies which code is currently hot, al- mentation, profiling, dynamic translation, etc. lowing optimizations to focus only where they will be most To demonstrate the usefulness and effectiveness of our effective. framework, we implemented several optimizations. These Architecture-specific code transformations may be done improve the performance of some applications by as much statically if the resulting executable is only targeting a sin- as 40% relative to native execution. The average speedup gle processor, or by using dynamic dispatch to select among relative to base DynamoRIO performance is 12%. several transformations prepared for different processors. The first option unduly restricts the executable while the 1 Introduction second bloats the executable size. Performing the optimiza- tion dynamically solves the problem by allowing the exe- The power and reach of static analysis is diminishing cutable to remain generic and specialize itself to the proces- for modern software, which heavily utilizes dynamic class sor it happens to be running on. loading, shared libraries, and runtime binding. Not only is Inter-module optimizations cannot be done statically in it difficult or impossible for a static compiler to analyze the the presence of shared and dynamically loaded libraries. whole program, but static optimization is limited by the ac- But all code is available to a dynamic optimizer, present- curacy of its predictions of runtime program behavior. Us- ing optimizations with a view of the code that cuts across Copyright c 2003 by the Institute of Electrical and Electronics Engi- the static units used by the compiler for optimization. neers. Personal use of this material is permitted. However, permission to Dynamic optimizations have one significant disadvan- reprint/republish this material for advertising or promotional purposes or tage versus static optimizations. The overhead of perform- for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works ing the optimization must be amortized before any improve- must be obtained from the IEEE. ment is seen. This limits the scope of optimizations that can be done online, and makes the efficiency of the optimiza- executed in the future. tion infrastructure extremely critical. For this reason, while DynamoRIO copies basic blocks (sequences of instruc- there are numerous flexible and general compiler infrastruc- tions ending with a single control transfer instruction) into tures for developing static optimizations, there are very few a code cache and executes them natively. At the end of for the development of dynamic optimizations. each block the application’s machine state must be saved Another important contrast with static compilation is and control returned to DynamoRIO (a context switch) to transparency. Unlike a static compiler optimization, a dy- copy the next basic block. If a target basic block is al- namic optimization cannot use the same memory alloca- ready present in the code cache, and is targeted via a direct tion routines or input/output buffering as the application, branch, DynamoRIO links the two blocks together with a because the optimization’s operations are interleaved with direct jump. This avoids the cost of a subsequent context those of the application. switch. The main contribution of this paper is a framework for Indirect branches cannot be linked in the same way be- implementing dynamic analyses and optimizations. The cause their targets may vary. To maintain transparency, framework is based on the DynamoRIO dynamic code original program addresses must be used wherever the ap- modification system. We export an interface for building plication stores indirect branch targets (for example, re- external modules, or clients, for DynamoRIO. With this turn addresses for function calls). These addresses must be API, custom runtime code transformations are simple to translated into their corresponding code cache addresses in develop. Efficiency is achieved through two key princi- order to jump to the target code. This translation is per- ples: restricting optimization units to linear streams of code formed as a fast hashtable lookup. and using adaptive levels of detail for representing instruc- To improve the efficiency of indirect branches, and to tions. Our interface provides direct support for customiz- achieve better code layout, basic blocks that are frequently able traces and adaptive optimization of traces, while main- executed in sequence are stitched together into a unit called taining transparency with respect to the application. It is a trace. When connecting beyond a basic block that ends general enough to be used for non-optimization purposes, in an indirect branch, a check is inserted to ensure that the including instrumentation, profiling, and security [23]. The actual target of the branch will keep execution on the trace. system is available to the public in binary form [16]. This check is much faster than the hashtable lookup, but if The paper is organized as follows. We first present in the check fails the full lookup must be performed. The su- Section 2 a description of the DynamoRIO system. Dy- perior code layout of traces goes a long way toward amor- namoRIO operates on unmodified native binaries and re- tizing the overhead of creating them and often speeds up the quires no special hardware or operating system support. It program [4, 32]. is implemented for both IA-32 Windows [5] and Linux, and A flow chart showing the operation of DynamoRIO is is capable of running large desktop applications. Section 3 presented in Figure 1. The figure concentrates on the flow describes our optimization interface in detail, and Section 4 of control in and out of the code cache, which is the bottom gives some examples of dynamic optimizations written us- portion of the figure. The copied application code looks ing the interface. We give experimental results in Section 5. just like the original code with the exception of its control We discuss two other dynamic code modification interfaces transfer instructions, which are shown with arrows in the along with other related work in Section 6. figure. Table 1 shows the typical performance improvement of 2 DynamoRIO each enhancement to the basic interpreter design. Caching is a dramatic performance improvement, and adding direct Our optimization infrastructure is built on a dynamic op- links is nearly as dramatic. The final steps of adding a fast timizer called DynamoRIO. DynamoRIO is the IA-32 ver- in-cache lookup for indirect branches and building traces sion [5] of Dynamo [4]. It is implemented for both IA-32 improve the performance significantly as well. Windows and Linux, and is capable of running large desk- The Windows operating system directly invokes appli- top applications. cation code or changes the program counter for callbacks, The goal of DynamoRIO is to observe and potentially exceptions, asynchronous procedure calls, setjmp, and the manipulate every single application instruction prior to its SetThreadContext API routine. These types of control execution. The simplest way to do this is with an inter- flow are intercepted in order to ensure that all application pretation engine. However, interpretation via emulation is code is executed under DynamoRIO [5]. Signals on Linux slow, especially on an architecture like IA-32 with a com- must be similarly intercepted. plex instruction set, as shown in Table 1. DynamoRIO uses DynamoRIO maintains thread-private code caches, each a typical