Leveraging Indirect Branch Locality in Dynamic Binary Translators

Leveraging Indirect Branch Locality in Dynamic Binary Translators Balaji Dhanasekaran Kim Hazelwood University of Virginia University of Virginia [email protected] [email protected] Abstract an application depending on how frequent a particular Dynamic Binary Translators (DBTs) have a wide range of code path is executed. It is extremely difficult to make applications like program instrumentation, dynamic opti- optimization decisions like these at compile time. But, mization, and security. One of the main issues with DBTs with the help of binary translators, we can instrument is their performance overhead. A significant part of this the application at runtime and choose the appropriate overhead is caused by indirect branch (IB) translation. In optimization based on the execution characteristics of the this paper, we show that the percentage of instructions application. DBTs provide enormous power in the hands spent in translating indirect branches is as high as 50% of of the users to analyze and optimize their applications at the total guest application’s instructions. The time spent runtime. DBTs are also used in other applications like in indirect branch translation is so high that in some ap- instrumentation (Pin [8] and DynamoRIO [2]), security plications, bounding the code cache size actually results (Strata [9]), dynamic translation (Rosetta [1]), and design in an increased performance, since code cache flushes also space exploration (Daisy [4]). remove stale indirect branch information along with the One of the main problems with DBTs is their perfor- translated code. In order to improve the performance of in- mance overhead. Figure 1 shows the relative performance direct branches, we analyze the locality of indirect branch of the SPEC2006 INT benchmarks when executed under targets and show that the locality is as high as 70%. We the control of Pin. For some benchmarks like perlbench, propose an indirect branch translation algorithm which ex- the overhead of executing under Pin can be as high as ploits this available locality. We analyze the performance 300%. One of the reasons for these high overheads is indi- of the proposed algorithm and show that the proposed al- rect branch translation. DBTs translate the guest code in gorithm achieves a hit rate of 73% compared to 46.5% units of basic blocks and traces and store them in the code with the default algorithm. cache. A basic block is a unit of instructions bounded by branch statements. A trace is a collection of basic blocks. Keywords Dynamic Binary Translators, Code Cache, In- As long as there are no branch statements, the control re- direct Branch Translation, Branch Target Locality mains within the trace. When the application encounters a branch statement, the control is transferred back to the 1. Introduction DBT. The DBT then translates the required instructions and transfers the control back to the application. In the Dynamic binary translators act as a middle layer between case of direct branches, the context switch happens only the guest application and the OS. They provide the ability during the first time the branch instruction is executed. to inspect, instrument, and translate instructions that When control is transferred to the DBT, it generates the are being executed. They allow the user to access the target trace and patches the direct branch with the tar- application’s attributes that are available only at run time. get trace’s code cache address. In the case of indirect For example, consider a compiler optimization like code branches, this branch target can vary across executions. motion, which may improve or hurt the performance of As a result, the branch target cannot be patched the way it is patched in direct branches. This causes a significant overhead in the DBT’s performance. The goal of this paper is to analyze the performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed impact of indirect branches and propose an appropriate for profit or commercial advantage and that copies bear this notice and the full citation solution to mitigate this overhead. We show that there on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. are performance trade-offs between indirect branch handling techniques like chaining and branch target hashing. Copyright c 2010 ACM [to be supplied]. $10.00 Figure 1. Normalized performance of the SPEC2006 INT Figure 2. Performance of perlbench as the code cache size Benchmark running under PIN. Performance is normalized is varied. with respect to native performance. Choosing the appropriate branch handling technique has a systems including Windows, Linux, and MacOS. The fol- significant impact on performance. We also show that the lowing subsections describe the code cache management indirect branch targets have a very high locality of about and branch handling in Pin. 73%. We propose an algorithm called the “Most Recently Used Branch Target” algorithm which exploits this local- 3.1 Runtime Translation and Code Caches in Pin ity by rearranging the indirect branch chain, such that the Pin translates the guest application in units of basic blocks most recently used trace is checked first. and traces and maintains the translated code in a software code cache. Thereafter, execution happens only from the 2. Motivation code cache and the original guest application is never One of the main motivations of this paper is an observa- executed directly. The code cache is bounded to 256 MB tion made by Hazelwood et al. [5] where they found that for 64-bit Pin and is unbounded for 32-bit Pin. For 32- in some cases, restricting the code cache size actually im- bit Pin, the code cache management algorithm allocates proves the performance significantly. Figure 2 shows the more memory as and when new traces are generated. performance of perlbench as we vary the code cache size. The drawback of this technique is that during the course As shown in the figure, the best case performance is not of the application’s execution, the code cache can get obtained with an unbounded code cache. On the contrary, clogged with invalid traces and the valid traces can become for some cases, restricting the code cache size actually in- scattered in the code cache. This defeats one of the creases the performance. Our analysis of this performance important advantages of using a DBT, which is better increase showed that the increase had nothing to do with instruction locality. the code cache size. Bounding the code cache increased the number of code cache flushes which also removed stale 3.2 Branch Handling in Pin indirect branch information. And, this happened to be the Branch handling constitutes a significant portion of Pin’s cause of the performance increase. The other main in- performance. A na¨ıve way of translating branch instruc- sight in this paper is the high locality available in indirect tions would be to transfer the control to the DBT when- branches. We show that there is a 70% probability that the ever a branch statement is encountered. The DBT can target of an indirect branch during its current execution is then check whether the target trace is available in the code the same as its target during the next execution. cache and generate it if the target trace is not present in the code cache. But, this approach of context switching 3. Background between the application and the DBT whenever a branch Pin is the dynamic binary translation system used in this statement is executed results in a significant performance paper. Pin is a multi platform DBT, mainly used for in- drop. As a result, Pin uses techniques like trace linking strumentation purposes. Pin is available on multiple archi- and indirect branch chaining for translating branch instruc- tectures including IA32 (32-bit x86), IA32e (64-bit x86), tions. We describe these approaches in the following sub- and Itanium. Pin is also available on multiple operating sections. 3.2.1 Direct Branch Translation The first time a direct branch is executed, control is transferred back to Pin. Pin generates the required target trace and patches the branch instruction’s target address with the target trace’s code cache address. As a result, whenever the same branch instruction is executed again, control can be transferred directly from one trace to the other, without the intervention of Pin. 3.2.2 Indirect Branch Translation In the case of indirect branches, the target address is present in a register or in a memory location. As a re- Figure 3. Indirect branch chain and the associated hash sult, the target of an indirect branch can vary between table. Register r1 holds the indirect branch’s target address. different executions of the same branch instruction. Pin t1, t2, and t3 are the trace’s target addresses and c1, c2, and uses a combination of two techniques to translate indirect c3 are the corresponding code cache addresses. branches. First, each indirect branch instruction is associated with an indirect branch chain. When the indirect branch is first encountered, the control is transferred to to analyze indirect branches. Our pintool measures the the Pin VMM which generates the target trace. Pin also number of trace traversals in the indirect branch chain and generates a compare and jump block and places it at the also the number of hash table accesses. We have modified head of the generated trace. This compare and jump block 64-bit Pin 2.8-33543 to implement our MRU algorithm. compares the current indirect branch target with the target All experiments were repeated for three iterations and the for which the trace was generated.

Leveraging Indirect Branch Locality in Dynamic Binary Translators

2. Instruction Set Architecture

Exploiting Branch Target Injection Jann Horn, Google Project Zero

Simty: Generalized SIMT Execution on RISC-V Caroline Collange

Ep 2182433 A1

Digital Design & Computer Arch. Lecture 17: Branch Prediction II

Intel Itanium 2 Processor Reference Manual

Understanding and Predicting Indirect Branch Behavior

Design of Digital Circuits Lecture 19: Branch Prediction II, VLIW, Fine-Grained Multithreading

Instruction Set Define Computer

AMD Architecture Guidelines for Indirect Branch Control

Predicated Code

SOFTWARE TECHNIQUES for MANAGING SPECULATION on AMD PROCESSORS 2 to Mitigate the Above Described Variants, There Are a Variety of Possible Techniques Software Can Use