Journal of Systems Architecture 56 (2010) 500–508

Contents lists available at ScienceDirect

Journal of Systems Architecture

journal homepage: www.elsevier.com/locate/sysarc

CoDBT: A multi-source dynamic binary using hardware–software collaborative techniques

Haibing Guan, Bo Liu *, Zhengwei Qi, Yindong Yang, Hongbo Yang, Alei Liang

Shanghai Jiao Tong University, Shanghai 200240, China article info abstract

Article history: For implementing a dynamic binary translation system, traditional software-based solutions suffer from Received 20 September 2009 significant runtime overhead and are not suitable for extra complex optimization. This paper proposes Received in revised form 24 June 2010 using hardware–software collaboration techniques to create an high efficient dynamic binary translation Accepted 24 July 2010 system, CoDBT, which emulates several heterogeneous ISAs (Instruction Set Architectures) on a host pro- Available online 5 August 2010 cessor without changing to the existing processor. We analyze the major performance bottlenecks via evaluating overhead of a pure software-solution DBT. Guidelines are provided for applying a suitable Keywords: hardware–software partition process to CoDBT, as are algorithms for designing hardware-based binary Dynamic binary translation translator and code cache management. An intermediate instruction set is introduced to make Hardware/software collaboration Multi-source multi-source translation more practicable and scalable. Meantime, a novel runtime profiling strategy is Runtime profiling integrated into the infrastructure to collect program hot spots information to supporting potential future optimizations. The advantages of using co-design as an implementation approach for DBT system are assessed by several SPEC benchmarks. Our results demonstrate that significant performance improve- ments can be achieved with appropriate hardware support choices. CoDBT could be an efficient and cost-effective solution for situations where the usual methods of performance acceleration for dynamic binary translation are inappropriate. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction observed that these hardware support strategies always offer per- formance advantages over existing software solution, via seeking a Dynamic binary translation (DBT) has many attractive applica- certain software overhead and then replaced it with hardware. tions in computer system designs. For instance, it can be used to However, there have had a little number of successful systems support legacy binary code [1]; support ISA virtualization [2]; en- which serve to entirety-orient. It mean that the designers may ren- able innovative co-designed micro-architectures [3], and many der some functions in hardware and some in software, according to others [4–9]. An adaptable DBT system can also profile program the product design goals and constraints. Different goals and runtime behavior and optimize blocks of frequently executed constraints in future products may result in different hardware– instructions [18]. However, DBT technology also comes with its software partitioning. The best known DBT based on collaborative costs: translation overhead, emulation overhead and potentially technique and entirety-orient is from Crusoe [3]. Compared to the other runtime overheads. A key consideration in designing a DBT prior work, such systems include a number of advanced features to system is the overhead resulting from translation time; any time improve a series of overheads and achieve high performance. spent on translating is time not spent executing the source pro- Unfortunately, these systems were designed to implement a gram. Currently, this is an interesting research topic to obtain in- specific type of dynamic translator for specific architecture with sights for designing systems featuring binary translation. modified micro-architecture and cannot be satisfied with the need Recent researchers have focused on various optimizing algo- of multi-source and flexibility. rithms and hardware acceleration methods to reduce overheads In this paper, as an alternative we present CoDBT (hardware– of specific parts in binary translation path, such as hardware sup- software Collaborative Dynamic Binary Translator). CoDBT crea- port for control transfers [15], source-target binary code memory tively employs hardware support, which is attractive because it management [12] and profile information collection [22].We allows greater flexibility for realizing hardware design innovations, and can offset software overheads. Its purpose is to lead to higher performance DBT than pure software solutions. Besides, due to * Corresponding author. Tel.: +86 21 34205581. defining an set of intermediate representation, make it be possible E-mail addresses: [email protected] (H. Guan), [email protected] (B. Liu), and convenient to support multi-sources instruction on single [email protected] (Z. Qi), [email protected] (Y. Yang), yanghongbo819@sjtu. edu.cn (H. Yang), [email protected] (A. Liang). physic platform. From the view point of design, CoDBT is composed

1383-7621/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2010.07.008 H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508 501 of a main software partition which is running on the target proces- instrumented instruction in software partition, and a profiling sor, and several hardware partitions for support special functions hardware for updating counter operations and collecting suffi- within DBT workflow. The hardware partition is able to execute cient data to provide a foundation for future optimizations. binary translation which is a part of DBT tasks and designed to The remainder of this paper is structured as follows. Section 2 run in conjunction with a host processor. The hardware partition describes the architecture of CoDBT. Section 3 discusses the is implemented in a FPGA chip in which a PowerPC processor core implementation of CoDBT in more detail. Section 4 evaluates was embedded, and communicates to the software partition the system performance. Section 5 discusses related work through memory sharing and on-chip bus. The software partition and, Section 6 summarizes our findings. is a complete process-level DBT software which will leave some tasks to executed by hardware. Therefore, the overlap in function- 2. CoDBT ality between the hardware and software partitions allows more flexibility in deciding which functions should run in which parti- 2.1. Overview tion and provides the opportunity for executing in hardware and software in parallel [8]. In the meanwhile, given the feature of mul- CoDBT is a hardware–software collaborative dynamic binary ti-source, CoDBT is able to emulate several heterogeneous ISAs (e.g. translator. The system is composed of three components: (1) the IA-32, MIPS, PowerPC). hardware accelerator, (2) the DBT software application, and (3) CoDBT focus on creating a collaborative DBT framework that is the control program that passes data between the other two com- efficient and systematic enough to be worth using in almost every ponents. In the CoDBT experimental platform, a general processor DBT system. To this end, we make the following contributions: is closely coupled with the hardware accelerator, providing a com- plete environment within a single FPGA chip. Fig. 1 1. Co-design framework for generic dynamic binary translation sys- shows the components of the collaborative framework. The hard- tem ware partition executes a subset of the dynamic binary translation One of our goals is to provide a whole DBT infrastructure based workflow and is designed to run in conjunction with a host proces- on hardware–software collaborative techniques. It is an adap- sor. The hardware partition is implemented by hardware descrip- tive translator that uses a simple basic block translator for ini- tion language Verilog, and communicates with software partition tial code emulation. These initial results indicate that basic through on-chip bus. The overlap in functionality between the block translation overhead is the major component of startup hardware and software allows more flexibility in deciding which overhead, and hot-spot optimization overhead can further exac- functions should run in which partition. erbate execution delays. Then, we propose two hardware mech- anisms for reducing the overhead. The first of these hardware 2.2. Dynamic binary translation function features assists is a hardware translation module which is targeted at basic block translation, and the second is a special TCache man- There are three sub-problems that must be solved in determin- agement unit. Using such two mechanisms can reduce the ing the hardware–software partition of CoDBT: r Functional translation time and improve translated code management. clustering: cluster the system functionality into a set of tasks; Through experiment, we demonstrate that with basic hardware s Allocation: allocate the tasks to either hardware or software; support, CoDBT can provide competitive translation perfor- t Scheduling: schedule the allocated tasks to determine timing mance without changing to the host processor. correctness of the partitioned system. These problems are interde- 2. Intermediate representation for multi-source pendent and must be solved simultaneously to determine an opti- We designed CoDBT to adapt easily and inexpensively to mal solution. According to Amdahl’s Law, selecting hardware changes in multiple source machines, including translations regions based on percentage of execution time to guarantee the from RISC and CISC machines. When compilers and other tools largest potential speedup. In this design we mainly concern with support multiple target machines, researchers and developers obtaining the maximum performance speedup. Therefore, we used call them re-targetable. By extension, we call a binary translator a partitioning technique that uses profiling results to sort the func- re-sourceable if it can emulate ISAs from multiple source tions of the application. machines at low cost. Definition a set of intermediate represen- When implement binary translating, we use basic block as the tation makes CoDBT support multi-source more easily and translation unit. Fig. 2 is the entire process of DBT. The main pro- extensibility. Intermediate representation is a set of machine- cessor first uses a Source Program Counter (SPC) to look up the cor- independent instructions. The translator firstly interprets the respondent Target Program Counter TPC in SPC–TPC map table source machine code into intermediate code, then after some which was stored in memory like a TLB in cache. If the correspond- optimizations if necessary, translates the intermediate code into target machine code. A key component of the translator, which could be extended to support other ISAs based on inter- mediate representation, is the lifting of abstraction from the machine-dependent source instructions to the machine-inde- pendent codes. Software 3. Co-design runtime profiling O Memory Execution N Contralor DDR Profiling is valuable for identifying program hot spots and guid- engine MEMORY C ing future optimizations [13]. In traditional systems, software H I alone is used to gather program behavior information through P binary instrumentation. Recently an area of active research B Hardware has been the hardware support of generating profiles at run- CPU U S Accelerator time. In these hardware-only profiler, low-overhead profiling can be achieved, but with noticeable error rate. We present a novel profile based on CoDBT infrastructure to achieve rapidly and accurately collecting profile information with minimal runtime overhead. The proposed approach consists of a reduced Fig. 1. SoC hardware and software collaborative frame design. 502 H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508

Generally, if the Rmiss is low enough, the whole system process-

ing time is still dominated by the Texecute. The system execution time will not be predominated by a single factor. Any stages may

harm or benefit the Tsys. At last, when the Rmiss reaches up to a cer- tain level, miss penalty will be the dominator. According to DBT workflow and analysis above, we could get the conclusion that there are several main bottlenecks in a DBT system:

(1) The first bottleneck is the TCache management. The corre-

sponding overhead is represented by symbol Tlookup in Eq. (2). When a basic block is loaded, processor must first check if the basic block has been translated before. The lookup of SPC–TPC map table is time-consuming and fre- quent. It’s inefficient to implement it via software. (2) The second bottleneck is the context switch between execu- tion and translation. The DBT software translation have dif- Fig. 2. Generic architecture for the function components in DBT system. ferent context with emulated source program. For most of basic blocks the processor must save and restore the context information. Given a basic block has about 10 instructions ing SPC value exists in the value table, it means that the basic bin- average, context switch is a common case. The cost even ary block has been translated from source ISA forms to target. The exceeds the execution time, especially when the processor target code is stored in the Target Code Cache (TCache), so proces- is RISC type and has many registers. sor execution pointer can directly jump to TPC to execute trans- (3) The third bottleneck is the translation cost (Ttranslation in lated codes. Otherwise, if the lookup result shows that there is a Eq. (2)). Because it takes hundreds of instructions to trans- code cache miss, the processor will translate the source codes to late a basic block in miss case, the performance will fluctuate target ones, and then places it in the TCache. distinctly between hit and miss case. This bottleneck affects the startup time and TCache miss penalty. 2.3. Hardware/software partition (4) An additional expenses is extra profile collection cost for providing a optimization foundation. Base on the hard- In the DBT process, execution of the target codes and translation ware/software infrastructure, the translator module inserts of source is independence. At the moment of control switch be- only one instrumentation instruction into each basic block tween the execution state and the translation state, context switch as profiling request information, and the profile hardware will happen. After the basic block has been translated, a context executes counting and date-update operations asynchro- switch restores the application context and begins executing ca- nously in available free slots. Under the support of hardware, ched translated instructions on the host CPU. Thus context switch the program execution with profiling can run almost as fast becomes ‘‘the common case”. Processor need to save context infor- as near-zero overhead. mation when it have to execute translation task. It is one of the most expensive overhead aspects, since copying data between reg- After considering above characteristics of the DBT system, on isters and memory space involves a large speed and power penalty. which the partitioning strategy can focus to achieve a suitable ini- In a similar manner, if the results of execution are read back from tial partitioning, we employ hardware to implement binary trans- hardware, the entire frame does not need to be copied back. Specif- lation function, TCache management, and profiling part. The ically, the method body does not need to be copied back as it never framework is shown in Fig. 3. There are three main advantages changes during execution in hardware. With accelerator being for performance improvement. (1) Due to target processor need integrated, context switch time could be eliminated. Besides of context switch, TCache lookup and instructions translation over- head are considered. The overall DBT running time Tsys is computed Software as: Hardware

Tsys ¼ Nblock Tblock ð1Þ

We now formulate the overhead issue of basic block translation Submit in the DBT system as a mathematical problem as follows: Translated Block

Tblock ¼ðTlookup þ Tcontext switchÞþTexecute þ Rmiss Ttranslate ð2Þ where Tlookup is the overhead of looking up SPC–TPC map table; Target Code Tcontext switch the overhead of context switching; Texecute the time spend at executing translated code; Rmiss the SPC–TPC map table lookuping miss rate in the TCache; Ttranslation is the cost of translating SPC Miss a source basic block to target one. / Hit The Rmiss parameter represents the miss rate of the TCache sys- TPC tem. It’s commonly determined by the cache size, and the replace- ment strategy. The foremost factor is the cache size. Commonly, when the text section size of source codes does not exceed the TCache size or is not much too larger, the miss rate is quite low and consequently leads to a high performance. Otherwise, TCache Fig. 3. Hardware–software partition design for DBT system. Proposed hardware replacement occurs frequently. support additions are divided into right. H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508 503 not switch to translate source instructions, the context switch time Table 1 can be eliminated. (2) Application-specific hardware replaces the The most commonly used instructions of IR. software’s translation process, so the startup time can be reduced Type Instructions and miss penalty can be improved. (3) While the hardware trans- Register state mapping get put lation unit translates basic blocks before the processor executes Memory access ld st corresponding instructions, the miss rate of TCache could be re- Data transfer mov li duced significantly. Arithmetic/logic add sub and not xor or mul div sll srl sra cmp sext zext Control transfer jmp branch 3. Implementation Special syscall call halt

This section describes the implementing and interaction of CoD- BT’s various models. It discusses the challenges of hardware–soft- be reduced as much as possible, if there are, they should be expli- ware collaborative DBT system and how CoDBT’s design addresses cit, so as to reduce the code analysis complicates and minimize the them. Base on the discussions above Section 2, a subset of the DBT runtime overhead. will be implemented by hardware. We employ the following ap- IR is a temporary state for translation to target codes. It’s visible proaches to alleviate common cases. outside the hardware partition, but non-. When CoDBT first encounters a block of , it translates it to host code 3.1. Part 1: multi-source dynamic translation model up to the IR and stores the IR block into local memory region, and then encodes the IR block to target architecture codes. Hardware The full spectrum of potential DBT applications motivated sys- translator could operate the IR codes for further optimization if de- tem design to support multiple sources architectures. The hard- sired. In CoDBT, IR offer better opportunities for adapting easily to ware translator is built of re-sourceable components that support multi-source decoding. However, The design of IR affects manipulate the source binary codes in terms of intermediate rep- the quality of generated target code and the efficiency of CoDBT, resentations (IR). The advantage of lifting source instruction code due to code bloat. Therefore, we balance between the performance to IR is that the code becomes machine-independent, which allows and the cost when designs IR. On one hand, the IR is kept as simple us to decouple the translator’s back end from the source machine. as possible so as to reduce the cost of translation; On the other As shown in Fig. 4, main steps in the translation process are decod- hand, the sematic of IR should be rich enough to support various ing the source instructions, translating the instructions into IR, characteristics of different ISAs. Intermediate instructions can analyzing IR for optimizations specific to binary translation, and bring many benefits. First, adding other source ends and target translating IR to target instructions, at last encoding the output ends becomes easier. Second, this will reduce the workload and binary file. After translation, a running target program maintains complexity of development, because intermediate instructions * an image of the data that would have been stored in the source ma- could make the complexity O(n n) of translating n kinds of differ- chine. If source and target architectures have different endianness, ent source ISAs to n kinds of different target ISAs drop nearly to translation to target instructions may require swapping bytes at O(n) theoretically. Third, intermediate instruction blocks can offer each source instruction load and store instruction. better opportunities for optimization and thus make optimization The IR used in our infrastructure is a low-level, MIPS-like easier. instruction set with some tradeoffs made to fit different machine CoDBT is a user mode in order to run process com- language features. In essence, the IR defines a general-purpose reg- piled for one target CPU on another. At the CPU level, CoDBT sup- ister architecture of a virtual layer, consisting of infinite 32-bit vir- poses the user memory mappings are handled by the host OS. tual registers (v0–vn). It defines RISC-style load/store instructions CoDBT includes a generic system call converter to handle to access memory, and the only addressing mode supported is dis- endianness issues and 32/64 bit conversions. Besides, in CoDBT placement. IR comprises six kinds of basic instructions to match the system calls are handled by both system call handler and sys- popular ISAs, includes arithmetic/logical, control transfer, data tem call interposition. Once a system call is encountered during the transfer, memory access, register state mapping and special execution of a translated block, the execution will be halted and instructions, as showed in Table 1. the control will be transferred to system call interposition, which The most difficult tradeoff we made during the design of inter- is in charge of checking the system call. If the behavior of the sys- mediate language is whether to use condition code or not, to ex- tem call is malicious, the executing process will be terminated; press the conditional jump semantics. Previous study shows that Otherwise, the control will be transferred to the system call han- although condition codes reduce the number of explicit COMPARE dler which responsible for running the system calls on target and TEST instructions, they prevent aggressive scheduling by the CPU. Since the different of passing parameters among various compiler because condition code adds implicit dependence architectures ISAs, the system call handler has to transfer the sys- between instructions. IR instructions are expected to be easy to tem calls from the source platform to the target before the be analyzed and transformed, so instruction dependence should execution.

(If desired) IR-level Optimizations MIPS

PowerPC Source 2 IR IR 2 Target Translating Translating Target Binary stream

IA-32 Source Binary stream

Fig. 4. Core translation process for a multi-source binary translator. 504 H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508

3.2. Part 2: code cache management SPC to the FIFO buffer. Besides, prior to the transfer of those SPC values, the base address of the Counter Map Table needs to be According to partition of Section 2.2, we employ hardware to passed to the dedicated profile module with the help of the Store implement the TCache (Target Code Cache) management which instruction, and this operation will be performed for only once. is a real existence as a manager to look up the translated basic In this method, at least a Load instruction and an Increment blocks in the TCache where stores translated basic blocks. The instruction would be saved for every profiled block compared to TCache manager maintains a SPC–TPC map table. It could be con- typical software-based solution. Therefore, it can reduce instru- sidered that it is similar with TLB to implement the map of virtual mentation overhead significantly. address to physical address. In this way, the lookup overhead is re- The profile module updates the counters in the local memory. duced greatly. At the point of code amount, there are three to five First, it gets a SPC value from the FIFO buffer and puts it into reg- instructions to implement the TCache lookup in each basic block. ister r1. Second, it loads the SPC field of the counter map table into Once a basic block have been executed, TCache manager need to register r2 through a hash function. We use the least significant 16 lookup the TPC value through the SPC–TPC map to find the next ba- bits of every SPC in the FIFO buffer as the value returned by the sic block. The workflow of TCache lookup is as follow: firstly, Get hash function, and the address of SPC is the sum of the counter the SPC of the next basic block, and check whether its correspond- map table base address and the hash value. Note that the value ing TPC exists in SPC–TPC map table via lookup; If the TPC exists, of register r2 may be dirty. Third, it compares the content of regis- which means that basic block has already translated and cached, ter r1 and register r2 to decide which operation it should take. jumps to that TPC and executes; Otherwise, hardware translator Otherwise, two situations need to be discussed separately. When decodes the source block instructions and builds translated target the hash collision does not occur, that is, the least significant 16 block, and delivers the TPC (see Fig. 5). bits of register r1 equals to the ones of register r2, the SPC value is written into the SPC field of the counter map table and updates the counter value in the local memory. When the hash collision oc- 3.3. Part 3: runtime profiling curs, the SPC value is written into the hash table using open addressing method. Moreover, the difference between SPC values We allocate certain memory area to Counter Map Table in of two adjacent basic blocks is not trivial. Therefore, the open CoDBT, and add only one Store instruction (Target-architecture addressing method can be taken to deal with the hash collision instruction) into each basic block which needs to be profiled. The efficiently (see Fig. 6). register field of the Store instruction is a value designated in CoD- BT, and the content of the designated register is the first SPC value of the profiled basic block. The memory address field of the store 4. Evaluation instruction is the entry address of the FIFO buffer. The Store instruction triggers the FIFO buffer operation, and then the dedi- 4.1. Experiment platform cated profile module performs the counter update operations. The correctness of interactive operation between hardware and The experimental platform chooses a general purpose processor software is guaranteed by the SPC value of every basic block. Be- PowerPC, combined with hardware acceleration units, which are sides, CoDBT software partition is in charge of assigning the allo- implemented using FPGA, on which the hardware partition of the cated memory and setting some necessary auxiliary information. CoDBT system runs. The hardware partition communicates with In the course of profile, a Store-type instruction is inserted into the processor and memory through a 32/64-bit on-chip bus, using every selected block. The instruction is used to pass the basic block its local cache memory independently. The software partition in-

SPC Hash SPC TPC Count Compare = ? Hit SPC TPC Count

SPC TPC Count Miss

Fig. 5. SPC–TPC mapping by hardware.

Fig. 6. Dedicated profiling hardware unit. H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508 505 cludes the parts of DBT functions, while the hardware partition duplicates a selected subset of the DBT system. The communica- tion overhead can be reduced greatly with the help of system bus, which also bring the CPU core and hardware accelerator closer together. Using a shared memory region in the design can also re- duce the amount of data which must be copied between hardware units [23]. Hardware acceleration units are pre-designed modules used in FPGAs. A on-chip buses are used to interconnect the modules with- in the core and for user defined logic within the gate-array. In our work, CoreConnect bus [24] licensed by IBM Crop., was chose be- cause it is a high-performance bus, which provides a standard interface between the processor cores and integrated bus control- lers so that a library of processor cores and bus controllers can be Fig. 7. The relationship between TCache miss rate and TCache size in CoDBT. developed in our system designs.

4.2. Evaluation

To stress CoDBT performance and other transient phases for a binary translation based hardware support, we ran the SPEC 2006 suite using reference inputs. For studies focused on TCache management efficiency, we test the TCache miss rate according to different size of TCache from 16k to 2M. For studies focused on profiling overhead, we compare the runtime performance of CoDBT with hardware-support profile to without profile. All evalu- ations are set up for testing the efficiency performance with wall- clock time parameter to stress CoDBT how to improve specific run- time overhead. We first show how the proposed hardware translator speed up runtime translation by comparing the mean time of translating a basic block with that of software-only solution, and the TCache Fig. 8. Percentage of TCache lookup speedup rate with various TCache size. management unit improve speedup rate with conventional soft- ware-based lookup. Then, we conduct performance analysis for the hardware profile assist integrated into the CoDBT. The comparison of TCache lookup and translation time between hardware acceleration and software-only has been revealed. As shown as Fig. 9, based on hardware acceleration there are only needed about average twenty cycles to finish the lookup operation and less than four hundred cycles to translate a basic block. The speedup rate is nearly 8 times. The main reason of the significant performance improvement of lookup operation is the elimination of context switching between binary execution and translation which is inevitable overhead for software-solution DBT. In the meantime, there is a performance evaluation about every basic Fig. 9. Performance improvement results of TCache lookup and translator unit in execution time in TCache hit and miss case, and compared be- CoDBT are shown comparing with software-only solution. The third and fourth tween hardware acceleration status and software resolution only. columns show the average cost when TCache lookup is hit or miss (shorter is Figs. 8 and 9 also show the performance improvement. In addition, better). the performance of CoDBT does not fluctuate distinctly between hit and miss case. This character is important to system’s real-time ability and startup performance. system any more. In traditional DBT system, scheduling decisions Figs. 7 and 8 show how the TCache size affects CoDBT perfor- occur during the runtime and each function scheduling need a con- mance. The statistic data is gathered when running spec bench- text switch, which is a vital bottleneck as demonstrated in Section mark mcf, gzip and paser. Generally, when the cache size is 2.2. The cost of a context switch from one unit to another can be smaller, the chance of replacement arises, and the miss rate of high due to the penalty in transferring the context data. With CoDBT grows higher. It is obvious that there is almost not cache our new approach, since the partitioning scheme provides overlap- miss when TCache size is between 1M to 2M. Because the miss rate ping functionalities between hardware and software, the execution of the program is extremely low when TCache size is big enough, flow is more flexible and parallelization. The context switch over- therefore the translation unit of hardware accelerator is idle in head is eliminated as a result of translation mission undertaken by most of the time. In the meantime, Fig. 9 shows the performance additional hardware translator. From Fig. 10, we can see that per- comparison between CoDBT TCahce management and pure soft- formance of CoDBT consistently performs in a acceptable region. ware-based solution. It is clear from the figure that, in most cases, The almost all benchmarks experimental results was controlled using hardware TCache management assist can reduce the lookup in less than two times when translating MIPS to PowerPC. overhead and enable CoDBT to translate source blocks or execute As the table results shown in Fig. 11, we compare the overhead next TPC basic block. of our new profiling technique with hardware support with that of By providing an overlap in support between the hardware and no-profile CoDBT. The second rank in the table indicates the overall software partitions, the context switch is not necessary for DBT wall clock cost without profiling, and the third means the cost with 506 H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508

translation dynamically by going through several intermediate representations. The significant advantage of resourceability and retargetability is that it can be applied to a variety of hardware platforms transparently. However, the performance of binary translators attributed to the latter type is not very well since the high software overhead. Performance improving of DBT is a major research topic. The techniques have spanned all aspects, including creation of more efficient source code, compiler optimization techniques and replacing software with hardware. A variety of techniques for boosting the performance of Java virtual machine exist [10,11].

Fig. 10. SPEC 2006 suite – ref – performance on CoDBT translating from MIPS ISAs At Austin, the researchers of Texas University propose using hard- to PowerPC compared to native execution performance. The figure on the top ware support to perform efficient Java translation coupled with a demonstrates the CoDBT overall overhead comparing to normalized native perfor- light-weight turn time environment [12]. The additional hardware mance, while the table on the bottom expresses the running time of each performs the translation of Java byte-codes to native code, thus benchmark in wall clock. eliminating much of the overhead of software translation. The pro- posed technique is extremely effective for short running client hardware profiling, as well as the last rank remarks the percentage workloads. of overhead. It is evident from the table that after adding the pro- From Crusoe [3] project, it is revealed that ’s Crusoe file hardware, the DBT cost time raised a little. To stress the profile VLIW processor and CMS [14] present an approach unique among overhead, the data is collected from SPEC 2006 as well. As shown in commercial architectures: a microprocessor system with an inter- Fig. 9, almost each benchmark’s performance time during execut- nal VLIW ISA with little resemblance to the external ISA that it ing with hardware profile is no more 2.9% time than that executed presents to users. This approach allows a simple, compact, low without profile. power microprocessor implementation, however, with the modify- Consequently, we consider how CoDBT profile overhead is re- ing the internal ISA between generations. From [21], a Common duced after being assisted by the hardware profile module. The Language Infrastructure based upon a virtual machine requires instrumented technique in traditional software solution needs at that all instructions being executed on the CLI be translated to na- least three essential instructions to update counter operations, tive machine instructions before they can be executed on the host including Load, Increment and Store instruction. In our hard- processor. However, both of those need a whole re-design of archi- ware-support profiling technique, we use only one Store instruc- tecture and just focus on specific ISA. tion in CoDBT software parts and a hardware profile module to There are also some other cases for hardware acceleration to complete counter update operations. The experiment has shown improve binary execution performance. For example, hardware as Fig. 11, that the slightly overhead of updating counter opera- support for control transfers in code caches [15]. They have studied tions does not affect the correctness and overall performance of architecture support for efficient control transfers among super- the whole system. Thus, we can save at least one Load instruction blocks being held in a code cache. This support is in the form of and one Increment instruction using our new method in CoDBT. a few new instructions and some underlying hardware structures. However, the new instructions also have a change to the existing 5. Related work instruction set. In contrast to much of the previous work, CoDBT pursues a Until recently, there exist many of popular dynamic binary whole hardware acceleration framework for DBT. Our goal was to translators such as StarDBT [16], QEMU [17], and FX!32 [18]. All provide a full co-design DBT system without changing to the exist- these binary translators can be simply divided into two types in ing architecture. The hardware–software partitioning strategy fo- term of translation pattern. One type can only translate single cuses on providing performance improvements. Importing of ISA into another specific ISA like StarDBT (from IA32 to IA32), hardware acceleration make it possible for speeding up a diverse FX!32 (from x86 to Alpha), besides, current x86 architecture al- range of performance bottlenecks, and gathering profile informa- ready implemented a hardware translation mechanism to translate tion without noticeable runtime overhead. x86 instructions into internal micro-operations. Another type is designed to be re-sourceable and re-targetable to support the translation between various instruction sets, such as Strata [19], 6. Conclusion and future work Walkabout [25], and UQDBT [20]. Typically, the UQDBT translator uses specifications to specify the guest/host architectures at vari- CoDBT is a hardware/software co-design dynamic binary trans- ous levels of abstraction, and simultaneously completes binary lator. It combines strengths from both the hardware and the soft-

Fig. 11. Hardware profiling overhead in CoDBT. H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508 507 ware to implement a dynamic binary translation, makes PowerPC [10] T.R. Halfhill, How to soup up Java (Part I), BYTE 23 (5) (1998) 60–74. CPU fully compatible with several existing source architectures, [11] P. Wayner, How to soup up Java (Part II): Nine recipes for fast easy, Java, BYTE 23 (5) (1998) 76–80. without changing to the host architecture. CoDBT achieves this [12] R. Radhakrishnan, R. Bhargava, L.K. John, Improving Java performance using with appropriate hardware–software partition choice. Our ap- hardware translation, in: Proceedings of the International Conference on proach could be important in the future for making an ultimate Supercomputing (ICS), 2001, pp. 427–439. [13] B. Thomas, L. James, Optimally profiling and tracing programs, ACM open system, where a single hardware chip can ran multiple Transactions on Programming Languages and Systems 16 (3) (1994) 1319– other architecture ISAs. It uses a variety of techniques to deliver 1360. the high performance needed for useful evaluations including: [14] J.C. Dehnert, B. Grant, J.P. Banning, R. Johnson, T. Kistler, A. Klaiber, J. Mattson, The transmeta code morphing-software: using speculation, recovery, and hardware support for eliminating context switch, TCache lookup adaptive retranslation to address real-life challenges, in: Proceedings of the with SPC–TPC map table, runtime profiling operations for potential International Conference on Code Generation and Optimization (CGO), 2003, optimization. Some of CoDBT’s other key features are its flexible pp. 15–24. [15] H. Kim, J.E. Smith, Hardware support for control transfers in code caches, in: and extensible architecture modeling, its intermediate instruction Proceedings of the International Symposium on Microarchitecture (MICRO), set definition. Our results demonstrate that CoDBT is high perfor- 2003, pp. 253–264. mance and achieves slowdowns as little as 1.5x over native [16] C. Wang, S. Hu, H. Kim, S.R. Nair, M. Breternitz, Z. Ying, Y. Wu, StarDBT: an execution. efficient multi-platform dynamic binary translation system, in: Proceedings of the Asia-Pacific Computer Systems Architecture Conference, 2007, pp. 4– The results to date have been promised and further research is 15. continuing on extending and improving the work. These exten- [17] F. Bellard, QEMU: a fast and portable dynamic translator, in: Proceedings of the sions and improvements include: Adding support for the remain- USENIX Annual Technical Conference (USENIX), 2005, pp. 41–46. [18] A. Chernoff, M. Herdeg, R. Hookway, C. Reeve, N. Rubin, T. Tye, S.B. Yadavalli, J. ing other ISAs, such as SPARC ARM, will have to be one of the Yates, FX!32: a profile-directed binary translator, IEEE Micro 18 (2) (1998) 56– main focuses of future work. In the meantime, basic block linking 64. which is an important and efficient optimization technique for just [19] K. Scott, N. Kumar, S. Velusamy, B. Childers, J.W. Davidson, M.L. Soffa, Retargetable and reconfigurable software dynamic translation, in: Proceedings in time compilation, and a series of optimization based on runtime of the First International Symposium on Code Generation and Optimization, profiling will be added to further bring more performance 2003, pp. 36–47. improvements. [20] D. Ung, C. Cifuentes, Dynamic binary translation using run-time feedbacks, Journal of Science and Computer Programming(JSCP) 60 (2) (2000) 189– 204. Acknowledgements [21] J.C. Libby, K.B. Kent, An embedded implementation of the Common Language Infrastructure, Journal of Systems Architecture: Embedded Systems Design 55 (2) (2009) 114–126. We thank the anonymous reviewers for their valuable com- [22] Y. Wu, Y. Lee, Hardware–software collaborative techniques for runtime ments and suggestions. This work was supported by the National profiling and phase transition detection, Journal of Computer Science and Natural Science Foundation of China (Grant Nos. 60773093, Technology 20 (5) (2005) 665–675. [23] A. Astarloa, A. Zuloaga, U. Bidarte, J.L. Martn, J. Lzaro, J. Jimenez, Tornado: a 60873209, 60970107, 60970108), the Key Program for Basic Re- self-reconfiguration control system for core-based multiprocessor CSoPCs, search of Shanghai (Grant No. 08JC1411800), the Ministry of Edu- Journal of Systems Architecture 53 (9) (2007) 629–643. cation and Joint Research Foundation (Grant No. MOE-INTEL- [24] IBM, CoreConnect Bus Architecture, 1999. . 08-11), the Science and Technology Commission of Shanghai [25] C. Cifuentes, B. Lewis, D. Ung, Walkabout: a retargetable dynamic binary Municipality (09510701600), IBM SUR Funding and CRL JP translation framework, Technical Report 2002-106, Sun Funding. Microsystems Laboratories, 2002.

References

[1] L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, Y. Haibing Guan received his Ph.D. degree in computer Zemach, IA-32 Execution Layer: a two-phase dynamic translator designed to science from the TongJi University (China), in 1999. He support IA-32 applications on -based systems, IEEE Micro (2003) is currently a professor with the Faculty of Computer 191–204. Science, Shanghai Jiao Tong University (Shanghai, [2] K. Adams, O. Agesen, A comparison of software and hardware techniques for China). His current research interests include, but are , in: Proceedings of the International Conference on not limited to, computer architecture, compiling, virtu- Architectural Support for Programming Languages and Operating Systems alization and hardware/software co-design. (ASPLOS), 2006, pp. 2–13. [3] A.C. Klaiber, The technology behind Crusoe processors, Transmeta Technical Brief, 2000. [4] V. Bala, E. Duesterwald, S. Banerjia, Dynamo: a transparent dynamic optimization system, in: Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), 2000, pp. 1–12. [5] E. Borin, C. Wang, Y. Wu, G. Araujo, Software-based transparent and comprehensive control-flow error detection, in: Proceedings of the International Conference on Code Generation and Optimization (CGO), 2006, pp. 333–345. Bo Liu is currently a Ph.D. student of at the Shanghai [6] A. Coronato, G.D. Pietro, L. Gallo, An agent based platform for task distribution Jiao Tong University (SJTU), China. He received his M.Sc. in virtual environments, Journal of Systems Architecture: Embedded Systems in Software Engineer in 2009 from Shanghai Jiao Tong Design 54 (9) (2008) 877–882. University, China. His main research interests are com- [7] Kenneth B. Kent, Micaela Serra, R. Nigel Horspool, Hardware/software co- puter architecture, virtualization and hardware/soft- design for virtual machines, in: IEE Proceedings Computers and Digital ware co-design. Techniques, vol. 152 (5), September 2005, pp. 537–548. [8] F. Qin, C. Wang, Z. Li, H. Kim, Y. Zhou, Y. Wu, LIFT: a low-overhead practical information flow tracking system for detecting security attacks, in: Proceedings of the International Symposium on Microarchitecture (MICRO), 2006, pp. 135–148. [9] Q. Wu, M. Martonosi, D.W. Clark, V.J. Reddi, D. Connors, Y. Wu, J. Lee, D. Brooks, A dynamic compilation framework for controlling microprocessor energy and performance, in: Proceedings of the International Symposium on Microarchitecture (MICRO), 2005, pp. 271–282. 508 H. Guan et al. / Journal of Systems Architecture 56 (2010) 500–508

Zhengwei Qi is an assistant professor in the School of Hongbo Yang is currently a Ph.D. student at Shanghai Software at the Shanghai Jiao Tong University (SJTU). He Jiao Tong University, China. He received the M.S. degree received his B.Eng. and M.Eng. degrees from North- in 1995 and received his B.S. degree in 1998 at Institute western Polytechnical University in 1995 and 1999, of Airforce Meteorologyity, China. His main research respectively. He received his Ph.D. in Computer Science interests are in virtual machines, computer architecture and Engineering from Shanghai Jiao Tong University in and compiling. 2005. His research interests are distributed computing, virtualized security, model checking, program analysis and embedded systems.

Yindong Yang is currently a Ph.D. student at Shanghai Alei Liang is currently a vice professor in Shanghai Jiao Jiao Tong University, China. He received the M.S. degree Tong Univeristy (SJTU). In 1991 and 1997, he achieved at School of Computer, Electronics and Information from his B.S. and M.S. of Computer Science from Hefei Guangxi University in 2007, China. In 2004, he received University of Technology (HFUT). In 2002, Ph.D. of his B.S. degree at School of Information and Technology Computer Science in Shanghai Jiaotong University. from Jiangnan University, China. His main research During 1991–1995, he worked as an engineer in The interests are in virtual machines, computer architecture 41th Institute of China Electronic Ministry. Since 2002, and compiling. joined SJTU. His research interests are in parallel com- puting via swarm intelligence and virtualization com- puting with dynamic binary translation.