DAISY: Dynamic Compilation for 100% Architectural Compatibility
Total Page:16
File Type:pdf, Size:1020Kb
DAISY: Dynamic Compilation for 100% Architectural Compatibility Kemal EbciogluÆ and Erik R. Altman IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 g fkemal,erik @watson.ibm.com Abstract tiously) in a recent keynote speech [11], that architectures which do not run Intel x86 code may well be doomed for Although VLIW architectures offer the advantages of failure, regardless of their speed! simplicity of design and high issue rates, a major impediment To solve the compatibility problem ef®ciently, there have to their use is that they are not compatible with the existing been several proposals beyond plain or caching interpreters software base. We describe new simple hardware features for [10]. One has been the object code translation approach (e.g. a VLIW machine we call DAISY (Dynamically Architected [20, 21, 22, 23]), where a software program takes as input Instruction Set from Yorktown). DAISY is speci®cally in- an executable module generated for the old machine, and tended to emulate existing architectures, so that all existing pro®le directed feedback information from past emulations, software for an old architecture (including operating system if available. It then generates a new executable module that kernel code) runs without changes on the VLIW. Each time a can run on the new architecture (resorting to interpretation in new fragment of code is executed for the ®rst time, the code some dif®cult cases), and that gives the same results that plain is translated to VLIW primitives, parallelized and saved in a interpretation would. Although many of the nasty challenges portion of main memory not visible to the old architecture, to static object code translation (programs printing their own by a Virtual Machine Monitor (software) residing in read checksum, shared variables, self modifying code, generating only memory. Subsequent executions of the same fragment a random number and using it as a branch target address, and do not require a translation (unless cast out). We discuss the so on) have been addressed, the static object code translation architectural requirements for such a VLIW, to deal with is- solution still has some problems. sues including self-modifying code, precise exceptions, and If object code translation is used to emulate applications aggressive reordering of memory references in the presence written for one existing machine on another ([22, 23]), then of strong MP consistency and memory mapped I/O. We have many primitives may need to be generated to emulate one implemented the dynamic parallelization algorithms for the old architecture instruction, or unsafe simplifying assump- PowerPC architecture. The initial results show high degrees tions may need to be made (e.g. about ordering of shared of instruction level parallelism with reasonable translation variable accesses, or the number of bits in the ¯oating point overhead and memory usage. representation) to get more performance, in which case full Keywords: INSTRUCTION-LEVEL PARALLELISM,OBJECT compatibility is sacri®ced. This is typically because hard- CODE COMPATIBLE VLIW, DYNAMIC COMPILATION,BINARY ware features to help compatibility with an ªimportant" old TRANSLATION,SUPERSCALAR architecture were not designed into the new fast machine; compatibility was just not emphasized, or came as an after- thought. For example, the set of condition codes maintained is often quite different in different architectures. This ob- 1 Background and Motivation ject code translation approach does allow the convenience of running many important applications of the old architecture Very Long Instruction Word (VLIW) architectures of- on the new machine, but does not provide a replacement for fer the advantages of design simplicity, a potentially short the old machine in terms of speed and range of applications. clock period, and high issue rates. Unfortunately, high per- If the new architecture is fully binary compatible with the formance is not suf®cient for success. One of the major old one by hardware design, but does not run with the best impediments to using a VLIW (or any new ILP machine performance on old binaries, ([20, 21]), and the new features architecture) has been its inability to run existing binaries of the new architecture that improve performance can be uti- of established architectures. It was argued (and not face- lized only by object code translation, or recompilation, the solution is still not perfect. Rapid adoption of new architec- tural features for higher performance may be possible under certain circumstances; scienti®c and technical computing is an example. But computer designers often underestimate the strong inertia of the user community and software vendors at large, and their resistance to change. 2 The Compilation Algorithm Another approach is to translate the old architecture in- structions to a new internal representation (e.g. VLIW) at In this paper, we call the original, old architecture that Icache miss time, by hardware [9, 14, 19]. This approach we are trying to emulate, the base architecture.TheVLIW is robust in the sense that it implements the old architecture which emulates the old architecture we called the migrant completely. But the optimizations that can be performed architecture, following the terminology of [20]. The base by the hardware are limited, compared to software oppor- architecture could be any architecture, but we will be giving tunities. Also the conversion from the old architecture rep- examples mostly from the IBM PowerPC. resentation in memory to the internal Icache representation Traditional caching emulators may spend under 100 in- is complex (especially if one attempts to do re-ordering) structions to translate a typical base architecture instruction and can require substantial hardware design investment, and (depending on the architectural mismatch and complexity of VLSI real estate. the emulated machine). So caching emulators are very fast, As an alternative we present DAISY (Dynamically but do not do much optimization nor ILP extraction. Tradi- Architected Instruction Set from Yorktown). DAISY em- tional VLIW compiler techniques, on the other hand, extract ploys software translation, which is attractive because it considerable ILP at the cost of much more overhead. dispenses with the need for complex hardware whose sole The goal in DAISY is to obtain signi®cant levels of ILP purpose is to achieve compatibility with (possibly ugly) old while keeping compilation overhead to a minimum, to meet architecture(s). Given the appropriate superset of features in the severe time constraints of a virtual machine implementa- the new architecture (e.g. condition codes in x86, PowerPC, tion. Unlike traditionalVLIW scheduling, DAISY examines and S/390 format), DAISY can be dynamically architected each operation in the order it occurs in the original binary by software to ef®ciently emulate any of the old architec- code, converting each into RISC primitives (if a CISCy oper- tures. Assuming that we can begin with a clean slate for both ation). As each RISC primitive is generated, DAISY imme- hardware and emulation software, and adopt a simple de- diately ®nds a VLIW instruction in which it can be placed, sign philosophy, what architectural features and compilation while still doing VLIW global scheduling on multiple paths techniques are required to make software translation ef®cient and across loop iterations and while maintaining precise ex- and 100% compatible with existing software? We attack this ceptions. problem in the current paper. Figure 1 shows an example of PowerPC code and its While DAISY and this paper focus mainly on a VLIW conversion to VLIW code 1 . This conversion uses the al- as the new architecture, the same ideas can be applied any gorithm described in [6]. However, the discussion below new superscalar design, and potentially to other new ILP ar- should make the main points clear. We begin with four major chitectures that break binary compatibility as well. Current points: compiler techniques for attaining high ILP are unacceptably Operations 1±11 of the original PowerPC code are slow for dynamic parallelization, which requires real-time scheduled in sequence into VLIW's. It turns out that performance from a compiler, in order to make the over- two VLIW's suf®ce for these 11 instructions, yielding head imperceptible to the user. To this end, we describe a an ILP of 4, 4, and 3.5 on the three possible paths new, signi®cantly faster parallelization technique. We have through the code. implemented this technique for the PowerPC and we report the initial encouraging ILP results. Another feature of the Operations are always added to the end of the last new compilation technique is the ability to maintain pre- VLIW on the current path. If input data for an op- cise exceptions, so the original instruction responsible for eration are available prior to the end of the last VLIW, an exception can be identi®ed, whenever an exception oc- then the operation is performed as early as possible curs. While out-of-order superscalars use elaborate hardware with the result placed in a renamed register (that is not mechanisms to maintain precise exceptions, in our case this architected in the original architecture). The renamed is done by software alone. register is then copied to the original (architected) reg- The paper is organized as follows: We ®rst give an exam- ister at the end of the last VLIW. This is illustrated by ple illustrating the new fast dynamic compilation algorithm the xor instruction in step 4, whose result is renamed used by DAISY. Next, various architectural features to sup- to r63 in VLIW1, then copied to the original desti- port high performance translation are described. We then de- nation r4 in VLIW2. By having the result available scribe the dynamic translation mechanism whereby DAISY early in r63, later instructions can be moved up. For runs the old software with minimal hardware support. Next example, the cntlz in step 11 can use the result in we discuss the mapping mechanisms from the old code to r63 before it has been copied to r4.