Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters ∗

Marc Berndl, Benjamin Vitale, Mathew Zaleski and Angela Demke Brown University of Toronto, Systems Research Lab {berndl,matz,bv,demke}@cs.toronto.edu

Abstract poor performance compared to native code. Consequently, many important systems, such as Sun’s HotSpot [1] and Direct-threaded interpreters use indirect branches to dis- IBM’s production virtual machine [21] run in mixed patch , but deeply-pipelined architectures rely on mode, compiling and executing some parts of a program branch prediction for performance. Due to the poor cor- while interpreting others. Baseline performance relation between the virtual program’s control flow and the thus continues to be relevant. hardware , which we call the context prob- Recently, Ertl and Gregg observed that the performance lem, direct threading’s indirect branches are poorly pre- of otherwise efficient direct-threaded interpretation is lim- dicted by the hardware, limiting performance. Our dispatch ited by pipeline stalls and flushes due to extremely poor technique, context threading, improves branch prediction indirect branch prediction [8]. Modern pipelined architec- and performance by aligning hardware and virtual machine tures, such as the Pentium IV (P4) and the PowerPC (PPC), state. Linear virtual instructions are dispatched with na- must keep their pipelines full to perform well. Hardware tive calls and returns, aligning the hardware and virtual PC. branch predictors use the native PC to exploit the highly- Thus, sequential control flow is predicted by the hardware biased branches found in typical (native code) CPU work- return stack. We convert virtual branching instructions to loads [3,12]. Direct-threaded virtual machine (VM) inter- native branches, mobilizing the hardware’s branch predic- preters, however,are not typical workloads. Their branches’ tion resources. We evaluate the impact of context - targets are unbiased and therefore unpredictable [8]. For an ing on both branch prediction and performance using inter- interpreted program, it is the virtual program counter (or preters for Java and OCaml on the Pentium and PowerPC vPC) that is correlated with control flow. We therefore pro- architectures. On the Pentium IV, our technique reduces pose to organize the interpreter so that the native PC cor- mean mispredicted branches by 95%. On the PowerPC, it relates with the vPC, exposing virtual control flow to the reduces mean branch stall cycles by 75% for OCaml and hardware. 82% for Java. Due to reduced branch hazards, context We introduce a technique based on threading, threading reduces mean execution time by 25% for Java and once popular in early interpreters for languages like Forth. by19% and37%for OCaml on theP4 andPPC970,respec- To leverage return address stack prediction. we implement tively. We also combine context threading with a conserva- each virtual instruction as a subroutine which ends in a na- tive inlining technique and find its performance comparable tive return instruction. Note, however, that these subrou- to that of selective inlining. tines are not full-fledged functions in the sense of a higher- level such as (no register save/re- store, stack frame creation, etc.). When the instructions of 1 Introduction a virtual program are loaded by the interpreter, we trans- late them to a sequence of call instructions, one per virtual instruction, whose targets are these . Virtual in- Interpretation is a powerful tool for implementing pro- structions are then dispatched simply by natively execut- gramming language systems. It facilitates interactive pro- ing this sequence of calls. The key to the effectiveness of gram development and debugging, compact and portable this simple approach is that at dispatch time, the native PC deployment, and simple language prototyping. This com- is perfectly correlated with the virtual PC. Thus, for non- bination of features makes interpreted languages attractive branching bytecodes, the return address stack in modern in many settings, but their applicability is constrained by processors reliably predicts the address of the next byte- ∗This research was supported by NSERC, IBM CAS and CITO. code to execute. Because the next dynamic instruction is not generally the next static instruction in the virtual pro- Direct Threading Table Opcode Implementation Bodies gram, branches pose a greater challenge, For these virtual (DTT) INST_GOTO: 0 &INST_PUSH arg0 = *vPC; vPC=arg0; instructions, we provide a limited form of specialized inlin- Program 1 11 goto *vPC++; ing, replacing indirect with relative branches, thus exposing created by 2 &INST_PUSH ... push 11 3 7 INST_PRINT: virtual branches to the hardware’s branch predictors. 4 &INST_MUL push 7 loader ... 5 We review techniques for virtual instruction dispatch in mul 5 &INST_PRINT goto *vPC++; print 6 &INST_DONE 6 interpreters, describe their performance problems, and de- done INST_DONE: vPC ...

fine the context problem in Section 2. Then, we discuss exit; #the end

Virtual Program Stack

¡ ¡ ¡ ¡

other work on improvingdispatch performancein Section 3. ¡ INST_MUL:

¡ ¡ ¡ ¡

¡ 4

¡ ¡ ¡ ¡

¡ ...

¡ ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡

We provide relevant details of our implementations in a Java ¡ goto *vPC++;

¡ ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡

¡ 0

¡ ¡ ¡ ¡

virtual machine (SableVM) and the OCaml interpreter on ¡

¡ ¡ ¡ ¡

¡ INST_PUSH: 2

¡ ¡ ¡ ¡

¡ program

¡ ¡ ¡ ¡

¡ ...

start

¡ ¡ ¡ ¡

the Pentium IV and Power PC architectures in Section 4. ¡

¡ ¡ ¡ ¡

¡ goto *vPC++;

¡ ¡ ¡ ¡ Using an extensive suite of benchmarks for both Java and ¡ OCaml, we evaluate context threading in Section 5. This paper makes the following contributions: Figure 1. Direct Threaded VM Interpreter

• We introduce a new dispatch technique for virtual ma- mov eax = (r1) ; r1 is vPC lwz r2 = 0(r1) chine interpreters that dramatically improves branch addl 4,r1 mtctr r2 prediction and demonstrate that our technique does not jmp *eax addi r1,r1,4 depend on a specific language or CPU architecture. bctr (a) Pentium IV assembly () Power PC assembly • We show context threading is effective. On both the P4 Figure 2. Direct Threaded Dispatch and the PPC970, it eliminates 25% of the mean elapsed time of Java benchmarks, with individual benchmarks running twice as fast as with direct threading. For OCaml, we achieve a 20% reduction in the mean ex- as the Direct Threading Table, or DTT. The operands are ecution time on the P4 and a 35% reduction on the also stored in this list, immediately after the corresponding PPC970 with some benchmarks achieving as much as opcode address. The vPC points into the DTT to indicate 40% and 50%, respectively. the instruction to be executed. The actual dispatch to the next instruction is accomplished by the “goto *vPC++” We show that context threading is compatible with in- • at the end of each opcode body, which is supportedby GNU lining, using a simple heuristic that we call Tiny Inlin- C’s labels-as-values extensions. In Figure 2, we show the ing. On OCaml, we achieve speedups relative to di- assembly code corresponding to this dispatch statement for rect threading of at least 10% over context threading the Pentium IV and PowerPC architectures. The overhead alone. On Java, we perform as well as or better than of these instructions is comparable to the execution time of SableVM’s implementation of selective inlining. small bodies in some VMs. When executing the indirect branch in Figure 2(a) the 2 The Context Problem Pentium IV will speculatively dispatch instructions using a predicted target address. The PowerPC uses a differ- An interpreter executes a virtual program by dispatching ent strategy for indirect branches, as shown in Figure 2(b). its virtual instructions in sequence. The current instruction First the target address is loaded into a register, and then a is indicated by a virtual program counter, or vPC. The vir- branch is executed to this register address. Rather than spec- tual program is a list of virtual instructions, each consisting ulate, the PowerPC stalls until the target address is known, of an opcode and zero or more operands. The exact repre- although other instructions may be scheduled between the sentation depends on the dispatch technique used. load and the branch to reduce or eliminate these stalls. A typical switch-dispatched interpreter, implemented in Stalling and incorrect speculation are serious pipeline C, is a for loop which fetches the opcode at vPC, and hazards. To perform at full speed, modern CPU’s need to then executes a switch statement, each opcode being im- keep their pipelines full by correctly predicting branch tar- plemented by a separate case block (the opcode body, or gets. Indirect branch predictors assume that the branch des- body). However, switch dispatch is considerably slower tination is highly correlated with the address of the indi- than the start-of-the-art, direct-threaded dispatch [7]. rect branch instruction itself. As observed by Ertl [8,9], As shown in Figure 1, a direct threaded interpreter rep- this assumption is usually wrong for direct threaded in- resents virtual program instructions as a list of addresses. terpreter workloads. In a direct-threaded implementation, Each address points to the opcode body. We refer to this list there is only one jump instruction per virtual opcode imple- mented. For example, in Figure 1, there are two instances stack. On the other hand, Curley found subroutine thread- of INST PUSH. In the context of vPC=0, the dispatch at ing faster on the 68000 [5]. On modern hardware the cost the end of the INST PUSH body results in a native indirect of the call and return is much lower, due to return branch branch back to the start of the INST PUSH body (since the prediction hardware, while the cost of direct threading has next virtual instruction at vPC=2 is also an INST PUSH). increased due to misprediction. In Section 5 we demon- However, the target of the same native indirect branch in strate this effect on several modern CPUs. the context of vPC=2 is determined by the address stored Superinstructions reduce the numberof dispatches. Con- at vPC=4, which in this example is an INST MUL opcode. sider the code to add a constant integer to a variable. This Thus, the target of the indirect branch depends on the vir- may require loading the variable onto the stack, loading the tual context—the vPC—rather than the hardware PC of the constant, adding, and storing back to the variable. VM de- branch, causing the hardware to speculate incorrectly or not signers can instead extend the virtual instruction set with a at all. We refer to this lack of correlation between the native single superinstruction that performs the work of all four PC and the vPC as the context problem. instructions. This technique is limited, however, because the virtual instruction encoding (often one byte per opcode) 3 Related Work may allow only a limited number of instructions, and the number of desirable superinstructions grows exponentially in the number of subsumed atomic instructions. Further- Much of the work on interpreters has focused on the dis- more, the optimal superinstruction set may change based patch problem. Kogge [14] remains a definitive description on the workload. One approach uses profile-feedback to of many threaded code dispatch techniques. These can be select and create the superinstructions statically (when the divided into two broad classes: those which refine the dis- interpreter is compiled [10]). patch itself, and those which alter the bodies so that there are more efficient or simply fewer dispatches. Switch and Piumarta [15] presents selective inlining. It constructs direct threading belong to the first class, as does subroutine superinstructions when the virtual program is loaded. They threading, discussed next. Later, we will discuss superin- are created in a relatively portable way, by memcpy’ing the structions and replication, which are in the second class. native code in the bodies, again using GNU C labels-as- We are particularly interested in subroutine threading and values. This technique was first documented earlier [19], replication because they both provide context to the branch but Piumarta’s independent discovery inspired many other prediction hardware. projects to exploit selective inlining. Like us, he applied his Some Forth interpreters use subroutine-threaded dis- optimization to OCaml, and reports significant speedup on patch. Here, the program is not represented as a list of several microbenchmarks. As we discuss in Section 5.4, our body addresses, but instead as a sequence of native calls technique is separate from, but supports and indeed facili- to the bodies, which are then constructed to end with na- tates, inlining optimizations. tive returns. Curley [5,6] describes a subroutine-threaded Only certain classes of opcode bodies can be relocated Forth for the 68000CPU. He improvesthe resulting code by using memcpy alone—the body must contain no pc-relative inlining small opcode bodies, and converts virtual branch instructions (typically this excludes C function calls). Se- opcodes to single native branch instructions. He cred- lective inlining requires that the superinstruction starts at its Charles Moore, the inventor of Forth, with discovering a virtual block, and ends at or before the end of these ideas much earlier. Outside of Forth, there is lit- the block. Ertl’s dynamic superinstructions [9] also use tle thorough literature on subroutine threading. In partic- memcpy, but are applied to effect a simple native compi- ular, few authors address the problem of where to store vir- lation by inlining bodies for nearly every virtual instruc- tual instruction operands. In Section 4, we document how tion. Ertl shows how to avoid the virtual basic block con- operands are handled in our implementation of subroutine straints, so dispatch to interpreter code is only required for threading. virtual branches and un-relocatable bodies. Catenation [24] The choice of optimal dispatch technique depends on the patches Sparc native code so that all implementationscan be hardware platform, because dispatch is highly dependent on moved, specializes operands, and converts virtual branches micro-architectural features. On earlier hardware, call and to native, thereby eliminating the virtual program counter. return were both expensive and hence subroutine thread- Replication—creating multiple copies of the opcode ing required two costly branches, versus one in the case of body—decreases the number of contexts in which it is exe- direct threading. Rodriguez [17] presents the tradeoffs for cuted, and hence increases the chances of successfully pre- various dispatch types on several 8 and 16-bit CPUs. For dicting the successor [9]. Replication implemented by in- example, he finds direct threading is faster than subroutine lining opcode bodies reduces the number of dispatches, and threadingon a 6809CPU, becausethe JSR and RET instruc- therefore, the average dispatch overhead [15]. In the ex- tion require extra cycles to push and pop the return address treme, one could create a copy for each instruction, elimi- Direct Threading Table Opcode Implementation Bodies nating misprediction entirely. This technique results in sig- (DTT) 0 &(CTT[0]) INST_GOTO: nificant code growth, which may [24] or may not [9] cause 1 11 arg0=*vPC; vPC=arg0; Context Threading Table goto *vPC++ Program bytecode 2 &(CTT[1]) misses. created by compiler (CTT) 3 7 CALL INST_PUSH ... push 11 4 &(CTT[2]) CALL INST_PUSH 0 INST_PRINT: In summary, misprediction of the indirect branches used push 7 loader 5 &(CTT[3]) CALL INST_MUL ... mul 6 &(CTT[4]) CALL INST_PRINT RET by a direct threaded interpreter to dispatch virtual instruc- print CALL INST_DONE done INST_DONE: tions limits its performanceon modern CPUs because of the vPC ... exit; #the end context problem. We have described several recent dispatch Virtual Program Stack

11 INST_MUL:

¡ ¡ ¡ ¡

¡ ¡ ¡ ¡

¡ ¡ ¡

¡ ...

¡ ¡ ¡

optimization techniques. Some of the techniques improve ¡

¡ ¡ ¡

¡ RET

¡ ¡ ¡ ¡

¡ ¡ ¡ ¡

¡ ¡ ¡

performance of each dispatch by reducing the number of ¡

¡ ¡ ¡

¡ INST_PUSH:

¡ ¡ ¡ ¡

¡ ¡ ¡ ¡

¡ ¡ ¡

¡ ...

¡ ¡ ¡ contexts in which a body is executed. Others reduce the ¡ RET number of dispatches, possibly to zero. Dynamo [2] is a system for trace-based runtime opti- Figure 3. Subroutine Threaded VM Interpreter mization of arbitrary programs. Its optimizations include replacing indirect branches with guarded linear control flow. One would expect this to be highly applicable to threaded 4.1 Understanding Branches interpreters. Sullivan et al. [22] applied Dynamo to a Java VM, but found it faired poorly. This was due to the con- text problem—it could not distinguish between the differ- To motivate our design, first note that the virtual pro- ent runtime contexts of a bytecode body. The solution was gram may contain all the usual types of control flow: con- to detect traces using a tuple, instead of only ditional and unconditional branches, indirect branches, and pc. Our technique, while simpler, accomplishes the same calls and returns. We must also consider the dispatch of thing. In the following section we will describe how we straight-line virtual instructions. For direct-threaded inter- address the context problem directly, by devirtualizing the preters, sequential (virtual) execution is just as expensive as interpreter’s control flow and thus exposing virtual execu- handling control transfers, since all virtual instructions are tion to native branch prediction resources. dispatched with an indirect branch. Second, note that the dynamic execution path of the virtual program will contain patterns (loops, for example) that are similar in nature to the patterns found when executing native code. These con- 4 Design and Implementation trol flow patterns originate in the algorithm that the virtual program implements, whether it is interpreted or compiled. Finally, note that modern microprocessors have consid- Direct-threaded interpreters are known to have very poor erable resources devoted to identifying these patterns in na- branch prediction properties, however, they are also known tive code, and exploiting them to predict branches. In fact, to have a small cache footprint (for small to medium sized the hardware provides different types of predictors to sup- opcode bodies) [18]. Since both branches and cache misses port differenttypes of native branches. Unfortunately, direct are major pipeline hazards, we would like to retain the threading uses only indirect branches and, due to the con- good cache behavior of direct-threaded interpreters while text problem, the patterns that exist in the virtual program improving the branch behavior. The preceding section de- are effectively hidden from the microprocessor. scribes various techniques for improving branch prediction The fundamental goal of our approach is to expose these by replicating entire bodies. The effect of these techniques virtual control flow patterns to the hardware, such that the is to trade instruction cache size for better branch predic- physical execution path matches the virtual execution path. tion. Ertl [9] claims that for Forth, with small opcode bod- To achieve this goal, we exploit the different types of hard- ies, code growth occurs but does not cause cache-related ware prediction resources to handle the different types of stalls. Vitale [24] finds that, with larger opcode bodies, code virtual control flow transfers. In Section 4.2 we show how growth from replication does induce cache-misses. We be- to replace straight-line dispatch with subroutine threading. lieve it is best to avoid growing code if possible. We intro- In Section 4.3 we show how to inline conditional and indi- duce a new technique which minimally affects code size and rect jumps and in Section 4.4 we discuss handling virtual produces dramatically fewer branch mispredictions than ei- calls and returns with native calls and returns. We strive ther direct threading or direct threading with inlining. to maintain the property that the virtual program counter is In this section we motivate our design in terms of align- precisely correlated with the physical program counter and ing virtual machine context with physical machine context, in fact, with our technique there is a one-to-one mapping and outline our implementation. between them at control flow points. 4.2 Handling Linear Dispatch 4.3 Handling Virtual Branches

Subroutine threading handles the branches that are in- The dispatch of straight-line virtual instructions is the duced by the dispatch of straight-line virtual instructions, largest single source of branches when executing an inter- however, the actual control flow of the virtual program is preter. Any technique that hopes to improve branch pre- still hidden from the hardware. That is, bodies of opcodes diction accuracy must thus address dispatch. The obvious that affect the virtual control flow still have no context. solution is inlining, as it eliminates the dispatch entirely for There are two problems, one relating to shared indirect straight-line sequences of virtual instructions. Inlining also branch prediction resources, and one relating to a lack of has other benefits, such as enabling optimizations across the history context for conditional branch prediction resources. implementations of multiple virtual instructions. The in- Consider the implementation of INST GOTO in Fig- crease in code size caused by aggressive inlining, however, ure 3. Even for this simple unconditional virtual branch, has the potential to overwhelm the benefits with the cost of prediction is problematic, because all INST GOTO instruc- increased instruction cache misses. tions in the virtual programshare a single indirectbranch in- Rather than eliminate dispatch, we propose an alterna- struction (and hence have a single prediction context). Con- tive organization for the interpreter in which native call and ditional virtual branches have the same problem. A sim- return instructions are used. Conceptually, this approach is ple solution is to generate replicas of the indirect branch elegant because subroutines are a natural unit of abstraction instruction in the CTT immediately following the call to the to express the implementations of virtual instructions. branching opcode body. Branching opcode bodies now end with native return, which transfers control to the replicated Figure 3 illustrates our implementation of subroutine indirect branch in the CTT. As a consequence, each virtual threading, using the same example program as Figure 1. branch instruction now has its own hardware context. We In this case, we show the state of the virtual machine af- refer to this technique as branch replication. ter the first virtual instruction has been executed (note that Branch replication is attractive because it is simple, and the virtual program stack now contains the value “11”). We produces the desired context with a minimum of replicated add a new structure to the interpreter architecture, called the instructions. However, it has a number of drawbacks. First, Context Threading Table (CTT), which contains a sequence for branching opcodes, we execute three hardware control of native call instructions. Each native call dispatches the transfers (a call to the body, a return, and the actual branch), body for its virtual instruction. We use the term Context which is an unnecessary overhead. Second, we still use the Threading, because the hardware address of each call in- overly general indirect branch instruction, even in cases like struction in the CTT provides execution context to the hard- INST GOTO where we would prefer a simpler direct native ware, most importantly, to the branch predictors. Each non- branch. Third, by only replicating the dispatch part of the branching opcode body now ends with a native return in- virtual instruction, we do not take full advantage of the con- struction, while opcodes that modify the virtual control flow ditional branch predictor resources provided by the hard- end with an indirect jump, as in direct-threading. The Di- ware. Due to these limitations, we only use branch replica- rect Threading Table (DTT) is still necessary to store imme- tion for indirect virtual branches and exceptions. diate virtual operands, and to correctly resolve virtual con- trol transfer instructions. In direct threading, entries in the For all other branches we fully inline the bodies of vir- DTT point to opcode bodies, whereas in subroutine thread- tual branch instructions into the CTT. We refer to this as ing they refer to call sites in the CTT. branch inlining. In the process of inlining, we convert in- direct branches into direct branches, where possible. We It seems counterintuitive to improve dispatch perfor- thus reduce pressure on the BTB, and instead exploit the mance by calling each body. It is not obvious whether a call conditional branch predictors. In particular, the virtual con- to a constant target is more or less expensive to execute than ditional branches now appear as real conditional branches an indirect jump, but that is not the issue. Modern micro- to the hardware. The primary cost of branch inlining is in- processors contain specialized hardware to improve the per- creased code size, but this is modest because virtual branch formance of call and return— specifically, a return address instructions are simple and have small bodies. For instance, stack that predicts the destination of the return to be the on the Pentium IV, most branch instructions can be inlined instruction following the corresponding call. Although the with no more than 10 words of additional space. Figure 4 cost of subroutine threading is two control transfers, versus shows an example of inlining the INST GOTO branch in- one for direct threading, this cost is outweighed by the ben- struction. The figure also illustrates how we handle virtual efit of eliminating a large source of unpredictable branches. call/return control flow, described next. Context Threading Table Opcode Implementation lenges to implementing our design for apply/return inlin- (CTT) Bodies ing. First, one must take care to match the hardware stack ... . against the virtual program stack. For instance, in OCaml, vPC = *vPC INST_GOTO . exceptions unwind the virtual machine stack; the hardware JMP destination }opcode body . stack must be unwound in a corresponding manner. Sec- ... 1 CALL INST_CALL INST_CALL: /* manipulate vPC */ ond, some run-time environments are extremely sensitive to CALL_INDIRECT callee RET hardware stack manipulations, since they use or modify the ... 2 machine stack pointer for their own purposes (such as han- first ... 3 . opcode { CALL ... . dling signals). In such cases, it is possible to create a sepa- in callee /* opcodes in callee */ . rate stack structure and swap between the two at virtual call /* opcodes in callee */ 5 and return points. This approach would introduce signifi- /* opcodes in callee */ INST_RETURN: /* manipulate vPC */ cant overhead, and is only justified if apply/return inlining /* opcodes in callee */ 4 RET JMP INST_RETURN provides a substantial performance benefit. Having described our design and its general implementa- tion, we now evaluate its effectiveness on real interpreters. Figure 4. Context Threaded VM Interpreter: Branch and Return Inlining 5 Experimental Evaluation 4.4 Handling Virtual Call and Return In this section, we evaluate the performance of context The only significant source of control transfers that re- threading and compare it to direct threading and direct- main in the virtual program are virtual calls and returns. For threaded selective inlining. Context threading combines successful branch prediction, the real problem is not the vir- subroutine threading, branch inlining and apply/return in- tual call, but rather the virtual return, because one virtual re- lining. We evaluate the contribution of each of these tech- turn may go back to multiple call sites. As noted previously, niques to the overall impact of context threading using the hardware already has an elegant solution to this problem two virtual machines and three microprocessor architec- for native code in the form of the return address stack. We tures. We begin by describing our experimental setup in need only to deploy this resource to predict virtual returns. Section 5.1. We then investigate how effectively our tech- We describe our solution with reference to Figure 4. The niques address pipeline branch hazards in Section 5.2, and virtual call body should effect a transfer of control to the the overall effect on execution time in Section 5.3. Finally, start of the callee. We begin at a virtual call instruction (see Section 5.4 demonstrates that context threading is comple- arrow labeled “1”). The virtual call body simply sets the mentary to inlining resulting in a portable, relatively simple, vPC to refer to the virtual callee and executes a native re- technique that provides performance comparable to or bet- turn to the next CTT location. Similar to branch replication, ter than SableVM’s implementation of selective inlining. we insert a new native call indirect instruction at this point in the CTT to transfer control to the start of the callee (ar- 5.1 VirtualMachines,BenchmarksandPlatforms row “2”). This call indirect causes the next location in the CTT to be pushed onto the hardware’s return address stack. OCaml We chose OCaml as representative of a class of The first instruction of the callee is then dispatched (arrow efficient, stack-based interpreters that use direct-threaded “3”). At the end of the callee, we modify the virtual return dispatch. The bytecode bodies of the interpreter are very instruction as follows. In the CTT, we emit a native di- efficient, and have been hand-tuned, including register al- rect branch to dispatch the body of the virtual return (arrow location. The implementation of the OCaml interpreter is “4”.) Unlike using a native call for this dispatch, the direct clean and easy to modify. branch avoids perturbing the return address stack. We mod- ify the body of the virtual return to end with a native return instruction, which now transfers control all the way back SableVM SableVM is a built for to the instruction following the original virtual call (arrow quick interpretation, implementing lazy method loading and “5”.) We refer to this technique as apply/return inlining1. a novel bi-directional virtual function lookup table. Hard- ware signals are used to handle exceptions. Most impor- With this final step, we have a complete technique that tantly for our purposes, SableVM already implements mul- aligns all virtual program control flow with the correspond- tiple dispatch mechanisms, including switch, direct thread- ing native flow. There are however, some practical chal- ing, and selective inlining (which SableVM calls inline 1“apply” is the name of the (generalized) function call opcode in threading) [11]. The support for multiple dispatch mech- OCaml anisms makes it easy to add context threading, and allows Table 1. Description of OCaml benchmarks Pentium IV PowerPC 7410 PPC970 Lines Branch Branch Elapsed of Time Mispredicts Time Stalls Time Source Benchmark Description (TSC*108) (MPT*106) (Cycles*108) (Cycles*106) (sec) Code boyer Boyer theorem prover 3.34 7.21 1.8 43.9 0.18 903 fft Fast Fourier transform 31.9 52.0 18.1 506 1.43 187 fib Fibonacci by recursion 2.12 3.03 2.0 64.7 0.19 23 genlex A lexer generator 1.90 3.62 1.6 27.1 0.11 2682 kb A knowledge base program 17.9 42.9 9.5 283 0.96 611 nucleic nucleic acid’s structure 14.3 19.9 95.2 2660 6.24 3231 quicksort Quicksort 9.94 20.1 7.2 264 0.70 91 sieve Sieve of Eratosthenes 3.04 1.90 2.7 39.0 0.16 55 soli A classic peg game 7.00 16.2 4.0 158 0.47 110 takc Takeuchi function (curried) 4.25 7.66 3.3 114 0.33 22 taku Takeuchi function (tuplified) 7.24 15.7 5.1 183 0.52 21

Table 2. Description of SpecJVM benchmarks Pentium IV PowerPC 7410 PPC970 Branch Branch Elapsed Time Mispredicts Time Stalls Time Benchmark Description (TSC*1011) (MPT*109) (Cycles*1010 ) (Cycles*108) (sec) compress Modified Lempel-Ziv compression 4.48 7.13 17.0 493 127.7 db performs multiple database functions 1.96 2.05 7.5 240 65.1 jack A Java parser generator 0.71 0.65 2.7 67 18.9 javac the Java compiler from the JDK 1.0.2 1.59 1.43 6.1 160 44.7 jess Java Expert Shell System 1.04 1.12 4.2 110 29.8 mpegaudio decompresses MPEG Layer-3 audio files 3.72 5.70 14.0 460 106.0 mtrt two thread variant of raytrace 1.06 1.04 5.3 120 26.8 raytrace a raytracer rendering 1.00 1.03 5.2 120 31.2 scimark performs FFT SOR and LU, ’large’ 4.40 6.32 18.0 690 118.1 soot java bytecode to bytecode optimizer 1.09 1.05 2.7 71 35.5

us to compare it against a selective inlining implementation, cause so few instructions contribute to the behavior. which we believe is a more complicated technique. SableVM Benchmarks SableVM experiments were run OCaml Benchmarks The benchmarks in Table 1 con- on the complete SPECjvm98 [20] suite (compress, db, mpe- stitute the complete standard OCaml benchmark suite2. gaudio, raytrace, mtrt, jack, jess and javac), one large object Boyer, kb, quicksort and sieve are mostly integer process- oriented application (soot [23]) and one scientific applica- ing, while nucleic and fft are mostly floating point bench- tion (scimark [16]). Table 2 summarizes the key character- marks. Soli is an exhaustive search algorithm that solves istics of these benchmarks. a solitaire peg game. Fib, taku, and takc are tiny, highly- recursive programs which calculate integer values. These Pentium IV Measurements The Pentium IV (P4) pro- three benchmarks are unusual because they contain very cessor aggressively dispatches instructions based on branch few distinct virtual instructions, and often contain only one predictions. As discussed in Section 2, the taken indirect instance of each. These features have two important con- branches used for direct-threaded dispatch are often mispre- sequences. First, the indirect branch in direct-threaded dis- dicted due to the lack of context. Ideally, we would mea- patch is relatively predictable. Second, even minor changes sure the mispredict penalty for these branches to see their can have dramatic effects (both positive and negative) be- effect on execution time, but the P4 does not have a counter 2ftp://ftp.inria.fr/INRIA/Projects/cristal/ for this purpose. Instead, we count the number of mis- Xavier.Leroy/benchmarks/objcaml.tar.gz predicted taken branches (MPT) to show how effectively h otiuin fec opnn fortcnqe sub- (labeled show only technique: sections our threading following of routine component the each in of graphs contributions Bar the in threading 2. direct and using 1 Tables platform and characteristic benchmark branching each and for giv We times execution technique. absolute dispatch i the art since state-of-the case, baseline threading the direct is the to experiments all malize data the Interpreting PPC970. the for time execution Instead, elapsed processor. di- only this report cannot on cycles we deeply- that stall however, these more expect count PPC970, on rectly We the larger like be ex- CPUs cycles. should pipelined overall penalty clock stall obtain of branch to terms the counters in hardware times the ecution use PPC7410, also the On we [4]. dependency”) information LR/CTR this on exactly provides “stall that 15, (counter count counter and a link CPU has to PPC7410 older due un- the like Fortunately, stalled would unit cycles dependencies. register we of branch number Hence, the the known. in count is to stalls destination PPC branch the the til and 2(b)) indi- Figure on in the speculate on typically not than branches do rect PowerPC processors the these on as P4, differently branches of cost Measurements PowerPC benchmarks Java multithreaded data complete the using collects for events which module, TSC kernel and Linux MPT own our both count We cycle-accurate the register. with (TSC) P4 the measure on We time prediction. branch improves threading context h P90bti sntue ydefault. by used not is it but PPC970 the [13] 0xc001004 value to ESCR the and 0x0003b000 to CCCR 4 3 hn i”cnb sdt norg pclto nltrmod later in speculation encourage to used be can bit” “hint A setting by 8 counter performance with counted are events MPT MPT relative to Direct 0.0 0.6 0.8 1.0 0.4 0.2

boyer 4

nta,sltbace r sd(sshown (as used are branches split Instead, . fft

fib a etu VMPT IV Pentium (a) genlex iue5 Cm ieieHzrsRltv oDrc Threadi Direct to Relative Hazards Pipeline OCaml 5. Figure npeetn u eut,w nor- we results, our presenting In kb Ocaml benchmark ene ocaatrz the characterize to need We nucleic SUB

quicksort ;sbotn threading subroutine );

3 sieve . iesapcounter stamp time

soli

takc TINY CONTEXT BRANCH SUB

taku l like els h P4 the geoMean we e s t igesuc fupeitbebace—h ipthused dispatch branches—the larges unpredictable the expec- of addresses source our single on threading matches 75% subroutine result and This because 60% tations PPC7410. by the stalls for on LR/CTR SableVM average reducing for and 85% and P4, OCaml the for 75% of MPT average reducing an impact, by greatest single the subrou has that threading reveals tine 6 and both 5 for Figures effects Examining overall interpreters. similar with stall PPC7410, LR/CTR the and on IV cycles Pentium the on (MPT) branches taken ge- the benchmarks. reports all graph across bar mean each ometric the of on cluster count cycles last stall we The LR/CTR graphs PPC7410. on where right, effect la- the the P4, present graphs On (b) the labeled the (MPT). on branches Figure, results taken each the mispredicted of present left (a) the beled On benchmarks Java SableVM. the for on results benchmarks, these reports OCaml 6 the Figure for while hazards branch pipeline duces goal. this met our have we begin well We pre- how examining hazards. branch by branch evaluation improve pipeline to reduce state and diction machine physical with state Hazards Branch Pipeline on Effect 5.2 be- threading definition. by direct 1.0, for height bar have a would unti show it not cause discussed do not We are 5.4. Section results inlining although parisons, (labeled technique inlining simple (labeled SableVM in inlining lective unilnn (labeled inlining apply/re turn includes which implementation threading context exceptions for (labeled branches replication indirect branch and and inlining branch plus LR/CTR stall cycles Relative to Direct 0.0 0.6 0.8 1.0 0.4 0.2 otx hedn lmntsms ftemispredicted the of most eliminates threading Context re- threading context which to extent the reports 5 Figure program virtual align to designed was threading Context

boyer

fft

fib b P71 rcrstalls lr/ctr PPC7410 (b)

genlex CONTEXT

OCaml benchmark kb

nucleic ng BRANCH quicksort .W nld asfrse- for bars include We ). TINY sieve SELECT

;adorcomplete our and ); soli ofcltt com- facilitate to )

takc TINY CONTEXT BRANCH SUB n u own our and )

taku

geoMean - - l t n nL/T tlsi eysalfrSableVM. Fig- for inlin- small in very apply/return is CONTEXT of stalls effect labeled LR/CTR the on bar PPC7410, ing the the On in seen 6(a). ure be can predi stack as address tor, return and the of implementation effectiveness the inlining reduces apply/return our of overhead when the mis-aligned manipulates allow become SableVM to we stacks 4.4, machine and Section virtual in the discussed as processor’s technique the switching uses SableVM that fact esp the by restricted is improvement. small a is platforms effect both overall while on the MPT, benchmarks, the in OCaml P4, increase non-recursive the the an for on is sieve, inlining apply/return For of impact. result and enormous calls an native has to and returns operations calls these virtual mapping of Thus, behavior per- returns. the their by benchmarks, dominated these is of formance nature 5(a). recursive Figure the in to shown Due as inlining, greates the apply/return reap from also benefit P4 the on inlining branch for challenge benchmarks. instruction these single for impact a noticeable previously, of a behavior noted have the can As in changes amount. minor small even a branches by conditional prediction recur- the hurts tiny, these the de-virtualizing In for benchmarks occurs taku. sive trend and takc MPT fib, the benchmarks to cy- OCaml exception stall notable LR/CTR A remaining the cles. effec of important 25% still PPC7410 about though the eliminating smaller, On inlining a 60%. has branch about inlining P4, branch by MPTs the remaining On the cuts ap- threading. after subroutine hazard pipeline plying remaining significant most branc the conditional are since ne expected, the as has again effect, inlining largest Branch bytecodes. straight-line all for aigsonta u ehiuscnsgicnl re- significantly can techniques our that shown Having inlining apply/return however, P4, the on SableVM For a are that benchmarks OCaml same the Interestingly, MPT relative to Direct 0.0 0.6 0.8 1.0 0.4 0.2 eitr ahrta mlmn opiae stack complicated a implement than Rather register.

compress

db

jack a etu VMPT IV Pentium (a)

iue6 alV ieieHzrsRltv oDrc Threa Direct to Relative Hazards Pipeline SableVM 6. Figure javac esp Java benchmark jess iety hsicessthe increases This directly. mpeg

mtrt

ray

scimark TINY CONTEXT BRANCH SUB SELECT

soot

geoMean hes c- xt t, t rc h eutosi rnhhzrsse nFgrs5and 5 Figures time in seen execution hazards in branch in reductions reductions the the the the track general, of has In 3–7% inlining additional time. an Branch elapsed single eliminating impact the time. largest has elapsed next threading on Subroutine impact largest clear. is techniques PPC7410). the for 7 deeply- vs. more stages its (23 with design consistent pipelined are reduc- greater time the execution measure PPC970, in the cannot tions on we stalls Although LR/CTR of faster PPC970. cost 26% the and and P4 PPC7410 the the both on For on faster 17% PPC970. thread- about direct runs the with ing, on compared lower threading, 39% context 19% SableVM, and about 9% PPC7410, P4, is on on VM threading lower direct Ocaml than threading the context geomet- of for The lower time execution architectures. three mean all ric vir- on both and on machines threading tual thread- direct context that outperforms show 9 significantly and ing 8 7, Figures in and (right- cluster) means OCaml most geometric la- of The CPU. performance right, PPC970 the the the on on reports SableVM on 9 results results Figure P4 PPC7410 (b). with and beled (a), section, labeled previous left, the the as in way organized same are the They respectively. benchmarks SableVM threading direct to relative reported time is execution data overall all on before, effects As two these examin of result we PPC. net section, the this the In and overhead. P4 instruction increases the native both a on using However, usage pipeline increased in Performance 5.3 time. execution impact overall the on examine reductions now these we of hazards, branch pipeline duce LR/CTR stall cycles Relative to Direct 0.0 0.6 0.8 1.0 0.4 0.2 cositrrtr n rhtcue,teefc four of effect the architectures, and interpreters Across and OCaml the for results show 8 and 7 Figures resulting prediction, branch improves threading Context

compress

db

jack b P71 rcrstalls lr/ctr PPC7410 (b)

javac Java benchmark

call/return jess

ding mpeg

mtrt arfrec dispatch each for pair

raytrace

scimark TINY CONTEXT BRANCH SUB SELECT

soot

geoMean . e . Elapsed Time Relative to Direct TSC Relative to Direct TSC relative to Direct 0.6 0.8 0.6 0.8 0.6 0.8 1.4 0.4 0.4 0.4 1.0 1.0 1.2 1.0 0.0 0.2 0.0 0.2 0.0 0.2

boyer compress boyer

fft fft

b Cm P90easd(el e a alV P90elaps PPC970 SableVM (a) sec (real) elapsed PPC970 OCaml (b) db

fib jack fib

genlex a etu VTSC IV Pentium (a) TSC IV Pentium (a) genlex javac

kb Threading Direct to Relative Time Elapsed SableVM 8. Figure iue9 P90EasdTm eaiet ietThreading Direct to Relative Time Elapsed PPC970 9. Figure iue7 Cm lpe ieRltv oDrc Threading Direct to Relative Time Elapsed OCaml 7. Figure Ocaml benchmark Ocaml benchmark kb Java benchmark jess nucleic

nucleic mpeg quicksort

quicksort sieve mtrt

sieve soli ray

soli takc scimark TINY CONTEXT BRANCH SUB

taku TINY CONTEXT BRANCH SUB TINY CONTEXT BRANCH SUB SELECT takc soot

geoMean taku

geoMean geoMean

Elapsed Time Relative to Direct Cycles Relative to Direct Cycles Relative to Direct 0.6 0.8 0.6 0.8 0.6 0.8 0.4 0.4 0.4 1.0 1.0 1.0 0.0 0.2 0.0 0.2 0.0 0.2

compress compress boyer

db db fft

fib jack jack

b P71 cycles PPC7410 (b) cycles PPC7410 (b) genlex javac javac

OCaml benchmark kb Java benchmark Java benchmark jess jess

nucleic mpeg mpeg

quicksort mtrt

d(el sec (real) ed mtrt

sieve raytrace ray soli scimark scimark TINY CONTEXT BRANCH SUB SELECT TINY CONTEXT BRANCH SUB TINY CONTEXT BRANCH SUB SELECT takc

soot soot taku

geoMean geoMean geoMean 6. The instruction overheads of our dispatch technique are most evident in the OCaml benchmarks fib and takc on the Table 3. Selective Inlining vs Context+Tiny P4 where the benefits of improved branch prediction (rela- (SableVM) tive to direct threading) are minor. In these cases, the op- Context Selective Tiny ∆ ∆ code bodies are very small and the extra instructions exe- Arch (C) (S) (T) (S-C) (S-T) cuted for dispatch are the dominant factor. P4 0.762 0.721 0.731 −0.041 −0.010 0.863 0.914 0.839 0.051 0.075 The effect of apply/return inlining on execution time PPC7410 PPC970 0.753 0.739 0.691 −0.014 0.048 is minimal overall, changing the geometric mean by only ±1% with no discernible pattern. Given the limited perfor- mance benefit and added complexity, a general implemen- tation of apply/return inlining does not seem worthwhile. head surrounding the smallest bodies and, as calls in the Ideally, one would like to detect heavy recursion automati- CTT are replaced with comparably-sized bodies, tiny inlin- cally, and only perform apply/return inlining when needed. ing ensures that the total code growth is minimal. In fact, We conclude that, for general usage, subroutine threading the smallest inlined OCaml bodies on P4 were smaller than plus branch inlining provides the best trade-off. the length of a relative call instruction. Table 3 summarizes We now demonstrate that context-threaded dispatch is the effect of tiny inlining. On the P4, we come within 1% complementary to inlining techniques. of SableVM’s sophisticated selective inlining implementa- tion. On PowerPC, we outperform SableVM by 7.8% for 5.4 Inlining the PPC7410 and 4.8% for the PPC970. The primary costs of direct-threaded interpretation are Inlining techniques address the context problem by repli- pipeline branch hazards, caused by the context problem. cating bytecode bodies and removing dispatch code. This Context threading solves this problem by correctly deploy- reduces both instructions executed as well as pipeline haz- ing branch predictionresources, and as a result, outperforms ards. In this section we show that, although both selective direct threading by a wide margin. Once the pipelines are inlining and our context threading technique reduce pipeline full, the secondary cost of executing dispatch instructions is hazards, context threading is slower because of instruction significant. A suitable technique for addressing this over- overhead. We address this issue by comparing our own tiny head is inlining, and we have shown that context threading inlining technique with selective inlining. is compatible with inlining using the “tiny” heuristic. Even In Figures 6, 8 and 9(a) the bar labeled SELECT shows with this simple approach, context threading achieves per- our measurements of Gagnon’s selective inlining imple- formance equivalent to, or better than, selective inlining. mentation for SableVM [11]. From these Figures, we see that selective inlining reduces both MPT and LR/CTR stalls 6 Current and Future Work significantly as compared to direct threading, but it is not as effective in this regard as subroutine threading alone. At the time of writing, we have extended our con- The larger reductions in pipeline hazards for context thread- text threading technique with a general purpose framework ing, however, do not necessarily translate into better per- which allows for the safe execution of arbitrary instrumen- formance over selective inlining. Figure 8(a) illustrates tation code in between bytecodes. Within this framework, that SableVM’s selective inlining beats context threading we have implemented bytecode logging to assist with de- on the P4 by roughly 5%, whereas on the PPC7410 and the bugging and several frequency and branch bias profilers. PPC970, both techniques have roughly the same effect on We have developed a lazy linking technique that allows us execution time, as shown in Figure 8(b) and Figure 9(a), to dynamically add generated code segments. Using these respectively. These results show that reducing pipeline haz- tools, we currently identify hot basic blocks, then regener- ards caused by dispatch is not sufficient to match the per- ate and link them into the program on the fly. We intend to formance of selective inlining. By eliminating some dis- use these capabilities to dynamically generate code for other patch code, selective inlining can do the same real work interesting compilation units including loop bodies, traces, with fewer instructions than context threading. and whole methods. Context threading is only a dispatch technique, and can be easily combined with inlining strategies. To investigate the impact of dispatch instruction overhead and to demon- 7 Conclusions strate that context threading is complementary to inlining, we implemented Tiny Inlining, a simple heuristic that in- Modern CPUs have deep pipelines which must be kept lines all bodies with a length less than four times the length full for them to perform well. Filling these pipelines re- of our dispatch code. This eliminates the dispatch over- quires that the processor speculate on which instructions it will be executing by predicting the direction or target of [5] C. Curley. Life in the FastForth lane. Forth Dimensions, branching instructions. Direct threaded interpreters use in- 14(4), January-February 1993. direct branches to dispatch bytecode bodies. On older pro- [6] C. Curley. Optimizing in a BSR/JSR threaded forth. Forth cessors this was efficient, but on modern processors these Dimensions, 14(5), March-April 1993. [7] M. A. Ertl. The structure and performance of efficient inter- branches are pipeline hazards causing either stalls or flushes preters. Journal of Instruction-Level Parallelism, 2003. due to mispredictions, hurting performance. [8] M. A. Ertl and D. Gregg. The behavior of efficient virtual Context threading improves performance by exposing machine interpreters on modern architectures. Lecture Notes the virtual program’s control flow to the hardware, reduc- in , 2150, 2001. ing pipeline hazards. For sequential bytecodes, we use sub- [9] M. A. Ertl and D. Gregg. Optimizing indirect branch predic- routine threading, dispatching each bytecode with a rela- tion accuracy in virtual machine interpreters. In Proc. ACM tive call and predicting each successor with the return stack. SIGPLAN 2002 PLDI, 2003. We inline the bodies of virtual conditional instructions, ex- [10] M. A. Ertl, D. Gregg, A. Krall, and B. Paysan. VMgen — a generator of efficient virtual machine interpreters. Software posing the virtual execution context to the hardware’s con- Practice and Experience, 32:265–294, 2002. ditional branch predictors. We also demonstrate a tech- [11] E. Gagnon and L. J. Hendren. In Proceedings of the 12th In- nique for improving prediction of virtual returns. We have ternational Conference on Compiler Construction, volume shown that our techniques eliminate a significant number 2622 of Lecture Notes in Computer Science, pages 170–184. of pipeline branch hazards. We reduce mean branch mis- Springer, 2003. predictions by 95% on the P4 and reduce mean LR/CTR [12] G. Hinton, D. Sagar, M. Upton, D. Boggs, D. Carmean, branch unit stall cycles by between 76% and 82% on the A. Kyker, and P. Roussel. The microarchitecture of the Pen- PowerPC 7410. Relative to direct threading, context thread- tium 4 processor. Intel Technology Journal, Q1, 2001. ing reduces mean elapsed time by 19 to 27% on the Pentium [13] Intel Corporation. IA-32 Intel Architecture Software De- veloper’s Manual Volume 3: System Programming Guide. IV, 14% on the PowerPC 7410 and by 20 to 37% on the 2004. PowerPC 970. [14] P. M. Kogge. An architectural trail to threaded- code sys- Context threading performs better than direct threading tems. IEEE Computer, 15(3), March 1982. but worse than direct-threaded inlining. We implemented a [15] I. Piumarta and F. Riccardi. Optimizing direct threaded code simple inlining scheme in which we inline only very small by selective inlining. In SIGPLAN Proc. PLDI, pages 291– bytecodes. On the P4 we come within 1% of SableVM’s 300, 1998. sophisticated selective inlining implementation. On the [16] R. Pozo and B. Miller. SciMark: a numerical benchmark for PPC970 we outperform SableVM’s implementation of se- Java and C/C++., 1998. [17] B. Rodriguez. Benchmarks and case studies of forth kernels. lective inlining by 4.8%. The Computer Journal, 60, 1993. Context threading is easy to implement, and provides a [18] T. H. Romer, D. Lee, G. M. Voelker, A. Wolman, W. A. significant performance advantage over direct threading. It Wong, J.-L. Baer, B. N. Bershad, and H. M. Levy. The struc- addresses the main obstacle to high performance virtual in- ture and performance of interpreters. In Proc. ASPLOS 7, struction dispatch by exposing the virtual program’s con- pages 150–159, October 1996. trol flow to the hardware’s control flow predictors. Fur- [19] M. Rossi and K. Sivalingam. A survey of instruction dis- thermore, the context threading technique is simple, so the patch techniques for byte-code interpreters. Technical Re- code remains flexible and supports other optimizations such port TKO-C79, Helsinki University Faculty of Information Technology, May 1996. as inlining and code specialization. We thus conclude that [20] SPEC jvm98 benchmarks, 1998. context threading is a better technique for baseline interpre- [21] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, tation and may be an attractive environment for dynamic M. Kawahito, K. Ishizaki, H. Komatsu, and T. Nakatani. optimization. Overview of the IBM Java just-in-time compiler. IBM Sys- tems Journals, Java Performance Issue, 39(1), February 2000. References [22] G. T. Sullivan, D. L. Bruening, I. Baron, T. Garnett, and S. Amarasinghe. Dynamic native optimization of inter- [1] The Java hotspot virtual machine, v1.4.1, technical white pa- preters. In Proc. of the Workshop on Interpreters, Virtual per. 2002. Machines and Emulators, 2003. [2] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a trans- [23] R. Vall´ee-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and parent dynamic optimization system. In SIGPLAN Confer- V. Sundaresan. Soot - a java bytecode optimization frame- ence on Programming Language Design and Implementa- work. In Proceedings of the 1999 conference of the Centre tion, pages 1–12, 2000. for Advanced Studies on Collaborative research, page 13. [3] M. Corporation. MPC7410 RISC Microprocessor Program- IBM Press, 1999. mer’s Manual. 2002. [24] B. Vitale. Catenation and operand specialization for Tcl VM [4] M. Corporation. MPC7410/MPC7400 RISC Microprocessor performance. In Proc. 2nd IVME, pages 42–50, 2004. User’s Manual, Rev. 1. 2002.