YETI: Gradually Extensible Trace Interpreter Mathew Zaleski, University of Toronto

[email protected]

(thesis proposal) Overview

‣Introduction • Background • Efficient Interpretation • Our Approach to Mixed-Mode Execution • Results and Discussion

Thesis Proposal Jan 2006 2 Why so few JIT compilers?

• Complex JIT infrastructure built in “big bang”, before any generated code can run. • Rather than incrementally extend the interpreter, typical JITs is built alongside. • The code generator of current JIT compilers makes little provision to reuse the interpreter. • The method-orientation of most JITs means that cold code is compiled with hot. ‣ Interpreters should be more gradually extensible to become dynamic compilers.

Thesis Proposal Jan 2006 3 Problems with current practice

• Packaging of virtual instruction bodies is: • Inefficient: Interpreters slowed by branch misprediction • Non-reusable: JIT compilers must implement all virtual instructions from scratch • Method orientation of a JIT compiler forces it to compile cold code along with hot. • Code compiled cold requires complex runtime to perform late binding if it runs. • Recompiling cold code that becomes hot requires complex recompilation infrastructure.

Thesis Proposal Jan 2006 4 Our Approach • Branch prediction problems of interpretation can be addressed by calling the virtual bodies. • Can speed up interpretation significantly. • Enables generated code to call the bodies. • JIT need not support all virtual instructions. • Complexity of compiling cold code can be side stepped by compiling dynamically selected regions that contain only hot code. • We describe how compiling traces allows us to compile only hot code and link on newly hot regions as they emerge. ‣ Enables gradual enhancement of interpreter

Thesis Proposal Jan 2006 5 Overview of Contribution

method based interpretation JIT • Callable bodies make for efficient interpretation. efficiency big bang • Reuse of callable bodies challenges development from generated code smooths “big bang”. • A trace oriented JIT reuse callable compiler is a simple and callable bodies bodies/traces promising architecture. YETI

Thesis Proposal Jan 2006 6 Gradual extension of VM

xTraces - compile all virtual instructions

✔ Traces- just integer x instructions compiled irtual instructions

V SUB Basic Traces - sub dispatch interpreter Blocks x x x Size and Complexity of Compiled Code Regions Larger regions. More instructions compiled.

Thesis Proposal Jan 2006 7 Result preview - Efficient Interpretation

• Branch misprediction dealt with by calling Subroutine Threading the bodies from region of generated code. 1.0 Threading 0.8 0.90 • Relative to Direct 0.81 0.82 Threaded VM 0.6 0.63 0.4 • Geo mean 0.2 Java SpecJVM98 0 • Relative to Direct benchmarks va/P4 ja va/PPC Ocaml benchmarks ja ocaml/P4 • ocaml/PPC

Thesis Proposal Jan 2006 8 Result preview - Trace based JIT

• Geom mean SPECJVM98 Java/PPC970 relative to Sun Hotspot JIT 6 5.9 SABVM 5.7 • 5 • Selective inlining 4 4.3 Modified JamVM. 3 • 2 TR-LINK = traces,no JIT • 1

• JIT = trace, JIT Relative to Sun Hotspot 0 • Only 50 integer JIT SABVM bytecodes TR-LINK • Promising start

Thesis Proposal Jan 2006 9 Background & Related Work

Ertl & Gregg Branch misprediction Piumarta & Riccardi Selective inlining Parrot (perl6) Callable bodies Vitale, Abdelrahman Catenation, Tcl Bala, Duesterwald, Banerjia Dynamo Bruening, Garnett ,Amarasinghe Dynamo Rio Whaley Partial methods Gal, Probst, Franz Hotpath, Trace-based JIT Suganuma,Yasue,Nkatani Region based compilation Hozle, Chambers, Ungar Self Many Java, JVM and JIT authors Java

Thesis Proposal Jan 2006 10 Overview

•Introduction ‣ Background: ‣ Dynamo & Traces ‣ Interpretation • Our Approach to Mixed-Mode Execution • Results and Discussion

Thesis Proposal Jan 2006 11 HP Dynamo

• Trace-oriented dynamic optimization system. • HP PA-8000 computers. • Counter-Intuitive approach: • Don’t execute optimized binary interpret it. • Count transits of reverse branches. • Trace-generate (next slide). • Dispatch traces when encountered. • Soon, most execution from trace cache. • faster than binary on hardware of the day!

Thesis Proposal Jan 2006 12 Trace with if-then-else

// => b2 if (c) c b1; • Trace is path else b1 b2 followed by program b2; b3; b3 • Conditional branches become trace exits. c texit b1 • Do not expect trace b2 exits to be taken. b3

Thesis Proposal Jan 2006 13 Overview

•Introduction ‣ Background: • Dynamo & Traces ‣ Interpretation • Efficient Interpretation • Our Approach • Selecting Regions • Results and Discussion

Thesis Proposal Jan 2006 14 Virtual Program

Java Java Source Bytecode

int f(boolean); int f(boolean parm){ Code: if (parm){ 0: iload_1 return 42; Javac 1: ifeq 7 }else{ compiler 4: bipush 42 return 0; 6: ireturn 7: iconst_0 } 8: ireturn }

Thesis Proposal Jan 2006 15 Interpreter

Loaded fetch execute Program Load dispatch Parms

Internal Bytecode Representation bodies Execution Cycle

Thesis Proposal Jan 2006 16 Switched Interpreter

vPC = internalRep;

while(1){ switch(*vPC++){ case iload_1: .. break;

case ifeq: .. break; //and many more..

} }; slow. Burdened by switch and loop overhead.

Thesis Proposal Jan 2006 17 Call Threaded Interpreter

void iload_1(){ //push load 1 vPC++; }

void ifeq(){ //change vPC vPC++; } static int *vPC = internalRep;

interp(){ while(1){ (*vPC)(); } }; slow. burdened by function pointer call

Thesis Proposal Jan 2006 18 Direct Threaded Interpreter

iload_1: .. goto *vPC++;

ifeq: -Execution of int f(boolean); if () vPC= Code: goto *vPC++; virtual program 0: iload_1 bipush: .. “threads” 1: ifeq 7 goto *vPC++; 4: bipush 42 ireturn: through bodies 6: ireturn .. 7: iconst_0 goto *vPC++; 8: ireturn iconst_0: .. goto *vPC++;

ireturn: .. goto *vPC++;

‣ Good: one dispatch branch taken per body

Thesis Proposal Jan 2006 19 Context Problem

iload_1: .. goto *vPC++;

ifeq: int f(boolean); if () vPC= Code: goto *vPC++; Virtual PC predicts 0: iload_1 bipush: destination. .. 1: ifeq 7 goto *vPC++; 4: bipush 42 ireturn: 6: ireturn .. Hardware PC 7: iload_1 goto *vPC++; 8: ireturn iconst_0: insufficient context .. goto *vPC++;

ireturn: .. goto *vPC++;

‣ Bad: hardware has no context to predict dispatch

Thesis Proposal Jan 2006 20 Overview

✓ Introduction ✓ Background ‣ Efficient Interpretation • Our Approach to Mixed-Mode Execution • Results and Discussion

Thesis Proposal Jan 2006 21 Virtual Program

Java Java Source Bytecode

int f(boolean); int f(boolean parm){ Code: if (parm){ 0: iload_1 return 42; Javac 1: ifeq 7 }else{ compiler 4: bipush 42 return 0; 6: ireturn 7: iconst_0 } 8: ireturn }

Thesis Proposal Jan 2006 22 Direct Threaded Interpreter

… vPC DTT iload_1: iload_1 &&iload_1 .. ifeq 7 &&ifeq goto *vPC++; bipush 42 +4 ifeq: ireturn &&bipush if () vPC= iconst_0 42 goto *vPC++; ireturn &&ireturn … &&iconst_0 bipush: &&ireturn .. goto *vPC++; Virtual DTT - Direct C implementation Program Threading Table of each body DTT maps vPC to implementation

Thesis Proposal Jan 2006 23 Context Problem

iload_1: .. goto *vPC++;

ifeq: int f(boolean); if () vPC= Code: goto *vPC++; Virtual PC predicts 0: iload_1 bipush: destination. .. 1: ifeq 7 goto *vPC++; 4: bipush 42 ireturn: 6: ireturn .. Hardware PC 7: iload_1 goto *vPC++; 8: ireturn iconst_0: insufficient context .. goto *vPC++;

ireturn: .. goto *vPC++;

‣ Bad: hardware has no context to predict dispatch

Thesis Proposal Jan 2006 24 Essence of Subroutine Threading

DTT Context Threading Table ret terminated CTT call iload_1 iload_1: 4 .. call ifeq asm(ret); call bipush 42 call ireturn ifeq: vPC=.. call iconst_0 goto *vPC; call ireturn virtual branches as points to generated direct code Package bodies as subroutines and call them

Thesis Proposal Jan 2006 25 Context Threading (CT) -- Generating specialized code in CTT

call iload_1 vPC subl inlined code movl ;pop stack for if_eq 4 cmpl ;compare 0 jne movl ; vPC = jmp ; else context for addl ; vPC += conditional call bipush branches call ireturn DTT call iconst_0 call ireturn Specialized bodies can also be generated in CTT!

Thesis Proposal Jan 2006 26 CT Performance !"#$%&'()*+',#*-.#(,#/)0

Subroutine Branch Inlining Tiny Inlining 1.00

,"+ 0.75

/0#)%*'12 0.50

0.25 !"#$%&'()*+ -'#).,+

0.00

4 c 4 c /p p /p p a /p l /p v a m l a v a m j a c a j o c o ‣ CT! 1%2*is an&#$3)'4%#* efficient interpretation'5*#66#$&'7#* technique/)8*.#)#2/9*

CThesisontext Th rProposaleading Jan 2006 24 27 Overview

✓ Introduction ✓ Background: Interpretation & traces ✓ Efficient Interpretation ‣ Our Approach to Mixed-Mode Execution • Selecting Regions • Results and Discussion

Thesis Proposal Jan 2006 28 Gradually Extensible Trace Interpreter

Three main enablers: 1. Bodies organized as callable routines so executable regions can efficiently mix compiled code and dispatched bodies. 2. The DTT can point to variously shaped execution units. 3. Efficient, flexible instrumentation.

Thesis Proposal Jan 2006 29 1. Bodies are callable

Packaging bytecode bodies as lightweight subroutines means that they can be efficiently called from generated code

iload_1: call iload_1 .. call iload_1 ret; specialized code for iadd istore: call istore_1 .. ret; ‣ Needn’t build all virtual instructions all in one shot.

Thesis Proposal Jan 2006 30 2. DTT always points to implementation

..of corresponding execution unit

vPC iload_1: .. ret; DTT call iload_1 call iload_1 JIT compiled code istore: for iadd .. call istore_1 ret; ‣ DTT indirection enables any shape of execution unit to be dispatched.

Thesis Proposal Jan 2006 31 3. Flexible, Efficient Instrumentation

A dispatcher describes an execution unit

while(1){ //dispatch loop d = vPC->dipatcher; pay = d->payload; body or generated (*d->pre)(vPC,pay,&tcs); code for region (*d->body)(); (*d->post)(vPC,pay,&tcs); payload } data specific to execution unit DTT dispatcher body preworker(){ payload } pre postworker(){ post } ‣ Our runtime active before and after each dispatch

Thesis Proposal Jan 2006 32 Overview

✓ Introduction ✓ Background: Interpretation & traces • Our Approach to Mixed-Mode Execution • Selecting Regions ‣ Basic Blocks • Traces • Results and Discussion

Thesis Proposal Jan 2006 33 Basic Block Detection

Java Java Source Bytecode

int f(boolean); int f(boolean parm){ Code: if (parm){ 0: iload_1 return 42; 1: ifeq 7 }else{ 4: bipush 42 return 0; 6: ireturn 7: iconst_0 } 8: ireturn }

Thesis Proposal Jan 2006 34 Basic Block Detection

DTT t_dispatcher while(1){ //dispatch loop body=BB0body=&&iload_1 d = vPC->dipatcher; payload=bb0payload pay = d->payload; pre (*d->pre)(vPC,pay,&tcs); post (*d->body)(); (*d->post)(vPC,pay,&tcs); t_dispatcher } body=&&ifeq payload pre 0: iload_1 post 1: ifeq 7 4: bipush 42 t_thread_context 6: ireturn historyList iload_1 ifeq 7: iconst_0 recordMode=01 8: ireturn

Thesis Proposal Jan 2006 35 Generated code for a basic block • Could JIT the bb, currently we generate “subroutine threading style” code for it.

DTT t_dispatcher body bb0 iload_1 payload pre call iload_1 bb0 ifeq post call if_eq 4 ret t_dispatcher body iconst_0 payload bb1 pre call iconst_0 bb1 ireturn post call ireturn ret Basic block is a run-time superinstruction

Thesis Proposal Jan 2006 36 C Code for interp function

static t_dispatcher init[256]={}; interp(t_dispatcher *dtt){ t_dispatcher *vPC = dtt; all in one C function • t_thread_context tcs; thread private are C iload: • //real work of iload here.. local variables vPC++; //to next instruction asm volatile(“ret”); • loader initializes iadd: //and many more bodies DTT to static dispatchers //dispatch loop while(1){ • dispatch loop calls d = vPC->dispatcher; instrumentation pay = d->payload; (*d->pre)(vPC,pay,&tcs); specific to dispatcher (*d->body)(); (*d->post)(vPC,pay,&tcs); }

Thesis Proposal Jan 2006 37 Overview

✓ Introduction ✓ Background: Interpretation & traces ✓ Our Approach to Mixed-Mode Execution • Selecting Regions ✓ Basic Blocks ‣ Traces • Results and Discussion

Thesis Proposal Jan 2006 38 Detecting Traces

• Use Dynamo’s trace detection heuristic. • Instrument reverse branches until they are hot. • Postworker of basic block dispatcher • Trace generate starting from hot reverse branch: • Much like bb’s were recorded • Postworker of each basic block region adds each bb to thread private history list. • Eventually creates new trace dispatcher • Hold off generating code until after trace has “trained” a few times..

Thesis Proposal Jan 2006 39 Trace with if-then-else

//c => b2 if (c) c b1; • Trace is path else b1 b2 followed by program b2; b3; b3 • Conditional branches become trace exits. c texit b1 • Do not expect trace b2 exits to be taken. b3

Thesis Proposal Jan 2006 40 Subroutine threading style code for a Trace

• Dispatch virtual instructions for trace

DTT iload_1 trace-b0-b1 trace exit handlers bb0 ifeq call iload_1 TEH 4 trace_exit_eq .. call iconst_0 ret iconst_0 trace_exit_iret TEH ireturn bb1 ..caller code...... ret Trace is super-super instruction

Thesis Proposal Jan 2006 41 Trace Exits

trace-b0-b1 (a) Two-way: from conditional call iload_1 branches (a) trace_exit_eq • one leg on trace call iconst_0 • other leg off trace (b) trace_exit_iret (b) Multiway: from invokes and returns ... • one leg on trace • potentially many legs off trace

Thesis Proposal Jan 2006 42 Trace Exit Handlers

• Code runs when trace exit trace-b0-b1 is taken before return to TEH call iload_1 interpreter tcs=.. trace_exit_eq ret Record which trace exit has call iconst_0 • occurred in thread context trace_exit_iret ret TEH • Return to dispatch loop tcs=.. Housekeeping roles: ret • • flush state of JIT code • Trace linking

Thesis Proposal Jan 2006 43 Trace linking

trace-1 • When trace exit is hot Hot trace exit and destination is a trace • Rewrite ret at end of TEH trace exit handler as jmp retjmp to destination trace • Only use of code Destination Trace trace-2 rewriting in system

Thesis Proposal Jan 2006 44 Trace JIT

• Generate native code for trace exits • A lot like branch inlining from CT system. • Optimize invokevirtual when call and return occur in same trace. • Naive register allocation scheme • Only handle 50 integer/object virtual instructions • Do virtual instructions one-by-one • Relatively easy debugging • Floating point should be easy.

Thesis Proposal Jan 2006 45 Overview

✓ Introduction ✓ Background: Interpretation & traces ✓ Our Approach to Mixed-Mode Execution ‣ Results and Discussion • Data • Discussion • Remaining Work in this dissertation

Thesis Proposal Jan 2006 46 Implementation

• Modify JamVM 1.1.3 to be SUB threaded • Gradually extend it to: • Detect, execute subroutine style basic blocks • Detect, execute subroutine style traces • Link traces • Compile traces.

Thesis Proposal Jan 2006 47 Region Shape • As execution units become larger • Trips around dispatch loop become less frequent • Next show data to justify “step back” approach. • Very simple experiment: • Modify dispatch loops to count iterations. Condition Description DCT Direct Call Threading BB CT-style Basic Blocks TR Traces (no link, no JIT) TR-LINK Linked Traces (no JIT)

Thesis Proposal Jan 2006 48 Region Shape Effect on Dispatch Count

Legend 1e10 DCT BB TR 1e8 TR-LINK

1e6

1e4 log dispatch count 1e2

1e0 db ray mtrt jess jack javac mpeg scitest geomean compress Spec JVM98 Benchmarks

Thesis Proposal Jan 2006 49 How efficient is profiling system?

• Run instrumentation without the JIT. • Are the intermediate versions of Java viable? • Include SUB threading in comparison: • Since it is an efficient dispatch technique. • Report elapsed time relative to distro of JamVM

Thesis Proposal Jan 2006 50 Profiling/Instrumentation Overhead

Legend 2.00

2.0 1.87 SUB DCT 1.69 BB 1.51 1.49 1.45 1.42 1.41 1.39 1.5 1.38 TR 1.33

1.22 TR-LINK 1.18 1.18 1.15 1.15 1.11 1.05 0.99 0.96 0.95 0.90 0.88 0.88 0.88 0.88 0.87 0.87 0.87

1.0 0.85 0.84 0.84 0.83 0.82 0.81 0.80 0.80 0.79 0.78 0.78 0.77 0.77 0.75 0.73 0.72 0.72 0.68 0.66 0.60 0.59 0.5

Elapsed time relative to jam-distro 0.0 db ray mtrt jess jack javac mpeg scitest geomean compress Spec JVM98 Benchmarks

Thesis Proposal Jan 2006 51 Performance of simple JIT

• Compare YETI performance WITHOUT JIT to selective inlining SableV • Compare YETI with preliminary trace based JIT to Sun’s Hotspot optimizing compiler • Not much basis for comparison to Hotpath Condition Description TR-LINK Linked Traces SABVM SableVM selective inlining JIT Traces (JIT and Link)

Thesis Proposal Jan 2006 52 JIT Performance relative to Sun Hotspot

Legend 13.73 14 SABVM 12.26 11.96 11.87 TR-LINK 11.65 11.46 12 JIT

10 9.71 8.78 8.14 7.95

8 7.20 5.94 5.73 5.69 5.49 5.21 6 5.17 4.31 4.25 4.07 3.92 3.45 3.23

4 3.06 2.86 2.59 2.43 2.31 1.93 2 1.50

Elapsed time relative to hotspot 0 db ray mtrt jess jack javac mpeg scitest geomean compress Spec JVM98 Benchmarks

Thesis Proposal Jan 2006 53 Overview

✓ Introduction ✓ Background: Interpretation & traces ✓ Our Approach ✓ Selecting Regions • Results and Discussion ✓ Data ‣ Discussion (and future work) • Remaining Work in this dissertation

Thesis Proposal Jan 2006 54 Gradual performance improvement

Geometric Mean SpecJVM98 Elapsed Time PPC970 • Performance improves as effort invested. 2 SUB very effective for Threading • 1.51 lightweight bodies 1.15 BB not viable by itself 1 • 0.78 0.80 0.83 0.75 • TR-LINK about same 0.57 as CT/branch inlining. 0 JIT preliminary. • BB TR JIT SUB DCT SABVM TR-LINK Relative to JamVM 1.3.3 Direct

Thesis Proposal Jan 2006 55 Discussion

• We have demonstrated how to build an interpreter that is simple and yet as efficient as SableVM and JamVM. • We have shown that our interpreter can be gradually extended to identify, link and compile traces. • We have shown how generated code can reuse callable virtual instruction bodies. • Our JIT, although it has no optimizer, only supports 50 Java virtual instructions, improves performance by 24%. • More instructions, better performance?

Thesis Proposal Jan 2006 56 2D vision of Incremental VM lifecycle..

xTraces - compile all virtual instructions Basic blocks - just integer virtual instructions Traces- just integer x ✔ instructions

SUB Basic interpreter Blocks Traces - sub dispatch Number of virtual instructions compiled ✔ ✔ ✔ Complexity of Compiled Code Regions Have explored this space

Thesis Proposal Jan 2006 57 Application

• If I had to build a new interpreter. • for “lightweight” bodies, so dispatch matters. • Start with bodies that can be conditionally compiled to be either direct threaded or callable. • Bring up the system using DCT because the dispatch loop makes it easier to debug. • e.g. Logging from dispatch loop is very helpful. • Primary platforms would run SUB and secondary platforms would run direct threading. • Gradually extend as described.

Thesis Proposal Jan 2006 58 Future Work

• Work could go in many different directions. • Apply to another language system • JavaScript? Python? Fortress? • Deal with polymorphic bytecodes • Extend JIT/Optimizer • Explore performance potential • Need a lot more infrastructure (e.g. IR) • Package infrastructure for others to apply.

Thesis Proposal Jan 2006 59 Overview

✓ Introduction ✓ Background: Interpretation & traces ✓ Our Approach ✓ Selecting Regions • Results and Discussion ✓ Data ✓ Discussion ‣ Remaining work in this dissertation

Thesis Proposal Jan 2006 60 Proposed Work for winter 2007.

• Infrastructure to measure: • Compile time; • Proportion of virtual instructions executed from compiled code. • Add float register class. • scimark, ray, Linpack would likely benefit. • Compile Basic Blocks • Long bb benchmarks will benefit. • Write, write, write.

Thesis Proposal Jan 2006 61 • Back 12 • Efficient 23 • Our 28 • BB 33 • TR 38 • RES 46 • CONC 54 Thesis Proposal Jan 2006 62