Thesis Proposal) Overview
Total Page:16
File Type:pdf, Size:1020Kb
YETI: Gradually Extensible Trace Interpreter Mathew Zaleski, University of Toronto [email protected] (thesis proposal) Overview ‣Introduction • Background • Efficient Interpretation • Our Approach to Mixed-Mode Execution • Results and Discussion Thesis Proposal Jan 2006 2 Why so few JIT compilers? • Complex JIT infrastructure built in “big bang”, before any generated code can run. • Rather than incrementally extend the interpreter, typical JITs is built alongside. • The code generator of current JIT compilers makes little provision to reuse the interpreter. • The method-orientation of most JITs means that cold code is compiled with hot. ‣ Interpreters should be more gradually extensible to become dynamic compilers. Thesis Proposal Jan 2006 3 Problems with current practice • Packaging of virtual instruction bodies is: • Inefficient: Interpreters slowed by branch misprediction • Non-reusable: JIT compilers must implement all virtual instructions from scratch • Method orientation of a JIT compiler forces it to compile cold code along with hot. • Code compiled cold requires complex runtime to perform late binding if it runs. • Recompiling cold code that becomes hot requires complex recompilation infrastructure. Thesis Proposal Jan 2006 4 Our Approach • Branch prediction problems of interpretation can be addressed by calling the virtual bodies. • Can speed up interpretation significantly. • Enables generated code to call the bodies. • JIT need not support all virtual instructions. • Complexity of compiling cold code can be side stepped by compiling dynamically selected regions that contain only hot code. • We describe how compiling traces allows us to compile only hot code and link on newly hot regions as they emerge. ‣ Enables gradual enhancement of interpreter Thesis Proposal Jan 2006 5 Overview of Contribution method based interpretation JIT • Callable bodies make for efficient interpretation. efficiency big bang • Reuse of callable bodies challenges development from generated code smooths “big bang”. • A trace oriented JIT reuse callable compiler is a simple and callable bodies bodies/traces promising architecture. YETI Thesis Proposal Jan 2006 6 Gradual extension of VM xTraces - compile all virtual instructions ✔ Traces- just integer x instructions compiled irtual instructions V SUB Basic Traces - sub dispatch interpreter Blocks x x x Size and Complexity of Compiled Code Regions Larger regions. More instructions compiled. Thesis Proposal Jan 2006 7 Result preview - Efficient Interpretation • Branch misprediction dealt with by calling Subroutine Threading the bodies from region of generated code. 1.0 Threading 0.8 0.90 • Relative to Direct 0.81 0.82 Threaded VM 0.6 0.63 0.4 • Geo mean 0.2 Java SpecJVM98 0 • Relative to Direct benchmarks va/P4 ja va/PPC Ocaml benchmarks ja ocaml/P4 • ocaml/PPC Thesis Proposal Jan 2006 8 Result preview - Trace based JIT • Geom mean SPECJVM98 Java/PPC970 relative to Sun Hotspot JIT 6 5.9 SABVM 5.7 • 5 • Selective inlining 4 4.3 Modified JamVM. 3 • 2 TR-LINK = traces,no JIT • 1 • JIT = trace, JIT Relative to Sun Hotspot 0 • Only 50 integer JIT SABVM bytecodes TR-LINK • Promising start Thesis Proposal Jan 2006 9 Background & Related Work Ertl & Gregg Branch misprediction Piumarta & Riccardi Selective inlining Parrot (perl6) Callable bodies Vitale, Abdelrahman Catenation, Tcl Bala, Duesterwald, Banerjia Dynamo Bruening, Garnett ,Amarasinghe Dynamo Rio Whaley Partial methods Gal, Probst, Franz Hotpath, Trace-based JIT Suganuma,Yasue,Nkatani Region based compilation Hozle, Chambers, Ungar Self Many Java, JVM and JIT authors Java Thesis Proposal Jan 2006 10 Overview •Introduction ‣ Background: ‣ Dynamo & Traces ‣ Interpretation • Our Approach to Mixed-Mode Execution • Results and Discussion Thesis Proposal Jan 2006 11 HP Dynamo • Trace-oriented dynamic optimization system. • HP PA-8000 computers. • Counter-Intuitive approach: • Don’t execute optimized binary interpret it. • Count transits of reverse branches. • Trace-generate (next slide). • Dispatch traces when encountered. • Soon, most execution from trace cache. • faster than binary on hardware of the day! Thesis Proposal Jan 2006 12 Trace with if-then-else //c => b2 if (c) c b1; • Trace is path else b1 b2 followed by program b2; b3; b3 • Conditional branches become trace exits. c texit b1 • Do not expect trace b2 exits to be taken. b3 Thesis Proposal Jan 2006 13 Overview •Introduction ‣ Background: • Dynamo & Traces ‣ Interpretation • Efficient Interpretation • Our Approach • Selecting Regions • Results and Discussion Thesis Proposal Jan 2006 14 Virtual Program Java Java Source Bytecode int f(boolean); int f(boolean parm){ Code: if (parm){ 0: iload_1 return 42; Javac 1: ifeq 7 }else{ compiler 4: bipush 42 return 0; 6: ireturn 7: iconst_0 } 8: ireturn } Thesis Proposal Jan 2006 15 Interpreter Loaded fetch execute Program Load dispatch Parms Internal Bytecode Representation bodies Execution Cycle Thesis Proposal Jan 2006 16 Switched Interpreter vPC = internalRep; while(1){ switch(*vPC++){ case iload_1: .. break; case ifeq: .. break; //and many more.. } }; slow. Burdened by switch and loop overhead. Thesis Proposal Jan 2006 17 Call Threaded Interpreter void iload_1(){ //push load 1 vPC++; } void ifeq(){ //change vPC vPC++; } static int *vPC = internalRep; interp(){ while(1){ (*vPC)(); } }; slow. burdened by function pointer call Thesis Proposal Jan 2006 18 Direct Threaded Interpreter iload_1: .. goto *vPC++; ifeq: -Execution of int f(boolean); if () vPC= Code: goto *vPC++; virtual program 0: iload_1 bipush: .. “threads” 1: ifeq 7 goto *vPC++; 4: bipush 42 ireturn: through bodies 6: ireturn .. 7: iconst_0 goto *vPC++; 8: ireturn iconst_0: .. goto *vPC++; ireturn: .. goto *vPC++; ‣ Good: one dispatch branch taken per body Thesis Proposal Jan 2006 19 Context Problem iload_1: .. goto *vPC++; ifeq: int f(boolean); if () vPC= Code: goto *vPC++; Virtual PC predicts 0: iload_1 bipush: destination. .. 1: ifeq 7 goto *vPC++; 4: bipush 42 ireturn: 6: ireturn .. Hardware PC 7: iload_1 goto *vPC++; 8: ireturn iconst_0: insufficient context .. goto *vPC++; ireturn: .. goto *vPC++; ‣ Bad: hardware has no context to predict dispatch Thesis Proposal Jan 2006 20 Overview ✓ Introduction ✓ Background ‣ Efficient Interpretation • Our Approach to Mixed-Mode Execution • Results and Discussion Thesis Proposal Jan 2006 21 Virtual Program Java Java Source Bytecode int f(boolean); int f(boolean parm){ Code: if (parm){ 0: iload_1 return 42; Javac 1: ifeq 7 }else{ compiler 4: bipush 42 return 0; 6: ireturn 7: iconst_0 } 8: ireturn } Thesis Proposal Jan 2006 22 Direct Threaded Interpreter … vPC DTT iload_1: iload_1 &&iload_1 .. ifeq 7 &&ifeq goto *vPC++; bipush 42 +4 ifeq: ireturn &&bipush if () vPC= iconst_0 42 goto *vPC++; ireturn &&ireturn … &&iconst_0 bipush: &&ireturn .. goto *vPC++; Virtual DTT - Direct C implementation Program Threading Table of each body DTT maps vPC to implementation Thesis Proposal Jan 2006 23 Context Problem iload_1: .. goto *vPC++; ifeq: int f(boolean); if () vPC= Code: goto *vPC++; Virtual PC predicts 0: iload_1 bipush: destination. .. 1: ifeq 7 goto *vPC++; 4: bipush 42 ireturn: 6: ireturn .. Hardware PC 7: iload_1 goto *vPC++; 8: ireturn iconst_0: insufficient context .. goto *vPC++; ireturn: .. goto *vPC++; ‣ Bad: hardware has no context to predict dispatch Thesis Proposal Jan 2006 24 Essence of Subroutine Threading DTT Context Threading Table ret terminated CTT call iload_1 iload_1: 4 .. call ifeq asm(ret); call bipush 42 call ireturn ifeq: vPC=.. call iconst_0 goto *vPC; call ireturn virtual branches as points to generated direct code Package bodies as subroutines and call them Thesis Proposal Jan 2006 25 Context Threading (CT) -- Generating specialized code in CTT call iload_1 vPC subl inlined code movl ;pop stack for if_eq 4 cmpl ;compare 0 jne movl ; vPC = jmp ; else context for addl ; vPC += conditional call bipush branches call ireturn DTT call iconst_0 call ireturn Specialized bodies can also be generated in CTT! Thesis Proposal Jan 2006 26 CT Performance !"#$%&'()*+',#*-.#(,#/)0 Subroutine Branch Inlining Tiny Inlining 1.00 ,"+ 0.75 /0#)%*'12 0.50 !"#$%&'()*+ 0.25 -'#).,+ 0.00 4 c 4 c /p p /p p a /p l /p v a m l a v a m j a c a j o c o ‣ CT! 1%2*is an&#$3)'4%#* efficient interpretation'5*#66#$&'7#* technique/)8*.#)#2/9* CThesisontext Th rProposaleading Jan 2006 24 27 Overview ✓ Introduction ✓ Background: Interpretation & traces ✓ Efficient Interpretation ‣ Our Approach to Mixed-Mode Execution • Selecting Regions • Results and Discussion Thesis Proposal Jan 2006 28 Gradually Extensible Trace Interpreter Three main enablers: 1. Bodies organized as callable routines so executable regions can efficiently mix compiled code and dispatched bodies. 2. The DTT can point to variously shaped execution units. 3. Efficient, flexible instrumentation. Thesis Proposal Jan 2006 29 1. Bodies are callable Packaging bytecode bodies as lightweight subroutines means that they can be efficiently called from generated code iload_1: call iload_1 .. call iload_1 ret; specialized code for iadd istore: call istore_1 .. ret; ‣ Needn’t build all virtual instructions all in one shot. Thesis Proposal Jan 2006 30 2. DTT always points to implementation ..of corresponding execution unit vPC iload_1: .. ret; DTT call iload_1 call iload_1 JIT compiled code istore: for iadd .. call istore_1 ret; ‣ DTT indirection enables any shape of execution unit to be dispatched. Thesis Proposal Jan 2006 31 3. Flexible, Efficient Instrumentation A dispatcher describes an execution unit while(1){ //dispatch