Marcelo S. Cintra Phd Thesis

UNIVERSITY OF CALIFORNIA, IRVINE Just-in-Time Compilation Techniques for Hardware/Software Co-Designed Processors DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Information and Computer Science by Marcelo Cintra Dissertation Committee: Professor Dr.Sc. Michael Franz, Chair Professor PhD. Harry Xu Professor PhD. Ian Harris 2015 © 2015 Marcelo Cintra DEDICATION To my beloved wife, Monique, and my beloved children, Sofia, Lucas Anthony and Benjamin. “Love, just like knowledge, is a treasure that the more we spread, the more we have.” Joanna de Ângelis ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES viii LIST OF LISTINGS ix ACKNOWLEDGMENTSx CURRICULUM VITAE xii ABSTRACT OF THE DISSERTATION xiv 1 Introduction1 1.1 Managed Programming Languages.......................3 1.2 Hardware/Software Co-Designed CPUs....................5 1.3 Problem Statement................................6 1.4 Thesis statement.................................7 1.5 Our Approach...................................8 1.6 Contributions...................................9 1.7 Thesis Organization............................... 11 2 Related Work 13 2.1 Virtual Machines................................. 14 2.2 Virtual Machine Services............................. 15 2.2.1 Mobile Code Security........................... 15 2.2.2 Garbage Collection............................ 16 2.2.3 Bytecode Emulation........................... 16 2.3 HW/SW Codesign Methodology......................... 19 2.3.1 HW/SW Co-design Advantages..................... 20 2.3.2 ISA Emulation.............................. 23 2.3.3 Dynamic Binary Translation....................... 24 2.4 HW/SW Co-Designed Processors........................ 25 2.4.1 Hardware Support in Co-Designed Processors............. 27 iii 3 Bytecode Acceleration Techniques 29 3.1 Bytecode Acceleration with AccelDroid..................... 29 3.2 AccelDroid Infrastructure............................ 32 3.2.1 OS Acceleration Driver.......................... 32 3.2.2 Modifications to the Android DVM................... 35 3.2.3 Bytecode Acceleration in the DBT.................... 37 3.3 Bytecode Optimizations............................. 39 3.3.1 Bytecode Interpretation with LINKPIPE................. 39 3.3.2 Bytecode Check with SPECULATION.................. 40 3.3.3 Bytecode Call and Return with FLOOK................. 42 3.3.4 Bytecode Register Promotion...................... 46 3.3.5 Optimizing 64-Bit Loads and Stores................... 48 3.4 Experiments.................................... 48 3.4.1 Programmability............................. 49 3.4.2 Porting the Dalvik Runtime into the Efficeon DBT........... 50 3.4.3 ISA Compatibility and Transparency.................. 50 3.4.4 CaffeineMark 3.0 Benchmark...................... 51 3.4.5 Overall Performance Benefits...................... 52 3.4.6 Bytecode-Specific Optimization Benefits................ 55 3.4.7 Direct Interpretation and LINKPIPE Benefits.............. 56 3.4.8 Energy Benefits.............................. 58 3.5 Conclusions.................................... 60 4 Evaluating Bytecode Performance and Energy Efficiency on Processor Simulators 62 4.1 Difficulties for Bytecode Energy/Performance Evaluation on Processor Sim- ulators....................................... 63 4.2 Motivation for a Virtualized Interface for Simulation Control......... 67 4.2.1 Cross-platform Benchmarking...................... 67 4.2.2 Controlling the Simulation on a Co-designed Processor........ 68 4.3 Benchmarking Using Special Markers..................... 69 4.3.1 API for Bytecode Instrumentation.................... 70 4.3.2 OS Simulation Control Driver...................... 70 4.3.3 Virtual Machine Modification...................... 72 4.3.4 Modifications to the Architecture Simulator.............. 75 4.3.5 Application Instrumentation....................... 76 4.4 Evaluation..................................... 78 4.4.1 Experiment Setup............................ 78 4.4.2 Workloads................................. 80 4.5 Related Work................................... 82 4.6 Conclusion and Future Work........................... 83 iv 5 Bytecode Acceleration in a Out-of-Order Co-Designed Processor 85 5.1 Motivation..................................... 86 5.2 Acceleration Interface Design.......................... 88 5.2.1 ISA-Compatibility............................. 88 5.2.2 Virtual Machine Simplicity and Flexibility............... 89 5.2.3 OS Transparency............................. 90 5.3 Programming for Acceleration.......................... 91 5.3.1 Virtual Machine.............................. 91 5.4 Internal Implementation for Acceleration.................... 93 5.4.1 Hardware Support for Acceleration Interface.............. 93 5.5 Uop Runtime................................... 94 5.5.1 Handle Multiple Virtual Machine Instances............... 96 5.5.2 Handle HW Interrupt and Exception.................. 96 5.5.3 Handle Non-Supported Instruction................... 97 5.5.4 Handle Floating Point Operations.................... 97 5.6 Experiment Setup................................. 98 5.7 Evaluation..................................... 103 5.7.1 Energy Consumption........................... 107 5.7.2 Startup Overhead Reduction....................... 107 5.8 Conclusion.................................... 108 6 Future Work 111 6.1 Promotion of Stack Variables to Registers.................... 111 6.2 Universal Acceleration Bytecode......................... 115 6.3 Fill the Dark Silicon with Specialized cores for Bytecode Acceleration.... 116 6.4 HW/SW Codesigned Security.......................... 116 7 Conclusion 118 7.1 Discussion..................................... 118 7.2 Limitations.................................... 121 7.3 A Final Word................................... 122 Bibliography 124 Appendices134 A Dalvik Virtual Machine.............................. 134 A.1 Dalvik Bytecodes............................. 134 A.2 Dalvik Interpreter............................. 138 A.3 Trace Based Just-in-Time compiler................... 139 B Tramseta Efficeon Processor........................... 140 B.1 Hardware Support for Speculative Execution............. 142 B.2 Bytecode Execution Time Breakdown.................. 143 B.3 Transmeta Code Morphing Software.................. 144 C Translation Example............................... 149 v LIST OF FIGURES Page 1.1 Modern CPU design...............................6 1.2 Bytecode Code Accelerator Design.......................8 2.1 Emulation Process................................ 24 3.1 The Benefit of Using AccelDroid for Bytecode Execution........... 30 3.2 Android Software Stack............................. 33 3.3 Flow Chart for the Acceleration Driver..................... 34 3.4 Bytecode Execution through a regular VM and through AccelDroid..... 38 3.5 Translation of Call/Return Instructions with FLOOK.............. 43 3.6 Speedup on CaffeineMark............................ 53 3.7 Acceldroid speedup on some popular algorithms................ 54 3.8 SPEC CPU 2000 Performance Comparison................... 55 3.9 CaffeineMark Performance Comparison..................... 55 3.10 Optimization Benefits.............................. 56 3.11 Interpretation Performance........................... 57 3.12 LINKPIPE Benefit................................. 58 3.13 Energy Savings without warmup overhead................... 59 3.14 Energy Savings including JIT warmup..................... 60 3.15 Performance Counters for the Energy Analysis of Figure 3.14 on page 60.. 61 4.1 Application source being compiled into different binaries for execution... 65 4.2 Simulation Control Interface on A HS/SW co-designed System........ 68 4.3 Flow Chart for the Simulation Driver...................... 71 4.4 Simulation Control Interface........................... 73 4.5 Application source compilation into binary and bytecode format....... 76 4.6 Application Instrumentation and Simulation Control............. 79 4.7 Execution overhead for CaffeineMark with bytecode marker......... 81 4.8 Bytecode vs. Native Execution for CaffeineMark................ 82 5.1 Accelerated Bytecode Execution......................... 89 5.2 Flow Chart for Trap Handling in Uop Runtime................. 95 5.3 CaffeineMark Performance............................ 104 5.4 Instruction Reduction............................... 105 5.5 Allocation Stalls.................................. 106 vi 5.6 AndEBench and 0xBench Performance..................... 107 5.7 CaffeineMark Energy Consumption....................... 108 5.8 AndEBench and 0xBench Energy Consumption................ 109 5.9 Startup Overhead Reduction........................... 110 1 Stack layout of a method with 2 parameters and 3 local variables....... 136 2 Stack layout of a method nonInlineableSeries.................. 137 3 Efficeon Processor Package and Die Area.................... 142 4 Interpretation and profiling of x86 Instructions................ 145 5 Translation of x86 Instructions......................... 146 6 Chaining X86 Translations............................ 146 7 Speculative Support for Agressive Optimizations................ 147 vii LIST OF TABLES Page 3.1 Processor Specifications............................. 54 4.1 Percentage of Dynamic Code Execution on Android Benchmarks....... 81 5.1 Important Microarchitecture Parameters.................... 99 5.2 Hardware Acceleration Support......................... 99 1 Execution Time Breakdown for DaCapo and Spec 2000...........

Marcelo S. Cintra Phd Thesis

SIMD Extensions

Analysis and Comparison of Different Microprocessors Used in Computer

Evolution Des X86befehlssatzes Und Seiner Erweiterungen

O Společnosti Transmeta

Transmeta Efficeon

Processor Microarchitecture an Implementation Perspective Ii

Mobile Processors

Transmeta Efficeon TM8820 Processor Interconnect, Andanagp 4Xgraphics Interface

Transmeta™ Efficeon™ TM8300 Processor

Introducing the Transmeta Efficeon TM8000 Microprocessor Family

Transmeta Efficeon Processor

Transmeta™ Efficeon™ TM8300/TM8600 Processor