By Mathew Zaleski a Thesis Submitted in Conformity with the Requirements

YETI: A GRADUALLYEXTENSIBLE TRACE INTERPRETER by Mathew Zaleski A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright c 2008 by Mathew Zaleski Library and Archives Bibliothèque et Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l’édition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre référence ISBN: 978-0-494-57946-6 Our file Notre référence ISBN: 978-0-494-57946-6 NOTICE: AVIS: The author has granted a non- L’auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l’Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distribute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L’auteur conserve la propriété du droit d’auteur ownership and moral rights in this et des droits moraux qui protège cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author’s permission. In compliance with the Canadian Conformément à la loi canadienne sur la Privacy Act some supporting forms protection de la vie privée, quelques may have been removed from this formulaires secondaires ont été enlevés de thesis. cette thèse. While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n’y aura aucun contenu removal does not represent any loss manquant. of content from the thesis. Abstract YETI: a graduallY Extensible Trace Interpreter Mathew Zaleski Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2008 The implementation of new programming languages benefits from interpretation because it is simple, flexible and portable. The only downside is speed of execution, as there remains a large performance gap between even efficient interpreters and systems that include a just-in- time (JIT) compiler. Augmenting an interpreter with a JIT, however, is not a small task. Today, Java JITs are typically method-based. To compile whole methods, the JIT must re-implement much functionality already provided by the interpreter, leading to a “big bang” development effort before the JIT can be deployed. Adding a JIT to an interpreter would be easier if we could more gradually shift from dispatching virtual instructions bodies implemented for the interpreter to running instructions compiled into native code by the JIT. We show that virtual instructions implemented as lightweight callable routines can form the basis for a very efficient interpreter. Our new technique, interpreted traces, identifies hot paths, or traces, as a virtual program is interpreted. By exploiting the way traces predict branch desti- nations our technique markedly reduces branch mispredictions caused by dispatch. Interpreted traces are a high-performance technique, running about 25% faster than direct threading. We show that interpreted traces are a good starting point for a trace-based JIT. We extend our interpreter so traces may contain a mixture of compiled code for some virtual instructions and calls to virtual instruction bodies for others. By compiling about 50 integer and object virtual instructions to machine code we improve performance by about 30% over interpreted traces, running about twice as fast as the direct threaded system with which we started. ii Acknowledgements My supervisor, Angela Demke Brown uncomplainingly drew the short straw when she was handed me, a middle-aged electrical engineer turned businessman, as her first doctoral student. Overcoming many obstacles she has taught me the subtle craft of systems research. I thank my advisory committee of Michael Stumm, David Wortman and Tarek Abdelrah- man for their input and guidance on our research. Special thanks are also due to Kevin Stoodley of IBM, without whose insight this research would not have been possible. My family must at some level be to blame for my decision to interrupt (even jeopardize?) a reasonably successful and entirely enjoyable career to spend the last several years retraining as a researcher. Leonard, my father, would have been pleased had I written this dissertation twenty years ago. If I had, he would have been alive to read it. My mother, Irma, has written several books on religion while I completed my studies. Her love of knowledge, and the process of writing it down, has been impetus for my less ethereal work. My father-in-law, Harry Eastman, advised me throughout the process of returning to school. I doubt I would have had the nerve to carry it through without his support. I wish he were still with us to pull his academic hood out of mothballs and see me over the threshold one more time. My wife, Harriet Eastman, has been a paragon of support and patience while I enjoyed the thrills and chills of the PhD program. Without her love and encouragement I would have given up long ago. Our children, Sam and Jacob, while occasionally benefiting from my flexible hours, were more often called upon to be good sports when the flexibility simply meant I worked all the time. iii Contents 1 Introduction 1 1.1 Challenges of Method-based JIT Compilation . 3 1.2 Challenges of Efficient Interpretation . 4 1.3 What We Need . 4 1.4 Overview of Our Solution . 5 1.5 Thesis Statement . 8 1.6 Contributions . 8 1.7 Outline of Thesis . 10 2 Background 12 2.1 High Level Language Virtual Machine . 12 2.1.1 Overview of a Virtual Program . 14 2.1.2 Interpretation . 15 2.1.3 Early Just in Time Compilers . 16 2.2 Challenges to HLL VM Performance . 17 2.2.1 Polymorphism and the Implications of Object-oriented Programming . 18 2.2.2 Late binding . 21 2.3 Early Dynamic Optimization . 22 2.3.1 Manual Dynamic Optimization . 22 2.3.2 Application specific dynamic compilation . 22 iv 2.3.3 Dynamic Compilation of Manually Identified Static Regions . 23 2.4 Dynamic Object-oriented optimization . 24 2.4.1 Finding the destination of a polymorphic callsite . 25 2.4.2 Smalltalk and Self . 26 2.4.3 Java JIT as Dynamic Optimizer . 28 2.4.4 JIT Compiling Partial Methods . 30 2.5 Traces . 31 2.6 Hotpath . 33 2.7 Branch Prediction and General Purpose Computers . 33 2.7.1 Dynamic Hardware Branch Prediction . 35 2.8 Chapter Summary . 36 3 Dispatch Techniques 38 3.1 Switch Dispatch . 39 3.2 Direct Call Threading . 41 3.3 Direct Threading . 41 3.4 The Context Problem . 43 3.5 Subroutine Threading . 45 3.6 Optimizing Dispatch . 46 3.6.1 Superinstructions . 46 3.6.2 Selective Inlining . 47 3.6.3 Replication . 48 3.7 Chapter Summary . 48 4 Design and Implementation of Efficient Interpretation 49 4.1 Understanding Branches . 51 4.2 Handling Linear Dispatch . 53 4.3 Handling Virtual Branches . 54 v 4.4 Handling Virtual Call and Return . 58 4.5 Chapter Summary . 60 5 Evaluation of Context Threading 62 5.1 Experimental Set-up . 63 5.1.1 Virtual Machines and Benchmarks . 63 5.1.2 Performance and Pipeline Hazard Measurements . 65 5.2 Interpreting the data . 66 5.2.1 Effect on Pipeline Branch Hazards . 70 5.2.2 Performance . 71 5.3 Inlining . 76 5.4 Limitations of Context Threading . 77 5.4.1 Heavyweight Virtual Instruction Bodies . 77 5.4.2 Context Threading and Profiling . 79 5.4.3 Development using SableVM . 79 5.5 Chapter Summary . 80 6 Design and Implementation of YETI 82 6.1 Structure and Overview of Yeti . 83 6.2 Region Selection . 87 6.2.1 Initiating Region Discovery . 87 6.2.2 Linear Block Detection . 88 6.2.3 Trace Selection . 90 6.3 Trace Exit Runtime . 92 6.3.1 Trace Linking . 93 6.4 Generating code for traces . 95 6.4.1 Interpreted Traces . 95 6.4.2 JIT Compiled Traces . 96 vi 6.4.3 Trace Optimization . 100 6.5 Other implementation details . 102 6.6 Chapter Summary . 103 7 Evaluation of Yeti 106 7.1 Experimental Set-up . 107 7.2 Effect of region shape on dispatch . 110 7.3 Effect of region shape on performance . 114 7.4 Early Pentium Results . 121 7.5 Identification of Stall Cycles . 121 7.5.1 Identifying Causes of Stall Cycles . 123 7.5.2 Stall Cycle results . 124 7.5.3 Trends . 127 7.6 Chapter Summary . 130 8 Conclusions and Future Work 133 8.1 Conclusions and Lessons Learned . 133 8.2 Future work . 136 8.2.1 Virtual instruction bodies as nested functions . 137 8.2.2 Extension to Runtime Typed Languages . 138 8.2.3 New shapes of region body . 139 8.2.4 Vision for new language implementation . 139 8.3 Summary . ..

By Mathew Zaleski a Thesis Submitted in Conformity with the Requirements

Java in Embedded Linux Systems

Effective Inline-Threaded Interpretation of Java Bytecode

Android Cours 1 : Introduction `Aandroid / Android Studio

Compiler Analyses for Improved Return Value Prediction

GNU Classpath

Guideline Document and Peripheral Input/Output Libraries for Developments

Speculative Parallelization of Object-Oriented Applications

Return Value Prediction in a Java Virtual Machine

Port of the Java Virtual Machine Kaffe to DROPS by Using L4env

Relative Factors in Performance Analysis of Java Virtual Machines

Bootstrappable Builds

Resource Accounting and Reservation in Java Virtual Machine