Combining Speculative Optimizations with Flexible Scheduling of Side-effects

DISSERTATION submitted in partial fulfillment of the requirements for the academic degree Doktor der Technischen Wissenschaften in the Doctoral Program in Engineering Sciences

Submitted by: Gilles Marie Duboscq

At the: Institute for System Software

Accepted on the recommendation of: Univ.-Prof. Dipl.-Ing. Dr. Dr.h.. Hanspeter Mössenböck Dr. Laurence Tratt

Linz, April 2016 Oracle, , HotSpot, and all Java-based trademarks are trademarks or registered trademarks of Oracle in the United States and other countries. All other product names mentioned herein are trademarks or registered trademarks of their respective owners. Eidesstattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Dissertation selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw. die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe. Die vorliegende Dissertation ist mit dem elektronisch übermittelten Textdokument identisch.

Linz, April 2016

Gilles Marie Duboscq

Abstract

Speculative optimizations allow to optimize code based on assumptions that cannot be verified at compile time. Taking advantage of the specific run-time situation opensmore optimization possibilities. Speculative optimizations are key to the implementation of high- performance language runtimes. Using them requires cooperation between the just-in-time compilers and the runtime system and influences the design and the implementation of both. New speculative optimizations as well as their application in more dynamic languages are using these systems much more than current implementations were designed for. We first quantify the run time and memory footprint caused by their usage. We then propose a structure for compilers that separates the compilation process into two stages. It helps to deal with this issues without giving up on other traditional optimizations. In the first stage, floating guards can be inserted for speculative optimizations. Then the guards are fixedinthe control-flow at appropriate positions. In the second stage, side-effecting instructions can bemoved or reordered. Using this framework we present two optimizations that help reduce the run-time costs and the memory footprint. We study the effects of both stages as well as the effects of these two optimizations inthe Graal . We evaluate this on classical benchmarks targeting the JVM: SPECjvm2008, DaCapo and Scala-DaCapo. We also evaluate JavaScript benchmarks running on the Truffle platform that uses the Graal compiler. We find that combining both stages can bring upto 84 % improvement in performance (9 % on average) and our optimization of memory footprint can bring memory usage down by 27 % to 92 %(45 % on average).

Kurzfassung

Spekulative Optimierungen sind Optimierungen des Quellcodes, welche auf Annahmen basieren die nicht zur Compile-Zeit ausgewertet werden können. Sie können spezielle Laufzeit-Situationen optimieren, die im allgemeinen Fall nicht optimierbar sind. Für die Performanz einer Laufzeitum- gebung für höhere Programmiersprachen sind spekulative Optimierungen essenziell. Für diese Optimierungen müssen der Just-in-Time-Compiler und die Laufzeitumgebung eng zusammenar- beiten, was beim Design und bei der Umsetzung der beiden Komponenten berücksichtigt werden muss. Bei solchen neuen, spekulativen Optimierungen und deren Anwendung in hoch dynamischen Sprachen kann es zu einer hohen Belastung des Compilers und der Laufzeitumgebung kommen, da diese ursprünglich nicht für solche Optimierungen konzipiert wurden. Zuerst werden die Auswirkungen solcher spekulativen Optimierungen auf die Laufzeit und das Speicherverhalten quantifiziert. Anschließend wird ein neues Compiler-Design vorgestellt, in welchem die Übersetzung eines Programms in zwei Stufen stattfindet. Dadurch können die Probleme, die bei den neuen Optimierungen auftreten, umgangen werden, ohne bestehende Optimierungen negativ zu beinflussen. In der ersten Stufe können Laufzeit-Guards fürdie spekulativen Optimierungen eingeführt werden, welche ihre Position im Kontrollfluss noch ändern können. Am Ende der ersten Stufe werden diese Guards im Kontrollfluss fixiert. In der zweiten Stufe können Anweisungen mit Nebeneffekten verschoben oder umsortiert werden. Mittels dieses Frameworks werden zwei Optimierungen präsentiert, die zum einen die Laufzeit un zum anderen den Speicherverbrauch verringern. Diese Arbeit untersucht die Auswirkungen der oben beschriebenen Stufen und die Aus- wirkungen der beiden Optimierungen im Kontext des Graal-Compilers. Folgende Benchmarks wurden zur Auswertung verwendet: SPECjvm2008, DaCapo und Scala-DaCapo. Zudem werden Javascript-Benchmarks verwendet, die auf der Truffle-Plattform laufen, welche wiederum den Graal-Compiler verwendet. Wenn beide Stufen kombiniert werden, kann die Laufzeit zum bis 84 % verringert werden (im Durchschnitt 9 %). Die Speicheroptimierung kann den Speicherverbrauch von 27 % bis 92 % verringern (im Durchschnitt 45 %).

Contents

1 Introduction 1 1.1 Research Context ...... 1 1.2 Problem Statement ...... 1 1.2.1 Existing Solutions ...... 2 1.2.2 Current Problems ...... 3 1.2.3 Proposed Solution ...... 3 1.3 Scientific Contributions ...... 3 1.4 Project Context ...... 4 1.5 Structure of this Thesis ...... 5

2 Background 7 2.1 Java and the ...... 7 2.2 The HotSpot Java Virtual Machine ...... 8 2.2.1 Deoptimization ...... 9 2.3 The Graal Compiler ...... 10 2.3.1 Overall Design ...... 10 2.3.2 Intermediate Representation ...... 12 2.3.3 Snippets ...... 19 2.4 Truffle ...... 22

3 Opportunities for Speculative Optimizations 25 3.1 Traditional Usages of Speculation ...... 25 3.1.1 Exception Handling ...... 25 3.1.2 Unreached Code ...... 27 3.1.3 Type Assumptions ...... 28 3.1.4 Loop Safepoint Checks Elimination ...... 29 3.2 Advanced Usages of Speculation ...... 29 3.2.1 Speculative Alias Analysis ...... 30 3.2.2 Speculative Store Checks ...... 30 3.2.3 Speculative Guard Motion ...... 31 3.2.4 Truffle ...... 31

4 The Costs of Speculation 35

iii Contents

4.1 Runtime Costs ...... 35 4.1.1 Assumptions vs. Guards ...... 36 4.2 Memory Footprint ...... 38 4.2.1 Experimental Data ...... 38 4.3 Managing Deoptimization Targets ...... 43 4.3.1 Java Memory Model Constraints ...... 43 4.3.2 Deoptimizing to the Previous Side-effect ...... 45 4.3.3 Fixed Deoptimizations ...... 46 4.4 Conclusion ...... 47

5 Optimization Stages 49 5.1 First Stage: Optimizing Guards ...... 50 5.1.1 Representation ...... 50 5.1.2 Optimizations ...... 50 5.2 Stage Transition ...... 52 5.2.1 Guard Lowering ...... 52 5.2.2 FrameState Assignment ...... 53 5.3 Second Stage: Optimizing Side-effecting Nodes ...... 56

6 Case Studies 59 6.1 Speculative Guard Motion ...... 59 6.1.1 Rewriting Bounds Checks in Loops ...... 60 6.1.2 Speculation Log ...... 62 6.1.3 Policy ...... 64 6.1.4 Processing Order ...... 64 6.2 Deoptimization Grouping ...... 67 6.2.1 Relation with Deoptimization Metadata Compression . 68 6.3 Vectorization ...... 69

7 Evaluation 73 7.1 Methodology ...... 73 7.1.1 Benchmarks ...... 73 7.1.2 Baseline ...... 75 7.2 Compilation Stages ...... 75 7.2.1 Effects of the First Stage ...... 75 7.2.2 Effects of the Second Stage ...... 81 7.3 Speculative Guard Motion ...... 83 7.4 Comparison to Other Compilers ...... 87 7.4.1 C1 ...... 88 7.4.2 C2 ...... 88 7.5 Deoptimization Grouping ...... 91 7.5.1 Debug Information Sharing ...... 91 7.6 Deoptimization Grouping without Debug Information Sharing . 94 iv Contents

7.6.1 Combining Debug Information Sharing and Deoptimiza- tion Grouping ...... 94

8 Related Work 97 8.1 Intermediate Representations ...... 97 8.2 Deoptimization ...... 98 8.2.1 Exception Handling ...... 99 8.3 Speculative Guard Motion ...... 100 8.4 Deoptimization Data ...... 100 8.4.1 Delta encoding ...... 100 8.4.2 Deoptimization Metadata Size ...... 101 8.4.3 Deoptimizing with Metadata vs. with Specialized Code 101

9 Summary 103 9.1 Future Work ...... 103 9.2 Conclusion ...... 104

List of Figures 107

List of Tables 109

List of Listings 111

List of Algorithms 113

Bibliography 115

v

Acknowledgments

I thank my advisor, Professor Hanspeter Mössenböck for his valuable feed- back, for welcoming me in the System Software institute and for his patience throughout the process. I also thank my second advisor, Dr Laurence Tratt for his comments. Both their efforts helped greatly in improving the quality of my thesis. I thank Oracle Labs for funding my work and supporting many other academic projects at the System Software institute. In particular, I thank the VM research group who has allowed me to work of such an interesting subject. I was extremely lucky to be part of the Graal project from the start and to see it take shape. I thank the people I met between Oracle Labs and the System Software institute: Thomas Würthinger without whom this project would not have been possible, but also Lukas Stadler, Doug Simon, Bernhard Urban, Tom Rodriguez, Stefan Marr, Andreas Wöß, Christian Wimmer and Felix Kaser for sharing ideas and enriching discussions. I thank my parents, Bertrand and Béatrice and my family, in particular my godmother Lise, for supporting and encouraging me throughout my life and in these studies. Finally, I thank Aga for her support and patience during this process.

vii

Chapter 1

Introduction

1.1 Research Context

Programming language implementations can divide their work between a preparation step that runs before the program starts and an execution step that happens when the program is running. Unless the program is trivial, there is always work to be done at run time. At the very least, the program must run. However the amount of work done ahead of runtime varies a lot. Some language implementations do as much work as possible during the preparation step and compile the program down to a binary that can be run by the target computer. On the other hand, some language implementations do not perform any preparation at all and do everything at runtime. This is often the case for the implementation of programming languages that use a lot of dynamic features. Program optimization is almost ubiquitous in implementations that con- centrate on the preparatory phase. Optimizing compilers use extensive static analysis to try to produce a binary form of the program that executes as fast as possible. Language implementations that delegate more work to the runtime have also been an active area of research. Just-in-time (JIT) compilers are used to generate optimized native code for fast execution. While compiling a program at runtime leaves less time for static analysis, it allows for more adaptive optimizations based on knowledge of the program’s execution profile. Such dynamic optimizations are an active field of research and their increased usage has prompted our work.

1.2 Problem Statement

High-level languages such as Java offer type and memory safety. These features often require checks to be performed at runtime. Languages with very dynamic semantics such as JavaScript, Ruby or R require even more runtime checks.

1 1. Introduction

Such checks can be for example a null-check when accessing a field of a Java object or type checks when using JavaScript’s + operator. Naive implementa- tions of run-time checks incur a significant overhead during execution. These costs often force software developers to sacrifice safety and ease of development to high-performance. Minimizing the costs of safety and dynamic features is important to allow their usage for high-performance software.

1.2.1 Existing Solutions

An efficient way of implementing run-time checks is to predict their outcome. While this does not remove the need to perform the checks, it allows runtime systems and compilers to provide an optimized fast path based on the assumed outcome. A common way to predict outcomes is for the runtime to monitor the execution of the program and to gather statistics. The runtime also has the unique chance to observe the current state the system. Based on the assumption that the state of the system will not change too much, speculative optimizations can be applied. An optimization is speculative if it uses assumptions, i.e., hypotheses that were not verified at compile time. For example, a speculation may be that the definition of a function will not change or that the data types of the operands of an operation will remain the same. The runtime can then use the JIT compiler to emit native code that takes advantage of these speculations. For some of these speculations, it is possible to be notified when the assumption is invalidated. For example, one could maintain a list of all compiled code that depends on a function not being redefined. If the function happens to be redefined, one can invalidate all compiled code that depends on the assumption that is no longer true. For other speculations, a check needs to be done by the compiled code that uses the assumption. For example if the compiler assumed that the input to a function is always of a specific type, it needs to insert a type check in that function to be sure that the assumption holds. In both cases, execution must exit the code that was compiled with the now invalid assumption and proceed in a non-speculative version. In order to do that, the state of the program (e.g., its local variables) must be transferred from the state used by the compiled code to the state used by the new execution system (different compiled code, , etc.). This process is called deoptimization. Deoptimization can be done using metadata that describes the mapping between those states. It has first been applied to allow debugging optimized programs [40]. During the early stages of the compilation, before any opti- mization has taken place, the compiler selects locations where this mapping needs to be possible. The compiler then makes sure that it keeps the mapping

2 1.3. Scientific Contributions up to date at these points. The mappings are then stored alongside with the compiled code and the runtime can use them whenever necessary.

1.2.2 Current Problems

While speculative techniques are already used in existing compiler and language runtimes, these systems were not necessarily designed to accommodate the increasing usage of speculative optimizations. Also, since the compiler has to select the locations where it can exit optimized code before performing any optimizations, it needs to predict those exit locations very early in the compilation process. Finally, the necessary amount of metadata grows with the amount of exits from the optimized code. This can lead to the output of the compiler being composed mostly of metadata and significantly increasing the footprint of compiled code.

1.2.3 Proposed Solution

After studying the current situation, this thesis sets out to ease the use of speculative optimizations in compilers and to reduce their run-time and compile-time costs without compromising existing optimizations. We propose to use two different models for the representation of runtime checks and deoptimization metadata during a compilation. Each model allows some optimizations that would conflict with the other model. We then show how to use these two models during a single compilation in order to perform all optimizations. To show how this framework can help to reduce the costs of speculative optimizations, we present two optimizations. First, we propose a speculative optimization that improves the ability of the compiler to hoist runtime checks out of loops. Since programs usually spend most of their time inside loops, this helps to reduce the run-time costs of these checks. Secondly, we also propose a compiler optimization that reduces the amount of metadata necessary for deoptimization. Since it does not require special support from the runtime, it can easily be adapted to many runtime systems to reduce the memory costs of speculative optimizations.

1.3 Scientific Contributions

In this thesis we study the time and memory impact of speculative optimiza- tions. In particular, we study the behavior of current techniques supporting speculative optimization in the context of heavy speculation. We propose a new way of structuring the compiler and the intermediate representation that allows us to combine the heavy usage of speculative opti-

3 1. Introduction

mizations with traditional compiler optimizations that need to reorder side effects (e.g., vectorization). Using this structure and intermediate representation we propose two op- timizations. One improves the runtime performance for loops when using speculative optimizations. The other reduces the memory footprint of specula- tive optimizations. We study the effects of our optimization using an implementation ofthese concepts and optimizations in the Graal compiler. This allows us to study the effects on Scala and Java running on a Java virtual machine (HotSpot) andon JavaScript.

1.4 Project Context

This research was done in the context of a long-term research collaboration between the Johannes Kepler University of Linz and Oracle Labs. The collabo- ration started with in 2001 and continues with Oracle since Sun’s acquisition in 2010. It is focused on language runtime systems and in particular on Oracle’s Java platform implementation, i.e., Oracle’s HotSpot Java virtual machine and Oracle’s Graal JIT compiler. The experience accumulated on the Java platform and in particular on the HotSpot Java virtual machine has helped devising and implementing the ideas developed in this thesis. The ongoing research in high-performance language implementation has also triggered the work of this thesis because speculative optimizations put unprecedented pressure on the abilities of the JIT compiler and the runtime platform to handle an increasing number of runtime checks and deoptimization points. In particular the collaboration with Oracle yielded the following research results which are directly related to this thesis:

Linear-scan register allocation Its usage for JIT compilation has been studied and implemented in HotSpot [51, 52, 76] and was later extended to be applied directly on the SSA form of the the IR [78]. The Graal compiler uses this register allocator.

Escape analysis Escape Analysis has been studied and implemented in HotSpot [44, 45, 46]. Partial Escape Analysis [66, 69] brings Escape Analysis further and was implemented as part of the Graal compiler.

Trace Compilation A trace compiler was implemented in HotSpot. Trace compilation techniques and policies were studied in [37, 39, 38]. The experience accumulated during has been used to design and tune the inlining heuristics used in Graal.

4 1.5. Structure of this Thesis

Array bounds check elimination This optimization was studied and implemented in the context of HotSpot [82, 83]. This work showed the importance of moving some dynamic guards out of loops and was a precursor of some of the optimizations discussed in this thesis.

JIT compilation policies and optimizations The effects of various optimization phases of the Graal JIT compiler were studied in the context of Scala [68]. Various JIT compilation policies were also compared [67].

Language implementation framework The Truffle high-performance language implementation framework was developed as part of this collab- oration [85, 84, 31, 33, 81, 41]. This framework was extended to allow high-performance language interoperability [34, 36, 32, 35]. This framework motivated some of work of this thesis. In this context, the author also made publications about some of the subjects covered in this thesis. In particular about the intermediate representation used in the Graal compiler [18, 20] as well as the “deoptimization grouping” optimization [19] presented in Section 6.2.

1.5 Structure of this Thesis

In Chapter 2 we will first describe the background of our work. This includes the Java platform, the HotSpot VM and the Graal compiler which form the environment in which we implemented and tested our ideas. We will then describe the opportunities of speculative optimization in a JIT compiler in Chapter 3. Once we have seen the opportunities of speculative optimization we will analyze their costs in Chapter 4. In particular we will look at the problems that can appear in systems where speculation is used very aggressively. In Chapter 5 we will then explain how splitting the compilation processes into two distinct stages allows us to be more aggressive in the usage of specu- lative optimizations while still being able to perform advanced optimizations of side effects. We then present two optimizations in Chapter 6. The first one runs in the first compilation stage and helps us to lower the costs of speculations by moving some runtime checks out of loops. The second one runs in the second compilation stage and helps us to reduce the memory footprint of using speculative optimizations. We also evaluate the effects of these techniques on various benchmarks in Chapter 7. Finally we compare different aspects of this thesis to related work in Chapter 8 before concluding.

5

Chapter 2

Background

To validate the ideas presented in this thesis, we implemented them in a just- in-time compiler called Graal that is used on top of the HotSpot Java virtual machine. In this chapter we will describe this environment and why it is an interesting platform to study speculative optimizations. We will also present an overview of the Graal compiler’s structure and intermediate representation.

2.1 Java and the Java Virtual Machine

The Java platform is a software layer and a set of specifications that allow developers to build Java applications. Typical applications developed on the Java platform range from server-side web application to desktop applications. The Java platform is based on two main components: the Java language [28] and the Java virtual machine [49]. The Java language is an object-oriented imperative . It offers a static type system with class and interface-based inheritance. Its specifications also guarantee type safety. The Java Virtual Machine (JVM) specifies an abstract computation model. It defines a class file format that describes data types and code. Codeis defined through a format for which the JVM specification describe the execution model. The standard Java compiler targets the JVM and compiles Java to class files that contain both the description of the program’s data types and the program’s methods compiled to Java . This allows applications built on the Java platform to be portable to a variety of hardware platforms and operating systems for which a JVM was implemented. The Java platform supports execution threads sharing a common memory heap. The Java memory model [50] defines the semantics of memory and concurrency operations independently of the memory model of the hardware that hosts the virtual machine. Security is an important aspect of the Java platform. In particular, it offers the possibility to run code from an untrusted source in a restricted context.

7 2. Background

The security features are built on top of memory safety, type safety and encapsulation guarantees that are offered by the Java platform. For example type casts are always checked at run time and object member accesses follow strict rules. Some of these safety rules are enforced at load time. When a class is loaded a number of checks are performed including a verification of the bytecode. Further checks are then performed at runtime. Minimizing the runtime impact of these checks has been an important reason for the development of speculative optimizations in JVM implementations.

2.2 The HotSpot Java Virtual Machine

The Java HotSpot VM is the reference implementation of the Java Virtual Machine. It is an industry-scale implementation that offers both high perfor- mance and a range of productivity features such as debugging and profiling. It is available for various operating systems and hardware architectures. The HotSpot VM uses mixed-mode execution: methods are initially in- terpreted, and are scheduled for JIT compilation as soon as they have been executed frequently enough. Thus execution starts in the interpreter, which is slow but has low startup and footprint costs. Profiling information is also gen- erated during interpretation. When the VM decides that a method should be compiled, it makes a request to the compiler, which can base its optimizations on the collected profiling information. The HotSpot VM has two JIT compilers: the C1 compiler (sometimes also called the “Client” compiler) and the C2 compiler (sometimes also called the “Server” compiler). The C1 compiler [47] aims at fast compilation speed, while the C2 compiler [55] aims at better optimizations at the expense of slower compilation speed. The HotSpot VM comes in different “flavors”: the Client VM only contains the C1 compiler and is tuned for fast start-up and low footprint. The Server VM contains both the C1 and C2 compilers and is optimized for high performance. It is tuned for software running for a long time where a longer warmup phase is acceptable if it pays off in higher peak performance in the long run. By default, the Server VM operate in tiered mode: once a method has been frequently executed in the interpreter it is first compiled by the C1 compiler. If this compiled method is still executed more frequently, it is scheduled for compilation by the C2 compiler. Profiling information is still gathered by the C1-compiled code. The details of the tiered policy are more involved and include the possibility to enable or disable profiling for code generated by the C1 compiler. Overall, it aims at providing a faster warmup than what would be achieved with the C2 compiler alone while still having the long-term peak performance benefits.

8 2.2. The HotSpot Java Virtual Machine

2.2.1 Deoptimization

Hölzle, Chambers, and Ungar [40] have introduced the concept of deoptimization which allows compiled code to jump back to the interpreter when needed. They implemented this concept in the Self VM [74] on which the HotSpot VM is based. In this VM deoptimization was mainly used to be able to debug optimized code. Most code could then run in its optimized compiled version and only when a breakpoint was hit or a thread was stopped would the VM use the deoptimization mechanism to transfer execution back to the interpreter. However, the deoptimization mechanism can also be used by a compiler to make assumption that it cannot prove (either by lack of time, decidability or overall view). When such an assumption turns out to be wrong at run time, deoptimization occurs, which transfers execution back to the interpreter or to a different version of the compiled code that does not make this assumption. In order to perform deoptimization the runtime system needs to know where execution should be resumed. In a system like the HotSpot JVM, this means that every location in compiled code where deoptimization is possible needs to be associated with a location in the Java bytecode where execution should be resumed. Performing compiler optimizations based on optimistic assumptions is called speculation. Supporting speculation involves both the runtime and the compiler. In particular, deoptimization needs to be supported by the compiler’s intermediate representation (IR) and by its optimization phases. The compiler needs to have enough information in its IR such that the state of the interpreter can be re-built during deoptimization. Deoptimization can be triggered by a check inserted in the compiled code, e.g., by checking that a variable is of the expected profiled type at runtime. If the check fails the compiled code can call the runtime to signal that the current frame needs to be deoptimized. Deoptimization can also be triggered externally if some event in the VM invalidates an assumption that the compiler made. For example when the VM loads a new class that invalidates the results of a previous class hierarchy analysis. In this case threads are stopped at safepoints (i.e., at locations where deoptimization metadata is available) and the stack frames of methods that depend on the invalidated assumption are deoptimized. To do that, the compiled code contains “safepoint checks” which check whether a stop at safepoints was requested by the VM and pause execution if necessary. Safepoint checks are usually inserted at loop back-edges and non-leaf method returns to ensure that all threads eventually reach a safepoint. Safepoints are associated with the necessary metadata to be able to perform deoptimization if needed. The compiler or runtime can also specify which action should be taken when deoptimization occurs. For example the compiled code can be discarded immediately or not and the method can be recompiled on the next invocation

9 2. Background or only after a new profiling period, which can be useful to gather new profiling data and optimize according to the new execution patterns. Along with this action, a deoptimization reason can be passed to the runtime, which is used to remember which types of speculations have failed in previous executions. Both the C1 and the C2 compiler of the HotSpot VM use speculative optimizations and deoptimization. The deoptimization capabilities of the HotSpot VM provide a stable and proven framework to experiment with speculative optimizations. This is valu- able since we concentrate our work on the usage and the impact of speculative optimizations in optimizing compilers and not on the implementation of deop- timization in the runtime.

2.3 The Graal Compiler

2.3.1 Overall Design Predecessors The Maxine project [77] aimed at creating a meta-circular Java VM: a JVM written in Java itself. While Maxine was not the first of its kind (it was created long after the Jikes RVM [2, 58] for example), it used the latest version on the Java language with the aim to provide an easily customizable and maintainable VM that can be used as a research vehicle. During this effort, a number of JIT compilers were used inside the Maxine VM at run time as well as at while compiling the VM. In particular, HotSpot’s C1 compiler was ported to Java to be used in the Maxine VM. This port was called the C1X [73] compiler. Since the Maxine VM needed to support a variety of compilers it had a clear compiler–runtime interface. The C1X compiler was then re-integrated into the HotSpot VM as a meta- circular compiler “C1X4HotSpot”. This also helped validating the other side of the compiler–runtime interface since the C1X compiler has to work with both HotSpot and Maxine through this interface.

The Graal Project The Graal project and compiler started from this point with the aim of creating a compiler producing high-performance code for the HotSpot VM. The major difference with the original C1 compiler is the choice of intermediate representation [18, 20]. The Graal compiler also supports late inlining which enables inlining after the full method has been parsed and other optimization have already been applied as opposed to inlining “on the fly” during bytecode parsing. It also includes optimizations that were not included in the C1 compiler. For example it includes Partial Escape Analysis [69]. The Graal project is an open source project that is part of OpenJDK [30].

10 2.3. The Graal Compiler

By participating in the Graal project from the beginning we were able to make important modifications to its intermediate representation and tothe structure of its optimizing phase. These modifications allowed us to make the Graal compiler well suited for the optimizations proposed in this thesis. This would have been more difficult in an existing compiler where these modification have big impacts on existing optimization phases.

Modular framework The compiler is built in a modular fashion in order to clearly define interfaces between different parts of the compiler. This is important in order to beable to reuse the compiler for different runtimes, hardware platforms and operating systems. Each module contains a set of Java packages. Dependencies between modules are explicitly declared. The dependency graph of modules is checked to ensure that it does not contain any cycles. The modules are then verified at build-time by building them separately and allowing each module to access only classes from its declared dependencies. Extensibility and modularity are also implemented by using service lookup mechanisms. A service is defined by a Java interface and service lookup looks for available implementations. This modularity helps to ensure that the overall design is applicable also in other cases (different language runtime, different target architecture etc.). Working in this context was important to make sure that our ideas can be applied more broadly.

Optimization phases The Graal compiler is structured into distinct phases. A phase receive the current intermediate representation, inspects it and can transform it. These phases can for example apply an optimization or refine the representation of some operations. Phases can also be nested: one phase can apply another one. This allows phases to be easily reused. Phases also provide natural boundaries where the IR can be verified and dumped for inspection in a visualization tool. This design, while classical, made sure that we could split the optimization phases into multiple stages, which is critical for the approach developed in this thesis.

Configurability The Graal compiler is configurable so that it can be used in different contexts. For example it can be configured to have various levels of optimizations so that it can be used for different compilation tiers (fast with few optimizations or slower but with better optimizations). The compiler can also be used for a variety of purposes that require different configurations such as a configuration

11 2. Background for a JIT compiler, for an ahead-of-time (AOT) compiler, or for a static analysis tool. It is important that multiple configurations can exist side by side since different configurations are needed inside a single VM. For example, some low-level “stubs” used in the HotSpot VM require a different compilation than standard Java code. Different configurations can also be used side byside when executed on a heterogeneous platform (e.g. CPU & GPU).

2.3.2 Intermediate Representation Structure

The Graal IR is based on a directed graph structure. Each node in the graph produces at most one value and it is in static single assignment (SSA) form [3, 59, 12]. To represent data-flow, a node has data-flow edges (also called input or use-def edges) pointing “upwards” to the nodes that produce its operands. To represent control-flow, a node has control-flow edges pointing “downwards” to its possible successors. Note that the two kinds of edges point in opposite directions. ... cond ... if (cond){ foo= v1 + v2; If } else { foo= v2; } Begin Begin ...

End End v1 v2 data-flow Merge Add

... Phi control-flow

...

Figure 2.1: Example IR graph with control-flow and data-flow edges

In the example in Figure 2.1, the If node has one data-flow input edge pointing “upwards” for the condition and two successor edges pointing “down- wards” for the true and false branches. This mirrors the edges usually found in a data-flow graph and a control-flow graph, respectively. In summary, theIR graph is a superposition of two directed graphs: the data-flow graph and the

12 2.3. The Graal Compiler control-flow graph. Data-flow edges are further refined into various categories. For example, in the figure, a gray edge represents the association between the Phi node and the Merge node. A description of all these categories is given later in this section. Contrary to many types of IR, we can see that in Graal’s IR, some nodes (such as the Add node) do not seem to belong to any basic block, we will come back to this later in this section. The IR automatically maintains reverse edges for all node connections. Therefore usage edges (also called def-use edges; the reverse of input edges) and predecessor edges (the reverse of successor edges) can also be used to traverse the graph. Unlike the direct edges, these reverse edges are implicitly maintained and are not ordered, which means that they can only be observed as an unordered set of usages or predecessors. Each block of the control-flow graph begins with a Begin node. Two or more control-flow branches can merge ata Merge node, which is also a Begin node. These Merge nodes need to know the order of their predecessors, so that an SSA Phi node can select the correct input value for each of the merge predecessors. As predecessor edges are not ordered, Merge nodes are connected to their predecessors using input edges pointing to End nodes. These End nodes are at the end of the control-flow of the Merge node’s predecessors. The Phi nodes are attached to their Merge node through a special input edge. This structure is illustrated in Figure 2.1. For clarity, in these figures, the edges between Merge node and End nodes are represented using control-flow edges from End to Merge even if in practice they are implemented using input edges pointing in the other direction as explained above. For simplicity, the IR only represents reducible loops, i.e. loops with a single entry point. Methods containing irreducible loops are not compiled and are left for the interpreter to execute. In Java bytecodes produced from a Java source program without obfuscation, there can never be any method with irreducible loops. The IR models loops explicitly: the loop header is a LoopBegin node. The back-edges of a loop are represented with LoopEnd nodes, which are attached to their LoopBegin through a special input edge. Since the LoopBegin node merges the control-flow of the loop header’s predecessor and the backward edges, Phi nodes can be attached to LoopBegin nodes. This structure is illustrated in Figure 2.2. For the transparent management of LoopBegin nodes (which merge control- flow), LoopBegin is a subclass of Merge and LoopEnd is a subclass of End.

Floating Nodes and Code Motion

Nodes are not necessarily fixed to a specific point in the control-flow. The control-flow splits and merges provide a backbone around which mostother nodes are floating. For example, the Add node in Figure 2.1 is floating. Those floating nodes are not part of the control-flow graph but only partofthe data-flow graph. The data-flow dependencies describe scheduling constraints

13 2. Background

...... for (int i=0; i < n;i++) { ... End } 0 ... LoopBegin

Phi 1

Add n

IntegerLessThan

If

Begin ...

...

LoopEnd

Figure 2.2: Example IR graph with a loop that are later used to assign a position for those floating nodes before register allocation and code emission. As described by Click [9], using floating nodes in combination with global value numbering allows the separation of code motion concern from many optimizations. Optimizations do not need to preserve a schedule anymore; they only need to make sure that the data-flow dependencies of floating nodes are not invalidated. Making these dependencies as loose as possible while maintaining correctness then gives more freedom for optimizations to the scheduler. This means that some optimizations do not need to find the optimal placement for an operation but only need to optimize the dependencies between nodes. Nodes are scheduled in a similar way to what Click [8] describes in his thesis. They are generally scheduled at the latest possible position in order to push them into those branches where they are actually used, with the exception of loops where nodes will be scheduled before a loop if possible.

Node Types

Nodes in the IR each have a type that defines the their semantics. There are also categories of nodes that are used to group types that share some characteristics. Some of the important node categories and types are:

14 2.3. The Graal Compiler

Value Nodes are those that return a value. Logic Nodes are a special type of value nodes: they return boolean values. Fixed Nodes are those that are part of the control-flow. Amongst fixed node, ControlSplit nodes have multiple successors and Merge nodes have multiple predecessors. Begin nodes start a control-flow block and End nodes finish it. LoopBegin are special types of Merge nodes which denote a loop header. They are associated with at least one LoopEnd which is a special type of End node that is used for control-flow back-edges. Floating nodes are those that are not part of the control-flow. As explained above, they are only part of the data-flow. Side-effecting Nodes are those that may have a side-effect when they are executed. Deoptimizing Nodes are those that may trigger deoptimization when they are executed. Guarding nodes are those that can be used by other nodes to ensure that a condition is true at runtime. For example Guard nodes are floating node that check a condition and deoptimize if is not true. Begin nodes that follow a ControlSplit node can also be used as guarding node, the condition they check depend on the ControlSplit node and which successor of this ControlSplit node they are (e.g., the “false” successor of an If node guards that the If node’s condition is false). FrameState nodes represents a state of the virtual machine. They contain a bytecode location and have the values of the current locals and stack slots as input.

Edge Types Since every node produce at most one value, different edge types can be used in cases where there could be an ambiguity. For example, an Invoke node produces a value if it calls a non-void function but it may also has side effects that need to be modeled. This means that an Invoke node can be referred to by both a value edge and a memory edge (see below). So far, we have only distinguished between control-flow edges and data-flow edges. We now refine the data-flow edges so that they can represent different kinds of data flow:

Control-Flow This edge points to a successor of the current node. It can only be used by fixed nodes. (Control-flow edges are represented with red lines in figures.)

15 2. Background

Value This edge consumes the value produced by the node it points to. This value will exist at runtime (in a register or on the stack). (Value edges are represented with black lines in figures.)

Memory This edge consumes the memory state produced by the node it points to. It can only be used to point to nodes that may have an observable side effect. This edge ensures that for example read-after write dependencies are respected in the schedule. (Memory edges are represented with green lines in figures.)

Condition This edge consumes the condition produced by the node it points to. A condition edge can only point to a Logic node. This is used instead of a value edge because on many architectures the result of comparisons is not reified in a standard register but in a special-purpose condition register. (Condition edges are represented with black lines in figures.)

State This edge points to a FrameState node which represents a state of the virtual machine. (State edges are represented with dashed black lines in figures.)

Guard This edge points to a guarding node. This guarding node checks a condition must be satisfied before the nodes that point to it can safely be executed. (Guard edges are represented with blue lines in figures.)

Anchor This edge points to a fixed node above which the referring node should not float. For example, it can be used to anchor a guard into a branch by making the Guard node point to the Begin node of the branch it should be executed in. (Anchor edges are represented with dashed blue lines in figures.)

Association This edge represents an association between two nodes, for example, between a Phi and a Merge node or a LoopEnd and LoopBegin node. (Association edges are represented with gray lines in figures.)

Granularity After a method has been parsed, the Graal IR contains high-level nodes where operations have a granularity similar to that of Java bytecodes. For example, there are LoadField and InstanceOf nodes. These operations have semantics that are similar to those of the corresponding bytecodes. A LoadField node first checks the receiver for null and then accesses the field. It does not yet describe the details of how fields are retrieved ina particular runtime. A graph containing such a high-level LoadField node is

16 2.3. The Graal Compiler illustrated in Figure 2.3. Keeping the graph at a high level in the beginning of the compilation allows the compiler to optimize at a high level first (coarse granularity).

... obj

Begin

...

LoadField #foo

......

Figure 2.3: High level LoadField node before lowering

The IR is then progressively lowered towards finer granularity. The high- level concepts are first broken down into run-time-specific operations which allow lower level optimizations to happen. For example, for HotSpot, a Load- Field node is lowered to a Read node that reads memory at a certain offset from the object’s base. The Read node is given a Guard node that checks that the receiver object is not null. This structure is illustrated in Figure 2.4. We can see that the Guard node is anchored in the current branch by an anchor edge to the Begin node. This is needed because the condition under which is branch is taken can be related to the object not being null in non-trivial ways. If the guard were not anchored to the branch it could be executed before we know if this branch is taken, which could result in unnecessary deoptimizations.

... obj

Begin IsNull

Location +12 (#foo) ... Guard (negated)

Read

......

Figure 2.4: Low-level nodes after lowering of a LoadField node

17 2. Background

Lowering is primarily used to apply optimizations at the most appropriate level of abstraction but it also has an impact on the size of graphs: graphs contain fewer nodes before lowering than after lowering. This has an impact on the compilation speed since the complexity of many optimization is affected by the number of nodes and edges in the graph. Keeping graphs small at the beginning of the optimization process improves the performance of the optimizations applied before lowering.

Floating Reads

In order to benefit from code motion and global value numbering applied to floating nodes, Read nodes which read data from memory are transformed from fixed nodes to floating nodes in the course of the compilation. This is done by explicitly describing the dependencies between Read nodes and nodes that have observable side effects. The model that Graal uses is similar to the one of Click and Paleczny [10]: Nodes that have side effects produce a memory state that can be used by nodes that read memory (e.g., as Read nodes). This dependency is specified by a memory edge. It is important to note that in this model, write-after-read edges are not necessary. Conceptually, each side effect produces an immutable snapshot of the memory that read operations can use. Since taking such snapshots is not technically feasible, the scheduler must schedule a read of a certain memory state before this memory state can be changed due to a side effect. While this adds some complexity in the scheduler compared to having explicit write-after-read edges, this reduces the amount of edges in the graph. In order to optimize these dependencies to provide more freedom to the scheduler, the memory is not always considered as a whole but as smaller alias classes [48]. An alias class is a group of names that may alias, i.e., names that may refer to the same location in memory. Graal supports a simple model of alias classes. There is a global one called ANY_LOCATION which aliases with everything. This is used for example for a call to a method that may have any kind of side-effects. Any number of disjunct classes can then be created: for example each Java field has its own class. Finally there canbe immutable classes which do not alias with anything, not even ANY_LOCATION: for example the length of a Java array is immutable. Graal only inserts dependencies between memory reads and side-effecting nodes if they may alias according to this system. As a result, different Java fields will not interfere and thus writing afield inside a loop doesn’t prevent the read of an array’s length inside the same loop to “float” (i.e. be scheduled) outside the loop. For example, in the graph from Figure 2.5, the read of field a does not interfere with the write of field b. Figure 2.6 illustrates the same graph after the transition to FloatingRead nodes and the introduction of memory edges. The read of field a is done on the state produced by the Invoke node above

18 2.3. The Graal Compiler

...

Invoke #foo

obj1 LoopBegin

Read #a ... < obj2 If

Read #b ...... foo(); ... obj3 for (int i=0; obj1.a

Figure 2.5: Example of IR for a loop with Read nodes. To simplify the figure, Location, End and Begin nodes were omitted. the loop. It can thus be scheduled before the loop. On the other hand, the read of field b is done on the state produced by a memory Phi node (i.e., a Phi node that merges memory states). This Phi node merges the state from the Invoke node before the loop and from the Write node on the loop’s back-edge. Its presence models the fact that the alias class of b can be written to inside the loop and the read of b can thus not be scheduled outside the loop. Before the introduction of the memory graph, these dependencies are implicit and only implied by the order of memory operations. The introduction of explicit dependencies allows to make the Read nodes floating and schedule them better. This explicit memory graph can also be used by optimizations.

2.3.3 Snippets

The Graal compiler uses co-called snippets [64] to express low-level operations. A snippet is Java code that expresses the low-level implementation of a high- level or complex operation. For example, type checks and fast-path allocation are implemented with snippets in Graal for HotSpot. Snippets can be used during lowering (e.g., for type checks), to specify compiler intrinsics (e.g., for Java’s System.arraycopy) and some runtime stubs.

19 2. Background

...

obj1 Invoke #foo

FloatingRead #a LoopBegin

Phi Phi obj2

< FloatingRead #b If

obj3 ......

Write #b

LoopEnd

Figure 2.6: Example of IR for a loop after Read nodes have been transformed into FloatingRead nodes and the introduction of the memory graph

Snippets are used instead of manual IR construction because they are easier to read and to maintain. For example, Listing 2.1 shows the snippet used in HotSpot to perform an “exact” type check. An exact type check checks for exactly one type instead of a full type hierarchy. This is possible for example, when checking for final classes in Java. The snippet shows how this type check is done: First the object is checked for null. If it is null the type check fails. Otherwise the “hub” (i.e., the pointer to the object’s type metadata) of the object is loaded and compared to the hub of the expected type. Since snippets are used to express low-level semantics, they need access to some low-level operations that may not be available in Java. This is done using node intrinsics. A node intrinsic is a special method that has a signature matching that of a specific node’s constructor. Graal automatically replaces calls to a node intrinsic with the corresponding node in the graph. Node intrinsics are special cases of the usual compiler intrinsic as they translate to a single IR node and they can only be used inside of snippets. For example in Listing 2.2, the snippet used to implement the Math.cos compiler intrinsic uses a node intrinsic to do a foreign call (i.e., a call to to a native function). Calls to the callDouble1 node intrinsic will be replaced with a ForeignCall node. Note that one of the arguments used to initialize

20 2.3. The Graal Compiler

/** * A test against a final type. */ @Snippet public static Object instanceofExact(Object object, KlassPointer exactHub, Object trueValue, Object falseValue){ if (probability(NOT_FREQUENT_PROBABILITY, object == null)) { isNull.inc(); return falseValue; } GuardingNode anchorNode= SnippetAnchorNode.anchor(); KlassPointer objectHub= loadHubIntrinsic(object, anchorNode); if (probability(NOT_FREQUENT_PROBABILITY, objectHub.notEqual(exactHub))) { exactMiss.inc(); return falseValue; } exactHit.inc(); return trueValue; }

Listing 2.1: Snippet of the instanceofExact type check

public static double cos(doublex ){ if (Math.abs(x) < PI_4){ return AMD64MathIntrinsicNode.compute(x, Operation.COS); } else { return callDouble1(ARITHMETIC_COS, x); } }

@NodeIntrinsic(value= ForeignCallNode.class, setStampFromReturnType= true) private static native double callDouble1( @ConstantNodeParameter ForeignCallDescriptor descriptor, double value);

Listing 2.2: Snippet of the Math.cos intrinsic

the ForeignCall node has a @ConstantNodeParameter annotation. This annotation tells the compiler that this argument must be a compile-time constant and must be evaluated at compile time before being passed to the ForeignCallNode constructor. On the other hand the other argument does not have this annotation. This means that this is a runtime value and the ValueNode representing this SSA value during compilation must be passed to the ForeignCallNode constructor. An other node intrinsic is used in this snippet to produce a node that can use x87’s FCOS instruction.

21 2. Background

if (probability(NOT_FREQUENT_PROBABILITY, object == null)) { isNull.inc(); if (!nullSeen){ DeoptimizeNode.deopt(InvalidateReprofile, OptimizedTypeCheckViolated); } return falseValue; }

Listing 2.3: Snippet of the instanceofExact type check using a profile for the “null” case

Deoptimization can be used in snippets. For example it can be used to take advantage of profiling information. Deoptimization is performed using a node intrinsic that will be translated to a Deoptimize node. For example, the null-check used in the previous instanceofExact type check snippet could be parametrized by whether null has already been seen during profiling. This is done in Listing 2.3, which deoptimizes if the input is null while this has never been observed during profile. A similar idea can be applied to the type check itself if a single type has been observed during profiling.

2.4 Truffle

Truffle [85, 84, 81, 41] is a Java-based language implementation framework which relies on a self-optimizing abstract syntax tree interpreter and Graal as a highly optimizing just-in-time compiler. In order to implement a language using Truffle, an Abstract Syntax Tree (AST) interpreter of this language must be written. This AST interpreter can use run-time profiling and specialization to customize itself to the current usage patterns. For example the addition node for a JavaScript interpreter must handle many different cases based on the run-time type of its operands (which can be numbers, strings or objects). Instead of having a single kind of “add” node that handles all these cases, different kinds of “add” nodes can be used, each of them handling operands of specific types. At the start of the program, an “uninitialized” node willbe used. This node is not biased towards specific operand types but it can also not handle any input yet. The first time it is called it will rewrite itself to a node that can handle the types of the actual inputs. The node is now biased towards specific operand types. If later during execution other operand types appear at this point in the program, the node will rewrite itself to a more generic node that can handle more types. Every AST node has its own mini interpreter, i.e., an execute() method that calls the execute() methods of its children and performs the node-specific operation with their result. An example of a Truffle interpreter is shown in Listing 2.4. The test method builds an AST for a function containing something akin to “return 21 + 21;” and executes it.

22 2.4. Truffle public class ExampleRootNode extends RootNode { private final @Children ExampleStatementNode[] body; @ExplodeLoop @Override public Object execute(VirtualFrame frame){ try { for (ExampleNode node: body){ node.execute(); } } catch (ReturnException e){ return e.getValue(); } return null; } } public class ReturnException extends ControlFlowException { private final Object value; public Object getValue() { retrun value;} } public class ReturnNode extends ExampleStatementNode { private final @Child ExampleExpressionNode valueNode; public void execute() { throw new ReturnException(valueNode.execute()); } } public class AddNode extends ExampleExpressionNode { private @Child Node left, right; public Object execute() { return (int) left.execute() + (int) right.execute(); } } public class ConstantNode extends ExampleExpressionNode { private final int value; public Object execute() { return value; } } public Object test() { ExampleRootNode rootNode= new ExampleRootNode( new ExampleStatementNode[] { new ReturnNode( new AddNode( new ConstantNode(21), new ConstantNode(21) ) ) }); return Truffle.getRuntime().createCallTarget(rootNode).call(); }

Listing 2.4: Example Truffle interpreter. The test method builds an AST for a function containing something akin to “return 21 + 21;” and executes it. For simplicity, constructors and some classes have been omitted.

23 2. Background

When the AST has been stable for a period of time, machine code special- ized to the current state will be emitted using the Graal compiler. During this JIT compilation the execute() methods of the specialized AST are inlined at their call sites and the resulting code is compiled. This is a case of partial evaluation because the compiled code is specialized to the observed input (in this case a specific AST). This allows the Truffle framework to produce compiled code without needing a specific compiler for the language that is being implemented. Since the assumption that the AST will remain stable is a speculation, deoptimization is used in case when the AST changes again. Section 3.2.4 contains more detail and examples about how Truffle uses specu- lation. The example in Listing 2.4 does not use speculation beyond stability of the AST. Using the example AST built in method test, when considering fields annotated @Child and @Children as constants and repeatedly inlining and simplifying everything in ExampleRootNode.execute, the code finally simplifies to “return 42;”. Realistic interpreters are much more complex and rely heavily on speculation. Note that the Graal compiler does not know about Truffle ASTs or about methods from the language implemented using Truffle. The optimizing Truffle runtime adds some special compiler intrinsics to the Graal compiler and uses a special inlining policy. But from the Graal compiler’s point of view, all compilations coming from the Truffle runtime compile the same root Java method: the execute() method of the root node type. The Truffle runtime replaces the parameter of this method that corresponds to the AST’s root node with a constant in the IR. Similarly the underlying HotSpot runtime only see different compiled code versions of this method but does not know about Truffle ASTs or methods from the language implemented using Truffle.

24 Chapter 3

Opportunities for Speculative Optimizations

3.1 Traditional Usages of Speculation

To better understand our speculative optimizations, we will first present traditional usages of speculation which usually occur in language runtimes that support speculative optimizations. For example, such optimizations are used in the standard compilers of the HotSpot virtual machine. For some of them we explain how they are implemented in the Graal compiler and also discuss minor improvements over their traditional implementation.

3.1.1 Exception Handling

When compiling a language supporting exception handling, exception edges have to be added to the control-flow graph to account for the control-flow transitions that can happen when an exception is raised. In a language such as Java, a significant number of instructions may throw an exception and, asa result, a large number of exception edges is needed. Using speculation, we can instead take the optimistic assumption that exceptions are not thrown and can thus completely eliminate most of the exception edges. This simplifies further analyses and optimizations by reducing the size and the complexity of the IR graph. Also, if an exception handler is not reached by any exception edge then this exception handler does not need to be parsed and compiled, which improves compilation time and reduces the size of the compiled code. When an exception is thrown at a position where the compiled code does not expect it, the runtime uses deoptimization to go back to the interpreter. The interpreter then re-executes the code and throws the exception. To implement this in the Graal compiler, we use Guard nodes that check the assumption

25 3. Opportunities for Speculative Optimizations that no exception needs to be thrown. Any further analysis can rely on the assumption without taking any special care. For example, before loading a field with the getfield bytecode, we first need to check that the receiver is not null. A null check guard is inserted in the graph and is used to guard the memory access. When the guard encounters a null value, deoptimization is triggered and the interpreter throws the NullPointerException. Since deoptimization is costly, we need to handle exceptions efficiently in programs that use exception handling for modeling normal control-flow. This, if an exception is thrown frequently for a certain node, we insert explicit exception checks instead of the guard. An explicit check is simply an If node followed by an explicit exception edge for the case when the check fails. On this edge, the exception object can even be a pre-allocated object, eliminating the allocation cost.

...... try { foo(); Invoke #foo bar(); } catch (MyException s){ // catch handler InvokeWithException #bar } ...... ExceptionObject

InstanceOf MyException

If

catch handler Unwind

Figure 3.1: Exception handling at Invoke nodes and a simple exception dispatch chain

In the Graal IR, two different node types are used for method invocation: Invoke nodes do not expect an exception to be raised while InvokeWith- Exception nodes act as a control-flow split between normal control-flow and exceptional control-flow. On its exception edge, an exception-aware invoke is followed by a special ExceptionObject node that represents the exception raised during the method call. This structure can be seen in Figure 3.1. In this example, the profile indicates that the call foo has never thrown any exception, so a simple Invoke node is used. On the other hand the call to bar has already thrown exceptions, so an InvokeWithException node is used.

26 3.1. Traditional Usages of Speculation

When the runtime is unwinding the stack because of an exception, it directly maps the program counter of the invoke throwing the exception to the program counter of the exception branch. If such a mapping does not exist, it means that the invoke did not have an exception edge and a mapping to the proper deoptimization metadata allows the runtime to handle the exception in the interpreter. In this case the exception has already been thrown and is in a “pending” state, so that the interpreter restarts execution at the bytecode of the invoke and immediately handles the exception. The explicit exception edges lead to a chain of If nodes that check the type of the exception object to find the right exception handler. At the end ofthis chain, if the type of the exception does not match any of the exception handlers, control-flow goes to an Unwind node that forwards the exception to the next method on the call stack. A simple example is illustrated in Figure 3.1. Explicitly modeling exception handling chains for points where exceptions actually happen allows us to apply the same optimizations to the exception handling code as to the rest of the code. For example, if the exact type of an exception object is discovered through inlining or other transformations, the exception handling chain can be optimized away and the exception-throwing instruction can jump directly to the correct handler.

3.1.2 Unreached Code

While branches that are explicitly used for exceptional behavior are an obvious target for speculative optimization, it is also common for parts of the normal control-flow to handle unlikely cases. These may be corner cases that donot appear very often or simply a case that may not appear depending on the input data. Such behavior can be observed using branch profiles. For a control-flow split, a branch profile tells us how many times each successor has been selected. This data is often collected to improve low-level block ordering such that the most likely path through the control-flow graph forms a straight sequence of contiguous blocks. When the profile says that a successor has never been selected, we can speculate that it will never be used in the future. We replace this successor with a deoptimization and thus remove unreached code from the compilation unit. By reducing the size of the compilation unit, we improve both compilation speed and code locality. This size reduction also allows more budget for inlining. More importantly, by cutting the unlikely branch off, we prevent it from merging back into the control-flow. This is important because merge points in the control-flow usually lead to merge pointsPhi ( nodes) in the data- flow. These data-flow merge points are detrimental to many optimizations. For example, they dilute the dynamic types of merged reference values by forcing a union operation, potentially preventing further optimizations such

27 3. Opportunities for Speculative Optimizations as devirtualization. They can also force the materialization of an object that would otherwise be scalar-replaced by escape analysis (See [69]).

3.1.3 Type Assumptions

Different kinds of speculations can be applied for operations based onthe dynamic type of an object. First, profiling data can be exploited. For example, for a dynamic type check such as Java’s instanceof or cast operation, if the profile tells the compiler than a single concrete type has been seen atthis location, we can speculate that this will also be the case in the future. In this case, an exact type check can be used by the compiler. Instead of performing the full sub-class check, the type of the object can be directly compared to this single concrete type, which is usually faster than the full sub-class check. In case of success, the compiler already knows the outcome for this concrete type and after the exact type check, the object’s exact type is known statically. If the exact type check does not succeed at run-time, deoptimization can be used to let the interpreter perform the full sub-class check and continue execution. A similar strategy can be used to devirtualize virtual calls. If a single concrete type has been seen for the receiver, an exact type check can be performed in order to make the call monomorphic and thus to devirtualize it. Such a call is then a candidate for inlining. When the type check fails, deoptimization can be used. This can also be extended to polymorphic cases where several types are observed during profiling by doing a cascade of exact type checks which finishes with a deoptimization. For example, Graal uses this technique to apply polymorphic inlining for up to 8 concrete types. This kind of speculation is important because calls are usually a barrier to optimizations. Speculative inlining allows the compiler to inline methods also in cases where static devirtualization cannot be applied. Finally, a range of techniques can be applied by speculating that the class hierarchy will not change [14]. For example, one can find the single implementation of an abstract method or the single concrete implementation of an abstract type. This kind of analysis can be used to improve type checks and to devirtualize calls. However, since Java supports dynamic class loading, the class hierarchy can change at run-time. In order to speculate on the state of the class hierarchy, instead of using a guard, an assumption is registered. This assumption is linked with the compiled code. If class loading later invalidates an assumption, the corresponding code can be evicted from the code cache. However, it is not enough to evict a method from the code cache as a thread could be currently executing that code. In order to do that, the VM first needs to pause all threads and to inspect the code that they are running. When the threads are stopped the VM has to walk their stacks and to deoptimize evicted methods that are currently executing. However, it cannot stop the threads at any location. Rather, it uses a safepoint mechanism (described by Hölzle, Chambers, and Ungar [40] as

28 3.2. Advanced Usages of Speculation int limit = ... for (int i=0;i< limit;i += stride){ // ... }

Listing 3.1: Simple loop structure

“interrupt points”) that ensures that all threads are stopped at safe positions. All the methods that need to be deoptimized and are currently on any of the threads’ stacks are inspected. For method activations at the top of the stack, they are deoptimized in place. Other activations are patched in such a way that they will be deoptimized when execution returns to them.

3.1.4 Loop Safepoint Checks Elimination

Virtual machines that use a safepoint mechanism often need to insert safepoint checks in loops. These checks can be removed if the loop is guaranteed to terminate in a “reasonable” amount of time (he thread will then be stopped after the loop and not in the loop). The Graal compiler uses a heuristic to decide if that is the case. It assumes that leaf loops that do not contain calls to other methods terminate fast enough if their number of iteration is statically known to be in the range of a 32 bit integer. This works well for a lot of loops that have the structure depicted in Listing 3.1. Indeed, if the stride is 1 then this loop’s iteration count is limit which is a 32 bit integer. However if the stride is larger than 1, the counter (i) could overflow and the loop might not terminate at all. In this case,we can speculate that limit ≤ MAX_INTEGER − stride and that overflow thus cannot happen. This condition can be checked once above the loop using a Guard node. We can then eliminate the safepoint check from the loop. This is beneficial if the loop’s iteration count is high enough such that the gainsof not executing the safepoint check can cover the extra condition checked before the loop.

3.2 Advanced Usages of Speculation

Speculation can also be used to perform more advanced optimization. We also explain how the Truffle API offers language implementers a generic framework for speculation. This allows them to implement new speculative optimizations for high-performance language interpreters without having to modify the Graal compiler.

29 3. Opportunities for Speculative Optimizations

3.2.1 Speculative Alias Analysis

Alias analysis [16] is used to decide if two accesses can refer to the same location at run-time. This is useful for various optimizations in the compiler. For example it can be used to find out if memory accesses can be reordered or not. It can also be used during vectorization of loops to find out if and how the memory writes of one iteration affect the memory reads of others. Statically proving an absence of aliasing between two variables is often hard or impossible if they are both of the same type. However, we can insert dynamic checks to see if the two variables actually point at the same concrete objects: if they don’t, then we have (dynamically) guaranteed that they are not aliasing.

3.2.2 Speculative Store Checks

The Java Virtual Machine specification [49, §6.5 aastore] requires a “store check” when storing an object reference into a reference array. The Java language and bytecode verification already ensure that the declared typeof the stored reference is assignable to the declared element type of the array. However, the dynamic type of the array’s elements could be a subclass of their declared type. To be sure that this operation is type-safe, a run-time “store check” must happen. The store check ensures that the dynamic type of the reference being stored is assignable to the dynamic type of the array elements. This type of check does not only appear in the Java platform, it appears as soon as a language has covariant array. For example, the C# language requires the same type of checks. In general, such type checks can be complex and can expand to a large subgraph in the intermediate representation. A full type check even requires a loop [11]. However, in most cases the dynamic type of the array is the same as the declared type of the array. If we speculate that this is true, then, because of the properties already checked during compilation and bytecode verification, the store check is not necessary anymore. This speculation is also relatively cheap to check since it is an exact type check. This exact type check tests that the dynamic type of the array is exactly the declared type and not some subtype. This can be performed by a direct comparison of the array’s dynamic type with a constant. An other advantage is that it does not depend on the type of the reference being stored so if multiple references are stored, a single check is needed. This speculation works particularly well for Java’s collection framework. For example, Java’s ArrayList is backed by an Object array. While it would be possible to statically prove that it is exactly an Object array and not an array of subclass objects, this would require an extensive whole-class analysis (or even whole-program because of Java reflection). The speculative optimization,

30 3.2. Advanced Usages of Speculation on the other hand, can be done locally in the context of a single compilation unit.

3.2.3 Speculative Guard Motion To avoid unnecessary deoptimizations, run-time checks performed to test the validity of a speculation are usually not executed unless the speculation will be used. For the compiler, this means that Guard nodes are always scheduled at a position where it is guaranteed that at least one of the guarded nodes will be executed. Lifting this restriction gives the compiler more freedom in the scheduling of guards at the price of potential unnecessary deoptimizations. This optimization is described in detail in Section 6.1.

3.2.4 Truffle Truffle API

As described in Section 2.4, the Truffle framework can be used to create high- performance languages implementations. This can be done at a high level using node rewriting through the Truffle DSL [41]. The Truffle framework thenuses the Graal compiler to apply partial evaluation to the language interpreter by speculating that the AST of a method will not change. This partial evaluation coupled with aggressive and predictable inlining allows us to leverage the interpreter to automatically produce compiled code. This is also called the 1st Futamura projection [24]. If a later rewrite changes the AST, deoptimization is used to continue execution with the new AST and eventually another partial evaluation will be applied with the new AST. The Truffle API also exposes a lower-level API to take advantage ofspecu- lation. Some annotations (such as @Child or @CompilationFinal) allow the to specify some fields which should be considered constant during partial evaluation. Here the programmer explicitly speculates that these values do not change in the future. If these values need to be changed, the optimized code must first be exited with an explicit deoptimization.

Explicit Deoptimization

The transferToInterpreterAndInvalidate() method allows the language developer to explicitly request deoptimization. The compiled code of the currently executing method is discarded and the AST that was considered constant during the previous partial evaluation can be mutated. After that, another partial evaluation and compilation can take place. For example, in Listing 3.2, the @CompilationFinal annotation and the transferToInterpreterAndInvalidate() method are used to create a branch profile that is used to speculate on whether a branch has everbeen entered. By default the visited field is false. During partial evaluation,

31 3. Opportunities for Speculative Optimizations

public final class BranchProfile extends NodeCloneable { @CompilationFinal private boolean visited= false; public void enter() { if (!visited){ CompilerDirectives.transferToInterpreterAndInvalidate(); visited= true; } } }

public class SomeNode extends Node { private final BranchProfile profile; ... private boolean isComplexCase(...) { ... } public void execute(...) { if (isComplexCase(...)) { profile.enter(); // handle rare complex case ... } ... } }

Listing 3.2: BranchProfile class using the Truffle API for speculation

enter is inlined into execute and the load of the visited field is replaced with its current value which allows to simplify the if(!visisted) statement. If this value was still false (i.e., if isComplexCase was always false so far), then the enter method always calls transferToInterpreterAndInvalidate- (). The call to transferToInterpreterAndInvalidate() is replaced by an explicit deoptimization and the branch that contains this deoptimization will be removed from the compilation unit and replaced with a guard. Due to the inlining performed during partial evaluation, this means that the branch handling the complex case is eliminated and replaced by a guard checking that isComplexCase remains false. If that condition becomes true, the compiled code is deoptimized, visited is set to true and further partial evaluations will include this branch.

Assumptions Another type of speculation can be done with Assumption objects of the Truffle API. When assumption objects are created, they are considered to be in a “valid” state. They will remain valid until they are explicitly invalidated by calling their invalidate() method. When running code that relies on such an assumption, the check() method is called to verify that it is still valid.

32 3.2. Advanced Usages of Speculation public class AddNode extends Node { private static final Assumption basicAdd; private @Child Node left, right;

public int executeInt() { try { basicAdd.check(); } catch(InvalidAssumptionException e){ // rewrite ... } return left.executeInt() + right.executeInt(); }

public static void notifyAddRedefined() { basicAdd.invalidate(); } }

Listing 3.3: A node using a Truffle Assumption to speculate

This exposes the ability to deoptimize code externally by requesting a safepoint [13]. Indeed, during partial evaluation, calls to the check() methods are completely eliminated. Instead, a dependency is registered between the compiled code and the assumption object. When the invalidate() method is called for an assumption, a safepoint is used to deoptimize all compiled code that has a dependency on this assumption object. For example, in Listing 3.3 an AST node implementing basic integer addition assumes that the language’s addition operation has not been redefined1. This is done through a global assumption. Every time an addition is done, the assumption is checked. In terms of compiled code, this just means that any code using the default addition registers a dependency on this assumption object. No code is emitted for the check. If integer addition is redefined, the assumption is invalidated using notifyAddRedefined() and compiled code depending on it is invalidated and deoptimized. Any AST that contains this basic AddNode will eventually execute the check in the interpreter. The check will fail, the exception will be thrown, and the node will be rewritten to one that executes the redefined addition. Since this is a completely different node this case does not need to be handled in the AddNode. By combining this generic assumptions API with the possibility to explicitly trigger deoptimization, the Truffle API makes it possible to use speculation for anything language implementers may need. In order to chose the right type of speculation, it is useful to understand the associated costs. In Chapter 4, we will look into these costs.

1This is possible for example in Ruby where the + operator of Fixnum can be redefined

33

Chapter 4

The Costs of Speculation

While speculation allows many optimizations, it comes at a cost. In this chapter we look at these costs. We first look at the run-time costs of checking whether speculations hold. We then look at the memory footprint costs associated with speculation. This aspect is less commonly studied and in particular we look at the effects when speculation is heavily used. Finally, we bring forward a problem that is usually not discussed in the literature: the consequences of speculation on the compiler IR and how different representations of speculation in the IR can constrain possible optimization.

4.1 Runtime Costs

Speculation causes run-time costs in two different ways: when a run-time check must be performed and when deoptimization occurs. Using assumptions or dependencies allows moving the costs of run-time checks from where the speculation is used to where the speculation may be invalidated. The compiler and the runtime always try to use speculative optimizations only when their costs are lower than their gains. However, the increased use of speculation results in an increased percentage of time spent for checking their validity. Even if there are still gains, the gains get smaller. For example, in Figure 4.1 we discuss the performance of three hypothetical compilers. Compiler 1 uses only a few speculative optimizations, for which it has to insert a few run-time checks. Compiler 2 uses more speculative optimizations, thus the resulting code can complete the same amount of work in less cycles. However, because of the additional speculations, more cycles are spent on run-time checks. These checks now make up a bigger proportion of the cycles necessary to complete the work. It would be beneficial to spend some compilation time for optimizing those run-time checks. This is what has been done in compiler 3, which results in even better performance. When deoptimization occurs, it causes certain run-time costs. Part of this overhead comes from the fact that execution is suspended while the VM

35 4. The Costs of Speculation

Work 30 Checks

20 cycles

10

0 Compiler 1 Compiler 2 Compiler 3

Figure 4.1: Hypothesis on cycles spent on the same workload by the code produced by 3 hypothetical compliers. Compiler 1, almost does not use speculative optimizations, compiler 2 uses speculative optimization, compiler 3 also uses speculative optimization but also optimizes the runtime guards introduced by these speculative optimizations.

rebuilds the interpreter state. The other reason is that execution restarts in the interpreter which is slower than compiled code. Figure 4.2 illustrates a possible life cycle of a method. First, the method is executed in the interpreter ( 1 ) until the VM triggers its compilation. During the compilation ( 2 ), the code continues to run in the interpreter. When com- pilation finishes and the compiled code has been installed, execution continues ( 3 ) at a much higher speed. When a deoptimization event occurs, execution is suspended ( 4 ) while the interpreter frames are being rebuilt. Execution then restarts in the interpreter and new profiling information is gathered ( 5 ). Finally, a new compilation is requested and runs in the background ( 6 ). During this compilation the speculation that failed before and caused deoptimization is not used any more. The new compiled code is installed and execution continue at high speed again, although a bit slower than before because the failing speculation was not used ( 7 ). Overuse of speculative optimizations doesn’t improve peak performance but does severely worsen warmup time

4.1.1 Assumptions vs. Guards The run-time checks necessary for speculation can be performed at two positions: either in the compiled code, just before using the speculation, or in the runtime, when changing something that might invalidate speculations made by some compiled code. For example, if a function speculates that its parameter is of a

36 4.1. Runtime Costs

compilation deoptimization compilation op/s

1 2 3 4 5 6 7

time compilation deoptimization code request request installation code compilation installation request

Figure 4.2: Execution speed over time with a deoptimization event specific type, this could either be checked in compiled code when entering the function or in the runtime when loading new code into the VM that contains a call to this function by static type-checking of the parameters passed during this call.

Assumptions

Assumptions (also called “dependencies” in the HotSpot VM) are speculations for which the check is performed when changing some global state. Their advantage is that they don’t require any instruction in the compiled code where the speculation is used. However they require maintaining a list of compiled code that uses them. They also always require the VM to use a safepointing mechanism that stops all threads before performing the deoptimization. This means that while they have no run time cost as long as deoptimization is not necessary, they are rather costly when deoptimization does happen.

Guards

Guards are speculations for which the check is performed directly in the compiled code. They can be used when there is no global state that could be watched with an assumption. Since they need to be executed just before the code they guard, their condition need to remain relatively simple. They offer more possibilities than assumptions when deoptimization happens: since

37 4. The Costs of Speculation other threads executing the same code are going to also execute the guard, it is often not necessary to use a safepoint to atomically deoptimize all threads.

4.2 Memory Footprint

While deoptimization is a rare event, the information needed to deoptimize is always present alongside the optimized code. In the HotSpot VM, metadata is associated with the program locations where deoptimization might be needed. This metadata is then used by the VM to rebuild the state of the interpreter from the state the machine. Since compilers can use inlining, a single frame from compiled code can correspond to multiple interpreter frames. For example, on the JVM, the metadata of a location in inlined code contains a stack of JVM frames that need to be rebuilt. Each frame is an activation of a specific JVM method at a specific bytecode index (“BCI”). For each frame there is also a mapping from JVM stack and local values to their location in the native registers and stack. This metadata is present for all program locations that may cause de- optimization. Thus, the more deoptimization is used the more metadata is produced by the compiler and needs to be stored for the lifetime of the compiled code. The amount of produced metadata is not significant; in fact, there is often more metadata than code to be stored after a compilation. In environments where memory is scarce this can be an obstacle to an extensive use of deoptimization. Memory can be an issue both on small low-end devices as well as on server systems where a lower memory usage means a higher service density.

4.2.1 Experimental Data We slightly extended existing instrumentation from the HotSpot VM to gather statistics about the memory footprint of metadata. The information is collected when the result of a compilation is inserted in the VM’s code cache. Thus, it shows how much metadata has been generated by the compiler over the whole lifetime of the VM and not statistics about the contents of the code cache at a specific point in time. These contents could be different at a specific point in time because unused and deoptimized code can be evicted from the code cache.

Java Benchmarks

In Figure 4.3, we show the amount of code and metadata installed into the VM’s code cache for a number of Java benchmarks when using the Graal compiler without any optimization targeted at reducing metadata footprint. We used the DaCapo [6] and Scala-DaCapo [62] benchmarks, which simulate typical applications running on the JVM. We divided the memory footprint of

38 4.2. Memory Footprint installed code in four categories. Code corresponds to machine code. Deop- timization metadata is data that is used only for deoptimization: it encodes the state of each frame that needs to be reconstructed during deoptimization. The Constants category is for constants that can be referenced either from deoptimization metadata or from the code. Finally, the Other category ac- counts for relocation information as well as other data needed for the VM’s bookkeeping of the code cache. We can already see that the footprint of deoptimization metadata is not negligible as it occupies between 1.7 and 1.8 times more space than the code itself. This makes it the main contributor to the memory footprint of JIT installed code.

35

30

25

20 MiB 15

10

5 apparat factorie kiama scalac scaladoc avrora batik fop jython lusearch pmd sunflow tomcat tradebeans xalan actors scalap scalariform scalatest scalaxb tmt 0 h2 luindex tradesoap specs

Code Deopt. metadata Const. metadata Other metadata

Figure 4.3: Amount of data installed into the VM’s code cache for various Java benchmarks. Code corresponds to machine code. Deoptimization metadata is data that is used only for deoptimization. Constants is for constant values that can be referenced either from deoptimization metadata or from the code. Other is for relocation information as well as other bookkeeping data.

We ran the same measurement for other compilers of the HotSpot VM. The results are summarized in Table 4.1 were we show the average memory footprint per compiled method. We found that for the Server compiler metadata

39 4. The Costs of Speculation occupies 1.4 to 2.0 times as much space as code; for the Client compiler this ratio is between 1.3 and 1.5. The Graal compiler typically produces more deoptimization metadata than the two other compilers. Some of the differences can be explained by the different typical sizes of the compilation units due to different inlining policies. Indeed, we can see that on average, theGraal compiler produces bigger compilation units than the Server compiler (1.5× to 2.1× bigger) which in turn produces larger compilation units than the Client compiler (1.3× bigger). With deeper inlining, most deoptimization sites need to store metadata for more frames. Some of the differences can also be explained by the intensive usage of deoptimization in the Graal compiler. Overall, these numbers are similar in magnitude to the numbers put forward by Hölzle, Chambers, and Ungar [40] for the Self VM.

Truffle Languages implemented using Truffle rely heavily on speculation (see Sec- tion 3.2.4). In Figure 4.4 we look at the amount of code and metadata installed in the code cache for some JavaScript benchmarks from the Octane suite. To run these, we use TruffleJS, a JavaScript implementation using Truffle. Overall, much more data is installed in the code cache and most of it is deoptimization metadata. In Table 4.1, we have summarized these numbers under the name Octane. For this benchmark suite, we could only collect numbers for the Graal compiler as it is the only compiler that is supported for Truffle’s partial evaluation. We can see that the code cache contains more than 83 times as much metadata as it contains code. In particular, we can see that much mode deoptimization metadata is produced than for the Java benchmarks. This can be explained by multiple factors. First, by design, Truffle interpreters use deoptimization a lot. But the size of deoptimization metadata is also influenced by the depth of the inlining and the amount of object that are replaced by scalar after escape-analysis and Truffle heavily relies on both. We can also see that compared to the Java benchmarks, there is an unusually large amount of metadata categorized as “constant”. Since Truffle relies heavily on partial evaluation, many things are constant at compile time (AST, profile, inline caches, etc.). At the end of the compilation, these constants can be found in the metadata. Overall, we can see that in the context of Truffle, reducing the memory footprint of deoptimization metadata becomes very important.

40 4.2. Memory Footprint

300 Code Deopt. metadata 250 Const. metadata Other metadata 200

150 MiB

100

50

0 box2dcode-loadcryptodeltablueearley-boyergbemumandreelnavier-stokespdfjs raytraceregexprichards

Figure 4.4: Amount of data installed into the VM’s code cache for various JavaScript benchmarks from the Octane suite running on TruffleJS. This shows that for Truffle, almost the entire memory overhead is used for deoptimization metadata.

41 4. The Costs of Speculation 0 6 1 3 2 2 9 5 7 5 ...... 3 6 2 17 61 14 30 31 61 25 598 ± ± ± ± ± ± ± ± ± ± 9 8 9 1 2 4 1 5 6 4 ...... Total 7 3450 3 2368 0 5993 1 2981 2 1925 3 2178 7 4658 4 3574 7 7385 0 726 594 ...... 1 0 2 1 0 0 0 2 1 27 ± ± ± ± ± ± ± ± ± ± 2 4 0 9 9 2 9 2 2 9 ...... 2 635 0 532 4 482 4 581 0 449 0 502 3 460 3 639 7 527 7 854 ...... 0 0 3 0 0 0 2 0 2 885 ± ± ± ± ± ± ± ± ± ± 0 1 2 2 8 3 5 1 4 3 ...... Metadata 3 76 6 34 0 311 0 86 2 34 0 29 2 298 5 72 3 426 7 25 300 ...... 8 1 5 1 38 11 18 18 35 24 459 ± ± ± ± ± ± ± ± ± ± 5 2 4 4 6 3 3 5 0 6 ...... Graal Compiler Client Compiler Server Compiler Deoptimization Constants Other Average Memory Footprint per Compiled Method (bytes) 3 1294 7 747 2 2996 2 1308 0 674 2 675 3 2293 6 1344 7 3771 7 691 945 ...... 8 1 3 1 1 19 10 10 23 272 ± ± ± ± ± ± ± ± ± ± 1 3 0 1 8 9 0 1 5 5 ...... Code 2 2190 6 1432 6 1040 1 991 2 750 2 1589 2 1506 8 957 6 2647 8 8480 ...... 68 30 47 25 43 91 109 119 176 146 % confidenceinterval. ± ± ± ± ± ± ± ± ± ± 99 4 3 5 1 2 2 3 9 7 8 ...... Methods Compiled 1738 32 965 20 424 31 193 25 543 46 102 35 438 10 321 14 715 21 273 the width of the ± Benchmark DaCapo DaCapo DaCapo Scala-DaCapo Scala-DaCapo Scala-DaCapo SPECjvm2008 SPECjvm2008 SPECjvm2008 Octane given Table 4.1: Memory footprint of deoptimization metadata in bytes for different benchmarks and compilers. All results are

42 4.3. Managing Deoptimization Targets

4.3 Managing Deoptimization Targets

Finally, besides runtime and footprint costs of speculation, we want to discuss a cost that is often overlooked. In this section we look at how the modeling of speculation in compiler IRs can affect the possibilities for optimizations. Some models limit the possibilities for speculative optimizations or increase their costs. Other models offer more freedom for speculative optimizations but prevent other traditional compiler optimizations.

4.3.1 Java Memory Model Constraints The deoptimization process must respect the Java Memory Model (JMM) [50, 28] and its rules that dictate which program transformations are valid. The JMM allows some reordering of observable side-effects1, but in general, side- effects cannot be repeated or ignored. There are two consequences ofthis:

Reordering of side-effects Two side-effecting nodes can only be reordered if there is no deoptimizing node between them. Indeed, if there is a deoptimizing node between two side-effecting nodes (or if one could be added later), these side effecting nodes cannot be reordered even if otherwise allowed by the JMM. In Figure 4.5, we can see an example of this situation. After parsing a program that contains two side-effecting nodes (in this case memory writes), an hypothetical optimization swaps those nodes. Later, a deoptimizing node is inserted between the two writes. At this point, the only possible deoptimization targets would be:

• Either after putfield a which results in writing field b once or twice and never writing field a (traces 2 and 3 ).

• Or before putfield a which results in writing field b twice and writing field a in between (trace 1 ).

None of these targets leads to valid executions. As a consequence, optimizations that may reorder side-effecting nodes cannot be performed until the final position of all possible deoptimizations are known.

Deoptimization targets If there is a deoptimizing node between two side- effecting nodes the deoptimization target must be between the corresponding side-effecting bytecodes. Indeed, using a similar reasoning as for the previous point, deoptimizing to a target outside of this range would lead to either repeating the first side-effect or omitting the second one.

1An “observable side-effect” is one of the JMM’s write, synchronization or external actions.

43 4. The Costs of Speculation

Original program ... putfield a; Parsing and putfield b; lowering 1 ...... 2 3 ... Write #b Write #a If Deoptimize Write #b Write #a ......

Inserting Reordering deoptimization Write #b side-effecting between nodes side-effecting Write #a nodes

...

Possible execution traces: 1 2 3 Write #b; Write #b; Write #b; putfield a; putfield b; putfield b;

Figure 4.5: Example of parsing and transformations leading to a deoptimiza- tion between two reordered side-effecting nodes and the resulting execution possibilities

In Figure 4.6, we can see that the traces that deoptimize to a location before the first side-effecting bytecode ( 1 ) or after the second one ( 3 ) are not valid. On the other hand, the trace that deoptimizes to a location between the side-effecting bytecodes ( 2 ) is valid. As a result there is a range of valid deoptimization targets, bounded by side-effecting bytecodes.

This means that if a deoptimization has an associated deoptimization target, swapping it with a side-effecting node invalidates that deoptimization target.

44 4.3. Managing Deoptimization Targets

...

Write #a 1 ...... putfield a; If Deoptimize 2 ... putfield b; ...... 3 Write #b Optimized IR Original program ...

Possible execution traces:

1 2 3 Write #a; Write #a; Write #a; putfield a; putfield b; putfield b;

Figure 4.6: Deoptimization target possibilities relative to the surrounding side-effecting nodes

4.3.2 Deoptimizing to the Previous Side-effect

If we want to keep the possibility to move and insert deoptimizing nodes, we need to know all the possible deoptimization targets, even if no depot currently targets them. As we have seen before, for each deoptimizing node, all bytecode positions between the surrounding side-effecting bytecodes are valid targets.

Creating Deoptimization Metadata The deoptimization process needs to recreate the full program state at the deoptimization target. The information necessary to recreate this state is the deoptimization metadata (FrameState nodes in the Graal IR). Beside the location of this target, the bulk of this metadata is the mapping from values of the unoptimized program to values in the optimized program. During parsing of the bytecode the compiler can keep track of this mapping to be able to create deoptimization metadata when necessary. However, once parsing is finished, keeping this mapping for all nodes in the control-flow is impractical: all of these nodes would need data-flow edges to all the values necessary to recreate the deoptimized program state. The amount of edges would have negative consequences for the memory footprint and the compilation speed. More importantly, these data-flow edges

45 4. The Costs of Speculation would introduce a lot of ordering constraints between otherwise unrelated nodes, hindering optimization opportunities. It is also not possible to re- create this mapping later as that would require the compilation process to be entirely deterministic which is not possible since we want to use speculative optimizations that depends on profiling data that changes from one compilation to an other. As a result, it is necessary to keep this mapping only at selected locations in the IR. Using the observation that deoptimization targets must be between the locations of the surrounding side-effecting nodes, we only keep information about deoptimization targets right after every side-effecting node in the IR. For a specific side-effecting node, this deoptimization target can then beused for all deoptimizing nodes until the next side-effecting node which in turn provides a new deoptimization target for the next deoptimizing nodes that follow. This model makes it possible to insert and move deoptimizing nodes during compilation since they are not associated with any particular deoptimization target. They can float anywhere and, after the final schedule, they willjust use the deoptimization target attached to the previous side-effecting node. However, using this model has the drawback that side-effecting nodes cannot be re-ordered which prevents certain optimizations. This model is used in parts of the Graal compiler (as we will see Chapter 5) and the Crankshaft compiler in the JavaScript V8 engine [23].

4.3.3 Fixed Deoptimizations

Some compilers take a different approach to the management of deoptimization targets and assign the deoptimization target immediately when creating the deoptimizing nodes. This means that deoptimizing nodes can only be created during bytecode parsing while there is still enough information to assign deoptimization targets. In this model, since deoptimizing nodes have a target, they cannot be reordered. However, it lets the compiler reorder side-effecting nodes. This is the model used in the existing compilers of the HotSpot VM: C1 and C2. It is also used in parts of the Graal compiler (as described in Chapter 5). Figure 4.7 illustrates the types of reordering possible when deoptimizing nodes are fixed. In this example, both the Deoptimize node and the Foreign- Call nodes are deoptimizing nodes: Deoptimize node always deoptimizes and the ForeignCall node may deoptimize due to a safepoint. The Write node for field y was moved below the write to field x since side-effecting nodes can be reordered. However while this side-effecting node is pushed down, it must be pushed down into all the branches. When reaching the control-split, the compiler must ensure that the side-effect will be performed regardless of which branch is taken and it creates a copy of the side-effecting node for each branch. When the Deoptimize node is encountered, the side-effecting node

46 4.4. Conclusion

......

Write #y If

If Write #x Write #y

Write #x Deoptimize Write #y Deoptimize

ForeignCall ForeignCall

...... (a) Before (b) After

Figure 4.7: Moving a guard check before a side-effecting node when deopti- mizing nodes are fixed can not be pushed down further and must be committed. For the same reason, the side-effecting node can not be pushed below the ForeignCall node on the other branch. This transformation is interesting because if x and y are consecutive locations in memory and the writes are directed to the same base object, both write could be performed in a single instruction. For example two 32 bit fields can be written in a single 64 bit store. Note that if two branches merge without intervening deoptimizing nodes, side-effecting instructions can be moved without being duplicated inthe branches. Side-effecting instructions for inside the branches can also be merged if they just differ by their input values. This is illustrated in Figure 4.8.

4.4 Conclusion

The capabilities of both models are important in an optimizing compiler using speculative optimizations. Being able to insert deoptimizing nodes after bytecode parsing gives more freedom in the structure of the compiler and makes it possible to use more aggressive optimizations. It also allows optimization to make more global decisions taking into account the whole compilation unit. The ability to move deoptimizing nodes is important in order to reduce the run-time costs of speculation. It allows the compiler to easily group and merge run-time checks as well as schedule them optimally. Finally, being able to re-order side-effecting nodes is important for some advanced optimizations such as vectorization.

47 4. The Costs of Speculation

... c ...

Write #y If a b a If b Merge Phi Write #x Write #x Write #x c Merge Write #y

ForeignCall ForeignCall

...... (a) Before (b) After

Figure 4.8: Moving an reordering side-effecting nodes when deoptimizing nodes are fixed. The Write nodes for field x where merged below the control- flow diamond-shape. The one for field y was moved through the control-flow diamond-shape and below the Write to x.

48 Chapter 5

Optimization Stages

In Section 4.3 we presented two models for the management of speculation in the compiler IR. In particular, we discussed the advantages and drawbacks of both models. To be able to profit from the benefits of both models, we now propose to transition from one model to the other during compilation. The compilation starts out with side-effecting nodes being fixed in the control- flow, while speculative optimizations can insert and move deoptimizing nodes, which are still floating. Then, deoptimizing nodes are fixed to their final locations and a deoptimization target is assigned to them. At this point, the Graal compiler could already emit machine code and the compilation would be complete. However, we instead propose to continue optimizations. In this second stage, the compiler can optimize side-effecting nodes. It can also lower (see Section 2.3.2) operations which need to be atomic with regard to deoptimization. The overall process is illustrated in Figure 5.1.

compilation

Speculative optimizations: 1st Stage • Floating guards • FrameStates at side effects Guard Lowering

FrameState Assignment Optimizations of the schedule of side-effecting nodes: 2nd Stage • Fixed deoptimizations • FrameStates at deoptimizations

Figure 5.1: Organization of the optimization stages

49 5. Optimization Stages

5.1 First Stage: Optimizing Guards

5.1.1 Representation

In the first compilation stage, side-effecting IR nodes are fixed in the control- flow. They may be removed by an optimization but they cannot be reordered. Each side-effecting node is associated with deoptimization metadata inthe form of a FrameState node. Deoptimizing nodes on the other hand can be modeled as floating node and can be reordered with each other and with side-effecting nodes. They are not yet associated with any deoptimization target. New deoptimizing nodes can also be inserted by speculative optimizations. The IR provides Guard nodes that can be used by speculative optimization to check their assumptions at run time. Guard nodes are floating nodes that take the condition that needs to be checked as input. An additional input is used to denote the branch in which the condition needs to be checked. This scheduling restriction is not used for correctness reasons but because of performance concerns. The deoptimization process has high performance costs in itself since it involves saving live values, tearing down the compiled frames, and creating interpreter frames. It also causes the rest of the method to be executed in the interpreter, and if the method is still hot, a new compilation is necessary. For these reason, it is generally not desirable to execute guards which are not required. The additional scheduling constraint for guards ensures that they are executed only if what they guard is guaranteed to be executed. Without this constraint, global value numbering of guards could cause them to be executed speculatively in a dominating block. This edge is called the guard’s anchor. The representation used at this stage is illustrated in Figure 5.2. The side-effecting Write node is associated with a FrameState representing the in- terpreter’s state after the corresponding putfield bytecode has been executed. The Read node uses a floating Guard node to check the compiler’s speculation that obj2 is not null and thus the y field can be read and no exception needs to be thrown1.

5.1.2 Optimizations

In the remainder of this section, we discuss some of the optimizations suitable for this stage. First, all the optimizations described in Chapter 3 should be performed here. They take advantage of the possibility to introduce Guard nodes for speculation. In this stage, optimizations can also take advantage of the possibility to move Guard nodes, by changing their anchor.

1A similar null check on obj1 for the Write node has been omitted for simplicity

50 5.1. First Stage: Optimizing Guards

... obj1 1 obj2

FrameState Write #x c If IsNull

... Begin ... obj1.x=1; if (c){ Guard (negated) t= obj2.y; } else { ... Read #y } ......

Figure 5.2: Example of IR during the first stage. Floating Guard nodes are used and side-effecting nodes have an associated FrameSate node.

Guard Anchor Optimization In order to give the scheduler the most freedom to move guards, the guard’s anchor should be has high as possible in the control-flow. Optimally, guards should always be anchored at the beginning of the block where they are guaranteed to be needed (i.e., the highest block that is post-dominated by at least one of their usages). The first compiler stage makes sure that this is the case. Also, if all branches of a control-split require the same Guard node, these Guard nodes are replaced with a single Guard node in the branch above the control-split. This optimization was partially described by the author in [20, §4.2].

Speculative Guard Motion This optimization tries to execute Guard nodes speculatively regardless of the possible false-positives described above and their performance implications. The main idea is to be able to speculatively execute Guard nodes outside of loops if their condition is loop invariant. This optimization is speculative because it moves Guard nodes that are control- dependent on nodes in the loop. We speculate that there is no correlation between the condition that is checked by the Guard node and the condition under which it is executed in the loop. This optimization is described in more detail in Section 6.1.

51 5. Optimization Stages

5.2 Stage Transition

Once the optimizations from the first stage have been performed, the compila- tion can proceed to the second stage. This is done by transforming the IR to conform to the model used in the second stage. This is a two-step process.

5.2.1 Guard Lowering

In order to reach the second stage, all floating guards are first assigned a position in the control-flow graph by doing a schedule of the floating nodes. These guards are then transformed into a control-flow split and an explicit deoptimization node.

...

If

...... Begin obj

If obj ... IsNull

... Begin IsNull If

... Guard (negated) Begin Begin

Read #a Deoptimize ... Read #a

...... (a) Before (b) After

Figure 5.3: Guard lowering: a floating Guard node is transformed into fixed control-flow using an If node and a Deoptimize node.

Guarded nodes still keep their guard edges, which now point to the Begin node instead of the Guard node. This is important so that the guarded node is still constrained to be scheduled after the guarding check has succeeded. This is illustrated in Figure 5.3. The Read node that was pointing to the Guard now points to the non-deoptimizing successor of the corresponding If node. Note that the anchor edge of the Guard node has disappeared. It only constrained the position at which the guard should be checked and is not needed anymore since the corresponding If node is now part of the control-flow.

52 5.2. Stage Transition

5.2.2 FrameState Assignment

Once all deoptimizing nodes are fixed, a deoptimization target is assigned to them by using the FrameState from the closest previous side-effecting node as illustrated in Figure 5.4. The algorithm used for this assignment is described in detail later in this section.

... FrameState@1 ... FrameState@1

SideEffecting SideEffecting

......

If Deoptimize If Deoptimize

......

If Deoptimize If Deoptimize

... FrameState@2 ... FrameState@2

SideEffecting SideEffecting

If Deoptimize If Deoptimize

...... (a) Before (b) After

Figure 5.4: FrameState assignment: FrameState nodes are transfered from side-effecting nodes to deoptimizing nodes

FrameState Nodes at Control-Flow Merges

An important requirement of using FrameState assignment is that all control- flow merge points must be considered as side-effecting node and thus havean associated FrameState node during the first compilation stage. Indeed, if any of the merging branches contains a state side-effecting node (as illustrated in Figure 5.5), any deoptimization triggered after the Merge node must deoptimize to the state associated with this Merge node. There is no other correct deoptimization target: deoptimizing to a location before the control-flow merge would mean deoptimizing to its dominator block and would risk re-executing the side-effect, on the other hand deoptimizing to a location after thelast

53 5. Optimization Stages

...

If

Begin Begin FrameState

SideEffecting ...

End End FrameState

Merge ...

Deoptimizing

...

Figure 5.5: Example of a situation where a Merge node requires a FrameState node side-effecting node inside a branch is not possible because we do not statically know which branch to choose. While it could be tempting to relax this requirement and only keep Frame- State node at control-flow merges if one of the incoming branches contains a side-effecting node, it is not a good idea in practice. Indeed, some optimizations, such as tail-duplication, could move the control-flow merge point and introduce side-effecting nodes in the branches. At this point since it is not possible to create new FrameState nodes after bytecode parsing, the IR would become invalid as it would contain a control-flow merge without an associated Frame- State node but that has side-effecting nodes in its incoming branches. Asa result, in the Graal IR, all Merge nodes have an associated FrameState node in the first compilation stage.

Algorithm

FrameState assignment is done using Algorithm 5.1. This algorithm traverses the IR’s fixed nodes which describe control-flow in reverse postorder, i.e., each node is visited before its successors except for loop back-edges (Loop- End nodes). During this traversal, it remembers the state associated the last side-effecting nodeStateSplit ( nodes) and assigns it to deoptimizing nodes. Both DeoptimizingBefore nodes and DeoptimizingAfter nodes are deoptimizing nodes, the difference is only relevant when they are also side-

54 5.2. Stage Transition

Data: An IR graph where all deoptimizing nodes are fixed to the control-flow and FrameState nodes are still associated with side-effecting nodes Result: The graph is modified in-place and deoptimizing nodes are associated with their final deoptimization target initialize Q to an empty queue visitedEnds ← ∅ 1 enqueue(graph.start, null),Q) while Q is not empty do (node, state) ← dequeue(Q) while node is a FixedWithNext do if node is a DeoptimizingBefore then node.deoptBeforeState ← state end if node is a StateSplit then 2 state ← node.stateAfter node.stateAfter ← null end if node is a DeoptimizingAfter then node.deoptAfterState ← state end node ← node.next end if node is an End ∧ node is not a LoopEnd then visitedEnds ← visitedEnds ∪ {state} merge ← node.merge ends ← {e ∈ merge.ends : e is not a LoopEnd} if ends ∩ visitedEnds = ends then /* all incoming ends have been visited */ 3 enqueue((merge, null),Q) end else if node is a ControlSplit then foreach s ∈ node.successors do enqueue((s, state),Q) end end end Algorithm 5.1: FrameState assignment

55 5. Optimization Stages effecting node: DeoptimizingBefore nodes can deoptimize before applying their side-effects and DeoptimizingAfter nodes can deoptimize after applying their side-effects. When reaching a control-flow merge, the algorithm doesnot need to merge the incoming states since all control-flow merges already have a FrameState attached. Also, since the start node and control-flow merge nodes are StateSplit nodes but not DeoptimizingBefore nodes, no state needs to be associated with them in the work queue (Lines 1 and 3), their state will be picked up when processing them (Line 2). Note that all side-effecting node are seen during this traversal as in the first stage, all side-effecting nodes need to be fixed nodes. Since, after this assignment, all deoptimizing nodes have a target, the edges between side-effecting nodes and FrameState nodes are not needed anymore and can be cleared to simplify the graph. This simplification removes constraints in the graph so that the subsequent schedules have more freedom. Not all FrameState nodes will be used after this transformation. The dead FrameState nodes can be removed from the graph. Some code can also become dead if it was only used by those FrameState nodes. Note that after this assignment, in practice, a lot of deoptimizing nodes share the same deoptimization target. This is because in typical compilations units, there are more deoptimizing nodes than side-effecting nodes and because, when fixing guards into the control-flow, they tend to cluster at the beginning of blocks.

5.3 Second Stage: Optimizing Side-effecting Nodes

As a result, in the second stage of the compilation, all deoptimizing nodes have a fixed position in the control-flow graph. They cannot be moved oradded anymore. Also, deoptimizing nodes have been assigned their final targets in the form of FrameState nodes, which are not attached to side-effecting nodes anymore. The second stage can now be used to optimize side-effecting nodes. As explained in Section 4.3.3, as long as it is allowed by the JMM, side-effecting nodes can be reordered among each other since the locations of all deoptimizing nodes in the control-flow is known. They can still not be reordered with deoptimizing nodes. However since guards have been transformed into explicit control-splits, many deoptimizing nodes are now in slow-paths that branch from the main execution path. A few optimizations that benefit from being able to reorder side-effecting nodes and can only be done in this second stage are described in the remainder of this section.

Effect Sinking This optimization tries to move side-effecting nodes to blocks that are executed the less frequently. For example, side-effecting nodes in a

56 5.3. Second Stage: Optimizing Side-effecting Nodes void minmax(Pointp , double[] a){ for (int i=0;i< a.length; ++i){ if (p.x< a[i]) { p.x= a[i]; } if (p.y> a[i]) { p.y= a[i]; } } }

Listing 5.1: Example where effect sinking can be beneficial void minmax(Pointp , double[] a){ double x= p.x; double y= p.y; for (int i=0;i< a.length; ++i){ if (x< a[i]) { x= a[i]; } if (y> a[i]) { y= a[i]; } } p.x=x; p.y=y; }

Listing 5.2: Example of the result of effect sinking

loop can be moved to a place after the loop if they do not need to be observed during the execution of the loop. This is the case in Listing 5.1: after effect sinking (Listing 5.2), the store to fields x and y is done only once. Effect sinking moves nodes with observable side-effects and can thus only be done in the second compilation stage.

Advanced Loop Optimizations Loop tiling, interchange and fusion re- quire re-ordering observables side-effects and thus they can only be done in the second compilation stage. This type of optimization is very important for high-performance computing [4]. They require re-ordering side-effecting side-effects because they often rely on applying the computations in adifferent order in order to use CPU caches more efficiently or in order to avoid the creation of intermediate results. The optimizations supported by the Graal compiler are described in more detail in Section 6.3.

57 5. Optimization Stages

Deoptimization Grouping After FrameState assignment, because of the procedure described in Section 5.2.2, many deoptimizing nodes share the same FrameState. We can take advantage of this to reduce the amount of metadata that needs to be stored to implement deoptimization by merging the control-flow that leads to deoptimization with the same FrameState. We call this deoptimization grouping. Since deoptimization grouping requires that FrameStates have already been assigned to deoptimizing nodes, it can only be done during the second compilation stage. This optimization is described in more detail in Section 6.2.

58 Chapter 6

Case Studies

We have implemented our new concepts presented in the previous chapter in the Graal compiler. To show how this framework can support advanced speculative optimizations, we now describe two original optimizations that we implemented. The first one helps to reduce the run-time cost of speculation guards by moving them out of loops and is applied in the first stage. The second one is designed to reduce the memory footprint of deoptimization metadata necessary for speculative optimizations and is applied in the second stage. We also briefly present of a vectorization optimization that others have implemented for Graal taking advantage of our concepts in the second compilation stage.

6.1 Speculative Guard Motion

Most programs spend the majority of their time in loops. Loops are thus a profitable target for optimizations. In this chapter, we discuss an optimization targeting guards in loops. Since it needs Guard node to still be floating, it is applied in the first stage of the compilation (see Section 5.1). Because the Graal IR uses floating nodes, in general, there is no need for a phase dedicated to loop-invariant code hoisting: as long as a floating node does not depend on loop-variant code, the scheduler will place this node above the loop. Speculative guard motion tries to move Guard nodes out of loops even if their anchor edge would normally keep them inside the loop. This is done by changing the anchor of the Guard node to the branch above the loop if the Guard node’s condition is loop-invariant. As illustrated in Figure 6.1, changing the Guard node’s anchor can mean that the Guard node is now evaluated in cases where it wasn’t before. Before the optimization, the Guard node was only executed if both If nodes in the loop took their right branch. After the optimization, the Guard node can be scheduled outside the loop and is only executed once independently of the branches taken by the If nodes inside the loop. This only works if the condition that is checked

59 6. Case Studies

by the guard can be computed outside the loop, i.e., it does not depend on loop-variant values. Note that as soon as a guard’s condition does not depend on loop-variant values, there can not be any correctness issues about executing it outside the loop. Any check or precondition required for the execution on the guard’s condition are captured by dependencies in the IR: either they are loop-invariant and thus can also be hoisted or they are loop-variant and then the condition itself can not be considered loop-invariant. Also, it is never a correctness issue to deoptimize earlier than strictly necessary since, in the first stage, guards are not attached to a particular deoptimization target but will be assigned one during FrameState assignment that will be correct at their final position. This optimization is speculative because there could be a correlation be- tween the Guard’s condition and the conditions under which it was originally anchored. In the case of Figure 6.1, there could be a correlation between c and a or b. Since we cannot prove that there is no correlation in general, we apply the optimization speculatively. The possible gains of not having to execute the Guard in the loop usually out-weight the performance risks as long as we are able to remedy undue speculations. Just like any other speculation, if it fails, the code will be deoptimized and this optimization will not be attempted again in the next compilation. Note that avoiding guard checks in the loop is not the only gain of this optimization: If the guard is hoisted out of the loop, other nodes that depend on it might also be hoisted. Also, if all guards can be scheduled outside of the loop and no more deoptimizing nodes remain in the loop, the loop will be amenable to more optimizations (such as vectorization) in the second stage. We rely on previous optimizations having removed the most trivial cases where speculative guard motion should not be performed. For example, if the Guard node’s condition is the same as the condition of one of its dominating If nodes, the Guard node can either be eliminated1 (when the If node guarantees that the condition is true) or replaced by an unconditional Deoptimize node (when the If node guarantees that the condition is false). In these cases, speculative guard motion should not be performed because the guard can be optimized better at its current position in the loop.

6.1.1 Rewriting Bounds Checks in Loops In order to increase the number of loop-invariant conditions, we rewrite com- parisons of loop variables in bounds checks: Instead of checking if a loop variable is within the bounds of an array we can check if its minimum and maximum values are within these bounds. If those values are loop invariant then the whole check becomes loop invariant and can be hoisted out of the loop.

1In this case, nodes that used to depend on the Guard node will be modified to depend on the Begin node in which the guarded condition is guaranteed

60 6.1. Speculative Guard Motion

......

Begin Begin c c ......

LoopBegin LoopBegin Guard a a If If b b LoopExit If LoopExit If

... Begin Begin ... Begin Begin

Guard

......

Merge Merge

LoopEnd LoopEnd (a) Before (b) After

Figure 6.1: Speculative guard motion: the anchor of a Guard node with a loop-invariant condition is moved to the branch dominating the loop.

One of the most probable causes of failed speculations is that a hoisted guard fails although the loop would not be entered. In particular this is often the case for bounds checks. Using loop inversion2 would provide space before the loop where the Guard nodes can be hoisted to and are only executed if the loop is entered. However, the compiler might not provide loop inversion. For example, at the time we implemented speculative guard motion in the Graal compiler, loop inversion was not available. To work around this problem, when rewriting bounds checks on loop variables, we not only check if the variables’ minimum and maximum values are within the bounds but we also check if the loop is entered. An example of a loop when a bounds check can be rewritten is shown in Listing 6.1. Listing 6.2 is the same loop after the bounds check has been rewritten. This is done only for counted loop where we can compute the minimum and maximum values of induction variables. In the current implementation, this affects loops that start with an If comparing an induction variable to

2Loop inversion transforms a loop of the form “while(c) {...}” into one of the form “if(c){do{...} while(c);}”

61 6. Case Studies for (int i= start;i< end;i += stride){ guard((unsigned)i< a.length); a[i] = ...; }

Listing 6.1: Example of loop where a bounds check can be rewritten to apply speculative guard motion guard(start >= end || ( // loop not entered (unsigned) start < a.length&& // check minimum value (unsigned) end <= a.length)); // check maximum value for (int i= start;i< end;i += stride){ a[i] = ...; }

Listing 6.2: Example of loop where a bounds check has been rewritten and moved outside a loop.

a loop-invariant value. One of the successors of this If node must exit the loop. The induction variables must be incremented or decremented by a loop-invariant amount.

6.1.2 Speculation Log

If a speculative optimization fails at run time the compiler should not attempt to apply it again in the same context. For most of the speculative optimizations in the Graal compiler this is achieved by gathering new profiling information after deoptimization occurred. This profiling data contains information about the case that made the speculation fail. The next compilation will then be based on this profiling data and the unwarranted speculation will not be repeated. In the case of speculative guard motion, however, the profiling data does not contain any information that would help us to make a better decision. Therefore we added another data structure which records when a speculation fails. This speculation log contains a list of those speculations that should not be attempted again because they have already caused deoptimization. When code is installed in the VM’s code cache, a speculation log can be associated with it. Deoptimization sites can optionally be associated with a “speculation” object (see the next section). If deoptimization occurs there and if the site has a speculation object, the failing speculation is added to the speculation log associated with the compiled code which triggered deoptimization.

62 6.1. Speculative Guard Motion

The Graal compiler associates a speculation log with the root method of the compilation unit so that it can be retrieved and examined during the next compilation.

Speculation Objects

The compiler can use any kind of object to represent the speculation as long as it is later able to compare this object to check whether a similar speculation has already failed. This flexibility allows the compiler to use varying degrees of precision to describe the speculation. For example, we can describe the guard motion speculation simply by a stateless object that simply denotes the fact that speculative guard motion was used. In this case we simply know that speculative guard motion has failed before for a method. If we want to be a bit more precise, we can identify the loop from which guards were moved by specifying the BCI of the loop header. We can also add more context by adding information about the inlining context in which this loop appears. To be even more precise we can also use the reason of the deoptimization in the speculation object. A coarse-grained description of the speculation will ensure that only few deoptimizations are needed to prevent any further speculation and will thus ensure faster warmup. A more fine-grained description will prevent a single failed speculation from inhibiting many other speculations that might have worked and will thus increase peak performance. However, in the worst case, many deoptimizations are needed before the compiler figures out that none of the speculations can be applied. This can have a negative impact on warmup time so that it takes longer to achieve peak performance.

Truffle

Graal achieves high performance for Truffle interpreters by applying partial evaluation of the interpreter method given a specialized AST. From the Graal’s compiler point of view, Truffle ASTs are just like any other Java object sothis partial evaluation is done by compiling the execute method of the root node type of the AST and replacing the parameter that corresponds to the AST by a constant in the Graal IR. This means that in practice, all Truffle compilation have the same Java method as compilation root. Also, the loop constructs of the language implemented using Truffle will be implemented using Java loops in the execute method of the corresponding AST node types. This means that all the loops in the language implemented using Truffle will map to only a few Java loops. As a result, as soon as a guard motion speculation fails for one loop of the Truffle language, it would prevent guard motion for all of the loops. To avoid this problem, when doing a Truffle compilation, Graal does not use a speculation log associated with the Java compilation root but rather a speculation log associated with the Truffle AST.

63 6. Case Studies

6.1.3 Policy

In order to further improve the profitability of speculative guard motion, itis only performed under certain conditions. First the relative frequencies of the block referenced by the original guard anchor and of the block before the loop are taken into account. If, according to profiling information, the block of the original anchor executes less often than the block before the loop, the optimization will not be performed. This can happen, for example, for loops that are rarely entered. Also, if the guard is a bounds check that needs to be rewritten to a more complex condition, the extra cost of executing the rewritten guard is taken into account when comparing the relative frequencies. The extra cost is set to make a rewritten bounds check cost three times more than the original one as the rewritten bounds check might contain up to three checks instead of one. Finally, the optimization is only performed if the speculation log does not indicate that the same optimization was already attempted and failed.

6.1.4 Processing Order

Hoisting a guard out of a loop might cause other dependent nodes to be hoisted as well. These dependent nodes could be the condition of some other guard, which means that it might only be possible to hoist some guard if some other guard has been hoisted before. We can still apply all the possible speculative guard motions in one pass. This is done by integrating speculative guard motion in a partial scheduling pass. The scheduling algorithm is presented as Algorithm 6.1. It tries to find the earliest block into which a node can be placed. It works recursively by first finding the earliest schedule of the node’s inputs. Recursion stops atfixed nodes since they are already part of a block. When the earliest block for a node has been found it is memorized in order to reduce complexity. This algorithm results in a partial schedule since it only finds a block for the necessary nodes and their transitive inputs. The scheduling is modified for Guard nodes to apply speculative guard motion. This can be seen in Function 6.2. First, the guard’s condition is checked to see if it can be rewritten as described in Section 6.1.1. Then, the loop nest in which the guard was originally found is explored, from the innermost loop to the outermost one. At each step, the speculation log is checked to see if a previous attempt to hoist such a guard out of a specific loop has already been done. The policy from Section 6.1.3 is also used to see if speculative guard motion should be applied. The search stops once a loop in which the condition (rewritten if possible) if not loop-invariant anymore. At the end, the guard’s anchor and condition edges are rewired as necessary to be hoisted above the best loop selected by the policy and the guard is scheduled

64 6.1. Speculative Guard Motion

Data: An IR graph Result: The graph is modified in-place to apply speculative guard motion initialize earliestCache to an empty map guards ← {n ∈ graph : n is a Guard} foreach g ∈ guards do earliestBlock(g) end Function earliestBlock(n : Node) if earliestCachen is defined then return earliestCachen end if n is a Fixed then return n.block end if n is a Guard then earliest ← computeEarliestBlockForGuard(n) else earliest ← computeEarliestBlock(n) end earliestCachen ← earliest return earliest end Function computeEarliestBlock(n : Node) dominators ← ∅ earliest ← null foreach i ∈ n.inputs do block ← earliestBlock(i) if block ∈/ dominators then dominators = dominators ∪ block.dominators ∪ {block} earliest ← block end end if earliest = null then return graph.start.block end return earliest end Algorithm 6.1: Speculative guard motion

65 6. Case Studies

Function computeEarliestBlockForGuard(g : Guard) rewrittenCondition ← tryRewriteCompare(n.condition) conditionBlock ← earliestBlock(g.condition) rewrittenConditionBlock ← earliestBlock(rewrittenCondition) anchorBlock ← earliestBlock(g.anchor) minCostFactor ← 1.0 optimizedAnchorBlock ← null l ← anchorBlock.loop while l ≠ null ∧ rewrittenConditionBlock ∈/ l.blocks ∧ canSpeculate(l, g) do needsRewrite ← conditionBlock ∈ l.blocks { 3 if needsRewrite rewriteCostFactor ← 1 otherwise l.dominator.probability costFactor ← rewriteCostFactor × anchorBlock.probability if costFactor < minCostFactor then minCostFactor ← costFactor optimizedAnchorBlock ← l.dominator optimizedAnchorNeedsRewrite ← needsRewrite end l ← l.parent end if optimizedAnchorBlock ≠ null then g.anchor ← optimizedAnchorBlock.block if optimizedAnchorNeedsRewrite then g.condition ← rewrittenCondition end return optimizedAnchorBlock else return computeEarliestBlock(g) end end Function 6.2: computeEarliestBlockForGuard(n)

66 6.2. Deoptimization Grouping in the block above this loop. If not such loop was found, the guard is scheduled like any other node and will remain inside the loop.

6.2 Deoptimization Grouping

After FrameState assignment, many deoptimizing nodes will share the same FrameState (see Section 5.2.2). While this means that they deoptimize to the same frame, the low-level deoptimization metadata that needs to be stored is not necessarily the same. Indeed, the physical locations of the values may be different. For example, as illustrated in Figure 6.2, two Deoptimize nodes use the same FrameState. At both Deoptimize nodes, the same values need to be restored in the interpreter for the locals and the expression stack but they are found in different registers or stack slots. This is important because the deoptimization metadata that needs to be produced by the VM is low-level data that contains references to the physical machine state. As a result, two different frame descriptions will need to be stored for these two deoptimization sites.

l0 l1 @1

... FrameState foo@5 ScopeDesc(offset=156): Test::foo@5 (line 14) Locals If Deoptimize - l0: stack[0] - l1: stack[4] Expression stack - @0: reg rax ... ScopeDesc(offset=173): Test::foo@5 (line 14) Locals If Deoptimize - l0: reg rsi - l1: reg rdx Expression stack ... - @0: reg rax

Low-level High-level IR deoptimization metadata

Figure 6.2: Low-level representation of the same FrameState used at two different positions. Note that the locals are in different physical locations.

The idea of deoptimization grouping is to combine all Deoptimize nodes that share the same FrameState into a single Deoptimize node, which will then result in a single deoptimization site in the generated code that needs

67 6. Case Studies only a single low-level translation of this FrameState. As a result, we expect a reduction of the size of the deoptimization metadata stored along the generated code. This is achieved in the following way: for every group of Deoptimize nodes that share the same FrameState, a Merge node is created, followed by a single Deoptimize node. All previous branches leading to Deoptimize nodes of this group flow into this Merge node (see Figure 6.3).

... FrameState ...

If Deoptimize If

...... FrameState

If Deoptimize If Merge

...... Deoptimize (a) Before (b) After

Figure 6.3: Merging deoptimization control-flow (high-level IR)

Multiple deoptimization sharing the same FrameState might want to pass different actions, reasons and speculation objects to the runtime call that performs the deoptimization. These values are used to communicate to the runtime what to do with the current code and why deoptimization happened so that the same speculative optimization is not attempted again once it failed. To still be able to group deoptimizations with different actions, reasons or speculation objects, Phi nodes are created to merge the data flow of those values. This transformation only covers cases where there is explicit control-flow leading to a Deoptimize node in the IR. However, there are other types of deoptimizing nodes such as Invoke nodes or SafePoint nodes for which the control-flow to a potential deoptimization site is implicit. Experiments showed that 38 % of all FrameState nodes are associated with Deoptimize nodes that are reached via an explicit control flow edge and are thus amenable to this optimization.

6.2.1 Relation with Deoptimization Metadata Compression While deoptimization grouping is purely a compiler optimization, the under- lying runtime can also optimize the storage of deoptimization metadata. In particular, since we run the Graal compiler in the HotSpot VM we need to know

68 6.3. Vectorization double[]b= new double[a.length]; for (int i=0;i< a.length; ++i){ b[i] = a[i]*2; } double[]c= new double[a.length]; for (int i=0;i< a.length; ++i){ c[i] =b[i] +1; }

Listing 6.3: Example of loops that can be fused what it does in this area to understand how it can interact with deoptimization grouping. The HotSpot VM can compress low-level deoptimization metadata while storing it. This is normally done only when some debugging features of the VM are used and thus more deoptimization metadata is produced by the com- piler. Compression works by finding common byte sequences in the serialized deoptimization metadata and sharing them. In HotSpot, this compression is called “Debug Information Sharing” [65]. It is done for specific chunks of serialized metadata, namely the list of all values of the expression stack, for the list of all values of the locals, for the list of all locked monitors, and for full frames. This ad-hoc compression is rather effective. Especially, since it separates the expression stack from the locals it is able to share information at a smaller granularity than the full stack of frames. However, it cannot find common sequences when the values have moved to different physical locations between two usages of the same FrameState. Also, it has quadratic complexity since every time a new chunk is recorded, it needs to be compared with the previous ones. In order to limit the time taken by this search, HotSpot only looks at the 50 previous chunks. In contrast to that, since Graal maintains skip-lists for some node types and def-use edges [18], deoptimization grouping has a linear complexity in terms of Deoptimize nodes.

6.3 Vectorization

Using our two-stage optimization framework, others have successfully imple- mented vectorization optimizations in the Graal compiler. The second stage allows the compiler to optimize side-effecting nodes, this is important for many loop vectorization optimization which can change the order of the execution of side-effects. The current vectorization optimizations includes support for simple “map” operations. A map operation on a vector create a new vector by applying the same function to all elements of the input vector. For example, the code in Listing 6.4 is a mapping operation that applies the function x ↦→ 2 × x + 1.

69 6. Case Studies double[]c= new double[a.length]; for (int i=0;i< a.length; ++i){ c[i] = a[i]*2+1; }

Listing 6.4: Result of fusing the loops from Listing 6.3

Current optimizations also include loop fusion. Loop fusion allows the compiler to avoid the creation of some temporary results by merging loops together. For example, the loops of Listing 6.3 can be fused together and the resulting code is shown in Listing 6.4. This is done by using Graal IR that models the vector operations directly rather than representing the loops that perform those operations. In the case of the previous example, this would mean the transformation of Graal IR illustrated in Figure 6.4.

a

Param 2

Map Mul

Return a

Param 2 Param 1 Mul 1 Map Map Add Add Return Return

...... (a) Before (b) After

Figure 6.4: IR transformation for loop fusion

With this example, we can see how vectorization optimizations can only be done while deoptimizing node are fixed: if a deoptimizing node could still be scheduled in the fused loop, it could not have a valid FrameState because, in the original program, there is no location where results have been partially written to the final arrayc ( ) while the intermediate array (b) does not exist. This vectorization implementation does not only take advantage of the second stage, it also uses speculative alias analysis. During the first stage it detects patterns of memory dependencies in loops that could later prevent

70 6.3. Vectorization vectorization that can be prevent when it is known that 2 array are distinct. When such a pattern is detected, a guard node that checks that the arrays are distinct at runtime is inserted in the IR and is associated with another node that represents the possible memory graph optimization if the guard holds. During vectorization, if this speculative memory edge permits vectorization while the conservative one does not, the speculative one will be used. If some speculative memory edges are not used, the associated guards are removed from the IR to avoid unnecessary costs.

71

Chapter 7

Evaluation

In order to measure the impact of our optimizations, we evaluated them using with benchmarks targeting the JVM as well as JavaScript benchmarks running on a JavaScript implementation under Truffle. After establishing a baseline, we looked at the impact of the first and the second compilation stage on peak performance. We also reviewed in detail the impact of speculative guard motion. Finally, we evaluated the impact of deoptimization grouping on the memory footprint.

7.1 Methodology

7.1.1 Benchmarks JVM Benchmarks Since Graal is a Java JIT compiler, we used a set of standard Java benchmarks to evaluate our work.

SPECjvm2008 This is an industry-standard benchmark suite [70] for Java VMs. It contains several workloads such as typical single-server or workstation applications or libraries focusing on specific aspects such as XML processing. These benchmarks try to avoid relying on disk I/O and never use network I/O. The scores of these benchmarks are given in operations per minute. A higher score means faster execution. We ran the benchmarks for 2 minutes of warmup and 4 minutes of measurement. Each benchmark in the suite was run in its own VM. We excluded the “compiler.sunflow” benchmark because it does not run on JDK 8.

DaCapo This is a commonly used benchmark suite [6] designed to represent real-world applications. We used version 9.12 “Bach” of the suite. Each benchmark was executed in its own VM and consisted of 50 iterations. The result of each benchmark is the time in millisecond to run the final iteration. A lower score means faster execution. We excluded the

73 7. Evaluation

“eclipse” benchmark because it does not run on JDK 8. We also did not run the “tradesoap” and “tradebeans” benchmarks because their network usage was incompatible with the cluster used for our measurements.

Scala-DaCapo This benchmark suite [62, 61] was inspired by DaCapo but it consists of programs written in the Scala programming language. Scala code usually compiles to Java bytecode and executes directly on the JVM. Similar to DaCapo, we ran 50 iterations per benchmark and the result is the wall time of the last iteration in milliseconds.

We ran all those benchmarks with large heaps (64GB) in order to minimize GC pressure and thus to focus on the performance of the compiled code.

JavaScript Benchmarks Languages implemented under Truffle and compiled with the Graal compiler heavily rely on speculative optimizations. In order to benchmark Truffle- based language implementations, we used TruffleJS [81, 36], a JavaScript implementation under Truffle. We used the following JavaScript benchmarks.

Octane This benchmark suite [26] was developed by Google to test their V8 JavaScript engine. We used benchmarks from version 2.0 of the suite. The benchmarks results of this suite are expressed as “scores”. A benchmark’s score is inversely proportional to the time it takes to run a single iteration of the benchmark. A higher score means faster execution. We ran each benchmark in its own VM. We excluded the “Code loading” and “zlib” benchmarks that focus on JavaScript code parsing and compilation since we are more interested in code execution. Similarly we did not run “Splay” which focuses on garbage collection. We do not present results of “GB Emulator” because measurements for this benchmark were too unstable and the very high variance of the results did not produce any statistically significant results. This was due toa bug in the version of Truffle.JS used for the benchmarks.

Kraken This benchmark suite [53] developed by Mozilla includes real- world JavaScript applications and libraries. We included some of the benchmarks from this suite that focus more on small computational kernels. We adapted them for use with the Octane harness.

NBody This benchmark [29] is another small computational kernel. We adapted it for use with the Octane harness.

For all JavaScript benchmarks, warmup iterations are run first and then multiple measurement iteration are run to compute the score. We ran all JVM and TruffleJS benchmarks 10 times and present averages over those 10 runs. In the benchmarks where we compare two configurations,

74 7.2. Compilation Stages we did an independent two-sample t-test with p = 0.01 to check if the difference is significant. The benchmarks were run on computers equipped with® 2Intel Xeon® CPU E5-2699 v3 @ 2.30GHz and 384GB RAM. To achieve more stable results, Intel® Turbo Boost was disabled. For the various benchmarks, manual inspection of the warmup curves was used to ensure that the benchmarks had reached their peak performance.

7.1.2 Baseline

Since our work is fundamental to the structure of the Graal compiler, there is no version of Graal that does not contain what we try to evaluate. Comparing Graal to a completely different compiler would not give a lot of insights because such compiler would have a completely different design and a different setof optimizations. To evaluate our work, we rather take the full Graal compiler as a baseline, with all optimization enabled. We later compare this baseline to configurations that disable some optimizations or limit the effects ofcertain aspects of the design. For reference we will still also compare this baseline against the other other compilers of the HotSpot VM (See Section 7.4). To establish this baseline we ran several benchmarks with the Graal compiler and record benchmark scores and compilation times. This baseline is shown in Tables 7.1 and 7.2. In this configuration, both the first stage and second stage of the compiler are enabled and speculative guard motion as well as vectorization are enabled.

7.2 Compilation Stages

7.2.1 Effects of the First Stage

To evaluate the effects of the first compilation stage, we compare the baseline with a compiler configuration where fixed guards are used instead of floating ones. In this configuration, optimizations that need to insert or move guards cannot be applied. Other optimizations are still applied. In this configuration, speculations are still used but they remain at fixed positions based on decisions taken during bytecode parsing. We recorded benchmark scores and compilation times for this configuration.

Performance

We fist looks at the performance of the compiled code. The results forthe SPECjvm2008 benchmark suite are presented in Figure 7.1. We can see that the performance of many computationally intense benchmarks such as “crypto”, “mpegaudio”, “lu” and “sparse” is better in the baseline which contains floating guards (first stage). Speculative guard motion is part of the first stageand needs to move guards so it is disabled in this configuration. Further analysis

75 7. Evaluation

Benchmark Score Compilation time/s compiler.compiler 5009.5 ± 92.55 365.4 ± 31.64 compress 1876.8 ± 9.60 51.1 ± 1.50 crypto.aes 630.7 ± 1.86 72.2 ± 5.70 crypto.rsa 2927.3 ± 24.41 58.2 ± 2.00 crypto.signverify 2767.8 ± 7.16 69.4 ± 3.66 derby 2829.1 ± 76.18 128.1 ± 4.08 mpegaudio 1280.6 ± 8.37 127.0 ± 10.49 scimark.fft.large 303.0 ± 2.13 51.4 ± 2.66 scimark.fft.small 3044.1 ± 36.37 52.6 ± 2.21 scimark.lu.large 39.0 ± 3.98 47.9 ± 2.15 scimark.lu.small 3925.1 ± 46.52 61.0 ± 4.26 scimark.monte_carlo 868.9 ± 9.77 141.8 ± 28.53 scimark.sor.large 256.5 ± 2.34 44.6 ± 1.84 scimark.sor.small 1819.0 ± 10.06 48.1 ± 2.67 SPECjvm2008 (ops/m) scimark.sparse.large 149.2 ± 2.44 44.9 ± 1.65 scimark.sparse.small 2365.7 ± 3.52 49.5 ± 1.78 serial 1318.4 ± 13.76 73.1 ± 2.24 sunflow 778.5 ± 13.37 110.4 ± 4.80 xml.transform 3197.1 ± 32.81 145.9 ± 4.31 xml.validation 3809.4 ± 108.26 97.8 ± 3.16 ai-astar 153.1 ± 3.21 9.4 ± 0.60 audio-dft 180.2 ± 1.34 4.9 ± 0.47 audio-oscillator 1130.9 ± 1.45 9.2 ± 1.05

Kraken imaging-desaturate 61 817.9 ± 326.94 1.6 ± 0.11 imaging-gaussian-blur 42.4 ± 1.92 8.3 ± 1.00 nbody 31 849.5 ± 97.72 5.3 ± 0.61 box2d 39 584.1 ± 251.81 46.9 ± 1.89 crypto 14 488.2 ± 211.72 37.9 ± 2.93 deltablue 33 694.5 ± 775.27 21.7 ± 0.35 earley-boyer 20 612.3 ± 439.75 92.4 ± 1.31 mandreel 13 501.4 ± 71.48 56.2 ± 3.65

Octane navier-stokes 21 801.1 ± 47.20 10.2 ± 0.64 pdfjs 18 283.2 ± 243.61 88.8 ± 1.84 raytrace 50 470.8 ± 1305.08 18.4 ± 0.71 richards 22 509.3 ± 99.26 8.0 ± 0.47

Table 7.1: Performance baseline for SPECjvm2008, Kraken and Octane. Results are given ± the width of the 99 % confidence interval.

76 7.2. Compilation Stages

Benchmark Time/ms Compilation time/s avrora 3299.7 ± 114.34 42.5 ± 1.17 batik 1277.8 ± 9.98 73.6 ± 1.12 fop 216.4 ± 3.81 74.1 ± 1.72 h2 5835.0 ± 164.47 116.9 ± 6.01 jython 1712.6 ± 40.03 292.2 ± 10.59 luindex 644.3 ± 52.62 45.0 ± 1.09

DaCapo lusearch 1178.3 ± 138.87 122.2 ± 6.13 pmd 2018.5 ± 31.77 143.4 ± 4.81 sunflow 441.8 ± 26.04 117.5 ± 6.44 tomcat 990.8 ± 16.24 212.5 ± 2.97 xalan 382.3 ± 10.89 157.2 ± 8.94 actors 5704.2 ± 92.20 76.2 ± 2.09 apparat 9800.2 ± 117.37 104.9 ± 4.36 factorie 15 262.8 ± 444.41 43.3 ± 1.70 kiama 322.8 ± 12.90 68.1 ± 2.55 scalac 1296.4 ± 30.30 272.2 ± 12.71 scaladoc 1388.9 ± 27.82 193.3 ± 3.99 scalap 136.0 ± 5.57 50.0 ± 1.34 scalariform 473.0 ± 16.33 88.0 ± 3.87 Scala-DaCapo scalatest 1029.4 ± 43.61 191.3 ± 60.34 scalaxb 389.5 ± 64.20 64.8 ± 1.80 specs 1622.6 ± 17.49 100.0 ± 4.02 tmt 6574.3 ± 48.08 70.3 ± 6.40

Table 7.2: Performance baseline for DaCapo and Scala-DaCapo. Results are given ± the width of the 99 % confidence interval.

of the effects of speculative guard motion alone in Section 7.3 show that the entire performance differences for “lu” and “sparse” can be explained by speculative guard motion. While this optimization also has an impact on “crypto” and “mpegaudio”, it does not entirely explain the speedups. For these benchmarks, we believe that the speedups come from the better scheduling possibilities due to floating guards. These results shows that speculative guard motion is not the only important optimization in the first stage. In the case of “monte_carlo” and “fft” however, we see a slight performance regression introduced by the floating guards of the first stage. This regression isdue to sub-optimal scheduling, which has more freedom when guards are floating. While overall this freedom seems to bring benefits in other benchmarks, poor scheduling in the innermost loops of those two benchmarks costs up to 3.7 % performance regression.

77 7. Evaluation

1 0.8 0.6 0.4 0.2 compiler.compiler compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.fft.small scimark.lu.large scimark.lu.small scimark.monte_carlo scimark.sor.large scimark.sor.small scimark.sparse.large scimark.sparse.small serial xml.transform 0 sunflow xml.validation

Baseline No first stage

Figure 7.1: SPECjvm2008 performance results without the first stage. Nor- malized using the baseline from Tables 7.1 and 7.2. Higher is better. Error bars indicate the 99 % confidence interval.

1 0.8 0.6 0.4 0.2 luindex sunflow scaladoc scalap tmt avrora batik fop h2 jython tomcat actors apparat factorie scalac scalatest scalaxb 0 lusearch pmd xalan kiama scalariform specs

Baseline No first stage

Figure 7.2: DaCapo and Scala-DaCapo performance results without the first stage. Normalized using the baseline from Tables 7.1 and 7.2. Higher is better. Error bars indicate the 99 % confidence interval.

78 7.2. Compilation Stages

For the DaCapo and Scala-DaCapo suites, we make a similar comparison. Note that while these suites’ result is a time in millisecond, in order to harmonize the comparison with the other suites, we compare the inverse of those times. As a result, all the graphs in Figure 7.2 are to be interpreter as “higher is better”: a higher number means that more operations run in the same amount of time. For these suites, we see less impact of the first stage with only few benchmarks showing statistically significant different scores in these suites: “batik”, “apparat” and “tmt” run from 2 % to 6 % faster when the first stage is used.

1 0.8 0.6 0.4 0.2 0 ai-astaraudio-dftaudio-oscillatorimaging-desaturateimaging-gaussian-blurnbodybox2d cryptodeltablueearley-boyermandreelnavier-stokespdfjs raytracerichards

Baseline No first stage

Figure 7.3: Kraken and Octane performance results without the first stage. Normalized using the baseline from Table 7.1. Higher is better. Error bars indicate the 99 % confidence interval.

Finally, the results for the JavaScript benchmarks are also shown in Fig- ure 7.3. The effects of the first stage are very noticeable for these benchmarks. Almost all benchmarks show a strong improvement (up to +76%) when en- abling the first stage. Since the TruffleJS implementation relies heavily on speculative optimizations, the flexibility of floating guards in the first stage brings clear benefits.

Compilation Time Regarding compilation time, for the Java benchmarks, the results are presented in Figure 7.4. Here, since we compare times, those graphs are to be interpreted as “lower is better”: a lower value means less time is spent in the compilation process. The impact of the first stage is generally low but enabling it can add up to 17 % compilation time in the case of the “lu” benchmark. For

79 7. Evaluation

1 0.8 0.6 0.4 0.2

0 compiler.compiler compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.fft.small scimark.lu.large scimark.lu.small scimark.monte_carlo scimark.sor.large scimark.sor.small scimark.sparse.large scimark.sparse.small serial sunflow xml.transform xml.validation

1 0.8 0.6 0.4 0.2 sunflow scaladoc scalap scalaxb tmt avrora batik h2 jython luindex tomcat actors apparat factorie scalac scalatest 0 fop lusearch pmd xalan kiama scalariform specs

Baseline No first stage

Figure 7.4: SPECjvm2008, DaCapo and Scala-DaCapo compilation time results without the first stage. Normalized using the baseline from Tables 7.1 and 7.2. Lower is better. Error bars indicate the 99 % confidence interval.

80 7.2. Compilation Stages most benchmark, the presence of the first stage has no impact on compilation time: of these 23 benchmarks, only 6 show a statistically significant difference. The biggest impact is on the Scala-DaCapo suite where “apparat”, “factorie” and “scalac” have up to 7.5 % longer compilation time when the first stage is enabled.

1 0.8 0.6 0.4 0.2 0 ai-astaraudio-dftaudio-oscillatorimaging-desaturateimaging-gaussian-blurnbodybox2d cryptodeltablueearley-boyermandreelnavier-stokespdfjs raytracerichards

Baseline No first stage

Figure 7.5: Kraken and Octane compilation time results without the first stage. Normalized using the baseline from Table 7.1. Lower is better. Error bars indicate the 99 % confidence interval.

For the JavaScript benchmarks, the results are shown in Figure 7.5. These benchmarks the impact on compilation time is bigger than for the JVM benchmarks with the compiler taking up to 130 % more time in the first stage. Fortunately, most benchmarks show a less heavy compilation overhead. Since these are the benchmarks that also derive the highest benefit from the first stage, the additional compilation time seems well invested. Overall these results show that the first stage allows the compiler to generate code with better peak performance. It helps both computationally intensive code as well as code that relies heavily on speculative optimizations.

7.2.2 Effects of the Second Stage

The second compilation stage allows the reordering of side-effecting nodes. Currently in the Graal compiler, only vectorization uses this capability. To evaluate the impact of the second stage we disable vectorization and compare this configuration to the baseline. We compare both benchmark scores and compilation times.

81 7. Evaluation

1 0.8 0.6 0.4 0.2 scimark.lu.large scimark.sor.large scimark.sor.small serial sunflow xml.transform compiler.compiler compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.lu.small scimark.sparse.small 0 scimark.fft.large scimark.fft.small scimark.monte_carlo scimark.sparse.large xml.validation

Baseline No vectorization

Figure 7.6: SPECjvm2008 performance results without vectorization. Normal- ized using the baseline from Table 7.1. Higher is better. Error bars indicate the 99 % confidence interval.

1 0.8 0.6 0.4 0.2 scimark.fft.large scimark.fft.small scimark.monte_carlo scimark.sparse.large xml.validation compiler.compiler compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.sor.large scimark.sparse.small serial sunflow 0 scimark.lu.large scimark.lu.small scimark.sor.small xml.transform

Baseline No vectorization

Figure 7.7: SPECjvm2008 compilation time results without vectorization. Normalized using the baseline from Table 7.1. Lower is better. Error bars indicate the 99 % confidence interval.

82 7.3. Speculative Guard Motion

The results for SPECjvm2008 are in Figure 7.6. We can see that Graal’s vectorization only has a noticeable effect in the “lu” benchmark. When vectorization is enabled, this benchmark is 34 % faster. However, vectorization does not come for free. We can see that it can cost up to 31 % more compilation time. However this effect is limited: for most benchmarks, there wasno statistically significant difference in compilation time. Overall, while wecan see that it is possible to get performance improvements from optimizations such as vectorization, more work is needed to gain performance from the second stage.

7.3 Speculative Guard Motion

Amongst the optimizations that run in the first stage, we presented speculative guard motion. To analyze its impact, we disabled it and compared the results to the baseline. We first discuss the scores and the compilation times.

1 0.8 0.6 0.4 0.2 scimark.fft.large scimark.lu.large scimark.monte_carlo scimark.sparse.small xml.transform compress crypto.aes crypto.rsa derby scimark.fft.small scimark.lu.small scimark.sor.small serial sunflow 0 compiler.compiler crypto.signverify mpegaudio scimark.sor.large scimark.sparse.large xml.validation

Baseline No speculative guard motion

Figure 7.8: SPECjvm2008 performance results without speculative guard motion. Normalized using the baseline from Table 7.1. Higher is better. Error bars indicate the 99 % confidence interval.

The results for SPECjvm2008 are found in Figure 7.8. When comparing these results with Figure 7.1, we can see that all the benchmarks that benefit from speculative guard motion were also affected by the first stage. The speedup obtained by the speculative guard motion are always at most as big as those obtained with the first compilation stage. This makes sense because speculative guard motion is part of the first compilation stage. In

83 7. Evaluation particular, disabling speculative guard motion has the same impact as disabling the complete first compilation stage for the “lu.small” and “sparse.small” benchmarks. This means that for these benchmarks, speculative guard motion is the only important optimization in the first stage. For other benchmarks such as “rsa”, “signverify” or “mpegaudio”, we can see that while speculative guard motion contributes to the gains of the first stage, it is not the only optimization that is beneficial: while some performance is lost when disabling speculative guard motion alone, performance is even lower when disabling the whole first compilation stage. For “lu”, we also tried to disable both speculative guard motion and vectorization. In this case we get the same results as when we only disable speculative guard motion. The reason is that without speculative guard motion, vectorization cannot trigger because deoptimizing nodes remain in the loops and thus the loop cannot be vectorized. In Figure 7.9 we can see that for most of the SPECjvm2008 benchmarks, speculative guard motion does not increase compilation time significantly. However we can see an increase of about 15 % for “mpegaudio” and “lu”.

1 0.8 0.6 0.4 0.2 scimark.lu.large scimark.sor.large scimark.sor.small sunflow xml.transform compiler.compiler compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.lu.small scimark.sparse.small serial 0 scimark.fft.large scimark.fft.small scimark.monte_carlo scimark.sparse.large xml.validation

Baseline No speculative guard motion

Figure 7.9: SPECjvm2008 compilation time results without speculative guard motion. Normalized using the baseline from Table 7.1. Lower is better. Error bars indicate the 99 % confidence interval.

The results for Octane and Kraken on TruffleJS are also in Figure 7.10. Speculative guard motion if effective for many benchmarks. When enabled, we observe up to 34 % improvement in performance. As for SPECjvm2008, in some of the benchmarks, speculative guard motion accounts for most of the performance difference observed when enabling the full first stage.We

84 7.3. Speculative Guard Motion

1 0.8 0.6 0.4 0.2 0 ai-astaraudio-dftaudio-oscillatorimaging-desaturateimaging-gaussian-blurnbodybox2d cryptodeltablueearley-boyermandreelnavier-stokespdfjs raytracerichards

Baseline No speculative guard motion

Figure 7.10: Octane and Kraken performance results without speculative guard motion. Normalized using the baseline from Table 7.1. Higher is better. Error bars indicate the 99 % confidence interval. see this behavior for both “imaging” benchmarks from Kraken as well as for “NBody”. The impact on compilation time can be see in Figure 7.11. For those benchmarks it is greater than for SPECjvm2008. This can be explained by the fact that IR graphs produced by partial evaluation in Truffle are usually larger and contain many guards that need to be processed by speculative guard motion. These results show that speculative guard motion is an important opti- mization for cases where a lot of guards are used. It can also be important to enable other optimizations such as vectorization. For a detailed look, we also instrumented each guard such that we can have the dynamic count of guards executed. We capture this data both for our baseline with all optimizations enabled as well as for executions with speculative guard motion disabled. When then look at the amount of guards executed for a fixed amount of time. The reason we look at this metric rather than the raw amount of guards executed per run is because the benchmark harness of SPECjvm2008, runs for a fixed amount of time rather than runa fixed amount of work. The results in Table 7.3, show that enabling speculative guard motion causes a significant decrease in the number of guards executed per second. This means that, for a fixed amount of time, when speculative guard motion is enabled, less guards are executed and CPU cycles can be used for other things. However, this does not always directly translate to performance improvements.

85 7. Evaluation

Million guards/s Benchmark Speculative guard motion Without With Difference compiler.compiler 203.4 ± 3.4 168.2 ± 2.2 −17.3 % compress 261.9 ± 1.0 279.2 ± 1.8 +6.6 % crypto.aes 330.9 ± 0.1 321.9 ± 2.1 −2.7 % crypto.rsa 362.6 ± 3.2 132.9 ± 3.3 −63.3 % crypto.signverify 260.4 ± 1.2 43.5 ± 0.2 −83.3 % derby 237.4 ± 4.4 218.2 ± 3.2 −8.1 % mpegaudio 291.6 ± 1.8 232.4 ± 1.7 −20.3 % scimark.fft.large 178.8 ± 6.2 169.5 ± 5.7 −5.2 % scimark.fft.small 253.2 ± 16.5 226.6 ± 10.8 −10.5 % scimark.lu.large 384.4 ± 0.0 10.7 ± 0.2 −97.2 % scimark.lu.small 350.3 ± 2.6 65.8 ± 0.8 −81.2 % scimark.monte_carlo 174.8 ± 4.0 175.0 ± 3.0 +0.2 % SPECjvm2008 scimark.sor.large 377.0 ± 0.0 2.2 ± 0.0 −99.4 % scimark.sor.small 376.3 ± 0.0 5.1 ± 0.0 −98.6 % scimark.sparse.large 377.8 ± 0.0 384.8 ± 0.0 +1.8 % scimark.sparse.small 362.5 ± 0.0 380.2 ± 0.0 +4.9 % serial 246.5 ± 1.7 188.0 ± 2.1 −23.7 % sunflow 657.9 ± 2.5 657.4 ± 14.6 −0.1 % xml.transform 315.8 ± 2.1 256.3 ± 2.4 −18.8 % xml.validation 334.3 ± 4.5 286.2 ± 6.5 −14.4 % ai-astar 221.1 ± 7.2 142.1 ± 4.8 −35.7 % audio-dft 69.4 ± 1.0 39.5 ± 0.3 −43.2 % audio-oscillator 159.1 ± 1.0 104.9 ± 0.5 −34.1 %

Kraken imaging-desaturate 419.1 ± 2.7 152.5 ± 2.3 −63.6 % imaging-gaussian-blur 1021.0 ± 8.2 876.3 ± 29.8 −14.2 % nbody 44.4 ± 1.0 36.0 ± 1.6 −19.1 % box2d 40.3 ± 0.4 37.0 ± 0.4 −8.1 % crypto 157.1 ± 4.9 68.7 ± 3.8 −56.3 % deltablue 169.9 ± 3.8 148.8 ± 3.1 −12.4 % earley-boyer 27.6 ± 0.2 27.5 ± 0.5 −0.4 % mandreel 180.0 ± 2.5 140.5 ± 1.2 −21.9 %

Octane navier-stokes 463.9 ± 3.2 159.2 ± 0.6 −65.7 % pdfjs 14.0 ± 0.2 13.0 ± 0.1 −7.5 % raytrace 75.7 ± 1.4 76.5 ± 1.5 +1.0 % richards 98.5 ± 1.0 97.2 ± 2.5 −1.3 %

Table 7.3: Number of guards executed per second for SPECjvm2008, Kraken and Octane. Results are given ± the width of the 99 % confidence interval.

86 7.4. Comparison to Other Compilers

1 0.8 0.6 0.4 0.2 0 ai-astaraudio-dftaudio-oscillatorimaging-desaturateimaging-gaussian-blurnbodybox2d cryptodeltablueearley-boyermandreelnavier-stokespdfjs raytracerichards

Baseline No without speculative guard motion

Figure 7.11: Kraken and Octane compilation time results without the without speculative guard motion. Normalized using the baseline from Table 7.1. Lower is better. Error bars indicate the 99 % confidence interval.

In certain cases, such as “sor”, while the number of guards executed per second dropped dramatically, there was no performance impact. This means that the execution of guards was not the bottleneck here. We can observe this even better on the “lu” benchmark. We already know that speculative guard motion improves the performance significantly for “lu.small” but there is no such effect for “lu.large”. However, we can confirm that speculative guard motion has an effect on the number of guards executed for both sizes of “lu”. This discrepancy can be explained by the fact that the large workload is cache-bound and execution time is dominated by cache misses. In this case, the presence of guards even in the innermost loops does not really influence the performance. We also checked that the number of failed speculations due to speculative guard motion stays low. In all benchmarks, deoptimization due to failed speculations accounted for no more than a few tens. On average, deoptimization was due to speculative guard motion only in 3.2 % of the cases. This helps us confirm that speculative guard motion is not too eager in its speculations.

7.4 Comparison to Other Compilers

In order to have some points of reference for the performance results, we now compare the Graal compiler to the two existing compilers in the HotSpot VM: the C1 and C2 compilers which we briefly presented in Section 2.2. For these

87 7. Evaluation

comparisons we use the Java benchmark suites: SPECjvm2008, DaCapo and Scala-DaCapo.

7.4.1 C1 To benchmark the C1 compiler, we use HotSpot in its “client” configuration where C1 is the only JIT compiler. Since C1 aims at fast warmup rather peak performance, we expect Graal to be faster overall. This is confirmed by the results presented in Figure 7.12. Graal has better peak performance for almost all benchmarks. The only exception is “scalatest” where C1 outperforms Graal by 33 %. As pointed out by Sewe [61], this benchmark executes many small methods only once. In this context, a compiler with a lower compilation threshold and faster warmup like C1 is more efficient than Graal.

7.4.2 C2 HotSpot’s second compiler, C2 has similar goals as Graal: peak performance in the long run. It also uses speculative optimizations and is overall more comparable to Graal than C1. To benchmark C2, we use HotSpot’s “server”. The results, presented in Figure 7.13, show that for most benchmarks the performance of C2 and Graal are very similar. Graal is faster than C2 for two of the Scala-DaCapo benchmarks: “factorie” (40 % faster) and “tmt” (26 % faster). C2 is faster than Graal on a few benchmarks. In particular, for the “monte_carlo” benchmarks, C2 is more than twice as fast as Graal. Overall we believe that on Java workloads, with its current design, Graal can achieve at least similar performance to C2. This has been confirmed by evolution of the performance of Graal since those experiments were run: For example later versions of Graal have completely caught up with C2 on “signverify” and have halved the gap on “monte_carlo”.

88 7.4. Comparison to Other Compilers

1 0.8 0.6 0.4 0.2 scimark.fft.large scimark.lu.large scimark.monte_carlo scimark.sparse.small xml.transform compress crypto.aes crypto.rsa derby scimark.fft.small scimark.lu.small scimark.sor.small serial sunflow 0 compiler.compiler crypto.signverify mpegaudio scimark.sor.large scimark.sparse.large xml.validation

1 0.8 0.6 0.4 0.2 luindex actors apparat factorie avrora batik fop h2 lusearch pmd sunflow xalan kiama scaladoc scalap scalariform scalaxb specs tmt 0 jython tomcat scalac scalatest

Graal C1

Figure 7.12: SPECjvm2008, DaCapo and Scala-DaCapo performance results for the C1 compiler. Normalized using the baseline for Graal from Tables 7.1 and 7.2. Higher is better. Error bars indicate the 99 % confidence interval.

89 7. Evaluation

1 0.8 0.6 0.4 0.2 mpegaudio scimark.sor.large scimark.sparse.large compiler.compiler compress crypto.aes crypto.rsa crypto.signverify scimark.fft.small scimark.lu.small scimark.sor.small serial sunflow xml.validation 0 derby scimark.fft.large scimark.lu.large scimark.monte_carlo scimark.sparse.small xml.transform

1 0.8 0.6 0.4 0.2 h2 jython tomcat scalac scalatest avrora batik fop lusearch pmd sunflow xalan kiama scaladoc scalap scalariform scalaxb specs tmt 0 luindex actors apparat factorie

Graal C2

Figure 7.13: SPECjvm2008, DaCapo and Scala-DaCapo performance results for the C2 compiler. Normalized using the baseline for Graal from Tables 7.1 and 7.2. Higher is better. Error bars indicate the 99 % confidence interval.

90 7.5. Deoptimization Grouping

7.5 Deoptimization Grouping

Using deoptimization grouping allows us to reduce the amount of low-level deoptimization metadata that the compiler produces. To measure the effect of this technique, we count the number of code positions with associated deoptimization metadata. We do that for our benchmarks by enabling or disabling deoptimization grouping in the Graal compiler. Table 7.4 reports those numbers and shows the differences between the two configurations. When deoptimization grouping is enabled, the number of positions with deoptimiza- tion metadata is reduced by about 26 % for JVM programs. This reduction rises to almost 50 % for the Octane benchmarks running on TruffleJS. This confirms our intuition that a significant amount of Deoptimize nodes use the same FrameState nodes.

Grouping Benchmark Change Disabled Enabled SPECjvm2008 937,947 694,378 −25.97 % Octane 113,275 56,672 −49.97 % DaCapo 1,197,277 880,136 −26.49 % Scala-DaCapo 928,651 695,735 −27.74 %

Table 7.4: Effect of deoptimization grouping on the number of code positions with associated deoptimization metadata in the Graal compiler.

7.5.1 Debug Information Sharing Before looking at the impact of deoptimization grouping in more detail, we want to see the effects of the existing deoptimization metadata compression techniques of the underlying platform. As described in Section 6.2.1, the HotSpot VM supports a compression technique for deoptimization metadata. While it is usually disabled by default, we enabled it and recorded code cache usage. Table 7.5 shows the change in memory usage with debug information sharing enabled and deoptimization grouping disabled. The categories used in this table are the same as the ones described in Section 4.2.1. Debug information sharing is very effective at compressing deoptimization metadata and results in a 40 % to 50 % decrease in of these metadata for JVM benchmarks. This number goes up to 89 % for the Octane benchmark. However, since this sharing is done while deoptimization metadata is recorded and after the compilation has been finished, it has no influence on the other categories of metadata in Table 7.5.1 Overall, for the JVM benchmarks, debug

1The observed variations for the other categories of metadata are not significant. They lie within the 99 % confidence interval.

91 7. Evaluation information sharing results in a 11 % to 25 % decrease in code cache occupation. This brings the overhead of metadata over machine code down to between 0.9× and 1.3× for all compilers and to between 1.1× and 1.2× for the Graal compilers in particular. For the Octane benchmarks, the total code cache occupation goes down by 85 % and metadata overhead goes down to 14.2×.

92 7.5. Deoptimization Grouping 2 0 9 1 7 3 6 9 2 1 ...... 2 2 3 35 41 22 36 12 11 3373 % % % % % % % % % % ± ± ± ± ± ± ± ± ± ± 6 6 9 8 3 8 8 1 2 3 ...... 4 5 7 7 9 1 4 6 9 ...... 11 23 23 82 15 12 17 23 15 22 Total − − − − − − − − − − 3 1924 6 4565 6 5684 4 2070 4 1590 9 3546 6 2903 6 130 364 9 2318 7 3028 ...... column corresponds to the size of 0 1 1 0 0 0 4 8 1 1 % % % % % % % % % % ± ± ± ± ± ± ± ± ± ± 1 5 1 3 2 1 1 1 2 ...... 0 3 4 3 1 1 2 4 2 3 2 . 0 0 0 0 0 ...... 0 Total − +0 − +0 − − − +0 +0 0 502 0 532 1 450 2 481 0 527 1 634 6 461 4 856 3 581 3 640 ...... 0 0 0 4 3 1 1 0 0 625 % % % % % % % % % % 3 ± ± ± ± ± ± ± ± ± ± 1 2 4 2 2 5 4 1 ...... 0 3 9 4 1 3 1 8 1 9 0 . 0 0 0 0 . 0 ...... 0 − +0 − − − +0 − +0 +16 confidenceinterval. The percentages relative to the baseline Metadata % 64 34 8 34 29 8 310 6 426 2 75 1 299 4 29 433 2 86 8 72 99 ...... Graal Client Server 0 1 0 9 5 4 17 18 10 % % % % % % % % % % 2588 8 3 0 7 1 4 7 6 4 0 ...... ± ± ± ± ± ± ± ± ± ± 6 5 4 6 0 0 6 0 50 40 37 47 . 45 . 41 48 50 86 . 41 . . . . . Deopt. Const. Other − − − − − − − − − − the width of the 9 337 1 448 2 420 2 1585 2 2064 9 91 501 1 755 1 1177 0 648 5 793 ± ...... 1 1 1 5 5 20 13 21 12 164 % % % % % % % % % % 3 1 7 2 5 3 2 0 3 ± ± ± ± ± ± ± ± ± ± ...... 0 . 0 0 0 0 0 0 1 3 0 9 0 6 9 5 6 ...... 0 − +0 − − +0 − +1 − +0 958 752 989 2175 2652 8561 1425 1039 1593 1510 Scala-DaCapo SPECjvm2008 SPECjvm2008 Octane Scala-DaCapo Scala-DaCapo SPECjvm2008 Benchmark Code DaCapo DaCapo DaCapo sharing feature is enabled.code Note and that all deoptimization metadata. groupingestablished is Results in not are Table enabled. 4.1 given The Table 7.5: Reduction of code an metadata size for different benchmarks and compilers when HotSpot’s debug information

93 7. Evaluation

7.6 Deoptimization Grouping without Debug Information Sharing

Now, we look at the impact of deoptimization grouping alone, when debug information sharing is disabled. In Table 7.6, the “Debug Information Sharing Disabled” section shows the effect of deoptimization grouping on the mem- ory overhead of deoptimization metadata. We can see that the reduction in positions with deoptimization metadata translates almost directly into a proportional reduction of the deoptimization metadata size. It also translates into a slight decrease in code size as we expected because of the lower number of deoptimization calls.2 Interestingly, it also leads to a decrease of “other” metadata which contains relocation metadata for those deoptimization call sites.

7.6.1 Combining Debug Information Sharing and Deoptimization Grouping We now want to study how debug information sharing and deoptimization grouping work together. Table 7.6 shows the change in memory usage of deoptimization metadata when grouping is enabled. The results show that deoptimization grouping is still able to reduce the code and metadata size on top of the gains from debug information sharing. Compared to debug information sharing alone, grouping further reduces the size of deoptimization metadata by an additional 22 % to 25 % for JVM benchmarks and 8 % for Octane benchmarks. For other categories, for which sharing alone has no effect, grouping has an effect as we have already seen when studying grouping alone. While combining grouping and sharing does not reduce the size of deoptimization metadata for the Octane benchmarks a lot compared to debug information sharing alone, it reduces the size of “constants” metadata a lot. This is important because, as we have seen in Section 4.2.1, the Octane benchmarks produce a lot of “constants” metadata. This shows that both techniques can be used together and combine well. Overall, the overhead of metadata is now down to between 0.9× and 1.1× for the Graal compiler for JVM benchmarks and 6.4× for Octane benchmarks. The total code cache size is reduced by about 31 % for the JVM benchmarks and by 92 % for the Octane benchmarks.

2The observed decrease in code size is significant. It is outside the 99 % confidence interval.

94 7.6. Deoptimization Grouping without Debug Information Sharing % 99 5 1 3 7 8 5 3 7 ...... 72 50 36 17 56 24 1834 8436 % % % % % % % % ± ± ± ± ± ± ± ± 1 6 7 3 8 4 4 0 ...... 4 1 5 3 0 6 7 8 ...... the width of the 31 91 14 46 15 31 13 31 Total ± − − − − − − − − 1 5104 9 4106 9 4029 9 3198 2 6272 7 5014 1 62 393 0 392 628 ...... 1 1 0 0 1 0 % % % % % % % % 20 16 0 ± ± ± ± ± ± ± ± 8 6 7 4 5 1 9 ...... 2 5 0 2 0 3 3 7 3 3 2 2 3 3 9 ...... 10 − − − − − − − − 4 465 2 465 1 449 5 449 5 506 8 507 3 769 2 770 ...... 7 8 2 1 2 2 400 260 % % % % % % % % 6 4 6 1 8 8 ± ± ± ± ± ± ± ± 5 1 ...... 9 1 7 0 0 8 4 2 8 8 ...... 11 12 50 50 10 10 − − − − − − − − Metadata . 1 279 4 273 7 279 5 274 7 372 0 374 9 12 446 5 12 456 ...... 5 53 18 26 10 35 % % % % % % % % 1169 7978 4 7 5 7 7 9 1 4 ...... ± ± ± ± ± ± ± ± 9 7 7 2 3 0 9 7 25 58 23 59 . . 24 57 . 94 . 46 . . . . Deopt. Const. Other − − − − − − − − Debug Information Sharing Enabled Debug Information Sharing Disabled 5 2241 9 1756 9 1240 4 924 6 1578 3 2817 3 40 921 4 371 066 ...... 12 20 19 15 18 10 278 190 % % % % % % % % 3 0 7 4 5 5 8 8 ± ± ± ± ± ± ± ± ...... 4 4 3 3 2 3 2 1 1 7 9 5 5 6 5 7 ...... − − − − − − − − 2103 1534 2106 1535 2560 2543 8245 8324 column corresponds to the size of code and all metadata. Results are given Total Scala-DaCapo Scala-DaCapo SPECjvm2008 SPECjvm2008 Octane Octane Benchmark Code DaCapo DaCapo confidenceinterval. The percentages are relative to the baseline established in Table 4.1 enabled. The Table 7.6: Reduction of code and metadata sizes for different benchmarks when Graal’s deoptimization grouping feature is

95

Chapter 8

Related Work

8.1 Intermediate Representations

The Graal IR is related to the graph IR presented by Click [10, 8] where nodes are not necessarily fixed to a specific point in the control-flow. This kind ofIR— often called “sea of nodes”—retains ideas from the Program Dependence Graph (PDG) of Ferrante, Ottenstein, and Warren [21] where only the dependencies necessary to express the program semantics are kept. The IR of the HotSpot Server compiler is almost exactly the IR described by Click. A notable difference between the Graal IR and Click’s IR is the different direction of edges for control-flow and data flow. This property allows ourIR to avoid projection nodes in the control-flow graph. In Click’s IR, such nodes are necessary in order to be able to distinguish between the successors of a control split. In our IR, projection nodes are not needed because a control split points to its successors. In Click’s approach, a lot of the optimization work is already done during bytecode parsing. Parsing also directly produces rather low-level code which means the IR is rather large for the whole duration of the compilation. In Graal, there are different levels of IR. Some optimizations such as inlining or escape analysis are done on a (relatively small) high-level IR, which is then “lowered” to a more low-level (and larger) IR that supports optimizations such as partial redundancy elimination or global value numbering to be performed on the more fine-grained low level instructions. Overall, in the Graal IR, there are more nodes that are fixed into the control-flow. For example, memory writes are fixed node in the Graal IRwhile there are floating and only ordered through their memory dependencies in Click’s IR. As a result, the control-flow graph that forms the backbone ofthe IR is more prominent in Graal where we try to find a sweet spot between the more familiar IRs purely based on a control-flow graph and a sea of node.

97 8. Related Work

Also, in Click’s IR, edges of a node are referenced with integer indexes. We use named edges in order to improve the maintainability of the IR itself as well as the maintainability of code that uses the IR.

8.2 Deoptimization

Deoptimization has first been used in the implementation of the Self lan- guage [40]. From there, it has been transfered to the HotSpot VM and used in the Client compiler [47], in the Server compiler [55] and in Graal. In HotSpot, deoptimization transfers the execution to the interpreter. The Jikes RVM [2, 58], also uses deoptimization. However, it has no interpreter, so deoptimization is done by transferring the execution to a less optimized version of the compiled code. This is called On-Stack Replacement (OSR) and was proposed for the Jikes RVM by Fink and Qian [22]. In the HotSpot VM, the term OSR is only used for the transfer of execution from the interpreter to compiled code in long-running loops [47]. The RPython tracing JIT [60] also support guards and handles them in a similar fashion to the HotSpot VM. In all these compiler, unlike in Graal, nodes that can trigger deoptimiza- tion maintain their own deoptimization metadata throughout the complete compilation process. Compared to our approach, it is as if the other compilers would go directly to the second compilation stage were guards have to remain fixed, which adds strong ordering constraints for guards. This difficulty iswell illustrated by Odaira and Hiraki [54] and Sundaresan, Stoodley, and Ramarao [72] who want to move instructions that may throw exceptions. Using our IR, we can move and insert Guard nodes in the first compilation stage without having to employ dynamic code patching or other techniques to overcome the inflexible placement of guards. The Crankshaft compiler of the JavaScript V8 engine [27, 23] handles deoptimization using a model similar to the one we use in the first stage of the compilation. While its IR does not have floating nodes, JavaScript- specific nodes that can cause deoptimization are manually moved during the compilation process. Before emitting the low-level IR used for register allocation, deoptimization targets are associated to deoptimization instructions. This is similar to FrameState assignment but does not leave room for a flexible reordering of side-effecting nodes. In contrast to our approach, the Crankshaft compiler does not have our second compilation stage. Binary translator such as Transmeta’s Code Morphing Software (CMS) [15] also supports speculative optimizations. In CMS this is done by using a rollback mechanism which triggers when an assumption fails. Similar effects as deoptimization can be achieved via code duplication and modification of control-flow. A predicate can be checked to dispatch thecode between one version that makes an assumption and another that does not. This has been done for example by Sias et al. [63]. This method can make

98 8.2. Deoptimization the global control-flow very complex if a large number of assumptions need to be taken. Thus, it would be impractical for cases where we want to use speculative optimizations heavily. Special control-flow for specific assumptions has been studied by Djoudi, Acquaviva, and Barthou [17] in the context of nested loops, but their method cannot be applied more generally. Another technique for speculative optimizations has been proposed by Kelsey et al. [43]. Their model targets sequential code, where blocks of speculatively optimized code and safe code run in parallel. The execution of the safe code is used to verify the execution of the speculatively optimized code. The speedup comes from the fact that the execution of a safe block can start as soon as the speculatively optimized execution of the previous block has finished. This allows overlap in the execution of the safe code. Evenif reasonable speedup is achieved, this technique only works for a sequential program and requires the developer to annotate the source code. Also, this technique only works if one can allocate multiple cores to run a sequential program.

8.2.1 Exception Handling One of the important usages of speculative optimizations in the Graal compiler is for exception handling. To mitigate the high number of exception edges, some compilers such as the compilers of the Jikes RVM [7] as well as HotSpot’s Client compiler [47] use a factored control-flow graph where exception edges are implicit and are summarized for each basic block. But this still requires the compiler’s analyses to take into account the instructions that may throw an exception while iterating over the instructions during optimizations. Our implementation using guards simply makes the assumption that exceptions do not happen and can then optimize as if this assumption was verified. Our approach follows the general principle that if some code inside a method has never been executed during profiling, it can be speculatively left out during compilation. This has been studied for general control-flow by Whaley [75]. The HotSpot Server compiler [55] also uses deoptimization to handle unlikely exceptions. It also uses profiling information to exclude explicit exception edges when possible, but it cannot exclude exception edges for invokes. Our evaluation has shown that invokes account for a large number of the exception edges. Another approach to reducing the overhead of exceptions is described by Su and Lipasti [71]. They compile regions of code under the assumption that exceptions are never triggered. At run time, this assumption is checked using special hardware support. If the assumption is invalidated, the changes that happened in the failing region are rolled back using hardware support. The region is then recompiled without assumptions and is re-executed. This approach is similar to ours in that no time is spent for compiling exception

99 8. Related Work handlers that are never executed. Our technique has a small run-time overhead that does not exist when assumptions are monitored by hardware. But the necessity of hardware support can be prohibitive since it requires features that do not exist in common platforms. Also, adding a new kind of assumption could require new hardware support, which is costly, while in our model, a new kind of assumption can be introduced very easily.

8.3 Speculative Guard Motion

Speculative guard motion hoists certain guards out of loops. Loop inversion can achieve a similar effect for some guards. Indeed, if a guard is directly anchored under the condition that is inverted, guards with loop-invariant conditions can be hoisted in the new block created before the loop header. In a system that supports speculation, speculative guard motion is simpler to implement than loop inversion. Its only modification to the IR is the change ofthe anchor position of the guard node that needs to be hoisted while loop inversion requires to modify the control-flow graph. More importantly, speculative guard motion applies to guards that are anchored anywhere in the loop while with loop inversion, only guards anchored after the loop’s condition can be hoisted. Some trace compilers also move guards speculatively. For example Bebenita et al. [5] propose guard strengthening in a trace compiler called SPUR. This optimization replaces a guard implied by a later guard with this later guard. This causes some guards to move up. It is speculative, because the trace may be exited earlier than necessary. The main difference to speculative guard motion is that an implied guard needs to exist for this movement to happen. This limits the applicability of guard strengthening.

8.4 Deoptimization Data

8.4.1 Delta encoding

We have already described compression techniques present in the HotSpot VM in Section 6.2.1. Different compression schemes have been proposed for other VMs. Based on the insight that deoptimization metadata is the result of the abstract interpretation of the compiled program, some VMs use delta encoding for storing deoptimization metadata. The idea is that the delta between two instances of deoptimization metadata that follow each other in the control-flow should be rather small. This was described by Schneider and Bolz [60] for RPython and is particularly effective for trace compilers where the compilation unit is essentially linear with side exits. In RPython, the serialized deoptimization metadata is split into two parts: the “resume data” which corresponds to Graal’s high-level FrameSate nodes

100 8.4. Deoptimization Data and the “back-end maps” which give the low level information by mapping values to their physical locations. As we have seen with deoptimization grouping, the same deoptimization metadata is used at many deoptimization sites but with different physical locations for the values. Thus, the separation used by RPython makes delta encoding more efficient: the the high level resume data does not change much from one site to an other compared to the low-level back-end maps. The LuaJIT compiler [56, 57], like most trace compilers, uses deoptimization intensively for trace exits. It stores deoptimization metadata in “snapshots”. Similar to Graal’s handling of FrameState nodes, snapshots are only taken if there has been a side-effect since the last snapshot or if an exit is likelyto be taken. Similar to RPython, snapshots reference the IR values and thus are independent of the values’ physical locations. Instead of using backend maps, since LuaJIT keeps the IR in memory, it scans the IR during deoptimization for special IR nodes which indicate that the register allocator moved the value from one physical location to another.

8.4.2 Deoptimization Metadata Size The Self VM [74, 40] is an ancestor of the HotSpot VM and contains lot of the infrastructure that was later used for deoptimization in the HotSpot VM. They report that compared to machine code, deoptimization metadata takes almost as much space (1.2×) as the compiled code for their baseline compiler and almost twice as much space (2.3×) as the compiled code for their optimizing compiler. Using the delta encoding described above, the RPython implementation is able to compress “resume data” (the biggest part of RPython’s deoptimization metadata) by up to 82 %. After compression the total deoptimization metadata takes 2.6× to 5× as much space than the machine code. In comparison, the Graal compiler produces between 0.9× and 1.1× as much metadata as compiled for standard Java programs and this number goes up to 6.4× for Javascript program running with Truffle.

8.4.3 Deoptimizing with Metadata vs. with Specialized Code While most systems that support deoptimization do so by using metadata, it is also possible to support it by using only code. Instead of emitting metadata and calling a centralized deoptimization handler to rebuild the deoptimized state, specialized code which does not need the metadata can be emitted for each deoptimization site. For example, with the Hotpath VM, Gal, Probst, and Franz [25] have explored both options and report that using specialized code is better for performance. However, they do not report any findings in terms of memory overhead.

101

Chapter 9

Summary

9.1 Future Work

Some of the possibilities established by our compiler structure and the opti- mizations proposed in this thesis can still be explored further. In particular, future work would be interesting in the following areas:

Reordering of Side-effecting Instructions We have mentioned vector- ization as an example of optimization that can reorder side-effecting nodes and was implemented in the second compiler stage but others could be used as well. For example, more complex loop optimizations such as loop tiling [42, 80] and interchange [1, 79] could be implemented. While these optimizations are already well documented, they have not yet been explored in the context of speculation. Indeed, their complexity often comes from having to use com- plex properties of inter-iteration dependencies that are often hard to prove at compile time. Exploring the opportunities of speculative optimization to reduce this complexity would be interesting.

Speculative Guard Motion without Loops The speculative guard mo- tion optimization that we presented concentrates on hoisting guards out of loops. However, guards could also be moved speculatively out of their branch even if there is no loop, namely in cases where a guard can be combined with another one. While hoisting guards out of loops leads to higher performance gains, there might be cases where speculative guard motion is still profitable without loops.

Further Deoptimization Grouping When we group deoptimizing nodes, we only consider those that have the same FrameState target as well as the same values to be restored. This could be extended to sites with the same target but different values. Phi nodes could be created to merge those values.

103 9. Summary

It would be interesting to investigate the additional costs and benefits of this kind of grouping.

Lowering all Deoptimization sites to Control-flow Splits In our cur- rent approach for the second compilation stage, while all deoptimizing nodes are fixed, some are still located on the fast-path. This is the case, forexample, for Invoke nodes or loop safepoints which may deoptimize. Having all deopti- mizing nodes on separate, slow, control-flow paths would give the compiler even more freedom to reorder side-effecting nodes and to group deoptimizations. In some cases, such as loop safepoints, this would require support from the runtime to be able to divert the control-flow to an other branch instead of deoptimizing directly.

9.2 Conclusion

In this thesis, we presented a new structure for compilers that use speculative optimizations. This structure divides the compiler into two stages. It helps the implementation and use of speculative optimizations by allowing free insertion and movement of deoptimizing nodes during the first stage of the compilation. The second stage then enables other types of optimizations that need to be able to move side-effecting nodes and assume that deoptimizing nodes stay at fixed positions. Using this framework we presented two optimizations that help compilers in situations where speculative optimizations are heavily used or in performance- critical code. Speculative guard motion improves the performance of programs by speculatively hoisting guards out of loops. Deoptimization grouping helps the compiler to reduce the memory footprint of the metadata necessary to support speculative optimizations. Our new compiler structure and its optimizations have been implemented in the Graal compiler. This allows us to show their feasibility for optimizing languages targeting the JVM (e.g., Java and Scala) as well as languages implemented under Truffle (e.g., JavaScript). We evaluated this implementation using SPECjvm2008 and DaCapo for Java, Scala-DaCapo for Scala, and Kraken and Octane for JavaScript. While the optimizations used in the two stages usually have moderate costs in terms of compilation time, they increase the perk performance of Java applications up to 84 % (4.7 % on average) and reduce the memory footprint of Just-In-Time compilation for Java by 27 % to 37 % (31 % on average). For JavaScript, they increase the peak performance up to 76 % (23 % on average) and reduce the memory footprint of JIT compiled code by 75 % to 96 %(90 % on average).

104

List of Figures

2.1 Example IR graph with control-flow and data-flow edges . . . 12 2.2 Example IR graph with a loop ...... 14 2.3 High level LoadField node before lowering ...... 17 2.4 Low-level nodes after lowering of a LoadField node ...... 17 2.5 Example of IR for a loop with Read nodes ...... 19 2.6 FloatingRead nodes and the memory graph ...... 20

3.1 Exception handling at Invoke nodes with exception dispatch . . . 26

4.1 Cycles spent by the code produced by 3 hypothetical compliers . . 36 4.2 Execution speed over time ...... 37 4.3 Data installed in the code cache for Java benchmarks ...... 39 4.4 Data installed into the code cache for JavaScript benchmarks . . . 41 4.5 Deoptimization between two reordered side-effecting nodes . . . . 44 4.6 Deoptimization target possibilities ...... 45 4.7 Reordering possibilities when deoptimizing nodes are fixed . . . . . 47 4.8 Further reordering possibilities when deoptimizing nodes are fixed 48

5.1 Organization of the optimization stages ...... 49 5.2 Example of IR during the first stage ...... 51 5.3 Guard lowering ...... 52 5.4 FrameState assignment ...... 53 5.5 Example where a Merge node requires a FrameState node . . . . . 54

6.1 Speculative guard motion ...... 61 6.2 Low-level representation of the same FrameState used at two dif- ferent positions ...... 67 6.3 Merging deoptimization control-flow (high-level IR) ...... 68 6.4 IR transformation for loop fusion ...... 70

7.1 SPECjvm2008 performance results without the first stage . . . . . 78 7.2 DaCapo and Scala-DaCapo performance results without the first stage ...... 78 7.3 Kraken and Octane performance results without the first stage . . 79

107 List of Figures

7.4 SPECjvm2008, DaCapo and Scala-DaCapo compilation time results without the first stage ...... 80 7.5 Kraken and Octane compilation time results without the first stage 81 7.6 SPECjvm2008 performance results without vectorization ...... 82 7.7 SPECjvm2008 compilation time results without vectorization . . . 82 7.8 SPECjvm2008 performance results without speculative guard motion 83 7.9 SPECjvm2008 compilation time results without speculative guard motion ...... 84 7.10 Octane and Kraken performance results without speculative guard motion ...... 85 7.11 Kraken and Octane compilation time results without the without speculative guard motion ...... 87 7.12 SPECjvm2008, DaCapo and Scala-DaCapo performance results for C1...... 89 7.13 SPECjvm2008, DaCapo and Scala-DaCapo performance results for C2...... 90

108 List of Tables

4.1 Memory footprint of deoptimization metadata ...... 42

7.1 Baseline for SPECjvm2008, Kraken and Octane ...... 76 7.2 Baseline for DaCapo and Scala-DaCapo ...... 77 7.3 Guards executed per second for SPECjvm2008, Kraken and Octane 86 7.4 Effect of deoptimization grouping on the number of positions with associated deoptimization metadata ...... 91 7.5 Effects of debug information sharing ...... 93 7.6 Effects of deoptimization grouping on code and metadata sizes .95

109

List of Listings

2.1 Snippet of the instanceofExact type check ...... 21 2.2 Snippet of the Math.cos intrinsic ...... 21 2.3 Snippet of the instanceofExact type check using a profile . . . . 22 2.4 Example Truffle interpreter ...... 23

3.1 Simple loop structure ...... 29 3.2 BranchProfile class using the Truffle API for speculation . . . 32 3.3 A node using a Truffle Assumption to speculate ...... 33

5.1 Example where effect sinking can be beneficial ...... 57 5.2 Example of the result of effect sinking ...... 57

6.1 Example of loop where a bounds check can be rewritten ...... 62 6.2 Example of loop where a bounds check has been rewritten . . . . . 62 6.3 Example of loops that can be fused ...... 69 6.4 Result of fusing the loops from Listing 6.3 ...... 70

111

List of Algorithms

5.1 FrameState assignment ...... 55

6.1 Speculative guard motion ...... 65 6.2 Function computeEarliestBlockForGuard(n)...... 66

113

Bibliography

[1] John R. Allen and Ken Kennedy. “Automatic Loop Interchange.” In: Proceedings of the International Conference on Compiler Construction. ACM Press, 1984, pp. 233–246. isbn: 0-89791-139-3. doi: 10.1145/ 502874.502897. [2] Bowen Alpern, S. Augart, Stephen M. Blackburn, M. Butrico, A. Cocchi, P. Cheng, J. Dolby, S. Fink, D. Grove, M. Hind, K. S. McKinley, M. Mergen, J. E. B. Moss, T. Ngo, and V. Sarkar. “The Jikes Research Virtual Machine project: Building an open-source research community.” In: IBM Systems Journal 44.2 (2005), pp. 399–417. doi: 10.1147/sj. 442.0399. [3] Bowen Alpern, Mark N. Wegman, and Frank Kenneth Zadeck. “De- tecting Equality of Variables in Programs.” In: Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM Press, 1988, pp. 1–11. isbn: 0-89791-252-7. doi: 10.1145/73560. 73561. [4] David F. Bacon, Susan L. Graham, and Oliver J. Sharp. “Compiler Transformations for High-performance Computing.” In: ACM Computing Surveys 26.4 (Dec. 1994), pp. 345–420. issn: 0360-0300. doi: 10.1145/ 197405.197406. [5] Michael Bebenita, Florian Brandner, Manuel Fahndrich, Francesco Lo- gozzo, Wolfram Schulte, Nikolai Tillmann, and Herman Venter. “SPUR: A Trace-based JIT Compiler for CIL.” In: Proceedings of the ACM SIG- PLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. ACM Press, 2010, pp. 708–725. isbn: 978-1-4503-0203-6. doi: 10.1145/1869459.1869517. [6] Stephen M. Blackburn et al. “The DaCapo Benchmarks: Java Benchmark- ing Development and Analysis.” In: Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. Portland, OR, USA: ACM Press, Oct. 2006, pp. 169–190. doi: 10.1145/1167473.1167488.

115 Bibliography

[7] Jong-Deok Choi, David Grove, Michael Hind, and Vivek Sarkar. “Efficient and precise modeling of exceptions for the analysis of Java programs.” In: Proceedings of the ACM SIGPLAN-SIGSOFT workshop on Program Analysis for Software Tools and Engineering. Toulouse, France: ACM Press, 1999, pp. 21–31. isbn: 1-58113-137-2. doi: 10.1145/316158. 316171. [8] Cliff Click. “Combining analyses, combining optimizations.” PhD thesis. Rice University, 1995. [9] Cliff Click. “Global code motion/global value numbering.” In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. La Jolla, California, United States: ACM Press, 1995, pp. 246–257. isbn: 0-89791-697-2. doi: 10.1145/207110.207154. [10] Cliff Click and Michael Paleczny. “A Simple Graph-based Intermediate Representation.” In: Papers from the ACM SIGPLAN Workshop on Intermediate Representations. ACM Press, 1995, pp. 35–49. isbn: 0- 89791-754-5. doi: 10.1145/202529.202534. [11] Cliff Click and John Rose. “Fast Subtype Checking in the HotSpot JVM.” In: Proceedings of the Joint ACM-ISCOPE Conference on Java Grande. ACM Press, 2002, pp. 96–107. isbn: 1-58113-599-8. doi: 10. 1145/583810.583821. [12] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.” In: Transactions on Pro- gramming Languages and Systems 13.4 (Oct. 1991), pp. 451–490. issn: 0164-0925. doi: 10.1145/115372.115320. [13] Benoit Daloze, Chris Seaton, Daniele Bonetta, and Hanspeter Mössen- böck. “Techniques and Applications for Guest-Language Safepoints.” In: Proceedings of the Workshop on Implementation, Compilation, Optimiza- tion of Object-Oriented Languages, Programs and Systems. 2015. [14] Jeffrey Dean, David Grove, and Craig Chambers. “Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis.” In: Proceedings of the European Conference on Object-Oriented Programming. Springer, 1995, pp. 77–101. isbn: 3-540-60160-0. [15] James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. “The Trans- meta Code Morphing™ Software: using speculation, recovery, and adap- tive retranslation to address real-life challenges.” In: Proceedings of the IEEE/ACM International Symposium on Code Generation and Opti- mization. San Francisco, California: IEEE, 2003, pp. 15–24. isbn: 0-7695- 1913-X.

116 Bibliography

[16] Alain Deutsch. “Interprocedural May-alias Analysis for Pointers: Be- yond K-limiting.” In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 1994, pp. 230–241. isbn: 0-89791-662-X. doi: 10.1145/178243.178263. [17] Lamia Djoudi, Jean-Thomas Acquaviva, and Denis Barthou. “Compo- sitional approach applied to loop specialization.” In: Concurrency and Computation: Practice and Experience 21.1 (Jan. 2009), pp. 71–84. issn: 1532-0626. doi: 10.1002/cpe.v21:1. [18] Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Chris- tian Wimmer, and Hanspeter Mössenböck. “Graal IR: An Extensible Declarative Intermediate Representation.” In: Proceedings of the Asia- Pacific Programming Languages and Compilers Workshop. 2013. [19] Gilles Duboscq, Thomas Würthinger, and Hanspeter Mössenböck. “Spec- ulation Without Regret: Reducing Deoptimization Meta-data in the Graal Compiler.” In: Proceedings of the International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. ACM Press, 2014, pp. 187–193. isbn: 978-1-4503-2926-2. doi: 10.1145/2647508.2647521. [20] Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. “An Intermediate Represen- tation for Speculative Optimizations in a Dynamic Compiler.” In: Pro- ceedings of the ACM workshop on Virtual Machines and Intermediate Languages. ACM Press, 2013, pp. 1–10. isbn: 978-1-4503-2601-8. doi: 10.1145/2542142.2542143. [21] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. “The program dependence graph and its use in optimization.” In: Transactions on Programming Languages and Systems 9.3 (July 1987), pp. 319–349. issn: 0164-0925. doi: 10.1145/24039.24041. [22] Stephen J. Fink and Feng Qian. “Design, implementation and evaluation of adaptive recompilation with on-stack replacement.” In: Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. San Francisco, California: IEEE, 2003, pp. 241–252. isbn: 0-7695-1913-X. [23] Olivier Flückiger. “Compiled Compiler Templates for V8. or: How I Learned to Stop Worrying and Love JavaScript.” MA thesis. University of Bern, Faculty of Science, 2014. [24] Yoshihiko Futamura. “Partial computation of programs.” In: RIMS Sym- posia on Software Science and Engineering. Ed. by Eiichi Goto, Koichi Furukawa, Reiji Nakajima, Ikuo Nakata, and Akinori Yonezawa. Vol. 147. Lecture Notes in Computer Science. Springer, 1983, pp. 1–35. isbn: 978-3-540-11980-7. doi: 10.1007/3-540-11980-9_13.

117 Bibliography

[25] Andreas Gal, Christian W. Probst, and Michael Franz. “HotpathVM: An Effective JIT Compiler for Resource-constrained Devices.” In: Proceedings of the ACM/USENIX International Conference on Virtual Execution Environments. Ottawa, Ontario, Canada: ACM Press, 2006, pp. 144–153. isbn: 1-59593-332-8. doi: 10.1145/1134760.1134780. [26] Google. Octane. 2014. url: https://developers.google.com/octane/. [27] Google. V8 JavaScript Engine. 2012. url: http://code.google.com/ p/v8/. [28] James Gosling, Bill Joy, Guy Steele, Gilad Bracha, and Alex Buckley. The Java Language Specification. Java SE 8 Edition. 1st. Addison-Wesley, 2014. isbn: 978-0133900699. [29] Isaac Gouy. n-body JavaScript V8 program, Computer Language Bench- marks Game. 2014. url: http://benchmarksgame.alioth.debian. org/u64q/program.php?test=nbody&lang=v8&id=1. [30] Graal Project. OpenJDK Community. url: http://openjdk.java.net/ projects/graal/ (visited on Oct. 2, 2014). [31] Matthias Grimmer. “A Runtime Environment for the Truffle/C VM.” MA thesis. Institute for System Software, Johannes Kepler University Linz, 2013. [32] Matthias Grimmer. “High-performance Language Interoperability in Multi-language Runtimes.” In: Proceedings of the Companion Publica- tion of the ACM SIGPLAN Conference on Systems, Programming, and Applications: Software for Humanity. ACM Press, 2014, pp. 17–19. isbn: 978-1-4503-3208-8. doi: 10.1145/2660252.2660256. [33] Matthias Grimmer, Manuel Rigger, Roland Schatz, Lukas Stadler, and Hanspeter Mössenböck. “TruffleC: Dynamic Execution of C on aJava Virtual Machine.” In: Proceedings of the International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. ACM Press, 2014, pp. 17–26. isbn: 978-1-4503-2926-2. doi: 10.1145/2647508.2647528. [34] Matthias Grimmer, Manuel Rigger, Lukas Stadler, Roland Schatz, and Hanspeter Mössenböck. “An Efficient Native Function Interface for Java.” In: Proceedings of the International Conference on Principles and Prac- tices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. ACM Press, 2013, pp. 35–44. isbn: 978-1-4503-2111-2. doi: 10.1145/2500828.2500832. [35] Matthias Grimmer, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. “Dynamically Composing Languages in a Modular Way: Supporting C Extensions for Dynamic Languages.” In: Proceedings of the 14th International Conference on Modularity. ACM Press, 2015, pp. 1–13. isbn: 978-1-4503-3249-1. doi: 10.1145/2724525.2728790.

118 Bibliography

[36] Matthias Grimmer, Thomas Würthinger, Andreas Wöß, and Hanspeter Mössenböck. “An Efficient Approach for Accessing C Data Structures from JavaScript.” In: Proceedings of the International Workshop on Im- plementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems PLE. ACM Press, 2014, 1:1–1:4. isbn: 978-1-4503- 2914-9. doi: 10.1145/2633301.2633302. [37] Christian Häubl and Hanspeter Mössenböck. “Trace-based Compilation for the Java HotSpot Virtual Machine.” In: Proceedings of the Interna- tional Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. ACM Press, 2011, pp. 129–138. isbn: 978-1-4503-0935-6. doi: 10.1145/2093157. 2093176. [38] Christian Häubl, Christian Wimmer, and Hanspeter Mössenböck. “Context- sensitive trace inlining for Java.” In: Computer Languages, Systems & Structures 39.4 (2013), pp. 123–141. issn: 1477-8424. doi: 10.1016/j. cl.2013.04.002. [39] Christian Häubl, Christian Wimmer, and Hanspeter Mössenböck. “Eval- uation of Trace Inlining Heuristics for Java.” In: Proceedings of the ACM Symposium on Applied Computing. ACM Press, 2012, pp. 1871–1876. isbn: 978-1-4503-0857-1. doi: 10.1145/2245276.2232084. [40] Urs Hölzle, Craig Chambers, and David Ungar. “Debugging optimized code with dynamic deoptimization.” In: Proceedings of the ACM SIG- PLAN Conference on Programming Language Design and Implementa- tion. ACM Press, 1992, pp. 32–43. isbn: 0-89791-475-9. doi: 10.1145/ 143095.143114. [41] Christian Humer, Christian Wimmer, Christian Wirth, Andreas Wöß, and Thomas Würthinger. “A Domain-specific Language for Building Self- optimizing AST Interpreters.” In: Proceedings of the International Confer- ence on Generative Programming: Concepts and Experiences. ACM Press, 2014, pp. 123–132. isbn: 978-1-4503-3161-6. doi: 10.1145/2658761. 2658776. [42] François Irigoin and Remi Triolet. “Supernode Partitioning.” In: Pro- ceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM Press, 1988, pp. 319–329. isbn: 0-89791- 252-7. doi: 10.1145/73560.73588. [43] Kirk Kelsey, Tongxin Bai, Chen Ding, and Chengliang Zhang. “Fast Track: A Software System for Speculative Program Optimization.” In: Proceedings of the IEEE/ACM International Symposium on Code Genera- tion and Optimization. IEEE, 2009, pp. 157–168. isbn: 978-0-7695-3576-0. doi: 10.1109/CGO.2009.18.

119 Bibliography

[44] Thomas Kotzmann. “Escape Analysis in the Context of Dynamic Com- pilation and Deoptimization.” PhD thesis. Institute for System Software, Johannes Kepler University Linz, 2005. [45] Thomas Kotzmann and Hanspeter Mössenböck. “Escape Analysis in the Context of Dynamic Compilation and Deoptimization.” In: Proceedings of the ACM/USENIX International Conference on Virtual Execution Environments. ACM Press, 2005, pp. 111–120. isbn: 1-59593-047-7. doi: 10.1145/1064979.1064996. [46] Thomas Kotzmann and Hanspeter Mössenböck. “Run-Time Support for Optimizations Based on Escape Analysis.” In: Proceedings of the IEEE/ACM International Symposium on Code Generation and Opti- mization. IEEE, Mar. 2007, pp. 49–60. isbn: 0-7695-2764-7. doi: 10. 1109/CGO.2007.34. [47] Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck, Thomas Rodriguez, Kenneth Russell, and David Cox. “Design of the Java HotSpot™ client compiler for Java 6.” In: ACM Transactions on Architecture and Code Optimization 5.1 (May 2008), 7:1–7:32. issn: 1544-3566. doi: 10. 1145/1369396.1370017. [48] William Landi and Barbara G. Ryder. “A Safe Approximate Algorithm for Interprocedural Aliasing.” In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 1992, pp. 235–248. isbn: 0-89791-475-9. doi: 10.1145/143095. 143137. [49] Tim Lindholm, Frank Yellin, Gilad Bracha, and Alex Buckley. The Java Virtual Machine Specification. Java SE 8 Edition. 1st. Addison-Wesley, 2014. isbn: 978-0133905908. [50] Jeremy Manson, William Pugh, and Sarita V. Adve. “The Java Memory Model.” In: Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. New York, NY, USA: ACM Press, 2005, pp. 378–391. isbn: 1-58113-830-X. doi: 10.1145/1040305. 1040336. [51] Hanspeter Mössenböck. Adding Static Single Assignment Form and a Graph Coloring Register Allocator to the Java HotSpot™ Client Compiler. Tech. rep. 15. Institute for Practical Computer Science, Johannes Kepler University Linz, 2000. [52] Hanspeter Mössenböck and Michael Pfeiffer. “Linear Scan Register Al- location in the Context of SSA Form and Register Constraints.” In: Proceedings of the International Conference on Compiler Construction. Springer, 2002, pp. 229–246. isbn: 3-540-43369-4. [53] Mozilla. Kraken JavaScript benchmarks. 2014. url: http://hg.mozilla. org/projects/kraken/.

120 Bibliography

[54] Rei Odaira and Kei Hiraki. “Sentinel PRE: Hoisting beyond Excep- tion Dependency with Dynamic Deoptimization.” In: Proceedings of the IEEE/ACM International Symposium on Code Generation and Opti- mization. IEEE, 2005, pp. 328–338. isbn: 0-7695-2298-X. doi: 10.1109/ CGO.2005.32. [55] Michael Paleczny, Christopher Vick, and Cliff Click. “The Java HotSpot™ Server Compiler.” In: Proceedings of the Symposium on Java Virtual Machine Research and Technology. USENIX, 2001. isbn: 1-880446-11-1. [56] Mike Pall. LuaJIT 2.0 intellectual property disclosure and research op- portunities. Lua community Mailing list. 2009. url: http://lua-users. org/lists/lua-l/2009-11/msg00089.html. [57] Mike Pall. src/lj_snap.c. Source code. url: http://luajit.org/ git/luajit-2.0.git. [58] Ian Rogers and Dave Grove. “The Strength of Metacircular Virtual Ma- chines: Jikes RVM.” In: Beautiful Architecture. Ed. by Diomidis Spinellis and Georgios Gousios. O’Reilly, 2009. Chap. 10. [59] Barry K. Rosen, Mark N. Wegman, and Frank Kenneth Zadeck. “Global Value Numbers and Redundant Computations.” In: Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM Press, 1988, pp. 12–27. isbn: 0-89791-252-7. doi: 10. 1145/73560.73562. [60] David Schneider and Carl Friedrich Bolz. “The efficient handling of guards in the design of RPython’s tracing JIT.” In: Proceedings of the ACM workshop on Virtual Machines and Intermediate Languages. Tucson, Arizona, USA: ACM Press, 2012, pp. 3–12. isbn: 978-1-4503-1633-0. doi: 10.1145/2414740.2414743. [61] Andreas Sewe. “Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine.” PhD thesis. Technische Universitaet Darmstadt, 2013. [62] Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. “Da Capo con Scala: design and analysis of a Scala benchmark suite for the Java virtual machine.” In: Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. ACM Press, 2011, pp. 657–676. [63] John W. Sias, Sain-zee Ueng, Geoff A. Kent, Ian M. Steiner, Erik M. Nystrom, and Wen-mei W. Hwu. “Field-testing IMPACT EPIC research results in Itanium 2.” In: Proceedings of the International Symposium on Computer architecture. München, Germany: IEEE, 2004, pp. 26–. isbn: 0-7695-2143-6.

121 Bibliography

[64] Doug Simon, Christian Wimmer, Bernhard Urban, Gilles Duboscq, Lukas Stadler, and Thomas Würthinger. “Snippets: Taking the High Road to a Low Level.” In: ACM Transactions on Architecture and Code Optimization 12.2 (June 2015), 20:20:1–20:20:25. issn: 1544-3566. doi: 10 . 1145 / 2764907. [65] src/share/vm/code/debugInfoRec.cpp. Source code. url: http://hg. .java.net/jdk8/jdk8/hotspot. [66] Lukas Stadler. “Partial Escape Analysis and Scalar Replacement for Java.” PhD thesis. Institute for System Software, Johannes Kepler Uni- versity Linz, 2014. [67] Lukas Stadler, Gilles Duboscq, Hanspeter Mössenböck, and Thomas Würthinger. “Compilation Queuing and Graph Caching for Dynamic Compilers.” In: Proceedings of the ACM workshop on Virtual Machines and Intermediate Languages. ACM Press, 2012, pp. 49–58. isbn: 978-1- 4503-1633-0. doi: 10.1145/2414740.2414750. [68] Lukas Stadler, Gilles Duboscq, Hanspeter Mössenböck, Thomas Würthinger, and Doug Simon. “An Experimental Study of the Influence of Dynamic Compiler Optimizations on Scala Performance.” In: Proceedings of the Workshop on Scala. ACM Press, 2013, 9:1–9:8. isbn: 978-1-4503-2064-1. doi: 10.1145/2489837.2489846. [69] Lukas Stadler, Thomas Würthinger, and Hanspeter Mössenböck. “Partial Escape Analysis and Scalar Replacement for Java.” In: Proceedings of the IEEE/ACM International Symposium on Code Generation and Opti- mization. ACM Press, 2014, 165:165–165:174. isbn: 978-1-4503-2670-4. doi: 10.1145/2544137.2544157. [70] Standard Performance Evaluation Corporation. SPECjvm2008. 2014. url: http://www.spec.org/jvm2008/. [71] Lixin Su and Mikko H. Lipasti. “Speculative optimization using hardware- monitored guarded regions for java virtual machines.” In: Proceedings of the ACM/USENIX International Conference on Virtual Execution Environments. San Diego, California, USA: ACM Press, 2007, pp. 22–32. isbn: 978-1-59593-630-1. doi: 10.1145/1254810.1254814. [72] Vijay Sundaresan, Mark Stoodley, and Pramod Ramarao. “Removing re- dundancy via exception check motion.” In: Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. Boston, MA, USA: ACM Press, 2008, pp. 134–143. isbn: 978-1-59593-978-4. doi: 10.1145/1356058.1356077.

122 Bibliography

[73] Ben L. Titzer, Thomas Würthinger, Doug Simon, and Marcelo Cintra. “Improving compiler-runtime separation with XIR.” In: Proceedings of the ACM/USENIX International Conference on Virtual Execution En- vironments. ACM Press, 2010, pp. 39–50. isbn: 978-1-60558-910-7. doi: 10.1145/1735997.1736005. [74] David Ungar and Randall B. Smith. “Self: The Power of Simplicity.” In: Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. Orlando, Florida, USA: ACM Press, 1987, pp. 227–242. isbn: 0-89791-247-0. doi: 10.1145/ 38765.38828. [75] John Whaley. “Partial method compilation using dynamic profile infor- mation.” In: Proceedings of the ACM SIGPLAN Conference on Object- Oriented Programming Systems, Languages, and Applications. Tampa Bay, FL, USA: ACM Press, 2001, pp. 166–179. isbn: 1-58113-335-9. doi: 10.1145/504282.504295. [76] Christian Wimmer. “Linear Scan Register Allocation for the Java HotSpot™ Client Compiler.” MA thesis. Institute for System Software, Johannes Kepler University Linz, 2004. [77] Christian Wimmer, Michael Haupt, Michael L. Van De Vanter, Mick Jordan, Laurent Daynès, and Douglas Simon. “Maxine: An Approachable Virtual Machine for, and in, Java.” In: ACM Transactions on Architecture and Code Optimization 9.4 (Jan. 2013), 30:1–30:24. issn: 1544-3566. doi: 10.1145/2400682.2400689. [78] Christian Wimmer and Hanspeter Mössenböck. “Optimized Interval Splitting in a Linear Scan Register Allocator.” In: Proceedings of the ACM/USENIX International Conference on Virtual Execution Envi- ronments. ACM Press, 2005, pp. 132–141. isbn: 1-59593-047-7. doi: 10.1145/1064979.1064998. [79] Michael Wolfe. “Advanced loop interchanging.” In: Proceedings of the International Conference on Parallel Processing. IEEE, 1986, pp. 536– 543. [80] Michael Wolfe. “More Iteration Space Tiling.” In: Proceedings of the ACM/IEEE Conference on Supercomputing. ACM Press, 1989, pp. 655– 664. isbn: 0-89791-341-8. doi: 10.1145/76263.76337. [81] Andreas Wöß, Christian Wirth, Daniele Bonetta, Chris Seaton, Christian Humer, and Hanspeter Mössenböck. “An Object Storage Model for the Truffle Language Implementation Framework.” In: Proceedings of the International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. ACM Press, 2014, pp. 133–144. isbn: 978-1-4503-2926-2. doi: 10.1145/2647508. 2647517.

123 Bibliography

[82] Thomas Würthinger, Christian Wimmer, and Hanspeter Mössenböck. “Array Bounds Check Elimination for the Java HotSpot™ Client Com- piler.” In: Proceedings of the International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. ACM Press, 2007, pp. 125–133. isbn: 978-1-59593- 672-1. doi: 10.1145/1294325.1294343. [83] Thomas Würthinger, Christian Wimmer, and Hanspeter Mössenböck. “Array bounds check elimination in the context of deoptimization.” In: Science of Computer Programming 74.5–6 (2009), pp. 279–295. issn: 0167-6423. doi: 10.1016/j.scico.2009.01.002. [84] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. “One VM to Rule Them All.” In: Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM Press, 2013, pp. 187–204. isbn: 978-1- 4503-2472-4. doi: 10.1145/2509578.2509581. [85] Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. “Self-optimizing AST Interpreters.” In: Proceedings of the Symposium on Dynamic languages. ACM Press, 2012, pp. 73–82. isbn: 978-1-4503-1564-7. doi: 10.1145/2384577.2384587.

124 Curriculum Vitæ

Gilles Marie Duboscq Born on April 5, 1988 [email protected]

Education

2011–0: Doctorate candidate Johannes Kepler University, Linz – Austria Research on the topic of Advanced Techniques for Speculative Optimization in a Dynamic Compiler. This is done in the framework of the Graal compiler, an optimizing, metacircular Just In Time compiler written in Java.

2010–2011: Master Université Paris Sud (Paris XI), Orsay (Paris) – France This Master had a specialization in research for Computer Science. Elective courses in Compilers and Optimization and Fault Tolerant Distributed Systems.

2008–2011: Master École Supérieure d’Électricité (Supélec), Gif-sur-Yvette (Paris) – France A top graduate school of engineering. This school is specialized in computer science, electronics, telecommunications and energy. Graduated with a special- ization in Computer Science & Software Engineering.

2006–2008: Classe préparatoire aux grandes écoles Lycée Michel Montaigne, Bordeaux – France Two years of advanced courses in mathematics, physics and computer science for preparation of the nationwide competitive entrance exams to top French graduate engineering and science schools.

2006: Baccalauréat Lycée de la mer, Gujan Mestras – France Baccalauréat with a major in maths, physics and engineering, with honors.

125 Curriculum Vitæ

Experience

Since 2015: Senior Member of Technical Staff Oracle, Zürich – Switzerland Research and development around the Graal project. Working on the Graal compiler and the associated Truffle language implementation framework.

2011–2014: Research Assistant Johannes Kepler University, Linz – Austria Development of the Graal compiler in collaboration with Oracle Labs. Working on most aspects of the compiler with a stronger focus on the support and implementation of advanced speculative optimizations. Design, implementation and maintenance of the continuous testing and continuous benchmarking infrastructure used to track the progress of the Graal compiler and other projects based on the Graal compiler.

April–September 2011: Intern Oracle, Grenoble – France Internship in the Maxine Java Virtual Machine project. Worked on the implementation of the Graal compiler as an evolution of the C1X compiler: changing its intermediate representation and implementing optimizations.

June–September 2010: Intern IBM, La Gaude – France eXtreme Blue Internship on Smarter Rail. Worked with a team of 4 to make a prototype for an automated customer service application designed for rail transportation companies using J2EE technologies.

2010: Contract Job Junior Supélec Stratégie, Gif-sur-Yvette (Paris) – France Drupal web-site. In a team of two, worked on the design an implementation of a Drupal PHP website for a student group managing Supélec’s recruitment forum. The project included the development of multiple custom modules including customer relationship management, contract management, invoicing and event planning & registration.

July 2009: Intern IBM, Montpellier – France Cloud computing dashboards prototyping, programming in Java. Automated data gathering from the heterogeneous software of the Cloud infrastructure and present it on dashboards.

126 2009: Contract Job Junior Supélec Stratégie, Gif-sur-Yvette (Paris) – France Authored Java web-based management applications used by a leading French construction and public works company for their on-site medical clinics.

2008: Contract Job Junior Supélec Stratégie, Gif-sur-Yvette (Paris) – France Authored a Java powered tool used by a leading aerospace, defense and security company in France. The application generates human-readable reports through the aggregation of source code audit dumps from multiple external tools.

Publications

Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. “Graal IR: An Extensible Declar- ative Intermediate Representation.” In: Proceedings of the Asia-Pacific Programming Languages and Compilers Workshop. 2013. Gilles Duboscq, Thomas Würthinger, and Hanspeter Mössenböck. “Speculation Without Regret: Reducing Deoptimization Meta-data in the Graal Com- piler.” In: Proceedings of the International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Lan- guages, and Tools. ACM Press, 2014, pp. 187–193. isbn: 978-1-4503-2926-2. doi: 10.1145/2647508.2647521. Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. “An Intermediate Representation for Speculative Optimizations in a Dynamic Compiler.” In: Proceedings of the ACM workshop on Virtual Machines and Intermediate Languages. ACM Press, 2013, pp. 1–10. isbn: 978-1-4503-2601-8. doi: 10.1145/2542142. 2542143. Doug Simon, Christian Wimmer, Bernhard Urban, Gilles Duboscq, Lukas Stadler, and Thomas Würthinger. “Snippets: Taking the High Road to a Low Level.” In: ACM Transactions on Architecture and Code Optimization 12.2 (June 2015), 20:20:1–20:20:25. issn: 1544-3566. doi: 10.1145/2764907. Lukas Stadler, Gilles Duboscq, Hanspeter Mössenböck, and Thomas Würthinger. “Compilation Queuing and Graph Caching for Dynamic Compilers.” In: Proceedings of the ACM workshop on Virtual Machines and Intermediate Languages. ACM Press, 2012, pp. 49–58. isbn: 978-1-4503-1633-0. doi: 10.1145/2414740.2414750.

127 Curriculum Vitæ

Lukas Stadler, Gilles Duboscq, Hanspeter Mössenböck, Thomas Würthinger, and Doug Simon. “An Experimental Study of the Influence of Dynamic Compiler Optimizations on Scala Performance.” In: Proceedings of the Workshop on Scala. ACM Press, 2013, 9:1–9:8. isbn: 978-1-4503-2064-1. doi: 10.1145/2489837.2489846. Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wol- czko. “One VM to Rule Them All.” In: Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM Press, 2013, pp. 187–204. isbn: 978-1-4503-2472-4. doi: 10.1145/2509578.2509581. Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. “Self-optimizing AST Interpreters.” In: Proceedings of the Symposium on Dynamic languages. ACM Press, 2012, pp. 73–82. isbn: 978-1-4503-1564-7. doi: 10.1145/2384577.2384587.

128