Submitted by DI David Leopoldseder, BSc.

Submitted at Institute for System Software

Supervisor and First Examiner o. Univ.-Prof. DI Dr. Dr. h. c. Hanspeter Simulation-Based Code M¨ossenb¨ock Second Examiner Duplication in a Dynamic Prof. Michael O’Boyle Co-Supervisor Compiler Dr. Lukas Stadler Linz, August, 2019

Doctoral Thesis to obtain the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Technische Wissenschaften

JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Osterreich¨ www.jku.at DVR 0093696 Oracle, Java, HotSpot, and all Java-based trademarks are trademarks or registered trademarks of Oracle in the United States and other countries. All other product names mentioned herein are trademarks or registered trademarks of their respective owners.

III

Statutory Declaration

I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references.

This printed thesis is identical with the electronic version submitted.

Linz, August 14, 2019 i

Abstract

ynamic compilers perform a wealth of optimizations to improve the performance of the generated machine code. They inline functions, unroll, peel and vectorize loops, D remove allocations, perform and scheduling, and code duplication. All optimizations serve the goal to improve the performance of the generated code along as many success metrics as possible, including latency, throughput, memory usage, cache behavior, micro-code usage, security and many others. In this process of transforming a source program to optimized machine code a typical compiler performs a multitude of decisions when applying optimizations. Many optimizations not only have positive impacts on a compilation unit, but can have negative effects as well, on any of the success metrics. Since it is infeasible for a compiler to produce the optimal code for a given source program on a given system, compilers resort to modeling optimization decisions via heuristic functions that are typically hand-tuned to a given set of benchmarks, in order to produce the fastest possible artifact.

Duplicating code into different control-flow branches opens the potential for applying context- specific optimizations, which would not be possible otherwise. However, duplication-based opti- mizations, including tail-duplication and , can have negative impacts on the perfor- mance of the generated machine code of a program. However, in many cases they are still able to improve performance significantly. This imposes a challenge on modern compilers: Duplicating instructions at every control flow merge is not possible because it leads to uncontrolled code growth and compile time increases. Yet, not duplicating code and missing out on performance in- creases is also not a suitable option. Therefore, compilers need to determine the performance and code-size impacts of a duplication transformation, before performing it. Answering the question of the impact of a single duplication transformation on the optimization potential of an entire compilation unit typically requires compile-time-intensive analysis unsuitable for dynamic compi- lation. Consequently, dynamic compilers commonly resort to simple heuristics modeling beneficial and harmful impacts of a duplication optimization. However, heuristics are never complete and often miss modeling aspects crucial for the performance of a program.

To tackle the shortcomings of duplication-based optimizations in a dynamic compiler we propose simulation-based code duplication, a three-tier optimization scheme that allows a compiler to find duplication optimization candidates (1), trade-off their expected impacts between different ii candidates (2) and only perform those duplication transformations that are considered benefi- cial (3). Simulation-based code duplication is precise, meaning that all simulated performance improvements are applicable after duplication. Additionally, it is complete, meaning it allows to simulate the effect of any given duplication-dependent optimization on a compilation unit after duplication.

We implemented our simulation-based code duplication approach on top of the Graal Virtual Machine and applied it to two code-duplication based optimizations: tail-duplication and loop unrolling for non-counted loops.

We show that our simulation-based code duplication scheme outperforms hard-coded heuristics and can significantly improve performance of the generated machine code.

Large parts of our work have been integrated into Oracle Lab’s Graal Virtual Machine and are commercially available. iii

Kurzfassung

ur Verbesserung der Performanz von generiertem Code führen dynamische Compiler eine Vielzahl von Optimierungen wie das Inlinen von Funktionen und das Ausrollen Z von Schleifen aus. Sie planen die Reihenfolge von Instruktionen zur Erreichung einer optimalen Pipeline Auslastung, weisen temporären Werten Register zu, um Speicherzugriffe auf ein Minimum zu reduzieren und sie duplizieren Code. Dies alles dient der Steigerung der Effizienz des generierten Codes anhand multipler Metriken, wie u. a. Latenz, Durchsatz, Speicher-, Cache- und Micro-Code-Auslastung und vielen weiteren. Im Verlauf von Übersetzung und Optimierung muss ein Compiler etliche Entscheidungen treffen, da nicht alle Transformationen automatisch die Performanz eines Programms verbessern. Einige Transformationen sind wechselwirksam: während sie auf eine bestimmte Metrik positive Effekte haben, können sie sich auf eine andere negativ auswirken. Eine optimale Lösung für ein Übersetzungsproblem ist technisch nicht realisierbar. Daher greifen Compiler zur Erzeugung von schnellstmöglichem Code auf Heuristiken zurück, die manuell anhand von Benchmarkprogrammen optimiert werden.

Codeduplizierung ermöglicht es einem Compiler kontextsensitive Optimierungen durchzuführen, die anderenfalls nicht möglich wären. Allerdings können code-duplizierende Optimierungen, ein- schließlich klassischer Tail-Duplizierung und Schleifenausrollung, negative Auswirkungen auf die Performanz von generiertem Maschinencode haben. In vielen Fällen können sie trotzdem zu signifikanten Performanzverbesserungen führen. Diese Tatsache stellt sich problematisch für op- timierende Compiler dar: Code an jeder Kontrollflusszusammenführung zu duplizieren ist nicht möglich, da es zu unkontrolliertem Codewachstum und Übersetzungszeiterhöhungen führt. Auf der anderen Seite ist es nicht wünschenswert auf Codeduplizierungen per se zu verzichten, da eventuelle Leistungssteigerungen versäumt werden könnten. Daher müssen optimierende Com- piler herausfinden, was die potenziellen Auswirkungen einer Duplizierung in Hinblick auf Code- größe und Performanzsteigerung sind. Diese Auswirkungen einer einzelnen Duplizierung auf eine gesamte Übersetzungseinheit erfordern von einem Compiler den Vollzug von komplexen und aufwendigen Daten- und Kontrollflussanalysen, die normalerweise nicht in einem dynamischen Übersetzungskontext anwendbar sind. Daher modellieren Compiler solche positiven und negativen Auswirkungen von Duplizierungen mittels Heuristiken. Heuristiken sind aber oft nicht komplett und modellieren nicht alle performanzrelevanten Konzepte eines Programms. iv

Um die Defizite von duplizierenden Optimierungen in einem dynamischen Compiler zu beseitigen, schlagen wir simulationsbasierte Codeduplizierung vor, ein Optimierungsschema, das es einem Compiler erlaubt, (1) optimierbare Duplizierungskandidaten zu finden, (2) deren Auswirkungen gegeneinander abzuwägen und (3) nur vorteilhafte Transformationen zu vollziehen.

Simulationsbasierte Duplizierung ist präzise, das heißt, alle vorher simulierten Auswirkungen sind tatsächlich später optimierbar. Zusätzlich ist unser Ansatz komplett, das heißt, er erlaubt einem Compiler, die Auswirkungen beliebiger Transformationen auf die Optimierbarkeit eines Pro- grammes hin zu simulieren.

Wir haben simulationsbasierte Duplizierung basierend auf der GraalVM für zwei Optimierungen im- plementiert: klassische Codeduplizierung und Schleifenausrollung von kopfgesteuerten Schleifen.

In dieser These zeigen wir, dass simulationsbasierte Duplizierung manuelle Heuristiken übertrifft und die Performanz von generiertem Code signifikant verbessern kann.

Große Teile unserer Arbeit wurden von Oracle Labs in ihre GraalVM integriert und sind kommerziell verfügbar. Contents v

Contents

1 Introduction 1 1.1 Problem Setting ...... 1 1.2 Problem Statement ...... 2 1.3 State-of-the-Art ...... 4 1.4 Remaining Challenges ...... 6 1.5 Novel Solution ...... 7 1.6 Scientific Contributions ...... 8 1.6.1 Publications ...... 9 1.6.2 Technical Contributions ...... 10 1.7 Applicability ...... 11 1.8 Project Context ...... 11 1.9 Structure of this Thesis ...... 14

2 Terminology 17 2.1 Compilation ...... 17 2.2 Intermediate Representation ...... 18 2.2.1 Control Flow Graph ...... 18 2.2.1.1 Dominance ...... 20 2.3 Static Single Assignment Form ...... 20

3 GraalVM System Overview 23 3.1 Java ...... 23 3.1.1 HotSpot JVM ...... 24 3.1.2 Graal Compiler ...... 26 3.1.2.1 Graal IR ...... 28 3.1.3 Truffle ...... 31 3.1.4 GraalVM ...... 31

4 Simulation-Based Code Duplication 33 4.1 Problem Statement ...... 34 4.1.1 Code Duplication Triangle ...... 43 vi Contents

4.2 Solution ...... 45 4.2.1 Finding Optimization Opportunities after Duplication ...... 49 4.2.1.1 Heuristics ...... 50 4.2.1.2 Backtracking ...... 52 4.2.1.3 Simulation ...... 53 4.2.1.4 Comparison ...... 54 4.3 Necessities: Cost Model ...... 55

5 Node Cost Model 59 5.1 Problems of existing Cost Models ...... 60 5.2 Cost Model Requirements ...... 60 5.3 Node Cost Model ...... 63 5.3.1 Code-Size Estimation ...... 63 5.3.2 Relative Performance Prediction ...... 64 5.3.3 Discussion ...... 65 5.4 Implementation Aspects ...... 66

6 Dominance-Based Duplication Simulation 69 6.1 Optimization Opportunities after Duplication ...... 69 6.1.1 Canonicalizations ...... 70 6.1.2 Read Elimination ...... 71 6.1.3 Conditional Elimination ...... 72 6.1.4 and Scalar Replacement ...... 73 6.1.5 Lock Coarsening ...... 74 6.1.6 Devirtualization ...... 75 6.2 Simulation-based Duplication of Control Flow Merges ...... 77 6.2.1 DBDS Algorithm ...... 77

6.2.2 AC(m, pi, oi) in Graal ...... 84 6.3 Trade-off Functions ...... 84

7 Fast-Path Loop Unrolling of Non-Counted Loops 87 7.1 Counted-loop Unrolling ...... 87 7.2 A Word on Unrolling Non-Counted Loops ...... 88 7.2.1 Non-Counted Loop Construct ...... 89 7.3 Optimization Opportunities ...... 94 7.3.1 Safepoint Poll Reduction ...... 94 7.3.2 Canonicalization ...... 95 7.3.3 Loop-Carried Dependency ...... 97 7.4 Fast-Path Unrolling of Non-Counted Loops ...... 98 7.4.1 Fast-Path Loop Creation ...... 98 7.4.2 Algorithm ...... 101 Contents vii

7.4.3 Discussion ...... 101 7.4.4 Non-counted Loop Unrolling ...... 101 7.4.5 Fast-Path Loop Unrolling Algorithm ...... 103 7.4.6 Unrolling Trade-Off Heuristic ...... 104 7.5 Loop-Wide Lock Coarsening ...... 107 7.5.1 Use Cases ...... 107 7.5.2 Fast-Path Tiling ...... 108 7.5.3 Loop-Wide Lock Coarsening Algorithm ...... 109 7.5.4 Loop-Wide Lock Coarsening Heuristics ...... 111 7.5.5 Safepoint Tiling ...... 111

8 Evaluation 113 8.1 Evaluation Methodology ...... 113 8.1.1 Hardware ...... 114 8.1.2 Software ...... 114 8.1.3 Benchmarks ...... 114 8.1.3.1 Java SPECjvm2008 ...... 114 8.1.3.2 Java DaCapo ...... 115 8.1.3.3 ScalaBench ...... 116 8.1.3.4 Renaissance ...... 116 8.1.3.5 JavaScript Octane ...... 116 8.1.3.6 JavaScript jetstream asm.js ...... 117 8.1.4 Benchmark Configuration ...... 117 8.1.5 Metrics ...... 117 8.1.6 Warmup & Metacircularity ...... 118 8.2 Experiments ...... 118 8.2.1 Dominance-Based Duplication Simulation (DBDS) ...... 119 8.2.2 Simulation vs. Heuristic-Based Solutions ...... 120 8.2.2.1 Interpretation ...... 122 8.2.3 Fast-Path Loop Unrolling ...... 128 8.2.3.1 Discussion ...... 128 8.2.4 Loop-wide Lock Coarsening ...... 129 8.2.5 Node Cost Model ...... 130

9 Related Work 135 9.1 Code Duplication ...... 135 9.1.1 Comparison ...... 138 9.2 Loop Unrolling ...... 139 9.3 Compiler Cost Models ...... 141 viii Contents

10 Future Work 145 10.1 Generalization of Simulation-based Optimizations ...... 145 10.1.1 Transactional Simulation-based IRs ...... 146 10.1.2 Optimization Tier Improvements ...... 146 10.2 Compiler Cost Model ...... 146 10.2.1 Machine Learning Cost Models ...... 147 10.2.2 Cost Model Success Functions ...... 147 10.3 Loop Unrolling ...... 148

11 Conclusion 149

List of Tables 150

List of Figures 151

Listings 155

List of Algorithms 158

Glossary 159

A Fast-Path Loop Creation Algorithm 163

B DBDS Algorithm 165

C Evaluation Appendix 167 C.1 Detailed Performance Plots ...... 167 C.1.1 Simulation vs. Heuristical Solutions ...... 168 C.1.2 DBDS ...... 171 C.1.3 Fast-Path Loop Unrolling ...... 174 C.1.4 Loop-wide Lock Coarsening ...... 177 C.1.5 Node Cost Model ...... 180 C.1.6 Combined Performance Impact ...... 183 C.2 Performance Significance Analysis ...... 186 C.2.1 Interpretation ...... 191

Bibliography 193 1

Chapter 1

Introduction

In this chapter we discuss the current state of the art of duplication-based optimizations and the open challenges in the research domain. Then, we propose a novel solution to determine the impact of a duplication-based optimization prior to performing a transformation. This enables a compiler to apply advanced reasoning about duplication impacts.

1.1 Problem Setting

ynamic compilers [25; 57; 92; 213; 218]1 perform optimizations at run time, concur- rently to an application being executed. They utilize profiling information [196] to D apply speculative optimizations [56; 57; 58; 221] and only optimize those parts of a program that are frequently executed and thus important.

Compiling at run time imposes unique challenges on a dynamic execution system, since the process of compilation competes with the application workload for CPU resources. This makes it partic- ularly challenging to perform complex optimizations [7], as they typically incur the necessity to perform compile-time-intensive control-flow and data-flow analysis of the program. However, over the last decades, dynamic compilers closed the performance gap between statically compiled and dynamically compiled languages [80; 129] by applying speculative optimizations. Dynamic com- pilers today [93; 140; 157] contain many optimizations specifically designed to generate highly efficient code for dynamic high-level programming language constructs. Additionally, dynamic compilers perform optimizations originally developed for statically compiled programming lan- guages like C,C++, since high-level languages typically allow programmers to express programs

1 We use the term dynamic compilation throughout the rest of this thesis, to describe an execution system that optimizes code at run time utilizing dynamic context information, i.e., profiling information for optimization. However, other approaches often refer to dynamic compilation with the term just-in-time (JIT)-compilation. We refrain from the usage of this term as just-in-time does not specify whether compilation uses any dynamic context information to optimize the generated machine code. 2 Introduction in low level semantics as well2. The number of optimizations applied by dynamic compilers is therefore typically very high. This manifests itself in the paradox of dynamic compilers: Since compilation is performed at run time, dynamic compilers have to adhere to the requirement of low compilation times, while at the same time they often need to apply more complex optimiza- tions in order to remove abstractions imposed by dynamic language constructs. In order to solve this dilemma, dynamic compilers often only optimize those parts of a program that are executed frequently. They typically use several execution tiers3 to only optimize the important (so called hot) parts of a program. Dynamic compilers implement the premise "make the common case fast" in order4 to focus the optimization effort on the commonly executed parts of a program. On the other hand, infrequently executed code is often handled by a slower execution tier, e.g., an interpreter or baseline compiler.

1.2 Problem Statement

During compilation many optimization decisions have to be performed. This includes complex questions like the inlining problem [156]5 as well as simpler decisions like which loops should be unrolled [7]. In many cases such optimization decisions cannot be made optimally for several platform-, architecture- and run-time-system-dependent reasons. Compilers do not control the entire execution stack, i.e., the operating system or the hardware varies, and it is impossible to provide optimal optimization decisions for each combination of both. Additionally, modern day CPUs are based on complex hardware that is hard to model precisely and completely in a dynamic compiler, for which compilation time must be as low as possible as compilation happens concurrently to the execution of the user program.

Therefore, compilers typically resort to modeling these optimization decisions as heuristic trade- off functions that try to guide optimization decisions towards a more optimal program. Heuristic trade-off functions base their reasoning on limited knowledge of the compilation unit, sometimes causing transformations that heavily increase code size or even decrease performance.

2 For example consider all possible implementations that iterate over a linked list in Java. Programmers can write index-based loops, for-each style loops or they can use lambda expressions to evaluate a function per list element. 3 For example the HotSpot Java virtual machine (JVM) [93] uses one interpreter and two dynamic compilers (C1 & C2) that are used with different configurations, collecting profiling information effectively resulting in 5 different tier configurations. 4 A principle used during the development of the MIPS [89] instruction set architecture. 5The inlining problem can be trivially formulated as finding a set of inlining decisions that minimize the overall execution time of a program without having accurate predictions for run-time decrease and code-size increase of all involved candidate methods [156]. 1.2 Problem Statement 3

Code duplication6 is an optimization that can significantly improve the run-time performance of a program by allowing a compiler to specialize the duplicated code to the values and types used in predecessor branches7. Consider the code in Listing 1.1a which shows a trivial Java [124] function foo that effectively returns, based on its input argument x, either the constant 2 or a computation 2 + y where y == x. In order to avoid writing the return statement twice, the programmer combined both cases in a control-flow merge block in the instruction return 2 + y. However, this introduces the unnecessary addition 2 + 0 in the case that x <= 0. Code duplication can optimize this program by taking the code in the merge block and duplicating it into both predecessor branches resulting in the code in Listing 1.1b. This code can be further optimized to the code in Listing 1.1c. After duplication the compiler is capable of optimizing the instruction 2 + 0 away via [7], i.e., evaluating the addition at compilation time. In this example the compiler was able to apply code duplication to remove one unnecessary addition and one jump instruction, which, if the else branch is executed at run time, decreases execution time by reducing the number of executed instructions.

1 int f o o ( int x ) { 1 int f o o ( int x ) { 1 int f o o ( int x ) { 2 final int y ; 2 final int y ; 2 if ( x >0) { 3 if ( x >0) { 3 if ( x >0) { 3 ... δ ... 4 y = x ; 4 y = x ; 4 return 2 + x ; 5 } else { 5 ... δ ... 5 } else { 6 y = 0 ; 6 return 2 + y ; 6 ... δ ... 7 } 7 } else { 7 return 2 ; 8 ... δ ... 8 y = 0 8 } 9 return 2 + y ; 9 ... δ ... 9 } 10 } 10 return 2 + y ; 11 } 12 }

(a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

Listing 1.1: Sample program.

However, code duplication can also have negative effects on various success metrics of compiled code and on the compilation process itself. A compiler can decide to duplicate code excessively resulting in code-size increases that can lead to performance degradation of any success metric.

6 Other approaches apply the theory of code duplication but use a different nomenclature. In general, many approaches duplicate code in order to improve certain success metrics of a compilation unit. Following is a list of the most common ones: duplication [13], replication [136; 137], tail duplication [26], inlining [7] procedure cloning [34], splitting (in the context of inlining) [25; 170], trace compilation [99], advanced scheduling for Very- Long-Instruction-Word (VLIW)(VLIW) processors including trace scheduling [66], superblock scheduling [96], hyperblock scheduling [127]. 7Whenever we use the term code duplication in this thesis we mean duplication into predecessor blocks. However, other optimizations including, i.e., partial redundancy elimination apply duplication of instructions into successor blocks of control-flow splits. 4 Introduction

Excessive duplication leads to exponential code growth [36] and is considered harmful. To illustrate this, consider the example from Listing 1.1a and the code after duplication in Listing 1.1c. Before duplication the function contained a piece of code in the merge block marked as δ. δ represents code that cannot be specialized for each value of the variable y. If δ represents the statistically significant portion of code size of method foo, duplicating it can nearly double the code size of it. While some optimizations might justify large code size increases, the constant folding of an addition most likely does not. However, for this to be decided on a conceptual level, a compiler needs to be aware of a duplication transformation’s impact, i.e., the transformation’s benefit in terms of performance increase and cost in terms of code-size increase.

Because of the potential negative implications of duplication, compilers need to carefully decide which parts of a program should be duplicated and which not. In order to do so, a compiler needs to find beneficial duplication candidates, i.e., those transformation candidates that will result in the best enabled optimizations. Therefore, compilers need to find all optimization opportunities after a duplication transformation.

A remaining problem is to discover optimization opportunities after code duplication, which is itself a non-trivial task. It requires global knowledge about the data-flow and the control-flow of a program that can only be obtained by complex analysis. There are different approaches to find such opportunities. Yet, there is no unified approach that finds all kinds of optimizations that can be applied by a dynamic compiler.

1.3 State-of-the-Art

Many compilers implement some kind of duplication and a large number of scientific approaches have been devised for studying the impacts of duplication-based optimizations on the performance of generated machine code.

We group approaches proposed by related work into the following categories: explicit duplication approaches that use (tail-) duplication as an optimization (1), approaches utilizing duplication as an enabler for other optimizations (2) and approaches that design their compilation process around duplication in order to optimize specific success metrics. This section gives a short outline of each group to motivate the remaining challenges and the novel solution presented in this thesis. For a detailed reflection on related work we refer to Chapter 9 where we take an in-depth look into all mentioned approaches and why we believe they are not suitable in a dynamic compilation setting. 1.3 State-of-the-Art 5

1) Explicit (Tail) - Duplication These approaches apply code duplication as a first class optimization to remove trivial basic blocks [111], remove unnecessary jumps or reduce the number of jumps [36] and remove conditional and unconditional statements [136; 137]. The main idea is to use the direct effect of duplication as the optimization. E.g., the duplication of the return statement from Listing 1.1a effectively removed the control flow merge from the control-flow graph (CFG). This removes one unconditional jump from the resulting machine code. Approaches in this category typically do not reason about advanced optimization impacts of a duplication and thus only model a fraction of the enabled optimizations.

2) Duplication to Enable Other Optimizations Such approaches utilize the indirect, i.e., enabling effect of a code duplication to perform other optimizations. To illustrate this, we refer to Listing 1.1b which shows the example from Listing 1.1a after duplication but before optimization. The direct effect of the du- plication is the removal of the merge block. The indirect effect however, is the constant folding the compiler can later perform to obtain the code in Listing 1.1c. This category includes work towards using duplication as a way to apply specialization. Prominent ap- proaches in this group are inlining [7], procedure cloning [34], splitting in the context of specialized compilation for Self [25], and approaches that use duplication to perform special optimizations like complete partial-redundancy elimination [13]. Approaches in this cat- egory reason about the enabling effect of a duplication. However, they do not combine multiple enabled optimizations into one optimization model, i.e., they are merely targeting one optimization or one class of optimizations. Additionally, such approaches do not allow the compiler to perform performance and code-size estimations in advance, incurring the risk of uncontrolled code growth due to excessive duplication8.

3) Advanced Scheduling We combine approaches that use duplication as a support paradigm in their compilation pipeline. Advanced scheduling approaches [66; 96; 127] have been developed in this context to realize a better code generation for VLIW processors, requiring increased instruction-level parallelism (ILP) in its compilation units to utilize the higher number of computational units on VLIW architectures. Such approaches apply (tail-)duplication as an optimization to increase the ILP of their compilation units. Yet, they do not reason about the enabling capabilities of duplication for subsequent optimizations nor do they model a code-size versus performance trade-off.

8See Table 9.1 in Chapter 9. 6 Introduction

In general, all of the presented approaches utilize code duplication in some way to perform op- timizations. However, neither of them applies a cost-benefit analysis prior to the actual trans- formation in order to only duplicate beneficial candidates. This disallows us to use them in the context of dynamic compilation to combine the direct and indirect effect of duplicating code while keeping code size and compile time at a minimal and controllable level.

1.4 Remaining Challenges

In order to support a fine-grained code duplication in a dynamic compiler, we derived the following challenges for which we believe they are not yet solved by related work and for which we present a solution in this thesis.

1) Heuristic-based optimization decisions are imprecise, i.e., they cannot reason about the impact of a transformation on the performance of the generated machine code. Therefore, a precise duplication approach is required that allows a compiler to reason about the impact of a single duplication transformation along multiple success metrics.

2) Excessive code duplication leads to uncontrolled code growth and unwanted side effects that often have negative impacts on performance. Therefore, performing code duplication without a code-size trade-off is not suitable for dynamic compilation, where a minimal compile time and code size is crucial for performance of the overall execution system.

3) A compiler should only perform those duplication transformations that improve performance.

4) In order to find suitable duplication candidates, the compiler needs to know the indirect impact of each single duplication transformation to be able to perform a realistic trade-off between a transformation’s code-size increase and its performance impact.

5) Therefore, the compiler needs to find all possible optimizations that are enabled by a code duplication.

6) To make this process suitable for dynamic compilation, the analysis has to be sufficiently fast. 1.5 Novel Solution 7

1.5 Novel Solution

In order to tackle the remaining challenges, we propose a novel approach to perform duplication- based optimizations in a dynamic compiler: simulation-based code duplication which allows a compiler to determine prior to performing the actual transformation what the impact of a dupli- cation on the optimization potential of a compilation unit is.

Simulation 1 Simulation 1

Rank & Evaluate Success Apply Best Metrics Optimizations

Program Simulation 2 Simulation 2

Optimized Program

Simulation 3 Simulation 3

Simulation Trade-Off Optimization

Figure 1.1: Simulation-based code duplication.

This allows a compiler to reason about which duplication operations should be performed in order to increase performance, without exploding code size or compilation time. Figure 1.1 depicts a schematic sketch of the approach which is based on three tiers that work the following way:

1) Simulation: The simulation tier discovers optimization opportunities after code duplica- tion by simulating the effect of a duplication on the compilation unit. The result is a simulation-based version of the program that, optimization-wise, acts like the original pro- 8 Introduction

gram. The compiler then performs partial optimizations on the simulated program, i.e., optimizations that behave like the real ones, however, without the compile-time-intensive task of maintaining correct data- and control-flow dependencies.

2) Trade-Off: The trade-off tier fits those opportunities into an optimization cost model that tries to maximize peak performance while minimizing code size and compilation time. The outcome is a set of duplication transformations that should be performed as they lead to sufficient peak performance improvements. Transformations that are unnecessary for performance or that explode code size are discarded in this tier.

3) Optimization: The optimization tier then performs the selected duplications together with the subsequent optimizations whose potential was detected by the simulation tier. The result is the optimized program.

The simulation step enables a compiler to evaluate different success metrics per potential du- plication. Using this information, the compiler can then select the most promising candidates for optimization. Additionally, our approach maps duplication candidates into an optimization cost model that allows the compiler to trade-off between different success metrics including peak performance, code size and compile time. To do so, we developed an architecture-agnostic cost model for a dynamic compiler that can be used by optimizations to make static performance and code-size predictions of compiler IR.

Simulation-based code duplication enables dynamic compilers to perform a fine-grained perfor- mance versus code-size trade-off in duplication-based optimizations. We implemented our ap- proach in two optimizations and show that it outperforms heuristics by producing faster and often smaller code, while consuming only slightly more compilation time than heuristic solutions. In our implementation we show that simulation-based code duplication significantly increases the performance of Java-based applications without sacrificing code size or compilation time.

1.6 Scientific Contributions

In this section we summarize the contributions of this thesis by listing publications and other artifacts produced in the context of the simulation-based code duplication approach. 1.6 Scientific Contributions 9

1.6.1 Publications

This thesis contributes the following novel aspects to the state-of-the art in compiler research:

• A three-tier simulation-based optimization scheme to first simulate the result of a duplication transformation, to then trade-off various simulation results against each other to finally only perform those duplication transformations that carry sufficient performance increases. Published in [116; 117; 120].

• An application of this scheme for tail-duplication of regular control flow merges including a dominance-based algorithm to simulate the effects of a duplication on the structure of a control-flow graph. We argue why simulation is favorable over backtracking to achieve a precise effect determination while still being complete in the number of supported opti- mizations. Published in [116; 120].

• The application of the three-tier simulation-based optimization scheme on duplication-based loop unrolling to show that the idea also applies to other duplication-based optimizations. Published in [117].

• An optimization to split loops into hot and cold paths, called fast-path loop creation (FP- loop creation), allowing partially aliasing memory operations to be pulled out of hot paths of loops by scheduling them in dominating blocks. This effectively lifts memory anti- dependencies into less frequently executed basic blocks. This work was motivated by our work on simulation-based loop unrolling. Published in [117].

• An algorithm used in our simulation-based unrolling approach that is used to partially unroll the hot path of a non-counted loop using FP-loop creation. Published in [117].

• A set of simulation-based unrolling strategies building on the proposed three-tier dupli- cation scheme to selectively apply loop unrolling of non-counted loops to improve peak performance. Published in [117].

• A platform- and architecture-agnostic cost model for a graph-based intermediate representa- tion (IR) in a dynamic compiler that enables optimizations to perform code-size and latency estimations. Published in [118].

• A novel optimization to reduce the locking overhead of synchronization-heavy Java loops. High-level idea published in [158]. 10 Introduction

Additionally, the author of this thesis contributed to different components of the Graal compiler and collaborated on research in the GraalVM ecosystem. This research is presented in various other publications and is not part of the contributions of this thesis. Subsequently, a non-exhaustive list of the authors work in the project is given:

• A novel approach for source code generation from unstructured compiler IR [119]. This approach was later used in Graal AOT JS, the first Java bytecode to JavaScript compiler in the Graal ecosystem, published in [115].

• An extensive analysis of the impact of dynamic compiler optimizations on Scala collection performance [157].

• An extensive analysis of the usage of x86-64 inline assembly in C programs [165].

• A novel approach for parallel trace register allocation in a dynamic compiler [62].

• A novel algorithm for inline substitution in a JIT compiler [156].

• A novel benchmark suite for concurrency benchmarks on the JVM [158].

• A novel algorithm for loop detection in a bytecode-based partial evaluation algorithm (under review [132]).

1.6.2 Technical Contributions

Large parts of the contributions presented in this thesis have been integrated into Oracle Labs’ Graal Virtual Machine (GraalVM) and are running in production. The artifacts deployed into GraalVM are:

Dominance-Based Duplication Simulation (DBDS) DBDS is integrated into GraalVM and enabled per default in production builds.

Node-cost-model (NCM) The NCM is integrated into the open source part of the Graal com- piler and is also used in GraalVM. Several other optimizations migrated to use the cost- model, for example Graal’s conditional move [126] optimization. 1.7 Applicability 11

1.7 Applicability

The optimizations presented in this thesis have been implemented in the Graal compiler [55; 57; 182; 185; 213; 219] which is part of the OpenJDK [139] project. Graal is an experimental JIT-compiler in HotSpot [93] since the Java 10 [142] release. Therefore, we have been able to thor- oughly test all optimization implementations presented in this thesis with standard JVM workloads including Java DaCapo [11], ScalaBench [176], Java SPECjvm2008 [189], Renaissance [158] and many other real world Java applications. Additionally, parts of the concepts presented in this thesis are implemented in GraalVM [140; 148] allowing us to execute the Truffle framework9 building on top of Graal. This allows the execution of various language implementations on top of GraalVM and enabled us to test our optimizations with languages other than Java including JavaScript [59] via the GraalJS JavaScript runtime [147], Ruby [32] via the TruffleRuby [152] runtime, R [70] via the FastR runtime [145] and Python 3 [69] via the Graal Python project [146].

Many concepts presented in this thesis are specific to dynamic compilation, including the need for low compilation times. However, the algorithms and optimizations investigated in this thesis that use profiling information are applicable to any kind of compiler that supports the concepts of CFGs and a notion of dominance [39]. All algorithms and optimizations using profiling information for their optimization decisions do so by using profiling information as an additional input to mathematical trade-off functions in optimization decisions. We tested the applicability by disabling the usage of profiling information in our optimizations. While this can have some influence on the performance increases for pathological patterns, the general performance improvements generated by our approach can still be replicated.

1.8 Project Context

GraalVM & Oracle Labs The work presented in this thesis was conducted together with Or- acle Labs in the context of an on-going research collaboration between the Institute of System Software [179] at the Johannes Kepler University Linz [103] and Oracle Labs (formerly Sun Mi- crosystems) [150]. The collaboration originated from a sabbatical Prof. Hanspeter Mössenböck, the head of the Institute for System Software, took in 2000 to develop a novel IR in static- single-assignment form (SSA) for the HotSpot client [109] compiler in order to implement a graph-coloring register allocator [133]. This work was the foundation for successful future col- laborations leading to several Bachelor, Master and PhD students working on enhancements for

9See Chapter 3. 12 Introduction the Java HotSpot VM as well as on the two compilers in it. In the following, we give a non- exhaustive list of the most important publications resulting from this collaboration in a timely ordered manner:

• Mössenböck [133] added a new SSA intermediate representation to HotSpot’s client com- piler (C1) and implemented a graph-coloring register allocator [24].

• Mössenböck and Pfeiffer [134] developed a novel algorithm for linear scan register allocation, an algorithm especially suitable for low compile-time requirements in dynamic compilers.

• Wimmer and Mössenböck [203] improved upon the previously developed algorithm and implemented it in the HotSpot client compiler. Later Wimmer and Franz [202] extended this work to apply linear scan register allocation to an SSA-based IR to utilize the properties of only having one definition of a variable to simplify the linear scan data flow analysis.

• Kotzmann et al. [108; 109] continued the work on C1 and implemented a novel escape analysis algorithm for it.

• Wimmer and Mössenböck [204, 205] proposed the idea of automatic object co-location of objects and developed it further to automatic object and array lining [206; 207].

• Würthinger et al. [214; 216] developed a novel algorithm for array-bounds-check-elimination for HotSpot’s C1 compiler.

• Würthinger et al. [215] worked on the visualization of program dependence graphs. The resulting software was the first tool to give a scalable visual representation of HotSpot’s server compiler (C2) sea-of-nodes-based intermediate representation [30].

• Würthinger et al. [210; 211; 211; 212; 220] developed dynamic code evolution for HotSpot, a mechanism that allows arbitrary re-definitions of Java classes at run time without shutting down the VM.

• Stadler et al. [184, 186] worked on extending HotSpot with continuations and co-routines allowing the creation of thousands of continuations in one VM.

• Häubl et al. [85; 86; 87; 88] developed a novel trace-based dynamic compiler for HotSpot.

The preceding list of research led to the inception of the Graal project. Graal started as a project to develop a dynamic compiler for Java that was itself written in Java. This was the successful start of a novel compiler project that led to research in the area of dynamic compilation 1.8 Project Context 13 and programming language implementation. Subsequently, we summarize the most important research milestones grouped into two categories: research on the novel dynamic compiler Graal and research on the abstract-syntax-tree implementation framework Truffle running on top of Graal:

Graal

• Stadler et al. [181] studied methods for efficient compilation queuing and graph caching during compilation.

• Duboscq et al. [55; 57] developed a novel sea-of-nodes based IR for the Graal compiler that allows the compiler to flexibly combine the scheduling of side effects with the maintenance of deoptimization meta data [58].

• Stadler et al. [182] worked on analyzing the performance characteristics of Scala code with respect to dynamic compiler optimizations.

• Stadler et al. [180; 185] proposed partial escape analysis, a novel algorithm to solve the all-or-nothing approach of pre-existing escape analysis algorithms by moving allocations, if possible, into less-frequently executed branches.

• Simon et al. [178] presented Snippets, a high-level and architecture-independent way to represent low-level IR semantics in a dynamic compiler.

• Eisl et al. [61; 62; 63] developed trace-register allocation for the Graal compiler. A novel register allocation approach that divides a program into traces and does register allocation on a per-trace basis in descending execution probability, i.e., more important code first.

Truffle

• Würthinger et al. [213, 218, 219] proposed the Truffle framework for practical partial eval- uation, the foundation on which various programming language implementations have been done on the GraalVM.

• Humer et al. [95] proposed the Truffle DSL, a domain-specific language for the implementa- tion of AST-based Truffle interpreters that supports the specialization of operations based on input types and additional properties. 14 Introduction

• Wöß et al. [209] introduced the object-storage model for Truffle, an object model that can be used by language implementers to develop Truffle-based language runtimes. It is language-agnostic, portable and supports object representations for dynamic languages.

• Grimmer et al. [81; 82; 83] proposed a language interoperability mechanism for Truffle [213].

• Rigger et al. [160; 161; 162; 163; 164; 166; 167; 168; 169] developed a managed execution environment called Sulong to execute unmanaged languages on the JVM. Additionally, he studied the usage of non-standardized elements in C projects such as inline assembly and compiler built-ins.

• Daloze et al. [40; 41; 42] devised a thread-safe object model and efficient thread-safe data structures for dynamic languages such as Ruby.

• Stadler et al. [183] proposed aggressive speculative optimizations for the R language.

• Van-De-Vanter et al. [195] proposed the Truffle Instrumentation Framework, a language- agnostic extension to the Truffle framework that allows easy debugging and tool support for Truffle language implementations.

Graal supports code generation for x86-64bit and SPARC. Additionally, a backend for arm64 is under experimental development. Over the years, experimental backends for PTX & HSAIL, OpenCL and CUDA have also been developed.

1.9 Structure of this Thesis

This thesis is structured as follows:

• Chapter 2 introduces theoretical concepts that lay the foundation of this thesis. It discusses terminology in the domain of compiler research necessary to reason about code duplication.

• Chapter 3 introduces the Java language, its virtual machine and basic execution paradigms. It gives an overview of the concepts of dynamic compilation and speculation. In the second part of this chapter we take a closer look at the Graal compiler and its ecosystem. All optimizations presented in this thesis have been designed for the Graal compiler, so we explain its paradigms, IR and characteristics. 1.9 Structure of this Thesis 15

• Chapter 4 introduces the problems of excessive code duplication in detail and motivates the non-heuristic solution to the problem. We propose a simulation-based solution to this problem and argue what is needed for this paradigm to work. We also motivate the need for a compiler cost model to perform more accurate decisions in duplication simulation.

• Chapter 5 introduces the node-cost model, a language- and architecture-agnostic cost model for the graph-based intermediate representation of the Graal compiler. This cost model is a fundamental requirement for the simulation-based code duplication approach to function properly.

• Chapter 6 presents our design and implementation of the simulation-based duplication scheme for traditional tail-duplication. We introduce a novel algorithm for duplication simulation called Dominance-Based Duplication Simulation.

• In Chapter 7 we propose an extension to the simulation-based duplication approach to apply simulation-based unrolling of non-counted loops via two novel algorithms: fast-path loop creation and fast-path loop unrolling.

• Chapter 8 contains an extensive performance evaluation of the algorithms and optimizations presented in this thesis. We conducted several experiments with different configurations of the compiler and the VM to show that our optimizations can improve the performance of modern Java applications as well as dynamic languages like JavaScript at moderate compile time and code-size increases.

• Chapter 9 discusses related work in the domain of code duplication, loop unrolling and compiler cost models. It presents a comparison of simulation-based code duplication with duplication-based optimizations proposed by related work.

• Chapter 10 contains an outlook to future work. We believe that simulation-based code duplication is a foundation for fruitful future research in the direction of specialized IRs for duplication and machine learning of cost-model parameters.

• We conclude on the thesis in Chapter 11.

17

Chapter 2

Terminology

In this chapter we define a common terminology for the rest of this thesis by summarizing all theoretical concepts needed in order to understand the simulation-based code duplication approach.

hile compiler construction is one of the oldest research domains in computer science, a common terminology is always viable for the exchange of ideas. Therefore, we define W a minimal set of terms, concepts and algorithms we use throughout the rest of this thesis to explain simulation-based code duplication. We describe them in a semi-formal fashion and refrain from formal definitions as they have been given by related work and are not part of the contribution of this thesis.

2.1 Compilation

Before we go into the basic concepts of intermediate representations, we define the basic terms of compilation that are used throughout the rest of this thesis.

Definition 1 (Compiler). A compiler C is a program that takes an input program p in input format S denoted as S(p) and produces a semantically equivalent program p′ in output format O denoted as O(p). Hence, a compiler is a function C(p): S(p) → O(p).

Definition 2 (Intermediate representation (IR)). During compilation C(p) the compiler constructs an IR of program P denoted as IR(p). IR(p) is semantically equivalent to p and S(p), however it is in a different format more suitable for optimizations.

Definition 3 (Optimization). During compilation C(p) the compiler performs optimizations (de- noted as OP (IR)) on the IR(p) that do not change the semantics of p. However, OP (IR) can change success metrics of the execution E(O(p)) including run-time performance, latency, memory usage, cache behavior etc. The original semantic of p must be preserved by OP (IR). 18 Terminology

Historically, a classical compiler implementation of C(p): S(p) → O(p) is a static compiler from a source language to machine code, e.g., C(p): SourceCode(p) → MachineCode(p). However, nowadays this is no longer valid as there are arbitrary many implementations of S and O that differ from source or machine code. Since we propose simulation-based code duplication for a Java compiler we focus on one particular instantiation of C(p) being a Java-bytecode [124] to machine code compiler.

Definition 4 (Dynamic Java Compiler). A dynamic Java compiler is a function JavaDynCompiler(p):

JavaBytecode(p) → MachineCode(p).

Every time we use the term compiler throughout the rest of this thesis we mean Definition 4.

2.2 Intermediate Representation

Compilers transform a program from its source representation S(p) into an IR IR(p) in order to be able to optimize1 it more easily. Source code or bytecode, typically have many abstrac- tions that are unnecessary or unsuitable for optimization. There are many reasons to use an intermediate representation as well as many requirements a particular IR(p) must fulfill. Gen- erally speaking, advanced optimizations like inlining [156] or duplication [120] require detailed information about the data- and control flow of a program. Every optimization has different requirements when it comes to the kind of analysis it performs or context information it requires. Therefore, modern-day Java compilers use multiple IRs to satisfy the many different needs of optimizations: JavaDynCompiler(p): JavaBytecode(p) → IR1(p) → IR2(p) → ... → IRn(p) →

MachineCode(p).

2.2.1 Control Flow Graph

In this thesis we focus on dynamic compilation in the Graal compiler, thus we focus on the IRs used by Graal. Graal uses two main intermediate representations for optimizations, a control flow graph CFG and a special graph-based IR called GraalIR: GraalCompiler(p): JavaBytecode(p) →

GraalIR → CF GLIR → MachineCode(p).

We will explain the details of GraalIR in Chapter 3 and will now give a short definition of a CFG and some basic properties necessary to understand GraalIR. GraalIR, even though it is a special intermediate representation, has an equivalent 1:1 mapping to a CFG after scheduling2.

1See Definition 3. 2See Paragraph 3.1.2.1.1. 2.2 Intermediate Representation 19

A control-flow graph is a standard intermediate representation used by many compilers like C1 [109], GCC [78], LLVM [111] and Graal [60; 201].

Definition 5 (Instruction). An instruction, denoted as i, is the smallest building block of a program. Depending on the IR, in which the program is represented, it typically represents a single operation as defined by the IR.

Definition 6 (Basic Block). A basic block, denoted as bb, is the longest possible sequence of 3 4 instructions: bb(i0, . . . , in) | instructiontype (i0...(n−1)) 6= branch ∧ (pred (i1...n) ∈ bb ∧ ∀i ∈ [1, n] | pred(ii = ii−1)). That is, all instructions inside bb are branch free instructions except the last instruction. There is no outer branch instruction merging into bb. Therefore, bb is the longest possible sequence of branch free instructions (except for the last one) without external branches leading to the instructions i1 . . . in.

Definition 7 (Edge). An edge is a 2-tuple (bbout, bbin) of two basic blocks. That is, an edge is a branch instruction between two basic blocks for which the following holds: successor(ilast(bbout)) = predecessor(ifirst(bbin)). Definition 8 (Control Flow Graph). A control flow graph (CFG) is a directed-graph composed of a 2-tuple CF G ≡ (Blocks, Edges), where Blocks is a set of basic blocks and Edges a set of edges connecting those blocks. Per definition [3; 35; 76; 114; 159] a CFG has one entry basic block, i.e., without a predecessor and 1 − n exit blocks without a successor.

static int S; foo( a) { static int int 0: i=0; for (int i = 0; i < a; i++) { bb0 S += i * a; bb0 } return S; } 1: i

3: return S bb2 4: tmp1 = i * a Dominator Tree 5: tmp2 = S + tmp1 6: S = tmp2 7: tmp3 = i +1 8: i = tmp3 9: jmp 1 bb3

Control Flow Grahp

Figure 2.1: Sample program foo.

Figure 2.1 shows a simple Java code foo with its associated CFG in a three-address high-level pseudo code. The CFG for foo consists of 4 basic blocks:

3 instructiontype(i) denotes the category a certain instruction belongs to, e.g., branching, expression and so on. 4pred(i) denotes the predecessor instruction of i. 20 Terminology

• The entry block bb0 of the method. Execution starts at instruction 0.

• The loop header bb1 that performs the comparison and a branch out of the loop in case the loop exit condition is reached.

• The loop body bb3 contains the body of loop instructions.

• The loop and method exit bb2 block which returns the variable S.

2.2.1.1 Dominance

A very important concept for optimizing compilers is the dominator tree [3; 35; 39; 76; 114; 159] that is based on the dominance relation of basic blocks.

Definition 9 (Dominance Relation). The dominance relation dominates(a, b) is a binary relation between the basic blocks of a program dominates ⊆ BxB = {(a, b)|a ∈ Blocks, b ∈ Blocks} and defined by the following term: dominates(a, b) → ∀ paths p from bstart to b|a ∈ p, i.e., every execution path from the start block to block b has to go through a. The dominance relation is transitive and a block trivially dominates itself. Additionally, direct dominance describes the dominance relation between two basic blocks a, b where a! = b.

Definition 10 (Dominator Tree). A dominator tree, denoted as domT ree(CF G), is a 2-tuple tree domT ree ≡ (Blocks, Edges) where Blocks is the set of basic blocks of the CF G and Edges represent the direct dominance relation between all blocks.

We show a simple example of a dominator tree on the right-hand side of Figure 2.1 for program foo. An arrow in the dominator tree means that the source block of the arrow directly dominates the target block (which is also true transitively).

The dominator tree is heavily used during various optimizations and during the building of static single assignment form5. The dominance relation is especially important for duplication, as all simulation-based algorithms presented in this thesis are based on the dominator tree.

2.3 Static Single Assignment Form

In the last section of this chapter special emphasis is put on static single assignment (SSA) form [28; 39; 133; 134; 202], a special property of intermediate representations.

5See Section 2.3. 2.3 Static Single Assignment Form 21

Definition 11 (SSA From). An intermediate representation IR(p) of program p is in SSA form iff for every variable there is only a single instruction in which it is defined, i.e., for every assignment to a local variable in the source program, an artificial variable with a different name is introduced.

SSA form requires a special handling of variable assignments. Consider the following piece of code:

1 a = 1 2 if ( c ) { 3 a = 2 4 } 5 use ( a )

The variable a is assigned the constant 1 before the statement if(c) and once assigned the constant 2 in the true branch. In order to represent this program in SSA form, a compiler cannot assign to a twice. Therefore, it introduces a new version of a for every assignment as seen below:

1 a1 = 1 2 if ( c ) { 3 a2 = 2 4 } 5 // which a to use? a1 or a2 6 use ( . . . )

After the control flow merge the program uses a. The question is which version of a should be used: a1 or a2. Since the actual value of a after the if depends on which branch is executed at run time, the compiler cannot perform this decision. In SSA form, this is handled by introducing so called phi functions, denoted as ϕ, that model the union of all possible predecessor values of a variable at a control flow merge. Depending on which predecessor was executed the value of a ϕ is the value assigned in the respective predecessor. Therefore, to properly represent the example from before in SSA form the compiler needs to introduce the artificial ϕ instruction as seen below:

1 a1 = 1 2 if ( c ) { 3 a2 = 2 4 } 5 a3 = ϕ(a1, a2) 6 // which a to use? a1 or a2 7 use ( a3 ) 22 Terminology

We illustrate the idea of SSA form in Figure 2.2 which shows the sample program foo from Figure 2.1 without SSA form and after re-writing it to SSA form (marked with the red instructions). The original local variable i, which has been assigned twice in the program, once before the loop and once on the backedge of the loop, has been replaced with a ϕ function -based on the values of i0 and i1.

static int S; foo( a) { static int int 0: i=0 0: i0=0 for (int i = 0; i < a; i++) { bb0 bb0 S += i * a; } return S; } 1: i2 = φ (i0, i1) 1: i

4: return S 3: return S bb2 bb2 4: tmp1 = i * a 5: tmp1 = i2 * a 5: tmp2 = S + tmp1 6: tmp2 = S + tmp1 6: S = tmp2 7: S = tmp2 7: tmp3 = i +1 8: tmp3 = i2 +1 8: i = tmp3 9: i1 = tmp3 9: jmp 1 10: jmp 1 bb3 bb3

Control Flow Grahp Control Flow Graph SSA

Figure 2.2: Sample program foo in SSA form.

There are several optimizations in Graal that utilize SSA form including inlining, duplication, common-sub expression elimination, loop invariant code motion, unrolling, scheduling, and many more.

Especially, duplication benefits from SSA form as our proposed simulation-based duplication ap- proach6 utilizes the dominance relation and its properties of value flow to simulate the effects of duplication operations.

6See Chapters 4, 6 and 7. 23

Chapter 3

GraalVM System Overview

This chapter introduces GraalVM: The Java virtual machine we used as our implementation platform for the algorithms and optimizations presented in this thesis. We present some major design principles of the Java programming language, the HotSpot JVM, JVMCI, the Graal compiler, the Truffle language implementation framework and Substrate VM. We reflect on the design and properties of the Graal compiler, its IR, optimization tiers, phases and capabilities.

ava [124] is a managed, general-purpose, programming language maintained by Ora- cle [149]1 which has gained industry popularity shortly after its first release in 1996. J Java is, and has been for many years, a very popular programming language [192].

3.1 Java

Historically, Java has been a so-called interpreted language. Java programs are executed by a Java runtime environment (JRE). For this process, Java source code is first compiled to a platform- and architecture-agnostic intermediate representation called Java bytecode [124]. This is done to ensure that Java programs are platform independent. The original motto employed by the design of the Java Virtual Machine specification which also contains the specification of the bytecode was "Write once, run anywhere" [38]. A JRE, among many other components, typically contains a JVM which executes the Java bytecode. While Java bytecode is platform- and architecture-agnostic, JVMs are highly platform- and architecture-specific pieces of software. Over the decades, multiple JVMs have been deployed. Some evolved while others where discarded after several releases. To this day, several major JVMs have been deployed. Today the most noticeable ones are IBM’s J9 [97], Jikes RVM [4] , Sun Micro System’s HotSpot [93] and Oracle Labs’

1It was originally proposed and deployed by Sun Microsystems, which was bought by Oracle 2010. 24 GraalVM System Overview

GraalVM [140; 148]. We propose simulation-based code duplication for the Graal compiler which is the dynamic compiler used in GraalVM (and in HotSpot from Java 10 [142]). Therefore, we will focus on these two JVMs for the rest of this chapter.

3.1.1 HotSpot JVM

.class File .class File .class.class File File .class File .jar Archive Java Bytecode

Parses

Class Loader JVMCI Compiler

Invokes Compiler Verifier Interpreter C1 Compiler Interface Implements

Linker C2 Compiler Executes Updates

Produces Produces Compiled Code

Profiling Info

Code Application Heap Bytecode Cache

Class Data

Heap

Requires Manages

GC Interface Implements Shenandoa Parallel GC Parallel Serial GC Serial ZGC G1

HotSpot JVM

Figure 3.1: HotSpot JVM overview.

HotSpot [30; 93; 109; 144; 153], the currently leading JVM on the market, is a complex piece of millions of lines of C++ code. However, over the decades it had proven to be the fastest and most reliable JVM implementation. Most noticeable, HotSpot closed the performance gap between statically compiled languages and interpreted languages like Java [80; 129]. Traditionally, Java bytecode has been executed by an interpreter upon invocation. The first releases of HotSpot 3.1 Java 25 did not contain any dynamic or just-in-time compiler. They were added incrementally over decades of releases. Current state-of-the art JVM implementations apply multi-level dynamic compilation to optimize for latency and throughput.

Figure 3.1 shows a schematic overview of the HotSpot JVM and the major components that are necessary to execute Java with near native speed. Upon invocation of the JVM with a Java main method, the JVM parses the bytecode of the main method’s enclosing class file and creates the necessary VM data structures for the type definitions, static memory, etc. Then the interpreter2 starts executing the bytecode. During execution it collects profiling [143; 196] data for several bytecode instructions:

• Invocation Counts: It records the number of times a method has been executed.

• Loop Backedge Count: It records the number of times a loop backedge has been taken.

• Type Checks: It records a fixed number of different types for a type-check’s operand.

• Receiver Types: For virtual calls it records a fixed number of different receiver types and an entry for all other types.

• Branch Probabilities: For every branching instruction it records how often a conditional branch was taken, effectively recording branch probabilities.

This profiling information is used to guide compilation and optimization. Based on the execution counts of methods and the loop backedges3, the interpreter queues currently executed methods for compilation. Depending on the compilation policy, one of the compiler implementations picks up a compilation task, performs compilation and installs the generated code in a memory re- gion called code cache. The next invocation of this method will not call the interpreter but the compiled version of the method. During compilation the compilers in HotSpot use the profiling information collected from the interpreter to apply speculative optimizations [56; 57; 58]. Spec- ulative optimizations utilize the profiling information to only optimize the common parts of a program4. If speculative optimizations performed during compilation turn out to be wrong later, HotSpot applies deoptimization [47; 56; 92; 108; 216; 221] in order to revert the already optimized version of the code back to an un-optimized baseline version. This baseline version is the original bytecode, which will be executed by the interpreter again. HotSpot has a complex policy for the different levels of optimization called tiered compilation [151].

2The predominant interpreter implementation in HotSpot is a template-based handwritten assembler interpreter for every bytecode. 3 Backedge counter overflow triggers on-stack-replacement [47] compilation where an interpreted stack frame is swapped to a compiled version during execution of a loop. 4The HotSpot JVM has its name from only optimizing important parts of a program. These important parts, i.e., the methods for which the execution counters are high are called hotspots and get compiled. 26 GraalVM System Overview

Interpreter

Optimize C1 Compiler Deoptimize

C2 Compiler JVMCI Compiler = Graal

Figure 3.2: HotSpot Tiered compilation.

We show a simplified scheme of tiered compilation in Figure 3.2. Execution starts for every method in the interpreter. Once the invocation counters of a method overflow, the interpreter triggers compilation of this method with the client compiler5, which is one of the two compilation tiers in HotSpot. C1 [109] is a simple dynamic compiler that targets fast warmup. It applies a few selective optimizations, but its main responsibility is to produce machine code quickly. C1-compiled code performs limited profiling collection of, e.g., invocation counters. If a C1-compiled method gets called very often a final top-tier compilation will be scheduled by the VM. Depending on the configuration of the VM [142], the top-tier compiler is either the server compiler [29; 30; 144; 153]6 or the Graal compiler [55; 56; 57; 116; 117; 120; 156; 157; 181; 182; 185; 218]. From within every intermediate execution tier, code can be deoptimized back to the un-optimized version that is executed in the interpreter.

3.1.2 Graal Compiler

The Graal compiler project [55; 56; 57; 116; 117; 120; 156; 157; 181; 182; 185; 218] started as an endeavor to develop a novel dynamic compiler for the HotSpot JVM. It was originally designed as a replacement for the C2 compiler. However, in contrast to C2, Graal is itself implemented in Java. With the introduction of the Java Virtual Machine Compiler Interface [141] (JVMCI), it became possible to replace C2 with a dynamic compiler implemented in Java. As this thesis targets optimizations for the Graal compiler, we will focus for the rest of this thesis on the Graal Compiler and GraalVM, the ecosystem developed by Oracle Labs around the compiler.

Choosing Java instead of C++ as the implementation language has many advantages, including automatic memory management, heavily optimized polymorphism, expressiveness of the language and tool support.

5Also called the C1 compiler. 6Also called C2 compiler. 3.1 Java 27

Machine Bytecode IR High Tier IR Mid Tier IR Low Tier LIR Code Code HIR Generation Lowering Lowering LIR Generation Execution Generation Constant Folding Memory Optimization DCE CFG Optimizations

GVN Guard Optimizations Scheduling Null Check Removal

Inlining Lock Optimizations PRE ...

Partial Escape Analysis Guard Optimizations ...

Read Elimination ...

Conditional Elimination

Duplication

Loop Optimizations

... Platform Platform Bytecode independent dependent Frontend Level machine level machine level Backend

Figure 3.3: Graal compiler schematic.

We present an overview of the Graal compiler in Figure 3.3. During compilation, Graal uses several layers of abstractions in order to allow optimizations to be expressed at different levels. The basic compilation process is divided into a frontend and a backend. Both have a dedicated intermediate representation: in the frontend Graal uses a graph-based intermediate representation (called Graal IR [55; 57]) that is based on the sea-of-nodes IR [29; 30], originally proposed for the C2 compiler. In the backend, Graal uses a CFG-based intermediate representation called LIR [201](for low-level IR) that was originally proposed and deployed in the C1 compiler. During compilation, the bytecodes of a method are parsed and the high-level intermediate representation is generated. Then three frontend tiers, i.e., three different levels of conceptual abstraction, are exercised on the IR. Between each tier a so-called lowering is performed that de-sugars abstractions from the IR and prepares it for the next level of optimizations, converging towards a machine- and platform-dependent IR before LIR generation. Many optimizing compilers, including Graal, group optimizations into so called phases. Phases analyze a program for optimizable patterns and perform the associated optimization transformation. Graal contains many such phases in all tiers. Below we summarize the most important ones per tier:

1) High-Tier: The high-tier contains all high-level optimizations utilizing Java bytecode se- mantics including inlining [156], global (GVN), constant folding [7], partial escape analysis [185], read elimination, conditional elimination, duplication [116; 117; 118; 120] and many more. Operations are typically represented as one IR node per bytecode. E.g., a field load is represented by one IR node modeling both the null check (as required by the JVM specification [124]) and the actual memory access. 28 GraalVM System Overview

2) Mid-Tier: Between the high- and mid-tier the IR is lowered to platform-independent machine- like operations. Null check semantics are expanded to guard nodes, Java heap accesses are modeled as memory accesses with address semantics, etc. In the mid-tier, Graal applies optimizations that are specific to Java memory semantics. Additionally, it applies lock optimizations as well as standard optimizations like GVN and constant folding along the way.

3) Low-Tier: The low-tier performs a limited number of optimizations and lowers platform- independent machine level semantics to platform-dependent machine level semantics. E.g., address operations are lowered to the associated architecture, e.g., amd64 or SPARC. The last phase schedules the IR graph and prepares it finally for LIR generation.

We will focus on the frontend of Graal, because simulation-based code duplication is a high-level compiler optimization.

3.1.2.1 Graal IR

Simulation-based code duplication is performed in the high-tier of the Graal compiler and thus works on Graal high-level IR [55; 55; 58]. In this section we will explain the IR as well as its properties in detail. We illustrate Graal IR with a simple code sample in Figure 3.4 which shows a source program foo along with its bytecode and the associated Graal IR. Graal IR is a sea-of- nodes-based [29; 30; 153] directed graph in SSA form [39]. Each IR node produces at most one value. Data flow is represented with upward edges and control flow with downward edges. The IR is a superposition of the control-flow and the data-flow graph. Control-flow nodes represent side-effecting instructions that are never re-ordered, whereas data-flow nodes are floating. Their final position in the generated code is purely determined by their data dependencies. Loop back edges and loop exits are modeled explicitly as instructions in the IR. In order to support deoptimization [47; 56; 92; 108; 216; 221] to the interpreter tier, Graal needs to model the interpreter state in the IR [58]. The interpreter state is defined [124] as the set of all local variables of a method, the set of all values on the expression stacks as well as the set of all locks on the expression stack. Additionally, in order to support re-materialization of escape-analyzed objects [185] in the interpreter, the compiled code needs to maintain a mapping from virtualized objects to the IR nodes representing an object’s fields [180; 185]. Graal models the interpreter state with so called FrameState nodes as first-class nodes in the IR that have inputs representing locals, stack values, locks, and virtual object mappings for re-materialization. State nodes are introduced during bytecode parsing for all nodes with a visible sideeffect. In order to illustrate the different concepts of the IR we will explain them with the example from Figure 3.4. In Figure 3.4, we first want to put emphasis on the bytecode. After compilation from source code to bytecode, the original high-level control flow constructs have been replaced with conditional jumps. Each 3.1 Java 29

static int S; State:0 State:2 Const 0 static int foo(int a) { Const 1 for (int i = 0; i < a; i++) { Start S += i * a; } S; End Add return Param a } φ

LoopBegin Stack | Locals < [] | [a] 0: iconst_0 [0] | [a,0=i] If 1: istore_1 [] | [a,i] 2: iload_1 [i] | [a,i] Mul 3: iload_0 [i,a] | [a,i] LoadField F 4: if_icmpge 23 [] i >= a? | [a,i] State:23 7: getstatic Field S [S] | [a,i] Add 10: iload_1 [S,i] | [a,i] LoopEnd StoreField F 11: iload_0 [S,i,a] | [a,i] LoopExit 12: imul [S,i*a] | [a,i] 13: iadd [S+(i*a)] | [a,i] 14: putstatic Field S [] | [a,i] 17: iinc 1, 1 [] | [a,i+1] LoadField F 20: goto 2 State:17 [] | [a,i] Control-flow Node 23: getstatic Field S [S] | [a,i] Data-flow Node Return 26: ireturn [] | [a,i] Control-flow Edge Data-flow Edge Association State Node State Edge

Figure 3.4: Graal IR example.

bytecode instruction has an index and a variable width7. In addition to the bytecode, we also show the stack and the local variables during execution of the bytecode with their symbolic values. This is important to understand how the frame state is represented in the Graal IR. In the IR we have the regular control-flow nodes, i.e., all nodes that are typically not re-ordered during optimization as well as the data flow (floating) nodes that represent side-effect-free operations in the program. For example, the addition and the multiplication with bytecode indices 12 & 13 are floating nodes, i.e., their final position in the generated code is only dependent on their inputs. The variable i from the original source program, in bytecode the local variable at index 1, is represented by a ϕ node in Graal as it is an SSA value. The Graal IR for foo has several frame states (state nodes)8 at different bytecode indices. The semantic for state nodes is defined as follows: the state node attached to a control-flow instruction is the state after the instruction, i.e., the state of the stack and locals in the interpreter at the associated bytecode index after executing the node that points to it. For example, the state after the start node is described by the state node at bytecode index 0. This means that if a deoptimization happens right after the start node, the correct interpreter state is represented by all inputs to the state node, i.e., only the local variables at index 09.

7This depends on the type of the instruction. 8We mark them with green borders in Figure 3.4. Nodes can have a state-after edge pointing to a frame state. Those edges are also marked in green. 9This is the parameter a. 30 GraalVM System Overview

The next bytecode instruction to be executed by the interpreter is iconst 0 at index 0. More interesting is the state node after the StoreField node in the loop: After storing the new value of S, deoptimization would force the interpreter to continue at bytecode index 17, which is the first instruction after the putstatic. The associated state node models the locals in the interpreter, i.e., the parameter a as well as the loop variable i which corresponds to the ϕ in the IR. Note that this is exactly the state of the interpreter at that bytecode index, i.e., there are no values on the stack and the locals are a and i. We want to point out that the modeling of interpreter state in the compiler has an influence on . This means that certain values deduced by optimizations to be effectively dead might still be needed by the interpreter which executes un-optimized bytecode. However, given that frame states are not optimized, they naturally force all values needed by the interpreter to be kept alive.

3.1.2.1.1 Scheduling

Graal IR is based on the sea-of-nodes [29; 30] approach originally proposed for HotSpot’s C2 (server) compiler. Floating nodes in the IR represent side-effect-free instructions that can be exe- cuted (and re-executed) everywhere in the generated code as long as they respect the topological ordering of their inputs. This means that a floating node can be executed as soon as all its inputs are available. A valid position (called schedule) for a floating node to be executed is anywhere in the generated program after its inputs have been calculated. The process of assigning valid positions in the generated code for floating nodes is called scheduling [29]. There are many valid schedules. Yet, only a few of them are considered suitable performance-wise. Compilers try to schedule nodes as late as possible, i.e., directly before their usages or in the latest common dominator block of all usages. Floating nodes allow a compiler like Graal to easily apply very complex code motion as operations are scheduled as late as possible. This means that they will only be executed if they are actually needed causing a run-time reduction if they are never needed. However, for analysis and compiler optimizations it is often a complex burden to maintain correct, schedulable graphs, even more so when always modeling the interpreter state with state nodes. Many analysis algorithms are becoming more complex without a schedule but often also faster as the process of scheduling a graph is very compile-time intensive.

Graal implements special begin- and end-nodes in the IR that mark the begin and the end of a basic block. The fixed nodes in the IR represent a CFG10 without a schedule for the floating nodes. After scheduling every floating node is assigned a basic block as well as an index in the list of nodes of that block. Thus, there is a direct mapping from Graal IR after scheduling to a standard CFG.

10See Chapter 2. 3.1 Java 31

3.1.3 Truffle

Truffle [209; 213; 218; 219] is a self-optimizing abstract syntax tree (AST) interpreter framework on top of GraalVM. Truffle language implementations apply Partial Evaluation [74] of AST pro- grams which are then compiled by the Graal compiler to reach near-native performance of the generated code. Truffle AST interpreters are themselves implemented in Java. Before compila- tion, their ASTs are combined to one compilation unit via partial evaluation, which effectively inlines the logic for each AST operation into the root AST node. This is performed on the Java bytecode level. Therefore, dynamic languages executed on top of Truffle produce Java compilation units themselves. However, Truffle compilation units are typically larger and more dynamic than classical Java workloads, requiring more elaborate compiler optimizations to reach near-native performance [120]. In our evaluation in Chapter 8 we present several experiments using benchmarks executed with GraalJS [147], Oracle Labs’ JavaScript implementation on top of Truffle.

3.1.4 GraalVM

We conclude this chapter with Figure 3.5 which shows an overview of GraalVM. GraalVM is the

C C++ Python Ruby R JavaScript Sulong(LLVM) Grovy Scala Java Kotlin Truffle Graal Compiler

Deploy on HotSpot JVM Substrate JVM

Solaris Windows Linux Mac Windows Linux Mac

SPARC x86-32bit amd64 arm amd64 arm

Figure 3.5: Graal ecosystem. virtual machine and ecosystem developed by Oracle Labs around the Graal compiler. To this day, there are several languages that are directly executed on GraalVM, i.e., those that compile to JVM bytecode. On top of Truffle [209; 213; 218; 219], there are language implementations for Python [146], Ruby [152], R [145], and JavaScript [147]. On top of Sulong [160; 162; 168; 169] there are language implementations for C/C++. 32 GraalVM System Overview

Normally, the Graal compiler and Truffle are running on HotSpot. However, SubstrateVM [187; 188] is a different ahead-of-time-compiled Java VM that also allows deployment of Graal and Truffle. Its advantages are instant start-up times and a low memory footprint. 33

Chapter 4

Simulation-Based Code Duplication

This chapter discusses the problems and implications of code duplication in a general way and then proposes a solution to it: simulation. We argue why simulation is favorable over other solutions and what a compiler requires to implement in order to support this paradigm.

very optimization increasing code size [33; 36] for the sake of performance is potentially subject to alignment issues [18], cache issues, compile-time increases and many other E harmful effects. Therefore, compilers are obliged to control the size of the produced code and keep it to a reasonable level. Yet, the definition of reasonable is dependent on the performance increases delivered by the optimization. If the negative effects introduced are justified by a peak performance increase, a compiler may decide to do so. However, this implies that compilers have the information to perform this kind of trade-off. They must be able to estimate the performance benefit of a single optimizing transformation. This can be done with heuristics or in a precise way.

This thesis proposes a novel approach for code duplication that allows a compiler to perform a fine-grained trade-off between code size and performance impacts. This is done in order to keep the negative impacts of duplication at a minimum. Why this is necessary is clear to domain experts, but for the general audience we summarize the problems in this chapter and motivate them with detailed examples. 34 Simulation-Based Code Duplication

4.1 Problem Statement

Excessive code duplication can lead to exponential code growth [36] and is considered harmful. However, duplication can also improve performance. In order for a compiler to decide how much code-size increase is justified by a single duplication transformation it needs to be able to trade- off the expected performance increase with the inevitable code-size increase. Subsequently, we discuss the remaining questions that define the problem space of code duplication.

Question 1(Q1): How can code duplication improve performance?

Code duplication can increase the performance of the generated machine code by using the paradigm optimization via specialization1. Every time a program contains a control flow merge2 the predecessors of the merge contain special semantics guarded via conditional logic. A merge has per definition more than one control-flow predecessor. In order for a program to split control, it needs to implement conditional logic. Conditional logic implies that there are paths in a program that carry more precise information than others. In reality this means that for every successor of a control flow split a compiler can safely assume more specialized information for a given piece of code than for the other successors because the branch is guarded by a condition. Code duplication utilizes this fact to perform optimizations. In order to optimize the code dominated by a control-flow merge, a compiler cannot use the guarded context information of a predecessor branch. It can only use a union of two or more.

Consider the example code in Figure 4.1 that shows a trivial program foo. On the right-hand side of the figure we plotted the knowledge a compiler has about the values of each variable in the program during analysis at any point in time when processing a basic block. In both branches the compiler not only has information about the variable y (because it is assigned first) but also about the variable x because of the conditional statement. In the true branch x has a value in the interval [1, MaxInt] as guarded by the condition and in the false branch x must be smaller than or equal to 0, i.e., it falls in the interval [MinInt, 0]. The compiler can use the information for both variables in order to optimize the respective branch. However, after the control flow merge the knowledge about y is the union of the data for y across all predecessor branches3 because the compiler does not know which branch is dynamically executed. Thus, the information the compiler can assume about y after the merge is the union of both predecessor value ranges, i.e.,

V alueRangeT rueBranch ∪ V alueRangeF alseBranch ≡ [1, MaxInt] ∪ [0] = [0, MaxInt].

1This paradigm is used by many other optimizations that apply a form of code duplication including for example inlining. 2This also includes loop headers which merge the forward predecessor of a loop and a loop’s backedges. 3In SSA form the union of both can only be as precise as the least precise ϕ input. 4.1 Problem Statement 35

x y int foo(int x) { final int y; [MinInt,MaxInt] [-] if (x>0) {

y = x; [1,MaxInt] [1,MaxInt]

y = 0; [MinInt,0] [0]

return 2 + y; [MinInt,MaxInt] [0,MaxInt] [1,MaxInt] [0] } = [0,MaxInt] ∪ Compiler’s Knowledge of the values

Figure 4.1: Control flow merge union of information.

1 int f o o ( int x ) { 1 int f o o ( int x ) { 1 int f o o ( int x ) { 2 final int y ; 2 if ( x >0){ 2 if ( x >0){ 3 if ( x >0) y = x ; 3 return 2 + x ; 3 return 2 + x ; 4 else y = 0 ; 4 } else { 4 } else { 5 return 2 + y ; 5 return 2 + 0 ; 5 return 2 ; 6 } 6 } 6 } 7 } 7 }

(a) Initial program. (b) After duplication. (c) After optimization.

Listing 4.1: Constant folding (CF) optimization opportunity.

Code duplication works by utilizing a compiler’s knowledge about the values in predecessor branches and the code in the merge block. If instructions from the merge block were moved into one of its predecessor blocks, this might open new optimization potential based on the in- creased knowledge that is available there. Consider the sample program in Listing 4.1, where it is easy to see how code duplication can help to optimize this code. In the initial program in Listing 4.1a, the variable y has the value 0 in the false branch. The merge block contains an instruction that can be optimized if the compiler can assume y == 0: the instruction 2 + 0. Therefore, if the compiler lifts the return instruction into the predecessor blocks by duplicating it, both additions can be specialized to the value of y in the respective branch (Listing 4.1b), which allows the compiler to perform a constant folding to the code in Listing 4.1c.

This leads to the definition of the first observations of code duplication: 36 Simulation-Based Code Duplication

Observation 1 Control flow merges are optimization boundaries.

Observation 1 summarizes the root problem that is solved by code duplication. A compiler is always limited in its optimization potential if a program contains control flow merges, because merges always produce unions of the value information for all variables4. In SSA [39] form it is trivial to identify values for which a compiler has different knowledge in predecessor branches, because such variables always produce ϕ functions.

This leads to Observation 2 that summarizes the optimization potential for code duplication. All instructions dominated by a merge are potentially optimizable by code duplication and thus subject to code duplication and its transformations.

Observation 2 Every instruction dominated by a control flow merge is potentially optimizable if it is not dominated by the control flow merge but by one of its direct predecessors.

Until now, this thesis discussed code duplication in the context of control flow merges after control flow splits. However, all that was said also applies to loop headers. Loop headers conceptually are nothing else than control flow merges, i.e., they merge the non-backedge predecessor of a loop and the backedge predecessors. We illustrate this in Figure 4.2, which shows a simple loop program on the left. The body of the loop in block b2 contains a section of code Ω that represents the real computation. In the original program, the loop body block b2 is dominated by the control flow merge, i.e., the loop header b1. However, after unrolling one iteration of the loop5, the original, not-unrolled iteration of the loop in b2 now also dominates the unrolled iteration in b1’ and b2’.

Duplication-wise the optimization opportunities inside the duplicated blocks are important for a compiler. Opportunities are created because there is no basic block boundary between a merge predecessor and a merge after duplication. This is straight forward for classical tail duplication: A merge predecessor block is extended with the merge block (and its dominated blocks, depending on how much code is actually duplicated). The same applies, conceptually, for loop unrolling: Unrolling a loop iteration has a similar effect on the dominator tree (from the perspective of optimization opportunities after duplication) as performing a normal code duplication. A loop

4 This also applies to the memory effects of a program and not only to the data-flow graph. Compilers like Graal model memory dependencies with explicit edges inside a graph [58], meaning that if there are different consumers for a memory location in different predecessors Graal generates memory phi functions in control flow merge blocks. Memory phi nodes themselves can have usages that can be optimized if a more precise memory location can be assumed for a memory effecting instruction. 5See the right side of Figure 4.2. 4.1 Problem Statement 37

int p(int a) { int p(int a) { int i=0; b0 int i=0; b0 b0 b0

b1 while(i

Dominator Tree b2' Ω Ω i++; i++ b2 b2 Dominator Tree

return i; if(i

Unrolled return i; } Ω Program b3 i++ b2'

Program after Unrolling One Iteration

Figure 4.2: Loop header control flow merge: Unrolling as duplication. unrolling changes the dominator tree: It extends a merge predecessor (the basic block containing a backedge) with the merge (loop header block) and its dominated blocks (loop body again). Therefore, tail duplication and loop unrolling share a similar concept: Enable optimizations by extending a merge predecessor basic block with code from the merge block (and potentially code dominated by the merge block). For tail duplication the extended blocks are a merge’s direct predecessors, for loop unrolling they are the basic blocks containing the loop backedges.

This leads us to Observation 3 which expresses the optimization opportunities for instructions inside loops via unrolling.

Observation 3 Every instruction inside a loop (trivially dominated by the loop headera) is potentially optimizable if it is not dominated by the loop header but by the backedge predecessor block. In SSA form this trivially means that every instruction that has a loop-ϕ instruction as input is potentially optimizable if it is duplicated to a point where it is dominated by the predecessor, i.e., the body itself in a prior iteration. This can be achieved via loop unrolling.

aSee Chapter 2.

For the rest of this chapter we will just refer to the term code duplication for simplicity. However, we always mean a broader set of duplication optimizations including classical (tail)-duplication as well as loop unrolling. 38 Simulation-Based Code Duplication

Now that we discussed the reasons for code duplication (Q1), the next question sets the focus on the code-size impact of code duplication.

Question 2(Q2): What is the code-size impact of a single code duplication transforma- tion?

Every time a compiler duplicates code it copies instructions and potentially changes the CFG. In general, every duplication transformation increases the code size by the amount of code that was duplicated. However, later transformations might reduce code size to a point where it is less than before the code duplication. In order to reason about these transitive effects, a compiler requires complex analysis in order to determine what the real code-size impact will be. Consider the code in Listing 4.2a which shows a function k that assigns a value to the variable phi depending on the value of the parameter a. After the first control-flow merge (line 8 in Listing 4.2a) the program tests the value of phi and either returns 0 or 1. However, this example perfectly illustrates the code-size effects of duplication. After duplicating the entire control-flow diamond from Listing 4.2a (line 8 - 12), the compiler produces the code in Listing 4.2b which is significantly larger than the original program. After removing the trivial conditions via optimization in both branches, the compiler produces the code in Listing 4.2c which is significantly smaller than the original program from Listing 4.2a6.

1 int k ( int a ) { 1 int k ( int a ) { 1 int k ( int a ) { 2 int p h i ; 2 int p h i ; 2 if ( a>0){ 3 if ( a>0){ 3 if ( a>0){ 3 return 0 ; 4 p h i =0; 4 p h i =0; 4 } else { 5 } else { 5 if ( p h i ==0){ 5 return 1 ; 6 p h i =1; 6 return 0 ; 6 } 7 } 7 } else { 7 } 8 if ( p h i ==0){ 8 return 1 ; 9 return 0 ; 9 } 10 } else { 10 } else { 11 return 1 ; 11 p h i =1; 12 } 12 if ( p h i ==0){ 13 } 13 return 0 ; 14 } else { 15 return 1 ; 16 } 17 } 18 }

(a) Initial program. (b) After duplication. (c) After optimization.

Listing 4.2: Duplication code-size increase example7.

6 Listing 4.2 is a trivial example that perfectly illustrates the code-size effect of a duplication transformation. For illustration purposes the example is presented in source code. However, we cannot easily reason about the real code size of the generated machine code for this example as it depends on compiler optimizations, run-time support etc. Note that, depending on the platform and the architecture the machine code generated for Listing 4.2c can be larger than the generated machine code for Listing 4.2a: considering that code generation for HotSpot [93] for example needs to emit safe point checks on every return. 4.1 Problem Statement 39

This leads to Observation 4.

Observation 4 The program produced by a single code duplication is always equal or larger than the original program. However, subsequent optimizations can reduce the program size again to a size that is equal or smaller than the original program.

Observation 4 motivates why reasoning about the code-size impact of a duplication transformation is non-trivial, requiring complex analyses by the compiler.

Therefore, in order to avoid a false-positive classification of the code-size impact of a single duplication transformation, a compiler needs to semantically understand the effect of subsequent optimizations on the code-size dimension of a duplicated program. We formulated Problem Implication 1 which states that a compiler will apply false reasoning if it does not understand the semantic impact of a duplication transformation on the optimization opportunities of a CFG.

Problem Implication 1

If a compiler reasons about the direct code-size impact without subsequent optimizations it always deduces that duplication transformations result in a code-size increase. This can result in a false-positive classification of duplication transformations that produce smaller code than the original program.

As stated before, code-size increase can always be a problem for a compiler or is at least not a benefit. Other approaches have identified the problems of excessive code duplication and exponential code growth [36] as well. In order to be complete in the problem definition used throughout the rest of this thesis we now explicitly define the problems of code-size increase in our context.

Question 3(Q3): What are the negative implications of code-size increase in a dynamic compiler?

In order to answer this question, we use the original example from Chapter 1 shown in Listing 4.3. We slightly modified the original code by replacing concrete instructions with symbolic placeholders for a more generic reasoning about the sample.

7 While it can be argued that this is not some code programmers typically write, such patterns often occur as a result of inlining [156] or loop unrolling [117]. 40 Simulation-Based Code Duplication

1 final int p h i ; 1 final int p h i ; 1 if ( x >0) { 2 if ( x >0) { 2 if ( x >0) { 2 ... δ ... ′ 3 p h i = x ; 3 p h i = x ; 3 ... Θ ... 4 } else { 4 ... δ ... 4 } else { 5 p h i = 0 ; 5 ... Θ ... 5 ... δ ... ′ 6 } 6 } else { 6 ... Θ ... 7 ... δ ... 7 p h i = 0 7 } 8 ... Θ ... 8 ... δ ... 9 ... Θ ... 10 }

(a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

Listing 4.3: Sample program.

Before we discuss the implications of code size-increase we look at Listing 4.3 again; this time from a compiler’s perspective on the effects chain of a program. We assume that the compiler deduced that it can optimize the code in line 8, Θ, to a faster instruction Θ′ by duplicating the code from the merge block (line 7-8) into both predecessor branches. This produces the code in Listing 4.3b, which can be further optimized8 to the code in Listing 4.3c. However, between the start of the merge block and Θ there is a section of code dominating Θ called δ. In order to duplicate Θ into the predecessors and to optimize it, the compiler needs to also duplicate δ as it is dominating Θ. If both δ and Θ are side-effecting instructions the order of their observable effects must not be modified as specified by the Java Memory Model (JMM) [128]. The compiler must not change the side-effect chain in the merge block via duplication. It has to preserve the semantics of the program. Therefore, the compiler is forced to duplicate δ in order to duplicate Θ. Note that this directly relates to how a compiler represents side-effects in its IR and how it moves them during scheduling. If there is no side-effect in δ, the compiler can schedule Θ before δ to limit the code-size increase during duplication. For sea-of-nodes-based IRs it is slightly easier as they can decide this purely on the effects dependency (edge)9 of their IR. This edge type directly chains together all side-effecting instructions. Deciding what to duplicate for non-side effecting instructions is more complex. They might require the duplication of side-effecting instructions in order to be optimizable as they are pulled in the set of duplicated instructions: a floating node10 is only duplicated if it either has a dependency on a fixed node or a transitive usage to a fixed node in the duplicated region of code.

A compiler often needs to duplicate code (δ) that cannot be optimized after duplication in order to optimize a section of code (Θ) which is optimizable after duplication. This is the root cause of code-size increase problems of code duplication. In general, we refer to Θ as the target

8For example, copy propagation can remove the ϕ instruction. 9Note that for Graal IR these are the control-flow edges, also called fixed nodes which are marked in red in the figures of this thesis. 10See Chapter 3 about floating instructions in Graal IR. 4.1 Problem Statement 41 instruction for optimization after duplication. Depending on the code size of δ, the compiler might significantly increase the code size of the snippet after duplication just for the sake of optimizing Θ.

After discussing the reason for the code-size increase of a duplication transformation, we take a look into the implications of code-size increase. Implications of increased code-size are:

• Increased compilation time: If a compiler duplicates a lot of code that is never relevant for performance it effectively spends time optimizing and compiling code that is never needed. This is particularly a problem for dynamic compilers that perform compilation at run time. Optimizing this code consumes compilation time, delaying the compilation of subsequent code and causing increased warmup time. We call this compilation-time increase direct compile-time increase as it is a direct product of the duplication transformation itself.

• Increased compilation time of subsequent compiler phases: Many optimization phases are dependent on the size of a program. If duplication increases the number of instructions this indirectly increases the overall compilation time of all phases. This can increase application latency as the time until peak performance [9] is reached, the so-called warmup time, is increased. For applications where performance relies on low compilation latency this can mean that overall application performance is reduced due to increased compilation time11. We call this compilation-time increase indirect compile time increase as it is a product of subsequent optimizations of the duplication transformation.

• Code cache memory usage by the JVM increased: The JVM manages compiled code in the code cache12. If the code cache is full this can lead to reduced compilation frequency as the compiled code must be "garbage collected", or the VM may abort execution13.

• Duplicating control flow can increase the complexity of a control flow graph in a way that subsequent compiler optimizations are inhibited: This is often a problem in real world compilers and indirectly relates to the phase-ordering problem [194]. If duplication changes the CFG, it can prohibit subsequent optimizations, but probably also enable others that are not modeled in advance. If duplication produces code that prohibits further optimization due to missing heuristics, this has to be taken into account.

11We want to mention function-as-a-service (FaaS)-based systems as applications from this category. A FaaS-based system often relies on low compilation time in order to reach good performance. This is the case as a FaaS runtime system often executes code, i.e., functions that have not yet been seen by the dynamic compiler. In a dynamic compilation system this means that the unseen code has to become hot first before it is compiled. Thus, overall system performance, defined as latency until the result, is affected by compilation time. 12See Chapter 3. 13 This depends on the configuration of the JVM. HotSpot’s [93] default configuration [31] in Java 8 tries to sweep, i.e,. remove compiled methods once the code cache is full and in the worst case scenario disables compilation completely. 42 Simulation-Based Code Duplication

• Maximum compilation unit size: Certain run-time systems have restrictions on the maximum compilation unit size. Reasons for these restrictions are direct and indirect immediate jump patching, code relocation, etc. For example, compilers for HotSpot have a configurable maximum code size14 that must not be exceeded else the virtual machine (VM) will not install the code for a compilation unit.

• Increased image size:If Graal is deployed with SubstrateVM15 excessive duplication increases the size of the generated binary, which can prohibit embedded deployment on resource- constrained devices.

Given the above list of negative implications of code-size increase, a compiler must decide very precisely what to duplicate and why to duplicate code. It is obliged to know the implications of increasing code size. Therefore, we defined Observation 5 which makes a statement about the minimal code size increase of duplication, which is generally not decidable.

Observation 5 Code duplication always leads to code-size increases. However, defining the minimal code- size increase that is necessary in order to deliver a specific performance increase is infeasible in practice. It requires to know the complete rest of the compilation pipeline. This includes knowing the point where the code is scheduled, how the branch predictor behaves, all the information of all callers, all the parameter values (function input), and many other factors. In practice, a compiler would need a closed world assumption to answer the question of the minimal possible code size given a required performance outcome which is not the case in a dynamic system like Java [124] that supports dynamic loading of classes at run timea.

a Defining the optimal solution to an optimization problem in the presence of multiple optimization phases is proven to be undecidable for real hardware [175; 194]. Thus, we refrain from trying to achieve an optimal solution and focus on realistic paradigms applicable in a dynamic compiler.

Given that we cannot define the minimal code size that is necessary to reach a certain performance, we use the code size metrics defined by a given compiler in its default configuration, i.e., we measure code size without duplication and with duplication in order to reason about a duplication- based optimization’s impact offline. Code duplication must not increase this number excessively but in a predictable and parameterizable fashion. Observation 5 leads to the definition of Problem Implication 2.

14Note that this is configurable with the option -XX:JVMCINMethodSizeLimit= and is currently set to 80000 words. 15See Chapter 3. 4.1 Problem Statement 43

Problem Implication 2

A compiler performing code-duplication-based optimizations needs to be aware of the code size increase it applies in order to monitor unwanted side effects introduced via excessive code-size increase. If a compiler does not control the amount of code it creates it will inevitably run into code explosion issues, e.g., when compiling a pathological source code program at least.

In addition to the questions derived above, we define some smaller implementation-level questions for code duplication that depend on the IR of a given compiler but still have an influence on the code-size increase of the optimization.

• What is the granularity of code duplication? In general, a compiler can decide to only duplicate entire basic blocks or single instructions; whereat the duplication of entire basic blocks may be more complex than duplicating one instruction at a time. A follow-up question is if a compiler can only duplicate a single merge block or other post-dominating code as well.

• Relating to the phase ordering problem [194] it is an interesting engineering problem to decide when in the compilation pipeline code duplication is most useful. This again heavily depends on the IR of a compiler as well as on its general phase plan design.

The above questions and observations mark the important problems as well as their implications for code duplication.

4.1.1 Code Duplication Triangle

To summarize the problem space and success metric dimensions of duplication-based optimiza- tions we propose the simplified model of the code duplication triangle (as seen in Figure 4.3) summarizing the three major success metrics of a dynamic compiler when it comes to code dupli- cation and the possible solutions to the problem. We borrow the idea of a problem-space triangle: a notion originally proposed for the CAP-theorem16 [14], which states that any shared-data sys- tem can only have two of three desired properties. The problem space of duplication is similar. Based on the "pick two" notion, it is straightforward to derive a code duplication optimization that optimizes for two success metrics. However, optimizing for all three means that a trade-off is necessary that explicitly sacrifices one dimension for another. Optimizing for two success metrics

16Consistency-availability-partitions: a theorem using in shared-data systems. 44 Simulation-Based Code Duplication

Peak Performance Goal High Compilation Large Code Time

Code Size Compile Time

No Speedup

Figure 4.3: Code duplication success metric triangle. can be achieved by17:

• Optimizing for peak performance and reducing compile time to the absolute minimum by only performing duplications having a positive impact on peak performance. This reduces compile time to the application of heuristics for finding optimizable candidates. However, code size might grow.

• Optimizing for peak performance while keeping code size increase at a minimum by per- forming compile-time-insensitive analysis to find those candidates that potentially improve performance with just a small code size increase.

• Optimizing for code size18 and compilation time: this can be achieved by not performing any duplications at all. However, this will not result in any performance improvements.

Given Figure 4.3, there is no approach that unifies those three success metrics into a single duplication approach that tries to generate the maximum performance increase at the minimal possible code-size and compile-time increase.

17Note that duplication-based optimizations require some kind of decision function shouldDuplicate(merge) that decides if a piece of code should be duplicated. Even the simplest implementation of shouldDuplicate consumes compilation time. Therefore, when we talk about "optimizing for compile time" we actually mean keeping the compile-time increase to the absolute minimum, i.e., performing only those transformations justified by a sufficient peak performance increase. Reasoning about the minimal code size increase necessary to deliver a specific performance increase is generally not decidable, therefore the lower bound of shouldDuplicate is unknown. 18Note that duplication can enable dead code elimination [7] and can thus be used to actually decrease code size of the generated machine code. See Chapters 5 and 8 for details. 4.2 Solution 45

4.2 Solution

In order to satisfy the remaining challenges and to solve all presented problems defined in Sec- tion 4.1, a duplication approach is required that:

• Produces a solution comparable in performance to the optimal19 one, i.e., a solution which performs all duplications that significantly improve performance (relates to Q1, O1-3).

• Spends as little time as possible analyzing the program for optimizable patterns (relates to Q2, O4, Q3, I1).

• Performs as little code transformations as possible in order to keep compilation time increase and control flow complexity at a minimum (relates to Q3, O5, I2).

• Only performs duplication transformations for which it knows that a performance increase in a success metric is possible (relates to Q1, O1-3, I2)

• Allows a configurable trade-off between code-size and performance increase (relates to I2).

In order to solve all listed problems, we propose simulation-based code duplication, an optimization scheme that allows a compiler to perform only those duplication transformations for which it knows that they lead to a sufficient performance increase without a code-size or compile-time explosion. The approach is based on a three-tier algorithm. Each tier performs analysis and generates a result that is then consumed by the subsequent tier.

1) The simulation tier discovers optimization opportunities after code duplication.

2) The trade-off tier fits those opportunities into an optimization cost-model that tries to maximize peak performance while minimizing code size and compilation time. The outcome is a set of duplication transformations that should be performed as they lead to sufficient peak performance improvements.

3) The optimization tier then performs those duplications together with the subsequent opti- mizations whose potential was detected by the simulation tier.

Figure 4.4 illustrates the overall idea of the approach. Since there is no compile-time efficient algorithm to analyze the impact of a duplication transformation it is too costly to analyze the impact of every duplication transformation on the concrete program. Therefore, we propose to

19Note that deriving the optimal performance for a program on a given hardware is not decidable. However, ideally compiler optimizations produce code that converges towards a (local) optimum. 46 Simulation-Based Code Duplication

Program Trade-Off Results

Build Optimize based on results Program

Program Optimize Optimized Program

Concrete Layer

n 1 Simulated Program Simulation Result Simulation Result Optimize Simulation Result

Optimized Simulated Program Sort and Filter Build Simulation Result Simulation Result caputring Simulation Result optimization potential Beneficial Opportunities

Simulation Trade-Off Abstract Layer

Figure 4.4: Simulation-based code duplication approach. reason about duplication impacts on a simulated program. The compiler can use simulation to reason about the impact of a duplication transformation and to generate a program on an abstract layer that, when it comes to the optimization potential after duplication, acts as the original program. However, analyzing the simulated program on the abstract layer is faster than optimizing the concrete program20. Once the compiler found all optimization capabilities in the simulation step, it can trade them off against each other in a separate step before it uses this information to optimize the concrete program in the final tier of the approach.

For this simulation-based approach to be successful, it has to generate faster code than heuristic solutions, while ideally producing less code and not being much slower in compilation time. Luckily code size and compilation time can be optimized together by only duplicating those parts of the program that can really be optimized later. In order to illustrate the compile-time complexity of a duplication-based optimization we need to define the parts of the optimization that consume time:

• Optimization Applicability Check: This check determines if a certain piece of code is opti- mizable after duplication. We denote this step as AC(m = merge) for a single merge and AC for multiple merges m, therefore AC = AC(m). ∀mergesP m

20This depends on the concrete duplication optimization. We propose two concrete implementations of the simulation-based duplication approach for tail duplication in Chapter 6 and for loop-unrolling in Chapter 7. For each of which we present algorithms to implement the simulation step efficiently in the respective chapter. 4.2 Solution 47

• Duplication: This part of the transformation really duplicates the instructions into pre- ceeding branches, recomputes SSA form and fixes the CFG of a program in IR. We de- note this step as D(m) for a single merge and D for multiple merges m, therefore D = D(m). ∀mergesP m

• Optimization: This part comes after the duplication step for each piece of code and performs all optimizations enabled by the duplication. This step must be done after every duplication, since it can happen that subsequent optimizations remove the need for later duplications. For example, consider the code from Listing 4.2 where a subsequent optimization changed the CFG. Such patterns are quite frequent. We denote this step as Opt(m) for a single merge and Opt for multiple merges m, therefore Opt = Opt(m). ∀mergesP m

This leads to the definition of the total compilation time for duplication of a given program with m control flow merges:

CompileT imeDuplication = m ∗ AC + moptimizable ∗ (D + Opt)

The compile-time-intensive parts are D and Opt 21 . In order to reduce the overall compile time of the optimization, moptimizable must be as small as possible, i.e., it should only contain duplication candidates that are definitely optimizable after duplication. This means the optimization applica- bility checks AC must be as restrictive as possible, only allowing a small amount of duplications. Assuming that the compiler would only allow a single duplication, the overall compilation time for this duplication would be:

moptimizable = 1 ⇒ CompileT imeDuplication = m ∗ AC + D + Opt

However, it remains the question what the compile-time cost of AC is and can minimally be, given the fact that the minimal compile time and code size for the optimization cannot be decided. If AC would be a constant time check its overall complexity would be negligible. Yet, to the best of our knowledge, there is no constant-time way of answering the question what the impact of a duplication on the optimization potential of a compilation unit is. This in turn means that reducing the number of identified optimizable merges, denoted by moptimizable, only helps in reducing the overall compilation time if AC is sufficiently fast. If AC has the same or higher complexity than

D + Opt, reducing moptimizable is irrelevant for reducing the overall compilation time (since the complexity of AC dominates the overall complexity of CompileT imeDuplication).

21 For example, the duplication algorithm for SSA-based sea-of-nodes IR in Graal has a worst-case asymptotic 2 complexity of O(n ) over the number of duplicated nodes. 48 Simulation-Based Code Duplication

We classify the quality of AC functions according to whether they are complete, i.e., all optimiza- tion capabilities after duplication can be found, and whether they are precise, i.e., all optimizations determined by AC are actually possible after duplication. Heuristics often produce false positives causing duplications that turn out to be not optimizable after duplication due to the incomplete model of the real program in the heuristic.

In reality, if AC is implemented heuristically, the size of moptimizable compared to a complete and precise implementation varies heavily. The reasons for that are simple (we denote a heuristic implementation of AC as ACh):

• If ACh is implemented in a very simplistic way, moptimizable is very small as the heuristic does

not find many optimization candidates. This in turn means that CompileT imeDuplication is low. However, peak performance increases might also be low or non existent.

• If ACh is implemented more sophistically by capturing more possible opportunities, moptimizable can be small or large, depending on the heuristic. However, this implies that the time for

ACh itself is increased, thus consuming more compilation time.

• If ACh is not complete, moptimizable will always be smaller than with a complete approach, because not all opportunities are found. This means that a heuristic approach will not

require the same CompileT imeDuplication as a complete approach (assuming that for ACh

the following relation holds: O(ACh) < O(D + Opt))

If moptimizable is large, the run time of the entire optimization is high. In our proposed simulation scheme, we try to reduce moptimizable as much as possible, with a near constant time implementa- tion of AC (making the complexity of m ∗ AC dependent on m). We devised a dominance-based algorithm (both for tail duplication and loop unrolling) that allows us to reduce the asymptotic 22 complexity of AC to the number of instructions i(mBlock) in the merge block .

In reality, the number of merges for which a compiler should23 optimize code after duplication

(moptimizable) is orders of magnitude smaller than the overall number of merges (m). This means that performing as few duplications as possible (D(m)) is always a compile-time improvement, considering that AC is much faster than a duplication with a subsequent optimization: O(AC) ≪ O(D + Opt).

22 In reality our approach estimates performance increases during simulation by analyzing the instructions in the rest of the compilation unit dominated by the merge. This is done as long as there is sufficient optimization potential found given the accumulating code size increase while iterating the dominated code. 23We do not say "can" here, as many duplications potentially improve performance but cause harmful code-size increases and should thus not be performed. 4.2 Solution 49

What if all merges m can be optimized after duplication? In that case, the compiler cannot save the time D(m) to perform the associated transformation. If every merge is optimizable after duplication, compile time is m ∗ (AC + D + Opt). However, just because an optimization is possible after duplication does not mean that the optimization justifies the code size increase induced by the duplication24. Therefore, in addition to implementing AC completely and precisely, a compiler must also be able to reason about the trade-off between performance increase and code-size increase. For pathological cases our approach allows the compiler to trade-off between the code-size impact and the performance impact in order to decide if the time D(m)+Opt(m) is justified. The compiler can sort all optimizable merges moptimizable by the estimated performance increase and the lowest code size and only allow a maximum factor of code-size increase per compilation unit. This effectively limits the maximum code-size increase for all compilation units, including pathological cases.

4.2.1 Finding Optimization Opportunities after Duplication

The two root tasks of every code duplication optimization are finding optimizations after du- plication25 and performing the real duplication step, i.e., duplicating all instructions, rewiring data-flow and recomputing SSA form26. Both tasks are costly in time, space and code complex- ity. Therefore, we want to optimize those steps as far as possible. However, the fundamental step of duplication can no longer be optimized simply for the fact that at some point in time the compiler really needs to duplicate the code. On the other hand, the step that finds optimization opportunities is subject to research to this day. For the rest of this thesis we will refer to the task of finding optimization opportunities after duplication with AC(m, pi) for a given control flow merge m and one single predecessor pi. This means that AC(m) = AC(m, pi), ∀predecessorsP pi therefore AC = AC(m) =  AC(m, pi). ∀mergesP m ∀mergesP m ∀predecessorsP pi of m

The implementation of AC is discussed in literature and depending on the target of the dupli- cation optimization multiple possible implementations of AC exist. Subsequently, we discuss the most common ones being heuristics and back-tracking-based algorithms which both suffer from shortcomings. We then propose simulation for implementing AC and finally argue why we believe that it solves the task better than other approaches.

24See Problem Implication 2 in Section 4.1 for details. 25Denoted as applicability check AC from before. 26Denoted as D above. 50 Simulation-Based Code Duplication

4.2.1.1 Heuristics

Implementing the AC step for a single predecessor of a merge AC(m, pi) with heuristics is non- trivial and requires domain knowledge about the IR of a compiler and the suitable duplication patterns that occur frequently in a given source language. Algorithm 1 depicts a duplication optimization algorithm using a very simple heuristic for AC(marked in color). It iterates over all merge blocks of a CFG. For each merge it iterates all instructions i and looks if any of them has a ϕ- instruction as its input. If so, this opens the possibility for optimization after duplication, because i than has the input of the ϕ-instruction at pi as input instead of the ϕ. Algorithm 1 is simple and fast. The asymptotic complexity for the entire duplication optimization using this heuristic is m ∗ AC + moptimizable ∗ (D + Opt) where AC = 1 ⇒ m + moptimizable ∗ (D + Opt). However, it duplicates also code that is not optimizable after duplication. Furthermore, it only considers the data-flow of the program. In Chapters 6 and 7 we will look at all the specific optimizations enabled by a duplication, including control-flow-based and data-flow-based optimizations. We can conclude that Algorithm 1 is neither precise nor complete. Yet, it might be good enough for compilers where code size and compile time are not considered important.

Algorithm 1: Simple heuristic-based duplication. Data: ControlFlowGraph cfg Result: Optimized ControlFlowGraph cfg 1 for BasicBlock merge in cfg.mergeBlocks() do 2 for BasicBlock pred in merge.predecessors() do 3 for Instruction i in merge.instructions() do 4 for PhiInstruction phi in merge.phis() do 5 if i.inputs() contains phi then 6 duplicate(merge,pred); 7 end 8 end 9 end 10 end 11 end

Based on the simple heuristic from Algorithm 1 we derived Algorithm 2 which contains a more complex heuristic-based implementation of AC (marked in color). It iterates all merge blocks of a program and all predecessor blocks of each merge. For each instruction in the merge block that has a ϕ-instruction of the merge as input it copies the instruction and replaces the ϕ-input of the copy with the value of the ϕ at the respective branch. It then runs a set of common optimiza- tions on the copied instruction to see if the instruction can be optimized given its new input27. Algorithm 2 is precise, i.e., all optimizations of the copy of instruction i are definitely optimizable after duplication. However, it is still not complete as it does not cover all optimizations, i.e.,

27 For the simplicity of this example we assume all optimizations can be applied to instruction level granularity. 4.2 Solution 51 optimizations that require knowledge about the dominance relation of the program after duplica- tion cannot be covered. Since the heuristic has no notion of the impact of the duplication on the dominator tree of the program.

Algorithm 2: Precise heuristic-based duplication. Data: ControlFlowGraph cfg Result: Optimized ControlFlowGraph cfg 1 bool progressMade ← true; 2 outer: while progressMade do 3 for BasicBlock merge in cfg.mergeBlocks() do 4 for BasicBlock pred in merge.predecessors() do 5 for Instruction i in merge.instructions() do 6 for PhiInstruction phi in merge.phis do 7 if i.inputs() contains phi then 8 Instruction copy ← i.copy(); 9 copy.replaceInput(phi,phi.valueAtPredecessor(pred)); 10 for opt o in {CF, PEA, ReadElim, StrengthRedux} do 11 if o.apply(copy) then 12 duplicate(merge,pred); 13 progressMade ← true; 14 continue outer; 15 end 16 end 17 end 18 end 19 end 20 end 21 end

However, Algorithm 2 has still an acceptable compile time if we look at the complexity of AC.

The overall compile time is m ∗ AC + moptimizable ∗ (D + Opt) where O(AC) = O(Opt), i.e., the asymptotic complexity of the AC is dependent on the complexity of the optimizations it applies on the copied instruction.

While Algorithm 2 is already a fairly complex and well-performing heuristic it still suffers from two major drawbacks:

• It does not reason about the semantics of the program in the presence of the optimizations applied after duplication (Q2-3, O3-4, I1-2). If an optimization after duplication changes the CFG of the program Algorithm 2 cannot reason about the transitive impact of those transformations. Consider the example from Listing 4.2: After duplication the compiler can optimize away the condition completely in both branches by subsequent optimizations. These transitive effects cannot be modeled with the heuristic from Algorithm 2. 52 Simulation-Based Code Duplication

• It does not reason about the code-size increase (or subsequent decrease as Listing 4.2 illustrated) after duplication. Thus, it does not allow a trade-off between code-size increase and performance increase, allowing duplication transformations that explode code size (Q3, O4, I2).

4.2.1.2 Backtracking

One possible way of determining what to duplicate is to tentatively perform a duplication at a merge and then backtrack if no improvement was possible. Consider the illustration of a backtracking-based approach in Algorithm 3. To determine if progress was made, the compiler first copies the control-flow graph (CFG) as a backup and performs a duplication on the original CFG. Then it applies a set of selected optimizations on it. If any of those optimizations is able to optimize the IR it re-starts with the changed CFG. If no improvement was made the compiler backtracks to the copied (original) CFG and advances to another merge.

Backtracking is as complete and precise as a duplication-based optimization can be. After du- plication, it performs all optimizations potentially enabled by a duplication. Additionally, after optimization, the approach has full knowledge of the code-size impact and the performance impact. However, its compile-time impact is enormous. Given that a backtracking-based solu- 28 tion does not implement a real AC(m, pi) but a holistic transformation, compilation time is

O(Duplication) = m ∗ (D + Opt) + mnotOptimizable ∗ BT where BT marks the time needed for backtracking.

Backtracking has three disadvantages:

• Copying the CFG for every predecessor-merge pair is compile-time intensive and thus typi- cally not suited for a dynamic compiler29.

• The compile-time impact of a duplication is not known in advance.

28We marked the part of the algorithm that decides if a duplication was beneficial in color. 29 We did experiments in our implementation in Graal. The copy operation increased compilation time by a factor of 10. The main problem of the copy operation is that we need to copy the entire IR and not only the portions which are relevant for duplication. This is the case because we do not know which parts of the IR are changed by subsequent optimizations. The compile-time cost of backtracking in the IR could be reduced by, e.g., the application of transactional IRs that support shallow operations only rewiring data-flow if necessary. We consider this interesting for future work studies and reflect on it in Chapter 10. 4.2 Solution 53

• Large compilation units (>100k instructions) contain thousands of control-flow merges. Applying optimization phases30 after every duplication to determine if an optimization trig- gered is therefore not feasible. Even though many optimization opportunities31 can be computed in (nearly) linear time over the number of instructions or the IR, processing the full IR every time the compiler performed a duplication is not acceptable for JIT compilation.

Algorithm 3: Backtracking-based duplication. Data: ControlFlowGraph cfg Result: Optimized ControlFlowGraph cfg 1 bool progressMade ← true; 2 outer: while progressMade do 3 progressMade ← false; 4 for BasicBlock b in cfg.mergeBlocks() do 5 bool change ← false; 6 ControlFlowGraph copy = cfg.copy(); 7 for BasicBlock p in b.predecessors() do 8 duplicate (cfg, p, b); 9 for opt o in {CE, CF, PEA, ReadElim, StrengthRedux} do 10 change ← o.apply(cfg); 11 end 12 end 13 if change then /* The CFG and basic block list changed, thus we need to restart. */ 14 progressMade ← true; 15 continue outer; 16 else /* Backtrack and advance one merge. */ 17 cfg ← copy; 18 end 19 end 20 end

4.2.1.3 Simulation

We argue that simulation-based approaches do not suffer from the problems of backtracking-based or heuristic-based approaches. The main idea of simulation-based duplication is to determine the impact (in terms of optimization potential on the whole compilation unit) before performing any code transformation. This allows a compiler to only perform those transformations that promise a sufficient peak performance increase (benefit)32. The main requirement for this to be practical is that simulating a duplication is sufficiently less complex in compilation time than performing the actual transformation. Duplication simulation should avoid the complex part of the duplication

30We solve this problem with partial optimizations, i.e., optimizations on the simulated program. We present the idea behind such transformations in Chapter 6. 31 The detailed optimization opportunities are presented for tail-duplication in Chapter 6 and for loop unrolling in Chapter 7. 32See Q1, Q2 for details. 54 Simulation-Based Code Duplication transformation maintaining the semantic correctness of all data dependencies, while still allowing valid decisions about the optimization potential of the involved instructions after duplication, i.e., being precise and complete.

Algorithm 4 outlines the basic idea. Before performing any duplication, the compiler simulates each duplication and performs partial optimizations on the simulation result. Those optimizations are local to the merge and predecessor blocks of the merge and thus faster than a full optimization over the entire IR33. For each partial optimization the compiler saves the optimization potential34. In addition, it stores an estimated code-size increase for the duplication. The compiler performs the simulation for each predecessor-merge pair and stores all of the results in a list. The results are sorted in ascending order by benefit (optimization potential, i.e., expected performance increase) and cost (code-size increase), to optimize the most promising candidates first. This is important in case not all candidates are duplicated due to code-size restrictions. The compiler then iterates over each simulation result and decides if the related transformation is worth it.

Algorithm 4: Simulation-based duplication. Data: ControlFlowGraph cfg Result: Optimized ControlFlowGraph cfg 1 simResults ← []; 2 for BasicBlock b in cfg.mergeBlocks() do 3 for BasicBlock p in b.predecessors() do 4 simCFG ← CFG after simulated duplication of b into p; 5 simResult ← result of partial opt (CE, CF, ...) applied to simCFG; 6 simResults.add (simResult); 7 end 8 end 9 sort simResults ascending by cost; 10 sort simResults descending by benefit; 11 for SimResult s in simResults do 12 if s.worthDuplicating() then 13 duplicate (s.merge,s.predecessor); 14 for opt o in {CE, CF, PEA, ReadElim, StrengthRedux} do 15 o.apply(s.predecessor) 16 end 17 end 18 end

4.2.1.4 Comparison

In order to illustrate the capabilities of the different implementation possibilities of the AC step we compare them in Table 4.1 based on the following properties:

33These partial optimizations have the same effect as global ones, as the CFG of the program after duplication is only changed for dominated blocks of the merge, but not for dominating ones. The only exception are duplications inside loops that can impact loop backedges. These need to be handled specially, i.e., see Chapter 7 for details. 34See Chapter 6. 4.3 Necessities: Cost Model 55

• Completeness with respect to the number of supported optimizations.

• Preciseness with respect to the real applicability of the opportunities found during the AC step.

• Estimated run-time complexity.

• Extensibility to new optimizations.

• Estimation of code size increase.

Approach Complete Precise Complexity AC ‡ Extensible $ Code Size Estimation Heuristics XX O(AC) = O(m) XX Backtracking X X O(AC) = O(m ∗ (D + Opt)) X X † Simulation X X O(AC) = O(m ∗ (S + Optsim)) X X

‡ With respect to the number of merges in the program m. $ With respect to adding more optimizations in the future. † Where S denotes the time needed to simulate a duplication and Optsim the time for optimizing the simulation results with partial optimizations. Table 4.1: Comparison of approaches to derive optimization capabilities.

Table 4.1 shows that simulation supports similar characteristics as backtracking, while maintaining a less cost-intensive AC implementation.

In the rest of this chapter we summarize additional necessities in order to support a fine-grained performance vs. code-size trade-off. In Chapters 6 and 7 we introduce dominance-based du- plication simulation and fast-path unrolling of non-counted loops as the two optimizations we implemented based on the three-tier simulation-based duplication scheme.

4.3 Necessities: Cost Model

Up to now, we presented algorithms and techniques for finding optimization opportunities after code duplication, i.e., possible implementations of the AC(m, pi) step. We presented them as binary analysis, e.g., after duplication, a constant folding is possible (true) or not (false). However, in order to perform a fine-grained performance vs. code-size trade-off, a duplication optimization needs to implement logic that is smarter than binary decisions.

Optimizations, although applying them is a binary decision, carry a continuous performance impact and code-size effect. For example, performing a simple constant folding of the addition a + 0 to the variable a is a binary decision from the compiler’s perspective, i.e., either perform it or not. 56 Simulation-Based Code Duplication

Yet, from the performance perspective, the reduced run time of the program, stemming from the fact that the generated code has to perform one instruction less, is a continuous value. Assuming a and the constant 0 are both in registers, a typical addition operation on x86 takes about 1 CPU cycle latency [68; 118].

Since duplication often requires the copying of instructions dominating the target instruction, the code-size impact varies for each single duplication transformation. Therefore, we refer to the performance increase of a single duplication transformation as its benefit, for example in reduced cycles CPU latency and to the code-size increase as its cost, i.e., typically measured in increased machine code size. In order for the compiler to apply a proper trade-off between cost and benefit in a duplication-based optimization we propose the introduction of a cost model for code duplication. This cost model can be used in the trade-off tier of the simulation-based duplication approach. The cost model must allow a compiler to make precise estimations of code-size increase and performance increase. Furthermore, it has to allow the compiler to trade-off different duplication candidates against each other, selecting the most promising ones for duplication.

In order to motivate the need for a proper cost model that allows a compiler to trade-off between different success metrics, we show a sample program in Figure 4.5. The program contains a control flow merge with three predecessors followed by a merge with two predecessors. Given the 3 predecessors, there are 5 possible ways to duplicate code in this simple program for the first merge instruction. However, some duplications are better than others and require the compiler to track code-size increase metrics in order to perform a useful decision. We plotted two possible duplications, the duplication of the code dominated by the merge into the first predecessor and into the second predecessor35. Both options have optimization potential after duplication as marked by the red colored source code. When duplicating into the first predecessor, the addition x + y can be optimized to 0 since y == -x. When duplicating into the second predecessor, the statement s2 = x + y; can be optimized to a plain assignment s2 = x; and one branch of the control flow diamond folds completely away. When duplicating into the first predecessor, code size is increased significantly as the if does not fold. However, when duplicating into the second predecessor, only 2 new instructions are effectively needed. Duplicating into the first predecessor might not be a problem in terms of code-size increase for such a small program. Yet, considering a full program scope with thousands of functions36 and thousands of control flow merges, such a code-size increase would be questionable. Therefore, a proper cost model allowing the compiler to trade-off the estimated performance increase with the estimated code-size increase is essential for a good performance of a code duplication optimization.

35Note that duplication on the IR level will not produce a control flow merge for the second predecessor after duplication, but will automatically generate a return. Therefore we added a return instruction on the source level. 36 For example, the Graal compiler performs around 80000 compilations across all Java DaCapo [11] and Scal- aBench [176] benchmarks [60; 61; 63]. 4.3 Necessities: Cost Model 57

static int s1,s2; static int s1,s2;

void foo(int x, int z) { void foo(int x, int z) { final int y; final int y; if(x>-10) { if(x>-10) { y = -x; y = -x; if(y==0){ if(y==1){ s1 = 1; s1 = 1; } else } else s1 = 2 s1 = 2 } } s2 = x + y; s2 = 0; } else { Optimize } else { if(z > 0){ if(z > 0){ y = 0; y = 0; } else { } else {

st y = z; y = z; 1 static int s1,s2; } } if(y==0){ if(y==1){ void foo(int x, int z) { s1 = 1; s1 = 1; final int y; } else } else

if(x>-10) { predecessor Duplicate into s1 = 2; s1 = 2; y = -x; } } } else { s2 = x + y; s2 = x + y; if(z > 0){ } } y = 0; } } } else { y = z; static int s1,s2; Duplicate into } } Merge predecessor void foo(int x, int z) { static int s1,s2; if(y==0){ final int y; s1 = 1; if(x>-10) { void foo(int x, int z) { } else 2 y = -x; nd final int y; s1 = 2; } else { if(x>-10) { } Target if(z > 0){ y = -x; s2 = x + y; Code y = 0; } else { } if(y==0){ if(z > 0){ s1 = 1; s1 = 1; } else s2 = x; s1 = 2 Optimize return; } } else { s2 = x + y; y = z; return; } } else { } y = z; if(y==1){ } s1 = 1; } } else if(y==1){ s1 = 2; s1 = 1; } } else s2 = x + y; s1 = 2 } } s2 = x + y; }

Figure 4.5: Sample program duplication cost model motivation.

59

Chapter 5

Node Cost Model

In this chapter we propose the node-cost model, a novel cost model for optimizations on the graph-based IR of the Graal compiler. We show that it allows a compiler, like Graal, to perform fine-grained trade-offs between the code size and the performance increase of a transformation.

uplication-based optimizations - while having a positive impact on one aspect (e.g., run-time performance) - can have a negative impact on another (e.g., code size) [120; D 157](Chapter 4). Therefore, a compiler needs to find a good compromise between benefits such as performance and negative effects, such as code size. In order to estimate any effect on the quality of the produced code, duplication-based optimizations typically apply ab- stract reasoning on models of a compilation unit in order to estimate a transformation’s benefits and costs [7]. Such reasoning is typically hand-crafted and highly specific to one particular op- timization1. Thus, state-of-the-art compilers often contain multiple cost-models(up to one for each optimization) applying specific trade-off functions. Changes to the structure of the com- piler’s IR or the ordering of optimizations as well as new optimizations and instruction types are rarely accounted for correctly in all trade-off functions, leading to unnecessary misclassifications of transformations. Additionally, over time this increases maintenance costs.

1For example the inlining heuristics of the HotSpot [93] server compiler [153] in http:// hg.openjdk.java.net/jdk10/hs/file/d85284ccd1bd/src/hotspot/share/opto/bytecodeInfo.cpp see InlineTree::should_inline. 60 Node Cost Model

5.1 Problems of existing Cost Models

In compilers for low-level languages, performance prediction models have been devised that pre- cisely estimate a program’s performance at run time [172; 198]2. In theory, those models could be used to guide dynamic compilers in their optimizations’ trade-off decisions. However, they cannot be applied to dynamic compilation of high-level languages for several reasons, including:

• Dynamic compilers for high-level languages apply optimizations on abstraction levels where no notion of the architecture and platform is present in a compiler’s intermediate represen- tation. This prohibits the compiler from using precise performance prediction approaches that model all hardware and software peculiarities of a system.

• High-level optimizations are typically executed early in the compilation pipeline and the compiler can still significantly change the code by applying various high-level compiler op- timizations [194] afterwards. This makes it difficult to predict the performance of the generated code.

• Precise performance prediction is often not suitable for dynamic compilation as it is run- time-intensive.

Consequently, we believe that precise performance models [198] are not applicable to be used in high-level compiler optimizations. Optimizations often require cost models that allow them to compare IR instructions against each other on an abstract, relative level without reasoning about absolute code size or run-time metrics (e.g., cycles). Yet, such models should enable an optimization to make valid assumptions about the relative run-time of two code fragments.

5.2 Cost Model Requirements

Graal performs many optimizations when a program is still in a platform- and architecture-agnostic, high-level representation. Optimizations at this level leverage knowledge about the semantics of the JVM specification. Prominent examples of such optimizations are inlining [157], partial escape analysis [185] and dominance-based duplication simulation [120]. While high-level optimizations can significantly improve program performance they often lack knowledge on how a given piece of IR is eventually translated to machine code, especially when taking into account how subsequent compiler optimizations could transform it. Thus, dynamic compilers, including Graal, typically resort to heuristics when performing optimization decisions. Since code duplication can lead to

2LLVM [111] uses instruction costs in their vectorization cost model https://github.com/llvm-mirror/llvm/ blob/master/include/llvm/Analysis/TargetTransformInfo.h line 134. 5.2 Cost Model Requirements 61

Metric Compilers using this Metric Disadvantages

Bytecode is not optimized. This means that debugging A metric used by many compilers mostly code, assertions, logging, etc., as well as redundant com- Bytecode Size for inlining [6]. In HotSpot, the C1 [109] putations that are easily eliminated with global value num- and C2 [153] compiler are using it. bering [7] also contribute to code size. This can result in a misclassification of code size. IRs typically contain meta instructions / nodes [29; 55; 109] that are needed by the compiler to annotate certain instructions with additional information. However, such nodes often do not result in generated code, for example Graal’s nodes for partial escape analysis [185], which are required to represent virtualized objects. They are only IR Node Count The C1, C2 and Graal compiler use(d) needed to support re-materialization during deoptimiza- this metric. tion. Additionally, there are large differences in the num- ber of machine code bytes generated for different types of IR nodes. For example, a field access typically lowers to a memory read of 4 byte size, whereas the invocation from Listing 5.1 can take up to 80 bytes of machine code. Notably, both operations are represented with a single IR node and thus count as 1.

Table 5.1: Compiler code size quantification. unintended code growth, the compiler requires a proper cost model that enables it to perform deterministic estimations of code size on the IR level that correlates with the final machine code size to control code size expansion. We summarize these needs as Problem 1 (P1).

P1 Machine Code Size Prediction High-level, platform- and architecture-agnostic, compiler optimizations cannot reason about system-specific details which are necessary to know in order to predict, which code will be generated for high-level language constructsa. Additionally, optimizations cannot infer the final machine code and thus not the size of a compilation unit, if there are other optimizations executed in between.

aFor example see the code generated for the call instruction in Listing 5.1.

interface F{ int f o o ( ) ; } abstract class C implements F{ int r e s u l t ; } void f o o (C c , int a ) { int r ; if (c.result > 0) { r = a ; } else { r = c . f o o ( ) ; } c.result = r; }

Listing 5.1: Trivial Java program.

Compilers use different static and dynamic metrics extracted from a compilation unit to estimate code size. We summarize the metrics that are most common in current compilers in Table 5.1. Prior to our work, the predominant metric used in the Graal compiler was IR node count which 62 Node Cost Model was often sufficient. However, as outlined in Table 5.1 and verified by structured experiments, it can also lead to a misclassification of the code size of a method, mainly because of nodes that expand to multiple machine instructions and because of compiler intrinsics. Intrinsics are code snippets, which the compiler inserts to efficiently implement known methods of the standard library. For example, Graal uses intrinsics for array operations. They are represented by a node in the IR that later expands to the real operation consisting of a large number of instructions inlined into the IR or even to a call to a stub of code implementing the intrinsic.3 Thus, we formulate Problem 2 (P2).

P2 High-Level Instructions

IR nodes capturing abstract language semantics and compiler intrinsics can expand to multiple instructions during code generation. Thus, they require special treatment when classifying code size in a compiler.

In addition to the above problems, we also consider structural canonicalizations as a source of misclassification, i.e., transformations that are applied by a compiler to bring a program into a canonical4 form. Many compilers, including Graal, combine real optimizations such as constant folding and [7] with structural canonicalizations such as enforcing a certain order for constant inputs on expression nodes. If compiler optimizations (i.e., enabling optimiza- tions such as duplication) base their optimization decisions on the optimization potential of other optimizations, and the potential is of structural nature, this can cause misclassification of the optimization potential. For example, an optimization might perform a transformation (enabling a structural canonicalization) that will not cause a performance benefit. This requires optimiza- tions to understand, which transformation is performance-increasing and which is not (Problem 3 (P3)).

P3 Performance Estimation Compiler optimizations typically cannot infer the performance increase of a particular trans- formation.

For the purpose of code duplication in the Graal compiler our cost model has to be applicable to dynamic compilation, i.e., the worst case asymptotic complexity should be at most linear over the number of nodes.

3 For example, in Graal, every method call to Arrays.equals([] a,[] b) in compiled code is intrinsified by the compiler with a special IR node expanding to more than 64 machine instructions (depending on the used CPU features). 4Canonical in a sense that it is understood by all optimizations. 5.3 Node Cost Model 63

5.3 Node Cost Model

In order to solve the presented problems, we propose the introduction of an IR cost model for the Graal compiler entitled node cost model that specifies costs for IR nodes. These costs are abstract metrics for code size (NodeSize) and operation latency (NodeCycles). Costs are specified together with the definition of an IR node.

NodeSize is a relative metric that we define as the average number of generated instructions on an abstract architecture after lowering the operation to machine code. We base the definition roughly on a double word instruction format (32 bit) without modeling different architectures. For example, our assumption is that an addition always generates one instruction of machine code, irrelevant of the abstraction level. The addition operation is our baseline. We specify other operations relative to it. We have a loose definition in order to avoid overspecification. For high-level optimizations, it is not important to have byte or word precision.

NodeCycle is an estimation of the latency of an instruction relative to an addition operation. We assume the latency of an addition to be one cycle on every platform and specify all other instruction latencies relative to it. The same rules as for the NodeSize metric apply. If a high-level operation (e.g., a field access) can expand to multiple instructions, we assume the most likely path through the compilation pipeline and specify the cycles metric as the average of the latency of the operation. We gathered the values for the latencies of instructions relative to the addition from several sources, mostly by empirical evaluation and by instruction classification approaches [68].

5.3.1 Code-Size Estimation

The cost model can be used to solve P1 from Section 5.2 by performing code-size estimations in optimizations. The compiler can use these estimates in duplication-based optimizations to trade-off between several optimization candidates. Consider Figure 5.1, which shows a simple Java method with the associated Graal IR together with the resulting assembly code in x86-64 Intel assembly [101]. The compiler overestimated the NodeSize for the method writeField to be 8 machine instructions due to two reasons:

1) 1) It pessimistically assumes that the parameters reside on the stack and therefore calculates the costs of a stack access; however, in this case the parameters can be accessed in registers due to the calling convention. 64 Node Cost Model

2) The compiler assumes that an If node generates 2 jump instructions, namely one from the true-branch to the merge and one from the comparison to the false branch; however, if both branches end in control flow sinks and never merge again one jump is saved.

class Clazz { int field; } Control-flow Node void writeField(Clazz c, int a) { Data-flow Node if (c != null) c.field = a; Control-flow Edge } Data-flow Edge

Start IsNull 1 Param0 1

2 If Param1 1 1 Size Estimation = 8 1 Return StoreField 1 Return test rsi, rsi jz L1 mov [rsi + 12], edx Real Size = 5 ret L1: ret

Figure 5.1: Code-size estimation: red numbers represent NodeSize values. HotSpot-specific pro- logue and epilogue instructions are removed for simplicity.

5.3.2 Relative Performance Prediction

The second major use case of the model is the relative performance prediction of optimization candidates solving P3 from Section 5.2. The node cost model can be used to compare two pieces of code based on their relative performance estimation. This need is motivated by our overall work on code duplication. Many optimization patterns are hard-coded in compilers: this includes optimizations like algebraic simplifications and strength reductions. However, simplifications do not guarantee peak performance increases. Therefore, comparing two pieces of code by summing up their relative performance predictions normalized to their profiled execution frequency allows compilers to perform better performance trade-offs.

Const 0 Pred[0] Pred[1] Pred[2] Const 3331 Param 1 Const 3331 Param0 Param0 Const 3331 Control-flow Node 0 0 0 Add 1 Add 1 Data-flow Node 0 Merge φ 0 Param0 0 Const 0 Param0 Control-flow Edge Param0 0 Data-flow Edge 2 Store Add 1 Add 1 Association 3 Possible add Optimize operations after Param1 Param0 Param1 Param0 2 Return duplication Add Add 1 1

Figure 5.2: Cycles estimation: Blue numbers represent NodeCycles values.

Consider the example in Figure 5.2 that shows Graal IR for a control flow merge with three pre- decessors. In SSA form, a variable is only assigned once; to model multiple possible assignments, ϕ functions are used. Figure 5.2 contains one ϕ at the merge with 3 inputs. Code duplication tries to determine how the code on each path looks after duplication, in order to find the best 5.3 Node Cost Model 65 candidate for optimizations. If the code is duplicated into pred[0], the resulting add opera- tion with a constant and a parameter input cannot be further optimized. Yet, Graal would still swap the two inputs to have a canonical representation of arithmetic operations for global value numbering. This is not a performance-increasing optimization, but a structural canonicalization5. Graal combines such transformations with actual transformations making it impractical for opti- mizations to know the difference if not hard coded except if such optimizations resort to using the node cost model to calculate the relative performance improvement of a transformation. In this case, the structural canonicalization will not result in a performance gain. Duplicating into pred[1], however, would lead to an addition with 0 that can be folded away completely. This requires the optimization to understand what the real benefit of a transformation is in terms of performance. Using the node cost model the compiler can precisely see that only the duplication into pred[1] can generate a real improvement in performance. This is important as the store and return instructions are duplicated as well which increases the code size. The compiler has to trade-off the benefit with the code size. In this example only the duplication into pred[1] generates a sufficient improvement to justify the increase of code size.

5.3.3 Discussion

The node cost model is not a precise performance prediction framework. We do not model pipelined architectures, because we have no notion of the concrete hardware during our high- level optimizations. Therefore, we also make no distinction between throughput and latency. We avoid modeling out-of-order processors, cache misses, data misalignments and exceptions in our cost model. We assume a single-threaded in-order execution environment with perfectly cached memory accesses. This is desired, since Graal is a multi-platform compiler and modeling all possible kinds of hardware with all their peculiarities would involve too much implementation and compile-time effort.6 We assume that a precise estimation is not necessary to support compiler optimizations in their trade-off decisions. This claim is verified in Chapter 8. For code size specifications, we also explicitly refrain from modeling debug information that is emitted by the code generator. Some instructions, for example implicit null checks, require the compiler to generate debug information for the VM during code generation. We also ignore register allocation effects by assuming an infinite number of virtual registers.

5See P3 from Section 5.2. 6 Our implementation allows for architecture-specific NodeSize and NodeCycle values if a particular architecture has wildly different characteristics. However, we have not implemented architecture-specific characteristics so far. 66 Node Cost Model

5.4 Implementation Aspects

We specified NodeSize and NodeCycles for over 430 nodes in the Graal compiler. Graal IR nodes are implemented as extensible Java classes. Classes representing nodes in the IR must be annotated with a special NodeInfo annotation to which we added additional fields for an abstract estimation of code size and run-time latency7. Figure 5.3 shows the distribution of values for

public @interface NodeInfo { NodeCycles cycles() default CYCLES_UNSET ; NodeSize size() default SIZE_UNSET ; }

Listing 5.2: Excerpt of the NodeInfo annotation.

Cycles Size 100 75 50 25 0 0 1 2 4 8 0 1 2 4 8 16 32 64 16 32 64 128 256 128 256 1024 1024

IGNORED IGNORED UNKNOWN UNKNOWN

Figure 5.3: Node cost model value distributions.

NodeSize and NodeCycles in the Graal compiler. We can see that significant amounts of nodes are ignored in the cost model as their values do not generate code or do not consume CPU cycles8. Those nodes would be a source for misclassification of optimizations as they are noise to the real value distributions of the code. Additionally, we solved P2 from Section 5.2 by also annotating IR nodes representing intrinsic operations with a code size and latency estimation (if possible). However, two problems remain for the cost model properties. The first problem is that there are nodes for which we cannot make a static estimation. For example, Graal supports grouping allocations of multiple objects together. This is expressed as a group allocation node whose latency depends on the number of allocations and the allocation sizes. Thus the latency has to be inferred dynamically9. The second problem is that there are nodes for which we simply cannot make an estimation, because their code size / latency is completely unknown. There are two kinds of nodes for which we cannot predict latency at compile time:

7See Listing 5.2. 8They are assigned the value IGNORED. 9This is implemented by modeling costs as dynamically dispatched getter functions in derived classes. 5.4 Implementation Aspects 67

Call Nodes The latency of invocations is opaque to the compiler. The compiler would need to parse every possible callable method to estimate its run time. Thus, we ignore the latency of calls in the cost model10. We annotated call nodes with estimations for their NodeSize but the run-time latency is set to UNKOWN.

Deoptimization Nodes Deoptimization [92] causes the program to continue execution in the interpreter and is thus out of scope of any performance prediction.

The concept of unknown values is a potential threat to misclassification of code. If optimizations make decisions based on control paths containing UNKNOWN nodes those decisions are potentially based on a misclassification of size or latency. However, in practice such cases do not happen often. Deoptimizations are rare and are treated as control flow sinks11 for all optimizations (i.e., a path leading to a deoptimization should not be optimized aggressively) and the compiler has an estimation for the code size of a deoptimization. Invokes are similar: the compiler can infer the generated code for an invocation, but the latency is unknown. Invokes themselves can have arbitrary side effects, so they can never be removed by the compiler. The latency estimation for a caller remains the same if we exclude the latency of the callee from the computation. The latency of a callee can be treated as a constant, scaling the latency of the caller and is thus irrelevant for the performance estimation of the caller. The compiler uses the latency estimations in order to perform optimizations. Since it cannot estimate the latency of a callee, its actual run time is ignored in trade-off calculations. However, this does not have an impact on the calculated optimization potential for the caller, as the callers’s run time stays the same with or without the callee.

Node NodeCycles NodeSize AddNode 1 1 MulNode 2 1 LoadField 2 1 DivNode 32 1 Call (Static Call) Unknown 2

Table 5.2: Important node cost model nodes.

Table 5.2 shows some important nodes in the Graal compiler together with their cost model values. The addition node is the baseline for the rest of the instructions. For example, multiplication usually needs twice the latency of an addition. LoadField nodes access memory and thus we estimate the time needed to mov a value from cached memory to a register. A field access in Java also requires a null check [128]. This check can be done implicitly or explicitly. However, after bytecode parsing the compiler does not know yet if it can emit an implicit null check, thus it estimates the latency to be 2 cycles: An explicit null check typically requires 1 instruction and

10This is correct as the latency metric is only used in optimizations. If calls are excluded from trade-off functions the compiler never performs optimizations in favor of optimizing call instructions. In all cases, the latency of a call is ignored, never resulting in any optimization potential when doing trade-off calculations. 11Nodes in the IR that end a particular control path. 68 Node Cost Model

1 cycle. Integer division (div) is an instruction where the latency depends on the size of the operands and the value ranges, thus we take the worst-case latency estimation of the instruction as a reference value. Static calls spend an UNKNOWN amount of cycles in the callee, but typically take up to two machine instructions code size.

We performed a detailed evaluation of the cost model and present it together with the overall performance evaluation of our simulation-based duplication approach in Chapter 8. 69

Chapter 6

Dominance-Based Duplication Simulation

This chapter presents DBDS, an application of the simulation-based duplication scheme for tail-duplication. We first present optimization opportunities after duplication. Then we intro- duce our approach to find those opportunities before we discuss the application in the Graal compiler and implementation aspects.

uplication-based optimizations work under the premise of optimization via special- ization. As discussed in Chapter 4 the general idea of code duplication is to optimize D programs by duplicating code from merge blocks into predecessor blocks which can then be optimized for the types and values used in predecessor blocks.

In this chapter we present our main application of the simulation-based duplication scheme: classi- 1 cal (tail-) duplication of control flow merges. For implementing the applicability step AC(m, pi) in a simulation-based fashion we present a novel algorithm to simulate the effects of a duplication on the dominator tree of a program called dominanced-based duplication simulation (DBDS).

6.1 Optimization Opportunities after Duplication

Before introducing our approach for code duplication of regular control flow merges, we present a complete list of optimization opportunities after duplication. The following sections discuss basic optimization opportunities enabled by a code duplication (denoted Opt in Chapter 4). We present each optimization opportunity in simplified source code as well as in GraalIR2, the IR of our implementation platform.

1See Chapter 4 for details. 2For the definition of the IR we refer to Chapter 2. 70 Dominance-Based Duplication Simulation

6.1.1 Canonicalizations

The Graal compiler groups simple optimizations including constant folding, strength reduction [7] and algebraic simplifications [17] in a group called canonicalizations3. Canonicalization opti- mizations (called canonicalizations from now on) typically work on a single instruction (or IR node) and only require knowledge about the direct data-flow or control-flow inputs of a node in order to optimize it. The common canonicalizations enabled by a duplication stem from the fact that data-flow inputs pointing to ϕ functions are replaced by a ϕ’s inputs at the respective predecessor.

1 int canon ( int x ) { 1 int canon ( int x ) { 1 int canon ( int x ) { 2 final int p h i ; 2 if ( x >0){ 2 if ( x >0){ 3 if ( x > 0) { 3 return 2 + x ; 3 return 2 + x ; 4 p h i = x ; 4 } else { 4 } else { 5 } else { 5 return 2 + 0 ; 5 return 2 ; 6 p h i = 0 ; 6 } 6 } 7 } 7 } 7 } 8 return 2 + p h i ; 9 } (a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

Start Param x Const 2 Const 0 Const 2

If > Add Const 0 JMP JMP Param x Start Start Param x Merge Control-flow Node φ If If Data-flow Node Const 2 Add Control-flow Edge Data-flow Edge Add Add Return Return Association Return Return Return

Initial Program After Duplication After Optimization (d) Graal IR Canonicalization opportunity.

Figure 6.1: Canonicalization opportunity.

The target instruction4 of duplication can be optimized, because a data-flow input is replaced with a more concrete node. The input to the ϕ, which will be the new input to the target instruction, is more precise in its value- or type-information than the ϕ. Typical target instructions are mathematical operations. We illustrate this in Figure 6.1, where we see a simple Java function canon in Figure 6.1a. After duplication of the merge into both predecessors the compiler generates the code in Figure 6.1b, which can be easily optimized to the code in Figure 6.1c. Figure 6.1d shows the function canon in Graal IR before duplication, after duplication, and in the final optimized form.

3As these optimizations attempt to transform a piece of IR into its canonical form. 4See Section 4.1 for details. The target instruction is the instruction that can be optimized after duplication. 6.1 Optimization Opportunities after Duplication 71

6.1.2 Read Elimination

Read Elimination is an optimization that eliminates redundant reads and writes in a program. Due to control-flow restrictions only fully redundant reads [13] can be eliminated5. However, partially redundant6 reads can be promoted to be fully redundant by duplication. Consider the example in Figure 6.2a. Read2 is redundant if the true branch (i > 0) is taken. By duplicating Read2

1 class A{ int x ; } 1 class A{ int x ; } 1 class A{ int x ; } 2 static int s ; 2 static int s ; 2 static int s ; 3 int readE (A a , int i ) { 3 int readE (A a , int i ) { 3 int readE (A a , int i ) { 4 if ( i > 0) { 4 if ( i > 0) { 4 if ( i > 0) { 5 // Read1 5 // Read1 5 // Read1 6 s = a . x ; 6 s = a . x ; 6 int tmp = a . x ; 7 } else { 7 // Read2 7 s = tmp ; 8 s = 0 ; 8 return a . x ; 8 return tmp ; 9 } 9 } else { 9 } else { 10 // Read2 10 s = 0 ; 10 s = 0 ; 11 return a . x ; 11 // Read2 11 // Read2 12 } 12 return a . x ; 12 return a . x ; 13 } 13 } 14 } 14 } (a) Initial programa. (b) After duplication. (c) After duplication and opti- mization. aNote that s is a static variable, thus the compiler cannot re- move it.

Param i Const 0 Param i Param i Const 0 Start Const 0 > Start Start Param a If > > If If Param a LoadField Param a StoreStatic LoadField StoreStatic LoadField StoreStatic StoreStatic StoreStatic JMP JMP StoreStatic

Control-flow Node Merge LoadField LoadField LoadField Data-flow Node LoadField Return Return Control-flow Edge Return Return Data-flow Edge Return Association Initial Program After Duplication After Optimization (d) Graal IR read elimination opportunity.

Figure 6.2: Read elimination opportunity. into both predecessors it becomes fully redundant in the true branch and can be eliminated. Figure 6.2b shows the program after duplication where we can easily see that Read2 is fully redundant in the true branch. The compiler can optimize the second read of a.x away by keeping the read value in a virtual register from that point on as seen in Figure 6.2c.

5An operation is fully-redundant if the value computed by the operation is available on all paths of the program. 6An operation is partially redundant if the value computed by the operation is already available on some paths of a program, but not all [13]. 72 Dominance-Based Duplication Simulation

6.1.3 Conditional Elimination

Conditional Elimination (CE) [182], i.e., the process of symbolically proving conditional state- ments, also profits from duplication. Consider the example in Figure 6.3a. In case the first condition does not hold and i <= 0, the second condition p > 12 is known to be true. The compiler can duplicate the rest of the method into the predecessor branches and eliminate the condition in the else branch to produce the code seen in Figure 6.3b which can be optimized to the code in Figure 6.3c. We also group guard instructions into this optimization opportunity.

1 int ce ( int i ) { 1 int ce ( int i ) { 1 int ce ( int i ) { 2 int p ; 2 int p ; 2 if ( i > 0) { 3 if ( i > 0) { 3 if ( i > 0) { 3 if ( i > 12) { 4 p = i ; 4 p = i ; 4 return 1 2 ; 5 } else { 5 // merge block m 5 } 6 p = 1 3 ; 6 if ( p > 12) { 6 return i ; 7 } 7 return 1 2 ; 7 } else { 8 // merge block m 8 } 8 return 1 2 ; 9 if ( p > 12) { 9 return i ; 9 } 10 return 1 2 ; 10 } else { 11 } 11 p = 1 3 ; 12 return i ; 12 // merge block m 13 } 13 if ( p > 12) { 14 return 1 2 ; 15 } 16 return i ; 17 } 18 } (a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

Param i Const 0 Param i Const 0 Param i Const 0

Start Start Start Const 13 > > > If If If

JMP JMP φ Const 13 Const 12 Merge Const 12 Const 12 Const 12

Control-flow Node > > > Data-flow Node > If If If If Control-flow Edge Return Return Return Return Return Return Return Data-flow Edge Return Return Association Initial Program After Duplication After Optimization (d) Graal IR conditional elimination opportunity.

Figure 6.3: Conditional elimination opportunity.

Guards are nodes that ensure a certain condition to be true, and deoptimize the generated code otherwise. Guards are used to implement speculative optimizations. Guards in Graal have three stages in their lifecycle [58]. For optimizing guards via duplication we are only interested in the first stage of guards, where they start out as single instructions in the high-tier. In the second stage they are fixed to control flow but new deoptimizing nodes can still be added to the IR and in the last stage the compiler associates a frame state7 with every guard and disallows the

7See Chapter 3 for details. 6.1 Optimization Opportunities after Duplication 73 introduction of new guards from that point in time on. In the first stage, guards have a condition input, i.e., the condition they are guarding. This condition can possibly be implied by dominating conditions. Thus, guards can also be optimized after duplication if a duplication transformation can establish a dominance relation that implies that a certain guard condition is a tautology or a contradiction.

6.1.4 Escape Analysis and Scalar Replacement

Escape analysis is a compiler analysis that determines the scope, i.e., lifetime of object allocations. If an object or pointer is known to only life inside a well defined scope, i.e., a method, the compiler can optimize the respective allocation by removing it and replacing it with the object’s fields. The most common optimization for non-escaping object allocations in Java compilers is scala replacement. Scalar replacement removes non-escaping object allocations by replacing an object with its scalars, i.e., field values, which can be held in registers instead of the Java heap. Traditionally, escape analysis and scalar replacement are done on whole method scope,

1 class A{ 1 int ea (A a ) { 1 int ea (A a ) { 2 int x ; 2 A p ; 2 if ( a == null ){ 3 A( int x ) { this . x=x ; } 3 if ( a == null ){ 3 return 0 ; 4 } 4 p = new A( 0 ) ; 4 } else { 5 int ea (A a ) { 5 return p . x ; 5 return a . x ; 6 A p ; 6 } else { 6 } 7 if ( a == null ){ 7 p = a ; 7 } 8 p = new A( 0 ) ; 8 return p . x ; 9 } else { 9 } 10 p = a ; 10 } 11 } 12 return p . x ; 13 } (a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

Param a Param a Start Start NullCheck Param a NullCheck If If Start NullCheck Const 0 If Const 0 JMP Alloc Const 0 Alloc LoadField LoadField Control-flow Node φ Merge Return LoadField Data-flow Node Return Control-flow Edge LoadField Return Return Data-flow Edge Return Association Initial Program After Duplication After Optimization (d) Graal IR escape analysis opportunity.

Figure 6.4: Escape analysis & scalar replacement opportunity. i.e., objects are only replaced with their scalars if an allocation never escapes an entire method. 74 Dominance-Based Duplication Simulation

Partial Escape Analysis (PEA) and Scalar Replacement [185] is an extension to escape analysis that allows scalar replacement to be performed on individual branches. Thus, it is typically able to remove, or defer, more allocations than whole method scope escape analysis. PEA and scalar replacement opportunities arise when an object has to be materialized through the usage in a ϕ instruction [39]8. This frequently happens in Java and Scala because of auto-boxing [124]. Figure 6.4a illustrates an opportunity for PEA after duplication. There would be no need for the A object to be allocated except that it is used by a ϕ instruction. The allocation of A in the true branch cannot be removed via scalar replacement as the else branch’s object is already escaped. This forces the compiler to materialize the allocation in the true branch to ensure both operands of the ϕ have the same type. After duplication (Figure 6.4b), (P)EA can deduce that the allocation is no longer needed. Thus, scalar replacement can reduce it to the code in Figure 6.4c.

6.1.5 Lock Coarsening

Java expresses high-level synchronization for concurrent programming with the synchronized statement. Code inside a synchronized block (including synchronized methods) can only be executed by one thread at a time. Inside the JVM, synchronized blocks are represented with a bytecode called monitorenter to enter the critical region and a bytecode called monitorexit to exit the block. Every synchronized block has a monitor, i.e., every Java object owns a lock that can be used to execute a critical region under mutual exclusion. Therefore, on the JVM level, monitorenter and monitorexit have a parameter that is the monitor of the critical regions. Lock Coarsening [49; 158] is an optimization that reduces locking overhead by merging adjacent critical regions delimited by monitorenter and monitorexit [124]. We illustrate the optimization with the code in Figure 6.5. The associated Graal IR in Figure 6.5d contains the monitorenter and monitorexit nodes representing the critical region. Duplication can make non-adjacent synchronized regions to be adjacent to each other. In Figure 6.5a we see a snippet of Java code that has a synchronized critical region in the false branch and in the merge block. In case the false branch is taken, object a is locked twice. However, no code is executed in the merge block between the else branch and the merge. The unlocking of the monitor a is followed by the immediate re-locking of it. Duplication can optimize this code by duplicating the merge block into both predecessors (Figure 6.5b) where the two consecutive synchronized blocks are now adjacent to each other in the false branch. The compiler can now coarsen the lock in the false branch to obtain the optimized code in Figure 6.5c. It can be argued that locking can be

8ϕ functions force object materialization since Graal’s type system cannot express types of materialized and virtualized objects at the same time. If a ϕ has one non-virtualized input, all other virtualized inputs have to be materialized. 6.1 Optimization Opportunities after Duplication 75

1 if (){ 1 if (){ 1 if (){ 2 ... 2 ... 2 ... 3 } else { 3 synchronized ( a ) { 3 synchronized ( a ) { 4 synchronized ( a ) { 4 ... 4 ... 5 ... 5 } 5 } 6 } 6 } else { 6 } else { 7 } 7 synchronized ( a ) { 7 synchronized ( a ) { 8 synchronized ( a ) { 8 ... 8 ... 9 ... 9 } 9 ... 10 } 10 synchronized ( a ) { 10 } 11 ... 11 } 12 } 13 }

(a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

If If a a If a MonitorEnter MonitorId MonitorEnter MonitorId MonitorEnter MonitorId ...... … … … MonitorExit MonitorExit

Merge MonitorId Control-flow Node MonitorId Data-flow Node MonitorEnter MonitorEnter MonitorId MonitorEnter MonitorEnter MonitorId Control-flow Edge … … Data-flow Edge … … … Association MonitorExit MonitorExit MonitorExit MonitorExit MonitorExit

Initial Program After Duplication After Optimization (d) Graal IR lock coarsening opportunity.

Figure 6.5: Lock coarsening opportunity. nearly a no-op in optimized code considering techniques like biased-locking [171]. However, if a lock is heavily contended [91]9, reducing the number of redundant release/acquire operations can significantly improve performance10.

6.1.6 Devirtualization

Devirtualization [102] is the task of finding the concrete receiver type of a virtual method dispatch. In languages such as Java that support virtual dispatch, the concrete receiver type of a virtual dispatch is often only known at run time. In order for the compiler to better optimize virtual dispatch via, e.g., inlining, the concrete receiver type must be inferred via context information inside a compilation unit. Consider the code in Listing 6.1a: The call to foo (line 17) can either call A::foo, B::foo or a method of a type that is not yet loaded [121]11.

9Threads contend for a lock if one thread already acquired the lock and executes the critical region while others are blocked until the owner of the lock releases it. 10See Section 7.5. 11A JVM-based compiler may use global class-hierarchy [46] assumptions [109] to assume that only A and B are currently loaded. 76 Dominance-Based Duplication Simulation

1 interface I{ 1 void callTest(I i){ 1 void callTest(I i){ 2 void f o o ( ) ; 2 if (i.getClass() == 2 if (i.getClass() == 3 } A. class ){ A. class ){ 4 class A implements I{ 3 ... 3 ... 5 void f o o ( ) {. . . } 4 // invokeinterface 4 // invoke direct 6 } 5 i . f o o ( ) ; 5 A : : f o o ( this=i ) ; 7 class B implements I{ 6 } else { 6 } else { 8 void f o o ( ) {. . . } 7 ... 7 ... 9 } 8 // invokeinterface 8 // invokeinterface 10 void callTest(I i){ 9 i . f o o ( ) ; 9 i . f o o ( ) ; 11 if (i.getClass() == 10 } 10 } A. class ){ 11 } 11 } 12 ... 13 } else { 14 ... 15 } 16 // invokeinterface 17 i . f o o ( ) ; 18 } (a) Initial program. (b) After duplication. (c) After duplication and opti- mization.

Listing 6.1: Devirtualization opportunity.

However, if the true branch is taken the compiler knows that the type of i is A (and not a subtype). Thus, lifting the call to both predecessors via duplication (Listing 6.1b) can devirtualize the virtual call and the compiler can replace the invokeinterface call with a invokedirect (Listing 6.1c) that can later be inlined. While devirtualization would be a suitable opportunity for code duplication, Graal’s inliner [156] already performs complex transformations including typeguarded inlining [156] that is later optimized by removing type guards via duplication12. It does so by introducing type-check switches to polymorphic callsites and duplicating invoke instructions for checked types. Thus, we refrain from modeling devirtualization in our duplication implementation. However, other approaches heavily apply duplication to enable devirtualization including, e.g., the Self compiler [25]13.

12See Section 6.1.3 for details. 13See Chapter 9. 6.2 Simulation-based Duplication of Control Flow Merges 77

6.2 Simulation-based Duplication of Control Flow Merges

In order to perform beneficial code duplications of control flow merges, we propose an imple- mentation of the simulation-based duplication scheme from Chapters 1 and 4. We base the implementation of the AC(m, pi) step on a dominance-based algorithm called dominance-based duplication simulation (DBDS). The basic idea of DBDS follows Algorithm 4 from Chapter 4 and is depicted in Figure 6.6. DBDS is one of the two concrete implementations of the simulation-

bs Sim Result 1 1 3 bs 2 bm Sim Result 2 bp1 bp2 Dominator bp1 bp2 Tree Depth Sort by First + Cycles Saved Sim Result 3 bm bm1 bm2 + Code Size bs Traversal + Probability bs Sim Result 4 bp1 bp2 Duplicate Optimize bp1' bp2' Opt 1 Duplication Opt 1 Sim Result n bm Simulation ... Traversal ... CFG Sim Result 1 Optimized CFG Opt n Opt n shouldDuplicate(simResult)

Sim Result Sim Result Sim Result 1 Sim Result 2 Sim Result 2 Simulation Trade-Off Optimization

Figure 6.6: DBDS algorithm schematic. based duplication scheme from Figure 1.1 in Chapter 1. The approach works in the following way: The simulation tier discovers optimization opportunities after code duplication. The trade-off tier fits those opportunities into an optimization cost-model that tries to maximize peak performance while minimizing code size and compilation time. The outcome is a set of duplication transforma- tions that should be performed as they lead to sufficient peak performance improvements. The optimization tier then performs those duplications together with the subsequent optimizations whose potential was detected by the simulation tier.

6.2.1 DBDS Algorithm

In order to illustrate the basic idea of DBDS we use the outline from Figure 6.6 and the example program f from Figure 6.7 which uses Graal IR [55; 57; 119]. Additionally, we present the combined pseudo code for the algorithm and traversal in Appendix B in Algorithms 9 and 10.

In order to find optimization opportunities, the compiler simulates duplication operations for each predecessor-merge pair of the CFG and applies partial optimizations on them. Partial optimiza- 14 tions are local to the merge block , they are AC(m, pi) implementations for all optimizations presented in the beginning of this chapter. If a partial optimization triggers during simulation,

14This includes blocks dominated by the merge block and is configurable by limiting the simulation depth. 78 Dominance-Based Duplication Simulation

a b

>

If bs

jmp 2 jmp

bp1 bp2

Control-flow Node x φ Data-flow Node Merge Control-flow Edge Div Data-flow Edge Association Return Dominance Relation bm

Figure 6.7: Example program f. the compiler remembers the optimization potential for the predecessor-merge pair. The result of the simulation is the optimization potential for each potential duplication. The entire approach is based on a depth-first traversal15 of the dominator tree16 [16] as outlined in the simulation part of Figure 6.6.

Simulation is only faster than backtracking if it avoids the expensive part of backtracking, being the real duplication step17. Therefore, DBDS refrains from actually duplicating code, but uses the dominance relation to act as if a certain merge block has been duplicated. We illustrate the

Merge Merge’ Duplicate Target Merge Target’

New Target Merge

Rest Rest

Figure 6.8: Duplication effect on the dominance relation. Red arrows denote the dominance relation for the merge block(s) before and after duplication.

15The depth first traversal of the CFG can be seen on line 13 in function visit in the Appendix in Algorithms 9 and 10. 16See Chapter 2. 17See Chapter 4 for details. 6.2 Simulation-based Duplication of Control Flow Merges 79 general paradigm that is the foundation for DBDS in Figure 6.8 which shows an arbitrary CFG for a program on the left. Before duplication, the merge instruction is dominated by the control-flow split. The merge dominates the target instruction for duplication. After duplicating the merge and the target instruction into both predecessors (as seen on the right) the dominance relation changed. The merge and target in the respective branch are now dominated by the successor of the split in each branch instead of the control flow split. This changes the data flow of the program, allowing the compiler to optimize the target instruction after duplication. Our DBDS algorithm builds upon this idea: Instead of actually duplicating code in order to determine if the target instruction is optimizable after duplication, we act as if the merge block is already dominated by one of the predecessor blocks in the respective branches. This way we can simulate how values flow after duplication, without actually changing the program. In the rest of this section we go into the details of the dominator tree traversal and the application of AC(m, pi, oi) to find optimizable candidates.

We use a simplified depth-first traversal over the dominator tree of the program. In a normal depth-first traversal of the dominator tree, the compiler processes a block before it processes its dominated blocks and this is done recursively. This means, that for every instruction in a program, all inputs of an instruction are visited18 before the instruction itself. Additionally, during the depth-first traversal, as every definition must dominate all its usages, the depth-first traversal stack implicitly holds all type, value and condition information of the current basic block’s control flow path. This allows our algorithm to use the information of dominating conditions for optimization. Every split in the control-flow graph narrows the information for a dominating condition’s operands. For example, an instruction if (a != null) has two successors in the dominator tree: the true and the false branch. In a depth first traversal of the true branch we know that (a != null) holds. This additional control-flow-sensitive information is used for optimization.

During the depth first traversal every time the compiler processes a basic block bpi which has a merge block successor bm in the CFG (as seen by the gray background in Figure 6.6), it pauses the current traversal and starts a so-called duplication simulation traversal (DST). DSTs follow the same depth-first traversal order as the normal path of a program. The DST re-uses the context information of the paused depth-first traversal and starts a new depth-first traversal at block bpi.

However, in the DST the compiler processes block bm directly after block bpi as if bpi would dominate bm. In other words, it extends the block bpi with the instructions of block bmi. The index i indicates a specialization of bm to the predecessor bpi. This way the compiler simulates a duplication operation. This is the case because in the original CFG, duplicating instructions from bmi into bpi effectively appends them to the block bpi. As every block trivially dominates itself the first instruction of bpi dominates its last instruction and therefore also the duplicated ones. Consider the sample program f in Figure 6.7 and its dominator tree in Figure 6.9. Program f

18 See SSA from Chapter 2. 80 Dominance-Based Duplication Simulation

bs

bp1 bp2 bm

Figure 6.9: Program f dominator tree.

consists of 4 basic blocks: the start block bs, the true branch bp1, the false branch bp2 and the merge block bm. The compiler simulates a duplication operation by starting two DSTs at block bp1 and bp2. This can be seen in Figure 6.10 by the dashed arrows. It processes bmi in both DSTs

bs Basic Block Dominates bp1 bp2 bm Simulation

bm1 bm2

Figure 6.10: Program f Duplication Simulation.

as if it were dominated by bpi. The compiler performs each DST until the first instruction after the next merge or split instruction19.

During the DSTs the compiler needs to determine which optimizations are enabled after dupli- cation. This represents the implementation of the AC(m, pi) step from Chapter 4. Therefore, we split up our optimization phases into two parts, the precondition and the action step. This scheme was presented by Chang et al. [26]. The precondition is a predicate that holds if a given IR pattern can be optimized. The associated action step performs the actual transformation.

The precondition is a per-optimization implementation of the entire AC(m, pi) step. Based on the preconditions we derive boolean applicability checks that determine if a precondition holds on a given IR pattern for a single optimization oi. We denote this as AC(m, pi, oi). We build

AC(m, pi, oi) and action steps for all optimizations presented in Sections 6.1.1 to 6.1.5. We denote the action step as Opt(m, pi). Additionally, we modify Opt(m, pi) to not change the un- derlying IR but to return new (sub)graphs containing the result of the optimization. We use the result of the Opt(m, pi) steps to estimate a peak-performance increase and a code size increase as explained in Chapter 5. We compare the resulting IR nodes of the Opt(m, pi) step with the original IR. There are several possible results of the Opt(m, pi) step:

• Empty: Opt(m, pi) is able to eliminate the node(s).

19This is a restriction of Graal’s duplication implementation. The current algorithm to perform tail duplication cannot duplicate over post-dominating merges. 6.2 Simulation-based Duplication of Control Flow Merges 81

• New Node: Opt(m, pi) returns new nodes that represent the semantics of the old ones and will replace them.

• Redundant Node: Opt(m, pi) deduces that an instruction in a dominating block already com- puted the same result, so the redundant node can be eliminated.

Based on this comparison we compute a numeric peak performance increase estimation by using the node cost model from Chapter 5 for each IR node. The node cost model returns the cycles estimation as well as the node size as a code-size estimation for each IR node. From the difference between the original and the optimized IR we compute a cycles saved (CS) measurement which tells us if a given Opt(m, pi) step might increase peak performance. We compute code-size increase in a similar fashion.

As stated, we want to avoid copying any code during DST. However, the code in bm still contains ϕ instructions and not the input of a ϕ on the respective branch. Therefore, we introduce the concept of so-called synonym mappings. A synonym map maps a ϕ node to its input on the respective DST predecessor of bmi. Before the compiler descends into bmi, it creates the synonym map for all ϕ instructions in bmi based on their inputs from bpi. Such a mapping can be seen in

Figure 6.11 which shows the algorithm during DBDS for both DSTs of the blocks bs → bp1 → bm1 and bs → bp2 → bm2. The synonym of relation in Figure 6.11 shows a mapping from the constant

2 to the ϕ node. In DST1: bs → bp1 → bm1 the synonym map holds the value a as a synonym for the ϕ node on the predecessor bp1. In DST2: bs → bp2 → bm2 the synonym map holds the constant 2 for the ϕ on the predecessor bp2 and a synonym for the division node, which can be optimized to a shift after duplication. This is the second use case for the synonym maps: holding new nodes returned from Opt(m, pi). These new nodes will be registered as synonyms for the original ones and propagated as replacement nodes for subsequent invocations of Opt(m, pi).

During the simulation traversal the compiler iterates all instructions (nodes) of the block bm and applies AC(m, pi, oi) on them. If AC(m, pi, oi) of an optimizations triggers it performs the associated action step Opt(m, pi) of the optimization, computes the cycles saved (CS) and saves them in a data structure associated with the block pair < bpi, bmi >. Additionally, the compiler saves new IR nodes as synonyms for the original nodes. In order for the AC(m, pi, oi) to work using the synonym mappings AC(m, pi, oi) implementations must accept replacement nodes that act as if a node has a replacement node as input instead of the original node. This is necessary in order to answer the question if a node is optimizable under the assumption that its original inputs are replaced by synonyms. We outline this idea in Listing 6.2 which shows a simplified version of the AC(m, pi, oi) and Opt(m, pi) step for a constant folding [7] optimization 20 of arithmetic nodes. The AC(m, pi, oi) pseudo code in Listing 6.2 returns true if both inputs (the replacement inputs x and y) are constant, or if the operation is a division and y is 0, because

20Denoted as acArithConstFold in Listing 6.2. 82 Dominance-Based Duplication Simulation

Control-flow Node Data-flow Node a b Control-flow Edge Data-flow Edge Association > Dominance Relation If Synonym of bs

2 jmp

jmp bp2 φ bp1 x Merge 1. DST 2. DST Div φ φ x x Return a 2 bm 1

Div Div Act as if bm is >> dominated by bp1 Return bm2 Return bm2 in DST1 and bp2 Node Synonym Node Synonym in DST2 Phi a Phi 2 Div >>

Figure 6.11: Example during duplication simulation. Orange arrows indicate the DSTs and how the traversal acts as if bm is dominated by a predecessor bpi. For both DSTs we also present the synonym maps at the end of the traversal. Gray nodes in the DSTs in bmi indicate that they do not have been created during Opt(m, pi) but have been part of the IR graph before the simulation.

21 22 in this case the entire operation will be folded to a deoptimization . The Opt(m, pi) step of Listing 6.2 returns a new constant node which has the value of the constant folded result of both constant operands x and y. Additionally, the Opt(m, pi) step also models the division by zero case and returns a deoptimization node in that case.

It could be argued, that implementing AC(m, pi, oi) and Opt(m, pi) can imply a tremendous amount of engineering for all instructions/nodes and optimizations in a compiler, in order to allow DBDS to function properly. However, this is not the case, because it often requires just API changes. Considering the last part of Listing 6.2 where the constant folding optimization of a single addition node is outlined in lines 24-30: Existing optimizations only have to be adapted slightly in order to re-use AC(m, pi, oi) and Opt(m, pi) for the real optimizations: They just

21 Graal does not handle division by zero exceptions in the generated code except it is in the fast path of a method which we delliberatley ignore for the sake of simplicity of this example. 22Denoted as optArithConstFold in Listing 6.2. 6.2 Simulation-based Duplication of Control Flow Merges 83

1 static boolean acArithConstFold(ArithNode a, Node x, Node y){ 2 boolean xConst = x.isConstant(); 3 boolean yCOnst = y.isConstant(); 4 if (a.operation== Division && yConst && y.asConstant()==0){ 5 return true ; 6 } 7 return xConst && yConst; 8 } 9 static Node optArithConstFold(ArithNode x, Node x, Node y) { 10 Operation op = a.operation(); 11 int constValX = x.asConstant(); 12 int constValY = y.asConstant(); 13 if (op==Division && constValY == 0){ 14 return new DeoptimizeNode(DivByZero) ; 15 } 16 return new ConstantNode(op. fold(constantValX , constantValY)); 17 } 18 ... 19 class AddNode extends ArithNode { 20 ... 21 Node x ; // x input of addition 22 Node y ; // y input of addition 23 ... 24 Node constantFold() { 25 if (acArithConstFold( this , this . x , this . y ) ) { 26 return optArithConstFold( this , this . x , this . y ) ; 27 } 28 // no constant folding possible 29 return this ; 30 } 31 ... 32 }

Listing 6.2: AddNode constant folding AC(m, pi, oi) & Opt(m, pi). need to outsource the logic that checks if an optimization is possible in a dedicated method for the AC(m, pi, oi) step and the actual optimization into a method representing the Opt(m, pi) step23.

As in the original example from Figure 6.7, Figure 6.11 shows the IR and data structures of the algorithm during the traversal. The compiler saves value and type information for each involved IR node and updates it regularly via the synonym mapping. Eventually, it iterates the division operation (x / ϕ) in bm2 and applies a strength reduction [7] AC on it. Since the AC (which is an implementation of AC(m, pi, oi) of the DivNode) returns true the compiler performs the action step which returns a new instruction (x » 1) that is saved as a synonym for the division node. The node cost model yields that the original division needs 32 cycles to execute while the shift only takes 1 cycle. Therefore, the cycles saved (CS) is computed as 32 − 1 = 31, i.e., we estimate that performing the duplication and subsequent optimizations reduces the run time of the program by 31 cycles.

23In our implementation we performed these API changes in Graal for all optimizations presented in Section 6.1, many of which already supported a kind of a AC(m, pi, oi) step. See Section 6.2.2 for details. 84 Dominance-Based Duplication Simulation

For completeness, we illustrate the optimized program f in Figure 6.12 which shows that all optimizations found during simulation are indeed performed after duplication. The result of the

a b

> x If bs

1

Div RightShift

Return Return bp1 bp2

Figure 6.12: Example after duplication. simulation tier is a list of simulation results capturing the code size effect and the optimization potential of each possible duplication in the compilation unit24. Then the trade-off tier checks if a simulation result is promising enough and if so performs the duplication and the subsequent optimization.

6.2.2 AC(m, pi, oi) in Graal

Graal already supported a limited form of ACs through the canonicalizable interface which imple- ments simple optimizations such as constant folding and strength reduction as operations on IR nodes. We extended those ACs with the ones presented in Sections 6.1.1 to 6.1.5. Based on the result of the AC(m, pi, oi) step, the simulation result data structure in Algorithm 9 is updated with the associated cyclesSaved and codesize estimations.

6.3 Trade-off Functions

We want to avoid code explosion and unnecessary duplication transformations that do not result in optimization potential. Thus, we consider the benefits of the duplication candidates discov- ered during the simulation tier. Based on their optimization potential (benefit) and their cost we select the most promising ones for duplication. This can be seen in the middle part of Fig- ure 6.6. We take the candidates from the simulation tier and sort them by benefit, cost and probability. We then decide for each candidate if it is beneficial enough to perform the associated

24See trade-off part in the middle of Figure 6.6 6.3 Trade-off Functions 85 duplication. The decision is made by a trade-off function that tries to maximize peak perfor- mance while minimizing code-size increase. The trade-off function is based on cost and benefit.

We formulate it as the boolean function shouldDuplicate(bpi, bm, benefit, cost). It decides if a given duplication transformation should be performed. All duplication candidates for which shouldDuplicate(bpi, bm, benefit, cost) returns true are passed to the optimization tier. The optimization tier performs the duplication of bm into bpi and performs all Opt(m, pi) steps on the result. Note that regarding compile time it would be better to perform the Opt(m, pi) on all duplicated pairs < bm, bpi > only after all duplications are made. However, the Opt(m, pi) step can significantly change the CFG of a program, e.g., if the AC(m, pi, oi) of a conditional elimination (Section 6.1.3) triggers. In such cases, simulation results that became invalid have to be dismissed, because their associated parts of the CFG are no longer semantically equivalent. This is a general problem of optimizations including duplication: The order of optimizations and subsequent enabled ones is exponential. Thus, it is infeasible for a compiler to model the entire possible transformation space. We consider this problem interesting for future work, where we plan to explore simulation of subsequent duplications. Our simulation-tier implementation for DBDS is capable of recursively simulating subsequent duplications, however, we currently have not considered this in our trade-off- and optimization tiers.

The most important part of our algorithm is the trade-off tier as it decides what to duplicate. We developed a function shouldDuplicate(bpi, bm, benefit, cost) that decides for one duplication at a time if it should be performed. During the development of the DBDS algorithm, 3 factors turned out to be most important for the trade-off decision:

1) Compilation Unit Maximum Size: HotSpot limits us on the size of the installed code per compilation unit.25 Therefore, we cannot increase code size arbitrarily.

2) Code-Size Increase Budget: In theory, code-size increase is only limited by the number of duplication opportunities. However, we do not want to increase the code size beyond a certain threshold. More code always increases the workload for subsequent compiler phases.

3) Probability: We decided to use profiling information to guide duplication. We compute the product of the relative probability of an instruction with respect to the entire compilation unit and the estimated cycles saved.

We sort duplication candidates based on these values and optimize the most likely and most beneficial ones first.

25This is configurable with the option -XX:JVMCINMethodSizeLimit, which is 655360 bytes by default. 86 Dominance-Based Duplication Simulation

Based on the above observations we derived the following trade-off heuristic:

c . . . Cost (in nodecost size) b . . . Benefit (in cycles saved)

p . . . bpi Probability cs . . . Compilation Unit Size (in nodecost size) is . . . Compilation Unit Initial Size IB... Code-Size Increase Budget = 0.5 MS... Max Compilation Unit Size BS... Benefit Scale Factor = 128 (b × p × BS) > c ∧ (cs < MS) ∧ (cs + c < is × IB) (shouldDuplicate)

We decide upon the value of duplications by computing their cost / benefit ratio. We allow the cost to be 128× higher than the benefit. We derived the constant 128 during empirical evaluation. It turned out to generate the best results for the benchmarks presented in Chapter 8. The budget increase (currently set to 0.5 for 50%) represents the maximum code-size increase. Note that the value of BS and IB have significant impact on compile time and performance of the entire duplication optimization. In addition to the overall performance evaluation, we also present different values for IB in Chapter 8 by testing different budget configurations. 87

Chapter 7

Fast-Path Loop Unrolling of Non-Counted Loops

In this chapter we present fast-path loop unrolling of non-counted loops, the second imple- mentation of the simulation-based duplication scheme we propose. We first motivate the need to optimize non-counted loops as they are commonly ignored under the premise that they are difficult to optimize and not important to optimize for performance. We then present opti- mization opportunities after unrolling them with duplication. Finally, we show how a compiler can apply simulation, in order to estimate the impact of loop unrolling in terms of code size and peak performance.

enerating fast machine code for loops depends on a selective application of different loop optimizations on the main optimizable parts of a loop: the loop’s exit condition(s), G the back edges and the loop body. All these parts must be optimized to generate optimal code for a loop. Therefore, a multitude of loop optimizations were proposed to reduce iteration counts, remove loop conditions [7], hoist invariant computations out side loop bodies [29], vectorize instructions [64], revert iteration spaces and schedule loop code to utilize pipelined architectures [2].

7.1 Counted-loop Unrolling

In order to motivate the problems and challenges when unrolling non-counted loops we first give a brief overview over counted-loop unrolling. Loop unrolling is a classical compiler optimization performed by nearly every . The idea is simple and we illustrate it in Listing 7.1. Counted loops evaluate their exit condition every iteration of the loop. At the end of the body the induction variables are incremented and the generated code jumps to the loop header and evaluates the condition with the new induction variables. However, if the compiler knows the 88 Fast-Path Loop Unrolling of Non-Counted Loops

1 for (i = 0;i<100;i++){ 1 for (i = 0;i<100;i=i+4){ 2 use ( i ) 2 use ( i ) 3 } 3 use ( i +1) 4 use ( i +2) 5 use ( i +3) 6 }

(a) Counted loop. (b) Unrolled loop.

Listing 7.1: Counted loop unrolling. bounds of the loop, i.e., start and end values as well as the increment for all induction variables, it can unroll iterations of a loop by duplicating the body of the loop and appending it after each other. We can see such an unrolling in Listing 7.1b: The compiler unrolled 4 iterations of this loop by duplicating the body and using the respective values of i in between. It can do so, as it knows the bounds of the loop and increments. During the unrolled iteration there is no need to re-check the condition every time again, since the compiler knows that only in the last iteration the condition will fail. Thus, it can remove intermediate trip checks. This is also one of the main advantages of unrolling: the compiler can remove intermediate trip checks which results in less executed code, larger basic blocks for the loop body and better data locality if the body of a loop is accessing consecutive memory. However, a compiler can only remove intermediate trip checks if the unrolled loop is counted. We will focus on this dilemma in the rest of this chapter.

7.2 A Word on Unrolling Non-Counted Loops . . .

Non-counted loops [43], that is, loops for which a compiler cannot statically reason about induction variables and iteration count, are less optimizable than counted loops as their iteration count cannot be statically determined. Optimizing such loops could potentially lead to large performance gains, given that many programs contain non-counted loops.

Java programs can contain non-counted loops. However, state-of-the-art compilers do not aggres- sively optimize them, since unrolling non-counted loops often involves duplicating also a loop’s exit condition, which thus only improves run-time performance if subsequent compiler optimizations can optimize the unrolled code.

As a generic way to optimize non-counted loops, we propose to unroll them without attempting to remove their exit conditions. However, this does not necessarily improve performance as the number of loop exit condition checks is not reduced and existing compiler optimizations often cannot optimize them away. Therefore, an approach is required that determines other optimiza- 7.2 A Word on Unrolling Non-Counted Loops . . . 89 tion opportunities enabled by unrolling a non-counted loop. We identified such optimization opportunities already for regular control flow merges in Chapter 6 showing significant increases in run-time performance if applied selectively. Loop unrolling is in fact a sequential code duplication transformation; the body of a loop is duplicated in front of itself. We developed an algorithm to duplicate the body of a non-counted loop, thus effectively unrolling the loop and enabling subsequent compiler optimizations. To do so, we developed a set of simulation-based unrolling strategies that analyze a loop for compiler optimizations enabled by loop unrolling.

7.2.1 Non-Counted Loop Construct

1 while (mem[...]!= null ){ 1 while (mem[ x]>a){ 1 while ( a>0){ 2 ... 2 ... 2 if (...){ 3 } 3 // x and y might alias 3 a++; 4 mem[ y ] = . . . 4 } else { 5 ... 5 a−=b ; 6 } 6 } 7 }

(a) Non-numerical exit condition. (b) Loop body aliasing with con- (c) Incompatible induction vari- dition. ables.

Listing 7.2: Non-Counted loops.

Non-counted loops are loops for which neither at compile time nor at run time the number of loop iterations can be determined. This can be due to several reasons; the most common ones are shown in Listing 7.2: the loop exit conditions are not numerical1, the loop body consists of side-effecting instructions2 aliasing with the loop condition3 or loop analysis [208] cannot guarantee that a loop contains induction variables that lead to the termination of the loop4.

1See Listing 7.2a. Counted loops, as their name outlines, always count from a start value to an end value and the exit condition is a boolean expression of one of the following ordering relations: <, <=, ==, ! =, >, >=. While all condition expressions that do not involve a ordering relation can never be used in a counted loop, a compiler cannot automatically assume a loop is counted just because the condition expression is of one of the previous forms. E.g., Listings 7.2b and 7.2c cannot be deduced to be counted, even though their exit condition is a mathematical ordering relation. 2 The term side-effect describes operations that can have an observable effect on the execution system / environ- ment. Such operations include function calls, memory writes, sun.misc.unsafe [130] usage, native calls and object locking [91]. In this thesis we typically use the term sideeffect to describe operations that (potentially) write to the Java heap. 3See Listing 7.2b. 4See Listing 7.2c. 90 Fast-Path Loop Unrolling of Non-Counted Loops

We informally observed that a significant amount of Java code contains non-counted loops. To further test this hypothesis, we instrumented the Graal compiler to count the number of counted and non-counted loops5 in the Java DaCapo [11] benchmark suite. The results in Table 7.1 demonstrate that non-counted loops are more frequent than counted loops, which provides high incentives to compiler developers to optimize them. Loop Unrolling [43; 44; 45; 50; 94; 173]

# Counted # Non-Counted % Non-Counted Benchmark Loops Loops Loops lusearch 219 449 67% jython 1779 3333 64% h2 538 4473 89% pmd 1456 3379 69% avrora 169 593 77% luindex 364 660 64% fop 653 1289 66% xalan 510 1022 66% batik 723 1857 71% sunflow 249 389 59% Arithmetic Mean 69.8%

Table 7.1: Number of counted and non-counted loops in the Java DaCapo Benchmark suite. non-counted loops [43] is only possible in a general way if a loop’s exit conditions are unrolled together with the body of a loop. Previous work proposed to avoid unrolling non-counted loops together with their exit conditions, e.g., Huang and Leng [94] who used the weakest pre-condition calculus [48] to insert a loop body’s weakest pre-condition into a given loop’s initial condition to maintain a loop’s semantics after unrolling it. However, side-effects cause problems in optimizing compilers when trying to unroll non-counted loops. If the body of a loop contains side-effecting instructions, deriving the weakest pre-condition requires modeling the state of the virtual machine (VM), which we consider impractical for dynamic compilation. As the compiler cannot generally infer the exact number of loop iterations for non-counted loops, it cannot speculatively unroll them and generate code containing side-effects that are unknown. Removing the intermediate trip checks that potentially infer with the body of the loop is generally not possible. This is a general problem, as side-effects and multi-threaded applications6 prevent compilers in managed execution systems like the Java virtual machine (JVM) from statically reasoning about the state of the memory, thus preventing the implementation of generic loop unrolling approaches.

Consider the code in Listing 7.3. The loop’s condition uses the value in memory at index x. However, the body of the loop contains a memory write at index y. If [51; 122] fails to prove that x and y do not alias, there is no way, without duplicating the exit conditions

5Graal’s loop optimizations analyze every loop during compilation and try to deduce if they are counted. This is done as many optimizations require additional information for counted loops. We used this analysis and instrumented the compiler to record how many loops of all compiled loops are counted. We counted the loops after inlining because the number of loop iterations in counted loops is often only inferable after this optimization. 6Unknown side effects are less of a problem in single-threaded systems, as the compiler can optimize global memory with the assumption it is never changed by other threads. 7.2 A Word on Unrolling Non-Counted Loops . . . 91

1 while (mem[ x ]!= null ){ 2 mem[ y ] = . . . 3 }

Listing 7.3: Non-counted loop with side effects.

as well, to unroll the loop without violating the memory effect established by the read after write dependency of the read in iteration n on the write in iteration n − 1. Unrolling the loop requires the compiler to respect the memory constraints of the loop. Traditional loop unrolling would duplicate the body of the loop and append it to the original body. We illustrate the unrolling of on iteration of the loop from Listing 7.3 in Listing 7.4. However, unrolling this loop (as seen in Listing 7.4) is violating the memory constraints of the loop and execute code that would not have been executed in the not unrolled version. The first iteration of the loop can write to mem[y]: If x and y alias this will cause the next check of the loop condition to evaluate to false effectively exiting the loop. In the unrolled version of Listing 7.3 as seen in Listing 7.4, the loop would execute another iteration before exiting. Therefore, for such loops, a compiler cannot remove intermediate trip checks during unrolling.

1 while (mem[ x ]!= null ){ 2 mem[ y ] = . . . 3 // unrolled body, y and x might alias, thus mem[y] could be null after the assignment 4 mem[ y ] = . . . 5 }

Listing 7.4: Non-counted loop with side effects after unrolling. Unrolling violates the memory constraints of the original loop from Listing 7.3.

1 while (mem[ x ]!= null ){ 2 mem[ y ] = . . . 3 if (mem[ x ]!= null ){ 4 // unrolled body, y and x might alias, thus mem[y] could be null after the assignment 5 mem[ y ] = . . . 6 } 7 }

Listing 7.5: Non-counted loop with side effects after correct unrolling.

The only proper unrolling for Listing 7.3 would be Listing 7.5 where the compiler also duplicates the trip check during unrolling. 92 Fast-Path Loop Unrolling of Non-Counted Loops

Read mem[x] Write-After-Read Iteration 0 Write mem[y] Read-After-Write

Read mem[x]

Write-After-Read Iteration 1 Write mem[y]

Figure 7.1: Possible side-effects in Listing 7.3 : The directed arrows show memory dependencies between read and write operations in two iterations of the loop.

We again want to put emphasis on the memory effects of the loop: Figure 7.1 shows the memory dependencies of the loop from Listing 7.3 in a graph illustrating memory reads and write over two loop iterations. That means that the condition cannot be re-ordered with the loop’s body without violating the memory constraints of the loop. Therefore, a compiler cannot derive a weakest pre-condition with respect to the body of the loop as the post-condition of the loop body is unknown and the effect of the body of the loop on the state of the VM is unknown.

We propose to unroll non-counted loops with code duplication [120]. Duplicating the loop body together with its loop condition enables unrolling of general loops. We can represent every loop as a while(true) loop with the exit condition moved into the loop body7. A loop can then

1 while ( true ){ 2 if (...){ 3 body 4 } else { 5 break 6 } 7 }

Listing 7.6: General loop construct. be unrolled by duplicating the body of the loop and inserting it before the initial loop body. Listing 7.7 shows the example from Listing 7.6 after unrolling one iteration. When applied to the source example from Listing 7.3, the unrolled source loop looks like Listing 7.8. In contrast to general loop unrolling [7], we cannot remove the second exit path (lines 3, 5, 6) of the loop as it is non-counted. Unrolling loops via duplication does not create traditional unrolling optimization opportunities, because loops are unrolled with their loop exit conditions. Therefore, there is no reduction in execution time due to a reduced number of loop exit condition checks. We propose to tackle this issue with the simulation-based scheme (proposed in Chapter 4) enabling other optimizations via unrolling. Loop unrolling can enable subsequent compiler optimizations in the

7See Listing 7.6. 7.2 A Word on Unrolling Non-Counted Loops . . . 93

1 while ( true ){ 2 if (...){ 3 body unrolled 4 if (...){ 5 original body 6 } else { 7 break 8 } 9 } else { 10 break 11 } 12 }

Listing 7.7: General loop construct unrolled.

1 while (mem[ x ]!= null ){ 2 mem[ y ] = . . . 3 if (mem[ x ]!= null ){ 4 mem[ y ] = . . . 5 } else { 6 break ; 7 } 8 }

Listing 7.8: Side-effect-full loop after unrolling. same way as dominance-based duplication simulation (Chapter 6). This enables optimizations such as constant folding, strength reduction [7], conditional elimination [182], and so forth. To determine the effects of unrolling a loop on the optimization potential of the compilation unit, we simulate the unrolling prior to the actual transformation; therefore, we can precisely find all profitable unrolling candidates and decide which ones are profitable enough to be unrolled. 94 Fast-Path Loop Unrolling of Non-Counted Loops

7.3 Optimization Opportunities

In this section we present some of the most important optimization opportunities that we consider for non-counted loop unrolling, namely safepoint poll reduction, canonicalization8, and loop- carried dependencies.

7.3.1 Safepoint Poll Reduction

HotSpot’s [93] execution system relies on safepoints [41; 123] to perform operations that require all mutator threads to be in a well-defined state with respect to heap accesses. If the JVM requests a safepoint it marks a dedicated memory page as non-readable 9 . Mutators periodically poll this safepoint page and segfault in case the VM requested a safepoint. The JVM handles the segfault using a segfault handler registered for that purpose, which then invokes the safepoint handler that performs the desired operation. Typical safepoint operations are garbage collections [113], class re-definitions [217], lock unbiasing, and monitor deflation. As the safepoint protocol is collaborative, it forces the code generated by the JIT compilers to periodically poll the safepoint page (the interpreter polls it after every interpreted bytecode). Generated code typically performs these polls at specific program locations:

• Method returns: Safepoint polls are performed at every method return.

• Loop back edges: Safepoint polls are performed at every loop back edge.

Safepoint polls inside loops typically impose many problems on optimizing compilers. They re- quire compilers to balance run-time performance and latency of safepoint operations. Optimizing compilers in the JVM aim to generate fast code; a safepoint poll in a hot loop is an additional memory read at every iteration, which leaves space for further optimization. For example, the safepoint poll can be optimized away if the iteration count of a loop is known and low enough to not have a negative effect on application latency. However, if a loop is long-running and the safepoint poll on the back edge is optimized away, the entire VM may be forced to wait for the loop to finish until the next safepoint is hit. In the worst case a full garbage collection (GC) cannot be performed because a mutator loop does not poll the safepoint page on its back edge. Stalling a full GC can crash the VM as it can, for example, run out of memory. There are multiple solutions to this dilemma; for example, the Graal compiler currently removes safepoint polls on a loop’s back edge if it can prove that the loop is counted and the loop’s iteration count is in

8See Section 6.1.1. 9Note that this mechanism is for unix-based operating systems. 7.3 Optimization Opportunities 95 the range of a 32 bit integer. For non-counted loops and for loops in the range of 64 bit integers (Java long) Graal only removes safepoint polls on back edges if the path leading to the back edge is guaranteed to perform a safepoint poll, for example, via a method call.

Graal does not remove safepoint polls for non-counted loops, which leaves further optimization opportunities. Consider the code in Listing 7.9. The non-counted loop has two back edges, one from the true and one from the false branch. The compiler removes the safepoint poll on the false branch as there is a call inside which will poll on return. However, for the true branch the compiler does not remove the safepoint poll. Thus, if the true branch is the fast-path one additional memory read is performed in every iteration of the loop. There are multiple solutions to optimize this safepoint poll without violating the implications of the safepoint protocol. The simplest solution is to unroll the loop u times, which will reduce the number of safepoint polls to n/u.

1 while (...){ 2 if (...){ 3 // fast-path 4 // no call 5 .... 6 safepointPoll; 7 } else { 8 // safepoint poll on call return 9 c a l l ( ) ; 10 } 11 }

Listing 7.9: Unrolling opportunity: Safepoint poll reduction.

7.3.2 Canonicalization

Instructions having loop ϕs [39] as inputs are potentially optimizable [120]. They can often be optimized by replacing their inputs with an input to the ϕ instead of the original ϕ. To optimize loop ϕs we simulate the unrolling of one iteration by checking if an instruction is optimizable under the assumption that it has the back edge input of a loop ϕ as input instead of the ϕ. Listing 7.10a shows a simple loop that follows this pattern. The loop ϕ loopPhi has three inputs, the unknown value a on the forward predecessor, the constant 0 on the first back edge and the unknown value b on the second back edge. In the body of the loop we check if(loopPhi == 0), which is generally not true. After unrolling the loop once it has four back edges instead of three. One back edge was added by the unrolled loop body. Line 15 in Listing 7.10b was the back edge 1 in the original loop’s (Listing 7.10a) body. However, peeling the fast-path of the loop in front of the original loop replaces the loop ϕ with the constant 0 which was the ϕ’s value on the 96 Fast-Path Loop Unrolling of Non-Counted Loops

1 l o o p P h i = a 1 l o o p P h i = a 2 while ( c ) 2 while ( c ) 3 // loop phi: 3 inputs 3 // loop phi: 4 inputs 4 // forward predecessor: a 4 // forward predecessor: a 5 // back edge 1: 0 5 // back edge 1: 0 6 // back edge 2: b 6 // back edge 2: b 7 l o o p P h i = ϕ( a , 0 , b ) 7 // back edge 3: b 8 { 8 l o o p P h i = ϕ( a , 0 , b , b ) 9 if ( loopPhi==0){ 9 { 10 doSth ( ) ; 10 if ( loopPhi==0){ 11 } 11 doSth ( ) ; 12 if (...){ 12 } 13 // back edge 1 13 if (...){ 14 l o o p P h i = 0 14 //original back edge 1: 0 replacesloopPhi 15 continue ; 15 l o o p P h i = 0 16 } else { 16 if ( c ) { 17 // back edge 2 17 if(loopPhi==0/*0==0*/) { 18 l o o p P h i = b 18 doSth ( ) ; 19 continue ; 19 } 20 } 20 if (...){ 21 } 21 // back edge 1 22 l o o p P h i = 0 23 continue ; 24 } else { 25 // back edge 2 26 l o o p P h i = b 27 continue ; 28 } 29 } else { 30 break ; 31 } 32 } else { 33 // back edge 3 34 l o o p P h i = b 35 continue ; 36 } 37 }

(a) Before unrolling. (b) Unrolling Canonicalization opportunity: Loop af- ter unrolling. The grey-shaded box highlights the original iteration of the loop. The gray line with white text (Line 17) shows the enabled optimiza- tion opportunity. The black line with white text (Lines 14-15) shows the replacement of a loop ϕ with the back edge value along the fast-path.

Listing 7.10: Unrolling opportunity: canonicalization. 7.3 Optimization Opportunities 97 original back edge 1. Therefore, the check if(loopPhi == 0) in the original iteration of the loop becomes if(0 == 0): The if instruction can be eliminated and the call to doSth can be executed unconditionally.

7.3.3 Loop-Carried Dependency

Loop-carried dependencies [27; 190] are dependencies between different iterations of a loop. Typ- ically, they appear in array access patterns inside loops. Consider the example in Listing 7.11.

1 while (...){ 2 a [ i ] = a [ i ] ∗ a [ i +1] 3 }

Listing 7.11: Unrolling opportunity: Loop Carried Dependency.

In every iteration of the loop, we read two array locations a[i] and a[i+1]. Unrolling one iteration of this non-counted loop allows us to avoid re-reading a[i+1] in the next iteration. In the first iteration, we read a[0] and a[1]. In the second iteration, we read a[1] and a[2]. If we unroll this loop once and if there is no aliasing side-effect in the body of the loop, we can eliminate one redundant load.

We implemented a simulation-based unrolling analysis that estimates the effects of loop unrolling and combined it with a read elimination optimization to determine if there are loop-carried de- pendencies in the original loop body that can be optimized.

In order to find loop-carried dependencies, we propose to compute a superblock [96]10 through the loop. For each memory location written and read in the superblock the compiler can determine the associated instruction. It maps memory reads to virtual registers and tracks their values over all instructions in the superblock. The compiler first iterates over the superblock and records all memory reads and writes. Then, it iterates over all loop ϕ instructions of the loop and establishes a mapping from loop ϕ values to their back edge values of the superblock11. Finally, the compiler iterates over the superblock again, replacing the ϕ node inputs in the loop header with the established back edge value mapping. For every instruction in the superblock, we try to eliminate memory accesses by re-using an already computed value. We repeat this process until there are no further eliminations possible. Figure 7.2 illustrates our read elimination simulation. We track the values read from memory in registers and update them (as well as the memory locations) according to the instructions seen. We then replace i with i + 1, which is the value of the ϕ on

10 We include structured control-flow diamonds if profiling information indicates that a split’s successor paths are taken with equal probability. 11The superblock ends in a loop back edge. 98 Fast-Path Loop Unrolling of Non-Counted Loops

while(….) { a[i] = a[i] * a[i + 1]; ... i++; Mem Register } a[0] a[1] a[2] r1 r2 r3 a[i] Read a[0] a[i]

a[i+1] Read a[1] Iteration i a[i+1] Re-use r1*r2 a[i]= Write a[0] r3 r2

a[i+1] a[i+1] Read a[1]

a[i+2] a[i+2] Read a[2] Iteration i+1 r1*r2 a[i+1]= Write a[0] r3

Figure 7.2: Loop-carried dependency read elimination simulation. the back edge and repeat the simulation for the loop body. If we find a read or a write that is redundant we use a static performance estimator and record the estimated run-time cycles saved for this instruction. We then trade-off the cycles saved (via unrolling) with the overall size of the superblock and perform the unrolling if the benefit is sufficiently high12.

7.4 Fast-Path Unrolling of Non-Counted Loops

In this section, we present Fast-Path Loop Unrolling, our approach to simulation-based loop unrolling of non-counted loops. First, we propose a novel algorithm to perform loop unrolling via duplication and peeling, called Fast-Path Loop Creation, which simulates the unrolling of a loop and analyzes to unrolled loop for optimization opportunities. We do this prior to the actual code transformation and unroll only those loops whose optimization potential is high. We use the node cost model13: The same static performance estimator we used in Chapter 6 to estimate the entire run time of the fast-path of a loop and compare it with the version computed during simulation. If the optimization potential is sufficiently high we perform the unrolling transformation14.

7.4.1 Fast-Path Loop Creation

Before exploring the unrolling of non-counted loops, we started to experiment with a novel trans- formation that we call fast-path loop creation, an optimization for multi-back-edge loops. Multi- back-edge loops are very common in many applications; for example, they can arise from con- ditional statements at the end of a loop’s body or they can result from continue statements

12See Section 7.4.6 for details. 13See Chapter 5 for details. 14See 7.4.5. 7.4 Fast-Path Unrolling of Non-Counted Loops 99 in loops. Additionally, a compiler often creates them as a result of inlining function calls into loops15. We devised this optimization for loops with multiple back edges because the probability of their back edges is often not equally distributed. If profiling information [196] indicates that a set of back edges is taken with a very high likelihood, the other back edges may hinder the optimization of the loop due to two main reasons:

• A side-effect can influence the scheduling of anti-dependencies16 in the loop [77].

• A back-edge value flowing into a loop ϕ [39] can make a loop non-counted.

Consider the code in Listing 7.12. The loop-invariant read [29] in line 4 cannot be hoisted out of the loop, because the memory location may be aliasing with the write in line 7. If alias analysis [51; 122] cannot prove that the loop variable i is different from k for all possible values of i and k, the compiler has to assume that there could be a write-after-read dependency from the write in one iteration to the read in the next iteration. This results in a read-after-write dependency from iteration n to iteration n + 1.

1 double r e s u l t = 0 ; 2 for ( int i = 0;i < arr.length; i++) { 3 if (...){ 4 result += arr[k]; 5 } else { 6 // write that can alias 7 arr[i] = result; 8 } 9 }

Listing 7.12: Multi-back-edge Loop.

Considering that, for example, the false branch in the loop has a low probability in comparison to the true branch, the read has to be performed in every iteration although the write is rarely executed. One possibility to still hoist the read out of the loop is to use fast-path loop creation. The general idea of fast-path loop creation is to create an outer, slow-path loop for all back edges of the loop that have a low probability. The inner, fast-path loop then only consists of frequently taken back edges. The outer loop is a while(true) loop that is exited under the same conditions as the inner loop. Listing 7.13 shows the loop from Listing 7.12 after fast-path loop creation in pseudo code.

15Callees with multiple returns can cause multiple loop back edges if a callsite is inside the body of a loop. Loop back edges can always be merged to a single one, however, this requires more jumps in the generated code, thus many compilers model loops with multiple back edges. 16Anti-dependencies are also known under the term write-after-read dependency and represent forward dependencies from read operations to writes. For example a read has to be performed before an unknown side-effect that may write to the location of the read. Side-effects are always anti-dependencies for dominating reads inside a loop’s body. 100 Fast-Path Loop Unrolling of Non-Counted Loops

1 double r e s u l t = 0 ; 2 int i = 0 ; 3 o u t e r : while ( true ){ 4 // initial read 5 tmp = arr[k]; 6 i n n e r : for (;i < arr.length; i++) { 7 if (...){ 8 result += tmp; 9 continue i n n e r ; 10 } else { 11 // write that can alias 12 arr[i] = result; 13 // continue outer, perform read again 14 continue o u t e r ; 15 } 16 } 17 break o u t e r ; 18 }

Listing 7.13: Multi-back-edge loop after fast-path loop creation.

Once the compiler created the outer loop, it can hoist the read from the inner loop to the outer loop. It promoted the write-after-read dependency from the inner fast-path loop to the outer slow-path loop. The slow-path loop is only executed if the else branch in the inner loop is taken, therefore its probability is equal to the one of the else branch. 7.4 Fast-Path Unrolling of Non-Counted Loops 101

7.4.2 Algorithm

We present a graphical representation of the fast-path creation algorithm in Figure 7.3 which shows the IR of a generic loop during the transformation. First the compiler creates the outer loop header, which becomes the slow-path loop header. Then, the compiler re-routes the slow- path loop back edges to the slow-path (outer) loop header.

...... Control-flow Node Loop Header Loop Header Create Re-route Control-flow Edge Slow-Path Slow-Path Fast Path Loop Header Loop Loop Header Backedges Loop Header Slow Path Header Association If If If

Loop Exit If Loop Exit If If Loop Exit

Loop Exit Loop End Loop End Loop End Loop End Loop Exit Loop End

Loop End

Figure 7.3: Fast-path loop creation example.

We present the detailed algorithm for fast-path loop creation in the Appendix in Chapter 11 in Algorithm 8.

7.4.3 Discussion

Fast-path loop creation has the advantage that the compiler does not need to duplicate code in order to optimize may-aliasing memory accesses. Although code duplication can often improve performance, it inherently increases code size, which a compiler tries to minimize. Fast-path loop creation is a useful optimization on its own. Although we did not evaluate the approach in detail, we determined peak performance improvements of up to 50% for micro-benchmarks. However, in our work, we focused on the unrolling of non-counted loops and used this optimization as a starting point for a fast-path loop unrolling algorithm described in Section 7.4.4.

7.4.4 Non-counted Loop Unrolling via fast-path loop creation

We can extend the algorithm for fast-path loop creation to perform fast-path unrolling of non- counted loops. First, we identify those loops that should be unrolled. The compiler then creates the slow-path outer loop for these loops.17 After the slow-path loop is created, the compiler uses

17 This also works for loops with only one back edge. In this case, the inner loop temporarily has no back edge and is degenerated until the transformation is finished. 102 Fast-Path Loop Unrolling of Non-Counted Loops

......

Loop Header Loop Header Loop Header Peel Remove Loop Header Fast-Path Loop Fast-Path If Loop If If Unrolled 1 If Iteration If If Loop Exit Loop Header Loop Exit If Loop Exit Loop End If If Merge Loop Exit Loop End If Loop Exit Control-flow Node Control-flow Edge Merge Merge Loop End Fast Path Loop Exit Loop End Slow Path Loop Exit Loop End Association Merge

Loop End

Figure 7.4: Path-based unrolling via fast-path loop creation and peeling.

a loop peeling transformation [7] to duplicate the body of the inner fast-path loop (including its loop exit conditions) u times, where u denotes the number of desired unrollings. After the peeling step, the inner fast-path loop is removed by deleting inner loop exits and re-routing inner loop back edges to the outer loop. The loop ϕs of the inner loop are replaced with inputs coming from the single non-back-edge predecessor. In a final step, the compiler removes the inner loop header. Figure 7.4 shows the example from Figure 7.3 after fast-path loop creation, during peeling and inner loop removal. Unrolling loops with the proposed technique has one major advantage over general and partial unrolling of loops. As we re-route all back edges that we consider being part of the slow path to the outer loop header, only the fast-path loop back edges remain connected to the inner, fast-path loop. Peeling the fast-path loop and removing it afterwards effectively unrolls only the fast path of the original code and not its slow path. Therefore, we keep the code-size increase at a minimum, assuming the correctness of the profiling information.

We perform fast-path loop unrolling independently of the fast-path loop creation optimization. If a loop should be unrolled, we perform the unrolling transformation. Later in the compila- tion pipeline, we perform a dedicated analysis to detect optimizable patterns like the one from Listing 7.12, for which we create the fast-path loop independently of any prior unrollings. The fast-path loop creation optimization and the fast-path loop unrolling share IR transformations, but semantically they are two separate optimizations. 7.4 Fast-Path Unrolling of Non-Counted Loops 103

7.4.5 Fast-Path Loop Unrolling Algorithm

In the following, we explain the fast-path unrolling algorithm in detail. Graal only performs a limited set of optimizations on non-counted loops18. However, it performs a multitude of optimizations on counted loops. Fast-path loop unrolling also works for counted loops without any changes, however, we do not want to interfere with existing optimizations, that is, we want to prevent creating IR patterns that are not optimizable for other transformations. Therefore, we ignore counted loops and loops for which the compiler can infer the iteration count after hoisting loop-invariant instructions out of the loop’s body. We use the notion of unrolling opportunities19. Opportunities model optimization-opportunities enabled by unrolling a loop. We implemented a set of analysis algorithms determining optimization opportunities that can be enabled by loop unrolling20.

We follow the simulation-based duplication approach proposed in Chapters 1, 4 and 6 and group the unrolling optimization into the three simulation-based duplication steps:

1) Simulation: We simulate the unrolling of a loop and identify the optimization potential.

2) Trade-Off: We trade-off the expected performance gains with the code-size increase of an unrolling transformation and decide which loop should be unrolled and how many times.

3) Optimization: Finally, we perform the unrolling transformations followed by the identified optimizations.

We first identify a loop’s optimization potential after unrolling by simulating the transformation and then decide whether we want to unroll it. Algorithm 5 shows the non-counted loop unrolling optimization. We filter out all loops that are counted and therefore ignored by our approach. Then, we check a set of implemented opportunities on a loop. Each opportunity op performs a simulation of the loop after unrolling it (op.shouldUnroll(loop)) and analyzes the loop for optimizable patterns in unrolled iterations. This follows the notion established in Chapter 4: Each opportunity op has an applicability check AC(m = LoopHeader, pi = loopBackedge) and an optimization step Opt(m = LoopHeader). If such an optimization is found, we use the node cost model to compute the overall run time of the loop’s body and the overall run time of the loop’s first (unrolled) iteration. If the estimator predicts that the unrolled iteration has a lower run time than the original loop’s body, we judge whether the expected performance increase justifies the expected code size increase21. We derived the upper bound of unrolling iterations via empirical evaluation. We performed our experiments from Chapter 8 with an upper limit of

18Graal performs and inversion on non-counted loops. 19We established this term in Chapter 6 for optimization opportunities after duplication/unrolling. 20See Section 7.3. 21See Section 7.4.6. 104 Fast-Path Loop Unrolling of Non-Counted Loops

2 to 64 unrollings. Although sometimes special code patterns are sensitive to a high number of unrollings, we could not generally measure noticeable differences with an upper limit higher than 4. Thus, we unroll a loop by a maximum unrolling factor of 4. That means, we first create the fast-path loop (Section 7.4.1), and then peel the inner slow-path loop at most four times. Finally, we remove the inner fast-path loop. Unrolling only those loops for which we know that there is sufficient optimization potential after unrolling allows us to keep code-size increase and compile-time increase at a moderate level22.

Algorithm 5: Path-based unrolling algorithm. Data: ControlFlowGraph cfg Result: Optimized ControlFlowGraph cfg [1] outer: for Loop loop in cfg.loops () do [2] if isCounted (loop) then [3] continue outer; [4] for Opportunity op in UnrollingOpportunities do [5] int unrollings ← op.shouldUnroll (loop); [6] if unrollings > 0 then [7] createFastPathLoop (loop); [8] for i in 0 . . . unrollings do [9] peelIteration (loop); [10] end [11] removeFastPathLoop (loop); [12] end [13] end [14] runCleanUpOptimizations (cfg);

7.4.6 Unrolling Trade-Off Heuristic

The final decision whether to unroll a loop is performed by a trade-off function which takes several variables into account that are relevant for the final unrolling decision:

• Maximum compilation unit size: The virtual machine imposes an upper bound for code size per compilation unit.

• Initial graph size: The heuristic must be dynamically based on the initial size of the com- pilation unit.

• Cycles saved: The number of estimated cycles saved inside the loop body by unrolling it. This value is calculated during the simulation of the unrolling of the loop. The compiler estimates the overall cycles of the loop and the cycles of the loop after unrolling and computes the difference.

22See Chapter 8. 7.4 Fast-Path Unrolling of Non-Counted Loops 105

• Code-Size Increase Budget: Prior work [120] on the Graal compiler derived constants for maximal code-size increase of single optimizations. Therefore, all non-inlining-based opti- mizations in the Graal compiler are limited in their code-size increase by 50%. The reason for that is that code-size increases also increase compile time as the workload for subsequent compiler optimizations is increased.

• Byte per cycle spending: In order to relate an estimated benefit with its cost we compute the ratio between code-size increase and run-time improvement. Thus, we need to specify how much code-size increase (in bytes) we are willing to accept for every saved cycle. This is currently configured to be 512 bytes per saved cycle. We derived this value via a structured experiment running our benchmarks from Chapter 8 with all powers of 2 between 2 and 1024. The value 512 generated the best results for those benchmarks.

All operands are computed by the optimization opportunities23.

The final unrolling decision is done by the algorithm given by the pseudo code in Algorithm 6 and is based on the following trade-off heuristic:

l . . .Loop u . . .Unrollings l.s . . .Loop Size l.cp . . .Cycles saved per loop iteration l.oc . . .Overall cycles of a loop including condition cs . . .Compilation Unit Size is . . .Compilation Unit Initial Size IB...Code-size increase Budget = 1.5 MS...Max Compilation Unit Size MU...Max Unrollings = 4 BS...Bytes/Cycle Spending = 512

canUnroll(loop) 7→ is ∗ IB < MS ∧ cs + u ∗ l.s < MS shouldUnroll(loop) 7→ l.cp ∗ BS > l.s nrOfUnrollings(loop) 7→ u ≡ min(MU, l.cp/l.oc ∗ MU ∗ 10)

23See Section 7.4. 106 Fast-Path Loop Unrolling of Non-Counted Loops

Algorithm 6: Unrolling decision algorithm. /* Size Restriction */ [1] if canUnroll (loop) then /* Trade-off Heuristic */ [2] if shouldUnroll (loop) then /* Compute Final Unrolling Factor */ [3] return nrOfUnrollings (loop); [4] return 0;

The compiler unrolls a loop nrOfUnrollings times if it has sufficient optimization potential. It achieves this by comparing the saved cycles per iteration with the size of the loop, multiplied by a constant factor that we can set to express how much code size we are willing to spend for one cycle run-time reduction. 7.5 Loop-Wide Lock Coarsening 107

7.5 Loop-Wide Lock Coarsening

Based on the work we did for benchmarking concurrent applications on the JVM [158] and our novel approach for duplication24 and unrolling of non-counted loops we stumbled across a special code pattern in synchronized Java code within loops. We illustrate such an example in Listing 7.14 which shows a simple Java loop containing a synchronized statement. The critical region is entered in every iteration of the loop. However, the monitor of the synchronized block is loop invariant, i.e., it is not dependent on induction variables or memory effects of the loop and thus is scheduled before the loop block.

1 Object monitor = ... /* loop invariant object */ ...; 2 for (...){ 3 synchronized ( monitor ) { 4 ... critical region ... 5 } 6 }

Listing 7.14: Synchronization loop in Java.

The semantic of Listing 7.14 is clear: in every iteration of the loop a different thread has the chance to acquire the lock. However, the lock can also be held by a single thread throughout the whole loop, since the synchronized statement has no fairness guarantees [158]. Heavy contention on the lock can cause performance penalties because of the locking overhead of the synchronized statement. For short loop bodies, the locking overhead can be significant, especially if the lock is contended and inflated [90].

7.5.1 Use Cases

We observed that many standard Java workloads execute code similar to the example in List- ing 7.14. Java’s synchronization mechanisms25 allow methods to be marked with the synchronized keyword, which sequentializes concurrent access of to these methods just like the synchronized statement does.

24See Chapter 6. 25See Section 6.1.5. 108 Fast-Path Loop Unrolling of Non-Counted Loops

1 class SynchronizedList { 2 ... 3 // synchronizes on the this object 4 public synchronized void add(Object o){ 5 ... 6 } 7 ... 8 } 9 10 SynchronizedList foo(){ 11 Object[] objects = ... some source ... 12 SynchronizedList l = new SynchronizedList(); 13 for (Object o:objects){ 14 l . add ( o ) ; 15 } 16 return l ; 17 } 18 19 SynchronziedList fooAferInlining(){ 20 Object[] objects = ... some source ... 21 SynchronizedList l = new SynchronizedList(); 22 for (Object o:objects){ 23 monitorenter(l) 24 ... critical region... 25 monitorexit(l) 26 } 27 return l ; 28 }

Listing 7.15: Synchronized list.

Synchronized methods are heavily used in synchronized Java collection libraries such as java.util.Vector or synchronized.List. These collections typically synchronize on the list object itself. We observed code like Listing 7.15, where a synchronized method is called inside a loop. After inlining the call of add into foo, the code in lines 23 - 25 is produced, exhibiting a critical region inside a loop with a loop-invariant monitor.

7.5.2 Fast-Path Tiling

In order to reduce the synchronization pressure on the monitor of such loops, we propose to hoist the synchronized block outside of the loop. However, in order to avoid reducing the fairness of the lock too much we propose to tile the iteration space of such loops with fast-path loop creation from Chapter 7. Based on the fast-path loop creation (Section 7.4.1) we developed an optimization that allows us to pull a lock outside of a loop. In order to explain the basic idea, we refer to Listing 7.16 that shows a minimal loop exhibiting the optimization potential. To reduce the synchronization overhead of performing the locking in every iteration of the loop, we want to hold the lock for multiple iterations, thus only performing a fraction of lock-unlock operations. We propose to create an outer-(fast-path)-loop without deleting the inner loop. This allows the compiler to pull the monitorenter and monitorexit out of the loop and to promote it to be in 7.5 Loop-Wide Lock Coarsening 109 the outer slow-path loop. That way, the code does not need to acquire the lock in every iteration, but can keep it for a higher number of iterations. Consider the code in Listing 7.17 which shows the loop from Listing 7.16 after creating an outer slow-path loop that is entered every 32th iteration, and the inner unchanged fast-path loop. Every 32th26 iteration, the generated code exits the inner loop, releases the lock and performs the back edge of the outer loop, leading to a new acquiring of the lock before entering the inner loop.

1 for (...){ 2 monitorenter(loopInvariantObject) 3 ... critical region ... 4 monitorexit(loopInvariantObject) 5 }

Listing 7.16: Loop-wide lock coarsening before.

1 int roundTrips = 0 2 while ( true ){ 3 monitorenter(loopInvariantObject) 4 for (...){ 5 ... critical region ... 6 if (roundTrips > 32){ 7 break ; 8 } 9 roundTrips++; 10 } 11 monitorexit(loopInvariantObject) 12 }

Listing 7.17: Loop-wide lock coarsening after.

7.5.3 Loop-Wide Lock Coarsening Algorithm

We present our approach for loop-wide lock coarsening in Algorithm 7. First, the algorithm determines for which loops it is valid to have their locks coarsened, before coarsening heuristics decide if it is beneficial to do so. Finally, the real transformation is carried out using the fast-path unrolling algorithm from Algorithm 8. It is important to mention that not all locks can be pulled outside of a loop. There are a few restrictions that must be fulfilled for a compiler to perform the transformation:

• Number of loop back edges: All back edges of a loop must lock and unlock the same objects in the same order. This is necessary as the order of lock and unlock must be preserved when pulling a lock out of a loop.

26This is a constant we determined via empirical evaluation. It turned out to generate the best results for our benchmarks. For the evaluation we refer to Chapter 8. 110 Fast-Path Loop Unrolling of Non-Counted Loops

• Deoptimization: Loop headers are control flow merges. They merge the forward predecessor and the back edges of a loop. Thus, Graal maintains frame states27 at loop headers to support deoptimization. For deoptimization the correct frame states must be maintained for the body of a loop. If a deoptimization happens inside a loop before or after a synchronized region, the interpreter state must be preserved. This imposes a challenge when pulling locks out side of a loop: If there is no interpreter state where the code inside the loop is executed but a lock of the synchronized region is on the evaluation stack, the interpreter cannot continue execution as it does not know the optimization pulled the lock outside the loop and enlarged the scope of the lock. HotSpot run-time support for releasing the lock during deoptimization would be required, which is currently not implemented. Thus, we only pull locks out side of loops that synchronize an entire loop body (seen on line 5-6 in Algorithm 7).

Algorithm 7: Loop-wide lock coarsening algorithm. Data: ControlFlowGraph cfg Result: Optimized ControlFlowGraph cfg /* Find loops */ [1] Loops[] candidates ← []; [2] Loops loops = cfg.computeLoops (); [3] for Loop l in loops do /* See if there are monitors in the loop */ [4] if loop.instructions () contains MonitorEnter then /* The monitorenter must be the first instruction in the loop */ [5] if loop.instructions ()[0] isa MonitorEnter then /* The exit must be the last instruction in the loop */ [6] if loop.instructions ()[loop.instructions ().length-1] isa MonitorExit then /* Monitor enter and exit must operate on the same object */ [7] MonitorEnter enter ← loop.instructions ()[0]; [8] MonitorExit exit ← loop.instructions ()[loop.instructions ().length-1]; [9] if enter.object() == exit.object() then [10] candidates.add(l); [11] end [12] for Loop l in candidates do /* See if the heuristics decide the loop should be tiled */ [13] if heuristics.shouldTile(l) then /* Create the inner loop exit based on the tiling factor */ [14] createExitCheck(l); /* Create the fast path loop from Algorithm 8 */ [15] createFastPathLoop(l); /* Perform the actual coarsening from fast-path to slow-path loop */ [16] moveEnterExitToSlowPathLoop(l); [17] end [18] return 0;

27See Chapter 3 for details. 7.5 Loop-Wide Lock Coarsening 111

7.5.4 Loop-Wide Lock Coarsening Heuristics

We implemented a very simple heuristic to decide if a lock inside a loop should be pulled out side via loop-wide lock coarsening. The basic idea can be seen in Listing 7.1828 and follows the assumption that the body of a loop must be sufficiently more complex in latency than the additional roundTrip check that is added to each iteration of the loop29.

1 boolean shouldTile(Loop l){ 2 // Do not tile if the loop is never executed 3 if (l .frequency() < nodeCostModel.latencyForType(IfInstr u c t i o n ) ) return false ; 4 int latencySum = 0; 5 // Collect a latency estimation for one iteration of the loop 6 for (Instruction i:l.instructions()){ 7 latencySum += nodeCostModel. latencyFor(i); 8 } 9 // The latency of the additional if added to the fast path loop 10 int latencyTileCheck = nodeCostModel.latencyForType(Condi t i o n ) + 11 nodeCostModel. latencyForType(IfNode); 12 // Real heuristic: The body of the loop must be long enough to dominate the added 13 // tile check 14 return latencySum > 2 ∗ latencyTileCheck ; 15 }

Listing 7.18: Lock coarsening tiling heuristic.

We implemented the loop-wide lock coarsening algorithm on-top of GraalVM and evaluated its performance in Chapter 8 where we give a detailed evaluation of the optimization showing that it can increase performance by up to 50% for selected benchmarks.

7.5.5 Safepoint Tiling

Based on the loop-wide lock coarsening we implemented the same idea for safepoints. As men- tioned before in Section 7.3, loops for which the compiler cannot deduce that their trip count is within 32 bit bounds require the compiler to generate safepoint polls in order to respect throughput-based service level agreements. This always applies to non-counted loops, as the compiler does not know how many iterations of a non-counted loop will be executed. Therefore, Graal generates safepoint polls on every loop back edge. However, in order to avoid reading the

28 For simplicity we assume the body of a loop just contains a single block for which the compiler can iterate all instructions (lines 6-8). In our actual implementation we build a superblock through the body of the loop and merge all loop back edges in this process and iterate the superblock. 29What is currently missing in the heuristic is a trade-off between the cost of the synchronization code itself and the cost of the added trip check. If a lock inside a loop is biased [90] replacing it with an additional check and a branch may me more expensive than the original biased lock. Since HotSpot and GraalVM currently do not implement profiling of the type of a lock, it is infeasible in the compiler to decide if a lock is biased, fast path accessed or inflated at run time. Therefore, we refrain from modeling this in our heuristics for now. 112 Fast-Path Loop Unrolling of Non-Counted Loops safepoint page in every iteration of a loop, the compiler could hoist the safepoint outside into a slow-path loop and perform the actual safepoint poll only from time to time, depending on the tiling factor. Consider Listing 7.19 which shows a generic Java loop. The compiler is forced to generate a safepoint poll on the loop back edge as it cannot proof that the loop will finish in ac- ceptable time thus demanding a safepoint in order for the application to be responsive. However,

1 while ( a ) { 2 ... /*loop body*/ ... 3 safepointPoll() 4 }

Listing 7.19: Safepoint tiling opportunity before optimization. the safepoint poll consumes cycles itself. We propose to use the same scheme as for loop-wide lock coarsening and apply fast-path loop creation to the loop to tile the iteration space and pull the safepoint poll to the slow path loop. The compiler could apply this optimization and produce the code in Listing 7.20 which only performs the safepoint poll in every nth iteration.

1 int roundTrips = 0 2 while ( true ){ 3 while ( a ) { 4 ... /*loop body*/ ... 5 if (roundTrips > 32){ 6 break ; 7 } 8 roundTrips++; 9 } 10 safepointPoll() 11 }

Listing 7.20: Safepoint tiling opportunity after optimization.

We have not evaluated this optimization. However, we consider it interesting for future work on applications of the fast-path unrolling/creation optimization. 113

Chapter 8

Evaluation

In this chapter we present an extensive performance evaluation of all the algorithms and optimizations presented in this thesis. We show that simulation-based code duplication can significantly improve performance of Java applications and that a simulation-based solution outperforms hard-coded heuristics.

valuating the functionality of compiler optimizations is vital to understand the im- plications created by optimizing a program during compilation. Optimizations are E traditionally justified by the performance improvements they generate. Dynamic com- pilers, however, have additional requirements, namely to keep compilation time and code size small. Thus, multiple success metrics become relevant for the performance of a dynamic com- piler. In our evaluation we seek to determine whether simulation-based duplication can improve the performance of Java applications and if the simulation scheme can outperform hard-coded heuristics. We evaluated our approach and our implementation of simulation-based duplication on top of the GraalVM by running and analyzing a set of industry-standard benchmarks. We present detailed evaluations for simulation-based duplication (Chapter 4), DBDS (Chapter 6), fast-path loop unrolling (Chapter 7) and loop-wide lock coarsening (Section 7.5).

8.1 Evaluation Methodology

In order to reduce external influences of our experiments, we followed state-of-the-art benchmark- ing methodology for dynamic compilers. All experiments were conducted with GraalVM [140; 148; 213], a modified version of the HotSpot [93] JVM, thus we also try to adhere to common JVM benchmarking methodology [75]. In this section we present our experimental setup including hardware configuration, software configuration, the measured metrics, the benchmarks and the benchmark setup. 114 Evaluation

8.1.1 Hardware

For our evaluation we used a dedicated desktop machine equipped with a desktop-class Intel i5 processor1 with 4 cores featuring 16GB of DDR RAM @ 800MHZ and a core speed of 3.4 GHz. We disabled frequency scaling and turbo boost [100] in the BIOS of the machine to obtain more stable results.

8.1.2 Software

The benchmark machine was running Fedora 29 with a 4.19.6-300 Linux kernel. The file sys- tem was standard ext4. The machine was running the minimal tiling-based window manager i3. We disabled Bluetooth, Ethernet and WLAN to reduce external influences. No other software than the minimal system requirements were running. We wrote a BASH harness that calls each benchmark individually and cleans up any remaining files (if there were any) in the working direc- tory. The BASH harness called a Python-based benchmark harness that captured STDOUT and STDERR of each benchmark to extract performance numbers. On shutdown of each benchmark, an aggregated JSON result file was written by the Python harness and was stored in a common results directory. For benchmark analysis we used R [70] with the tidyverse [191] data processing framework.

8.1.3 Benchmarks

In this section we present the set of industry-standard Java and JavaScript benchmarks we used for our experiments. For each benchmark we give a short description as well as some notes on the configuration of the benchmark in the experimental setting.

8.1.3.1 Java SPECjvm2008

Java SPECjvm2008 [189] is a Java benchmark suite designed to measure the performance of the Java runtime environment. It does so by combining real world applications testing Java core fea- tures including JIT-compilation, garbage collection, the memory subsystem etc. The benchmark harness is comprised of a warmup and a measurement period during each of which the bench- mark workload is executed as often as possible. The warmup time period is used to stabilize the application code with respect to dynamic compilation and deoptimization. Results of the warmup

1Model: Intel Core i5 750. 8.1 Evaluation Methodology 115 duration are discarded. The result of the benchmark is the number of finished iterations of the workload within the measurement duration. In addition to the standard workloads of the suite, SPECjvm2008 contains also workloads in which the benchmark workload is executed once in order to measure the warmup time. We removed those benchmarks from our measurement as we are interested in peak performance. We configured the suite to use 120 seconds for warmup and 120 seconds for measurement. The standard execution harness executes all benchmarks within one execution of the virtual machine. However, this introduces the problem of profile pollution [143]. HotSpot does not support context-sensitive profiling [5; 87], therefore executing the set of hetero- geneous workloads inside one VM is sub-optimal to judge the compiler’s effectiveness to optimize a program. Thus, we executed every benchmark workload in a dedicated invocation of the JVM.

8.1.3.2 Java DaCapo

The Java DaCapo suite was originally proposed by Blackburn et al. [11]. It was released because of the need of a Java benchmark suite that makes use of the rich and complex features of the Java programming language and its eco system. Benchmark suites such as SPECjvm2008 behave more like traditional C and C++ applications that do not heavily exercise dynamic features of the JVM. Java DaCapo is comprised of selected real-world applications, including a wide variety of different applications from pdf processing to server applications. The suite heavily tests the performance of JIT compilation, garbage collection, file I/O and networking. The original suite contains 14 benchmarks. However, we used version 9.12 for which we excluded the benchmarks eclipse, tomcat, tradebeans and tradesoap as they are no longer supported by JDK versions > 8u92. There are new DaCapo releases that fixed these issues. However, in order to be comparable with the numbers collected in the beginning of the work for this thesis we refrained from updating it.

All DaCapo benchmarks report performance as run time in milliseconds per one iteration of the workload. The user of the suite is responsible to choose sensible values for the number of iterations and computing any final performance result. In order to ensure that the compiler as well as the application code are warmed up, we selected the number of warmup iterations based on the behavior of a benchmark with respect to JIT compilation. We define the number of warmup iterations w to be the number of iterations necessary for compilation frequency to drop below a minimal stable rate. This is the point where we assume warmup to be over. We cannot warm up all benchmarks until all compilations are done, as several benchmarks perform bytecode generation forcing compilation in every iteration of the benchmark.

Since Java DaCapo is a run-time-based benchmark we use a fixed number of iterations to gather performance data from the benchmark execution. For the final measurement iterations, we took m iterations of the benchmark where m = 2 ∗ w. We computed the arithmetic mean for those m iterations and used it as one result for a given benchmark. We executed every benchmark several 116 Evaluation times, each time with a dedicated execution of the VM. So the final results for one benchmark are n measurements, where each measurement ni is the arithmetic mean of the last m in-process iterations of a benchmark.

8.1.3.3 ScalaBench

ScalaBench, originally proposed by Sewe et al. [176], is the DaCapo benchmark suite for the Scala [65] programming language. The JVM was originally designed to execute bytecode [124] generated by the javac compiler that compiled Java source code to bytecode. However, over time other languages started to compile to JVM bytecode. The most prominent one besides Java is Scala. Scala [138] is a statically-typed programming language that combines object-oriented and functional programming in one high-level language. This is the reason why Scala workloads typically differ from Java workloads in their type and class hierarchy behavior (as described by Stadler et al. [182]). Scala applications tend to be much more dynamic than Java applications. ScalaBench contains several real-world applications selected from open source projects in the Scala community. The process to derive the number of warmup and measurement iterations was similar to the Java DaCapo benchmark suite.

8.1.3.4 Renaissance

The Renaissance [158]2 benchmark suite was developed because other benchmark suites missed inherently parallel and concurrent Java and Scala workloads. Renaissance contains several bench- marks that exercise different parallel programming paradigms, different synchronization mecha- nisms and primitives as well as atomic operation performance. In addition to the Renaissance benchmarks, we used a few other Java and Scala benchmarks addressing novel JVM features (since Java 8) such as streams and lambdas. The process of determining warmup and measurement iterations was the same as for Java DaCapo and ScalaBench.

8.1.3.5 JavaScript Octane

JavaScript Octane [22] was a widely used JavaScript benchmark suite containing workloads rang- ing from 500 LOC to 300 000 LOC. The suite addresses JIT compilation, garbage collection, code loading and parsing of the executing JavaScript VM. The suite was retired in 2017 by the V8 team [79] and is no longer actively maintained. The usage of the suite is controversial as

2 Note that we proposed and developed loop-wide lock coarsening from Chapter 7 originally during our work on [158]. 8.1 Evaluation Methodology 117 it primarily tests the VM, garbage collector and the compiler and not the how the VM behaves in the presence of a user interface3. However, for our use case the suite is still a good fit, es- pecially because it puts a lot of pressure on the compiler. We measured Octane performance using Graal JS [213], the JavaScript implementation on top of Truffle, which is comparable in peak performance to Google’s V8 [79]. Graal JS is on average 17% slower than JavaScript on the V8 VM [213]. The JavaScript Octane benchmarks are throughput-based. We run a warmup period to ensure that the application is compiled and compute the throughput for the suite in the measurement duration.

8.1.3.6 JavaScript jetstream asm.js

JavaScript jetstream asm.js [155] is a benchmark suite originally developed for the WebKit [98] JavaScript VM to also include novel features such as asm.js into a JavaScript benchmarking suite. It contains most of the Octane benchmarks together with new ones. Therefore, we excluded all non asm.js [135] benchmarks from the jetstream suite. JetStream benchmarks are also throughput- based.

8.1.4 Benchmark Configuration

We derived a uniform setup for all benchmarks and executed them sequentially on our benchmark machine. Each benchmark was configured to use a start and max heap size of 10GB. The benchmarks invoke System.gc() between iterations to remove the data from one iteration before doing another iteration. We also wanted to keep the overall GC impact as small as possible. Thus, we chose the Serial Collector which only uses one GC thread.

8.1.5 Metrics

In order to have valid arguments about the impact of our optimizations we measured three different metrics in our experiments: Peak performance, code size and compile time.

Performance is reported by each benchmark suite itself. Java DaCapo, ScalaBench and Renais- sance report performance in milliseconds per iteration. Java SPECjvm2008, JavaScript Octane and JavaScript jetstream asm.js measure throughput performance and report a score after exe- cution. 3The requirements of UI based applications are different than for server applications, i.e., for UI-based systems low latency is more important than the highest possible peak performance. 118 Evaluation

Compile Time was measured programmatically in the compiler itself. Graal supports timing statements that are used throughout the compiler.

Code Size was measured programmatically with a counter in the compiler.

For each configuration presented in the subsequent sections we executed every benchmark suite 20 times4.

8.1.6 Warmup & Metacircularity

GraalVM is a metacircular system, i.e., the compiler is itself implemented in Java. We used the standard configuration of tiered compilation [10; 151]. This means that at the first time of invocation, every method is executed by a template-based handwritten assembler interpreter. During interpretation, code is profiled and hot functions are first compiled by the base-line compiler C1 [109] and then by Graal. In this configuration, Graal is itself not compiled with Graal but only with C1 in order to reduce warmup time [9].

8.2 Experiments

In this section, we present a detailed description of all experiments. We present the result data and an interpretation. We report two different kinds of results: condensed results and raw data. For the sake of brevity, we present condensed results in the form of tables in Tables 8.1 to 8.8, 8.10 and 8.11 and we present the raw data in the form of boxplots [72] in the appendix (Figures C.1 to C.30.) Additionally, we present data for the combined performance impact of the work of this thesis in Figures C.31 to C.36. The table-based results show the combined impact of compiler configurations on an entire benchmark suite. We normalized our experiment data to a baseline configuration of GraalVM with all optimizations proposed in this thesis disabled. To aggregate the data for a table-based representation, we computed the median5 of each benchmark and configuration from the normalized data and report the geometric mean [67], the minimal and the maximum value for compile time, the code size and the performance6 across an entire benchmark suite. Since all data has been normalized to the default configuration of GraalVM without any of the proposed optimizations enabled, a value of 1 means that the respective metric (code size,

4This accumulated to more than two months of machine time. 5See Section 8.1: For each benchmark and configuration we have 20 data points. For run-time-based benchmarks a single data point is computed as the arithmetic mean of the measurement iterations of a benchmark. For throughput-based benchmarks a data point is the throughput result of the measurement period. Most of the data samples are normally distributed, however, not all of them, thus we used non-parametric statistics. 6Throughput and run time. 8.2 Experiments 119 compile time, throughput or run time) is unchanged with respect to the unoptimized baseline. In order to properly plot the raw data of our experiments in Figures C.1 to C.30., we removed outliers that disturbed the scale of the plots. We therefore inspected all benchmarks and configurations and manually determined if a data point is an outlier. For benchmarks where certain configurations produce data orders of magnitude higher or lower, we did not remove them from the raw data. While we believe that the filtering is necessary to properly plot any data, outlier filtering may always be subject to bias and thus may contain errors.

Additionally, to the condensed analysis and the raw data, we conducted a significance analysis of the results in the appendix in Tables C.1 to C.6, which shows that simulation-based code duplication can significantly improve the performance of Java and JavaScript applications.

8.2.1 DBDS

In our first experiment, we want to demonstrate that simulation-based code duplication is useful and can significantly improve the performance of Java applications while maintaining a moderate code-size and compile-time increase. Additionally, we want to show that our proposed trade-off heuristics from Section 6.3 work as expected and deliver smaller code, with smaller compila- tion times, than an implementation of DBDS without any code-size or performance trade-off. Therefore, we tested all our benchmarks with the following two configurations:

• DBDS-0.25: This configuration enabled DBDS from Chapter 6 with a code-size increase budget of 25%.

• DBDS-No-Model: This configuration implements the DBDS algorithm from Chapter 6 without any performance versus code-size trade-off from Section 6.3. In this configuration the optimization will perform transformations as long as it sees any kind of benefit.

The results of our experiments can be seen in Tables 8.1 and 8.2 which show the performance of both configurations along the three tested success metrics.

Interpretation The results show that the DBDS-0.25 configuration outperforms the DBDS- No-Model configuration in all benchmarks except Renaissance and JavaScript jetstream asm.js. In Renaissance DBDS-No-Model is on average only 0.3% faster than DBDS-0.25, whereas in JavaScript jetstream asm.js DBDS-No-Model can optimize a read inside a loop that later can be- come loop invariant. This is an indirect effect enabled by DBDS-No-Model and is not intended. Thus, we consider it to be a "lucky punch". Our proposed trade-off implementation outperforms the DBDS-No-Model configuration also in code size and compilation time for every single bench- mark, always resulting in less code and smaller compilation times. Compile-time increases for 120 Evaluation

Benchmarks Compile Time ↓ Code Size ↓ Run Time ↓ Configruation Mean Min Max Mean Min Max Mean Min Max Java DaCapo DBDS-0.25 1.233 1.135 1.542 1.142 1.082 1.316 0.992 0.952 1.015 DBDS-No-Model 1.996 0.978 2.833 1.573 0.663 2.09 1.021 0.948 1.085 ScalaBench DBDS-0.25 1.136 1.078 1.224 1.089 1.021 1.137 0.988 0.92 1.02 DBDS-No-Model 1.356 1.297 1.407 1.299 1.162 1.399 1.019 0.899 1.088 Renaissance DBDS-0.25 1.161 1.049 1.309 1.096 0.825 1.243 0.995 0.904 1.028 DBDS-No-Model 1.724 1.29 1.959 1.624 1.028 1.958 0.992 0.605 1.162

Table 8.1: DBDS performance run-time benchmarks. Best value of a given average metric is underlined and bold.

Benchmarks Compile Time ↓ Code Size ↓ Throughput ↑ Configruation Mean Min Max Mean Min Max Mean Min Max Java SPECjvm2008 DBDS-0.25 1.147 1.078 1.452 1.101 1.026 1.438 1.009 0.979 1.092 DBDS-No-Model 1.693 1.293 2.188 1.666 1.272 2.319 0.943 0.722 1.057 JavaScript Octane DBDS-0.25 1.204 1.123 1.299 1.135 1.086 1.2 1.065 0.817 1.324 DBDS-No-Model 1.61 1.412 1.797 1.508 1.317 1.676 0.931 0.263 1.423 JavaScript jetstream asm.js DBDS-0.25 1.17 1.139 1.287 1.117 1.08 1.173 1.014 0.986 1.045 DBDS-No-Model 1.725 1.373 2.358 1.494 1.28 1.789 1.227 0.91 3.538

Table 8.2: DBDS performance throughput benchmarks. Best value of a given average metric is underlined and bold.

DBDS-0.25 range from 14% to 23% and code-size increases from 8% to 14%. We consider these increases to be moderate and a success. Performance-wise, the mean performance impact ranges from 0% to 6%, while no benchmark suite is on average slower with DBDS enabled. The best single performance increase, as it can be seen in the min column for run time-based benchmarks and the max column for throughput-based benchmarks, is about 32% for one of the JavaScript Octane benchmarks. We consider the results to be in favor for our trade-off tier. They clearly indicate that DBDS can improve the performance of Java applications and also show that a proper code-size versus compile-time trade-off is necessary. In Section 8.2.5 we present different param- eterizations of the trade-off from Section 6.3 to show that it enables the compiler to perform a fine-grained code-size versus performance trade-off.

8.2.2 Simulation vs. Heuristic-Based Solutions

Our second category of experiments tries to find a validation to the claim that simulation-based duplication outperforms heuristics as presented in Chapter 4. We use DBDS (Chapter 6) as an optimization instance of simulation-based duplication and evaluate it against a very complex heuristic approach (Section 4.2.1.1) called path duplication. Path duplication is a tail-duplication 8.2 Experiments 121 optimization that implements an even more advanced form of a complex heuristic than the ones described in Section 4.2.1.1. Path duplication is based on the idea of precise heuristic-based dupli- cation as presented in Section 4.2.1.1. In addition to the optimization opportunities presented in the complex heuristic in Section 4.2.1.1, path duplication also computes control-flow-dependent optimization opportunities such as partial-escape opportunities (Section 6.1.4) and conditional eliminations. Path duplication does not perform a precise or transitive code-size analysis. It only estimates code size effects with the number of fixed nodes7 in the merge block. The implementa- tion of the optimization in Graal was tuned on a large set of benchmarks8 to deliver performance improvements at acceptable code-size and compile-time trade-offs. Therefore, it was tuned over the years to not duplicate pathological cases that explode code size. This means that path dupli- cation is already fairly good at reducing compile time and code size to a minimum. Additionally, path duplication was tuned to enable Graal-specific optimizations by analyzing optimizable pat- terns known to trigger in subsequent optimization phases. However, its duplication heuristic does not analyze transitive effects (like the ones from Chapter 4). Previous to our work on simulation- based duplication it has been the predominant duplication optimization used in Graal. Therefore, we consider it to be a suitable candidate for evaluating the code size and performance trade-offs of simulation-based duplication. We evaluated several configurations of DBDS and path duplica- tion against a baseline version of GraalVM. In the baseline version path duplication as well as all optimization presented in this thesis have been disabled. The configurations tested are:

• DBDS-0.05: This configuration uses DBDS from Chapter 6 with a code-size increase budget of 5%.

• DBDS-0.1: This configuration uses DBDS from Chapter 6 with a code-size increase budget of 10%.

• DBDS-0.25: This configuration uses DBDS from Chapter 6 with its default code-size increase budget of 25%.

• DBDS-1.0: This configuration uses DBDS from Chapter 6 with a code-size increase budget of 100%.

• PathDup: This configuration uses path duplication as a replacement for DBDS.

We normalized all results to a no-optimization baseline configuration. We evaluated all configu- rations from above with all benchmarks and report compile time, code size and performance. For the experiments we consider one of the following results to support our claims:

7See Chapter 3. 8Including all benchmarks presented in this chapter. 122 Evaluation

• PathDup: Should generate observably more code, because it applies code duplication. Therefore, it will also result in higher compilation times. However, it should also have a visible performance impact (positive or negative) for those benchmarks where duplication triggers.

• DBDS 0.05-1.0: For simulation-based duplication we expect the following (depending on the parameterization of the budget):

– Code Size: We expect code sizes to be generally less than for path duplication (de- pending on the configuration). However, this depends on the workload. For workloads with a limited number of duplication opportunities we expect DBDS to generate less code, because it only performs beneficial duplications. For workloads where DBDS finds more opportunities than path duplication (because it is aware of transitive ef- fects), we expect the code size to be in the range of the maximum code-size increase budget (e.g. 25% increase for DBDS-0.25).

– Performance: We expect the performance to be as good as with path duplication and even better for benchmarks where path duplication misses opportunities or is unaware of transitive effects.

– Compilation Time: We expect the compilation time for DBDS to depend on the budget configuration. Simulation takes time, which means that large budgets can result in higher compilation time than path duplication.

We present the results of this experiments in Tables 8.3 and 8.4, which show the mean, min and max values for each configuration, each metric, and each benchmark. Detailed performance results for all benchmarks can be seen in the appendix in Figures C.1 to C.6.

8.2.2.1 Interpretation

We marked the best mean value of each configuration and metric in Tables 8.3 and 8.4 by underlining the respective value. Additionally, we marked the best performance, since duplication is an aggressive compiler optimization targeting long-running server applications.

Performance DBDS configurations are on par or faster than PathDup in every except one bench- mark suite, namely ScalaBench9. For ScalaBench, PathDup creates on average 1.1% faster code than the next fastest DBDS configuration. For the other run-time-based benchmark suites,

9For ScalaBench, PathDup creates the best single performance increase. This can be seen in the column min. 8.2 Experiments 123

PathDup is on par with the fastest DBDS configuration of the respective benchmark. Looking at individual benchmark results10 DBDS-0.25 outperforms PathDup up to 14% in JavaScript Octane. The fastests DBDS configuration, DBDS-1.0 even improves performance for a single benchmark by up to 36%, while PathDup only improves performance by up to 18%. For Java DaCapo, PathDup creates code that is on average as fast as DBDS11.

Code Size Code-size effects are as expected for the DBDS configurations. DBDS-0.05 creates the smallest code-size increase of all DBDS configurations and also always creates smaller code than PathDup, while its performance is on average nearly as good as PathDup. Code-size increases of path duplication tend to be larger than DBDS up to the 0.25 configuration.

Compilation Time Compile-time effects are as expected for the DBDS configurations. The slowest is DBDS-1.0 while the fastest is DBDS-0.05. This means that the trade-off function of DBDS from Section 6.3 works as expected: the more budget it has, the more duplications it allows and the longer the compilation takes. Interestingly, compilation time for PathDup is fairly good, given that code size increase is always larger as for DBDS-0.05. This means that the code-size trade-off, which is need to reduce code size, requires more compilation time than not performing a precise code size trade-off. The trade-off comes at a compile-time cost that is not alleviated by the reduced code size, i.e., reducing code size by simulation works, but consumes more compilation time than is saved via the reduced code size. This statement is a unique finding and we consider it to be an interesting target for future work in a setting where code size is the crucial performance metric.

10Minimum and maximum values for a metric can be seen in columns min and max. 11For Java DaCapo, DBDS-1.0 creates the best single performance improvement. 124 Evaluation

Benchmarks Compile Time ↓ Code Size ↓ Run Time ↓ Configruation Mean Min Max Mean Min Max Mean Min Max Java DaCapo PathDup 1.132 1.067 1.17 1.156 1.092 1.254 0.99 0.954 1.005 DBDS-0.05 1.127 1.084 1.185 1.06 1.021 1.122 0.994 0.955 1.015 DBDS-0.1 1.179 1.107 1.335 1.098 1.05 1.215 0.994 0.955 1.01 DBDS-0.25 1.233 1.135 1.542 1.142 1.082 1.316 0.992 0.952 1.015 DBDS-1.0 1.28 1.178 1.71 1.183 1.12 1.41 0.99 0.939 1.02 ScalaBench PathDup 1.082 1.026 1.238 1.099 1.028 1.255 0.977 0.904 1.024 DBDS-0.05 1.096 1.061 1.157 1.05 1.013 1.096 0.994 0.941 1.024 DBDS-0.1 1.123 1.074 1.198 1.069 1.019 1.12 0.989 0.915 1.014 DBDS-0.25 1.136 1.078 1.224 1.089 1.021 1.137 0.988 0.92 1.02 DBDS-1.0 1.16 1.071 1.266 1.112 1.034 1.181 0.99 0.933 1.02 Renaissance PathDup 1.08 0.881 1.211 1.097 0.825 1.289 0.992 0.922 1.029 DBDS-0.05 1.091 0.925 1.203 1.043 0.804 1.153 0.992 0.92 1.019 DBDS-0.1 1.139 0.928 1.452 1.076 0.796 1.33 0.994 0.907 1.053 DBDS-0.25 1.161 1.049 1.309 1.096 0.825 1.243 0.995 0.904 1.028 DBDS-1.0 1.191 1.004 1.411 1.138 0.889 1.344 0.997 0.89 1.031

Table 8.3: Simulation performance run-time benchmarks. Best value for a given metric is underlined and bold. Best performance is marked in red. Default DBDS configuration is marked in green.

Benchmarks Compile Time ↓ Code Size ↓ Throughput ↑ Configruation Mean Min Max Mean Min Max Mean Min Max Java SPECjvm2008 PathDup 1.117 0.996 1.505 1.136 1.032 1.356 1.006 0.919 1.09 DBDS-0.05 1.103 1.012 1.586 1.055 0.982 1.393 1.001 0.98 1.015 DBDS-0.1 1.137 1.064 1.562 1.083 1.002 1.365 1.007 0.979 1.05 DBDS-0.25 1.147 1.078 1.452 1.101 1.026 1.438 1.009 0.979 1.092 DBDS-1.0 1.188 1.088 1.659 1.14 1.037 1.443 0.999 0.897 1.045 JavaScript Octane PatDup 1.092 1.059 1.166 1.121 1.064 1.172 1.038 0.812 1.183 DBDS-0.05 1.141 1.069 1.244 1.078 1.024 1.174 1.056 0.867 1.251 DBDS-0.1 1.174 1.103 1.244 1.098 1.052 1.171 1.053 0.826 1.197 DBDS-0.25 1.204 1.123 1.299 1.135 1.086 1.2 1.065 0.817 1.324 DBDS-1.0 1.229 1.146 1.316 1.159 1.086 1.254 1.062 0.789 1.364 JavaScript jetstream asm.js PathDup 1.087 1.069 1.121 1.113 1.09 1.153 0.991 0.901 1.016 DBDS-0.05 1.109 1.085 1.202 1.058 1.028 1.093 1.016 1 1.043 DBDS-0.1 1.14 1.104 1.229 1.087 1.035 1.153 1.015 0.986 1.045 DBDS-0.25 1.17 1.139 1.287 1.117 1.08 1.173 1.014 0.986 1.045 DBDS-1.0 1.207 1.145 1.406 1.161 1.102 1.212 1.012 1 1.031

Table 8.4: Simulation performance throughput benchmarks. Best value for a given metric is underlined and bold. Best performance is marked in red. Default DBDS configuration is marked in green. 8.2 Experiments 125

Synthesis The results show that simulation-based code duplication and PathDup are very close to each other with respect to performance. Most benchmarks for which duplication triggers, show comparable performance improvements. However, on average DBDS often improves upon path duplication12, for example, when comparing path duplication with the default DBDS-0.25 configuration on specific benchmarks13: Looking at individual benchmarks14 DBDS-0.25 improves upon path duplication by up to 14% in JavaScript Octane, 4% in JavaScript jetstream asm.js, 2% in Renaissance. This also shows up in most of the average values of the configurations. We inspected the benchmarks where DBDS outperforms path duplication by large amounts. Many of them are related to transitive effects, code size trade-offs and precise reasoning which path duplication does not apply. However, regarding complexity, path duplication is the most complex implementation of a heuristic-based solution without using dedicated simulation or backtracking steps. Generating faster code than path duplication, given that it was tuned on the benchmarks presented in this thesis for a long time, is a great success. Therefore, we conclude that simulation- based reasoning is beneficial as it can outperform hard-coded and hand optimized heuristics. Furthermore, we see that a precise code-size trade-off pays off as, e.g., DBDS-0.25 creates smaller code than path duplication in 4 out of 6 benchmark suites15.

The experiments indicate that the following claims made in Chapters 4 to 6 are correct:

• Simulation-based code duplication is necessary to discover transitive effects of du- plication opportunities in order to produce the best possible performance increase. Performance for PathDup and DBDS is similar for many benchmarks. Therefore, reasoning about the average performance impacts is not very useful, as both optimizations trigger on benchmarks that benefit from duplication. Performance improvements along those bench- marks hardly show up in the average case, therefore, we want to show that this claim is correct by inspecting the performance of individual benchmarks, i.e., the best individ- ual performance increases. In order to illustrate the drawbacks of heuristics we want to find examples where the missing knowledge16 of the optimization causes missed optimiza- tion potential. For example, simulation-based duplication finds optimization opportunities which heuristic-based approaches like path duplication miss in many benchmarks: It out- performs path duplication in individual benchmarks by up to 14%17, while path duplication only outperforms simulation-based duplication by up to 2%18. However, this still comes at an average compile time cost of 10% compared to path duplication and an average 2%

12Except for the ScalaBench benchmark, for which PathDup creates the fastest single-code improvement. 13Column min for runtime-based benchmarks and column max for throughput-based ones. 14Columns min and max. 15With a different parameterization for the code size increase budget, e.g., DBDS-0.05, code size is always smaller than with PathDup. 16For example the missing transitive effect determination of a heuristic results in missed optimization opportunities. 17In the JavaScript Octane benchmark, column max. 18See fastest DBDS-0.25 configuration against path duplication for the benchmark ScalaBench. 126 Evaluation

code-size increase19. We marked the default configuration of DBDS in green in Tables 8.3 and 8.4. It demonstrates a reasonable trade-off between compilation time, code size and performance. It outperforms path duplication in 3 out of 6 benchmark suites20. The default configuration also produces smaller code than path duplication in 4 out of 6 benchmark suites. Regarding compile time, DBDS-0.25 is always slower than path duplication, which is expected given that the AC implementation is complete and performs a trade-off step, which path duplication does not.

• A code-size trade-off is necessary in order to reduce code size to the minimal possible increase (which cannot be proven in general). This can be shown by comparing path duplication with DBDS-0.25. In the 0.25 configuration, DBDS produces faster code than path duplication on average in 3 benchmark suites21 while it generates smaller code size than path duplication in 4 out of 6 benchmark suites and only increasing it compared to path duplication by 1.4% in one benchmark suite while it creates on average up to 2.8% faster code22. Code size could probably be reduced even more, resulting in smaller and faster code than path duplication for every single benchmark. However, we have not yet worked into the direction of tuning our code-size trade-off from Section 6.3 towards smaller code size.

• Compilation time for simulation is higher than for heuristics, given that simulation is precise. However, code size is smaller and performance is faster. This claim can be partially shown depending on the different configurations of DBDS. DBDS can be configured with different parameters. We chose to configure the budget parameter in our experiments, which can have an impact on all success metrics. The best trade-off between performance, code size and compile time is the DBDS-0.25 configuration, as it generates on average the fastest code of all DBDS configurations, with nearly no code-size increase compared to path duplication. However, it requires more compilation time than path duplication. DBDS requires between 3%-10% more compilation time than path duplication, while it nearly always produces the same or better performance improvements as path duplication without increasing code size. We consider this a success and a fact supporting our initial claim.

19It can be argued that the compilation time of path duplication would also be increased if it would be able to find more duplication opportunities. However, we have no heuristic duplication implementation that is capable of reasoning about transitive effects, and thus we cannot prove this claim. 20It generates the best individual performance increase (column min for run-time-based benchmarks and column max for throughput-based benchmarks) in 5 out of 6 benchmark suites compared to path duplication. If we compare the average performance of DBDS-0.25 against PathDup, DBDS-0.25 is on average 0.15% slower than PathDup for the runtime-based benchmarks and on average 1.8% faster for the throughput-based benchmarks. 21By up to 3% on average, other 3 are slower than path duplication by at most 1.2%. 22With single benchmarks being up to 18% faster than path duplication. 8.2 Experiments 127

Additionally, if we are willing to sacrifice a little on peak performance the DBDS-0.05 configuration performs very well compared to path duplication: it produces the lowest code size, only consumes up to 5% more compilation time without being much slower than path duplication23. Additionally, if we look at the regressions produced by path duplication, DBDS-0.05 always produces less regressions on the benchmarks than path duplication24.

• The simulation-based duplication scheme allows for a fine-grained trade-off between code size, compilation time and performance. This claim is nicely visible when looking at DBDS-0.05-1.0 where we see that increasing the budget comes with higher compilation times, more code and (most of the times) also a better performance25. However, perfor- mance improvements do not scale as linearly as code size and compile time. There is a natural performance plateau that is reached when it comes to performance improvements for a single optimization like duplication. This plateau seems to be between DBDS-0.25 and DBDS-1.0. At a certain point, performing more duplication reduces performance, due to indirect effects such as phase ordering issues and instruction cache mixups. .

In summary the results support our initial claims: Performing a trade-off is necessary to control code size and compilation time. While this might not be a problem for heuristics like path duplication, as they are limited in the possible number of their duplications, it is necessary for precise approaches like simulation-based duplication. All heuristics, including path duplication, are limited in the number of opportunities they capture, because they are hard-coded and do not analyze transitive effects. Precise approaches like DBDS, however, require profound trade-offs in order to avoid code-size explosion and compile-time increases. For example DBDS, finds many more duplication opportunities than path duplication. Based on our work, we believe that path duplication is already a very complex heuristic (with a complex implementation) that performs very well compared to DBDS. However, in the end, heuristics like path duplication are limited in their performance potential 26 requiring approaches like DBDS in order to generate the best possible performance increases with duplication. If compilation time and code size are less important and performance is the most important success metric DBDS-1.0 would be a suitable configuration, as it creates the fastest single performance increase in 5 out of 6 benchmark suites.

23The highest average slowdown to path duplication is 1.7%. 24By up to 10% on JavaScript jetstream asm.js. 25Looking at the maximum performance increases of the single benchmark suites. 26This can be seen in the performance results where DBDS produces performance improvements that are 2 times better than path duplication. 128 Evaluation

8.2.3 Fast-Path Loop Unrolling

In this section we evaluate the implementation of the fast-path unrolling of non-counted loops as proposed in Chapter 7. Our main hypothesis was that fast-path unrolling of non-counted loops can significantly increase the performance of Java applications. Therefore, our primary metric of interest is performance (average run time or throughput). We tested one configuration FPU (for fast path unrolling) with our optimization enabled against the baseline GraalVM without any optimizations proposed in this thesis. Tables 8.5 and 8.6 show the results of our experiments for the benchmarks. Detailed performance plots can be found in Figures C.13 to C.18.

Benchmarks Compile Time ↓ Code Size ↓ Run Time ↓ Configruation Mean Min Max Mean Min Max Mean Min Max Java DaCapo FPU 1.042 1.018 1.102 1.023 1.002 1.066 1 0.984 1.01 ScalaBench FPU 1.088 1.01 1.538 1.04 0.998 1.163 1.006 0.991 1.02 Renaissance FPU 1.025 0.821 1.138 1.005 0.749 1.121 0.936 0.5 1.019

Table 8.5: Fast-path loop unrolling performance run-time benchmarks. Benchmarks Compile Time ↓ Code Size ↓ Throughput ↑ Configruation Mean Min Max Mean Min Max Mean Min Max Java SPECjvm2008 FPU 1.053 0.991 1.281 1.032 0.98 1.297 0.995 0.971 1.007 JavaScript Octane FPU 1.045 1.01 1.083 1.012 0.982 1.071 0.971 0.734 1.028 JavaScript jetstream asm.js FPU 1.098 1.013 1.339 1.017 0.998 1.042 1.199 1 3.615

Table 8.6: Fast-path loop unrolling performance throughput benchmarks.

8.2.3.1 Discussion

Performance The performance impact of fast-path unrolling of non-counted loops varies over the different benchmark suites. The impact on Java DaCapo is mixed. We see slight improvements and regressions in run time of up to 2%27. The performance impact on ScalaBench is similar. Performance improvements on Renaissance are significant with run-time reductions of up to 50%. The most significant performance improvements are reached on our subset of the JavaScript jetstream asm.js benchmark suite with improvements of more than 300%. Our subset of the

27Average performance is unchanged. Individual results in column min and max show up to 2% changes. 8.2 Experiments 129 jetstream suite only contains asm.js benchmarks. Those benchmarks offer much optimization potential (mostly loop-carried dependencies), for example, array accesses that can be optimized if a loop is unrolled once.

Compile Time We consider the compile-time increase to be acceptable for most of the bench- marks. We measured the highest compile time increase (i.e., a maximum increase of about 50%) in the ScalaBench benchmark apparat. We analyzed the benchmark in detail; several hot compilation units in this benchmark are very large. Unrolling them opens further opportunities for unrolling of inner loops that are also unrolled by our heuristics. We consider this to be a suboptimal pattern that should be fixed in the future.

Code Size The code-size of fast-path loop unrolling is negligible in nearly all cases (expect for the same problem in apparat that is also reflected in code size, where unrolling a loop causes inner loop unrolling opportunities to open). The low code-size increases are generally due to our trade-off function that is configured to only optimize loops for which we know that there is a sufficient optimization potential. Unrolling all loops that carry optimization potential would have a severe impact on code size. However, those loops for which the run-time impact justifies a certain code-size increase are much less frequent. Thus, the overall increase in code size is low. The current parameterization of the optimization will never allow a code-size increase that is larger than 50%.

8.2.4 Loop-wide Lock Coarsening

In this experiment we show the performance results of evaluating loop-wide lock coarsening from Section 7.5 against our baseline configuration. We want to show that tiling the iteration space of loops inside locks for the sake of coarsening them can significantly reduce performance. The results of the experiment can be seen in Tables 8.7 and 8.8. It shows aggregated performance results of the loop-wide-lock-coarsening (LWLC) configuration against GraalVM baseline. 130 Evaluation

Benchmarks Compile Time ↓ Code Size ↓ Run Time ↓ Configruation Mean Min Max Mean Min Max Mean Min Max Java DaCapo LWLC 1.006 0.995 1.033 1 0.987 1.029 0.999 0.987 1.005 ScalaBench LWLC 0.997 0.977 1.018 0.999 0.979 1.038 1 0.986 1.008 Renaissance LWLC 0.997 0.88 1.062 0.986 0.87 1.037 0.971 0.549 1.018

Table 8.7: Loop wide lock coarsening performance run-time benchmarks. Benchmarks Compile Time ↓ Code Size ↓ Throughput ↑ Configruation Mean Min Max Mean Min Max Mean Min Max Java SPECjvm2008 LWLC 1.021 0.989 1.306 1.01 0.951 1.352 1.002 0.971 1.084 JavaScript Octane LWLC 1.01 0.987 1.043 1.002 0.98 1.072 0.983 0.755 1.016 JavaScript jetstream asm.js LWLC 1.005 0.984 1.015 0.996 0.968 1.009 1 0.994 1.006

Table 8.8: Loop wide lock coarsening performance throughput benchmarks.

Interpretation The performance impact is not very high on average. For Java DaCapo, Scal- aBench and Java SPECjvm2008 the optimization does not create improvements or regressions above 2%. However, for the Renaissance benchmark suite, impacts are better. This is expected as Renaissance [158] is a concurrency-focused benchmark suite exhibiting many more optimizable synchronization patterns than Java DaCapo or ScalaBench. For Renaissance, the highest impact is on the FJ-Kmeans benchmark with a run-time reduction of nearly 50%. This benchmark is suffering from high locking overhead in a very hot loop that is contended between two threads.

Compilation-time and code-size impacts range from 0 to 30%, while the mean is within 2%. We believe that applications exhibiting loops with synchronized statements can profit a lot from this optimization. However, choosing suitable options for the tiling size is currently not automated and on a best effort basis. We consider this to be an interesting target for future work, but have not yet investigated into this direction.

8.2.5 Node Cost Model

In this section, we test the hypothesis that our node cost model allows optimizations to apply a finer-grained trade-off of code size versus peak performance, resulting in increased peak per- formance and reduced code size and compilation time. We tested this hypothesis by running 8.2 Experiments 131 our benchmarks with different compiler and cost model configurations. A change in the node cost model parameterization should be reflected in either performance, code size or compile time. Therefore, we tested four configurations and parameterizations of the node cost model against a baseline with all optimizations from this thesis disabled:

GraalVM This is the default configuration of GraalVM, which is configured to deliver high per- formance at a medium code-size and compile-time increase. It uses the node cost model in duplication-based optimizations to perform code-size and run-time estimations. no-model The cost model trade-off functions for DBDS and fast-path unrolling are disabled. Every time an optimization finds a potential improvement in peak performance, it performs the associated transformation without a trade-off against code size. zeroSize In this configuration, we set all NodeSize values for IR nodes to 0. The zeroSize configuration should behave similarly to the no-model configuration, because the estimated code-size increase is always 0. However, the GraalVM configuration limits the maximum code-size increase per compilation unit to 50%, which will also be used in the zeroSize configuration. Therefore, we expect code sizes larger than GraalVM but smaller than 50%. zeroCycles In this configuration, we set all NodeCycles values for IR nodes to 0. The zeroCycles configuration should reduce the code size compared to the no-opt configura- tion, because the trade-off function of DBDS triggers if the code-size effect of a duplication is negative (i.e., duplication can enable subsequent compiler optimizations that reduce code size, for example dead code elimination [7]).

Expected changes in performance for all configurations in relation to the baseline can be seen in Table 8.9.

Compile Relative Config Performance Code Size Time to no-opt ---- GraalVM ↑ ↑ ↑ no-opt no-model ↑ ↑ ↑ GraalVM zeroCycles - ↓ ↓ no-opt zeroSize - ↑ ↑ GraalVM

Table 8.9: Expected performance impacts: Upward pointing arrows represent an increase of a given metric, downward pointing arrows a decrease. 132 Evaluation

Benchmarks Compile Time ↓ Code Size ↓ Run Time ↓ Configruation Mean Min Max Mean Min Max Mean Min Max Java DaCapo GraalVM 1.279 1.164 1.61 1.151 1.079 1.332 0.991 0.942 1.02 No-Model 2.139 1.568 3.053 1.713 1.389 2.087 1.04 0.951 1.193 Zero-Cycles 1.05 1.031 1.097 0.999 0.981 1.034 0.997 0.969 1.01 Zero-Size 1.378 1.199 1.851 1.243 1.115 1.452 0.997 0.958 1.05 ScalaBench GraalVM 1.205 1.102 1.316 1.123 1.019 1.207 0.993 0.913 1.037 No-Model 1.742 1.296 2.425 1.569 1.16 1.946 1.036 0.903 1.152 Zero-Cycles 1.027 0.926 1.062 0.994 0.953 1.017 1.001 0.985 1.012 Zero-Size 1.305 1.111 1.913 1.187 1.038 1.277 1 0.904 1.051 Renaissance GraalVM 1.219 0.995 1.366 1.121 0.823 1.267 0.908 0.5 1.031 No-Model 1.87 1.236 3.041 1.681 1.006 2.65 1.003 0.605 1.412 Zero-Cycles 1.055 0.967 1.133 1 0.879 1.093 0.972 0.551 1.029 Zero-Size 1.295 1.08 1.542 1.176 0.846 1.365 0.911 0.5 1.052

Table 8.10: Cost model performance run-time benchmarks. Benchmarks Compile Time ↓ Code Size ↓ Throughput ↑ Configruation Mean Min Max Mean Min Max Mean Min Max Java SPECjvm2008 GraalVM 1.241 1.131 1.867 1.146 1.06 1.647 1.003 0.971 1.059 No-Model 1.865 1.311 3.184 1.694 1.299 2.529 0.942 0.722 1.057 Zero-Cycles 1.077 1.005 1.619 1.021 0.947 1.532 0.994 0.859 1.053 Zero-Size 1.297 1.148 2.064 1.215 1.103 1.828 0.992 0.897 1.063 JavaScript Octane GraalVM 1.261 1.146 1.393 1.152 1.091 1.258 1.035 0.81 1.334 No-Model 1.817 1.431 2.483 1.512 1.311 1.676 0.918 0.264 1.423 Zero-Cycles 1.103 1.031 2.078 1.08 0.991 2.524 1.006 0.989 1.038 Zero-Size 1.295 1.153 1.449 1.187 1.102 1.299 1.016 0.644 1.368 JavaScript jetstream asm.js GraalVM 1.277 1.177 1.492 1.136 1.097 1.214 1.211 0.986 3.615 No-Model 1.726 1.376 2.356 1.492 1.281 1.776 1.227 0.91 3.538 Zero-Cycles 1.05 1.031 1.086 1.003 0.99 1.029 0.999 0.985 1.006 Zero-Size 1.297 1.172 1.501 1.163 1.082 1.254 1.204 0.99 3.615

Table 8.11: Cost model performance throughput benchmarks.

Interpretation The results of the experiments can be seen in Tables 8.10 and 8.11. We nor- malized the results to a mean computed from the no-opt configuration without any node-cost- model-based optimization enabled. The results seem to confirm our hypothesis. The GraalVM configuration produces the best performance at a medium code-size and compile-time increase. For some benchmarks, GraalVM produces less optimal code than no-opt; however, this only ap- plies to outliers28. The more interesting comparison is between GraalVM, no-model, zeroSize and zeroCycles. The GraalVM configuration using the node cost model to do relative perfor- mance estimations and code-size classification always results in less code size and faster com- pilation times than the no-model configuration. As expected, zeroCycles configuration either produces no code-size increase or reduces the code size compared to the no-opt configuration. In

28For details about DBDS performance see [120]. 8.2 Experiments 133 benchmarks like zlib, zeroCycles results in less compile time than no-opt, because duplication can significantly reduce code size resulting in less work for other compiler optimizations29. The zeroSize configuration also behaves as expected by producing code-size increases and compile- time increases between GraalVM and no-model.

The results indicate that the cost model is indeed beneficial to reduce code size and compilation time and only duplicate the most promising candidates. The most interesting configuration is zeroCycles, as it shows that a fine-grained trade-off is possible with the proposed node cost model. In zeroCycles, each IR node has a NodeCycle value of 0. As shown in Chapter 6, DBDS only duplicates code iff benefit ∗ probability > cost. In this configuration, the left-hand-side of the inequality is always 0, because the benefit is 0 (as all NodeCycles are 0). Thus, DBDS only duplicates if the cost is negative, which can happen when the impact of a duplication is an enabled dead code elimination resulting in a reduction of code size. This effectively leads to a reduction in code size compared to the no-opt configuration in some benchmarks. The zeroSize configuration behaves like the no-model configuration except that it implements the upper code size boundary of the GraalVM configuration for an entire compiler graph and thus produces less code than the no-model configuration.

The results show that a simple cost model capturing relative instruction latencies and instruction sizes can be used in high-level optimizations to guide trade-off functions. This enabled us to implement the trade-off function in our duplication optimizations in Graal [117; 120]. Without investing significant amounts of compile time we could significantly improve optimizations to perform fine-grained performance predictions at an abstract architecture-agnostic optimization level. This allowed us to improve performance and to reduce code size and compile time.

29Code duplication can enable dead code elimination resulting in less code to be compiled. Therefore, code duplication can reduce compilation time in this benchmark.

135

Chapter 9

Related Work

In this chapter we reflect on the related work for the algorithms and optimizations proposed in this thesis.

uplication works under the premise of optimization via specialization. Thus, there is a lot of related work in the domain of compiler optimization and performance estimation. D We grouped related work into three major categories, namely code duplication, loop- unrolling and compiler cost models.

9.1 Code Duplication

A multitude of approaches have been devised to use the direct or indirect impact of duplication on subsequent optimization opportunities.

Mueller and Whalley [136, 137] proposed Replication: a special form of code duplication to optimize away conditional and unconditional branches. Their approach is related to simulation- based code duplication (and DBDS in particular) in that our approach also duplicates code to remove conditional branches1.

Many novel concepts have been applied in the design of the Self compiler. One of the novel optimizations developed for Self was Splitting [25]: an approach to tackle the problems of virtual dispatch [102]. Duplication can be used to devirtualize call sites for which a concrete receiver type is only known in specific contexts2. The Self compiler performs splitting to specialize code after control-flow merges to the values used in predecessor branches. This allows the splitting

1See conditional elimination opportunity in Section 6.1.3. 2See devirtualization opportunity in Section 6.1.6. 136 Related Work optimization to specialize calls to more concrete receiver types3. After splitting, virtual calls can be replaced by invocations of statically-known methods as outlined in Section 6.1.6, allowing them to be inlined. Conceptually, splitting is a special form of type-guarded inlining [156], except that no new type guards are added before invocations but dominating conditions act as guards for ϕ input specialization. Splitting is related to simulation-based code duplication in many aspects even though our approach refrains from applying devirtualization during duplication in the favor of more elaborate inlining models [156]. However, the basic idea deployed by Self’s splitting also applies to simulation-based code duplication. Additionally, the heuristics applied by splitting incorporate profiling information, which makes it useful in the context of dynamic compilation. Chambers [25] describes Self’s splitting heuristics, which are based on the frequency of the code paths that are optimized (weight) and the cost of a duplication (code size), making it aware of the code-size impact and thus subject to elaborate cost models. We extended their ideas (reluctant splitting, eager splitting, and the combination of both) by using a fast duplication simulation algorithm (DBDS from Chapter 6) in order to estimate the peak performance impact of the duplication before doing it. Additionally, we improved upon their idea of weight and cost by using a static performance estimator4 to estimate the peak performance increase of a duplication transformation and by using real profiling information from the VM to only optimize those parts of the program that are frequently executed.

Another approach closely related to simulation-based code duplication was developed by Bodík et al. [13]. They use duplication as a means to enable partial redundancy elimination (PRE) [15; 28; 104] and also to pre-compute which duplications are necessary. We improved their ideas by computing a general set of optimizations that are enabled by duplication to all opportunities presented in Chapter 6.

Simulation-based duplication works by considering its effects on a simulated version of the program as an estimation of how the actual program can be optimized. There has been research in the direction of performance impact estimation, most notably by Ball [8], which estimates the effects of subsequent optimizations on inlined functions by using a data flow analysis on the inlinee to derive parameter dependency sets for all variables. Those dependency sets collect all expressions influencing the computation of a variable. The effect of an optimization can then be estimated by propagating constant parameters through the dependent computations. The approach could be adapted to work on ϕ inputs instead of parameters to estimate the impact of ϕ inputs on subsequent code. DBDS improves upon the original approach by enabling a larger set of optimizations including optimizations that depend on control flow.

3For example, see ϕ specialization in Chapter 4. 4Node cost model from Chapter 5. 9.1 Code Duplication 137

Advanced Scheduling approaches like Trace-, Superblock- and Hyperblock-scheduling [66; 96; 127] apply code duplication in order to increase the instruction-level-parallelism (ILP) of a compila- tion unit. Those approaches differ from simulation-based code duplication in what they want to achieve. They apply tail duplication [26] in order to extended basic blocks by forming traces/- superblocks/hyperblocks. Superblocks are extended basic blocks. Basic blocks are sequences of instructions without any jump target (except for the first instruction) and without any jump (except for the last instruction). Superblocks are larger sequences of basic blocks that can contain jumps to outside code (merge blocks) but no control flow merges. This means that a superblock is the longest possible sequence of instructions without control flow merges. Superblocks increase ILP [197], which is needed to properly optimize code for VLIW processors, which require elaborate compiler support to generate efficient instructions. Different heuristics have been proposed in this context but the optimization potential established by duplication is not analyzed in advance, since it is not needed for instruction selection and scheduling.

An interesting approach that combines the ideas of tail-duplication with loop-unrolling was pro- posed by Rock and Koch [170] in what they called aggressive tail splitting (ATS). ATS combines the ideas of hyperblock generation (inside a loop) with loop unrolling; however, instead of merging the backedge with the original loop header, ATS can generate new loop headers specialized to the backedge values. They call this re-creation of a specialized loop header re-splicing. Their ap- proach generates multi-level loops [115; 119] during re-splicing. We improved upon their idea by using fast-path loop creation instead of re-splicing, which gives us a place to schedule infrequent operations, namely inside the outer-slow-path-loop. Those infrequent operations often pollute the type and value information inside loop-ϕ functions5.

We often used the paradigm optimization via specialization as the root idea of code duplication. There is another compiler optimization that embodies this idea even more: inlining [7; 154; 156; 174]. We consider related work in the context of inlining to be out of scope for this thesis, however, we still want to mention that approaches like procedure Cloning [34] or splitting [25] employ many similar concepts as the ones that we developed or applied in our simulation-based duplication work.

Finally, we want to mention trace-based compilers [10; 85; 99] in contrast to method-based com- pilers like Graal. Trace-based compilers have to deal with a problem similar to code duplication: when to merge. A trace-based compiler typically records execution traces and compiles them to machine code. During the process of tracing an execution, it is necessary to detect backedges of loops to actually discover high-level constructs such as loops. In this process, a trace-based compiler can decide to create control-flow merges in a delayed fashion effectively applying tail- duplication and loop unrolling during trace creation. The question when to merge and when to keep recording a trace is very similar to the question when to duplicate code and where to.

5See Chapter 7. 138 Related Work

Profile- Impact- Code-Size Classification Opportunities Driven Aware Increase Tail Duplication ✗ ✗ • OPT-LATE 1 Depends on heuristic • Conditional Jumps Replication ✓ ✓ • Unconditional Jumps 15%-50% • OPT-LATE Trace • ILP 2 ✓ ✗ 15%-300% Scheduling • OPT-LATE Superblock • ILP ✓ ✗ 20%-450% Duplication • OPT-LATE • ILP Hyperblock ✓ ✗ • OPT-LATE X Duplication • If-Conversion • ILP PRE ✓ ✓ • OPT-LATE 14%-1300% • Complete PRE3(w pruning) • ILP • OPT-LATE • PRE Simulation-Based • CE4 ✓ ✓ 0%-25% (Parameterizable) Code Duplication • SR5 • AS6 • Global CSE7 • PEA8 1 OPT-LATE ≡ All optimizations enabled by a duplication which have not been identified before duplication. 2 ILP ≡ Increased instruction-level parallelism (ILP) of the generated code. 3 PRE ≡ Partial Redundancy Elimination. 4 CE ≡ Conditional Elimination. 5 SR ≡ Strength Reduction. 6 AS ≡ Algebraic Simplifications. 7 CSE ≡ Common Subexpression Elimination. 8 PEA ≡ Partial Escape Analysis and Scalar Replacement. Table 9.1: Code duplication-based optimizations: feature comparison

9.1.1 Comparison

To summarize this section about related work in the context of tail-duplication, we compiled an overview table of the approaches mostly related to ours, capturing the most important success metrics of a duplication approach. The overview can be seen in Table 9.1. We classify related work based on the following criteria:

• Profile-driven: Duplication can utilize profiling information, i.e., execution frequencies or branch probabilities to only duplicate code that is actually executed and performance rele- vant.

• Impact-aware: Duplication can be done heuristically without knowing the optimization potential of the actual outcome of the transformation. Alternatively, it can use analysis to find out if a piece of code is optimizable after duplication. 9.2 Loop Unrolling 139

• Opportunities: We list the optimization opportunities enabled by a code duplication. This includes optimization opportunities that are directly enabled via the change in control flow or data flow as well as all optimizations that are accidentally enabled by a duplication. We denote the latter as OPT-LATE.

• Code-size increase: We report the code size increases produced by the duplication optimiza- tions as reported in the respective publications.

Based on the comparison from Table 9.1, we can see that simulation-based code duplication captures most of the optimization opportunities and capabilities of the other approaches while maintaining a parameterizable and small code size.

9.2 Loop Unrolling

Loop unrolling is a special form of code duplication, as it duplicates a loop’s header along the back-edge of a loop6. Based on this fact, we proposed fast-path loop unrolling as a simulation- based loop unrolling technique for non-counted loops. In this section, we reflect on related work in the domain of loop unrolling.

Numerous approaches to generate faster code for loops have been proposed over the last decades. Loop Unrolling [43; 44; 45; 50; 94; 173] was extensively researched as a general-purpose compiler optimization. However, loop unrolling is usually only done on counted loops. There are many optimization opportunities that can be enabled by loop unrolling, including (but not limited to):

• Performance increases on pipelined architectures due to better ILP [44; 197]

• Reduction in loop trip checks [45], which leads to less computations

• Improved register allocation and pipelining [52]

• Improved memory locality [7]

Most related to our approach is the work of Huang and Leng [94]. They proposed a generalized unrolling of while(...) loops, based on adding the weakest precondition [48] of a loop to its condition. We take a different approach by selectively unrolling loops with arbitrary side- effecting instructions together with their loop conditions. We do not use the concept of weakest

6See Chapter 7. 140 Related Work preconditions, since the automatic derivation of them is not possible in the presence of arbitrary side-effects. Therefore, we are searching for different optimization opportunities that can still be enabled via unrolling a loop, even if loop conditions are unrolled with the body.

Loop unrolling and [2] combined with loop jamming was researched by Carr et al. [20]. Their approach performs unroll-and-jam to increase the instruction-level parallelism of nested loops in order to generate better scheduling on pipelined architectures. Our approach is different in that we explicitly unroll non-counted loops to enable other compiler optimizations. We did not implement an instruction-level-parallelism-based optimization.

Knijnenburg et al. [107] applied unroll-and-jam in the context of iterative compilation, an approach to automatically, using iterative compilation and execution, derive unrolling parameters for a given workload. We have not investigated the application of iterative compilation to derive unrolling factors or other optimization and cost-model parameters, given that a dynamic compiler is required to perform compilation at run time, prohibiting repeated compilation for the sake of performance. However, we consider the application of iterative compilation as an interesting future research task and reflect on it in Chapter 10. Especially, the presence of profiling information could be utilized, inspecting hot compilation units and performing online sampling to apply simplified versions of iterative compilation in a system like Graal. Long and O’Boyle [125] worked on employing instance-based learning for optimizing Java programs and combined it with online learning in order to surpass the need of iterative compilation. We believe that combining such approaches is an interesting future work direction that would also adhere to the requirements of dynamic compilers like GraalVM.

The HotSpot server compiler [153] applies various optimizations on counted and non-counted loops including unrolling, tiling and iteration splitting [144]. Graal implements similar loop opti- mizations as the server compiler. Our work extends Graal to also unroll non-counted loops.

Carminati et al. [19] studied a loop unrolling approach to reduce worst-case execution time of real-time software. They use various unrolling strategies for counted and non-counted loops and use if-conversion to create branch-less code. Our approach to non-counted loop unrolling allows us to effectively unroll only the fast-path of a loop. Additionally, we improved upon their work by removing all restrictions (general side-effects) on the body of the target loop to be unrolled.

Krall and Lelait [110] proposed loop unrolling for SIMD-based auto-vectorization in an optimizing compiler. Counted loops are unrolled n times where n equals the vector length. A scalar- to-SIMD transformation then transforms scalar code to its vectorized equivalent. This work is partially related to ours, in that they unroll loops to enable subsequent optimizations - in their case scalar-to-SIMD transformations. However, their work only considers counted loops. 9.3 Compiler Cost Models 141

The Graal compiler already applies loop peeling [7], unswitching, partial unrolling of counted loops and a limited form of on counted loops. Therefore, we limited our work on non-counted loops.

Stencil code generation has been extensively researched by Hagedorn et al. [84] to optimize stencil codes inside loops which also cover many loop-carried dependencies. We could not apply or build on their approach, since most of those transformations cannot be applied in a Java compiler, because of possible deoptimizations and side effects [56; 58].

Su and Wang [190] studied the effects of loop-carried dependencies on the concepts of software pipelining, which was first researched by Charlesworth [27]. Many compilers apply loop unrolling together with, e.g., register renaming on super scalar systems. Such optimizations are not done in Graal. Thus, we have not compared our approach to software pipelining.

Auto-vectorization for Java has been proposed for the Jikes JVM [4] by El-Shobaky et al. [64]. Unrolling and auto-vectorization go hand in hand, since many compilers first unroll a loop and then perform linear code vectorization on its body. The Graal compiler already performs auto- vectorization, however, only on counted-loops. Linear code vectorization in Graal could also profit from unrolling non-counted loops.

9.3 Compiler Cost Models

In the last part of this chapter we reflect on related work in the context of compiler cost models. In Chapter 5 we proposed a novel cost model for the graph-based IR of the Graal compiler, in order to quantify duplication opportunities and to rank them based on their optimization potential, expressed as abstract cycles saved and code size increase values. Various research on compiler optimizations applied cost models to reduce compilation time and code size, to increase or predict performance, or to select the best candidate instructions during code generation.

Production Compilers: Most industry-standard compilers (e.g. LLVM[111] and GCC [78]) use cost models to implement performance predictions. However, they typically use architecture- specific information in their models. In contrast to that, we propose an architecture-agnostic cost model for high-level optimizations.

The Gnu C compiler GCC [78] applies several cost models at different compilation stages to guide optimizations, instruction selection, and code generation. However, the IR of GCC is not comparable to the sea-of-nodes IR of Graal. In the front-end, GCC mostly uses SSA-based Gimple code, which is an abstract CFG-based representation of C-programs. In the backend a machine 142 Related Work format (called register transfer language) is used, which precisely models instruction costs 7. Both IRs are not comparable to Graal IR, and the level of abstraction is way lower than a graph-based IR. Thus, their ideas are not applicable to Graal.

The LLVM compiler project [111] has two cost models: One for inlining and loop unrolling and one for vectorization [7]. We have not found publications on the considerations underlying their cost models. However, we have looked through the source code of the project8. The cost model used for vectorization precisely models all instructions of the IR. The values for the model are either derived from analysis and simulation or taken from literature [68]. However, LLVM IR is significantly different from Graal IR. We cannot use precise models in Graal because of multiple reasons specified in Chapter 5.

The HotSpot [93] JVM applies various cost models: The client compiler [109] does not ap- ply global cost models at all. The server compiler [144], however, uses architecture description (AD) files for and selection9. Those AD files contain instruction pat- terns for instruction selection. The patterns are defined together with a cost annotation that is used by the instruction selection algorithm to generate a minimal instruction sequence from a machine-independent instruction tree. The server compiler similarly expresses the cost of nodes not generating any code with 0. We improved upon the idea of architecture description files by specifying abstract instruction costs to be used in high-level optimizations. While the server compiler does not use cost models during duplication or other high-level optimizations, it uses various predominant metrics for classification including bytecode size, inlining depth, etc. We improved upon this work by selectively applying our node cost model in optimizations that can have potentially negative effects such as code size increase.

The optimizing compiler of Google’s V8 [79] JavaScript engine uses a limited form of instruction costs for selection and scheduling. The usage is comparable to the one of the server compiler. We have not found published work on the cost-models used in V8’s optimizations.

(Precise) Performance Prediction: Research in performance estimation explores ways to es- timate the run time of an application without executing it. Various approaches have been in- vestigated, including precise performance prediction in compilers [21], compile-time prediction of single-to-multicore application migration [193], compile-time estimations for speculative par- allelization [53], and execution time derivations for benchmark characterization to estimate the performance of a program on a given hardware [172]. Performance prediction approaches are typi- cally done for compilers of statically-compiled languages enabling a precise analysis and estimation in a late stage in the compilation pipeline. Graal optimizations cannot make performance-relevant

7See https://gcc.gnu.org/onlinedocs/gccint/Costs.html for the documentation of their cost-model API 8LLVM Git repository mirror https://github.com/llvm-mirror/llvm. 9This is a common approach employed by code generator-generators in order to perform instruction selection [71]. 9.3 Compiler Cost Models 143 transformation decisions under the assumption that the IR is not change any more until code gen- eration. This prohibits modeling architecture-specific contexts, i.e., it effectively prohibits the use of such models in Graal.

In this context, we want to mention Wang [198] who worked on compile-time performance predic- tion for compiler transformations. Their work is similar to ours in many aspects: Their approach is based on a precise run-time estimation of straight-line code. This allows compiler optimizations to trade-off several possible transformations with each other to select the ones with the biggest peak performance improvement. Our approach is slightly different to theirs: We designed our cost model to support optimizations that trade-off code size with peak performance. Our cost model captures code size and run-time estimations. Our run-time estimation is less precise than theirs, as we only reason about local transformations, which is necessary in a just-in-time compiler.

Fursin et al. [73] worked on finding the lower-bound execution time of scientific applications. Their algorithm applies profiling, instrumentation and re-execution of the instrumented program to derive a minimal10 execution time of memory-bound code. We could not apply their technique in our work, since the virtual machine does not allow us to heavily instrument user code. Given the potential side-effecting nature of most Java programs, we cannot re-execute a piece of code after instrumentation.

Instruction Scheduling & Selection: Other approaches use instruction cost models to find minimum-cost instruction sequences during code generation11. We improved upon this idea by using relative instruction latencies and size estimations for abstract high-level compiler optimiza- tions. Instruction selection [12] optimizations typically try to reduce a given cost function by selecting the best12 candidate instructions for a given IR pattern. A typical specification of cost is instruction latency (similar to the cycles annotation of our node cost model). Thus, a minimal instruction selection for a given IR pattern results in a minimal execution time and therefore in highest peak performance.

Miscellaneous: In the rest of this section on related work in the context of compiler cost models we discuss approaches that are related to our node cost model, but for which we could not re-use their assumptions/ideas, since our requirements are specialized to dynamic compilation and low compile times.

Ceng et al. [23] computed the cost of an application execution by using IR, trace, and static information. We cannot apply trace information in our cost model. Yet, we also apply profiling information to guide our optimizations using the cost model to apply advanced reasoning.

10The lower bound. 11For example, in the HotSpot server compiler [153]. 12The lower overall cost. 144 Related Work

Tolubaeva [193] worked on compile-time cost models for static compilation to support the per- formance estimation of programs that are migrated from single-core machines to multi-core ma- chines. While multi-threaded modeling is promising for parallel workloads like Renaissance [158], we currently avoid modeling complex properties in our cost model for the sake of low compilation times.

Dou and Cintra [53] worked on static compile-time estimations for speculative parallelization combining estimated overheads, scheduling restrictions and so on. Parallel cost models require elaborate modeling, prohibiting their usage in the current form in Graal. Thus, we designed our cost model to be single-threaded.

Cascaval et al. [21] worked on performance prediction of arbitrary programs at compile time. They built several models of the CPU, the memory hierarchy, and the I/O behavior of the application. While this is very promising, we cannot apply precise models during dynamic compilation for compile-time constraints.

Saavedra and Smith [172] derived execution-time models to characterize benchmarks and target machines, which allows them to estimate the execution time of a given program on a given hardware. While the basic ideas overlap, execution-time modeling cannot be applied during dynamic compilation on the IR of a compiler, where we do not control the rest of the compilation pipeline, i.e., other optimizations can still change the code.

Cooper et al. [37] used objective success functions to optimize their code for given properties such as code size. However, they did not apply IR-based cost models. We believe that a cost model that models latencies and code size is a suitable success function for peak-performance-relevant code. 145

Chapter 10

Future Work

In this chapter we give an outline of possible future work building on the concepts presented in this thesis.

he simulation-based duplication scheme we proposed in Chapter 4 and for which we showed implementations for tail-duplication1 and loop-unrolling of non-counted loops2 T constitutes a solid foundation for interesting future work. Every optimization that not only has positive impacts on a compilation-unit but can also have negative impacts is subject to heuristics and trade-offs. Thus, we believe that the simulation-based duplication scheme is an interesting basis to do future work.

10.1 Generalization of Simulation-based Optimizations

We showed that the simulation-based duplication scheme can be applied to tail duplication and to loop unrolling. In the future, we want to explore its applicability also to other optimizations that increase code size. This includes loop unswitching, peeling, unrolling of counted loops, splitting and inlining. All those optimizations can have negative impacts on the code size of a compilation unit. Thus, a sophisticated simulation step prior to performing the actual transformation is an interesting target for future research. We want to investigate whether it is possible to generalize the simulation-based duplication scheme to all optimizations, applying a simulation and a trade- off step prior to actually performing a transformation. All optimizations that currently implement heuristics could benefit from simulation to properly estimate the effects of a transformation.

1See Chapter 6. 2See Chapter 7. 146 Future Work

10.1.1 Transactional Simulation-based IRs

We believe that simulation-based duplication and optimization could benefit from the idea of transactional intermediate representations. The idea of transactional memory [1] can be extended to support operations on an intermediate representation. We believe that transactions can be implemented for compiler optimizations, by grouping sets of related optimizations together. Such a set of optimizations can be exercised on the IR, potentially in a simulation-based fashion. The transformations apply actions on a model of the IR that are, once they are finished, either committed to the actual IR or rolled back if the outcome, i.e., the performance increase, does not justify the actual application of a transformation. We believe that this is an interesting possibility to save compilation time by only committing final results to the IR and keeping intermediate data-flow rewiring to a minimum. This would work nicely together with our simulation-scheme as simulation could be combined with the transactional IR, allowing more steps of the compilation pipe-line to be simulated. Transactional IRs would enable a compiler to apply a cheap form of backtracking, also during simulation.

10.1.2 Optimization Tier Improvements

In the future we want to tune our optimization tier. The current optimization tier implementation cannot duplicate across multiple merges along paths although the simulation tier can simulate along paths. We want to conduct experiments evaluating how complex a path-based implemen- tation would be and if we can increase peak performance even further.

10.2 Compiler Cost Model

We believe that the node cost model proposed in Chapter 5 is an interesting target for future work. First, we want to explore whether our cost model can be used in other optimizations of the Graal compiler. This mainly includes inlining [156], since this optimization is strongly dependent on precise metrics for code size and IR size. If so, we want to migrate the other optimizations in Graal step by step so that they benefit from our cost model. We want to do an iterative migration refining the cost model on the way. 10.2 Compiler Cost Model 147

10.2.1 Machine Learning Cost Models

The current implementation of the node-cost model is based on domain expert’s knowledge, data from instruction vendors, and implementation aspects of the IR. Defining appropriate node costs is a tedious and error prone process and requires holistic knowledge of the IR. While the node cost model is the first attempt in Graal to have an architecture-agnostic cost model, assigning node costs by hand is still a non-optimal solution. While some operations in the IR have architecture- specific particularities, many don’t. Those that behave alike on most architectures are subject to automated machine learning-based optimizations. For example, the size annotations of a node could be derived by a multi-variable regression that inspects a node, its surroundings, maybe its control flow complexity, and the generated machine code for that node, and tries to learn, via regression, the function producing the code size for a given node. Other approaches apply this idea for static compilation. There is ongoing research on this topic for LLVM [111] by Mendis et al. [131] who try to apply machine learning techniques to learn cost model parameterization for the backend of the LLVM optimizer. We believe that such work is vital and will be an advancement for the static compilation community. However, for dynamic compiler the story is even more complex, given the different levels on which optimizations are executed and the expressiveness of sea-of-nodes [30] IRs. Therefore, we believe that automatically learning the cost model parameters (which could be done offline) is an interesting future research direction.

10.2.2 Cost Model Success Functions

We introduced the node-cost model in order to attribute benefit and cost to duplication transfor- mations in order to properly rank them in our optimizations. However, every duplication decision is based on the local benefit and cost, i.e., the impact on the merge block and the performance relation to the overall compilation unit. For example, one duplication leads to a code-size increase of 4 bytes while on average decreasing run time by 2 cycles. However, the overall impact of a single transformation on the performance of the whole compilation unit is hard to derive, given that we do not have a notion of evaluating a transformation with respect to application-wide performance. In order to empirically derive proper trade-off functions for our duplication-based optimizations we believe that the combination of machine learning and iterative compilation [105; 106; 107] would be an interesting future research question. Iterative compilation is an approach to auto- matically select parameters of optimization phases as well as the ordering of phases along the way. The approach works be iteratively generating versions of a program that are evaluated against each other to find the best one. Iterative compilation has shown to work over a wide variety of compilers and workloads. However, its application in Java-based systems is problematic, as the re-execution of user code with observable side-effects can cause program crashes and infinite runs. We believe that modified versions of iterative compilation, e.g., performing iterative compilations 148 Future Work but not executing them but rather using our cost model to derive performance labels for the results, would be an interesting research direction. This would allow the generation of a lot of data that in return could be used to guide the work on trade-off heuristics. Such an approach could be combined with knowledge bases as applied, for example, by instance-based learning [125] to accumulate a transformation-impact knowledge base over time. Alternatively, it can be com- bined with feature selection [112; 199] for performance impact prediction [54]. In general, there are different possible directions to solve the task of finding duplication candidates that tend to have large overall performance impacts. Those could be derived implicitly via feature detection or explicitly via improving duplication trade-offs by learning them over time.

10.3 Loop Unrolling

In Chapter 7 we presented novel algorithms for fast-path loop creation and fast-path unrolling of non-counted loops for which we believe there are many interesting transformations possible on the resulting fast-path loop construct.

As part of our future work, we want to add more optimizations based on analyzing more non- counted loops in Java programs. Additionally, we are currently working on our implementation to further reduce compile time.

In the work we presented, we focused on non-counted loops, as Graal already performs many advanced optimizations on counted loops. However, we believe that fast-path unrolling can be extended to counted loops in order to allow a fully generic partial unrolling optimization that captures both counted and non-counted loops. For fast-path unrolling it makes no difference if a loop is counted or not. While non-counted loops always have to be unrolled with their trip check, the trip check of unrolled iterations for counted loops can be removed. We believe that the removal of the intermediate trip check can be expressed as a dedicated optimization, allowing the unrolling algorithm to be fully agnostic about the type of loop it is working on. 149

Chapter 11

Conclusion

n this thesis we proposed simulation-based code duplication for a dynamic compiler, a novel approach to perform duplication-based optimizations such as tail duplication I and loop unrolling.

We started out with a discussion of the main problem of duplication-based optimizations, i.e., the code-size increase. Along the way, we reflected on why inherently complex optimizations like duplication can hardly be fully expressed with heuristics that only model portions of the real world. We looked at backtracking, the most precise and complete duplication approach possible. Yet, we also discussed why it cannot be deployed in production by a dynamic compiler.

We proposed dominance-based duplication simulation, one of two applications of the simulation- based duplication scheme and showed that it allows a compiler to perform precise duplication decisions while modeling transitive effects of subsequent optimizations. For this to work, we presented a novel cost model for the graph-based intermediate representation of the Graal compiler and showed why it outperforms hard-coded, heuristic-based solutions.

Then, we presented our application of the simulation-based code duplication approach on loop unrolling for non counted loops. We presented two novel algorithms: the fast-path loop concept to split hot-and-cold parts of a loop into different loops and a fast-path unrolling algorithm built on-top of this idea. We showed that optimizing non-counted loops in Java can significantly improve performance, even though many compilers still fail to optimize them under the premise that they are not important. We also presented ongoing work on using the fast-path loop for lock and safepoint optimizations.

We implemented all concepts in the Graal compiler and tested them extensively with Java, JavaScript, R, Python and Ruby programs, before we deployed our implementation of tail du- plication as well as our cost model into production. They are now part of the GraalVM. Thus, our optimizations are heavily used and tested by the Graal project’s users and developers. 150 Conclusion

We presented an extensive evaluation of all optimizations proposed in this thesis. We showed that our simulation-based implementation for tail duplication (DBDS) can outperform a hand- crafted hard coded tail-duplication optimization that was specifically optimized for the workloads we benchmarked. We presented also an evaluation of the cost model, validating the claim that a cost model allows duplication to perform fine grained code-size versus performance trade-off.

Simulation-based code duplication is a novel approach to perform duplication-based optimizations that removes the need for hand-tuned hard-coded heuristics, as it gives a compiler full control over a duplication optimization’s impact. The basic idea of the approach is simple and generic and can be implemented in every state-of-the-art compiler. It outperforms hard-coded heuristics, is extensible, and shows significant performance improvements. List of Tables 151

List of Tables

4.1 Optimization capability approaches ...... 55

5.1 Compiler code size quantification...... 61 5.2 Important node cost model nodes...... 67

7.1 Number of counted and non-counted loops in the Java DaCapo Benchmark suite. 90

8.1 DBDS performance run-time benchmarks...... 120 8.2 DBDS performance throughput benchmarks...... 120 8.3 Simulation performance run-time benchmarks...... 124 8.4 Simulation performance throughput benchmarks...... 124 8.5 Fast-path loop unrolling performance run-time benchmarks...... 128 8.6 Fast-path loop unrolling performance throughput benchmarks...... 128 8.7 Loop wide lock coarsening performance run-time benchmarks...... 130 8.8 Loop wide lock coarsening performance throughput benchmarks...... 130 8.9 NodeCost model expected performance impacts ...... 131 8.10 Cost model performance run-time benchmarks...... 132 8.11 Cost model performance throughput benchmarks...... 132

9.1 Code duplication-based optimizations: feature comparison ...... 138

C.1 Significance Tests for Java DaCapo...... 186 C.2 Significance Tests for ScalaBench...... 187 C.3 Significance Tests for Renaissance...... 188 C.4 Significance Tests for Java SPECjvm2008...... 189 C.5 Significance Tests for JavaScript Octane...... 190 C.6 Significance Tests for JavaScript jetstream asm.js...... 190 C.7 Significance & Performance Results...... 191

List of Figures 153

List of Figures

1.1 Simulation-based code duplication...... 7

2.1 Sample program foo...... 19 2.2 Sample program foo in SSA form...... 22

3.1 HotSpot JVM overview...... 24 3.2 HotSpot Tiered compilation...... 26 3.3 Graal compiler schematic...... 27 3.4 Graal IR example...... 29 3.5 Graal ecosystem...... 31

4.1 Control flow merge union of information...... 35 4.2 Loop header control flow merge: Unrolling as duplication...... 37 4.3 Code duplication success metric triangle...... 44 4.4 Simulation-based code duplication approach...... 46 4.5 Sample program duplication cost model motivation...... 57

5.1 Node cost model code-size estimation ...... 64 5.2 Cycles estimation ...... 64 5.3 Node cost model value distributions...... 66

6.1 Canonicalization opportunity...... 70 6.2 Read elimination opportunity...... 71 6.3 Conditional elimination opportunity...... 72 6.4 Escape analysis & scalar replacement opportunity...... 73 6.5 Lock coarsening opportunity...... 75 6.6 DBDS algorithm schematic...... 77 6.7 Example program f...... 78 6.8 Duplication effect on the dominance relation. Red arrows denote the dominance relation for the merge block(s) before and after duplication...... 78 6.9 Program f dominator tree...... 80 6.10 Program f Duplication Simulation...... 80 6.11 Example Duplication Simulation: DBDS ...... 82 154 List of Figures

6.12 Example after duplication...... 84

7.1 Loop Memory Graph ...... 92 7.2 Loop-carried dependency read elimination simulation...... 98 7.3 Fast-path loop creation example...... 101 7.4 Path-based unrolling via fast-path loop creation and peeling...... 102

C.1 Simulation Performance Java DaCapo ...... 168 C.2 Simulation Performance ScalaBench ...... 168 C.3 Simulation Performance Renaissance ...... 169 C.4 Simulation Performance Java SPECjvm2008 ...... 169 C.5 Simulation Performance JavaScript Octane ...... 170 C.6 Simulation Performance JavaScript jetstream asm.js ...... 170 C.7 DBDS Performance Java DaCapo ...... 171 C.8 DBDS Performance ScalaBench ...... 171 C.9 DBDS Performance Renaissance ...... 172 C.10 DBDS Performance Java SPECjvm2008 ...... 172 C.11 DBDS Performance JavaScript Octane ...... 173 C.12 DBDS Performance JavaScript jetstream asm.js ...... 173 C.13 Unrolling Performance Java DaCapo ...... 174 C.14 Unrolling Performance ScalaBench ...... 174 C.15 Unrolling Performance Renaissance ...... 175 C.16 Unrolling Performance Java SPECjvm2008 ...... 175 C.17 Unrolling Performance JavaScript Octane ...... 176 C.18 Unrolling Performance JavaScript jetstream asm.js ...... 176 C.19 Lock Coarsening Performance Java DaCapo ...... 177 C.20 Lock Coarsening Performance ScalaBench ...... 177 C.21 Lock Coarsening Performance Renaissance ...... 178 C.22 Lock Coarsening Performance Java SPECjvm2008 ...... 178 C.23 Lock Coarsening Performance JavaScript Octane ...... 179 C.24 Lock Coarsening Performance JavaScript jetstream asm.js ...... 179 C.25 Cost Model Performance Java DaCapo ...... 180 C.26 Cost Model Performance ScalaBench ...... 180 C.27 Cost Model Performance Renaissance ...... 181 C.28 Cost Model Performance Java SPECjvm2008 ...... 181 C.29 Cost Model Performance JavaScript Octane ...... 182 C.30 Cost Model Performance JavaScript jetstream asm.js ...... 182 C.31 Overall Performance Impact Java DaCapo ...... 183 C.32 Overall Performance Impact ScalaBench ...... 183 C.33 Overall Performance Impact Renaissance ...... 184 C.34 Overall Performance Impact Java SPECjvm2008 ...... 184 List of Figures 155

C.35 Overall Performance Impact JavaScript Octane ...... 185 C.36 Overall Performance Impact JavaScript jetstream asm.js ...... 185

Listings 157

Listings

1.1 Sample program...... 3

4.1 Constant folding (CF) optimization opportunity...... 35 4.2 Duplication code-size increase example1...... 38 4.3 Sample program...... 40

5.1 Trivial Java program...... 61 5.2 Excerpt of the NodeInfo annotation...... 66

6.1 Devirtualization opportunity...... 76

6.2 AddNode constant folding AC(m, pi, oi) & Opt(m, pi)...... 83

7.1 Counted loop unrolling...... 88 7.2 Non-Counted loops...... 89 7.3 Non-counted loop with side effects...... 91 7.4 Non-counted loop with side effects after unrolling. Unrolling violates the memory constraints of the original loop from Listing 7.3...... 91 7.5 Non-counted loop with side effects after correct unrolling...... 91 7.6 General loop construct...... 92 7.7 General loop construct unrolled...... 93 7.8 Side-effect-full loop after unrolling...... 93 7.9 Unrolling opportunity: Safepoint poll reduction...... 95 7.10 Unrolling opportunity: canonicalization...... 96 7.11 Unrolling opportunity: Loop Carried Dependency...... 97 7.12 Multi-back-edge Loop...... 99 7.13 Multi-back-edge loop after fast-path loop creation...... 100 7.14 Synchronization loop in Java...... 107 7.15 Synchronized list...... 108 7.16 Loop-wide lock coarsening before...... 109 7.17 Loop-wide lock coarsening after...... 109 7.18 Lock coarsening tiling heuristic...... 111 7.19 Safepoint tiling opportunity before optimization...... 112 7.20 Safepoint tiling opportunity after optimization...... 112

LIST OF ALGORITHMS 159

List of Algorithms

1 Simple heuristic-based duplication...... 50

2 Precise heuristic-based duplication...... 51

3 Backtracking-based duplication...... 53

4 Simulation-based duplication...... 54

5 Path-based unrolling algorithm...... 104

6 Unrolling decision algorithm...... 106

7 Loop-wide lock coarsening algorithm...... 110

8 Fast-path loop creation algorithm...... 163

9 DBDS algorithm first part...... 165

10 DBDS algorithm second part...... 166

LIST OF ALGORITHMS 161

Glossary

Abbreviations

C1 client compiler

C2 server compiler

CFG control-flow graph

DBDS Dominance-Based Duplication Simulation

SBCD simulation-based code duplication

FaaS function-as-a-service

FP-loop creation fast-path loop creation

GraalVM Graal Virtual Machine

ILP instruction-level parallelism

IR intermediate representation

JIT just-in-time

JMM Java Memory Model

JRE Java runtime environment

JVM Java virtual machine 162 LIST OF ALGORITHMS

JVMCI Java virtual machine compiler interface

NCM node-cost-model

SSA static-single-assignment form

VLIW Very-Long-Instruction-Word

VM virtual machine 163

Appendix A

Fast-Path Loop Creation Algorithm

We present our algorithm for fast-path loop creation in Algorithm 8.

Algorithm 8: Fast-path loop creation algorithm. Data: Loop loop Result: Loop slowPathLoop [1] LoopHeader fastPathLoopHeader ← loop.loopHeader(); /* Create new loop header for the outer loop and place it before the fast-path loop */ [2] LoopHeader slowPathLoopHeader ← new LoopHeader(); [3] fastPathLoopHeader.insertInstructionBefore (slowPathLoopHeader); /* Create phis for the outer loop based on the types of the inner loop. */ [4] for PhiNode innerPhi in fastPathLoopHeader.phis () do [5] PhiNode emptyPhi ← new PhiNode(innerPhi.type()); [6] slowPathLoopHeader.addPhi (emptyPhi); [7] end /* All loop exit paths of inner loop will also exit outer loop */ [8] for LoopExitNode exit in fastPathLoopHeader.exits () do [9] LoopExitNode outerExit ← new LoopExitNode(); [10] slowPathLoopHeader.addLoopExit (exit); [11] exit.insertInstructionAfter (outerExit); [12] end /* Determine all fast-path ends, e.g. the n highest probable loop ends */ [13] Set fastPathEnds ← computeFastPathEnds (loop); /* Update phis of inner and outer loop */ [14] for LoopEndNode end in fastPathLoopHeader.ends () do [15] if fastPathEnds.contains (end) then /* Fast-path back edges will jmp back to fast-path header */ [16] continue; [17] else [18] end.setLoopHeader (slowPathLoopHeader); /* Add */ [19] int outerPhiIndex ← 0; [20] for PhiNode phi in fastPathLoopHeader.phis () do [21] slowPathLoopHeader.phis ().get (outerPhiIndex++).addInput (phi.removeInputAtLoopEnd (end)); [22] end [23] end [24] end /* update predecessor at [0] phi inputs */ [25] int outerPhiIndex ← 0; [26] for PhiNode phi in fastPathLoopHeader.phis () do [27] phi.replaceInputAt (0, slowPathLoopHeader.phis ().get (outerPhiIndex++)); [28] end [29] return new Loop(slowPathLoopHeader);

165

Appendix B

DBDS Algorithm

We present the DBDS algorithm in pseudo-code in Algorithms 9 and 10.

Algorithm 9: DBDS algorithm first part.

[1] List dominatorInfo ← [] ;Map synonyms ← [] ; simResults ← []; [2] Function dbds(cfg: ControlFlowGraph) : void is [3] DomTree dom ← cfg.computeDominatorTree() ; ⊲ Compute the dominator tree of the program [4] visit (dom.startBlock (), false); [5] for SimResult s in simResults do [6] if costModel.shouldDuplicate(s) ; ⊲ Perform trade-off [7] then [8] s.duplicate(); [9] s.optimize(); [10] end [11] end [12] end [13] Function visit(b: Block, isSim: bool) : void is [14] infosBefore ← dominatorInfo.size (); [15] process(b, isSim); [16] if isSim ; ⊲ For simplicity simulation depth is 1 in this algorithm [17] then [18] simResults.add(new SimResult(b, b.successor())); ⊲ Save simulation result for pred-merge pair [19] end [20] else [21] for Block dominated in b.getDominatedBlocks () do [22] visit (dominated, isSim) ; ⊲ Depth-First into dominated blocks [23] end [24] end [25] infosBefore.trimToSize (infosBefore); [26] end 166 DBDS Algorithm

Algorithm 10: DBDS algorithm second part.

[1] Function process(b: Block, isSim: bool) : void is [2] if b.isMergePredecessor () then [3] propagateSynonyms (synonyms, b, b.successor ()); ⊲ Register synonyms for phis [4] visit (b.successor (), true); ⊲ Start DST [5] synonyms ← [] ; [6] end [7] for Instruction i in b.instructions () ; ⊲ Iterate all instructions in this block [8] do [9] if i.predecessor () instanceof IfInstruction then [10] IfInstruction ifInsn ← i.predecessor (); [11] dominatorInfo.add (ifInsn.dataFlowInfo (i)); ⊲ Save info of dominating conditions [12] end [13] end [14] if isSim then [15] for Instruction i in b.instructions () ; ⊲ Iterate all instructions in this block [16] do [17] for Optimization o in opts ; ⊲ Try to optimize them using synonyms [18] do [19] if o.acStep(i,synonyms.getFor(i.inputs()) ; ⊲ Check optimization applicability [20] then [21] Node result ← o.optStep(i,synonyms.getFor(i.inputs()) ; ⊲ Get optimization result [22] costModel.recordCodeSize(result) ; ⊲ Save code size for trade-off [23] Benefit b← costModel.computeBenefit(i,result) ; ⊲ Compute benefit [24] simResults.get (b).rememberPotential(b); [25] registerSynonymFor(i, result); ⊲ Register optimized synonym [26] end [27] end [28] end [29] end [30] end 167

Appendix C

Evaluation Appendix

C.1 Detailed Performance Plots

In the appendix we present detailed performance evaluation figures for the experiments presented in Chapter 8. 168 Evaluation Appendix

C.1.1 Simulation vs. Heuristical Solutions

● ● compile time ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 160% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● 180% ● ● code size 160% ● ● ●

● ● ● 140% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

● ● ● ● ● 110% ● ● ● ● run time run ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● 100% ● ● ● ● ● ● ● ● ● ● ● ● ● 90% ● ● ● ● ● ●

fop h2 batik pmd xalan avrora jython luindex lusearch sunflow dup−only dup−only−0.1 no−op dup−only−0.05 dup−only−1.0 pathDup−only

Figure C.1: Simulation Performance Java DaCapo; compile time (lower is better), code size (lower is better), run time (lower is better).

● 150% ●

● compile time

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ● ● ●

90% ●

140% ● ● ● ● ● ● ● ● ● ● ● ● ● ● code size ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

● ● ●

● 80% ●

● 110% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● time run ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% ● ● ●

80% ●

tmt actors kiama scalac specs apparat factorie scalap scalaxb scaladoc scalatest scalariform dup−only dup−only−0.1 no−op dup−only−0.05 dup−only−1.0 pathDup−only

Figure C.2: Simulation Performance ScalaBench; compile time (lower is better), code size (lower is better), run time (lower is better). C.1 Detailed Performance Plots 169

200% compile time ● ● ● 175% ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 125% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● 75% 200% ● ● code size ●

● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ●

●● ●● ●● ●● ●● 120% ●● ●● ●● ●● ●● ●● ●● ●● ●● time run ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 110% ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●● ●● ●● ●● ●● ●● ●●●●●●●● ●●●●●●●● ●● ●● ●● 100% ●●●●●● ●● ●● ●● ●●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 90% ●● ●●

Dotty

Scrabble

ChiSquare fj−kmeans finagle−http movie−lens rx−scrabble db−shootout FoldLeftSum naive−bayes philosophers future−genetic finagle−chirper LogRegression neo4j−analytics reactors−savina

scala−stm−bench7

StreamsFoldLeftSum StreamsForeachSum

AlternatingLeastSquares

ClassificationDecisionTree StreamsPhoneMnemonics

dup−only dup−only−0.1 no−op dup−only−0.05 dup−only−1.0 pathDup−only

Figure C.3: Simulation Performance Renaissance; compile time (lower is better), code size (lower is better), run time (lower is better).

200% compile time

● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ● ●

● 200% code size

● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ●

● ● ● ●●●●●● throughput 110% ●●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ●●● ●●● ●● ●●● ●●● ●●●●● ● ● ● ● ● ● ● ●●● ● ●●●●● ●● ● ● ●● ● ● 100% ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●●●●● ●●● ●●● ●●●●●●●●● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% ● ● ●●● ● ● ●● ● 80%

derby serial sunflow compress crypto.rsa crypto.aes mpegaudio xml.transformxml.validation crypto.signverify scimark.fft.largescimark.fft.smallscimark.lu.largescimark.lu.small compiler.compiler scimark.sor.largescimark.sor.small scimark.monte_carlo scimark.sparse.largescimark.sparse.small

dup−only dup−only−0.1 no−op dup−only−0.05 dup−only−1.0 pathDup−only

Figure C.4: Simulation Performance Java SPECjvm2008; compile time (lower is better), code size (lower is better), throughput (higher is better). 170 Evaluation Appendix

● ● compile time

● ● ● ● 150% ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ● ● ● ● ● ● 90% ●

● ●

● 160% ● code size ●

● 140% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ●

150% throughput ● ●

125% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ● ● ● ● ● 75% ● ● ● 50% ●

zlib Splay Box2D Crypto PdfJS RegExp Richards CodeLoad DeltaBlue Gameboy Mandreel RayTrace Typescript EarleyBoyer NavierStokes dup−only dup−only−0.1 no−op dup−only−0.05 dup−only−1.0 pathDup−only

Figure C.5: Simulation Performance JavaScript Octane; compile time (lower is better), code size (lower is better), throughput (higher is better).

● ●

● compile time 140% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ●

● ● ● code size 140% ● ● ● ● ● ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80% ●

● ●● ●●●

●●● ● ● ● throughput ● ●● ● ● ●●● ●●●●●● ● ●●●●● ● ● ● ●●● ● ● ●●●●●●●●●● ●● ● ● ●●●● ●●●●● ●●●●●●●● 100% ● ●●● ● ● ● ● ●●● ●●●●●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% ● ●

80% ●

dry.c towers.c bigfib.cpp n−body.c float−mm.c hash−map quicksort.c container.cpp gcc−loops.cpp

dup−only dup−only−0.1 no−op dup−only−0.05 dup−only−1.0 pathDup−only

Figure C.6: Simulation Performance JavaScript jetstream asm.js; compile time (lower is better), code size (lower is better), throughput (higher is better). C.1 Detailed Performance Plots 171

C.1.2 DBDS

350% ● 300% compile time ●

250% ● ● ● ● ● ● ● ● ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ●

200% code size

● ● ●

● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

140% ● ● ● run time run ● ● ●

120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● 100% ● ● ● ●

● ● ● ● ●

fop h2 batik pmd xalan avrora jython luindex lusearch sunflow dup−only no−model no−op

Figure C.7: DBDS Performance Java DaCapo; compile time (lower is better), code size (lower is better), run time (lower is better).

300% ● ● compile time ● 250%

200% ● ● ● ●

● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ●

● ● code size 200% ● ● ● ● ● ● ●

● ●

● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

100% ● ● ● ●

● 120% ● ●

● time run ● ● ● 110% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● 90% ● ●

80% ●

tmt actors kiama scalac specs apparat factorie scalap scalaxb scaladoc scalatest scalariform dup−only no−model no−op

Figure C.8: DBDS Performance ScalaBench; compile time (lower is better), code size (lower is better), run time (lower is better). 172 Evaluation Appendix compile time

● 300% ●

● ● ● ● ● ● 200% ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ●

● ● ● 300% ● code size ●

250% ● ● ● ● 200% ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ●

150% time run

●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●●●● ●●●●●●●● ●● ●●●●●●●● ●● ●● ●● ●● ●●●●●● ●● 100% ●● ●● ●● ●●

●● ●●●●●● 50% ●●

Dotty

Scrabble

ChiSquare fj−kmeans finagle−http movie−lens rx−scrabble db−shootout FoldLeftSum naive−bayes philosophers future−genetic finagle−chirper LogRegression neo4j−analytics reactors−savina

scala−stm−bench7

StreamsFoldLeftSumStreamsForeachSum

AlternatingLeastSquares

ClassificationDecisionTree StreamsPhoneMnemonics

dup−only no−model no−op

Figure C.9: DBDS Performance Renaissance; compile time (lower is better), code size (lower is better), run time (lower is better).

● compile time 400% ● ● 300% ● ●

● ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ●

350% ●

● 300% code size ● 250% ● 200% ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● throughput 110% ● ● ●●●●●● ●●● ● ● ● ●● ● ●● ● ● ● ● ●● ●●● ●●●●●●● ●●●●●● ● ● ●● 100% ● ● ●●●●● ●●●● ●●●●●●●●● ●● ● ● ● ● ● ● ● ●●● 90% ● ● 80% ● 70% ●

derby serial sunflow compress crypto.rsa crypto.aes mpegaudio xml.transformxml.validation crypto.signverify scimark.fft.largescimark.fft.smallscimark.lu.largescimark.lu.small compiler.compiler scimark.sor.largescimark.sor.small scimark.monte_carlo scimark.sparse.largescimark.sparse.small

dup−only no−model no−op

Figure C.10: DBDS Performance Java SPECjvm2008; compile time (lower is better), code size (lower is better), throughput (higher is better). C.1 Detailed Performance Plots 173

300% ● compile time

● 250% ● ● ● ● ● ● 200% ●

● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

● ●

250% code size ● 200% ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

● throughput ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

80% ●

● 40%

zlib Splay Box2D Crypto PdfJS RegExp Richards CodeLoad DeltaBlue Gameboy Mandreel RayTrace Typescript EarleyBoyer NavierStokes dup−only no−model no−op

Figure C.11: DBDS Performance JavaScript Octane; compile time (lower is better), code size (lower is better), throughput (higher is better).

250% ● compile time ● ● 200% ● ● ● ●

● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ●

● 200% code size

● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ●

300% throughput

200%

●●●

● ● ● ●●● ●● ●●●●● ● ●●● ●●● ● ●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●● ● ● ● ● ● ● 100% ● ● ● ●

dry.c towers.c bigfib.cpp n−body.c float−mm.c hash−map quicksort.c container.cpp gcc−loops.cpp

dup−only no−model no−op

Figure C.12: DBDS Performance JavaScript jetstream asm.js; compile time (lower is better), code size (lower is better), throughput (higher is better). 174 Evaluation Appendix

C.1.3 Fast-Path Loop Unrolling

150% ● compile time ● 140% ● 130% ● ●

● 120% ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ● ● ● ● ● ● 100% 90% ●

● ● ● ● ● ● code size 110% ● ● ● ● ● ● ● ● ● 100% ●

● 90%

● 130%

● ● 120% time run ● ● ● ● ● 110% ● ● ● ● ● ● 100% ●

90% ● ● 80% ●

fop h2 batik pmd xalan avrora jython luindex lusearch sunflow no−op unroll−only

Figure C.13: Unrolling Performance Java DaCapo; compile time (lower is better), code size (lower is better), run time (lower is better).

175% compile time 150% ● ●

● ● ● ● ● 125% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● 75% ●

● code size 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ● ● ●

● ● 90% ● ● ●

● 105% ● ● ● ● ● ● ● ● time run ● 100% ● ● ● ● 95% ● ● ● 90%

● 85% ●

tmt actors kiama scalac specs apparat factorie scalap scalaxb scaladoc scalatest scalariform no−op unroll−only

Figure C.14: Unrolling Performance ScalaBench; compile time (lower is better), code size (lower is better), run time (lower is better). C.1 Detailed Performance Plots 175 compile time ● 175% ●

● ● 150% ● ●

● ● 125% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● 75%

● 130% code size ● ● ● ● ● ● ● ● ● ● ● 110% ● ●

● ● ● ● ● ● ● 90% ● 70%

●● ●● ●● 120% ●● ●● ●● ●● ●● ●● time run ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●● ●● ●● ●● ●●●●●●●● ●● ●● ●● ●● ●●●●●● 100% ●● ●● ●● ●● ●● 80% ●● 60%

Dotty

Scrabble

ChiSquare fj−kmeans finagle−http movie−lens rx−scrabble db−shootout FoldLeftSum naive−bayes philosophers future−genetic finagle−chirper LogRegression neo4j−analytics reactors−savina

scala−stm−bench7

StreamsFoldLeftSumStreamsForeachSum

AlternatingLeastSquares

ClassificationDecisionTree StreamsPhoneMnemonics

no−op unroll−only

Figure C.15: Unrolling Performance Renaissance; compile time (lower is better), code size (lower is better), run time (lower is better).

160% compile time 140% ● ● ● ● ●

● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ●

● ● 100% ● ● ● ● ● ● ● 80% ●

● ● 140% code size ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● 80%

● ● 110% ●●●●●● ● ● throughput

● ● ● 105% ● ● ● ● ● ● ● ● ●●●●● 100% ●●● ● ●● ● ● ● ●●●●●●●●● ● ● ● ● ● ● 95% ● ● ● ●●

90% ● ● ●●

derby serial sunflow compress crypto.rsa crypto.aes mpegaudio xml.transformxml.validation crypto.signverify scimark.fft.largescimark.fft.smallscimark.lu.largescimark.lu.small compiler.compiler scimark.sor.largescimark.sor.small scimark.monte_carlo scimark.sparse.largescimark.sparse.small

no−op unroll−only

Figure C.16: Unrolling Performance Java SPECjvm2008; compile time (lower is better), code size (lower is better), throughput (higher is better). 176 Evaluation Appendix

● 180% compile time ● 150% ● ●

● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 90% ●

200% ● code size 175% ● ● 150% 125% ● ● ● ● ● ● ● ● ● ●

● 100% ● ●

140% throughput 120%

● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● 80% ●

zlib Splay Box2D Crypto PdfJS RegExp Richards CodeLoad DeltaBlue Gameboy Mandreel RayTrace Typescript EarleyBoyer NavierStokes no−op unroll−only

Figure C.17: Unrolling Performance JavaScript Octane; compile time (lower is better), code size (lower is better), throughput (higher is better). 140% compile time

120% ● ●

● ● ● ● ● ● ● ● ●

100% ● ● ● ● ● ● ● ● ● ●

120% ● ● code size ● ● ● ● 110% ● ● ● ● ● ● 100% ● ● ● ● ● 90% ● ●

● 80% ● throughput 300%

200%

●●●●● ● ● ●●● ● ●●●●● ●●● ● ● ●●●●●●●●● ●●●●●●● 100% ● ● ●

dry.c towers.c bigfib.cpp n−body.c float−mm.c hash−map quicksort.c container.cpp gcc−loops.cpp

no−op unroll−only

Figure C.18: Unrolling Performance JavaScript jetstream asm.js; compile time (lower is better), code size (lower is better), throughput (higher is better). C.1 Detailed Performance Plots 177

C.1.4 Loop-wide Lock Coarsening

150% ●

140% compile time 130% ●

● 120% ● ● ● ● ● ● ● ● ● ● 110% ● ● ● ● ● ● ● ● ● ● ● ● 100% 90%

● ● ● 140% ● 130% code size 120% ● ● ● ● ● ● ● ● ● 110% ● ● ● ● 100% ● ● ● 90%

● 120% ●

● ●

● time run 110% ● ● ●

● ● ●

100% ●

90% ● ● ● 80% ●

fop h2 batik pmd xalan avrora jython luindex lusearch sunflow no−op tile−only

Figure C.19: Loop-Wide Lock Coarsening Performance Java DaCapo; compile time (lower is bet- ter), code size (lower is better), run time (lower is better).

140% ● ● compile time

● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ●

● ●

80% ●

● ● ●

● code size 130% ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ●

90% ● ● ●

● ●

110% ●

● run time run

● ● 105% ● ● ● ● ● ● ● ● 100% ● ● ● ● 95% ● ● 90% ● 85% ●

tmt actors kiama scalac specs apparat factorie scalap scalaxb scaladoc scalatest scalariform no−op tile−only

Figure C.20: Loop-Wide Lock Coarsening Performance ScalaBench; compile time (lower is better), code size (lower is better), run time (lower is better). 178 Evaluation Appendix

● compile time ● ● ● ●

● ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ●

● ● ● ● ● ● ● ● ● 90% ● ●

● 140% ● ● code size ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ● ● ● 80%

●● ●● ●● 120% ●● ●● ●● ●● ●● ●● ●● time run ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●●●● ●●●●●●●● ●●●● ●● ●● ●●●●●● 100% ●● ●● ●● ●● 80% 60% ●●

Dotty

Scrabble

ChiSquare fj−kmeans finagle−http movie−lens rx−scrabble db−shootout FoldLeftSum naive−bayes philosophers future−genetic finagle−chirper LogRegression neo4j−analytics reactors−savina

scala−stm−bench7

StreamsFoldLeftSumStreamsForeachSum

AlternatingLeastSquares

ClassificationDecisionTree StreamsPhoneMnemonics

no−op tile−only

Figure C.21: Loop-Wide Lock Coarsening Performance Renaissance; compile time (lower is bet- ter), code size (lower is better), run time (lower is better).

● ● 150% compile time ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ● ●

● ● ● ● ● 90% ● ● ● ● 70%

● ● ● ● ● 130% code size ● ● ● ●

● ● ● ● ● ● ● 110% ● ● ●

● ● ● ● ● 90% ● throughput

110% ● ●●●●●●

● ● ● ● ● ●●●● ● ● ● ●●● ●●●●●● ●●●●● 100% ●● ● ● ●●●●●●●●● ●● ● ● ● ● ●

90% ● ●

derby serial sunflow compress crypto.rsa crypto.aes mpegaudio xml.transformxml.validation crypto.signverify scimark.fft.largescimark.fft.smallscimark.lu.largescimark.lu.small compiler.compiler scimark.sor.largescimark.sor.small scimark.monte_carlo scimark.sparse.largescimark.sparse.small

no−op tile−only

Figure C.22: Loop-Wide Lock Coarsening Performance Java SPECjvm2008; compile time (lower is better), code size (lower is better), throughput (higher is better). C.1 Detailed Performance Plots 179

● compile time 140% ● ●

120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● 100% ● ● ● ●

● ● ● ● ● code size ● 110% ● ●

● 100%

● 90% ●

140% throughput 120%

● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● 80%

zlib Splay Box2D Crypto PdfJS RegExp Richards CodeLoad DeltaBlue Gameboy Mandreel RayTrace Typescript EarleyBoyer NavierStokes no−op tile−only

Figure C.23: Loop-Wide Lock Coarsening Performance JavaScript Octane; compile time (lower is better), code size (lower is better), throughput (higher is better).

120% ● compile time

● ● ●

● ● ● 110% ●

100% ●

● ● ● 90% ● ● ● ●

● ● ● ● ● ● ● ● ● code size ● ● 110% ● ● ● ● ● ● ● 100% ●

● ● 90% ● ● ●

● 80% ●

102.5% ●●● ●●

●●● ●

● throughput

100.0% ●●●●● ●●●●●●●● ● ●●●● ●●●●●●● ●●●● 97.5%

● 95.0% ● ● ●

92.5% ●

dry.c towers.c bigfib.cpp n−body.c float−mm.c hash−map quicksort.c container.cpp gcc−loops.cpp

no−op tile−only

Figure C.24: Loop-Wide Lock Coarsening Performance JavaScript jetstream asm.js; compile time (lower is better), code size (lower is better), throughput (higher is better). 180 Evaluation Appendix

C.1.5 Node Cost Model

300% ● compile time

250% ● ● ● ● ● ● ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ●

250% code size

200% ● ●

● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● 50%

● ● ● ● 200% ● ● run time run

● ● 160% ● ● ● ● ● ●

● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● 80% ●

fop h2 batik pmd xalan avrora jython luindex lusearch sunflow default no−model no−op zeroCycles zeroSize

Figure C.25: Node-Cost Model Performance Comparison Java DaCapo; compile time (lower is better), code size (lower is better), run time (lower is better).

● ● ● 250% compile time ●

200% ● ● ● ● ●

● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ●

● ●

● ● 200% ● ● ● code size ● ● ●

● ● ● 160% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● 80% ●

175% ●

150% ● time run

125% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ●

● ●

tmt actors kiama scalac specs apparat factorie scalap scalaxb scaladoc scalatest scalariform default no−model no−op zeroCycles zeroSize

Figure C.26: Node-Cost Model Performance Comparison ScalaBench; compile time (lower is bet- ter), code size (lower is better), run time (lower is better). C.1 Detailed Performance Plots 181

300% compile time ● ● ● ● 250% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● 300% ● ● code size ● ● ● 250% ● ● ● ● ● ● ● ● ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ●

●● run time run 150% ●● ●●

●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●● ●● ●● ●● ●● ●● ●● ●●●●●● ●● ●● ●●●●●●●● ●●●● ●● ●● ●● ●● ●● ●● ●● ●●●●●● ●● 100% ●● ●● ●● ●●

●● ●● ●● ●● ●●●● ●● ●● ●● ●● ●● ●●●● ●●●●●●●● 50% ●● ●● ●● ●●

Dotty

Scrabble

ChiSquare fj−kmeans finagle−http movie−lens rx−scrabble db−shootout FoldLeftSum naive−bayes philosophers future−genetic finagle−chirper LogRegression neo4j−analytics reactors−savina

scala−stm−bench7

StreamsFoldLeftSumStreamsForeachSum

AlternatingLeastSquares

ClassificationDecisionTree StreamsPhoneMnemonics

default no−model no−op zeroCycles zeroSize

Figure C.27: Node-Cost Model Performance Comparison Renaissance; compile time (lower is bet- ter), code size (lower is better), run time (lower is better).

300% compile time

● ● 250% ● ● ● ● ● 200% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ●

300% ● ● 250% ● code size ●

200% ● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ●

● throughput 110% ● ● ●●●●●● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ●●●●● ● ● ● ● ●● ●●● ●●● ● ● ● ● ● ● ●●●●● ●● ● 100% ● ● ● ● ●●●● ●●●●●●●●● ● ●● ● ● ● ● ● ● ● ● 90% ● ● ● ●● ● 80% ● 70% ●

derby serial sunflow compress crypto.rsa crypto.aes mpegaudio xml.transformxml.validation crypto.signverify scimark.fft.largescimark.fft.smallscimark.lu.largescimark.lu.small compiler.compiler scimark.sor.largescimark.sor.small scimark.monte_carlo scimark.sparse.largescimark.sparse.small

default no−model no−op zeroCycles zeroSize

Figure C.28: Node-Cost Model Performance Comparison Java SPECjvm2008; compile time (lower is better), code size (lower is better), throughput (higher is better). 182 Evaluation Appendix

300% ● compile time

250% ● ●

● ● ● ● 200% ●

● ● ● ● ● ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

● ●

250% code size ● 200% ● ● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ●

● ● ● throughput ● 120% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

80% ● ●

● 40%

zlib Splay Box2D Crypto PdfJS RegExp Richards CodeLoad DeltaBlue Gameboy Mandreel RayTrace Typescript EarleyBoyer NavierStokes default no−model no−op zeroCycles zeroSize

Figure C.29: Node-Cost Model Performance Comparison JavaScript Octane; compile time (lower is better), code size (lower is better), throughput (higher is better).

250% ● compile time ● ● 200% ● ● ● ●

● ● ● ● ● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ●

● 200% code size

● ● ● 150% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ●

●● ● ● ●●● ● 300% throughput

200%

●●●

● ● ●●●● ● ●●●●● ●●● ●●●● ●●●● ●●● ●●●●●●●● ●●● ●●● ●● ●●●●● ●●● ● ● ● ●●●●● ●●●●●●●●● ●●●●● ●●●●●●● ●●● ● ● ● ● ● ● 100% ● ● ●

dry.c towers.c bigfib.cpp n−body.c float−mm.c hash−map quicksort.c container.cpp gcc−loops.cpp

default no−model no−op zeroCycles zeroSize

Figure C.30: Node-Cost Model Performance Comparison JavaScript jetstream asm.js; compile time (lower is better), code size (lower is better), throughput (higher is better). C.1 Detailed Performance Plots 183

C.1.6 Combined Performance Impact

150% ● ●

● compile time 140% ● ● ● ● ●

● 130% ●

● ● 120% ● ● ● ● ● ● ● ● 110% ● ● ● ● ● 100% 90%

● ● ● 140% ● 130% code size ● ●

● 120% ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ● 100% ● 90%

● ● 120%

● run time run ● ● 110% ●

● ● ● ● ●

● 100% ● ● ● ● 90% ● ●

fop h2 batik pmd xalan avrora jython luindex lusearch sunflow default no−op

Figure C.31: Overall Performance Impact Java DaCapo; compile time (lower is better), code size (lower is better), run time (lower is better). 150%

● ● compile time ● ●

● ●

130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ●

90% ●

● ● 140% ● code size ● ● ● ● ● ● ● ● ●

● 120% ● ●

● ● ● ● ● ● ● 100% ● ●

● 80% ●

115% ● ● ● 110% ● run time run ● ● 105% ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● 95% ● ● ● 90% ● 85% ●

tmt actors kiama scalac scalap specs apparat factorie scalaxb scaladoc scalatest scalariform default no−op

Figure C.32: Overall Performance Impact ScalaBench; compile time (lower is better), code size (lower is better), run time (lower is better). 184 Evaluation Appendix

● compile time 150% ● ● ● ● ● ● ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ●

● ● ● ● ● 90% ● ●

150% ● ● ● ● ● ● code size ●

130% ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ● ●

● ● ● 90% ● ● 70%

●● 120% ●● ●● ●● time run ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●●●●● ●● ●●●●●●●● ●● ●● ●● ●● ●●●●●● 100% ●● ●● ●● 80%

●● 60% ●● ●●●● ●● ●● ●●

Dotty

Scrabble

ChiSquare fj−kmeans finagle−http movie−lens rx−scrabble db−shootout FoldLeftSum naive−bayes philosophers future−genetic finagle−chirper LogRegression neo4j−analytics reactors−savina

scala−stm−bench7

StreamsFoldLeftSumStreamsForeachSum

AlternatingLeastSquares

ClassificationDecisionTree StreamsPhoneMnemonics

default no−op

Figure C.33: Overall Performance Impact Renaissance; compile time (lower is better), code size (lower is better), run time (lower is better).

● ● ● ● 150% compile time ● ● ● ● ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 110% ● ●

● ● ● 90% ● ● 70% 150% ● ● ● ● ● code size ● ● ● ● 130% ● ● ● ● ● ● ● ● ● ● ● 110% ●

● ● 90% ● throughput

110% ●●●●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ●●●●● ● ● ●● ●●● ● ● ● 100% ● ● ●● ●●●●● ● ● ● ●●●●●●●●● ● ● ●

● ●

90% ● ●

derby serial sunflow compress crypto.rsa crypto.aes mpegaudio xml.transformxml.validation crypto.signverify scimark.fft.largescimark.fft.smallscimark.lu.largescimark.lu.small compiler.compiler scimark.sor.largescimark.sor.small scimark.monte_carlo scimark.sparse.largescimark.sparse.small

default no−op

Figure C.34: Overall Performance Impact Java SPECjvm2008; compile time (lower is better), code size (lower is better), throughput (higher is better). C.1 Detailed Performance Plots 185

150% ●

● compile time ● 140% ● ●

130% ● ● ● ●

● 120% ● ● ● ● ● ● ● 110% ● ● ● ● ● ● ● ● 100% ● ● 90%

130% code size ● ● ● ● ● ●

● 120% ● ● ● ●

● ● ● ● 110% ● ● ● ● 100% ● 90% ● 150%

● ● throughput 125% ●

● ● ● ● ● ● ● ● ● ● ● ● 100% ● ● ● ● ● ● ● ● 75%

zlib Splay Box2D Crypto PdfJS RegExp Richards CodeLoad DeltaBlue Gameboy Mandreel RayTrace Typescript EarleyBoyer NavierStokes default no−op

Figure C.35: Overall Performance Impact JavaScript Octane; compile time (lower is better), code size (lower is better), throughput (higher is better).

● ● compile time 140% ● ●

120% ●

● ● ● ● ● ● 100% ● ● ● ● ● ● ●

● code size

130% ●

● ● ● ● ● ● 110% ● ● ● ● ● ●

● ● ● 90% ● ● ● ● ●

●● ● ● throughput 300%

200%

● ● ●●● ●●●● ●●● ●●● ●●●●● ●●● ●●● ● ●●●●●●●●● ●●●●●●● ● ● ● ● 100% ● ●

dry.c towers.c bigfib.cpp n−body.c float−mm.c hash−map quicksort.c container.cpp gcc−loops.cpp

default no−op

Figure C.36: Overall Performance Impact JavaScript jetstream asm.js; compile time (lower is better), code size (lower is better), throughput (higher is better). 186 Evaluation Appendix

C.2 Performance Significance Analysis

In order to provide a complete overview of all performance implications of the optimizations proposed in this thesis we report the impacts of simulation-based code duplication in combined result tables1 for a configuration called all-opt that concurrently enables DBDS, fast-path loop unrolling and lock coarsening2. In order to assess the significance of the final results for the optimizations proposed in this thesis we performed a simple significance analysis of the data for the all-opt configuration using the Wilcoxon-Mann-Whitney-Test [200]3. We report the results of this analysis in Tables C.1 to C.6. Each table reports median for the baseline and all-opt configurations per benchmark and metric, the first and third quantile, the difference in medians, the p-value result for the Wilcoxon-Mann-Whitney test as well as a column indicating whether the test outlines a statistic significance. The final column indicates whether the all-opt configuration is faster, i.e., better than the baseline configuration.

Median Median Benchmark Metric D† p-0.95∗ S ‡ F$ Baseline[Q1;Q3] All Opt[Q1;Q3] avrora codesize 2.10e+06[2.02e+06;2.14e+06] 2.43e+06[2.36e+06;2.48e+06] 3.33e+05 2.17e-16 X ✗ avrora compiletime 1.45e+07[1.41e+07;1.48e+07] 1.87e+07[1.84e+07;1.90e+07] 4.16e+06 1.09e-16 X ✗ avrora runtime 5.18e+03[5.13e+03;5.21e+03] 5.15e+03[5.11e+03;5.20e+03] 3.00e+01 3.26e-01 ✗ X batik codesize 7.47e+06[7.36e+06;7.52e+06] 8.80e+06[8.65e+06;8.93e+06] 1.33e+06 1.09e-16 X ✗ batik compiletime 4.15e+07[4.10e+07;4.20e+07] 5.38e+07[5.24e+07;5.52e+07] 1.23e+07 1.09e-16 X ✗ batik runtime 8.88e+02[8.87e+02;8.89e+02] 8.97e+02[8.96e+02;8.98e+02] 9.00e+00 9.90e-11 X ✗ fop codesize 7.53e+06[7.44e+06;7.63e+06] 8.43e+06[8.39e+06;8.49e+06] 9.02e+05 1.09e-16 X ✗ fop compiletime 4.47e+07[4.41e+07;4.53e+07] 5.52e+07[5.48e+07;5.55e+07] 1.05e+07 1.09e-16 X ✗ fop runtime 2.02e+02[2.01e+02;2.04e+02] 2.06e+02[2.05e+02;2.07e+02] 4.00e+00 1.89e-07 X ✗ h2 codesize 1.49e+07[1.47e+07;1.50e+07] 1.72e+07[1.71e+07;1.74e+07] 2.39e+06 1.83e-07 X ✗ h2 compiletime 8.26e+07[8.16e+07;8.36e+07] 1.03e+08[1.02e+08;1.05e+08] 2.05e+07 1.83e-07 X ✗ h2 runtime 3.85e+03[3.81e+03;3.90e+03] 3.82e+03[3.79e+03;3.84e+03] 3.50e+01 1.05e-02 X X jython codesize 2.03e+07[1.99e+07;2.07e+07] 2.70e+07[2.63e+07;2.78e+07] 6.72e+06 3.21e-16 X ✗ jython compiletime 1.42e+08[1.38e+08;1.45e+08] 2.30e+08[2.20e+08;2.33e+08] 8.82e+07 3.21e-16 X ✗ jython runtime 1.60e+03[1.59e+03;1.61e+03] 1.53e+03[1.52e+03;1.54e+03] 7.45e+01 8.68e-10 X X luindex codesize 2.25e+06[2.17e+06;2.30e+06] 2.48e+06[2.39e+06;2.50e+06] 2.28e+05 7.38e-10 X ✗ luindex compiletime 1.60e+07[1.55e+07;1.63e+07] 1.96e+07[1.90e+07;2.04e+07] 3.62e+06 1.09e-16 X ✗ luindex runtime 7.13e+02[7.02e+02;7.36e+02] 6.80e+02[6.71e+02;7.13e+02] 3.30e+01 5.65e-03 X X lusearch codesize 2.36e+06[2.32e+06;2.42e+06] 2.69e+06[2.63e+06;2.73e+06] 3.24e+05 1.60e-10 X ✗ lusearch compiletime 1.71e+07[1.68e+07;1.75e+07] 2.12e+07[2.07e+07;2.15e+07] 4.06e+06 1.09e-16 X ✗ lusearch runtime 1.32e+03[1.31e+03;1.33e+03] 1.32e+03[1.31e+03;1.33e+03] 2.00e+00 7.94e-01 ✗ ✗ pmd codesize 1.46e+07[1.42e+07;1.49e+07] 1.63e+07[1.60e+07;1.65e+07] 1.64e+06 7.60e-16 X ✗ pmd compiletime 8.84e+07[8.59e+07;9.00e+07] 1.09e+08[1.08e+08;1.12e+08] 2.10e+07 1.09e-16 X ✗ pmd runtime 1.28e+03[1.28e+03;1.29e+03] 1.28e+03[1.28e+03;1.29e+03] 2.00e+00 3.41e-01 ✗ ✗ sunflow codesize 1.92e+06[1.87e+06;2.00e+06] 2.09e+06[2.04e+06;2.17e+06] 1.70e+05 3.59e-08 X ✗ sunflow compiletime 1.57e+07[1.55e+07;1.63e+07] 1.86e+07[1.81e+07;1.91e+07] 2.84e+06 1.26e-06 X ✗ sunflow runtime 1.85e+03[1.83e+03;1.87e+03] 1.84e+03[1.83e+03;1.84e+03] 1.40e+01 1.03e-02 X X xalan codesize 5.19e+06[5.13e+06;5.27e+06] 5.97e+06[5.87e+06;6.04e+06] 7.82e+05 1.09e-16 X ✗ xalan compiletime 3.08e+07[3.03e+07;3.12e+07] 4.02e+07[3.95e+07;4.08e+07] 9.39e+06 1.09e-16 X ✗ xalan runtime 7.57e+02[7.53e+02;7.63e+02] 7.60e+02[7.57e+02;7.64e+02] 3.00e+00 1.16e-01 ✗ ✗ † Difference of medians. ∗ p-0.95 Confidence for Wilcoxon-Mann-Whitney-Test. ‡ Statistic significance. $ All-opt configuration median faster than baseline.

Table C.1: Significance Tests for Java DaCapo.

1The raw data can be seen in Figures C.31 to C.36. 2The entire set of optimizations proposed in this thesis. 3 We used the Wilcoxon-Mann-Whitney test since we have independent random samples (benchmark results) for which we, due to potential noise in the data, cannot guarantee a normal distribution, which prohibits the usage of the standard t-test for independent random samples. Most of the data is normally distributed but not all sample sets. We used the Shapiro-Wilk test [177] to assess whether our experiment data is normally distributed. C.2 Performance Significance Analysis 187

Median Median Benchmark Metric D† p-0.95∗ S ‡ F$ Baseline[Q1;Q3] All Opt[Q1;Q3] actors codesize 6.52e+06[6.36e+06;6.82e+06] 7.01e+06[6.89e+06;7.34e+06] 4.89e+05 6.11e-06 X ✗ actors compiletime 4.60e+07[4.42e+07;4.86e+07] 5.25e+07[5.09e+07;5.43e+07] 6.57e+06 1.05e-07 X ✗ actors runtime 2.94e+03[2.91e+03;2.96e+03] 2.93e+03[2.90e+03;2.95e+03] 5.00e+00 5.47e-01 ✗ X apparat codesize 1.66e+07[1.51e+07;1.74e+07] 1.73e+07[1.69e+07;1.86e+07] 7.44e+05 3.64e-03 X ✗ apparat compiletime 1.29e+08[1.15e+08;1.36e+08] 1.53e+08[1.43e+08;1.63e+08] 2.42e+07 8.18e-08 X ✗ apparat runtime 5.01e+03[4.96e+03;5.06e+03] 4.92e+03[4.83e+03;4.96e+03] 8.55e+01 2.58e-06 X X factorie codesize 3.70e+06[3.58e+06;3.83e+06] 4.29e+06[4.15e+06;4.51e+06] 5.86e+05 2.37e-08 X ✗ factorie compiletime 2.97e+07[2.88e+07;3.08e+07] 3.91e+07[3.74e+07;4.10e+07] 9.38e+06 1.10e-14 X ✗ factorie runtime 1.00e+04[9.88e+03;1.01e+04] 9.12e+03[8.98e+03;9.25e+03] 9.03e+02 7.21e-09 X X kiama codesize 4.30e+06[4.21e+06;4.42e+06] 5.08e+06[4.94e+06;5.14e+06] 7.87e+05 3.21e-16 X ✗ kiama compiletime 2.90e+07[2.85e+07;2.97e+07] 3.56e+07[3.49e+07;3.67e+07] 6.60e+06 3.21e-16 X ✗ kiama runtime 2.59e+02[2.56e+02;2.60e+02] 2.54e+02[2.53e+02;2.56e+02] 5.00e+00 1.26e-04 X X scalac codesize 3.54e+07[3.50e+07;3.57e+07] 3.61e+07[3.58e+07;3.64e+07] 7.28e+05 1.80e-06 X ✗ scalac compiletime 2.04e+08[2.01e+08;2.07e+08] 2.25e+08[2.24e+08;2.27e+08] 2.12e+07 1.09e-16 X ✗ scalac runtime 1.03e+03[1.03e+03;1.04e+03] 1.06e+03[1.05e+03;1.06e+03] 2.50e+01 1.50e-07 X ✗ scaladoc codesize 1.87e+07[1.84e+07;1.94e+07] 1.97e+07[1.95e+07;1.99e+07] 9.72e+05 8.45e-07 X ✗ scaladoc compiletime 1.12e+08[1.10e+08;1.15e+08] 1.25e+08[1.25e+08;1.27e+08] 1.35e+07 7.60e-16 X ✗ scaladoc runtime 1.03e+03[1.02e+03;1.04e+03] 1.03e+03[1.03e+03;1.04e+03] 1.00e+00 6.88e-01 ✗ X scalap codesize 3.76e+06[3.69e+06;3.81e+06] 4.50e+06[4.40e+06;4.54e+06] 7.44e+05 1.09e-16 X ✗ scalap compiletime 2.50e+07[2.46e+07;2.54e+07] 3.21e+07[3.15e+07;3.27e+07] 7.17e+06 1.09e-16 X ✗ scalap runtime 1.31e+02[1.31e+02;1.33e+02] 1.29e+02[1.29e+02;1.30e+02] 2.00e+00 1.50e-06 X X scalariform codesize 8.66e+06[8.48e+06;8.80e+06] 9.86e+06[9.78e+06;9.98e+06] 1.20e+06 1.09e-16 X ✗ scalariform compiletime 6.40e+07[6.27e+07;6.50e+07] 7.48e+07[7.38e+07;7.61e+07] 1.08e+07 1.09e-16 X ✗ scalariform runtime 3.96e+02[3.94e+02;3.98e+02] 4.05e+02[4.03e+02;4.06e+02] 9.00e+00 3.43e-08 X ✗ scalatest codesize 1.99e+07[1.94e+07;2.03e+07] 2.22e+07[2.16e+07;2.27e+07] 2.30e+06 5.52e-14 X ✗ scalatest compiletime 1.51e+08[1.48e+08;1.56e+08] 1.81e+08[1.76e+08;1.86e+08] 2.92e+07 1.09e-16 X ✗ scalatest runtime 1.11e+03[1.09e+03;1.12e+03] 1.13e+03[1.12e+03;1.16e+03] 2.30e+01 1.45e-04 X ✗ scalaxb codesize 9.02e+06[8.87e+06;9.21e+06] 1.04e+07[1.03e+07;1.05e+07] 1.40e+06 1.09e-16 X ✗ scalaxb compiletime 6.28e+07[6.22e+07;6.44e+07] 7.91e+07[7.85e+07;8.05e+07] 1.62e+07 1.09e-16 X ✗ scalaxb runtime 2.80e+02[2.79e+02;2.82e+02] 2.77e+02[2.75e+02;2.79e+02] 3.00e+00 3.29e-05 X X specs codesize 1.06e+07[1.02e+07;1.08e+07] 1.26e+07[1.25e+07;1.28e+07] 1.94e+06 1.43e-15 X ✗ specs compiletime 7.36e+07[6.98e+07;7.54e+07] 9.29e+07[9.04e+07;9.49e+07] 1.93e+07 1.43e-15 X ✗ specs runtime 1.35e+03[1.35e+03;1.36e+03] 1.35e+03[1.35e+03;1.36e+03] 3.50e+00 2.02e-01 ✗ ✗ tmt codesize 5.27e+06[5.15e+06;5.38e+06] 5.86e+06[5.75e+06;6.18e+06] 5.93e+05 1.05e-14 X ✗ tmt compiletime 3.49e+07[3.42e+07;3.54e+07] 4.20e+07[4.12e+07;4.44e+07] 7.08e+06 1.09e-16 X ✗ tmt runtime 3.63e+03[3.62e+03;3.64e+03] 3.57e+03[3.56e+03;3.58e+03] 6.00e+01 8.83e-10 X X † Difference of medians. ∗ p-0.95 Confidence for Wilcoxon-Mann-Whitney-Test. ‡ Statistic significance. $ All-opt configuration median faster than baseline.

Table C.2: Significance Tests for ScalaBench. 188 Evaluation Appendix

Median Median Benchmark Metric D† p-0.95∗ S ‡ F$ Baseline[Q1;Q3] All Opt[Q1;Q3] AlternatingLeastSquares codesize 8.99e+06[8.61e+06;9.36e+06] 8.94e+06[8.71e+06;9.09e+06] 5.11e+04 7.91e-01 ✗ X AlternatingLeastSquares compiletime 5.96e+07[5.73e+07;6.09e+07] 6.59e+07[6.54e+07;6.72e+07] 6.30e+06 3.81e-13 X ✗ AlternatingLeastSquares runtime 4.44e+03[4.37e+03;4.51e+03] 4.40e+03[4.35e+03;4.49e+03] 5.00e+01 4.61e-01 ✗ X ChiSquare codesize 4.51e+06[4.44e+06;4.57e+06] 4.89e+06[4.80e+06;4.99e+06] 3.76e+05 1.09e-16 X ✗ ChiSquare compiletime 3.02e+07[2.98e+07;3.07e+07] 3.74e+07[3.69e+07;3.82e+07] 7.12e+06 1.09e-16 X ✗ ChiSquare runtime 1.40e+03[1.38e+03;1.42e+03] 1.38e+03[1.37e+03;1.41e+03] 1.70e+01 3.13e-02 X X ClassificationDecisionTree codesize 1.19e+07[1.15e+07;1.20e+07] 1.16e+07[1.14e+07;1.20e+07] 2.48e+05 6.80e-01 ✗ X ClassificationDecisionTree compiletime 7.97e+07[7.85e+07;8.12e+07] 8.53e+07[8.48e+07;9.17e+07] 5.63e+06 2.95e-14 X ✗ ClassificationDecisionTree runtime 2.53e+03[2.49e+03;2.55e+03] 2.47e+03[2.45e+03;2.50e+03] 6.10e+01 5.90e-02 ✗ X db-shootout codesize 4.88e+06[4.82e+06;4.93e+06] 5.83e+06[5.68e+06;5.95e+06] 9.51e+05 1.09e-16 X ✗ db-shootout compiletime 3.19e+07[3.15e+07;3.25e+07] 4.19e+07[4.14e+07;4.28e+07] 9.96e+06 1.09e-16 X ✗ db-shootout runtime 7.56e+03[7.47e+03;7.79e+03] 7.63e+03[7.50e+03;7.79e+03] 7.00e+01 4.42e-01 ✗ ✗ Dotty codesize 2.29e+07[2.26e+07;2.35e+07] 2.58e+07[2.54e+07;2.61e+07] 2.82e+06 1.09e-16 X ✗ Dotty compiletime 1.52e+08[1.51e+08;1.60e+08] 1.83e+08[1.80e+08;1.87e+08] 3.03e+07 1.09e-16 X ✗ Dotty runtime 4.42e+03[4.39e+03;4.47e+03] 4.50e+03[4.45e+03;4.53e+03] 7.70e+01 2.31e-07 X ✗ finagle-chirper codesize 1.07e+07[9.47e+06;1.37e+07] 1.27e+07[1.20e+07;1.55e+07] 1.99e+06 4.64e-04 X ✗ finagle-chirper compiletime 7.66e+07[6.51e+07;1.15e+08] 9.52e+07[8.68e+07;1.19e+08] 1.85e+07 3.97e-03 X ✗ finagle-chirper runtime 2.22e+03[2.16e+03;2.57e+03] 2.27e+03[2.21e+03;2.56e+03] 5.30e+01 3.49e-02 X ✗ finagle-http codesize 3.16e+06[3.09e+06;3.24e+06] 3.52e+06[3.40e+06;3.60e+06] 3.62e+05 7.43e-14 X ✗ finagle-http compiletime 2.40e+07[2.34e+07;2.47e+07] 2.82e+07[2.77e+07;2.90e+07] 4.27e+06 1.09e-16 X ✗ finagle-http runtime 2.61e+03[2.59e+03;2.68e+03] 2.55e+03[2.53e+03;2.59e+03] 5.90e+01 6.11e-08 X X fj-kmeans codesize 1.41e+06[1.32e+06;1.54e+06] 1.74e+06[1.70e+06;1.79e+06] 3.38e+05 7.96e-13 X ✗ fj-kmeans compiletime 1.04e+07[9.89e+06;1.14e+07] 1.41e+07[1.33e+07;1.43e+07] 3.68e+06 2.27e-13 X ✗ fj-kmeans runtime 5.95e+02[5.89e+02;5.98e+02] 3.23e+02[3.19e+02;3.25e+02] 2.72e+02 9.11e-20 X X FoldLeftSum codesize 5.51e+05[5.45e+05;5.62e+05] 6.39e+05[6.29e+05;6.51e+05] 8.77e+04 1.09e-16 X ✗ FoldLeftSum compiletime 5.33e+06[5.27e+06;5.41e+06] 6.53e+06[6.39e+06;6.64e+06] 1.20e+06 1.09e-16 X ✗ FoldLeftSum runtime 2.88e+02[2.88e+02;2.88e+02] 2.88e+02[2.88e+02;2.88e+02] 0.00e+00 3.86e-01 ✗ X future-genetic codesize 3.18e+06[2.82e+06;3.74e+06] 3.94e+06[3.30e+06;4.11e+06] 7.61e+05 1.09e-03 X ✗ future-genetic compiletime 2.38e+07[2.10e+07;2.78e+07] 3.15e+07[2.67e+07;3.36e+07] 7.70e+06 4.81e-06 X ✗ future-genetic runtime 1.40e+03[1.37e+03;1.45e+03] 1.45e+03[1.42e+03;1.50e+03] 4.60e+01 2.10e-05 X ✗ LogRegression codesize 7.92e+06[7.80e+06;7.97e+06] 8.75e+06[8.62e+06;8.82e+06] 8.33e+05 1.09e-16 X ✗ LogRegression compiletime 5.34e+07[5.27e+07;5.41e+07] 6.48e+07[6.38e+07;6.59e+07] 1.15e+07 1.09e-16 X ✗ LogRegression runtime 3.08e+03[3.06e+03;3.26e+03] 3.07e+03[3.06e+03;3.13e+03] 2.00e+00 4.50e-02 X X movie-lens codesize 2.17e+07[1.62e+07;2.37e+07] 1.82e+07[1.75e+07;2.26e+07] 3.45e+06 7.23e-01 ✗ X movie-lens compiletime 1.34e+08[1.08e+08;1.47e+08] 1.39e+08[1.29e+08;1.56e+08] 5.04e+06 3.26e-02 X ✗ movie-lens runtime 1.72e+04[1.61e+04;1.82e+04] 1.72e+04[1.67e+04;1.81e+04] 4.50e+01 5.36e-02 ✗ ✗ naive-bayes codesize 3.35e+06[3.29e+06;3.41e+06] 3.63e+06[3.58e+06;3.67e+06] 2.79e+05 2.12e-14 X ✗ naive-bayes compiletime 2.51e+07[2.47e+07;2.57e+07] 3.15e+07[3.11e+07;3.17e+07] 6.40e+06 1.09e-16 X ✗ naive-bayes runtime 1.97e+03[1.96e+03;1.97e+03] 1.94e+03[1.94e+03;1.95e+03] 2.30e+01 5.71e-12 X X neo4j-analytics codesize 2.21e+07[2.18e+07;2.23e+07] 2.41e+07[2.38e+07;2.44e+07] 2.02e+06 1.73e-13 X ✗ neo4j-analytics compiletime 1.52e+08[1.50e+08;1.54e+08] 1.85e+08[1.82e+08;1.87e+08] 3.26e+07 1.09e-16 X ✗ neo4j-analytics runtime 1.21e+04[1.18e+04;1.23e+04] 1.16e+04[1.16e+04;1.17e+04] 4.13e+02 1.13e-05 X X philosophers codesize 2.06e+06[1.94e+06;2.13e+06] 2.47e+06[2.40e+06;2.59e+06] 4.05e+05 7.28e-15 X ✗ philosophers compiletime 1.56e+07[1.50e+07;1.62e+07] 2.01e+07[1.96e+07;2.12e+07] 4.52e+06 4.34e-16 X ✗ philosophers runtime 1.04e+03[1.02e+03;1.09e+03] 1.05e+03[1.02e+03;1.10e+03] 6.00e+00 7.61e-01 ✗ ✗ reactors-savina codesize 4.84e+06[4.65e+06;5.04e+06] 5.61e+06[5.39e+06;5.95e+06] 7.74e+05 1.11e-10 X ✗ reactors-savina compiletime 3.26e+07[3.12e+07;3.37e+07] 3.91e+07[3.75e+07;4.14e+07] 6.53e+06 1.32e-13 X ✗ reactors-savina runtime 1.31e+04[1.27e+04;1.35e+04] 1.30e+04[1.25e+04;1.35e+04] 6.20e+01 3.82e-01 ✗ X rx-scrabble codesize 3.00e+06[2.89e+06;3.10e+06] 3.44e+06[3.27e+06;3.60e+06] 4.48e+05 8.68e-12 X ✗ rx-scrabble compiletime 1.96e+07[1.88e+07;2.01e+07] 2.39e+07[2.35e+07;2.48e+07] 4.34e+06 4.34e-16 X ✗ rx-scrabble runtime 3.18e+02[3.08e+02;3.29e+02] 3.28e+02[3.19e+02;3.35e+02] 1.00e+01 2.28e-03 X ✗ scala-stm-bench7 codesize 3.67e+06[3.60e+06;3.77e+06] 4.64e+06[4.55e+06;4.80e+06] 9.74e+05 2.27e-13 X ✗ scala-stm-bench7 compiletime 2.62e+07[2.55e+07;2.69e+07] 3.59e+07[3.51e+07;3.69e+07] 9.67e+06 9.94e-14 X ✗ scala-stm-bench7 runtime 1.20e+03[1.19e+03;1.21e+03] 1.18e+03[1.17e+03;1.20e+03] 1.80e+01 6.30e-05 X X Scrabble codesize 1.31e+06[1.25e+06;1.34e+06] 1.54e+06[1.50e+06;1.55e+06] 2.30e+05 1.09e-16 X ✗ Scrabble compiletime 1.12e+07[1.08e+07;1.16e+07] 1.38e+07[1.36e+07;1.41e+07] 2.59e+06 1.09e-16 X ✗ Scrabble runtime 3.70e+02[3.65e+02;3.80e+02] 3.61e+02[3.58e+02;3.79e+02] 9.00e+00 1.66e-03 X X StreamsFoldLeftSum codesize 5.72e+05[5.63e+05;5.82e+05] 6.42e+05[6.35e+05;6.54e+05] 7.04e+04 1.09e-16 X ✗ StreamsFoldLeftSum compiletime 5.51e+06[5.45e+06;5.59e+06] 6.63e+06[6.47e+06;6.79e+06] 1.13e+06 1.09e-16 X ✗ StreamsFoldLeftSum runtime 8.60e+01[8.60e+01;8.60e+01] 4.30e+01[4.30e+01;4.30e+01] 4.30e+01 3.61e-28 X X StreamsForeachSum codesize 5.60e+05[5.46e+05;5.70e+05] 6.43e+05[6.35e+05;6.50e+05] 8.25e+04 1.09e-16 X ✗ StreamsForeachSum compiletime 5.40e+06[5.35e+06;5.43e+06] 6.51e+06[6.47e+06;6.60e+06] 1.11e+06 1.09e-16 X ✗ StreamsForeachSum runtime 1.02e+02[1.01e+02;1.02e+02] 5.20e+01[5.20e+01;5.20e+01] 5.00e+01 1.85e-20 X X StreamsPhoneMnemonics codesize 1.03e+06[1.00e+06;1.06e+06] 1.25e+06[1.22e+06;1.28e+06] 2.16e+05 1.09e-16 X ✗ StreamsPhoneMnemonics compiletime 8.87e+06[8.65e+06;9.03e+06] 1.13e+07[1.11e+07;1.16e+07] 2.43e+06 1.09e-16 X ✗ StreamsPhoneMnemonics runtime 4.49e+03[4.46e+03;4.55e+03] 4.08e+03[4.02e+03;4.13e+03] 4.11e+02 9.86e-20 X X † Difference of medians. ∗ p-0.95 Confidence for Wilcoxon-Mann-Whitney-Test. ‡ Statistic significance. $ All-opt configuration median faster than baseline.

Table C.3: Significance Tests for Renaissance. C.2 Performance Significance Analysis 189

Median Median Benchmark Metric D† p-0.95∗ S ‡ F$ Baseline[Q1;Q3] All Opt[Q1;Q3] compiler.compiler codesize 1.26e+07[1.24e+07;1.27e+07] 1.46e+07[1.44e+07;1.47e+07] 2.00e+06 2.95e-13 X ✗ compiler.compiler compiletime 8.04e+07[7.95e+07;8.17e+07] 1.04e+08[1.02e+08;1.05e+08] 2.32e+07 1.09e-16 X ✗ compiler.compiler throughput 6.51e+02[6.48e+02;6.53e+02] 6.44e+02[6.40e+02;6.48e+02] 7.00e+00 9.75e-06 X ✗ compress codesize 8.37e+05[8.15e+05;8.60e+05] 9.07e+05[8.96e+05;9.28e+05] 7.02e+04 6.09e-08 X ✗ compress compiletime 7.23e+06[7.12e+06;7.35e+06] 8.39e+06[8.22e+06;8.47e+06] 1.16e+06 8.56e-10 X ✗ compress throughput 2.07e+02[2.07e+02;2.09e+02] 2.04e+02[2.03e+02;2.04e+02] 3.00e+00 2.73e-10 X ✗ crypto.aes codesize 2.83e+06[2.65e+06;3.06e+06] 3.30e+06[3.10e+06;3.73e+06] 4.71e+05 1.17e-04 X ✗ crypto.aes compiletime 2.94e+07[2.69e+07;3.23e+07] 3.54e+07[3.13e+07;3.90e+07] 5.97e+06 1.02e-04 X ✗ crypto.aes throughput 6.80e+01[6.60e+01;7.00e+01] 6.80e+01[6.60e+01;7.00e+01] 0.00e+00 9.34e-01 ✗ ✗ crypto.rsa codesize 2.12e+06[2.09e+06;2.16e+06] 2.30e+06[2.26e+06;2.34e+06] 1.79e+05 6.26e-13 X ✗ crypto.rsa compiletime 1.48e+07[1.45e+07;1.51e+07] 1.71e+07[1.70e+07;1.75e+07] 2.30e+06 1.09e-16 X ✗ crypto.rsa throughput 1.08e+03[1.07e+03;1.08e+03] 1.08e+03[1.07e+03;1.08e+03] 0.00e+00 7.43e-01 ✗ ✗ crypto.signverify codesize 2.61e+06[2.57e+06;2.66e+06] 3.00e+06[2.88e+06;3.08e+06] 3.92e+05 2.17e-16 X ✗ crypto.signverify compiletime 1.84e+07[1.81e+07;1.86e+07] 2.20e+07[2.11e+07;2.25e+07] 3.58e+06 7.43e-14 X ✗ crypto.signverify throughput 4.72e+02[4.69e+02;4.72e+02] 4.72e+02[4.71e+02;4.72e+02] 0.00e+00 9.01e-01 ✗ ✗ derby codesize 8.17e+06[8.02e+06;8.44e+06] 9.69e+06[9.44e+06;1.08e+07] 1.52e+06 1.09e-16 X ✗ derby compiletime 4.54e+07[4.50e+07;4.74e+07] 6.17e+07[6.03e+07;6.92e+07] 1.63e+07 3.21e-16 X ✗ derby throughput 4.24e+02[4.20e+02;4.26e+02] 4.25e+02[4.17e+02;4.30e+02] 1.00e+00 7.37e-01 ✗ X mpegaudio codesize 1.54e+06[1.45e+06;1.61e+06] 1.70e+06[1.64e+06;1.74e+06] 1.60e+05 1.33e-05 X ✗ mpegaudio compiletime 1.65e+07[1.49e+07;1.73e+07] 1.90e+07[1.80e+07;2.01e+07] 2.52e+06 2.60e-06 X ✗ mpegaudio throughput 1.42e+02[1.42e+02;1.44e+02] 1.39e+02[1.39e+02;1.41e+02] 3.00e+00 3.96e-08 X ✗ scimark.fft.large codesize 5.79e+05[5.68e+05;5.89e+05] 6.75e+05[6.58e+05;6.98e+05] 9.67e+04 2.17e-16 X ✗ scimark.fft.large compiletime 7.23e+06[7.08e+06;7.35e+06] 8.89e+06[8.60e+06;9.09e+06] 1.66e+06 1.09e-16 X ✗ scimark.fft.large throughput 3.40e+01[3.40e+01;3.50e+01] 3.30e+01[3.20e+01;3.50e+01] 1.00e+00 6.88e-02 ✗ ✗ scimark.fft.small codesize 8.39e+05[8.24e+05;8.58e+05] 1.00e+06[9.89e+05;1.03e+06] 1.66e+05 1.09e-16 X ✗ scimark.fft.small compiletime 7.11e+06[6.99e+06;7.18e+06] 9.21e+06[9.10e+06;9.37e+06] 2.10e+06 1.09e-16 X ✗ scimark.fft.small throughput 1.99e+02[1.95e+02;2.22e+02] 2.14e+02[1.97e+02;2.17e+02] 1.55e+01 9.42e-01 ✗ X scimark.lu.large codesize 5.97e+05[5.81e+05;6.12e+05] 6.42e+05[6.26e+05;6.58e+05] 4.46e+04 3.09e-12 X ✗ scimark.lu.large compiletime 7.78e+06[7.67e+06;7.95e+06] 9.01e+06[8.80e+06;9.28e+06] 1.23e+06 1.09e-16 X ✗ scimark.lu.large throughput 7.00e+00[7.00e+00;7.00e+00] 7.00e+00[7.00e+00;7.00e+00] 0.00e+00 - ✗ ✗ scimark.lu.small codesize 9.42e+05[9.19e+05;9.58e+05] 1.03e+06[1.03e+06;1.07e+06] 9.10e+04 4.05e-14 X ✗ scimark.lu.small compiletime 7.85e+06[7.76e+06;7.98e+06] 9.19e+06[9.03e+06;9.31e+06] 1.34e+06 1.09e-16 X ✗ scimark.lu.small throughput 3.13e+02[3.13e+02;3.13e+02] 3.14e+02[3.14e+02;3.14e+02] 1.00e+00 5.73e-07 X X scimark.monte_carlo codesize 7.21e+05[7.04e+05;7.39e+05] 7.91e+05[7.76e+05;8.00e+05] 7.00e+04 9.94e-14 X ✗ scimark.monte_carlo compiletime 6.32e+06[6.22e+06;6.44e+06] 7.40e+06[7.25e+06;7.55e+06] 1.08e+06 1.09e-16 X ✗ scimark.monte_carlo throughput 2.06e+02[2.06e+02;2.07e+02] 2.15e+02[2.14e+02;2.25e+02] 9.00e+00 5.77e-11 X X scimark.sor.large codesize 5.44e+05[5.33e+05;5.60e+05] 5.83e+05[5.56e+05;5.99e+05] 3.95e+04 8.79e-06 X ✗ scimark.sor.large compiletime 5.79e+06[5.66e+06;5.87e+06] 6.52e+06[6.39e+06;6.68e+06] 7.25e+05 4.34e-16 X ✗ scimark.sor.large throughput 4.30e+01[4.20e+01;4.30e+01] 4.30e+01[4.30e+01;4.30e+01] 0.00e+00 2.38e-01 ✗ ✗ scimark.sor.small codesize 6.40e+05[6.24e+05;6.56e+05] 6.82e+05[6.62e+05;7.01e+05] 4.14e+04 9.55e-08 X ✗ scimark.sor.small compiletime 6.02e+06[5.91e+06;6.10e+06] 6.71e+06[6.59e+06;6.84e+06] 6.97e+05 1.09e-16 X ✗ scimark.sor.small throughput 2.09e+02[2.09e+02;2.09e+02] 2.09e+02[2.09e+02;2.09e+02] 0.00e+00 6.01e-01 ✗ ✗ scimark.sparse.large codesize 5.91e+05[5.71e+05;6.12e+05] 6.31e+05[6.20e+05;6.64e+05] 3.95e+04 3.67e-05 X ✗ scimark.sparse.large compiletime 6.62e+06[6.44e+06;6.74e+06] 7.56e+06[7.33e+06;7.85e+06] 9.44e+05 4.90e-13 X ✗ scimark.sparse.large throughput 2.30e+01[2.30e+01;2.30e+01] 2.30e+01[2.20e+01;2.30e+01] 0.00e+00 5.75e-01 ✗ ✗ scimark.sparse.small codesize 7.41e+05[7.12e+05;7.49e+05] 8.11e+05[7.92e+05;8.26e+05] 6.95e+04 7.43e-14 X ✗ scimark.sparse.small compiletime 6.68e+06[6.55e+06;6.76e+06] 7.68e+06[7.51e+06;7.85e+06] 1.00e+06 1.09e-16 X ✗ scimark.sparse.small throughput 2.49e+02[2.48e+02;2.50e+02] 2.49e+02[2.48e+02;2.49e+02] 0.00e+00 7.14e-01 ✗ ✗ serial codesize 2.66e+06[2.58e+06;2.71e+06] 3.05e+06[2.98e+06;3.20e+06] 3.89e+05 1.08e-15 X ✗ serial compiletime 1.89e+07[1.86e+07;1.91e+07] 2.75e+07[2.64e+07;2.82e+07] 8.59e+06 1.08e-15 X ✗ serial throughput 1.46e+02[1.44e+02;1.47e+02] 1.46e+02[1.42e+02;1.47e+02] 0.00e+00 4.81e-01 ✗ ✗ sunflow codesize 1.95e+06[1.89e+06;2.00e+06] 2.09e+06[2.05e+06;2.13e+06] 1.40e+05 2.38e-06 X ✗ sunflow compiletime 1.58e+07[1.53e+07;1.63e+07] 1.83e+07[1.81e+07;1.86e+07] 2.51e+06 1.51e-14 X ✗ sunflow throughput 9.40e+01[9.30e+01;9.40e+01] 9.40e+01[9.40e+01;9.40e+01] 0.00e+00 2.52e-01 ✗ ✗ xml.transform codesize 9.07e+06[8.94e+06;9.17e+06] 1.04e+07[1.04e+07;1.06e+07] 1.36e+06 1.09e-16 X ✗ xml.transform compiletime 5.42e+07[5.35e+07;5.51e+07] 6.91e+07[6.88e+07;7.01e+07] 1.49e+07 1.09e-16 X ✗ xml.transform throughput 3.52e+02[3.51e+02;3.54e+02] 3.60e+02[3.59e+02;3.61e+02] 8.00e+00 2.63e-08 X X xml.validation codesize 4.57e+06[3.91e+06;6.88e+06] 6.70e+06[4.54e+06;7.73e+06] 2.13e+06 1.40e-02 X ✗ xml.validation compiletime 2.96e+07[2.67e+07;4.68e+07] 5.10e+07[3.37e+07;5.40e+07] 2.14e+07 4.61e-03 X ✗ xml.validation throughput 7.17e+02[7.07e+02;7.21e+02] 7.29e+02[7.22e+02;7.38e+02] 1.20e+01 1.33e-05 X X † Difference of medians. ∗ p-0.95 Confidence for Wilcoxon-Mann-Whitney-Test. ‡ Statistic significance. $ All-opt configuration median faster than baseline.

Table C.4: Significance Tests for Java SPECjvm2008. 190 Evaluation Appendix

Median Median Benchmark Metric D† p-0.95∗ S ‡ F$ Baseline[Q1;Q3] All Opt[Q1;Q3] Box2D codesize 3.31e+06[3.26e+06;3.36e+06] 3.84e+06[3.77e+06;3.93e+06] 5.23e+05 1.09e-16 X ✗ Box2D compiletime 3.73e+07[3.70e+07;3.77e+07] 5.03e+07[4.97e+07;5.07e+07] 1.29e+07 1.09e-16 X ✗ Box2D throughput 4.80e+04[4.78e+04;4.81e+04] 4.80e+04[4.79e+04;4.82e+04] 1.10e+01 4.71e-01 ✗ X CodeLoad codesize 1.55e+07[1.53e+07;1.56e+07] 1.87e+07[1.86e+07;1.88e+07] 3.22e+06 1.09e-16 X ✗ CodeLoad compiletime 1.06e+08[1.05e+08;1.08e+08] 1.34e+08[1.33e+08;1.35e+08] 2.83e+07 1.09e-16 X ✗ CodeLoad throughput 8.09e+03[8.05e+03;8.13e+03] 7.99e+03[7.91e+03;8.04e+03] 1.04e+02 1.73e-04 X ✗ Crypto codesize 3.59e+06[3.52e+06;3.65e+06] 4.13e+06[4.07e+06;4.24e+06] 5.42e+05 1.09e-16 X ✗ Crypto compiletime 4.13e+07[3.99e+07;4.22e+07] 5.22e+07[5.00e+07;5.37e+07] 1.09e+07 1.09e-16 X ✗ Crypto throughput 1.89e+04[1.88e+04;1.90e+04] 1.93e+04[1.91e+04;1.94e+04] 3.41e+02 4.01e-06 X X DeltaBlue codesize 2.37e+06[2.33e+06;2.40e+06] 2.64e+06[2.59e+06;2.70e+06] 2.76e+05 1.09e-16 X ✗ DeltaBlue compiletime 2.21e+07[2.19e+07;2.24e+07] 2.62e+07[2.59e+07;2.66e+07] 4.10e+06 3.26e-15 X ✗ DeltaBlue throughput 4.10e+04[4.09e+04;4.10e+04] 4.67e+04[4.66e+04;4.68e+04] 5.74e+03 1.60e-10 X X EarleyBoyer codesize 3.47e+06[3.32e+06;3.77e+06] 4.28e+06[3.97e+06;4.38e+06] 8.04e+05 1.29e-11 X ✗ EarleyBoyer compiletime 3.35e+07[3.23e+07;3.51e+07] 4.60e+07[4.40e+07;4.70e+07] 1.25e+07 1.09e-16 X ✗ EarleyBoyer throughput 2.75e+04[2.74e+04;2.76e+04] 3.01e+04[3.00e+04;3.03e+04] 2.67e+03 1.60e-10 X X Gameboy codesize 4.36e+06[4.27e+06;4.45e+06] 4.78e+06[4.73e+06;4.85e+06] 4.21e+05 2.27e-13 X ✗ Gameboy compiletime 4.03e+07[4.01e+07;4.09e+07] 4.94e+07[4.87e+07;4.98e+07] 9.04e+06 2.27e-13 X ✗ Gameboy throughput 5.77e+04[5.76e+04;5.79e+04] 7.01e+04[6.97e+04;7.03e+04] 1.23e+04 1.60e-10 X X Mandreel codesize 4.11e+06[4.05e+06;4.14e+06] 4.57e+06[4.53e+06;4.62e+06] 4.61e+05 5.30e-13 X ✗ Mandreel compiletime 3.52e+07[3.50e+07;3.55e+07] 4.12e+07[4.08e+07;4.13e+07] 5.99e+06 5.30e-13 X ✗ Mandreel throughput 2.17e+04[2.16e+04;2.18e+04] 2.23e+04[2.22e+04;2.25e+04] 5.57e+02 1.60e-10 X X NavierStokes codesize 1.58e+06[1.52e+06;1.61e+06] 1.83e+06[1.80e+06;1.87e+06] 2.46e+05 2.17e-16 X ✗ NavierStokes compiletime 1.61e+07[1.58e+07;1.64e+07] 2.04e+07[2.01e+07;2.06e+07] 4.32e+06 1.09e-16 X ✗ NavierStokes throughput 2.86e+04[2.85e+04;2.86e+04] 2.47e+04[2.47e+04;2.48e+04] 3.81e+03 1.58e-10 X ✗ PdfJS codesize 7.17e+06[7.07e+06;7.32e+06] 8.51e+06[8.38e+06;8.54e+06] 1.34e+06 1.09e-16 X ✗ PdfJS compiletime 7.29e+07[7.23e+07;7.40e+07] 9.75e+07[9.65e+07;9.78e+07] 2.45e+07 1.09e-16 X ✗ PdfJS throughput 2.04e+04[2.02e+04;2.05e+04] 2.18e+04[2.17e+04;2.18e+04] 1.40e+03 1.60e-10 X X RayTrace codesize 2.12e+06[2.07e+06;2.16e+06] 2.36e+06[2.32e+06;2.43e+06] 2.33e+05 2.17e-16 X ✗ RayTrace compiletime 1.96e+07[1.94e+07;1.99e+07] 2.42e+07[2.38e+07;2.47e+07] 4.54e+06 1.09e-16 X ✗ RayTrace throughput 5.36e+04[5.28e+04;5.44e+04] 5.52e+04[5.49e+04;5.55e+04] 1.64e+03 6.64e-05 X X RegExp codesize 4.72e+06[4.65e+06;4.81e+06] 5.69e+06[5.67e+06;5.72e+06] 9.74e+05 7.16e-16 X ✗ RegExp compiletime 1.32e+08[1.31e+08;1.33e+08] 1.69e+08[1.69e+08;1.70e+08] 3.74e+07 7.16e-16 X ✗ RegExp throughput 1.13e+03[1.12e+03;1.14e+03] 1.09e+03[1.08e+03;1.10e+03] 4.00e+01 1.30e-08 X ✗ Richards codesize 1.44e+06[1.41e+06;1.47e+06] 1.64e+06[1.59e+06;1.65e+06] 1.98e+05 1.56e-11 X ✗ Richards compiletime 1.30e+07[1.29e+07;1.32e+07] 1.51e+07[1.50e+07;1.54e+07] 2.09e+06 2.17e-16 X ✗ Richards throughput 2.47e+04[2.45e+04;2.49e+04] 2.74e+04[2.73e+04;2.74e+04] 2.68e+03 1.60e-10 X X Splay codesize 2.30e+06[2.22e+06;2.35e+06] 2.71e+06[2.67e+06;2.78e+06] 4.14e+05 1.09e-16 X ✗ Splay compiletime 2.27e+07[2.21e+07;2.35e+07] 2.96e+07[2.91e+07;3.05e+07] 6.88e+06 1.09e-16 X ✗ Splay throughput 4.62e+03[3.94e+03;5.98e+03] 5.57e+03[5.48e+03;5.69e+03] 9.44e+02 9.60e-02 ✗ X Typescript codesize 7.64e+06[7.46e+06;7.77e+06] 8.38e+06[8.18e+06;8.54e+06] 7.34e+05 2.12e-14 X ✗ Typescript compiletime 9.31e+07[9.23e+07;9.47e+07] 1.12e+08[1.11e+08;1.14e+08] 1.90e+07 1.09e-16 X ✗ Typescript throughput 9.12e+03[8.98e+03;9.24e+03] 7.36e+03[6.00e+03;7.47e+03] 1.76e+03 1.59e-10 X ✗ zlib codesize 2.37e+06[2.33e+06;2.42e+06] 2.70e+06[2.67e+06;2.75e+06] 3.35e+05 1.09e-16 X ✗ zlib compiletime 2.65e+07[2.63e+07;2.68e+07] 3.44e+07[3.40e+07;3.47e+07] 7.82e+06 1.09e-16 X ✗ zlib throughput 1.89e+04[1.88e+04;1.89e+04] 2.52e+04[2.51e+04;2.53e+04] 6.29e+03 1.59e-10 X X † Difference of medians. ∗ p-0.95 Confidence for Wilcoxon-Mann-Whitney-Test. ‡ Statistic significance. $ All-opt configuration median faster than baseline.

Table C.5: Significance Tests for JavaScript Octane.

Median Median Benchmark Metric D† p-0.95∗ S ‡ F$ Baseline[Q1;Q3] All Opt[Q1;Q3] bigfib.cpp codesize 2.59e+06[2.54e+06;2.62e+06] 2.85e+06[2.78e+06;2.91e+06] 2.63e+05 2.95e-13 X ✗ bigfib.cpp compiletime 2.45e+07[2.41e+07;2.47e+07] 2.91e+07[2.88e+07;2.94e+07] 4.60e+06 1.09e-16 X ✗ bigfib.cpp throughput 1.57e+02[1.57e+02;1.58e+02] 1.69e+02[1.69e+02;1.69e+02] 1.20e+01 1.43e-11 X X container.cpp codesize 4.51e+06[4.44e+06;4.56e+06] 4.97e+06[4.89e+06;5.04e+06] 4.62e+05 2.17e-16 X ✗ container.cpp compiletime 5.19e+07[5.11e+07;5.29e+07] 7.22e+07[7.08e+07;7.41e+07] 2.04e+07 1.09e-16 X ✗ container.cpp throughput 1.00e+02[9.90e+01;1.00e+02] 1.03e+02[1.03e+02;1.03e+02] 3.00e+00 1.03e-10 X X dry.c codesize 9.00e+05[8.64e+05;9.31e+05] 1.05e+06[1.04e+06;1.09e+06] 1.50e+05 1.09e-16 X ✗ dry.c compiletime 8.35e+06[8.14e+06;8.48e+06] 9.92e+06[9.84e+06;1.02e+07] 1.57e+06 1.09e-16 X ✗ dry.c throughput 2.60e+01[2.60e+01;2.60e+01] 9.40e+01[9.40e+01;9.40e+01] 6.80e+01 3.32e-14 X X float-mm.c codesize 1.30e+06[1.26e+06;1.31e+06] 1.44e+06[1.39e+06;1.47e+06] 1.37e+05 4.22e-07 X ✗ float-mm.c compiletime 1.05e+07[1.03e+07;1.06e+07] 1.24e+07[1.21e+07;1.27e+07] 1.85e+06 2.12e-14 X ✗ float-mm.c throughput 2.18e+02[2.18e+02;2.18e+02] 2.42e+02[2.42e+02;2.42e+02] 2.40e+01 3.17e-13 X X gcc-loops.cpp codesize 2.51e+06[2.49e+06;2.54e+06] 2.86e+06[2.83e+06;2.95e+06] 3.46e+05 2.06e-15 X ✗ gcc-loops.cpp compiletime 2.13e+07[2.10e+07;2.17e+07] 3.18e+07[3.13e+07;3.25e+07] 1.05e+07 1.09e-16 X ✗ gcc-loops.cpp throughput 1.01e+02[1.01e+02;1.01e+02] 1.05e+02[1.05e+02;1.05e+02] 4.00e+00 1.47e-13 X X hash-map codesize 1.73e+06[1.69e+06;1.76e+06] 1.93e+06[1.91e+06;1.96e+06] 2.00e+05 4.89e-15 X ✗ hash-map compiletime 1.51e+07[1.49e+07;1.53e+07] 1.82e+07[1.80e+07;1.84e+07] 3.11e+06 1.09e-16 X ✗ hash-map throughput 1.24e+05[1.23e+05;1.24e+05] 1.44e+05[1.35e+05;1.46e+05] 1.99e+04 2.75e-06 X X n-body.c codesize 1.39e+06[1.35e+06;1.42e+06] 1.58e+06[1.56e+06;1.65e+06] 1.97e+05 2.17e-16 X ✗ n-body.c compiletime 1.15e+07[1.14e+07;1.18e+07] 1.52e+07[1.51e+07;1.58e+07] 3.63e+06 1.09e-16 X ✗ n-body.c throughput 8.40e+01[8.40e+01;8.40e+01] 8.80e+01[8.80e+01;8.80e+01] 4.00e+00 3.31e-14 X X quicksort.c codesize 1.54e+06[1.49e+06;1.59e+06] 1.89e+06[1.83e+06;1.95e+06] 3.43e+05 4.34e-16 X ✗ quicksort.c compiletime 1.32e+07[1.30e+07;1.35e+07] 1.80e+07[1.77e+07;1.83e+07] 4.83e+06 1.09e-16 X ✗ quicksort.c throughput 1.60e+02[1.60e+02;1.60e+02] 1.61e+02[1.60e+02;1.61e+02] 1.00e+00 7.03e-05 X X towers.c codesize 1.30e+06[1.26e+06;1.33e+06] 1.50e+06[1.44e+06;1.56e+06] 1.98e+05 1.03e-08 X ✗ towers.c compiletime 1.09e+07[1.07e+07;1.12e+07] 1.32e+07[1.27e+07;1.36e+07] 2.22e+06 1.09e-16 X ✗ towers.c throughput 7.30e+01[7.30e+01;7.30e+01] 7.20e+01[7.20e+01;7.30e+01] 1.00e+00 1.10e-04 X ✗ † Difference of medians. ∗ p-0.95 Confidence for Wilcoxon-Mann-Whitney-Test. ‡ Statistic significance. $ All-opt configuration median faster than baseline.

Table C.6: Significance Tests for JavaScript jetstream asm.js. C.2 Performance Significance Analysis 191

C.2.1 Interpretation

For all benchmark suites the all-opt configuration produces on average more often faster code than the baseline configuration while producing slower code in less cases. We report the frequencies of the all-opt versus the baseline configuration in Table C.7. The best performance is reached for the JavaScript jetstream asm.js benchmark suite where all-opt is significantly faster for 8 benchmarks and slower for 1. Performance increases for Java SPECjvm2008 are mixed, with 4 benchmarks being significantly faster and 3 significantly slower, while 11 benchmarks tend to be slower, though not significantly. Performance increases for Renaissance are also more frequent than regressions. The same applies for Java DaCapo and ScalaBench. Based on this data, we can conclude that the optimizations presented in this thesis can significantly improve the performance of Java and JavaScript applications and therefore be used in production compilers like GraalVM.

Benchmark All-Opt Faster All-Opt Faster All-Opt Slower All-Opt Slower Suite Significant Not Significant Significant Not Significant Java DaCapo 4 1 2 3 ScalaBench 6 2 3 1 Renaissance 11 4 4 3 Java SPECjvm2008 4 2 3 11 JavaScript Octane 9 2 4 0 JavaScript jetstream asm.js 8 0 1 0

Table C.7: Significance & Performance Results.

193

Bibliography

[1] Ali-Reza Adl-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy, Bratin Saha, and Tatiana Shpeisman. 2006. Compiler and Runtime Support for Efficient Software Transac- tional Memory. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 26–37. https://doi.org/10.1145/1133981.1133985

[2] Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. 1995. Software Pipelining. ACM Comput. Surv. 27, 3 (Sept. 1995), 367–432. https://doi.org/10. 1145/212094.212131

[3] Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization. ACM, New York, NY, USA, 1–19. https://doi.org/10.1145/800028. 808479

[4] Bowen Alpern, Steve Augart, Stephen M Blackburn, Maria Butrico, Anthony Cocchi, Perry Cheng, Julian Dolby, Stephen Fink, David Grove, Michael Hind, et al. 2005. The Jikes research virtual machine project: building an open-source research community. IBM Systems Journal 44, 2 (2005), 399–417.

[5] Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI ’97). ACM, New York, NY, USA, 85–96. https://doi.org/10.1145/258915.258924

[6] Matthew Arnold, Stephen Fink, Vivek Sarkar, and Peter F. Sweeney. 2000. A Comparative Study of Static and Profile-based Heuristics for Inlining. In DYNAMO’00. ACM, New York, NY, USA, 52–64. https://doi.org/10.1145/351397.351416

[7] David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler Transformations for High-performance Computing. ACM Comput. Surv. 26, 4 (1994), 345–420. https: //doi.org/10.1145/197405.197406 194 Bibliography

[8] J. Eugene Ball. 1979. Predicting the Effects of Optimization on a Procedure Body. In Proceedings of the 1979 SIGPLAN Symposium on Compiler Construction (SIGPLAN ’79). ACM, New York, NY, USA, 214–220. https://doi.org/10.1145/800229.806972

[9] Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. 2017. Virtual Machine Warmup Blows Hot and Cold. Proc. ACM Program. Lang. 1, OOPSLA, Article 52 (Oct. 2017), 27 pages. https://doi.org/10.1145/3133876

[10] Michael Bebenita, Mason Chang, Gregor Wagner, Andreas Gal, Christian Wimmer, and Michael Franz. 2010. Trace-based Compilation in Execution Environments Without Inter- preters. In Proceedings of the 8th International Conference on the Principles and Prac- tice of Programming in Java (PPPJ ’10). ACM, New York, NY, USA, 59–68. https: //doi.org/10.1145/1852761.1852771

[11] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKin- ley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Mar- tin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In Proceed- ings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Sys- tems, Languages, and Applications (OOPSLA ’06). ACM, New York, NY, USA, 169–190. https://doi.org/10.1145/1167473.1167488

[12] Gabriel Hjort Blindell, Mats Carlsson, Roberto Castañeda Lozano, and Christian Schulte. 2017. Complete and Practical Universal Instruction Selection. ACM Transactions on Em- bedded Computing Systems (TECS) 16, 5s (2017), 119.

[13] Rastislav Bodík, Rajiv Gupta, and Mary Lou Soffa. 1998. Complete Removal of Redun- dant Expressions. In Proceedings of the ACM SIGPLAN 1998 Conference on Program- ming Language Design and Implementation (PLDI ’98). ACM, New York, NY, USA, 1–14. https://doi.org/10.1145/277650.277653

[14] E. Brewer. 2012. CAP Twelve Years Later: How the "Rules" Have Changed. Computer 45, 02 (feb 2012), 23–29. https://doi.org/10.1109/MC.2012.37

[15] Preston Briggs and Keith D. Cooper. 1994. Effective Partial Redundancy Elimination. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (PLDI ’94). ACM, New York, NY, USA, 159–170. https://doi. org/10.1145/178243.178257 Bibliography 195

[16] Preston Briggs, Keith D. Cooper, and L. Taylor Simpson. 1997. Value Numbering. Softw. Pract. Exper. 27, 6 (June 1997), 701–724. https://doi.org/10.1002/(SICI) 1097-024X(199706)27:6<701::AID-SPE104>3.3.CO;2-S

[17] B. Buchberger and R. Loos. 1982. Algebraic Simplification. In Computer Algebra: Symbolic and Algebraic Computation, Bruno Buchberger, George Edwin Collins, and Rüdiger Loos (Eds.). Springer Vienna, Vienna, 11–43. https://doi.org/10.1007/ 978-3-7091-3406-1_2

[18] Brad Calder and Dirk Grunwald. 1994. Reducing Branch Costs via Branch Alignment. In Proceedings of the Sixth International Conference on Architectural Support for Pro- gramming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 242–251. https://doi.org/10.1145/195473.195553

[19] Andreu Carminati, Renan Augusto Starke, and Rômulo Silva de Oliveira. 2017. Combining loop unrolling strategies and code predication to reduce the worst-case execution time of real-time software. Applied Computing and Informatics 13, 2 (2017), 184 – 193. https: //doi.org/10.1016/j.aci.2017.03.002

[20] S. Carr, , and P. Sweany. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences, Vol. 1. 183–192 vol.1. https://doi.org/10.1109/HICSS.1996.495462

[21] Calin Cascaval, Luiz DeRose, David A Padua, and Daniel A Reed. 1999. Compile-time based performance prediction. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 365–379. https://doi.org/10.1007/3-540-44905-1_23

[22] Stefano Cazzulani. 2012. Octane: The JavaScript benchmark suite for the modern web. Retrieved December 21 (2012), 2015. https://blog.chromium.org/2012/08/ octane-javascript-benchmark-suite-for.html

[23] J. Ceng, J. Castrillon, W. Sheng, H. Scharwächter, R. Leupers, G. Ascheid, H. Meyr, T. Is- shiki, and H. Kunieda. 2008. MAPS: An Integrated Framework for MPSoC Application Par- allelization. In Proceedings of the 45th Annual Design Automation Conference (DAC ’08). ACM, New York, NY, USA, 754–759. https://doi.org/10.1145/1391469.1391663

[24] G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN ’82). ACM, New York, NY, USA, 98–105. https://doi.org/10.1145/800230.806984 196 Bibliography

[25] Craig David Chambers. 1992. The Design and Implementation of the Self Compiler, an Optimizing Compiler for Object-oriented Programming Languages. Ph.D. Dissertation. Stanford, CA, USA. UMI Order No. GAX92-21602.

[26] Pohua P. Chang, Scott A. Mahlke, and Wen-mei W. Hwu. 1991. Using Profile Information to Assist Classic Code Optimizations. Softw. Pract. Exper. 21, 12 (Dec. 1991), 1301–1321. https://doi.org/10.1002/spe.4380211204

[27] A. Charlesworth. 1981. An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family. Computer 14, 09 (sep 1981), 18–27. https: //doi.org/10.1109/C-M.1981.220595

[28] Fred Chow, Sun Chan, Robert Kennedy, Shin-Ming Liu, Raymond Lo, and Peng Tu. 1997. A New Algorithm for Partial Redundancy Elimination Based on SSA Form. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language De- sign and Implementation (PLDI ’97). ACM, New York, NY, USA, 273–286. https: //doi.org/10.1145/258915.258940

[29] Cliff Click. 1995. Global Code Motion/Global Value Numbering. In PLDI’95. ACM, New York, NY, USA, 246–257. https://doi.org/10.1145/207110.207154

[30] Cliff Click and Keith D. Cooper. 1995. Combining Analyses, Combining Optimizations. ACM Trans. Program. Lang. Syst. 17, 2 (March 1995), 181–196. https://doi.org/10. 1145/201059.201061

[31] Codecache Tuning 2018. Oracle Java SE Embedded Developer’s Guide. (2018). https:// docs.oracle.com/javase/8/embedded/develop-apps-platforms/codecache.htm

[32] Ruby Community. 2018. Ruby Programming Language. (2018). https://www.ruby-lang. org/en/

[33] Keith Cooper and Linda Torczon. 2011. Engineering a compiler. Elsevier.

[34] Keith D Cooper, Mary W Hall, and Ken Kennedy. 1993. A methodology for procedure cloning. Computer Languages 19, 2 (1993), 105–117. https://doi.org/10.1016/ 0096-0551(93)90005-L ICCL ’92.

[35] Keith D Cooper, Timothy J Harvey, and Ken Kennedy. 2001. A simple, fast dominance algorithm. Software Practice & Experience 4, 1-10 (2001), 1–8. Bibliography 197

[36] Keith D Cooper, Kathryn S Mckinley, and Linda Torczon. 1998. Compiler-Based Code- Improvement Techniques. (1998).

[37] Keith D Cooper, Devika Subramanian, and Linda Torczon. 2002. Adaptive optimizing compilers for the 21st century. The Journal of Supercomputing 23, 1 (2002), 7–22.

[38] Matt Curtin. 1998. Write once, run anywhere: Why it matters. Technical Article. http://java. sun. com/features/1998/01/wo (1998).

[39] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst. 13, 4 (Oct. 1991), 40. https://doi.org/10. 1145/115372.115320

[40] Benoit Daloze, Stefan Marr, Daniele Bonetta, and Hanspeter Mössenböck. 2016. Efficient and Thread-safe Objects for Dynamically-typed Languages. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016). ACM, New York, NY, USA, 642–659. https://doi. org/10.1145/2983990.2984001

[41] Benoit Daloze, Chris Seaton, Daniele Bonetta, and Hanspeter Mössenböck. 2015. Tech- niques and Applications for Guest-language Safepoints. In Proceedings of the 10th Work- shop on Implementation, Compilation, Optimization of Object-Oriented Languages, Pro- grams and Systems (ICOOOLPS ’15). ACM, New York, NY, USA, Article 8, 10 pages. https://doi.org/10.1145/2843915.2843921

[42] Benoit Daloze, Arie Tal, Stefan Marr, Hanspeter Mössenböck, and Erez Petrank. 2018. Parallelization of Dynamic Languages: Synchronizing Built-in Collections. Proc. ACM Program. Lang. 2, OOPSLA, Article 108 (Oct. 2018), 30 pages. https://doi.org/10. 1145/3276478

[43] Jack W Davidson and Sanjay Jinturkar. 1995. An aggressive approach to loop unrolling. Technical Report. Technical Report CS-95-26, Department of Computer Science, University of Virginia, Charlottesville.

[44] Jack W Davidson and Sanjay Jinturkar. 1995. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In Microarchitecture, 1995., Proceedings of the 28th Annual International Symposium on. IEEE, 125–132. 198 Bibliography

[45] Jack W. Davidson and Sanjay Jinturkar. 1996. Aggressive loop unrolling in a retargetable, optimizing compiler. In Compiler Construction, Tibor Gyimóthy (Ed.). Springer Berlin Hei- delberg, Berlin, Heidelberg, 59–73.

[46] Jeffrey Dean, David Grove, and Craig Chambers. 1995. Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis. In Proceedings of the 9th European Confer- ence on Object-Oriented Programming (ECOOP ’95). Springer-Verlag, Berlin, Heidelberg, 77–101. http://dl.acm.org/citation.cfm?id=646153.679523

[47] Daniele Cono D’Elia and Camil Demetrescu. 2018. On-stack Replacement, Distilled. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 166–180. https://doi. org/10.1145/3192366.3192396

[48] Edsger Wybe Dijkstra, Edsger Wybe Dijkstra, Edsger Wybe Dijkstra, Etats-Unis Informati- cien, and Edsger Wybe Dijkstra. 1976. A discipline of programming. Vol. 1. prentice-hall Englewood Cliffs.

[49] Pedro Diniz and Martin Rinard. 1997. Lock coarsening: Eliminating lock overhead in auto- matically parallelized object-based programs. Springer Berlin Heidelberg, Berlin, Heidelberg, 285–299. https://doi.org/10.1007/BFb0017259

[50] Thomas J. Watson IBM Research Center. Research Division, FE Allen, and J Cocke. 1971. A catalogue of optimizing transformations.

[51] Amer Diwan, Kathryn S. McKinley, and J. Eliot B. Moss. 1998. Type-based Alias Analysis. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI ’98). ACM, New York, NY, USA, 106–117. https://doi. org/10.1145/277650.277670

[52] Lukasz Domagala, Duco van Amstel, Fabrice Rastello, and P. Sadayappan. 2016. Register Allocation and Promotion Through Combined Instruction Scheduling and Loop Unrolling. In Proceedings of the 25th International Conference on Compiler Construction (CC 2016). ACM, New York, NY, USA, 143–151. https://doi.org/10.1145/2892208.2892219

[53] Jialin Dou and Marcelo Cintra. 2007. A Compiler Cost Model for Speculative Parallelization. ACM TACO 4, 2, Article 12 (June 2007). https://doi.org/10.1145/1250727.1250732 Bibliography 199

[54] Christophe Dubach, John Cavazos, Björn Franke, Grigori Fursin, Michael F.P. O’Boyle, and Olivier Temam. 2007. Fast Compiler Optimisation Evaluation Using Code-feature Based Performance Prediction. In Proceedings of the 4th International Conference on Computing Frontiers (CF ’07). ACM, New York, NY, USA, 131–142. https://doi.org/10.1145/ 1242531.1242553

[55] Gilles Duboscq, Lukas Stadler, Thomas Würthinger, Doug Simon, Christian Wimmer, and Hanspeter Mössenböck. 2013. Graal IR: An Extensible Declarative Intermediate Representa- tion. In Proceedings of the Asia-Pacific Programming Languages and Compilers Workshop.

[56] Gilles Duboscq, Thomas Würthinger, and Hanspeter Mössenböck. 2014. Speculation With- out Regret: Reducing Deoptimization Meta-data in the Graal Compiler. In PPPJ ’14. ACM, New York, NY, USA, 7. https://doi.org/10.1145/2647508.2647521

[57] Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. 2013. An Intermediate Representation for Speculative Optimiza- tions in a Dynamic Compiler. In VMIL’13. https://doi.org/10.1145/2542142.2542143

[58] Gilles Marie Duboscq. 2016. Combining Speculative Optimizations with Flexible Scheduling of Side-effects. Ph.D. Dissertation. Linz, Upper Austria, Austria.

[59] ECMA. 2017. ECMASCRIPT 2017 JavaScript Specification. (2017). https://www. ecma-international.org/publications/standards/Ecma-262.htm

[60] Josef Eisl. 2018. Trace Register Allocation. Ph.D. Dissertation. Linz, Upper Austria, Aus- tria.

[61] Josef Eisl, Matthias Grimmer, Doug Simon, Thomas Würthinger, and Hanspeter Mössen- böck. 2016. Trace-based Register Allocation in a JIT Compiler. In PPPJ ’16. ACM, New York, NY, USA. https://doi.org/10.1145/2972206.2972211

[62] Josef Eisl, David Leopoldseder, and Hanspeter Mössenböck. 2018. Parallel Trace Register Allocation. In Proceedings of the 15th International Conference on Managed Languages & Runtimes (ManLang ’18). ACM, New York, NY, USA, Article 7, 7 pages. https: //doi.org/10.1145/3237009.3237010

[63] Josef Eisl, Stefan Marr, Thomas Würthinger, and Hanspeter Mössenböck. 2017. Trace Register Allocation Policies: Compile-time vs. Performance Trade-offs. In Proceedings of the 14th International Conference on Managed Languages and Runtimes (ManLang 2017). ACM, New York, NY, USA, 92–104. https://doi.org/10.1145/3132190.3132209 200 Bibliography

[64] Sara El-Shobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Automatic Vectoriza- tion Using Dynamic Compilation and Tree Pattern Matching Technique in Jikes RVM. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (ICOOOLPS ’09). ACM, New York, NY, USA, 63–69. https://doi.org/10.1145/1565824.1565833

[65] EPFL. 2018. Scala Programming Language. (2018). https://www.scala-lang.org/

[66] Joseph A. Fisher. 1995. Instruction-level Parallel Processors. IEEE Computer Society Press, Chapter Trace Scheduling: A Technique for Global Microcode Compaction, 186– 198. http://dl.acm.org/citation.cfm?id=201749.201766

[67] Philip J. Fleming and John J. Wallace. 1986. How Not to Lie with Statistics: The Correct Way to Summarize Benchmark Results. Commun. ACM 29, 3 (March 1986), 218–221. https://doi.org/10.1145/5666.5673

[68] Agner Fog. 2018. Instruction Tables for Intel, AMD and VIA CPUs. (2018). http: //www.agner.org/optimize/instruction_tables.pdf

[69] Python Software Foundation. 2018. Python 3 Programming Language. (2018). https: //www.python.org/download/releases/3.0/

[70] R Foundation. 2018. R Project. (2018). https://www.r-project.org/

[71] Christopher W. Fraser, Robert R. Henry, and Todd A. Proebsting. 1992. BURG: Fast Optimal Instruction Selection and Tree Parsing. SIGPLAN Not. 27, 4 (April 1992), 68–76. https://doi.org/10.1145/131080.131089

[72] Michael Frigge, David C. Hoaglin, and Boris Iglewicz. 1989. Some Implementations of the Boxplot. The American Statistician 43, 1 (1989), 50–54. http://www.jstor.org/ stable/2685173

[73] G. Fursin, M. F. P. O’Boyle, O. Temam, and G. Watts. 2004. A fast and accurate method for determining a lower bound on execution time. Concurrency and Computation: Practice and Experience 16, 2-3 (2004), 271–292. https://doi.org/10.1002/cpe.774

[74] Yoshihiko Futamura. 1999. Partial Evaluation of Computation Process–An Approach to a Compiler-Compiler. Higher-Order and Symbolic Computation 12, 4 (01 Dec 1999), 381– 391. https://doi.org/10.1023/A:1010095604496 Bibliography 201

[75] Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications (OOPSLA ’07). ACM, New York, NY, USA, 57–76. https://doi.org/10.1145/1297027.1297033

[76] Loukas Georgiadis, Renato F. Werneck, Robert E. Tarjan, Spyridon Triantafyllis, and David I. August. 2004. Finding Dominators in Practice. (2004), 677–688.

[77] Philip B. Gibbons and Steven S. Muchnick. 1986. Efficient Instruction Scheduling for a Pipelined Architecture. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction (SIGPLAN ’86). ACM, New York, NY, USA, 11–16. https://doi.org/10. 1145/12276.13312

[78] GNU. 2018. GCC, the GNU Compiler Collection. (2018). https://gcc.gnu.org/

[79] Google. 2012. V8 JavaScript Engine. (2012). http://code.google.com/p/v8/

[80] Isaac. Gouy. 2018. Computer Language Benchmark Game: Java vs C. (2018). https: //benchmarksgame-team.pages.debian.net/benchmarksgame/faster/java.html

[81] Matthias Grimmer, Roland Schatz, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. 2015. Memory-safe Execution of C on a Java VM. In Proceedings of the 10th ACM Workshop on Programming Languages and Analysis for Security (PLAS’15). ACM, New York, NY, USA, 16–27. https://doi.org/10.1145/2786558.2786565

[82] Matthias Grimmer, Chris Seaton, Roland Schatz, Thomas Würthinger, and Hanspeter Mössenböck. 2015. High-performance Cross-language Interoperability in a Multi-language Runtime. In Proceedings of the 11th Symposium on Dynamic Languages (DLS 2015). ACM, New York, NY, USA, 78–90. https://doi.org/10.1145/2816707.2816714

[83] Matthias Grimmer, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. 2015. Dynamically Composing Languages in a Modular Way: Supporting C Extensions for Dy- namic Languages. In Proceedings of the 14th International Conference on Modularity (MODULARITY 2015). ACM, New York, NY, USA, 1–13. https://doi.org/10.1145/ 2724525.2728790

[84] Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. 2018. High Performance Stencil Code Generation with Lift. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM, New York, NY, USA, 100–112. https://doi.org/10.1145/3168824 202 Bibliography

[85] Christian Häubl and Hanspeter Mössenböck. 2011. Trace-based Compilation for the Java HotSpot Virtual Machine. In Proceedings of the International Conference on the Principles and Practice of Programming in Java. ACM Press, 129–138. https://doi.org/10.1145/ 2093157.2093176

[86] Christian Häubl, Christian Wimmer, and Hanspeter Mössenböck. 2012. Evaluation of Trace Inlining Heuristics for Java. In Proceedings of the ACM Symposium on Applied Computing. ACM Press, 1871–1876. https://doi.org/10.1145/2245276.2232084

[87] Christian Häubl, Christian Wimmer, and Hanspeter Mössenböck. 2013. Context-sensitive Trace Inlining for Java. Comput. Lang. Syst. Struct. 39, 4 (2013), 123–141. https: //doi.org/10.1016/j.cl.2013.04.002

[88] Christian Häubl, Christian Wimmer, and Hanspeter Mössenböck. 2013. Deriving Code Cov- erage Information from Profiling Data Recorded for a Trace-based Just-in-time Compiler. In Proceedings of the International Conference on the Principles and Practice of Programming in Java. ACM Press, 1–12. https://doi.org/10.1145/2500828.2500829

[89] John Hennessy, Norman Jouppi, Steven Przybylski, Christopher Rowen, Thomas Gross, For- est Baskett, and John Gill. 1982. MIPS: A microprocessor architecture. In ACM SIGMICRO Newsletter, Vol. 13. IEEE Press, 17–22.

[90] Peter Hofer. 2016. Method Profiling and Lock Contention Profiling on the Java Virtual Machine Level. Ph.D. Dissertation. Linz, Upper Austria, Austria.

[91] Peter Hofer, David Gnedt, Andreas Schörgenhumer, and Hanspeter Mössenböck. 2016. Efficient Tracing and Versatile Analysis of Lock Contention in Java Applications on the Virtual Machine Level. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering (ICPE ’16). ACM, New York, NY, USA, 263–274. https: //doi.org/10.1145/2851553.2851559

[92] Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging Optimized Code with Dynamic Deoptimization. In PLDI’92. ACM, New York, NY, USA, 32–43. https://doi. org/10.1145/143095.143114

[93] HotSpot JVM 2018. Java Version History (J2SE 1.3). (2018). http://en.wikipedia. org/wiki/Java_version_history Bibliography 203

[94] Jung-Chang Huang and Tau Leng. 1999. Generalized loop-unrolling: a method for program speedup. In Application-Specific Systems and Software Engineering and Technology, 1999. ASSET’99. Proceedings. 1999 IEEE Symposium on. IEEE, 244–248.

[95] Christian Humer, Christian Wimmer, Christian Wirth, Andreas Wöß, and Thomas Würthinger. 2014. A Domain-specific Language for Building Self-optimizing AST In- terpreters. In Proceedings of the 2014 International Conference on Generative Program- ming: Concepts and Experiences (GPCE 2014). ACM, New York, NY, USA, 123–132. https://doi.org/10.1145/2658761.2658776

[96] Wen-Mei W. Hwu, Scott A. Mahlke, William Y. Chen, Pohua P. Chang, Nancy J. Warter, Roger A. Bringmann, Roland G. Quellette, Richard E. Hank, Tokuzo Kiyohara, Grant E. Haab, John G. Holm, and Daniel M. Lavery. 1995. Instruction-level Parallel Proces- sors. IEEE Computer Society Press, Chapter The Superblock: An Effective Technique for VLIW and Superscalar Compilation, 234–253. http://dl.acm.org/citation.cfm? id=201749.201774

[97] IBM. 2018. IBM J9 VM. (2018). https://www.ibm.com/support/knowledgecenter/ de/SSYKE2_8.0.0/welcome/welcome_javasdk_version.html

[98] Apple Inc. 2018. Webkit JavaScript Engine. (2018). https://webkit.org/

[99] Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu, and Toshio Nakatani. 2011. A trace-based Java JIT compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 246–256.

[100] Intel. 2018. Intel Turbo Boost. (2018). https://www.intel.com/content/www/us/en/ support/articles/000007359/processors/intel-core-processors.html

[101] Intel. 2019. Intel R 64 and IA-32 Architectures Software Developer’s Manual. (2019). https://software.intel.com/en-us/articles/intel-sdm

[102] Kazuaki Ishizaki, Motohiro Kawahito, Toshiaki Yasue, Hideaki Komatsu, and Toshio Nakatani. 2000. A Study of Devirtualization Techniques for a Java Just-In-Time Com- piler. In OOPSLA ’00. ACM, 294–310. https://doi.org/10.1145/353171.353191

[103] JKU. 2018. Johannes Kepler University Linz, Austria. (2018). https://www.jku.at/ 204 Bibliography

[104] Robert Kennedy, Sun Chan, Shin-Ming Liu, Raymond Lo, Peng Tu, and Fred Chow. 1999. Partial Redundancy Elimination in SSA Form. ACM Trans. Program. Lang. Syst. 21, 3 (May 1999), 627–676. https://doi.org/10.1145/319301.319348

[105] T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O’Boyle. 2000. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622). 237– 246. https://doi.org/10.1109/PACT.2000.888348

[106] Toru Kisuki, Peter M. W. Knijnenburg, Mike F. P. O’Boyle, François Bodin, and Harry A. G. Wijshoff. 1999. A feasibility study in iterative compilation. In High Performance Computing, Constantine Polychronopoulos, Kazuki Joe Akira Fukuda, and Shinji Tomita (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 121–132.

[107] P. M. W. Knijnenburg, T. Kisuki, and M. F. P. O’Boyle. 2002. Iterative Compilation. Springer Berlin Heidelberg, Berlin, Heidelberg, 171–187. https://doi.org/10.1007/ 3-540-45874-3_10

[108] Thomas Kotzmann and Hanspeter Mössenböck. 2005. Escape Analysis in the Context of Dynamic Compilation and Deoptimization. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (VEE ’05). ACM, New York, NY, USA, 111–120. https://doi.org/10.1145/1064979.1064996

[109] Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck, Thomas Rodriguez, Ken- neth Russell, and David Cox. 2008. Design of the Java HotSpot&Trade; Client Compiler for Java 6. ACM Trans. Archit. Code Optim. 5, 1, Article 7 (May 2008), 32 pages. https://doi.org/10.1145/1369396.1370017

[110] Andreas Krall and Sylvain Lelait. 2000. Compilation Techniques for Multimedia Processors. International Journal of Parallel Programming 28, 4 (01 Aug 2000), 347–361. https: //doi.org/10.1023/A:1007507005174

[111] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO ’04 (CGO ’04). IEEE Computer Society. http://dl.acm.org/citation.cfm?id=977395.977673 Bibliography 205

[112] Hugh Leather, Edwin Bonilla, and Michael O’Boyle. 2009. Automatic Feature Generation for Machine Learning Based Optimizing Compilation. In Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’09). IEEE Computer Society, Washington, DC, USA, 81–91. https://doi.org/10.1109/ CGO.2009.21

[113] Philipp Lengauer and Hanspeter Mössenböck. 2014. The Taming of the Shrew: Increasing Performance by Automatic Parameter Tuning for Java Garbage Collectors. In Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering (ICPE ’14). ACM, New York, NY, USA, 111–122. https://doi.org/10.1145/2568088.2568091

[114] Thomas Lengauer and Robert Endre Tarjan. 1979. A fast algorithm for finding dominators in a flowgraph. ACM Transactions on Programming Languages and Systems (TOPLAS) 1, 1 (1979), 121–141.

[115] David Leopoldseder. 2015. Master Thesis: Graal AOT JS: A Java to JavaScript Compiler. http://epub.jku.at/obvulihs/content/titleinfo/912629. (2015).

[116] David Leopoldseder. 2017. Simulation-based Code Duplication for Enhancing Compiler Optimizations. In SPLASH Companion 2017. ACM, New York, NY, USA, 10–12. https: //doi.org/10.1145/3135932.3135935

[117] David Leopoldseder, Roland Schatz, Lukas Stadler, Manuel Rigger, and Hanspeter Mössen- böck. 2018. Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations. In Proceedings of the 15th International Conference on Man- aged Languages and Runtimes (ManLang 2018). ACM, New York, NY, USA. https: //doi.org/10.1145/3237009.3237013

[118] David Leopoldseder, Lukas Stadler, Manuel Rigger, Thomas Würthinger, and Hanspeter Mössenböck. 2018. A Cost Model for a Graph-based Intermediate-representation in a Dynamic Compiler. In Proceedings of the 10th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages (VMIL 2018). ACM, New York, NY, USA, 26–35. https://doi.org/10.1145/3281287.3281290

[119] David Leopoldseder, Lukas Stadler, Christian Wimmer, and Hanspeter Mössenböck. 2015. Java-to-JavaScript Translation via Structured Control Flow Reconstruction of Compiler IR. In DLS’15. ACM, New York, NY, USA, 91–103. https://doi.org/10.1145/2816707. 2816715 206 Bibliography

[120] David Leopoldseder, Lukas Stadler, Thomas Würthinger, Josef Eisl, Doug Simon, and Hanspeter Mössenböck. 2018. Dominance-based Duplication Simulation (DBDS): Code Duplication to Enable Compiler Optimizations. In CGO 2018. ACM, New York, NY, USA, 126–137. https://doi.org/10.1145/3168811

[121] Sheng Liang and Gilad Bracha. 1998. Dynamic Class Loading in the Java Virtual Machine. In Proceedings of the 13th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA ’98). ACM, New York, NY, USA, 36–44. https://doi.org/10.1145/286936.286945

[122] Jin Lin, Tong Chen, Wei-Chung Hsu, Pen-Chung Yew, Roy Dz-Ching Ju, Tin-Fook Ngai, and Sun Chan. 2003. A Compiler Framework for Speculative Analysis and Optimizations. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI ’03). ACM, New York, NY, USA, 289–299. https://doi. org/10.1145/781131.781164

[123] Yi Lin, Kunshan Wang, Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish. 2015. Stop and Go: Understanding Yieldpoint Behavior. In Proceedings of the 2015 In- ternational Symposium on Memory Management (ISMM ’15). ACM, New York, NY, USA, 70–80. https://doi.org/10.1145/2754169.2754187

[124] Tim Lindholm, Frank Yellin, Gilad Bracha, and Alex Buckley. 2015. The Java Virtual Machine Specification, Java SE 8 Edition. http://docs.oracle.com/javase/specs/ jvms/se8/jvms8.pdf

[125] Shun Long and Michael O’Boyle. 2004. Adaptive Java Optimisation Using Instance-based Learning. In Proceedings of the 18th Annual International Conference on Supercomputing (ICS ’04). ACM, New York, NY, USA, 237–246. https://doi.org/10.1145/1006209. 1006243

[126] Scott A. Mahlke, Richard E. Hank, James E. McCormick, David I. August, and Wen- Mei W. Hwu. 1995. A Comparison of Full and Partial Predicated Execution Support for ILP Processors. In Proceedings of the 22Nd Annual International Symposium on Computer Architecture (ISCA ’95). ACM, New York, NY, USA, 138–150. https://doi.org/10. 1145/223982.225965

[127] Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1995. Instruction-level Parallel Processors. IEEE Computer Society Press, Chapter Effective Compiler Support for Predicated Execution Using the Hyperblock, 161–170. http://dl. acm.org/citation.cfm?id=201749.201763 Bibliography 207

[128] Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. In Proceedings of the 32Nd ACM SIGPLAN-SIGACT Symposium on Principles of Program- ming Languages (POPL ’05). ACM, New York, NY, USA, 378–391. https://doi.org/ 10.1145/1040305.1040336

[129] Stefan Marr, Benoit Daloze, and Hanspeter Mössenböck. 2016. Cross-language Compiler Benchmarking: Are We Fast Yet?. In Proceedings of the 12th Symposium on Dynamic Languages (DLS 2016). ACM, New York, NY, USA, 120–131. https://doi.org/10. 1145/2989225.2989232

[130] Luis Mastrangelo, Luca Ponzanelli, Andrea Mocci, Michele Lanza, Matthias Hauswirth, and Nathaniel Nystrom. 2015. Use at Your Own Risk: The Java Unsafe API in the Wild. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). ACM, New York, NY, USA, 695–710. https://doi.org/10.1145/2814270.2814313

[131] Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. PrePrint: arXiv preprint arXiv:1808.07412 (2018).

[132] Raphael Mosaner, Manuel Rigger, David Leopoldseder, Roland Schatz, and Hanspeter Mössenböck. 2018. On-Stack Replacement in Truffle Interpreters for Non-structured Lan- guages. (2018). Under Review.

[133] Hanspeter Mössenböck. 2000. Adding static single assignment form and a graph coloring register allocator to the Java HotSpotTM client compiler. Technical Report. Citeseer.

[134] Hanspeter Mössenböck and Michael Pfeiffer. 2002. Linear Scan Register Allocation in the Context of SSA Form and Register Constraints. In Compiler Construction, R. Nigel Horspool (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 229–246.

[135] Mozilla. 2018. asm.js. (2018). http://http://asmjs.org/

[136] Frank Mueller and David B. Whalley. 1992. Avoiding Unconditional Jumps by Code Replica- tion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 322–330. https://doi.org/10.1145/143095.143144

[137] Frank Mueller and David B. Whalley. 1995. Avoiding Conditional Branches by Code Repli- cation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 56–66. https://doi.org/10.1145/207110.207116 208 Bibliography

[138] Martin Odersky, Lex Spoon, and Bill Venners. 2008. Programming in scala. Artima Inc.

[139] OpenJDK 2013. Graal Project. (2013). http://openjdk.java.net/projects/graal

[140] OpenJDK 2017. GraalVM -New JIT Compiler and Polyglot Runtime for the JVM;. (2017). http://www.oracle.com/technetwork/oracle-labs/program-languages/ overview/index-2301583.html

[141] OpenJDK. 2018. JEP 243: Java-Level JVM Compiler Interface. (2018). https: //openjdk.java.net/jeps/243

[142] OpenJDK. 2018. JEP 317: Experimental Java-Based JIT Compiler. (2018). https: //openjdk.java.net/jeps/317

[143] OpenJDK. 2018. Type Profile Pollution. (2018). https://wiki.openjdk.java.net/ display/HotSpot/MethodData#MethodData-Poll

[144] Oracle. 2015. Loop Optimizations in the Hotspot Server VM Compiler (C2). (2015). https://wiki.openjdk.java.net/pages/viewpage.action?pageId=20415918

[145] Oracle. 2018. FastR. (2018). https://github.com/oracle/fastr

[146] Oracle. 2018. Graal Python. (2018). https://github.com/graalvm/graalpython

[147] Oracle. 2018. GraalJS Repository. (2018). https://github.com/graalvm/graaljs

[148] Oracle. 2018. GraalVM. (2018). https://www.graalvm.org/

[149] Oracle. 2018. Oracle Java. (2018). https://www.oracle.com/de/java/

[150] Oracle. 2018. Oracle Labs. (2018). https://labs.oracle.com/

[151] Oracle. 2018. Tiered Compilation. (2018). https://docs.oracle.com/javase/7/docs/ technotes/guides/vm/performance-enhancements-7.html#tieredcompilation

[152] Oracle. 2018. Truffle Ruby. (2018). https://github.com/oracle/truffleruby

[153] Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java HotSpotTM Server Compiler. In Proceedings of the Java Virtual Machine Research and Technology Symposium. USENIX, 1–12. Bibliography 209

[154] Simon Peyton Jones and Simon Marlow. 2002. Secrets of the Glasgow Haskell Compiler Inliner. J. Funct. Program. 12, 5 (July 2002), 393–434. https://doi.org/10.1017/ S0956796802004331

[155] Filip Pizlo. 2014. JetStream Benchmark Suite. (2014). http://browserbench.org/ JetStream/

[156] Aleksandar Prokopec, Gilles Duboscq, David Leopoldseder, and Thomas Würthinger. 2019. An Optimization-driven Incremental Inline Substitution Algorithm for Just-in-time Com- pilers. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Gen- eration and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA, 164–179. http://dl.acm.org/citation.cfm?id=3314872.3314893

[157] Aleksandar Prokopec, David Leopoldseder, Gilles Duboscq, and Thomas Würthinger. 2017. Making Collection Operations Optimal with Aggressive JIT Compilation. In SCALA 2017. ACM, New York, NY, USA, 29–40. https://doi.org/10.1145/3136000.3136002

[158] Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr Tůma, Mar- tin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. 2019. Renaissance: Benchmarking Suite for Parallel Applications on the JVM. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). ACM, New York, NY, USA, 31–47. https://doi.org/10.1145/3314221.3314637

[159] Ganesan Ramalingam. 2002. On Loops, Dominators, and Dominance Frontiers. ACM Trans. Program. Lang. Syst. 24, 5, 455–490. https://doi.org/10.1145/570886.570887

[160] Manuel Rigger. 2016. Sulong: Memory Safe and Efficient Execution of LLVM-Based Lan- guages. In ECOOP 2016 Doctoral Symposium (ECOOP DS 2016). http://ssw.jku.at/ General/Staff/ManuelRigger/ECOOP16-DS.pdf.

[161] Manuel Rigger. 2018. Sandboxed Execution of C and Other Unsafe Languages on the Java Virtual Machine (Extended Abstract). In Student Research Competition at the Intl. Conf. on the Art, Science, and Engineering of Programmings (Programming SRC 2018). https://doi.org/10.1145/3191697.3213795 210 Bibliography

[162] Manuel Rigger, Matthias Grimmer, and Hanspeter Mössenböck. 2016. Sulong - Execution of LLVM-based Languages on the JVM (Position Paper). In Proceedings of the 11th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems (ICOOOLPS 2016). ACM, New York, NY, USA, Article 7, 4 pages. https: //doi.org/10.1145/3012408.3012416

[163] Manuel Rigger, Matthias Grimmer, Christian Wimmer, Thomas Würthinger, and Hanspeter Mössenböck. 2016. Bringing Low-level Languages to the JVM: Efficient Execution of LLVM IR on Truffle. In Proceedings of the 8th International Workshop on Virtual Machines and Intermediate Languages (VMIL 2016). ACM, New York, NY, USA, 6–15. https://doi. org/10.1145/2998415.2998416

[164] Manuel Rigger, Stefan Marr, Bram Adams, and Hanspeter Mössenböck. 2019. Understand- ing GCC Builtins to Develop Better Tools. (2019). Under Review.

[165] Manuel Rigger, Stefan Marr, Stephen Kell, David Leopoldseder, and Hanspeter Mössen- böck. 2018. An Analysis of x86-64 Inline Assembly in C Programs. In Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’18). ACM, New York, NY, USA, 84–99. https://doi.org/10.1145/3186411.3186418

[166] Manuel Rigger, Daniel Pekarek, and Hanspeter Mössenböck. 2018. Context-Aware Failure- Oblivious Computing as a Means of Preventing Buffer Overflows. In Proceedings of the 12th International Conference on Network and System Security (NSS 2018). https:// doi.org/10.1007/978-3-030-02744-5_28

[167] Manuel Rigger, Roland Schatz, Matthias Grimmer, and Hanspeter Mössenböck. 2017. Lenient Execution of C on a Java Virtual Machine: Or: How I Learned to Stop Wor- rying and Run the Code. In Proceedings of the 14th International Conference on Man- aged Languages and Runtimes (ManLang 2017). ACM, New York, NY, USA, 35–47. https://doi.org/10.1145/3132190.3132204

[168] Manuel Rigger, Roland Schatz, Jacob Kreindl, Cristian Häubl, and Hanspeter Mössenböck. 2018. Sulong, and Thanks for All the Fish (Extended Abstract). In Workshop on Modern Language Runtimes, Ecosystems, and VMs (MoreVMs 2018). ACM, New York, NY, USA, 35–44. https://doi.org/10.1145/3191697.3191726

[169] Manuel Rigger, Roland Schatz, Rene Mayrhofer, Matthias Grimmer, and Hanspeter Mössen- böck. 2018. Sulong, and Thanks for All the Bugs: Finding Errors in C Programs by Abstracting from the Native Execution Model. In Proceedings of the Twenty-Third Inter- Bibliography 211

national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2018). ACM, New York, NY, USA, 377–391. https://doi.org/10. 1145/3173162.3173174

[170] Michael Rock and Andreas Koch. 2004. Architecture-Independent Meta-optimization by Aggressive Tail Splitting. Springer Berlin Heidelberg, Berlin, Heidelberg, 328–335. https: //doi.org/10.1007/978-3-540-27866-5_42

[171] Kenneth Russell and David Detlefs. 2006. Eliminating Synchronization-related Atomic Op- erations with Biased Locking and Bulk Rebiasing. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applica- tions (OOPSLA ’06). ACM, New York, NY, USA, 263–272. https://doi.org/10.1145/ 1167473.1167496

[172] Rafael H Saavedra and Alan J Smith. 1996. Analysis of benchmark characteristics and benchmark performance prediction. ACM Transactions on Computer Systems (TOCS) 14, 4 (1996), 344–384.

[173] Vivek Sarkar. 2001. Optimized Unrolling of Nested Loops. Int. J. Parallel Program. 29, 5, 545–581. https://doi.org/10.1023/A:1012246031671

[174] Robert W. Scheifler. 1977. An Analysis of Inline Substitution for a Structured Programming Language. Commun. ACM 20, 9 (Sept. 1977), 647–654. https://doi.org/10.1145/ 359810.359830

[175] Uwe Schwiegelshohn, Franco Gasperoni, and Kemal Ebcioˇglu. 1991. On optimal paral- lelization of arbitrary loops. J. Parallel and Distrib. Comput. 11, 2 (1991), 130 – 134. https://doi.org/10.1016/0743-7315(91)90118-S

[176] Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine. In OOPSLA ’11. ACM, New York, NY, USA. https://doi.org/10.1145/2048066. 2048118

[177] Samuel Sanford Shapiro and Martin B Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 3/4 (1965), 591–611. 212 Bibliography

[178] Doug Simon, Christian Wimmer, Bernhard Urban, Gilles Duboscq, Lukas Stadler, and Thomas Würthinger. 2015. Snippets: Taking the High Road to a Low Level. ACM Trans. Archit. Code Optim. 12, 2, Article 20 (June 2015), 20:20:1–20:20:25 pages. https:// doi.org/10.1145/2764907

[179] SSW. 2012. Institute for System Software, Johannes Kepler University Linz, Austria. (2012). http://ssw.jku.at/

[180] Lukas Stadler. 2014. Partial Escape Analysis and Scalar Replacement for Java. Ph.D. Dissertation. Linz, Upper Austria, Austria.

[181] Lukas Stadler, Gilles Duboscq, Hanspeter Mössenböck, and Thomas Würthinger. 2012. Compilation Queuing and Graph Caching for Dynamic Compilers. In Proceedings of the ACM Workshop on Virtual Machines and Intermediate Languages. ACM Press, 49–58. https://doi.org/10.1145/2414740.2414750

[182] Lukas Stadler, Gilles Duboscq, Hanspeter Mössenböck, Thomas Würthinger, and Doug Simon. 2013. An Experimental Study of the Influence of Dynamic Compiler Optimizations on Scala Performance. In SCALA ’13. ACM, New York, NY, USA, Article 9, 8 pages. https://doi.org/10.1145/2489837.2489846

[183] Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan. 2016. Optimizing R Language Execution via Aggressive Speculation. In Proceedings of the 12th Sympo- sium on Dynamic Languages (DLS 2016). ACM, New York, NY, USA, 84–95. https: //doi.org/10.1145/2989225.2989236

[184] Lukas Stadler, Christian Wimmer, Thomas Würthinger, Hanspeter Mössenböck, and John Rose. 2009. Lazy Continuations for Java Virtual Machines. In Proceedings of the Inter- national Conference on the Principles and Practice of Programming in Java. ACM Press, 143–152. https://doi.org/10.1145/1596655.1596679

[185] Lukas Stadler, Thomas Würthinger, and Hanspeter Mössenböck. [n. d.]. Partial Escape Analysis and Scalar Replacement for Java. In CGO’14. ACM Press, 165–174. https: //doi.org/10.1145/2544137.2544157

[186] Lukas Stadler, Thomas Würthinger, and Christian Wimmer. 2010. Efficient Coroutines for the Java Platform. In Proceedings of the International Conference on the Principles and Practice of Programming in Java. ACM Press, 20–28. https://doi.org/10.1145/ 1852761.1852765 Bibliography 213

[187] Codruţ Stancu, Christian Wimmer, Stefan Brunthaler, Per Larsen, and Michael Franz. 2014. Comparing Points-to Static Analysis with Runtime Recorded Profiling Data. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools (PPPJ ’14). ACM, New York, NY, USA, 157–168. https://doi.org/10.1145/2647508.2647524

[188] Codruţ Stancu, Christian Wimmer, Stefan Brunthaler, Per Larsen, and Michael Franz. 2015. Safe and Efficient Hybrid Memory Management for Java. ACM, New York, NY, USA. 81–92 pages. https://doi.org/10.1145/2754169.2754185

[189] Standard Performance Evaluation Corporation. 2008. SPECjvm2008. (2008). http:// www.spec.org/jvm2008/

[190] Bogong Su and Jing Wang. 1991. Loop-carried dependence and the general URPR software pipelining approach (unrolling, pipelining and rerolling). In System Sciences, 1991. Pro- ceedings of the Twenty-Fourth Annual Hawaii International Conference on, Vol. 2. IEEE, 366–372.

[191] tidy. 2018. tidyverse: R packages for data science. (2018). https://www.tidyverse.org/

[192] TIOBE. 2018. TIOBE Programming Community Index. (2018). https://www.tiobe. com/tiobe-index/

[193] Munara Tolubaeva. 2014. Compiler Cost Model for Multicore Architectures. Ph.D. Disser- tation.

[194] Sid-Ahmed-Ali Touati and Denis Barthou. 2006. On the Decidability of Phase Ordering Problem in Optimizing Compilation. In CF ’06. ACM, 10. https://doi.org/10.1145/ 1128022.1128042

[195] Michael L. Van-De-Vanter, Chris Seaton, Michael Haupt, Christian Humer, and Thomas Würthinger. 2018. Fast, Flexible, Polyglot Instrumentation Support for Debuggers and other Tools. Programming Journal 2, 3 (2018), 14. https://doi.org/10.22152/ programming-journal.org/2018/2/14

[196] April W. Wade, Prasad A. Kulkarni, and Michael R. Jantz. 2017. AOT vs. JIT: Impact of Profile Data on Code Quality. In LCTES 2017. ACM, New York, NY, USA, 1–10. https: //doi.org/10.1145/3078633.3081037 214 Bibliography

[197] David W. Wall. 1991. Limits of Instruction-level Parallelism. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, USA, 176–188. http: //doi.acm.org/10.1145/106972.106991

[198] Ko-Yang Wang. 1994. Precise Compile-time Performance Prediction for Superscalar-based Computers. In PLDI’94. ACM, New York, NY, USA, 73–84. https://doi.org/10.1145/ 178243.178250

[199] Z. Wang and M. O’Boyle. 2018. Machine Learning in Compiler Optimization. Proc. IEEE 106, 11 (Nov 2018), 1879–1901. https://doi.org/10.1109/JPROC.2018.2817118

[200] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Breakthroughs in statistics. Springer, 196–202.

[201] Christian Wimmer. 2004. Master Thesis: Linear Scan Register Allocation for the Java HotSpot Client Compiler. http://www.christianwimmer.at/Publications/Wimmer04a/. (2004).

[202] Christian Wimmer and Michael Franz. 2010. Linear Scan Register Allocation on SSA Form. In CGO ’10. ACM, New York, NY, USA, 170–179. https://doi.org/10.1145/1772954. 1772979

[203] Christian Wimmer and Hanspeter Mössenböck. 2005. Optimized Interval Splitting in a Linear Scan Register Allocator. In Proceedings of the 1st ACM/USENIX International Con- ference on Virtual Execution Environments (VEE ’05). ACM, New York, NY, USA, 132–141. https://doi.org/10.1145/1064979.1064998

[204] Christian Wimmer and Hanspeter Mössenböck. 2006. Automatic Object Colocation Based on Read Barriers. In Proceedings of the Joint Conference on Modular Programming Lan- guages. Springer-Verlag, 326–345. https://doi.org/10.1007/11860990_20

[205] Christian Wimmer and Hanspeter Mössenböck. 2007. Automatic Feedback-directed Object Inlining in the Java HotspotTM Virtual Machine. In Proceedings of the International Con- ference on Virtual Execution Environments. ACM Press, 12–21. https://doi.org/10. 1145/1254810.1254813

[206] Christian Wimmer and Hanspeter Mössenböck. 2008. Automatic Array Inlining in Java Virtual Machines. In Proceedings of the International Symposium on Code Generation and Optimization. ACM Press, 14–23. https://doi.org/10.1145/1356058.1356061 Bibliography 215

[207] Christian Wimmer and Hanspeter Mössenböck. 2010. Automatic Feedback-directed Object Fusing. ACM Transactions on Architecture and Code Optimization 7, 2, Article 7 (2010), 35 pages. https://doi.org/10.1145/1839667.1839669

[208] Michael Wolfe. 1992. Beyond Induction Variables. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation (PLDI ’92). ACM, New York, NY, USA, 162–174. https://doi.org/10.1145/143095.143131

[209] Andreas Wöß, Christian Wirth, Daniele Bonetta, Chris Seaton, Christian Humer, and Hanspeter Mössenböck. 2014. An object storage model for the truffle language imple- mentation framework. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java platform: Virtual machines, Languages, and Tools. ACM, 133–144.

[210] Thomas Würthinger, Danilo Ansaloni, Walter Binder, Christian Wimmer, and Hanspeter Mössenböck. 2011. Safe and Atomic Run-time Code Evolution for Java and Its Application to Dynamic AOP. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. ACM Press, 825–844. https://doi. org/10.1145/2048066.2048129

[211] Thomas Würthinger, Walter Binder, Danilo Ansaloni, Philippe Moret, and Hanspeter Mössenböck. 2010. Applications of Enhanced Dynamic Code Evolution for Java in GUI Development and Dynamic Aspect-oriented Programming. In Proceedings of the Interna- tional Conference on Generative Programming and Component Engineering. ACM Press, 123–126. https://doi.org/10.1145/1868294.1868312

[212] Thomas Würthinger, Walter Binder, Danilo Ansaloni, Philippe Moret, and Hanspeter Mössenböck. 2010. Improving Aspect-oriented Programming with Dynamic Code Evo- lution in an Enhanced Java Virtual Machine. In Proceedings of the 7th Workshop on Re- flection, AOP and Meta-Data for Software Evolution. ACM Press, Article 5, 5:1–5:5 pages. https://doi.org/10.1145/1890683.1890688

[213] Thomas Würthinger, Christian Wimmer, Christian Humer, Andreas Wöß, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grimmer. 2017. Practical Partial Evaluation for High-performance Dynamic Language Runtimes. In PLDI 2017. ACM, New York, NY, USA, 662–676. https://doi.org/10.1145/3062341.3062381 216 Bibliography

[214] Thomas Würthinger, Christian Wimmer, and Hanspeter Mössenböck. 2007. Array Bounds Check Elimination for the Java HotSpotTM Client Compiler. In Proceedings of the Inter- national Conference on the Principles and Practice of Programming in Java. ACM Press, 125–133. https://doi.org/10.1145/1294325.1294343

[215] Thomas Würthinger, Christian Wimmer, and Hanspeter Mössenböck. 2008. Visualization of Program Dependence Graphs. In Proceedings of the International Conference on Compiler Construction. Springer-Verlag, 193–196.

[216] Thomas Würthinger, Christian Wimmer, and Hanspeter Mössenböck. 2009. Array bounds check elimination in the context of deoptimization. Science of Computer Programming 74, 5-6 (2009). https://doi.org/10.1016/j.scico.2009.01.002

[217] Thomas Würthinger, Christian Wimmer, and Lukas Stadler. 2010. Dynamic code evolution for Java. In Proceedings of the 8th International Conference on the Principles and Practice of Programming in Java. ACM, 10–19.

[218] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. 2013. One VM to Rule Them All. In Onward! 2013. ACM, New York, NY, USA, 187–204. https: //doi.org/10.1145/2509578.2509581

[219] Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Chris- tian Wimmer. 2012. Self-optimizing AST interpreters. In Proceedings of the 8th symposium on Dynamic languages (DLS ’12). ACM Press, 73–82.

[220] Thomas Würthinger, Christian Wimmer, and Lukas Stadler. 2013. Unrestricted and Safe Dynamic Code Evolution for Java. Science of Computer Programming 78, 5 (May 2013), 481–498. https://doi.org/10.1016/j.scico.2011.06.005

[221] Yudi Zheng, Lubomír Bulej, and Walter Binder. 2017. An Empirical Study on Deopti- mization in the Graal Compiler. In 31st European Conference on Object-Oriented Program- ming (ECOOP 2017) (Leibniz International Proceedings in Informatics (LIPIcs)), Peter Müller (Ed.), Vol. 74. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Ger- many, 30:1–30:30. https://doi.org/10.4230/LIPIcs.ECOOP.2017.30 Bibliography 217

Acknowledgments

Most of my gratitude belongs to my advisor Prof. Hanspeter Mössenböck who constantly chal- lenges my thinking about compilers and computer science in general and supported me in every possible way. Next, I want to thank my second advisor Prof. Michael O’Boyle from the university of Edinburgh for his time and input on this thesis.

Then, I want to thank my co-advisor Dr. Lukas Stadler. It is pleasure coding, working, teaching, and enjoying a cold beer with you. This thesis, as of all the work I did in the Graal project would not have been possible without your constant support.

This work was funded by and performed in collaboration with Oracle Labs, specifically the GraalVM research group. I want to thank all the people that supported me and my work, most and for all Thomas Würthinger who provided me with all possible support from the Oracle side. Thank you for your support and the chance to contribute to something that became as significant as Graal. In this context I want to send specials thanks to my previous manager Doug Simon, who has been an awesome manager during my time at Oracle and later on as well. I also want to mention Roland Schatz, Gilles Duboscq, Stefan Anzinger and Tom Rodriguez, who always have an ear for even the most complex compiler graphs.

A special thank you goes out to Aleksandar Prokopec for inviting me to a broader set of research topics and projects. It is my pleasure.

I also want to thank my university specifically our institute the SSW: Special thanks to my fellow PhD students (current and previous) Josef Eisl, Manuel Rigger, Benoit Daloze, Florian Latifi and Raphael Mosaner.

A PhD is a collaborative effort and therefore I want to thank all other people who had been involved and contributed to this journey. 218 Bibliography

Most importantly I want to thank my parents, who have always been there for me. Mum, thank you for everything, for always encouraging me to keep pushing and remembering me that this is still the thing I want to do in my life. Dad, all that is left to say is, I value all the time we had together and regret you not witnessing the future endeavors of my life to come including this PhD of mine.

In the end this sentence is for her, she knows...