Master thesis QuickInterp - Improving interpreter performance with superinstructions

Lukas Miedema June 11, 2020

bipush

lcmp

new

ifeq

dup

iload_3iload_3

istore_3

Exam committee prof. dr. M. Huisman dr.ir. T. van Dijk dr.ir. A.B.J. Kokkeler

Cover design Gerben Miedema Abstract

The performance of Java Virtual Machine (JVM) bytecode interpreters can be severely limited by the (1) inability to perform optimizations over multiple instructions, and (2) the excessive level of branching with the interpreter loop as a result of having to do at least one jump per bytecode instruction. With the QuickInterp VM we mitigate these performance limitations within JVM interpreter design by means of superinstructions. Our interpreter is improved by supporting not just regular bytecode instructions, but also extra instructions which perform the work of a sequence of bytecode instructions (superinstructions). Bytecode is, at class load time, preprocessed where sequences of bytecode instructions are replaced by such equivalent superinstructions, requiring no alterations to or compatibility loss with existing bytecode. The interpreter source code is generated automatically based on a profile of the running application. New instruction handlers are generated by concatenating the instruction handlers of existing instructions, removing the need for manual construction of the superinstruction handlers. Which sequences of instructions to convert into a superinstruction, and how to perform the most effective substitution of superinstructions into a loaded program, are questions which we answer in this thesis. Earlier work has shown that finding the optimal superinstruction set is NP-hard [15]. As such, we implement an iterative optimization algorithm to find the optimal superinstruction set. Furthermore, we design and test various runtime substitution algorithms. Our new shortest-path based runtime substitution algorithm uses shortest-path based pathfinding through the input program to find the combination of superinstructions that lowers the amount of required instruction dispatches as much as possible. Furthermore, we enhance the shortest-path algorithm by doing substitution based on equivalence, extending the impact each individual superinstruction can make at runtime. We compare our new runtime substitution algorithms against a reimplementation of a substitution algorithm from earlier work [4]. Our methods show that the superinstruction optimization is still valid in 2020, boasting a 45.6% performance improvement against baseline in a small arithmetic benchmark, and showing a 33.0% improvement against baseline in a larger Spring Boot-based web application benchmark. Our iterative superinstruction set construction algorithm manages to find near-optimal solutions for the NP-hard problem of constructing the superinstruction set. However, we also show how more advanced superinstruction placement algorithms do not offer the same return on investment. Given enough superinstructions, each of the tested substitution algorithms is capable of achieving similar performance improvements. The code and benchmarks are available at https://github.com/LukasMiedema/QuickInterp.

2 Table of contents

Abstract 2

Table of contents 2

List of figures 5

List of listings 6

1 Introduction 9 1.1 What is a VM ...... 9 1.2 What is an interpreter ...... 10 1.2.1 Anatomy of an interpreter ...... 10 1.3 What is a JIT Compiler ...... 11 1.4 Superinstructions ...... 12 1.5 Research goals ...... 13 1.5.1 Motivation ...... 13 1.5.2 Research question ...... 13 1.5.3 Goals and method ...... 14 1.6 Research contributions ...... 14 1.7 Overview ...... 14

2 Background 15 2.1 The JVM ...... 15 2.1.1 Introduction ...... 15 2.1.2 Bytecode structure ...... 15 2.1.3 Stack-based architecture ...... 16 2.1.4 Verifier, typesafety and abstract interpreters ...... 16 2.1.5 Dynamic class loading ...... 17 2.2 Modern interpreter dispatching ...... 17 2.2.1 Superscalar execution and branch prediction ...... 17 2.2.2 The interpreter ...... 18 2.2.3 Token-threaded interpreters ...... 19 2.2.4 Threaded-code interpreter ...... 20 2.3 Conclusion ...... 21

3 Related work 22 3.1 Introduction ...... 22 3.2 Superinstructions ...... 22 3.2.1 Workflow ...... 22 3.2.2 Superoperators ...... 24 3.2.3 vmgen ...... 24 3.2.4 Tiger ...... 25 3.2.5 Conclusion ...... 26

3 3.3 Other interpreter optimizations / research ...... 27 3.3.1 Static replication ...... 27 3.3.2 Instruction specialization ...... 28 3.4 Conclusion ...... 28

4 Design of QuickInterp 29 4.1 Introduction ...... 29 4.1.1 Design goals ...... 29 4.1.2 Overview ...... 30 4.2 Architecture and workflow overview ...... 30 4.2.1 From profile to runtime ...... 30 4.3 QuickInterp application profile ...... 32 4.3.1 Introduction ...... 32 4.3.2 Data in the profile ...... 32 4.4 QuickInterp compile time ...... 36 4.4.1 Introduction ...... 36 4.4.2 Instruction selection ...... 36 4.4.3 Static evaluation ...... 39 4.4.4 Superinstruction set construction ...... 41 4.4.5 Superinstruction generation evaluation loop ...... 44 4.4.6 Handling superinstruction operands ...... 45 4.5 QuickInterp runtime ...... 47 4.5.1 Introduction ...... 47 4.5.2 Basic runtime superinstruction placement ...... 47 4.5.3 Instruction placement using shortest path ...... 51 4.6 Equivalent superinstructions ...... 55 4.6.1 Superinstruction equivalence ...... 55 4.6.2 Discovering data dependencies ...... 57 4.6.3 Data Dependency Graph Construction ...... 65 4.6.4 Using superinstruction equivalency ...... 67 4.7 Conclusion ...... 68

5 Implementing QuickInterp 70 5.1 Introduction ...... 70 5.2 Implementation goals and non-goals ...... 71 5.3 QuickInterp on OpenJDK Zero ...... 72 5.3.1 Why OpenJDK Zero ...... 72 5.3.2 OpenJDK Zero class-loading pipeline ...... 72 5.3.3 OpenJDK Zero Interpreter ...... 74 5.3.4 Code stretching ...... 76 5.3.5 Superinstruction placement in Java ...... 78 5.3.6 Conclusion ...... 79 5.4 Profiling in practice ...... 79 5.4.1 Specialized Instructions ...... 80 5.4.2 The profile on disk ...... 81 5.4.3 Conclusion ...... 82 5.5 Constructing the superinstruction set ...... 82 5.5.1 Interpreter Generator tool implementation ...... 83 5.5.2 Loading the profile ...... 87 5.5.3 Optimizing the instruction set ...... 90 5.5.4 Conclusion ...... 92 5.6 Superinstruction placement ...... 92 5.6.1 Overview ...... 92 5.6.2 Algorithm implementations ...... 94

4 5.6.3 Conclusion ...... 98 5.7 Generating the QuickInterp interpreter ...... 98 5.7.1 Instruction primitives ...... 99 5.7.2 Instruction code as macros ...... 101 5.7.3 Code generation ...... 102 5.7.4 Superinstruction class cache ...... 105 5.7.5 Loss of the garbage collection and class verifier ...... 106 5.7.6 Wrapping up ...... 107 5.8 Conclusion ...... 107 5.8.1 Revisiting the implementation goals ...... 108

6 Benchmarks and results 110 6.1 Introduction ...... 110 6.2 Benchmarking goals and non-goals ...... 110 6.3 Benchmark selection ...... 111 6.3.1 Small benchmark: JMH primes benchmark ...... 111 6.3.2 Large benchmark: Spring pet clinic web app ...... 112 6.3.3 Benchmarking environment and parameters ...... 114 6.4 Results: JMH Primes Benchmark ...... 114 6.4.1 Benchmark and static evaluation scores ...... 114 6.4.2 Result interpretation ...... 116 6.5 Spring Pet Clinic benchmark ...... 117 6.5.1 Benchmark and static evaluation scores ...... 117 6.5.2 Result interpretation ...... 117 6.5.3 Interpreter size and superinstructions ...... 120 6.6 Result interpretation ...... 120 6.6.1 Best runtime substitution algorithm ...... 121 6.6.2 Static evaluation: not a perfect predictor ...... 122 6.7 Conclusion ...... 122

7 Final thoughts 124 7.1 Revisiting the research goals and questions ...... 124 7.1.1 Research goals ...... 124 7.1.2 Answering research questions ...... 125 7.2 Future work ...... 127 7.2.1 Production-ready implementation ...... 127 7.2.2 Better static evaluation heuristics ...... 128 7.2.3 Dynamic rewriting ...... 128 7.2.4 Dynamically-loadable superinstructions for OSGi applications ...... 128 7.2.5 Value-dependent superinstructions ...... 129 7.2.6 Superinstructions as a JIT target ...... 129

Appendices 130

A List of bytecode instructions 131 A.1 Bytecode instructions with their description ...... 131 A.2 Instruction handler flags ...... 135 A.3 Instruction pc and stack offsets ...... 140

Definitions 144

Bibliography 146

5 List of figures

3.1 Diagram of the process described by Erl et al. [4] ...... 23

4.1 Diagram of the core architecture ...... 31 4.2 Diagram of the Instruction Set Generation process as seen in Figure 4.1 ...... 42 4.3 Listing 4.17 represented as a tree ...... 50 4.4 Listing 4.19 represented as a graph, with the instructions on the edges and the superinstructions of Listing 4.20 added ...... 51 4.5 Listing 4.24 represented as a graph, with the instructions on the edges and the superinstructions of Listing 4.25 added ...... 54 4.6 Listing 4.24 represented as a graph, with the instructions on the edges and the superinstructions of Listing 4.26 added ...... 54 4.7 Graph showing the effect of instructions on the (relative) stack depth ...... 63 4.8 Step 1: The input program is converted to a graph where all the nodes are blocks . 67 4.9 Step 2: Barrier attributes are added to each block ...... 67 4.10 Step 3: Must-happen-after edges are added based on barrier attributes, replacing the original edges ...... 67

5.1 Simplified class diagram of the Interpreter Generator ...... 86 5.2 Simplified class diagram of the runtime ...... 93 5.3 Copy of Figure 4.4 with the direction of the edges reversed, showing a bytecode program as a graph with instructions on the edges ...... 96 5.4 A DOT graph generated by our equivalence-aware runtime substitution algorithm of two equivalent sequences of code ...... 98

6.1 Entity relation diagram of the Pet Clinic database ...... 112 6.2 Screenshot of the Pet Clinic web application, showing information about a pet owner113 6.3 Primes benchmark with cache enabled ...... 115 6.4 Primes benchmark without cache enabled ...... 115 6.5 Primes benchmark static evaluation score ...... 115 6.6 Primes benchmark score vs static evaluation score ...... 115 6.7 Number of static evaluations completed within 180 seconds for each algorithm . . 116 6.8 Spring Pet Clinic benchmark (average of all requests) ...... 118 6.9 Spring Pet Clinic benchmark (average of all Open “Landing” page requests) . . . . 118 6.10 Spring Pet Clinic benchmark showing the duration of the individual request types (all using shortest path) ...... 118 6.11 Primes benchmark static evaluation score ...... 119 6.12 Spring benchmark average duration vs static evaluation score ...... 119 6.13 Number of static evaluations completed within 180 seconds for each algorithm . . 119 6.14 Spring Pet Clinic benchmark libjvm.so size vs number of superinstructions) . . . 120 6.15 Spring Pet Clinic benchmark score vs libjvm.so size ...... 120

6 List of listings

1.1 Basic interpreter loop ...... 11 2.1 Sequence of bytecode instructions executing x = x · y ...... 16 2.2 Illegal sequence of bytecode instructions with an unbounded operand stack . . . . 17 2.3 Basic interpreter loop ...... 18 2.4 Basic interpreter loop with a control table ...... 19 2.5 Basic interpreter loop in a interpreter ...... 20 4.1 Sequence of bytecode instructions with a goto ...... 33 4.2 An illegal superinstruction candidate derived from Listing 4.1 ...... 33 4.3 Sequence of bytecode instructions as found in a profile where instructions are superinstructionable ...... 36 4.4 A superinstruction candidate derived from Listing 4.3 ...... 36 4.5 Code fragment with a jump target ...... 37 4.6 First superinstruction candidate from Listing 4.5 including the instructions before the jump target ...... 37 4.7 Second super candidate from Listing 4.5 with only the instructions after the jump target ...... 37 4.8 Sequence of bytecode instructions computing (x · y · (z + z + y)) mod 2, where mod is the modulo operator ...... 38 4.9 Another sequence of bytecode instructions computing (3 · (a + b + c)) − 2 ..... 38 4.10 score(p, Jp) static evaluation algorithm ...... 41 4.11 theoreticalMaximum(P ) static evaluation algorithm computing the maximum score 41 4.12 optimize(P,S) genetic optimization algorithm ...... 44 4.13 Sequence of bytecode instructions executing x := x · 5 ...... 46 4.14 bipush handler ...... 46 4.15 Listing 4.13 with a superinstruction ...... 46 4.16 Example of the triplet table used for table-based substitution with two superinstruc- tions with three superinstructions defined ...... 48 4.17 Example of three superinstruction definitions ...... 50 4.18 R(I,S) tree-based runtime substitution algorithm ...... 50 4.19 Short bytecode program (repeat of Listing 4.8) ...... 51 4.20 A few superinstruction definitions ...... 51 4.21 Listing 4.19 after superinstruction substitution by the tree-based substitution algorithm 52 4.22 Listing 4.19 after optimal superinstruction substitution ...... 52 4.23 An example Java method set(int,int) with a conditional jump. The input score is clamped to at most MAX_SCORE...... 53 4.24 The set(int,int) method from Listing 4.23 shown as bytecode ...... 53 4.25 A few superinstruction definitions ...... 54 4.26 A few superinstruction definitions ...... 54 4.27 Equivalence algorithm in pseudocode ...... 55 4.28 Code compiled from int x = 4; int y = 5; ...... 56 4.29 Code compiled from int y = 5; int x = 4; ...... 56 4.30 Code compiled from int x = 4; int y = 5; ...... 57 4.31 Code compiled from int y = 5; int x = 4; ...... 57

7 4.32 Superinstruction derived from Listing 4.30 ...... 58 4.33 Superinstruction derived from Listing 4.31 ...... 58 4.34 Superinstruction derived from Listing 4.30 ...... 58 4.35 Superinstruction derived from Listing 4.31 ...... 58 4.36 A Java expressions ...... 60 4.37 Bytecode compiled from 4.36 ...... 60 4.38 Another Java expressions ...... 60 4.39 Bytecode compiled from 4.38 ...... 60 4.40 Two Java assignments with expressions ...... 61 4.41 Bytecode compiled from 4.40 ...... 61 4.42 Two Java assignments with expressions ...... 61 4.43 Bytecode compiled from 4.42 ...... 61 4.44 Two mixed expressions while equivalent to Listing 4.41 Listing 4.43 ...... 62 4.45 Abstract interpreter for marking expressions ...... 63 4.46 Repeat of the bytecode compiled from 4.40 ...... 63 5.1 The iload and fload instruction handler in OpenJDK Zero ...... 74 5.2 All const_n handlers are defined by invocations of one macro, which expands to the actual definition of that instruction handler ...... 74 5.3 Straightforward concatenation of the bipush and iload instruction handlers . . . 75 5.4 Optimized concatenation of the bipush and iload instruction handlers by coalescing top-of-stack and program counter modifications ...... 75 5.5 The fload instruction handler in OpenJDK Zero ...... 77 5.6 Code implementing the ifnull instruction handler (simplified) ...... 81 5.7 Snippet of an app.profile file ...... 82 5.8 Invocation of the Intepreter Generator tool ...... 84 5.9 Example invocation and output of the Intepreter Generator tool ...... 85 5.10 Simplified example of a profile file, using line numbers instead of bytecode offsets . 88 5.11 Bytecode corresponding to the counters from Listing 5.10 ...... 88 5.12 Short sequence of instructions containing incoming jumps ...... 89 5.13 Superinstruction candidates derived from Listing 5.12 ...... 89 5.14 Generated output for the regular pop instruction ...... 100 5.15 Generated output for the superinstruction aload_0-iload ...... 100 5.16 Original iload instruction handler ...... 101 5.17 iload instruction primitive macro definition ...... 102 5.18 Interpreter loop with handlers read from an external file ...... 103 5.19 Definition of the dispatch macro using a dispatch table ...... 103 5.20 Definition of the dispatch table using a generated file ...... 103 5.21 Content of the dispatch table in bytecodeInterpreter.jumptable.hpp ...... 104 5.22 Content of the generated instruction constants file bytecodes.generated.hpp .. 104 5.23 Content of the generated definitions file bytecodes.definitions.hpp ...... 105 6.1 The prime test benchmark ...... 111

8 Chapter 1

Introduction

Long gone are the days when the only way to run software was by running it directly on the hardware. Today a plethora of mature software technologies are available which decouple the application from the target platform by means of a Virtual Machine (VM) for a number of reasons: increased portability, security, or simply to provide a richer runtime platform than what is possible directly in hardware. In this chapter an overview of this technology is presented. In section 1.1 the VM as a concept is introduced, with a special look at the interpreter in section 1.2 – a technique by which a non-native instruction set can be executed on a platform. Another technique for this purpose is discussed in section 1.3: the Just-In-Time compiler. Both have their strengths and weaknesses, which leads to the introduction of superinstructions in section 1.4 – an optimization attempting to mitigate the weaknesses of the interpreter, with some hints towards the state of the research domain of superinstructions. With the foundation laid the research goals are introduced in section 1.5, with the contribution of this work following suit in section 1.6. Finally, an overview is presented of this whole thesis in section 1.7.

1.1 What is a VM

Software is inherently linked to the target platform of the software. After all, all software needs a machine to run on, and as such will have some adaptation for that machine. And a “platform” or “machine” in this sense isn’t just the CPU or the instruction set of that CPU, instead it encompasses the whole execution environment of the target platform. Software libraries and APIs available on the target machine are just as much part of “the platform”. The software can be executed by the target platforms’ processor, and also use function- and syscalls to library code or kernel functions. This link may seem completely unavoidable, and it may even be the intended way of writing applications for a target platform. However, there are also some serious drawbacks when multiple distinct platforms need to be supported by a particular software application. Some differences between the platforms are easily overcome, like when dealing with a different CPU instruction set one could simply use a programming language that abstracts over the specifics of the instruction set (e.g. C++) and recompile the software for the other CPU architecture. However such abstractions can quickly break down when the differences become too large, and typically do not offer a great deal of protection against differences in software API which may stand in the way of reusing all code across platforms. A cryptography application may for instance choose to not use a rich cryptography library available on one platform, as an independent software implementation needs to be made available for the other platforms supported by the software anyways. And even when the decision is made to use the platform-provided encryption library for performance or other reasons on the one platform that supports it, the developers of the application still have to support the embedded encryption algorithms for the other platforms. Now it is not only necessary to provide two different distributions of the same software application for different target platforms, it also

9 becomes necessary to maintain multiple versions of the codebase at places where the programming language is unable to abstract away the differences. Ideally, the whole world would just use one type of machine with one CPU, one operating system, one set of libraries and stick with it. Unfortunately, we do not live in that world, but with some software tricks we can close the distance. With software, it is possible to virtualize one standard platform with a Virtual Machine (VM). This standard platform can then be used to develop applications for. All target platforms can, by acquiring an implementation of this VM, virtualize this standard platform and run all software available for it. In a sense, this way of running software carries the thin abstraction layer provided by a programming language like C++ into the runtime, and makes it more complete by abstracting the entire execution environment. Such a standard platform can take many forms. For example, in the world of Docker this standard platform is a Linux kernel with a selection of software libraries, with non-Linux software platforms requiring to virtualize the whole kernel to run Docker applications (“containers”). Another more relevant example is the Java Virtual Machine (JVM), which will be discussed in detail in section 2.1. Given the focus on the JVM, a full list of all bytecode instructions and their definitions can be found in appendix A.1.

1.2 What is an interpreter

Not all virtual machines are the same, and while Docker and the JVM may both be virtual machines, they are different in various ways. One such way is how they abstract the CPU. The answer in the case of Docker can be summarized with “not”. An application for Docker is still bound to the host platforms’ CPU and its instruction set, where Docker only standardizes the software environment. This is in contrast to the JVM, which does virtualize the CPU by introducing its own instruction set with its own semantics independent of the host platform. Such virtualization provides a new challenge for the virtual machine implementation to solve. Somehow the virtual machine needs to take instructions not compatible with the target platforms’ CPU and execute them. There are multiple ways to solve this problem, but a very straightforward way is by using an interpreter. An interpreter can be thought of as a software implementation of a CPU. It keeps track of its current state (e.g. frame pointer and program counter), fetches the next instruction from the input program and performs the actions associated with that instruction. It then fetches the next instruction while following the semantics of the virtualized CPU. As such, an interpreter requires software implementations for all the different instructions in the instruction set of the VM. The interpreter, together with the rest of the VM, is written directly for the target platform allowing it to bridge the gap between the virtualized platform and the actual host platform.

1.2.1 Anatomy of an interpreter To better illustrate how an interpreter may be implemented, consider Listing 1.1

1 void interpret(instruction[] code, int* stack){ C code 2 instruction* pc=&code[0] 3 while (true) { 4 switch (*pc){ 5 case NOP: pc++; continue; 6 case STOP: ...; 7 case JUMP: ...; 8 case PUSH: ...; 9 case POP: pc++; stack--; continue; 10 ... 11 } 12 } 13 }

10 Listing 1.1: Basic interpreter loop

This tiny program already shows the basic skeleton of an interpreter. It holds the state with a stack pointer (frame pointer) and a program counter, and iteratively processes the next instruction in a indefinite loop. In this program there are a few instructions: NOP, STOP, JUMP, PUSH and POP, each with code to implement that instruction. For brewity most implementations are omitted, but the NOP instruction with its simple implementation is shown: it simply increments the program counter (pc) and carries on. The POP instruction is also shown modifying the stack pointer. It appears that the stack grows down from this snippet of code. The program itself is represented as just an array of instruction values. The stack for the program is provided to the interpreter as an int pointer. Finally, the program counter (pc) is initialized to the start of the program. In section 2.2 this code fragment will be revisited, adding more complexity to the interpreter. However even without that extra complexity Listing 1.1 should shed some light on interpreter basics.

1.3 What is a JIT Compiler

The discussion regarding interpreter design did not mention the word “performance” even once, but as one might suspect the execution of an instruction set by simulating a CPU is simply not nearly as fast as executing the instructions directly in hardware. Another, more performant method of bridging the gap between the virtual machine and the host machine exists: the Just-In-Time compiler. The idea is simple: convert (compile) whole swaths of virtual machine code directly to target machine code. Doing this for the whole application may, depending on the virtual machine instruction set, lead to an excessive amount of code or may simply take too long. As such the idea is to only compile where it’s necessary, and do it Just In Time – just before the execution of said code. This offers many advantages: performance can be brought closer to that of an application running directly on the hardware. Additionally, by bringing the problem of bridging the gap between the virtual machine and the target machine into the domain of compiler construction, all sorts of existing theory and optimizations can be reused. By running at runtime – while the application runs – it is also possible to “cheat” a bit when compiling the target machine code: if at a particular moment the application has never taken an else branch of an if-else, the JIT compiler could cheat by not compiling the else branch. Instead, it makes the else branch trap out of the generated code back to the JIT compiler to generate new code if it does end up being taken, as to not violate the semantics of the VM. This optimization may bring great performance improvements for the general case (where the else is not taken), and still provide a correct execution path for the other case. And generating code with such an extra assumption can enable further optimizations – for example by assuming that this else won’t be taken other parts of the code also fall away or can be simplified. These are optimizations which aren’t even possible without a VM. One obvious downside to a JIT compiler is the time it takes to start up. Code that is not often executed suffers a performance penalty as the time to JIT compile may overshadow the total runtime that code would have within the VM when run with just an interpreter. This is why virtual machines are often combining an interpreter with a JIT compiler. The interpreter is there to provide a running start, with the JIT compiler able to optimize the “hotspots” – often executed pieces of code – leading to a best of both worlds situation. JIT compilers have another downside: the cost of developing one is significantly higher than that of an interpreter. The code fragment from the previous section (Listing 1.1) is an example of the elegance and simplicity of an interpreter. Furthermore, this interpreter is written entirely in the C programming language, which offers some cross-CPU portability, making it easier to maintain an interpreter for multiple hardware platforms. A JIT compiler in contrast faces the same problems as a typical compiler, requiring a CPU-specific backend to generate machine code for that one CPU architecture. As such, porting it to a new architecture is considerably more expensive.

11 This work focusses on improving the JVM without a JIT compiler by designing and implementing optimizations for the interpreter.

1.4 Superinstructions

One of the reasons why a JIT compiler typically outperforms an interpreter: an interpreter needs to do a lot of jumps. When looking at Listing 1.1, every instruction that is executed amounts to at least two jumps: one to the instruction handler, and one back to the start of the loop. These jumps are expensive, and especially if the instruction handler is only a few machine-code instructions, the costs of these jumps can start to dominate the execution time of the VM. The exact mechanism why this costs so much is explained later in section 2.2.1, but for now it suffices to understand that lowering the amount of jumps in the interpreter can make a direct, positive impact on the performance of the interpreter. One such mechanism is by using superinstructions [1, 3, 4, 5, 6, 15]. The idea assumes that some sequences of instructions are common. If the individual instructions in such a sequence are very short, it would make sense to adapt the virtual machine instruction set to instead contain a new “bulk” instruction which can do the work of such a sequence at once. If such an instruction can replace a sequence of x instructions, it would save 2 · (x − 1) jumps in the interpreter design of Listing 1.1, which could lead to a substantial performance improvement. Such an instruction that is the concatenation of multiple regular instructions is called a superinstruction. However adapting the virtual machine instruction set to contain common superinstructions is often not practical for an existing virtual machine. Furthermore, the ideal superinstructions may differ from one application to another. As such, the superinstruction architecture inserts superinstructions into to input program while it is being read into the virtual machine, requiring no change to the external instruction set. The ideal superinstructions depend on the kind of application being run. An application applying formulae on a large dataset may see the best performance improvements if each of its formulae would be expressed as a superinstruction. A different application dealing with the conversion between two file formats may benefit the most from superinstructions spanning a sequence of LOAD and STORE instructions. As such, the superinstructions in a superinstruction architecture are typically not hand-picked. Instead good instruction candidates are selected in some way by examining the input program far ahead of the runtime. From the results of this examination the superinstructions can be generated automatically: by simply taking multiple switch cases in Listing 1.1 and concatenating them, a new handler can be created implementing all those instructions in a sequence. With this technique, code-generation is used to create the instruction handlers for the VM. The VM with this generated code can be compiled like normal with a C compiler (or similar), yielding a VM optimized for the application the superinstructions were selected for. This also highlights an immediate downside of the superinstruction workflow: a whole VM needs to be compiled for one particular application. This erodes at the advantages a VM provides over running directly on the hardware, as now either the software vendor or the user needs to generate a VM for the application. In the case this responsibility falls on the software vendor part of the portability advantage is lost: the vendor now has to provide one binary per supported platform. In the case of Java however, a VM equipped with a JIT compiler is available for all mainstream platforms. Instead, rare and specialized platforms like embedded devices with unusual CPUs or operating systems can be found without a JIT compiler. These platforms also require more specialized handling – deploying an application on one of these hardware platforms is most likely not something the user just downloaded from the internet. These kinds of platforms might be managed remotely, with strict security and careful vetting of the software that gets to run on the platform. In such an environment, we do not consider it unreasonable that the user of the software goes through the extra step of generating a VM for the application prior to deploying it. Instead of specializing for an application, the superinstruction architecture can also be tuned

12 towards a more generic set of code. While some patterns of instructions may be highly specific to a particular application, it is expected that some patterns occur across applications. These may be caused by sharing a common compiler, by common programming practices or by using identical library code. Existing implementations of the superinstruction architecture [1, 3, 4, 5, 6, 15] use rather primitive algorithms to (1) construct the superinstruction set, and use rather primitive algorithms to (2) insert superinstructions into loaded program code. In this work it is believed that this can be improved by using more advanced algorithms.

1.5 Research goals 1.5.1 Motivation Much work has been done in the past trying to reduce the number of dispatches by implementing superinstructions [1, 3, 4, 5, 6]. However, their methods of finding sequences of instructions to concatenate is suspected to leave room for optimization. Their algorithms for substituting superin- structions into loaded bytecode are not particularly sophisticated, and place extra requirements on the superinstruction set. For example, the superinstruction set of Gregg et al. [6] requires for any superinstruction spanning more than two instructions that its prefix – a superinstruction spanning just one fewer bytecode instruction – also be present in the superinstruction set. This decreases the number of useful superinstructions that can be available, as these prefixes occupy valuable space in the superinstruction set that could otherwise be used to allocate more useful superinstructions. Furthermore, no earlier implementation of superinstructions was able to detect equivalences between superinstructions, further decreasing the number of useful superinstructions that can be available for a given maximum superinstruction set size. By improving and broadening the substitution algorithm and superinstruction set construction algorithm with the use of graph-based algorithms by which (1) the superinstructionset is generated, and (2) the way existing bytecode (at class load time) is converted to this new instruction set. The work conducted on optimizing an interpreter by Ogata et al. [13] shows that bytecode fetches are costly, and optimizing them can bring great performance improvements. The number of bytecode fetches would be further reduced by means of effectively generating longer “superinstructions”. It should be noted that Ogata conducted their experiments on a PowerPC CPU, which may behave different than modern amd64 (x86_64) CPUs on which we intend to validate our optimizations.

1.5.2 Research question This leads to the general research question: RQ How can graph-based superinstruction matching algorithms and graph-based superinstruction set construction algorithms provide a performance speedup over existing algorithms in an interpreting JVM? Which can be divided further into subquestions: R1 What are the performance characteristics of a superinstruction VM with graph-based instruction matching algorithms on modern hardware? R2 How can graph-based instruction matching algorithms be tuned and optimized for particular workloads? R3 How can graph-based superinstruction set construction algorithms be tuned and optimized for particular workloads? The answering of these questions will also answer another question:

RX How does the superinstruction architecture as implemented in earlier work perform on a modern JVM interpreter implementations?

13 Question RQ already attempts to reestablish if the work of Casey et al. [3] and others is still valid on modern hardware and implemented on top of a different system when comparing to a modern JVM.

1.5.3 Goals and method The following goals can be enumerated as per the hypothesis above, and should provide results on that hypothesis.

G1 Collect, devise and adapt a set of algorithms that, when provided with profiling data of a particular application, generate an optimized superinstruction set. G2 Collect, devise and adapt a set of algorithms that, when provided with a superinstruction set and application code can insert appropriate superinstructions as to maximize performance G3 Implement these algorithms, either on top of OpenJDK or by some other means

G4 Evaluate performance on a set of sample applications and benchmarks. The benchmarks are discussed in section 6.3.

As per goal G3 it is clear that these goals would require building an actual implementation.

1.6 Research contributions

This thesis to add to the body of research concerning improving interpreter performance, with a focus on JVM interpreters. JVM interpreters as an optimization target have fallen out of fashion for most mainstream VMs as a result of the emergence and now ubiquity of JIT compilers in VMs. However, for a select set of applications including embedded VMs, JIT compilers remain out of reach. Furthermore, we repeated the results reached in earlier work and have established that the superinstruction optimization still makes sense on modern processor architectures and on modern JVM interpreters.

1.7 Overview

In chapter 2 we discuss the various facets of the JVM, the theory behind why superinstructions offer a speedup, and the basics on interpreter design needed to understand this thesis. In chapter 3 related work is discussed. We discuss several other superinstruction implementations, and some related optimizations. The design and the implementation of our interpreter – QuickInterp – is split into two chapters. Chapter 4 discusses the design in way that is independent of the interpreter to which it is applied. Then, in chapter 5 we implement the design on top of OpenJDK Zero 11. With the design and implementation complete, chapter 6 introduces the two benchmarks (the primes benchmark and the Spring pet clinic benchmark) and uses those to test the implementation. Finally, chapter 7 concludes this thesis by revisiting the research questions and discussing possible future work.

14 Chapter 2

Background

The Java programming language is a language designed to be secure, performant, robust, and platform independent [18]. Section 2.1 looks at the the JVM, which is the technology that allows Java to run on multiple platforms without recompilation, and is the technology to which we apply the superinstruction optimization. Superinstructions can help speed up performance due to the way modern processors load and execute instructions within an interpreter, which we discuss in section 2.2.

2.1 The JVM

The Java language is designed to support the development of secure, high-performance, highly- robust applications on multiple platforms [18]. The language is object-oriented and supported by a garbage collector. We focus on the “multiple platforms” part of the equation, which is the component that can benefit from superinstructions. Running on multiple platforms is enabled by the use of a Virtual Machine (VM): the Java Virtual Machine (JVM). In this section we will discuss various aspects of the JVM that are needed to understand how superinstructions may be implemented in a JVM interpreter.

2.1.1 Introduction The JVM has an instruction set called bytecode, which we will discuss in section 2.1.2 and section 2.1.3. Java code is compiled by the Java compiler to this instruction set, which make it similar to a machine language. However, the promise of a secure VM has led to the creation of a typesafe instruction set [18]. In JVM bytecode, each memory location has an associated type, and instructions may only operate on allowed types. Code is statically checked to not violate any of the typing rules prior to being executed, and the bytecode format is written in such a way that it allows this kind of verification, which we will discuss in section 2.1.4. Java, as a language, is strict in checking types at compile-time. However, classes are lazily linked at runtime [18]. New classes can be loaded at any time as the application runs, and can even be generated by the application itself, which we will discuss in section 2.1.5.

2.1.2 Bytecode structure The javac compiler compiles Java programs to .class files, which are the basic unit of code within the JVM. They contain information about one particular class: it contains the fields of the class (including type information) and all the methods in the class (static methods, virtual methods, abstract methods, constructor methods and the static initializer method), together with optional metadata [7]. Each non-abstract method has a “Code” attribute containing the actual bytecode instructions. The ranges and exception types of exception handlers are stored separately from

15 the Code attribute in an Exceptions attribute. More information about the class format and the other attributes can be found in the Java Virtual Machine Specification [7]. The instructions within the Code attribute are called “bytecode” as each instruction has a one-byte opcode. This restricts the number of instructions to only 28 = 256 instructions, and as of Java 11, there are 202 bytecode instructions [8]. A table of all 202 instructions can be found in Appendix A.1. While each opcode is only one byte, the instructions themselves can be longer than just one byte. Each instruction is followed by zero or more instruction operands. These values serve as additional input to the bytecode instructions. For example, the bipush instruction pushes a one-byte constant onto the operand stack (we will discuss the operand stack in a moment). It receives this constant as an instruction operand. The Code attribute is also called the instruction stream or bytecode stream, and as such the bipush instruction reads one additional value from the bytecode stream. Bytecode instructions do not have undefined behavior to make the VM more secure [18]. For example, the “iaload” instruction, which reads an integer from an array, will throw an java.lang.ArrayIndexOutOfBound exception or an java.lang.NullPointerException if the index is out of bounds or if the array reference is null respectively. Furthermore, the bytecode instructions are designed in such a way that they facilitate verification. Jump instructions read their jump offset from the bytecode stream, and do not take their jump offset as a variable. There is one exception: the “ret” instruction. However, this instruction requires that the variable it receives its jump target from is of a special “returnAddress” type, which can only be constructed by the “jsr” instruction and cannot be manipulated [9].

2.1.3 Stack-based architecture The JVM is a stack-based virtual machine [10], as it uses an operand stack to communicate values between instructions. Most instructions push and pop values to the operand stack, while special “load” and “store” instructions exist to read and write variables to a local variable table. Both the local variable table and the operand stack are part of the current stack frame. Each time a method is invoked, a new frame is allocated for that method with a new local variable table and operand stack, keeping the local variable table and the operand stack separated between method invocations. The frame is destroyed when the method returns. The local variable table consists of slots from which values can be read and written to. The slots are addressed by an index – the local variable table index. The name of a local variable is converted by the compiler into a local variable table index.

1 iload_3 pushes x onto the operand stack JVM Bytecode 2 iload 4 pushes y onto the operand stack 3 imul pops x and y, pushes product 4 istore_3 pops the product

Listing 2.1: Sequence of bytecode instructions executing x = x · y

An example sequence of bytecode instructions can be seen in Listing 2.1. Here it is assumed that the variable “x” is assigned to local variable table index 3 (slot 3), and “y” is assigned to local variable table index 4 (slot 4). Note how there is a dedicated instruction to read from slot 3, the “iload_3” instruction, while reading from slot 4 happens with the general iload instruction followed by a one-byte instruction operand. Such specializations of often-used instruction operands are common in the bytecode instruction set.

2.1.4 Verifier, typesafety and abstract interpreters As mentioned in the introduction, the JVM is typesafe, which is checked in part by static verification. To allow static verification, various constraints exist on what one can do with bytecode as we will discuss in a moment. These constraints are interesting for superinstruction optimizations, as they

16 enable powerful analysis on the bytecode which may not be possible in other instruction sets. For example, all branch targets can be statically determined, making it easier to construct a control flow graph. Furthermore, since the JVM is typesafe, type information can always be statically determined for a given location in the code. The act of analyzing the code statically is often done with an abstract interpreter. An abstract interpreter is a bit like a regular interpreter, but instead of operating on values it operates on types. Furthermore, it takes all branches when there is an opportunity to branch. Where a regular interpreter may push the constant “10” onto the operand stack, an abstract interpreter would push “int” onto an operand type stack. The class verifier is a type of abstract interpreter that is used to verify the type correctness of the code prior to loading. Abstract interpreters can also be used to determine the types of values in the local variable table slots and on the operand stack for other purposes, like garbage collection. One of the constraints put on JVM bytecode is limiting the depth of the operand stack. The “Code” attribute as discussed in the previous section stores the largest slot index used by the method, together with the maximum operand stack depth. These values are used to determine the size of a new stack frame when calling a method, and are statically verified. The fact that there exists even a maximum operand stack depth has some consequences. It is not possible for an instruction to be reachable in the code with different operand stack depths. For example, the code in Listing 2.2 is illegal.

1 iload_3 pushes x onto the operand stack JVM Bytecode 2 goto -1 jump to the previous instruction

Listing 2.2: Illegal sequence of bytecode instructions with an unbounded operand stack

The code in Listing 2.2 repeatedly pushes “x” onto the operand stack, without ever popping it. In order to stay performant, the JVM does not perform checks against this kind of unbounded behavior at runtime, but will catch this illegal sequence of instructions with the class verifier discussed earlier [18].

2.1.5 Dynamic class loading As discussed in section 2.1.2, the .class file is the basic unit of code within the JVM. Class files are offered to the JVM by means of a class loader. Multiple class loaders can exist at the same time, and each class loader creates its own effective namespace, enabling the loading of two classes with the same name but with different implementations under different class loaders. Furthermore, classes do not have to originate from disk. They can be loaded in from file, from the network, or even be generated by the application itself. Concrete examples of the application generating its own classes can already be found within the Java Standard Library itself. For example, when creating a lambda using the Java 8 lambda syntax, the JVM will generate a special lambda class at first invocation. This highly dynamic behavior poses interesting challenges for a superinstruction architecture: there isn’t a single “load” phase, where all classes get loaded before starting the application. Instead, the process of loading classes happens concurrently with the runtime.

2.2 Modern interpreter dispatching 2.2.1 Superscalar execution and branch prediction The core idea of superinstructions is to concatenate existing instruction handlers to create new instruction handlers that can take on the work of multiple instructions. At glance, it may appear as if such an optimization is minimal, saving only a few jumps from having to be executed. However, modern CPU’s are superscalar: they fetch and decode several instructions at a time [17]. Fetching

17 the next few instructions requires knowing what the next few instructions are going to be, and jumps without a fixed jump target make this harder. This brings us to the topic of branch prediction – predicting where the code will branch to. Processors may use various indicators to determine where a conditional branch instruction will branch to. If a prediction is wrong, the processor has to repeat work, negatively impacting performance. A predictor may use a table of previous branch results stored in a branch prediction table [17]. The table is typically organized as an associative cache, linking branch history to the memory address of a particular instruction. While such an approach may work well in a regular program, in an interpreter the branch target depends on the next bytecode instruction. If, in the bytecode instruction stream, the instruction sequences “a b” and “a c” are common, the branch predictor will not be able to reliably infer which instruction comes after “a”. The superinstruction architecture helps to mitigate this problem by creating the new “a-b” and “a-c” superinstructions. Since these superinstructions can be executed as a whole, the processor is able to work efficiently without branch mispredictions, as there are no branches to predict.

2.2.2 The interpreter To better illustrate how an interpreter may be implemented, consider Listing 2.3. This Listing continues on the earlier Listing 1.1.

1 #define NOP 0 C code 2 #define STOP 1 3 #define JUMP 2 4 #define PUSH 3 5 #define POP 4 6 ... 7 8 typedef instruction int 9 10 void interpret(instruction[] code, int* stack){ 11 instruction* pc=&code[0] 12 while (true) { 13 switch (*pc){ 14 case NOP: pc++; continue; 15 case STOP: return; 16 case JUMP: pc+=pc[1]; continue; 17 case PUSH: *(stack++)=pc[1]; pc+=2; continue; 18 case POP: stack--; pc+=1; continue; 19 ... 20 } 21 } 22 }

Listing 2.3: Basic interpreter loop

In this short example we see five instructions: NOP, STOP, JUMP, PUSH and POP. Furthermore, we see a typedef for the type of instruction – it’s an int. The program is just an array of ints. The stack for the program is provided to the interpreter as an int pointer. Finally, the program counter (pc) is initialized to the start of the program. The program runs in an infinite while loop, switching over the current instruction by simply following where pc points. Within that loop, each instruction has a handler:

NOP This is a no-op instruction, and performs no work besides updating the program counter. The pc is updated by one to move the program to the next instruction.

STOP The STOP instruction shows how the interpreter may terminate. Here the interpret function is left when this instruction is encountered, exiting the interpreter.

18 JUMP This instruction reads a jump offset from the input program – it interprets the next value in the program not as an instruction, but as an instruction operand. This value is interpreted as a relative jump offset, and added to the program counter changing the control flow of the interpreted program.

PUSH This instruction retrieves an instruction operand and pushes it onto the stack. The value following the current instruction is read and written to the top of the stack, after which the stack pointer is incremented. The program counter is incremented by two as a result of the increased size of this instruction.

POP The last instruction pops the top of the stack by simply decreasing it by one. It has no instruction operand, so the program counter is incremented by just one.

An actually usable VM may have many more instructions, however this tiny interpreter provides a nice starting point to evaluate other interpreter designs.

2.2.3 Token-threaded interpreters The most straightforward and intuitive interpreter design might be the token-threaded interpreter, which is the kind of interpreter shown in Listing 2.3. The token-threaded interpreter directly operates on a stream of tokens (instructions) as its input – here these are stored in the code array. Every token is an instruction, and after dereferencing the program counter a decoding step is needed to actually find the corresponding instruction handler implementing that instruction. In Listing 2.3 this decoding is done by means of a switch statement, but this is not the only way to decode the tokens. Another way is by having a control table available in the interpreter, where every instruction (token) has an index in this table. Such an interpreter is shown in Listing 2.4

1 static const void* control_table[] = { C code 2 &&label_nop, 3 &&label_stop, 4 &&label_jump, 5 &&label_push, 6 &&label_pop 7 }; 8 9 typedef instruction int 10 11 void interpret(instruction[] code, int* stack){ 12 instruction* pc=&code[0] 13 while (true) { 14 goto *control_table[*pc]; 15 16 label_nop: 17 pc++; continue; 18 label_stop: 19 return; 20 label_jump: 21 pc+=pc[1]; continue; 22 label_push: 23 *(stack++)=pc[1]; pc+=2; continue; 24 label_pop: 25 stack--; pc+=1; continue; 26 } 27 }

Listing 2.4: Basic interpreter loop with a control table

Referencing a label (&& operator) is not part of the C language, and requires out-of-spec compiler support (like GCCs labels-as-values extension). A switch-statement, like in Listing 2.3, can’t just do a simple table lookup. Instead, it has to do a range check on the input value to optionally

19 skip the whole switch statement. The control table approach gives more control over the exact dispatching mechanism and can forgo such a check if the virtualized instruction set allows it.

2.2.4 Threaded-code interpreter Interpreters are associated with requiring little to no preprocessing of the instructions before executing, and the ones from the Listing 2.3 and Listing 2.4 can even run directly on the input program. However, with a bit of preprocessing it is possible to move the decoding step – where for a given token the right instruction handler is found – from runtime to load time. This is done by walking over the instructions at load time and replacing every instruction with a pointer to its instruction handler. An interpreter implemented like this is called a threaded code interpreter, and an example of the interpretation loop can be seen in Listing 2.5. Note how the instruction datatype is now defined to be a void* instead of an int.

1 typedef instruction void* C code 2 3 void interpret(instruction[] code, int* stack){ 4 instruction* pc=&code[0] 5 while (true) { 6 goto **pc; 7 8 label_nop: 9 pc++; continue; 10 label_stop: 11 return; 12 label_jump: 13 pc+=pc[1]; continue; 14 label_push: 15 *(stack++)=pc[1]; pc+=2; continue; 16 label_pop: 17 stack--; pc+=1; continue; 18 } 19 }

Listing 2.5: Basic interpreter loop in a threaded code interpreter

The conversion of the input program to threaded code is not shown, as an implementation depends on the specifics of the target platform. The threaded code interpreter may not always be straightforward to implement. For example, the running example language we have seen in Listing 2.3 and Listing 2.4 uses an int for the instruction datatype. The exact length of int is compiler-dependent, but on 32-bit platforms it will typically be 4 bytes. On 32-bit platforms, the size of a pointer is also 32-bit, meaning that on this platform it is possible to convert all instructions to instruction handler pointers by overwriting them in the input program in-place. However, on other platforms (like amd64) a pointer may be 8 bytes, meaning that the threaded code interpreter architecture would require copying all instructions to a new array with sufficient space for the larger datatype. Furthermore, potentially-complex analysis of the code needs to be performed to ensure that the relative jumps are updated correctly. Finally, it may not always be possible to walk over all the instructions statically. The target VM may not make a distinction between the stack and the memory where the code lives (the von Neumann architecture). What is at one moment a stack write may at another moment be an instruction, making it completely impossible to visit all instructions ahead of the runtime. And even when the program memory is distinct and immutable, if the target architecture allows jumping “between” instructions – to the instruction operand of a particular instruction – it becomes much harder to generate the required threaded code for the interpreter. The complexity of walking over a program to inspect and potentially convert every instruction without execution for the JVM ties in closely with the abstract interpreter used for the verifier and garbage collector as discussed in section 2.1.4.

20 2.3 Conclusion

We discussed how Java and the JVM work. JVM bytecode – while similar – is not the same as machine code. We discussed how there are only up to 256 instructions (currently 202 as of Java 11 [8]) due to the use of byte-sized opcodes. JVM bytecode is typesafe and statically verified at load time. The static verification of bytecode is part of the specification, and is made possible due to certain restrictions like fixed branching targets and fixed operand stack depths. Having just 202 instructions combined with an instruction set that lends itself well for static analysis makes JVM bytecode a suitable candidate for superinstructions. We also looked at interpreter design. The superinstruction optimization works due to the high cost of branch mispredictions within the CPU, which are common within interpreters as the CPU may not be able to predict which instruction will be executed next. We discussed how an interpreter can be implemented, discussing token-threaded interpreters (which is the most simple bytecode interpreter) and threaded-code interpreters. Each interpreter type has its own upsides and downsides, but the number of handlers within the token-threaded interpreter is limited by the size of the token. In the case of bytecode is one byte, or 256 handlers. These topics will come back in further sections as they underpin the superinstruction opti- mization. In the next chapter, we will discuss related work, including various implementations of the superinstruction optimization and other optimizations which rely on the branch prediction behavior of the CPU.

21 Chapter 3

Related work

3.1 Introduction

The superinstruction optimization is not new, and various earlier implementations exist [1, 3, 4, 5, 6, 15]. In this chapter, we discuss the basics of the general superinstruction architecture in section 3.2: how profiling information is used, how the superinstruction set is constructed and how the superinstructions are placed into loaded bytecode. We discuss some of the major implementations from previous work (superoperations [15] in section 3.2.2, vmgen [4] in section 3.2.3, and Tiger [1] in section 3.2.4) and provide an overview of what they have done. In section 3.3, we also discuss two other VM optimizations that address the branch prediction mismatches hurdle addressed by superinstructions.

3.2 Superinstructions

In section 1.4 we already covered the basics of the superinstruction optimization: concatenating superinstruction handlers speeds up the interpreter as it spends less time dispatching instructions and dealing with costly branch prediction mismatches. Section 2.2.1 discussed how modern processors execute multiple instructions at the same time in a process called superscalar execution. However, superscalar execution requires predicting the next sequence of machine code instructions, and the superinstruction architecture helps facilitate this across bytecode instructions in a JVM interpreter, as multiple bytecode instructions are handled by the same instruction handler.

3.2.1 Workflow Section 1.4 already established that the superinstructions should be generated automatically. However, it did not cover the mechanism by which the superinstructions are generated. Generating the superinstructions involves creating new instruction handlers that end up in the VM interpreter. Figure 3.1 shows an overview of the main workflow as presented by Ertl et al. [4]. This workflow ends up generating an interpreter for a specific workload or application via an Interpreter Generator or vmgen (“VM Generator”) – a mechanism via which an optimized interpreter can be generated based on an external configuration, which is described by Ertl et al. [4] (“vmgen”) and by Casey [1] (“Tiger”). The core of the workflow and tooling developed by Ertl et al. [4] is used by the others [1, 3, 5], which is shown in Figure 3.1. The workflow is split into three phases:

Ahead of Time VM compile time, which happens long before the application is executed or is even known Class Load Time which occurs before any code within the class is executed, but on the target system where the JVM runs

22 vmgen workflow (adapted from multiple diagrams from Ertl et al.) Ahead of Time Instruction Set Generator Profiling data

Base Instruction Set Implementation Instruction Set Definition .iload .instrX: iload + imul // C++ code implementing iload .instrY: new + dup + invokespecial .imul // C++ code implementing imul ...

Interpreter Generator

C++ Compilation Interpreter Generated code switch (instr) { case instrX: ... Keyhole tables

Bytecode loaded from class files Class Load Time new java/lang/Object dup Keyhole Conversion invokespecial java/lang/Object.”hashCode”()I ...

Superinstruction representation

Run- Execution time

Profiling data

Legend Data Process or action Indicates deviation from non-superinstruction JVM workflow Figure 3.1: Diagram of the process described by Erl et al. [4]

23 Runtime which is the phase in which the the bytecode is interpreted The interesting changes with respect to a standard (non-superinstruction) JVM are indicated in gray. Key here is the use of a “Base Instruction Set Implementation" which offers reference implementations for the standard bytecode instructions. These are, with automation, concatenated based upon an instruction set definition to generate the appropriate superinstructions. The work of Ertl et al. [4] calls tables holding the mapping between base instruction sequences and superinstructions Keyhole tables, which are also compiled into the VM to be used at Class Load Time. At Class Load Time, said keyhole tables are used to perform Keyhole conversion, where the input bytecode is processed and sequences of bytecode are converted to superinstructions. These superinstructions can then be executed at runtime. The choice of which instructions to concatenate and include in a superinstruction is based on a profile of an application. The workflow used by Ertl et al. [4] can generate a profiling interpreter automatically, which can generate such a profile from running it on an application. This output profile (at the very bottom of the diagram) can be used as input again for a new build of the VM (top of the diagram).

3.2.2 Superoperators Proesting [15] introduces superoperators as an optimization technique for bytecode interpreters. Su- peroperators are technologically similar to superinstructions, as they are automatically synthesized from smaller operands to avoid costly per-operation overheads. Proesting uses the optimization to reduce the size of ANSI C application binaries: instead of compiling the entire program to machine code, it is possible to reduce the binary size by expressing sequences of code as interpretable bytecode instructions. Note that these bytecode instructions are not Java bytecode instructions – they just refer to an instruction opcode size of one byte. Proesting indicates that such a change impairs a performance penalty: the interpreted C applications run about 8-16 times slower than unoptimized compiled code. However, with superoperators, the performance impact is brought down to a factor of 3-9, yielding a best-case performance improvement with superoperators over no superoperators of 2.6x. Generation happens with a tool called hti (hybrid translator/interpreter). The hti tool compiles C functions into a tiny amount of assembly code for the function prologue in order to stay compatible with the application binary interface. The rest of the function is compiled into bytecode instructions, which is interpreted by an included interpreter written in assembly. Since Proesting uses a bytecode instruction set, their virtual machine supports up to 256 instructions (including any superoperands). 109 base instructions are needed to cover every operation in the interpreter, leaving 147 bytecodes available for generated superoperators. The superoperators are generated using a greedy algorithm which promotes the highest scoring pair of operations to a superoperation. The new superoperation is then treated as its own operation, repeating the process until all 147 superoperators have been allocated. The scoring is based on the desired outcome: when optimizing for performance, the score is based on the number of (expected) executions. In this mode, the superoperator architecture is identical to the superinstruction architecture as described in section 3.2.1, with the exception that substitution of the superoperators into the program can already happen at compile time. The hti tool also supports another mode where it optimizes for disk space. In this mode, the score is based on the number of occurrences of each operation in the to-be-compiled program. This is not a relevant mode in the superinstruction architecture as described in section 3.2.1, as insertion of superinstructions in the program happens at class-load time and not at compile time. As such, slimming down the size of the program with superinstructions has no impact on the on-disk representation of the program.

3.2.3 vmgen Ertl et al. [4] presents a tool called vmgen. The tool was already discussed in the introduction (section 3.2.1) as it presents a clear example of the superinstruction architecture as also seen in

24 other work and that we reproduce as part of this thesis. The authors use the tool to generate two interpreters: one for the Gforth Forth language interpreter, and another for the Cacao JVM. The latter is interesting, as it targets a JVM implementation just like we do in this thesis. However, their interpreter is not a JVM bytecode interpreter, instead it interprets the (very similar) Cacoa JIT-compiler intermediate representation. They saw a performance increase of 80% using superinstructions in Gforth, however they report no such improvements when targeting the JVM. The authors explain that this is due to the large size of the generated JVM interpreter, with the JVM interpreter + superinstructions no longer fitting in the instruction cache of the tested CPUs. The tool takes a VM instruction description file and generates C code from it for each VM instruction. It can generate profiling instructions and it can generate superinstructions based on information present in the VM instruction description file. In the VM instruction description file, each instruction definition is not written in C. Instead, it is written using a special syntax that can be transformed into C. This syntax gives the vmgen tool more information about the contents of an instruction handler, which allows it to very easily generate profiling instructions or generate instructions with other extra’s (like debugging). Their implementation of profiling instructions is very similar to our technique, which we will discuss in section 4.3. The vmgen tool produces optimized superinstruction handlers – they are not a simple concate- nation of the machine code of multiple instruction handlers. One example the authors give is of the “sipush-iadd” superinstruction, where the intermediate value pushed by the “sipush” instruction is not written to the operand stack, but instead provided as a constant to the “iadd” instruction. However, this optimization is made possible not by the vmgen tool on its own. Instead, it uses GCCs ability to, within the superinstruction handler, detect that writing the value to top of the stack and reading it immediately after is equivalent to using the value as a constant. The authors however show that this does indeed happen – GCC is capable of making that optimization. To determine which superinstruction to create, their approach uses profiling information do determine which instructions to concatenate, generating 2400 superinstructions in their largest run. The algorithm used to create the superinstructions does not optimize which sequences to promote to superinstructions. Instead, all sequences of instructions are turned into a superinstruction, but the maximum length of each superinstruction is limited to just 4 regular instructions. To lower the amount of candidate superinstructions even further, the authors discarded superinstructions with fewer than 10,000 executions as per the profile when generating the JVM interpreter. This makes their superinstruction set construction algorithm effectively the same as the algorithm from Proesting [15], where superoperators are also selected based upon the number of executions. The runtime substitution algorithm is called peephole optimization, and uses the keyhole tables from Figure 3.1. It is a straightforward algorithm that repeatedly looks at two consecutive instructions. If the two instructions are in the keyhole table as a pair, the first instruction is replaced with the opcode of the superinstruction that is also in the table. This process repeats, making it possible to create larger superinstructions. We reimplement the peephole optimization algorithm as part of this thesis in section 4.5.2 as triplet-based substitution, where we discuss this algorithm in more detail.

3.2.4 Tiger The work from Casey [1] presents another interpreter-generator tool: Tiger. The Tiger interpreter generator specifically targets the JVM, and it supports a large number of optimization techniques. The optimizations are not implemented on an existing VM, rather they are part of a new VM written by the author called Fastcore. The author splits the optimizations into two categories: static optimizations (which apply to all programs) and dynamic optimizations (specific to one program). Superinstructions fall into the dynamic optimizations category, together with other optimizations that try to assist modern processors in branch prediction. The combination of dynamic optimizations achieve a performance speedup of 2.76x for a broad range of programs. Casey tries several different strategies for selecting superinstructions. Superinstructions are selected both statically, based only on the program code and no runtime information, as well as dynamically by using runtime profiling information. Furthermore, Casey uses heuristics to turn

25 the number of occurrences (static selection) or number of executions (dynamic selection) into a score that can be optimized. Superinstructions are selected greedly based upon the highest scoring superinstruction candidates found in the program code. The heuristics used by Casey are: 1. Use the number of occurrences or the number of executions as-is 2. Divide the length of each superinstruction by the number of occurrences/executions (prefer short superinstructions over long superinstructions) 3. Multiply the length of each superinstruction by the number of occurrences/executions (prefer longer superinstructions over short superinstructions) Each of these heuristics were tested both with a static profile and a dynamic profile, and tested with superinstruction set sizes of 8, 16, 32, 64, 128, 256 and 512. Casey reports better results with each increase in superinstruction set size. Furthermore, he also reports better results with a static profile than with a dynamic profile, as well as that prefering shorter superinstructions boosts performance. Casey observes that even slight changes in superinstruction selection can have a big impact on benchmark performance. To place superinstructions, Tiger uses a Deterministic Finite Automaton (DFA) to place superinstructions. When processing a sequence of bytecode instructions, the DFA moves to new state taking the opcode as input. Certain states in the DFA are marked with a superinstruction opcode, indicating that when reached that superinstruction can be placed. This allows placing superinstructions without the prefix being present, as was the case in the vmgen tool created by Ertl et al. [4]. The DFA is functionally very similar to the tree-based parsing technique introduced in this thesis in section 4.5.2. Casey also touches on the subject of greedy superinstruction placement, discussing how greedy superinstruction placement can miss better superinstructions that start later, which is a problem we touch upon in section 4.5.3. Casey implements “optimal” superinstruction placement, but does not discuss the implementation. Casey shows a slight performance improvement using optimal superinstruction placement vs greedy superinstruction placement, but argues that this result can be impacted by the superinstruction selection algorithm.

3.2.5 Conclusion Various superinstruction implementations have been discussed in this section, each of the imple- mentations bringing something new to the table. Table 3.1 provides an overview of the previous sections, with an implementation in each column. The Superinstruction selection row indicates how sequences of instructions were promoted to superinstructions. All works used some form of frequency-based reduction of the number of superinstruction candidates, although Ertl et al. attempted to promote all candidates to a superinstruction. Only when that wasn’t possible they resorted to reducing the number of candidates by removing candidates with fewer than 10,000 executions according to profiling data. The table also contains our results, which we will discuss in detail in the remainder of this thesis (the design is discussed in chapter 4 and the implementation discussed in chapter 5). To summerize, we construct the superinstruction set using static evaluation, which is a heuristic discussed in section 4.4.3. The heuristic differs from the heuristics used by others. Where others use a heuristic to select candidate superinstructions directly, our heuristic can only assign a score to a particular superinstruction set as a whole. To create a superinstruction set while using this heuristic, an iterative optimization algorithm is used which starts with a random superinstruction set, and iteratively improves it, to optimize the static evaluation score. To determine this static evaluation score, the runtime substitution algorithm is used, meaning that the superinstruction set is directly optimized for the runtime substitution algorithm. Note that the Results row cannot be easily compared: the different authors used different virtual machines, different hardware and different benchmarks.

26 Table 3.1: Overview of earlier work

Author Proesting Ert et al. Casey Our results Largest su- 147 2400 512 1000 perinstruction set Interpreter ANSI C inter- Cacao VM FastCore OpenJDK Zero preter (C) Super- Based on fre- No selection Based on fre- Iterative opti- instruction quency when possible quency using mization using selection / based on heuristics static evaluation frequency Runtime sub- Unknown Peephole DFA-based Triplet-based stitution (greedy) (greedy) (greedy), Un- (greedy), tree- known (optimal) based (greedy), shortest path (optimal), shortest-path with equivalence (optimal) Results 2.6x none (JVM), 2.76x 1.46x (Primes), 1.8x (Gforth) 1.36x (Spring)

3.3 Other interpreter optimizations / research

As discussed in section 2.2.1, the superinstruction optimization hinges on the failure of a superscalar processor failure to predict the next instruction handler in the interpreter loop. Superinstructions help by concatenating handlers together, which makes it possible for a superscalar processor to predict the instructions ahead. There are two other optimizations which also hope to mitigate the same problem: instruction specialization and static replication, which we will discuss in this section. A core observation used by both optimizations is that the processor associates branching history with the location of the conditionally branching machine code instruction, recording if and where the jump went to. This memory is used for branch prediction based on historic results of a conditional branching instruction. However, in an interpreter, this historic data is only accurate if the same sequence of instructions is executed twice. For example, after the instructions “iload iadd” have been executed, the processor will predict that the “iload” handler will jump to the “iadd” instruction next if it sees the “iload” instruction again. The iload instruction is a very common instruction, and as such this is likely to cause a branch misprediction.

3.3.1 Static replication Static replication is a technique created by Casey et al. [2], and provides a speedup of up to 3.07x, compared to 3.39x with superinstructions in the same paper. Static replication works not by reducing the number of branches, but rather by giving the processor more information to help figure out where to jump to next. The idea is to “replicate” common instruction handlers multiple times within the interpreter binary, and assign a unique opcode to each instruction handler. Then, opcodes are replaced at load time with new opcodes that map to a copy of the same instruction handler. For example, the “iload” instruction handler is replicated two times within the interpreter

27 binary. Each copy has its own opcode, creating “iload-a” and “iload-b. Each copy does exactly the same thing as the regular iload handler. In the program “iload iadd iload imul iload iadd”, the pattern “iload iadd” is identified as common (much like a superinstruction). But instead of replacing these two with a superinstructions, the iload instruction opcode is consistently replaced with an alternative iload-a instruction. Furthermore, iload imul is similarly replaced with iload-b imul. This creates the full program “iload-a iadd iload-b imul iload-a iadd”. As iload-a is distinct from iload-b in the VM binary, the processor will associate separate branch prediction histories to both instructions. As such, the processor is able to learn that iload-a is followed by iadd, and likewise that iload-b is followed by imul.

3.3.2 Instruction specialization Instruction specialization is an entirely different optimization, and not immediately designed to help the processor with branch prediction. However, it does have this as a side-effect, as it adds more instructions to the VM. The technique is implemented in the Tiger tool created by Casey [1]. JVM bytecode has few instructions (202 as of Java 11 [8]). While being able to do with few instructions, certain instructions with common instruction operands are specialized into their own dedicated instructions. Specializations of common instructions, like iload_1 for the iload instruction, exist to reduce the memory load and help performance. Such specializations also add extra instructions that have their own associated branch prediction information. The idea of instruction specialization is to push this approach further: why not generate significantly more of such specialized instructions. Casey reports that the performance benefit of this optimization is dominated by the reduction in branch prediction misses.

3.4 Conclusion

In this section we discussed three of the major existing superinstruction implementations, of which two are for the JVM and one is for ANSI C interpreters. We discussed the general superinstruction set architecture at the hand of the implementation by Ertl et al. [4]. Profiling data is used in an instruction set generator to generate an instruction set. This instruction set is used by an interpreter generator tool to generate code, which is compiled by a C++ (or other) compiler to create an interpreter binary. When classes are loaded, their bytecode is inspected and sequences of bytecode for which superinstructions exist are substituted with superinstructions. Finally, the bytecode with the new superinstructions is executed by the interpreter binary. While the superinstruction optimization works by reducing constant per-instruction overhead, part of this overhead is caused by branch mispredictions as explained in section 2.2.1. We also discussed two other optimizations which address branch mispredictions in another way. These optimizations create extra instructions that are replications of existing instructions. These instructions give the processor more opportunity to predict the next instruction, which lowers the number of branch mispredictions.

28 Chapter 4

Design of QuickInterp

4.1 Introduction

In this chapter the QuickInterp architecture design is introduced. QuickInterp is the name for an implementation of the superinstruction architecture as discussed in section 3.2 with new capabilities to satisfy the research questions. This architecture still remains platform-agnostic for the purpose of this chapter – i.e. the architecture is laid out in such a way that it is not dependent on one particular VM implementation. However, it is assumed this platform is a Java Virtual Machine (JVM) with bytecode as its instruction set, with no assumptions on its interpreter design. The designed architecture should be able to satisfy research goal G1 and G2 from section 1.5, and help aid in satisfying research goal G3 in the implementation:

G1 Collect, devise and adapt a set of algorithms that, when provided with profiling data of a particular application, generate an optimized superinstruction set. G2 Collect, devise and adapt a set of algorithms that, when provided with a superinstruc- tion set and application code can insert appropriate superinstructions as to maximize performance

G3 Implement these algorithms, either on top of OpenJDK or by some other means G4 Evaluate performance on a set of sample applications and benchmarks. The benchmarks are discussed in section 6.3.

4.1.1 Design goals From these goals a set of design goals (DGs) can be derived for the superinstruction architecture:

DG1 Enable the VM to generate a rich application profile from a running application such that any superinstruction construction algorithm is able to prioritize which sequences of bytecodes make the most suitable superinstructions (requirements for G1, G2) DG2 Design a superinstruction architecture that is agnostic of the chosen construction (G1) and runtime placement (G2) algorithms to also support implementations of existing algorithms (required for benchmarking, ties into goal G4).

DG3 Design a new, optimized algorithm for the construction of the superinstruction set (G1). DG4 Design a new, optimized algorithm for the runtime placement of superinstruction (G2).

These goals make the purpose of this chapter clear: a superinstruction architecture needs to be designed ready to be implemented on some VM (DG1, DG2), and this superinstruction

29 architecture should support two new algorithms. These algorithms, one for the superinstruction set construction and one for the runtime placement of superinstructions, need to be designed as well for DG3 and DG4.

4.1.2 Overview In section 4.2 the workflow from section 3.2 is built upon, with some differences in approach that are more suitable to support different superinstruction algorithms. Then, a deep-dive into profiling is taken in section 4.3. Here it is discussed how a profile from a running application should be obtained, and how the requirements for such a profile dictate what kind of data goes in it. In section 4.4 the steps that have to be taken before compiling the interpreter are expanded upon – here a profile is used to construct a superinstruction set laying out what the requirements are for such a construction algorithm. With this superinstruction set, the interpreter source code is generated ready to be compiled. At runtime in section 4.5 the available superinstructions need to be substituted into all classes that get loaded by a placement algorithm. The runtime placement algorithm from Gregg et al. [6] is discussed, including what dependencies it has on a superinstruction set construction algorithm. A slightly optimized version of this placement algorithm is discussed, and its limitations are outlined. In section 4.5.3 a further look is taken at how these algorithms could be improved, and what the characteristics would be of the “perfect” placement algorithm including a design of this algorithm. An attempt at optimzing this runtime placement algorithm is done in section 4.6 – the two have an instrinsic dependency on each other and as such cannot be developed independently. Finally, in section 4.7 a second look is taken at the design goals set for the architecture design. The actual implementation of this design will have to wait for chapter 5, as only there the implementation of this architecture is discussed.

4.2 Architecture and workflow overview

In Figure 4.1 an overview of the superinstruction architecture is shown again, which is an evolution of the diagram from section 3.2 where the workflow from Ertl et al. [4] was shown. This time, various processes have been simplified and some detail has been removed – the process of generating the superinstruction set has been simplified to a single Superinstruction Set Generator component. The Superinstruction Converter – which is the runtime component capabable of substituting superinstructions into classes as they get loaded – has also been transformed into a black box. Furthermore, the exact steps taken to go from an Instruction Set Definition to a runnable Superinstruction-capable interpreter binary have also been hidden in a single procedure. All this hiding however reveals the constants and variables in the architecture to support goal DS2: algorithm-agnosticism towards both the superinstruction set construction algorithm (Su- perinstruction Set Generator) and the runtime substitution algorithm (Superinstruction Converter). An interesting aspect of this architecture immediately shows up: the superinstruction converter is already known and available for the superinstruction set generator at compile-time. This is no oversight: the QuickInterp architecture uses the superinstruction converter at compile-time to help guide the superinstruction set generator towards the ideal superinstruction set. However, how this cooperation happens and their exact synergy will be discussed at a later time in section 4.4.5.

4.2.1 From profile to runtime Before continuing into more detail regarding the architecture, please consider the workflow for creating an optimized VM for a particular application first. This workflow reveals various aspects (like the details of profiling) that are absent from the architectural overview diagram as shown in the previous section in Figure 4.1.

30 QuickInterp abstract workflow diagram Ahead of Time Superinstruction Set Instruction Set Generation Generator Profiling data

Interpreter Generation and Compilation Instruction Set Definition .instrX: iload + imul .instrY: new + dup + invokespecial Superinstruction Converter

Class Load Time Bytecode loaded from class files new java/lang/Object dup invokespecial ... Superinstruction-capable interpreter Class Superinstruction representation Conversion

Runtime Execution

Legend Data or component Process or action Indicates deviation from non-superinstruction JVM workflow Data, component or Figureprocesses 4.1: of special Diagram interest of the core architecture

1. Application and workflow selection At the very start of attempting to create an optimized VM, one must obtain the application that is to be optimized for, including any and all loadable modules (e.g. OSGI modules or other runtime-loadable libraries). This may seem like an obvious step, but since the VM optimizations will be very closely tied to the application and its code it is vital that all or as much of the code as possible is gathered prior to profiling. It is also important to prepare a realistic profiling run: if the application is a web application, ensure that organic HTTP requests are made by real browsers if possible. It may be tempting to send web requests via a synthetic load generation tool, but that may not trigger the same code paths during profiling as in production, leading to a bad profile.

2. Profiling The QuickInterp step will end up producing profile of a running application. This profile will contain detailed information about the executed code, allowing later steps to fully examine how execution moved through the code while it was run. The VM must be put into profiling mode by some means (this is an implementation detail discussed in section 5.4) and the application is run on this VM. After interacting with the application for a satisfactory amount of time the application can be closed. The profiling VM will generate a profile file or files (discussed in section 5.4.2).

31 Paying close attention to Figure 4.1 will reveal another difference: the profile is not shown in that diagram as a product of the Execution process. From Ertl et al. [4] it wasn’t quite clear how and where in the workflow the application profile is created, but in the case of QuickInterp there is no reason to expect profiling capabilities from the same runtime image as the superinstruction VM – i.e. using superinstructions is allowed to prevent (effective) profiling. That removes the profiling component as a result coming out of the Execution as shown in Figure 4.1, but shouldn’t stop the VM from allowing profiling at all.

3. Instruction Set Generation With a profile generated, one must supply this profile to the Superinstruction Set Generation process, together with additional configuration regarding which implementations to take of the compile- time superinstruction construction algorithm and which runtime placement algorithm. Further implementation-level options may be specified, like the maximum size of the superinstruction set or the maximium length of the largest superinstruction. This process ends with an optimized superinstruction set for the next step to use.

4. Building the VM From the optimized superinstruction set it is possible to generate the interpreter. The interpreter handlers are generated, definitions for use within the interpreter are generated and other metadata is written to file to enable the interpreter to use the new superinstructions. Once the interpreter is generated, the VM can be compiled via a makefile or other means. After this process is finished, a binary with the VM is produced which is capable of using superinstructions.

5. Running the VM The VM is run like a normal VM – it is fully compliant and capable of loading any bytecode. At class-load time, the selected runtime superinstruction placement algorithm from step 3 is applied to the loaded classes. The interpreter has the right instruction handlers registered for these superinstructions, ideally enjoying a gain in performance due to the reduction in jumps.

4.3 QuickInterp application profile 4.3.1 Introduction The role of the application profile has been discussed in the previous section (section 4.2) and has been seen in detail in earlier work in section 3.2. The profile for QuickInterp is important, as it doesn’t just reveal often-executed sequences of bytecode, it also enables evaluating a particular superinstruction set statically. Given a particular superinstruction set and a profile, the profile can help make an educated guess regarding the expected performance of this superinstruction set by examining the control flow as it is serialized in the profile. The ability to do a static evaluation of the performance of a particular superinstruction set is important in determining the optimal superinstruction set (design goal DS1) and is one of the key designs in the QuickInterp. The design of the superinstruction set construction algorithm will use this static evaluation capability in a later section (section 4.4.3).

4.3.2 Data in the profile Efficient profiling Designing the profile to hold the right data to enable static evaluation appears simple. The most trivial approach is simply to write every executed opcode to disk with one file per JVM thread. This would hold all information regarding how control moved through the program in a way that is directly usable to a superinstruction construction algorithm by finding often-occurring patterns in

32 these files. Furthermore, these files would also enable the static evaluation of a particular runtime superinstruction placement algorithm by letting it perform instruction placement on these files and examining the reduction in total instructions. However, early testing has revealed that the number of executed instructions is prohibitive: it very quickly runs into the tens or hundreds of billions of instructions for even small and short-lived programs, resulting in a profile size ranging from tens to hundreds of gigabytes. Not only would performing efficient static evaluation on such a large profile require extensive preprocessing to a more efficient intermediate format to prevent the runtime superinstruction placement algorithm from having to ingest the whole profile, hardware limitations would put an upper bound on the maximum program execution time that can be profiled. As such, a smarter way of profiling is required. Applications themselves are not tens to hundreds of gigabytes: no, Java applications rarely exceed 100 megabytes. As such, the size of the code itself is not the problem, it’s just the frequency with which code is execution that prevents us from writing every opcode to disk. This already hints at the solution we ended up implementing: instead of writing all the opcodes to disk, instead count how often sequences of instructions are executed with special counters and write the values of those counters to disk when the program exits. However, now a new problem emerges: what to count, and where to place these counters?

Superinstruction limitations One important observation is that there are some limitations to what can be included in a superinstruction. A superinstruction can rarely encompass an entire method – it is not a JIT compiler-with-extra-steps. For example, unconditional jump instructions (like a goto) cannot be included in a superinstruction. This is because the target of a goto instruction is the instructions’ operand, and this value is not part of the instruction opcode itself and as such not known when concatenating a set of instructions to make a superinstruction. Converting a sequence of bytecode instructions into a superinstruction requires ignoring all the operands, as the operands themselves are the data the instruction handlers read from the bytecode stream. Rather, these operands become the operands of this new superinstruction. For example, the sequence iload-istore might be a superinstruction, but the sequence iload 10-istore 15 is not. These operands – the 10 and 15 in this case – are not lost. When the runtime superinstruction placement algorithm chooses to replace the iload 10-istore 15 sequence of instructions with this superinstruction, it preserves the operands. The placed iload-istore superinstruction has two operands: 10 and 15. The sequence becomes iload-istore 10 15 in the bytecode stream. In the case of the iload-istore this role of the operands may seem inconsequential, and it is for most instructions. However, some instructions modify control flow, like the goto instruction. Even if the jump instruction jumps to another location within a piece of bytecode, this jump cannot be included in the superinstruction.

1 jump target JVM Bytecode 1 iload_1 Invalid superinstruction 2 iload_1 2 iload 3 iload 10 3 iadd 4 iadd 4 istore 5 istore 10 5 goto 3 6 goto line 1 1 6 iload 4 7 iload 10 2 7 ireturn 8 ireturn Listing 4.2: An illegal superinstruction candidate Listing 4.1: Sequence of bytecode instructions derived from Listing 4.1 with a goto

In Listing 4.1 a piece of bytecode is shown that cannot be turned into a functional superin- struction as a whole because of such a goto. Listing 4.2 shows what a superinstruction would look like if Listing 4.1 were turned into one using the rules as just discussed: remove all the operands.

33 In Listing 4.1, the goto at 1 jumps back within the same fragment to line 1. goto instructions are unconditional jump instructions, meaning that execution would never reach 2 from just the goto. However, we assume that 2 is reached by some other means – another jump not shown. This could for example be an exception handler or simply a jump earlier in the program. The goto is also present in the superinstruction in Listing 4.2 at 3 . Now the code following this goto at 3 is the problem: due to the goto, execution will not continue within the superinstruction. Instead, the goto will require that the interpreter dispatch to some other instruction. This instruction is decided by the operand of the goto, which is not part of the superinstruction. As such, it isn’t known at superinstruction set construction time, and as such this goto will force leaving the superinstruction control flow. Furthermore, the instructions at 4 – which are still part of the superinstruction – are completely unreachable. The interpreter cannot jump somewhere halfway into a superinstruction, meaning there is no value including them in the superinstruction. Techniques exist which can help to make containing the jump within the superinstruction possible, like instruction specialization where an instruction opcode with common operands is turned into its own dedicated opcode (this optimization is discussed in section 3.3.2). Such techniques could potentially be very interesting, as whole loops with their branches in input bytecode could be completely compiled to machine code. However, these techniques are not considered in the QuickInterp architecture. For the dead code in the superinstruction the solution is simpler: cut the superinstruction in two between 3 and 4 , starting a new superinstruction after the goto. This is the required treatment for all unconditional jump instructions. Not just goto is classified as an unconditional jump – the jsr and ret instructions also perform an unconditional jump (in the case of ret even with an unknown jump target). All instructions which cause the termination of the method (return, ireturn, lreturn, areturn, dreturn and freturn) have the same classification, although for the purpose of profiling they are easier to deal with – control leaves the method instead of carrying on somewhere else. Finally, two surprising instructions can be considered “unconditional jumps” – the tableswitch and lookupswitch instructions. These are used to implement the Java switch statements (Java 14 switch-expressions are compiled differently). Both of these instructions take a number of jump targets as their operands, representing the case statements from the original Java code with a default branch. This default branch is always present in the bytecode, even if the switch- statement was compiled without it. Because of the presence of this default jump, the tableswitch and lookupswitch end up always branching to somewhere. This also makes them unconditional jumps for the purpose of superinstructions. Instructions which invoke a method (the invoke family of bytecode instructions, e.g. invokevirtual) or could otherwise cause the execution of other JVM bytecode (e.g. putstatic, which could trigger class initialization) may also not be eligible for joining a superinstruction. This is because the execution of other JVM bytecode would require storing the current state of the interpreter onto the JVM call stack and starting executing the other method. If this happens in the middle of a superinstruction, the interpreter would somehow have to save that it’s in the middle of such a superinstruction to be able to carry on with executing the superinstruction when the invoke returns. These problems are not impossible to overcome, but for the design of the QuickInterp profile such invoke statements are considered to be impossible to transform to a superinstruction.

Runtime counters With the knowledge that invoke instructions, putstatic, getstatic and unconditional jump instructions like goto, jsr, iret and so on cannot be turned into superinstructions (implementation details may add more un-“superinstructionable” instructions like monitorenter – see appendix A.2), one might reason that in order to serialize all the information regarding the control flow into a profile, all that is needed is to save the number of executions of every “block” of consecutive, superinstructionable instructions that appear. By careful insertions of execution counters at the right places in the bytecode, one would have all the data in the profile. This reasoning however is not correct for two reasons:

34 1. This requires a new execution counter at the target of a jump location, losing information from which original location that jump originated. 2. It is possible to embed a conditional jump inside a superinstruction, requiring having access to how often a conditional jump ended up executing important information in superinstruction set construction

A conditional jump is, well, conditional. And when that condition is not met, the jump will not happen and execution will continue with the next instruction. In case a conditional jump is part of a superinstruction, this would leave control within the superinstruction. This is why it is possible to embed a jump within a superinstruction. As such, the QuickInterp profile needs two different kinds of counters:

Local Execution Counters These are counters which, for a given bytecode offset (location in the method), count how often that location was executed Branch Counters These are counters which count how often a conditional jump instruction ends up jumping. They are stored for the location of the jump location and not of their target, enabling full static tracing of the control flow.

A method is started with a Local Execution Counter counting how many times the method was entered. Instructions receive either a local execution counter or a branch counter based on their type. All conditional jumps are taken care of with branch counters, which are also the only instructions having to be tracked with branch counters. The rest of the instructions are profiled with carefully placed local execution counters with the exact placement depending on their type. invoke instructions, putstatic and getstatic These instructions in a way behave like con- ditional jumps towards an exception handler. However, since an invoke cannot be part of a superinstruction in QuickInterp, there is no use in detecting how often an invoke jumps towards an exception handler. As such it’s safe to treat them like unconditional jumps: exception handlers receive a local execution counter to count their executions, and the instructions themself are immediately succeeded by another local execution counter. Exception-generating instructions The same goes for instructions which can throw VM- internal exceptions. A monitorenter may for instance throw an IllegalMonitorStateException or a NullPointerException, likewise a new instruction may throw an OutOfMemoryError. These act very similar to invoke instructions, and as such are profiled in the same way. Unconditional jump instructions and return instructions For the unconditional jump in- structions (goto, jsr, ret, athrow, tableswitch and lookupswitch) it does not make sense to count how often their branches are taken as they are always taken. These unconditional jumps together with return instructions (return, ireturn, lreturn, areturn, dreturn and freturn) have an “implicit” local execution counter immediately following them always counting 0 executions. It is known that execution will never continue after these instructions, so this information need not be serialized in a profile. However for unconditional jump instructions which keep execution within the method, a local execution counter placed at the target location of the unconditional jump instruction (e.g. the exception handler, subroutine start or at every branch of a looupswitch) needs to be available in the profile.

The profile now contains counters for all classes loaded during the profiling run. When the VM shuts down (or at some other implementation-dependent moment) the measured counters can be written to a profiling file together with the location at which they were measured in the bytecode. This “sparse” profile plus the original bytecode allows the construction of the full control flow graph with correct execution counts. This bytecode must also be provided in some implementation-dependent way. The implementation details of the profile are discussed in section 5.4.

35 4.4 QuickInterp compile time 4.4.1 Introduction The QuickInterp superinstruction architecture, as seen in section 4.2, takes an agnostic approach to the superinstruction set construction algorithm and the runtime substitution algorithm (of design goal DS3). On this agnostic architecture, various superinstruction set construction algorithms can be implemented. The main goal of a superinstruction set construction algorithm is to deliver a candidate superinstruction set. This candidate can then be evaluated by the runtime substitution algorithm on the profiled bytecode. Variations in the superinstruction set can be made, which in turn can be evaluated against the profile, hopefully approaching the ideal superinstruction set.

4.4.2 Instruction selection Superinstruction candidates from the profile The very first step that needs to be taken by a superinstruction set construction algorithm is the construction of a set of superinstruction candidates. We want to have a set of all superinstruction candidates from which the superinstruction set construction algorithm can make a selection. In the previous section (section 4.3.2) it was already discussed that not all instructions can be transformed into a superinstruction. This limitation cuts the bytecode of a method into “blocks” of instructions where each block of consecutive instructions could be turned into a superinstruction. We call these superinstruction candidates the base superinstruction candidates. This distinction is important as the base superinstruction candidates serve a dual role: they’re not just used as superinstruction candidates, but they’re also used to test the efficacy of a particular superinstruction set. However, how base superinstruction candidates are used for evaluating a superinstruction set is discussed in the next section (section 4.4.3). First we discuss how the (base) superinstruction candidates are made from the profiled code.

1 iload_1 JVM Bytecode 1 iload_1 Superinstruction candidate 2 iload 10 2 iload 3 iadd 3 iadd 4 istore 10 4 istore

Listing 4.3: Sequence of bytecode instructions as Listing 4.4: A superinstruction candidate derived found in a profile where instructions are superin- from Listing 4.3 structionable

In Listing 4.3 a block of consecutive instructions from the profile is shown. To convert this into a base candidate superinstruction it is enough to remove the instruction operands, which is shown in Listing 4.4. Instruction operands cannot be part of a superinstruction as a superinstruction is simply a concatenation of instruction handlers. It may seem that these blocks would all make suitable superinstruction candidates, but jumps within the bytecode complicate things. Conditional jump instructions may leave such a block before having reached the end, and jump instructions from other places in the method may enter the block halfway through. Since superinstructions cannot be entered halfway through (this would require dispatching within the superinstruction), this last case requires its own candidate superinstruction. That is, the code following the jump target would have to be treated as its own superinstruction candidate.

36 1 iload_1 JVM Bytecode 2 jump target 3 iconst_3 4 irem 5 dup 6 dup 7 istore_1 8 iflt line 2

Listing 4.5: Code fragment with a jump target

From this block of superinstructionable instructions, at least two candidates would have to be generated in order to cover the two cases: one where this block is entered from the top (candidate shown in Listing 4.6), and one where it is entered via the jump (candidate shown in Listing 4.7).

1 iload_1 Superinstruction candidate 1 iconst_3 Superinstruction candidate 2 iconst_3 2 irem 3 irem 3 dup 4 dup 4 istore_1 5 istore_1 5 iflt 6 iflt 6

Listing 4.6: First superinstruction candidate Listing 4.7: Second super candidate from Listing from Listing 4.5 including the instructions before 4.5 with only the instructions after the jump the jump target target

Deriving multiple superinstruction candidates from a single block of superinstructionable instructions may appear to be caused by the presence of the iflt at line 8 in Listing 4.5, but this iflt is only added for illustrative purposes for the jump target at line 2 to make sense. The jump target at line 2 can be thought of as a “virtual” instruction as its not present in the bytecode at that location, but otherwise has its own special semantics for deriving candidate superinstructions from the code block. There are other reasons why a jump target may be present, e.g. because it’s the start of an exception handler, because it is one of one of the branches of a switch statement (especially common when fall-through case blocks are used) or because it is the start of a Java subroutine. The number of candidate superinstructions derived from a single block of superinstructionable code is the number of incoming jumps plus one (the entry to the block could be considered as one “jump”).

Base superinstruction candidates The base superinstruction candidates seen previously that are directly derived from the profile are special. Their classification is important as they are also used to perform static evaluation which will be elaborated in the next section. Additional profiling information, like execution counters and jump counters are retained in these base superinstruction candidates as they are needed for static evaluation. This profiling information is standalone: it contains jump counters, local execution counters ànd an additional counter counting the number of times the base superinstruction candidate was entered as per the profile. Take for example Listing 4.5 again, and assume it was entered 300 times from the top. Let’s also assume that the jump target at line 2 was entered 100 times. This would associate an entry counter of 300 to the candidate superinstruction in Listing 4.6 and an entry counter of 100 to the candidate superinstruction in Listing 4.6. If the iflt has a jump counter of 40 (it jumped 40 times), the base superinstruction candidates would divide 300 100 this value up based on their entry counters: 40 · for Listing 4.6 and 40 · for 300 + 100 300 + 100 Listing 4.7. Another way to think about this is that the probability that the jump is performed

37 is computed based on the total number of executions (300 + 100) and multiplied in each base superinstruction candidate by the number of executions of that candidate. When constructing a superinstruction from the (base) superinstruction candidate, the profiling information from any conditional jump within the superinstruction is discarded. This means this information is not available at runtime, as no improvement is possible by making this information available to the runtime substitution algorithm – there is no additional cost in substituting in a superinstruction that rarely gets executed entirely but was compiled into the VM. Future improvements are possible by making richer profiling information available to the runtime superinstruction placement algorithm, which could be used for speculative optimization. However for a runtime substitution algorithm to make effective use of a profile for speculative optimization, considerable improvements are needed to the superinstruction architecture as designed here. This is discussed as future work in section 7.2.3.

Other superinstruction candidates Further processing may be done to increase or change the number of superinstruction candidates. For example, common subsequences may be found within the candidate superinstructions to create more candidate superinstructions. Another example is the generation of prefix superinstruction candidates: if a, b, c, d is a base superinstruction candidate, generating the prefixes would add a, b and a, b, c to the list of candidate superinstructions. Deriving superinstruction candidates is critical: consider the two blocks of JVM bytecode shown in Listing 4.8 and Listing 4.9.

1 iload 10 load x JVM Bytecode 1 iconst_3 load 3 JVM Bytecode 2 iload 11 load y 2 iload 4 load a 3 imul 3 iload 7 load b 4 iload 12 load z 4 iload 8 load c 5 iload 12 load z 5 iadd 6 iload 11 load y 6 iadd 7 iadd 7 imul 8 iadd 8 iconst_2 load literal 2 9 imul 9 isub 10 iconst_2 load literal 2 11 irem Listing 4.9: Another sequence of bytecode in- structions computing (3 · (a + b + c)) − 2 Listing 4.8: Sequence of bytecode instructions computing (x · y · (z + z + y)) mod 2, where mod is the modulo operator

With a good profile covering all code paths that the application would cover at runtime, such preprocessing of the superinstruction candidates may seem unnecessary – the number of instruction dispatches saved increases with the size of the superinstruction, as such pursuing the longest possible superinstructions would yield the most performance improvement. However the size of the superinstruction set is finite, with its size determined by some input parameter for the superinstruction set construction algorithm. As such, it may not possible to convert all base superinstruction candidates to a superinstruction. In Listing 4.8 and Listing 4.9 such an example is shown – assume both bytecode blocks are executed N times. Assume that the superinstruction set construction algorithm can only allocate one superinstruction to cover both cases. Picking all of Listing 4.8 for the superinstruction would yield a speedup score(L4.8), where score(Lx) is measured in the number of saved instruction dispatches in Listing x. score(L4.8) is computed by taking the number of instructions in Listing 4.8 minus one, as one dispatch is needed for entering the candidate. This number is then multiplied with the number of executions as per the profile, yielding score(L4.8) = N · (11 − 1) = 10N. Likewise, the speedup score(L4.9) = N · (9 − 1) = 8N. Limiting the superinstruction candidates to only the base superinstruction candidates would limit the maximum speedup to score(L4.8,L4.9) = max(8N, 10N) = 10N by taking Listing 4.8. However,

38 when examining the expressions or even the bytecode in the two listings, it is revealed how much they share – the whole sequence from line 4 to line 10 in Listing 4.8 is also present in Listing 4.9 from line 2 to line 8 (inclusive). If these seven instructions were turned into a superinstruction candidate c, its score would be score(c) = 2N · (7 − 1) = 12N. This is an improvement over the previous best of 10N, highlighting the importance of trying subsequences of the base superinstruction candidates. This way of computing score(c) – simply multiplying its base score of 7−1 = 6 by the number of possible executions 2N as per the profile – only works when it’s the only superinstruction candidate and when the chosen runtime substitution algorithm is capable of making these substitutions (not all of them are as we will see). If Listing 4.8 was already available as a superinstruction on its own, it would be preferred to the substitution of superinstruction c lowering its score to score(c) = 6N. Superinstructions can have very complex exclusion relations with each other which makes the superinstruction placement NP-hard [15]. The topic of how the superinstruction set is optimized will be discussed in section 4.4.5. The specific way of generating new superinstruction candidates from the base superinstruction candidates depends on the exact superinstruction set construction algorithm and the runtime substitution algorithm, as not all runtime substitution algorithms are capable of making the same substitutions. For example, particular naïve algorithms may fail to substitute any superinstruction where its prefix is not also a superinstruction in the superinstruction set (this is the case for the algorithm by Gregg et al. [6]), which would require preprocessing to make sure all the prefixes are available as superinstruction candidate. These algorithms are discussed in conjunction with their runtime substitution algorithms due to their close relation.

4.4.3 Static evaluation In section 4.3.2 the data in the profile was discussed, and that this data is needed to enable static evaluation of a particular superinstruction set and superinstruction placement algorithm combination. And for this task the base superinstruction candidates are used combined with the actual superinstruction runtime placement algorithm. From a candidate superinstruction set, a superinstruction set is generated. This process allocates opcodes and assigns names to the candidate superinstructions to enable their placement into a program.

Definitions Let us first consider a table of definitions that can help lay the foundation and explain what data the algorithm operates on.

Symbol Type Description B Set Set of all base JVM instructions as defined in the JVM specifi- cation 11th edition ([8]), e.g. istore or nop. Does not include superinstructions. b ∈ B Element Instruction from B S Set Set of all superinstructions active in the VM. This set is disjoint with B, that is, no instruction is both a base JVM instruction and a superinstruction. X ∈ S, s ∈ S Ordered Set Superinstruction from S. These themselves are ordered sets – sets of instructions from B, as superinstructions are compos- ites of base JVM instructions. Sometimes written as s ∈ S to indicate that it’s an element of the larger set of superinstruc- tion, sometimes written as X ∈ S to remind the reader that the superinstruction in of itself is also a set. |s|, |X| Integer Number of base JVM instructions that make up the superinstruc- tion X. For example, the superinstruction X = {istore, dup} has an |X| = 2 as it consists of two elements.

39 I, J Ordered Set I and J typically refer to a list of instructions as seen in a program, and in that sense they’re equal to X but with a different context. J is typically the output of a substitution algorithm, and depending on the context may also include elements from S – that is, may include superinstructions besides base JVM instructions. P Set Set which represents the profile. Elements in P are sequences of bytecode which could be converted to a superinstruction as they were found by the runtime profile. These sequences are called the base superinstruction candidates. One key property of such a base superinstruction candidate is that they retain profiling information regarding the number of executions for every instruction within the base superinstruction candidate. p ∈ P Ordered Set An element of P is a base superinstruction candidate, and it is also an ordered set consisting of base JVM instructions (elements of B). p may be substituted for Ip, utilizing the fact that p is just a sequence of instructions and can be used in any place where a sequence of instructions I is expected.

Evaluation process Armed with a super instruction set, a profile (which is a set of base superinstruction candidates computed as per the previous section) and a runtime substitution algorithm the static evaluation can begin. Let P be the profile, S be the superinstruction set, B be the JVM base instruction set as per the JVM specification [8]. s ∈ S are the superinstructions in S, p ∈ P are the base superinstruction candidates (with profiling information) and b ∈ B are the base instructions. The regular instruction set and superinstruction set are disjoined, that is S ∩ B = ∅ and the interpreter would be compiled with the instructions S ∪ B. Static evaluation however does not require compilation of an interpreter – that’s the whole point. Let R be the runtime substitution algorithm, where R(Ip,S) → Jp and Ip and Jp are a sequence of bytecode instructions (either a method or a subsequence of a method). Ip is equivalent to Jp, that is using Jp instead of Ip retains the semantics of Ip as defined by the Java Virtual Machine Instruction Set [8]. Ip does not contain superinstructions (∀i ∈ Ip : i ∈ B) but Jp may (∀j ∈ Jp : j ∈ B ∪ S). The set p ∈ P is the set of base superinstruction candidates – this is where their special status comes in as discussed in the previous section. The base superinstruction candidates are all the blocks of code that can be converted to a superinstruction, with their incoming branches taken care of just like any superinstruction candidate. But base superinstruction candidates are also equipped with profiling data as they are immediately derived from the profile, which the other superinstruction candidates do not have. A base superinstruction candidate p is a sequence of (i, n) tuples, where i ∈ B and n the number of times that instruction was executed per the profile. It is required that p is a compatible substitute for Ip – that is, the runtime placement algorithm R must be able to operate on a plain list of instructions ànd a list of instructions with profiling information attached. The output Jp must retain this information if it was provided as input. This substitution relation (substituting p 7→ Ip) can be implemented using object-oriented programming and a common base type in a way that requires minimal case-dependent code in the implementation of the runtime substitution algorithm. P Static evaluation computes the total score totalScore(P,S) = p∈P score(p, R(p, S))) where score(p, Jp) computes the number of dispatches saved by executing Jp over p 7→ Ip (p is substituted for Ip here when using static evaluation). The algorithm to compute the number of dispatches needed implements a very simple instruction visitor, which is shown in Listing 4.10.

40 1 score(p, Jp): Pseudocode 2 return dispatches(p)- dispatches(Jp) 3 4 dispatches(L): Counts the number of instruction dispatches 5 total := 0 6 for instr ∈ L: 7 cinstr := executionCount(L, instr) Fetches the number of executions according to the profile 8 total := total + cinstr 9 return total

Listing 4.10: score(p, Jp) static evaluation algorithm

The score algorithm is a formalization of the algorithm seen in the previous section. It is quite simple – it computes the number of instruction dispatches saved by substituting in superinstructions. It does so by first computing the number of dispatches in the original code sequence and then subtracting the number of executions in the superinstruction version, which is done with the dispatches function. The dispatches function implements a very simple algorithm. It loops over the bytecode without taking any branch, but simply until ever instruction has been visited. It assumes that the profiling information has been updated correctly if the runtime substitution algorithm R made modifications to the dataflow graph. If R produces an unreachable instruction, the runtime substitution algorithm must set the profiling counter to zero for that instruction.

Theoretical maximum This algorithm also provides another tool: it allows computing the theoretical maximum score for a given profile. This score is computed in a straightforward way: assume that all the base superinstruction candidates are superinstructions. This reduces the number of dispatches for every single one of them to just one dispatch. Multiply this with the number of times they were entered will produce the minimal number of dispatches. By computing the total number of dispatches without any superinstructions it is possible to compute a theoretical maximum score.

1 theoreticalMaximum(P ): Pseudocode 2 P ( p p ) return p∈P dispatches( )- executionCount( )

Listing 4.11: theoreticalMaximum(P ) static evaluation algorithm computing the maximum score

In Listing 4.11 this algorithm can be seen in its formal form. executionCount(p) computes the number of times that base superinstruction candidate was entered according to the profile (the number of times its first instruction was executed).

4.4.4 Superinstruction set construction Proebsting [15] pointed out that determining the optimal superinstruction set is NP-hard. This makes it unlikely to find the absolute optimal superinstruction within a reasonable amount of time for any algorithm (well, not without proving P = NP in the process). The complexity is due to the complex relation between particular superinstruction candidates as seen in section 4.4.2 – considering an additional superinstruction candidate may affect the individual score of all previously considered superinstruction candidates. This prevents the use of a greedy algorithm, where a superinstruction set is built up iteratively always adding the next highest scoring superinstruction candidate. Earlier work did not necessarily deploy particularly sophisticated algorithms as discussed in section 3.2.5. However QuickInterp is to change that. The superinstruction set construction algorithm has access to a static evaluation algorithm as seen earlier, and this algorithm can be used to iteratively improve the superinstruction set. The capabilities of the Superinstruction Set

41 Generator – as seen in Figure 4.1 – are twofold: first it has to be able to generate an initial superinstruction set candidate from a list of candidate superinstructions. Then, it has to be able to mutate this superinstruction set candidate to derive a new superinstruction set candidate. In the Instruction Set Generation process, variations on the current best superinstruction set are evaluated.

QuickInterp Instruction Set Generation process Ahead of Time Instruction Set Generation Superinstruction Set Base Compute Base Generator Preprocessor Superinstruction Superinstruction Profiling data Candidates Candidates

Superinstruction Bytecode Conversion Candidates Superinstruction Seeder Converter

Instruction Set Static Evaluation Definition

Mutator no Good enough? yes

Instruction Set Definition .instrX: iload + imul .instrY: new + dup + invokespecial

Legend Data or component Process or action FigureData, component 4.2: Diagram or process of the for Instruction which various Set implementation Generation process are available as seen in Figure 4.1

Consider the diagram in Figure 4.2. This diagram is an elaboration of the Instruction Set Generation process as seen earlier in Figure 4.1 and shows many of the steps already discussed earlier. The Superinstruction Set Generator, the Superinstruction Converter and the Preprocessor each have multiple implementations which will be discussed later. These substitutions allow the implementation and tuning of various algorithms. The process of generating an instruction set is as follows: 1. Profiling Data is processed to compute Base Superinstruction Candidates which produces the Base Superinstruction Candidates. 2. The Preprocessor is used to generate more candidates producing the Superinstruction Can- didates. As discussed earlier, the role of the preprocessor is to increase the number of superinstruction candidates, for example by taking the prefix of each base superinstruction candidates. The preprocessor is not required to do anything – it may simply return the base superinstruction candidates without any processing. 3. The Superinstruction Candidates are fed into the Seeder part of the Superinstruction Set Generator. The seeder is capable of generating an initial superinstruction set definition. This process is part of the superinstruction set generator implementation. The generated Instruction Set Definition need not be a subset of all the superinstruction candidates – the generator is allowed to add new superinstructions, modify instructions or perform any operation on them allowing very powerful superinstruction set generators that have some awareness of the relation between superinstructions.

42 4. The Superinstruction Set Definition, as generated by the Superinstruction Set Generator is fed into the Bytecode Conversion process, where the base superinstruction candidates are processed as bytecode-to-be-converted by the superinstruction converter with the provided superinstruction set definition. 5. The output of the Bytecode Conversion is provided to a static evaluation process, where it is scored according to the score(p, Jp) algorithm shown earlier in Listing 4.10, from which a decision is made. 6. If this score is considered adequate, the instruction set generation process terminates with the current instruction set definition. Otherwise, it returns the instruction set definition to the Mutator component of the Superinstruction Set Generator. 7. The Mutator component makes some changes to the superinstruction set. These changes may be random, although the way they are made is implementation-dependent. After the mutator the process resumes at step 4. The decision to continue or not must be made in a way that guarantees termination. An example of an implementation would be to not look at the score, but instead look at the time spent. If the time spent exceeds some threshold, the current best is returned. Figure 4.2 only covers the rough outline of the mutate-score-evaluate loop: it will be discussed in more detail in section 4.4.5.

Superinstruction Set Generator In Figure 4.2 the Superinstruction Set Generator component may have the most surprising structure. It has two sub-processes, the Seeder and Mutator which are closely related. The process description from earlier makes a decent description of its jobs: generate the initial superinstruction set, and modify a given superinstruction set. The process of optimizing the superinstruction set is a stochastic one. The superinstruction set generator does not have enough information to make a very good superinstruction set in the seed process – the superinstruction candidates it receives from the Preprocessor may not even possess usable profiling information. As such, all but the most trivial implementations opt to build an instruction set by randomly selecting candidates. Further mutating of a superinstruction set definition is also based on chance: a random number of superinstructions are changed. The amount of change is guided: the static evaluation process provides a number (between 0 and 1) that indicates how much change should be made, where 1 indicates a complete rewrite of the superinstruction set candidate and 0 means as little change as possible. The mutate process should never leave its input unchanged. How this “change” varies throughout the run of the evaluation loop is discussed in section 4.4.5. Some further postprocessing may be performed by the superinstruction candidates – the reimplementation of Gregg et al. [6] runtime substitution algorithm requires the prefix of every instruction to be available in the superinstruction set. Even though a preprocessor can make these available, relying solely on the preprocessor without making the superinstruction set constructor be aware of the need for the prefix might be inefficient as this superinstruction set generator would end up producing a lot of superinstruction set definitions without superinstructions prefixes for all superinstructions. These would score lower as the runtime substitution algorithm is unable to use them and rely on the optimization process to correct this, but this is not an efficient approach. Instead, the superinstruction set generator itself should add prefixes for all superinstructions it tries to add. In other words: when there are additional constraints on combinations of superinstructions, these constraints should be dealt with by the superinstruction set generator. The role of the preprocessor is only to increase the material it can work with in a pluggable way, so that different superinstruction set generators can be coupled to different preprocessors. Due to the use of random, running the same optimization process twice may yield different results, although the exact source of entropy is an implementation detail of the superinstruction set generator. The theoretical maximum can be used to see if a particular run indeed did poorly, however the theoretical maximum isn’t necessarily achievable. Instead, it provides an upper bound to the score of a particular superinstruction set.

43 Creating superinstruction candidates The role of the preprocessor was already briefly covered: it receives the base superinstruction candidates as input – these are derived from the bytecode directly – and provide a full list of all superinstruction candidates. The preprocessor is the architectural component introduced to overcome the problem seen earlier in Listing 4.3 and Listing 4.4 – the instruction set may not be large enough to allow allocating all base superinstruction candidates. As such, parts of particular base superinstruction candidates may be more suitable, which the preprocessor can make available. The superinstruction set construction algorithm, when making a candidate superinstruction set, is not obliged to only pick superinstruction candidates provided by the preprocessor. It makes more sense that, when the superinstruction X requires that f(X) is available for a particular runtime substitution algorithm, that the superinstruction set construction algorithm itself generates f(X) and ensures its availability. If the superinstruction set construction algorithm were to only add X and fail to add f(X), the runtime substitution algorithm would not be able to use X and as such lower the score, so eventually the optimization loop would reject X unless f(X) is available. But, this would take costly iterations of the optimization algorithm that need not be spent if the superinstruction set construction algorithm was aware of this relation. Likewise, assuming X ≡ X0 the runtime substitution algorithm may be able to recognize X and X0 as equivalent, and as such adding X0 to a superinstruction set already containing X will not improve its score. This is an example that can be solved with both a preprocessor ànd with a smarter superinstruction set construction algorithm: the preprocessor could remove either X or X0 from its list of candidates, and the superinstruction set construction algorithm could recognize the equivalence and not add one when the other is available. The design decision to have two places where new superinstruction candidates can be constructed – (1) in the superinstruction construction set algorithm and (2) in the dedicated preprocessor component – is to increase the architectures modularity. Design goal DS2 demands an architecture on which multiple superinstruction construction algorithms and multiple runtime substitution algorithms can be implemented and tested. Separating the preprocessor from the superinstruction construction algorithm allows common, algorithm-agnostic preprocessing steps to be usable across all superinstruction construction algorithms. For example, one preprocessor may take all the subsequences of the base superinstruction candidates. Such a preprocessor may be useful for any superinstruction construction algorithm. Other more specific operations, like ensuring that f(X) is available when instruction X is present in the candidate superinstruction set, are more suitable to be implemented by the superinstruction set construction algorithm itself.

4.4.5 Superinstruction generation evaluation loop Figure 4.2 already showed the instruction set generator mutate-score-evaluate evaluation loop. However, besides showing the general data flow in this workflow, it fails to show the specific details of this algorithm. The evaluation loop is a genetic algorithm in a very straightforward implementation. In Listing 4.12

1 optimize(P ,S): Pseudocode 2 b = seed(S) b is the running best 3 h = 0.5 h is the mutation rate ∈ [0, 1] 4 while continue(b): e.g. start + 1 minute < now 5 N = {i ∈ {0...k} | mutate(b, h)} N contains k mutations of b 6 for n ∈ N: 7 if score(n) > score(b): 8 b := n Found a new best → use it 9 h := nexth(h) Let h approach 0 fine-tuning b, e.g. nexth(h) = h1.01 10 return b

Listing 4.12: optimize(P,S) genetic optimization algorithm

44 This algorithm chooses per iteration not one variation of the current best b, instead it chooses k mutations. This enables the algorithm to work concurrently – on multiple CPU cores or even on multiple machines, variations of one particular super-instruction set could be evaluated. This in turn increases the hardware utilization of this algorithm and reduces the wall clock time the algorithm needs to find a good solution. Another variable shows up: h for heat. This is a parameter in the genetic algorithm controlling how much variation every mutation step is to introduce. With an h of 0.5, 50% of the instruction set definition is to be kept when mutating. As the h approaches 0, less and less changes should be made such that with an h of zero, only one superinstruction is to be changed (the mutate(b, h) function should never return its input). The goal is to prevent the algorithm from getting stuck in a local optima. It may find itself with a superinstruction set where three base superinstruction candidates have been promoted to real superinstructions, and unable to allocate new superinstructions. However, these three base superinstructions fail to effectively optimize all profiled code, and if it were to instead select three smaller superinstructions that independently do not score very well as superinstructions, it may generate a higher-scoring superinstruction set. These situations are expected to be somewhat common due to the complex impact individual superinstructions have on the total score. As such it’s important to, when the optimization process starts, start out by making big changes to the current best superinstruction candidate to help it depart such a local optimia. As the optimization process progresses, the rate of change can be lowered to fine-tune the candidate it ended up finding. This is reflected in the value of h lowering down to 0, which is done by the nexth(h) function. A simple implementation of this function would be nexth(h) = h1.01 or some other exponent. In QuickInterp, nexth(h) does not derive its value from the current value of h, but instead on the time left. In this implementation the decision to continue is solely based on time. Let tend be the end time of the optimization loop after which continue(b) returns false, let t be the current time and let the start time of the algorithm be tstart with tstart ≤ t ≤ tend. t − t next(h) is defined as next(h) = 0.5 · ( end )3. Time is mapped to a value between 1 and 0, tend − tstart starting at 1 and linearly approaching 0 as t approaches tend. This value is raised to the power three to spent more time at lower values, giving the algorithm more time to fine-tune. The decision process has been moved to the continue(b) function. The implementation of this continue(b) function decides how long this iterative process continues. In QuickInterp, continue(b) is simply implemented as continue(b) = t < tend. An earlier exit can be made if the continue(b) function assesses that, for example, score(b) has sufficiently approached the theoretical maximum, however this is not done in QuickInterp. The exact interpretation of this heat parameter depends on the implementation of the su- perinstruction set construction algorithm, as this algorithm provides the implementation of the mutate(b, h) algorithm. However, in general a heat of h ∈ [0, 1] should cause a change of roughly h · 100%. That is, an h of 0.3 should prompt a change of 30%. If it’s more efficient to implement an algorithm which has some variance in the amount of change it makes this is acceptable. There is also a minimum change – the implementation of mutate(b, h) must prevent returning the input b as-is – instead always at least one change should be made. In QuickInterp no extensive testing has been done with varying implementations of the continue(b) and the nexth(h) functions, but the chosen implementations have been shown to converge quickly (within seconds) on small programs with k = 256. It might be that the runtime performance of the superinstruction set construction algorithm can be improved as such, however there is no design goal requiring swift construction of a superinstruction set. Instead, as long as an optimal set is found within reasonable time this design sits in agreement with the design goals.

4.4.6 Handling superinstruction operands Most of the techniques used for generating the actual superinstruction handlers are considered implementation details, and will be discussed in chapter 5. However, the way superinstructions obtain their instruction operands requires some special attention.

45 1 iload 10 load x JVM Bytecode 2 bipush 5 load literal 5 3 imul 4 istore 10 store

Listing 4.13: Sequence of bytecode instructions executing x := x · 5

Consider Listing 4.13. This example shows four bytecode instructions, and three of the instructions here have instruction operands. Instructions read their operands from the bytecode stream – instruction operands are not part of the instruction opcode itself. As such, when a superinstruction is created from this code fragment, the superinstruction has to retrieve these operands from somewhere in the bytecode stream as well. The superinstruction architecture is one where the amount of hand-written change to interpreter source code is kept at a minimum. As such, superinstructions are created by the direct concatenation of multiple instruction handlers – for example the iload bipush imul istore handlers from Listing 4.13. These handlers however expect their instruction operands to be readable from the bytecode stream – the bipush handler may be implemented as seen in Listing 4.14.

1 uint8_t* pc; C code 2 uint8_t* topOfStack; 3 ... 4 case bipush: { 5 *topOfStack = (int8_t) pc[1]; 6 topOfStack++; Increase stack 7 pc += 2; Progress the program counter by two bytes 8 }

Listing 4.14: bipush handler

bipush takes one byte as instruction operand, which it pushes onto the operand stack. The bipush handler reads the instruction operand on line 5 with pc[1]. This operand is stored in the bytecode stream immediately after the opcode (which is at pc[0] or *pc), which is why the pc variable is used. It then writes this value to the topOfStack location, increasing the stack and the program counter. The program counter is incremented by two, because besides the one-byte operand the opcode itself is also one byte. After concatenating instruction handlers like the one for bipush, they will still read their instruction operands relative to the pc program counter variable. Furthermore, their handlers will also increment the program counter with not just the size of their operands, but also with the size of their own opcodes. If the handlers from Listing 4.13 are simply concatenated to form one new superinstruction handler, this handler would not be able to read its instruction operands if they were following it immediately in the bytecode stream. Instead, spacing for where the instruction opcodes of all but the first instruction were would have to be preserved as the handlers will increment the program counter skipping over the locations of the bipush, imul and istore opcodes. This can be solved by modifying the instruction handlers, and was considered for the design QuickInterp as a compact encoding scheme. However, there is some serious advantage in leaving these “spaces” in the bytecode stream – they can be filled with the original opcodes. Assume the existence of a new superinstruction super1, which implements all instructions from Listing 4.13.

1 super1 10 load x, load 5, multiply and store JVM Bytecode 2 bipush 5 load literal 5 3 imul 4 istore 10 store

Listing 4.15: Listing 4.13 with a superinstruction

46 What can be seen in Listing 4.15 is that the opcode of the iload is replaced with super1 – as expected. However, the other instructions are still present, still with their instruction operands. Because super1 is the most simple concatenation of these handlers, it will read their operands. The bipush handler compiled into the super1 instruction will actually read the instruction operand of the bipush instruction at line 2. But this is not all, if a jump is performed from somewhere else (not shown) to the bipush at line 2, the interpreter will happily and correctly execute the bipush instruction. Finally, after executing all the handlers within the super1 superinstruction, the program counter has been incremented by 7 as each handler has done its own increment of the program counter, but control was never returned to the interpreter. This means that when it jumps back to the interpreters’ dispatch loop, it will execute the instruction after the istore instruction. This behavior is ideal: not only do jump targets keep on working without consideration, and the otherwise-unreachable code is skipped in the bytecode stream. Even though QuickInterp is designed from the ground up around more intensive code inspection, including the construction of control-flow graphs detailing where jumps go, this coincidental “feature” caused by the most naive way of compiling superinstructions is kept. It allows for the most faithful reimplementation of earlier work, and additionally causes minimal harm as the only downside is the slight failure to compress the size of the bytecode in-memory.

4.5 QuickInterp runtime 4.5.1 Introduction Runtime is the phase where instruction substitution happens. In this chapter we discuss various runtime substitution algorithms, including one which performs optimally (with some limitations to the domain). QuickInterp’s approach to performing superinstruction substitution is closely linked to how, at compile-time, the base superinstruction candidates are constructed as discussed in section 4.4.2. After all, static evaluation is performed by running the substitution algorithm on the base superinstruction candidates. As such, the first step in processing the bytecode of a method is to cut it into “blocks” of consecutive bytecode instructions similarly to how the base superinstruction candidates are created. Unconditional jump instructions and other instructions which cannot be part of a superinstruction are used as delimiters, and blocks with a length ≤ 1 are skipped. Within these blocks, substitution is performed. Incoming jumps are dealt with in an implementation- dependent way, but most often they are simply not considered. By keeping the original instructions in place, they continue to work for free as discussed in section 4.4.6.

4.5.2 Basic runtime superinstruction placement Definitions Before continuing on with the somewhat abstract definitions of each substitution algorithm, let us first consider a table of definitions that can be reused across the substitution algorithms to define each algorithm more formally. This table reuses some of the definitions seen earlier in section 4.4.3.

Symbol Type Description B Set Set of all base JVM instructions as defined in the JVM specifi- cation 11th edition ([8]), e.g. istore or nop. Does not include superinstructions. b ∈ B Element Instruction from B S Set Set of all superinstructions active in the VM. This set is disjoined with B, that is, no instruction is both a base JVM instruction and a superinstruction.

47 X ∈ S, s ∈ S Ordered Set Superinstruction from S. These themselves are ordered sets – sets of instructions from B, as superinstructions are compos- ites of base JVM instructions. Sometimes written as s ∈ S to indicate that it’s an element of the larger set of superinstruc- tion, sometimes written as X ∈ S to remind the reader that the superinstruction in of itself is also a set. |s|, |X| Integer Number of base JVM instructions that make up the superinstruc- tion X. For example, the superinstruction X = {istore, dup} has an |X| = 2 as it consists of two elements. T Object Tree of superinstructions used by the tree-based runtime substi- tution algorithm. I, J Ordered Set I and J typically refer to a list of instructions as seen in a program, and in that sense they’re equal to X but with a different context. J is typically the output of a substitution algorithm, and depending on the context may also include elements from S – that is, may include superinstructions besides base JVM instructions.

Triplet-based substitution Triplet-based substitution is the name of our reimplementation of an existing mechanism for performing superinstruction substitution from Ertl et al. [4]. The authors use a straightforward substitution mechanism they call peephole optimization, which is a common name in compiler construction literature for examinating a small window of statements or instructions and substituting in more optimized equivalent instructions. In the case of their implementation, the window always has a size of just two consecutive instructions, but with some clever superinstruction set constraints it is possible to substitute in longer superinstructions. To enable runtime substitution, their construction algorithm generates triplets for each superin- struction (i, j, x) with i ∈ B ∪ S, j ∈ B and x ∈ S where S is the set of all superinstructions and B is the set of standard JVM instructions.

1 uint8 TABLE[4][3] = { C code 2 { new, dup, super1}, new-dup 3 { bipush, iadd, super2}, bipush-iadd 4 { super2, istore, super3}, bipush-iadd-istore 5 { super2, iload, super4}, bipush-iadd-iload 6 };

Listing 4.16: Example of the triplet table used for table-based substitution with two superinstruc- tions with three superinstructions defined

In Listing 4.16 four such triplets are shown in the triplet table. For readability, Listing 4.16 shows instruction labels instead of instruction opcodes, but for this strategy to work effectively the labels would be replaced with opcodes. The first two triplets define superinstructions with just two instructions. The first superinstruction new-dup is coupled to the label super1. The second superinstruction, merging bipush and iadd, gets to be called super2. This superinstruction is referenced in the next line: the bipush-iadd-istore superinstruction triplet composes the previous superinstruction super2 and adds the istore instruction to make a new superinstruction super3. Also the bipush-iadd-load superinstruction makes use of the definition of its prefix by also referring to super2, showing how longer superinstructions can be defined while only using triplets. The runtime placement algorithm always looks at two consecutive instructions a and b, attempt- ing to find a triplet (a, b, x) for some x in the triplet table. If this triplet is found, a is replaced with x and b is discarded as it is implemented by x. The process is repeated after reading the

48 next instruction, now with the just-placed superinstruction x and the newly read instruction c. Triplets in the table may exist where the first opcode in the triplet is a superinstruction opcode in of itself, which is also shown in Listing 4.16. Via this mechanism, longer superinstructions are possible as the algorithm bases a new superinstruction on an existing superinstruction and some extra instruction. If no triplet is found, the window shifts by one instruction. For example, if no triplet (a, b, x) for some x is found, the algorithm reads the next instruction c and repeats the process, now trying to find triplet (b, c, x) for some x. To construct a superinstruction set, their compile-time superinstruction set construction algorithm operates very similarly. A variation of the runtime placement algorithm is executed on the profiled bytecode, marking two instructions a, b for inclusion as a superinstruction when their number of executions sits above a particular threshold. If this action is performed, the next instruction c is read and now a, c is evaluated for inclusion in the superinstruction set. The decision of including a superinstruction is solely based on the number of executions in the profile, and no optimization loop or other mechanism is used to detect the complex effects adding new superinstructions can have to the performance of existing superinstructions as discussed in section 4.4.2. Let X be a sequence of bytecode instructions that is a prefix of another sequence of bytecode instructions Y , and count(A) be the number of times the sequence A was executed. From this, it is obvious that if X is a prefix of Y , count(X) ≥ count(Y ). After all, for every execution of Y , X also has to be executed. As such, if the decision to include a particular sequence of instructions as a superinstruction is made by only looking at the number of executions, if Y is a superinstruction then X must also have been promoted to a superinstruction as it has at least as many executions. This in turn causes their superinstruction set construction algorithm to always produce superinstruction sets where for any superinstruction X ∈ S where X spans more than two instructions (|X| ≥ 3), the prefix Y is also a member of S. Their superinstruction set construction algorithm is not reimplemented as part of QuickInterp. This mechanism is reimplemented by means of a Superinstruction Set Converter aware of this side-effect – that the prefix of a superinstruction always needs to be available. Besides this, the QuickInterp optimization loop using static evaluation is also used for this reimplementation, even when Ertl et al. [4] do not indicate using such a sophisticated mechanism for optimizing the superinstruction set, instead opting to just promote sequences of profiled bytecode to superinstruction set based on their execution counts. It should be noted that Ert et al. created their implementation in 2001 and were limited by the hardware of their time, having at most 512MB of RAM available on their most powerful test platform. While elegant and both time- and memory efficient, this runtime substitution algorithm requires the prefix for any superinstruction to be present in the superinstruction set. The reimplementation keeps this limitation to establish a baseline and see what improvements better algorithms can bring, while using QuickInterp’s more advanced superinstruction set construction algorithm to allow a comparison just based on the runtime substitution algorithm.

Tree-based substitution Tree-based substitution is a very simple improvement over the triplet-based substitution, removing the requirement that all superinstructions X ∈ S with |X| ≥ 3 must have their prefix Y ∈ S. The algorithm constructs a tree of the superinstruction set. Every node in the tree carries an opcode o, optionally a superinstruction opcode s and up to |B| child nodes, where B is the set of base JVM instructions. The root of the tree also has up to |B| child nodes, but no opcode o or superinstruction opcode s. An example of such a tree is shown in Listing 4.17 and Figure 4.3. Listing 4.17 defines three superinstructions. In Figure 4.3 the tree derived from these superinstructions is shown. In the tree, all nodes have just one opcode b ∈ B (B is the JVM base instruction set). Some nodes have a superinstruction opcode s ∈ S associated (S is the superinstruction set with B ∩ S = ∅), but not all. Intermediate nodes without an s are created without such a value for superinstructions where the prefix is no superinstruction, allowing superinstructions to be placed when no prefix exists.

49 root

b = bipush b = iload 1 super1: bipush iadd Superinstructions b = iadd, b = iload, 2 super2: iload iload iadd istore s = super1 s = super3 3 super3: iload iload

b = iadd Listing 4.17: Example of three superinstruction definitions b = istore, s = super2

Figure 4.3: Listing 4.17 represented as a tree

1 treeReplace(I,S): Pseudocode 2 T := makeTree(S) Convert the superinstruction set to a tree as in Figure 4.3 3 is := first(I) Start of the superinstruction window 4 ie := first(I) End of the superinstruction window 5 il := null Last ie ≤ i < is where superinstruction {is, ..., il} is valid 6 o := null The opcode of the {is, ..., il} superinstruction 7 c := cursor(T ) Cursor in the tree, starts at the root 8 9 while is 6= end(I): True as long as is points to a valid instruction 10 if hasSuperinstruction(c): Test if the node at c has a superinstruction (s) 11 il := ie 12 o := opcode(c) Read the superinstruction opcode s from the cursor 13 c := moveCursor(c, opcode(ie)) 14 if ie = end(I) ∨ ¬isValid(c): The tree has no matches or the end has been reached 15 if o 6= null: Has there been a valid superinstruction? 16 replace({is, ..., il}, o) Place the best instruction that was found 17 is := next(il) Restart the window at the end of the superinstruction 18 ie := next(il) 19 else: 20 is := next(is) Restart the window at the next instruction 21 ie := is 22 il := null 23 o := null 24 c := cursor(T ) Create a new cursor starting at the root 25 else: 26 ie := next(ie) Increase the window

Listing 4.18: R(I,S) tree-based runtime substitution algorithm

The algorithm is shown in pseudocode in Listing 4.18. The placement algorithm creates a cursor c into the tree – a pointer to the current node in the tree. This cursor starts at the root, and can be advanced by giving it a new opcode (shown using the moveCursor function). It will then move to the child node with that opcode. Looking at the tree in Figure 4.3, it can be seen that the cursor can be in three states: (1) valid node with superinstruction opcode s, (2) valid node but without a superinstruction opcode, and (3) invalid node. The third case is triggered when the cursor is advanced to a child node that does not exist, e.g. the cursor points to the bipush node in Figure 4.3 (top left) and moveCursor is called with istore as opcode. This node has no child with that opcode, so the cursor moves to the invalid state. The algorithm keeps a window of instructions, with is tracking the start of the window and ie tracking the end. This window is considered as a superinstruction by consulting the cursor. If the cursor ends up with a node (either state 1 or 2), the window is then increased by moving the end ie to the next instruction, after which the process is repeated. Whenever a node is encountered with a superinstruction (state 2), this superinstruction opcode and the location of this instruction are saved in o and il respectively. This is the largest superinstruction found to date from the start of

50 the window. When the cursor indicates it has reached an invalid node (state 3), or the end of the input program L has been reached, no superinstruction can be found by increasing the window size. As such, it terminates the search starting from is. If o has been set, o is the largest superinstruction starting at is that can be replaced in, stretching from is to il. This superinstruction is then substituted into the program.

4.5.3 Instruction placement using shortest path The tree-based approach still is not perfect: a very short superinstruction may exist spanning just two instructions, which is substituted into the program. But by performing the placement, a much longer superinstruction starting just one instruction later cannot be placed in, hurting performance. Superinstructions cannot be picked eagerly, as taking a superinstruction may prevent the allocation of a more optimal superinstruction. One can treat the input program as a directed graph where every instruction is a edge. For the example, we will not consider incoming jumps in the input program, instead assume jumps are dealt with in some other way. Outgoing jumps are treated as normal instructions, which will be discussed later. For every instruction, a node is created. Furthermore, a special exit node is created. For every instruction, an edge is added from the instructions’ node to the node of the next instruction as it was in the input bytecode program, and this edge receives the instruction opcode as label. The last instruction does not have a next instruction, so instead this edge goes from the instructions’ node to the exit node. Conditional jump instructions (instructions where the normal control flow may not continue, this also includes any instruction which can throw an exception) are not treated differently, as for the purpose of a superinstruction these are allowed to jump out of the superinstruction. Finally, all available superinstructions are added to this graph by skipping the regular instruction edges to which they are equivalent.

JVM Bytecode start 1 iload 10 load x iload 10 2 iload 11 load y super1 3 imul super2 iload 11 4 iload 12 load z 5 iload 12 load z imul 6 iload 11 load y iload 12 7 iadd super1 8 iadd iload 12 9 imul super1 10 iconst_2 load literal 2 iload 11 11 irem super3 iadd Listing 4.19: Short bytecode program (repeat of iadd Listing 4.8) imul super4 iconst_2

irem Superinstructions 1 super1: iload iload end 2 super2: iload iload imul 3 super3: iload iload iadd iadd imul Figure 4.4: Listing 4.19 represented as a graph, 4 super4: iadd imul iconst_2 with the instructions on the edges and the su- perinstructions of Listing 4.20 added Listing 4.20: A few superinstruction definitions

An example of this transformation when applied to Listing 4.19 is shown in Figure 4.4. Regular instructions are in black displayed on the right, while superinstructions are displayed in blue. Figure 4.4 immediately reveals something interesting: a lot of superinstructions overlap. By using an eager superinstruction placement algorithm like the triplet-based or tree-based substitution

51 algorithms seen earlier, interesting superinstructions like the large super3 instruction may be missed.

1 super2 10 11 load x·y JVM Bytecode 2 super1 11 11 load z and load z 1 super2 10 11 load x·y JVM Bytecode 3 iload 11 load y 2 iload 11 4 iadd 3 super3 12 11 multiple operations 5 iadd 4 iconst_2 6 super4 multiple operations 5 irem

Listing 4.21: Listing 4.19 after superinstruction Listing 4.22: Listing 4.19 after optimal superin- substitution by the tree-based substitution algo- struction substitution rithm

Listing 4.21 shows what the tree-based substitution algorithm would do with Listing 4.19 – it would fail to utilize the super3 superinstruction due to the placement of the super1 superinstruction. The optimal (least dispatches) solution is shown in Listing 4.22, where the temptation of the super2 is overcome to be able to use the super3 superinstruction. From the title of this subsection it may already be obvious how the result of 4.22 can be obtained: by using a shortest path algorithm. In the graph representation of the program as shown in Figure 4.4, the number of nodes visited directly reflects the number of instruction dispatches required. The goal of using superinstructions is to minimize the number of instruction dispatches, and this is exactly what a shortest path algorithm can offer us when setting the distance of every edge to one. This enables the use of a simple breadth-first-search algorithm to find the optimal combination of superinstructions. This placement algorithm shares the superinstruction tree as used by the tree-based substitution algorithm. However, breadth-first-search is be used to decide which node to evaluate next starting from the “start” node , and no actual substitutions are made until breadth-first-search first reaches the end node (labeled “end”). Whenever a node is visited, the number of steps required to reach that node are saved in the node including the (super)instruction used to reach it. This way, once the algorithm terminates the full path from the end back to the start can be traced back, and the instructions along this path represents the optimized bytecode program with the optimal superinstructions.

Shortest path with jumps Jumps were ignored in the example from Listing 4.19 and Figure 4.4. At runtime, no profiling information is available in QuickInterp, meaning that no decision can be made on whether a conditional jump will be taken or not. Therefor, the assumption is that a conditional jump is not taken in the shortest path algorithm. In other words, the superinstruction placement algorithm treats conditional jumps no other than regular instructions. They may leave the fragment under consideration, but this is dealt with at the jump target. Instructions which can never be part of a superinstruction need not be considered, and similar to the compile-time construction of the base superinstruction candidates (section 4.4.2) this cuts the program to process into “blocks” of consecutive bytecode instructions. Dealing with the jump target of jump instructions is dealt with quite naively by the shortest-path runtime placement algorithm: incoming jumps are ignored for the purposes of finding the shortest path. Some further optimization is possible by taking the frequency of outgoing conditional jumps into consideration. Not all parts of the fragment under examination may see the same number of executions. Conditional jumps may leave the fragment, and incoming jumps may enter the fragment further down. The shortest path algorithm as presented does not take into account that some pieces may be executed more often than others, and as such these pieces should gain some kind of priority in receiving superinstructions. Some superinstruction placements conflict with

52 eachother, as seen in Figure 4.4, and the shortest path algorithm looks only at the number of instructions (the length) of the superinstruction to decide the placement. Instead, it would be more optimal to also consider how often each superinstruction would end up getting (fully) executed.

1 iload_2 JVM Bytecode 2 aload_0 1 int[] scores = {-1, -1, ...}; Java code 3 getfield MAX_SCORE 2 int MAX_SCORE = 255; 4 if_icmple line 8 1 3 5 aload_0 4 void set(int id, int score){ 6 getfield MAX_SCORE 5 if (score> MAX_SCORE){ 7 istore_2 6 score= MAX_SCORE; 8 jump target 2 7 } 9 aload_0 8 scores[id] = score; 10 getfield scores 9 } 11 iload_1 12 iload_2 Listing 4.23: An example Java method 13 iastore 14 return set(int,int) with a conditional jump. The input score is clamped to at most MAX_SCORE. Listing 4.24: The set(int,int) method from Listing 4.23 shown as bytecode

Consider the Java method in Listing 4.23. The set(int,int) method allows for saving a “score” into an array, and this score is clamped to at most 255. Entering a higher score will cause the score to be lowered to this maximum with a conditional jump (an if statement). Listing 4.24 shows the bytecode of this method, with 1 marking the if_icmple instruction (if-integer-compare- less-than-equal), which may perform a conditional jump if the score is ≤ SCORE. The Java compiler has inverted the condition, instead jumping over the clamping code on line 6 of the Java program if the score is smaller than or equal to the maximum score. The jump target of the if_icmple instruction is marked with a 2 in the bytecode program. Now let us assume that the profile shows that this conditional jump was always performed (the condition score > MAX_SCORE was always false, i.e. no score value exceeded 255). Such a scenario might emerge as a result of defensive programming practices, where a developer writes such clamping code to preserve the integrity of the data even though there is no reason to suspect values larger than 255 are ever provided. The fact that the clamping is never needed means that even if there are very good superinstructions dealing with the code in the if statement, their placement does not provide any performance benefit. In Listing 4.25 two superinstructions are defined. The two superinstructions, when applied to the bytecode of the set(int,int) method from Listing 4.24, are exclusive. This can be seen in Figure 4.5 – taking super2 can only be done if super1 is not substituted in. What can also be seen in this figure is that the if_icmple conditional instruction is treated exactly like any other instruction. When applying the shortest path algorithm as seen earlier, it would choose super2. This instruction saves four dispatches instead of the meager two saved by substituting super1, so clearly it is better when not considering the runtime profile. However, knowing that the if_icmple instruction – which is part of super2 – always jumps changes things up a bit as this severely degrades the performance improvement of super2. Instead of saving four dispatches, it is never executed entirely as the if_icmple handler compiled into the superinstruction jumps out of the superinstruction after executing only the if_icmple handler itself. In other words, it ends up saving no instruction dispatches, making it just as good as a substitution as not using superinstructions at all. For a runtime substitution algorithm to recognize this situation and pick super1 instead of super2, profiling information needs to be available at runtime. This is not implemented in QuickInterp and remains future work. However, considering the information available to the runtime substitution algorithm with the current design, the shortest path algorithm finds an optimal solution.

53 start iload_2 super1 aload_0 getfield if_icmple 1 super1: Superinstructions super2 aload_0 2 iload_2 getfield 3 aload_0 4 getfield istore_2 5 super2: aload_0 6 getfield getfield 7 if_icmple 8 aload_0 iload_1 9 getfield iload_2 10 istore_2 iastore return Listing 4.25: A few superinstruction definitions end Figure 4.5: Listing 4.24 represented as a graph, with the instructions on the edges and the su- perinstructions of Listing 4.25 added

start iload_2 aload_0 getfield 1 super3: Superinstructions if_icmple 2 getfield aload_0 3 istore_2 getfield 4 aload_0 5 getfield istore_2 jump target 6 iload_1 aload_0 super3 7 iload_2 getfield 8 super4: super4 9 aload_0 iload_1 10 getfield iload_2 11 iload_1 iastore 12 iload_2 return end Listing 4.26: A few superinstruction definitions Figure 4.6: Listing 4.24 represented as a graph, with the instructions on the edges and the su- perinstructions of Listing 4.26 added

While performing optimal superinstruction placement gets tricky when conditional jump instructions are involved, dealing with jump targets is simpler. The shortest path algorithm takes a very naive approach by ignoring jump targets while running the shortest path algorithm. However, this does not affect the performance of the substitutions as we will see. Listing 4.26 defines two new superinstructions, which can once again be applied to the bytecode of the set(int,int) method from Listing 4.24. This yields Figure 4.6. For clarity, the jump target node of the if_icmple instruction has been annotated with jump target and is shown in red. It may appear at glance that the two superinstructions – super3 and super4 – are once again exclusive. Substituting in super3 would, with the assumption the if_icmple jump instruction never jumps, completely go over super4 and as such it would never end up getting executed. The shortest path algorithm would naively choose the super3 instruction as it is longer, and when the jump instruction ends up jumping to the jump target node (in red) it would be unable to use the super4 superinstruction as this was not substituted in. However, the substitution algorithm can substitute both these

54 instructions in – placing super4 is entirely possible due to the way superinstructions read their operands as discussed in section 4.4.6. If the branch is taken, this instruction can be used. Instead, the algorithm deals with incoming jumps in a rather simple way: after having decided a shortest path from the start node to the end node and placing the appropriate superinstructions, it then continues to determine the shortest path from any jump targets to the end node. If this shortest path includes any superinstructions that were not part of the main path, they are also substituted in. This continues until all incoming jumps have been examinated.

4.6 Equivalent superinstructions

In the previous section (section 4.5), various runtime substitution algorithms have been discussed. In section 4.5.2 the straightforward triplet-based substitution algorithm was presented, which was then improved with the introduction of the tree-based substitution algorithm. The tree-based algorithm is an improvement over the triplet-based algorithm because it needs less superinstructions within the superinstruction set to make the same substitutions, freeing up superinstructions to cover other sequences of bytecode. In this section we introduce another technique for reducing the number of required superin- structions using instruction equivalence. If two or more superinstructions are equivalent, it is not needed to add both to the superinstruction set. To detect and use equivalent superinstructions, an equivalence algorithm is introduced which can determine if two sequences of bytecode are equivalent. To test equivalence it first constructs two Data Dependency Graphs (DDGs) – one for each input. Then, the algorithm uses graph coloring to test if the two graphs are isomorphisms of each other. Finally, if the two graphs are isomoprhisms, the two sequences of bytecode are equivalent allowing the two inputs to be interchanged. This equivalence algorithm is used in two places: at compile time and at runtime. At compile time, the equivalence algorithm is used to remove equivalent superinstructions from the superinstruction set, freeing up space for other instructions. At runtime an improvement of the shortest-path based substitution algorithm from section 4.5.3 uses it to find superinstructions that are equivalent with subsequences of the input program.

4.6.1 Superinstruction equivalence Let E(I,J) → [0, 1] be the equivalence function, where I, J are sequences of instruction opcodes with ∀i ∈ I : i ∈ B and ∀j ∈ J : j ∈ B (all instructions fed into the equivalence function are regular base JVM instructions B and not superinstructions). Finally, E maps to 1 to indicate that I is equivalent to J, otherwise it maps to 0.

1 E(I,J): Pseudocode 2 Ib := makeBlockGraph(I) Step 1: create a graph of atomic blocks 3 Ia := computeAttributesInGraph(Ib) Step 2: compute and assign barrier attributes 4 Ig := makeDataDependencyGraph(Ia) Step 3: add edges based on barrier attributes 5 6 Jb := makeBlockGraph(J) Step 1 7 Ja := computeAttributesInGraph(Jb) Step 2 8 Jg := makeDataDependencyGraph(Ja) Step 3 9 10 (Igc, Jgc) := colorGraphs(Ig, Jg) Standard graph coloring 11 for pi ∈ Igc: For all partitions in the colored graph 12 if ¬(∃pj ∈ Jgc : pi = pj ) Find a partition in Jgc that has the same nodes 13 return 0 If this cannot be found they’re not equivalent 14 return 1

Listing 4.27: Equivalence algorithm in pseudocode

Listing 4.27 shows the algorithm in pseudocode. For both inputs a graph of atomic blocks as nodes is created (Ib and Jb). The atomic blocks are sequences of instructions which must always be considered as a whole, and the mechanism behind this will be discussed in section 4.6.2.

55 In the computeAttributesInGraph(...) step, the data dependencies of each block are analyzed and exposed as attributes: a canonicalization of the data dependencies of the block. The block graph, now enriched with these attributes, is made available as Ia and Ja. We will discuss these attributes in section 4.6.2 and their use in tracking data dependencies. With the canonicalized data dependencies available, the final step for each graph is to connect blocks to earlier blocks in the program based on data dependencies, which is done in makeDataDependencyGraph(...). This creates Ig and Jg, which finally is are the dependency graphs of I and J respectively. The data dependency graphs must be isomorphisms of each other in order to be considered equivalent by the algorithm. Standard graph coloring is used in the colorGraphs(Ig, Jg) step to partition the nodes (the atomic blocks). If all these partitions are equal, the graphs are isomorphisms and thus equivalent. E(I,J) effectively takes two ordered sets of bytecode instructions (opcodes) and determines their equivalence. It can be applied to a sequence of input program code under consideration (I), and a superinstruction s (i.e. J 7→ s) as the superinstruction is an ordered sequence of bytecode instructions by itself. However, the function E(I,J) can also be applied to two superinstruction candidates to determine equivalence during instruction set generation. We create a preprocessor component as seen in section 4.4.4 to remove equivalent superinstruction candidates during superinstruction set construction, reducing the number of iterations needed to approach the optimal superinstruction set.

Example To gain an initial understanding of what equivalence can look like and lay the ground work for understanding how a data dependency graph can be constructed, let us consider two equivalent sequences of bytecode showing how and why they are equivalent. Equivalency becomes more prevalent as superinstructions get longer, as the number of permutations which yield correct and equivalent code increases, however even these small sequences can already show the property.

1 iconst_4 loads literal 4 Bytecode 1 iconst_5 loads literal 5 Bytecode 2 istore_1 stores into slot 1 2 istore_2 stores into slot 2 3 iconst_5 loads literal 5 3 iconst_4 loads literal 4 4 istore_2 stores into slot 2 4 istore_1 stores into slot 1

Listing 4.28: Code compiled from int x = 4; Listing 4.29: Code compiled from int y = 5; int y = 5; int x = 4;

In Listing 4.28 and Listing 4.29 two such equivalent bytecode sequences are shown. It’s easy to see why when looking at the code they were compiled from, both Listing 4.28 and Listing 4.29 are compiled from the same two statements x = 4; and y = 5;. There are no data dependencies between these statements – that is, they do not in any way refer to the same data. This lack of data dependencies is also visible in the bytecode: the x = 4; statement writes using istore_1 into local variable table slot 1, while the y = 4; statement uses istore_2 to write to local variable table slot 2. There is also no data dependency via the operand stack as it reaches a depth of zero between the two instructions (between line 2 and 3 in both Listing 4.28 and Listing 4.29). Note that even though these two statements may be compiled from literally the same statements in different orders, the name of a variable is not retained in JVM bytecode. In other words, the two statements p = 4; q = 5; could yield the same bytecode as Listing 4.28 if they were assigned to the same local variable table slots by the Java compiler. The types of p and q need not even be ints, e.g. byte p = (byte) 4; short q = (short) 5 would compile to the same code as Listing 4.28. This is because the JVM operand stack represents short, char and byte primitives as int, making their bytecode instructions the same if p or q was any of those types. Finally, initialization is not explicit in JVM bytecode, meaning that x = 4; y = 5; compiles to the same bytecode as int x = 4; int y = 5;.

56 4.6.2 Discovering data dependencies While the two code fragments from the previous section showed an example of what equivalence may look like, detailing why the two statements have no data dependencies, there are more places that can cause data dependencies. To create the DDG from the data dependencies, all sources of data dependencies must be traced. On the JVM, such data dependencies are caused by three categories of operations: 1. Local variable table operations. Java stack variables are transformed to indices in the local variable table by the Java compiler, and instructions can read and write from and to these slots. 2. Operand stack operations. Almost all Java instructions read or write to and from the operand stack. 3. Field operations, method calls, array reads, array writes, jumps and exceptions. Reading or writing from or to a field or array also create a data dependency. Furthermore, reordering may not be possible because an instruction has side effects of any sort. These side effects may be a (conditional) jump where moving the instruction would change the semantics of the program, but also method calls (which can do their own writes to fields) cannot be moved.

These three types each have their own complexities for tracing the data dependencies. The QuickInterp equivalence algorithm only constructs a data dependency graph of local variable table dependencies, simplifying the algorithm. However, in order to do this correctly dependencies via the operand stack still must be dealt with somehow. The basic idea is to group dependent stack operations together into an atomic block. As discussed in section 4.6.1, these blocks are always moved as a whole, and permit reordering based solely on local variable table dependencies. First, we’ll look at what causes local variable table dependencies and what implication each dependency has. Then, the operand stack is discussed and how based on the use of the operand stack atomic blocks can be created. Finally, the third category of instructions is briefly discussed: the field operations, method calls and other instructions which cannot be moved due to their side effects.

Local variable table dependencies In JVM bytecode, the first three local variable table indices have dedicated store instructions – e.g. for integers there is istore_0, istore_1, istore_2 and istore_3. To store an integer at a larger local variable table index, the general istore instruction is used. As mentioned before, superinstructions are always just concatenations of the instruction handlers: a superinstruction with e.g. istore 10 cannot exist. Instead, it would be a superinstruction with just istore, reading the “10” from the bytecode stream, which means the 10 is also unavailable for determining data dependencies.

1 iconst_4 loads literal 4 Bytecode 1 iconst_5 loads literal 5 Bytecode 2 istore 20 stores into slot 20 (x) 2 istore 8 stores into slot 8 (y) 3 iconst_5 loads literal 5 3 iconst_4 loads literal 4 4 istore 30 stores into slot 30 (y) 4 istore 9 stores into slot 9 (x)

Listing 4.30: Code compiled from int x = 4; Listing 4.31: Code compiled from int y = 5; int y = 5; int x = 4;

This has consequences for what can be considered equivalent. Consider Listing 4.30 and Listing 4.31, which show sequences of bytecode instructions once again compiled from the same two statements in different order. What can be seen here is that the Java compiler assigned higher slot numbers to x and y – slot 20 and 30 in Listing 4.30 and slot 9 and 8 in Listing 4.31 for x and y respectively. At first glance it may appear that these two sequences of bytecode cannot be

57 equivalent, but remember that, when transformed to a superinstruction, the instruction operands are not included. So the actual local variable table slots are not considered, only the sequence of operands. In that case the two listings start to look more similar, but still may not be considered equivalent.

1 iconst_4 Superinstruction 1 iconst_5 Superinstruction 2 istore stores into any slot 2 istore stores into any slot 3 iconst_5 3 iconst_4 4 istore stores into any slot 4 istore stores into any slot

Listing 4.32: Superinstruction derived from List- Listing 4.33: Superinstruction derived from List- ing 4.30 ing 4.31

In Listing 4.32 and Listing 4.33 two superinstructions derived from the earlier listings can be seen – sequences of bytecode without instruction operands. One crucial observation can be made from these two superinstructions – the istore operations no longer operate on a variable table slot, instead it may operate on any local variable table slot. It is indeed possible for the superinstruction in Listing 4.32 to refer to the same local variable table slot twice – storing first a 4 there, then overwriting it with a 5. Such code is not impossible – on the contrary, it is the Java compiler output for the statements int x = 4; x = 5;. While a little odd to see two consecutive assignments to the same variable within a Java program, there is nothing illegal about it. With the observation that both istore instructions may be writing to the same slot, it becomes clear that the two superinstructions in Listing 4.32 and Listing 4.33 are not equivalent. Substituting int x = 5; x = 4; for int x = 4; x = 5; changes the semantics of the input program, as the value of x at the end of the two statements is not the same. As such, when a general istore instruction is encountered, it effectively creates a data dependency to every single local variable table slot – it could write to any slot. This isn’t limited to istore or other store instructions, but also applies to general load instructions and the iinc instruction. One way to think about equivalence is to consider it a reordering. The algorithm QuickInterp uses is limited to reordering, i.e. it does not support equivalence where a different set of instructions reaches the same result. With this limitations in place, two sequences of bytecode are not considered equivalent if they do not have the same number of instructions of the same types. For example, one sequence of bytecodes with three dneg instructions will not be considered equivalent to another sequence of bytecodes with just one dneg operation. This constraint allows a one-to-one mapping from instructions from the first sequence of bytecode instructions to the second sequence of bytecode instructions. A visualization of the reordering between the two superinstructions from the earlier listings can be seen in Listing 4.34 and Listing 4.35.

1 iconst_4 Superinstruction 1 iconst_5 Superinstruction 2 istore 2 istore 3 iconst_5 3 iconst_4 4 istore 4 istore

Listing 4.34: Superinstruction derived from List- Listing 4.35: Superinstruction derived from List- ing 4.30 ing 4.31

While this mapping relation may seem obvious, it was previously established that these two are not equivalent due to the istore potentially writing to the same slot twice. What these two listings also help to reveal is the general mechanism that dicatates data dependency incompatibility. The istore instruction from line 2 in Listing 4.34 “crosses over” the istore from line 4 from the same listing – their mapping edges overlap. This is not allowed for the istore instruction, as the

58 last write to a given local variable table index ends up being visible. To formulate the general rule: a store instruction may not be moved across over another store instruction. However this isn’t the only rule:

• Read instructions that read data from the local variable table. These may not be moved over any write instructions, but read instructions may be moved over other read instructions of any opcode, as reading does not mutate the data. The read instruction opcodes are:

– aload – dload – fload – iload – lload – ret, which reads from the local variable table, but cannot be part of a superinstruction anyways and as such it is not considered • Write instructions that write data to the local variable table. They may not be moved over write instructions ànd read instructions, this is because the last write becomes visible after the superinstruction as discussed earlier. The write instruction opcodes are:

– astore – dstore – fstore – istore – lstore – iinc – jsr, which writes to the local variable table (counterpart of ret), but cannot be part of a superinstruction and as such it is not considered

• Specialized read instructions are read instructions which read from a known local variable table index. These may not be moved across (1) regular write instructions, and (2) specialized write instructions that access the same local variable table slot. They may however be moved across other specialized write instructions which are for a different local variable table slot, and can be moved across any read instruction. These instruction opcodes and their families are:

– aload family: aload_0, aload_1, aload_2 and aload_3 – dload family: dload_0, dload_1, dload_2 and dload_3 – fload family: fload_0, fload_1, fload_2 and fload_3 – iload family: iload_0, iload_1, iload_2 and iload_3 – lload family: lload_0, lload_1, lload_2 and lload_3 • Specialized write instructions are write instructions which write to a known local variable tabel index, similar to the specialized read instructions. These may not be moved across (1) regular write instructions, (2) regular read instructions, (3) specialized write instructions with the same local variable table index, and finally (4) specialized read instruction with the same local variable table index. The instruction opcodes and their families are:

– astore family: astore_0, astore_1, astore_2 and astore_3 – dstore family: dstore_0, dstore_1, dstore_2 and dstore_3 – fstore family: fstore_0, fstore_1, fstore_2 and fstore_3 – istore family: istore_0, istore_1, istore_2 and istore_3 – lstore family: lstore_0, lstore_1, lstore_2 and lstore_3

59 One constraint may seem odd considering the semantics of the JVM: a store operation of type x (e.g. istore) may not be moved across a store operation of type y (e.g. fstore) when x 6= y. The JVM is a typesafe VM and disallows reinterpreting the same data in a local variable table slot. If the type of a local variable table slot is, say, an int, the verifier will block any code from loading that would attempt to read the slot as if it was a float. As such, it may appear as if two type-incompatible store operations cannot refer to the same slot, and as such there’s no harm in reordering them. However, this is not the case: the JVM allows the reuse of a slot. JVM slots are reused by writing to them using a store operation of a particular type, and this changes the slot to that type. Since the write overwrites the original data in the slot, no reinterpretation occurs. As a result, it is possible for a sequence of bytecode instructions to treat one slot first as an int and apply an istore operation on it, and then reuse it as a float by writing to it using fstore. This behavior as a result forces the reordering algorithm to disallow reordering load and store instructions across other store operations, even when the types are incompatible. In section 4.6.3 we will discuss how these properties are taken to create attributes that generalize the data dependency relation. These attributes have special rules that describe data dependency relations and allow adding data dependency edges to create a data dependency graph. However, first another source of data dependencies need to be discussed: the operand stack.

Operand stack dependencies The JVM operand stack is used for expressions – by having a standard place from where values can be read from and written to, it is possible to keep the bytecode compact. Instructions like iadd do not need two local variable table indices for its input and one for its output in the bytecode stream. Instead, the iadd instruction simply reads (pops) two values from the operand stack, and pushes the result back onto the operand stack. The iload and istore instruction can be used transfer values between the operand stack and the local variable table, but this is not the only source of values. Values may come from constants (e.g. bipush), field accessors, invoke instructions and other instructions. Java expressions without variables do not use the local variable table regardless of how large the expression is: the intermediate values within the expression are kept on the operand stack and are never stored in the local variable table. Our equivalence algorithm does not trace operand stack data dependencies: instead, instructions where the operand stack is not empty are grouped together into an atomic block. As discussed before, these atomic blocks are only ever moved as a whole, saving the complexity of tracing operand stack dependencies. We’ll take a look at why neglecting to trace data dependencies via the operand stack is not a huge loss. Additionally, performing data dependency analysis on a graph of atomic blocks creates a new challenge: figuring out where these atomic blocks start and end.

1 int a=x*y + 10; Java 1 int b = 10 +x*y; Java

Listing 4.36: A Java expressions Listing 4.38: Another Java expressions

1 iload 7 load x Bytecode 1 bipush 10 load literal 10 Bytecode 2 iload 8 load y 2 iload 7 load x 3 imul 3 iload 8 load y 4 bipush 10 load literal 10 4 imul 5 iadd 5 iadd 6 istore 9 store a 6 istore 10 store b

Listing 4.37: Bytecode compiled from 4.36 Listing 4.39: Bytecode compiled from 4.38

To get an idea of how the JVM operand stack works and how it is used, consider the two expressions in Listing 4.36 and Listing 4.38, and their respective bytecode in Listing 4.37 and

60 Listing 4.39. Before comparing the two examples, let us first focus on Listing 4.36 and its bytecode in Listing 4.37 (on the left). Two values are loaded onto the operand stack using iload, which are then multiplied on line 3 in Listing 4.37. The multiplication instruction pops the two values and pushes the product of the two variables back onto the operand stack. The bipush instruction is used to push a literal 10 onto the operand stack, which is then together with the already-present multiplication result added together. This leaves just one value on the operand stack – the sum. This is then stored by the istore in the local variable table slot of a, which has apparently been assigned to slot 9 by the Java compiler. The code in Listing 4.38 and Listing 4.39 is very similiar but has some subtle differences. Observe that all values are first pushed onto the operand stack – the literal 10, the value of x and the value of y are all present on the operand stack at the same time. Then, they are aggregated by first a mul instruction which multiplies the values of x and y. Finally, the literal and the result of this multiplication are summed by the iadd instruction, leaving just this summation to be stored with the istore instruction. Looking at the expressions and the code each expression produced, the order makes a lot of sense. The bytecode instructions are in Reverse Polish Notation (RPN) – the expression from Listing 4.36 can be written as x y · 10 + in RPN. Besides the transformation to RPN, the two Java expressions have been translated to bytecode without any optimization passes. The notation is a serialization of the Abstract Syntax Tree (AST). Without diving too far into the complexities of compiler front-ends, it is easy to see that the two expressions in Listing 4.36 and Listing 4.38 are equivalent. However, this equivalence is due to the symmetry of the iadd instruction: ∀a∀b a + b = b + a. In QuickInterp, this property and other mathematical properties of arithmetic instructions are not used to determine equivalence. With that considered, observe that the RPN for an expression becomes unique. This has some interesting consequences for equivalence – two expressions are only equal if they are exactly the same. The expressions are effectively “atoms” within a larger superinstruction and can be considered as if they are their own instruction. This is why there’s little gain in tracing data dependency via the operand stack: the Java compiler is likely to leave the operand stack empty between statements, and will not place an independent expression right in the middle of another. As such, two equivalent expressions compiled by the Java compiler will always produce the exact same sequence of instructions. The only exception is when the two expressions are equivalent due to a mathematical property of one of the instructions, e.g. the symmetry of iadd. However, this detection is not implemented in the equivalence algorithm either.

1 int a=x + 2; Java 1 int b=x * 14; Java 2 int b=x * 14; 2 int a=x + 2;

Listing 4.40: Two Java assignments with expres- Listing 4.42: Two Java assignments with expres- sions sions

1 iload_3 load x Bytecode 1 iload_3 load x Bytecode 2 iconst_2 load literal 2 2 bipush 14 load literal 14 3 iadd 3 imul 4 istore_1 store a 4 istore_2 store b 5 iload_3 load x 5 iload_3 load x 6 bipush 14 load literal 14 6 iconst_2 load literal 2 7 imul 7 iadd 8 istore_2 store b 8 istore_1 store a

Listing 4.41: Bytecode compiled from 4.40 Listing 4.43: Bytecode compiled from 4.42

How expressions can be treated as atoms for the purpose of equivalency can be seen in Listing 4.40 and Listing 4.42, where two equivalent Java assignments are shown. These code listings can

61 only be equivalent if the equivalency algorithm can determine that the variables a 6= b, a 6= x and b 6= x, and it can when looking at the bytecode for these listings in Listing 4.41 and Listing 4.43. This is because the specialized store and load operations have been used by the Java compiler: istore_1 for a, istore_2 for b and iload_3 for x. Note that the expressions themselves aren’t just equivalent – they’re the same. The two expressions have been reordered but otherwise match exactly. The operand stack can be very deep – even though instructions like iadd only pop two items, the JVM limits the depth of the operand stack at 216 − 1 [7]. This makes it entirely possible to create bytecode that “interleaves” an expression with a completely independent expression.

1 iload_3 load x Bytecode 2 iconst_2 load literal 2 3 iload_3 load x 4 bipush 14 load literal 14 5 imul 6 istore_2 store b 7 iadd 8 istore_1 store a

Listing 4.44: Two mixed expressions while equivalent to Listing 4.41 Listing 4.43

Take for example the bytecode of Listing 4.44. It is equivalent to the earlier listings, and one of the expressions which was just declared atomic (a = x + 2) has been split up, placing the other expression (b = x * 14) right in the middle. This is possible in JVM bytecode because the operand stack is exactly that – a stack. As the bytecode for the b = x * 14 leaves the operand stack completely balanced (pushes just as many items as it pops), it does not destroy or overwrite work done by the few instructions from the a = x + 2 expression. However, the Java compiler will not generate such code, as it is not valid Java to mix an unrelated expression-assignment within another expression. As such, the QuickInterp equivalence algorithm assumes that expressions where the operand stack is used can be seen as atomic blocks that are only ever moved as a whole, and this is in line with the code generation performed by the Java compiler. As mentioned, making assumption greatly simplifies tracing data dependencies via the stack – there simply are none, because an atomic block of instructions always leaves the operand stack balanced, and as such all stack data dependencies are resolved within the block. While it is trivial to create examples like the equivalency between Listing 4.41 and Listing 4.44 where this assumption is broken, there is no harm in incorrectly classifying two equivalent sequences of bytecode as non-equivalent. At worst, the program fails to utilize the equivalency for a performance boost down due to needing extra superinstructions, but the input program after superinstruction substitution stays correct. The exact mechanism for constructing these atomic blocks is discussed in a moment, however it is important to observe that exactly between the two expressions in Listing 4.41 and Listing 4.43 the operand stack depth reaches 0 (relative from the start of the superinstruction). As such, the basic idea for constructing these atomic blocks is by finding out where the operand stack reaches a depth of zero, as this indicates a cutpoint between multiple expressions. This is done by using an abstract interpreter.

1 markAtomicBlocks(I): Pseudocode 2 d := 0 Track stack depth starting at d = 0 (relative) 3 dmin := d Running minimum 4 dmax := d Running maximum 5 for i ∈ I:‘ 6 d -= stackPops(i) Subtract the number of values i pops 7 dmin := min(dmin, d) 8 d -= stackPushes(i) Add the number of values i pushes 9 di := d Track the depth for this instruction 10 dmax := max(dmax, d)

62 Listing 4.45: Abstract interpreter for marking expressions

The abstract interpreter can be seen in Listing 4.45. An abstract interpreter operates on types instead of on values, and in this case all it does is keeping track of the stack depth ignoring all outgoing jumps. The abstract interpreter is started on the beginning of the superinstruction, and for every instruction it encounters it determines how many values the instruction pops and how many values the instruction pushes, tracking the depth of the operand stack relative to the beginning of the superinstruction. The stack depth after a given instruction i is stored in di, which will be used later on.

iload_3 iconst_2 iadd 1 iload_3 load x Bytecode 2 iconst_2 load literal 2 istore_1 3 iadd iload_3 4 istore_1 store a bipush 5 iload_3 load x 6 bipush 14 load literal 14 imul 7 imul istore_2 8 istore_2 store b 0 1 2 Listing 4.46: Repeat of the bytecode compiled Relative stack depth from 4.40 Figure 4.7: Graph showing the effect of instruc- tions on the (relative) stack depth

In Figure 4.7, the relative depth of the stack can be seen of Listing 4.46. For every instruction from Listing 4.46 two nodes are shown: an instruction pops zero or more values from the operand stack which is shown by the first node, and pushes zero or more arguments back onto the stack which is shown by the second node. The algorithm from Listing 4.45 saves the second node into variable di where i is the instruction ∈ I. Note that even though the iadd instruction from line 3 pops two operands from the operands stack and with that lowers the stack depth to 0, it still has a di = diadd = 1 because the same instruction also pushes the result again – this is reflected in the second node. Observe that since expressions themselves are balanced, the depth of the operand stack returns to a depth of 0 (di = 0) between expressions. For this reason, after the istore_1 instruction the stack depth is 0 (di = distore_1 = 0) as the first expression has been completed. Then, the second expression uses the operand stack to once again return to a depth of 0 after the istore_2 instruction. As mentioned before, this characteristic of expressions compiled by javac forms the basis of determining the atomic blocks as discussed earlier, allowing another algorithm to cut the superinstruction into pieces based on where di is zero. Now, there is no reason why a superinstruction itself should leave the operand stack balanced – indeed a superinstruction may be a concatenation of just two pop instructions or two dup instructions. This is why the algorithm from Listing 4.45 keeps track of the minimum value: dmin. dmin is the lowest depth the operand stack had relative to the depth at the start of the superinstruction, and as such it’s completely legal for it to be a negative number. An example is the superinstruction pop pop, which will have an dmin = −2 as each pop instruction removes one element from the operand stack. The block construction algorithm makeBlockGraph(...) is not considering cut points where the relative stack depth is zero, instead it considers cut points where the relative stack depth is dmin. Due to the way dmin is tracked in Listing 4.45, it is set after an instruction has popped values from the operand stack, but before any values it pushes have been counted. That while di for a given instruction i is the depth of the operand stack after both pops and pushes. The iadd instruction visible in Figure 4.7 for example already set dmin

63 to 0, even while its di ends at 1. dmin reflects the lowest relative stack offset from the start of the superinstruction from which a value was accessed in the superinstruction, while di for a given instruction i tracks the relative stack offset as it is after the instruction i. It is even possible to obtain a dmin which is lower than all di within a particular superinstruction, for example in the superinstruction dup pop. The dup instruction pops one value and then pushes that value twice onto the operand stack. As such, it sets dmin = −1, while the relative operand stack depth after any instruction never reaches below 0 (ddup = 1, dpop = 0). If the sequence dup pop was part of a larger superinstruction it could still not become an atomic block on its own, even though it leaves the operand stack balanced. This is because it does have a data dependency via the stack. The whole point of grouping instructions together into atomic blocks is that there are no external operand stack dependencies between them. Since the dup instruction pops one element and then pushes it twice, it effectively read a value from a relative stack depth of −1. This is detected by setting dmin = −1 when visiting the dup instruction, meaning that cuts are only performed where di = dmin = −1. No instruction within just the dup pop sequence has a di = dmin, as such this sequence is not cut into its own block, correctly detecting this data dependency. With that the definition of our block graph construction algorithm (makeBlockGraph(...)) is complete. By recording the depth of the operand stack, it is possible to isolate sections of bytecode (atomic blocks) which have no external data dependencies via the operand stack, allowing them to be moved as a whole. Note that this algorithm does not have access to the instruction operands – it can only see the opcodes themselves. Some instructions, like tableswitch or invokestatic, pop a variable amount of values from the operand stack depending on their instruction operands.

Other operations Before diving into how these atomic “expression” blocks in the block graph are used together with the local variable table data dependencies, there’s one more category of data dependencies that has to be discussed. This category contains the field operations, method calls, array reads, array writes, jumps and exceptions. The effects of these kinds of data dependencies are so profound that they cannot be reordered in any way. Instead, these kinds of instructions for a sort of “barriers” – reordering can happen within the superinstruction, but instructions cannot be lifted over this class of instructions.

• (Conditional) jumps and exceptions: since these change the control flow, store operations cannot be moved over such jumps as it would affect what is visible after the jump. For example, if a store instruction is moved before a jump, the jump target will see the updated value. While a more advanced equivalence algorithm may work together with an advanced substitution algorithm to recognize this case and detect whether the jump target actually uses this stored value, the QuickInterp algorithm only ever considered equivalency when just looking at two sequences of bytecode “in a vacuum” (without context). As such it has to disallow moving any kind of jumps. • Array reads and array writes can cause exceptions (ArrayIndexOutOfBoundsException, NullPointerException), as such they must be treated just like jumps. • Method calls likewise can cause exceptions, and also cannot be part of superinstructions anyways. • Non-static field reads and writes can cause NullPointerExceptions, as they always have to be performed on an object, a well as security exceptions. • Finally, static field reads and writes can cause class initialization (which can throw exceptions) and a host of security exceptions, which similarly disallows reordering them.

As such, none of this instructions in this category support reordering, and all act like a full barrier. Note that these instructions can be mixed into a regular expression, e.g. instead of an bipush there might be a getfield instruction in Listing 4.41. After all, it is perfectly legal to use

64 a field within an in Java expression, and the same goes for a method call (although those cannot be part of a superinstruction anyways).

4.6.3 Data Dependency Graph Construction It is finally time to put all the data dependency categories together to explain the computeAttributesInGraph and makeDataDependencyGraph algorithms from Listing 4.27 work. The computeAttributesInGraph algorithm assigns a set of attributes to each block in the block graph, based on data dependencies as mentioned in 4.6.1. The attributes serve as an abstraction, a canonicalization of what kind of data-related actions the instructions within the atomic block perform, simplifying the process of con- structing the data dependency graph. These attributes are used by the makeDataDependencyGraph algorithm to wire up the complete DDG. Before discussing the attributes, let us revisit the categories of instructions affecting data dependencies as seen in the previous section: • Instructions accessing the local variable table have specific constraints which allows some of them to be reordered. • Instructions jumping or throwing exceptions (which is a rather large category) work as barriers and cannot be reordered. • Instructions using the operand stack are grouped together into atomic blocks which have to be moved as a whole. These blocks can contain instructions using the local variable table and instructions throwing exceptions, but by moving them as a whole they do not have external data dependencies via the operand stack that have to be dealt with.

The goal of each attribute is to dictate what data dependencies the block that it is assigned to has. As such, each attribute defines a kind of must-happen-before relation that is used by the makeDataDependencyGraph algorithm from 4.27. The blocks are found using the atomic block marker algorithm from Listing 4.45. The attributes are:

FULL_BARRIER : no blocks of any kind (with any attribute) may be moved over blocks with this attribute. In other words, all blocks with any attribute that was already happening before this block, must happen before this block. This is assigned to any block with an instruction that jumps or throws exceptions.

STORE_BARRIER_ANY : blocks with FULL_BARRIER, STORE_BARRIER_ANY, STORE_BARRIER_0, STORE_BARRIER_1, STORE_BARRIER_2, STORE_BARRIER_3, LOAD_BARRIER_ANY, LOAD_BARRIER_0, LOAD_BARRIER_1, LOAD_BARRIER_2 or LOAD_BARRIER_3 must happen before this block. This is assigned to blocks with instructions in the generalized store family of instructions (e.g. istore or fstore, but not istore_1), and also to the iinc instruction.

STORE_BARRIER_0 : blocks with FULL_BARRIER, STORE_BARRIER_ANY, STORE_BARRIER_0, LOAD_BARRIER_ANY or LOAD_BARRIER_0 must happen before this block. This is assigned to blocks with instructions in the specialized store_0 family of instructions (e.g. istore_0 or fstore_0). STORE_BARRIER_1 : blocks with FULL_BARRIER, STORE_BARRIER_ANY, STORE_BARRIER_1, LOAD_BARRIER_ANY or LOAD_BARRIER_1 must happen before this block. This is assigned to blocks with instructions in the specialized store_1 family of instructions (e.g. istore_1 or fstore_1). STORE_BARRIER_2 : blocks with FULL_BARRIER, STORE_BARRIER_ANY, STORE_BARRIER_1, LOAD_BARRIER_ANY or LOAD_BARRIER_2 must happen before this block. This is assigned to blocks with instructions in the specialized store_2 family of instructions (e.g. istore_2 or fstore_2).

65 STORE_BARRIER_3 : blocks with FULL_BARRIER, STORE_BARRIER_ANY, STORE_BARRIER_3, LOAD_BARRIER_ANY or LOAD_BARRIER_3 must happen before this block. This is assigned to blocks with instructions in the specialized store_3 family of instructions (e.g. istore_3 or fstore_3). LOAD_BARRIER_ANY : blocks with FULL_BARRIER, STORE_BARRIER_ANY, STORE_BARRIER_0, STORE_BARRIER_1, STORE_BARRIER_2, STORE_BARRIER_3 must happen before this block. This is assigned to blocks with instructions in the generalized load family of instructions (e.g. iload or fload, but not iload_1). LOAD_BARRIER_0 : blocks with FULL_BARRIER, STORE_BARRIER_ANY or STORE_BARRIER_0 must happen before this block. This is assigned to blocks with instructions in the specialized load_0 family of instructions (e.g. iload_0 or fload_0). LOAD_BARRIER_1 : blocks with FULL_BARRIER, STORE_BARRIER_ANY or STORE_BARRIER_1 must happen before this block. This is assigned to blocks with instructions in the specialized load_1 family of instructions (e.g. iload_1 or fload_1). LOAD_BARRIER_2 : blocks with FULL_BARRIER, STORE_BARRIER_ANY or STORE_BARRIER_2 must happen before this block. This is assigned to blocks with instructions in the specialized load_2 family of instructions (e.g. iload_2 or fload_2). LOAD_BARRIER_3 : blocks with FULL_BARRIER, STORE_BARRIER_ANY or STORE_BARRIER_3 must happen before this block. This is assigned to blocks with instructions in the specialized load_3 family of instructions (e.g. iload_3 or fload_3).

Each block has zero or more of these attributes depending on the instructions within the block. Note that this technically allows reordering of a block with no attributes around another block with FULL_BARRIER, however a block with no attributes is effectively dead code – it does not operate on any input data and does not produce output data. Since a superinstruction may leave the operand stack unbalanced, there could be a partial expression at the beginning of the superinstruction, and similarly there could be a partial expression at the end of the superinstruction. These are dealt with by constructing special blocks for these partially-present parts and excluding them from any reordering. If there are instructions before the first cut point where di > dmin they are turned into a block and marked with FULL_BARRIER. Likewise, if there are instructions after the last cut point with di > dmin they are also turned into a block and similarly marked with FULL_BARRIER. These blocks may be empty, as would be the case with the bytecode from Listing 4.41 and Listing 4.43 where no partial expressions are contained in the superinstruction. However, they are always created. The block at the beginning is the start block, and the block at the end is the end block. An example of the whole procedure, going from blocks, marking them with attributes and finally creating the DDG can be seen in Figure 4.8, Figure 4.9 and Figure 4.10.

Step 1 Figure 4.8 shows the first step in constructing the DDG, performing the makeBlockGraph(...) algorithm from the pseudocode in Listing 4.27. First a graph is created with just those blocks, and a start and end block. The blocks are connected in the graph in the order they occurred in the superinstruction. Step 2 Figure 4.9 shows an example of the second step, where attributes are determined based on the type of instructions within the block. In the pseudocode from Listing 4.27 this step is performed in computeAttributesInGraph(...). A block may have multiple of these attributes, and as mentioned it’s technically possible for a block to have no attributes at all. Step 3 Finally the magic happens in step 3 and Figure 4.10 carries on with the example: the makeDataDependencyGraph(...) creates the DDG. Edges are added between the blocks based on their attributes from the previous step. These edges only point to earlier within the superinstruction, using the edges marked with after as they were in step 1 and 2. Observe

66 Start Start FULL_BARRIER Start FULL_BARRIER

after after LOAD_BARRIER, LOAD_BARRIER, Block 0 Block 0 LOAD_BARRIER_3, Block 0 LOAD_BARRIER_3, STORE_BARRIER_3 STORE_BARRIER_3 after after

Block 1 Block 1 STORE_BARRIER_2 Block 1 STORE_BARRIER_2

after after

Block 2 Block 2 LOAD_BARRIER_1 Block 2 LOAD_BARRIER_1

after after LOAD_BARRIER, LOAD_BARRIER, Block 3 Block 3 STORE_BARRIER_2 Block 3 STORE_BARRIER_2 after after

End End FULL_BARRIER End FULL_BARRIER

Figure 4.8: Step 1: The in- Figure 4.9: Step 2: Barrier Figure 4.10: Step 3: Must- put program is converted to a attributes are added to each happen-after edges are added graph where all the nodes are block based on barrier attributes, re- blocks placing the original edges

how the End node with its FULL_BARRIER must link to everything as it must be executed after everything, and likewise every block must link to the preceding Start node as it also has a FULL_BARRIER.

The graph from Figure 4.10 can be colored using traditional graph coloring techniques as is done in the pseudocode in Listing 4.27, which can then be used to ascertain graph isomorphism allowing it to be compared to another graph and testing it for equivalency. In fact, if the DDGs of two superinstructions are merged into one graph simply by importing the nodes into a single graph, but leaving the two DDGs disconnected, equality becomes even easier. The single graph can be colored, and if the two End nodes end up with the same graph color, the two superinstructions are equivalent. All pseudocode function calls from Listing 4.27 have been defined. With the construction of the graph finally completed and isomorphism determined, the definition of the QuickInterp equivalence algorithm as defined at the start of this section concludes.

4.6.4 Using superinstruction equivalency One of the uses of the equivalency algorithm is to facilitate the placement of superinstructions that are not identical to a sequence of bytecode, but only equivalent. Once a superinstruction has been found however, some preprocessing of the input bytecode is needed before it can be substituted in. Furthermore, searching for a superinstruction requires a linear search through the superinstruction set, potentially harming performance. In this section, we address these concerns surrounding the use of an equivalence algorithm by explaining how superinstruction substitution works when dealing with equivalent-but-not-equal superinstructions, and addressing the performance impact of the use of a linear search.

Substituting bytecode based on equivalence In order to make a substitution, the runtime substitution algorithm must change the input program including its instruction operands. As discussed, many instructions store additional data in the bytecode stream (e.g. the 10 in istore 10). Generally, this is easy to deal with as the

67 superinstruction is a straightforward concatenation of the instruction handlers. As such, it expects the instruction operands at the exact same place in the bytecode stream as the original program (discussed in section 4.4.6). However, when substituting in an equivalent superinstruction, this is no longer the case. As such, the instructions need to be reordered first. This reordering requires information from the equivalence algorithm. If the runtime substitution algorithm can obtain how the blocks from bytecode sequence I map to those in J, it can reorder the blocks in the input program including their instruction operands. After that reordering has been performed, the superinstruction can be substituted in like normal. As such, a practical implementation of the algorithm shown in Listing 4.27 will not map to [0, 1]: instead it will produce a one-to-one mapping detailing how every instruction from I has been reordered to form J. Then, the runtime substitution algorithm can take this mapping, apply it to the bytecode sequence under consideration (I) to change it to become J. Finally, superinstruction substitution can happen just as if this bytecode sequence under consideration is exactly J. The superinstruction for J can now find all the instruction operands at the places where it expects them. This has consequences: in Figure 4.6 from section 4.5.3, it is shown how that superinstruction placement can be nested. That is, the same sequence of instructions can be covered by more than one superinstruction, and this is how jump targets are dealt with in the shortest path algorithm. When including equivalence, if the “smaller” instruction (super4 in Figure 4.6) would be substitutable due to the reordering of instruction, it would break the earlier substitution of super3 which refers to the same instruction operands. As such, equivalence substitution cannot be used when crossing a jump target.

Speeding up equivalence matching As slow substitution performance may be a show-stopper for the introduction of a superinstruction architecture in a production VM, some techniques are possible to optimize finding an instruction. One observation is that for two graphs to be isomorphisms of each other, one idea is to save as much metadata about the superinstruction as possible into a database as a key. This data can include the maximum and minimum operand stack depths (dmin and dmax from Listing 4.45), the number of blocks, and the groups of blocks which received the same color by the coloring algorithm. However, QuickInterp is not a production VM, and creating a high-performance substitution algorithm which itself has great runtime performance is not the goal here. As such, we do not address or optimize for this concern, and any optimizations in this sense are disregarded as they do not affect the performance of the superinstruction VM after all code has been loaded.

4.7 Conclusion

In this chapter the entirety of the QuickInterp architecture has been laid out. Key points discussed include the intricacies and design of effective runtime profiling, the high-level software architecture of QuickInterp, how iterative optimization can improve superinstruction set construction, the shortest path runtime substitution algorithm, and finally how instruction equivalence can further improve performance. At the beginning of the chapter in section 4.1 various design goals were listed, and now is the time to revisit them.

DG1 Enable the VM to generate a rich application profile from a running application such that any superinstruction construction algorithm is able to prioritize which sequences of bytecodes make the most suitable superinstructions (requirements for G1, G2) DG2 Design a superinstruction architecture that is agnostic of the chosen construction (G1) and runtime placement (G2) algorithms to also support implementations of existing algorithms (required for benchmarking, ties into goal G4).

DG3 Design a new, optimized algorithm for the construction of the superinstruction set

68 (G1). DG4 Design a new, optimized algorithm for the runtime placement of superinstruction (G2).

Design goal DG1 has been answered with the design of a powerful runtime profiling architecture, where information about how the control flow moved through the bytecode is stored enabling more than just analysis of common patterns in the bytecode (section 4.3), but also allowing static evaluation of a superinstruction set. Next up, in response to DG2 a flexible and powerful architecture supporting multiple runtime superinstruction substitution algorithms has been designed (section 4.2), and in later sections it is shown how various algorithms can be implemented on top of this architecture (section 4.5). The third design goal DG3 prompted the design of a new optimized algorithm for the construction of the superinstruction set. This has taken the shape of an iterative optimization algorithm which itself is agnostic (but aware) of the runtime substitution algorithm. It uses the runtime substitution algorithm and powerful profiling to automatically evaluate a candidate superinstruction set, and with that it attempts to zero in on the optimal superinstruction set. Pluggable preprocessors components can be used to further tune the superinstruction set construction algorithm. Finally, design goal DG4 lead to both the creation of the shortest path runtime substitution algorithm discussed in section 4.5.3, but also how the number of superinstructions needed can be reduced by using superinstruction equivalence (section 4.6). The design is by no means perfect – that is, there is more that could be done given sufficient resources. For example, if the shortest-path based runtime substitution algorithm had access to profiling information, it could make more educated guesses around the impact of conditional jumps leaving superinstructions. Furthermore, the equivalency algorithm is not capable of utilizing symmetry and other mathematical properties of common arithmetic operations like iadd (integer addition). If it was capable of recognizing equivalence in that area, it might perform better. However the design goals as listed in section 4.1 do not include any non-functional requirements that set requirements for the performance of the architecture. Indeed, this design is attempting to find which modifications to the superinstruction architecture as seen in earlier work improve performance, without setting a bar for what is expected nor requiring the exploration of every single option or every single design derivation. The performance of the design is also linked to implementation decisions that will be discussed in the next chapter, and as such only the combination of the design, how it’s implemented and what kind of benchmarks are used. As such, the relation between the algorithm design and its actual performance characteristics is an indirect one at best. We discuss how this algorithm actually performs as implemented in QuickInterp in chapter 6, where the algorithm is put to the test in benchmarks on real hardware.

69 Chapter 5

Implementing QuickInterp

5.1 Introduction

In this chapter the implementation of QuickInterp is discussed. This is the implementation of the architecture and the algorithms as designed in chapter 4 in such a way to support the research goals from section 1.5. We selected OpenJDK 11 Zero [14] as a base VM to implement the superinstruction set architecture on top of, as this VM is a pure C++ (“zero-assembly”) port of OpenJDK, maintained at the time of writing and also completely up-to-date with all the features that can be expected in a modern JVM. The runtime substitution algorithms are implemented in Java to help aid prototyping and make it easier to implement the various substitution algorithms from chapter 4 (triplet-based substitution, tree-based substitution and shortest-path based substitution). In the process of developing QuickInterp, we took various shortcuts to speed up development that would not be acceptable in a production VM: our patches break the garbage collector and class verifier. We consider the loss of these flagship JVM features acceptable as their absence does not interfere significantly with the ability to gather results and evaluate the design, as both the garbage collector and the class verifier operate independently of any superinstruction architecture. However, to make a proper comparison to a VM without superinstructions, the garbage collector and class verifier must be disabled in both virtual machines. QuickInterp is available at https://github.com/LukasMiedema/QuickInterp. The structure of this section largely matches that of the design from chapter 4. Section 5.2 discusses the goals for the implementation, with section 5.3 discussing the choice of using OpenJDK Zero and the implications of this choice: it presents an overview of how superinstructions are added to the software architecture of OpenJDK Zero, while various key architectural decisions are discussed. OpenJDK Zero supports, without changes, up to 256 instructions in its bytecode format, which we considered insufficient for meaningful superinstruction experimentation. This section also discusses the mechanism and changes to OpenJDK Zero implemented to support up to 216 superinstructions as a new theoretical limit in QuickInterp. Runtime profiling as designed previously in section 4.3 is implemented in section 5.4, where the VM is instrumented to report on where code is executed and how control flow moves through the program. Section 5.5 discusses how the iterative superinstruction set construction algorithm is implemented, with section 5.6 following to explain how the runtime substitution algorithms are implemented. Then, section 5.7 loops back to discuss how a generated superinstruction set is turned into an actual interpreter by an interpreter generator. Finally, section 5.8 concludes the implementation and reflects back on the implementation goals as set out at the beginning of this chapter.

70 5.2 Implementation goals and non-goals

Before discussing the implementation, let us discuss the implementation goals first. The imple- mentation is the link between the design and the benchmarks, and as such we do not derive the goals for the implementation so much from the research goals of this thesis, but also from the design of QuickInterp as seen in the previous chapter (chapter 4). As discussed earlier, the goal of QuickInterp is not to produce a production-ready VM competing with other mainstream implementations. It does not have to be secure, safe, make efficient use of hardware resources like disk space and memory, or be easy to use. The only priority for the implementation is to do is provide answers to the research questions from section 1.5.2, allowing corners to be cut. In the previous chapter in section 4.1.1 we presented a set of design goals:

DG1 Enable the VM to generate a rich application profile from a running application such that any superinstruction construction algorithm is able to prioritize which sequences of bytecodes make the most suitable superinstructions (requirements for G1, G2) DG2 Design a superinstruction architecture that is agnostic of the chosen construction (G1) and runtime placement (G2) algorithms to also support implementations of existing algorithms (required for benchmarking, ties into goal G4).

DG3 Design a new, optimized algorithm for the construction of the superinstruction set (G1). DG4 Design a new, optimized algorithm for the runtime placement of superinstruction (G2).

For the implementation, most design goals can be translated to an implementation goal to implement that aspect of the QuickInterp architecture. The implementation also adds its own goals: the OpenJDK Zero VM needs to be modified to support superinstructions. Classes at class load time need to undergo a transformation to contain superinstructions via the active runtime substitution algorithm, and the interpreter itself needs to contain additional handlers for each superinstruction. This leads us to the following implementation goals:

IG1 The VM runtime needs to be capable of gathering profiling information from the running application, and write this to a file for the superinstruction set construction algorithm (DG1). IG2 Provide an implementation of the iterative superinstruction set construction algorithm discussed in section 4.4. It must produce a list of optimal superinstructions for a given profile, maximum instruction set size and a runtime substitution algorithm (DG3). IG3 Provide an interpreter generator that, from the list of superinstructions, generates C++ source code and other metadata to implement all superinstructions in the VM. The VM needs to be modified to use this generated C++ code and be refactored in such a way that it is possible to concatenate instruction handlers to create superinstructions.

IG4 Implement the three runtime superinstruction placmement algorithms (triplet-based, tree- based and shortest-path runtime substitution) (DG4). IG5 Modify the OpenJDK Zero class loading pipeline to include the runtime substitution of each class by the active runtime substitution algorithm. IG6 Produce a VM that provides an accurate, testable reflection of the design of each of the components as laid out in chapter 4 (G4 from section 1.5).

The only design goal that is not explicitly translated to an implementation goal is design goal DG2. This design goal has set out the overarching architecture for QuickInterp and as such it is embedded in the design of QuickInterp and not present as its own implementation goal. Finally, implementation goal IG6 links back to research goal G4: this goal concerns the evaluation of the

71 proposed superinstruction algorithms and architecture. Implementation forms the link between the design and the evaluation, and as such it is vital to keep this goal in sight and not take shortcuts that impact benchmarking results in a way that is not reflective of the design.

5.3 QuickInterp on OpenJDK Zero

Building on OpenJDK Zero [14] introduces its own difficulties as this VM is (obviously) not written from the ground up to support superinstructions. In this section, we discuss various aspects of the OpenJDK Zero architecture and how we changed them to support our design. Section 5.3.2 discusses how Java classes are loaded and where our implementation modifies classes. Next up, we’ll take a look at how the OpenJDK Zero interpreter is structured and what changes were made to the actual runtime machinery to support concatenating instruction handlers in section 5.3.3. OpenJDK Zero, by nature of its interpreter design, supports only 28 total instructions (regular instructions + superinstructions), which is addressed in section 5.3.4 where we discuss how this architectural limit is raised to 216 total instructions. We opted to implement the runtime substitution algorithms in Java, running it in the same VM where it is also modifying all the code that is loaded. This is discussed in section 5.3.5, including how we deal with the inevitable circular loading errors caused by doing substitution in Java. This section will set the stage for the more platform-agnostic implementation of the profiling (section 5.4), the generation of the superinstruction set (section 5.5) and the implementation of the runtime superinstruction substitution algorithms (section 5.6).

5.3.1 Why OpenJDK Zero To implement the design of QuickInterp we have to pick an existing VM implementation that meets a few criteria: the source code has to be accessible, it has to be written in C++ or another “high-level” language that would allow us to concatenate instruction handlers (anything written in assembly is out), and finally the VM should be part of the current generation of virtual machines in compliance with a recent version of the Java Virtual Machine specification. OpenJDK 11 Zero meets all these criteria: it’s GPL licenced with class-path exception and the code is freely available online. Furthermore, it claims compatibility with the Java Virtual Machine specification Java SE 11 Edition [11]. Finally, it is entirely written in C++ as this is the main goal of the OpenJDK Zero porting project: creating a VM free of machine-specific assembly code. These properties makes it an excellent match to implement QuickInterp on top of.

5.3.2 OpenJDK Zero class-loading pipeline To support superinstructions, the superinstructions must be substituted into the in-memory bytecode format when a class loads. Furthermore, they must be present in the interpreter as their own instruction handlers. There is no reason for a VM to keep the bytecode as it was on disk: it can be transformed to another intermediate language, which might be more efficient to interpret. OpenJDK Zero however does almost no preprocessing on the bytecode that is read from file. Bytecode is verified using a bytecode verifier, followed by some basic instruction rewriting (discussed in a moment) which leaves the bytecode structure basically untouched. The interpreter is run on this bytecode which is very similar (the size is identical) to the bytecode provided by the user. This makes OpenJDK Zero a token-threaded interpreter, as discussed in background section 2.2. Some instruction rewriting is done by rewriter.cpp. For the purposes of creating a superin- struction architecture, the rewriter does fairly little. Various constant-pool references are rewritten to native machine endianness for the invoke family of instructions and the ldc (load constant from the constant pool) instruction. Besides the ldc instruction, these instructions cannot become a meaningful part of a superinstruction as discussed earlier.

72 Use of a JVMTI agent Superinstruction substitution has to be done somewhere in the class loading pipeline, preferably as early as possible in the loading process to make sure all processing steps within the VM see only the substituted class with superinstructions. This is important because the more invasive substitution algorithms like the shortest-path substitution algorithm (when using equivalence) might reorder bytecode instructions. If information about the bytecode prior to such a reordering was collected, this information may be invalid. Furthermore, the process needs to be able to inspect all classes, including core classes like java.lang.Object (the root of the class inheritance tree for all objects) and hidden classes (e.g. lambda classes) defined via the confusingly named Unsafe.defineAnonymousClass, as it has no relation with “anonymous classes” like they exist in the Java language. OpenJDK Zero provides an implementation of a tooling interface called Java Virtual Machine Tooling Interface (JVMTI). The JVMTI exists to support the development of debuggers, loggers and other tools (“agents”) that wish to inspect the state of the JVM and even modify it. For QuickInterp, a key feature of the JVMTI is the ability to intercept and replace any class that gets loaded by the VM except hidden classes, which we will cover in a moment. With JVMTI it is even possible to replace java.lang.Object, enabling superinstructions in the deepest level of the VM. JVMTI has the wiring in place to intercept and process class substitution at all places where classes can be defined, and as such we opted to use JVMTI to implement our algorithms with. We implemented our JVMTI client as part of the VM source code itself, modifying the JVMTI startup procedure (where it scans for JVMTI command line arguments) to always inject a reference to the embedded JVMTI agent. Hidden classes are not a language feature (yet1) on the JVM. Instead, they are an implementation detail of how OpenJDK generates classes for e.g. lambdas, string concatenations and more. These classes are slipped into the class loader of the parent class (even when that class loader does not support defining new classes) and do not show up in the JVMTI interface. It makes sense that these are hidden to JVMTI out of the box: they are an implementation detail and are not expected to exist. Hidden classes are marked as such and this is used to skip notifying JVMTI agents. Fortunately this made it fairly trivial to modify the OpenJDK Zero class loading mechanism to stop excluding these classes from JVMTI visibility. This means that within our modified OpenJDK Zero, the JVMTI is technically out-of-spec. However, considering we’re not testing anything related to JVMTI, this is of no effect when it comes to obtaining benchmarking results.

Determining class identity Various processes within the superinstruction workflow require a way to uniquely identify a class. The profiling phase places executing counters within the class and needs a way to indicate which class. System classes need to be placed in a cache (discussed in section 5.3.5), which once again requires identifying them somehow. While JVM classes have a class name, this class name does not have to be unique. It is perfectly legal for two classes to get loaded with the same name but under different class loaders. Furthermore, the hidden classes do not have a name at all, and are instead assigned one further down the class loading pipeline. As such, we designed a simple but effective mechanism to assign a name based on the content of the class. Our approach is to compute a simple hash based on the class content and append this to the name of the class: !, e.g. java.lang.Boolean!3795190246904940199. The hash code is computed by interpreting the whole class as an array of 8-byte integers and XOR’ing these together. While not secure as it’s trivial to create a hash collision with this algorithm, the goal of this hashing algorithm isn’t security but rather to create a simple mechanism of differentiating classes with the same name but different content within the scope of the QuickInterp implementation.

1JEP-371 aims to change that as of Java 15.

73 5.3.3 OpenJDK Zero Interpreter While the lack of complex rewriting means that there is some room for other optimizations, skipping out on such processing steps makes it easier to implement the QuickInterp superinstruction architecture. Since the interpreter is a straightforward JVM bytecode interpreter with instruction handlers for each bytecode instruction, concatenation of them is simple. Each instruction handler modifies the program counter (pc) and top-of-stack (tos) variables, which are stored in CPU registers, after which the next opcode is read and is dispatched to. The interpreter itself can be compiled as either a large switch-statement switching to each instruction handler, or as a large table with pointers to each instruction handler. The table approach is faster, and this approach is selected at build time when it is available (it requires special compiler features not part of regular C++). The platform on which QuickInterp will be evaluated (amd64 on Linux) can use the table approach and a such this is the only relevant dispatching technique for this thesis. While most instructions have dedicated handlers – dedicated sequences of C++ code that implement that handler and just that handler – some of the instruction handlers are reused or share code by using macros. To create superinstructions, the interpreter generator (discussed in section 5.7) needs access to standalone code for each instruction handler in order to be able to concatenate it.

1 CASE(_iload): Execution falls through C++ code 2 CASE(_fload): 3 SET_STACK_SLOT(LOCALS_SLOT(pc[1]), 0); 4 UPDATE_PC_AND_TOS_AND_CONTINUE(2, 1); “And continue” = jumps back to the dispatch loop

Listing 5.1: The iload and fload instruction handler in OpenJDK Zero

An example of this is shown in Listing 5.1 where one instruction handler is reused. The iload and fload instructions both make use of the same code by using the fall-through nature of case statements.

1 #define OPC_CONST_n(opcode, const_type, value)\ C++ code 2 CASE(opcode): \ 3 SET_STACK_ ## const_type(value, 0); \ 4 UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1); 5 6 OPC_CONST_n(_iconst_m1, INT, -1); 7 OPC_CONST_n(_iconst_0, INT, 0); 8 OPC_CONST_n(_iconst_1, INT, 1); 9 OPC_CONST_n(_iconst_2, INT, 2); 10 OPC_CONST_n(_iconst_3, INT, 3); 11 OPC_CONST_n(_iconst_4, INT, 4); 12 OPC_CONST_n(_iconst_5, INT, 5); 13 OPC_CONST_n(_fconst_0, FLOAT, 0.0); 14 OPC_CONST_n(_fconst_1, FLOAT, 1.0); 15 OPC_CONST_n(_fconst_2, FLOAT, 2.0);

Listing 5.2: All const_n handlers are defined by invocations of one macro, which expands to the actual definition of that instruction handler

Furthermore, some instruction handlers are defined using a macro, which can be seen in Listing 5.2. The OPC_CONST_n macro contains the entire definition of the instruction handler, including the CASE statement, which prevents simple concatenation of such macros. As such, these macros need to be “expanded” first to capture the code fragments that make up each instruction handler such that they are available to the code generator. While OpenJDK Zero lends itself fairly well to concatenating instruction handlers, manual inspection of each instruction handler is still necessary to ensure that the above examples can work correctly.

74 Concatenating handlers One of the benefits of superinstructions is that modifications to the top-of-stack (tos) and program counter (pc) variables can be coalesced within the superinstruction. Take for instance a superinstruction consisting of two instruction handlers that both push an element onto the operand stack – within this superinstruction, it is not necessary to modify the tos value twice. Instead, the tos value can be modified once at the end of the superinstruction, accounting for both instruction handlers. The second instruction handler has to operate on “tos+1” (assuming the operand stack grows down) instead of “tos” to reflect that the tos variable has not been updated yet after the first instruction handler. This is possible because most JVM instructions impact the operand stack (and program counter too) in a consistent way, for example a bipush instruction handler always pushes one element onto the operand stack and increases the program counter by two.

1 /* bipush handler */ C++ 1 /* bipush handler */ C++ 2 SET_STACK_INT((jbyte)(pc[1]), 0); 2 SET_STACK_INT((jbyte)(pc[1]), 0); 3 UPDATE_PC_AND_TOS(2, 1); 3 4 4 5 /* iload handler */ 5 /* iload handler */ 6 SET_STACK_SLOT( 6 SET_STACK_SLOT( 7 LOCALS_SLOT(pc[1]), 0); 7 LOCALS_SLOT(pc[1+2]), 1); 8 UPDATE_PC_AND_TOS(2, 1); 8 UPDATE_PC_AND_TOS(2+2, 1+1); 9 CONTINUE; 9 CONTINUE;

Listing 5.3: Straightforward concatenation of the Listing 5.4: Optimized concatenation of the bipush and iload instruction handlers bipush and iload instruction handlers by co- alescing top-of-stack and program counter modi- fications

This example is shown in Listing 5.3 and Listing 5.4, where both listings implement the bipush-iload superinstruction by concatenating the instruction handlers, one without coalescing write operations and the other with. The macro UPDATE_PC_AND_TOS(x,y) adds x to pc and y to the tos (we assume the stack grows down), and the CONTINUE macro jumps back to the interpreter loop where the next instruction is read and executed. In Listing 5.1 and Listing 5.2 we saw the CONTINUE operation included in the UPDATE_PC_AND_TOS(x,y) macro call, but this would prematurely terminate execution within the superinstruction. Listing 5.3 performs two calls to UPDATE_PC_AND_TOS(x,y) – one after each instruction handler – while Listing 5.4 needs only one call. Observe how the code of the iload handler has been tweaked in the optimized superinstruction of Listing 5.4 – instead of accessing pc[1] on line 6, it reads the local variable table index from pc[1 + 2]. This is to make up for the missing write to the pc variable caused by omitting the UPDATE_PC_AND_TOS(x,y) on line 3. Likewise, the second argument of SET_STACK_SLOT(value, offset) is an offset relative to the current top-of-stack, which due to the omission of the UPDATE_PC_AND_TOS(x,y) call is now 1. There is no concrete implementation goal to coalesce program counter and top-of-stack writes. However, we considered it relatively easy to add to the QuickInterp architecture as all the UPDATE_PC_AND_TOS_AND_CONTINUE(x,y) macro had to be modified anyways in all instruction handlers. Furthermore, earlier work (e.g. Ertl et al. [4] and Casey [1]) often combined superinstruc- tions with other optimizations like top-of-stack caching. While we did not consider implementing this relatively unrelated optimization, coalescing writes to the program counter and top-of-stack values at least gives the C++ compiler the opportunity to detect that one instruction handler writes to the same location another instruction handler reads within the same instruction. The effects of this – in theory – are somewhat akin to “static caching” from Casey [1], however the static caching was for the whole VM and not just the superinstructions. While static caching was implemented in their interpreter independent of superinstructions, our write-coalescence optimization requires that the compiler is capable of picking up a particular data dependency relation. Even though the

75 optimization is not nearly as strong as what has been discussed in earlier work, implementation goal IG6 asks for a VM that can determine the efficacy of the design, and we believe this optimization is such a good match for the superinstruction architecture that it must be included. It also helps that it’s not that hard to implement on top of OpenJDK Zero. Not all instruction handlers are equal when it comes to concatenation: some instructions modify the operand stack depth in a variable way (e.g. invoke-statements), some instructions modify the program counter in a variable way (e.g. tableswitch), and some instructions cannot be part of a superinstructions due to other reasons related to the implementation of OpenJDK. In the implementation of QuickInterp, each instruction handler is characterized by zero or more flags that describe how (and if) the instruction handler can be part of a superinstruction. Appendix A.2 lists all bytecode instructions with their flags, including a full definition of each flag.

5.3.4 Code stretching As mentioned in the introduction, OpenJDK Zero has one shortcoming when it comes to imple- menting the superinstruction architecture: it supports only up to 28 = 256 different instruction handlers. This is a consequence of it being a token-threaded interpreter – the bytecode is kept as-is, and the opcodes (“tokens”) are just one byte long. Added superinstructions each create their own instruction handler (even if that handler is a concatenation of existing handlers) and as such need their own opcode. In order to support more superinstructions, we set out to change the interpreter to use two-byte opcodes, giving us 216 = 65536 possible instruction handlers in a process dubbed “code stretching” (stretching a one-byte opcode into two). Note that this need not mean that QuickInterp supports that many superinstructions – there might be other limitations that prevent the creation of a VM with that many instruction handlers, like insufficient resources for the C++ compiler to compile such a massive interpreter. While we ultimately chose to transform the whole interpreter (and VM) to use two-byte instruction opcodes, one easier to implement alternative requires some discussion. This alternative would be to add just one new instruction to the VM called super to cover all superinstructions. Following the super opcode in the bytecode stream, there would be two bytes that dictate what kind of superinstruction it is (a “superinstruction opcode” of sorts), followed by the regular instruction operands of that superinstruction. To use this, the instruction handler for super would have to include its own dispatching mechanism to execute a particular superinstruction. The goal of the superinstruction architecture is to save instruction dispatches, and this approach would add one additional instruction dispatch to every single superinstruction. As such, it only starts making sense to concatenate superinstructions consisting of three or more instruction handlers, a superinstruction with just two instruction handlers performs just as many dispatches as the two regular instructions that were concatenated. Furthermore, this design would likely impair the triplet-based runtime substitution algorithm designed by Ertl et al. [4] disproportionately, as this algorithm is more likely to make short superinstructions. With implementation goal IG6 in mind, this approach – while much easier to implement – was ultimately dismissed. Even though the amount of work required to change every single place where the size of an instruction operand was assumed was rather large, we still chose the code stretching approach over a simpler one-instruction alternative discussed earlier. To implement the two-byte opcode architecture, the changes to the VM are twofold. One, the VM needs to be patched to work with two-byte opcodes, thus all places where one-byte opcodes are expected need to be modified. This assumption is made in every single instruction handler and many other places in the interpreter. Two, a code stretching processing step needs to be added that preprocesses all classes as they get loaded to use two-byte opcodes, as the VM now expects every opcode to be two bytes and is no longer compatible with one-byte opcodes.

Patching the interpreter Let us first discuss how the interpreter has to be modified to work with two-byte opcodes. Within the interpreter itself, two main things need to be changed: (1) the interpreter must now read two

76 bytes and do a dispatch based on that, and (2) every instruction handler must be modified to account for the larger instruction operand. Outside the interpreter there is one more place that needs some work: OpenJDK has a few utility methods for determining the size and instruction operands of various instructions. These are not used inside the interpreter, but used by auxiliary code like the rewriter.cpp discussed earlier, or by the garbage collector and class verifier. The utility methods are the easiest to address as this basically amounts to reporting every instruction to be one byte larger than earlier. These are updated to report the correct size, and take the larger instruction opcode into account when reading the instruction operands. While QuickInterp does not have a working garbage collector, it is not the use of two-byte opcodes that ended up breaking the garbage collector, instead it was broken by the superinstructions themselves (discussed in section 5.3.5). Back within the interpreter, changes to the dispatcher are straightforward. Performing a dispatch on two bytes is done by reading two bytes from the bytecode stream and shifting them together: uint16_t opcode = (pc[1] « 8) | pc[0]. It is not possible to read two bytes at once as there is no reason for the two addresses (pc[0] and pc[1]) to be aligned to a 16-bit boundary. The dispatching table is increased in size to 216 entries, but other than that dispatching is mostly left unchanged. It is only when dealing with all the instruction handlers that things get a little more tricky. Let us revisit the fload instruction handler from section 5.3.3, which is shown in Listing 5.5.

1 CASE(_fload): C++ code 2 SET_STACK_SLOT(LOCALS_SLOT(pc[1]), 0); 3 UPDATE_PC_AND_TOS_AND_CONTINUE(2, 1); “And continue” = jumps back to the dispatch loop

Listing 5.5: The fload instruction handler in OpenJDK Zero

Two locations need to be changed here: the pc[1] and the “2” argument in the call to UPDATE_PC_AND_TOS_AND_CONTINUE(2,1) (both shown in red in Listing 5.5). pc[1] is the instruc- tion operand at the memory location of the program counter plus one. The “1” in pc[1] here is due to the one-byte length of the instruction opcode, but now that has grown, it must be replaced with pc[2]. Likewise, the UPDATE_PC_AND_TOS_AND_CONTINUE(x,y) call needs to be made with x = 3 to reflect the larger instruction size. This was done for every instruction handler, enabling it to work with two-byte opcodes.

Code stretch Existing bytecode can now no longer be executed as it does not use the expected two bytes for opcodes. Our approach keeps the existing instruction handlers at the original opcodes, for example iload in binary was 0001 0101 and now becomes 0000 0000 0001 0101. This means that we effectively have to insert 0000 0000 in front of every instruction opcode. The obvious place to do this is in the JVMTI agent discussed earlier. Note how this widening operation is essentially using big-endian encoding. The encoding scheme is fairly independent of the target machine architecture (the opcodes are not aligned, so native machine access cannot be used anyways), as such we were free to choose between big-endian and little-endian for the two-byte opcode representation. Using big-endian encoding brings one massive advantage: the 0000 is effectively a nop instruction – an instruction which does nothing. The nop instruction in JVM bytecode has 0 as its opcode, and no instruction operands. In other words, inserting 0000 in front of every instruction opcode effectively does not change the code, even for VMs which haven’t been modified to expect two-byte operands, it is just that half the instructions they are executing now do nothing. While compatibility with existing VMs is a nice gimic, this also makes it possible to use existing libraries to simply add a nop instruction before every single existing instruction, which will have the same effect as inserting the zeros but saves us from having to write a custom bytecode reader and writer.

77 We opted to use a third-party bytecode manipulation library called Java Native Instrumentation Framework (JNIF) [12] – a C++ library created to modify bytecode in a JVMTI client by the Software and Programmer Efficiency Research Group (“sape”) from the University of Lugano in Switzerland. A C++ library is necessary because JVMTI is a native interface, and common bytecode manipulation libraries like ASM are written in Java. JNIF is a high-level library: it parses every instruction, creating an object graph that includes where jumps go within the program. It allows the insertion of new instructions at any point in the program, and updates all jump target locations (regular jumps, conditional jumps, exception handlers, etc.) automatically, which saved us a lot of time. We used the nop trick to add a nop instruction in front of every existing instruction, and letting the library deal with updating jump targets. We only had to make some very minor changes to the library for a few cases due to various jumps targeting the second byte in a two-byte opcode instead of the first byte.

Dealing with long methods The JVM has a rather low maximum method length: methods are limited to 216 bytes [7] of bytecode. When stretching a method near this limit, it may go over it. During testing, only one such case was identified: a generated class part of Apache FastMath containing large arrays of constants which was initialized in a very large static initializer method. Instead of trying to lift the 216 size limit, which would be very complex, we instead decided to cut the Apache FastMath static initializer into two methods that both initialize part of the array. While by no means an elegant solution, this solution has no runtime impact and as such it is not endangering the results of QuickInterp, and at the same time it was far less time consuming to implement.

5.3.5 Superinstruction placement in Java Given the complexity of the runtime substitution algorithms, we decided to implement these in Java instead of in C++ to speed up development. Furthermore, developing in Java gives us access to a larger array of existing bytecode manipulation tools like ObjectWeb’s ASM (ASM is not an abbreviation). In this section we discuss the interface between each runtime substitution algorithm and the VM, and how Java is called from the JVMTI agent.

Modifying bytecode using ASM To make swapping runtime substitution algorithms as seamless as possible, all algorithms implement a common interface with just two methods: one to set the superinstruction set (called once just after VM startup), and one to transform a list of instructions. Java is called from the JVMTI agent using Java Native Interface (JNI). On the Java side, the bytes are parsed by JVM bytecode manipulation library ASM. This tool is used to parse each method to create the list of instructions that is given to each runtime substitution algorithm. The list of instructions is an ASM list of instruction nodes read from the bytecode by ASM. ASM has various features for reading and transforming bytecode, including creating an object graph from the bytecode of a method using the ASM tree API. This object graph representation simplifies adding, removing or modifying instructions, and can be converted back into a compliant JVM class after all manipulations have been performed. The object graph representation has one object for each instruction called the instruction node holding the opcode of that instruction together with the instruction operands. Each instruction node links to the next and previous instruction node as they occurred in the bytecode, forming a linked list. Jumps are represented as special “label” instructions, and instructions which perform jumps (e.g. a goto) hold a reference to the label instruction that they jump to, instead of an actual offset. Only when transforming back to bytecode is the jump target resolved to an actual jump offset, which permits the insertion of instructions between a jump instruction and its target instruction without having to manually update the jump offset.

78 We modified ASM work exclusively with two-byte opcodes. The bytecode that it reads has already been code stretched by the JVMTI agent, so all instruction operands are already two-bytes in size. Furthermore, we added support for a special superinstruction node. This node wraps around an existing instruction node and copies the instruction operands of the existing instruction node. However, it has its own superinstruction opcode, and when written to the bytecode stream it will emit the superinstruction opcode together with the instruction operands of the wrapped node. To place a superinstruction, a runtime substitution algorithm simply removes the instruction node at the start of the superinstruction. Then, the algorithm must insert a superinstruction node at that location wrapping around the original instruction. This is all that is necessary, the “tail” of the superinstruction (other instructions that are still part of the superinstruction) must remain in the object graph. The return type of the substitution method is void – the runtime substitution algorithms modify the object graph in-place. The runtime substitution algorithm must take care to ignore ASM label instruction nodes, and some (like the shorest-path algorithm) may even use the label nodes as they indicate a possible incoming jump at that location. However, we will take a deep dive into the inner workings of these algorithms in section 5.6.

Avoiding recursion The runtime substitution algorithms are executed within the same VM as the application. This creates circular dependencies for VM core classes. For example, in order to place superin- structions into java.lang.Object, the runtime substitution algorithms is called, which requires java.lang.Object to be loaded, and so on. This problem is solved with a cache: a folder of classes that already contain superinstructions. By using these classes, the circular class loading cycle is broken while all classes can receive superinstructions. Classes are written to the cache with a file name based on their original hash (before superinstructions or codestretching) as discussed in section 5.7.4.

5.3.6 Conclusion In this section the integration between the QuickInterp design and an actual JVM has been discussed, explaining how the various facets of our superinstruction architecture are wired up in OpenJDK Zero while providing motivation for our choices. We have seen how a JVMTI agent is used as a staging ground for performing modifications on classes. Within the interpreter, we discussed how instruction handlers can be concatenated and how writes to the top-of-stack and program counter variables can be coalesced. The VM was modified to work with two-byte opcodes to support more instructions, with the JVMTI agent stepping up to convert all input classes to this two-byte format in a process called code stretching. Furthermore, we discussed a Java API for runtime superinstruction substitution, explaining how this API is invoked and how the bytecode of a class makes its way to the runtime substitution algorithm. Finally, we saw how the circular class loading problem caused by doing substitution in Java is solved by using a class cache.

5.4 Profiling in practice

In section 4.3 the profiling system for QuickInterp was designed. We discussed how profiling has to be done in a smart way – writing every opcode to disk is not feasible for long running applications. As such, profiling was designed to use only select counters, saving only the information that is needed for static evaluation as discussed in section 4.3.2. In this section we will discuss the implementation of profiling using special “profiling” instructions for these counters. Section 5.4.1 discussing how special profiling instructions are implemented, while section 5.4.2 describes how the profile is written to disk. Finally, section 5.4.3 wraps up the implementation of profiling.

79 5.4.1 Specialized Instructions Considering the existence of rather extensive infrastructure to manipulate bytecode instruc- tions needed to support superinstructions, it made a lot of sense to implement profiling by means of simply adding special “profiling” instructions. Profiling is enabled with a VM flag (-XX:+UnlockExperimentalVMOptions -XX:+EnableProfiler), and this is exclusive with the use of superinstructions. With this option enabled, after code stretching an extra step is performed in the JVMTI agent. Section 4.3.2 introduced two types of counters: Local Execution Counters count the number of executions of a given bytecode instruction. These are implemented with a simple profile instruction which counts how many times it is executed. This instruction is placed before the instruction to be counted. Branch Counters count how many times a conditional branch instruction ends up performing the jump. These are implemented with special “profiling” variants of each conditional jump instruction. The JNIF library is used to place or modify the instructions. All conditional jump instructions are traced and replaced with the opcode of the special “profiling” variants to provide the branch counters. For every traceable conditional jump instruction, a profiling variant must be available. Furthermore, at all places in the bytecode where a local execution counter is expected, the special profile counter instruction is inserted.

Profiling information Just replacing an instruction is not enough: when the interpreter is executing a profiling instruction, it must somehow be able to tell at which location it is executing to update the correct counter. As such, a profiling information table is maintained with extra information about each profiling location. Each profiling instruction gets its own unique 4-byte profiling id, which is an index in the profiling information table. When adding profiling instructions to the bytecode, a new profiling id gets generated and put in that table together with the exact location identifier of that profiling id. This location identifier consists five parts: 1. The type of counter (local execution counter or branch counter) 2. The fully qualified name (FQN) of the class, e.g. java.lang.String 3. The hash code of that class (see section 5.3.3) 4. The name of the method including the type signature 5. The bytecode index of the original instruction within the method

This way, with a given profiling id it is always possible to trace the original location of that instruction. The last item – the bytecode index – requires a bit more discussion. When adding profiling instructions, the bytecode index of all following instructions will shift. As such, this refers to the original bytecode index within the method, prior to modifying any instructions. The profiling ids themselves must be available to the interpreter when executing a profiling instructions, and as such it makes a lot of sense to make them available as regular instruction operands. Each profiling instruction takes not just its regular instruction operands (e.g. the profiling variant of ifnull still needs a branch target as its instruction operand), but also the 4-byte profiling id. This profiling id is inserted into the bytecode stream after the regular instruction operands to minimize the amount of changes required for the profiling variant of the jump instruction. For example, “ifnull ” now becomes “ifnull_p ” in the bytecode stream (where ifnull_p is the opcode of the profiling version of ifnull). Since classes can be loaded concurrently by different threads, inserting profiling instructions must be thread-safe. The profiling table itself is implemented as a thread-safe concurrent hash

80 table implementation using the ConcurrentHashTable (utilities/concurrentHashTable.hpp) C++ template class available in the OpenJDK source tree. New profiling ids are generated by incrementing a shared counter that is incremented using compare-and-set atomic integer operations.

Instruction handlers With all the substitution done by the JVMTI agent, all conditional branch instructions have been replaced with their branch counting counterparts, and the bytecode is littered with local execution counter instructions. The local execution counter instruction is the simplest: it reads the 4-byte profiling id from the bytecode stream, then uses that 4-byte profiling id to find the accompanying record in the profiling table. The profiling table entry for a given profiling id does not only contain the location identifier, it also contains an 8-byte counter. Since the same method can be executed at the same time by multiple threads, compare-and-set integer operations are used to increment this counter atomically. Where the local execution counters are implemented with a new instruction handler for the profile instruction, all conditional jump handlers need to be copied and modified to both count the number of executions and perform the functionality of the regular conditional jump instruction.

1 #define INSTR_ifnull(pc,offset)\ C code 2 const bool cmp = (STACK_OBJECT(offset-1) == NULL); \ 3 if (cmp){\ 4 PROFILE_CONDITIONAL(WITH_PROFILE,(pc)+4); \ 5 int skip = (int16_t)Bytes::get_Java_u2((pc)+2); \ 6 address branch_pc = (pc); \ 7 SET_PC_AND_TOS(pc+skip, offset-1); \ 8 CONTINUE; \ 9 }

Listing 5.6: Code implementing the ifnull instruction handler (simplified)

In the code, this is done by adding a call to a new PROFILE_CONDITIONAL macro to each conditional jump instruction. This macro increases the counter for that instruction, and an example of this can be seen in Listing 5.6 with the call to the macro highlighted in red. The macro takes a pointer to the profiling id, which is computed by taking an offset from the program counter. Furthermore, the macro is placed in such a way that it is only executed when the conditional jump handler is about to jump. When the macro is invoked with WITH_PROFILING set, it will include code that reads the profiling id from the bytecode stream and use that to update the associated counter. To set this flag, the code generator capable of generating the superinstruction handlers (which will be discussed in detail in section 5.7) is used to generate these special conditional jump handlers with WITH_PROFILING set. This design makes use of the existing superinstruction infrastructure, simplifying the task of creating profiling variants of all conditional jump instructions.

5.4.2 The profile on disk When the VM exits, it has to somehow serialize all the profiling information gathered to make it available for the superinstruction set construction algorithm. The values of the counters are written to a text file, while each class (prior to being instrumented with profiling instructions) is written to disk. app.profile file format When the profiling VM is terminated, it writes the value of each counter to a file caled app.profile. We decided on using a simple textual representation of each location identifier (as seen in section 5.4.1) combined with the number of executions of that location. Using a textual format makes it easy to inspect the profile, which in turn simplifies debugging and development.

81 1 ... Profile 2 lec n.l.s.b.primes.PrimeBenchmark!13468019222500153135!isPrime(I)Z:0 150000 3 jc n.l.s.b.primes.PrimeBenchmark!13468019222500153135!isPrime(I)Z:8 12345 4 jc n.l.s.b.primes.PrimeBenchmark!13468019222500153135!isPrime(I)Z:18 3026641110 5 jc n.l.s.b.PrimeBenchmark!13468019222500153135!isPrime(I)Z:30 3026641110 6 lec o.o.jmh.results.BenchmarkTaskResult!3896100830435373639!(JJ)V:0 15 7 lec o.o.jmh.results.BenchmarkTaskResult!3896100830435373639!(JJ)V:6 15 8 lec o.o.jmh.results.BenchmarkTaskResult!3896100830435373639!(JJ)V:34 15 9 ...

Listing 5.7: Snippet of an app.profile file

In Listing 5.7 a fragment of the profiling file can be seen. This profiling file was obtained by running the primes benchmark that will be discussed in section 6.4. The package names have been abbreviated to fit on the page. This example already reveals the format of each profiling line. One line is used per counter, and the location data + counter data is concatenated into a string as follows: !! The entire file that Listing 5.7 is a part of weighs in at just 112,810 lines, summing up to 12.6 megabytes. This shows how our implementation of the profile saves us from the egregious disk space demand that writing every opcode to disk would have entailed, as predicted in section 4.3.

Class dump Having the counters available is not enough to construct the superinstruction set – the bytecode must also be available. While it is possible to provide the superinstruction set construction tool with a copy of the application code of the profiled application, this would inevitably miss some code. The JVM itself generates code at runtime for lambdas and generates code for reflection, which would be missed. Furthermore, applications themselves may also generate code at runtime. Finally, tracing all the locations from which classes may be loaded is hard, as code may come from nested JARs, libraries scattered across the system, or from somewhere in the standard library. As such, we decided on writing every class to disk when profiling was enabled. With profiling enabled, the JVMTI agent not only instruments each class with profiling instructions, but also writes the class to disk in a folder called class-dump. This is the class before receiving special profiling instructions, but after code stretching. Writing the classes after code stretching is key in static evaluation, as this is the input of the runtime substitution algorithms. To write the classes to disk, the same naming mechanism is used to identify the class. The class is dumped to a file, with the file name being the fully qualified name of the class combined with the hash code.

5.4.3 Conclusion In this section we saw the implementation of the profiling as designed in section 4.3. The implementation is refreshingly simple compared to the rather complex design. We showed how specialized profiling instructions can be used to implement the various counters, and how these handlers were implemented. Finally, we discussed how the profile is written to disk, making it ready to be used by the next section to finally construct a superinstruction set.

5.5 Constructing the superinstruction set

With the profiler implementation out of the way, it is time to construct the superinstruction set. To reiterate the design of the superinstruction set construction algorithm as it was discussed in section 4.4: an iterative optimization algorithm is used to select the optimal superinstruction candidiates, while using a technique called static evaluation to score each randomly-generated candidate superinstruction set. The higher the score, the better the superinstruction set.

82 The superinstruction candidates that this iterative optimization algorithm picks from are derived from the profiled bytecode. Furthermore, the superinstruction candidates derived directly from the profile are called “base superinstruction candidates” and keep their profiling information. These base superinstruction candidates are then used to generate more superinstruction candidates by a preprocessor, which cuts the base superinstruction candidates into smaller pieces increasing the number of superinstruction candidates available. To test the quality of a particular set of superinstruction candidates (a candidate superinstruction set), the runtime substitution algorithm is used and applied to all code that was profiled in a procedure called static evaluation, where the quality of a candidate superinstruction set is tested without having to rerun the profiled application. This is where the base superinstruction candidates come in again: since they’re a direct derivative from the profiled bytecode and have kept their associated profiling information, they can be used to score the candidate superinstruction set. The runtime substitution algorithm is tasked with substituting superinstructions into the base superinstruction candidates, now using them as code into which superinstructions have to be substituted rather than members of a superinstruction set. Then, the profiling information from the base superinstruction candidate is used to determine the number of instruction dispatches, and how this compares to the original version without superinstructions. The number of instruction dispatches saved makes up the score of a particular superinstruction set, and the higher the score the better the superinstruction set. In section 5.5.1 we discuss the architecture of the tool we created to generate the superinstruction set. We give an overview of the architecture of this tool, and how it can be used to read the profile, optimize the superinstruction set and to finally generate the interpreter. While part of the same tool, the topic of interpreter generation is mostly reserved for section 5.7 as this is where an in-depth look will be taken at the various VM structures that need to be generated. Section 5.5.2 focuses on how the profile is loaded. The file format discussed in the previous section (section 5.4) is read from disk, and the (base) superinstruction candidates are created. When the right data structures are set, section 5.5.3 discusses how the optimization algorithm is implemented, including how the algorithm interfaces with the runtime substitution algorithms and how exactly to count the dispatches saved by the placed superinstructions. Section 5.5.4 finally wraps up the construction of the superinstruction set, summarizing and reflecting back the topics just mentioned.

5.5.1 Interpreter Generator tool implementation As mentioned in the introduction, the whole workflow, starting with a profile going all the way to having a generated interpreter set ready to be compiled, is implemented in one tool: the interpreter generator. This tool goes through three steps, implementing the whole interpreter generation workflow: 1. Reading the profile • Parsing command line arguments • Read the profile, parsing the various counters in the app.profile file and associating this information with bytecode instructions from the profile class dump (discussed in section 5.5.2) • Creating the base superinstruction candidates from the profiled bytecode • Creating the full set of superinstruction candidates based by using a preprocessor selected using command line arguments 2. Finding the optimal superinstruction set • Running the iterative superinstruction optimization algorithm including static evaluation (section 5.5.3) 3. Generating the interpreter • Generating the interpreter files from the optimized superinstruction set (discussed in section 5.7)

83 Using the Interpreter Generator Before diving into the software architecture of the interpreter generator, let us first consider how the tool is used and what options it has. The workflow implemented by the tool is a relative straightforward implementation of the design as was shown in Figure 4.2 back in section 4.4.4.

1 $ java -jar InterpreterGenerator.jar \ Terminal 2 /path/to/folder/with/profile \ 3 --preprocessor=fully.qualified.preprocessor.ClassName \ 4 --mutator=fully.qualified.mutator.ClassName \ 5 --si=fully.qualified.superinstruction.substitution.algorithm.ClassName \ 6 --instructionSetSize=20 \ 7 --time=12000 or --evaluations=1000000

Listing 5.8: Invocation of the Intepreter Generator tool

Listing 5.8 shows an invocation of the Interpreter Generator. The tool requires the following parameters:

• The path to the folder with the profile (app.profile file and class-dump folder). • The preprocessor. This component creates more superinstruction candidates from the base superinstruction candidates. • The active mutator. This component is responsible for generating the initial superinstruction set candidate, as well as deriving a new superinstruction set candidates from a previous best. • The implementation of the runtime superinstruction (“si”) substitution algorithm • The maximum superinstruction set size • An end condition as to when to terminate the optimization process: this can either be after a set time (in milliseconds) or after a number of evaluations

The interpreter generator will continue to optimize the superinstruction set until either “evaluations” number of evaluations have been completed or until “time” number of milliseconds have elapsed. It will then take this superinstruction set and generate all the interpreter files for it.

Example invocation In Listing 5.9, the full output of a run of the tool can be seen. Here, the steps that it takes are clearly visible: first the profile is read and the preprocessor is applied (here a special “no-op” preprocessor is active, which does not add extra superinstruction candidates). Various statistics about the read profile are shown, including the theoretical max skip. This is the number of instructions dispatches that can be saved if every single base superinstruction candidate were to be turned into a superinstruction, and this is also the theoretical maximum score discussed in section 4.4.3. We then see the optimizer configuration, with the chosen runtime substitution algorithm (called strategy) and the mutator used to derive new superinstruction set candidates. The optimizer automatically detects the number of hardware threads available, and has detected 16 threads in this case. The target superinstruction set size (target IS size) is 4, and the number of dispatches determined by static evaluation is 27, 269, 049, 458. Then, the tool starts optimizing. This optimization takes 180 seconds, as set by the command line parameters. As it optimizes, the tool prints every second what its current best is, and how its current best stacks up against the theoretical maximum. For brevity, we cut most of this output from the listing. For this application, within just one second it finds a superinstruction set already achieving a significant reduction in the number of instruction dispatches (85.86 % of the theoretical maximum). It continues on until it reaches the time limit, stopping at a reduction of 21, 194, 124, 039 instruction dispatches. In the final step the VM structures are generated and written to disk. We will cover those in detail in section 5.7.

84 1 $ java -jar InterpreterGenerator.jar \ Terminal 2 primes \ 3 --preprocessor=jdk.internal.vm.si.impl.profiler.processor.NoOpProcessor \ 4 --mutator=jdk.internal.vm.si.impl.optimizer.mutator.IndependentInstructionSetMutator \ 5 --si=jdk.internal.vm.si.runtime.ShortestPathAlgorithm \ 6 --instructionSetSize=4 \ 7 --time=180000 8 9 === ProfileReader configuration === 10 App profile path: primes/app.profile 11 Class dump path: primes/class-dump 12 Pre-processor: class jdk.internal.vm.si.impl.profiler.processor.NoOpProcessor 13 14 === Reading profile... === 15 === Profile statistics === 16 Unique segments: 5,067 17 Candidate superinstr: 5,936 18 Profiled instructions: 27,268,590,774 19 Total instructions: 32,838 20 Longest segment: 1,026 21 Average segment length: 4.5 22 Average weight: 1,195,889.739 23 Minimum dispatches: 6,059,573,308 24 Theoretical max skip: 21,209,017,466 (77.778%) 25 26 === Optimizer configuration === 27 Strategy: jdk.internal.vm.si.runtime.ShortestPathAlgorithm 28 Mutator: jdk.internal.vm.si.impl.optimizer.mutator. IndependentInstructionSetMutator 29 Time: 180,000 ms 30 Maximum evaluations: 9,223,372,036,854,775,807 31 Threads: 16 32 Target IS size: 4 33 Initial: 27,269,049,458 34 35 === Optimizing... === 36 Score: 18,161,292,502 (85.63 %), heat: 0.49, evaluations: 1,025 37 Score: 21,188,993,170 (99.906 %), heat: 0.481, evaluations: 2,561 38 ... snip ... 39 Score: 21,194,124,039 (99.93 %), heat: 0, evaluations: 260,353 40 Final score: 21,194,124,039 / 21,209,017,466 (99.93 %) computed in 261,889 evaluations and 180,124 ms 41 42 === Writing interpreter files === 43 InstructionSetDefinition size: 212 44 Writing to jdk11/src/hotspot/share/interpreter/generated/bytecodes.generated.hpp 45 Writing to jdk11/src/hotspot/share/interpreter/generated/bytecodes.definitions.hpp 46 Writing to jdk11/src/hotspot/share/interpreter/generated/bytecodes.length.hpp 47 Writing to jdk11/src/hotspot/share/interpreter/generated/bytecodeInterpreter.handlers.hpp 48 Writing to jdk11/src/hotspot/share/interpreter/generated/bytecodeInterpreter.jumptable.hpp 49 Writing instructions to jdk.internal.vm.si.generator/superinstructions.list

Listing 5.9: Example invocation and output of the Intepreter Generator tool

85 Software architecture

Runtime Si Algorithms (jdk.internal.vm.si.runtime) ShortestPathAlgorithm TreeAlgorithm ASM library (jdk.internal.vm.si.runtime.asm)

<> RuntimeSubstitutionAlgorithm +setInstructions(instructionSet) +convert(instructionList) TripletAlgorithm EquivalenceAlgorithm

Interpreter Generator (jdk.internal.vm.si.impl)

Profile Reader (jdk.internal.vm.si.impl.profiler) Optimizer (jdk.internal.vm.si.impl.optimizer) Generator (jdk.internal.vm.si.impl.generator)

Profiler InstructionSetOptimizer InterpreterGenerator +main(args) : void +optimize() 1 +generate() 1 1 ProfileReader +readProfile() : Profile CacheBuilder +main(args) : void() <> <> InstructionSetMutator Preprocessor +seed(length, profile) +process(input : List) : List +mutate(length, parent, profile)

Mutators (jdk.internal.vm.si.impl.optimizer.mutator) Preprocessors (jdk.internal.vm.si.impl.profiler.processor)

EquivalenceAwareProcessor PrefixMutator

PrefixProcessor IndependentMutator

Figure 5.1: Simplified class diagram of the Interpreter Generator

In Figure 5.1, a simplified class diagram of the interpreter generator tool can be seen, showing the most important packages and classes within each package. All runtime algorithms (package Runtime Si Algorithms) are bundled into the interpreter generator tool: these need to be available to do static evaluation. This package is part of the QuickInterp VM standard library and available at runtime, and contains all the runtime substitution algorithms and other machinery (like ASM) to enable placing superinstructions into bytecode. We see the common interface (RuntimeSubstitutionAlgorithm) as mentioned in section 5.3.5, with the various implementations of that interface. The main method in the Profiler class is where execution starts in the Interpreter Generator package. In that package, the three steps from the start of this section are clearly visible: (1) the profile parsing phase in package Profile Reader, (2) the optimizer in package Optimizer and finally (3) the code generator in package Generator. In the next few sections we will discuss each package in more detail, discussing what each package does. Section 5.5.2 discusses how the profile is loaded, while section 5.5.3 discusses the optimization loop. And, as mentioned, the generation of the files is its own topic as it requires more discussion on how the VM works, which is reserved for section 5.7. There’s one additional class that’s not connected to anything: the CacheBuilder in the

86 Generator package. This class has its own method, and it can be invoked as its own program. We will cover the functionality of this class in section 5.7, however this class can be invoked to generate a class cache consisting of classes that already contain superinstructions. It is not part of the regular interpreter-generator workflow.

5.5.2 Loading the profile The profiling phase has to parse the profile, create the base superinstruction candidates and use this to create the full set of superinstruction candidates for the optimizer to use.

Creating base superinstruction candidates The first step in loading the profile is simply parsing all the bytecode from the profile and cutting this into the base superinstruction candidate. All bytecode instructions are visited in the order in which they occurred in the method, and base superinstruction candidates are built up incrementally. The places to “cut” and start a new base superinstruction candidate are based on the characteristics of the instruction (e.g. invokedynamic cannot be part of a superinstruction, so it is cut). For each bytecode instruction, a set of flags is provided that captures the characteristics of the bytecode instruction. Appendix A.2 contains a full listing of all the flags that exist, and which instructions have which flags. For the cutting of bytecode to create superinstruction candidates however, only two flags are important:

NO_SUPERINSTRUCTION This instruction cannot be part of a superinstruction. Usually these are instructions which leave the interpreter and expect to return by means of a dispatch, and since it is not possible to dispatch to within a superinstruction, these cannot be part of a superinstruction.

TERMINAL When placed in a superinstruction, no instruction would be executed after this instruction as it either unconditionally jumps away or exits the method.

When visiting an instruction with the NO_SUPERINSTRUCTION flag, the bytecode is cut at that point. The bytecode instructions gathered up until that point are turned into a base superinstruction candidate if its size is at least two (single instructions make no sense to turn into a superinstruction or to use for evaluation), and otherwise they are simply discarded. The instruction with the NO_SUPERINSTRUCTION flag does not end up in a any superinstruction candidate. Besides the NO_SUPERINSTRUCTION flag, there’s another group of instructions whose charac- teristics impact base superinstruction candidate construction: those with the TERMINAL flag. An instruction with the TERMINAL flag is not necessarily excluded from being part of a superinstruction, but instructions with this flag will always jump away. As such, there is no use in continuing the superinstruction after this instruction. As such, while the instruction with the TERMINAL flag is allowed into the base superinstruction candidate, the bytecode is cut after this instruction to start creating a new superinstruction candidate.

Retaining profiling information The information from the profile is sparse – while it is possible to derive the number of executions for every instruction, it is not immediately stored this way. The base superinstruction candidates carry, for each instruction, two values that are needed to enable static evaluation: the number of executions, and the number of incoming jumps. A simple algorithm is run over the bytecode of each method to fill in these values for every instruction. Listing 5.10 and Listing 5.11 show an example of profile data, which give some insight into the algorithm. In Listing 5.10, three profile counters are shown in a simplified version of the app.profile file format. Listing 5.11 shows a sequence of bytecode instructions corresponding to the profile counters. In this listing, instruction operands are omitted for brevity – this is not a superinstruction so they are still present. What is shown with blue arrows is how the if_icmplt

87 1 iload_1 Bytecode 2 iload_2 3 imul 4 iconst_2 1 lec example(II)V:1 300 Profile 5 if_icmplt 2 jc example(II)V:5 300 6 iinc 3 jc example(II)V:9 150 7 bipush 8 iload_2 Listing 5.10: Simplified example of a profile file, 9 goto using line numbers instead of bytecode offsets 10 return

Listing 5.11: Bytecode corresponding to the counters from Listing 5.10

Instruction Executions Incoming jumps iload_1 300 0 iload_2 300 0 imul 300 + 150 = 450 150 iconst_2 450 0 if_icmplt 450 0 iinc 450 − 300 = 150 0 bipush 150 0 iload_2 150 0 goto 150 0 return 150 − 150 + 300 = 300 300

Table 5.1: Table showing all interpolated executions and incoming jump values from Listing 5.11. Values in bold were present in the profile.

jumps to the return instruction, and likewise how the goto jumps to the imul instruction. Observe that how for both these jump instructions, jump counters are available in the profile. The output of the algorithm can be seen in Table 5.1. Values in bold are read directly from the profile file. The first instruction in the method must always have a local execution counter indicating how many times the method was executed, and that is also the case here. This local execution counter counts 300 executions, which is assigned to the first instruction (iload_1). No branching happens, so the interpreter moves on to the next instruction. As no branching has happend, the second instruction (iload_2 on line 2) also gets assigned an execution count of 300. We see this process at more places – without a mention from the profile by means of a local execution counter or jump counter, each instruction simply inherits the executions of its preceding instruction. The imul at line 3 is the branch target of the goto instruction from line 9. Since the branch counter reveals that this branch was taken 150 times, this 150 can be added to the number of executions coming in from the previous instruction: 300 + 150 = 450. Branch counters do not just impact instructions they jump to, also the instructions they jump away from, and this is visible in the iinc on line 6. The if_icmplt instruction jumps away 300 times, leaving just 450 − 300 executions at the iinc instruction. With these rules, the whole table can be filled in – most instructions just inherit the number of executions from their preceding instruction, with jumps adding or removing from that. Some instructions always branch in a particular way, an example of this is the return instruction. While not shown in this example, the return instruction is always followed by an implicit local execution counter of 0 – the only way for execution to resume after a return instruction is by a jump coming from somewhere else. This also brings up another aspect of the algorithm that is missing from this example: a profile file may contain multiple local execution counters, and will do so for certain jumps that cannot be part of a superinstruction (e.g. exception handlers, switch-statements).

88 When encountering a local execution counter anywhere, the algorithm will simply take this value for the “executions” column, instead of computing it. The table is computed for each base superinstruction candidate, and made available for static evaluation. The implementation of static evaluation will be discussed in section 5.5.3.

Creating more superinstruction candidates As mentioned in the design in section 4.4.4, it is the job of the preprocessor component to generate more superinstruction candidates from the base superinstruction candidates, and this component is configurable with the --preprocessor command line argument. There is one kind of preprocessing that is always done on the base superinstruction candidates regardless of the preprocessor active: splitting them up based on where incoming jumps happen. An example of this can be seen in Listing 5.12 and Listing 5.13.

1 iload Part X JVM Bytecode 1 superXYZ: Superinstructions 2 iload 2 iload 3 incoming jump 3 iload 4 imul Part Y 4 imul 5 iload 5 iload 6 iconst_2 6 iconst_2 7 incoming jump 7 iadd 8 iadd Part Z 8 irem 9 irem 9 superXY: 10 iload 11 iload Listing 5.12: Short sequence of instructions con- 12 imul taining incoming jumps 13 iload 14 iconst_2 15 superYZ: 16 imul 17 iload 18 iconst_2 19 iadd 20 irem 21 superX: 22 iload 23 iload 24 superY: 25 imul 26 iload 27 iconst_2 28 superZ: 29 iadd 30 irem

Listing 5.13: Superinstruction candidates derived from Listing 5.12

In this example, the only base superinstruction candidate is superXYZ, which is all of Listing 5.12. The other superinstruction candidates are all the different cuts one can make by splitting superXYZ at one or more incoming jump markers. This transformation is needed to address the need for superinstructions for incoming jumps. Imagine that superXYZ was placed as a superinstruction into the code of Listing 5.12 – in this case, none of the incoming jumps can make use of that superinstruction, as it’s not possible to jump to within a superinstruction. This processing step ensures that there are superinstructions available to be placed when starting at any of the incomping jump markers. And if the superYZ superinstruction exists, it competes with the whole superXYZ superinstruction, as superYZ can also be used by executions that start from the top. As such, our processing step generates not just superinstruction candidates starting at each incoming jump, it

89 makes all combinations of consecutive “parts” delimited by incoming jumps. After this transformation is completed, the new list of superinstruction candidates is handed over to the preprocessor. We implemented three preprocessors:

NoOpProcessor which does nothing, i.e. it returns its input as processed list of superinstruction candidates (this one is omitted from the class diagram in Figure 5.1).

PrefixProcessor generates every possible prefix of each instruction. For example, when provided with a-b-c-d, it will add a-b-c and a-b. This one was mostly added for the triplet-based substitution algorithm, as it is not capable of placing longer superinstructions in some cases.

EquivalenceAwareProcessor will remove superinstruction candidates that are equivalent.

We decided against splitting on outgoing conditional jump instructions, even though there’s something to be said for it. Placing a superinstruction that contains a conditional jump instruction with a high probability of leaving the superinstruction (e.g. it did so in every execution according to the profile), degrades the value of that superinstruction – it does not save as many dispatches as the jump always leaves the superinstruction. However, there is little harm done here – placing a shorter superinstruction is not “better” here, it is just “not worse”. As such, we do not provide the superinstruction set optimization algorithm with these shorter superinstructions.

5.5.3 Optimizing the instruction set With the superinstruction candidates available, it is time to start discussing how the optimization loop works. The implementation of the loop is a straightforward implementation of the design from section 4.4.5: a “current-best” candidate superinstruction set is kept, which is randomly mutated a set number of times (256 times). These mutations are evaluated, and if any one of them beats the current best then they become the new current best. This is repeated until either the end condition is met: either the timeout set by --time expires, or the number of evaluations set by --evaluations has been met.

Iterative optimization algorithm The InstructionSetMutator from Figure 5.1 sits at the heart of the iterative optimization algorithm. It implements the operations “seed” and “mutate” which generate new candidate superinstruction sets – either based on nothing (seed), or based on a current best superinstruction set candidate (mutate) as designed back in section 4.4.5. There are two implementations of the mutator: the IndependentMutator and the PrefixMutator. The IndependentMutator does not consider any relation between superinstruction candidates, nor does it generate new superinstruction candidates by itself – all superinstruction candidates that it provides come from the set of available superinstruction candidates. The PrefixMutator is a special mutator developed for the triplet-based substitution algorithm. This mutator ensures that each superinstruction added to its candidate superinstruction set either has a length of 2, or has its prefix present. To create the prefixes, the PrefixMutator actually generates new superinstruction candidates that might not be available. While it is possible to use the PrefixProcessor to make the prefixes available, this would end up causing many evaluation loops to go “wasted” where the mutator would have picked a candidate superinstruction but not its prefix, adding no value as the triplet-based substitution algorithm cannot place superinstructions without a prefix. The implementation is concurrent: as seen in the example invocation from section 5.5.1, it detects and makes use of all hardware threads available on the platform on which it is run. A thread pool is created with the detected number of hardware threads, and tasks are submitted to this thread. For each iteration of the optimization algorithm, it first spawns 256 tasks (this is a fixed number), each of which is submitted to the thread pool. These tasks perform as much work as possible concurrently: they mutate the current best candidate superinstruction set, use this to run static evaluation on all base superinstruction candidates, and then provide the static

90 evaluation score combined with the mutated candidate superinstruction set as result. The main evaluation loop waits for each task to complete, and when each task is completed the best result is picked from the 256. Finding the best from a list of evaluated candidate superinstruction sets and updating the “current best” field is the only part that does not happen concurrently. If the end condition has not been met yet, the cycle repeats.

Interfacing with ASM While in essence the same, the implementation of static evaluation is a little more complex than what the design shows in section 4.4.3. One problem is invoking the runtime substitution algorithms on the base superinstruction candidates – these must be made available as an ASM object graph while still retaining their profiling information to allow comparing the before and after situation. Interfacing with the runtime substitution algorithm is made complex by the fact that regular ASM instruction nodes contain instruction operands. However, these are not present in the base superinstruction candidates. Fortunately, runtime substitution algorithms do not actually use the instruction operands. Additionally, the profiling information must retain to allow counting how many dispatches are saved. To solve these problems, two new surrogate instruction nodes are added to ASM. Since we tweaked ASM to work with two-byte opcodes, all ASM source code is available in our project and can be modified. The instruction nodes we added are: • A surrogate instruction node with only an opcode but no operands. Besides the opcode, it also stores profiling information regarding the number of executions. • A surrogate jump target (“label node” in ASM), similarly holding profiling information. The base superinstruction candidate is converted into a sequence of surrogate instruction nodes and surrogate jump targets, which creates a compatible format for each runtime substitution algorithm to perform substitution on.

Tracing paths to count dispatches Another problem is how to deal with instructions that become unreachable as a result of placing a superinstruction. For example, in the instruction sequence “a b c”, if one were to place the superinstruction “a-b”, instruction “b” itself would be unreachable and thus never executed. Static evaluation needs to pick up on this and correctly score the substitution based on how many instruction dispatches are saved by never executing “b”. This problem stems from the way superinstructions are added to the ASM object graph: placing a superinstruction replaces only the first instruction node, leaving the rest in the object graph. To repeat the example: in the instruction sequence “a b c”, placing superinstruction “a-b” would lead to the instruction sequence “a-b b c” in the object graph, with “b” still present. For substitution, this is no problem, as the handler for a-b skips the separate b instruction anyways, and a runtime substitution algorithm may also substitute “b”. And to make matters worse, “b” may be reachable by a jump from elsewhere in the program, and this needs to be counted as well. As such, our static evaluation algorithm works by tracing paths through the output of the substitution algorithm by an abstract interpreter. A path is a sequence of instructions within the base superinstruction candidate (before and after passing through the substitution algorithm) that can be turned into a superinstruction. Multiple such paths can exist within one base superinstruction candidate, as the superinstruction candidate can be entered at multiple locations by jumps, each creating its own path. By working its way through every such path in the code – while using the profiling data to find out how many times that path was taken – the static evaluation algorithm is able to count how many dispatches are needed in the new bytecode with the superinstructions substituted in. The path tracing abstract interpreter is once started at the beginning of the base superinstruction candidate, and also started from every jump target within the base superinstruction candidate. For each starting location, it simulates execution while tracking the number of executions that it represents. This number can only go down – if it jumps anywhere it goes off the path, meaning

91 that those executions are no longer part of the path that it is tracking. If such a jump would land it somewhere else within the base superinstruction candidate then this is not a problem, since the jump target would be visited as its own path. And if it lands somewhere else, it will get processed as a path as part of that base superinstruction candidate when it gets there. To keep score, the abstract interpreter keeps a shared dispatch counter, counting how many instruction dispatches are needed to cover all the paths within the base superinstruction candidate. For each instruction that is visited, it adds the number of executions that the current path represents. This shared dispatch counter is the final score. When visiting a superinstruction, the abstract interpreter properly skips the instructions that follow. For example, if a superinstruction has a length of four, it will not count the next three instructions, counting the dispatch for just one. However, the abstract interpreter still looks at the profiling data for each superinstruction part – if it is a conditional jump instruction that jumps out, it will still adjust the number of executions the current path represents.

5.5.4 Conclusion In this section, we discussed the created interpreter generator tool which can read the profile, determine the optimal superinstruction set and generate the interpreter – although how the generation works has to wait until section 5.7. The tool accepts a variety of command-line arguments to allow testing different configurations and parameters. We discussed how the tool reads the profile: the profile data is interpolated and made available in the base superinstruction candidates. We saw how the base superinstruction candidates themselves are cut based on instruction flags, and how preprocessing happend. Then, we discussed the implementation of the iterative optimization algorithm, how concurrency is used and how a pluggable mutator is used to efficiently work with the prefix constraint of the triplet-based runtime substitution algorithm. We discussed how to the static evaluation interfaces with ASM – how surrogate instruction nodes enable us to present an ASM object graph to each runtime substitution algorithm, even when the bytecode itself is not used. Finally, we saw how static evaluation counts the dispatches saved by tracing “paths” through the output of the runtime substitution algorithm, wrapping up the optimization phase of the interpreter generation process. This leaves just section 5.7 to explain how exactly one goes from an optimized set of superinstructions to a compilable VM equipped with these superinstructions. But before we go there, it is time to discuss how runtime substitution is implemented in the next section.

5.6 Superinstruction placement

Superinstruction placement occurs in Java as discussed in section 5.3.5. That section left the problem of placement at a JNI call from the JVMTI agent to “some” Java method. In this section we will trace this method within JVM-managed code to the runtime substitution algorithm and then through it until a converted class with superinstructions comes out the other end. We will discuss how the active runtime substitution algorithm and superinstruction set are loaded, and how each of the runtime substitution algorithms are implemented. Section 5.6.1 starts out with an overview of the software architecture, discussing the components and what exactly the JVMTI agent calls. Section 5.6.2 discusses the implementations of the various runtime substitution algorithms. These algorithms are a relative straightforward implementation of their design, which already leads to the wrap-up of this section in section 5.6.3.

5.6.1 Overview The implementation of the runtime is split into two pieces: one part is part of the java.base module2, which is the core module of the JVM. java.base contains classes like java.lang.Object and java.lang.String, which ensures that it is loaded with the first code. The second part is

2“Module” in this context refers to the Java language feature introduced in Java 9.

92 its own module: jdk.internal.vm.si.runtime. Being an independent module of java.base provides much quicker recompilation times, as java.base is rather large and takes a long time to compile.

Java Base (java.base)

ClassBytecodeFormatConverterProxy <> +convertClass(input : byte[]) : byte[] <> ClassBytecodeFormatConverter +convertClass(input : byte[]) : byte[]

Runtime Si Algorithms (jdk.internal.vm.si.runtime)

ASM library (jdk.internal.vm.si.runtime.asm) InstructionSetConfiguration ClassReader -instructionSetDefinition ClassBytecodeFormatConverterImpl -runtimeSubstitutionAlgorithm <> 1 +convertClass(input : byte[]) : byte[]

<> ClassWriter

ShortestPathAlgorithm TreeAlgorithm 1 <> RuntimeSubstitutionAlgorithm +setInstructions(instructionSet) +convert(instructionList) TripletAlgorithm EquivalenceAlgorithm

Figure 5.2: Simplified class diagram of the runtime

Figure 5.2 is a class diagram showing how the various components are laid out. Note that parts of this class diagram were already visible earlier in section 5.5.1, however in this version the runtime is shown in more detail. The class diagram shows the two modules, and how java.base only contains two classes. The classes are:

ClassBytecodeFormatConverter is an interface for classes that can convert bytecode. Bytecode is provided and returned in its most basic format: the byte array. This bytecode is bytecode for the whole class – not just one method or one Code segment within a class. ClassBytecodeFormatConverterProxy is the entry point called by the JVMTI agent when it wants to have a class converted. The convertClass method here is static, and this class delegates to an implementation of the ClassBytecodeFormatConverter interface. The other module (jdk.internal.vm.si.runtime) contains the implementation of the ClassBytecodeFormatConverter interface. But before we go there, let us first discuss how the ClassBytecodeFormatConverterProxy obtains an implementation of this interface: the two modules are joined via the service loading API. The java.base module indicates that it needs an implementation of the interface, and the jdk.internal.vm.si.runtime module supplies an implementation in its module.java file. This architecture also allows providing extra JARs on the class path which could provide an alternative implementation of the whole runtime superinstruction architecture, which was a useful feature during development. In the jdk.internal.vm.si.runtime module we see the implementation of the interface: ClassBytecodeFormatConverterImpl. This class is the gateway to one of the runtime superin- struction substitution algorithms – it converts the bytes of the class to an ASM object graph, one for each method. It runs this object graph through the active runtime substitution algorithm before serializing it back to a byte array. To do this conversion from and to a byte array, it uses the ASM ClassReader and ClassWriter, as indicated in the class diagram.

93 The ClassBytecodeFormatConverterImpl must somehow know which runtime substitution algorithm to call. Additionally, the runtime substitution algorithm must have the list of all available superinstructions. This information is written to disk by the interpreter generator tool (from section 5.7) by serializing an instance of InstructionSetConfiguration using reg- ular Java serialization. This class is a very simple plain Java object, holding a list of instruc- tion definitions as well as the name of the runtime substitution class, which is deserialized by ClassBytecodeFormatConverterImpl on startup. ClassBytecodeFormatConverterImpl uses the information from this class to construct an instance of the runtime substitution algorithm and to provide the algorithm with the list of superinstruction candidates. This approach keeps the runtime substitution algorithm pluggable without requiring each substitution algorithm to reimplement the conversion from a byte array to an object graph and back.

5.6.2 Algorithm implementations As per the design (section 4.6), three algorithms are implemented: the triplet-based substitution algorithm, the tree-based substitution algorith, and finally the shortest-path algorithm. The class diagram from Figure 5.2 already reveals a specialization of the shortest-path algorithm: the equivalence-algorithm. The implementation of each of these algorithms is discussed below. The triplet-based substitution algorithm and the tree-based substitution algorithm are both greedy algorithms (walk the instructions once and make all substitution that are found), but differ in their data structures: the triplet-based algorithm uses a table of triplets, while the tree-based algorithm uses a tree data structure. These are implemented in basically the same way, but with different data structures for finding new superinstructions. We implemented these data structures in their own class (not shown in Figure 5.2). The shortest path algorithm adds logic to when to substitute a superinstruction, refraining from placing superinstructions that make better superinstructions unreachable. Since it imposes no new requirements on the way superinstrutions are stored, our implementation simply reuses the tree data structure from the tree-based substitution algorithm. Finally there is the equivalence-based variant of the shortest-path substitution algorithm. This algorithm is a variant of the shortest-path algorithm, but now with a new data structure that enables it to find equivalent instructions. Since this algorithm is still a shortest-path algorithm, we implemented it as a specialization (class extension) of the shortest path algorithm, replacing the use of the tree data structure with a hash table.

Triplet-based substitution algorithm The triplet-based algorithm has the most straightforward implementation. As discussed in section 4.5.2, this algorithm is a reimplementation of the algorithm used by Ertl et al. [4], and works by repeatedly looking up two consecutive instructions into a table. If an entry is found, the first instruction is substituted with the superinstruction associated with that table entry and the search is repeated. The table can only express superinstructions with two opcodes, and longer superinstructions are present in the table by referring to shorter superinstructions. The intricacies of this algorithm were described in section 4.5.2. We use the ASM object graph to do substitutions on, and superinstructions are implemented with a special superinstruction nodes as discussed in section 5.3.5. A superinstruction node does not replace all instructions that make up the superinstruction, instead only the first instruction is replaced. This adds a bit more complexity to the algorithm, as it isn’t just replacing two consecutive instructions. At the core of the triplet-based substitution algorithm are two nested substitution loops. The outer loop marks the beginning of the superinstruction, and the inner loop marks the end, each iterating over the instruction nodes. The inner loop always starts at the first instruction following the instruction the outer loop points to. For example, in the program “a b c” the outer loop would start at “a”, while the inner loop would start at “b” – the first instruction after “a”. What the outer and inner loop point to is used to construct the lookup for a superinstruction. If a superinstruction is found, it is substituted in at the location of the outer loop, and the inner

94 loop is moved forward by one instruction to start the search for a superinstruction based on the original one. If no superinstruction is found, the outer loop moves to the next instruction and the inner loop is restarted. The outer loop continues until all instructions in the program have been considered as a start instruction of a superinstruction. This lookup is handled by a separate class which implements the triplet mechanism – the TripletSuperinstructionTable. There are two aspects to this class: it has to construct the table from a list of superinstructions when it starts, and it has to perform lookups. Internally, the class does not use a flat table of triplets like the implementation by Ertl et al. [4]. Instead, the class uses a hash map, where the keys are tuples mapping to the third value. The tuples contain the start and end instruction opcode of the superinstruction, while the value the tuple maps to is the opcode of the superinstruction. Lookup is implemented by simple hash map search for a matching tuple and returning the superinstruction opcode, if any. Table construction is done by grouping all superinstructions into categories based on their length. First, all superinstructions with a length of two are added to the hash table (the shortest length). Then, all superinstructions with a length of three are added, looking up for each superinstruction the prefix superinstruction in the hash table, which must exist for this algorithm to work. This process repeats until all categories of superinstructions have been inserted. As we will see, this makes the triplet-based substitution algorithm the simplest to implement. Its data structures are straightfoward and substitution isn’t complicated greatly by the “peephole optimization” aspect, where it substitutes the same instruction node multiple times.

Tree-based substitution algorithm Next up is the implementation of the tree-based substitution algorithm. This algorithm alleviates the main pain point of the triplet-based substitution algorithm by not using these intermediate superinstructions to construct longer superinstructions. Instead, this algorithm navigates down a tree in parallel to reading the instruction nodes, using the tree to save the progress “out of band”. The substitution algorithm itself is just as straightforward as the triplet-based algorithm, except this time an additional bit of data is kept: a cursor. This cursor indicates the current position in the tree, and can tell if a superinstruction is found for the current cursor or if a longer one is available. The cursor can point to intermediate nodes within the tree, which need not have a superinstruction associated. Just like the triplet-based algorithm, the tree-based substitution algorithm also uses two loops in the same way. The cursor is reset for each iteration of the outer loop – this is when a new instruction is picked to start finding superinstructions from. The inner loop is incremented not only when a possible superinstruction is found, also when the cursor indicates that it has not yet reached a leaf in the tree. Besides the cursor, the best (longest) superinstruction it found is also kept, substituting this superinstruction in when the cursor indicates it no longer points to a node. The tree data structure is also implemented in its own class: SuperinstructionTree (likewise not shown in Figure 5.2). The tree construction is straightforward: instructions are added to the tree with an “add” operation. This operation adds a superinstruction by inserting all intermediate nodes that are missing (those not carrying a superinstruction), or converting an intermediate node into a node that now carries a superinstruction opcode. No preprocessing or sorting of the superinstructions is required. The SuperinstructionTree has just one public method called lookup(), which takes no arguments and returns a new cursor object. The cursor can be moved by providing it an instruction opcode, and it starts out referencing the root of the superinstruction tree. Furthermore, the cursor has methods to test if it still points to a valid node in the tree and to read the superinstruction opcode of the current node, if available. Cursor objects themselves are not thread-safe and as such are never shared across threads, while the SuperinstructionTree itself is thread-safe after construction.

95 10 11 12 12 11 iload iload imul iload iload iload iadd iadd imul iconst_2 irem start end

super1 super1 super1 super2 super4 super3

Figure 5.3: Copy of Figure 4.4 with the direction of the edges reversed, showing a bytecode program as a graph with instructions on the edges

Shortest path substitution algorithm As mentioned in the introduction, the shortest path substitution algorithm is a change in the way when instructions are substituted. It simply reuses the SuperinstructionTree class to find which superinstructions are available for a given location. The design discussed how the algorithm should not just place all superinstructions on the shortest path from the start to the end of the program, it should also place all superinstructions starting from each jump target to make sure they receive optimal substitutions, as jump targets are also valid places to enter the program. This means determining the shortest path multiple times with one thing in common: their end point. As such, it’s more efficient to instead do shortest path in reverse, starting at the end point and finding the quickest way to get from the end to each instruction in the program. The superinstruction edges used to get to each instruction the quickest are the ones that need to be substituted in. This substitution algorithm is not a greedy algorithm that makes substitutions as it finds superinstructions, instead it first organizes all possible substitutions into a graph. To do this, it deploys what is essentially the same algorithm as the tree-based substitution algorithm, but instead of actually making the substitution, the found superinstructions are inserted into this graph. The structure of this graph matches that as seen earlier in the design: a graph with the instructions on the edges. The nodes in these graphs represent instruction dispatches, and superinstructions are special edges that skip multiple instruction dispatches. However, this graph is made in reverse, with all the instructions edges and superinstruction edges pointing back. Figure 5.3 shows an example of such a graph. The graph is constructed in the code as a hash table, where each key is a node in the graph and maps to all the outgoing edges of that node. The instructions themselves are used as the keys in the hash table. For example, the key of the start node in Figure 5.3 is the iload 10 instruction. The the next node has iload 11 as a key, and so on. This is just a straightforward mechanism for identifying the instruction dispatch nodes, as the instructions themselves are on the edges and not the nodes. The end node is identified with “null” as key. Populating the hash table is done by walking all instructions and reusing the tree algorithm. When a superinstruction is found, let X be the instruction that would be executed immediately after the superinstruction in the program, but that is not part of the superinstruction. The dispatch node X is the dispatch node that the superinstruction would jump to when finished. For example, the instruction sequence “a b c” with superinstruction “a-b” would have X = c. The superinstruction “a-b” is inserted into the hash table at dispatch node “c”, as an edge from dispatch node “c” to dispatch node “a”. This table is further populated by also including regular instructions “in reverse”, giving them the same treatment as superinstructions to construct the full graph. To continue the example, in the instructions “a b c”, instruction “b” would also be registered as well at dispatch node “c”, this time as a way to get to dispatch node “b”. Since this X instruction may not exist, “null” is used as key for instructions that have no next instruction. To continue with the example once more, the superinstruction “a-b-c” has no next instruction in the program “a b c”, so it is stored at dispatch node “null” (end) as a way to get to “a”. With the backedges table fully constructed, it can be used to find a path from the end of

96 the program (indicated by the “null” magic value), to the beginning of the program (the first instruction in the program). We use Dijkstra’s algorithm for this using the Java PriorityQueue implementation, however since all the edge weights are all the same (each instruction dispatch saved is equally valuable), a simple breadth-first-search could also have been used. Both shortest-path implementations are very similar in implementation difficulty and algorithmic complexity, and more complex optimizations that build on the shortest-path algorithm may require different edge weights. As such, we opted to implemented Dijkstra’s algorithm, even though no such algorithms were implemented for this thesis. With the shortest path from the end node to each dispatch node in the program determined, all selected superinstruction edges are placed. This does not take into account that some superin- structions may not be reached, however there is no harm in substituting in a superinstruction that is never executed. Such a superinstruction does not have any negative (or positive) performance impact, moreover static evaluation was implemented in a way that can deal with this by tracing all paths through the code, and unreachable superinstructions are simply not on any path.

Equivalence algorithm The equivalence algorithm is a subclass of the shortest-path substitution algorithm. It overrides the lookup mechanism, and uses its own data structure: the EquivalenceSuperinstructionMap (not shown in Figure 5.2). In this data structure, each superinstruction is placed in a hash map identified with a lookup key to help speed up searching. Multiple superinstructions may share the same lookup key, and as such it uses graph color to narrow the search down to find the final match. To construct this superinstruction map data structure, each superinstruction is converted to a superinstruction graph. This is the graph consisting of atomic blocks as shown in the design in section 4.6.3. To test equivalence, two sequences of code need to be transformed to a graph and equality is tested using graph coloring. However, the construction of each superinstruction graph can happen at VM startup time, so this is done. While not exactly tuned for performance, to help the performance a little, each superinstruction is stored using a lookup key. This lookup key is simply an unordered set of the content of each atomic block. Our equivalence algorithm is not capable of determining equivalence when the same instructions are not present in the two code fragments that are being compared. In fact, our algorithm requires that the exact same atomic blocks are present in both code fragments, and only allows reordering of the whole blocks. As such this lookup key provides an effective tool to help speed up a lookup. When the EquivalenceSuperinstructionMap finds a match, it provides not just the found superinstruction definition, but also the output of the graph coloring step. The output of this step gives a bijection, mapping from the instructions as they are in the input program to where they are in the superinstruction. The EquivalenceAlgorithm uses this information to reorder the instructions in the input program. The equivalence algorithm needs to be able to not just do equivalence-based lookups, but also identity-based lookups. This was discussed in the design to deal with jump targets: finding an equivalent instruction may require reordering instructions. If a jump instruction jumps to somewhere within the reordered instructions, it would no longer execute the same code. As such, when trying to find a superinstruction for a sequence of code that contains a jump instruction, the substitution algorithm must perform a “regular” lookup based on identity and not on equivalence. To support this, two data structures are kept: besides the superinstruction map, an instance of the superinstruction tree as used by the shortest-path algorithm is also constructed. Since the superinstruction tree performs much better, it is always used as a sort of first line of defense. Only when a superinstruction is not found in the superinstruction tree does the algorithm consult the equivalence-aware superinstruction map. Figure 5.4 shows a DOT graph that was generated by our implementation. The two shown superinstruction graphs are equivalence, as is shown by the matching colors added to each block. “Color” is represented as a number on each node, and not actually rendered like a color. Note that some blocks are empty – as the code these graphs are derived from leaves the stack balanced, the first and last block are empty. Since the (empty) last blocks is a FULL_BARRIER block, it links to

97 Block b4 Block b9 Color: 8 Color: 8

Attrs: FULL_BARRIER Attrs: FULL_BARRIER

Instructions: Instructions:

Block b3 Block b8 Color: 7 Color: 7

Attrs: STORE_BARRIER_ANY Attrs: STORE_BARRIER_ANY

Instructions: iinc Instructions: iinc

Block b1 Block b7 Color: 5 Color: 5 Block b2 Block b6 Color: 6 Color: 6 Attrs: STORE_BARRIER_2 Attrs: STORE_BARRIER_2 LOAD_BARRIER_1 LOAD_BARRIER_1 Attrs: STORE_BARRIER_3 Attrs: STORE_BARRIER_3 LOAD_BARRIER_2 LOAD_BARRIER_2 LOAD_BARRIER_3 LOAD_BARRIER_3 Instructions: iconst_0 Instructions: iconst_0 Instructions: iload_3 Instructions: iload_3 iload_1 iload_1 dup dup iload_2 iload_2 imul imul iadd iadd istore_3 istore_3 imul imul istore_2 istore_2

Block b0 Block b5 Color: 4 Color: 4

Attrs: FULL_BARRIER Attrs: FULL_BARRIER

Instructions: Instructions:

Figure 5.4: A DOT graph generated by our equivalence-aware runtime substitution algorithm of two equivalent sequences of code

all preceding blocks. As such, testing if these two blocks have the same color is enough to test if the whole graph is equivalence, which they are (both have “color: 4”).

5.6.3 Conclusion In this section we have seen how superinstruction placement is done in Java. The software architecture was discussed in section 5.6.1, showing how the Java substitution component is replaceable as a whole via the service API. We discussed how each substitution algorithm uses ASM, and how there is a common entry point in the jdk.internal.vm.si.runtime module which converts each method to an ASM object graph to delegate to the active runtime substitution algorithm. Section 5.6.2 went over the implementation of each algorithm, discussing how there are really two parts to each substitution algorithm: the way they store and lookup superinstructions, and the way they place superinstructions. Effectively, four implementations were discussed to cover various possiblities between three ways to find superinstructions (triplet-based, tree-based and equivalence-based) and two ways to place superinstructions (greedy and shortest-path). It is up to chapter 6 to test which one will perform best, but before moving on to the benchmarks, let us first consider how the superinstruction-capable interpreter is actually generated in section 5.7.

5.7 Generating the QuickInterp interpreter

With the instruction set defined and the runtime substitution algorithms in place, all that remains is generating an interpreter with the right superinstruction handlers. Section 5.3.3 already started

98 the discussion on how the OpenJDK interpreter works and how it can be modified with additional handlers. It was discussed that instruction handlers within the interpreter loop sometimes reuse code. Some instructions use the fall-through nature of the switch-statement to reuse instruction handlers, while others are defined by macro expansions. In this section, we will discuss how each handler is made available as its own snippet of C++ code that can be concatenated by the interpreter generator tool from section 5.5. Furthermore, we will also discuss how the interpreter generator output supports write coalescing of top-of-stack and program counter writes, as discussed in section 5.3.3. To implement write coalescing, each instruction handler within a superinstruction needs to be able to work with an offset relative to the start of the superinstruction. The way these offsets are computed is discussed in section 5.7.1, with section 5.7.2 following up how the code generator plugs these values into generated C++ code fragments. The interpreter loop itself is not the only thing that needs to be modified, as the VM also requires information about all of its instructions. It needs the length and opcode to be able to iterate over them, and this metadata needs to be generated for each superinstruction which is discussed in section 5.7.3. Next, we discuss the superinstruction class cache in section 5.7.4 – a component already seen back in section 5.5.1 that can be used to create a cache of classes with superinstructions to save the runtime from having to do this work. The superinstruction class cache is also the only way for the VM core classes to receive superinstructions, as certain classes (like java.lang.Object) need to be loaded before any runtime substitution algorithm can become active. Finally, this section – section 5.5 – is the section where we break compatibility with the garbage collector and class verifier, and section 5.7.5 discusses the implications of losing these flagship features.

5.7.1 Instruction primitives To support generating superinstructions we introduced a new term: the instruction primitive. An instruction primitive is a sequence of C++ code that is the part of an instruction handler that executes that instruction without writing to the top-of-stack or program counter variables, or jumping towards the next instruction (jump instructions themselves are still allowed to jump). Instruction primitives are effectively the “essence” of an instruction handler, and isolating them is required to create superinstructions. The superinstructions are simply sequences of these instruction primitives, and we also reimplement the regular VM instructions with instruction primitives. Recreating the regular VM instructions is as simple as combining one instruction primitive with a bit of generated boiler plate code. This boiler plate code completes the instruction handler, updating the top-of-stack and program counter, and jumping to the next instruction handler.

A macro for each instruction While the next section will discuss in detail how the existing instruction handlers are converted to instruction primitives, we will first look at how to use them from a code generation point of view. Each instruction primitive is defined in its own C++ preprocessor macro: INSTR_(pc, offset). The “” part is replaced with the name of that instruction, e.g. INSTR_iload(pc, offset) will expand to the instruction primitive for the iload instruction. We will discuss the “pc” and “offset” parameters in a moment, but let us first consider how the instruction primitives are used to form an instruction handler. As discussed just now, it isn’t just the superinstructions that make use of instruction primitives – with all the interpreter code restructured into instruction handlers, the regular VM instructions are also generated by the generator and as such also use the instruction primitives. Consider the regular pop instruction handler in Listing 5.14. This instruction handler is generated by the interpreter generator by using the INSTR_pop instruction primitive.

99 1 CASE(_pop): { C++ 2 INSTR_pop(pc+0,0); 3 INSTR_end_sequence(pc+2,-1); 4 puts("Overrun of pop handler"); 5 }

Listing 5.14: Generated output for the regular pop instruction

What can be seen in this listing is that the generated instruction handler consists of three parts:

1. The INSTR_pop(pc+0,0) instruction primitive macro 2. An INSTR_end_sequence(pc+2,-1) macro 3. An overrun message print statement

Since the instruction primitives themselves do not jump to the next instruction (this was the whole point, and is necessary to create superinstructions), the instruction handler needs to be ended somehow and jump to the next instruction. This is done with the INSTR_end_sequence pseudo-instruction, which updates the actual values of the program counter and the top-of-stack variables, and then jumps to the next instruction. And the print statement is just there for debugging – if a handler manages to somehow not jump out and instead falls through to the next instruction, it will print “Overrun of pop handler”, which should never happen.

Creating superinstructions Going from a regular instruction handler constructed with these primitives to one with multiple instruction handlers is easy: simply add more instruction primitives. This is also where the use of “pc” and the “offset” starts to come into play. The “pc” and “offset” macro parameters are needed to support write coalescing. Within the instruction primitive, “pc” is an expression that evaluates to a pointer which is taken as program counter within that instruction primitive. The first instruction primitive will always get “pc+0”, as we saw in Listing 5.14.

1 CASE(_super_aload_0_iload): { C++ 2 INSTR_aload_0(pc+0,0); 3 INSTR_iload(pc+2,1); 4 INSTR_end_sequence(pc+5,2); 5 puts("Overrun of super_aload_0_iload handler"); 6 }

Listing 5.15: Generated output for the superinstruction aload_0-iload

Listing 5.15 shows an example of the code generated for the superinstruction aload_0-iload, which can help illustrate why the macro arguments are needed. The iload instruction reads a one-byte value from the instruction stream, making its signature within the bytecode always <2-byte opcode><1-byte value>. Within a superinstruction, it cannot read this one-byte value from its standard location, which is two bytes after the program counter pointer (pc[2]). Instead, within this superinstruction it needs to read it four bytes after the program counter (pc[4]). The whole instruction primitive must be “shifted” as it were by two bytes, and this is done by providing it with a new definition of “pc”: namely “pc+2”. Now, it will read (pc+2)[2], which is the same as pc[4] as the data type of pc is a byte pointer. The second parameter – the “offset” is used similarly, but now for the operand stack. The INSTR_aload_0 pushes one value onto the operand stack (writes to topOfStack[0]) but without updating the top-of-stack variable. As such, the INSTR_iload instruction – which also pushes a value onto the operand stack – must now write to topOfStack[1] as topOfStack was not updated between the primitives. This offset of “1” is communicated with the argument we see for the topOfStack macro parameter: the instruction primitive will write to topOfStack[0 + offset].‘

100 This also explains the values provided to INSTR_end_sequence – this pseudo-instruction behaves exactly like a real instruction primitive that is an unconditional jump. It sets the “pc” value to “pc+5”, and adds 2 to the “topOfStack” variable. And the pop instruction handler from Listing 5.14 shows how the topOfStack offset can also be negative: the pop instruction handler removes one element from the top of the stack, so the INSTR_end_sequence operates at an offset of “-1”. In section 5.7.2 we will look at the implementation of each of these macros.

Computing offsets It is the code generator that has to compute the “pc” and “offset” arguments as these depend on where the instruction primitive sits within the superinstruction. These values can only be computed for instruction handlers with a fixed impact on the program counter and top-of-stack variables. For each instruction primitive that can be part of a superinstruction, it is known what impact it has on the operand stack and on the program counter. These values are available in Appendix A.3. A very basic abstract interpreter constructs the superinstruction from a list of instruction primitives, keeping track of the relative program counter (starting at 0), and the relative stack pointer (also starting at 0). Each added instruction primitive modifies the relative program counter and relative stack pointer. Note that not all instructions have a fixed impact on the program counter or on the top of stack pointer – for example the tableswitch does not have these values in Appendix A.3. And, looking at Appendix A.2 which is used to determine which instruction primitives to concatenate, we can also see that tableswitch cannot be part of a superinstruction anyways. But some of the instructions that have an unknown offset can be part of a superinstruction. For example, the ireturn instruction is marked as entirely unknown (unknown PC offset and unknown stack offset). This is because it’s not relevant to track the size for terminal instructions – the ireturn will always leave the method and as such there’s no reason to track these properties for this instruction. If a superinstruction ends with ireturn, it will omit the INSTR_end_instruction call as the ireturn is already guaranteed to end the superinstruction, indicated by the TERMINAL flag shown in Appendix A.2. It could not even place INSTR_end_instruction if it wanted to, because it does not know the offsets after the ireturn to be able to place this statement. This is also where the overrun message statement as seen on line 5 in Listing 5.15 comes in – while testing if the INSTR_end_sequence really jumps to the next instruction is fairly trivial, not all superinstructions make use of this. Superinstructions which naturally jump out like those using an ireturn or a goto do not, and in those cases this statement is useful for asserting if the superinstruction handler does not fall through to the next handler, which would create a hard-to-debug bug.

5.7.2 Instruction code as macros Section 5.3.3 already mentioned the changes that are necessary to the instruction handlers, and we saw in the previous section how each instruction primitive gets its own macro. Implementing these macros is fairly straightforward for most instructions. The OpenJDK Zero Interpreter structure already coalesces most writes within the instruction handlers themselves, and as such all that remains is removing the updates to the program counter and top of stack, and removing the jump to the next instruction.

1 CASE(_iload): iload and fload reuse the same handler C++ 2 CASE(_fload): 3 SET_STACK_SLOT(LOCALS_SLOT(pc[2]), 0); 4 UPDATE_PC_AND_TOS_AND_CONTINUE(3, 1);

Listing 5.16: Original iload instruction handler

101 1 #define INSTR_iload(pc,offset)\ C++ 2 SET_STACK_SLOT(LOCALS_SLOT((pc)[2]), offset);

Listing 5.17: iload instruction primitive macro definition

Listing 5.16 and Listing 5.17 show the original instruction handler and the instruction primitive respectively, both for the iload instruction. The original handler shown in Listing 5.16 has already been modified to support working with two-byte opcodes instead of one-byte opcodes, and as such the only transformation that needs to be done is formatting it as an instruction primitive. To do so, the UPDATE_PC_AND_TOS_AND_CONTINUE macro is removed, and the arguments to this call (the 3 and 1) are placed in Appendix A.3. The “pc” macro argument, as we saw in the previous section, is now an expression. This has consequences – e.g. in the expression pc[2], if “pc” is substituted with “pc + 4” it will expand to “pc + 4[2]”, which does not compile. As such, all occurrences of “pc” are wrapped in brackets: “(pc)”. To continue the example, after substitution it would read “(pc + 4)[2]”, which is valid in C++. Also the “offset” macro argument is placed in, but this argument expands to a number and not an expression. In the previous section we discussed how this argument provides an offset relative to the “topOfStack” variable. Most instruction handlers in OpenJDK Zero however do not interact directly with this variable, but do so via other macros like the SET_STACK_SLOT macro seen in Listing 5.17. This macro however still takes an offset as argument (the second argument), and while this was set to 0 in the original handler (line 3 in Listing 5.16), it is now set to offset (line 2 in Listing 5.17). To also implement the fload instruction primitive, we simply copied the code for the iload instruction primitive, meaning that no code is reused. This pattern was repeated to write all instruction primitives. There were some instruction handlers where the program counter or top-of- stack pointers were updated within the instruction handler, and these have been first rewritten to work with offsets and only update these values at the end of the handlers. The code generator generates all instruction handlers from instruction primitives, also those that cannot be included in a superinstruction anyways (e.g. invokedynamic or monitorenter). The handlers for these instructions were also transformed into macros, but without taking proper care to use the offset parameter or to prevent intermediate program counter or top-of-stack updating, as these instructions are never part of a superinstruction anyways, and as such will always be the first (and only) primitive within their instruction handler.

5.7.3 Code generation With the structure and the technique for generating the instruction handlers discussed, it’s time to look at how the VM files are written to disk and how they are integrated into existing VM source code. Besides the instruction handlers, there are various other things that need to be generated for each instruction. The following items are generated: 1. The instruction handlers (as discussed), which are written to bytecodeInterpreter.handlers.hpp 2. A dispatch table, where “goto dispatch_table[opcode]” jumps to the instruction handler of “opcode”, which is generated as bytecodeInterpreter.jumptable.hpp 3. VM constant and opcode for each instruction, written to bytecodes.generated.hpp 4. VM definitions for each instruction, written to bytecodes.definitions.hpp 5. A list of all superinstructions and the superinstruction substitution algorithm for the runtime to use, which is generated as superinstructions.list Each of these items are generated as their own file. All of them except item 5 (the superinstruc- tion list) are C++ source code, and these are included at select places in the VM by #include statements. The generated C++ source code files are not standalone compilation units that can be compiled to their own object files, instead they require the #include statements to be part of other VM files that rely on what is generated (this is why they have “.hpp” file extensions instead of “.cpp” extensions).

102 Instruction handlers The first item, the instruction handlers, are written to bytecodeInterpreter.handlers.hpp. The generated code could already be seen in the previous section, and replaces the sequence of instruction handlers in the bytecodeInterpreter.cpp file. From that file, we replaced all handlers with one preprocessor statement: “#include generated/bytecodeInterpreter.handlers.hpp”.

1 ... C++ 2 opcode = READ_BYTECODE(pc); 3 DISPATCH(opcode); 4 5 #include "generated/bytecodeInterpreter.handlers.hpp" 6 7 DEFAULT: 8 fatal("Unimplemented opcode"); 9 goto finish; 10 } 11 ...

Listing 5.18: Interpreter loop with handlers read from an external file

Listing 5.18 shows a simplified version of what the interpreter loop looks like. The DISPATCH statement on line 3 jumps to the first instruction handler, and each instruction handler will (often with the use of the INSTR_end_sequence macro) jump to the next instruction. All instruction handlers are included from the generated file, with only the default “not-implemented” handler remaining.

Dispatch table The DISPATCH(opcode) macro, as seen in use in Listing 5.18 (on line 3), is implemented by using GCC labels-as-values extension in OpenJDK Zero, and we made no changes to this mechanism.

1 #define DISPATCH(opcode) goto *(void*)dispatch_table[opcode] C++

Listing 5.19: Definition of the dispatch macro using a dispatch table

1 const static void* const dispatch_table[1 << 16] = { C++ 2 #include "generated/bytecodeInterpreter.jumptable.hpp" 3 };

Listing 5.20: Definition of the dispatch table using a generated file

Listing 5.19 shows the dispatch macro definition and Listing 5.20 shows how the dispatch table is initialized. Both these occur within the interpreter function in bytecodeInterpreter.cpp. The size of the table is set to 1 << 16, which is equal to 216 = 65, 536. This is the maximum number of handlers supported with 2-byte opcodes, as discussed in section 5.3.4. The original VM prior to our modifications had a dispatch table size of 256 (the maximum number of handlers supported with 1-byte opcodes).

103 1 &&opc_nop, Regular instructions C++ 2 &&opc_aconst_null, 3 &&opc_iconst_m1, 4 &&opc_iconst_0, 5 ... 6 &&opc_super_iinc_goto, Superinstructions are also in the table 7 &&opc_super_iload_1_iconst_1_ishl_istore_1_aload_0_iload_1, 8 &&opc_super_ishl_ior, 9 ... 10 &&opc_default All unused space is padded with opc_default

Listing 5.21: Content of the dispatch table in bytecodeInterpreter.jumptable.hpp

An example of the content of the the bytecodeInterpreter.jumptable.hpp file is shown in Listing 5.21. This file will always contain 216 entries (one for every index in the table), with Listing 5.21 only showing a short snippet of this file. Note the “&&” syntax: this syntax is part of the GCC labels-as-values extension, and allows grabbing the pointer to the code for a label. Each instruction handler has its own label. The file always starts out with all the regular VM instructions, which are at their standard locations. For example, the opcode of nop is 0, which is the first instruction handler in the table to put it at dispatch_table[0]. The regular VM instructions are followed by superinstructions, which are always named “opc_super_” followed by a concatenation of all the instruction names within the superinstruction. The rest of the table is padded to 216 entries with opc_default – this is the default handler we saw in Listing 5.18 on line 6.

VM constants and definitions Outside of the interpreter, there is one place that needs information about the new instructions: the bytecodes.cpp and bytecodes.hpp files. These files contain an enum constant for each bytecode instruction, mapping its name to an opcode, and track a definition of each instruction. A definition includes the lenght of the instruction, the types of the instruction operands, a “wide” definition (if available) and a few extra attributes regarding the characteristics of the instruction. The definitions enable abstract interpreters like the garbage collector and the verifier to work on instructions without dealing with the characteristics of each instruction separately, instead using the definition of each instruction. While a powerful mechanism, this mechanism is not powerful enough to fully capture what we wanted to allow with superinstructions: for example, it does not allow particular combinations of instruction operand types – combinations that do not occur in the regular VM bytecode instructions, but will occur once instructions are merged into superinstructions. As such, this is the point where we broke compatibility with the garbage collector and the class verifier. With these features unsupported, all that the definitions are used for is determining the length of each instruction, which is necessary to allow the VM to iterate over the instruction stream. The capability of iterating over each instruction is so basic to the VM that it is not something we can drop support for. OpenJDK uses it to inject special instructions into core classes like java.lang.Object, and it is used in rewriter.cpp (discussed in section 5.3.2). As such, the code generator needs to generate an enum constant for each superinstruction, and generate a definition that allows iterating over the bytecode stream.

1 _super_iinc_goto = 239 , // 0xef C++ 2 _super_iload_1_iconst_1_ishl_istore_1_aload_0_iload_1 = 240 , // 0xf0 3 _super_ishl_ior = 241 , // 0xf1 4 _super_aload_0_iload = 242 , // 0xf2 5 ...

Listing 5.22: Content of the generated instruction constants file bytecodes.generated.hpp

Listing 5.22 shows an example of the generated instruction constants. These are included with an #include generated/bytecodes.generated.hpp statement in the bytecodes.hpp file.

104 This file already contains similar statements for each of the regular VM instructions. Each of the generated constants receives a generated opcode, e.g. in Listing 5.22 the _super_iinc_goto instruction is assigned opcode 239. This opcode matches the location in the dispatch table of the instruction handler for this instruction – that is, dispatch_table[239] is the location of the instruction handler for the _super_iinc_goto instruction.

1 def(_super_iinc_goto , "super_iinc_goto" , "bicbboo", NULL, T_ILLEGAL, 0, true); C++ 2 def(_super_ishl_ior , "super_ishl_ior" , "bbb" , NULL, T_ILLEGAL, 0, true); 3 def(_super_aload_0_iload, "super_aload_0_iload", "bbbi" , NULL, T_ILLEGAL, 0, true); 4 def(_super_sipush_iand , "super_sipush_iand" , "bccbb" , NULL, T_ILLEGAL, 0, true); 5 ...

Listing 5.23: Content of the generated definitions file bytecodes.definitions.hpp

Listing 5.23 shows an example of the generated definitions file. Note that some of the longer superinstructions seen earlier in Listing 5.22 have been omitted due to their length not fitting on the page. Each instruction definition requires seven arguments, but almost none of those are relevant anymore as the garbage collector and the class verifier have been put out of service. We will revisit the topic and consequences of missing JVM features in section 5.7.5. The important arguments are the instruction opcode (first argument) and the instruction signature string (third argument, e.g. "bicbboo"). The role of the instruction signature string is only to provide the length of the instruction – each letter in the instruction signature is one byte. An instruction signature of five characters would yield an instruction length of six bytes in the bytecode stream (one extra byte for the now 2-byte opcode, which is not included in the signature to stay compatible with original definitions). Each of the letters in the signature indicates what kind of operand it is, but these are not actually used, and as such we will not be discussing the content of each signature string.

Superinstructions and runtime substitution algorithm information With the generated C++ code discussed, there’s one last item on the list: information for the runtime to use. The interpreter generator tool has information that the runtime needs: namely the selected runtime substitution algorithm (with the –si command line argument), and the list of superinstructions that are compiled into the VM. This information is communicated by means of a serialized Java object written to a file. This class was already discussed back in section 5.6.1. The class diagram there shows a InstructionSetConfiguration component, which is a data class storing the list of superinstructions and the selected runtime substitution algorithm. At runtime, when the ClassBytecodeFormatConverterImpl is first started, it will deserialize the file, create an instance of the selected runtime substitution algorithm and provide this instance with the superinstruction definitions.

5.7.4 Superinstruction class cache The class cache was introduced in section 5.3.5 and mentioned in section 5.5.2. In the class diagram from Figure 5.1, there is an additional class with a main method: the CacheBuilder class. The class cache is a cache of classes that already contain superinstructions, and need no runtime superinstruction substitution. To reiterate why we need such a feature: implementing the substitution algorithms in Java has a consequence, as not all classes can be substituted at runtime. Some classes need to be loaded in order to do substitution, and while many of the classes are inconsequential, some of the required classes are important core VM classes like java.lang.String. Having a class cache with pre-substituted classes provides a workaround, and allows classes like java.lang.Object to be substituted. The usage of the CacheBuilder is simple: it requires one argument, which is a path to the folder where the profile of an application is stored. It will read the superinstructions.list configuration file written by the interpreter generator tool to figure out which runtime substitution

105 algorithm was used, and which superinstructions are available. It will then loop over all classes in the profile and convert them one by one to a version with superinstructions, then writing those classes to disk in a folder called “class-cache”. The JVMTI agent will look for any class that it is loading in the class cache, and if present use the version from the class cache instead of sending the class through runtime substitution. If no class exists in the class cache, but the class loader of the class is the system class loader, the class will not be substituted. This is a simple mechanism to prevent circular loading errors, but puts additional pressure on the quality of the profile. For example, if an important data structure that is part of the JVM Standard Library (e.g. LinkedList) is not used according to the profile, it will not be able to receive superinstructions as it is a core VM class (class loader is the system class loader) and as such the JVMTI agent will not substitute it. Furthermore, since it’s not in the profile, it is not in the class cache, so there is no version with superinstructions available. If this class does turn out to be needed by the application, the application cannot benefit from the performance improvements offered by superinstructions. However, in such a scenario, the quality of any superinstruction substitution within the class is debatable. Since it isn’t in the profile, the superinstruction set construction algorithm did not take it into consideration, and as such the VM may not be equipped with effective superinstructions for the class anyways. Considering we control the quality of the profiling for our own benchmarking results, we considered this a non-issue. However, it should be noted that the way we use the instruction class cache for all system classes is a shortcut taken to simplify the implementation rather than a far-reaching consequence of the QuickInterp architecture. To create a more user-friendly implementation of QuickInterp, only the classes actually needed by the runtime substitution algorithm to work should be required in the class cache.

5.7.5 Loss of the garbage collection and class verifier The VM definitions generated for each superinstruction as discussed in section 5.7.3 are not “good enough” to support garbage collection. The instruction definition format used internally in OpenJDK is not powerful enough to express all possible superinstructions, and as such the garbage collector is unable to function without expanding this format. The definitions could also be pointed to as the reason why the verifier no longer works, however this is a litte more complex: we argue that the verifier should not attempt to verify code after receiving superinstructions. Instead, it should be applied to the code before superinstructions. As such, it is the location of the JVMTI agent in the class loading pipeline that breaks the verifier, and not the inability to verify superinstructions. However it is entirely possible to verify bytecode with superinstructions, as each superinstruction is a direct replacement of a sequence of regular instructions, and making it possible to treat superinstructions in that way would restore both the verifier and the garbage collector. The topic of losing flagship JVM features has been touched on in the introduction of this chapter (section 5.1): we consider it acceptable to take shortcuts when these shortcuts do not impact benchmarking result in a way that is not reflective of the design. This was discussed in section 5.2 in the discussion surrounding implementation goal IG6, which requires that the implementation is an accurate reflection of the design when it comes to testing. However, the garbage collector and class verifier are components separate from the performance of the interpreter. Our superinstruction architecture does not affect the work a garbage collector has to do, and the class verifier is only executed at class load time. A garbage collector can however degrade runtime performance as the interpreter has to wait for the GC to finish, and since QuickInterp is not compatible with any garbage collector, any benchmark will run on it without garbage collection, sparing it from performance degradation due to GC overhead. This means that, in order to create an apples-to-apples comparison, any non-superinstruction VM that QuickInterp is compared against must also run benchmarks without a GC. As long as this is taken into consideration when designing and running benchmarks, we do not consider the loss of these JVM features a violation of implementation goal IG6.

106 5.7.6 Wrapping up In this section, we have covered the actual generation of C++ code and the superinstructions.list file based on the output from the superinstruction set construction algorithm from section 5.5. We saw how instruction handlers are created from instruction primitives – the parts of instruc- tion handlers that do not jump. These instruction primitives are concatenated to create larger instruction handlers. This is made possible due to each instruction primitive having its own macro definition, making concatenation simple. The instruction primitives support write coalescing, a feature discussed earlier in section 5.3.3. Here, we discussed how write coalescing is implemented. Each instruction primitive within a superinstruction is compiled to read and write its data relative to the program counter and stack pointer at the start of the whole superinstruction. This is made possible by supplying each instruction primitive with offsets: the program counter is provided as an expression (e.g. pc + 5), and an offset is provided for accessing the top-of-stack. Besides generating the instruction handlers, the other files that the VM needs were also discussed: we generate constants for each new instruction, we generate a dispatch table to facilitate jumping to each instruction handler based on opcode, and we generate definitions for each instruction. The definitions are minimal however – they are only good enough to support basic iteration over code that includes superinstructions. The garbage collector and class verifier cannot make use of them to perform garbage collection or class verification, and this is the point where our code breaks compatibility with both of those features. The final file that is generated is not a source code file, instead it is the superinstructions.list file with the selected runtime substitution algorithm and the list of superinstructions. This file is not only used by the runtime, it is also used by the CacheBuilder tool. We discussed how the cache builder tool – which is part of the interpreter generator – is implemented. This tool implements the workaround for the circular class loading problem we discussed back in section 5.3.5 – it runs all profiled classes through the substitution algorithm and places them in a class-cache folder to be used by the runtime. The class cache solves the last piece of the puzzle: by using the cache instead of substituting system classes at runtime enables system classes like java.lang.String to contain superinstructions, even when they need to be loaded before the runtime substitution algorithm can become active.

5.8 Conclusion

In this chapter we discussed the implementation of QuickInterp on top of OpenJDK Zero. We dis- cussed the structure of OpenJDK Zero, and how we modified this VM to support superinstructions. Code stretching was discussed, which is the name of the feature that lets us use two-byte opcodes raising the number of possible instructions to 216. Code stretching, together with other runtime class changes, are implemented in a JVMTI agent. We went over our implementation of the profiling system: a new “profile” instruction makes it possible to record the number of executions at select places in the bytecode, and together with profiling variants of each jump instruction we implemented the different counters as per the design from Chapter 4. A preprocessing step places these profiling instructions into the program if profiling is enabled. To construct the superinstruction set, an interpreter generator tool was created. This tool implements the iterative optimization algorithm from the design. Static evaluation is implemented by tracing paths, which is necessary to perform static evaluation on the ASM object graph after superinstruction substitution. We discussed how the runtime substitution algorithms are implemented by substituting superinstructions into an ASM object graph of the input program. We explained how shortest path can take jump targets in the program into consideration by performing a shortest-path lookup in reverse, and how we opted to use Dijkstra’s algorithm using a priority queue to enable future experimentation with different edge weights, while all edge weights in the superinstruction graph are equal in the implementation as presented in this thesis. This chapter also covered how the instruction handlers are actually generated using instruction primitives, which are sequences of C++ code that represent the essence of an instruction handler, omitting any boilerplate jump code to execute the next instruction.

107 By isolating these instruction primitives we can concatenate them to generate superinstructions. We explained how the existing handlers have been converted to instruction primitives, and how these instruction primitives support write coalescing by supplying them two parameters: a relative program counter and an operand stack offset. Finally, the other generated C++ code was discussed including VM metadata like the instruction definitions and a list of superinstructions to make the runtime aware of the available superinstructions.

5.8.1 Revisiting the implementation goals At the start of this chapter in section 5.2, various goals have been set for the implementation. These goals were mostly derived from the design goals from section 4.1.1, with implementation goal IG6 deriving from a research goal instead of an implementation goal. Let us revisit those goals as they were listed in section 5.2.

IG1 The VM runtime needs to be capable of gathering profiling information from the running application, and write this to a file for the superinstruction set construction algorithm (DG1).

IG2 Provide an implementation of the iterative superinstruction set construction algorithm discussed in section 4.4. It must produce a list of optimal superinstructions for a given profile, maximum instruction set size and a runtime substitution algorithm (DG3). IG3 Provide an interpreter generator that, from the list of superinstructions, generates C++ source code and other metadata to implement all superinstructions in the VM. The VM needs to be modified to use this generated C++ code and be refactored in such a way that it is possible to concatenate instruction handlers to create superinstructions. IG4 Implement the three runtime superinstruction placmement algorithms (triplet-based, tree-based and shortest-path runtime substitution) (DG4).

IG5 Modify the OpenJDK Zero class loading pipeline to include the runtime substitution of each class by the active runtime substitution algorithm. IG6 Produce a VM that provides an accurate, testable reflection of the design of each of the components as laid out in chapter 4 (G4 from section 1.5).

Our implementation meets each of these goals. Implementation goal IG1 has been answered in section 5.4 with the implementation of profiling. A profile file is generated (“app.profile”) and all classes are dumped to a folder when our modified VM is started with a particular command line argument, satisfying this goal. Implementation goal IG2 has been met with the implementation of the superinstruction set construction algorithm as described in section 5.5 in the interpreter generator tool. We discussed how the iterative superinstruction set construction algorithm works, together with our implementation of the static evaluation algorithm that traces paths through the code to work on the ASM object graph representation of bytecode. The profile, maximum instruction set size and the runtime substitution algorithm are provided to the interpreter generation tool as command line arguments, and the list of optimized superinstructions is made available to the code generator, which resides in the same tool. As such, all aspects of this goal are satisfied. The third implementation goal, IG3, has been satisfied by the code generator in same interpreter generator tool. The interpreter generator tool, which was the answer to goal IG2 as it implements the superinstruction set optimization algorithm, has a code generator component that can generate the various C++ source code files as discussed in section 5.7. The instruction handlers are restructured into instruction primitives allowing the creation of superinstructions by the code generator, satisfying this goal. Continuing on, implementation goal IG4 has been satisfied as discussed in section 5.6, where all three runtime placement algorithms have been discussed including their implementation. To answer implementation goal IG5, a JVMTI agent has been made which can perform substitution of all classes – some at runtime and others ahead-of-time by

108 using a class cache as discussed in section 5.3.2 and section 5.6. Finally, implementation goal IG6 was reached. While goal IG6 is an overarching goal that affects the entire implementation, we believe that the implementation as presented in this chapter is a sufficiently faithful implementation of the design from chapter 4. The loss of flagship JVM features like the garbage collector and the class verifier is of little consequence to the testability of the design, as long as the lack of these features is taken into consideration regarding testing as discussed in section 5.7.5 (e.g. running a garbage collector benchmark is not useful, but the superinstruction architecture is not expected to make any changes to GC performance). That concludes the implementation of the QuickInterp design on OpenJDK Zero, creating an implementation that allows for benchmarking the design, which is the topic for chapter 6.

109 Chapter 6

Benchmarks and results

6.1 Introduction

Finally it is time to evaluate the performance of QuickInterp by means of of various benchmarks. To start off, the goals of benchmarking are discussed in section 6.2. Then, in section 6.3 two benchmarks are introduced. The first benchmark, the primes benchmark, represents an application with relatively little code where the bulk of the execution is limited to just a few methods. The second benchmark, the Spring benchmark, is a large benchmark with a lot of libraries where a large amount of methods that are frequently executed, and we will motivate the need for two such different benchmarks at the hand of the goals from section 6.2. Section 6.4 discusses the results of the primes benchmark, showing a best-case improvement of 45.6% over baseline. Section 6.5 discusses the Spring benchmark results, where a best-case improvement of 33.0% is achieved. Finally, these are results are interpreted and put into context in section 6.6, wrapping up this chapter.

6.2 Benchmarking goals and non-goals

The overarching goal of benchmarking is not to test every single aspect of the QuickInterp architecture and its implementation. Certain features that one might expect of a product JVM (like a working garbage collector or class verifier) are not present. Furthermore, the runtime substitution algorithms implemented in QuickInterp are not optimized for performing necessarily fast substitution – instead, they are designed to create optimal substitutions. The goals for the benchmarking phase are derived from the main research goals as seen previously in section 1.5. G1 Collect, devise and adapt a set of algorithms that, when provided with profiling data of a particular application, generate an optimized superinstruction set. G2 Collect, devise and adapt a set of algorithms that, when provided with a superinstruc- tion set and application code can insert appropriate superinstructions as to maximize performance

G3 Implement these algorithms, either on top of OpenJDK or by some other means G4 Evaluate performance on a set of sample applications and benchmarks. The benchmarks are discussed in section 6.3.

While goal G4 is the only goal that mentions benchmarking, it is not an independent goal as the algorithms to evaluate have been implemented in response to goal G2, goal G3 and to a lesser extend goal G1. Goal G2 has been answered by the creation of a universal superinstruction

110 set construction algorithm (section 4.4) that uses static evaluation (section 4.4.5). As such, the data gathered from the benchmarks must give insight to how a particular static evaluation score relates to actual runtime performance. Goal G3 – the implementation of each algorithm – gives a platform to create comparative tests for. Various benchmarks can be run in the same environment where only the active runtime substitution algorithm is changed. These benchmarks should be representative of a range of applications – both applications where a small number of methods may make up the bulk of the applications runtime, and also applications where the execution is spread over a large number of methods. This leads to the following the benchmarking goals:

BG1 Create and implement a benchmarking application where a few sequences of bytecode are executed frequently. BG2 Create and implement a benchmarking application where large amounts of bytecode are executed frequently without such a prominent hotspot.

BG3 Compare using the created benchmarks the efficacy of the four implemented runtime substitution algorithms (triplet-based substitution, tree-based substitution, shortest path substitution, and shortest path substitution with instruction equivalence) BG4 Compare using the created benchmarks the efficiacy of the four implemented runtime substitution algorithms against a standard OpenJDK Zero JVM without superinstructions (this is the JVM QuickInterp is based on).

BG5 Test using the created benchmarks the accuracy of static evaluation as described in section 4.4.5

Although not explicitly listed as a goal, the environment for the various tests must be standard- ized using the same hardware, same compiler suite, the same software, and so on. Standardization for optimizing the superinstruction set however is not straightforward. We might for example give each runtime substitution algorithm the same amount of time to optimize in the superinstruction set construction algorithm. While this may seem fair, some runtime substitution algorithms perform substitution a lost faster, so one algorithm may complete a lot more optimization iterations within the superinstruction set construction evaluation loop. Another way to standardize in that case would be to standardize based on the number of evaluations. As such, these two approaches for standardizing the optimization environment are tried.

6.3 Benchmark selection 6.3.1 Small benchmark: JMH primes benchmark To answer benchmarking goal BG1 we produced a small arithmetic benchmark using Java Microbenchmark Harness (JMH). This benchmark tests if a particular integer is a prime in a straightforward and naive way.

1 boolean isPrime(int possiblePrime){ Java 2 for (int i = 2;i< possiblePrime;i++) { 3 if (possiblePrime%i == 0) 4 return true; 5 } 6 return false; 7 }

Listing 6.1: The prime test benchmark

The Java implementation of this prime test is shown in Listing 6.1. The method isPrime(int) is invoked on a range of numbers between 240, 000 and 250, 000 over and over again for 30 seconds.

111 Such a run of 30 seconds (an “iteration”) is repeated 20 times, for a total of 600 seconds worth of test data. Each iteration reports the throughput (in operations per second), and the total iterations per second is computed by JMH. Because QuickInterp has no JIT compiler that optimizes the application as it runs, there is no need for an extensive warmup before the iterations that count towards the actual score can start. As such, warmup is limited to a single iteration of just 10 seconds to help place the code in cache and make sure all the code has been run through the runtime substitution algorithm. JMH typically runs benchmarks in “forked” mode, where it spawns off a separate JVM instance to run the actual benchmark. However, QuickInterp profiling cannot work while forking (the host JVM would overwrite the profile of the forked JVM) and as such this was disabled. Both the benchmark run on which profiling was conducted and the benchmark runs used to gather result had forking disabled.

6.3.2 Large benchmark: Spring pet clinic web app The larger benchmark implemented to answer benchmarking goal BG2 is based on the Spring Boot Pet Clinic example application [16]. Spring Boot is a modern Java web framework, handling HTTP requests, database access, dependency injection, view rendering, caching and more. The Pet Clinic application is a sample application that is used to demo the various features present in the Spring Framework. It models an application that one might find in the back office of an actual pet clinic: pet owners can be managed, each with multiple pets and visits can be planned for those pets. The entity relation diagram of the database can be seen in Figure 6.1.

SPECIALTY_ID:ID VET_ID:ID

TYPE_ID:ID

OWNER_ID:ID

PET_ID:ID

Figure 6.1: Entity relation diagram of the Pet Clinic database

The benchmark uses Apache JMeter to gather results. Apache JMeter is an external load testing application for various network protocols including HTTP, and it makes HTTP requests according to a test plan to the Spring Boot application. Our test plan tests the scenario of adding a new visit to a particular pet:

112 1. Visit the landing page (“home”) 2. Go to the “find owners” search page 3. Search for a particular pet owner by last name 4. On the owner information page, go to one of the pets registered to this owner (a screenshot of the owner information page can be seen in Figure 6.2) 5. Log a new visit at a random date and with a random message

Figure 6.2: Screenshot of the Pet Clinic web application, showing information about a pet owner

This scenario is repeated 70 times per simulated user. The test plan simulates 8 concurrent users, each of which follow the same test plan independently for a total of 70 · 8 = 560 executions of the scenario. The Pet Clinic application comes preloaded with some sample data, including various pet owners and their pets. From this data, five different pet owners were selected and visits are logged for these five pet owners. This is an attempt at lowering database contention, as an increase in database contention causes request threads to wait for each other. If and when the threads handling requests wait for each other is not deterministic, as it depends on the interleaving of the particular threads and as such it would increase the noise in the benchmarking data. The duration of each HTTP request is recorded, which is used to compute the average duration per request for the whole test plan, as well as the average duration per request per request type. This benchmark, unlike the primes benchmark from section 6.3.1, always performs a fixed number of operations rather than testing operations for a fixed amount of time. Since this benchmark adds items to the database, it will gradually slow down as the database becomes more populated. This

113 makes it important to ensure that each execution of the benchmark ends up with the same number of items in the database at the end, regardless if it can insert those items very quickly or very slowly. Due to the large amount of classes loaded (there are 14644 classes in the profile), the Pet Clinic application is first “warmed up” by executing the scenario once. This causes all the relevant classes to load prior to running the actual benchmark, which ensures that we are testing interpreter performance and not class loader performance.

6.3.3 Benchmarking environment and parameters The environment for each benchmark was kept the same. We only conducted testing on one hardware platform.

CPU Ryzen 3700x 8-core (16 thread) CPU, stock cooler (no OC) Memory 3200MT/s 32GB (CL16)

OS Ubuntu 19.10 Compiler gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) Host JVM OpenJDK 64-Bit Server JVM (build 11.0.6+10-post-Ubuntu-1ubuntu119.10.1, mixed mode, sharing)

The Host JVM is the JVM used for optimizing the instruction set and building QuickInterp. We opted against running this on QuickInterp as QuickInterp still is significantly slower compared to a JVM with a highly-optimized JIT compiler. Since there is no expected change in output besides producing the output faster, there is no gain in running it on QuickInterp vs a much faster JVM. To run the benchmark, the minimal number of background processes must be running on the test machine. Furthermore, before starting each benchmark the test system must be idle for at least 30 seconds. This is to ensure the CPU has had time to cool down, as the selected CPU (Ryzen 3700x) is a CPU that makes great use of any thermal headroom it might have available to boost its clock speeds. However, no monitoring of temperature was done. As mentioned in the goals section, the configuration for the superinstruction set construction algorithm can be standardized in two different ways: based on time and based on a fixed number of evaluations. While we did not write the runtime substitution algorithms with performance in mind, for practical testing we limited the optimization time to 180 seconds. The superinstruction set construction algorithm was also run on the test machine.

6.4 Results: JMH Primes Benchmark 6.4.1 Benchmark and static evaluation scores In Figure 6.3 the result of running the benchmark with cache enabled can be seen across various superinstruction set sizes, with Figure 6.4 showing the results of the same benchmark but without cache. The benchmark was run for the superinstruction set sizes of 1, 2, 3, 4, 5, 10 and 20. The green line in both figures is the baseline result obtained from an OpenJDK Zero JVM without our patches applied. The superinstruction set construction algorithm uses an iterative optimization algorithm to find the optimal superinstruction set, which was discussed in section 4.4.5. As part of iterative optimization, each superinstruction set has a static evaluation score (see section 4.4.3). This score is an indication of the ultimate benchmark performance, and a higher score should relate to better runtime performance. Figure 6.5 plots the static evaluation score against the number of superinstruction used for each benchmark run. Figure 6.6 plots the static evaluation score against

114 the benchmark score. Ideally, we would expect to a see a linear correlation between the benchmark score and the static evaluation score as the static evaluation score is an indicator of the benchmark score. Finally, as discussed in the introduction (section 6.3.3), bounding the optimization time to 180 seconds for each algorithm disadvantages slower algorithms as they cannot do the same number of optimization iterations. Figure 6.7 plots the number of static evaluations each of the algorithms managed to perform within their allowed 180 seconds.

7,000 7,000 6,000 6,000 → → 5,000 5,000 4,000 4,000 3,000 3,000 Shortest Path Shortest Path 2,000 Tree 2,000 Tree Higher is better Higher is better Operations/second Operations/second Triplet Triplet 1,000 1,000 Baseline Baseline 0 0 0 5 10 15 20 0 5 10 15 20 Number of superinstructions Number of superinstructions

Figure 6.3: Primes benchmark with cache en- Figure 6.4: Primes benchmark without cache abled enabled

·1010 7,000 6,000 2 → 5,000 → 4,000 3,000 1 2,000 Shortest Path Higher is better Tree Shortest Path 1,000 Higher is better Tree Triplet Static evaluation score 0 0 Triplet 0 0.5 1 1.5 2 Operands/second (benchmark score) 0 5 10 15 20 ·1010 Static evaluation score (dispatches saved) Number of superinstructions Higher is better →

Figure 6.5: Primes benchmark static evaluation Figure 6.6: Primes benchmark score vs static score evaluation score

115 ·106 1.5

1.25 → 1

0.75

0.5

Higher is better Shortest Path

Total static evaluations 0.25 Tree Triplet 0 0 5 10 15 20 Number of superinstructions

Figure 6.7: Number of static evaluations completed within 180 seconds for each algorithm

Equivalence algorithm results The above graphs do not show the results of the equivalence algorithm. The equivalence algorithm was found to be very slow: about three orders of magnitude slower than the shortest path algorithm. This makes it almost impossible to optimize the superinstruction set with this algorithm, as it cannot come close to the number of evaluations that the other algorithms can do in the optimization time (180 seconds). As such, we instead opted to use the superinstruction set optimized with the shortest-path algorithm. However, the primes benchmark can be effectively optimized with so few superinstructions that none of them are equivalent, and performing substitution with this superinstruction set on all profiled code yielded no substitutions due to equivalence. When running the equivalence algorithm in such a scenario where it can only place superinstruc- tions that exactly match the instructions that it replaces, it behaves exactly like the shortest-path algorithm that it is based on. As such, the equivalence algorithm is not included in the benchmark results, as the shortest-path algorithm is already shown.

6.4.2 Result interpretation Figure 6.3 and Figure 6.4 both show a uptick in performance from baseline, peaking at 6959.537 operations / second (shortest path with 10 superinstructions, no cache) compared to the baseline of 4778.601 operations / second (a 45.6% improvement in throughput). Note that even though we show these numbers here with great precision, repeating the tests is likely to see them change by a bit, especially when run on a different system. This result is achieved with very few superinstructions for the shortest path and tree substitution algorithm, and the triplet-based runtime substitution algorithm also catches up when given enough superinstructions. This appears to indicate there are very few superinstructions needed to cover the primes benchmark, which is exactly the expected behavior as the primes benchmark is testing just a very short method with few instructions over and over. The two figures (Figure 6.3 and Figure 6.4) are very similar, showing that caching makes almost no difference here. Caching is the only way system classes can receive superinstructions, but this appears to not make any difference within the primes benchmark. Once again, this is not surprising as the primes benchmark does not use any system libraries by itself. Even though JMH may use such libraries, it appears to not have a significant effect on performance. Another thing is noticeable about the two figures: there appears to be some “noise” in the data. Sometimes, a run with more superinstructions (e.g. the triplet-based algorithm with two superinstructions) performs worse than a run with fewer superinstructions (e.g. the triplet-based algorithm with one superinstruction). This is unexpected, as adding a superinstruction should only increase the results. Note that the noise cannot simply be explained with run-to-run variance:

116 it shows up both in the cache-enabled and cache-disabled runs in exactly the same way, and we even ran the benchmark multiple times to see if it would go away. It did not. Figure 6.5 shows the static evaluation score, and what is most interesting is that the “noise” is not present here. In other words, the static evaluation algorithm believes it is actually building a better superinstruction set with each new added superinstruction. And when looking at the benchmark score vs the static evaluation score (Figure 6.6), it generally looks like a linear relation (which is what you would expect), but with an outlier. The noise shows up here again, showing that while a better static evaluation score is generally a good predictor of better performance, it doesn’t appear to capture the whole story. We will revisit the noise in section 6.6.

6.5 Spring Pet Clinic benchmark

For the Spring Pet Clinic benchmark, we did not run each configuration with and without cache, instead opting to only use cache. It is the most representative of our solution, and it significantly reduces the amount of time taken to collect all the benchmark results, allowing us to investigate even larger superinstruction sets. We kept the superinstruction set construction time at 180 seconds, and ran the benchmark for the following superinstruction set sizes: 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 100, 200, 300, 400, 500, 750 and 1000.

6.5.1 Benchmark and static evaluation scores Figure 6.8 shows an overview of all data points collected by plotting average request duration against the number of superinstructions for each algorithm. Each data point in this graph is the average duration of all the requests in the scenario. Since the scenario loads different pages, it can be expected that certain pages are systematically slower than others. Figure 6.10 shows just the result for the shortest path algorithm, still plotting duration against the number of superinstructions. However, in this graph, the various request types have not been averaged together. Instead, the average of each individual request type is shown. Figure 6.9 shows all algorithms again, but this time using only the duration of the Open “Landing” page request. Figure 6.11 show the static evaluation score once again, showing how the static evaluation score improves with more superinstructions. Figure 6.12 shows the relation between static evaluation score and the actual benchmark result. Note that for the Spring Pet Clinic benchmark, a lower time per request is better. As such, ideally, the static evaluation score would correlate inversely to the time per request. Finally, Figure 6.13 shows how many optimization evaluations were possible within the allotted 180 seconds. It shows how the time that a substitution algorithm needs is not independent of the number of superinstructions.

6.5.2 Result interpretation Figure 6.8 and Figure 6.9 both show that the superinstruction optimization also benefits the Spring benchmark, generally showing faster requests when run with more superinstructions. The absolute best performer is the tree algorithm with 500 superinstructions, which manages an average request time of just 343.93 milliseconds. Comparing that to the baseline of 457.50 milliseconds yields a performance improvement of 33.0% percent. The theoretical maximum for the Spring Pet Clinic benchmark is 158,756,594,928 dispatches saved, while the shortest path runtime substitution algorithm with 1000 superinstructions tops out at a static evaluation score of 155,673,831,868 dispatches saved (98.052% of the maximum), which stopped us from testing even larger superinstruction sets. Only when the superinstruction set is tiny does our implementation actually perform worse than baseline. One explaination for this is that baseline does not have to do codestretching (discussed in section 5.3.4), while our implementation always has to code stretch all classes, even when running with almost no superinstructions. However, given enough superinstructions the baseline is always outperformed.

117 580 Shortest Path 560 Tree Triplet 540 Baseline

520

500 Lower is better ←

Milliseconds/request 480

460

−100 0 100 200 300 400 500 600 700 800 900 1,000 1,100 Number of superinstructions

Figure 6.8: Spring Pet Clinic benchmark (average of all requests)

Shortest Path 360 Tree Triplet Baseline 340

320 Lower is better ← Milliseconds/request 300

−100 0 100 200 300 400 500 600 700 800 900 1,000 1,100 Number of superinstructions

Figure 6.9: Spring Pet Clinic benchmark (average of all Open “Landing” page requests)

700

600

500

400

300

Lower is better Open “Landing” page 200

← Open “Find Owner” page Milliseconds/request 100 Perform “Find Owner” search Open “New Visit” form 0 Submit “New Visit” form −100 0 100 200 300 400 500 600 700 800 900 1,000 1,100 Number of superinstructions

Figure 6.10: Spring Pet Clinic benchmark showing the duration of the individual request types (all using shortest path)

118 ·1011 600 500 1.5

→ 400

1 300

Lower is better 200 Shortest Path ←

0.5 Milliseconds/request Shortest Path 100 Tree Dispatches saved Higher is better Tree Triplet 0 Triplet 0 0.5 1 1.5 0 0 200 400 600 800 1,000 ·1011 Static evaluation score (dispatches saved) Number of superinstructions Higher is better →

Figure 6.11: Primes benchmark static evaluation Figure 6.12: Spring benchmark average duration score vs static evaluation score

·105

3 →

2

Higher is better 1 Shortest Path

Total static evaluations Tree Triplet 0 0 200 400 600 800 1,000 Number of superinstructions

Figure 6.13: Number of static evaluations completed within 180 seconds for each algorithm

Both figures (Figure 6.8 and Figure 6.9) show the same kind of noise that the primes benchmark results also showed – giving the superinstruction set construction algorithm more superinstructions doesn’t always improve performance. This noise could be caused by the selected superinstructions, or it could be caused by something else (e.g. the CPU throttles down or some background process runs). Here, we do not have a run with cache and without cache to test if this noise is due to such background tasks, but we can however examine if this behavior is shown by all the different requests. Figure 6.10 shows the results of only the shortest path algorithm, but plots all the different request types independently. If the noise were to be caused by background tasks or other factor independent of the interpreter, we would not expect each request type to be equally affected. However, Figure 6.10 shows that for runs where the JVM performed better, it performed better for all request types. And likewise, where it performed worse, it performed worse across all request types. From this we can tell that the inconsistent performance must be somehow due to the selected superinstructions, as it is the only variable. It could be that the superinstruction set construction algorithm happend to get stuck in a local optima: since the algorithm uses random mutations to find the best superinstruction set, it could find a worse superinstruction set when given more superinstructions simply due to chance. If this were the case however, we would expect a similar amount of noise in Figure 6.11, which is not present. In fact, the noise should even be more extreme in this figure, as the time spent dispatching

119 within the interpreter is not the only thing the interpreter does, while it is the only thing we base a better static evaluation score on. Figure 6.12 continues the investigation by plotting the static evaluation score against the request duration. It should show a perfect line here if static evaluation score is a perfect predictor of the performance of a superinstruction set, which it does not show. This is the same conclusion we drew from the interpretation of the result of the primes benchmark (section 6.4.2), and we will discuss this noise more in section 6.6.

6.5.3 Interpreter size and superinstructions

·105 3.55 Shortest Path Tree 550 Triplet 3.5 size (kB)

3.45 500 Lower is better Lower is better Shortest Path ← ← libjvm.so 3.4 Tree Milliseconds/request Triplet 450 0 200 400 600 800 1,000 3.4 3.45 3.5 3.55 ·105 Number of superinstructions libjvm.so size (kB)

Figure 6.14: Spring Pet Clinic benchmark Figure 6.15: Spring Pet Clinic benchmark score libjvm.so size vs number of superinstructions) vs libjvm.so size

Besides the benchmark score and static evaluation score, we also recorded the size of the JVM library (libjvm.so) for each compiled JVM. It is this library that contains the interpreter, and as the number of superinstructions increases, one would expect its size to increase. However, the superinstruction set construction algorithm isn’t bound to a particular length of superinstruction: instead, it is limited only by the number of superinstructions it may make, without any constraints on their length. As such, it is possible for the superinstruction set construction algorithm to create a larger JVM binary while using fewer superinstructions. Figure 6.14 shows that this generally does not happen: it shows a linear correlation between the number of superinstructions and the size of the JVM library. The size of the JVM library does not appear to give an explanation to the noise seen earlier. A larger JVM may not fit in the cache, or cause other issues that prevent it from performing as well. However, Figure 6.15 does not show a strong correlation between size and time per request. In fact, a large JVM binary size appears to be a good indicator of a good benchmark score. This makes sense, as a large JVM binary likely contains more superinstructions, and section 6.5.1 already showed that more superinstructions generally improve performance. As such, we do not believe the size is the origin of the noise seen in the benchmark results.

6.6 Result interpretation

The previous sections have presented the results of the primes and Spring benchmark. We saw how for each of the benchmarks, adding more superinstructions generally improves performance (to a point), and saw how – when given enough superinstructions – each of the runtime substitution algo- rithms generally converges to about the same performance. On the topic of runtime substitution, we discussed another caveat: the equivalence substitution algorithm was in all benchmarks completely useless, falling back to “simple” shortest-path substitution. Even in the 1000-superinstruction equipped Spring benchmarks, it failed to make even a single substitution due to equivalence.

120 Furthermore, we saw that there is some “noise” in the result, noise that cannot be explained by simple run-to-run variance. Instead, it appears the static evaluation algorithm is not a perfect predictor of performance.

6.6.1 Best runtime substitution algorithm We also saw how few superinstructions are actually necessary to make good substitutions. The very best result of the Spring benchmark was obtained with just 300 superinstructions, without larger superinstruction set sizes improving on this result. The results also show that – besides the triplet-based substitution algorithm – all runtime substitution algorithms perform comparably. And when given enough superinstructions, even the triplet-based substitution algorithm can catch up to the others. To understand why the new runtime substitution algorithms perform about the same, let us consider their design philosophies: 1. Tree-based substitution: improves over triplet-based substitution by requiring fewer superin- structions in the JVM to make a substitution 2. Shortest-path based substitution: improves over triplet-based substitution by making better (non-eager) substitutions 3. Equivalence-based substitutions: improves over shortest-path based substitution by requiring fewer superinstructions due to equivalence The philosophy of every substitution algorithm comes down to either reducing the number of necessary superinstructions (tree-based, equivalence-based), or improving the handling of multiple choices of superinstructions (shortest-path based). However, with just 300 superinstructions, these philosophies do not appear to apply. Both the shortest-path based and equivalence-based substitution algorithms operate under the assumption that there are a lot of superinstructions, and these superinstructions are not necessarily generated for the locations where they can be placed. For example, a superinstruction may primarily be added to the superinstruction set because it occurs in method A (according to the profile), but it can be placed in method B, method C and method D. However, we suspect that this generally doesn’t happen: a superinstruction generated for method A probably does not fit anywhere else. As such, a runtime substitution algorithm does not have to worry about placing this superinstruction somewhere where it blocks a better superinstruction (the problem the shortest-path algorithm attempts to solve). This explains why the shortest-path algorithm rarely outperforms the tree-based algorithm. Instead, they generally perform equally. Likewise, the equivalence algorithm is not going to do a good job placing this superinstruction anywhere else. When implementing a new runtime substitution algorithm, the only important requirement appears to be that it can place superinstructions that match exactly. A runtime substitution algorithm does not have to search for equivalence or even place them along a shortest path. Each superinstruction appears to have a place that it came from according to the profile, and only has to be substituted at that location. All the superinstruction substitution algorithm needs to do is that when it sees that location again is to place that superinstruction. Each of the proposed runtime substitution algorithms are capable of making such a substitution, which is likely why they perform about equally. Thus, the tree-based substitution algorithm appears to be enough for superinstruction set sizes up to 1000. While we did not test larger superinstruction set sizes, in applications that benefit from even larger superinstruction sets the shortest-path algorithm may make a comeback. In such scenario’s, while each superinstruction still has just one place where it needs to be inserted, other superinstructions may interfere and eager substitution. Since this is the type of problem the shortest-path algorithm is designed for, it would be a better candidate in such a case.

121 6.6.2 Static evaluation: not a perfect predictor The interpretation in section 6.4.2 and section 6.5.2 already discussed the noise in the performance data. To recap: this noise does not appear to be run-to-run variance caused by factors unrelated to the application. And, this noise is not present in the final static evaluation score of each JVM, and as such it is not caused by the superinstruction set construction getting stuck in a local optima. This leaves one obvious culprit: the static evaluation score is not a perfect indicator of runtime performance. A superinstruction set with a better static evaluation score may in fact perform worse at runtime. This may have many explanations. For example, it could be that the C++ compiler that compiles the interpreter can very effectively optimize certain instruction primitives when put together in a superinstruction, while it cannot do the same for others. It could be explained by CPU intricacies: maybe the CPU branch predictor is already capable of fetching the next instruction(s) in particular cases, but not all. Such CPU behavior would make certain instruction dispatches “cheaper” than others, and creating superinstructions that cover the “cheap” instruction dispatches would not provide the same benefit as covering other instructions. It could even be due to the size or the order of superinstruction handlers within the interpreter, with certain superinstruction handlers pushing others out of the level-1 CPU cache. We did a brief investigation in the relation between the size of the libjvm.so binary file and the benchmark performance, and found no correlation, however this is by no means definitive. What is clear however is that static evaluation is not a perfect predictor of runtime performance. Future work will be necessary to discover how it can be improved, which promises to offer more performance improvements than improving the runtime substitution algorithm.

6.7 Conclusion

In this chapter, we finally saw the performance of QuickInterp in two benchmarks. We discussed the two benchmarks (primes and Spring), what they are testing and how they are implemented. Then, we discussed the results of each. As it turns out, all the newly-introduced runtime substitution algorithms perform about the same, and each of them generally performs better than the triplet- based substitution based on the algorithm from Ertl et al. [4]. There is also some “noise” in the results, that can be best explained by inaccuracy between static evaluation and actual runtime performance. At the start of this chapter, five benchmark goals (BGs) were introduced:

BG1 Create and implement a benchmarking application where a few sequences of bytecode are executed frequently.

BG2 Create and implement a benchmarking application where large amounts of bytecode are executed frequently without such a prominent hotspot. BG3 Compare using the created benchmarks the efficacy of the four implemented runtime substitution algorithms (triplet-based substitution, tree-based substitution, shortest path substitution, and shortest path substitution with instruction equivalence)

BG4 Compare using the created benchmarks the efficiacy of the four implemented runtime substitution algorithms against a standard OpenJDK Zero JVM without superinstruc- tions (this is the JVM QuickInterp is based on). BG5 Test using the created benchmarks the accuracy of static evaluation as described in section 4.4.5

The first two goals – BG1 and BG2 – are met with the primes JMH benchmark from section 6.3.1 and the Spring Pet Clinic benchmark from section 6.3.2 respectively. The first three runtime substitution algorithms (triplet-based substitution, tree-based substitution, and shortest-path based

122 substitution) were all compared. We did not include or even ran the equivalence substitution algorithm beyond letting it build the cache, where it always reported to not make any substitution due to equivalence. Since this algorithm is then the exact same as shortest-path, we consider it included and benchmarking goal BG3 met. To make a comparison with a JVM without superinstructions, baseline tests were included in the testing results of a stock OpenJDK Zero JVM without superinstructions. And finally, we tested the accuracy of the static evaluation algorithm to meet benchmarking goal BG5. The noise we observed in the result appears to be caused by inaccuracy of the static evaluation algorithm, as discussed in section 6.6.2, however we also saw the performance generally improve as the static evaluation score improves (e.g. Figure 6.6 and Figure 6.12 from section 6), making it not a useless indicator either. That wraps up the benchmarking and results chapter. In chapter 7 we will revisit the original research questions and discuss possible future work.

123 Chapter 7

Final thoughts

With the design complete (chapter 4), implemented (chapter 5), benchmarked and discussed (chapter 6), section 7.1 reflects back on the research goals as originally written in section 1.5. With the research questions answered, we wrap up this chapter and with that this thesis by presenting several potential research directions based on this thesis in section 7.2.

7.1 Revisiting the research goals and questions

In section 1.5, we listed four research goals. These goals were derived from our research questions. In this section we will discuss these in reverse order: first revisiting the research goals and see how they have been answered, and then move on to answer the questions.

7.1.1 Research goals

G1 Collect, devise and adapt a set of algorithms that, when provided with profiling data of a particular application, generate an optimized superinstruction set. G2 Collect, devise and adapt a set of algorithms that, when provided with a superinstruc- tion set and application code can insert appropriate superinstructions as to maximize performance G3 Implement these algorithms, either on top of OpenJDK or by some other means G4 Evaluate performance on a set of sample applications and benchmarks. The benchmarks are discussed in section 6.3.

Research goal G1 was met with the design (section 4.4) and implementation (section 5.5) of the iterative optimization algorithm using static evaluation. The research goal also requires the presence of profiling data, which was made available by designing (section 4.3) and implementing (section 5.4) a runtime profiling system. Research goal G2 was answered by the design and implementation of four runtime substitution algorithms (section 4.5 and section 4.6 for the design, section 5.6 for the implementation). Four algorithms were implemented, including a re-implementation of the original runtime substitution algorithm used by Ertl et al. [4]. All algorithms, including the superinstruction architecture were implemented on top of OpenJDK Zero (chapter 5), answering research question G4. This also includes the code generation to generate the VM, modifications to the VM to support up to 216 superinstructions, and class loading code to actually perform substitutions. As we answered all research goals, we can revisit and answer the research questions with the obtained result.

124 7.1.2 Answering research questions The research goals were derived from a set of research questions originally listed in section 1.5.2: RQ How can graph-based superinstruction matching algorithms and graph-based superin- struction set construction algorithms provide a performance speedup over existing algorithms in an interpreting JVM? Which can be divided further into subquestions: R1 What are the performance characteristics of a superinstruction VM with graph-based instruction matching algorithms on modern hardware? R2 How can graph-based instruction matching algorithms be tuned and optimized for particular workloads? R3 How can graph-based superinstruction set construction algorithms be tuned and optimized for particular workloads? The answering of these questions will also answer another question: RX How does the superinstruction architecture as implemented in earlier work perform on a modern JVM interpreter implementations?

In this section, we will first look at the subquestions (R1-R3), then using their answers to answer the main question (RQ). Finally, we will look at the extra question (RX).

R1: What are the performance characteristics of a superinstruction VM with graph- based instruction matching algorithms on modern hardware? Two graph-based matching algorithms were implemented: the shortest-path runtime substitution algorithm and the equivalence runtime substitution algorithm. To answer the research question: section 6.4.1 and section 6.5.1 have shown that these algorithms – especially for small superinstruc- tion sets – do boost performance. Furthermore, they also outperform the triplet-based substitution algorithm, which is a reimplementation of the algorithm designed by Ertl et al. [4]. But there is a caveat: we also implemented a fourth algorithm alleviating the main pain point from the triplet-based algorithm: this algorithm is the tree-based substitution algorithm. As shown in the same two sections (section 6.4.1 and section 6.5.1) show little difference between the tree-based substitution algorithm and the graph-based algorithms. Moreover, section 6.6.1 discusses how it appears that superinstructions generally aren’t reused: they are derived from one location where they help performance, and the only thing the substitution algorithm has to do is place that superinstruction again when processing the same code. In other words, while graph-based instruction matching algorithms perform well on modern hardware, they do not offer a significant performance improvement over simpler substitution algorithms like the tree-based substitution algorithm. A superinstruction added to help performance in method A needs to be substituted into method A again when it is loaded, and more complex algorithms like shortest-path substitution or equivalence-based substitution are not required for such a substitution.

R2: How can graph-based instruction matching algorithms be tuned and optimized for particular workloads? When originally asking this question, we expected the need for some mechanism to align the instruction substitution algorithm with the superinstruction set construction algorithm, which would have to be done by hand. For example, in the case of the tree-based runtime substitution algorithm, we expected that we would need to tune the runtime substitution algorithm to skip superinstructions that would degrade the placement performance, when adding such superinstructions made the placement of others impossible (this is the problem solved by the shortest-path based algorithm).

125 If the superinstruction set is constructed by interpreting the problem as a Longest Common Subsequence (LCS) problem, independent of the runtime substitution algorithm, many instructions may be present that cannot or should not be placed by the runtime substitution algorithm. However, we solved these problems with a special superinstruction set construction algorithm that uses static evaluation to measure the exact behavior of the runtime substitution algorithm and automatically matches that behavior. As such, to answer the question: the graph-based instruction matching algorithms need no tuning to deal with potential shortcomings caused by poor superinstruction set construction. The superinstruction set construction algorithm instead can be implemented in such a way that it automatically deals with the exact substitution behavior of the runtime substitution algorithm.

R3: How can graph-based superinstruction set construction algorithms be tuned and optimized for particular workloads? This question was already partially answered in the previous section. When we originally asked this question, we were expecting to need dedicated superinstruction set construction algorithms for each substitution algorithm, to align with the needs of the substitution algorithm. However, in section 4.4.5 we introduced an iterative optimization algorithm, which automatically optimizes for the active runtime substitution algorithm, without having to be manually tuned for the runtime substitution algorithm. To answer the question: the construction of the superinstruction set can be implemented with an iterative optimization algorithm, which need not be tuned for the intricacies of the runtime substitution algorithm, but can instead adapt to the behavior that it observes from this algorithm. By using the runtime substitution itself during the construction of the superinstruction set, it is possible to construct a superinstruction set that is known to perform well with that algorithm.

RQ: How can graph-based superinstruction matching algorithms and graph-based superinstruction set construction algorithms provide a performance speedup over existing algorithms in an interpreting JVM? This topic was already touched upon in the analysis of the results (section 6.6.1), and the general answer appears to be: graph-based runtime substitution algorithms cannot provide a performance speedup. Given enough superinstructions and a good enough superinstruction set construction algorithm, for superinstruction set sizes up to 1000, graph-based superinstruction matching algorithms do not provide a substantial performance benefit over non-graph based algorithms. We showed that the superinstruction optimization itself is still highly relevant, achieving considerable performance improvements in both benchmarks. And we also showed that the solution based on a runtime substitution algorithm from Ertl et al. [4] – the triplet-based runtime substitution algorithm – leaves room for improvement. Part of this behavior can be explained by the quality of the superinstruction set construction algorithm: our algorithm brings the runtime substitution algorithm into the equation, and will not add superinstructions that hurt performance. This may explain why Casey [1] reported a slight performance improvement using an “optimal” substitution over his greedy substitution algorithm, while using fewer superinstructions than we did. Casey used a greedy superinstruction set construction algorithm that did not take the limitations of the substitution algorithm into consideration, while ours does. It appears that, even when running it with 1000 superinstructions, superinstructions generally are created for just one location. And as long as the runtime substitution algorithm can make that substitution, it will approach the maximum possible speedup. And for such a substitution, graph-based substitution algorithms are not necessary. However, for applications that benefit from even larger numbers of superinstructions (not tested), it may happen that the superinstructions start to “interfere” when not using shortest-path based superinstruction substitution. As such, we cannot rule out the use of optimal superinstruction placement for even larger applications, even when using highly optimized superinstruction set construction algorithms.

126 RX: How does the superinstruction architecture as implemented in earlier work per- form on a modern JVM interpreter implementations? Considering the time between much of the earlier work and today, we expected to see some changes in the results. It is not possible to make a quantitative analysis to compare earlier work to ours. We did not repeat any benchmarks used by earlier work, and any performance difference could be explained by a wide range of sources (different interpreter architecture, different garbage collector, different Java version, etc.). We did not expect any runtime substitution algorithm implementation to perform as well as they did, and when given enough superinstructions even the old triplet-based reimplementation of the algorithm proposed by Ertl et al. [4]. To answer the question: the triplet-based substitution algorithm still performs well given enough superinstructions, as the “pressure” for more superinstructions was not as strong as we anticipated. However, considering the simplicity of better algorithms like the tree-based algorithm, we still would not recommend reimplementing the triplet-based substitution algorithm in a new superinstruction VM.

7.2 Future work

In the previous section we reached an interesting conclusion: optimizing the way superinstructions are placed is not an avenue that appears to bring great performance improvements. However, this does not mean there is nothing to further explore, and we will discuss various approaches in this section.

7.2.1 Production-ready implementation The implementation introduced in section 5 had one interesting implementation goal:

IG6: Produce a VM that provides an accurate, testable reflection of the design of each of the components as laid out in chapter 4 (G4 from section 1.5).

This goal was added specifically to enable cutting corners while creating a VM that could be tested, and as result the QuickInterp implementation as we present it for this thesis is by no means a production-ready VM. The garbage collector and class verifier are broken, it is full of bad programming patterns (e.g. hard-to-verify memory management, absolute paths to source code folders, commented out assertions) which would have to be fixed. To fix the garbage collector and the class verifier, we would suggest removing the JVMTI agent. The use of a JVMTI agent is already a shortcut, and using an external API like that internally causes problems when attempting to load other JVMTI agents (they would see classes with superinstructions). Instead, the code stretching and superinstruction conversion would need to happen after class loading. Moving the superinstruction substitution out of the JVMTI agent would fix class verification, leaving only the garbage collector. The VM would need to be changed to be aware of superinstructions. It needs to know which instructions are superinstructions, and what those superinstructions are made out of. Then, the garbage collector would need to be modified to recognize superinstructions, and interpret them as a sequence of regular instructions (which is what they are). Extensive validation of the VM would have to be performed, just like a regular production build of OpenJDK. Oracle maintains the Java Compatiblity Kit1 (JCK) – a test kit containing unit tests and other facilities for testing the correctness of a JVM implementation, which could be used to validate the VM. However, the JCK is proprietary, but is made available to those wishing to contribute code to the OpenJDK project after signing various forms.

1https://openjdk.java.net/groups/conformance/JckAccess/

127 7.2.2 Better static evaluation heuristics In section 6.6.2, we discussed that static evaluation as it stands today is not a perfect predictor of runtime performance. Static evaluation as presented in this thesis (discussed in section 4.4.3) measures how many instruction dispatches can be saved with a particular superinstruction set, with more being better. However, it appears that there is more to runtime performance, as some superinstructions perform worse than others even when they should save the same number of dispatches. The exact relation between creating a particular superinstruction and the impact on benchmark performance would have to be researched, to create better heuristics for use in the static evaluation algorithm. Note that this may require other changes to QuickInterp as well: it may turn out that the order of the instruction handlers within the VM matters, or that the occurrence of a particular instruction handler blocks particular C++ compiler optimizations. These would have to be identified as well, to solve the disparity between static evaluation score and benchmark score.

7.2.3 Dynamic rewriting A modern JIT compiler may do more work than just converting instructions to machine code. Since a JIT compiler operates at runtime, it can compile the code multiple times with assumptions that only hold when the code is compiled, using the fact that the code can be recompiled later when those assumptions no longer hold. For example, a common optimization is devirtualization, where interface calls are replaced with regular method calls based on the runtime state of the VM. The VM may know that for a particular interface (e.g. List) only one implementation is loaded (e.g. ArrayList). As such, all code that operates on a List must use an ArrayList, and as such it is possible to replace code using the List interface with code using the ArrayList class. Virtual method lookups are faster than interface method lookups, and as such this benefits performance. As soon as the assumption is broken (a new implementation is loaded), the JIT compiler would have to generate new code. Dynamic code generation would bring such optimizations to a superinstruction VM, where the code is substituted with superinstructions multiple times based on the state of the VM. Here, the runtime substitution algorithm starts to behave more and more like a JIT compiler, but instead of targeting machine code, it targets all instructions in the superinstruction set.

7.2.4 Dynamically-loadable superinstructions for OSGi applications One problem of the superinstruction architecture is that it requires building a VM for an ap- plication. But what if a new application needs to be deployed to an existing VM? Typically, this new application would not be able to benefit from increased runtime performance due to superinstructions, as the VM does not contain the right superinstructions for the application. In section 6.6.1, we discussed the suspicion that superinstructions are typically not shared. They may be allocated for just one location, and the substitution algorithm needs to insert them. This creates an interesting opportunity for such application: they could include their own superinstructions. These could be delivered to the VM in a shared library (a .dll or .so file), which would be loaded into the VMs process. Then, the interpreter tables are patched to contain new superinstructions, and the runtime substitution algorithm is informed of the newly available superinstructions, solving the main drawback of having to recompile the whole VM. One drawback is that this violates the security model of the JVM. Classes can be loaded in a sandboxed environment, limiting their access to file, network and other APIs. However, loading a shared library into the VM process would give the native code within that library free reign over everything the JVM has access to. But, as long as the shared library is generated by a trusted source, it can help solve the problem of requiring rebuilding the JVM for every application.

128 7.2.5 Value-dependent superinstructions While not shown in the result section, at some point our QuickInterp interpreter contained a bug where it could not make superinstructions containing any jump instruction (also no conditional jumps). With that limitation, it offered a much more modest performance improvement in the primes benchmark, scoring only about 12% better than baseline (vs the 45.6% improvement shown in section 6.4.2). This indicates to us that the ability to add more instructions to a superinstruction is of significant value to helping performance. The idea of “value-dependent superinstructions” is to specialize instructions. For example, if “goto 20” is executed a lot according to the profile, we would create a new instruction called “goto_20”. Special instruction primitives would need to be available to deal with receiving a jump offset from the code generator. Since this instruction now has a fixed jump offset, it can be entirely included in a superinstruction. The superinstruction set code generator would have to be updated to allow jumps to take place within a superinstruction, and with that mechanism in place, larger sections of methods could be compiled into just one superinstruction. This mechanism could also be applied to transform often- executed “tableswitch” and “lookupswitch” instructions to be included in superinstructions.

7.2.6 Superinstructions as a JIT target The ideas of dynamically rewritten code (section 7.2.3) and the value-dependent superinstructions (section 7.2.5) both encroach on the capabilities of a true JIT compiler. JIT compilers are typically capable of transforming entire methods to machine code, and with value-dependent superinstructions it would become possible to transform larger sequences – including loops – into a single superinstruction. Likewise, JIT compilers may use runtime profiling to rewrite the code based on the state of the VM, which is a capability that could be introduced with dynamically rewritten code. This begs the question: why not take a JIT compiler instead, and retarget it to work with superinstructions? Make an existing JIT compiler emit commonly executed sequences at compile time, make those available to the interpreter as new instructions. The JIT compiler would perform the role of a runtime substitution algorithm, where instead of generating new code it would glue the ahead-of-time compiled pieces of code together. This would benefit from the many years of research into JIT compilers, while the runtime is still platform independent.

129 Appendices

130 Appendix A

List of bytecode instructions

A.1 Bytecode instructions with their description

The following table lists all bytecode instructions part of the Java Virtual Machine Specification version 11 [8]. The description of each instruction is copied directly from the Java Virtual Machine Specification document.

Opcode Instruction Description 0x0 nop Do nothing 0x1 aconst_null Push null 0x2 iconst_m1 Push int constant 0x3 iconst_0 Push int constant 0x4 iconst_1 Push int constant 0x5 iconst_2 Push int constant 0x6 iconst_3 Push int constant 0x7 iconst_4 Push int constant 0x8 iconst_5 Push int constant 0x9 lconst_0 Push long constant 0xa lconst_1 Push long constant 0xb fconst_0 Push float constant 0xc fconst_1 Push float constant 0xd fconst_2 Push float constant 0xe dconst_0 Push double constant 0xf dconst_1 Push double constant 0x10 bipush Push byte 0x11 sipush Push short 0x12 ldc Push item from run-time constant pool 0x13 ldc_w Push item from run-time constant pool (wide index) 0x14 ldc2_w Push long or double from run-time constant pool (wide index) 0x15 iload Load int from local variable 0x16 lload Load long from local variable 0x17 fload Load float from local variable 0x18 dload Load double from local variable 0x19 aload Load reference from local variable 0x1a iload_0 Load int from local variable 0x1b iload_1 Load int from local variable 0x1c iload_2 Load int from local variable 0x1d iload_3 Load int from local variable 0x1e lload_0 Load long from local variable 0x1f lload_1 Load long from local variable

131 0x20 lload_2 Load long from local variable 0x21 lload_3 Load long from local variable 0x22 fload_0 Load float from local variable 0x23 fload_1 Load float from local variable 0x24 fload_2 Load float from local variable 0x25 fload_3 Load float from local variable 0x26 dload_0 Load double from local variable 0x27 dload_1 Load double from local variable 0x28 dload_2 Load double from local variable 0x29 dload_3 Load double from local variable 0x2a aload_0 Load reference from local variable 0x2b aload_1 Load reference from local variable 0x2c aload_2 Load reference from local variable 0x2d aload_3 Load reference from local variable 0x2e iaload Load int from array 0x2f laload Load long from array 0x30 faload Load float from array 0x31 daload Load double from array 0x32 aaload Load reference from array 0x33 baload Load byte or boolean from array 0x34 caload Load char from array 0x35 saload Load short from array 0x36 istore Store int into local variable 0x37 lstore Store long into local variable 0x38 fstore Store float into local variable 0x39 dstore Store double into local variable 0x3a astore Store reference into local variable 0x3b istore_0 Store int into local variable 0x3c istore_1 Store int into local variable 0x3d istore_2 Store int into local variable 0x3e istore_3 Store int into local variable 0x3f lstore_0 Store long into local variable 0x40 lstore_1 Store long into local variable 0x41 lstore_2 Store long into local variable 0x42 lstore_3 Store long into local variable 0x43 fstore_0 Store float into local variable 0x44 fstore_1 Store float into local variable 0x45 fstore_2 Store float into local variable 0x46 fstore_3 Store float into local variable 0x47 dstore_0 Store double into local variable 0x48 dstore_1 Store double into local variable 0x49 dstore_2 Store double into local variable 0x4a dstore_3 Store double into local variable 0x4b astore_0 Store reference into local variable 0x4c astore_1 Store reference into local variable 0x4d astore_2 Store reference into local variable 0x4e astore_3 Store reference into local variable 0x4f iastore Store into int array 0x50 lastore Store into long array 0x51 fastore Store into float array 0x52 dastore Store into double array 0x53 aastore Store into reference array 0x54 bastore Store into byte or boolean array 0x55 castore Store into char array

132 0x56 sastore Store into short array 0x57 pop Pop the top operand stack value 0x58 pop2 Pop the top one or two operand stack values 0x59 dup Duplicate the top operand stack value 0x5a dup_x1 Duplicate the top operand stack value and insert two values down 0x5b dup_x2 Duplicate the top operand stack value and insert two or three values down 0x5c dup2 Duplicate the top one or two operand stack values 0x5d dup2_x1 Duplicate the top one or two operand stack values and insert two or three values down 0x5e dup2_x2 Duplicate the top one or two operand stack values and insert two, three, or four values down 0x5f swap Swap the top two operand stack values 0x60 iadd Add int 0x61 ladd Add long 0x62 fadd Add float 0x63 dadd Add double 0x64 isub Subtract int 0x65 lsub Subtract long 0x66 fsub Subtract float 0x67 dsub Subtract double 0x68 imul Multiply int 0x69 lmul Multiply long 0x6a fmul Multiply float 0x6b dmul Multiply double 0x6c idiv Divide int 0x6d ldiv Divide long 0x6e fdiv Divide float 0x6f ddiv Divide double 0x70 irem Remainder int 0x71 lrem Remainder long 0x72 frem Remainder float 0x73 drem Remainder double 0x74 ineg Negate int 0x75 lneg Negate long 0x76 fneg Negate float 0x77 dneg Negate double 0x78 ishl Shift left int 0x79 lshl Shift left long 0x7a ishr Arithmetic shift right int 0x7b lshr Arithmetic shift right long 0x7c iushr Logical shift right int 0x7d lushr Logical shift right long 0x7e iand Boolean AND int 0x7f land Boolean AND long 0x80 ior Boolean OR int 0x81 lor Boolean OR long 0x82 ixor Boolean XOR int 0x83 lxor Boolean XOR long 0x84 iinc Increment local variable by constant 0x85 i2l Convert int to long 0x86 i2f Convert int to float 0x87 i2d Convert int to double

133 0x88 l2i Convert long to int 0x89 l2f Convert long to float 0x8a l2d Convert long to double 0x8b f2i Convert float to int 0x8c f2l Convert float to long 0x8d f2d Convert float to double 0x8e d2i Convert double to int 0x8f d2l Convert double to long 0x90 d2f Convert double to float 0x91 i2b Convert int to byte 0x92 i2c Convert int to char 0x93 i2s Convert int to short 0x94 lcmp Compare long 0x95 fcmpl Compare float 0x96 fcmpg Compare float 0x97 dcmpl Compare double 0x98 dcmpg Compare double 0x99 ifeq Branch if int comparison with zero succeeds 0x9a ifne Branch if int comparison with zero succeeds 0x9b iflt Branch if int comparison with zero succeeds 0x9c ifge Branch if int comparison with zero succeeds 0x9d ifgt Branch if int comparison with zero succeeds 0x9e ifle Branch if int comparison with zero succeeds 0x9f if_icmpeq Branch if int comparison succeeds 0xa0 if_icmpne Branch if int comparison succeeds 0xa1 if_icmplt Branch if int comparison succeeds 0xa2 if_icmpge Branch if int comparison succeeds 0xa3 if_icmpgt Branch if int comparison succeeds 0xa4 if_icmple Branch if int comparison succeeds 0xa5 if_acmpeq Branch if reference comparison succeeds 0xa6 if_acmpne Branch if reference comparison succeeds 0xa7 goto Branch always 0xa8 jsr Jump subroutine 0xa9 ret Return from subroutine 0xaa tableswitch Access jump table by index and jump 0xab lookupswitch Access jump table by key match and jump 0xac ireturn Return int from method 0xad lreturn Return long from method 0xae freturn Return float from method 0xaf dreturn Return double from method 0xb0 areturn Return reference from method 0xb1 return Return void from method 0xb2 getstatic Get static field from class 0xb3 putstatic Set static field in class 0xb4 getfield Fetch field from object 0xb5 putfield Set field in object 0xb6 invokevirtual Invoke instance method; dispatch based on class 0xb7 invokespecial Invoke instance method; direct invocation of instance initialization methods and methods of the current class and its supertypes 0xb8 invokestatic Invoke a class (static) method 0xb9 invokeinterface Invoke interface method 0xba invokedynamic Invoke a dynamically-computed call site 0xbb new Create new object

134 0xbc newarray Create new array 0xbd anewarray Create new array of reference 0xbe arraylength Get length of array 0xbf athrow Throw exception or error 0xc0 checkcast Check whether object is of given type 0xc1 instanceof Determine if object is of given type 0xc2 monitorenter Enter monitor for object 0xc3 monitorexit Exit monitor for object 0xc4 wide Extend local variable index by additional bytes 0xc5 multianewarray Create new multidimensional array 0xc6 ifnull Branch if reference is null 0xc7 ifnonnull Branch if reference not null 0xc8 goto_w Branch always (wide index) 0xc9 jsr_w Jump subroutine (wide index)

A.2 Instruction handler flags

The following table shows the flags of each JVM instruction. These flags relate to the ability of the instruction to be merged into a superinstruction in QuickInterp. The following flags exist:

UNKNOWN_PC_OFFSET The instruction changes the program counter by a variable amount. This applies to instructions with a variable amount of instruction operands, like tableswitch. If such an instruction is placed in a superinstruction, writes to the pc cannot be coalesced symbolically past this instruction as seen in section 5.3.3. An instruction with this flag can only be part of a superinstruction if it also has the UPDATES_PC_OFFSET flag. UNKNOWN_STACK_OFFSET Similar to UNKNOWN_PC_OFFSET, but for the stack. Has the same con- sequences, but now stack writes cannot be coalesced. This applies to instructions like invokevirtual which pop (or push) a variable amount of values to the operand stack. An instruction with this flag can only be part of a superinstruction if it also has the UPDATES_STACK_OFFSET flag. NO_SUPERINSTRUCTION This instruction cannot be part of a superinstruction. Usually these are instructions which leave the interpreter and expect to return by means of a dispatch, and since it is not possible to dispatch to within a superinstruction, these cannot be part of a superinstruction.

TERMINAL When placed in a superinstruction, no instruction would be executed after this instruction as it either unconditionally jumps away or exits the method.

JUMP This is a jump instruction. This is combined with TERMINAL to indicate the instruction is an unconditional jump.

UPDATES_PC_OFFSET This instruction writes to the program counter, updating it. This has consequences for program counter write coalescence, as reads after this instruction are now no longer relative to the start of the superinstruction, but relative to the last instruction with the UPDATES_PC_OFFSET flag. This flag is required for instructions with UNKNOWN_PC_OFFSET if they are allowed in a superinstruction.

UPDATES_STACK_OFFSET Similar to UPDATES_PC_OFFSET, but for the operand stack. This flag is required for instructions with UNKNOWN_STACK_OFFSET if they are allowed in a superinstruc- tion.

Opcode Label QuickInterp flags 0x0 nop

135 0x1 aconst_null 0x2 iconst_m1 0x3 iconst_0 0x4 iconst_1 0x5 iconst_2 0x6 iconst_3 0x7 iconst_4 0x8 iconst_5 0x9 lconst_0 0xa lconst_1 0xb fconst_0 0xc fconst_1 0xd fconst_2 0xe dconst_0 0xf dconst_1 0x10 bipush 0x11 sipush 0x12 ldc NO_SUPERINSTRUCTION 0x13 ldc_w NO_SUPERINSTRUCTION 0x14 ldc2_w NO_SUPERINSTRUCTION 0x15 iload 0x16 lload 0x17 fload 0x18 dload 0x19 aload 0x1a iload_0 0x1b iload_1 0x1c iload_2 0x1d iload_3 0x1e lload_0 0x1f lload_1 0x20 lload_2 0x21 lload_3 0x22 fload_0 0x23 fload_1 0x24 fload_2 0x25 fload_3 0x26 dload_0 0x27 dload_1 0x28 dload_2 0x29 dload_3 0x2a aload_0 0x2b aload_1 0x2c aload_2 0x2d aload_3 0x2e iaload 0x2f laload 0x30 faload 0x31 daload 0x32 aaload 0x33 baload 0x34 caload 0x35 saload 0x36 istore

136 0x37 lstore 0x38 fstore 0x39 dstore 0x3a astore 0x3b istore_0 0x3c istore_1 0x3d istore_2 0x3e istore_3 0x3f lstore_0 0x40 lstore_1 0x41 lstore_2 0x42 lstore_3 0x43 fstore_0 0x44 fstore_1 0x45 fstore_2 0x46 fstore_3 0x47 dstore_0 0x48 dstore_1 0x49 dstore_2 0x4a dstore_3 0x4b astore_0 0x4c astore_1 0x4d astore_2 0x4e astore_3 0x4f iastore 0x50 lastore 0x51 fastore 0x52 dastore 0x53 aastore 0x54 bastore 0x55 castore 0x56 sastore 0x57 pop 0x58 pop2 0x59 dup 0x5a dup_x1 0x5b dup_x2 0x5c dup2 0x5d dup2_x1 0x5e dup2_x2 0x5f swap 0x60 iadd 0x61 ladd 0x62 fadd 0x63 dadd 0x64 isub 0x65 lsub 0x66 fsub 0x67 dsub 0x68 imul 0x69 lmul 0x6a fmul 0x6b dmul 0x6c idiv

137 0x6d ldiv 0x6e fdiv 0x6f ddiv 0x70 irem 0x71 lrem 0x72 frem 0x73 drem 0x74 ineg 0x75 lneg 0x76 fneg 0x77 dneg 0x78 ishl 0x79 lshl 0x7a ishr 0x7b lshr 0x7c iushr 0x7d lushr 0x7e iand 0x7f land 0x80 ior 0x81 lor 0x82 ixor 0x83 lxor 0x84 iinc 0x85 i2l 0x86 i2f 0x87 i2d 0x88 l2i 0x89 l2f 0x8a l2d 0x8b f2i 0x8c f2l 0x8d f2d 0x8e d2i 0x8f d2l 0x90 d2f 0x91 i2b 0x92 i2c 0x93 i2s 0x94 lcmp 0x95 fcmpl 0x96 fcmpg 0x97 dcmpl 0x98 dcmpg 0x99 ifeq JUMP 0x9a ifne JUMP 0x9b iflt JUMP 0x9c ifge JUMP 0x9d ifgt JUMP 0x9e ifle JUMP 0x9f if_icmpeq JUMP 0xa0 if_icmpne JUMP 0xa1 if_icmplt JUMP 0xa2 if_icmpge JUMP

138 0xa3 if_icmpgt JUMP 0xa4 if_icmple JUMP 0xa5 if_acmpeq JUMP 0xa6 if_acmpne JUMP 0xa7 goto JUMP, TERMINAL 0xa8 jsr JUMP, TERMINAL 0xa9 ret JUMP, TERMINAL 0xaa tableswitch JUMP, NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xab lookupswitch JUMP, NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xac ireturn NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xad lreturn NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xae freturn NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xaf dreturn NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xb0 areturn NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xb1 return NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xb2 getstatic NO_SUPERINSTRUCTION, UNKNOWN_STACK_OFFSET, UPDATES_STACK_OFFSET 0xb3 putstatic NO_SUPERINSTRUCTION, UNKNOWN_STACK_OFFSET, UPDATES_STACK_OFFSET 0xb4 getfield NO_SUPERINSTRUCTION, UNKNOWN_STACK_OFFSET, UPDATES_STACK_OFFSET 0xb5 putfield NO_SUPERINSTRUCTION, UNKNOWN_STACK_OFFSET, UPDATES_STACK_OFFSET 0xb6 invokevirtual NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xb7 invokespecial NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xb8 invokestatic NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xb9 invokeinterface NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xba invokedynamic NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xbb new 0xbc newarray 0xbd anewarray 0xbe arraylength 0xbf athrow NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xc0 checkcast NO_SUPERINSTRUCTION 0xc1 instanceof NO_SUPERINSTRUCTION 0xc2 monitorenter NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xc3 monitorexit NO_SUPERINSTRUCTION 0xc4 wide NO_SUPERINSTRUCTION, UNKNOWN_STACK_OFFSET, UPDATES_PC_OFFSET, UPDATES_STACK_OFFSET 0xc5 multianewarray NO_SUPERINSTRUCTION, UNKNOWN_STACK_OFFSET, UPDATES_STACK_OFFSET 0xc6 ifnull JUMP 0xc7 ifnonnull JUMP 0xc8 goto_w JUMP, NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET 0xc9 jsr_w JUMP, NO_SUPERINSTRUCTION, TERMINAL, UNKNOWN_STACK_OFFSET

139 A.3 Instruction pc and stack offsets

To enable write coalescing within a superinstruction, the size of the instruction in the bytecode stream and its modifications to the top-of-stack pointer need to be known at compile-time so succeeding instruction handlers can be generated with an offset. The instruction size with operands is shown in the PC offset column, and the stack offset is shown in the Stack offset column. For more information, see section 5.7.1. Note that all PC offset values are after code stretching, and as such are all one byte larger than what the JVM specification [8] describes.

Opcode Instruction PC offset Stack offset 0x0 nop +2 0 0x1 aconst_null +2 +1 0x2 iconst_m1 +2 +1 0x3 iconst_0 +2 +1 0x4 iconst_1 +2 +1 0x5 iconst_2 +2 +1 0x6 iconst_3 +2 +1 0x7 iconst_4 +2 +1 0x8 iconst_5 +2 +1 0x9 lconst_0 +2 +2 0xa lconst_1 +2 +2 0xb fconst_0 +2 +1 0xc fconst_1 +2 +1 0xd fconst_2 +2 +1 0xe dconst_0 +2 +2 0xf dconst_1 +2 +2 0x10 bipush +3 +1 0x11 sipush +4 +1 0x12 ldc +3 +1 0x13 ldc_w +4 +1 0x14 ldc2_w +4 +2 0x15 iload +3 +1 0x16 lload +3 +2 0x17 fload +3 +1 0x18 dload +3 +2 0x19 aload +3 +1 0x1a iload_0 +2 +1 0x1b iload_1 +2 +1 0x1c iload_2 +2 +1 0x1d iload_3 +2 +1 0x1e lload_0 +2 +2 0x1f lload_1 +2 +2 0x20 lload_2 +2 +2 0x21 lload_3 +2 +2 0x22 fload_0 +2 +1 0x23 fload_1 +2 +1 0x24 fload_2 +2 +1 0x25 fload_3 +2 +1 0x26 dload_0 +2 +2 0x27 dload_1 +2 +2 0x28 dload_2 +2 +2 0x29 dload_3 +2 +2 0x2a aload_0 +2 +1

140 0x2b aload_1 +2 +1 0x2c aload_2 +2 +1 0x2d aload_3 +2 +1 0x2e iaload +2 -1 0x2f laload +2 0 0x30 faload +2 -1 0x31 daload +2 0 0x32 aaload +2 -1 0x33 baload +2 -1 0x34 caload +2 -1 0x35 saload +2 -1 0x36 istore +3 -1 0x37 lstore +3 -2 0x38 fstore +3 -1 0x39 dstore +3 -2 0x3a astore +3 -1 0x3b istore_0 +2 -1 0x3c istore_1 +2 -1 0x3d istore_2 +2 -1 0x3e istore_3 +2 -1 0x3f lstore_0 +2 -2 0x40 lstore_1 +2 -2 0x41 lstore_2 +2 -2 0x42 lstore_3 +2 -2 0x43 fstore_0 +2 -1 0x44 fstore_1 +2 -1 0x45 fstore_2 +2 -1 0x46 fstore_3 +2 -1 0x47 dstore_0 +2 -2 0x48 dstore_1 +2 -2 0x49 dstore_2 +2 -2 0x4a dstore_3 +2 -2 0x4b astore_0 +2 -1 0x4c astore_1 +2 -1 0x4d astore_2 +2 -1 0x4e astore_3 +2 -1 0x4f iastore +2 -3 0x50 lastore +2 -4 0x51 fastore +2 -3 0x52 dastore +2 -4 0x53 aastore +2 -3 0x54 bastore +2 -3 0x55 castore +2 -3 0x56 sastore +2 -3 0x57 pop +2 -1 0x58 pop2 +2 -2 0x59 dup +2 +1 0x5a dup_x1 +2 +1 0x5b dup_x2 +2 +1 0x5c dup2 +2 +2 0x5d dup2_x1 +2 +2 0x5e dup2_x2 +2 +2 0x5f swap +2 0 0x60 iadd +2 -1

141 0x61 ladd +2 -2 0x62 fadd +2 -1 0x63 dadd +2 -2 0x64 isub +2 -1 0x65 lsub +2 -2 0x66 fsub +2 -1 0x67 dsub +2 -2 0x68 imul +2 -1 0x69 lmul +2 -2 0x6a fmul +2 -1 0x6b dmul +2 -2 0x6c idiv +2 -1 0x6d ldiv +2 -2 0x6e fdiv +2 -1 0x6f ddiv +2 -2 0x70 irem +2 -1 0x71 lrem +2 -2 0x72 frem +2 -1 0x73 drem +2 -2 0x74 ineg +2 0 0x75 lneg +2 0 0x76 fneg +2 0 0x77 dneg +2 0 0x78 ishl +2 -1 0x79 lshl +2 -1 0x7a ishr +2 -1 0x7b lshr +2 -1 0x7c iushr +2 -1 0x7d lushr +2 -1 0x7e iand +2 -1 0x7f land +2 -2 0x80 ior +2 -1 0x81 lor +2 -2 0x82 ixor +2 -1 0x83 lxor +2 -2 0x84 iinc +4 0 0x85 i2l +2 +1 0x86 i2f +2 0 0x87 i2d +2 +1 0x88 l2i +2 -1 0x89 l2f +2 -1 0x8a l2d +2 0 0x8b f2i +2 0 0x8c f2l +2 +1 0x8d f2d +2 +1 0x8e d2i +2 -1 0x8f d2l +2 0 0x90 d2f +2 -1 0x91 i2b +2 0 0x92 i2c +2 0 0x93 i2s +2 0 0x94 lcmp +2 -3 0x95 fcmpl +2 -1 0x96 fcmpg +2 -1

142 0x97 dcmpl +2 -3 0x98 dcmpg +2 -3 0x99 ifeq +4 -1 0x9a ifne +4 -1 0x9b iflt +4 -1 0x9c ifge +4 -1 0x9d ifgt +4 -1 0x9e ifle +4 -1 0x9f if_icmpeq +4 -2 0xa0 if_icmpne +4 -2 0xa1 if_icmplt +4 -2 0xa2 if_icmpge +4 -2 0xa3 if_icmpgt +4 -2 0xa4 if_icmple +4 -2 0xa5 if_acmpeq +4 -2 0xa6 if_acmpne +4 -2 0xa7 goto +4 0 0xa8 jsr +4 0 0xa9 ret +4 0 0xaa tableswitch Unknown Unknown 0xab lookupswitch Unknown Unknown 0xac ireturn Unknown Unknown 0xad lreturn Unknown Unknown 0xae freturn Unknown Unknown 0xaf dreturn Unknown Unknown 0xb0 areturn Unknown Unknown 0xb1 return Unknown Unknown 0xb2 getstatic +4 Unknown 0xb3 putstatic +4 Unknown 0xb4 getfield +4 Unknown 0xb5 putfield +4 Unknown 0xb6 invokevirtual Unknown Unknown 0xb7 invokespecial Unknown Unknown 0xb8 invokestatic Unknown Unknown 0xb9 invokeinterface Unknown Unknown 0xba invokedynamic Unknown Unknown 0xbb new +4 +1 0xbc newarray +3 0 0xbd anewarray +4 0 0xbe arraylength +2 0 0xbf athrow Unknown Unknown 0xc0 checkcast +4 0 0xc1 instanceof +4 0 0xc2 monitorenter Unknown Unknown 0xc3 monitorexit +2 -1 0xc4 wide Unknown Unknown 0xc5 multianewarray +5 Unknown 0xc6 ifnull +4 -1 0xc7 ifnonnull +4 -1 0xc8 goto_w Unknown Unknown 0xc9 jsr_w Unknown Unknown

143 Definitions

abstract interpreter An abstract interpreter is used to statically analyze a program without running it. It operates on data types and not on values, and takes every branch until it has visited every instruction. See section 2.1.4. Used on pages 8, 17, 21, 56, 57 abstract syntax tree A tree representation of the syntatic structure of source code. Used on page 55

ASM ObjectWeb ASM is a library that provides an API for modifying JVM bytecode. For QuickInterp, ASM is modified to work with two-byte opcodes and used in the runtime substitution algorithms. Used on pages 72, 85 base superinstruction candidate A base superinstruction candidate is a superinstruction can- didate directly derived from the profile and is used for static evaluation. Further processing may increase the number of superinstruction candidates for consideration. See section 4.4.2 and section 4.4.3. Used on pages 30–33, 41 bytecode The instruction set of the JVM is called bytecode. It is a very compact stack-based instruction set where every instruction opcode is just a single byte, making it very suitable for transport across the internet. In order to execute bytecode, a JVM implementation is required. Used on page 2 bytecode stream Alternative name for the Code attribute within a method, which is the source of all bytecode instructions. Used on page 16 control table In an interpreter the control table is the table of all instruction handlers. Instruction tokens are dispatched by referencing their offset from the control table and jumping to this location in a token-threaded interpreter. See section 1.2. Used on pages 19, 20

DDG Data Dependency Graphs are a graph constructed from a program with the instructions as nodes. Directed edges between instructions indicate a Data Dependency where one instruction operates on the data provided by a previous instruction.. Used on pages 49, 51, 59–61 instruction handler The code in an interpreter implementing the instruction semantics. The interpreter will dispatch to the right instruction handler after fetching the current instruction- to-be-executed. See section 1.2. Used on pages 2, 12, 19, 20, 26, 93, 138, 139 instruction operand The value an instruction operates on. Typically one that is embedded in the program code and not obtained from external sources. See section 1.2. Used on pages 16, 19, 20, 39, 40 interpreter An interpreter executes an input program without any prior conversion. See section 1.2. Used on pages 9, 10, 12, 138, 139 javac The Java compiler Used on pages 15, 57

144 JIT compiler A method of converting bodies of virtual machine code (typically a single function or method) to target machine code just before the execution of said code. Typically faster than a interpreter for often-executed code. See section 1.3. Used on pages 9, 11, 12, 27, 106, 108, 122, 123 JNI Java Native Interface is an API enabling calling native code from Java code, and calling Java code from native code. QuickInterp uses JNI to call runtime substitution algorithms written in Java from the JVMTI agent. Used on page 72 JNIF JNIF [12] is a C++ library created to modify bytecode in a JVMTI client by the Software and Programmer Efficiency Research Group (“sape”) from the University of Lugano in Switzerland. Used on pages 72, 74 JVM A VM implementing the Java runtime environment. Since a JVM is typically not imple- mented in hardware, running Java applications (bytecode) almost always requires a JVM. See section 1.1 and 2. Used on pages 2, 10, 12, 15, 23, 34, 138, 140 JVMTI The Java Virtual Machine Tooling Interface (JVMTI) is an native API for tools that want to monitor or modify the state of the JVM. For QuickInterp, it’s key feature is the ability to intercept and modify all classes as they get loaded, which is used to implement superinstructions. Used on pages 67, 71–73, 139 profile A profile typically refers to a runtime profile of a particular application – this is a serialized log of program execution in such a way that can help a superinstruction set construction algorithm in constructing the most optimal superinstructionset for that application. Used on pages 25, 29, 138

QuickInterp The name of the new design and implementation of a superinstruction architecture created in response to the goals in section 1.5. Used on pages 2, 24–26, 28–30, 35, 39–41, 43, 46, 47, 62, 73

RPN A mathethmatical notation where operations are preceded by their operands. For example, instead of writing 1 + 2, in RPN one would write 1 2+. Used on page 55 static evaluation Static evaluation refers to the ability to estimate the quality of a particular superinstruction set combined with a runtime superinstruction placement algorithm and a profile statically. See section 4.4.3. Used on pages 31, 33, 138 superinstruction A superinstruction implements the functionality of a set of regular instructions in a single instruction. Having a superinstruction in the interpreter instead of a sequence of smaller instructions saves a number of costly jumps, making it an optimization for the interpreter. See section 1.4. Used on pages 2, 9, 12, 139 superinstruction candidate A sequence of bytecode operands (without their associated opcodes) that is to be considered to be included in a superinstruction set. See section 4.4.2. Used on pages 7, 30–32, 138 superinstruction set A superinstruction-enabled interpreter is compiled with a set of additional superinstructions. This makes the superinstruction set. Used on pages 30, 32, 33, 139 threaded code interpreter An interpreter design where the input program is represented as an array of pointers. Executing an instruction of the program is done by executing the code at the location of the pointer. See section 1.2. Used on pages 7, 20 token-threaded interpreter A type of interpreter operating on a stream of tokens, decoding them on the fly to resolve which instruction handler to execute. See section 1.2. Used on pages 19, 66, 70, 138

145 VM An emulator of a computer system (hence “Virtual" Machine) – in the case of the JVM this system is not a system typically implemented in hardware, but rather an abstraction that is used to make software portable. See section 1.1. Used on pages 9, 10, 15, 139

146 Bibliography

[1] Kevin Casey. “Automatic Generation of Optimised Virtual Machine Interpreters”. PhD thesis. Citeseer, 2006. [2] Kevin Casey, M Anton Ertl, and David Gregg. “Optimizing indirect branch prediction accuracy in virtual machine interpreters”. In: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation. 2003, pp. 278–288. [3] Kevin Casey et al. “Towards Superinstructions for Java Interpreters”. In: Software and Compilers for Embedded Systems. Ed. by Andreas Krall. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 329–343. isbn: 978-3-540-39920-9. [4] M Anton Ertl et al. “vmgen—A Generator of Efficient Virtual Machine Interpreters”. In: Software: Practice and Experience 32.3 (2002), pp. 265–294. [5] Martin Anton Ertl, Christian Thalinger, and Andreas Krall. “Superinstructions and Repli- cation in the Cacao JVM interpreter”. In: Journal of .NET Technologies Vol. 4 (2006), pp. 25–32. [6] David Gregg, M Anton Ertl, and Andreas Krall. “Implementing an efficient Java interpreter”. In: International Conference on High-Performance Computing and Networking. Springer. 2001, pp. 613–620. [7] Tim Lindholm et al. “The Java R Virtual Machine Specification - Java SE 11 Edition”. In: Oracle, Aug. 2018. Chap. 2: The Structure of the Java Virtual Machine. url: https: //docs.oracle.com/javase/specs/jvms/se11/html/index.html. [8] Tim Lindholm et al. “The Java R Virtual Machine Specification - Java SE 11 Edition”. In: Oracle, Aug. 2018. Chap. 6: The Java Virtual Machine Instruction Set. url: https: //docs.oracle.com/javase/specs/jvms/se11/html/index.html. [9] Tim Lindholm et al. “The Java R Virtual Machine Specification - Java SE 11 Edition”. In: Oracle, Aug. 2018. Chap. 4.10: Verification of class Files. url: https://docs.oracle.com/ javase/specs/jvms/se11/html/index.html. [10] Tim Lindholm et al. “The Java R Virtual Machine Specification - Java SE 11 Edition”. In: Oracle, Aug. 2018. Chap. 3: Compiling for the Java Virtual Machine. url: https : //docs.oracle.com/javase/specs/jvms/se11/html/index.html. [11] Tim Lindholm et al. The Java R Virtual Machine Specification - Java SE 11 Edition. Aug. 2018. url: https://docs.oracle.com/javase/specs/jvms/se11/html/index.html. [12] Luis Mastrangelo and Matthias Hauswirth. “JNIF: Java Native Instrumentation Framework”. In: Proceedings of the 2014 International Conference on Principles and Practices of Program- ming on the Java Platform: Virtual Machines, Languages, and Tools. PPPJ ’14. Cracow, Poland: Association for Computing Machinery, 2014, pp. 194–199. isbn: 9781450329262. doi: 10.1145/2647508.2647516. url: https://doi.org/10.1145/2647508.2647516. [13] Kazunori Ogata, Hideaki Komatsu, and Toshio Nakatani. “Bytecode fetch optimization for a Java interpreter”. In: ACM SIGOPS Operating Systems Review. Vol. 36. 5. ACM. 2002, pp. 58–67. [14] OpenJDK Zero-Assembler Project. url: https://openjdk.java.net/projects/zero/.

147 [15] Todd A. Proebsting. “Optimizing an ANSI C Interpreter with Superoperators”. In: Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’95. San Francisco, California, USA: Association for Computing Machinery, 1995, pp. 322–332. isbn: 0897916921. doi: 10.1145/199448.199526. url: https://doi.org/10. 1145/199448.199526. [16] Antoine Rey et al. Spring PetClinic Sample Application. https://github.com/spring- projects/spring-petclinic. 2020. [17] J. E. Smith and G. S. Sohi. “The microarchitecture of superscalar processors”. In: Proceedings of the IEEE 83.12 (1995), pp. 1609–1624. [18] The Java Language Environment. url: https://www.oracle.com/technetwork/java/ intro-141325.html.

148