Submitted by Schr¨ockeneder Florian

Submitted at Institute for System Software

Supervisor Prof. Dr. Dr. h.. Implementation of a Hanspeter M¨ossenb¨ock Co-Supervisor DI Josef Eisl

Graph Coloring Register July 2017 Allocator for the Graal

Master Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Computer Science

JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Osterreich¨ www.jku.at DVR 0093696 I

Abstract

Register allocation is crucial for the performance of modern . In this thesis we implemented a graph coloring register allocator and compared it to a linear scan approach. In cases where code quality is important our approach has potentially an advantage. In cases where compile time is more important the other approach has the advantage.

In this thesis we first explain the two register allocation approaches and take a look at where register allocation fits into the compilation process. Then we look at our version of register allocation and show an complete example of how it works.

For the comparison of the two approaches we use two benchmark suites. From the results of this comparison we learned that linear scan can perform better if it is highly optimized and graph coloring is not. In order to use the full advantages of graph coloring there are also a lot of further optimizations needed.

From what we have learned creating this thesis, it is beneficial to have the possibility to choose between different kinds of register allocation. Depending on the situation, a better code performance can be more important than a shorter compile time. But to really make a difference in practice, a lot optimization effort has to be put into an implementation. II

Kurzfassung

Registerzuteilung ist entscheidend für die Leistung von modernen Übersetzern. In dieser Diplomarbeit haben wir einen Ansatz der Registerzuteilung implementiert, nämlich Graph Coloring und verglichen ihn mit einem Linear Scan Ansatz. In Fällen wo die Qualität des Codes wichtig ist, hat unser Ansatz potentiell Vorteile. In Fällen wo die Übersetzungszeit wichtiger ist hat der andere Ansatz Vorteile.

In dieser Diplomarbeit stellen wir zuerst beide Ansätze zur Registerzuteilung vor und sehen uns an wo genau im Übersetzungsprozess die Registerzuteilung stattfindet. Im Anschluss sehen wir uns unsere Version der Registerzuteilung an und zeigen ein komplettes Beispiel wie dieser funktioniert.

Für den Vergleich der beiden Ansätze verwenden wir zwei Benchmark-Suiten. Anhand der Ergebnisse dieses Vergleichs lernten wir, dass ein stark optimierter Linear Scan Ansatz eine bessere Spitzenleis- tung erzielt als ein nicht optimierter Graph Coloring Ansatz. Um die Vorteile von Graph Coloring zu nutzen sind weitere Optimierungen notwendig.

Wir lernten beim Erstellen dieser Diplomarbeit, dass es von Vorteil ist die Möglichkeit zu haben zwischen verschiedenen Arten von Registerzuteilung zu wählen. Abhängig von der Situation kann eine bessere Leistung des Codes wichtiger sein als eine kürzere Übersetzungszeit. Aber um in der Praxis einen Unterschied zu machen muss ein hoher Optimierungsaufwand an einer Implementation betrieben werden. Contents III

Contents

1 Introduction 1 1.1 Variables, Values and Live Ranges ...... 1 1.2 Register allocation ...... 1 1.2.1 Graph Coloring ...... 2 1.2.2 Linear Scan ...... 2 1.3 Trade-off between allocation methods ...... 3 1.4 Structure of this Thesis ...... 4

2 System Overview 5 2.1 JavaVM...... 5 2.2 HotSpot VM ...... 5 2.3 Graal ...... 6 2.3.1 Compilation with Graal ...... 6 2.4 Static Single Assignment Form ...... 7

3 Implementation 8 3.1 Chaitin-Briggs Algorithm ...... 8 3.1.1 ...... 9 3.1.2 Renumber ...... 9 3.1.3 Build ...... 10 3.1.4 Coalesce ...... 10 3.1.5 Spill Costs ...... 10 3.1.6 Optimistic coloring ...... 11 3.1.6.1 Simplify ...... 11 3.1.6.2 Select ...... 12 3.1.7 Spill Code ...... 13 3.2 Graph Coloring Register Allocation For Graal ...... 13 3.2.1 Liveness Analysis ...... 14 3.2.1.1 Number Instructions ...... 14 3.2.1.2 Build Local Live Sets ...... 15 3.2.1.3 Build Global Live Sets ...... 15 Contents IV

3.2.1.4 Build Intervals ...... 16 3.2.2 Build Graph ...... 17 3.2.3 Coloring ...... 19 3.2.3.1 Simplify ...... 19 3.2.3.2 Select ...... 20 3.2.4 Spill Values ...... 21 3.2.5 After Spill ...... 22 3.2.6 Assign Location ...... 23 3.2.7 Phi-Resolution ...... 24 3.2.8 Data-Structure ...... 25

4 Case Study 28 4.1 Test Case ...... 28 4.2 LIR Code of the Test Case ...... 28 4.3 After lifetime analysis ...... 29 4.4 After build graph ...... 31 4.5 After simplify ...... 31 4.6 After spill ...... 32 4.7 After select ...... 33 4.8 After assign locations ...... 33 4.9 After resolve data flow ...... 35

5 Evaluation 37 5.1 Evaluation environment and benchmark ...... 37 5.2 DaCapo ...... 39 5.3 Peak Performance ...... 39 5.4 Compile Time ...... 39 5.5 ScalaDacapo ...... 40 5.6 Peak Performance ...... 40 5.7 Compile Time ...... 40

6 Related Work 42

7 Future Work 44 7.1 Spilling Strategy ...... 44 7.2 Heuristic for Choosing a Spill Candidate ...... 45 7.3 Identify Performance Issues ...... 45

8 Conclusion 46

9 Acknowledgement 47 Contents V

Bibliography 51 Introduction 1

Chapter 1

Introduction

Computer programs have an arbitrary amount of variables, but processors only have a limited amount of hardware registers. This means that some variables have to be kept in memory. The Goal of register allocation is to find an assignment that keeps important variables in registers [4].

1.1 Variables, Values and Live Ranges

In this thesis we differentiate between these three notions. Variables can hold values and these values have live ranges. A variable stands for a type like integer or floating point. A value is assigned to one or more variables. A live range starts at the definition of the value and ends at its last use.

1.2 Register allocation

Register allocation is a part of the compilation process. Its task is to decide which value to keep in hardware registers at each point of the generated code. Accessing a register is faster than accessing memory which means the decision to keep a value in memory has a direct effect on run-time perfor- mance [4]. Different ways to make this decision exist. We focused on one way to solve this issue, by using a graph coloring approach. Another approach is linear scan. Introduction 2

1.2.1 Graph Coloring

Graph coloring is method that assigns colors to all nodes of the graph. Any two registers that are connected via an edge can not receive the same color. Finding a coloring for the graph is NP-complete in general and because of this we need heuristics to find an approximation. We can treat register allocation as such a graph coloring problem by creating an interference graph. The interference graph contains a node for each value. If one value is alive at the same time as another one, they interfere with each other. In that case, we add an edge between the nodes of these values to the graph. Based on these interferences we can assign values to registers. We can only assign two values to the same register, if they do not interfere with each other [6,7]. Figure 1.1 and Figure 1.2 shows an example of a colored interference graph.

Figure 1.1: Interference Example

1.2.2 Linear Scan

Linear Scan does not use an an interference graph but allocates registers in a single scan of the live ranges of the values. The algorithm keeps these live ranges in a sorted list and jumps from start point to start point. In every step it keeps a list with alive values and checks if the life range that begins at this point, interferes with them [18]. Figure 1.3 shows how linear scan performs the allocation for the code in Figure 1.1. We also assume two registers. At point A the two registers contain v1 and Introduction 3

Figure 1.2: Interference Graph with five values and two registers v3. Then we jump to point B. v1 is no longer alive here but v4 is. We continue to point C where v2 becomes alive. The registers now contain v4 and v2. At point D v1 gets a register and v2 stays in the other one. And finally at point E v2 and v5 are assigned to registers.

Figure 1.3: Linear scan example

1.3 Trade-off between allocation methods

The graph coloring algorithm passes over the live ranges more than once in order to build and main- tain the graph. Linear scan makes the register assignment in a single pass over the live ranges and is therefore faster with regards to compile time. On the other hand the linear scan makes the assignment decisions based on local knowledge and does not reevaluate these decisions [12]. Graph coloring makes this decisions based on local and global knowledge and reevaluates them which can lead to better allocation and spill decisions and therefore can lead to a better runtime performance.

Just-in-time compilers often use linear scan because of the faster compile time. Just-in-time com- pilers compile code during the execution time of the program, when it is needed. However we can Introduction 4 think of instances where better code is more important than compile time. One example would be for the compilation of time critical methods on a server, that are compiled once but are often executed.

For these reasons it makes sense to extend Graal with an additional allocation method. Graal is a just-in-time compiler for Java, written in Java [8]. We provide Gaal with a graph coloring algo- rithm.

1.4 Structure of this Thesis

The rest of the thesis is structured as follows. Chapter2 explains the structure of the whole system and where our work fits in. In Chapter3 we are going to explain in detail our solution and how it works. In Chapter5 we evaluate our results. After that, in Chapter4 we show a non-trivial example of the results of our work. Chapter6 and Chapter7 deal with related and future work. Chapter8 concludes the thesis and summarizes our work and the results. System Overview 5

Chapter 2

System Overview

This thesis is implemented for the OpenJDK Graal project [8]. In order to see where this work fits in, we take a look at the whole system in this Chapter.

2.1 Java VM

The Java Virtual Machine (JVM) is a part of a program execution environment, which works identically on a variety of different platforms. It works as an abstract machine and has a specific of instructions, the Java bytecode [14]. Every Java program is translated in to this bytecode and executed on a JVM. The JVM can perform this bytecode execution by interpreting the bytecode or translating it to machine code via just-in-time compilation.

2.2 HotSpot VM

The Java HotSpot VM [16] is an implementation of the Java VM. It has an interpreter and two just- in-time compilers, a server compiler and a client compiler.

HotSpot client compiler [13, 16] aims to reduce the time and memory needed for compilation. There- fore the HotSpot client compiler uses a linear scan register allocator. It focuses on local code quality and less on global optimizations.

The HotSpot server compiler [16, 17] performs global optimizations and aims for code performance. This makes the server compiler slower with regards to compile-time. This compiler uses a graph col- oring register allocation approach. System Overview 6

Another important aspect of the Hotspot VM is the tiered compilation. At first the program code is executed by the interpreter. The VM analyzes the code during execution and avoids the compila- tion of code that is infrequently executed. This lets the compiler focus on more performance critical parts of the code called hotspots. These parts are then compiled by either, the client or the server compiler [16].

2.3 Graal

The Graal VM is a modified version of the Hotspot VM. Graal is an alternative compiler for the HotSpot VM. The Graal VM reuses all other HotSpot components that are not used for compilation, such as the interpreter and the garbage collector. Graal is an aggressively optimizing compiler and often makes assumptions about the running program [20]. Figure 2.1 shows where Graal fits in. Graal compiles java bytecode and is written in Java.

Figure 2.1: Graal and HotSpot [20]

2.3.1 Compilation with Graal

The Graal compiler process consist of a front end and a back end. The front end is used for opti- mizations and has a graph-IR (intermediate representation). Our work belongs to the back end where register allocation and code generation take place. In the back end the source code is in form of a low-level IR (LIR) [20]. The LIR consists of a control flow graph with basic blocks. Each basic block contains a list of instructions. The LIR is in SSA form and already target specific. System Overview 7

2.4 Static Single Assignment Form

Static single assignment (SSA) form is an intermediate representation of the source code where each variable is written exactly once. A control flow graph acts as intermediate representation and it is in SSA form if a value has only one definition but can have multiple uses [11].

If multiple branches reach a basic block, it can not be statically determined which value definition is valid. Figure 2.2 shows an example where x has two definitions. In this example z is defined as x + y, but we do not know at this point, which branch the program will take. So we can not decide if x will hold the value of the first or the second definition. In order to make this decision we need φ-functions. In the example the φ-function makes this decision dynamically and creates a new value called x3. X3 holds now the correct value and the algorithm can use it from that point on.

(a) Example of code not in SSA (b) Example of SSA form with φ- form function

Figure 2.2: Transformation to SSA form Implementation 8

Chapter 3

Implementation

In this chapter we are going to take a look at register allocation via graph coloring. Our algorithm works differently in some aspects than the original algorithm we base our work on. This is partly because we reuse some code from the existing linear scan register allocator of Graal and also due to the LIR structure and the information it provides. In order to understand our work and these differences, we are going to look at the Chaitin-Briggs algorithm first and then take a look at our approach.

3.1 Chaitin-Briggs Algorithm

Figure 3.1 shows Chaitins algorithm and Figure 3.2 shows the structure of the Chaitin-Briggs algo- rithm. It consists of seven steps and maintains an interference graph. We are going to take a look at each step in order to better understand what the algorithm does.

Figure 3.1: Chaitins Algorithm Implementation 9

Figure 3.2: Chaitin-Briggs Algorithm

3.1.1 Data Structure

The interference graph is the main data structure and it has two different representations. The first representation is a bit matrix, where only the half below the diagonal is used in order to save space in memory. This can be done because the bit matrix consists of a row and a column for every node and a bit is set to one, if an edge exists between two nodes. This means the the upper half is just the mirrored lower half and we can leaf it out without loosing information. The second representation is a list of adjacency vectors. The algorithm uses the bit matrix for fast random access to the nodes and the list of adjacency vectors for sequential traversal of the graph. The length of the vector is the number of neighbours which is the degree of the node. The algorithm has this information stored and does not need to recompute it. [6]

3.1.2 Renumber

This step of the algorithm creates a new live range for each definition point and assigns a unique name to it. A live range that reaches a use point builds a union with every other live range that reaches this point. This Union is a set that consist of all live ranges that are alive at this point. The Briggs paper [4] mentions, that this is implemented in their algorithm as an example of the classical disjoint set union-find problem [21]. Implementation 10

3.1.3 Build

The build step constructs the interference graph using two passes over the code. The first pass calculates the degree of every node with the bit matrix. The second pass fills the adjacency vectors. The two pass approach is used because non-segmented adjacency vectors can be handled more quickly. A node is added to the interference graph for every live range. An edge between two nodes is added if the two live ranges represented by the nodes intersect [6].

3.1.4 Coalesce

The coalescing step reduces the number of live ranges if possible. The algorithm combines two live ranges, if the definition of one live range is a copy of the second (i.e. a move instruction) and they do not interfere at another point. Combining live ranges reduces unnecessary moves. This is the case, because if the same register is assigned to the two live ranges, there is no need to copy the value of the live range from one register into another register. However it also leads to more complex graphs with nodes of higher degrees. It is possible that a graph that was colorable before coalescing but is not afterwards. [4]

In Figure 3.3 the two live ranges x and y can be coalesced. Figure 3.4 shows how coalescing af- fects the interference graph. Since one node represents one live range, the nodes of the coalesced live ranges are combined into a new node. The new node has interferes with every node that either x or y interferes. The increase in the number of interferences can make the graph uncolorable.

3.1.5 Spill Costs

This part of the algorithm estimates the run-time cost if the live range would be spilled. The costs consist of the number of loads and stores added to the code, each weighted by Equation 3.1. c denotes the cost of the load or store operation on the target architecture and d denotes the loop nesting depth of the instruction. This algorithm computes the spill costs of every node in advance. [4]

c ∗ 10d (3.1) Implementation 11

Figure 3.3: Coalescing example

3.1.6 Optimistic coloring

The coloring heuristics is where the Briggs [4] algorithm differs from Chaitin [6]. The original algorithm used a pessimistic coloring approach which makes the decision to spill a life range in the simplify phase. Optimistic coloring, as proposed by Briggs, moves this decision into the select step and only marks nodes as spill candidates. This allows the algorithm to reconsider a bad spill decision. Figure 3.5 depicts an example where optimistic coloring finds a coloring but the pessimistic approach does not. In this example every node has a degree which is equal to k. K denotes the number of registers that are available and therefore the degree of the coloring we want to find. A pessimistic approach would choose to spill one node, even though a color could be found.

3.1.6.1 Simplify

The simplify step starts by removing nodes with a degree less than k from the interference graph and placing them on a stack. It chooses these nodes in arbitrarily order. After that, the interference graph only contains nodes, that have a degree of ≥ k. The algorithm chooses one node based on the ratio of spill costs divided by the current degree of the node and marks it as a spill candidate. Simplify then removes the spill candidate from the interference graph and pushes it on the stack. Although the node has a degree greater or equal to k, it could still receive a color if two neighbours are assigned the same color. By removing the node, simplify reduces the degree of the of the remaining nodes that interfere with the spill candidate. Implementation 12

(a) One node for each live (c) Combined node for x range and y after coalescing

Figure 3.4: Coalescing of two live ranges

Now the algorithm can continue removing nodes with a degree less than k or choose another spill candidate. Simplify continues until the interference graph is empty and all nodes are on the stack. [4] The example in Figure 3.6 and Figure 3.7 show a situation where we are removing nodes under the assumption of k=2. First we remove z because it is the only node with a degree < k. After that, we choose a spill candidate, y in that case, remove it and push it on the stack. Then the algorithm continues by removing the remaining nodes which have a degree of < k now.

3.1.6.2 Select

The assignment of colors and the decision to spill live ranges takes place in the select phase. This step takes a node from the stack, checks if a color is available and adds the node to the graph. A color is available, if it is not assigned to any neighbour of the node. If the select step realizes that there is no color left, the node is not added to the graph and remains uncolored. This is only the case for spill candidates. The select step ends when the stack is empty. If nodes remain uncolored in this step, they have to be spilled. If every node is colored, the allocator is finished at this point. [4] In Figure 3.8 we continue the example from Figure 3.6 and Figure 3.7 and color the interference graph. Although simplify marked y as spill candidate, we can find a color that is available. Implementation 13

Figure 3.5: Diamond shaped graph, k=2

3.1.7 Spill Code

This step inserts spill code for nodes, that could not be colored by the select phase. Spilling a live range means, it is kept in memory rather than in a register. A live range is spilled by inserting a store after every definition and a load after before every use. [4] This parts the live range into multiple short live ranges. If a live range has been spilled, the algorithm starts from the beginning with the renumber step. Figure 3.9 shows an example of spill code added for the live range x. The first part shows the code before spilling, the second part shows the code after spilling.

3.2 Graph Coloring Register Allocation For Graal

Our graph coloring register allocator consists of seven main steps. Some of these steps are divided in smaller steps, but we sill consider them one step overall. Figure 3.10 shows the steps of our algorithm and their order. In this chapter of the thesis we are going to look at every step in detail. Implementation 14

(a) Interference graph (b) z is removed and before simplify put on the stack.

Figure 3.6: Simplify example, k=2

3.2.1 Liveness Analysis

The liveness analysis is reused from the linear scan register allocator that Graal already has. Fig- ure 3.11 shows the parts of the liveness analysis and the order of execution. The first three parts use the almost unaltered code of the linear scan allocator where as the fourth step is slightly modified to fit the needs of our own implementation of intervals. The reason for reusing code from linear scan is, that it is already in use and proven to do what we need.

3.2.1.1 Number Instructions

This method iterates the LIR instructions in a single forward pass and assigns an id to each instruction. These ids start by zero and are incremented by two. Which means the first instruction of the LIR code gets its id to be zero, id of the second instruction is set to two and so on. We increment the ids by two because we have two operand modes which are called def and use. This step is necessary to determine the start end the end point of the live ranges, the use positions, the definition positions and also the positions in the code for inserting spill instructions. Implementation 15

(a) Empty graph, nodes on stack

Figure 3.7: Simplify example, k=2

3.2.1.2 Build Local Live Sets

This method iterates the instructions form back to front and creates and fills two bit sets for every basic block of the LIR code. The first set is for the values that are live in that block, the second is for values which are terminated in that block. For every use position of a value the algorithm sets a live bit and for every definition position it sets a kill bit. This is the case, because the value is alive in every block it has a use position until it reaches its definition, if we look at the LIR code from back to front. In order to set these bits for every block, the algorithm has to run over the LIR code once. Its structure is shown in Figure 3.12.

3.2.1.3 Build Global Live Sets

The step to build global live sets determines, which values are alive at the beginning of a block and at the end of a block [22]. This means it looks at the edges between the blocks. The algorithm creates another two bit sets per block, one for values that are alive after (LiveOut) and one for before (LiveIn) the block is executed. This is done by going over the blocks in reverse order. Figure 3.13 shows how the algorithm works. LiveOut is the union of the sets of values that are alive at the beginning of the successor blocks. For computing LiveIn the algorithm uses LiveGen and LiveKill, the bit sets created in the step before. LiveIn unions together LiveOut of the current block, values set alive in this block an the negation of values killed in this block. Implementation 16

(a) Empty graph, (b) Colored Graph nodes on stack

Figure 3.8: Select example, k=2

The algorithm has to iterate the blocks more than once, because a change of the bit sets can in- fluence the sets of a previous block. Because of the previous step of computing the local live sets, the algorithm does not have to iterate over the instructions, but over the blocks.

3.2.1.4 Build Intervals

The build intervals step creates an interval for every value and determines their live ranges. Since the LIR code is already in SSA form, a value has one definition point and might has multiple uses, but it can still have more than one live range. An interval is a representation of a single value and holds all the information about the value such as its operand id, live ranges and use positions.

The algorithm iterates once over the LIR code from bottom to top. It starts by creating an in- terval for every value that is alive at the end of the last block and extending their live ranges to start at the beginning of the block [22]. Then the algorithm proceeds to iterate the instructions. If a use position of a value is found, the live range is extended to the beginning of the block. If the live range already reaches that far, it is merged. If the value is not alive, a new live range is added and ends at this point, the start point is set to the beginning of the block. After that, a use position is also stored in a list in the interval. If a definition of a value is found, its live range is set to start from here and not from the beginning of the block [22]. Implementation 17

(a) Code before spilling (b) Code after spilling

Figure 3.9: Spilling example

Figure 3.10: Graph coloring register allocator for Graal

In short the algorithm tracks the live ranges of a value from the last use position and extends them until its definition is reached. Figure 3.14 illustrates the algorithm. On the first encounter of value, an interval is created. In addition, the algorithm is adding a use position for every caller saved register, if the current instruction destroys them.

3.2.2 Build Graph

This part of the allocator builds the interference graph. For every interval, it creates a node in the graph. Since there is one interval for one value, every node in the interference graph represents one value. After adding the node to the graph, the algorithm checks if the value interferes with another value. To values interfere, if one of their live ranges overlap. As an example, we look at Figure 3.15 if v1 has a live range that is defined at instruction number 2 and is used in instruction number 4 and v2 has a live range that is defined in instruction number 3 and used in instruction number 5, they interfere. If Implementation 18

Figure 3.11: Structure of liveness analysis

Figure 3.12: Build local live sets two live ranges interfere, an edge between their corresponding nodes is added to the interference graph.

Figure 3.16 shows the idea of the algorithm. What we have not mentioned so far, is that there is more than one interference graph, depending on the the number of different register sets, e.g. integer and floating point registers. Values can only interfere with each other, if they can occupy the same registers on the target architecture. For every register category we create one graph and no interferences between the values that belong to either one. For example, if an architecture has an integer register set and floating point register set, then the algorithm creates two interference graphs. Graal provides information on the target architectures and the values, so we know which value belongs to which register set. For this reason we store a register category number to the intervals. Implementation 19

Figure 3.13: Build global live sets

3.2.3 Coloring

The coloring part of the algorithm also consists of two phases and follows closely the implementation described in the Briggs paper [4]. However the differences are in the details. For example Briggs paper does not mention about fixed registers but we have to take them into account. We handle these fixed registers by leaving them in the graph and ignore interferences between them but take interferences with values into account. We can do this because fixed registers do not need to be assigned to another register and two different registers are two different colors by definition and can have an edge.

Before coloring the nodes of the interference graph, we need to determine k, where k is the num- ber of allocatable registers for a register category. We get an array of all allocatable registers of the target architecture from Graal. Our algorithm sorts this array into the different register categories. The number of registers for each register category is then the k for the corresponding interference graph.

3.2.3.1 Simplify

Simplify determines which nodes can be colored trivially and which nodes may need to be spilled. Fixed registers are not removed or marked as spill candidates. When simplify choses a spill candidate it uses costs to determine the best candidate. The metric we use for computing the spill costs is Implementation 20

Figure 3.14: Build intervals

number of use positions current degree . We choose the node with the lowest cost as spill candidate. Simplify is finished when only nodes of fixed registers remain in the interference graph and no more nodes were removed during the last run. These remaining nodes represent a register and are therefore already colored by definition. Figure 3.17 shows the structure of simplify.

3.2.3.2 Select

This step assigns the actual colors to the nodes and therefore determines which value occupies which register on the target architecture. The select step also decides, which values are spilled. The algorithm as shown in Figure 3.18, starts by popping a node from the stack. In the next step it checks if a color is available, by going through the neighbouring nodes to see, what color they have. If a color is still not assigned to on of the neighbours, it gets assigned to the node and the node is added to the interference graph. If no color is available the node is not added to the graph, but put aside for spilling. This is where select decides to spill a value [4]. The algorithm continues from the beginning until the stack is empty. At this point the algorithm either found a coloring and can go to the resolve data flow step or it did not find a coloring and has to spill the values of all uncolored nodes. Implementation 21

Figure 3.15: Build Graph Example

Figure 3.16: Build graph

3.2.4 Spill Values

The spill step receives a list of nodes that could not be colored by select. The values of these nodes need to be spilled. The interval contains the use positions the life ranges and the definition position of its value. Figure 3.19 shows the algorithm in detail. First the algorithm creates a new stack slot, assigns it to the interval and deletes all life ranges of the value. Next, the algorithm inserts a move instruction to the stack slot after the values definition and a move instruction from the stack slot before every use, if the instruction needs a register [23]. For the algorithm find the right position, it first gets the right block of LIR code where the use occurs, then it finds the index to know where to insert the move instruction. Listing 3.1 shows how to find the right position to insert a move. In some cases the target architecture can use a stack slot directly and does not need a register, in this cases we don‘t insert a move instruction. This can only be the case for use positions, the move instruction after the definition is always added. The information about the need for a register is stored with the use position in the interval. Implementation 22

Figure 3.17: Simplify

We consider for example Figure 3.20, a use position at instruction 334 , this instruction is in LIR block B3 which begins with instruction 330 and ends with instruction 340. We find the right block by looking at the first and the last instruction ids of every block, until the use position is between them. We can now just compute the index by subtracting the id of the first instruction from the use position and dividing the result by two. Now we know that the use position in the example occurs at index 2 and we can insert the move afterwards. After inserting a move, the algorithm adds a live range, which is only one instruction long to the value. The algorithm then continues with next use position. If it is finished it goes on to the next value until it handles every node that the select step could not color.

3.2.5 After Spill

By inserting instructions into the LIR Code, the algorithm breaks the numbering of the instructions, but not their order. If we keep that in mind, we do not have to number them again. Inserted instructions are assigned an id of minus one. If we find an instruction with the id -1 we now that this instruction has been inserted by us and we just continue to the next instruction until we find the right index. The live ranges of the added instructions are also not a problem, because they are live at the instruction, not before. These means if we add two move instructions at the same index, they will Implementation 23

Figure 3.18: Select

1 i n t findIndex(int pos) { 2 3 Block B = getBlock(pos); 4 i n t index = pos − B.firstInstruction.id; 5 index = index /2; 6 7 while(B.getInstruction(index).id == −1){ 8 index++; 9 } 10 11 return index; 12 } Listing 3.1: Finding spill position

interfere with each other but not with values from the instruction before, because their live ranges are inserted to begin at the use position. Figure 3.21 shows and example with inserted spill moves. Not numbering the instructions again has also another big advantage, we do not need to start at the beginning with a complete live time analysis, if the intervals are also kept. The life ranges of a values that are not spilled, do not change and neither does any use position. After the spill code is inserted we simply throw away the graph and rebuild it from the intervals by going back to the build graph step.

3.2.6 Assign Location

The assign location step traverses the LIR code once and replaces every value with its assigned register. The algorithm as shown in Figure 3.22 does this by looking at every LIR instruction. If the instruction contains values, the algorithm gets the corresponding interval. If the value is not spilled, it is replaced by the assigned register. If the value has been spilled, the assign location step looks for the use position Implementation 24

Figure 3.19: Spill values that matches the current instruction. If this use position requires a register, the value is replaced by the assigned register. If the the use position has a should have register as its priority, the value is replaced by its stack slot.

3.2.7 Phi-Resolution

Phi-Resolution is the purpose of the resolve data flow step. This step of the algorithm is reused from the existing liner scan register allocator and adapted to our needs. It is necessary to resolve the phi instructions. At this point we know which value is in a specific register at the point of a phi instruction. Phi instructions occur at the edges between basic blocks. These phi instructions are resolved by adding some moves at the end of a block. These moves are necessary to match the contents of registers to Implementation 25

(a) Index be- (b) Index after fore spilling spilling

Figure 3.20: Spilling example what the successor blocks are expecting to find. We use a move resolver to add these moves to make sure we are not overwriting a register, before its content is moved to where it belongs. Another reason for using the move resolver is to break up cyclic dependencies.

3.2.8 Data-Structure

Our algorithm is driven by two data structures, the interference graph and the intervals. The in- terference graph uses two representations and the intervals are used to store additional information on the values. The two representations of the graph are similar to the Chaitin [6] paper. The first representation is a bit matrix which is realized by an array of bit sets with the id of the node as index. As shown in the first graph of Figure 3.23, an interference between two nodes is stored in the node with the higher id. For example in the provided graph there is an interference between v0 and v1 and an interference between v1 and v2. The second representation of the interference graph is an array of adjacency vectors as shown in the second graph of Figure 3.23. Every vector contains all the neighbours of the corresponding node. The third data structure is an array that contains the intervals. Implementation 26

(a) Code before spilling (b) Code after spilling

Figure 3.21: Spilling example

Figure 3.22: Assign Locations Implementation 27

(a) Graph represented as bit matrix (b) Graph represented as array of adjacency vectors

(c) Array of intervals

Figure 3.23: Data structure Case Study 28

Chapter 4

Case Study

In this chapter we want to present how our algorithm works with a non trivial example. This chapter is providing such an example, from the starting point of our algorithm until the end result. The test case is executed on an AMD64 architecture with two register categories.

4.1 Test Case

The example we choose to demonstrate our algorithm is the SpillLoopPhiVariableAtDefinition test case. We are using a test case that is part of the JUnit tests that come with the Graal source code. The SpillLoopPhiVariableAtDefinition class can be found in the Graal source code in the com.oracle.graal.jtt.loop package. Listing 4.1 shows the Java code of the test case. The reason we choose this example is because it is not trivial, so that our algorithm has to spill values and on the other hand it produces code and graphs of a size, that is reasonable to show in this thesis.

We see in the Java code of the test case that it calls methods from GraalDirectives. Especially interesting for us is the spillRegisters() call. This call forces the allocator to spill values at the point of the call. The bindToRegister() call just creates a use position that requires a register for the value.

4.2 LIR Code of the Test Case

Our algorithm starts with the LIR code and the control flow graph we get from the front end. Fig- ure 4.1 shows the control flow graph for the SpillLoopPhiVariableAtDefinition test case. Listing 4.2 shows the LIR code. The code we show here is quite long, so we are going to only show the parts that change between the steps of the algorithm. At the end we are looking at the whole code again, Case Study 29

1 p u b l i c class SpillLoopPhiVariableAtDefinition extends JTTTest { 2 3 p u b l i c static int test(int arg) { 4 i n t count = arg; 5 f o r(int i = 0; i < arg; i++){ 6 GraalDirectives .bindToRegister(count); 7 GraalDirectives. spillRegisters(); 8 GraalDirectives .bindToRegister(count); 9 i f (i==0) { 10 GraalDirectives. spillRegisters(); 11 continue; 12 } 13 GraalDirectives. spillRegisters(); 14 count++; 15 } 16 return count; 17 } Listing 4.1: Test case java code

in order to see the result. The code shown in Listing 4.2 is from after the numbering step. As we can see all instructions are already numbered. Our algorithm only changes the numbring in the LIR code during the lifetime analysis.

Figure 4.1: Control flow graph the test case.

4.3 After lifetime analysis

This step of our algorithm calculates a set of live ranges for every value, stored in an array of intervals. The live ranges of the values from our test case example are shown in Figure 4.2. Live ranges of fixed registers are shown as gray and values as blue. From these live ranges we can also see how the spillRegisters() call forces our allocator to spill values in block B2, B3 and B4. On every point in the LIR code that this call occurs, it sets a temporary a live range for every fixed register. Case Study 30

B0 −> B1 [ −1 , −1] _nr__instruction______( LIR ) 0 [rsi|d, rbp|q] =LABEL size: 2 align: false label:? 2 v5|q = MOVE rbp|q moveKind: QWORD 4 [] = HOTSPOTLOCKSTACK frameMapBuilder: com. oracle . graal . lir .amd64.AMD64FrameMapBuilder@4667ae56 slotKind : QWORD 6 v0|d = MOVE rsi|d moveKind: DWORD 8 JUMP ~[v0|d, int[0|0x0]] size: 2 destination: B0 −> B1

B1 <− B0 , B3 , B4 −> B2 , B5 [ −1 , −1] llh (loop 0 depth 1) _nr__instruction______( LIR ) 10 [v1|d, v2|d] = LABEL size: 2 align: true label:? 12 CMP (x: v0|d, y: v2|d) size: DWORD 14 BRANCH ~[] size: 0 condition: > trueDestinationProbability: 0.5 trueDestination: B1 −> B2 falseDestination: B1 −> B5

B2 <− B1 −> B3 , B4 [ −1 , −1] (loop 0 depth 1) _nr__instruction______( LIR ) 16 [] =LABEL size: 0 align: false label:? 18 BINDTOREGISTER v1 | d 20 SPILLREGISTERS 22 BINDTOREGISTER v1 | d 24 v3|d = INC v2|d size: DWORD 26 TEST (x: v2|d, y: v2|d) size: DWORD 28 BRANCH ~[] size: 0 condition: = trueDestinationProbability: 0.5 trueDestination: B2 −> B3 falseDestination: B2 −> B4

B4 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) _nr__instruction______( LIR ) 36 [] =LABEL size: 0 align: false label:? 38 SPILLREGISTERS 40 v4|d = INC v1|d size: DWORD 42 JUMP ~[v4|d, v3|d] size: 2 destination: B4 −> B1

B3 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) _nr__instruction______( LIR ) 30 [] =LABEL size: 0 align: false label:? 32 SPILLREGISTERS 34 JUMP ~[v1|d, v3|d] size: 2 destination: B3 −> B1

B5 <− B1 [ −1 , −1] _nr__instruction______( LIR ) 44 [] =LABEL size: 0 align: false label:? 46 rax|d = MOVE v1|d moveKind: DWORD 48 RETURN (savedRbp: v5|q, value: rax|d, ~outgoingValues: []) size: 0 isStub: false scratchForSafepointOnReturn: rcx config : HotSpotVMConfig Listing 4.2: Test case LIR code before allocation Case Study 31

Figure 4.2: Live ranges of the test case after lifetime analysis

4.4 After build graph

In the build graph step, we take the intervals and their contained life ranges to produce our interference graphs. Since the machine we are running this test case on provides two register types, we get two interference graphs. One graph contains integer registers and the second one floating point registers. As we can see in Figure 4.3, this graphs tend to get rather large, even for small examples like our test case. We are not looking at the floating point graph because it contains only fixed registers and there is no need to color them. We also do not insert edges between nodes from fixed registers because they are registers by definition and they each already represent a different color.

4.5 After simplify

Figure 4.4 shows the reduced graph after the simplify step, containing only fixed registers and the stack containing the nodes of the values that need to be colored. We can see that v0, v1, v2, v3, v5 are marked as spill candidates. Case Study 32

Figure 4.3: Interference graph for integer registers of test case

(a) Interference graph for the test case after simplify (b) Stack for the test case after simplify

Figure 4.4: Interference graph and stack for the test case after simplify

4.6 After spill

In the select step the algorithm realizes that their is no color available for v0, v1, v2, v3, v5 and they need to be spilled. Listing 4.3 shows the LIR code after the spill step. The spill moves to stack were inserted after the definitions of v5, v0, v3 at instruction id numbers 2, 6 and 24. v1 and v2 are defined at instruction number 10 which is a Phi instruction. Phi instructions can use stack slots directly so we do not need to move a register to the stack. In block B1 v0 is restored before its usage at instruction number 12. In block B2 v2 is restored before its usages at instruction numbers 18 and 22. And also in block B2 v2 is restored before its usage at instruction number 26. Values only need to be restored before usages at instructions that require registers. They do not need to be restored before instructions that can handle stack slots. Figure 4.5 depicts the live ranges after spilling. Case Study 33

B0 −> B1 [ −1 , −1] _nr__instruction______( LIR ) 0 [rsi|d, rbp|q] =LABEL size: 2 align: false label:? 2 v5|q = MOVE rbp|q moveKind: QWORD −1 vstack:4|q = MOVE v5|q moveKind: QWORD 4 [] = HOTSPOTLOCKSTACK frameMapBuilder: com. oracle . graal . lir .amd64.AMD64FrameMapBuilder@4667ae56 slotKind : QWORD 6 v0|d = MOVE rsi|d moveKind: DWORD −1 vstack :0|d = MOVE v0|d moveKind: DWORD 8 JUMP ~[v0|d, int[0|0x0]] size: 2 destination: B0 −> B1

B1 <− B0 , B3 , B4 −> B2 , B5 [ −1 , −1] llh (loop 0 depth 1) _nr__instruction______( LIR ) 10 [v1|d, v2|d] = LABEL size: 2 align: true label:? −1 v0|d = MOVE vstack :0|d moveKind: DWORD 12 CMP (x: v0|d, y: v2|d) size: DWORD 14 BRANCH ~[] size: 0 condition: > trueDestinationProbability: 0.5 trueDestination: B1 −> B2 falseDestination: B1 −> B5

B2 <− B1 −> B3 , B4 [ −1 , −1] (loop 0 depth 1) _nr__instruction______( LIR ) 16 [] =LABEL size: 0 align: false label:? −1 v1|d = MOVE vstack :1|d moveKind: DWORD 18 BINDTOREGISTER v1 | d 20 SPILLREGISTERS −1 v1|d = MOVE vstack :1|d moveKind: DWORD 22 BINDTOREGISTER v1 | d 24 v3|d = INC v2|d size: DWORD −1 vstack :3|d = MOVE v3|d moveKind: DWORD −1 v2|d = MOVE vstack :2|d moveKind: DWORD 26 TEST (x: v2|d, y: v2|d) size: DWORD 28 BRANCH ~[] size: 0 condition: = trueDestinationProbability: 0.5 trueDestination: B2 −> B3 falseDestination: B2 −> B4

B4 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) ... B3 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) ... B5 <− B1 [ −1 , −1] . . . Listing 4.3: Test case LIR code after spilling

4.7 After select

After spilling we rebuild the graph and repeat the coloring steps. This time we find a coloring. Figure 4.6 shows the colored graph. We can see in the graph that two nodes who share an edge do not have the same color. Every color represents a single hardware register. What we also see in the new interference graph is, that it has less edges than before. The values v1, v2, v3 and v4 do not have any interference in the colored graph.

4.8 After assign locations

After select has colored the graph, the assign locations step replaces every value by its register. This is shown in Listing 4.4. In cases where a spilled value does not need a register it is replaced by the assigned stack slot. For example v1 and v2 in Block B1 at instruction number 10 are replaced by their stack slot. Case Study 34

B0 −> B1 [ −1 , −1] _nr__instruction______( LIR ) 0 [rsi|d, rbp|q] =LABEL size: 2 align: false label:? 2 r10|q = MOVE rbp|q moveKind: QWORD −1 vstack:4|q = MOVE r10|q moveKind: QWORD 4 [] = HOTSPOTLOCKSTACK frameMapBuilder: com. oracle . graal . lir .amd64.AMD64FrameMapBuilder@4667ae56 slotKind : QWORD 6 r10|d = MOVE rsi|d moveKind: DWORD −1 vstack:0|d = MOVE r10|d moveKind: DWORD 8 JUMP ~[vstack:0|d, int[0|0x0]] size: 2 destination: B0 −> B1

B1 <− B0 , B3 , B4 −> B2 , B5 [ −1 , −1] llh (loop 0 depth 1) _nr__instruction______( LIR ) 10 [vstack:1|d, vstack:2|d] = LABEL size: 2 align: true label:? −1 r10|d = MOVE vstack:0|d moveKind: DWORD 12 CMP (x: r10|d, y: vstack:2|d) size: DWORD 14 BRANCH ~[] size: 0 condition: > trueDestinationProbability: 0.5 trueDestination: B1 −> B2 falseDestination: B1 −> B5

B2 <− B1 −> B3 , B4 [ −1 , −1] (loop 0 depth 1) _nr__instruction______( LIR ) 16 [] =LABEL size: 0 align: false label:? −1 r10|d = MOVE vstack:1|d moveKind: DWORD 18 BINDTOREGISTER r10 | d 20 SPILLREGISTERS −1 r10|d = MOVE vstack:1|d moveKind: DWORD 22 BINDTOREGISTER r10 | d 24 r10|d = INC vstack:2|d size: DWORD −1 vstack:3|d = MOVE r10|d moveKind: DWORD −1 r10|d = MOVE vstack:2|d moveKind: DWORD 26 TEST (x: r10|d, y: r10|d) size: DWORD 28 BRANCH ~[] size: 0 condition: = trueDestinationProbability: 0.5 trueDestination: B2 −> B3 falseDestination: B2 −> B4

B4 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) _nr__instruction______( LIR ) 36 [] =LABEL size: 0 align: false label:? 38 SPILLREGISTERS 40 r10|d = INC vstack:1|d size: DWORD 42 JUMP ~[r10|d, vstack:3|d] size: 2 destination: B4 −> B1

B3 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) _nr__instruction______( LIR ) 30 [] =LABEL size: 0 align: false label:? 32 SPILLREGISTERS 34 JUMP ~[vstack:1|d, vstack:3|d] size: 2 destination: B3 −> B1

B5 <− B1 [ −1 , −1] _nr__instruction______( LIR ) 44 [] =LABEL size: 0 align: false label:? 46 rax|d = MOVE vstack:1|d moveKind: DWORD 48 RETURN (savedRbp: vstack:4|q, value: rax|d, ~outgoingValues: []) size: 0 isStub: false scratchForSafepointOnReturn: rcx config : HotSpotVMConfig Listing 4.4: Test case LIR code after assigning of registers Case Study 35

Figure 4.5: Live ranges of the test case after spilling

Figure 4.6: Colored interference graph for our test case

4.9 After resolve data flow

The last part of our algorithm resolves the Phi instructions that are still within the LIR code. In Listing 4.6 we can see that the algorithm inserts moves at the end of the Blocks B0, B4 and B3. Phi instructions always occur in pairs at the edges of basic blocks. This means a Phi instruction at the end of a Block corresponds to a Phi instruction at the beginning of its successor block.

Listing 4.5 shows an example from our test case. Instruction number 8 in Block B0 holds the results of the Block. Instruction number 10 in Block B1 holds the place where Block B1 expects the results of its successor. In order to resolve the Phi instructions the algorithm moves the results of Block B0 where B1 expects them like it is shown in Figure 4.6 Case Study 36

B0 −> B1 [ −1 , −1] _nr__instruction______( LIR ) ... 8 JUMP ~ [vstack:0|d, int[0|0x0]] size: 2 destination: B0 −> B1

B1 <− B0 , B3 , B4 −> B2 , B5 [ −1 , −1] llh (loop 0 depth 1) _nr__instruction______( LIR ) 10 [vstack:1|d, vstack:2|d] = LABEL size: 2 align: true label:? ... Listing 4.5: Phi instruction

B0 −> B1 [ −1 , −1] _nr__instruction______( LIR ) 0 [rsi|d, rbp|q] =LABEL size: 2 align: false label:? 2 r10|q = MOVE rbp|q moveKind: QWORD −1 vstack:4|q = MOVE r10|q moveKind: QWORD 4 [] = HOTSPOTLOCKSTACK frameMapBuilder: com. oracle . graal . lir .amd64.AMD64FrameMapBuilder@4667ae56 slotKind : QWORD 6 r10|d = MOVE rsi|d moveKind: DWORD −1 vstack:0|d = MOVE r10|d moveKind: DWORD −1 vstack:2|d = MOVE input: int[0|0x0] −1 vstack:1|d = STACKMOVE (input: vstack:0|d, ~backupSlot: vstack:5|q) scratch: rax 8 JUMP ~[] size: 0 destination: B0 −> B1

B1 <− B0 , B3 , B4 −> B2 , B5 [ −1 , −1] llh (loop 0 depth 1) ... B2 <− B1 −> B3 , B4 [ −1 , −1] (loop 0 depth 1) ... B4 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) _nr__instruction______( LIR ) 36 [] =LABEL size: 0 align: false label:? 38 SPILLREGISTERS 40 r10|d = INC vstack:1|d size: DWORD −1 vstack:2|d = STACKMOVE (input: vstack:3|d, ~backupSlot: vstack:5|q) scratch: rax −1 vstack:1|d = MOVE r10|d moveKind: DWORD 42 JUMP ~[] size: 0 destination: B4 −> B1

B3 <− B2 −> B1 [ −1 , −1] lle (loop 0 depth 1) _nr__instruction______( LIR ) 30 [] =LABEL size: 0 align: false label:? 32 SPILLREGISTERS −1 vstack:2|d = STACKMOVE (input: vstack:3|d, ~backupSlot: vstack:5|q) scratch: rax 34 JUMP ~[] size: 0 destination: B3 −> B1

B5 <− B1 [ −1 , −1] . . . Listing 4.6: Test case LIR after resolving Phi instructions Evaluation 37

Chapter 5

Evaluation

In this chapter we are going to see how well our allocator works and we are comparing it to the existing linear scan register allocator. Our allocator provides a basic graph coloring approach to register allocation in Graal. The linear scan allocator on the other hand is highly optimized and thoroughly tested. What we expect to find is, that we are not to far behind in terms of code performance, but significantly slower in terms of compile time.

5.1 Evaluation environment and benchmark

We run this evaluation on a Intel Core 2 Quad CPU Q6600, 4 GB of ram, running Ubuntu 14.04 as operating system. For measuring the performance we use the DaCapo [2] benchmarks and the Scala DaCapo [19] benchmarks. Since the benchmarks must be executed for some time until all methods have been compiled by Graal we run a number of warmup iterations before collecting the result. The last time we get is our result. We repeat this ten times and get 10 results. Then we compare the results from our allocator with results we get from the same number of runs with the liner scan al- locator. Table 5.1 shows the number of warmup iterations and the results of the peak performance measurements and Table 5.2 shows the results of the compile time measurements. We run this tests with the code from commit cf439d2 from https://bitbucket.org/f_schroecki/graphcoloring/. We also use the option -XX:JVMCIThreads=1 to run all benchmarks for stability.

We then normalize our result by dividing each number through the linear scan average in order to display it in the following graphs. We also calculate a composite mean. We do this by computing the geometric mean of all our normalized results of the graph coloring results. This allows us to see the average difference of the two allocators. Evaluation 38

Table 5.1: Peak performance measurements DaCapo and Scala DaCapo (ms) benchmark warm-ups mean GC mean Lscan std deviation GC std deviation Lscan avrora 20 33826.506 33456.454 170.165 86.525 batik 30 1910.270 1790.052 215.143 224.091 fop 50 447.644 382.603 32.606 40.999 jython 30 5761.770 4807.824 545.350 446.505 luindex 25 3226.005 2948.347 527.602 444.418 lusearch 15 3174.800 3093.319 38.798 45.102 pmd 15 5453.429 5464.347 314.620 333.652 sunflow 20 4238.649 3364.383 166.782 143.113 tomcat 20 3016.944 2625.430 56.777 62.750 xalan 25 4944.346 4527.838 119.766 107.226 actors 20 9090.634 8857.324 129.812 98.000 apparat 20 26529.165 24809.948 1783.263 1439.273 factorie 10 37384.071 33523.434 5351.856 2139.445 kiama 25 2032.455 2020.297 319.354 222.746 scalac 30 2966.090 2421.155 355.317 216.280 scaladoc 25 4647.106 3684.904 423.604 488.410 scalap 30 2607.116 2596.708 398.346 374.122 scalariform 30 8111.191 8937.400 1274.634 927.892 scalaxb 40 831.307 731.428 145.881 91.651 tmt 30 9509.057 8416.926 416.422 281.894

Table 5.2: Compile time measurements DaCapo and Scala DaCapo (ms) benchmark mean GC mean Lscan std deviation GC std deviation Lscan avrora 1579.715 563.969 39.111 6.215 batik 28254.273 1974.360 1040.884 124.476 fop 11944.187 2296.175 517.883 114.562 jython 80418.537 14794.340 6303.278 2687.232 luindex 8086.893 1165.610 1239.754 87.355 lusearch 3449.553 1048.378 135.663 141.383 pmd 26529.165 3508.696 2658.580 436.031 sunflow 3733.121 913.241 513.616 70.485 tomcat 23860.985 4113.378 2268.288 81.289 xalan 13877.165 1571.836 523.774 140.700 actors 3494.690 1055.742 204.790 38.559 apparat 9966.652 1817.105 2579.016 86.058 factorie 3105.717 775.881 807.667 38.625 kiama 8751.671 1374.712 470.356 76.440 scalac 37646.678 6342.320 2676.057 173.519 scaladoc 24245.831 4007.809 2798.199 163.480 scalap 2793.358 724.151 192.629 26.670 scalariform 11956.137 1861.243 1588.473 231.026 scalaxb 6310.688 1320.809 544.879 65.349 tmt 2930.710 1033.803 223.776 53.502 Evaluation 39

5.2 DaCapo

5.3 Peak Performance

Here we show the peak performance results of the DaCapo benchmarks. As expected our performance is not to far off. But the linear scan shows overall the better performance. By computing our composite mean we see that the linear scan allocator performs on average 1.100 times better. Figure 5.1 shows the comparison of linear scan and graph coloring for the individual benchmarks.

Figure 5.1: Dacapo performance comparison

5.4 Compile Time

As expected linear scan clearly outperforms our graph coloring allocator with regards to compile time, as we can see in Figure 5.2. Our calculated composite mean here is 5.745 which shows that the linear scan allocator is significantly faster. Evaluation 40

Figure 5.2: Dacapo compile time comparison

5.5 ScalaDacapo

5.6 Peak Performance

The ScalaDacapo benchmarks show quite similar results. We get lose to the performance but linear scan is still a bit better as we can see in Figure 5.3. Our composite mean for this benchmarks is 1.078.

5.7 Compile Time

The results of the compile time comparison also do not lead to any surprises. As expected and shown in Figure 5.4 linear scan is clearly faster than graph coloring. Our composite mean in this case is 4.719 Evaluation 41

Figure 5.3: ScalaDacapo performance comparison

Figure 5.4: ScalaDacapo compile time comparison Related Work 42

Chapter 6

Related Work

This chapter reviews the papers that are most relevant for our work. The implementation of a graph coloring register allocator was first described in the papers by Chaitin et al. [6,7]. The underlaying principle described in these papers are still the basis of graph coloring in register allocation today. They describe the two data structures for the interference graph and how to build it. The concept of interference has not changed since the papers were released. The papers also deal with situations were a coloring was not found and spilling is needed. Chaitin showed that finding a coloring for a graph is NP-complete in general. Chaitin uses heuristics to solve this issue.

We oriented our work mostly on the Briggs et al. [3,4] papers. These papers are based on the Chaitin papers but improve some aspects of them. They introduced the concept of optimistic coloring to register allocation which prevents unnecessary spilling and delays the spilling decision into a later stage of the algorithm. The Briggs papers also describe the use of the SSA form to discover values with multiple live ranges.

A paper by Cooper et al. [9] describes the construction of an interference graph. This paper suggests multiple smaller graphs in stead of a single graph. Since modern architectures mostly use different kinds of registers, values can only exist in one type of register and therefore can not interfere with values of other types. This concept of multiple graphs was the obvious choice for our implementation.

Another paper form Mössenböck [15] deals with the SSA form in the Java Hotspot Client Com- piler. This paper describes the use of the SSA form and also the construction of an interference graph in modern compilers. Other papers by Buchwald et al. [5] and Cytron et al. [11] also gave us a better understanding of the SSA form.

Papers that we also consider as important for this work deal if optimizations of graph coloring register allocators. A paper form Cooper and Simpson [10] describes and approach to splitting live ranges in graph coloring register allocators. The algorithm introduced in the paper uses and additional data Related Work 43 structure called a containment graph.

Another paper by Bergner et al. [1] that deals with optimization is about interference region spilling. This optimization is interesting because it leads to better spill code. Spilling is limited to the actual points where the Value interferes. This means a value is spilled at the point in the code where it has an interference and is restored after the interference ends. The concept of interference region spilling and splitting can be combined.

Also important for our thesis is the paper by Poletto and Sakar [18]. This paper describes regis- ter allocation with linear scan. Another paper by Wimmer and Mössenböck [23] extends the linear scan algorithm with a method for splitting Intervals. A further paper by Wimmer and Franz [22] de- scribes the use of the SSA form with a linear scan register allocator. This paper is especially relevant since Graal uses an implementation of this approach. Future Work 44

Chapter 7

Future Work

In this chapter we describe how our work can be further improved in the future in order to deliver better performance. We have shown that our implementation is functional but lacks in terms of performance compared to the linear scan allocator.

7.1 Spilling Strategy

The main disadvantage of our allocator is the spilling strategy. As we explained in previous chapters we spill whole Values. To improve the spilling strategy we suggest to implement a splitting approach. Our current data structure is suited for this improvement. What we need to do in order to implement spilling is extending the Interval class. If the allocator decides to split an interval we need to store that information in the interval. We suggest doing that by storing an array of the new Intervals inside the original on.

Every part of the split interval needs an unique id. Our solution for that would be to take the highest Value id we have and add one. We also need a mapping to identify which split interval belongs to which original interval. We then need to take the Node of the original interval out of the graph and add the split intervals to it. What is then left to do is to adapt the handling of the intervals in every point in the Code where an interval is called.

Another improvement to our spilling strategy would be interference region spilling. For this opti- mization we need a method to identify the regions where an interval overlaps with its neighbors. This can be done by comparing the live ranges of the interval. Then we would have to add the identified region to the interval and only spill it there in the spill Method. Future Work 45

Both splitting and spilling can be used together but we would need a method to decide which one to use.

7.2 Heuristic for Choosing a Spill Candidate

We choose a spill candidate based on the number of the use positions of the Value and the degree of the node. We use incorporate the degree because we want to avoid spilling insignificant nodes that are only live in a short range and don’t have a high impact when spilled. We chose this Method because it is easy to compute, does not need a lot of resources and gives us some form of making beneficial spill decisions. However it is possible that there is a stronger heuristic for choosing an optimal spill candidate. In order to improve the heuristic further experimentation and comparing of benchmarks is needed.

7.3 Identify Performance Issues

After optimizing the allocator we need to compare it again to the linear scan allocator. It should achieve at least the same performance. If the linear scan allocator achieves a better performance then the graph coloring allocator in some cases, we need to identify these cases and analyze the code that is produced. Conclusion 46

Chapter 8

Conclusion

In this thesis we gave an overview of register allocation and gave an introduction to Graal. We ex- tended the Graal compiler with a graph coloring register allocator.

The thesis presents a functional prototype. We compared our graph coloring allocator to the the linear scan allocator in the evaluation chapter. For this comparison we used the DaCapo and the Scala DaCapo benchmark suites.

The evaluation showed that the graph coloring allocator we implemented in this thesis is not that far behind in the linear scan allocator in terms of performance. This is due to the high optimization effort put into the linear scan allocator and our simple spilling strategy.

By further optimizing the spilling strategy in the future we are confident that the performance of our allocator will be able match the performance of the linear scan allocator or even surpass it. Our implementation is the rudimentary form of a graph coloring allocator. In order to perform sufficiently in practice, more optimization effort is needed.

If this optimizations are implemented in the future, Graal could benefit from extending it with this type of register allocation. Acknowledgement 47

Chapter 9

Acknowledgement

First I would like to thank my advisor DI Josef Eisl for his endless patience and repeated explanations which allowed me complete this thesis.

I would also like to thank my family and friends for their moral support and understanding. When I worked on weekends or at night I could always count on words of encouragement. List of Figures 48

List of Figures

1.1 Interference Example ...... 2 1.2 Interference Graph with five values and two registers ...... 3 1.3 Linear scan example ...... 3

2.1 Graal and HotSpot [20]...... 6 2.2 Transformation to SSA form ...... 7

3.1 Chaitins Algorithm ...... 8 3.2 Chaitin-Briggs Algorithm ...... 9 3.3 Coalescing example ...... 11 3.4 Coalescing of two live ranges ...... 12 3.5 Diamond shaped graph, k=2 ...... 13 3.6 Simplify example, k=2 ...... 14 3.7 Simplify example, k=2 ...... 15 3.8 Select example, k=2 ...... 16 3.9 Spilling example ...... 17 3.10 Graph coloring register allocator for Graal ...... 17 3.11 Structure of liveness analysis ...... 18 3.12 Build local live sets ...... 18 3.13 Build global live sets ...... 19 3.14 Build intervals ...... 20 3.15 Build Graph Example ...... 21 3.16 Build graph ...... 21 3.17 Simplify ...... 22 3.18 Select ...... 23 3.19 Spill values ...... 24 3.20 Spilling example ...... 25 3.21 Spilling example ...... 26 3.22 Assign Locations ...... 26 3.23 Data structure ...... 27 List of Figures 49

4.1 Control flow graph the test case...... 29 4.2 Live ranges of the test case after lifetime analysis ...... 31 4.3 Interference graph for integer registers of test case ...... 32 4.4 Interference graph and stack for the test case after simplify ...... 32 4.5 Live ranges of the test case after spilling ...... 35 4.6 Colored interference graph for our test case ...... 35

5.1 Dacapo performance comparison ...... 39 5.2 Dacapo compile time comparison ...... 40 5.3 ScalaDacapo performance comparison ...... 41 5.4 ScalaDacapo compile time comparison ...... 41 Listings 50

Listings

3.1 Finding spill position ...... 23

4.1 Test case java code ...... 29 4.2 Test case LIR code before allocation ...... 30 4.3 Test case LIR code after spilling ...... 33 4.4 Test case LIR code after assigning of registers ...... 34 4.5 Phi instruction ...... 36 4.6 Test case LIR after resolving Phi instructions ...... 36 Bibliography 51

Bibliography

[1] Peter Bergner, Peter Dahl, David Engebretsen, and Matthew O’Keefe. Spill code minimization via interference region spilling. 1997.

[2] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Di- wan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA ’06: Proceed- ings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programing, Systems, Languages, and Applications, pages 169–190, New York, NY, USA, October 2006. ACM Press.

[3] Preston Briggs, Keith D. Cooper, Ken Kennedy, and Linda Torczon. Coloring heuristics for register allocation. 1989.

[4] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements to graph coloring register allocation. 1994.

[5] Sebastian Buchwald, Denis Lohner, and Sebastian Ullrich. Verified construction of static single assignment form. 2016.

[6] Gregory J. Chaitin. Register allocation spilling via graph coloring. 1982.

[7] Gregory J. Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins, and Peter W. Markstein. Register allocation via coloring. 1981.

[8] OpenJDK Community. Graal project. 2016. http://openjdk.java.net/projects/graal.

[9] Keith D. Cooper, Timothy J. Harvey, and Linda Torczon. How to build an interference graph. 1988. Bibliography 52

[10] Keith D. Cooper and L. Taylor Simpson. Live range splitting in a graph coloring register allocator. Year 2005.

[11] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F.Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. 1991. DOI=http://dx.doi.org/10.1145/115372.115320.

[12] Alkis Evlogimenos. Improvements to linear scan register allocation. 2004.

[13] Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck, Thomas Rodriguez, Kenneth Russell, and David Cox. The java hotspot client compiler. 2001.

[14] Tim Lindholm, Frank Yellin, Gilad Bracha, and Alex Buckley. The java R virtual machine specification java se 8 edition. 2015.

[15] Hanspeter Mössenböck. Adding static single assignment form and a graph coloring register allo- cator to the java hotspottm client compiler. 2000.

[16] Sun Microsystems Oracle. The java hotspot performance engine architecture. Visited August 2016. http://www.oracle.com/technetwork/java/whitepaper-135217.html.

[17] Michael Paleczny, Christopher Vick, and Cliff Click. The java hotspot server compiler. 2001.

[18] Massimiliano Poletto and Vivek Sakar. Linear scan register allocation. 1999.

[19] Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. Da capo con scala: Design and analysis of a scala benchmark suite for the java virtual machine. In Proceedings of the 26th Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA ’11, pages 657–676, New York, NY, USA, 2011. ACM.

[20] Lukas Stadler, Thomas Würthinger, and Hanspeter Mössenböck. Partial escape analysis and scalar replacement for java. 2014.

[21] Robert Endre Tarjan. Efficiency of a good but not linear set union algorithm. 1975.

[22] Christian Wimmer and Michael Franz. Linear scan register allocation on ssa form. 2010.

[23] Christian Wimmer and Hanspeter Mössenböck. Optimized interval splitting in a linear scan register allocator. 2005. Curriculum Vitae 53

Personal Information Curriculum Vitae Name Schröckeneder, Florian Address Anastasius-Grün-Straße 16, 4020 Linz Telephone 0664/2756994 Email [email protected] Nationality Austrian Day of Birth 15.Oktober.1988

Professional Experience

2006–today Different kinds of side jobs unrelated to my profession 2008–2009 Military draft, Linz, Austria

Education

2003–2008 Bundeshandelsakademie, Schärding, Austria. 2009–2014 BSc in Computer Science, Johannes Kepler University, Linz, Austria. 2014–2016 MSc in Computer Science, Johannes Kepler University, Linz, Austria.

Other Interests

Hobbies Hiking, Archery, Fishing, Motorcycling Eidesstattliche Erklärung 54

Eidesstattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Masterarbeit selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw. die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe. Die vorliegende Masterarbeit ist mit dem elektronisch übermittelten Textdokument identisch.

Linz, am 31. 07. 2017

Florian Schröckeneder.