Simulation-Based Code Duplication in a Dynamic Compiler

Submitted by DI David Leopoldseder, BSc. Submitted at Institute for System Software Supervisor and First Examiner o. Univ.-Prof. DI Dr. Dr. h. c. Hanspeter Simulation-Based Code Mössenböck Second Examiner Duplication in a Dynamic Prof. Michael O’Boyle Co-Supervisor Compiler Dr. Lukas Stadler Linz, August, 2019 Doctoral Thesis to obtain the academic degree of Doktor der technischen Wissenschaften in the Doctoral Program Technische Wissenschaften JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Osterreich¨ www.jku.at DVR 0093696 Oracle, Java, HotSpot, and all Java-based trademarks are trademarks or registered trademarks of Oracle in the United States and other countries. All other product names mentioned herein are trademarks or registered trademarks of their respective owners. III Statutory Declaration I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references. This printed thesis is identical with the electronic version submitted. Linz, August 14, 2019 i Abstract ynamic compilers perform a wealth of optimizations to improve the performance of the generated machine code. They inline functions, unroll, peel and vectorize loops, D remove allocations, perform instruction selection and scheduling, register allocation and code duplication. All optimizations serve the goal to improve the performance of the generated code along as many success metrics as possible, including latency, throughput, memory usage, cache behavior, micro-code usage, security and many others. In this process of transforming a source program to optimized machine code a typical compiler performs a multitude of decisions when applying optimizations. Many optimizations not only have positive impacts on a compilation unit, but can have negative effects as well, on any of the success metrics. Since it is infeasible for a compiler to produce the optimal code for a given source program on a given system, compilers resort to modeling optimization decisions via heuristic functions that are typically hand-tuned to a given set of benchmarks, in order to produce the fastest possible artifact. Duplicating code into different control-flow branches opens the potential for applying context- specific optimizations, which would not be possible otherwise. However, duplication-based optimizations, including tail-duplication and loop unrolling, can have negative impacts on the performance of the generated machine code of a program. However, in many cases they are still able to improve performance significantly. This imposes a challenge on modern compilers: Duplicating instructions at every control flow merge is not possible because it leads to uncontrolled code growth and compile time increases. Yet, not duplicating code and missing out on performance increases is also not a suitable option. Therefore, compilers need to determine the performance and code-size impacts of a duplication transformation, before performing it. Answering the question of the impact of a single duplication transformation on the optimization potential of an entire compilation unit typically requires compile-time-intensive analysis unsuitable for dynamic compilation. Consequently, dynamic compilers commonly resort to simple heuristics modeling beneficial and harmful impacts of a duplication optimization. However, heuristics are never complete and often miss modeling aspects crucial for the performance of a program. To tackle the shortcomings of duplication-based optimizations in a dynamic compiler we propose simulation-based code duplication, a three-tier optimization scheme that allows a compiler to find duplication optimization candidates (1), trade-off their expected impacts between different ii candidates (2) and only perform those duplication transformations that are considered beneficial (3). Simulation-based code duplication is precise, meaning that all simulated performance improvements are applicable after duplication. Additionally, it is complete, meaning it allows to simulate the effect of any given duplication-dependent optimization on a compilation unit after duplication. We implemented our simulation-based code duplication approach on top of the Graal Virtual Machine and applied it to two code-duplication based optimizations: tail-duplication and loop unrolling for non-counted loops. We show that our simulation-based code duplication scheme outperforms hard-coded heuristics and can significantly improve performance of the generated machine code. Large parts of our work have been integrated into Oracle Lab’s Graal Virtual Machine and are commercially available. iii Kurzfassung ur Verbesserung der Performanz von generiertem Code führen dynamische Compiler eine Vielzahl von Optimierungen wie das Inlinen von Funktionen und das Ausrollen Z von Schleifen aus. Sie planen die Reihenfolge von Instruktionen zur Erreichung einer optimalen Pipeline Auslastung, weisen temporären Werten Register zu, um Speicherzugriffe auf ein Minimum zu reduzieren und sie duplizieren Code. Dies alles dient der Steigerung der Effizienz des generierten Codes anhand multipler Metriken, wie u. a. Latenz, Durchsatz, Speicher-, Cache- und Micro-Code-Auslastung und vielen weiteren. Im Verlauf von Übersetzung und Optimierung muss ein Compiler etliche Entscheidungen treffen, da nicht alle Transformationen automatisch die Performanz eines Programms verbessern. Einige Transformationen sind wechselwirksam: während sie auf eine bestimmte Metrik positive Effekte haben, können sie sich auf eine andere negativ auswirken. Eine optimale Lösung für ein Übersetzungsproblem ist technisch nicht realisierbar. Daher greifen Compiler zur Erzeugung von schnellstmöglichem Code auf Heuristiken zurück, die manuell anhand von Benchmarkprogrammen optimiert werden. Codeduplizierung ermöglicht es einem Compiler kontextsensitive Optimierungen durchzuführen, die anderenfalls nicht möglich wären. Allerdings können code-duplizierende Optimierungen, ein- schließlich klassischer Tail-Duplizierung und Schleifenausrollung, negative Auswirkungen auf die Performanz von generiertem Maschinencode haben. In vielen Fällen können sie trotzdem zu signifikanten Performanzverbesserungen führen. Diese Tatsache stellt sich problematisch für optimierende Compiler dar: Code an jeder Kontrollflusszusammenführung zu duplizieren ist nicht möglich, da es zu unkontrolliertem Codewachstum und Übersetzungszeiterhöhungen führt. Auf der anderen Seite ist es nicht wünschenswert auf Codeduplizierungen per se zu verzichten, da eventuelle Leistungssteigerungen versäumt werden könnten. Daher müssen optimierende Com- piler herausfinden, was die potenziellen Auswirkungen einer Duplizierung in Hinblick auf Code- größe und Performanzsteigerung sind. Diese Auswirkungen einer einzelnen Duplizierung auf eine gesamte Übersetzungseinheit erfordern von einem Compiler den Vollzug von komplexen und aufwendigen Daten- und Kontrollflussanalysen, die normalerweise nicht in einem dynamischen Übersetzungskontext anwendbar sind. Daher modellieren Compiler solche positiven und negativen Auswirkungen von Duplizierungen mittels Heuristiken. Heuristiken sind aber oft nicht komplett und modellieren nicht alle performanzrelevanten Konzepte eines Programms. iv Um die Defizite von duplizierenden Optimierungen in einem dynamischen Compiler zu beseitigen, schlagen wir simulationsbasierte Codeduplizierung vor, ein Optimierungsschema, das es einem Compiler erlaubt, (1) optimierbare Duplizierungskandidaten zu finden, (2) deren Auswirkungen gegeneinander abzuwägen und (3) nur vorteilhafte Transformationen zu vollziehen. Simulationsbasierte Duplizierung ist präzise, das heißt, alle vorher simulierten Auswirkungen sind tatsächlich später optimierbar. Zusätzlich ist unser Ansatz komplett, das heißt, er erlaubt einem Compiler, die Auswirkungen beliebiger Transformationen auf die Optimierbarkeit eines Pro- grammes hin zu simulieren. Wir haben simulationsbasierte Duplizierung basierend auf der GraalVM für zwei Optimierungen im- plementiert: klassische Codeduplizierung und Schleifenausrollung von kopfgesteuerten Schleifen. In dieser These zeigen wir, dass simulationsbasierte Duplizierung manuelle Heuristiken übertrifft und die Performanz von generiertem Code signifikant verbessern kann. Große Teile unserer Arbeit wurden von Oracle Labs in ihre GraalVM integriert und sind kommerziell verfügbar. Contents v Contents 1 Introduction 1 1.1 Problem Setting . 1 1.2 Problem Statement . 2 1.3 State-of-the-Art . 4 1.4 Remaining Challenges . 6 1.5 Novel Solution . 7 1.6 Scientific Contributions . 8 1.6.1 Publications . 9 1.6.2 Technical Contributions . 10 1.7 Applicability . 11 1.8 Project Context . 11 1.9 Structure of this Thesis . 14 2 Terminology 17 2.1 Compilation . 17 2.2 Intermediate Representation . 18 2.2.1 Control Flow Graph . 18 2.2.1.1 Dominance . 20 2.3 Static Single Assignment Form . 20 3 GraalVM System Overview 23 3.1 Java . 23 3.1.1 HotSpot JVM . 24 3.1.2 Graal Compiler . 26 3.1.2.1 Graal IR . 28 3.1.3 Truffle . 31 3.1.4 GraalVM . 31 4 Simulation-Based Code Duplication 33 4.1 Problem Statement . 34 4.1.1 Code Duplication Triangle . 43 vi Contents 4.2 Solution . 45 4.2.1 Finding Optimization Opportunities after Duplication . 49 4.2.1.1 Heuristics . 50 4.2.1.2 Backtracking . 52 4.2.1.3 Simulation . 53 4.2.1.4 Comparison . 54 4.3 Necessities: Cost Model . 55 5 Node Cost Model 59 5.1 Problems of existing Cost Models . 60 5.2 Cost Model Requirements . 60 5.3 Node Cost Model . 63 5.3.1 Code-Size

Simulation-Based Code Duplication in a Dynamic Compiler

CSE 582 – Compilers

ICS803 Elective – III Multicore Architecture Teacher Name: Ms

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

Functional Array Programming in Sac

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Foundations of Scientific Research

Survey of Compiler Technology at the IBM Toronto Laboratory

Compiler Optimizations

Compiler Transformations for High-Performance Computing