Improving Interpreter Performance with Superinstructions

Master thesis QuickInterp - Improving interpreter performance with superinstructions Lukas Miedema June 11, 2020 bipush lcmp new ifeq dup iload_3iload_3 istore_3 Exam committee prof. dr. M. Huisman dr.ir. T. van Dijk dr.ir. A.B.J. Kokkeler Cover design Gerben Miedema Abstract The performance of Java Virtual Machine (JVM) bytecode interpreters can be severely limited by the (1) inability to perform optimizations over multiple instructions, and (2) the excessive level of branching with the interpreter loop as a result of having to do at least one jump per bytecode instruction. With the QuickInterp VM we mitigate these performance limitations within JVM interpreter design by means of superinstructions. Our interpreter is improved by supporting not just regular bytecode instructions, but also extra instructions which perform the work of a sequence of bytecode instructions (superinstructions). Bytecode is, at class load time, preprocessed where sequences of bytecode instructions are replaced by such equivalent superinstructions, requiring no alterations to or compatibility loss with existing bytecode. The interpreter source code is generated automatically based on a profile of the running application. New instruction handlers are generated by concatenating the instruction handlers of existing instructions, removing the need for manual construction of the superinstruction handlers. Which sequences of instructions to convert into a superinstruction, and how to perform the most effective substitution of superinstructions into a loaded program, are questions which we answer in this thesis. Earlier work has shown that finding the optimal superinstruction set is NP-hard [15]. As such, we implement an iterative optimization algorithm to find the optimal superinstruction set. Furthermore, we design and test various runtime substitution algorithms. Our new shortest-path based runtime substitution algorithm uses shortest-path based pathfinding through the input program to find the combination of superinstructions that lowers the amount of required instruction dispatches as much as possible. Furthermore, we enhance the shortest-path algorithm by doing substitution based on equivalence, extending the impact each individual superinstruction can make at runtime. We compare our new runtime substitution algorithms against a reimplementation of a substitution algorithm from earlier work [4]. Our methods show that the superinstruction optimization is still valid in 2020, boasting a 45.6% performance improvement against baseline in a small arithmetic benchmark, and showing a 33.0% improvement against baseline in a larger Spring Boot-based web application benchmark. Our iterative superinstruction set construction algorithm manages to find near-optimal solutions for the NP-hard problem of constructing the superinstruction set. However, we also show how more advanced superinstruction placement algorithms do not offer the same return on investment. Given enough superinstructions, each of the tested substitution algorithms is capable of achieving similar performance improvements. The code and benchmarks are available at https://github.com/LukasMiedema/QuickInterp. 2 Table of contents Abstract 2 Table of contents 2 List of figures 5 List of listings 6 1 Introduction 9 1.1 What is a VM . .9 1.2 What is an interpreter . 10 1.2.1 Anatomy of an interpreter . 10 1.3 What is a JIT Compiler . 11 1.4 Superinstructions . 12 1.5 Research goals . 13 1.5.1 Motivation . 13 1.5.2 Research question . 13 1.5.3 Goals and method . 14 1.6 Research contributions . 14 1.7 Overview . 14 2 Background 15 2.1 The JVM . 15 2.1.1 Introduction . 15 2.1.2 Bytecode structure . 15 2.1.3 Stack-based architecture . 16 2.1.4 Verifier, typesafety and abstract interpreters . 16 2.1.5 Dynamic class loading . 17 2.2 Modern interpreter dispatching . 17 2.2.1 Superscalar execution and branch prediction . 17 2.2.2 The interpreter . 18 2.2.3 Token-threaded interpreters . 19 2.2.4 Threaded-code interpreter . 20 2.3 Conclusion . 21 3 Related work 22 3.1 Introduction . 22 3.2 Superinstructions . 22 3.2.1 Workflow . 22 3.2.2 Superoperators . 24 3.2.3 vmgen . 24 3.2.4 Tiger . 25 3.2.5 Conclusion . 26 3 3.3 Other interpreter optimizations / research . 27 3.3.1 Static replication . 27 3.3.2 Instruction specialization . 28 3.4 Conclusion . 28 4 Design of QuickInterp 29 4.1 Introduction . 29 4.1.1 Design goals . 29 4.1.2 Overview . 30 4.2 Architecture and workflow overview . 30 4.2.1 From profile to runtime . 30 4.3 QuickInterp application profile . 32 4.3.1 Introduction . 32 4.3.2 Data in the profile . 32 4.4 QuickInterp compile time . 36 4.4.1 Introduction . 36 4.4.2 Instruction selection . 36 4.4.3 Static evaluation . 39 4.4.4 Superinstruction set construction . 41 4.4.5 Superinstruction generation evaluation loop . 44 4.4.6 Handling superinstruction operands . 45 4.5 QuickInterp runtime . 47 4.5.1 Introduction . 47 4.5.2 Basic runtime superinstruction placement . 47 4.5.3 Instruction placement using shortest path . 51 4.6 Equivalent superinstructions . 55 4.6.1 Superinstruction equivalence . 55 4.6.2 Discovering data dependencies . 57 4.6.3 Data Dependency Graph Construction . 65 4.6.4 Using superinstruction equivalency . 67 4.7 Conclusion . 68 5 Implementing QuickInterp 70 5.1 Introduction . 70 5.2 Implementation goals and non-goals . 71 5.3 QuickInterp on OpenJDK Zero . 72 5.3.1 Why OpenJDK Zero . 72 5.3.2 OpenJDK Zero class-loading pipeline . 72 5.3.3 OpenJDK Zero Interpreter . 74 5.3.4 Code stretching . 76 5.3.5 Superinstruction placement in Java . 78 5.3.6 Conclusion . 79 5.4 Profiling in practice . 79 5.4.1 Specialized Instructions . 80 5.4.2 The profile on disk . 81 5.4.3 Conclusion . 82 5.5 Constructing the superinstruction set . 82 5.5.1 Interpreter Generator tool implementation . 83 5.5.2 Loading the profile . 87 5.5.3 Optimizing the instruction set . 90 5.5.4 Conclusion . 92 5.6 Superinstruction placement . 92 5.6.1 Overview . 92 5.6.2 Algorithm implementations . 94 4 5.6.3 Conclusion . 98 5.7 Generating the QuickInterp interpreter . 98 5.7.1 Instruction primitives . 99 5.7.2 Instruction code as macros . 101 5.7.3 Code generation . 102 5.7.4 Superinstruction class cache . 105 5.7.5 Loss of the garbage collection and class verifier . 106 5.7.6 Wrapping up . 107 5.8.

Improving Interpreter Performance with Superinstructions

Strict Protection for Virtual Function Calls in COTS C++ Binaries

Comparative Studies of 10 Programming Languages Within 10 Diverse Criteria Revision 1.0

An Introduction to the C Programming Language and Software Design

Symbol Tables and Branch Tables Linking Applications Together

Abstract Interface Types in GNAT: Conversions, Discriminants, and C++

Multi-Dispatch in the Java Virtual Machine: Design and Implementation

Implementing Signatures for C++

Compiling Objects

Sandboxing PHP Applications with Tailored System Call Allowlists

PROGRAMMING LANGUAGE EVOLUTION and SOURCE CODE REJUVENATION a Dissertation by PETER MATHIAS PIRKELBAUER Submitted to the Office