High Performance Python Through Workload Acceleration with OMR Jitbuilder
Total Page:16
File Type:pdf, Size:1020Kb
High Performance Python Through Workload Acceleration with OMR JitBuilder by Dayton J. Allen Bachelor of Science, Computer Science University of the West Indies, 2017 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Computer Science In the Graduate Academic Unit of Computer Science Supervisor(s): David Bremner, PhD, Computer Science Examining Board: Michael Fleming, PhD, Computer Science, Chair Suprio Ray, PhD, Computer Science External Examiner: Monica Wachowicz, PhD, Geodesy & Geomatics Engineering This thesis is accepted by the Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK June, 2020 c Dayton J. Allen, 2020 Abstract Python remains one of the most popular programming languages in many domains including scientific computing. Its reference implementation, CPython, is by far the most used version. CPython's runtime is bytecode-interpreted and leaves much to be desired when it comes to performance. Several attempts have been made to improve CPython's performance such as reimplementing performance-critical code in a more high-performance language (e.g. C, C++, Rust), or, transpiling Python source code to a more high-performance language, which is then called from within CPython through some form of FFI mechanism. Another approach is to JIT com- pile performance-critical Python methods or utilize alternate implementations that include a JIT compiler. JitBuilder provides a simplified interface to the underlying compiler technology avail- able in Eclipse OMR. We propose using JitBuilder to accelerate performance-critical workloads in Python. By creating Python bindings to JitBuilder's public interface, we can generate native code callable from within CPython without any modifications to its runtime. Results demonstrate that our approach rivals and in many cases outperforms state- of-the-art JIT compiler based approaches in the current ecosystem { namely, Numba and PyPy. ii Acknowledgements First and foremost, I would like to thank my thesis supervisor Dr. David Bremner for his guidance and patience during my thesis project. Many times I have pivoted or ventured off into some alternate project and he would nudge me back on course. I would also like to thank all my CASA colleagues and IBM contacts that provided valuable help and feedback on my project. Thanks to all my Canadian friends that helped me to navigate being in a foreign country, thousands of kilometres away from home and family. Thanks to my family for their endless support. Last but not least, thanks to everyone I know outside Computer Science that showed interest in my thesis project and kept asking questions. I know this is not easy. iii Table of Contents Abstract ii Acknowledgments iii Table of Contents iv List of Tables vii List of Figures viii Abbreviations ix 1 Introduction 1 1.1 Motivation . 2 1.2 Contributions . 4 1.3 Organization . 4 2 Background & Related Work 6 2.0.1 CPython's Performance Woes . 6 2.0.1.1 Bytecode Optimizations or the Lack Thereof . 6 2.1 Making CPython Faster . 7 2.1.1 C Extension Modules . 7 2.1.2 Transpilers . 7 2.1.3 JIT Compilation & Other Implementations . 8 2.1.3.1 Repurposed JIT Compilers . 9 iv 2.1.3.2 Bolt-On JIT Compilers . 10 2.2 Regular Expressions . 12 2.3 JitBuilder . 13 3 Python Bindings to JitBuilder 17 3.1 JitBuilder's Polyglot Ambitions . 17 3.1.1 Python Language Bindings . 20 3.1.1.1 Mapping C & C++ Languages features to Python . 22 3.1.2 Autojit . 23 3.1.2.1 Workflow . 26 4 Regular Expression Engine 30 4.1 Design & Implementation . 30 4.1.1 Bindings to Rust's RegEx bytecode compiler . 31 4.1.2 RegEx Interpreter . 35 4.1.2.1 Regular Expressions & Finite Automata . 35 4.1.2.2 Matching Engine . 37 4.1.3 Specializing Regex Compiler . 39 5 Evaluation 44 5.1 Experimental Setup . 44 5.2 Testing our Implementations . 45 5.3 Discussion . 46 5.3.1 AutoJIT . 46 5.3.2 RegEx Engine . 52 6 Conclusions & Future Work 58 6.1 Conclusions . 58 6.2 Future Work . 61 v Bibliography 70 A Appendix A 71 A.1 Code Repositories . 71 Vita vi List of Tables 4.1 Summary of Supported Regex Features. 34 4.2 Regex Engine Matching (hjH)ello against \Hello World". See Sec- tion 4.1.1 for a discussion of the bytecodes used in the table above. 39 5.1 Relative Speedups versus CPython on our Iterative Fibonacci Bench- mark. 47 5.2 Relative Speedups versus CPython on our Recursive Fibonacci Bench- mark. 47 5.3 Relative Speedups versus CPython on our Dot Product Benchmark. 48 5.4 Relative Speedups versus CPython on our Mandelbrot Benchmark. 49 5.5 Relative Speedups versus CPython on our Matrix Multiplication Bench- mark. 50 5.6 Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #1. 54 5.7 Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #2. 56 5.8 Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #3. 56 A.1 Code Repositories. 71 vii List of Figures 3.1 JitBuilder Client versus Implementation API. 18 3.2 C API Virtual Function Emulation [5]. 21 3.3 Auto JIT Workflow. 27 3.4 Lowering 2D Arrays to JitBuilder. 29 4.1 Partial NFA for matching a single character: c. 35 4.2 Partial NFA that describes the alternation: n1jn2. 35 4.3 Partial NFA that describes matching n once or zero times: n?. 36 4.4 Partial NFA that describes matching n zero or more times: n∗. 36 4.5 Partial NFA that describes matching n one or more times: n+. 36 4.6 Partial NFA that describes the concatenation: n1n2. 36 4.7 (hjH)ello NFA Representation. 37 4.8 Regex Matching Engine Flowchart. 40 5.1 Iterative Fibonacci Benchmark. 46 5.2 Recursive Fibonacci Benchmark. 47 5.3 Dot Product Benchmark. 48 5.4 Mandelbrot Benchmark. 49 5.5 Matrix Multiplication Benchmark. 50 5.6 Compilation Times: Python+JitBuilder versus Numba. 51 5.7 Regex Benchmark Set #1. 54 5.8 Regex Benchmark Set #2. 55 5.9 Regex Benchmark Set #3. 57 viii List of Symbols, Nomenclature or Abbreviations JIT Just-in-Time Compiler Regex Regular Expression IL Intermediate Language IR Intermediate Representation DFA Deterministic Finite Automata NFA Non-Deterministic Finite Automata FFI Foreign Function Interface JVM Java Virtual Machine CLR Core Language Runtime ASDL Abstract Syntax Description Lanuguage MSIL Microsoft Intermediate Language WIP Work in Progress CASA Centre of Advanced Studies - Atlantic VARARGs Variable Argruments IP Instruction Pointer SP String Pointer PC Program Counter LIFO Last in, First out STDLIB Standard Library VM Virtual Machine API Application Programming Interface CPU Central Processing Unit AST Abstract Syntax Tree EOL End of Life RRC Rust's Regex Crate Epsilon ix Chapter 1 Introduction Python [44] is a high-level, dynamically-typed programming language. It is currently one of the most popular programming languages, a feat that can be attributed to its simplicity and relatively easy learning curve. These attributes and its rich ecosystem of high-quality libraries make it an attractive option within many domains. Despite its many pros, one of Python's major weaknesses is the relatively poor performance of its reference implementation (CPython). This lack of performance is mainly due to its runtime being bytecode interpreted as well as its dynamic nature. This dynamism renders many compiler optimizations infeasible. Several attempts have been made to overcome CPython's performance limitations, such as developing native C extension modules to the CPython runtime. C extension modules necessitate a deviation from writing pure Python code by requiring code to be developed in a lower-level programming language (C/C++). This can be both a tedious and error-prone activity even for experts. A related approach is to implement general-purpose performance-critical code in high-performance languages (e.g. C++, C, Rust) and interface with Python through some form of foreign function interface (FFI) layer. An FFI is an interface that allows one to call out to code developed in one programming language, from another potentially unrelated programming lan- 1 guage. In the end, this also leads to a C extension module, but the difference is that the module is a binding to functionality from the general-purpose library rather than a fully Python-specific module. This layer of indirection somewhat positions Python as a frontend to high-performance libraries versus a language one can im- plement high-performance code in. Finally, we can also improve the performance of CPython by employing the use of just-in-time (JIT) compilers. Aycock [4] describes JIT compilation as compilation that occurs after a program begins execution. This allows one to collect information during execution to better guide compilation and optimization decisions. Eclipse OMR [23] provides a rich set of language-agnostic reusable components for implementing high-performance programming language runtimes. OMR historically stood for \Open Managed Runtimes" but the name is no longer interpreted as an acronym and has persisted despite the project not being a managed runtime. Amongst the components that OMR provides is a compiler based on the IBM J9 Java Virtual Machine (JVM). Despite its JVM roots, the OMR compiler component is also programming language agnostic. Another component, and perhaps the most pertinent to this thesis, is JitBuilder. JitBuilder is a native acceleration library that provides a simplified interface to the underlying compiler technology available in OMR. 1.1 Motivation The current approaches to accelerating Python code leave room for improvement. Approaches that deviate too much from the CPython runtime [37] struggle with adoption due to not supporting CPython's C application programming interface (API) that many popular Python libraries are based on.